Lecture 7: Floating Point Arithmetic Memory Hierarchy and Cache
1
Question:
Suppose we want to compute using four didecima l arithtiithmetic: ¾ S = 1.000 + 1.000x104 – 1.000x104 ¾ What’s the answer?
Ariane 5 rocket ¾ June 1996 expl od ed wh en a 64 bit fl p t number relating to the horizontal velocity of the rocket was converted to a 16 bit signed integer. The number was larger than 32,767, and thus the conversion failed. 2 ¾ $500M rocket and cargo
1 Defining Floating Point Arithmetic
Representable numbers ¾ Scientific notation: +/- d.d…d x rexp ¾ sign bit +/- ¾ radix r (usually 2 or 10, sometimes 16) ¾ significand d.d…d (how many base-r digits d?) ¾ exponent exp (range?) ¾ others? Operations: ¾ arithmetic: +,-,x,/,... » how to round result to fit in format ¾ comparison (<, =, >) ¾ conversion between different formats » short to long FP numbers, FP to integer ¾ exception handling » what to do for 0/0, 2*largest_number, etc. ¾ binary/decimal conversion » for I/O, when radix not 10 3
IEEE Floating Point Arithmetic Standard 754 (1985) - Normalized Numbers
Normalized Nonzero Representable Numbers: +- 1.d…d x 2exp ¾ Macheps = Machine epsilon = 2-#significand bits = relative error in each operation smallest number ε such that fl( 1 + ε ) > 1
¾ OV = overflow threshold = largest number ¾ UN = underflow threshold = smallest number
Format # bits #significand bits macheps #exponent bits exponent range ------Single 32 23+1 2-24 (~10-7) 8 2-126 -2127 (~10+-38) Double 64 52+1 2-53 (~10-16) 11 2-1022 -21023 (~10+-308) Double >=80 >=64 <=2-64(~10-19) >=15 2-16382 -216383 (~10+-4932) Extended (80 bits on all Intel machines)
+- Zero: +-, significand and exponent all zero 4 ¾ Why bother with -0 later
2 IEEE Floating Point Arithmetic Standard 754 - “Denorms”
Denormalized Numbers: +-0.d…d x 2min_exp ¾ sign bit, nonzero significand, minimum exponent ¾ Fills in gap between UN and 0 Underflow Exception ¾ occurs when exact nonzero result is less than underflow threshold UN ¾ Ex: UN/3 ¾ return a denorm, or zero
5
IEEE Floating Point Arithmetic Standard 754 - +- Infinity
+- Infinity: Sign bit, zero significand, maximum exponent Overflow Exception ¾occurs when exact finite result too large to represent accurately ¾Ex: 2*OV ¾return +- infinity Divide by zero Exception ¾return +- infinity = 1/+-0 ¾sign of zero important! Also return +- infinity for ¾3+infinity, 2*infinity, infinity*infinity ¾Result is exact, not an exception!
6
3 IEEE Floating Point Arithmetic Standard 754 - NAN (Not A Number)
NAN: Sign bit, nonzero significand, maximum exponent Invalid Exception ¾occur s when exact result ntnot a we ll-dfinddefined real number ¾0/0 ¾sqrt(-1) ¾infinity-infinity, infinity/infinity, 0*infinity ¾NAN + 3 ¾NAN > 3? ¾Return a NAN in all these cases Two kikidnds of NANs ¾Quiet - propagates without raising an exception ¾Signaling - generate an exception when touched » good for detecting uninitialized data
7
Error Analysis
Basic error formula ¾fl(a op b) = (a op b)*(1 + d) where » op one of + ,,,-*/ » |d| <= macheps » assuming no overflow, underflow, or divide by zero Example: adding 4 numbers
¾fl(x1+x2+x3+x4) = {[(x1+x2)*(1+d1) + x3]*(1+d2) + x4}*(1+d3) = x1*(1+d1)*(1+d2)*(1+d3) + x2*(1+d1)*(1+d2)*(1+d3) + x3*(1+d2)* (1+d3) + x4*(1+d3) = x1*(1+e1) + x2*(1+e2) + x3*(1+e3) + x4*(1+e4) where each |ei| <~ 3*macheps ¾get exact sum of slightly changed summands xi*(1+ei) ¾Backward Error Analysis - algorithm called numerically stable if it gives the exact result for slightly changed inputs ¾Numerical Stability is an algorithm design goal 8
4 Backward error
Approximate solution is exact solution to modified problem. HHwow lar ge a m odif ication to ori ginal p roblem is requ ired to give result actually obtained? How much data error in initial input would be required to explain all the error in computed results? Approximate solution is good if it is exact solution to “nearby” problem.
f x f(x) Backward error f’ Forward error x’ f’(x) f 9
Sensitivity and Conditioning Problem is insensitive or well conditioned if relative change in input causes commensurate relative change in solution. Problem is sensitive or ill-conditioned, if relative change in solution can be much larger than that in input data.
Cond = |Relative change in solution| / |Relative change in input data| = |[f(x’) – f(x)]/f(x)| / |(x’ – x)/x|
Problem is sensitive, or ill-conditioned, if cond >> 1.
When function f is evaluated for approximate input x’ = x+h instead of true input value of x. Absolute error = f(x + h) – f(x) ≈ h f’(x) Relative error =[ f(x + h) – f(x) ] / f(x) ≈ h f’(x) / f(x) 10
5 Sensitivity: 2 Examples cos(π/2) and 2-d System of Equations
a*x1+ bx2 = f Consider problem of computing c*x1+ dx2 = g cosine function for arguments near π/2. Let x ≈ π/2 and let h be small perturbation to x. Then . Abs: f(x + h) – f(x) ≈ h f’(x) Rel: [ f(x + h) – f(x) ] / f(x) ≈ h f’(x) / f(x) absolute error = cos(x+h) – cos(x) ≈ -h sin(x) ≈ -h, relative error ≈ -h tan(x) ≈ ∞
So small change in x near π/2 causes large relative change in cos(x) regardless of method used. cos(1.57079) = 0.63267949 x 10-5 cos(1.57078) = 1.64267949 x 10-5 Relative change in output is a quarter million times greater than 11 relative change in input.
Sensitivity: 2 Examples cos(π/2) and 2-d System of Equations
a*x1+ bx2 = f Consider problem of computing c*x1+ dx2 = g cosine function for arguments near π/2. Let x ≈ π/2 and let h be small perturbation to x. Then . absolute error = cos(x+h) – cos(x) ≈ -h sin(x) ≈ -h, relative error ≈ -h tan(x) ≈ ∞
So small change in x near π/2 causes large relative change in cos(x) regardless of method used. cos(1.57079) = 0.63267949 x 10-5 . cos(1.57078) = 1.64267949 x 10-5 Relative change in output is a quarter million times greater than relative change in input. 12
6 Exception Handling
What happens when the “exact value” is not a real number, or too small or too larggpe to represent accurately ? 5 Exceptions: ¾Overflow - exact result > OV, too large to represent ¾Underflow - exact result nonzero and < UN, too small to represent ¾Divide-by-zero -nonzero/0 ¾Invalid - 0/0, sqrt(-1), … ¾Inexact - yygyou made a rounding error (very common!) Possible responses ¾Stop with error message (unfriendly, not default) ¾Keep computing (default, but how?)
13
Summary of Values Representable in IEEE FP
+- Zero +- 0…0 0……………………0
+- Not 0 or NlidNormalized nonzero numbers all 1s anything Denormalized numbers +- 0…0 nonzero +-Infinity +- 1….1 0……………………0 NANs +- 1….1 nonzero ¾Sigggnaling and quiet ¾Many systems have only quiet
14
7 Questions?
15
More on the In-Class Presentations
Start at 1:00 on Monday, 5/5/08 Turn in report s on MMdonday, 5/2/08 Presentations roughly 20 minutes each Use powerpoint or pdf Describe your project, perhaps motivate via application DibDescribe your meththd/hod/approach Provide comparison and results
See me about your topic 16
8 Cache and Memory
17
Cache and Its Importance in Performance
Motivation: ¾ Time to run code = clock cycles running code + clock cycles waiting for memory ¾ For many years, CPU’s have sped up an average of 50% per year over memory chip speed ups.
Hence, memory access is the bottleneck to computing fast.
18
9 DRAM 9%/yr. Latency in a Single System (2X/10 yrs) 500 1000 Ratio 400 atio
MMemoryemory AccessAccess TimeTime R 100 300
200 10
Time (ns) 100
1 CPU Time Memory to CPU 0 0.1 1997 1999 2001 2003 2006 2009 µProc X-Axis 60%/yr. CPU Clock Period (ns) Ratio (2X/1.5yr) Memory System Access Time Processor-Memory Performance Gap: (grows 50% / year) 19
Here’s your problem
Say 2.26 GHz ¾ 2 ops/cycle DP ¾ 4.52 Gflop/s peak FSB 533 MHz ¾ 32 bit data path (4 bytes) or 2.132 GB/s With 8 bytes/word (DP) ¾ 266.5 MW/s from 20 memory
10 Intel Clovertown
Quad-core processor Each core does 4 floating point ops/s Say 2.4 GHz ¾ thus 4 ops/core*4 flop/s * 2.4 GHz = 38.4 Gflop/s peak FSB 1.066 GHz ¾ 1.066 GHz*4B /8 (W/B) = 533 MW/s
» There’s your problem 21
Commodity Processor Trends Bandwidth/Latency is the Critical Issue, not FLOPS
Got Bandwidth? Annual Typical value Typical value Typical value increase in 2005 in 2010 in 2020
Single-chip floating-point 59% 4 GFLOP/s 32 GFLOP/s 3300 GFLOP/s performance
Front-side bus 1 GWord/s 3.5 GWord/s 27 GWord/s 23% bandwidth = 0.25 word/flop = 0.11 word/flop = 0.008 word/flop
70 ns 50 ns 28 ns DRAM latency (5.5%) = 280 FP ops = 1600 FP ops = 94,000 FP ops = 70 loads = 170 loads = 780 loads
Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 22 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6.
11 Solving the Memory Bottleneck
Since we cannot make fast enough memories, we iitdnvented the memory hierarchy ¾ L1 Cache (on chip) ¾ L2 Cache ¾ Optional L3 Cache ¾ Main Memory ¾ Hard Drive
Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. ¾ Hold frequently accessed blocks of main memory CPU looks first for data in L1, then in L2, then in main memory. Typical bus structure:
CPU chip register file L1 ALU cache cache bus system bus memory bus
I/O main L2 cache bus interface bridge memory
12 26
What is a cache?
Small, fast storage used to improve average access time to slow memory. Exploits spacial and temporal locality In computer architecture, almost everything is a cache! ¾ Reg itisters “a cach h”e” on vari iblables –software managed ¾ First-level cache a cache on second-level cache ¾ Second-level cache a cache on memory ¾ Memory a cache on disk (virtual memory) ¾ TLB a cache on page table ¾ Branch-prediction a cache on prediction information?
P/RProc/Regs L1-Cache BiggerL2-Cache Faster Memory
Disk, Tape, etc. 27
13 Cache Performance Metrics
Miss Rate ¾ Fraction of memory references not found in cache (misses/references) ¾ Typical numbers: » 3-10% for L1 » can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time ¾ Time to deliver a line in the cache to the processor (()includes time to determine whether the line is in the cache) ¾ Typical numbers: » 1 clock cycle for L1 » 3-8 clock cycles for L2 Miss Penalty ¾ Additional time required because of a miss » Typically 25-100 cycles for main memory
Traditional Four Questions for Memory Hierarchy Designers
QQppp1: Where can a block be placed in the upper level? (Block placement) ¾Fully Associative, Set Associative, Direct Mapped Q2: How is a block found if it is in the upper level? (Block identification) ¾Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) ¾Random, LRU Q4: What happens on a write? (Write strategy) ¾Write Back or Write Through (with Write Buffer)
29
14 Cache-Related Terms
ICACHE : Instruction cache DCACHE (L1) : Dat a cach e cl osest to regi st ers SCACHE (L2) : Secondary data cache TCACHE (L3) : Third level data cache ¾ Data from SCACHE has to go through DCACHE to registers ¾ TCACHE is larger than SCACHE, and SCACHE is larger than DCACHE ¾ Not all processors have TCACHE
30
Line Replacement Policy
When 2 memory lines are in the cache and a 3rd line comes in, one of the two previous ones must be evicted: which one to choose? Since one doesn’t know the future, heuristics: ¾ LRU: Least Recently Used » Hard to implement ¾ FIFO » Easy to implement ¾ Random » Even easier to implement Overall, associative caches... ¾ can alleviate thrashing ¾ require more complex circuitry than direct-mapped ¾ are more expensive than direct-mapped ¾ are slower than direct-mapped
15 Three Types of Cache Misses
Compulsory (or cold-start) misses ¾ First access to data ¾ Can be reduced via bigger cache lines ¾ Can be reduced via some pre-fetching Capacity misses ¾ Misses due to the cache not being big enough ¾ Can be reduced via a bigger cache Conflict misses ¾ Misses due to some other memory line having evicted the needed cache line ¾ Can be reduced via higher associativity
Write Policy: Write-Through
What happens when the processor modifies memory that is in cache? Option #1: Write-through ¾ Write goes BOTH to cache and to main memory ¾ Memory and cache always consistent
Store Memory CPU Cache Load Cache Load
16 Write Policy: Write-Back
Option #2 ¾ Write ggyoes only to cache ¾ Cache lines are written back to memory when evicted ¾ Requires a “dirty” bit to indicate whether a cache line was written to or not ¾ Memory not always consistent with the cache
Write CPU Store Back Memory Cache Load Cache Load
Cache Basics
Cache hit: a memory access that is found in the cache -- cheap Cache miss: a memory access that is not in the cache - expensive, because we need to get the data from elsewhere Consider a tiny cache (for illustration only)
X|00|0 X001 Address X010 X011 tag line offset X100 X101 X110 X111
Cache line length: number of bytes loaded together in one entry Direct mapped: only one address (line) in a given range in cache Associative: 2 or more lines with different addresses exist 40
17 Direct-Mapped Cache
Direct mapped cache: A block from main memory can go in exactlyyp one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache. cache
main memory
41
Fully Associative Cache
Fully Associative Cache : A block from main memory can be ppylaced in any location in the cache. This is called f fyully associative because a block in main memory may be associated with any entry in the cache. cache
Main memory
42
18 Set Associative Cache
Set associative cache : The middle range of designs between direct mapped cache and f ully associative cache is called set-associative cache. In a n-way set- associative cache a block from main memory can go into N (N > 1) locations in the cache. 2-way set-associative cache
Main memory
43
Here assume cache has 8 blocks, while memory has 32
Fully associative Direct mapped Set associative 12 can go anywhere 12 can go only into 12 can go anywhere in block 4 (12 mod 8) Set 0 (12 mod 4)
Block no 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 44
19 Here assume cache has 8 blocks, while memory has 32
Fully associative Direct mapped Set associative 12 can go anywhere 12 can go only into 12 can go anywhere in block 4 (12 mod 8) Set 0 (12 mod 4)
Block no 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 45
Tuning for Caches
1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining.
46
20 Registers
Registers are the source and destination of most CPU dat a operati ons.
They hold one element each.
They are made of static RAM (SRAM), which is very expensive.
The access time is usually 1-151.5 CPU clock cycles.
Registers are at the top of the memory
subsystem. 47
The Principle of Locality
The Principle of Locality: ¾Program access a relatively small portion of the address space at any iitnstant of time. Two Different Types of Locality: ¾Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) ¾Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Last 15 years, HW relied on localilty for speed 48
21 Principals of Locality
Temporal: an item referenced now will be again soon.
Spatial: an item referenced now causes neighbors to be referenced soon.
Lines, not words, are moved between memory levels. Both principals are satisfied. There is an optimal line size based on the properties of the data bus and the memory subsystem designs.
Cache lines are typically 32-128 bytes with 1024 being the longest currently. 49
Cache Thrashing
Thrashing occurs when frequently used cache lines replace each other. There are three primary causes for thrashing: ¾Instructions and data can conflict, particularly in unified caches. ¾Too many variables or too large of arrays are accessed that do not fit into cache. ¾IdiIndirec t address ing, e.g., sparse ma titrices.
Machine architects can add sets to the associativity. Users can buy another vendor’s machine. However, neither solution is realistic.
50
22 Counting cache misses
nxn 2-D array, element size = e bytes, cache line size = b bytes memory/cache line
2 One cache miss for every cache line: n x e /b 2 Total number of memory accesses: n Miss rate: e/b Example: Miss rate = 4 bytes / 64 bytes = 6.25% Unless the array is very small memory/cache line
One cache miss for every access Example: Miss rate = 100% Unless the array is very small
Cache Coherence for Multiprocessors
All data must be coherent between memory levels. Multiple ppprocessors with separate caches must inf orm the other processors quickly about data modifications (by the cache line). Only hardware is fast enough to do this.
Standard protocols on multiprocessors: ¾ Snoopy: all processors monitor the memory bus. ¾ Directory based: Cache lines maintain an extra 2 bits per processor to maintain clean/dirty status bits.
False sharing occurs when two different shared variables are located in the in the same cache block, causing the block to be exchanged between the processors even though the processors are accessing different variables. Size of block (line) important.
52
23 Indirect Addressing d = 0 x do i = 1,n j = ind(i) y d = d + sqrt( x(j)*x(j) + y(j)*y(j) + z(j)*z(j) ) end do z
Change loop statement to d = d + sqrt( r(1,j)*r(1,j) + r(2,j)*r(2,j) + r(3,j)*r(3,j) ) r
Note that r(1,j)-r(3,j) are in contiguous memory and probably are in the same cache line (d is probably in a register and is irrelevant). The original form uses 3 cache lines at every instance of the loop and can cause cache thrashing. 53
Cache Thrashing by Memory Allocation
parameter ( m = 1024*1024 ) real a(m), b(m)
For a 4 Mb direct mapped cache, a(i) and b(i) are always mapped to the same cache line. This is trivially avoided using padding.
real a(m), extra(32), b(m)
extra is at least 128 bytes in length, which is longer than a cache line on all but one memory subsystem that is available today. 54
24 Cache Blocking
We want blocks to fit into cache. On parallel computers we have p x cache so that data may fit into cache on p processors, but not one. This leads to superlinear speed up! Consider matrix-matrix multiply.
do k = 1,n do j = 1,n do i = 1,n c((,j)i,j) = c( (,j)i,j) + a( (,i,k) *b( k,j) end do end do end do An alternate form is ... 55
Cache Blocking do kk = 1,n,nblk N K N do jj = 1, n, nblk A M NB C M * B K do ii = 1,n,nblk do k = kk,kk+nblk-1 do j = jj,jj+nblk-1 do i = ii,ii+nblk-1 c((,j)i,j) = c( (,j)i,j) + a( (,i,k) * b( k,j) end do . . . end do
56
25 Lessons
The actual performance of a simple program can be a complicated function of the architecture Slight changes in the architecture or program change the performance significantly Since we want to write fast programs, we must take the architecture into account, even on uniprocessors Since the actual performance is so complicated, we need simple models to help us design efficient alrithmslgorithms We will illustrate with a common technique for improving cache performance, called blocking
57
Assignment 4
58
26 Strassen’s Matrix Multiply
The traditional algorithm (with or without tiling) has O(n3) flops Strassen discovered an algorithm with asymptotically lower flops ¾ O(n2.81) Consider a 2x2 matrix multiply, normally 8 multiplies ¾ Strassen does it with 7 multiplies and 18 adds Let M = m11 m12 = a11 a12 b11 b12 m21 m22 = a21 a22 b21 b22 Let p1 = (a12 - a22) * (b21 + b22) p5 = a11 * (b12 - b22) p2 = (a 11 + a 22) * (b11 + b22) p 6 = a 22 * (b21 - b11) p3 = (a11 - a21) * (b11 + b12) p7 = (a21 + a22) * b11 p4 = (a11 + a12) * b22 Then m11 = p1 + p2 - p4 + p6 m12 = p4 + p5 Extends to nxn by divide&conquer m21 = p6 + p7 59 m22 = p2 - p3 + p5 - p7
Strassen (continued) T(n) = Cost of multiplying nxn matrices = 7*T(n/2) + 18*(n/2)^2 = O(n^log_2 7) = O(n^2.81)
° Available in several libraries ° Up to several time faster if n large enough (100s) ° Needs more memory than standard algorithm ° Can be less accurate because of roundoff error ° Current world’s record is O(n2.376..)
60
27 Other Fast Matrix Multiplication Algorithms
• Current world’s record is O(n 2.376... ) (Coppersm ith & Winogra d)
• Possibility of O(n2+ε) algorithm! (Cohn, Umans, Kleinberg, 2003) • http://www.siam.org/pdf/news/174.pdf • httppg://arxiv.org/PS_cache/math/p df/0511/05 11460.pdf
• Fast methods (besides Strassen) may need unrealistically large n 61
Amdahl’s Law
Amdahl’s Law places a strict limit on the speedup that can be realized byyg using multi ppple processors. Two eq uivalent expressions for Amdahl’s Law are given below:
tN = (fp/N + fs)t1 Effect of multiple processors on run time
S = 1/(fs + fp/N) Effect of multiple processors on speedup
Where: fs = serial fraction of code fp = parallel fraction of code = 1 - fs N = number of processors
62
28 Illustration of Amdahl’s Law It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of yygpggour code before doing production runs using large numbers of processors
250 fp = 1.000 200 fp = 0.999
150 fp = 0.990 fp = 0.900
peedup 100 s
50
0 0 50 100 150 200 250 Number of processors 63
Amdahl’s Law Vs. Reality Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications ( and I/O) will result in a further degradation of performance.
80 fp = 0.99 70 60 50 Amdahl's Law 40
eedup Reality p
s 30 20 10 0 0 50 100 150 200 250 Number of processors 64
29 Optimizing Matrix Addition for Caches
Dimension A(n,n), B(n,n), C(n,n) A, B, C stored by column (as in Fortran) Algorithm 1: ¾for i=1:n, for j=1:n, A(i,j) = B(i,j) + C(i,j) Algorithm 2: ¾for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j) What is “memory access pattern” for Algs 1 and 2? Which is faster? What if A, B, C stored by row (as in C)?
65
Loop Fusion Example
/* Before */ for (i = 0; i < N; i = i+1) for (j = 0 ;j; j < N ;j; j = j+1 ) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j];
/* After */ for (i = 0; i < N; i = i+1) f(j0jNjj1)for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access; improve spatial locality 66
30 Optimizing Matrix Multiply for Caches
Several techniques for making this faster on modern processors ¾heavily studied Some optimizations done automatically by compiler, but can do much better In general, you should use optimized libraries (often supplied by vendor) for thi s and other very common li ne ar al gebra operations ¾BLAS = Basic Linear Algebra Subroutines Other algorithms you may want are not going to be supplied by vendor, so need to
know these techniques 67
Using a Simple Model of Memory to Optimize Assume just 2 levels in the hierarchy, fast and slow All data initially in slow memory ¾ m = number of memory elements (words) moved between fast and slow memory CttilComputational ¾ t = time per slow memory operation m Intensity: Key to ¾ f = number of arithmetic operations algorithm efficiency
¾ tf = time per arithmetic operation << tm ¾ q = f / m average number of flops per slow memory access
Minimum possible time = f* tf when all data in fast memory Actual time Machine ¾ f * tf + m * tm = f * tf * (1 + tm/tf * 1/q) Balance:Key to machine efficiency Larger q means time closer to minimum f * tf
¾ q ≥ tm/tf needed to get at least half of peak speed 68 q: flops/memory reference
31 Warm up: Matrix-vector multiplication y = y + A*x
for i = 1:n for j = 1:n y(i) = y(i) + A(i,j)*x(j)
A(i,:) = + *
y(i) y(i) x(:)
69
Warm up: Matrix-vector multiplication {read x(1:n) into fast memory} {read y(1:n) into fast memory} for i = 1:n {read row i of A into fast memory} for j = 1:n y(i) = y(i) + A(i,j)*x(j) {write y(1:n) back to slow memory} • m = number of slow memory refs = 3n + n2 • f = number of arithmetic operations = 2n2 • q = f / m ~= 2
• Matrix-vector multiplication limited by slow memory speed •Think of q as reuse of data 70
32 Matrix Multiply C=C+A*B
for i = 1 to n for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)
C(i,j) C(i,j) A(i,:) B(:,j) =+*
71
Matrix Multiply C=C+A*B(unblocked, or untiled) for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) ) back to slow memory}
C(i,j) C(i,j) A(i,:) B(:,j) =+* 72
33 Matrix Multiply q=ops/slow mem ref (unblocked, or untiled)
Number of slow memory references on unblocked matrix multiply m = n3 read each column of B n times + n2 read each column of A once for each i + 2*n2 read and write each element of C once = n3 + 3*n2 So q = f/m = (2*n3)/(n3 + 3*n2) ~= 2 ffg,mpmor large n, no improvement over m atrix-vector mult
C(i,j) C(i,j) A(i,:) B(:,j) =+* 73
Naïve Matrix Multiply on RS/6000
12000 would take 1095 years 6 T = N4.7 5 4 3 2 Size 2000 took 5 days 1 log cycles/flop 0 -1 012345
log Problem Size
O(N3) performance would have constant cycles/flop Performance looks like O(N4.7) Slide source: Larry Carter, UCSD 74
34 Naïve Matrix Multiply on RS/6000
Page miss every iteration 6
p 5 TLB miss every 4 iteration 3 2 Cache miss every
log cycles/flo 1 16 iterations Page miss every 512 iterations 0 0 1 2 3 4 5
log Problem Size
Slide source: Larry Carter, UCSD 75
Matrix Multiply (blocked, or tiled)
Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory}
C(i,j) C(i,j) A(i,k)
=+*B(k,j) 76
35 Matrix Multiply (blocked or tiled) q=ops/slow mem ref
n size of matrix Why is this algorithm correct? b blocksize N number of blocks Number of slow memory references on blocked matrix multiply m = N*n2 read each block of B N3 times (N3 * n/N * n/N) + N*n 2 read each block of A N3 times + 2*n2 read and write each block of C once = (2*N + 2)*n2 So q = f/m = 2*n3 / ((2*N + 2)*n2) ~= n/N = b for large n
So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2)
Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b2 <= M, so q ~= b <= sqrt(M/3)
Theorem (Hong, Kung, 1981): Any reorganization of this algorithm 77 (that uses only associativity) is limited to q =O(sqrt(M))
More on BLAS (Basic Linear Algebra Subroutines)
Industry standard interface(evolving) Vendors, others supply optimized implementations History ¾ BLAS1 (()1970s): » vector operations: dot product, saxpy (y=a*x+y), etc » m=2*n, f=2*n, q ~1 or less ¾ BLAS2 (mid 1980s) » matrix-vector operations: matrix vector multiply, etc » m=n2, f=2*n2, q~2, less overhead » somewhat faster than BLAS1 ¾ BLAS3 (late 1980s) » matrix-matrix operations: matrix matrix multiply, etc » m >= 4n2, f=O(n3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2 Good algorithms used BLAS3 when possible (LAPACK) www.netlib.org/blas, www.netlib.org/lapack 78
36 BLAS for Performance
Intel Pentium 4 w/SSE2 1.7 GHz
2000 Level 3 BLAS
1500
1000 Mflop/s 500 Level 2 BLAS Level 1 BLAS 0 10 100 200 300 400 500 Order of vector/Matrices
Development of blocked algorithms important for performance 79
BLAS for Performance
Alpha EV 5/6 500MHz (1Gflop/s peak)
700 600 Level 3 BLAS 500 400
Mflop/s 300 200 100 Level 2 BLAS 0 Level 1 BLAS 10 100 200 300 400 500
Order of vector/Matrices
Development of blocked BLAS 3 (n-by-n matrix matrix multiply) vs algorithms important for performance BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors) 80
37 Optimizing in practice
Tiling for registers ¾loop unrolling, use of named “register” variables Tiling for mul ti pl e level s of cach e Exploiting fine-grained parallelism within the processor ¾super scalar ¾pipelining Complicated compiler interactions Hard to do by hand (but you’ll try) Automatic optimization an active research area ¾PHIPAC: www.icsi.berkeley.edu/~bilmes/phipac ¾ www.cs.berkeley.edu/~iyer/asci_slides.ps ¾ATLAS: www.netlib.org/atlas/index.html 81
Summary
Performance programming on uniprocessors requires ¾ understanding of memory system » levels, costs, sizes ¾ understanding of fine-grained parallelism in processor to produce good instruction mix Blocking (tiling) is a basic approach that can be applied to many matrix algorithms Applies to uniprocessors and parallel processors ¾ The technique works for any architecture, but choosing the blocksize b and other details depends on the architecture Similar techniques are possible on other data structures You will get to try this in Assignment 2 (see the class homepage)
82
38 Summary: Memory Hierachy
Virtual memory was controversial at the time: can SW automatically manage 64KB across many programs? ¾1000X DRAM growth removed the controversy Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms?
83
Performance = Effective Use of Memory Hierarchy
Can only do arithmetic on Level 1, 2 & 3 BLAS Intel PII 450MHz data at the top of the 350 hierarchy 300 250 Higher level BLAS lets us 200 do this Mflop/s 150 100 BLAS Memory Flops Flops/ 50 Ref s Memory 0 Ref s 10 100 200 300 400 500 Order of vector/Matrices Level 1 3n 2n 2/3 y= y+αx Level 2 n2 2n2 2 Development of blocked y= y+Ax algorithms important for performance Level 3 4n2 2n3 n/ 2 C= C+ AB 84
39 Improving Ratio of Floating Point Operations to Memory Accesses subroutine mult(n1,nd1,n2,nd2,y,a,x) implicit real* 88(a (a-h,o-z) dimension a(nd1,nd2),y(nd2),x(nd1)
do 10, i=1,n1 t=0.d0 do 20, j=1,n2 20 t=t+a(j,i)*x(j) **** 2 FLOPS 10 y(i)=t **** 2 LOADS return end
85
Improving Ratio of Floating Point Operations to Memory Accesses c works correctly when n1,n2 are multiples of 4 dimension a(nd1,nd2), y(nd2), x(nd1) do i=1,n1-3,4 t1=0.d0 t2= 0.d0 t3=0.d0 t4=0.d0 do j=1,n2-3,4 t1=t1+a(j+0,i+0)*x(j+0)+a(j+1,i+0)*x(j+1)+ 1 a(j+2,i+0)*x(j+2)+a(j+3,i+1)*x(j+3) t2=t2+a(j+0,i+1)*x(j+0)+a(j+1,i+1)*x(j+1)+ 1 a(j+2,i+1)*x(j+2)+a(j+3,i+0)*x(j+3) t3=t3+a(j+0,i+2)*x(j+0)+a(j+1,i+2)*x(j+1)+ 1 a(j+2,i+2)*x(j+2)+a(j+3,i+2)*x(j+3) t4=t4+a(j+0,i+3)*x(j+0)+a(j+1,i+3)*x(j+1)+ 1 a(j+2,i+3)*x(j+2)+a(j+3,i+3)*x(j+3) enddo 32 FLOPS y(i+0)=t1 y(i+1)=t2 20 LOADS y(i+2)=t3 y(i+3)=t4 enddo 86
40 Amdahl’s Law
Amdahl’s Law places a strict limit on the speedup that can be realized byyg using multi ppple processors. Two eq uivalent expressions for Amdahl’s Law are given below:
tN = (fp/N + fs)t1 Effect of multiple processors on run time
S = 1/(fs + fp/N) Effect of multiple processors on speedup
Where: fs = serial fraction of code fp = parallel fraction of code = 1 - fs N = number of processors
87
Amdahl’s Law - Theoretical Maximum Speedup of parallel execution
speedup = 1/(P/N + S) » P (parallel code fraction) S ( serial code fraction) N (p rocessors) ¾Example: Image processing » 30 minutes of preparation (serial) » One minute to scan a region » 30 minutes of cleanup (serial)
Speedup is restricted by serial portion. And, speedup increases with greater number of cores!
88
41 Illustration of Amdahl’s Law It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of yygpggour code before doing production runs using large numbers of processors
250 fp = 1.000 200 fp = 0.999 fp = 0.990 150 fp = 0.900 peedup s 100 What’s gggoing on here?
50
0 0 50 100 150 200 250 Number of processors 89
Amdahl’s Law Vs. Reality Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications ( and I/O) will result in a further degradation of performance.
80 fp = 0.99 70 60 50 Amdahl's Law 40
eedup Reality p s 30 20 10 0 0 50 100 150 200 250 Number of processors 90
42 Gustafson’s Law
Thus, Amdahl’s Law predicts that there is a maxi mum scal abilit y for an application, determined by its parallel fraction, and this limit is generally not large. There is a way around this: increase the problem size ¾ bigger problems mean bigger grids or more particles: bigger arrays ¾ number of serial operations generally remains
constant; number of parallel operations 91 increases: parallel fraction increases
Parallel Performance Metrics: Speedup
Relative performance: Absolute performance:
60 2000 T3E T3E O2K 1750 T3E Ideal 50 O2K Ideal 1500 O2K Ideal 40 1250
30 1000 MFLOPS peedup 750 S 20 500
10 250
0 0 0 8 16 24 32 40 48 0 10 20 30 40 50 60 Processors Processors Speedup is only one characteristic of a program - it is not synonymous with performance. In this comparison of two machines the code achieves comparable speedups but one of the machines is faster.
92
43 Fixed-Problem Size Scaling
• a.k.a. Fixed-load, Fixed-Problem Size, Strong Scaling, Problem-Constrained, constant-problem size (CPS), variable subgrid 1
• Amdahl Limit: SA(n) = T(1) / T(n) = ------f / n + ( 1 - f ) • This bounds the speedup based only on the fraction of the code that cannot use parallelism ( 1- f ); it ignores all other factors
• SA --> 1 / ( 1- f ) as n --> ∞
93
Fixed-Problem Size Scaling (Cont’d)
• Efficiency (n) = T(1) / [ T(n) * n]
• Memory requirements decrease with n
• Surface-to-volume ratio increases with n
• Superlinear speedup possible from cache effects • Motivation: what is the largest # of procs I can use effectively and what is the fastest time that I can solve a given problem?
• Problems: - Sequential runs often not possible (large problems) - Speedup (and efficiency) is misleading if processors are slow
94
44 Fixed-Problem Size Scaling: Examples
S. Goedecker and Adolfy Hoisie, Achieving High Performance in Numerical Computations on RISC Workstations and Parallel Systems,International Conference on Computational Physics: PC'97 Santa Cruz, August 25-28 1997.
95
Fixed-Problem Size Scaling Examples
96
45 Scaled Speedup Experiments
• a.kFidSbidk.a. Fixed Subgrid-Size, W eak S cali ng, G ust af son scali ng. • Motivation: Want to use a larger machine to solve a larger global problem in the same amount of time. • Memory and surface-to-volume effects remain constant.
97
Scaled Speedup Experiments
• Be wary of benchmarks that scale problems to unreasonably-large sizes
- scale the problem to fill the machine when a smaller size will do;
- simplify the science in order to add computation -> “World’s largest MD simulation - 10 gazillion particles!”
- run grid sizes for only a few cycles because the full run won’t finish during this lifetime or because the resolution makdithltifitdtkes no sense compared with resolution of input data
• Suggested alternate approach (Gustafson): Constant time benchmarks - run code for a fixed time and measure work done
98
46 Example of a Scaled Speedup Experiment Processors NChains Time Natoms Time per Time Efficiency Atom per PE per Atom 1 32 38.4 2368 1.62E-02 1.62E-02 1.000 2 64 38.4 4736 8.11E-03 1.62E-02 1.000 4 128 38.5 9472 4.06E-03 1.63E-02 0.997 8 256 38.6 18944 2.04E-03 1.63E-02 0.995 16 512 38.7 37888 1.02E-03 1.63E-02 0.992 32 940 35.7 69560 5.13E-04 1.64E-02 0.987 64 1700 32.7 125800 2.60E-04 1.66E-02 0.975 128 2800 27.4 207200 1.32E-04 1.69E-02 0.958 256 4100 20.75 303400 6.84E-05 1.75E-02 0.926 512 5300 14.49 392200 3.69E-05 1.89E-02 0.857 TBON on ASCI Red
Efficiency
1.040 0.940 0.840 0.740 0.640 0.540 0.440 0 200 400 600
99
100
47