Lecture 7: Floating Point Arithmetic Memory Hierarchy and Cache

1

Question:

‹ Suppose we want to compute using four didecima l arithtiithmetic: ¾ S = 1.000 + 1.000x104 – 1.000x104 ¾ What’s the answer?

‹ Ariane 5 rocket ¾ June 1996 expl od ed wh en a 64 bit fl p t number relating to the horizontal velocity of the rocket was converted to a 16 bit signed integer. The number was larger than 32,767, and thus the conversion failed. 2 ¾ $500M rocket and cargo

1 Defining Floating Point Arithmetic

‹ Representable numbers ¾ : +/- d.d…d x rexp ¾ sign bit +/- ¾ radix r (usually 2 or 10, sometimes 16) ¾ d.d…d (how many base-r digits d?) ¾ exponent exp (range?) ¾ others? ‹ Operations: ¾ arithmetic: +,-,x,/,... » how to round result to fit in format ¾ comparison (<, =, >) ¾ conversion between different formats » short to long FP numbers, FP to integer ¾ exception handling » what to do for 0/0, 2*largest_number, etc. ¾ binary/ conversion » for I/O, when radix not 10 3

IEEE Floating Point Arithmetic Standard 754 (1985) - Normalized Numbers

‹ Normalized Nonzero Representable Numbers: +- 1.d…d x 2exp ¾ Macheps = = 2-#significand bits = relative error in each operation smallest number ε such that fl( 1 + ε ) > 1

¾ OV = overflow threshold = largest number ¾ UN = underflow threshold = smallest number

Format # bits #significand bits macheps #exponent bits exponent range ------Single 32 23+1 2-24 (~10-7) 8 2-126 -2127 (~10+-38) Double 64 52+1 2-53 (~10-16) 11 2-1022 -21023 (~10+-308) Double >=80 >=64 <=2-64(~10-19) >=15 2-16382 -216383 (~10+-4932) Extended (80 bits on all Intel machines)

‹ +- Zero: +-, significand and exponent all zero 4 ¾ Why bother with -0 later

2 IEEE Floating Point Arithmetic Standard 754 - “Denorms”

‹ Denormalized Numbers: +-0.d…d x 2min_exp ¾ sign bit, nonzero significand, minimum exponent ¾ Fills in gap between UN and 0 ‹ Underflow Exception ¾ occurs when exact nonzero result is less than underflow threshold UN ¾ Ex: UN/3 ¾ return a denorm, or zero

5

IEEE Floating Point Arithmetic Standard 754 - +- Infinity

‹+- Infinity: Sign bit, zero significand, maximum exponent ‹Overflow Exception ¾occurs when exact finite result too large to represent accurately ¾Ex: 2*OV ¾return +- infinity ‹Divide by zero Exception ¾return +- infinity = 1/+-0 ¾sign of zero important! ‹Also return +- infinity for ¾3+infinity, 2*infinity, infinity*infinity ¾Result is exact, not an exception!

6

3 IEEE Floating Point Arithmetic Standard 754 - NAN (Not A Number)

‹NAN: Sign bit, nonzero significand, maximum exponent ‹Invalid Exception ¾occur s when exact result ntnot a we ll-dfinddefined real number ¾0/0 ¾sqrt(-1) ¾infinity-infinity, infinity/infinity, 0*infinity ¾NAN + 3 ¾NAN > 3? ¾Return a NAN in all these cases ‹Two kikidnds of NANs ¾Quiet - propagates without raising an exception ¾Signaling - generate an exception when touched » good for detecting uninitialized data

7

Error Analysis

‹Basic error formula ¾fl(a op b) = (a op b)*(1 + d) where » op one of + ,,,-*/ » |d| <= macheps » assuming no overflow, underflow, or divide by zero ‹Example: adding 4 numbers

¾fl(x1+x2+x3+x4) = {[(x1+x2)*(1+d1) + x3]*(1+d2) + x4}*(1+d3) = x1*(1+d1)*(1+d2)*(1+d3) + x2*(1+d1)*(1+d2)*(1+d3) + x3*(1+d2)* (1+d3) + x4*(1+d3) = x1*(1+e1) + x2*(1+e2) + x3*(1+e3) + x4*(1+e4) where each |ei| <~ 3*macheps ¾get exact sum of slightly changed summands xi*(1+ei) ¾Backward Error Analysis - algorithm called numerically stable if it gives the exact result for slightly changed inputs ¾Numerical Stability is an algorithm design goal 8

4 Backward error

‹ Approximate solution is exact solution to modified problem. ‹ HHwow lar ge a m odif ication to ori ginal p roblem is requ ired to give result actually obtained? ‹ How much data error in initial input would be required to explain all the error in computed results? ‹ Approximate solution is good if it is exact solution to “nearby” problem.

f x f(x) Backward error f’ Forward error x’ f’(x) f 9

Sensitivity and Conditioning ‹ Problem is insensitive or well conditioned if relative change in input causes commensurate relative change in solution. ‹ Problem is sensitive or ill-conditioned, if relative change in solution can be much larger than that in input data.

Cond = |Relative change in solution| / |Relative change in input data| = |[f(x’) – f(x)]/f(x)| / |(x’ – x)/x|

‹ Problem is sensitive, or ill-conditioned, if cond >> 1.

‹ When function f is evaluated for approximate input x’ = x+h instead of true input value of x. ‹ Absolute error = f(x + h) – f(x) ≈ h f’(x) ‹ Relative error =[ f(x + h) – f(x) ] / f(x) ≈ h f’(x) / f(x) 10

5 Sensitivity: 2 Examples cos(π/2) and 2-d System of Equations

a*x1+ bx2 = f ‹ Consider problem of computing *x1+ dx2 = g cosine function for arguments near π/2. ‹ Let x ≈ π/2 and let h be small perturbation to x. Then . Abs: f(x + h) – f(x) ≈ h f’(x) Rel: [ f(x + h) – f(x) ] / f(x) ≈ h f’(x) / f(x) absolute error = cos(x+h) – cos(x) ≈ -h sin(x) ≈ -h, relative error ≈ -h tan(x) ≈ ∞

‹ So small change in x near π/2 causes large relative change in cos(x) regardless of method used. ‹ cos(1.57079) = 0.63267949 x 10-5 ‹ cos(1.57078) = 1.64267949 x 10-5 ‹ Relative change in output is a quarter million times greater than 11 relative change in input.

Sensitivity: 2 Examples cos(π/2) and 2-d System of Equations

a*x1+ bx2 = f ‹ Consider problem of computing c*x1+ dx2 = g cosine function for arguments near π/2. ‹ Let x ≈ π/2 and let h be small perturbation to x. Then . absolute error = cos(x+h) – cos(x) ≈ -h sin(x) ≈ -h, relative error ≈ -h tan(x) ≈ ∞

‹ So small change in x near π/2 causes large relative change in cos(x) regardless of method used. ‹ cos(1.57079) = 0.63267949 x 10-5 . ‹ cos(1.57078) = 1.64267949 x 10-5 ‹ Relative change in output is a quarter million times greater than relative change in input. 12

6 Exception Handling

‹What happens when the “exact value” is not a real number, or too small or too larggpe to represent accurately ? ‹5 Exceptions: ¾Overflow - exact result > OV, too large to represent ¾Underflow - exact result nonzero and < UN, too small to represent ¾Divide-by-zero -nonzero/0 ¾Invalid - 0/0, sqrt(-1), … ¾Inexact - yygyou made a rounding error (very common!) ‹Possible responses ¾Stop with error message (unfriendly, not default) ¾Keep computing (default, but how?)

13

Summary of Values Representable in IEEE FP

‹+- Zero +- 0…0 0……………………0

+- Not 0 or ‹NlidNormalized nonzero numbers all 1s anything ‹Denormalized numbers +- 0…0 nonzero ‹+-Infinity +- 1….1 0……………………0 ‹NANs +- 1….1 nonzero ¾Sigggnaling and quiet ¾Many systems have only quiet

14

7 Questions?

15

More on the In-Class Presentations

‹ Start at 1:00 on Monday, 5/5/08 ‹ Turn in report s on MMdonday, 5/2/08 ‹ Presentations roughly 20 minutes each ‹ Use powerpoint or pdf ‹ Describe your project, perhaps motivate via application ‹ DibDescribe your meththd/hod/approach ‹ Provide comparison and results

‹ See me about your topic 16

8 Cache and Memory

17

Cache and Its Importance in Performance

‹ Motivation: ¾ Time to run code = clock cycles running code + clock cycles waiting for memory ¾ For many years, CPU’s have sped up an average of 50% per year over memory chip speed ups.

‹ Hence, memory access is the bottleneck to computing fast.

18

9 DRAM 9%/yr. Latency in a Single System (2X/10 yrs) 500 1000 Ratio 400 atio

MMemoryemory AccessAccess TimeTime R 100 300

200 10

Time (ns) 100

1 CPU Time Memory to CPU 0 0.1 1997 1999 2001 2003 2006 2009 µProc X-Axis 60%/yr. CPU Clock Period (ns) Ratio (2X/1.5yr) Memory System Access Time Processor-Memory Performance Gap: (grows 50% / year) 19

Here’s your problem

‹ Say 2.26 GHz ¾ 2 ops/cycle DP ¾ 4.52 Gflop/s peak ‹ FSB 533 MHz ¾ 32 bit data path (4 bytes) or 2.132 GB/s ‹ With 8 bytes/word (DP) ¾ 266.5 MW/s from 20 memory

10 Intel Clovertown

‹ Quad-core processor ‹ Each core does 4 floating point ops/s ‹ Say 2.4 GHz ¾ thus 4 ops/core*4 flop/s * 2.4 GHz = 38.4 Gflop/s peak ‹ FSB 1.066 GHz ¾ 1.066 GHz*4B /8 (W/B) = 533 MW/s

» There’s your problem 21

Commodity Processor Trends Bandwidth/Latency is the Critical Issue, not FLOPS

Got Bandwidth? Annual Typical value Typical value Typical value increase in 2005 in 2010 in 2020

Single-chip floating-point 59% 4 GFLOP/s 32 GFLOP/s 3300 GFLOP/s performance

Front-side bus 1 GWord/s 3.5 GWord/s 27 GWord/s 23% bandwidth = 0.25 word/flop = 0.11 word/flop = 0.008 word/flop

70 ns 50 ns 28 ns DRAM latency (5.5%) = 280 FP ops = 1600 FP ops = 94,000 FP ops = 70 loads = 170 loads = 780 loads

Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 22 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6.

11 Solving the Memory Bottleneck

‹ Since we cannot make fast enough memories, we iitdnvented the memory hierarchy ¾ L1 Cache (on chip) ¾ L2 Cache ¾ Optional L3 Cache ¾ Main Memory ¾ Hard Drive

Cache Memories ‹ Cache memories are small, fast SRAM-based memories managed automatically in hardware. ¾ Hold frequently accessed blocks of main memory ‹ CPU looks first for data in L1, then in L2, then in main memory. ‹ Typical bus structure:

CPU chip register file L1 ALU cache cache bus system bus memory bus

I/O main L2 cache bus interface bridge memory

12 26

What is a cache?

‹ Small, fast storage used to improve average access time to slow memory. ‹ Exploits spacial and temporal locality ‹ In computer architecture, almost everything is a cache! ¾ Reg itisters “a cach h”e” on vari iblables –software managed ¾ First-level cache a cache on second-level cache ¾ Second-level cache a cache on memory ¾ Memory a cache on disk (virtual memory) ¾ TLB a cache on page table ¾ Branch-prediction a cache on prediction information?

P/RProc/Regs L1-Cache BiggerL2-Cache Faster Memory

Disk, Tape, etc. 27

13 Cache Performance Metrics

‹ Miss Rate ¾ of memory references not found in cache (misses/references) ¾ Typical numbers: » 3-10% for L1 » can be quite small (e.g., < 1%) for L2, depending on size, etc. ‹ Hit Time ¾ Time to deliver a line in the cache to the processor (()includes time to determine whether the line is in the cache) ¾ Typical numbers: » 1 clock cycle for L1 » 3-8 clock cycles for L2 ‹ Miss Penalty ¾ Additional time required because of a miss » Typically 25-100 cycles for main memory

Traditional Four Questions for Memory Hierarchy Designers

‹ QQppp1: Where can a block be placed in the upper level? (Block placement) ¾Fully Associative, Set Associative, Direct Mapped ‹ Q2: How is a block found if it is in the upper level? (Block identification) ¾Tag/Block ‹ Q3: Which block should be replaced on a miss? (Block replacement) ¾Random, LRU ‹ Q4: What happens on a write? (Write strategy) ¾Write Back or Write Through (with Write Buffer)

29

14 Cache-Related Terms

‹ ICACHE : Instruction cache ‹ DCACHE (L1) : Dat a cach e cl osest to regi st ers ‹ SCACHE (L2) : Secondary data cache ‹ TCACHE (L3) : Third level data cache ¾ Data from SCACHE has to go through DCACHE to registers ¾ TCACHE is larger than SCACHE, and SCACHE is larger than DCACHE ¾ Not all processors have TCACHE

30

Line Replacement Policy

‹ When 2 memory lines are in the cache and a 3rd line comes in, one of the two previous ones must be evicted: which one to choose? ‹ Since one doesn’t know the future, heuristics: ¾ LRU: Least Recently Used » Hard to implement ¾ FIFO » Easy to implement ¾ Random » Even easier to implement ‹ Overall, associative caches... ¾ can alleviate thrashing ¾ require more complex circuitry than direct-mapped ¾ are more expensive than direct-mapped ¾ are slower than direct-mapped

15 Three Types of Cache Misses

‹ Compulsory (or cold-start) misses ¾ First access to data ¾ Can be reduced via bigger cache lines ¾ Can be reduced via some pre-fetching ‹ Capacity misses ¾ Misses due to the cache not being big enough ¾ Can be reduced via a bigger cache ‹ Conflict misses ¾ Misses due to some other memory line having evicted the needed cache line ¾ Can be reduced via higher associativity

Write Policy: Write-Through

‹ What happens when the processor modifies memory that is in cache? ‹ Option #1: Write-through ¾ Write goes BOTH to cache and to main memory ¾ Memory and cache always consistent

Store Memory CPU Cache Load Cache Load

16 Write Policy: Write-Back

‹ Option #2 ¾ Write ggyoes only to cache ¾ Cache lines are written back to memory when evicted ¾ Requires a “dirty” bit to indicate whether a cache line was written to or not ¾ Memory not always consistent with the cache

Write CPU Store Back Memory Cache Load Cache Load

Cache Basics

‹Cache hit: a memory access that is found in the cache -- cheap ‹Cache miss: a memory access that is not in the cache - expensive, because we need to get the data from elsewhere ‹Consider a tiny cache (for illustration only)

X|00|0 X001 Address X010 X011 tag line offset X100 X101 X110 X111

‹ Cache line length: number of bytes loaded together in one entry ‹ Direct mapped: only one address (line) in a given range in cache ‹ Associative: 2 or more lines with different addresses exist 40

17 Direct-Mapped Cache

‹ Direct mapped cache: A block from main memory can go in exactlyyp one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache. cache

main memory

41

Fully Associative Cache

‹ Fully Associative Cache : A block from main memory can be ppylaced in any location in the cache. This is called f fyully associative because a block in main memory may be associated with any entry in the cache. cache

Main memory

42

18 Set Associative Cache

‹ Set associative cache : The middle range of designs between direct mapped cache and f ully associative cache is called set-associative cache. In a n-way set- associative cache a block from main memory can go into N (N > 1) locations in the cache. 2-way set-associative cache

Main memory

43

Here assume cache has 8 blocks, while memory has 32

Fully associative Direct mapped Set associative 12 can go anywhere 12 can go only into 12 can go anywhere in block 4 (12 mod 8) Set 0 (12 mod 4)

Block no 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 44

19 Here assume cache has 8 blocks, while memory has 32

Fully associative Direct mapped Set associative 12 can go anywhere 12 can go only into 12 can go anywhere in block 4 (12 mod 8) Set 0 (12 mod 4)

Block no 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 45

Tuning for Caches

1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining.

46

20 Registers

‹Registers are the source and destination of most CPU dat a operati ons.

‹They hold one element each.

‹They are made of static RAM (SRAM), which is very expensive.

‹The access time is usually 1-151.5 CPU clock cycles.

‹Registers are at the top of the memory

subsystem. 47

The Principle of Locality

‹The Principle of Locality: ¾Program access a relatively small portion of the address space at any iitnstant of time. ‹Two Different Types of Locality: ¾Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) ¾Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) ‹Last 15 years, HW relied on localilty for speed 48

21 Principals of Locality

‹ Temporal: an item referenced now will be again soon.

‹ Spatial: an item referenced now causes neighbors to be referenced soon.

‹ Lines, not words, are moved between memory levels. Both principals are satisfied. There is an optimal line size based on the properties of the data bus and the memory subsystem designs.

‹ Cache lines are typically 32-128 bytes with 1024 being the longest currently. 49

Cache Thrashing

‹ Thrashing occurs when frequently used cache lines replace each other. There are three primary causes for thrashing: ¾Instructions and data can conflict, particularly in unified caches. ¾Too many variables or too large of arrays are accessed that do not fit into cache. ¾IdiIndirec t address ing, e.g., sparse ma titrices.

‹ Machine architects can add sets to the associativity. Users can buy another vendor’s machine. However, neither solution is realistic.

50

22 Counting cache misses

‹ nxn 2-D array, element size = e bytes, cache line size = b bytes memory/cache line

2 „ One cache miss for every cache line: n x e /b 2 „ Total number of memory accesses: n „ Miss rate: e/b „ Example: Miss rate = 4 bytes / 64 bytes = 6.25% „ Unless the array is very small memory/cache line

„ One cache miss for every access „ Example: Miss rate = 100% „ Unless the array is very small

Cache Coherence for Multiprocessors

‹ All data must be coherent between memory levels. Multiple ppprocessors with separate caches must inf orm the other processors quickly about data modifications (by the cache line). Only hardware is fast enough to do this.

‹ Standard protocols on multiprocessors: ¾ Snoopy: all processors monitor the memory bus. ¾ Directory based: Cache lines maintain an extra 2 bits per processor to maintain clean/dirty status bits.

‹ False sharing occurs when two different shared variables are located in the in the same cache block, causing the block to be exchanged between the processors even though the processors are accessing different variables. Size of block (line) important.

52

23 Indirect Addressing d = 0 x do i = 1,n j = ind(i) y d = d + sqrt( x(j)*x(j) + y(j)*y(j) + z(j)*z(j) ) end do z

‹ Change loop statement to d = d + sqrt( r(1,j)*r(1,j) + r(2,j)*r(2,j) + r(3,j)*r(3,j) ) r

‹ Note that r(1,j)-r(3,j) are in contiguous memory and probably are in the same cache line (d is probably in a register and is irrelevant). The original form uses 3 cache lines at every instance of the loop and can cause cache thrashing. 53

Cache Thrashing by Memory Allocation

parameter ( m = 1024*1024 ) real a(m), b(m)

‹ For a 4 Mb direct mapped cache, a(i) and b(i) are always mapped to the same cache line. This is trivially avoided using padding.

real a(m), extra(32), b(m)

‹ extra is at least 128 bytes in length, which is longer than a cache line on all but one memory subsystem that is available today. 54

24 Cache Blocking

‹ We want blocks to fit into cache. On parallel computers we have p x cache so that data may fit into cache on p processors, but not one. This leads to superlinear speed up! Consider matrix-matrix multiply.

do k = 1,n do j = 1,n do i = 1,n c((,j)i,j) = c( (,j)i,j) + a( (,i,k) *b( k,j) end do end do end do ‹ An alternate form is ... 55

Cache Blocking do kk = 1,n,nblk N K N do jj = 1, n, nblk A M NB C M * B K do ii = 1,n,nblk do k = kk,kk+nblk-1 do j = jj,jj+nblk-1 do i = ii,ii+nblk-1 c((,j)i,j) = c( (,j)i,j) + a( (,i,k) * b( k,j) end do . . . end do

56

25 Lessons

‹ The actual performance of a simple program can be a complicated function of the architecture ‹ Slight changes in the architecture or program change the performance significantly ‹ Since we want to write fast programs, we must take the architecture into account, even on uniprocessors ‹ Since the actual performance is so complicated, we need simple models to help us design efficient alrithmslgorithms ‹ We will illustrate with a common technique for improving cache performance, called blocking

57

Assignment 4

58

26 Strassen’s Matrix Multiply

‹ The traditional algorithm (with or without tiling) has O(n3) ‹ Strassen discovered an algorithm with asymptotically lower flops ¾ O(n2.81) ‹ Consider a 2x2 matrix multiply, normally 8 multiplies ¾ Strassen does it with 7 multiplies and 18 adds Let M = m11 m12 = a11 a12 b11 b12 m21 m22 = a21 a22 b21 b22 Let p1 = (a12 - a22) * (b21 + b22) p5 = a11 * (b12 - b22) p2 = (a 11 + a 22) * (b11 + b22) p 6 = a 22 * (b21 - b11) p3 = (a11 - a21) * (b11 + b12) p7 = (a21 + a22) * b11 p4 = (a11 + a12) * b22 Then m11 = p1 + p2 - p4 + p6 m12 = p4 + p5 Extends to nxn by divide&conquer m21 = p6 + p7 59 m22 = p2 - p3 + p5 - p7

Strassen (continued) T(n) = Cost of multiplying nxn matrices = 7*T(n/2) + 18*(n/2)^2 = O(n^log_2 7) = O(n^2.81)

° Available in several libraries ° Up to several time faster if n large enough (100s) ° Needs more memory than standard algorithm ° Can be less accurate because of roundoff error ° Current world’s record is O(n2.376..)

60

27 Other Fast Matrix Multiplication Algorithms

• Current world’s record is O(n 2.376... ) (Coppersm ith & Winogra d)

• Possibility of O(n2+ε) algorithm! (Cohn, Umans, Kleinberg, 2003) • http://www.siam.org/pdf/news/174.pdf • httppg://arxiv.org/PS_cache/math/p df/0511/05 11460.pdf

• Fast methods (besides Strassen) may need unrealistically large n 61

Amdahl’s Law

Amdahl’s Law places a strict limit on the speedup that can be realized byyg using multi ppple processors. Two eq uivalent expressions for Amdahl’s Law are given below:

tN = (fp/N + fs)t1 Effect of multiple processors on run time

S = 1/(fs + fp/N) Effect of multiple processors on speedup

Where: fs = serial fraction of code fp = parallel fraction of code = 1 - fs N = number of processors

62

28 Illustration of Amdahl’s Law It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of yygpggour code before doing production runs using large numbers of processors

250 fp = 1.000 200 fp = 0.999

150 fp = 0.990 fp = 0.900

peedup 100 s

50

0 0 50 100 150 200 250 Number of processors 63

Amdahl’s Law Vs. Reality Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications ( and I/O) will result in a further degradation of performance.

80 fp = 0.99 70 60 50 Amdahl's Law 40

eedup Reality p

s 30 20 10 0 0 50 100 150 200 250 Number of processors 64

29 Optimizing Matrix Addition for Caches

‹ Dimension A(n,n), B(n,n), C(n,n) ‹ A, B, C stored by column (as in ) ‹ Algorithm 1: ¾for i=1:n, for j=1:n, A(i,j) = B(i,j) + C(i,j) ‹ Algorithm 2: ¾for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j) ‹ What is “memory access pattern” for Algs 1 and 2? ‹ Which is faster? ‹ What if A, B, C stored by row (as in C)?

65

Loop Fusion Example

/* Before */ for (i = 0; i < N; i = i+1) for (j = 0 ;j; j < N ;j; j = j+1 ) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j];

/* After */ for (i = 0; i < N; i = i+1) f(j0jNjj1)for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per access; improve spatial locality 66

30 Optimizing Matrix Multiply for Caches

‹Several techniques for making this faster on modern processors ¾heavily studied ‹Some optimizations done automatically by compiler, but can do much better ‹In general, you should use optimized libraries (often supplied by vendor) for thi s and other very common li ne ar al gebra operations ¾BLAS = Basic Linear Algebra Subroutines ‹Other algorithms you may want are not going to be supplied by vendor, so need to

know these techniques 67

Using a Simple Model of Memory to Optimize ‹ Assume just 2 levels in the hierarchy, fast and slow ‹ All data initially in slow memory ¾ m = number of memory elements (words) moved between fast and slow memory CttilComputational ¾ t = time per slow memory operation m Intensity: Key to ¾ f = number of arithmetic operations algorithm efficiency

¾ tf = time per arithmetic operation << tm ¾ q = f / m average number of flops per slow memory access

‹ Minimum possible time = f* tf when all data in fast memory ‹ Actual time Machine ¾ f * tf + m * tm = f * tf * (1 + tm/tf * 1/q) Balance:Key to machine efficiency ‹ Larger q means time closer to minimum f * tf

¾ q ≥ tm/tf needed to get at least half of peak speed 68 q: flops/memory reference

31 Warm up: Matrix-vector multiplication y = y + A*x

for i = 1:n for j = 1:n y(i) = y(i) + A(i,j)*x(j)

A(i,:) = + *

y(i) y(i) x(:)

69

Warm up: Matrix-vector multiplication {read x(1:n) into fast memory} {read y(1:n) into fast memory} for i = 1:n {read row i of A into fast memory} for j = 1:n y(i) = y(i) + A(i,j)*x(j) {write y(1:n) back to slow memory} • m = number of slow memory refs = 3n + n2 • f = number of arithmetic operations = 2n2 • q = f / m ~= 2

• Matrix-vector multiplication limited by slow memory speed •Think of q as reuse of data 70

32 Matrix Multiply C=C+A*B

for i = 1 to n for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)

C(i,j) C(i,j) A(i,:) B(:,j) =+*

71

Matrix Multiply C=C+A*B(unblocked, or untiled) for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) ) back to slow memory}

C(i,j) C(i,j) A(i,:) B(:,j) =+* 72

33 Matrix Multiply q=ops/slow mem ref (unblocked, or untiled)

Number of slow memory references on unblocked matrix multiply m = n3 read each column of B n times + n2 read each column of A once for each i + 2*n2 read and write each element of C once = n3 + 3*n2 So q = f/m = (2*n3)/(n3 + 3*n2) ~= 2 ffg,mpmor large n, no improvement over m atrix-vector mult

C(i,j) C(i,j) A(i,:) B(:,j) =+* 73

Naïve Matrix Multiply on RS/6000

12000 would take 1095 years 6 T = N4.7 5 4 3 2 Size 2000 took 5 days 1 log cycles/flop 0 -1 012345

log Problem Size

O(N3) performance would have constant cycles/flop Performance looks like O(N4.7) Slide source: Larry Carter, UCSD 74

34 Naïve Matrix Multiply on RS/6000

Page miss every iteration 6

p 5 TLB miss every 4 iteration 3 2 Cache miss every

log cycles/flo 1 16 iterations Page miss every 512 iterations 0 0 1 2 3 4 5

log Problem Size

Slide source: Larry Carter, UCSD 75

Matrix Multiply (blocked, or tiled)

Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory}

C(i,j) C(i,j) A(i,k)

=+*B(k,j) 76

35 Matrix Multiply (blocked or tiled) q=ops/slow mem ref

n size of matrix Why is this algorithm correct? b blocksize N number of blocks Number of slow memory references on blocked matrix multiply m = N*n2 read each block of B N3 times (N3 * n/N * n/N) + N*n 2 read each block of A N3 times + 2*n2 read and write each block of C once = (2*N + 2)*n2 So q = f/m = 2*n3 / ((2*N + 2)*n2) ~= n/N = b for large n

So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2)

Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b2 <= M, so q ~= b <= sqrt(M/3)

Theorem (Hong, Kung, 1981): Any reorganization of this algorithm 77 (that uses only associativity) is limited to q =O(sqrt(M))

More on BLAS (Basic Linear Algebra Subroutines)

‹ Industry standard interface(evolving) ‹ Vendors, others supply optimized implementations ‹ History ¾ BLAS1 (()1970s): » vector operations: dot product, saxpy (y=a*x+y), etc » m=2*n, f=2*n, q ~1 or less ¾ BLAS2 (mid 1980s) » matrix-vector operations: matrix vector multiply, etc » m=n2, f=2*n2, q~2, less overhead » somewhat faster than BLAS1 ¾ BLAS3 (late 1980s) » matrix-matrix operations: matrix matrix multiply, etc » m >= 4n2, f=O(n3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2 ‹ Good algorithms used BLAS3 when possible (LAPACK) ‹ www.netlib.org/blas, www.netlib.org/lapack 78

36 BLAS for Performance

Intel Pentium 4 w/SSE2 1.7 GHz

2000 Level 3 BLAS

1500

1000 Mflop/s 500 Level 2 BLAS Level 1 BLAS 0 10 100 200 300 400 500 Order of vector/Matrices

‹ Development of blocked algorithms important for performance 79

BLAS for Performance

Alpha EV 5/6 500MHz (1Gflop/s peak)

700 600 Level 3 BLAS 500 400

Mflop/s 300 200 100 Level 2 BLAS 0 Level 1 BLAS 10 100 200 300 400 500

Order of vector/Matrices

‹ Development of blocked BLAS 3 (n-by-n matrix matrix multiply) vs algorithms important for performance BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors) 80

37 Optimizing in practice

‹ Tiling for registers ¾loop unrolling, use of named “register” variables ‹ Tiling for mul ti pl e level s of cach e ‹ Exploiting fine-grained parallelism within the processor ¾super scalar ¾pipelining ‹ Complicated compiler interactions ‹ Hard to do by hand (but you’ll try) ‹ Automatic optimization an active research area ¾PHIPAC: www.icsi.berkeley.edu/~bilmes/phipac ¾ www.cs.berkeley.edu/~iyer/asci_slides.ps ¾ATLAS: www.netlib.org/atlas/index.html 81

Summary

‹ Performance programming on uniprocessors requires ¾ understanding of memory system » levels, costs, sizes ¾ understanding of fine-grained parallelism in processor to produce good instruction mix ‹ Blocking (tiling) is a basic approach that can be applied to many matrix algorithms ‹ Applies to uniprocessors and parallel processors ¾ The technique works for any architecture, but choosing the blocksize b and other details depends on the architecture ‹ Similar techniques are possible on other data structures ‹ You will get to try this in Assignment 2 (see the class homepage)

82

38 Summary: Memory Hierachy

‹ Virtual memory was controversial at the time: can SW automatically manage 64KB across many programs? ¾1000X DRAM growth removed the controversy ‹ Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy ‹ Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms?

83

Performance = Effective Use of Memory Hierarchy

‹ Can only do arithmetic on Level 1, 2 & 3 BLAS Intel PII 450MHz data at the top of the 350 hierarchy 300 250 ‹ Higher level BLAS lets us 200 do this Mflop/s 150 100 BLAS Memory Flops Flops/ 50 Ref s Memory 0 Ref s 10 100 200 300 400 500 Order of vector/Matrices Level 1 3n 2n 2/3 y= y+αx ‹ Level 2 n2 2n2 2 Development of blocked y= y+Ax algorithms important for performance Level 3 4n2 2n3 n/ 2 C= C+ AB 84

39 Improving Ratio of Floating Point Operations to Memory Accesses subroutine mult(n1,nd1,n2,nd2,y,a,x) implicit real* 88(a (a-h,o-z) dimension a(nd1,nd2),y(nd2),x(nd1)

do 10, i=1,n1 t=0.d0 do 20, j=1,n2 20 t=t+a(j,i)*x(j) **** 2 FLOPS 10 y(i)=t **** 2 LOADS return end

85

Improving Ratio of Floating Point Operations to Memory Accesses c works correctly when n1,n2 are multiples of 4 dimension a(nd1,nd2), y(nd2), x(nd1) do i=1,n1-3,4 t1=0.d0 t2= 0.d0 t3=0.d0 t4=0.d0 do j=1,n2-3,4 t1=t1+a(j+0,i+0)*x(j+0)+a(j+1,i+0)*x(j+1)+ 1 a(j+2,i+0)*x(j+2)+a(j+3,i+1)*x(j+3) t2=t2+a(j+0,i+1)*x(j+0)+a(j+1,i+1)*x(j+1)+ 1 a(j+2,i+1)*x(j+2)+a(j+3,i+0)*x(j+3) t3=t3+a(j+0,i+2)*x(j+0)+a(j+1,i+2)*x(j+1)+ 1 a(j+2,i+2)*x(j+2)+a(j+3,i+2)*x(j+3) t4=t4+a(j+0,i+3)*x(j+0)+a(j+1,i+3)*x(j+1)+ 1 a(j+2,i+3)*x(j+2)+a(j+3,i+3)*x(j+3) enddo 32 FLOPS y(i+0)=t1 y(i+1)=t2 20 LOADS y(i+2)=t3 y(i+3)=t4 enddo 86

40 Amdahl’s Law

Amdahl’s Law places a strict limit on the speedup that can be realized byyg using multi ppple processors. Two eq uivalent expressions for Amdahl’s Law are given below:

tN = (fp/N + fs)t1 Effect of multiple processors on run time

S = 1/(fs + fp/N) Effect of multiple processors on speedup

Where: fs = serial fraction of code fp = parallel fraction of code = 1 - fs N = number of processors

87

Amdahl’s Law - Theoretical Maximum Speedup of parallel execution

‹ speedup = 1/(P/N + S) » P (parallel code fraction) S ( serial code fraction) N (p rocessors) ¾Example: Image processing » 30 minutes of preparation (serial) » One minute to scan a region » 30 minutes of cleanup (serial)

‹ Speedup is restricted by serial portion. And, speedup increases with greater number of cores!

88

41 Illustration of Amdahl’s Law It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of yygpggour code before doing production runs using large numbers of processors

250 fp = 1.000 200 fp = 0.999 fp = 0.990 150 fp = 0.900 peedup s 100 What’s gggoing on here?

50

0 0 50 100 150 200 250 Number of processors 89

Amdahl’s Law Vs. Reality Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications ( and I/O) will result in a further degradation of performance.

80 fp = 0.99 70 60 50 Amdahl's Law 40

eedup Reality p s 30 20 10 0 0 50 100 150 200 250 Number of processors 90

42 Gustafson’s Law

‹ Thus, Amdahl’s Law predicts that there is a maxi mum scal abilit y for an application, determined by its parallel fraction, and this limit is generally not large. ‹ There is a way around this: increase the problem size ¾ bigger problems mean bigger grids or more particles: bigger arrays ¾ number of serial operations generally remains

constant; number of parallel operations 91 increases: parallel fraction increases

Parallel Performance Metrics: Speedup

Relative performance: Absolute performance:

60 2000 T3E T3E O2K 1750 T3E Ideal 50 O2K Ideal 1500 O2K Ideal 40 1250

30 1000 MFLOPS peedup 750 S 20 500

10 250

0 0 0 8 16 24 32 40 48 0 10 20 30 40 50 60 Processors Processors Speedup is only one characteristic of a program - it is not synonymous with performance. In this comparison of two machines the code achieves comparable speedups but one of the machines is faster.

92

43 Fixed-Problem Size Scaling

• a.k.a. Fixed-load, Fixed-Problem Size, Strong Scaling, Problem-Constrained, constant-problem size (CPS), variable subgrid 1

• Amdahl Limit: SA(n) = T(1) / T(n) = ------f / n + ( 1 - f ) • This bounds the speedup based only on the fraction of the code that cannot use parallelism ( 1- f ); it ignores all other factors

• SA --> 1 / ( 1- f ) as n --> ∞

93

Fixed-Problem Size Scaling (Cont’d)

• Efficiency (n) = T(1) / [ T(n) * n]

• Memory requirements decrease with n

• Surface-to-volume ratio increases with n

• Superlinear speedup possible from cache effects • Motivation: what is the largest # of procs I can use effectively and what is the fastest time that I can solve a given problem?

• Problems: - Sequential runs often not possible (large problems) - Speedup (and efficiency) is misleading if processors are slow

94

44 Fixed-Problem Size Scaling: Examples

S. Goedecker and Adolfy Hoisie, Achieving High Performance in Numerical Computations on RISC Workstations and Parallel Systems,International Conference on Computational Physics: PC'97 Santa Cruz, August 25-28 1997.

95

Fixed-Problem Size Scaling Examples

96

45 Scaled Speedup Experiments

• a.kFidSbidk.a. Fixed Subgrid-Size, W eak S cali ng, G ust af son scali ng. • Motivation: Want to use a larger machine to solve a larger global problem in the same amount of time. • Memory and surface-to-volume effects remain constant.

97

Scaled Speedup Experiments

• Be wary of benchmarks that scale problems to unreasonably-large sizes

- scale the problem to fill the machine when a smaller size will do;

- simplify the science in order to add computation -> “World’s largest MD simulation - 10 gazillion particles!”

- run grid sizes for only a few cycles because the full run won’t finish during this lifetime or because the resolution makdithltifitdtkes no sense compared with resolution of input data

• Suggested alternate approach (Gustafson): Constant time benchmarks - run code for a fixed time and measure work done

98

46 Example of a Scaled Speedup Experiment Processors NChains Time Natoms Time per Time Efficiency Atom per PE per Atom 1 32 38.4 2368 1.62E-02 1.62E-02 1.000 2 64 38.4 4736 8.11E-03 1.62E-02 1.000 4 128 38.5 9472 4.06E-03 1.63E-02 0.997 8 256 38.6 18944 2.04E-03 1.63E-02 0.995 16 512 38.7 37888 1.02E-03 1.63E-02 0.992 32 940 35.7 69560 5.13E-04 1.64E-02 0.987 64 1700 32.7 125800 2.60E-04 1.66E-02 0.975 128 2800 27.4 207200 1.32E-04 1.69E-02 0.958 256 4100 20.75 303400 6.84E-05 1.75E-02 0.926 512 5300 14.49 392200 3.69E-05 1.89E-02 0.857 TBON on ASCI Red

Efficiency

1.040 0.940 0.840 0.740 0.640 0.540 0.440 0 200 400 600

99

100

47