FUTURE TECHNOLOGIES GROUP

The Roofline Model

Samuel Williams Lawrence Berkeley National Laboratory

[email protected]

1 Outline

FUTURE TECHNOLOGIES GROUP ™ Introduction ™ The Roofline Model ™ Example: high-order finite difference stencils ™ New issues at exascale

2 Challenges / Goals

FUTURE TECHNOLOGIES GROUP ™ At petascale, there are a wide variety of architectures (superscalar CPUs, embedded CPUs, GPUs/accelerators, etc…). ™ The comput ati onal ch aract eri sti cs of numeri cal meth o ds can vary dramatically.

™ The result is that performance and the benefit of optimization can vary significantly from one architecture x kernel combination to the next.

™ We wish to qqyqypuickly quantify performance bounds for a variet y of architectural x implementation x algorithm combinations. ™ Moreover, we wish to identify performance bottlenecks and enumerate potential remediation strategies.

3 Arithmetic Intensity

FUTURE TECHNOLOGIES GROUP O( log(N) ) O( 1 ) O( N )

SpMV, BLAS1,2 FFTs Dense Linear Algebra Stencils (PDEs) (BLAS3) PIC codes Naïve Particle Methods Lattice Methods

™ True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM

™ Some HPC kernels have an arithmetic intensity that scales with problem size (temporal locality increases with problem size) while others have constant arithmetic intensity.

™ Arithmetic intensity is ultimately limited by compulsory traffic ™ Arithmetic intensity is diminished by conflict or capacity misses

4 Arithmetic Intensity

FUTURE TECHNOLOGIES GROUP O( log(N) ) O( 1 ) O( N )

SpMV, BLAS1,2 FFTs Dense Linear Algebra Stencils (PDEs) (BLAS3) PIC codes Naïve Particle Methods Lattice Methods

™ Note, we are free to define arithmetic intensity in different terms: ƒ e.g. DRAM bytes -> , PCIe, or network bytes. ƒ flop’ s -> stencil’ s

™ Thus, we might consider performance (MStencil/s) as a function of arithmetic intensity (stencils per of ____)

5 FUTURE TECHNOLOGIES GROUP

Roofline Model

6 Overlap of Communication

FUTURE TECHNOLOGIES GROUP ™ Consider a simple example in which a FP kernel maintains a working set in DRAM. ™ We assume we can perf ectl y overl ap compu tati on w ith communication or v.v. either through prefetching/DMA and/or pipelining (decoupling of communication and computation) ™ Thus, ti me, i s th e maxi mum of th e ti me requi red t o t ransf er th e da ta and the time required to perform the floating point operations.

Byte’s / STREAM Bandwidth

Flop’s / Flop/s

time

7 Roofline Model Basic Concept FUTURE TECHNOLOGIES GROUP ™ Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis.

Attainable FLOP/s with Optimizations1-i = min Performance ij AI * Ban dw idth with O pti mi zati ons1-j

™ where optimization i can be SIMDize, or unroll, or SW prefetch, … ™ Given a kernel’s arithmetic intensity (based on DRAM traffic after being filtered by the cache), programmers can inspect the figure, and bound performance.

™ Moreover, provides insights as to which optimizations will potentially be beneficial.

8 Example

FUTURE TECHNOLOGIES GROUP ™ Consider the Opteron 2356: ƒ Dual Socket (NUMA) ƒ lim ite d HW s tream prefthfetchers ƒ quad-core (8 total) ƒ 2.3GHz ƒ 2-way SIMD (DP) ƒ separate FPMUL and FPADD datapaths ƒ 4-cyclFPltle FP latency

™ Assuming expression of parallelism is the challenge on this architecture, what would the roofline model look like ?

9 Roofline Model Basic Concept FUTURE TECHNOLOGIES GROUP Opteron 2356 ™ Naively, one might assume 256.0 (()Barcelona) peak performance is 128.0 always attainable . peak DP 64.0 OP/s

LL 32. 0 16.0 le GF

bb 808.0 4.0

attaina 2.0 1.0 0.5

10 Roofline Model Basic Concept FUTURE TECHNOLOGIES GROUP Opteron 2356 ™ However, with a lack of 256.0 (()Barcelona) locality, DRAM bandwidth 128.0 can be a bottleneck peak DP 64.0 ™ Plot on log-log scale OP/s

LL 32. 0 ™ Given AI, we can easily 16.0 bound performance

le GF ™ But architectures are much bb 808.0 more complicated 4.0

attaina 2.0 ™ We will bound performance as we eliminate specific 1.0 forms of in-core parallelism 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 11 Roofline Model computational ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 ™ Opterons have dedicated 256.0 (()Barcelona) multipliers and adders. 128.0 ™ If th e cod e i s dom ina te d by peak DP adds, then attainable 64.0 performance is half of peak.

OP/s mul / add imbalance

LL 32. 0 ™ WllthWe call these Ce ilings 16.0 ™ They act like constraints on

le GF performance

bb 808.0 4.0

attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 12 Roofline Model computational ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 ™ Opterons have 128-bit 256.0 (()Barcelona) datapaths. 128.0 ™ If ins truc tions aren ’t peak DP SIMDized, attainable 64.0 performance will be halved

OP/s mul / add imbalance

LL 32. 0 16.0 w/out SIMD le GF

bb 808.0 4.0

attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 13 Roofline Model computational ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 ™ On Opterons, floating-point 256.0 (()Barcelona) instructions have a 4 cycle 128.0 latency. peak DP ™ If we don’t express 4-way 64.0 ILP, performance will drop

OP/s mul / add imbalance

LL 32. 0 by as much as 4x 16.0 w/out SIMD le GF

bb 808.0 4.0 w/out ILP

attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 14 Roofline Model communication ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 ™ We can perform a similar 256.0 (()Barcelona) exercise taking away 128.0 parallelism from the peak DP memory subsystem 64.0 OP/s

LL 32. 0 16.0 le GF

bb 808.0 4.0

attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 15 Roofline Model communication ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 ™ Explicit software prefetch 256.0 (()Barcelona) instructions are required to 128.0 achieve peak bandwidth peak DP 64.0 OP/s

LL 32. 0 16.0 le GF

bb 808.0 4.0

attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 16 Roofline Model communication ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 ™ Opterons are NUMA 256.0 (()Barcelona) ™ As such memory traffic 128.0 mustbt be correc tly ba lance d peak DP among the two sockets to 64.0 achieve good Stream OP/s

LL 32. 0 bandwidth. 16.0 ™ le GF We could continue this by bb 808.0 examiiining s tiddtrided or 4.0 random memory access patterns

attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 17 Roofline Model computation + communication ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 ™ We may bound 256.0 (()Barcelona) performance based on the 128.0 combination of expressed peak DP in-core parallelism and 64.0 attained bandwidth.

OP/s mul / add imbalance

LL 32. 0 16.0 w/out SIMD le GF

bb 808.0 4.0 w/out ILP

attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 18 Roofline Model locality walls FUTURE TECHNOLOGIES GROUP Opteron 2356 ™ Remember, memory traffic 256.0 (()Barcelona) includes more than just 128.0 compulsory misses . peak DP ™ As such, actual arithmetic 64.0 intensity may be

OP/s mul / add imbalance

LL 32. 0 substantially lower.

w/out SIMD ™ Walls are unique to the 16.0 architecture-kernel o le GF n ly compulsory bb 808.0 combination 4.0 w/out ILP FLOPs attaina

2.0 miss traffic AI = ClMiCompulsory Misses 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 19 Cache Behavior

FUTURE TECHNOLOGIES GROUP ™ Knowledge of the underlying cache operation can be critical. ™ For example, caches are organized into lines. Lines are organized itinto se ts & ways ( associtiitiativity) ƒ Thus, we must mimic the effect of Mark Hill’s 3C’s of caches ƒ Impacts of conflict, compulsory, and capacity misses are both architecture- and application-dependent. ƒ Ultimately they reduce the actual flop:byte ratio. ™ Moreover, many caches are write allocate. ƒ a write a lloca te cac he rea d in an en tire cac he line upon a wr ite m iss. ƒ If the application ultimately overwrites that line, the read was superfluous (further reduces flop:byte ratio) ™ BdtidbthdtfitBecause programs access data in words, but hardware transfers it in 64 or 128B cache lines, spatial locality is key ƒ Array-of-structure data layouts can lead to dramatically lower flop:byte ratios. ƒ e.g. if a program only operates on the “red” field of a pixel, bandwidth is wasted. 20 Roofline Model locality walls FUTURE TECHNOLOGIES GROUP Opteron 2356 ™ Remember, memory traffic 256.0 (()Barcelona) includes more than just 128.0 compulsory misses . peak DP ™ As such, actual arithmetic 64.0 intensity may be

OP/s mul / add imbalance

LL 32. 0 substantially lower.

w/out SIMD ™ Walls are unique to the 16.0 architecture-kernel o le GF n ly compulsory bb 808.0 combination +write allo 4.0 w/out ILP FLOPs c attaina ation traffic 2.0 miss traffic AI = Alloca tions + Compu lsory Misses 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 21 Roofline Model locality walls FUTURE TECHNOLOGIES GROUP Opteron 2356 ™ Remember, memory traffic 256.0 (()Barcelona) includes more than just 128.0 compulsory misses . peak DP ™ As such, actual arithmetic 64.0 intensity may be

OP/s mul / add imbalance

LL 32. 0 substantially lower.

w/out SIMD ™ Walls are unique to the 16.0 architecture-kernel o le GF n ly compulsory bb 808.0 combination +write allo +capacity 4.0 w/out ILP FLOPs c attaina ation traffic miss traffic 2.0 miss traffic AI = CitAlltiCCapacity + Allocations + Compu lsory 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 22 Roofline Model locality walls FUTURE TECHNOLOGIES GROUP Opteron 2356 ™ Remember, memory traffic 256.0 (()Barcelona) includes more than just 128.0 compulsory misses . peak DP ™ As such, actual arithmetic 64.0 intensity may be

OP/s mul / add imbalance

LL 32. 0 substantially lower.

w/out SIMD ™ Walls are unique to the 16.0 architecture-kernel o le GF n ly compulsory bb 808.0 combination +write allo +capacity 4.0 +conflict w/out ILP FLOPs c attaina ation traffic miss traffic miss traffic 2.0 miss traffic AI = Con flic t + Capac ity + Allocati ons + Compu lsory 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 23 Roofline Model locality walls FUTURE TECHNOLOGIES GROUP Opteron 2356 ™ SW Optimizations remove 256.0 (()Barcelona) these walls and ceilings 128.0 which act to constrain peak DP performance. 64.0

OP/s mul / add imbalance LL 32. 0 = naïilïve implemen ttitation 16.0 w/out SIMD constrained by low BW,

o lack of ILP/DLP, and le GF n ly compulsory bb 808.0 poor arithmetic intensity +write allo +capacity

+conflict w/out ILP 4.0 = ultimate performance c c attaina ation traffic miss traffic miss traffic 2.0 miss traffic limit (16x better performance) 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 24 FUTURE TECHNOLOGIES GROUP

Examples

25 7-point Stencil

FUTURE TECHNOLOGIES GROUP ™ Simplest derivation of the Laplacian operator results in a constant coefficient 7-point stencil for all xyzx,y,z: u(x,y,z,t+1) = alpha*u(x,y,z,t) + beta*( u(x,y,z-1,t) + u(x,y-1,z,t) + u(x-1,y,z,t) + u(x+1,y,z,t) + u(x,y+1,z,t) + u(x,y,z+1,t) ) ™ Clearly each stencil performs: ƒ 8 floating-point operations z+1 ƒ 8f8 memory references +Z y-1 x+1 all but 2 should be x,y,z filtered by an ideal x-1 y+1 +Y cache z-1 ƒ 6 memory streams +X

all but 2 should be filtered PDE grid stencil for heat equation PDE (less than # HW prefthfetchers)

26 Roofline Model – 7pt Stencil

FUTURE TECHNOLOGIES GROUP Xeon X5550 ™ Where are we on the 512.0 (()Nehalem) roofline? 256.0 ™ 8 floating-point operations 128.0 peak DP ™ OP/s asymptotic limit of 16(24)

LL 64. 0 bytes (w/write allocate) per mul / add imbalance 32.0 stencil w/out SIMD o le GF ™ n AI < 0.5(0.33) ly compulsory bb 16. 0 traffic withw ™ There is a heavy imbalance 8.0 between multiplies and w/out ILP adds. r attaina ite allocate 4.0 miss traffic 2.0 1.0 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 27 Roofline Model – 7pt Stencil

FUTURE TECHNOLOGIES GROUP Xeon X5550 ™ However, there is a heavy 512.0 (()Nehalem) imbalance between 256.0 multiplies and adds. ™ 66% of peak is the in-core 128.0 performance limit OP/s

LL 64. 0 mul / add imbalance 32.0 ™ To attain maximum w/out SIMD o

le GF performance, we must: n ly compulsory bb 16. 0 traffic withw ™ program for NUMA 8.0 ™ bypass the cache w/out ILP r attaina ™ express ILP x DLP ite allocate 4.0 miss traffic ™ select parallelization to 2.0 minimize cache 1.0 working set. 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 28 High-order

FUTURE TECHNOLOGIES GROUP ™ Use the roofline to predict performance of high-order stencils: ™ Consider (still finite difference) 2nd, 4th, and 8th-order versions of the hhteat equa tion. ™ We end up with stencils like:

z+1 z+4 z+4 z+3 z+2 z+2 y-1 y-2 y-4 y-1 y-3 z+1 x+1 x,y,z x+1 x+2 y-2 x+4 x,y,z y-1x,y,zx+1x+2x+3 x-2 x-1 y+1 x-2x-1 y+1 x-1 x-4x-3 z-1 y+2 y+1 y+2 y+3y+4 z-1 z-2 z-3 z-1 z-2 z-4

™ and computational characteristics:

8 flop’s 15 flop’s 29 flop’s 8 D$ accesses 14 D$ accesses 26 D$ accesses 16 DRAM bytes 16 DRAM bytes 16 DRAM bytes AI<0.5 AI<0.94 AI<1.8

29 Roofline Model – High Order Stencils FUTURE TECHNOLOGIES GROUP Xeon X5550 ™ We can plot the AI for these 512.0 (()Nehalem) three kernels on the 256.0 roofline ™ Clearly, the degree of 128.0 requisite optimization OP/s

LL 64. 0 becomes quite high. mul / add imbalance ™ Nevertheless, because: 32.0 w/out SIMD 1. le GF they are all roughly bb 16. 0 bdbtbound by stream 8.0 bandwidth

w/out ILP 2. they all transfer the attaina 4.0 2

4 8 same dtdata nd th th 2.0 order order order they will all have 1.0 approximately the same 1 1 1 /8 /4 /2 1 2 4 8 16 run time actual FLOP:Byte ratio 30 What about horizontal communication? FUTURE TECHNOLOGIES GROUP ™ Unfortunately we neglected to include the impact of the deeper ghost zones (= “grow cells” = “halos”) on arithmetic intensity.

™ If the grid is small (323), then the overhead of a 1, 2, or 4-deep ghost zone can be progressively more severe… ƒ 2nd order: scale AI by 83% (<0.42 flops per byte) ƒ 4th order: scale AI by 70% (<0.66 flops per byte) ƒ 8th order: scale AI by 51% (<0.92 flops per byte)

™ Clearly this will impinge upon our performance bounds…

31 Roofline Model – High Order Stencils FUTURE TECHNOLOGIES GROUP Xeon X5550 ™ Thus, we go from…. 512.0 (()Nehalem) 256.0 128.0 OP/s

LL 64. 0 mul / add imbalance 32.0 w/out SIMD le GF

bb 16. 0 8.0

w/out ILP attaina 4.0 2 4 8 nd th th

2.0 order order order 1.0 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 32 Roofline Model – High Order Stencils FUTURE TECHNOLOGIES GROUP Xeon X5550 ™ to this 512.0 (()Nehalem) 256.0 ™ optimization is still important, but 128.0 communication has OP/s

LL 64. 0 reddttiblduced attainable mul / add imbalance performance. 32.0 w/out SIMD le GF

bb 16. 0 ™ Although the requisite 8.0 number of flop’s scale with

w/out ILP order, performance will not. attaina 4.0 2 4 8

nd ™ As such, time per sweep th th 2.0 order order order will increase (slowly) with 1.0 order. 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 33 Roofline Model – High Order Stencils FUTURE TECHNOLOGIES GROUP Xeon X??? ™ What about the next 512.0 (SandyygBridge-e) generation SNB-e? 256.0 ™ WthflitWe can use the roofline to mul / add imbalance predict the performance of 128.0 future machines. OP/s

LL 64. 0 w/out AVX ™ lac k o f AVX can h ur t performance by 4x 32.0 le GF

bb 16. 0 ™ Still bandwidth-bound w/out ILP 8.0 ™ Although peak increased by >4x (8-core+AVX+GHz), attaina 4.0 2 4 8

nd sustained performance will th th 2.0 order order order likely only increase by 1.0 ~2.5x. 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 34 Communication-Avoiding

FUTURE TECHNOLOGIES GROUP ™ As DRAM bandwidth clearly constrains performance, we should investigate (DRAM) communication-avoiding algorithms ™ In a MG sol ver, one mi ihtght appl y mu ltilltiple re laxes. ™ By properly orchestrating data movement, one can reduce the number of reads of the grid. ™ However, this requires redundant computation on ghost/grow cells.

™ Consider three exampp(les (all countin g raw flo p)p’s): ƒ 4-relaxes with a 2nd order stencil on a 323 grid. AI~1.5 flops per byte ƒ 2-relaxes with a 4th order stencil on a 323 grid. AI~1.27 flops per byte ƒ 1-relax with a 8th order stencil on a 323 ggppyrid. AI~1.16 flops per byte

35 Roofline Model – Communication-Avoiding on 323 FUTURE TECHNOLOGIES GROUP Xeon X5550 ™ When communication- 512.0 (()Nehalem) avoiding is applied, (raw) 256.0 performance is roughly constant as a function of 128.0 order OP/s

LL 64. 0 ™ No surprise , they all mul / add imbalance perform roughly equal work 32.0 w/out SIMD for equal (vertical) data le GF

bb 16. 0 movement. 8.0 ™ However, there is some w/out ILP attaina 4.0 degree of redundancy we 2 8 4 nd th th

2.0 order order order have ignored: ™ 2nd = 66% are useful 1.0 1 1 1 th /8 /4 /2 1 2 4 8 16 ™ 4 = 75% are useful actual FLOP:Byte ratio ™ 8th = 80% are useful 36 Roofline Model – Communication-Avoiding on 323 FUTURE TECHNOLOGIES GROUP Xeon X5550 ™ When we plot useful flop/s, 512.0 (()Nehalem) we scale performance 256.0 down accordingly . 128.0 ™ The result is that we should OP/s

LL 64. 0 bbltbe able to per form: mul / add imbalance nd 32.0 ƒ 4 x 2 order relaxes, w/out SIMD th le GF ƒ 2 x 4 order relaxes, or bb 16. 0 ƒ 1 x 8th order relax 8.0 in roughly equal time. w/out ILP attaina 4.0 2 8 4 nd th th

2.0 order order order ™ Need FastMath scientists to decide which will 1.0 provide better overall 1/ 1/ 1/ 1 2 4 8 16 8 4 2 time-to-solution. actual FLOP:Byte ratio 37 FUTURE TECHNOLOGIES GROUP

Exascale

38 Exascale machines

FUTURE TECHNOLOGIES GROUP ™ Exascale machines are still on the drawing board, but we can proxy one with the following characteristics ™ >100K no des ™ 8-16TF/s of peak performance ™ A hierarchical memory (ala GPU) with ƒ >1 TB/s of bandwidth to 10’s – 100’s of GB of “near” DRAM ƒ >0.1 TB/s of bandwidth to >1 TB of “far” DRAM ™ This motivates us to think about locality in terms of not only flops per byte of fast DRAM, but also flops per byte of far DRAM ƒ need to perform >128 flop’s per 64b word from near DRAM ƒ >1,,g280 floating-pppoint operations p er 64b word from far DRAM

™ Can you fit the entire problem in 100 GB of near DRAM? ™ if not, but you can fit a MG level solve in near DRAM, do you really perform 128e12 floating-point operations per node per level solve?

39 Roofline Model

FUTURE TECHNOLOGIES GROUP Hypothetical ™ We now have two critical arithmetic intensities (from Exascale Nodepeak DP nnearear aandnd farfar DRAM)DRAM) 16384 ™ In this example, avoiding near no FMA 8192 communication is irrelevant.

OP/s ™ Rather, we must:

LL 4096 ™ reinvent algorithms to 2046 boost flop:far bytes

le GF ™ demand HW designers b b 1024 trade higher far bandwidth 512 w/out SIMD for reduced near capacity ™ demand HW designers attaina 256 boost near capacity to fit 128 another working set plateau. 64 1 1 /4 /2 1 2 4 8 16 32 FLOP:near bytes 1 1 /4 /2 12481632 FLOP:far bytes

40 FUTURE TECHNOLOGIES GROUP

Questions?

Acknowledgments Research supported by DOE Office of Science under contract number DE-AC02-05CH11231.

41 FUTURE TECHNOLOGIES GROUP

BACKUP SLIDES

42 FUTURE TECHNOLOGIES GROUP

Alternate Rooflines

43 No overlap of communication and computation FUTURE TECHNOLOGIES GROUP ™ Previously, we assumed perfect overlap of communication or computation. ™ What h appens if th ere i s a d epend ency (e ither in heren t or by a lac k of optimization) that serializes communication and computation ?

Byte’s / STREAM Bandwidth Byte’s / STREAM Bandwidth Flop’s / Flop/s Flop’s / Flop/s

time

™ Time is the sum of communication time and computation time. ™ The result is that flop/s grows asymptotically.

44 No overlap of communication and computation FUTURE TECHNOLOGIES GROUP ™ Consider a generic machine ™ If we can perfectly decouple andld overlap commun itiication with computation, the roofline is sharp/angular. ™ However, w ithou t over lap, the roofline is smoothed, and attainable performance is degraded by up to a factor of 2x.

45 Alternate Bandwidths

FUTURE TECHNOLOGIES GROUP ™ Thus far, we assumed a synergy between streaming applications and bandwidth (proxied by the STREAM ) ™ STREAM is NOT a goo d proxy for s hort st anza/ rand om cacheline access patterns as memory latency (instead of just bandwidth) is being exposed. ™ Thus one mi g ht conce ive o f a lterna te memory b enc hmar ks to provide a bandwidth upper bound (ceiling)

™ Similarly, if data is primarily local in the LLC cache, one should construct rooflines based on LLC bandwidth and flop:LLC byte ratios.

™ For GPUs/accelerators, PCIe bandwidth can be an impediment. Thus one can construct a roofline model based on PCIe bandwidth and the flop:PCIe byte ratio.

46 Alternate Computations

FUTURE TECHNOLOGIES GROUP ™ Arising from HPC kernels, its no surprise roofline use DP Flop/s. ™ Of course, it could use ƒ SP flop/s, ƒ integer ops, ƒ bit operations, ƒ pairwise comparisons (sorting), ƒ graphics operations, ƒ etc…

47 Time-based roofline

FUTURE TECHNOLOGIES GROUP ™ In some cases, it is easier to visualize performance in terms of seconds (i.e. time-to-solution). ™ We can i nver t th e roofli ne ( second s per fl op ) an d s imp ly mu ltip ly by the number of requisite flop’s

™ Additionally, we could change the horizontal axis from locality to some more appealing metric.

48 Alternate Axes

FUTURE TECHNOLOGIES GROUP ™ Rather than thinking of requisite optimizations, think of performance (mstencil/s) as a function of flop’s per stencil and DRAM bytes per stencil. ™ We can plot DRAM bandwidth to divide the space into bandwidth and compute-bound regions. ™ WlltWe can also plot iso-curves of constant MStencil/s ™ We observe the effect of ncil communication-avoiding or ee curve of constant MStencil/s moving to high-order p’s per st p’s

methods. oo Fl move to high-order

communication-avoiding

DRAM Bytes per stencil 49 Little’s Law

FUTURE TECHNOLOGIES GROUP ™ Little’s Law: Concurrency = Latency * Bandwidth - or - Effective Throughput = Expressed Concurrency / Latency

™ Bandwidth ƒ conventional memory bandwidth ƒ #floating-point units ™ Latency ƒ memory latency ƒ functional unit latency ™ Concurrency: ƒ bytes expressed to the memory subsystem ƒ concurrent (parallel) memory operations ™ For example, consider a CPU with 2 FPU’s each with a 4-cycle latency. Little’s law states that we must express 8-way ILP to fully utilize the machine.

50 Little’s Law Examples

FUTURE TECHNOLOGIES GROUP Applied to Memory Applied to FPUs ™ consider a CPU with 20GB/s of ™ consider a CPU with 2 FPU’s each bandwidth and 100ns memory with a 4-cycle latency. latency. ™ Little’s law states that we must ™ Little’s law states that we must express 8-way ILP to fully utilize express 2KB of concurrency the machine . (independent memory operations) to the memory subsystem to attain ™ Solution: unroll/jam the code to peak performance express 8 independent FP operations. ™ On today’s superscalar processors, ™ Note, simply unrolling dependent hardware stream prefetchers operations (e.g. reduction) does specultillatively loa d consecu tive not increase ILP. It simply elements. amortizes loop overhead. ™ Solution: express the memory access pattern in a streaming fashion in order to engage the prefetchers. 51 Three Classes of Locality

FUTURE TECHNOLOGIES GROUP ™ Temporal Locality ƒ reusing data (either registers or cache lines) multiple times ƒ amortizes th e i mpact of li mit ed b and width . ƒ transform loops or algorithms to maximize reuse.

™ Spatial Locality ƒ data is transferred from cache to registers in words. ƒ However, data is transferred to the cache in 64-128Byte lines ƒ using every word in a line maximizes spatial locality. ƒ transform data structures into structure of arrays (SoA) layout

™ Sequential Locality ƒ Many memory address patterns access cache lines sequentially. ƒ CPU’s hardware stream prefetchers exploit this observation to hide speculatively load data to memory latency. ƒ Transform loops to generate (a few) long, unit-stride accesses. 52 NUMA

FUTURE TECHNOLOGIES GROUP ™ Recent multicore SMPs have integrated the memory controllers on chip. ™ As a result, memory-access is non-uniform (NUMA) ™ That is , the bandwidth to read a given address varies dramatically among between cores ™ Exploit NUMA (affinity+first touch) when you malloc/init data. ™ Concept is similar to data decomposition for distributed memory

53 NUMA

FUTURE TECHNOLOGIES GROUP ™ Recent multicore SMPs have integrated the memory controllers on chip. ™ As a result, memory-access is non-uniform (NUMA) ™ That is , the bandwidth to read a given address varies dramatically among between cores ™ Exploit NUMA (affinity+first touch) when you malloc/init data. ™ Concept is similar to data decomposition for distributed memory

54 Various Kernels

FUTURE TECHNOLOGIES GROUP ™ We have examined and heavily optimized a number of kernels and applications for both CPUs and GPUs. ™ We observe that for most , performance is highly correlated with DRAM bandwidth – particularly on the GPU. ™ Note, GTC has a strong scatter/gather component that skews STREAM- based rooflines.

1024 Xeon X5550 (Nehalem) 1024 NVIDIA C2050 (Fermi) 512 512

256 256 DGEMM

128 128 RTM/wave eqn.

64 DGEMM 64 RTM/wave eqn. 27pt Stencil 32 32 27pt Stencil 7pt Stencil 7pt Stencil 16 16 GTC/pushi GTC/pushi SpMV 8 SpMV 8 GTC/chargei 4 4 GTC/chargei 2 2 1 1 1 1 1 1 1 1 1 1 /32 /16 /8 /4 /2 12481632 /32 /16 /8 /4 /2 12481632

55 FUTURE TECHNOLOGIES GROUP

56