FUTURE TECHNOLOGIES GROUP
The Roofline Model
Samuel Williams Lawrence Berkeley National Laboratory
1 Outline
FUTURE TECHNOLOGIES GROUP Introduction The Roofline Model Example: high-order finite difference stencils New issues at exascale
2 Challenges / Goals
FUTURE TECHNOLOGIES GROUP At petascale, there are a wide variety of architectures (superscalar CPUs, embedded CPUs, GPUs/accelerators, etc…). The comput ati onal ch aract eri sti cs of numeri cal meth o ds can vary dramatically.
The result is that performance and the benefit of optimization can vary significantly from one architecture x kernel combination to the next.
We wish to qqyqypuickly quantify performance bounds for a variet y of architectural x implementation x algorithm combinations. Moreover, we wish to identify performance bottlenecks and enumerate potential remediation strategies.
3 Arithmetic Intensity
FUTURE TECHNOLOGIES GROUP O( log(N) ) O( 1 ) O( N )
SpMV, BLAS1,2 FFTs Dense Linear Algebra Stencils (PDEs) (BLAS3) PIC codes Naïve Particle Methods Lattice Methods
True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes
Some HPC kernels have an arithmetic intensity that scales with problem size (temporal locality increases with problem size) while others have constant arithmetic intensity.
Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses
4 Arithmetic Intensity
FUTURE TECHNOLOGIES GROUP O( log(N) ) O( 1 ) O( N )
SpMV, BLAS1,2 FFTs Dense Linear Algebra Stencils (PDEs) (BLAS3) PIC codes Naïve Particle Methods Lattice Methods
Note, we are free to define arithmetic intensity in different terms: e.g. DRAM bytes -> cache, PCIe, or network bytes. flop’ s -> stencil’ s
Thus, we might consider performance (MStencil/s) as a function of arithmetic intensity (stencils per byte of ____)
5 FUTURE TECHNOLOGIES GROUP
Roofline Model
6 Overlap of Communication
FUTURE TECHNOLOGIES GROUP Consider a simple example in which a FP kernel maintains a working set in DRAM. We assume we can perf ectl y overl ap compu tati on w ith communication or v.v. either through prefetching/DMA and/or pipelining (decoupling of communication and computation) Thus, ti me, i s th e maxi mum of th e ti me requi red t o t ransf er th e da ta and the time required to perform the floating point operations.
Byte’s / STREAM Bandwidth
Flop’s / Flop/s
time
7 Roofline Model Basic Concept FUTURE TECHNOLOGIES GROUP Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis.
Attainable FLOP/s with Optimizations1-i = min Performance ij AI * Ban dw idth with O pti mi zati ons1-j
where optimization i can be SIMDize, or unroll, or SW prefetch, … Given a kernel’s arithmetic intensity (based on DRAM traffic after being filtered by the cache), programmers can inspect the figure, and bound performance.
Moreover, provides insights as to which optimizations will potentially be beneficial.
8 Example
FUTURE TECHNOLOGIES GROUP Consider the Opteron 2356: Dual Socket (NUMA) lim ite d HW s tream prefthfetchers quad-core (8 total) 2.3GHz 2-way SIMD (DP) separate FPMUL and FPADD datapaths 4-cyclFPltle FP latency
Assuming expression of parallelism is the challenge on this architecture, what would the roofline model look like ?
9 Roofline Model Basic Concept FUTURE TECHNOLOGIES GROUP Opteron 2356 Naively, one might assume 256.0 (()Barcelona) peak performance is 128.0 always attainable . peak DP 64.0 OP/s
LL 32. 0 16.0 le GF
bb 808.0 4.0
attaina 2.0 1.0 0.5
10 Roofline Model Basic Concept FUTURE TECHNOLOGIES GROUP Opteron 2356 However, with a lack of 256.0 (()Barcelona) locality, DRAM bandwidth 128.0 can be a bottleneck peak DP 64.0 Plot on log-log scale OP/s
LL 32. 0 Given AI, we can easily 16.0 bound performance
le GF But architectures are much bb 808.0 more complicated 4.0
attaina 2.0 We will bound performance as we eliminate specific 1.0 forms of in-core parallelism 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 11 Roofline Model computational ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 Opterons have dedicated 256.0 (()Barcelona) multipliers and adders. 128.0 If th e cod e i s dom ina te d by peak DP adds, then attainable 64.0 performance is half of peak.
OP/s mul / add imbalance
LL 32. 0 WllthWe call these Ce ilings 16.0 They act like constraints on
le GF performance
bb 808.0 4.0
attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 12 Roofline Model computational ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 Opterons have 128-bit 256.0 (()Barcelona) datapaths. 128.0 If ins truc tions aren ’t peak DP SIMDized, attainable 64.0 performance will be halved
OP/s mul / add imbalance
LL 32. 0 16.0 w/out SIMD le GF
bb 808.0 4.0
attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 13 Roofline Model computational ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 On Opterons, floating-point 256.0 (()Barcelona) instructions have a 4 cycle 128.0 latency. peak DP If we don’t express 4-way 64.0 ILP, performance will drop
OP/s mul / add imbalance
LL 32. 0 by as much as 4x 16.0 w/out SIMD le GF
bb 808.0 4.0 w/out ILP
attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 14 Roofline Model communication ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 We can perform a similar 256.0 (()Barcelona) exercise taking away 128.0 parallelism from the peak DP memory subsystem 64.0 OP/s
LL 32. 0 16.0 le GF
bb 808.0 4.0
attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 15 Roofline Model communication ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 Explicit software prefetch 256.0 (()Barcelona) instructions are required to 128.0 achieve peak bandwidth peak DP 64.0 OP/s
LL 32. 0 16.0 le GF
bb 808.0 4.0
attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 16 Roofline Model communication ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 Opterons are NUMA 256.0 (()Barcelona) As such memory traffic 128.0 mustbt be correc tly ba lance d peak DP among the two sockets to 64.0 achieve good Stream OP/s
LL 32. 0 bandwidth. 16.0 le GF We could continue this by bb 808.0 examiiining s tiddtrided or 4.0 random memory access patterns
attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 17 Roofline Model computation + communication ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 We may bound 256.0 (()Barcelona) performance based on the 128.0 combination of expressed peak DP in-core parallelism and 64.0 attained bandwidth.
OP/s mul / add imbalance
LL 32. 0 16.0 w/out SIMD le GF
bb 808.0 4.0 w/out ILP
attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 18 Roofline Model locality walls FUTURE TECHNOLOGIES GROUP Opteron 2356 Remember, memory traffic 256.0 (()Barcelona) includes more than just 128.0 compulsory misses . peak DP As such, actual arithmetic 64.0 intensity may be
OP/s mul / add imbalance
LL 32. 0 substantially lower.
w/out SIMD Walls are unique to the 16.0 architecture-kernel o le GF n ly compulsory bb 808.0 combination 4.0 w/out ILP FLOPs attaina
2.0 miss traffic AI = ClMiCompulsory Misses 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 19 Cache Behavior
FUTURE TECHNOLOGIES GROUP Knowledge of the underlying cache operation can be critical. For example, caches are organized into lines. Lines are organized itinto se ts & ways ( associtiitiativity) Thus, we must mimic the effect of Mark Hill’s 3C’s of caches Impacts of conflict, compulsory, and capacity misses are both architecture- and application-dependent. Ultimately they reduce the actual flop:byte ratio. Moreover, many caches are write allocate. a write a lloca te cac he rea d in an en tire cac he line upon a wr ite m iss. If the application ultimately overwrites that line, the read was superfluous (further reduces flop:byte ratio) BdtidbthdtfitBecause programs access data in words, but hardware transfers it in 64 or 128B cache lines, spatial locality is key Array-of-structure data layouts can lead to dramatically lower flop:byte ratios. e.g. if a program only operates on the “red” field of a pixel, bandwidth is wasted. 20 Roofline Model locality walls FUTURE TECHNOLOGIES GROUP Opteron 2356 Remember, memory traffic 256.0 (()Barcelona) includes more than just 128.0 compulsory misses . peak DP As such, actual arithmetic 64.0 intensity may be
OP/s mul / add imbalance
LL 32. 0 substantially lower.
w/out SIMD Walls are unique to the 16.0 architecture-kernel o le GF n ly compulsory bb 808.0 combination +write allo 4.0 w/out ILP FLOPs c attaina ation traffic 2.0 miss traffic AI = Alloca tions + Compu lsory Misses 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 21 Roofline Model locality walls FUTURE TECHNOLOGIES GROUP Opteron 2356 Remember, memory traffic 256.0 (()Barcelona) includes more than just 128.0 compulsory misses . peak DP As such, actual arithmetic 64.0 intensity may be
OP/s mul / add imbalance
LL 32. 0 substantially lower.
w/out SIMD Walls are unique to the 16.0 architecture-kernel o le GF n ly compulsory bb 808.0 combination +write allo +capacity 4.0 w/out ILP FLOPs c attaina ation traffic miss traffic 2.0 miss traffic AI = CitAlltiCCapacity + Allocations + Compu lsory 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 22 Roofline Model locality walls FUTURE TECHNOLOGIES GROUP Opteron 2356 Remember, memory traffic 256.0 (()Barcelona) includes more than just 128.0 compulsory misses . peak DP As such, actual arithmetic 64.0 intensity may be
OP/s mul / add imbalance
LL 32. 0 substantially lower.
w/out SIMD Walls are unique to the 16.0 architecture-kernel o le GF n ly compulsory bb 808.0 combination +write allo +capacity 4.0 +conflict w/out ILP FLOPs c attaina ation traffic miss traffic miss traffic 2.0 miss traffic AI = Con flic t + Capac ity + Allocati ons + Compu lsory 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 23 Roofline Model locality walls FUTURE TECHNOLOGIES GROUP Opteron 2356 SW Optimizations remove 256.0 (()Barcelona) these walls and ceilings 128.0 which act to constrain peak DP performance. 64.0
OP/s mul / add imbalance LL 32. 0 = naïilïve implemen ttitation 16.0 w/out SIMD constrained by low BW,
o lack of ILP/DLP, and le GF n ly compulsory bb 808.0 poor arithmetic intensity +write allo +capacity
+conflict w/out ILP 4.0 = ultimate performance c c attaina ation traffic miss traffic miss traffic 2.0 miss traffic limit (16x better performance) 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 24 FUTURE TECHNOLOGIES GROUP
Examples
25 7-point Stencil
FUTURE TECHNOLOGIES GROUP Simplest derivation of the Laplacian operator results in a constant coefficient 7-point stencil for all xyzx,y,z: u(x,y,z,t+1) = alpha*u(x,y,z,t) + beta*( u(x,y,z-1,t) + u(x,y-1,z,t) + u(x-1,y,z,t) + u(x+1,y,z,t) + u(x,y+1,z,t) + u(x,y,z+1,t) ) Clearly each stencil performs: 8 floating-point operations z+1 8f8 memory references +Z y-1 x+1 all but 2 should be x,y,z filtered by an ideal x-1 y+1 +Y cache z-1 6 memory streams +X
all but 2 should be filtered PDE grid stencil for heat equation PDE (less than # HW prefthfetchers)
26 Roofline Model – 7pt Stencil
FUTURE TECHNOLOGIES GROUP Xeon X5550 Where are we on the 512.0 (()Nehalem) roofline? 256.0 8 floating-point operations 128.0 peak DP OP/s asymptotic limit of 16(24)
LL 64. 0 bytes (w/write allocate) per mul / add imbalance 32.0 stencil w/out SIMD o le GF n AI < 0.5(0.33) ly compulsory bb 16. 0 traffic withw There is a heavy imbalance 8.0 between multiplies and w/out ILP adds. r attaina ite allocate 4.0 miss traffic 2.0 1.0 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 27 Roofline Model – 7pt Stencil
FUTURE TECHNOLOGIES GROUP Xeon X5550 However, there is a heavy 512.0 (()Nehalem) imbalance between 256.0 multiplies and adds. 66% of peak is the in-core 128.0 performance limit OP/s
LL 64. 0 mul / add imbalance 32.0 To attain maximum w/out SIMD o
le GF performance, we must: n ly compulsory bb 16. 0 traffic withw program for NUMA 8.0 bypass the cache w/out ILP r attaina express ILP x DLP ite allocate 4.0 miss traffic select parallelization to 2.0 minimize cache 1.0 working set. 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 28 High-order
FUTURE TECHNOLOGIES GROUP Use the roofline to predict performance of high-order stencils: Consider (still finite difference) 2nd, 4th, and 8th-order versions of the hhteat equa tion. We end up with stencils like:
z+1 z+4 z+4 z+3 z+2 z+2 y-1 y-2 y-4 y-1 y-3 z+1 x+1 x,y,z x+1 x+2 y-2 x+4 x,y,z y-1x,y,zx+1x+2x+3 x-2 x-1 y+1 x-2x-1 y+1 x-1 x-4x-3 z-1 y+2 y+1 y+2 y+3y+4 z-1 z-2 z-3 z-1 z-2 z-4
and computational characteristics:
8 flop’s 15 flop’s 29 flop’s 8 D$ accesses 14 D$ accesses 26 D$ accesses 16 DRAM bytes 16 DRAM bytes 16 DRAM bytes AI<0.5 AI<0.94 AI<1.8
29 Roofline Model – High Order Stencils FUTURE TECHNOLOGIES GROUP Xeon X5550 We can plot the AI for these 512.0 (()Nehalem) three kernels on the 256.0 roofline Clearly, the degree of 128.0 requisite optimization OP/s
LL 64. 0 becomes quite high. mul / add imbalance Nevertheless, because: 32.0 w/out SIMD 1. le GF they are all roughly bb 16. 0 bdbtbound by stream 8.0 bandwidth
w/out ILP 2. they all transfer the attaina 4.0 2
4 8 same dtdata nd th th 2.0 order order order they will all have 1.0 approximately the same 1 1 1 /8 /4 /2 1 2 4 8 16 run time actual FLOP:Byte ratio 30 What about horizontal communication? FUTURE TECHNOLOGIES GROUP Unfortunately we neglected to include the impact of the deeper ghost zones (= “grow cells” = “halos”) on arithmetic intensity.
If the grid is small (323), then the overhead of a 1, 2, or 4-deep ghost zone can be progressively more severe… 2nd order: scale AI by 83% (<0.42 flops per byte) 4th order: scale AI by 70% (<0.66 flops per byte) 8th order: scale AI by 51% (<0.92 flops per byte)
Clearly this will impinge upon our performance bounds…
31 Roofline Model – High Order Stencils FUTURE TECHNOLOGIES GROUP Xeon X5550 Thus, we go from…. 512.0 (()Nehalem) 256.0 128.0 OP/s
LL 64. 0 mul / add imbalance 32.0 w/out SIMD le GF
bb 16. 0 8.0
w/out ILP attaina 4.0 2 4 8 nd th th
2.0 order order order 1.0 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 32 Roofline Model – High Order Stencils FUTURE TECHNOLOGIES GROUP Xeon X5550 to this 512.0 (()Nehalem) 256.0 optimization is still important, but 128.0 communication has OP/s
LL 64. 0 reddttiblduced attainable mul / add imbalance performance. 32.0 w/out SIMD le GF
bb 16. 0 Although the requisite 8.0 number of flop’s scale with
w/out ILP order, performance will not. attaina 4.0 2 4 8
nd As such, time per sweep th th 2.0 order order order will increase (slowly) with 1.0 order. 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 33 Roofline Model – High Order Stencils FUTURE TECHNOLOGIES GROUP Xeon X??? What about the next 512.0 (SandyygBridge-e) generation SNB-e? 256.0 WthflitWe can use the roofline to mul / add imbalance predict the performance of 128.0 future machines. OP/s
LL 64. 0 w/out AVX lac k o f AVX can h ur t performance by 4x 32.0 le GF
bb 16. 0 Still bandwidth-bound w/out ILP 8.0 Although peak increased by >4x (8-core+AVX+GHz), attaina 4.0 2 4 8
nd sustained performance will th th 2.0 order order order likely only increase by 1.0 ~2.5x. 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 34 Communication-Avoiding
FUTURE TECHNOLOGIES GROUP As DRAM bandwidth clearly constrains performance, we should investigate (DRAM) communication-avoiding algorithms In a MG sol ver, one mi ihtght appl y mu ltilltiple re laxes. By properly orchestrating data movement, one can reduce the number of reads of the grid. However, this requires redundant computation on ghost/grow cells.
Consider three exampp(les (all countin g raw flo p)p’s): 4-relaxes with a 2nd order stencil on a 323 grid. AI~1.5 flops per byte 2-relaxes with a 4th order stencil on a 323 grid. AI~1.27 flops per byte 1-relax with a 8th order stencil on a 323 ggppyrid. AI~1.16 flops per byte
35 Roofline Model – Communication-Avoiding on 323 FUTURE TECHNOLOGIES GROUP Xeon X5550 When communication- 512.0 (()Nehalem) avoiding is applied, (raw) 256.0 performance is roughly constant as a function of 128.0 order OP/s
LL 64. 0 No surprise , they all mul / add imbalance perform roughly equal work 32.0 w/out SIMD for equal (vertical) data le GF
bb 16. 0 movement. 8.0 However, there is some w/out ILP attaina 4.0 degree of redundancy we 2 8 4 nd th th
2.0 order order order have ignored: 2nd = 66% are useful 1.0 1 1 1 th /8 /4 /2 1 2 4 8 16 4 = 75% are useful actual FLOP:Byte ratio 8th = 80% are useful 36 Roofline Model – Communication-Avoiding on 323 FUTURE TECHNOLOGIES GROUP Xeon X5550 When we plot useful flop/s, 512.0 (()Nehalem) we scale performance 256.0 down accordingly . 128.0 The result is that we should OP/s
LL 64. 0 bbltbe able to per form: mul / add imbalance nd 32.0 4 x 2 order relaxes, w/out SIMD th le GF 2 x 4 order relaxes, or bb 16. 0 1 x 8th order relax 8.0 in roughly equal time. w/out ILP attaina 4.0 2 8 4 nd th th
2.0 order order order Need FastMath scientists to decide which will 1.0 provide better overall 1/ 1/ 1/ 1 2 4 8 16 8 4 2 time-to-solution. actual FLOP:Byte ratio 37 FUTURE TECHNOLOGIES GROUP
Exascale
38 Exascale machines
FUTURE TECHNOLOGIES GROUP Exascale machines are still on the drawing board, but we can proxy one with the following characteristics >100K no des 8-16TF/s of peak performance A hierarchical memory (ala GPU) with >1 TB/s of bandwidth to 10’s – 100’s of GB of “near” DRAM >0.1 TB/s of bandwidth to >1 TB of “far” DRAM This motivates us to think about locality in terms of not only flops per byte of fast DRAM, but also flops per byte of far DRAM need to perform >128 flop’s per 64b word from near DRAM >1,,g280 floating-pppoint operations p er 64b word from far DRAM
Can you fit the entire problem in 100 GB of near DRAM? if not, but you can fit a MG level solve in near DRAM, do you really perform 128e12 floating-point operations per node per level solve?
39 Roofline Model
FUTURE TECHNOLOGIES GROUP Hypothetical We now have two critical arithmetic intensities (from Exascale Nodepeak DP nnearear aandnd farfar DRAM)DRAM) 16384 In this example, avoiding near no FMA 8192 communication is irrelevant.
OP/s Rather, we must:
LL 4096 reinvent algorithms to 2046 boost flop:far bytes
le GF demand HW designers b b 1024 trade higher far bandwidth 512 w/out SIMD for reduced near capacity demand HW designers attaina 256 boost near capacity to fit 128 another working set plateau. 64 1 1 /4 /2 1 2 4 8 16 32 FLOP:near bytes 1 1 /4 /2 12481632 FLOP:far bytes
40 FUTURE TECHNOLOGIES GROUP
Questions?
Acknowledgments Research supported by DOE Office of Science under contract number DE-AC02-05CH11231.
41 FUTURE TECHNOLOGIES GROUP
BACKUP SLIDES
42 FUTURE TECHNOLOGIES GROUP
Alternate Rooflines
43 No overlap of communication and computation FUTURE TECHNOLOGIES GROUP Previously, we assumed perfect overlap of communication or computation. What h appens if th ere i s a d epend ency (e ither in heren t or by a lac k of optimization) that serializes communication and computation ?
Byte’s / STREAM Bandwidth Byte’s / STREAM Bandwidth Flop’s / Flop/s Flop’s / Flop/s
time
Time is the sum of communication time and computation time. The result is that flop/s grows asymptotically.
44 No overlap of communication and computation FUTURE TECHNOLOGIES GROUP Consider a generic machine If we can perfectly decouple andld overlap commun itiication with computation, the roofline is sharp/angular. However, w ithou t over lap, the roofline is smoothed, and attainable performance is degraded by up to a factor of 2x.
45 Alternate Bandwidths
FUTURE TECHNOLOGIES GROUP Thus far, we assumed a synergy between streaming applications and bandwidth (proxied by the STREAM benchmark) STREAM is NOT a goo d proxy for s hort st anza/ rand om cacheline access patterns as memory latency (instead of just bandwidth) is being exposed. Thus one mi g ht conce ive o f a lterna te memory b enc hmar ks to provide a bandwidth upper bound (ceiling)
Similarly, if data is primarily local in the LLC cache, one should construct rooflines based on LLC bandwidth and flop:LLC byte ratios.
For GPUs/accelerators, PCIe bandwidth can be an impediment. Thus one can construct a roofline model based on PCIe bandwidth and the flop:PCIe byte ratio.
46 Alternate Computations
FUTURE TECHNOLOGIES GROUP Arising from HPC kernels, its no surprise roofline use DP Flop/s. Of course, it could use SP flop/s, integer ops, bit operations, pairwise comparisons (sorting), graphics operations, etc…
47 Time-based roofline
FUTURE TECHNOLOGIES GROUP In some cases, it is easier to visualize performance in terms of seconds (i.e. time-to-solution). We can i nver t th e roofli ne ( second s per fl op ) an d s imp ly mu ltip ly by the number of requisite flop’s
Additionally, we could change the horizontal axis from locality to some more appealing metric.
48 Alternate Axes
FUTURE TECHNOLOGIES GROUP Rather than thinking of requisite optimizations, think of performance (mstencil/s) as a function of flop’s per stencil and DRAM bytes per stencil. We can plot DRAM bandwidth to divide the space into bandwidth and compute-bound regions. WlltWe can also plot iso-curves of constant MStencil/s We observe the effect of ncil communication-avoiding or ee curve of constant MStencil/s moving to high-order p’s per st p’s
methods. oo Fl move to high-order
communication-avoiding
DRAM Bytes per stencil 49 Little’s Law
FUTURE TECHNOLOGIES GROUP Little’s Law: Concurrency = Latency * Bandwidth - or - Effective Throughput = Expressed Concurrency / Latency
Bandwidth conventional memory bandwidth #floating-point units Latency memory latency functional unit latency Concurrency: bytes expressed to the memory subsystem concurrent (parallel) memory operations For example, consider a CPU with 2 FPU’s each with a 4-cycle latency. Little’s law states that we must express 8-way ILP to fully utilize the machine.
50 Little’s Law Examples
FUTURE TECHNOLOGIES GROUP Applied to Memory Applied to FPUs consider a CPU with 20GB/s of consider a CPU with 2 FPU’s each bandwidth and 100ns memory with a 4-cycle latency. latency. Little’s law states that we must Little’s law states that we must express 8-way ILP to fully utilize express 2KB of concurrency the machine . (independent memory operations) to the memory subsystem to attain Solution: unroll/jam the code to peak performance express 8 independent FP operations. On today’s superscalar processors, Note, simply unrolling dependent hardware stream prefetchers operations (e.g. reduction) does specultillatively loa d consecu tive not increase ILP. It simply elements. amortizes loop overhead. Solution: express the memory access pattern in a streaming fashion in order to engage the prefetchers. 51 Three Classes of Locality
FUTURE TECHNOLOGIES GROUP Temporal Locality reusing data (either registers or cache lines) multiple times amortizes th e i mpact of li mit ed b and width . transform loops or algorithms to maximize reuse.
Spatial Locality data is transferred from cache to registers in words. However, data is transferred to the cache in 64-128Byte lines using every word in a line maximizes spatial locality. transform data structures into structure of arrays (SoA) layout
Sequential Locality Many memory address patterns access cache lines sequentially. CPU’s hardware stream prefetchers exploit this observation to hide speculatively load data to memory latency. Transform loops to generate (a few) long, unit-stride accesses. 52 NUMA
FUTURE TECHNOLOGIES GROUP Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is , the bandwidth to read a given address varies dramatically among between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory
53 NUMA
FUTURE TECHNOLOGIES GROUP Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is , the bandwidth to read a given address varies dramatically among between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory
54 Various Kernels
FUTURE TECHNOLOGIES GROUP We have examined and heavily optimized a number of kernels and applications for both CPUs and GPUs. We observe that for most , performance is highly correlated with DRAM bandwidth – particularly on the GPU. Note, GTC has a strong scatter/gather component that skews STREAM- based rooflines.
1024 Xeon X5550 (Nehalem) 1024 NVIDIA C2050 (Fermi) 512 512
256 256 DGEMM
128 128 RTM/wave eqn.
64 DGEMM 64 RTM/wave eqn. 27pt Stencil 32 32 27pt Stencil 7pt Stencil 7pt Stencil 16 16 GTC/pushi GTC/pushi SpMV 8 SpMV 8 GTC/chargei 4 4 GTC/chargei 2 2 1 1 1 1 1 1 1 1 1 1 /32 /16 /8 /4 /2 12481632 /32 /16 /8 /4 /2 12481632
55 FUTURE TECHNOLOGIES GROUP
56