The Roofline Model

FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory [email protected] 1 Outline FUTURE TECHNOLOGIES GROUP Introduction The Roofline Model Example: high-order finite difference stencils New issues at exascale 2 Challenges / Goals FUTURE TECHNOLOGIES GROUP At petascale, there are a wide variety of architectures (superscalar CPUs, embedded CPUs, GPUs/accelerators, etc…). The comput ati onal ch aract eri sti cs of numeri cal meth o ds can vary dramatically. The result is that performance and the benefit of optimization can vary significantly from one architecture x kernel combination to the next. We wish to qqyqypuickly quantify performance bounds for a variet y of architectural x implementation x algorithm combinations. Moreover, we wish to identify performance bottlenecks and enumerate potential remediation strategies. 3 Arithmetic Intensity FUTURE TECHNOLOGIES GROUP O( log(N) ) O( 1 ) O( N ) SpMV, BLAS1,2 FFTs Dense Linear Algebra Stencils (PDEs) (BLAS3) PIC codes Naïve Particle Methods Lattice Methods True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes Some HPC kernels have an arithmetic intensity that scales with problem size (temporal locality increases with problem size) while others have constant arithmetic intensity. Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses 4 Arithmetic Intensity FUTURE TECHNOLOGIES GROUP O( log(N) ) O( 1 ) O( N ) SpMV, BLAS1,2 FFTs Dense Linear Algebra Stencils (PDEs) (BLAS3) PIC codes Naïve Particle Methods Lattice Methods Note, we are free to define arithmetic intensity in different terms: e.g. DRAM bytes -> cache, PCIe, or network bytes. flop’ s -> stencil’ s Thus, we might consider performance (MStencil/s) as a function of arithmetic intensity (stencils per byte of ____) 5 FUTURE TECHNOLOGIES GROUP Roofline Model 6 Overlap of Communication FUTURE TECHNOLOGIES GROUP Consider a simple example in which a FP kernel maintains a working set in DRAM. We assume we can per fectl y over lap compu ta tion w ith communication or v.v. either through prefetching/DMA and/or pipelining (decoupling of communication and computation) Thus, ti me, i s th e maxi mum of th e ti me requi re d to transf er th e da ta and the time required to perform the floating point operations. Byte’s / STREAM Bandwidth Flop’s / Flop/s time 7 Roofline Model Basic Concept FUTURE TECHNOLOGIES GROUP Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis. Attainable FLOP/s with Optimizations1-i = min Performance ij AI * Ban dw idth with O pti mi zati ons1-j where optimization i can be SIMDize, or unroll, or SW prefetch, … Given a kernel’s arithmetic intensity (based on DRAM traffic after being filtered by the cache), programmers can inspect the figure, and bound performance. Moreover, provides insights as to which optimizations will potentially be beneficial. 8 Example FUTURE TECHNOLOGIES GROUP Consider the Opteron 2356: Dual Socket (NUMA) lim ite d HW s tream prefthfetchers quad-core (8 total) 2.3GHz 2-way SIMD (DP) separate FPMUL and FPADD datapaths 4-cyclFPltle FP latency Assuming expression of parallelism is the challenge on this architecture, what would the roofline model look like ? 9 Roofline Model Basic Concept FUTURE TECHNOLOGIES GROUP Opteron 2356 Naively, one might assume 256.0 (()Barcelona) peak performance is 128.0 always attainable . peak DP 64.0 OP/s LL 32. 0 16.0 le GF bb 808.0 4.0 attaina 2.0 1.0 0.5 10 Roofline Model Basic Concept FUTURE TECHNOLOGIES GROUP Opteron 2356 However, with a lack of 256.0 (()Barcelona) locality, DRAM bandwidth 128.0 can be a bottleneck peak DP 64.0 Plot on log-log scale OP/s LL 32. 0 Given AI, we can easily 16.0 bound performance le GF But architectures are much bb 808.0 more complicated 4.0 attaina 2.0 We will bound performance as we eliminate specific 1.0 forms of in-core parallelism 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 11 Roofline Model computational ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 Opterons have dedicated 256.0 (()Barcelona) multipliers and adders. 128.0 If th e cod e is dom ina te d by peak DP adds, then attainable 64.0 performance is half of peak. OP/s mul / add imbalance LL 32. 0 WllthWe call these Ce ilings 16.0 They act like constraints on le GF performance bb 808.0 4.0 attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 12 Roofline Model computational ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 Opterons have 128-bit 256.0 (()Barcelona) datapaths. 128.0 If ins truc tions aren ’t peak DP SIMDized, attainable 64.0 performance will be halved OP/s mul / add imbalance LL 32. 0 16.0 w/out SIMD le GF bb 808.0 4.0 attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 13 Roofline Model computational ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 On Opterons, floating-point 256.0 (()Barcelona) instructions have a 4 cycle 128.0 latency. peak DP If we don’t express 4-way 64.0 ILP, performance will drop OP/s mul / add imbalance LL 32. 0 by as much as 4x 16.0 w/out SIMD le GF bb 808.0 4.0 w/out ILP attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 14 Roofline Model communication ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 We can perform a similar 256.0 (()Barcelona) exercise taking away 128.0 parallelism from the peak DP memory subsystem 64.0 OP/s LL 32. 0 16.0 le GF bb 808.0 4.0 attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 15 Roofline Model communication ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 Explicit software prefetch 256.0 (()Barcelona) instructions are required to 128.0 achieve peak bandwidth peak DP 64.0 OP/s LL 32. 0 16.0 le GF bb 808.0 4.0 attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 16 Roofline Model communication ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 Opterons are NUMA 256.0 (()Barcelona) As such memory traffic 128.0 mustbt be correc tly ba lance d peak DP among the two sockets to 64.0 achieve good Stream OP/s LL 32. 0 bandwidth. 16.0 le GF We could continue this by bb 808.0 examiiining s tiddtrided or 4.0 random memory access patterns attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 17 Roofline Model computation + communication ceilings FUTURE TECHNOLOGIES GROUP Opteron 2356 We may bound 256.0 (()Barcelona) performance based on the 128.0 combination of expressed peak DP in-core parallelism and 64.0 attained bandwidth. OP/s mul / add imbalance LL 32. 0 16.0 w/out SIMD le GF bb 808.0 4.0 w/out ILP attaina 2.0 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 18 Roofline Model locality walls FUTURE TECHNOLOGIES GROUP Opteron 2356 Remember, memory traffic 256.0 (()Barcelona) includes more than just 128.0 compulsory misses . peak DP As such, actual arithmetic 64.0 intensity may be OP/s mul / add imbalance LL 32. 0 substantially lower. w/out SIMD Walls are unique to the 16.0 architecture-kernel o le GF n ly compulsory bb 808.0 combination 4.0 w/out ILP FLOPs attaina 2.0 miss traffic AI = ClMiCompulsory Misses 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 19 Cache Behavior FUTURE TECHNOLOGIES GROUP Knowledge of the underlying cache operation can be critical. For example, caches are organized into lines. Lines are organized itinto se ts & ways (associtiitiativity) Thus, we must mimic the effect of Mark Hill’s 3C’s of caches Impacts of conflict, compulsory, and capacity misses are both architecture- and application-dependent. Ultimately they reduce the actual flop:byte ratio. Moreover, many caches are write allocate. a write a lloca te cac he rea d in an en tire cac he line upon a wr ite m iss. If the application ultimately overwrites that line, the read was superfluous (further reduces flop:byte ratio) BdtidbthdtfitBecause programs access data in words, but hardware transfers it in 64 or 128B cache lines, spatial locality is key Array-of-structure data layouts can lead to dramatically lower flop:byte ratios. e.g. if a program only operates on the “red” field of a pixel, bandwidth is wasted. 20 Roofline Model locality walls FUTURE TECHNOLOGIES GROUP Opteron 2356 Remember, memory traffic 256.0 (()Barcelona) includes more than just 128.0 compulsory misses . peak DP As such, actual arithmetic 64.0 intensity may be OP/s mul / add imbalance LL 32. 0 substantially lower. w/out SIMD Walls are unique to the 16.0 architecture-kernel o le GF n ly compulsory bb 808.0 combination +write allo 4.0 w/out ILP FLOPs c attaina ation traffic 2.0 miss traffic AI = Alloca tions + Compu lsory Misses 1.0 0.5 1 1 1 /8 /4 /2 1 2 4 8 16 actual FLOP:Byte ratio 21 Roofline Model locality walls FUTURE TECHNOLOGIES GROUP Opteron 2356 Remember, memory traffic 256.0 (()Barcelona) includes more than just 128.0 compulsory misses .

The Roofline Model

Cache-Aware Roofline Model: Upgrading the Loft

CS 575: the Roofline Model

Beyond the Roofline: Cache-Aware Power and Energy-Efficiency

How to Write Fast Numerical Code Fall 2016 Lecture: Roofline Model

A Roofline Model of Energy

Roofline-Based Data Migration Methodology for Hybrid Memories 849

Roofline Model Toolkit: a Practical Tool for Architectural and Program

FPGA-Roofline: an Insightful Model for FPGA-Based Hardware

Gables: a Roofline Model for Mobile Socs

Page 1 the Roofline Model Electrical Engineering and Computer

The Roofline Model

The Boat Hull Model: Enabling Performance Prediction for Parallel Computing Prior to Code Development