<<

Performance Tuning of Scientific Codes with the

1:30pm Introduction to Roofline Samuel Williams 2:00pm Using Roofline in NESAP Jack Deslippe 2:20pm Using LIKWID for Roofline Charlene Yang 2:40pm Using NVProf for Roofline Protonu Basu

3:00pm break / setup NERSC accounts

3:30pm Introduction to Intel Advisor Charlene Yang 3:50pm Hands-on with Intel Advisor Samuel Williams 4:45pm closing remarks / Q&A all Introductions

Samuel Williams Jack Deslippe Computational Research Division NERSC Lawrence Berkeley National Lab Lawrence Berkeley National Lab [email protected] [email protected]

Charlene Yang Protonu Basu NERSC Computational Research Division Lawrence Berkeley National Lab Lawrence Berkeley National Lab [email protected] [email protected] Introduction to the Roofline Model Samuel Williams Computational Research Division Lawrence Berkeley National Lab [email protected] Acknowledgements § This material is based upon work supported by the Advanced Scientific Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231. § This material is based upon work supported by the DOE RAPIDS SciDAC Institute. § This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02- 05CH11231. § Special Thanks to: • Zakhar Matveev, Intel Corporation • Roman Belenov, Intel Corporation Introduction to Performance Modeling Why Use Performance Models or Tools?

§ Identify performance bottlenecks § Motivate software optimizations § Determine when we’re done optimizing • Assess performance relative to machine capabilities • Motivate need for algorithmic changes § Predict performance on future machines / architectures • Sets realistic expectations on performance for future procurements • Used for HW/SW Co-Design to ensure future architectures are well-suited for the computational needs of today’s applications.

6 Performance Models

§ Many different components can contribute to kernel run time. § Some are characteristics of the application, some are characteristics of the machine, and some are both ( + caches).

#FP operations Flop/s data movement Cache GB/s DRAM data movement DRAM GB/s PCIe data movement PCIe bandwidth Depth OMP Overhead MPI Message Size Network Bandwidth MPI Send:Wait ratio Network Gap #MPI Wait’s Network

7 Performance Models

§ Can’t think about all these terms all the time for every application…

Computational Complexity #FP operations Flop/s Cache data movement Cache GB/s DRAM data movement DRAM GB/s PCIe data movement PCIe bandwidth Depth OMP Overhead MPI Message Size Network Bandwidth MPI Send:Wait ratio Network Gap #MPI Wait’s Network Latency

8 Performance Models

§ Because there are so many components, performance models often conceptualize the system as being dominated by one or more of these components.

Roofline #FP operations Flop/s Model Cache data movement Cache GB/s DRAM data movement DRAM GB/s PCIe data movement PCIe bandwidth Depth OMP Overhead MPI Message Size Network Bandwidth MPI Send:Wait ratio Network Gap #MPI Wait’s Network Latency Williams et al, "Roofline: An Insightful Visual Performance Model For 9 Multicore Architectures", CACM, 2009. Performance Models

§ Because there are so many components, performance models often conceptualize the system as being dominated by one or more of these components.

#FP operations Flop/s Cache data movement Cache GB/s DRAM data movement DRAM GB/s LogCA PCIe data movement PCIe bandwidth Depth OMP Overhead MPI Message Size Network Bandwidth MPI Send:Wait ratio Network Gap #MPI Wait’s Network Latency Bin Altaf et al, "LogCA: A High-Level Performance Model for Hardware Accelerators", ISCA, 2017. 10 Performance Models

§ Because there are so many components, performance models often conceptualize the system as being dominated by one or more of these components.

#FP operations Flop/s Cache data movement Cache GB/s DRAM data movement DRAM GB/s PCIe data movement PCIe bandwidth Depth OMP Overhead MPI Message Size Network Bandwidth LogGP MPI Send:Wait ratio Network Gap #MPI Wait’s Network Latency Alexandrov, et al, "LogGP: incorporating long messages into the LogP model - one step closer towards a realistic model for parallel 11 computation", SPAA, 1995. Performance Models

§ Because there are so many components, performance models often conceptualize the system as being dominated by one or more of these components.

#FP operations Flop/s Cache data movement Cache GB/s DRAM data movement DRAM GB/s PCIe data movement PCIe bandwidthRight model Depth OMP Overhead MPI Message Size Network Bandwidthdepends! on app and problem size LogP MPI Send:Wait ratio Network Gap #MPI Wait’s Network Latency Culler, et al, "LogP: a practical model of parallel computation", CACM, 1996. 12 Roofline Model: Arithmetic Intensity and Bandwidth Performance Models / Simulators

§ Historically, many performance models and simulators tracked latencies to predict performance (i.e. counting cycles)

§ The last two decades saw a number of latency-hiding techniques… • Out-of-order execution (hardware discovers parallelism to hide latency) • HW stream prefetching (hardware speculatively loads data) • Massive thread parallelism (independent threads satisfy the latency-bandwidth product)

§ Effective latency hiding has resulted in a shift from a latency-limited computing regime to a throughput-limited computing regime

14 Roofline Model

§ Roofline Model is a throughput- oriented performance model… • Tracks rates not times • Augmented with Little’s Law (concurrency = latency*bandwidth) • Independent of ISA and architecture (applies to CPUs, GPUs, TPUs1, etc…)

https://crd.lbl.gov/departments/computer-science/PAR/research/roofline

1Jouppi et al, “In-Datacenter Performance Analysis of a Tensor 15 Processing Unit”, ISCA, 2017. (DRAM) Roofline

§ One could hope to always attain peak performance (Flop/s) CPU § However, finite locality (reuse) and (compute, flop/s)

bandwidth limit performance. DRAM Bandwidth § Assume: (GB/s) • Idealized /caches DRAM • Cold start (data in DRAM) (data, GB) #FP ops / Peak GFlop/s Time = max #Bytes / Peak GB/s

16 (DRAM) Roofline

§ One could hope to always attain peak performance (Flop/s) CPU § However, finite locality (reuse) and (compute, flop/s)

bandwidth limit performance. DRAM Bandwidth § Assume: (GB/s) • Idealized processor/caches DRAM • Cold start (data in DRAM) (data, GB)

Time 1 / Peak GFlop/s = max #FP ops #Bytes / #FP ops / Peak GB/s

17 (DRAM) Roofline

§ One could hope to always attain peak performance (Flop/s) CPU § However, finite locality (reuse) and (compute, flop/s)

bandwidth limit performance. DRAM Bandwidth § Assume: (GB/s) • Idealized processor/caches DRAM • Cold start (data in DRAM) (data, GB)

#FP ops Peak GFlop/s = min Time (#FP ops / #Bytes) * Peak GB/s

18 (DRAM) Roofline

§ One could hope to always attain peak performance (Flop/s) CPU § However, finite locality (reuse) and (compute, flop/s)

bandwidth limit performance. DRAM Bandwidth § Assume: (GB/s) • Idealized processor/caches DRAM • Cold start (data in DRAM) (data, GB) Peak GFlop/s GFlop/s = min AI * Peak GB/s

Note, Arithmetic Intensity (AI) = Flops / Bytes (as presented to DRAM )

19 (DRAM) Roofline

§ Plot Roofline bound using Arithmetic Intensity as the x-axis § Log-log scale makes it easy to Peak Flop/s doodle, extrapolate performance along Moore’s Law, etc…

§ Kernels with AI less than machine DRAM GB/s balance are ultimately DRAM Attainable Flop/s DRAM-bound Compute-bound bound (we’ll refine this later…)

Arithmetic Intensity (Flop:Byte)

20 Roofline Example #1

§ Typical machine balance is 5-10 per byte… • 40-80 flops per double to exploit compute capability Peak Flop/s • Artifact of technology and money • Unlikely to improve

§ Consider STREAM Triad… DRAM GB/sGflop/s ≤ AI * DRAM GB/s Attainable Flop/s #pragma omp parallel for for(i=0;i

21 Roofline Example #2

§ Conversely, 7-point constant coefficient stencil… • 7 flops Peak Flop/s • 8 memory references (7 reads, 1 store) per point • Cache can filter all but 1 read and 1 write per point Gflop/s ≤ AI * DRAM GB/s • AI = 0.44 flops per byte == memory bound, but 5x the flop rate #pragma omp parallel for DRAM GB/s for(k=1;k

22 Hierarchical Roofline

§ Real processors have multiple levels of memory • Registers • L1, L2, L3 cache • MCDRAM/HBM (KNL/GPU device memory) • DDR (main memory) • NVRAM (non-) § Applications can have locality in each level § Unique data movements imply unique AI’s § Moreover, each level will have a unique bandwidth

23 Hierarchical Roofline

§ Construct superposition of Rooflines… § Measure a bandwidth Peak Flop/s § Measure AI for each level of memory

• Although an loop nest may have multiple L2 GB/s AI’s and multiple bounds (flops, L1, L2, … DRAM)… MCDRAM cache GB/s DDR Bound DDR AI*BW < • … performance is bound by the Attainable Flop/s DDR GB/s MCDRAM AI*BW minimum

Arithmetic Intensity (Flop:Byte)

24 Hierarchical Roofline

§ Construct superposition of Rooflines… § Measure a bandwidth Peak Flop/s § Measure AI for each level of memory

• Although an loop nest may have multiple L2 GB/s AI’s and multiple bounds (flops, L1, L2, … DRAM)… MCDRAM cache GB/s

• … performance is bound by the Attainable Flop/s minimum DDR bottleneck pulls performance below MCDRAM Roofline Arithmetic Intensity (Flop:Byte)

25 Hierarchical Roofline

§ Construct superposition of Rooflines… § Measure a bandwidth Peak Flop/s § Measure AI for each level of memory

• Although an loop nest may have multiple L2 GB/s

AI’s and multiple bounds (flops, L1, L2, … MCDRAM bound MCDRAM cache GB/s MCDRAM AI*BW < DRAM)… DDR AI*BW • … performance is bound by the Attainable Flop/s DDR GB/s minimum

Arithmetic Intensity (Flop:Byte)

26 Hierarchical Roofline

§ Construct superposition of Rooflines… § Measure a bandwidth Peak Flop/s § Measure AI for each level of memory • Although an loop nest may have multiple AI’s and multiple bounds (flops, L1, L2, … DRAM)…

Attainable Flop/s MCDRAM DDR GB/s • … performance is bound by the bottleneck pulls minimum performance below DDR Roofline

Arithmetic Intensity (Flop:Byte)

27 Roofline Model: Modeling In-core Performance Effects Data, Instruction, Thread-Level Parallelism…

§ Modern CPUs use several techniques to increase per core Flop/s

Fused Multiply Add Vector Instructions Deep pipelines • w = x*y + z is a common • Many HPC codes apply • The hardware for a FMA idiom in linear algebra the same operation to a is substantial. • Rather than having vector of elements • Breaking a single FMA separate multiple and • Vendors provide vector up into several smaller add instructions, instructions that apply operations and pipelining Resurgence…… processors canetc use a the same operation to 2, them allows vendors to Tensor Cores, fusedQFMA, multiply! add (FMA) 4, 8, 16 elements… increase GHz • The FPU chains the x [0:7] *y [0:7] + z [0:7] • Little’s Law applies… multiply and add in a • Vector FPUs complete 8 need FP_Latency * single pipeline so that it vector operations/cycle FP_bandwidth can complete FMA/cycle independent instructions

29 Data, Instruction, Thread-Level Parallelism…

§ If every instruction were an ADD (instead of FMA), performance would drop by 2x on KNL or 4x Peak Flop/s on Haswell Add-only (No FMA) § Similarly, if one failed to vectorize, performance would drop by DDR GB/s No vectorization

another 8x on KNL and 4x on Attainable Flop/s Poor vectorization Haswell pulls performance below DDR § Lack of threading (or load Arithmetic Intensity (Flop:ByteRoofline) imbalance) will reduce performance by another 64x on KNL. 30 Superscalar vs. Instruction mix

§ Define in-core ceilings based on instruction mix… Peak Flop/s § e.g. Haswell ≥50% FP • 4-issue superscalar 25% FP (75% int) • Only 2 FP data paths 12% FP (88% int) • Requires 50% of the instructions to be FP DDR GB/s to get peak performance Attainable Flop/s

Arithmetic Intensity (Flop:Byte)

31 Superscalar vs. Instruction mix

§ Define in-core ceilings based on instruction mix… Peak Flop/s § e.g. Haswell 100% FP • 4-issue superscalar 50% FP (50% int) • Only 2 FP data paths 25% FP (75% int) • Requires 50% of the instructions to be FP DDR GB/s to get peak performance Attainable Flop/s § e.g. KNL • 2-issue superscalar Arithmetic Intensity (Flop:Byte) • 2 FP data paths • Requires 100% of the instructions to be FP to get peak performance

32 Superscalar vs. instruction mix

§ Define in-core ceilings based on instruction mix… Peak Flop/s § e.g. Haswell 100% FP • 4-issue superscalar 50% FP (50% int) • Only 2 FP data paths 25% FP (75% int) • Requires 50% of the instructions to be FP DDR GB/s to get peak performance Attainable Flop/s § e.g. KNL • 2-issue superscalar Arithmetic Intensity (Flop:Byte) • 2 FP data paths • Requires 100% of the instructions to be FP to get peak performance

33 Superscalar vs. instruction mix

§ Define in-core ceilings based on instruction mix… § e.g. Haswell Peak Flop/s • 4-issue superscalar

• Only 2 FP data paths 25% FP (75% int) • Requires 50% of the instructions to be FP DDR GB/s to get peak performance Attainable Flop/s non-FP instructions § e.g. KNL can sap instruction issue bandwidth and • 2-issue superscalar pull performance Arithmetic Intensity (Flop:Byte) below Roofline • 2 FP data paths • Requires 100% of the instructions to be FP to get peak performance

34 Divides and other Slow FP instructions

§ FP Divides (sqrt, rsqrt, …) might support only limited pipelining § As such, their throughput is Peak Flop/s substantially lower than FMA’s § If divides constitute even if 3% of 6% VDIVPD the flop’s come from divides, DDR GB/s

performance can be cut in half. Attainable Flop/s A divide in the inner Ø Penalty varies substantially loop can easily cut peak performance in between architectures and Arithmetic Intensity (Flop:Byte) half generations (e.g. IVB, HSW, KNL, …)

35 Roofline Model: Modeling Cache Effects Locality Walls

§ Naively, we can bound AI using only compulsory cache misses Peak Flop/s No FMA

DDR GB/s No vectorization Attainable Flop/s Compulsory AI Compulsory Arithmetic Intensity (Flop:Byte) #Flop’s AI = Compulsory Misses

37 Locality Walls

§ Naively, we can bound AI using only compulsory cache misses § However, write allocate caches Peak Flop/s can lower AI No FMA

DDR GB/s No vectorization Attainable Flop/s +Write Allocate +Write AI Compulsory Arithmetic Intensity (Flop:Byte) #Flop’s AI = Compulsory Misses + Write Allocates

38 Locality Walls

§ Naively, we can bound AI using only compulsory cache misses § However, write allocate caches Peak Flop/s can lower AI No FMA § Cache capacity misses can have DDR GB/s a huge penalty No vectorization Attainable Flop/s +Capacity Misses +Capacity Allocate +Write AI Compulsory Arithmetic Intensity (Flop:Byte) #Flop’s AI = Compulsory Misses + Write Allocates + Capacity Misses

39 Locality Walls

§ Naively, we can bound AI using only compulsory cache misses § However, write allocate caches Peak Flop/s can lower AI No FMA § Cache capacity misses can have DDR GB/s a huge penalty No vectorization Ø Compute bound became Attainable Flop/s

memory bound Misses +Capacity Allocate +Write AI Compulsory Arithmetic Intensity (Flop:Byte) Know the theoretical #Flop’s AI = Compulsory Misses + Write Allocates + Capacity Misses bounds! on your AI.

40 Roofline Model: General Strategy Guide General Strategy Guide

§ Broadly speaking, there are three approaches to improving performance: Peak Flop/s No FMA

DDR GB/s Attainable Flop/s

Arithmetic Intensity (Flop:Byte)

42 General Strategy Guide

§ Broadly speaking, there are three approaches to improving performance: Peak Flop/s § Maximize in-core performance No FMA (e.g. get compiler to vectorize)

DDR GB/s Attainable Flop/s

Arithmetic Intensity (Flop:Byte)

43 General Strategy Guide

§ Broadly speaking, there are three approaches to improving performance: Peak Flop/s § Maximize in-core performance No FMA (e.g. get compiler to vectorize) § Maximize memory bandwidth DDR GB/s

(e.g. NUMA-aware allocation) Attainable Flop/s

Arithmetic Intensity (Flop:Byte)

44 General Strategy Guide

§ Broadly speaking, there are three approaches to improving performance: Peak Flop/s § Maximize in-core performance No FMA (e.g. get compiler to vectorize) § Maximize memory bandwidth DDR GB/s

(e.g. NUMA-aware allocation) Attainable Flop/s

§ Minimize data movement AI Compulsory Current AI Current (increase AI) Arithmetic Intensity (Flop:Byte)

45 Constructing a Roofline Model requires answering some questions… Questions can overwhelm users…

What is my How much machine’s data did my Propertiespeak flop/s? of the kernel actually Fundamental target machine move? ? Whatproperties is my kernel’s of the ?Did my kernel Properties of an compulsory AI? vectorize? kernel constrained How important is application’s execution (communication lower (Benchmarking)vectorization or bounds) How many by hardware FMA on my ? ? machine? flop’s(Instrumentation) did my ? What is my kernel actually (Theory)Can my kernel machine’s do? How much did ? that divide ever be DDR GB/s? vectorized? hurt? ?L2 GB/s? ? ?

47 We need tools… Node Characterization?

Empirical Roofline Graph (Results.quadflat.2t/Run.002) 10000 Cori / KNL

§ “Marketing Numbers” can be 2450.0 GFLOPs/sec (Maximum)

deceptive… 1000 • Pin BW vs. real bandwidth L1 - 6442.9 GB/s L2 - 1965.4 GB/s GFLOPs / sec Empirical Roofline Graph (Results.summitdev.ccs.ornl.gov.02.MPI4/Run.001) 100 • TurboMode / Underclock for AVX DRAM - 412.9 GB/s 100000 SummitDev / 4GPUs

• compiler failings on high-AI loops. 17904.6 GFLOPs/sec (Maximum)

10000 10 0.01 0.1 1 10 100 § LBL developed the Empirical FLOPs / Byte

1000 L1 - 6506.5 GB/s

Roofline Toolkit (ERT)… DRAM - 1929.7 GB/s GFLOPs / sec

• Characterize CPU/GPU systems 100 • Peak Flop rates

10 • Bandwidths for each level of memory 0.01 0.1 1 10 100 FLOPs / Byte • MPI+OpenMP/CUDA == multiple GPUs https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/ 49 Instrumentation with Performance Counters?

§ Characterizing applications with performance counters can be problematic… x Flop Counters can be broken/missing in production processors x Vectorization/Masking can complicate counting Flop’s x Counting Loads and Stores doesn’t capture cache reuse while counting cache misses doesn’t account for prefetchers. x DRAM counters (Uncore PMU) might be accurate, but… x are privileged and thus nominally inaccessible in user mode x may need vendor (e.g. Cray) and center (e.g. NERSC) approved OS/kernel changes

50 Forced to Cobble Together Tools…

§ Use tools known/observed to work on NERSC’s Cori (KNL, HSW)… • Used Intel SDE (Pin binary instrumentation + emulation) to create software Flop counters • Used Intel VTune performance tool (NERSC/Cray approved) to access uncore counters Ø Accurate measurement of Flop’s (HSW) and DRAM data movement (HSW and KNL) Ø Used by NESAP (NERSC KNL application readiness project) to characterize apps on Cori…

http://www.nersc.gov/users/application-performance/measuring-arithmetic-intensity/

NERSC is LBL’s production computing division CRD is LBL’s Computational Research Division NESAP is NERSC’s KNL application readiness project 51 LBL is part of SUPER (DOE SciDAC3 Computer Science Institute) Initial Roofline Analysis of NESAP Codes MFDn EMGeo PICSAR 10000" 10000" 10000"

Roofline"Model" 1000" 1000" 1000" Roofline"Model" wo/FMA" Roofline"Model" 100" wo/FMA" 100" Original" 100" wo/FMA" GFLOP/s" GFLOP/s" 1"RHS" GFLOP/s" SELL" Original" 10" 10" 10" 4"RHS" SB" Rooflinew/Tiling" 8"RHS" SELL+SB" w/Tiling+Vect"

2P HSW 1" 1" 1" -only 0.01" 0.1" 1" 10" 0.1" 1" 10" nRHS+SELL+SB" 0.1" 1" 10" Arithme3c"Intensity"(FLOP/byte)" Arithme3c"Intensity"(FLOP/byte)" Arithme3c"Intensity"(FLOP/byte)"DRAMinsufficient for was PICSAR 10000" 10000" 10000"

Roofline"Model" 1000" 1000" 1000" Roofline"Model" wo/FMA" Roofline"Model" 100" wo/FMA" 100" Original" 100" wo/FMA" GFLOP/s" GFLOP/s" 1"RHS" GFLOP/s" SELL" Original" 10" 4"RHS" 10" SB" 10" w/Tiling" KNL 8"RHS" SELL+SB" w/Tiling+Vect" 1" 1" 1" 0.01" 0.1" 1" 10" 0.1" 1" 10" nRHS+SELL+SB" 0.1" 1" 10" Arithme3c"Intensity"(FLOP/byte)" Arithme3c"Intensity"(FLOP/byte)" Arithme3c"Intensity"(FLOP/byte)"

52 Evaluation of LIKWID

AMReX Application Characterization § LIKWID provides easy to use wrappers (2Px16c HSW == Cori Phase 1) 1024 L2 for measuring performance counters… L3 512 DRAM ü Works on NERSC production systems Roofline 256 ü Minimal overhead (<1%) 128 ü Scalable in distributed memory (MPI-friendly) ü Fast, high-level characterization 64 32 x No detailed timing breakdown or optimization advice Bandwidth(GB/s) x Limited by quality of hardware performance counter 16

implementation (garbage in/garbage out) 8 Ø Useful tool that complements other Nyx(4Px8T)

tools Nyx(32Px1T) MFIX (32Px1T)MFIX WarpX (4Px8T)WarpX WarpX (32Px1T)WarpX HPGMG HPGMG (4Px8T) PeleLM (32Px1T)PeleLM HPGMG HPGMG (32Px1T) Combustor (4Px8T)Combustor Combustor (32Px1T)Combustor https://github.com/RRZE-HPC/likwid 53 Intel Advisor

§ Includes Roofline Automation… ü Automatically instruments applications (one dot per loop nest/function) Memory-bound, invest into Compute bound: invest ü Computes FLOPS and AI for each cache blocking etc into SIMD,.. function (CARM) ü AVX-512 support that incorporates masks ü Integrated Cache Simulator1 (hierarchical roofline / multiple AI’s) ü Automatically benchmarks target system (calculates ceilings) ü Full integration with existing Advisor capabilities http://www.nersc.gov/users/training/events/roofline-training-1182017-1192017

1Technology Preview, not in official product roadmap so far. 54 Tools and PlatformsUse LIKWID for fast, for Roofline Modeling scalable app-level instrumentation Intel Intel NVIDIA Metric STREAM SDEERT AdvisorVTune LIKWIDNVProf Peak MFlops Peak MIPS Use Advisor for DRAMUse BW ERT to loop-level Benchmark Cache BW systems instrumentation MFlops and analysis on %SIMD? Intel targets MIPS DRAM BW

Execution Cache BW Auto-Roofline Intel CPUs IBM Power8 ? NVIDIA GPUs AMD CPUs ? ? Platforms AMD GPUs ARM ?

55 Questions? Backup Complexity, Depth, … Why Use Performance Models or Tools?

§ Identify performance bottlenecks § Motivate software optimizations § Determine when we’re done optimizing • Assess performance relative to machine capabilities • Motivate need for algorithmic changes § Predict performance on future machines / architectures • Sets realistic expectations on performance for future procurements • Used for HW/SW Co-Design to ensure future architectures are well-suited for the computational needs of today’s applications.

59 Computational Complexity

§ Assume run time is correlated #pragma omp parallel for #pragma omp parallel for for(i=0;i

60 Data Movement Complexity

§ Assume run time is correlated Operation Flop’s Data with the amount of data accessed DAXPY O(N) O(N) DGEMV O(N2) O(N2) (or moved) DGEMM O(N3) O(N2) § Easy to calculate amount of data FFTs O(NlogN) O(N) CG O(N1.33) O(N1.33) accessed… count array accesses MG O(N) O(N) N-body O(N2) O(N) § Data moved is more complex as it Which is more requires understanding cache expensive… behavior… Performing Flop’s, or Moving words from memory • Compulsory1 data movement (array ? sizes) is a good initial guess… • … but needs refinement for the effects of finite cache capacities 1Hill et al, “Evaluating Associativity in CPU Caches”, IEEE 61 Trans. Comput., 1989. Machine Balance and Arithmetic Intensity

§ Data movement and computation Operation Flop’s Data AI (ideal) can operate at different rates DAXPY O(N) O(N) O(1) DGEMV O(N2) O(N2) O(1) § We define machine balance as DGEMM O(N3) O(N2) O(N) the ratio of… FFTs O(NlogN) O(N) O(logN) Peak DP Flop/s CG O(N1.33) O(N1.33) O(1) Balance = MG O(N) O(N) O(1) Peak Bandwidth N-body KernelsO(N2) withO(N) AI O(N) § …and arithmetic intensity as the greater than! machine balance are ultimately ratio of… compute limited Flop’s Performed AI = Data Moved

62 Distributed Memory Performance Modeling

§ In distributed memory, one communicates by sending messages between processors. § Messaging time can be constrained by several components… • Overhead (CPU time to send/receive a message) • Latency (time message is in the network; can be hidden) • Message throughput (rate at which one can send small messages… messages/second) • Bandwidth (rate one can send large messages… GBytes/s) § Bandwidths and latencies are further constrained by the interplay of network architecture and contention § Distributed memory versions of our algorithms can be differently stressed by these components depending on N and P (#processors)

63 Computational Depth

§ Parallel machines incur Operation Flop’s Data AI (ideal) Depth substantial overheads on DAXPY O(N) O(N) O(1) O(1) DGEMV O(N2) O(N2) O(1) O(logN) synchronization (), DGEMM O(N3) O(N2) O(N) O(logN) point-to-point communication, FFTs O(NlogN) O(N) O(logN) O(logN) CG O(N1.33) O(N1.33) O(1) O(N0.33) reductions, and broadcasts. MG O(N) O(N) O(1) O(logN)

2 § We can classify algorithms by N-body O(NOverheads) O(N) canO(N) O(logN) depth (max depth of the dominate at high algorithm’s dependency chain) ! concurrencyproblems or small Ø If dependency chain crosses process boundaries, we incur substantial overheads.

64 Modeling NUMA NUMA Effects

§ Cori’s Haswell nodes are built from 2 processors (sockets) • Memory attached to each socket (fast) Peak Flop/s • Interconnect that allows remote memory No FMA access (slow == NUMA) • Improper memory allocation can result in DDR GB/s more than a 2x performance penalty Attainable Flop/s

DDR GB/s (NUMA) CPU0 CPU1 cores 0-15 cores 16-31 Arithmetic Intensity (Flop:Byte) ~50GB/s ~50GB/s DRAM DRAM

66 Hierarchical Roofline vs. Cache-Aware Roofline

…understanding different Roofline formulations in Advisor There are two Major Roofline Formulations:

§ Hierarchical Roofline (original Roofline w/ DRAM, L3, L2, …)… • Williams, et al, “Roofline: An Insightful Visual Performance Model for Multicore Architectures”, CACM, 2009 • Chapter 4 of “Auto-tuning Performance on Multicore Computers”, 2008 • Defines multiple bandwidth ceilings and multiple AI’s per kernel • Performance bound is the minimum of flops and the memory intercepts (superposition of original, single-metric Rooflines) § Cache-Aware Roofline • Ilic et al, "Cache-aware Roofline model: Upgrading the loft", IEEE Computer Architecture Letters, 2014 • Defines multiple bandwidth ceilings, but uses a single AI (flop:L1 bytes) • As one looses cache locality (capacity, conflict, …) performance falls from one BW ceiling to a lower one at constant AI § Why Does this matter? • Some tools use the Hierarchical Roofline, some use cache-aware == Users need to understand the differences • Cache-Aware Roofline model was integrated into production Intel Advisor • Evaluation version of Hierarchical Roofline1 (cache simulator) has also been integrated into Intel Advisor

1Technology Preview, not in official product roadmap so far. 68 Hierarchical Roofline Cache-Aware Roofline

§ Captures cache effects § Captures cache effects § AI is Flop:Bytes after being filtered by § AI is Flop:Bytes as presented to the L1 lower cache levels cache (plus non-temporal stores)

§ Multiple Arithmetic Intensities § Single Arithmetic Intensity (one per level of memory) § AI dependent on problem size § AI independent of problem size (capacity misses reduce AI)

§ Memory/Cache/Locality effects are § Memory/Cache/Locality effects are observed as decreased AI observed as decreased performance

§ Requires performance counters or § Requires static analysis or binary cache simulator to correctly measure AI instrumentation to measure AI

69 Example: STREAM

§ L1 AI… #pragma omp parallel for • 2 flops for(i=0;i

70 Example: STREAM Hierarchical Roofline Cache-Aware Roofline

Peak Flop/s Peak Flop/s

Performance is bound to the minimum of the two Observed performance Intercepts… is correlated with DRAM

AIL1 * L1 GB/s bandwidth L1 GB/s AIDRAM * DRAM GB/s L1 GB/s Attainable Flop/s Attainable Flop/s Multiple AI’s…. 1) Flop:DRAM bytes Single AI based on flop:L1 bytes 2) Flop:L1 bytes (same) DRAM GB/s DRAM GB/s 0.083 0.083 Arithmetic Intensity (Flop:Byte) Arithmetic Intensity (Flop:Byte)

71 Example: 7-point Stencil (Small Problem)

§ L1 AI… #pragma omp parallel for for(k=1;k

72 Example: 7-point Stencil (Small Problem) Hierarchical Roofline Cache-Aware Roofline

Peak Flop/s Peak Flop/s

Performance bound is the minimum of the two L1 GB/s L1 GB/s Attainable Flop/s Attainable Flop/s Multiple AI’s…. 1) flop:DRAM ~ 0.44 2) flop:L1 ~ 0.11 DRAM GB/s DRAM GB/s 0.11 0.44 0.11 Arithmetic Intensity (Flop:Byte) Arithmetic Intensity (Flop:Byte)

73 Example: 7-point Stencil (Small Problem) Hierarchical Roofline Cache-Aware Roofline

Peak Flop/s Peak Flop/s

Observed performance Performance bound is is between L1 and DRAM lines the minimum of the two L1 GB/s L1 GB/s (== some cache locality) Attainable Flop/s Attainable Flop/s Multiple AI’s…. 1) flop:DRAM ~ 0.44 Single AI based on flop:L1 bytes 2) flop:L1 ~ 0.11 DRAM GB/s DRAM GB/s 0.11 0.44 0.11 Arithmetic Intensity (Flop:Byte) Arithmetic Intensity (Flop:Byte)

74 Example: 7-point Stencil (Large Problem) Hierarchical Roofline Cache-Aware Roofline

Peak Flop/s Peak Flop/s

Observed performance is closer to DRAM line Capacity misses reduce (== less cache locality) L1 GB/s DRAM AI and performance L1 GB/s Attainable Flop/s Attainable Flop/s Multiple AI’s…. 1) flop:DRAM ~ 0.20 Single AI based on flop:L1 bytes 2) flop:L1 ~ 0.11 DRAM GB/s DRAM GB/s 0.11 0.20 0.11 Arithmetic Intensity (Flop:Byte) Arithmetic Intensity (Flop:Byte)

75 Example: 7-point Stencil (Observed Perf.) Hierarchical Roofline Cache-Aware Roofline

Peak Flop/s Peak Flop/s

Observed performance is closer to DRAM line L1 GB/s Actual observed performance L1 GB/s (== less cache locality)

Attainable Flop/s is tied to the bottlenecked resource Attainable Flop/s and can be well below a cache Roofline (e.g. L1). Single AI based on flop:L1 bytes DRAM GB/s 0.11 0.20 0.11 Arithmetic Intensity (Flop:Byte) Arithmetic Intensity (Flop:Byte)

76 Example: 7-point Stencil (Observed Perf.) Hierarchical Roofline Cache-Aware Roofline

Peak Flop/s Peak Flop/s

Observed performance is closer to DRAM line Actual observed performance L1 GB/s (== less cache locality)

Attainable Flop/s is tied to the bottlenecked resource Attainable Flop/s and can be well below a cache Roofline (e.g. L1). Single AI based on flop:L1 bytes DRAM GB/s DRAM GB/s 0.11 0.20 0.11 Arithmetic Intensity (Flop:Byte) Arithmetic Intensity (Flop:Byte)

77