ERLANGEN REGIONAL COMPUTING CENTER

SIMD: Data parallel execution J. Eitzinger

PATC, 12.6.2018 Stored Program Computer: Base setting

for (int j=0; j

401d08: f3 0f 58 04 82 addss xmm0,[rdx + rax * 4] Arithmetic Control 401d0d: 48 83 c0 01 add rax,1 Logical 401d11: 39 c7 cmp edi,eax

Unit CPU 401d13: 77 f3 ja 401d08 Unit

Input Output Architect’s view: Make the common case fast !

Execution and memory . Improvements for relevant software Strategies . What are the technical opportunities? . Increase clock speed . Parallelism . Economical concerns . Specialization . Marketing concerns

2 Data parallel execution units (SIMD)

for (int j=0; j

• 4 operands (AVX)

• 8 operands (AVX512)

3 Data parallel execution units (SIMD)

for (int j=0; j

• 2 operands (SSE) = + • 4 operands (AVX)

• 8 operands (AVX512)

4 History of short vector SIMD 1995 1996 1997 1998 1999 2001 2008 2010 2011 2012 2013 2016

VIS MDMX MMX 3DNow SSE SSE2 SSE4 VSX AVX IMCI AVX2 AVX512 AltiVec NEON 64 bit 128 bit 256 bit 512 bit

Vendors package other ISA features under the SIMD flag: monitor/mwait, NT stores, prefetch ISA, application specific instructions, FMA

X86 Power ARM SPARC64 Intel 512bit IBM 128bit A5 upward Fujitsu 256bit AMD 256bit 64bit / 128bit K Computer 128bit

5 Technologies Driving Performance

Technology 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 33 200 1.1 2 3.8 3.2 2.9 2.7 1.9 1.7 Clock MHz MHz GHz GHz GHz GHz GHz GHz GHz GHz ILP SMT SMT2 SMT4 SMT8 SIMD SSE SSE2 AVX AVX512 Multicore 2C 4C 8C 12C 15C 18C 22C 28C 3.2 6.4 12.8 25.6 42.7 60 128 Memory GB/s GB/s GB/s GB/s GB/s GB/s GB/s

ILP Obstacle: Not more parallelism available

Clock Obstacle: Power/Heat dissipation

Multi- Manycore Obstacle: Getting data to/from cores

SIMD Obstacle: Power

6 History of Intel chip performance

145W 173W 135W

Trade cores for 96W frequency

7 The real picture AVX512 FMA

AVX

SSE2

8 Finding the right compromise Core complexity Intel Skylake-EP

Intel KNL Nvidia GP100

# cores SIMD

Area is total power budget!

Turbo: Change weights within the same architecture! Frequency

9 AVX and Turbo mode

Haswell 2.3 GHz Broadwell 2.3 GHz

Turn off Turbo is not an option because base 1456 Xeon E5-2630v4 AVX clock is low! 10 cores 2.2 GHz

10 LINPACK frequency game

Highest performance at knee of roofline plot!

11 The driving forces behind performance 2012

Intel IvyBridge-EP

P = ncore * F * S * n

Intel IvyBridge-EP

Number of cores ncore 12 FP instructions per cycle F 2 FP ops per instructions S 4 (DP) / 8 (SP) Clock speed [GHz] n 2.7 Performance [GF/s] P 259 (DP) / 518 (SP)

TOP500 rank 1 (1996)

But: P=5.4 GF/s for serial, non-SIMD code

12 The driving forces behind performance 2018

Intel Skylake-SP

P = ncore * F * M * S * n

Intel IvyBridge-EP

Number of cores ncore 28 FP instructions per cycle F 2 FMA factor M 2 FP ops per instructions S 8 (DP) / 16 (SP) Clock speed [GHz] n 2.3 (scalar 2.8) Performance [GF/s] P 2060 (DP) / 4122 (SP)

But: P=5.6 GF/s for serial, non-SIMD code

13 Future of SIMD in new architectures

• For Intel SIMD is a closed chapter

• SIMD will be an important aspect of the architecture but not a communicated performance driver

• If other vendors go beyond 32b SIMD is questionable

• Future developments will include: • Special purpose instructions • Usability improvements • Incremental improvements for existing functionality

• What is the next big thing? What about GPGPUs

14 SIMD: THE BASICS SIMD processing – Basics Steps (done by the compiler) for “SIMD processing”

for(int i=0; i

for(int i=0; i

Load 256 Bits starting from address of A[i] to LABEL1: register R0 VLOAD R0  A[i] VLOAD R1  B[i] V64ADD[R0,R1]  R2 Add the corresponding 64 Bit entries in R0 and VSTORE R2  C[i] R1 and store the 4 results to R2 ii+4 i<(n-4)? JMP LABEL1 Store R2 (256 Bit) to address //remainder loop handling starting at C[i]

16 SIMD processing – Basics

No SIMD vectorization for loops with data dependencies:

for(int i=0; i

“Pointer aliasing” may prevent SIMDfication void f(double *A, double *B, double *C, int n) { for(int i=0; i

void f(double restrict *A, double restrict *B, double restrict *C, int n) {…}

17 Vectorization compiler options (Intel) . The compiler will vectorize starting with –O2. . To enable specific SIMD extensions use the –x option: . -xSSE2 vectorize for SSE2 capable machines Available SIMD extensions: SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, ...

. -xAVX on Sandy/Ivy Bridge processors . -xCORE-AVX2 on Haswell/Broadwell . -xCORE-AVX512 on Skylake (certain models) . -xMIC-AVX512 on Xeon Phi Knights Landing

Recommended option: . -xHost will optimize for the architecture you compile on (Caveat: do not use on standalone KNL, use MIC-AVX512) . To really enable 64b SIMD with current Intel compilers you need to set: -qopt-zmm-usage=high

18 Vectorization source code directives

. Fine-grained control of loop vectorization . Use !DEC$ (Fortran) or #pragma (C/C++) sentinel to start a compiler directive

. #pragma vector always vectorize even if it seems inefficient (hint!) . #pragma novector do not vectorize even if possible (use with caution) . #pragma vector nontemporal use NT stores if possible (i.e., SIMD and lignment conditions are met) . #pragma vector aligned . specifies that all array accesses are aligned to SIMD address boundaries (DANGEROUS! You must not lie about this!) . Evil interactions with OpenMP possible

20 User mandated vectorization (OpenMP4)

. Since Intel Compiler 12.0 the pragma is available (now deprecated) . #pragma simd enforces vectorization where the other pragmas fail . Prerequesites: . Countable loop . Innermost loop . Must conform to for-loop style of OpenMP worksharing constructs . There are additional clauses: reduction, vectorlength, private . Refer to the compiler manual for further details

#pragma simd reduction(+:x) for (int i=0; i

. NOTE: Using the #pragma simd the compiler may generate incorrect code if the loop violates the vectorization rules! It may also generate inefficient code!

21 Architecture: SIMD and Alignment . Alignment issues . Alignment of arrays should optimally be on SIMD-width address boundaries to allow packed aligned loads (and NT stores on x86) . Otherwise the compiler will revert to unaligned loads/stores . Modern x86 CPUs have less (not zero) impact for misaligned LOAD/STORE, but Xeon Phi KNC relies heavily on it! . How is manual alignment accomplished?

. Stack variables: alignas keyword (C++11/C11) . Dynamic allocation of aligned memory (align = alignment boundary) . C before C11 and C++ before C++17: posix_memalign(void **ptr, size_t align, size_t size); . C11 and C++17: aligned_alloc(size_t align, size_t size);

22 Assembler: Why and how?

Why check the assembly code? . Sometimes the only way to make sure the compiler “did the right thing” . Example: “LOOP WAS VECTORIZED” message is printed, but Loads & Stores may still be scalar! . Get the assembler code (Intel compiler): icc –S –O3 -xHost triad.c -o a.out . Disassemble Executable: objdump –d ./a.out | less

The x86 ISA is documented in: Intel Software Development Manual (SDM) 2A and 2B AMD64 Architecture Programmer's Manual Vol. 1-5

23 Basics of the x86-64 ISA

. Instructions have 0 to 3 operands (4 with AVX-512) . Operands can be registers, memory references or immediates . Opcodes (binary representation of instructions) vary from 1 to 17 bytes . There are two assembler syntax forms: Intel (left) and AT&T (right) . Addressing Mode: BASE + INDEX * SCALE + DISPLACEMENT . C: A[i] equivalent to *(A+i) (a pointer has a type: A+i*8)

movaps [rdi + rax*8+48], xmm3 movaps %xmm4, 48(%rdi,%rax,8) add rax, 8 addq $8, %rax js 1b js ..B1.4

401b9f: 0f 29 5c c7 30 movaps %xmm3,0x30(%rdi,%rax,8) 401ba4: 48 83 c0 08 add $0x8,%rax 401ba8: 78 a6 js 401b50

24 Basics of the x86-64 ISA with extensions

16 general Purpose Registers (64bit): rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, r8-r15 alias with eight 32 bit register set: eax, ebx, ecx, edx, esi, edi, esp, ebp 8 opmask registers (16bit or 64bit, AVX512 only): k0–k7 Floating Point SIMD Registers: xmm0-xmm15 (xmm31) SSE (128bit) alias with 256-bit and 512-bit registers ymm0-ymm15 (xmm31) AVX (256bit) alias with 512-bit registers zmm0-zmm31 AVX-512 (512bit)

SIMD instructions are distinguished by: VEX/EVEX prefix: v Operation: mul, add, mov Modifier: nontemporal (nt), unaligned (u), aligned (a), high (h) Width: scalar (s), packed (p) Data type: single (s), double (d)

25 ISA support on Intel chips SKylake supports all legacy ISA extensions: MMX, SSE, AVX, AVX2

Furthermore KNL supports: . AVX-512 Foundation (F), KNL and Skylake . AVX-512 Conflict Detection Instructions (CD), KNL and Skylake . AVX-512 Exponential and Reciprocal Instructions (ER), KNL . AVX-512 Prefetch Instructions (PF), KNL

AVX-512 extensions only supported on Skylake: . AVX-512 Byte and Word Instructions (BW) . AVX-512 Doubleword and Quadword Instructions (DQ) . AVX-512 Vector Length Extensions (VL)

ISA Documentation: Intel Architecture Instruction Set Extensions Programming Reference

26 Example for masked execution

Masking for predication is very helpful in cases such as e.g. remainder loop handling or conditional handling.

27 SIMD processing – The whole picture

SIMD influences instruction Comparing total execution time: execution in the core – other runtime contributions stay the Execution Cache Memory same! Scalar 12 AVX example: SSE 6 15 21 Execution Units 3 cy AVX 3 Total runtime with data loaded Caches 15 cy from memory: Scalar 48 Memory 21 cy SSE 42 AVX 39 Per-cacheline (8 iterations) cycle SIMD only effective if runtime is dominated counts by instructions execution!

34 Limits of SIMD processing

. Only part of application may be vectorized, arithmetic vs. load/store (Amdahls law), data transfers . Memory saturation often makes SIMD obsolete

16cy 4cy 2cy

Execution Cache Memory Possible solution: Improve cache Scalar 16cy bandwidth SSE 4cy 4cy 4cy AVX 2cy Per-cacheline AVX512 1cy diminishing cycle counts returns (Amdahl)

35 Rules for vectorizable loops

1. Countable 2. Single entry and single exit 3. Straight line code 4. No function calls (exception intrinsic math functions)

Better performance with: 1. Simple inner loops with unit stride 2. Minimize indirect addressing 3. Align data structures (SSE 16bytes, AVX 32bytes) 4. In C use the restrict keyword for pointers to rule out aliasing

Obstacles for vectorization: . Non-contiguous memory access . Data dependencies

36 A STRATEGY FOR ANALYZING EXECUTION BOUND CODES Prerequisites

. Aquire runtime profile (gprof) . Instrument hot loops with likwid for HPM measurements

Benchmarking: . Rule out memory bandwidth saturation . Measure peak bandwidth with suitable benchmark . Do scaling runs in one memory domain . Measure bandwidth with likwid-perfctr

. Rule out other runtime contributions from other data transfers . Measure bandwidth to all cache levels with likwid-perfctr . Is CPI low (below 0.8)?

38 Instruction Analysis: Background

. We assume your useful work is based on floating point arithmetic!

Metric

Algorithmic work . FP operations How much overhead is added? Get job done with as few Instruction mix . Total instructions . Instruction decomposition instructions as possible How efficient are the instructions executed? Exploit as much . CPI Processor instruction level . Frequency parallelism as possible

39 Instruction Analysis: Data acquisition

. Can be based on . Static analysis: by hand or using tool (IACA) . Measurement: using any HPM tool (e.g. likwid-perfctr)

. Quantify the following instruction counts: . Total instructions (any likwid group) . Arithmetic FP instructions (FLOPS_DP group) › Scalar › SIMD (SSE, AVX, AVX512) . Load/Store instructions (DATA group) . Branch instructions (BRANCH group)

40 Example: Measure with likwid-perfctr

$ likwid-perfctr -g BRANCH -C N:0 -m ../miniMD-ICC -t 1 -n 100

Region halfneigh, Group 1: BRANCH +------+------+ | Region Info | Core 0 | +------+------+ | RDTSC Runtime [s] | 3.432568 | | call count | 102 | +------+------+ +------+------+------+ | Event | Counter | Core 0 | +------+------+------+ | INSTR_RETIRED_ANY | FIXC0 | 6443543000 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 8227216000 | | CPU_CLK_UNHALTED_REF | FIXC2 | 8227212000 | +------+------+ | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 195287900 | | Metric | Core 0 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 13036900 | +------+------+ +------+------+------+ | Runtime (RDTSC) [s] | 3.4326 | | Runtime unhalted [s] | 3.4280 | | Clock [MHz] | 2400.0248 | | CPI | 1.2768 | | Branch rate | 0.0303 | | Branch misprediction rate | 0.0020 | | Branch misprediction ratio | 0.0668 | | Instructions per branch | 32.9951 | +------+------+

41 Instruction Analysis: Comparison of variants

. Compute percentage of total instructions for: . FP arithmetic instructions (total and different types) . Load/store instructions . Branch instructions . The rest

. Additional metrics . SIMD fraction (arithmetic only) . Instruction overhead (rest/total)

42 Factor analysis

Runtime factor = Instruction factor X CPI factor

. Runtime factor . runtime scalar / runtime SIMD

. Instruction factors . Arithmetic: # arith. instructions (scalar) / # arith. instructions (SIMD) . Total: # total instructions (scalar) / # total instructions (SIMD)

. CPI factor . CPI scalar / CPI SIMD

. Algorithmic factor

43 EXAMPLE: MINI-MD Benchmark overview: MiniMD

Proxy app from MANTEVO. https://mantevo.org

. MiniMD – MPI + OpenMP . App. 3000 lines, C++ . Extracted from LAMMPS (http://lammps.sandia.gov) Molecular dynamics application code . Other implementations: OpenACC, OpenCL, Kokkos . Supports Lennard-Jones (default) and eam force calculation . Uses neighbour lists

45 Runtime profile

% self time seconds calls name 83.05 21.44 402 ForceLJ::compute(Atom&, Neighbor&, Comm&, int) 13.29 3.43 21 Neighbor::build(Atom&) 1.63 0.42 1 Integrate::run(Atom&, Force*, Neighbor&, Comm& 0.50 0.13 2280 Atom::pack_comm(int, int*, double*, int*) 0.46 0.12 2412 Atom::unpack_reverse(int, int*, double*) 0.23 0.06 41 Neighbor::binatoms(Atom&, int) 0.19 0.05 2280 Atom::unpack_comm(int, int, double*) 0.19 0.05 21 Comm::borders(Atom&) 0.15 0.04 20 Atom::sort(Neighbor&) 0.12 0.03 intel_avx_rep_memcpy*

46 Code review

. ForceLJ::compute is a proxy routine to jump to different implementations: . compute_halfneigh – Default for running with 1 thread . compute_halfneigh_threaded – Default with more than 1 thread . compute_fullneigh – Optional algorithm

. The optimal algorithm is halfneigh updating both partners for force calculations. But requires atomics for OpenMP parallelisation and only vectorizes if forced with pragma SIMD.

47 Code review halfneigh compute 131072 for(int i = 0; i < nlocal; i++) { neighs = &neighbor.neighbors[i * neighbor.maxneighs]; const int numneighs = neighbor.numneigh[i]; const MMD_float xtmp = x[i * PAD + 0]; const MMD_float ytmp = x[i * PAD + 1]; const MMD_float ztmp = x[i * PAD + 2]; MMD_float fix = 0.0; MMD_float fiy = 0.0; MMD_float fiz = 0.0; 78 for(int k = 0; k < numneighs; k++) { const int j = neighs[k]; const MMD_float delx = xtmp - x[j * PAD + 0]; const MMD_float dely = ytmp - x[j * PAD + 1]; const MMD_float delz = ztmp - x[j * PAD + 2]; const MMD_float rsq = delx * delx + dely * dely + delz * delz; // CUTOFF LOOP SEE NEXT SLIDE } f[i * PAD + 0] += fix; f[i * PAD + 1] += fiy; f[i * PAD + 2] += fiz; }

48 Code review cutoff loop halfneigh vs fullneigh if(rsq < cutforcesq) { const MMD_float sr2 = 1.0 / rsq; halfneigh const MMD_float sr6 = sr2 * sr2 * sr2 * sigma6; const MMD_float force = 48.0 * sr6 * (sr6 - 0.5) * sr2 * epsilon;

fix += delx * force; fiy += dely * force; fiz += delz * force;

f[j * PAD + 0] -= delx * force; f[j * PAD + 1] -= dely * force; f[j * PAD + 2] -= delz * force; } Enter 54 times of 78 if(rsq < cutforcesq) { const MMD_float sr2 = 1.0 / rsq; fullneigh const MMD_float sr6 = sr2 * sr2 * sr2 * sigma6; const MMD_float force = 48.0 * sr6 * (sr6 - 0.5) * sr2 * epsilon;

fix += delx * force; mul: 10 fiy += dely * force; add: 4 fiz += delz * force; div: 1 reciprocal }

49 Likwid HPM Analysis: Memory hierarchy and branch prediction for halfneigh Code compiled with ICC 17 with AVX support and benchmarked on Xeon Broadwell system (2 sockets, 18 cores per socket, 2.3 GHz)

. L3 data volume 6 GB, bandwidth 1.4 GB/s

. L2 data volume 9.5 GB, bandwidth 2.3 GB/s

. CPI 0.54

. 7 instructions per branch, 4.5% mispredicted

No impact of memory hierarchy on performance!

51 Likwid HPM instruction breakdown

Runtime [s] scalar sse avx avx2 halfneigh 9.00 8.13 6.98 8.19 fullneigh 15.17 (+69%) 9.15 (+13%) 5.92 (-15%) 5.68 (-19%)

Instruction type Full scalar Full AVX Half scalar Half AVX CPI 0.98 0.69 0.78 0.55 Arithmetic Scalar 69.16% 4.83% 51.5% 4.84% Arithmetic SSE 0% 0.21% 0% 0.14% Arithmetic AVX 0% 35.82% 0% 12.6% Arithmetic Total 69.16% 40.85% 51.5% 17.58% Load 14.19% 17.03% 16.96% 13.65% Store 0.23% 1.06% 4.49% 5.64% Branch 5.27% 3.61% 4.06% 14.34% Other 10.45% 37.45% 22.99% 48.80%

52 Making sense of instruction breakdown

Scaling scalar to AVX SIMD (Optimum x4)

FullNeigh HalfNeigh Runtime factor 2.56 1.29 Arithmetic Instr factor 3.04 2.66 Instruction factor 1.80 0.91 CPI factor 1.42 Instruction 1.42scaling x CPI scaling Overall factor 2.55 1.29

. Halfneigh requires a factor 1.78 less arithmetic work which is reflected in a scalar runtime scaling of 1.69

. But because Fullneigh shows a better SIMD scalability with AVX it beats the better algorithm by a small margin

53 MiniMD Assembly AVX

AVX: k in const int j = neighs[k]; blocks of 4 /*vmovdqu Extract x[j+2](%rdi,%rbx,4), */ %xmm10 vmulpdvmovqvmovd %ymm9,16(%rsi,%rdx,8),%xmm10, %ymm9, %rdx %ymm12 %xmm12 const MMD_float delx = xtmp - x[j * PAD + 0]; vfmadd231pdvmovhpdvpunpckhqdq 16(%rsi,%rcx,8),%ymm7,%xmm10, %ymm7, %xmm10, %ymm12 %xmm12, %xmm9 %xmm13 const MMD_float dely = ytmp - x[j * PAD + 1]; vfmadd231pdvmovqvmovd 16(%rsi,%r12,8),%ymm10,%xmm9, %ymm10,%r13 %ymm12 %xmm15 const MMD_float delz = ztmp - x[j * PAD + 2]; vmovhpdmovl 16(%rsi,%r13,8),%edx, %ecx %xmm15, %xmm10 vinsertf128shrq $32,$1, %xmm10,%rdx %ymm13, %ymm9 const MMD_float rsq = delx * delx + dely * dely /*lea Extract x[j](%rdx,%rdx,2), and x[j+1] %r12d*/ + delz * delz; vmovupslea (%rsi,%rcx,8), (%rcx,%rcx,2), %xmm14 %ecx vinsertf128movslq %$1,ecx ,(%rsi,%r13,8), %rdx %ymm14, X[j+x] %ymm11movslq %r12d, %rcx vmovupsmovl (%rsi,%rdx,8),%r13d, %r12d %xmm11 0+0 0+1 0+2 1+0 1+1 1+2 vinsertf128shrq $32,$1, (%rsi,%r12,8),%r13 %ymm11, %ymm7lea (%r12,%r12,2), %r12d 2+0 2+1 2+2 3+0 3+1 3+2 vunpckhpdmovslq %ymm11,%r12d, %ymm7,%r12 %ymm14 vunpcklpdlea %ymm11, (%r13,%r13,2), %ymm7, %ymm12 %r13d ymm[0-3] movslq %r13d, %r13 vsubpd %ymm12, %ymm1, %ymm7 0+0 1+0 2+0 3+0 vsubpd %ymm9, %ymm3, %ymm10 vsubpd %ymm14, %ymm2, %ymm9 0+1 1+1 2+1 3+1

0+2 1+2 2+2 3+2

54 Conclusion

. MiniMD is an interesting case illustrating the compromises you face in a real world situation.

Algorithm

Instruction Superscalar count execution

SIMD

55 PATTERN BASED PERFORMANCE ENGINEERING

Best Practices Basic PE Process Basics of optimization

1. Define relevant test cases 2. Establish a sensible performance metric 3. Acquire a runtime profile (sequential) 4. Identify hot kernels (Hopefully there are any!) Iteratively 5. Carry out optimization process for each kernel

Motivation: • Understand observed performance • Learn about code characteristics and machine capabilities • Develop a well founded performance expectation • Deliberately decide on optimizations

57 Best practices for benchmarking

. Preparation . Reliable timing (minimum time which can be measured?) . Document code generation (flags, compiler version) . Get access to an exclusive system . System state (clock speed, turbo mode, memory, caches) . Consider to automate runs with a script (shell, python, perl)

. Doing . Affinity control . Check: Is the result reasonable? . Is result deterministic and reproducible? . Statistics: Mean, Best ? . Basic variants: Thread count, affinity, working set size

58 Performance Engineering Tasks: Software

Optimizing software for a specific hardware requires to align several orthogonal.

On the software side it is mostly about reducing algorithmic and processor work. Still decisions here may also restrict the options on the hardware side.

1 Reduce algorithmic work Algorithm

Implementation 2 Minimize processor work Instruction code

Unit of work for processor

59 Performance Engineering Tasks: Hardware

Parallelism: Horizontal dimension

3 Distribute work and data for optimal utilization of parallel resources

Memory Memory

4 Avoid bottlenecks

L3 L3 L3 L3 L3 L3 L3 L3 Data paths: Vertical dimension Data paths: Vertical

L2 L2 L2 L2 L2 L2 L2 L2

Use most effective L1 5 L1 L1 L1 L1 L1 L1 L1 execution units on chip

SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD FMA FMA FMA FMA FMA FMA FMA FMA core core core core core core core core

60 Thinking in bottlenecks

• A bottleneck is a performance limiting setting • Microarchitectures expose numerous bottlenecks

Observation 1: Most applications face a single bottleneck at a time!

Observation 2: There is a limited number of relevant bottlenecks!

61 Performance Engineering Process: Analysis Algorithm/Code Hardware/Instruction Microbenchmarking Analysis set architecture

Application The set of input data indicating Benchmarking a pattern is its signature

Performance patterns are Pattern typical performance limiting motifs

Step 1 Analysis: Understanding observed performance

62 Performance Engineering Process: Modeling

Pattern Qualitative view

Performance Model Quantitative view Wrong pattern Wrong

Validation Traces/HW metrics

Step 2 Formulate Model: Validate pattern and get quantitative insight

63 Performance Engineering Process: Optimization

Pattern

Performance improves until next Performance Model bottleneck is hit

Improves Performance

Optimize for better Eliminate non- resource utilization expedient activity

Step 3 Optimization: Improve utilization of available resources

64 Performance pattern classification

1. Maximum resource utilization (computing at a bottleneck)

2. Optimal use of parallel resources

3. Hazards (something “goes wrong”)

4. Use of most effective instructions

5. Work related (too much work or too inefficiently done)

65 Patterns (I): Bottlenecks & parallelism

Metric signature, LIKWID Pattern Performance behavior performance group(s)

Saturating speedup across Bandwidth meets BW of suitable Bandwidth saturation cores sharing a data path streaming benchmark (MEM, L3)

Good (low) CPI, integral ratio of ALU saturation Throughput at design limit(s) cycles to specific instruction count(s) (FLOPS_*, DATA, CPI)

Bad or no scaling across NUMA Unbalanced bandwidth on Bad ccNUMA page placement domains, performance improves memory interfaces / High remote with interleaved page placement traffic (MEM)

Load imbalance / serial Different amount of “work” on the fraction Saturating/sub-linear speedup cores (FLOPS_*); note that instruction count is not reliable!

66 Patterns (II): Hazards

Metric signature, LIKWID Pattern Performance behavior performance group(s)

Large discrepancy from False sharing of cache Frequent (remote) CL evicts performance model in parallel case, (CACHE) lines bad scalability

In-core throughput far from design (Large) integral ratio of cycles to Pipelining issues limit, performance insensitive to specific instruction count(s), bad data set size (high) CPI (FLOPS_*, DATA, CPI)

High branch rate and branch miss Control flow issues See above ratio (BRANCH)

Micro-architectural Large discrepancy from simple Relevant events are very performance model based on hardware-specific, e.g., memory anomalies LD/ST and arithmetic throughput aliasing stalls, conflict misses, unaligned LD/ST, requeue events Low BW utilization / Low cache hit Latency-bound data Simple bandwidth performance ratio, frequent CL evicts or access model much too optimistic replacements (CACHE, DATA, MEM)

67 Patterns (III): Work-related

Metric signature, LIKWID Pattern Performance behavior performance group(s)

Speedup going down as more cores Large non-FP instruction count are added / No speedup with small Synchronization overhead (growing with number of cores problem sizes / Cores busy but low used) / Low CPI (FLOPS_*, CPI) FP performance

Low CPI near theoretical limit / Low application performance, good Large non-FP instruction count Instruction overhead scaling across cores, performance (constant vs. number of cores) insensitive to problem size (FLOPS_*, DATA, CPI)

Low BW utilization / Low cache hit Simple bandwidth performance Excess data volume ratio, frequent CL evicts or model much too optimistic replacements (CACHE, DATA, MEM)

Expensive Many cycles per instruction (CPI) if the problem is large-latency instructions Code arithmetic Similar to instruction overhead composition Ineffective Scalar instructions dominating in data-parallel loops (FLOPS_*, instructions CPI)

68 Patterns conclusion . Pattern signature = performance behavior + hardware metrics . Hardware metrics alone are almost useless without a pattern

. Patterns are applied hotspot (loop) by hotspot

. Patterns map to typical execution bottlenecks

. Patterns are extremely helpful in classifying performance issues . The first pattern is always a hypothesis . Validation by tanking data (more performance behavior, HW metrics) . Refinement or change of pattern

. Performance models are crucial for most patterns . Model follows from pattern

69 References

Book: . G. Hager and G. Wellein: Introduction to High Performance Computing for Scientists and Engineers. CRC Computational Science Series, 2010. ISBN 978-1439811924 http://www.hpc.rrze.uni-erlangen.de/HPC4SE/

Papers: . J. Hammer, G. Hager, J. Eitzinger, and G. Wellein: Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft. Proc. PMBS15, the 6th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, in conjunction with ACM/IEEE Supercomputing 2015 (SC15), November 16, 2015, Austin, TX. DOI: 10.1145/2832087.2832092, Preprint: arXiv:1509.03778 . M. Wittmann, G. Hager, T. Zeiser, J. Treibig, and G. Wellein: Chip-level and multi-node analysis of energy-optimized lattice-Boltzmann CFD simulations. Concurrency and Computation: Practice and Experience (2015). DOI: 10.1002/cpe.3489 Preprint: arXiv:1304.7664 . H. Stengel, J. Treibig, G. Hager, and G. Wellein: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. Proc. ICS15, DOI: 10.1145/2751205.2751240, Preprint: arXiv:1410.5010 . G. Hager, J. Treibig, J. Habich and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Computation and Concurrency: Practice and Experience (2013). DOI: 10.1002/cpe.3180, Preprint: arXiv:1208.2908

70 References

Papers continued:

. J. Treibig, G. Hager and G. Wellein: Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering. Workshop on Productivity and Performance (PROPER 2012) at Euro-Par 2012, August 28, 2012, Rhodes Island, Greece. DOI: 10.1007/978-3-642- 36949-0_50. Preprint: arXiv:1206.3738 . J. Treibig, G. Hager, H. Hofmann, J. Hornegger and G. Wellein: Pushing the limits for medical image reconstruction on recent standard multicore processors. International Journal of High Performance Computing Applications, (published online before print). DOI: 10.1177/1094342012442424 . J. Treibig, G. Hager and G. Wellein: LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. Proc. PSTI2010, the First International Workshop on Parallel Software Tools and Tool Infrastructures, San Diego CA, September 13, 2010. DOI: 10.1109/ICPPW.2010.38. Preprint: arXiv:1004.4431 . J. Treibig, G. Wellein and G. Hager: Efficient multicore-aware parallelization strategies for iterative stencil computations. Journal of Computational Science 2 (2), 130-137 (2011). DOI 10.1016/j.jocs.2011.01.010 . J. Treibig, G. Hager and G. Wellein: Multicore architectures: Complexities of performance prediction for Bandwidth-Limited Loop Kernels on Multi-Core Architectures. DOI: 10.1007/978-3-642-13872-0_1, Preprint: arXiv:0910.4865.

71