Levels of Parallelism

Hans Pabst Application Engineer

Acknowledgements: Martyn Corden, Steve “Dr. Fortran” Lionel, and others Notices and Disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at .com, or from the OEM or retailer. All products, systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production . Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. The benchmark results reported above may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user’s components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations. Software and workloads used in performance tests may have been optimized for performance only on Intel . Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. -dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804. No computer system can be absolutely secure. Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo. Intel® Hyper-Threading Technology available on select Intel® processors. Requires an Intel® HT Technology-enabled system. Your performance varies depending on the specific hardware and software you use. Learn more by visiting http://www.intel.com/info/hyperthreading. © 2018 Intel Corporation. Intel, the Intel logo, Xeon and Xeon logos are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Contents

• Parallelism on Intel Architecture

• Vectorization of code: overview and terminology • Introduction to SIMD ISA for Intel® processors • Intel® AVX, Intel® AVX2, Intel® AVX-512 • SIMD instruction set for Intel® compilers • SIMD vectorization Opening statement and Motivation

“Parallelism == Performance” (leads to)

Optimization – making sure the above statement is true! Speedup using Parallelism 1 Analyze 2 Implement Compiler Cilk Plus Debug

OpenMP 3

Amplifier XE Amplifier EBS EBS (XE only) Hotspot Libraries MKL TBB Tune

IPP 4

Composer XE Composer

Inspector XE Inspector Threads Memory

Four-Step Development

Amplifier XE Amplifier concurrency waits & Locks Speedup by (re-)using Libraries

▪ Highly optimized, threaded, & vectorized math functions that maximize performance on each processor family ▪ Utilizes industry-standard C and Fortran APIs for compatibility with popular BLAS, LAPACK, and FFTW functions—no code changes required ▪ Dispatches optimized code for each processor automatically without the need to branch code ® What’s New in the 2018 edition ▪ Improved small matrix multiplication performance in GEMM & LAPACK ▪ Improved ScaLAPACK performance for distributed computation ▪ 24 new vector math functions ▪ Simplified license for easier adoption & More about Intel MKL: redistribution software.intel.com/mkl ▪ Additional distributions via YUM, APT-GET, & Conda Seven Levels of Parallelism #1 – Node-level Parallelism

Levels of Parallelism Node #1 – Node-level Parallelism

Levels of Parallelism Node

What can I do? Increase per-node perf., identify scalability issues (Intel Trace Analyzer & Collector), employ comm-avoiding algorithms, etc. (more nodes ;-) #2 Socket-level Parallelism

Levels of Parallelism Node Socket #2 Socket-level Parallelism

Levels of Parallelism Node Socket

What can I do? Do I have Perform data init. in NUMA-issues? parallel, separate data structures, take control Check with with NUMA allocator 'numactl -i all‘, or (libnuma). Intel Vtune. 2S Configurations 4S Configurations 8S Configuration

LBG LBG

SKL SKL SKL SKL SKL SKL Intel® UPI

** DMI x4 LBG LBG LBG 3x16 3x16 PCIe* 1x100G PCIe* 1x100G SKL Intel® OP Fabric Intel® OP Fabric SKL

SKL SKL (2S-2UPI & 2S-3UPI shown)

DMI SKL SKL LBG 3x16 PCIe*

(4S-2UPI & 4S-3UPI shown) SKL SKL Intel® Xeon® Scalable Processor DMI 3x16 PCIe* supports up to 8 sockets LBG LBG Without the need for an additional Node Controller #3 Core / Level Parallelism

Levels of Parallelism Node Core 1 Core 2 Socket Core / Thread-Level

Core 3 Core 4 #4 Thread-Level Parallelism (with Hyperthreading aka SMT)

Levels of Parallelism Thread 1 Thread 1 Node Core 1 Core 2 Socket Thread 2 Thread 2 Core / Thread-Level (Hyperthreading) Thread 1 Thread 1

Core 3 Core 4 Thread 2 Thread 2 Hyperthreading

• Buffers keep arch. state Architectural Architectural State A State B • Shared execution blocks • Enabled by BIOS settings

Background • Extracts Instruction Level Parallelism (ILP) Single Execution Block • Complements out-of-order execution • Intel Core: intra-core slowdowns due to HT are eliminated, BUT slowdown may happen due missing thread-affinization or because of synchronization (locks). Parallelism on Intel® Architecture (Xeon Server Cores)

Intel® Intel® Intel® Xeon Intel® Xeon Intel® Xeon Xeon Xeon E5 scalable Phi Phi processors 2nd gen.

HSW BDW SKX KNC KNL 64- SNB IVB 5100 5500 5600 bit Cores 1 2 4 6 8 10 18 22 28 61 72

Threads 2 2 8 12 16 20 36 44 56 244 288

IVB: Ivy Bridge BDW: Broadwell KNC: Knights Corner SKX: Skylake Server core refines HSW: Haswell(SKL: Skylake ) KNL: Knights Landing SKL client core significantly Parallelism on Intel® Architecture (Xeon Server Cores)

Intel® Intel® Intel® Xeon Intel® Xeon Intel® Xeon Xeon Xeon E5 scalable Phi Phi processors Coprocessor 2nd gen.

HSW BDW SKX KNC KNL 64- SNB IVB 5100 5500 5600 bit Cores 1 2 4 6 8 10 18 22 28 61 72

Threads 2 2 8 12 16 20 36 44 56 244 288

IVB: Ivy Bridge BDW: Broadwell KNC: Knights Corner SKX: Skylake Server core refines HSW: Haswell(SKL: Skylake ) KNL: Knights Landing SKL client core significantly #5 GPU-CPU Parallelism

Levels of Parallelism Node Socket Core / Thread-Level (Hyperthreading) GPU-CPU Integrated Processor Graphics eDRAM

eDRAM

Intel HD Intel® HD Graphics Intel HD Graphics Graphics Intel HD Graphics

2nd Gen Intel® 3rd Gen Intel 4th Gen Intel Core 5th Gen Intel Core Core™ Processor Core Processor Processor Processor

Lots of compute power for data-parallel (client-)applications #6 Instruction-level Parallelism

Levels of Parallelism Node Socket Core 1 Core / Thread-Level (Hyperthreading) GPU-CPU Instruction

Execution Units on Haswell/Broadwell Newly introduced

Unified

Port 0 Port 1 Port 5 Port 6 Port 2 Port 3 Port 4 Port 7

ALU ALU ALU ALU LD LD FMA, FP FMA, FP MUL ADD/MUL JMP Prim. Branch STA STA STD STA Vec Int MUL Vec Int ALU Vec/FP Shuf FDIV Vec LOG Vec Int ALU New Branch Unit BranchBranch 2 Vec LOG EU - Reduces Port 0 conflicts Vec LOG Shift - 2nd EU for high branch code

0 63 127 255 Memory Control 2xFMA - Double peak FLOPS th New AGU for stores - Two FP MUL benefits legacy 4 ALU - Great of int workloads - Leaves Port 2 & 3 for Loads - Frees Port 0&1 for vec L1 Data

• Race to higher frequencies slowed down • New logic introduced with new generations, leading to higher complexity • Fused Multiply Add for performance • Separate Address Generation Unit for calculations Execution Units on Skylake Server #7 Data Level Parallelism

Levels of Parallelism Node Socket Core / Thread-Level (Hyperthreading) GPU-CPU Instruction Data (Vectorisation) Contents

• Parallelism on Intel Architecture

• Vectorization of code: overview and terminology • Introduction to SIMD ISA for Intel® processors • Intel® AVX, Intel® AVX2, Intel® AVX-512 • SIMD instruction set switches for Intel® compilers • SIMD vectorization Vectorization of code

• Transform sequential code to exploit SIMD processing capabilities of Intel® processors • Calling a vectorized library • Automatically by tools like a compiler • Manually by explicit syntax

a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]

+ for(i = 0; i <= MAX; i++) c[i] = a[i] + b[i]; b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i] =

c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i] Vectorization terminology

• Single Instruction Multiple Data (SIMD) • Processing vector with a single operation • Provides data level parallelism (DLP) • More efficient than scalar processing due to DLP • Vector • Consists of more than one element • Elements are of same scalar data types (e.g. floats, integers, …) • Vector length (VL), i.e., number of elements in the vector A B AAi i BBi i A B Ai i Bi i Scalar Vector Processing + Processing + C CCi i C Ci i VL Contents

• Parallelism on Intel Architecture

• Vectorization of code: overview and terminology • Introduction to SIMD ISA for Intel® processors • Intel® AVX, Intel® AVX2, Intel® AVX-512 • SIMD instruction set switches for Intel® compilers • SIMD vectorization History of SIMD ISA extensions*

Intel® Pentium® processor (1993)

MMX™ (1997)

Intel® Streaming SIMD Extensions (Intel® SSE in 1999 to Intel® SSE4.2 in 2008)

Intel® Advanced Vector Extensions (Intel® AVX in 2011 and Intel® AVX2 in 2013)

Intel® AVX-512 in 2016

* Illustrated with the number of 32-bit data elements that are processed by one “packed” instruction. Intel® AVX

• A 256 bit vector extension to SSE: • SSE uses dedicated 128 bit registers called XMM (16 for Intel® 64) • Extends all XMM registers to 256 bit called YMM • Lower 128 bit of YMM register are mapped/shared with XMM • AVX works on either • The whole 256 bit • The lower 128 bit; zeros the higher 128 bit • AVX counterparts for almost all existing SSE instructions: For initial generation (Intel® AVX) full 256 bit vectors for FP only

128 bits (1999)

XMM YMM

256 bits (2010) Intel® AVX2

• Basically same as Intel® AVX with following additions: • Doubles width of integer vector instructions to 256 bits • Floating point fused multiply add (FMA)

Instruction Single Precision Double Precision Processor Family Set FLOPs Per Clock FLOPs Per Clock

Pre 2nd generation Intel® Core™ Processors SSE 4.2 8 4

2nd and 3rd generation Intel® Core™ Processors AVX 16 8

4th generation Intel® Core™ Processors AVX2 32 16 2x

4x • Bit Manipulation Instructions (BMI) • Gather instructions • Any-to-any permutes • Vector-vector shifts Intel® AVX and AVX2 vector types

8x single precision FP Intel® AVX 4x double precision FP

32x 8 bit integer

16x 16 bit integer

Intel® AVX2 8x 32 bit integer

4x 64 bit integer

plain 256 bit Intel® AVX-512 for Intel® CPUs MPX,SHA, …

AVX-512VL • Intel® Xeon Phi™ and Intel® Xeon® processors share a large AVX-512PR AVX-512BW

set of instructions AVX-512ER AVX-512DQ • Instruction sets are not identical AVX-512CD AVX-512CD

• Subsets are represented by AVX-512F AVX-512F individual feature flags (CPUID) AVX2 AVX2 AVX2

AVX AVX AVX AVX

SSE SSE SSE SSE SSE Common Instruction Set Instruction Common

NHM Intel® Xeon® Intel® Xeon® Intel® Xeon Phi™ Intel® Xeon® E5 and E5v2 E5v3 and E5v4 Processor Scalable processor processor processor families families family Intel® AVX-512 (1/3)

• Intel® AVX-512 instruction set is split into different subsets • Intel® AVX-512 Foundation (AVX-512F): • Extension of AVX known instruction sets including mask registers • Available in all products supporting Intel® AVX-512

Double/Quadword Integer Math Support New Permutation Primitives Bit Manipulation Arithmetic

IEEE division and square Two source shuffles Vector rotate root Including gather/scatter with double/quad-word indices

DP FP transcendental Universal ternary logical Compress & expand primitives operation

New transcendental support New mask instructions instructions Intel® AVX-512 (2/3)

• Intel® AVX-512 Conflict Detection (AVX-512CD): • Check identical values inside a vector (for 32 or 64 bit integers) • Used for finding colliding indexes (32 or 64 bit) before a gather-operation-scatter sequence • Likely to be available in future for both Intel® Xeon Phi™ and Intel® Xeon® processors • Intel® AVX-512 Vector Length Extension (AVX-512VL): • Freely select the vector length (512 bit, 256 bit and 128 bit) • Orthogonal extension but planned for future Intel® Xeon® processors only • Intel® AVX-512 Byte/Word (AVX-512BW) and Doubleword/Quadword (AVX- 512DQ): • Two groups, planned for future Intel® Xeon® processors: • 8 and 16 bit integers • 32 and 64 bit integers and FP Intel® AVX-512 (3/3)

• Intel® AVX-512 Exponential & Reciprocal Instructions (AVX-512ER): • Higher accuracy with HW based sqrt, reciprocal (28 bit) and exp function (22 bit) • Likely only for future Intel® Xeon Phi™ processors • Intel® AVX-512 Prefetch Instructions (AVX-512PF): • Manage data streams for higher throughput (incl. gather and scatter) • Likely only for future Intel® Xeon Phi™ processors • More information: https://software.intel.com/en-us/blogs/additional-avx-512-instructions Intel® AVX-512 registers

• Extended VEX encoding (EVEX) to introduce another prefix • Extends previous AVX and SSE registers to 512 bit: • 32 bit: 8 ZMM registers (same as YMM/XMM) • 64 bit: 32 ZMM registers (2x of YMM/XMM) • 32 FP* and 8 K/mask registers (K0 is special) XMM0-15 128 bit YMM0-15

32 bit32 256 bit

K0-7

64 bit 64 bit64

ZMM0-31 512 bit

•  No penalty when switching between XMM, YMM and ZMM!

* Dtwice as many FP registers (when compared to AVX/2). In fact, AVX2 code can use 32 regs as well (EVEX). Intel® AVX-512 vector types

16x single precision FP

8x double precision FP

64x 8 bit integer

32x 16 bit integer Intel® AVX-512 16x 32 bit integer

8x 64 bit integer

plain 512 bit

 Includes AVX and AVX2 64 bit masks Intel® Turbo Boost and AVX*

• Amount of turbo frequency achieved depends on: Type of workload, number of active cores, estimated current and power consumption, and processor temperature • Due to workload dependency, separate AVX base and turbo frequencies will be defined for Intel® Xeon® processors with 4th generation Intel® Core™

Architecture and later

Rated Turbo

AVX Turbo

AVX/Rated Turbo

Rated Base

Frequency

AVX Base AVX/Rated Base

4th Generation Intel® Core™ Previous Generations Architecture and later *AVX refers to Intel® AVX, Intel® AVX2 or Intel® AVX-512 Contents

• Parallelism on Intel Architecture

• Vectorization of code: overview and terminology • Introduction to SIMD ISA for Intel® processors • Intel® AVX, Intel® AVX2, Intel® AVX-512 • SIMD instruction set switches for Intel® compilers • SIMD vectorization SIMD instruction set switches (1/3) For Intel® compilers

• Linux*, OS X*: -x, Windows*: /Qx • Might enable Intel processor specific optimizations • Processor-check added to “main” routine: Application errors in case SIMD feature missing or non-Intel processor with appropriate/informative message • Linux*, OS X*: -ax, Windows*: /Qax • Multiple code paths: baseline and optimized/processor-specific • Optimized code paths for Intel processors defined by • Multiple SIMD features/paths possible, e.g.: -axSSE2, CORE-AVX2 • Baseline code path defaults to –msse2 (/arch:sse2) • The baseline code path can be modified by –m or –x (/arch: or /Qx) SIMD instruction set switches (2/3) For Intel® compilers

• Linux*, OS X*: -m, Windows*: /arch: • Neither check nor specific optimizations for Intel processors: Application optimized for both Intel and non-Intel processors for selected SIMD feature • Missing check can cause application to fail in case extension not available • Default for Linux*: -msse2, Windows*: /arch:sse2 • Activated implicitly • Implies the need for a target processor with at least Intel® SSE2 • Default for OS X*: -xsse3 (IA-32), -xssse3 (Intel® 64) SIMD instruction set switches (3/3) For Intel® compilers

• Special for Linux*, OS X*: -xHost, Windows*: /QxHost • Compiler checks SIMD features of current host processor (where built on) and makes use of latest SIMD feature available • Code only executes on processors with same SIMD feature or later as on build host • As for -x or /Qx, if “main” routine is built with –xHost or /QxHost the final executable only runs on Intel processors • Disabling vectorization Linux*, OS X*: -no-vec, Windows*: /Qvec- • Disables vectorization for the compile unit • The compiler can still use some SIMD features SIMD feature set names (1/2) For Intel® compilers

SIMD Feature Description

AVX May generate Intel® Advanced Vector Extensions (Intel® AVX), SSE4.2, SSE4.1, SSE3, SSE2, SSE and Intel SSSE3.

ATOM_SSE4.2 May generate MOVBE instructions for Intel processors (depending on setting of -minstruction or /Qinstruction). May also generate Intel® SSE4.2, SSE3, SSE2 and SSE instructions for Intel processors. Optimizes for Intel® Atom™ processors that support Intel® SSE4.2 and MOVBE instructions.

SSE4.2 May generate Intel® SSE4.2, SSE4.1, SSE3, SSE2, SSE and Intel SSSE3.

SSE4.1 May generate Intel® SSE4.1, SSE3, SSE2, SSE and Intel SSSE3.

ATOM_SSSE3 May generate MOVBE instructions for Intel processors (depending on setting of -minstruction or deprecated: /Qinstruction). May also generate Intel® SSE3, SSE2, SSE and Intel® SSSE3 instructions for Intel SSE3_ATOM & processors. Optimizes for Intel® Atom™ processors that support Intel® SSE3 and MOVBE instructions. SSSE3_ATOM

SSSE3 May generate Intel® SSE3, SSE2, SSE and Intel SSSE3.

SSE3 May generate Intel® SSE3, SSE2 and SSE.

SSE2 May generate Intel® SSE2 and SSE. SIMD feature set names (2/2) For Intel® compilers

SIMD Feature Description

CORE-AVX512 May generate Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Foundation instructions, Intel® AVX-512 Conflict Detection instructions, and other AVX-512 subsets which will be available on future Intel® XEON™ architecture Optimizes for Intel® processors that support Intel® AVX-512 instructions. Sets –qopt-zmm-usage=low by default.

MIC-AVX512 May generate Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Foundation instructions, Intel® AVX-512 Conflict Detection instructions, Intel® AVX-512 Exponential and Reciprocal instructions, Intel® AVX-512 Prefetch instructions for Intel® processors, and the instructions enabled with CORE-AVX2. Optimizes for Intel® processors that support Intel® AVX-512 instructions.

COMMON-AVX512 May generate Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Foundation instructions and Intel® AVX-512 Conflict Detection instructions. Optimizes for Intel® processors that support Intel® AVX- 512 instructions. Sets –qopt-zmm-usage=high by default.

CORE-AVX2 May generate Intel® Advanced Vector Extensions 2 (Intel® AVX2), Intel® AVX, SSE4.2, SSE4.1, SSE3, SSE2, SSE and Intel SSSE3 instructions.

CORE-AVX-I May generate Intel® Advanced Vector Extensions (Intel® AVX), including instructions in 3rd generation Intel® Core™ processors, Intel® SSE4.2, SSE4.1, SSE3, SSE2, SSE and Intel SSSE3.

* AVX-512: -mavx512f -mavx512cd AVX-512/Core: -mavx512f -mavx512cd -mavx512dq * AVX2: -mavx2 -mfma -mavx512bw -mavx512vl Compatibility (HW&SW)

Compiler – Linux - GCC – Windows* - Microsoft* Visual Studio – MacOS - Apple XCode*

Will my code run on a non-Intel CPU? – Yes (32 / 64 bit compatibles)

 Plug&Play replacement for the platform compiler (MS/GCC), but with improved code performance. • Object, and binary compatibility, option flags Compatibility (cont.) • Compiler can generate code that is

Control using – Intel-friendly Linux: -x • Will only run on Intel CPUs Windows: /Qx

Control using Linux: -m – Compatible-friendly Windows: /arch • Will run on Intel and non-Intel CPUs Contents

• Parallelism on Intel Architecture

• Vectorization of code: overview and terminology • Introduction to SIMD ISA for Intel® processors • Intel® AVX, Intel® AVX2, Intel® AVX-512 • SIMD instruction set switches for Intel® compilers • SIMD vectorization Vectorization software architecture

Vector Options Ease of use

Intel® Math Kernel Library

Auto vectorization

Semi-auto vectorization: #pragma (vector, ivdep, )

Array Notation: Fortran, Intel® Cilk™ Plus

C/C++ Vector Classes (F32vec16, F64vec8)

OpenCL*

Intrinsics

Fine control Overview of vector code types

• Auto-Vectorization • OpenMP SIMD function DO i=1, N real function ef(a, b) A(i) = B(i) + C(i) real :: a, b END DO !$omp declare simd linear(ref(a,b)) ef = a + b end function ef • Array notation A(:) = B(:) + C(:) !$omp simd DO i = 1, N A(i)= ef(B(i), C(i)) • OpenMP SIMD construct END DO !$omp simd DO i = 1, N A(i) = B(i) + C(i) END DO Automatic vectorization

• The compiler vectorizer works similarly for SSE, AVX, AVX2 and AVX-512 (C/C++, Fortran) • Enabled by default at optimization level -O2 • Some ISA features, such as vector masks, gather/scatter instructions and fused multiply-add (FMA) enable better vectorization of code • Vectorized loops may be recognized by • Compiler vectorization and optimization reports -qopt-report-phase=vec –qopt-report=5 • Looking at the assembly code, -S • Using Intel® VTune™ or Advisor Optimization Report Example

• Example novec.f90: 1: subroutine fd(y) 2: integer :: i 3: real, dimension(10), intent(inout) :: y 4: do i=2,10 5: y(i) = y(i-1) + 1 6: end do 7: end subroutine fd

$ ifort novec.f90 –qopt-report=5 ifort: remark #10397: optimization reports are generated in *.optrpt files in the output location

$ cat novec.optrpt … LOOP BEGIN at novec.f90(4,5) remark #15344: loop was not vectorized: vector dependence prevents vectorization remark #15346: vector dependence: assumed FLOW dependence between y line 5 and y line 5 remark #25436: completely unrolled by 9 LOOP END … Reasons why automatic vectorization fails

• Compiler prioritizes code correctness • Compiler heuristics to estimate vectorization efficiency • Vectorization could lead to incorrect or inefficient code due to • Data dependencies • Alignment • Function calls in loop block • Complex control flow / conditional branches • Mixed data types • Non-unit stride between elements • Loop body too complex (register pressure) • ... Factors preventing code vectorization

1. Loop-carried dependencies 4. Loop structure, boundary condition DO i = 1, N A(i + M) = A(i) + B(i) struct _x { int d; int bound; }; END DO void doit(int *a, struct _x *x) { 2. Pointer aliasing (compiler-specific) for(int i = 0; i < x->bound; i++) a[i] = 0; SUBROUTINE scale(a, b, z) REAL, POINTER, CONTIGUOUS :: a(:), b(:) } INTEGER :: i DO i=1,N b(i) = z * a(i) 5. Outer vs. inner loops END DO END SUBROUTINE DO j=1,JMAX DO i=1,IMAX 3. Function calls (incl. indirect) D(i,j) = D(i,j) + 1 END DO DO i = 1, NX END DO x = x0 + i * h sumx = sumx + func(x, y, xp) END DO 6. Cost-benefit (compiler specific) 7. And others... Factors slowing-down vectorized code

1. Indirect memory access 4. Small trip counts not multiple of

DO i=1, N VL A(B(i)) = C(i)*D(i) END DO SUBROUTINE doit(a, b, unknown_small_value) REAL, CONTIGUOUS :: a(:), b(:) 2. Memory sub-system Latency / INTEGER :: unknown_small_value, i DO i=1,unknown_small_value Throughput a(i) = z*b(i) END DO SUBROUTINE scale(n, j, a, b, c) END SUBROUTINE INTEGER :: n, j, i REAL, CONTIGUOUS :: a(:,:), b(:), c(:) 5. Branchy codes DO i = 1, VERY_BIG c(i) = z * a(i,j) DO i = 1, MAX b(i) = z * a(i,i) IF ( D(i) < N) THEN END DO call do_this(D(i)) ELSE IF (D(i) > M) THEN call do_that(D(i)) 3. Serialized or “sub-optimal” function END IF calls END DO

DO i=1, NX 6. MANY others: spill/fill, fp sumx = sumx + serialized_func_call(x, y, xp) END DO accuracy trade-offs, FMA, DIV/SQRT, Unrolling, even AVX throttling.. Preparing code for SIMD

Identify Hotspots

Re-layout data for SIMD efficiency FP Integer Integer or FP? Align data structures

Convert code to SIMD form

Precision is important: Can Yes Follow SIMD coding guidelines convert Change to SP impacts the to SP? SIMD width. No Optimize memory access patterns and prefetch (if appropriate)

Further optimization Data Layout – why it is important

• Instruction-Level • Hardware is optimized for contiguous loads/stores • Support for non-contiguous accesses differs with hardware (e.g., AVX2/AVX-512 gather) • Memory-Level • Contiguous memory accesses are cache-friendly • Number of memory streams can place pressure on prefetchers Data layout – common layouts

Array-of-Structs (AoS) Struct-of-Arrays (SoA) Hybrid (AoSoA)

x y z x y z x x x x x x x x y y z z

x y z x y z y y y y y y x x y y z z

x y z x y z z z z z z z x x y y z z

• Pros: • Pros: • Pros: Good locality of Contiguous load/store Contiguous load/store, 1 memory stream {x, y, z}, • Cons: 1 memory stream Poor locality of • Cons: • Cons: {x, y, z}, Not a “normal” layout Potential for gather/scatter 3 memory streams Data alignment – why it is important

Cache Line 0 Cache Line 1 0 1 2 3 … … 6 7 8 9 … … … … … …

0 1 2 3 6 7 8 9

Aligned Load Unaligned Load ▪ Address is aligned ▪ Address is not aligned ▪ One cache line ▪ Potentially multiple cache lines ▪ One instruction ▪ Potentially multiple instructions Data alignment – sample applications

• 1) Align Memory REAL, ALLOCATABLE :: array(:,:) !dir$ attributes align:64::array • 2) Access Memory in an Aligned Way DO J=1,M DO I=1,N ARRAY(I,J)=… 3) Tell the Compiler !dir$ assume_aligned (array, 64) ! or !$omp simd aligned(array:64) !dir$ assume (mod(N,16) .eq. 0) ! Important for rank > 1 Alignment impact - example

• Both cases compiled using –xCORE-AVX512 • (Potentially) unaligned access: LOOP BEGIN at mult.F90(9,3) remark #15389: vectorization support: reference c(i) has unaligned access [ mult.F90(10,6) ] remark #15389: vectorization support: reference a(i) has unaligned access [ mult.F90(10,13) ] subroutine mult(n, a, b, c) remark #15389: vectorization support: reference b(i) has unaligned access [ mult.F90(10,20) ] integer, intent(in) :: n remark #15381: vectorization support: unaligned access used inside loop body real, intent(in) :: a(n), b(n) remark #15305: vectorization support: vector length 16 real, intent(out) :: c(n) remark #15309: vectorization support: normalized vectorization overhead 1.778 integer :: i remark #15301: OpenMP SIMD LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 2 !$omp simd remark #15451: unmasked unaligned unit stride stores: 1 do i=1,n remark #15475: --- begin vector cost summary --- c(i) = a(i) * b(i) remark #15476: scalar cost: 6 end do remark #15477: vector cost: 0.560 end subroutine mult remark #15478: estimated potential speedup: 8.780 remark #15488: --- end vector cost summary --- • Aligned access: LOOP END

subroutine amult(n, a, b, c) integer, intent(in) :: n LOOP BEGIN at amult.F90(9,3) real, intent(in) :: a(n), b(n) remark #15388: vectorization support: reference c(i) has aligned access [ amult.F90(10,6) ] real, intent(out) :: c(n) remark #15388: vectorization support: reference a(i) has aligned access [ amult.F90(10,13) ] integer :: i remark #15388: vectorization support: reference b(i) has aligned access [ amult.F90(10,20) ] remark #15305: vectorization support: vector length 16 !$omp simd aligned(a,b,c:64) remark #15301: OpenMP SIMD LOOP WAS VECTORIZED do i=1,n remark #15448: unmasked aligned unit stride loads: 2 c(i) = a(i) * b(i) remark #15449: unmasked aligned unit stride stores: 1 end do remark #15475: --- begin vector cost summary --- end subroutine amult remark #15476: scalar cost: 6 remark #15477: vector cost: 0.310 remark #15478: estimated potential speedup: 18.000 remark #15488: --- end vector cost summary --- LOOP END Data alignment – real-life applications

0 1 2 3 4 5 6 7 8 9 …

Data Data alignment – real-life applications

0 1 2 3 4 5 6 7 8 9 …

Data Data alignment – real-life applications 0 1 2 3 4 5 6 7 8 9 …

Data Halo Data alignment – real-life applications 0 1 2 3 4 5 6 7 8 9 …

Data Halo Data alignment – real-life applications 0 1 2 3 4 5 6 7 8 9 …

Data Not strictly Halo necessary… Padding Data alignment – real-life applications 0 1 2 3 4 5 6 7 8 9 …

Data Not strictly Halo necessary… Padding Recommendation: Data Alignment

• Adopt LAPACK/BLAS like interfaces that allow "leading dimension(s)". Summary

• SIMD is (yet) another level of parallelism available in the CPU • Compiler flags can be used to target a specific ISA • This is unrelated to the optimization level “On”. • Performance := Opt. Level + Target Flag. • Data layout is essential for efficient SIMD

Improved Floating Point Consistency

Three Metrics of Interest • Usually in conflict

Accuracy with each other!

• Careful use of

Performance Reproducibility compiler options can help control the trade-offs Compiler Option –fp-model name fast=1|2 Lots of options, but can be confusing double/extended/source precise except strict fma- fast=2 fast … strict

speed FP precision & reproducibility

74 Use these two switches – its easier

-fp-model=consistent -fimf-use-svml=true A good portable Eliminate inconsistency between compromise (17.0+) + scalar code and vectorized code. Can recover performance with slight Cost: ~12% performance loss of accuracy.

SVML : Small Vector Math Library One can still use –fp-model along with -fimf-use-svml and -fimf-arch-consistency. See https://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler Benefits of using –fp-model=consistent

Without this switch: - aggressive optimizations - Changes operation order for specific architecture - Favours performance - use of optimized math over accuracy libraries (imf)

With this switch - Robust optimizations - Conservative optimizations - Preserves accuracy across architectures - Portable math libraries used Benefits of using -fimf-use-svml=true

Without this switch – (and with –fp-model=consistent) - scalar code uses LIBM - Uses different algorithms - vectorised code may (scalar vs. vector) use SVML - Leads to different results

With this switch - scalar code uses SVML - vectorised code uses SVML - Uses same algorithms - Leads to similar results Additionally Loops containing math function calls can be vectorised (often the compiler cannot vectorise loops that have function calls with in them) Conditional Numerical Reproducibility (CNR) in Intel® Math Kernel Library