Vectorization & Cache Organization ASD Shared Memory HPC Workshop

Computer Systems Group

Research School of Computer Science Australian National University Canberra, Australia

February 11, 2020 Schedule - Day 2

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 2 / 85 Single Instruction Multiple Data (SIMD) Operations Outline

2 Cache Basics

1 Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions 3 Multiprocessor Cache Organization Understanding SIMD Operations SIMD Registers Using SIMD Operations

4 Thread Basics

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 3 / 85 Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions Flynn’s Taxonomy

SISD: Single instruction single data MISD: Multiple instructions single data (streaming processors) SIMD: Single instruction multiple data (array, vector processors) MIMD: Multiple instructions multiple data (multi-threaded processors) Mike Flynn, ‘Very High-Speed Computing Systems’, Proceedings of IEEE, 1966

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 4 / 85 Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions Types of Parallelism

Data Parallelism: Performing the same operation on different pieces of data SIMD: e.g. summing two vectors element by element Task Parallelism: Executing different threads of control in parallel Instruction Level Parallelism: Multiple instructions are concurrently executed Superscalar - Multiple functional units Out-of-order execution and pipelining Very long instruction word (VLIW)

SIMD - Multiple operations are concurrent, while instructions are the same

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 5 / 85 Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions History of SIMD - Vector Processors

Instructions operate on vectors rather than scalar values Has vector registers where vectors can be loaded from or stored Vectors may be of variable length, i.e. vector registers must support variable vector lengths Data elements to be loaded into a vector register may not be contiguous in memory, i.e. support for strides or distances between two elements of a vector Cray-I used vector processors Clocked at 80 MHz in Los Alamos National Lab, 1976 Introduced CPU registers for SIMD vector operations 250 MFLOPS when SIMD operations utilized effectively Primary disadvantage: Works well only if parallelism is regular Superseded by contemporary scalar processors with support for vector operations, i.e. SIMD extensions

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 6 / 85 Single Instruction Multiple Data (SIMD) Operations SIMD CPU Extensions SIMD Extensions

Extensive use of SIMD extensions in contemporary hardware: Complex Instruction Set Computers (CISC) Intel MMX: 64-bit wide registers - first widely used SIMD instruction set on the desktop computer in 1996 Intel Streaming SIMD Extensions (SSE): 128-bit wide XMM registers Intel Advanced Vector Extensions (AVX): 256-bit wide YMM registers Reduced Instruction Set Computers (RISC) SPARC64 VIIIFX (HPC-ACE): 128-bit registers PowerPC A2 (Altivec, VSX): 128-bit registers ARMv7, ARMv8 (NEON): 64-bit and 128-bit registers Similar architecture: Single Instruction Multiple Thread (SIMT) used in GPUs

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 7 / 85 Single Instruction Multiple Data (SIMD) Operations Understanding SIMD Operations SIMD Processing - Vector addition

C[i] = A[i] + B[i]

1 void VectorAdd(float *a, float *b, float *, size_t size) { size_t i; 3 for (i = 0; i < size; i++) { c[i] = a[i] + b[i]; 5 } }

Assume arrays A and B contain 8-bit short integers No dependencies between operations, i.e. embarrassingly parallel Note: arrays A and B may not be contiguously allocated How can this operation be parallelized ?

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 8 / 85 Single Instruction Multiple Data (SIMD) Operations Understanding SIMD Operations SIMD Processing - Vector addition

Scalar: 8 loads + 4 scalar adds + 4 stores = 16 ops Vector: 2 loads + 1 vector add + 1 store = 4 ops Speedup: 16/4 = 4× Fundamental idea: Perform multiple operations using single instructions on multiple data items concurrently Advantages: Performance improvement Fewer instructions - reduced code size, maximization of data bandwidth Automatic Parallelization by for vectorizable code

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 9 / 85 Single Instruction Multiple Data (SIMD) Operations SIMD Registers Intel SSE

Intel Streaming SIMD Extensions (1999) 70 new instructions SSE2 (2000) 144 new instructions with support for double data and 32b ints SSE3 (2005) 13 new instructions for multi-thread support and HyperThreading SSE4 (2007) 54 new instructions for text processing, strings, fixed-point arithmetic

8 (in 32-bit mode) or 16 (in 64-bit mode) 128-bit XMM Registers XMM0 - XMM15 8, 16, 32, 64-bit Integers 32-bit SP & 64-bit DP

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 10 / 85 Single Instruction Multiple Data (SIMD) Operations SIMD Registers Intel AVX Intel Advanced Vector Extensions (2008): extended vectors to 256b AVX2 (2013) Expands most integer SSE and AVX instructions to 256b Intel FMA3 (2013) Fused multiply-add introduced in Haswell 8 or 16 256-bit YMM Registers YMM0 - YMM15 SSE instructions operate on lower half of YMM registers

Introduces new three-operand instructions, i.e. one destination, two source operands Previously, SSE instructions had the form a = a + b With AVX, the source operands are preserved, i.e. c = a + b

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 11 / 85 Single Instruction Multiple Data (SIMD) Operations SIMD Registers ARM NEON

ARM Advanced SIMD (NEON) ARM Advanced SIMDv2 Support for fused multiply-add Support for half-precision extension Available in ARM Cortex-A15 Separate register file 32 64-bit Registers Shared by VFPv3/VFPv4 instructions Separate 10-stage execution pipeline NEON register views: D0-D31: 32 64-bit Double-word Q0-Q15: 16 128-bit Quad-word 8, 16, 32, 64-bit Integers ARMv7: 32-bit SP Floating-point ARMv8: 32-bit SP & 64-bit DP

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 12 / 85 Single Instruction Multiple Data (SIMD) Operations SIMD Registers SIMD Instruction Types

Data Movement: Load, store vectors between main memory and SIMD registers Arithmetic operations: Addition, subtraction, multiplication, division, absolute difference, maximum, minimum, saturation arithmetic, square root, multiply-accumulate, multiply-subtract, halving-subtract, folding maximum and minimum Logical operations: Bitwise AND, OR, NOT operations and their combinations Data value comparisons: =, <=, <, >=, > Pack, Unpack, Shuffle: Initializing vectors from bit patterns, rearranging bits based on a control mask Conversion: Between floating-point and integer data types using saturation arithmetic Bit Shift: Often used to do integer arithmetic such as division and multiplication Other: Cache specific operations, casting, bit insert, cache line flush, data prefetch, execution pause etc

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 13 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations How to use SIMD operations

Compiler auto-vectorization: Requires a compiler with vectorizing capabilities. Least time consuming. Performance variable and entirely dependent on compiler quality. Compiler intrinsic functions: Almost one-to-one mapping to assembly instructions, without having to deal with register allocations, instruction scheduling, type checking and call stack maintenance. Inline assembly: Writing assembly instructions directly into higher level code Low-level assembly: Best approach for high performance. Most time consuming, least portable.

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 14 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations Compiler Auto-vectorization

Requires a vectorizing compiler, e.g. gcc, icc, clang Loop unrolling combined with the generation of packed SIMD instructions GCC enables vectorization with -O3 Enabled with -O2 on Intel systems Instruction set specified by -msse2 (-msse4.1 -mavx) for Intel systems Enabled with -mfpu=neon on ARM systems Reports from vectorization process -ftree-vectorizer-verbose= (gcc), where level is between 1 and 5 -vec-report5 (Intel icc)

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 15 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations Compiler Auto-vectorization - Loops

What kind of loops are good candidates for auto-vectorization? Countable: The loop trip count must be known at entry to the loop at runtime Single entry and single exit: No break Straight-line code: It is not possible for different iterations to have different flow-control (must not branch). If statements allowed if they can be implemented as masked assignments The innermost loop of a nest: Possible loop interchange in previous optimization phases No function calls: Some intrinsic math functions allowed (sin, log, pow etc) Aliasing: Pointers to vector arguments should be declared with keyword restrict which guarantees that no aliases exist for them

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 16 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations Compiler Auto-vectorization – Obstacles

Non-contiguous memory accesses: Four consecutive floats (ints) can be loaded directly. If there is a stride, they have to be loaded separately using multiple instructions. Non-aligned data structures: May result in multiple load instructions can align arrays (here on 16-byte boundaries) dynamically or statically as follows:

float*a=(float *) memalign(16, N * sizeof(float)); 2 float b[N] __attribute__ ((aligned (16)));

Data dependencies: RAW, WAR, WAW

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 17 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations Compiler Auto-vectorization - Example

SAXPY: Y = αX + Y in Single Precision Here α is scalar constant and X , Y are SP vectors Used in BLAS (Basic Linear Algebra Subprograms) library

void saxpy(int n, float a, float* __restrict__ X, float* __restrict__ Y) { 2 int i; for (i=0; i

1 $ gcc -O3 -ftree-vectorizer-verbose=1 saxpy.cc -o saxpy Analyzing loop at saxpy.cc:31 3 Vectorizing loop at saxpy.cc:31 saxpy.cc:31: note: &=& = vect_do_peeling_for_alignment &=& = 5 saxpy.cc:31: note: &=& = vect_update_inits_of_dr &=& = saxpy.cc:31: note: &=& = vect_do_peeling_for_loop_bound &=& =Setting upper bound of nb 7 iterations for epilogue loop to 2 saxpy.cc:31: note: LOOP VECTORIZED. 9 saxpy.cc:28: note: vectorized 1 loops in function. saxpy.cc:31: note: Completely unroll loop 2 times 11 saxpy.cc:28: note: Completely unroll loop 3 times

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 18 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations SIMD Data Types - Intel

Data Type Content SSE SSE2 SSE3 AVX m64 8×char     4×short 2×int32 2×float 1×int64 1×double m128 4×float     m128d 2×double     m128i 16×char     8×short 4×int32 2×int64 m256 8×float     m256d 4×double     m256i 32×char     16×short 8×int32 4×int64

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 19 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations SIMD Data Types - ARM

NEON vector data types have the following pattern: x t 64-bit type (D-register) 128-bit type (Q-register) Content int8x8 t int8x16 t 8-bit char int16x4 t int16x8 t 16-bit int int32x2 t int32x4 t 32-bit int int64x1 t int64x2 t 64-bit int uint8x8 t uint8x16 t 8-bit unsigned int uint16x4 t uint16x8 t 16-bit unsigned int uint32x2 t uint32x4 t 32-bit unsigned int uint64x1 t uint64x1 t 64-bit unsigned int float16x4 t float16x8 t 16-bit floating-point float32x2 t float32x4 t 32-bit floating-point poly8x8 t poly8x16 t 8-bit polynomial poly16x4 t poly16x8 t 16-bit polynomial

It is also possible to have types representing arrays of vectors from size 1 to 4D E.g. int8x8x2 t represents an array of two int8x8 vectors Individual vectors in these arrays can be accessed using .val[0] etc

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 20 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations Compiler Intrinsic Functions - Intel

SSE and AVX intrinsic function names use the following notational convention: mm : Indicates the basic operation of the intrinsic function; for example add for addition and sub for subtraction : Denotes the type of data the instruction operates on The first one or two letters of each suffix denote whether the data is: p: packed ep: extended packed S: scalar The remaining letters and numbers denote the type, with notation as follows: s: single-precision floating-point d: double-precision floating-point i32: signed 32-bit integer u32: unsigned 32-bit integer ...

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 21 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations Compiler Intrinsic Functions - ARM

NEON intrinsic function names use the following notational convention: v : Indicates the basic operation of the intrinsic function; for example add for addition and sub for subtraction : Number (between 1 and 4): Denotes the array size of the result vector, i.e. size 1 (int16x4 t), size 2 (int16x4x2 t), size 3 (int16x4x3 t) or size 4 (int16x4x4 t) q: Denotes that Q registers must be used by both operands and result l: Long shape. Number of bits in each result element is double the number of bits in each operand element, i.e. operands are usually doubleword vectors and result is a quad-word vector w: Wide shape. Result and operand are twice the width of the second operand, i.e. doubleword and quad-word operand make a quad-word result : Denotes the type of data the instruction operates on

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 22 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations Compiler Intrinsic Functions - Examples

Load four SP FP values, address aligned Intel SSE2: m128 mm load ps(float* p); ARM NEON: float32x4 t vld1q f32 (const float32 t* p); R0 R1 R2 R3 p[0] p[1] p[2] p[3]

Store four SP FP values. Address must be 16-byte aligned Intel SSE2: void mm store ps(float* p, m128 a); ARM NEON: void vst1q f32 (float32 t* p, float32x4 t a); p[0] p[1] p[2] p[3] a0 a1 a2 a3

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 23 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations Compiler Intrinsic Functions - Examples

Add four SP FP values Intel SSE2: m128 mm add ps( m128 a, m128 b); ARM NEON: float32x4 t vaddq f32 (float32x4 t a, float32x4 t b); R0 R1 R2 R3 a0+b0 a1+b1 a2+b2 a3+b3

Fused multiply-add four SP FP values Intel FMA3: m128 mm fmadd ps( m128 a, m128 b, m128 c); ARM NEON: void vfmaq f32(float32x4 t a, float32x4 t b, float32x4 t c); R0 R1 R2 R3 a0*b0 + c0 a1*b1 + c1 a2*b2 + c2 a3*b3 + c3

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 24 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations Compiler Intrinsic Functions - Portability

Code vectorized with intrinsic functions for a specific processor is not portable Only runs if architecture supports specific SIMD extension Portability ensured by using conditional compilation in code Program must contain both a scalar and a vectorized version of the same computation Using #ifdef statements, the applicable version is chosen at compile time For e.g. in GCC, if -msse2 option is passed, then the macro SSE2 is defined Usage of conditional compilation:

1 #ifdef __SSE2__ /* Code optimized with SSE2 intrinsic functions*/ 3 #elif__ARM_NEON__ /* Code optimized withARMNEON intrinsic functions*/ 5 #else /* Scalar code*/ 7 #endif

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 25 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations Reference Manuals

Intel: Intel Intrinsics Guide Intel 64 and IA-32 Architectures Developer’s Manual ARM: ARM NEON Programmer’s Guide (version 1.0) ARM NEON Intrinsics in GCC

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 26 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations Optimizing Vector Addition

C[i] = A[i] + B[i]

1 void VectorAdd(float *a, float *b, float *c, size_t size) { size_t i; 3 for (i = 0; i < size; i++) { c[i] = a[i] + b[i]; 5 } }

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 27 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations Optimizing Vector Addition - Intel SSE2

void VectorAddSSE(float* __restrict__ a, float* __restrict__ b, float* __restrict__ c, size_t size) { 2 size_t i; for (i = 0; i < (size/4) * 4; i+=4) { 4 /* Load intoSSEXMM registers*/ __m128 sse_a = _mm_load_ps(&a[i]); 6 __m128 sse_b = _mm_load_ps(&b[i]);

8 /* Perform addition*/ __m128 sse_c = _mm_add_ps(sse_a, sse_b); 10 /* Store back to memory*/ 12 _mm_store_ps(&c[i],sse_c); } 14 for (i = (size/4) * 4; i < size; i++) { 16 c[i] = a[i] + b[i]; } 18 }

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 28 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations Optimizing Vector Addition - ARM NEON

void VectorAddNEON(float32_t* __restrict__ a, float32_t* __restrict__ b, float32_t* __restrict__ c, size_t size) { 2 size_t i; /* Declare vector data types*/ 4 float32x4_t a4, b4, c4;

6 for (i=0; i< (size/4)*4; i+=4) { /* Load into QuadNEON registers*/ 8 a4 = vld1q_f32(a+i); b4 = vld1q_f32(b+i); 10 /* Perform addition*/ 12 c4 = vaddq_f32(a4, b4);

14 /* Store back to memory*/ vst1q_f32(c+i, c4); 16 }

18 for (i = (size/4)*4; i

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 29 / 85 Single Instruction Multiple Data (SIMD) Operations Using SIMD Operations Hands-on Exercise: Vectorizing Loops

Objective: Vectorizing loops using SSE & NEON instructions

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 30 / 85 Cache Basics Outline

3 Multiprocessor Cache Organization

1 Single Instruction Multiple Data (SIMD) Operations 4 Thread Basics

2 Cache Basics

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 31 / 85 Cache Basics Importance

An infinitely fast CPU must still load and store its data to memory HPC applications have execution time t(n) = O(nm) for problem size n ≥ n, ≤ t(n) data  ⇒ accesses t(n) instr’n Discrete Fourier Transform - O(n) data items O(nlogn or n2) operations Reduction to upper/lower triangular - O(n2) data items O(n3) operations Forward/Backward substitution - O(n2) data items O(n2) operations Hence we need large memories with fast access times

Achieve by: cost (per bit) - speed trade-off:

cache memory low wide (||) memory access moderate faster technology (?) high

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 32 / 85 Cache Basics Access Times

Memory Access Time: The amount of time it takes to read or write to a memory location Memory Cycle Time: How quickly you can repeat a memory access

For example, a memory chip may have an access time of 200ns, but the cycle time may be 50ns Over last ≈20 years, CPU speed has improved much faster than memory access times. In the mid 1980s, commodity DRAMs had an access time of 200ns while IBM PC had CPU clock of 4.77MHz (210ns). Today commodity DRAMs have access times of around 50ns, but the clock speeds have decreased to 1ns or less. In part because the CPU is now within 1 chip, while memory is on external chips Memory sizes are also now much larger

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 33 / 85 Cache Basics Memory Technologies

Two main memory technologies: cost/bit: access time: used in: SRAM (Static RAM) high low ( ≈ 10ns) caches DRAM (Dynamic RAM) low high ( ≈ 50ns) most main memories SRAM: each bit uses at least 3 transistors (6 for best) and constant power supply DRAM: each bit uses 1 transistor and a capacitor. Over time, the charge leaks from the capacitor and it must be refreshed. Memory access involves the stages: select row address select column address (R/W) access selected bit(s) as well as transferring addresses and data over the relevant bus. Hence there is scope for pipelining

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 34 / 85 Cache Basics Memory Hierarchy Modern microprocessors have a memory hierarchy

Access Speed Registers Clock cycle Cache Few cycles Memory (DRAM) Many cycles Virtual memory Long!

Idea: data that is “currently most needed” is brought into a (smaller) faster memory Observation: memory accesses in most programs exhibit: Temporal locality: if access address X, likely to access X again soon Spatial locality: if access address X, likely to access X+1 soon ⇒ caches organized into lines (units) of L words (L = 2l , eg. L = 2, 4, 8)  blocked memory accesses (faster) & less control info needed (per word)  redundant memory traffic if only ever use 1 word per line eg. pointer chasing (non-unit stride guaranteed!) if so, yields good cost-speed trade-off

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 35 / 85 Cache Basics Memory Hierarchy

Intel Core i7 Xeon 5500 at 2.8 GHz

Cycles Time (≈ ns) Registers 1 0.3 L1 Cache Hit 4 1.5 L2 Cache Hit 10 3.5 L3 Cache Hit 40 15 Local DRAM 160 60 Remote DRAM 280 100

Memory access time includes cost of getting data over the bus

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 36 / 85 Cache Basics Registers

Very limited resource, accessed in one cycle Goal is to keep operands in registers as much as possible X = G ∗ 2.41 + A/W − W /B RISC instructions limited to two operands, thus must store result of G ∗ 2.41, A/W , and W /B back to registers before adding them together. Also W is used twice and don’t want it loaded from (slow) memory twice A fundamental job of the compiler is to optimize register use, e.g.

1 ld [%W],%f1 !load W from memory into register f1 ld [%A],%f2 !load A from memory into register f2 3 fdiv %f2,%f1,%f2 !form A/W and overwrite f2 (A) with result ld [%B],%f3 !load B from memory into register f3 5 fdiv %f1,%f3,%f2 !form W/B and overwrite f3 (B) with result

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 37 / 85 Cache Basics Cache Memory

Small amount of SRAM memory Cache hit rates: % of (word) accesses in program when data is in cache Need to be high (eg. > 95%) for good performance Problem: programs may need to be re-written to do this! Only possible if have a sufficient inherent data re-use = g(n)/n e.g.: 1/2 1/2 3/2 (n × n ) matrix multiply: g(n) = 2n (mixed unit and non-unit stride)

FFT (Fast Fourier Transform): g(n) = 8n lg2 n (power-of-2 stride) Thus, FFT is more likely to be dominated by memory access time Consistency of data cache & main memory: When a store instruction is executed, the relevant line is updated 1st in the cache write-through: resulting word is written to main memory at same time slow, but consistency is maintained copy-back: write to memory (‘dirty’) cache line when thrown out of cache

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 38 / 85 Cache Basics Cache Friendly

Friendly Unfriendly for (i=0; i<800000; i+=8) { for (i=0; i<100000; i+=1) { sum += a[i]; sum += a[i]; } } double a[200][200]; double a[200][200]; for (i=0; i<200; i++) { for (i=0; i<200; i++) { for (j=0; j<200; j++) { for (j=0; j<200; j++) { sum += a[j][i]; sum += a[i][j]; } } } } while (ptr->next != NULL) ptr = ptr->next;

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 39 / 85 Cache Basics Direct-Mapped Caches

Cache memory is organized into lines of size 2l for cache of size 2c there are C 0 = 2c−l lines.

All addresses with same a1 are mapped to the same cache line

31 c c−1 l l−1 0 X = a0 a1 a2

 easy to implement & low chip area / word (note: cache must store value of a0) 0 ⇒ large C possible (better performance)  cache conflicts: 2 (or more) words from memory map to the same cache line can have instr’n–instr’n, instr’n–data and data–data can make large, often unpredictable, performance losses

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 40 / 85 Cache Basics K-way Set Associative Caches

like a direct-mapped cache of size C 0, but every line is extended to a set of K lines (total number of cache lines is now C 0K) addresses with same a1 can map into the corresponding line in any of the K sets Typically K = 1, 2, 4, 5, 6, 8 Reduces chance of conflicts by factor of K, but some extra cost

Examples of Cache Thrashing

4K direct mapped cache 4K 2-way set associative cache float a[1024], b[1024]; float a[1024], b[1024], c[1024]; for (i=0; i<1024; i++) { for (i=0; i<1024; i++) { a[i] = a[i]+b[i]; a[i] = a[i]+b[i]+c[i]; } }

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 41 / 85 Cache Basics Cache: Other Issues

Modern processors usually have separate data & instruction 1st level caches Multiple (2 or 3) levels of cache (i.e. a deep memory hierarchy) Typically: top level (data) cache: C 0 = 16KB, K = 4, write-through; 2nd level (instr’n/data) cache: C 0 = 1MB, K = 1,copy-back  Harder still to tune a program for 2 levels! (Top-level) cache prefetching (requires load/store pipelines) An access time (latency) of δ cycles can be hidden if perform each load δ cycles in advance of when needed Can be done via: Prefetching or software pipelining (by programmer or compiler, eg. UltraSPARC) H/W instruction re-ordering (eg. Pentium IV) more effective, but H/W more expensive, complex Increase data bus width to cache line size L

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 42 / 85 Cache Basics Hands-on Exercise: IPC and Cache Misses

Objective: Using the PAPI advanced interface to measure instructions per cycle for a variety of loops and cache behaviour

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 43 / 85 Multiprocessor Cache Organization Outline

3 Multiprocessor Cache Organization

1 Single Instruction Multiple Data (SIMD) Operations 4 Thread Basics

2 Cache Basics

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 44 / 85 Multiprocessor Cache Organization Shared Memory Hardware

(Fig 2.5 Grama et al, Intro to Parallel Computing)

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 45 / 85 Multiprocessor Cache Organization Shared Address Space Systems

Systems with caches but otherwise flat memory generally called UMA If access to local cheaper than remote (NUMA), this should be built into your algorithm Global address space systems are easier to program Read only interactions invisible to programmer and coded like sequential program Read/write are harder, require mutual exclusion for concurrent accesses Programmed using threads Synchronization using locks and related mechanisms

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 46 / 85 Multiprocessor Cache Organization Cache Hierarchy on Intel Core i7 (2013)

(64 byte cache line size)

Ref: http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 47 / 85 Multiprocessor Cache Organization Caches on Multiprocessors

Multiple copies of some data word being manipulated by two or more processors at the same time Usually have separate I-cache (Instruction Cache) on the first 1–2 levels (usually of similar sizes to the Data Caches) How is instruction access different to data? Why is this useful? Two requirements Address translation mechanism that locates each physical memory word in system Concurrent operations on multiple copies have well defined semantics The latter generally known as a cache coherency protocol Input/Output using direct memory access (DMA) on machines with caches also leads to coherency issues Some machines only provide shared address space mechanisms and leave coherence to the programmer e.g. Texas Instrument Keystone II system

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 48 / 85 Multiprocessor Cache Organization Cache Coherency

Intuitive behaviour: reading value at address X should return the last value written to address X by any processor What does last mean? What if simultaneous or closer in time than time required to communicate between two processors? In a sequential program, last is determined by program order (not time) Holds true within thread of parallel program, but what does this mean with multiple threads?

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 49 / 85 Multiprocessor Cache Organization Cache/Memory Coherency

A memory system is coherent if Ordered as Issued: A read by processor P to address X that follows a write by P to address X should return the value of the write by P (assuming no other processor writes to X in between) Write Propagation: A read by processor P1 to address X that follows a write by processor P2 to X returns the written value if the read and write are sufficiently separated in time (assuming no other write to X occurs in between) Write Serialization: Writes to the same address are serialized: two writes by any two processors are observed in the same order by all processors (Later to be contrast with memory consistency!)

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 50 / 85 Multiprocessor Cache Organization Two Cache Coherency Protocols

(Fig 2.21 Grama et al, Intro to Parallel Computing)

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 51 / 85 Multiprocessor Cache Organization Cache Line View

Ref: http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1

Need to augment cache line information with information regarding validity

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 52 / 85 Multiprocessor Cache Organization Update vs. Invalidate

Update Protocol When a data item is written, all of its copies in the system are updated Invalidate Protocol (most common) Before a data item is written, all other copies are marked as invalid Comparison Multiple writes to same word with no intervening reads require multiple write broadcasts in an update protocol, but only one initial invalidation With multiword cache blocks, each word written in a cache block must be broadcast in an update protocol, but only one invalidate per line is required The delay between writing a word in one processor and reading the written data in another is usually less for update

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 53 / 85 Multiprocessor Cache Organization False Sharing

Two processors modify different parts of the same cache line Invalidate protocol leads to ping-ponged cache lines Update protocol performs reads locally but updates much traffic between processors This effect is entirely an artefact of http://15418.courses.cs.cmu.edu/ hardware spring2015/lecture/cachecoherence1 Need to design parallel systems / programs with this issue in mind Cache line size, longer more likely Alignment of data structures with respect to cache line size

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 54 / 85 Multiprocessor Cache Organization Implementing Cache Coherency

On small scale bus based machines A processor must obtain access to the bus to broadcast a write invalidation With two competing processors, the first to gain access to the bus will invalidate the others data A cache miss needs to locate top copy of data Easy for write-through cache For write-back caches, each processor snoops the bus and responses by providing data if it has the top copy For writes, we would like to know if any other copies of the block are cached i.e. whether a write-back cache needs to put details on the bus Handled by having a tag to indicate shared status Minimizing processor stalls Either by duplication of tags or having multiple inclusive caches

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 55 / 85 Multiprocessor Cache Organization 3 State (MSI) Cache Coherency Protocol

read: local read write: local write c read (coherency read): read on remote processor gives rise to shown transition in local cache c write (coherency write): write miss, or write in Shared state, on remote processor gives rise to shown transition in local cache (Fig 2.22 Grama et al, Intro to Parallel Computing)

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 56 / 85 Multiprocessor Cache Organization MSI Coherency Protocol

(Fig 2.23 Grama et al, Intro to Parallel Computing)

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 57 / 85 Multiprocessor Cache Organization Snoopy Cache Systems

All caches broadcast all transactions (read or write misses, writes in S state) Suited to bus or ring interconnects However scalability is limited (i.e. ≤ 8 processors) All processors monitor the bus for transactions of interest Each processor’s cache has a set of tag bits that determine the state of the cache block Tags are updated according to state diagram for relevant protocol e.g. snoop hardware detects that a read has been issued for a cache block that it has a dirty copy of, it asserts control of the bus and puts data out, (to requesting cache and to main memory), sets tag to S state What sort of data access characteristics are likely to perform well/badly on snoopy based systems?

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 58 / 85 Multiprocessor Cache Organization Snoopy Cache Based System

(Fig 2.24 Grama et al, Intro to Parallel Computing)

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 59 / 85 Multiprocessor Cache Organization Snoopy Cache-Based System: Ring

The Core i7 (Sandy Bridge) on-chip interconnect revisited: a ring-based interconnect between Cores, Graphics, Last Level Cache (LLC) and System Agent domains has 4 physical rings: Data (32B), Request, Acknowledge and Snoop rings fully pipelined; bandwidth, latency and power scale with cores shortest path chosen to minimize latency has distributed arbitration & sophisticated protocols to handle coherency and ordering

(courtesy www.lostcircuits.com)

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 60 / 85 Multiprocessor Cache Organization Directory Cache Based Systems

The need to broadcast is clearly not scalable A solution is to only send information to processing elements specifically interested in that data This requires a directory to store information Augment global memory with presence bitmap to indicate which caches each memory block is located in

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 61 / 85 Multiprocessor Cache Organization Directory Based Cache Coherency

To implement, we must track the state of each cache block A simple protocol might be: Shared: one or more processors have the block cached, and the value in memory is up to date Uncached: no processor has a copy Exclusive: only one processor (the owner) has a copy and the value in memory is out of date Must handle a read/write miss and a write to a shared, clean cache block These first reference the directory entry to determine the current state of this block Then update the entry’s status and presence bitmap Send the appropriate state update transactions to the processors in the presence bitmap

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 62 / 85 Multiprocessor Cache Organization Directory-Based Cache Coherency

(Fig 2.25 Grama et al, Intro to Parallel Computing)

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 63 / 85 Multiprocessor Cache Organization Directory-Based Systems

How much memory is required to store the directory? What sort of data access characteristics are likely to perform well/badly on directory based systems? How do distributed and centralized systems compare? Should the presence bitmaps be replicated in the caches? Must they be? How would you implement sending an invalidation message to all (and only to all) processors in the presence bitmap?

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 64 / 85 Multiprocessor Cache Organization Costs on SGI Origin 3000 (clock cycles)

<= 16 CPU > 16 CPU Cache Hit 1 1 Cache miss to local memory 85 85 Cache miss to remote home directory 125 150 Cache miss to remotely cached data (3 hop) 140 170

Figure from http://people.nas.nasa.gov/∼schang/origin opt.html Data from: Computer Architecture: A Quantitative Approach, By David A. Patterson, John L. Hennessy, David Goldberg Ed 3, Morgan Kaufmann, 2003

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 65 / 85 Multiprocessor Cache Organization Real Cache Coherency Protocols

From Wikipedia Most modern systems use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect. The MESI protocol adds an “Exclusive” state to reduce the traffic caused by writes of blocks that only exist in one cache. The MOSI protocol adds an “Owned” state to reduce the traffic caused by write-backs of blocks that are read by other caches [The processor owner of the cache line services requests for that data]. The MOESI protocol does both of these things. The MESIF protocol uses the “Forward” state to reduce the traffic caused by multiple responses to read requests when the coherency architecture allows caches to respond to snoop requests with data.

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 66 / 85 Multiprocessor Cache Organization MESI(on a bus)

Ref: https://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/MESIHelp.htm

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 67 / 85 Multiprocessor Cache Organization Multi-Level Cache

What is visibility of changes between levels of cache?

http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1 Easiest model is inclusive: If line is in owned state in L1, it is also in owned state in L2

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 68 / 85 Multiprocessor Cache Organization Cache Summary

Cache coherency arises because abstraction of a single shared address space is not actually implemented by a single storage unit in a machine Three components to cache coherency Issue order, write propagation, write serialization Two implementations Broadcast and Directory False sharing is potential performance issue More likely the longer the cache line

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 69 / 85 Multiprocessor Cache Organization Hands-on Exercise: Matrix Multiplication Performance

Objective: To understand the effect of various matrix multiply loop orderings on IPC and cache misses

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 70 / 85 Thread Basics Outline

3 Multiprocessor Cache Organization

1 Single Instruction Multiple Data (SIMD) Operations 4 Thread Basics

2 Cache Basics

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 71 / 85 Thread Basics Fork/Join Programming Model

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 72 / 85 Thread Basics (Heavyweight) UNIX Processes

O/S like UNIX is based on the notion of a process The CPU is shared between different processes UNIX processes created via fork() Child copy an exact copy of parent, except it has unique process ID Processes are “joined” using the system calls wait() This introduces a synchronization

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 73 / 85 Thread Basics UNIX Fork Example

pid = fork(); if (pid == 0) { //code to be executed by child } else{ //code to be executed by parent } if (pid == 0) exit(0); else wait(0);

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 74 / 85 Thread Basics Processes and Threads

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 75 / 85 Thread Basics Why Threads

Software Portability Applications can be developed on serial machine and run on parallel machines without changes (is this really true?) Latency Hiding Ability to mask access to memory, I/O or communication by having another thread execute in the meantime (but how quickly can execution switch between threads?) Scheduling and Load Balancing For unstructured and dynamic applications (e.g. game playing), load balancing can be very hard. One option is to create more threads than CPU resources and let the O/S sort out the scheduling Ease of Programming, Widespread Use Due to above, threaded programs are easier to write (or develop incrementally), so there has been widespread acceptance of the POSIX thread API (generally referred to as pthreads)

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 76 / 85 Thread Basics Pthread Creation

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 77 / 85 Thread Basics Threads and Threaded Code

pthread self() provides ID of calling routine pthread equal(thread1, thread2) tests ID Detached Threads Threads that will never synchronize via a join Specified via an attribute Re-entrant or thread-safe routines are those than can be safely called when another instance has been suspended in the middle of its invocation

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 78 / 85 Thread Basics Example: Computing Pi

Ratio of area of circle to π the square is 4 Guess points with domain of square at random Identify those that are distance less than 1 from origin Ratio of points in circle π to total points is 4

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 79 / 85 Thread Basics Example: Computing Pi

1 #include #include 3 #define MAX_THREADS 512 void *compute_pi (void*); 5 .... main () { 7 ... pthread_t p_threads[MAX_THREADS]; 9 pthread_attr_t attr; pthread_attr_init(&attr); 11 for (i=0; i < num_threads; i++) { hits [i] = i; 13 pthread_create(&p_threads[i], &attr, compute_pi, (void *) &hits[i]); 15 } for (i=0; i < num_threads; i++) { 17 pthread_join(p_threads[i], NULL); total_hits += hits[i]; 19 } ... 21 }

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 80 / 85 Thread Basics Example: Computing Pi (cont)

1 void *compute_pi(void *s) { int seed, i, *hit_pointer; 3 int local_hits; hit_pointer = (int *) s; 5 seed = *hit_pointer; local_hits = 0; 7 double rx, ry; for (i = 0; i < sample_points_per_thread; i++) { 9 rx = ((double) rand_r(&seed)) / RAND_MAX - 0.5; ry = ((double) rand_r(&seed)) / RAND_MAX - 0.5; 11 if (rx*rx + ry*ry < 0.25) local_hits++; 13 } *hit_pointer = local_hits; 15 pthread_exit(0); }

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 81 / 85 Thread Basics Programming and Performance Notes

Note the use of the function rand r (instead of superior random number generators such as drand48). Executing this on a 4-processor SGI Origin, we observe a 3.91 fold speedup at 32 threads. This corresponds to a parallel efficiency of 0.98! We can also modify the program slightly to observe the effect of false-sharing. The program can also be used to assess the secondary cache line size.

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 82 / 85 Thread Basics Performance

4 processor SGI Origin System using up to 32 threads Instead of incrementing local hits, we add to a shared array using stride of 1, 16 and 32

(Fig 7.2 Grama et al, Intro to Parallel Computing)

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 83 / 85 Thread Basics Hands-on Exercise: False Cache Line Aliasing

Objective: To observe the effect of cache line contention and false cache line sharing on performance

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 84 / 85 Thread Basics Summary

Topics covered today - Vectorization & Cache Organization: SIMD operations Cache basics Multiprocessor cache organization

Tomorrow - Multi Processor Parallelism!

Computer Systems (ANU) Vectorization & Cache Organization Feb 11, 2020 85 / 85