Computer Architecture

Lecture 10 – Vector Machine (Data Level Parallel)

Pengju Ren Institute of Artificial Intelligence and Robotics PengjuXi’an Ren@XJTU Jiaotong University 2021 http://gr.xjtu.edu.cn/web/pengjuren SISD、MIMD、SIMD and MIMD

Data Streams Single Multiple Instruction Single SISD: Intel Pentium 4 SIMD: SSE instruction of Streams Multiple MISD: No example today MIMD: Intel Core i7

SISD: Single Instruction stream, Single Data Stream MIMD: Multiple Instruction streams, Multiple Data Streams SPMD (SinglePengju Program Ren@XJTU multiple data) GPU 2021 SIMD: Single Instruction stream, Multiple Data Streams MISD: Multiple Instruction streams, Single Data Stream

2 Agenda

. Vector Processors . Single Instruction Multiple Data (SIMD) . Instruction Set Extensions (Neon@ARM, AVX@Intel, etc.)

Pengju Ren@XJTU 2021

3 Modern SIMD Processors SIMD architectures can exploit significant data-level parallelism for:  Matrix-oriented scientific computing  Media-oriented image and sound processors  Machine Learning Algorithms Most modern CPUs have SIMD architectures  Intel SSE and MMX, AVX, AVX2 (Streaming SIMD Extension, Multimedia extensions、Advanced Vector extensions)  ARM NEON, MIPS MDMX These architectures include instruction set extensions which allow both sequential and parallel instructions to be executed Some architectures include separate SIMD coprocessors for handling these instructions ARM NEON Pengju Ren@XJTU 2021  Included in Cortex-A8 and Cortex-A9 processors Intel SSE and AVX  Introduced in 1999 in the Pentium III processor  AVX512 currently used in Xeon Core series 4 Vector Processor

 Basic idea: Read sets of data elements into “vector registers” Operate on those registers Disperse the results back into memory  Registers are controlled by compiler Used to hide memory latency Leverage memory bandwidth  Overcoming limitations of ILP: – Dramatic reduction in fetch and decode bandwidth. – No data hazard between elements of the same vector. – Data hazard logic is required only between two vector instructions. – Heavily interleaved memory banks. Hence latency of initiating memory access versusPengju cache access isRen@XJTU amortized. 2021 – Since loops are reduced to vector instructions, there are no control hazards. – Good performance for poor locality

5 Vector Programming Model Scalar Registers Vector Registers x31 v31

x0 v0 [0] [1] [2] [MAXVL-1]

Vector Length Register Ll v1 Vector Arithmetic v2 Instructions + + + + + + vadd v3, v1, v2 v3 [0] [1] [Ll-1]

Vector Load and Store Vector Register Instructions Pengju Ren@XJTUv1 2021 vldst v1, x1, x2

Memory Base, x1 Stride, x2 6 Vector Code Example

# C code # Scalar Code # Vector Code for (i=0; i<64; i++) li x4, 64 li x4, 64 C[i] = a*A[i] + B[i]; li x6, a li x6, a loop: setvl x4 fld f1, 0(x1) vld v1, x1 fld f2, 0(x2) vld v2, x2 fmul.d f3,f1,x6 vmul v3,v1,x6 fadd.d f4,f3,f2 vadd v4,v3,v2 fsd f4, 0(x3) vst v4, x3 addi x1, 8 addi x2, 8 addi x3, 8 subi x4, 1 Pengju Ren@XJTUbnez x4, loop 2021

7 Vector Instruction Set Advantages

. Compact – one short instruction encodes N operations . Expressive, tells hardware that these N operations: – are independent – use the same functional unit – access disjoint registers – access registers in same pattern as previous instructions – access a contiguous block of memory (unit-stride load/store) – access memory in a known pattern (stridedPengjuload/store) Ren@XJTU 2021 . Scalable – can run same code on more parallel pipelines (lanes)

8 RV64V Extension (RISC-V + Vector Extension)

 Vector Register:32x64bit (16 read and 8 write ports)  Vector Functional Units:Each unit is fully Pipelined  Vector Load/Store Unit  A set of scalar register: normal 31 general-purpose registers

Suffix .vv when both operand are vector, .vs when second operand is a scalar, and .sv when the first is a scalar register. Pengju Ren@XJTU 2021 Dynamic register type: If a vector register has 32 64-bit elements, then 128x16-bit and 256x8-bits elements are equally valid view 9 Vector Arithmetic Execution

• Use deep pipeline (=> fast clock) to execute element operations v v v • Simplifies control of deep pipeline 1 2 3 because elements in vector are independent (=> no hazards!) • No data hazards • No bypassing needed

PengjuSix-stage Ren@XJTUmultiply pipeline 2021

v3 <- v1 * v2

10 Vector Instruction Execution

vadd vc, va, vb

Execution using Execution using one pipelined four pipelined functional unit functional units

A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]

C[2] Pengju C[8]Ren@XJTUC[9] 2021C[10] C[11] C[1] C[4] C[5] C[6] C[7]

C[0] C[0] C[1] C[2] C[3]

11 Vector Instruction Parallelism

. Can overlap execution of multiple vector instructions – example machine has 32 elements per vector register and 8 lanes Load Unit Multiply Unit Add Unit load mul add time load mul add Pengju Ren@XJTU 2021 Instruction issue

Complete 24 operations/cycle while issuing 1 short instruction/cycle 12 Vector Chaining

. Vector version of register bypassing – introduced with Cray-1 vld v1 vmul v3,v1,v2 V1 V2 V3 V4 V5 vadd v5, v3,v4

Chain Chain Load Unit Pengju Ren@XJTUMult. 2021Add Memory

. Chaining allows a vector operation to start as soon as the individual elements of its vector source operand become available 13 Vector Chaining Advantage

• Without chaining, must wait for last element of result to be written before starting dependent instruction Load Mul Time Add

• With chaining, can start dependent instruction as soon as first result appears Pengju Ren@XJTU 2021 Load Mul Add

14 Vector Execution Time

. Execution time depends on three factors: – Length of operand vectors – Structural hazards among the operations – Data dependencies . Convoy – Set of vector instructions that could potentially execute together . Chime – Unit of timePengju to execute Ren@XJTU one convey 2021 – m conveys executes in m chimes for vector length n – For vector length of n, requires m x n clock cycles (single lane)

15 Example vld v0,x5 # Load vector X vmul v1,v0,f0 # Vector-scalar multiply vld v2,x6 # Load vector Y vadd v3,v1,v2 # Vector-vector add vst v3,x6 # Store the sum

Convoys: 1 vld vmul 2 vld vadd 3 vst Pengju Ren@XJTU 2021 3 chimes, 2 FP ops per result, cycles per FLOP = 1.5 For 32 element vectors, requires 32 x 3 = 96 clock cycles

16 Vector Processor Optimization How can a vector processor execute a single vector faster than one element per clock cycle ? Multiple Lanes: beyond one element/cycle How does a vector processor handle programs where the vector lengths are not the same as the maximum vector length ? Vector-length Registers: Handling loops not equal to MVL (strip Mining) What happens when there is an IF statement inside the code to be vectorized ? Predicate Registers: vector-mask control What does a vector processor need from the memory system ? Memory banks: supplying bandwidth for vector Load/Store Units How does a vectorPengju processor Ren@XJTU handle multiple dimensional2021 matrices ? Data structure must vectorize How does a vector processor handle sparse matrices ? Vector scatter/gather

17 Vector Unit Structure- Multiple Lanes Functional Unit

Vector Registers Elements Elements Elements Elements 0, 4, 8, … 1, 5, 9, … 2, 6, 10, 3, 7, 11, … …

Pengju Ren@XJTU 2021 Lane

Memory Subsystem The same element position in the input and output registers is referred to as a lane. There cannot be a carry or overflow from one lane to another 18 T0 Vector Microprocessor (UCB/ICSI, 1995)

Vector register Lane elements striped over lanes [24][25] [26][27][28] [29] [30] [31] [16][17] [18][19][20] [21] [22] [23] [8] [9] [10][11][12] [13] [14] [15] Pengju[0] Ren@XJTU[1] [2] [3] [4] [5] 2021[6] [7]

19 Vector Strip mining

Problem: What happens if the length is not matching the length of the vector registers? A vector-length register (VLR) contains the number of elements used within a vector register Solution: “Strip mining” split a large loop into loops less or equal the maximum vector length (MVL) Pengju Ren@XJTU 2021

20 Vector Strip mining: Example

andi x1, xN, 63 # N mod 64 for (i=0; i

21 Vector Conditional Execution Problem: Want to vectorize loops with conditional code: for (i=0; i0) then A[i] = B[i];

Solution: Add vector mask (or flag) registers – vector version of predicate registers, 1 bit per element …and maskable vector instructions – vector operation becomes bubble (“NOP”) at elements where mask bit is clear Code example: cvm # Turn on all elements vld vA, (PengjuxA) # Load Ren@XJTU entire A vector 2021 vgt vA, f0 # Set bits in mask register where A>0 vld vA, (xB) # Load B vector into A under mask vst vA, (xA) # Store A back to memory under mask

22 Masked Vector Instructions

Simple Implementation Density-Time Implementation – execute all N operations, turn off – scan mask vector and only execute result writeback according to mask elements with non-zero masks

M[7]=1 A[7] B[7] M[7]=1

M[6]=0 A[6] B[6] M[6]=0 A[7] B[7] M[5]=1 A[5] B[5] M[5]=1 M[4]=1 A[4] B[4] M[4]=1 M[3]=0 A[3] B[3] M[3]=0 C[5] M[2]=0 C[4] M[1]=1 M[2]=0 C[2] M[0]=0 M[1]=1 C[1] C[1] Pengju Ren@XJTU 2021 Write data port

M[0]=0 C[0]

Write Enable Write data port

23 Interleaved Vector Memory System

. Memory system must be designed to support high bandwidth for vector loads and stores . Spread accesses across multiple banks – Control bank addresses independently – Load or store non sequential words (need independent bank addressing) – Support multiple vector processors sharing the same memory Base Stride Vector Registers

Address Generator + Pengju Ren@XJTU 2021 0 1 2 3 4 5 6 7 8 9 A B C D E F

Memory Banks 24 Interleaved Vector Memory System

. Memory system must be designed to support high bandwidth for vector loads and stores . Spread accesses across multiple banks – Control bank addresses independently – Load or store non sequential words (need independent bank addressing) – Support multiple vector processors sharing the same memory

Pengju Ren@XJTU 2021

25 Stride: Handling Multidimensional Arrays Problem: Want to vectorize rows/columns for (i=0; i<100; i++) for (j=0; j<100; j++){ A[i][j] = 0.0 for (k=0; k<100; k++) A[i][j]=A[i][j]+B[i][k]*D[k][j]; Solution: nonunit strides

Pengju Ren@XJTU 2021

Access nonsequential memory locations and to reshape them into a dense structure is one of the major advantages of a vector architecture. RV64V: VLDS (load vector with stride) VSTS (store vector with stride) 26 Compress/Expand Operations

. Compress packs non-masked elements from one vector register contiguously at start of destination vector register – population count of mask vector gives packed vector length . Expand performs inverse operation

M[7]=1 A[7] A[7] A[7] M[7]=1 M[6]=0 A[6] A[5] B[6] M[6]=0 M[5]=1 A[5] A[4] A[5] M[5]=1 M[4]=1 A[4] A[1] A[4] M[4]=1 M[3]=0 A[3] A[7] B[3] M[3]=0 M[2]=0 A[2] A[5] B[2] M[2]=0 M[1]=1 A[1] A[4] A[1] M[1]=1 M[0]=0PengjuA[0] Ren@XJTUA[1] B[0] 2021M[0]=0 Compress Expand Used for density-time conditionals and also for general selection operations 26 Vector Scatter-Gather Problem: Handling sparse matrices Solution: gather-Scatter operations

. Consider: for (i = 0; i < n; i=i+1) A[K[i]] = A[K[i]] + C[M[i]]; . Use index vector: vsetdcfg 4*FP64 # 4 64b FP vector registers vld Pengjuv0, x7 Ren@XJTU# Load K[] 2021 vldx v1, x5, v0 # Load A[K[]] vld v2, x28 # Load M[] vldi v3, x6, v2 # Load C[M[]] vadd v1, v1, v3 # Add them vstx v1, x5, v0 # Store A[K[]] vdisable # Disable vector registers 28 Automatic Code Vectorization for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Sequential Code Vectorized Code load load load Iter. 1 load load load me

add Ti add add

store store store

load Iter. 1 Iter. 2 Iter. 2 loadPengju Ren@XJTU 2021 Vector Instruction Vectorization is a massive compile-time reordering add of operation sequencing  requires extensive loop-dependence analysis store 29 Vector Reductions

Problem: Loop-carried dependence on reduction variables sum = 0; for (i=0; i1)

30 Vector Memory-Memory versus Vector Register Machines . Vector memory-memory instructions hold all vector operands in main memory . The first vector machines, CDC Star-100 (‘73) and TI ASC (‘71), were memory-memory machines . Cray-1 (’76) was first vector register machine Vector Memory-Memory Code Example Source Code vadd C, A, B for (i=0; i

32 Packed SIMD Extensions 64b 32b 32b 16b 16b 16b 16b 8b 8b 8b 8b 8b 8b 8b 8b . Very short vectors added to existing ISAs for microprocessors . Use existing 64-bit registers split into 2x32b or 4x16b or 8x8b – Lincoln Labs TX-2 from 1957 had 36b datapath split into 2x18b or 4x9b – Newer designs have wider registers • 128b for PowerPC Altivec, Intel SSE2/3/4 • 256b/512b for Intel AVX . Single instruction operates on all elements within register 16b Pengju16b Ren@XJTU16b 2021 16b 16b 16b 16b 16b

4x16b adds + + + + 16b 16b 16b 16b 33 AArch64 (ARM V8 Neon) Neon is a brand name which refers to Arm’s implementations of the Advanced SIMD Architecture.

A 128-bit Neon vectorPengju can contain Ren@XJTU the following element 2021 sizes:  Sixteen 8-bit elements (operand suffix .16B, where B indicates byte)  Eight 16-bit elements (operand suffix .8H, where H indicates halfword)  Four 32-bit elements (operand suffix .4S, where S indicates word)  Two 64-bit elements (operand suffix .2D, where D indicates doubleword) 34 Instr example of AArch64 (ARM V8 Neon)

Pengju Ren@XJTU 2021 ADD V0.8H, V1.8H, V2.8H MUL V0.4S, V2.4S, V3.S[2] performs a parallel addition of eight lanes of 16-bit multiplies each of the four 32-bit elements (8 x 16 = 128) integer elements from vectors in V1 in V2 by the 32-bit scalar value in lane 2 of V3, and V2, storing the result in V0 storing the result vector in V0.

35 Packed SIMD versus Vectors

. Limited instruction set: – no vector length control – no strided load/store or scatter/gather – unit-stride loads must be aligned to 64/128-bit boundary . Limited vector register length: – requires superscalar dispatch to keep multiply/add/load units busy – loop unrolling to hide latencies increases register pressure . Trend towards fuller vector support in microprocessors – Better support for misaligned memory accesses – Support ofPengju double-precision Ren@XJTU (64-bit floating-point) 2021 – New Intel AVX spec (announced April 2008), 256b vector registers (expandable up to 1024b), gather added, scatter to follow – ARM Scalable Vector Extensions (SVE, SVE2)

36 Advantages of Vector Processors

. Require Lower Instruction Bandwidth  Reduced by fewer fetches and decodes . Easier Addressing of Main Memory  Load/Store units access memory with known patterns . Elimination of Memory Wastage  Unlike cache access, every data element that is requested by the processor is actually used – no cache misses  Latency only occurs once per vector during pipelined loading . Simplification of Control Hazards  Loop-related control hazards from the loop are eliminated . Scalable Platform  IncreasePengju performance Ren@XJTU by using more hardware 2021 resources . Reduced Code Size  Short, single instruction can describe N operations

37 Summary Performance Optimizations

. Increase Memory Bandwidth  Memory banks are used to reduce load/store latency  Allow multiple simultaneous outstanding memory requests . Strip Mining  Generates code to allow vector operands whose size is less than or greater than size of vector registers . Vector Chaining  Equivalent to data forwarding in vector processors  Results of one pipeline are fed into operand registers of another pipeline . Scatter and Gather  Retrieves data elements scattered throughout memory and packs them into sequential vectors in vector registers  Promotes dataPengju locality and Ren@XJTU reduces data pollution 2021 . Multiple Parallel Lanes, or Pipes  Allows vector operation to be performed in parallel on multiple elements of the vector 38 Next Lecture:Multithreading and Multicore (Thread-level Parallel)

Pengju Ren@XJTU 2021

39 Acknowledgements

. Some slides contain material developed and copyright by: – Arvind (MIT) – Krste Asanovic (MIT/UCB) – Joel Emer (Intel/MIT) – James Hoe (CMU) – David Patterson (UCB) – David Wentzlaff (Princeton University)

. MIT material derived from course 6.823 . UCB material derived from course CS252 and CS 61C Pengju Ren@XJTU 2021

40 Fundamentals of Armv8 Neon technology

Pengju Ren@XJTU 2021

41 Neon for ARM SIMD

Arm Neon technology is an advanced Single Instruction Multiple Data (SIMD) architecture extension for the Arm Cortex-A and Cortex-R series processors.

. Neon intrinsics . Neon-enabled libraries (Arm Compute Library) . Auto-vectorization by your compiler (Arm Compiler, LLVM, GCC) armclang –target=aarch64-arm-non-eabi –mcpu=cortex-a53 Auto-vectorization is enabled by default at optimization level -O2 and higher, or with -fectorize . Hand-coded NeonPengju assembler Ren@XJTU 2021

42 Neon Intrinsics Example

int64x2_t vaddl_high_s32 (int32x4_t a, int32x4_t b) Description Signed Add Long (vector). This instruction adds each vector element in the lower or upper half of the first source SIMD&FP register to the corresponding vector element of the second source SIMD&FP register, places the results into a vector, and writes the vector to the destination SIMD&FP register. The destination vector elements are twice as long as the source vector elements. All the values in this instruction are signed integer values.

Pengju Ren@XJTU 2021

43 Arm Compute Library

The Arm Compute Library is a collection of low-level machine learning functions optimized for Cortex-A CPU and Mali GPU architectures. The library provides ML acceleration on Cortex-A CPU through Neon, or SVE and acceleration on Mali GPU through Open CL. The library is extremely flexible and allows developers to source machine learning functions individually or use as part of complex pipelines to accelerate their algorithms and applications. Enabled by Arm microarchitecture optimizations, the Arm Compute Library provides superior performance to other OSS alternatives and immediate support for new Arm technologies, for example, SVE2. The Arm Compute Library is open-source software available under a permissive MIT license.

 Over 100 machinePengju learning Ren@XJTU functions for CPU 2021and GPU  Multiple convolution algorithms (GEMM, Winograd, FFT, and Direct)

44 Example of Source code for Compiler optimization

You could make the following changes to assist the compiler optimizations:  Compile at -O2 or higher, or with -fvectorize.  Specify #pragma clang loop vectorize(enable) before the loop as a hint to the compiler.

Pengju Ren@XJTU 2021

45