Pengju Ren@XJTU 2021

Computer Architecture Lecture 10 – Vector Machine (Data Level Parallel) Pengju Ren Institute of Artificial Intelligence and Robotics PengjuXi’an Ren@XJTU Jiaotong University 2021 http://gr.xjtu.edu.cn/web/pengjuren SISD、MIMD、SIMD and MIMD Data Streams Single Multiple Instruction Single SISD: Intel Pentium 4 SIMD: SSE instruction of x86 Streams Multiple MISD: No example today MIMD: Intel Core i7 SISD: Single Instruction stream, Single Data Stream MIMD: Multiple Instruction streams, Multiple Data Streams SPMD (SinglePengju Program Ren@XJTU multiple data) GPU 2021 SIMD: Single Instruction stream, Multiple Data Streams MISD: Multiple Instruction streams, Single Data Stream 2 Agenda . Vector Processors . Single Instruction Multiple Data (SIMD) . Instruction Set Extensions （Neon@ARM, AVX@Intel, etc.) Pengju Ren@XJTU 2021 3 Modern SIMD Processors SIMD architectures can exploit significant data-level parallelism for: Matrix-oriented scientific computing Media-oriented image and sound processors Machine Learning Algorithms Most modern CPUs have SIMD architectures Intel SSE and MMX, AVX, AVX2 (Streaming SIMD Extension, Multimedia extensions、Advanced Vector extensions) ARM NEON, MIPS MDMX These architectures include instruction set extensions which allow both sequential and parallel instructions to be executed Some architectures include separate SIMD coprocessors for handling these instructions ARM NEON Pengju Ren@XJTU 2021 Included in Cortex-A8 and Cortex-A9 processors Intel SSE and AVX Introduced in 1999 in the Pentium III processor AVX512 currently used in Xeon Core series 4 Vector Processor Basic idea: Read sets of data elements into “vector registers” Operate on those registers Disperse the results back into memory Registers are controlled by compiler Used to hide memory latency Leverage memory bandwidth Overcoming limitations of ILP: – Dramatic reduction in fetch and decode bandwidth. – No data hazard between elements of the same vector. – Data hazard logic is required only between two vector instructions. – Heavily interleaved memory banks. Hence latency of initiating memory access versusPengju cache access isRen@XJTU amortized. 2021 – Since loops are reduced to vector instructions, there are no control hazards. – Good performance for poor locality 5 Vector Programming Model Scalar Registers Vector Registers x31 v31 x0 v0 [0] [1] [2] [MAXVL-1] Vector Length Register Ll v1 Vector Arithmetic v2 Instructions + + + + + + vadd v3, v1, v2 v3 [0] [1] [Ll-1] Vector Load and Store Vector Register Instructions Pengju Ren@XJTUv1 2021 vldst v1, x1, x2 Memory Base, x1 Stride, x2 6 Vector Code Example # C code # Scalar Code # Vector Code for (i=0; i<64; i++) li x4, 64 li x4, 64 C[i] = a*A[i] + B[i]; li x6, a li x6, a loop: setvl x4 fld f1, 0(x1) vld v1, x1 fld f2, 0(x2) vld v2, x2 fmul.d f3,f1,x6 vmul v3,v1,x6 fadd.d f4,f3,f2 vadd v4,v3,v2 fsd f4, 0(x3) vst v4, x3 addi x1, 8 addi x2, 8 addi x3, 8 subi x4, 1 Pengju Ren@XJTUbnez x4, loop 2021 7 Vector Instruction Set Advantages . Compact – one short instruction encodes N operations . Expressive, tells hardware that these N operations: – are independent – use the same functional unit – access disjoint registers – access registers in same pattern as previous instructions – access a contiguous block of memory (unit-stride load/store) – access memory in a known pattern (stridedPengjuload/store) Ren@XJTU 2021 . Scalable – can run same code on more parallel pipelines (lanes) 8 RV64V Extension (RISC-V + Vector Extension) Vector Register：32x64bit (16 read and 8 write ports) Vector Functional Units：Each unit is fully Pipelined Vector Load/Store Unit A set of scalar register: normal 31 general-purpose registers Suffix .vv when both operand are vector, .vs when second operand is a scalar, and .sv when the first is a scalar register. Pengju Ren@XJTU 2021 Dynamic register type: If a vector register has 32 64-bit elements, then 128x16-bit and 256x8-bits elements are equally valid view 9 Vector Arithmetic Execution • Use deep pipeline (=> fast clock) to execute element operations v v v • Simplifies control of deep pipeline 1 2 3 because elements in vector are independent (=> no hazards!) • No data hazards • No bypassing needed PengjuSix-stage Ren@XJTUmultiply pipeline 2021 v3 <- v1 * v2 10 Vector Instruction Execution vadd vc, va, vb Execution using Execution using one pipelined four pipelined functional unit functional units A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] Pengju C[8]Ren@XJTUC[9] 2021C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3] 11 Vector Instruction Parallelism . Can overlap execution of multiple vector instructions – example machine has 32 elements per vector register and 8 lanes Load Unit Multiply Unit Add Unit load mul add time load mul add Pengju Ren@XJTU 2021 Instruction issue Complete 24 operations/cycle while issuing 1 short instruction/cycle 12 Vector Chaining . Vector version of register bypassing – introduced with Cray-1 vld v1 vmul v3,v1,v2 V1 V2 V3 V4 V5 vadd v5, v3,v4 Chain Chain Load Unit Pengju Ren@XJTUMult. 2021Add Memory . Chaining allows a vector operation to start as soon as the individual elements of its vector source operand become available 13 Vector Chaining Advantage • Without chaining, must wait for last element of result to be written before starting dependent instruction Load Mul Time Add • With chaining, can start dependent instruction as soon as first result appears Pengju Ren@XJTU 2021 Load Mul Add 14 Vector Execution Time . Execution time depends on three factors: – Length of operand vectors – Structural hazards among the operations – Data dependencies . Convoy – Set of vector instructions that could potentially execute together . Chime – Unit of timePengju to execute Ren@XJTU one convey 2021 – m conveys executes in m chimes for vector length n – For vector length of n, requires m x n clock cycles (single lane) 15 Example vld v0,x5 # Load vector X vmul v1,v0,f0 # Vector-scalar multiply vld v2,x6 # Load vector Y vadd v3,v1,v2 # Vector-vector add vst v3,x6 # Store the sum Convoys: 1 vld vmul 2 vld vadd 3 vst Pengju Ren@XJTU 2021 3 chimes, 2 FP ops per result, cycles per FLOP = 1.5 For 32 element vectors, requires 32 x 3 = 96 clock cycles 16 Vector Processor Optimization How can a vector processor execute a single vector faster than one element per clock cycle ? Multiple Lanes: beyond one element/cycle How does a vector processor handle programs where the vector lengths are not the same as the maximum vector length ? Vector-length Registers: Handling loops not equal to MVL (strip Mining) What happens when there is an IF statement inside the code to be vectorized ? Predicate Registers: vector-mask control What does a vector processor need from the memory system ? Memory banks: supplying bandwidth for vector Load/Store Units How does a vectorPengju processor Ren@XJTU handle multiple dimensional2021 matrices ? Data structure must vectorize How does a vector processor handle sparse matrices ? Vector scatter/gather 17 Vector Unit Structure- Multiple Lanes Functional Unit Vector Registers Elements Elements Elements Elements 0, 4, 8, … 1, 5, 9, … 2, 6, 10, 3, 7, 11, … … Pengju Ren@XJTU 2021 Lane Memory Subsystem The same element position in the input and output registers is referred to as a lane. There cannot be a carry or overflow from one lane to another 18 T0 Vector Microprocessor (UCB/ICSI, 1995) Vector register Lane elements striped over lanes [24][25] [26][27][28] [29] [30] [31] [16][17] [18][19][20] [21] [22] [23] [8] [9] [10][11][12] [13] [14] [15] Pengju[0] Ren@XJTU[1] [2] [3] [4] [5] 2021[6] [7] 19 Vector Strip mining Problem: What happens if the length is not matching the length of the vector registers? A vector-length register (VLR) contains the number of elements used within a vector register Solution: “Strip mining” split a large loop into loops less or equal the maximum vector length (MVL) Pengju Ren@XJTU 2021 20 Vector Strip mining: Example andi x1, xN, 63 # N mod 64 for (i=0; i<N; i++) setvl x1 # Do remainder C[i] = A[i]+B[i];loop: A B C vld v1, xA sll x2, x1, 3 # Multiply by 8 Remainder + add xA, x2 # Bump pointer vld v2, xB add xB, x2 64 elements + vadd v3, v1, v2 vst v3, xC add xC, x2 Pengju Ren@XJTUsub xN, x1 # 2021Subtract elements + li x1, 64 setvl x1 # Reset full length bgtz xN, loop # Any more to do? 21 Vector Conditional Execution Problem: Want to vectorize loops with conditional code: for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i]; Solution: Add vector mask (or flag) registers – vector version of predicate registers, 1 bit per element …and maskable vector instructions – vector operation becomes bubble (“NOP”) at elements where mask bit is clear Code example: cvm # Turn on all elements vld vA, (PengjuxA) # Load Ren@XJTU entire A vector 2021 vgt vA, f0 # Set bits in mask register where A>0 vld vA, (xB) # Load B vector into A under mask vst vA, (xA) # Store A back to memory under mask 22 Masked Vector Instructions Simple Implementation Density-Time Implementation – execute all N operations, turn off – scan mask vector and only execute result writeback according to mask elements with non-zero masks M[7]=1 A[7] B[7] M[7]=1 M[6]=0 A[6] B[6] M[6]=0 A[7] B[7] M[5]=1 A[5] B[5] M[5]=1 M[4]=1 A[4] B[4] M[4]=1 M[3]=0 A[3] B[3] M[3]=0 C[5] M[2]=0 C[4] M[1]=1 M[2]=0 C[2] M[0]=0 M[1]=1 C[1] C[1] Pengju Ren@XJTU 2021 Write data port M[0]=0 C[0] Write Enable Write data port 23 Interleaved Vector Memory System .

Pengju Ren@XJTU 2021

SIMD Extensions

SIMD: Data Parallel Execution J

Effectiveness of the MAX-2 Multimedia Extensions for PA-RISC 2.0 Processors

Implications of Programmable General Purpose Processors for Compression/Encryption Applications 1. Introduction

ALEX BENNÉE KVM FORUM 2017 Created: 2017-10-20 Fri 20:46

Msc THESIS Customizing Vector Instruction Set Architectures

The X86 Is Dead. Long Live the X86!

Conjunto De Instruções Multimídia

Numerical Applications and Sub-Word Parallelism: the NAS Benchmarks on a Pentium 4

The RISC-V Instruction Set Manual Volume I: User-Level ISA Document Version 2.2

Zynq-7000 All Programmable Soc Architecture Porting Quick Start Guide

Native Signal Processing with Altivec in the Ptolemy Environment