Architecture: Vector Processing: SIMD/Vector/GPU Exploiting Regular (Data) Parallelism

Prof. Onur Mutlu (edited by seth) Carnegie Mellon University

Data Parallelism SIMD Processing

 Concurrency arises from performing the same operations  Single instruction operates on multiple data elements

on different pieces of data  In time or in space

 Single instruction multiple data (SIMD)  Multiple processing elements  E.g., dot product of two vectors

 Contrast with data flow  Time-space duality

 Concurrency arises from executing different operations in parallel (in  Array : Instruction operates on multiple data a data driven manner) elements at the same time

: Instruction operates on multiple data  Contrast with (“control”) parallelism elements in consecutive time steps  Concurrency arises from executing different threads of control in parallel

 SIMD exploits instruction-level parallelism

 Multiple instructions concurrent: instructions happen to be the same

3 4 Array vs. Vector Processors SIMD Array Processing vs. VLIW

 VLIW ARRAY PROCESSOR VECTOR PROCESSOR

Instruction Stream Same op @ same time Different ops @ time LD VR  A[3:0] LD0 LD1 LD2 LD3 LD0 ADD VR  VR, 1 AD0 AD1 AD2 AD3 LD1 AD0 MUL VR  VR, 2 ST A[3:0]  VR MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0

Different ops @ same space AD3 MU2 ST1 MU3 ST2 Time Same op @ space ST3

Space Space

5 6

SIMD Array Processing vs. VLIW Vector Processors

 Array processor  A vector is a one-dimensional array of numbers

 Many scientific/commercial programs use vectors for (i = 0; i<=49; i++) C[i] = (A[i] + B[i]) / 2

 A vector processor is one whose instructions operate on vectors rather than scalar (single data) values

 Basic requirements

 Need to load/store vectors  vector registers (contain vectors)

 Need to operate on vectors of different lengths  vector length register (VLEN)

 Elements of a vector might be stored apart from each other in memory  vector stride register (VSTR)

 Stride: distance between two elements of a vector

7 8 Vector Processors (II) Vector Processor Advantages

 A vector instruction performs an operation on each element + No dependencies within a vector

in consecutive cycles  Pipelining, parallelization work well

 Vector functional units are pipelined  Can have very deep pipelines, no dependencies!

 Each stage operates on a different data element + Each instruction generates a lot of work

 Vector instructions allow deeper pipelines  Reduces instruction fetch bandwidth

 No intra-vector dependencies  no hardware interlocking within a vector + Highly regular memory access pattern

 No control flow within a vector  Interleaving multiple banks for higher memory bandwidth

 Known stride allows prefetching of vectors into /memory  Prefetching

+ No need to explicitly code loops

 Fewer branches in the instruction sequence

9 10

Vector Processor Disadvantages Vector Processor Limitations -- Works (only) if parallelism is regular (data/SIMD parallelism) -- Memory (bandwidth) can easily become a bottleneck, ++ Vector operations especially if -- Very inefficient if parallelism is irregular 1. compute/memory operation balance is not maintained -- How about searching for a key in a linked list? 2. data is not mapped appropriately to memory banks

Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983. 11 12 Vector Registers Vector Functional Units

 Each vector data register holds N M-bit values  Use deep pipeline (=> fast

 Vector control registers: VLEN, VSTR, VMASK clock) to execute element operations V V V  Vector Mask Register (VMASK) 1 2 3  Simplifies control of deep  Indicates which elements of vector to operate on pipeline because elements in  Set by vector test instructions vector are independent  e.g., VMASK[i] = (Vk[i] == 0)  Maximum VLEN can be N

 Maximum number of elements stored in a vector register M-bit wide M-bit wide Six stage multiply pipeline V0,0 V1,0 V0,1 V1,1

V3 <- v1 * v2

V0,N-1 V1,N-1

13 Slide credit: Krste Asanovic 14

Vector Machine Organization (-1) Memory Banking

 CRAY-1  Example: 16 banks; can start one bank access per cycle

 Russell, “The CRAY-1  Bank latency: 11 cycles computer system,”  Can sustain 16 parallel accesses if they go to different banks CACM 1978. Bank Bank Bank Bank 0 1 2 15  Scalar and vector modes

 8 64-element vector registers MDR MAR MDR MAR MDR MAR MDR MAR  64 bits per element Data  16 memory banks

 8 64-bit scalar registers Address bus  8 24-bit address registers

CPU

15 Slide credit: Derek Chiou 16 Vector Memory System Scalar Code Example

 For I = 0 to 49  [i] = (A[i] + B[i]) / 2 Bas Stride Vector Registers e  Scalar code Address MOVI R0 = 50 1 Generator + MOVA R1 = A 1 304 dynamic instructions MOVA R2 = B 1 MOVA R3 = C 1 X: LD R4 = MEM[R1++] 11 ;autoincrement addressing 0 1 2 3 4 5 6 7 8 9 A B C E F LD R5 = MEM[R2++] 11 ADD R6 = R4 + R5 4 Memory Banks SHFR R7 = R6 >> 1 1 ST MEM[R3++] = R7 11 DECBNZ --R0, X 2 ;decrement and branch if NZ

Slide credit: Krste Asanovic 17 18

Scalar Code Execution Time Vectorizable Loops

 Scalar execution time on an in-order processor with 1 bank  A loop is vectorizable if each iteration is independent of any

 First two loads in the loop cannot be pipelined: 2*11 cycles other

 4 + 50*40 = 2004 cycles  For I = 0 to 49  C[i] = (A[i] + B[i]) / 2 7 dynamic instructions

 Scalar execution time on an in-order processor with 16  Vectorized loop: banks (word-interleaved) MOVI VLEN = 50 1

 First two loads in the loop can be pipelined MOVI VSTR = 1 1

 4 + 50*30 = 1504 cycles VLD V0 = A 11 + VLN - 1 VLD V1 = B 11 + VLN – 1

 Why 16 banks? VADD V2 = V0 + V1 4 + VLN - 1

 11 cycle memory access latency VSHFR V3 = V2 >> 1 1 + VLN - 1

 Having 16 (>11) banks ensures there are enough banks to VST C = V3 11 + VLN – 1 overlap enough memory operations to cover memory latency

19 20 Vector Code Performance Vector Chaining

 No chaining  Vector chaining: Data forwarding from one vector

 i.e., output of a vector functional unit cannot be used as the functional unit to another input of another (i.e., no vector data forwarding)

 One memory port (one address generator) V V V V V  16 memory banks (word-interleaved) LV v1 1 2 3 4 5 MULV v3,v1,v2 ADDV v5, v3, v4

Chain Chain

Load Unit Mult. Add

Memory  285 cycles

21 Slide credit: Krste Asanovic 22

Vector Code Performance - Chaining Vector Code Performance – Multiple Memory Ports

 Vector chaining: Data forwarding from one vector  Chaining and 2 load ports, 1 store port in each bank functional unit to another

11 11 49 11 49 Strict assumption: Each memory bank 4 49 has a single port (memory bandwidth bottleneck) These two VLDs cannot be 149 pipelined. WHY?

11 49

VLD and VST cannot be  182 cycles pipelined. WHY?  79 cycles

23 24 Questions (I) Gather/Scatter Operations

 What if # data elements > # elements in a vector register?  Need to break loops so that each iteration operates on # Want to vectorize loops with indirect accesses: elements in a vector register for (i=0; i

 Use indirection to combine elements into vector registers SV vA, rA # Store result

 Called scatter/gather operations

25 26

Gather/Scatter Operations Conditional Operations in a Loop

 Gather/scatter operations often implemented in hardware  What if some operations should not be executed on a vector to handle sparse matrices (based on a dynamically-determined condition)? loop: if (a[i] != 0) then b[i]=a[i]*b[i]  Vector loads and stores use an index vector which is added goto loop to the base register to generate the addresses

Index Vector Data Vector Equivalent  Idea: Masked operations 1 3.14 3.14  VMASK register is a bit mask determining which data element 36.50.0 should not be acted upon 7 71.2 6.5 VLD V0 = A 8 2.71 0.0 VLD V1 = B 0.0 0.0 VMASK = (V0 != 0) 0.0 VMUL V1 = V0 * V1 71.2 VST B = V1 2.7  Does this look familiar? This is essentially predicated execution. 27 28 Another Example with Masking Masked Vector Instructions

Simple Implementation Density-Time Implementation for (i = 0; i < 64; ++i) – execute all N operations, turn off – scan mask vector and only execute if (a[i] >= b[i]) then c[i] = a[i] result writeback according to mask elements with non-zero masks else c[i] = b[i] Steps to execute loop M[7]=1 A[7] B[7] M[7]=1 M[6]=0 A[6] B[6] M[6]=0 1. Compare A, B to get A[7] B[7] M[5]=1 A[5] B[5] M[5]=1 A B VMASK VMASK 1 2 0 M[4]=1 A[4] B[4] M[4]=1 C[5] 22 1 2. Masked store of A into C M[3]=0 A[3] B[3] M[3]=0 32 1 M[2]=0 C[4] 4100 3. Complement VMASK M[1]=1 M[2]=0 C[2] -5 -4 0 M[0]=0 0-31 4. Masked store of B into C M[1]=1 C[1] C[1] 65 1 Write data port -7 -8 1 M[0]=0 C[0]

Write Enable Write data port

29 Slide credit: Krste Asanovic 30

Some Issues

 Stride and banking

 As long as they are relatively prime to each other and there are enough banks to cover bank access latency, consecutive accesses proceed in parallel

 Storage of a matrix

 Row major: Consecutive elements in a row are laid out consecutively in memory

 Column major: Consecutive elements in a column are laid out consecutively in memory

 You need to change the stride when accessing a row versus column

31 32 Array vs. Vector Processors, Revisited Remember: Array vs. Vector Processors

 Array vs. vector processor distinction is a “purist’s” ARRAY PROCESSOR VECTOR PROCESSOR distinction

 Most “modern” SIMD processors are a combination of both Instruction Stream Same op @ same time  They exploit in both time and space Different ops @ time LD VR  A[3:0] LD0 LD1 LD2 LD3 LD0 ADD VR  VR, 1 AD0 AD1 AD2 AD3 LD1 AD0 MUL VR  VR, 2 ST A[3:0]  VR MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0

Different ops @ same space AD3 MU2 ST1 MU3 ST2 Time Same op @ space ST3

Space Space

33 34

Vector Instruction Execution Vector Unit Structure

ADDV C,A,B Functional Unit

Execution using Execution using one pipelined four pipelined functional unit functional units

Vector Registers A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] Elements 0, Elements 1, Elements 2, Elements 3, 4, 8, … 5, 9, … 6, 10, … 7, 11, … A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]

C[2] C[8] C[9] C[10] C[11]

C[1] C[4] C[5] C[6] C[7] Lane

C[0] C[0] C[1] C[2] C[3] Memory Subsystem

Slide credit: Krste Asanovic 35 Slide credit: Krste Asanovic 36 Vector Instruction Level Parallelism Automatic Code Vectorization

Can overlap execution of multiple vector instructions for (i=0; i < N; i++) C[i] = A[i] + B[i];  example machine has 32 elements per vector register and 8 lanes Scalar Sequential Code Vectorized Code  Complete 24 operations/cycle while issuing 1 short instruction/cycle load load load Load Unit Multiply Unit Add Unit load load Iter. 1 load load mul

add add Time add add time

load store store store mul load add Iter. Iter. Iter. 2 load 1 2 Vector Instruction

Vectorization is a compile-time reordering of Instruction add issue operation sequencing  requires extensive loop dependence analysis store Slide credit: Krste Asanovic 37 Slide credit: Krste Asanovic 38

Vector/SIMD Processing Summary

 Vector/SIMD machines good at exploiting regular data-level parallelism  Same operation performed on many data elements SIMD Operations in Modern ISAs  Improve performance, simplify design (no intra-vector dependencies)

 Performance improvement limited by vectorizability of code

 Scalar operations limit vector machine performance

 Amdahl’s Law

 CRAY-1 was the fastest SCALAR machine at its time!

 Many existing ISAs include (vector-like) SIMD operations

MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD

39 Intel Pentium MMX Operations MMX Example: Image Overlaying (I)

 Idea: One instruction operates on multiple data elements simultaneously

 Ala array processing (yet much more limited)

 Designed with (graphics) operations in mind

No VLEN register Opcode determines data type: 8 8-bit bytes 4 16-bit words 2 32-bit doublewords 1 64-bit quadword

Stride always equal to 1.

Peleg and Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro, 1996.

41 42

MMX Example: Image Overlaying (II)

Graphics Processing Units SIMD not Exposed to (SIMT)

43 High-Level View of a GPU Concept of “Thread Warps” and SIMT

 Warp: A set of threads that execute the same instruction (on different data elements)  SIMT (-speak)  All threads run the same kernel

 Warp: The threads that run lengthwise in a woven fabric …

Thread Warp 3 Thread Warp 8 Thread Warp Common PC Scalar Scalar Scalar Scalar Thread Warp 7 ThreadThreadThread Thread W X Y Z SIMD Pipeline

45 46

Loop Iterations as Threads SIMT Memory Access for (i=0; i < N; i++) C[i] = A[i] + B[i];  Same instruction in different threads uses thread id to Scalar Sequential Code Vectorized Code index and access different data elements

load load load Let’s assume N=16, blockDim=4  4 blocks Iter. 1 load load load 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 + add Time add add 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

store store store

load + + + + Iter. Iter. Iter. 2 load 1 2 Vector Instruction

add

store Slide credit: Krste Asanovic 47 Slide credit: Hyesoon Kim Sample GPU SIMT Code (Simplified) Sample GPU Program (Less Simplified)

CPU code for (ii = 0; ii < 100; ++ii) { C[ii] = A[ii] + B[ii]; }

CUDA code

// there are 100 threads __global__ void KernelFunction(…) { int tid = blockDim.x * blockIdx.x + threadIdx.x; int varA = aa[tid]; int varB = bb[tid]; C[tid] = varA + varB; }

Slide credit: Hyesoon Kim Slide credit: Hyesoon Kim 50

Latency Hiding with “Thread Warps” Warp-based SIMD vs. Traditional SIMD

 Traditional SIMD contains a single thread  Warp: A set of threads that  Lock step execute the same instruction  Programming model is SIMD (no threads)  SW needs to know vector Thread Warp 3 Warps available (on different data elements) Thread Warp 8 for scheduling length  ISA contains vector/SIMD instructions Thread Warp 7 SIMD Pipeline  Fine-grained multithreading I-Fetch  One instruction per thread in  Warp-based SIMD consists of multiple scalar threads executing in pipeline at a time (No branch Decode a SIMD manner (i.e., same instruction executed by all threads) RF prediction) RF RF  Does not have to be lock step

 Interleave warp execution to A A A Warps accessing LU LU LU  Each thread can be treated individually (i.e., placed in a different hide latencies Miss? warp)  programming model not SIMD  Register values of all threads stay D-Cache Thread Warp 1  SW does not need to know vector length in All Hit? Data Thread Warp 2  Enables memory and branch latency tolerance  No OS context switching Thread Warp 6 Writeback  ISA is scalar  vector instructions formed dynamically  Memory latency hiding  Essentially, it is SPMD programming model implemented on SIMD  Graphics has millions of pixels hardware Slide credit: Tor Aamodt 51 52 SPMD Branch Divergence Problem in Warp-based SIMD

 Single procedure/program, multiple data  SPMD Execution on SIMD Hardware  This is a programming model rather than computer organization  NVIDIA calls this “Single Instruction, Multiple Thread” (“SIMT”) execution

 Each processing element executes the same procedure, except on different data elements

 Procedures can synchronize at certain points in program, e.g. barriers A

Thread Warp Common PC B  Essentially, multiple instruction streams execute the same Thread Thread Thread Thread program C D F 1 2 3 4

 Each program/procedure can 1) execute a different control-flow path, 2) work on different data, at run-time E  Many scientific applications programmed this way and run on MIMD G (multiprocessors)

 Modern GPUs programmed in a similar way on a SIMD computer

53 Slide credit: Tor Aamodt 54

Control Flow Problem in GPUs/SIMD Branch Divergence Handling (I)

 GPU uses SIMD pipeline to save area Stack A/1111 A Reconv. PC Next PC Active Mask on control logic. TOS - GAEB 1111 TOS E D 0110  Group scalar threads into BB/1111 TOS E CE 1001 warps Branch CC/1001 DD/0110 F Common PC Path A Thread Warp  Branch divergence EE/1111 Thread Thread Thread Thread Path B 1 2 3 4 occurs when threads GG/1111 inside warps branch to different execution A B C D E G A paths.

Time

Slide credit: Tor Aamodt 55 Slide credit: Tor Aamodt 56 Dynamic Warp Formation Dynamic Warp Formation/Merging

 Idea: Dynamically merge threads executing the same  Idea: Dynamically merge threads executing the same instruction (after branch divergence) instruction (after branch divergence)

 Form new warp at divergence

 Enough threads branching to each path to create full new warps Branch

Path A

Path B

 Fung et al., “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” MICRO 2007. 58 59

Dynamic Warp Formation Example What About Memory Divergence?

x/1111 A y/1111  Modern GPUs have caches Legend x/1110 A A  Ideally: Want all threads in the warp to hit (without B y/0011 Execution of Warp x Execution of Warp y at Basic Block A at Basic Block A conflicting with each other) x/1000 x/0110 x/0001 C y/0010 D y/0001 F y/1100  Problem: One thread in a warp can stall the entire warp if it D A new warp created from scalar misses in the cache. E x/1110 threads of both Warp x and y y/0011 executing at Basic Block D

G x/1111 y/1111  Need techniques to A A B B C C D D E E F F G G A A  Tolerate memory divergence Baseline  Integrate solutions to branch and memory divergence Time Dynamic A A B B C D E E F G G A A Warp Formation Time

Slide credit: Tor Aamodt 60 61 NVIDIA GeForce GTX 285 NVIDIA GeForce GTX 285 “core”

 NVIDIA-speak:

 240 stream processors

 “SIMT execution”

 Generic speak: 64 KB of storage … for fragment  30 cores contexts (registers)  8 SIMD functional units per core

= SIMD functional unit, control = instruction stream decode shared across 8 units = multiply-add = execution context storage = multiply

62 63 Slide credit: Kayvon Fatahalian Slide credit: Kayvon Fatahalian

NVIDIA GeForce GTX 285 “core” NVIDIA GeForce GTX 285

Tex Tex … … … … … …

Tex Tex … … … … … …

64 KB of storage Tex Tex … for thread contexts … … … … … … (registers)

Tex Tex … … … … … …  Groups of 32 threads share instruction stream (each group is

a Warp) Tex Tex … … … … … …  Up to 32 warps are simultaneously interleaved  Up to 1024 thread contexts can be stored There are 30 of these things on the GTX 285: 30,720 threads

64 65 Slide credit: Kayvon Fatahalian Slide credit: Kayvon Fatahalian