Lec20-Vector.Pdf

Vector Processors CS252 • Initially developed for super-computing applications, Graduate Computer Architecture today important for multimedia. Lecture 20 • Vector processors have high-level operations that Vector Processing => Multimedia work on linear arrays of numbers: "vectors" SCALAR VECTOR David E. Culler (1 operation) (N operations) r1 r2 v1 v2 + + Many slides due to Christoforos E. Kozyrakis r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 CS252/Culler Lec 20. 2 4/9/02 Properties of Vector Processors Styles of Vector Architectures • Single vector instruction implies lots of work (loop) • Memory-memory vector processors – Fewer instruction fetches – All vector operations are memory to memory • Each result independent of previous result • Vector-register processors – Multiple operations can be executed in parallel – All vector operations between vector registers (except – Simpler design, high clock rate vector load and store) – Compiler (programmer) ensures no dependencies – Vector equivalent of load-store architectures • Reduces branches and branch problems in pipelines – Includes all vector machines since late 1980s – We assume vector-register for rest of the lecture • Vector instructions access memory with known pattern – Effective prefetching – Amortize memory latency of over large number of elements – Can exploit a high bandwidth memory system – No (data) caches required! CS252/Culler CS252/Culler Lec 20. 3 Lec 20. 4 4/9/02 4/9/02 Historical Perspective Cray-1 Breakthrough • Mid-60s fear perf. stagnates • Fast, simple scalar processor • SIMD processor arrays – 80 MHz! actively developed during late – single-phase, latches 60’s – mid 70’s • Exquisite electrical and mechanical design – bit-parallel machines for image • Semiconductor memory processing • Vector register concept • pepe, staran, mpp – vast simplification of instruction set – word-parallel for scientific – reduced necc . memory bandwidth • Illiac IV • Tight integration of vector and scalar • Cray develops fast scalar • Piggy-back off 7600 stacklib – CDC 6600, 7600 • Later vectorizing compilers developed • CDC bets of vectors with • Owned high-performance computing for a decade Star-100 – what happened then? • Amdahl argues against vector – VLIW competition CS252/Culler CS252/Culler Lec 20. 5 Lec 20. 6 4/9/02 4/9/02 1 Components of a Vector Processor Cray-1 • Scalar CPU: registers, datapaths, instruction fetch logic Block • Vector register – Fixed length memory bank holding a single vector Diagram – Typically 8-32 vector registers, each holding 1 to 8 Kbits – Has at least 2 read and 1 write ports • Simple 16-bit RR instr – MM: Can be viewed as array of 64b, 32b, 16b, or 8b elements • Vector functional units (FUs) • 32-bit with immed – Fully pipelined, start new operation every clock • Natural combinations of – Typically 2 to 8 FUs: integer and FP scalar and vector – Multiple datapaths (pipelines) used for each unit to process multiple elements per cycle • Scalar bit-vectors • Vector load-store units (LSUs) match vector length – Fully pipelined unit to load or store a vector – Multiple elements fetched/stored per cycle • Gather/scatter M-R – May have multiple LSUs • Cond. merge • Cross-bar to connect FUs , LSUs, registers CS252/Culler CS252/Culler Lec 20. 7 Lec 20. 8 4/9/02 4/9/02 Basic Vector Instructions Vector Memory Operations Instr. Operands Operation Comment • Load/store operations move groups of data VADD.VV V1,V2,V3 V1=V2+V3 vector + vector between registers and memory VADD.SV V1,R0,V2 V1=R0+V2 scalar + vector • Three types of addressing VMUL.VV V1,V2,V3 V1=V2xV3 vector x vector – Unit stride VMUL.SV V1,R0,V2 V1=R0xV2 scalar x vector • Fastest VLD V1,R1 V1=M[R1..R1+63] load, stride=1 – Non-unit (constant) stride VLDS V1,R1,R2 V1=M[R1..R1+63*R2] load, stride=R2 – Indexed (gather-scatter) • Vector equivalent of register indirect VLDX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed("gather") • Good for sparse arrays of data VST V1,R1 M[R1..R1+63]=V1 store, stride=1 • Increases number of programs that vectorize VSTS V1,R1,R2 V1=M[R1..R1+63*R2] store, stride=R2 • compress/expand variant also VSTX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed(“scatter") • Support for various combinations of data widths in memory + all the regular scalar instructions (RISC style)… – {.L,.W,.H.,.B} x {64b, 32b, 16b, 8b} CS252/Culler CS252/Culler Lec 20. 9 Lec 20. 10 4/9/02 4/9/02 Vector Code Example Vector Length Y[0:63] = Y[0:653] + a*X[0:63] • A vector register can hold some maximum number of elements for each data width (maximum vector length 64 element SAXPY: scalar 64 element SAXPY: vector or MVL) LD R0,a LD R0,a #load scalar a • What to do when the application vector length is not ADDI R4,Rx,#512 VLD V1,Rx #load vector X exactly MVL? loop: LD R2, 0(Rx) VMUL.SV V2,R0,V1 #vector mult MULTD R2,R0,R2 VLD V3,Ry #load vector Y • Vector-length (VL) register controls the length of any LD R4, 0(Ry) VADD.VV V4,V2,V3 #vector add vector operation, including a vector load or store ADDD R4,R2,R4 – E.g. vadd.vv with VL=10 is SD R4, 0(Ry) VST Ry,V4 #store vector Y ADDI Rx,Rx,#8 for (I=0; I<10; I++) V1[I]=V2[I]+V3[I] ADDI Ry,Ry,#8 • VL can be anything from 0 to MVL SUB R20,R4,Rx BNZ R20,loop • How do you code an application where the vector length is not known until run-time? CS252/Culler CS252/Culler Lec 20. 11 Lec 20. 12 4/9/02 4/9/02 2 Strip Mining Optimization 1: Chaining • Suppose application vector length > MVL • Suppose: • Strip mining vmul.vv V1,V2,V3 – Generation of a loop that handles MVL elements per iteration vadd.vv V4,V1,V5 # RAW hazard – A set operations on MVL elements is translated to a single vecto r • Chaining instruction – Vector register (V1) is not as a single entity but as a • Example: vector saxpy of N elements group of individual registers – First loop handles (N mod MVL) elements, the rest handle MVL – Pipeline forwarding can work on individual vector elements VL = (N mod MVL); // set VL = N mod MVL • Flexible chaining: allow vector to chain to any other for (I=0; I<VL; I++) // 1st loop is a single set of active vector operation => more read/write ports Y[I]=A*X[I]+Y[I]; // vector instructions low = (N mod MVL); Unchained vmul vadd Cray X-mp VL = MVL; // set VL to MVL introduces for (I=low; I<N; I++) // 2nd loop requires N/MVL memory chaining vmul Y[I]=A*X[I]+Y[I]; // sets of vector instructions Chained CS252/Culler CS252/Culler Lec 20. 13 vadd Lec 20. 14 4/9/02 4/9/02 Optimization 2: Multi-lane Implementation Chaining & Multi-lane Example Pipelined Scalar LSU FU0 FU1 Lane Datapath vld Vector Reg. vmul.vv Partition vadd.vv Functional addu Unit Time vld To/From Memory System vmul.vv vadd.vv • Elements for vector registers interleaved across the lanes addu • Each lane receives identical control • Multiple element operations executed per cycle • Modular, scalable design Element Operations: Instr. Issue: • No need for inter-lane communication for most vector instructions • VL=16, 4 lanes, 2 FUs, 1 LSU, chaining -> 12 ops/cycle CS252/Culler CS252/Culler Lec 20. 15 • Just one new instruction issued per cycle !!!! Lec 20. 16 4/9/02 4/9/02 Optimization 3: Conditional Execution Two Ways to View Vectorization • Suppose you want to vectorize this: • Inner loop vectorization (Classic approach) for (I=0; I<N; I++) – Think of machine as, say, 32 vector registers each with 16 if (A[I]!= B[I]) A[I] -= B[I]; elements • Solution: vector conditional execution – 1 instruction updates 32 elements of 1 vector register – Add vector flag registers with single-bit elements – Good for vectorizing single-dimension arrays or regular – Use a vector compare to set the a flag register kernels (e.g. saxpy) – Use flag register as mask control for the vector sub • Outer loop vectorization (post-CM2) • Addition executed only for vector elements with – Think of machine as 16 “virtual processors” (VPs) corresponding flag element set each with 32 scalar registers! (• multithreaded processor) • Vector code – 1 instruction updates 1 scalar register in 16 VPs vld V1, Ra – Good for irregular kernels or kernels with loop-carried vld V2, Rb dependences in the inner loop vcmp.neq.vv F0, V1, V2 # vector compare • These are just two compiler perspectives vsub.vv V3, V2, V1, F0 # conditional vadd – The hardware is the same for both vst V3, Ra –Cray uses vector mask & merge CS252/Culler CS252/Culler Lec 20. 17 Lec 20. 18 4/9/02 4/9/02 3 Vectorizing Matrix Mult Parallelize Inner Product // Matrix-matrix multiply: // sum a[i][t] * b[t][j] to get c[i][j] Sum of Partial Products for (i=1; i<n; i++) { for (j=1; j<n; j++) { sum = 0; for (t=1; t<n; t++) { * * * * sum += a[i][t] * b[t][j]; // loop-carried } // dependence c[i][j] = sum; } } + + CS252/Culler CS252/Culler Lec 20. 19 Lec 20. 20 4/9/02 4/9/02 Outer-loop Approach Approaches to Mediaprocessing // Outer -loop Matrix-matrix multiply: // sum a[i][t] * b[t][j] to get c[i][j] // 32 elements of the result calculated in parallel // with each iteration of the j- loop (c[i][j:j+31]) General-purpose for (i=1; i<n; i++) { Vector Processors processors with for (j=1; j<n; j+= 32) { // loop being vectorized SIMD extensions sum[0:31] = 0; VLIW with SIMD extensions for (t=1; t<n; t++) { (aka mediaprocessors) ascalar = a[i][t]; // scalar load bvector[0:31] = b[t][j:j+31]; // vector load prod[0:31] = b_vector[0:31]*ascalar ; // vector mul Multimedia sum[0:31] += prod[0:31]; // vector add } Processing c[i][j:j+31] = sum[0:31]; // vector store } } DSPs ASICs/FPGAs CS252/Culler CS252/Culler Lec 20.

Load more