Lec20-Vector.Pdf

Vector Processors CS252 • Initially developed for super-computing applications, Graduate Computer Architecture today important for multimedia. Lecture 20 • Vector processors have high-level operations that Vector Processing => Multimedia work on linear arrays of numbers: "vectors" SCALAR VECTOR David E. Culler (1 operation) (N operations) r1 r2 v1 v2 + + Many slides due to Christoforos E. Kozyrakis r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 CS252/Culler Lec 20. 2 4/9/02 Properties of Vector Processors Styles of Vector Architectures • Single vector instruction implies lots of work (loop) • Memory-memory vector processors – Fewer instruction fetches – All vector operations are memory to memory • Each result independent of previous result • Vector-register processors – Multiple operations can be executed in parallel – All vector operations between vector registers (except – Simpler design, high clock rate vector load and store) – Compiler (programmer) ensures no dependencies – Vector equivalent of load-store architectures • Reduces branches and branch problems in pipelines – Includes all vector machines since late 1980s – We assume vector-register for rest of the lecture • Vector instructions access memory with known pattern – Effective prefetching – Amortize memory latency of over large number of elements – Can exploit a high bandwidth memory system – No (data) caches required! CS252/Culler CS252/Culler Lec 20. 3 Lec 20. 4 4/9/02 4/9/02 Historical Perspective Cray-1 Breakthrough • Mid-60s fear perf. stagnates • Fast, simple scalar processor • SIMD processor arrays – 80 MHz! actively developed during late – single-phase, latches 60’s – mid 70’s • Exquisite electrical and mechanical design – bit-parallel machines for image • Semiconductor memory processing • Vector register concept • pepe, staran, mpp – vast simplification of instruction set – word-parallel for scientific – reduced necc . memory bandwidth • Illiac IV • Tight integration of vector and scalar • Cray develops fast scalar • Piggy-back off 7600 stacklib – CDC 6600, 7600 • Later vectorizing compilers developed • CDC bets of vectors with • Owned high-performance computing for a decade Star-100 – what happened then? • Amdahl argues against vector – VLIW competition CS252/Culler CS252/Culler Lec 20. 5 Lec 20. 6 4/9/02 4/9/02 1 Components of a Vector Processor Cray-1 • Scalar CPU: registers, datapaths, instruction fetch logic Block • Vector register – Fixed length memory bank holding a single vector Diagram – Typically 8-32 vector registers, each holding 1 to 8 Kbits – Has at least 2 read and 1 write ports • Simple 16-bit RR instr – MM: Can be viewed as array of 64b, 32b, 16b, or 8b elements • Vector functional units (FUs) • 32-bit with immed – Fully pipelined, start new operation every clock • Natural combinations of – Typically 2 to 8 FUs: integer and FP scalar and vector – Multiple datapaths (pipelines) used for each unit to process multiple elements per cycle • Scalar bit-vectors • Vector load-store units (LSUs) match vector length – Fully pipelined unit to load or store a vector – Multiple elements fetched/stored per cycle • Gather/scatter M-R – May have multiple LSUs • Cond. merge • Cross-bar to connect FUs , LSUs, registers CS252/Culler CS252/Culler Lec 20. 7 Lec 20. 8 4/9/02 4/9/02 Basic Vector Instructions Vector Memory Operations Instr. Operands Operation Comment • Load/store operations move groups of data VADD.VV V1,V2,V3 V1=V2+V3 vector + vector between registers and memory VADD.SV V1,R0,V2 V1=R0+V2 scalar + vector • Three types of addressing VMUL.VV V1,V2,V3 V1=V2xV3 vector x vector – Unit stride VMUL.SV V1,R0,V2 V1=R0xV2 scalar x vector • Fastest VLD V1,R1 V1=M[R1..R1+63] load, stride=1 – Non-unit (constant) stride VLDS V1,R1,R2 V1=M[R1..R1+63*R2] load, stride=R2 – Indexed (gather-scatter) • Vector equivalent of register indirect VLDX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed("gather") • Good for sparse arrays of data VST V1,R1 M[R1..R1+63]=V1 store, stride=1 • Increases number of programs that vectorize VSTS V1,R1,R2 V1=M[R1..R1+63*R2] store, stride=R2 • compress/expand variant also VSTX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed(“scatter") • Support for various combinations of data widths in memory + all the regular scalar instructions (RISC style)… – {.L,.W,.H.,.B} x {64b, 32b, 16b, 8b} CS252/Culler CS252/Culler Lec 20. 9 Lec 20. 10 4/9/02 4/9/02 Vector Code Example Vector Length Y[0:63] = Y[0:653] + a*X[0:63] • A vector register can hold some maximum number of elements for each data width (maximum vector length 64 element SAXPY: scalar 64 element SAXPY: vector or MVL) LD R0,a LD R0,a #load scalar a • What to do when the application vector length is not ADDI R4,Rx,#512 VLD V1,Rx #load vector X exactly MVL? loop: LD R2, 0(Rx) VMUL.SV V2,R0,V1 #vector mult MULTD R2,R0,R2 VLD V3,Ry #load vector Y • Vector-length (VL) register controls the length of any LD R4, 0(Ry) VADD.VV V4,V2,V3 #vector add vector operation, including a vector load or store ADDD R4,R2,R4 – E.g. vadd.vv with VL=10 is SD R4, 0(Ry) VST Ry,V4 #store vector Y ADDI Rx,Rx,#8 for (I=0; I<10; I++) V1[I]=V2[I]+V3[I] ADDI Ry,Ry,#8 • VL can be anything from 0 to MVL SUB R20,R4,Rx BNZ R20,loop • How do you code an application where the vector length is not known until run-time? CS252/Culler CS252/Culler Lec 20. 11 Lec 20. 12 4/9/02 4/9/02 2 Strip Mining Optimization 1: Chaining • Suppose application vector length > MVL • Suppose: • Strip mining vmul.vv V1,V2,V3 – Generation of a loop that handles MVL elements per iteration vadd.vv V4,V1,V5 # RAW hazard – A set operations on MVL elements is translated to a single vecto r • Chaining instruction – Vector register (V1) is not as a single entity but as a • Example: vector saxpy of N elements group of individual registers – First loop handles (N mod MVL) elements, the rest handle MVL – Pipeline forwarding can work on individual vector elements VL = (N mod MVL); // set VL = N mod MVL • Flexible chaining: allow vector to chain to any other for (I=0; I<VL; I++) // 1st loop is a single set of active vector operation => more read/write ports Y[I]=A*X[I]+Y[I]; // vector instructions low = (N mod MVL); Unchained vmul vadd Cray X-mp VL = MVL; // set VL to MVL introduces for (I=low; I<N; I++) // 2nd loop requires N/MVL memory chaining vmul Y[I]=A*X[I]+Y[I]; // sets of vector instructions Chained CS252/Culler CS252/Culler Lec 20. 13 vadd Lec 20. 14 4/9/02 4/9/02 Optimization 2: Multi-lane Implementation Chaining & Multi-lane Example Pipelined Scalar LSU FU0 FU1 Lane Datapath vld Vector Reg. vmul.vv Partition vadd.vv Functional addu Unit Time vld To/From Memory System vmul.vv vadd.vv • Elements for vector registers interleaved across the lanes addu • Each lane receives identical control • Multiple element operations executed per cycle • Modular, scalable design Element Operations: Instr. Issue: • No need for inter-lane communication for most vector instructions • VL=16, 4 lanes, 2 FUs, 1 LSU, chaining -> 12 ops/cycle CS252/Culler CS252/Culler Lec 20. 15 • Just one new instruction issued per cycle !!!! Lec 20. 16 4/9/02 4/9/02 Optimization 3: Conditional Execution Two Ways to View Vectorization • Suppose you want to vectorize this: • Inner loop vectorization (Classic approach) for (I=0; I<N; I++) – Think of machine as, say, 32 vector registers each with 16 if (A[I]!= B[I]) A[I] -= B[I]; elements • Solution: vector conditional execution – 1 instruction updates 32 elements of 1 vector register – Add vector flag registers with single-bit elements – Good for vectorizing single-dimension arrays or regular – Use a vector compare to set the a flag register kernels (e.g. saxpy) – Use flag register as mask control for the vector sub • Outer loop vectorization (post-CM2) • Addition executed only for vector elements with – Think of machine as 16 “virtual processors” (VPs) corresponding flag element set each with 32 scalar registers! (• multithreaded processor) • Vector code – 1 instruction updates 1 scalar register in 16 VPs vld V1, Ra – Good for irregular kernels or kernels with loop-carried vld V2, Rb dependences in the inner loop vcmp.neq.vv F0, V1, V2 # vector compare • These are just two compiler perspectives vsub.vv V3, V2, V1, F0 # conditional vadd – The hardware is the same for both vst V3, Ra –Cray uses vector mask & merge CS252/Culler CS252/Culler Lec 20. 17 Lec 20. 18 4/9/02 4/9/02 3 Vectorizing Matrix Mult Parallelize Inner Product // Matrix-matrix multiply: // sum a[i][t] * b[t][j] to get c[i][j] Sum of Partial Products for (i=1; i<n; i++) { for (j=1; j<n; j++) { sum = 0; for (t=1; t<n; t++) { * * * * sum += a[i][t] * b[t][j]; // loop-carried } // dependence c[i][j] = sum; } } + + CS252/Culler CS252/Culler Lec 20. 19 Lec 20. 20 4/9/02 4/9/02 Outer-loop Approach Approaches to Mediaprocessing // Outer -loop Matrix-matrix multiply: // sum a[i][t] * b[t][j] to get c[i][j] // 32 elements of the result calculated in parallel // with each iteration of the j- loop (c[i][j:j+31]) General-purpose for (i=1; i<n; i++) { Vector Processors processors with for (j=1; j<n; j+= 32) { // loop being vectorized SIMD extensions sum[0:31] = 0; VLIW with SIMD extensions for (t=1; t<n; t++) { (aka mediaprocessors) ascalar = a[i][t]; // scalar load bvector[0:31] = b[t][j:j+31]; // vector load prod[0:31] = b_vector[0:31]*ascalar ; // vector mul Multimedia sum[0:31] += prod[0:31]; // vector add } Processing c[i][j:j+31] = sum[0:31]; // vector store } } DSPs ASICs/FPGAs CS252/Culler CS252/Culler Lec 20.

Lec20-Vector.Pdf

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support