Vector Programming Model Graduate Computer Architecture Scalar Registers Vector Registers Lecture 13 R15 V15

CS252 Recall: Vector Programming Model Graduate Computer Architecture Scalar Registers Vector Registers Lecture 13 r15 v15 Vector Processing (Con’t) r0 v0 Intro to Multiprocessing [0] [1] [2] [VLRMAX-1] March 8th, 2010 Vector Length Register VLR v1 John Kubiatowicz Vector Arithmetic v2 Instructions + + + + + + Electrical Engineering and Computer Sciences ADDV v3, v1, v2 v3 University of California, Berkeley [0] [1] [VLR-1] Vector Load and Vector Register Store Instructions v1 http://www.eecs.berkeley.edu/~kubitron/cs252 LV v1, r1, r2 Memory 3/8/2010Base, r1 Stride, r2cs252-S10, Lecture 13 2 Vector Memory-Memory vs. Recall: Vector Unit Structure Vector Register Machines Functional Unit • Vector memory-memory instructions hold all vector operands in main memory • The first vector machines, CDC Star-100 (‘73) and TI ASC (‘71), were memory-memory machines • Cray-1 (’76) was first vector register machine Vector Vector Memory-Memory Code Registers Elements Elements Elements Elements Example Source Code 0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, … ADDV C, A, B for (i=0; i<N; i++) SUBV D, A, B { C[i] = A[i] + B[i]; Vector Register Code D[i] = A[i] - B[i]; LV V1, A } LV V2, B Lane ADDV V3, V1, V2 SV V3, C Memory Subsystem SUBV V4, V1, V2 SV V4, D 3/8/2010 cs252-S10, Lecture 13 3 3/8/2010 cs252-S10, Lecture 13 4 Vector Memory-Memory vs. Vector Register Machines Automatic Code Vectorization • Vector memory-memory architectures (VMMA) require for (i=0; i < N; i++) greater main memory bandwidth, why? C[i] = A[i] + B[i]; – All operands must be read in and out of memory Scalar Sequential Code Vectorized Code • VMMAs make if difficult to overlap execution of load load load multiple vector operations, why? load – Must check dependencies on memory addresses Iter. 1 load load • VMMAs incur greater startup latency add – Scalar code was faster on CDC Star-100 for vectors < 100 Time add add elements store – For Cray-1, vector/scalar breakeven point was around 2 store store elements load Apart from CDC follow-ons (Cyber-205, ETA-10) all Iter. Iter. load major vector machines since Cray-1 have had vector Iter. 2 1 2 Vector Instruction register architectures add Vectorization is a massive compile-time (we ignore vector memory-memory from now on) reordering of operation sequencing store requires extensive loop dependence analysis 3/8/2010 cs252-S10, Lecture 13 5 3/8/2010 cs252-S10, Lecture 13 6 Vector Stripmining Vector Instruction Parallelism Problem: Vector registers have finite length Can overlap execution of multiple vector instructions Solution: Break loops into pieces that fit into vector – example machine has 32 elements per vector register and 8 lanes registers, “Stripmining” ANDI R1, N, 63 # N mod 64 Load Unit Multiply Unit Add Unit for (i=0; i<N; i++) MTC1 VLR, R1 # Do remainder C[i] = A[i]+B[i]; load loop: mul AB C LV V1, RA add DSLL R2, R1, 3 # Multiply by 8 time + Remainder DADDU RA, RA, R2 # Bump pointer load LV V2, RB mul DADDU RB, RB, R2 add 64 elements + ADDV.D V3, V1, V2 SV V3, RC DADDU RC, RC, R2 DSUBU N, N, R1 # Subtract elements Instruction issue + LI R1, 64 MTC1 VLR, R1 # Reset full length BGTZ N, loop # Any more to do? Complete 24 operations/cycle while issuing 1 short instruction/cycle 3/8/2010 cs252-S10, Lecture 13 7 3/8/2010 cs252-S10, Lecture 13 8 Vector Chaining Vector Chaining Advantage • Vector version of register bypassing • Without chaining, must wait for last element of result to – introduced with Cray-1 be written before starting dependent instruction Load V V V V V LV v1 Mul 1 2 3 4 5 MULV v3,v1,v2 Time Add ADDV v5, v3, v4 Chain Chain • With chaining, can start dependent instruction as soon as first result appears Load Unit Mult. Add Load Mul Memory Add 3/8/2010 cs252-S10, Lecture 13 9 3/8/2010 cs252-S10, Lecture 13 10 Vector Startup Dead Time and Short Vectors Two components of vector startup penalty – functional unit latency (time through pipeline) – dead time or recovery time (time before another vector instruction can start down pipeline) Functional Unit Latency No dead time R X X X W 4 cycles dead time T0, Eight lanes No dead time R X X X W First Vector Instruction 100% efficiency with 8 element R X X X W vectors R X X X W R X X X W Dead Time R X X X W 64 cycles active R X X X W Cray C90, Two lanes R X X X W 4 cycle dead time Dead Time R X X X W Second Vector Instruction Maximum efficiency 94% with 128 element vectors 3/8/2010 cs252-S10,R LectureX X 13X W 11 3/8/2010 cs252-S10, Lecture 13 12 Vector Scatter/Gather Vector Conditional Execution Problem: Want to vectorize loops with conditional code: Want to vectorize loops with indirect accesses: for (i=0; i<N; i++) if (A[i]>0) then for (i=0; i<N; i++) A[i] = B[i]; A[i] = B[i] + C[D[i]] Indexed load instruction (Gather) Solution: Add vector mask (or flag) registers LV vD, rD # Load indices in D vector – vector version of predicate registers, 1 bit per element …and maskable vector instructions LVI vC, rC, vD # Load indirect from rC base – vector operation becomes NOP at elements where mask bit is clear LV vB, rB # Load B vector Code example: ADDV.D vA, vB, vC # Do add CVM # Turn on all elements SV vA, rA # Store result LV vA, rA # Load entire A vector SGTVS.D vA, F0 # Set bits in mask register where A>0 LV vA, rB # Load B vector into A under mask SV vA, rA # Store A back to memory under mask 3/8/2010 cs252-S10, Lecture 13 13 3/8/2010 cs252-S10, Lecture 13 14 Masked Vector Instructions Compress/Expand Operations Simple Implementation Density-Time Implementation • Compress packs non-masked elements from one – execute all N operations, turn off – scan mask vector and only vector register contiguously at start of destination result writeback according to mask execute elements with non-zero vector register masks – population count of mask vector gives packed vector length M[7]=1 A[7] B[7] M[7]=1 M[6]=0 A[6] B[6] • Expand performs inverse operation M[6]=0 M[5]=1 A[5] B[5] A[7] B[7] M[5]=1 A[7] M[4]=1 A[4] B[4] M[7]=1 A[7] A[7] M[7]=1 M[4]=1 A[5] M[3]=0 A[3] B[3] M[6]=0 A[6] B[6] M[6]=0 C[5] M[3]=0 M[5]=1 A[5] A[4] A[5] M[5]=1 M[2]=0 C[4] M[4]=1 A[4] A[1] A[4] M[4]=1 M[2]=0 C[2] M[1]=1 M[3]=0 A[3] A[7] B[3] M[3]=0 M[1]=1 C[1] M[0]=0 C[1] M[2]=0 A[2] A[5] B[2] M[2]=0 M[1]=1 A[1] A[4] A[1] M[1]=1 Write data port M[0]=0 A[0] A[1] B[0] M[0]=0 M[0]=0 C[0] Write Enable Write data port Compress Expand Used for density-time conditionals and also for general selection operations 3/8/2010 cs252-S10, Lecture 13 15 3/8/2010 cs252-S10, Lecture 13 16 Administrivia Vector Reductions • Exam: one week from Wednesday (3/17) Problem: Loop-carried dependence on reduction variables Location: 310 Soda sum = 0; TIME: 6:00-9:00 for (i=0; i<N; i++) – This info is on the Lecture page (has been) sum += A[i]; # Loop-carried dependence on sum – Get on 8 ½ by 11 sheet of notes (both sides) Solution: Re-associate operations if possible, use binary – Meet at LaVal’s afterwards for Pizza and Beverages tree to perform reduction • I have your proposals. # Rearrange as: We need to meet to discuss them sum[0:VL-1] = 0 # Vector of VL partial sums – Time this week? Wednesday after class for(i=0; i<N; i+=VL) # Stripmine VL-sized chunks sum[0:VL-1] += A[i:i+VL-1]; # Vector sum # Now have VL partial sums in one vector register do { VL = VL/2; # Halve vector length sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials } while (VL>1) 3/8/2010 cs252-S10, Lecture 13 17 3/8/2010 cs252-S10, Lecture 13 18 Novel Matrix Multiply Solution Optimized Vector Example • Consider the following: /* Multiply a[m][k] * b[k][n] to get c[m][n] */ for (i=1; i<m; i++) { /* Multiply a[m][k] * b[k][n] to get c[m][n] */ for (j=1; j<n; j+=32) {/* Step j 32 at a time. */ for (i=1; i<m; i++) { sum[0:31] = 0; /* Init vector reg to zeros. */ for (j=1; j<n; j++) { for (t=1; t<k; t++) { sum = 0; a_scalar = a[i][t]; /* Get scalar */ for (t=1; t<k; t++) b_vector[0:31] = b[t][j:j+31]; /* Get vector */ sum += a[i][t] * b[t][j]; c[i][j] = sum; /* Do a vector-scalar multiply. */ } prod[0:31] = b_vector[0:31]*a_scalar; } • Do you need to do a bunch of reductions? NO! /* Vector-vector add into results.

Load more