Vector Programming Model Graduate Computer Architecture Scalar Registers Vector Registers Lecture 13 R15 V15

Vector Programming Model Graduate Computer Architecture Scalar Registers Vector Registers Lecture 13 R15 V15

CS252 Recall: Vector Programming Model Graduate Computer Architecture Scalar Registers Vector Registers Lecture 13 r15 v15 Vector Processing (Con’t) r0 v0 Intro to Multiprocessing [0] [1] [2] [VLRMAX-1] March 8th, 2010 Vector Length Register VLR v1 John Kubiatowicz Vector Arithmetic v2 Instructions + + + + + + Electrical Engineering and Computer Sciences ADDV v3, v1, v2 v3 University of California, Berkeley [0] [1] [VLR-1] Vector Load and Vector Register Store Instructions v1 http://www.eecs.berkeley.edu/~kubitron/cs252 LV v1, r1, r2 Memory 3/8/2010Base, r1 Stride, r2cs252-S10, Lecture 13 2 Vector Memory-Memory vs. Recall: Vector Unit Structure Vector Register Machines Functional Unit • Vector memory-memory instructions hold all vector operands in main memory • The first vector machines, CDC Star-100 (‘73) and TI ASC (‘71), were memory-memory machines • Cray-1 (’76) was first vector register machine Vector Vector Memory-Memory Code Registers Elements Elements Elements Elements Example Source Code 0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, … ADDV C, A, B for (i=0; i<N; i++) SUBV D, A, B { C[i] = A[i] + B[i]; Vector Register Code D[i] = A[i] - B[i]; LV V1, A } LV V2, B Lane ADDV V3, V1, V2 SV V3, C Memory Subsystem SUBV V4, V1, V2 SV V4, D 3/8/2010 cs252-S10, Lecture 13 3 3/8/2010 cs252-S10, Lecture 13 4 Vector Memory-Memory vs. Vector Register Machines Automatic Code Vectorization • Vector memory-memory architectures (VMMA) require for (i=0; i < N; i++) greater main memory bandwidth, why? C[i] = A[i] + B[i]; – All operands must be read in and out of memory Scalar Sequential Code Vectorized Code • VMMAs make if difficult to overlap execution of load load load multiple vector operations, why? load – Must check dependencies on memory addresses Iter. 1 load load • VMMAs incur greater startup latency add – Scalar code was faster on CDC Star-100 for vectors < 100 Time add add elements store – For Cray-1, vector/scalar breakeven point was around 2 store store elements load Apart from CDC follow-ons (Cyber-205, ETA-10) all Iter. Iter. load major vector machines since Cray-1 have had vector Iter. 2 1 2 Vector Instruction register architectures add Vectorization is a massive compile-time (we ignore vector memory-memory from now on) reordering of operation sequencing store requires extensive loop dependence analysis 3/8/2010 cs252-S10, Lecture 13 5 3/8/2010 cs252-S10, Lecture 13 6 Vector Stripmining Vector Instruction Parallelism Problem: Vector registers have finite length Can overlap execution of multiple vector instructions Solution: Break loops into pieces that fit into vector – example machine has 32 elements per vector register and 8 lanes registers, “Stripmining” ANDI R1, N, 63 # N mod 64 Load Unit Multiply Unit Add Unit for (i=0; i<N; i++) MTC1 VLR, R1 # Do remainder C[i] = A[i]+B[i]; load loop: mul AB C LV V1, RA add DSLL R2, R1, 3 # Multiply by 8 time + Remainder DADDU RA, RA, R2 # Bump pointer load LV V2, RB mul DADDU RB, RB, R2 add 64 elements + ADDV.D V3, V1, V2 SV V3, RC DADDU RC, RC, R2 DSUBU N, N, R1 # Subtract elements Instruction issue + LI R1, 64 MTC1 VLR, R1 # Reset full length BGTZ N, loop # Any more to do? Complete 24 operations/cycle while issuing 1 short instruction/cycle 3/8/2010 cs252-S10, Lecture 13 7 3/8/2010 cs252-S10, Lecture 13 8 Vector Chaining Vector Chaining Advantage • Vector version of register bypassing • Without chaining, must wait for last element of result to – introduced with Cray-1 be written before starting dependent instruction Load V V V V V LV v1 Mul 1 2 3 4 5 MULV v3,v1,v2 Time Add ADDV v5, v3, v4 Chain Chain • With chaining, can start dependent instruction as soon as first result appears Load Unit Mult. Add Load Mul Memory Add 3/8/2010 cs252-S10, Lecture 13 9 3/8/2010 cs252-S10, Lecture 13 10 Vector Startup Dead Time and Short Vectors Two components of vector startup penalty – functional unit latency (time through pipeline) – dead time or recovery time (time before another vector instruction can start down pipeline) Functional Unit Latency No dead time R X X X W 4 cycles dead time T0, Eight lanes No dead time R X X X W First Vector Instruction 100% efficiency with 8 element R X X X W vectors R X X X W R X X X W Dead Time R X X X W 64 cycles active R X X X W Cray C90, Two lanes R X X X W 4 cycle dead time Dead Time R X X X W Second Vector Instruction Maximum efficiency 94% with 128 element vectors 3/8/2010 cs252-S10,R LectureX X 13X W 11 3/8/2010 cs252-S10, Lecture 13 12 Vector Scatter/Gather Vector Conditional Execution Problem: Want to vectorize loops with conditional code: Want to vectorize loops with indirect accesses: for (i=0; i<N; i++) if (A[i]>0) then for (i=0; i<N; i++) A[i] = B[i]; A[i] = B[i] + C[D[i]] Indexed load instruction (Gather) Solution: Add vector mask (or flag) registers LV vD, rD # Load indices in D vector – vector version of predicate registers, 1 bit per element …and maskable vector instructions LVI vC, rC, vD # Load indirect from rC base – vector operation becomes NOP at elements where mask bit is clear LV vB, rB # Load B vector Code example: ADDV.D vA, vB, vC # Do add CVM # Turn on all elements SV vA, rA # Store result LV vA, rA # Load entire A vector SGTVS.D vA, F0 # Set bits in mask register where A>0 LV vA, rB # Load B vector into A under mask SV vA, rA # Store A back to memory under mask 3/8/2010 cs252-S10, Lecture 13 13 3/8/2010 cs252-S10, Lecture 13 14 Masked Vector Instructions Compress/Expand Operations Simple Implementation Density-Time Implementation • Compress packs non-masked elements from one – execute all N operations, turn off – scan mask vector and only vector register contiguously at start of destination result writeback according to mask execute elements with non-zero vector register masks – population count of mask vector gives packed vector length M[7]=1 A[7] B[7] M[7]=1 M[6]=0 A[6] B[6] • Expand performs inverse operation M[6]=0 M[5]=1 A[5] B[5] A[7] B[7] M[5]=1 A[7] M[4]=1 A[4] B[4] M[7]=1 A[7] A[7] M[7]=1 M[4]=1 A[5] M[3]=0 A[3] B[3] M[6]=0 A[6] B[6] M[6]=0 C[5] M[3]=0 M[5]=1 A[5] A[4] A[5] M[5]=1 M[2]=0 C[4] M[4]=1 A[4] A[1] A[4] M[4]=1 M[2]=0 C[2] M[1]=1 M[3]=0 A[3] A[7] B[3] M[3]=0 M[1]=1 C[1] M[0]=0 C[1] M[2]=0 A[2] A[5] B[2] M[2]=0 M[1]=1 A[1] A[4] A[1] M[1]=1 Write data port M[0]=0 A[0] A[1] B[0] M[0]=0 M[0]=0 C[0] Write Enable Write data port Compress Expand Used for density-time conditionals and also for general selection operations 3/8/2010 cs252-S10, Lecture 13 15 3/8/2010 cs252-S10, Lecture 13 16 Administrivia Vector Reductions • Exam: one week from Wednesday (3/17) Problem: Loop-carried dependence on reduction variables Location: 310 Soda sum = 0; TIME: 6:00-9:00 for (i=0; i<N; i++) – This info is on the Lecture page (has been) sum += A[i]; # Loop-carried dependence on sum – Get on 8 ½ by 11 sheet of notes (both sides) Solution: Re-associate operations if possible, use binary – Meet at LaVal’s afterwards for Pizza and Beverages tree to perform reduction • I have your proposals. # Rearrange as: We need to meet to discuss them sum[0:VL-1] = 0 # Vector of VL partial sums – Time this week? Wednesday after class for(i=0; i<N; i+=VL) # Stripmine VL-sized chunks sum[0:VL-1] += A[i:i+VL-1]; # Vector sum # Now have VL partial sums in one vector register do { VL = VL/2; # Halve vector length sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials } while (VL>1) 3/8/2010 cs252-S10, Lecture 13 17 3/8/2010 cs252-S10, Lecture 13 18 Novel Matrix Multiply Solution Optimized Vector Example • Consider the following: /* Multiply a[m][k] * b[k][n] to get c[m][n] */ for (i=1; i<m; i++) { /* Multiply a[m][k] * b[k][n] to get c[m][n] */ for (j=1; j<n; j+=32) {/* Step j 32 at a time. */ for (i=1; i<m; i++) { sum[0:31] = 0; /* Init vector reg to zeros. */ for (j=1; j<n; j++) { for (t=1; t<k; t++) { sum = 0; a_scalar = a[i][t]; /* Get scalar */ for (t=1; t<k; t++) b_vector[0:31] = b[t][j:j+31]; /* Get vector */ sum += a[i][t] * b[t][j]; c[i][j] = sum; /* Do a vector-scalar multiply. */ } prod[0:31] = b_vector[0:31]*a_scalar; } • Do you need to do a bunch of reductions? NO! /* Vector-vector add into results.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    13 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us