Overview of 15-740
Total Page:16
File Type:pdf, Size:1020Kb
Vector Computers and GPUs 15-740 FALL’18 NATHAN BECKMANN BASED ON SLIDES BY DANIEL SANCHEZ, MIT 1 Today: Vector/GPUs Focus on throughput, not latency Programmer/compiler must expose parallel work directly Works on regular, replicated codes (i.e., data parallel / SIMD) 2 Background: Supercomputer Applications Typical application areas ◦ Military research (nuclear weapons aka “computational fluid dynamics”, cryptography) ◦ Scientific research ◦ Weather forecasting ◦ Oil exploration ◦ Industrial design (car crash simulation) ◦ Bioinformatics ◦ Cryptography All involve huge computations on large data sets In 70s-80s, Supercomputer Vector Machine 4 Vector Supercomputers Epitomized by Cray-1, 1976: Scalar Unit ◦ Load/Store Architecture Vector Extension ◦ Vector Registers ◦ Vector Instructions Implementation ◦ Hardwired Control ◦ Highly Pipelined Functional Units ◦ Interleaved Memory System ◦ No Data Caches ◦ No Virtual Memory 5 Cray-1 (1976) Vector registers V0 V1 V. Mask V2 V3 V. Length V4 V5 V6 V7 FP Add S0 FP Mul S1 Single Port S2 Primary FP Recip Temporary S3 Memory data 64 S4 data Int Add High- T Regs S5 registers registers S6 Int Logic bandwidth 16 banks of S7 memory Int Shift 64-bit words A0 A1 Pop Cnt A2 Primary Temporary A3 64 A4 address Addr Add address A5 B Regs A6 registers Addr Mul registers A7 64-bitx16 NIP CIP memory bank 4 Instruction LIP cycle 50 ns Buffers processor cycle 12.5 ns (80MHz) 7 Vector Programming Model Scalar Registers Vector Registers r15 v15 r0 v0 [0] [1] [2] [VLRMAX-1] Vector Length Register VLR 9 Vector Programming Model Scalar Registers Vector Registers r15 v15 r0 v0 [0] [1] [2] [VLRMAX-1] Vector Length Register VLR v1 Vector Arithmetic v2 Instructions + + + + + + ADDV v3, v1, v2 v3 [0] [1] [VLR-1] 10 Vector Programming Model Scalar Registers Vector Registers r15 v15 r0 v0 [0] [1] [2] [VLRMAX-1] Vector Length Register VLR Vector Load and Vector Register Store Instructions v1 LV v1, r1, r2 Memory Base, r1 Stride, r2 11 Vector Code Example # C code # Scalar Code # Vector Code for (i=0; i<64; i++) LI R4, 64 LI VLR, 64 C[i] = A[i] + B[i]; loop: LI R4, 4 LD F0, 0(R1) LV V1, R1, R4 LD F2, 0(R2) LV V2, R2, R4 ADD F4, F2, F0 ADDV V3, V1, V2 ST F4, 0(R3) SV V3, R3, R4 ADD R1, R1, 8 ADD R2, R2, 8 ADD R3, R3, 8 SUB R4, R4, 1 BNEZ R4, loop What if we want to execute larger loops than the vector registers? 12 Vector Stripmining Problem: Vector registers are finite Solution: Break loops into chunks that fit into registers, “stripmining” for (i=0; i<N; i++) LI R1, N # R1 is ops left to perform C[i] = A[i]+B[i]; LI R2, 8 # R2 is stride loop: VSETVLR R3, R1 # R3 = ops done A B C # = min(VLRMAX, R1) + Remainder LV V1, RA, R2 # Inner loop using vector LV V2, RB, R2 VADD V3, V1, V2 + 64 elements SV V3, RC, R2 SLL R3, R3, 3 # R3 = R3 * 8 ADD RA, RA, R3 # Lots of pointer updates ADD RB, RB, R3 ADD RC, RC, R3 + 64 elements SUB R1, R1, R3 # Any more to do? BNEZ R1, loop 13 Vector Stripmining Problem: Vector registers are finite Solution: Break loops into chunks that fit into registers, “stripmining” for (i=0; i<N; i++) LI R1, N # R1 is ops left to perform C[i] = A[i]+B[i]; LI R2, 8 # R2 is stride loop: VSETVLR R3, R1 # R3 = ops done A B C # = min(VLRMAX, R1) + Remainder LV V1, RA, R2 # Inner loop using vector LV V2, RB, R2 VADD V3, V1, V2 + 64 elements SV V3, RC, R2 SLL R3, R3, 3 # R3 = R3 * 8 ADD RA, RA, R3 # Lots of pointer updates ADD RB, RB, R3 ADD RC, RC, R3 + 64 elements SUB R1, R1, R3 # Any more to do? BNEZ R1, loop 14 Vector Instruction Set Advantages Compact ◦ one short instruction encodes N operations Expressive, tells hardware that these N operations: ◦ are independent ◦ use the same functional unit ◦ access disjoint registers ◦ access registers in same pattern as previous instructions ◦ access a contiguous block of memory (unit-stride load/store) ◦ access memory in a known pattern (strided load/store) Scalable & (somewhat) portable ◦ can run same code on more parallel pipelines (lanes) 15 Vector Arithmetic Execution Use deep pipeline to execute element operations ◦ Deep pipeline Fast clock! V V V Much simpler pipeline control! 1 2 3 ◦ Operations are independent no pipeline hazards Vector maximizes advantages of pipelining and avoids its downsides Six stage multiply pipeline V3 <- V1 * V2 16 Vector Instruction Execution Vector machine can microarchitecturally vary the number of “lanes” Execution using Execution using one pipelined ADDV C,A,B four pipelined functional unit functional units A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3] 17 Registers are kept nearby functional units to Vector Unit Structure minimize data movement. Functional Unit Vector Registers Elements Elements Elements Elements 0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, … Lane Memory Subsystem 18 Vector Memory System Challenge is bandwidth aggressive banking Cray-1: 16 banks, 4 cycle bank busy time, 12 cycle latency ◦ More on this in GPUs… Base Stride Vector Registers Address Generator + 0 1 2 3 4 5 6 7 8 9 A B C D E F Memory Banks 19 Vector Chaining Vector analog of bypassing V V V V V LV v1 1 2 3 4 5 MULV v3,v1,v2 ADDV v5, v3, v4 Chain Chain Load Unit Mult. Add Memory 20 Vector Chaining Advantage • Without chaining, must wait for last element of result to be written before starting dependent instruction Load Mul Time Add • With chaining, can start dependent instruction as soon as first result appears Load Mul Add 21 Vector Instruction-Level Parallelism Can overlap execution of multiple vector instructions ◦ Example machine has 32 elements per vector register and 8 lanes Load Unit Multiply Unit Add Unit load mul add time load mul add Instruction issue Complete 24 operations/cycle while issuing <1 short instruction/cycle 22 Vector Conditional Execution Problem: Want to vectorize loops with conditional code: ◦ for (i=0; i<N; i++) if (A[i]>0) A[i] = B[i]; Solution: Add vector mask (or flag) registers ◦ vector version of predicate registers, 1 bit per element …and maskable vector instructions ◦ vector operation becomes NOP at elements where mask bit is clear Code example: ◦ CVM # Turn on all elements LV vA, rA # Load entire A vector SGTV vA, F0 # Set bits in mask register where A>0 LV vA, rB # Load B vector into A under mask SV vA, rA # Store A back to memory under mask 23 Masked Vector Instructions Simple Implementation Efficient Implementation – execute all N operations, turn off result – scan mask vector and only execute writeback according to mask elements with non-zero masks M[7]=1 A[7] B[7] M[7]=1 M[6]=0 A[6] B[6] M[6]=0 A[7] B[7] M[5]=1 A[5] B[5] M[5]=1 M[4]=1 A[4] B[4] M[4]=1 M[3]=0 A[3] B[3] M[3]=0 C[5] M[2]=0 C[4] M[1]=1 M[2]=0 C[2] M[0]=0 M[1]=1 C[1] C[1] Write data port M[0]=0 C[0] Write Enable Write data port 24 Vector Scatter/Gather Want to vectorize loops with indirect accesses: for (i=0; i<N; i++) A[i] = B[i] + C[D[i]] Indexed load instruction (Gather) LV vD, rD # Load indices in D vector LVI vC, rC, vD # Load indirect from rC base LV vB, rB # Load B vector ADDV vA, vB, vC # Do add SV vA, rA # Store result 25 Vector Scatter/Gather Scatter example: for (i=0; i<N; i++) A[B[i]]++; Is following a correct translation? LV vB, rB # Load indices in B vector LVI vA, rA, vB # Gather initial A values ADDV vA, vA, 1 # Increment SVI vA, rA, vB # Scatter incremented values 26 Multimedia Extensions Short vectors added to existing general-purpose ISAs Initially, 64-bit registers split into 2x32b or 4x16b or 8x8b Limited instruction set: ◦ No vector length control ◦ No strided load/store or scatter/gather ◦ Unit-stride loads must be aligned to 64-bit boundary Limitation: Short vector registers ◦ Requires superscalar dispatch to keep multiply/add/load units busy ◦ Loop unrolling to hide latencies increases register pressure Trend towards fuller vector support in microprocessors ◦ e.g. x86: MMX SSEx (128 bits) AVX (256 bits) AVX-512 (512 bits) 28 Intel Larrabee Motivation Design experiment: not a real 10-core chip! # CPU cores 2 out of order 10 in-order Instructions per issue 4 per clock 2 per clock VPU lanes per core 4-wide SSE 16-wide L2 cache size 4 MB 4 MB Single-stream 4 per clock 2 per clock Vector throughput 8 per clock 160 per clock 20 times the multiply-add operations per clock Data in chart taken from Seiler, L., Carmean, D., et al.