CS252 Spring 2017 Graduate Computer Architecture Lecture 9

CS252 Spring 2017 Graduate Computer Architecture Lecture 9: Vector Supercomputers Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 8 Overcoming the worst hazards in OoO superscalars: • Branch prediction • Bimodel • Local/Branch History Table • Global/gselect, gshare • Tournament • Branch address cache (predict multiple branches per cycle) • Trace cache • Return Address Predictors • Today - Load/Store Queues, Vector Supercomputers 2 WU UCB CS252 SP17 Load-Store Queue Design § After control hazards, data hazards through memory are proBably next most important Bottleneck to superscalar performance § Modern superscalars use very sophisticated load- store reordering techniques to reduce effective memory latency By allowing loads to Be speculatively issued CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 3 Speculative Store Buffer Store Store § Just like register updates, stores should Address Data not modify the memory until after the Speculative instruction is committed. A speculative Store Buffer store Buffer is a structure introduced to V S Tag Data hold speculative store data. V S Tag Data § During decode, store Buffer slot allocated V S Tag Data V S Tag Data in program order V S Tag Data § Stores split into “store address” and V S Tag Data “store data” micro-operations § “Store address” execution writes tag Store Commit § “Store data” execution writes data Path § Store commits when oldest instruction and Both address and data available: - clear speculative Bit and eventually Tags Data move data to cache § On store abort: - clear valid Bit L1 Data Cache CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 4 Load bypass from speculative store buffer Speculative Load Address Store Buffer L1 Data Cache V S Tag Data V S Tag Data V S Tag Data V S Tag Data Tags Data V S Tag Data V S Tag Data Load Data § If data in both store buffer and cache, which should we use? Speculative store Buffer § If same address in store buffer twice, which should we use? Youngest store older than load CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 5 Memory Dependencies sd x1, (x2) ld x3, (x4) § When can we execute the load? CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 6 In-Order Memory Queue § Execute all loads and stores in program order § => Load and store cannot leave ROB for execution until all previous loads and stores have completed execution § Can still execute loads and stores speculatively, and out-of-order with respect to other instructions § Need a structure to handle memory ordering… CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 7 Conservative O-o-O Load Execution sd x1, (x2) ld x3, (x4) § Can execute load Before store, if addresses known and x4 != x2 § Each load address compared with addresses of all previous uncommitted stores - can use partial conservative check i.e., bottom 12 bits of address, to save hardware § Don’t execute load if any previous store address not known § (MIPS R10K, 16-entry address queue) CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 8 Address Speculation sd x1, (x2) ld x3, (x4) § Guess that x4 != x2 § Execute load Before store address known § Need to hold all completed But uncommitted load/store addresses in program order § If suBsequently find x4==x2, squash load and all following instructions § => Large penalty for inaccurate address speculation CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 9 Memory Dependence Prediction (Alpha 21264) sd x1, (x2) ld x3, (x4) § Guess that x4 != x2 and execute load Before store § If later find x4==x2, squash load and all following instructions, But mark load instruction as store-wait § SuBsequent executions of the same load instruction will wait for all previous stores to complete § Periodically clear store-wait Bits CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 10 Supercomputer Applications § Typical application areas - Military research (nuclear weapons, cryptography) - Scientific research - Weather forecasting - Oil exploration - Industrial design (car crash simulation) - Bioinformatics - Cryptography § All involve huge computations on large data set § Supercomputers: CDC6600, CDC7600, Cray-1, … § In 70s-80s, Supercomputer º Vector Machine CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 11 Vector Supercomputers § Epitomized By Cray-1, 1976: § Scalar Unit - Load/Store Architecture § Vector Extension - Vector Registers - Vector Instructions § Implementation - Hardwired Control - Highly Pipelined Functional Units - Interleaved Memory System - No Data Caches - No Virtual Memory [©Cray Research, 1976] CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 12 Cray-1 Internals displayed at EPFL Photograph by Rama, Wikimedia Commons, Cc-by-sa-2.0-fr 13 WU UCB CS252 SP17 Vector Programming Model Scalar Registers Vector Registers x31 v31 x0 v0 [0] [1] [2] [VLRMAX-1] Vector Length Register VLR v1 Vector Arithmetic v2 Instructions + + + + + + vadd v3, v1, v2 v3 [0] [1] [VLR-1] Vector Load and Store Vector Register Instructions v1 vld v1, x1, x2 Memory Base, x1 Stride, x2 CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 14 Vector Code Example # Vector Code # C code li x4, 64 for (i=0; i<64; i++) # Scalar Code setvlr x4 C[i] = A[i] + B[i]; li x4, 64 loop: vfld v1, x1 fld f1, 0(x1) vfld v2, x2 fld f2, 0(x2) vfadd.d v3,v1,v2 fadd.d f3,f1,f2 vfsd v3, x3 fsd f3, 0(x3) addi x1, 8 addi x2, 8 addi x3, 8 subi x4, 1 bnez x4, loop CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 15 Cray-1 (1976) V0 Vi V1 V. Mask V2 Vj 64 Element Vector V3 V. Length Registers V4 Vk Single Port V5 V6 Memory V7 FP Add S0 Sj FP Mul 16 Banks of 64- ( (Ah) + j k m ) S1 S2 Sk FP Recip bit words Si S3 (A0) 64 S4 Si Int Add + Tjk S5 T Regs S6 8-bit SECDED S7 Int Logic Int Shift A0 80MW/sec data ( (Ah) + j k m ) A1 Pop Cnt A2 load/store Aj Ai A3 (A0) 64 A4 Ak Addr Add Bjk A5 A 320MW/sec B Regs A6 i Addr Mul instruction A7 buffer refill 64-bitx16 NIP CIP 4 Instruction Buffers LIP memory bank cycle 50 ns processor cycle 12.5 ns (80Mhz) CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 16 Vector Instruction Set Advantages § Compact - one short instruction encodes N operations § Expressive, tells hardware that these N operations: - are independent - use the same functional unit - access disjoint registers - access registers in same pattern as previous instructions - access a contiguous Block of memory (unit-stride load/store) - access memory in a known pattern (strided load/store) § ScalaBle - can run same code on more parallel pipelines (lanes) CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 17 Vector Arithmetic Execution • Use deep pipeline (=> fast clock) to v v v execute element operations 1 2 3 • Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!) Six-stage multiply pipeline v3 <- v1 * v2 CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 18 Vector Instruction Execution vfadd.d vc, va, vb Execution using Execution using one pipelined four pipelined functional unit functional units A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3] CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 19 Interleaved Vector Memory System § Bank Busy time: Time Before Bank ready to accept next request § Cray-1, 16 Banks, 4 cycle Bank Busy time, 12 cycle latency Base Stride Vector Registers Address Generator + 0 1 2 3 4 5 6 7 8 9 A B C D E F Memory Banks CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 20 Vector Unit Structure Functional Unit Vector Registers Elements Elements Elements Elements 0, 4, 8, … 1, 5, 9, … 2, 6, 10, 3, 7, 11, … … Lane Memory Subsystem CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 21 T0 Vector Microprocessor (UCB/ICSI, 1995) Vector register Lane elements striped over lanes [24][25] [26][27][28] [29] [30] [31] [16][17] [18][19][20] [21] [22] [23] [8] [9] [10][11][12] [13] [14] [15] [0] [1] [2] [3] [4] [5] [6] [7] CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 22 Vector Instruction Parallelism § Can overlap execution of multiple vector instructions - example machine has 32 elements per vector register and 8 lanes Load Unit Multiply Unit Add Unit load mul add time load mul add Instruction issue Complete 24 operations/cycle while issuing 1 short instruction/cycle CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 23 Vector Chaining § Vector version of register Bypassing - introduced with Cray-1 V1 V2 V3 V4 V5 vld v1 vfmul v3,v1,v2 Vfadd v5, v3, v4 Chain Chain Load Unit Mult. Add Memory CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 24 Vector Chaining Advantage • Without chaining, must wait for last element of result to be written Before starting dependent instruction Load Mul Time Add • With chaining, can start dependent instruction as soon as first result appears Load Mul Add CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 25 Vector Startup § Two components of vector startup penalty - functional unit latency (time through pipeline) - dead time or recovery time (time before another vector instruction can start down pipeline) Functional Unit Latency R X X X W R X X X W First Vector Instruction R X X X W R X X X W R X X X W Dead Time R X X X W R X X X W R X X X W Dead Time R X X X W Second Vector Instruction R X X X W CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 26 Dead Time

Load more