CS252 Spring 2017 Graduate Computer Architecture Lecture 9

CS252 Spring 2017 Graduate Computer Architecture Lecture 9: Vector Supercomputers Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 8 Overcoming the worst hazards in OoO superscalars: • Branch prediction • Bimodel • Local/Branch History Table • Global/gselect, gshare • Tournament • Branch address cache (predict multiple branches per cycle) • Trace cache • Return Address Predictors • Today - Load/Store Queues, Vector Supercomputers 2 WU UCB CS252 SP17 Load-Store Queue Design § After control hazards, data hazards through memory are proBably next most important Bottleneck to superscalar performance § Modern superscalars use very sophisticated load- store reordering techniques to reduce effective memory latency By allowing loads to Be speculatively issued CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 3 Speculative Store Buffer Store Store § Just like register updates, stores should Address Data not modify the memory until after the Speculative instruction is committed. A speculative Store Buffer store Buffer is a structure introduced to V S Tag Data hold speculative store data. V S Tag Data § During decode, store Buffer slot allocated V S Tag Data V S Tag Data in program order V S Tag Data § Stores split into “store address” and V S Tag Data “store data” micro-operations § “Store address” execution writes tag Store Commit § “Store data” execution writes data Path § Store commits when oldest instruction and Both address and data available: - clear speculative Bit and eventually Tags Data move data to cache § On store abort: - clear valid Bit L1 Data Cache CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 4 Load bypass from speculative store buffer Speculative Load Address Store Buffer L1 Data Cache V S Tag Data V S Tag Data V S Tag Data V S Tag Data Tags Data V S Tag Data V S Tag Data Load Data § If data in both store buffer and cache, which should we use? Speculative store Buffer § If same address in store buffer twice, which should we use? Youngest store older than load CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 5 Memory Dependencies sd x1, (x2) ld x3, (x4) § When can we execute the load? CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 6 In-Order Memory Queue § Execute all loads and stores in program order § => Load and store cannot leave ROB for execution until all previous loads and stores have completed execution § Can still execute loads and stores speculatively, and out-of-order with respect to other instructions § Need a structure to handle memory ordering… CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 7 Conservative O-o-O Load Execution sd x1, (x2) ld x3, (x4) § Can execute load Before store, if addresses known and x4 != x2 § Each load address compared with addresses of all previous uncommitted stores - can use partial conservative check i.e., bottom 12 bits of address, to save hardware § Don’t execute load if any previous store address not known § (MIPS R10K, 16-entry address queue) CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 8 Address Speculation sd x1, (x2) ld x3, (x4) § Guess that x4 != x2 § Execute load Before store address known § Need to hold all completed But uncommitted load/store addresses in program order § If suBsequently find x4==x2, squash load and all following instructions § => Large penalty for inaccurate address speculation CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 9 Memory Dependence Prediction (Alpha 21264) sd x1, (x2) ld x3, (x4) § Guess that x4 != x2 and execute load Before store § If later find x4==x2, squash load and all following instructions, But mark load instruction as store-wait § SuBsequent executions of the same load instruction will wait for all previous stores to complete § Periodically clear store-wait Bits CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 10 Supercomputer Applications § Typical application areas - Military research (nuclear weapons, cryptography) - Scientific research - Weather forecasting - Oil exploration - Industrial design (car crash simulation) - Bioinformatics - Cryptography § All involve huge computations on large data set § Supercomputers: CDC6600, CDC7600, Cray-1, … § In 70s-80s, Supercomputer º Vector Machine CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 11 Vector Supercomputers § Epitomized By Cray-1, 1976: § Scalar Unit - Load/Store Architecture § Vector Extension - Vector Registers - Vector Instructions § Implementation - Hardwired Control - Highly Pipelined Functional Units - Interleaved Memory System - No Data Caches - No Virtual Memory [©Cray Research, 1976] CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 12 Cray-1 Internals displayed at EPFL Photograph by Rama, Wikimedia Commons, Cc-by-sa-2.0-fr 13 WU UCB CS252 SP17 Vector Programming Model Scalar Registers Vector Registers x31 v31 x0 v0 [0] [1] [2] [VLRMAX-1] Vector Length Register VLR v1 Vector Arithmetic v2 Instructions + + + + + + vadd v3, v1, v2 v3 [0] [1] [VLR-1] Vector Load and Store Vector Register Instructions v1 vld v1, x1, x2 Memory Base, x1 Stride, x2 CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 14 Vector Code Example # Vector Code # C code li x4, 64 for (i=0; i<64; i++) # Scalar Code setvlr x4 C[i] = A[i] + B[i]; li x4, 64 loop: vfld v1, x1 fld f1, 0(x1) vfld v2, x2 fld f2, 0(x2) vfadd.d v3,v1,v2 fadd.d f3,f1,f2 vfsd v3, x3 fsd f3, 0(x3) addi x1, 8 addi x2, 8 addi x3, 8 subi x4, 1 bnez x4, loop CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 15 Cray-1 (1976) V0 Vi V1 V. Mask V2 Vj 64 Element Vector V3 V. Length Registers V4 Vk Single Port V5 V6 Memory V7 FP Add S0 Sj FP Mul 16 Banks of 64- ( (Ah) + j k m ) S1 S2 Sk FP Recip bit words Si S3 (A0) 64 S4 Si Int Add + Tjk S5 T Regs S6 8-bit SECDED S7 Int Logic Int Shift A0 80MW/sec data ( (Ah) + j k m ) A1 Pop Cnt A2 load/store Aj Ai A3 (A0) 64 A4 Ak Addr Add Bjk A5 A 320MW/sec B Regs A6 i Addr Mul instruction A7 buffer refill 64-bitx16 NIP CIP 4 Instruction Buffers LIP memory bank cycle 50 ns processor cycle 12.5 ns (80Mhz) CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 16 Vector Instruction Set Advantages § Compact - one short instruction encodes N operations § Expressive, tells hardware that these N operations: - are independent - use the same functional unit - access disjoint registers - access registers in same pattern as previous instructions - access a contiguous Block of memory (unit-stride load/store) - access memory in a known pattern (strided load/store) § ScalaBle - can run same code on more parallel pipelines (lanes) CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 17 Vector Arithmetic Execution • Use deep pipeline (=> fast clock) to v v v execute element operations 1 2 3 • Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!) Six-stage multiply pipeline v3 <- v1 * v2 CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 18 Vector Instruction Execution vfadd.d vc, va, vb Execution using Execution using one pipelined four pipelined functional unit functional units A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3] CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 19 Interleaved Vector Memory System § Bank Busy time: Time Before Bank ready to accept next request § Cray-1, 16 Banks, 4 cycle Bank Busy time, 12 cycle latency Base Stride Vector Registers Address Generator + 0 1 2 3 4 5 6 7 8 9 A B C D E F Memory Banks CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 20 Vector Unit Structure Functional Unit Vector Registers Elements Elements Elements Elements 0, 4, 8, … 1, 5, 9, … 2, 6, 10, 3, 7, 11, … … Lane Memory Subsystem CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 21 T0 Vector Microprocessor (UCB/ICSI, 1995) Vector register Lane elements striped over lanes [24][25] [26][27][28] [29] [30] [31] [16][17] [18][19][20] [21] [22] [23] [8] [9] [10][11][12] [13] [14] [15] [0] [1] [2] [3] [4] [5] [6] [7] CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 22 Vector Instruction Parallelism § Can overlap execution of multiple vector instructions - example machine has 32 elements per vector register and 8 lanes Load Unit Multiply Unit Add Unit load mul add time load mul add Instruction issue Complete 24 operations/cycle while issuing 1 short instruction/cycle CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 23 Vector Chaining § Vector version of register Bypassing - introduced with Cray-1 V1 V2 V3 V4 V5 vld v1 vfmul v3,v1,v2 Vfadd v5, v3, v4 Chain Chain Load Unit Mult. Add Memory CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 24 Vector Chaining Advantage • Without chaining, must wait for last element of result to be written Before starting dependent instruction Load Mul Time Add • With chaining, can start dependent instruction as soon as first result appears Load Mul Add CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 25 Vector Startup § Two components of vector startup penalty - functional unit latency (time through pipeline) - dead time or recovery time (time before another vector instruction can start down pipeline) Functional Unit Latency R X X X W R X X X W First Vector Instruction R X X X W R X X X W R X X X W Dead Time R X X X W R X X X W R X X X W Dead Time R X X X W Second Vector Instruction R X X X W CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 26 Dead Time

CS252 Spring 2017 Graduate Computer Architecture Lecture 9

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support