Superscalar Processors, Multiprocessors

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors Robert Wagner Faster and faster Processors ● So much to do, so little time. ● How can we make computers that execute faster? ◆ Faster clock => more instructions/second. (technology constraints) ◆ Pipelining: => faster clock ◆ Execute more than 1 instruction per cycle, (Superscalar processor) ◆ Use multiple processors and divide the computation (Multiprocessors, Clustered computing, Distributed computing) © RW Fall 2000 CPS 104 SS 2 Multiple Pipelines : Floating Point EX MEM M1 M2 M3 M4 M5 M6 M7 IF ID/RF WB A1 A2 A3 A4 FP/INT Divide Unit Not Pipelined 25 Clocks © RW Fall 2000 CPS 104 SS 3 (CPI < 1): Superscalar Design ● Pipelining can get CPI=1 and fast clock. Can we do better? ● Superscalar design: Execute multiple instructions every clock. ● Problems for Superscalar Design: ◆ Need multiple execution units (pipelines), ◆ Structural Hazards: ➤ Need multiple accesses to register files. ➤ Might need multiple accesses to caches ◆ Data Hazards: ➤ How to deal with data dependencies (keep program semantics)? ➤ What to do with stalled instructions? ◆ Control Hazards: ➤ What to do about conditional branches? © RW Fall 2000 CPS 104 SS 4 Superscalar Design Solutions ● Multiple pipelines are not a problem. We already had them in “regular” pipeline design. ● Structural Hazards: Build register files with many read and write ports: Ex; 7-read and 3-write ports. Build multi-port caches. ● Data Hazards solutions: ◆ Issue instructions in order. Use score-board to eliminate data hazards by stalling instructions. ◆ “Better Solution”: Issue instruction out of order, Use register renaming to avoid data hazards, Graduate instructions in order. ● Control Hazards solutions: ◆ Use Branch Prediction: ◆ Make sure that the branch is resolved before registers are modified. ◆ OR, Use speculative execution, roll back results if branches were predicted wrong. © RW Fall 2000 CPS 104 SS 5 The Alpha 21164 Superscalar ● Can issue up to four instructions per clock cycle ● Deep pipeline: 7 stage integer, 9-stage floating point, up to 13 stages for on-chip load/store. ● There are two Integer and two Floating-point pipelines. ● In order issue. In-order execution. ● Use score-board to stall instructions with conflicts. ● Use score-board to compute all register forwarding operations. ● Integer Register File has 4 read ports and two write ports. ● Floating point Register File has 6 read ports and 3 write ports. ● Use Branch Prediction to keep the pipe full. © RW Fall 2000 CPS 104 SS 6 The Alpha 21164 Superscalar Pipeline FP Divider Floating Floating Point Add pipeline & Divide Point Refill Instruction Register Buffer Buffer File Floating Point Multiply Pipeline 0 N ext Instr. Inst. Slot Issue Index Cache FP Store Data Logic 1 Logic Scoreboard Logic Integer Store PC Multiply Logic Integer Instr. Register Integer Pipeline 1 TLB File Integer Pipeline 2 Data Store & Fill Data Cache To FP Units Data TLB Level-2 Cache © RW Fall 2000 CPS 104 SS 7 Alpha 21164 Pipeline Stages Read Instruction Cache Buffer Instruction Slot: Steer to Execution pipeline Determine Instruction Issue, Fetch Int Registers S0 S1 S2 S3 First Integer Pipeline Stage Second Integer Pipeline Stage Write Integer Register File. S4 S5 S6 Integer Pipeline Read floating-point Registers First floating-point pipeline stage Last floating point stage, Write register S4 S5 S6 S7 S8 Floating-point Pipeline Calculate virtual address, begin data cache read End data cache read, translate to physical address S4 S5 S6 S7 S8 S9 S10 S11 S12 Memory Access Pipeline Use Data, Write store to cache, Start L2 tag Access End L2 tag access Start L2 data access End L2 data access Begin data cache fill End data cache fill Use L2 data © RW Fall 2000 CPS 104 SS 8 MIPS R 10000 Superscalar ● Issues 4 instructions at a time ● Has 5 execution units: 2 FP units, 2 Integer units and load/store unit. ● Out of order execution ● Speculative execution predicts up to four branches at a time. © RW Fall 2000 CPS 104 SS 9 MIPS 10000 CPU FP FP Adder FP Register Align Add/N Pack queue File 16 Active Free 64x64 entries List Register Mult. Sum/N Pack List 32 Busy 5 read 3 write Div. entries bit table Data Sqrt Cache Instr. Instr. Instr. Register 32KB pre- Cache Decode map Load - decode 32-KB table store queue Int. Branch 16 Register entries File Address TLB 64x64 Calc. 64x2 entries Integer 7 read queue 3 write 16 Integer entries ALU Integer ALU © RW Fall 2000 CPS 104 SS 10 MIPS 10000 Pipelines I-Fetch I-Decode I-Issue Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 issue RF Alignment Add Pack WB Floating Point Latency = 2 issue RF Multiply Sum Prod. Pack WB Load/Store issue RF Acalc D-Cache (Integer or FP) Latency = 2 Load WB Queues issue Integer RF ALU 1 WB Latency = 1 issue RF ALU 2 WB Fetch Dec Map Write Instruction fetch and decode pipeline. Fills queues 4 istructions in parallel. I-Cache Branch-Add Up to 4 branch instructions are predicted Speculative fetching. © RW Fall 2000 CPS 104 SS 11 What is Parallel Computer Architecture? ● A Parallel Computer is a collection of processing elements that cooperate to solve large problems fast ◆ how large a collection? ◆ how powerful are the elements? ◆ how does it scale up? ◆ how do they cooperate and communicate? ◆ how is data transmitted between processors? ◆ what are the primitive abstractions? ◆ how does it all translate to performance? © RW Fall 2000 CPS 104 SS 12 Parallel Computation: Why and Why Not? ● Pros ◆ Performance ◆ Cost-effectiveness (commodity parts) ◆ Smooth upgrade path ◆ Fault Tolerance ● Cons ◆ Difficult to parallelize applications ◆ Requires automatic parallelization or parallel program development ◆ Software! AAHHHH! © RW Fall 2000 CPS 104 SS 13 Applications: Science and Engineering ● Examples ◆ Weather prediction ◆ Evolution of galaxies ◆ Oil reservoir simulation ◆ Automobile crash tests ◆ Drug development ◆ VLSI CAD ◆ Nuclear BOMBS! ● Typically model physical systems or phenomena ● Problems are 2D or 3D ● Usually requires “number crunching” ● Involves “true” parallelism © RW Fall 2000 CPS 104 SS 14 Applications: Commercial ● Examples ◆ Transaction processing ◆ Database ◆ Financial models ● Involves data movement, not much number crunching ● Involves throughput parallelism © RW Fall 2000 CPS 104 SS 15 Applications: Multi-media/home ● Examples ◆ speech recognition ◆ data compression/decompression ◆ 3D graphics ● Will become ubiquitous ● Involves everything (crunching, data movement, true parallelism, and throughput parallelism) © RW Fall 2000 CPS 104 SS 16 Flynn Taxonomy ● SISD ◆ Single Instruction Single Data ◆ Standard sequential machines ● SIMD ◆ Single Instruction Multiple Data ◆ Early “vector” computers -- CRAY 1, CDC Star ◆ On single chip today, multimedia (decompression) ◆ Special applications (graphics, image processing, cryptography) ● MIMD ◆ Multiple Instruction Multiple Data ◆ most of today’s parallel machines © RW Fall 2000 CPS 104 SS 17 Message Passing Architectures Node 0 0,N-1 Node 1 0,N-1 P P ● Cannot directly Mem Mem access memory $ $ on another node CA CA ● IBM SP-2, Intel Interconnect Paragon CA CA ● $ $ Cluster of workstations Mem Mem P P Node 2 0,N-1 Node 3 0,N-1 © RW Fall 2000 CPS 104 SS 18 Message Passing Programming Model Local Process Local Process Address Space Address Space match Recv y, P, t Send x, Q, t address x address y Process P Process Q ● User level send/receive abstraction ◆ local buffer (x,y), process (Q,P) and tag (t) ◆ naming and synchronization © RW Fall 2000 CPS 104 SS 19 Single Shared Address Space Machine Physical Address Space load P n Common Physical Addresses ● Communication, store sharing, and P 0 synchronization with Shared Portion store / load on shared of Address Pn Private variables Space ● Private Portion Must map virtual pages of Address to physical page frames Space P2 Private ● Consider OS support for good mapping P1 Private P0 Private © RW Fall 2000 CPS 104 SS 20 Small Scale Shared Memory Multiprocessors Cache(s) P P P P P P P P and TLB $ $ $ $ $ $ $ $ Main Memory 0N-1 ● Small number of processors connected to one shared memory ● Memory is equidistant from all processors (UMA) ● Kernel can run on any processor (symmetric MP) © RW Fall 2000 CPS 104 SS 21 Cache Coherence Problem (Initial State) P1 P2 Time BUS x Main Memory © RW Fall 2000 CPS 104 SS 22 Cache Coherence Problem (Step 1) P1 P2 ld r2, x Time BUS x Main Memory © RW Fall 2000 CPS 104 SS 23 Cache Coherence Problem (Step 2) P1 P2 ld r2, x ld r2, x Time BUS x Main Memory © RW Fall 2000 CPS 104 SS 24 Cache Coherence Problem (Step 3) P1 P2 ld r2, x ld r2, x add r1, r2, r4 Time st x, r1 Interconnection Network x Main Memory © RW Fall 2000 CPS 104 SS 25 Snoopy Cache-Coherence Protocols ● Bus provides serialization point for consistency ● Each cache controller “snoops” all bus transactions ◆ relevant transactions if for a block it contains ◆ take action to ensure coherence ➤ invalidate ➤ update ➤ supply value ◆ depends on state of the block and the protocol ● Simultaneous Operation of Independent Controllers © RW Fall 2000 CPS 104 SS 26 Large Scale Shared Memory Multiprocessors ● 100s to 1000s of nodes (processors) with P single shared physical P Mem Mem address space $ P P Mem $ Cntrl/NI ● Use General Purpose Mem Interconnection $ Cntrl/NI Network $ Cntrl/NI Cntrl/NI ◆ Still have cache coherence protocol ◆ Use messages instead of bus transactions Interconnect ◆ No hardware broadcast ● Communication Assist © RW Fall 2000 CPS 104 SS 27 Directory Based Cache Coherence ● Avoid

Load more