Pipelining, Superscalar, Multiprocessors

Admin Homework 6 due Wed Projects due Wed Final Tuesday April 28, 7pm-10pm Review (Jie will have one to work problems) Pipelining, Superscalar, Multiprocessors CPS 104 cps 104 pipeline.1 ©ARL 2001 cps 104 pipeline.2 ©ARL 2001 Single Cycle, Multiple Cycle, vs. Pipeline Pipelining Summary Cycle 1 Cycle 2 Clk Single Cycle Implementation: Most modern processors use pipelining Load Store Waste Pentium 4 has 24 (35) stage pipeline! Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Intel Core 2 duo has 14 stages Clk Alpha 21164 has 7 stages Multiple Cycle Implementation: Pipelining creates more headaches for exceptions, etc… Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Pipelining augmented with superscalar capabilities Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr cps 104 pipeline.3 ©ARL 2001 cps 104 pipeline.4 ©ARL 2001 Superscalar Processors Superscalar Processors Key idea: execute more than one instruction per cycle Key Challenge: Finding the independent instructions Instruction level parallelism (ILP) Pipelining exploits parallelism in the “stages” of instruction execution Superscalar exploits parallelism of independent instructions Option 1: Compiler Example Code: Static scheduling (Alpha 21064, 21164; UltraSPARC I, II; sub $2, $1, $3 Pentium) and $12, $3, $5 or $13, $6, $2 Option 2: Hardware add $3, $3, $2 Dynamic Scheduling (Alpha 21264; PowerPC; Pentium Pro, sw $15, 100($2) 3, 4) Out-of-order instruction processing Superscalar Execution sub $2, $1, $3 and $12, $3, $5 or $13, $6, $2 add $3, $3, $2 sw $15, 100($2) cps 104 pipeline.5 ©ARL 2001 cps 104 pipeline.6 ©ARL 2001 Instruction Level Parallelism Exposing Instruction Level Parallelism Problems: Branch prediction Program structure: branch every 4-8 instructions Hardware can remember if branch was taken Limited number of registers Next time it sees the branch it uses this to predict outcome Static scheduling: compiler must find and move instructions from other basic blocks Register renaming Indirection! The CS solution to almost everything Dynamic scheduling: Hardware creates a big “window” to look During decode, map register name to real register location for independent instructions New location allocated when new value is written to reg. Must know branch directions before branch is executed! Determines true dependencies. Example Code: sub $2 , $1, $3 # writes $2 = $p1 Example Code: sub $2 , $1, $3 and $12, $3, $2 # reads $p1 and $12, $3, $2 or $2 , $6, $4 # writes $2 = $p3 or $2 , $6, $4 add $3, $3, $2 # reads $p3 add $3, $3, $2 sw $15, 100( $2 ) # reads $p3 sw $15, 100($2) cps 104 pipeline.7 ©ARL 2001 cps 104 pipeline.8 ©ARL 2001 CPU design Summary Additional Notes All Modern CPUs use pipelines. Disadvantages of the Single Cycle Processor Many CPUs have 8-12 pipeline stages. Long cycle time The latest generation processors (Pentium-4, PowerPC G4, SUN’s Cycle time is too long for all instructions except the Load UltraSPARC) use multiple pipelines to get higher speed ( Superscalar design). Multiple Clock Cycle Processor : Divide the instructions into smaller steps The course: CPS220: Advanced Computer Architecture I covers Superscalar processors. Execute each step (instead of the entire instruction) in one cycle Now, Parallel Architectures… Pipeline Processor : Natural enhancement of the multiple clock cycle processor Each functional unit can only be used once per instruction If a instruction is going to use a functional unit: it must use it at the same stage as all other instructions Pipeline Control : Each stage’s control signal depends ONLY on the instruction that is currently in that stage cps 104 pipeline.9 ©ARL 2001 cps 104 pipeline.10 ©ARL 2001 What is Parallel Computer Architecture? Parallel Computation: Why and Why Not? A Parallel Computer is a collection of processing elements that Pros cooperate to solve large problems fast Performance how large a collection? Cost-effectiveness (commodity parts) how powerful are the elements? Smooth upgrade path how does it scale up? Fault Tolerance how do they cooperate and communicate? Cons how is data transmitted between processors? Difficult to parallelize applications what are the primitive abstractions? Requires automatic parallelization or parallel program development how does it all translate to performance? Software! AAHHHH! cps 104 pipeline.11 ©ARL 2001 cps 104 pipeline.12 ©ARL 2001 §7.2 The Difficulty The Difficulty Parallel Creating of §7.2Processing Simple Problem Parallel Programming Parallel software is the problem for i = 1 to N A[i] = (A[i] + B[i]) * C[i] Need to get significant performance sum = sum + A[i] improvement Split the loops Otherwise, just use a faster uniprocessor, Independent iterations since it’s easier! for i = 1 to N A[i] = (A[i] + B[i]) * C[i] Difficulties Programs for i = 1 to N Partitioning sum = sum + A[i] Coordination Communications overhead Chapter 7 — Multicores, Multiprocessors, and Clusters — 14 cps 104 pipeline.13 ©ARL 2001 cps 104 pipeline.14 ©ARL 2001 Amdahl’s Law Small Scale Shared Memory Multiprocessors Sequential part can limit speedup Cache(s) P P P P P P P P and TLB $ $ $ $ $ $ $ $ Example: 100 processors, 90× speedup? Tnew = Tparallelizable /100 + Tsequential Main Memory 1 Speedup = = 90 0 N-1 − + (1 Fparalleliz able ) Fparalleliz able /100 Small number of processors connected to one shared memory Solving: F = 0.999 parallelizable Memory is equidistant from all processors (UMA) Kernel can run on any processor (symmetric MP) Need sequential part to be 0.1% of original time Intel dual/quad Pentium Multicore Chapter 7 — Multicores, Multiprocessors, and Clusters — 15 cps 104 pipeline.15 ©ARL 2001 cps 104 pipeline.16 ©ARL 2001 §7.11 Real Stuff: Benchmarking Four Multicores RealStuff: BenchmarkingMulticores … Four §7.11 Four Example Systems Four Example Systems 2 × quad-core Intel Xeon e5345 2 × oct-core (Clovertown) Sun UltraSPARC T2 5140 (Niagara 2) 2 × quad-core AMD Opteron X4 2356 2 × oct-core (Barcelona) IBM Cell QS20 Chapter 7 — Multicores, Multiprocessors, and Clusters — 17 Chapter 7 — Multicores, Multiprocessors, and Clusters — 18 cps 104 pipeline.17 ©ARL 2001 cps 104 pipeline.18 ©ARL 2001 Cache Coherence Problem (Initial State) Cache Coherence Problem (Step 1) P1 P2 P1 P2 ld r2, x Time Time BUS BUS x x Main Memory Main Memory cps 104 pipeline.19 ©ARL 2001 cps 104 pipeline.20 ©ARL 2001 Cache Coherence Problem (Step 2) Cache Coherence Problem (Step 3) P1 P2 ld r2, x P1 P2 ld r2, x ld r2, x ld r2, x add r1, r2, r4 Time Time st x, r1 BUS BUS x x Main Memory Main Memory cps 104 pipeline.21 ©ARL 2001 cps 104 pipeline.22 ©ARL 2001 §7.4 Clusters and Other Clusters Message-PassingOther and§7.4 Multiproces Message Passing Loosely Coupled Clusters Network of independent computers Each processor has private physical Each has private memory and OS address space Connected using I/O system E.g., Ethernet/switch, Internet Hardware sends/receives messages Suitable for applications with independent tasks between processors Web servers, databases, simulations, … High availability, scalable, affordable Problems sors Administration cost (prefer virtual machines) Low interconnect bandwidth c.f. processor/memory bandwidth on an SMP Chapter 7 — Multicores, Multiprocessors, and Clusters — 23 Chapter 7 — Multicores, Multiprocessors, and Clusters — 24 cps 104 pipeline.23 ©ARL 2001 cps 104 pipeline.24 ©ARL 2001 §7.5 Hardware Multithreading Hardware §7.5 Grid Computing Multithreading Performing multiple threads of execution in parallel Separate computers interconnected by long-haul networks Replicate registers, PC, etc. Fast switching between threads E.g., Internet connections Fine-grain multithreading Work units farmed out, results sent back Switch threads after each cycle Interleave instruction execution Can make use of idle time on PCs If one thread stalls, others are executed E.g., SETI@home, World Community Grid Coarse-grain multithreading Only switch on long stall (e.g., L2-cache miss) Simplifies hardware, but doesn’t hide short stalls (eg, data hazards) Chapter 7 — Multicores, Multiprocessors, and Clusters — 25 Chapter 7 — Multicores, Multiprocessors, and Clusters — 26 cps 104 pipeline.25 ©ARL 2001 cps 104 pipeline.26 ©ARL 2001.

Load more