Admin
Homework 6 due Wed
Projects due Wed
Final Tuesday April 28, 7pm-10pm
Review (Jie will have one to work problems) Pipelining, Superscalar, Multiprocessors
CPS 104
cps 104 pipeline.1 ©ARL 2001 cps 104 pipeline.2 ©ARL 2001
Single Cycle, Multiple Cycle, vs. Pipeline Pipelining Summary Cycle 1 Cycle 2 Clk Single Cycle Implementation: Most modern processors use pipelining Load Store Waste Pentium 4 has 24 (35) stage pipeline!
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Intel Core 2 duo has 14 stages Clk Alpha 21164 has 7 stages
Multiple Cycle Implementation: Pipelining creates more headaches for exceptions, etc… Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Pipelining augmented with superscalar capabilities
Pipeline Implementation: Load Ifetch Reg Exec Mem Wr
Store Ifetch Reg Exec Mem Wr
R-type Ifetch Reg Exec Mem Wr cps 104 pipeline.3 ©ARL 2001 cps 104 pipeline.4 ©ARL 2001
Superscalar Processors Superscalar Processors
Key idea: execute more than one instruction per cycle Key Challenge: Finding the independent instructions
Instruction level parallelism (ILP) Pipelining exploits parallelism in the “stages” of instruction execution
Superscalar exploits parallelism of independent instructions Option 1: Compiler Example Code: Static scheduling (Alpha 21064, 21164; UltraSPARC I, II; sub $2, $1, $3 Pentium) and $12, $3, $5 or $13, $6, $2 Option 2: Hardware add $3, $3, $2 Dynamic Scheduling (Alpha 21264; PowerPC; Pentium Pro, sw $15, 100($2) 3, 4) Out-of-order instruction processing Superscalar Execution sub $2, $1, $3 and $12, $3, $5 or $13, $6, $2 add $3, $3, $2 sw $15, 100($2)
cps 104 pipeline.5 ©ARL 2001 cps 104 pipeline.6 ©ARL 2001 Instruction Level Parallelism Exposing Instruction Level Parallelism
Problems: Branch prediction Program structure: branch every 4-8 instructions Hardware can remember if branch was taken Limited number of registers Next time it sees the branch it uses this to predict outcome Static scheduling: compiler must find and move instructions from other basic blocks Register renaming Indirection! The CS solution to almost everything Dynamic scheduling: Hardware creates a big “window” to look During decode, map register name to real register location for independent instructions New location allocated when new value is written to reg. Must know branch directions before branch is executed!
Determines true dependencies. Example Code: sub $2 , $1, $3 # writes $2 = $p1 Example Code: sub $2 , $1, $3 and $12, $3, $2 # reads $p1 and $12, $3, $2 or $2 , $6, $4 # writes $2 = $p3 or $2 , $6, $4 add $3, $3, $2 # reads $p3 add $3, $3, $2 sw $15, 100( $2 ) # reads $p3 sw $15, 100($2) cps 104 pipeline.7 ©ARL 2001 cps 104 pipeline.8 ©ARL 2001
CPU design Summary Additional Notes
All Modern CPUs use pipelines.
Disadvantages of the Single Cycle Processor Many CPUs have 8-12 pipeline stages. Long cycle time The latest generation processors (Pentium-4, PowerPC G4, SUN’s Cycle time is too long for all instructions except the Load UltraSPARC) use multiple pipelines to get higher speed ( Superscalar design). Multiple Clock Cycle Processor :
Divide the instructions into smaller steps The course: CPS220: Advanced Computer Architecture I covers Superscalar processors. Execute each step (instead of the entire instruction) in one cycle
Now, Parallel Architectures… Pipeline Processor :
Natural enhancement of the multiple clock cycle processor
Each functional unit can only be used once per instruction
If a instruction is going to use a functional unit:
it must use it at the same stage as all other instructions
Pipeline Control :
Each stage’s control signal depends ONLY on the instruction that is currently in that stage cps 104 pipeline.9 ©ARL 2001 cps 104 pipeline.10 ©ARL 2001
What is Parallel Computer Architecture? Parallel Computation: Why and Why Not?
A Parallel Computer is a collection of processing elements that Pros cooperate to solve large problems fast Performance
how large a collection? Cost-effectiveness (commodity parts) how powerful are the elements? Smooth upgrade path
how does it scale up? Fault Tolerance
how do they cooperate and communicate? Cons
how is data transmitted between processors? Difficult to parallelize applications what are the primitive abstractions? Requires automatic parallelization or parallel program development how does it all translate to performance? Software! AAHHHH!
cps 104 pipeline.11 ©ARL 2001 cps 104 pipeline.12 ©ARL 2001 §7.2 Processing§7.2 of Creating Parallel DifficultyThe Simple Problem Parallel Programming
Parallel software is the problem for i = 1 to N
A[i] = (A[i] + B[i]) * C[i] Need to get significant performance sum = sum + A[i] improvement Split the loops Otherwise, just use a faster uniprocessor, Independent iterations since it’s easier! for i = 1 to N
A[i] = (A[i] + B[i]) * C[i] Difficulties Programs
for i = 1 to N Partitioning sum = sum + A[i] Coordination Communications overhead
Chapter 7 — Multicores, Multiprocessors, and Clusters — 14 cps 104 pipeline.13 ©ARL 2001 cps 104 pipeline.14 ©ARL 2001
Amdahl’s Law Small Scale Shared Memory Multiprocessors
Sequential part can limit speedup Cache(s) P P P P P P P P and TLB $ $ $ $ $ $ $ $ Example: 100 processors, 90× speedup?
Tnew = Tparallelizable /100 + Tsequential Main Memory 1 Speedup = = 90 0 N-1 − + (1 Fparalleliz able ) Fparalleliz able /100 Small number of processors connected to one shared memory Solving: F = 0.999 parallelizable Memory is equidistant from all processors (UMA)
Kernel can run on any processor (symmetric MP) Need sequential part to be 0.1% of original time Intel dual/quad Pentium Multicore
Chapter 7 — Multicores, Multiprocessors, and Clusters — 15 cps 104 pipeline.15 ©ARL 2001 cps 104 pipeline.16 ©ARL 2001 §7.11 Four… MulticoresBenchmarking Stuff:Real Four Example Systems Four Example Systems
2 × quad-core Intel Xeon e5345 2 × oct-core (Clovertown) Sun UltraSPARC T2 5140 (Niagara 2)
2 × quad-core AMD Opteron X4 2356 2 × oct-core (Barcelona) IBM Cell QS20
Chapter 7 — Multicores, Multiprocessors, and Clusters — 17 Chapter 7 — Multicores, Multiprocessors, and Clusters — 18 cps 104 pipeline.17 ©ARL 2001 cps 104 pipeline.18 ©ARL 2001 Cache Coherence Problem (Initial State) Cache Coherence Problem (Step 1)
P1 P2 P1 P2 ld r2, x Time Time
BUS BUS
x x
Main Memory Main Memory
cps 104 pipeline.19 ©ARL 2001 cps 104 pipeline.20 ©ARL 2001
Cache Coherence Problem (Step 2) Cache Coherence Problem (Step 3)
P1 P2 ld r2, x P1 P2 ld r2, x
ld r2, x ld r2, x add r1, r2, r4 Time Time st x, r1
BUS BUS
x x
Main Memory Main Memory
cps 104 pipeline.21 ©ARL 2001 cps 104 pipeline.22 ©ARL 2001 §7.4 Clusters and Other Message-Passing Multiproces §7.4 and OtherMessage-Passing Clusters Message Passing Loosely Coupled Clusters
Network of independent computers Each processor has private physical Each has private memory and OS address space Connected using I/O system E.g., Ethernet/switch, Internet
Hardware sends/receives messages Suitable for applications with independent tasks between processors Web servers, databases, simulations, …
High availability, scalable, affordable
Problems sors Administration cost (prefer virtual machines)
Low interconnect bandwidth
c.f. processor/memory bandwidth on an SMP
Chapter 7 — Multicores, Multiprocessors, and Clusters — 23 Chapter 7 — Multicores, Multiprocessors, and Clusters — 24 cps 104 pipeline.23 ©ARL 2001 cps 104 pipeline.24 ©ARL 2001 §7.5 §7.5 HardwareMultithreading Grid Computing Multithreading
Performing multiple threads of execution in parallel Separate computers interconnected by long-haul networks Replicate registers, PC, etc. Fast switching between threads E.g., Internet connections Fine-grain multithreading Work units farmed out, results sent back Switch threads after each cycle
Interleave instruction execution
Can make use of idle time on PCs If one thread stalls, others are executed E.g., SETI@home, World Community Grid Coarse-grain multithreading
Only switch on long stall (e.g., L2-cache miss)
Simplifies hardware, but doesn’t hide short stalls (eg, data hazards)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 25 Chapter 7 — Multicores, Multiprocessors, and Clusters — 26 cps 104 pipeline.25 ©ARL 2001 cps 104 pipeline.26 ©ARL 2001