Admin

 Homework 6 due Wed

 Projects due Wed

 Final Tuesday April 28, 7pm-10pm

 Review (Jie will have one to work problems) Pipelining, Superscalar, Multiprocessors

CPS 104

cps 104 .1 ©ARL 2001 cps 104 pipeline.2 ©ARL 2001

Single Cycle, Multiple Cycle, vs. Pipeline Pipelining Summary Cycle 1 Cycle 2 Clk Single Cycle Implementation:  Most modern processors use pipelining Load Store Waste  Pentium 4 has 24 (35) stage pipeline!

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10  Intel Core 2 duo has 14 stages Clk  Alpha 21164 has 7 stages

Multiple Cycle Implementation:  Pipelining creates more headaches for exceptions, etc… Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch  Pipelining augmented with superscalar capabilities

Pipeline Implementation: Load Ifetch Reg Exec Mem Wr

Store Ifetch Reg Exec Mem Wr

R-type Ifetch Reg Exec Mem Wr cps 104 pipeline.3 ©ARL 2001 cps 104 pipeline.4 ©ARL 2001

Superscalar Processors Superscalar Processors

 Key idea: execute more than one instruction per cycle  Key Challenge: Finding the independent instructions

 Instruction level parallelism (ILP)  Pipelining exploits parallelism in the “stages” of instruction execution

 Superscalar exploits parallelism of independent instructions  Option 1: Compiler  Example Code:  Static scheduling (Alpha 21064, 21164; UltraSPARC I, II; sub $2, $1, $3 Pentium) and $12, $3, $5 or $13, $6, $2  Option 2: Hardware add $3, $3, $2  Dynamic Scheduling (Alpha 21264; PowerPC; Pentium Pro, sw $15, 100($2) 3, 4)  Out-of-order instruction processing  Superscalar Execution sub $2, $1, $3 and $12, $3, $5 or $13, $6, $2 add $3, $3, $2 sw $15, 100($2)

cps 104 pipeline.5 ©ARL 2001 cps 104 pipeline.6 ©ARL 2001 Instruction Level Parallelism Exposing Instruction Level Parallelism

 Problems:  Branch prediction  Program structure: branch every 4-8 instructions  Hardware can remember if branch was taken  Limited number of registers  Next time it sees the branch it uses this to predict outcome  Static scheduling: compiler must find and move instructions from other basic blocks  Register renaming  Indirection! The CS solution to almost everything  Dynamic scheduling: Hardware creates a big “window” to look  During decode, map register name to real register location for independent instructions  New location allocated when new value is written to reg.  Must know branch directions before branch is executed!

 Determines true dependencies.  Example Code: sub $2 , $1, $3 # writes $2 = $p1  Example Code: sub $2 , $1, $3 and $12, $3, $2 # reads $p1 and $12, $3, $2 or $2 , $6, $4 # writes $2 = $p3 or $2 , $6, $4 add $3, $3, $2 # reads $p3 add $3, $3, $2 sw $15, 100( $2 ) # reads $p3 sw $15, 100($2) cps 104 pipeline.7 ©ARL 2001 cps 104 pipeline.8 ©ARL 2001

CPU design Summary Additional Notes

 All Modern CPUs use pipelines.

 Disadvantages of the Single Cycle Processor  Many CPUs have 8-12 pipeline stages.  Long cycle time  The latest generation processors (Pentium-4, PowerPC G4, SUN’s  Cycle time is too long for all instructions except the Load UltraSPARC) use multiple pipelines to get higher speed ( Superscalar design).  Multiple Clock Cycle Processor :

 Divide the instructions into smaller steps  The course: CPS220: Advanced I covers Superscalar processors.  Execute each step (instead of the entire instruction) in one cycle

 Now, Parallel Architectures…  Pipeline Processor :

 Natural enhancement of the multiple clock cycle processor

 Each functional unit can only be used once per instruction

 If a instruction is going to use a functional unit:

 it must use it at the same stage as all other instructions

 Pipeline Control :

 Each stage’s control signal depends ONLY on the instruction that is currently in that stage cps 104 pipeline.9 ©ARL 2001 cps 104 pipeline.10 ©ARL 2001

What is Parallel Computer Architecture? Parallel Computation: Why and Why Not?

 A Parallel Computer is a collection of processing elements that  Pros cooperate to solve large problems fast  Performance

 how large a collection?  Cost-effectiveness (commodity parts)  how powerful are the elements?  Smooth upgrade path

 how does it scale up?  Fault Tolerance

 how do they cooperate and communicate?  Cons

 how is data transmitted between processors?  Difficult to parallelize applications  what are the primitive abstractions?  Requires automatic parallelization or parallel program development  how does it all translate to performance?  Software! AAHHHH!

cps 104 pipeline.11 ©ARL 2001 cps 104 pipeline.12 ©ARL 2001 §7.2 Processing§7.2 of Creating Parallel DifficultyThe Simple Problem Parallel Programming

Parallel software is the problem for i = 1 to N

A[i] = (A[i] + B[i]) * C[i] Need to get significant performance sum = sum + A[i] improvement  Split the loops Otherwise, just use a faster uniprocessor,  Independent iterations since it’s easier! for i = 1 to N

A[i] = (A[i] + B[i]) * C[i] Difficulties Programs

for i = 1 to N Partitioning sum = sum + A[i] Coordination Communications overhead

Chapter 7 — Multicores, Multiprocessors, and Clusters — 14 cps 104 pipeline.13 ©ARL 2001 cps 104 pipeline.14 ©ARL 2001

Amdahl’s Law Small Scale Multiprocessors

 Sequential part can limit speedup Cache(s) P P P P P P P P and TLB $ $ $ $ $ $ $ $ Example: 100 processors, 90× speedup?

Tnew = Tparallelizable /100 + Tsequential Main Memory 1  Speedup = = 90 0 N-1 − + (1 Fparalleliz able ) Fparalleliz able /100  Small number of processors connected to one shared memory Solving: F = 0.999 parallelizable  Memory is equidistant from all processors (UMA)

 Kernel can run on any processor (symmetric MP) Need sequential part to be 0.1% of original time  Intel dual/quad Pentium  Multicore

Chapter 7 — Multicores, Multiprocessors, and Clusters — 15 cps 104 pipeline.15 ©ARL 2001 cps 104 pipeline.16 ©ARL 2001 §7.11 Four… MulticoresBenchmarking Stuff:Real Four Example Systems Four Example Systems

2 × quad-core Intel Xeon e5345 2 × oct-core (Clovertown) Sun UltraSPARC T2 5140 (Niagara 2)

2 × quad-core AMD Opteron X4 2356 2 × oct-core (Barcelona) IBM Cell QS20

Chapter 7 — Multicores, Multiprocessors, and Clusters — 17 Chapter 7 — Multicores, Multiprocessors, and Clusters — 18 cps 104 pipeline.17 ©ARL 2001 cps 104 pipeline.18 ©ARL 2001 Problem (Initial State) Cache Coherence Problem (Step 1)

P1 P2 P1 P2 ld r2, x Time Time

BUS BUS

x x

Main Memory Main Memory

cps 104 pipeline.19 ©ARL 2001 cps 104 pipeline.20 ©ARL 2001

Cache Coherence Problem (Step 2) Cache Coherence Problem (Step 3)

P1 P2 ld r2, x P1 P2 ld r2, x

ld r2, x ld r2, x add r1, r2, r4 Time Time st x, r1

BUS BUS

x x

Main Memory Main Memory

cps 104 pipeline.21 ©ARL 2001 cps 104 pipeline.22 ©ARL 2001 §7.4 Clusters and Other Message-Passing Multiproces §7.4 and OtherMessage-Passing Clusters Message Passing Loosely Coupled Clusters

 Network of independent computers Each processor has private physical  Each has private memory and OS address space  Connected using I/O system  E.g., Ethernet/switch, Internet

Hardware sends/receives messages  Suitable for applications with independent tasks between processors  Web servers, databases, simulations, …

 High availability, scalable, affordable

 Problems sors  Administration cost (prefer virtual machines)

 Low interconnect bandwidth

 c.f. processor/memory bandwidth on an SMP

Chapter 7 — Multicores, Multiprocessors, and Clusters — 23 Chapter 7 — Multicores, Multiprocessors, and Clusters — 24 cps 104 pipeline.23 ©ARL 2001 cps 104 pipeline.24 ©ARL 2001 §7.5 §7.5 HardwareMultithreading Multithreading

 Performing multiple threads of execution in parallel Separate computers interconnected by long-haul networks  Replicate registers, PC, etc.  Fast switching between threads E.g., Internet connections  Fine-grain multithreading Work units farmed out, results sent back  Switch threads after each cycle

 Interleave instruction execution

Can make use of idle time on PCs  If one stalls, others are executed E.g., SETI@home, World Community Grid  Coarse-grain multithreading

 Only switch on long stall (e.g., L2-cache miss)

 Simplifies hardware, but doesn’t hide short stalls (eg, data hazards)

Chapter 7 — Multicores, Multiprocessors, and Clusters — 25 Chapter 7 — Multicores, Multiprocessors, and Clusters — 26 cps 104 pipeline.25 ©ARL 2001 cps 104 pipeline.26 ©ARL 2001