inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures Lecture 40 – Hardware Parallel Computing 2006-12-06 Thanks to John Lazarro for his CS152 slides inst.eecs.berkeley.edu/~cs152/ Head TA Scott Beamer inst.eecs./~cs61c-tb Spam Emails on the Rise New studies show spam emails have increased by 50% worldwide. Worse yet, much of the spamming is being done by criminals, so laws may not help.

http://www.cnn.com/2006/WORLD/europe/11/27/uk.spam.reut/index.html CS61C L40 Hardware Parallel Computing (1) Beamer, Fall 2006 © UCB Outline

• Last time was about how to exploit parallelism from the software point of view • Today is about how to implement this in hardware • Some parallel hardware techniques • A couple current examples • Won’t cover out-of-order execution, since too complicated

CS61C L40 Hardware Parallel Computing (2) Beamer, Fall 2006 © UCB Introduction

• Given many threads (somehow generated by software), how do we implement this in hardware? • Recall the performance equation: Execution Time = (Inst. Count)(CPI)(Cycle Time) • Hardware Parallelism improves • Instruction Count - If the equation is applied to each CPU, each CPU needs to do less • CPI - If the equation is applied to system as a whole, more is done per cycle • Cycle Time - Will probably be made worse in process

CS61C L40 Hardware Parallel Computing (3) Beamer, Fall 2006 © UCB Disclaimers

• Please don’t let today’s material confuse what you have already learned about CPU’s and pipelining • When programmer is mentioned today, it means whoever is generating the assembly code (so it is probably a compiler) • Many of the concepts described today are difficult to implement, so if it sounds easy, think of possible hazards

CS61C L40 Hardware Parallel Computing (4) Beamer, Fall 2006 © UCB Superscalar

• Add more functional units or pipelines to CPU • Directly reduces CPI by doing more per cycle • Consider what if we: • Added another ALU • Added 2 more read ports to the RegFile • Added 1 more write port to the RegFile

CS61C L40 Hardware Parallel Computing (5) Beamer, Fall 2006 © UCB Simple Superscalar MIPS CPU

Instruction Inst1 • Can now do 2 Memory instruction in 1 Inst0 cycle! Rd Rs Rt Rd RsRt 5 Instruction 5 5 5 5 5 Address A Data s W0RaRb W1RcRd s Addr A e 32 Data

r 32 L d

Register U Memory PC d File A B

t Data x e In N clk clk 32 clk 32 C

A 32 L

32 U D

CS61C L40 Hardware Parallel Computing (6) Beamer, Fall 2006 © UCB Simple Superscalar MIPS CPU (cont.)

• Considerations • ISA now has to be changed • Forwarding for pipelining now harder • Limitations • Programmer must explicitly generate parallel code • Improvement only if other instructions can fill slots • Doesn’t scale well

CS61C L40 Hardware Parallel Computing (7) Beamer, Fall 2006 © UCB Single Instruction Multiple Data (SIMD)

• Often done in a vector form, so all data has the same operation applied to it • Example: AltiVec (like SSE) • 128bit registers can hold: . 4 floats, 4 ints, 8 shorts, 16 chars, etc. • Processes whole vector

A

A 128 L

128 U B

CS61C L40 Hardware Parallel Computing (8) Beamer, Fall 2006 © UCB Superscalar in Practice

• ISA’s have extensions for these vector operations • One , that has parallelism internally • Performance improvement depends on program and programmer being able to fully utilize all slots • Can be parts other than ALU (like load) • Usefulness will be more apparent when combined with other parallel techniques

CS61C L40 Hardware Parallel Computing (9) Beamer, Fall 2006 © UCB Thread Review

• A Thread is a single stream of instructions • It has its own registers, PC, etc. • Threads from the same process operate in the same virtual address space • Are an easy way to describe/think about parallelism • A single CPU can execute many threads by Time Division Multipexing Thread0 CPU Thread1 Time Thread2

CS61C L40 Hardware Parallel Computing (10) Beamer, Fall 2006 © UCB Multithreading

• Multithreading is running multiple threads through the same hardware • Could we do Time Division Multipexing better in hardware? • Consider if we gave the OS the abstraction of having 4 physical CPU’s that share memory and each execute one thread, but we did it all on 1 physical CPU?

CS61C L40 Hardware Parallel Computing (11) Beamer, Fall 2006 © UCB Static Multithreading Example

Appears to be 4 CPU’s at 1/4 clock

Introduced in 1964 by Seymour Cray

ALU Pipeline Stage

CS61C L40 Hardware Parallel Computing (12) Beamer, Fall 2006 © UCB Static Multithreading Example Analyzed

• Results: • 4 Threads running in hardware • Pipeline hazards reduced . No more need to forward . No control issues . Less structural hazards • Depends on being able to fully generate 4 threads evenly . Example if 1 Thread does 75% of the work – Utilization = (% time run)(% work done) = (.25)(.75) + (.75)(.25) = .375 = 37.5%

CS61C L40 Hardware Parallel Computing (13) Beamer, Fall 2006 © UCB Dynamic Multithreading

• Adds flexibility in choosing time to switch thread • Simultaneous Multithreading (SMT) • Called Hyperthreading by Intel • Run multiple threads at the same time • Just allocate functional units when available • Superscalar helps with this

CS61C L40 Hardware Parallel Computing (14) Beamer, Fall 2006 © UCB Dynamic Multithreading Example

One thread, 8 units Two threads, 8 units

Cycle M M FX FX FPFPBRCC CycleM M FX FXFP FPBRCC 1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

CS61C L40 Hardware Parallel Computing (15) Beamer, Fall 2006 © UCB Multicore

• Put multiple CPU’s on the same die • Why is this better than multiple dies? • Smaller • Cheaper • Closer, so lower inter-processor latency • Can share a L2 Cache (details) • Less power • Cost of multicore: complexity and slower single-thread execution

CS61C L40 Hardware Parallel Computing (16) Beamer, Fall 2006 © UCB Multicore Example (IBM Power5)

Core #1

Shared Stuff

Core #2

CS61C L40 Hardware Parallel Computing (17) Beamer, Fall 2006 © UCB Administrivia

• Proj4 due tonight at 11:59pm • Proj1 - Check newsgroup for posting about Proj1 regrades, you may want one • Lab tomorrow will have 2 surveys • Come to class Friday for the HKN course survey

CS61C L40 Hardware Parallel Computing (18) Beamer, Fall 2006 © UCB Upcoming Calendar

Week # Mon Wed Thu Lab Fri LAST Parallel I/O #15 Parallel CLASS Computing Computing Networking & This Week in Hardware 61C Feedback Summary, in Software (Scott) Survey Review, & HKN Evals #16 FINAL EXAM Sun 2pm THU 12-14 @ Review 12:30pm-3:30pm 10 Evans 234 Hearst Gym Final exam Same rules as Midterm, except you get 2 double-sided handwritten review sheets (1 from your midterm, 1 new one) + green sheet [Don’t bring backpacks]

CS61C L40 Hardware Parallel Computing (19) Beamer, Fall 2006 © UCB Real World Example 1: Processor

• Multicore, and more….

CS61C L40 Hardware Parallel Computing (20) Beamer, Fall 2006 © UCB Real World Example 1: Cell Processor

• 9 Cores (1PPE, 8SPE) at 3.2GHz • Power Processing Element (PPE) • Supervises all activities, allocates work • Is multithreaded (2 threads) • Synergystic Processing Element (SPE) • Where work gets done • Very Superscalar • No Cache, only Local Store

CS61C L40 Hardware Parallel Computing (21) Beamer, Fall 2006 © UCB Real World Example 1: Cell Processor

• Great for other multimedia applications such as HDTV, cameras, etc… • Really dependent on programmer use SPE’s and Local Store to get the most out of it

CS61C L40 Hardware Parallel Computing (22) Beamer, Fall 2006 © UCB Real World Example 2: Niagara Processor

• Multithreaded and Multicore • 32 Threads (8 cores, 4 threads each) at 1.2GHz • Designed for low power • Has simpler pipelines to fit more on • Maximizes thread level parallelism • Project Blackbox

CS61C L40 Hardware Parallel Computing (23) Beamer, Fall 2006 © UCB Real World Example 2: Niagara Processor

• Each thread runs slower (1.2GHz), and there is less number crunching ability (no FP unit), but tons of threads • This is great for webservers, where there are typically many simple requests, and many data stalls •Can beat “faster” and more expensive CPU’s, while using less power

CS61C L40 Hardware Parallel Computing (24) Beamer, Fall 2006 © UCB Peer Instruction

1. The majority of PS3’s processing ABC 1: FFF power comes from the Cell processor 2: FFT 3: FTF 2. A computer that has max utilization 4: FTT can get more done multithreaded 5: TFF 6: TFT 3. Current multicore techniques can 7: TTF scale well to many (32+) cores 8: TTT CS61C L40 Hardware Parallel Computing (25) Beamer, Fall 2006 © UCB Peer Instruction Answer 1. All PS3 is 2.18TFLOPS, Cell is only 204GFLOPS (GPU can do a lot…) FALSE 2. No more functional power FALSE 3. Share memory and caches huge barrier. Why Cell has Local Store FALSE

1. The majority of PS3’s processing power ABC comes from the Cell processor 1: FFF 2: FFT 2. A computer that has max utilization can 3: FTF get more done multithreaded 4: FTT 5: TFF 3. Current multicore techniques can scale 6: TFT 7: TTF well to many (32+) cores 8: TTT

CS61C L40 Hardware Parallel Computing (26) Beamer, Fall 2006 © UCB Summary

• Superscalar: More functional units • Multithread: Multiple threads executing on same CPU • Multicore: Multiple CPU’s on the same die • The gains from all these parallel hardware techniques relies heavily on the programmer being able to map their task well to multiple threads • Hit up CS150, CS152, CS162 and wikipedia for more info

CS61C L40 Hardware Parallel Computing (27) Beamer, Fall 2006 © UCB