CPS104 Organization and Programming Lecture 20: Superscalar processors, Multiprocessors

Robert Wagner Faster and faster Processors

● So much to do, so little time. . .

● How can we make that execute faster?

◆ Faster clock => more instructions/second. (technology constraints)

◆ Pipelining: => faster clock

◆ Execute more than 1 instruction per cycle, (Superscalar )

◆ Use multiple processors and divide the computation (Multiprocessors, Clustered computing, )

© RW Fall 2000 CPS 104 SS 2 Multiple Pipelines : Floating Point

EX MEM

M1 M2 M3 M4 M5 M6 M7

IF ID/RF WB

A1 A2 A3 A4

FP/INT Divide Unit Not Pipelined

25 Clocks

© RW Fall 2000 CPS 104 SS 3 (CPI < 1): Superscalar Design

● Pipelining can get CPI=1 and fast clock. Can we do better? ● Superscalar design: Execute multiple instructions every clock. ● Problems for Superscalar Design: ◆ Need multiple execution units (pipelines), ◆ Structural Hazards: ➤ Need multiple accesses to register files. ➤ Might need multiple accesses to caches ◆ Data Hazards: ➤ How to deal with data dependencies (keep program semantics)? ➤ What to do with stalled instructions? ◆ Control Hazards: ➤ What to do about conditional branches?

© RW Fall 2000 CPS 104 SS 4 Superscalar Design Solutions ● Multiple pipelines are not a problem. We already had them in “regular” design. ● Structural Hazards: Build register files with many read and write ports: Ex; 7-read and 3-write ports. Build multi-port caches. ● Data Hazards solutions: ◆ Issue instructions in order. Use score-board to eliminate data hazards by stalling instructions. ◆ “Better Solution”: Issue instruction out of order, Use to avoid data hazards, Graduate instructions in order. ● Control Hazards solutions: ◆ Use Branch Prediction: ◆ Make sure that the branch is resolved before registers are modified. ◆ OR, Use , roll back results if branches were predicted wrong.

© RW Fall 2000 CPS 104 SS 5 The Superscalar

● Can issue up to four instructions per clock cycle

● Deep pipeline: 7 stage integer, 9-stage floating point, up to 13 stages for on-chip load/store.

● There are two Integer and two Floating-point pipelines.

● In order issue. In-order execution.

● Use score-board to stall instructions with conflicts.

● Use score-board to compute all register forwarding operations.

● Integer has 4 read ports and two write ports.

● Floating point Register File has 6 read ports and 3 write ports.

● Use Branch Prediction to keep the pipe full.

© RW Fall 2000 CPS 104 SS 6 The Alpha 21164 Superscalar Pipeline

FP Divider

Floating Floating Point Add pipeline & Divide Point Refill Buffer Buffer File Floating Point Multiply Pipeline 0 N ext Instr. Inst. Slot Issue Index FP Store Data Logic 1 Logic Scoreboard Logic Integer Store

PC Multiply Logic Integer Instr. Register Integer Pipeline 1 TLB File Integer Pipeline 2

Data Store & Fill Data Cache To FP Units

Data TLB Level-2 Cache

© RW Fall 2000 CPS 104 SS 7 Alpha 21164 Pipeline Stages Read Instruction Cache Buffer Instruction Slot: Steer to Execution pipeline Determine Instruction Issue, Fetch Int Registers S0 S1 S2 S3 First Integer Pipeline Stage Second Integer Pipeline Stage Write Integer Register File.

S4 S5 S6 Integer Pipeline

Read floating-point Registers First floating-point pipeline stage Last floating point stage, Write register

S4 S5 S6 S7 S8 Floating-point Pipeline

Calculate virtual address, begin data cache read End data cache read, translate to physical address

S4 S5 S6 S7 S8 S9 S10 S11 S12 Memory Access Pipeline

Use Data, Write store to cache, Start L2 tag Access End L2 tag access Start L2 data access End L2 data access Begin data cache fill End data cache fill Use L2 data © RW Fall 2000 CPS 104 SS 8 MIPS R 10000 Superscalar

● Issues 4 instructions at a time ● Has 5 execution units: 2 FP units, 2 Integer units and load/store unit. ● Out of order execution ● Speculative execution predicts up to four branches at a time.

© RW Fall 2000 CPS 104 SS 9 MIPS 10000 CPU

FP FP FP Register Align Add/N Pack queue File

16 Active Free 64x64 entries List Register Mult. Sum/N Pack List 32 Busy 5 read 3 write Div. entries bit table Data Sqrt Cache Instr. Instr. Instr. Register 32KB pre- Cache Decode map Load - decode 32-KB table store queue Int. Branch 16 Register entries File Address TLB 64x64 Calc. 64x2 entries Integer 7 read queue 3 write 16 Integer entries ALU

Integer ALU

© RW Fall 2000 CPS 104 SS 10 MIPS 10000 Pipelines

I-Fetch I-Decode I-Issue Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7

issue RF Alignment Add Pack WB Floating Point Latency = 2 issue RF Multiply Sum Prod. Pack WB

Load/Store issue RF Acalc D-Cache (Integer or FP) Latency = 2 Load WB Queues

issue Integer RF ALU 1 WB Latency = 1 issue RF ALU 2 WB

Fetch Dec Map Write Instruction fetch and decode pipeline. Fills queues 4 istructions in parallel.

I-Cache Branch-Add Up to 4 branch instructions are predicted Speculative fetching.

© RW Fall 2000 CPS 104 SS 11 What is Parallel ?

● A Parallel Computer is a collection of processing elements that cooperate to solve large problems fast ◆ how large a collection? ◆ how powerful are the elements? ◆ how does it scale up? ◆ how do they cooperate and communicate? ◆ how is data transmitted between processors? ◆ what are the primitive abstractions? ◆ how does it all translate to performance?

© RW Fall 2000 CPS 104 SS 12 Parallel Computation: Why and Why Not?

● Pros ◆ Performance ◆ Cost-effectiveness (commodity parts) ◆ Smooth upgrade path ◆ Fault Tolerance ● Cons ◆ Difficult to parallelize applications ◆ Requires automatic parallelization or parallel program development ◆ Software! AAHHHH!

© RW Fall 2000 CPS 104 SS 13 Applications: Science and Engineering

● Examples ◆ Weather prediction ◆ Evolution of galaxies ◆ Oil reservoir simulation ◆ Automobile crash tests ◆ Drug development ◆ VLSI CAD ◆ Nuclear BOMBS! ● Typically model physical systems or phenomena ● Problems are 2D or 3D ● Usually requires “number crunching” ● Involves “true” parallelism

© RW Fall 2000 CPS 104 SS 14 Applications: Commercial

● Examples ◆ Transaction processing ◆ Database ◆ Financial models ● Involves data movement, not much number crunching ● Involves throughput parallelism

© RW Fall 2000 CPS 104 SS 15 Applications: Multi-media/home

● Examples ◆ speech recognition ◆ data compression/decompression ◆ 3D graphics ● Will become ubiquitous ● Involves everything (crunching, data movement, true parallelism, and throughput parallelism)

© RW Fall 2000 CPS 104 SS 16 Flynn Taxonomy

● SISD ◆ Single Instruction Single Data ◆ Standard sequential machines ● SIMD ◆ Single Instruction Multiple Data ◆ Early “vector” computers -- CRAY 1, CDC Star ◆ On single chip today, multimedia (decompression) ◆ Special applications (graphics, image processing, cryptography) ● MIMD ◆ Multiple Instruction Multiple Data ◆ most of today’s parallel machines

© RW Fall 2000 CPS 104 SS 17 Message Passing Architectures

Node 0 0,N-1 Node 1 0,N-1 P P ● Cannot directly Mem Mem access memory $ $ on another node CA CA ● IBM SP-2, Intel Interconnect Paragon

CA CA ● $ $ Cluster of workstations Mem Mem P P Node 2 0,N-1 Node 3 0,N-1

© RW Fall 2000 CPS 104 SS 18 Message Passing Programming Model

Local Local Process Address Space Address Space

match Recv y, P, t

Send x, Q, t address x address y

Process P Process Q ● User level send/receive abstraction ◆ local buffer (x,y), process (Q,P) and tag (t) ◆ naming and synchronization

© RW Fall 2000 CPS 104 SS 19 Single Shared Address Space

Machine Physical Address Space load P n Common Physical Addresses ● Communication, store sharing, and P 0 synchronization with Shared Portion store / load on shared of Address Pn Private variables Space ● Private Portion Must map virtual pages of Address to physical page frames Space P2 Private ● Consider OS support for good mapping P1 Private

P0 Private

© RW Fall 2000 CPS 104 SS 20 Small Scale Multiprocessors

Cache(s) P P P P P P P P and TLB $ $ $ $ $ $ $ $

Main Memory

0N-1

● Small number of processors connected to one shared memory ● Memory is equidistant from all processors (UMA) ● Kernel can run on any processor (symmetric MP)

© RW Fall 2000 CPS 104 SS 21 Problem (Initial State)

P1 P2 Time

BUS

x Main Memory

© RW Fall 2000 CPS 104 SS 22 Cache Coherence Problem (Step 1)

P1 P2 ld r2, x Time

BUS

x Main Memory

© RW Fall 2000 CPS 104 SS 23 Cache Coherence Problem (Step 2)

P1 P2 ld r2, x

ld r2, x Time

BUS

x Main Memory

© RW Fall 2000 CPS 104 SS 24 Cache Coherence Problem (Step 3)

P1 P2 ld r2, x

ld r2, x add r1, r2, r4 Time st x, r1

Interconnection Network

x Main Memory

© RW Fall 2000 CPS 104 SS 25 Snoopy Cache-Coherence Protocols

● Bus provides serialization point for consistency ● Each cache controller “snoops” all bus transactions ◆ relevant transactions if for a block it contains ◆ take action to ensure coherence ➤ invalidate ➤ update ➤ supply value ◆ depends on state of the block and the protocol ● Simultaneous Operation of Independent Controllers

© RW Fall 2000 CPS 104 SS 26 Large Scale Shared Memory Multiprocessors

● 100s to 1000s of nodes (processors) with P single shared physical P Mem Mem address space $ P P Mem $ Cntrl/NI ● Use General Purpose Mem Interconnection $ Cntrl/NI Network $ Cntrl/NI Cntrl/NI ◆ Still have cache coherence protocol ◆ Use messages instead of bus transactions Interconnect ◆ No hardware broadcast ● Communication Assist

© RW Fall 2000 CPS 104 SS 27 Directory Based Cache Coherence

● Avoid broadcast request to all nodes on a miss ◆ traffic ◆ time ● Maintain directory of which nodes have cached copies of the block (directory controller + directory state) ● On a miss, send message to directory ● Directory determines what (if any) protocol action is required ◆ e.g., invalidation ● Directory waits for protocol actions to finish and then responds to the original request

© RW Fall 2000 CPS 104 SS 28 Today’s Parallel Computer Architecture

● Extension of traditional computer architecture to support communication and cooperation ◆ Communications architecture

Shared Message Multiprogramming Data Memory Passing Parallel User Programming Level Model Library and Communication Operating System Abstraction System Support Level Communication Hardware/Software Hardware Boundary

Physical Communication Medium

© RW Fall 2000 CPS 104 SS 29 Toward a Generic Parallel Machine

● Separation of Node 0 Node 1 programming P P models from Mem Mem architectures $ $ CA CA ● All models require Interconnect communication P P Mem Mem ● Node with $ $ processor(s), CA CA memory, communication Node 2 Node 3 assist

© RW Fall 2000 CPS 104 SS 30