Multiprocessors • why would you want a multiprocessor? • what things can it do well? • What things can’t it do well? CSE 240B • What things can it do that a bunch of computers can’t do? Parallel • How much are you willing to pay?

Processor Processor Dean Tullsen Cache Cache

Single bus

Memory I/O

CSE 240B Dean Tullsen CSE 240B Dean Tullsen

Classifying Multiprocessors Flynn Taxonomy

• SISD (Single Instruction Single Data) – Uniprocessors • SIMD (Single Instruction Multiple Data) • Flynn Taxonomy – Examples: Illiac-IV, CM-2 » Simple programming model » Low overhead • Interconnection Network » All custom • MIMD (Multiple Instruction Multiple Data) •Memoryypgy Topology – Examppy,yles: many, nearly all modern MPs » Flexible • Programming Model » Use off-the-shelf micros • MISD (Multiple Instruction Single Data) – ???

CSE 240B Dean Tullsen CSE 240B Dean Tullsen Interconnection Network Memory Topology

Processor Processor Processor Processor Processor Processor • Bus • UMA () Cache Cache Cache

• Network Cache Cache Cache • NUMA (Non-uniform Memory Access) Memory Memory Memory • pros/cons? • pros/cons? Single bus Network

Processor Processor Processor

Memory I/O Cache Cache Cache Processor Processor Processor

Single bus

Cache Cache Cache Memory I/O cpu M Memory Memory Memory cpu M

Network Network cpu M

. . . . CSE 240B Dean Tullsen CSE 240B. . Dean Tullsen cpu M

Programming Model Parallel Programming -- Review

i = 47 • -- every processor can name every address location Processor A Processor B • Message Passing -- each processor can name only it ’s local memory. Communication is through explicit messages. index = i++; index = i++;

PPProcessor Processor P rocessor

Cache Cache Cache

Memory Memory Memory • Shared-memory programming requires synchronization to

Network provide mut ual excl usi on and prevent race conditi ons – locks (semaphores) shared memory architecture with network interconnection sometimes called – barriers Distributed Shared Memory (DSM) CSE 240B Dean Tullsen CSE 240B Dean Tullsen Small-Scale Multiprocessors Communication Models — Shared Memory • Shared Memory – Processors communicate through shared address space – Easy on small -scale machines – Advantages: Processor Processor Processor ƒ Model of choice for uniprocessors, small-scale MPs • Caches serve to: ƒ Ease of programming – RdReduce ltlatency of access Cache Cache Cache ƒ Lower latency – Preserve bus/memory ƒ Easier to use hardware controlled caching bandwidth • Message passing – Valuable for both private Single bus data and shared data – Processors have private memories, communicate via messages • What about cache – Advantages: coherence/consistency? Memory I/O ƒ Less hardware , easier to design ƒ Focuses attention on costly non-local operations • Can support either model on either HW base

CSE 240B Dean Tullsen CSE 240B Dean Tullsen

The Problem of Cache Coherence What Does Coherence Mean? • Informally: Processor Processor Processor – Any read must return the most recent write – Too strict and very difficult to implement • Better: Cache Cache Cache – A processor sees its own writes to a location in the correct order. – Any write must eventually be seen by a read Single bus – All writes are seen in order (“serialization”). Writes to the same location are seen in the same order by all processors. • Without these guarantees , synchronization doesn ’t Memory I/O work.

CSE 240B Dean Tullsen CSE 240B Dean Tullsen Potential Solutions Basic Snoopy Protocols • Write Invalidate Protocol: • Snooping Solution (Snoopy Bus): – Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies – Send all requests for unknown data to all processors – Read Miss: – Processors snoop to see if they have a copy and respond accordingly ƒ Write-through: memory is always up-to-date – Requires “broadcast”, since caching information is at processors ƒ Write-back: snoopppy in caches to find most recent copy – Works well with bus (natural broadcast medium) • Write Update Protocol: – Dominates for small scale machines (most of the market) – Write to shared data: broadcast on bus, processors snoop, and update copies – Read miss: memory is always up -to-date • Directory-Based Schemes – Keep track of what is being shared in one centralized place – Distributed memoryyy() => distributed directory (avoids bottlenecks) – Send point-to-point requests to processors • Write serialization: bus serializes requests – Scales better than Snoop – Bus is single point of arbitration – Actually existed BEFORE Snoop -based schemes CSE 240B Dean Tullsen CSE 240B Dean Tullsen

Basic Snoopy Protocols An Example Snoopy Protocol • Invalidation protocol, write-back cache • Each block of memory is in one state: • Write Invalidate versus Broadcast: – Clean in all caches and up-to-date in memory – Invalidate requires one transaction per write-run – Dirty in exactly one cache – Invalidate exploits spatial locality: one transaction per block – Not in any caches • Each cache block is in one state: – Broadcast has lower latency between write and read – (S)hared: block can be read – Broad cast: BW (i ncreased) vs. l atency (d ecreased) trad eoff – (E)xclusive: cache has only copy , its writeable , and dirty – (I)nvalid: block contains no data • Read misses: cause all caches to snoop • Writes to clean line are treated as misses

CSE 240B Dean Tullsen CSE 240B Dean Tullsen Other protocols (more common) Snoopy-Cache State Machine

• ESI = Exclusive, Shared, Invalid CPU read hit

Write miss for Shared CPU read this block Shared Invalid (read only) Invalid Place read miss on bus (read only) • MESI = Modified, Exclusive, Shared, Invalid ock CPU bl k te read CPU write ri bac w – Exclusive = private , clean e- U miss P Writ C Place read miss on bus – Modified = private, dirty us ss b i n o

ss access

Place write read m e-back block; abort miss on bus mi t ri rite-back block; PU ite memory abort memory access C W

wr WitWrite mi ss WW for this block • MOESI = Modified, Owned, Exclusive, Shared, Invalid Place Read miss for this block Cache state transitions Cache state transitions based – Owned = responsible for writing back shared, dirty line. Exclusive based on requests from CPU Exclusive on requests from the bus (read/write) (read/write) CPU write miss Write-back cache block Place write miss on bus

CPU write hit CPU read hit

CSE 240B Dean Tullsen CSE 240B Dean Tullsen

Example Example

ESI Protocol ESI Protocol

P1 P2 Bus Memory P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Valu pste State A ddr Valu e State A dd r Valu eA ctio n Proc . A ddr Valu eA d dr Val u P1P1: Write Write 1010 toto A1 P1P1: Write Write 1010 toto A1 EA110I WM1A1 A1old P1:P1: Read Read A1 A1 P1:P1: Read Read A1 A1 EA110 P2: Read A1 P2: Read A1 P2: Read A1 P2: Read A1 abort RM S A1 10 S WBA1 1 A110 10RM A1 102 A1 10 P2:P2: WriteWrite 20 to A1 P2:P2: WriteWrite 20 to A1 IEA120WM2A1A110 P2:P2: WriteWrite 40 to A2 P2:P2: WriteWrite 40 to A2 IEA240WB2A120A120 WM 2 A2 40 A2 old

Assumes A1 and A2 map to same cache block Assumes A1 and A2 map to same cache block

CSE 240B Dean Tullsen CSE 240B Dean Tullsen Snoopy Cache: State Machine Multiprocessors -- Key Points

• Network vs. Bus Extensions (ld(some already • Message-passing vs. Shared Memory mentioned): • Shared Memory is more intuitive, but creates problems for – Clean exclusive state (no miss for both the programmer (memory consistency , requiring private data on write) Illinois Protocol (also MESI) synchronization) and the architect (cache coherence). –ownership • Snoopy cache coherence protocols trade off complexity – Cache-cache transfers (more states) for reduced bandwidth demands.

CSE 240B Dean Tullsen CSE 240B Dean Tullsen