Multiprocessors • why would you want a multiprocessor?

Multiprocessors and Multithreading

Processor Processor more is better? Cache Cache

Single

Memory I/O

CSE 141 Dean Tullsen CSE 141 Dean Tullsen

Classifying Multiprocessors Flynn Taxonomy

• SISD (Single Instruction Single Data) – Uniprocessors • SIMD (Single Instruction Multiple Data) Flynn Taxonomy • – Examples: Illiac-IV, CM-2, Nvidia GPUs, etc. » Simple programming model • Interconnection Network » Low overhead • MIMD (Multiple Instruction Multiple Data) • Memory Topology – Examples: many, nearly all modern multiprocessors or multicores » Flexible • Programming Model » Use off-the-shelf or cores • MISD (Multiple Instruction Single Data) – ???

CSE 141 Dean Tullsen CSE 141 Dean Tullsen Interconnection Network Memory Topology

P • Bus 0 • UMA ()

P1 Network P NUMA (Non-uniform Memory Access) • 2 • P • pros/cons? 3 • pros/cons? P4

P5 Processor Processor Processor P6 Processor Processor Processor P7

Cache Cache Cache Processor Processor Processor Cache Cache Cache

Single bus Cache Cache Cache Single bus

Memory I/O

cpu M Memory Memory Memory Memory I/O

cpu M Network Network cpu M

. . . . CSE 141 Dean Tullsen CSE 141. . Dean Tullsen cpu M

Programming Model Parallel Programming

i = 47 • -- every processor can name every address Processor A Processor B location • Message Passing -- each processor can name only it’s local index = i++; index = i++; memory. Communication is through explicit messages. • pros/cons? Processor Processor Processor

ChCache ChCache ChCache

Memory Memory Memory • Shared-memory programming requires synchronization to provide mut ual excl usi on and prevent race conditi ons Network – locks (semaphores) • find the max of 100,000 integers on 10 processors. – barriers

CSE 141 Dean Tullsen CSE 141 Dean Tullsen Multiprocessor Caches (Shared Memory) What Does Coherence Mean? Informally: • the problem -- cache coherency • – Any read must return the most recent write the solution? • – Too strict and very difficult to implement

Processor Processor Processor • Better: – A processor sees its own writes to a location in the correct order. Cache Cache Cache – Any write must eventually be seen by a read – All writes are seen in order (“serialization”). Writes to the same location are seen in the same order by all processors. Single bus • Without these guarantees , synchronization doesn ’t work. Memory I/O

CSE 141 Dean Tullsen CSE 141 Dean Tullsen

Basic Snoopy Protocols Potential Solutions • write-update • Snooping Solution (Snoopy Bus): – on each write, each cache holding that location updates its value – Send all requests for unknown data to all processors • write-invalidate <= most common – Processors snoop to see if they have a copy and respond accordingly – on each write, each cache holding that location invalidates the cache line. – Requires “broadcast”, since caching information is at processors – Works well with bus (natural broadcast medium) Processor Processor Processor

– Dominates for small scale machines (most of the market) Snoop Cache tag Snoop Cache tag Snoop Cache tag tag and data tag and data tag and data • Directory-Based Schemes Single bus – Keep track of what is being shared in one centralized place

– Distributed memoryyy() => distributed directory (avoids bottlenecks) Memory I/O – Send point-to-point requests to processors – Scales better than Snoop • both schemes MUCH easier on a bus-based multiprocessor – Actually existed BEFORE Snoop -based schemes • potentially requires a LOT of messages , but ... CSE 141 Dean Tullsen CSE 141 Dean Tullsen Basic Snoopy Protocols An Example Snoopy Protocol • Invalidation protocol, write-back cache • Each block of memory is in one state: • Write Invalidate versus Broadcast: – Clean in all caches and up-to-date in memory – Invalidate requires one transaction per write-run – Dirty in exactly one cache – Invalidate exploits spatial locality: one transaction per block – Not in any caches • Each cache block is in one state: – Broadcast has lower latency between write and read – (S)hared: block can be read – Broad cast: BW (i ncreased) vs. l atency ( ecreased) trad eoff – (E)xclusive: cache has only copy , its writeable , and dirty – (I)nvalid: block contains no data • Read misses: cause all caches to snoop • Writes to clean line are treated as misses

CSE 141 Dean Tullsen CSE 141 Dean Tullsen

Other protocols (more common) Snoopy-Cache State Machine

• ESI = Exclusive, Shared, Invalid CPU read hit

Write miss for Shared CPU read this block Shared Invalid (read only) Invalid • MESI = Modified, Exclusive, Shared, Invalid Place read miss on bus (read only)

ock CPU – Exclusive = private, clean bl k te read CPU write ri bac w e- U miss – Modified = private, dirty P Writ Place read miss on bus us ss b i n o

ss access

Place write read m e-back block; abort miss on bus mi t ri rite-back block; PU ite memory abort memory access C W

wr WitWrite mi ss WW • MOESI = Modified , Owned , Exclusive , Shared , Invalid for this block Place Read miss for this block – Owned = responsible for writing back shared, dirty line. Cache state transitions Cache state transitions based Exclusive based on requests from CPU Exclusive on requests from the bus (read/write) (read/write) CPU write miss • How does MESI avoid sending messages across bus Write-back cache block Place write miss on bus

(compared to ESI)? CPU write hit CPU read hit • What traffic does MOESI avoid? CSE 141 Dean Tullsen CSE 141 Dean Tullsen Cache Coherency Simultaneous Multithreading

• How do you know when an external (not this processor core)d/i) read/write occurs? • Snooping protocols

• Directory protocols Processor Processor

Processor Processor Processor Cache Cache

Snoop Cache tag Snoop Cache tag Snoop Cache tag tag and data tag and data tag and data Memory Memory Single bus Directory Directory

Memory I/O Network

CSE 141 Dean Tullsen CSE 141 Dean Tullsen

Hardware Multithreading SlEtiSuperscalar Execution Multithreaded

Conventional mm Processor Issue Slots n strea cycles)

PC PC PC oo regs regs regs ime (proc PC nstructi i TT regs CPU

CSE 141 Dean Tullsen CSE 141 Dean Tullsen Superscalar Execution with Fine-Grain Multithreading SlEtiSuperscalar Execution

Issue Slots Issue Slots cycles) cycles) Vertical waste 1 ime (proc ime (proc TT TT Thread 2 Thread 3 Horizontal waste

CSE 141 Dean Tullsen CSE 141 Dean Tullsen

Simultaneous Multithreading SMT Performance

) 7

cle Simultaneous Multithreading yy 6 Issue Slots 5 ns per C ns per cycles) oo 4 Thread 1 3 ime (proc (Instructi

TT Thread 2 Fine-G rai n M ultith readi ng 2 Thread 3 1 Conventional Superscalar Thread 4 roughput Th 0 Thread 5 12345678

CSE 141 Dean Tullsen CSE 141Number of Threads Dean Tullsen Goals A Conventional Superscalar Architecture

We had three primary goals for this architecture: FthFetch Unit floating point fp Data fp Cache 1. Minimize the architectural impact on conventional PC instruction queue units reg’s suppgerscalar design.

Instruction Cache 2. Minimize the performance impact on a single thread.

8 integer int. integer 3. Achieve significant throughput gains with many instruction queue units reg’s int/ld- threads. store Decode units

• Fetch up to 8 • Out-of-order, • Issue 3 floating itinstructi ons per cycl epospecultilative it6itint, 6 integer CSE 141 Dean Tullsen CSE 141execution Dean Tullsen

Performance of the Naïve Design An SMT Architecture

5

FthFetch )) Unit floating point fp Data fp Cache PC instruction queue units reg’s 4 er Cycle PP

Instruction Cache 3 ructions tt 8 integer int. integer instruction queue units reg’s int/ld- store 2 Register hput (Ins

Decode units gg Renaming Unmodified Superscalar Throu • Fetch up to 8 • Out-of-order, • Issue 3 floating 1 itinstructi ons per cycl epospecultilative it6itint, 6 integer 2468 execution instructions per cycle CSE 141 Dean Tullsen CSE 141Number of Threads Dean Tullsen Bottlenecks of the Baseline Architecture Improving Fetch Throughput

• Instruction queue full conditions (12-21% of • The fetch unit in an SMT architecture has two distinct cycles) adilhidvantages over a conventional architecture. – Can fetch from multiple threads at once. – Lack of parallelism in the queue. –Can choose which threads to fetch. • Fetch throughput (4.2 instructions per cycle when queue not full)

CSE 141 Dean Tullsen CSE 141 Dean Tullsen

Improved Fetch Performance Improved Performance

• Fetching from 2 threads/cycle achieved most of the perfflilformance from multiple-thdfhhread fetch. • Fetching from the thread(s) which have the fewest Improved 5

unissued instructions in -flight significantly increases le parallelism and throughput (ICOUNT fetch policy) 4 s per cyc nn Baseline 3 nstructio II 2 Unmodified superscalar 1 2468 CSE 141 Dean Tullsen CSE 141Number of Threads Dean Tullsen This SMT Architecture, then: Multithreaded Processors • Coarse-grain multithreading (Alewife-MIT) – context at long-latency operations (cache misses) • Borrows heavily from conventional superscalar design. • Fine-grain multithreading (Tera ) • Minimizes the impact on single-thread performance – context switch every cycle – Sun Niagara T1 • Achieves significant throughput gains over the superscalar • Simultaneous multithreading (SMT) (2.5X , up to 5. 4 IPC). – execute iiflilhdihlinstructions from multiple threads in the same cycle – is only different from fine-grain multithreading in the context of superscalar execution – requires surprisingly few changes to a conventional out -of-order – Was announced to be featured in the next Compaq Alpha processor (21464), but that processor never completed. – IddihIlPi4Introduced in the Pentium 4 processor – announced“Hd as “Hyper- threading technology.” (HT Technology) – IBM Power 5, 6 has 2 cores, each 2-way SMT. – Nehalem Multicore has 2 threads/core

CSE 141 Dean Tullsen CSE 141 Dean Tullsen

Multi-Core Processors Why Multicore, Multithreaded? (aka Chip Multiprocessors)

CPU CPU CPU

CPU CPU CPU rmance

• Multiple cores on the same die, may or may not share L2 oo

cache. perf • Intel, AMD both have quad core processors. Sun Niagara T2 i s 8 cores x 8 th read s (64 contexts!) • Everyone’s roadmap seems to be increasingly multi-core.

CSE 141 Dean Tullsen CSE 1411 25 Dean Tullsen Number of threads Intel Nehalem Intel Nehalem Core

• First instantiation – Intel Core i7

CSE 141 Dean Tullsen CSE 141 Dean Tullsen

Intel Nehalem, summary Multiprocessors -- Key Points

• Up to 8 cores (i7, 4 cores) • Network vs. Bus • 2 SMT threads per core • Message-passiShdMing vs. Shared Memory • 20+ stage • Shared Memory is more intuitive, but creates problems for both the ppgrogrammer (memor y consistenc y, re quirin g • x86i86 inst ructi ons t ransl ltdtated to RISC -like uops synchronization) and the architect (cache coherency). • Superscalar, 4 “instructions” (uops) per cycle (more with • Multithreading gives the illusion of fusing) (including, in many cases, the performance) with very little • Caches (i7) additional hardware. – 32KB 4-way set-associative I cache per core • When multiprocessing happens within a single – 32KB, 8-way set-associative D cache per core die/processor, we call that a chip multiprocessor , or a – 256 KB unified 8-way set-associative L2 cache per core multi-core architecture. – 8 MB shared 16-way set-associative L3 cache

CSE 141 Dean Tullsen CSE 141 Dean Tullsen