Multiprocessors and Multithreading Multiprocessors Classifying Multiprocessors Flynn Taxonomy

Multiprocessors • why would you want a multiprocessor? Multiprocessors and Multithreading Processor Processor Processor more is better? Cache Cache Cache Single bus Memory I/O CSE 141 Dean Tullsen CSE 141 Dean Tullsen Classifying Multiprocessors Flynn Taxonomy • SISD (Single Instruction Single Data) – Uniprocessors • SIMD (Single Instruction Multiple Data) Flynn Taxonomy • – Examples: Illiac-IV, CM-2, Nvidia GPUs, etc. » Simple programming model • Interconnection Network » Low overhead • MIMD (Multiple Instruction Multiple Data) • Memory Topology – Examples: many, nearly all modern multiprocessors or multicores » Flexible • Programming Model » Use off-the-shelf microprocessors or microprocessor cores • MISD (Multiple Instruction Single Data) – ??? CSE 141 Dean Tullsen CSE 141 Dean Tullsen Interconnection Network Memory Topology P • Bus 0 • UMA (Uniform Memory Access) P1 Network P NUMA (Non-uniform Memory Access) • 2 • P • pros/cons? 3 • pros/cons? P4 P5 Processor Processor Processor P6 Processor Processor Processor P7 Cache Cache Cache Processor Processor Processor Cache Cache Cache Single bus Cache Cache Cache Single bus Memory I/O cpu M Memory Memory Memory Memory I/O cpu M Network Network cpu M . CSE 141 Dean Tullsen CSE 141. Dean Tullsen cpu M Programming Model Parallel Programming i = 47 • Shared Memory -- every processor can name every address Processor A Processor B location • Message Passing -- each processor can name only it’s local index = i++; index = i++; memory. Communication is through explicit messages. • pros/cons? Processor Processor Processor ChCache ChCache ChCache Memory Memory Memory • Shared-memory programming requires synchronization to provide mu tua l excl usi on and prevent race conditi ons Network – locks (semaphores) • find the max of 100,000 integers on 10 processors. – barriers CSE 141 Dean Tullsen CSE 141 Dean Tullsen Multiprocessor Caches (Shared Memory) What Does Coherence Mean? Informally: • the problem -- cache coherency • – Any read must return the most recent write the solution? • – Too strict and very difficult to implement Processor Processor Processor • Better: – A processor sees its own writes to a location in the correct order. Cache Cache Cache – Any write must eventually be seen by a read – All writes are seen in order (“serialization”). Writes to the same location are seen in the same order by all processors. Single bus • Without these guarantees , synchronization doesn ’t work. Memory I/O CSE 141 Dean Tullsen CSE 141 Dean Tullsen Basic Snoopy Protocols Potential Solutions • write-update • Snooping Solution (Snoopy Bus): – on each write, each cache holding that location updates its value – Send all requests for unknown data to all processors • write-invalidate <= most common – Processors snoop to see if they have a copy and respond accordingly – on each write, each cache holding that location invalidates the cache line. – Requires “broadcast”, since caching information is at processors – Works well with bus (natural broadcast medium) Processor Processor Processor – Dominates for small scale machines (most of the market) Snoop Cache tag Snoop Cache tag Snoop Cache tag tag and data tag and data tag and data • Directory-Based Schemes Single bus – Keep track of what is being shared in one centralized place – Distributed memoryyy() => distributed directory (avoids bottlenecks) Memory I/O – Send point-to-point requests to processors – Scales better than Snoop • both schemes MUCH easier on a bus-based multiprocessor – Actually existed BEFORE Snoop -based schemes • potentially requires a LOT of messages , but ... CSE 141 Dean Tullsen CSE 141 Dean Tullsen Basic Snoopy Protocols An Example Snoopy Protocol • Invalidation protocol, write-back cache • Each block of memory is in one state: • Write Invalidate versus Broadcast: – Clean in all caches and up-to-date in memory – Invalidate requires one transaction per write-run – Dirty in exactly one cache – Invalidate exploits spatial locality: one transaction per block – Not in any caches • Each cache block is in one state: – Broadcast has lower latency between write and read – (S)hared: block can be read – Broa dcast: BW (i ncreased) vs. l atency (decreased) trad eo ff – (E)xclusive: cache has only copy , its writeable , and dirty – (I)nvalid: block contains no data • Read misses: cause all caches to snoop • Writes to clean line are treated as misses CSE 141 Dean Tullsen CSE 141 Dean Tullsen Other protocols (more common) Snoopy-Cache State Machine • ESI = Exclusive, Shared, Invalid CPU read hit Write miss for Shared CPU read this block Shared Invalid (read only) Invalid • MESI = Modified, Exclusive, Shared, Invalid Place read miss on bus (read only) ock CPU – Exclusive = private, clean bl k te read CPU write ri bac w e- U miss – Modified = private, dirty P Writ C Place read miss on bus us ss b i n o ss access Place write read m e-back block; abort miss on bus mi t ri rite-back block; PU ite memory abort memory access C W wr WitWrite m iss WW • MOESI = Modified , Owned , Exclusive , Shared , Invalid for this block Place Read miss for this block – Owned = responsible for writing back shared, dirty line. Cache state transitions Cache state transitions based Exclusive based on requests from CPU Exclusive on requests from the bus (read/write) (read/write) CPU write miss • How does MESI avoid sending messages across bus Write-back cache block Place write miss on bus (compared to ESI)? CPU write hit CPU read hit • What traffic does MOESI av oid? CSE 141 Dean Tullsen CSE 141 Dean Tullsen Cache Coherency Simultaneous Multithreading • How do you know when an external (not this processor core)d/i) read/write occurs? • Snooping protocols • Directory protocols Processor Processor Processor Processor Processor Cache Cache Snoop Cache tag Snoop Cache tag Snoop Cache tag tag and data tag and data tag and data Memory Memory Single bus Directory Directory Memory I/O Network CSE 141 Dean Tullsen CSE 141 Dean Tullsen Hardware Multithreading SlEtiSuperscalar Execution Multithreaded Conventional mm Processor Issue Slots n strea cycles) PC PC PC oo regs regs regs ime (proc PC nstructi i TT regs CPU CSE 141 Dean Tullsen CSE 141 Dean Tullsen Superscalar Execution with Fine-Grain Multithreading SlEtiSuperscalar Execution Issue Slots Issue Slots cycles) cycles) Vertical waste Thread 1 ime (proc ime (proc TT TT Thread 2 Thread 3 Horizontal waste CSE 141 Dean Tullsen CSE 141 Dean Tullsen Simultaneous Multithreading SMT Performance ) 7 cle Simultaneous Multithreading yy 6 Issue Slots 5 ns per C ns per cycles) oo 4 Thread 1 3 ime (proc (Instructi TT Thread 2 Fine- Gra in Mu ltithreadi ng 2 Thread 3 1 Conventional Superscalar Thread 4 roughput Th 0 Thread 5 12345678 CSE 141 Dean Tullsen CSE 141Number of Threads Dean Tullsen Goals A Conventional Superscalar Architecture We had three primary goals for this architecture: FthFetch Unit floating point fp Data fp Cache 1. Minimize the architectural impact on conventional PC instruction queue units reg’s suppgerscalar design. Instruction Cache 2. Minimize the performance impact on a single thread. 8 integer int. integer 3. Achieve significant throughput gains with many instruction queue units reg’s int/ld- threads. store Decode Register Renaming units • Fetch up to 8 • Out-of-order, • Issue 3 floating itinstruc tions per cycl epospecultilative it6itint, 6 integer CSE 141 Dean Tullsen CSE 141execution instructions per cycle Dean Tullsen Performance of the Naïve Design An SMT Architecture 5 FthFetch )) Unit floating point fp Data fp Cache PC instruction queue units reg’s 4 er Cycle PP Instruction Cache 3 ructions tt 8 integer int. integer instruction queue units reg’s int/ld- store 2 Register hput (Ins Decode units gg Renaming Unmodified Superscalar Throu • Fetch up to 8 • Out-of-order, • Issue 3 floating 1 itinstruc tions per cyc lepospecultilative it6itint, 6 integer 2468 execution instructions per cycle CSE 141 Dean Tullsen CSE 141Number of Threads Dean Tullsen Bottlenecks of the Baseline Architecture Improving Fetch Throughput • Instruction queue full conditions (12-21% of • The fetch unit in an SMT architecture has two distinct cycles) adilhidvantages over a conventional architecture. – Can fetch from multiple threads at once. – Lack of parallelism in the queue. –Can choose which threads to fetch. • Fetch throughput (4.2 instructions per cycle when queue not full) CSE 141 Dean Tullsen CSE 141 Dean Tullsen Improved Fetch Performance Improved Performance • Fetching from 2 threads/cycle achieved most of the perfflilformance from multiple-thdfhhread fetch. • Fetching from the thread(s) which have the fewest Improved 5 unissued instructions in -flight significantly increases le parallelism and throughput (ICOUNT fetch policy) 4 s per cyc nn Baseline 3 nstructio II 2 Unmodified superscalar 1 2468 CSE 141 Dean Tullsen CSE 141Number of Threads Dean Tullsen This SMT Architecture, then: Multithreaded Processors • Coarse-grain multithreading (Alewife-MIT) – context switch at long-latency operations (cache misses) • Borrows heavily from conventional superscalar design. • Fine-grain multithreading (Tera Supercomputer) • Minimizes the impact on single-thread performance – context switch every cycle – Sun Niagara T1 • Achieves significant throughput gains over the superscalar • Simultaneous multithreading (SMT) (2. 5X, up to 5. 4 IPC). – execute iiflilhdihlinstructions from multiple threads in the same cycle – is only different from fine-grain multithreading in the context of superscalar execution – requires surprisingly few changes to a

Multiprocessors and Multithreading Multiprocessors Classifying Multiprocessors Flynn Taxonomy

2.5 Classification of Parallel Computers

SIMD Extensions

Computer Hardware Architecture Lecture 4

An Introduction to Gpus, CUDA and Opencl

Cuda C Programming Guide

Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2

Computer Architecture: Parallel Processing Basics

Thread-Level Parallelism I

Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over Next 2 Weeks)!

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

X86 Intrinsics Cheat Sheet Jan Finis [email protected]

Lect. 11: Vector and SIMD Processors