CIS 501 Computer Architecture This Unit: Multithreading (MT) Readings

This Unit: Multithreading (MT) Application •! Why multithreading (MT)? OS •! Utilization vs. performance Compiler Firmware CIS 501 •! Three implementations Computer Architecture CPU I/O •! Coarse-grained MT Memory •! Fine-grained MT •! Simultaneous MT (SMT) Digital Circuits Unit 10: Hardware Multithreading Gates & Transistors CIS 501 (Martin/Roth): Multithreading 1 CIS 501 (Martin/Roth): Multithreading 2 Readings Performance And Utilization •! H+P •! Performance (IPC) important •! Chapter 3.5-3.6 •! Utilization (actual IPC / peak IPC) important too •! Paper •! Even moderate superscalars (e.g., 4-way) not fully utilized •! Tullsen et al., “Exploiting Choice…” •! Average sustained IPC: 1.5–2 ! < 50% utilization •! Mis-predicted branches •! Cache misses, especially L2 •! Data dependences •! Multi-threading (MT) •! Improve utilization by multi-plexing multiple threads on single CPU •! One thread cannot fully utilize CPU? Maybe 2, 4 (or 100) can CIS 501 (Martin/Roth): Multithreading 3 CIS 501 (Martin/Roth): Multithreading 4 Superscalar Under-utilization Simple Multithreading •! Time evolution of issue slot •! Time evolution of issue slot •! 4-issue processor •! 4-issue processor time cache cache Fill in with instructions miss miss from another thread Superscalar Superscalar Multithreading •! Where does it find a thread? Same problem as multi-core •! Same shared-memory abstraction CIS 501 (Martin/Roth): Multithreading 5 CIS 501 (Martin/Roth): Multithreading 6 Latency vs Throughput MT Implementations: Similarities •! MT trades (single-thread) latency for throughput •! How do multiple threads share a single processor? –! Sharing processor degrades latency of individual threads •! Different sharing mechanisms for different kinds of structures +! But improves aggregate latency of both threads •! Depend on what kind of state structure stores +! Improves utilization •! Example •! No state: ALUs •! Thread A: individual latency=10s, latency with thread B=15s •! Dynamically shared •! Thread B: individual latency=20s, latency with thread A=25s •! Persistent hard state (aka “context”): PC, registers •! Sequential latency (first A then B or vice versa): 30s •! Replicated •! Parallel latency (A and B simultaneously): 25s •! Persistent soft state: caches, bpred –! MT slows each thread by 5s •! Dynamically partitioned (like on a multi-programmed uni-processor) + But improves total latency by 5s ! •! TLBs need thread ids, caches/bpred tables don’t •! Different workloads have different parallelism •! Exception: ordered “soft” state (BHR, RAS) is replicated •! SpecFP has lots of ILP (can use an 8-wide machine) •! Transient state: pipeline latches, ROB, RS •! Server workloads have TLP (can use multiple threads) •! Partitioned … somehow CIS 501 (Martin/Roth): Multithreading 7 CIS 501 (Martin/Roth): Multithreading 8 MT Implementations: Differences The Standard Multithreading Picture •! Main question: thread scheduling policy •! Time evolution of issue slots •! When to switch from one thread to another? •! Color = thread •! Related question: pipeline partitioning •! How exactly do threads share the pipeline itself? •! Choice depends on time •! What kind of latencies (specifically, length) you want to tolerate •! How much single thread performance you are willing to sacrifice •! Three designs •! Coarse-grain multithreading (CGMT) Superscalar CGMT FGMT SMT •! Fine-grain multithreading (FGMT) •! Simultaneous multithreading (SMT) CIS 501 (Martin/Roth): Multithreading 9 CIS 501 (Martin/Roth): Multithreading 10 Coarse-Grain Multithreading (CGMT) CGMT •! Coarse-Grain Multi-Threading (CGMT) regfile + Sacrifices very little single thread performance (of one thread) I$ ! D$ –! Tolerates only long latencies (e.g., L2 misses) B P •! Thread scheduling policy •! Designate a “preferred” thread (e.g., thread A) •! Switch to thread B on thread A L2 miss •! CGMT •! Switch back to A when A L2 miss returns •! Pipeline partitioning thread scheduler •! None, flush on switch regfile –! Can’t tolerate latencies shorter than twice pipeline depth regfile •! Need short in-order pipeline for good performance I$ D$ B •! Example: IBM Northstar/Pulsar P L2 miss? CIS 501 (Martin/Roth): Multithreading 11 CIS 501 (Martin/Roth): Multithreading 12 Fine-Grain Multithreading (FGMT) Fine-Grain Multithreading •! Fine-Grain Multithreading (FGMT) •! FGMT –! Sacrifices significant single thread performance •! Multiple threads in pipeline at once +! Tolerates latencies (e.g., L2 misses, mispredicted branches, etc.) •! (Many) more threads •! Thread scheduling policy •! Switch threads every cycle (round-robin), L2 miss or no •! Pipeline partitioning regfile thread scheduler •! Dynamic, no flushing regfile regfile •! Length of pipeline doesn’t matter so much regfile –! Need a lot of threads I$ •! Extreme example: Denelcor HEP D$ •! So many threads (100+), it didn’t even need caches B P •! Failed commercially •! Not popular today •! Many threads ! many register files CIS 501 (Martin/Roth): Multithreading 13 CIS 501 (Martin/Roth): Multithreading 14 Vertical and Horizontal Under-Utilization •! FGMT and CGMT reduce vertical under-utilization •! Loss of all slots in an issue cycle •! Do not help with horizontal under-utilization •! Loss of some slots in an issue cycle (in a superscalar processor) time CGMT FGMT SMT CIS 501 (Martin/Roth): Multithreading 15 CIS 501 (Martin/Roth): Multithreading 16 Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) map table •! What can issue insns from multiple threads in one cycle? •! Same thing that issues insns from multiple parts of same program… regfile •! …out-of-order execution I$ D$ B •! Simultaneous multithreading (SMT): OOO + FGMT P •! Aka “hyper-threading” •! Observation: once insns are renamed, scheduler doesn’t care which •! SMT thread they come from (well, for non-loads at least) •! Replicate map table, share (larger) physical register file •! Some examples thread scheduler map tables •! IBM Power5: 4-way issue, 2 threads •! Intel Pentium4: 3-way issue, 2 threads regfile •! Intel “Nehalem”: 4-way issue, 2 threads I$ D$ •! Alpha 21464: 8-way issue, 4 threads (canceled) B •! Notice a pattern? #threads (T) * 2 = #issue width (N) P CIS 501 (Martin/Roth): Multithreading 17 CIS 501 (Martin/Roth): Multithreading 18 SMT Resource Partitioning Static & Dynamic Resource Partitioning •! Physical regfile and insn buffer entries shared at fine-grain •! Static partitioning (below) •! Physically unordered and so fine-grain sharing is possible •! T equal-sized contiguous partitions •! How are physically ordered structures (ROB/LSQ) shared? ±! No starvation, sub-optimal utilization (fragmentation) –! Fine-grain sharing (below) would entangle commit (and squash) •! Dynamic partitioning •! Allowing threads to commit independently is important •! P > T partitions, available partitions assigned on need basis ±! Better utilization, possible starvation •! ICOUNT: fetch policy prefers thread with fewest in-flight insns •! Couple both with larger ROBs/LSQs thread scheduler map tables regfile regfile I$ I$ D$ D$ B B P P CIS 501 (Martin/Roth): Multithreading 19 CIS 501 (Martin/Roth): Multithreading 20 Multithreading Issues Notes About Sharing Soft State •! Shared soft state (caches, branch predictors, TLBs, etc.) •! Caches are shared naturally… •! Key example: cache interference •! Physically-tagged: address translation distinguishes different threads •! General concern for all MT variants •! … but TLBs need explicit thread IDs to be shared •! Can the working sets of multiple threads fit in the caches? •! Virtually-tagged: entries of different threads indistinguishable •! Shared memory SPMD threads help here •! Thread IDs are only a few bits: enough to identify on-chip contexts +!Same insns ! share I$ •! Thread IDs make sense on BTB (branch target buffer) +!Shared data less D$ contention ! •! BTB entries are already large, a few extra bits / entry won’t matter •! MT is good for workloads with shared insn/data •! Different thread’s target prediction ! automatic mis-prediction •! To keep miss rates low, SMT might need a larger L2 (which is OK) •! Out-of-order tolerates L1 misses •! … but not on a BHT (branch history table) •! BHT entries are small, a few extra bits / entry is huge overhead •! Large physical register file (and map table) •! Different thread’s direction prediction ! mis-prediction not automatic •! physical registers = (#threads * #arch-regs) + #in-flight insns •! Ordered soft-state should be replicated •! map table entries = (#threads * #arch-regs) •! Examples: Branch History Register (BHR), Return Address Stack (RAS) •! Otherwise it becomes meaningless… Fortunately, it is typically small CIS 501 (Martin/Roth): Multithreading 21 CIS 501 (Martin/Roth): Multithreading 22 Multithreading vs. Multicore Research: Speculative Multithreading •! If you wanted to run multiple threads would you build a… •! Speculative multithreading •! A multicore: multiple separate pipelines? •! Use multiple threads/processors for single-thread performance •! A multithreaded processor: a single larger pipeline? •! Speculatively parallelize sequential loops, that might not be parallel •! Both will get you throughput on multiple threads •! Processing elements (called PE) arranged in logical ring •! Multicore core will be simpler, possibly faster clock •! Compiler or hardware assigns iterations to consecutive PEs •! SMT will get you better performance (IPC) on a single thread •! Hardware tracks logical order to detect mis-parallelization •! SMT is basically an ILP engine that converts

CIS 501 Computer Architecture This Unit: Multithreading (MT) Readings

Implicitly-Multithreaded Processors

2.5 Classification of Parallel Computers

Kaisen Lin and Michael Conley

A Speculative Control Scheme for an Energy-Efficient Banked Register File

The Microarchitecture of a Low Power Register File

Three-Dimensional Integrated Circuit Design: EDA, Design And

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors

What Is SPMD? Messages

Unit: 4 Processes and Threads in Distributed Systems

Exploiting Automatic Vectorization to Employ SPMD on SIMD Registers

Declustering Spatial Databases on a Multi-Computer Architecture