Introduction

„ Performance of a single serial program is „ A full program (single-threaded UNIX Chip-level limited by its available ILP and long-latency Multithreading and operations ) „ Time-sharing „ An thread, e.g., a Multiprocessing Multiprogramming workloads POSIX thread Parallel applications „ Synchronization „ A compiler-generated thread, e.g. „ Thread-level Parallelism microthread Increase overall instruction throughput of the „ A hardware-generated thread

Credit: Zhichun Zhu, UIC. All copyrights reserved.

Explicitly Multithreading Exploit Thread-level Parallelism Chip Multiprocessing Processors

„ „ Multiprocessor system „ Issuing instructions from multiple Replicate an entire processor core for each thread to support multiple threads within a „ , memory consistency threads in a cycle single processor chip „ Multithreaded processor CMP Explicitly SMT „ Interleave the execution of instructions of different user-defined Core 0 Core 1 threads (OS threads or processes) „ Chip multiprocessors (CMP), fine-grained, coarse-grained, and „ Issuing instructions from a single L1 I$ L1 D$ L1 D$ L1 I$ simultaneous multithreading (SMT) Implicitly thread in a cycle „ Dynamically generate threads from single-threaded programs and execute such speculative threads concurrent with the lead thread. Fine-grained (FGMT) L2 $ „ Multiscalar, dynamic multithreading, speculative multithreaded, …… Coarse-grained (CGMT)

1 Running Multiple Threads on One CMP Fine-grained Multithreading Chip AB „ Advantage Horizontal loss Context switch overhead „ Provide two or more thread contexts on chip „ Reduce latencies for processor-to- Spatial partition Temporal partition Per cycle Per FU Switch from one thread to the next on a processor communication and fixed, fine-grained schedule (e.g. every cycle) synchronization „ Example: Tera MTA machine Vertical 128 threads (128 register contexts) „ Drawback loss Switch threads on every clock cycle

More complicated uniprocessor vs. simple CMP FGMT CGMT SMT Fully mask the 128-cycle memory access latency (no cache) CMPs Static partitioning of Dynamic partitioning of execution resources execution resources Drawback: sacrifice single-thread performance for Lower frequency overall throughput

Coarse-grained Multithreading Models of CGMT Models of CGMT

„ Provide multiple thread contexts within „ Static: context switch occurs each time the „ Dynamic: context switch is triggered by the processor core same instruction is executed a dynamic event Explicit context switch instructions „ The currently active thread is executed Switch-on-cache-miss, switch-on-signal, Implicit-switch: switch-on-load, switch-on-store, switch-on-use until it reaches a situation that triggers switch-on-branch

a context switch (e.g. stalls on a long- Advantage: low context switching overhead (0 or 1 Advantage: reduce unnecessary context latency event, such as a cache miss) cycle) switches Disadvantage: switching contexts more often than Disadvantage: higher context switching necessary overhead

2 Simultaneous Multithreading Cost of Thread Switches Fairness and Priority (SMT)

„ Dynamic events that trigger context switches „ Fairness „ Allow instructions from multiple active may only be detected late in the Cache miss rate + OS-controlled context switch Threads with low miss rates are preempted after a time slice threads to be interleaved within and „ Naïve implementation Æ several pipeline expires across pipeline stages bubbles Threads are prevented from for a minimum quantum „ Reduce both horizontal and vertical „ Replicate registers for each thread and save „ Priority losses current state of pipeline at context switch Æ Thread enters a critical section Æ increase priority avoid switch penalty but increase complexity Thread leaves a critical section Æ reduce priority „ Maximize processor resource utilization Thread spins on a or enters an idle loop Æ reduce priority „ Which approach should be used?

SMT Resource Sharing SMT Sharing of Pipeline Stages SMT Sharing of Pipeline Stages

Fetch0 Fetch1 Fetch0 Fetch1 Fetch0 Fetch1 „ Dedicated Æ low utilization „ Decode Decode Decode0 Decode1 Decode0 Decode1 „ Shared Æ complicated design, sometimes poor For RISC machines, major complexity is to resolve 2 Rename Rename Rename0 Rename1 performance dependences (O(n ) complexity); thus partitioning „ Fetch would reduce complexity but could compromise Issue Issue Issue0 Issue1 single-thread performance Time-share an instruction cache port among Ex Ex Ex multiple threads For CISC machines, determining instruction semantics and decomposing them can be very Mem Mem Mem „ Time-sharing, but complex, time-sharing the decode stage may be Retire0 Retire1 Retire0 Retire1 Retire0 Retire1 „ RAS and global BHR are better to be dedicated more beneficial

3 SMT Sharing of Pipeline Stages SMT Sharing of Pipeline Stages CMP vs. SMT

„ Issue „ Memory „ CMP is easier to implement Selection must involve multiple threads Sharing cache ports is straightforward „ SMT can hide long latencies Wakeup is limited to intra-thread Design tradeoff of load/store queue „ SMT has better resource utilization „ Partition instruction window? „ Sharing Æ potential consistency problem „ CMP + SMT? „ Execute „ Partitioning Æ simpler but lower utilization IBM Power5 Sharing is straightforward „ Retire Design tradeoffs on bypass network Partition or time-share

Comparisons between ’s Hyper-Threading Intel’s Hyper-Threading Multithreading Schemes Technology Technology

MT Approach Resources Shared between Threads Context Switch „ Mechanism A single physical processor appear as two „ ROB entries, load and store buffer None Everything Explicit OS context logical processors by applying a two-threaded entries are statically partitioned among switch SMT approach two threads Fine-grained Everything but and Switch every cycle control logic/state „ Each logical processor maintains a complete „ Partitioned resources are recombined Coarse-grained Everything but I-fetch buffers, RF, Switch on set of the architecture state (general- when only one thread is active and control logic/state purpose registers, control registers, …) SMT Everything but I-fetch buffers, All contexts concurrently „ Add less than 5% to the relative chip RAS, ARF, control logic/state, ROB, active; no switch „ Logical processors share nearly all other SQ, … size resources, such as caches, execution units, CMP L2 cache, system interconnect All contexts concurrently „ Improve performance by 16% to 28% on active; no switch branch predictors, control logic, and buses server applications

4 Explicit vs. Implicit Challenges in IMT Processor Resolving Control Dependences Multithreading Designs

„ Explicit „ Resolving control dependences „ Spawn implicit future threads at subsequent control-independent points in the program’s Improve instruction throughput „ Resolving register data dependences control flow Programmer-created threads „ Resolving memory data dependences A A E „ Implicit A

Improve individual application’s B B F B performance C D G Dynamically generated threads C

E C H

Resolving Register Data Resolving Memory Data Disjoint Eager Execution Dependences Dependences

„ Choose the branch path with the highest „ Dependences within a thread „ Interthread false dependences (WAR and cumulative prediction rate Resolved with standard techniques WAW) 5 „ Buffer writes from future threads and commit 1 Dependences across threads 2 them when those threads retire 0.75 0.25 Disallow interthread register data dependences, 3 0.56 0.19 communicate all shared operands through memory „ Interthread true dependences (RAW) 4 0.42 0.14 with L/S Future threads assume no dependences violations + 0.32 Compiler identify interthread dependences extensions to snoop-based cache coherence 0.24 explicitly Track L/S from each thread in separate per- Data dependence speculation thread L/S queues

5 Executing the Same Thread Real Processor: IBM Power5 Further Reading

„ „ Each processor has two Execute the same instructions in full-performance 1. Reference book, Chapter 11, multiple contexts processor “Executing Multiple Threads” „ Each core supports two- „ Fault detection (transient errors) way SMT 2. “A survey of processors with explicit „ Right picture: a Power5 multithreading”, Theo Ungerer, Borut „ Prefetching MCM with four processor chips (16 logic Robic and Jurij Silc, ACM Computing „ Branch resolution CPUs) „ Each chip has 276M Surveys, Vol. 35, No. 1, March 2003, Xtors, size 389mm2 pages 29-63 POWER5 chief scientist Balaram Sinharoy holding a POWER5 MCM (Multi-chip Module)

6