Introduction Thread
Performance of a single serial program is A full program (single-threaded UNIX Chip-level limited by its available ILP and long-latency Multithreading and operations process) Time-sharing An operating system thread, e.g., a Multiprocessing Multiprogramming workloads POSIX thread Parallel applications Synchronization A compiler-generated thread, e.g. Thread-level Parallelism microthread Increase overall instruction throughput of the processor A hardware-generated thread
Credit: Zhichun Zhu, UIC. All copyrights reserved.
Explicitly Multithreading Exploit Thread-level Parallelism Chip Multiprocessing Processors
Multiprocessor system Issuing instructions from multiple Replicate an entire processor core for each Shared memory thread to support multiple threads within a Cache coherence, memory consistency threads in a cycle Message passing single processor chip Multithreaded processor CMP Explicitly SMT Interleave the execution of instructions of different user-defined Core 0 Core 1 threads (OS threads or processes) Chip multiprocessors (CMP), fine-grained, coarse-grained, and Issuing instructions from a single L1 I$ L1 D$ L1 D$ L1 I$ simultaneous multithreading (SMT) Implicitly thread in a cycle Dynamically generate threads from single-threaded programs and execute such speculative threads concurrent with the lead thread. Fine-grained (FGMT) L2 $ Multiscalar, dynamic multithreading, speculative multithreaded, …… Coarse-grained (CGMT)
1 Running Multiple Threads on One CMP Fine-grained Multithreading Chip AB Advantage Horizontal loss Context switch overhead Provide two or more thread contexts on chip Reduce latencies for processor-to- Spatial partition Temporal partition Per cycle Per FU Switch from one thread to the next on a processor communication and fixed, fine-grained schedule (e.g. every cycle) synchronization Example: Tera MTA machine Vertical 128 threads (128 register contexts) Drawback loss Switch threads on every clock cycle
More complicated uniprocessor vs. simple CMP FGMT CGMT SMT Fully mask the 128-cycle memory access latency (no cache) CMPs Static partitioning of Dynamic partitioning of execution resources execution resources Drawback: sacrifice single-thread performance for Lower frequency overall throughput
Coarse-grained Multithreading Models of CGMT Models of CGMT
Provide multiple thread contexts within Static: context switch occurs each time the Dynamic: context switch is triggered by the processor core same instruction is executed a dynamic event Explicit context switch instructions The currently active thread is executed Switch-on-cache-miss, switch-on-signal, Implicit-switch: switch-on-load, switch-on-store, switch-on-use until it reaches a situation that triggers switch-on-branch
a context switch (e.g. stalls on a long- Advantage: low context switching overhead (0 or 1 Advantage: reduce unnecessary context latency event, such as a cache miss) cycle) switches Disadvantage: switching contexts more often than Disadvantage: higher context switching necessary overhead
2 Simultaneous Multithreading Cost of Thread Switches Fairness and Priority (SMT)
Dynamic events that trigger context switches Fairness Allow instructions from multiple active may only be detected late in the pipeline Cache miss rate + OS-controlled context switch Threads with low miss rates are preempted after a time slice threads to be interleaved within and Naïve implementation Æ several pipeline expires across pipeline stages bubbles Threads are prevented from preemption for a minimum quantum Reduce both horizontal and vertical Replicate registers for each thread and save Priority losses current state of pipeline at context switch Æ Thread enters a critical section Æ increase priority avoid switch penalty but increase complexity Thread leaves a critical section Æ reduce priority Maximize processor resource utilization Thread spins on a lock or enters an idle loop Æ reduce priority Which approach should be used?
SMT Resource Sharing SMT Sharing of Pipeline Stages SMT Sharing of Pipeline Stages
Fetch0 Fetch1 Fetch0 Fetch1 Fetch0 Fetch1 Dedicated Æ low utilization Decode Decode Decode0 Decode1 Decode0 Decode1 Shared Æ complicated design, sometimes poor For RISC machines, major complexity is to resolve 2 Rename Rename Rename0 Rename1 performance dependences (O(n ) complexity); thus partitioning Fetch would reduce complexity but could compromise Issue Issue Issue0 Issue1 single-thread performance Time-share an instruction cache port among Ex Ex Ex multiple threads For CISC machines, determining instruction semantics and decomposing them can be very Mem Mem Mem Branch predictor Time-sharing, but complex, time-sharing the decode stage may be Retire0 Retire1 Retire0 Retire1 Retire0 Retire1 RAS and global BHR are better to be dedicated more beneficial
3 SMT Sharing of Pipeline Stages SMT Sharing of Pipeline Stages CMP vs. SMT
Issue Memory CMP is easier to implement Selection must involve multiple threads Sharing cache ports is straightforward SMT can hide long latencies Wakeup is limited to intra-thread Design tradeoff of load/store queue SMT has better resource utilization Partition instruction window? Sharing Æ potential consistency problem CMP + SMT? Execute Partitioning Æ simpler but lower utilization IBM Power5 Sharing is straightforward Retire Design tradeoffs on bypass network Partition or time-share
Comparisons between Intel’s Hyper-Threading Intel’s Hyper-Threading Multithreading Schemes Technology Technology
MT Approach Resources Shared between Threads Context Switch Mechanism A single physical processor appear as two ROB entries, load and store buffer None Everything Explicit OS context logical processors by applying a two-threaded entries are statically partitioned among switch SMT approach two threads Fine-grained Everything but register file and Switch every cycle control logic/state Each logical processor maintains a complete Partitioned resources are recombined Coarse-grained Everything but I-fetch buffers, RF, Switch on pipeline stall set of the architecture state (general- when only one thread is active and control logic/state purpose registers, control registers, …) SMT Everything but I-fetch buffers, All contexts concurrently Add less than 5% to the relative chip RAS, ARF, control logic/state, ROB, active; no switch Logical processors share nearly all other SQ, … size resources, such as caches, execution units, CMP L2 cache, system interconnect All contexts concurrently Improve performance by 16% to 28% on active; no switch branch predictors, control logic, and buses server applications
4 Explicit vs. Implicit Challenges in IMT Processor Resolving Control Dependences Multithreading Designs
Explicit Resolving control dependences Spawn implicit future threads at subsequent control-independent points in the program’s Improve instruction throughput Resolving register data dependences control flow Programmer-created threads Resolving memory data dependences A A E Implicit A
Improve individual application’s B B F B performance C D G Dynamically generated threads C
E C H
Resolving Register Data Resolving Memory Data Disjoint Eager Execution Dependences Dependences
Choose the branch path with the highest Dependences within a thread Interthread false dependences (WAR and cumulative prediction rate Resolved with standard techniques WAW) 5 Buffer writes from future threads and commit 1 Dependences across threads 2 them when those threads retire 0.75 0.25 Disallow interthread register data dependences, 3 0.56 0.19 communicate all shared operands through memory Interthread true dependences (RAW) 4 0.42 0.14 with L/S Future threads assume no dependences violations + 0.32 Compiler identify interthread dependences extensions to snoop-based cache coherence 0.24 explicitly Track L/S from each thread in separate per- Data dependence speculation thread L/S queues
5 Executing the Same Thread Real Processor: IBM Power5 Further Reading
Each processor has two Execute the same instructions in full-performance 1. Reference book, Chapter 11, multiple contexts processor “Executing Multiple Threads” Each core supports two- Fault detection (transient errors) way SMT 2. “A survey of processors with explicit Right picture: a Power5 multithreading”, Theo Ungerer, Borut Prefetching MCM with four processor chips (16 logic Robic and Jurij Silc, ACM Computing Branch resolution CPUs) Each chip has 276M Surveys, Vol. 35, No. 1, March 2003, Xtors, size 389mm2 pages 29-63 POWER5 chief scientist Balaram Sinharoy holding a POWER5 MCM (Multi-chip Module)
6