Simultaneous Multithreading (SMT, aka Hyperthreading)
Haowen Chan Vladislav Shkapenyuk
References
• Jack Lo, Susan Eggers, Joel Emer, Henry Levy, Rebecca Stamm, and Dean Tullsen. Converting Thread-Level Parallelism Into Instruction-Level Parallelism via Simultaneous Multithreading, in ACM Transactions on Computer Systems, August 1997, pages 322-354. • Marr, D., Binns, F., Hill, D., Hinton, G., Koufaty, D., Miller, J., Upton, M. "Hyper-Threading Technology Architecture and Microarchitecture: A Hypertext History." Intel Technology Journal. • Joel Emer. “The Post-ultimate Alpha.” PACT01 Keynote Speach • Jack L. Lo, Sujay S. Parekh, Susan J. Eggers, Henry M. Levy, and Dean M. Tullsen. “Software-Directed Register Deallocation for Simultaneous Multithreaded Processors”, University of Washington Technical Report #UW-CSE-97-12-01 • S. Hily, A. Seznec. “Contention on 2nd Level Cache May Limit The Effectiveness of Simultaneous Multithreading.” IRISA Report No 1086, Feb. 1997
1 Problem Statement
• Consider small scale (2-8 processors) on-chip shared-memory MPs vs wide-issue superscalar processors with similar numbers of FUs. • Tradeoff between instruction-level parallelism vs thread-level parallelism • Which architecture is better for applications with high ILP and low TLP? For applications with high TLP and low ILP?
Superscalar Processors
• Does well when there is lots of ILP, but ILP in general applications is limited due to dependencies • Cannot exploit TLP effectively since only one context can be served at any one time.
2 Multiprocessors
• Can exploit TLP, but ability to exploit ILP is limited due to static partitioning of resources between processors (e.g. each thread only has access to 1/n of total available functional units on-chip).
Simultaneous Multithreading (SMT)
• Best of both worlds – run multiple threads simultaneously on a unified processor so that resource utilization is maximized • Good for applications with either good ILP or TLP, or for workloads with variable characteristics
3 Implementation
• Can be designed on top of an out-of-order superscalar processor by adding support for n multiple simultaneous threads • Processor is presented to OS as n logical processors, hardware provides support for n independent contexts • Context-supporting resources are either replicated n times or pooled together and shared.
Processor Resources
Replicated • Program counters • Subroutine return stacks • I-cache ports • Instruction TLB • Architectural registers and register renaming logic • Per-thread instruction retirement Pooled • Functional units • Caches • Physical registers
4 SMT Operation
• Significant changes from conventional superscalar only in instruction fetch and register file structure • IF stage: conventional branch prediction drives instruction fetch (augmented by extra thread id bits in BTB). • Fetch mechanism selects n threads and fetches m instructions from each thread.
SMT Operation
• Register file may be substantially larger to support multiple simultaneous threads without running out of register capacity. • Hence register reads and writes will be slower and may require an extra pipeline stage • Renaming, instruction queueing, and execution remain similar to conventional superscalar implementation • Intel reports only 5% added to the die area when SMT was added to Xeon.
5 SMT Benefits
• Higher resource utilization => more speedup • High flexibility of resource allocation => good performance over wider range of applications • Unified cache => No cache coherency overhead • Fast and easy thread-level synchronization can be leveraged from shared resources, e.g.: – “Synchronization function unit” – Synchronization primitives can be stored in the shared L1 cache
SMT Tradeoffs
• Register Underutilization – Conventional architectures are very conservative in register deallocation
• Result – register file is the limiting factor in SMT machines • Larger register file – higher cost and slower access time. May increase the cycle time or require multi-cycle register access
6 SMT Tradeoffs • Unified cache may lead to cache interference between threads
• Additional stress on memory hierarchy - L2 cache bandwidth might be insufficient for more than a few threads
SMT Tradeoffs
• In parallel applications, threads tend to have the same resource needs at the same time – flexibility in allocation may not gain much
• Scheduling overhead in SMT might result in performance degradation if OS doesn’t use all available logical CPUs
• Interference in BTB and branch prediction structures
• Priority mechanism in SMT hardware of selecting which threads to serve may interfere with OS priority mechanisms
7 Existing SMT Implementations
• Intel Xeon and Pentium 4 3Ghz implements 2-context SMT processor • Compaq designed a 4-context SMT processor Alpha 21464 (EV-8) - now officially dead • Clearwater Networks has an 8-context SMT network processor • Intel implemented 16-context SMT in its network processor IXP1200 • Sun UltraSparc 5 and IBM Power5 also promise to implement some form of SMT
SMT in Intel Xeon
8 Intel Xeon Front-end
Intel Xeon Execution Engine
9 Performance Results
10