Simultaneous Multithreading (SMT, Aka Hyperthreading)

Simultaneous Multithreading (SMT, aka Hyperthreading) Haowen Chan Vladislav Shkapenyuk References • Jack Lo, Susan Eggers, Joel Emer, Henry Levy, Rebecca Stamm, and Dean Tullsen. Converting Thread-Level Parallelism Into Instruction-Level Parallelism via Simultaneous Multithreading, in ACM Transactions on Computer Systems, August 1997, pages 322-354. • Marr, D., Binns, F., Hill, D., Hinton, G., Koufaty, D., Miller, J., Upton, M. "Hyper-Threading Technology Architecture and Microarchitecture: A Hypertext History." Intel Technology Journal. • Joel Emer. “The Post-ultimate Alpha.” PACT01 Keynote Speach • Jack L. Lo, Sujay S. Parekh, Susan J. Eggers, Henry M. Levy, and Dean M. Tullsen. “Software-Directed Register Deallocation for Simultaneous Multithreaded Processors”, University of Washington Technical Report #UW-CSE-97-12-01 • S. Hily, A. Seznec. “Contention on 2nd Level Cache May Limit The Effectiveness of Simultaneous Multithreading.” IRISA Report No 1086, Feb. 1997 1 Problem Statement • Consider small scale (2-8 processors) on-chip shared-memory MPs vs wide-issue superscalar processors with similar numbers of FUs. • Tradeoff between instruction-level parallelism vs thread-level parallelism • Which architecture is better for applications with high ILP and low TLP? For applications with high TLP and low ILP? Superscalar Processors • Does well when there is lots of ILP, but ILP in general applications is limited due to dependencies • Cannot exploit TLP effectively since only one context can be served at any one time. 2 Multiprocessors • Can exploit TLP, but ability to exploit ILP is limited due to static partitioning of resources between processors (e.g. each thread only has access to 1/n of total available functional units on-chip). Simultaneous Multithreading (SMT) • Best of both worlds – run multiple threads simultaneously on a unified processor so that resource utilization is maximized • Good for applications with either good ILP or TLP, or for workloads with variable characteristics 3 Implementation • Can be designed on top of an out-of-order superscalar processor by adding support for n multiple simultaneous threads • Processor is presented to OS as n logical processors, hardware provides support for n independent contexts • Context-supporting resources are either replicated n times or pooled together and shared. Processor Resources Replicated • Program counters • Subroutine return stacks • I-cache ports • Instruction TLB • Architectural registers and register renaming logic • Per-thread instruction retirement Pooled • Functional units • Caches • Physical registers 4 SMT Operation • Significant changes from conventional superscalar only in instruction fetch and register file structure • IF stage: conventional branch prediction drives instruction fetch (augmented by extra thread id bits in BTB). • Fetch mechanism selects n threads and fetches m instructions from each thread. SMT Operation • Register file may be substantially larger to support multiple simultaneous threads without running out of register capacity. • Hence register reads and writes will be slower and may require an extra pipeline stage • Renaming, instruction queueing, and execution remain similar to conventional superscalar implementation • Intel reports only 5% added to the die area when SMT was added to Xeon. 5 SMT Benefits • Higher resource utilization => more speedup • High flexibility of resource allocation => good performance over wider range of applications • Unified cache => No cache coherency overhead • Fast and easy thread-level synchronization can be leveraged from shared resources, e.g.: – “Synchronization function unit” – Synchronization primitives can be stored in the shared L1 cache SMT Tradeoffs • Register Underutilization – Conventional architectures are very conservative in register deallocation • Result – register file is the limiting factor in SMT machines • Larger register file – higher cost and slower access time. May increase the cycle time or require multi-cycle register access 6 SMT Tradeoffs • Unified cache may lead to cache interference between threads • Additional stress on memory hierarchy - L2 cache bandwidth might be insufficient for more than a few threads SMT Tradeoffs • In parallel applications, threads tend to have the same resource needs at the same time – flexibility in allocation may not gain much • Scheduling overhead in SMT might result in performance degradation if OS doesn’t use all available logical CPUs • Interference in BTB and branch prediction structures • Priority mechanism in SMT hardware of selecting which threads to serve may interfere with OS priority mechanisms 7 Existing SMT Implementations • Intel Xeon and Pentium 4 3Ghz implements 2-context SMT processor • Compaq designed a 4-context SMT processor Alpha 21464 (EV-8) - now officially dead • Clearwater Networks has an 8-context SMT network processor • Intel implemented 16-context SMT in its network processor IXP1200 • Sun UltraSparc 5 and IBM Power5 also promise to implement some form of SMT SMT in Intel Xeon 8 Intel Xeon Front-end Intel Xeon Execution Engine 9 Performance Results 10.

Load more