Simultaneous Multithreading (SMT, Aka Hyperthreading)

Simultaneous Multithreading (SMT, Aka Hyperthreading)

Simultaneous Multithreading (SMT, aka Hyperthreading) Haowen Chan Vladislav Shkapenyuk References • Jack Lo, Susan Eggers, Joel Emer, Henry Levy, Rebecca Stamm, and Dean Tullsen. Converting Thread-Level Parallelism Into Instruction-Level Parallelism via Simultaneous Multithreading, in ACM Transactions on Computer Systems, August 1997, pages 322-354. • Marr, D., Binns, F., Hill, D., Hinton, G., Koufaty, D., Miller, J., Upton, M. "Hyper-Threading Technology Architecture and Microarchitecture: A Hypertext History." Intel Technology Journal. • Joel Emer. “The Post-ultimate Alpha.” PACT01 Keynote Speach • Jack L. Lo, Sujay S. Parekh, Susan J. Eggers, Henry M. Levy, and Dean M. Tullsen. “Software-Directed Register Deallocation for Simultaneous Multithreaded Processors”, University of Washington Technical Report #UW-CSE-97-12-01 • S. Hily, A. Seznec. “Contention on 2nd Level Cache May Limit The Effectiveness of Simultaneous Multithreading.” IRISA Report No 1086, Feb. 1997 1 Problem Statement • Consider small scale (2-8 processors) on-chip shared-memory MPs vs wide-issue superscalar processors with similar numbers of FUs. • Tradeoff between instruction-level parallelism vs thread-level parallelism • Which architecture is better for applications with high ILP and low TLP? For applications with high TLP and low ILP? Superscalar Processors • Does well when there is lots of ILP, but ILP in general applications is limited due to dependencies • Cannot exploit TLP effectively since only one context can be served at any one time. 2 Multiprocessors • Can exploit TLP, but ability to exploit ILP is limited due to static partitioning of resources between processors (e.g. each thread only has access to 1/n of total available functional units on-chip). Simultaneous Multithreading (SMT) • Best of both worlds – run multiple threads simultaneously on a unified processor so that resource utilization is maximized • Good for applications with either good ILP or TLP, or for workloads with variable characteristics 3 Implementation • Can be designed on top of an out-of-order superscalar processor by adding support for n multiple simultaneous threads • Processor is presented to OS as n logical processors, hardware provides support for n independent contexts • Context-supporting resources are either replicated n times or pooled together and shared. Processor Resources Replicated • Program counters • Subroutine return stacks • I-cache ports • Instruction TLB • Architectural registers and register renaming logic • Per-thread instruction retirement Pooled • Functional units • Caches • Physical registers 4 SMT Operation • Significant changes from conventional superscalar only in instruction fetch and register file structure • IF stage: conventional branch prediction drives instruction fetch (augmented by extra thread id bits in BTB). • Fetch mechanism selects n threads and fetches m instructions from each thread. SMT Operation • Register file may be substantially larger to support multiple simultaneous threads without running out of register capacity. • Hence register reads and writes will be slower and may require an extra pipeline stage • Renaming, instruction queueing, and execution remain similar to conventional superscalar implementation • Intel reports only 5% added to the die area when SMT was added to Xeon. 5 SMT Benefits • Higher resource utilization => more speedup • High flexibility of resource allocation => good performance over wider range of applications • Unified cache => No cache coherency overhead • Fast and easy thread-level synchronization can be leveraged from shared resources, e.g.: – “Synchronization function unit” – Synchronization primitives can be stored in the shared L1 cache SMT Tradeoffs • Register Underutilization – Conventional architectures are very conservative in register deallocation • Result – register file is the limiting factor in SMT machines • Larger register file – higher cost and slower access time. May increase the cycle time or require multi-cycle register access 6 SMT Tradeoffs • Unified cache may lead to cache interference between threads • Additional stress on memory hierarchy - L2 cache bandwidth might be insufficient for more than a few threads SMT Tradeoffs • In parallel applications, threads tend to have the same resource needs at the same time – flexibility in allocation may not gain much • Scheduling overhead in SMT might result in performance degradation if OS doesn’t use all available logical CPUs • Interference in BTB and branch prediction structures • Priority mechanism in SMT hardware of selecting which threads to serve may interfere with OS priority mechanisms 7 Existing SMT Implementations • Intel Xeon and Pentium 4 3Ghz implements 2-context SMT processor • Compaq designed a 4-context SMT processor Alpha 21464 (EV-8) - now officially dead • Clearwater Networks has an 8-context SMT network processor • Intel implemented 16-context SMT in its network processor IXP1200 • Sun UltraSparc 5 and IBM Power5 also promise to implement some form of SMT SMT in Intel Xeon 8 Intel Xeon Front-end Intel Xeon Execution Engine 9 Performance Results 10.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us