<<

Simultaneous Multithreading (SMT, aka Hyperthreading)

Haowen Chan Vladislav Shkapenyuk

References

• Jack Lo, Susan Eggers, Joel Emer, Henry Levy, Rebecca Stamm, and Dean Tullsen. Converting -Level Parallelism Into Instruction-Level Parallelism via Simultaneous Multithreading, in ACM Transactions on Systems, August 1997, pages 322-354. • Marr, D., Binns, F., Hill, D., Hinton, G., Koufaty, D., Miller, J., Upton, M. "Hyper-Threading Technology Architecture and : A Hypertext History." Intel Technology Journal. • Joel Emer. “The Post-ultimate Alpha.” PACT01 Keynote Speach • Jack L. Lo, Sujay S. Parekh, Susan J. Eggers, Henry M. Levy, and Dean M. Tullsen. “Software-Directed Register Deallocation for Simultaneous Multithreaded Processors”, University of Washington Technical Report #UW-CSE-97-12-01 • S. Hily, A. Seznec. “Contention on 2nd Level May Limit The Effectiveness of Simultaneous Multithreading.” IRISA Report No 1086, Feb. 1997

1 Problem Statement

• Consider small scale (2-8 processors) on-chip shared-memory MPs vs wide-issue superscalar processors with similar numbers of FUs. • Tradeoff between instruction-level parallelism vs thread-level parallelism • Which architecture is better for applications with high ILP and low TLP? For applications with high TLP and low ILP?

Superscalar Processors

• Does well when there is lots of ILP, but ILP in general applications is limited due to dependencies • Cannot exploit TLP effectively since only one context can be served at any one time.

2 Multiprocessors

• Can exploit TLP, but ability to exploit ILP is limited due to static partitioning of resources between processors (e.g. each thread only has access to 1/n of total available functional units on-chip).

Simultaneous Multithreading (SMT)

• Best of both worlds – run multiple threads simultaneously on a unified so that resource utilization is maximized • Good for applications with either good ILP or TLP, or for workloads with variable characteristics

3 Implementation

• Can be designed on top of an out-of-order superscalar processor by adding support for n multiple simultaneous threads • Processor is presented to OS as n logical processors, hardware provides support for n independent contexts • Context-supporting resources are either replicated n times or pooled together and shared.

Processor Resources

Replicated • Program counters • Subroutine return stacks • I-cache ports • Instruction TLB • Architectural registers and logic • Per-thread instruction retirement Pooled • Functional units • Caches • Physical registers

4 SMT Operation

• Significant changes from conventional superscalar only in instruction fetch and structure • IF stage: conventional branch prediction drives instruction fetch (augmented by extra thread id bits in BTB). • Fetch mechanism selects n threads and fetches m instructions from each thread.

SMT Operation

• Register file may be substantially larger to support multiple simultaneous threads without running out of register capacity. • Hence register reads and writes will be slower and may require an extra stage • Renaming, instruction queueing, and execution remain similar to conventional superscalar implementation • Intel reports only 5% added to the die area when SMT was added to Xeon.

5 SMT Benefits

• Higher resource utilization => more • High flexibility of resource allocation => good performance over wider range of applications • Unified cache => No cache coherency overhead • Fast and easy thread-level synchronization can be leveraged from shared resources, e.g.: – “Synchronization function unit” – Synchronization primitives can be stored in the shared L1 cache

SMT Tradeoffs

• Register Underutilization – Conventional architectures are very conservative in register deallocation

• Result – register file is the limiting factor in SMT machines • Larger register file – higher cost and slower access time. May increase the cycle time or require multi-cycle register access

6 SMT Tradeoffs • Unified cache may lead to cache interference between threads

• Additional stress on - L2 cache bandwidth might be insufficient for more than a few threads

SMT Tradeoffs

• In parallel applications, threads tend to have the same resource needs at the same time – flexibility in allocation may not gain much

• Scheduling overhead in SMT might result in performance degradation if OS doesn’t use all available logical CPUs

• Interference in BTB and branch prediction structures

• Priority mechanism in SMT hardware of selecting which threads to serve may interfere with OS priority mechanisms

7 Existing SMT Implementations

• Intel Xeon and 4 3Ghz implements 2-context SMT processor • Compaq designed a 4-context SMT processor (EV-8) - now officially dead • Clearwater Networks has an 8-context SMT • Intel implemented 16-context SMT in its network processor IXP1200 • Sun UltraSparc 5 and IBM Power5 also promise to implement some form of SMT

SMT in Intel Xeon

8 Intel Xeon Front-end

Intel Xeon Execution Engine

9 Performance Results

10