Hyper-Threading: Simultaneous Multithreading on Pentium 4
Presented by: Thomas Repantis [email protected]
CS203B-Advanced Computer Architecture, Spring 2004 – p.1/32 Overview
Multiple threads executing on a single processor without switching.
1. Threads 2. SMT 3. Hyper-Threading on P4 4. OS and Compiler Support 5. Performance for Different Applications
CS203B-Advanced Computer Architecture, Spring 2004 – p.2/32 Threads
• Process: “A task being run by the computer.” • Context: Describes a process’s current state of execution (registers, flags, PC...). • Thread: A “light-weight” process (has its own PC and SP, but single address space and global variables). • Each process consists of at least one thread. • Threads allow faster context-switching and fine-grain multitasking.
CS203B-Advanced Computer Architecture, Spring 2004 – p.3/32 Single-Threaded CPU
A lot of bubbles in the in- struction issue and in the pipeline!
CS203B-Advanced Computer Architecture, Spring 2004 – p.4/32 Single-Threaded SMP
Executing processes are doubled, but bubbles are doubled as well!
CS203B-Advanced Computer Architecture, Spring 2004 – p.5/32 Superthreaded CPU
Each issue and each pipeline stage can con- tain instructions of the same thread only.
CS203B-Advanced Computer Architecture, Spring 2004 – p.6/32 Hyper-Threaded CPU (SMT)
Instructions of different threads can be sched- uled on the same stage.
CS203B-Advanced Computer Architecture, Spring 2004 – p.7/32 SMT vs TeraMTA
• Each processor of the TeraMTA has 128 streams, that include a PC and 32 registers. • Each stream is assigned to a thread. • Instructions from different streams can be pipelined on the same processor. • However, in TeraMTA only a single thread is active on any given cycle.
CS203B-Advanced Computer Architecture, Spring 2004 – p.8/32 SMT Benefits
SMT: • Gives the OS the illusion of several (currently two) logical processors. • Makes efficient use of resources. • Overcomes the barrier of limited amount of ILP within just one thread. • Is implemented by dividing processor resources to replicated, partitioned, and shared.
CS203B-Advanced Computer Architecture, Spring 2004 – p.9/32 Replicated Resources
Each logical processor has independent: • Instruction Pointer • Register Renaming Logic • Instruction TLB • Return Stack Predictor • Advanced Programmable Interrupt Controller • Other architectural registers
CS203B-Advanced Computer Architecture, Spring 2004 – p.10/32 Partitioned Resources
Each logical processors gets exactly half of: • Re-order buffers (ROBs) • Load/Store buffers • Several queues (e.g. scheduling, uop (micro-operations))
Partitioning prohibits a logical processor from monopo- lizing the resources.
CS203B-Advanced Computer Architecture, Spring 2004 – p.11/32 Statically Partitioned Queue
Specific positions are as- signed to each proces- sor.
CS203B-Advanced Computer Architecture, Spring 2004 – p.12/32 Dynamically Partitioned Queue
A limit is imposed to the positions each processor can use, but no specific positions are assigned.
CS203B-Advanced Computer Architecture, Spring 2004 – p.13/32 Shared Resources
Each logical processor shares SMT-unaware resources: • Execution Units • Microarchitectural registers (GPRs, FPRs) • Caches: trace cache, L1, L2, L3
Sharing: + Enables efficient use of resources, but... - Allows a thread to monopolize a resource (e.g. cache thrashing).
CS203B-Advanced Computer Architecture, Spring 2004 – p.14/32 Pentium 4
• 32-bit • 2.4 to 3.4 GHz clock frequency • 800 MHz system bus • 0.13-micron technology • 8KB L1 data cache, 12KB L1 instruction cache, 256KB to 1MB L2 cache, 2MB L3 cache • NetBurst microarchitecture (hyper-pipelined) • Hyper-Threading technology
CS203B-Advanced Computer Architecture, Spring 2004 – p.15/32 Front-End Pipeline
(a) Trace Cache Hit
(b) Trace Cache Miss
CS203B-Advanced Computer Architecture, Spring 2004 – p.16/32 Out-Of-Order Execution Engine Pipeline
CS203B-Advanced Computer Architecture, Spring 2004 – p.17/32 Implementation Goals Achieved
• Minimal die area cost (less than 5% more die area). • Stall of one logical processor does not stall the other (buffering queues between pipeline logic blocks). • When only one thread is running, speed should be the same as without H-T (partitioned resources are dedicated to it).
CS203B-Advanced Computer Architecture, Spring 2004 – p.18/32 Single- and Multi-Task Modes
Partitioned resources are dedicated to one of the logical processors when the other is HALTed.
CS203B-Advanced Computer Architecture, Spring 2004 – p.19/32 Operating System Optimizations
When the OS schedules threads to logical processors it should: • HALT an inactive logical processor, to avoid wasting resources for idle loops (continuously checking for available work). • Schedule threads to logical processors on different physical processors instead of the same (when possible), to avoid using the same physical execution resources.
CS203B-Advanced Computer Architecture, Spring 2004 – p.20/32 OS Optimizations
The Linux kernel (2.6 series) distinguishes between logical and physical processors:
• H-T-aware passive and active load-balancing • H-T-aware task pickup • H-T-aware affinity • H-T-aware wakeup
CS203B-Advanced Computer Architecture, Spring 2004 – p.21/32 Compiler Optimizations
Intel 8.0 C++ and FORTRAN compilers:
Automatic optimizations: • Vectorization • Advanced instruction selection Programmer-controlled optimizations: • Insertion of Streaming-SIMD-Extensions 3 (SSE3) instructions • Insertion of OpenMP directives
CS203B-Advanced Computer Architecture, Spring 2004 – p.22/32 Performance gain from automatic optimizations
SPEC CPU 2000 shows significant speedup not only from H-T specific (QxP) but even for general P4 (QxN) optimizations.
CS203B-Advanced Computer Architecture, Spring 2004 – p.23/32 Performance gain from manual optimizations
SPEC OMPM 2001 shows speedup achieved by automatic optimizations in combination with OpenMP directives.
CS203B-Advanced Computer Architecture, Spring 2004 – p.24/32 Thread-level Parallelism of Desktop Applications
• Unlike server workloads, interactive desktop applications focus on response time and not on end-to-end throughput. • Average response time improvement on dual- vs uni-processor measured 22%. • The application programmer has to exploit multi-threading. • More than 2 processors yield no great improvements.
CS203B-Advanced Computer Architecture, Spring 2004 – p.25/32 Performance in Client-Server Applications
While H-T offers no gain or degradation in API calls and user application workloads, it achieves considerable speedups in multi-threaded workloads.
CS203B-Advanced Computer Architecture, Spring 2004 – p.26/32 Performance in File Server Workloads
Good speedups in multi-threaded workloads, whether filesystem and socket calls, or just socket calls.
CS203B-Advanced Computer Architecture, Spring 2004 – p.27/32 Performance in Online Transaction Processing
21% performance gain in the case of 1 and 2 processors.
CS203B-Advanced Computer Architecture, Spring 2004 – p.28/32 Performance in Web Serving
16 to 28% performance gain.
CS203B-Advanced Computer Architecture, Spring 2004 – p.29/32 Conclusions
• Hyper-Threading enables thread-level parallelism by duplicating the architectural state of the processor, while sharing one set of processor execution resources. • When scheduling threads, the OS sees two logical processors. • While not providing the performance achieved by adding a second processor, Hyper-Threading can offer a 30% improvement. • Resource contention limits the performance benefits for certain applications. • Performance gains are evident in multi-threaded workloads, which are usually foundCS203B-AdvancedinComputerserArchitecturevers, Spr.ing 2004 – p.30/32 References
1. D. Marr et al., ªHyper-Threading Technology Architecture and Microarchitectureº, Intel Technology Journal, Volume 06-Issue 01, 2002. 2. D. Tulsen et al., ª Simultaneous Multithreading: Maximizing On-Chip Parallelismº, ISCA, 1995. 3. J. Stokes, ªIntroduction to Multithreading, Superthreading and Hyperthreadingº, Ars Technica, 2002. 4. K. Smith et al., ªSupport for the Intel Pentium 4 Processor with Hyper-Threading Technology in Intel 8.0 Compilersº, Intel Technology Journal, Volume 08-Issue 01, 2004. 5. D. Vianney, ªHyper-Threading speeds Linuxº, IBM Linux developerWorks, 2003. 6. J.Hennessy, D. Patterson, ªComputer Architecture: A Quantitative Approachº, 3rd Edition, pp. 608±615, 2003. 7. ªHyper-Threading Technology on the Intel Xeon Processor Family for Serversº, Intel White Paper, 2004. 8. K. Flautner et al., ªThread-level Parallelism and Interactive Performance of Desktop Applicationsº, ASPLOS, 2000. CS203B-Advanced Computer Architecture, Spring 2004 – p.31/32 9. L. Carter et al., ªPerformance and Programming Experience on the Tera MTAº, SIAM Conference on Parallel Processing, 1999. Thank you!
Questions/Comments?
CS203B-Advanced Computer Architecture, Spring 2004 – p.32/32