Hyper-Threading: Simultaneous Multithreading on 4

Presented by: Thomas Repantis [email protected]

CS203B-Advanced Computer Architecture, Spring 2004 – p.1/32 Overview

Multiple threads executing on a single processor without switching.

1. Threads 2. SMT 3. Hyper-Threading on P4 4. OS and Compiler Support 5. Performance for Different Applications

CS203B-Advanced Computer Architecture, Spring 2004 – p.2/32 Threads

• Process: “A task being run by the computer.” • Context: Describes a process’s current state of execution (registers, flags, PC...). • : A “light-weight” process (has its own PC and SP, but single address space and global variables). • Each process consists of at least one thread. • Threads allow faster context-switching and fine-grain multitasking.

CS203B-Advanced Computer Architecture, Spring 2004 – p.3/32 Single-Threaded CPU

A lot of bubbles in the in- struction issue and in the pipeline!

CS203B-Advanced Computer Architecture, Spring 2004 – p.4/32 Single-Threaded SMP

Executing processes are doubled, but bubbles are doubled as well!

CS203B-Advanced Computer Architecture, Spring 2004 – p.5/32 Superthreaded CPU

Each issue and each pipeline stage can con- tain instructions of the same thread only.

CS203B-Advanced Computer Architecture, Spring 2004 – p.6/32 Hyper-Threaded CPU (SMT)

Instructions of different threads can be sched- uled on the same stage.

CS203B-Advanced Computer Architecture, Spring 2004 – p.7/32 SMT vs TeraMTA

• Each processor of the TeraMTA has 128 streams, that include a PC and 32 registers. • Each stream is assigned to a thread. • Instructions from different streams can be pipelined on the same processor. • However, in TeraMTA only a single thread is active on any given cycle.

CS203B-Advanced Computer Architecture, Spring 2004 – p.8/32 SMT Benefits

SMT: • Gives the OS the illusion of several (currently two) logical processors. • Makes efficient use of resources. • Overcomes the barrier of limited amount of ILP within just one thread. • Is implemented by dividing processor resources to replicated, partitioned, and shared.

CS203B-Advanced Computer Architecture, Spring 2004 – p.9/32 Replicated Resources

Each logical processor has independent: • Instruction Pointer • Register Renaming Logic • Instruction TLB • Return Stack Predictor • Advanced Programmable Interrupt Controller • Other architectural registers

CS203B-Advanced Computer Architecture, Spring 2004 – p.10/32 Partitioned Resources

Each logical processors gets exactly half of: • Re-order buffers (ROBs) • Load/Store buffers • Several queues (e.g. scheduling, uop (micro-operations))

Partitioning prohibits a logical processor from monopo- lizing the resources.

CS203B-Advanced Computer Architecture, Spring 2004 – p.11/32 Statically Partitioned Queue

Specific positions are as- signed to each proces- sor.

CS203B-Advanced Computer Architecture, Spring 2004 – p.12/32 Dynamically Partitioned Queue

A limit is imposed to the positions each processor can use, but no specific positions are assigned.

CS203B-Advanced Computer Architecture, Spring 2004 – p.13/32 Shared Resources

Each logical processor shares SMT-unaware resources: • Execution Units • Microarchitectural registers (GPRs, FPRs) • Caches: trace cache, L1, L2, L3

Sharing: + Enables efficient use of resources, but... - Allows a thread to monopolize a resource (e.g. cache thrashing).

CS203B-Advanced Computer Architecture, Spring 2004 – p.14/32

• 32-bit • 2.4 to 3.4 GHz clock frequency • 800 MHz system bus • 0.13-micron technology • 8KB L1 data cache, 12KB L1 instruction cache, 256KB to 1MB L2 cache, 2MB L3 cache • NetBurst (hyper-pipelined) • Hyper-Threading technology

CS203B-Advanced Computer Architecture, Spring 2004 – p.15/32 Front-End Pipeline

(a) Trace Cache Hit

(b) Trace Cache Miss

CS203B-Advanced Computer Architecture, Spring 2004 – p.16/32 Out-Of-Order Execution Engine Pipeline

CS203B-Advanced Computer Architecture, Spring 2004 – p.17/32 Implementation Goals Achieved

• Minimal die area cost (less than 5% more die area). • Stall of one logical processor does not stall the other (buffering queues between pipeline logic blocks). • When only one thread is running, speed should be the same as without H-T (partitioned resources are dedicated to it).

CS203B-Advanced Computer Architecture, Spring 2004 – p.18/32 Single- and Multi-Task Modes

Partitioned resources are dedicated to one of the logical processors when the other is HALTed.

CS203B-Advanced Computer Architecture, Spring 2004 – p.19/32 Operating System Optimizations

When the OS schedules threads to logical processors it should: • HALT an inactive logical processor, to avoid wasting resources for idle loops (continuously checking for available work). • Schedule threads to logical processors on different physical processors instead of the same (when possible), to avoid using the same physical execution resources.

CS203B-Advanced Computer Architecture, Spring 2004 – p.20/32 OS Optimizations

The (2.6 series) distinguishes between logical and physical processors:

• H-T-aware passive and active load-balancing • H-T-aware task pickup • H-T-aware affinity • H-T-aware wakeup

CS203B-Advanced Computer Architecture, Spring 2004 – p.21/32 Compiler Optimizations

Intel 8.0 C++ and FORTRAN compilers:

Automatic optimizations: • Vectorization • Advanced instruction selection Programmer-controlled optimizations: • Insertion of Streaming-SIMD-Extensions 3 (SSE3) instructions • Insertion of OpenMP directives

CS203B-Advanced Computer Architecture, Spring 2004 – p.22/32 Performance gain from automatic optimizations

SPEC CPU 2000 shows significant speedup not only from H-T specific (QxP) but even for general P4 (QxN) optimizations.

CS203B-Advanced Computer Architecture, Spring 2004 – p.23/32 Performance gain from manual optimizations

SPEC OMPM 2001 shows speedup achieved by automatic optimizations in combination with OpenMP directives.

CS203B-Advanced Computer Architecture, Spring 2004 – p.24/32 Thread-level Parallelism of Desktop Applications

• Unlike server workloads, interactive desktop applications focus on response time and not on end-to-end throughput. • Average response time improvement on dual- vs uni-processor measured 22%. • The application programmer has to exploit multi-threading. • More than 2 processors yield no great improvements.

CS203B-Advanced Computer Architecture, Spring 2004 – p.25/32 Performance in Client-Server Applications

While H-T offers no gain or degradation in API calls and user application workloads, it achieves considerable speedups in multi-threaded workloads.

CS203B-Advanced Computer Architecture, Spring 2004 – p.26/32 Performance in File Server Workloads

Good speedups in multi-threaded workloads, whether filesystem and socket calls, or just socket calls.

CS203B-Advanced Computer Architecture, Spring 2004 – p.27/32 Performance in Online Transaction Processing

21% performance gain in the case of 1 and 2 processors.

CS203B-Advanced Computer Architecture, Spring 2004 – p.28/32 Performance in Web Serving

16 to 28% performance gain.

CS203B-Advanced Computer Architecture, Spring 2004 – p.29/32 Conclusions

• Hyper-Threading enables thread-level parallelism by duplicating the architectural state of the processor, while sharing one set of processor execution resources. • When scheduling threads, the OS sees two logical processors. • While not providing the performance achieved by adding a second processor, Hyper-Threading can offer a 30% improvement. • Resource contention limits the performance benefits for certain applications. • Performance gains are evident in multi-threaded workloads, which are usually foundCS203B-AdvancedinComputerserArchitecturevers, Spr.ing 2004 – p.30/32 References

1. D. Marr et al., ªHyper-Threading Technology Architecture and Microarchitectureº, Technology Journal, Volume 06-Issue 01, 2002. 2. D. Tulsen et al., ª Simultaneous Multithreading: Maximizing On-Chip Parallelismº, ISCA, 1995. 3. J. Stokes, ªIntroduction to Multithreading, Superthreading and Hyperthreadingº, Ars Technica, 2002. 4. K. Smith et al., ªSupport for the Intel Pentium 4 Processor with Hyper-Threading Technology in Intel 8.0 Compilersº, , Volume 08-Issue 01, 2004. 5. D. Vianney, ªHyper-Threading speeds Linuxº, IBM Linux developerWorks, 2003. 6. J.Hennessy, D. Patterson, ªComputer Architecture: A Quantitative Approachº, 3rd Edition, pp. 608±615, 2003. 7. ªHyper-Threading Technology on the Intel Processor Family for Serversº, Intel White Paper, 2004. 8. K. Flautner et al., ªThread-level Parallelism and Interactive Performance of Desktop Applicationsº, ASPLOS, 2000. CS203B-Advanced Computer Architecture, Spring 2004 – p.31/32 9. L. Carter et al., ªPerformance and Programming Experience on the Tera MTAº, SIAM Conference on Parallel Processing, 1999. Thank you!

Questions/Comments?

CS203B-Advanced Computer Architecture, Spring 2004 – p.32/32