Hyper-Threading Technology Architecture and Microarchitecture

Hyper-Threading Technology Architecture and Microarchitecture Deborah T. Marr, Desktop Products Group, Intel Corp. Frank Binns, Desktop ProductsGroup, Intel Corp. David L. Hill, Desktop Products Group, Intel Corp. Glenn Hinton, Desktop Products Group, Intel Corp. David A. Koufaty, Desktop Products Group, Intel Corp. J. Alan Miller, Desktop Products Group, Intel Corp. Michael Upton, CPU Architecture, Desktop Products Group, Intel Corp. Index words: architecture, microarchitecture, Hyper-Threading Technology, simultaneous multi- threading, multiprocessor INTRODUCTION ABSTRACT The amazing growth of the Internet and telecommunications is powered by ever-faster systems Intel’s Hyper-Threading Technology brings the concept demanding increasingly higher levels of processor of simultaneous multi-threading to the Intel performance. To keep up with this demand we cannot Architecture. Hyper-Threading Technology makes a rely entirely on traditional approaches to processor single physical processor appear as two logical design. Microarchitecture techniques used to achieve processors; the physical execution resources are shared past processor performance improvement–super- and the architecture state is duplicated for the two pipelining, branch prediction, super-scalar execution, logical processors. From a software or architecture out-of-order execution, caches–have made perspective, this means operating systems and user microprocessors increasingly more complex, have more programs can schedule processes or threads to logical transistors, and consume more power. In fact, transistor processors as they would on multiple physical counts and power are increasing at rates greater than processors. From a microarchitecture perspective, this processor performance. Processor architects are means that instructions from both logical processors therefore looking for ways to improve performance at a will persist and execute simultaneously on shared greater rate than transistor counts and power execution resources. dissipation. Intel’s Hyper-Threading Technology is one This paper describes the Hyper-Threading Technology solution. architecture, and discusses the microarchitecture details of Intel's first implementation on the Intel Xeon Processor Microarchitecture processor family. Hyper-Threading Technology is an Traditional approaches to processor design have important addition to Intel’s enterprise product line and focused on higher clock speeds, instruction-level will be integrated into a wide variety of products. parallelism (ILP), and caches. Techniques to achieve higher clock speeds involve pipelining the microarchitecture to finer granularities, also called super-pipelining. Higher clock frequencies can greatly improve performance by increasing the number of instructions that can be executed each second. Because Intel is a registered trademark of Intel Corporation or there will be far more instructions in-flight in a super- its subsidiaries in the United States and other countries. pipelined microarchitecture, handling of events that Xeon is a trademark of Intel Corporation or its disrupt the pipeline, e.g., cache misses, interrupts and subsidiaries in the United States and other countries. branch mispredictions, can be costly. Hyper-Threading Technology Architecture and Microarchitecture 1 Intel Technology Journal Q1, 2002 ILP refers to techniques to increase the number of 25 instructions executed each clock cycle. For example, a Power super-scalar processor has multiple parallel execution 20 Die Size units that can process instructions simultaneously. With SPECInt Perf super-scalar execution, several instructions can be 15 executed each clock cycle. However, with simple in- order execution, it is not enough to simply have multiple 10 execution units. The challenge is to find enough 5 instructions to execute. One technique is out-of-order execution where a large window of instructions is 0 simultaneously evaluated and sent to execution units, i486 Pentium(TM) Pentium(TM) 3 Pentium(TM) 4 Processor Processor Processor based on instruction dependencies rather than program order. Figure 1: Single-stream performance vs. cost Accesses to DRAM memory are slow compared to Figure 1 shows the relative increase in performance and execution speeds of the processor. One technique to the costs, such as die size and power, over the last ten reduce this latency is to add fast caches close to the years on Intel processors1. In order to isolate the processor. Caches can provide fast memory access to microarchitecture impact, this comparison assumes that frequently accessed data or instructions. However, the four generations of processors are on the same caches can only be fast when they are small. For this silicon process technology and that the speed-ups are reason, processors often are designed with a cache normalized to the performance of an Intel486 hierarchy in which fast, small caches are located and processor. Although we use Intel’s processor history in operated at access latencies very close to that of the this example, other high-performance processor processor core, and progressively larger caches, which manufacturers during this time period would have handle less frequently accessed data or instructions, are similar trends. Intel’s processor performance, due to implemented with longer access latencies. However, microarchitecture advances alone, has improved integer there will always be times when the data needed will not performance five- or six-fold1. Most integer be in any processor cache. Handling such cache misses applications have limited ILP and the instruction flow requires accessing memory, and the processor is likely can be hard to predict. to quickly run out of instructions to execute before Over the same period, the relative die size has gone up stalling on the cache miss. fifteen-fold, a three-times-higher rate than the gains in The vast majority of techniques to improve processor integer performance. Fortunately, advances in silicon performance from one generation to the next is complex process technology allow more transistors to be packed and often adds significant die-size and power costs. into a given amount of die area so that the actual These techniques increase performance but not with measured die size of each generation microarchitecture 100% efficiency; i.e., doubling the number of execution has not increased significantly. units in a processor does not double the performance of The relative power increased almost eighteen-fold the processor, due to limited parallelism in instruction 1 during this period . Fortunately, there exist a number of flows. Similarly, simply doubling the clock rate does known techniques to significantly reduce power not double the performance due to the number of consumption on processors and there is much on-going processor cycles lost to branch mispredictions. research in this area. However, current processor power dissipation is at the limit of what can be easily dealt with in desktop platforms and we must put greater emphasis on improving performance in conjunction with new technology, specifically to control power. 1 These data are approximate and are intended only to show trends, not actual performance. Intel486 is a trademark of Intel Corporation or its subsidiaries in the United States and other countries. Hyper-Threading Technology Architecture and Microarchitecture 2 Intel Technology Journal Q1, 2002 Thread-Level Parallelism event multi-threading techniques do not achieve optimal overlap of many sources of inefficient resource usage, such as branch mispredictions, instruction A look at today’s software trends reveals that server dependencies, etc. applications consist of multiple threads or processes that can be executed in parallel. On-line transaction Finally, there is simultaneous multi-threading, where processing and Web services have an abundance of multiple threads can execute on a single processor software threads that can be executed simultaneously without switching. The threads execute simultaneously for faster performance. Even desktop applications are and make much better use of the resources. This becoming increasingly parallel. Intel architects have approach makes the most effective use of processor been trying to leverage this so-called thread-level resources: it maximizes the performance vs. transistor parallelism (TLP) to gain a better performance vs. count and power consumption. transistor count and power ratio. Hyper-Threading Technology brings the simultaneous In both the high-end and mid-range server markets, multi-threading approach to the Intel architecture. In multiprocessors have been commonly used to get more this paper we discuss the architecture and the first performance from the system. By adding more implementation of Hyper-Threading Technology on the processors, applications potentially get substantial performance improvement by executing multiple Intel Xeon processor family. threads on multiple processors at the same time. These threads might be from the same application, from HYPER-THREADING TECHNOLOGY different applications running simultaneously, from ARCHITECTURE operating system services, or from operating system Hyper-Threading Technology makes a single physical threads doing background maintenance. Multiprocessor processor appear as multiple logical processors [11, 12]. systems have been used for many years, and high-end To do this, there is one copy of the architecture state for programmers are familiar with the techniques to exploit each logical

Hyper-Threading Technology Architecture and Microarchitecture

Parallel Programming

Real-Time Performance During CUDA™ a Demonstration and Analysis of Redhawk™ CUDA RT Optimizations

Unit: 4 Processes and Threads in Distributed Systems

Gpu Concurrency

Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks

Intel's Hyper-Threading

COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous Multi-Threading and Multi-Core Processors Edgar Gabriel Spring 2011

IMTEC-91-58 High-Performance Computing: Industry Uses Of

Introduction to Parallel Computing

Computer-System Architecture (Cont.) Symmetrically Constructed Clusters (Cont.) Advantages: 1

Lect. 9: Multithreading

IV. Parallel Operating Systems