Simultaneous Multithreading: Exploiting Instruction-Level and Thread-Level Parallelism in Microprocessors

Simultaneous Multithreading: Exploiting Instruction-Level and Thread-Level Parallelism in Microprocessors Jeffrey Butera Senior Thesis Computer Science Advised by Dave Wonnacott Haverford College April 25, 2011 1 Abstract Increasing the performance of microprocessors has always been a pressing issue in the fields of Computer Science and Computer Engi- neering. The first processors functioned by executing a single stream of sequential instructions. The initial approach to increasing the speed of instruction execution involved using smaller and faster transistors so that the clock frequency could be increased. This approach was stymied by slow memory accesses that could not keep up with a fast clock cycle. Cache memories were developed to keep frequently accessed data closer to the processor to prevent the types of stalls. RISC, reduced instruction set computing, architectures were also designed to allow for faster clock cycles and execution. After RISC, the focus shifted away from speeding up the execution of a single instruction and moved towards increasing the number of instructions that can be active in a single clock cycle. Different designs, such as pipelining and superscalars, allowed multiple instructions to be issued in the same clock cycle, but were limited by increased power consumption, heat, and chip size. Multi-core and multiprocessor designs allowed multiple threads of instructions to be simultaneously executed on different processors or cores. These designs were improved by the development of multithreading, which aimed to increase thread- level parallelism on a single core. This thesis addresses the intricacies of a specific type of hardware multithreading: Simultaneous Multithreading (SMT). SMT is a technique used to increase the performance of a microprocessor by exploiting parallelism in all available forms. SMT combines instruction-level and thread-level parallelism and enables multiple instructions from multiple threads to be issued in the same clock cycle. This paper is a detailed examination of Simultaneous Multithreading and will specifi- cally address: how multiple instructions are fetched and issued simultaneously, how hardware resources are designated amongst threads, the performance upgrade associated with SMT in terms of instructions per cycle, the energy-efficiency of the design, and how SMT is utilized in general purpose microprocessors such as in the Intel Pen- tium 4 Hyper-Threading processor. 2 Contents 1 Background 4 2 Exploiting Instruction-Level Parallelism 4 3 Maximizing Thread-Level Parallelism 5 4 Increasing ILP and TLP 7 5 How SMT Works 8 5.1TheInstructionFetchUnit.................... 10 5.1.1 ThreadSelection..................... 10 5.1.2 InstructionSelection................... 13 5.1.3 ClearingUptheFetchUnit............... 14 5.2ResourceSharing......................... 15 6 SMT Performance 15 7 The Energy-Efficiency of SMT 17 8 Intel Hyper-Threading 18 9 Conclusion 19 3 1 Background A microprocessor, or central processing unit (CPU), is the brain of a computer. Much like a human brain, it processes information and sends out instructions to tell the different parts of the computer what to do. The op- eration of a microprocessor can be broken down into four major steps. First, an instruction is fetched from memory. Second, the instruction is decoded so that the CPU can be told what to do. Next, the instruction is executed by the hardware, and finally the result of the execution is written back to some form of memory. In the 1970s, the first computers contained a single processor that could execute a single stream of instructions in a sequential fashion. The first ap- proaches to speed up a processor involved creating smaller and faster transistors so that the clock speed could be increased. Eventually, the clock speed could no longer be increased because slow memory accesses held back performance gains. Cache memories were developed to keep frequently accessed data in smaller, faster memories closer to the CPU. If, during the execution of an instruction, the CPU needs to access data that is stored in memory, it first checks the cache for the data. If the data is found, a cache hit, then it simply reads the data. If that data is not in the cache, a cache miss, then the data must be fetched from the main memory. The processor stalls while waiting for the data to be fetched from memory, increasing the time for the instruction execution. The next approach to increasing processor performance dealt with the instructions themselves. The reduced instruction set computing strategy (RISC) was developed to simplify instructions. The idea behind RISC was to eliminate the complex instructions that may take multiple, longer clock cycles to complete. By converting a complex instruction into a sequence of simpler instructions that could be executed faster, the clock rate could be increased and the same result could be achieved faster. The development of RISC based architectures influenced designers to come up with more ways to exploit instruction-level parallelism to increase performance. 2 Exploiting Instruction-Level Parallelism The first major attempt to maximize instruction-level parallelism (ILP) was a technique called pipelining. Pipelining breaks down a processor into multiple stages and creates a pipeline that instructions pass through. This pipeline functions much like an assembly line. An instruction enters at one end, passes through the different stages of the pipe, and exits at the other end. 4 The pipeline allows for multiple instructions to flow through the pipe at the same time. Different instructions can be executed in parallel from each stage of the pipeline, thus increasing the throughput, or number of instructions leaving the pipeline per unit time[4]. Most pipelined processors are broken down into five stages: Fetch, De- code, Execute, Memory, and Writeback. This means that five instructions can execute at the same time, one from each stage of the pipeline. Also, each stage of the pipeline only has to deal with 1/5 of the processor’s logic, mak- ing the stages faster than full instructions and allowing the clock frequency to be increased[3]. A pipelined architecture must be aware of the length of each pipeline stage, because the clock rate is dependent on the slowest stage of the pipeline. That is to say, no instruction can move to the next stage of the pipeline until all instructions have finished at their current stage. The next development in increasing instruction-level parallelism was superscalar processors. Superscalars achieve ILP and execute multiple instructions per cycle by having multiple copies of hardware datapaths. A superscalar that has two datapaths can fetch two instructions simultaneously from memory. This means that the processor must also have double the logic to fetch and decode two instructions at the same time[3]. Superscalars also utilize pipelining to fully maximize parallelism. A sample execution of a superscalar processor can be seen in Figure 1a. The superscalar is executing a single program and each box represents a potential issue slot for an instruction. This particular superscalar has four datapaths and attempts to fill these issue slots each processor cycle to achieve ILP. Further advances in pipelining and superscalars are limited. In the case of pipelining, increasing the clock rate means an increase in power consumption and heat. Superscalars are limited by their width, or number or datapaths. As the width increases, the amount of hardware needed on the chip grows as well. More hardware means bigger chips and more power consumed. These issues make it impractical to solely rely on instruction-level parallelism to increase processor performance and they have become the driving force behind the switch to increasing thread-level parallelism. 3 Maximizing Thread-Level Parallelism A second form of parallelism, thread level parallelism (TLP), arose to make use of multiprocessor and multi-core designs. A multiprocessor exploits TLP by connecting two or more processors. Rather than overlapping individ- ual instructions from one thread (ILP), multiple streams of instructions are simultaneously executed on the different processors [4]. In a multi-core ar- 5 Figure 1: Sample Execution Sequence for: (a)Superscalar processor, (b)Multithreaded superscalar processor, (c)SMT processor. [1] chitecture, multiple processor cores are situated on a single chip. The core of the processor is regarded as the unit that actually interprets and executes the instructions. Each core usually has its own cache, but shares lower-level caches and the input/output buses [4]. Multi-core designs achieve TLP by assigning each core its own stream of instructions to execute. A technique called multithreading takes thread-level parallelism a step further by increasing TLP in a single core. A core with multithreading capa- bilities is assigned multiple threads from which it can execute instructions. While one thread is waiting for a response from memory, the processor can switch threads and execute another instruction. Each thread also has its own program counter and set of registers to hold its architectural state so that the processor can pick up where it left off when it switched from the thread. This ensures that more than one thread can be active at one time[3]. The basic idea of enhancing TLP through the use of multithreading can be explained with a simple example. Imagine enhancing the quality of a picture using some sort of image processing software. A single-core, non- 6 multithreaded processor would work on enhancing the picture, pixel by pixel, from left to right and top to bottom. Multithreading can be used to speed up this process, but the software orchestrating the image enhancement must be written to support multiple threads. If a software program was written to supports four threads, then each thread could handle enhancing a quarter of the image. A quad-core processor running this software could enhance the image up to four times faster than the single-core processor by sending one thread of instructions to each core of the processor.

Load more