SIMULTANEOUS MULTITHREADING: a Platform for Next-Generation Processors

SIMULTANEOUS MULTITHREADING: A Platform for Next-Generation Processors Susan J. Eggers s the processor community prepares exploits both types of parallelism, SMT University of Washington for a billion transistors on a chip, processors use resources more efficiently, and Aresearchers continue to debate the both instruction throughput and speedups are Joel S. Emer most effective way to use them. One greater. Digital Equipment Corp. approach is to add more memory (either Simultaneous multithreading combines cache or primary memory) to the chip, but hardware features of wide-issue superscalars Henry M. Levy the performance gain from memory alone is and multithreaded processors. From super- University of Washington limited. Another approach is to increase the scalars, it inherits the ability to issue multi- level of systems integration, bringing sup- ple instructions each cycle; and like Jack L. Lo port functions like graphics accelerators and multithreaded processors it contains hard- University of Washington I/O controllers on chip. Although integra- ware state for several programs (or threads). tion lowers system costs and communica- The result is a processor that can issue mul- Rebecca L. Stamm tion latency, the overall performance gain to tiple instructions from multiple threads each Digital Equipment Corp. applications is again marginal. cycle, achieving better performance for a We believe the only way to significantly variety of workloads. For a mix of indepen- Dean M. Tullsen improve performance is to enhance the dent programs (multiprogramming), the University of California, processor’s computational capabilities. In overall throughput of the machine is San Diego general, this means increasing parallelism— improved. Similarly, programs that are par- in all its available forms. At present only cer- allelizable, either by a compiler or a pro- tain forms of parallelism are being exploited. grammer, reap the same throughput Current superscalars, for example, can exe- benefits, resulting in program speedup. cute four or more instructions per cycle; in Finally, a single-threaded program that must practice, however, they achieve only one or execute alone will have all machine two, because current applications have low resources available to it and will maintain instruction-level parallelism. Placing multi- roughly the same level of performance as ple superscalar processors on a chip is also when executing on a single-threaded, wide- not an effective solution, because, in addi- issue processor. Simultaneous tion to the low instruction-level parallelism, Equal in importance to its performance ben- performance suffers when there is little efits is the simplicity of SMT’s design. multithreading exploits thread-level parallelism. A better solution is Simultaneous multithreading adds minimal to design a processor that can exploit all hardware complexity to, and, in fact, is a both instruction-level types of parallelism well. straightforward extension of, conventional Simultaneous multithreading is a processor dynamically scheduled superscalars. Hardware and thread-level design that meets this goal, because it con- designers can focus on building a fast, single- sumes both thread-level and instruction-level threaded superscalar, and add SMT’s multi- parallelism by issuing parallelism. In SMT processors, thread-level thread capability on top. parallelism can come from either multi- Given the enormous transistor budget in instructions from threaded, parallel programs or individual, the next computer era, we believe simulta- independent programs in a multiprogram- neous multithreading provides an efficient different threads in the ming workload. Instruction-level parallelism base technology that can be used in many comes from each single program or thread. ways to extract improved performance. For same cycle. Because it successfully (and simultaneously) example, on a one billion transistor chip, 20 12 IEEE Micro 0272-1732/97/$10.00 © 1997 IEEE Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:23 UTC from IEEE Xplore. Restrictions apply. to 40 SMTs could be used side-by-side to that of achieve performance comparable to a much larger number of conventional superscalars. With IRAM technology, SMT’s high execution rate, which currently doubles memory bandwidth Thread 1 requirements, can fully exploit the increased bandwidth Thread 2 capability. In both billion-transistor scenarios, the SMT Thread 3 processor we describe here could serve as the processor Thread 4 building block. cycles) (processor Time Thread 5 How SMT works The difference between superscalar, multithreading, and (a) (b) (c) simultaneous multithreading is pictured in Figure 1, which shows sample execution sequences for the three architectures. Each row represents the issue slots for a single exe- Figure 1. How architectures partition issue slots (function- cution cycle: a filled box indicates that the processor found al units): a superscalar (a), a fine-grained multithreaded an instruction to execute in that issue slot on that cycle; an superscalar (b), and a simultaneous multithreaded proces- empty box denotes an unused slot. We characterize the sor (c). The rows of squares represent issue slots. The pro- unused slots as horizontal or vertical waste. Horizontal waste cessor either finds an instruction to execute (filled box) or occurs when some, but not all, of the issue slots in a cycle the slot goes unused (empty box). can be used. It typically occurs because of poor instruction- level parallelism. Vertical waste occurs when a cycle goes completely unused. This can be caused by a long latency the processor fetches eight instructions from the instruction instruction (such as a memory access) that inhibits further cache. After instruction decoding, the register-renaming logic instruction issue. maps the architectural registers to the hardware renaming Figure 1a shows a sequence from a conventional super- registers to remove false dependencies. Instructions are then scalar. As in all superscalars, it is executing a single program, fed to either the integer or floating-point dispatch queues. or thread, from which it attempts to find multiple instruc- When their operands become available, instructions are tions to issue each cycle. When it cannot, the issue slots go issued from these queues to their corresponding functional unused, and it incurs both horizontal and vertical waste. units. To support out-of-order execution, the processor tracks Figure 1b shows a sequence from a multithreaded archi- instruction and operand dependencies so that it can deter- tecture, such as the Tera.1 Multithreaded processors contain mine which instructions it can issue and which must wait for hardware state (a program counter and registers) for sever- previously issued instructions to finish. After instructions al threads. On any given cycle a processor executes instruc- complete execution, the processor retires them in order and tions from one of the threads. On the next cycle, it switches frees hardware registers that are no longer needed. to a different thread context and executes instructions from Our SMT model, which can simultaneously execute the new thread. As the figure shows, the primary advantage threads from up to eight hardware contexts, is a straightfor- of multithreaded processors is that they better tolerate long- ward extension of this conventional superscalar. We repli- latency operations, effectively eliminating vertical waste. cated some superscalar resources to support simultaneous However, they cannot remove horizontal waste. Conse- multithreading: state for the hardware contexts (registers and quently, as instruction issue width continues to increase, mul- program counters) and per-thread mechanisms for pipeline tithreaded architectures will ultimately suffer the same fate flushing, instruction retirement, trapping, precise interrupts, as superscalars: they will be limited by the instruction-level and subroutine return. We also added per-thread (address- parallelism in a single thread. space) identifiers to the branch target buffer and translation Figure 1c shows how each cycle an SMT processor selects look-aside buffer. Only two components, the instruction fetch instructions for execution from all threads. It exploits instruc- unit and the processor pipeline, were redesigned to benefit tion-level parallelism by selecting instructions from any from SMT’s multithread instruction issue. thread that can (potentially) issue. The processor then Simultaneous multithreading needs no special hardware to dynamically schedules machine resources among the instruc- schedule instructions from the different threads onto the tions, providing the greatest chance for the highest hardware functional units. Dynamic scheduling hardware in current utilization. If one thread has high instruction-level paral- out-of-order superscalars is already functionally capable of lelism, that parallelism can be satisfied; if multiple threads simultaneous multithreaded scheduling. Register renaming each have low instruction-level parallelism, they can be exe- eliminates register name conflicts both within and between cuted together to compensate. In this way, SMT can recov- threads by mapping thread-specific architectural registers er issue slots lost to both horizontal and vertical waste. onto the hardware registers; the processor then issues instructions (after their operands have been calculated or loaded SMT model from memory) without regard to thread. We derived

Load more