A Survey of Processors with Explicit Multithreading

A Survey of Processors with Explicit Multithreading THEO UNGERER University of Augsburg BORUT ROBICˇ University of Ljubljana AND JURIJ SILCˇ Joˇzef Stefan Institute Hardware multithreading is becoming a generally applied technique in the next generation of microprocessors. Several multithreaded processors are announced by industry or already into production in the areas of high-performance microprocessors, media, and network processors. A multithreaded processor is able to pursue two or more threads of control in parallel within the processor pipeline. The contexts of two or more threads of control are often stored in separate on-chip register sets. Unused instruction slots, which arise from latencies during the pipelined execution of single-threaded programs by a contemporary microprocessor, are filled by instructions of other threads within a multithreaded processor. The execution units are multiplexed between the thread contexts that are loaded in the register sets. Underutilization of a superscalar processor due to missing instruction-level parallelism can be overcome by simultaneous multithreading, where a processor can issue multiple instructions from multiple threads each cycle. Simultaneous multithreaded processors combine the multithreading technique with a wide-issue superscalar processor to utilize a larger part of the issue bandwidth by issuing instructions from different threads simultaneously. Explicit multithreaded processors are multithreaded processors that apply processes or operating system threads in their hardware thread slots. These processors optimize the throughput of multiprogramming workloads rather than single-thread performance. We distinguish these processors from implicit multithreaded processors Categories and Subject Descriptors: C.1 [Computer Systems Organization]: Processor Architectures; C.1.3 [Processor Architectures]: Other Architecture Styles—Pipeline processors General Terms: Design, Performance Additional Key Words and Phrases: Blocked multithreading, interleaved multithreading, simultaneous multithreading Authors’ addresses: T. Ungerer, Institute of Computer Science, University of Augsburg, Eichleitnerstrasse 30, D-86135 Augsburg, Germany; email: [email protected]; B. Robic,ˇ Faculty of Com- puter and Information Science, University of Ljubljana, Trzaˇ skaˇ 25, Sl-1000 Ljubljana, Slovenia; email: [email protected]; J. Silc,ˇ Computer Systems Department, Jozefˇ Stefan Institute, Jamova 39, Sl-1000 Ljubljana, Slovenia; email: [email protected]. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 2003 ACM 0360-0300/03/0300-0029 $5.00 ACM Computing Surveys, Vol. 35, No. 1, March 2003, pp. 29–63. 30 Ungerer et al. that utilize thread-level speculation by speculatively executing compiler- or machine-generated threads of control that are part of a single sequential program. This survey paper explains and classifies the explicit multithreading techniques in research and in commercial microprocessors. 1. INTRODUCTION to switch to another thread. However, a thread switch on a conventional pro- A multithreaded processor is able to con- cessor causes saving of all registers, currently execute instructions of differ- loading the new register values, and ent threads of control within a single several more administrative tasks that pipeline. The minimal requirement for often require too much time to prove this such a processor is the ability to pursue method as an efficient solution. two or more threads of control in parallel within the processor pipeline, that is, —Contemporary superscalar micropro- the processor must provide two or more cessors [Silcˇ et al. 1999] are able independent program counters, an inter- to issue four to six instructions each nal tagging mechanism to distinguish in- clock cycle from a conventional lin- structions of different threads within the ear instruction stream. VLSI technol- pipeline, and a mechanism that triggers ogy will allow future microprocessors a thread switch. Thread-switch overhead with an issue bandwidth of 8–32 in- must be very low, from zero to only a few structions per cycle [Patt et al. 1997]. cycles. Multithreaded processor features As the issue rate of future micropro- often, but not always, multiple register cessors increases, the compiler or the sets on the processor chip. hardware will have to extract more The current interest in hardware multi- instruction-level parallelism (ILP) from threading stems from three objectives: a sequential program. However, ILP found in a conventional instruction —Latency reduction is an important task stream is limited. ILP studies which al- when designing a microprocessor. La- low single control flow branch specula- tencies arise from data dependencies tion have reported parallelism around between instructions within a single 7 instructions per cycle (IPC) with thread of control. Long latencies are infinite resources [Wall 1991; Lam and caused by memory accesses that miss in Wilson 1992] and around 4 IPC with the cache and by long running instruc- large sets of resources (e.g., 8 to 16 tions. Short latencies may be bridged execution units) [Butler et al. 1991]. within a superscalar processor by ex- Contemporary high-performance micro- ecuting succeeding, nondependent in- processors therefore exploit speculative structions of the same thread. Long la- parallelism by dynamic branch predic- tencies, however, stall the processor and tion and speculative execution of the lessen its performance. predicted branch path to increase sin- —Shared-memory multiprocessors suffer gle thread performance. Control specu- from memory access latencies that lation may be enhanced by data depen- are several times longer than in a dence and value speculation techniques single-processor system. When access- to increase performance of a single pro- ing a nonlocal memory module in a gram thread [Lipasti et al. 1996; Lipasti distributed-shared memory system, the and Shen 1997; Chrysos and Emer 1998; memory latency is enhanced by the Rychlik et al. 1998]. transfer time through the communica- tion network. Additional latencies arise Multithreading pursues a different set in a shared memory multiprocessor of solutions by utilizing coarse-grained from thread synchronizations, which parallelism [Iannucci et al. 1994]. cause idle times for the waiting thread. The notion of a thread in the context One solution to fill these idle times is of multithreaded processors differs from ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 31 the notion of software threads in multi- applying thread-level speculation. A threaded operating systems. In case of thread in such processor proposals refers a multithreaded processor a thread is to any contiguous region of the static or always viewed as a hardware-supported dynamic instruction sequence. We call thread which can be—depending on the such processors implicit multithreaded specific form of multithreaded processor— processors, which refers to any processor a full program (single-threaded UNIX that can dynamically generate threads process), an operating system thread from single-threaded programs and exe- (a light-weighted process, e.g., a POSIX cute such speculative threads concurrent thread), a compiler-generated thread with the lead thread. In case of misspecu- (subordinate microthread, microthread, lation, all speculatively generated results nanothread etc.), or even a hardware- must be squashed. Threads generated generated thread. by implicit multithreaded processors are always executed speculatively, in contrast to the threads in explicit multithreaded 1.1. Explicit and Implicit Multithreading processors. The solution surveyed in this paper is the Examples of implicit multithreaded utilization of coarser-grained parallelism processor proposals are the multiscalar by explicit multithreaded processors that processor [Franklin 1993; Sohi et al. interleave the execution of instructions of 1995; Sohi 1997; Vijaykumar and Sohi different user-defined threads (operating 1998], the trace processor [Rotenberg system threads or processes) in the same et al. 1997; Smith and Vajapeyam pipeline. Multiple program counters are 1997; Vajapeyam and Mitra 1997], the available in the fetch unit and the multi- single-program speculative multithread- ple contexts are often stored in different ing architecture [Dubey et al. 1995], the register sets on the chip. The execution superthreaded architecture [Tsai and Yew units are multiplexed between the thread 1996; Li et al. 1996], the dynamic multi- contexts that are loaded in the register threading processor [Akkary and Driscoll sets. The latencies that arise in the com- 1998], and the speculative multithreaded putation of a single instruction stream are processor [Marcuello et al. 1998]. Some filled by computations of another thread. approaches, in particular the multiscalar Thread-switching is performed automati- scheme, use compiler-support for thread cally by the processor due to a hardware- generation. In these examples a mul- based thread-switching policy. This ability tithreaded processor

A Survey of Processors with Explicit Multithreading

Chapter 1: Introduction What Is an Operating System?

Distributed & Cloud Computing: Lecture 1

Distributed Algorithms with Theoretic Scalability Analysis of Radial and Looped Load ﬂows for Power Distribution Systems

Three-Dimensional Integrated Circuit Design: EDA, Design And

Parallel Programming

Real-Time Performance During CUDA™ a Demonstration and Analysis of Redhawk™ CUDA RT Optimizations

18-741 Advanced Computer Architecture Lecture 1: Intro And

Grid Computing: What Is It, and Why Do I Care?*

Unit: 4 Processes and Threads in Distributed Systems

Declustering Spatial Databases on a Multi-Computer Architecture

Chap01: Computer Abstractions and Technology

Arch2030: a Vision of Computer Architecture Research Over