<<

A Survey of Processors with Explicit Multithreading

THEO UNGERER University of Augsburg

BORUT ROBICˇ University of Ljubljana

AND

JURIJ SILCˇ Joˇzef Stefan Institute

Hardware multithreading is becoming a generally applied technique in the generation of . Several multithreaded processors are announced by industry or already into production in the areas of high-performance microprocessors, media, and network processors. A multithreaded is able to pursue two or threads of control in parallel within the processor . The contexts of two or more threads of control are often stored in separate on- register sets. Unused instruction slots, arise from latencies during the pipelined of single-threaded programs by a contemporary , are filled by instructions of other threads within a multithreaded processor. The execution units are multiplexed between the contexts that are loaded in the register sets. Underutilization of a due to missing instruction-level parallelism can be overcome by simultaneous multithreading, where a processor can issue multiple instructions from multiple threads each cycle. Simultaneous multithreaded processors combine the multithreading technique with a wide-issue superscalar processor to utilize a larger part of the issue bandwidth by issuing instructions from different threads simultaneously. Explicit multithreaded processors are multithreaded processors that apply processes or operating threads in their hardware thread slots. These processors optimize the throughput of multiprogramming workloads rather than single-thread performance. We distinguish these processors from implicit multithreaded processors

Categories and Subject Descriptors: tag">C.1 [ Organization]: Processor ; C.1.3 [Processor Architectures]: Other Styles—Pipeline processors General Terms: Design, Performance Additional Key Words and Phrases: Blocked multithreading, interleaved multithreading, simultaneous multithreading

Authors’ addresses: . Ungerer, Institute of , University of Augsburg, Eichleitnerstrasse 30, -86135 Augsburg, Germany; : [email protected]; . Robic,ˇ Faculty of Com- puter and Information Science, University of Ljubljana, Trzaˇ skaˇ 25, Sl-1000 Ljubljana, Slovenia; email: [email protected]; . Silc,ˇ Computer Systems Department, Jozefˇ Stefan Institute, Jamova 39, Sl-1000 Ljubljana, Slovenia; email: [email protected]. Permission to digital/hard of part or all of this work for personal or use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 2003 ACM 0360-0300/03/0300-0029 $5.00

ACM Computing Surveys, . 35, No. 1, March 2003, pp. 29–63. 30 Ungerer et al.

that utilize thread-level speculation by speculatively executing - or -generated threads of control that are part of a single sequential program. This survey paper explains and classifies the explicit multithreading techniques in research and in commercial microprocessors.

1. INTRODUCTION to switch to another thread. However, a thread switch on a conventional pro- A multithreaded processor is able to con- cessor causes saving of all registers, currently execute instructions of differ- loading the new register values, and ent threads of control within a single several more administrative tasks that pipeline. The minimal requirement for often require too much to prove this such a processor is the ability to pursue method as an efficient solution. two or more threads of control in paral- lel within the processor pipeline, that is, —Contemporary superscalar micropro- the processor must provide two or more cessors [Silcˇ et al. 1999] are able independent program counters, an inter- to issue four to six instructions each nal tagging mechanism to distinguish in- clock cycle from a conventional lin- structions of different threads within the ear instruction stream. VLSI technol- pipeline, and a mechanism that triggers ogy will allow future microprocessors a thread switch. Thread-switch overhead with an issue bandwidth of 8–32 in- must be very low, from zero to only a few structions per cycle [Patt et al. 1997]. cycles. Multithreaded processor features As the issue rate of future micropro- often, but not always, multiple register cessors increases, the compiler or the sets on the processor chip. hardware will have to extract more The current interest in hardware multi- instruction-level parallelism (ILP) from threading stems from three objectives: a sequential program. However, ILP found in a conventional instruction — reduction is an important stream is limited. ILP studies which al- when designing a microprocessor. La- low single control flow branch specula- tencies arise from dependencies tion have reported parallelism around between instructions within a single 7 (IPC) with thread of control. Long latencies are infinite resources [ 1991; Lam and caused by memory accesses that miss in Wilson 1992] and around 4 IPC with the and by long running instruc- large sets of resources (.g., 8 to 16 tions. Short latencies may be bridged execution units) [Butler et al. 1991]. within a superscalar processor by - Contemporary high-performance micro- ecuting succeeding, nondependent in- processors therefore exploit speculative structions of the same thread. Long la- parallelism by dynamic branch predic- tencies, however, stall the processor and tion and of the lessen its performance. predicted branch to increase sin- —Shared-memory multiprocessors suffer gle thread performance. Control specu- from memory access latencies that lation may be enhanced by data depen- are several times longer than in a dence and value speculation techniques single-processor system. When access- to increase performance of a single pro- ing a nonlocal memory module in a gram thread [Lipasti et al. 1996; Lipasti distributed- system, the and Shen 1997; Chrysos and Emer 1998; memory latency is enhanced by the Rychlik et al. 1998]. transfer time through the communica- tion network. Additional latencies arise Multithreading pursues a different set in a shared memory multiprocessor of solutions by utilizing coarse-grained from thread synchronizations, which parallelism [Iannucci et al. 1994]. cause times for the waiting thread. The notion of a thread in the One solution to fill these idle times is of multithreaded processors differs from

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 31 the notion of threads in multi- applying thread-level speculation. A threaded operating systems. In case of thread in such processor proposals refers a multithreaded processor a thread is to any contiguous region of the static or always viewed as a hardware-supported dynamic instruction sequence. We call thread which can be—depending on the such processors implicit multithreaded specific form of multithreaded processor— processors, which refers to any processor a full program (single-threaded that can dynamically generate threads ), an thread from single-threaded programs and exe- (a -weighted process, e.g., a POSIX cute such speculative threads concurrent thread), a compiler-generated thread with the lead thread. In case of misspecu- (subordinate microthread, microthread, lation, all speculatively generated results nanothread etc.), or even a hardware- must be squashed. Threads generated generated thread. by implicit multithreaded processors are always executed speculatively, in contrast to the threads in explicit multithreaded 1.1. Explicit and Implicit Multithreading processors. The solution surveyed in this paper is the Examples of implicit multithreaded utilization of coarser-grained parallelism processor proposals are the multiscalar by explicit multithreaded processors that processor [Franklin 1993; Sohi et al. interleave the execution of instructions of 1995; Sohi 1997; Vijaykumar and Sohi different -defined threads (operating 1998], the trace processor [Rotenberg system threads or processes) in the same et al. 1997; Smith and Vajapeyam pipeline. Multiple program counters are 1997; Vajapeyam and Mitra 1997], the available in the fetch unit and the multi- single-program speculative multithread- ple contexts are often stored in different ing architecture [Dubey et al. 1995], the register sets on the chip. The execution superthreaded architecture [Tsai and Yew units are multiplexed between the thread 1996; Li et al. 1996], the dynamic multi- contexts that are loaded in the register threading processor [Akkary and Driscoll sets. The latencies that arise in the com- 1998], and the speculative multithreaded putation of a single instruction stream are processor [Marcuello et al. 1998]. Some filled by computations of another thread. approaches, in particular the multiscalar Thread-switching is performed automati- scheme, use compiler-support for thread cally by the processor due to a hardware- generation. In these examples a mul- based thread-switching policy. This ability tithreaded processor may be character- is in contrast to conventional processors or ized by a single processing unit with a today’s superscalar processors, which use single or multiple-issue pipeline able to or a time-consuming, operat- process instructions of different threads ing system-based thread switch. concurrently. As a result, some of these Depending on the specific multi- approaches may rather be viewed as threaded , either a single- very closely coupled chip multiprocessors, issue instruction pipeline (as in scalar because multiple subordinate processing processors) is used, or multiple instruc- units execute different threads under con- tions from possibly different instruction trol of a single sequencer unit. streams are issued simultaneously. The latter are called simultaneous multi- 1.2. Multithreading and Multiprocessors threaded (SMT) processors and combine the multithreading technique with a wide- The first multithreaded processor ap- issue superscalar processor such that the proaches in the 1970s and ap- full-issue bandwidth is more often utilized plied multithreading user-thread-level by potentially issuing instructions from to solve the memory access latency prob- different threads simultaneously. lem. This problem arises for each mem- A different approach increases the ory access after a cache miss—in particu- performance of sequential programs by lar, when a processor of a shared-memory

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 32 Ungerer et al. multiprocessor accesses a shared-memory than 1 (to distinguish multithreaded - variable located in a remote-memory mod- chitectures from fine-grained dataflow ule. The latency becomes a problem if the architectures). and nonblocking processor spends a large fraction of its threads are distinguished. A nonblocking time sitting idle and waiting for remote thread is formed such that its evalua- accesses to complete [Culler et al. 1998]. tion proceeds without blocking the pro- Latencies that arise in a pipeline are cessor pipeline (for instance, by remote defined with a wider —for exam- memory accesses, cache misses, or syn- ple, covering also long-latency - chronization ). Evaluation of a non- tions like div or latencies due to branch blocking thread starts as soon as all interlocking. input operands are available, which is Older multithreaded processor ap- usually detected by some kind of dataflow proaches from the 1980s usually extend principle. Thread switching is controlled scalar RISC processors by a multithread- by the compiler harnessing the idea of ing technique and focus on effectively rescheduling, rather than blocking, when very long remote memory access waiting for data. Access to remote data is latencies. Such processors will only be organized in a -phase manner by one useful as processor nodes in distributed thread sending the access request to mem- shared-memory multiprocessors. How- ory and another thread activating when ever, developing a processor that is specif- its data is available. Thus a program is ically designed for such multiprocessors compiled into many, very threads is commonly regarded as too expensive. activating each other when data become Multiprocessors today comprise available. The same hardware mecha- off-the-shelf microprocessors and almost nisms may also be used to synchronize never specifically designed processors interprocess to awaiting (with the exception of the Cray MTA). threads, thereby alleviating operating - Therefore, newer multithreaded processor tems overhead. In contrast, a blocking approaches also strive for tolerating all thread might be blocked during execution latencies (even single-cycle latencies) that by remote memory accesses, cache misses, arise from primary cache misses that or synchronization needs. The waiting hit in secondary cache, from long-latency time, during which the pipeline is blocked, operations, or even from unpredictable is lost when using a von Neumann pro- branches. cessor, but can be efficiently bridged by a fast to another thread in a multithreaded processor. Switching to 1.3. Multithreading and Dataflow another thread in a single-threaded pro- Another root of multithreading comes cessor usually exhibits too much context from dataflow architectures. Viewed from switching overhead to mask the latency a dataflow perspective a single-threaded efficiently. The original thread can be re- architecture is characterized by the com- sumed when the reason for blocking is putation that conceptually moves forward removed. one at a time through a sequence Use of nonblocking threads typically of states, each step corresponding to the leads to many small threads that are execution of one enabled instruction. Ac- appropriate for execution by a hybrid cording to Dennis and Gao [1994], a dataflow computer [Silcˇ et al. 1998] or multithreaded architecture differs from by a multithreaded architecture that is a single-threaded architecture in that closely related to hybrid dataflow. Block- there may be several enabled instruc- ing threads may just be the threads tions from different threads which all (e.g., P(OSIX)threads or Solaris threads) are candidates for execution. A thread is or whole UNIX processes of a multi- viewed as a sequentially ordered block threaded UNIX-based operating system, of instructions with a grain-size greater but may also be even smaller microthreads

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 33

Explicit multithreading

Issuing Issuing from a single thread from multiple threads in a cycle in a cycle

Interleaved Blocked Simultaneous multithreading multithreading multithreading (IMT) (BMT) (SMT)

Fig. 1. Explicit multithreading. generated by a compiler to utilize the po- —Blocked multithreading (BMT): The in- tentials of a multithreaded processor. structions of a thread are executed suc- Note that we exclude in this survey - cessively until an occurs that may brid dataflow architectures that are de- cause latency. This event induces a con- signed for the execution of nonblocking text switch (see Section 4). threads. Although these architectures are When instructions can be issued from often called multithreaded, we have - multiple threads in a given cycle, the fol- egorized them in a paper [Silcˇ lowing technique can be used: et al. 1998] as threaded dataflow or large- grain dataflow because a dataflow prin- —Simultaneous multithreading (SMT): ciple is applied to the execution of Instructions are simultaneously issued nonblocking threads. Thus, multithreaded from multiple threads to the execution architectures (in the more narrow sense units of a superscalar processor. Thus, applied here) stem from the modification the wide superscalar instruction issue is of scalar RISC, VLIW/, or - combined with the multiple-context ap- scalar processors. proach (see Section 6). Before we present these techniques 2. CLASSIFICATION OF EXPLICIT in detail, we briefly review the main MULTITHREADING TECHNIQUES architectural approaches that integrate Explicit multithreaded processors fall instruction-level parallelism and thread- into two categories depending on whether level parallelism. they issue instructions from only a single A way to look at latencies that arise thread or from multiple threads in a given in a pipelined execution is the opportu- cycle. nity cost in terms of the instructions that If instructions can be issued only from a might be processed while the pipeline single thread in a given cycle, the following is interlocked, for example, waiting for two principal techniques of explicit multi- a remote reference to return. The op- threading are used (see Figure 1): portunity cost of single-issue processors is the number of cycles lost by laten- —Interleaved multithreading (IMT): An cies. Multiple-issue processors (e.g., - instruction of another thread is fetched perscalar, VLIW, etc.) potentially execute and fed into the execution pipeline at more than 1 IPC, and thus the oppor- each processor cycle (see Section 3). tunity cost is the product of the latency

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 34 Ungerer et al.

event is more difficult to implement and a context-switching overhead of one to sev- eral cycles might arise. Let us now examine processors that can issue instructions from multiple threads in a given cycle. In addition to the SMT processors, one can include in this cat- egory (by the widest of definitions) also the chip multiprocessor () approach. While SMT uses a monolithic design with resources shared among threads, CMP proposes a distributed design that uses a collection of independent processors Fig. 2. Different approaches possible with single- with resource sharing. For example, issue (scalar) processors: (a) single-threaded scalar, Figure 4a shows a four-threaded eight- (b) interleaved multithreading scalar, (c) blocked issue SMT processor, and Figure 4b shows multithreading scalar. a CMP with four two-issue processors. The SMT processor exploits ILP by selecting cycles and the issue bandwidth plus the instructions from any thread (four in this number of issue slots not fully filled. We case) that can potentially issue. If one that future single-threaded proces- thread has high ILP, it may fill all hori- sors will continue to exploit further super- zontal slots depending on the issue strat- scalar or other multiple-issue techniques, egy of the processor. If multiple threads and thus further increase the opportunity each have low ILP, instructions of several cost of remote-memory accesses. threads can be issued and executed simul- Figure 2 demonstrates the opportunity taneously. In the CMP processor with sev- cost of different approaches possible with eral multiple-issue processors (four two- scalar processors, while the different ap- issue processors in this case) on a single proaches possible with multiple-issue pro- chip, each CPU is assigned a thread from cessors are given in Figure 3. In all which it can issue multiple instructions these cases instructions from only a sin- each cycle (up to two in this case). Thus, gle thread are issued in a given cycle. each CPU has the same opportunity cost The opportunity cost in single-threaded as in a two-issue superscalar model. The superscalar can be easily determined as CMP is not able to hide latencies by issu- the number of empty issue slots (see ing instructions of other threads. However, Figure 3(a)). It consists of horizontal because horizontal losses will be smaller losses (the number of empty places in not for two-issue than for high-bandwidth su- fully filled issue slot) and the even more perscalars, a CMP of four two-issue pro- harmful vertical losses (cycles where no cessors will reach a higher utilization than instructions can be issued). In VLIW pro- an eight-issue superscalar processor. cessors, horizontal losses appear as no-op There are a number of tradeoffs to con- operations (not shown in Figure 3). The op- sider when choosing between SMT and portunity cost of single-threaded VLIW is CMP. The present report does not survey about the same as single-threaded super- CMP design so these tradeoffs are out- scalar. An IMT superscalar (or IMT VLIW) side its scope (but see Section 6 and the processor is able to fill the vertical losses conclusions). of the single-threaded models by instruc- tions of other threads, but not the hori- 3. INTERLEAVED MULTITHREADING zontal losses. Further design possibilities, 3.1. Principles such as BMT superscalar or BMT VLIW processors, would fill several succeeding The interleaved multithreading (IMT) cycles with instructions of the same thread technique (also called fine-grain mul- before context switching. The switching tithreading) means that the processor

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 35

Fig. 3. Different approaches possible with multiple-issue processors: (a) single- threaded superscalar, (b) interleaved multithreading superscalar, (c) blocked multithreading superscalar.

Fig. 4. Issuing from multiple threads in a cycle: (a) simultaneous mul- tithreading, (b) chip multiprocessor. switches to a different thread after each ated by not a thread until the instruction fetch. In principle, an instruc- memory transaction has completed. This tion of a thread is fed into the pipeline af- model requires at least as many threads as ter the retirement of the previous instruc- pipeline stages in the processor. Interleav- tion of that thread. Since IMT eliminates ing the instructions from many threads control and data dependences between in- limits the processing power accessible to structions in the pipeline, pipeline haz- a single thread, thereby degrading the ards cannot arise and the processor single-thread performance. Two possibil- pipeline can be easily built without the - ities to overcome this deficiency are the cessity of complex forwarding paths. This following: leads to a very simple and therefore poten- tially very fast pipeline—no hardware in- 1. The dependence look-ahead technique terlocking or data forwarding is necessary. adds several to each instruction Moreover, the context-switching overhead in the ISA. The additional op- is zero cycles. Memory latency is toler- code bits allow the compiler to state the

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 36 Ungerer et al.

number of instructions directly follow- and developed by Denelcor Inc., in Denver, ing in program order that are not data- CO, between 1978 and 1985, was a pio- or control-dependent on the instruction neering example of a multithreaded ma- being executed. This allows the instruc- chine [Smith 1981]. The HEP system tion scheduler in the IMT processor to [Smith 1985] was designed to have up to feed non-data- or control-dependent in- 16 processors (Figure 5) with up to 128 structions of the same thread succes- threads per processor. The 128 threads sively into the pipeline. The dependence were supported by a large number of reg- look-ahead technique may be applied to isters dynamically allocated to threads. speed up single-thread performance or The processor pipeline had eight stages, to increase machine utilization in the matching the minimum number of proces- case where a workload does not com- sor cycles necessary to fetch a data item prise enough threads. from memory to a register. Consequently 2. The interleaving technique proposed by up to eight threads were in execution con- Laudon, Gupta, and Horowitz [1994], currently within a single HEP processor. adds caching and full pipeline inter- However, the pipeline did not allow more locks to the IMT technique. Contexts than one memory, branch, or divide in- may be interleaved on a cycle-by-cycle struction to be in the pipeline at the given basis, yet a single-thread context is also time. Allowing only a single memory in- efficiently supported. struction in the pipeline at a time resulted in poor single-thread performance and re- The most well-known examples of IMT quired a very large number of threads. If processors are used in the Heteroge- thread queues were all empty, the next in- neous Processor (HEP), the struction from the last thread dispatched Horizon, and the Cray Multi-Threaded was examined for independence from pre- Architecture (MTA) multiprocessors (see vious instructions and, if found to be so, Section 5.1). The HEP system has up to the instruction was also issued. 16 processors while the other two con- In contrast to all other IMT processors, sist of up to 256 processors. Each of these all threads within a HEP processor shared processors supports up to 128 threads. the same register set. Multiple processors While HEP uses instruction look-ahead and data memories were interconnected only if there is no other work, the Horizon via a pipelined switch and any register- and Cray MTA employ the explicit depen- memory or data-memory location could be dence look-ahead technique. Further IMT used to synchronize two processes on a processor proposals include the Multil- producer–consumer basis by a full/empty isp Architecture for Symbolic Applications synchronization on a data memory (MASA), the MIT M-Machine, the Me- word. Processor of MicroUnity, and the pro- cessor SB-PRAM/HPP (all covered in the MASA. The Multilisp Architecture next section). An example of an IMT - for Symbolic Applications [Halstead and work processor is the five-threaded Lextra Fujita 1988] was an IMT processor pro- LX4580 (see Section 5.2). posal for parallel symbolic computation In principle, the IMT technique can also with various features intended for effec- be combined with a superscalar instruc- tive Multilisp [Halstead 1985] program tion issue, but confirm the in- execution. MASA featured a tagged archi- tuition that SMT is the more efficient tech- tecture, multiple contexts, fast han- nique [Eggers et al. 1997]. dling, and a synchronization bit in every memory word. Its principal novelty was the use of multiple contexts both to sup- 3.2. Examples of Past Commercial port interleaved execution from separate and of Research Prototypes instruction streams and to speed up proce- HEP. The Heterogeneous Element Pro- dure calls and trap handling (in the same cessor system, designed by Burton Smith manner as register windows).

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 37

Fig. 5. The HEP processor.

M-Machine. The MIT M-Machine [Fillo for video streaming applications. A su- et al. 1995] supports both public and perscalar processor kernel is enhanced private registers for each thread, and by group instructions dedicated to video uses the IMT technique. Each processor streaming applications and by five thread supports four hardware resident user - contexts. Instructions of the different threads, and each V-thread supports four threads are executed in the IMT fash- resident H-threads. All the H-threads in issuing instructions with a proposed a given V-thread the same address 1-GHz clock frequency providing five space, and each H-thread instruction is independent 200-MHz threads. A very a three-wide VLIW. Event and exception light-weight context switch upon syn- handling are each performed by a separate chronous exceptions and asynchronous V-thread. Swapping a processor-resident events permits rapid handling of vir- V-thread with one stored in memory re- tual memory system exceptions and I/O quires about 150 cycles (1.5 s). The events. M-Machine (like HEP, Horizon, and Cray MTA) employs full-empty bits for effi- cient, low-level synchronization. Moreover SB-PRAM and HPP. The SB-PRAM (SB it supports and guarded stands for Saarbrucken ¨ and PRAM for par- pointers with for access allel random access machine) [Bach et al. control and . 1997] or high-performance PRAM (HPP) [Formella et al. 1996] is a multiple in- struction multiple data (MIMD) parallel MicroUnity MediaProcessor. MicroUnity computer with shared and [Hansen 1996] proposed its MediaProces- time due to its mo- sor in 1996 as a processor specialized tivation: building a multiprocessor that is

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 38 Ungerer et al.

Fig. 6. Blocked multithreading. as close as possible to the theoretical ma- thread runs on a BMT processor, there are chine model CRCW-PRAM (CRCW stands no context switches and the processor per- for concurrent concurrent ). Pro- forms just like a standard processor with- cessor and memory modules are con- out multithreading. nected by a pipelined combining butter- In the following we classify BMT pro- fly network. Network latency is hidden cessors by the event that triggers a by pipelining several so-called virtual pro- context switch (Figure 6) [Kreuzinger and cessors on one physical processor Ungerer 1999]: in IMT mode. Instructions of 32 virtual processors are interleaved round-robin in 1. Static models: A context switch occurs a single SB-PRAM processor, which is each time the same instruction is ex- therefore classified as 32-threaded IMT ecuted in the instruction stream. The processor. The project is in progress at context switch is encoded by the com- the University of Saarland, Saarbrucken,¨ piler. The main advantage of this tech- Germany. A first prototype was running nique is that context switching can be with four processors, and recently a 64- triggered already in the fetch stage processor model which in total supports of the pipeline. The context switching up to 2048 threads has been completed. overhead is one cycle (if the fetched Hardware details and performance mea- instruction triggers the context switch surements are reported by Paul, Bach, and is discarded), zero cycles (if the Bosch, Fischer, Lichtenau, and Rohrig¨ fetched instruction triggers a context [2002]. switch but is still executed in the pipeline), and almost zero cycles (if a context switch buffer is applied; see the 4. BLOCKED MULTITHREADING Rhamma processor below in this sec- tion). There are two main variations of 4.1. Principles the static BMT technique: The blocked multithreading (BMT) tech- —The explicit-switch static model, nique (sometimes also called coarse-grain where an explicit context switch in- multithreading) means that a single struction exists. This model is simple thread is executed until it reaches a situa- to implement. However, the addi- tion that triggers a context switch. Usually tional instruction causes one-cycle such a situation arises when the instruc- context switching overhead. tion execution reaches a long-latency oper- —The implicit-switch model, where ation or a situation where a latency may each instruction belongs to a specific arise. Compared to the IMT technique, a instruction class and a context switch smaller number of threads is needed and a decision depends on the instruction single thread can execute at full speed un- class of the fetched instruction. In- til the next context switch. If only a single struction classes that cause context

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 39

switch include , store, and branch of a specific , for example, sig- instructions. naling an , trap, or message The switch-on-load static model arrival. switches after each load instruction to —The switch-on-use dynamic model bridge memory access latency. How- switches when an instruction tries to ever, assuming an on-chip D-cache, the use the still missing value from a load thread switch occurs more often than (which, for example, missed in the necessary, which makes an extremely cache). For example, when a compiler fast context switch necessary, prefer- schedules instructions so that a load ably with zero-cycle context switch from shared memory is issued several overhead. cycles before the value is used, the The switch-on-store static model context switch should not occur until switches after store instructions. The the actual use of the value and will model may be used to support the im- not occur at all if the load completes plementation of sequential consistency before the register is referenced. so that the next memory access instruc- To implement the switch-on-use tion can only be performed after the model, a valid bit is added to each store has completed in memory. register (by a simple form of - The switch-on-branch static model board). The bit is cleared when a load switches after branch instructions. The to the corresponding register is is- model can be applied to simplify pro- sued and set when the result returns cessor design by renouncing branch from the network. A thread switches prediction and speculative execution. context if it needs a value from a reg- The branch misspeculation penalty is ister whose valid bit is still cleared. avoided, but single-thread performance This model can also be seen as a is decreased. However, it may be effec- lazy model that extends either the tive for programs with a high percent- switch-on-load static model (called age of branches that are hard to predict lazy-switch-on-load) or the switch- or even are unpredictable. on-cache-miss dynamic model (called lazy-switch-on-cache-miss). 2. Dynamic models: The context switch is —The conditional-switch dynamic triggered by a dynamic event. In gen- model couples an explicit switch eral, all the instructions between the instruction with a condition. The con- fetch stage and the stage that triggers text is switched only when the condi- the context switch are discarded, lead- tion is fulfilled; otherwise the context ing to a higher context switch overhead switch is ignored. A conditional- than static context switch models. Sev- switch instruction may be used, for eral dynamic models can be defined: example, after a group of load/store —The switch-on-cache-miss dynamic instructions. The context switch is model switches the context if a load or ignored if all load instructions (in store misses in the cache. The idea is the preceding group) hit the cache; that only those loads that miss in the otherwise, the context switch is cache and those stores that cannot performed. Moreover, a conditional- be buffered have long latencies and switch instruction could also be cause context switches. Such a con- added between a group of loads and text switch is detected in a late stage their subsequent use to realize a of the pipeline. A large number of sub- lazy context switch (instead of imple- sequent instructions have already en- menting the switch-on-use model). tered the pipeline and must be dis- carded. Thus context switch overhead The explicit-switch, conditional-switch, is considerably increased. and switch-on-signal techniques enhance —The switch-on-signal dynamic model the ISA by additional instructions. The switches context on the occurrence implicit switch technique may favor a

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 40 Ungerer et al. specific ISA encoding to simplify instruc- on each clock cycle, zero-delay branch tion class detection. All other techniques instructions, and fast execution of indi- are microarchitectural techniques without vidual instructions. Each processor can the necessity of ISA changes. support up to 64 threads and uses the A previous classification [Boothe and switch-on-cache-miss dynamic interleav- Ranade 1992; Boothe 1993] concerns the ing model. BMT technique only in a shared-memory multiprocessor environment and is re- MDP in J-Machine. The MIT Jellybean stricted to only a few of the models of the Machine (J-Machine) [Dally et al. 1992] is BMT technique described above. In par- so-called because it is to be built entirely of ticular, the switch-on-load in [Boothe and a large number of ”jellybean” components. Ranade 1992] switches only on instruc- The initial version uses an 8 8 16 cube tions that load data from remote memory, network, with possibilities of expanding to while storing data in remote memory does 64K nodes. The ”jellybeans” are message- not cause context switching. Likewise, the driven processor (MDP) chips, each of switch-on-miss model is defined so that the which has a 36-bit processor, a 4K word context is only switched if a load from re- memory, and a router with communica- mote memory misses in the cache. tion ports for bidirectional transmissions Several well-known processors use in three dimensions. External memory of the BMT technique. The MIT Sparcle up to 1M words can be added per processor. [Agarwal et al. 1993] and the MSparc The MDP creates a task for each arriving processors [Mikschl and Damm 1996] message. In the prototype, each MDP chip use both switch-on-cache-miss and has four external memory chips that pro- switch-on-signal dynamic models. The vide 256K memory words. However, access CHoPP 1 [Mankovic et al. 1987] uses is through a 12-bit data , and with an the switch-on-cache-miss and switch- error correcting cycle, the access time is on-use dynamic models, while Rhamma four memory cycles per word. Each com- [Gruenewald and Ungerer 1996, 1997] munication port has a 9-bit data channel. applies several static and dynamic models The routers provide support for automatic of BMT. The EVENTS mechanism and from source to destination. The the proposal latency of a transfer is 2 s across the apply multithreading for real-time event prototype machine, assuming no blocking. handling. These, as well as some other When a message arrives, a task is created processors using the BMT technique, are automatically to it in 1 s. Thus, it described in the next section, while the is possible to implement a shared-memory current commercial high-performance model using message passing, in which a processors—that is, the two-threaded message provides a fetch address and an IBM RS64 IV, the Sun MAJC—and the automatic task sends a reply with the de- network processors of the IXP, IBM sired data. Power NP, Vitesse IQ2x00, and AMCC nP families are covered in Section 5. MIT Sparcle. The MIT Sparcle proces- sor [Agarwal et al. 1993] is derived from a SPARC RISC processor. The eight over- 4.2. Examples of Past Commercial Machines lapping register windows of a SPARC pro- and of Research Prototypes cessor are organized as four independent CHoPP 1. The Columbia Homogeneous nonoverlapping thread contexts, each us- Parallel Processor 1 (CHoPP 1) [Mankovic ing two windows, one as a register set, the et al. 1987] was designed by CHoPP Sul- other as a context for trap and message livan Computer Corporation (ANTs since handlers (Figure 7). 1999) in 1987. The system was a shared- Context switches are used only to memory MIMD with up to 16 powerful hide long memory latencies since small processors. High sequential performance pipeline delays are assumed to be hidden is due to issuing multiple instructions by the proper ordering of instructions by

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 41

Fig. 7. Sparcle register usage. an optimizing compiler. The MIT Spar- cle processor switches to another context in the case of a remote cache miss or a failed synchronization (switch-on-cache- miss and switch-on-signal strategies). Thread switching is triggered by external hardware, that is, by the cache/ controller. Reloading of the pipeline and the software implementation of the con- text switch requires 14 processor cycles. The MIT Alewife distributed shared- memory multiprocessor [Agarwal et al. 1995] is based on the multithreaded MIT Sparcle processor. The multiprocessor has Fig. 8. An Alewife node. been operational since May 1994. A node in the Alewife multiprocessor comprises hanced by run-time and compile-time a Sparcle processor, an external floating- partitioning and placement of data and point unit, cache, and a directory-based processes. cache controller that manages cache- coherence, a network router chip, and a Rhamma. The Rhamma processor memory module (Figure 8). [Gruenewald and Ungerer 1996, 1997] The Alewife multiprocessor uses a was developed at the University of Karl- low-dimensional direct interconnection sruhe, Germany, as an experimental network. Despite its distributed-memory microprocessor that bridges all kinds of architecture, Alewife allows efficient latencies by a very fast context switch. shared-memory programming through The (EU) and load/store a multilayered approach to locality unit (LSU) are decoupled and work con- management. latency currently on different threads. A number and network bandwidth requirements of register sets used by different threads are reduced by a directory-based cache- are accessed by the LSU as well as the coherence scheme referred to as Limit- EU. Both units are connected by FIFO LESS directories. Latencies still occur buffers for continuations, each denoting although communication locality is en- the thread tag and the instruction pointer

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 42 Ungerer et al. of a thread. The Rhamma processor com- text, and the results from an execution bines several static and dynamic models thread are poststored. Threads are non- of the BMT technique. blocking and each thread is enabled when The EU of the Rhamma processor the required inputs are available (i.e., switches the thread whenever a load/store data-driven at a coarse grain). The sep- or control-flow instruction is fetched. arate load/store/sync processor performs Therefore, if the instruction format is cho- preloads and schedules ready threads on sen appropriately, a context switch can the pipeline. The pipeline processor exe- be recognized by predecoding the instruc- cutes the threads which will require no tions already during the instruction fetch memory accesses. On completion the re- stage. sults from the thread are poststored by the In the case of a load/store, the continu- load/store/sync processor. ation is stored in the FIFO buffer of the The DanSoft nanothreading approach. LSU, a new continuation is fetched from The nanothreading and the microthread- the EU’s FIFO buffer, and the context is ing approaches use multithreading switched to the register set given by the but spare the hardware complexity of new thread tag and the new instruction providing multiple register sets. pointer. The LSU loads and stores data of Nanothreading [Gwennap 1997], pro- a different thread in a register set concur- posed for the DanSoft processor, dismisses rently to the work of the EU. There is a loss full multithreading for a nanothread that of one processor cycle in the EU for each - executes in the same register set as the quence of load/store instructions. Only the main thread. The DanSoft nanothread re- first load/store instruction forces a context quires only a 9-bit and switch in the EU; succeeding load/store in- some simple control logic, and it resides structions are executed in the LSU. in the same as the main thread. The loss of one processor cycle in the Whenever the processor stalls on the main EU when fetching the first of a se- thread, it automatically begins fetching quence of load/store instructions can be instructions from the nanothread. Only avoided by a so-called context switch buffer one register set is available, so the two (CSB) whenever the sequence is executed threads must share the register set. Typ- the second time. The CSB is a hard- ically the nanothread will focus on a sim- ware buffer, collecting the addresses of ple task, such as prefetching data into a load/store instructions that have caused buffer, that can be done asynchronously to context switches. Before fetching an in- the main thread. struction from the I-cache, the instruction In the DanSoft processor nanothreading fetch stage checks whether the address is used to implement a new branch strat- can be found in the context switch buffer. egy that fetches both sides of a branch. A In that case the thread is switched im- static branch prediction scheme is used, mediately and an instruction of another where branch instructions include 3 bits thread is fetched instead of the load/store to direct the instruction stream. The bits instruction. No bubble occurs. specify eight levels of branch direction. For the middle four cases, denoting low confi- PL/-Machine. The Preload and Post- dence on the branch prediction, the pro- store Machine (or PL/PS-Machine) [Kavi cessor fetches from both the branch et al. 1997] is most similar to the Rhamma and the fall-through path. If the branch processor. It also decouples memory ac- is mispredicted in the main thread, the cesses from thread execution by pro- backup path executed in the nanothread viding separate units. This decoupling generates a misprediction penalty of only eliminates thread stalls due to memory one to two cycles. accesses and makes thread switches due The DanSoft processor proposal is a to cache misses unnecessary. Threads are dual-processor CMP, each processor fea- created when all data is preloaded into turing a VLIW instruction set and the the register set holding the thread’s con- nanothreading technique. Each processor

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 43 contains an integer processor, but the two proaches that apply multithreading in processor cores share a floating-point unit new areas, in particular embedded real- as well as the system interface. time systems, which are proposed by the However, the nanothread technique EVENTS approach and by the Komodo might also be used to fill the instruction project. issue slots of a wide superscalar approach The EVENTS scheduler [Luth¨ et al. as in SMT. 1997; Metzner and Niehaus 2000] acts Microthreading. The microthreading as processor-external thread scheduler by technique [Bolychevsky et al. 1996] is supervising several MSparc processors. similar to nanothreading. All threads Contest switches are triggered due to share the same register set and the same real-time scheduling techniques and the run-time stack. However, the number of computation-intensive real-time threads threads is not restricted to two. When are assigned to the different MSparc pro- a context switch arises, the program cessors. The EVENTS mechanism is im- counter is stored in a continuation queue. plemented as a field of field programmable The PC represents the minimum possible gate arrays (FPGAs). context information for a given thread. Microthreading is proposed for a modified Komodo microcontroller. The Komodo RISC processor. microcontroller [Brinkschulte et al. 1999a; Both techniques—nanothreading as 1999b] implements real-time schedul- well as microthreading—are proposed ing on an instruction-by- in the context of a BMT technique, but instruction basis deeply embedded in might also be used to fill the instruction the multithreaded processor core, in con- issue slots of a wide superscalar approach trast to the processor-external EVENTS as in SMT. approach. The Komodo microcontroller The drawback to nanothreading and is a multithreaded microcontroller microthreading is that the compiler has aimed at embedded real-time systems to schedule registers for all threads that with a hardware event handling mecha- may be active simultaneously, because all nism that allows handling of simultaneous threads execute in the same register set. overlapping events with hard real-time The solution to this problem has been requirements. The main purpose for the recently described in Jesshope and Luo use of multithreading within the Komodo [2000] and Jesshope [2001]. These papers microcontroller is not latency utilization, describe the dynamic allocation of regis- but extremely fast reactions on real-time ters using vector instruction sets and also events. the ease with which the architecture can Real-time Java threads are used as be developed as a CMP. interrupt service threads (ISTs)—a new MSparc and the EVENTS mechanism. An hardware event handling mechanism that approach similar to the MIT Sparcle pro- replaces the common interrupt service rou- cessor was taken at the University of tines (ISRs). The occurrence of an event Oldenburg, Germany, with the MSparc activates an assigned IST instead of an processor [Mikschl and Damm 1996]. ISR. The Komodo microcontroller sup- MSparc supports up to four contexts on ports the concurrent execution of multi- a chip and is compatible with standard ple ISTs and zero-cycle overhead context SPARC processors. Switching is supported switching, and triggers ISTs by a switch- by hardware and can be achieved within on-signal context switching strategy. one processor cycle. However, a four-cycle As shown in Figure 9, the four-stage overhead is introduced due to pipeline pipelined processor core consists of an refill. The multithreading policy is BMT instruction-fetch unit, a decode unit, an with the switch-on-cache-miss policy as in operand fetch unit, and an execution unit. the MIT Sparcle processor. Four sets are provided on The rapid context-switching ability of the processor chip. A Java in- the multithreading technique leads to ap- struction is decoded either to a single

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 44 Ungerer et al.

Fig. 9. of the Komodo microcontroller. micro-op or a sequence of micro-ops, or instruction selection [Kreuzinger et al. a trap routine is called. Each micro-op 2000]. The GP scheduling scheme assigns is propagated through the pipeline with portions of the processor execution power its thread id. Micro-ops from multiple to each hardware thread and allows a threads can be simultaneously present in strict isolation of the different real-time the different pipeline stages. threads—an urgently demanded feature The instruction fetch unit holds four of real-time systems. GP scheduling is only program counters (PCs) with dedicated efficient if implemented within a multi- status bits (e.g., thread active/suspended); threaded processor [Brinkschulte et al. each PC is assigned to a separate thread. 2002]. Four portions are fetched over the The priority manager applies one of memory interface and put in the appropri- the implemented scheduling schemes for ate instruction (IW). Several in- IW selection. However, latencies may re- structions may be contained in the fetch sult from branches or memory accesses. portion because of the average Java byte- To avoid pipeline stalls, instructions from code length of 1.8 . Instructions are threads of less priority can be fed into the fetched depending on the status bits and pipeline. The decode unit predicts the la- fill levels of the IWs. tency after such an instruction, and pro- The instruction decode unit contains the ceeds with instructions from other IWs. IWs, dedicated status bits (e.g., priority), External signals are delivered to the sig- and counters for the implementation of nal unit from the components the proportional share scheme. A priority of the microcontroller core as, for example, manager decides from which IW the next timer, counter, or serial interface. By the instruction will be decoded. occurrence of such a signal, the corre- The real-time scheduling algorithms sponding IST will be activated. As soon as fixed priority preemptive, earliest dead- an IST activation ends, its assigned real- first, least laxity first, and guaranteed time thread is suspended and its status percentage (GP) scheduling are imple- is stored. An external signal may activate mented in the priority manager for next the same thread again.

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 45

Further investigations within the Komodo project [Brinkschulte et al. 2000] have covered real-time garbage collection on a multithreaded microcon- troller [Fuhrmann et al. 2001], and a distributed real-time called OSA+ [Brinkschulte et al. 2000].

5. INTERLEAVED AND BLOCKED MULTITHREADING IN CURRENT AND PROPOSED MICROPROCESSORS Fig. 10. The Cray MTA computer system. 5.1. High-Performance Processors resource port. There are several routing Cray MTA. The Cray Multi-Threaded nodes per , rather than the several CPs Architecture (MTA) computer [Alverson per routing node. In particular, the num- et al. 1990] features a powerful VLIW in- / ber of routing nodes is at least p3 2, where struction set and interleaved multithread- p is the number of CPs. This allows the bi- ing technique. It was designed by the section bandwidth to scale linearly with p, Tera Computer Company, Seattle, WA, / while the network latency scales as p1 2. now Cray Corporation.1 The communication is capable of sup- The MTA inherits much of the de- data transfers to and from mem- sign philosophy of the HEP as well as ory on each clock tick in both directions, a (paper) architecture called Horizon as are all of the between the routing [Thistle and Smith 1988]. The latter was nodes themselves. designed for up to 256 processors and The Cray MTA custom chip CP up to 512 memory modules in a 16 (Figure 11) is a VLIW pipelined processor 16 6-node internal network. Horizon using the IMT technique. Each thread is (like HEP) employed a global address associated with one 64-bit stream status space and memory-based synchronization word, thirty-two 64-bit general registers, through the use of full/empty bits at each and eight 64-bit target registers. The location. Each processor supported 128 processor may switch context every cycle independent instruction streams by 128 (3- cycle period) between as many as register sets with context switches occur- 128 distinct threads (called streams by ring at every clock cycle. Unlike HEP, the designers of the MTA), thereby hiding Horizon allowed multiple memory opera- up to 128 cycles (384 ns) of memory tions from a thread to be in the pipeline latency. Since the context switching is so simultaneously. fast, the processor has no time to swap MTA systems are constructed from re- the processor state. Instead, it has 128 source modules, each of which contains up of everything, that is, 128 stream status to six resources. A resource can be a com- words, 4096 general registers, and 1024 putational processor (CP), an I/O proces- target registers. Dependences between sor (IOP), an I/O cache (IOC) unit, and instructions are explicitly encoded by either two or four memory units (MUs) the compiler using explicit dependence (Figure 10). Each resource is individually look-ahead. Each instruction contains connected to a separate routing node in a 3-bit look-ahead field that explicitly the depopulated three-dimensional (3-D) specifies how many instructions from this torus interconnection network [Almasi thread will be issued before encounter- and Gottlieb 1994]. Each routing node has ing an instruction that depends on the three or four communication ports and a current one. Since seven is the maximum possible look-ahead value, at most eight 1 Tera purchased Cray and adopted the name. instructions (i.e., 24 operations) from each

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 46 Ungerer et al.

Fig. 11. The MTA computational processor. thread can be concurrently executing in time-integral of the number of instruction different stages of a processor’s pipeline. streams ready to issue (measuring the de- In addition, each thread can issue as gree of overutilization of the processor). many as eight memory references without All three counters are user-readable in a waiting for earlier ones to finish, further single unprivileged operation. Eight coun- augmenting the memory latency tolerance ters are implemented in each of the pro- of the processor. The CP has a load/store tection domains of the processor. All are architecture with three addressing modes user-readable in a single unprivileged op- and 32 general-purpose 64-bit registers. eration. Four of these counters accumu- The three-wide VLIW instructions are late the number of instructions issued, 64 bits. Three operations can be exe- the number of memory references, the cuted simultaneously per instruction: a time-integral of the number of instruc- memory reference operation (M-op), an tion streams, and the time-integral of the arithmetic/logical operation (A-op), and a number of in the network. These branch or simple arithmetic/logical opera- counters are also used for accounting. tion (C-op). If more than one operation in The other four counters are configurable to an instruction specifies the same register accumulate events from any four of a large or setting of condition codes, the priority set of additional sources within the proces- is M-op A-op C-op. sor, including memory operations, jumps, Every processor has a clock register that traps, and so on. is synchronized with its counterparts in Thus, the Cray MTA exploits paral- the other processors and counts up once lelism at all levels, from fine-grained per cycle. In addition, the processor counts ILP within a single processor to parallel the total number of unused instruction programming across processors, to multi- issue slots (measuring the degree of un- programming among several applications derutilization of the processor) and the simultaneously. Consequently, processor

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 47 scheduling occurs at many levels, and Sun MAJC. Sun proposed in 1999 its managing these levels poses unique MAJC-5200 [Tremblay 1999] that can be and challenging scheduling concerns classified as a dual-processor chip with [Alverson et al. 1995]. BMT processors. Special Java-directed in- After many delays the MTA reached the structions are provided, motivating the market in December 1997 when a single- acronym MAJC for Micro Architecture processor system (clock speed 145 MHz) for Java Computing. Instruction, data, was delivered to the San Diego Supercom- thread, and process-level parallelism is puter Center. In May 1999, the system exploited in the MAJC architecture was upgraded to eight processors and a by supporting explicit multithreading (so- network with 64 routing nodes. Each pro- called vertical multithreading), implicit cessor runs at 260 MHz and is theoreti- multithreading (called speculative mul- cally capable of 780 MFLOPS. This ma- tithreading), and chip multiprocessors. chine was effectively retired in September Instruction-level parallelism is utilized by 2001. VLIW packets containing from one to four Early MTA systems had been built us- instructions and data-level parallelism ing GaAs for all logic design. through single-instruction, multiple-data As a result of the semiconductor market’s (SIMD) instructions in particular for mul- focus on CMOS technology for computer timedia applications. Thread-level par- systems, Cray started a transition to us- allelism is utilized through compiler- ing CMOS technology in the MTA. The supported explicit multithreading. The latest in this development is the MTA-2, architecture supports combining several which is configured with 28 processors and multithreaded processors on a chip to har- 112 (GB) of memory. The first ness process-level parallelism. Cray MTA-2 system was Single-threaded-program execution shipped in late December 2001. may be accelerated by speculative mul- tithreading with microthreads that are HTMT Architecture Project. In 1997, the dependent on nonspeculative threads Jet Propulsion Laboratory in collabo- [Tremblay et al. 2000]. Virtual channels ration with several other institutions communicate shared register values (The California Institute of Technology, between different microthreads with Princeton University, University of Notre produced-consumer synchronization. In Dame, University of Delaware, The State case of misspeculation, the speculative University of New York at Stonybrook, microthread is discarded. Tera Computer Company (now Cray), Argonne National Laboratories) initiated IBM RS64 IV. IBM developed a mul- the Hybrid Technology MultiThreaded tithreaded PowerPC processor, which is (HTMT) Architecture Project whose used in the IBM iSeries and pSeries com- is to design the first petaflops computer mercial servers. The processor—originally by 2005–2007. The computer will be based code-named SStar—is called RS64 IV and on radically new hardware became available for purchase in the such as superconductive processors, op- fourth quarter of 2000. Because of its tical holographic storages, and optical optimization for commercial work- packet switched networks [ 1997]. load (i.e., on-line transaction processing, There will be 4096 superconductive pro- resource planning, Web serv- cessors, called . A SPELL processor ing, and collaborative groupware) with consists of 16 multistream units, where typically large and function-rich applica- each multistream unit is a 64-bit, deeply tions, a two-threaded block-interleaving pipelined integer processor capable of ex- approach with a switch-on-cache-miss ecuting up to eight parallel threads. As a model was chosen. A so-called thread- result, each SPELL is capable of running switch buffer, which holds up to eight in- in parallel up to 128 threads, arranged in structions from the background thread, 16 groups of 8 threads [Dorojevets 2000]. reduces the cost of pipeline reload after

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 48 Ungerer et al. a thread switch. These instructions may enabling 10-Gb/s -speed processing be introduced into the pipeline immedi- (Gb = gigabits). ately after a thread switch. A significant throughput increase by multithreading IBM PowerNP . The while adding less than 5% to the chip area IBM PowerNP NP46GS3 network proces- was reported in Borkenhagen et al. [2000]. sor [IBM Corporation 1999] integrates a switching engine, search engine, frame processors, and multiplexed medium ac- 5.2. Network Processors cess controllers (MACs). The NP4GS3 Currently multithreading proves success- includes an embedded IBM PowerPC ful in network processors. Network pro- 405 processor core, 16 programmable cessors are expected to become the funda- multithreaded picoprocessors, 40 Fast mental building blocks for networks in the /4-Gb MACs, and hardware ac- same fashion that microprocessors are for celerators performing functions such as today’s personal [Glaskowsky searches, frame forwarding, frame fil- 2002]. Network processors offer real- tering, and frame alteration. time processing of multiple data streams, The multithreaded picoprocessors are enhanced security, and proto- used for packet manipulation while the col (IP) packet handling and forwarding processor core handles control functions. capabilities. Each picoprocessor is a 32-bit, 133-MHz Network processors apply multithread- RISC core that supports two threads in ing to bridge latencies during memory the hardware and includes nine hardwired accesses. Hard real-time events, that is, function units for tasks such as string the deadline should never be missed, copying, bandwidth-policy checking, and are requirements for network processors check summing. that need specific instruction scheduling The aggregated bandwidth of the during multithreaded execution. Multi- NP4GS3 is some 4 Gb/s, allowing it to threading is typically restricted to the in- manage a single OC-48 optical channel or ternal architecture of the parallel working up to 40 Fast Ethernet ports [Glaskowsky picoprocessors or microengines that per- 2002]. form the data traffic handling. Vitesse IQ2x00 network processors. The Vitesse IQ2x00 family consists of IQ2000 Intel IXP network processors. The In- and IQ2200 network processors, whose ar- tel IXP1200 network processor [Intel chitecture is based on multiple packed pro- Corporation 2002]—the first in the cessors. Each packed processor is a five- Intel Internet Exchange Architecture threaded BMT that switches on cache miss (Intel IXA) network processor family [Glaskowsky 2002]. that uses multithreading—combines an Intel StrongARM processor core with Lexra network processors. The Lexra net- six programmable multithreaded micro- work processor employs packet processors, engines. The IXP2400 network processor which are used in parallel to implement IP is based on an Intel XScale core with layer processing in software. The current eight multithreaded microengines, and packed processor is the LX4580 [Gelinas the IXP2800 contains even 16 multi- et al. 2002]. It implements the MIPS32 ar- threaded microengines [Intel Corporation chitecture, including recent architectural 2002]. The microengines (four-threaded enhancements, along with instructions for BMT) are used for packet forwarding and optimized . By contrast traffic management, whereas the XScale with the Intel IXP and Vitesse IQ2x00, the processor core is applied for control tasks. LX4580 uses the IMT approach (with five Data and event signals are shared among threads). threads and microengines at virtual zero latency while maintaining coherence. AMCC nP network processors. AMCC’s nP The IXP2800 is claimed to be capable of network processor family is based on the

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 49

MMC Networks’ nPcore multithreaded ILP is utilized from the individual packet processor. Each processor in this threads. Because an SMT processor si- family consists of one or several nPcore multaneously exploits coarse- and fine- processing units that are connected to an grain parallelism, it uses its resources external search and a more efficiently and thus achieves better CPU [Glaskowsky 2002]. In particular, throughput and than single- the nP7250 network processor has two threaded superscalar processors for mul- nPcores with 2.4 Gb/s aggregated band- tithreaded (or multiprogramming) work- width. The current of AMCC’s line is loads. The tradeoff is a slightly more the nP7510, whose six nPcore processing complex hardware organization. units offer 10 Gb/s. If all (hardware-supported) threads of an SMT processor always execute threads Other network processors. A number of of the same process, preferably in single- other network processors, such as the program multiple-data fashion, a uni- C-5, and the Fast Pattern Pro- fied (primary) I-cache may prove useful, cessor (as part of Agere’s PayloadPlus chip since the code can be shared between the set) seem to include multithreading fea- threads. Primary D-cache may be unified tures. Unfortunately, it is often hard to or separated between the threads depend- find out, because multithreading is not ing on the access mechanism used. the main feature of network processors The SMT technique combines a wide and often hardly mentioned. The mar- superscalar instruction issue with mul- ket for network processors is extremely tithreading by providing several register competitive, so the above-mentioned pro- sets on the processor and issuing instruc- cessors will soon be replaced by their tions from several instruction queues si- successors. multaneously. Therefore, the issue slots of a wide-issue processor can be filled by op- 6. SIMULTANEOUS MULTITHREADING erations of several threads. Latencies oc- curring in the execution of single threads 6.1. Principles are bridged by issuing operations of the re- IMT and BMT are techniques which are maining threads loaded on the processor. most efficient when applied to scalar RISC In principle, the full issue bandwidth can or VLIW processors. Combining multi- be utilized. threading with the superscalar technique The resources of SMT processors can be naturally leads to a technique where organized in two ways: all hardware contexts are active simul- taneously, competing each cycle for all 1. Resource sharing: Instructions of dif- available resources. This technique, called ferent threads share all resources like simultaneous multithreading (SMT), in- the fetch buffer, the physical registers herits from superscalars the ability to is- for renaming registers of different reg- sue multiple instructions each cycle; and ister sets, the instruction window, and like multithreaded processors it contains the reorder buffer. Thus SMT adds hardware resources for multiple contexts. minimal hardware complexity to con- The result is a processor that can is- ventional superscalars; hardware de- sue multiple instructions from multiple signers can focus on building a fast threads each cycle. Therefore, not only single-threaded superscalar and add can unused cycles in the case of laten- multithread capability on top. The cies be filled by instructions of alterna- complexity added to superscalars by tive threads, but so can unused issue slots multithreading includes the thread tag within one cycle. for each internal instruction represen- Thread-level parallelism can come from tation, multiple register sets, and the either multithreaded, parallel programs abilities of the fetch and the retire or from multiple, independent programs units to fetch/retire instructions of dif- in a multiprogramming workload, while ferent threads.

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 50 Ungerer et al.

2. Resource : The second orga- enhanced to handle multiple threads in- nizational form replicates all internal ternally. Another step to multithreading buffers of a superscalar such that each is simultaneous subordinate microthread- buffer is bound to a specific thread. In- ing [Chappell et al. 1999], which is a mod- struction fetch, decode, rename, and ification of superscalars to run threads at retire units may be multiplexed be- microprogram level concurrently. A new tween the threads or be duplicated microthread is spawned either by event- themselves. The issue unit is able to driven by hardware or by an explicit issue instructions of different instruc- spawn instruction. The subordinate mi- tion windows simultaneously to the crothread could be used, for example, to execution units. This form of organi- improve branch prediction of the primary zation adds more changes to the orga- thread or to preload data. nization of superscalar processors but A competing approach to SMT is repre- leads to a natural partitioning of the sented by the chip multiprocessors (CMPs) instruction window and simplifies the [Ungerer et al. 2002]. A CMP integrates issue and retire stages. two or more complete processors on a sin- gle chip. Every unit of a processer is - The fetch unit of an SMT processor can plicated and used independently of its take advantage of the interthread com- copies on the chip. CMP is easier to petition for instruction bandwidth in two implement, but only SMT has the abil- ways. First, it can partition this band- ity to hide latencies. Examples of CMP width among the threads and fetch from are the Texas TMS320C8x several threads each cycle. In this way, it [Texas Instruments 1994], the increases the of fetching only [Hammond and Olukotun 1998], the Com- nonspeculative instructions. Second, the paq Piranha [Barroso et al. 2000], and the fetch unit can be selective about which IBM POWER4 chip [Tendler et al. 2002]. threads it fetches. For example, it may Projects simulating different configura- fetch those that will provide the most im- tions of SMT are discussed next, while mediate performance benefit. SMT approaches that can be found in The main drawback to SMT may be that current microprocessors are given in it complicates the issue stage, which is Section 7. always central to the multiple threads. A functional partitioning as required by 6.2. Examples of Prototypes and Simulated the on-chip wire-delay of future micro- SMT Processors processors is not easily achieved with an SMT processor due to the centralized in- MARS-M. The MARS-M multi- struction issue. A separation of the thread threaded computer system [Dorozhevets queues as in the SMT approaches with and Wolcott 1992] was developed and resource replication is a possible solution, manufactured within the Russian Next- although it does not remove the central in- Generation Computer Systems program struction issue. during 1982–1988. The MARS-M was the Closely related to SMT are architectural first system where the SMT technique proposals that only slightly enhance a su- was implemented in a real design. The perscalar processor by the ability to pur- system has a decoupled multiprocessor sue two or more threads only for a short architecture with execution, address, time. In principle, predication is the first control, memory, and peripheral multi- step in this direction. An enhanced form threaded subsystems working in parallel of predication is able to issue and execute and communicating via multiple register predicated instruction even if the predi- FIFO-queues. The execution and address cate is not yet solved. A further step is dy- subsystems are multiple-unit VLIW pro- namic predication [Klauser et al. 1998a] cessors with SMT. Within each of these as applied for the Polypath architecture two subsystems up to four threads can [Klauser et al. 1998b] that is a superscalar run in parallel on their hardware contexts

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 51 allocated by the control subsystem, while and Levy [1995] at the University of sharing the subsystem’s set of pipelined Washington, Seattle, WA, surveys en- functional units and resolving resource hancements of the processor. conflicts on a cycle-by-. The Simulations were conducted to evaluate control processor uses IMT with a zero- processor configurations of an up to eight- overhead context switch upon issuing a threaded and eight-issue superscalar. memory load operation. In total, up to This maximum configuration showed a 12 threads can run simultaneously within throughput of 6.64 IPC due to multi- MARS-M with a peak instruction issue threading using the SPEC92 rate of 26 instructions per cycle. Medium- suite and assuming a processor with 32 scale integration (MSI) and large-scale execution units (among them multiple integration emitter-coupled logic (LSI load/store units). Descriptions and the re- ECL) elements with a minimum delay of sults are based on the multiprogramming 2.5 ns are used to implement processor model. logic. There are 739 boards, each of which The next approach was based on a hy- can contain up to 100 ICs mounted on pothetical out-of-order instruction issue both sides. The MARS-M hardware is superscalar microprocessor that resem- water-cooled and occupies three cabinets. bles the MIPS R10000 and HP PA-8000 The initial clock period of 100 ns was [Tullsen et al. 1996; Eggers et al. 1997]. increased to 108 ns at the debugging Both approaches followed the resource- stage. sharing organization aiming at an only slight hardware enhancement of a su- Matsushita Media Research Laboratory perscalar processor. The latter approach processor. The multithreaded processor evaluated more realistic processor con- of the Media Research Laboratory of figurations, and presented implementa- Matsushita Ind. (Japan) was tion issues and solutions to register file another pioneering approach to SMT access and instruction scheduling for a [Hirata et al. 1992]. Instructions of dif- minimal change to superscalar processor ferent threads are issued simultaneously organization. to multiple execution units. In the simulations of the latter archi- results on a parallel ray- appli- tectural model (Figure 12), eight threads cation showed that using eight threads, and an eight-issue superscalar organiza- a speedup of 3.22 in the case of one tion are assumed. Eight instructions are load/store unit, and of 5.79 in the case of decoded, renamed, and fed to either the two load/store units, can be achieved over integer or floating-point instruction win- a conventional RISC processor. However, dow. Unified buffers are used. When caches or translation look-aside buffers operands become available, up to eight are not simulated, nor is a branch predic- instructions are issued out of order per tion mechanism. cycle, executed, and retired. Each thread “Multistreamed superscalar” at the Univer- can address 32 architectural integer (and sity of Santa Barbara. Serrano et al. [1994] floating-point) registers. These registers and Yamamoto and Nemirovsky [1995] are renamed to a large physical register at the University of Santa Barbara, CA, file of 356 physical registers. The larger extended the interleaved multithreading register file requires a longer access time. (then called multistreaming) technique to To avoid increasing the processor cycle general-purpose superscalar processor ar- time, the pipeline is extended by two chitecture and presented an analytical stages to allow two-cycle register reads model of multithreaded superscalar per- and two-cycle writes. Renamed instruc- formance, backed up by simulation. tions are placed into one of two instruc- tion windows. The 32-entry integer in- SMT processor at the University of struction window handles integer and all Washington. The SMT processor ar- load/store instructions, while the 32-entry chitecture, proposed by Tullsen, Eggers, floating-point instruction window handles

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 52 Ungerer et al.

Fig. 12. SMT processor architecture.

floating-point instructions. Three floating- the oldest instructions by penalizing those point and six integer units are assumed. threads with instructions closest to the All execution units are fully pipelined, of either the integer or the floating- and four of the integer units also exe- point queue is nearly as good as ICOUNT cute load/store instructions. The I- and D- and better than BRCOUNT and MISS- caches are multiported and multibanked, COUNT, which are all better than round- but common to all threads. robin fetching. The ICOUNT.2.8 fetching The multithreaded workload consists of strategy reached an IPC of about 5.4 a program mix of SPEC92 benchmark pro- (the RR.2.8 only reached about 4.2). Most grams that are executed simultaneously interesting is that neither mispredicted as different threads. The simulations eval- branches nor blocking due to cache misses, uate different fetch and instruction issue but a mix of both and perhaps some other strategies. effects, proved to be the best fetching An RR.2.8 fetching scheme to access strategy. multiported I-cache—that is, in each cycle, In a single-threaded processor, choosing eight instructions are fetched in round- instructions for issue that are least likely robin policy from each of two different to be on a wrong path is always achieved threads—was superior to other schemes by selecting the oldest instructions, those like RR.1.8, RR.4.2, and RR.2.4 with deepest into the instruction window. For less fetching capacity. As a fetch policy, the SMT processor, several different issue the ICOUNT feedback technique, which strategies have been evaluated, like oldest gives the highest fetch priority to the instructions first, speculative instructions threads with the fewest instructions in last, and branches first. Issue bandwidth the decode, renaming, and queue pipeline is not a bottleneck on these simulated ma- stages, proved superior to the BRCOUNT chines and all strategies seem to perform scheme, which gives the highest prior- equally well, so the simplest mechanism is ity to those threads that are least likely to be preferred. Also, doubling the size of to be on a wrong path, and the MISS- instruction windows (but not the number COUNT scheme, which gives priority to of searchable instructions for issue) has no the threads that have the fewest out- significant effect on the IPC. Even an in- standing D-cache misses. The IQPOSN finite number of execution units increases policy that gives the lowest priority to throughput by only 0.5%.

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 53

Further research looked at compiler of California at Irvine, combines out- techniques for SMT [Lo et al. 1997] and of-order execution within an instruction at extracting threads of a single pro- stream with the simultaneous execution gram designed for multithreaded execu- of instructions of different instruction tion on an SMT [Tullsen et al. 1999]. The streams [Gulati and Bagherzadeh 1996; SMT processor has also been evaluated Loikkanen and Bagherzadeh 1996]. A with workloads [Lo et al. 1998] particular superscalar processor called achieving roughly a three- increase the Superscalar Processor in instruction throughput with an eight- (SDSP) is enhanced to run multiple threaded SMT over a single-threaded su- threads. The enhancements are directed perscalar with similar resources. by the goal of minimal modification to Wallace et al. [1998] presented the the superscalar base processor. Therefore, threaded multipath execution model, most resources on the chip are shared by which exploits existing hardware on an the threads, as for instance the register SMT processor to execute simultaneously file, reorder buffer, instruction window, alternate paths of a conditional branch in store buffer, and renaming hardware. a thread. Threaded Multiple Path Execu- Based on simulations, a performance gain tion employs eager execution of branches of 20–55% due to multithreading was in an SMT processor model. It extends the achieved across a range of benchmarks. SMT processor by introducing additional Pontius and Bagherzadeh [1999] evalu- hardware to for unused processor ated a multithreaded superscalar proces- resources (unused hardware threads), a sor model using several video decode, pic- confidence estimator, mechanisms for fast ture processing, and signal filter programs starting and finishing threads, priorities as workloads. The programs were paral- attached to the threads, support for lelized at the source code level by par- speculatively executed memory access titioning the main loop and distributing operations, and an additional bus for the loop iteration to several threads. The distributing the contents of the register relatively low speedup that was reported mapping table (mapping synchronization for multithreading resulted from algorith- bus, MSB). If the hardware detects that mic restrictions and from the already high a number of processor threads are not IPC in the single-threaded model. The processing useful instructions, the confi- latter was only possible because multi- dence estimator is used to decide whether media instructions were not used. Other- only one continuation of a conditional wise a large part of the IPC in the single- branch should be followed (high confi- threaded model would be hidden by the dence) or both continuations should be SIMD parallelism within the multimedia followed simultaneously (low confidence). instructions. The MSB is used to provide a thread that starts execution of one continuation SMV processor. The Simultaneous Mul- speculatively with the valid register . tithreaded Vector (SMV) architecture Such register mappings between different [Espasa and Valero 1997], designed at register sets incur an overhead of four to the Polytechnic University of Catalunya eight cycles. Such a speculative execution (Barcelona, Spain) combines simultane- increases single-program performance by ous multithreaded execution and out- 14–23%, depending on the misprediction of-order execution with an integrated vec- penalty, for programs with a high branch unit and vector instructions. Figure 13 misprediction rate. Wallace et al. [1999] depicts the SMV architecture. The fetch explored instruction recycling on the engine selects one of eight threads and proposed multipath SMT processor. fetches four instructions on its behalf. The decoder renames the instructions, us- Irvine multithreaded superscalar. This ing a per-thread rename table, and then multithreaded superscalar processor sends all instructions to several common approach, developed at the University execution queues. Inside the queues, the

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 54 Ungerer et al.

Fig. 13. Simultaneous Multithreaded Vector architecture. FPU = floating-point unit; ALU = ; VFU = vector functional unit. instructions of different threads are indis- Karlsruhe multithreaded superscalar pro- tinguishable, and no thread information cessor. While the SMT processor pro- is kept except in the reorder buffer and posed by Tullsen, Eggers, and Levy memory queue. Register names preserve [1995] surveys enhancements of the Alpha all dependences. Independent threads use 21164 processor, the multithreaded super- independent rename tables, which pre- approach of Sigmund and vents false dependences and conflicts from Ungerer [1996a, 1996b] is based on a sim- occurring. The vector unit has 128 vector plified PowerPC 604 processor, combining registers, each holding 128 64-bit regis- multiple pipelines that were dedicated to ters, and has four general-purpose inde- different threads with a simultaneous is- pendent execution units. The number of sue unit (resource replication). registers is the product of the number of Using an instruction mix with 20% load threads and the number of physical regis- and store instructions, the performance ters required to sustain good performance results showed, for an eight-issue pro- on each thread. cessor with four to eight threads and

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 55 a single load/store unit, that two in- processor. Another reason for the negative struction fetch units, two decode units, effect of large reorder buffers and of large four integer units, 16 rename registers, reservation stations for the load/store and four register ports, and completion queues thread control units lies in the fact that with 12 slots are sufficient. The single those instructions have a long latency load/store unit proved the principal bot- and typically have two to three integer tleneck. The multithreaded superscalar instructions as consumers. The effect is processor (eight-threaded, eight-issue) is that the consuming integer instructions able to hide completely latencies caused by eat up space in the integer reservation sta- 4-2-2-2 burst cache refills.2 It reaches the tion, thus blocking instructions from other maximum throughput of 4.2 IPC that is threads from entering it. This multiplica- possible with a single load/store unit. tion effect is made even worse by a non- speculative execution of store and thread SMT multimedia processor. Subsequent control instructions. SMT research at the University of Subsequent research by Sigmund et al. Karlsruhe has explored resource repli- [2000] investigated a cost/benefit analy- cation-based models sis of various SMT multimedia processor for an SMT processor with multime- models. count and chip space dia enhancements [Oehring et al. 1999a, estimations (applying the tool described 1999b]. Related research by Pontius and by Steinhaus et al. [2001]) showed that Bagherzadeh [1999] and Wittenburg et al. the additional hardware cost of a four- [1999] did not use multimedia operations. threaded, eight-issue SMT processor over Hand-coding was applied to transfer a a single-threaded processor is a 2% in- commercial MPEG-2 video decoding algo- crease in and a 9% in- rithm to the SMT multimedia processor crease in chip space for a 288 million model and to program the in transistor chip, which yields a threefold multithreaded fashion. The research pro- speedup. The small scaled four-threaded, cessor model assumes a wide-issue super- eight-issue SMT models with only 24 mil- scalar processor, and enhances it by the lion require a 9% increase in SMT technique, by multimedia units, and transistor count and a 27% increase in by an additional on-chip RAM storage. chip space, resulting in a 1.5-fold speedup The most surprising finding was that over the superscalar base model. smaller reservation stations for the thread Similar results were reached by Burns unit and the global and the local load/store and Gaudiot [2002], estimated the units as well as the smaller reorder layout area for SMT.They identified which buffers, increased the IPC value for the layout blocks are affected by SMT, de- multithreaded models. Intuition suggests termined the scaling of chip space re- better performance with larger buffers. quirements, and compared SMT versus However, large reservation stations (and single-threaded processor space require- large reorder buffers) draw too many ments by scaling a R10000-based layout highly speculative instructions into the to 0.18-micron technology. pipeline. Smaller buffers limit the spec- ulation depth of fetched instructions and Simultaneous multithreading for signal pro- lead to the fact that only nonspeculative cessors. Wittenburg et al. [1999] looked instructions or instructions with low spec- at the application of the SMT technique ulation depth are fetched, decoded, is- for signal processors using combining in- sued, and executed in an SMT processor. structions that are applied to registers of An abundance of speculative instructions several register sets simultaneously in- may decrease the performance of an SMT stead of multimedia operations. - tions with a Hough transformation as 2 4-2-2-2 assumes that four times 64-bit portions are transmitted over the memory bus, the first portion workload showed a speedup of up to six reaching the processor four cycles after the cache compared to a single-threaded processor miss indication, the next portion two cycles later, etc. without multimedia extensions.

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 56 Ungerer et al.

Simultaneous multithreading and power coherent NUMA (nonuniform memory consumption. Simultaneous multithread- access) multiprocessor. This 250 million ing can also be applied for reduction of transistor chip was planned for year 2003. power consumption. In principle, the same In the meantime, the project has been power reduction techniques are applica- abandoned, as Compaq sold the Alpha ble for SMT processors as for superscalars. processor technology to Intel. When an SMT processor is simultane- ously executing threads, the per-cycle use Blue Gene. Simultaneous multithread- of the processor resources should notice- ing is mentioned as processor technique ably increase, offering less opportunities for the building block of the IBM Blue for power reduction via traditional tech- Gene system—a 5-year effort to build niques such as . However, an a petaflops supercomputer started in SMT processor, specifically designed to ex- December 1999 [Allen et al. 2001]. ecute multiple threads in a power-aware Sun UltraSPARC V. Also the Sun Ultra- manner, provides additional options for SPARC V processor has been announced power-aware design [Brooks et al. 2000]. to exploit thread-level parallelism by sym- Mispredictions cost energy because the metric multithreading—that is, SMT. The speculatively executed instructions must ultraSPARC V will be able to switch be- be squashed. Contemporary superscalar tween two different modes depending on processors discard approximately 60% of the of work—one mode for heavy duty the fetched and 30% of the executed in- calculations and the other for business structions due to misspeculations. In case transactions such as database operations of a power-aware design, the issue slots [Lawson and Vance 2002]. of an SMT processor could be preferably filled by less speculative instructions of Hyper-Threading Technology in the Intel other threads. Harnessing the extra paral- Xeon processor. Intel’s Hyper-Threading lelism provided by multiple threads allows Technology [Marr et al. 2002] proposes the processor to rely much less on specula- SMT for the -based Intel Xeon tion. Less resources are utilized by specu- processor family to be used in dual and lative instructions and more parallelism is multiprocessor servers. Hyper-Threading exploited. Moreover, a simpler branch unit Technology makes a single physical pro- design could be used. This inherently im- cessor appear as two logical processors by plies a greater power efficiency per thread. applying a two-threaded SMT approach. The simulations of Seng et al. [2000] Each logical processor maintains a com- showed that by using an appropriate plete set of the architecture state, which scheduler up to 22% less power is con- consists of the general-purpose registers, sumed by an SMT processor than by a com- the control registers, the advanced pro- parable superscalar. grammable interrupt controller (APIC) registers, and some machine state regis- ters. Logical processors share nearly all other resources on the physical processor, 7. SIMULTANEOUS MULTITHREADING IN such as caches, execution units, branch CURRENT MICROPROCESSORS predictors, control logic, and buses. Each Alpha 21464. Compaq unveiled its logical processor has its own APIC. Inter- Alpha EV8 21464 proposal in 1999 [Emer rupts sent to a specific logical processor 1999], a four-threaded eight-issue SMT are handled only by that logical processor. processor that closely resembles the When one logical processor is stalled, SMT processor proposed by Tullsen et al. the other logical processor can continue to [1999]. The processor proposal featured make forward progress. A logical processor out-of-order execution, a large on-chip may be temporarily stalled for a variety of secondary cache, a direct RAMBUS in- reasons, including servicing cache misses, terface, and an on-chip router for system handling branch mispredictions, or - interconnect of a directory based, cache- ing for the results of previous instructions.

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 57

Independent forward progress was en- The branch prediction structures are ei- sured by managing buffering queues such ther duplicated or shared. The branch his- that no logical processor can use all the tory buffer used to look up the global his- entries when two active software threads tory array is also tracked independently were executing. for each logical processor. However, the The buffering queues that separate ma- large global array is a shared jor pipeline logic blocks are either parti- structure with entries that are tagged tioned or duplicated to ensure indepen- with a logical processor ID. The return dent forward progress through each logic stack buffer, which predicts the target of block. If only one active software thread return instructions, is duplicated. is active, the thread should run at the When both threads are decoding in- same speed on a processor with Hyper- structions simultaneously, the streaming Threading Technology as on a processor buffers alternate between threads so that without this capability. This means that both threads share the same decoder logic. partitioned resources should be recom- The decode logic has to keep two copies of bined when only one software thread is all the state needed to decode IA-32 in- active, thus applying a flexible resource- structions for the two logical processors sharing model. even though it only decodes instructions In the Xeon, as in the Pentium 4, in- for one logical processor at a time. In gen- structions generally come from the execu- eral, several instructions are decoded for tion trace cache (TC), which replaces the one logical processor before switching to primary I-cache. Two sets of instruction the other logical processor. pointers independently track the progress The op queue that decouples the front- of the two executing software threads. The end from the out-of-order execution engine TC entries are tagged with thread infor- is partitioned such that each logical pro- mation. The two logical processors arbi- cessor has half the entries. The allocator trate access to the TC every clock cycle. logic takes ops from the op queue and If both logical processors want access to allocates many of the key machine buffers the TC at the same time, access is granted needed to execute each op, including the to one then the other in alternating clock 126 reorder buffer entries, 128 integer and cycles. If one logical processor is stalled or 128 floating-point physical registers, and is unable to use the TC, the other logical 48 load and 24 store buffer entries. If there processor can use the full bandwidth of the are ops for both logical processors in the TC, every cycle. op queue, the allocator will alternate se- If there is a TC miss, instruction bytes lecting ops from the logical processors ev- need to be fetched from the secondary ery clock cycle to assign resources. Each cache and decoded into ops (microopera- logical processor can use up to a maxi- tions) to be placed in the TC. Each logical mum of 63 reorder buffer entries, 24 load processor has its own instruction transla- buffers, and 12 store buffer entries. tion look-aside buffer and its own set of in- Since each logical processor must main- struction pointers to track the progress of tain and track its own complete architec- instruction fetch for the two logical proces- ture state, there are two register ta- sors. The instruction fetch logic in charge bles for , one for each of sending requests to the secondary cache logical processor. The register renaming arbitrates on a first-come first-served ba- process is done in parallel to the alloca- sis, while always reserving at least one tor logic described above, so the register request slot for each logical processor. In rename logic works on the same ops to this way, both logical processors can have which the allocator is assigning resources. fetches pending simultaneously. Each log- Once ops have completed the alloca- processor has its own set of two 64- tion and register rename processes, they byte streaming buffers to hold instruction are placed into the memory instruction bytes in preparation for the instruction de- queue or the general instruction queue, code stage. respectively. The two sets of queues are

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 58 Ungerer et al. also partitioned such that ops from each processor to the previous-generation logical processor can use at most half the PentiumIII Xeon processor on four-way entries. The memory instruction queue server platforms. A significant portion and general instruction queues send ops of those gains can be attributed to to the five scheduler queues as fast as Hyper-Threading Technology [Marr et al. they can, alternating between ops for the 2002]. two logical processors every clock cycle, as needed. 8. CONCLUSIONS Each scheduler has its own scheduler queue of 8 to 12 entries from which it se- Although the meaning of the term mul- lects ops to send to the execution units. tithreading is sometimes used to include The schedulers choose ops regardless of all kinds of architectures that are able whether they belong to one logical proces- to concurrently execute multiple instruc- sor or the other. The schedulers are ef- tion streams on a single chip [Sohi 2001], fectively oblivious to logical processor dis- that is, chip multiprocessors, explicit and tinctions. The ops are simply evaluated implicit multithreaded architectures, we based on dependent inputs and availabil- clearly distinguish these architectural so- ity of execution resources. For example, lutions and focus this survey on explicit the schedulers could dispatch two ops multithreaded processors. from one logical processor and two ops Explicit multithreaded processors inter- from the other logical processor in the leave the execution of instructions of dif- same clock cycle. To avoid and en- ferent user-defined threads within the sure fairness, there is a limit on the num- same pipeline, in contrast to implicit mul- ber of active entries that a logical proces- tithreaded processors that dynamically sor can have in each scheduler’s queue. generate threads from single-threaded The execution core and memory hier- programs and execute such speculative archy are also largely oblivious to logical threads concurrently with the lead thread. processors. After execution, the ops are Superscalar and implicit multithreaded placed in the reorder buffer, which decou- processors aim at a low execution time ples the execution stage from the retire- of a single program, while explicit mul- ment stage. The reorder buffer is parti- tithreaded processors (and chip multipro- tioned such that each logical processor can cessors) aim at a low execution time of a use half the entries. multithreaded workload. The retirement logic tracks when ops Several explicit multithreaded proces- from the two logical processors are ready sors as well as chip multiprocessors have to be retired, then retires the ops in pro- currently been announced by industry or gram order for each logical processor by are already into production in the ar- alternating between the two logical pro- eas of high-performance microprocessors, cessors. Retirement logic will retire ops media, and network processors. The ba- for one logical processor, then the other, al- sic interleaved and blocked multithread- ternating back and forth. If one logical pro- ing techniques are applied in VLIW pro- cessor is not ready to retire any ops then cessors, in the network processors, and all retirement bandwidth is dedicated to even in superscalar processors. In partic- the other logical processor. ular the success of these techniques in The implementation of Hyper- network processors is stunning. In addi- Threading Technology in the Xeon tion, simultaneous multithreading proces- processor adds less than 5% to the relative sors and chip multiprocessors have been chip size and maximum power require- announced or are already in production ments, but can provide performance by IBM, Intel, and Sun. The additional benefits much greater than that. Initial chip space requirement by a two-threaded benchmark tests show up to a 65% per- processor is reported to be about 5% com- formance increase on high-end server pared to a single-threaded processor (IBM applications when comparing the Xeon RS64 IV and Intel Xeon processor), while a

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 59 significant throughput increase is reached MATURU, ., MOREIRA, J., NEWNS, D., , M., by multithreading. We therefore expect ex- PHILHOWER, ., PICUNKO, T., PITERA,J., plicit multithreading techniques, in par- PITMAN, M., RAND, R., ROYYURU, A., SALAPURA, V., SANOMIYA, A., SHAH, R., SHAM, ticular the simultaneous multithread- Y., S INGH, S., SNIR, M., SUITS, F., SWETZ, R., ing techniques, to be commonly used in SWOPE,.C.,VISHNUMURTHY, N., WARD,T.C.J., the next generation of high-performance WARREN, H., AND ZHOU, R. 2001. Blue Gene: microprocessors. a vision for protein science using a petaflops supercomputer. IBM Syst. J. 40, 2, 310–326. In contrast, implicit multithreaded pro- ALMASI,G.S.AND GOTTLIEB, A. 1994. Highly Par- cessors and the related helper thread ap- allel Computing, 2nd . Benjamin/Cummings, proach remain hot research topics and Menlo Park, CA. must still prove their efficiency in real pro- ALVERSON, G., KAHAN, S., KORRY, R., MCCANN,C.,AND cessors. We expect that future research SMITH, B. J. 1995. Scheduling on the Tera will also focus on the deployment of multi- MTA. In Lecture in Computer Science, vol. 949. Springer-Verlag, Heidelberg, Germany. threading techniques in the fields of signal 19–44. processors and , in par- ALVERSON, R., CALLAHAN, D., CUMMINGS, D., KOBLENZ, ticular for real-time applications, and for B., PORTERFIELD, A., AND SMITH, B. J. 1990. . Moreover, instruc- The Tera computer system. In Proceedings of tion scheduling within a multithreaded the 4th International Conference on Supercom- pipeline as well as puting (Amsterdam, The Netherlands). 1–6. architectures for multithreaded proces- BACH, P., BRAUN, M., FORMELLA, A., FRIEDRICH, J., GRUN¨ , T., AND LINCHTENAU, C. 1997. Building the 4 sors are still open research questions. processor SB-PRAM prototype. In Proceedings This survey should clarify the terminol- of the 30th Hawaii International Conference on ogy and demonstrate the state-of-the-art System Science (Maui, HI). 5:14–23. as well as research results achieved for ex- BARROSO, L. A., GHARACHORLOO, K., MCNAMARA, R., plicit multithreaded processors. NOWATZYK, A., QADEER, S., SANO, B., SMITH,S., STETS, R., AND VERGHESE, B. 2000. Piranha: a scalable architecture based on single-chip mul- ACKNOWLEDGMENTS tiprocessing. In Proceedings of the 27th Annual International Symposium on Computer Architec- The authors would like to thank anonymous review- ture (Vancouver, B.C., Canada). 282–293. ers for many valuable comments. BOLYCHEVSKY, A., JESSHOPE,C.R.,AND MUCHNIK,V.B. 1996. Dynamic scheduling in RISC architec- tures. IEE P. Comput. Dig. Tech. 143, 5, 309–317. REFERENCES BOOTHE, R. F. 1993. Evaluation of multithreading AGARWAL, A., BIANCHINI, R., CHAIKEN, D., JOHNSON, and caching in large shared memory parallel K. L., KRANZ, D., KUBIATOWICZ, J., LIM,B.H., computers. Tech. Rep. UCB/CSD-93-766. Com- MACKENZIE, K., AND YEUNG, D. 1995. The MIT puter Science Division, University of California, Alewife machine: architecture and performance. Berkeley, Berkeley, CA. In Proceedings of the 22nd Annual International BOOTHE,R.F.AND RANADE, A. 1992. Improved mul- Symposium on (Santa tithreading techniques for hiding communica- Margherita Ligure, Italy). 2–13. tion latency in multiprocessors. In Proceedings of AGARWAL, A., KUBIATOWICZ, J., KRANZ, D., LIM,B.H., the 19th International Symposium on Computer YEOUNG, D., D’SOUZA,G.,AND PARKIN, M. 1993. Architecture ( Coast, Australia). 214–223. Sparcle: an evolutionary processor design for BORKENHAGEN, J. M., EICKEMEYER,R.J.,KALLA,R.N., large-scale multiprocessors. IEEE Micro 13,3, AND KUNKEL, S. R. 2000. A multithreaded 48–61. PowerPC processor for commercial servers. IBM AKKARY,H.AND DRISCOLL, M. A. 1998. A dynamic J. Res. Dev. 44, 6, 885–898. multithreading processor. In Proceedings of BRINKSCHULTE, U., BECHINA, A., PICIOROAGA,F., the 31st Annual International Symposium on SCHNEIDER, E., UNGERER, T., KREUZINGER,J.,AND Microarchitecture (Dallas, TX). 226–236. PFEFFER, M. 2000. A middleware ALLEN, F., ALMASI, G., ANDREONI, W., BEECE, D., BERNE, architecture for distributed embedded real-time B. J., BRIGHT, A., BRUNHEROTO, J., CASCAVAL,C., systems. In Proceedings of the 20th IEEE Sym- CASTANOS, J., COTEUS, P., CRUMLEY, P., CURIONI, A., posium on Reliable Distributed Systems (New DENNEAU, M., DONATH, W., ELEFTHERIOU, M., Orleans LA). 218–226. FITCH, B., FLEISCHER, B., GEORGIOU,C.J., BRINKSCHULTE, U., KRAKOWSKI, C., KREUZINGER,J., GERMAIN, R., GIAMPAPA, M., GRESH, D., GUPTA, M., AND UNGERER, T. 1999a. A multithreaded HARING, R., HO, H., HOCHSCHILD,P., Java microcontroller for thread-oriented real- HUMMEL, S., JONAS, T., LIEBER, D., MARTYNA,G., time event-handling. In Proceedings of the

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 60 Ungerer et al.

International Conference on Parallel Architec- sian high-performance computing. J. Supercom- tures and Compilation Techniques (Newport put. 6, 1, 5–48. Beach, CA). 34–39. DUBEY, P. K., O’BRIEN, K., O’BRIEN,K.M.,AND BRINKSCHULTE, U., KRAKOWSKI, C., MARSTON, R., BARTON, C. 1995. Single-program speculative KREUZINGER,J.,AND UNGERER, T. 1999b. The multithreading (SPSM) architecture: compiler- Komodo project: thread-based event handling assisted fine-grain multithreading. Tech. Rep. supported by a multithreaded Java microcon- 19928. IBM, Yorktown Heights, NY. troller. In Proceedings of the 25th Euromicro EGGERS,S.J.,EMER,J.S.,LEVY, H. M., LO, J. L., STAMM, Conference (Milan, Italy). 122–128. R. M., AND TULLSEN, D. M. 1997. Simultaneous BRINKSCHULTE, U., KREUZINGER, J., PFEFFER, M., multithreading: a platform for next-generation AND UNGERER, T. 2002. A scheduling tech- processors. IEEE Micro 17, 5, 12–19. nique providing a strict isolation of real-time EMER, J. S. 1999. Simultaneous multithreading: threads. In Proceedings of the 7th IEEE In- multiplying Alpha’s performance. In Proceed- ternational Workshop on Object-oriented Real- ings of the Microprocessor Forum (San Jose, CA). time Dependable Systems (San Diego, CA). 169– ESPASA,R.AND VALERO, M. 1997. Exploiting 172. instruction- and data-level parallelism. IEEE BROOKS, D. M., BOSE, P., SCHUSTER, S. E., JACOBSON, H., Micro 17, 5, 20–27. KUDVA,P.N.,BUYUKTOSUNOGLU, A., WELLMAN,J.D., FILLO, M., KECKLER,S.W.,DALLY,W.J.,CARTER, ZYUBAN, V., GUPTA, M., AND COOK, P. W. 2000. N. P., CHANG, A., AND GUREVICH, Y. 1995. The Power-aware microarchitecture: designing and M-machine multicomputer. In Proceedings of modeling challenges for next-generation micro- the 28th Annual International Symposium on processors. IEEE Micro 20, 6, 26–44. Microarchitecture (Ann Arbor, MI). 146– BURNS,J.AND GAUDIOT, J. L. 2002. SMT layout 156. overhead and . IEEE T. Parall. Distr. FORMELLA, A., KELLER,J.,AND WALLE, T. 1996. HPP: Syst. 13, 2, 142–155. A high performance PRAM. In Lecture Notes BUTLER, M., YEH,T.Y.,PATT,Y.N.,ALSUP, M., SCALES, in Computer Science, vol. 1123. Springer-Verlag, H., AND SHEBANOW, M. 1991. Single instruction Heidelberg, Germany. 425–434. stream parallelism is greater than two. In Pro- FRANKLIN, M. 1993. The multiscalar architec- ceedings of the 18th International Symposium on ture. Tech. Rep. 1196. Department of Com- Computer Architecture (Toronto, Ont., Canada). puter Science, University of Wisconsin-Madison, 276–286. Madison, WI. CHAPPELL,R.S.,STARK, J., KIM,S.P.,REINHARDT,S.K., FUHRMANN, S., PFEFFER, M., KREUZINGER,J., AND PATT, Y. N. 1999. Simultaneous subordi- UNGERER,T.,AND BRINKSCHULTE, U. 2001. Real- nate microthreading (SSMT). In Proceedings of time garbage collection for a multithreaded Java the 26th Annual International Symposium on microcontroller. In Proceedings of the 4th IEEE Computer Architecture (Atlanta, GA). 186–195. International Symposium on Object-Oriented CHRYSOS,G.Z.AND EMER, J. S. 1998. Memory de- Real-Time (Magdeburg, pendence prediction using store sets. In Pro- Germany). 69–76. ceedings of the 25th Annual International Sym- GELINAS, B., HAYS,P.,AND KATZMAN, S. 2002. Fine- posium on Computer Architecture (Barcelona, grained hardware multi-threading: A CPU ar- Spain). 142–153. chitecture for high- packed processing. CULLER, D. E., SINGH,J.P.,AND GUPTA,A. Lexra Inc., Waltham, MA. White paper. 1998. Parallel Computer Architecture: A GLASKOWSKY, P. N. 2002. Network processors ma- Hardware/Software Approach. Morgan ture in 2001. Microproc. Report. February 19, Kaufmann, San Francisco, CA. 2002 (online journal). DALLY,W.J.,FISKE, J., KEEN, J., LETHIN, R., NOAKES, M., GRUNEWALD¨ ,W.AND UNGERER, T. 1996. Towards ex- NUTH, P., DAVISON, R., AND FYLER, G. 1992. The tremely fast context switching in a blockmulti- message-driven processor: a multicomputer pro- threaded processor. In Proceedings of the 22nd cessing node with efficient mechanisms. IEEE Euromicro Conference (Prague, Czech Republic). Micro 12, 2, 23–39. 592–599. DENNIS,J.B.AND GAO, G. R. 1994. Multithreaded GRUNEWALD¨ ,W.AND UNGERER, T. 1997. A mul- architectures: principles, projects, and issues. In tithreaded processor designed for distributed Multithreaded Computer Architecture: A - shared memory systems. In Proceedings of the mary of the State of the Art, R. A. Iannucci, G. R. International Conference on Advances in Par- Gao, R. Halstead, and B. J. Smith, Eds. Kluwer allel and Distributed Computing (Shanghai, Boston, MA, Dordrecht, The Netherlands, China). 206–213. London, U.K. 1–74. GULATI,M.AND BAGHERZADEH, N. 1996. Perfor- DOROJEVETS, M. 2000. COOL multithreading in mance study of a multithreaded superscalar HTMT SPELL-1 processors. Int. J. High Speed microprocessor. In Proceedings of the 2nd In- Electron. Sys. 10, 1, 247–253. ternational Symposium on High-Performance DOROZHEVETS,M.N.AND WOLCOTT, P. 1992. The Computer Architecture (San Jose, CA). 291– El’brus-3 and MARS-M: recent advances in Rus- 301.

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 61

GWENNAP, L. 1997. DanSoft develops VLIW de- KREUZINGER, J., SCHULZ, A., PFEFFER, M., UNGERER, sign. Microproc. Report 11, 2 (Feb. 17), 18–22. T., B RINKSCHULTE,U.,AND KRAKOWSKI, C. 2000. HALSTEAD, R. H. 1985. MULTILISP: a language for Real-time scheduling on multithreaded proces- concurrent symbolic computation. ACM Trans. sors. In Proceedings of the 7th International Con- Program. Lang. Syst. 7, 4, 501–538. ference on Real-Time Computer Systems and Ap- plications (Cheju Island, South Korea). 155–159. HALSTEAD,R.H.AND FUJITA, T. 1988. MASA: a mul- tithreaded processor architecture for parallel KREUZINGER,J.AND UNGERER, T. 1999. Context- symbolic computing. In Proceedings of the 15th switching techniques for decoupled multi- International Symposium on Computer Architec- threaded processors. In Proceedings of the 25th ture (Honolulu, HI). 443–451. Euromicro Conference (Milan, Italy). 1:248– 251. HAMMOND,L.AND OLUKOTUN, K. 1998. Considera- tions in the design of Hydra: a multiprocessor- LAM,M.S.AND WILSON, R. P. 1992. Limits of con- on-chip microarchitecture. Tech. Rep. CSL-- trol flow on parallelism. In Proceedings of the 98-749. Computer Systems Laboratory, Stanford 18th International Symposium on Computer Ar- University, Stanford, CA. chitecture (Toronto, Ont., Canada). 46–57. HANSEN, C. 1996. MicroUnity’s MediaProcessor LAUDON, J., GUPTA, A., AND HOROWITZ, M. 1994. In- architecture. IEEE Micro 16, 4, 34–41. terleaving: a multithreading technique target- HIRATA, H., KIMURA, K., NAGAMINE, S., MOCHIZUKI, ing multiprocessors and . In Pro- Y., N ISHIMURA, A., NAKASE,Y.,AND NISHIZAWA, ceedings of the 6th International Conference on T. 1992. An elementary processor architec- Architectural Support for Programming Lan- ture with simultaneous instruction issuing guages and Operating Systems (San Jose, CA). from multiple threads. In Proceedings of the 308–318. 19th International Symposium on Computer LAWSON,S.AND VANCE, A. 2002. Sun hints at Architecture (Gold Coast, Australia). 136– UltraSparc V and beyond. Available online at 145. PC World.com. IANNUCCI, R. A., GAO, G. R., HALSTEAD, R., AND LI, Z., TSAI,J.Y.,WANG, X., YEW,P.C.,AND SMITH, B. J., Eds. 1994. Multithreaded Com- ZHENG, B. 1996. Compiler techniques for con- puter Architecture: A Summary of the State current multithreading with hardware specu- of the Art. Kluwer Boston, MA, Dordrecht, lation support. In Lecture Notes in Computer The Netherlands, London, U.K. Science, vol. 1239. Springer-Verlag, Heidelberg, IBM CORPORATION. 1999. IBM network processor. Germany. 175–191. Product overview. IBM, Yorktown Heights, NY. LIPASTI,M.H.AND SHEN, J. P. 1997. The per- INTEL CORPORATION. 2002. Intel Internet exchange formance potential of value and dependence architecture network processors: flexible, wire- prediction. In Lecture Notes Computer Sci- speed processing from the customer premises ence, vol. 1300. Springer-Verlag, Heidelberg, to the network core. White paper. Intel, Santa Germany. 1043–1052. Clara, CA. LIPASTI, M. H., WILKERSON,C.B.,AND SHEN,J.P. JESSHOPE, C. R. 2001. Implementing an efficient 1996. Value locality and load value prediction. vector instruction set in a chip multi-processor In Proceedings of the 7th International Confer- using micro-threaded pipelines. Aust. Comput. ence on Architectural Support for Programming Sci. Commun. 23, 4, 80–88. Languages and Operating Systems (Cambridge, MA). 138–147. JESSHOPE,C.R.AND LUO, B. 2000. Micro-threading: a new approach to future RISC. In Proceedings LO, J. L., BARROSO, L. A., EGGERS,S.J.,GHARACHORLOO, of the Australasian Computer Architecture Con- K., LEVY,H.M.,AND PAREKH, S. S. 1998. An ference (Canbera, Australia). 34–41. analysis of database workload performance on KAVI, K. M., LEVINE,D.L.,AND HURSON, A. R. 1997. simultaneous multithreaded processors. In Pro- A non-blocking multithreaded architecture. In ceedings of the 25th Annual International Sym- Proceedings of the 5th International Conference posium on Computer Architecture (Barcelona, on Advanced Computing (Madras, India). 171– Spain). 39–50. 177. LO, J. L., EGGERS,S.J.,EMER,J.S.,LEVY, H. M., STAMM, KLAUSER, A., AUSTIN, T., GRUNWALD,D.,AND CALDER, R. L., AND TULLSEN, D. M. 1997. Converting B. 1998a. Dynamic hammock predication for thread-level parallelism to instruction-level par- non-predicated instruction sets. In Proceedings allelism via simultaneous multithreading. ACM of the International Conference on Parallel Ar- Trans. Comput. Syst. 15, 3, 322–354. chitectures and Compilation Techniques (Paris, LOIKKANEN,M.AND BAGHERZADEH, N. 1996. A fine- France). 278–285. grain multithreading superscalar architecture. KLAUSER, A., PAITHANKAR, A., AND GRUNWALD,D. In Proceedings of the International Conference 1998b. Selective eager execution on the on Parallel Architectures and Compilation Tech- PolyPath architecture. In Proceedings of the niques (Boston, MA). 163–168. 25th Annual International Symposium on LUTH¨ , K., METZNER, A., PIEKENKAMP,T.,AND RISU,J. Computer Architecture (Barcelona, Spain). 1997. The events approach to rapid prototyp- 250–259. ing for embedded control system. In Proceedings

ACM Computing Surveys, Vol. 35, No. 1, March 2003. 62 Ungerer et al.

of the Workshop Zielarchitekturen eingebetteter ence on Computer design: VLSI in Computers Syststeme (Rostock, Germany). 45–54. and Processors (Austin, TX). 199–206. MANKOVIC, T. E., POPESCU,V.,AND SULLIVAN, H. 1987. SERRANO,M.J.,YAMAMOTO, W., WOOD, R., AND CHoPP priciples of operations. In Proceedings of NEMIROVSKY, M. D. 1994. Performance estima- the 2nd International Supercomputer Conference tion in a multistreamed superscalar proces- (Mannheim, Germany). 2–10. sor. In Lecture Notes in Computer Science, MARCUELLO, P., GONZALES, A., AND TUBELLA, J. 1998. vol. 794. Springer-Verlag, Heidelberg, Germany. Speculative multithreaded processors. In 213–230. Proceedings of the 12th International Confer- SIGMUND, U., STEINHAUS, M., AND UNGERER, T. 2000. ence on Supercomputing (Melbourne, Australia). Transistor count and chip space assessment 77–84. of multimedia-enhanced simultaneous multi- MARR,D.T.,BINNS, F., HILL, D. L., HINTON,G., threaded processors. In Proceedings of the 4th KOUFATY, D. A., MILLER, J. A., AND UPTON, Workshop on Multithreaded Execution, Architec- M. 2002. Hyper-threading technology archi- ture and Compilation (Monterrey, CA). tecture and microarchitecture: a hypertext his- SIGMUND,U.AND UNGERER, T. 1996a. Evaluating a tory. Intel Technology J. 6, 1 (online journal). multithreaded superscalar microprocessor - METZNER,A.AND NIEHAUS, J. 2000. MSparc: multi- sus a multiprocessor chip. In Proceedings of the threading in real-time architectures. J. Univer- 4th PASA Workshop on Parallel Systems and sal Comput. Sci. 6, 10, 1034–1051. Algorithms (Julich,¨ Germany). 147–159. MIKSCHL,A.AND DAMM, W. 1996. Msparc: a multi- SIGMUND,U.AND UNGERER, T. 1996b. Identifying threaded Sparc. In Lecture Notes in Computer bottlenecks in multithreaded superscalar multi- Science, vol. 1123. Springer-Verlag, Heidelberg, processors. In Lecture Notes in Computer Sci- Germany. 461–469. ence, vol. 1123. Springer-Verlag, Heidelberg, OEHRING, H., SIGMUND,U.,AND UNGERER, T. 1999a. Germany. 797–800. MPEG-2 video decompression on simultaneous Sˇ ILC, J., ROBICˇ,B.,AND UNGERER, T. 1998. Asyn- multithreaded multimedia processors. In Pro- chrony in : from dataflow ceedings of the International Conference on Par- to multithreading. Parall. Distr. Comput. Prac- allel Architectures and Compilation Techniques tices 1, 1, 57–83. (Newport Beach, CA). 11–16. Sˇ ILC, J., ROBICˇ,B.,AND UNGERER, T. 1999. Processor OEHRING, H., SIGMUND,U.,AND UNGERER, T. 1999b. Architecture: From Dataflow to Superscalar and Simultaneous multithreading and multimedia. Beyond. Springer-Verlag, Heidelberg and Berlin, In Proceedings of the Workshop on Multithreaded Germany, and New York, NY. Execution, Architecture and Compilation SMITH, B. J. 1981. Architecture and applications of (Orlando, FL). the HEP multiprocessor computer system. SPIE PATT,Y.N.,PATEL,S.J.,EVERS, M., FRIENDLY,D.H., Real-Time Signal Processing IV 298, 241–248. AND STARK, J. 1997. One billion transistors, one uniprocessor, one chip. Computer 30,9,51– SMITH, B. J. 1985. The architecture of hep. In Par- 57. allel MIMD Computation: HEP Supercomputer and Its Applications, J. S. Kowalik, Ed. MIT PAUL,W.J.,BACH, P., BOSCH, M., FISCHER,J., Press, Cambridge, MA, 41–55. LICHTENAU,C.,AND ROHRIG¨ , J. 2002. Real PRAM programming. In Lecture Notes in SMITH,J.E.AND VAJAPEYAM, S. 1997. Trace proces- Computer Science, vol. 2400. Springer-Verlag, sors: moving to fourth-generation microarchitec- Heidelberg, Germany. 522–531. tures. Computer 30, 9, 68–74. PONTIUS,N.AND BAGHERZADEH, N. 1999. Multi- SOHI, G. S. 1997. Multiscalar: another fourth- threaded extensions enhance multimedia perfor- generation processor. Computer 30,9,72. mance. In Proceedings of the Workshop on Mul- SOHI, G. S. 2001. Microprocessors—10 years back, tithreaded Execution, Architecture and Compila- 10 years ahead. In Lecture Notes in Computer tion (Orlando, FL). Science, vol. 2000. Heidelberg, Germany. 208– ROTENBERG, E., JACOBSON, ., SAZEIDES,Y.,AND SMITH, 218. J. E. 1997. Trace processors. In Proceedings SOHI,G.S.,BREACH,S.E.,AND VIJAYKUMAR,T.N. of the 30th Annual International Symposium on 1995. Multiscalar processors. In Proceedings Microarchitecture (Research Triangle Park, NC). of the 22nd Annual International Symposium 138–148. on Computer Architecture (Santa Margherita RYCHLIK, B., FAISTL, J., KRUG,B.,AND SHEN,J.P. Ligure, Italy). 414–425. 1998. Efficiency and performance impact of STEINHAUS, M., KOLLA, R., LARRIBA-PEY, J. L., UNGERER, value prediction. In Proceedings of the Inter- T., AND VALERO, M. 2001. Transistor count and national Conference on Parallel Architectures chip space estimation of simple-scalar-based mi- and Compilation Techniques (Paris, France). croprocessor models. In Proceedings of the Work- 148–154. shop on Complexity-Effective Design (Goteborg,¨ SENG,J.S.,TULLSEN,D.M.,AND CAI, G. Z. N. 2000. Sweden). Power-sensitive multithreaded architecture. In STERLING, T. 1997. Beyond 100 teraflops through Proceedings of the IEEE International Confer- superconductors, holographic storage, and the

ACM Computing Surveys, Vol. 35, No. 1, March 2003. Survey of Processors with Explicit Multithreading 63

data . In Proceedings of the International Symposium on High-Performance Computer Symposium on Supercomputing (Tokyo, Japan). Architecture (Orlando, FL). 54–58. TENDLER, J. M., DODSON,J.S.,FIELDS,JR., J. S., LE, UNGERER, T., ROBICˇ,B.,AND Sˇ ILC, J. 2002. Multi- H., AND SINHAROY, B. 2002. POWER4 system threaded processors. Computer J. 45, 3, 320–348. microarchitecture. IBM J. Res. Dev. 46, 1, 5–26. VAJAPEYAM,S.AND MITRA, T. 1997. Improving su- TEXAS INSTRUMENTS. 1994. TMS320C80 Technical perscalar instruction dispatch and issue by ex- . Texas Instruments, Dallas, TX. ploiting dynamic code sequences. In Proceedings THISTLE,M.AND SMITH, B. J. 1988. A processor ar- of the 24th Annual International Symposium on chitecture for Horizon. In Proceedings of the Su- Computer Architecture (Denver, CO). 1–12. percomputing Conference (Orlando, FL). 35–41. VIJAYKUMAR,T.N.AND SOHI, G. S. 1998. Task selec- TREMBLAY, M. 1999. A VLIW convergent multipro- tion for a multiscalar processor. In Proceedings cessor . In Proceedings of the of the 31st Annual International Symposium on Microprocessor Forum (San Jose, CA). Microarchitecture (Dallas, TX). 81–92. TREMBLAY, M., CHAN, J., CHAUDHRY, S., CONIGLIARO, WALL, D. W. 1991. Limits of instruction-level par- A. W., AND TSE, S. S. 2000. The MAJC archi- allelism. In Proceedings of the 4th International tecture: a synthesis of parallelism and scalabil- Conference on Architectural Support for Pro- ity. IEEE Micro 20, 6, 12–25. gramming Languages and Operating Systems TSAI,J.Y.AND YEW, P. C. 1996. The superthreaded (Santa Clara, CA). 176–188. architecture: thread pipelining with run-time WALLACE, S., CALDER,B.,AND TULLSEN, D. M. 1998. data dependence checking and control specula- Threaded multiple path execution. In Proceed- tion. In Proceedings of the International Confer- ings of the 25th Annual International Sym- ence on Parallel Architectures and Compilation posium on Computer Architecture (Barcelona, Techniques (Boston, MA). 35–46. Spain). 238–249. TULLSEN, D. M., EGGERS,S.J.,EMER,J.S.,LEVY,H.M., WALLACE, S., TULLSEN,D.M.,AND CALDER, B. 1999. LO,J.L.,AND STAMM, R. L. 1996. Exploiting Instruction recycling on a multiple-path proces- : instruction fetch and issue on an imple- sor. In Proceedings of the 5th International Sym- mentable simultaneous multithreading proces- posium on High-Performance Computer Archi- sor. In Proceedings of the 23rd Annual Inter- tecture (Orlando, FL). 44–53. national Symposium on Computer Architecture WITTENBURG,J.P.,MEYER,G.,AND PIRSCH, P. 1999. (Philadelphia, PA). 191–202. Adapting and extending simultaneous multi- TULLSEN, D. M., EGGERS,S.J.,AND LEVY, H. M. 1995. threading for high performance video signal pro- Simultaneous multithreading: maximizing on- cessing applications. In Proceedings of the Work- chip parallelism. In Proceedings of the 22nd shop on Multithreaded Execution, Architecture Annual International Symposium on Computer and Compilation (Orlando, FL). Architecture (Santa Margherita Ligure, Italy). YAMAMOTO,W.AND NEMIROVSKY, M. D. 1995. In- 392–403. creasing superscalar performance through mul- TULLSEN, D. M., LO, J. L., EGGERS,S.J.,AND LEVY, tistreaming. In Proceedings of the Interna- H. M. 1999. Supporting fine-grained synchro- tional Conference on Parallel Architectures and nization on a simultaneous multithreading pro- Compilation Techniques (Limassol, Cyprus). cessor. In Proceedings of the 5th International 49–58.

Received June 2001; revised September 2002; accepted November 2002

ACM Computing Surveys, Vol. 35, No. 1, March 2003.