<<

.

SIMULTANEOUS MULTITHREADING: A Platform for Next-Generation Processors

Susan J. Eggers s the community prepares exploits both types of parallelism, SMT University of Washington for a billion transistors on a chip, processors use resources more efficiently, and Aresearchers continue to debate the both instruction throughput and speedups are Joel S. Emer most effective way to use them. One greater. Digital Equipment Corp. approach is to add more memory (either Simultaneous multithreading combines or primary memory) to the chip, but hardware features of wide-issue superscalars Henry M. Levy the performance gain from memory alone is and multithreaded processors. From super- University of Washington limited. Another approach is to increase the scalars, it inherits the ability to issue multi- level of systems integration, bringing sup- ple instructions each cycle; and like Jack L. Lo port functions like graphics accelerators and multithreaded processors it contains hard- University of Washington I/O controllers on chip. Although integra- ware state for several programs (or threads). tion lowers system costs and communica- The result is a processor that can issue mul- Rebecca L. Stamm tion latency, the overall performance gain to tiple instructions from multiple threads each Digital Equipment Corp. applications is again marginal. cycle, achieving better performance for a We believe the only way to significantly variety of workloads. For a mix of indepen- Dean M. Tullsen improve performance is to enhance the dent programs (multiprogramming), the University of California, processor’s computational capabilities. In overall throughput of the machine is San Diego general, this means increasing parallelism— improved. Similarly, programs that are par- in all its available forms. At present only cer- allelizable, either by a or a pro- tain forms of parallelism are being exploited. grammer, reap the same throughput Current superscalars, for example, can exe- benefits, resulting in program . cute four or more ; in Finally, a single-threaded program that must practice, however, they achieve only one or execute alone will have all machine two, because current applications have low resources available to it and will maintain instruction-level parallelism. Placing multi- roughly the same level of performance as ple superscalar processors on a chip is also when executing on a single-threaded, wide- not an effective solution, because, in addi- issue processor. Simultaneous tion to the low instruction-level parallelism, Equal in importance to its performance ben- performance suffers when there is little efits is the simplicity of SMT’s design. multithreading exploits -level parallelism. A better solution is Simultaneous multithreading adds minimal to design a processor that can exploit all hardware complexity to, and, in fact, is a both instruction-level types of parallelism well. straightforward extension of, conventional Simultaneous multithreading is a processor dynamically scheduled superscalars. Hardware and thread-level design that meets this goal, because it con- designers can focus on building a fast, single- sumes both thread-level and instruction-level threaded superscalar, and add SMT’s multi- parallelism by issuing parallelism. In SMT processors, thread-level thread capability on top. parallelism can come from either multi- Given the enormous transistor budget in instructions from threaded, parallel programs or individual, the next era, we believe simulta- independent programs in a multiprogram- neous multithreading provides an efficient different threads in the ming workload. Instruction-level parallelism base technology that can be used in many comes from each single program or thread. ways to extract improved performance. For same cycle. Because it successfully (and simultaneously) example, on a one billion transistor chip, 20

12 IEEE Micro 0272-1732/97/$10.00 © 1997 IEEE

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:23 UTC from IEEE Xplore. Restrictions apply. .

to 40 SMTs could be used side-by-side to that of achieve per- formance comparable to a much larger number of conven- tional superscalars. With IRAM technology, SMT’s high execution rate, which currently doubles memory bandwidth Thread 1 requirements, can fully exploit the increased bandwidth Thread 2 capability. In both billion-transistor scenarios, the SMT Thread 3 processor we describe here could serve as the processor Thread 4 building block. cycles) (processor Time Thread 5 How SMT works The difference between superscalar, multithreading, and (a) (b) (c) simultaneous multithreading is pictured in Figure 1, which shows sample execution sequences for the three architec- tures. Each row represents the issue slots for a single exe- Figure 1. How architectures partition issue slots (function- cution cycle: a filled box indicates that the processor found al units): a superscalar (a), a fine-grained multithreaded an instruction to execute in that issue slot on that cycle; an superscalar (b), and a simultaneous multithreaded proces- empty box denotes an unused slot. We characterize the sor (c). The rows of squares represent issue slots. The pro- unused slots as horizontal or vertical waste. Horizontal waste cessor either finds an instruction to execute (filled box) or occurs when some, but not all, of the issue slots in a cycle the slot goes unused (empty box). can be used. It typically occurs because of poor instruction- level parallelism. Vertical waste occurs when a cycle goes completely unused. This can be caused by a long latency the processor fetches eight instructions from the instruction instruction (such as a memory access) that inhibits further cache. After instruction decoding, the register-renaming logic instruction issue. maps the architectural registers to the hardware renaming Figure 1a shows a sequence from a conventional super- registers to remove false dependencies. Instructions are then scalar. As in all superscalars, it is executing a single program, fed to either the integer or floating-point dispatch queues. or thread, from which it attempts to find multiple instruc- When their operands become available, instructions are tions to issue each cycle. When it cannot, the issue slots go issued from these queues to their corresponding functional unused, and it incurs both horizontal and vertical waste. units. To support out-of-order execution, the processor tracks Figure 1b shows a sequence from a multithreaded archi- instruction and operand dependencies so that it can deter- tecture, such as the Tera.1 Multithreaded processors contain mine which instructions it can issue and which must wait for hardware state (a program and registers) for sever- previously issued instructions to finish. After instructions al threads. On any given cycle a processor executes instruc- complete execution, the processor retires them in order and tions from one of the threads. On the next cycle, it frees hardware registers that are no longer needed. to a different thread context and executes instructions from Our SMT model, which can simultaneously execute the new thread. As the figure shows, the primary advantage threads from up to eight hardware contexts, is a straightfor- of multithreaded processors is that they better tolerate long- ward extension of this conventional superscalar. We repli- latency operations, effectively eliminating vertical waste. cated some superscalar resources to support simultaneous However, they cannot remove horizontal waste. Conse- multithreading: state for the hardware contexts (registers and quently, as instruction issue width continues to increase, mul- program counters) and per-thread mechanisms for tithreaded architectures will ultimately suffer the same fate flushing, instruction retirement, trapping, precise interrupts, as superscalars: they will be limited by the instruction-level and subroutine return. We also added per-thread (address- parallelism in a single thread. space) identifiers to the branch target buffer and translation Figure 1c shows how each cycle an SMT processor selects look-aside buffer. Only two components, the instruction fetch instructions for execution from all threads. It exploits instruc- unit and the processor pipeline, were redesigned to benefit tion-level parallelism by selecting instructions from any from SMT’s multithread instruction issue. thread that can (potentially) issue. The processor then Simultaneous multithreading needs no special hardware to dynamically schedules machine resources among the instruc- schedule instructions from the different threads onto the tions, providing the greatest chance for the highest hardware functional units. Dynamic scheduling hardware in current utilization. If one thread has high instruction-level paral- out-of-order superscalars is already functionally capable of lelism, that parallelism can be satisfied; if multiple threads simultaneous multithreaded scheduling. each have low instruction-level parallelism, they can be exe- eliminates register name conflicts both within and between cuted together to compensate. In this way, SMT can recov- threads by mapping thread-specific architectural registers er issue slots lost to both horizontal and vertical waste. onto the hardware registers; the processor then issues instruc- tions (after their operands have been calculated or loaded SMT model from memory) without regard to thread. We derived our SMT model from a high-performance, out- This minimal redesign has two important consequences. of-order, superscalar architecture whose dynamic schedul- First, since most hardware resources are still available to a ing core is similar to that of the Mips . In each cycle single thread executing alone, SMT provides good perfor-

September/October 1997 13

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:23 UTC from IEEE Xplore. Restrictions apply. .

Simultaneous multithreading

fetches eight instructions from each thread. To match the Because it can fetch instructions issue hardware’s lower instruction width, it then chooses a subset of these instructions for decoding. It takes instruc- from more than one thread, an tions from the first thread until it encounters a branch instruc- tion or the end of a cache line and then takes the remaining SMT processor can be selective instructions from the second thread. Because the instructions come from two different about which threads it fetches. threads, there is a greater likelihood of fetching useful instructions—indeed, the 2.8 scheme performed 10% bet- ter than fetching from one thread at a time. Its hardware cost is only an additional port on the instruction cache and mance for programs that cannot be parallelized. Second, logic to locate the branch instruction—nothing extraordi- because the changes to enable simultaneous multithreading nary for near-future processors. A less hardware intensive are minimal, the commercial transition from current super- alternative is to fetch from multiple threads while limiting scalars to SMT processors should be fairly smooth. the fetch bandwidth to eight instructions. However, the However, should an SMT implementation negatively performance gain here is lower: for example, the 2.8 impact either the targeted processor cycle time or the time scheme performed 5% better than fetching four instruc- to design completion, designers could take several approach- tions from each of two threads. es to simplify it. Because most of SMT’s implementation com- Icount feedback. Because it can fetch instructions from plexity stems from its wide-issue superscalar underpinnings, more than one thread, an SMT processor can be selective many of the alternatives involve altering the superscalar. One about which threads it fetches. Not all threads provide equal- solution is to partition the functional units across a dupli- ly useful instructions in a particular cycle. If the processor cated register file (as in the ) to reduce ports on can predict which threads will produce the fewest delays, the and dispatch queues.2 Another is to build performance should improve. Our thread selection hard- interleaved caches augmented with multiple, independent- ware uses the Icount feedback technique,4 which gives high- ly addressed banks, or phase-pipeline cache accesses to est priority to the threads with the fewest instructions in the increase the number of simultaneous cache accesses. A third decode, renaming, and queue pipeline stages. alternative would subdivide the dispatch queue to reduce Icount increases performance in several ways. First, it instruction issue delays.3 As a last resort, a designer could replenishes the dispatch queues with instructions from the reduce the number of register and cache ports by using a fast-moving threads, avoiding those that will fill the queues narrower issue width. This alternative would have the great- with instructions that depend on and are consequently est impact on performance. blocked behind long-latency instructions. Second and most Instruction fetching. In a conventional processor, the important, it maintains in the queues a fairly even distribu- fetches instructions from a single thread into tion of instructions among these fast-moving threads, there- the . Performance issues revolve around max- by increasing interthread parallelism (and the ability to hide imizing the number of useful instructions that can be fetched more latencies). Finally, it avoids thread starvation, because (by minimizing branch mispredictions, for example) and threads whose instructions are not executing will eventual- fetching independent instructions quickly enough to keep ly have few instructions in the pipeline and will be chosen functional units busy. An SMT processor places additional for fetching. Thus, even though eight threads are sharing and stress on the fetch unit. The fetch unit must now fetch instruc- competing for slots in the dispatch queues, the percentage tions more quickly to satisfy SMT’s more efficient dynamic of cycles in which the queues are full is actually less than on scheduler, which issues more instructions each cycle a single-threaded superscalar (8 % of cycles versus 21%). (because it takes them from multiple threads). Thus, the fetch This performance also comes with a very low hardware unit becomes SMT’s performance bottleneck. cost. Icount requires only a small amount of additional logic On the other hand, an SMT fetch unit can take advantage to increment (decrement) per-thread counters when instruc- of the interthread competition for instruction bandwidth to tions enter the decode stage (exit the dispatch queues) and enhance performance. First, it can partition this bandwidth to pick the two smallest counter values. among the threads. This is an advantage, because branch Icount feedback works because it addresses all causes instructions and cache-line boundaries often make it diffi- of dispatch queue inefficiency. In our experiments,4 it out- cult to fill issue slots if the fetch unit can access only one performed alternative schemes that addressed a particular thread at a time. We fetch from two threads each cycle to cause of dispatch queue inefficiency, such as those that increase the probability of fetching only useful (nonspecu- minimize branch mispredictions (by giving priority to lative) instructions. In addition, the fetch unit can be smart threads with the fewest outstanding branches) or minimize about which threads it fetches, fetching those that will pro- load delays (by giving priority to threads with the fewest vide the most immediate performance benefit. outstanding on-chip cache misses). 2.8 fetching. The fetch unit we propose is the 2.8 scheme.4 Register file and pipeline. In simultaneous multi- The unit has eight program counters, one for each thread threading (as in a superscalar processor), each thread can context. On each cycle, it selects two different threads (from address 32 architectural integer (and floating-point) regis- those not already incurring instruction cache misses) and ters. The register-renaming mechanism maps these archi-

14 IEEE Micro

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:23 UTC from IEEE Xplore. Restrictions apply. .

tectural registers onto a file whose size is gle thread, exploiting instruction-level parallelism. Multi- determined by the number of architectural registers in all threaded superscalars, in addition to heightening instruction- thread contexts, plus a set of additional renaming registers. level parallelism, hide latencies of one thread by switching to The larger SMT register file requires a longer access time; to and executing instructions from another thread, thereby avoid increasing the processor cycle time, we extended the exploiting thread-level parallelism. SMT pipeline two stages to allow two-cycle register reads To gauge SMT’s potential for executing parallel workloads, and two-cycle writes. we also compared it to a third alternative for improving The two-stage register access has several ramifications on instruction throughput: small-scale, single-chip shared-mem- the architecture. For example, the extra stage between instruc- ory multiprocessors, whose processors are also superscalars. tion fetch and execute increases the branch misprediction We examined both two- and four-processor multiprocessors, penalty by one cycle. The two extra stages between register partitioning their scheduling unit resources (the functional renaming and instruction commit increase the minimum time units—and therefore the issue width—dispatch queues, and that an executing instruction can hold a hardware register; renaming registers) differently for each case. In the two- this, in turn, increases the pressure on the renaming registers. processor multiprocessor (MP2), each processor received Finally, the extra stage needed to write back results to the reg- half of SMT’s execution resources, so that the total resources ister file requires an extra level of bypass logic. of the two architectures were comparable. Each processor Implementation parameters. SMT’s implementation of the four-processor multiprocessor (MP4) contains approx- parameters will, of course, change as technologies shrink. imately one-fourth of SMT’s chip resources. For some exper- In our simulations, they targeted an implementation that iments we increased these base configurations until each MP should be realized approximately three years from now. processor had the resource capability of a single SMT. For the CPU, the parameters are MP2 and MP4 represent an interesting trade-off between thread-level and instruction-level parallelism. MP2 can • an eight-instruction fetch/decode width; exploit more instruction-level parallelism, because each • six integer units, four of which can load (store) from processor has more functional units than its MP4 counter- (to) memory; part. On the other hand, MP4 has two more processors to • four floating-point units; take advantage of more thread-level parallelism. • 32-entry integer and floating-point dispatch queues; All processor simulators are execution-driven, cycle-level • hardware contexts for eight threads; simulators; they model the processor pipelines and memo- • 100 additional integer renaming registers; ry subsystems (including interthread contention for all struc- • 100 additional floating-point renaming registers; and tures in the and the buses between them) • retirement of up to 12 instructions per cycle. in great detail. The simulators for the three alternative archi- tectures reflect the SMT implementation model, but without For the memory subsystem, the parameters are the simultaneous multithreading extensions. Instead they use single-threaded fetching (per processor) and the shorter • 128-Kbyte, two-way set associative, L1 instruction and pipeline, without simultaneous multithreaded issue. The fine- data caches; the D-cache has four dual-ported banks; grained multithreaded processor simulator context switches the I-cache has eight single-ported banks; the access between threads each cycle in a round-robin fashion for time per bank is two cycles; instruction fetch, issue, and retirement. Table 1 (next page) • a 16-Mbyte, direct-mapped, unified L2 cache; the single summarizes the characteristics of each architecture. bank has a transfer time of 12 cycles on a 256-bit ; We evaluated simultaneous multithreading on a multipro- • an 80-cycle memory latency on a 128-bit bus; gramming workload consisting of several single-threaded • 64-byte blocks on all caches; programs and a group of parallel (multithreaded) applica- • 16 outstanding cache misses; tions. We used both types of workloads, because each exer- • data and instruction TLBs that contain 128 entries each; cises different parts of an SMT processor. and The larger (workload-wide) working set of the multipro- • McFarling-style branch prediction hardware:5 a 256- gramming workload should stress the shared structures in entry, four-way set-associative branch target buffer with an SMT processor (for example, the caches, TLB, and branch an additional thread identifier field, and a hybrid branch prediction hardware) more than the largely identical threads predictor that selects between global and local predic- of the parallel programs, which share both instructions and tors. The global predictor has 13 history bits; the local data. We chose the programs in the multiprogramming work- predictor has a 2,048-entry local history table that index- load from the Spec95 and Splash2 suites. Each es into a 4,096-entry prediction table. SMT program executed as a separate thread. To eliminate the effects of benchmark differences when simulating fewer Simulation environment than eight threads, each data point in Table 2 and Figure 2 We compared SMT with its two ancestral processor archi- comprises at least four simulation runs, where each run used tectures, wide-issue superscalars and fine-grained, multi- a different combination of the benchmarks. The data repre- threaded superscalars. Both are single-processor architectures, sent continuous execution with a particular number of designed to improve instruction throughput. Superscalars do threads; that is, we stopped simulating when one of the so by issuing and executing multiple instructions from a sin- threads completed.

September/October 1997 15

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:23 UTC from IEEE Xplore. Restrictions apply. .

Simultaneous multithreading

Table 1. Comparison of processor architectures.* As we explained earlier, SMT’s pipeline is two cycles longer than Features SS MP2 MP4 FGMT SMT that of the superscalar and the processors of the single-chip multi- CPUs 1 2 4 1 1 processor. Therefore, a single thread Functional units/CPU 10 5 3 10 10 executing on an SMT processor has Architectural registers/CPU additional latencies and should have (integer or floating point) 32 32 32 256 256 lower performance. However, as (8 contexts) (8 contexts) Figure 2 shows, SMT’s single-thread Dispatch queue size performance was only one percent (integer or floating point) 32 16 8 32 32 (parallel workload) and 1.5% (mul- Renaming registers/CPU tiprogramming workload) worse (integer or floating point) 100 50 25 100 100 than that of the single-threaded Pipe stages 7 7 7 9 9 superscalar. Accurate branch predic- Threads fetched/cycle 1 1 1 1 2 tion hardware and the shared pool Multithread fetch algorithm n/a n/a n/a Round-robin Icount of renaming registers prevented additional penalties from occurring *SS represents a wide-issue superscalar, MP2 and MP4 represent small-scale, frequently. single-chip, shared-memory multiprocessors, whose processors (two and four, Resource contention on SMT is also respectively) are also superscalars, and FGMT represents a fine-grained multi- a potential problem. Many of SMT’s threaded superscalar. hardware structures, such as the caches, TLBs, and branch prediction tables, are shared by all threads. The The parallel workload consists of coarse-grained (paral- unified organization allows a more flexible, and therefore high- lel threads) and medium-grained (parallel loop iterations) er, utilization of these structures, as executing threads place parallel programs targeted for shared-memory multiproces- different usage demands on them. It also makes the entire sors. As such, it was a fair basis for evaluating MP2 and MP4. structures available when fewer than eight threads—most It also presents a different, but equally challenging, test of importantly, a single thread—are executing. On the downside, SMT. Unlike the multiprogramming workload, all threads in interthread use leads to competition for the shared resources, a parallel application execute the same code, and therefore, potentially driving up cache misses, TLB misses, and branch have similar execution resource requirements (they may mispredictions. need the same functional units at the same time, for exam- We found that interthread interference was significant only ple). Consequently, there is potentially more contention for for the L1 data cache and the branch prediction tables; the these resources. data sets of our workload fit comfortably into the off-chip We also selected the parallel applications from the Spec95 L2 cache, and conflicts in the TLBs and the L1 instruction and Splash2 suites. Where feasible, we executed the entire cache were minimal. L1 data cache misses rose by 68% (par- parallel portions of the programs; for the long-running allel workload) and 66% (multiprogramming workload) and Spec95 programs, we simulated several iterations of the main branch mispredictions by 50% and 27%, as the number of loops, using their reference data sets. The Spec95 programs threads went from one to eight. were parallelized with the SUIF compiler,6 using policies SMT was able to absorb the additional conflicts. Many developed for shared-memory machines. of the L1 misses were covered by the fully pipelined 16- We compiled the multiprogramming workload with cc and Mbyte L2 cache, whose latency was only 10 cycles longer the parallel benchmarks with a version of the Multiflow com- than that of the L1 cache. Consequently, L1 interthread piler7 that produces Alpha executables. For all programs we conflict misses degraded performance by less than one set compiler optimizations to maximize each program’s per- percent.8 The most important factor, however, was SMT’s formance on the superscalar; however, we disabled trace ability to simultaneously issue instructions from multiple scheduling, so that speculation could be guided by the branch threads. Thus, although simultaneous multithreading intro- prediction and out-of-order execution hardware. duces additional conflicts for the shared hardware struc- tures, it has a greater ability to hide them. Simulation results SMT vs. the superscalar. The single-threaded superscalar As Table 2 shows, simultaneous multithreading achieved fell far short of SMT’s performance. As Table 2 shows, the much higher throughput than the one to two instructions superscalar’s instruction throughput averaged 2.7 instruc- per cycle normally reported for current wide-issue super- tions per cycle, out of a potential of eight, for the multipro- scalars. Throughput rose consistently with the number of gramming workload; the parallel workload had a slightly threads; at eight threads, it reached 6.2 for the multipro- higher average throughput of 3.3. gramming workload and 6.1 for the parallel applications. As Consequently, SMT executed the multiprogramming Figure 2b shows, speedups for parallel applications at eight workload 2.3 times faster and the parallel workload 1.9 threads averaged 1.9 over the same processor with one times faster (at eight threads). The superscalar’s inability to thread, demonstrating SMT’s parallel processing capability. exploit more instruction-level parallelism and any thread-

16 IEEE Micro

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:23 UTC from IEEE Xplore. Restrictions apply. .

level parallelism (and consequent- Table 2. Instruction throughput executing a multiprogramming ly hide horizontal and vertical workload and a parallel workload. waste) contributed to its lower per- formance. Multiprogramming SMT vs. fine-grained multi- workload Parallel workload threading. By eliminating vertical Threads SS FGMT SMT SS MP2 MP4 FGMT SMT waste, the fine-grained multithread- ed architecture provided speedups 1 2.7 2.6 3.1 3.3 2.4 1.5 3.3 3.3 over the superscalar as high as 1.3 on 2 — 3.3 3.5 — 4.3 2.6 4.1 4.7 both workloads. However, this max- 4 — 3.6 5.7 — — 4.2 4.2 5.6 imum speedup occurred at only four 8 — 2.8 6.2 — — — 3.5 6.1 threads; with additional threads, performance fell. Two factors con- tribute. First, fine-grained multi- 3.0 3.0 threading eliminates only vertical waste; given the latency-hiding capa- bility of its out-of-order processor 2.0 2.0 and lockup-free caches, four threads were sufficient to do that. Second, fine-grained multithreading cannot hide the additional conflicts from Speedup 1.0 Speedup 1.0 SMT interthread competition for shared MP2 resources, because it can issue MP4 FGMT instructions from only one thread 0.0 0.0 each cycle. Neither limitation applies 12 4 8 12 4 8 Number of threads to simultaneous multithreading. Number of threads Consequently, SMT was able to get (a)(b) higher instruction throughput and greater program speedups than the Figure 2. Speedups with the multiprogramming workload (a) and parallel fine-grained multithreaded processor. workload (b). SMT vs. the multiprocessors. SMT obtained better speedups than the multiprocessors (MP2 and MP4), not only when simu- ry operations, was responsible for most of MP2’s and MP4’s lating the machines at their maximum thread capability (eight inefficient resource use. The floating-point units were also a for SMT, four for MP4, and two for MP2), but also for a given bottleneck for MP4 on this largely floating-point-intensive number of threads. At maximum thread capability, SMT’s workload. Selectively increasing the hardware resources of throughput reached 6.1 instructions per cycle, compared MP2 and MP4 to match those of SMT eliminated a particu- with 4.3 for MP2 and 4.2 for MP4. lar bottleneck. However, it did not improve speedups, Speedups on the multiprocessors were hindered by the because the bottleneck simply shifted to a different resource. fixed partitioning of their hardware resources across proces- Only when we gave each processor within MP2 and MP4 all sors, which prevents them from responding well to changes the hardware resources of SMT did the multiprocessors in instruction- and thread-level parallelism. Processors were obtain greater speedups. However, this occurred only when idle when thread-level parallelism was insufficient; and the the architecture executed the same number of threads; at multiprocessor’s narrower processors had trouble exploiting maximum thread capability, SMT still did better. large amounts of instruction-level parallelism in the unrolled The speedup results also affect the implementation of loops of individual threads. An SMT processor, on the other these machines. Because of their narrower issue width, the hand, dynamically partitions its resources among threads, multiprocessors could very well be built with a shorter cycle and therefore can respond well to variations in both types of time. The speedups indicate that a multiprocessor’s cycle parallelism, exploiting them interchangeably. When only one time must be less than 70% of SMT’s before its performance thread is executing, (almost) all machine resources can be is comparable. dedicated to it; and additional threads (more thread-level parallelism) can compensate for a lack of instruction-level parallelism in any single thread. SIMULTANEOUS MULTITHREADING is an evolution- To understand how fixed partitioning can hurt multi- ary design that attacks multiple sources of waste in wide- processor performance, we measured the number of cycles issue processors. Without sacrificing single-thread in which one processor needed an additional hardware performance, SMT uses instruction-level and thread-level resource and the resource was idle in another processor. (In parallelism to substantially increase effective processor uti- SMT, the idle resource would have been used.) Fixed parti- lization and to accelerate both multiprogramming and par- tioning of the integer units, for both arithmetic and memo- allel workloads. Our measurements show that an SMT

September/October 1997 17

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:23 UTC from IEEE Xplore. Restrictions apply. .

Simultaneous multithreading

Alternative approaches to exploiting parallelism Other researchers have studied simultaneous multi- all threads. In contrast, the others partition resources either threading designs, as well as several architectures that rep- in space or in time, thereby limiting their flexibility to adapt resent alternative approaches to exploiting parallelism. to available parallelism. Hirata and colleagues,1 presented an architecture for an SMT processor and simulated its performance on a ray-trac- ing application. Gulati and Bagherzadeh2 proposed an SMT References processor with four-way issue. Yamamoto and Nemirovsky3 1. H. Hirata et al., “An Elementary Processor Architecture with evaluated an SMT architecture with separate dispatch Simultaneous Instruction Issuing from Multiple Threads,” queues for up to four threads. Proc. Int’l Symp. , Assoc. of Thread-level parallelism is also essential to other next- Computing Machinery, N.Y., 1992, pp. 136-145. generation architectures. Olukotun and colleagues4 inves- 2. M. Gulati and N. Bagherzadeh, “Performance Study of a tigated design trade-offs for a single-chip multiprocessor Multithreaded Superscalar ,” Proc. Int’l Symp. and compared the performance and estimated area of this High-Performance Computer Architecture, ACM, 1996, architecture with that of superscalars. Rather than building pp.291-301. wider superscalar processors, they advocate the use of 3. W. Yamamoto and M. Nemirovsky, “Increasing Superscalar multiple, simpler superscalars on the same chip. Performance through Multistreaming,” Proc. Int’l Conf. Multithreaded architectures have also been widely inves- Parallel Architectures and Compilation Techniques, IFIP, tigated. The Tera5 is a fine-grained multithreaded processor Laxenburg, Austria, 1995, pp. 49-58. that issues up to three operations each cycle. Keckler and 4. K. Olukotun et al., “The Case for a Single-Chip Multiprocessor,” Dally6 describe an architecture that dynamically interleaves Proc. Int’l Conf. Architectural Support for Programming operations from LIW instructions onto individual function- Languages and Operating Systems, ACM, 1996, pp. 2-11. al units. Their M-Machine can be viewed as a coarser 5. R. Alverson et al., “The Tera Computer System,” Proc. Int’l grained, compiler-driven SMT processor. Conf. Supercomputing, ACM, 1990, pp. 1-6. Some architectures use threads in a speculative manner 6. S.W. Keckler and W.J. Dally, “Processor Coupling: Integrating to exploit both thread-level and instruction-level paral- Compile Time and Runtime Scheduling for Parallelism,” Proc. lelism. Multiscalar7 speculatively executes threads using Int’l Symp. Computer Architecture, ACM, 1992, pp. 202-213. dynamic branch prediction techniques and squashes 7. G.S. Sohi, S.E. Breach, and T. Vijaykumar, “Multiscalar threads if control (branches) or data (memory) specula- Processors,” Proc. Int’l Symp. Computer Architecture, ACM, tion is incorrect. The superthreaded architecture8 also exe- 1995, pp. 414-425. cutes multiple threads concurrently, but does not speculate 8. J.-Y. Tsai and P.-C. Yew, “The Superthreaded Architecture: on data dependencies. Thread Pipelining with Run-time Data Dependence Checking Although all of these architectures exploit multiple forms and Control Speculation,” Proc. Int’l Conf. Parallel of parallelism, only simultaneous multithreading has the Architectures and Compilation Techniques, IEEE Computer ability to dynamically share execution resources among Society Press, Los Alamitos, Calif., 1996, pp. 49-58.

processor achieves performance superior to that of several Sujay Parekh for comments on an earlier draft. competing designs, such as superscalar, traditional multi- This research was supported by NSF grants MIP-9632977, threaded, and on-chip multiprocessor architectures. We CCR-9200832, and CCR-9632769, NSF PYI Award MIP-9058439, believe that the efficiency of a simultaneous multithreaded DEC Western Research Laboratory, and several fellowships processor makes it an excellent building block for future (Intel, Microsoft, and the Computer Measurement Group). technologies and machine designs. Several issues remain whose resolution should improve SMT’s performance even more. Our future work includes References compiler and operating systems support for optimizing pro- 1. R. Alverson et al., “The Tera Computer System,” Proc. Int’l Conf. grams targeted for SMT and combining SMT’s multithreading Supercomputing, Assoc. of Computing Machinery, N.Y., 1990, capability with architectures. pp. 1-6. 2. K. Farkas et al., “The Multicluster Architecture: Reducing Cycle Time Through Partitioning,” to appear in Proc. 30th Ann. Acknowledgments IEEE/ACM Int’l Symp. , IEEE Computer Society We thank John O’Donnell of Equator Technologies, Inc. Press, Los Alamitos, Calif., Dec. 1997. and Tryggve Fossum of Digital Equipment Corp. for the 3. S. Palacharla, N.P. Jouppi, and J.E. Smith, “Complexity-Effective source to the Alpha AXP version of the Multiflow compiler. Superscalar Processors,” Proc. Int’l Symp. Computer We also thank Jennifer Anderson of DEC Western Research Architecture, ACM, 1997, pp. 206-218. Laboratory for copies of the SpecFP95 benchmarks, paral- 4. D. M. Tullsen et al., “Exploiting Choice: Instruction Fetch and lelized by the most recent version of the SUIF compiler, and Issue on an Implementable Simultaneous Multithreading

18 IEEE Micro

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:23 UTC from IEEE Xplore. Restrictions apply. .

Processor,” Proc. Int’l Symp. Computer Architecture, ACM, Levy received an MS in computer science from the 1996, pp. 191-202. University of Washington. He is a senior member of IEEE, a 5. S. McFarling, “Combining Branch Predictors,” Tech. Report TN- fellow of the ACM, and a recipient of a Fulbright Research 36, Western Research Laboratory, Digital Equipment Corp., Palo Scholar award. Alto, Calif., June 1993. 6. M.W. Hall et al., “Maximizing Multiprocessor Performance with the SUIF Compiler,” Computer, Dec. 1996, pp. 84-89. Jack L. Lo is currently a PhD candidate 7. P.G. Lowney et al., “The Multiflow Trace Scheduling Compiler,” in computer science at the University of J. Supercomputing, May 1993, pp. 51-142. Washington. His research interests 8. J.L. Lo et al., “Converting Thread-level Parallelism to Instruction- include architectural and compiler issues level Parallelism via Simultaneous Multithreading,” ACM Trans. for simultaneous multithreading, proces- Computer Systems, ACM, Aug. 1997. sor microarchitecture, instruction-level parallelism, and code scheduling. Lo received a BS and an MS in computer science from Stanford University. He is a member of the ACM.

Susan J. Eggers is an associate profes- sor of computer science and engineer- Rebecca L. Stamm is a principal hard- ing at the University of Washington. Her ware engineer in Digital Semiconduc- research interests are computer archi- tor’s Advanced Development Group at tecture and back-end compilation, with Digital Equipment Corp. She has worked an emphasis on experimental perfor- on the architecture, logic and circuit mance analysis. Her current focus is on design, performance analysis, verifica- issues in compiler optimization (dynamic compilation and tion, and debugging of four generations compiling for multithreaded machines) and of VAX and Alpha and is currently doing (multithreaded architectures). research in hardware multithreading. She holds 10 US patents Eggers received a PhD in computer science from the in computer architecture. University of California at Berkeley. She is a recipient of an Stamm received a BSEE from the Massachusetts Institute NSF Presidential Young Investigator award and a member of of Technology and a BA in history from Swarthmore College. the IEEE and ACM. She is a member of the IEEE and the ACM.

Joel S. Emer is a senior consulting engi- Dean M. Tullsen is a faculty member in neer at Digital Equipment Corp., where computer science at the University of he has worked on processor perfor- California, San Diego. His research inter- mance analysis and performance mod- ests include simultaneous multithread- eling methodologies for a number of ing, novel branch architectures, VAX and Alpha CPUs. His current compiling for multithreaded and other research interests include multithreaded high-performance processor architec- processor organizations, techniques for increased instruc- tures, and memory subsystem design. He received a 1997 tion-level parallelism, instruction and data cache organiza- Career SA ard from the National Science Foundation. tions, branch-prediction schemes, and data-prefetch Tullsen received a PhD in computer science from the strategies for future Alpha processors. University of Washington, with emphasis on simultaneous Emer received a PhD in electrical engineering from the multithreading. He is a member of the ACM. University of Illinois. He is a member of the IEEE, ACM, Tau Beta Pi, and Eta Kappa Nu. Direct questions concerning this article to Eggers at Dept. of Computer Science and Engineering, Box 352350, University of Washington, Seattle WA 98115; [email protected] Henry M. Levy is professor of comput- ington.edu. er science and engineering at the University of Washington, where his research involves operating systems, dis- tributed and parallel systems, and com- Reader Interest Survey puter architecture. His current research Indicate your interest in this article by circling the appropriate interests include operating system sup- number on the Reader Interest Card. port for simultaneous multithreaded processors and distrib- uted resource management in local-area and wide-area Low 150 Medium 151 High 152 networks.

September/October 1997 19

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:23 UTC from IEEE Xplore. Restrictions apply.