HIGH-PERFORMANCE THROUGHPUT COMPUTING

THROUGHPUT COMPUTING, ACHIEVED THROUGH MULTITHREADING AND

MULTICORE TECHNOLOGY, CAN LEAD TO PERFORMANCE IMPROVEMENTS

THAT ARE 10 TO 30× THOSE OF CONVENTIONAL PROCESSORS AND SYSTEMS.

HOWEVER, SUCH SYSTEMS SHOULD ALSO OFFER GOOD SINGLE-THREAD

PERFORMANCE. HERE, THE AUTHORS SHOW THAT HARDWARE SCOUTING

INCREASES THE PERFORMANCE OF AN ALREADY ROBUST CORE BY UP TO 40

PERCENT FOR COMMERCIAL BENCHMARKS.

Microprocessors took over the entire Power4 and Power5;8,9 and Intel, through computer industry in the 1970s and 1980s, hyperthreading, are shipping products in this leading to the phrase “the attack of the killer space. AMD has announced plans to ship micros” as a popular topic on the newsgroup dual-core chips in late 2005. What is misun- comp.arch for many years. We believe that the derstood, though, is how you can build a disruption offered by throughput computing that has several multithread- is similar in scope. Over the next several years, ed cores that still delivers high-end single- we expect chip multithreading (CMT) proces- thread performance. The combination of sors to take over all computing segments, from high-end single-thread performance and a Shailender Chaudhry laptops and desktops to servers and super- high degree of multicore, multithreading (for computers, effectively creating “the attack of example, tens of threads) is what we call high Paul Caprioli throughput computing.” Most microproces- performance throughput computing, the topic sor companies have now announced plans of this article. Sherman Yip and/or products with multiple cores and/or A variety of papers have shown how work- multiple threads per core. loads running on servers10 and desktops11 can Marc Tremblay has been developing CMT technology for greatly benefit from thread-level parallelism. almost a decade through the Microprocessor It is no surprise that large symmetric multi- Sun Microsystems Architecture for Java Computing (MAJC, processors (SMPs) were so successful in the pronounced “magic”) program1,2 and a vari- 1990s in delivering very high throughput in ety of SPARC processors.3-7 IBM, through the terms of transactions per second, for instance, while running large, scalable, multithreaded This article was based on a keynote speech by Marc applications. Successive generations of SMPs Tremblay at the 2004 International Symposium on delivered better individual processors, better Computer Architecture in Munich, with some updat- interconnects (certainly in terms of band- ed information. width, not necessarily in terms of latency) and better scalability. Gradually, applications and

32 Published by the IEEE Computer Society 0272-1732/05/$20.00 © 2005 IEEE

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply. operating systems have matched the hardware cache behave as a 64-Mbyte one for scalability. A decade later, architects design- SPECfp2000. ing CMT processors face a dilemma: Many Researchers have proposed several other applications and some operating systems techniques to reduce the impact of ever- already scale to tens of threads. This leads to increasing memory latencies, as the “Scaling the question, “Should you populate a proces- the Memory Wall” sidebar explains. sor die with many very simple cores or should you use fewer but more powerful cores?” We Throughput computing simulated (and built, or are building) two very Systems designed for throughput comput- different cores and two different threading ing emphasize the overall work performed models and discuss how they benefit from over a fixed time period as opposed to focus- multithreading and from growing the num- ing on a metric describing how fast a single ber of cores on a chip. We show that if multi- core or a thread executes a benchmark. The threading is the prime feature and the rest of work is the aggregate amount of computation the core (that is, the rest of the pipeline and performed by all functional units, all threads, the memory subsystem) is architected around all cores, all chips, all coprocessors (such as it, as opposed to simply adding threading to network and cryptography accelerators) and an existing core, these two different cores ben- all the network interface cards in a system. efit greatly from multithreading (a 1.8 to 3.2× The scope of throughput computing is increase in throughput) and from multicores broad, especially if the microarchitecture and (3.1 to 3.7× increase). Server workloads dic- operating system collaborate to enable a tate that high throughput be the primary goal. thread to handle a full process. In this way, a On the other hand, the impact of critical sec- 32-thread CMT can appear to be an SMP of tions, Amdahl’s law, response time require- 32 virtual cores, from the viewpoint of the ments, and pipeline efficiency force us to try operating system and applications. At one end to design a high-performance single-thread of the spectrum, 32 different processes could pipeline although not at the expense of run on a 32-way CMT. At the other end, a throughput. single application, scaling to 32 threads—like Unfortunately as we discuss in a later sec- many large commercial applications running tion, two big levers that industry traditional- on today’s SMPs—can run on a CMT. Any ly used— and traditional point in between is also possible. For instance, out-of-order execution—no longer work very you could run four application servers, each well in the context of an aggressive 65-nm scaling to eight threads. Obviously, there are process and memory latencies now ranging in sweet spots, and it is the task of the load bal- the hundreds of cycles. ancing software and operating system to Clearly the industry needs new techniques. appropriately leverage the hardware. This article describes hardware scouting in Throughput computing relies on the fact which the launches a hardware that server processors run large data sets thread (invisible to software) which runs in and/or a large number of distinct jobs, result- front of the head thread. The goal here is to ing in memory footprints that stress all levels bring all interesting data and instructions (and of the memory hierarchy. This stress results control state) into the on-chip caches. Scout- in much higher miss rates for memory oper- ing heavily leverages some of the existing ations than those typical of SPECint2000, for threaded hardware to boost single-thread per- instance. A compounding effect is the miss formance. The control speculation accuracy, cost. In about 2007, servers will already be the scout’s depth, the memory-level paral- driving over 1 Tbyte of data provided by hun- lelism (MLP) impact, and the overall effect dreds of dual, inline memory modules. This on performance and cache sizes form the core makes the memory physically far from proces- of this article. We will show that this microar- sors (requiring long board traces, heavy load- chitecture technique will, for some configu- ing, backplane connectors, and so on), rations, improve performance by 40 percent resulting in large cache-miss penalties. on TPC-C, double the effective L2 cache size As a result, cores are often idle, waiting for for SPECint2000, and make a 1-Mbyte L2 data or instructions coming from distant

MAY–JUNE 2005 33

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply. FUTURE TRENDS

Scaling the Memory Wall Tolerating memory latency has of course been a goal for archi- no longer mostly rely on faster transistors and faster wires to provide tects and researchers for a long time. Various designs have employed an easy performance scaling, certain ways of using the transistor caches to tolerate memory latency by using the temporal and spa- budget compensate for the lack of scaling; these reach high levels tial locality that most programs exhibit. To improve the latency tol- of performance, especially in the context of throughput computing. erance of caches, other designs have employed nonblocking caches that support load hits under a miss and multiple misses. The use of software prefetching has proven effective when the References compiler can statically predict which memory reference will cause 1. C.-K. Luk and T.C. Mowry, “Compiler-Based a miss and disambiguate pointers.1,2 This is effective for a small Prefetching for Recursive Data Structures,” Proc. 7th subset of applications. In particular, however, it is not very effec- Int’l Conf. Architectural Support for Programming tive in commercial database applications, a prime target for servers. Languages and Operating Systems (ASPLOS 96), ACM Notice that software prefetching can also hurt performance if it Press, 1996, pp. 222-233. 2. T.C. Mowry, M.S. Lam, and A. Gupta, “Design and • generates extraneous load misses by speculatively hoisting Evaluation of a Compiler Algorithm for Prefetching,” multiple loads above branches based on static analysis and Proc. 5th Int’l Conf. Architectural Support for • increases instruction bandwidth. Programming Languages and Operating Systems (ASPLOS 92), ACM Press, 1992, pp. 62-73. Hardware prefetching alleviates the instruction cache pres- 3. S. Chaudhry and M. Tremblay, “Method and Apparatus sure, using dynamic information to predict misses and discover for Using an Assist Processor to Pre-fetch Data Values access patterns. Its drawback (and the reason why system vendors for a Primary Processor,” US Patent 6,415,356. often turn it off) is that it is hard to map a single hardware algo- 4. J.D. Collins et al., “Dynamic Speculative rithm to different applications, which would require a variety of Precomputation,” Proc. 34th Ann. ACM/IEEE Int’l algorithms. We have observed severe cache pollution when apply- Symp. Microarchitecture (Micro-34), IEEE CS Press, ing a generic hardware prefetch to codes with significantly dif- 2001, pp. 306-317. ferent data access patterns. 5. C. Zilles and G. Sohi, “Execution-Based Prediction Thread-based prefetching techniques on a CMT processor use Using Speculative Slices,” Proc. 28th Ann. Int’l Symp. idle threads or cores to prefetch data for the main computation.3-5 Computer Architecture (ISCA 01), IEEE CS Press, 2001, Finally, researchers have proposed several forms of run-ahead pp. 2-13. execution; these techniques have various levels of complexity.6-8 6. J. Dundas and T. Mudge, “Improving Data Cache In general, these techniques run program instructions after long- Performance by Pre-Executing Instructions Under a latency instructions. They do not actually retire instructions but Cache Miss,” Proc. 1997 Int’l Conf. Supercomputing prefetch data into the caches. These prefetches are accurate (SC 97), IEEE CS Press, 1997, pp. 68-78. because they are part of the instruction stream and are based on 7. R. Balasubramonian, S. Dwarkadas, and D.H. Albonesi, runtime information. Run-ahead execution can effectively cover L1 “Dynamically Allocating Processor Resources latency. Covering the longer latencies of remote caches and mem- Between Nearby and Distant ILP,” Proc. 28th Ann. Int’l ory came at a hardware cost and some potential slowdown. New Symp. Computer Architecture (ISCA 01), IEEE CS techniques such as out-of-order commit9 are starting to appear. It Press, 2001, pp. 26-37. remains to be seen how they will work within the context of 8. O. Mutlu et al., “Runahead Execution: An Alternative to throughput computing, but it looks promising. Very Large Instruction Windows for Out-of-order Hardware scouting also runs instructions after a load miss. It does Processors,” Proc. 9th Int’l Symp. High Performance so by using mostly the same structures needed to support multi- Computer Architecture (HPCA 03), IEEE CS Press, threading in a processor core. An earlier article gives a preview of 2003, pp. 129-140. how to accomplish this in a modern processor.10 A novel aspect of 9. A. Cristal, D. Ortega, J. Llosa, and Mateo Valero, “Out- scouting is that it executes instructions at a faster rate than normal of-order Commit Processors,” Proc. 10th Int’l Symp. execution to both ensure the timeliness of prefetches and to warm High Performance Computer Architecture (HPCA 04), up the fetch prediction hardware and caches. Invoking and exiting out IEEE CS Press, 2004, pp. 48-59. of scouting is a zero-cycle operation to avoid slowdowns from unsuc- 10. J. Niccolai, “Sun Adds to its UltraSPARC Road cessful scouting events. This is an important implementation aspect Map,” Computerworld, 12 Feb. 2004; http://www. that dictates the organization of important pipeline stages (this has computerworld.com/hardwaretopics/hardware/story/0,1 been our experience). We conclude that even though architects can 0801,90145,00.html.

34 IEEE MICRO

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply. memory locations. Depending on the proces- sor’s latency tolerance, “distant memory” can Memory latency Compute be the first few levels of caches, external caches, other processors’ caches, and/or main Thread 4 C M C MMC memory. As we will describe later, even very Thread 3 C M C MMC aggressive out-of-order processors will be idle, waiting for memory data, for hundreds of Thread 2 C M C MMC cycles. Thread 1 C M C MMC CMT allows the processor to switch on- Time chip threads into the pipeline after it discov- ers events causing long stalls; Figure 1 shows this strategy. CMT uses the fact that some of Figure 1. Systems with multithreaded cores hide stall the outer structures (those further from the cycles by switching alternate threads and processes into core) are often architected for full pipelining the pipeline. and yet are idle most of the time. Examples of such structures include mem- ory controllers, I/O con- trollers, and networking Core 8 accelerators. CMT processors Memory latency share these structures among Compute cores and threads, resulting in Thread 4 much better utilization. Thread 3 Core 2 Alternatively, we find that Thread 2 adding full pipelining to non- Thread 1 pipelined structures, such as Thread 4 Thread 3 memory controllers, cyclic Core 1 redundancy checkers, or large Thread 2 Thread 1 MMUs adds slightly to the Time area and yet allows full shar- ing at very little performance cost. Figure 2 shows how Figure 2. A 32-way CMT combines multithreaded cores into an on-chip SMP to deliver high cores and threads overlap spa- throughput. tially and temporally to achieve a large amount of work (high throughput). Throughput computing is as much about the consolidation of on-chip resources as it is DRAMs DRAMs DRAMs about consolidating assets at the server gran- ularity. This strategy accommodates data shar- Memory Memory Memory ing among processors in a large SMP on-chip controller controller controller or across a few CMT chips with much short- L2 L2 L2 er latencies. For horizontally scaled applica- cache bank cache bank cache bank tions, this permits consolidation of multiple I/O copies of the operating system and the appli- Global switch cations into one copy of the operating system Instruction Data Instruction Data Instruction Data and a single physical copy of the application cache cache cache cache cache cache (with multiple virtual copies). This consoli- Multithreaded Multithreaded Multithreaded dation can result in large savings in terms of core core core memory, often the dominant cost in these platforms.

Typical CMT Figure 3. Block diagram of a typical CMT. In Figure 3, we show a block diagram for a

MAY–JUNE 2005 35

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply. FUTURE TRENDS

able through replicating many of them. 10 10 No. of cores The Niagara core is a good example of a 9 6 1 3 8.6 power-efficient multithreaded core. It uses a 8 2 4 7.8 simple one-issue, in-order pipeline with little 7 6.2 6.5 speculation, which translates into a small area, 6 5.5 low power, and a fairly easy core to multithread. 5 A complex core will require more effort to 3.6 4.6 4 4.8 implement, likely result in a more expensive 2.8 3 3.2 3.3 part, and likely draw more power. On the 1.9 2.6 Relative performance 2 other hand, it can offer substantial perfor- 1 1 1.8 mance gain. 0 Both simultaneous multithreading (SMT) 1 2 3 4 and vertical threading (VT) enable a core to No. of threads support multiple threads. Vertical threading is simpler because it has the restriction that, Figure 4. Impact of cores and threads on lightweight cores. on a given cycle, the processor can only issue instructions from a single thread. Obviously, VT is less able to optimize pipeline resource typical CMT processor, one that we have been utilization. using for simulation and analysis over the Lightweight cores become more attractive years. We will use it throughout this article. when coupled with multithreading. A core It is fairly generic and yet maps nicely onto using SMT or VT can take advantage of oth- the Niagara processor, for instance.6 In this erwise idle resources while not significantly generic processor, a few threads (four for Nia- impacting other threads. Since lightweight gara) share a pipeline and form a core. For a cores have a relatively low resource utilization T-threaded core, we assume T copies of the per thread, such cores can support many register file and T copies of the internal proces- threads before per-thread performance starts sor registers. We assume that other inner to degrade. Supporting four threads can scale structures, such as the level-one (L1) instruc- total performance by as much as 3.2× on a tion cache, data cache, and translation look- large database benchmark, as Figure 4 shows. aside buffers, are shared within a core but not The figure also shows that a lightweight core across cores. with four threads has greater performance This model views the memory subsystem than three single-threaded cores, a very good as a single logical structure. In our diagram, a trade-off. global switch connects cores and the shared These large performance improvements logical memory. The on-chip L2 cache has might appear surprisingly high when com- banks for increased bandwidth. Figure 3 pared with other commercial processors. The shows the same number of banks as the num- key to achieving them is to architect cores ber of cores, but that need not be the case. from the very beginning to support multi- Each bank of the L2 cache has attached mem- threading (as opposed to adding multi- ory controllers. We use this model to study threading to an existing core not explicitly the impact of threading and multiple cores on built for multithreading). As an example, in overall performance. the context of throughput computing, there is little to gain from trying to access the L1 cache Multithreading and multicore in two cycles (versus three). This relaxation In trying to architect a core for a CMT chip, permits the removal of complex and power there is an inherent tension between the com- hungry hardware, which has the benefit of plexity of each core and number of cores that reducing the area and power of a core. How- can fit on a die. If you choose to implement a ever, it also lowers the initial single-thread per- lightweight core or a core that does not target formance, anticipating that the multithreaded high instruction level parallelism (ILP) and performance will make up for this deficit. high clock rate, single-thread performance will A heavyweight core will improve single- suffer, but high overall throughput is achiev- thread performance, but generally reduces the

36 IEEE MICRO

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply. number of cores that will fit 7 6.9 onto a chip. Also, compared to a lightweight core, a heav- 6 6.4 ier core will typically have 5.3 5 higher power consumption. 5.3 5 Nevertheless, the judicious 4 3.7 4.1 3.7 addition of complexity can 3.5 3 increase performance per 2.8 2.8 square millimeter as well as 2 2.1 1.9 performance per watt. Exper- Relative performance 1.9 No. of cores 1.5 1 iments show that an other- 1 1 3 2 4 wise fairly lightweight core 0 with hardware scouting capa- 1 2 3 4 bilities (which we discuss No. of threads later), can outperform a truly lightweight core by 4 to 10×. Figure 5. Impact of cores and threads on heavyweight Multithreading is also a cores. profitable addition to a heavyweight core. Although a heavier core is better able to optimize per- weight cores, each having four threads per thread pipeline utilization and is better at core, delivered a combined 10× performance. extracting ILP than a lighter core, there still A heavier multithreaded core was also able to remains significant opportunity for an addi- yield a 6.9× improvement. Of course, this tional thread to utilize otherwise idle scaling is based on the already impressive 4 to resources. 10× advantage of the heavier core. We have observed that a two-thread core We generated the data in Figures 4 and 5 achieves a 1.4 to 1.5× improvement in to understand scaling, sharing, inflexion throughput. This boost in performance costs points, and so on. To isolate these effects, we only a relatively small increase in area (about assume that 10 percent). The additional improvement from a four-thread core starts to diminish, • all threads on a core share the L1 caches, yielding another 1.3 to 1.4× improvement. • all threads on a chip share the TLBs, This incremental performance comes at an • the number of threads or cores does not incremental cost of about 15 percent for the affect the access times to caches, and state of the additional two threads and the • that the processor connects directly to slightly bigger state machines. memory through multiple memory con- Multicore technology can help to extract trollers (that is, there is no L3 cache). thread-level parallelism in either lightweight or heavyweight core designs. By putting four lightweight cores on a die, we observed up to Single-thread performance a 3.6× increase in performance (see Figure 4). CMT processors built from heavyweight Similarly, as Figure 5 shows, four heavyweight cores can provide good single-thread perfor- cores on a die showed improvements of up to mance. One choice to consider is a tradition- 3.7×. Therefore, almost regardless of the type al out-of-order core. Such processors have of core being used, throughput processors been very popular and fairly successful in run- have much to gain from replicating cores and ning commercial applications because they organizing them efficiently. can execute past load misses and overlap mul- Furthermore, the gains achieved by repli- tiple misses. This reduces the effective mem- cating cores are largely independent of the ory latency, resulting in a high number of gains obtained by multithreading. We instructions per cycle (IPC). observed substantial performance improve- The ability to overlap misses is limited by ment by putting several multithreaded cores the window size and the amount of specula- on a die. For example, a chip using four light- tion supported. A small window limits the

MAY–JUNE 2005 37

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply. FUTURE TRENDS

hundreds of processor cycles to complete. 3 Instruction Thus, to “tolerate” this long latency and over- 2.75 Data lap multiple load misses, an out-of-order processor needs a very large window and high- 2.5 ly speculative execution. This increases the 2.25 core area substantially; some structures grow quadratically with window size. Growing this 2 window size to 64, 128, or even 256 instruc- 1.75 tions through traditional methods runs counter to the desire to build a core small 1.5 enough to fit many of them on a single chip. 1.25 The data on the UltraSPARC V core, a 128-

MLP (no. additional misses overlapped) instruction-deep out-of-order processor, 1 16 32 64 128 Scout showed that it was about nine times as big as Window sizes (no. of instructions) the four-issue in-order core in UltraSPARC I. The power associated with the large associa- Figure 6. Database memory parallelism. The left bar of each tive structures (content-addressable memo- set is for a moderately speculative, out-of-order core; the ries) and broadcast networks also make right bar is for a core with an aggressive issue policy. traditional out-of-order cores unattractive for throughput computing. The amount of speculation that fits on a chip

Serialization Instruction cache miss limits the ability to overlap multiple load miss- Dependent store Branch mispredict es. For example, if a store address is unknown Missing load Maximum window (because it depends on a load miss), and the 100 processor does not support memory disam- 90 biguation for possible read-after-write hazards 80 of subsequent loads, then these loads cannot be candidates for miss overlap. Also, atomic 70 operations typically stall the decode stage of the 60 pipeline until all prior instructions have com- 50 pleted. This too prevents subsequent loads from becoming candidates for miss overlap. 40 Lastly, some programs do have independent 30 instructions but these lie farther along in the 20 instruction stream than a limited hardware

Window terminations (percentage) 10 structure can handle. In such programs, a processor with even a reasonably large win- 0 16 32 64 128 Scout dow size of as much as 128 instructions will Window sizes (no. of instructions) not find other independent loads that it can overlap to memory. Figure 7. Window termination conditions. The left bar of each set is for a Moreover, aggressive speculation only adds moderately speculative, out-of-order core; the right bar is for a core with an to the complexity and area. The design must aggressive issue policy. control the aggressiveness using predictors and rewind mechanisms, and it must implement structures to correct any miss-speculation. number of instructions that are potential can- The following figures illustrate the degree didates for overlap. For example, a 32-instruc- of MLP obtained when running a database tion window limits the search for the next load workload with different core architectures. In miss to 31 instructions following the first load Figure 6, the y-axis shows the number of addi- miss. Unfortunately, as we discussed previ- tional misses overlapped to memory from a ously, the growing gap in the cycle time of 2-Mbyte unified L2 cache. On the x-axis, we dual inline memory modules and processors show different window sizes and different has resulted in memory accesses that can take degrees of speculation. The first bar in each

38 IEEE MICRO

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply. pair represents a moderately speculative, out- of-order core; the second represents an aggres- Ld→r7 sive issue policy. In particular, the aggressive use r7 policy allows stores to issue out of order with respect to other stores and allows branches to br issue out of order with respect to other branches. Figure 7 identifies the cause that stopped the processor from examining subsequent Ld→r20 instructions for issue and thus prevented it from discovering more independent load br misses for overlapping. Serialization caused by atomic instructions begins to dominate as the → cause. Ld r20 br Hardware scout To achieve high MLP and thus high single- thread performance, we must issue indepen- Ld→r20 dent memory requests that can be found fairly far downstream from an initial load miss. Fur- br thermore, our technique for doing this must avoid the area, complexity, and window size restrictions inherent in the traditional out-of- Figure 8. Example of how a hardware- order approach. scouting processor handles a load miss. A hardware scout processor achieves MLP by creating a resumable hardware checkpoint of program state on encountering a load miss. further address computations while the main The core can then continue issuing instruc- register file safely retains the checkpoint state. tions arbitrarily far along in the instruction Of course, a subsequent load might have an stream until the memory subsystem returns address that depends on the original load that the load data. When the data returns, the missed. Or, for that matter, it might depend processor resumes execution from the check- on a use of the original load miss, another load pointed state. Naturally, memory operations miss, a use of a use of a load that missed, or so that the processor reissues on the second pass on. Since bandwidth is precious, we do not have the advantage of having been previous- want to issue memory operations having non- ly prefetched into the primary caches. sensical addresses to the caches. To accomplish The principal advantage to this approach is this, each architectural register has an associ- that the scouting distance is proportional to ated not there (NT) bit. When a load misses, the number of cycles elapsed waiting for the the NT bit is set for the destination register. load miss data. The size of on-chip struc- Additionally, the NT bit is set for the desti- tures—such as the size of an out-of-order issue nation register of any instruction that has a window—does not limit scouting. Moreover, source register NT bit set. That is, the desti- hardware scouting relies only on redeploying nation’s NT bit is assigned a logical OR of the the existing pipeline resources, a more area source registers’ NT bits. efficient approach than having an It is worth noting that the NT bits do not autonomous hardware prefetch engine. propagate indefinitely. Once a register is rede- In the example in Figure 8, we see an initial fined and the processor reuses it in the pro- load to register 7 missing and becoming the gram, it clears the register’s NT bit. This cause of a checkpoint. The entire sequence of happens automatically because of the OR instructions in the gray area then issues spec- operation previously described. An instruc- ulatively. This approach uses a separate regis- tion whose source registers are available (and ter file, called the shadow register file, to make is not itself a missing load) will clear the NT speculative computational results available for bit for the destination register regardless of

MAY–JUNE 2005 39

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply. FUTURE TRENDS

quickly for stores that will be reached archi- Resolved, not mispredicted Not resolved, not mispredicted tecturally once execution resumes from the Resolved, mispredicted checkpoint. Not resolved, mispredicted Note that the computational results calcu- 450 130 lated while scouting never commit to archi- 401 122 400 120 tectural state. This permits the freedom to 110 perform optimizations beneficial to the com- 350 100 300 90 mon cases without concern for the correct- 258 80 ness of a rare situation. As an example, data 250 70 from stores that fit in the store buffer bypass 200 60 50 correctly to loads, while stores that overflowed 150 40 the store buffer capacity do not. In effect, we 100 79 30

No. of targets (millions) 20 17 are speculating that loads leading to address No. of branches (millions) 50 8 21 10 6 computation do not have read-after-write haz- 0 0 ards with the overflowed stores. (a) (b) The processor will encounter branches and other control transfer instructions while scout- Figure 9. Simulation results for the database workload, ing. Many of these control dependencies are showing four classes of branch (a) and jump target (b) pre- resolvable, that is, independent of outstanding diction. load misses. Thus, regardless of whether the fetch unit predicted them correctly, the scout thread can proceed down the correct path. Resolved, not mispredicted The mispredicted control transfer instructions Not resolved, not mispredicted Resolved,450 mispredicted result only in a redirection of the fetch unit, as Not400 resolved, mispredicted is normally done. Upon encountering a control dependency 800 734 61.5 60 that is not resolvable (the NT bit is set on the 700 condition code register), hardware scouting 600 50 continues executing along the fetch-unit-pre- 500 482 40 dicted path. This usually results in following 400 the correct path. Note that there are no cor- 30 300 rectness issues associated with executing down 20 the wrong path since instructions only write 200 to the shadow register file, and the processor No. of targets (millions) 10 No. of branches (millions) 100 72 will eventually restart execution from the 24 3.8 3.2 1.8 0 0 checkpoint. In fact, our results show that the (a) (b) more difficult to predict branches are usually associated with small if-else programming Figure 10. Simulation results for SPECint2000, showing constructs that quickly return scout execution four classes of branch (a) and jump target (b) prediction. to the correct path. The experimental simu- lation results described in this article do not include this effect because our trace-driven whether it had been previously set. methodology does not contain the wrong path Besides load misses, hardware scouting instructions. Hence, we simply do not model ameliorates other conditions that are tradi- scouting through these control dependencies. tionally stalling in nature. For example, a full Atomic operations (lock acquisition) and store buffer is a stall condition for both tradi- other memory barriers are a significant per- tional in-order and out-of-order designs. This formance inhibitor to traditional methods for becomes, for us, just another reason to take a extracting ILP. In contrast, a scout-capable checkpoint and begin hardware scouting. processor can scout past these barriers and Since the processor uses store addresses to prefetch data close to the requesting proces- prefetch cache lines, the scouting will help sor. Because software locks are typically ensure that the store buffer can drain more uncontested, it is valuable to request the lock-

40 IEEE MICRO

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply. protected cache lines. This is especially true Resolved, not mispredicted because it was likely a remote processor that Not resolved, not mispredicted last used the shared data. Thus, the loads Resolved,450 mispredicted Not400 resolved, mispredicted shortly following the barrier are often fertile 11 territory for discovering MLP. 160 156 10.4 Because control speculation is our primary 150 10 140 9 possibility for miss-speculation, we have some 130 simulation results showing how well the 120 8 110 7 100 scouting hardware handles control depen- 90 6 dencies. For these results, we collected an 80 70 61 5 instruction trace of 1.2 billion instructions 60 4 50 from a commercial database workload and 40 3 30 2 No. of targets (millions) 500 instruction trace samples containing 1 No. of branches (millions) 20 1 0.209 0.027 million instructions each for both 10 3.9 1.6 0.064 0 0 SPECint2000 and SPECfp2000. Using sim- (a) ulations, we classified each branch or jump (b) instruction into one of four categories: Figure 11. Simulation results for SPECfp2000, showing four resolved, correctly predicted; unresolved, cor- classes of branch (a) and jump target (b) prediction. rectly predicted; resolved, mispredicted; and unresolved, mispredicted. Branch prediction is very good overall, 87 tion; they correspond to the latencies of the to 98 percent correct. Furthermore, between L2 cache, the L3 cache, and memory. Clear- 37 to 72 percent of branches are resolvable ly, the L2 cache satisfies the majority of cache during speculative execution. Branch predic- misses, so we created the second graph to bet- tion for the unresolvable branches is 84 to 97 ter show the clustering at the higher latencies. percent accurate. Overall then, the hardware For the database workload, we see that the can scout in the correct direction 90 percent scouting depth is more smoothly distributed. of the time for the database workload, as Fig- Because this multithreaded program has more ure 9a shows; for the SPECint2000 and primary cache misses, queuing delays in the SPECfp2000 codes the prediction is even bet- memory subsystem somewhat blur the laten- ter, as Figures 10a and 11a show. cy clusterings. The second histogram for the The jump target predictions illustrated in database workload shows the y-axis scaled log- Figures 9b, 10b, and 10c show that scouting arithmically. The last bar is artificially high is only 4 percent incorrect in the worst case. because it includes all distances greater than Of course, this is no accident. When archi- 1,000 instructions. tecting a hardware scout processor from a Scouting can progress 500, 600, even 700 clean slate, you must consider the importance instructions or more. Naturally, this relates to of accurate predictions, then size buffers the memory latency—the longer it takes to accordingly. Moreover, you must pay atten- satisfy a load request, the farther execution can tion to how predictions made while scouting proceed. What is not immediately obvious is interact with predictions made while running the fact that dependent instructions (instruc- the main thread. As an example, it is advis- tions having one or more source registers NT) able to make the return address stack part of can issue faster than independent ones. Recall the checkpoint to properly predict method that the only operation required of a depen- returns after having scouted. dent instruction is the single-bit ORing of the The histograms in Figure 12 show how source NT bits to set the destination NT bit. deep into the instruction stream scouting can Thus, these instructions can group without proceed. The height of each bar indicates how regard to pipeline resources such as arithmetic many scouting events resulted in executing logic units, memory pipes, and so on. the number of instructions shown on the x- Additionally, the issue of dependent axis. The scale on the x-axis is from 0 to 1,000 instructions can ignore all data dependencies. instructions. These histograms reveal three For example, if one source register is NT, there natural groupings in the modeled configura- is no point in stalling to wait until a multicy-

MAY–JUNE 2005 41

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply. FUTURE TRENDS

7 25

6 20 5

4 15

3 10

2 5 No. of scouting events (millions) of scouting events No. 1 (base 2) of scouting events No.

0 0 Scouting distance (no. of instructions) Scouting distance (no. of instructions) (a) (b) 6 100 5 90 80 4 70 60 3 50 40 2 30 20

No. of scouting events (millions) of scouting events No. 1

No. of scouting events (thousands) of scouting events No. 10 0 0 Scouting distance (no. of instructions) Scouting distance (no. of instructions) (c) (d) Figure 12. Histograms showing the incidence of scouting on the database workload, for all latencies plotted linearly (a) and plotted semi-logarithmically (b). Similar SPECint2000 histograms are for all latencies (c) and for just the higher latencies (d). These show the natural three latency groupings.

cle instruction produces the other source reg- consider the metric of MLP, which we define ister. In fact, if one of a producer’s sources is as the average number of outstanding mem- NT, even producers and consumers can issue ory requests over the time period when there together in the same cycle by forwarding the is at least one request outstanding.12 Figure 13 NT result of the producer to the consumer. show the overall MLP for our benchmarks as The value of relaxing the issue rules for well as the MLP when we consider only loads, dependent instructions is that the processor stores, or prefetch instructions in isolation. encounters load misses more quickly. This These figures show the relative improvement helps ensure that the memory system will have in MLP as compared to a moderately aggres- responded with the data in a timely fashion. sive, out-of-order processor. If it requires N cycles to reach a missing load Clearly, scouting’s effect on MLP is dra- instruction during normal execution, the matic. Overall, it increases MLP for the data- scouting hardware will reach that same base workload by more than 20 percent; for instruction in fewer than N cycles. This allows SPECint2000, 32 percent; and for more time for the data to return to the SPECfp2000, 70 percent. Considering loads pipeline. alone, scouting increase MLP by more than The key to high performance in modern 49 percent. Loads are arguably the most crit- server is the ability to over- ical to overall performance, but stores also lap long cache misses. To quantify this effect, matter because they do accumulate in a finite-

42 IEEE MICRO

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply. size store buffer. Again, we see that scouting 49.26 50 45.95 increases the MLP of stores by approximate- 40 ly 20 percent. The figures also show prefetch 30 19.38 MLP, but the result for the database workload 20 10 is anomalous in that scouting decreases MLP. −39.93 0 Our analysis shows that scouting finds the −10 (percentage) −20

prefetches early and also ends them early. Fur- MLP improvement thermore, the number of prefetch instructions −30 −40 in the database trace is low, so the effect is Loads Stores Prefetches Memory actually small in absolute terms. (a) The cycles spent in scouting benefit the architectural thread in several ways. First, as 40 38.13 35 32.07 mentioned, load misses are issued to the mem- 30 ory subsystem and so you can think of scout- 25 20.1 ing as a sophisticated hardware prefetch 20 engine that not only prefetches exactly what 15 (percentage) 10 8.14

the processor will need, but also does so MLP improvement 5 through the reuse of existing execution 0 resources. The processor can use this “prefetch Loads Stores Prefetches Memory engine” precisely when it is most useful— (b) 120 when the pipelines would otherwise stall. 107.27 Moreover, the address computation uses the 100 same instructions and register values that 80 70.75 would otherwise later result in a load miss. 60 57.09 Secondly, hardware scouting primes other 40 on-chip structures. It warms up the instruc- (percentage) 20.1

MLP improvement 20 tion cache, for example, with instructions that 0 the processor will need in the immediate Loads Stores Prefetches Memory future. Thus, future instruction cache misses (c) overlap with the current data cache miss. Additionally, the branch prediction array and Figure 13. Percentage MLP improvement from scouting for jump target buffer both learn how to correct- the database (a), SPECint2000 (b), and SPECfp2000 (c) ly predict upcoming control transfers. workloads. For the case of a classic two-bit branch pre- dictor, a designer ought to account for whether the branch is resolved speculatively imately equal performance, a 1-Mbyte cache or architecturally when updating the counter. offers the performance of an 8-Mbyte one. In particular, the strongly taken backward Or, a 4-Mbyte cache offers the performance of branch at the end of a loop should not be a 16-Mbyte cache. updated to a not-taken state upon architec- For SPECint2000, scouting yields a 12 per- turally exiting the loop. (A naïve implemen- cent performance boost for a 512-Kbyte tation might update this end-of-loop branch cache, an attractive gain considering the low twice, once during scouting and again when miss rate of these benchmarks. Similarly, a 1- reaching it architecturally. Such a double Mbyte cache in a scouting processor is as effec- update might first update strongly taken to tive as a 2-Mbyte cache in a processor weakly taken and then to not-taken, an unde- incapable of scouting. The SPECfp2000 sirable outcome.) workload shows some of the most dramatic Scouting also results in higher performance. gains. With a 512-Kbyte cache, scouting Figure 14 illustrate the improvement in IPC increases SPECfp2000 performance by 34 for various cache sizes. For example, the data- percent. With a 1-Mbyte cache, the scouting base workload shows a 40 percent improve- processor’s performance exceeds that of a ment in performance for a 512-Kbyte L2 processor without scouting that has a 64- cache. Or, comparing points having approx- Mbyte L2 cache.

MAY–JUNE 2005 43

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply. FUTURE TRENDS

MT processors offer a way Cto significantly improve the 3 performance of computer sys- tems. The return on investment 2.75 Scouting No scouting for multithreading is among the 2.5 highest in computer microarchi- 2.25 tectural techniques. If you design 2 a core from scratch to support 1.75 multithreading, gains as high as 3× are possible for just a 20 per- 1.5 Buys 7 Mbytes Buys 12 Mbytes cent increase in area. Few Normalized IPC 1.25 microarchitectural techniques 1 can offer this level of improve- 40% better 0.75 performance ment for the same additional 0.5 area. Likewise, the availability of 0.25 0.50 1 2 4 8 16 32 64 hundreds of millions to a few bil- (a) L2 cache size (Mbytes) lion transistors permits building a network of multithreaded cores that can also offer excellent scal- 1.4 ability. Here again, the key is to Scouting 1.3 No scouting design the network with CMT in mind. In other words, the 1.2 many wires and high frequency available on chip—as opposed to 1.1 on a board or backplane— 1 Buys 1 Mbyte should dictate what the network Normalized IPC will look like. If architected 0.9 12% better properly, linear scalability for a performance large set of applications is possi- 0.8 ble. 0.25 0.50 1 2 4 8 16 32 64 (b) L2 cache size (Mbytes) Even with throughput perfor- mance as the main target, we 1.5 have shown that the microarchi- Scouting tecture necessary to support 1.4 No scouting threads on a CMT can also 1.3 achieve high single-thread per- Performance improvement over 64-Mbyte cache formance. Hardware scouting, 1.2 which Sun is implementing on Buys 5 Mbytes 1.1 the Rock microprocessor, can increase the single-thread per- 1 formance of applications by up Normalized IPC 0.9 34% better to 40 percent. Alternatively, performance scouting is a technique that 0.8 makes the on-chip caches appear 0.50 1 2 4 8 16 32 64 much larger, performance wise. (c) L2 cache size (Mbytes) Finally, it is a performance robustness technique, making up for code tailored for different on-chip cache sizes or even a dif- Figure 14. Normalized IPC improvement from scouting for the database (a), ferent number and levels of SPECint2000 (b), and SPECfp2000 (c) workloads. IPC improvement varies caches. with cache size. We have attempted to show that to fully exploit the promis-

44 IEEE MICRO

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply. es of multithreading and multicores, you must Exploiting Memory-Level Parallelism,” Proc. architect for it from the very beginning. In the 31st Ann. Int’l Symp. Computer Architecture future, we will show how going down the path (ISCA 04), IEEE Press, 2004, pp. 76-89. of CMT and hardware scouting leads to a new execution model that does not fit into the nor- mal classification of in-order or out-of-order Shailender Chaudhry is a distinguished engi- processors. We will show how to attain even neer at Sun Microsystems where he is chief higher single-thread performance by building architect of the Rock processor. His experi- on top of hardware scouting and combining ence includes work on the MAJC architec- some of our earlier work on speculative mul- ture, picoJava II, microJava, and space-time tithreading.2 MICRO computing. His research interests include computer systems architecture, algorithms, and protocols. Chaudhry has a BS and an MS References in computer engineering from Syracuse Uni- 1. M. Tremblay, “MAJC: An Architecture for the versity. New Millennium,” Hot Chips 11, 1999; http://www.hotchips.org/archives/hc11/3_Tue/ Paul Caprioli is a senior staff engineer at Sun hc99.s8.2.Tremblay.pdf. Microsystems where he is the Rock core archi- 2. M. Tremblay et al., “The MAJC Architecture: tect and leads the performance modeling A Synthesis of Parallelism and Scalability,” group. His research interests include computer IEEE Micro, vol. 20, no. 6, Nov.-Dec. 2000, architectures and high-performance technical pp. 12-25. computing. Caprioli has an MS and a PhD in 3. S. Kapil, “Gemini: A Power-Efficient Chip mathematics from Rensselaer Polytechnic Multi-Threaded UltraSPARC Processor,” Hot Institute. Chips 15, 2003; http://www.hotchips.org/ archive/hc15/pdf/12.sun.pdf. Sherman Yip is an engineer at Sun Microsys- 4. Q. Jacobson, “UltraSPARC IV Processors,” tems where he is a member of the Rock archi- Microprocessor Forum, In-Stat, 2003. tecture team and responsible for performance 5. D. Greenley, “Sun UltraSPARC IV+ modeling. His research interests include Processor,” Microprocessor Forum, In-Stat, processor simulation, visualization, and Java 2004. performance and design. Yip has a BS in com- 6. P. Kongetira, K. Aingaran, and K. Olukotun, puter science from the University of Califor- “Niagara: A 32-Way Multithreaded Sparc nia at Davis. Processor,” IEEE Micro, vol. 25, no. 2, Mar.- Apr. 2005, pp. 21-29. Marc Tremblay is a Sun Fellow and distin- 8. J. Tendler et al., “Power4 System guished engineer at Sun Microsystems, where Microarchitecture,” IBM J. Research and he is chief architect of the Scalable Systems Development, vol. 46, no. 1, Jan. 2002, pp. Group. His research interests include chip 5-25. multiprocessing, chip multithreading, specu- 9. J. Clabes et al., “Design and Implementation lative multithreading, and assist threading. of the Power5 Microprocessor,” Proc. 41st Tremblay has an MS and a PhD in computer Ann. Conf. Design Automation (DAC 04), science from UCLA and a BS in physics engi- ACM Press, 2004, pp. 670-672. neering from Laval University in Canada. He 10. S. Kunkel et al., “A Performance holds 97 US patents in various areas of com- Methodology for Commercial Servers,” IBM puter architecture. He is a member of the J. Research and Development, vol. 44, no. IEEE and the ACM. 6, Nov. 2000, pp. 851-872. 11. K. Flautner et al., “Thread Level Parallelism and Interactive Performance of Desktop Applications,” ACM SIGPLAN Notices, vol. Direct questions and comments about this 35, no. 11, Nov. 2003, pp. 129-138. article to Marc Tremblay, Sun Microsystems 12. Y. Chou, B. Fahs, and S. Abraham, Inc., 430 North Mary Ave., Sunnyvale, CA “Microarchitecture Optimizations for 94085; [email protected].

MAY–JUNE 2005 45

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on November 27,2020 at 15:03:22 UTC from IEEE Xplore. Restrictions apply.