<<

. Superspeculative Microarchitecture

Theme Feature for Beyond AD 2000

Employing a broad spectrum of superspeculative techniques can achieve significant performance increases over today’s top-of-the-line . The experimental, superspeculative microarchitecture Superflow has a potential performance of 9.0 and realizable performance of 7.3 IPC for the SPEC95 integer suite, without requiring recompilation or changes to the instruction set architecture.

Mikko H. n its brief lifetime of 26 years, the microproces- 6-instruction-wide machine will achieve only 1.2 to Lipasti sor has achieved a total performance growth of 2.3 IPC.1 Such data imply that the current superscalar 10,000 times thanks to technology improvements paradigm is running into rapidly diminishing returns John Paul and microarchitecture innovations. Transistor on performance. Shen count and clock frequency have increased by an Carnegie I order of magnitude in each of the first two decades of POTENTIAL NEW PARADIGMS Mellon University microprocessors; increased from Future billion-transistor chips will inevitably imple- 10,000 to 100,000 in the 1970s and up to 1 million ment machines that are much wider (issue more than in the 1980s, while clock frequency increased from four instructions at once) and deeper (have longer 200 KHz to 2 MHz in the 1970s and up to 20 MHz pipelines). The question is, how do we harvest addi- in the 1980s. This incredible technology trend has tional parallelism proportional to increased machine continued: Since 1990, both transistor count and resources? Several approaches have vocal advocates, clock frequency have already achieved an increase of each with valid reasons; they are 20 to 30 times. During the 1980s, sustained instruc- tions per cycle also increased by almost an order of • reconfigurable engines; magnitude, from roughly 0.1 to 0.9. IPC is a measure • specialized, very long instruction word (VLIW) of the instruction-level parallelism or instruction machines; throughput achieved by the concurrent processing of • wide, simultaneous multithreaded (SMT) uni- multiple machine instructions. In the 1990s, IPC im- processors; provement is struggling and may not triple by 1999. • single-chip multiprocessors (CMP); New microarchitecture innovations are needed. • memory-centric computing engines (such as IRAM); Current top-of-the-line microprocessors are four- • very wide conventional superscalars; and instruction-wide superscalar machines; that is, they • wide superspeculative processors. can fetch and complete up to four instructions in a single machine cycle. Such machines use pipelined Two overriding concerns will determine which pos- functional units, aggressive branch prediction, sibility might have the most commercial impact: per- dynamic , and out-of-order execu- formance scalability and software migration complexity. tion of instructions to maximize parallelism and tol- The approach that succeeds will have to deliver enough erate memory latency. State-of-the-art processors performance improvement to justify the hardware include the Digital Equipment Alpha 21264, Silicon resources expended. It must also effectively deal with Graphics MIPS/, IBM/ PowerPC the tremendous effort and expense required to create 604, and . Even with such elaborate or migrate a critical mass of useful applications. microarchitectures, against a potential 4 IPC, these Computing history’s two most economically successful machines typically achieve only about 0.5 to 1.5 sus- instruction set architectures—the IBM S/360 and the tained IPC for real-world programs. Intel —have reaped rewards for paying meticulous Worse yet, most studies indicate that machine effi- attention to software cost and code compatibility. ciency drops even lower as we extrapolate to wider machines. One recent study indicated that although a Reconfigurable computers hypothetical 2-instruction-wide machine achieves IPC Researchers have proposed reconfigurable computers in the range of 0.65 to 1.40, a similar, hypothetical, that employ large arrays of highly programmable build-

0018-9162/97/$10.00 © 1997 IEEE September 1997 59

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:17 UTC from IEEE Xplore. Restrictions apply. .

Super- ing blocks. Typical examples are complex program- However, their utility in improving single-program per- speculative mable logic devices and field-programmable gate formance has thus far been restricted to numerical architectures arrays. These devices rely on powerful software tools to applications that contain easily parallelized loops. have the map applications to the reconfigurable hardware. The Limited interconnect and synchronization potential of intent is to construct powerful, specialized computing overhead will throttle performance scalability, particu- engines at a relatively low cost with very short turn- larly for more generalized applications with interthread sustaining IO around time. This approach’s performance scalability dependences. Significant development is required before IPC for has yet to be demonstrated beyond a few specialized parallelizing compilers will provide performance gains nonnumeric applications. Moreover, the huge investment required for generalized applications on CMPs. Unless such soft- applications. to leverage the latest and best fabrication technology ware technology becomes widely available, CMP, like may not be economically justified by the niche-market SMT, is destined to remain a technique for improving volume for reconfigurable systems. Finally, software throughput but not latency. Furthermore, without a sig- tools for automating application-to-hardware mapping nificant multithreaded-application base there may not must incorporate the latest code compilation and opti- be a mainstream demand for single-chip SMT proces- mization techniques as well as yet-to-be-developed sor and CMP implementations. Without mainstream hardware synthesis technologies. This is a very tall market demand, these implementations are not eco- order for the software. nomically justifiable.

VLIW machines Memory-centric engines Specialized VLIW machines already exist for multi- Proposing a memory-centric view (as opposed to media applications. These statically controlled, wide the traditional CPU-centric view) to computer system machines contain numerous functional units with design has become quite popular. The potential inte- highly deterministic behavior, which permits tremen- gration of dense DRAM technology with fast logic dous computing throughput on specialized applica- technology on the same chip (intelligent RAM) is cer- tions. They provide performance scalability by packing tainly of technological interest. However, it is unclear more into a single instruction. VLIW machines rely on that this integration inspires any truly new architecture powerful compilers to detect and resolve inter-instruc- paradigms. We see it as more of a technology/imple- tion dependences in software. This keeps the hardware mentation issue that enables shorter latency and more design clean and fast. Such machines usually require density and bandwidth at the upper levels of the mem- application recompilation, so retargetable compilers ory hierarchy. There is also the issue of software. are essential. Such compilers have been long in coming Compilation tools for array-structured, message- and are still not widely available. Furthermore, the sta- passing multicomputers have been in development for tic nature of VLIWs makes them inherently incom- over a decade and are still not widely available. patible with dynamic variations in parallelism or latency, both of which are caused by aggressive mem- Wide conventional superscalars ory subsystems and speculative-execution techniques. On the other hand, scaling up current superscalars to more instructions at a time, though attrac- SMT uniprocessors tive from a software cost perspective, doesn’t seem Simultaneous multithreaded processors are super- promising either. Performance scalability is limited scalar uniprocessors that support multiple machine because current microarchitectural techniques for contexts and execute multiple instruction streams extracting instruction-level parallelism are returning simultaneously. They do so to increase a multipro- significantly less performance improvement on wider grammed workload’s throughput or reduce a multi- machines. Incremental improvements will result in threaded program’s latency. Performance scalability only marginal performance gains. Limit studies have depends on finding enough parallelism, a task shown that even when all control and structural haz- left to software. Developing multithreaded applica- ards are removed, single-thread performance is still tions is challenging due to the extreme difficulty of severely limited by true data dependences between debugging multithreaded programs and the lack of instructions that produce results and instructions that automatic thread-partitioning compilers. Therefore, use these results as their source operands. Due to such we view SMT primarily as a technique for improving data dependences, the producer and consumer instruc- throughput in multiprogrammed workloads. tions must necessarily be serialized. Enforcing these serializations leads to the classical dataflow limit for Single-chip multiprocessors program performance: Given unlimited machine Our view of the single-chip multiprocessor approach resources, a program cannot execute any faster than is similar. CMPs will inevitably be used for improving the execution of the longest dependence chain induced throughput under multiprogrammed workloads. by the program’s true data dependences.

60 Computer

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:17 UTC from IEEE Xplore. Restrictions apply. .

20.0

18.0 Conventional superscalar (baseline performance) Additional performance provided by 16.0 superspeculation techniques 14.0 12.0 10.0 8.0 Sustained IPC 6.0 4.0 2.0 0 go m88ksim gcc compress li ijpeg perl vortex Harmonic mean

Figure 1. Superspecu- Superspeculative processors grams without requiring advanced compilation sup- lation’s performance We believe it is due to the seeming inability to get port. Superspeculation aggressively continues the sta- potential. As we scale beyond the dataflow limit to harvest more instruction- tistical approach that emerged during the 1980s: issue width from four level parallelism that many researchers are advocating designers optimizing machine performance for statis- to eight to 16 to 32, one of the approaches just discussed, but all these tically common cases instead of the less likely worst the conventional approaches require a dramatic, expensive paradigm case. We generalize the current control speculation (in superscalar reaps shift from sequential programming to an explicitly par- the form of branch prediction) to include various forms diminishing allel model. Superspeculative processors, on the other of data speculation.2 With this generalization, the performance returns, hand, overcome the dataflow limit without sacrificing processor can circumvent both the control flow and topping out at slightly code compatibility. They do so by aggressively specu- dataflow constraints of a program via aggressive spec- over 4 IPC (the lating past true data dependences and harvesting addi- ulation. Such speculation often pays off because of the harmonic mean). In tional parallelism in places where none was believed to predictable behavior of real programs. The theoretical contrast, a superspec- exist. The basis for the superspeculative approach is basis for superspeculative microarchitectures rests on ulative processor’s that producer instructions generate many highly pre- the weak dependence model,3 which defers the detec- IPC continues to dictable values in real programs. Consumer instruc- tion of and relaxes the enforcement of control flow and increase with issue tions can thus frequently and successfully speculate on dataflow dependences between machine instructions. width, topping out at their source operand values and begin execution with- Strong-dependence model. The implied total instruc- 19.0 IPC (for the vor- out results from the producer instructions. Con- tion ordering of a sequential program is an overspeci- tex ); the sequently, a superspeculative processor can remove the fication and need not be rigorously enforced to meet harmonic mean is serialization constraints between producer and con- the requirements of semantically correct execution. The 9.0 IPC. sumer instructions, enabling program performance to actual program semantics and inter-instruction depen- potentially exceed the classical dataflow limit. dences are specified by the control flow graph, which Figure 1 summarizes the performance obtainable by specifies the possible paths for traversing the basic scaling a conventional superscalar design—one that blocks of instructions in the program, and the dataflow employs all the latest techniques in branch prediction graph, which specifies the true data dependences and out-of-order execution—from today’s issue width between producer and consumer instructions. As long of four instructions per cycle up to eight, 16, and 32 as the processor does not violate the serialization con- instructions per cycle. The sustained IPC attainable for straints imposed by these graphs, it can overlap and the SPEC95 integer benchmarks shown levels off reorder the execution of instructions. This achieves bet- quickly around width eight or sixteen. ter performance by avoiding the enforcement of implied In contrast, the stacked bars show the additional per- but unnecessary precedences. However, the processor formance attainable by a processor that employs super- must still enforce true inter-instruction dependences. speculative techniques. These techniques frequently To date, most machines enforce such dependences in a more than double the attainable performance and pro- rigorous fashion that adheres to two requirements: vide ample justification for continuing to devote proces- sor implementation resources to microarchitectures that • Dependences are determined in an absolute and are compatible with current ISAs and do not require a exact way; that is, two instructions are identified massive software investment. as either dependent or independent, and when in Superspeculative microarchitectures have the poten- doubt, dependences are pessimistically assumed tial of sustaining close to 10 IPC for nonnumerical pro- to exist.

September 1997 61

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:17 UTC from IEEE Xplore. Restrictions apply. .

We propose • Dependences are enforced throughout instruc- renames their operands, and detects inter-instruc- a weak tion execution; that is, the dependences are never tion dependences. dependence allowed to be violated and are enforced during • Execute. Instructions wait until their operands model for instruction processing. are available, then allocate functional units, exe- super- cute according to their prescribed semantics, and speculative Such a traditional and conservative approach is forward their results to subsequent dependent processors. called the strong-dependence model for program exe- instructions. cution. This traditional model is overly rigorous and • Commit. Instructions are allowed to write back unnecessarily restricts available parallelism. their results into the architected registers in pro- Weak-dependence model. Instead, we propose the gram order. weak-dependence model for superspeculative proces- sors. This model specifies that Figure 2 also identifies the three key parameters that a superspeculative microarchitecture must maximize: • Dependences need not be determined exactly or assumed pessimistically, but can instead be opti- • instruction flow, the rate at which useful instruc- mistically approximated or even temporarily tions are fetched, decoded, and dispatched to the ignored. execution core; • Dependences can be temporarily violated during • register dataflow, the rate at which results are instruction execution as long as recovery can be produced and register values become available; performed prior to affecting the permanent and machine state. • memory dataflow, the rate at which data values are stored and retrieved from data memory. The weak-dependence model’s advantage is that the machine can process instructions without having the These three flows roughly correspond to the pro- program semantics completely determined, as specified cessing of branch, ALU, and load/store instructions. In by control flow and dataflow graphs. Furthermore, the a superspeculative microarchitecture, aggressive spec- machine can now speculate aggressively and temporar- ulative techniques are employed to accelerate the pro- ily violate the dependences as long as corrective mea- cessing of all these instruction types. sures are in place to recover from misspeculation. If a significant percentage of speculations are correct, the SUPERFLOW TECHNIQUES machine can effectively exceed the performance limit Our proposed approach to superspeculative microar- imposed by the traditional, strong-dependence model. chitecture is called Superflow. We’ve implemented a Similar in concept to branch prediction’s imple- prototype to collect the data reported here, which indi- mentation in current processors, superspeculation uses cates its potential, but the detailed design effort is ongo- two interacting engines. The front-end engine assumes ing. Superflow employs a wide range of speculative the weak-dependence model and is highly speculative, techniques to improve the throughput of instruction predicting instructions to aggressively speculate past flow, register dataflow, and memory dataflow beyond them. When predictions are correct, these speculative traditional limits. Hence the name Superflow. instructions will effectively have skipped over certain stages of instruction execution. The back-end engine Instruction flow techniques still uses the strong-dependence model to validate the With these techniques, the processor attempts to speculations, recover from misspeculation, and pro- supply as many useful instructions as it can to the exe- vide history and guidance information to the specu- cution core. Instruction flow is a threefold prob- lative engine. By taking this approach, processors can lem: conditional-branch throughput, taken-branch harvest an unprecedented level of instruction-level par- throughput, and misprediction latency. allelism. The edges of the dataflow graph that repre- Conditional branches. To provide adequate condi- sent inter-instruction dependences are now enforced tional-branch throughput, the processor must accu- and become part of the critical dataflow path only rately predict the outcomes and targets of multiple when misspeculations occur. conditional branches in every cycle. We combined A superspeculative machine. As Figure 2 shows, earlier work on predicting multiple branches in every instruction execution can be divided into four logical cycle4,5 and predicting branches accurately6 in a two- stages, each taking one or more machine cycles. phase . This predictor employs a local (per static branch) and a global branch history • Fetch. The processor retrieves instructions from to predict multiple branches in each cycle. During the or main memory. first stage (fetch), global knowledge is used to predict • Decode. The processor decodes instructions, multiple branches and generate a fetch address for

62 Computer

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:17 UTC from IEEE Xplore. Restrictions apply. .

Trace Instruction cache cache Instruction flow Branch predictor Fetch

Instruction buffer

Decode

Integer Floating-point Media Memory

RS RS RS RS

Execute Memory dataflow Reorder Register buffer dataflow (ROB) Commit

Store Data cache RS = Reservation station queue

the next cycle.5 During the second stage (decode), the targets. This eliminates the need for a multiported Figure 2. Superspecu- earlier predictions are checked via an advanced gshare instruction cache or complex merge/align logic in the lative machine predictor,6 which combines local and global knowl- critical path. overview. A Superflow edge to generate an accurate prediction. Hence, many Misprediction latency. Finally, the processor must microarchitecture mispredictions made in the first phase are corrected minimize the latency of correctly resolving mispre- employs a broad with a latency of only one cycle. dicted branches. To do so, we employ a set of aggres- spectrum of specula- Taken branches. The fetch unit must be able to cor- sive data-speculative techniques in the execution core; tive techniques to rectly process more than one taken branch per cycle, the register and memory dataflow sections describe maximize overall which involves predicting each branch’s direction and them. These techniques enable branches to execute instruction throughput target, and also fetching, merging, and aligning earlier than they would otherwise and produced sig- beyond traditional instructions from each branch target. nificant performance gains by significantly reducing limits by maximizing To reduce complexity, the Superflow fetch engine misprediction penalties. the throughputs of uses an interesting, recently proposed approach called instruction flow, reg- a trace cache.5 A trace cache is a history-based fetch Register dataflow techniques ister dataflow, and mechanism that stores dynamic-instruction traces in These techniques facilitate fast and efficient pro- memory dataflow. a cache indexed by fetch address and branch outcome. cessing of ALU instructions. The execution core strives Whenever it finds a suitable trace, it dispatches instruc- for two fundamental goals to increase instruction tions from the trace cache rather than sequential throughput. It must instructions from the instruction cache. Since a dynamic sequence of instructions in the trace cache • efficiently detect and resolve inter-instruction can contain multiple taken branches but is stored dependences and sequentially, there is no need to fetch from multiple • eliminate or bypass as many dependences as pos-

September 1997 63

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:17 UTC from IEEE Xplore. Restrictions apply. .

Table 1. Estimated transistor cost of Superflow. Transistor count Resource Description (millions) CPU core logic (32 instructions issued per cycle / 4 instructions issued per cycle)2 × 2 million transistors in a PowerPC 604, which issues four instructions per cycle 128.0 Value prediction table 32 Kbytes × 8 bits/byte × 6 transistors/SRAM bit 1.6 Classification table 8K entries × 2 bits/entry × 6 transistors/SRAM bit 0.1 Dependence prediction table 8K entries × 7 bits/entry × 64 ports × 6 transistors/SRAM bit 22.0 Alias prediction table 8K entries × 7 bits/entry × 32 ports × 6 transistors/SRAM bit 11.0 Pattern history tables 64K entries × 2 bits/entry × 2 tables × 6 transistors/SRAM bit 1.6 Trace cache 64 Kbytes (estimated size) × 8 bits/byte × 6 transistors/SRAM bit 3.1 Level 1 instruction cache 64 Kbytes × 8 bits/byte x 6 transistors/SRAM bit 3.1 Level 1 data cache 64 Kbytes × 4 ports × 8 bits/byte × 6 transistors/SRAM bit 12.6 Processor core total 183.1 Level 2 unified cache 16 Mbytes × 8 bits/byte × 6 transistors/SRAM bit (approximately) 805.3 Grand total 988.4

sible to expose more parallelism between instruc- Memory dataflow techniques tions. These techniques minimize average memory latency and provide adequate memory bandwidth for support- Detecting control and data dependences among ing a high-performance superspeculative processor core. multiple active instructions is an inherently sequential To reduce average memory latency, we incorporate the task that becomes expensive combinatorially as the prediction of load values,8,9 addresses, and aliases into number of concurrently active instructions increases. the execution core. For adequate load instruction Furthermore, multiple-instruction dispatch is difficult throughput, we introduce load stream partitioning, a to implement and has an adverse impact on cycle time divide-and-conquer strategy for reducing the cost and because all instructions in a dispatch group must be complexity of a high-bandwidth memory system. simultaneously cross-checked. To avoid impacting Memory latency is a severe bottleneck to processor cycle time, dependence checking and dispatch can be performance, and three factors cause it: pipelined into multiple stages. Unfortunately, a deeper pipeline, especially in the decode portion, results in a • address generation interlocks, which delay the greater performance penalty. However, this penalty initiation of a fetch from memory because the can be overcome with dependence prediction, a spec- address is unknown; ulative technique that can frequently short-circuit • the latency of accessing the storage device; and pipelined multicycle decode. It does so by predicting • queuing delays caused by contention for shared the dependence relationships between instructions and resources in the memory subsystem. speculatively allowing instructions that are predicted to be data ready to execute in parallel with exact Speculative prediction of load addresses can elimi- dependence checking.7 We discovered that such depen- nate many address generation interlocks. Previous dence relationships between instructions are highly research on address prediction is largely subsumed by predictable in real programs. The Superflow execu- source operand value prediction, which incorporates tion core adopts this technique to help overcome the stride prediction for detecting and predicting constant- latency cost of pipelined decode/dispatch and facili- stride memory addresses. Address generation inter- tate the implementation of wide-dispatch processors. locks also cause a secondary problem. When load Superflow employs conventional, well-understood addresses are unknown, it is impossible to detect the techniques, such as register and memory renaming, to existence of an alias with an earlier outstanding store. eliminate false dependences between instructions. To alleviate this problem, we propose alias prediction, However, it advances well beyond the limits imposed a logical extension to the register dependence predic- by true dependences by employing source operand value tion technique described earlier.7 prediction7 to eliminate true data dependences between Rather than predict the dependence distance to a instructions. This technique uses dynamic-value history preceding register write, we predict the dependence information, stored per static program instruction, to distance to a preceding store. (Andreas Moshovos and predict future values of that instruction’s source his colleagues described a similar technique.10) The operands. Superflow improves the accuracy of source predicted distance is then used to obtain the load value operand value prediction beyond previously reported from that offset in the processor’s store queue, which levels by extending it to include value stride prediction. holds outstanding stores. For this speculative for- In value stride prediction, a dynamic hardware mecha- warding to occur, neither the load nor the store need nism detects constant, incremental increases in operand to have their addresses available yet, allowing the values (strides) and uses them to predict future values. bypassing of address generation entirely.

64 Computer

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:17 UTC from IEEE Xplore. Restrictions apply. .

18.0 Additional performance provided by optimizations for 16.0 Register dataflow Memory dataflow 14.0 Instruction flow Superscalar (baseline performance) 12.0

10.0

8.0 Sustained IPC 6.0

4.0

2.0

0 go m88ksim gcc compress li ijpeg perl vortex Harmonic mean

The storage device’s latency can be folded away the implementation of high-bandwidth memory sys- Figure 3. Superflow (effectively eliminated) by performing load value pre- tems by eliminating the need for large, centrally performance; diction, which uses a per-static-load-value history to located, and extremely multiported data caches. sustained IPC for a predict future values. This technique can frequently pre- Experimental evidence suggests that a significant por- fetch width of 32, a dict the value to be loaded when the load instruction is tion of the load stream can be diverted to these spe- reorder buffer size of dispatched, in effect implementing a zero-cycle load.8 cialized units for processing.3,8 128, and a 128-entry At first glance, providing adequate memory band- store queue for vari- width to support Superflow’s performance goals PERFORMANCE POTENTIAL ous memory configu- appears daunting. Since roughly 40 percent of the 10- To evaluate superspeculation’s performance poten- rations. The leftmost instruction-per-cycle throughput consists of loads and tial, we simulated the performance of a Superflow bar shows IPC for a stores, the memory subsystem must provide an average processor with a perfect cache with bandwidth of four references per cycle. Furthermore, unlimited ports. The the peak bandwidth required to prevent excessive • fetch width of 32; second bar shows IPC queuing delays is even higher. However, several factors • a 128-entry reorder buffer; for a 64-Kbyte data and techniques relieve this difficult problem. • 64-Kbyte, 4-way, set-associative data and instruc- cache with unlimited First, recent discoveries indicate that the reuse dis- tion caches with 10-cycle miss delay to a perfect, ports. The third, tances of many stored values (measured in execution pipelined second-level cache; and fourth, and fifth bars cycles) are very short.11 In fact, our simulations show • a 128-entry, fully associative store queue. show the previous that many stores do not even make it out of the store configuration, but queue before their values are needed again. Hence, the Table 1 shows a crude estimate of the transistors with eight, four, and loads retrieving these values do not require a cache port. needed for a naive implementation of such a processor. two cache ports. Each Second, a technique called constant promotion8 can be We estimate that CPU core logic will increase as the stacked bar shows used to eliminate the actual memory accesses normally square of the issue width, consuming 128 million tran- cumulative IPC attain- performed by load instructions. Constant promotion sistors relative to the two million core logic transistors able beyond the con- provides a hardware mechanism that guarantees that the in the PowerPC 604, which issues four instructions per ventional superscalar value generated via load value prediction is correct and cycle. To reduce design complexity, we expect that the with instruction flow, need not be checked against an actual memory reference. functional units in the execution core will be clustered memory dataflow, and It does so by keeping the load value prediction table in some manner similar to earlier proposals.12 Relative register dataflow coherent with main memory. This technique monitors to the CPU core, the superspeculative prediction struc- superspeculative all intervening stores between an update and a lookup of tures, including those for branch prediction, consume techniques. the load value prediction table, invalidating entries mod- roughly 36 million transistors, while level-1 caches con- ified by such stores. sume about 19 million transistors. A superspeculative These two observations lead us to introduce the processor core (including the level-1 caches) requires notion of load stream partitioning, which simply par- less than 200 million transistors. Assuming a total bud- titions loads into multiple streams based on their get of one billion transistors, the chip still has room for behavior and sends them to disjoint, specialized func- a fast, 16-Mbyte, level-2 cache, which should be more tional units for processing. This approach facilitates than adequate to capture the working sets of most

September 1997 65

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:17 UTC from IEEE Xplore. Restrictions apply. .

anticipated applications. If necessary, designers can 4. T.M. Conte et al., “Optimization of Instruction Fetch reduce the size of the level-2 cache to make room for Mechanisms for High Issue Rates,” Proc. 22nd Int’l various I/O and network interfaces. These Symp. on , IEEE CS Press, Los estimates are approximate. Alamitos, Calif., 1995, pp. 333-344. Figure 3 on page 65 shows performance results for 5. E. Rotenberg, S. Bennett, and J. Smith, “Trace Cache: A an unlimited number of cache ports, and eight, four, and Low Latency Approach to High Bandwidth Instruction two cache ports. Results for a model with a perfect cache Fetching,” Proc. 29th Ann. ACM/IEEE Int’l Symp. on (one that incurs no cache misses) are also included for Microarchitecture, IEEE CS Press, Los Alamitos, Calif., reference. Each stacked bar reflects the additional IPC 1996, pp. 24-34. harvested by adding superspeculative instruction flow 6. S. McFarling, Combining Branch Predictors, Tech. Report techniques, memory dataflow techniques, and register TN-36, Digital Equipment Corp., Maynard, Mass., 1993, dataflow techniques to the machine model. We have http://www.research.digital.com/wrl/home.html. published additional detailed performance results.3 7. M.H. Lipasti and J.P. Shen, “The Performance Potential These results conclusively demonstrate not only that of Value and Dependence Prediction,” Proc. EURO-PAR superspeculative techniques provide impressive perfor- ’97, Springer-Verlag, Passau, Germany, 1997. mance, but also that without them, very wide super- 8. M.H. Lipasti, C.B. Wilkerson, and J.P. Shen, “Value scalars (such as the baseline case shown) do not scale to Locality and Load Value Prediction,” Proc. Seventh Int’l significantly improved levels of performance. Conf. on Architectural Support for Programming Lan- guages and Operating Systems, ACM Press, New York, ur results demonstrate Superflow’s dramatic per- 1996, pp. 138-147. formance potential, placing superspeculative 9. L. Widigen, E. Sowadsky, and K. McGrath, “Eliminating O microarchitecture in the forefront of competing Operand Read Latency,” Computer Architecture News, approaches for billion-transistor computing. We are Dec. 1996, pp. 18-22. currently exploring these techniques in greater depth 10. A. Moshovos et al., “Dynamic Speculation and Synchro- with more extensive and accurate simulations. The next nization of Data Dependences,” Proc. 24th Int’l Symp. phase of our effort will be to move these ideas toward on Computer Architecture, IEEE CS Press, Los Alamitos, implementation and produce a Superflow prototype Calif., 1997, pp.181-193. after extensive implementation trade-off studies. We 11. A.S. Huang and J.P. Shen, “The Intrinsic Bandwidth plan to develop simulation models of the prototype that Requirements of Ordinary Programs,” Proc. Seventh Int’l can provide performance accuracy down to the Conf. on Architectural Support for Programming Lan- machine cycle level. Trial circuit designs for the timing- guages and Operating Systems, ACM Press, New York, critical pieces of the prototype will help us estimate chip 1996, pp. 105-114. complexity and machine cycle time. ❖ 12. S. Vajapeyam and T. Mitra, “Improving Superscalar Instruction Dispatch and Issue by Exploiting Dynamic Code Sequences, Proc. 24th Int’l Symp. Computer Archi- Acknowledgements tecture, ACM Press, New York, 1997, pp. 1-12. ONR grants N00014-96-1-0928 and N00014-96- 1-0347 and Intel Corp. supported this research. This research also benefited from discussions with other Mikko H. Lipasti is an advisory engineer with IBM. His MIG members: Bryan Black, Yuan Chou, Andrew research interests include superspeculative computer Huang, Chris Newburn, and Derek Noonburg. Chris architecture. Lipasti has a BS in Wilkerson coined the name Superflow. from Valparaiso University and an MS and PhD in elec- trical and computer engineering from Carnegie Mellon. He is a member of the IEEE, ACM, and Tau Beta Pi. References 1. K. Olukotun et al., “The Case For a Single-Chip Multi- John Paul Shen is a professor in CMU’s Electrical and processor,” Proc. Seventh Int’l Conf. Architectural Sup- Computer Engineering Department and heads up the port for Programming Languages and Operating Systems, Microarchitecture Innovation Group. Shen received a ACM Press, New York, 1996, pp. 2-11. BS from the University of Michigan and an MS and 2. M.H. Lipasti and J.P. Shen, “Exceeding the Data-Flow PhD from the University of Southern California, all in Limit Via Value Prediction,” Proc. 29th Ann. ACM/IEEE electrical engineering. He is an IEEE fellow. Int’l Symp. on Microarchitecture, IEEE CS Press, Los Alamitos, Calif., 1996, pp. 226-237. Contact Shen at the Microarchitecture Innovation 3. M.H. Lipasti, Value Locality and , Group, Dept. of Electrical and Computer Engineer- doctoral dissertation, Carnegie Mellon Univ., Dept. Elec- ing, Carnegie Mellon University, Pittsburgh, PA trical and Computer Eng., May 1997. 15213; [email protected].

66 Computer

Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:17 UTC from IEEE Xplore. Restrictions apply.