Superspeculative Microarchitecture for Beyond AD 2000

. Superspeculative Microarchitecture Theme Feature for Beyond AD 2000 Employing a broad spectrum of superspeculative techniques can achieve significant performance increases over today’s top-of-the-line microprocessors. The experimental, superspeculative microarchitecture Superflow has a potential performance of 9.0 instructions per cycle and realizable performance of 7.3 IPC for the SPEC95 integer suite, without requiring recompilation or changes to the instruction set architecture. Mikko H. n its brief lifetime of 26 years, the microproces- 6-instruction-wide machine will achieve only 1.2 to Lipasti sor has achieved a total performance growth of 2.3 IPC.1 Such data imply that the current superscalar 10,000 times thanks to technology improvements paradigm is running into rapidly diminishing returns John Paul and microarchitecture innovations. Transistor on performance. Shen count and clock frequency have increased by an Carnegie I order of magnitude in each of the first two decades of POTENTIAL NEW PARADIGMS Mellon University microprocessors; transistor count increased from Future billion-transistor chips will inevitably imple- 10,000 to 100,000 in the 1970s and up to 1 million ment machines that are much wider (issue more than in the 1980s, while clock frequency increased from four instructions at once) and deeper (have longer 200 KHz to 2 MHz in the 1970s and up to 20 MHz pipelines). The question is, how do we harvest addi- in the 1980s. This incredible technology trend has tional parallelism proportional to increased machine continued: Since 1990, both transistor count and resources? Several approaches have vocal advocates, clock frequency have already achieved an increase of each with valid reasons; they are 20 to 30 times. During the 1980s, sustained instructions per cycle also increased by almost an order of • reconfigurable parallel computing engines; magnitude, from roughly 0.1 to 0.9. IPC is a measure • specialized, very long instruction word (VLIW) of the instruction-level parallelism or instruction machines; throughput achieved by the concurrent processing of • wide, simultaneous multithreaded (SMT) uni- multiple machine instructions. In the 1990s, IPC im- processors; provement is struggling and may not triple by 1999. • single-chip multiprocessors (CMP); New microarchitecture innovations are needed. • memory-centric computing engines (such as IRAM); Current top-of-the-line microprocessors are four- • very wide conventional superscalars; and instruction-wide superscalar machines; that is, they • wide superspeculative processors. can fetch and complete up to four instructions in a single machine cycle. Such machines use pipelined Two overriding concerns will determine which pos- functional units, aggressive branch prediction, sibility might have the most commercial impact: per- dynamic register renaming, and out-of-order execu- formance scalability and software migration complexity. tion of instructions to maximize parallelism and tol- The approach that succeeds will have to deliver enough erate memory latency. State-of-the-art processors performance improvement to justify the hardware include the Digital Equipment Alpha 21264, Silicon resources expended. It must also effectively deal with Graphics MIPS/R10000, IBM/Motorola PowerPC the tremendous effort and expense required to create 604, and Intel Pentium Pro. Even with such elaborate or migrate a critical mass of useful applications. microarchitectures, against a potential 4 IPC, these Computing history’s two most economically successful machines typically achieve only about 0.5 to 1.5 sus- instruction set architectures—the IBM S/360 and the tained IPC for real-world programs. Intel x86—have reaped rewards for paying meticulous Worse yet, most studies indicate that machine effi- attention to software cost and code compatibility. ciency drops even lower as we extrapolate to wider machines. One recent study indicated that although a Reconfigurable computers hypothetical 2-instruction-wide machine achieves IPC Researchers have proposed reconfigurable computers in the range of 0.65 to 1.40, a similar, hypothetical, that employ large arrays of highly programmable build- 0018-9162/97/$10.00 © 1997 IEEE September 1997 59 Authorized licensed use limited to: Stanford University. Downloaded on March 11,2021 at 00:29:17 UTC from IEEE Xplore. Restrictions apply. Super- ing blocks. Typical examples are complex program- However, their utility in improving single-program per- speculative mable logic devices and field-programmable gate formance has thus far been restricted to numerical architectures arrays. These devices rely on powerful software tools to applications that contain easily parallelized loops. have the map applications to the reconfigurable hardware. The Limited processor interconnect and synchronization potential of intent is to construct powerful, specialized computing overhead will throttle performance scalability, particu- engines at a relatively low cost with very short turn- larly for more generalized applications with interthread sustaining IO around time. This approach’s performance scalability dependences. Significant development is required before IPC for has yet to be demonstrated beyond a few specialized parallelizing compilers will provide performance gains nonnumeric applications. Moreover, the huge investment required for generalized applications on CMPs. Unless such soft- applications. to leverage the latest and best fabrication technology ware technology becomes widely available, CMP, like may not be economically justified by the niche-market SMT, is destined to remain a technique for improving volume for reconfigurable systems. Finally, software throughput but not latency. Furthermore, without a sig- tools for automating application-to-hardware mapping nificant multithreaded-application base there may not must incorporate the latest code compilation and opti- be a mainstream demand for single-chip SMT proces- mization techniques as well as yet-to-be-developed sor and CMP implementations. Without mainstream hardware synthesis technologies. This is a very tall market demand, these implementations are not eco- order for the software. nomically justifiable. VLIW machines Memory-centric engines Specialized VLIW machines already exist for multi- Proposing a memory-centric view (as opposed to media applications. These statically controlled, wide the traditional CPU-centric view) to computer system machines contain numerous functional units with design has become quite popular. The potential inte- highly deterministic behavior, which permits tremen- gration of dense DRAM technology with fast logic dous computing throughput on specialized applica- technology on the same chip (intelligent RAM) is cer- tions. They provide performance scalability by packing tainly of technological interest. However, it is unclear more into a single instruction. VLIW machines rely on that this integration inspires any truly new architecture powerful compilers to detect and resolve inter-instruc- paradigms. We see it as more of a technology/imple- tion dependences in software. This keeps the hardware mentation issue that enables shorter latency and more design clean and fast. Such machines usually require density and bandwidth at the upper levels of the mem- application recompilation, so retargetable compilers ory hierarchy. There is also the issue of software. are essential. Such compilers have been long in coming Compilation tools for array-structured, message- and are still not widely available. Furthermore, the sta- passing multicomputers have been in development for tic nature of VLIWs makes them inherently incom- over a decade and are still not widely available. patible with dynamic variations in parallelism or latency, both of which are caused by aggressive mem- Wide conventional superscalars ory subsystems and speculative-execution techniques. On the other hand, scaling up current superscalars to process more instructions at a time, though attrac- SMT uniprocessors tive from a software cost perspective, doesn’t seem Simultaneous multithreaded processors are super- promising either. Performance scalability is limited scalar uniprocessors that support multiple machine because current microarchitectural techniques for contexts and execute multiple instruction streams extracting instruction-level parallelism are returning simultaneously. They do so to increase a multipro- significantly less performance improvement on wider grammed workload’s throughput or reduce a multi- machines. Incremental improvements will result in threaded program’s latency. Performance scalability only marginal performance gains. Limit studies have depends on finding enough thread parallelism, a task shown that even when all control and structural haz- left to software. Developing multithreaded applica- ards are removed, single-thread performance is still tions is challenging due to the extreme difficulty of severely limited by true data dependences between debugging multithreaded programs and the lack of instructions that produce results and instructions that automatic thread-partitioning compilers. Therefore, use these results as their source operands. Due to such we view SMT primarily as a technique for improving data dependences, the producer and consumer instruc- throughput in multiprogrammed workloads. tions must necessarily be serialized. Enforcing these serializations leads to the classical dataflow limit for Single-chip multiprocessors program performance: Given unlimited machine Our view of the

Load more