Scaling to the End of Silicon with EDGE Architectures

COVER FEATURE Scaling to the End of Silicon with EDGE Architectures The TRIPS architecture is the first instantiation of an EDGE instruction set, a new, post-RISC class of instruction set architectures intended to match semiconductor technology evolution over the next decade, scaling to new levels of power efficiency and high performance. Doug Burger nstruction set architectures have long lifetimes as well as simplified control logic, an entire RISC Stephen W. because introducing a new ISA is tremendously processor could fit on a single chip. By moving to a Keckler disruptive to all aspects of a computer system. register-register architecture, RISC ISAs supported However, slowly evolving ISAs eventually aggressive pipelining and, with careful compiler Kathryn S. I become a poor match to the rapidly changing scheduling, RISC processors attained high perfor- McKinley underlying fabrication technology. When that gap mance despite their reduction in complexity. Mike Dahlin eventually grows too large, the benefits gained by Since RISC architectures, including the x86 renormalizing the architecture to match the under- equivalent with µops, are explicitly designed to sup- Lizy K. John lying technology make the pain of switching ISAs port pipelining, the unprecedented acceleration of Calvin Lin well worthwhile. clock rates—40 percent per year for well over a Charles R. Microprocessor designs are on the verge of a decade—has permitted performance scaling for the Moore post-RISC era in which companies must introduce past 20 years with only small ISA changes. For new ISAs to address the challenges that modern example, Intel has scaled its x86 line from 33 MHz James Burrill CMOS technologies pose while also exploiting the in 1990 to 3.4 GHz today—more than a 100-fold Robert G. massive levels of integration now possible. To meet increase over 14 years. Approximately half of that McDonald these challenges, we have developed a new class of increase has come from designing deeper pipelines. ISAs, called Explicit Data Graph Execution However, recent studies show that pipeline scaling William (EDGE), that will match the characteristics of semi- is nearly exhausted,1 indicating that processor Yoder, and conductor technology over the next decade. design now requires innovations beyond pipeline- the TRIPS centric ISAs. Intel’s recent cancellation of its high- Team TIME FOR A NEW ARCHITECTURAL MODEL? frequency Pentium 4 successors is further evidence The University of The “Architecture Comparisons” sidebar pro- of this imminent shift. Texas at Austin vides a detailed view of how previous architectures Future architectures must support four major evolved to match the driving technology at the time emerging technology characteristics: they were defined. In the 1970s, memory was expensive, so CISC • Pipeline depth limits mean that architects must architectures minimized state with dense instruction rely on other fine-grained concurrency mecha- encoding, variable-length instructions, and small nisms to improve performance. numbers of registers. In the 1980s, the number of • The extreme acceleration of clock speeds has devices that could fit on a single chip replaced hastened power limits; in each market, future absolute transistor count as the key limiting resource. architectures will be constrained to obtain as With a reduced number of instructions and modes much performance as possible given a hard 44 Computer Published by the IEEE Computer Society 0018-9162/04/$20.00 © 2004 IEEE Architecture Comparisons Comparing EDGE with historic architectures illus- ture is that RAW uses compiler-determined issue trates that architectures are never designed in a vac- order (including the inter-ALU routers), whereas uum—they borrow frequently from previous archi- the TRIPS architecture uses dynamic issue, to tol- tectures and can have many similar attributes. erate variable latencies. However, since the RAW ISA supports direct communication of a producer • VLIW: A TRIPS block resembles a 3D VLIW instruction in one tile to a consuming instruction’s instruction, with instructions filling fixed slots in a ALU on another tile, it could be viewed as a stati- rigid structure. However, the execution semantics cally scheduled variant of an EDGE architecture, differ markedly. A VLIW instruction is statically whereas TRIPS is a dynamically scheduled EDGE scheduled—the compiler guarantees when it will ISA. execute in relation to all other instructions.1 All • Dataflow machines: Classic dataflow machines, instructions in a VLIW packet must be indepen- researched heavily at MIT in the 1970s and 1980s,4 dent. The TRIPS processor is a static placement, bear considerable resemblances to intrablock exe- dynamic issue (SPDI) architecture, whereas a VLIW cution in EDGE architectures. Dataflow machines machine is a static placement, static issue (SPSI) were originally targeted at functional programming architecture. The static issue model makes VLIW languages, a suitable match because they have architectures a poor match for highly communica- ample concurrency and have side-effect free, “write- tion-dominated future technologies. Intel’s family once” memory semantics. However, in the context of EPIC architectures is a VLIW variant with simi- of a contemporary imperative language, a dataflow lar limitations. architecture would need to store and process many • Superscalar: An out-of-order RISC or x86 super- unnecessary instructions because of complex con- scalar processor and an EDGE processor traverse trol-flow conditions. EDGE architectures use con- similar dataflow graphs. However, the superscalar trol flow between blocks and support conventional graph traversal involves following renamed point- memory semantics within and across blocks, per- ers in a centralized issue window, which the hard- mitting them to run traditional imperative lan- ware constructs—adding instructions individually— guages such as C or Java, while gaining many of the at great cost to power and scalability. Despite benefits of more traditional dataflow architectures. attempts at partitioned variants,2 the scheduling • Systolic processors: Systolic arrays are a special his- scope of superscalar hardware is too constrained torical class of multidimensional array processors5 to place instructions dynamically and effectively. In that continually process the inputs fed to them. In our scheduling taxonomy, a superscalar processor contrast, EDGE processors issue the set of instruc- is a dynamic placement, dynamic issue (DPDI) tions dynamically in mapped blocks, which makes machine. them considerably more general purpose. • CMP: Many researchers have remarked that a TRIPS-like architecture resembles a chip multi- References processor (CMP) with 16 lightweight processors. 1. J.A. Fisher et al., “Parallel Processing: A Smart Compiler In an EDGE architecture, the global control maps and a Dumb Machine,” Proc. 1984 SIGPLAN Symp. irregular blocks of code to all 16 ALUs at once, and Compiler Construction, ACM Press, 1984, pp. 37-47. the instructions execute at will based on dataflow 2. K.I. Farkas et al., “The Multicluster Architecture: Reduc- order. In a 16-tile CMP, a separate program counter ing Cycle Time Through Partitioning,” Proc. 30th Ann. exists at each tile, and tiles can communicate only IEEE/ACM Int’l Symp. Microarchitecture (MICRO-30), through memory. EDGE architectures are finer- IEEE CS Press, 1997, pp. 149-159. grained than both CMPs and proposed specula- 3. G.S. Sohi, S.E. Breach, and T.N. Vijaykumar, “Multiscalar tively threaded processors.3 Processors,” Proc. 22nd Int’l Symp. Computer Architec- • RAW processors: At first glance, the TRIPS ture (ISCA 95), IEEE CS Press, 1995, pp. 414-425. microarchitecture bears similarities to the RAW 4. Arvind, “Data Flow Languages and Architecture,” Proc. architecture. A RAW processor is effectively a 2D, 8th Int’l Symp. Computer Architecture (ISCA 81), IEEE tiled static machine, with shades of a highly parti- CS Press, 1981, p. 1. tioned VLIW architecture. The main difference 5. H.T. Kung, “Why Systolic Architectures?” Computer, Jan. between a RAW processor and the TRIPS architec- 1982, pp. 37-46. July 2004 45 power ceiling; thus, they must support TRIPS: AN EDGE ARCHITECTURE power-efficient performance. Just as MIPS was an early example of a RISC • Increasing resistive delays through global ISA, the TRIPS architecture is an instantiation of An EDGE ISA on-chip wires means that future ISAs must an EDGE ISA. While other implementations of provides a richer be amenable to on-chip communication- EDGE ISAs are certainly possible, the TRIPS archi- dominated execution. tecture couples compiler-driven placement of interface • Design and mask costs will make running instructions with hardware-determined issue order between the many application types across a single to obtain high performance with good power effi- compiler and the design desirable; future ISAs should sup- ciency. At the University of Texas at Austin, we are microarchitecture. port polymorphism—the ability to use building both a prototype processor and compiler their execution and memory units in dif- implementing the TRIPS architecture, to address ferent ways and modes to run diverse the above four technology-driven challenges as applications. detailed below: The ISA in an EDGE architecture supports •To increase concurrency, the TRIPS ISA in- one main characteristic: direct instruction com- cludes an array of concurrently executing munication. Direct instruction communication arithmetic logic units (ALUs) that provide both

Scaling to the End of Silicon with EDGE Architectures

Computer Organization and Architecture Designing for Performance Ninth Edition

Introduction to the Intel® Nios® II Soft Processor

Performance and Energy Efficient Network-On-Chip Architectures

Computer Organization & Architecture Eie

Computer Architecture: Dataflow (Part I)

A Modular Soft Processor Core in VHDL

Assignment Solutions

Register Are Used to Quickly Accept, Store, and Transfer Data And

IAR C/C++ Compiler Reference Guide for V850

Computer Organization and Architecture Structure

Configurable Fine-Grain Protection for Multicore Processor Virtualization 1

Processor-Organaization.Pdf