A New Direction for Computer Architecture Research

Christoforos E. Kozyrakis David A. Patterson University of California, Berkeley bersquare Cybersquare A New Direction for Computer Architecture Research Current computer architecture research continues to have a bias for the past in that it focuses on desktop and server applications. In our view, a different computing domain—personal mobile computing—will play a significant role in driving technology in the next decade. This domain will pose a different set of requirements for microprocessors and could redirect the A emphasis of computer architecture research. dvances in integrated circuit technology will soon These devices will pose a different set of requirements allow the integration of one billion transistors on a for microprocessors and could redirect the emphasis single chip. This is an exciting opportunity for com- of computer architecture research. puter architects and designers; their challenge will be to propose microprocessor designs that use this huge BILLION-TRANSISTOR PROCESSORS transistor budget efficiently and meet the requirements Computer recently produced a special issue on of future applications. “Billion-Transistor Architectures.”1 The first three arti- The real question is, just what are these future appli- cles discussed problems and trends that will affect future cations? We contend that current computer architec- processor design. Seven articles from academic research ture research continues to have a bias for the past in groups proposed microprocessor architectures and that it focuses on desktop and server applications. In implementations for billion-transistor chips. These pro- our view, a different computing domain—personal posals covered a wide architecture space, ranging from mobile computing—will play a significant role in dri- out-of-order designs to reconfigurable systems. At about ving technology in the next decade. In this paradigm, the same time, Intel and Hewlett-Packard presented the the basic personal computing and communicating basic characteristics of their next-generation IA-64 archi- devices will be portable and battery operated, and will tecture, which is expected to dominate the high-perfor- support multimedia tasks like speech recognition. mance processor market within a few years.2 24 Computer 0018-9162/98/$10.00 © 1998 IEEE . Table 1. Number of memory transistors in the billion-transistor microprocessors. No. of memory Architecture Key idea transistors (millions) It is no surprise that most of these proposals focus Advanced superscalar Wide-issue superscalar 910 on the computing domains that have shaped proces- processor with speculative sor architecture for the past decade: execution; multilevel on-chip caches • The uniprocessor desktop running technical and Superspeculative Wide-issue superscalar 820 scientific applications, and processor with aggressive data • the multiprocessor server used for transaction and control speculation; multilevel, processing and file-system workloads. on-chip caches Trace Multiple distinct cores that 600 Table 1 summarizes the basic features of the pro- speculatively execute program posed architectures presented in the Computer special traces; multilevel on-chip caches issue. We also include two other processor implemen- Simultaneous Wide superscalar with support 810 tations not in the special issue, the Trace and IA-64. multithreading for aggressive sharing among Developers of these processors have not presented multiple threads; multilevel on-chip implementation details. Hence we assume that they will caches have billion-transistor implementations and speculate Chip multiprocessor Symmetric multiprocessor; 450 on the number of transistors they devote to on-chip system shared second-level cache memory or caches. (For descriptions of these proces- IA-64 VLIW architecture with support 600 sors, see the “Using a Billion Transistors” sidebar.) for predicated execution and long- Table 1 reports the number of transistors used for instruction bundling caches and main memory in each billion-transistor Raw Multiple processing tiles with 670 processor. The amount varies from almost half the reconfigurable logic and memory transistor budget to 90 percent. Interestingly, only one interconnected through a design, Raw, uses that budget as part of the main sys- reconfigurable network tem memory. The majority use 50 to 90 percent of their transistor budget on caches, which help mitigate the high latency and low bandwidth of external memory. get on redundant, local copies of data normally found In other words, the conventional vision of future elsewhere in the system. For applications of the future, computers spends most of the billion-transistor bud- is this really our best use of a half-billion transistors? Using a Billion Transistors additional complexity in the issue and con- checks for hazards and interlocks, which The first two architectures in Computer’s trol logic. helps to maintain binary compatibility survey—the advanced superscalar and The chip multiprocessor (CMP) uses the across chip generations. Finally, it supports superspeculative—have very similar char- transistor budget by placing a symmetric predicated execution through general- acteristics. The basic idea is a wide super- multiprocessor on a single die. There will purpose predication registers, which scalar organization with multiple execution be eight uniprocessors on the chip, all sim- reduces control hazards. units or functional cores. These architectures ilar to current out-of-order processors. The Raw machine is probably the most use multilevel caching and aggressive pre- Each uniprocessor will have separate first- revolutionary architecture proposed, as it diction of data, control, and even sequences level caches but share a large second-level incorporates reconfigurable logic for gen- of instructions (traces) to use all the avail- cache and the main memory interface. eral-purpose computing. The processor able instruction level parallelism (ILP). Due IA-64 can be considered a recent com- consists of 128 tiles, each of which has a to their similarity, we group them together mercial reincarnation of architectures processing core, small first-level caches and call them wide superscalar processors. based on the very long instruction word backed by a larger amount of dynamic The trace processor consists of multiple (VLIW), now renamed explicitly parallel memory (128 Kbytes) used as main mem- superscalar processing cores, each execut- instruction computing. Based on the infor- ory, and a reconfigurable functional unit. ing a trace issued by a shared instruction mation announced thus far, its major inno- The tiles interconnect in a matrix fashion issue unit. It also employs trace and data vations are the instruction dependence via a reconfigurable network. This design prediction, and shared caches. information attached to each long instruc- emphasizes the software infrastructure, The simultaneous multithreading (SMT) tion and the support for bundling multiple compiler, and dynamic event support. This processor uses multithreading at the gran- long instructions. These changes attack the infrastructure handles the partitioning and ularity of instruction issue slot to maximize problem of scaling and low code density mapping of programs on the tiles, as well the use of a wide-issue, out-of-order super- that often accompanied older VLIW as the configuration selection, data rout- scalar processor. It does so at the cost of machines. IA-64 also includes hardware ing, and scheduling. November 1998 25 . Table 2. Evaluation of billion-transistor processors for the desktop/server domain. “Wide superscalar” includes the advanced superscalar and superspeculative processors. Characteristic Wide superscalar Trace Simultaneous multithreading Chip multiprocessor IA-64 Raw SPECint04 performance + + + 0 +/0 0 SPECfp04 performance + + + + + 0 TPC-F performance 0 0 + + 0 − Software effort + + 0 0 0 − Physical-design complexity − 0 − 0 0 + THE DESKTOP/SERVER DOMAIN order execution provides only a small benefit to online In optimizing processors and computer systems for transaction processing (OLTP) applications.3 For the the desktop and server domain, architects often use the Raw architecture, it is difficult to predict any poten- popular SPECint95, SPECfp95, and TPC-C/D bench- tial success of its software to map the parallelism of marks. Since this computing domain will likely remain databases on reconfigurable logic and software- significant as billion-transistor chips become available, controlled caches. architects will continue to use similar benchmark suites. We playfully call such future benchmark suites Software effort “SPECint04’’ and “SPECfp04’’ for technical/scientific For any new architecture to gain wide acceptance, applications, and “TPC-F’’ for database workloads. it must run a significant body of software.4 Thus the Table 2 presents our prediction of how these proces- effort needed to port existing software or develop new sors will perform on these benchmarks. We use a grad- software is very important. In this regard, the wide ing system of + for strength, 0 for neutrality, and − for superscalar, trace, and SMT processors have the edge, weakness. since they can run existing executables. The same holds for CMP, but this architecture can deliver the Desktop highest performance only if applications are rewrit- For the desktop environment, the wide superscalar, ten in a multithreaded or parallel fashion. As the past trace, and simultaneous multithreading processors decade has taught us, parallel programming for high

Load more