.

Christoforos E. Kozyrakis David A. Patterson University of California, Berkeley bersquare

Cybersquare A New Direction for Computer Architecture Research

Current computer architecture research

continues to have a bias for the past in that it

focuses on desktop and server applications. In

our view, a different computing domain—personal

mobile computing—will play a significant role in

driving technology in the next decade. This

domain will pose a different set of requirements

for microprocessors and could redirect the A emphasis of computer architecture research.

dvances in technology will soon These devices will pose a different set of requirements allow the integration of one billion transistors on a for microprocessors and could redirect the emphasis single chip. This is an exciting opportunity for com- of computer architecture research. puter architects and designers; their challenge will be to propose microprocessor designs that use this huge BILLION-TRANSISTOR PROCESSORS transistor budget efficiently and meet the requirements Computer recently produced a special issue on of future applications. “Billion-Transistor Architectures.”1 The first three arti- The real question is, just what are these future appli- cles discussed problems and trends that will affect future cations? We contend that current computer architec- processor design. Seven articles from academic research ture research continues to have a bias for the past in groups proposed microprocessor architectures and that it focuses on desktop and server applications. In implementations for billion-transistor chips. These pro- our view, a different computing domain—personal posals covered a wide architecture space, ranging from mobile computing—will play a significant role in dri- out-of-order designs to reconfigurable systems. At about ving technology in the next decade. In this paradigm, the same time, and Hewlett-Packard presented the the basic personal computing and communicating basic characteristics of their next-generation IA-64 archi- devices will be portable and battery operated, and will tecture, which is expected to dominate the high-perfor- support multimedia tasks like speech recognition. mance processor market within a few years.2

24 Computer 0018-9162/98/$10.00 © 1998 IEEE .

Table 1. Number of memory transistors in the billion-transistor micro- processors.

No. of memory Architecture Key idea transistors (millions) It is no surprise that most of these proposals focus Advanced superscalar Wide-issue superscalar 910 on the computing domains that have shaped proces- processor with speculative sor architecture for the past decade: execution; multilevel on-chip caches • The uniprocessor desktop running technical and Superspeculative Wide-issue superscalar 820 scientific applications, and processor with aggressive data • the multiprocessor server used for transaction and control speculation; multilevel, processing and file-system workloads. on-chip caches Trace Multiple distinct cores that 600 Table 1 summarizes the basic features of the pro- speculatively execute program posed architectures presented in the Computer special traces; multilevel on-chip caches issue. We also include two other processor implemen- Simultaneous Wide superscalar with support 810 tations not in the special issue, the Trace and IA-64. multithreading for aggressive sharing among Developers of these processors have not presented multiple threads; multilevel on-chip implementation details. Hence we assume that they will caches have billion-transistor implementations and speculate Chip multiprocessor Symmetric multiprocessor; 450 on the number of transistors they devote to on-chip system shared second-level cache memory or caches. (For descriptions of these proces- IA-64 VLIW architecture with support 600 sors, see the “Using a Billion Transistors” sidebar.) for predicated execution and long- Table 1 reports the number of transistors used for instruction bundling caches and main memory in each billion-transistor Raw Multiple processing tiles with 670 processor. The amount varies from almost half the reconfigurable logic and memory transistor budget to 90 percent. Interestingly, only one interconnected through a design, Raw, uses that budget as part of the main sys- reconfigurable network tem memory. The majority use 50 to 90 percent of their transistor budget on caches, which help mitigate the high latency and low bandwidth of external memory. get on redundant, local copies of data normally found In other words, the conventional vision of future elsewhere in the system. For applications of the future, computers spends most of the billion-transistor bud- is this really our best use of a half-billion transistors?

Using a Billion Transistors additional complexity in the issue and con- checks for hazards and interlocks, which The first two architectures in Computer’s trol logic. helps to maintain binary compatibility survey—the advanced superscalar and The chip multiprocessor (CMP) uses the across chip generations. Finally, it supports superspeculative—have very similar char- transistor budget by placing a symmetric predicated execution through general- acteristics. The basic idea is a wide super- multiprocessor on a single die. There will purpose predication registers, which scalar organization with multiple execution be eight uniprocessors on the chip, all sim- reduces control hazards. units or functional cores. These architectures ilar to current out-of-order processors. The Raw machine is probably the most use multilevel caching and aggressive pre- Each uniprocessor will have separate first- revolutionary architecture proposed, as it diction of data, control, and even sequences level caches but share a large second-level incorporates reconfigurable logic for gen- of instructions (traces) to use all the avail- cache and the main memory interface. eral-purpose computing. The processor able instruction level parallelism (ILP). Due IA-64 can be considered a recent com- consists of 128 tiles, each of which has a to their similarity, we group them together mercial reincarnation of architectures processing core, small first-level caches and call them wide superscalar processors. based on the very long instruction word backed by a larger amount of dynamic The trace processor consists of multiple (VLIW), now renamed explicitly parallel memory (128 Kbytes) used as main mem- superscalar processing cores, each execut- instruction computing. Based on the infor- ory, and a reconfigurable functional unit. ing a trace issued by a shared instruction mation announced thus far, its major inno- The tiles interconnect in a matrix fashion issue unit. It also employs trace and data vations are the instruction dependence via a reconfigurable network. This design prediction, and shared caches. information attached to each long instruc- emphasizes the software infrastructure, The simultaneous multithreading (SMT) tion and the support for bundling multiple compiler, and dynamic event support. This processor uses multithreading at the gran- long instructions. These changes attack the infrastructure handles the partitioning and ularity of instruction issue slot to maximize problem of scaling and low code density mapping of programs on the tiles, as well the use of a wide-issue, out-of-order super- that often accompanied older VLIW as the configuration selection, data rout- scalar processor. It does so at the cost of machines. IA-64 also includes hardware ing, and scheduling.

November 1998 25 .

Table 2. Evaluation of billion-transistor processors for the desktop/server domain. “Wide superscalar” includes the advanced superscalar and superspeculative processors.

Characteristic Wide superscalar Trace Simultaneous multithreading Chip multiprocessor IA-64 Raw SPECint04 performance + + + 0 +/0 0 SPECfp04 performance + + + + + 0 TPC-F performance 0 0 + + 0 − Software effort + + 0 0 0 − Physical-design complexity − 0 − 0 0 +

THE DESKTOP/SERVER DOMAIN order execution provides only a small benefit to online In optimizing processors and computer systems for transaction processing (OLTP) applications.3 For the the desktop and server domain, architects often use the Raw architecture, it is difficult to predict any poten- popular SPECint95, SPECfp95, and TPC-C/D bench- tial success of its software to map the parallelism of marks. Since this computing domain will likely remain databases on reconfigurable logic and software- significant as billion-transistor chips become available, controlled caches. architects will continue to use similar benchmark suites. We playfully call such future benchmark suites Software effort “SPECint04’’ and “SPECfp04’’ for technical/scientific For any new architecture to gain wide acceptance, applications, and “TPC-F’’ for database workloads. it must run a significant body of software.4 Thus the Table 2 presents our prediction of how these proces- effort needed to port existing software or develop new sors will perform on these benchmarks. We use a grad- software is very important. In this regard, the wide ing system of + for strength, 0 for neutrality, and − for superscalar, trace, and SMT processors have the edge, weakness. since they can run existing executables. The same holds for CMP, but this architecture can deliver the Desktop highest performance only if applications are rewrit- For the desktop environment, the wide superscalar, ten in a multithreaded or parallel fashion. As the past trace, and simultaneous multithreading processors decade has taught us, parallel programming for high should deliver the highest performance on SPECint04. performance is neither easy nor automated. These architectures use out-of-order and advanced pre- In the same vein, IA-64 will supposedly run exist- diction techniques to exploit most of the available ing executables, but significant performance increases instruction level parallelism (ILP) in a single sequen- will require enhanced VLIW compilers. The Raw tial program. IA-64 will perform slightly worse because machine relies on the most challenging software devel- very long instruction word (VLIW) compilers are not opment. Apart from the requirements for sophisti- mature enough to outperform the most advanced hard- cated routing, mapping, and runtime-scheduling tools, ware ILP techniques—those which exploit runtime the Raw processor will need new compilers or libraries information. The chip multiprocessor (CMP) and Raw to make this reconfigurable design usable. will have inferior performance since research has shown that desktop applications are not highly paral- Complexity lelizable. CMP will still benefit from the out-of-order One last issue is physical design complexity, which features of its cores. includes the effort devoted to the design, verification, For floating-point applications, on the other hand, and testing of an integrated circuit. parallelism and high memory bandwidth are more Currently, advanced microprocessor development important than out-of-order execution; hence the takes almost four years and a few hundred engi- simultaneous multithreading (SMT) processor and neers.1,5 Testing complexity as well as functional and CMP will have some additional advantage. electrical verification efforts have grown steadily, so that these tasks now account for the majority of the Server processor development effort.5 Wide superscalar and In the server domain, CMP and SMT will provide multithreading processors exacerbate both problems the best performance, due to their ability to use coarse- by using complex techniques—like aggressive grained parallelism even with a single chip. Wide data/control prediction, out-of-order execution, and superscalar, trace, or IA-64 systems will perform multithreading—and nonmodular designs (that is, worse, because current evidence indicates that out-of- they use many individually designed multiple blocks).

26 Computer .

With the IA-64 architecture, the basic challenge is the design and verification of the forwarding logic among the multiple functional units on the chip. Designs for the CMP, trace processor, and Raw machine alleviate the problems of physical-design complexity by being more modular. CMP includes multiple copies of the same uniprocessor. Yet, it car- ries on the complexity of current out-of-order designs with support for cache coherency and multiprocessor communication. The trace processor uses replicated processing elements to reduce complexity. Still, trace prediction and issue involve intratrace dependence checking, register remapping, and intraelement for- warding. Such features account for a significant por- tion of the complexity in wide superscalar designs. Similarly, the Raw design requires the design and replication of a single processing tile and network switch. Verification of a reconfigurable organization is trivial in terms of the circuits, but verification of the mapping software is also required, which is often not trivial. The conclusion we can draw from Table 2 is that the proposed billion-transistor processors have been optimized for such a computing environment, and that Figure 1. Future most of them promise impressive performance. The Inexpensive gadgets that are small enough to fit in a personal mobile- only concern for the future is the design complexity pocket—personal digital assistants (PDA), palmtop computing devices of these architectures. computers, Web phones, and digital cameras—are will incorporate the joining the ranks of notebook computers, cellular functionality of several 8 A NEW TARGET: PERSONAL MOBILE COMPUTING phones, pagers, and video games. Such devices now current portable In the past few years, technology drivers changed support a constantly expanding range of functions, devices. significantly. High-end systems alone used to direct and multiple devices are converging into a single unit. the evolution of computing. Now, low-end systems This naturally leads to greater demand for comput- drive technology, due to their large volume and atten- ing power, but at the same time, the size, weight, and dant profits. Within this environment, two important power consumption of these devices must remain con- trends have evolved that could change the shape of stant. For example, a typical PDA is five to eight inches computing. by three inches, weighs six to twelve ounces, has two to The first new trend is that of multimedia applica- eight Mbytes of memory (ROM/RAM), and is expected tions. Recent improvements in circuit technology and to run on the same set of batteries for a few days to a innovations in software development have enabled the few weeks.8 Developers have provided a large software, use of real-time data types like video, speech, anima- operating system, and networking infrastructure (wire- tion, and music. These dynamic data types greatly less modems, infrared communications, and so forth) improve the usability, quality, productivity, and enjoy- for such devices. Windows CE and the PalmPilot devel- ment of PCs.6 Functions like 3D graphics, video, and opment environment are prime examples.8 visual imaging are already included in the most pop- ular applications, and it is common knowledge that A new application domain their influence on computing will only increase: These two trends—multimedia applications and portable electronics—will lead to a new application • 90 percent of desktop cycles will be spent on domain and market in the near future.9 In the personal “media” applications by 2000;7 mobile-computing environment, there will be a single • multimedia workloads will continue to increase personal computation and communication device, in importance;1 and small enough to carry all the time. This device will • image, handwriting, and speech recognition will incorporate the functions of the pager, cellular phone, pose other major challenges.5 laptop computer, PDA, digital camera, video game, calculator, and remote shown in Figure 1. The second trend is the growing popularity of Its most important feature will be the interface and portable computing and communication devices. interaction with the user: Voice and image input and

November 1998 27 .

output (speech and pattern recognition) will be voice, and signal processing require performing key functions. Consumers will use these devices the same operation across sequences of data in a Most of the billion- to take notes, scan documents, and check the vector or SIMD (single-instruction, multiple-data) transistor surroundings for specific objects.9 A wireless fashion. architectures offer infrastructure for sporadic connectivity will • Coarse-grained parallelism. In many media appli- only limited support support services like networking (to the Web cations, a pipeline of functions process a single and for e-mail), telephony, and global posi- stream of data to produce the end result. for personal mobile tioning system (GPS) information. The device • High instruction reference locality. Media func- computing. will be fully functional even in the absence of tions usually have small kernels or loops that dom- network connectivity. inate the processing time and demonstrate high Potentially, such devices would be all a per- temporal and spatial locality for instructions. son needs to perform tasks ranging from note • High memory bandwidth. Applications like 3D taking to making an online presentation, and from graphics require huge memory bandwidth for Web browsing to VCR programming. The numerous large data sets that have limited locality. uses of such devices and the potentially large volume • High network bandwidth. Streaming data—like lead many to expect that this computing domain will video or images from external sources—requires soon become at least as significant as desktop com- high network and I/O bandwidth. puting is today. With a budget of much less than two watts for the A different type of microprocessor whole device, the processor must be designed with a The microprocessor needed for these personal power target of less than one watt. Yet, it must still mobile-computing devices is actually a merged gen- provide high performance for functions like speech eral-purpose processor and digital-signal processor recognition. Power budgets close to those of current (DSP) with the power budget of the latter. Such micro- high-performance microprocessors (tens of watts) are processors must meet four major requirements: unacceptable for portable, battery-operated devices. Such devices should be able to execute functions at the • high performance for multimedia functions, minimum possible energy cost. • energy and power efficiency, After energy efficiency and multimedia support, the • small size, and third main requirement for personal mobile comput- • low design complexity. ers is small size and weight. The desktop assumption of several chips for external cache and many more for The basic characteristics of media-centric applica- main memory is infeasible for PDAs; integrated solu- tions that a processor needs to support were specified tions that reduce chip count are highly desirable. A by Keith Diefendorff and Pradeep Dubey:6 related matter is code size, because PDAs will have limited memory to minimize cost and size. The size of • Real-time response. Instead of maximum peak executables is therefore important. performance, processors must provide worst case A final concern is design complexity—a concern in guaranteed performance that is sufficient for real- the desktop domain as well—and scalability. An archi- time qualitative perception in applications like tecture should scale efficiently not only in terms of per- video. formance but also in terms of physical design. Many • Continuous-media data types. Media functions designers consider long interconnects for on-chip com- typically involve processing a continuous stream munication to be a limiting factor for future proces- of input (which is discarded once it is too old) sors. Long interconnects should therefore be avoided, and continuously sending the results to a display since only a small region of the chip will be accessible or speaker. Thus, temporal locality in data mem- in a single clock cycle.10 ory accesses—the assumption behind 15 years of innovation in conventional memory systems—no Processor evaluation longer holds. Remarkably, data caches may well Table 3 summarizes our evaluation of the billion- be an obstacle to high performance for continu- transistor architectures with respect to personal mobile ous-media data types. This data is typically nar- computing. row (pixel images and sound samples are 8 to 16 As the table shows, most of these architectures offer bits wide, rather than the 32- or 64-bit data of only limited support for personal mobile computing. desktop machines). The capability to perform Real-time response. Because they use out-of-order multiple operations on such data types in a sin- techniques and caches, these processors deliver quite gle wide data path is desirable. unpredictable performance, which makes it difficult to • Fine-grained parallelism. Functions like image, guarantee real-time response.

28 Computer .

Table 3. Evaluation of billion-transistor processors for the personal mobile-computing domain. “Wide superscalar” includes the advanced superscalar and superspeculative processors.

Wide Simultaneous Chip Characteristic superscalar Trace multithreading multiprocessor IA-64 Raw Comments Real-time − − 0 0 0 0 Out-of-order execution, branch response prediction, and/or caching techniques make execution unpredictable. Continuous 0 0 0 0 0 0 Caches do not efficiently data types support data streams with little locality. Fine-grained 0 0 0 0 0 + MMX-like extensions are parallelism less efficient than full vector support. Reconfigurable logic can use fine-grained parallelism. Coarse-grained 0 0 + + 0 + parallelism Code size 0 0 0 0 − 0 For the first four architectures, the use of loop unrolling and software pipelining may increase code size. IA-64’s VLIW instructions and Raw’s hardware configuration bits lead to larger code sizes. Memory 0 0 0 0 0 0 Cache-based designs interfere bandwidth with high memory bandwidth for streaming data. Energy/power − − − 0 0 − Out-of-order execution efficiency schemes, complex issue logic, forwarding, and reconfigurable logic have power penalties. Physical-design − 0 − 0 0 + complexity Design scalability − 0 − 0 0 0 Long wires—for forwarding data or for reconfigurable inter- connects—limit scalability.

Continuous-media data types. Hardware-controlled caches. Yet designers of portable systems would pre- caches complicate support for continuous-media data fer reduced code size, as suggested by the 16-bit types. instruction sets of Mips and ARM. Code size is a Parallelism. The billion-transistor architectures weakness for IA-64 and any other architecture that exploit fine-grained parallelism by using MMX-like relies heavily on loop unrolling for performance, as multimedia extensions or reconfigurable execution such code will surely be larger than that of 32-bit units. Multimedia extensions expose data alignment RISC machines. Raw may also have code size prob- issues to the software and restrict the number of vec- lems, as programmers must “program’’ the reconfig- tor or SIMD elements operated on by each instruc- urable portion of each data path. The code size tion. These two factors limit the usability and penalty of the other designs will likely depend on how scalability of multimedia extensions. Coarse-grained much they exploit loop unrolling and in-line proce- parallelism, on the other hand, is best on the simul- dures to achieve high performance. taneous multithreading, CMP, and Raw architectures. Memory bandwidth. Cache-based architectures have Code size. Instruction reference locality has tradi- limited memory bandwidth. Limited bandwidth tionally been exploited through large instruction becomes a more acute problem in the presence of multi-

November 1998 29 .

VECTOR IRAM Vector IRAM,1 the architecture proposed by our Table 4. Evaluation of vector IRAM for two computing domains. research group, is our first attempt to design an archi- Domain Characteristics Rating tecture and implementation that match the require- ments of the mobile personal environment. − Desktop/server computing SPECint04 We base vector IRAM on two main ideas: vector SPECfp04 + processing and the integration of logic and DRAM on TPC-F 0 a single chip. The former addresses many demands of Software effort 0 multimedia processing, and the latter addresses the Physical-design complexity 0 energy efficiency, size, and weight demands of portable Personal mobile computing Real-time response + devices. Vector IRAM is not the last word on com- Continuous data types + puter architecture research for mobile multimedia Fine-grained parallelism + applications, but we hope it proves a promising ini- Coarse-grained parallelism 0 tial step. Code size + The vector IRAM processor consists of an in-order, Memory bandwidth + dual-issue with first-level caches, Energy/power efficiency + tightly integrated with a vector execution unit that Design scalability 0 contains eight pipelines. Each pipeline can support parallel operations on multiple media types and DSP functions. The memory system consists of 96 Mbytes of DRAM used as main memory. It is organized in a ple data sequences (with little locality) streaming through hierarchical fashion with 16 banks and eight sub- a system—exactly the case with most multimedia data. banks per bank, connected to the scalar and vector The potential use of streaming buffers and cache unit through a crossbar. This memory system provides bypassing would help sequential bandwidth, but such sufficient sequential and random bandwidth even for techniques fail to address the bandwidth requirements demanding applications. of indexed or random accesses. In addition, it would External I/O is brought directly to the on-chip mem- be embarrassing to rely on cache bypassing for per- ory through high-speed serial lines (instead of paral- formance when a design dedicates 50 to 90 percent of lel buses), operating in the Gbit/s range. the transistor budget to caches. From a programming point of view, vector IRAM Energy/power efficiency. Despite the importance to is comparable to a vector or SIMD microprocessor. both portable and desktop domains,1 most designs do not address the energy/power efficiency issue. Comparison with billion-transistor architectures Several characteristics of the billion-transistor proces- We asked reviewers—a group that included most of sors increase the energy consumption of a single task the architects of the billion-transistor architectures— and the power the processor requires. to grade vector IRAM. Table 4 presents the median These characteristics include redundant computa- grades they gave vector IRAM for the two computing tion for out-of-order models, complex issue and domains—desktop/server and mobile personal com- dependence analysis logic, fetching a large number of puting. We requested reviews, comments, and grades instructions for a single loop, forwarding across long from all the architects of the processors in Computer’s wires, and the use of typically power-hungry recon- September 1997 special issue, and although some were figurable logic. too busy, most were kind enough to respond. Design scalability. As for physical design scalability, Obviously, vector IRAM is not competitive within forwarding results across large chips or communica- the desktop/server domain. Indeed, this weakness in tion among multiple cores or tiles is the main prob- the conventional computing domain is probably the lem of most billion-transistor designs. Such com- main reason some are skeptical of the importance of munication already requires multiple cycles in several merged logic-DRAM technology. As measured by high-performance, out-of-order designs. Simple SPECint04, we do not expect vector processing to ben- pipelining of long interconnects is not a sufficient efit integer applications. Floating-point intensive appli- solution as it exposes the timing of forwarding or cations, on the other hand, are highly vectorizable. All communication to the scheduling logic or software. It applications would still benefit from the low memory also increases complexity. latency and high memory bandwidth. The conclusion to draw from Table 3 is that the pro- In the server domain, we expect vector IRAM to per- posed processors fail to meet many of the require- form poorly due to its limited on-chip memory. A poten- ments of the new computing model. Which begs the tially different evaluation for the server domain could question, what design will? arise if we examine workloads for decision support

30 Computer .

instead of online transaction processing.11 In decision highest performance. The presence of short support, small code loops with highly data-parallel oper- vectors in some applications does not pose a Vector IRAM’s ations dominate execution time. Architectures like vec- performance problem if the start-up time of challenge is in the tor IRAM and Raw should thus perform significantly each vector instruction is pipelined and chain- implementation better on decision support than on OLTP workloads. ing (the equivalent of forwarding in vector technology rather In terms of software effort, vectorizing compilers processors) is supported, as we intend. have been developed and used in commercial envi- The code size of programs written for vec- than the microarchi- ronments for decades. Additional work is required to tor processors is small compared to that of tectural design. tune such compilers for multimedia workloads and other architectures. This compactness is possi- make DSP features and data types accessible through ble because a single vector instruction can spec- high-level languages. In this case, compiler-delivered ify whole loops. MIPS/ watt is the proper figure of merit. Memory bandwidth, both sequential and random, As for design complexity, vector IRAM is a highly is available from the on-chip hierarchical DRAM. modular design. The necessary building blocks are the Vector IRAM includes a number of critical DSP fea- in-order scalar core, the vector pipeline (which is repli- tures like high-speed multiply accumulates, which it cated eight times), and the basic memory array tile. achieves through instruction chaining. Vector IRAM The lack of dependencies within a vector instruction also provides auto-increment addressing (a special and the in-order paradigm should also reduce the ver- addressing mode often found in DSP chips) through ification effort for vector IRAM. strided vector memory accesses. The open questions are what complications arise We designed vector IRAM to be programmed in from merging high-speed logic with DRAM, and what high-level languages, unlike most DSP architectures. is the resulting impact on cost, yield, and testing. We did this by avoiding features like circular or bit- Many DRAM companies are investing in merged reversed addressing and complex instructions that logic-DRAM fabrication lines, and many companies make DSP processors poor compiler targets.12 Like all are exploring products in this area. Also, our project general-purpose processors, vector IRAM provides sent a nine-million-transistor test chip to fabrication virtual memory support, which DSP processors do not this fall. It will contain several key circuits of vector offer. IRAM in a merged logic-DRAM process. We expect We expect vector IRAM to have high energy effi- the answer to these questions to be clearer in the next ciency as well. Each vector instruction specifies a large year. Unlike the other billion-transistor proposals, vec- number of independent operations. Hence, no energy tor IRAM’s challenge is the implementation technol- needs to be wasted for fetching and decoding multiple ogy rather than the microarchitectural design. instructions or checking dependencies and making various predictions. In addition, the execution model Support for personal mobile computing is strictly in order. Vector processors need only lim- As mentioned earlier, vector IRAM is a good match ited forwarding within each pipeline (for chaining) to the personal mobile-computing model. The design and do not require chaining to occur within a single is in-order and does not rely on data caches, making clock cycle. Hence, designers can keep the control the delivered performance highly predictable—a key logic simple and power efficient, and eliminate most quality in supporting real-time applications. The vec- long wires for forwarding. tor model is superior to limited, MMX-like, SIMD Performance comes from multiple vector pipelines extensions for several reasons: working in parallel on the same vector operation, as well as from high-frequency operation. This allows • It provides explicit control of the number of ele- the same performance at lower clock rates—and lower ments each instruction operates on. voltages—as long as we add more functional units. In • It allows scaling of the number of elements each CMOS logic, energy increases with the square of the instruction operates on without changing the voltage, so such trade-offs can dramatically improve instruction set architecture. energy efficiency. DRAM has been traditionally opti- • It does not expose data packing and alignment mized for low power, and the hierarchical structure to software, thereby reducing code size and provides the ability to activate just the sub-banks con- increasing performance. taining the necessary data. • A single design database can produce chips of As for physical-design scalability, the processor-mem- varying cost-performance ratios. ory crossbar is the only place were vector IRAM uses long wires. Still, the vector model can tolerate latency Since most media-processing functions are based on if sufficient fine-grained parallelism is available. So deep algorithms working on vectors of pixels or samples, pipelining is a viable solution without any hardware or it’s not surprising that a vector unit can deliver the software complications in this environment.

November 1998 31 .

or almost two decades, architecture research San Francisco, 1996. has focused on desktop or server machines. As 5. J. Wilson et al., “Challenges and Trends in Processor F a result, today’s microprocessors are 1,000 Design,” Computer, Jan. 1998, pp. 39-50. times faster for desktop applications. Nevertheless, 6. K. Diefendorff and P. Dubey, “How Multimedia Work- in doing so, we are designing processors of the future loads Will Change Processor Design,” Computer, Sept. with a heavy bias toward the past. To design suc- 1997, pp. 43-45. cessful processor architectures for the future, we first 7. W. Dally, “Tomorrow’s Computing Engines,” keynote need to explore future applications and match their speech, Fourth Int’l Symp. High-Performance Computer requirements in a scalable, cost-effective way. Architecture, Feb. 1998. Personal mobile computing offers a vision of the 8. T. Lewis, “Information Appliances: Gadget Netopia,” future with a much richer and more exciting set of Computer, Jan. 1998, pp. 59-66. architecture research challenges than extrapolations 9. G. Bell and J. Gray, Beyond Calculation: The Next 50 of the current desktop architectures and benchmarks. Years of Computing, Springer-Verlag, Feb. 1997. Vector IRAM is an initial approach in this direction. 10. D. Matzke, “Will Physical Scalability Sabotage Perfor- Put another way, which problem would you rather mance Gains?” Computer, Sept. 1997, pp. 37-39. work on: improving performance of PCs running 11. P. Trancoso et al., “The Memory Performance of DSS FPPPP—a 1982 Fortran benchmark used in Commercial Workloads in Shared-Memory Multi- SPECfp95—or making speech input practical for processors,” Proc. Third Int’l Symp. High-Performance PDAs? It is time for some of us in the very successful Computer Architecture, IEEE CS Press, Los Alamitos, computer architecture community to investigate archi- Calif., 1997, pp. 250-260. tectures with a heavy bias for the future. ❖ 12. J. Eyre and J. Bier, “DSP Processors Hit the Main- stream,” Computer, Aug. 1998, pp. 51-59.

Acknowledgments The ideas and opinions presented here are from dis- Christoforos E. Kozyrakis is currently pursuing a PhD cussions within the IRAM group at UC Berkeley. In degree in computer science at the University of Cali- addition, we thank the following for their useful feed- fornia, Berkeley. Prior to that, he was with ICS- back, comments, and criticism on earlier drafts, as FORTH, Greece, working on the design of single-chip well as the grades for vector IRAM: Anant Agarwal, high-speed network routers. His research interests Jean-Loup Baer, Gordon Bell, Pradeep Dubey, Lance include microprocessor architecture and design, mem- Hammond, Wang Wen-Hann, John Hennessy, Mark ory hierarchies, and digital VLSI systems. Kozyrakis Hill, John Kubiatowicz, Corinna Lee, Henry Levy, received a BSc degree in computer science from the Doug Matzke, Kunle Olukotun, Jim Smith, and University of Crete, Greece. He is a member of the Gurindar Sohi. IEEE and the ACM. This research is supported by DARPA (DABT63- C-0056), the California State MICRO program, NSF (CDA-9401156), and by research grants from LG Semicon, Hitachi, Intel, Microsoft, Sandcraft, David A. Patterson teaches computer architecture at SGI/Cray, Sun Microsystems, , and the University of California, Berkeley, and holds the TSMC. Pardee Chair of Computer Science. At Berkeley, he led the design and implementation of RISC I, likely the first VLSI reduced instruction set computer. He was also a leader of the Redundant Arrays of Inex- References pensive Disks (RAID) project, which led to the high- 1. D. Burger and J. Goodman, “Billion-Transistor Archi- performance storage systems produced by many tectures,” Computer, Sept. 1997, pp. 46-47. companies. Patterson received a PhD in computer sci- 2. J. Crawford and J. Huck, “Motivations and Design ence from the University of California, Los Angeles. Approach for the IA-64 64-bit Instruction Set Architec- He is a fellow of the IEEE Computer Society and the ture,” Microprocessor Forum, Micro Design Resources, ACM, and is a member of the National Academy of 1997. Engineering. 3. K. Keeton et al., “Performance Characterization of the Quad Pentium Pro SMP Using OLTP Workloads,” Proc. 1998 Int’l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., 1998, pp. 15-26. Contact the authors at the Computer Science Divi- 4. J. Hennessy and D. Patterson, Computer Architecture: A sion, University of California, Berkeley, CA 94720- Quantitative Approach, 2nd ed., Morgan Kaufmann, 1776; {kozyraki, patterson}@cs.berkeley.edu.

32 Computer