Awards ...... Common Bonds: MIPS, HPS, Two-Level Branch Prediction, and Compressed Code RISC

ONUR MUTLU ETH Zurich

RICH BELGARD

...... We are continuing our series of of performance optimization on the com- sions made in the MIPS project that retrospectives for the 10 papers that piler and keep the hardware design sim- passed the test of time. The retrospective received the first set of MICRO Test of ple. The is responsible for touches on the design tradeoffs made to Time (“ToT”) Awards in December generating and scheduling simple instruc- couple the hardware and the software, 2014.1,2 This issue features four retro- tions, which require little translation in the MIPS project’s effect on the later spectives written for six of the award- hardware to generate control signals to development of “fabless” semiconductor winning papers. We briefly introduce control the datapath components, which companies, and the use of benchmarks these papers and retrospectives and in turn keeps the hardware design simple. as a method for evaluating end-to-end hope that you will enjoy reading them as Thus, the instructions and hardware both performance of a system as, among much as we have. If anything ties these remain simple, whereas the compiler others, contributions of the MIPS project works together, it is the innovation they becomes much more important (and that have stood the test of time. delivered by taking a strong position in likely complex) because it must schedule the RISC/CISC debates of their decade. instructions well to ensure correct and High-Performance Systems We hope the IEEE Micro audience, espe- high-performance use of a simple pipe- The second retrospective addresses three cially younger generations, will find the line. Many modern processors continue related papers that received MICRO ToT historical perspective provided by these to take advantage of some of the design Awards in 2014. These papers introduce retrospectives invaluable. principles outlined in this paper, either in the High Performance Substrate (HPS) their ISA (the MIPS ISA still has consider- microarchitecture, examine its critical MIPS able use) or in their internal execution of design issues, and discuss the use of large The first retrospective is for the oldest instructions (many complex instructions hardware-supported atomic units to paper being discussed in this issue: are broken into simple RISC-style microin- enhance performance in HPS-style, dynam- “MIPS: A Architecture.”3 structions in modern processors). The ically scheduled microarchitectures. This work introduced MIPS (Microproces- MIPSISAanddesignisalsousedinmany The first paper, “HPS, A New Micro- sor without Interlocked Pipeline Stages) modern courses architecture: Rationale and Introduction,”4 and its design philosophy, principles, hard- because of its simplicity. introduced the HPS microarchitecture, its ware implementation, related systems, Thomas Gross, Norman Jouppi, John design principles, and a prototype imple- and software issues. It is a concise over- Hennessy, Steven Przybylski, and Chris mentation. It also introduced the concept view of the MIPS design, which is one of Rowen provide in their retrospective a of restricted dataflow and its use in imple- the early reduced, or simple, instruction historical perspective on the development menting out-of-order instruction schedul- sets that have significantly impacted the of MIPS, the context in which they arrived ing and execution while maintaining microprocessor industry as well as com- at the design principles, and the develop- sequential execution and precise excep- puter architecture research and education ments that occurred after the paper. They tions from the ’s perspective for decades (and probably counting!). The also provide their reflections on what they (that is, without exposing the out-of-order- key design principle is to push the burden see as the design and evaluation deci- ness of dataflow execution to software)......

70 Published by the IEEE Computer Society 0272-1732/16/$33.00 2016 IEEE This concept of out-of-order execution memory disambiguation problem, has on the realization that adding another level using a restricted dataflow engine that since been the subject of many works. of history to predict a branch’s direction supports precise exceptions has been the The third paper, “Hardware Support can greatly improve branch prediction hallmark of high-performance processor for Large Atomic Units in Dynamically accuracy, in addition to the single level designs since the success of the Scheduled ,”7 likely provides that tracks which direction the same Pentium Pro,5 which implemented the the first treatment of large units of work static branch went the last N times it was concept. The paper describes how com- as atomic units that can enable higher executed (which had been established plexinstructionscanbetranslatedinhard- performance in an HPS-like dynamically practice since Jim Smith’s seminal ISCA ware to simple micro-operations and how scheduled processor. It discusses the 1981 paper11). In other words, the direc- dependent micro-operations are linked in benefits of enlarged blocks in an out-of- tion the branch took the last N times the hardware such that they are executed in order microarchitecture and describes same “history of branch directions” was a dataflow manner, in which a micro- how enlarged blocks can be formed at dif- encountered can provide a much better operation “fires” when its input operands ferent layers: architectural, compiler, and prediction for where the branch might go are ready. The mechanisms described in hardware. It introduces the notion of a fill the next time the same “history of branch this work paved the way for many future unit, which is a hardware structure that directions” is encountered. and current processor designs with com- enables the formation of large blocks of This insight formed the basis of many plex instruction sets to execute programs micro-operations that can be executed future designs used in using out-of-order execution (and exploit atomically. This work has opened up new high-performance processors, most nota- some of the RISC principles) without dimensions and is considered a direct pre- bly starting with the Pentium Pro. Yeh requiring changes to the instruction set cursor to notions such as the trace cache, and Patt’s MICRO 1991 paper discusses architecture or the software. The retro- which is implemented in the Intel Pen- various design choices for this predictor spective by Yale Patt and his former stu- tium 4,8 and the block-structured ISA.9 and provides a detailed evaluation of their dents Wen-mei Hwu, Steve Melvin, and In their retrospective, Yale Patt and impact on prediction accuracy. For exam- Mike Shebanow beautifully describes the colleagues provide a historical perspec- ple, they examine the effect of keeping six major contributions of the original HPS tive of HPS, the environment and the global or local branch history registers paper. works that influenced the design deci- (many modern hybrid predictors use a The second paper, “Critical Issues sions behind it, its reception by the com- combination) and the effect of how the Regarding HPS, A High Performance munity and the industry, and what pattern history tables are organized. The Microarchitecture,”6 discusses what the happened afterward. We hope you enjoy retrospective provides the historical per- authors deem as critical issues in the reading this retrospective that describes spective as to how two-level branch pre- design of the HPS microarchitecture. The the inspirations and development of a diction came about, along with the paper is forward looking, discussing some research project that has heavily influ- contemporary works that followed it. of the problems that the authors thought enced almost all modern high- “would have to be solved to make HPS performance microprocessor designs, Compressed Code RISC viable,” as the retrospective puts it. The CISC or RISC. The authors also provide a Processor paper can be seen as a “problem book” glimpse of some of the future research The final retrospective in this issue is for that anticipates the many issues to be that has been enabled and influenced by “Executing Compressed Programs on an solved in designing a modern out-of-order the HPS project, of which the next paper Embedded RISC Architecture,”12 one of execution engine that translates a com- we discuss is a prime example. the first papers to tackle the problem of plex instruction set to a simple set of code compression to save physical mem- micro-operations internally. For example, Two-Level Branch Prediction ory space in cost-sensitive embedded it introduces the notion of a node cache, The third retrospective is for “Two-Level processors. It is also one of the first to an early precursor to a modern trace Adaptive Training Branch Prediction”10 by examine cost-related architectural issues cache, and discusses issues related to Tse-Yu Yeh and Yale Patt. This paper in a RISC-based system on chip (SoC), as control flow, design of the dynamic addresses a critical problem in the HPS Andy Wolfe’s retrospective nicely instruction scheduler, the memory units, microarchitecture, and in pipelined and describes. The basic idea is to store pro- and state repair mechanisms. It also intro- out-of-order microarchitectures in gen- grams in compressed form in main mem- duces the unknown address problem— eral: how to feed the pipeline with useful ory to save memory space, and decom- the problem of whether a younger load instructions in the presence of conditional press each cache block right before it is instruction can execute in the presence of branches. The authors introduced a break- fetched into the processor’s instruction an older store instruction whose address through design for conditional branch pre- cache. Wolfe’s retrospective shows clearly has not yet been computed. This prob- diction, which they dubbed the two-level why RISC made sense for embedded SoC lem, now more commonly known as the branch predictor. This predictor is based designs, owing to such designs’ cost ...... JULY/AUGUST 2016 71 ...... AWARDS

constraints and real-time requirements. He evaluation have been published today if 5. D.B. Papworth, “Tuning the Pentium discusses how he came to work on they were submitted to our top com- Pro Microarchitecture,” IEEE Micro, designing such embedded RISC process- puter architecture conferences?” Apr. 1996, pp. 14–15. ors, the challenges associated with such We hope our community opens a dia- 6. Y.N. Patt et al., “Critical Issues designs, and the series of alternatives he logue to seriously consider this question Regarding HPS, A High Performance examined to reduce the cost of an when evaluating papers for acceptance. Microarchitecture,” Proc. 18th Ann. embedded RISC processor. He also pro- Contrarian ideas that change many things Workshop Microprogramming, 1988, vides a nice perspective of the require- in a system might not easily be quantita- pp. 109–116. ments, history, and development of the tively evaluated in a rigorous manner 7. S.W. Melvin, M.C. Shebanow, and embedded computing market, which we (and they perhaps should not be), but Y.N. Patt, “Hardware Support for hope you will greatly enjoy reading. that does not mean that they cannot Large Atomic Units in Dynamically change the world or that the entire scien- Scheduled Machines,” Proc. 21st s we conclude our introduction of tific community cannot greatly benefit Ann. Workshop Microprogramming this set of six MICRO ToT Award A from their publication and presentation at and Microarchitecture, 1988, pp. winners, we make a few observations. our top conferences. As the computer 60–63. First, as the retrospectives make clear, all architecture field evolves and diversifies 8. G. Hinton et al., “The Microarchitec- these papers have had widespread in a period where there does not seem to ture of the Intel Pentium 4 Process- impact on both modern processor be a dominant architectural design para- or,” Intel Technology J., vol. 1, 2001, designs as well as academic research, digm, it is time to take a step back and pp. 1–12. which has made them worthy of the Test enable the unleashing of bigger ideas in of Time Award. Second, each of these our conferences by relaxing the per- 9. S.W. Melvin and Y.N. Patt, papers vigorously challenged the status ceived requirements for quantitative “Enhancing Instruction Scheduling with a Block-Structured ISA,” Int’l J. quo and took contrarian positions to the evaluation. We hope the MICRO ToT Parallel Programming, vol. 23, no. 3, dominant design paradigms or popular Award winners and their retrospectives 1995, pp. 221–243. approaches of their times, which is how provide inspiration and motivation to this they achieved their impact. MIPS (RISC) end, in addition to inspiring young mem- 10. T.-Y. Yeh and Y.N. Patt, “Two-Level was proposed as a compelling alterna- bers of our community to develop ideas Adaptive Training Branch Prediction,” tive to CISC, which was the dominant that challenge the status quo in major Proc. 24th Ann. Int’l Symp. Microarch- ISA design paradigm at the time; HPS ways. itecture, 1991, pp. 51–61. was proposed as a way to enable high- 11. J.E. Smith, “A Study of Branch Predic- performance CISC against the strong ...... tion Strategies,” Proc. 8th Ann. advocacy for RISC at the time; two- References Symp. Computer Architecture, 1981, level branch prediction came as a 1. O. Mutlu and R. Belgard, “Introducing pp. 135–148. breakthrough when Jim Smith’s dec- the MICRO Test of Time Awards: 12. A. Wolfe and A. Chanin, “Executing ade-old two-bit counters were thought Concept, Process, 2014 Winners, and Compressed Programs on an to be the best predictor available; and the Future,” IEEE Micro, vol. 35, no. Embedded RISC Architecture,” Proc. cost-sensitive embedded RISC pro- 2, 2015, pp. 85–87. cessors were proposed as an alterna- 25th Ann. Int’l Symp. Microarchitec- 2. O. Mutlu and R. Belgard, “The 2014 tive to simple low-performance micro- ture, 1992, pp. 81–91. MICRO Test of Time Award Winners: controllers in the embedded computing From 1978 to 1992,” IEEE Micro, vol. market. Third, and perhaps most impor- 36, no. 1, 2016, pp. 60–C3. tantly, in all these papers, the focus on Onur Mutlu is a full professor of computer quantitative evaluation was much less 3. J. Hennessy et al., “MIPS: A Micro- science at ETH Zurich and a faculty mem- than a typical paper we see published processor Architecture,” Proc. 15th ber at Carnegie Mellon University. Con- today in top computer architecture con- Ann. Workshop Microprogramming, tact him at [email protected]. ferences. In fact, the three seminal 1982, pp. 17–22. HPSpapershavenoevaluationatall! 4. Y.N. Patt, W.M. Hwu, and M. Sheba- Rich Belgard is an independent consultant As we go through these award-winning now, “HPS, A New Microarchitec- for computer manufacturers, software papers, we cannot help but ask the ture: Rationale and Introduction,” companies, and investor groups and an question, “Could these award-winning Proc. 18th Ann. Workshop Micropro- expert and consultant to law firms. Con- papers with little or no quantitative gramming, 1985, pp. 103–108. tact him at [email protected].

...... 72 IEEE MICRO ...... A Retrospective on “MIPS: A Microprocessor Architecture”

THOMAS R. GROSS ETH Zurich

NORMAN P. J OUPPI Google

JOHN L. HENNESSY

STEVEN PRZYBYLSKI Verdande Group

CHRIS ROWEN Cadence Design Systems

...... The MIPS project started in early to pipeline and did not take advantage of The quarter before the MIPS project 1981. At that time, mainstream architec- compiler capabilities (such as register allo- started, Stanford University had offered tures such as the VAX, IBM 370, Intel cation) that could provide more efficient for the second time a (graduate) class on 8086, and 68000 were fairly execution. VLSI design, based on the approach complex, and their operation was con- One area of architecture research in developed by Carver Mead and Lynn trolled by . Designers of high- the early 1980s was to create even more Conway.1 One key idea was that the performance implementations discovered complex CISC architectures. Notable design rules could be specified without that pipelining of complex-instruction-set examples of these include high-level lan- close coupling to the details of a specific computer (CISC) machines, especially the guage machines such as Lisp machines, process. Most of the MIPS project team VAX, was hard; the VAX 11/780 required and later the Intel 432. However, the members had taken this class. Another around 10 processor cycles to execute a approach taken in the MIPS project as difference was that in VLSI, tradeoffs dif- single instruction, on average. Tradeoffs well as the RISC project at the University fer from those made in multiboard that had been made when there were of California, Berkeley, was to design machines built from hundreds of small-, few resources available for implementa- simpler instruction sets that did not medium-, and large-scale integration tion (for example, dense instruction sets require execution of microcode. These parts like the VAX 11/780. In a multiboard sequentially decoded over many cycles by simpler instructions were a better fit for , gate transistors were expen- microcode) ended up creating unneces- an optimizing compiler’s output—more sive (for example, only four 2-input gates sary constraints when more resources complex operations could be synthe- per transistor-transistor logic package), became available. For example, as transis- sized from several simple operations, but whereas ROM transistors were relatively tors were replacing the core, the cost of in the common case where only a simple cheap (for example, kilobits of ROM per memory dropped much faster than operation was needed, the more com- package). Much of this was driven by logic—favoring a slightly less dense pro- plex aspects of a CISC instruction could pinout and other limitations of DIP pack- gram encoding over logic-intensive inter- be effectively optimized away. In the aging. This naturally favored the design pretation. Also, the design of instruction case of the MIPS project, we empha- of microcoded machines, in which con- sets did not assume the use of an optimiz- sized ease of pipelining and sophisti- trol and sequencing signals could be effi- ing compiler. CISC instruction features cated register allocation, whereas the ciently stored as many bits per package such as arithmetic operations with oper- Berkeley RISC project included support instead of being computed by low- ands in memory were especially difficult for register windows in hardware. density gates requiring many packages...... JULY/AUGUST 2016 73 ...... AWARDS

that was heavily influenced by the Stan- ford MIPS design and was sold as the MIPS R2000. A final evaluation of the Stanford MIPS architecture was pub- lished in 1988.4

Reflections Several papers described the implemen- tation and testing of the MIPS pro- cessor5 and discussed the design decisions that turned out to be right and those that deserve to be reconsidered. More than 30 years later, at a time when contain multiple inde- pendent processing units (cores) and more than 5 billion transistors, a few of Figure 1. Die photo of MIPS processor. these have passed the test of time.

However, in VLSI, a transistor has ative to other processor designs of this MIPS Design approximately the same cost no matter era, a board powered by a MIPS pro- As part of the MIPS project, we devel- what logical function it is used for, which cessor could deliver the same perform- oped various CAD tools, such as tools for favors the adoption of direct control as in ance as a cabinet-sized computer built programmable logic array synthesis and early RISC architectures instead of with lower levels of integration. At the timing verification.6 These tools were microcode. Another difference with time we wrote our paper in the fall of driven by the designer’s requirements widespread adoption of VLSI was that 1982,2 the architecture and microarchi- and allowed a small team of students the topology of the wiring was important. tecture were defined, for C and faculty to complete the processor In 1981, we had only a single level of and Pascal were written, and several design in a timely manner. metal, so long wires had to be largely pla- test chips covering different parts of the We also invested heavily in compilers nar. This also favored simple and regular design had been sent out for fabrication. beyond the immediate needs of the designs. Finally, in a VLSI design it was MIPS processor development. Fellow clearly zero-sum: we had a maximum student Fred Chow designed and imple- chip size available, so any feature put in Developments After mented a machine-independent global would have to justify its value and would Publication optimizer, and fellow student Peter displace a less valuable feature. These The MIPS design was completed in the Steenkiste developed a Lisp compiler to constraints brought a lot of clarity of spring of 1983 and sent out for fabrica- investigate dynamic type checking on a focus to the design process. tion.3 At about the time the design was processor that did not provide any dedi- Given this context, we decided to sent for fabrication, the Center for Inte- cated hardware support for this task. build a single-chip high-performance grated Systems at Stanford University Personal were a novelty microprocessor that could be realized developed a 3-lm VLSI fabrication pro- at that time, and the MIPS project mem- with the fabrication technology available cess and used the MIPS design (shrunk bers were the first to enjoy workstations to university teams—a 4-lmsingle-level optically) to tune its process. In early in their offices. These workstations let us metal nMOS process that put severe lim- 1984, we received working 3-lmMIPS support the implementation with novel itations on the design complexity and processors from MOSIS that operated at tools and interactively experiment with size. (The DARPA MOSIS program bro- the speed we had expected for the different compiler and hardware optimi- kered access to silicon foundries so that design at 4 lm. Figure 1 shows the die zations. RISC designs emphasize effi- students and researchers in universities photo. These processors were run on a cient resource usage and encourage could obtain real silicon parts.) High per- test board developed by our fellow grad designers to employ resources where formance at that time meant that the pro- student, Anant Agarwal, and on 20 Feb- they can contribute the most; this RISC cessor could execute more than 1 million ruary 1984, the first program (8queens) strategy drove many of the major design (meaningful) instructions per second was run to completion. In mid-1984, decisions (for example, to abandon (MIPS), and the design target was 2 MIPS Computer Systems was founded. microcode in favor of simple instructions, MIPS (that is, a of 4 MHz). Rel- It designed a completely new processor or to use precious on-chip transistors for ...... 74 IEEE MICRO frequent operations and leave other tradeoff between design complexity, crit- conferences, benchmark results started operations to be dealt with by software). ical path length, and software develop- to become important, and the MIPS ment costs. In addition to clever circuit paper (as well as papers by other design Hardware/Software Coupling design and novel tools, this ability to groups, such as the UC Berkeley RISC The MIPS project invested in software make hardware/software tradeoffs was project8) presented empirical evidence. early on; the compiler tool chain worked a key factor for success. Stanford University published the set of before the design was finalized. We The team wanted to pick a name for benchmarks used by the MIPS project made several key microarchitecture deci- the project that emphasized performance. (the Stanford Benchmark Suite), and sions (including the structure of the pipe- About nine months earlier, the RISC proj- despite their limitations, they were in line) after running benchmark programs ect at UC Berkeley had started, so we use (at least) 25 years later to explore and assessing the impact of proposed needed a catchy acronym. “Million instruc- array indexing and recursive function design changes. tions per second” (MIPS) sounded right, calls. About five years later came the The MIPS project also emphasized given the project’s goals, but this metric founding of the SPEC consortium, which tradeoffs across layers as defined by was also known as the “meaningless indi- eventually produced a large body of real- then-prevalent industry practice. Two cator of processor speed.” So, we settled istic benchmarks. In 2011, about 30 sample design decisions illustrate this on “microprocessor without interlocked years later, the first ACM conferences aspect: the MIPS processor uses word pipeline stages.” initiated a process that lets authors of addresses, not byte addresses. The rea- The Mead/Conway approach to VLSI accepted papers submit an artifact col- soning was that byte addressing would design emphasized a decoupling of archi- lection (the benchmarks and tools used complicate the memory interface and tecture and fabrication. No longer was a to generate any empirical evidence pre- slow down the processor, that bytes chip design tied to a specific (often pro- sented in a paper).9 The MIPS project aren’t accessed that often, and that an prietary, in-house) process. The MIPS cannot claim credit for these develop- optimizing compiler could handle any project demonstrated that a high- ments, but it emphasized early on that issues. This deci- performance design could be realized in end-to-end performance, from source sion was probably correct for research this framework and supported the view program to executing machine instruc- processors (and saved us from debating that VLSI design could be done without tions, is the metric that matters. if the processor should be big-endian or close coupling to a proprietary process. little-endian) and illustrates the freedom Vertical integration—that is, design and he Stanford MIPS project was an afforded by looking beyond a single layer. fabrication in one company—might offer T important evolutionary step. Later But all subsequent descendants sup- benefits, but so does the separation of RISC architectures such as MIPS Inc. and ported byte addressing (of course, by fabrication and design. The MOSIS ser- DEC Alpha were able to learn from both that time, VLSI had advanced to CMOS vice was an early (nonprofit) experiment; our mistakes and successes, producing with at least two levels of metal, so byte a few years after the end of the MIPS cleaner and more widely applicable archi- selection logic was easier to implement). project, commercial silicon foundries tectures, including integrated floating- However, word alignment of word started to offer fabrication services and point and system support features such accesses as in MIPS was still retained, allowed the creation of “fabless” semi- as TLBs. All new architectures that unlike in minicomputers. conductor companies. appeared afterward (since 1985) incorpo- Sometimes the absence of interlocks rated ideas from the RISC designs, and on MIPS is seen as a defining feature of Benchmarks as the concern for resource and power this project. However, it was a tradeoff, The MIPS project used a quantitative efficiency continues to be important, we based on the capabilities of the imple- approach to decide on various features expect RISC ideas to remain relevant for mentation technology of the time. To of the processor and therefore needed a processor designs. And, finally, this MIPS meet the design goal of a 4-MHz clock, collection of benchmark programs to col- paper was also an important step in the the designers had to streamline the pro- lect the needed for decision making. transition of the SIGMICRO Annual Work- cessor, and still most of the critical paths In retrospect, the benchmarks we used shop on Microprogramming to the IEEE/ involved the processor’s control compo- were tiny and did not include any signifi- ACM International Symposium on nent.7 For the MIPS processor, it made cant operating systems code. Conse- Microarchitecture. sense to simplify the design as much as quently, the designers focused on possible; later descendants, with access producing a design that delivered per- to better VLSI processes, made different formance for compiled programs but References decisions. The absence of hardware paid less attention to the operating sys- 1. C. Mead and L. Conway, Introduction interlocks (to delay an instruction if one tem interface and the need to connect to VLSI Systems, Addison-Wesley, of the operands wasn’t ready) was a the processor to a memory hierarchy. At 1980...... JULY/AUGUST 2016 75 ...... AWARDS

2. J. Hennessy et al., “MIPS: A Micro- 6. N. Jouppi, “TV: An NMOS Timing ETH Zurich. Contact him at thomas. processor Architecture,” Proc. 15th Analyzer,” Proc. 3rd CalTech Conf. [email protected]. Ann. Microprogramming Workshop, VLSI, 1983, pp. 71–85. 1982, pp. 17–22. 7. S. Przybylski et al., “Organization and Norman P. Jouppi is a distinguished hard- 3. C. Rowen et al., “MIPS: A High Per- VLSI Implementation of MIPS,” J. ware engineer at Google. Contact him formance 32-bit NMOS Microproc- VLSI and Computer Systems, vol. 1, at [email protected]. no. 2, 1984, pp. 170–208. essor,” Proc. Int’l Solid-State Circuits John L. Hennessy is president emeritus 8. D.A. Patterson and C.H. Sequin, Conf., 1984, pp. 180–181. of Stanford University. Contact him at “RISC-I: A Reduced Instruction Set 4. T.R. Gross et al., “Measurement and [email protected]. VLSI Computer,” Proc. 8th Ann. Evaluation of the MIPS Architecture Symp. Computer Architecture, 1981, Steven Przybylski is the president and and Processor,” ACM Trans. Com- pp. 443–457. principal consultant of the Verdande puter Systems, Aug. 1988, pp. 229– 9. S. Krishnamurthi, “Artifact Evaluation Group. Contact him at sp@verdande. 258. for Software Conferences,” SIGPLAN com. 5. S. Przybylski, “The Design Verification Notices, vol. 48, no. 4S, 2013, pp. 17–21. and Testing of MIPS,” Proc. Conf. Chris Rowen is the CTO of the IP Group at Advanced Research in VLSI, 1984, pp. Thomas R. Gross is a faculty member in Cadence Design Systems. Contact him at 100–109. the Computer Science Department at [email protected].

...... HPS Papers: A Retrospective

YALE N. PATT University of Texas at Austin

WEN-MEI W. H WU University of Illinois at Urbana–Champaign

STEPHEN W. M ELVIN

MICHAEL C. SHEBANOW Samsung

...... HPS happened at a time (1984) ing its bets, pursuing the development of scripted variables—say I—and checked when the computer architecture com- the i860 along with continued activity on the upper and lower bounds; if they both munity was being inundated with the . passed, it used the size information of promises of RISC technology. Dave Pat- The RISC phenomenon rejected the each element in the array and the array’s terson’s RISC at Berkeley and John Hen- VAX and x86 architectures as far too com- dimensions to compute part of the loca- nessy’s MIPS at Stanford were visible plex. Both architectures had variable- tion of A[I,J,K]. More than a half-dozen university research projects. Hewlett length instruction sets, often with multi- operations were performed in carrying Packard had abandoned its previous ple operations in each instruction. The out the work of this instruction. Intel’s instruction set architecture (ISA) in favor VAX Index instruction, for example, x86’s variable-length instruction had pre- of the HP Precision Architecture (HP/PA) required six operands to assist in comput- fixes to override an instruction’s normal and had attracted a number of people ing the memory location of a desired ele- activity, and several bytes when neces- from IBM to join their workforce. Sun ment in a multidimensional subscripted sary to locate the location of an operand. Microsystems had moved from the array. If you wanted the location of RISC advocates argued that with sim- to the Sparc ISA. Motor- A[I,J,K], you could obtain it with three pler instructions requiring in general a ola itself was focusing on their 78K, later instantiations of the Index instruction. single operation, wherein the signals renumbered 88K. Even Intel was hedg- The Index instruction took one of the sub- needed to control the datapath were ...... 76 IEEE MICRO contained within the instruction itself, Several pieces of previous work pro- be near-impossible for a global scheduler one would require no microcode and vided insights. Twenty years earlier, in to see the available parallelism in an irreg- could make the hardware much simpler, the mid-1960s, the IBM 360/91 intro- ularly parallel program, it did not matter. resulting in a program executing in less duced the Tomasulo algorithm, a It only mattered that the operations time. Not everyone followed the mantra dynamic scheduling (aka out-of-order themselves knew when they were ready exactly; for example, Hennessy’s MIPS execution) mechanism, in their FPU.1 to execute. By allowing the operations used a pipeline reorganizer to package Instructions were allowed to schedule themselves to determine when they two operations in a single instruction. themselves out of program order when were ready to execute, we could exploit But for the most part, RISC meant sin- all flow dependencies (RAW hazards) irregular parallelism. Allowing the nodes gle-cycle execution of individual instruc- were satisfied. Unfortunately, instruc- to determine when they can fire is the tions, statically scheduled and fetched tions were allowed to retire as soon as hallmark of dataflow. one at a time. The instructions were they completed execution, breaking the Third, we recognized that, unlike essentially micro-ops. ISA and preventing precise exceptions. Tomasulo, the mechanism need not be We marched to a different drum. In fact, the 360/91 design team had to restricted to the FPU—that indeed other “We” were Yale Patt, a visiting professor get special permission from IBM to pro- operations, and in the fifth year of his nine-year visit at UC duce the 360/91 because it did not obey more importantly, loads and stores, Berkeley, and Wen-mei Hwu, Michael the requirements of the System 360 could also be allowed to execute in paral- Shebanow, and Steve Melvin, all PhD stu- ISA.2 IBM did not produce a follow-on lel. This let us build the dataflow graph dents at Berkeley who completed their product embracing the ideas of the 360/ from all instructions in the program, not PhDs there with Yale over the next sev- 91 FPU. just the floating-point instructions. eral years. We dubbed our project the Ten years earlier, in 1975, Arvind and Fourth, we recognized that we did High Performance Substrate (HPS). We Kim Gostelow3 and Jack Dennis and not have to change the ISA and in fact were all part of the Aquarius project, led David Misunas4 introduced the important not doing so provided two benefits: by Al Despain. Al concentrated on design- notion of representing programs as data- Our HPS microarchitecture could ing a engine, and we focused on flow graphs, and Robert Keller5 intro- be used with any ISA. That is, high-performance microarchitecture. duced the notion of a window of instructions in any ISA could be Berkeley was a special place to do instructions—that is, instructions that decoded (that is, converted) to a research in the 1980s. Among the work had been fetched and decoded but not dataflow graph and merged into going on, Velvel Kahan invented and yet executed. When we studied these the dataflow graph of the previ- refined his specification of IEEE floating- papers in 1984, several things became ously decoded instructions. The point arithmetic; Michael Stonebraker clear to us. dataflow graph’s elements were invented Ingres; Domenico Ferrari pro- First, with respect to the Tomasulo single operations: micro-ops. duced distributed ; Manuel Blum, algorithm, we would have to do some- Decoding could produce multiple Dick Karp, and Mike Harrison pursued thing about the lack of precise excep- operations each cycle, that is, exciting theoretical issues; Lotfi Zadeh tions. We correctly decided that this wide issue. refined his fuzziness concepts; and Sue problem was the deal breaker that pre- Importantly, we could maintain Graham did important work in compiler vented IBM from producing follow-on the ISA’s inviolability. technology. designs. We introduced the Result Buf- Unlike the RISC projects going on at fer, which stored the results of instruc- Fifth, we recognized that to make this the time, we felt it was unnecessary to tions executed out of order, allowing work, we needed to be able to continu- meddle with the ISA; rather, we would them to be retired in order. Although we ously supply a lot of micro-operations require the microarchitecture to break did not know it at the time, Jim Smith into the execution core. We incorporated instructions into micro-operations (AMD and Andy Pleszkun were concurrently a post-decode window of instructions, later called them Rops, for RISC ops) and investigating in-order retirement of out- but knew we could not, as Keller advo- allow the micro-ops to be scheduled to of-order execution machines, and came cated, stop supplying micro-ops when the function units dynamically (that is, at up with the name Reorder Buffer (ROB), we encountered an unresolved condi- runtime). We insisted on operating below which is a much better descriptor for our tional branch. For this we would need an the ISA’s level, allowing the ISA to remain Result Buffer.6 aggressive branch predictor, and beyond inviolable and thereby retaining all the Second, we recognized that Dennis that, speculative execution. We did not work of the previous 50 years. We were and Arvind’s dataflow graph concept have a suitable branch predictor at the sensitive to not introduce features that could allow huge gains in performance time, but we knew one would be neces- might seem attractive at the moment but by allowing many instructions to execute sary for HPS to be viable. It would also could turn out to be costly downstream. concurrently—and that although it would require a fast mechanism for recovering ...... JULY/AUGUST 2016 77 ...... AWARDS

the state of the machine if a branch pre- We started publishing the HPS micro- finished his PhD in 1987 and took a fac- diction took us down the wrong path. architecture with two papers at MICRO ulty position at the University of Illinois at Finally, we recognized that unlike previ- 18. The first paper provided an introduc- Urbana–Champaign. Steve Melvin com- ously constructed dataflow machines, tion to the HPS microarchitecture.7 We pleted his PhD in 1990 and has been a wherein the compiler constructed a gener- described the rationale for a restricted successful independent consultant in the ally unwieldly dataflow graph, our dataflow dataflow machine that schedules opera- computer architecture industry for more graph would be constructed dynamically. tions out of order within an active win- than 25 years. Michael Shebanow com- This allowed nodes of the dataflow graph dow of the dynamic instruction stream. pleted his PhD and went on to become a to be created in the front end of the pipe- We showed the need for an aggressive lead architect on many microprocessors line when the instructions were fetched, branch predictor that allowed speculative over the years, including the M88120 at decoded, and renamed, and removed execution, and we provided a mecha- Motorola, the R1 and Denali at HAL, the from the dataflow graph at the back end nism for quickly repairing the state of the M3 at Cyrix, and the Fermi GPU shader of the pipeline when the instructions were machine when a branch misprediction core complex at Nvidia. He is now a vice retired. The result: at any point in time, occurs. We introduced the notion of president at Samsung, leading a team only a small subset of the instruction decoding complex instructions into focused on advanced graphics IP. stream, the instructions in flight, are in the micro-ops, and the use of the result buf- dataflow graph. We dubbed this concept fer to retire instructions in program order. “restricted dataflow.” We saw that The second paper at MICRO 18, the References although pure dataflow was problematic “Critical Issues” paper, pointed out 1. R.M. Tomasulo, “An Efficient Algo- for many reasons (such as saving state on some of the problems that would have rithm for Exploiting Multiple Arithmetic an or exception, debugging pro- to be solved to make HPS viable.8 We Units,” IBM J. Research and Develop- grams, and verifying hardware), restricted discussed various issues, including cach- ment, vol. 11, no. 1, 1967, pp. 25–33. dataflow was viable. ing and control flow, and discussed what 2. R.M. Tomasulo, lecture at University These revelations did not come all at we called the “unknown address prob- of Michigan, Ann Arbor, Jan. 30, once. We had the luxury at the time of lem.” We presented an algorithm for 2008; www.youtube.com/watch?v meeting almost daily in Yale’s office as determining whether a memory read ¼ S6weTM1tNzQ. we argued and struggled over the con- operation is ready to be executed in the 3. Arvind and K.P. Gostelow, A New cepts. Sometimes one or more of us presence of older memory writes, includ- Interpreter for Dataflow and Its Impli- were ready to give up, but we didn’t. ing those with unknown addresses. At the time, many of our colleagues We continued the work over the next cations for Computer Architecture, were skeptical of our work and made it several years, publishing results as we tech. report 72, Dept. of Information clear they thought we were barking up went along. In one of those subsequent and Computer Science, Univ. of Cali- the wrong tree. We were advocating papers, published at MICRO 21, we fornia, Irvine, 1975. future chips that would use the HPS introduced the concept of a fill unit and 4. J.B. Dennis and D.P. Misunas, “A Pre- microarchitecture to fetch and decode the treatment of large units of work as liminary Architecture for a Basic Data four instructions each cycle. Our critics atomic units.9 In all, we published more Flow Processor,” Proc. 2nd Int’l argued that there was not enough paral- than a dozen papers extending the ideas Symp. Computer Architecture, 1975, lel work available to support the notion of first put forward in 1985. More impor- pp. 126–132. four-wide issue, and besides, there were tantly, starting with Intel’s Pentium Pro, 5. R.M. Keller, “Look Ahead Process- not enough transistors on the chip to announced in 1995, the microprocessor ors,” ACM Computing Surveys, vol. handle the job. To the first criticism, we industry embraced the fundamental con- 7, no. 4, 1975, pp. 177–195. pointed out that although we could not cepts of HPS: aggressive branch predic- 6. J.E. Smith and A.R. Pleszkun, see the parallelism, it did not matter. The tion leading to speculative execution; “Implementation of Precise parallelism was there, and the dataflow wide-issue, dynamic scheduling of in Pipelined Processors,” Proc. 12th nodes would know when all dependen- micro-ops via a restricted dataflow Ann. IEEE/ACM Int’l Symp. Computer cies had been satisfied and they were graph; and in-order retirement enabling Architecture, 1985, pp. 36–44. ready to fire. To the second argument, precise exceptions. 7. Y.N. Patt, W.M. Hwu, and M. Sheba- we begged for patience. Moore’s law now, “HPS, A New Microarchitec- was alive and well, and although at the ale Patt continued as a visiting pro- ture: Rationale and Introduction,” time there were fewer than 1 million Y fessor at Berkeley for four more Proc. 18th Ann. Workshop Micropro- transistors on each chip, we had faith years, then moved on to the University gramming, 1985, pp. 103–108. that Moore’s law would satisfy that one of Michigan and eventually to the Univer- 8. Y.N. Patt et al., “Critical Issues for us. sity of Texas at Austin. Wen-mei Hwu Regarding HPS, A High Performance ...... 78 IEEE MICRO Microarchitecture,” Proc. 18th Ann. Yale N. Patt is the Ernest Cockrell, Jr. Stephen W. Melvin is an independent con- Workshop Microprogramming, 1985, Centennial Chair in Engineering at the sultant in the computer architecture pp. 109–116. University of Texas at Austin. Contact industry. Contact him at melvin@zytek. 9. S.W. Melvin, M.C. Shebanow, and him at [email protected]. com. Y.N. Patt, “Hardware Support for Large Atomic Units in Dynamically Wen-mei W. Hwu is the AMD Jerry Sand- Michael C. Shebanow is a vice president Scheduled Machines,” Proc. 21st ers Endowed Chair in Electrical Engi- at Samsung. Contact him at shebanow Ann. Workshop Microprogramming neering at the University of Illinois at @gmail.com. and Microarchitecture, 1988, pp. Urbana–Champaign. Contact him at 60–63. [email protected]...... Retrospective on “Two-Level Adaptive Training Branch Prediction”

TSE-YU YEH Apple

YALE N. PATT University of Texas at Austin

...... Looking back at the world of of concurrent execution. This meant a microarchitecture tradeoffs that summer, computer architecture in 1991, when we powerful and effective branch predictor and often the discussions would gravitate wrote our MICRO paper,1 we believe our that could fetch, decode, and execute to branch prediction. When Tse-Yu motivation for addressing branch prediction instructions speculatively, and only very returned to Ann Arbor for the fall semes- is best illustrated by our earlier 1991 ISCA infrequently have to throw away the ter, he was excited about the possibility paper,2 which showed that instructions per speculative work because of a branch of a branch predictor that could support cycle can be greater than two. Conven- misprediction. The computer architecture the HPS microarchitecture. tional wisdom at the time had argued other- community had pretty well given up on After many after-midnight meetings wise. We demonstrated in that paper that branch prediction providing any major over several months in Yale’s office, there was enough instruction-level parallel- benefits. Most accepted as fact that Jim combined with a whole lot of simula- ism in nonscientific workloads to support Smith’s 2-bit saturating counter was as tion, the two-level adaptive branch pre- our contention. We further showed that good as it was going to get.3 His branch dictor was born. We knew the branch the HPS microarchitecture (a superscalar predictor was published in ISCA in 1981 predictor would have to be dynamic; out-of-order execution engine) could and was still the best-performing branch that was a no-brainer. The branch pre- improve the execution rate greatly by predictor 10 years later. It was the branch dictor would have to use historical infor- exploiting instruction-level parallelism, pro- predictor in Pentium, Intel’s “brainiac” mation based on the input data that the vided the penalty caused by incorrect microprocessor, the top of the x86 line at program was processing, and it would branch predictions was minimized. the time, released in 1991. have to accommodate phase changes. In fact, the lack of an aggressive, Tse-Yu spent the summer of 1990 at But that was not enough. What histori- highly accurate branch predictor was a Motorola, working for Michael Sheba- cal information would yield good branch major stumbling block in enabling our now, one of the original inventors of the prediction? Somehow we stumbled on high-performance microarchitecture to HPS microarchitecture. Michael was the answer: it was not the direct record achieve its potential. The HPS microarchi- interested in branch prediction, and in fact of a branch’s behavior that was needed, tecture required the execution core to be almost did his PhD dissertation at Berke- but rather whether the branch was kept fully supplied with micro-ops, organ- ley on branch prediction. Tse-Yu and taken at previous instances of time hav- ized as a dataflow graph that allowed lots Michael had many conversations about ing the same history. That is, if the ...... JULY/AUGUST 2016 79 ...... AWARDS

branch’s history was 011000101 (0 accurate than the previous best case. joined Sibyte, which was acquired by means not taken, 1 means taken), there Our predictor showed that maintaining Broadcom; later he moved to PA Semi is little one can infer. But, if the last time two levels of branch history would pro- and eventually to Apple. He has been the history was 011000101 the branch vide much greater accuracy than was responsible for many microprocessor was taken, and the time before that possible with the simpler branch predic- designs over the years. He is currently an when the history was 011000101 the tion schemes. Other researchers soon engineering director at Apple, where he branch was also taken, it was a good agreed that the benefit from branch pre- is responsible for CPU verification. Yale bet that the branch would be taken at diction could be substantial, and that Patt retired from Michigan in 1999 to this time. maintaining two levels of branch history become the Ernest Cockrell, Jr. Centen- With that insight, we set out to was the correct mechanism. The result nial Chair in Engineering at the University design the two-level predictor. Several was a wave of papers on dynamic branch of Texas at Austin. He continues to thrive issues had to be resolved. prediction, successively improving on on directing PhD students in microarchi- First, how should the direct record of our basic two-level predictor design. We tecture research and teaching both fresh- branches be stored? Three choices pre- followed our first paper with a more com- men and graduate students. sented themselves: a global history regis- prehensive paper describing alternative ter to record the behavior of each dynamic implementations of the two-level branch; a separate history register for scheme in ISCA 1992,4 and Kimming So each static branch; and a half-way meas- and colleagues at IBM published their References ure, in which we partitioned branches into correlation predictor in ASPLOS later that 1 T.-Y. Yeh and Y.N. Patt, “Two-Level equivalence classes and allocated one his- same year.5 Soon after, Scott McFarling Adaptive Training Branch Prediction,” tory register for each class. We dubbed introduced gshare,6 a variation on the Proc. 24th Ann. Int’l Symp. Microarchi- the three schemes G, P, and S, respec- GAs two-level predictor and hybrid tecture, 1991, pp. 51–61. tively. The obvious tradeoffs were inter- branch prediction. 2. M. Butler et al., “Single Instruction ference versus correlation and the cost of At our industrial affiliates meeting Stream Parallelism Is Greater Than the history registers. the next year, Tse-Yu presented our Two,” Proc.18thAnn.Int’lSymp.Com- Second, how should the information branch predictor. Intel engineers in the puter Architecture, 1991, pp. 276–286. be stored that records what happens as a audience were excited about its possi- 3. J.E. Smith, “A Study of Branch Predic- result of a previous instance of the current bilities. They had decided to increase tion Strategies,” Proc. 8th Ann. history? We decided to store results as the pipeline of their next processor Pen- Symp. Computer Architecture, 1981, entries in a table, indexed by the bits of his- tium Pro from 5 stages to 13 to cut pp. 135–148. tory in the history registers. The same down the cycle time to be more com- 4. T.-Y. Yeh and Y.N. Patt, “Alternative three choices applied here—a global pat- petitive with all the emerging RISC Implementations of Two-Level Adap- tern table, a per-static-branch pattern table, designs. More pipeline stages meant a tive Branch Prediction,” Proc. 19th and the in-between scheme. We dubbed greater branch-misprediction penalty, Ann. Int’l Symp. Computer Architec- the three choices g, p, and s. Thus, we so Intel badly needed a better branch ture, 1992, pp. 124–134. had nine choices to investigate: GAg, GAs, predictor. Intel’s interest sparked sub- 5.S.-T.Pan,K.So,andJ.T.Rahmeh, GAp, SAg, SAs, SAp, PAg, PAs, and PAp. stantial interaction between us, result- “Improving the Accuracy of Dynamic Third, what decision structure should ing in Tse-Yu spending the summer of Branch Prediction Using Branch we use given the entries in the pattern 1992 at Intel in Hillsboro, Oregon, Correlation,” Proc. 5th Int’l Conf. tables? We examined several state where the Pentium Pro was being Architectural Support for Program- machines, finally deciding on the 2-bit designed. Intel decided to opt for the ming Languages and Operating counterschemeofJimSmith’s1981 PAs version of the two-level branch pre- Systems, 1992, pp. 76–84. ISCA paper. dictor and did all the hard engineering to 6. S. McFarling, “Combining Branch Pre- Finally, how many bits of history should make it implementable in the Pentium dictors,” tech. note TN-36, Western be kept in the history registers? Using n Pro. With the announcement of that Research Library, 1993. bits of history meant a pattern table of size product, the computer architecture 2n. Adding one bit of history meant dou- landscape had changed forever in favor Tse-Yu Yeh is an engineering director at bling the size of each pattern table. of aggressive branch prediction. Apple. Contact him at [email protected]. We introduced the Two-Level Adap- tive Branch Predictor at MICRO in 1991,1 Yale N. Patt is the Ernest Cockrell, Jr. demonstrating that a lot more perform- se-Yu Yeh completed his PhD at Centennial Chair in Engineering at the ance could be obtained with a branch T Michigan in 1993 and took a job University of Texas at Austin. Contact predictor that was significantly more with Intel in Santa Clara. Afterward, he him at [email protected]...... 80 IEEE MICRO ...... Retrospective on Code Compression and a Fresh Approach to Embedded Systems

ANDY WOLFE Santa Clara University

...... In late fall of 1991, I was a first- pads and touchscreens for PCs. I had based compression code. The com- year assistant professor at Princeton Uni- decided to focus on this area when I pressed blocks were stored in the pro- versity. I was teaching an upper-level joined Princeton, working with Marilyn gram ROM, using substantially less course on computer architecture based Wolf and Sharad Malik on hardware/soft- space than the original programs. In prac- on quantitative analysis, a relatively new ware codesign and on the emerging tice, this meant that we could fit more approach that grew out of research at opportunities in digital audio and video. features and functionality into a limited- Stanford and Berkeley in the prior dec- Alex had worked on telecommunica- size program memory. At runtime, the ade. I was, quite literally, teaching the tions-based computing at AT&T. He compressed blocks are decompressed class “by the book” using the recently repeatedly would remind me that the when read into the instruction cache. released text Computer Architecture: A questions I was posing in class were not Neither the processor core nor the Quantitative Approach.1 I had been able relevant to “real engineers with real instruction cache itself need be aware to sit in on John Hennessy’s similar jobs.” A change in cache configuration that the programs were stored in com- course a few years earlier where he that would improve average perform- pressed form. A line address table with used the draft manuscript for the text, ance by a couple percent was not impor- its own cache provided pointers for locat- and although I was unable to replicate his tant where he worked. Engineering ing the compressed blocks. Although the gravitas, I was able to convey the primary decisions were dominated by hard and decompression time could add to the message to my students: Computer soft limits on space, power require- cache-line access time, in many configu- architecture research had changed, and ments, cooling capacity, and cost. We rations the reduced number of program it was incumbent to be able to model, started looking at the current state of memory accesses required to fetch the measure, and evaluate everything we embedded systems to find an interesting compressed data would compensate for did. Alex Chanin was a student in that problem to solve. Could we improve the decompression time. In any case, class, on leave from AT&T in order to get cost, size, or power as compared to con- decompression was required only for a a master’s degree. He approached me ventional processors? Could we show cache miss, so any performance impact mid-semester to discuss ideas for a the- that we had made such an improvement would be small. sis topic related to computer architec- with quantitative measures? And, of In looking back at this research, I find ture. He made the primary requirements course, could we find a problem that it most interesting to look at the factors for a topic very clear. It had to be impor- could be “solved” in the next six months that led us to think about this problem tant enough that it would be accepted for using the tools and resources we had on and about the proposed solution. It is graduation, but equally as important, it hand? With those goals in mind, we interesting to review not only what we had to be finished on time at the end of decided to address the question of published, but the other alternatives we the next semester so he could graduate, whether we could make programs for rejected, because at least one of those get married, and go back to his job on reduced-instruction-set computing has become successful. Most of all, how- time. Working within those constraints, (RISC) processors smaller so that the ever, I now find the specific problem that we developed the technology discussed cost of program memory could be we addressed to be less interesting than in the paper, “Executing Compressed reduced. the constraints that we put on the prob- Programs on an Embedded RISC The system we developed was called lem. On the basis of our limited resources Architecture.”2 the compressed code RISC processor and our desire to make the solution as pal- I had a background in embedded (CCRP). The basic idea was that pro- atable to industry as possible, we focused computing, primarily using microcontrol- grams were compressed as the last step on using mainstream RISC CPU architec- lers to design peripherals such as touch- in program development using a block- tures in cost-sensitive embedded ...... JULY/AUGUST 2016 81 ...... AWARDS

systems. This was a different way of Embedded systems developers performance increases made them thinking about embedded CPU research focused on low cost, low power, high lev- increasingly unsuitable for embedded at the time, but it reflected some impor- els of integration, and absolute predict- use. The most prominent of these fea- tant changes in the CPU industry and the ability at runtime. The heart of the tures was the heavy dependence on on- way many researchers were starting to embedded systems market was the 8-bit chip caches to support high clock speeds think about embedded computing. This . These chips, such as and the high instruction bandwidth may have been the most important Intel’s 8051 and Motorola’s 68HC11 and needed to provide one or more 32-bit impact of our paper, providing a concrete their derivatives, included on-board pro- instructions per clock. The problem for example of success to researchers who gram and data memory, integrated analog the embedded world was that nobody were starting to think about embedded and digital I/O peripherals, on-chip clock understood how to use a processor with computing in a new way. generators, serial ports, timers, interrupt caches in a real-time environment. We controllers, and low-power sleep modes. thought of caches as having probabilistic The State of the Art They typically used a fraction of a watt at performance, and anyone could imagine In today’s world it seems obvious that a peak performance and were provided in a worst-case scenario in which a pro- RISC-based system on chip (SoC) is a low-cost plastic packages with 40 to 80 cessor with caches might be 10 or more good solution for an , pins. The processors were cheap. Sys- times slower than the typical case. This but when we started working on this tems built with them were cheap. They ledtolotsofrantingandravinginthe research, such devices basically did not were small and required little power. embedded-systems community about exist. Moreover, the trends in RISC pro- Moreover, they were highly predictable. missiles that wouldn’t fire, car engines cessor development and mainstream Developers would “cycle count” the crit- that would explode, and airplanes that CPU core development in general were ical paths in programs to determine per- would drop out of the sky. It’s hard to moving further away from suitability in formance, because the execution time of believe in retrospect, but many smart embedded systems. each instruction in the path was known people in 1991 thought that real-time As an academic and a computer archi- and fixed. This allowed them to be used embedded systems simply could not tect, I had expected RISC and complex- for real-time control. At the higher end of use processors with caches, and that instruction-set computing (CISC) imple- the embedded market, there were newer there would be very limited further mentations to begin to converge. During 16-bit that resembled improvement in embedded processor my time as a graduate student at Carne- the 8-bit parts at a higher cost and simple performance. gie Mellon University, Bob Colwell had single-board computers based on older Additionally, the caches that could fit authored a well-known paper comparing 8086 or 68000 family processors. These on a chip at the time were never large fundamental aspects of RISC and CISC were smaller than PCs and lower power enough to keep up with the processor architecture.3 I was highly persuaded by but also very predictable in operation. core’s performance, and thus designers its primary conclusion, that at the most Although these solutions were suitable would try to fill up as much of the chip as fundamental level RISC and CISC differ for many applications, it was clear that possible with cache. This meant no room only in the way they represent and there was interest in higher performance to integrate any other features, such as decode instructions. All other tenets of and more capabilities. General-purpose I/O, main memory, clock generators, or RISC architecture, such as large, general- computers were doubling in performance communications. Unlike the early days of purpose register files, pipelining, on-chip every year, and the embedded world was the microprocessor, when Intel, Zilog, caches, and reliance on compiler optimi- being left behind. and Motorola would create an ecosystem zation, were general advances in pro- Almost all of the activity in CPU devel- of peripheral chips to support each new cessor implementation tied to advances opment focused on high-performance processor bus, these new RISC micro- in VLSI technology. They were equally RISC processors for workstations and processors were targeted toward large applicable to RISC and CISC architec- servers and RISC-like microarchitectures computer manufacturers that would tures. By 1991, we were starting to see for CISC chips that could compete with develop proprietary support chips for this convergence at the high end. Intel them. In the past, CPU implementations each computer system. Building a com- processors like the and the upcom- would often start off as desktop process- plete system using the new RISC micro- ing Pentium had RISC-like microarchitec- ors, and then, as they became cheaper, processors required a much larger tures and could match RISC processors they would be used for high-end investment. Other factors made the cost in integer performance; however, in embedded systems and then reused in of these microprocessors unsuitable for terms of the embedded processor land- integrated microcontrollers. This path- embedded use. As RISC architectures, scape, RISC and CISC implementations wayseemedtobeatanend.Perform- they relied on high clock speeds and lots were starting to diverge in disturbing ance was skyrocketing, but the very of instruction and data bandwidth. ways. features that were contributing to these Although caches helped, these ...... 82 IEEE MICRO processors were in large ceramic pack- mostly targeted for laser printers. Both dent in Ed McCluskey’s lab at Stanford. I ages with lots of I/O pins and power and teams were known to be working on presented the MICRO 21 paper to some ground pins. The SuperSPARC processor more expensive superscalar versions of of the Stanford faculty and students in had about 300 pins. The R4000SC had these chips rather than bringing them the Computer Systems Lab. After that about 450 pins. Even the low-end RISC downmarket. In fact, as late as 1992, presentation, Mark Horowitz com- processors had close to 200 pins. This Intel still included the i960 in its Multime- mented that an architectural improve- meant that these were expensive parts dia and Supercomputing databook, along ment of two to four times was that used lots of I/O power for off-chip with the i860 and i750, as opposed to in immaterial compared to a modern RISC communications. Designers also focused its two-volume Embedded Microcontrol- processor. He explained that general- on clock rate at the expense of power. lers and Processors databook. ARM was purpose RISC processors benefitted Many of these RISC CPUs used dynamic deep into the development of the from enormous engineering efforts, logic and thus (at the time) did not have a ARM610 for the Apple Newton, but the including critical-path circuit tuning and low-speed or sleep mode. Many designs idea of an embedded ARM core as part access to the newest manufacturing (such as Pentium and SuperSPARC) had of a third-party SoC was not well-known processes. He claimed that this design moved to Bi-CMOS circuits. Unlike today, until several years later with the introduc- effort, which was only affordable in a where power is a major constraint, power tion of the ARM 7 core. high-volume, general-purpose processor, had become a tool for increasing clock The embedded systems community would provide 10 to 100 times improve- speed. was stuck in a rut. It was clear that the ment in cycle time compared to my Along with all of these CPU implemen- semiconductor industry was investing all FPGA approach and that alternative tation issues, embedded system design- of its money and effort into RISC and yet application-specific approaches made ers were deterred by one fundamental it was not at all clear how embedded sys- sense only if they could exceed that. I aspect of RISC architectures—that they tems would ever benefit from this invest- basically disregarded the comments at used fixed-length, 32-bit instructions. ment. This was the problem we were the time as simply a defense of the sta- This, along with the fact that RISC com- trying to solve when we developed the tus quo, but as I continued my research pilers were all tuned for performance, CCRP. over the next couple of years, it became meant that in practice RISC object code clear that he had observed a fundamental was much larger than CISC code. Some Developing an Approach change in the computer industry. measurements showed that, for exam- By 1991, I had been struggling with this Because processors were becoming ple, Sparc code could be three times as problem for some time. I had been look- exponentially more complex, the ability large as VAX code for the same program. ing for some unique architectural to invest large amounts of money in their This difference was not as pronounced approach that would break past existing development was becoming as impor- for other CISC microprocessors, but there performance constraints. One thought tant as the core architectural ideas. still seemed to be about a 50 percent was that if we brought the instruction set During the summer of 1989, my advi- code-storage penalty for using a RISC and datapath closer to the application sor, John Shen, arranged an internship at architecture as compared to CISC. In a level and replaced runtime interpretation ESL/TRW to design a proof-of-concept system with a limited dollar or space with compile-time optimization, we could system to track incoming missiles as part budget for program ROM, this meant that reduce code storage space and band- of the Strategic Defense Initiative. I a RISC system could include only two- width and devote more hardware to cal- started working on a VLIW-style board- thirds as much functionality as a CISC culation. As one of those projects, in level processor based on high- system. This seemed like a major disad- 1988 I developed a computer based on performance floating-point units (FPUs) vantage. Remember, at the time, a typical the new Xilinx field-programmable gate from Weitek and microcoded instructions. midrange embedded system might only array (FPGA) technology based on recon- Our objective was to get maximum per- have 32 Kbytes of program memory. figurable instruction decoding and data- formance within a dorm-fridge-sized 9U Given all these factors, as well as the paths. Early results were promising. cabinet with a development budget that I fact that most RISC products were being Application-specific instruction sets think was a few million dollars, including promoted as high-end solutions, design- could be very dense, and increases in application software. Sometime mid- ers did not seriously consider RISC to be instruction-level parallelism (ILP) of two summer, I read an IEEE Micro article practical for embedded computing. to four times over conventional process- describing the upcoming Intel i860XR 64- Some earlier efforts had attempted to ors were easy to develop. I packaged my bit microprocessor4 andthenquickly introduce embedded RISC architectures, initial results into a paper to be presented obtained the databook and programmer’s most notably the AMD 29000 and at MICRO 21. manual from Intel. The i860XR was the , but these were low- For the 1988–1989 academic year, I expected to ship by September with the integration, high-pin-count devices, was invited to be a visiting graduate stu- ability to do 80 Mflops, double-precision, ...... JULY/AUGUST 2016 83 ...... AWARDS

at 40 MHz with a promise of a 50 MHz a laser printer or an advanced car engine rewriting the program with the new part soon behind. Dr. Horowitz’s predic- controller, which might benefit from the instructions, and that a new compiler tion had already come true. The floating- added performance, there often was no was needed. I had lots of ideas I wanted point engines we could buy topped out at room and no budget to add 50 percent to try, but none would fit into Alex’s 20 MHz and did not have enough pin more ROM chips. The RISC code was “graduate this semester” timetable, so bandwidth to sustain double-precision cal- simply too big; but the question was we put it on hold and decided to try our culations at that rate. If we built our own why? At a fundamental level, RISC code remaining approach. Within two days, chips, we estimated that we could sustain and CISC code were specifying the Alex was already getting promising 20 MHz, but even with substantial ILP, it same amount of information. They were results. Of course, ARM successfully would be difficult to reach 80 Mflops, and describing the same amount of comput- implemented an alternative 16-bit chip development would cost millions of ing to be done. There should be some instruction set with their Thumb instruc- dollars. Intel simply had been able to out- way to express this in the same number tion set that was announced around invest anything we could duplicate. They of bits. If we were willing to accept a per- 1995. used a bleeding-edge 1,000-nm CMOS formance impact with RISC, could we The final approach was to just com- process (no giggling from millennials) and make the code as small as CISC code? press the program and then decompress tuned the circuits using the best design- How big would that performance impact it at runtime. We ran some programs ers. They customized the memory tech- be? through “compress” on a Unix system nology in the caches, included DRAM We started to explore various ideas. and got 40 to 50 percent size improve- support, and developed compilers tuned One idea was to write CISC code for an ments. Of course, one problem we faced to the architecture. They raised the bar on existing architecture, then decode it into in executing compressed code is that we everyone’s expectations, and did it in a RISC code on instruction fetch. This was no longer knew where the code for any processor that used 3 W. We did not inspired in part by the decoded instruc- branch target had ended up. If we tried know how to duplicate the performance tion cache that Dave Ditzel and Alan to patch branch targets in the com- perunitarea,butatthesametimewedid Berenbaum had developed in CRISP.5 pressed code with the new locations, not know how to build a reliable real-time We could use an existing CISC compiler the compression would change and the system using a processor with such com- and retain most of the RISC core—every- locations would move. plex pipelining and heavy reliance on thing after the decode stage. However, Our first inclination was that we caches. I decided to prove to my team we quickly lost enthusiasm when we would decompress the programs into that I could “tame” a modern RISC micro- looked at some details. There would be RAM and run them uncompressed, but processor by using restrictive Fortran- no way to specify and efficiently use all upon reflection that made little sense. style programming, modeling the cache of the registers in the RISC. Also, at the The extra RAM required to hold the pro- misses and DRAM page misses, and lim- time, we believed that much of the gram would offset any savings in ROM. iting interrupts. We ended up building a magic in RISC architecture was the abil- At that point, we came back to the idea four-processor-per-card message-passing ity to compile for one specific pipeline. of the decoded instruction cache. If the architecture that provided up to 320 We would lose that ability. We moved on decompression happened on cache refill, MFlops per board. I had “tamed” a lead- to other ideas. then the cache could be our decom- ing-edge processor and used I was enamored with the idea of pressed program RAM. Every RISC core it as a practical digital signal processor in using a 16-bit instruction set that used a would have an instruction cache. Since an embedded system. Were there other limited set of processor resources, used the cache would hold only uncom- ways that this approach could be used only the most common opcodes, sup- pressed code, the CPU core would not to leverage the investments in high- ported only short immediates, and speci- even need to know about the compres- performance RISC processors while fied two registers instead of three. A sion and would need no modification. In maintaining the characteristics of mode-switch opcode would switch fact, the cache itself would need no mod- embedded systems? between short-instruction and long- ification other than the refill engine, instruction mode. I thought that we could which would decompress the instruc- Developing a Solution get close to a 2:1 improvement in code tions. This was a great idea, but it quickly When Alex and I started to look for an density. Alex prototyped the system, but led to the next problem. Active cache embedded system problem to solve, I we couldn’t get it to work to our satisfac- lines were dispersed throughout the pro- thought back to the Colwell paper about tion. Performance dropped 30 to 50 per- gram. You could not decompress a RISC and CISC. One of the problems in cent, and the extra instructions meant “compress” file (which used the Lem- using RISC processors in embedded sys- that the code was not much smaller. I pel-Ziv-Welch algorithm) unless you tems was that the code was so much suspected that the real problem was that started from the beginning and pro- bigger. Even in a higher-end system like we were just not doing a good job of ceeded sequentially. The solution was to ...... 84 IEEE MICRO select a moderately large cache line size processors. Josh Fisher and Paolo Fara- availability of open source tools such as and to compress each correspondingly boschi included code compression in compilers, simulators, and operating sys- sized program block separately. That their LX VLIW processor at Hewlett tems increased the opportunities for way we could decompress a single Packard. IBM incorporated it into the embedded systems research in both aca- cache line at a time. For such small data- CodePack technology in certain demia and industry. Interestingly enough, sets, a preselected Huffman code PowerPC processors. However, in look- we are now reaching a point where the worked well. We looked at various alter- ing back at the effort and its impact, development model is beginning to natives but decided that for the MIPS what I find most interesting is that it rep- switch. The embedded market—including R2000 code in our experiments, treating resented a change in thinking about how general-purpose devices such as mobile each byte as a character (rather than to do embedded systems architecture phones that have severe cost, power, and each instruction) worked best. research. Many researchers with exper- packaging constraints and real-time The remaining problem was how to tise in high-performance architecture requirements—now dominates the pro- find a cache replacement line in the com- began to think about how the investment cessor market. Embedded designs are pressed code. Each block was no longer in high-performance processors could be evolving into servers and infrastructure in at its original address in memory because leveraged into the embedded space. which cost, power, and packaging are of the compression. We decided to incor- Researchers who saw this work pre- important. Now we need to learn how to porate a table of pointers to each cache sented at MICRO and other presenta- leverage the investment in embedded line, similar to a page table that locates tions, as well as other researchers who computing to support the general-purpose pages on a disk. The table would be simply noted the same technology shifts computing markets. MICRO stored in ROM with the code and could through their own experiences, began to be referenced to find any block of the pro- propose other problems that could be gram. We included a cache for the table, solved using a similar approach. This References a cache line address look-aside buffer reinvigorated embedded systems as a 1. J.L. Hennessy, Computer Architec- (like a translation look-aside buffer for a research topic and led to widespread ture: A Quantitative Approach, Mor- page table), that would eliminate the extra advancement in the following years. gan Kaufmann, 1991. memory read. Even with all the overhead 2. A. Wolfe and A. Chanin, “Executing of block coding and the tables, we could n retrospect, this change in how we Compressed Programs on an still substantially reduce the code size. thought about embedded computing Embedded RISC Architecture,” Proc. We had all the pieces for a workable I was inevitable. The changes that I 25th Ann. Int’l Symp. Microarchitec- system. We built a simulation, measured observed were rapidly impacting the ture, 1992, pp. 81–91. compression rates, and simulated per- entire industry as the cost and complexity 3. R.P. Colwell et al., “Instruction Sets formance. We wrote up the results into a of developing processors and compilers and Beyond: Computers, Complexity, paper, and Alex prepared his master’s escalated. The development of the ARM7 and Controversy,” Computer, vol. 18, thesis. All we needed were the final two and eventually the synthesizable ARM7 no. 9, 1985, pp. 8–19. keys to publishing a computer architec- TDMI had likely started. This would lead 4. L. Kohn and N. Margulis, “Introducing ture paper in 1992: we needed to get to the development of other licensable the 64-Bit Microprocessor,” RISC into the title, and we needed a four- peripheral cores and buses that would be IEEE Micro, vol. 9, no. 4, 1989, pp. 15–30. letter acronym for our project. used with ARM cores to build RISC micro- 5. D.R. Ditzel, H.R. McLellan, and A.D. controllers and SoCs for embedded sys- Berenbaum, “The Hardware Architec- Aftermath tems. Kurt Keutzer at Berkeley was also ture of the CRISP Microprocessor,” The paper was well-received and already working on code compression Proc. 14th Ann. Int’l Symp. Computer inspired many researchers to improve on techniques for RISC. Other research Architecture, 1987, pp. 309–319. our methods. There were more than 100 groups began exploring other ways to lev- follow-on papers that improved coding erage mainstream RISC architectures for efficiency and system performance. At embedded systems applications and thus Andy Wolfe is a consultant and an least two teams imple- developing system-wide approaches to adjunct professor at Santa Clara Univer- mented the technique into commercial managing power and predictability. The sity. Contact him at [email protected].

...... JULY/AUGUST 2016 85