<<

Instruction Sets and Beyond: , Complexity, and Controversy

Robert P. Colwell, Charles Y. Hitchcock m, E. Douglas Jensen, H. M. Brinkley Sprunt, and Charles P. Kollar Carnegie-Mellon University

t alanc o tyreceived Instruction set design is important, but lt1Mthi it should not be driven solely by adher- ence to convictions about design style, ,,tl ica' ch m e n RISC or CISC. The focus ofdiscussion o09 Fy ipt issues. RISC should be on the more general question of the assignment of system function- have guided ality to implementation levels within years. A study of an architecture. This point of view en- d yield a deeper un- compasses the instruction set-CISCs f hardware/software tend to install functionality at lower mputer performance, the system levels than RISCs-but also JOCUS on rne assignment iluence of VLSI on design, takes into account other design fea- ofsystem functionality to and many other topics. Articles on tures such as register sets, coproces- RISC research, however, often fail to sors, and caches. implementation levels explore these topics properly and can While the implications of RISC re- within an architecture, be misleading. Further, the few papers search extend beyond the instruction and not be guided by that present comparisons with com- set, even within the instruction set do- whether it is a RISC plex instruction set design main, there are limitations that have or CISC design. often do not address the same issues. not been identified. Typical RISC As a result, even careful study of the papers give few clues about where the literature is likely to give a distorted RISC approach might break down. view of this area ofresearch. This arti- Claims are made for faster machines cle offers a useful perspective of that are cheaper and easier to design RISC/Complex Instruction Set Com- and that "map" particularly well onto puter research, one that is supported VLSI technology. It has been said, by recent work at Carnegie-Mellon however, that "Every complex prob- University. lem has a simple solution. . . and it is Much RISC literature is devoted to wrong." RISC ideas are not "wrong," discussions of the size and complexity but a simple-minded view of them of computer instruction sets. These would be. RISC theory has many im- discussions are extremely misleading. plications that are not obvious. Re-

8 0018-9162/85/0900-0008$01.00 1985 IEEE COMPUTER search in this area has helped focus at- This progression from small and complexity of large architectures. tention on some important issues in simple to large and complex instruc- Current CAD tools and microcod- computer architecture whose resolu- tion sets is striking in the development ing support programs are ex- tions have too often been determined of single-chip processors within the amples. by defaults; yet RISC proponents past decade. Motorola's 68020, for ex- is an interesting example often fail to discuss the application, ar- ample, carries 11 more addressing of a technique that encourages com- chitecture, and implementation con- modes than the 6800, more than twice plex designs in two ways. First, it pro- texts in which their assertions seem as many instructions, and support for vides a structured means of effectively justified. an instruction cache and . creating and altering the While RISC advocates have been Again, not only has the number of ad- that control execution of numerous vocal concerning their design methods dressing modes and instructions in- operations and complex instructions in and theories, CISC advocates have creased, but so has their complexity. a computer. Second, the proliferation been disturbingly mute. This is not a This general trend toward CISC ma- of CISC features is encouraged by the healthy state of affairs. Without chines was fueled by many things, in- quantum nature of microcode memor- substantive, reported CISC research, cluding the following: ies; it is relatively easy to add another arguments are left un- many RISC * New models are often required to or obscure instruc- countered and, hence, out of per- with exist- tion to a machine which has not yet is be upward-compatible spective. The lack of such reports ing models in the same computer used all of its microcode space. nature due partially to the proprietary family, resulting in the superset- Instruction traces from CISC ma- and of most commercial CISC designs ting and proliferation of features. chines consistently show that few of partially to the fact that industry de- the available instructions are used in signers do not generally publish as * Many computer designers tried to reduce the "semantic gap" be- most computing environments. This much as academics. Also, the CISC situation led IBM's John Cocke, in the design style has no coherent statement tween programs and computer in- struction sets. By adding instruc- early 70's, to contemplate a departure of design principles, and CISC design- from traditional computer styles. The ers do not appear to be actively work- tions semantically closer to those used by programmers, these de- result was a research project based on ing on one. This lack of a manifesto an ECL rnachine that used a very ad- differentiates the CISC and RISC de- signers hoped to reduce software costs by creating a more easily vanced compiler, creatively named sign styles and is the result of their dif- "801" for the research group's build- ferent historical developments. programmed machine. Such in- structions tend to be more com- ing number. Little has been published plex because of their higher se- about that project, but what has been Towards defining a RISC mantic level. (It is often the case, released speaks for a principled and however, that instructions with coherent research effort. The 801's instruction set was based Since the earliest digital electronic high semantic content do not ex- on three design principles. According computers, instruction sets have tended actly match those required for the to Radin, 2 the instruction set was to be to grow larger and more complex. The language at hand.) that set of run-time operations that 1948 MARK-1 had only seven instruc- * In striving to develop faster ma- tions of minimal complexity, such as chines, designers constantly moved * could not be moved to compile adds and simple jumps, but a contem- functions from software to micro- time, porary machine like the VAX has hun- code and from microcode to hard- * could not be more efficiently exe- dreds of instructions. Furthermore, its ware, often without concern for cuted by object code produced by instructions can be rather compli- the adverse effects that an added a compiler that understood the cated, like atomically inserting an ele- architectural feature can have on high-level intent of the program, ment into a doubly linked list or an implementation. For example, and evaluating a floating point polynomial addition of an instruction requir- * could be implemented in random of arbitrary degree. Any high perfor- ing an extra level of decoding logic logic more effectively than the mance implementation of the VAX, as can slow a machine's entire in- equivalent sequence of software a result, has to rely on complex im- struction set. (This is called the instructions. plementation techniques such as pipe- "n + 1" phenomenon.1 ) The machine relied on a compiler that lining, prefetching, and multi-cycle in- * Tools and methodologies aid de- used many optimization strategies for struction execution. signers in handling the inherent much of its effectiveness, including a September 1985 9 powerful scheme of register alloca- formed at run-time by special hard- Pyramid's machine is not a load/store tion. The hardware implementation ware). By trading hardware for com- computer, however, and both Ridge was guided by a desire for leanness and pile-time software, the Stanford and Pyramid machines have variable featured hardwired control and single- researchers were able to expose and use length instructions and use multiple- cycle instruction execution. The archi- the inherent internal parallelism of cycle interpretation and microcoded tecture was a 32-bit load/store ma- their fast computing engine. control engines. Further, while their chine (only load and store instructions These three machines, the 801, instruction counts might seem reduced accessed memory) with 32 registers RISCI, and MIPS, form the core of when compared to a VAX, the Pyra- and single-cycle instructions. It had RISC research machines, and share a mid has almost 90 instructions and the separate instruction and data caches to set of common features. We propose Ridge has over 100. The use of micro- allow simultaneous access to code and the following elements as a working coding in these machines is for price operands. definition of a RISC: and performance reasons. The Pyra- Some of the basic ideas from the 801 mid machine also has a system of mul- research reached the West Coast in the (1) Single-cycle operation facilitates tiple register sets derived from the mid 70's. At the University of Califor- the rapid execution of simple Berkeley RISC I, but this feature is or- nia at Berkeley, these ideas grew into a functions that dominate a com- thogonal to RISC theory. These may series of graduate courses that prod- puter's instruction stream and be successful machines, from both uced the RISC I * (followed later by the promotes a low interpretive technological and marketing stand- RISC II) and the numerous CAD tools overhead. points, but they are not RISCs. that facilitated its design. These (2) Loadlstore design follows from The six RISC features enumerated courses laid the foundation for related a desire for single-cycle opera- above can be used to weed out mis- research efforts in performance eval- tion. leading claims and provide a spring- uation, computer-aided design, and (3) Hardwired control provides for board for points of debate. Although computer implementation. the fastest possible single-cycle some aspects of this list may be argu- The RISC I processor, 3 like the 801, operation. Microcode leads to able, it is useful as a working defi- is a load/store machine that executes slower control paths and adds to nition. most of its instructions in a single cy- interpretive overhead. cle. It has only 31 instructions, each of (4) Relatively few instructions and which fits in a single 32-bit word and addressing modes facilitate a Points of attention and uses practically the same encoding for- fast, simple interpretation by the contention mat. A special feature of the RISC I is control engine. its large number of registers, well over (5) Fixed instruction format with There are two prevalent misconcep- a hundred, which are used to form a consistent use, eases the hard- tions about RISC and CISC. The first series of overlapping register sets. This wired decoding of instructions, is due to the RISC and CISC acro- feature makes procedure calls on the which again speeds control nyms, which seem to imply that the RISC I less expensive in terms of pro- paths. domain for discussion should be re- cessor-memory bus traffic. (6) More compile-time effort offers stricted to selecting candidates for a Soon after the first RISC I project at an opportunity to explicitly machine's instruction set. Although Berkeley, a processor named MIPS move static run-time complexity specification format and number of ( without Interlocked into the compiler. A good ex- instructions are the primary issues in Pipe Stages) took shape at Stanford. ample of this is the software most RISC literature, the best gener- MIPS I is a pipelined, single-chip pro- pipeline reorganizer used by alization of RISC theory goes well cessor that relies on innovative software MIPS. I beyond them. It connotes a willingness to ensure that its pipeline resources are to make design tradeoffs freely and properly managed. (In machines such A consideration of the two com- consciously across architecture/imple- as the IBM System/360 Model 91, panies that claim to have created the mentation, hardware/software, and pipeline interstage interlocking is per- first commercial "RISC" computer, compile-time/run-time boundaries in Ridge Computers and Pyramid Tech- order to maximize performance as nology, illustrates why a definition is measured in some specific context. Please note that the term "RISC" is used throughout needed. Machines of each firm have The RISC and CISC acronyms also this article to refer to all research efforts conceming Re- duced Instruction Set Computers, while the term "RISC restricted instruction formats, a fea- seem to imply that any machine can be I" refers specificallv to the Berkeley research project. ture they share with RISC machines. classified as one or the other and that

10 COM PUTER m

the primary task confronting an archi- far too little attention from RISC re- Such problems with RISC system tect is to choose the most appropriate search efforts, in contrast to the CISC designs may encourage commercial design style for a particular applica- efforts focused on this area. 8,9 RISC designers to define a new level of tion. But the classification is not a An early argument in favor of RISC standardization in order to achieve dichotomy. RISCs and CISCs are at design was that simpler designs could some of the advantages of multiple im- different corners of a continous multi- be realized more quickly, giving them a plementations supporting one stan- dimensional design space. The need is performance advantage over complex dard interface. A possible choice for not for an by which one can machines. In addition to the economic such an interface would be to define an be chosen: rather, the goal should be advantages of getting to market first, intermediate language as the target for the formulation of a set of techniques, the simple design was supposed to all compilation. The intermediate lan- drawn from CISC experiences and guage would then be translated into RISC tenets, which can be used by a optimal machine code for each imple- designer in creating new systems. 46 The insinuation that the Micro- mentation. This translation process One consequence of the us-or-them VAX-32follows in a RISC would simply be performing resource attitude evinced by most RISC publi- tradition is unreasonable. It scheduling at a very low level (e.g., cations is that the reported perfor- does notfollow our definition pipeline management and register mance of a particular machine (e.g., of a RISC; it violates all allocation). RISC I) can be hard to interpret if the six RISC criteria. It should be noted that the Micro- contributions made by the various de- VAX-32 does not directly implement sign decisions are not presented indi- avoid the performance disadvantages all VAX architecture. The suggestion vidually. A designer faced with a large of introducing a new machine based has been made that this implementa- array of choices needs guidance more on relatively old implementation tion somehow supports the RISC incli- specific than a monolithic, all-or- technology. In light of these argu- nation toward emulating complex nothing performance measurement. ments, DEC's MicroVAX-3210 is functions in software. In a recent pub- An example of how the issue of especially interesting. lication, David Patterson observed: scope can be confused is found in a re- The VAX easily qualifies as a CISC. cent article.7 By creating a machine According to published reports, the Although I doubt DEC is calling with only one instruction, its authors MicroVAX-32, a VLSI implementa- them RISCs, I certainly found it in- claim to have delimited the RISC de- tion of the preponderance of the VAX teresting that DEC's single chip sign space to their machine at one end instruction set, was designed, realized, do not implement the whole of the space and the RISC I (with 31 and tested in a period of several VAX instruction set. A MicroVAX instructions) at the other end. This months. One might speculate that this traps when it tries to execute some model is far too simplistic to be useful; very short gestation period was made infrequent but complicated oper- an absolute number of instructions possible in large part by DEC's con- ations, and invokes transparent cannot be the sole criterion for cate- siderable expertise in implementing the software routines that simulate gorizing an architecture as to RISC or VAX architecture (existing products those complicated instructions. I I CISC. It ignores aspects of addressing included the 11/780, 11/750, 11/730, modes and their associated complexi- and VLSI-VAX). This shortened The insinuation that the Micro- ty, fails to deal with compiler/archi- design time would not have been possi- VAX-32 follows in a RISC tradition is tecture coupling, and provides no way ble had DEC had not first created a unreasonable. It does not come close to evaluate the implementation of standard instruction set. Standardiza- to fitting our definition of a RISC; it other non-instruction set design deci- tion at this level, however, is precisely violates all six RISC criteria. To begin sions such as register files, caches, what RISC theory argues against. with, any VAX by definition has a memory management, floating point Such standards constrain the un- variable-length instruction format and operations, and co-processors. conventional RISC hardware/soft- is not a load/store machine. Further, Another fallacy is that the total sys- ware tradeoffs. From a commercial the MicroVAX-32 has multicycle in- tem is composed of hardware, soft- standpoint, it is significant that the struction execution, relies on a micro- ware, and application code. This MicroVAX-32 was born into a world coded control engine, and interprets leaves out the operating system, and where compatible assemblers, com- the whole array of VAX addressing the overhead and the needs of the op- pilers, and operating systems abound, modes. Finally, the MicroVAX-32 exe- erating system cannot be ignored in something that would certainly not be cutes 175 instructions on-chip, hardly most systems. This area has received the case for a RISC design. a reduced number. September 1985 11 A better perspective in the Micro would be difficult to fault this ap- language well is different from cre- VAX-32 shows that there are indeed proach to the design of an inexpensive ating a single machine such as the cost/performance ranges where mi- VAX. The MicroVAX-32 also shows VAX that must exhibit at least ac- crocoded implementation of certain that it is still possible for intelligent, ceptable performance for a wide range functions is inappropriate and soft- competent computer designers who of languages. While RISC research ware emulation is better. The impor- understand the notion of correct func- offers valuable insights on a per-lan- tance of carefully making this assign- tion-to-level mapping to find micro- guage basis, more emphasis on cross- ment of function to implementation coding a valuable technique. Pub- language anomalies, commonalities, level-software, microcode, or hard- lished RISC work, however, does not and tradeoffs is badly needed. ware-has been amply demonstrated accommodate this possibility. Especially misleading are RISC in many RISC papers. Yet this basic The application environment is also claims concerning the amount of de- concern is also evidenced in many ofcrucial importance in system design. sign time saved by creating a simple CISC machines. In the case of the The RISC I instruction set was de- machine instead of a complex one. MicroVAX-32, floating point instruc- signed specifically to run the C lan- Such claims sound reasonable. Never- tions are migrated either to a copro- guage efficiently, and it appears theless, there are substantial dif- cessor chip or to software emulation reasonably successful. The RISC I ferences in the design environments routines. The numerous floating-point researchers have also investigated for an academic one-of-a-kind project chips currently available attest to the the Smalltalk-80 computing environ- (such as MIPS or RISC I) and a market reception for this partitioning. ment. 12 Rather than evaluate RISC I machine with lifetime measured in Also migrated to emulation are the as a Smalltalk engine, however, the years that will require substantial soft- console, decimal, and string instruc- RISC I researchers designed a new ware and support investments. As was tions. Since many of these instructions RISC and report encouraging perfor- pointed out in a recent Electronics are infrequent, not time-critical, or are mance results from simulations. Still, Week article, R. D. Lowry, market not generated by many compilers, it designing a processor to run a single development manager for Denelcor,

12 COMPUTER noted that "commercial-product de- bound, low-level benchmarks that the RISC rules for designing an in- velopment teams generally start off a have been used by RISC researchers struction set might result in a missile project by weighing the profit and loss (e.g., calculating a Fibonacci series re- guidance computer optinized for run- impacts ofdesign decisions."13 Lowry cursively) is not the only metric in a ning its most common task-diagnos- is quoted as saying, "A university computer system. In some', it is not tics. In terms ofinstruction frequencies, doesn't have to worry about that, so even one of the most interesting. For of course, flight control applications there are often many built-in deadends many current computers, the only use- constitute a trivial special case and in projects. This is not to say the value ful performance index is the number would not begiven muchattention. It is oftheir research is diminished. It does, of transactions per second, which has worth emphasizing that in efforts to however, make it very difficult for no direct or simple correlation to the quantify performance and apply those someone to reinvent the system to time it takes to calculate Ackermann's measurements to system design, one make it a commercial product." For a function. While millions ofinstructions must pay attention not just to instruc- product to remain viable, a great deal per second might be a meaningful met- tion execution frequencies, but also to of documentation, user training, coor- ric in some computing environments, cycles consumed per instruction execu- dination with fabrication or produc- reliability, availability, and response tion. Levy and Clark make this point tion facilities, and future upgrades time are of much more concern in regarding the VAX instruction set,'4 must all be provided. It is not known others, such as spaceand aviation com- but it has yet to appear in any papers how these factors might skew a design- puting. The extensive error checking on RISC. time comparison, so all such compari- incorporated into these machines at When performance, such as sons should be viewed with suspicion. every level may slow the basic clock throughput or transactions per second, Even performance claims, perhaps time and substantially diminish per- is a first-order concern, one is faced the most interesting of all RISC asser- formance. Reduced performance is with the task of quantifying it. The tions, are ambiguous. Performance as tolerable; but downtime may not be. Berkeley RISC I efforts to establish the measured by narrowly compute- In the extreme, naive application of machine's throughput arelaudable, but

September 1985 13 N.

before sweeping conclusions are drawn heterogenous benchmarks in perfor- typically use registers for information one must carefully examine the bench- mance measurement is still lost on specific to a procedure. When a pro- mark programs used. As Patterson many commercial and academic com- cedure call is performed, the informa- noted: puter evaluators who have succumbed tion must be saved, usually on a "micro- on a pro- for to the mi1conception that memory stack, and restored The performance predictions benchmarks" represent a useful mea- cedure return. These operations are [RISC I and RISC II] were based on surement in isolation. typically very time consuming due to small programs. This small size was the intrinsic data transfer require- dictated by the reliability of the Multiple register sets ments. RISC I uses its multiple register simulator and compiler, the avail- sets to reduce the frequency of this able simulation time, and the in- register saving and restoring. It also to han- Probably the most publicized RISC- ability of the first simulators takes ofan overlap between dle UNIX system calls. " style processor is the Berkeley RISC I. advantage The best-known feature of this chip is register sets for parameter passing, Some of these "small" programs ac- its large , organized as a reducing even further the memory tually execute millions of instructions, series of overlapping register sets. This reads and writes necessary. 15 yet they are very narrow programs in is ironic, since the register file is a per- RISC I has a register file of 138 terms of the scope of function. For ex- formance feature independent of any 32-bit registers organized into eight ample, the Towers of Hanoi program, RISC (as defined earlier) aspect of the overlapping "windows." In each win- when executing on the 68000, spends processor. Multiple register sets could dow, six registers overlap the next over 90 percent of its memory accesses be included in any general-purpose window (for outgoing parameters and in procedure calls and returns. The register machine. incoming results). During any proce- RISC I and II researchers recently It is easy to believe that MRSs can dure, only one of these windows is ac- reported results from a large bench- yield performance benefits, since tually accessible. A procedure call mark,"I but the importance of large, procedure-based, high-level languages changes the current window to the next

14 COMPUTER window by incrementing a pointer, only way to use them, nor even neces- the VAX and the 68000.3,11 There are and the six outgoing parameter regis- sarily the best. For example, designers many problems with these results and ters become the incoming parameters of the 801 and MIPS chose other ways the methods used to obtain them. of the called procedure. Similarly, a to use their available hardware; these Foremost, the performance effects of procedure return changes the current RISCs have only a single, convention- the reduced instruction set were not window to the previous window, and ally sized register set. Caches, floating- decoupled from those of the over- the outgoing result registers become point hardware, and interprocess com- lapped register windows. Consequent- the incoming result registers of the munication support are a few of the ly, these reports shed little light on the calling procedure. If we assume that many possible uses for those resources RISC-related performance of the ma- six 32-bit registers are enough to con- "freed" by a RISC's simple instruc- chine, as shown below. tain the parameters, a procedure call tion set. Moreover, as chip technology Some performance comparisons be- involves no actual movement of infor- improves, the tradeoffs between in- tween different machines, especially mation (only the window pointer is ad- struction set complexity and architec- early ones, were based on simulated justed). The finite on-chip resources ture/implementation features become benchmark execution times. While ab- limit the actual savings due to register less constrained. Computer designers solute speed is always interesting, window overflows and underflows.3 will always have to decide how to best other metrics less implementation-de- It has been claimed that the small use available resources and, in doing pendent can provide design informa- control area needed to implement the so, should realize which relations are tion more useful to computer archi- simple instruction set of a VLSI RISC intrinsic and which are not. tects, such as data concerning the leaves enough chip area for the large The Berkeley papers describing the processor-memory traffic necessary to register file.3 The relatively small RISC I and RISC II processors claimed execute a series of benchmarks. It is amount of control logic used by a their resource decisions produced large difficult to draw firm conclusions RISC does free resources for other performance improvements, two to from comparisons of vastly different uses, but a large register file is not the four times over CISC machines like machines unless some effort has been

September 1985 15 made to factor out implementation- dependent features not being com- pared (e.g., caches and floating point accelerators). Experiments structured to accom- modate these reservations were con- ducted at CMU to test the hypothesis that the effects of multiple register sets are orthogonal to instruction set com- plexity. 16 Specifically, the goal was to see if the performance effects ofMRSs were comparable for RISCs and CISCs. Simulators were written for two CISCs (the VAX and the 68000) Figure 1. Total processor-memory traffic for benchmarks on the standard without MRSs, with non-overlapping VAX and two modified VAX computers, one with multiple register sets and MRSs and with overlapping MRSs. one with overlapped multiple register sets. Simulators were also written for the RISC I, RISC I with non-overlapping register sets, and RISC I with only a single register set. In each of the simulators, care was taken not to change the initial architectures any more than absolutely necessary to add or remove MRSs. Instead of simulat- ing execution time, the total amount of processor-memory traffic (bytes read and written) for each benchmark was recorded for comparison. To use this data fairly, only different register set versions of the same architecture were compared so the ambiguities that arise from comparing different architec- Figure 2. Total processor-memory traffic for benchmarks on the standard 68000 and two modified 68000s, one with multiple register sets and one with tures like the RISC I and the VAX were overlapped multiple register sets. avoided. The benchmarks used were the same ones originally used to evalu- ate RISC I. A summary of the experi- ments and their results are presented by Hitchcock and Sprunt. 17 As expected, the results show a sub- stantial difference in processor-mem- ory traffic for an architecture with and without MRSs. The MRS versions of both the VAX and 68000 show marked decreases in processor-memory traffic for procedure-intensive benchmarks, shown in Figures I and 2. Similarly, the single register set version of RISC I requires many more memory reads and writes than RISC I with overlap- Figure 3. Total processor-memory traffic for benchmarks on the standard ped register sets (Figure 3). This result RISC I and two modified RISC l's, one with no overlap between register sets is due in part to the method used for and one with only one register set. 16 COMPUTER handling register set overflow and The 432 tended to support the runtime environ- underflow, which was kept the same ment, attempting instead to place equi- for all three variations. With a more The Intel 432 is a classic example of valent functionality into the compiler intelligent scheme, the single register a CISC. It is an object-oriented VLSI or software. This is contrary to the set RISC I actually required fewer microprocessor chip-set designed ex- mainstream of instruction set design, bytes of memory traffic on Acker- pressly to provide a productive Ada which reflects a steady migration of mann's function than its multiple programming environment for large such functionality from higher levels register set counterparts. For bench- scale, multiple-process, multiple- (software) to lower ones (microcode or marks with very few procedure calls processor systems. Its architecture hardware) in the expectation of im- (e.g., the sieve of Eratosthenes), the supports object orientation such that proved performance. single register set version has the same every object is protected uniformly This investigation should include an amount of processor-memory traffic without regard to traditional distinc- analysis of the 432's efficiency in exe- as the MRS version of the same ar- tions such as "supervisor/user mode" cuting large-system code, since exe- chitecture. 17 or "system/user data structures." The cuting such code well was the primary 432 has a very complex instruction set. design goal of the 432. Investigators Clearly, MRSs can affect the amount Its instructions are bit-encoded and used the Intel 432 microsimulator, of processor-memory traffic necessary range in length from six to 321 bits. which yields cycle-by-cycle traces of to execute a program. A significant The 432 incorporates a significant the machine's execution. While this amount of the performance of RISC I degree of functional migration from microsimulator is well-suited to simu- for procedure-intensive environments software to on-chip microcode. The lating small programs, it is quite un- has been shown to be attributable to its interprocess communication SEND wieldy for large ones. As a result, the scheme of overlapped register sets, a primitive is a 432 machine instruction, concentration here is on the low-level feature independent of instruction-set for instance. benchmarks that first pointed out the complexity. Thus, any performance Published studies of the perfor- poor 432 performance. claims for reduced instruction set com- mance of the Intel 432 on low-level Simulations of these benchmarks puters that do not remove effects due benchmarks (e.g., towers of Hanoi 18) revealed several performance prob- to multiple register sets are inconclu- show that it is very slow, taking 10 to lems with the 432 and its compiler: sive, at best. 20 times as long as the VAX 11/780. Such a design, then, invites scrutiny in (1) The 432's Ada compiler per- These CMIU experiments used bench- the RISC/CISC controversy. forms almost no optimization. The marks drawn from other RISC One is tempted to blame the ma- machine is frequently forced to make research efforts for the sake of con- chine's object-oriented runtime envi- unnecessary changes to its complex ad- tinuity and consistency. Some of the ronment for imposing too much over- dressing environment, and it often benchmarks, such as Ackermann, head. Every memory reference is recomputes costly, redundant subex- Fibonacci, and Hanoi, actually spend checked to ensure that it lies within the pressions. This recomputation serious- most of their time performing proce- boundaries of the referenced object, ly skews many results from benchmark dure calls. The percentage of the total and the read/write protocols of the comparisons. Such benchmarks reflect processor-memory traffic due to "C" executing context are verified. RISC the performance of the present version procedure calls for these three bench- proponents argue that the complexity of the 432 but show very little about marks on the single register set version of the 432 architecture, and the addi- the efficacy of the architectural trade- of the 68000 ranges from 66 to 92 per- tional decoding required for a bit- offs made in that machine. cent. As was expected, RISC I, with its encoded instruction stream contribute (2) The bandwidth of 432 memory overlapped register structure that to its poor performance. To address is limited by several factors. The 432 allows procedure calls to be almost these and other issues, a detailed study has no on-chip data caching, no in- free in terms of processor-memory bus of the 432 was undertaken to evaluate struction stream literals, and no local traffic, did extremely well on these the effectiveness of the architectural data registers. Consequently, it makes highly recursive benchmarks when mechanisms provided in support of its far more memory references than it compared to machines with only a intended runtime environment. The would otherwise have to. These refer- single register set. It has not been es- study concentrated on one of the cen- ence requirements also make the code tablished, however, that these bench- tral differences in the RISC and CISC size much larger, since many more bits marks are representative of any com- design styles: RISC designs avoid are required to reference data within puting environment. hardware/microcode structures in- an object than within a local register.

September 1985 17 And because of pin limitations, the 432 that the 432 microcode implementa- Acknowledgements must multiplex both data and address tion of interprocess communication is information over only 16 pins. Also, much faster than an equivalent soft- We would like to thank the in- the standard Intel 432/600 develop- ware version. On these low-level numerable individuals, from industry ment system, which supports shared- benchmarks, the 432 could have much and academia, who have shared their memory multiprocessing, uses a slow higher performance with only a better thoughts on this matter with us and asynchronous bus that was designed compiler and minor changes to its im- stimulated many of our ideas. In par- more for reliability than throughput. plementation.Thus, it is wrong to con- ticular, we are grateful to George Cox These implementation factors com- clude that the 432 supports the general and Konrad Lai of Intel for their help bine to make wait states consume 25 to RISC point of view. with the 432 microsimulator. 40 percent of the processor's time on This research was sponsored in part the benchmarks. by the Department of the Army under (3) On highly recursive benchmarks, contract DAA B07-82-C-J164. the object-oriented overhead in the 432 In spite of-and sometimes because does indeed appear in the form of a of-the wide publicity given to cur- References slow procedure call. Even here, rent RISC and CISC research, it is not easy to gain a thorough appreciation 1. J. Hennessy et al., "Hardware/Soft- though, the performance problems ware Tradeoffs for Increased Perfor- should not be attributed to object of the important issues. Articles on mance," Proc. Symp. Architectural orientation or to the machine's intrin- RISC research are often oversimpli- Supportfor Programming Languages sic complexity. Designers of the 432 fied, overstated, and misleading, and Operating Systems, 1982, pp. made a decision to provide a new, pro- and papers on CISC design offer no 2-11. tected context for every procedure call; coherent design principles for com- 2. G. Radin, "The 801 Minicomputer," the user has no option in this respect. If parison. RISC/CISC issues are best Proc. Symp. Architectural Support considered in light of their function- for Programming Languages and an unprotected call mechanism were Operating Systems, 1982, pp. 39-47. used where appropriate, the Dhry- to-implementation level assignment. Strictly limiting the focus to 3. D. A. Patterson and C. H. Sequin, "A stone benchmark19 would run 20 per- instruc- VLSI tion counts or other RISC," Computer, Vol. 15, No. cent faster. oversimpli- 9, Sept. 1982, pp. 8-21. (4) Instructions are bit-aligned, so fications can be misleading or mean- ingless. 4. R. P. Colwell, C. Y. Hitchcock II1, the 432 must almost of necessity de- and E. D. Jensen, " A Perspective on code the various fields of an instruc- Some of the more subtle issues have the Processor Complexity Controver- tion sequentially. Since such decoding not been brought out in current lit- sy," Proc. Int. Conf. Computer Design: VLSIin Computers, 1983, pp. often overlaps with instruction execu- erature. Many of these are design con- 613-616. tion, the 432 stalls three percent of the siderations that do not lend themselves 5. D. Hammerstrom, "Tutorial: The time while waiting for the instruction to the benchmark level analysis used in Migration of Function into Silicon," decoder. This percentage will get RISC research. Nor are they always 10th Ann. Int'I Symp. Computer Ar- worse, however, once other problems properly evaluated by CISC designers, chitecture, 1983. above are eliminated. guided so frequently by tradition and 6. J. C. Browne, "Understanding Exe- corporate economics. cution Behavior of Software Systems," Colwell provides a detailed treat- RISC/CISC research has a great Cotnputer, Vol. 17, No. 7, July 1984, ment of this experiment and its deal to offer computer designers. pp. 83-87. results. 20 These contributions must not be lost 7. H. Azaria and D. Tabak, "The This 432 experiment is evidence that due to an illusory and artificial MODHEL Microcomputer for RISCs RISC's renewed emphasis on the im- Study", Microprocessing and Micro- dichotomy. Lessons learned studying programming, Vol. 12, No. 3-4, portance of fast instruction decoding RISC machines are not incompatible Oct.-Nov. 1983, pp. 199-206. and fast local storage (such as caches with or mutually exclusive of the rich 8. G. C. Barton "Sentry: A Novel Hard- or registers) is substantiated, at least tradition of computer design that ware Implementation of Classic for low-level compute-bound bench- preceded them. Treating RISC ideas Operating System Mechanisms," marks. Still, the 432 does not provide as perspectives and techniques rather Proc. Ninth Ann. Int'l Symp. Con- compelling evidence that large-scale than dogma and understanding their puterArchitecture, 1982, pp. 140-147. migration of function to microcode domains of applicability can add im- 9. A. D. Berenbaum, M. W. Condry, and hardware is ineffective. On the portant new tools to a computer and P. M. Lu, "The Operating System contrary, Cox et al.2' demonstrated and Language Support Features of the designer's repertoire. Z BELLMAC-32 Microprocessor," COM PUTER ..W

Proc. Symp. Architectural Support trical and Computer Engineering Depart- for Programming Languages and ments of Carnegie-Mellon University for Operating Systems, 1982, pp. 30-38. six years. For the previous 14 years he per- 10. J. Hennessy, "VLSI Processor Ar- formed industrial R/D on computer sys- chitecture," IEEE Transactions on tems, hardware, and software. He consults Computers, Vol. C-33, No. 12, Dec. and lectures extensively throughout the 1984, pp. 1221-1246. world andhas participated widelyin profes- 11. D. Patterson, "RISC Watch," Com- sional society activities. puterArchitectureNews, Vol. 12, No. 1, Mar. 1984, pp. 11-19. Robert P. Colwell recently completed his doctoral dissertation on the performance 12. David Ungar et al., "Architecture of effects of migrating functions into silicon, SOAR: Smalltalk on a RISC," 11th using the Intel 432 as a case study. His in- Ann. Int'lSymp. ComputerArchitec- dustrial experience includes design of a ture, 1984, pp. 188-197. color graphics workstation for Perq Sys- 13. W. R. Iversen, "Money Starting to tems, and work on Bell Labs' microproces- Flow As Parallel Processing Gets sors. He received the PhD and MSEE Hot," Electronics Week, Apr. 22, degrees from Carnegie-Mellon University 1985, pp. 36-38. in 1985 and 1978, and the BSEE degree 14. H. M. LevyandD. W. Clark, "Onthe from the University of Pittsburgh in 1977. Use of Benchmarks for Measuring He is a member of the IEEE and ACM. System Performance" ComputerAr- H. M. Brinkley Sprunt is a doctoral can- chitectureNews, Vol. 10, No. 6, 1982, didate in the Department of Electrical and pp. 5-8. Computer Engineering ofCarnegie-Mellon 15. D. C. Halbert and P. B. Kessler, University. He received a BSEE degree in "Windows of Overlapping Register electrical engineering from Rice University Frames", CS292R Final Reports, in 1983. His research interests include com- University of California, Berkeley, puter architecture evaluation and design. June 9, 1980. He is a member of the IEEE and ACM. 16. R. P. Colwell, C. Y. Hitchcock III, and E. D. Jensen, "Peering Through the RISC/CISC Fog: An Outline of Research," Computer Architecture News, Vol. 11, No. 1, Mar. 1983, pp. 44-50. Charles Y. Hitchcock mI is a doctoral can- didate in Carnegie-Mellon University's 17. C. Y. Hitchcock III and H. M. B. Department of Electrical and Computer Sprunt, "Analyzing Multiple Register Engineering. He is currently pursuing Sets," 12th Ann. Int'l Symp. Com- research in computer architecture and is a puterArchitecture, 1985, in press. member of the IEEE and ACM. He grad- 18. P. M. Hansenetal., "A Performance uated with honors in 1981 from Princeton Evaluation of the Intel iAPX 432," University with a BSEin electrical engineer- Charles P. Koflar is a senior research staff Computer Architecture News, Vol. ing and computer science. His MSEE from member in Carnegie-Mellon University's 10, No. 4, June 1982, pp. 17-27. CMU in 1983 followed research he did in computer Science Department. He is cur- design automation. rently pursuing research in decentralized 19. R. P. Weicker, "Dhrystone: A Syn- asynchronous computing systems. He has thetic Systems Programming Bench- been associated with the MCF and mark," Comm. ACM, Vol. 27, No. NEBULA project at Carnegie-Mellon Uni- 10, Oct. 1984, pp. 1013-1030. versity since 1978. Previous research has 20. R. P. Colwell, "The Performance Ef- been in the area of computer architecture fects of Functional Migration and Ar- validation and computer architecture chitectural Complexity in Object- description languages. He holds a BS in Oriented Systems," PhD. thesis, computer science from the University of Carnegie-Mellon University, Pitts- Pittsburgh. burgh, PA. Expected completion in June, 1985. 21. G. W. Cox et al., "Interprocess Com- munication and Processor Dispatch- Questions about this article can be ing on the Intel 432," ACM Trans. directed to Colwell at the Computer Science Computer Systems, Vol. 1, No. 1, E. Douglas Jensen has been on the faculties Department, Carnegie-Mellon University, Feb. 1983, pp. 45-66. of both the Computer Science and Elec- Pittsburgh, PA 15213. 19 September 1985