Quick viewing(Text Mode)

The 2014 MICRO Test of Time Award Winners: from 1978 to 1992

The 2014 MICRO Test of Time Award Winners: from 1978 to 1992

Awards ...... The 2014 MICRO Test of Time Award Winners: From 1978 to 1992

ONUR MUTLU Carnegie Mellon University

RICH BELGARD

...... As you may know, the Interna- tions. The two-level is tes code-generation choices for correctly tional Symposium on essentially two carefully codesigned pro- and efficiently optimizing instruction (MICRO)—the flagship microarchitecture grammable logic arrays (PLAs) that schedules of loops for various architec- conference, and a premier together comprise more compact stor- tures, including very long instruction word architecture conference for nearly five age for microinstructions than a single (VLIW) and superscalar. The paper covers decades—selected 10 papers as recipi- monolithic control store. It was born architectures incorporating a varying set ents of the first set of MICRO Test of from the necessity to “maximize the of features for loop schedule optimiza- Time (ToT) Awards in December 2014. contribution of every transistor spent” tion, using the notions of pipelin- We announced the winning papers and (to quote Tredennick’s retrospective) in ing and modulo scheduling. The work is described the selection in the the design of the Motorola MC68000 based on the authors’ extensive (about a March/April 2015 issue of IEEE Micro.1 . The paper also described in decade long) experience in hardware/soft- The authors of these 10 distinguished detail the microprogrammed control logic ware codesign for realizing Cydrome’s papers were invited to write short retro- implementation of a single-chip micro- Cydra 5 processor. Yet, the paper’s loop spectives to reflect on their work, which architecture, based on the MC68000 code scheduling strategies apply far was done at least 20 years ago. This experience. In his retrospective, Treden- beyond the extensive architectural sup- issue features retrospectives written by nick describes his experience at Motor- port provided by the Cydra 5 for loop the original coauthors of two of the ola that led to this paper and discusses scheduling purposes, as Schlansker’s ret- award-winning papers. We briefly intro- his subsequent experiences in industry, rospective beautifully describes. duce these papers and retrospectives, which were partially shaped by his and we hope that you will enjoy reading involvement with the MC68000. He also s we conclude, we would like to them as much as we have. muses about the connection between A take the opportunity to pay tribute The first retrospective is for the old- design and design automation proc- to the extremely valuable impact that est paper that won the 2014 MICRO ToT esses, which makes the retrospective a Bob Rau has had in our field, especially in Award. “Microprogrammed Implemen- fun historical perspective and a delightful the development of technology tation of a Single Chip ” read for the IEEE Micro audience. and VLIW processors, as well as hard- by Skip Stritter and Nick Tredennick was The second retrospective is for one of ware/software cooperation in instruc- published in MICRO 1978.2 It introduced the youngest papers that won the 2014 tion-level parallelism. It has been 13 the idea of a two-level control store (text- MICRO ToT Award. “Code Generation years since Bob died, but his impact is book material in Schema for Modulo Scheduled Loops,” wonderfully felt in the compiler technol- today) with the goal of minimizing the authored by Bob Ramakrishna Rau, ogy commonly in use today, along with chip real estate dedicated to the control Michael S. Schlansker, and P.P. Tirumalai, the many clearly articulated technical logic used in microprogrammed pro- was published in MICRO 1992.3 The articles he contributed to academic litera- cessor designs, and in particular the paper provides a “recipe book” (to quote ture. His works are taught in many mod- memory used to store the microinstruc- Schlansker) that discusses and enumera- ern compiler and computer architecture ......

60 Published by the IEEE Computer Society 0272-1732/16/$33.00 c 2016 IEEE classes today. Bob was one of the most the scaling of the underlying circuit and 3. B. Ramakrishna Rau, Michael S. prominent contributors to MICRO for device technologies. MICRO Schlansker, and P.P. Tirumalai, “Code decades, and our selection of the 1992 Generation Schema for Modulo Sched- article, for which he was the primary ...... uled Loops,” Proc. 25th Ann. Int’l Symp. driver (according to Schlansker’s retro- References Microarchitecture, 1992, pp. 158–169. spective), as part of the first set of 1. O. Mutlu and R. Belgard, “Introducing the MICRO Test of Time Awards: MICRO ToT Awards points to the techni- Onur Mutlu is the Strecker Early Career Concept, Process, 2014 Winners, and cal excellence and value of insight he Professor at Carnegie Mellon Univer- the Future,” IEEE Micro, vol. 35, no. upheld as a leading member of our com- sity. Contact him at [email protected]. munity. We hope these two key values 2, 2015, pp. 85–87. continue to thrive as microarchitecture/ 2. S. Stritter and N. Tredennick, Rich Belgard is an independent consul- architecture and hardware/software “Microprogrammed Implementation tant for computer manufacturers, soft- codesign become even more important of a Single Chip Microprocessor,” ware companies, and investor groups with fundamental challenges threatening Proc. 11th Ann. Workshop Microprog- and an expert and consultant to law firms. the large improvements obtained from ramming, 1978, pp. 8–16. Contact him at [email protected].

...... Evolution of Microprocessor Logic Design

NICK TREDENNICK Jonetix

...... In the summer of 1977, I was wanted me to work on the design of the The microprocessor’s entire design had teaching as an assistant professor at the on-chip , but that he first needed to fit on a single power-, pin-, and transis- University of Texas in Austin when Tom me to begin work on the microproces- tor-constrained, size-limited silicon chip. Gunter walked into my office and intro- sor’s logic design “until we find a compe- All of the microprocessor’s comput- duced himself. He asked if I’d like to work tent logic designer.” Of course, that ing resources (data registers, address for Motorola on a microprocessor design never happened, and I spent my time registers, program , and arith- project. My areas of expertise were com- doing the logic design for what became metic units), interrupt logic, interface puters and logic design, and I had a little the MC68000. logic (pin and external control), and experience with microprocessor applica- I began looking for books and articles control logic had to fit inside the transis- tions, but no experience with microproc- on microprocessor logic design. I was tor, area, and power budget. Since we essor design or semiconductor design. unable to find documentation for any began by doubling or more than doubling Nevertheless, there was mutual interest microprocessor logic design methods. the width of the data and address regis- and I took a job with Motorola beginning That seemed odd, given that the Design ters and the arithmetic units, as well as in September 1977. The project was a Automation Conference was already 14 substantially increasing the number of next-generation microprocessor design years old in 1977. Just what processes data and address registers compared to called MACS (Motorola Advanced Com- were all those software engineers auto- an 8- accumulator-based design, we puter System). Motorola’s previous mating? Well, OK, I’d have to make up quickly ate into the transistor-budget microprocessor designs had been 8-bit the design process as I went along. increases provided by our move to the accumulator-based designs suitable for At the time, the biggest differences next most advanced semiconductor embedded applications; MACS was to be between computer design and micro- process. The consequence of these a 16-/32-bit design more suitable for com- were in the constraints decisions was that, in the implementa- puter applications. Tom said he eventually placed on the microprocessor’s designer. tion of the control logic, we had to ...... JANUARY/FEBRUARY 2016 61 ...... AWARDS

maximize the contribution of every tran- sequences. One instruction decoder First, I decided to make a list of the prob- sistor spent. pointed to the operand address calcula- lematic design decisions that I had made In 1977, moving to the next semicon- tion sequence and a second pointed to during the project, so that I could avoid ductor process meant designers had the required operation sequence. The those errors in the next design. Second, somewhat more than twice the number of address calculation sequence computed because available design tools did not transistors enabled by the previous-genera- the operand address and sent a request support the design method I had been tion semiconductor process. With each to memory for the operand before trans- using, I decided to create a design semiconductor process generation—which ferring control to the second instruction description that could act as the basis for came along about every 18 months—tran- decoder, which performed the operation implementing design aids. sistor area shrunk by half, which doubled and stored the result. This led to a prob- the number of available transistors in a lem with the clear memory instruction, Make better decisions fixed area. Additional available transistors which read the location to be cleared At the end of the MC68000 project, I came from improvements in transistor lay- before writing a zero to that location. It made a list of the design decisions that I out and from lithography advances that was necessary to enable universal shar- felt led either to inefficiencies that could enabled production of larger chips. ing of the address calculation functions, have been avoided or to increased diffi- As transistor size decreased, power but some users didn’t expect a read to culty in completing the design. I don’t per transistor fell, so that chip power accompany a clear memory instruction. recall the contents of the list, but I believe rose only slowly. Leakage currents, even Similar to the sharing of address calcu- there were about 10 items. The character for new process generations, were negli- lation sequences, instruction operations of the items on the list was something gible, so we worried only about active shared sequences. Add, subtract, AND, like “instead of X as a method for register power. In addition, the smaller transis- OR, and XOR, for example, could all share decoding and control, Y is probably to be tors were faster, so clock speeds rose a common two-operand arithmetic preferred.” I resolved to use this error- and performance increased with each sequence through the use of an arith- correction sheet at the beginning of my semiconductor process generation. The metic logic unit and condition-code con- next microprocessor design. 18-month cycle for new semiconductor trol table. For that table, the instruction About a year later, I made use of that process generations also drove the decoder selected a row and the common list when I began the Micro/370 micro- development cycle. Project delay could two-operand arithmetic sequence chose processor design while working at IBM mean that your competitors benefitted a column; that way, the sequences could Research in Yorktown Heights. And from twice as many of the newer, faster be common and the operations different. here’s why I don’t recall the contents of transistors in their designs. The controller for the MC68000 micro- that list: at the end of the Micro/370 processor looked like a two-level control design project, when I made a list of the Design process store with vertical (compact) design decisions that I felt led either to The MC68000 was probably among the for sequencing and horizontal (mostly inefficiencies that could have been last of the pencil-and-paper microprocessor decoded) microcode for avoided or to increased difficulty in com- designs. The project did not have the bene- control points. The structure is described pleting the design, it turned out to be fit of either computer-aided design entry or in the paper “Microprogrammed Imple- essentially the inverse of the list that I computer-based logic simulation. I drew mentation of a Single Chip Microproc- had made at the end of the previous pencil-and-paper diagrams of the execution essor,” which Skip Stritter and I wrote. design. My lesson from this was that it’s units, decoders, logic units, and intercon- What we called vertical and horizontal probably not fruitful to try to judge your nections. I used modified Karnaugh maps microcode are nothing more than the out- design decisions in retrospect, because (of up to 16 variables) for logic minimization. puts of two highly optimized PLAs operat- it is impossible to forecast the conse- I wrote register transfer sequences in ing in parallel. The execution unit control quences of the alternatives in the cycle-by-cycle flowcharts for each instruc- PLA was optimized to eliminate duplicate absence of actually doing the detailed tion in pencil on large sheets of paper. states that would have occurred if the design work to implement them. I used these methods both to place control points had been included in the and to assign the instructions’ op codes sequencing PLA. Design process, design (for efficient instruction decoding and uni- Most of the decisions in the design of automation form access to register fields) and to opti- the focused on transistor In 1979, design automation was a popular mize the programmable logic arrays (PLAs) efficiency. and growing business, but there seemed that decoded the instructions and provided to be little correlation between what was instruction execution sequencing. Logical conclusions being automated and the actual logic For control efficiency, instructions At the end of the MC68000 project, in design process that I had been using. It shared operand address calculation late 1979, I resolved to do two things. looked more like design engineers were ...... 62 IEEE MICRO modifying their design processes to con- the march of semiconductor progress. in cache design and shifting emphasis to form to the available tools. That seemed That was unwise. The design process consistency issues in the . like an inefficient approach to me, so I changed dramatically as transistors shrunk Relative differences in logic speed and resolved to document the microproces- because the constraints changed. propagation delay forced tradeoffs in pipe- sor design process that I had used so that The design process I used was suited line depth versus per-stage logic process- if there was any interest in automating an to a single individual controlling an entire ing and in on-chip location of functional actual design process, there would be at logic design encompassing fewer than units. Multiprocessor designs opened least one template for doing so. About a 100,000 transistors. In computer archi- whole new vistas of requirements for year after I went to work for IBM, I began tecture terms, the processors of 1977 innovation and development. documenting the process I would use to were primitive even compared to the design a microprocessor. Since I couldn’t mainframe designs of the time. Micro- thought there should be a core method use the information from the MC68000 processors didn’t have to invent architec- I for microprocessor logic design and design—because many of the design tural features; we were still copying that should be the basis for the automation details were confidential—I began a new features from the more advanced main- of design aids. In an environment evolving IBM 360-based microprocessor design frame implementations of the time. rapidly with progress in semiconductor that was eventually named Micro/370. In the time it took me to initiate and process, that assumption was invalid. I It began as an example design, but it complete another design and write the built an ad-hoc design method that was grew into a team building a real microproc- process, microprocessor designs were suited to the constraints present at the essor when we decided we had to already incorporating millions of transis- time I began the design. Maximum chip actually build the microprocessor for the tors. Computational path widths would size could accommodate fewer than design process to have credibility. The at most double and then level off, so exe- 100,000 transistors, making transistors the project produced a Micro/370 microproc- cution units became a smaller portion of scarce resource, so design efficiency was essor that was functional and even booted the design; there were plenty of transis- paramount. Today’s chips easily contain IBM’s VM , but was not tors to spend on control logic, easing billions of transistors; transistors are abun- successful in the market. The project also constraints on the efficiency of controller dant, so the scarce resource is designers resulted in Microprocessor Logic Design design. On-chip caches, floating-point or design management, verification, or (Digital Press, 1987), a computer engi- capabilities, and multiprocessor designs time. Newer microprocessor design proj- neering textbook that described the debuted to consume excess transistors. ects require large teams and emphasize design process. A number of universities Microprocessor development acceler- specialization and design fragmentation. adopted the text for courses. ated past mainframe and In the 1970s, microprocessor design design as the leading edge of computer was primitive. Its logic design, incorporat- Constraint evolution architecture, encouraging innovators to ing a few tens of thousands of transistors, So, there we were in 1987—10 years after I enter the field. Newer design projects could be managed by a single person or a had gone to work at Motorola and had been lead to specialization in design expertise small team, and in terms of computer unable to find a microprocessor design in areas such as cache replacement architecture, it was a trailing-edge imple- process, there now existed a documented strategies, branch prediction, and mentation incorporating features that had microprocessor design process. Would the floating-point implementation. This fur- been pioneered in mainframes, minicom- design automation engineers finally see the ther specialization led to the growth and puters, and workstations. In contrast, light and begin automating a process from a fragmentation of design teams, which today’s microprocessor design is perhaps design engineer’s template? called for more coordination and for the most sophisticated of all engineering In a word: no. standardization of methods and design design. Its logic design, incorporating bil- It didn’t happen and it shouldn’t have documents. lions of transistors, requires a large team happened. I was disappointed, but I Progress in semiconductor process of experts in a wide range of specialties, shouldn’t have been. Just as I had done at conspired to change design constraints. and in terms of computer architecture, the end of the MC68000 project’s “design As transistors shrunk below 90 nm, leak- these advanced are errors” list, I was taking the same myopic age currents became significant, changing blazing the trail in innovation. MICRO view of the design process. We all took for the design emphasis in power manage- granted that transistors shrunk and got ment. Logic speed diverged further from faster with each generation. My mistake memory speed as semiconductor process Nick Tredennick is a VP and engineer was that I took as an unstated assumption engineers emphasized speed for logic and at Jonetix. He is a life fellow of IEEE. that the design process didn’t evolve with density for memory, changing constraints Contact him at [email protected].

...... JANUARY/FEBRUARY 2016 63 ...... AWARDS

...... Efficient Code Generation Schema for Modulo Loops

MICHAEL SCHLANSKER Hewlett Packard Enterprise Labs

...... This retrospective is dedicated dataflow architecture,2 which formalized latencies of operations around a cycle of to Bob Rau, who sadly cannot participate therenamingofvariableinstanceswithina carried dependence. in receiving this honor. It was Bob’s pas- sequence of loop iterations in order to Because of Bob’s interest in develop- sionate interest in computer architecture, allow concurrent processing. Dataflow ing efficient hardware for a new processor instruction-level parallelism, and strengthened our understanding of loop- product (the Axiom processor was later that enabled our original paper, “Code level parallelism, recurrences, and loop-car- renamed as the Cydrome processor), he Generation Schema for Modulo Sched- ried dependences and helped us identify a strongly desired to replace the Polycyclic’s uled Loops,” with coauthor Partha Tiru- rigorous means to express loop parallelism shift registers with more efficient static 1 malai. This paper was published in 1992 within our compiler’s intermediate form. RAM. Cydrome’s Cydra 5 processor incor- at MICRO 25. It addressed a basic ques- But dataflow was not the target of our porated rotating register hardware to tion—how can we efficiently execute research. We wanted to accelerate inner- implement . For each loops on a processor with instruction-level most loops with synchronous hardware to operation, register addresses could be parallelism while preserving processor exploit large amounts of parallelism with- dynamically computed by adding a register simplicity? The paper condensed a body out the complexities of hardware-based offset that was specified by the operation of work that resulted from Bob’s passion- dynamic scheduling. Much as in earlier to an iteration control pointer (ICP). The ate effort over an extended period of time ICP was incremented by the execution of horizontal microcode, or very long instruc- into a summary of known techniques for a loop branch to relocate register referen- tion words (VLIWs),3 our goal was to loop acceleration. It distilled contributions ces within innermost loops in a manner remove complexity from the processor by Bob, his close collaborators, and a similar to laying out a new context frame and instead use a compiler to produce broader parallel processing community in dataflow, or incrementing a vector regis- highly optimized static code schedules into an architecture for, and a manual of ter pointer. This allowed code from parallel that are executed by simple hardware. facts about, synchronous parallel loop loop iterations to access values stored in Bob’s early work toward this goal resulted execution. The goal of this retrospective a nonshifting RAM without replicating in the Polycyclic architecture,4 which com- is to share a few highlights in our progress loop code. For machines without register bined wide synchronous execution with toward developing this body of work. renaming, a compiler uses Modulo Varia- innovative shift-register hardware. The In addition to Partha Tirumalai and ble Expansion, which combines compile- architecture demonstrated cyclic code myself, other collaborators who worked time register renaming with code replica- schedules, with a period called the initia- directly with Bob and influenced this tion to enable parallel loop execution. work included Bob’s ESL collaborator tion interval (II), which controlled the In addition to rotating registers, the Chris Glaeser; Bob’s Cydrome collabora- steady state execution pattern for a loop. Cydra 5 provided predicates to support tors, including Joe Bratt, Peter Donovan, The innermost loop was scheduled as a conditional execution. A compare opera- Peter Hsu, Ross Towle, and Art Sorkin; cyclic of overlapped loop itera- tion computed a predicate that could and Bob’s Hewlett-Packard collabora- tions. Code running these loops came to conditionally nullify operations that were tors, including Vinod Kathail, Meng Lee, be known as software pipelines, and the dependent on that predicate. Predicated and Scott Mahlke. compile-time scheduler used to generate execution supported the if-conversion of The Code Generation Schema paper code for such loops was called a modulo conditionals in the body of loops, which resulted from Bob’s lasting interest in scheduler. The highest performance was allowed the parallel execution of condi- instruction-level parallelism. An important achieved by identifying a minimal loop II, tionals without code replication. launching pad for our work came from ear- which accommodated throughput limita- Another complex problem is that of lier work on single assignment languages tions that were dictated either by fully uti- controlling the software pipeline fill and and dataflow, such as MIT’s tagged-token lized computational resources or by the drain process. As execution ramps up, ...... 64 IEEE MICRO reaches a steady state, and then ramps features, achieving the highest perform- of exciting research in instruction-level down, a different set of operations are ance requires substantial code replication. parallelism. Again, I would like to express either needed or unused in the software For complex-instruction-set , my sadness that Bob Rau cannot join us processing pipeline. Again, the Cydra 5 reduced-instruction-set computers, and to celebrate this honor. Bob was the pri- solved this problem without code replica- superscalar machines that do not have mary driver for a body of work that was tion. After a compiler determined an II and advanced loop features or cannot perform developed over much of a decade and generated a cyclic code schedule, opera- sufficient dynamic scheduling in hard- culminated in the Code Generation tions within a loop could be separated ware, their compilers can still improve per- Schema publication. We miss his unwav- (according to their scheduled time) into formance by exploiting schema for code ering pursuit of technical excellence, stages, each lasting II cycles. The total rescheduling, register renaming, and repli- which was directed toward developing number of stages, or stage count, indi- cation. However, the amount of code next-generation computer architectures cates the maximum number of loop itera- grows with the amount of parallelism, and to exploit instruction-level parallelism. He tions that are simultaneously in process a large expansion in code size can degrade cultivated a positive approach to technol- when a software pipeline reaches fully instruction processing performance. ogy development based on enthusiasm, busy execution. The Cydra 5’s loop branch Each hardware architecture choice creativity, deep intellectual discourse, and operation computes a sequence of predi- presents complex compile-time code-gen- perseverance that many of his coworkers cate values that are applied to code in eration tradeoffs between the amount of remember fondly. MICRO each stage. The actions of the branch code replication and the achieved perform- operation, which computed a stage predi- ance as a function of the loop’s trip count. cate and advanced the ICP, caused code For example, loop preconditioning could be References in each stage to be correctly executed or used to sequentially execute a residual 1. B.R. Rau, M.S. Schlansker, and P.P. Tir- nullified according the software pipeline’s number of iterations modulo p and then umalai, “Code Generation Schema for fill and drain progress. This was called retire parallel groups of exactly p unrolled Modulo Scheduled Loops,” Proc. 25th “kernel-only code” and eliminated the loop iterations to complete loop execution. Ann. Int’l Symp. Microarchitecture, needed in other approaches. However, this approach leads to poor per- 1992, pp. 158–169. Another important high-performance formance with small loop trip count. Our 2. Arvind and V. Kathail, “A Multiple Pro- loop feature is , goal was to develop a set of precisely cessor Data Flow Machine that Sup- whichallowsoperationstobeexecutedin defined code generation choices for a vari- ports Generalized Procedures,” Proc. scheduled code before it is known whether ety of important hardware architectural 8th Ann. Int’l Symp. Computer Archi- they would have executed in the original alternatives and compare these loop control tecture, 1981, pp. 291–302. sequential code. Speculative operations schema for a specific amount of instruc- 3. J.A. Fischer, “Trace Scheduling: A produce errors that should not be reported tion-level parallelism as expressed with Technique for Global Microcode until it is known whether they execute in example pipelined multifunction . Compaction,” IEEE Trans. Com- the original code; they should be reported In summary, our goal for this paper puters, July 1981, pp. 478–490. after that time. Speculative execution was was to provide a recipe book for optimiz- 4. B.R. Rau and C.D. Glaeser, “Some used by Multiflow5 and was incorporated ing the schedule for loop codes for vari- Scheduling Techniques and an Easily into the IMPACT architecture6 at the Uni- ous architectures. The paper uses Schedulable Horizontal Architecture for versity of Illinois. Speculative execution is advanced Cydra 5 loop control features High Performance Scientific Computing,” particularly important in while loops, in that eliminate a need for code replication. Proc. 14th Ann. Workshop Microprog- which load operations must be moved It defines code generation schema that ramming, 1981, pp. 183–198. above one or more conditional exit can be used for various VLIW and conven- 5. R.P. Colwell et al., “A VLIW for a Trace branches to achieve good performance. tional processors. The paper measures Scheduling Compiler,” IEEE Trans. Com- This discussion sets the stage for the key attributes such as achieved perform- puters, vol. 37, no. 8, 1988, pp. 967–979. Code Generation Schema paper’s primary ance versus loop trip count and required 6. S. Mahlke et al., “Sentinel Scheduling goal of developing efficient code genera- code size for varying degrees of hardware for VLIW and Superscalar Processors,” tion schema for loops that execute on a parallelism. Finally, it shows that without Proc. 5th Int’l Conf. Architectural Sup- range of hardware architectures with adequate hardware support, compile- port for Programming Languages and some or all of the loop features described time loop scheduling causes significant Operating Systems, 1992, pp. 238–247. earlier. We knew that many processors code growth unless performance is sacri- would not have rotating registers or predi- ficed for short trip-count loops. Michael Schlansker is a Distinguished cated execution and that code replication Technologist at the Hewlett Packard could be used in their absence. For a stati- thank those who recognized this work Enterprise Labs. Contact him at mike_ cally scheduled VLIW with none of these I asalastingcontributiontoalargebody [email protected].