Architectural Trade-Offs in a Latency Tolerant Gallium Arsenide Microprocessor

Architectural Trade-offs in a Latency Tolerant Gallium Arsenide Microprocessor by Michael D. Upton A dissertation submitted in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy (Electrical Engineering) in The University of Michigan 1996 Doctoral Committee: Associate Professor Richard B. Brown, CoChairperson Professor Trevor N. Mudge, CoChairperson Associate Professor Myron Campbell Professor Edward S. Davidson Professor Yale N. Patt © Michael D. Upton 1996 All Rights Reserved DEDICATION To Kelly, Without whose support this work may not have been started, would not have been enjoyed, and could not have been completed. Thank you for your continual support and encouragement. ii ACKNOWLEDGEMENTS Many people, both at Michigan and elsewhere, were instrumental in the completion of this work. I would like to thank my co-chairs, Richard Brown and Trevor Mudge, ﬁrst for attracting me to Michigan, and then for allowing our group the freedom to explore many different ideas in architecture and circuit design. Their guidance and motivation combined to make this a truly memorable experience. I am also grateful to each of my other dissertation committee members: Ed Davidson, Yale Patt, and Myron Campbell. The support and encouragement of the other faculty on the project, Karem Sakallah and Ron Lomax, is also gratefully acknowledged. My friends and former colleagues Mark Rossman, Steve Sugiyama, Ray Farbarik, Tom Rossman and Kendall Russell were always willing to lend their assistance. Richard Oettel continually reminded me of the valuable support of friends and family, and the importance of having fun in your work. Our corporate sponsors: Cascade Design Automation, Chronologic, Cadence, and Metasoft, provided software and support that made this work possible. A design of this complexity would not have been possible using university-developed tools. Significant development efforts were required in all areas of microprocessor design and computer architecture to bring this project to fruition. This work was not performed in iso- lation, and would not have been possible without the help of many others. Aurora I: Dave Johnson wrote an initial Verilog to Cascade netlist translator. Rich Uh- lig designed the Verilog RTL model. Ajay Chandna, Tom Huff, Tom Hoy and I designed the modules and layout floorplan. Aurora II: Phil Barker and PJ Sherhart designed the cell layouts used for the Aurora II iii chip. Taly Budescu assisted in all the horrible tasks no-one else wanted. Tim Stanley designed much of the bus interface and a behavioral MMU to allow the Aurora II model to run real code. Bob McVay modified GCC to produce code without byte operations. Jim Dundas tested the scan chains of the final fabrication of the Aurora II chip, and Mark Roberts and Marie Powell tested the yield of the register files. Aurora III: PJ Sherhart designed the bus interface unit of the Aurora III. David Kibler did the much improved cell layouts for the Aurora III chip. Dave Putti and Sara Domonkos designed an initial version of the Load-Store Unit. Bob McVay designed an initial version of the Instruction Fetch Unit. Tim Stanley reprised his role as behavioral MMU designer, providing crucial input into the design decision of the Aurora III bus interface unit. A special thanks is reserved for Dave Putti, who single-handedly completed the final version of the Load Store unit for the Aurora III. Without his help this work would not have been completed. iv TABLE OF CONTENTS DEDICATION . ii ACKNOWLEDGEMENTS . iii LIST OF FIGURES . viii LIST OF TABLES . xi CHAPTER 1 Introduction . 1 1.1 Technology Requirements for High Performance Processors . 1 1.2 Microprocessor Density Increase Over Time . 3 1.3 Clock Speed Increase Over Time . 6 1.4 Research Project Goals . 7 CHAPTER 2 The Importance of Latency Tolerance for High Clock-Rate Processors . 10 2.1 CPU-Memory Performance Discrepancy . 10 2.2 Cache Miss Characterization . 18 2.3 Memory system enhancements to maintain performance . 21 2.3.1 Prefetching . 21 2.3.2 Nonblocking memory operations . 23 2.4 Elements of Current Microprocessors . 24 2.4.1 R4400 . 25 2.4.2 R8000 . 25 2.4.3 R10000 . 25 2.4.4 SuperSPARC . 26 2.4.5 UltraSPARC . 27 2.4.6 Intel Pentium Processor . 27 2.4.7 AMD K5 . 28 2.4.8 Motorola 88110 . 28 2.4.9 IBM/Motorola PowerPC 604 . 29 2.4.10 DEC Alpha 21064 . 30 2.4.11 DEC Alpha 21164 . 30 2.5 Computational Efficiency and Delivered Performance . 31 2.6 Summary . 33 CHAPTER 3 Gallium Arsenide Microprocessor Design Studies . 35 3.1 Gallium Arsenide DCFL Logic . 36 3.2 The Aurora I Processor . 39 v 3.2.1 CAD Tool Development . 41 3.2.2 Aurora I Test Results . 42 3.3 The Aurora II Processor . 44 3.3.1 Timing and Optimization . 48 3.3.2 Layout Optimization . 50 3.3.3 I/O Pad Design . 52 3.4 Aurora II Results . 53 3.4.1 Error Summary . 54 3.4.2 Clocking Issues . 56 3.4.3 Exception Overhead . 56 3.4.4 GaAs Technical Difficulties . 57 CHAPTER 4 Process Modeling Studies . 61 4.1 SUSPENS system performance model . 61 4.2 GaAs SUSPENS Model . 62 4.2.1 Bakoglu Equations . 66 4.3 Model Sensitivity . 68 4.4 Aurora III Architectural Directions . 69 4.5 Aurora III Model Predictions . 70 4.6 Aurora III Model Floorplan . 71 4.7 Conclusion . 72 CHAPTER 5 Impact of GaAs Technology on Architecture . 74 5.1 Path Length Reduction . 74 5.2 Interconnect Parasitics . 74 5.3 Functional Decomposition . 75 5.4 Circuit Design Techniques for Reduced Path Length . 76 5.4.1 Ling Adder . 79 5.4.2 Pipelined Ling Adder . 82 5.5 Summary . 83 CHAPTER 6 Aurora III Microprocessor System Architecture and Design . 84 6.1 System Overview . 84 6.2 Processor Organization . 88 6.3 Instruction Fetch Unit . 91 6.4 Execution Unit . 93 6.5 Load/Store Unit . 98 6.6 Prefetch Unit . 100 6.7 Bus Interface Unit . 101 6.8 Architectural Evaluation . 104 6.9 Study Results . 105 6.10 Summary . 113 vi CHAPTER 7 The Design Process and Verification . ..

Architectural Trade-Offs in a Latency Tolerant Gallium Arsenide Microprocessor

Memory Hierarchy Memory Hierarchy

Petaflops for the People

Make the Most out of Last Level Cache in Intel Processors In: Proceedings of the Fourteenth Eurosys Conference (Eurosys'19), Dresden, Germany, 25-28 March 2019

Migration from IBM 750FX to MPC7447A by Douglas Hamilton European Applications Engineering Networking and Computing Systems Group Freescale Semiconductor, Inc

Caches & Memory

Stealing the Shared Cache for Fun and Profit

IBM Power Systems Performance Report Apr 13, 2021

A Bibliography of Publications in IEEE Micro

The Bus Interface for 32-Bit Microprocessors That

Ushering in a New Era: Argonne National Laboratory & Aurora

TOP500 Supercomputer Sites

MPI on Aurora