A Highly Productive Implementation of an Out-Of-Order Processor Generator

A Highly Productive Implementation of an Out-of-Order Processor Generator Christopher Celio Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2018-151 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-151.html December 1, 2018 Copyright © 2018, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. A Highly Productive Implementation of an Out-of-Order Processor Generator by Christopher Patrick Celio A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Emeritus David A. Patterson, Chair Professor Krste Asanović,Co-chair Assistant Professor Zachary Pardos Fall 2017 A Highly Productive Implementation of an Out-of-Order Processor Generator Copyright 2017 by Christopher Patrick Celio 1 Abstract A Highly Productive Implementation of an Out-of-Order Processor Generator by Christopher Patrick Celio Doctor of Philosophy in Computer Science University of California, Berkeley Professor Emeritus David A. Patterson, Chair Professor Krste Asanović,Co-chair General-purpose serial-thread performance gains have become more difficult for industry to realize due to the slowing down of process improvements. In this new regime of poor process scaling, continued performance improvement relies on a number of small-scale micro- architectural enhancements. However, software simulator-based models, which computer architecture research has largely relied upon, may not be well-suited for evaluating ideas at the necessary fidelity. To facilitate architecture research during this fallow period of Moore's Law, we propose using processor simulators built from synthesizable processor designs. This thesis describes the design of a synthesizable, industry-competitive processor built on recent advancements in open-source hardware: we leverage the new open-source RISC-V instruction set architecture, the new Chisel hardware construction language, and the Rocket-chip processor generator. Our processor generator is called BOOM, and it designed for use in education, research, and industry. Like most contemporary high-performance cores, BOOM is superscalar (able to execute multiple instructions per cycle) and out-of-order (able to execute instructions as their dependencies are resolved and not restricted to their program order). The BOOM generator was implemented using the Chisel hardware construction language, allowing for the rapid implementation of parameterized designs. The Chisel description generates synthesizable implementations of BOOM that can target both FPGAs and ASIC tool-flows. The BOOM effort culminated in a test chip that was fabricated in the TSMC 28 nm HPM process (high performance mobile) using the foundry-provided standard-cell library and memory compiler. This thesis highlights two aspects of the BOOM design: its industry-competitive branch prediction and its configurable execution datapath. The remainder of the thesis discusses the BOOM tape-out, which was performed by two graduate students and demonstrated the ability to quickly adapt the design to the physical design issues that arose. i To my parents, for the many opportunities that they gave me. ii Contents Contents ii List of Figures vi List of Tables viii 1 Introduction1 1.1 Leveraging New Infrastructure..........................1 1.2 Contributions...................................2 1.3 Thesis Outline...................................3 2 Background4 2.1 Motivation.....................................4 2.2 Technology and Research Trends........................5 2.2.1 Impact of Performance Slowdown on Architecture Research...... 10 2.3 Computer Architecture Research Trends.................... 12 2.3.1 Architectural Simulators......................... 12 2.3.2 RTL Implementations.......................... 15 2.3.3 Energy Modeling Methodologies..................... 18 2.4 Processor Microarchitecture........................... 20 2.5 Out-of-order Processor Microarchitectures................... 23 2.5.1 The Data-in-ROB Design (Implicit Renaming)............. 23 2.5.2 The Physical Register File Design (Explicit Renaming)........ 25 2.5.3 The Differences Between the Two Styles................ 25 2.6 History of Out-of-order Processors........................ 27 2.7 The Value of Out-of-order Execution...................... 28 2.8 Conclusion..................................... 29 3 BOOM Overview 31 3.1 The RISC-V Instruction Set Architecture.................... 31 3.2 The BOOM Microarchitecture.......................... 33 3.2.1 The BOOM Pipeline........................... 33 iii 3.2.2 Instruction Fetch and Branch Prediction................ 35 3.2.2.1 Branch Target Buffer (BTB)................. 35 3.2.2.2 Return Address Stack (RAS)................. 35 3.2.2.3 Conditional Branch Predictor (BPD)............. 35 3.2.3 The Decode Stage and Resource Allocation............... 36 3.2.4 The Register Rename Stage....................... 36 3.2.5 The Reorder Buffer (ROB) and Exception Handling.......... 36 3.2.6 The Issue Unit.............................. 38 3.2.7 The Register File and Bypass Network................. 38 3.2.8 The Execution Pipeline.......................... 38 3.2.9 The Load/Store Unit (LSU)....................... 38 3.2.10 The Memory System........................... 39 3.3 Design Methodology............................... 39 3.3.1 RTL Design................................ 39 3.3.2 RTL Verification............................. 40 3.3.3 The Chisel Hardware Construction Language.............. 42 3.3.4 The Rocket-chip System-on-a-Chip Generator............. 44 3.4 FPGA Implementation.............................. 44 3.5 ASIC Implementation.............................. 45 3.5.1 SRAM Generation............................ 50 3.5.2 Custom Bit Array Register File..................... 50 3.5.3 Timing Analysis.............................. 52 3.6 Parameters.................................... 53 3.7 Performance Evaluation............................. 53 4 Branch Prediction 59 4.1 Background.................................... 60 4.1.1 Research Methodology.......................... 60 4.1.2 Deficiencies of Trace-based, Unpipelined Models............ 61 4.1.3 The State-of-the-art TAGE Predictor.................. 62 4.2 The BOOM RTL Implementation........................ 64 4.2.1 The Frontend Organization....................... 64 4.2.2 Providing a Branch Predictor Framework................ 66 4.2.3 Managing the Global History Register................. 66 4.2.4 The Two-bit Counter Tables....................... 69 4.2.5 Superscalar Predictions.......................... 70 4.2.6 The BOOM GShare Predictor...................... 71 4.2.7 The BOOM TAGE Predictor...................... 71 4.2.7.1 TAGE Global History and the Circular Shift Registers (CSRs) 73 4.2.7.2 Usefulness Counters (u-bits).................. 73 4.2.7.3 TAGE Snapshot State..................... 74 4.3 Addressing the Gaps Between RTL and Models................ 74 iv 4.3.1 Superscalar Prediction.......................... 74 4.3.2 Delayed History Update......................... 75 4.3.3 Delayed Predictor Update........................ 76 4.3.4 Accurate Cost Measurements...................... 76 4.3.5 Implementation Realism......................... 77 4.4 Proposed Improvements for Software Model Evaluations........... 77 4.5 Conclusion..................................... 78 5 Describing an Out-of-order Execution Pipeline Generator 80 5.1 Goals and Challenges............................... 81 5.2 The BOOM Execution Pipeline......................... 83 5.2.1 Branch Speculation............................ 83 5.2.2 Execution Units.............................. 83 5.2.3 Functional Units............................. 86 5.2.3.1 Pipelined Functional Units................... 86 5.2.3.2 Iterative Functional Units................... 89 5.2.4 The Load/Store Unit........................... 89 5.3 Case Study: Adding Floating-point Support.................. 89 5.3.1 Register File and Register Renaming.................. 91 5.3.2 Issue Window............................... 91 5.3.3 Floating-point Control and Status Register (fcsr)........... 91 5.3.4 Hardfloat and Low-level Instantiations................. 92 5.3.5 Pipelined Functional Unit Wrapper................... 92 5.3.6 Adding the FPU to an Execution Unit................. 94 5.3.7 Results................................... 94 5.4 Case Study: Adding a Binary Manipulation Instruction............ 97 5.4.1 Decode, Rename, and Instruction Steering............... 97 5.4.2 The Popcount Unit Implementation................... 97 5.5 Limitations.................................... 100 5.6 Conclusion..................................... 102 6 VLSI Implementation Effort 104 6.1 Background.................................... 105 6.2 BOOMv1..................................... 105 6.3 BOOMv2..................................... 109 6.3.1 Frontend (Instruction Fetch)....................... 109 6.3.2 Distributed Issue Windows........................ 112 6.3.3 Register File Design........................... 112

Load more