Computer Architecture: a Constructive Approach

Computer Architecture: A Constructive Approach Using Executable and Synthesizable Specifications Arvind 1, Rishiyur S. Nikhil 2, Joel S. Emer 3, and Murali Vijayaraghavan 1 1 MIT 2 Bluespec, Inc. 3 Intel and MIT with contributions from Prof. Li-Shiuan Peh, Abhinav Agarwal, Elliott Fleming, Sang Woo Jun, Asif Khan, Myron King (MIT); Prof. Derek Chiou (University of Texas, Austin); and Prof. Jihong Kim (Seoul National University) c 2012-2013 Arvind, R.S.Nikhil, J.Emer and M.Vijayaraghavan Revision: December 31, 2012 Acknowledgements We would like to thank the staff and students of various recent offerings of this course at MIT, Seoul National University and Technion for their feedback and support. Contents 1 Introduction 1-1 2 Combinational circuits 2-1 2.1 A simple \ripple-carry" adder............................... 2-1 2.1.1 A 2-bit Ripple-Carry Adder............................ 2-3 2.2 Static Elaboration and Static Values........................... 2-6 2.3 Integer types, conversion, extension and truncation................... 2-7 2.4 Arithmetic-Logic Units (ALUs).............................. 2-9 2.4.1 Shift operations................................... 2-9 2.4.2 Enumerated types for expressing ALU opcodes................. 2-12 2.4.3 Combinational ALUs................................ 2-13 2.4.4 Multiplication.................................... 2-14 2.5 Summary, and a word about efficient ALUs....................... 2-16 3 Sequential (Stateful) Circuits and Modules 3-1 3.1 Registers........................................... 3-1 3.1.1 Space and time................................... 3-1 3.1.2 D flip-flops..................................... 3-2 3.1.3 Registers...................................... 3-3 3.2 Sequential loops with registers.............................. 3-4 3.3 Sequential version of the multiply operator....................... 3-6 3.4 Modules and Interfaces................................... 3-7 3.4.1 Polymorphic multiply module........................... 3-11 3.5 Register files........................................ 3-12 3.6 Memories and BRAMs................................... 3-14 i ii CONTENTS 4 Pipelining Complex Combinational Circuits 4-1 4.1 Introduction......................................... 4-1 4.2 Pipeline registers and Inelastic Pipelines......................... 4-2 4.2.1 Inelastic Pipelines................................. 4-3 4.2.2 Stalling and Bubbles................................ 4-4 4.2.3 Expressing data validity using the Maybe type................. 4-5 4.3 Elastic Pipelines with FIFOs between stages....................... 4-8 4.4 Final comments on Inelastic and Elastic Pipelines.................... 4-10 5 Introduction to SMIPS: a basic implementation without pipelining 5-1 5.1 Introduction to SMIPS................................... 5-1 5.1.1 Instruction Set Architectures, Architecturally Visible State, and Implementa- tion State...................................... 5-2 5.1.2 SMIPS processor architectural state....................... 5-2 5.1.3 SMIPS processor instruction formats....................... 5-3 5.2 Uniform interface for our processor implementations.................. 5-4 5.3 A simple single-cycle implementation of SMIPS v1................... 5-5 5.4 Expressing our single-cycle CPU with BSV, versus prior methodologies........ 5-13 5.5 Separating the Fetch and Execute actions........................ 5-15 5.5.1 Analysis....................................... 5-17 6 SMIPS: Pipelined 6-1 6.1 Hazards........................................... 6-1 6.1.1 Modern processors are distributed systems................... 6-2 6.2 Two-stage pipelined SMIPS................................ 6-2 6.2.1 A first step: a single rule, inelastic solution................... 6-3 6.2.2 A second step: a two-rule, somewhat elastic solution.............. 6-4 6.2.3 A third step: a two-rule, elastic solution using epochs............. 6-5 6.2.4 A final step: a two-rule, elastic, fully-decoupled, distributed solution..... 6-7 6.3 Conclusion......................................... 6-9 7 Rule Semantics and Event Ordering 7-1 7.1 Introduction......................................... 7-1 7.2 Parallelism: semantics of a rule in isolation....................... 7-1 7.2.1 What actions can be combined (simultaneous) in a rule?............ 7-2 7.2.2 Execution semantics of a single rule....................... 7-5 7.3 Logical semantics vs. implementation: sequential rule execution............ 7-7 7.4 Concurrent rule execution, and scheduling rules into clocks.............. 7-7 iii 7.4.1 Schedules, and compilation of schedules..................... 7-9 7.4.2 Examples...................................... 7-9 7.4.3 Nuances due to conditionals............................ 7-12 7.5 Hardware intuition for concurrent scheduling...................... 7-12 7.6 Designing performance-critical ciruits using EHRs................... 7-14 7.6.1 Intra-clock concurrency and semantics...................... 7-15 7.6.2 EHRs........................................ 7-16 7.6.3 Implementing the counter with EHRs...................... 7-17 7.7 Concurrent FIFOs..................................... 7-18 7.7.1 Multi-element concurrent FIFOs......................... 7-18 7.7.2 Semantics of single element concurrent FIFOs.................. 7-20 7.7.3 Implementing single element concurrent FIFOs using EHRs.......... 7-22 8 Data Hazards (Read-after-Write Hazards) 8-1 8.1 Read-after-write (RAW) Hazards and Scoreboards................... 8-1 8.2 Concurrency issues in the pipeline with register file and scoreboard.......... 8-6 8.3 Write-after-Write Hazards................................. 8-7 8.4 Deeper pipelines...................................... 8-7 8.5 Conclusion......................................... 8-8 9 Branch Prediction 9-1 9.1 Introduction......................................... 9-1 9.2 Static Branch Prediction.................................. 9-2 9.3 Dynamic Branch Prediction................................ 9-3 9.4 A first attempt at a better Next-Address Predictor (NAP)............... 9-5 9.5 An improved BTB-based Next-Address Predictor.................... 9-6 9.5.1 Implementing the Next-Address Predictor.................... 9-6 9.6 Incorporating the BTB-based predictor in the 2-stage pipeline............. 9-8 9.7 Direction predictors.................................... 9-9 9.8 Incorporating multiple predictors into the pipeline................... 9-10 9.8.1 Extending our BSV pipeline code with multiple predictors........... 9-12 9.9 Conclusion......................................... 9-15 10 Exceptions 10-1 10.1 Introduction......................................... 10-1 10.2 Asynchronous Interrupts.................................. 10-2 10.2.1 Interrupt Handlers................................. 10-2 10.3 Synchronous Interrupts.................................. 10-3 10.3.1 Using synchronous exceptions to handle complex and infrequent instructions 10-3 10.3.2 Incorporating exception handling into our single-cycle processor........ 10-3 10.4 Incorporating exception handling into our pipelined processor............. 10-5 10.4.1 BSV code for pipeline with exception handling................. 10-8 iv CONTENTS 11 Caches 11-1 11.1 Introduction......................................... 11-1 11.2 Cache organizations.................................... 11-4 11.2.1 Replacement policies................................ 11-6 11.2.2 Blocking and Non-blocking caches........................ 11-6 11.3 A Blocking Cache Design................................. 11-7 11.4 Integrating caches into the processor pipeline...................... 11-10 11.5 A Non-blocking cache for the Instruction Memory (Read-Only)............ 11-11 11.5.1 Completion Buffers................................. 11-13 11.6 Conclusion......................................... 11-15 12 Virtual Memory 12-1 12.1 Introduction......................................... 12-1 12.2 Different kinds of addresses................................ 12-2 12.3 Paged Memory Systems.................................. 12-2 12.4 Page Tables......................................... 12-4 12.5 Address translation and protection using TLBs..................... 12-5 12.6 Variable-sized pages.................................... 12-7 12.7 Handling TLB misses................................... 12-9 12.8 Handling Page Faults................................... 12-10 12.9 Recursive Page Faults................................... 12-11 12.10Integating Virtual Memory mechanisms ino the processor pipeline.......... 12-12 12.11Conclusion......................................... 12-14 13 Future Topics 13-1 13.1 Asynchronous Exceptions and Interrupts......................... 13-1 13.2 Out-of-order pipelines................................... 13-1 13.3 Protection and System Issues............................... 13-1 13.4 I and D Cache Coherence................................. 13-1 13.5 Multicore and Multicore cache coherence......................... 13-1 13.6 Simultaneous Multithreading............................... 13-2 13.7 Energy efficiency...................................... 13-2 13.8 Hardware accelerators................................... 13-2 v A SMIPS Reference SMIPS-1 A.1 Basic Architecture..................................... SMIPS-2 A.2 System Control Coprocessor (CP0)............................ SMIPS-3 A.2.1 Test Communication Registers.......................... SMIPS-3 A.2.2 Counter/Timer Registers............................. SMIPS-3 A.2.3 Exception Processing Registers.......................... SMIPS-4 A.3 Addressing and

Computer Architecture: a Constructive Approach

Comparison of Parallel and Pipelined CORDIC Algorithm Using RCA and CSA

Basics of Logic Design Arithmetic Logic Unit (ALU) Today's Lecture

Implementation of Carry Tree Adders and Compare with RCA and CSLA

CS/EE 260 – Homework 5 Solutions Spring 2000

UNIT 8B a Full Adder

How Many Bits Are in a Byte in Computer Terms

Design of High Speed and Low Power Six Transistor Full Adder Using Two Transistor Xor Gate

Half Adder, Which Finds the Sum of Two Bits

Systems Architecture of the English Electric KDF9 Computer

Lecture 4 Adders

CLOSED SYLLABLES Short a 5-8 Short I 9-12 Mix: A, I 13 Short O 14-15 Mix: A, I, O 16-17 Short U 18-20 Short E 21-24 Y As a Vowel 25-26

Introduction to Computer Data Representation