ECE 4750 Computer Architecture, Fall 2020 T02 Fundamental Processor

ECE 4750 Computer Architecture, Fall 2021 T02 Fundamental Processor Microarchitecture School of Electrical and Computer Engineering Cornell University revision: 2021-09-12-02-56 1 Processor Microarchitectural Design Patterns 4 1.1. Transactions and Steps . .4 1.2. Microarchitecture: Control/Datapath Split . .5 2 TinyRV1 Single-Cycle Processor 6 2.1. High-Level Idea for Single-Cycle Processors . .7 2.2. Single-Cycle Processor Datapath . .8 2.3. Single-Cycle Processor Control Unit . 14 2.4. Analyzing Performance . 14 3 TinyRV1 FSM Processor 17 3.1. High-Level Idea for FSM Processors . 18 3.2. FSM Processor Datapath . 18 3.3. FSM Processor Control Unit . 25 3.4. Analyzing Performance . 29 4 TinyRV1 Pipelined Processor 31 4.1. High-Level Idea for Pipelined Processors . 32 4.2. Pipelined Processor Datapath and Control Unit . 34 1 5 Pipeline Hazards: RAW Data Hazards 39 5.1. Expose in Instruction Set Architecture . 41 5.2. Hardware Stalling . 42 5.3. Hardware Bypassing/Forwarding . 43 5.4. RAW Data Hazards Through Memory . 47 6 Pipeline Hazards: Control Hazards 48 6.1. Expose in Instruction Set Architecture . 50 6.2. Hardware Speculation . 51 6.3. Interrupts and Exceptions . 54 7 Pipeline Hazards: Structural Hazards 59 7.1. Expose in Instruction Set Architecture . 60 7.2. Hardware Stalling . 61 7.3. Hardware Duplication . 62 8 Pipeline Hazards: WAW and WAR Name Hazards 62 8.1. Software Renaming . 64 8.2. Hardware Stalling . 65 9 Summary of Processor Performance 65 10 Case Study: Transition from CISC to RISC 69 10.1. Example CISC: IBM 360/M30 . 70 10.2. Example RISC: MIPS R2K . 73 Copyright © 2016 Christopher Batten. All rights reserved. This handout was originally prepared by Prof. Christopher Batten at Cornell University for ECE 4750 / CS 4420. It has 2 been updated in 2017-2021 by Prof. Christina Delimitrou. Download and use of this handout is permitted for individual educational non-commercial purposes only. Redistribution either in part or in whole via both commercial or non-commercial means requires written permission. 3 1. Processor Microarchitectural Design Patterns 2.0. Transactions and Steps 1. Processor Microarchitectural Design Patterns Time Instructions Avg Cycles Time = × × Program Program Instruction Cycle • Instructions / program depends on source code, compiler, ISA • Avg cycles / instruction (CPI) depends on ISA, microarchitecture • Time / cycle depends upon microarchitecture and implementation Microarchitecture CPI Cycle Time Single-Cycle Processor 1 long FSM Processor >1 short Pipelined Processor ≈1 short 1.1. Transactions and Steps • We can think of each instruction as a transaction • Executing a transaction involves a sequence of steps add addi mul lw sw jal jr bne Fetch Instruction 3 3 3 3 3 3 3 3 Decode Instruction 3 3 3 3 3 3 3 3 Read Registers 3 3 3 3 3 3 3 Register Arithmetic 3 3 3 3 3 3 Read Memory 3 Write Memory 3 Write Registers 3 3 3 3 3 Update PC 3 3 3 3 3 3 3 3 3 1. Processor Microarchitectural Design Patterns1.2.Microarchitecture: Control/Datapath Split 1.2. Microarchitecture: Control/Datapath Split Control Unit Control Signals Status Signals Datapath imem imem imem dmem dmem dmem req_val req resp req_val req resp Memory 4 2. TinyRV1 Single-Cycle Processor 2.1. High-Level Idea for Single-Cycle Processors 2. TinyRV1 Single-Cycle Processor Time Instructions Avg Cycles Time = × × Program Program Instruction Cycle • Instructions / program depends on source code, compiler, ISA • Avg cycles / instruction (CPI) depends on ISA, microarchitecture • Time / cycle depends upon microarchitecture and implementation Microarchitecture CPI Cycle Time Single-Cycle Processor1 long FSM Processor >1 short Pipelined Processor ≈1 short Technology Constraints • Assume technology where Control Unit logic is not too expensive, so Control Status we do not need to overly Datapath minimize the number of registers and combinational logic regfile • Assume multi-ported register ﬁle with a reasonable number of ports is feasible imem imem dmem dmem req resp req resp • Assume a dual-ported Memory combinational memory <1 cycle combinational 5 2. TinyRV1 Single-Cycle Processor 2.1. High-Level Idea for Single-Cycle Processors 2.1. High-Level Idea for Single-Cycle Processors add addi mul lw sw jal jr bne Fetch Instruction 3 3 3 3 3 3 3 3 Decode Instruction 3 3 3 3 3 3 3 3 Read Registers 3 3 3 3 3 3 3 Register Arithmetic 3 3 3 3 3 3 Read Memory 3 Write Memory 3 Write Registers 3 3 3 3 3 Update PC 3 3 3 3 3 3 3 3 Fetch Decode Reg Read Write Fetch Decode Reg Write Fetch Decode Write Inst Inst Arith Mem Reg Inst Inst Arith Reg Inst Inst Reg lw Read Update add Read Update jal Update Reg PC Reg PC PC Single-Cycle 6 2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath 2.2. Single-Cycle Processor Datapath ADD 31 25 24 20 19 15 14 12 11 7 6 0 add rd, rs1, rs2 0000000rs2rs1 000 rd 0110011 R[rd] ← R[rs1] + R[rs2] PC ← PC + 4 regfile regfile (read) (write) pc ADDI 31 20 19 15 14 12 11 7 6 0 addi rd, rs1, imm imm rs1000 rd 0010011 R[rd] ← R[rs1] + sext(imm) PC ← PC + 4 regfile regfile (read) (write) pc 7 2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath Implementing ADD and ADDI Instructions To control unit ir[11:7] pc_plus4 ir[19:15] regfile regfile (read) alu (write) ir[24:20] +4 ir[31:7] imm pc gen op2_sel imemreq. imemresp. addr data MUL 31 19202425 11121415 067 mul rd, rs1, rs2 0000001rs2 000rs1 rd 0110011 R[rd] ← R[rs1] × R[rs2] PC ← PC + 4 To control unit ir[11:7] mul wb_sel pc_plus4 ir[19:15] regfile regfile (read) alu (write) ir[24:20] +4 ir[31:7] imm pc gen op2_sel imemreq. imemresp. addr data 8 2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath LW 31 20 19 15 14 12 11 7 6 0 lw rd, imm(rs1) imm rs1010 rd 0000011 R[rd] ← M[ R[rs1] + sext(imm) ] PC ← PC + 4 To control unit ir[11:7] mul wb_sel pc_plus4 ir[19:15] regfile regfile (read) alu (write) ir[24:20] +4 ir[31:7] imm pc gen op2_sel imemreq. imemresp. dmemreq. dmemresp. addr data addr data SW 31 25 24 20 19 15 14 12 11 7 6 0 sw rs2, imm(rs1) imm rs2 rs1010 imm 0100011 M[ R[rs1] + sext(imm) ] ← R[rs2] imm = { inst[31:25], inst[11:7] } PC ← PC + 4 To control unit ir[11:7] rf_wen mul wb_sel pc_plus4 ir[19:15] regfile regfile (read) alu (write) ir[24:20] +4 ir[31:7] imm pc gen imm_type op2_sel imemreq. imemresp. dmemreq. dmemreq. dmemresp. addr data data addr data 9 2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath JAL 31 1112 067 jal rd, imm imm rd 1101111 R[rd] ← PC + 4 imm = { inst[31], inst[19:12], inst[20], inst[30:21], 0 } PC ← PC + sext(imm) To control unit ir[11:7] rf_wen jalbr_targ mul wb_sel pc_plus4 ir[19:15] regfile regfile (read) alu (write) ir[24:20] +4 imm alu_func ir[31:7] gen pc_sel pc imm_type + op2_sel imemreq. imemresp. dmemreq. dmemreq. dmemresp. addr data data addr data JR 31 1920 11121415 067 jr rs1 000000000000 000rs100000 1100111 PC ← R[rs1] To control unit jr_targ ir[11:7] rf_wen jalbr_targ mul wb_sel pc_plus4 ir[19:15] regfile regfile (read) alu (write) ir[24:20] +4 imm alu_func ir[31:7] gen pc_sel pc imm_type + op2_sel imemreq. imemresp. dmemreq. dmemreq. dmemresp. addr data data addr data 10 2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath BNE 31 2425 1920 11121415 067 bne rs1, rs2, imm imm rs2 001rs1imm 1100011 if ( R[rs1] != R[rs2] ) PC ← PC + sext(imm) imm = { inst[31], inst[7], inst[30:25], inst[11:8], 0 } else PC ← PC + 4 To control unit jr_targ ir[11:7] rf_wen jalbr_targ mul eq wb_sel pc_plus4 ir[19:15] regfile regfile (read) alu (write) ir[24:20] +4 imm alu_func ir[31:7] gen pc_sel pc imm_type + op2_sel imemreq. imemresp. dmemreq. dmemreq. dmemresp. addr data data addr data 11 2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath Adding a New Auto-Incrementing Load Instruction Draw on the datapath diagram what paths we need to use as well as any new paths we will need to add in order to implement the following auto-incrementing load instruction. LW.AI 31 1920 11121415 067 lw.ai rd, imm(rs1) imm 000rs1rd 0001011 R[rd] ← M[ R[rs1] + sext(imm) ] R[rs1] ← R[rs1] + 4 PC ← PC + 4 To control unit jr_targ ir[11:7] rf_wen jalbr_targ mul eq wb_sel pc_plus4 ir[19:15] regfile regfile (read) alu (write) ir[24:20] +4 imm alu_func ir[31:7] gen pc_sel pc imm_type + op2_sel imemreq. imemresp. dmemreq. dmemreq. dmemresp. addr data data addr data 12 2. TinyRV1 Single-Cycle Processor 2.3. Single-Cycle Processor Control Unit 2.3. Single-Cycle Processor Control Unit imem dmem pc imm op1 alu wb rf req req inst sel type sel func sel wen val val add pc+4 – rf + alu 1 10 addi mul pc+4 – – – mul 1 10 lw pc+4 i imm + mem 1 1 1 sw jal jr jr i – – –010 bne Need to factor eq status signal into pc_sel signal for BNE! 2.4. Analyzing Performance Time Instructions Cycles Time = × × Program Program Instruction Cycles • Instructions / program depends on source code, compiler, ISA • Cycles / instruction (CPI) depends on ISA, microarchitecture • Time / cycle depends upon microarchitecture and implementation 13 2.

ECE 4750 Computer Architecture, Fall 2020 T02 Fundamental Processor

Inside Intel® Core™ Microarchitecture Setting New Standards for Energy-Efficient Performance

POWER-AWARE MICROARCHITECTURE: Design and Modeling Challenges for Next-Generation Microprocessors

Hardware Architecture

Microcontroller Serial Interfaces

Reverse Engineering X86 Processor Microcode

Intel(R) Software Guard Extensions Developer Guide

Itanium® 2 Processor Microarchitecture Overview

Microarchitecture-Level Soc Design 27 Young-Hwan Park, Amin Khajeh, Jun Yong Shin, Fadi Kurdahi, Ahmed Eltawil, and Nikil Dutt

Computer Architectures an Overview

Digital and System Design

Itanium Processor

Intel's Haswell CPU Microarchitecture