Advanced RISC-V Architectures

PULP PLATFORM Open Source Hardware, the way it should be! Working with RISC-V Part 2 of 4 : Advanced RISC-V Architectures Luca Benini <[email protected]> Frank K. Gürkaynak <[email protected]> http://pulp-platform.org @pulp_platform https://www.youtube.com/pulp_platform Working with RISC-V Summary ▪ Part 1 – Introduction to RISC-V ISA ▪ Part 2 – Advanced RISC-V Architectures ▪ Going 64 bit ▪ Bottlenecks ▪ Safety/Security ▪ Vector units ▪ Part 3 – PULP concepts ▪ Part 4 – PULP based chips | ACACES 2020 - July 2020 Working with RISC-V PULP RISC-V Cores from Tiny to App-level 32 bit 64 bit Low Cost DSP Streaming Linux Core Enhanced Compute capable Core Core Core ▪ Zero-riscy ▪ RI5CY ▪ Snitch ▪ Ariane ▪ RV32-ICM ▪ RV32-ICMFX ▪ RV32- ▪ RV64-IC(MA) ▪ Micro-riscy ▪ SIMD ICMDFX ▪ Full ▪ HW loops ▪ RV32-CE privileged ▪ Bit specification manipulation ▪ Fixed point ARM Cortex-M0+ ARM Cortex-M4 ARM Cortex-A5 | ACACES 2020 - July 2020 Working with RISC-V From IoT to HPC ▪ For the first 4 years of PULP, we used only 32bit cores ▪ Most IoT near-sensor applications work well with 32bit cores. ▪ 64bit memory space is not affordable in an MCU-class device ▪ But times change: ▪ Large datasets, high-precision numerical calculations (e.g. double precision FP) at the IoT edge (gateways) and cloud ▪ Software infrastructure (OS – typically linux) with virtual memory assumes 64bit ▪ High-performance computing, being hot again, requires 64bit ▪ Research question – pJ/OP on 64bit data+address space is possible? How? | ACACES 2020 - July 2020 Working with RISC-V An application class processor ▪ Virtual Memory ▪ Larger address space (64-bit) ▪ Multi-program environment ▪ Requires more hardware support ▪ Efficient sharing and protection ▪ MMU (TLBs, PTW) ▪ Operating System ▪ Privilege Levels ▪ Highly sequential code ▪ More Exceptions (page fault, illegal access) ▪ Increase frequency to gain performance →Ariane an application class processor ▪ Large software infrastructure ▪ Drivers for hardware (PCIe, ethernet) ▪ Application SW (e.g.: Tensorflow, …) NOT an ARM Cortex-A killer! “Controller” core with must-have features for 64bit OSes | ACACES 2020 - July 2020 Working with RISC-V ARIANE: Linux Capable 64-bit core ▪ Application class processor ▪ 6-stage pipeline ▪ Linux Capable ▪ In-order issue ▪ Out-of-order write-back ▪ M, S and U privilege modes ▪ In-order commit ▪ TLB ▪ Tightly integrated D$ and I$ ▪ Branch-prediction ▪ Hardware PTW ▪ RAS ▪ Optimized for 1+GHz clock speed ▪ Branch Target Buffer ▪ Branch History Table ▪ Frequency: 1+ GHz (22 FDX) ▪ Area: 100s kGE (200-400) ▪ Scoreboarding ▪ Critical path: ~ 25-30 logic levels ▪ Designed for extendibility | ACACES 2020 - July 2020 Working with RISC-V Absolute minimum necessary to boot Linux? ▪ Hardware ▪ Software ▪ 64 or 32 bit Integer Extension ▪ Zero Stage Bootloader ▪ Atomic Extension ▪ Device Tree Specification (DTS) ▪ Privilege levels U, S and M ▪ RAM preparation (zeroing) ▪ MMU ▪ Second stage bootloader ▪ FD Extension or out-of-tree Kernel patch ▪ BBL ▪ 16 MB RAM ▪ Uboot ▪ Interrupts ▪ … ▪ Core local interrupts (CLINT) like timer and ▪ Linux Kernel inter processor interrupts ▪ User-space applications (e.g.: Busybox) ▪ Serial or distro | ACACES 2020 - July 2020 Ariane μARC 2 3 1 4 5 6 1. PC Gen • Select PC 2. Instr. Fetch • TLB • Query I$ 3. Instr. Decode • Re-align • De-compress • Decode 4. Issue • Select FU • Issue 5. Execute 6. Commit 1. Write state | Working with RISC-V Frequency-IPC trade-off ▪ Frequency: ▪ Increased bubbles due to: ▪ Increase frequency through pipelining ▪ Data Hazards ➔ Forwarding ▪ Modern Intel CPUs have around 10 - ▪ Structural Hazards ➔ Scoreboard 20 pipeline stages ▪ Control Hazards ➔ Branch Prediction ▪ Adds significant complexity on the cache interfaces | ACACES 2020 - July 2020 Working with RISC-V Data Hazards - Forwarding | ACACES 2020 - July 2020 Working with RISC-V Scoreboarding ▪ Hide latency of multi-cycle instructions ▪ Clean and modular interface to functional units ➔ scalability (FPU) ▪ Add issue port: Dual-Issue implementation ▪ Split execution into four steps: ▪ Issue: Relatively complex issue logic (extra pipeline-stage) ▪ Read Operands: From register file or forwarded ▪ Execute ▪ Write Back: Mitigate structural hazards on write-back path ▪ Mitigate structural hazards on write-back port ▪ Implemented as a circular buffer | ACACES 2020 - July 2020 Working with RISC-V Scoreboard Issue V FU OP rd src 1 src 2 result exception Commit Decode: 0 ld x1 ← x2(0) 0 mul x2 ← x2, x2 0 addi x3 ← x0, 16 0 addi x2 ← x3, x1 ld x1 ← x2(128) Issue Commit ALU LSU MUL Regfile | ACACES 2020 - July 2020 Working with RISC-V Scoreboard – 3 cycle instruction (LD) Issue V FU OP rd src 1 src 2 result exception Commit Decode: 0 LSU LD x1 x2 x0 0 No ld x1 ← x2(0) 0 mul x2 ← x2, x2 0 addi x3 ← x0, 16 0 addi x2 ← x3, x1 ld x1 ← x2(128) Issue Commit ALU LSU MUL Regfile | ACACES 2020 - July 2020 Working with RISC-V Scoreboard – 2 cycle instruction (MUL) V FU OP rd src 1 src 2 result exception Commit Decode: Issue 0 LSU LD x1 x2 x0 0 No ld x1 ← x2(0) 0 MUL MUL x2 x2 x2 0 No mul x2 ← x2, x2 0 addi x3 ← x0, 16 0 addi x2 ← x3, x1 ld x1 ← x2(128) Issue Commit ALU LSU MUL Regfile | ACACES 2020 - July 2020 Working with RISC-V Scoreboard - single cycle instruction (ALU) V FU OP rd src 1 src 2 result exception Commit Decode: 0 LSU LD x1 x2 x0 0 No ld x1 ← x2(0) Issue 0 MUL MUL x2 x2 x2 0 No mul x2 ← x2, x2 0 ALU ADD x3 x0 x0 16 No addi x3 ← x0, 16 0 addi x2 ← x3, x1 ld x1 ← x2(128) Issue Commit ALU LSU MUL Regfile | ACACES 2020 - July 2020 Working with RISC-V Scoreboard – Multiple Write Back V FU OP rd src 1 src 2 result exception Commit Decode: 1 LSU LD x1 x2 x0 0x5555 No ld x1 ← x2(0) 1 MUL MUL x2 x2 x2 0xAAAA No mul x2 ← x2, x2 Issue 1 ALU ADD x3 x0 x0 16 No addi x3 ← x0, 16 0 ALU ADD x2 x3 x1 0 No addi x2 ← x3, x1 forward ld x1 ← x2(128) Issue Commit ALU LSU MUL Regfile | ACACES 2020 - July 2020 Working with RISC-V Scoreboard - Commit Issue V FU OP rd src 1 src 2 result exception Decode: 0 LSU LD x1 x2 x0 128 No Commit ld x1 ← x2(0) 1 MUL MUL x2 x2 x2 0xAAAA No mul x2 ← x2, x2 1 ALU ADD x3 x0 x0 16 No addi x3 ← x0, 16 1 ALU ADD x2 x3 x1 0xAABA No addi x2 ← x3, x1 ld x1 ← x2(128) Issue Commit ALU LSU MUL Regfile | ACACES 2020 - July 2020 Working with RISC-V Scoreboard – Exception Issue V FU OP rd src 1 src 2 result exception Decode: 1 LSU LD x1 x2 x0 128 Yes ld x1 ← x2(0) 0 Commit mul x2 ← x2, x2 1 ALU ADD x3 x0 x0 16 No addi x3 ← x0, 16 1 ALU ADD x2 x3 x1 0xAABA No addi x2 ← x3, x1 ld x1 ← x2(128) Issue Commit ALU LSU MUL Regfile | ACACES 2020 - July 2020 Working with RISC-V Scoreboard - Commit Issue V FU OP rd src 1 src 2 result exception Decode: 1 LSU LD x1 x2 x0 128 Yes ld x1 ← x2(0) 0 mul x2 ← x2, x2 0 Commit addi x3 ← x0, 16 1 ALU ADD x2 x3 x1 0xAABA No addi x2 ← x3, x1 ld x1 ← x2(128) Issue Commit ALU LSU MUL Regfile | ACACES 2020 - July 2020 Working with RISC-V Scoreboard - Commit Issue V FU OP rd src 1 src 2 result exception Commit Decode: 1 LSU LD x1 x2 x0 128 Yes ld x1 ← x2(0) 0 mul x2 ← x2, x2 0 addi x3 ← x0, 16 0 addi x2 ← x3, x1 ld x1 ← x2(128) Issue Commit ALU LSU MUL Regfile | ACACES 2020 - July 2020 Working with RISC-V Scoreboard - Commit Issue V FU OP rd src 1 src 2 result exception Commit Decode: 0 ld x1 ← x2(0) 0 mul x2 ← x2, x2 0 addi x3 ← x0, 16 0 addi x2 ← x3, x1 ld x1 ← x2(128) Issue Commit ALU LSU MUL Regfile | ACACES 2020 - July 2020 Branch Prediction ▪ Branch Target Buffer ▪ Branch History Table ▪ 2-bit saturation counter ▪ Anti-aliasing bits | Working with RISC-V Caches ▪ Caches are a necessity for larger systems ▪ SRAMs (cache memories) are slow compared to regular ▪ Private L1 caches logic ▪ I$ (16 kByte, 32 entries, 16 byte cache line, 4 way) - 1,41% MR (Linux Boot) ▪ Virtually indexed, physically tagged data cache ▪ D$ (32 kByte, 32 entries, 16 byte cache line, 8 way) - 3,17% MR (Linux Boot) ▪ L2 cache (outside core domain) | ACACES 2020 - July 2020 Working with RISC-V Memory Interfaces ▪ Load and stores are very common in RISC architectures ▪ Caches add (costly) tag-comparison ▪ Address translation adds to this already critical path ▪ A fast CPU design needs to account for these effects as much as possible ▪ Virtually indexed, physically tagged caches ▪ De-skewing | ACACES 2020 - July 2020 Working with RISC-V Memory Management Unit (MMU) ▪ Essential for supporting Linux Virtual Address ▪ Ariane implements 39-bit page based address translation (SV39) ▪ SV39 supports three levels of page tables MMU ▪ 1st level: 1 gigabit-pages ▪ 2nd level: 2 megabit-pages TLB MISS ▪ 3rd level: (regular) 4 kilobit-pages TLB HIT ▪ Configurable number of TLB entries Physical Address ▪ Hardware page table walker allows for PTW efficient TLB miss management | ACACES 2020 - July 2020 Hardware Page Table Walker (HPTW) Virtual Page Number 2 Virtual Page Number 1 Virtual Page Number 0 Physical Page Number Base (CSR) V R W X A D G PPN 1 0 0 0 0 0 0 0xDEADBEEF Pointer to next level PTE ⠇ ⠇ ⠇ ⠇ ⠇ ⠇ ⠇ ⠇ 1 1 0 0 0 0 0 0xCOFFEEBABE Leaf Node (1 Gigapage) | Second Level Page Table Virtual Page Number 2 Virtual Page Number 1 Virtual Page Number 0 V R W X A D G PPN 1 0 0 0 0 0 0 0xDEADBEEF Analogous with third ⠇ ⠇ ⠇ ⠇ ⠇ ⠇ ⠇ ⠇ 1 1 0 0 0 0 0 0xCOFFEEBABE and fourth level V R W X A D G PPN 1 0 0 0 0 0 0 0xAAAAAAAAA Pointer to next level PTE ⠇ ⠇ ⠇ ⠇ ⠇ ⠇ ⠇ ⠇ 1 1 0 0 0 0 0 0xBBBBBBBBBB Leaf Node (2 Megapage) | Working with RISC-V Full Debug support ▪ JTAG interface ▪ OpenOCD support ▪ Debug Bridge to communicate with hardware ▪ Allows for: ▪ run-control ▪ single-step ▪ inspection ▪ 16 performance counters

Advanced RISC-V Architectures

Pipelining 2

EECS 570 Lecture 5 GPU Winter 2021 Prof

14 Superscalar Processors

Concepts Introduced in Chapter 3 Instruction Level Parallelism (ILP)

Intro Evolution of Superscalar Processor

Simty: Generalized SIMT Execution on RISC-V Caroline Collange

History Scoreboarding Overview Machine Correctness Four Stages

Overview of 15-740

5 10 15 20 25 30 35 Class 712 Electrical Computers And

Modern Computer Architectures Lecture-1: Introduction

Pipelining: Basic and Intermediate Concepts

Lecture 6: Scoreboarding and Tomasulo Algorithm