Energy-Efficient RISC-V Processors in 28Nm FDSOI
Total Page:16
File Type:pdf, Size:1020Kb
Energy-Efficient RISC-V Processors in 28nm FDSOI Borivoje Nikolić Department of Electrical Engineering and Computer Sciences University of California, Berkeley [email protected] 26 September 2017 Our 28FDSOI Adventures Raven-2 Raven-1 Raven-3 Raven-4 LDPC May Apr Aug Feb Jul Sep Mar Nov Mar Apr Mar Jul Mar 2011 2012 2013 2014 2015 2016 2017 BTLE + SNOW testchip (Leti) Ten chips designed in 28nm FDSOI 9 tested and functional, 6 published 1 in fab 2 Berkeley RISC-V ISA www.riscv.org A new completely open ISA – free to use and extend Has complete software support (GCC, Linux, LLVM, simulators…) RV32, RV64, and RV128 variants for 32b, 64b, and 128b address spaces defined Base ISA only 40 integer instructions, but supports compiler, linker, OS, etc. Extensions provide full general-purpose ISA, including IEEE- 754/2008 floating-point Comparable ISA-level metrics to other RISCs Designed for extension, customization 3 RISC-V Foundation Members (60+) Platinum: Gold, Silver, Auditors: Rumble Developme 4 nt “Rocket Chip” SoC Generator Example Output 5 “Rocket Chip” SoC Specialization 1. Change Parameters 2. Develop New Accelerators 3. Develop Own RISC-V Core 4. Develop Own Device 6 28nm FDSOI RISC-V Processor SoCs RISC-V with vector accelerator and integrated DC-DC converters 34 GFLOPS/W RISC-V with vector accelerator and integrated DC-DC converters, back-bias, power management 54 GFLOPS/W 7 Raven-3 Processor Vector RF VI$ DC-DC Rocket/Hwacha Tile D$ I$ BIST Uncore Process: ST 28nm FDSOI Runs Linux B. Zimmer, VLSI’15 0.45V-1V, including cache B. Zimmer, JSSC 4/15 34GFLOPS/W running DP matrix-matrix multiplication 8 Raven-3 Processor 28nm FDSOI offers high energy efficiency 4 DC-DC converter modes cover wide operating range 9 Raven-4 RISC-V Processor SoC VOLTAGE AND CLOCK POWER MANAGEMENT SRAM INTEGRATED GENERATION (0.4 mm2) (0.1 mm2) BIST MEASUREMENT Toggle Back-Bias Counter Z-scale PMU Programmable I Generator 8KB Scratchpad current mirror load ref Clock NWELL Counter Set body bias Vout waveform PWELL Iload Set DC-DC Vout reconstruction To CORE (1.07 mm2) Vector Accelerator scope ... Vector Issue Unit 1.8V Rocket Core ... Branch Prediction (16KB Vector RF uses eight Vout custom 8T SRAM macros) 1.0V ... Scalar int int int int int RF FPU Crossbar 48 switched-capacitor Functional units int DC-DC unit cells (64-bit Int. Mul., SP/DP FMA) + Vector Memory Unit DCDC toggle DCDC FSM V ref 16KB Scalar 32KB Shared 8KB Vector Inst. Cache Data Cache Inst. Cache DC-DC controller (Custom 8T (Custom 8T (Custom 8T SRAM Macros) SRAM Macros) SRAM Macros) To core clk Arbiter scope Async. FIFO/Level shifters Adaptive clock between domains 1.0V generator Digital IO pads to wire-bonded chip-on-board UNCORE To/from off-chip FPGA FSB and DRAM 10 VOLTAGE AND CLOCK POWER MANAGEMENT SRAM INTEGRATED GENERATION (0.4 mm2) (0.1 mm2) BIST MEASUREMENT Toggle Back-Bias Counter Z-scale PMU Programmable Power I Generator 8KB Scratchpad current mirror load ref Clock BackNWELL Bias ManagementCounter Set body bias Vout waveform PWELL Iload Set DC-DC Vout reconstruction To CORE (1.07 mm2) Vector Accelerator scope ... Vector Issue Unit 1.8V Rocket Core ... Branch Prediction (16KB Vector RF uses eight 1.0V Vout custom 8T SRAM macros) Integrated ... Scalar int int int int int RF FPU Crossbar 48 switched-capacitor Voltage RISC-V RocketFunctional units int DC-DC unit cells (64-bit Int. Mul., SP/DP FMA) Regulation+ ProcessorVector Memory Unit DCDC toggle FSM V ref 16KB Scalar 32KB Shared 8KB Vector Inst. Cache Data Cache Inst. Cache DC-DC controller (Custom 8T (Custom 8T (Custom 8T SRAM Macros) SRAM Macros) SRAM Macros) To Adaptivecore clk Arbiter scope Async. FIFO/Level shifters ClockAdaptive clock between domains 1.0V generator Digital IO pads to wire-bonded chip-on-board UNCORE Generation To/from off-chip FPGA FSB and DRAM 11 VOLTAGE AND CLOCK POWER MANAGEMENT SRAM INTEGRATED GENERATION (0.4 mm2) (0.1 mm2) BIST MEASUREMENT Toggle Back-Bias Counter Z-scale PMU Programmable I Generator 8KB Scratchpad current mirror load ref Clock NWELL Counter Set body bias Vout waveform PWELL Iload Set DC-DC Vout reconstruction To CORE (1.07 mm2) Vector Accelerator scope ... Vector Issue Unit 1.8V Rocket Core ... Branch Prediction (16KB Vector RF uses eight 1.0V Vout custom 8T SRAM macros) Integrated ... Scalar int int int int int RF FPU Crossbar 48 switched-capacitor Voltage Functional units int DC-DC unit cells (64-bit Int. Mul., SP/DP FMA) Regulation+ Vector Memory Unit DCDC toggle FSM V ref 16KB Scalar 32KB Shared 8KB Vector Inst. Cache Data Cache Inst. Cache DC-DC controller (Custom 8T (Custom 8T (Custom 8T SRAM Macros) SRAM Macros) SRAM Macros) To core clk Arbiter scope Async. FIFO/Level shifters Adaptive clock between domains 1.0V generator Digital IO pads to wire-bonded chip-on-board UNCORE To/from off-chip FPGA FSB and DRAM 12 Simultaneous Switching DC-DCs Traditional Interleaving Simultaneous Switching Charge sharing losses No charge sharing! Clock frequency adapts to track the voltage ripple Zimmer et al, JSSC’16 13 Reconfigurable SC Converters Four operating modes Vout FSM toggle Vref + supply 0.5–1V core voltage 2GHz comparator Zimmer et al, JSSC’16 14 VOLTAGE AND CLOCK POWER MANAGEMENT SRAM INTEGRATED GENERATION (0.4 mm2) (0.1 mm2) BIST MEASUREMENT Toggle Back-Bias Counter Z-scale PMU Programmable I Generator 8KB Scratchpad current mirror load ref Clock NWELL Counter Set body bias Vout waveform PWELL Iload Set DC-DC Vout reconstruction To CORE (1.07 mm2) Vector Accelerator scope ... Vector Issue Unit 1.8V Rocket Core ... Branch Prediction (16KB Vector RF uses eight Vout custom 8T SRAM macros) 1.0V ... Scalar int int int int int RF FPU Crossbar 48 switched-capacitor RISC-V RocketFunctional units int DC-DC unit cells (64-bit Int. Mul., SP/DP FMA) + ProcessorVector Memory Unit DCDC toggle FSM V ref 16KB Scalar 32KB Shared 8KB Vector Inst. Cache Data Cache Inst. Cache DC-DC controller (Custom 8T (Custom 8T (Custom 8T SRAM Macros) SRAM Macros) SRAM Macros) To core clk Arbiter scope Async. FIFO/Level shifters Adaptive clock between domains 1.0V generator Digital IO pads to wire-bonded chip-on-board UNCORE To/from off-chip FPGA FSB and DRAM 15 RISC-V Rocket Processor Five-stage in-order RISC-V core Similar in performance to ARM Cortex-A5 Single/double precision floating point unit Memory-management unit allows full operating system support http://www.riscv.org 16 Vector Coprocessor Energy-efficient acceleration of common kernels Decoupled Vector Accelerator Scalar Unit Master Sequencer Scalar Execution Vector Lane 0 Vector Lane 1 Vector Lane 2 Vector Lane N Rocket VCMDQ Unit (SXU) Control FPREQQ Vector Execution Vector Execution Vector Execution Vector Execution Processor Unit (VXU) Unit (VXU) Unit (VXU) Unit (VXU) FPRESPQ a Sequencer/ Sequencer/ Sequencer/ Sequencer/ Expander Expander Expander Expander Vector s Runahead VRCMDQ v p v p v p v p Unit (VRU) … a Scalar Memory Unit Vector Memory Vector Memory Vector Memory Vector Memory 4 KB (SMU) Unit (VMU) Unit (VMU) Unit (VMU) Unit (VMU) L1 VI$ L1-to-L2 TileLink Crossbar Y. Lee et al, “The Hwacha Microarchitecture Manual, Version 3.8.1,” 2015 17 Custom 8T Cell SRAM Macro 8T cell for 1R1W ports, low voltage operation with single P-Well 4KB macro, 2:1 physical interleaving 512x72 bits (4KB+ECC) 1V: 380ps (C->Q), 7pJ/read; 0.6V: 1.37ns (C->Q), 2.3pJ/read Single-ended read increases speed 30% and decreases energy 30% SRAM in 28nm FDSOI functional from 1V to 0.45V Thomas et al, SOI12’16 Thomas et al, IEDM’14 Zimmer, Ph.D. Dis, 2015 Keller et al, JSSC’17 18 VOLTAGE AND CLOCK POWER MANAGEMENT SRAM INTEGRATED GENERATION (0.4 mm2) (0.1 mm2) BIST MEASUREMENT Toggle Back-Bias Counter Z-scale PMU Programmable Power I Generator 8KB Scratchpad current mirror load ref Clock NWELL ManagementCounter Set body bias Vout waveform PWELL Iload Set DC-DC Vout reconstruction To CORE (1.07 mm2) Vector Accelerator scope ... Vector Issue Unit 1.8V Rocket Core ... Branch Prediction (16KB Vector RF uses eight Vout custom 8T SRAM macros) 1.0V ... Scalar int int int int int RF FPU Crossbar 48 switched-capacitor Functional units int DC-DC unit cells (64-bit Int. Mul., SP/DP FMA) + Vector Memory Unit DCDC toggle FSM V ref 16KB Scalar 32KB Shared 8KB Vector Inst. Cache Data Cache Inst. Cache DC-DC controller (Custom 8T (Custom 8T (Custom 8T SRAM Macros) SRAM Macros) SRAM Macros) To core clk Arbiter scope Async. FIFO/Level shifters Adaptive clock between domains 1.0V generator Digital IO pads to wire-bonded chip-on-board UNCORE To/from off-chip FPGA FSB and DRAM 19 Measuring Power Rippling voltage supply makes it easy to measure power consumption of the core 20 Measuring Power Rippling voltage supply makes it easy to measure power consumption of the core Each DC-DC toggle is a fixed amount of energy Rippling Supply Voltage Vref Low power High power consumption consumption 21 Measuring Power Counters measure the DC-DC toggle clock frequency to determine core energy Each DC-DC toggle is a fixed amount of energy Rippling Supply Voltage Vref Low power High power consumption consumption DC-DC Toggle Clock Cochet et al, A‐SSCC’16 22 Power Management Unit (PMU) Tiny 32-bit RISC-V core Fully programmable; can access all control registers 23 VOLTAGE AND CLOCK POWER MANAGEMENT SRAM INTEGRATED GENERATION (0.4 mm2) (0.1 mm2) BIST MEASUREMENT Toggle Back-Bias Counter Z-scale PMU Programmable I Generator 8KB Scratchpad current mirror load ref Clock BackNWELL Bias Counter Set body bias Vout waveform PWELL Iload Set DC-DC Vout reconstruction To CORE (1.07 mm2) Vector Accelerator scope ..