Ram K. Krishnamurthy Senior Principal Engineer

High Performance Energy Efficient Near Threshold Circuits: Challenges and Opportunities 2012 MICRO Near Threshold Computing Workshop Keynote December 2012 Ram K. Krishnamurthy Senior Principal Engineer Circuits Research Lab, Circuits & Systems Research, Intel Labs Intel Corporation, Hillsboro, OR 97124, USA [email protected] Acknowledgements: Intel Circuits Research Lab, Vivek De, Rick Forand, Wen-Hann Wang, Shekhar Borkar, Greg Taylor, IPR Bangalore Design Lab, Stefan Rusu, Jim Held Era of Tera-scale Computing Teraflops of performance operating on Terabytes of data Entertainment, learning and virtual travel Model-based Apps Recognition TIPS Financial Analytics Mining Synthesis Models GIPS Personal Media Creation and 3D & Management Video Mult- Terascale Performance MIPS Media Multi-core Text KIPS Single-core Health Kilobytes Megabytes Gigabytes Terabytes Dataset Size 2 Tera-scale Platform Vision Special Integrated IO Cache Cache Cache Purpose Engines devices Scalable On-die Interconnect Fabric Last Level Last Level Last Level Integrated Off Die Cache Cache Cache Memory Controllers interconnect Socket High Bandwidth IO Inter- Memory Connect 3 Silicon Process Technology Innovation 65nm 45nm 32nm 22nm 14nm 10nm 7nm 2005 2007 2009 2011 2013 * 2015 * 2017 * 2019+ MANUFACTURING DEVELOPMENT RESEARCH Hi-K Tri-Gate *projected Process innovation leads to energy efficient performance and predictable 2-year technology cycles 4 22nm Performance and Energy Scaling 5 M. Bohr, Intel Developer Forum 2012 Silicon Integration Providing Greater End-User Value • More transistors/area: enables substantial system-on-chip integration opportunities Extreme Scale (Exa-Scale) Computing Research 2W – 100 GigaFLOPS 20MW - ExaFLOPS 10 year goal: ~300X Improvement in energy efficiency Equal to 20 pJ/FLOP at the system level J. Rattner, ISCA 2012 Keynote Ultra Low Power Graphics/Video & Security Circuits 10-100X higher performance/watt vs. GP cores Dedicated HW Intel ISSCC, VLSI 2008-2012 More flexible… 100x More efficient… DSPs GOPS/W Microprocessors 10x Source: ISSCC Flexibility vs. energy-efficiency Flexibility P4 x86 PPC PPC MUD Alpha Alpha Sparc Sparc2 Sparc1 MPEG2 MPEG2 Itanium 802.11a PPC770 PPC970 Encrypt SA-DSP Fuj-DSP Fuj-DSP Cell-SPE Fuj-Multi Video ME Video NEC-DSP PPC2-SOI PPC1-SOI KAIST-DSP Hitachi-DSP SIMD Vector SIMD AES Encryption AES SIMD Permutation SIMD DSP functions highly throughput-oriented: Amenable for parallelism/pipelining ⇒ Better power-performance optimization ⇒ Optimal partitioning of tasks between GP processor and dedicated engines 8 Specialized HW Accelerators for ExaExa----ScaleScale General purpose cores, special-purpose accelerators, interconnect fabric Efficient, adaptive, reconfigurable, resilient LowLowLow-Low ---powerpower generalgeneral----purposepurpose core SP HW accelerators Fixed function vs. limited programmability Operation over wide supply voltage range (near-threshold to nominal) 9 NTV Operation & Energy Efficiency 4 2 10 10 65nm CMOS, 50°C 1 65nm CMOS, 50°C 450 10 375 10 3 10 1 300 1 10 2 1 225 9.6X (mW) 150 10 -1 10 1 10 -1 (GOPS/Watt) Total Power (mW) Power Total Energy-Efficiency Energy-Efficiency 75 Subthreshold 320mV -2 Power Leakage Active 1 10 -2 0 10 Maximum (MHz) Frequency 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Supply Voltage (V) Supply Voltage (V) H. Kaul, R. Krishnamurthy et al, ISSCC 2008 Frequency reduces almost Energy efficiency improves by linearly first, then exponentially one order of magnitude at NTV Total power reduces by three to Energy efficiency reduces in four orders of magnitude subthreshold operation Leakage power reduces by two to three orders of magnitude 10 NTV Across Technology Generations H. Kaul, et. al., ISSCC 2009 3 S. K. Hsu, et. al., ISSCC 2012 3 9 10 10 45nm CMOS 8 22nm CMOS, 50°C 50 °C 7 32b Multiply 2 ) 6 10 5 16b SIMD W Multiply 8X 9x m 4 ( 3 2 r 10 10 e 2 w 1 o 300mV P 72b Add 1.1V 0 1 e g Normalized Normalized Energy Efficiency 0.15 0.40 0.65 0.90 1.15 1.40 Vhi 9x 0.15 0.37 0.59 0.74 0.87 0.98 Vlo a k Region Supply Voltage (V) -1 a 10 10 e L A. Agarwal, et. al., ISSCC 2010 Sub-threshold 3.0 10 Reconfigurable Fabric, 32nm CMOS, 50 °C -2 2.5 10 1 (GOPS/W)EfficiencyEnergy 2.0 Register File 0.8mW Permute Crossbar -3 1.5 5.7x 10 -1 1 10 0.2 0.4 0.6 0.8 1.0 1.2 1.0 Supply Voltage (V) 10 -2 0.5 Sub-threshold Region Sub-threshold 340mV NTV operation improves energy -3 Energy Efficiency (TOPS/W) Efficiency Energy 0 10 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Leakage (mW) Active Power efficiency across 45nm-22nm Supply Voltage (V) CMOS 11 NTV Opportunities for Wide Dynamic Range T. Thakkar, Intel Developer Forum 2012 ATOM™ 32nm SOC V/F Islands SoC integration of many unrelated functions in their own power ‘islands’. DDR GPIO • On-die voltage regulation leading to power ‘islands’ that can have different voltage levels. south • Power management that shuts functional units off. audio • Voltage-Frequency pairs; CPU’s can be run in several operating points where its power supply is adjusted to DDR complex reduce power while keeping various functional blocks at constant voltage: – lowest frequency: 100 - 600MHz – medium frequency: 700 - 1500MHz security CPU – burst frequency: 1600 – 2500MHz EMMC • OFF chip drivers have to support various voltage levels NC whereas the controller logic is powered by a lower 2D/3D PLLs clocks voltage : graphics – LPDDR: 1.25V video – MIPI-display: 1.25V DDR– HDMI-display DDR 3.3V Image Signal – SD cards: 2.85V GPIO Processor display – GPIO: 1.25V, 1.80V HDMI DDR GPIO MIPI DDR T. Thakkar, Medfield, Intel Developer Forum 2012 NTV Opportunities for Converged Core 14 T. Piazza, Intel Developer Forum 2012 Impact of Variation on NTV 6 1.0 60% +/- 5% Variation in Vdd or Vt 5 0.8 50% Spread 4 40% 0.6 30% 3 0.4 20% 2 Freq (Relative) Freq 0.2 10% Frequency 1 0.0 0% noise 5% to vulnerability Circuit 0.0 0.2 0.4 0.6 0.8 1.0 0 1.0 0.9 0.8 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Vdd (Relative) Vdd scaling towards threshold Threshold (V −V )α frequency ∝ dd t V dd 5% variation in Vt or Vdd results in up to 50% variation in circuit performance 15 Variation Modeling & Measurements 4 65nm CMOS, 50 °C 10 65nm CMOS Typical Die Measurements 1 10 3 ±5% 1.2V Frequency variation 2 across 0-110 °C Frequency variation 10 across fast – slow dies ±2X ±18% 10 1 Normalized Distribution Normalized Maximum Frequency (MHz) Frequency Maximum 50 °C 320mV ±2X 320mV 0 1 0.5 1.0 1.5 2.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Normalized Frequency Supply Voltage (V) H. Kaul, R. Krishnamurthy et al, ISSCC 2008 Monte-Carlo Simulations 18% nominal frequency spread 65nm CMOS measurements 2X spread at NTV 5% nominal spread due to temperature 16 2X spread at NTV Using Vdd to Compensate for Variation 56 56 65nm CMOS, 65nm CMOS, 320mV Typical Die 320mV, 50C 42 42 28 28 14 14 Frequency (MHz) Frequency Frequency (MHz) Frequency 23MHz 23MHz 0 0 0 50 110 Slow Typical Fast Temperature (C) Process Skew • Adaptive Voltage Compensation for variation tolerance • Adjust supply voltage to maintain constant performance • ±50mV adjustment about 320mV: Nominal 23MHz performance sustained across 0-110°C and Intel Confidential fast-slow skews 17 Subthreshold Leakage at NTV 60% 50% 40% Vdd 40% Increasing Variations 30% 50% Vdd 75% Vdd 20% 100% Vdd SD Leakage Leakage Power SD 10% 0% 45nm 32nm 22nm 14nm 10nm 7nm 5nm NTV operation reduces total power, improves energy efficiency Subthreshold leakage power is substantial portion of the total 18 Low Voltage SRAM and Register File 6T SRAM suffers stability and yield at NTV 6T SRAM cell with larger transistors 8T/10T SRAM for improved stability and yield Variation tolerant register file for NTV wrbl# rdbl wrbl wrbl# rdbl Conventional dual-ended (DE) write cell Dual-ended transmission gatewrbl (Write failure due to strong P and weak N) (DETG) write cell S. Hsu, R. Krishnamurthy et al, ISSCC 2012 19 Low Voltage Latches and Flip-flops Designing flip-flops for NTV Averaging with vector flip-flops Upsized Ck Ck Ck Ck Shared Ck Ck min-sized clock drivers D “0” “1” Ck Ck Q Non-minimum Channel Length Vmin improves by 175 mV Hold time margin by 7 to 30% 20 Low Voltage Logic: Multiplexers & Gates Designing multiplexers for NTV Transmission gates, logic gates Issue: Large off-current paths “1” Weak on-current paths “1” “1” Body effect “1” “0” “0” “0” “0” “0” “0” “0” One-hot 4:1 Encoded 4:1 Up to 3X reduction in worst case Avoid series connected static droop transmission gates Logic fan in limited to 3 stack 21 Low Voltage Level Converters CVSL Level Converter Low Voltage Significant energy High Voltage Circuit Block consumed in contention Circuit Block currents Two-stage cascaded split-output level Ultra-low voltage split-output shifter level shifter VCC MID VCC HIGH VCC HIGH CVSL CVSL Stage Stage 0 VCC LOW OUT VCC LOW MID VCC MID 0 IN H. Kaul, R. Krishnamurthy et al, ISSCC 2009 CVSL split into two stages to reduce contention current Decoupled output from CVSL Decoupled output for smaller CVSL Interrupts contention devices 20% energy reduction Vmin improved by 125 mV 22 Soft Errors and Reliability 10 65nm 1 90nm Assuming 2X bit/latch 130nm count increase per 0.8 180nm generation 250nm Latch 0.6 0.4 Memory 0.2 Relative 130nm to Relative n-SER/cell (sea-level) n-SER/cell 0 1 0.5 1 1.5 2 180 130 90 65 45 32 Voltage (V) Technology (nm) Soft error/bit reduces each generation Soft error at the system level will Impact of NTV on soft error rate continue to increase Positive impact of NTV on reliability Low V lower E fields, low power lower temperature Device aging effects mitigated

Ram K. Krishnamurthy Senior Principal Engineer

On the Hardware Reduction of Z-Datapath of Vectoring CORDIC

18-447 Computer Architecture Lecture 6: Multi-Cycle and Microprogrammed Microarchitectures

SIMD Extensions

Datapath Design I Systems I

Liečba Firmy Krízovým Manažérom

The Economic Impact of Moore's Law: Evidence from When It Faltered

Curtiss-Wright to Display Rugged COTS Modules and System Solutions at Intel Developer Forum 2016

LECTURE 5 Single-Cycle Datapath and Control

Upgrading and Repairing Pcs, 21St Edition Editor-In-Chief Greg Wiegand Copyright © 2013 by Pearson Education, Inc

Effectiveness of the MAX-2 Multimedia Extensions for PA-RISC 2.0 Processors

New Intel-Powered Classmate Pc Design

AI Chips: What They Are and Why They Matter