Management & Monitoring Dr. Andrea Bartolini Dynamic Power

• Linear ↓ with ↓ CEffective • Linear ↓ with ↓ f

• Quadratic ↓ with ↓ Vdd

• Cubic ↓ with ↓ both Vdd and f

David H. Albonesi ACACES10 Sub-threshold Leakage Current

~Area

• Exponential ↓ with ↓ Vgs (~Vdd) • Exponential ↓ with ↑ VTH • Exponential ↓ with ↓ T

David H. Albonesi ACACES10 Is the same for all the die?

• Variability: Alpha-Power Thermal Model Delay: Cout Vdd Cout Vdd Dp    ION (T)[Vdd -Vth(T)] Carrier Mobility:

m To μ(T) = μ(T0 )( T ) Threshold Voltage:

Vth = Vth(T0 ) - k(T - T0 )

T ↑ μ↓ Vth↓ Delay Trend

• For wires, the resistivity is linearly dependent from T • Delay increases as T increases

• For Low VT (LVT) design (Vdd >> Vth) • μ dominates w.r.t. Vth • Delay Increases as T increases

• For High VT (HVT) design (Vdd ≈ Vth) • Vth dominates w.r.t μ • Delay decreases as T increases, Indirect Temperature Dependence (ITD) Thermal Behavior of CMOS gates

ITD Attacking Dynamic Power

• Dynamic Voltage and Frequency Scaling (DVFS) • Reduce voltage, frequency, or both • Exploit slack in application execution • Cubic dynamic power savings • Reduce effective switching capacitance • Exploit idle or underutilized hardware resources • Match hardware resources to application behavior • Linear dynamic power savings • Complementary to DVFS

David H. Albonesi ACACES10 Reduce Static Power

• Dynamic Voltage Scaling (DVS) • Reduce Voltage • Increase Threshold Voltage • Exponential static power savings • Dynamic Thermal management (DTM) • Reduce Temperature • Exponential static power savings • Reducing the leaking component • Exploit idle or underutilized hardware resources • Match hardware resources to application behavior • Linear static power savings

David H. Albonesi ACACES10 Power Metrics • Energy • E = P*Time = P * CPU_Time = P*Clock_Cycles/Clock_Rate => E = (Pdyn + Psta) *Clock_Cycles/Clock_Rate 2 => E = Pdyn*Clock_Cycles/Clock_Rate + Psta*CPU_Time => E = (a*CLVDD )*Clock_Cycles + Psta*CPU_Time • Power • Rate of energy dissipation • Power density • Power per unit area • Temperature • T ~ P • Energy per instruction (EPI) • Energy per task • Energy-delay product (EDP) [MIPS/W] • Energy-delay2 (ED squared) [MIPS2/W] • Performance given a max power or thermal constraint

David H. Albonesi ACACES10 Strategies

• Dynamic Voltage and Frequency Scaling – DVFS

• Run fast and Stop – , Power Gating, Turbo Mode DVFS – with deadline or “on-demand governor” Key idea: Exploit slack by scaling V & f to run evenly across a time quantum Power Gating vs Clock Gating

• Clock Gating consists of reducing the dynamic energy by gating the of each FF when no transition is detected. • Instantaneus transition, saves dynamic power • Power Gating consists of disconnecting the cirquit logic from the Vdd thanks to a «power gating» transistor. This sets to zero the power consumption of the given logic portion. • No state retention. The content of the internal registers is lost as effect of power gating. This leads to costly (time and energy) transitions «in» and «out» of the power gating state. As it is needed to save and restore the internal state. Run Fast and Stop (RTFS) vs DVFS

• Run Fast Then Stop (RFTS) is a technique where the runs at the highest frequency until the job is finished, then it stops. • DVFS runs “low and slow” to reduce dynamic power by V^2. • Active Power • RFTS: 1. Clock Gating – core continues to leak. 2. Power Gating – core is powered off and doesn’t leak.

14 Run Fast and Stop (RTFS) vs DVFS - II

What is best? UNCORE Power: The uncore logic accounts for all the 1. DVFS internal CPU components which needs to be active when the core is active. These components includes the PLL, Memory Controllers, 2. RFTS: Peripherals (I2C, SPI, USB, ...). • clock gate The dynamic power of these components when the system is idle can be avoided by using Clock Gating. The leakage power of these 3. RFTS: components as well as the one • power gate of Core logic when the system is idle can be avoided by using Power Gating. 15 Departement Informationstechnologie und Elektrotechnik Formula – Energy no Power Management

• Total Power 2 • 푃푇푂푇@푀퐴푋 = 푃퐷푌푁@푀퐴푋 + 푃푆 = 퐶 × 퐹푀퐴푋 × 푉퐷퐷 +푃푈푁퐶푂푅퐸 +푃퐿퐸퐴퐾

• Application Execution Time: 퐶푃푈 푇푖푚푒푀퐴푋

• Time in IDLE : 푇푖푚푒퐼퐷퐿퐸 • Total Energy : 퐸푇푂푇@푀퐴푋 = 푃퐷푌푁@푀퐴푋 × 퐶푃푈 푇푖푚푒푀퐴푋 + (푃푈푁퐶푂푅퐸 + 푃퐿퐸퐴퐾) × (푇푖푚푒퐼퐷퐿퐸 + 퐶푃푈 푇푖푚푒푀퐴푋) 퐸푇푂푇@푀퐴푋 = (푃푈푁퐶푂푅퐸+ 푃퐿퐸퐴퐾 + 푃퐷푌푁@푀퐴푋) × 퐶푃푈 푇푖푚푒푀퐴푋 + (푃푈푁퐶푂푅퐸 + 푃퐿퐸퐴퐾) × 푇푖푚푒퐼퐷퐿퐸 Formula – Clock Gating

• Total Power 2 • 푃푇푂푇@푀퐴푋 = 푃퐷푌푁@푀퐴푋 + 푃푆 = 퐶 × 퐹푀퐴푋 × 푉퐷퐷 +푃푈푁퐶푂푅퐸 +푃퐿퐸퐴퐾 • Application Execution Time: 퐶푃푈 푇푖푚푒푀퐴푋 • Time in IDLE : 푇푖푚푒퐼퐷퐿퐸 • Total Energy :

퐸푇푂푇@푀퐴푋+퐶퐺 = 푃퐷푌푁@푀퐴푋 × 퐶푃푈 푇푖푚푒푀퐴푋 + 푃푈푁퐶푂푅퐸 × 퐶푃푈 푇푖푚푒푀퐴푋 + 푃퐿퐸퐴퐾 ×

(푇푖푚푒퐼퐷퐿퐸+퐶푃푈 푇푖푚푒푀퐴푋)

퐸푇푂푇@푀퐴푋+퐶퐺 = (푃푈푁퐶푂푅퐸+ 푃퐿퐸퐴퐾 + 푃퐷푌푁@푀퐴푋) × 퐶푃푈 푇푖푚푒푀퐴푋 + 푃퐿퐸퐴퐾 × 푇푖푚푒퐼퐷퐿퐸 Formula – Power Gating

• Total Power 2 • 푃푇푂푇@푀퐴푋 = 푃퐷푌푁@푀퐴푋 + 푃푆 = 퐶 × 퐹푀퐴푋 × 푉퐷퐷 + 푃푃퐿퐿 +푃퐿퐸퐴퐾 • Application Execution Time: 퐶푃푈 푇푖푚푒푀퐴푋 • Time in IDLE : 푇퐼퐷퐿퐸 • Total Energy :

퐸푇푂푇@푀퐴푋+푃퐺 = 푃퐷푌푁@푀퐴푋 × 퐶푃푈 푇푖푚푒푀퐴푋 + (푃푈푁퐶푂푅퐸+푃퐿퐸퐴퐾) × 퐶푃푈 푇푖푚푒푀퐴푋 + 퐸푆퐴푉퐸 + 퐸푅퐸푆푇푂푅퐸

퐸푇푂푇@푀퐴푋+푃퐺 = (푃푈푁퐶푂푅퐸+ 푃퐿퐸퐴퐾 + 푃퐷푌푁@푀퐴푋) × 퐶푃푈 푇푖푚푒푀퐴푋 + 퐸푆퐴푉퐸 + 퐸푅퐸푆푇푂푅퐸 Formula – DVFS

Current Frequency 퐹 < 퐹 ⇒ 푠푙 = 퐹푀퐴푋ൗ 퐷푉퐹푆 푀퐴푋 퐹퐷푉퐹푆 F = Clock Rate • 푃푇푂푇@퐷푉퐹푆 = 푃퐷푌푁@퐷푉퐹푆 + 푃푆 2 = 퐶 × 퐹퐷푉퐹푆 × 푉퐷퐷@퐷푉퐹푆 +푃푈푁퐶푂푅퐸 +푃퐿퐸퐴퐾 1 3 • 푃퐷푌푁@퐷푉퐹푆 = 푃퐷푌푁@푀퐴푋 × ( ) 푠푙 Cubic Power Saving of dynamic power w.r.t nominal frequency

• 퐹퐷푉퐹푆 chosen to remove time in IDLE - 퐶푃푈 푇푖푚푒퐷푉퐹푆= 퐶푃푈 푇푖푚푒푀퐴푋 + 푇푖푚푒퐼퐷퐿퐸@푀퐴푋

⇒ 퐹퐷푉퐹푆 = 퐶푃푈 푇푖푚푒푀퐴푋 × 퐹푀퐴푋/(퐶푃푈 푇푖푚푒푀퐴푋 + 푇푖푚푒퐼퐷퐿퐸@푀퐴푋)

• Application Execution Time: CPU 푇푖푚푒퐷푉퐹푆 = CPU 푇푖푚푒푀퐴푋 × 푠푙 = 퐶푃푈 푇푖푚푒푀퐴푋 + 푇푖푚푒퐼퐷퐿퐸@푀퐴푋 • Total Energy : 1 퐸 = 푃 × ( )3 × 퐶푃푈 푇푖푚푒 + (푃 +푃 ) × 퐶푃푈 푇푖푚푒 푇푂푇@퐷푉퐹푆 퐷푌푁@푀퐴푋 푠푙 퐷푉퐹푆 푈푁퐶푂푅퐸 퐿퐸퐴퐾 퐷푉퐹푆 1 퐸 = 푃 × ( )3 × (퐶푃푈 푇푖푚푒 × 푠푙)+ (푃 +푃 ) × (퐶푃푈 푇푖푚푒 + 푇푖푚푒 ) 푇푂푇@퐷푉퐹푆 퐷푌푁@푀퐴푋 푠푙 푀퐴푋 푈푁퐶푂푅퐸 퐿퐸퐴퐾 푀퐴푋 퐼퐷퐿퐸@푀퐴푋 1 퐸 = 푃 × ( )ퟐ × 퐶푃푈 푇푖푚푒 + (푃 +푃 ) × (퐶푃푈 푇푖푚푒 + 푇푖푚푒 ) 푇푂푇@퐷푉퐹푆 퐷푌푁@푀퐴푋 푠푙 푀퐴푋 푈푁퐶푂푅퐸 퐿퐸퐴퐾 푀퐴푋 퐼퐷퐿퐸@푀퐴푋 1 퐸 = (푃 +푃 + 푃 × ( )ퟐ) × 퐶푃푈 푇푖푚푒 + (푃 +푃 ) × 푇푖푚푒 푇푂푇@퐷푉퐹푆 푈푁퐶푂푅퐸 퐿퐸퐴퐾 퐷푌푁@푀퐴푋 푠푙 푀퐴푋 푈푁퐶푂푅퐸 퐿퐸퐴퐾 퐼퐷퐿퐸@푀퐴푋 Run Fast and Stop vs DVFS - III

• “Break-even time” is defined as the time that the core needs to be powered off to compensate for save and restore energy. • In high leakage situations, the power gating benefit is realized in a shorter time.

Low Leakage

High Leakage

21 Departement Informationstechnologie und Elektrotechnik DVFS and Memory Slack

Dynamic Voltage and Frequency Scaling acts on saving power by reducing the clock frequency however the clock frequency reduces only speed of the Core logic and does not reduce the speed of the memory subsystem. CPU Bound Applications are applications composed of a large set of ALU instructions and/or characterized by high data locality => High L1/L2 Hit Rate Memory Bound (MEM Bound) Applications are applications characterized by large data sets or complex data access patterns with fewer ALU operations per data accessed and are characterized by low data locality => Low L1/L2 Cache Hit Rate DVFS – Memory slack

CPU BOUND APP – High IPC High Frequency Low Frequency

cpu cpu 1 1 • Performance Loss • Power reduction cache cache • Energy Efficiency Loss! 0 0

cpu cpu 1 1 • Same Performance cache • Power reduction cache • Energy Efficiency Gain! 0 0

dram dram High Frequency MEMORY BOUND APP – Low IPC Low Frequency CPU BOUND APP High Frequency Low Frequency

cpu cpu 1 1 • Performance Loss • Power reduction cache cache • EnergyDVFS Efficiency + MemoryLoss! Slack 0 • Power Saving 0 • No performance Loss • Higher Energy Efficiency

cpu cpu 1 1 • Same Performance cache • Power reduction cache • Energy Efficiency Gain! 0 0

dram dram High Frequency MEMORY BOUND APP Low Frequency Formula – DVFS – with CPI (I)

퐶푙표푐푘 퐶푦푐푙푒푠 = 퐼푛푠푡푟푢푐푡푖표푛 퐶표푢푛푡 × 퐶푦푐푙푒푠 푃푒푟 퐼푛푠푡푟푢푐푡푖표푛 퐶푃푈 푇푖푚푒 = 퐼푛푠푡푟푢푐푡푖표푛 퐶표푢푛푡 × 퐶푃퐼/퐶푙표푐푘 푅푎푡푒 퐶푙표푐푘 푃푒푟 퐼푛푠푡푟푢푐푡푖표푛 = 퐶표푚푝푢푡푒 퐶푦푐푙푒푠 + 퐷푎푡푎 퐴푐푐푒푠푠 퐶푦푐푙푒푠

퐶표푚푝푢푡푒 퐶푦푐푙푒푠 scale with the (Frequency) 퐷푎푡푎 퐴푐푐푒푠푠 퐶푦푐푙푒푠 depend on memory speed and usually do not scale with the Clock Rate (Frequency) 퐴퐿푈 푇푖푚푒 = 퐶표푚푝푢푡푒 퐶푦푐푙푒푠 /퐶푙표푐푘 푅푎푡푒퐶푈푅푅퐸푁푇 푀퐸푀 푇푖푚푒 = 퐷푎푡푎 퐴푐푐푒푠푠 퐶푦푐푙푒푠 /퐶푙표푐푘 푅푎푡푒푀퐴푋 퐶푃푈 푇푖푚푒 = 퐴퐿푈 푇푖푚푒 + 푀퐸푀 푇푖푚푒 Formula – DVFS – with CPI (II) Assuming all the instructions to be executed in 1 cycle (in general not true for MUL, DIV) 푠푖푛푐푒 퐶푙표푐푘 푃푒푟 퐼푛푠푡푟푢푐푡푖표푛 = 퐶표푚푝푢푡푒 퐶푦푐푙푒푠 + 퐷푎푡푎 퐴푐푐푒푠푠 퐶푦푐푙푒푠

퐶표푚푝푢푡푒 퐶푦푐푙푒푠 = 1 ⇒ 퐴퐿푈 푇푖푚푒 = 1/ 퐶푙표푐푘 푅푎푡푒퐶푈푅푅퐸푁푇 퐷푎푡푎 퐴푐푐푒푠푠 퐶푦푐푙푒푠 = 퐶푃퐼 − 1 ⇒ 푀퐸푀 푇푖푚푒 = (퐶푃퐼 − 1)/ 퐶푙표푐푘 푅푎푡푒푀퐴푋 - Execution Time without DVFS: 퐶푃푈 푇푖푚푒푀퐴푋 = 퐼푛푠푡푟푢푐푡푖표푛 퐶표푢푛푡 × 퐶푃퐼/퐶푙표푐푘 푅푎푡푒푀퐴푋 - Execution Time with DVFS: 퐶푃푈 푇푖푚푒퐷푉퐹푆 = 퐼푛푠푡푟푢푐푡푖표푛 퐶표푢푛푡 × [ 1/퐶푙표푐푘 푅푎푡푒퐷푉퐹푆 + +( 퐶푃퐼 − 1 /퐶푙표푐푘 푅푎푡푒푀퐴푋)] - If MEM Bound Application (CPI >>1) 퐶푃푈 푇푖푚푒퐷푉퐹푆 ≅ 퐼푛푠푡푟푢푐푡푖표푛 퐶표푢푛푡 × 퐶푃퐼 − 1 /퐶푙표푐푘 푅푎푡푒푀퐴푋 퐶푃푈 푇푖푚푒퐷푉퐹푆 ≅ 퐶푃푈 푇푖푚푒푀퐴푋 Formula – DVFS – with Memory Bound App 퐹푀퐴푋 F = Clock Rate Current Frequency 퐹퐷푉퐹푆 < 퐹푀퐴푋 ⇒ 푠푙 = ൗ퐹퐷푉퐹푆 • 푃푇푂푇@퐷푉퐹푆 = 푃퐷푌푁@퐷푉퐹푆 + 푃푆 2 = 퐶 × 퐹퐷푉퐹푆 × 푉퐷퐷@퐷푉퐹푆 +푃푃퐿퐿 +푃퐿퐸퐴퐾 1 • 푃 = 푃 × ( )3 퐷푌푁@퐷푉퐹푆 퐷푌푁@푀퐴푋 푠푙 Cubic Power Saving of dynamic power w.r.t nominal frequency

• 퐹퐷푉퐹푆 chosen as the minimum avaiable => Maximum power saving

Application Execution Time 퐶푃푈 푇푖푚푒퐷푉퐹푆 = 퐶푃푈 푇푖푚푒푀퐴푋

• Total Energy : 1 퐸 = 푃 × ( )3 × (퐶푃푈 푇푖푚푒 )+(푃 +푃 ) × (퐶푃푈 푇푖푚푒 + 푇 ) 푇푂푇@퐷푉퐹푆 퐷푌푁@푀퐴푋 푠푙 푀퐴푋 푃퐿퐿 퐿퐸퐴퐾 푀퐴푋 퐼퐷퐿퐸 • With clock gating: 1 퐸 = 푃 × ( )3 × (퐶푃푈 푇푖푚푒 )+푃 × (퐶푃푈 푇푖푚푒 + 푇 ) 푇푂푇@퐷푉퐹푆+퐶퐺 퐷푌푁@푀퐴푋 푠푙 푀퐴푋 퐿퐸퐴퐾 푀퐴푋 퐼퐷퐿퐸 • With power gating: 1 퐸 = 푃 × ( )3 × (퐶푃푈 푇푖푚푒 )+퐸 + 퐸 푇푂푇@퐷푉퐹푆+푃퐺 퐷푌푁@푀퐴푋 푠푙 푀퐴푋 푆퐴푉퐸 푅퐸푆푇푂푅퐸