Memory Power Management via Dynamic Voltage/Frequency Scaling

Howard David (Intel) Eugene Gorbatov (Intel) Chris Fallin (CMU) Ulf R. Hanebutte (Intel) Onur Mutlu (CMU) Memory Power is Significant n Power consumption is a primary concern in modern servers n Many works: CPU, whole-system or cluster-level approach n But memory power is largely unaddressed n Our server system*: memory is 19% of system power (avg)

q Some work notes up to 40% of total system power n Goal: Can we reduce this figure? System Power Memory Power 400 300 200 100 Power (W) 0 gcc mcf lbm milc sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum

® *Dual 4-core Intel Xeon , 48GB DDR3 (12 DIMMs), SPEC CPU2006, all cores active. 2 Measured AC power, analytically modeled memory power. Existing Solution: Memory Sleep States? n Most memory energy-efficiency work uses sleep states

q Shut down DRAM devices when no memory requests active n But, even low-memory-bandwidth workloads keep memory awake

q Idle periods between requests diminish in multicore workloads

q CPU-bound workloads/phases rarely completely -resident Sleep State Residency 8% 6%

States 4% 2% 0% Time Spent in Sleep gcc lbm mcf milc sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum

3 Memory Bandwidth Varies Widely n Workload memory bandwidth requirements vary widely

Memory Bandwidth for SPEC CPU2006 8 6 4 2 0

Bandwidth/channel (GB/s) n Memory system is provisioned for peak capacity à often underutilized

4 Memory Power can be Scaled Down n DDR can operate at multiple frequencies à reduce power

q Lower frequency directly reduces switching power

q Lower frequency allows for lower voltage

q Comparable to CPU DVFS CPU Voltage/ System Memory System Freq. Power Freq. Power ↓ 15% ↓ 9.9% ↓ 40% ↓ 7.6%

n Frequency scaling increases latency à reduce performance

q Memory storage array is asynchronous

q But, transfer depends on frequency

q When bus bandwidth is bottleneck, performance suffers

5 Observations So Far n Memory power is a significant portion of total power

q 19% (avg) in our system, up to 40% noted in other works n Sleep state residency is low in many workloads

q Multicore workloads reduce idle periods

q CPU-bound applications send requests frequently enough to keep memory devices awake n Memory bandwidth demand is very low in some workloads

n Memory power is reduced by frequency scaling

q And voltage scaling can give further reductions

6 DVFS for Memory n Key Idea: observe memory bandwidth utilization, then adjust memory frequency/voltage, to reduce power with minimal performance loss

à Dynamic Voltage/Frequency Scaling (DVFS) for memory

n Goal in this work:

q Implement DVFS in the memory system, by:

q Developing a simple control algorithm to exploit opportunity for reduced memory frequency/voltage by observing behavior

q Evaluating the proposed algorithm on a real system

7 Outline n Motivation

n Background and Characterization

q DRAM Operation

q DRAM Power

q Frequency and Voltage Scaling n Performance Effects of Frequency Scaling

n Frequency Control Algorithm

n Evaluation and Conclusions

8 Outline n Motivation

n Background and Characterization

q DRAM Operation

q DRAM Power

q Frequency and Voltage Scaling n Performance Effects of Frequency Scaling

n Frequency Control Algorithm

n Evaluation and Conclusions

9 DRAM Operation n Main memory consists of DIMMs of DRAM devices n Each DIMM is attached to a memory bus (channel)

/8 /8 /8 /8 /8 /8 /8 /8

Memory Bus (64 bits)

10 DRAM Operation n Main memory consists of DIMMs of DRAM devices n Each DIMM is attached to a memory bus (channel) n Multiple DIMMs can connect to one channel

to

11 Inside a DRAM Device

12 Inside a DRAM Device

Banks • Independent arrays Bank 0 • Asynchronous: independent of Row Decoder Row memory bus speed

Sense Amps Column Decoder

13 Inside a DRAM Device

I/O Circuitry

• Runs at bus speed • Clock sync/distribution • Bus drivers and receiversBank 0

• Buffering/queueingDecoder Row Registers Recievers Write FIFO Write

Sense Amps Column Decoder Drivers

14 Inside a DRAM Device

ODT

On-DieBank 0 Termination

Row Decoder Row • Required by bus electrical characteristics Registers Recievers for reliable operation FIFO Write • Resistive element that dissipates power whenSense bus isAmps active Column Decoder Drivers

15 Inside a DRAM Device

ODT

Bank 0 Row Decoder Row Registers Recievers Write FIFO Write

Sense Amps Column Decoder Drivers

16 Effect of Frequency Scaling on Power n Reduced memory bus frequency: n Does not affect bank power:

q Constant energy per operation q Depends only on utilized memory bandwidth n Decreases I/O power:

q Dynamic power in bus interface and clock circuitry reduces due to less frequent switching n Increases termination power: q Same data takes longer to transfer

q Hence, bus utilization increases n Tradeoff between I/O and termination results in a net power reduction at lower frequencies

17 Effects of Voltage Scaling on Power n Voltage scaling further reduces power because all parts of memory devices will draw less current (at less voltage) n Voltage reduction is possible because stable operation requires lower voltage at lower frequency:

Minimum Stable Voltage for 8 DIMMs in a Real System

1.6 Vdd for Power Model 1.5 1.4 1.3 1.2 1.1 DIMM Voltage (V) 1 1333MHz 1066MHz 800MHz

18 Outline n Motivation

n Background and Characterization

q DRAM Operation

q DRAM Power

q Frequency and Voltage Scaling n Performance Effects of Frequency Scaling

n Frequency Control Algorithm

n Evaluation and Conclusions

19 How Much Memory Bandwidth is Needed? Memory Bandwidth for SPEC CPU2006 7 6 5 4 3 2 1 0 Bandwidth/channel (GB/s) gcc lbm mcf milc sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum

20 Performance Impact of Static Frequency Scaling n Performance impact is proportional to bandwidth demand n Many workloads tolerate lower frequency with minimal performance drop Performance Loss, Stac Frequency Scaling 80 70 60 50 1333->800 40 30 1333->1066 20 10 Performance Drop (%) 0 gcc mcf lbm milc sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum

21 Performance Impact of Static Frequency Scaling n Performance impact is proportional to bandwidth demand n Many workloads tolerate lower frequency with minimal performance drop Performance Loss, Stac Frequency Scaling :: :: :: :: :: : : : 8

6 1333->800 4 1333->1066 2

Performance Drop (%) 0 gcc mcf lbm milc sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum

22 Outline n Motivation

n Background and Characterization

q DRAM Operation

q DRAM Power

q Frequency and Voltage Scaling n Performance Effects of Frequency Scaling

n Frequency Control Algorithm

n Evaluation and Conclusions

23 Memory Latency Under Load n At low load, most time is in array access and bus transfer à small constant offset between bus-frequency latency curves n As load increases, queueing delay begins to dominate à bus frequency significantly affects latency Memory Latency as a Funcon of Bandwidth and Mem Frequency 800MHz 1067MHz 1333MHz 180 150 120

Latency (ns) 90 60 0 2000 4000 6000 8000 Ulized Channel Bandwidth (MB/s)

24 Control Algorithm: Demand-Based Switching Memory Latency as a Funcon of Bandwidth and Mem Frequency 800MHz 1067MHz 1333MHz 180 150 120 90 Latency (ns) 60 T 0 2000 T800 4000 1066 6000 8000 Ulized Channel Bandwidth (MB/s)

After each epoch of length Tepoch: Measure per-channel bandwidth BW

if BW < T800 : switch to 800MHz else if BW < T1066 : switch to 1066MHz else : switch to 1333MHz

25 Implementing V/F Switching n Halt Memory Operations

q Pause requests

q Put DRAM in Self-Refresh

q Stop the DIMM clock n Transition Voltage/Frequency

q Begin voltage ramp

q Relock memory controller PLL at new frequency

q Restart DIMM clock

q Wait for DIMM PLLs to relock n Begin Memory Operations

q Take DRAM out of Self-Refresh

q Resume requests

26 Implementing V/F Switching n Halt Memory Operations

q Pause requests

q Put DRAM in Self-Refresh

q Stop the DIMM clock n Transition Voltage/Frequency

q Begin voltage ramp

q Relock memory controller PLL at new frequency

Cq MemoryRestart DIMM frequency clock already adjustable statically Cq VoltageWait for DIMMregulators PLLs to relockfor CPU DVFS can work for n Begin memory Memory DVFS Operations q Take DRAM out of Self-Refresh

Cq FullResume transition requests takes ~20µs

27 Outline n Motivation

n Background and Characterization

q DRAM Operation

q DRAM Power

q Frequency and Voltage Scaling n Performance Effects of Frequency Scaling

n Frequency Control Algorithm

n Evaluation and Conclusions

28 Evaluation Methodology n Real-system evaluation

q Dual 4-core Intel Xeon®, 3 memory channels/socket

q 48 GB of DDR3 (12 DIMMs, 4GB dual-rank, 1333MHz) n Emulating memory frequency for performance

q Altered memory controller timing registers (tRC, tB2BCAS)

q Gives performance equivalent to slower memory frequencies n Modeling power reduction

q Measure baseline system (AC power meter, 1s samples)

q Compute reductions with an analytical model (see paper)

29 Evaluation Methodology

n Workloads

q SPEC CPU2006: CPU-intensive workloads

q All cores run a copy of the benchmark n Parameters

q Tepoch = 10ms q Two variants of algorithm with different switching thresholds:

q BW(0.5, 1): T800 = 0.5GB/s, T1066 = 1GB/s

q BW(0.5, 2): T800 = 0.5GB/s, T1066 = 2GB/s à More aggressive frequency/voltage scaling

30 Performance Impact of Memory DVFS n Minimal performance degradation: 0.2% (avg), 1.7% (max)

4 3 BW(0.5,1) 2 BW(0.5,2) 1 0 -1 gcc Performance Degradaon (%) mcf lbm milc AVG sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum

31 Performance Impact of Memory DVFS n Minimal performance degradation: 0.2% (avg), 1.7% (max) n Experimental error ~1%

4 3 BW(0.5,1) 2 BW(0.5,2) 1 0 -1 gcc Performance Degradaon (%) mcf lbm milc AVG sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum

32 Memory Frequency Distribution

n Frequency distribution shifts toward higher memory frequencies with more memory-intensive benchmarks

100% 80% 60% 1333 40% 1066 20% 800 0% gcc mcf lbm milc sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum

33 Memory Power Reduction

n Memory power reduces by 10.4% (avg), 20.5% (max)

25

20 BW(0.5,1) 15 BW(0.5,2) 10 5 0 Memory Power Reducon (%) gcc mcf lbm milc AVG sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum

34 System Power Reduction n As a result, system power reduces by 1.9% (avg), 3.5% (max) 4 3.5 3 2.5 BW(0.5,1) BW(0.5,2) 2 1.5 1 0.5

System Power Reducon (%) 0 gcc mcf lbm milc AVG sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum

35 System Energy Reduction n System energy reduces by 2.4% (avg), 5.1% (max) 6 5 4 BW(0.5,1) 3 BW(0.5,2) 2 1 0

System Energy Reducon (%) -1 gcc lbm mcf milc AVG sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum

36 Related Work n MemScale [Deng11], concurrent work (ASPLOS 2011)

q Also proposes Memory DVFS

q Application performance impact model to decide voltage and frequency: requires specific modeling for a given system; our bandwidth-based approach avoids this complexity

q Simulation-based evaluation; our work is a real-system proof of concept n Memory Sleep States (Creating opportunity with data placement [Lebeck00,Pandey06], OS scheduling [Delaluz02], VM subsystem [Huang05]; Making better decisions with better models [Hur08,Fan01]) n Power Limiting/Shifting (RAPL [David10] uses memory throttling for thermal limits; CPU throttling for memory traffic [Lin07,08]; Power shifting across system [Felter05])

37 Conclusions n Memory power is a significant component of system power q 19% average in our evaluation system, 40% in other work n Workloads often keep memory active but underutilized q Channel bandwidth demands are highly variable q Use of memory sleep states is often limited n Scaling memory frequency/voltage can reduce memory power with minimal system performance impact q 10.4% average memory power reduction q Yields 2.4% average system energy reduction n Greater reductions are possible with wider frequency/ voltage range and better control algorithms

38 Memory Power Management via Dynamic Voltage/Frequency Scaling

Howard David (Intel) Eugene Gorbatov (Intel) Chris Fallin (CMU) Ulf R. Hanebutte (Intel) Onur Mutlu (CMU) Why Real-System Evaluation? n Advantages: q Capture all effects of altered memory performance n System/kernel code, interactions with IO and , etc

q Able to run full-length benchmarks (SPEC CPU2006) rather than short instruction traces q No concerns about architectural simulation fidelity n Disadvantages: q More limited room for novel algorithms and detailed measurements q Inherent experimental error due to background-task noise, real power measurements, nondeterministic timing effects n For a proof-of-concept, we chose to run on a real system in order to have results that capture all potential side-effects of altering memory frequency

40 CPU-Bound Applications in a DRAM-rich system n We evaluate CPU-bound workloads with 12 DIMMs: what about smaller memory, or IO-bound workloads? n 12 DIMMs (48GB): are we magnifying the problem? q Large servers can have this much memory, especially for database or enterprise applications q Memory can be up to 40% of system power [1,2], and reducing its power in general is an academically interesting problem n CPU-bound workloads: will it matter in real life? q Many workloads have CPU-bound phases (e.g., database scan or business logic in server workloads) q Focusing on CPU-bound workloads isolates the problem of varying memory bandwidth demand while memory cannot enter sleep states, and our solution applies for any compute phase of a workload

[1] L. A. Barroso and U. Holzle. “The Datacenter as a : An Introduction to the Design of Warehouse-Scale Machines.” Synthesis Lectures on . Morgan & Claypool, 2009. [2] C. Lefurgy et al. “Energy Management for Commercial Servers.” IEEE Computer, pp. 39—48, December 2003.

41 Combining Memory & CPU DVFS? n Our evaluation did not incorporate CPU DVFS:

q Need to understand effect of single knob (memory DVFS) first

q Combining with CPU DVFS might produce second-order effects that would need to be accounted for n Nevertheless, memory DVFS is effective by itself, and mostly orthogonal to CPU DVFS:

q Each knob reduces power in a different component

q Our memory DVFS algorithm has neligible performance impact à negligible impact on CPU DVFS

q CPU DVFS will only further reduce bandwidth demands relative to our evaluations à no negative impact on memory DVFS

42 Why is this Autonomic Computing? n Power management in general is autonomic: a system observes its own needs and adjusts its behavior accordingly à Lots of previous work comes from architecture community, but crossover in ideas and approaches could be beneficial n This work exposes a new knob for control algorithms to turn, has a simple model for the power/energy effects of that knob, and observes opportunity to apply it in a simple way n Exposes future work for: n More advanced control algorithms n Coordinated energy efficiency across rest of system n Coordinated energy efficiency across a cluster/datacenter, integrated with memory DVFS, CPU DVFS, etc.

43