Memory Power Management via Dynamic Voltage/Frequency Scaling
Howard David (Intel) Eugene Gorbatov (Intel) Chris Fallin (CMU) Ulf R. Hanebutte (Intel) Onur Mutlu (CMU) Memory Power is Significant n Power consumption is a primary concern in modern servers n Many works: CPU, whole-system or cluster-level approach n But memory power is largely unaddressed n Our server system*: memory is 19% of system power (avg)
q Some work notes up to 40% of total system power n Goal: Can we reduce this figure? System Power Memory Power 400 300 200 100 Power (W) 0 gcc mcf lbm milc sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum
® *Dual 4-core Intel Xeon , 48GB DDR3 (12 DIMMs), SPEC CPU2006, all cores active. 2 Measured AC power, analytically modeled memory power. Existing Solution: Memory Sleep States? n Most memory energy-efficiency work uses sleep states
q Shut down DRAM devices when no memory requests active n But, even low-memory-bandwidth workloads keep memory awake
q Idle periods between requests diminish in multicore workloads
q CPU-bound workloads/phases rarely completely cache-resident Sleep State Residency 8% 6%
States 4% 2% 0% Time Spent in Sleep gcc lbm mcf milc sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum
3 Memory Bandwidth Varies Widely n Workload memory bandwidth requirements vary widely
Memory Bandwidth for SPEC CPU2006 8 6 4 2 0
Bandwidth/channel (GB/s) n Memory system is provisioned for peak capacity à often underutilized
4 Memory Power can be Scaled Down n DDR can operate at multiple frequencies à reduce power
q Lower frequency directly reduces switching power
q Lower frequency allows for lower voltage
q Comparable to CPU DVFS CPU Voltage/ System Memory System Freq. Power Freq. Power ↓ 15% ↓ 9.9% ↓ 40% ↓ 7.6%
n Frequency scaling increases latency à reduce performance
q Memory storage array is asynchronous
q But, bus transfer depends on frequency
q When bus bandwidth is bottleneck, performance suffers
5 Observations So Far n Memory power is a significant portion of total power
q 19% (avg) in our system, up to 40% noted in other works n Sleep state residency is low in many workloads
q Multicore workloads reduce idle periods
q CPU-bound applications send requests frequently enough to keep memory devices awake n Memory bandwidth demand is very low in some workloads
n Memory power is reduced by frequency scaling
q And voltage scaling can give further reductions
6 DVFS for Memory n Key Idea: observe memory bandwidth utilization, then adjust memory frequency/voltage, to reduce power with minimal performance loss
à Dynamic Voltage/Frequency Scaling (DVFS) for memory
n Goal in this work:
q Implement DVFS in the memory system, by:
q Developing a simple control algorithm to exploit opportunity for reduced memory frequency/voltage by observing behavior
q Evaluating the proposed algorithm on a real system
7 Outline n Motivation
n Background and Characterization
q DRAM Operation
q DRAM Power
q Frequency and Voltage Scaling n Performance Effects of Frequency Scaling
n Frequency Control Algorithm
n Evaluation and Conclusions
8 Outline n Motivation
n Background and Characterization
q DRAM Operation
q DRAM Power
q Frequency and Voltage Scaling n Performance Effects of Frequency Scaling
n Frequency Control Algorithm
n Evaluation and Conclusions
9 DRAM Operation n Main memory consists of DIMMs of DRAM devices n Each DIMM is attached to a memory bus (channel)
/8 /8 /8 /8 /8 /8 /8 /8
Memory Bus (64 bits)
10 DRAM Operation n Main memory consists of DIMMs of DRAM devices n Each DIMM is attached to a memory bus (channel) n Multiple DIMMs can connect to one channel
11 Inside a DRAM Device
12 Inside a DRAM Device
Banks • Independent arrays Bank 0 • Asynchronous: independent of Row Decoder Row memory bus speed
Sense Amps Column Decoder
13 Inside a DRAM Device
I/O Circuitry
• Runs at bus speed • Clock sync/distribution • Bus drivers and receiversBank 0
• Buffering/queueingDecoder Row Registers Recievers Write FIFO Write
Sense Amps Column Decoder Drivers
14 Inside a DRAM Device
ODT
On-DieBank 0 Termination
Row Decoder Row • Required by bus electrical characteristics Registers Recievers for reliable operation FIFO Write • Resistive element that dissipates power whenSense bus isAmps active Column Decoder Drivers
15 Inside a DRAM Device
ODT
Bank 0 Row Decoder Row Registers Recievers Write FIFO Write
Sense Amps Column Decoder Drivers
16 Effect of Frequency Scaling on Power n Reduced memory bus frequency: n Does not affect bank power:
q Constant energy per operation q Depends only on utilized memory bandwidth n Decreases I/O power:
q Dynamic power in bus interface and clock circuitry reduces due to less frequent switching n Increases termination power: q Same data takes longer to transfer
q Hence, bus utilization increases n Tradeoff between I/O and termination results in a net power reduction at lower frequencies
17 Effects of Voltage Scaling on Power n Voltage scaling further reduces power because all parts of memory devices will draw less current (at less voltage) n Voltage reduction is possible because stable operation requires lower voltage at lower frequency:
Minimum Stable Voltage for 8 DIMMs in a Real System
1.6 Vdd for Power Model 1.5 1.4 1.3 1.2 1.1 DIMM Voltage (V) 1 1333MHz 1066MHz 800MHz
18 Outline n Motivation
n Background and Characterization
q DRAM Operation
q DRAM Power
q Frequency and Voltage Scaling n Performance Effects of Frequency Scaling
n Frequency Control Algorithm
n Evaluation and Conclusions
19 How Much Memory Bandwidth is Needed? Memory Bandwidth for SPEC CPU2006 7 6 5 4 3 2 1 0 Bandwidth/channel (GB/s) gcc lbm mcf milc sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum
20 Performance Impact of Static Frequency Scaling n Performance impact is proportional to bandwidth demand n Many workloads tolerate lower frequency with minimal performance drop Performance Loss, Sta c Frequency Scaling 80 70 60 50 1333->800 40 30 1333->1066 20 10 Performance Drop (%) 0 gcc mcf lbm milc sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum
21 Performance Impact of Static Frequency Scaling n Performance impact is proportional to bandwidth demand n Many workloads tolerate lower frequency with minimal performance drop Performance Loss, Sta c Frequency Scaling :: :: :: :: :: : : : 8
6 1333->800 4 1333->1066 2
Performance Drop (%) 0 gcc mcf lbm milc sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum
22 Outline n Motivation
n Background and Characterization
q DRAM Operation
q DRAM Power
q Frequency and Voltage Scaling n Performance Effects of Frequency Scaling
n Frequency Control Algorithm
n Evaluation and Conclusions
23 Memory Latency Under Load n At low load, most time is in array access and bus transfer à small constant offset between bus-frequency latency curves n As load increases, queueing delay begins to dominate à bus frequency significantly affects latency Memory Latency as a Func on of Bandwidth and Mem Frequency 800MHz 1067MHz 1333MHz 180 150 120
Latency (ns) 90 60 0 2000 4000 6000 8000 U lized Channel Bandwidth (MB/s)
24 Control Algorithm: Demand-Based Switching Memory Latency as a Func on of Bandwidth and Mem Frequency 800MHz 1067MHz 1333MHz 180 150 120 90 Latency (ns) 60 T 0 2000 T800 4000 1066 6000 8000 U lized Channel Bandwidth (MB/s)
After each epoch of length Tepoch: Measure per-channel bandwidth BW
if BW < T800 : switch to 800MHz else if BW < T1066 : switch to 1066MHz else : switch to 1333MHz
25 Implementing V/F Switching n Halt Memory Operations
q Pause requests
q Put DRAM in Self-Refresh
q Stop the DIMM clock n Transition Voltage/Frequency
q Begin voltage ramp
q Relock memory controller PLL at new frequency
q Restart DIMM clock
q Wait for DIMM PLLs to relock n Begin Memory Operations
q Take DRAM out of Self-Refresh
q Resume requests
26 Implementing V/F Switching n Halt Memory Operations
q Pause requests
q Put DRAM in Self-Refresh
q Stop the DIMM clock n Transition Voltage/Frequency
q Begin voltage ramp
q Relock memory controller PLL at new frequency
Cq MemoryRestart DIMM frequency clock already adjustable statically Cq VoltageWait for DIMMregulators PLLs to relockfor CPU DVFS can work for n Begin memory Memory DVFS Operations q Take DRAM out of Self-Refresh
Cq FullResume transition requests takes ~20µs
27 Outline n Motivation
n Background and Characterization
q DRAM Operation
q DRAM Power
q Frequency and Voltage Scaling n Performance Effects of Frequency Scaling
n Frequency Control Algorithm
n Evaluation and Conclusions
28 Evaluation Methodology n Real-system evaluation
q Dual 4-core Intel Xeon®, 3 memory channels/socket
q 48 GB of DDR3 (12 DIMMs, 4GB dual-rank, 1333MHz) n Emulating memory frequency for performance
q Altered memory controller timing registers (tRC, tB2BCAS)
q Gives performance equivalent to slower memory frequencies n Modeling power reduction
q Measure baseline system (AC power meter, 1s samples)
q Compute reductions with an analytical model (see paper)
29 Evaluation Methodology
n Workloads
q SPEC CPU2006: CPU-intensive workloads
q All cores run a copy of the benchmark n Parameters
q Tepoch = 10ms q Two variants of algorithm with different switching thresholds:
q BW(0.5, 1): T800 = 0.5GB/s, T1066 = 1GB/s
q BW(0.5, 2): T800 = 0.5GB/s, T1066 = 2GB/s à More aggressive frequency/voltage scaling
30 Performance Impact of Memory DVFS n Minimal performance degradation: 0.2% (avg), 1.7% (max)
4 3 BW(0.5,1) 2 BW(0.5,2) 1 0 -1 gcc Performance Degrada on (%) mcf lbm milc AVG sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum
31 Performance Impact of Memory DVFS n Minimal performance degradation: 0.2% (avg), 1.7% (max) n Experimental error ~1%
4 3 BW(0.5,1) 2 BW(0.5,2) 1 0 -1 gcc Performance Degrada on (%) mcf lbm milc AVG sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum
32 Memory Frequency Distribution
n Frequency distribution shifts toward higher memory frequencies with more memory-intensive benchmarks
100% 80% 60% 1333 40% 1066 20% 800 0% gcc mcf lbm milc sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum
33 Memory Power Reduction
n Memory power reduces by 10.4% (avg), 20.5% (max)
25
20 BW(0.5,1) 15 BW(0.5,2) 10 5 0 Memory Power Reduc on (%) gcc mcf lbm milc AVG sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum
34 System Power Reduction n As a result, system power reduces by 1.9% (avg), 3.5% (max) 4 3.5 3 2.5 BW(0.5,1) BW(0.5,2) 2 1.5 1 0.5
System Power Reduc on (%) 0 gcc mcf lbm milc AVG sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum
35 System Energy Reduction n System energy reduces by 2.4% (avg), 5.1% (max) 6 5 4 BW(0.5,1) 3 BW(0.5,2) 2 1 0
System Energy Reduc on (%) -1 gcc lbm mcf milc AVG sjeng bzip2 dealII tonto namd soplex gobmk povray hmmer calculix gamess leslie3d sphinx3 h264ref gromacs perlbench GemsFDTD cactusADM libquantum
36 Related Work n MemScale [Deng11], concurrent work (ASPLOS 2011)
q Also proposes Memory DVFS
q Application performance impact model to decide voltage and frequency: requires specific modeling for a given system; our bandwidth-based approach avoids this complexity
q Simulation-based evaluation; our work is a real-system proof of concept n Memory Sleep States (Creating opportunity with data placement [Lebeck00,Pandey06], OS scheduling [Delaluz02], VM subsystem [Huang05]; Making better decisions with better models [Hur08,Fan01]) n Power Limiting/Shifting (RAPL [David10] uses memory throttling for thermal limits; CPU throttling for memory traffic [Lin07,08]; Power shifting across system [Felter05])
37 Conclusions n Memory power is a significant component of system power q 19% average in our evaluation system, 40% in other work n Workloads often keep memory active but underutilized q Channel bandwidth demands are highly variable q Use of memory sleep states is often limited n Scaling memory frequency/voltage can reduce memory power with minimal system performance impact q 10.4% average memory power reduction q Yields 2.4% average system energy reduction n Greater reductions are possible with wider frequency/ voltage range and better control algorithms
38 Memory Power Management via Dynamic Voltage/Frequency Scaling
Howard David (Intel) Eugene Gorbatov (Intel) Chris Fallin (CMU) Ulf R. Hanebutte (Intel) Onur Mutlu (CMU) Why Real-System Evaluation? n Advantages: q Capture all effects of altered memory performance n System/kernel code, interactions with IO and peripherals, etc
q Able to run full-length benchmarks (SPEC CPU2006) rather than short instruction traces q No concerns about architectural simulation fidelity n Disadvantages: q More limited room for novel algorithms and detailed measurements q Inherent experimental error due to background-task noise, real power measurements, nondeterministic timing effects n For a proof-of-concept, we chose to run on a real system in order to have results that capture all potential side-effects of altering memory frequency
40 CPU-Bound Applications in a DRAM-rich system n We evaluate CPU-bound workloads with 12 DIMMs: what about smaller memory, or IO-bound workloads? n 12 DIMMs (48GB): are we magnifying the problem? q Large servers can have this much memory, especially for database or enterprise applications q Memory can be up to 40% of system power [1,2], and reducing its power in general is an academically interesting problem n CPU-bound workloads: will it matter in real life? q Many workloads have CPU-bound phases (e.g., database scan or business logic in server workloads) q Focusing on CPU-bound workloads isolates the problem of varying memory bandwidth demand while memory cannot enter sleep states, and our solution applies for any compute phase of a workload
[1] L. A. Barroso and U. Holzle. “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.” Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009. [2] C. Lefurgy et al. “Energy Management for Commercial Servers.” IEEE Computer, pp. 39—48, December 2003.
41 Combining Memory & CPU DVFS? n Our evaluation did not incorporate CPU DVFS:
q Need to understand effect of single knob (memory DVFS) first
q Combining with CPU DVFS might produce second-order effects that would need to be accounted for n Nevertheless, memory DVFS is effective by itself, and mostly orthogonal to CPU DVFS:
q Each knob reduces power in a different component
q Our memory DVFS algorithm has neligible performance impact à negligible impact on CPU DVFS
q CPU DVFS will only further reduce bandwidth demands relative to our evaluations à no negative impact on memory DVFS
42 Why is this Autonomic Computing? n Power management in general is autonomic: a system observes its own needs and adjusts its behavior accordingly à Lots of previous work comes from architecture community, but crossover in ideas and approaches could be beneficial n This work exposes a new knob for control algorithms to turn, has a simple model for the power/energy effects of that knob, and observes opportunity to apply it in a simple way n Exposes future work for: n More advanced control algorithms n Coordinated energy efficiency across rest of system n Coordinated energy efficiency across a cluster/datacenter, integrated with memory DVFS, CPU DVFS, etc.
43