Overview of CPU Power Consumption and Management in Smartphones
Prof. Sasu Tarkoma
University of Helsinki, Aalto University, Helsinki Institute for Information Technology
Contents
• Modern smartphone SoC and CPUs – The CPU: power states – Power management basics • Smartphone solutions – Linux CPU Frequency subsystem – Power models • Intra-device task offloading – Sensor hub – Heterogeneous multiprocessing • Computation offloading Smartphones
• Smartphones have become hubs for applications and connecting with the Internet • Cloud has emerged as a backend for mobile applications • Mobile data and WiFi are the dominant protocols for connecting with Internet resources • The next generation solutions are addressing limitations of the current smartphones – Coordination of resource usage – Offloading in its many forms – Heterogeneous environment and the emergence of IoT / M2M / wearables Observations
• Smartphone and mobile device hardware and software evolve rapidly • Multiple wireless protocols • Heterogeneous computing over multiple cores – Dedicated subsystems (sensor hubs) – Increasing number of sensing subsystems – Always-on sensing • Battery technology has not kept pace with the development • Software is not, in many cases, optimized • Difficult to balance between local versus distributed processing • Difficult to control traffic across interfaces Mobile Evolution
1995 2000 2005 2010 2015 Processor Single Single Single Dual-core Quad-core and beyond, auxiliary processors, sensor hubs Cellular 2G 2.5-3G 3.5G Transition 4G generation toward 4G Standard GSM GPRS HSPA HSPA, LTE LTE, LTE-A
Downlink (Mb/ 0.01 0.1 1 10 100 s) Display pixels 4 16 64 256 1024 (x1000) Communicatio - - WiFi, WiFi, WiFi, Bluetooth LE, ns modules Bluetooth Bluetooth RFID Battery 1 2 3 4 5 capacity (Wh) Software (MB) 0.1 1 10 100 1000 Example Smartphone SoC
Modem Subsystem Multicore Subsystem Multimedia Subsystem
LTE Adreno World KRAIT CPU KRAIT CPU GPU Modem Audio, GPS,Wi-Fi, Video HW, BT,FM Accelerator L1 Cache L1 Cache s DSP DSP DSP L2 Cache Multim. Proc.
Dual channel memory
Snapdragon Adaptive Power Technologies Android Smartphone Power Profile (mW) 800 700 600 500 400 300 200 100 0 28 Energy and Power Primer
2.3 Power Computation
Energy can be saved on multiple levels in the hardware and software architecture. On circuit and transistor levels energy can be saved by changing the voltage and frequency of the circuit. The total power of CMOS logic circuits are determined by the clock rate, the supply voltage, and the capacitances of the transistors. The larger the capacitance, the more the transistors need to work to change the output charge. The higher the source voltage, the more the output has to change. The higher the clock frequency, the more power is wasted during switching. The 28 Energy and Power Primer capacitances are determined during chip design; however, the clock rate and voltage can be varied at runtime. By varying clock rate and supply voltage, it is possible to obtain linear and quadratic improvements, respectively. 2.3 PowerThe overall Computation power consumption is given by
Energy can be saved on multipleP = levelsPswitch in the+ P hardwarestatic, and software architecture.(2.6) On circuit and transistor levels energy can be saved by changing the voltage and where P is the dynamic switching power and P is the static leakage frequencyswitch of the circuit. The total power of CMOS logicstatic circuits are determined power. The formula does not include the lost power due to short circuit due to by the clock rate, the supply voltage, and the capacitances of the transistors. a transistor switching that is a relatively small term in the dynamic switching The larger the capacitance, the more the transistors need to work to change the power [4]. output charge. The higherCPU the source Total voltage, Power the more the output has to change. The CPU consists of transistors that can switch states every clock cycle. Each The higher the clock frequency, the more power is wasted during switching. The switch of states between a complementary pair of transistors results in wasted capacitances are determined during chip design; however, the clock rate and power. Typically, most of the gates in the CPU switch states every clock cycle. voltage can• The be total varied power at runtime. of CMOS By varyinglogic circuits clock rateare anddetermined supply voltage, it is Thus the higher the clock rate, the higher the amount of wasted energy. In possible toby obtain the clock linear rate, and the quadratic supply improvements, voltage, and the respectively. addition tocapacitances the clock rate, of also the thetransistors voltage a↵ects the power consumption of the CPU.The overall power consumption is given by – Switching power and static power: The dynamic switching loss is the power used by switching of the state of the P = P + P , (2.6) transistors. This loss can be determinedswitch by usingstatic the formula • Dynamic switching loss is given by: where Pswitch is the dynamic switching power2 and Pstatic is the static leakage Pswitch = ↵ C V f, (2.7) power. The formula does not include the⇥ lost⇥ power⇥ due to short circuit due to By varying clock rate and supply voltage, it is possible a transistor• switching that is a relatively small term in the dynamic switching where ↵ denoteto obtain the linear switching and activity, quadraticC isimprovements the average capacitance of the tran- powersistors, [4].V is the supply voltage and f is the clock frequency [5]. The power consumptionThe CPU consists is proportional of transistors to the that product can switch of the states square every of the clock supply cycle. voltage Each switchand the of frequency. states between The power a complementary consumption pair decreases of transistors quadratically results with in thewasted re- power.duction Typically, of the voltage. most The of the formula gates in is not the exactCPU switch for modern states chips, every because clock cycle. also Thusmemory the circuits higher and the static clock leakage rate, the current higher need the to amount be taken of into wasted account. energy. In additionThe clock to the rate clock and rate, the supply also the voltage voltage are a parameters↵ects the power for the consumption energy optimiza- of the CPU.tion of the CPU [6, 7]. Major energy savings can be gained by adjusting the supplyThe dynamic voltage and switching the clock loss rate is the either power one used or both by switching of them atof the the state same of time. the transistors.Indeed, dynamic This loss voltage can scaling be determined is widely by used using to the improve formula power consumption of battery powered devices. Both low voltage and lowered clock frequencies are 2 used to minimize power consumptionPswitch = ↵ of componentsC V f, such as CPUs and DSPs.(2.7) ⇥ ⇥ ⇥ Operating frequency has an approximately linear relationship with the oper- where ↵ denote the switching activity, C is the average capacitance of the tran- sistors, V is the supply voltage and f is the clock frequency [5]. The power consumption is proportional to the product of the square of the supply voltage and the frequency. The power consumption decreases quadratically with the re- duction of the voltage. The formula is not exact for modern chips, because also memory circuits and static leakage current need to be taken into account. The clock rate and the supply voltage are parameters for the energy optimiza- tion of the CPU [6, 7]. Major energy savings can be gained by adjusting the supply voltage and the clock rate either one or both of them at the same time. Indeed, dynamic voltage scaling is widely used to improve power consumption of battery powered devices. Both low voltage and lowered clock frequencies are used to minimize power consumption of components such as CPUs and DSPs. Operating frequency has an approximately linear relationship with the oper- 2.3 Power2.3 Computation Power Computation29 29
Power#(W)#Power#(W)#
Process#running#with#Process#running#with# frequency#f#frequency#f#
Process#with#running#frequency#f/2# Process#with#running#frequency#f/2# Time# Time#
Figure 2.2 Example of frequency, time and energy consumption Figure 2.2 Example of frequency, time and energy consumption
ating voltage given by the following equation [4]: ating voltage given by the following equation [4]: V = + f , (2.8) CPU Power Savingnorm 1 2 ⇥ norm Vnorm = 1 + 2 fnorm, (2.8) where = V /V and =1⇥ . V is the threshold voltage at which a 1 th max 2 1 th where = Vtransistor/V and conducts =1 and begins . V tois switch. the threshold voltage at which a 1 th max 2 1 th • Runningtransistor a task conducts atFigure a and slower 2.2 begins illustrates speed to switch. the relationships saves energy; of the operating frequency, process com- however,Figure it will 2.2 illustrates pletionlake time,longer the and relationships energyto execute consumption. of the operatingthe The task process frequency, thus is executed process on com- two di↵erent pletion time, andsystems energy at di consumption.↵erent voltage The and process frequency is executed levels. The on top two process di↵erent runs at high affectingsystems the at performance di↵frequencyerent voltage and is and completed frequency fast. levels. The lower The top process process runs runs at a at halved high processor •Dynamicfrequency voltage andfrequency is and completed frequency and supply fast. The voltage scaling lower and process takes favor much runs longer at a halved to complete. processor The lower fre- frequency andquency supply and voltage supply and voltage takes lead much to longersmaller to energy complete. consumption The lower [7]. fre-Suppose that parallelismquency andand supplythe having new voltage frequency multiple lead istof/ smaller2, voltage/frequency then energy from Equation consumption 2.7 we [7]. obtain Suppose the newthat switching scaledthe cores new frequency executingpower is f/2, tasks. then from Equation 2.7 we obtain the new switching V 2 f 1 power P 0 = ↵ C ( ) = P . (2.9) switch ⇥ ⇥ 2 ⇥ 2 8 switch Power (W) V 2 f 1 P 0 = ↵ C ( ) = P . (2.9) Now,switch the completion⇥ time⇥ 2T of⇥ the2 process8 switch is doubled for the lower frequency: Now, the completion time T of the process is1 doubled for the lower1 frequency: E0 = P 0 T 2= P T 2= P T, (2.10) switch switch ⇥ ⇥ 8 switch ⇥ ⇥ 4 switch ⇥ Process running 1 1 with frequencyE0 f = P 0 T 2= P T 2= P T, (2.10) switch givingswitch an improvement⇥ ⇥ 8 ofswitch 1/4 with⇥ ⇥ the lower4 switch frequency⇥ while significantly in- creasing the execution¼ energy time of savings the task. with To the compensate for the performance loss, Processgiving with anrunning improvement frequency f/2 of 1/4 with the lower frequency while significantly in- we can divide thedoubling task into of two the subtasksexecution and time run these in parallel using two Time creasing the executioncores. The time dynamic of the power task. can To decrease compensate by a for factor the two performance compared loss,to the original we can dividecase. the task into two subtasks and run these in parallel using two cores. The dynamic power can decrease by a factor two compared to the original case. Scaling SoCs • Static power leakage is affecting the feasibility of voltage scaling. – A linear decrease of voltage will result in a quadratic decrease of the switching loss; however, it will only result in a linear decrease of the static leakage. – Static leakage can become dominant with low voltage integrated circuits with high density of transistors • To address leakage, SoCs use power gating to allow portions of the system to be switched off when necessary. • Low power requirements are also driving the trend for multicore systems with many voltage and frequency controlled cores.
• Dynamic voltage and frequency scaling favor parallelism and having multiple voltage/frequency scaled cores executing tasks. CPU Frequency and Relative Power Consumption
Clock frequency Core voltage Relative power (MHz (V) consumption 250 1.075 100%
500 1.2 249%
550 1.275 309%
600 1.35 378%
Source: J. Kurtto, “Mapping and improving the energy efficiency of the Nokia N900”. M.Sc. Thesis. University of Helsinki, Department of Computer Science. Power Optimization Levels
• Smartphone and mobile device power optimization happens on multiple levels: – Silicon-level, in which the transistor capacitance and the chip design affect the energy efficiency – SoC-level, in which multiple power/voltage/clock domains can be used to support granular power management with the help of software and DVFS – Software-level, in which various power managers monitor and control the energy and power settings.
• A high-level framework is needed to perform system- wide tuning and optimization. APM and ACPI
• Advanced Power Management (APM) – Firmware and BIOS level – Apps and drivers -> APM Driver-> BIOS APM -> hardware – Power management via device all and automatic based on device activity – Power management events • Advanced Configuration and Power Interface (ACPI) – APM is replaced by ACPI – OS has the control – Power states: global, device, processor, performance Smartphone CPU and SoC
• The five popular SoCs used by smartphones today are: – Qualcomm’s Snapdragon consists of four versions from S1 to S4. – Texas Instrument’s OMAP (Open Media Applications Platform). – Samsung’s Exynos SoCs used in their smartphones and tablets. The latest version of Exynos is 5 Octa and it features a quad-core Cortex-A15 and a quad-core Cortex-A7, ARM Mali GPU, and auxiliary processors. – Nvidia’s Tegra SoCs are multi-core and based on ARM cores with an ulta low- power (ULP) GeForce GPU. The latest version, Tegra 4, supports ARM- A15 cores in quad- or octa-core configurations. – Apple’s CPUs and SoCs are designed by Apple based on the ARM architecture. 7.1 CPU and SoC 107 Mobile CPU Performance
5000#
4500#
4000#
3500#
3000#
2500#
2000#
1500# Geekbench2%performance%
1000#
500#
0# Tegra#3#(Nexus# A6X#(4th#iPad)# APQ8064#1.5# Exynos5250# Tegra#4#1.9#GHz# CoreKi5#1.7GHz# 7)#1.3GHz# 1.4#GHz# GHz#(Nexus#4)# 1.7GHz#(Nexus# (Ivy#Bridge)# 10)#
Figure 7.2 Comparison of mobile CPU and SoC performance
to reduce the number of o↵-chip memory accesses. Caches store most frequently used data in on-chip memory supporting fast access. Typically, each core has its own instruction and data caches (L1 cache). The cores can also share a common larger cache (L2 cache). In addition to multiple CPU cores, an additional low power CPU core has been proposed to address low-intensity background tasks [5]. Nvidias Tegra 4 features such a Battery Saver core that is optimized for low power. When the quad-core main CPU is not needed, the system switches to the Battery Saver core and completely switches o↵ the main CPU. The Battery Saver core has its own L2 cache and the L2 cache used by the other cores can be power gated.
The ARM based SoCs use dedicated controllers for realizing di↵erent func- tions making them very di↵erent from the desktop PCs that rely on the a single CPU for most of the work. The benefit of dedicated controllers is their e ciency compared to software based implementations. The dedicated chips and circuits can carry out the tasks more e ciently and with fewer cycles than a software implementation. Indeed, it is not uncommon to first have software based implementation of a process or algorithm, such as audio processing or graphics generation, that is then implemented in hardware for increasing e ciency. Dynamic Power Management
• Dynamic Power Management (DPM) is a design methodology for energy and power management of dynamically reconfiguring systems. The goal for a DPM system is to provide the requested services and performance with a minimum power consumption. • DVFS (Dynamic Voltage and Frequency Scaling) – Minimizes power needs by trying to reduce the operating frequency and the core voltage to levels sufficient to execute current tasks but with no excess resources left. • DPS (Dynamic Power Switching) – DPS detects the true need of resource utilization in a CPU and, if there are no current computational tasks, forces the CPU into minimal power state. Example of Dynamic Power Management
DVFS DPS DVFS DPS DVFS
Work Idle Work Idle Work
Time P and C States
• At processor level we can manage power consumption with two different strategies: – Control of the CPU performance states using frequency and core voltage (P-states). – Use of processor operating states (C-states). • The Advanced Configuration and Power Interface (ACPI) specification is the industry standard solution for power management used by the desktop and laptop industry. – Current smartphones use these states as well, but they are not based on ACPI. Overview of P and C States
Highest power mode Higher voltage/ P states frequency
Lower voltage/ frequency System is active mode
System is in idle mode C0 C1 C states C2 C3 C4 … Lowest power mode Power Management Components
• A subsystem for connecting the low-level technologies and policies at higher level. • A system of in-kernel policy governors. They are essentially pre-configured power schemes with the ability to modify the clock frequency according to required needs. Typically these governors use the P- states to to change frequencies for lower power consumption. They will switch between clock frequencies, on the basis of current CPU utilization level trying to save power while not unduly losing performance. The governors are tunable allowing some customizing of the frequency scaling. • Drivers implementing the technology in a CPU-specific way. P and C States: Example
• The difference between C-states and P-states is one of operational level and can be summarized as follows: – In C-states starting from C1 the processor is idle but partially in use. P-state is an operational concept and defined solely by the clock frequency and core voltage.
State Idle Power (mW)
C0 433 The Nexus 4 smartphone power C1 390 states. Nexus 4 is based on the C2 330 Snapdragon S4 SoC. C3 200 Without idle states 1060 Linux CPU Frequency Subsystem
• The Linux CPU frequency subsystem that has supported dynamic processor frequencies since the 2.6.0 Linux kernel. • The CPUfreq subsystem uses governors and daemons for implementing a static or dynamic power management policy.
User space governors and daemons
powersaved cpuspeed cpufreqd
LINUX KERNEL (in-kernel governors)
performance userspace powersave ondemand
Cpufreq module (/proc and /sys interface) Governors
• Performance governor that gives the highest CPU frequency and performance. This governor statically sets the highest frequency value and allows the tuning of this highest value. • Powersave governor that sets the lowest CPU frequency and system speed. • Userspace governor that allows the CPU frequency to be set manually. The component can be used to implement custom power policies. • Ondemand governor is an in-kernel governor to dynamically set the CPU frequency based on CPU utilization. • Conservative governor is similar to the on demand governor, but allows a more gradual increase of the power consumption. Example: Governors
Frequency
Performance governor Highest frequency
Ondemand governor
Powersave governor Lowest frequency
Time Time resolution
• The default governors use millisecond resolution for decisions, the thresholds are specified in microseconds
• Userspace governor can be implemented with finer- grained granularity (microseconds)
• Intel Speedstep Technology can switch frequency with latency of 10 microseconds
• Faster sampling results in quicker response time for changes in the workload Ondemand Governor
• For each CPU: – Every X milliseconds: • Get Utilization since last check • If utilization > UP_THRESHOLD – Increase frequency to MAX – Every Y milliseconds: • Get utilization • If utilization < DOWN_THRESHOLD – Decrease frequency 20% • Conservative is similar but more gradual increase • Used on many Android devices, for example Samsung S4 • V. Pallipadi and A. Starikovskiy, “The Ondemand Governor,” The Linux Symposium, 2006. Linux cpufreq
• cpufreq-info – Info on the current governor and settings • cpufreq-set allows setting of: – d minimum frequency, – u maximum frequency, – f specific frequency (userspace governor must be set first) and – g governor on a – c specific CPU • Cpufreq allows activating governors on every available CPU (and core) • Can access /sys/devices/system/cpu/cpu0/cpufreq and /proc/cpuinfo on Android (as root) Energy-aware scheduling
• Traditional schedulers do not take the multicore topology or energy issues into account
• Energy-aware aware schedulers – Distribute tasks across CPUs and cores to save energy – Underlying topology affects the strategy – For example: cluster tasks on a high performance CPU, then remaining CPUs can enter idle states
• Examples: ARM’s big.LITTLE, NVIDIA battery saver core, … ARM big.LITTLE
• The ARM big.LITTLE is a computing architecture that Interrupt Control combines slow and low-
power processor cores with Cortex-A15 Cortex-A7 faster and more power CP CP U U demanding cores CPU CPU L2 Cache This architecture is used, for • L2 Cache Low performance tasks, example, by Samsung always-on always Galaxy Note 3 and S4 High performance tasks connected smartphones (Exynos 5 Seamless migration
Octa). Interconnect • Extends DVFS with CPU migration. big.LITTLE Models
• Clustered model, in which the OS scheduler observes one of the two processor clusters, and the scheduler transitions between the clusters based on the observed load. • In-kernel switcher pairs a more powerful core with a less powerful core with the option of having many identical pairs on a chip. The active core is selected based on the load. • Heterogeneous multi-processing (MP) enables the use of all physical cores simultaneously. High priority or computationally demanding threads are run by the powerful cores whereas low priority or less demanding threads are are run on the less powerful cores. Big.LITTLE Experiments
• The framework allows seamless migration of tasks across the processor cores. The Cortex- A15-Cortex-A7 system is designed to migrate tasks between the processor clusters in less than 20 microseconds with 1GHz processors. • ARM’s energy efficiency comparison of Cortex-A15 and Cortex-A7 indicates significant power savings with a variety of benchmarks. • For example, the Dhrystone benchmark gives energy efficiency benefit of 3.5x for A7 with performance benefit of 1.9x for A15. This motivates the use of the slower processor for lightweight tasks. GPUs
• In modern smartphone designs graphics processing is typically offloaded to the GPU that has a high performance graphics processing pipeline. • The four well-known mobile GPUs are: – Adreno GPU in the Qualcomm’s Snapdragon line of SoCs. – PowerVR GPU used in TI’s OMAP line of SoCs. – Mali in the ARM architecture. – GeForce ULP (ultra low-power) in the Tegra line of SoCs from Nvidia. • GPU can be used for generic processing. For example, a Gabor face feature extraction algorithm was implemented with the Tegra GPU and OpenGL ES and the shader language. – The resulting GPU based algorithm achieved a 4.25 times speedup compared to the CPU based version Modelling the CPU
• We outline the development of simple power models for the smartphone SoC and CPU based on utilization and linear regression. The development of our simple model proceeds in the following phases: – Design of training phase with different CPU loads. The loads should be realistic and reflect the real-life workloads on the device. – Power measurement of the training loads on the smartphone. An external power monitor tool is typically used for high accuracy measurements. – Creation of a power model for the CPU energy consumption. • The training can model construction can happen offline, online or online with offline support. 120 Smartphone Subsystems 120 Smartphone Subsystems
with total power measurements with an external power monitor to obtain a fine- withgrained total view power of energy measurements consumption with of anthe external CPU. A similar power approach monitor canto obtain be used a fine- grainedto model view GPUs of energy [20]. The consumption monitored CPU of the components CPU. A include similar the approach bus control, can L1 be used toand model L2 caches, GPUs [20]. bu↵ers, The integer monitored execution, CPU floating components point execution,include the and bus queues. control, L1 andThe L2Processor performance caches, bu counter↵ers, integer metricsPower execution, were mappedModel floating to the pointbased CPU components.execution, on and queues. TheThe performance model collects counter architectural metrics statistics were mapped such as to cycles the per CPU instruction, components. mem- oryThe references, model collects cache architectural misses,Counters and on-chip statistics communication such as cycles details per together instruction, with mem- orythe references, application-level cache details misses, to optimize and on-chip performance communication and meet the details desired together energy with and• Isci temperature and Martanosi goals. Given have a CPU proposed with N components, a power model the ith for component the application-levelCPUs based detailson performance to optimize performance counters. and meet the desired energy denoted by Ci, the model is based on component access rates given by a perfor- and temperature goals. Given a CPU with N components, the ith component mance– counter They or correlate a combination hardware of performance performance counters. counters For CPU component and denoted by Ci, the model is based on component access rates given by a perfor- Ci, the maximumsystem power, log withMaxPower total (powerCi) and ArchitecturalScalingmeasurements (withCi) are an heuris- mance counter or a combination of performance counters. For CPU component tics and estimatedexternal empirically. power monitor to obtain a fine-grained view Ci, the maximum power, MaxPower(Ci) and ArchitecturalScaling(Ci) are heuris- tics and estimatedof energy empirically. consumption of the CPU. A similar P (C )=AccessRate(C ) ArchitecturalScaling(C ) MaxPower(C )+(7.2) i approach cani be⇥ used to model GPUsi ⇥ i NonGatedClockPower(Ci).(7.3) P (Ci)=AccessRate(Ci) ArchitecturalScaling(Ci) MaxPower(Ci)+(7.2) The total power of the CPU⇥ is given by the following equation:⇥ NonGatedClockPower(Ci).(7.3) N The total power of thePtotal CPU= is givenP (Ci)+ byIdle the powerfollowing. equation: (7.4) i=1 X N C. Isci and M. Martonosi, “RuntimeP powertotal monitoring= in Phigh-end(Ci)+ processors:Idle Methodology power. and empirical data,” in (7.4) Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. 7.1.12 Single Core Regression Model i=1 X Assuming a linear relationship between power consumption and CPU load, the linear regression model would be the following: 7.1.12 Single Core Regression Model P = a U + b, (7.5) Assuming a linear relationshipcpu between⇥ powercpu consumption and CPU load, the
linearwhere regressiona and b are model constants, would and beU thecpu is following: CPU utilization. The latter term indi- cates the power consumption of the CPU. This simple model does not take the P = a U + b, (7.5) DVFS into account, but it can becpu extended⇥ cpu to take the voltage and frequency scaling into account. whereThea powerand b are model constants, can then and be usedUcpu inis CPUpower utilization. estimation in The predicting latter term the indi- catespower the consumption power consumption without power of the measurements. CPU. This Following simple model our example, does not given take the DVFSthat CPU into utilization account, butis 20 it % can for a be give extended duration toT , takewe can the determine voltage the and energy frequency scalingconsumption into account. with Etotal =(a20+b)T . The energy consumption of an application thatThe uses power the CPU model in can a dynamic then be manner used can in be power determined estimation with in predicting the power consumption without power measurements. Following our example, given E = P i T, (7.6) total cpu ⇥ that CPU utilization is 20 % for a givei duration T , we can determine the energy X consumption with Etotal =(a20+b)T . The energy consumption of an application i thatwhere usesPcpu theis CPU the i in-th a measurement dynamic manner of CPU can utilization be determined and withT is the time interval of the measurement. E = P i T, (7.6) total cpu ⇥ i X i where Pcpu is the i-th measurement of CPU utilization and T is the time interval of the measurement. 120 Smartphone Subsystems
120 Smartphone Subsystems with total power measurements with an external power monitor to obtain a fine- grained view of energy consumption of the CPU. A similar approach can be used towith model total GPUs power measurements [20]. The monitored with an external CPU power components monitor to obtain include a fine- the bus control, L1 andgrained L2 view caches, of energy bu↵ consumptioners, integer of the execution, CPU. A similar floating approach point can be execution, used and queues. to model GPUs [20]. The monitored CPU components include the bus control, L1 Theand performance L2 caches, bu↵ers, counter integer execution, metrics floating were mapped point execution, to the and CPU queues. components. TheThe performance model collects counter metricsarchitectural were mapped statistics to the CPU such components. as cycles per instruction, mem- oryThe references, model collects cache architectural misses, statistics and on-chip such as cycles communication per instruction, mem- details together with theory application-level references, cache misses, details and toon-chip optimize communication performance details together and meet with the desired energy the application-level details to optimize performance and meet the desired energy andand temperature temperature goals. goals. Given Given a CPU a with CPUN components, with N components, the ith component the ith component denoteddenoted by byCiC, thei, the model model is based is on based component on component access rates given access by a rates perfor- given by a perfor- mancemance counter counter or ora combination a combination of performance of performance counters. For counters. CPU component For CPU component Ci, the maximum power, MaxPower(Ci) and ArchitecturalScaling(Ci) are heuris- Cticsi, the and maximum estimated empirically. power, MaxPower(Ci) and ArchitecturalScaling(Ci) are heuris- tics and estimated empirically.
P (C )=AccessRate(C ) ArchitecturalScaling(C ) MaxPower(C )+(7.2) i i ⇥ i ⇥ i NonGatedClockPower(Ci).(7.3) P (Ci)=AccessRate(Ci) ArchitecturalScaling(Ci) MaxPower(Ci)+(7.2) The total power of the CPU is given⇥ by the following equation: ⇥ NonGatedClockPower(Ci).(7.3) N The total power ofPtotal the= CPUP (C isi)+ givenIdle power by the. following equation:(7.4) i=1 X N 7.1.12 Single Core Regression ModelPtotal = P (Ci)+Idle power. (7.4) i=1 Assuming a linear relationship between powerX consumption and CPU load, the linear regression model would be the following:
P = a U + b, (7.5) 7.1.12 Single CoreSingle Regression Core Modelcpu Regression⇥ cpu Model
where a and b are constants, and Ucpu is CPU utilization. The latter term indi- Assumingcates the power a linear consumption relationship of the CPU. between This simple power model consumption does not take the and CPU load, the linearDVFS• regressionAssuming into account, a model but linear it can would relationship be extended be the tobetween following: take the power voltage and frequency scalingconsumption into account. and CPU load, the linear regression model Thewould power modelbe the can following: then be used P in= powera estimationU + b, in predicting the (7.5) cpu ⇥ cpu powerConstants consumption withouta and b power are measurements.determined Followingwith regression our example, given that• CPU utilization is 20 % for a give duration T , we can determine the energy where a and b are constants, and Ucpu is CPU utilization. The latter term indi- consumption• The energy with Etotal consumption=(a20+b)T . Theof an energy application consumption that of anuses application the catesthat uses theCPU the power in CPU a dynamic in consumption a dynamic manner manner of thecan can be CPU.be determined determined This with simple with model does not take the DVFS into account, but it can be extended to take the voltage and frequency E = P i T, (7.6) total cpu ⇥ scaling into account. i X The poweri model can then be used in power estimation in predicting the where• GivenPcpu is that the i -thCPU measurement utilization of is CPU 20 % utilization for a give and duration T is the T time , powerintervalwe consumption of thecan measurement. determine without the energy power measurements.consumption with Following Etotal = our example, given that CPU(a20 utilization + b)T. is 20 % for a give duration T , we can determine the energy consumption with Etotal =(a20+b)T . The energy consumption of an application that uses the CPU in a dynamic manner can be determined with
E = P i T, (7.6) total cpu ⇥ i X i where Pcpu is the i-th measurement of CPU utilization and T is the time interval of the measurement. 7.2 Display 121
Single Core Regression Model with 7.1.13 Single Core Regression Model withDVFS DVFS DVFS can significantly improve energy e ciency of the CPU. The obtained benefit depends on the idle power consumption of the CPU and the execution • DVFS can significantly improve energy efficiency. time of the tasks with di↵erent CPU voltages and frequency levels. The e↵ect of The effect of DVFS can be examined with the following DVFS can be• examined with the following simple equation [23]: simple equation: E = P t + P (t t), (7.7) ⇥ idle ⇥ max where E gives• where the total E gives energy the total of the energy workload, of theP workload,is the average P is the power over average power over the workload, t is the execution the workload, t is the execution time of the workload, Pidle is the idle power time of the workload, P is the idle power of the CPU, of the CPU, and t is the maximumidle running time of the workload over all and t max is the maximum running time of the workload frequencies. max over all frequencies.
7.1.14 Multicore RegressionA. Carroll and G. ModelHeiser, “An analysis of power consumption in a smartphone,” in Proceedings of the 2010 USENIX conference. Yifan Zhang et al. studied the power modeling of multicore smartphone CPUs and they have identified that the traditional frequency and utilization based regression techniques are prone to errors in the multicore setting [15]. Based on this observation they developed a new regression based power model for multicore CPUs based on the time spent in C states. This model builds on a model of a single core working at frequency f that is the given by the following equation [15]: P = WED + U + c, (7.8) core Ci ⇥ Ci U ⇥ i X where WEDCi is the weighted average entry duration for idle state Ci, Ci and