Overview of CPU Power Consumption and Management in Smartphones

Prof. Sasu Tarkoma

University of Helsinki, Aalto University, Helsinki Institute for Information Technology

Contents

• Modern smartphone SoC and CPUs – The CPU: power states – Power management basics • Smartphone solutions – Linux CPU Frequency subsystem – Power models • Intra-device task offloading – – Heterogeneous multiprocessing • Computation offloading Smartphones

• Smartphones have become hubs for applications and connecting with the Internet • Cloud has emerged as a backend for mobile applications • Mobile data and WiFi are the dominant protocols for connecting with Internet resources • The next generation solutions are addressing limitations of the current smartphones – Coordination of resource usage – Offloading in its many forms – Heterogeneous environment and the emergence of IoT / M2M / wearables Observations

• Smartphone and mobile device hardware and software evolve rapidly • Multiple wireless protocols • over multiple cores – Dedicated subsystems (sensor hubs) – Increasing number of sensing subsystems – Always-on sensing • Battery technology has not kept pace with the development • Software is not, in many cases, optimized • Difficult to balance between local versus distributed processing • Difficult to control traffic across interfaces Mobile Evolution

1995 2000 2005 2010 2015 Processor Single Single Single Dual-core Quad-core and beyond, auxiliary processors, sensor hubs Cellular 2G 2.5-3G 3.5G Transition 4G generation toward 4G Standard GSM GPRS HSPA HSPA, LTE LTE, LTE-A

Downlink (Mb/ 0.01 0.1 1 10 100 s) Display pixels 4 16 64 256 1024 (x1000) Communicatio - - WiFi, WiFi, WiFi, Bluetooth LE, ns modules Bluetooth Bluetooth RFID Battery 1 2 3 4 5 capacity (Wh) Software (MB) 0.1 1 10 100 1000 Example Smartphone SoC

Modem Subsystem Multicore Subsystem Multimedia Subsystem

LTE World KRAIT CPU KRAIT CPU GPU Modem Audio, GPS,Wi-Fi, Video HW, BT,FM Accelerator L1 Cache L1 Cache s DSP DSP DSP L2 Cache Multim. Proc.

Dual channel memory

Snapdragon Adaptive Power Technologies Android Smartphone Power Profile (mW) 800 700 600 500 400 300 200 100 0 28 Energy and Power Primer

2.3 Power Computation

Energy can be saved on multiple levels in the hardware and software architecture. On circuit and transistor levels energy can be saved by changing the voltage and frequency of the circuit. The total power of CMOS logic circuits are determined by the clock rate, the supply voltage, and the capacitances of the transistors. The larger the capacitance, the more the transistors need to work to change the output charge. The higher the source voltage, the more the output has to change. The higher the clock frequency, the more power is wasted during switching. The 28 Energy and Power Primer capacitances are determined during design; however, the clock rate and voltage can be varied at runtime. By varying clock rate and supply voltage, it is possible to obtain linear and quadratic improvements, respectively. 2.3 PowerThe overall Computation power consumption is given by

Energy can be saved on multipleP = levelsPswitch in the+ P hardwarestatic, and software architecture.(2.6) On circuit and transistor levels energy can be saved by changing the voltage and where P is the dynamic switching power and P is the static leakage frequencyswitch of the circuit. The total power of CMOS logicstatic circuits are determined power. The formula does not include the lost power due to short circuit due to by the clock rate, the supply voltage, and the capacitances of the transistors. a transistor switching that is a relatively small term in the dynamic switching The larger the capacitance, the more the transistors need to work to change the power [4]. output charge. The higherCPU the source Total voltage, Power the more the output has to change. The CPU consists of transistors that can switch states every clock cycle. Each The higher the clock frequency, the more power is wasted during switching. The switch of states between a complementary pair of transistors results in wasted capacitances are determined during chip design; however, the clock rate and power. Typically, most of the gates in the CPU switch states every clock cycle. voltage can• The be total varied power at runtime. of CMOS By varyinglogic circuits clock rateare anddetermined supply voltage, it is Thus the higher the clock rate, the higher the amount of wasted energy. In possible toby obtain the clock linear rate, and the quadratic supply improvements, voltage, and the respectively. addition tocapacitances the clock rate, of also the thetransistors voltage a↵ects the power consumption of the CPU.The overall power consumption is given by – Switching power and static power: The dynamic switching loss is the power used by switching of the state of the P = P + P , (2.6) transistors. This loss can be determinedswitch by usingstatic the formula • Dynamic switching loss is given by: where Pswitch is the dynamic switching power2 and Pstatic is the static leakage Pswitch = ↵ C V f, (2.7) power. The formula does not include the⇥ lost⇥ power⇥ due to short circuit due to By varying clock rate and supply voltage, it is possible a transistor• switching that is a relatively small term in the dynamic switching where ↵ denoteto obtain the linear switching and activity, quadraticC isimprovements the average capacitance of the tran- powersistors, [4].V is the supply voltage and f is the clock frequency [5]. The power consumptionThe CPU consists is proportional of transistors to the that product can switch of the states square every of the clock supply cycle. voltage Each switchand the of frequency. states between The power a complementary consumption pair decreases of transistors quadratically results with in thewasted re- power.duction Typically, of the voltage. most The of the formula gates in is not the exactCPU switch for modern states chips, every because clock cycle. also Thusmemory the circuits higher and the static clock leakage rate, the current higher need the to amount be taken of into wasted account. energy. In additionThe clock to the rate clock and rate, the supply also the voltage voltage are a parameters↵ects the power for the consumption energy optimiza- of the CPU.tion of the CPU [6, 7]. Major energy savings can be gained by adjusting the supplyThe dynamic voltage and switching the clock loss rate is the either power one used or both by switching of them atof the the state same of time. the transistors.Indeed, dynamic This loss voltage can scaling be determined is widely by used using to the improve formula power consumption of battery powered devices. Both low voltage and lowered clock frequencies are 2 used to minimize power consumptionPswitch = ↵ of componentsC V f, such as CPUs and DSPs.(2.7) ⇥ ⇥ ⇥ Operating frequency has an approximately linear relationship with the oper- where ↵ denote the switching activity, C is the average capacitance of the tran- sistors, V is the supply voltage and f is the clock frequency [5]. The power consumption is proportional to the product of the square of the supply voltage and the frequency. The power consumption decreases quadratically with the re- duction of the voltage. The formula is not exact for modern chips, because also memory circuits and static leakage current need to be taken into account. The clock rate and the supply voltage are parameters for the energy optimiza- tion of the CPU [6, 7]. Major energy savings can be gained by adjusting the supply voltage and the clock rate either one or both of them at the same time. Indeed, dynamic voltage scaling is widely used to improve power consumption of battery powered devices. Both low voltage and lowered clock frequencies are used to minimize power consumption of components such as CPUs and DSPs. Operating frequency has an approximately linear relationship with the oper- 2.3 Power2.3 Computation Power Computation29 29

Power#(W)#Power#(W)#

Process#running#with#Process#running#with# frequency#f#frequency#f#

Process#with#running#frequency#f/2# Process#with#running#frequency#f/2# Time# Time#

Figure 2.2 Example of frequency, time and energy consumption Figure 2.2 Example of frequency, time and energy consumption

ating voltage given by the following equation [4]: ating voltage given by the following equation [4]: V = + f , (2.8) CPU Power Savingnorm 1 2 ⇥ norm Vnorm = 1 + 2 fnorm, (2.8) where = V /V and =1⇥ . V is the threshold voltage at which a 1 th max 2 1 th where = Vtransistor/V and conducts =1 and begins . V tois switch. the threshold voltage at which a 1 th max 2 1 th • Runningtransistor a task conducts atFigure a and slower 2.2 begins illustrates speed to switch. the relationships saves energy; of the operating frequency, process com- however,Figure it will 2.2 illustrates pletionlake time,longer the and relationships energyto execute consumption. of the operatingthe The task process frequency, thus is executed process on com- two di↵erent pletion time, andsystems energy at di consumption.↵erent voltage The and process frequency is executed levels. The on top two process di↵erent runs at high affectingsystems the at performance di↵frequencyerent voltage and is and completed frequency fast. levels. The lower The top process process runs runs at a at halved high processor •Dynamicfrequency voltage andfrequency is and completed frequency and supply fast. The voltage scaling lower and process takes favor much runs longer at a halved to complete. processor The lower fre- frequency andquency supply and voltage supply and voltage takes lead much to longersmaller to energy complete. consumption The lower [7]. fre-Suppose that parallelismquency andand supplythe having new voltage frequency multiple lead istof/ smaller2, voltage/frequency then energy from Equation consumption 2.7 we [7]. obtain Suppose the newthat switching scaledthe cores new frequency executingpower is f/2, tasks. then from Equation 2.7 we obtain the new switching V 2 f 1 power P 0 = ↵ C ( ) = P . (2.9) switch ⇥ ⇥ 2 ⇥ 2 8 switch Power (W) V 2 f 1 P 0 = ↵ C ( ) = P . (2.9) Now,switch the completion⇥ time⇥ 2T of⇥ the2 process8 switch is doubled for the lower frequency: Now, the completion time T of the process is1 doubled for the lower1 frequency: E0 = P 0 T 2= P T 2= P T, (2.10) switch switch ⇥ ⇥ 8 switch ⇥ ⇥ 4 switch ⇥ Process running 1 1 with frequencyE0 f = P 0 T 2= P T 2= P T, (2.10) switch givingswitch an improvement⇥ ⇥ 8 ofswitch 1/4 with⇥ ⇥ the lower4 switch frequency⇥ while significantly in- creasing the execution¼ energy time of savings the task. with To the compensate for the performance loss, Processgiving with anrunning improvement frequency f/2 of 1/4 with the lower frequency while significantly in- we can divide thedoubling task into of two the subtasksexecution and time run these in parallel using two Time creasing the executioncores. The time dynamic of the power task. can To decrease compensate by a for factor the two performance compared loss,to the original we can dividecase. the task into two subtasks and run these in parallel using two cores. The dynamic power can decrease by a factor two compared to the original case. Scaling SoCs • Static power leakage is affecting the feasibility of voltage scaling. – A linear decrease of voltage will result in a quadratic decrease of the switching loss; however, it will only result in a linear decrease of the static leakage. – Static leakage can become dominant with low voltage integrated circuits with high density of transistors • To address leakage, SoCs use power gating to allow portions of the system to be switched off when necessary. • Low power requirements are also driving the trend for multicore systems with many voltage and frequency controlled cores.

• Dynamic voltage and frequency scaling favor parallelism and having multiple voltage/frequency scaled cores executing tasks. CPU Frequency and Relative Power Consumption

Clock frequency Core voltage Relative power (MHz (V) consumption 250 1.075 100%

500 1.2 249%

550 1.275 309%

600 1.35 378%

Source: J. Kurtto, “Mapping and improving the energy efficiency of the Nokia N900”. M.Sc. Thesis. University of Helsinki, Department of Computer Science. Power Optimization Levels

• Smartphone and mobile device power optimization happens on multiple levels: – Silicon-level, in which the transistor capacitance and the chip design affect the energy efficiency – SoC-level, in which multiple power/voltage/clock domains can be used to support granular power management with the help of software and DVFS – Software-level, in which various power managers monitor and control the energy and power settings.

• A high-level framework is needed to perform system- wide tuning and optimization. APM and ACPI

• Advanced Power Management (APM) – Firmware and BIOS level – Apps and drivers -> APM Driver-> BIOS APM -> hardware – Power management via device all and automatic based on device activity – Power management events • Advanced Configuration and Power Interface (ACPI) – APM is replaced by ACPI – OS has the control – Power states: global, device, processor, performance Smartphone CPU and SoC

• The five popular SoCs used by smartphones today are: – ’s Snapdragon consists of four versions from S1 to S4. – Texas Instrument’s OMAP (Open Media Applications Platform). – Samsung’s SoCs used in their smartphones and tablets. The latest version of Exynos is 5 Octa and it features a quad-core Cortex-A15 and a quad-core Cortex-A7, ARM Mali GPU, and auxiliary processors. – ’s SoCs are multi-core and based on ARM cores with an ulta low- power (ULP) GeForce GPU. The latest version, Tegra 4, supports ARM- A15 cores in quad- or octa-core configurations. – Apple’s CPUs and SoCs are designed by Apple based on the ARM architecture. 7.1 CPU and SoC 107 Mobile CPU Performance

5000#

4500#

4000#

3500#

3000#

2500#

2000#

1500# Geekbench2%performance%

1000#

500#

0# Tegra#3#(Nexus# A6X#(4th#iPad)# APQ8064#1.5# Exynos5250# Tegra#4#1.9#GHz# CoreKi5#1.7GHz# 7)#1.3GHz# 1.4#GHz# GHz#(Nexus#4)# 1.7GHz#(Nexus# (Ivy#Bridge)# 10)#

Figure 7.2 Comparison of mobile CPU and SoC performance

to reduce the number of o↵-chip memory accesses. Caches store most frequently used data in on-chip memory supporting fast access. Typically, each core has its own instruction and data caches (L1 cache). The cores can also share a common larger cache (L2 cache). In addition to multiple CPU cores, an additional low power CPU core has been proposed to address low-intensity background tasks [5]. Nvidias Tegra 4 features such a Battery Saver core that is optimized for low power. When the quad-core main CPU is not needed, the system switches to the Battery Saver core and completely switches o↵ the main CPU. The Battery Saver core has its own L2 cache and the L2 cache used by the other cores can be power gated.

The ARM based SoCs use dedicated controllers for realizing di↵erent func- tions making them very di↵erent from the desktop PCs that rely on the a single CPU for most of the work. The benefit of dedicated controllers is their eciency compared to software based implementations. The dedicated chips and circuits can carry out the tasks more eciently and with fewer cycles than a software implementation. Indeed, it is not uncommon to first have software based implementation of a process or algorithm, such as audio processing or graphics generation, that is then implemented in hardware for increasing eciency. Dynamic Power Management

• Dynamic Power Management (DPM) is a design methodology for energy and power management of dynamically reconfiguring systems. The goal for a DPM system is to provide the requested services and performance with a minimum power consumption. • DVFS (Dynamic Voltage and Frequency Scaling) – Minimizes power needs by trying to reduce the operating frequency and the core voltage to levels sufficient to execute current tasks but with no excess resources left. • DPS (Dynamic Power Switching) – DPS detects the true need of resource utilization in a CPU and, if there are no current computational tasks, forces the CPU into minimal power state. Example of Dynamic Power Management

DVFS DPS DVFS DPS DVFS

Work Idle Work Idle Work

Time P and C States

• At processor level we can manage power consumption with two different strategies: – Control of the CPU performance states using frequency and core voltage (P-states). – Use of processor operating states (C-states). • The Advanced Configuration and Power Interface (ACPI) specification is the industry standard solution for power management used by the desktop and industry. – Current smartphones use these states as well, but they are not based on ACPI. Overview of P and C States

Highest power mode Higher voltage/ P states frequency

Lower voltage/ frequency System is active mode

System is in idle mode C0 C1 C states C2 C3 C4 … Lowest power mode Power Management Components

• A subsystem for connecting the low-level technologies and policies at higher level. • A system of in-kernel policy governors. They are essentially pre-configured power schemes with the ability to modify the clock frequency according to required needs. Typically these governors use the P- states to to change frequencies for lower power consumption. They will switch between clock frequencies, on the basis of current CPU utilization level trying to save power while not unduly losing performance. The governors are tunable allowing some customizing of the frequency scaling. • Drivers implementing the technology in a CPU-specific way. P and C States: Example

• The difference between C-states and P-states is one of operational level and can be summarized as follows: – In C-states starting from C1 the processor is idle but partially in use. P-state is an operational concept and defined solely by the clock frequency and core voltage.

State Idle Power (mW)

C0 433 The smartphone power C1 390 states. Nexus 4 is based on the C2 330 Snapdragon S4 SoC. C3 200 Without idle states 1060 Linux CPU Frequency Subsystem

• The Linux CPU frequency subsystem that has supported dynamic processor frequencies since the 2.6.0 Linux kernel. • The CPUfreq subsystem uses governors and daemons for implementing a static or dynamic power management policy.

User space governors and daemons

powersaved cpuspeed cpufreqd

LINUX KERNEL (in-kernel governors)

performance userspace powersave ondemand

Cpufreq module (/proc and /sys interface) Governors

• Performance governor that gives the highest CPU frequency and performance. This governor statically sets the highest frequency value and allows the tuning of this highest value. • Powersave governor that sets the lowest CPU frequency and system speed. • Userspace governor that allows the CPU frequency to be set manually. The component can be used to implement custom power policies. • Ondemand governor is an in-kernel governor to dynamically set the CPU frequency based on CPU utilization. • Conservative governor is similar to the on demand governor, but allows a more gradual increase of the power consumption. Example: Governors

Frequency

Performance governor Highest frequency

Ondemand governor

Powersave governor Lowest frequency

Time Time resolution

• The default governors use millisecond resolution for decisions, the thresholds are specified in microseconds

• Userspace governor can be implemented with finer- grained granularity (microseconds)

• Intel Speedstep Technology can switch frequency with latency of 10 microseconds

• Faster sampling results in quicker response time for changes in the workload Ondemand Governor

• For each CPU: – Every X milliseconds: • Get Utilization since last check • If utilization > UP_THRESHOLD – Increase frequency to MAX – Every Y milliseconds: • Get utilization • If utilization < DOWN_THRESHOLD – Decrease frequency 20% • Conservative is similar but more gradual increase • Used on many Android devices, for example Samsung S4 • V. Pallipadi and A. Starikovskiy, “The Ondemand Governor,” The Linux Symposium, 2006. Linux cpufreq

• cpufreq-info – Info on the current governor and settings • cpufreq-set allows setting of: – d minimum frequency, – u maximum frequency, – f specific frequency (userspace governor must be set first) and – g governor on a – c specific CPU • Cpufreq allows activating governors on every available CPU (and core) • Can access /sys/devices/system/cpu/cpu0/cpufreq and /proc/cpuinfo on Android (as root) Energy-aware scheduling

• Traditional schedulers do not take the multicore topology or energy issues into account

• Energy-aware aware schedulers – Distribute tasks across CPUs and cores to save energy – Underlying topology affects the strategy – For example: cluster tasks on a high performance CPU, then remaining CPUs can enter idle states

• Examples: ARM’s big.LITTLE, NVIDIA battery saver core, … ARM big.LITTLE

• The ARM big.LITTLE is a computing architecture that Interrupt Control combines slow and low-

power processor cores with Cortex-A15 Cortex-A7 faster and more power CP CP U U demanding cores CPU CPU L2 Cache This architecture is used, for • L2 Cache Low performance tasks, example, by Samsung always-on always Galaxy Note 3 and S4 High performance tasks connected smartphones (Exynos 5 Seamless migration

Octa). Interconnect • Extends DVFS with CPU migration. big.LITTLE Models

• Clustered model, in which the OS scheduler observes one of the two processor clusters, and the scheduler transitions between the clusters based on the observed load. • In-kernel switcher pairs a more powerful core with a less powerful core with the option of having many identical pairs on a chip. The active core is selected based on the load. • Heterogeneous multi-processing (MP) enables the use of all physical cores simultaneously. High priority or computationally demanding threads are run by the powerful cores whereas low priority or less demanding threads are are run on the less powerful cores. Big.LITTLE Experiments

• The framework allows seamless migration of tasks across the processor cores. The Cortex- A15-Cortex-A7 system is designed to migrate tasks between the processor clusters in less than 20 microseconds with 1GHz processors. • ARM’s energy efficiency comparison of Cortex-A15 and Cortex-A7 indicates significant power savings with a variety of benchmarks. • For example, the Dhrystone benchmark gives energy efficiency benefit of 3.5x for A7 with performance benefit of 1.9x for A15. This motivates the use of the slower processor for lightweight tasks. GPUs

• In modern smartphone designs graphics processing is typically offloaded to the GPU that has a high performance graphics processing pipeline. • The four well-known mobile GPUs are: – Adreno GPU in the Qualcomm’s Snapdragon line of SoCs. – PowerVR GPU used in TI’s OMAP line of SoCs. – Mali in the ARM architecture. – GeForce ULP (ultra low-power) in the Tegra line of SoCs from Nvidia. • GPU can be used for generic processing. For example, a Gabor face feature extraction algorithm was implemented with the Tegra GPU and OpenGL ES and the shader language. – The resulting GPU based algorithm achieved a 4.25 times speedup compared to the CPU based version Modelling the CPU

• We outline the development of simple power models for the smartphone SoC and CPU based on utilization and linear regression. The development of our simple model proceeds in the following phases: – Design of training phase with different CPU loads. The loads should be realistic and reflect the real-life workloads on the device. – Power measurement of the training loads on the smartphone. An external power monitor tool is typically used for high accuracy measurements. – Creation of a power model for the CPU energy consumption. • The training can model construction can happen offline, online or online with offline support. 120 Smartphone Subsystems 120 Smartphone Subsystems

with total power measurements with an external power monitor to obtain a fine- withgrained total view power of energy measurements consumption with of anthe external CPU. A similar power approach monitor canto obtain be used a fine- grainedto model view GPUs of energy [20]. The consumption monitored CPU of the components CPU. A include similar the approach bus control, can L1 be used toand model L2 caches, GPUs [20]. bu↵ers, The integer monitored execution, CPU floating components point execution,include the and bus queues. control, L1 andThe L2Processor performance caches, bu counter↵ers, integer metricsPower execution, were mappedModel floating to the pointbased CPU components.execution, on and queues. TheThe performance model collects counter architectural metrics statistics were mapped such as to cycles the per CPU instruction, components. mem- oryThe references, model collects cache architectural misses,Counters and on-chip statistics communication such as cycles details per together instruction, with mem- orythe references, application-level cache details misses, to optimize and on-chip performance communication and meet the details desired together energy with and• Isci temperature and Martanosi goals. Given have a CPU proposed with N components, a power model the ith for component the application-levelCPUs based detailson performance to optimize performance counters. and meet the desired energy denoted by Ci, the model is based on component access rates given by a perfor- and temperature goals. Given a CPU with N components, the ith component mance– counter They or correlate a combination hardware of performance performance counters. counters For CPU component and denoted by Ci, the model is based on component access rates given by a perfor- Ci, the maximumsystem power, log withMaxPower total (powerCi) and ArchitecturalScalingmeasurements (withCi) are an heuris- mance counter or a combination of performance counters. For CPU component tics and estimatedexternal empirically. power monitor to obtain a fine-grained view Ci, the maximum power, MaxPower(Ci) and ArchitecturalScaling(Ci) are heuris- tics and estimatedof energy empirically. consumption of the CPU. A similar P (C )=AccessRate(C ) ArchitecturalScaling(C ) MaxPower(C )+(7.2) i approach cani be⇥ used to model GPUsi ⇥ i NonGatedClockPower(Ci).(7.3) P (Ci)=AccessRate(Ci) ArchitecturalScaling(Ci) MaxPower(Ci)+(7.2) The total power of the CPU⇥ is given by the following equation:⇥ NonGatedClockPower(Ci).(7.3) N The total power of thePtotal CPU= is givenP (Ci)+ byIdle the powerfollowing. equation: (7.4) i=1 X N C. Isci and M. Martonosi, “RuntimeP powertotal monitoring= in Phigh-end(Ci)+ processors:Idle Methodology power. and empirical data,” in (7.4) Proceedings of the 36th Annual IEEE/ACM International Symposium on , 2003. 7.1.12 Single Core Regression Model i=1 X Assuming a linear relationship between power consumption and CPU load, the linear regression model would be the following: 7.1.12 Single Core Regression Model P = a U + b, (7.5) Assuming a linear relationshipcpu between⇥ powercpu consumption and CPU load, the

linearwhere regressiona and b are model constants, would and beU thecpu is following: CPU utilization. The latter term indi- cates the power consumption of the CPU. This simple model does not take the P = a U + b, (7.5) DVFS into account, but it can becpu extended⇥ cpu to take the voltage and frequency scaling into account. whereThea powerand b are model constants, can then and be usedUcpu inis CPUpower utilization. estimation in The predicting latter term the indi- catespower the consumption power consumption without power of the measurements. CPU. This Following simple model our example, does not given take the DVFSthat CPU into utilization account, butis 20 it % can for a be give extended duration toT , takewe can the determine voltage the and energy frequency scalingconsumption into account. with Etotal =(a20+b)T . The energy consumption of an application thatThe uses power the CPU model in can a dynamic then be manner used can in be power determined estimation with in predicting the power consumption without power measurements. Following our example, given E = P i T, (7.6) total cpu ⇥ that CPU utilization is 20 % for a givei duration T , we can determine the energy X consumption with Etotal =(a20+b)T . The energy consumption of an application i thatwhere usesPcpu theis CPU the i in-th a measurement dynamic manner of CPU can utilization be determined and withT is the time interval of the measurement. E = P i T, (7.6) total cpu ⇥ i X i where Pcpu is the i-th measurement of CPU utilization and T is the time interval of the measurement. 120 Smartphone Subsystems

120 Smartphone Subsystems with total power measurements with an external power monitor to obtain a fine- grained view of energy consumption of the CPU. A similar approach can be used towith model total GPUs power measurements [20]. The monitored with an external CPU power components monitor to obtain include a fine- the bus control, L1 andgrained L2 view caches, of energy bu↵ consumptioners, integer of the execution, CPU. A similar floating approach point can be execution, used and queues. to model GPUs [20]. The monitored CPU components include the bus control, L1 Theand performance L2 caches, bu↵ers, counter integer execution, metrics floating were mapped point execution, to the and CPU queues. components. TheThe performance model collects counter metricsarchitectural were mapped statistics to the CPU such components. as cycles per instruction, mem- oryThe references, model collects cache architectural misses, statistics and on-chip such as cycles communication per instruction, mem- details together with theory application-level references, cache misses, details and toon-chip optimize communication performance details together and meet with the desired energy the application-level details to optimize performance and meet the desired energy andand temperature temperature goals. goals. Given Given a CPU a with CPUN components, with N components, the ith component the ith component denoteddenoted by byCiC, thei, the model model is based is on based component on component access rates given access by a rates perfor- given by a perfor- mancemance counter counter or ora combination a combination of performance of performance counters. For counters. CPU component For CPU component Ci, the maximum power, MaxPower(Ci) and ArchitecturalScaling(Ci) are heuris- Cticsi, the and maximum estimated empirically. power, MaxPower(Ci) and ArchitecturalScaling(Ci) are heuris- tics and estimated empirically.

P (C )=AccessRate(C ) ArchitecturalScaling(C ) MaxPower(C )+(7.2) i i ⇥ i ⇥ i NonGatedClockPower(Ci).(7.3) P (Ci)=AccessRate(Ci) ArchitecturalScaling(Ci) MaxPower(Ci)+(7.2) The total power of the CPU is given⇥ by the following equation: ⇥ NonGatedClockPower(Ci).(7.3) N The total power ofPtotal the= CPUP (C isi)+ givenIdle power by the. following equation:(7.4) i=1 X N 7.1.12 Single Core Regression ModelPtotal = P (Ci)+Idle power. (7.4) i=1 Assuming a linear relationship between powerX consumption and CPU load, the linear regression model would be the following:

P = a U + b, (7.5) 7.1.12 Single CoreSingle Regression Core Modelcpu Regression⇥ cpu Model

where a and b are constants, and Ucpu is CPU utilization. The latter term indi- Assumingcates the power a linear consumption relationship of the CPU. between This simple power model consumption does not take the and CPU load, the linearDVFS• regressionAssuming into account, a model but linear it can would relationship be extended be the tobetween following: take the power voltage and frequency scalingconsumption into account. and CPU load, the linear regression model Thewould power modelbe the can following: then be used P in= powera estimationU + b, in predicting the (7.5) cpu ⇥ cpu powerConstants consumption withouta and b power are measurements.determined Followingwith regression our example, given that• CPU utilization is 20 % for a give duration T , we can determine the energy where a and b are constants, and Ucpu is CPU utilization. The latter term indi- consumption• The energy with Etotal consumption=(a20+b)T . Theof an energy application consumption that of anuses application the catesthat uses theCPU the power in CPU a dynamic in consumption a dynamic manner manner of thecan can be CPU.be determined determined This with simple with model does not take the DVFS into account, but it can be extended to take the voltage and frequency E = P i T, (7.6) total cpu ⇥ scaling into account. i X The poweri model can then be used in power estimation in predicting the where• GivenPcpu is that the i -thCPU measurement utilization of is CPU 20 % utilization for a give and durationT is the T time , powerintervalwe consumption of thecan measurement. determine without the energy power measurements.consumption with Following Etotal = our example, given that CPU(a20 utilization + b)T. is 20 % for a give duration T , we can determine the energy consumption with Etotal =(a20+b)T . The energy consumption of an application that uses the CPU in a dynamic manner can be determined with

E = P i T, (7.6) total cpu ⇥ i X i where Pcpu is the i-th measurement of CPU utilization and T is the time interval of the measurement. 7.2 Display 121

Single Core Regression Model with 7.1.13 Single Core Regression Model withDVFS DVFS DVFS can significantly improve energy eciency of the CPU. The obtained benefit depends on the idle power consumption of the CPU and the execution • DVFS can significantly improve energy efficiency. time of the tasks with di↵erent CPU voltages and frequency levels. The e↵ect of The effect of DVFS can be examined with the following DVFS can be• examined with the following simple equation [23]: simple equation: E = P t + P (t t), (7.7) ⇥ idle ⇥ max where E gives• where the total E gives energy the total of the energy workload, of theP workload,is the average P is the power over average power over the workload, t is the execution the workload, t is the execution time of the workload, Pidle is the idle power time of the workload, P is the idle power of the CPU, of the CPU, and t is the maximumidle running time of the workload over all and t max is the maximum running time of the workload frequencies. max over all frequencies.

7.1.14 Multicore RegressionA. Carroll and G. ModelHeiser, “An analysis of power consumption in a smartphone,” in Proceedings of the 2010 USENIX conference. Yifan Zhang et al. studied the power modeling of multicore smartphone CPUs and they have identified that the traditional frequency and utilization based regression techniques are prone to errors in the multicore setting [15]. Based on this observation they developed a new regression based power model for multicore CPUs based on the time spent in C states. This model builds on a model of a single core working at frequency f that is the given by the following equation [15]: P = WED + U + c, (7.8) core Ci ⇥ Ci U ⇥ i X where WEDCi is the weighted average entry duration for idle state Ci, Ci and

U are coecients of WEDCi , U is the utilization rate, and c is a constant. A power model is created by obtaining the coecients by linear regression on the training data with WEDCi and U, and the associated Pcore. The multicore model extends the above single core model and it is given by [15]:

NC

PCPU = PBL,NC + P,core,Ui,fi , (7.9) i X where NC is the number of cores, PBL,NC is the baseline CPU power with NC cores, and P,core,Ui,fi is the power increment due to core i when running at frequency fi with utilization of Ui.ThetermP,core,Ui,fi can be predicted using the single core model (NC = 1) and with the measurement of the constant

PBL,NC .

7.2 Display

Display is definitely one of the most energy-consuming components in a mod- ern mobile phone, especially now, when smartphones have started using touch- Multicore Regression Model

• Yifan Zhang et al. studied the power modeling of multicore smartphone CPUs and they have identified that the traditional frequency and utilization based regression techniques are prone to errors in the multicore setting.

• Based on this observation they developed a new regression based power model for multicore CPUs based on the time spent in C states.

Y. Zhang, X. Wang, X. Liu, Y. Liu, L. Zhuang, and F. Zhao, “Towards better cpu power management on multicore smartphones,” in Proceedings of the Workshop on Power-Aware Computing and Systems, ser. HotPower ’13. New York, NY, USA: ACM, 2013. 7.4 Sensors 143

#Temperature# Other#sensors#

#AcceleraAon# #GPS# Sensing#applicaAons#

#Bus#

Sensor#interface#(OS#and# middleware)# Sensor#hub#coKprocessor#

Host#CPU#(maximize#sleep)## Sensor#hub#(alwaysKon)#

Figure 7.19 Overview of a sensor hub

Assuming that anBalancing application consists Between of N stages andCores that we have more than one processor for running the stages, the central question is how to place the • Given that we have a high performance core and a stages acrosslower the processors.performance The core placement for sensing decision tasks, needs it is to clear take that into account the energy characteristicscomputationally of theheavy processors operations as well should as the be processor run on the wakeup and stage/task schedulinghigh performance cost [50]. core. Given that we have a high performance core and a lower performance core for •The low performance core, on the other hand, would be sensing tasks, it is clear that computationally heavy operations should be run suitable for reading sensors and then handing over the on the highdata performance to the high core. performance The low performance core for intensive core, on the other hand, would be suitableprocessing, for reading such sensors as speech and thenrecognition. handing over the data to the high performance core for intensive processing, such as speech recognition. Assuming that the low-cost processor is the most Assuming• that the low-cost processor is the most suitable for a specific com- suitable for a specific computation stage i, we have the putation stage i, we have the following bound for the slow-down of the stage following bound for the slow-down of the stage: [50]: P M P M Etrans/P L active sleep active si < L + M , (7.10) Pactive Ti M M.-R. Ra, B. PriyanthaM , A. Kansal, and J. Liu, “Improving energy efficiency personal sensing applications with where Pactiveheterogeneousand Pmulti-processors,”sleep are in the Proceedings power of the 2012 consumptions ACM Conference on Ubiquitous of the Computing. high performance main core when in active and sleep states, respectively. In a similar fashion, L trans Pactive is the power consumption of the low performance core, E is the M transition cost for stage placement on the main core, and Ti is the execution time of the stage on the main core. The slow-down factor is dependent on the hardware and on the computation. For example, memory size of the processor, data parallel instruction sets, floating Supporting Continuous Sensing

• Continuous sensing involves the constant monitoring of onboard sensors, such as acceleration, microphone or camera. Thus continuous sensing burdens the processor and uses a lot of energy.

• Galaxy S4 smartphone that has a hardware chip for aggregating and optimizing sensor data gathering and processing. Sensorhub

Other Temperature sensors

Acceleration GPS Sensing applications

Bus

Sensor interface (OS and Sensor hub co- middleware) processor

Host CPU Sensor hub (always- (maximize sleep) on) Smartphone Platform Overview Android iOS 8 Firefox OS Symbian Linux Series 60

Low-level Linux Power Management iOS kernel Windows NT Linux Power Kernel-side power Management (Gonk) framework with power management API (Power Manager), peripheral power on/ off High-level Java class PowerManager, I/O Run-time power Gaia (OS Shell) Applications use power JNI binding to OS. Key Framework management Gecko runtime: domain manager that management methods: goToSleep(long), framework Power Management follows system-wide newWakeLock(…), Web API is non- power-state policies. userActivity(long…) standard and reserved for pre- Nokia Energy Profiler BatteryStats monitors installed applications API energy consumption and uses device specific subsystem models.

Energy Wake lock (partial, full) is Coding Multitasking API Asynchronous events Active object, wakeup conservation used to ensure that device patterns, (tasks and push (system messages) events, resource and patterns stays on. multitasking notification), domain manager Methods: Create, acquire, API (since asynchronous Resource lock in release, sensor batching iOS 4), push events API, Power Management API coalesced updates (since iOS7) Policies Wake lock specific flags Internal, Internal Gonk and Gecko Domain manager for and policies, system-wide multitasking level system-wide and power setting API (since domain-wide policy. iOS4) Domain-specific policies are possible Battery The BatteryManager class iOS 3 and Battery class W3C Battery Status Battery API (charge information contains strings and later: provides the battery API level, external power). constants for different UIDevice level, remaining Nokia Energy Profiler battery related notifications Class allows operating time, and that applications can to query/ an event when subscribe to, includes: subscribe battery is below 1%. battery level, temperature, battery info voltage Power Profilers Name / Authors Year Purpose

PowerScope 1999 Energy profiling of device and processes

Joule Watcher 2000 Fine-grained thread-level profiling

Nokia Energy 2006-2007 On-device standalone profiler Profiler Shye et al. 2009 Energy profiling of device and components with a logger application

PowerTutor 2009 Hybrid profiler based on PowerBooter

PowerBooter 2009-2010 Short-term power model for components

BattOr 2011 Portable power monitor

Sesame 2011 Self-constructive on-device power model for device and components

PowerProf 2011 Self-constructive API-level power profiler

MobiBug 2011 Automatic diagnosis of application crashes

Carat 2012-2013 Application energy profiling and debugging eProf 2012 Fine-grained power model for device, components and applications

DevScope 2012 Self-constructive power model for device and components

AppScope 2012 Fine-grained energy profiler for applications based on DevScope eDoctor 2012 Automatic diagnosis of battery drain problems

V-Edge 2013 Self-constructive power model for device and components PowerTutor Computation Offloading

• Based on material contributed by Matti Siekkinen, Eemil Lagerspetz and Juhani Toivonen Environment

2+3"*0-4#.'

Web services

&1-2*./'

&01+%$"*./'

0-01*./' (#)*"+'!"#$%' #,#-%*./'

!"#$%&' Access network What is offloading?

• Consider apps designed and implemented to be run on standalone mobile OS • Execute part of the application code in a remote machine

Offloading Offloading framework app framework return result app app callMethod() app app app app app app Mobile OS Mobile OS Virtualization Smartphone Remote server (Cloud) Offloading work to save energy

• Main objective is to save energy – Tradeoff: less computing with some extra communication – Transfer state back and forth between smartphone and cloud

Energy for communication

Energy for computation

• Often involves dynamic decision making because the tradeoff is not constant • May also improve other performance metrics (response time) – High performance computing in cloud Offloading

• The offloading of a mobile computing task is a trade-off between the energy used for local processing and the energy required for offloading the task, uploading its data, and downloading the result, if necessary.

• One can express the offloading energy trade-off as follows: – Etrade = Elocal − Edelegate > 0, where Elocal is the energy used for complete local execution, and Edelegate is the energy used if the task is offloaded from the perspective of the mobile device.

• If Etrade is greater than zero, then there is an energy benefit for delegating the task to the cloud. Offloading frameworks • Most rely on having source code available – MAUI at Mobisys’10 • Cuervo et al. from Duke, UMass, UCLA, MSR – Cuckoo at MobiCASE’10 • Kemp et al. from Vrije Universiteit – ThinkAir at Infocom’12 • Kosta et al. from DT Labs, Cambridge, Nottingham, Huawei • One modifies the underlying system (VM) – CloneCloud at EuroSys’11 • Chun et al. from Intel Labs, Princeton • Testing with computationally intensive apps delivers impressive results – 45% energy savings for chess AI [MAUI] – 20x speedup and energy savings for a large image search [CloneCloud] What can be offloaded?

• Content processing and transformations – Example: Javascript processing in OperaMini • Completion notification and mobile push • Application execution – Google docs, Windows Office • Connection management – BitTorrent! – Large downloads • Speech recognition – Siri • Positioning (A-GPS) • NVIDIA cloud enhanced 3D graphics

Example of Offloading: Indexing

• Implemented with Dessy: mobile desktop search • To offload indexing – Transmit entire file to the cloud service – Wait for response – Receive file summary • High energy savings can be obtained when offloading CPU-intensive tasks • With N900 and WLAN 700kB/s: 96.5% savings! – 200 000 words, 1 MB file – With WLAN 100kB/s this is reduced to 83.7% Dessy Offloading

Fig. 1. The offloading destinations from the point of view of the mobile device. 53 scheme. This includes offloading to the Dessy Cloud service via UMTS/GPRS mobile Internet or via IEE 802.11 WLAN, as well as using Scavenger to offload computation to surrogates present in the WLAN. In the absence of Internet connectivity in the local WLAN, represented by a dotted line in the Figure, the local surrogates may still be used for Scavenger-based offloading. In the absence of any surrogates, if the WLAN is connected to the Internet, cloud offloading is possible. As we will see below, the choice of the optimal offloading method and communication technology depends mostly on the upload rate of the N900 and the energy cost of transmitting. From the mobile device’s perspective, offloading indexing consists of uploading a document for indexing, waiting for the result to become ready, and downloading the result. In the experiments, the energy consumed by the entire transaction is measured. Fig. 2. The Power Monitor, powering the N900 through a fake battery. In the experimental setup, we used the Nokia N900 mobile device, with Bluetooth and FM radio disabled. The energy use was measured using a Monsoon Power Monitor device2. The omitting U from the formula above. To obtain mW h values, Power Monitor allows recording of (time, current, voltage)- one can just multiply the mAh value by 4. Note that the Power tuples at a sample rate of 5 kHz. Internally, the device is able Monitor records current in mA. to measure at 50 kHz, and the recorded samples are averages of 10 measurements each. Every experiment run proceeded in the following manner: The N900 device was powered through the Power Mon- 1) The experiment script was started and the N900 screen itor using a fake battery. No USB cable was connected to is turned off the N900 during the experiments. Figure 2 shows the Monsoon 2) Measurement was started in Monsoon PowerTool Power Monitor and the N900 along with the fake battery. The 3) Several seconds of idle time elapsed input voltage was set to 4 V. The formula to obtain mW h 4) The N900 started an upload of the chosen file to the values from the current, voltage and time tuples is target server using the chosen communication method 5) The N900 finished the upload and started waiting for E = I U t/3600, (1) ⇥ ⇥ the result to be ready 6) The N900 downloaded the result where E is the energy in mW h, I is the current in mA,U 7) The experiment script finished is the voltage in V , and t is the time in seconds. The Power 8) Several seconds of idle time elapsed Monitor gave a steady voltage of 4V in all of the experiments. 9) Measurement was stopped in Monsoon PowerTool We will therefore use mAh values in the rest of this paper, The idle time was used to find the beginning and end 2http://msoon.com/LabEquipment/PowerMonitor/ of the active experiment time in the measured data. When

Cloudlets

Cloudlet architecture from CMU consists of customized ephemeral Cloudlets on a network virtual machines with soft state, and a platform for running them

Deploy applications near the users to avoid latency and bandwidth problems

Facilitates elastic and mobile execution of network components in base stations

Network support for computation Constructed from clip art from pixabay.com offloading?