electronics

Article Phase-Based Accurate Power Modeling for Mobile Application Processors

Kitak Lee 1,2 , Seung-Ryeol Ohk 1, Seong-Geun Lim 1 and Young-Jin Kim 1,*

1 Department of Electrical and Computer Engineering, Ajou University, Suwon 16499, Korea; [email protected] (K.L.); [email protected] (S.-R.O.); [email protected] (S.-G.L.) 2 System LSI Business, Samsung Electronics Co., Ltd., Hwaseong 18448, Korea * Correspondence: [email protected]

Abstract: Modern mobile application processors are required to execute heavier workloads while the battery capacity is rarely increased. This trend leads to the need for a power model that can analyze the power consumed by CPU and GPU at run-time, which are the key components of the application processor in terms of power savings. We propose novel CPU and GPU power models based on the phases using performance monitoring counters for . Our phase-based power models employ combined per-phase power modeling methods to achieve more accurate power consumption estimations, unlike existing power models. The proposed CPU power model shows estimation errors of 2.51% for ARM Cortex A-53 and 1.97% for Samsung M1 on average, and the proposed GPU power model shows an average error of 8.92% for the Mali-T880. In addition, we integrate proposed CPU and GPU models with the latest display power model into a holistic power model. Our holistic power model can estimate the smartphone0s total power consumption with an error of 6.36% on average while running nine 3D game benchmarks, improving the error rate by about 56% compared   with the latest prior model.

Citation: Lee, K.; Ohk, S.-R.; Lim, Keywords: power model; power estimation; application processors; phase classification; smartphones S.-G.; Kim, Y.-J. Phase-Based Accurate Power Modeling for Mobile Application Processors. Electronics 2021, 10, 1197. https://doi.org/ 1. Introduction 10.3390/electronics10101197 According to YouGov, a UK international market research firm, the most respondents

Academic Editor: Akash Kumar answered “Longer battery life” to the question “Which one of the following features would you most like your current to have?” [1]. Therefore, power management on a Received: 14 March 2021 smartphone is very important, and we need to carefully analyze the power consumed by Accepted: 11 May 2021 each component of the smartphone. Published: 17 May 2021 Figure1 shows the power breakdowns of a Galaxy S3 smartphone [ 2] and a Galaxy S7 smartphone, respectively. Figure1b shows the result estimated by the power models Publisher’s Note: MDPI stays neutral proposed in this paper while an application processor executes twenty 3D game benchmark with regard to jurisdictional claims in applications. As shown in Figure1, since the power consumption ratio of CPU+GPU over published maps and institutional affil- the total system power has increased from 57% to 85%, it is very important to analyze CPU iations. and GPU power accurately for power management on smartphones. This is because bigger power consumption may give rise to more power estimation error. This trend makes low-power techniques more important such as dynamic voltage frequency scaling (DVFS). DVFS is a technique that adjusts frequency and voltage to Copyright: © 2021 by the authors. achieve the optimal trade-off point between power and performance. Many DVFS policies, Licensee MDPI, Basel, Switzerland. which are implemented in software at the Linux kernel level, employ information like This article is an open access article utilization to know the workload of a processor. If DVFS software can get information distributed under the terms and about power consumption of processors, there will more chances for DVFS policies to conditions of the Creative Commons make bigger energy efficiency. However, it is very difficult to estimate accurately the power Attribution (CC BY) license (https:// consumption of each processor at run-time without a built-in hardware sensor. creativecommons.org/licenses/by/ 4.0/).

Electronics 2021, 10, 1197. https://doi.org/10.3390/electronics10101197 https://www.mdpi.com/journal/electronics ElectronicsElectronics 20212021, 10,, 119710, 1197 2 of 19 2 of 19

CPU GPU Display Base CPU GPU Display Base 4%

11% 25% 29% 40%

18% 56% 17%

(a) (b) FigureFigure 1. 1.SmartphoneSmartphone power power breakdowns breakdowns while while running running game applications onon ((aa)) GalaxyGalaxy S3, S3, (b) Galaxy(b) Galaxy S7. S7.

In this paper, we propose new phase-based power models that can estimate the power In this paper, we propose new phase-based power models that can estimate the consumption of CPU and GPU accurately, which are the core components of a mobile powerapplication consumption processor of (AP), CPU using and theGPU performance accurately monitoring, which are counter the core (PMC) components at run-time. of a mo- bileSince application our proposed processor power (AP), models using can the provide performance the power monitoring consumption counter information (PMC) of at run- time.each Since component our proposed every time powe interval,r models the power can provide management the power drivers consumption can implement information more ofpower-efficient each component management every time policies interval, using th suche power information. management In addition, drivers a can program implement moreprofiler power-efficient can provide software management engineers policies with powerusing consumptionsuch information. information In addition, of their a pro- gramprograms. profiler For can example, provide it couldsoftware be embeddedengineers inwith a system power analyzer consumption program information such as of theirARM programs. DS5-streamline For example, [3] to show it could developers be embedded the run-time in a system power consumptionanalyzer program of each such as ARMcomponent DS5-streamline while applications [3] to show are running. developers the run-time power consumption of each In detail, we propose a novel power modeling method which combines phase clas- componentsification and while PMC-based applications power are modeling. running. In addition, we integrate the proposed CPU andIn GPU detail, power wemodels propose with a novel an existing power state-of-the-art modeling method display which power combines model to derive phase a classi- ficationholistic and power PMC-based model that power can estimate modeling. the total Inpower addition, consumption we integrate of a smartphone. the proposed In CPU andextensive GPU power experiments, models we with show an how existing accurately state- ourof-the-art holistic power display model power can model estimate to the derive a holisticsystem power0s power model consumption that can while estimate executing the varioustotal power 3D game consumption benchmarks of on a a smartphone. Galaxy S7 In extensivesmartphone experiments, with an Android we show 7.1 platform. how accura The contributionstely our holistic of our power work are model summarized can estimate theas system follows.′s power consumption while executing various 3D game benchmarks on a Gal- axy• S7The smartphone proposed with power an model Android achieves 7.1 platform. the accuracy The improvementcontributions over of our the work existing are sum- marizedpower as follows. models by using multiple PMC-based per-phase power models and proposed phase classification and phase prediction methods for both CPU and GPU. • The proposed power model achieves the accuracy improvement over the existing • For the optimal GPU power model, optimal PMC events for linear regression predic- powertors are models searched by by using the geneticmultiple algorithm PMC-base [4]. d per-phase power models and proposed • phaseA holistic classification power model and isphase completed, prediction which methods estimates for the both smartphone CPU and total GPU. power • Forconsumption the optimal accurately, GPU power including model, the optimal power consumption PMC events of CPU,for linear GPU, regression display, and predic- torscomponents are searched connected by the to genetic them. algorithm [4]. • • AAll holistic the power power models model have is beencompleted, proven towhich be practical estimates because the allsmartphone implementations total power consumptionand verifications accurately, are conducted including on recent the smartphonespower consumption with an Android of CPU, platform. GPU, display, andThe components remainder of connected this paper isto organizedthem. as follows. Section2 introduces previous • studiesAll the about power CPU andmodels GPU have power been models. proven Section to3 be describes practical the because motivation all ofimplementations our work. Then,and Section verifications4 describes are an conducted overview of on our recent CPU andsmartphones GPU power with models, an Android and Sections platform.5 and6 explain our novel CPU and GPU power modeling approaches in detail. Sections7 andThe8 explain remainder the experimental of this paper environment is organized and experimental as follows. results, Section respectively. 2 introduces Finally, previous studiesSection about9 concludes CPU thisand paper. GPU power models. Section 3 describes the motivation of our work. TheThen, terminologies Section 4 describes used in the an remaining overview parts of ofour this CPU paper and are GPU described power as follows.models, and SectionsFirst, an 5 eventand 6 type explain means our the novel type ofCPU available and GPU events power that a modeling PMC can count.approaches The PMC in detail. Sectionsneeds to 7 beand set 8 up explain about whatthe experimental type of events en tovironment count before and it starts. experimental Second, an results, event set respec- tively.means Finally, the set ofSection six event 9 concludes types. Finally, this an paper. interval vector refers to a vector that consists of countedThe terminologies values for six preset used eventin the types remaining and for parts a clock of counter this paper during are a described time interval. as follows. First, an event type means the type of available events that a PMC can count. The PMC needs to be set up about what type of events to count before it starts. Second, an event set means the set of six event types. Finally, an interval vector refers to a vector that consists of counted values for six preset event types and for a clock counter during a time interval.

Electronics 2021, 10, 1197 3 of 19

2. Related Work 2.1. Utilization-Based Power Model Utilization is the ratio of a non-idle period to a time interval and means the load of a processor. A utilization-based power model builds a power model using a linear regression method and uses utilization as a regression predictor. Yoon et al. [5] expressed a CPU power model as the sum of power consumed by each core and an uncore (Uncore means the resource of a microprocessor that is not in the core). They used the utilization of each core as predictors and uncore power as a constant to build a CPU power model. Zhang et al. [6] developed a utilization-based model that is compensated for the transition overhead incurred when the CPU0s idle state changes from deep to shallow. Kim et al. [7] built a utilization-based GPU power model. They used GPU utilization as a regression predictor and get regression coefficients and constants at each frequency step. The utilization-based model is so basic that it can be implemented quickly on any device, but it cannot predict power consumption of processors accurately. This is because the power consumption varies greatly depending on the type of instruction being executed, even with the same utilization. Table1 summarizes the existing power models including utilization-based power models.

Table 1. Comparisons of the prior power models for mobile devices.

Power Model Device Type Device Name CPU GPU Base of Method Yoon [5] Smartphone O O Utilization-based Zhang [6] Smartphone O X Utilization-based Kim [7] Smartphone 4 X O Utilization-based Pricopi [8] Embedded Board ARM development platform O X PMC-based Jin [9] Smartphone Samsung Galaxy S4 X O PMC-based Walker [10] Embedded board Hardkernel Odroid-XU3 O X PMC-based Hardkernel Odroid-U3 & Kim [11] Embedded board O X Phase-based Hardkernel Odroid-XU3 Zheng [12] Embedded board Hardkernel Odroid-XU3 O X Phase-based

2.2. PMC-Based Power Model PMCs are a set of special-purpose registers built in modern microprocessors to store the counts of hardware-related activities such as cache accesses or the number of executed instructions. PMC events can be used for power estimation, since PMC events covers overall hardware operations in a microprocessor. In addition, a PMC-based power model can be implemented on smartphones because most of their micro-architectures support PMCs. Pricopi et al. [8] developed a linear regression-based power model. They used PMC events such as cache accesses, memory accesses, clock cycles, and number of instructions, as predictors of linear regression. Jin et al. [9] analyzed which events represent utilization of each GPU pipeline unit. They built the GPU power model by using these PMC events as predictors. Walker et al. [10] have done a high-level study by almost fully automating event selection, which considers what events are suitable as predictors. However, their study has a limitation, as they consistently use only six events whatever the running workload is. These events may have good performance on average on 60 workloads, but they are not always best for each workload, since six is insufficient to cover different characteristics of execution behaviors in various workloads.

2.3. Phase Classification and Its Application A phase is defined to be a set of intervals that have similar execution behavior, where an interval means a time section of continuous execution. Thus, all intervals in the same phase have similar execution behaviors. In order to classify each interval into a phase, Electronics 2021, 10, 1197 4 of 19

Sherwood et al. [11] used a basic block vector (BBV) which keeps track of the execution frequencies of basic blocks. Then, they computed the similarity of the intervals using a vector distance and classified them into distinct phases using K-means clustering [12]. However, per-phase patterns of their work are the program patterns because basic blocks are generated by a compiler. In order to use hardware operation patterns rather than patterns of BBV,Kim et al. [13] and Zheng et al. [14] used PMC events for phase classification and devised PMC-based phase classification for power estimation. Kim et al. [13] used phase classification for accurate cross-platform power/performance estimation. They created the cumulative distribution function (CDF) graphs of each event for different phases and different platforms. They observed that different phases for each platform could present very different trends, while the events in the same phase behave very similarly across platforms. They called it as per-phase pattern trained the neural network model to capture pattern within a phase. Zheng et al. [14] also utilized a program phase based on dynamic basic blocks for cross-platform power/performance estimation. They applied simplex-constrained quadratic programming (SCQP) to obtain a linear predictive model specific for each program phase because they were motivated by program’s homogeneous d behavior across platforms at individual phase level. In their work, θi ∈ R denotes unknowns at phase i and are determined by SCQP. Thus, they built a power model for each phase and each of completed models is only used for a corresponding phase.

3. Motivational Studies For an ARM CPU, the maximum number of events that can be collected simultaneously is six, since its core has 6 PMCs. Thus, Walker et al. [10] proposed a variable selection method that selects proper six event types to be used as regression predictors of a power model. Then, by eliminating multicollinearity between the six selected events, they tried to build an accurate power model. However, they just used only one event set, which consists of six event types, whatever a running program is. To improve this, we investigated which set of events can produce better results on a specific phase. To this end, we first ran 75 workloads on the Cortex-A53 CPU and calculated the LDST_inst_rate value for each workload. LDST_inst_rate is defined as (1).

# of load store instructions LDST_inst = (1) rate # of total instructions

We classified three workloads with the three lowest LDST_inst_rate value as Group A (i.e., CPU bound workloads) and three workloads with the three highest LDST_inst_rate value as Group B (i.e., memory bound workloads). Table2 shows the details of the three event sets. The event type related to the CPU pipeline among all events, called CPU_EV, are highlighted in bold, and the event type related to the cache and memory systems, called MEM_EV, are not highlighted in Table2. Event set 2 is the same event set used by Walker et al. [10] that consists of three CPU_EVs and three MEM_EVs. Unlike Event set 2, Event set 1 consists of four CPU_EVs and two MEM_EVs and is designed to minimize multicollinearity as much as possible. Event set 3 consists of two CPU_EVs and four MEM_EVs and is also designed to be minimize multicollinearity as much as possible. Figure2 shows the mean average percentage errors (MAPEs) of the power models built with each event set on each workload group (i.e., Group A and Group B) over real power measurement. First, the model built with Event set 2 shows acceptable errors for both groups, but the results are not excellent. On the other hand, the model built with Event set 1, which consists of four CPU_EVs and 2 MEM_EVs, shows excellent results in Group A (i.e., CPU bound workloads). In contrast, the model built with Event set 3 shows excellent results in Group B. These results suggest that a specific event set can produce better results on a specific phase. Therefore, by selecting an optimal event type corresponding to a specific phase or workload at run-time, it will be possible to estimate power consumption more accurately. Electronics 2021, 10, 1197 5 of 19

Table 2. Three event sets used in motivational studies.

Event Set 1 Event Set 2 Event Set 3 0 × 1B INST_SPEC 0 × 1B INST_SPEC 0 × 1B INST_SPEC 0 × 72 LDST_SPEC 0 × 50 L2D_CACHE_LD 0 × 72 LDST_SPEC 0 × 75 VFP_SPEC 0 × 6A UNALIGNED_LDST_SPEC 0 × 04 L1D_CACHE 0 × 73 DP_SPEC 0 × 73 DP_SPEC 0 × 16 L2D_CACHE 0 × 03 L1I_CACHE 0 × 14 L1I_CACHE 0 × 61 BUS_ACCESS_ST 0 × 17 L2D_REFILL 0 × 19 BUS_ACCESS 0 × 17 L2D_REFILL No. of CPU_EV: 4 No. of CPU_EV: 3 No. of CPU_EV: 2 No. of MEM_EV: 2 No. of MEM_EV: 3 No. of MEM_EV: 4

Electronics 2021, 10, 1197 5 of 19 Figure 2 shows the mean average percentage errors (MAPEs) of the power models built with each event set on each workload group (i.e., Group A and Group B) over real

powerTable measurement. 2. Three event sets usedFirst, in motivationalthe model studies. built with Event set 2 shows acceptable errors for both groups, but the results are not excellent. On the other hand, the model built with Event Set 1 Event Set 2 Event Set 3 Event set 1, which consists of four CPU_EVs and 2 MEM_EVs, shows excellent results in 0 × 1B INST_SPEC 0 × 1B INST_SPEC 0 × 1B INST_SPEC 0 × 72 LDST_SPECGroup A0 (i.e.,× 50 CPU bound L2D_CACHE_LD workloads). In cont0 × 72rast, the model LDST_SPEC built with Event set 3 shows 0 × 75 VFP_SPECexcellent 0 ×results6A in UNALIGNED_LDST_SPEC Group B. These results0 ×su04ggest that L1D_CACHE a specific event set can produce 0 × 73 DP_SPEC 0 × 73 DP_SPEC 0 × 16 L2D_CACHE 0 × 03 L1I_CACHEbetter results0 × 14 on a specific L1I_CACHE phase. Therefore,0 × 61 by selecting BUS_ACCESS_ST an optimal event type corre- 0 × 17 L2D_REFILLsponding 0 ×to19 a specific phase BUS_ACCESS or workload at 0 ×run-time,17 it will L2D_REFILL be possible to estimate power No. of CPU_EV: 4 No. of CPU_EV: 3 No. of CPU_EV: 2 No. of MEM_EV: 2consumption more No. of MEM_EV:accurately. 3 No. of MEM_EV: 4

Group A(CPU workloads) Groups B(MEM workloads) 50 44.79 45 40 35 30 25 20 16.15 16.69 15 12.73 10 3.49 4.815 5 0 Event set 1 Event set 2 Event set 3 (4+2) (3+3) (2+4)

Figure 2. The mean average percentage error (MAPE) of power models built with two workload Figure 2. The mean average percentage error (MAPE) of power models built with two workload groups with the three event sets (unit: %). groups with the three event sets (unit: %). Kim et al. [13] classified phases using PMC events and trained a neural network to capture per-phase patterns. However, they used only a single power model for different phasesKim rather et thanal. [13] different classified power models phases for using different PMC phases. events Zheng and et al. trained [14] built a neural network to captureper-phase per-phase power models patterns. and selectively Howe usedver, them they according used to only the phase. a single However, power they model for different phasesused the rather same event than set different for all per-phase power power models models for rather different than different phases. event Zheng sets et al. [14] built per- for different per-phase power models. Therefore, to the best of our knowledge, there has phasebeen no power study of usingmodels different and event selectively sets for different used phases. them We according devise per-phase to the power phase. However, they usedmodels the based same on regression event set and for employ all per-phase them in an optimally power selective models manner. rather than different event sets for different per-phase power models. Therefore, to the best of our knowledge, there has 4. Overview of the Proposed Method beenFigure no study3 shows of the overallusing processdifferent of the event proposed sets CPU for power different model. The phases. process inWe devise per-phase powerFigure3 ismodels repeated based for every on time regression interval. Our and CPU employ power modelthem proceedsin an optimally as follows. selective manner. First, it predicts a phase using a proposed phase prediction method. After the phase 4.is Overview predicted, the of second the Proposed is the event Method counting step. In this step, PMCs count events corresponding to the predicted phase and an interval vector is made. There is a suitable eventFigure set for each 3 shows phase andthe how overall to determine process an of appropriate the proposed event set CPU for each power phase model. The process inis describedFigure 3 inis Sectionrepeated 5.3. Thefor thirdevery step time is phase interval. classification. Our CPU Our power method classifiesmodel proceeds as follows. phases using the interval vector. Thus, it is not until event counting is finished that the First,phase isit classified.predicts a phase using a proposed phase prediction method. After the phase is predicted,After classifying the second the phase is the of the event time interval, counting we compare step. In the this classified step, phase PMCs and count events corre- spondingthe predicted to phase the topredicted determine whichphase power and modelan interval to use in vector estimating is themade. processor’s There is a suitable event power consumption. If the prediction is correct when comparing the predicted phase and the classified phase, power consumption is estimated using the accurate per-phase power model. However, if the prediction is incorrect, the per-phase power model cannot be used because the event types used the in per-phase power model and the types of collected

Electronics 2021, 10, 1197 6 of 19

set for each phase and how to determine an appropriate event set for each phase is de- scribed in Section 5.3. The third step is phase classification. Our method classifies phases using the interval vector. Thus, it is not until event counting is finished that the phase is classified. After classifying the phase of the time interval, we compare the classified phase and the predicted phase to determine which power model to use in estimating the processor’s power consumption. If the prediction is correct when comparing the predicted phase and the classified phase, power consumption is estimated using the accurate per-phase power model. However, if the prediction is incorrect, the per-phase power model cannot be used because the event types used the in per-phase power model and the types of collected events do not match. Even if the phase prediction fails, the power consumption during the time interval must be estimated. Thus, we employ a general but relatively inaccurate power model, which can be used even if the phase prediction fails. Meanwhile, the phase prediction step is needed to use more than one per-phase power model because different per-phase power models use different event types. Therefore, it is necessary to predict the phase before event counting and count the events of the phase to be expected. In Section 5, we describe in detail the phase prediction step, phase classification step, and how to build per-phase power and general power models. Electronics 2021, 10, 1197 Figure 4 shows the overall process of the proposed GPU power6 model. of 19 There is no phase prediction step because Mali Midgard GPU can collect all event types simultane- ously. Thus, all event types of all per-phase power models can be covered by simply eventscounting do not all match. available Even ifevents. the phase Therefore, prediction th fails,e GPU the power power consumption model consists during of three steps. theFirst, time all interval events must are be counted estimated. by Thus, PMCs, we employand an a generalinterval but vector relatively is made. inaccurate Second, the phase powerof the model, time whichinterval can is be classified used even ifby the the phase interval prediction vector. fails. Then, Meanwhile, the per-phase the phase power model prediction step is needed to use more than one per-phase power model because different per-phasecorresponding power models to the use classified different eventphase types. is used Therefore, to estimate it is necessary power to consumption predict the of GPU. In phaseSection before 6, eventwe describe counting in and detail count the events phase of classification the phase to be expected.step for InGPU Section and5, how to build a weper-phase describe in GPU detail power the phase model. prediction step, phase classification step, and how to build per-phase power and general power models.

[1] [2] [3] Phase Counting Phase prediction event values classification

Classified Predicted same? phase phase

Yes No

[4-1] [4-2] Per-phase power A general power models model (7 events per model) (4 events)

Figure 3. Overall process of the proposed CPU power model. FigureFigure 3. Overall4 shows process the overall of the process proposed of the CPU proposed power GPU model. power model. There is no phase prediction step because Mali Midgard GPU can collect all event types simultaneously. Thus, all event types of all per-phase power models can be covered by simply counting all available events. Therefore, the GPU power model consists of three steps. First, all events are counted by PMCs, and an interval vector is made. Second, the phase of the time Electronics 2021, 10, 1197 7 of 19 interval is classified by the interval vector. Then, the per-phase power model corresponding to the classified phase is used to estimate power consumption of GPU. In Section6, we describe in detail the phase classification step for GPU and how to build a per-phase GPU power model.

[1] [2] [3] Counting Phase Per-phase Event values classification power models

Figure 4. Overall process of the proposed GPU power model. Figure 4. Overall process of the proposed GPU power model. 5. CPU Power Modeling 5.1. Phase Classification 5.1.1.5. CPU Motivation Power Modeling 5.1. WePhase made Classification two workload groups and performed a simple experiment on a big core. Firstly,5.1.1. Motivation workloads of the first group consist of CPU related instructions. We measured We made two workload groups and performed a simple experiment on a big core. Firstly, workloads of the first group consist of CPU related instructions. We measured the average power consumption of a big core while it executes these workloads. The mini- mum and maximum power consumptions of the workloads were 2588 mW and 3069 mW, respectively. Secondly, workloads of the second group consist of memory related instruc- tions. The workloads in the second group have different working set sizes (i.e., the domain size of memory addresses) and stride. When we measured each average power consump- tion of these workloads, the minimum and maximum power consumption of the work- loads were 2933 mW and 4415 mW, respectively. These results suggest that when we clas- sify phases, we have to consider how CPU uses memory rather than what CPU instruc- tions are. Therefore, we based our phase classification on how frequently CPU uses memory and what memory device CPU mainly uses. Figure 5 shows a schematic figure of our CPU phase classification. Our method classifies the behavior of CPU during an interval into five phases using three decision values (𝑥 and 𝑥 for LDST_inst_rate, 𝑦 for Cache_miss_rate).

CPU + DRAM DRAM phase phase CPU 𝑦 phase CPU + CACHE CACHE Cache_miss_rate phase phase

𝑥 LDST_inst_rate 𝑥

Figure 5. Visualization of CPU phase classification.

5.1.2. Phase Classification LDST_inst_rate and Cache_miss_rate are used for phase classification. They are shown in Figure 5 on the x and y axes. Since the LDST_inst_rate represents the ratio of memory- related instructions to total instructions, LDST_inst_rate can decide how frequently CPU uses memory. Similarly, Cache_miss_rate decides what memory device CPU mainly uses. For example, if the value of LDST_inst_rate is greater than 𝑥 and the Cache_miss_rate is less than 𝑦, our method determines that the current CPU’s phase is the CACHE phase. # 𝑜𝑓 𝑐𝑎𝑐ℎ𝑒 𝑚𝑖𝑠𝑠𝑒𝑠 𝐶𝑎𝑐ℎ𝑒_𝑚𝑖𝑠𝑠_𝑟𝑎𝑡𝑒 = (2) # 𝑜𝑓 𝑙𝑜𝑎𝑑 𝑠𝑡𝑜𝑟𝑒 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 Equations (1) and (2) represent LDST_inst_rate and Cache_miss_rate, and we can get them using counted event values. Table 3 shows Event A, Event B, and Event used when calculating LDST_inst_rate and Cache_miss_rate. These events represent the number of

Electronics 2021, 10, 1197 7 of 19

[1] [2] [3] Counting Phase Per-phase Event values classification power models

Figure 4. Overall process of the proposed GPU power model.

5. CPU Power Modeling 5.1. Phase Classification 5.1.1. Motivation We made two workload groups and performed a simple experiment on a big core. Firstly, workloads of the first group consist of CPU related instructions. We measured the average power consumption of a big core while it executes these workloads. The mini- Electronics 2021, 10, 1197 mum and maximum power consumptions of the workloads were 25887 of 19 mW and 3069 mW, respectively. Secondly, workloads of the second group consist of memory related instruc- tions. The workloads in the second group have different working set sizes (i.e., the domain the average power consumption of a big core while it executes these workloads. The sizeminimum of memory and maximum addresses) power and consumptions stride. When of the we workloads measured were each 2588 mW average and power consump- tion3069 of mW, these respectively. workloads, Secondly, the workloads minimum of the and second maximum group consist powe of memoryr consumption related of the work- loadsinstructions. were 2933 The workloads mW and in 4415 the second mW, group respectively. have different These working results set sizes suggest (i.e., the that when we clas- domain size of memory addresses) and stride. When we measured each average power sifyconsumption phases, ofwe these have workloads, to consider the minimum how andCPU maximum uses memory power consumption rather than of the what CPU instruc- tionsworkloads are. wereTherefore, 2933 mW we and based 4415 mW, our respectively. phase cl Theseassification results suggest on how that when frequently CPU uses memorywe classify and phases, what we memory have to consider device how CPU CPU mainly uses memory uses. rather Figure than 5 what shows CPU a schematic figure instructions are. Therefore, we based our phase classification on how frequently CPU uses ofmemory our CPU and whatphase memory classification. device CPU mainlyOur method uses. Figure classifies5 shows a schematicthe behavior figure of CPU during an intervalof our CPU into phase five classification. phases using Our methodthree classifiesdecision the values behavior (𝑥 of CPUand during 𝑥 for an LDST_inst_rate, 𝑦 forinterval Cache_miss_rate into five phases). using three decision values (x1 and x2 for LDST_inst_rate, y1 for Cache_miss_rate).

CPU + DRAM DRAM phase phase CPU 𝑦 phase CPU + CACHE CACHE Cache_miss_rate phase phase

𝑥 LDST_inst_rate 𝑥

Figure 5. Visualization of CPU phase classification. Figure 5. Visualization of CPU phase classification. 5.1.2. Phase Classification 5.1.2. LDST_inst_ratePhase Classificationand Cache_miss_rate are used for phase classification. They are shown in Figure5 on the x and y axes. Since the LDST_inst_rate represents the ratio of memory- relatedLDST_inst_rate instructions to total and instructions, Cache_miss_rateLDST_inst_rate are usedcan decide for phase how frequently classification. CPU They are shown inuses Figure memory. 5 on Similarly, the x andCache_miss_rate y axes. Sincedecides the what LDST_inst_rate memory device CPU represents mainly uses. the ratio of memory- For example, if the value of LDST_inst_rate is greater than x and the Cache_miss_rate is less related instructions to total instructions, LDST_inst_rate2 can decide how frequently CPU than y2, our method determines that the current CPU’s phase is the CACHE phase. uses memory. Similarly, Cache_miss_rate decides what memory device CPU mainly uses. # of cache misses For example, if theCache value_miss of_rate LDST_inst_rate= is greater than 𝑥 and the(2) Cache_miss_rate is # of load store instructions less than 𝑦, our method determines that the current CPU’s phase is the CACHE phase. Equations (1) and (2) represent LDST_inst_rate and Cache_miss_rate, and we can get them using counted event values. Table3 shows Event A,# Event 𝑜𝑓 B,𝑐𝑎𝑐ℎ𝑒 and Event C 𝑚𝑖𝑠𝑠𝑒𝑠 used 𝐶𝑎𝑐ℎ𝑒_𝑚𝑖𝑠𝑠_𝑟𝑎𝑡𝑒 = (2) when calculating LDST_inst_rate and Cache_miss_rate#. These𝑜𝑓 events𝑙𝑜𝑎𝑑 represent 𝑠𝑡𝑜𝑟𝑒 the number 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 of load store instructions, the number of any instructions, and the number of cache misses, respectively.Equations However, (1) and they can(2) berepresent replaced with LDST_inst_rate different event types and because Cache_miss_rate some event , and we can get types are not available on a specific microarchitecture. For ARM cortex-A53, we used themBUS_ACCESS using counted (0x19) as event Event C values. instead ofTableL2D_CACHE_REFILL 3 shows Event (0x17) A, Event. This is B, because and Event C used when calculatingBUS_ACCESS LDST_inst_rate (0x19) indirectly represents and Cache_miss_rate the number of cache. These misses sinceevents once represent a cache the number of miss occurs, a memory bus access also occurs. Table3 also shows the x1, x2, and y1 used when deciding the phase. They were set through extensive experiments because proper values for them vary greatly depending on the unit of Event A, Event B, and Event C.

Electronics 2021, 10, 1197 8 of 19

Table 3. Parameters for CPU phase classification.

Microarchitecture Parameters ARM Cortex-A53 Samsung M1 Event A MEM_ACCESS(0x13) LDST_SPEC(0x72) Event B INST_RETIRED(0x08) INST_SPEC(0x1B) Event C BUS_ACCESS(0x19) L2D_CACHE_REFILL(0x17) x1 0.1 0.1 x2 0.4 0.4 y1 0.1 0.01

5.2. Per-Phase Power Model Equation (3) represents the per-phase power model.

6 CPU Pdynamic = β0EC + ∑ βnEn (3) n=1

In (3), EC is the value obtained from the clock counter divided by sampling time, and En is the value obtained from the n-th event counter divided by sampling time. In (3), the first term means constant dynamic power [10]. It has a constant property in that it is consumed constantly when CPU is in the running state (i.e., P-state), regardless of how many and what activities occur. The second term means dynamic power that depends on how many and what activities occur in CPU [10]. Each per-phase power model can be built by determining the unknown coefficients β0 and βn using linear regression. We execute all workloads to collect interval vectors used for linear regression, and we classified these interval vectors into five phases using the proposed phase classification method shown in Section 5.1. Finally, each per-phase power model is built with only interval vectors for the corresponding phase.

5.3. Variable Selection As is shown in Section3, it is important to select proper event types because the performance of the per-phase model depends on which event types are used as predictors. Therefore, each per-phase power model should use proper event types as predictors by a variable selection method for better performance. An event set consists of 6 event types. The aim of this step is to select the three event types for each event set among 6 event types. This is because the preset three event types should always be included in every event set for phase classification shown in Section 5.1. Table3 shows these three event types for LDST_inst_rate and Cache_miss_rate. The three remaining event types, which are not included in event sets yet, must be selected. Depending on how we choose them, there is a room for improving the performance of each per-phase model. Therefore, we aimed to test all combinations of three event types to select an optimal combination. We needed to test 52C3 = 22,100 cases for a big core and 37C3 = 7770 cases for a little core because the number of event types for a big core is 52 and that for a little core is 37. However, all workloads must be executed once for each case. It is time consuming to executing all workloads once per each case. Thus, we make a data matrix to save experiment time. The data matrix is a matrix that contains approximate event values for all event types at each time interval while all workloads are running. Once we get it, we can test every case without executing every workload per case. Algorithm 1 shows how we make it. It generates interval vectors with 8 elements (6 event data, 1 clock counter data, 1 power consumption data) to execute all workloads once. In other words, an n by 8 matrix with n interval vectors is made. However, it should be an n by 54 matrix (52 event data, 1 clock counter data, 1 power consumption data) since the data matrix must contain every event value at each time interval. We use the algorithm shown in Algorithm 1 to make this n by 54 data matrix. Electronics 2021, 10, 1197 9 of 19

First, the algorithm in Algorithm 1 measures the average execution time of k-th workload (lines 4–6). The number of collected event values is NPMC when a workload is executed. Thus, each workload must be executed repeatedly NSUBSET times to collect all event types (line 7). The algorithm prepares to collect event types corresponding to the subset ESi (lines 8–9). The algorithm repeats to execute the workload until the execution time is within a certain range of error τ. If the error exceeds the threshold τ, the interval vectors is invalid. It iterates to run the workload and collect event values for a subset ESi until a valid interval vector is made (lines 10–18). After all interval vectors for all workloads are stored in the matrix M, the algorithm divides the columns of the clock counter and power consumption by NSUBSET to find the average and returns M (lines 23–24).

Algorithm 1 building data matrix Input: CPU workloads τ: Tolerance threshold n: The number of workloads NEV: The number of available event types NPMC: The number of PMCs E: A set of all event types ESi: i-th Subset of E (ES0 ∪ ES1 ... ∪ ESn = E, ES0 ∩ ES1 ... ∩ ESn = ∅,) 1 NSUBSET = NEV / NPMC + 1; sumvnum = 0; 2 Nmax = NSUBSET × NPMC; 3 For (k = 0; kτ DO 11 execute workload k; 12 start PMC event counting and measure power consumption; 13 Ttemp = execution time;  14 err = Tk − Ttemp / 50ms; 15 vnum = the number of interval vectors; 16 M[sumvnum : sumvnum + vnum, c : c + NPMC] = the no. of events per second; 17 M[sumvnum : sumvnum + vnum, Nmax : Nmax + 2]+ = power measurement and clock counter value per second; 18 END WHILE 19 c+ = NPMC; 20 END FOR 21 sumvsum+ = vnum; 22 END FOR 23 M[0 : sumvnum, Nmax : Nmax + 2]/= NSUBSET 24 Return M

We can quickly test every case using this data matrix. However, the best case may not be best because of inaccuracy of the data matrix. The data matrix just shows an approximate trend, representing approximate values when PMCs collects the event types simultaneously. This is because a building process of data matrix allows a certain error in the program execution time. Thus, we select the top five cases as candidates and retest these cases using the newly counted event values that is collected simultaneously. This allows to evaluate the candidates accurately. Finally, a candidate showing the top results is used to build the per-phase model. In this way, we can find a suboptimal combination of event types quickly. Electronics 2021, 10, 1197 10 of 19

We can quickly test every case using this data matrix. However, the best case may not be best because of inaccuracy of the data matrix. The data matrix just shows an approximate trend, representing approximate values when PMCs collects the event types simultane-

Electronics 2021, 10, 1197 ously. This is because a building process of data matrix allows a certain error in10 ofthe 19 program execution time. Thus, we select the top five cases as candidates and retest these cases using the newly counted event values that is collected simultaneously. This allows to evaluate the candidates accurately. Finally, a candidate showing the top results is used to build the per- 5.4. Phase Prediction phase model. In this way, we can find a suboptimal combination of event types quickly. In this paper, we predict the CPU phase using the following two methods. The first is the past method. In this method, an upcoming phase is predicted to be the same as the 5.4. Phase Prediction previous classified phase as shown in (4). In this paper, we predict the CPU phase using the following two methods. The first ˆ CPU CPU is the past method. In this method,Phaset an upcomi= Phasengt−1 phase is predicted to be the same(4) as the previous classified phase as shown in (4). In (4), Phaseˆ CPU represents the predicted phase at time interval t and PhaseCPU repre- t t−1 sents the classified phase at time interval𝑃ℎ𝑎𝑠𝑒t − 1. =𝑃ℎ𝑎𝑠𝑒 (4) The second is the deep-past method. In order to predict an upcoming phase, this In (4), 𝑃ℎ𝑎𝑠𝑒 represents the predictedCPU phase atCPU time interval CPUt and 𝑃ℎ𝑎𝑠𝑒 rep- method analyzes the recent 7 phases (i.e., Phaset−1 , Phaset−2 , ... , Phaset−7 ). In this method,resents eachthe classified phase has phase a weight at wtime as showninterval in (5).t − 1. The second is the deep-past method. In order to predict an upcoming phase, this method analyzes the recent 7 phaseswt−i =(i.e.,8 − 𝑃ℎ𝑎𝑠𝑒i , 𝑃ℎ𝑎𝑠𝑒 , …, 𝑃ℎ𝑎𝑠𝑒(5)). In this

method, each phase has a weight w as shown in (5). CPU In this method, bucketp is the sum of the weights as shown in Figure6. Phaset−i determines which bucket the weight is added𝑤 to. For=8−𝑖 example, the weight of time index t-1 (5) is added to the third bucket because phase is 3. After the 7 weights have been added to the 5 buckets,In this we method, can determine 𝑏𝑢𝑐𝑘𝑒𝑡 the upcoming is the sum phase. of the weights as shown in Figure 6. 𝑃ℎ𝑎𝑠𝑒 determines which bucket the weight is added to. For example, the weight of time index t- ˆ CPU  1 is added to the third bucketPhase becauset = argmax phasep isbucket 3. Afterp the 7 weights have been(6) added to the 5 buckets, we can determine the upcoming phase.

Time Phase Weight Time Phase weight index index

t-1 3 1 t-2 1 2 p ……

Time Phase Weight Time Phase weight 11 index index 23 Time Phase weight Time Phase weight p t-3 2 3 t-4 3 4 index index 3 1 + 4 t-1 3 1 t-4 3 4 Time Phase weight Time Phase weight 46 index index 57 t-5 5 5 t-6 4 6

Time Phase weight index t-7 5 7 FigureFigure 6. 6.Example Example of phases,of phases, weights, weights, and bucketsand buckets in the in deep-past the deep-past method. method.

Equation (6) shows our deep-past method. It decides the upcoming phase as the phase of the bucket with the largest value.𝑃ℎ𝑎𝑠𝑒 =𝑎𝑟𝑔𝑚𝑎𝑥(𝑏𝑢𝑐𝑘𝑒𝑡) (6) There is a need to set the initial phase since both methods are history-based methods. We setEquation phase 2 (CPU (6) shows + cache) our as thedeep-past initial phase method. because It phasedecides 2 was the the upcoming most common phase as the phasephase when of the we bucket executed with 75 the workloads largest andvalue. classified each interval vectors in Section 5.3. In theThere case of is the a deep-pastneed to set method, the initial the method phase since cannot both predict methods until the are 7th history-based phase decision methods. isWe done, set sincephase it considers2 (CPU + thecache) recent as 7 the phases initia forl phase prediction. because Therefore, phase we 2 was set the the initial most 7 common phasesphase aswhen phase we 2 inexecuted the case of75 theworkloads deep-past and method. classified each interval vectors in Section 5.3. 5.5.In the General case Power of the Model deep-past method, the method cannot predict until the 7th phase deci- sion is done, since it considers the recent 7 phases for prediction. Therefore, we set the ˆ CPU CPU initialIf Phase 7 phasest and as phasePhaset 2 inare the not case same, of thethe prediction deep-past is method. failed. In this case, we cannot use one of the per-phase power models because the collected event types are not matched to En used in the power model.

3 CPU Pdynamic = β0EC + ∑ βnEn (7) n=1

Equation (7) represents the general power model for the prediction miss case. The general power model is built with EC and three En, which are values of three events shown

Electronics 2021, 10, 1197 11 of 19

5.5. General Power Model If 𝑃ℎ𝑎𝑠𝑒 and 𝑃ℎ𝑎𝑠𝑒 are not same, the prediction is failed. In this case, we cannot use one of the per-phase power models because the collected event types are not matched to 𝐸 used in the power model. 𝑃 = 𝛽𝐸 + 𝛽𝐸 (7) Equation (7) represents the general power model for the prediction miss case. The general power model is built with E and three 𝐸, which are values of three events shown in Table 3 divided by sampling time. These three events are always included in every event sets to calculate LDST_inst_rate and Cache_miss_rate. Even though a prediction fails, we can use this general power model because the clock counter value and the three events values are always collected regardless of which phase is.

6. GPU Power Modeling 6.1. Phase Classification Unlike the CPU’s phase classification, which focused on how CPU uses external memory in connection with its internal behavior, this section focuses more on GPU inter- nal behavior because the power consumption of GPU is much larger than that of external memory. Jin et al. [9] and Nah et al. [15] analyzed the GPU pipeline and used utilizations of its unit for their research. Jin et al. [9] analyzed what hardware units exist on the GPU microarchitecture and used the utilizations of each unit as a predictor of their linear re- gression power model. Nah et al. [15] used the utilizations of each unit to characterize various kinds GPU workloads. Prior research suggest that the utilizations of each unit can be used to build a phase-based power model since the usage of hardware resource varies Electronics 2021, 10, 1197 from workload to workload. Therefore, 11in of 19this paper, we analyze what units are in the Mali T-880 microarchitecture and employ the utilizations of each unit to classify the GPU in Table3 divided by sampling time. These three events are always included in every event setsphase. to calculate LDST_inst_rate and Cache_miss_rate. Even though a prediction fails, we can use this general power model because the clock counter value and the three events values are alwaysOur collected phase regardless of classification which phase is. method for GPU classifies interval vectors into four phases. 6.They GPU Power are Modeling called Fixed function unit phase, Tiler phase, Tripipe phase, and Memory phase, 6.1. Phase Classification respectively.Unlike the CPU’s phase The classification, name which of focused each on howphase CPU uses means external which unit is used most at a time interval. memory in connection with its internal behavior, this section focuses more on GPU internal behaviorAdditionally, because the power the consumption reasons of GPU isfor much choosing larger than that ofthese external 4 units are as follows. Figure 7 shows an ab- memory. Jin et al. [9] and Nah et al. [15] analyzed the GPU pipeline and used utilizations ofstracted its unit for their block research. diagram Jin et al. [9] analyzed of the what hardwareT-880 units microarchite exist on the cture. As is shown in Figure 7, there are GPU microarchitecture and used the utilizations of each unit as a predictor of their linear regressionthree power types model. of Nah units et al. [15] used(i.e., the utilizations tiler, L2 of each cache, unit to characterize shader core). Tiler divides the screen into tiles of various kinds GPU workloads. Prior research suggest that the utilizations of each unit can be16 used × to build16 apixels phase-based and power modelthe since L2 the cache usage of hardware connects resource varies with external memory systems. Meanwhile, a from workload to workload. Therefore, in this paper, we analyze what units are in the Mali T-880shader microarchitecture core and can employ be the utilizationsused ofas each both unit to classifya vertex the GPU phase. shader and a fragment shader. Figure 8 shows Our phase classification method for GPU classifies interval vectors into four phases. Theythe are internal called Fixed function structure unit phase, Tiler of phase, a shader Tripipe phase, core. and Memory Fi phase,xed function units (FFUs) surround a program- respectively. The name of each phase means which unit is used most at a time interval. Additionally,mable thetripipe reasons for unit choosing that these 4 unitsruns are asa follows.shader Figure 7program, shows an and they are responsible for the prepara- abstracted block diagram of the T-880 microarchitecture. As is shown in Figure7, there are threetion types of of unitsshader (i.e., tiler, L2operations cache, shader core). such Tiler divides as therasterizing screen into tiles of and depth testing, or for the post-processing 16 × 16 pixels and the L2 cache connects with external memory systems. Meanwhile, a shaderof shader core can be used operations as both a vertex shader [16]. and a fragmentTherefore, shader. Figure a8 shadershows the core needs to be divided into two parts: the internal structure of a shader core. Fixed function units (FFUs) surround a programmable tripipefixed unit thatfunction runs a shader units program, and they a are tripipe responsible forunit. the preparation Since of GPU consists of four units: FFU, tiler, tripipe, shader operations such as rasterizing and depth testing, or for the post-processing of shader operationsand memory [16]. Therefore, asystem, shader core needs we to beclassify divided into interval two parts: the fixedvectors into four phases corresponding 4 units: function units and a tripipe unit. Since GPU consists of four units: FFU, tiler, tripipe, and memoryFFU system, phase, we classify Tiler interval phase, vectors into fourTripip phases correspondinge phase, 4 units:and FFU Memory phase. phase, Tiler phase, Tripipe phase, and Memory phase.

Shader Shader Core Core

Tiler Shader ... Core

Queue L2 Cache

Control Bus Data Bus Figure 7. An abstract block diagram of Mali Midgard GPU architecture [16].

FigureWe use utilizations 7. An ofabstract each unit for block phase classification. diagram The of utilization Mali can Midgard be cal- GPU architecture [16]. culated using counted event values. In this paper, we suggest two methods for phase classification described in the following subsections.

ElectronicsElectronics 2021, 102021, 1197, 10, 1197 12 of 19 12 of 19

Shader Core Front-end Fixed Function blcoks

Vertex Fragment Front-end Front-end External Shared MemoryShared External

Tripipe

Fragment Front-end

Back-end Fixed Function blcoks

Figure 8. 8.An An abstract abstract inside inside view of aview shader of core a shader [16]. core [16]. 6.1.1. Maximum Selection Method InWe order use to decideutilizations the phase of for each an interval unit vector,for phase this method classification. selects the phase The withutilization can be calcu- latedhighest using utilization counted among event four units. values. For instance, In this if pape the utilizationr, we suggest of FFU unit two is methods larger for phase classi- GPU than the others, FFU phase is selected for Phaset . Otherwise, if the utilization of Tier fication described in the following subsections.GPU unit is larger than the others, Tiler phase is selected for Phaset . 6.1.1.6.1.2. K-Means Maximum Clustering Selection Method Method This method is based on K-means clustering [12]. We can cluster all interval vectors in the dataIn matrix orderusing to decide K-means the clustering. phase for New an vectors interval are used vector, for K-means this method clustering selects the phase with highestinstead of theutilization interval vectors. among Each four of these units. new vectors For instance, corresponds if to the each utilization interval vector of FFU unit is larger and has a utilization value of four units. For the initial centroid of K-means clustering, the than the others, FFU phase is selected for 𝑃ℎ𝑎𝑠𝑒 . Otherwise, if the utilization of Tier vector with the highest unit utilization value is selected for each unit, and a total of four unit is larger than the others, Tiler phase is selectedFFU for 𝑃ℎ𝑎𝑠𝑒 .Tiler interval vectors are selected. (i.e., one with the highest ut , one with the highest ut , Tripipe Memory one with the highest ut , one with the highest ut ). The entire interval vectors are 6.1.2.divided K-Means into 4 clusters Clustering by K-means Method clustering, and each cluster means a corresponding phase. For example, the phase of a vector belonging to a cluster formed by the initial centroidThis with method the highest is FFUbased utilization on K-means becomes theclustering FFU phase. [12]. We can cluster all interval vectors in the data matrix using K-means clustering. New vectors are used for K-means clustering instead6.2. Per-Phase of Powerthe interval Model vectors. Each of these new vectors corresponds to each interval vectorEquation and has (8) represents a utilization the per-phase value of power four model units. of GPU.For the In (8), initial some centroid variables of K-means cluster- are used as predictors after log conversion (ln(Em))), and the others are used as predictors ing,without the log vector conversion with (E nthe). highest unit utilization value is selected for each unit, and a total of four interval vectors are selected. (i.e., one with the highest 𝑢, one with the highest GPU = M ( ) + N Pdynamic Σm=0βmln Em Σn=M+1βnEn (8) 𝑢 , one with the highest 𝑢 , one with the highest 𝑢 ). The entire interval vectorsThe reasonare divided for log conversion into 4 clusters is the nonlinear by K-means characteristic clustering, found by Isci and et al. each [17]. cluster It means a corre- is that, when the number of events per second is relatively small, its small increase causes spondingto a large increase phase. in power For consumption.example, the On thephase contrary, of a when vector the value belonging is large, its to small a cluster formed by the initialincrease centroid causes to a with small increasethe highest in power FFU consumption. utilization Hong becomes [18] solved the this FFU problem phase. by converting the values using a log function. Thus, we decide to use log-transformed 6.2.values Per-Phase as predictors Power for linear Model regression. However, since not all variables have nonlinear characteristics, the log transformation was only applied to some events. Equation (8) represents the per-phase power model of GPU. In (8), some variables

are used as predictors after log conversion (ln(𝐸))), and the others are used as predictors without log conversion (𝐸). 𝑃 = ∑ 𝛽 𝑙𝑛(𝐸) + ∑ 𝛽𝐸 (8) The reason for log conversion is the nonlinear characteristic found by Isci et al. [17]. It is that, when the number of events per second is relatively small, its small increase causes to a large increase in power consumption. On the contrary, when the value is large, its small increase causes to a small increase in power consumption. Hong [18] solved this problem by converting the values using a log function. Thus, we decide to use log-trans- formed values as predictors for linear regression. However, since not all variables have nonlinear characteristics, the log transformation was only applied to some events.

Electronics 2021, 10, 1197 13 of 19

6.3. Variable Selection The PMC, implemented in the ARM Mali-T880 GPU, can collect all 185 events types simultaneously. Thus, it is possible to know all 185 events values at any time interval without the data matrix shown in Algorithm 1. However, it is undesirable to use all 185 event types as predictors, because it adversely affects the regression performance. Thus, we need to decide which events to use and which events not to use. Considering the number of cases for 185 event types, we need to explore 3185 cases since there are 3 cases for every event type (i.e., one for Em, one for En, one for not used). Therefore, in this paper, we employ the genetic algorithm [4] to search the optimal case for regression. The gene is represented by an integer sequence with 185 elements, and an element has three cases: 0 if the event type is not used, 1 if the event type is used for Em, and 2 if the event type is used for En. We used parameters for the genetic algorithm as follows. An initial population of 500, a random integer gene initialization method, a uniform crossover, a uniform integer gene mutation method (indpb = 0.04), and a k best individuals selection method were used. In order to collect the event values for Mali GPU, a target agent called gator [19] is needed. The gator driver is responsible for delivering counted event values to the ARM Streamline [20] host program. Originally, the gator driver should be used with the gator daemon and the ARM Streamline [20], but we modified the gator driver to deliver PMC event values to our power estimation program.

7. Experimental Setup 7.1. Android OS Settings for Experiment Due to the limited battery capacity in a smartphone, many low-power techniques are applied to manage its power consumption. Such techniques can act as an impediment to identify the effectiveness of the phase-based power model. Therefore, we turned them off for accurate experiments. First, the governor for the dynamic voltage frequency scaling (DVFS) was set as the userspace governor, and the frequencies of the little core and big core were set to 1.1 GHz and 2.14 GHz, respectively. The frequency of GPU was set to 533 MHz. Second, we turned on only one core for each cluster and prevented the other cores from turning on during the experiment. Finally, ARM provides the intelligent power allocation (IPA) technology for smartphone thermal management. By turning this technique off, the frequency is not changed while the workload is running.

7.2. CPU and GPU Power Measument 7.2.1. CPU Power Measurement In this experiment, we use the monsoon power monitor that measures the total power consumption of the smartphone. The android debug bridge (ADB) shell commands are used to execute the CPU workload, but the CPU operates only when the smartphone display is on. We set the android wake_lock for accurate measurement so that the CPU can operate even when the display is off. By doing so, we can exclude the power consumption of display. After setting the wake lock, the measured power without executing any workload is expressed as Pbase in (9).

CPU Smartphone Pdynamic = Pmeasured − Pbase (9)

The measured power, while the CPU executes workload and the display is off, is Smartphone represented by Pmeasured in (9). Therefore, the dynamic power consumption of the CPU CPU is represented by Pdynamic as shown in (9).

7.2.2. GPU Power Measurement Since not all GPU workloads support offscreen rendering, display power cannot be excluded. We used fast and accurate display power model [21] to estimate display power display consumption, which is expressed as Pestimated as shown in (10). Therefore, the dynamic Electronics 2021, 10, 1197 14 of 19

GPU CPU power consumption of the GPU, Pdynamic, is calculated as shown in (10). In (10), Pestimated is the estimated power consumption by our proposed CPU power model.

GPU Smartphone display CPU Pdynamic = Pmeasured − Pbase − Pestimated − Pestimated (10)

8. Experimental Results 8.1. Building CPU and GPU Power Models 8.1.1. Workloads for CPU and GPU Power Models We used 75 workloads and 15 workloads to build the CPU and GPU power models, re- spectively. 75 workloads for CPU are composed of 37 from roy longbottom benchmark [22], 16 from c algorithm benchmark [23], 1 from parsec benchmark [24], 6 from cortex suite benchmark [25], 13 from mibench benchmark [26], and 2 from ARM Linux Benchmark [27]. Meanwhile, 15 workloads for GPU are composed of 3DMark [28], Madagascar 3D bench- mark [29], Kassja benchmark [30], Renderscript benchmark [31], Nena benchmark [32], V1 GPU benchmark [33], Seascape benchmark [34], Basemark [35], 3D benchmark [36], Relative benchmark [37], and L-bench [15]. We installed them on the Galaxy S7 smartphone for experiment.

8.1.2. Results of Per-Phase CPU Power Models Table4 shows the MAPEs of per-phase power models. The predictors of each per- phase model were selected by the variable selection method shown in Section 5.3. All 54-size interval vectors from the data matrix were classified into five phases using the phase classification method shown in Section 5.1. Then, each per-phase model has been built with only the interval vectors of the corresponding phase. Eventually, the MAPEs of per-phase models shown in Table4 are on average 2.51% on ARM cortex-A53 CPU and averagely 1.97% on Samsung M1 CPU, respectively.

Table 4. Results of phase-specific CPU power models (unit: MAPE (%)).

CPU ARM Cortex-A53 Samsung M1 Phase Phase 1 (CPU) 0.63 0.66 Phase 2 (CPU + Cache) 4.62 2.98 Phase 3 (CPU + DRAM) 1.38 1.92 Phase 4 (Cache) 4.82 1.9 Phase 5 (DRAM) 1.12 2.4 Avg. 2.51 1.97

8.1.3. Comparison with Existing CPU Power Models Table5 shows the results of the proposed CPU power models, the existing utilization- based model [5] and the PMC based model. In the case of the PMC based model, we used our variable selection method to select six global event types instead of the method proposed by Walker et at [10]. The first and second rows of Table5 show the results of implementing those models on the Galaxy S7 smartphone. The proposed per-phase models show better results compared to both the utilization-based model and the PMC- based model. However, the general power model described in Section 5.5 shows worse MAPE results than that of the PMC-based model, but shows better results than that of the utilization-based model. Actually, a phase prediction rate is important. This is because, if prediction fails, we cannot help but use the general power model. Meanwhile, CPU workloads are not suitable to test the phase prediction rate because they have monotonous characteristics. Thus, in Section 8.2.2, we will test the prediction rate for 3D rendering and game benchmarks that show dynamic characteristics. Electronics 2021, 10, 1197 15 of 19

Table 5. Comparisons with existing CPU power models (unit: MAPE (%)).

CPU ARM Cortex-A53 Samsung M1 Power Model Utilization-based model [5] 25.61 11.99 PMC-based model [10] 9.85 6.64 Avg. of per-phase power models 2.51 1.97 General power model 15.16 7.89

8.1.4. Results of Per-Phase GPU Power Models Table6 shows the MAPE of our per-phase GPU power models. We used the maximum selection method, described in Section 6.1, to classify all interval vectors to four phases. The genetic algorithm was performed about 220 h to select the semi-optimal event sets and finally we built the per-phase power models using selected event sets. As shown in Table6, the per-phase power models showed 5.12–11.23% MAPEs and 8.92% MAPEs on average.

Table 6. Results of GPU per-phase power models (unit: MAPE (%)).

Phase 1 Phase 1 Phase 3 Phase 4 Phase Avg. MAPE (Tiler) (FFU) (Tripipe) (Memory) MAPE 8.54 11.23 5.12 10.77 8.92

8.1.5. Comparison with Existing GPU Power Models Table7 shows the results of the proposed GPU power model, the existing utilization- based model [7] and the PMC-based model [9]. The first and second rows of Table7 show the results of implementing those models on the Galaxy S7 smartphone. There is no general power model for GPU, since there is no phase prediction step in our GPU power estimation process. As shown in Table7, the proposed per-phase models were better than the others.

Table 7. Comparisons with existing GPU power models (unit: MAPE (%)).

Power Model MAPE Utilization-based model [7] 17.71 PMC-based model [9] 15.23 Avg. of per-phase power models 8.92

8.2. CPU + GPU Power Model Test 8.2.1. Workloads for CPU + GPU Power Model We installed nine additional workloads, which are different from those of Section 8.1, to test our holistic power model that combines the proposed CPU and GPU power mod- els. The nine workloads are composed of Antutu benchmark GPU [38], Performance test mobile [39], Orbital flight benchmark [40], Unreal benchmark [41], OESK benchmark [42], Snow forest [43], KZ FPS2017 [44], and bunny benchmark [45]. In the case of the perfor- mance test mobile [39]. Each section was treated as different workloads.

8.2.2. CPU Phase Prediction Rate In case of phase prediction, two methods are proposed in Section 5.4. We evaluated these methods with the 9 workloads. The phase prediction rate is calculated as in (11).

# of prediction corrected interval vectors prediction rate(%) = (11) # of total interval vectors

Table8 shows the prediction rates of the past and deep past methods. The average prediction rates of the past and deep past methods were 67.65% and 81.85%, respectively, when we experimented on nine workloads. Therefore, if the past method is used, prediction Electronics 2021, 10, 1197 16 of 19

Table 8 shows the prediction rates of the past and deep past methods. The average prediction rates of the past and deep past methods were 67.65% and 81.85%, respectively, when we experimented on nine workloads. Therefore, if the past method is used, predic- tion misses may occur so frequently that the general power model is more frequently used. On the contrary, the deep-past method can improve the performance of our phase- Electronics 2021, 10, 1197 16 of 19 based power model.

Table 8. Phase prediction rate for each method (unit: prediction rate (%)). misses may occur so frequently that the general power model is more frequently used. On the contrary, the deep-pastCPU method can improve the performance of our phase-based ARM Cortex-A53 Samsung M1 Avg. Methodpower model.

Table 8. PhasePast prediction rate for each method 70.32 (unit: prediction rate (%)).64.99 67.65 Deep past 76.31 86.80 81.55 CPU ARM Cortex-A53 Samsung M1 Avg. Method 8.2.3. Results of GPU Phase Classification Methods Past 70.32 64.99 67.65 The holisticDeep past power model should 76.31 estimate the overall 86.80 power 81.55 consumption of a smartphone. To this end, we combined the proposed CPU and GPU power models and an8.2.3. existing Results state-of of GPU Phasethe-art Classification display power Methods model. The holistic power model should estimate the overall power consumption of a smart- phone. To this end,𝑃 we combined= the 𝑃 proposed+ CPU 𝑃 and GPU+𝑃 power models+𝑃 and an (12) existing state-of the-art display power model. As shown in (12), 𝑃 represents the holistic power model and is the sum of Smartphone CPU GPU display each estimated Ppower.estimated 𝑃=Pestimated, +𝑃Pestimated +, Pandestimated 𝑃+ Pbase represent (12)the estimated power values of the CPU, GPU, and display power models, respectively. 𝑃 is the same Smartphone valueAs with shown that in (12),of (9).Pestimated The powerrepresents consumption the holistic of power the display model and is isestimated the sum of by the high- CPU GPU display accurateeach estimated display power. powerPestimated model, P estimated[21]. , and Pestimated represent the estimated power valuesFigure of the 9 CPU, shows GPU, the and holistic display power power models models, respectively.with four optionalPbase is the combinations. same value In Figure with that of (9). The power consumption of the display is estimated by the high-accurate 9,display CPU(past) power means model [ 21a ].CPU power model built with the past phase prediction method de- scribedFigure in 9Section shows the 5.4, holistic and GPU(max) power models means with four a GPU optional power combinations. model built In Figurewith the9 , maximum selectionCPU(past) method means a described CPU power in model Section built 6.1. with CPU(deep-past) the past phase prediction and GPU(K-means) method de- mean the CPUscribed and in SectionGPU power 5.4, and models GPU(max) with means the adeep-pas GPU powert method model built and with K-means the maximum clustering, respec- tively.selection In method Figure described 9, x-axis in represents Section 6.1 .the CPU(deep-past) running workloads and GPU(K-means) and y-axis mean represents the MAPEs,the CPU andwhich GPU are power calculated models by with comparing the deep-past the method estimated and K-meanspower with clustering, the actual meas- respectively. In Figure9, x-axis represents the running workloads and y-axis represents the uredMAPEs, power. which When are calculated we compare by comparing the average the estimated result powerof model with 1 the with actual that measured of model 2, we can seepower. that When the Maximum-based we compare the average method result is bette of modelr than 1 with the thatK-means of model method. 2, we can Likewise, the performancesee that the Maximum-based of model 3 is method better is than better that than of the model K-means 4. method.Thus, the Likewise, Maximum the selection methodperformance is better of model than 3the is betterK-means than method that of model since 4.the Thus, holistic the Maximumpower model selection with GPU(max) showsmethod better is better MAPEs than the than K-means that methodwith GPU(K-means). since the holistic power model with GPU(max) shows better MAPEs than that with GPU(K-means).

25

20

15 MAPE 10

5

0 123456789Avg. Workloads Model 1: Model 2: CPU(past) + GPU(max) CPU(past) + GPU(K-means)

Model 3 Model 4 CPU(deep-past) + GPU(max) CPU(deep-past) + GPU(K-means)

Figure 9. Test results of four power models built with combinations of proposed methods. Figure 9. Test results of four power models built with combinations of proposed methods.

Electronics 2021, 10, 1197 17 of 19

Electronics 2021, 10, 1197 17 of 19

8.2.4. Holistic Power Model Comparison 8.2.4.Based Holistic on the Power results Model of the Comparison previous section, we selected model 3 (deep-past + maxi- mum Basedselection) on the for resultsthe holistic of the power previous model. section, Figure we 10 selected shows model the results 3 (deep-past of comparison + maxi- oursmum with selection) existing for state-of-the-art the holistic power power model. models. Figure Figure 10 10 shows shows the the results MAPEs of of comparison four com- binationsours with of existing the proposed state-of-the-art and existing power power models. models Figure on CPU 10 shows and GPU the MAPEsfor the holistic of four powercombinations model of of the the smartphone. proposed and The existing average power MAPE models was on reduced CPU and from GPU 14.51% for the to holistic9.99% whenpower we model compare of the model smartphone. 5 with model The average 6. Similar MAPE results was are reduced shown from when 14.51% we compare to 9.99% modelwhen 5 we with compare model model7. The 5average with model MAPE 6. was Similar reduced results from are 14.51% shown to when 10.28%. we compareAs a re- sult,model model 5 with 3, modelwhich 7.is Thethe averagecombination MAPE of was the reducedproposed from CPU 14.51% and GPU to 10.28%. power As models, a result, showedmodel 3, a which6.36% isaverage the combination MAPE, showing of the proposed much improved CPU and results GPU powercompared models, to model showed 5, whicha 6.36% is the average combination MAPE, showingof the existing much state-of-the-art improved results PMC-based compared CPU to model and GPU 5, which power is models.the combination of the existing state-of-the-art PMC-based CPU and GPU power models.

30

25

20

15 MAPE 10

5

0 123456789Avg. Workloads Model 5: Model 6: Model 7: Model 3: CPU(PMC) + CPU(proposed) + CPU(PMC) + CPU(proposed) + GPU(PMC) GPU(PMC) GPU(proposed) GPU(proposed) FigureFigure 10. 10. TestTest results results of of four four holistic holistic power power models models bu builtilt with with combinations combinations of of proposed proposed methods methods andand existing existing methods. methods.

9.9. Conclusions Conclusions InIn this this paper, paper, new and accurateaccurate phase-basedphase-based CPU CPU and and GPU GPU power power modeling modeling methods meth- odsfor mobilefor mobile application application processors processors are proposed.are proposed. We We classified classified hardware hardware operations operations into intoseveral several phases phases and and built built per-phase per-phase power powe models.r models. Additionally, Additionally, then, then, we combined we combined them theminto aninto accurate an accurate holistic holistic power power model. model. Based Based on an on observation, an observation, a specific a specific event event set canset canproduce produce better better results results on a on specific a specific phase phase in the in aspect the aspect of the of accuracy the accuracy of a power of a power model, model,we proposed we proposed efficient efficient variable variable selection select methodsion methods to find to thefind optimal the optimal event event set in set each in phase. We also proposed phase classification and prediction methods to use an appropriate each phase. We also proposed phase classification and prediction methods to use an ap- per-phase power model according to the hardware state. In the case of CPU, a phase propriate per-phase power model according to the hardware state. In the case of CPU, a classification method, a variable selection method, and two phase prediction methods phase classification method, a variable selection method, and two phase prediction meth- were proposed, and in the case of GPU, two phase classification methods and the genetic ods were proposed, and in the case of GPU, two phase classification methods and the algorithm were proposed. genetic algorithm were proposed. We tested the holistic power models using four different optional combinations. The We tested the holistic power models using four different optional combinations. The results showed that the model built with the deep-past method for CPU and maximum- results showed that the model built with the deep-past method for CPU and maximum- based method for GPU was best. The results also showed that the proposed power based method for GPU was best. The results also showed that the proposed power model model can estimate power consumption more accurately than the existing PMC-based can estimate power consumption more accurately than the existing PMC-based power power model. When we compared our best holistic power models with models built with model. When we compared our best holistic power models with models built with exist- existing methods, our model showed an average MAPE of 6.36%, which resulted in 56% ingimprovement methods, our over model the latest showed prior an model. average MAPE of 6.36%, which resulted in 56% im- provementSince weover conducted the latest allprior experiments model. on the Galaxy S7 smartphone, our model shows the possibilitySince we conducted of accurately all experiments estimating the on powerthe Galaxy consumption S7 smartphone, of a recent our mobilemodel shows device thewithout possibility additional of accurately sensors estimating for power the measurement. power consumption As future of work, a recent more mobile research device on withoutthe improved additional phase sensors prediction for power method measuremen is required,t. As because future if work, the hardware more research state ison more the improvedcorrectly predictedphase prediction by the improvedmethod is phaserequired prediction, because method, if the hardware a per-phase state power is more model cor- corresponding to the state will be used more frequently, and the accuracy of the power model estimation will be raised higher.

Electronics 2021, 10, 1197 18 of 19

Author Contributions: Conceptualization, Y.-J.K.; Methodology, K.L., S.-R.O., Y.-J.K.; Project admin- istration, Y.-J.K.; Resources, Y.-J.K.; Software, K.L., S.-R.O.; Validation, K.L., S.-G.L.; Visualization, S.-R.O., S.-G.L.; Writing—original draft, K.L.; Writing—review & editing, S.-G.L., Y.-J.K. All authors have read and agreed to the published version of the manuscript. Funding: This research was supported by the Basic Science Research Program through the NRF, Korea funded by the Ministry of Education (2018R1A2B6005466) and the Ajou University research fund. Conflicts of Interest: The authors declare no conflict of interest.

References 1. Heibert, P. Smartphone Users Still Want Long-Lasting Batteries More Than Shatterproof Screens. Available online: https://today. yougov.com/topics/technology/articles-reports/2018/02/20/smartphone-users-still-want-longer-battery-life (accessed on 16 May 2021). 2. Chen, X.; Chen, Y.; Ma, Z.; Fernandes, F.C.A. How is energy consumed in smartphone display applications? In Proceedings of the 14th Workshop on Mobile Computing Systems and Applications (HotMobile 2013), Jekyll Island, GA, USA, 26–27 February 2013. 3. ARM. Arm DS-5. Available online: https://developer.arm.com/tools-and-software/embedded/legacy-tools/ds-5-development- studio (accessed on 16 May 2021). 4. Whitley, D. A genetic algorithm tutorial. Stat. Comput. 1994, 4, 65–85. [CrossRef] 5. Yoon, C.; Lee, S.; Choi, Y.; Ha, R.; Cha, H. Accurate power modeling of modern mobile application processors. J. Syst. Arch. 2017, 81, 17–31. [CrossRef] 6. Zhang, Y.; Liu, Y.; Zhuang, L.; Liu, X.; Zhao, F.; Li, Q. Accurate CPU Power Modeling for Multicore Smartphones; MSR-TR-2015-9; Microsoft Reasearch Asia: Beijing, China, 2015. 7. Kim, Y.G.; Kim, M.; Kim, J.M.; Sung, M.; Chung, S.W. A Novel GPU Power Model for Accurate Smartphone Power Breakdown. ETRI J. 2015, 37, 157–164. [CrossRef] 8. Pricopi, M.; Muthukaruppan, T.S.; Venkataramani, V.; Mitra, T.; Vishin, S. Power-performance modeling on asymmetric multi- cores. In Proceedings of the 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Montreal, QC, Canada, 29 September–4 October 2013; pp. 1–10. 9. Jin, T.; He, S.; Liu, Y. Towards Accurate GPU Power Modeling for Smartphones. In Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys 2015), Florence, Italy, 19–22 May 2015; pp. 7–11. 10. Walker, M.J.; Diestelhorst, S.; Hansson, A.; Das, A.K.; Yang, S.; Al-Hashimi, B.M.; Merrett, G.V. Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs. IEEE Trans. Comput. Des. Integr. Circuits Syst. 2017, 36, 106–119. [CrossRef] 11. Sherwood, T.; Perelman, E.; Hamerly, G.; Sair, S.; Calder, B. Discovering and exploiting program phases. IEEE Micro 2003, 23, 84–93. [CrossRef] 12. Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A K-Means Clustering Algorithm. J. R. Stat. Soc. Ser. C Appl. Stat. 1979, 28, 100. [CrossRef] 13. Kim, Y.; Mercati, P.; More, A.; Shriver, E.; Rosing, T. P4: Phase-based power/performance prediction of heterogeneous systems via neural networks. In Proceedings of the 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Irvine, CA, USA, 13–16 November 2017; pp. 683–690. 14. Zheng, X.; John, L.K.; Gerstlauer, A. Accurate phase-level cross-platform power and performance estimation. In Proceedings of the Proceedings of the 53rd Annual Design Automation Conference, Austin, TX, USA, 5–9 June 2016; pp. 1–6. 15. Nah, J.-H.; Suh, Y.; Lim, Y. L-Bench: An Android benchmark set for low-power mobile GPUs. Comput. Graph. 2016, 61, 40–49. [CrossRef] 16. Peter Harris. The Mali GPU: An Abstract Machine, Part 3-The Midgard Shader Core. Available online: https://community.arm. com/developer/tools-software/graphics/b/blog/posts/the-mali-gpu-an-abstract-machine-part-3---the-midgard-shader- core?CommentId=b7b18d27-4368-4e96-bce8-4f4a947af00c (accessed on 16 May 2021). 17. Isci, C.; Martonosi, M. Runtime power monitoring in high-end processors: Methodology and empirical data. In Proceedings of the 36th Annual International Symposium on Microarchitecture, San Diego, CA, USA, 3–5 December 2003. 18. Hong, S.; Kim, H. An integrated GPU power and performance model. ACM SIGARCH Comput. Arch. News 2010, 38, 280–289. [CrossRef] 19. ARM-Software. Gator Daemon, Driver and Related Tools. Available online: https://github.com/ARM-software/gator (accessed on 16 May 2021). 20. ARM Developer. Streamline Performance Analyzer. Available online: https://developer.arm.com/tools-and-software/ embedded/arm-development-studio/components/streamline-performance-analyzer (accessed on 16 May 2021). 21. Kim, C.; Hong, S.; Lee, K.; Kim, Y.J. High-Accurate and Fast Power Model Based on Channel Dependency for Mobile AMOLED Displays. IEEE Access 2018, 6, 73380–73394. [CrossRef] 22. R. Longbottom. Roy Longbottom’s pc Benchmark Collection. Available online: http://www.roylongbottom.org.uk (accessed on 16 May 2021). 23. fragglet. C Algorithms. Available online: https://fragglet.github.io/c-algorithms/ (accessed on 16 May 2021). Electronics 2021, 10, 1197 19 of 19

24. Bienia, C.; Li, K. PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors. In Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation, New Orleans, LA, USA, 21 June 2009; Volume 2011. 25. Thomas, S.; Gohkale, C.; Tanuwidjaja, E.; Chong, T.; Lau, D.; Garcia, S.; Taylor, B.M. CortexSuite: A synthetic brain benchmark suite. In Proceedings of the 2014 IEEE International Symposium on Workload Characterization (IISWC), Raleigh, NC, USA, 26–28 October 2014; pp. 76–79. 26. Guthaus, M.R.; Ringenberg, J.S.; Ernst, D.; Austin, T.M.; Mudge, T.; Brown, R.B. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization, Austin, TX, USA, 2 December 2001; pp. 3–14. 27. TonyHo. ARM Linux BenchMark. Available online: http://tonyho.github.io/ARM%20Linux%20BenchMark.html (accessed on 16 May 2021). 28. 3DMark. UL LLC. Available online: https://play.google.com/store/apps/details?id=com.futuremark.dmandroid.application (accessed on 16 May 2021). 29. Madagascar 3D Benchmark. ext2dev4me. Available online: https://play.google.com/store/apps/details?id=app.ext2dev4me. m3db (accessed on 16 May 2021). 30. Benchmark 3D Kassja graphics. MKDesignMobile. Available online: https://play.google.com/store/apps/details?id=com. mkdesignmobile.KassjaBenchmark (accessed on 16 May 2021). 31. Renderscript Benchmark. Binary Group SK s.r.o. Available online: https://play.google.com/store/apps/details?id=name. duzenko.farfaraway (accessed on 16 May 2021). 32. NenaMark2. Nena Innovation AB. Available online: https://play.google.com/store/apps/details?id=se.nena.nenamark2 (accessed on 16 May 2021). 33. V1-GPU Benchmark Pro. Invented Games. Available online: https://m.apkpure.com/kr/v1-gpu-benchmark-pro/com. InventedGames.V1Benchmark (accessed on 16 May 2021). 34. Seascape Benchmark-Test Your Device Performance. NatureApps. Available online: https://play.google.com/store/apps/ details?id=com.nature.seascape (accessed on 16 May 2021). 35. Basemark GPU. Basemark. Available online: https://play.google.com/store/apps/details?id=com.basemark.basemarkgpu.free (accessed on 16 May 2021). 36. 3D Benchmark. 42 Game Studio. Available online: https://apkpure.com/kr/3d-benchmark-android-gamers/com.cabodidev. threedbenchmark (accessed on 16 May 2021). 37. Relative Benchmark. RelativeGames. Available online: https://play.google.com/store/apps/details?id=com.re3.benchmark (accessed on 16 May 2021). 38. Antutu Benchmark. Beijing Antutu Technology Company Limited. Available online: http://www.antutu.com/en/index.htm (accessed on 16 May 2021). 39. PassMark PerformanceTest. PassMark Software. Available online: https://play.google.com/store/apps/details?id=com. passmark.pt_mobile (accessed on 16 May 2021). 40. Orbital Flight Benchmark-Asteroids Benchmarks. Virtual Arts Ltd. Available online: https://apkpure.com/kr/orbital-flight- benchmark-asteroids-benchmarks/virtualarts.benchmarks.orbitalflight (accessed on 16 May 2021). 41. Unreal System Benchmark. PlatinumBladeGames. Available online: https://play.google.com/store/apps/details?id=com. PlatinumBladeGames.UnrealSystemBenchmark (accessed on 16 May 2021). 42. OESK Benchmark-3D 2D Game. NTIMobile. Available online: https://apkpure.com/kr/oesk-benchmark-3d-2d-game/pl. ntimobile.oeskbenchmarktwo2 (accessed on 16 May 2021). 43. Snow Forest Benchmark-Asteroids Benchmarks. Virtual Arts Ltd. Available online: https://apkpure.com/kr/snow-forest- benchmark-asteroids-benchmarks/virtualarts.benchmarks.snowforest (accessed on 16 May 2021). 44. KZ FPS BENCHMARK 2017. KrzysztofZuk. Available online: https://play.google.com/store/apps/details?id=com. KrzysztofZuk.KZ3DBenchmark2017&hl=ko (accessed on 16 May 2021). 45. Bunny Benchmark. Alexander Fomin. Available online: https://play.google.com/store/apps/details?id=ru.alexanderfomin. bunnybenchmark&hl=ko (accessed on 16 May 2021).