Energy Modeling and Management for Data Services in Multi-Tier Mobile Cloud Architectures

Dissertation

Presented in Partial Fulﬁllment of the Requirements for the Doctor of Philosophy Degree in the Graduate School of The Ohio State University

Zichen Xu, C.S., M.S.

Graduate Program in Department of Electrical and Computer Engineering

The Ohio State University

2016

Dissertation Committee:

Prof. Xiaorui Wang, Advisor Prof. Füsun Özgüner Prof. Christopher C. Stewart c Copyright by

Zichen Xu

2016 Abstract

Researchers’ prediction about the emergence of very small and very large computing devices is becoming true. Computer users create personal content from their mobile devices and these contents are processed/stored in the remote server. This mobile cloud computing architecture contains millions of smartphone devices as the edge and high-end servers as the cloud, in order to provide data services worldwide. Unlike data services in traditional architectures, data services in the mobile computing architecture is greatly constrained by by energy consumption. Data services running in the cloud consume a large amount of electricity that accounts for 4% of the global energy use. Data processing and transmission in mobiles devices, such as smartphones, quickly drain out their batteries.

Therefore, energy is one of the most important criterion in the design of these systems. To address this problem, we need to build an energy modeling and management framework to proﬁle, estimate and manage the energy consumption for data processing in the mobile cloud architecture.

We first start with energy profiling of data processing in a single node. The study discovers that there exist possibilities of finding energy-efficient execution plans other than fast plans only. Based on the profile, we propose our online estimation tools for

ii modeling and estimating energy consumption of relational data operations. Using this tool, we provide power performance control for data processing. The control framework provide service level agreement guarantee while reducing the power consumption of the server at its best effort. The control-theoretic design provide system stability when facing unpredictable workloads.

Using the same modeling processing, we expand our research to optimize energy- related objectives, such as carbon footprint and cloud expense, in multiple nodes. We carefully study the processing of data in multiple nodes, and ﬁnd that the processing (i.e., read/write) signiﬁcantly affects the objectives when replicating data objects across multiple nodes. In this way, we transform the optimization problem on time-varying load balancing into a semi-static decision problem of data replication. By solving this problem, we build two data storage systems–CADRE and BOSS, to reduce the carbon footprint of serving data, and the cloud expense of processing in-memory data, respectively.

The modeling and managing process can also be applied to edge devices, such as smartphones. We first build an energy estimation tool for specific applications on smartphones using performance counters. Unlike traditional energy modeling work that uses system utilization, using performance counters can provide energy estimation for finer- grained executions and isolate the target energy profile. With understanding of the energy profile of data applications on smartphones, we further model the battery usage of the device. Based on the energy model and the battery model, we propose a dual-battery management system on battery-powered devices. Certain battery is favored by specific

iii workloads with their energy demand patterns. Altering the power supply between the two batteries can significantly improve the service time of the device, compared to the device powered with the same amount of battery capacity. Combining all energy modeling and management system designs above, we are able to significantly improve the energy efficiency of data services in each tier of the mobile cloud architecture.

iv To my parents, my wife, and my mentors

v Acknowledgments

I enjoy my long journey to pursue my Ph.D. degree in the United States. I could never make this far without everyone’s help. It is never an easy decision for my parents to encourage and support their only child to travel across continents and study for so many years abroad. I feel guilty that not serving by their side for such a long time and I deeply appreciate their devoted love. I am also especially thankful for my wife Jiangyue (Jane)

Li. Marrying her is the most wonderful thing happened in my life. She is always by my side and adores my work. I am givingthis thesisto her as my gratefulness for the relentless support.

I am thankful and grateful to have worked with my mentor Dr. Xiaorui Wang, who taught me what research attitude is and the philosophy of doing research, provided in- sightful guidance, and offered many advice. His advice allowed me to rethink the essence of research, ﬁnd the real key problem, and dig deeper to explore the effective solution to address the real life challenge. His encouragement carried me through the most desperate period in my research. He is THE pattern that I want to follow in my rest academic career life. Along this way, I want to thank Dr. Füsun Özgüner for serving on my committee.

I also want to show my gratitude to my dear friend/life advisor, Dr. Christopher Stewart and Dr. Yicheng Tu. I learnt so many from their advice on life. Talking to them, even

vi within a short coffee break, can unchain me from the narrow aspect of the problem in research and life, then move on to solve the key issue.

It is my pleasure and my life treasure to have so many friends supported me along this career path. I want to thank their help and support: Siwen Sun, Deng Nan, Kai Ma,

Xiaodong Wang, Zhezhe Chen, Mai Zheng, Li Li, Marco Brocanelli, Kuangyu Zheng,

Wenli Zheng, and Ziqi Huang.

vii Vita

2007 ...... B.S.Computer Technology

2009 ...... M.S.Computer Science

2011-present ...... Graduate Research Associate, The Ohio State University.

Publications

Research Publications

Zichen Xu, Yi-cheng Tu and Xiaorui Wang, “Online Energy Estimation of Relational Operations in Database Systems”. IEEE Transactions on Computers, 4(11): 3223-3236, November 2015.

Yi-cheng Tu, Xiaorui Wang, Bo Zeng, and Zichen Xu, “A System for Energy-Efﬁcient Data Management”. ACM SIGMOD Record, 43(1): 21-26, March 2014.

Zichen Xu, Nan Deng, Christopher Stewart, and Xiaorui Wang, “Blending On-Demand and Spot Instances to Lower Costs for In-Memory Storage”. in proceedings of the 35th IEEE International Conference on Computer Communications, April 2016.

Nan Deng, Zichen Xu, Christopher Stewart, and Xiaorui Wang, “Tell-Tale Tails: Decom- posing Response Times for Live Internet Services”. in proceedings of the 6th International Green and Sustainable Computing Conference, December 2015.

viii Zichen Xu, Nan Deng, Christopher Stewart, and Xiaorui Wang, “CADRE: Carbon-Aware Data Replication for Geo-Diverse Services”. in proceedings of the 35th IEEE Interna- tional Conference on Computer Communications, July 2015.

Zhang Xu, Haining Wang, Zichen Xu, and Xiaorui Wang, “Power Attack: An Increasing Threat to Data Centers”. in proceedings of the 21st Network and Distributed System Security Symposium, February 2014.

Zichen Xu, Yi-cheng Tu, and Xiaorui Wang, “Dynamic Energy Estimation of Query Plans in Database Systems”. in proceedings of the 33rd International Conference on Distributed Computing Systems, July 2013.

Zichen Xu, Xiaorui Wang, and Yi-cheng Tu, “Power-Aware Throughput Control for Database Management”. in proceedings of the 10th International Conference on Auto- nomic Computing, June 2013.

Zichen Xu, Xiaorui Wang, and Yi-cheng Tu, “PET: Reducing Database Energy Cost via Query Optimization”. in proceedings of the 38th International Conference on Very Large Data Bases, September 2012.

Fields of Study

Major Field: Department of Electrical and Computer Engineering

ix Table of Contents

Page

Abstract...... ii

Dedication...... v

Acknowledgments...... vi

Vita ...... viii

ListofTables ...... xiii

ListofFigures...... xiv

Chapters:

1. Introduction...... 1

1.1 ThesisStatement ...... 2 1.2 Contributions...... 3 1.3 Organization ...... 8

2. Energy Modeling and Management for Data Services on a SingleNode ... 9

2.1 DynamicEnergyEstimationforDataProcessing ...... 10 2.1.1 Energy Proﬁling for Relational Operations in Modern Servers . 13 2.1.2 Observationson DBMS Workload Characteristics ...... 16 2.1.3 EnergyModelingforRelationalOperations ...... 19

x 2.1.4 OnlineEstimationScheme...... 23 2.1.5 Evaluation ...... 25 2.2 Power-Aware Throughput Control for Database Operations...... 32 2.2.1 Power Performance Optimization for Database Operations... 35 2.2.2 Power/Performance Controller Design for Database Operations 42 2.2.3 ExperimentalResults ...... 48 2.3 Discussion ...... 55

3. Two Applications for Optimizing Data Services on MultipleNodes...... 57

3.1 CADRE: Carbon-Aware Data Replication for Geo-Diverse Services . . 58 3.1.1 Redistributing Data Replications for Reducing Carbon Footprints 61 3.1.2 CADRE Design: Data Replication for Carbon Reduction in Geo-DiverseDataServices ...... 64 3.1.3 Evaluation ...... 76 3.2 BOSS: Blending On-Demand and Spot Instances to Lower Costs for In-MemoryStorage...... 83 3.2.1 ReducingCloudPrices forRentingDataStorage ...... 87 3.2.2 Blending On-Demand and Spot Instances to Lower Costs for In-MemoryStorage ...... 89 3.2.3 Evaluation ...... 102 3.3 Discussion ...... 114

4. EnergyModelingandManagementonEdgeDevices ...... 116

4.1 Energy Proﬁling of Battery-Powered Devices ...... 116 4.1.1 PerformanceCounter-basedEnergyModeling ...... 124 4.1.2 EnergyModeling ...... 125 4.1.3 Evaluation ...... 130 4.2 DualBatteryManagement ...... 136 4.2.1 BatteryCharacteristics...... 140 4.2.2 Evaluation ...... 147 4.3 Discussion ...... 155

5. RelatedWork ...... 160

5.1 EnergyModelinginDataManagementSystems ...... 160 5.2 DistributedDataReplication ...... 164

xi 5.3 EnergyModelinginSmartphones ...... 165

6. Conclusion ...... 167

Bibliography ...... 168

xii List of Tables

Table Page

2.1 The maximum power consumption of major hardware components in our database server. Similar results are also reported in other work shown in therightmostcolumn...... 13

2.2 Keyquantitiesinpowerestimationmodels...... 20

2.3 Energycost functionsforrelationaloperators...... 20

2.4 Energycost functionsforrelationaloperators...... 21

2.5 Notationsandsymbols...... 36

3.1 Symbols and notations. Bold indicatesvectors...... 66

3.2 On-demand,cacheandspotinstanceproﬁles...... 103

xiii List of Figures

Figure Page

1.1 A simple architecture overviewof the mobile cloud...... 2

1.2 Contributions in the diagram of mobile cloud architecture...... 4

2.1 ThecompositionofaqueryexecutionplaninDBMS(Q5inTPC-H benchmark[105])...... 12

2.2 a. CPU, memory and hard disk power consumption under different levels of workload intensity. b. Hard disk power consumption under sequential readandrandomreadworkloads...... 15

2.3 Active energy consumption under different access patterns...... 18

2.4 Activeenergyconsumptionofjoinalgorithms...... 19

2.5 Estimation accuracy of three models in nine workloads with different data sharingpatterns...... 29

2.6 Estimation accuracy of three models in nine workloads with different data sharing patterns and different data access patterns...... 30

2.7 Model behavior at runtime under competing CPU-intensivetasks. . . . . 31

2.8 Model behavior at runtime under competing I/O-intensivetasks...... 32

2.9 Comparison of static and RLS with other energy estimationmodels. . . . 33

xiv 2.10 Studies on low power state of memory and CPU. All data are normalized to the ﬁrst histogram (active case in memory and DVFS100 in CPU). The quantiﬁcation of the performance histogram are on left y axis and that of power histogram are on right y axis. Since DBMS has no throughput under standby/powerdown state in memory and DVFS0 state in CPU, there is no performancedataunderthosepowerstates...... 39

2.11 The impacts of CPU frequency (i.e., DVFS level) on active power consumption (a) and DBMS throughput (b). The ﬁve different workloads in (b) contain different ratio of I/O-intensive queries. All data in (b) are normalized to the largest throughput with the workload contains 5% I/O intensive queries in the system with maximum CPU frequency...... 40

2.12 The relationship between workload’s frequency-to-throughput sensitivity and the percentage of I/O-intensive queries in the workload (λ)...... 41

2.13 The power-aware throughput control architecture. The names in parenthe- sesaregivenusingcontrolterminology...... 44

2.14 A sinusoidal throughput set point and controlled result...... 46

2.15 A snapshot of normalized database throughput and active power consumption in 50 control periods in three different system settings...... 50

2.16 The total energy saving of Normal, SpeedStep and PAT. All the data are normalizedtothedataofNormal...... 51

2.17 A snapshot of relative throughput information and DVFS level in 50 con- trolperiods...... 53

2.18 The energy saving of the ﬁve control techniques. Data are normalized to thePATcase...... 54

3.1 (a) Querying data replication over Google’s geo-diverse sites. (b) Consis- tent hashing migrates data too often under time-varying heterogeneity. . . 60

xv 3.2 (a) Carbon emission of two different data objects (lineitem and order are data tables from TPC-H) with different replicas. (b) Carbon emission of of the same data object spawned at different time. t0=12AM, 05/09/2011,d ∆T = 6 hours. The vertical lines indicate the number of replicas with lowestcarbonemissions...... 63

3.3 Data paths for queries in CADRE. Boxes represent the system software that runs at all sites. Shaded boxes reﬂect CADRE components with novel design. CADRE assumes create queries precede read and write queries. . 65

3.4 CDF of modeling performance across ﬁve sites. *: best performance of theper-requestmodelononesite...... 70

3.5 Theconvexfootprint-replicationcurves...... 71

3.6 Carbon footprints for a 1-week snapshot of Google trace. Emission and workloadratesareprovidedhourly...... 77

3.7 Performance comparison using the WorldCup trace (First row) and the Googletrace(Secondrow)...... 78

3.8 (a) Carbon footprints and (b) the average latency comparison between replicationstrategiesandroutingpolicies...... 81

3.9 Data replication simulations under (a) decreasing spare capacity and (b) scale-out...... 82

3.10 ObservationsonblendedAmazonstores...... 88

3.11 Prices for instances leased on demand and spot markets in Amazon IAAS clouds...... 90

3.12 TheBOSSframework...... 92

3.13 Inter-sitereplicationinBOSS...... 94

xvi 3.14 The intra-site conﬁguration design of BOSS. Dashed circle represents spot instances while solid circles are on-demand instances...... 100

3.15 A 24-hour throughput of four sites under the local price variation. Data are recorded at the same time but x-axis is adjusted to the site’s local time. QPSisquerypersecond...... 103

3.16 Performance snapshots between BOSS and Amazon baselines in US West. 107

3.17 One month performance comparison using database (ﬁrst row) and wordcount (secondrow)...... 109

3.18 The total number of instances leased using database. The number above eachbaristheaverageinstanceutilization...... 110

3.19 The impact of tuning weight coefﬁcients α and β using database. Note thexaxisisinlogscale...... 111

3.20 Impact of (a) proﬁling period and (b) risk on BOSS. (a) is normalized to Oracle, and (b) is normalized to Default...... 112

3.21 Scale-out performance of deploying BOSS in Amazon and Google’s cloud platforms. Normalized to Default...... 113

4.1 The critical path of accessing web contents. The dash arrow indicates that JavaScripts may trigger re-execution of previous phases...... 121

4.2 One second performance constraint for rendering web activities, which only provides 200ms performance budget for client-side energy optimization...... 123

4.3 The architecture of REEWA. Shaded boxes are the major components. HWishardwareandHWCishardwarecounter...... 125

xvii 4.4 Normalized (i.e., to their maximum value detected) power consumption, non-halt CPU cycles, and CPU utilization during the network phase on Nexus 4. At the millisecond level, the CPU utilization cannot reﬂect the power change immediately (around 50ms latency) while CPU cycle counts capture the runtime power changes with negligible delay (lessthan1ms). 131

4.5 Normalized power consumption, package rates, and hardware state change during the network phase on Nexus 4. The package count captures the ever-changing power faster than reading the hardware state from the system file. Meanwhile, the package counts has a much smaller delay than the hardware state change reflected to the system file (e.g. 50-100ms). . . 132

4.6 Normalized estimation accuracy of REEWA, compared with other three baselines with the top 25 websites, and different browsing phases (from top 2,500 webpages). The results are normalized based on the measurement.133

4.7 REEWAhas a highenergy estimationaccuracy (close to the measured per- phase average), while the baselines are unaware of different phases, resulting in degraded estimation performance, during browsing Wikipedia.com onNexus4...... 134

4.8 Estimation accuracy comparison between REEWA and other baselines while accessing static HTML (ﬁrst row), Full JavsScript HTML 5 (second row), and Full Flash (third row) using browser (i.e., Chrome), hybrid app (i.e., HackerNews), Web App (i.e., Facebook). All experiments are doneonNexus5...... 135

4.9 (a) Total service times using different Lithium-ion batteries. (2) Average task ﬁnish times in different discharge period of the LiCoO2 battery. . . . 138

4.11 The system diagram of the online phase. The shaded boxes are the components we added into the system, the grey arrows indicate the message passing for battery modeling, and the red arrows are the original message communicationinthesmartphonesystem...... 142

4.12 The system diagram of the “ofﬂine” phase. All the notations are the same asinFigure4.11...... 144

xviii 4.13 (a) Battery estimation when self-charging happens; (B) Battery estimation at tails of discharge curve. The GroundTruth baseline is the monitored data from the battery meter. Default is the original BattStat read from Android system and converted into real volume (mAh). BattTrack is the recent published work on modeling remaining battery capacity...... 150

4.14 Estimationaccuracy comparisonin benchmarks ...... 151

4.15 Overheadcomparisoninbenchmarks...... 152

4.16 Service time comparison of using mixed batteries on Nexus5...... 153

4.17 Runtime battery assignment under workload change...... 156

4.18 Self-charging gain under various discharge amount and restingtime. . . . 157

4.19 Service time comparison in different round-robin period. For example, RR(2)hasa2-minutesswitchingperiod...... 157

4.20 Service time comparison with several baselines on Nexus5...... 158

4.21 Service time comparison using different benchmarks on Nexus5...... 158

4.22 The impact of aging/temperature on performance gain on Nexus 5. Fig- ure 4.22(b) shows performance powered by four different batteries, as- cendingbytheirage...... 159

xix Chapter 1: Introduction

Computing systems are much closer to users than ever before. The mobile cloud architecture combines cloud computing, mobile computing and the network in-between. The infrastructure brings rich computational resources to end users via smartphones, called the edge, and harvests large quantities of data to data service providers, called the cloud, as shown in Figure 1.1. For example, one mobile user creates a post, and clicks share on his phone. This new data object will be synchronized to the backend data servers in the cloud and duplicated to a network of geo-diverse sites. As a result, all friends from different locations can receive the update within a short period of time and be able to view the new content. This computing infrastructure enables execution of rich mobile applications on a plethora of mobile devices. However, performance of the computing infrastructure is constrained by a lot of factors, such as connectivity, availability, and most importantly, the energy consumption from computing devices at both end of the architecture. For smartphones, they are usually powered by batteries. Traditional energy modeling and management techniques are too complex and heavy to be directly applied in these devices because of the battery limitation. For data servers, servers in sites (e.g., data centers) are known

1 for their enormous energy consumption. To solve this problem, recently there are a lot of efforts on energy management in data centers [16, 25, 20, 21, 52, 30, 53, 113]. However, most solutions focus mainly on the operating system (OS) level. As a result, they cannot be directly applied to the energy management for data processing, due to the lack of sufﬁcient knowledge of data processing behaviors. Therefore, it is important to design data-speciﬁc energy modeling and management mechanisms for mobile cloud infrastructure.

Figure 1.1: A simple architecture overview of the mobile cloud.

1.1 Thesis Statement

To address the energy concern in the mobile cloud, we are trying to build a general energy modeling and management framework. To be speciﬁc, given a data object, we

2 aim to understand and possibly reduce the energy consumption of data services in mobile cloud, in respect to performance objectives and system constraints. The desired framework shall possess the following features:

- Approximate Optimal Performance: the framework shall at least provide approxi-

mate energy saving performance within a upper bound of the possible optimal result,

with respect to system constraints.

- Accuracy: the framework shall provide accurate proﬁles of data processing;

- Adaptivity: the framework shall be adaptable to any devices in the infrastructure;

- Light-weight: the framework shall have very small performance and energy over-

heads;

Among the four, performance and accuracy are the key requirements, and thus the main metrics for evaluating our models in this paper.

1.2 Contributions

Our work mainly focuses on the energy modeling and management on the computing devices in the mobile cloud infrastructure, as shown in Figure 1.2. We start with understanding the energy profile of data processing in the modern server and discover the possibility of finding energy efficient execution plans other than focusing on performance only. With such profile, we build modeling tool to provide energy estimation for storing and processing data (e.g., relational operations). Based on the estimation, we are able to

3 manage power/performance of data processing in a single node and provide the tradeoff in-between, as the ﬁrst layer in the cloud side in Figure 1.2. Further, we expand the work to optimize data services on multiple nodes for objectives as carbon footprint and cloud expense. We propose two systems to manage data replication based on our modeling process for energy consumption of data processing and storage, as the second layer in the cloud side in Figure 1.2. At this point, we provide an overall picture of energy modeling and management for data service at the cloud side. The rest of the paper is to provide energy modeling and management in edge devices, such as smartphones. We build energy estimation estimation tools based on hardware counters, and battery models to implement our dual-battery system design to extend service time of battery-powered devices. Next, we elaborate our contributions in each of the categories discussed above.

R$V7 CQ%R7 I:` ].QJV : :V`0V` VGV`0V`

Figure 1.2: Contributions in the diagram of mobile cloud architecture.

4 Dynamic Energy Estimation of Relational Operations in Database Systems: Previous studies on database energy management have focused on either high-level ideas [61] or energy profiling of hardware used in data processing [59]. We propose a comprehensive mechanism to estimate the energy cost of query processing at the DBMS level and develop a general energy model based on extensive system identification study. We develop a series of physical models for evaluating the energy cost of individual relational operators in a static environment. Such models form the foundation of the robust online model. We design an online model estimation scheme to automatically adjust parameters of the static model in response to the dynamic environment. We implement our model in the kernel of a real DBMS, and evaluate it on a physical testbed with a comprehensive set of workloads generated from TPC benchmarks and scientific database traces.

Power/Performance Throughput Control for Database Management Systems: As one of the ﬁrst attempts to introducing classic control theory into energy management in database system, we design a control framework–PAT, to maintain the throughput of the DBMS while minimizing active power consumption. The control-theoretic design guarantees system stability. We study the complex relationships among query statistics,

DBMS throughput, hardware power states, and the overall power via extensive experiments. Our results show that (1) there exists great energy savings when tuning DVFS with

I/O intensive workload; (2) the relation of DBMS throughput and CPU frequency can be approximated as a linear model when the incoming DBMS workload statics are steady;

5 (3) the ratio of I/O intensive queries in the workload plays a major role in the impact of workload statistics in the control framework.

We implement PAT on a physical testbed and evaluate it with extensive workloads generated from standard database benchmarks. The results show that, our solution leads to signiﬁcant (51.3%) more energy saving and the least number of control errors comparing with other control solutions .

Reducing Carbon-footprint for Data Services in Geo-diverse Sites: Prior work [69, 99] dynamically dispatches queries to the sites with lowest emission rates. However, data replication can lead to several unique features of data processing: the read query can be processed at any site and is dispatched to the site with lowest emissions. Replication decreases its footprint. The write query updates all sites that host the data. Replication increases its footprint.

We propose an approach for data replication that extends consistent hashing and exploits time-varying emissions. The framework includes (1) a modeling approach that reduces a wide range of replication policies to functions that map replication factors to carbon footprints; (2) an online algorithm that reduces carbon footprints for heavily accessed objects and adheres to capacity and availability constraints. We provide a thorough evaluation with realistic workloads and emission traces that reveals up to 70% carbon savings.

Reducing Cost for Data Services in Cloud In-memory Storage: In-memory storage is vital for cloud services that respond to queries interactively, especially data services spreading all over the world. To reduce the cost of renting cloud in-memory storage, we

6 present a novel framework for in-memory storage that blends reliable on-demand instances and cheap spot instances. We show that our inter-site replication can mitigate the effects of spot instance failures and prove that our online replication algorithm is at least O(1 +

ω F( ))-competitive. |kd| Our design achieves high throughput, and handles spot instance failures by dispatching read queries to spot instances and state-change queries to on-demand instances. Using an efﬁciency frontier of cost savings and response time variations, we can manage the risk/saving tradeoff inside one site. We evaluate our system using real spot instances from

Amazon and Google, we show that our framework can signiﬁcantly reduce costs by 84%, compared to other managed in-memory storage platforms.

Performance Counter-based Energy Modeling for Smartphones: We argue the necessity of the energy estimation for web activities on smartphones, and show that traditional energy estimation methods are inefficient for web activities. To solve this problem, we propose a framework, REEWA, to provide runtime energy estimation for web activities on smartphones, featuring performance counter-based energy models. We show that our counter selection process and customized implementation for web activities can significantly mitigate the large overhead from using performance counters. Thus, REEWA achieves high accuracy, low overhead, and fine-granularity. We prototype REEWA in An- droid. Empirical results show that our framework can significantly improve estimation accuracy at a maximum 33% for web activities with negligible overhead (1%), compared

7 to traditional energy estimation methods. We apply REEWA to support energy optimizations for web activities, which reduces extra 23% energy consumption on avearage.

Dual Battery Management System for Smartphones: Recognizing that the battery discharge pattern is critical in the device service time, we conduct a series of studies on the today’s smartphone batteries. We build a system-level battery model and use it for the online battery estimation. We find that a certain battery is favored by specific workload types and smartly exploiting self-charging from Li-ion batteries can significantly improve the overall battery life and reduce the completion time of the target workload. Based these insights, we propose a dual battery management framework, called BEAST, to extend the service time of battery-powered devices. We integrate our battery models inside of the current Android system, and design the battery assignment strategies for BEAST. Our prototype is implemented on two smartphones and evaluated with a dozen of batteries. Our framework can significantly improve the battery life up to 42%, compared to the default battery management system.

1.3 Organization

The rest of the thesis is organized as follows: Chapter 2 introduces our energy modeling and management work on a single node. Chapter 3 presents our optimization frameworks for data services on multiple nodes. Chapter 4 illustrates our energy modeling and management solutions for edge devices. Chapter 5 discusses the details of prior work, including energy proﬁling, modeling and conversation solutions of data processing on smartphones and servers. Chapter 6 concludes the paper.

8 Chapter 2: Energy Modeling and Management for Data Services on a Single Node

In this chapter, we are discussing our efforts on energy modeling and management for data services on a single node. More speciﬁcally, our work focuses on a major application– relational database management systems (RDBMSs). Relational database is the classic data service model for storing/processing data objects in computers. Our work proposes a very ﬁrst energy model for relational operations in data management, based on energy pro-

ﬁles of each single relational operation executed in an enterprise-level server. The energy model enables RDBMSs to choose a more energy efﬁcient execution plan rather than the performance-only plan. Meanwhile, the energy estimation allows additional throughput control while reducing the power consumption of a single node. We describe the detailed modeling process, our dynamic estimation scheme, and empirical results in Section 2.1.

Then, we provide a power/performance control tool to reduce the power consumption of data processing on a single node in Section 2.2. We discuss the impact of our contributions for data services on a single node server in Section 2.3.

9 2.1 Dynamic Energy Estimation for Data Processing

Data centers (DC) are known to be the “SUVs of the tech world” for their enormous energy consumption. Triggered by this problem, there are recently a lot of efforts on energy management in data centers [16, 25, 20, 21, 52, 30, 53, 113]. However, those solutions focus mainly on the operating system (OS) level. As a result, they cannot be directly applied to application-level energy management, due to the lack of sufﬁcient knowledge of application behavior. Therefore, it is important to design application-speciﬁc energy estimation and management mechanisms. In this section, we target a very important type of DC application – database management systems (DBMSs).

Energy management is a relatively new topic in the database research ﬁeld. The theme in such research is to design DBMSs with energy consumption as a ﬁrst-class performance goal, as advocated by the Claremont report [15]. Current work in energy-aware DBMS has focused on energy-aware query optimization that considers both time performance and energy usage as the target [61, 120], and power management policies in distributed databases [21, 86]. Unlike other studies that focus on the implementation of energy-aware

DBMS, this section reports our work on a key issue that has so far received little attention

– modeling the energy cost of database systems.

Energy cost estimation in databases carries high technical significance in energy-aware database design. In database systems, the query optimizer evaluates different computational paths (named plans) by explicitly labeling their resource consumption. This knowledge is indispensable in finding a good query execution plan with high energy efficiency

10 [63]. For example, recent studies [61, 120] have shown that in a typical database there are many query plans that require much less power while suffering from little performance degradation. Therefore, energy conservation can be achieved by identifying such query plans. Note that information needed for making such decisions, is hidden inside database system, thus cannot be captured at the OS or the hardware level [61, 120]. Therefore, to

ﬁnd query execution plans with a low energy cost in order to capture the power-saving opportunities, a practical approach is to provide accurate energy estimation in query optimization process. We believe that such work is important for energy-aware data processing, and it also builds the foundation for power-aware workload management on the data center scale. In this section, we report the result of our study in energy cost estimation in DBMS during query optimization, with a focus on the quantiﬁcation of the energy consumption of query plans.

Energy cost estimation in DBMS serves two purposes in the energy-aware DBMS design. The first one is that, like the traditional cost estimation in DBMS which helps to select faster query plans, energy cost estimation enables selection of query plans with a lower energy cost. The quantification of the estimated energy cost of individual query plans enables accurate energy cost estimation of the entire workload. In this way, the model could provide valuable insights for other energy management policies, such as energy consolidation and projection in the DC [20]. Our static model based on offline analysis can partially achieve this goal but it is essential that our model be robust under system and workload variations. Thus, we propose an online estimation solution based on the static model to

11 Query Energy Estimated Energy Cost Cost For Query

Sort Cost BitMap Scan Hash Join Cost Cost Estimated Energy Cost Sequential Index Scan For Relational Operator Scan Cost Cost

CPU Oper. IO Operation Network Cost Cost Trans. Cost Estimated Energy Cost For Basic Operations

Figure 2.1: The compositionof a query execution plan in DBMS (Q5 in TPC-H benchmark [105]).

build a dynamic energy cost estimator for accurate, robust and fast estimation of energy cost in DBMS.

ˆ E = WcpuNtuples +WI/ONpages (2.1)

Specially, we design and evaluate a two-level framework to fulﬁll the above design goals. In DBMS, each query plan is a unique path to execute a series of relational operators that consists of a set of basic operations, as shown in Figure 2.1. We ﬁrst introduce our study of power break down of basics operations of relational operators. Based on that,

12 Table 2.1: The maximum power consumption of major hardware components in our database server. Similar results are also reported in other work shown in the rightmost column.

Component Power (Watt) Citation CPU: Xeon E5645 88.9 [107] Memory: 32 GB 20.5 [73] HDD: Seag. 2TB 7200RPM 0.42 [83] Other parts 0.23 N/A Total 111

we build a static model that describes the energy consumption of relational operators according to their resource needs. The statistics of relational operators are provided from a modified DBMS kernel and their energy cost coefficients are derived from a training query set using classic regression tools. Such models show a high accuracy in predicting energy consumption in a static environment. However, the values of energy cost coefficients (e.g., number of Joules needed to process an indexed tuple) of the model depend on system states (e.g., CPU utilization) and workload statistics (e.g., table cardinality, query arrival rate, etc.). To further improve the static model by making it adaptable to environmental and workload dynamics, we propose an online model estimation scheme that uses a recursive least square (RLS) estimator to periodically update the model parameters.

2.1.1 Energy Proﬁling for Relational Operations in Modern Servers

The system identiﬁcation study begins with the roles of different hardware components play in the energy consumption of data processing. For this purpose, we measure the active

13 power consumption of major hardware components (shown in Table 2.1). The results exhibit the fact that CPU and memory contribute the most to the active power (about

99%). The active power consumption of other components (e.g., hard disk) are negligible.

To further reveal the power consumption patterns of different hardware components, we also record the power cost under different DBMS workload intensities. Figure 2.2a shows:

(1) the CPU power cost is positively correlated with the workload intensity, (2) memory rarely stays idle, and (3) the power consumption of the memory and hard drive remains steady with the workload intensity. Also, as shown in Figure 2.2b, the energy use pattern of disks is not affected by data access patterns – both sequential and random access led to the same power cost. This is due to an important physical feature of storage hardware – their leakage power costs always dominate. The above ﬁndings are supported by results from other work on system studies, listed as the last column in Table 2.1.

To estimate the energy cost of a query plan, we are essentially interested in its marginal energy consumption (namely active energy) if we assume the baseline power is always the same as the leakage power. The above ﬁndings in Figure 2.2 conﬁrm our intuition that the marginal energy consumption of a query plan is positively related to its size (i.e., number of operations Ntuples), Furthermore, the power consumption patterns indicate that

CPU energy increases superlinearly with the increase of Ntuples while the energy cost of storage system increases linearly with Ntuples. Thus, Equation 2.1 can be transformed into:

ˆ E = f (Wcpu,Ntuples) × Ntuples +W I/O × Npages (2.2)

m f (Wcpu,Ntuples) = Wcpu × Ntuples (2.3)

14 200 (a) Energy Usage Pattern 160 CPU 120 Memory Disk 80 40 0 10 20 30 40 50 60 70 80 90 100 20 (b) Disk Access Pattern Sequential Access 15 Random Access

Active Power Consumption (Watt) 5

0 10 20 30 40 50 60 70 80 90 100 Workload Intensity (%)

Figure 2.2: a. CPU, memory and hard disk power consumption under different levels of workload intensity. b. Hard disk power consumption under sequential read and random read workloads.

15 where W I/O is the static energy cost coefficient of I/O operations and m is a model coefficient for CPU energy consumption. Based on the regression curve obtained from the system identification experiment, m = 0.5 in our platform. Note that the above model is significantly different from what is used in a traditional query optimizer with the processing time (or throughput) as the optimization goal [57, 92]. In the latter, the I/O cost is the dominating factor that often overshadows CPU cost.

2.1.2 Observations on DBMS Workload Characteristics

We are interested in verifying whether above models holds in different query processing patterns. Therefore, we extend the identiﬁcation experiment in a ﬁne-grained manner.

The results of our extensive experiments using typical database workloads in Figure 2.3, however, show that the active CPU energy consumption does not always increase in a non-linear way with the number of processing tuples as in Equation (2.2). In other words, power does not always increase - it levels out beyond a certain value of Ntuples. In such experiments, we run the same query repeatedly in different size databases to avoid the impact of resource sharing. Speciﬁcally, by changing the range of search predicates or size of the underlying database tables, the number of tuples accessed by the query processing algorithms changes in different runs. Figure 2.3 shows the active CPU energy consumption of two types of queries: one with the sequential table scan and the other with the indexed table scan. We can observe that for both queries, the CPU energy consumption ﬁrst exhibits a quadratic growth with the total number of tuples accessed until reaching its “hockey

16 point”. After this point, the relationship between energy cost and query size becomes linear (The overall shape is like a hockey stick). After this point, CPU only gets to process a constant number of tuples at each time unit because of the I/O bandwidth constraint. By looking deeper into the low-level operations performed by the computer to execute such workloads, we believe reasons for the above observations are: when the workload size is small (before the knee point), the active energy consumption is dominated by the CPU energy consumption which is quadratic growth with the workload size. When the system is fully utilized, the CPU energy coefﬁcient (i.e., power) is almost a constant according to the curve in Figure 2.2, the energy consumption increases linearly with the workload size.

Also, as we observed from Figure 2.3, the curve and knee point are different in different scan methods, therefore, we need to consider models for each individual relational operations. For example, to represent the piecewise curve in Figure 2.3, we remodel the energy model of database operations as:

W Nm+1 +W N for N ≤ N Eˆ = cpu tuples I/O pages tuples (2.4) WcpuNtuples +W I/ONpages for Ntuples > N where N is the smooth point of the energy cost curve shown in Figure 2.2. Note that the slopes of the regression lines in Figure 2.3 are good indicators of energy cost coefﬁcients of the relational operations. We use them as the initial values for model calibration.

We also run such experiments for one-join queries (only one join operation for two tables). After eliminating the energy cost of the table scan operations (a join always happens after a scan of the two input tables), we found that the energy cost of pure join operation

17 140 120 4 100 2 80 0 60 0 200 400

Energy (KJ) 40 20 Regular Access Index Access 0 0 2000 4000 6000 8000 10000 Accessed Tuples (×1000)

Figure 2.3: Active energy consumption under different access patterns.

is in a linear relationship with size of the join operation, as illustrated in Figure 2.4. Fur- thermore, Figure 2.4 also shows that different join algorithms carry different energy cost.

For example, hash joins always consume much more energy than the other two joins under the same workload size and the difference between them is growing with the size of the operations. For join operations, we don’t need to ﬁnd the smooth point but still it is necessary to explore the different slope for the energy-size curve for different join operations after isolating its energy consumption with the energy consumption of the lower-level scan operations.

Based on findings of the system identification experiment and refined model Equation

2.2, it is necessary to quantify the number of operations Ntuples, the model parameters

(i.e., unit power costs Wcpu, W I/O) and the smooth point N for each individual relational

18 1600 1400 1000 1200 900 1000 800 800 6000 6200 Hash Join 600 Merge Join Energy (KJ) 400 Nestloop Join 200

0 2000 4000 6000 8000 10000 Accessed Tuples (X1000)

Figure 2.4: Active energy consumption of join algorithms.

operation. The W I/O is constant according previous study. The Ntuples is readily available from the existing query optimizer by the knowledge of the data table and its data histogram/cardinality. Note that the original resource estimator in the DBMS could be highly unreliable. We calibrate coefﬁcients in this estimator for better resource estimation accuracy. Now the key problem is to ﬁnd Wcpu and N. Next, we present our efforts in

ﬁnding and calibrating energy models for each individual relational operator.

2.1.3 Energy Modeling for Relational Operations

For each operator, the model Equation (2.4) shall be modiﬁed based on its processing behavior. In the remainder of this section, we introduce energy models for a set of popular relational operators. Readers interested in more detailed work on the model calibration can refer to [117]. A summary of the operator energy models can be found in Table 2.4 with

19 Table 2.2: Key quantities in power estimation models.

Symbol Deﬁnition n The number of tuples retrieved for CPU usage p The number of pages retrieved for memory usage R The sorting algorithm coefﬁcient x The indicator of chosen relational operator wx The CPU unit-energy cost of relational operator x Nx The smooth point of relational operator x

Table 2.3: Energy cost functions for relational operators.

Var Seq Idx Sort Bmap Nest Merge Hash wx 0.0078 0.0093 0.1098 0.0193 0.153 0.165 0.189 Nx 1153 2109 N/A 2654 N/A N/A N/A

all the symbols introduced in Table 2.2. Note here, all the variables in Table 2.2 except wx and Nx can be obtained from the DBMS optimizer.

For single table relational operators (i.e., selection and projection), we only consider two file organizations – heap files and index files, and their corresponding scanning algorithms – sequential scan and index-based scan, respectively. In addition, we consider a special type of index scan – bitmap scan that is implemented in the PostgreSQL. Sorting is a very important step in processing many relational operators, thus we also study the energy cost of sorting (although it is not a relational operator per se).

20 Table 2.4: Energy cost functions for relational operators.

Methods Cost function 3 Sequential Scan wsn 2 + wp,n ≤ Ns ; wsn + wp,n > Ns 3 Index Scan win 2 + wp,n ≤ Ni ; win + wp,n > Ni Sorting wt nR 3 Bitmap Scan wbn 2 + wt nR + wp,n ≤ Nb ; wbn + wt nR + wp,n > Nb Nested Loop Join wln + wt nR + wp Sort Merge Join wmn + wtnR + wp Hash Join whn + wt nR + wp

Sequential Scan: sequential scan searches each row of the heap ﬁle (data table) and omit relevant columns according to the predicate. In an equality search (the predicate equals to a value), the anticipated search size is m/2 where m is the size of the table that contains p pages. Thus, the anticipated energy cost, according to Equation 2.4, is 3 m 2 w + wp, before reaching the smooth point. After the system is fulﬁlled, the esti- s 2 m mated energy cost is ws +wp. Note here, the anticipated search size is different when 2 it is a range search on the same table. In the implementation, the number of anticipated search size is already obtained because the database optimizer has knowledge of data dis- m tribution and cardinality in each table. Thus, we use n = as the number of anticipated 2 basic operations in the model throughout this section.

Index Scan: index scan is similar as sequential scan except it uses a (tree-based or hash) index to reduce the number of tuples accessed. The estimated energy cost for index

21 3 scan is win 2 + wp for searching the n anticipated tuples from p pages before the smooth point. When the system is fulﬁlled, estimated energy cost is win + wp. Note here, the unit energy cost of accessing an indexed tuple wi is different from that of a tuple in sequential scan (ws).

Sorting: Sorting is a CPU “hungry” operation due to the need to sort the list in multiple runs. The sorting size and the speciﬁc sorting algorithm are the key factors in estimating the sorting energy cost. For the merge sort algorithm implemented in PostgreSQL (other database systems may be different), the energy cost for sorting is wt nR, where n is the number of tuples fetched to be sorted and R is sort algorithm related coefﬁcient.

Bitmap Scan: bitmap scan searches the index ﬁle using its bitmap index, which is based on bit arrays (commonly called bitmaps) of columns. Then, the scanned result is

3 sorted by the bitmap index. Thus, its energy cost is wbn 2 + wt nR + wp before the smooth point and wbn + wt nR + wp for the rest larger query size.

Joins: For any two table joins (original or temporary table), the energy consumption depends on the join algorithm used. According the results of Figure 2.4, the energy consumption is linear with the joined size after eliminating the input table scan costs. Thus, we apply the similar linear model to the nested loop, sort-merge and hash joins and adding the sort cost because of the ﬁnal sort operation after each join. Due to the limited space, we ignore the detailed explanation of the algorithm here. Readers interested in this part can refer to [117]. Values of the calibrated energy coefﬁcients used in the static model on our platform are listed in Table 2.3.

22 2.1.4 Online Estimation Scheme

The main idea of the online model estimation approach is: we keep the structure of the previous physical model, treat the database system as a black-box and take the cost parameters as variables, which reﬂect the combined effects of all possible system/environmental noises. We then use a feedback control style mechanism to periodically update those parameters using real-time energy measurements. As a result, errors generated by different sources will be compensated. In each period with length Ts, we have an operation vector

~n = {n1,n2,··· ,nm, p1, p2,··· , pm} to hold quantities of all operations of each query cur- rently being processed in the system. Recall that values of ~n are provided by the query optimizer. In general, the RLS scheme builds the coefﬁcient vector ~w = {w1,w2,··· ,wn} ∑n to hold all to-be-updated parameters d and a variable k = j=1 w j. In our case, we have ~w = {ws,wi,··· ,wh,Ns,Ni,···}. Following the routine of constructing the coefﬁ- cient vector in RLS scheme, let us denote k1 = ∑wx and k2 = ∑Nx. The total system

′ ′ coefﬁcient vector is ~w = {~wx,··· ,,~Nx,k1,k2}. For period j, ~w is denoted as W~ ( j). Sim- ilarly, for the associated parameter vector ~n, we have another associated parameter vector

′ ′ ~n = {n1,n2,··· ,nm, p1, p2,··· , pm,1,1} and denote the ~n at period j as ~N( j). At each period, the active energy consumption of the server E is measured. The RLS model generates a quantity e( j) as the baseline power from the measurement of the last j −1 periods and the current energy cost E as follows:

(( j − 1)e( j − 1) + E) e( j) = (2.5) j

23 we set the initial energy consumption as e(0) = 0. The next step is to use this estimator to ﬁnd the values of energy cost parameters. The coefﬁcient vector W~ ( j) is updated as follows: ε( j)~NT ( j)M( j − 1) W~ ( j) = W~ ( j − 1) + (2.6) λ +~N( j)M( j − 1)~NT ( j) where ε( j) = e( j) −~NT ( j)W~ ( j) is the estimation error. ~NT ( j) is the transpose of ~N( j).

M( j − 1) is the covariance matrix of vector ~N( j), and λ is the constant forgetting factor within [0,1] – a smaller λ enables the estimator to forget the history faster. The RLS estimator adapts itself so that ε( j) is minimized in the mean-square sense. When the two variables, ~N and e( j), are jointly stationary, this algorithm converges to a set of tap-weights which, on average, are equal to the Wiener-Hopf solution [64]. The following routines are invoked at the beginning of every period j of model updating:

(1) Recording the workload statistics ~N( j) and previous energy cost e( j);

(2) Computing W~ ( j) based on recorded data and Eq. (2.6).

As a recursive algorithm, the RLS estimator has a low computational overhead (tens of microseconds as we recorded in our experiments). It is also robust despite different workload dynamics and system contentions. Based on results obtained from the static models upon running composite workloads (see [117] for details), the initial values for the energy parameters (W~ ) in our testbed are W~ (0) = {0.00768,··· ,1153,2109,2654}.

24 The length of Ts implicitly affects the accuracy of the RLS model. It is relevant to the frequency of incoming query request. If the query arrival rate is high, Ts needs to be set smaller to sample sufﬁcient variance. Otherwise, we could make it longer to avoid possible disturbance and excessive computational overhead. In our experiment, we set it tobe 1/9 seconds, the same sampling frequency in the energy consumption detection.

2.1.5 Evaluation

Our testbed consists of a 2U Dell R710 (3.0 GHz 12-core CPU Xeon E5645, 32GB of DDR3 memory, and 2TB 7200RPM HDD as local storage, as shown in Table 2.1) server and various workloads generated from three SDSS batch workloads [96] and 22 standard TPC-H queries [106]. Another machine is used to produce database workloads and collect experimental data including query statistics and energy consumption. The latter is calculated from discrete power readings measured by power meters (i.e., a 34410A

Digital multimeter [9] and a Watts up power meter [11]). To measure the CPU power, we use the digital multimeter by a current clamp attached to its supporting power wire.

The watts up power meter connects the server with the power outlet to measure the total power of the server. The data server is installed with PostgreSQL 8.3.14 in the Redhat

5 (kernel 2.6.9). The DBMS’s kernel is modiﬁed to provide runtime information such as the estimated energy cost, the data histogram/cardinality and the plan selection. we design synthetic test cases that simulate such noises. In all such experiments, we set the MPL

∈ [10,2000] to create a more realistic database runtime environment in which multiple queries are processed concurrently.

25 We simulate the impact of different error sources using the following three types of test cases. Type I case: to test the accuracy of RLS model under workload and system noises, we deﬁne this type of workload with different resource sharing patterns. Speciﬁcally, we have the share-everything (SE) and share-nothing (SN) workloads. The SE workload is generated by queries with small computation and considerable amount of data shared with other queries. The SN workload consists of queries with long processing time and little data shared with other queries.

Type II case: poor estimation of data distribution (in the form of data histograms) in database tables is the main reason of the estimation error in operation quantities [32].

The case study contains: (i) deterministic access (DA) workload that searches similar data region in the table from time to time; and (ii) random access (RA) workload that randomly touches all spectra of the data domain. After runningthe DA workload, the query optimizer will quickly learn the data distribution and resource estimation for incoming queries will be increasingly accurate. For the same reason, after running the RA queries, the data histogram information in DBMS optimizer will be updated frequently, which leads to inaccurate resource estimation. In short, the purpose of this case study is to test the model when facing the resource estimation error, more speciﬁcally, the error in selectivity estimation.

Type III case: in a real-world database server, other applications will run concurrently with database processes and such processes may cause variations of the system’s resource availability. We design this case study to emulate those noises in two cases in

26 a data server. The first one is to inject the system with a Fibonacci program. Being a computation-intensive program, it competes with DBMS for CPU resource. The second one is to change I/O availability at runtime via RW program. RW is an I/O-intensive program that frequently read/write large files, thus competing with the DBMS for I/O bandwidth. Both cases introduce significant changes in the system capacity and are used to verify the model performance under the system error.

We compare the performance of the RLS models with two baselines: the static model and the ad hoc model. Note that, since we have multiple queries running in the system now, the average EER that we used to evaluate the system’s performance is redeﬁned as the arithmetic mean of the EERs of all involved queries.

To evaluate the performance of our energy model for data processing on a single site, we build a testbed consists of a 2U Dell R710 (3.0 GHz 12-core CPU Xeon E5645, 32GB of DDR3 memory, and 2TB 7200RPM HDD as local storage) server and various workloads generated from three SDSS batch workloads [96] and 22 standard TPC-H queries

[106]. Another machine is used to produce database workloads and collect experimental data including query statistics and energy consumption. The latter is calculated from discrete power readings measured by power meters (i.e., a 34410A Digital multimeter [77] and a Watts up power meter [11]). To measure the CPU power, we use the digital multimeter by a current clamp attached to its supporting power wire. The watts up power meter connects the server with the power outlet to measure the total power of the server. The data server is installed with PostgreSQL 8.3.14 in the Redhat 5 (kernel 2.6.9). The DBMS’s

27 kernel is modiﬁed to provide runtime information such as the estimated energy cost, the data histogram/cardinality and the plan selection.

Results of Type I case. In such experiments, we create nine workloads with different sizes to test three models – RLS model, ad hoc model and static model. As shown in

Figure 2.5, the EERs generated by the RLS model are signiﬁcantly smaller than that of the static model for both SE and SN workloads, with an all-round average EER of 8.89% and

6.93%, respectively. Clearly, RLS model can effectively hide the error from the correlation among queries. The ad hoc solution, which could partially capture the interactions among queries, shows a better performance than the static model. However, it is not compatible to the full-ﬂedged RLS model – its EER often triples (the average EERs are 33.54% and

34.78% for the two workload types). Another observation here is, the SN workload usually causes more errors than the SE workload in RLS model but less errors in static model. Our explanation is that the SN workload provides more ﬂuctuations in the energy consumption than SE workload does during the execution. Thus, depending on the value of forgetting factor λ, the parameter estimator generates more errors than that in SE workloads. Static model favors SN workload since it has no knowledge of resource sharing among queries and the estimation in the SN workload only needs to add up the energy consumption of each individual query.

Results of Type II case. As shown in Figure 2.6, the RLS model beats other two models in accuracy when handling DA workload – the average EER is 7.13% with the highest being 11.9%. For the DA workload, queries always visit the same part of the table therefore

28 200 % SE RLS 160 % SN RLS SE Adhoc SN Adhoc 120 % SE Static SN Static 80 % Average EER 40 %

0 % 10 50 100 200 300 400 500 1000 2000 Query Set Size

Figure 2.5: Estimation accuracy of three models in nine workloads with different data sharing patterns.

it leads to very high cache hit rate. That is likely to be the reason why the estimation errors of static model show (roughly) a linear increase in Figure 2.5. For the RA workload, although the query optimizer could produce large error in resource estimation, our dynamic model can capture the trend of such errors and compensate them. The EERs are lower than

10% for most cases – the average EER is 7.26% with the highest EER being 11.07%. For both workloads, the performance of RLS model is similar because those data statistics are already kept updating in the online scheme of the RLS model.

Results of Type III case. In Figure 2.7a, the system starts with 20 Fibo processes. This number rises to 54 at the seventh second, drops at the 24th second to 10, and increases to 40 at the 33rd second. The change of the CPU utilization contention results in high

29 200 % DA RLS 160 % RA RLS DA Adhoc RA Adhoc 120 % DA Static RA Static 80 % Average EER 40 %

0 % 10 50 100 200 300 400 500 1000 2000 Query Set Size

Figure 2.6: Estimation accuracy of three models in nine workloads with different data sharing patterns and different data access patterns.

resource estimation error. By comparing the number of running Fibo threads (red line) and the energy estimation error (green dash line), the RLS model can capture the trend of such change and react within a short period of time (i.e., less than three seconds). It keeps its average performance at 90% accuracy of the energy cost estimation.

When it comes to the I/O resource competition, the measured energy consumption unpredictably increases with the number of RW applications in Figure 2.8. The reason is that the performance bottleneck of data processing is still the I/O bandwidth. When this critical resource is “stolen”, most queries will be halt to wait for I/O resource, which results in huge wasted energy consumption. As a result, the estimation performance is greatly affected by those applications that compete with DBMS for the I/O resource, such

30 100 % 100 Fibo 80 % RLS 80 60 % 60 40 % 40

Average EER 20 % 20

0 % 0 Fibo Threads Number 0 5 10 15 20 25 30 35 40 Time (second)

Figure 2.7: Model behavior at runtime under competing CPU-intensive tasks.

the data change at the 7th and 15th second in Figure 2.8. When the system status change tends to be steady, such as in time period of 15 – 19 and 34 – 40, RLS model tries to get its performance back within a few periods (a little bit longer than that in Fibo). In all, RLS model could handle the noise from I/O bandwidth change at some degree.

Comparison with other estimation methods. To highlight the beneﬁt of RLS model, we compare it with other energy estimation models deployed at the OS-level. System is a system level energy estimation model [48], and SysAdap is an enhanced ad hoc OS-level energy estimation model [112]. We use those models to estimate energy consumption of workloads used in above cases. The results are shown in Figure 2.17. It is not a surprise that System has terrible estimation performance since it considers the energy cost of all hardware operations which may not be caused by DBMS operations. Comparing with Sys- tem, SysAdap shows a relatively better performance in those cases because it periodically

31 100 % 100 RW 80 % RLS 80 60 % 60 40 % 40

Average EER 20 % 20 RW App Number 0 % 0 0 5 10 15 20 25 30 35 40 Time (second)

Figure 2.8: Model behavior at runtime under competing I/O-intensive tasks.

corrects itself according to the detected DBMS energy cost. However, the performance of

SysAdap is only compatible with the performance of the Static model. The RLS model gives highly accurate estimation in all cases. Clearly, for database energy cost estimation, we need to build the estimation model inside the DBMS.

2.2 Power-Aware Throughput Control for Database Operations

The rapid growth of energy-related research in DBMSs is driven by the fact that data centers are energy starving. The increasing operating expenses of data centers (e.g., the electricity bill) quickly deplete the revenue earned from database services due to its accu- mulating demand of energy. The power-performance tradeoff has now become a new key challenge in general purpose database system design [118].

32 160% a. Thoughput Traditional 140% Heuristic 120% SCTRL PAT 100% 80% 60% 40% 20% Throughput (Normalized) 5 10 15 20 25 30 35 40 45 50 Control Period 180% b. DVFS Traditional 160% Heuristic 140% SCTRL PAT 120% 100% 80%

DVFS Level 60% 40% 20% 5 10 15 20 25 30 35 40 45 50 Control Period

180% c. Power Traditional+ 160% Heuristic 140% SCTRL PAT 120% 100% 80%

Power (Watt) 60% 40% 20% 5 10 15 20 25 30 35 40 45 50 Control Period Figure 2.9: Comparison of static and RLS with other energy estimation models.

33 Redesigning DBMS towards high energy efﬁciency has been discussed in the database community. Poess et al. [84] examine the power saving opportunities from different hardware for data services. Lang et al. [62] report large energy savings by using DVFS techniques in CPUs. However, harvesting those opportunities in data processing while maintaining the desired performance is not a trivial task. The performance of the DBMS could be very sensitive to the hardware power mode changes. Shown in our characteristic study above, tuning one step (25%) in CPU frequency could result in 50% performance degradation for CPU intensive queries and tuning low-power modes in memory is a bad idea due to 95% performance degradation in both I/O and CPU intensive queries. Therefore, we cannot directly apply the power management solutions from hardware for the energy conservation in DBMSs.

It is also difﬁcult to provide performance guarantees in a DBMS due to one of its salient features – the sensitivity of the DBMS performance to different hardware power modes depends on the workload changes and the environment dynamics. We need an adaption architecture that could promptly monitor such a feature from DBMS and determine whether such adaption should be performed. Attempting to solve the problem, some studies employ simple and traditional hill-climbing strategies to make important adaption decision, such as in [62]. It is an ad hoc solution that tunes the CPU frequency up when performance is below SLA and down when that is above. However, there exist questions about how much to tune and how long tuning is needed, which are the classic controller problems in control theories as the steady-state error and settling time. Though there are

34 many state-of-art control work implemented in the OS level for the similar problem, such as [114, 111, 79], they are not feasible due to the lack of the knowledge of data processing.

To remedy those aforementioned problems, we ﬁrstly need to understand the nature of the DBMS’s response to the changes of different hardware power modes (inputs). Specif- ically, we need a quantitative system model that describes how to control the DBMS performance based on the inputs. Also, the framework needs to be implemented without affecting the DBMS process. At last, the control algorithm shall be robust and response fast such that it could tolerate the errors from model estimation in DBMS optimizer and the inputs change.

We present Power-Aware Throughput control (PAT), an online feedback control framework for energy conservation at the DBMS level, to address the above challenges. Our solution takes advantage of proven techniques from the ﬁeld of control theory, which are specialized in dealing with systems that are subject to unpredictable dynamics [39]. In this solution, we take the energy conservation problem in DBMS as a feedback control problem and tackle it with a a proportional-integral (PI) controller derived from a dynamic

DBMS system model from our characteristic study.

2.2.1 Power Performance Optimization for Database Operations

In our study, we focus on throughput as the main performance metric for three reasons. First, throughput is the reciprocal of the average response time. It is an important performance measurement metric. To keep the DBMS throughput within a desired level is essential to avoid situations, such as overloading. Secondly, throughput data can be

35 measured at any time which enables PAT to have accurate and fast response to the control outputs based on the runtime feedback in seconds. The response time, however, is a delayed signal that has a negative impact on the performance of the controller when the system changes in the delay, especially for the OLAP execution, which takes hours to ﬁn- ish. Last but not least, to control the response time for each individual queries, it requires a performance sniffer on each of them. This undoubtedly violates the goal of a light-weight design. PAT is a control framework that can be adapted with any suitable controller.

Table 2.5: Notations and symbols.

Symbol Deﬁnition Z-domain

dCPU demanding CPU - dI/O demanding I/O - uCPU avrg. CPU usage - uI/O avrg. I/O usage - R j the jth fuzzy rule - p j implication of the R j - t j membership conﬁdence - β I/O util. threshold - M,N membership coeff. - T control period - i period index - Rs set point - λ ratio of I/O intensive query - f CPU frequency F(z) r DBMS throughput R(z) A,B system model coeff. - kI,kP controller coeff. KI,KP C(z) controller transfer function C(z) G(z) system transfer function G(z)

36 Notations used throughout this section are listed in Table 2.5. In this table, the variables can be divided by their use, namely for the system model, the fuzzy controller and the control model. We will introduce them in details when we discuss the models. Note here, the column labeled with Z-domain are the frequency representation of themselves in the time domain after the Zigzag transform. The transformation is a necessary step in the control-theoretic analysis to ﬁnd values of the control variables which ensures the stability and robustness of the controller. Readers who are interested in Z-transform or standard control system design may refer [49].

The impact of hardware power modes under extreme DBMS workloads. To further understand the impact of low-power modes in different hardware components on power savings and performance, we use clock-gating technique in memory to build the ﬁve power states described in [35], four discrete CPU DVFS level described in [114] and one CPU C-state

[74] labeled as “DVFS0”. The DVFS100 is the system running under a full CPU frequency/voltage capacity. To avoid possible bias from DBMS workload and reveal more insights, we repeat each experiments under different CPU intensive and I/O intensive workloads several trials, demonstrated in the sub-ﬁgures of ﬁrst row and second row of

Figure 2.10, respectively.

The ﬁgures in the left column of Figure 2.10 show the DBMS performance and active power data of different power states in memory. One state transition in memory, such as from the active state to the active-standby state, can contribute to almost 20% of active power savings from memory. However, it comes with a severe performance penalty as

37 95% loss in CPU workloads and 98% loss in I/O workloads after the transition. Not to mention the limited power savings and magnitude-larger performance loss in the active- powerdown state. Thus, though [35] claims high effectiveness of the memory-level DVFS, it is not a feasible power saving technique for data processing. As a result, we ﬁnd that any power management techniques which increase per-tuple I/O performance cost may has a huge negative impact on the throughput, which eventually leads to the energy cost.

The ﬁgures in the left column Figure 2.10 illustrate the results of different DVFS state in CPU. The active power cost of the server is monotonically decreases with DVFS setting

(CPU frequency) in both workload cases. On the other hand, the DBMS throughput loss linearly increases with the decreasing of DVFS level on a different degree. This shows a salient feature of the DBMS performance model under the power control knobs. Neverthe- less, the sensitivity of CPU intensive and I/O intensive workload under the DVFS change is different. One could tune the DVFS setting further to gain more power savings without hurting the performance in a DBMS with pure I/O intensive workloads.

Figure 2.10 also shows the system reaction under the CPU C-state (DVFS0) in term of power and performance. When the CPU is set to the C-state, the whole system is in halt state. We didn’t observe any DBMS throughput though the active power consumption is low. On the other hand, the performance delay of due to the transition from CPU C-state to active state is unacceptable large. Therefore, we don’t implement the C-state in the real testbed for saving but leave it as the simulation work using our experiment data and assumption the assumptions in [72] as the proﬁle.

38 Performance Power (a) (b) (c) (d) 100 % 100 % 100 % 100 % 100 % 100 % 100 % 100 %

50 % 50 % 50 % 50 % 50 % 50 % 50 % 50 % Power (%) Power (%) Power (%) Power (%) Performance (%) Performance (%) Performance (%) Performance (%) 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % ActiveActive-StandbyActive-PowerdownStandbyPowerdown ActiveActive-StandbyActive-PowerdownStandbyPowerdown DVFS100DVFS75DVFS50DVFS25DVFS0 DVFS100DVFS75DVFS50DVFS25DVFS0

Figure 2.10: Studies on low power state of memory and CPU. All data are normalized to the ﬁrst histogram (active case in memory and DVFS100 in CPU). The quantiﬁcation of the performance histogram are on left y axis and that of power histogram are on right y axis. Since DBMS has no throughput under standby/powerdown state in memory and DVFS0 state in CPU, there is no performance data under those power states.

These experiments show that CPU P-state (DVFS technique) is a good candidate of the control actuator. Next we will extend the experiment to see how DBMS and active power react under the impact of different CPU frequencies and mixed workloads with intensity in-between (neither pure CPU intensive nor pure I/O intensive).

Active power, CPU power state and DBMS throughput. We extend the experiments in previous study to reveal more insights on the tradeoff between throughput and power.

Figure 2.11a demonstrates the fact that the power consumption is linearly correlated with the relative DVFS level, as supported by [72].

The power and performance data in Figure 2.11b are recorded from experiments running under many different workload intensity (i.e., (λ, the ratio of I/O-intensive queries in a workload, as in Table 2.5)). An important observation from Figure 2.11b is that, there exists a linear relationship between throughput and CPU frequency for all DBMS

39 a. Power b. Throughput 300 100

250 80

60 200 40

λ 150 20 =5% Relative Throughput (%) λ=10% Power Consumption (Watt) λ=20% λ Power Traces 0 =30% f(x) λ=40% 100 50 60 70 80 90 100 50 60 70 80 90 100 Relative DVFS Level (%) Relative DVFS Level (%)

Figure 2.11: The impacts of CPU frequency (i.e., DVFS level) on active power consumption (a) and DBMS throughput (b). The ﬁve different workloads in (b) contain different ratio of I/O-intensive queries. All data in (b) are normalized to the largest throughput with the workload contains 5% I/O intensive queries in the system with maximum CPU frequency.

workloads with a ﬁxed λ value. For all data in Figure 2.11b, the goodness-of-ﬁt metric

2 variance(linear throughput prediction) (R = 1 − variance(real throughput) ) as a linear model is greater than 95%. There- fore, we can use the following linear model as a starting point of our system identiﬁcation process:

r = Aλ f + B (2.7)

Where r is the DBMS throughput, f is the CPU frequency, λ is the ratio of I/O intensive workload and A,B are the system model coefﬁcients, as mentioned in Table 2.5.

40 0.01 Traces 0.008 f(x) 0.006 0.004 Sensitivity 0.002 0 0 10 20 30 40 50 60 The Percentage of I/O-intensive Queries

Figure 2.12: The relationship between workload’s frequency-to-throughput sensitivity and the percentage of I/O-intensive queries in the workload (λ).

Among all the workload characteristics, we found that the ratio of I/O-intensive queries

λ in the workload is the major factors that affect the system sensitivity of the hardware power modes, as shown in Figure 2.11b. Our explanation is that Linux use the Round-

Robin as the process scheduling algorithm. The more queries are bounded by I/O, the larger chance that those processes will skip the CPU time slice, which keeps the CPU in the idle state. It (1) makes the system less sensitive to the power mode changes and (2) the system prefers I/O intensive queries more for energy savings. We call this favor as the sensitivity of the DBMS, measured as the slopes of all the throughput-DVFS lines in

Figure 2.11(b).

Figure 2.12 shows that λ affects the database’s sensitivity to the power mode changes as a linear function (with a goodness-of-ﬁt R2 = 95%). Since the value λ is essential in

41 our throughput control, it is necessary to update λ to identify the workload at runtime.

Note here, when λ increases from 20% to 30% in Figure 2.11b, the throughput of DBMS drops heavily at the same DVFS level. There is a value between 20% to 30%, called β in our system model, that defines infection point. When λ > β, the system enters a I/O busy waiting state. This state is a “Limbo” that we are trying to avoid in our experiment since it causes a lot of false and noise data. Fortunately, β is a relative static number for any given systems that can be found during the system identification process. Based on our hardware and system configuration, β = 32% (this number need to be calibrated when

PAT is applied to a different system environment).

After ﬁnding those interesting behaviors of data processing under the impact of hardware low power mode, we start to build the control framework for energy conservation.

2.2.2 Power/Performance Controller Design for Database Operations

The overall control framework of PAT is illustrated in Figure 4.3. The main components within the framework form a feedback control loop (indicated by the red arrow in

Figure 4.3), which include: the PI controller (Controller), the system throughput monitor

(Plant) and the CPU power state modulator (Actuator). The runtime goal of this loop is to maintain the DBMS throughput at the set point Rs. Speciﬁcally, the following steps are invoked at the beginning of each control period:

1. The throughput monitor measures system throughput r(i−1) in the last period. The

control error is computed as ∆r(i) = Rs − r(i − 1);

42 2. The controller receives the control error ∆r and the workload statistic factor λ.

Based their values, it computes the control signal ∆ f (i);

3. The CPU power state modulator receives the control output ∆ f , to calculate the new

DVFS level and apply it in the CPU.

4. Exception 1: when the λ is larger than the safety threshold β or the DVFS level

is highest but the detected throughput still fails to meet the set point Rs, the CPU

modulator will set the DVFS level to the active lowest state to save power.

5. Exception 2 (optional): when DVFS level is set to be the active lowest level but the

throughput is still larger than the set point Rs, the CPU modulator will set CPU in

sleep mode for a short period of time t.

Note here, the duration t of CPU in C-state is shall be smaller than T , and we assume the transaction time of different state shall be at least one magnitude less than t). The two additional exceptions are rules generated to handle under/overload scenarios caused by either unpredicted incoming workload ﬂow or the capacity limit of DVFS technique. The exception 2 is not implemented physically in our testbed because the assumption of low delay between “sleep/wake up” mode of CPU cannot be achieved in the current hardware environment. However, we believe this is an important option for energy conservation thus we leave an interface here for future update. We do run the simulation of the “CPU napping” under the same assumption made by [72] and the result is combined with other observations. Note that though PAT depends on the workload statistics and DBMS throughput

43 DBMS Clients Data Flow ... Control Path

Traditional DBMS Query Resource Workload Workload Query Queue Usage Estimator Statistics Classifier λ (Input Module) PAT Control Rs + r(i) PI Controller f(i) Flow (Controller) r(i-1) -

(Plant) Throughput System Util. New CPU CPU Power State Monitor Monitor Power State Modulator (Actuator) Physical Server

Figure 2.13: The power-aware throughput control architecture. The names in parentheses are given using control terminology.

44 from the existing DBMS optimizer, it does not disrupt the normal operation of a given

DBMS. All components are implemented as light-weight daemon programs.

System Modeling Building an accurate mathematical model of the system is of great importance to the whole controller design. In PAT, we are interested in how to connect the active throughput r and the run-time DVFS level f . In our study, we found an linear relationship between active throughput and power consumption. Let us denote this period as T and the throughput within this period as r(i). Then, given the current relative throughput data r(i), our control goal is to guarantee that the r(i) could be converged to the set point Rs after a ﬁnite number of control periods (settling time). Note here, for better establishing the model we scale those two values into percentage. Thus, r and f are the relative throughput and DVFS level, respectively. For example, f = 100% means that

CPU is running at its highest frequency. For safety issue, we set the minimum active CPU

DVFS frequency to 40%.

Here we update the original system model in Equation (2.7) as,

∆r(i) = λA∆ f (i) + B (2.8) where i stands for the ith period, and λ is the ratio of I/O-intensive queries in the current workload. Equation (2.8) could be viewed in z-transform as:

R(z) = λAF(z) (2.9)

45 25 % 20 % a. Relative Frequency 15 % 10 % 5 %

f(i) 0 % ∆ -5 % -10 % -15 % -20 % -25 % 25 % 20 % b. Output 15 % 10 % 5 %

r(i) 0 % ∆ -5 % -10 % -15 % Measured Output -20 % Model Ouput -25 % 0 10 20 30 40 50 60 70 80 90 100 Control Period

Figure 2.14: A sinusoidal throughput set point and controlled result.

where R(z),F(z) are the z-transform of signal r(i), f (i), respectively. Thus, the system transfer function of the DBMS regarding to frequency change in Figure 2.11 is:

R(z) G(z) = = λA (2.10) F(z)

We also test the system with sinusoidal inputs in Figure 2.14. In this set of experiments, the throughput changes sinusoidally within the range of [0; 100]. Small, periodical modeling errors can be seen. This means that there are probably unknown dynamics that our model fails to capture. This is not surprising due to the error from resource estimator and workload classiﬁer. As we shall see later, the feedback control used in PAT has the power to reduce the effects of modeling errors, especially those that impose small errors such as the one we observe here.

Controller Design The goal of the controller design is to meet the following criteria:

46 • Stability: the throughput shall settle into a bounded range in response to a bounded

reference input.

• Zero Steady State Error: when the system enter the steady state (any property of

the system is unchanging in time), the throughput shall settle to the set point, which

is the performance budget in our case.

• Short Settling Time: the system shall settle to the set point by the speciﬁed dead-

line.

Based on control theory, we design a Proportional-Integral (PI) control that has been widely adopted in industry control systems. We select the PI controller for its nice property of zero-state-error and fast response[39, 49]. Thus, the PI control can provide robust control performance despite modeling error and input/output disturbances. The PI controller has the following form in the discrete time domain:

i ∆ f (i) = kP∆r(i) + kI ∑(∆r( j)) (2.11) 1

th where ∆r(i) = Rs − r(i) is the control error at i period. ∆ f (i) is the frequency offset to manipulate the throughput to the set point Rs. kI and kP are control parameters. Those parameters can be analytically chosen to guarantee the system stability and zero steady- state error. From Equation (2.11), we have the controller transform function in the z- transform as: z(k + k ) − k C(z) = I P P (2.12) z − 1

47 The throughput controller is implemented to adjust CPU frequency based on throughput.

Based on the system identiﬁcation, the parameters in Equation (2.8) are A = 4.329, B =

24.329. λ is provided at runtime by FWC. Overall, the standard closed loop transfer function F(z) is

F(z) = G(z)C(z)

λAkp(z − 1) + λAkIz (2.13) = (1 + λA(kI + kP))z − (λAkP + 1) The controller uses the Root-Locus method [39, 49] to guarantee stability and zero steady- state error. The poles of our closed-loop system are −0.26 ± 0.8i. As both poles are inside one unit circle, the closed-loop system in our experiments is stable [39]. The ﬁnal controller is Equation (2.11) with control parameters kI = −0.5 and kP = 1.06. Readers who are interested in details of the control analysis could ﬁnd them in [50].

2.2.3 Experimental Results

To evaluate the performance of our control framework, we build a database server with

PostgreSQL (version 8.3.18) in Redhat 5 (kernel version 3.0.0). The data server is the

DELL PowerEdge R710 equipped with Intel Xeon CPU E5645 (12 cores with 1 – 3GHz frequency range) The client creates database workload in a pool of 2,000 query samples from the TPC tools[106] and SDSS traces [96]. The power consumption of the server is measured by a WattsUpPro power meter (with ±1.5% error) under a ﬁxed sampling frequency of 1 Hz [11].

We have designed several baselines for the performance comparison and evaluation on

PAT. (1)Normal and Tradition. Normal sets the system with the highest DVFS level and

48 Tradition sets the DVFS level based on the result of the ofﬂine analysis of the set point and system model. (2)SpeedStep, Heuristic and SCTRL. SpeedStep is a power management policy based on system utilization at operating system level (the default power management option in BIOS setting). Heuristic is an ad hoc control solution built on SpeedStep with performance constraints. SCTRL is a system level feedback control technique. Com- paring with PAT, it contains all the control components in the framework except DBMS

(such as the workload classiﬁer) and its system model is based on system level identiﬁca- tion.

To study the impact of PAT framework on performance and energy savings, we have designed three scenarios to simulate the normal DBMS operations. (1) Ideal environment.

In the ideal environment, database process is the only user of all the computational resources. (2) Competing environment. In the competing environment, a set of pure CPU intensive programs (a Fibonacci computation program) are introduced to the system to compete with data processes for CPU resources. (3) Preemptive environment. In the preemptive environment, a set of high-priority (OS-level) processes will frequently interrupt the database processes for CPU resources.

Here we provide a snapshot of 50 control periods in those three environments as in

Figure 2.15. For better illustration, the control setpoint is the 40% relative throughput of the system capacity. The first row of figures in Figure 2.15 is the run time throughput information and the second row of figures shows related runtime CPU power. All the

49 a. normal b. competing c. preemptive 140% NORMAL 140% NORMAL 140% NORMAL SPEEDSTEP SPEEDSTEP SPEEDSTEP 120% PAT 120% PAT 120% PAT 100% SETPOINT 100% SETPOINT 100% SETPOINT 80% 80% 80% 60% 60% 60% 40% 40% 40% 20% 20% 20% Throughput (Normalized) Throughput (Normalized) Throughput (Normalized) 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Control Period Control Period Control Period 140% NORMAL 140% NORMAL 140% NORMAL SPEEDSTEP SPEEDSTEP SPEEDSTEP 120% PAT 120% PAT 120% PAT 100% 100% 100% 80% 80% 80% 60% 60% 60%

DVFS LEVEL 40% DVFS LEVEL 40% DVFS LEVEL 40% 20% 20% 20% 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Control Period Control Period Control Period 180% 180% 180% NORMAL NORMAL NORMAL 160% SPEEDSTEP 160% SPEEDSTEP 160% SPEEDSTEP 140% PAT 140% PAT 140% PAT 120% 120% 120% 100% 100% 100% 80% 80% 80%

Power (Watt) 60% Power (Watt) 60% Power (Watt) 60% 40% 40% 40% 20% 20% 20% 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Control Period Control Period Control Period

Figure 2.15: A snapshot of normalized database throughput and active power consumption in 50 control periods in three different system settings.

throughput data is normalized to the maximum capacity of the system and all the power data is normalized to the Normal baseline.

In Figure 2.15a, though the performance curve of Normal and SpeedStep is similar to each other, SpeedStep shows its advantage in power savings. Without violating the SLA

(40%), PAT controls the throughput performance strictly to the setpoint, which results in a much larger power savings than the SpeedStep.

In the competing scenario, the throughput of Normal, SpeedStep and PAT, comparing with those from the ideal environment, are greatly affected by the injected non-DBMS processes. The predicates of available CPU resource in hardware and resource demand from

DBMS optimizer could be unreliable. As shown in Figure 2.15b, the power consumption of SpeedStep in this scenario is much higher than the ideal case due to it controls the power according to the whole operating system utilization, not the DBMS throughput. However,

50 200 % NORMAL 200 % SPEEDSTEP PAT 150 % 150 %

100 % 100 % Performance (Normalized) 50 % 50 % (Normalized) Energy Consumption 0 % 0 % E E E P P P normal competingpreemptivenormal competingpreemptive

Figure 2.16: The total energy saving of Normal, SpeedStep and PAT. All the data are normalized to the data of Normal.

PAT could distinguish the difference between noises from non-DBMS process and control the throughput to the setpoint while tolerating the errors from prediction and modeling.

Figure 2.15c demonstrates the result in the preemptive scenario. The behavior of preemptive system processes leads to lower DBMS throughput since its resource is occupied by high-priority processes. This scenario creates the aforementioned control exception 1.

When PAT ﬁnds that it cannot tune up the performance with the maximum DVFS level as reaching the limit of DVFS technique, it will tune down the DVFS level to save energy.

This tuning happened in 6th, 13th, 18th and etc. period in the power consumption result of Figure 2.15c.

51 Overall, we conduct a performance and energy saving evaluation study of the above experiment data, as shown in Figure 2.16. PAT gets up to 51% of energy savings (15% more than the SpeedStep) comparing with the energy cost of Normal in the ideal scenario.

The number of energy saving decreases in the competing environment because of the resource competition and increases in the preemptive environment because of the control exception 1 in PAT set DVFS as the lowest level.

In all, PAT achieves much more energy savings and little performance penalty comparing with OPEN and SpeedStep. Readers may question that this as a unfair comparison since the 40% setpoint is only known by PAT and the other two baselines are preserving maximum performance. We will discuss the difference between PAT and the traditional control and manual tuning techniques, outline the advantages of PAT in the next section.

Figure 2.17 is the snapshot of the runtime throughput, DVFS level and power consumption of the ﬁve controllers. The Tradition controls the throughput to the set point to the maximum degree in most of the time. However, there are several problems with this control technique. First, ﬁnding a good static DVFS for one workload in one system scenario needs extensive experimental work and complex learning processes. Secondly, it could easily fail when system dynamic changes (such as the competing scenario) or workload changes.

The Heuristic gives a relative better throughput control to the set point comparing with the SpeedStep. However, when facing a ever-changing workload, the Heuristic technique fails to commit to a steady state in an acceptable time. For example, data in control period

52 160% a. Thoughput Traditional 140% Heuristic 120% SCTRL PAT 100% 80% 60% 40% 20% Throughput (Normalized) 5 10 15 20 25 30 35 40 45 50 Control Period 180% b. DVFS Traditional 160% Heuristic 140% SCTRL PAT 120% 100% 80%

DVFS Level 60% 40% 20% 5 10 15 20 25 30 35 40 45 50 Control Period

180% c. Power Traditional+ 160% Heuristic 140% SCTRL PAT 120% 100% 80%

Power (Watt) 60% 40% 20% 5 10 15 20 25 30 35 40 45 50 Control Period Figure 2.17: A snapshot of relative throughput information and DVFS level in 50 control periods.

53 250 % 250 % Traditional Heuristic 200 % SCTRL 200 % PAT 150 % 150 %

100 % 100 % Performance (Normalized) (Normalized) 50 % 50 % Energy Consumption 0 % 0 % ECost PViolation Profit

Figure 2.18: The energy saving of the ﬁve control techniques. Data are normalized to the PAT case.

20 to 30 in Figure 2.17a show how Heuristic controls the “M” shape workload pattern.

While SCTRL and PAT could both commit to the setpoint in 4 periods, the tuning of Ad

Hoc oscillates in many more steps, which results in less energy savings and low proﬁt.

Solving this problem of Heuristic will eventually leads to the same feedback controller design as PAT.

The SCTRL technique treats the processing of DBMS as a black box. It settles to the setpoint faster than Heuristic. However, since the DVFS technique (actuator of controller) has its own limitations when all DBMS processes are in I/O busy waiting (exception 1 in the PAT control procedure), SCTRL would uselessly set the DVFS level to the maximum, which unnecessarily waste energy. Thus, to energy efﬁciently control the throughput of

DBMS, it is essential to involve DBMS runtime information in the control modeling.

54 We conduct the overall performance evaluation of the ﬁve control technique in Fig- ure 2.18. While PAT getting the 20% more energy savings than the theoretic optimal case in the experiment Tradition, shown in the ﬁrst histogram in Figure 2.18, the Heuristic and SCTRL only get 56% and 74% of the energy savings achieved by PAT because of the failure to commit steady state ( Heuristic) and uselessly set DVFS level to the highest

(SCTRL).

2.3 Discussion

In this chapter, we have demonstrated several well-designed frameworks for singlenode energy modeling and management. Speciﬁcally, we have shown the necessity and some key framework features:

• Building energy-aware data management systems requires speciﬁc system identiﬁ-

cation processes to explore the essential components of possible energy models.

• A static model is only effective to compare and select a better execution plan but

fails at provide sufﬁcient management information for the execution. An adaptive

scheme is a must to improve the online estimation performance that sustains large

errors from system dynamic and workload variation.

• The low-power modes of hardware provide opportunities for power saving in data

processing with predictable performance degradation. Maximizing energy savings

under a user-speciﬁed performance bound in database systems is plausible when

there is a ﬁxed performance agreement.

55 • However, providing service-level agreement guarantee is non-trivial due to the dy-

namics in database workloads and environment. The proposed control scheme is

sufﬁcient to solve the problem instead of using ad-hoc or other heuristic method as

the control framework design provides performance guarantee and system stability.

56 Chapter 3: Two Applications for Optimizing Data Services on Multiple Nodes

In this chapter, we discuss two well-designed data replication management system based on previous single-node modeling method. Other than energy consumption, the two proposed systems focus on carbon footprints from consuming energy, and the expense of serving data in a cloud with geo-diverse data centers (i.e., sites). To minimize target objectives, such as the carbon footprint, or the expense, we formulate the problem into a decision problem of data replication. In this way, we transform the classic time-varying load balancing problem into a dynamic programming knapsack problem. Proving solving such problem is NP-Complete, our proposed framework provides unique solutions and algorithms to make decisions that lead to approximate performance. We provide detailed proof of the performance bound of our method and evaluate the prototype in real world systems. We discuss the details of replication for reducing carbon footprint in Section 3.1, and data replication in mixed cloud instance management to reduce expense in Section

3.2. At last, we conclude the insight learnt from the two applications in Section 3.3.

57 3.1 CADRE: Carbon-Aware Data Replication for Geo-Diverse Ser- vices

Over the next 5 years, the number of Internet users will grow by 60% and Internet services will host 3X more data [110]. As demand increases, carbon emissions from those services will grow. Already, the global IT sector emits more carbon dioxide than most countries [110]. Carbon emissions worldwide must decrease by 72% to prevent climate change [89]. The enormity of this challenge has prompted many IT companies to vol- untarily reduce their emissions. If climate change continues unabated, all IT companies could be forced to reduce emissions if governments impose carbon taxes or if users boycott services with high emissions [13]. Carbon-aware approaches to manage data are needed.

Data replication is a widely used approach to manage data for Internet services. Repli- cating data across geo-diverse sites improves availability and reduces response times. Col- lectively, these sites are powered by many regional power authorities. Each authority manages the mix of resources to produce energy and thus controls local carbon emission rates

(i.e., equivalent CO2 released per joule). The energy mix varies from site to site because some regions may have a higher renewable energy supply. The energy mix also varies over time because renewable resources produce energy intermittently [65]. As a result, the carbon emission rate from such energy mix is ﬂuctuating with the time and the location.

Prior work [69, 99] dynamically dispatches queries to the sites with lowest emission rates. This approach reduces a service’s carbon footprints (i.e., its total carbon emissions), provided 1) the requested data is replicated to a low-emission site and 2) the query can be

58 processed at any site [65, 69]. Figure 3.1(a) plots carbon footprints for a read-only query and a write query accessing the same data. The read query can be processed at any site and is dispatched to the site with lowest emissions. Replication decreases its footprint. The write query updates all sites that host the data. Replication increases its footprint. Finding a data replication policy (e.g., how many replicas and where to replicate) for low carbon footprints is challenging because:

1. Read-write mix, replication factor (i.e., number of replicas) and per-site emissions af-

fect footprints. Replicating to too many or too few sites increases footprints.

2. Emission rates change over time. Replication policies based on outdated snapshots

increase carbon footprints.

3. Data replication reduces effective storage capacity. Many services can not afford to

replicate data to every site.

We present CADRE, a carbon-aware data replication approach for geo-diverse services. CADRE chooses the replication factor and replication sites for each data object.

It supports workloads and time-varying emission rates and retains many features of consistent hashing, a widely used approach [54]. CADRE also provides high availability, respects system capacity limits and balances data across all sites.

CADRE builds upon prior work to forecast data energy needs and per-site emissions [123].

We validate these approaches with traces from Google cluster [7], Yahoo! distributed ﬁle system, Yahoo! Answers [121], and World Cup [1]. CADRE combines these forecasts

59 Figure 3.1: (a) Querying data replication over Google’s geo-diverse sites. (b) Consistent hashing migrates data too often under time-varying heterogeneity.

to generate footprint-replication curves, i.e., a model of each data object’s carbon footprints as a function of replication factors. The footprint-replication curve exposes the best replication policy for each object. However, greedy per-object replication algorithms may violate global constraints on storage capacity or availability. CADRE uses an online, distributed algorithm to decide the replication policy based on the footprint-replication curve.

Based on the multiple-choice secretary algorithm [56], our algorithm tries to replicate objects that beneﬁt the most from the carbon-aware replication. We prototype CADRE to support frontend PostgreSQL applications [87] and evaluate it with emission data from the known locations of Google’s data centers. Tests are conducted on physical machines

(with direct energy measurements). We also use simulations for scale-out tests. CADRE

60 provides a tradeoff between carbon emission savings and data availability, however, it does not necessarily lead to bad performance (i.e., latency). We empirically measure the carbon footprint and the per-query latency from using CADRE. Compared to consistent hashing,

CADRE reduces carbon footprints by 70% while increasing the average latency by less than 5%. The performance difference between CADRE and the ofﬂine optimal algorithm is within 10%.

CADRE reduces carbon footprints when sites have spare storage capacity. Many services are unable to use all their capacity for many reasons, such as mixed workloads.

Our empirical results show that with 30% spare capacity, CADRE uses 33% less carbon footprints than consistent hashing. When there is no spare capacity, CADRE performs as well as consistent hashing. CADRE also interacts well with latency- and cost-aware query dispatchers [88, 91], reducing footprints by 73% and 64%, respectively.

3.1.1 Redistributing Data Replications for Reducing Carbon Foot- prints

For a given data object, adding more replicas may help us to seize opportunities to read from the datacenter with lower cost; however, this also increase the write cost since more replicas need to be updated. Thus, the characteristics of data replication for reducing cost cost create an interesting dilemma: if we replicate a data object in more data centers, the read request is more likely to seize the cost reduction opportunities by routing it to the datacenter with the lowest cost rate at any given time in the period; however, if there exists

61 many data replicas in multiple data centers, the cost of write request is increasing with the degree of data replicas.

In traditional cost-aware workload scheduling and load balancing work [69, 68, 42], the energy consumption, the carbon footprint or the overall cost are the summations of the holding times of considered commutation resources for workload processing, including active machines, network, and waiting queue (considering queuing delay). However, depending on the data-dependent workload characteristics, all workloads that contain update operations will introduce additional maintenance operations, e.g., keeping consistency. In this case, workloads can be divided as read requests and write requests. For read requests, a green-aware request dispatcher will route it to the data center with the lowest estimated carbon footprint and/or electricity cost if the requested data objects have replicas in this location. For write requests, instead of updating the content of the requested data object in one location, all relative replicas needs to be updated to keep the strong consistency among data objects.

According to our analysis, each data object has an optimal number of replicas; More- over, depending on the changes of read/write frequency and the variation of the unit price, different data objects may have different optimal number of replicas.

To validate the existence of such optimal point, we explore the optimal replicas for data table lineitem and order from industry database benchmark TPC-H [104] against real trace of renewable energy production. The results are shown in Figure 3.5(a) and 3.5(b). In

Figure 3.5(a), the carbon emission of workloads on both data objects are decreasing at ﬁrst

62 400 400 order lineitem(t0) ∆ lineitem (t0+ T) 300 300 lineitem

200 Best replication factor 200 E grams) 2 100 100 (CO Carbon Footprint

0 Carbon Footprint (g) 0 0 6 12 18 24 0 6 12 18 24 # of Replicas # of Replicas (a) (b)

Figure 3.2: (a) Carbon emission of two different data objects (lineitem and order are data tables from TPC-H) with different replicas. (b) Carbon emission of of the same data object spawned at different time. t0=12AM, 05/09/2011,d ∆T = 6 hours. The vertical lines indicate the number of replicas with lowest carbon emissions.

until reaching the minimum point, where the carbon reduction by read requests equals the additional carbon footprint from keeping consistency of write requests. After this point, the consistency cost dominates the total carbon footprint and the curve is monotonically increasing. Apparently, such optimal points of different data objects can be utilized to achieve carbon footprint reduction.

However, ﬁnding such optimal point online for each data object is difﬁcult. First, each data object has its own optimal point based on the requests associated with this data object.

As shown in Figure 3.5(a), the data table lineitem and order have different optimal point even with the same workload. Second, such point varies with time-varying signals, such as carbon emission rate and electricity price of each data center. Figure 3.5(b) proves such variation by consider two carbon emission curves on different time. The data replication replicates the same data object using a curve from trace data of 12AM,05/09/2011

63 (specifically, the carbon emission trace for four different locations) and another 6 hour later curve. The carbon footprint is different for the same data object and the optimal point varies. Thus, an offline analysis may not be sufficient to solve such problem with online time varying parameters.

3.1.2 CADRE Design: Data Replication for Carbon Reduction in Geo- Diverse Data Services

Consistent hashing controls which sites are allowed to host an object. Given a maximum replication factor, only sites that would be selected by consistent hashing are allowed.

This design provides a default write order for each object. Speciﬁcally, the order chosen by consistent hashing is also the order that sites process data updates. The order is pre- served when carbon-aware policies choose a subset of sites for replication. The ﬁrst site that hosts an object is called the head site. The last is called the tail site.

Create and write queries follow the chain replication protocol [102]. For each object, writes traverse geo-diverse sites in the same order. The head site updates ﬁrst and sends the write to the next site. The write is complete when the tail site updates. Like CRAQ, each site keeps the most recent version for each object [102]. Recall, virtual sites randomly permute the order of physical sites across partitions. This mechanism increases throughput for durable writes that return after the completion from the ﬁrst few, randomly ordered sites

[17].

CADRE manages two editions of each object. The ﬁrst edition means that the object has been replicated to its staging policy (i.e., the default replication policy from consistent

64 Figure 3.3: Data paths for queries in CADRE. Boxes represent the system software that runs at all sites. Shaded boxes reﬂect CADRE components with novel design. CADRE assumes create queries precede read and write queries.

hashing) and its workload is being proﬁled. The second edition begins after the staging

65 Symbol(s) Meaning Ω , Ω j Set of all sites and subset allowed for object j O Set of objects created Q Set of query dispatchers er , ew Average energy per read and per write r j , w j Read/write queries for object j zi,q Energy to message site i from dispatcher q (t) mi Emission rate at site i at time t Cj(D j) Total carbon footprints for object j K, k j Default replication factor and assigned factor for object j ∗ D j(k) Sites that host k replicas for object j with min. footprints ∗ Cj (k) Minimum carbon footprints for object j with k replicas o k j Best replication factor for object j ko Best data replication policy C(k) Minimum footprint of data replication policy k

Table 3.1: Symbols and notations. Bold indicates vectors. phase completes when an object has been replicated to its carbon-aware policy. For cor- rectness, CADRE must ensure that writes and reads are properly directed to the right edition. When the staged edition is active, an object is replicated to the ﬁrst K sites provided by consistent hashing, where K is the default replication factor. CADRE deactivates the staged edition before it activates the carbon-aware edition to ensure that outstanding writes are not lost. Speciﬁcally, it overwrites the staged edition with a forwarding message. New write queries are queued until the carbon-aware edition is activated. Read queries either return stale staged-edition data or they are queued.

In addition to each object’s data, CADRE also stores replication policies for each object. Object lookup retrieves this policy. On a lookup, the dispatcher ﬁrst checks the cache for replication policy. A cache miss causes the dispatcher to lookup the object using the

66 hash function in consistent hashing, i.e., it indicates that the staged edition is active. When the dispatcher receives a signal that the staged edition has been deactivated, it gets and caches the carbon-aware policy.

CADRE allows virtual sites to leave and return. Power cycles and network issues affecting individual servers cause outages. Virtual sites recover lost objects when they receive a read or write query. On reads, the site contacts nearby virtual sites to restore lost data. On writes, the virtual site gets queries beginning from the last conﬁrmed write from its predecessor in the write ordering. CADRE assumes physical sites are stable and virtual sites are live over long periods.

Support for Eventual and FIFO consistency: Query dispatchers route queries between sites. To lower carbon footprints, for example, dispatchers should route queries to sites with low emissions. Chain replication allows for routing read-only queries; writes must update all replication sites. However, as discussed in prior work, routing policies affect consistency [102, 17]. A query dispatcher that randomly routes reads to any site can ensure clients that they will eventually (after many reads) retrieve the current version.

Clients should compare version numbers of their recent reads to make sure a recent read has not returned an older value. As a result, this achieves the eventual consistency. A query dispatcher that routes queries to the same (possibly randomly selected) site ensures clients that they will only read updates. Each client’s writes are seen in order, i.e., FIFO consistency. CADRE supports either of the two access models. Although not required, we assume that each replication site hosts only one replica in the remainder of the section.

67 By default, FIFO consistency forces dispatchers to route queries for an object to one site. However, dispatchers may need to change their routing policies when emission rates change. Fortunately, the policies change infrequently relative to read access rates, e.g., hourly to sub-second time scales. To ensure FIFO consistency, a dispatcher caches and returns the current versions of objects hosted at the old site. Periodically, they check the version at the tail site. When the versions match, the dispatcher routes subsequent queries to any site. CADRE implements this method to maintain FIFO consistency while dispatching queries to low emission sites.

Modeling. The carbon footprint for an object is the product of two variables: energy used to access the object and emissions rate during those accesses. CADRE uses an analytic model to describe how the energy coefﬁcient changes across workloads and replication policies in one time frame [T1,T2]. The model is shown in Equation (3.1), (3.2), and (3.3).

T2 , (t) (t) (t) Cj,r(D j) ∑ ∑ (er + zi,q)r j,q min(mi ) (3.1) i∈D j t=T1 q∈Q

T2 , (t) (t) Cj,w(D j) ew ∑ w j ∑ mi (3.2) t=T1 i∈D j

Cj(D j) , Cj,r(D j) +Cj,w(D j) (3.3)

Note, Cj,r(D j) and Cj,w(D j) are the total carbon footprints of reads and writes for object j, respectively. zi,q accounts for the lowest energy consumption of routing the read. All notations are deﬁned in Table 3.1. The models use average emission and access rates over discrete intervals. First, the energy used at a single site for read queries differs from write queries. Write queries often access more resources than reads, e.g., hard disks. However,

68 read-only queries often involve complex joins and scans [119]. For read-only queries, we also model communication cost between dispatchers host sites. However, prior work has shown that the energy footprints for communications are often the second-order effects compared to processing footprints [58].

To capture energy needs at a single site, we run benchmark workloads on a small cluster. The benchmarks execute representative R/W queries in succession, achieving high utilization. We measure the energy consumption and use it to estimate R/W energy needs.

We use our R/W needs to model the total energy needs for TPC-H workloads in Postgres.

In Figure 3.4, We compare our approach to 1) Per-request, a per-request cost model that does not distinguish reads from writes [69] and 2) Query-type, a more advanced approach that further distinguishes read types, e.g., join, select and scan [119]. Results show that our approach provides a nice tradeoff between simplicity and accuracy, achieving 35% reduction in average error compared to the ﬁrst approach while yielding only 10% more error variations than the second approach.

Footprint-replication curves: Replication helps dispatchers reduce footprints on read accesses, but it increases the footprints of write accesses. Using Equation (3.3), we explore an object’s total carbon footprints under all possible replication factors. The graph, where each point on the y-axis is the smallest carbon footprint achievable under the corresponding replication factor on the x-axis, is a footprint-replication curve. More precisely, R j:

∗ R j,k = Cj (k).

69 100 % 80 %

60 %

40 % query-type R/W (CADRE) Data Objects) 20 % avg. query* CDF(Fractions of avg. query 0 % 0 % 20 % 40 % 60 % 80 % 100 % Estimation Errors

Figure 3.4: CDF of modeling performance across ﬁve sites. *: best performance of the per-request model on one site.

∗ D j(k) , argmin Cj(D) (3.4) D⊆Ω j,|D|=k ∗ ∗ D j(k − 1) ⊂ D j (k) (3.5)

o ∗ k j = argmin D j(k) (3.6) k=1,...,|Ω| ∗ ∗ Cj (k) , Cj(D j(k)) (3.7)

Equation (3.4), (3.5), (3.6), and (3.7) present optimization models for the footprint- replication curve. Equation (3.5) ensures that if the replication factor decreases, e.g., a virtual site fails, the remaining virtual sites are unaffected. It requires that the best replication policies build upon each other. Without this constraint, a decrease in the replication factor could force CADRE to choose sites combined that lead to higher carbon footprints.

Consistent hashing has this feature, called smoothness [54]. Given the inputs to our model,

70 400 order lineitem 300

200 Best replication factor E grams) 2 100 (CO Carbon Footprint 0 0 6 12 18 24 # of Replicas

Figure 3.5: The convex footprint-replication curves.

we construct best replication policy with a factor of k by incrementally adding a site from the best k − 1 policy.

Footprint-replication curves can be divided into two parts. In the ﬁrst part, carbon footprints decrease monotonically because replication to more sites provides more cost saving potential for reads. In the second part, carbon footprints increase because the routing provides little savings and writes dominate. The end of the ﬁrst part is the best replication factor, because the footprint-replication curve is convex [116].

Figure 3.5 gives an example of two footprint-replication curves (i.e., lineitem and order data tables in TPC-H workloads [105]). The order table is a frequently accessed small table

(100Qps+25K rows) and the lineitem table is a less popular larger table (1Qps+1M rows).

Figure 3.5 shows that curves decrease with more replications added for read ﬂexibility.

However, the curve reaches the minimum point, where savings beneﬁted from reads are

71 compensated by penalties from keeping consistency for writes. After the point, writes’ costs dominate the total carbon footprint and the curve is increasing.

Footprint-replication curves provide the best carbon-aware policy for each object. How- ever, applying the best policy for each object may violate global system constraints. CADRE allows system managers to set the following constraints:

1. Storage Capacity: We target fast but costly in-memory stores that are widely used to provide low response times and high quality results. As shown in Equation (3.8), managers can set the provisioning factor f to force partial replication because full replication is often too costly.

∑k j ≤|O||Ω| f (3.8) |O| 2. Availability: Replication to geo-diverse sites ensure that objects are durable and available during earthquakes, ﬁres and other regional outages. Equation (3.9) ensures that every object is replicated to at least K sites but no more than |Ω j|.

K ≤ k j ≤|Ω j|,∀ j ≤|O| (3.9)

3. Load Balancing: Consistent hashing uses virtual sites to handle heterogeneity. CADRE overrides these settings to consider time varying emission rates. Managers can devalue emission rates by setting per-site weights, shown in Equation (3.10). The weight vector can be changed on the ﬂy.

(t) , ∗(t) mi mi wi. (3.10)

CADRE ﬁnds good replication policies that 1) respect these constraints and 2) reduce carbon footprints.

72 Greedy online algorithm: The greedy algorithm proceeds as follows. When a data object is created, we compute its optimal replication factor using its footprint-replication curve.

At the start, we always replicate the object to its best replication factor unless the availability constraint is violated. However, if the best replication factor would cause the spare capacity to fall below the minimum capacity required for the remaining objects, we only replicate the object to K sites (i.e., the minimum replica for availability). At the end of this greedy algorithm, all objects are replicated to either their optimal policy or the default policy.

If there always exist spare capacities for data replication, the greedy algorithm is already optimal [116]. However, in the worst case, all late arriving objects with heavy workloads could be assigned to default replication policies. We propose, Fill-and-Swap, an ofﬂine algorithm that ﬁnds an optimal solution.

Fill-and-Swap algorithm: Given the result from the greedy algorithm, we use gradient search to ﬁnd a local optimum. First, we ﬁll the unused storage capacity by increasing

o replication factors of objects that are not replicated to their k j sites. Specifically, we increase the replication factor for one at a time, choosing the object that can reduce the most global carbon footprint. Once the storage capacity is fulfilled, we further reduce the footprints by Swap. In the Swap, We find the ith object that reduces the most carbon footprints if its replication factor is increased by one; and find the jth data object that increases the least carbon footprints if its replication factor is decreased by one. If Swap reduces the

73 global carbon footprints, we keep looking for this kind of object pair (i, j) and perform

Swap, otherwise, the algorithm terminates.

This gradient search approach yields the optimal solution because footprint-replication curves are convex. However, it only works ofﬂine by scouring footprint-replication curves

∗ of all objects. Note, replications where K > k j indicate that the minimum number of copies for high availability is already sub-optimal. K is already the best setting for these objects thus they are safely ignored in this discussion.

Fill-and-swap resolves two problems with the greedy approach. First, spare capacity is exhausted too quickly. Late arriving objects are forced to accept the default policy.

Second, some objects could have small total beneﬁts but relatively large gradients at small replication factors.

Multiple-choice Secretary algorithm: Empirically, it is likely that late arriving but frequently accessed objects account for most of the carbon difference between greedy and ofﬂine algorithms. This result makes sense given that data access patterns often follow top-heavy Pareto or Exponential distributions [123]. The challenge is to judiciously select the objects with largest carbon footprint savings when they arrive, using only knowledge of previous objects.

CADRE uses the multiple-choice secretary algorithm, shown in Algorithm 1, to select l1 objects with largest savings [56]. If l1 = 1, the algorithm observes o objects, records the largest observed savings and then selects next object that exceeds the observed max (Line

1 14-22). If l1 > 1, the algorithm randomly samples a binomial distribution y=B(l1; 2 ) (Line

74 Algorithm 1 MCSA for Carbon-Aware Replication

1: Function MCSA(l1, Ol, Ou) |O|(|Ω| f −K) 2: l1 := min( |Ω|−K ,l1) 3: Ou := min( |O|,Ou ) 4: Ol := max( 0,Ol ) 5: if l1 ≤ (Ou − Ol) then o 6: k j := k j 7: end if 8: if l1 >1 then 1 Ou−Ol 9: return MCSA(B(l1; 2 ), Ol, Ou := floor( 2 )) 1 Ou−Ol 10: return MCSA(B(l1; 2 ), Ol := ceil( 2 ), Ou ) 11: end if 12: Largest := 0, Observed := 0 13: while Ol ≤ j ≤ Ou do ∗ ∗ o 14: Savings j := Cj (K) −Cj (k j ) o l1 15: if k j ≤ K or Observed ≤ e then 16: k j := K 17: Largest := max(Largest,Savings j) 18: Observed := Observed + 1 19: else if Savings j > Largest then o 20: k j := k j 21: else 22: k j := K 23: end if 24: end while

75 9-10). It then recursively selects y objects from the half of the remaining objects. After l1 objects are selected or only l1 objects remain, the algorithm terminates.

3.1.3 Evaluation

Prototype: Our prototype has three parts. First, the core software implements functions of consistent hashing, carbon footprint models and replication algorithms. These packages run at each site. Second, a front-end PostgreSQL database engine handles workloads on a speciﬁc site. The engine is also modiﬁed to include a background process that periodically computes local emission rates. Finally, the query dispatcher routes queries between sites.

Note that, in our current prototype, we ignore the network cost between sites and end users, consider it as a future work in implementation.

We study our prototype in two environments. First, we have a real testbed comprising four clusters, which represent four Google sites (i.e., NC, SC, IA, and OR) in the United

States. Each cluster has a location-based carbon emission proﬁle, and a server proﬁle similar as the machine used in Google sites. There are four DELL R710 servers per cluster.

Each server has 12 2.4Ghz cores, 16 GB RAM and 1 TB disk-based storage. DVFS and other power saving mechanisms are turned off to measure energy in the worst case. This provides references for er and ew in Equation (3.1) and (3.2). To study CADRE at scale, we also set up simulations. Starting from the site proﬁles from our physical testbed, we keep adding new sites based on the known proﬁles of Google and Amazon sites [76, 43].

76 1500 Hash 1200 CADRE Oracle E Kg/h) 2 900 600

300 Carbon Footprints

Per Hour (CO 0 0 1 2 3 4 5 6 7 Day

Figure 3.6: Carbon footprints for a 1-week snapshot of Google trace. Emission and workload rates are provided hourly.

Traces and Workloads: For each data object, we consider three types of queries, namely create, read and write. In the test, we treat a data table as one data object. We evaluated

CADRE with data access traces from WorldCup, and Google.

WorldCup is the three-month webpage access data during World Cup 1998, which covers over 1 billion queries [1]. This trace only contains two types of queries–create and read. This trace is representative of in-memory data accesses with simple and heterogeneous transactions having skewed access patterns. The Google trace is a one-month data access trace on a google cluster of over 11k machines in May 2011 [7]. This trace contains data access information such as access type (i.e., create/read/write), start/end time, etc. Using this trace, we can evaluate the estimation performance from the staging phase in CADRE. In our test, we select one-week data on 12 machines from the entire trace.

77 1000 th 5000 th th

Used 1200 75 T 75 T 75 T

Avg. Avg. 100 Avg. T T Staged T 800 th 4000 th th 1000 25 25 80 25 600 800 3000 60 E ton) E gram/kWh) 2 600 2 400 2000 40 Time (ms) (CO 400 200 1000 Carbon Footprints 200 20 Per-query Turnaround Per-ste Untilization (%) Rate (CO 0 Per-site Carbon Emission 0 0 0 HashGreedyBlindWeightCADREOracle HashGreedyBlindWeightCADREOracle HashGreedyBlindWeightCADREOracle HashGreedyBlindWeightCADREOracle

1000 th 5000 th th

Used 1200 75 T 75 T 75 T

Avg. Avg. 100 Avg. T T Staged T 800 th 4000 th th 1000 25 25 80 25 600 800 3000 60 E ton) E gram/kWh) 2 600 2 400 2000 40 Time (ms) (CO 400 200 1000 Carbon Footprints 200 20 Per-query Turnaround Per-site Untilization (%) Rate (CO 0 Per-site Carbon Emission 0 0 0 HashGreedyBlindWeightCADREOracle HashGreedyBlindWeightCADREOracle HashGreedyBlindWeightCADREOracle HashGreedyBlindWeightCADREOracle

(a) Carbon Footprint (b) Carbon Emissions Rate (c) Latency (d) Utilization

Figure 3.7: Performance comparison using the WorldCup trace (First row) and the Google trace (Second row).

The minimum R/W ratio per hour in the selected trace is 1.2:1. Note that, both WorldCup and Google traces are representing real-world data access patterns but not real data. To

ﬁll queries with real data, we use TPC-H, TPC-W and TPC-C benchmarks [105]. In our testbed, we have over 2,000 objects per site, corresponding to 600 GB of data.

Data services have much smaller carbon footprints under carbon-aware data replication: We start by showing a 1-week carbon footprint snapshot under consistent hashing and carbon-aware data replication strategies in Figure 3.6. All carbon footprint curves have diurnal patterns due to the diurnal workload patterns from the Google trace. Across the whole trace, CADRE achieves lower and more predictable carbon footprints. Its 25th and

78 75th percentiles differ by 25%, whereas these percentiles differ by 63% with Hash. Mean- while, CADRE achieves the similar performance as Oracle. For over 90% data points, the carbon footprint difference between CADRE and Oracle is within 10%.

Performance comparison of different replication strategies: To examine the performance of CADRE, we compare it with other baselines using WorldCup and Google (the

ﬁrst and the second row in Figure 3.7, respectively). In the experiment, we measured the total carbon footprints, the statistics of carbon emission rate, the latency, and the per-site capacity utilization. Note that, the latency for writes is measured as the duration of all chain replications are updated after the write.

Figure 3.7(a) shows that CADRE outperforms Hash, beats other baselines, and performs closely to Oracle. The carbon footprint difference between CADRE and Hash in

Google is 2.5X larger than that in WorldCup. It is because Google is a read/write mixed workload and CADRE considers both reads and writes, thus CADRE can provide a better tradeoff under this kind of workload. CADRE beats other carbon-aware replication strategies, such as Greedy, by saving up to 56% carbon footprints. In contrast to those baseline strategies, CADRE focuses the set of data objects that contribute the most to carbon savings, and respects both capacity and availability constraints. Except Oracle, CADRE has the least average carbon emission rate across all replication sites. When we examine the

25th carbon emission rate data, CADRE always picks the cleanest site as Oracle, whereas

Blind and Weight fail (in the Google trace) because of the mis-prediction and the strict

79 load balance constraint. For the 75th data, we observe that Greedy picks the most dirty site as Hash, which signiﬁcantly reduces its carbon savings.

Using carbon-aware data replication does not necessarily hurt the performance of serving data. As illustrated in Figure 3.7(c), the average latency in CADRE is only 5% more than in Hash. Introducing writes increases the latency because chain replication requires queries to traverse every replication site. However, CADRE uses its model to choose a good combination of sites. We observe the latency decreases from Hash to Blind (no workload model) to Weight (constrained model) to CADRE. Note, Greedy exhausts spare capacity quickly and assigns late arriving objects to their minimum replication factor. This reduces the latency for writes but increases carbon footprints.

At last, we examine the per-site capacity usage in Figure 3.7(d). Weight has the similar variance (10%) in the capacity utilization as Hash, as both strategies are designed for the fair load balancing. Comparing to Hash and CADRE, Weight serves as a tradeoff between the load balancing requirement and the carbon footprint reduction. Greedy has the largest variance (46%) for both workloads, due to its aggressive replication behavior. In all results,

CADRE has the closest performance to Oracle.

CADRE interacts well with many routing policies: We here examine how CADRE interacts with routing policies for carbon reduction, in Figure 3.8. Although Rand, LW, and CW are not carbon-aware routing policies, compared to Hash, CADRE can still reduce up to 70%, 73%, and 64% of their carbon footprints, respectively (Figure 3.8(a)). Using the replication from CADRE, LW can further reduce 38% latency than that of Hash+Rand

80 1000 5000 Rand Rand LW LW 800 CW 4000 CW GLB GLB

600 3000 E tons) 2 400 2000 Time (ms) (CO Carbon Footprints

200 Average Turnaround 1000

0 0 Hash CADRE Oracle Hash CADRE Oracle (a) Carbon Footprints (b) Latency

Figure 3.8: (a) Carbon footprints and (b) the average latency comparison between replication strategies and routing policies.

(Figure 3.8(b)). Using CADRE+GLB helps to save additional 21% carbon footprints with

17% less in average latency than that from Hash+GLB.

CADRE is designed to exploit time-varying emission rates. However, other important costs also vary over time, e.g., electricity prices. Often, these metrics change less frequently than emission rates, meaning that consistent hashing may perform better relative to CADRE. We did a study on the electricity price analysis from [65], the result shows, by simply placing data in synesthetic with the sites with low electricity price, CADRE can help saving 43% more electricity expenses, however, the relative carbon footprint increased. In the future, carbon emissions will better align with price, making CADRE an attractive approach for both metrics.

81 5000 5000 Hash Hash CADRE CADRE 4000 Oracle 4000 Oracle

3000 3000 E ton) E ton) 2 2 2000 2000 (CO (CO Carbon Footprints Carbon Footprints 1000 1000

0 0 100 75 50 25 0 5 7 9 11 13 15 Spare Capacity (%) # of Sites (a) (b)

Figure 3.9: Data replication simulations under (a) decreasing spare capacity and (b) scale- out.

Spare storage capacity limits carbon footprint savings: We also evaluate the performance of Hash, CADRE and Oracle under bounded available capacities. The available spare capacity is gradually reduced from 100% to 0%, and carbon footprints of all data replication approaches are increasing, shown in Figure 3.9(a). The carbon footprint of

Hash is increasing because more data migrations occur due to limited space. The performance of CADRE and Oracle are converged to Hash when there is no spare capacity.

With 30% spare capacity, CADRE can still save up to 33.7% carbon footprints.

CADRE keeps low carbon footprints as more geo-diverse sites are added: We also study the scale-out performance of CADRE via simulations from 5 sites to 16 sites, using duplicated proﬁles of Google and Amazon sites, shown in Figure 3.9(b). Adding more

82 sites with time-varying carbon emissions into the system, CADRE can further reduce carbon footprints of serving data (about 200 tons per site). Meanwhile, with the increasing number of sites, the performance of CADRE and Oracle are converging.

3.2 BOSS: Blending On-Demand and Spot Instances to Lower Costs for In-Memory Storage

In-memory storage is vital for cloud services that respond to queries interactively, especially services with users spread all over the world. By 2018, the market for in-memory storage will exceed $13B, reﬂecting sustained growth of 43% annually [45]. Memcached,

Cassandra and Redis are open source software packages commonly used to store data in memory [125]. Increasingly, clients prefer managed platforms that expose a networked

API for creating, reading and writing data while handling replication and availability on behalf of their clients. Amazon ElastiCache [4], MemCachier [8], and MemCached

Cloud [78] are examples of managed in-memory storage platforms.

Managed platforms can lease bundles of memory and CPU, called instances, from infrastructure-as-a-service (IAAS) clouds. For example, Memcachier leases from Ama- zon, Google, Rackspace and Azure [8]. The most common type of leases are for on- demand instances, where workloads get exclusive access to an instance for a set time.

In-memory storage can lease on-demand instances for a long time to retain data stored in main memory. However, using on-demand instances is expensive for two reasons. First, on-demand prices include a surcharge for exclusive access because IAAS clouds can not use instances leased on demand to execute other (perhaps more proﬁtable) workloads.

83 Second, large in-memory storage capacity often leads to idle processing capacity. For example, in 2010, TripAdvisor spent 8% of its equipment budget on 350 GB of in-memory storage [41]. These servers could have processed 5B lookups per day but actually processed only 1B— 80% of the processing capacity was unused [41].

Spot instances provide another option for leasing resources from IAAS clouds. With spot instances, workloads get exclusive access to an instance only until a competing workload outbids the lease price. There is no surcharge for exclusive access because the IAAS cloud can revoke the lease at any time. As a result, Amazon EC2 spot instances can cost

90% less than on-demand instances [3]. However, to use spot instances, workloads must be allowed to stop abruptly at any time. Also, workloads shall be able to delay their execution when prices for spot instances rise. Prior research has addressed these challenges for data processing workloads that tolerate slow execution time [98]. These workloads, e.g., map-reduce jobs, include recovery techniques that can resume execution whenever they regain access to spot instances. If IAAS clouds provide a short warning, workloads can use virtual machine migration to pause and resume their work [93, 100, 47]. This research could help in-memory storage avoid losing data when spot instances stop abruptly.

However, other than storing data, spot instances are also used for query processing. It is challenging to process incoming queries when spot instances are unavailable without relying on pricey on-demand instances.

If in-memory storage platforms can not procure spot instances after failure, there are a few choices: (1) accept slower response time and possible data loss, (2) replace spot

84 instances with on-demand instances, negating savings, or (3) move data and workload to other IAAS clouds where prices are cheaper. Spot prices on Amazon differ by up to 132% between its US East, US West, European and Asian sites [2] (We use the word site to refer to an IAAS cloud). Further, spot market prices change over time at each site. Savvy workloads can reduce costs by constantly shifting data between sites to use the cheapest spot instance. However, too much data migration can increase costly inter-site network bandwidth.

This section presents BOSS, a Blended On-demand and Spot Storage framework for in-memory storage. In BOSS, leased on-demand instances process state-changing queries, e.g., object creation. Spot instances process read-only queries. This design reduces costs for in-memory storage serving mostly read-only queries and avoids data hazards (e.g., in- complete writes). The challenge for BOSS is the response time. When spot instances stop abruptly, re-dispatching read-only queries could overwhelm surviving instances. BOSS uses data replication between sites and within sites to anticipate the effects of variable leasing time. First, when BOSS creates objects, it places them at sites where spot instances are available. This shifts queries away from sites without spot instances. BOSS also balances query rates between sites by replicating heavily accessed objects to more sites than lightly accessed objects. This reduces the impact when spot instances at one site fail. However, placing objects at too many sites wastes in-memory storage on redun- dant data. We devised a novel, online replication approach that considers (1) the expected cost saving from replicating an object to more sites and (2) the spare capacity available

85 for additional replication. Our approach proﬁles query rates for newly created objects and historical spot prices to model cost. If the statistical distribution of cost savings is expo-

ω nential, our approach is O(1 + )-competitive where ω is the reserved storage ratio and |kd|

|kd| is the default replication factor. Our competitive ratio improves for any distribution more skewed than exponential.

Within a site, BOSS fully replicates data over spot and on-demand instances. BOSS exploits fast internal cloud networks that do not charge for intra-site bandwidth. When spot instances fail, their queries can be served by any other instances. BOSS allows users to trade response times for cost savings, like portfolio management. Users can use backup on-demand instances to mitigate risk. BOSS provides an efﬁciency frontier that helps users manage cost and risk.

We prototyped BOSS in Cassandra [60] and evaluated its costs on Amazon and Google clouds. In the prototype, |kd| = 2. Thus, in the worse case, the performance of BOSS’s inter-site replication is O(1.5)-competitive. On Amazon, our evaluation observes BOSS over 17 weeks under real spot instance prices. We compared BOSS to (1) ElastiCache,

Amazon’s managed in-memory store, (2) an auto-scaled in-memory storage implemented atop Amazon EC2 and (3) a recent research proposal based on virtual machine migration [47]. We deployed up to 78 Amazon EC2 instances across 8 sites. Results show that

BOSS costs 85% less than ElastiCache while achieving comparable 95th percentile response time (within 13%). BOSS cost 84% less than autoscaled in-memory storage using on-demand instances. Further, BOSS increases average utilization per leased instance by

86 60% compared to ElastiCache. Over 17 weeks on Amazon, BOSS never lost data. Google

Cloud reduces savings by charging a ﬁxed price for spot instances. We used BOSS on

Google Cloud, and costs were 56% lower than using only on-demand instances.

The contributions for this section are:

- We present a novel framework for in-memory storage that blends reliable on-demand in-

stances and cheap spot instances.

- We show that inter-site replication can mitigate the effects of spot instance failures.

ω - We prove that our online replication algorithm is at least O(1 + F( ))-competitive. |kd| - We present a design that achieves high throughput and handles spot instance failures by

dispatching read queries to spot instances and state change queries to on-demand instances.

- We show that using an efﬁciency frontier of cost savings and response time variations can

manage the risk/saving tradeoff.

- Using real spot instances from Amazon and Google, we show that our framework can

signiﬁcantly reduce costs compared to other managed in-memory storage platforms.

3.2.1 Reducing Cloud Prices for Renting Data Storage

Static optimization is ineffective for cost reduction in hosting data services using

blended stores. From the data objects’ perspective, most objects (i.e., more than 90%)

have few overall accesses, which are usually less than 100. In our observations, these

few-accessed data objects will be quickly expired (e.g., 6 to 7 days) in the memory and

probably backed up in the lower level storage. This is not a surprise since most in-memory

87 10000 on-demand 100 % spot Instance Mix blended Effi. Frontier 8000 80 % 6000 60 % 4000 40 % Price Volatility 2000 20 % 0

US EASTUS WESTEuropeAsia(Tokyo)Asia(SGP) Avg. Cost Savings (%) 0 % 0 % 5 % 10 % 15 % 20 % Risk(%)-STD of Response Time (a) Price Volatality (b) Efﬁciency Frontier

Figure 3.10: Observations on blended Amazon stores.

data management systems have the lifetime threshold (i.e., time-to-live or TTL) for keys on managed data. For example, memcached system can set the largest TTL up to 2592000 seconds, or 30 days. Thus, any optimized placement decision could have sub-optimal results after data expiry. We use the expiry time of the 95-percentile data (i.e., 7 days) in our study as the TTL for all data objects. If any data live longer than this time, we consider it as a new data object entering the system and place it online using our algorithm.

From the expense’s perspective, we observe that, on-demand instances have a ﬁxed price and the price difference between on-demand instance from different sites can be large than 20%, and on average, spot instances can be 64% cheaper than on-demand instances with the same computing resource capacities, as shown in Figure 3.10(a). The price volatility (i.e., the frequency of price change larger than a threshold, e.g., 5%) is

88 very high. When blending them together, the blended store incurs a much higher price volatility than using each type of instance individually. As a result, any decisions from the optimization based on the static prediction could lead to suboptimal results. Increasing the granularity of performing such optimization could help solving the problem but the cost and the overhead also increase.

The relationship between cost savings and performance delays is non-linear and monotonic. In our 6-month observation, on demand instances only fail once while spot instances can fail on a hourly basis. It is much safer to host data on the on-demand instances and it is risky to host data on the spot instance. To quantify this risk, we measure the performance delays from blended instances, and compare them to the performance of using all on-demand instances with the same specification. Shown in Figure 3.10(b), we draw an efficiency frontier based on the normalized performance delay and cost savings from experiment results. The results show that more cost savings may incur more performance delays. Similarly, we propose to build this efficiency frontier to pick the most suitable instance configuration at the intra-site level.

3.2.2 Blending On-Demand and Spot Instances to Lower Costs for In-Memory Storage

Overview. BOSS provides Internet services with managed in-memory storage. It supports queries that create, read and write data objects, but controls the mapping of data objects to instances (i.e., it controls data replication). BOSS leases on-demand and spot instances from infrastructure-as-a-service (IAAS) clouds. Normally, these IAAS clouds

89 Figure 3.11: Prices for instances leased on demand and spot markets in Amazon IAAS clouds.

are located at geographically diverse sites where regional factors, e.g., electricity prices, affect instance prices. BOSS assumes that each site supports:

1. High-speed internal networks that allow instances hosted in the same site to com-

municate without cost.

2. Instances can be leased on demand at any time.

3. Spot instances require winning bids. Bidders can lose many times before winning a

lease. Further, spot instances can be stopped at any time.

Figure 3.11 plots the average lease price for on-demand and spot instances at 5 Amazon

EC2 sites [2]. We show snapshots for two days (Jan. 30 and Feb. 30, 2015). On demand prices differed by 42% between US East and Asia Singapore, but within a single site,

90 prices were stable. In contrast, US East provided the cheapest spot instances on Jan. 30 while Asia SGP had the cheapest spot instances on Feb. 30.

BOSS targets in-memory storage with these features:

1. Data objects are accessed frequently enough to exceed the CPU capacity of a single

instance.

2. Objects are created and retired frequently. Total storage capacity needed for all

objects is stable.

3. Read-only queries are more frequent than queries to create and update objects.

In-memory storage is usually reserved for data objects accessed very frequently. The most popular objects have peak query rates 12X larger than the average object [125].

BOSS assumes that objects with low access rates should be evicted to cheaper storage, e.g.,

SSD. Newly created objects are normally accessed more often than older objects, so LRU replacement policies can manage eviction and maintain stable storage capacity [125]. In addition to LRU, in-memory storage enforces time-to-live (TTL) thresholds. In the Google cluster, more than 95% of objects are retired or evicted within 7 days [7].

Figure 3.12 highlights software that manages inter-site and intra-site replication in

BOSS. Within a site, a query dispatch framework sends incoming queries that create or update objects to on-demand instances. This ensures high data availability. On-demand instances use chain replication [108] that provides scalable write throughput. Read-only queries are dispatched to spot instances. BOSS uses an efﬁciency frontier to help users

91 Figure 3.12: The BOSS framework.

manage the ratio of spot instances to on-demand instances. Spot instances use fast internal networks to fully replicate data stored on on-demand instances. Across sites, BOSS uses a cost model to ﬁnd the best replication policy for each object. The cost model penalizes sites where spot instances are unavailable. Inter-site replication also uses an online algorithm to determine which objects deserve extra copies (for cost saving). The subsequent sections discuss inter-site and intra-site replication in detail.

92 Inter-site Data Replication. Figure 4.3 shows the life cycle of a data object in BOSS.

When a query that creates an object arrives, BOSS applies Amazon’s consistent hashing [33] to replicate the object to multiple sites. Each object must be copied to at least |kd| sites for availability. User can set |kd|, or follow the default system configuration. BOSS profiles each object’s query rate during this initial placement and fits a decay function.

Note that, we use placement and replication interchangeably throughout this paper. Query rates are normally skewed, e.g., following exponential models. The profiled query rate and spot prices from the last 24 hours are fed into a cost model and online replication algorithm. This produces our final data placement. Objects stay at their final sites until their TTL times out.

BOSS avoids moving data between sites to reduce wide-area bandwidth. After online replication, objects are placed at ﬁnal sites. BOSS can have an increased cost if the ﬁnal sites are chosen poorly. The only mechanism in BOSS to mitigate bad decisions is to recreate the object, e.g., TTL timeout. We compare the design to a recent approach that moves objects when spot instances fail [47]. The following focuses on our (1) cost model and (2) replication algorithm.

Cost Model. Replication policies can increase storage costs by assigning objects to sites with highly priced instances. Also, policies that increase the frequency of inter-site communication increase storage costs. We model the total cost as a weighted function of (1) expense to store object at the replication sites and (2) latency between the sites. Equa- tions 3.11–3.15 show the model:

93 Figure 3.13: Inter-site replication in BOSS.

Pi , Pi,w + Pi,r + Gi (3.11) T t ∑ ∑t wi j∈k p j(OD) Pi,w , × (3.12) T × IOD |k| T t ri t Pi,r , ∑ min p j(SP) (3.13) t ISP j∈ki

t ∑t∈T wi Li , (1 − τ) × lw + τlr + lrouting (3.14) T × IOD

Ci(k) , αPi + βLi (3.15)

th th In these equations, i indexes the i data object and j indexes the j site. Ci(k) is the total cost of serving object i under a replication policy k, where k is a set of host sites

(e.g., {US East, US West, Europe, etc.}). The size of the set |k| is the replication factor.

The total cost is a linear combination of ﬁnancial cost (Pi) and inter-site latency (Li) [69]. α β t Parameters and are weight coefﬁcients. The parameters p j(X) capture the price of instance type X at time t. The two instance types are SP and OD. Time t is the tth interval in the TTL T . We use p j(X) to capture the average price over time T .

94 Pi captures the expense of leasing on-demand instances for writes Pi,w, spot instances for reads Pi,r, and the network bandwidth expense Gi of migrating data between sites. Let t ¯ wi be the query arrival rate for updates (writes) to object i. Also, IOD captures the query processing rate per unit time t for on-demand instances. Thus, Pi,w is the number of on- demand instances provisioned at each site to handle write queries. To be sure, BOSS fully replicates data across on-demand instances. Gi is the multiplication of allocated storage, the cost of network bandwidth and the frequency of migration. In BOSS, all objects are allocated the same size container (4KB). Each object is migrated at most |kd| + |k| times.

The cost of read-only queries is Pi,r. Unlike write queries, read-only queries are processed by only 1 spot instance across all sites. Further, BOSS can scale the capacity of spot instances. Thus, the cost to serve the object is the integral of the product of the number

t ¯ of spot instances needed for reads ri/ISP and cheapest spot instance price at time t. BOSS

t uses spot prices from 24 hours prior to estimate p j(X).

The lowest possible latency of one data operation Li is the sum of the average write latency lw, multiplied by the number of replicas (writes applied to all replicas), and the average read latency lr, plus the routing delay lrouting. Note here, we calculate the routing delay lrouting based on the average distance among all selected sites. Also, τ is the percentage of read-only queries.

Online Replication Algorithm. Each BOSS site receives a stream of newly created objects. After a proﬁling period, our online algorithm tries to reduce costs by carefully assigning objects to sites. The key insight is that: For many objects, over-replication

95 to more than |kd| sites can reduce costs by serving read-only queries more cheaply, i.e.,

∃k ∈ k : C(kd) > C(k), where k is the set of all replication policies. However, BOSS has limited capacity to over-replicate, i.e., |k full|×I > Ω. Here, k full replicates data objects to all sites, I is total number of objects and Ω is the total in-memory storage capacity.

The challenge is to maximize cost savings within the capacity budget. Bin packing can solve this problem at a single site. However, with create queries streaming to all sites, bin packing algorithms would require BOSS to update spare capacity between sites every each time an object was updated—using costly inter-site communication. We preferred a distributed solution. BOSS allows each site to operate independently by over allocating space to over-replication. When a site chooses to over-replicate, it loses a replication slot at all sites. Each site is tasked with ﬁnding the objects with largest cost savings—even though some spare capacity is stranded. (Later, we show competitive guarantees despite internal fragmentation). Let ko be the replication policy with minimum cost. Cost savings

o for object i are Ci(kd) − Ci(k ). The online version of this problem tries to predict the largest objects from a stream [56]. Speciﬁcally, for each object, we decide to either (1)

o replicate to its lowest cost placement k or (2) stick to the default replication kd. Note, we can get ko by exhaustively computing Equation 3.15 for all replication policies k ∈ k.

BOSS uses the multiple choice secretary algorithm [56] (MCSA) to estimate the nl objects with largest cost saving from n arrivals. First, BOSS prepares for n incoming objects by randomly selecting markers mi from the binomial distribution B(n,1/2), recursively.

nl In each recursion between markers, BOSS expects to select 2 objects for over-replication.

96 mi+1−mi When nl within a marker region is 1, BOSS observes e objects, records the largest savings and then over-replicates the next object with greater savings. If the algorithm reaches the last marker with fewer than nl objects, it chooses any remaining object with savings larger than the median of selected objects. ω Performance Analysis: O(1 + )-Competitiveness. BOSS uses MSCA to pick the top kd nl data objects with largest cost savings, and over-replicates them across multiple sites.

In this section, we prove that if the data access distribution follows an exponential distri- ω bution, the performance bound of the algorithm is O(1 + ), depending on ω (i.e., the kd reserved storage factor) and |kd| (i.e., the default replication factor). First, we need to ﬁnd the distribution of the cost.

Lemma 3.2.1. Assuming the data access distribution (denoted by x) follows an exponential distribution x ∼ Exp(λ), and the spot price history holds in the replication policy k, then the cost distribution of the replication policy C(k) follows exponential distribution

C(k) ∼ Exp(µ), where µ is a linear function of λ.

The spot price is hard to predict, BOSS can only know sites that have had very low spot prices. To estimate the cost of one replication policy, we assume the price history holds during this replication period. In this case, given a ﬁxed replication policy, the cost of replication policy depends on the incoming reads and writes. We can estimate the read- to-write ratio from proﬁling. Thus, the total cost in Equation 3.15 can be converted into a linear function of data accesses x. If the data access distribution x is an exponential distribution, then the cost distribution C(k) is also an exponential distribution (If x ∼ Exp(λ),

97 then (ax) ∼ Exp(λ/a)). More importantly, if the cost distribution of any replication policy

o C(k) is exponential, the achievable cost savings C(kd) −C(k ) is also exponential. This claim can be proved by calculating the joint density function of two independent exponential distribution, which is also a density function of an exponential distribution. We define the competitiveness of BOSS, denoted by f : C(BOSS)=(1+f)C(OPT), where C(BOSS) and C(OPT ) represent the total cost of using decisions from BOSS and the offline optimal solution, respectively. To find f, there are two steps. In the first step, BOSS uses MSCA to pick the possible top nl objects with largest cost saving without looking into the future objects. We need to find out the cost distance between the objects selected by MSCA and the actual top nl objects.

Lemma 3.2.2. Given any set of non-negative real numbers ordered by value, and let us have C(γ) equals to the sum of the largest nl elements in the set. The expected value of the elements selected by MSCA is at least (1 − ⊖( 1/nl))C(γ). p This is competitiveness of the MCSA online algorithm. Due to the limited space, we refer the proof in prior research [56]. Lemma 3.2.2 provides the performance competitiveness between BOSS’s decision and the decision (denoted by γ) of picking the actual top nl objects. However, this decision may not be the optimal one. Due to misprediction,

γ may not select the right set of nl, which leads to wasted storage and thus less savings.

Therefore, our second step is to ﬁnd the competitiveness between γ and the ofﬂine optimal decision.

98 Theorem 3.2.3. If we only optimally replicate the top nl objects, as the decision γ, the competitiveness of γ is 1 + F(nr), where F(·) is the CDF of normalized cost saving distribution and nr is the ratio of objects that are not selected by γ.

The cost savings missed by decision γ is from optimally replicating data objects in the set of nrI. In the worse case, γ may select a subset of objects but we may have enough spare space to store all data objects optimally. Thus, we can calculate the competitiveness by counting how much more normalized saving that can be achieved if all data are replicated at their own ko. Using Lemma 3.2.1, we can easily prove that the normalized cost saving distribution F(·) is exponential. The rate parameter (denoted by v) of F(·) is in (0,1], and the ratio of unselected objects nr = ω/|kd|. Now, we have the competitiveness of BOSS in Theorem 3.2.4.

Theorem 3.2.4. If the data access distribution is exponential, the competitiveness of BOSS ω is O(1 + ). kd Proof. Using Lemma 3.2.1, 3.2.2, and Theorem 3.2.3. We have

C(BOSS) = (1 + F(nr))(1 − ⊖( 1/nl ))C(OPT) ω p −v |k | = (2 − e d )(1 − ⊖( 1/nl ))C(OPT) ω p ω −v −v |k | |k | = (2 − e d − ⊖( 1/nl )+ e d × ⊖( 1/nl ))C(OPT) (Apply Taylor seriesp expansion) p ω ω 2 2 ≤ (2 − 1 + v − v /2 − ⊖( 1/nl ))C(OPT),(v ≤ 1) |kd| |kd| ω p ≤ (1 + )C(OPT) |kd|

99 Writes Reads Head Write Head Read Node Node on Ͳdemand instances mixed spot/on Ͳdemand instances

Figure 3.14: The intra-site conﬁguration design of BOSS. Dashed circle represents spot instances while solid circles are on-demand instances.

In the worst case, if the spare capacity is running out and lots of data objects are created, then w → 1. In our setup, we follow the replication conﬁguration in Cassandra, in which |kd| = 2. As a result, the theoretical competitive ratio of our prototyped BOSS is O(1.5) in the worst case. This competitive ratio improves for any distributions that are more skewed than exponential. The skewness indicates that the ratio between cost savings from optimally replicating the top nl objects and from optimally replicating all data objects is higher.

Intra-site Replication. In Figure 3.14, BOSS lays out on-demand instances for high write throughput and spot instances for read throughput. Each instance runs a local in- memory storage that independently supports read, write and create operations. The BOSS intra-site replication protocol describes the prorogation of queries across instances. First, we describe the support for write queries. Clients route write queries to any on-demand

100 instance. A hash function parameterized by the query payload determines a random order of the on-demand instances. BOSS then processes the query at each on-demand instance in order. The client can observe her write only after it is processed by the last on-demand instance [108]. Only if all on-demand instances within a site store the update, the client then observes a successful write, indicated by a per-object version number. The last on- demand instances in the write chain broadcasts the query to (1) spot instances within the site and (2) other sites hosting replicas. We use the gossip protocol to update (possibly failed) spot instances slowly over time [60]. At the beginning of each period, say 1 minute, all nodes randomly pair with neighbors and share the latest “gossip” on up-to-date data versions. This communication protocol helps the system to scale out and easy to conﬁgure for new instances.

Read-only queries are distributed across many spot instances. Clients can query any on-demand instances to learn the set of active spot and on-demand instances. Read-only queries should be routed to available spot instances ﬁrst. However, if the response time for queries routed to spot instances exceeds 2X the service delay, clients can also route queries to on-demand instances. Clients are encouraged to randomly choose between available instances for reads.

Risk Management. How many on-demand instances should BOSS lease? By design, it must lease enough on-demand instances to handle write throughput. However, when spot instances stop and the response time increases, clients can direct read-only queries to active spot and on-demand instances. Over 6-month, we observed only one on-demand instance

101 stop abruptly. Additional on-demand instances mitigate the effect of losing spot instances.

However, it also negates cost savings. To quantify this risk, we measure the performance delays from blended instances with different ratios of spot to on-demand instances. To be precise, we run offline experiments that capture average response time and cost savings during a short period for each spot-to-on-demand ratio. An mean-to-variation efficiency frontier plots response time standard deviation to cost savings mean. Any point on the frontier is an efficient operating configuration. Users can select specific configuration that

fits their desired risk. Later, we show an example of this frontier from multiple intra-site instance configurations. Our approach is analogous to portfolio management in the stock market. One can purchase low risk bond or high risk stock. Depending on one predefined risk (i.e., the response time variation), there exists one instance mix that provides the highest expected return (i.e., cost savings). Note that, based on the profiling, BOSS bids spot instances every hour.

3.2.3 Evaluation

Prototype: We prototyped BOSS as a Cassandra (v2.1.8) node image and deployed it to serve data processing workloads [60]. We choose Cassandra to implement BOSS because of its rich APIs for supporting in-memory workloads and easy conﬁguration for starting spot instances. Cassandra supports the gossip protocol to identify and explore connected instances. The default replication factor |kd| and the TTL in Cassandra are 2 and 7 days, respectively. The reserve ratio ω is reset at the beginning of each TTL to ensure we have enough space to store all data under the default replication. We have modiﬁed the nodetool

102 Average Unit Price ($/H) Name Type US East US West EU IRE EU FFM m3.medium On-demand 0.067 0.077 0.073 0.079 cache.m3.medium Cache 0.09 0.094 0.095 0.103 m3.medium Spot 0.0072 0.0085 0.0102 0.0119 AP SGP AP TYO AP SYD SA SP m3.medium On-demand 0.098 0.096 0.093 0.095 cache.m3.medium Cache 0.12 0.12 0.12 0.115 m3.medium Spot 0.014 0.0127 0.0103 0.011

Table 3.2: On-demand, cache and spot instance proﬁles.

10000 0.06 Local Price Local Price Local Price Local Price BOSS BOSS BOSS Througput 7500 0.045

5000 0.03

2500 0.015 ($/H*Instance) Avg. Unit Price Throughput (QPS) 0 0 6AM 12PM18PM12PM 3AM 9AM 3PM 9PM 11AM 5PM 11PM 5AM 6PM 12AM 6AM 12PM Local Time Local Time Local Time Local Time (a) US East (b) US West (c) EU IRE (d) AP SPG

Figure 3.15: A 24-hour throughput of four sites under the local price variation. Data are recorded at the same time but x-axis is adjusted to the site’s local time. QPS is query per second. to look up IP addresses in other sites for the inter-site replication. The query dispatcher is implemented in the Cassandra load balancer. To proﬁle data objects and monitor the runtime performance, we use Amazon CloudWatch and the nodetool status command to capture instance utilization, network delay, and workload statistics. The price meter API from Amazon provides runtime unit prices, variations and total expenses of BOSS. BOSS

103 uses the bid price that would have provided 90% uptime in the previous month. However, a better bidding price model could help improve the performance of BOSS. We consider this as an extra layer on top of BOSS for future work, noting prior work [98, 93].

Results. We evaluated BOSS on IAAS clouds offered by Amazon and Google that lease both spot and on-demand instances. Real clouds allow us to study BOSS as spot prices change in the real world. (To be sure, we can not control price changes in our evaluation.)

Table 3.2 provides details about CPU, memory and prices for the instances used. For spot instances, we report the average winning bid is 0.0096$/H across all sites. For Amazon, we used 8 sites. Spot instances were 86–92% cheaper than on-demand instances at the same site. The average lifetime of a spot instance was 36 minutes and prices between sites were not correlated. For Google, we used 8 global sites. We focus on Google pricing later in this section.

We compared BOSS to other systems to show that (1) BOSS costs less than widely used alternatives and (2) both inter-site and intra-site replication contribute to cost savings. Below, we describe baseline systems associated with each category above. For the remainder of this section, we will refer to each baseline using italicized words.

State of the Practice:

Default uses only on-demand instances for all queries. Consistent hashing is used to

replicate data across sites. We enable autoscaling in Amazon EC2. When the instance

is idle, Amazon checkpoints data to EBS and stops charging for the on-demand lease.

104 ElastiCache is Amazon’s managed in-memory storage. It is used by Internet services

like Netﬂix. We added consistent hashing to support replication across multiple sites.

Amazon autoscaling also reduces costs for ElastiCache.

Migration mimics a recent approach that migrates virtual machines (VM) across sites

when received termination warning [47]. This approach stops autoscaling by regularly

sending state changes to a central server. Our version checkpoints internal Cassandra

state (not the whole VM).

Alternative BOSS Designs:

InterOnly disables intra-site replication to spot instances. This approach shows cost

savings for inter-site replication using only on-demand instances.

IntraOnly disables inter-site replication for low cost. It uses consistent hashing to place

objects across sites. Like BOSS, IntraOnly uses on-demand instances to cover for failed

spot instances.

Oracle discards data proﬁling and prediction in BOSS, using future knowledge about

the trace to estimate the effect of perfect workload prediction.

We feed each site workloads based on Google’s cluster cloud traces [7]. We implemented a latency-aware load balancer to route queries to the nearest site that hosts the data [40]. It is important to note that nearest site load balance inﬂates the cost of serving read-only queries by using spot instances at more expensive sites. Our evaluation shows this cost does not negate the beneﬁts of BOSS. The in-memory workloads used in our evaluation are:

105 WordCount is a MapReduce (MR) benchmark. It contains a wordcount MR program

and a 10GB trunked Wikipedia data dump. This workload aims to test the performance

for a large quantity of random key-value pair read.

Database contains TPC-W and simple update queries. It represents sequential reads on

large database ﬁles. The update query writes randomly on popular data objects.

24-hour Performance Analysis. In Figure 3.15, we show throughput and average price per instance at 4 sites over 24 hours. The x-axis is adjusted to each site’s local time zone.

When it was daytime at US East (i.e., 8AM to 3PM), spot instance prices were high (e.g., the mixed price is 0.03 $/H). Data created during this time period was replicated to other sites instead of US East. As a result, the throughput at US East decreased from 4300

QPS (i.e., query per second) to 2000 QPS, while the throughput at EU Ireland and Asia

Singapore increased 26% and 30%, respectively. We further illustrate the cost reduction impact of BOSS within a single site. Figure 3.16 illustrates the throughput and the expense of running BOSS in US West for 24 hours. To highlight the performance of BOSS, we compare it to Default and ElastiCache. From Figure 3.16(a), we observed that BOSS shifts the query rate signiﬁcantly over time. Sometimes, BOSS can have more throughput than Default and ElastiCache (e.g., around hour 4), due to the performance boost from serving reads on more spot instances. However, if spot instances fail, reads are routed to the backup on-demand instances, causing the sharp throughput drop (e.g., around hour

8 to 10). By carefully managing the variable lease time of spot instances, BOSS can signiﬁcantly reduce expenses. Illustrated in Figure 3.16(b), BOSS cuts the cost by 83%

106 240 360

180 300

120 240 Original Controlled 60 180 Power Power (Watt) SLA Throughput (QPS) 0 120 0 1 2 3 4 5 6 Time (s) (a) Throughput 1 BOSS 0.8 Default ElastiCache 0.6 0.4 0.2

Per-hour Expense ($/H) 0 4 8 12 16 20 24 Time (Hour) (b) Average Expense Per Instance

Figure 3.16: Performance snapshots between BOSS and Amazon baselines in US West.

107 on average during this period. The average per-hour expense of BOSS is 0.17$/H. while

Default and ElastiCache cost about 0.43 $/H and 0.6$/H, respectively. ElastiCache is more expensive than Default because the instance price is around 10% higher, as shown in

Table 3.2.

Cost/Performance Comparison. Figure 3.17 shows one month performance and cost comparison between BOSS and all baselines. Shown in Figure 3.17(a), BOSS is 84% lower than Default, 66% lower than InterOnly, 85% lower than ElastiCache, and 50% lower than IntraOnly under the database workload. The cost savings increase under the read-only wordcount workload. In both cases, we observe that workload misprediction (an effect captured by Oracle) causes very small increase in cost.

On the performance side, we examine the 95th-percentile response time of all queries as the latency metric. Figure 3.17(b) shows that BOSS has highest response time on both workloads, but the extra delay is within 13% and 10% of Default and ElastiCache, respectively. Although spot instances can drop frequently, BOSS’s two-layer replication shields it from massive response time increases.

Figure 3.18 demonstrates the number of instance leased and the average utilization from these instances on different in-memory storage platforms. Default and ElastiCache are usually over-provisioned with more instances and thus much lower utilization. In- terOnly helps reduce the number of on-demand instances. IntraOnly boosts per-instance utilization by ofﬂoading reads to spot instances but the total number of on-demand instances is still over-provisioned. BOSS consolidates on-demand and spot instances that

108 80 350 75th T 95th-percentile

70 Mean 300 60 th T 25 250 50 200 40 30 150 20 100

Expense ($/H) 10 50

0 Response Time (ms) 0 DefaultInterOnlyElastiCacheMigrationIntraOnlyBOSSOracle DefaultInterOnlyElastiCacheMigrationIntraOnlyBOSSOracle

80 350 75th T 95th-percentile

70 Mean 300 60 th T 25 250 50 200 40 30 150 20 100

Expense ($/H) 10 50

0 Response Time (ms) 0 DefaultInterOnlyElastiCacheMigrationIntraOnlyBOSSOracle DefaultInterOnlyElastiCacheMigrationIntraOnlyBOSSOracle

(a)Expense (b)ResponseTime

Figure 3.17: One month performance comparison using database (ﬁrst row) and wordcount (second row).

109 50 On-demand 58% Spot 40 43% 56% 32%

30 67% 85% 96% 76% 20 83% Instance # 93% 10 53%

0 Default InterOnly ElastiCacheMigration IntraOnly BOSS Oracle

Figure 3.18: The total number of instances leased using database. The number above each bar is the average instance utilization.

lead to high utilization (above 85%). However, we observe that workload misprediction can noticeably degrade the performance of BOSS, as comparing BOSS with Oracle. Bet- ter workload prediction algorithms could lead to increased resource utilization. It is worth noting that over 17 weeks of running BOSS on Amazon EC2, we have never experienced data lost.

Impact of weight coefﬁcients. We have already shown that BOSS is good at reducing the cost of hosting data services using blended instances. There are two weight coefﬁcients

(i.e., α and β) in the cost model Equation3.15 that affect the decision of picking sites and allocating instances. Instead of evaluating each one of them independently, we shown the performance of BOSS by tuning α and β to weight the two metrics (i.e., expense and latency), as shown in Figure 3.19. When we gradually increase α and decrease β, BOSS

110 100% 150% Average Cost th 80% 95 -percentile Latency 140%

60% 130% 40% 120%

20% 110% Normalized Latency (%)

Normalized Expense (%) 0% 100% 10-4 10-3 10-2 10-1 1 10 102 103 104 Relative Weight (αP/βL)

Figure 3.19: The impact of tuning weight coefﬁcients α and β using database. Note thex axis is in log scale.

focus more on the cost reduction and less on the response time. As a result, the expense is monotonically decreasing when we look at the tail of the performance, the 95th-percentile latency is increasing but slowly. The queuing delay due to the single vCPU is not affected at the tail performance.

Impact of Proﬁling Period and Risk Management. During the proﬁling phase, how long

BOSS shall profile to achieve the reasonable accuracy while data is placed in sub-optimal sites is a concern. The longer time spent on profiling, the more we can predict the access pattern and workload statistics for certain data objects. We empirically set profiling time to 10%, 20%, 30%, 40% and 50% of TTL for the database workload. As demonstrated in

Figure 3.20(a), the lowest cost was achieved at 30%. Normalized expense at each setting

111 100 % 200 Instance Mix Effi. Frontier 180 80 %

160 60 %

140 40 %

120 20 % Avg. Cost Savings (%)

Normalized Expense (%) 100 0 % 0 10 20 30 4050 0 % 5 % 10 % 15 % 20 % Profiling Period (% of TTL) Risk(%)-STD of Response Time (a) Proﬁling Period (b) Risk

Figure 3.20: Impact of (a) proﬁling period and (b) risk on BOSS. (a) is normalized to Oracle, and (b) is normalized to Default.

was 1.23, 1.15, 1.09, 1.19, 1.34 and 1.51, respectively. Long proﬁling periods were costly, reﬂecting the importance of inter-site replication.

Figure 3.20(b) plots the efﬁciency frontier for the database workload. We created the frontier ofﬂine with repeated experiments. Each point represents 20 experiments conducted with the same ratio of spot instances to on-demand nodes. The x-axis captures the variation of response time across the experiments. The y-axis captures mean cost savings.

Points along the frontier maximize the mean-to-variation, thus are all good indicators for instance conﬁguration.

Scale-out and Computational Overhead. To test the scale-out performance of BOSS, we use virtual site, which are duplicated sites based on the real site proﬁles from 8 distinct

112 100% Amazon 80% Google 60% 40% 20% 0% Normalized Expense(%) 5 10 15 20 25 30 35 40 45 50 # of Sites

Figure 3.21: Scale-out performance of deploying BOSS in Amazon and Google’s cloud platforms. Normalized to Default.

physical sites of Amazon EC2 and Google Cloud around the world. Each virtual site hosts 10 on-demand instances and up to 15 spot instances. The average prices for the on-demand instance are 0.085$/H and 0.05$/H, and for the spot instance are 0.0089$/H and 0.02$/H, from Amazon and Google, respectively. We plot the total expense of using

BOSS to serve database workloads from 10 to 50 virtual sites. Compared to Amazon’s spot instances, Google has a relative slower expense increase when scaling-out, as shown in Figure 3.21. All results show that BOSS can ensure high data availability by using on-demand instances while reduce costs by opportunistically using spot instances. As a framework, BOSS can be applied to increase the utilization of in-memory storage without understanding and deﬁning workload types.

113 Another concern is the computing overhead of using the cost model (Equation 3.15) when scaling out. In our current prototype, we use exhaustive search to find the cost from the site combination with distinct site profiles. The computation time of searching the entire space of 8 physical sites is about 8 mins on a dual-core desktop computer. Note here, although we have 50 virtual sites in this experiment, as their profiles are from 8 distinct physical sites, their computation time is the same. The computation overhead from exhaustive search grows exponentially when the number of physical sites increases. How- ever, hosting 8 physical global sites already exceeds the capacity of 95% cloud venders in the world [6]. In future work, we can use search heuristics [69] to keep the computation overhead linearized with the number of sites considered.

3.3 Discussion

In this chapter, we have shown two system designs for optimizing time-varying metrics in data services across multiple nodes. The system design includes:

• Traditional data replication method inﬂates important time-vary system metrics,

such as carbon footprint, price, etc.

• Frequently moving data away from sites with higher saving potentials incurs higher

overhead.

• Objects with frequent reads are often under replicated, forcing dispatchers to serve

data from sites with less savings.

114 • objects with frequent writes are often over replicated, because writes must eventually

propagate to all sites.

• Data replication transforms the time-varying variable optimization problem from a

time-vary dynamic problem into a semi-static decision problem.

• Our approach can also be adapted to many other time-varying metrics such as loads,

electricity prices, etc.

115 Chapter 4: Energy Modeling and Management on Edge Devices

As mentioned in Section 1, our current work is shifting to the battery-powered devices, i.e., smartphones. In this chapter, we start from our understanding on the energy proﬁles of current smartphones, and the energy consumption breakdown of all the major energy consuming components. Then, we discuss our current design of eCounter, a hardware performance counter based energy model in Section 4.1. Using this model, we can accurately trace and estimate the energy consumption of any data related operations in the smartphone. Based on the energy model, we further explore the energy storage–battery, and provide a dual battery management framework to extend service time for data services in smartphones. The detailed design is presented in Section 4.2. Last, we share design insights during the system design and implementation in Section 4.3.

4.1 Energy Proﬁling of Battery-Powered Devices

The increasing popularity of mobile devices, e.g., smartphones, makes the idea of accessing the Web from any place and at any time a reality. The web accesses are provided by a rich body of smartphone applications (apps). Depending on the implementation, these

116 apps can be divided into native apps (e.g., Chrome), hybrid apps (e.g. HackerNews), and web apps (e.g. Facebook and Amazon). Each web access in these apps is considered as an activity (i.e., a single, focused thing that user can do, deﬁned in Android SDK [5]).

Such activity rely on key functions from the web engine to connect the Internet, load web contents, and render these contents on the screen. These activities, are usually called web activities [12]. Based on recent statistics [71], average consumers spend more than 80% of time on smartphones in web activities. Such a long time on web activities could lead to undesirably large amounts of energy consumption on smartphones. To reduce such energy consumption, we need to understand and estimate the energy consumption of web activities.

There are some major challenges in understanding the energy consumption of web activities. First, although most smartphones have built-in voltage sensors, they usually are not equipped with current sensors. As a result, the energy consumption cannot be precisely measured. External power meters are accurate but inconvenient for customers to use at runtime. Runtime measurement is not a practical solution. Second, the energy consumption of web activities is greatly affected by many dynamic factors, such as hardware power states, network conditions, etc. Worse still, such impact varies during different execution stages of web activities. For example, when the web content is loaded, the network condition no longer affects the energy consumption of rendering the content to the screen.

Thus, the runtime estimation shall capture the impact of the dynamic factors and focus on

117 different hardware components at different execution stages. Unfortunately, the state-of- the-art modeling work on smartphones [95] focus on the entire phone instead of individual activities, not to mention the speciﬁc stages. Recent modeling techniques, such as App-

Scope[122], provide tools to tool to account for energy consumed from a thread. However, an activity may involve many cycles of execution in multiple threads. Zhu [129] studied the critical execution path of web browsing on smartphones and proposed to use statistics of HTML tags to estimate its energy consumption. However, this model only works for the rendering stage. The network part from web activities, which dominates the energy consumption of the entire activity, is missing from this modeling technique. Meanwhile, it is highly preferable that the proposed estimation methodology does not depend on the speciﬁc knowledge of existing web components, so the model can work with future new components. Thus, it is preferable to have an estimation framework specially designed for web activities.

This section presents a framework, REEWA, to provide ﬁne-grained energy estimation for web activities on smartphones. In sharp contrast to existing modeling work that re- lies on system utilization for energy estimation, REEWA features a dynamic methodology that uses hardware counters to capture runtime hardware events, to account the energy consumption from one function to the whole process in web activities. REEWA features

ﬁve major attributes in the design: (1) Accuracy. REEWA uses hardware counters to capture the exact hardware events that consumes the energy for web activities. (2) Low

118 overhead, REEWA carefully selects the wanted hardware events and accounts for the energy consumption, providing the best tradeoff between the accuracy and the overhead. (3)

Low cost, the implementation of REEWA utilizes the available free resource from the system, thus has minimal memory and energy footprints on smartphones. (4) Transparent,

REEWA has no dependence on the source code of the apps, nor the structure of the web content. The implementation of REEWA does not change any code in the application, nor the loaded web contents.

Hardware counters can be set for a specific process during the monitoring time, e.g., the Chrome process during rendering a web page. As such, they can be used for energy estimation without the interference with other processes. Moreover, an energy model based on hardware events does not need the knowledge of existing web components and so can be applied to esoteric web components. In addition, system dynamics, such as hardware power state change due to bad network conditions, result to a sequence of hardware events or counts change. REEWA notices these change and adapts itself with updated estimation coefficients and models. Despite many benefits from using hardware counters, dynamic energy estimation based on hardware counters is non-trivial. First, accessing hardware counters requires multiple system calls. Increasing the accessing frequency and counting all hardware events can improve the modeling accuracy but also lead to high overheads, in terms of both time and energy. Hence, we study the correlation between the energy consumption and relevant hardware events to minimize the set of monitored events. Second, although we are able to access hardware counters from almost all hardware components

119 inside a smartphone, not all hardware components are relevant in each phase. For example, network devices, such as WiFi/Radio chips, are usually not in use while the web activity is updating the screen, i.e., painting. Thus, we associate relevant component and selected counters with each phase in the web activity. More importantly, the model may change with the network conditions and the system states. We build speciﬁc models for these conditions. After addressing these challenges, we highlight the design of REEWA with two cases on energy optimization for web activities. We show that with the accurate runtime energy estimation from REEWA, the power manager in smartphones can make better choices on saving energy and prolong the battery life of the device.

We implemented REEWA as a service on a modiﬁed Android 5.0 system, deployed onto two smartphones, Nexus 4 and Nexus 5. REEWA spawns light-weight monitor threads, triggered by the WebKit function references, to capture the hardware event counts during speciﬁc execution phases in a web activity. These counts are obtained at the runtime, thus can be used to estimate the runtime energy consumption, and provide accurate energy prediction while considering the impact of user interaction patterns and system dynamics. Our evaluation uses multiple web browsers and web/hybrid applications, including Chrome, Firefox, Hacker News, etc., to access thousands of webpages from the top

5000 websites. Compared with several state-of-art energy estimation approaches, REEWA achieves a very high estimation accuracy (e.g., 99% for 90th-percentile workloads) on both devices with a negligible overhead (e.g., 0.71%, worst-case time). We have applied

120 REEWA as a support component for the aforementioned two energy optimization techniques. Empirical results show that with a better runtime energy estimation and prediction, both optimizations can achieve more energy savings (i.e., 13% and 30%, respectively).

Critical path of web browsing

HTML DOM

Internet Network CSS CSSOM Styling Layout Painting

Browser Cache JavaScrtipt Scripting Network

Network Parsing RenderTree Layout Painting (NE) (PS) (RT) (LA) (PT) CPU+Cache+Memory

Figure 4.1: The critical path of accessing web contents. The dash arrow indicates that JavaScripts may trigger re-execution of previous phases.

Critical Path of Web Activities. The procedure of a web activity can be separated into

ﬁve interdependent phases: Networking (NE), Parsing (PS), RenderTree (RT), Layout (LA) and Painting (PT), illustrated in Figure 4.1. The last four phases combined is usually called the rendering process. The Network phase establishes services from the low level network stack to connect the remote server and wait for its response. The performance of this phase is identiﬁed as the bottleneck for web activities [101], and hard to model at the application level. The Parsing phase interprets the retrieved web content into structured syntax data. Although this phase is an important pre-processing phase for webpage rendering, our

121 measurement shows that it usually accounts for less than 1% of the total energy consumption. Similar results are also reported in [129]. In the RenderTree phase, the web App builds the rendering tree based on the syntax data, (styling) and processes web scripts, e.g., JavaScript,(scripting). The Layout phase is the phase that computes the web layout attributes, such as width/height, and position of the web contents for the later painting phase. The Painting phase employs the styled internal representation to generate the ﬁ- nal graphical user interface of the webpage. WebView functions in all those phases could be concurrently executed and their scheduling is nondeterministic, which makes precise energy estimation more challenging.

Tight Performance Constraint for Web Activities. The most important constraint for

Web Activities is the rendering delay. Google [10] recommends that a page shall be ren- dered within one second on a mobile network. Research [28] has shown that any delay longer than a second will cause the user to interrupt their ﬂow of thought, creating a poor experience, regardless of device or type of network. If we look at a typical sequence of communication between a smartphone and a server, 600 ms of that time has already been used up by network overhead: a DNS lookup to resolve the hostname (e.g. google.com) to an IP address, a network roundtrip to perform the TCP handshake, and ﬁnally a network roundtrip to send the HTTP request, as shown in Figure 4.2. In this case, the process leaves us 200ms for client-side optimization. As a result, we need the response from energy estimation/prediction at the millisecond or the microsecond level.

122 Rendering web contents in one second TCP HTTP Request Server Client Ͳside DNS Lookup Connection &Response Response Rendering 200ms 200ms 200ms 200ms 200ms

800ms Network (3G) and Server Overhead, 200ms performance which we cannot do anything about. budget to save energy.

Figure 4.2: One second performance constraint for rendering web activities, which only provides 200ms performance budget for client-side energy optimization.

Energy Optimization for Web Activities. Except the energy optimization for the entire devices, researchers proposed two state-of-the-art energy optimization techniques for accessing web content. The ﬁrst one is the energy-aware execution scheduling [127]. The concurrent client-side rendering process is usually scheduled in the web engine. They propose to consolidate the networking phases of current web activities together to reduce the energy consumption of network, and schedule the rendering process using Smallest-

Energy-First while not violating the page rendering deadline. The current scheduling method follows a First-Come-First-Serve manner. Zhao et al. [128] proposes to reschedule the execution to reduce the energy consumption. The second one requires a heterogenous multi-core architecture. Without violating certain performance constraints, some rendering processes can be ofﬂoaded to the Little core instead of the big core, in order to save active energy consumption. Both methods require a highly accurate energy proﬁles for these web activities. More importantly, any schedule decisions require a fast response from the estimation in order to meet the performance constraint.

123 4.1.1 Performance Counter-based Energy Modeling

In previous sections, we have discussed the necessity and the unique features of building energy models for web activities. In this section, we introduce the design of REEWA, as highlighted in Figure 4.3. REEWA starts the monitoring and estimation as soon as a new web activity is initialized by smartphone users. For each newly started web application (e.g., web browser, web app, hybrid app, etc.), a monitor thread is initialized with a hardware counter interface to all possible hardware counters (HWC) available on the platform. The execution of the web activity, as aforementioned, will reference a list of functions built in the webkit engine. These function references serve as triggers in

REEWA to identify each phase of execution and initialize/record hardware event counts.

Before entering the next phase, the previous counters will be stopped and their counts will be saved for energy accounting. Afterwards, REEWA will pick another set of counters and hardware devices specific to the next target phase, based on our system identification study. All recorded counts, combined with our hardware counter based energy models, are used to estimate the energy consumption of the recorded phase and predict the future energy consumption for the same application. In this section, we first discuss our hardware counter based energy model for each hardware components involved in web activities’ execution. Then, we mainly focus on the hardware selection and event selection design for the modeling process. With a complete model, we show our prediction based on counter history and two cases of energy optimization using the prediction.

124 Users User Interactions

Web Activities Monitor Thread Web App UI Applications onCreate() Initialization

Stop Previous Painting Estimated HW Counters WebKit References Energy Esti./ Energy HW Counts Embedding API Collection Prediction Parsing/ Modified Java Styling/Layout HW HWC Ͳbased WebKit WebCore Script References Selection Energy Models Libraries Core Networking Event Android API References Selection

Network Graphics Storage Hardware Layer Screen CPU ...

Figure 4.3: The architecture of REEWA. Shaded boxes are the major components. HW is hardware and HWC is hardware counter.

4.1.2 Energy Modeling

Network. Modeling 3G and WiFi components is the most difﬁcult work in our modeling.

It is because they both have multiple running state, and the state transition power cannot be ignored. To model the power consumption from these devices, we use piecewise models for each running states. The main counters we used for these devices are number of bits, size of packages, memory hits/misses, and the network bandwidth. We use the network bandwidth counter to identify the running state of the device, and use other counters to

125 quantify the possible energy usage. A general form of the model is shown in Equation 4.1:

j Mband j Mband P =Ci Mi + Chit Mhit (4.1)

Mband + Cmiss Mmiss,Mband ∈ (a,b] (4.2)

j j where P is the power consumption for the device j, Mi is the related hardware event i,

Mband with coefﬁcient Ci . Mhit and Mmiss are the cache hit/miss events, with their relevant

Mband Mband coefﬁcients as Chit and Chit , respectively. Mband is the bandwidth counter to identify the running state, and thus change the value of the coefﬁcients.

CPU. We start by introducing a server-level power model from [94], in which power is a linear combination of all relevant hardware counters (M). Unlike previous work [19,

51], this model considers the impact of sharing the maintenance power of the processor.

This maintenance power is used to maintain shared multicore resources, such as clocking circuitry and voltage regulators. Those factors consumes some active power when the core is active. Combining the estimated power based on hardware events and the maintenance power, the total power P j of core j is:

m j j j Ucore P = ∑ CiMi + Cshare n (4.3) i 1 i = 1 + ∑ Ucore i=1,i6= j where m,n are the total number of relevant hardware events and the total number of cores,

j respectively. Mi is the counter value of the relevant hardware event i in core j and Ci is j the coefficient of this event. v is the utilization of core j, (Ucore ∈ [0,1]). Cshare is the coefficient of the maintenance power of the processor (those coefficients can be calibrated offline for each target platform). In model (1), the first term represents the core’s dynamic

126 power estimated based on hardware events, such as CPU, cache, and memory, while the second term represents the core’s share of the processor maintenance power.

We now model the power consumed only by the web activity. This power model should include the dynamic power estimated based on the hardware events related to only the browsing process, as well as the process’s share of the maintenance power. For simplicity, when considering the maintenance power, we assume that all the web activities are scheduled to run on the same core, which is consistent with the default process scheduling policy in Linux [26]. But this assumption can be relaxed for an extended model. Hence, a browsing process k’s maintenance power can be estimated based on its utilization of the core. Thus, we extend model Equation (4.3) to estimate the power consumption of browsing process k as:

m k, j k, j k, j j Uapp P = ∑ CiMi + Capp n (4.4) i 1 i = 1 + ∑ Ucore i=1,i6= j

k, j where Mi is the hardware event i that is related only to the process k in core j and k, j Ci is the coefﬁcient. Uapp is the utilization of the web activity process k in the core j,

k, j j j (Uapp ∈ [0,Ucore]). Capp is the coefficient that can be obtained based on profiling results on a given platform. The rest symbols follow the same meanings in Equation (4.3). Note that all coefficients are only related to the runtime subsystem states and independent from web activities. Based on model (4.4), energy can be calculated as the integral of the modeled power over time.

127 GPU. REEWA can capture data from the hardware counters in the GPU and outputs the result to the monitor thread. There are two parts of counters considered for GPU energy consumption. The ﬁrst part is the graphic performance counters, including the frame rate, OPENGL draw calls, bandwidth, shader instructions, and pixel processor frame rate.

These accounts for the most energy consumed when performing 3D rendering. The other part is the screen buffer, which includes texture cache hit count and texture cache miss count. We use these two to estimate the energy consumption from the local memory buffer. Overall, the energy consumption can be estimated using Equ. 4.5

m j j P = ∑ CiMi + ChitMhit + CmissMmiss (4.5) i=1 j j where P is the power consumption for the core j in the GPU, Mi is the counted hardware event i, with coefﬁcient Ci. Mhit and Mmiss are the cache hit/miss events, with their coefﬁcients as Chit and Cmiss, respectively.

Cache and Memory. Cache and memory are depending on the core-level events. Both

CPU and GPU can provide cache hit info to estimate the energy consumption from different levels of cache. However, the cache miss cannot be directly used to estimate the memory energy consumption. It is because a cache miss may also result in a memory miss. For the memory supporting CPU, we associate the counts from previous history to decide whether this is a memory hit or a miss. The hit/miss ratio will update itself after each incorrect prediction. To this end, the overall model is shown in Equ. 4.6

cache P =ChitMhit + CmissMmiss (4.6)

memory P =ChitMhit + CmissMmiss (4.7)

128 ∗ where P is the power consumption for the related hardware, Mhit and Mmiss are the storage hit/miss events, with their coefﬁcients as Chit and Cmiss, respectively. Note that, although most smartphones have very fast I/O accessing local non-volatile ﬂash, we do not consider these as most web contents are usually stored in the memory. This is indeed a limitation of our current prototype and we will consider it as a piece of future work. It is worth to note that all above components are combined together for calibration. It is because all these components are integrated in the same system-on-chips (SoCs), with the same power supply circuit. We cannot unpack the socket without breaking it, and thus we model them as a whole.

Screen. Unfortunately, screen has nothing to do with hardware counters. We could use

GPU and CPU counters to monitor the runtime rendering of the screen. However, this seems to be an overuse of the performance counter. As both of our prototype smartphones use LCD screen. We use the same model from [95, 28, 55, 81], which is based on the lightness level and the basic leakage power consumed by the screen. Note that, more and more new smartphones are now using AMOLED as their screen. In that case, we can adapt a hybrid power model, based on Chamleon [38], which includes the original RGB color, brightness level, and the counts of framerate, obtained from GPU. In this section, we can always obtain the energy data from screen based on its conﬁguration before hand, and thus it is treated as a constant during estimation.

129 4.1.3 Evaluation

Baselines. Below, we describe baseline systems for our evaluation. For the remainder of this section, we will refer to each baseline using the italicized words.

• Content [129], models the power consumption based on attributes of parsed web

objects, such as size, color, etc. We modiﬁed the Firefox code to extract those

information for this baseline.

• Sesame [37], uses hardware utilization as the predicates in a linear model to estimate

the energy consumption of mobile devices. We read all predicates presented in [37]

from Android/Linux to support this methodology.

• AppScope [122], traces the runtime system calls, associated with the hardware power

state information to estimate the runtime energy consumption of a target mobile

processes. To support this baseline, all required information, including power states

of all hardware components, are extracted from the system ﬁle.

Note here, there are no available estimation system that can identify different execution phases of web activities. To make fair comparison, we use time stamps to separate the estimation data from each phase for the above baselines. All those baselines are modiﬁed to capture energy consumption for mobile Apps in their original design. Workloads. To measure the performance of web activities, we use web contents from the top 5000 popular websites. We randomly separate all these webpages into two sets. We use the ﬁrst set for model training and the second set for evaluation. For better evaluation, we pick webpages

130 150% Measured Power 125% CPU Cycles 100% CPU Utilization 75% 50%

Normalized 25% NE Counters NE Counters Started Stopped 0% 0 100 200 300 400 500 600

Figure 4.4: Normalized (i.e., to their maximum value detected) power consumption, non- halt CPU cycles, and CPU utilization during the network phase on Nexus 4. At the millisecond level, the CPU utilization cannot reﬂect the power change immediately (around 50ms latency) while CPU cycle counts capture the runtime power changes with negligible delay (less than 1ms).

with rich content from those websites. For example, for Google.com, instead of using its simple homepage for test, we choose the result page of searching hot keywords in Google

Trends. To simulate user interactions, we use two access patterns–random and prerecorded trace. The random pattern randomly triggers these interactive web components to start a web activity. The trace is an one-week user interaction trace from 10 users on 4 Apps, including Chrome, Hacker news, Facebook, and Walmart.

Energy accounting for one web activity phase. To evaluate the performance of REEWA, we conduct a series of experiments on Nexus 4 and Nexus 5 for each phase during web activities. First, we show a snapshot of runtime measurement and our estimation during the networking phase. To test the performance, we modiﬁed the Firefox web browser to omit the right time stamp from the application level, in order to test whether capturing

131 150% 125% DCH Measured Power 100% Package Rates 75% FACH 50% NE Counters NE Counters Normalized 25% Started Stopped 0% 0 100 200 300 400 500 600 Time (millisecond)

Figure 4.5: Normalized power consumption, package rates, and hardware state change during the network phase on Nexus 4. The package count captures the ever-changing power faster than reading the hardware state from the system file. Meanwhile, the package counts has a much smaller delay than the hardware state change reflected to the system file (e.g. 50-100ms).

system calls is a viable approach. We show the measured energy consumption and our estimation data in Figure 4.5.

Figure 4.5 gives a visual illustration how the hardware events correlated to the energy consumption. However, to estimate the runtime energy cost is not only solely depending on the most correlated hardware events, but also other triggers that may modiﬁed the correlation between the measured energy consumption and the recorded counts. For example, after t=150ms in Figure 4.5, the device enters a lower power mode due to the reduced number of transmission. The power mode of the network device is also changed. Such change cannot be fully captured by the number of packages but it triggers the power mode change event. When capturing this event, we use another suite of coefﬁcients to calibrate the estimation model.

132 75 75 Content AppScope Content AppScope Sesame REEWA Sesame REEWA 50 50

25 25

Estmation Error(%) 0 Estmation Error(%) 0 googlebinglive twitterfacebooklinkedinamazonpaypalyahoowikipediayoutubeblogspotwordpressebayvk yahoo.jpyandex.ruhao123baiduweiboqq taobao163 sohusina.cn googlebinglive twitterfacebooklinkedinamazonpaypalyahoowikipediayoutubeblogspotwordpressebayvk yahoo.jpyandex.ruhao123baiduweiboqq taobao163 sohusina.cn

30 30 Content AppScope Content AppScope 25 Sesame REEWA 25 Sesame REEWA 20 20 15 15 10 10

Estmation Error(%) 5 Estmation Error(%) 5 0 0 NE PS RT LA PT NE PS RT LA PT (a)Nexus4 (b)Nexus5

Figure 4.6: Normalized estimation accuracy of REEWA, compared with other three baselines with the top 25 websites, and different browsing phases (from top 2,500 webpages). The results are normalized based on the measurement.

Estimation accuracy under multiple workloads. To avoid noise in measurement, we disabled the power management service for the cores as well as thecache. All web contents are loaded from the same local server to ensure the same network delay for all tests. The browser–Firefox is instrumented to provide time stamp for capturing the energy estimation data of each phase for those baselines. Our test runs in multiple trials. Each trial starts on an idle device (the Firefox is the only running App in user space), reads one URL from the script and starts to loading the webpage. For the ground truth of actual energy consumed by browsing each webpage, instead of reading the power data from the embedded tool in the phone, we directly measure the energy consumption of CPU, memory and network components during the browsing time. Based on the time stamps, we could tell the active

133 1 Measurement Content AppScope PerPhaseAvg Seasame eCounter 0.8

0.6

0.4 Power (W) 0.2 PS RT LA PA 0 0 100 200 300 Time (millisecond)

Figure 4.7: REEWA has a high energy estimation accuracy (close to the measured per- phase average), while the baselines are unaware of different phases, resulting in degraded estimation performance, during browsing Wikipedia.com on Nexus 4.

power/energy data of each browsing phase. Reading the power data from smartphone has a very low sampling frequency (usually 1 to 5 mins) and running the power App may have interference with the browsing process. We repeat each trial for at least 10 times and report the average sampling power data and the energy estimation for each phase.

Figure 4.6(a) and (b) (ﬁrst row) illustrate the estimation performance of REEWA of browsing 25 different websites on our two platforms, Nexus 4 and Nexus 5, respectively.

Among all workloads, the content-based approach always underestimate the energy consumption by half, compared with measured energy consumption. It is because content- based method fails at capturing the energy data from the Network and the Parse phase. In contrast, Sesame signiﬁcantly improves the estimation performance (close to 75-88%) by considering the utilization of related hardware (e.g., CPU utilization) as model predictors.

134 140 95th T 140 95th T 140 95th T

Avg. Avg. Avg. T T 120 T 120 120 5th 5th 5th 100 100 100 80 80 80 60 60 60 40 40 40 Accuracy (%) Accuracy (%) Accuracy (%) 20 20 20 0 0 0 ReeWaSesamePowerTutorContentAppScope ReeWaSesamePowerTutorContentAppScope ReeWaSesamePowerTutorContentAppScope

140 95th T 140 95th T 140 95th T

(a) Browser (b) Hybrid App (c) Web App

Figure 4.8: Estimation accuracy comparison between REEWA and other baselines while accessing static HTML (ﬁrst row), Full JavsScript HTML 5 (second row), and Full Flash (third row) using browser (i.e., Chrome), hybrid app (i.e., HackerNews), Web App (i.e., Facebook). All experiments are done on Nexus 5.

135 Capturing the power states of all related hardware, AppScope achieves a close performance as REEWA. In a stable environment as the web activity is the only active process running in the system, it is likely that AppScope has the similar performance. REEWA achieves higher estimation performance by providing a ﬁner-grained energy model based on hardware counters, which further improves the worst-case estimation performance by

11%–less than 16mJ, compared with the real measurement.

Figures 4.6(a) and 4.6(b) (second row) show estimation errors of the ﬁve browsing phases. Both Sesame and AppScope have high energy estimation errors for each browsing phase because Sesame only monitors the entire phone and AppScope focuses on the process without being aware of different browsing phases, which is also illustrated in Fig- ure 4.7. As a result, they cannot capture the ﬁne-grained energy consumption data, which is essential, for an estimation solution to adapt to network power state changes.

4.2 Dual Battery Management

Despite the rich functionalities, the utility of mobile devices are usually limited by their battery capabilities. Moore’s law tells us the integrated circuit performance has doubled every eighteen months, yet this fact does not coincide with the development of the battery industry. The service time of mobile devices, e.g., smartphones, is mainly constrained by their batteries. Previous work already investigated that different characteristics of batteries may grant different performance towards different computing workloads, thus there exist opportunities for extending the device service time. For example, any smartphone today has about 10X better performance than the best Nokia 3310 in the 90s but the average

136 longest battery life (i.e., length of a battery discharge cycle) is decreased to 1/8 of the previous phone generation [126].

Other than the battery life, there are many other properties of batteries that battery industry evaluate a good battery from bad ones, for example, batteries with higher volumetric and gravimetric energy densities. However, the various properties of batteries are often at odds with each other. That is, batteries with higher volumetric and gravimetric energy densities usually has lower power densities, thus lower battery life. These factors affect the computing efﬁciency of the mobile device, but so far we can only hope that battery industry can make a conformable battery design that compromises all performance characteristics for computing systems. Hands of system researchers are tightened on managing the battery.

The odds of the battery properties also exist in one single battery. For example, running a constant workload, the ﬁrst 50% battery capacity can be drained in 3 hours while the rest

50% are used in 2 hours. This is because the resistance losses inside a battery are proportional to the square of the current and higher currents speed up the creation of ﬁssures in the electrodes that reduce the amount of energy a battery can store, thus the battery voltage drops at an accelerating from energy draining. In our current system, this voltage is used to estimate the current remaining battery capacity, which results in the non-proportional discharge pattern in the example. Many power modeling work in mobile systems noticed the same pattern yet has no solutions [126, 36].

137 1000 5 LiFePO4 AnTuTu LiCoO 2 4 GeekBench LI2MnO3 800 3

2 600 1

Per Task Battery Use (%) 0

Total Service Time (minute) 400 100 80 60 40 20 0 AnTuTuGeekBench Remaining Capacity (a) (b)

Figure 4.9: (a) Total service times using different Lithium-ion batteries. (2) Average task ﬁnish times in different discharge period of the LiCoO2 battery.

Providing a system-level battery model is challenging. The battery discharging curve is time-varying, depending on several main issues: 1) user behavior. The user behavior in last discharge cycle affects the next one. Two major factors are the full capacity and the battery discharge curve. 2) temperature. Although running the same workload, the battery discharge curve is different under different temperature. For example, the difference of battery lifetime can be up to 3X for the same phone working under 60F and 44F. The coefﬁcient for such difference could be up to 3X. 3) different stages of battery discharge curve. During different stages of battery discharge curve, depending on the output voltage, the input voltage of different devices varies. As a result, a static power/energy coefﬁcient

138 may no longer work in the same scenario. 4) working status of the device. Most power- consuming devices in smartphone have multiple voltage/frequency states. Those states affect the energy coefﬁcients. For example, when the signal is bad, the voltage requested by 3G device could be 1V larger than normal input voltage even the device is in the same hardware state (i.e., the busy state).

We further extend the battery study focusing on the performance related battery characteristic and discover that (1) battery shows different performance during different discharge stage, as shown in Figure 4.9(a); (2) resting after heavy draining can improve the duration of battery discharging cycles, as illustrated in Figure 4.9(b). Based on the study, we propose a novel battery management framework, BEAST, which manages dual batteries in a smartphone to extend the service time. BEAST contains (1) a novel battery model that provides accurate battery statistics, including remaining battery capacity and current battery discharge rate; (2) a battery management facility that chooses which battery to support current computing workloads.

To the best of our knowledge, this is the ﬁrst work that focuses on a comprehensive study of disproportionate battery use and proposing a dual battery design on smartphones.

We evaluate BEAST using two popular Android smartphones with 12 different batteries.

Our results show that using two batteries can slightly improve the service time by 5%, compared to using single battery that has the same capacity of the two combined. BEAST can signiﬁcantly improve the service time of devices (e.g., 23% on average) powered by dual batteries. For some network intensive workloads, such improvement can be up to

139 45%, compared to the default battery management. BEAST’s dual battery mechanism also has 25% lower costs than the traditional methods that only deploys large and single type of battery.

100 Idle 120 True Capacity Full Measured Capacity 80 100 80 60 60 40 40 Normalized Full 20 Normalized Full 20 Battery Capacity(%) Battery Capacity(%) 0 0 0 10 50 100 150 0 7 14 21 28 Charge/Discharge Cycles Days (a) Aging effect: battery capacity appears to (b) Digital memory effect: the smartphone ap- diminish at an accelerated pace after each pears to be fully charged long before it is truly charge/discharge cycle. fully charged.

4.2.1 Battery Characteristics

Non-linear Battery Discharge Curve. Modeling the behavior of batteries is complex, because of non-linear effects during discharge. In the ideal case, the voltage stays constant during discharge, with an instantaneous drop to zero when the battery is empty. The ideal capacity would be constant for all discharge currents, and all energy stored in the battery would be used. However, for a real battery, the voltage slowly drops during discharge and the effective capacity is lower for high discharge currents. For constant loads, we can easily calculate the ideal battery lifetime (L) by dividing its capacity (C) by the discharge current (I): L = C/I. However, due to the rate capacity and the recovery effects this relation does not hold for real batteries. Many models have been developed to predict real battery

140 lifetimes under a given load. However, most of them does not representing the non-linear pattern in the discharge curve, as shown in Figure 4.9.

Recovery Effect. There are also other factors affect the battery information. The battery lifetime, of course, mainly depends on the rate of energy consumption of the device.

However, lowering the average consumption rate is not the only way to increase battery lifetime. Due to nonlinear physical effects in the battery, the lifetime also depends on the usage pattern. During periods of high energy consumption the effective battery capacity degrades, and therefore the lifetime will be shortened. However, during periods without energy consumption the battery can recover some of its capacity, and the lifetime will be lengthened. One illustration of this recovering effect is show in Figure 4.9. However, this non-linear behavior is not captured by the current state-of-the-practice model as the voltage is usually the only input for estimating the battery capacity.

Aging Effect. Aging is a concern with all lithium-ion battery chemistries. As the cell

“ages” two things appear to occur from the user’s standpoint: Battery capacity appears to diminish, reducing the time the battery can supply current to the device. The voltage level, while in use, appears to drop off faster, therefore applying an ever decreasing voltage to the circuit, as shown in Figure 4.10a. This is more noticeable for higher current applications as the internal resistance will cause this voltage drop to be magniﬁed. Both phenomena cause the battery to appear to the user to have less capacity as its cells age.

Battery capacity will diminish with age as its internal plates and electrolytes undergo irreversible damage. This damage is caused by corrosion and other irreversible chemical

141 User/App Requests

App Msg App Manager Task Manager Broadcaster Application Manager Framework Msg/Com. Detector

IPC Manager Thread Manager Power Manager System Kernel Activity Utilization System Status Model Model

Power SOD SOD History Power Drivers Model Calibration Model Collector

Power Gauge Chip CT Battery Voltage Sensor

Figure 4.11: The system diagram of the online phase. The shaded boxes are the components we added into the system, the grey arrows indicate the message passing for battery modeling, and the red arrows are the original message communication in the smartphone system.

processes. Two factors contribute to classic aging of the internal battery components. The plates of the cells will corrode similar to steel rusting. As the plates corrode, their surface area will diminish and the electrolyte will undergo chemical changes causing them both to be less chemically reactive. This change reduces the volume of reactive components in the cell causing the charge capacity of the cell to be reduced. It also increases the internal resistance of the cell as these corrosion products inhibit the free ﬂow of electrons

142 through the plates, electrolytes and less efﬁcient chemical reactions due to diluted reaction agents. This deterioration occurs whether or not the battery is being used and will increase under certain environmental conditions which are discussed further below. Each charge/discharge cycle of a battery also has a similar effect but at an accelerated pace.

Digital Memory Effect. The digital memory effect is a failure mode whose effect results in the transmission of improper calibrations of the battery’s fuel gauge to a device. First we must distinguish between memory effect and digital memory effect. A memory effect is the concept that was derived from cyclic memory. Cyclic memory is the thought that a battery could “remember” how much energy was used up on previous discharges.

However, cyclic memory only affects nickel-cadmium batteries.

Inaccurate measurement of capacity is the only similarity between memory effect and digital memory effect since digital memory effect has nothing to do with molecular chemical change. Instead digital memory effect is the improper calibration and reading by the device and the smartphone’s battery fuel gauge, as shown in Figure 4.10b. As alluded to above the fuel gauge integrated circuitry calculates remaining battery capacity (power) and transmits that calculation to the device operating system through the SMBus connectors.

The fuel gauge also stores present cell capacity characteristics and application parameters within the on-chip EEPROM (electrically erasable programmable read only memory).

The calculated capacity registers a conservative estimate of the amount of charge that can be removed given the current temperature, discharge rate, stored charge and application parameters. Capacity estimation is then reported in capacity remaining and percentage of

143 full charge to the device. But sometimes the reported information is not correct. The incorrect report of capacity remaining and percentage is caused by the fuel gauge not perform recalibration its circuitry automatically. The digital memory effect is then a false reading for maximum capacity and thus results in lower run time of the battery.

Battery System Kernel Power Manager Manager System Status

SOD Battery History Power Drivers Model Model Collector

Power Gauge Chip CT Battery Voltage Sensor

Figure 4.12: The system diagram of the “ofﬂine” phase. All the notations are the same as in Figure 4.11.

When we look at the problem of managing battery in smartphones, the problem is not solely on the shoulder of battery physics, voltage sensors, power gauge chips, battery drivers, the power management system ﬁle in the Linux kernel, and/or the battery management unit in the core library. The problem lies on the whole path starting from battery itself and ending at the software applications. Our framework targets at ﬁxing the problem

144 from the related components in the hierarchy of the smartphone system. Each one solves one or two sources of the battery deﬁciency problem.

Battery Modeling Process. The battery modeling design contains two stages–offline and online. In the offline stage, we calibrate the remaining capacity (RC) estimation model, build the battery capacity curve, make decisions about re-initialize the battery, and estimate the full capacity after the charging process is done. In the online stage, we collect runtime system calls and device utilization information to estimate the runtime current demand. Combine the estimated current information and the detected voltage information, we obtain the past energy usage and current state of the battery. Using those information, we can runtime tuning the coefficients in the battery model and better estimate the remaining capacity of the current battery.

Figure 4.11 shows the system diagram of our online stage. We plot two message detectors in the application framework and operating system, respectively, to collect application activities and system calls information. Mostly, the collected information are used to model the current for those devices that are not utilization-based, such as screen, user interaction detector, GPRS, etc. We have another detector in the system kernel that periodically fetches the system utilization information from those components, such as

CPU, GPU, etc. Combine those two together, we can get a linear model that represents the system demand for power, in terms of current. On the other hand, we also collect information from the local power sensors, which are normally the voltage sensor. Based on

145 those information combined, we will be able to estimate the runtime power usage of the current system.

The offline stage, shown in Figure 4.12, is simpler than the online stage but the importance are equally the same. We use this stage to detect the true fully charged state of the battery. Then, based on the usage history and current full capacity, the model learns and recalibrates the battery coefficients for the next discharge cycle. Therefore, we could estimate the battery lifetime and decide whether need to reset the system status file and

PowerGauge IC to remove the memory effect.

However, there are still many important jobs left to implement. Our current plan is to build a system that meets the following design: 1) when the battery is in the online stage, it starts to build the battery charge curve, and decides which discharge state the battery is in. 2) During the charging process, the implementation shall calculate the average usage during different apps as a recursive learning process. It updates the battery coefﬁcients based on the usage from the last discharge cycle for the next discharge cycle.

Dual Battery Management. As aforementioned, the two key factors of choosing the right batteries for the current workload is the current energy demand in term of output current, and the duration of resting period for self-charge. The decision process can be formulated in the following equation:

B1,2 = C(B1) : C(B2)?C(B1) > C(B2) (4.8)

where B1,2 is the decision of choosing battery B1 or battery B2. C(B1), C(B2) are the battery saving potential from battery B1 or battery B2, respectively. The battery saving

146 potential of choosing a battery is a linear combination of weighted output current and current resting time, as:

C(B) = αCurrent(B) + βT(B) (4.9) where Current(B) is output current of battery B and T (B) is the self-charging potential from elapsed resting time of battery B if in the previous decision period, battery B is not selected. Note that T (B) is a non-linear function of the elapsed resting time. We will discuss more about the resting time in the evaluation. In Equation 4.9, α and β are the cost coefﬁcients, depending on the battery chemistry.

4.2.2 Evaluation

In this subsection, we mainly discuss our experimental setup and empirical results of battery estimation, and many empirical results of using different batteries. All experiments are running on two popular Android smartphones, Nexus 4 and Nexus 5. We use 12 batteries with different ages, from brand new to 2 years old, and capacities, ranging from

1800mAh to 3400mAh. These batteries can also be divided by their chemical compound.

We choose the popular Lithium-ion compound that are suitable to support smartphone devices, such as LiCoO2, LiMnO3, and LiFePO4. We use multimeter and battery meter to provide ground truth of smartphone energy demand, and battery use in the runtime. To examine the performance of BEAST, we design the following details:

Baselines: To understand the impact of Beast for solving battery deﬁciency, we implement the following battery model baselines:

147 • Default implements the default battery assignment–no switching until one battery is

used up.

• BEAST implements our full solution, including online/ofﬂine stage, and the application-

level estimation.

• RR Round-robin is dual battery management baseline that switch between batteries

in a ﬁxed frequency.

• Oracle is our ofﬂine analysis based on collected system data.

Benchmarks: To better evaluate the performance of our battery models. We run well known standard benchmark tools to evaluate the estimation performance and overhead.

• AnTuTu is a performance benchmark for all hardware devices on the smartphone.

• GeekBench is a CPU/Memory performance benchmark from GeekBench. It runs

both CPU/memory intensive jobs to test the single core performance and the mem-

ory throughput.

• PhoneCall is a tool from Google Android test suite for the radio device test.

• WebBrowse is a tool from Google Android test suite for the network device, i.e.,

WiFi, test.

Estimation Performance Comparison. Figure 4.13 illustrates the estimation samples at the runtime. BEAST captures accurate battery information because it considers two

148 important battery characteristics–self-charging, as shown in Figure 4.13(a), and the non- linear discharge behavior at the begin and the end of the discharge curve, as shown in

Figure 4.13(b). The red curve in Figure 4.13 demonstrates the deficiency of using smartphone power modeling to estimate battery statistic. Although using power model can estimate the real energy use, it fails at capturing the self-charging effect. The battery discharge curve is not monotonically decreasing the battery has chances to rest. As a result, the PowerTutor baseline is inaccurate and also has a large overhead since it requires fine- grained sampling frequency. BEAST significantly outperforms Default and BattTracker because it has the advantage of temperature-aware battery model as in BattTracker and captures runtime battery statistics change, which is not in the two baselines.

As a result, BEAST can provide accurate battery statistic estimation. Figure 4.14 gives an overall estimation accuracy comparison among multiple baselines using four different workloads. BEAST has the highest accuracy and minimum variations in all four workloads. BattTracker can provide accurate battery proﬁle but the variation is high due to the aforementioned reasons. PowerTutor achieves good performance when running CPU intensive workloads, which favored by its power models. This accuracy decreases when more network intensive jobs involved in the workload, such as PhoneCall and WebBrowse.

It is necessary to build battery model directly based on battery info instead of using power modeling. As shown in Figure 4.15, because PowerTutor has to examine the system states and per-component utilization, the overhead of the such estimationis 3X to 4X larger than battery models.

149 GroundTruth BattTracker PowerTutor Beast Default 1300 1000

1250 800 1200 600 1150 400 1100

Capacity (mAh) Capacity (mAh) 200

Remaining Battery 1050 Remaining Battery 1000 0 5 10 15 20 25 5 10 15 20 25 Time (period) Time (period) (a) (b)

Figure 4.13: (a) Battery estimation when self-charging happens; (B) Battery estimation at tails of discharge curve. The GroundTruth baseline is the monitored data from the battery meter. Default is the original BattStat read from Android system and converted into real volume (mAh). BattTrack is the recent published work on modeling remaining battery capacity.

Battery Use. As aforementioned, using mixed battery can significantly improve the battery life of mobile devices. Our online battery assignment depends on the current battery status and workload demand. Here we show how Beast reacts to battery states change and workloads change in Figure 4.16. We use two methods to manage the two batteries in the system. The first method is Default–use a random battery first until it is drained out, then switch to the next one, as the blue curve in Figure 4.16. The other method is based on

150 120 BEAST BattTracker Default PowerTutor 100

Accuracy(%) 40 Battery Estimation 20

0 AnTuTu GeekBench PhoneCall WebBrowse

Figure 4.14: Estimation accuracy comparison in benchmarks

the decision from BEAST. Clearly, BEASE outperforms Default by extending the service time of the target device by approximate 48%, from 813 minutes to 1195 mins.

The battery assignment decision, as we previously discussed in Equation 4.8 and Equa- tion 4.9. The energy demand (e.g., in terms of current) and the variation of the historical energy demand are the two variables affecting the saving potential from choosing a battery.

Figure 4.17 shows the decision change with workload change. In Figure 4.17(a), before the demand increases to 500mA, B1 is mostly used by BEAST. B2 is only switched because B2 has been rest for too long that the leakage effect dominates the self-charge effect.

The energy change variation affects the switch frequency in BEAST, as demonstrated in

Figure 4.17(b). Before 43min, BEAST switch batteries every 2 or 3 minutes, on average.

After 43min, BEAST switch batteries at a lower frequency when variation decreases.

151 100 BEAST BattTracker Default PowerTutor 80

ModelingOverhead(%) 20

0 AnTuTu GeekBench PhoneCall WebBrowse

Figure 4.15: Overhead comparison in benchmarks

To pick the resting time for a battery, BEAST considers the last period energy drain from the battery, and the duration of resting to avoid negative impact from leakage. The resting time varies when these two variables change. Figure 4.18 presents the curve of resting time. When the battery has been discharged by 100mAh in the last active period, the battery can recover 20mAh in 5 minutes, and such gain decreases because leakage happens. When the battery has been discharged by 500mAh, resting 10 minutes is a better option, as shown in Figure 4.18.

The duration of resting time is a key factor in battery assignment decision. In a simple switching policy, such as round-robin (RR), if we can find the optimal switching frequency, we can maximize the gain from self-charging. However, finding the optimal frequency is a difficult problem and a static switching frequency is not an option. Figure 4.19 shows the

152 2500 B1 in Beast Battery in Default B2 in Beast 2000

1500

1000

Capacity (mAh) 500 Remaining Battery

0 0 200 400 600 800 1000 1200 Time (minute)

Figure 4.16: Service time comparison of using mixed batteries on Nexus 5.

service time of using different round-robin switching frequency. When workload changes, the optimal switch frequency changes. For example, When running AnTuTu, 2-minute is the best round-robin period while running PhoneCall workload, 10-minute is the best switching period.

Service Time. Figure 4.20 shows discharge curves of using different battery management policies. Default is the worst policy since it has no concerns on the battery to use. RR, with 10-minute switch period, shows a slightly better performance than Default.

BEAST signiﬁcantly outperforms the previous two baselines, and is close to the ofﬂine optimal–Oracle. In this experiment, BEAST extends the service time of the device by

43% and 26%, compared to Default and RR.

153 At last, we showthe overall results of using BEAST to extend the service time of Nexus

5 running different benchmarks in Figure 4.21. BEAST achieves the closest performance as the Oracle in all four workloads. In some workloads, such as Geekbench, BEAST has the similar performance as RoundRobin. It is because the workload variation is small and the switching period picked by RR is the best for this workload scenario. On average,

BEAST achieves 27% and 13% longer service time than Default and RR, respectively. In some network-intensive workload with unpredictable user behaviors, such as PhoneCall and WebBrowse, BEAST can achieve up to 40% and 28% more service time, compared to

Default and RR, respectively.

Impact of Battery Characteristics. There are another two important factors in modeling and managing batteries–age and temperature. Figure 4.22 shows the impact of battery aging on the two important variables on our decision model–max current output and self- charging effect. The output current is slowly decreasing when the battery is older. The aging effect greatly affects the beneﬁts from self-charging. When the battery is 30 months old, the self-charging effect is almost zero. Figure 4.22(b) presents the impact on the am- bient temperature on the service time. We formulate the results based on the extended service time from the best case. The curves represent different batteries. First, it is clear that either too hot or too cold is bad for extending the service time. The best temperature for dual-battery management is around 75 Fahrenheit, and such gain diminishes with age.

154 4.3 Discussion

Compared to high-end computing systems, such as data servers, edge devices are more constrained by their energy storage and thus require speciﬁc energy modeling/manangement frameworks for their data services. My approach tends to provide a run-time energy modeling and battery modeling based on the collectible information provided by their own systems. Such system design has shown the following feature:

• Traditional modeling methods based on system statistics and/or application speciﬁc

information is limited to provide accurate runtime energy proﬁles.

• Our model features a meter-free dynamic methodology that uses performance coun-

ters to capture relevant hardware events.

• Battery modeling requires unique understanding of battery characteristics and can

be simple and low overhead.

• Dual-battery design can signiﬁcantly increase the service time of the device because

(1) battery shows different performance during different discharge stage; (2) resting

after heavy draining can improve the duration of battery discharging cycles. Ex-

ploiting these two facts can extend the service time of the device up to 48%.

155 2500 1000

2000 800

1500 600

1000 400

Capacity (mAh) 500 200 Remaining Battery B1 in Beast Workload B2 in Beast Energy Demand (mAh) 0 0 10 20 30 40 50 60 Time (minute) (a) 1000 2500 B1 in Beast Workload B2 in Beast 800 2000 600 1500

1000 400

Capacity (mAh) 200 Remaining Battery 500 Energy Demand (mAh) 0 0 10 20 30 40 50 60 Time (minute) (b)

Figure 4.17: Runtime battery assignment under workload change.

156 100 100mAh 300mAh 500mAh 200mAh 400mAh 80

Self-charging Gain (mAh) 0 0 5 10 15 20 Time (minute)

Figure 4.18: Self-charging gain under various discharge amount and resting time.

700 RR(2) RR(10) 600 RR(5) RR(20) 500 400 300 200 100

Total Service Time (minute) 0 AnTuTu GeekBench PhoneCall WebBrowse

Figure 4.19: Service time comparison in different round-robin period. For example, RR(2) has a 2-minutes switching period.

157 3000 RR.B1 Oracle.B1 2500 RR.B2 Oracle.B2 BEAST.B1 Default.B1/B2 2000 BEAST.B2 1500 1000 Capacity (mAh)

Remaining Battery 500 0 50 100 150 200 250 300 350 Time (minute)

Figure 4.20: Service time comparison with several baselines on Nexus 5.

700 BEAST RoundRobin 600 Default Oracle 500 400 300 200 100 Service Time (minute) 0 AnTuTu GeekBench PhoneCall WebBrowse

Figure 4.21: Service time comparison using different benchmarks on Nexus 5.

158 100% 100%

80% 80%

60% 60%

40% 40%

Time Gain (%) 20% Time Gain (%) 20%

Normalized Service Max Current Normalized Service Self-charging 0% 0% 0 5 10 15 20 25 30 0 40 80 120 Age (month) Temperature (Fahrenheit) (a) Battery Aging (b) Temperature

Figure 4.22: The impact of aging/temperature on performance gain on Nexus 5. Fig- ure 4.22(b) shows performance powered by four different batteries, ascending by their age.

159 Chapter 5: Related Work

Our proposed energy modeling framework shall consider both the edge and the cloud.

For the cloud, the framework shall reduce energy consumption from processing any data objects, such as the structured data stored in the database, at any given servers. Further, the framework shall provide ﬂexible data storage across multiple sites without incurring additional costs of data migrations and full replications, contributing to the body of knowledge in data replication systems. For the smartphone, the framework can model the energy proﬁle of data processing in a battery-powered system. Below, we discuss related work in each of these areas.

5.1 Energy Modeling in Data Management Systems

Prior work proposes to treat energy as a ﬁrst class resource in the operating system, such as [31, 124, 23]. The ﬁrst formal analytic energy model in the operating system is proposed by Heath et al. [48]. After that, many other publications provide various models towards energy optimization in the operating system, such as [73, 109, 112, 29, 14]. Com- paring with our work, all such solutions focus mainly on the operating system level. As

160 a result, none of the model can be directly applied to data processing, as an application, due to the lack of knowledge of the data processing patterns. Work in energy-efficient data management systems can be traced back to the early 1990’s. Motivated by the increasing energy-related cost of database servers, the database community has only recently identified building energy-efficient database systems as an important direction of exploration [15]. Two articles [44, 46] emphasize query optimization with energy as the target, which implicitly argues for a mechanism for estimating the energy cost of a query plan.

Supported by initial experimental results, [63] presents two specific techniques to save energy in databases: tuning CPU frequency and rescheduling user queries. Our previous work [120] reveals the existence of many energy-efficient query plans that carry minimum performance penalty. By showing some plans of high energy efficiency coincides with performance, a subsequent report [107] stirs up discussions on whether energy-aware query optimization is a worthwhile approach towards green databases. Our opinion is that, when the search space is sufficiently big and power/performance estimations are accurate enough, we will find energy-efficient plans that most likely will be ignored by existing query optimizers. This standpoint is supported by more recent evidence provided by [61] and [59]. Other related research in green databases diverge to several directions. The

Transaction Performance Council (TPC) ofﬁcially announced TPC-Energy [86] in 2007.

Poess and Nambiar [82, 83, 85] report extensive experimental results on power consumption patterns in typical commercial database servers.

161 Energy modeling in operating systems: many works propose to treat energy as a ﬁrst class resource in the operating system, such as [31, 124, 23]. The ﬁrst formal analytic energy model in the operating system is proposed by Heath et al. [48]. After that, many other publications provide various models towards energy optimization in the operating system, such as [73, 109, 112, 29, 14]. Comparing with our work, all such solutions focus mainly on the operating system level. As a result, none of the model can be directly applied in the DBMS, due to the lack of knowledge of the DBMS resource needs and data processing patterns.

Energy management in database systems: work in energy-efficient database systems can be traced back to the early 1990’s. In [18], query optimization with energy as the performance criterion is proposed within the context of mobile databases. In this paper, we are interested in energy consumption of servers connected to the power grid. Motivated by the increasing energy-related cost of database servers, the database community has only recently identified building energy-efficient database systems as an important direction of exploration [15]. Two articles [44, 46] emphasize query optimization with energy as the target, which implicitly argues for a mechanism for estimating the energy cost of a query plan. Supported by initial experimental results, [63] presents two specific techniques to save energy in databases: tuning CPU frequency and rescheduling user queries. Our previous work [120] reveals the existence of many energy-efficient query plans that carry

162 minimum performance penalty. By showing some plans of high energy efficiency coincides with performance, a subsequent report [107] stirs up discussions on whether energy- aware query optimization is a worthwhile approach towards green databases. Our opinion is that, when the search space is sufficiently big and power/performance estimations are accurate enough, we will find energy-efficient plans that most likely will be ignored by existing query optimizers. This standpoint is supported by more recent evidence provided by [61] and [59], and verified by our experimental results shown in Section ??. Other related research in green databases diverge to several directions. The Transaction Perfor- mance Council (TPC) officially announced TPC-Energy [86] in 2007. Poess and Nambiar

[82, 83, 85] report extensive experimental results on power consumption patterns in typical commercial database servers.

Modeling power in databases: it is worth noticing that power modeling has been addressed in some of the work mentioned above. As a position paper, [61] proposes a general formula for quantifying power cost of a query plan. [59] delivers more comprehensive results in modeling the peak power of database operations. As power and energy are very different concepts, the modeling processes (and apparently the models) are also different.

Our work focus on building physical models on energy consumption. It is based on sys- tematic studies with extensive experiments. Our work signiﬁcantly improves the static model idea presented in [120] and makes it a practical framework for database energy estimation. Aiming at a robust solution that delivers high accuracy in realistic database

163 environment, we use a dynamic modeling approach to continuously update key parameters of our model such that it adapts quickly to dynamics in the system and workload.

5.2 Distributed Data Replication

Internet data services strive to provide high availability and low response times. Repli- cating data to multiple sites improves availability. Fairly distributing workload across sites lowers response times. We discuss the related work focusing on two major features– consistency and cost optimization.

Consistency: Many recent works use consistent hashing for geo-diverse services. Paiva et al. add locality aware placement for objects accessed together, which achieve 6X speedup [80]. ChainReaction improves concurrency by allowing write operations to complete before all replicas are updated [17]. Transactional consistency is convenient for application programmers to use but challenging for database programmers to scale. Even- tual consistency is easier to scale but inconvenient to use [34]. Causal consistency falls between these extremes, allowing programmers to reason about the order in which updates are applied while only passively exchanging data [70]. Our framework shall support all those consistency settings using the chain replication in the system design.

Cost Optimization: Distributed data stores represent a network of sites that maintain information replicas. Some storage systems expose rich query abilities, such as Google’s

BigTable, while most systems support key-value store semantics (e.g., Amazon’s Dynamo and Microsoft’s Azure storage [34, 75]). This classic infrastructure ensures the availability of serving data across multiple geo-diverse regions, but sacriﬁces the performance of high

164 volume read/write accesses under the consistency constraints (e.g., FIFO/eventual consistency). Proved by the CAP theorem [27], the tradeoff between the availability and the consistency always exists in such distributed system. It is non-trivial to build a low energy cost distributed data store with those traditional requirements. For example, SPANStore replicates data to sites with low prices for storage and network transfer [115]. Like our framework, it models the cost of replication, and ﬁnds optimal policies. However, our work considers energy consumption, which is more closely related to the processing behavior. SPANStore migrates data when prices change signiﬁcantly.

5.3 Energy Modeling in Smartphones

Energy Modeling for Web Activities. Zhao et al. [127] first identified that the web browser is an energy-heavy application on mobile phones. They proposed an offload mechanism to reduce this energy usage on the client-side. Later, they propose an estimation model on the web loading time to fully utilize power states of the 3G interface for energy savings [128]. Thiagarajan et al. [103] measured the power consumed by loading different web components using external power meters. Zhu et al. [129] estimated the energy of processing parsed web components and proposed a big/small core scheduling to reduce the processing energy while meeting the display deadline. In contrast, REEWA features a meter-free dynamic methodology that uses runtime hardware events to precisely estimate browsing energy. As a result, REEWA can estimate the energy consumption of the entire execution path of web browsing, as shown in Figure 4.1.

165 Hardware Counter-based Energy Estimation. Hardware counters have traditionally been used to monitor the performance of hardware [19, 22], operating systems [24, 67, 97,

94], and workloads [51, 66]. Bellosa [19] proposed a power estimation model as a linear combination of hardware counters. Other studies extended the model to speciﬁc hardware components, such as Intel SandyBridge [90]. Most recent work Power Container [94] extended this model to estimate and contain the power consumption of a process in the multi-core server. However, those models cannot be directly used and implemented on batter-powered devices. On a battery-powered device, some hardware counters readable in a regular server may not be available, such as memory reference counters on Nexus 4

(discussed later in Section ??). Further, this energy-stressed platform requires a more re- stricted lightweight design on both energy and time for practical concerns. Unlike servers that CPU and memory usually dominates the active energy consumption [97], network and screen contribute almost the same energy consumption as CPU and memory on batter- powered devices like smartphones. Thus, previous work without considering those parts fails at giving precise energy estimation on battery-powered devices. By addressing those design concerns, REEWA presents one of the ﬁrst studies that use hardware counters for energy estimation on battery-powered systems.

166 Chapter 6: Conclusion

In this thesis, we describe our initial design of an energy conservation framework for computing devices in the mobile cloud architecture. We discuss our previous work of optimizing data services on the server side from single node to multiple nodes, and present our energy modeling and management work on the smartphone side. We hope to build an agile and flexible framework that can accurately count the runtime energy consumption and build efficient energy conversation technique upon. By adopting our technique, we are safe to say that we can reduce the energy consumption from serving data by up to 50% from the service side, and extend the device service time from the edge by 48%. Com- bining the gain along different tiers of the mobile cloud architecture, we can significantly improve the energy efficiency of data services, and thus avoid the dramatic increase of electricity consumption of data centers and edge devices in the future.

167 Bibliography

[1] 1998 World Cup Workload. http://goo.gl/cDaXQc.

[2] Amazon app. https://play.google.com/store/apps/details?id= com.amazon.android&hl=en.

[3] Amazon ec2 spot instances. http://aws.amazon.com/ec2/ purchasing-options/spot-instances/ (Last visited in July 2015).

[4] Amazon elasticache. http://aws.amazon.com/elasticache/(Last visited in July 2015).

[5] Android deﬁned activity. http://developer.android.com/ reference/android/app/Activity.html.

[6] Data Center Efﬁciency Assessment. https://www.nrdc.org/energy/ files/data-center-efficiency-assessment-IP.pdf.

[7] Google Cluster Service Traces. code.google.com/p/ googleclusterdata.

[8] MemCachier Status. http://status.memcachier.com/.

[9] Monsoon power monitor. http://www.msoon.com.

[10] Pagespeed insights. https://developers.google.com/speed/docs/ insights/mobile?hl=en.

[11] Watts up power meter. www.wattsupmeters.com.

[12] Web activities. https://developer.mozilla.org/en-US/docs/Web/ API/Web_Activities.

168 [13] Amazon’s ’dirty cloud’ criticised in greenpeace report. http://www.bbc.com/ news/technology-26867362, 2014.

[14] Zahra Abbasi, Georgios Varsamopoulos, and Sandeep K. S. Gupta. Tacoma: Server and workload management in internet data centers considering cooling-computing power trade-off and energy proportionality. ACM Trans. Archit. Code Optim., 9(2):11:1–11:37, June 2012.

[15] Rakesh Agrawal, Anastasia Ailamaki, and et al. The Claremont Report on Database Research. Communications of ACM, 52:56–65, June 2009.

[16] Faraz Ahmad and T. N. Vijaykumar. Joint optimization of idle and cooling power in data centers while maintaining response time. SIGARCH Comput. Archit. News, 38(1):243–256, 2010.

[17] Sérgio Almeida, João Leitão, and Luís Rodrigues. Chainreaction: A causal+ consistent datastore based on chain replication. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys ’13, pages 85–98, New York, NY,USA, 2013. ACM.

[18] Rafael Alonso and Sumit Ganguly. Energy Efﬁcient Query Optimization. Technical report, Matsushita Info Tech Lab, 1992.

[19] Frank Bellosa. The beneﬁts of event: driven energy accounting in power-sensitive systems. In Proc. of workshop on ACM SIGOPS’00, pages 37–42.

[20] Luca Benini, Ro Bogliolo, and Giovanni De Micheli. A survey of design techniques for system-level dynamic power management. IEEE Transactions on VLSI Systems, 8:299–316, 2000.

[21] Andreas Berl, Erol Gelenbe, Marco Di Girolamo, Giovanni Giuliani, Hermann de Meer, Minh Quan Dang, and Kostas Pentikousis. Energy-efﬁcient cloud computing. Comput. J., 53(7):1045–1051, 2010.

[22] Ramon Bertran, Marc Gonzalez, Xavier Martorell, Nacho Navarro, and Eduard Ayguade. Decomposable and responsive power models for multicore processors using performance counters. In Proc. of ICS’10, pages 147–158.

[23] Ricardo Bianchini and Ram Rajamony. Power and energy management for server systems. Computer, 37(11):68–74, November 2004.

169 [24] W.L. Bircher and L.K. John. Complete system power estimation using processor performance events. IEEE Transactions on Computers, 61(4):563–577, 2012. [25] Pat Bohrer, Elmootazbellah N. Elnozahy, and et al. Power aware computing. chapter The case for power management in web servers, pages 261–289. Kluwer Academic Publishers, Norwell, MA, USA, 2002. [26] Daniel Bovet and Marco Cesati. Understanding the linux kernel. 2005. [27] E. Brewer. Cap twelve years later: How the "rules" have changed. Computer, 45(2):23–29, Feb 2012. [28] Aaron Carroll and Gernot Heiser. An analysis of power consumption in a smartphone. In Proc. of USENIX ATC’10, pages 21–21. [29] John B. Carter and Karthick Rajamani. Designing energy-efﬁcient servers and data centers. IEEE Computer, 43(7):76–78, 2010. [30] Bharat Chandramouli, Wilson C. Hsieh, John B. Carter, and Sally A. McKee. A cost model for integrated restructuring optimizations. J. Instruction-Level Parallelism, 5, 2003. [31] Jeffrey S. Chase, Darrell C. Anderson, Prachi N. Thakar, Amin M. Vahdat, and Ronald P. Doyle. Managing energy and server resources in hosting centers. SIGOPS Oper. Syst. Rev., 35(5):103–116, October 2001. [32] Surajit Chaudhuri. An overview of query optimization in relational systems. In PODS, pages 34–43, 1998. [33] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev., 2007. [34] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon’s highly available key-value store. In ACM SIGOPS Operating Systems Review, volume 41, pages 205–220. ACM, 2007. [35] Qingyuan Deng, David Meisner, Luiz Ramos, Thomas F. Wenisch, and Ricardo Bianchini. Memscale: active low-power modes for main memory. SIGPLAN Not., 46(3):225–238, March 2011.

170 [36] Ning Ding, Daniel Wagner, Xiaomeng Chen, Abhinav Pathak, Y. Charlie Hu, and Andrew Rice. Characterizing and modeling the impact of wireless signal strength on smartphone battery drain. In Proc. of SIGMETRICS ’13, pages 29–40.

[37] Mian Dong and Lin Zhong. Self-constructive high-rate system energy modeling for battery-powered mobile systems. In Proce. of MobiSys’11, pages 335–348.

[38] Mian Dong and Lin Zhong. Chameleon: A color-adaptive web browser for mobile oled displays. IEEE Trans. Mob. Comput., 11(5), 2012.

[39] G. F. Franklin, J. D. Powell, and M. L. Workman. Digital Control of Dynamic Systems. Addison-Wesley, 1990.

[40] Ayalvadi Ganesh, Sarah Lilienthal, D. Manjunath, Alexandre Proutiere, and Flo- rian Simatos. Load balancing via random local search in closed and open systems. SIGMETRICS Perform. Eval. Rev., 2010.

[41] A. Gelfond. Tripadvisor architecture - 40m visitors, 200m dynamic page views, 30tb data. http://highscalability.com, June 2011.

[42] Iñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini. Greenhadoop: leveraging green energy in data-processing frameworks. In Proc. of EuroSys, pages 57–70, 2012.

[43] Google. Green datacenters in google. http://www.google.com/green/ energy/.

[44] Goetz Graefe. Database servers tailored to improve energy efﬁciency. In Proc. of EDBT Workshop, SETMDM ’08, pages 24–28, 2008.

[45] Inc. Grand View Research. In-memory computing (imc) market analysis, market size, application analysis, regional outlook, competitive strategies and forecasts, 2014.

[46] Stavros Harizopoulos, Mehul A. Shah, Justin Meza, and Parthasarathy Ran- ganathan. Energy efﬁciency: The new holy grail of data management systems research. In CIDR, 2009.

[47] Xin He, Prashant Shenoy, Ramesh Sitaraman, and David Irwin. Cutting the cost of hosting online services using cloud spot markets. In Proc. of HPDC, 2015.

171 [48] Taliver Heath, Bruno Diniz, Enrique V. Carrera, Wagner Meira, Jr., and Ricardo Bianchini. Energy conservation in heterogeneous server clusters. In Proceedings of PPoPP’05, pages 186–195, New York, NY, USA, 2005. ACM.

[49] Joseph L. Hellerstein, Yixin Diao, Sujay Parekh, and Dawn M. Tilbury. Feedback Control of Computing Systems. John Wiley & Sons, 2004.

[50] Hidden for blind review. Model evaluationof pat, a comprehensive study. Technical report.

[51] C. Isci and M. Martonosi. Phase characterization for power: evaluating control- ﬂow-based and event-counter-based techniques. In Proc. of ISCA’06, pages 121– 132.

[52] Canturk Isci and Margaret Martonosi. Runtime power monitoring in high-end processors: Methodology and empirical data. In Proc. of the MICRO, pages 93–105, 2003.

[53] Aman Kansal, Feng Zhao, Jie Liu, Nupur Kothari, and Arka A. Bhattacharya. Vir- tual machine power metering and provisioning. In SoCC, pages 39–50, 2010.

[54] David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing, 1997.

[55] Dongwon Kim, Wonwoo Jung, and Hojung Cha. Runtime power estimation of mobile AMOLED displays. In Proc. of DATE’13.

[56] Robert Kleinberg. A multiple-choice secretary algorithm with applications to online auctions. In Proc. of SODA. Society for Industrial and Applied Mathematics, 2005.

[57] Robert Kooi. The Optimization of Queries in Relational Databases. PhD thesis, 1980.

[58] Jonathan G Koomey. Worldwide electricity used in data centers. Environmental Research Letters, 3(3), 2008.

[59] Mayuresh Kunjir, Puneet Birwa, and Jayant Haritsa. Peak Power Plays in Database Engines. In Proc. of the EDBT, 2012.

172 [60] Avinash Lakshman and Prashant Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 2010.

[61] Willis Lang, Ramakrishnan Kandhan, and Jignesh M. Patel. Rethinking Query Processing for Energy Efﬁciency: Slowing Down to Win the Race. IEEE Data Engineering Bulletin, 34(1):12–23, 2011.

[62] Willis Lang, Ramakrishnan Kandhan, and Jignesh M. Patel. Rethinking query processing for energy efﬁciency: Slowing down to win the race. IEEE Data Eng. Bull., 34(1):12–23, 2011.

[63] Willis Lang and Jignesh M. Patel. Towards eco-friendly database management systems. In CIDR, 2009.

[64] J. B. Lawrie. A brief historical perspective of the wiener–hopf technique, 2007.

[65] Kien Le, Ricardo Bianchini, Thu D. Nguyen, Ozlem Bilgir, and Margaret Martonosi. Capping the brown energy consumption of internet services at low cost. In Proc. of IGCC, pages 3–14, 2010.

[66] Adam Lewis, Soumik Ghosh, and N.-F. Tzeng. Run-time energy consumption estimation based on workload in server systems. In Proc. of HotPower’08, pages 4–4.

[67] Tao Li and Lizy Kurian John. Run-time modeling and estimation of operating system power consumption. In Proc. of SIGMETRICS’03, pages 160–171.

[68] Minghong Lin, Zhenhua Liu, Adam Wierman, and Lachlan L. H. Andrew. Online algorithms for geographical load balancing. In Proceedings of the 2012 Interna- tional Green Computing Conference (IGCC), IGCC ’12, pages 1–10, Washington, DC, USA, 2012. IEEE Computer Society.

[69] Zhenhua Liu, Minghong Lin, Adam Wierman, Steven H. Low, and Lachlan L.H. Andrew. Greening geographical load balancing. ACM SIGMETRICS, 2011.

[70] Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G. Andersen. Don’t settle for eventual: Scalable causal consistency for wide-area storage with cops. In Symposium on Operating Systems Principles (SOSP 11), 2011.

[71] Mary Meeker and Liang Wu. Internet trends. http://www.kpcb.com/ insights/2012-internet-trends.

173 [72] David Meisner, Brian T. Gold, and Thomas F. Wenisch. Powernap: eliminating server idle power. In Proc. of the ASPLOS’09, pages 205–216, New York, NY, USA, 2009. ACM.

[73] David Meisner, Christopher M. Sadler, Luiz André Barroso, Wolf-Dietrich Weber, and Thomas F. Wenisch. Power management of online data-intensive services. In Proc. of the ISCA, pages 319–330, New York, NY, USA, 2011. ACM.

[74] David Meisner and Thomas F. Wenisch. Dreamweaver: architectural support for deep sleep. In ASPLOS, pages 313–324, 2012.

[75] Microsoft. Microsoft’s azure data storage for mobile apps. http://azure. microsoft.com/en-us/services/storage/.

[76] R. Miller. Where amazon’s data centers are located, 2008. http://www. datacenterknowledge.com.

[77] Mosoon. Monsoon power monitor. http://www.msoon.com.

[78] Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. Scaling memcache at facebook, 2013.

[79] Pradeep Padala, Kai-Yuan Hou, Kang G. Shin, Xiaoyun Zhu, Mustafa Uysal, Zhikui Wang, Sharad Singhal, and Arif Merchant. Automated control of multiple virtualized resources. In EuroSys, pages 13–26, 2009.

[80] João Paiva, Pedro Ruivo, Paolo Romano, and Luís Rodrigues. Autoplacer: Scalable self-tuning data placement in distributed key-value stores. In USENIX ICAC, 2013.

[81] Abhinav Pathak, Y. Charlie Hu, Ming Zhang, Paramvir Bahl, and Yi-Min Wang. Fine-grained power modeling for smartphones using system call tracing. In Proc.of Eurosys’11, pages 153–168.

[82] Meikel Poess and Raghunath Othayoth Nambiar. Energy cost, the key challenge of today’s data centers: a power consumption analysis of TPC-C results. PVLDB, 1(2):1229–1240, 2008.

[83] Meikel Poess and Raghunath Othayoth Nambiar. Tuning servers, storage and database for energy efﬁcient data warehouses. In ICDE, 2010.

174 [84] Meikel Poess and Raghunath Othayoth Nambiar. Tuning servers, storage and database for energy efﬁcient data warehouses. In ICDE, pages 1006–1017, 2010.

[85] Meikel Poess and Raghunath Othayoth Nambiar. Power Based Performance and Capacity Estimation Models for Enterprise Information Systems. IEEE Data Engi- neering Bulletin, 34(1):34–49, 2011.

[86] Meikel Poess, Raghunath Othayoth Nambiar, Kushagra Vaid, John M. Stephens, Karl Huppler, and Evan Haines. Energy benchmarks: a detailed analysis. In e- Energy, pages 131–140, 2010.

[87] PostgreSQL. http://www.postgresql.org/.

[88] Asfandyar Qureshi, Rick Weber, Hari Balakrishnan, John Guttag, and Bruce Maggs. Cutting the electric bill for internet-scale systems. SIGCOMM Computer Commu- nication Review, 39(4):123–134, 2009.

[89] Nathaniel L. Bindoff Richard B. Alley, Terje Berntsen and et al. Summary for policymakers: Climate change 2014, mitigation of climate change. IPCC, 2014.

[90] E. Rotem, A. Naveh, D. Rajwan, A. Ananthakrishnan, and E. Weissmann. Power- management architecture of the intel microarchitecture code-named sandy bridge. Micro, IEEE, 32(2):20–27, 2012.

[91] D. Sciascia and F. Pedone. Geo-replicated storage with scalable deferred update replication. In Proceedings of the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2013, pages 1–12, June 2013.

[92] Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, and Thomas G. Price. Access path selection in a relational database management system. In SIGMOD Conference, pages 23–34, 1979.

[93] Prateek Sharma, Stephen Lee, Tian Guo, David Irwin, and Prashant Shenoy. Spotcheck: Designing a derivative iaas cloud on the spot market. In Proc. of Eu- roSys, 2015.

[94] Kai Shen, Arrvindh Shriraman, Sandhya Dwarkadas, Xiao Zhang, and Zhuan Chen. Power containers: an os facility for ﬁne-grained power and energy management on multicore servers. In Proc. of ASPLOS’13, pages 65–76.

175 [95] Alex Shye, Benjamin Scholbrock, and Gokhan Memik. Into the wild: studying real user activity patterns to guide power optimizations for mobile architectures. In Proc. of MICRO’09, pages 168–178.

[96] Sloan Digital Sky Survey. Sdss database. http://cas.sdss.org/dr7/en/.

[97] David C. Snowdon, Etienne Le Sueur, Stefan M. Petters, and Gernot Heiser. Koala: A platform for os-level power management. In Proc. of EuroSys’09, pages 289–302.

[98] Yang Song, M. Zafer, and Kang-Won Lee. Optimal bidding in spot instance market. In IEEE INFOCOM, 2012.

[99] Christopher Stewart and Kai Shen. Some joules are more precious than others: Managing renewable energy in the datacenter. In Proc. of HotPower, 2009.

[100] Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, and Prashant Shenoy. Spoton: A batch computing service for the spot market. In Proc. of SOCC, 2015.

[101] Srikanth Sundaresan, Nazanin Magharei, Nick Feamster, Renata Teixeira, and Sam Crawford. Web performance bottlenecks in broadband access networks. SIGMET- RICS Perform. Eval. Rev., 41(1):383–384, 2013.

[102] Jeff Terrace and Michael J. Freedman. Object storage on craq: High-throughput chain replication for read-mostly workloads. In Proceedings of USENIX Annual Technical Conference, USENIX, 2009.

[103] Narendran Thiagarajan, Gaurav Aggarwal, Angela Nicoara, Dan Boneh, and Jatin- der Pal Singh. Who killed my battery?: analyzing mobile browser energy consumption. In Proc. of WWW’12, pages 41–50. ACM.

[104] TPC council. http://www.tpc.org/.

[105] Transaction Processing Performance Council. http://www.tpc.org.

[106] Transaction Processing Performance Council. http://www.tpc.org.

[107] Dimitris Tsirogiannis, Stavros Harizopoulos, and Mehul A. Shah. Analyzing the energy efﬁciency of a database server. In Proc. of the international conf. on management of data, SIGMOD ’10, pages 231–242. ACM, 2010.

176 [108] Robbert van Renesse and Fred B. Schneider. Chain replication for supporting high throughput and availability. In USENIX OSDI, 2004.

[109] Akshat Verma, Gargi Dasgupta, Tapan Kumar Nayak, Pradipta De, and Ravi Kothari. Server workload analysis for power minimization using consolidation. In Proceedings of the USENIX’09, pages 28–28, 2009.

[110] B. Walsh. Your data is dirty: The carbon price of cloud computing. www.time. com, 2014.

[111] Xiaorui Wang, Ming Chen, Charles Lefurgy, and Tom W. Keller. Ship: Scalable hierarchical power control for large-scale data centers. In PACT, pages 91–100, 2009.

[112] Xiaorui Wang, Kai Ma, and Yefu Wang. Adaptive power control with online model estimation for chip multiprocessors. IEEE Trans. Parallel Distrib. Syst., 22(10):1681–1696, October 2011.

[113] Yefu Wang, Xiaorui Wang, Ming Chen, and Xiaoyun Zhu. Power-efﬁcient response time guarantees for virtualized enterprise servers. In IEEE Real-Time Systems Sym- posium, pages 303–312, 2008.

[114] Yefu Wang, Xiaorui Wang, Ming Chen, and Xiaoyun Zhu. Partic: Power-aware response time control for virtualized web servers. IEEE Trans. Parallel Distrib. Syst., pages 323–336, 2011.

[115] Zhe Wu, Michael Butkiewicz, Dorian Perkins, Ethan Katz-Bassett, and Harsha V Madhyastha. Spanstore: Cost-effective geo-replicated storage spanning multiple cloud services. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 292–308. ACM, 2013.

[116] Zichen Xu. Exploiting heavy-tailed distributions for cost-aware data replication. In Tech Report OSU-ECE-15-2, http://pacs.ece.ohio-state. edu/techreports/TechReport-15-02.pdf, February, 2015.

[117] Zichen Xu. Power cost estimation in dbms, a comprehensive study. Technical Re- port OSU-ECE-12-002, Department of Electrical and Computer Engineering, The Ohio State University, June 2012.

[118] Zichen Xu, Yi-Cheng Tu, and Xiaorui Wang. Exploring power-performance trade- offs in database systems. In ICDE, pages 485–496, 2010.

177 [119] Zichen Xu, Yi-Cheng Tu, and Xiaorui Wang. Dynamic energy estimation of query plans in database systems. In Proc. of the International Conference on Distributed Computing Systems, 2013.

[120] Zichen Xu, Yicheng Tu, and Xiaorui Wang. Exploring power- performance trade- offs in database systems. In Proc. of ICDE, 2010.

[121] Yahoo! Yahoo! labs webscope, 2010. webscope.sandbox.yahoo.com.

[122] Chanmin Yoon, Dongwon Kim, Wonwoo Jung, Chulkoo Kang, and Hojung Cha. Appscope: Application energy metering framework for android smartphones using kernel activity monitoring. In Proc. of USENIX ATC’12, pages 36–36.

[123] Hongliang Yu, Dongdong Zheng, Ben Y. Zhao, and Weimin Zheng. Understand- ing user behavior in large-scale video-on-demand systems. In Proceedings of the 1st ACM EuroSys European Conference on Computer Systems 2006, EuroSys ’06, pages 333–344, 2006.

[124] Heng Zeng, Carla S. Ellis, Alvin R. Lebeck, and Amin Vahdat. Ecosystem: managing energy as a ﬁrst class operating system resource. SIGOPS Oper. Syst. Rev., 36(5):123–132, October 2002.

[125] Hao Zhang, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, and Meihui Zhang. In- memory big data management and processing: A survey. Knowledge and Data Engineering, IEEE Transactions on, 27, July 2015.

[126] Lide Zhang, Birjodh Tiwana, Zhiyun Qian, Zhaoguang Wang, Robert P. Dick, Zhuoqing Morley Mao, and Lei Yang. Accurate online power estimation and auto- matic battery behavior based power model generation for smartphones. In Proc. of CODES+ISSS’10, pages 105–114.

[127] Bo Zhao, Byung Chul Tak, and Guohong Cao. Reducing the delay and power consumption of web browsing on smartphones in 3G networks. In Proc. of ICDCS’11, pages 413–422.

[128] Bo Zhao, Qiang Zheng, and Guohong Cao. Energy-aware web browsing in 3G based smartphones. In Proc. of ICDCS’13.

[129] Yuhao Zhu and Vijay Janapa Reddi. High-performance and energy-efﬁcient mobile web browsing on big/little systems. In Proc. of HPCA’13, pages 13–24.

178