Towards Network-Level Efficiency for Cloud Storage Services
Total Page:16
File Type:pdf, Size:1020Kb
Towards Network-level Efficiency for Cloud Storage Services Zhenhua Li Cheng Jin Tianyin Xu Tsinghua University University of Minnesota University of California Peking University Twin Cities San Diego [email protected] [email protected] [email protected] Christo Wilson Yao Liu Linsong Cheng Northeastern University State University of New York Tsinghua University Boston, MA, US Binghamton University Beijing, China [email protected] [email protected] [email protected] Yunhao Liu Yafei Dai Zhi-Li Zhang Tsinghua University Peking University University of Minnesota Beijing, China Beijing, China Twin Cities [email protected] [email protected] [email protected] ABSTRACT General Terms Cloud storage services such as Dropbox, Google Drive, and Mi- Design, Experimentation, Measurement, Performance crosoft OneDrive provide users with a convenient and reliable way to store and share data from anywhere, on any device, and at any Keywords time. The cornerstone of these services is the data synchronization (sync) operation which automatically maps the changes in users’ Cloud storage service; network-level efficiency; data synchroniza- local filesystems to the cloud via a series of network communica- tion; traffic usage efficiency tions in a timely manner. If not designed properly, however, the tremendous amount of data sync traffic can potentially cause (fi- 1. INTRODUCTION nancial) pains to both service providers and users. Cloud storage services such as Dropbox, Google Drive, and Mi- This paper addresses a simple yet critical question: Is the cur- crosoft OneDrive (renamed from SkyDrive since Feb. 2014) pro- rent data sync traffic of cloud storage services efficiently used? vide users with a convenient and reliable way to store and share We first define a novel metric named TUE to quantify the Traffic data from anywhere, on any device, and at any time. The users’ da- Usage Efficiency of data synchronization. Based on both real-world ta (e.g., documents, photos, and music) stored in cloud storage are traces and comprehensive experiments, we study and characterize automatically synchronized across all the designated devices (e.g., the TUE of six widely used cloud storage services. Our results PCs, tablets, and smartphones) connected to the cloud in a timely demonstrate that a considerable portion of the data sync traffic is manner. With multiplicity of devices – especially mobile devices in a sense wasteful, and can be effectively avoided or significant- – that users possess today, such “anywhere, anytime” features sig- ly reduced via carefully designed data sync mechanisms. All in nificantly simplify data management and consistency maintenance, all, our study of TUE of cloud storage services not only provides and thus provide an ideal tool for data sharing and collaboration. guidance for service providers to develop more efficient, traffic- In a few short years, cloud storage services have reached phe- economic services, but also helps users pick appropriate services nomenal levels of success, with the user base growing rapidly. For that best fit their needs and budgets. example, Microsoft OneDrive claims that over 200 million cus- tomers have stored more than 14 PB of data using their service [9], while Dropbox has claimed more than 100 million users who store Categories and Subject Descriptors or update 1 billion files every day [6]. Despite the late entry into this market (in Apr. 2012), Google Drive obtained 10 million users C.2.4 [Computer-communication Networks]: Distributed Sys- just in its first two months [7]. tems—Distributed applications; D.4.3 [Operating Systems]: File The key operation of cloud storage services is data synchroniza- Systems Management—Distributed file systems tion (sync) which automatically maps the changes in users’ local filesystems to the cloud via a series of network communications. Figure 1 demonstrates the general data sync principle. In a cloud Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed storage service, the user usually needs to assign a designated lo- for profit or commercial advantage and that copies bear this notice and the full cita- cal folder (called a “sync folder”) in which every file operation is tion on the first page. Copyrights for components of this work owned by others than noticed and synchronized to the cloud by the client software de- ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- veloped by the service provider. Synchronizing a file involves a publish, to post on servers or to redistribute to lists, requires prior specific permission sequence of data sync events, such as transferring the data index, and/or a fee. Request permissions from [email protected]. data content, sync notification, sync status/statistics, and sync ac- IMC’14, November 5–7, 2014, Vancouver, BC, Canada. Copyright 2014 ACM 978-1-4503-3213-2/14/11 ...$15.00. knowledgement. Naturally, each data sync event incurs network http://dx.doi.org/10.1145/2663716.2663747. traffic. In this paper, this traffic is referred to as data sync traffic. 115 To answer the question thoroughly, we first define a novel metric Cloud named TUE to quantify the Traffic Usage Efficiency of data syn- chronization. Borrowing a term similar to PUE (i.e., the Power Us- age Effectiveness = Total facility power [14], a widely adopted metric Data Sync Event IT equipment power (data index, data content, for evaluating the cloud computing energy efficiency), we define Client sync notification, ...) Total data sync traffic TUE = : (1) File Operation Data update size (file creation, file deletion, file modification, ...) When a file is updated (e.g., created, modified, or deleted) at the us- User er side, the data update size denotes the size of altered bits relative to the cloud-stored file 2. From the users’ point of view, the data up- date size is an intuitive and natural signifier about how much traffic Figure 1: Data synchronization principle. should be consumed. Compared with the absolute value of sync traffic (used in previous studies), TUE better reveals the essential traffic harnessing capability of cloud storage services. If not designed properly, the amount of data sync traffic can In order to gain a practical and in-depth understanding of TUE, potentially cause (financial) pains to both providers and users of we collect a real-world user trace and conduct comprehensive bench- cloud storage services. From the providers’ perspective, the aggre- mark experiments of six widely used cloud storage services, in- gate sync traffic from all users is enormous (given the huge number cluding Google Drive, OneDrive, Dropbox, Box, Ubuntu One, and of files uploaded and modified each day!). This imposes a heavy SugarSync. We examine key impact factors and design choices that burden in terms of infrastructure support and monetary costs (e.g., are common across all of these services. Impact factors include file as payments to ISPs or cloud infrastructure providers). To get a size, file operation, data update size, network environment, hard- quantitative understanding, we analyze a recent large-scale Drop- ware configuration, access method, and so on. Here the “access box trace [12] collected at the ISP level [25]. The analysis reveals: method” refers to PC client software, web browsers, and mobile (1) The sync traffic contributes to more than 90% of the total ser- apps. Design choices (of data sync mechanisms) include data sync vice traffic. Note that the total service traffic is equivalent to one granularity, data compression level, data deduplication granularity, third of the traffic consumed by YouTube [25]; (2) Data synchro- and sync deferment (for improved batching). nization of a file (sometimes a batch of files) generates 2.8 MB of By analyzing these factors and choices, we are able to thoroughly inbound (client to cloud) traffic and 5.18 MB of outbound (cloud unravel the TUE related characteristics, design tradeoffs, and opti- to client) traffic on average. According to the Amazon S3 pricing mization opportunities of these state-of-the-art cloud storage ser- policy [1] (Dropbox stores all the data content in S3 and S3 only vices. The major findings in this paper and their implications are charges for outbound traffic), the Dropbox traffic would consume summarized as follows: nearly $0.05/GB × 5.18 MB × 1 billion = $260,000 every day1. These costs grow even further when we consider that all cloud stor- • The majority (77%) of files in our collected trace are smal- age service providers must bear similar costs, not just Dropbox [4]. l in size (less than 100 KB). Nearly two-thirds (66%) of Data sync traffic can also bring considerable (and unexpected) fi- these small files can be logically combined into larger files nancial costs to end users, despite that basic cloud storage services for batched data sync (BDS) in order to reduce sync traffic. are generally free. News media has reported about user complaints However, only Dropbox and Ubuntu One have partially im- of unexpected, additional charges from ISPs, typically from mobile plemented BDS so far. users with limited data usage caps [8, 2]. As a consequence, some • users have warned: “Keep a close eye on your data usage if you The majority (84%) of files are modified by users at least have a mobile cloud storage app.” In addition, some cloud stor- once. Unfortunately, most of today’s cloud storage services age applications (e.g., large data backup [3]) are also impaired by are built on top of RESTful infrastructure (e.g., Amazon S3, the bandwidth constraints between the user clients and the cloud. Microsoft Azure, and OpenStack Swift) that typically only full-file This limitation is regarded as the “dirty secrets” of cloud storage support data access operations at the level [26, 17].