Benchmarking Personal Cloud Storage
Total Page:16
File Type:pdf, Size:1020Kb
Benchmarking Personal Cloud Storage Idilio Drago Enrico Bocchi University of Twente Politecnico di Torino [email protected] [email protected] Marco Mellia Herman Slatman Aiko Pras Politecnico di Torino University of Twente University of Twente [email protected] [email protected] [email protected] ABSTRACT are being attracted by these offers, saving personal files, Personal cloud storage services are data-intensive applica- synchronizing devices and sharing content with great sim- tions already producing a significant share of Internet traf- plicity. This high public interest pushed various providers fic. Several solutions offered by different companies attract to enter the cloud storage market. Services like Dropbox, more and more people. However, little is known about each SkyDrive and Google Drive are becoming pervasive in peo- service capabilities, architecture and – most of all – perfor- ple’s routine. Such applications are data-intensive and their mance implications of design choices. This paper presents a increasing usage already produces a significant share of In- methodology to study cloud storage services. We apply our ternet traffic [3]. methodology to compare 5 popular offers, revealing differ- Previous results about Dropbox [3] indicate that design ent system architectures and capabilities. The implications and architectural choices strongly influence service perfor- on performance of different designs are assessed executing mance and network usage. However, very little is known a series of benchmarks. Our results show no clear winner, about how other providers implement their services and the with all services suffering from some limitations or having implications of different designs. This understanding is valu- potential for improvement. In some scenarios, the upload of able as a guideline for building well-performing services that the same file set can take seven times more, wasting twice wisely use network resources. as much capacity. Our methodology and results are useful The goal of this paper is twofold. Firstly, we investigate thus as both benchmark and guideline for system design.1 how different providers tackle the problem of synchronizing people’s files. For answering this question, we develop a methodology that helps to understand both system archi- Categories and Subject Descriptors tecture and client capabilities. We apply our methodology C.2 [Computer-Communication Networks]: Miscella- to compare 5 services, revealing differences on client soft- neous; C.4 [Performance of Systems]: Measurement ware, synchronization protocols and data center placement. Techniques Secondly, we investigate the consequences of such designs on performance. We answer this question by defining a series of benchmarks. Taking the perspective of users connected from Keywords a single location in Europe, we benchmark each selected Measurements; Performance; Comparison; Dropbox service under the same conditions, highlighting differences manifested in various usage scenarios and emphasizing the 1. INTRODUCTION relevance of design choices for both users and the Internet. Our results extend [3], where Dropbox usage is analyzed Personal cloud storage services allow to synchronize local from passive measurements. In contrast to the previous folders with servers in the cloud. They have gained popular- work and [12, 16] that focus on a specific service, this paper ity, with companies offering significant amounts of remote compares several solutions using active measurements. The storage for free or reduced prices. More and more people results in [3] are used to guide our benchmarking definition. 1 The authors of [11] benchmark cloud providers, but focusing Our benchmarking tool and the measurements used in our analyses are available at the SimpleWeb trace repository: only on server infrastructure. Similarly to our goal, [9] eval- http://www.simpleweb.org/wiki/cloud_benchmarks uates Dropbox, Mozy, Carbonite and CrashPlan. Motivated by the extensive list of providers, we first propose a method- Permission to make digital or hard copies of all or part of this work for personal or ology to automate the benchmarking. Then, we analyze sev- classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation eral synchronization scenarios and providers, shedding light on the first page. Copyrights for components of this work owned by others than the on the impact of design choices on performance. author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or Our results reveal interesting insights, such as unexpected republish, to post on servers or to redistribute to lists, requires prior specific permission drops in performance in common scenarios because of both and/or a fee. Request permissions from [email protected]. the lack of client capabilities and architectural differences IMC’13, October 23–25, 2013, Barcelona, Spain. in the services. Overall, the lessons learned are useful as Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-1953-9/13/10 ...$15.00. guidelines to improve personal cloud storage services. http://dx.doi.org/10.1145/2504730.2504762. 0: Parameters Table 1: Benchmarks to assess client performance. Random bytes Plain text Test computer 1: Send files Set Files Size Set Files Size Testing Provider 1 1 5 1 FTP server application 100 kB 100 kB App-under-test 2 1 1 MB 6 1 1 MB 2: Upload files 3 10 100 kB 7 10 100 kB 4 100 10 kB 8 100 10 kB 3: Statistics Figure 1: Testbed to study personal storage services. workload, returning different IP addresses according to the originating DNS resolver [2]. The owners of the IP addresses are identified using the whois service. For each IP address, we look for the geo- 2. METHODOLOGY AND SERVICES graphic location of the server. Since popular geolocation This section describes the methodology we follow to de- databases are known to have serious limitations regarding sign benchmarks to check capabilities and performance of cloud providers [14], we rely on a hybrid methodology that personal storage services. We use active measurements re- makes use of: (i) informative strings (i.e., International Air- lying on a testbed (Fig. 1) composed of two parts: (i) a test port Codes) revealed by reverse DNS lookup; (ii) the short- computer that runs the application-under-test in the desired est Round Trip Time (RTT) to PlanetLab nodes [15]; and operating system; and (ii) our testing application. The com- (iii) active traceroute to spot the closest well-known loca- plete environment can run either in a single machine, or in tion of a router. Previous works [2, 5] indicate that these separate machines provided that the testing application can methodologies provide an estimation with about a hundred intercept traffic from the test computer. of kilometers of precision, which is sufficient for our goals. We build the testbed in a single Linux server for our ex- periments. The Linux server both controls the experiments 2.2 Checking Capabilities and hosts a virtual machine that runs the test computer 2 Previous work [3] shows that personal storage applications (Windows 7 Enterprise). Our testbed is connected to a can implement several capabilities to optimize storage usage 1 GB/s Ethernet network at the University of Twente, in and to speed up transfers. These capabilities include the which Internet connectivity is not a bottleneck. adoption of chunking (i.e., splitting content into a maximum Our testing application receives as input benchmarking size data unit), bundling (i.e., the transmission of multiple parameters (step 0 in Fig. 1) describing the sequence of op- small files as a single object), deduplication (i.e., avoiding erations to be performed. The testing application acts re- re-transmitting content already available on servers), delta motely on the test computer, generating specific workloads encoding (i.e., transmission of only modified portions of a in the form of file batches, which are manipulated using file) and compression. a FTP client (step 1). Files of different types are created For each case, a specific test has been designed to observe or modified at run-time, e.g., text files composed of ran- if the given capability is implemented. We describe each dom words from a dictionary, images with random pixels, test directly in Sect. 4. In summary, our testing application or random binary files. Generated files are synchronized to produces specific batches of files that would benefit from a the cloud by the application-under-test (step 2) and the ex- capability. The exchanged traffic is analyzed to determine changed traffic is monitored to compute performance metrics how the service operates. (step 3). These include the amount of traffic seen during the experiments, the time before actual synchronization starts 2.3 Benchmarking Performance and the time to complete synchronization. After knowing how the services are designed in terms of both data center locations and system capabilities, we check 2.1 Architecture and Data Centers how such choices influence synchronization performance and The used architecture, data center locations and data cen- the amount of overhead traffic. ter owner are important aspects of personal cloud storage, Results in [3] show that up to 90 % of Dropbox users’ having both legal and performance implications. To iden- uploads carry less than 1 MB. While 50 % of the batches tify how the analyzed services operate, we observe the DNS carry only a single file, a significant portion (around 10 %) name of contacted servers when (i) starting the application; involves at least 100 files. Based on these results, we design (ii) immediately after files are manipulated; and (iii) when 8 benchmarks varying (i) number of files; (ii) file sizes and the application is in idle state. For each service, a list of (iii) file types, therefore covering a variety of synchroniza- contacted DNS names is compiled. tion scenarios. Tab. 1 lists our benchmark sets.