<<

Linköping University | Department of and Information Science Master thesis, 30 ECTS | Software Engineering 2018 | LIU-IDA/LITH-EX-A--18/016--SE

The Cost of Confidenality in Storage

Eric Henziger

Supervisor : Niklas Carlsson Examiner : Niklas Carlsson

Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Upphovsrä

Dea dokument hålls llgängligt på – eller dess framda ersäare – under 25 år från publicerings- datum under förutsäning a inga extraordinära omständigheter uppstår. Tillgång ll dokumentet innebär llstånd för var och en a läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och a använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsräen vid en senare dpunkt kan inte upphäva dea llstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För a garantera äktheten, säkerheten och llgängligheten finns lösningar av teknisk och administrav art. Upphovsmannens ideella rä innefaar rä a bli nämnd som upphovsman i den om- faning som god sed kräver vid användning av dokumentet på ovan beskrivna sä samt skydd mot a dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för up- phovsmannens lierära eller konstnärliga anseende eller egenart. För yerligare informaon om Linköping University Electronic Press se förlagets hemsida hp://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starng from the date of publicaon barring exceponal circumstances. The online availabil- ity of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educaonal purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are con- dional upon the consent of the copyright owner. The publisher has taken technical and administrave measures to assure authencity, security and accessibility. According to intellectual property law the au- thor has the right to be menoned when his/her work is accessed as described above and to be protected against infringement. For addional informaon about the Linköping University Electronic Press and its procedures for publicaon and for assurance of document integrity, please refer to its www home page: hp://www.ep.liu.se/.

© Eric Henziger Abstract

Cloud storage services allow users to store and access data in a secure and flexible manner. In recent years, services have seen rapid growth in popularity as well as in technological progress and hundreds of millions of users use these services to store thousands of petabytes of data. Additionally, the synchronization of data that is essential for these types of services stands for a significant amount of the total internet traffic. In this thesis, seven cloud storage applications were tested under controlled experiments during the synchronization process to determine feature support and measure performance metrics. Special focus was put on comparing applications that perform side encryption of user data to applications that do not. The results show a great variation in feature support and performance between the different applications and that client side encryption introduces some limitations toother features but that it does not necessarily impact performance negatively. The results provide insights and enhances the understanding of the advantages and disadvantages that come with certain design choices of cloud storage applications. These insights will help future technological development of cloud storage services. Acknowledgments

Even though I am the sole author for this thesis, my journey has been far from lonely and I have many people to thank for reaching the completion of my thesis. First and foremost, thanks to Associate Professor Niklas Carlsson for his work as examiner and supervisor. Niklas has been generous in sharing his vast knowledge and helped me get back on track when I was lost and things felt hopeless. Thanks to my dear friend Erik Areström who I also had the pleasure to have as my opponent for this thesis. Erik’s warmth and positive attitude have been a source of motivation and I’m happy to get to share this final challenge as a Linköping University student with you. Thanks to my fellow thesis students with whom I’ve spent numerous lunches, fika breaks, and foosball games with: Cristian Torrusio, Edward Nsolo, Jonatan Pålsson and Sara Bergman. You guys have turned even the dullest of work days into days of joy with interesting discussions and many laughs. Special thanks to my good friend Tomas Öhberg who, in addition to participating in the previously mentioned activities, have been the greatest of bollplanks when discussing our theses as well as life in general. Thanks to Natanael Log and Victor Tranell for their valuable feedback on early drafts of this thesis. I wish you all good fortune in your future endeavors and I hope that our paths may cross again sometime. This thesis concludes my five years at Linköping University. It has been an adventurous time during which I have learned immensely and had the privilege to get to know many great people. Thanks to all my fellow course mates, especially Henrik Adolfsson, Simon Delvert and Raymond Leow, for being with me through tough and challenging exams, laboratory work and projects. Thanks to all examiners at the university departments IDA, MAI and ISY for pushing me to learn stuff that I would not have been disciplined enough to learn on my own. I would also like to thank my colleagues at Westermo R&D for being great role models in the software industry and for inspiring and motivating me for what’s to come in my professional life. Thanks to my awesome friends back in Hallstahammar, I don’t have space to thank you all, but the three families Brandt, Joannisson and Tejnung include the very strong core part. While spending time with you have been limited during these years, it has always been of highest quality. Finally, my warmest thanks to my mom and dad, Aina and Bosse, and my sister, Annelie, for your endless support and raising me to who I am. Great work! ♡

This thesis was written using LATEX together with PGFPlots for plot generation. The support from random strangers across the internet has been of great use in making this thesis into what it is.

iv Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

List of Code Listings ix

1 Introduction 1 1.1 Aim ...... 2 1.2 Research Questions ...... 2 1.3 Contributions ...... 2 1.4 Delimitations ...... 3

2 Theory 4 2.1 Cloud Infrastructure and Cloud Storage ...... 4 2.2 File Encryption ...... 5 2.3 Cloud Storage User Behavior ...... 6 2.4 Cloud Storage Features ...... 7 2.5 Personal Cloud Storage Applications ...... 11 2.6 Related Work ...... 13

3 Method 15 3.1 Test Environment ...... 15 3.2 Testing Personal Cloud Storage Capabilities ...... 17 3.3 Advanced Delta Encoding Tests ...... 18 3.4 CPU Measurements ...... 21 3.5 Disk Utilization ...... 24 3.6 Memory Measurements ...... 24 3.7 Security in Transit ...... 25 3.8 Cloud Storage Traffic Identification ...... 26

v 4 Results 28 4.1 Compression ...... 28 4.2 Deduplication ...... 29 4.3 Delta Encoding ...... 30 4.4 CPU Utilization ...... 32 4.5 Disk Utilization ...... 38 4.6 Memory Utilization ...... 39 4.7 Security in Transit ...... 41

5 Discussion 43 5.1 Results ...... 43 5.2 Method ...... 46 5.3 The Work in a Wider Context ...... 47

6 Conclusion 49 6.1 Future Work ...... 50

Bibliography 51

A Appendices 57 A.1 Cloud Storage Application Changelogs ...... 57 A.2 Packet Size Distributions ...... 61 A.3 CPU Utilization ...... 62 A.4 Disk Utilization ...... 64

vi List of Figures

2.1 Two sharing the same cloud storage space for two chunks...... 8 2.2 Attack scenario in a cross-user deduplicated cloud...... 9

3.1 The testbed setup used for the cloud storage measurements...... 16 3.2 Visualization of the update patterns used in the delta encoding tests...... 19 3.3 The different phases and their transitions during the sync process...... 22 3.4 Screenshot of MEGAsync preferences with HTTP disabled...... 24 3.5 Screenshot of Wireshark during TLS analysis...... 25

4.1 Compression test results for the different PCS applications...... 29 4.2 uploaded with sprinkled updates over a 10 MB file for and SpiderOak . 31 4.3 CPU utilization during idle and cooldown phases...... 33 4.4 CPU utilization during pre-processing and transfer phases...... 34 4.5 Phase durations for the pre-processing and transfer phases...... 34 4.6 CPU volumes for the pre-processing and transfer phases...... 35 4.7 CPU utilization for the tested PCS applications during a single file upload...... 36 4.8 CPU volumes during equalized network conditions...... 37 4.9 CPU utilization for with and without TLS...... 38 4.10 Average amount of bytes written to disk during a 300 MB file upload...... 39 4.11 Memory utilization for the tested PCS applications during five consecutive file uploads. 40 4.12 Mega warning dialog boxes when trusting foreign TLS certificates...... 42

A.1 Packet size distributions for the tested PCS applications during a 10MB file upload of highly compressible data...... 61

vii List of Tables

3.1 Tested PCS Applications ...... 16

4.1 Summary of the tested PCS applications ...... 28 4.2 Deduplication test results ...... 30 4.3 Mean Memory Utilization (%) ...... 41 4.4 Certificate Authorities used by the PCS applications ...... 41

A.1 CPU utilization during idle and cooldown phases ...... 62 A.2 CPU utilization during pre-process and transfer phases ...... 62 A.3 Phase durations in seconds ...... 63 A.4 CPU Volumes ...... 63 A.5 Average amount of bytes written to disk during a 300 MB file upload...... 64

viii List of Code Listings

3.1 Code for delta file modifications ...... 20 3.2 Code used for CPU and memory measurements ...... 22 3.3 Categorization of traffic flows ...... 27

ix Glossary

AES Advanced Encryption Standard

CA Certificate Authority CFB Cipher Feedback CPU Central Processing Unit CSE Client-Side Encryption CSP Cloud Service Provider

EULA End-User License Agreement

GCM Galois Counter Mode GDPR General Data Protection Regulation

IoT Internet of Things IP Internet Protocol

MDP Markov Decision Process MitM Man-in-the-Middle MTU Maximum Transmission Unit

PBKDF2 Password-Based Key Derivation Function 2 PCI SSC Payment Card Industry Security Standards Council PCS Personal Cloud Storage PKP Public Key Pinning

RSA Rivest–Shamir–Adleman RTT Round Trip Time

TGDH Tree-based Group Diffie-Hellman TLS TOS Terms of Service

URL Uniform Resource Locator

x 1 Introduction

Cloud storage services and file synchronization applications have changed how people store their important data, such as documents and image files. These applications allow us to access fileson all our devices and regardless of what our geographical location may be. They also give us a sense of security as our data is backed up. Personal Cloud Storage (PCS) applications have had rapid growth since their entry into the market. One of the most popular actors on the market, Dropbox, reportedly had 500 million users in March 2016 [1]. Similarly, Sundar Pichai reported in his keynote speech at I/O 2017 that had over 800 million active users [2]. In a white paper by Cisco [3], it was estimated that 2.3 billion, or 59 percent of the Internet consumer population, will be using PCS by 2020. Further, Cisco forecasted that global consumer cloud storage traffic will grow to 48 exabytes per yearby 2020, compared to 8 exabytes in 2015. Factors that become relevant when we choose to keep our data in a company’s cloud storage solution are privacy and integrity. For instance, by accepting the terms in Dropbox’s and Google’s End-User License Agreements (EULAs) you give them some rights to your stored content. With Dropbox, you give them, including their affiliates and trusted third parties, the right to access, storeand scan “Your Stuff” [4]. Similarly, agreeing to Google’s Terms of Service (TOS) [5] gives them “a worldwide license to use, host, store, reproduce, modify, [...], publicly display and distribute such content.” where “such content” refers to the user’s stored content. Granting such rights might not be acceptable for some users or for certain content. Moreover, with software bugs such as the one that allowed logging in to Dropbox accounts without the correct password [6] or implementation of government surveillance backdoors such as the NSA Prism program [7] a need for stronger protection of the end user’s privacy may arise. A common solution to achieve confidentiality is to

1 1.1. Aim use Client-Side Encryption (CSE) where the user’s data is encrypted before being transmitted to the cloud storage provider. Alongside CSE, Cloud Service Providers (CSPs) have developed other features to improve their products and to make the synchronization process as efficient as possible. For instance, files that are added to the cloud storage may be compressed, deduplicated or chunked into smaller pieces. In this thesis, the tradeoffs that come with CSE are highlighted. The nature of encrypted datacan put limitations on the efficiency of improvements for data synchronization such as compression and deduplication. Further, performing the encryption on the client increases the Central Processing Unit (CPU) utilization on those clients which may have limited energy resources or computational power, for instance in or Internet of Things (IoT) devices. For this thesis, seven differ- ent PCS providers, with four of them supporting CSE, were tested through controlled experiments. Tests for features like compression, deduplication and delta encoding as well as testing metrics for CPU, memory and disk utilization were conducted. Other papers have studied the capabilities of PCS services in detail and some of the key findings are presented in Chapter 2. This thesis puts focus on the differences between services that support CSE and those who do not and if CSE affects the efficiency of the syncing process. The method usedin this thesis is similar to the one used by Bocchi et al. [8] but have been tailored towards testing the relevant metrics for when CSE is a factor. The method is thoroughly described in Chapter 3. The results from the experiments are presented in Chapter 4 and discussed in Chapter 5.

1.1 Aim

The purpose of this thesis project is to evaluate PCS providers that offer client-side encryption and compare these solutions to other popular actors in the market with regards to metrics such as network throughput and CPU utilization.

1.2 Research Questions

This thesis attempts to answer the following questions:

1. How can performance metrics such as CPU, memory and network utilization be measured in PCS applications?

2. How does client-side encryption affect the performance of cloud storage applications? Here, the concerned performance metrics are CPU utilization and network throughput.

3. How does sharing of data between multiple devices affect the synchronization process ina client-side encrypted cloud storage service?

1.3 Contributions

The work in this thesis builds upon established methods used in previous academic papers to test cloud storage applications. In extension to previous work, a novel metric for fair comparison of CPU utilization between the different applications as well as new test methods for getting insights

2 1.4. Delimitations into the performance of cloud storage features such as delta encoding are presented. To the best of the author’s knowledge, this is the first work in academia that specifically focuses on the difference between CSE supporting and non-supporting services. Additionally, applications that have had little exposure in these types of studies (e.g. Sync.com) are included in this thesis and the tests are performed on a relatively untested platform (i.e. macOS).

1.4 Delimitations

The usage patterns for file synchronization come in many shapes. This study evaluates certain properties of file sync applications based on a set of predefined scenarios. These scenarios include tests of different file sizes and file activities such as creation, modification and deletion, withthe intent of resembling the typical use of file synchronization. However, to run an exhaustive suite of test scenarios is impossible due to limited resources and for that reason there may be instances where the tested applications perform differently compared to the results presented later in this thesis. It is assumed that those PCS applications that claim that they support CSE do so properly. Con- sidering that the tested applications are proprietary with non-disclosed there exists a theoretical possibility that the services are more knowledgeable about their users’ encryption keys than is claimed by the providers. However, verifying the authenticity of each CSP is out of scope for this thesis.

3 2 Theory

This chapter gives a theoretical background to relevant topics about cloud storage and file syn- chronization. Further, it describes the current state of the art for improving file sync performance and security. Then, some PCS alternatives, including some which support CSE, are introduced. Finally, previous works related to this thesis are presented.

2.1 Cloud Infrastructure and Cloud Storage

Mell and Grance [9] define a cloud infrastructure as “a collection of hardware and software that enables five essential characteristics of ”. The five characteristics are “on-demand self-service, broad network access, resource pooling, rapid elasticity and measured service”. In essence, a cloud service is highly accessible and flexible from a consumer’s point of view. Wheeler and Winburn [10] says “cloud storage [...] allows providers to store your data on multiple servers in different locations in a way transparent to you”. A PCS application typically offers some storage space in the cloud using a folder on the user’s device that synchronizes file changes in that folder with the cloud. As such, the end user stores his or her files in that folder exactly as would be done in a regular folder but the cloud storage client makes sure that those files are uploaded to the cloud servers. An important distinction between cloud implementations is the notion of public and private clouds. A private cloud is kept within a single organization. A public cloud, however, uses the internet and is accessible by multiple parties. Typically, a public cloud provider offer their services to multiple customers. Therefore, integrity and confidentiality can be compromised in a public cloud. This thesis focuses exclusively on public cloud services.

4 2.2. File Encryption

2.2 File Encryption

In the most simple sense, encryption is a method that converts readable data, usually called the plaintext, into unreadable data, called the ciphertext. Conversely, decryption is the method for turn- ing the ciphertext back into plaintext and through these methods confidentiality can be achieved. Encryption methods are typically classified as either symmetric or asymmetric. For both methods, keys are used to decrypt and encrypt the file or message. In symmetric encryption, the samekey is used for encryption and decryption. A common symmetric-key algorithm is Advanced Encryp- tion Standard (AES). AES is a block cipher which uses a block size of 128 bits and a key size of either 128, 192 or 256 bits. With AES, all key sizes are considered safe against brute-force attacks within a reasonable time frame, but the longer key sizes are more computationally expensive. Pre- vious studies [11], [12] showed that a smaller key size gives faster encryption times and lower CPU utilization. A block cipher divides the plaintext into smaller blocks and applies the encryption to those blocks. However, to prevent identical plaintext blocks from returning identical ciphertexts, different meth- ods called cipher block modes of operation have been developed that provides both confidentiality and authenticity. One such mode is the Cipher Feedback (CFB) mode where a plaintext block is encrypted together with the previous ciphertext block through an XOR operation. That way, even if two plaintext blocks are identical they will have different ciphertexts. For encrypting the first plaintext block, a randomized array of bits called an Initialization Vector is used. Another mode of operation is Galois Counter Mode (GCM) which was designed by McGrew and Viega who published a paper [13] on the security and performance of GCM. The authors claim that GCM was designed to support authenticated encryption with high performance. Their study showed that GCM gives good performance compared to other modes such as OCB. In asymmetric encryption, two different keys are used, often called the public key and the private key. The public key is used to encrypt the data but can not be used for decryption, and may therefore be publicly shared with anyone who wants to encrypt data. The private key is the only way to decrypt the data that has been encrypted by the public key and is therefore preferably kept secret to only those who are allowed to access the unencrypted data. A common method for asymmetric-key encryption is Rivest–Shamir–Adleman (RSA). RSA uses large keys of typically 2048 bits or larger and is significantly slower at performing encryption compared to AES as shown in previous studies [14], [15]. Due to the different characteristics of the two encryption methods they can be applied together to achieve additional layers of security. For instance, Al Hasib and Haque [16] suggested that AES should be used for encrypting large data blocks while RSA is used for key encryption and key management.

Convergent Encryption The encryption key is a parameter on which the output of the encryption algorithm is based. Therefore, if the key is randomly generated, the resulting ciphertext can potentially take on any possible form. From a CSP’s perspective, a way to get a more deterministic behavior of encryption is to apply a method called convergent encryption. With convergent encryption, the hash value of the data that is to be encrypted is used as the encryption key. This way, identical plaintext

5 2.3. Cloud Storage User Behavior messages will produce identical ciphertexts. Convergent encryption can be an alluring feature for CSPs since duplication detection between users becomes trivial even when the cloud storage content is encrypted. In fact, former CSP Bitcasa was able to offer unlimited storage space by using convergent encryption [17]. However, convergent encryption has some weaknesses in comparison to traditional, private key, encryption. In a sense, a collection of convergent encrypted data becomes a rainbow table where one can do lookups to see if certain content is stored. For instance, imagine a CSP that employs convergent encryption. An investigative authority can then encrypt data that they suspect is being stored in the same way as the provider, i.e. using the file hash as the encryption key, and ask the provider if they store a copy of that encrypted data. So even though the data is encrypted, the unencrypted data can still be inferred from it by having knowledge of the encryption method.

2.3 Cloud Storage User Behavior

Previous studies have classified PCS user behavior. Drago et al.[18] studied two datasets of Dropbox user data and found four different classes of user behavior: occasional, download-only, upload-only, and heavy users. Occasional users were those who did install and run Dropbox without necessarily syncing any files. These users represented about 30% of all users in the datasets. From thetwo analyzed datasets, 37% in the first dataset and 33% in the second dataset were classified asheavy users that both stored and retrieved data of nearly equal amount. The users pertaining to the download- or upload-only classes, whose majority of transmissions were either retrieval or store operations, represented 7% and 27%, respectively. Another finding was that a high percentage (between 40 and 80%) of cloud storage flows consisted of less than 100 kB. This was attributed to two factors: that the synchronization protocol sends file changes as deltas as soon as they are detected and that Dropbox is primarily used for small files that is changed often rather than large . A study by Li et al. [19] analyzed trace files from 153 PCS users in the U.S. and China using Dropbox, Google Drive and OneDrive among others. The traces were collected in a time period between July 2013 and March 2014, tracing over 200,000 files in total. They found that 77% offiles stored in cloud storage had a size of 100 kB or smaller. Further, 84% of the files were modified at least once, 52% could be efficiently compressed and 18% deduplicated.

Sharing of Client-Side Encrypted Data Sharing files and data is an important feature for cloud storage. As a user you might wanttoshare a document with your colleagues or a photo album with family and friends. With CSE, challenges to sharing are introduced. Since encrypted data can only be decrypted by those who hold the decryption key, there has to be a way for users who wish to share files with each other to securely share keys. One cloud storage provider, SpiderOak, revokes their “No Knowledge” policy for files that are shared through a so called “ShareRoom” [20]. The founders of , István Lám and Szilveszter Szebeni, have proposed and patented solutions for sharing data in dynamic groups over an untrusted cloud storage service [21]–[23]. Their solutions are based on the Tree-based Group Diffie-Hellman (TGDH) protocol.

6 2.4. Cloud Storage Features

Wilson and Ateniese [24] gave an overview of CSPs with CSE and how these providers handled sharing of data, and uncovered some weaknesses when enabling sharing. Their work focused on the issuing of certificates and Certificate Authorities (CA) and highlighted the problem when theCSPis acting as a CA for itself. The services they tested were shown to be issuing certificates themselves for cryptographic operations, introducing a potential risk where the CSP may issue counterfeit certificates to the users. The authors also propose solutions to mitigate the risk by eitherletting a trusted third party handle certificate issuing or allowing users to use their own certificates, for instance by using PGP.

2.4 Cloud Storage Features

This section presents features that can be implemented in cloud storage application to either enhance performance or security.

Performance Features The features presented below are implemented to give better performance of some sort for the cloud storage application. The features usually have a trade-off between CPU utilization and network utilization or storage space.

Compression Compression is a technique in which you encode data in a more compact format than the original format. The effect is that less bytes are needed to store the information. This is useful incloud storage as data can be compressed before transmission, which decreases network utilization, or before being stored in the cloud, which saves storage space. So, in exchange for CPU computation you get a smaller payload for the network transfer. Compression can be loss-less, which means that no information is lost after compression, or lossy where information is lost but the compressed version of the data might still be able to give sensible data. Examples of lossy compression formats are MP3 and JPEG which are used to create smaller audio and image files while still, hopefully, preserving sound and visual quality to an acceptable degree. The efficiency of loss-less compression is highly dependable on the format of the original data.The data may be highly compressible, in the case of plain text material, or nearly incompressible, in the case of a JPEG image or encrypted data. A common metric used to measure compression efficiency is compression ratio which is defined as

size before compression compression ratio = . (2.1) size after compression

Different compression algorithms perform differently with regards to compression ratio aswell compression time, i.e. the time needed for the algorithm to run. Schmidhuber and Heil [25] did a comparison study on different compression algorithms, for instance Lempel-Ziv and Huffman, and their compression performance on text data. The algorithms were able to reach compression ratios from 1.7 up to 2, resulting in the compressed files having about half the size of the original files.

7 2.4. Cloud Storage Features

Figure 2.1: Two files sharing the same cloud storage space for two chunks.

Data Chunking During the process of data chunking, the file that is to be synced is split into smaller parts, called chunks, before transmission. Data chunking is beneficial if the sync process is interrupted andhas to be resumed at a later time. When resuming, instead of having to start resyncing the full file contents the syncing may start at the last unsuccessfully synced data chunk. An important variable for data chunking is the size, in bytes, of the chunks. A smaller chunk size may be advantageous in case of network interruptions as less data needs to be re-transmitted. However, each chunk introduces acknowledgment overhead and therefore a larger chunk size (or no chunking at all) may be desirable. To be able to efficiently manage small chunks, bundling maybe implemented. With bundling, small chunks are bundled and acknowledged together.

Deduplication To reduce the amount of redundancy in cloud storage, deduplication may be implemented. By checking if the data to be stored is already in the cloud, albeit the data is from another file, the two files may share storage by having their blocks pointing to the same source blocks, asshownin Figure 2.1. Deduplication checking can be implemented in various ways. The effectiveness of deduplication is related to if and how data chunking is used. If chunking is not enabled, the deduplication check is performed on the whole file. However, two files might not necessarily be fully identical but smaller chunks of those files may very well be identical and can therefore share the same storage.As such, the usefulness of deduplication relates to the chunking size. Smaller chunk sizes can increase deduplication efficiency but also introduces overhead. Meyer and Bolosky[26] made a study in which they compared the effectiveness of different deduplication strategies such as whole-file deduplication and various block-based strategies with fixed size blocks and Rabin Fingerprinting [27], which varies the block size based on the content of the file. Their results showed that a block based approach was able to give bigger savings compared to a whole-file approach, and the savings grew as chunk size was decreased. However, whole-file deduplication had a much lower cost with regards to performance and complexity.

8 2.4. Cloud Storage Features

Figure 2.2: Attack scenario in a cross-user deduplicated cloud.

Other, more sophisticated, methods for data deduplication have been proposed. For instance, Widodo et al. [28] have studied Content-Defined Chunking which chunks the data based on file contents compared to more traditional fixed-size chunking. The check for deduplication may occur on the client or on the . When deduplication is checked on the client side, a hash of the file (or a chunk of the file) to be stored is sent to the server.The server then checks if the hash already exists and if so the client does not need to upload the actual file contents. This gives better network utilization compared to if the deduplication occurs server side and the client always uploads the file contents. Another property of deduplication is if it is single-user or multi-user. In the latter case, deduplication is performed on an inter-user level so that if user A and user B store the same file in their respective cloud account, it may still only consume storage space equal to one copy of that file. This type of deduplication may be much more efficient than if deduplication is restricted to work on aper-user basis, especially for popular files. However, inter-user deduplication introduces some integrity issues. Harnik et al. [29] described various attacks when multi-user deduplication is used in conjunction with deduplication checking on the client side. An example of such an attack is presented in Figure 2.2. In it, the attacker Charlie knows that Alice is storing a document with sensitive data, such as a PIN code. Further, he knows the general format of the document and generates copies of such a document, each with a different PIN code. By uploading those documents and seeing which one is deduplicated, i.e. the document that is not uploaded to the cloud, he can infer Alice’s PIN code.

Delta Encoding A popular technique for optimizing file synchronization is to implement delta encoding. In the event of file changes, delta encoding calculates the difference of the changed file,the delta, and only sends the change set instead of sending the file in its entirety. Even with delta encoding, a scenario with frequent file updates may lead to significant overhead traffic. One way to manage frequent updates is to aggregate multiple updates before propagating changes. This comes with a price since it decreases file consistency. To be able to balance overhead traffic and file consistency, Lee, Ko and Pack[30] proposed a solution for an efficient delta encoding algorithm. The algorithm would find an optimal policy for determining whether an update should be aggregated or synchronized immediately. To achieve this the problem was formulated as a

9 2.4. Cloud Storage Features

Markov Decision Process (MDP) where the synchronization state is changed by calculating the best next action according to a reward function. In their example, the two possible actions were to either aggregate or synchronize the update. The authors showed some theoretical evidence that their algorithm could be useful. However, no actual implementation was made for which the authors provided the following reason: “the source codes for both the client and the cloud server in cloud storage applications (e.g., Dropbox) are not open and available are not appropriate to EDS”, with EDS being the name of the authors’ proposed algorithm. Their findings lay both a foundation and give an opportunity for testing the algorithm practically. Delta encoding has in itself certain shortcomings. One of the drawbacks is the inability to efficiently handle large compressed files. Al-Sayyed et al. [31] brings the issue to light and propose a method for handling compressed files. Their idea is based on finding the changes in the subfiles ofazip file and only update those altered subfiles, if any. The paper fails to explain how theirapproach differs from conventional delta encoding, but they were able to show improvements over the default behavior of Dropbox.

Security Features The security features presented below aim to protect the user’s privacy when using a cloud storage application.

Security during Transit The transportation of data from the client to the cloud is performed using different networking protocols. Typically, HTTP or HTTPS is used. With HTTPS, the transmitted data is encrypted using Transport Layer Security (TLS). That way, even if network traffic is intercepted by an in- termediary network device the payload of the data packets can not be read as it is encrypted. To achieve a secure data channel, TLS introduces a handshake process during which certificates and encryption keys are exchanged between client and server. Thus, the handshake process is affected by the Round Trip Time (RTT) as these credentials need to be exchanged before the actual payload can be transferred. Naylor et al. [32] showed that HTTPS with TLS introduces some additional overhead compared to HTTP. The TLS handshake showed a significant delay in page load times, especially for services with server locations in the U.S. with a high RTT. Further, they showed that the increased energy consumption from the cryptographic operations in HTTPS was negligible.

Certificate Pinning In conventional TLS connection establishment the server certificate is authenticated by the client by checking if the certificate has been signed by any of the CAs that the client trusts. Atechnique called Certificate Pinning can be used by the client to restrict which server certificates to trust. In a white paper from Symantec [33], two key behaviors for certificate pinning methods are presented:

• “The client is pre-configured to know what server certificate it should expect.”

• “If the server certificate does not match the pre-configured server certificate then theclient will prevent the session from taking place.”

10 2.5. Personal Cloud Storage Applications

From an application perspective, the client will not accept certificates other than those that it has pinned even if the certificates are configured to be trusted on an or organization level. This makes it more difficult for Man-in-the-Middle (MitM) attacks to succeed as counterfeit certificates would be rejected by the client. A drawback of using certificate pinning is thatifthe server changes its certificate, which is typically required as certificates are only valid for alimited amount of time, the client application will need to be updated as well with the new certificate pinned.

Security at Rest To protect user data from being exploited in case of a security breach, CSPs may encrypt the data when it is stored in the cloud. A significant factor of this is where the encryption step takes place. The data can be encrypted either by the client before it is being sent to the cloud or at the server-side. With the encryption being performed by the client, and under the assumption that the CSP doesn’t know the encryption key, only the user is able to decrypt his/her data. So even if the service provider would want to read the data that is stored on their servers, for instance in the case of a law enforcement inquiry, they would not be able to do so. With server side encryption, the user’s data is only private to that user as long as the server provider wishes it to be.

2.5 Personal Cloud Storage Applications

The applications used for this thesis were selected based on their prevalence in published academic articles and to some extent recommendations in online reviews [34]. In the following sections each application is presented and some key features are highlighted.

Major Cloud Storage Providers The three largest CSPs were included in the experiments for this thesis for multiple reasons. These services have been included in many previous studies of cloud storage solutions and as such the results from this thesis can be compared against the results from those studies. Further, since none of these services provide CSE they serve as a counter-part to the study of CSE supporting providers.

Dropbox Dropbox began as a in 2007 and is today one of the major actors in the market. Dropbox’s data centers are located across the [35]. Dropbox is primarily written in Python, while parts of its infrastructure have been converted to Go [36].

Google Drive Google Drive launched in 2012. In July 2017, Google launched their new synchronization client “ and Sync”, which replaced their already existing Google Drive desktop application. How- ever, for this thesis the client will interchangeably be referred to as “Google Drive”. Google has data centers around the world with the closest to Sweden being located in Finland and the Nether- lands [37].

11 2.5. Personal Cloud Storage Applications

Microsoft OneDrive OneDrive is the current PCS solution from . OneDrive was formerly called SkyDrive and has been in service since 2007. Microsoft has data centers globally, but states that data stored in the EU is maintained within that region to meet regulation requirements [38].

PCS Applications with Client-Side Encryption Several alternatives for PCS with CSE exist. While none of the services below are as popular as Dropbox or Google Drive for instance, the large number of services that offer CSE indicates that there is a demand for that type of service on the market. Additional services that could have been included in this thesis but had to be excluded due to limited resources were pCloud and Sugarsync, to name a few.

Mega Mega is the successor to and is developed by the New Zealand-based company Mega Limited. Their client for the desktop and platform, MEGAsync is written in C++. Mega provides access to their source code repositories on GitHub for review purposes [39]. For data transactions to and from the cloud, Mega uses HTTP rather than HTTPS. Even so, the payload of the requests, which holds the user’s data, is encrypted which prevents unauthorized access. However, in the client’s preferences settings there is an option called “Don’t use HTTP” which has the effect of enabling TLS for file transmissions. Along with the setting is a statement that says “Enable this option only if your transfers don’t start. In normal circumstances HTTP is satisfactory as all transfers are already encrypted.” implying that the option is offered for improving stability or availability rather than security. For encryption, the official documentation from Mega states that “All symmetric cryptographic operations are based on AES-128” [40] and in their TOS it is stated that cross-user deduplication may occur [41].

SpiderOak SpiderOak is a U.S. company found in 2007. SpiderOak uses AES-256-CFB to encrypt user data [42]. They also claim that they perform compression as well as deduplication in order to reduce network utilization. Further, every file and folder is encrypted with a unique key[43]. Different versions of a single file is encrypted with different keys which allows SpiderOak to support versioned retrieval of files. The collection of encryption keys are secured by the user’s password which ishashedand salted using Password-Based Key Derivation Function 2 (PBKDF2). During file backup, SpiderOak makes an encrypted copy of the file and temporarily writes the encrypted copy to the local hard drive [44]. This puts additional requirements on the client regarding free disk space compared to using an approach where the encrypted data is kept in memory. According to their online documentation, SpiderOak’s datacenters are located in the midwestern United States [45].

12 2.6. Related Work

Sync.com A Canadian based company, Sync.com, offers free CSE storage. Their storage servers are located in Canada, namely in the city of Toronto (primary) and the city of Markham (backup) [46]. Sync.com describes their zero-knowledge policy and encryption methods in their privacy white paper [47]. To summarize, Sync.com uses asymmetric encryption where a 2048 bit RSA private encryption key is generated for each user and is used to encrypt the user’s AES encryption keys that are used to encrypt file data. The private key is itself encrypted with 256 bit AES GCM, which inturnis locked using the user’s password. The user’s password is in turn stretched with PBKDF2 to reduce the risk of data breach due to a weak password.

Tresorit Tresorit [48] is a Hungarian and Swiss based company that launched their CSE service in April 2014. Their servers are located in EU, or more specifically in Ireland and the Netherlands, using data centers. Tresorit uses AES-256 in CFB mode to encrypt user data.

Cloud Encryption Gateways There exist solutions for CSE that works on top of an already existing cloud storage service. That is, the program encrypts the data and then puts the encrypted data in the storage folder of the cloud service. One such solution is BoxCryptor which is typically used in conjunction with Dropbox. The effect of having BoxCryptor encrypt the data before putting it in the sync folder of Dropbox isnot only advantageous since features of Dropbox such as compression is rendered almost useless on the encrypted data. A better flow of operations would be to compress the data before encryption and then start the transfer to the cloud without attempting to apply compression at that point. This is easy to achieve in theory and when you have full control of the whole storage process but can become an issue when you mix two independent services, such as BoxCryptor and Dropbox.

2.6 Related Work

Gracia−Tinedo et al. [49] compared three CSPs (Dropbox, and Sugarsync) and found variations in transfer speeds. Among their findings, they showed that up and download speeds were higher for clients located in USA and Canada compared to clients located in Europe. A suggested and intuitive explanation for this behavior were the locations of the provider’s data centers. Further, they showed differences in upload speeds for Dropbox and Box that depended on the hourofthe day. For instance, Dropbox had a 15-35% increase in upload speed during night hours. Mager et al. [50] studied the now discontinued PCS service which had many properties in common with Tresorit, for instance end-to-end encryption. They found that during the file syn- chronization process, Wuala would encrypt the file and store it locally before syncing the encrypted contents to the cloud, similar to how SpiderOak describes their sync process. If and how the syncing process was affected by syncing large files while having limited free disk space, since the encrypted copy would double the amount of disk space required, was not uncovered by the study. Since their discontinuation, Wuala recommends Tresorit [51] as their successor to their former users.

13 2.6. Related Work

Cui et al. [52] present common methods for optimizing the performance of file sync applications. The mentioned methods are chunking, bundling, deduplication and delta encoding. Through exper- iments, the authors determine if these methods are active in popular file sync applications such as Dropbox, Google Drive, OneDrive and Box. Their results showed that the applications implement file chunking with different chunk sizes. In the report, it was shown that none of the fourpreviously mentioned applications had bundling capabilities activated. Regarding deduplication, only Drop- box had that capability implemented, with the additional support for also checking duplication against deleted files. Drago et al. [53] performed similar tests as Cui et al. but received different results. For instance, they found different chunking sizes for Google Drive (4 MB compared to 260 kB according toCui et al.) and that Dropbox did indeed implement bundling. These differences may be because the versions of the applications were different or due to the fact that Drago used a Windows 7machine as test client while Cui used an Android device. The latter theory is supported by another paper by Cui et al. [54] as well as a study by Luo et al. [55] in which testing was performed on both PC and Android devices. These, more thorough, studies showed that capabilities not only differs between applications but that they may also depend on which client the application is run. Typically, the PC clients had more capabilities activated compared to their Android counterparts. Compared to these previous studies, this thesis is the first to use macOS as the client platform. There is some overlap in tested applications with regards to the most popular services (Dropbox, Google Drive, OneDrive) but this thesis is complemented with the CSE-supporting services that have had little or no exposure in these types of studies. Finally, while delta encoding previously have been tested to the extent of whether it is supported or not, this thesis makes more granular tests to get a better understanding of the efficiency of the delta encoding mechanism that the services implement.

CPU and Memory Utilization in PCS Applications Compared to just transmitting the data directly, compression and encryption requires CPU intensive computations before the network transfer can even start. Further, the compressed and/or encrypted data must be temporarily stored on the client while the sync process is active. As part of their lessons learned when studying the behavior of four cloud storage providers, Hu et al. wrote “Cloud storage providers should perform pre-processing tasks like encryption, compression, [...] incrementally and in parallel with bulk transfer of data over the network to avoid delays in network transfer and to avoid storing large amounts of temporary data” [56]. Li et al. [57] implemented a middleware solution that was used in conjunction with Dropbox to improve the synchronization process. In their study they measured the CPU utilization of vanilla Dropbox during a file upload of a file that was appended with 2 kB of random data every0.2 seconds until it reached a total size of 10 MB. They found that the Dropbox application was single- threaded and had a mean CPU utilization of 54% during the upload, and that the utilization grew significantly as the file size reached certain thresholds at 4 and8MB.

14 3 Method

This chapter explains the methods and choices made for the conducted experiments. The tested ap- plications and their respective versions are presented in Table 3.1. The experiments were conducted running the latest version of each client. Because new versions of the software were released for the clients during the testing period and most clients, with Mega and Sync.com being two exceptions, update automatically without any way for the user to disable these updates a range of versions is given in the table. The earliest version is from when the experiments began and the latest version is from when the experiments ended. Changelogs for the applications during the testing period are presented in Appendix A.1. The experiments were conducted by adding files to the cloud services’ sync folders and taking measurements during the sync process, i.e. from the time that the file is added to the folder until it has been uploaded and the local folder is synchronized with the cloud storage servers. The methodology for the experiments was based on the one described in the paper by Bocchi et al [8] and used the benchmarking scripts1 that the authors provided from that study. The script files were extended and modified to suit the test environment used for this thesis. In cases wheremajor additions or modifications to the test scripts were required, special care has been taken to present and explain the changes in the relevant sections in this chapter.

3.1 Test Environment

The experiments were performed at Linköping University. A Macbook Air laptop was used to execute the tests and run the different sync clients. The laptop ran macOS High Sierra version

1https://www.simpleweb.org/wiki/index.php/Cloud_benchmarks

15 3.1. Test Environment

Table 3.1: Tested PCS Applications

Application Versions Dropbox 43.4.50 – 49.4.69 Backup and Sync from Google 3.39 – 3.41.9267.0638 OneDrive 17.005.0107 (0010) – 18.044.0301 MEGAsync 3.6.0 (b72f46) – 3.6.6 (99a46c) SpiderOakONE 7.0.1 – 7.1.0 Sync.com 1.1.19 – 1.1.20 Tresorit 3.1.1235.751 – 3.1.1265.764

University ISP Internet

Dropbox, Google Drive, Mega, OneDrive, SpiderOak, Sync.com, Tresorit

macOS High Sierra

Figure 3.1: The testbed setup used for the cloud storage measurements.

10.13.3 and had a 1.3 GHz Intel Core i5 CPU with two physical cores, 8 GB of RAM and a 128 GB SSD. The laptop was connected to the university network through a 10 Gb/s Thunderbolt to adapter. An illustration of the testbed setup can be seen in Figure 3.1. To the greatest extent possible, the clients were running with default settings unless they required configuration to be able to be tested under the automated test cases. One exception to thiswasSpi- derOakONE, which was configured to launch minimized and have their LAN-Sync feature disabled as it would otherwise interfere with the firewall settings on the test laptop. LAN-Sync allows files to be downloaded from on the same local network for increased download speeds. As the files used in the tests were unique and no other computers on the local network hadSpiderOak running (at least not with the same user account) the disabling of LAN-Sync had theoretically no impact on the test results.

Benchmarking Scripts The benchmarking scripts were written in Python and executed with the Python 2.7.14 interpreter. Compared to the originals, the scripts were modified and extended to suit the testbed setup. The original study ran the PCS applications on virtual machines while running the scripts on the host machine of those virtual machines and for that reason the original scripts used an FTP server to move files to and from the sync folders of the tested applications. For this thesis, since thescripts were run on the same machine as the PCS clients, the files were simply copied using the function shutil.copy2() which is included in the Python standard library.

16 3.2. Testing Personal Cloud Storage Capabilities

As files were copied to the sync folders of the application under test, network traffic wasrecorded during the whole duration of the sync process using Python modules netifaces and pcapy among others. The packet capture was executed as a separate thread to allow concurrency between the packet capture process and the main test procedure. The method for measuring CPU utilization is described in greater detail in Section 3.4.

3.2 Testing Personal Cloud Storage Capabilities

To get a better understanding of the different PCS applications several tests were performed to see if the services supported popular features related to cloud storage. The tested capabilities in this thesis were compression, deduplication and delta encoding. As these features require some additional computational resources from the client, the outcome of these tests would supplement the test results from the CPU and memory utilization tests.

Compression To test if and how compression was implemented in the different PCS applications files containing plain text were added to the sync folders and uploaded to the cloud servers. With the content being plain text, the potential for efficient compression was big. To verify if compression was indeed used, the amount of uploaded bytes was compared to the actual file size. If the number of uploaded bytes was lower than the actual file size then that behavior would be attributed to compression. The file size of the original files ranged from 10 MB up to 28 MB. The test was repeated 15timesand the mean value of the uploads for each file size was calculated. Additionally, for each upload the compression ratio was calculated. With most PCS clients, the compressed file content was kept in memory before it was uploaded to the cloud. Therefore, the calculation of compression ratio used uploaded bytes, including network traffic overhead, as the denominator instead of file size after compression as specified in Formula 2.1.

Deduplication The test for client-side deduplication was divided into four sub-scenarios, listed below.

(i) Different name

(ii) Different folder

(iii) Different folder and name

(iv) Delete and re-upload

In every test scenario, a 20 MB file made up of random bytes, referred to as the original filefor the rest of this section, was placed in the sync folder of the application under test. Then, a second file with identical content except some metadata differences was uploaded. For both uploads, the amount of bytes transferred during the uploads was measured. If the upload of the second file required as much data to be transferred as the original file then that would showthatthe application doesn’t employ deduplication. On the other hand, if very small amounts of data were

17 3.3. Advanced Delta Encoding Tests transferred for the second file then that would indicate that deduplication was used. After placing the second file in the sync folder, unless significant network traffic was identified within 60seconds (90 seconds for SpiderOak) the test script determined that no data transfer would take place. In the first scenario, a second file identical to the original file, except with a different filename,was placed in the sync folder. Test cases (ii) and (iii) included putting a copy of the original file in a different folder, with either the same or a different name as the original file. The fourthscenario included deletion of the original file and then re-uploading an exact copy of that file after ashort while (1-2 minutes). The purpose of the fourth test case was to show if the cloud storage keeps deleted files that can be “un-deleted” from the cloud. The deduplication test suite was initially run 15 times for each PCS application. After those test runs, only OneDrive showed inconclusive results for which an additional 25 test runs were executed.

Basic Delta Encoding Tests Test scenarios for delta encoding were conducted to see how the different PCS clients managed file modifications with regards to their content. All clients underwent three basic tests toseeif delta encoding was enabled at all for the clients. Then, for those clients that did perform delta encoding, more advanced tests were conducted to determine how and to what extent delta encoding was performed. Three different tests to determine if delta encoding was supported were performed:

• Append

• Prepend

• Insert at random position

The test scenarios would either insert random bytes at the end, beginning or at a random position of a file in increments of 5 MB, starting at 5 MB up to 25 MB. During each modification, the network packets for the file transmission were captured and the amount of uploaded bytes were inferred by analyzing the packet trace files. If a client used delta encoding techniques thenthe amount of uploaded bytes would equal the size of the update, in this case 5 MB. On the other hand, if a client did not take advantage of delta encoding techniques then the amount of uploaded bytes would equal the total file size after each modification, i.e. 5, 10, 15, 20, 25bytes.

3.3 Advanced Delta Encoding Tests

For those PCS applications whose behavior implied delta encoding, additional test cases were executed. The purpose of these tests was to measure the efficiency of the delta encoding mechanisms in the different PCS applications and give better understanding of the implementations ofdelta encoding. Initially, the following four test cases were conducted.

• Continuous, non-overlapping

18 3.3. Advanced Delta Encoding Tests

0 10 0 10 0 10

Update 0 Update 0 Update 0

Update 1 Update N Update 1 Update N Update 1 Update N (a) Delta encoding test with con- (b) Delta encoding test with con- (c) Delta encoding test with tinuous, non-overlapping updates. tinuous, overlapping updates. gapped updates. 0

N

(d) Delta encoding test with sprinkled updates.

Figure 3.2: Visualization of the update patterns used in the delta encoding tests.

• Continuous, overlapping

• Gapped

• Sprinkled

For these test cases, a 10 MB file consisting of random data was uploaded. Then, file modifications were applied to the file where data was overwritten within the file, such that the file contentwas updated but the file size remained at 10 MB. The different ways for updating the files arepresented in Figure 3.2. The update patterns described in Figures 3.2a and 3.2b would apply updates in a sequential but continuous manner, until eventually all the contents of the original file had been updated. The test with gapped updates would update parts of the file but leave some of it unchanged. Finally, the sprinkle test case acts as a kind of stress test for the delta encoding algorithm, where random bytes corresponding to p percent of the whole file would be sampled and updated with new values. The effect of this was that individual bytes at random positions of the file were changed, typically scattered from each other. Figure 3.2d gives an example where a 500 bytes file was updated with p = 0.02 resulting in 10 bytes updated at various places in the file. For the conducted tests, the value of p was varied to find out how large it could become until the delta encoding mechanisms of the PCS applications became useless. The test scripts provided by Drago et al. already had test cases for the basic tests. To support the advanced test cases, the test script file was extended with the additional update patterns which are presented in Listing 3.1. The code performs the update of a file in either a chunked or sprinkled pattern. In the code, the open() function opens a file in the ’r+’ mode which means that the file is opened for both reading and writing. During the update of file content in the sprinkle pattern

19 3.3. Advanced Delta Encoding Tests

1 import random 2 3 def insert_random_bytes(fname, updatesize, pattern, offset, p): 4 5 # Overwrite @updatesize bytes at position @offset 6 if pattern == CHUNK: 7 with open(fname , ’r+’) as f: 8 f.seek(offset) 9 rand_bytes = bytearray(random.getrandbits(8) for 10 i in range(updatesize)) 11 f.write(rand_bytes) 12 13 # Sprinkle some random bytes over a file 14 # i.e. change a few bytes here and there 15 elif pattern == SPRINKLE: 16 with open(fname , ’r+’) as f: 17 fbytes = bytearray(f.read()) 18 19 # Get n random positions based on @p and the size of the 20 # original file. 21 changes = random.sample(xrange(len(fbytes)), 22 int(len(fbytes)*p)) 23 24 # Update the bytes at the randomly generated positions 25 for i in changes: 26 fbytes[i] = random.getrandbits(8) 27 28 # Overwrite the file with the updated content 29 f.seek(0) 30 f.write(fbytes) Listing 3.1: Code for delta file modifications

test case, there is a 1/256 possibility that fbytes[i] = random.getrandbits(8) writes the same value as it currently holds, i.e. the value is not updated. However, it was determined that this flaw was tolerable for the conducted experiments.

Block Size for Delta Updates Early test results indicated that the delta encoding mechanism used by SpiderOak was applied block-wise on a file. This meant that if the changes were spread over several blocks therewould be additional overhead compared to if the changes had pertained to one block only. To find out the size of the blocks, a variation of the gapped updates delta encoding test was performed. For this test, only two changes of one each were made to the file and the distance between the changes was varied between tests. If the two changes were within the same block it would require an

20 3.4. CPU Measurements amount of network traffic for a single block and if the changes were in different blocks theamount of network traffic would be increased (theoretically doubled) compared to that of a single block update. Through binary search, the threshold for how large the distance between two changes could be before they end up located in two different blocks could be found. Assuming that the block size of the first block had the same size as every other block, if a change at byte position0 and position x yielded twice as much network traffic as a change at position 0 and position x − 1 then that would indicate a block size of x bytes.

3.4 CPU Measurements

The CPU measurements were conducted by uploading a 10 MB file containing random data and taking measurements before, during and after the synchronization process. The experiment was repeated 25 times for each PCS application. To measure CPU, memory and network utilization, the Python module psutil was used. The module allowed for measurements of per-process CPU and memory utilization. The measurement of CPU and memory utilization values were executed in a dedicated thread. The code that was run while the thread was actively measuring is presented in Listing 3.2. The method begins with a for loop (line 2) where every running process on the host machine is matched by its name against predefined process names for the different PCS applica- tions, i.e. “Dropbox”, “MEGAclient” and “sync-worker.exe” for the services Dropbox, Mega and Sync.com, respectively. Some services ran multiple processes, e.g. Dropbox ran three processes all named “Dropbox” while Tresorit ran one process named “Tresorit” and another named “Tre- soritExtension”. The test scripts measured the processes for each service collectively. After the processes for the application under test had been found, a while loop (line 6) ran until the end of the test. During an iteration of the loop, utilization values for CPU and memory from the different processes were added together to their respective variable. These values were saved together with a timestamp and then the thread would sleep for 40 ms before starting the next iteration of the loop. In the code listing, error handling has been omitted for brevity. Tools for measuring per-process network utilization were considered, especially the tool . However, such tools were deemed too imprecise for the experiments and instead the total network utilization was measured using psutil. Relying on the total network utilization is of course not as specific as the per-process network utilization. However, minimizing the number of running processes on the test device by closing all programs except the sync client under test was enough to make the background/noise traffic on the network insignificant compared to the actual sync traffic.

Synchronization Phases To be able to compare CPU utilization between clients, the sync process was categorized into different phases called idle, pre-processing, transfer and cooldown. The idle phase consisted of CPU measurements when the sync client was up-to-date with the cloud storage, i.e. not actively syncing. The pre-processing phase began when a file was copied into the sync folder and continued up until the client started to upload data, which is where the transfer phase took over. The transfer phase lasted as long as data was uploaded to the cloud and then a 5 second cooldown phase took over before returning back to the idle phase. The duration of the cooldown phase at 5 seconds was chosen arbitrarily and deemed suitable during initial testing. The phases and their transitions are

21 3.4. CPU Measurements

1 def run(self): 2 for proc in psutil.process_iter(): 3 if proc.name() in self.get_proc_names(): 4 self.procs.append(proc) 5 6 while not self.stopit: 7 memory = 0.0 8 cpu = 0.0 9 10 for proc in self.procs: 11 cpu += proc.cpu_percent(None) 12 memory += proc.memory_percent() 13 14 self.measurements.append((time.time(), 15 cpu, 16 memory)) 17 time.sleep(0.04) Listing 3.2: Code used for CPU and memory measurements

Figure 3.3: The different phases and their transitions during the sync process.

described in a state diagram in Figure 3.3. Every measurement of the sync clients’ CPU utilization pertained to one, and only one, of these phases. Further, the duration of each phase, except the idle phase, was measured by taking a timestamp for each measurement.

CPU Volume The samples of CPU utilization combined with their respective timestamps enabled a calculation of the CPU integral, or the CPU volume, which would give comparable values between different clients. For instance, if the transfer phase for one client had a mean CPU utilization of 50% and a duration of 2 seconds and another client had a 100% CPU utilization for 1 second, they would have the same CPU volume. Three methods for calculating the CPU volume were considered for this thesis. The chosen one was also the simplest one as it multiplied the mean value of the CPU

22 3.4. CPU Measurements measurements with the duration of the phase. The other two methods used two integral methods in the SciPy Python module, namely the Simpson’s rule and the trapezoidal rule. During initial experiments it was determined that the simple method of multiplying mean and duration had sufficient precision for the experiments, albeit it would give a slight overestimation compared tothe other two methods. When calculating the CPU volumes the mean value for the CPU utilization during idle state was subtracted from the mean value of the CPU utilization during the transfer and pre-processing states, respectively. This was made to show how much the operations of pre-processing and transferring files to the cloud affects CPU utilization specifically, without regarding other operations thatthe PCS application might perform. The calculation for the CPU volume during the transfer state Vtransfer is presented as

transferEnd Vtransfer = ∫ cpu(t) dt ≈ (mean(cputransfer) − mean(cpuidle)) ∗ transferDuration. transferStart The value for the CPU volume during the pre-processing state was calculated correspondingly.

CPU Volume Under Equal Network Conditions While the test environment was equivalent for all the tested applications, the actual results were affected by the data centers’ geographical location. For instance, SpiderOak’s and Sync.com’s data centers are located in North America and the other tested services had data centers located in Europe. The distance to the servers has an impact on the RTTs and the additional time to reach SpiderOak’s and Sync.com’s servers could give lower upload rates to those services. To mitigate this discrepancy, a method to level out the playing field and decrease the impact of geographical location by using the network link simulator tool Network Link Conditioner2 was used. The tool can set bandwidth, packet loss and latency for the network interface. Because SpiderOak was the service that had the highest RTT and the lowest throughput, the tool was configured in such a way that the bandwidth and delay matched the conditions of SpiderOak for every other service. For instance, the RTT to SpiderOak’s servers was 145 ms and the RTT to Dropbox’s servers was 20 ms. Therefore, the network link was configured to add 62 ms of delay in each direction of thelink when testing Dropbox. Additionally, the network throughput was throttled to 10 Mbps in both directions. The CPU volumes were calculated in the same way as in the original CPU volume tests. However, due to time constraints, these tests were only repeated five times.

HTTP vs HTTPS Comparison To measure the CPU utilization impact of using TLS, the MEGAsync client was tested running with default settings (HTTP) as well as running with the “Don’t use HTTP” setting enabled (see screenshot in Figure 3.4). The HTTPS setting was tested in the same manner as the CPU utilization tests of Mega with default settings. Each test was repeated 50 times each.

2http://nshipster.com/network-link-conditioner/

23 3.5. Disk Utilization

Figure 3.4: Screenshot of MEGAsync preferences with HTTP disabled.

3.5 Disk Utilization

Because the SpiderOak client creates encrypted local copies of files while they are synced to the cloud servers, measurements on disk utilization, more specifically the number of bytes written to disk, were conducted. The python module used for the other measurements, psutil, had support for reading disk I/O statistics. Unfortunately, support for per process I/O counters was not available on macOS so the measurements had to be performed on OS level rather than specifically measuring the PCS application processes. The test case was conducted by copying a 300 MB file in the sync folder of PCS application under test. After the copying operation had completed, the number of bytes written to disk was measured and the sync operation of the application was performed. After the file had been synced to the cloud, the number of bytes written to disk was once again measured and the difference between the start and end value was calculated. The test was repeated 40times and the mean value was calculated from those test results.

3.6 Memory Measurements

Testing memory utilization was performed by uploading large files of random data. Each file had a size of 300 MB as an attempt to trigger a significant increase of the memory utilization ofthe PCS clients. The files contained random data to minimize the risk of deduplication. Thetest case would upload five files consecutively. The motivation for this was to see if memory utilization

24 3.7. Security in Transit

Figure 3.5: Screenshot of Wireshark during TLS analysis. was increased additively for consecutive uploads, or if it was freed after finishing an upload. The memory utilization was measured before any upload had occurred while the client was idling as well as during each transfer phase of the five file uploads. Measurements were taken every 50 ms,and mean values were calculated from those measurements. The test was repeated 12 times and mean of means was calculated from those test instances.

3.7 Security in Transit

The network packet analysis tool Wireshark3 was used to determine the security protocol used by the different PCS applications for network transfers. By starting a sync operation forthe tested application, e.g. by placing a file in the application’s sync folder, while capturing network traffic with Wireshark, the network traffic could be analyzed to determine which TLS versionwas used. A screenshot of this process can be seen in Figure 3.5, with lines in red added by the author for emphasis. A similar method was used to determine which CAs were used by looking at the certificates that the server provided during the TLS handshake process at the startofthe synchronization process.

3https://www.wireshark.org/

25 3.8. Cloud Storage Traffic Identification

TLS Interception The PCS applications were tested against a simple MitM attack to see if the TLS encrypted traffic could be intercepted. For this, the tool mitmproxy4 was used which provides a HTTP proxy functionality with the addition of analyzing the traffic that it proxies on application level. The mitmproxy software was running on a 17.10 machine listening on port 8080 on the same local network as the macOS laptop running the PCS clients. On the PCS client machine, the mitmproxy certificate was added to the macOS Keychain as a trusted root CA. All the testedPCS applications had native support for HTTP proxy configuration and was setup to send application traffic to the proxy’s IP. The network traffic sent when the proxy setup was active, specifically the TLS connection negotiation, was analyzed using Wireshark. This method is not the most sophisticated kind of MitM attack as it requires significant configuration on the client machine. However, the results from this test gives a clue on how the different PCS applications manages foreign TLS certificates and if they employ techniques like certificate pinning.

3.8 Cloud Storage Traffic Identification

The test scripts captured network traffic during the sync processes. The network capture files were then analyzed in a post-processing step where every captured packet was analyzed and either categorized (if the packet belonged to the tested service) or discarded (if the packet belonged to a process that was not relevant to the tested service). To identify if the packet belonged to the tested service, a collection of known Uniform Resource Locator (URL) domains for each service was kept during the period in which the experiments were conducted. Every known URL domain was queried to get the Internet Protocol (IP) addresses pertaining to that URL. The IP addresses were then used to identify which packets from the experiments’ packet captures belonged to the tested service.

Classification of Storage and Control Traffic Since most of the tested applications use HTTPS for transferring data it can be tricky to determine whether traffic is of type storage, i.e. the file to be stored is transferred, or control traffic,i.e. overhead traffic that manages the sync process. A method that was used in the original scripts, although only for the application Wuala, was to see if the distribution of packet sizes for a certain flow had a large portion of big packets. The underlying reasoning is that storage traffic hasalotof big packets since (1) the Maximum Transmission Unit (MTU) is around 1500 bytes and the file to be stored is typically much larger and (2) the application tries to transfer the file in as few packets as possible. The original formula to categorize the traffic was used for this theses, but with tweaked values for the constants. The code is presented in Listing 3.3. Interpreted, the code says that if more than 40% of the packets in a flow have a size larger than 1300 bytes then that flow isclassified as storage traffic. Histograms of packet size distributions for each service is presented in Appendix A.2. For most services, the storage flows were overwhelmingly consisting of packets with a size of 1400 bytesor larger, with two exceptions. First, a storage flow from Dropbox (see Figure A.1a) consisted of about

4https://mitmproxy.org/

26 3.8. Cloud Storage Traffic Identification

1 FULL = 1300 2 THRESHOLD = 0.40 3 4 for flow in flows: 5 if (1.0 − flow.cdf_packet_size(FULL)) > THRESHOLD: 6 flow.traffic_type = STORAGE 7 else: 8 flow.traffic_type = CONTROL Listing 3.3: Categorization of traffic flows

40% packets with a size of 200 bytes and about 60% of packets with a size of 1400 bytes. Second, the majority of the packets in the storage flow from Google Drive (see Figure A.1b) had a size of 1368 bytes. If not for these two services, the values of FULL and THRESHOLD could probably have been set to something less permissive like 1400 and 0.80, respectively. It should be noted though that control traffic would typically amount for a very small part (<0.5%) of the total traffic upload, so even if the classification would erroneously classify control trafficas storage traffic, it would have little impact on the test results.

27 4 Results

This chapter presents the results from the experiments. A summary of the tested PCS applications is presented in Table 4.1. One can see that Dropbox and SpiderOak are among the most feature- rich services, based on the conducted tests, while other services like Google Drive and OneDrive do not support any of the tested features under certain circumstances. In the following sections, key findings from the feature and performance tests are presented and discussed.

4.1 Compression

The results from the test with compressible files are shown in Figure 4.1. The results show three different behaviors: For OneDrive, Mega and Sync.com, the amount of uploaded bytes and thefile size had a near 1:1 ratio, indicating no compression. Dropbox, SpiderOak and Tresorit consistently

Table 4.1: Summary of the tested PCS applications

Services Performance Features Security Features Compression Deduplication Delta Sync TLS CSE Source Code Disclosed Dropbox Yes Yes Yes v1.2 No No Google Drive Conditional No No v1.2 No No OneDrive No Sometimes No v1.2 No No Mega No Yes No v1.0 Yes Yes SpiderOak Yes Yes Yes v1.0 Yes No Sync.com No Yes No v1.2 Yes No Tresorit Yes No No v1.2 Yes No

28 4.2. Deduplication

30

25 2 20

15

Upload [MB] 1 10

5 Compression Ratio

10 15 20 25 0 File size [MB]

Sync.com Dropbox Google Drive Dropbox Tresorit Mega OneDrive SpiderOak SpiderOak Tresorit Google Drive (a) Bytes transferred during file uploads with (b) Mean compression ratios for services that highly compressible files. perform compression.

Figure 4.1: Compression test results for the different PCS applications. uploaded less data than the original file size, indicating compression. Finally, Google Drive seemed to perform compression up until the file size became larger than 16 MB. Through subsequent testing, it was determined that the threshold for Google Drive’s compression was at 224 bytes (about 16.77 MB). Mean compression ratios for the PCS clients that perform compression are presented in Figure 4.1b denoted with 99.99% confidence intervals. The compression ratio for Google Drive included only data from file sizes equal or less than 16 MB, i.e. cases where they perform compression. The compression ratio was slightly higher for Google Drive and Tresorit compared to Dropbox and SpiderOak. Still, the compression ratios for these four clients were close to 2, meaning the amount of uploaded bytes account for about half the size of the uncompressed file contents.

4.2 Deduplication

The results from the deduplication experiment are presented in Table 4.2. Dropbox, Mega, Spi- derOak and Sync.com all seem to perform client-side deduplication as none of these have to upload an identical file that is already stored in the cloud, albeit in another folder and/or with another name. As for Mega, this behavior confirms the information in their TOS about performing dedu- plication. Google Drive, OneDrive and Tresorit do re-upload identical files. OneDrive showed inconsistent results for the deletion and re-upload scenario where it in 30% out of the 40 test runs would not re-upload the file but presumably “undelete” the file in the cloud storage servers. This washowever the only inconsistency found during the deduplication tests. For every other client, deduplication would either occur for all four of the test scenarios or not at all.

29 4.3. Delta Encoding

Table 4.2: Deduplication test results

Deduplication Scenarios Service Different Name Different Folder Different Name and Folder Deleted Dropbox Yes Yes Yes Yes Google Drive No No No No OneDrive No No No Sometimes Mega Yes Yes Yes Yes SpiderOak Yes Yes Yes Yes Sync.com Yes Yes Yes Yes Treosorit No No No No

4.3 Delta Encoding

The results from the basic delta encoding experiments showed that two of the tested PCS applica- tions, Dropbox and SpiderOak, used delta encoding for file updates. These two services were both able to perform efficient delta uploading when the delta changes were applied in continuous chunks. However, when downloading a delta updated file to a second client, SpiderOak showed worse per- formance compared to Dropbox. The download using the Dropbox client would amount to the file size while the download with the SpiderOak client would have significant overhead corresponding to the file size plus the size of the accumulated delta encoding updates. This overhead was present when parts of the original file were kept after the delta updates had been applied, i.e. in thegapped delta updates case (see Figure 3.2c), but not present when the file had been completely overwritten with delta updates, as was the case for the test cases described in Figures 3.2a and 3.2b.

Block Size for Delta Updates and Sprinkle Test The results from the sprinkle tests are shown in Figure 4.2. The method used to find the supposed block size for delta updates was incapable of determining a block size for Dropbox. However, the shape of the curve for Dropbox’s sprinkled updates is similar to the shape of SpiderOak’s sprinkled updates with the difference that Dropbox managed much larger amounts of sprinkled data compared to SpiderOak. The test results from the block size tests for SpiderOak showed that they used a block size of 218 bytes (about 262 kB). This meant that even for a 1 byte change, the whole block would be updated and uploaded to the cloud servers. The expected upload in worst case scenarios, i.e. where every change is placed in different blocks, would follow the formula given in Equation 4.1. That meant that if a file consisting of M blocks would be updated with M single bytes evenly distributed across every block, the upload would equal the size of the whole file.

n = Number of bytes changed Expected upload = BLOCK_SIZE ⋅ n (4.1)

30 4.3. Delta Encoding

10 10

8 8

6 6 Upload [MB] Upload [MB] 4 4 Worst case Theoretical worst 2 2 Sprinkle Theoretical sprinkle 0 0 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 Bytes changed Bytes changed ·104 (a) Bytes uploaded with updates placed in differ- (b) Bytes uploaded with sprinkled updates over ent blocks (worst case) as well as with sprinkled a 10 MB file for Dropbox. updates over a 10 MB file for SpiderOak. Equa- tion 4.1 is plotted in a solid line and Equation 4.2 is plotted in a dashed line.

Figure 4.2: Bytes uploaded with sprinkled updates over a 10 MB file for Dropbox and SpiderOak. Notice the different ranges on the x axes for the two services.

For randomly placed updates, such as in the delta sprinkle test, the expected amount of uploaded data is related to the probability that an update is placed in a block which already have an update. A formula for this is given in Equation 4.2.

M = Number of blocks in file 1 p = M n = Number of bytes changed n−1 Expected upload = ∑ BLOCK_SIZE ⋅ (1 − p)k (4.2) k=0

The values for the expected uploads as well as the test results from the experiments are shown in Figure 4.2a. For the worst case scenario, the measured values followed the expected values closely. For the random placement experiments, the actual upload amount was slightly above the values given by Equation 4.2. Knowing that SpiderOak performed block-based delta encoding updates with a specific size, a few additional observations were made. First, the blocking behavior was only observable if the total file size was at least twice the size of the block size. Second, delta updates did not match the block size if the file was prepended or appended, i.e. bytes were inserted before the first byte or after thelast byte in the file. Instead, the delta update would match the size of the change, even if it wassmaller than the block size. Finally, the first modification of a file that was not an append or prependof data would require different amounts of uploading depending on where in the file the modification was put. A change near the beginning of the file required more data to be uploaded compared toa

31 4.4. CPU Utilization change near the end of the file. A possible explanation for this was that a modification inablock affected the succeeding blocks in that file. However, subsequent modifications would only require uploading corresponding to the number of affected blocks.

4.4 CPU Utilization

For each PCS application, the mean CPU utilizations and 95% confidence intervals for the idle and cooldown phases are presented in Figure 4.3. Corresponding results for the pre-processing and transfer phases are presented in Figure 4.4. Additionally, plots from a single test case for each service are shown in Figure 4.7. For results with higher precision, tables with CPU utilization values, phase durations and CPU volumes can be found in Appendix A.3. In the following sections, significant results from the tests for the different phases as well as phase duration and CPUvolumes are presented and interpreted.

Idle The CPU utilization for when the client is idling was less than 2.5% for every client. Mega and OneDrive were the clients with the lowest CPU utilization while SpiderOak and Tresorit had the highest utilization at about 1.9% and 2.4%, respectively.

Pre-processing Dropbox had a significantly higher CPU utilization during the pre-processing phase compared to the other clients. This seems reasonable as Dropbox supports all three of the tested capabilities for improved synchronization. However, SpiderOak which supports the same capabilities as Dropbox and also CSE had a much lower pre-processing CPU utilization. An explanation for this may be that Dropbox was much faster at detecting the file change, and would start processing the file almost immediately. The difference is apparent when looking at the file insertion points inFigures 4.7a and 4.7e. Because of this, the collected pre-processing CPU utilization values for Dropbox would reach work loads much sooner compared to SpiderOak which took a long time to detect the file change and would therefore collect many values similar to when the client was in the idle phase.

Transfer The mean CPU utilization values for Dropbox and Tresorit were significantly higher than for the other clients. The fact that they were above 100% indicates not only that the clients are multi- threaded, but that at least two threads of the application were heavily utilizing each CPU core simultaneously. SpiderOak was the only client with a lower transfer CPU utilization compared to its pre-processing utilization. This might be due to the fact that SpiderOak creates a temporary copy of the file that is about to be synced to the cloud. The file has to be compressed, checked for duplication and encrypted before uploading can begin and that could possibly be what lead to the high pre- processing utilization compared to the transfer phase utilization. Contrarily, the transfer phase utilization for Mega was significantly higher than its pre-processing utilization. This indicates that Mega alternated between applying encryption and uploading data to the cloud.

32 4.4. CPU Utilization

10 Idle Cooldown

8

6

4 CPU Utilization [%]

2

0

Mega Dropbox Tresorit OneDrive SpiderOak Sync.com Google Drive

Figure 4.3: CPU utilization during idle and cooldown phases.

Cooldown All clients showed an increase in CPU utilization immediately after finishing the sync upload com- pared to their respective idle utilization values. Similar to the transfer phase results, Dropbox and Tresorit had the highest CPU utilization values. SpiderOak was the client with the lowest increase during the cooldown phase compared to its idle value.

Pre-processing and Transfer Duration The mean durations for the pre-processing and transfer phases are presented in Figure 4.5. The duration for the pre-processing phase spanned from just above two seconds for Dropbox up to about 30 seconds for SpiderOak. As for the transfer phase, things were switched up. The duration of the transfer phase was directly correlated to the upload speeds of the clients. During testing, Google Drive and Mega showed upload speeds at over 20 MB/s and therefore had the shortest transfer durations. SpiderOak had the slowest upload speeds at about 2 MB/s and consequently got longer transfer durations. The other clients would usually upload files at around 5 MB/s. A possible explanation astowhy SpiderOak and Sync.com had the highest transfer durations could be due to their data centers being located in the U.S. and Canada, respectively, while the other services have data centers located in Europe.

33 4.4. CPU Utilization

Pre-processing Transfer Pre-processing Transfer

150 100

100

10 CPU Utilization [%] 50 CPU Utilization [%]

0 1

Mega Mega Dropbox Tresorit Dropbox Tresorit OneDrive SpiderOak Sync.com OneDrive SpiderOak Sync.com Google Drive Google Drive

(a) Linear y-axis. (b) Logarithmic y-axis.

Figure 4.4: CPU utilization during pre-processing and transfer phases.

Pre-processing 60 Transfer

40 Duration [s] 20

0

Mega Dropbox Tresorit OneDrive SpiderOakSync.com Google Drive

Figure 4.5: Phase durations for the pre-processing and transfer phases.

34 4.4. CPU Utilization

1,400 Pre-processing Transfer 1,000 Pre-processing Transfer 1,200

1,000

100 800

600 CPU Volume CPU Volume 400 10

200

0 1

Mega Mega Dropbox Tresorit Dropbox Tresorit OneDrive SpiderOak Sync.com OneDrive SpiderOak Sync.com Google Drive Google Drive

(a) Linear y-axis. (b) Logarithmic y-axis.

Figure 4.6: CPU volumes for the pre-processing and transfer phases.

CPU Volumes The mean CPU volumes denoted with 95% confidence intervals for the pre-processing and transfer phases are presented in Figure 4.6. The CPU volumes were generally higher during the transfer phase compared to the pre-processing phase, with two exceptions. Google Drive had a slightly lower transfer CPU volume than pre- processing CPU volume. SpiderOak had a very high CPU volume for the pre-processing phase. Looking at Figure 4.7e, one can see that it both takes a very long time for it to recognize the file insertion, but after doing so it peaks at about 100% for about 5 seconds before transitioning to the transfer phase. The low CPU volume of Mega have multiple possible explanations. For example, the client doesn’t perform compression and it uses HTTP rather than HTTPS for transmitting data to the cloud servers, which is discussed further in the upcoming section. Finally, among the clients that perform CSE, Mega is the only service that uses the shorter key length with AES-128 rather than AES-256 for encryption purposes.

CPU Volumes Under Equal Network Conditions The results from the CPU utilization test where the PCS applications had similar network conditions with regards to network throughput and RTT are presented in Figure 4.8. Comparing the results with the CPU volumes during non-equal network conditions that was pre- sented in Figure 4.6a, one can see that the CPU volumes during the transfer phase have increased for each client. An explanation for this is the throttled throughput and increased RTT which increase the duration of the transfer phase. Still, the relative performance between the different clients during the transfer phase have not changed. Dropbox still has the highest CPU volume

35 4.4. CPU Utilization

·107 ·107 250 3.5 250 3.5 3 3 200 200 2.5 2.5 150 2 150 2 FILE INSERTION FILE INSERTION

100 1.5 Bytes 100 1.5 Bytes 1 1 50 50 CPU Utilization [%] 0.5 CPU Utilization [%] 0.5 0 0 0 0 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Time [s] Time [s] (a) CPU utilization for Dropbox. (b) CPU utilization for Google Drive.

·107 ·107 250 3.5 250 3.5 3 3 200 200 2.5 2.5 150 2 150 2 FILE INSERTION FILE INSERTION

100 1.5 Bytes 100 1.5 Bytes 1 1 50 50 CPU Utilization [%] 0.5 CPU Utilization [%] 0.5 0 0 0 0 0 2 4 6 8 10 12 14 16 18 20 0 5 10 15 20 25 30 35 Time [s] Time [s] (c) CPU utilization for Mega. (d) CPU utilization for OneDrive.

·107 ·107 250 3.5 250 3.5 3 3 200 200 2.5 2.5 150 2 150 2 FILE INSERTION FILE INSERTION

100 1.5 Bytes 100 1.5 Bytes 1 1 50 50 CPU Utilization [%] 0.5 CPU Utilization [%] 0.5 0 0 0 0 0 10 20 30 40 50 60 70 0 5 10 15 20 25 30 35 Time [s] Time [s] (e) CPU utilization for SpiderOak. (f) CPU utilization for Sync.com.

·107 250 3.5 3 200 2.5 150 2 FILE INSERTION

100 1.5 Bytes CPU Utilization 1 Uploaded bytes 50 CPU Utilization [%] 0.5 0 0 0 5 10 15 20 25 30 35 Time [s] (g) CPU utilization for Tresorit. 36

Figure 4.7: CPU utilization for the tested PCS applications during a single file upload. The left y axis shows the CPU utilization in percent and the right y axis shows the amount of uploaded bytes. Notice the different ranges for the x axes in the Mega and SpiderOak plots. 4.4. CPU Utilization

1,400 Pre-processing Transfer 1,200

1,000

800

600 CPU Volume 400

200

0

Mega Dropbox Tresorit OneDrive SpiderOak Sync.com Google Drive

Figure 4.8: CPU volumes during equalized network conditions. during transfer and Google Drive, OneDrive and Mega have the lowest. One exception to this was SpiderOak which surpassed Sync.com in CPU volume for the transfer phase. During the tests, it seemed that the throughput for SpiderOak was affected by the network tool, even though it was the only service without any delay added to the network link. This affected the transfer phase duration and could explain why it scaled differently compared to the other services.

Mega HTTP vs HTTPS comparison The results from the CPU utilization comparison of Mega with or without TLS are presented in Figure 4.9. The plots show mean values from the 50 test runs denoted with 95% confidence intervals. For results with higher precision, tables with CPU utilization values, phase durations and CPU volumes for the comparison can be found in Appendix A.3. The difference in CPU utilization is insignificant for the idle and cooldown phases, which isunder- standable as no or very little network traffic is transferred during these phases. However, during the pre-processing and transfer phases the CPU utilization was clearly increased when HTTPS was used.

37 4.5. Disk Utilization

Idle Pre-processing

0 0.02 0.04 0.06 0.08 0.1 0.12 0 0.5 1 1.5 2 2.5 CPU Utilization [%] CPU Utilization [%] (a) CPU utilization for Mega during Idle. (b) CPU utilization for Mega during Pre-processing.

Transfer Cooldown

0 10 20 30 40 50 60 70 0 0.5 1 1.5 2 2.5 CPU Utilization [%] CPU Utilization [%] (c) CPU utilization for Mega during Transfer. (d) CPU utilization for Mega during Cooldown.

Pre-processing Transfer

0 1 2 3 4 5 6 0 20 40 60 80 100 120 CPU Volume CPU Volume (e) CPU volume for Mega during Pre-processing. (f) CPU volume for Mega during Transfer.

Transfer

Pre-processing

0 0.5 1 1.5 2 2.5 Duration (s) HTTPS HTTP (g) Phase durations for the pre-processing and trans- fer phases.

Figure 4.9: CPU utilization for Mega with and without TLS.

4.5 Disk Utilization

The average amount of bytes written to disk denoted with 95% confidence intervals during a 300 MB file upload for each PCS client is presented in Figure 4.10. In Appendix A.4, exact values can be found in Table A.5. Despite a fairly high variance for some of the clients (Dropbox, OneDrive, Mega), only SpiderOak reached disk writes correlating to the size of the uploaded file. Because the measurements of disk writes were made on OS level, each measurement included some overhead from other processes that wrote to disk during the file synchronization processes. This in combination with the fact that different services in general had different upload times, Google Drive being the fastest, SpiderOak being the slowest and the other clients somewhere in between, can explain why Google Drive had the lowest amount of bytes written to disk and SpiderOak had a relatively high amount of bytes written to disk even when disregarding the 300 MB written for the temporary local file.

38 4.6. Memory Utilization

400

300

200

100 Bytes Written To Disk [MB]

0

Mega Dropbox Tresorit OneDrive SpiderOakSync.com Google Drive

Figure 4.10: Average amount of bytes written to disk during a 300 MB file upload.

4.6 Memory Utilization

The mean values of memory utilization during idle and transfer are shown in Table 4.3. The right- most column shows the largest difference between the idle memory utilization and any of the file uploads. All values are presented as percentage values of the available memory on the test laptop. With the host machine having 8 GB of RAM, 1% equals about 80 MB. The memory utilization for each PCS client from a single test run is shown in Figure 4.11. Dropbox, Google Drive and SpiderOak are the clients with the highest base memory utilization while Mega, OneDrive and Sync.com have the lowest memory utilization. During file uploads, Dropbox and Tresorit have the highest increase in memory utilization, about 0.65% which corresponds to about 50 MB of memory on the test laptop, compared to their base values. Mega and OneDrive have a relatively small increase at just over 0.1%. By looking at the memory utilization graphs for each client in Figure 4.11 some properties can be seen for the different clients. For instance, Sync.com seems to have a rapid fluctuation of memory utilization during file uploads. Dropbox, Mega and Tresorit had a stable memory utilization throughout the whole test procedure and the same can be said for Google Drive and OneDrive with the addendum that they showed an occasional drop and re-acquisition of memory during the test. On a side note, by looking at the slope of the uploaded bytes plot it can be seen in Figure 4.11c that the upload speed for Mega occasionally decreases after a certain amount of continuous uploading.

39 4.6. Memory Utilization

·109 ·109 3 3 1.5 1.5 2.5 2.5

2 1 2 1

1.5 Bytes 1.5 Bytes 0.5 0.5 1 1 Memory Utilization [%] Memory Utilization [%] 0.5 0 0.5 0 0 100 200 300 400 500 600 0 50 100 150 200 250 300 Time [s] Time [s] (a) Memory utilization for Dropbox (b) Memory utilization for Google Drive

·109 ·109 3 3 1.5 1.5 2.5 2.5

2 1 2 1

1.5 Bytes 1.5 Bytes 0.5 0.5 1 1 Memory Utilization [%] Memory Utilization [%] 0.5 0 0.5 0 0 50 100 150 200 250 300 350 0 100 200 300 400 Time [s] Time [s] (c) Memory utilization for Mega (d) Memory utilization for OneDrive

·109 ·109 3 3 1.5 1.5 2.5 2.5

2 1 2 1 Bytes 1.5 1.5 Bytes 0.5 0.5 1 1 Memory Utilization [%] Memory Utilization [%] 0.5 0 0.5 0 0 300 600 900 1,200 1,500 0 200 400 600 800 Time [s] Time [s] (e) Memory utilization for SpiderOak (f) Memory utilization for Sync.com

·109 3 1.5 2.5

2 1

Bytes Memory Utilization 1.5 0.5 Uploaded bytes 1 Memory Utilization [%] 0.5 0 0 50 100 150 200 250 300 350 400 Time [s] (g) Memory utilization for Tresorit 40

Figure 4.11: Memory utilization for the tested PCS applications during five consecutive file uploads. The left y axis shows the memory utilization in percent and the right y axis shows the amount of uploaded bytes. Notice the different ranges for the x axes. 4.7. Security in Transit

Table 4.3: Mean Memory Utilization (%)

Uploads Max Service Idle 1st 2nd 3rd 4th 5th Increase Dropbox 2.22 2.86 2.89 2.86 2.88 2.89 0.68 Google Drive 2.17 2.44 2.59 2.55 2.64 2.72 0.55 OneDrive 0.62 0.72 0.72 0.72 0.73 0.73 0.11 Mega 0.57 0.64 0.65 0.67 0.68 0.68 0.11 SpiderOak 1.76 1.99 1.99 2.00 2.00 1.99 0.24 Sync.com 0.69 0.98 1.05 1.15 1.22 1.27 0.58 Tresorit 1.24 1.80 1.82 1.84 1.82 1.83 0.60

Table 4.4: Certificate Authorities used by the PCS applications

Services Certificate Authorities Dropbox DigiCert SHA2 High Assurance Server CA → DigiCert SHA2 Extended Validation Server CA Google Drive Google Internet Authority G3 → GlobalSign Root CA - R2 OneDrive Microsoft IT TLS CA 4 → Baltimore CyberTrust Root Mega (TLS Enabled) COMODO RSA Organization Validation Secure Server CA → COMODO RSA Certification Authority → AddTrust External CA Root SpiderOak ssl@.com Sync.com RapidSSL RSA CA 2018 Tresorit Microsoft IT TLS CA 1 → Baltimore CyberTrust Root

4.7 Security in Transit

All clients support TLS for encrypted network transmissions. It should be noted though that Mega doesn’t have TLS enabled by default but rather provides it as an optional feature (see Figure 3.4). Further, most clients run the latest version (v1.2) of TLS except for Mega and SpiderOak which run the earlier 1.0 version of TLS. The certificate chains and the CAs for the respective services are presented in Table 4.4. Chains denoted with arrows (→) are presented bottom-up, i.e. from the intermediate certificate used to sign end-user (the application server) up to the root CA. From the table, we see that OneDrive and Tresorit use Microsoft certificates. This is expected as Tresorit is built upon the Azure cloud solution owned by Microsoft. The Microsoft certificates are in turn signed by the Baltimore CyberTrust Root certificate which is managed by DigiCert. Dropbox uses certificates signed by DigiCert and Sync.com is using a RapidSSL certificate which is in turn also signed by DigiCert. When Megais configured to use TLS, it uses certificates issued by Comodo. Google and SpiderOak donotuse certificates from third party CAs but instead provide certificates signed by themselves.

41 4.7. Security in Transit

(b) The second warning informs about the risks (a) First warning. Clicking “I don’t care” leads as well as limitations to Mega’s end-to-end en- to the seconds warning. cryption.

Figure 4.12: Mega warning dialog boxes when trusting foreign TLS certificates.

TLS Interception The MitM test with mitmproxy showed that most of the PCS applications would prevent the TLS connection negotiation from succeeding when a foreign TLS certificate was used. Several of the clients, Dropbox, Google Drive and Tresorit, sent a TLS alert message with code 48 (Unknown CA) to indicate that the certificate sent from the mitmproxy server was untrusted. OneDrive sent another alert message with code 86 (Inappropriate Fallback). SpiderOak and Sync.com would not complete the TLS negotiation and did not send any TLS alert messages either. For Dropbox and SpiderOak which openly claim that they use certificate pinning [58], [59] this behavior was expected and it seems that the other abovementioned clients use the same or similar techniques to prevent MitM attacks. Mega was the only client which allowed connection establishment with an unknown certificate. However, for the connection to be established, two dialog boxes with warnings (see Figure 4.12) that informed the user of the situation and potential risks when trusting the mitmproxy certificate had to be bypassed.

42 5 Discussion

This chapter is divided into three sections. In the first section, significant findings from theex- periments presented in Chapter 4 are interpreted, examined and/or discussed. Then, the method chosen for this thesis is discussed, scrutinized and, where applicable, criticized. Finally, the thesis is discussed in a wider perspective with regards to societal and ethical aspects.

5.1 Results

This section discusses particularly interesting results from the CPU and memory measurement tests, as well as the results from PCS capability tests for compression, deduplication and delta encoding.

Features in Cloud Storage Applications The results show large variations in which and how many features the different PCS applications support. Google Drive and OneDrive do in some cases support none of the tested features while Dropbox and SpiderOak support all of them. Of course, it is possible that Google Drive and OneDrive apply compression and deduplication on the server-side to be able to provide a leaner and less feature-packed client application. The usage of CSE is not a deal-breaker with regards to if the features can be implemented, as SpiderOak supports the same set of tested features as Dropbox. However, CSE does have implications for some of these features, which will be discussed later on in this section.

43 5.1. Results

Delta Encoding The results from the delta encoding tests showed multiple interesting differences between the two services that use delta encoding. While Dropbox was able to keep update sizes close to the op- timum, except during the sprinkle tests, SpiderOak would in certain cases require unexpectedly large updates. Why the first update requires extra upload traffic could not be determined, butitis possible that it is related to the block design that was implied to be used by the SpiderOak client. Hypothetically, SpiderOak might not divide a file into blocks during the initial upload of thefile and instead have to apply the blocking when the first file modification occurs which could bean explanation for the additional upload traffic during that first update. It is probable that SpiderOak would be able to perform the delta encoding more efficiently if it wasn’t for the use of CSE. Since the Dropbox service has full knowledge of the data that is stored it can apply delta updates directly. On the other hand, SpiderOak are not able to apply the delta encodings on the server side but must forward that operation to the clients. This constraint raises the question if implementing delta encoding in CSE clients is worth the effort and additional CPU utilization.

CPU Utilization The results from the CPU utilization tests showed great variation between the clients and the differences can not be attributed to single features (like CSE) alone but rather a combination of multiple factors regarding both features and networking conditions.

Pre-processing and Transfer Duration One of the factors used for the calculation of CPU volume was the duration spent in the pre- processing and transfer phases. This was an important factor to be able to compare clients that applied different operations on the files, for instance compression and encryption, at different speeds. However, two key properties that were unrelated to these file operations but significant for the actual durations of each phase were the time to recognize file changes and the upload speed to the cloud servers. In Figure 4.7a it can be seen how the CPU utilization for the Dropbox application is immediately increased as soon as the file is inserted into the sync folder. Other clients like SpiderOak and Tresorit have significantly higher durations, but looking at Figures 4.7e and 4.7g it seems that they are generally slow at recognizing file changes in their respective sync folder rather than being slow at performing pre-processing tasks such as compression and encryption. What makes a PCS application fast at identifying file changes in the sync folder is dependent on the implementation of the application. However, it was a bit surprising that the two clients with the highest CPU utilization in the idle phase, SpiderOak and Tresorit, were the slowest at reacting to file changes. The upload speed for the different PCS applications can be inferred by looking at the slopeof the uploaded bytes graphs in Figures 4.7 and 4.11. It can be seen that Google Drive had fast uploads for both the 10 MB files used for the CPU utilization test as well as the 300 MB filesused for the memory utilization test, compared to most of the other clients. A contributing factor for this is Google’s geographically nearby servers with good performance, which enabled upload speeds usually around 20-30 MB/s.

44 5.1. Results

CPU Volumes The CPU volumes seems to have a correlation to the number of supported features of the tested PCS applications. For instance, SpiderOak and Dropbox had the highest CPU volumes in the pre-processing and transfer phases, respectively. Likewise, CSE applications Sync.com and Tresorit had higher CPU volumes for the transfer phases compared to the non-CSE supporting applications Google Drive and OneDrive. On the other hand, Mega showed low CPU utilization and volume values despite performing both CSE and deduplication checking. However, this can be explained by Mega’s decision to prioritize good performance over “redundant security” as they are the only CSE-supporting provider that uses AES-128 instead of AES-256 as well as running HTTP without TLS. As was seen in Figure 4.9f from Mega’s HTTP vs HTTPS comparison, the CPU volume during the transfer phase was nearly doubled when Mega was set to use HTTPS instead of HTTP.

Disk and Memory Utilization The results from the disk utilization measurements indicates that SpiderOak is the only client that saves a temporary copy of the data to disk. For users that store large files and with little free space on their hard drive the necessity of temporarily doubling the space required for large files may be unsatisfactory and that is perhaps why this seems to be a relatively uncommon behavior. The low disk utilization in combination with the relatively low memory utilization suggests that the other PCS applications perform pre-processing tasks incrementally and in parallel with network transfers, as was suggested by Hu et al [56].

Security in Transit Most services employ the industry standard version of TLS (1.2). However, Mega’s and SpiderOak’s use of TLS 1.0 may raise some questions. The Payment Card Industry Security Standards Council (PCI SSC) have deemed TLS 1.0 insecure with a migration completion date set to 30 June 2018 [60] due to known exploits such as POODLE [61] and Heartbleed [62]. Still, the necessity for TLS of already encrypted data, as is the case for those services that support CSE, can be argued. Applying additional layers of encryption can be seen as redundant and Mega’s claim that “HTTP is satisfactory as all transfers are already encrypted” (see Figure 3.4) and Sync.com’s claim that “Sync [...] does not rely on SSL for any meaningful security, as SSL on it’s [sic] own cannot be trusted” [47] are both valid from that point of view. Further, the significant increase in CPU utilization when enabling TLS on the Mega client acts as a counterargument where it might be preferable to use HTTP without TLS which would give lower energy consumption.

TLS Interception Certificate pinning was shown to be an effective technique to mitigate the MitM attack performed with mitmproxy. However, certificate pinning, which is built on Public Key Pinning (PKP), might be facing deprecation in the near future. At the time of writing, engineers are in the process of deprecating PKP support in the Chromium project [63] in favor of another technique called Certificate Transparency [64]. Nevertheless, protection against MitM attacks using counter- feit certificates will most likely continue to be implemented in PCS applications in the futureand

45 5.2. Method therefore other, more sophisticated, methods must be used if one wish to successfully intercept TLS traffic. The warning that Mega presents to the user when their SSL key can’t be verified states that “[Attackers can] blindly move your files and folders in MEGA, delete them or deny you accessto your account.” This raises the question if Mega is susceptible to a MitM attack, especially since these warnings can be suppressed. Theoretically, if the attacker was to get one minute to configure Mega to use a proxy and suppress future warnings then the user would have to enter the preferences settings to detect the infringement, something that the average user probably doesn’t do in the daily use of the service. However, to what extent an attacker can corrupt user data through this type of attack is out of scope for this thesis and left for future studies.

5.2 Method

This section brings up potential flaws and drawbacks in the chosen methodology used for this thesis. While the methodology was applied in such a way that the validity and reliability would be as high as possible for this thesis there were still limitations in the methodology due to time and resource constraints.

Test Environment and Tools The experiments were performed on macOS. The platform has a market share of about 8% on the desktop and laptop market which is significantly smaller than of the predominant Windows platform whose market share is at nearly 90% [65]. The decision to use macOS as the platform for testing was mainly based on personal preferences of the author of this thesis. However, previous studies of PCS applications have used either Windows or Android and as such this thesis may fill a gap by also including macOS as a testing platform. The method used for identification of cloud storage traffic had some potential weaknesses. Byrelying on pre-made lists of known IP addresses, which in turn were based on lists of known hostnames, it is possible that some traffic was unrecognized even though it was belonging to the cloud service. The hostnames and IP addresses were collected under careful consideration and, where applicable, in accordance with the clients’ documentation. However, nothing prevents a client from switching to another host at any given time and without knowing the behavior of the client (by studying the source code or similar) it is impossible to know when such an event may occur. The experiments conducted for this thesis are heavily based on the test scripts that Drago et al. [53] used for their benchmarking study of PCS applications. Those scripts were then modified to fit the test environment for this thesis. Apart from being proven in use and to some extent being audited by the author of this thesis, the test scripts have not gone through any deeper analysis to verify their correctness and aptitude. To limit the potential risk of interference between the CSP applications and test scripts, which were run simultaneously on the same machine during testing, one could isolate the application under test and the test scripts to different CPU cores. On , this can be achieved withthe taskset command. However, for this method to be applicable, the tests should preferably be performed on

46 5.3. The Work in a Wider Context a machine with at least four CPU cores so that the PCS application can get two dedicated cores to be able to take advantage of potential multi-core features. The experiments were repeated multiple times for all test cases to avoid coincidental test results and increase validity. However, the tests were conducted exclusively during office hours, especially during lunch hour (12:00 to 13:00). To mitigate the effect that upload speed changes depending on the hour of the day [49], the experiments could have been conducted at different times although that approach could instead have increased the deviation of the results.

File Formats in Cloud Storage The files used for the conducted experiments have been consisting of either pure text (collections of random words) or randomized bytes. The use of random data was preferable to avoid coincidental deduplication. However, the general use of cloud storage is naturally not for storing random bytes. Liu et al. [66] analyzed a snapshot dataset containing millions of files from a cloud storage solution used on Tsinghua University by hundreds of users (mostly college students). They found that, in terms of frequency, the most popular file types among the users were JPEG and GIF, both image formats. Considering that certain file types are more popular than others, CSPs may implement features designed specifically for such types. For instance, Dropbox have open sourced Lepton, a streaming image compression format, used for lossless compression of JPEG images. Horn et al. [67] showed how Lepton achieved a 23% compression of JPEG images on average when deployed on the Dropbox production file system. In their paper, they mention that JPEG files account for35% of Dropbox’s filesystem and another 40% is comprised of video files using the compression stan- dard H.264. With file type specific management of cloud storage, the outcome of the experiments conducted for this thesis could possibly have been different if other file types had been considered.

5.3 The Work in a Wider Context

Sharing of personal data to companies and the responsibilities of those companies when handling the data have been hot topics in recent times. Earlier in 2018, the so called -Cambridge Analytica data scandal was unraveled to the public. The scandal revolves around events where the data mining and analysis firm Cambridge Analytica was able to collect data about Facebook users through a personality quiz application [68]. While sharing of personal data is to a large degree what makes Facebook usable and enjoyable it can be hard for the average user to determine how the data is being used. As users, we must decide if the benefits from sharing personal data is worth the risks associated with the sharing. This holds true for both social media platforms and personal cloud storage applications. Cloud storage solutions have enabled people to store their most precious memories in a secure, flexible and convenient manner. Since companies are at times unable to protect their users integrity, it is up to the user to decide if some of the convenience should be yielded for additional protection such as CSE. As users of PCS applications, we have the privilege to be able to choose from a wide variety of CSPs. Who we choose to entrust in handling our data is a hard decision, and perhaps we should choose different actors depending on the content ofour data. During the writing of this thesis, the General Data Protection Regulation (GDPR) came into force on the 25th of May, 2018. The GDPR regulates how companies should design and manage their

47 5.3. The Work in a Wider Context services with regards to data privacy. The regulation covers every individual within the EU and as such it affects CSPs operating in that region. Most of the services included in this thesishave made public announcements [69]–[73] on their work to become GDPR compliant in time for the date of enforcement. The GDPR does not specify specific algorithms or methods for protecting personal data and both CSE and non-CSE supporting providers can comply with GDPR. Still, the CSE supporting providers likes to paint a picture of how their privacy policies can be made GDPR compliant with little effort, if not compliant already, thanks to CSE. Nevertheless, stronger regulation on data privacy from state authorities puts pressure on CSPs to not only provide feature rich and usable services to their users, but also protect the users’ privacy and integrity with great care.

Trust in Cloud Storage This thesis puts special focus on cloud storage services that offer CSE as a means of protecting the users’ integrity. While the mathematical methods for performing encryption are well suited for the task at hand, i.e. preventing unauthorized access to data, the context in which these methods are applied are important if one wishes to understand where the trust is put for storing the data securely. In the preface to his book Secret & Lies [74], Bruce Schneier discusses the relation between the mathematics of cryptography and context in which it is applied:

“Mathematics is perfect; reality is subjective. Mathematics is defined; computers are ornery. Mathematics is logical; people are erratic, capricious, and barely compre- hensible.”

The CSE supporting services tested in this thesis all use AES for encrypting user data. However, users of these services do not only trust the protection of AES, but also that the service providers implement the encryption correctly. Considering the orneriness of computers and erratic nature of humans, to use Schneier’s words, these applications should perhaps not be earning our trust indubitably. Grothe et al. [75] performed a security analysis of Tresorit and the claim that usage of Tresorit required trust in either Tresorit or Microsoft Azure, but not both. This type of split of trust can be seen as a security mechanism as it requires more parties to breach their policies for breaking the security model. However, the authors showed that this claim was wrong and that the usage of Tresorit required trust in Tresorit regardless of trusting Azure or not. In their conclusion, they make a broader statement that “In case a cloud provider does not reveal its protocols or their client software source code, users are bound to their claims.” Hopefully, secure, open and perhaps even trustless models for cloud storage will become available in the future and, most certainly, there will at least be a demand for those type of solutions.

48 6 Conclusion

This thesis gives insight to how personal cloud storage applications work. Through controlled experiments, seven different applications were tested to uncover which features they support and how they utilize memory and CPU resources. The survey of which features are supported in which applications gives a better understanding of why the tested applications perform differently from each other with regards to CPU utilization. The conducted experiments measure and analyze the behavior of the PCS applications during the synchronization process of local files to the cloud servers. Network traffic and performance metrics such as CPU, memory and hard drive utilization were measured simultaneously to give precise data during the different stages of the synchronization process. This thesis uses a novel metric, called CPU volume in this work, where CPU utilization was measured over time. This enabled us to get comparable values between measurements of the applications’ synchronization processes, despite different network conditions between the applications. The thesis puts focus on differentiating between applications that support client-side encryption (CSE) of user data and applications which do not. Results show that while CSE supported applica- tions tend to be more CPU intensive than applications without CSE support, exceptions do exist. Along CSE, many other features correlates to increased CPU utilization. For instance, compression, client-side deduplication and delta encoding are all features which require additional computational effort on the cloud storage clients. Therefore, it is hard to determine exactly how great ofanimpact CSE has on the CPU utilization. Further, our results show that the efficiency of delta encoding can be limited by CSE when client data is shared among multiple devices or users. As a CSE provider is unaware of the content they store they cannot apply delta encoding updates of a file directly on the server-side and instead has

49 6.1. Future Work to relay that work to the clients. This limitation increased network utilization during sharing of data between devices when compared to a cloud storage provider which has knowledge about the data it stores.

6.1 Future Work

This thesis covers PCS applications in general, with specific focus on those clients that support CSE. Given more time, adding more clients to the study could have provided additional insights to how PCS applications operate and how they differentiate from each other. The experiments conducted in this thesis test the performance and feature support on the client- side of the services. It would be interesting to test server-side, API and network protocol behavior of the services as well. However, that would require additional tests and significant changes to the methodology. The CSE supporting PCS applications could also be audited in ways to determine if and how the encryption is performed. Similar to the study by Wilson and Ateniese [24] as well as a study by Kholia and W￿ęgrzyn [76], decompilation tools can be used to reverse engineer applications and uncover specifics of the implementation. Kholia and W￿ęgrzynwere also able to monkey patchthe SSL code used by the Dropbox application and could therefore intercept the encrypted network traffic. This would be particularly interesting for the CSE clients that don’t publish theirsource code to the public, i.e. all except Mega in this thesis, to see how they do encryption key management etc.

50 Bibliography

[1] Dropbox Inc. (2016-03-07). Celebrating half a billion users | | Dropbox Blog, [Online]. Avail- able: https://blogs.dropbox.com/dropbox/2016/03/500-million/ (visited on 2017-11- 14). [2] Ben Popper, Inc. (2017-05-17). Google announces over 2 billion monthly active devices on Android - , [Online]. Available: https://www.theverge.com/2017/5/ 17/15654454/android-reaches-2-billion-monthly-active-users (visited on 2018-01- 30). [3] Cisco Public, “Cisco Global Cloud Index: Forecast and Methodology, 2015–2020,” Tech. Rep., 2016. [Online]. Available: https://www.cisco.com/c/dam/en/us/solutions/collateral/ service-provider/global-cloud-index-gci/white-paper-c11-738085.pdf. [4] Dropbox Inc. (2016-12-08). Dropbox Terms of Service, [Online]. Available: https://www. dropbox.com/terms (visited on 2018-01-30). [5] Google LLC. (2017-09-25). Google Terms of Service, [Online]. Available: https : / / www . google.com/intl/en/policies/terms/ (visited on 2018-01-30). [6] Arash Ferdowsi - Dropbox Inc. (2011-06-20). Yesterday’s Authentication Bug, [Online]. Avail- able: https://blogs.dropbox.com/dropbox/2011/06/yesterdays-authentication-bug/ (visited on 2018-01-30). [7] G. Greenwald, E. MacAskill, L. Poitras, S. Ackerman, and D. Rushe, “Microsoft handed the NSA access to encrypted messages,” , 2013-07-12. [Online]. Available: https: //www.theguardian.com/world/2013/jul/11/microsoft- nsa- collaboration- user- data (visited on 2018-01-30). [8] E. Bocchi, I. Drago, and M. Mellia, “Personal Cloud Storage Benchmarks and Comparison,” IEEE Transactions on Cloud Computing, vol. 5, no. 4, pp. 751–764, 2017.

51 Bibliography

[9] P. Mell, T. Grance, et al., “The NIST Definition of Cloud Computing,” 2011. [10] A. Wheeler and M. Winburn, Cloud storage security: A practical guide. Elsevier, 2015. [11] N. Singhal and J. Raina, “Comparative Analysis of AES and RC4 Algorithms for Better Utilization,” International Journal of Computer Trends and Technology, vol. 2, no. 6, pp. 177– 181, 2011. [12] D. S. A. Elminaam, H. M. Abdual-Kader, and M. M. Hadhoud, “Evaluating the Performance of Symmetric Encryption Algorithms.,” International Journal of Network Security, vol. 10, no. 3, pp. 216–222, 2010. [13] D. A. McGrew and J. Viega, “The Security and Performance of the Galois/Counter Mode (GCM) of Operation,” in Proceedings of the International Conference on Cryptology in India, Springer, 2004, pp. 343–355. [14] P. Mahajan and A. Sachdeva, “A Study of Encryption Algorithms AES, DES and RSA for Security,” Global Journal of Computer Science and Technology, vol. 13, no. 15, 2013, issn: 0975-4172. [15] P. Prajapati, N. Patel, R. Macwan, N. Kachhiya, and P. Shah, “Comparative Analysis of DES, AES, RSA Encryption Algorithms,” International Journal of Engineering and Management Research, vol. 4, no. 1, pp. 292–294, 2014. [16] A. Al Hasib and A. A. M. M. Haque, “A Comparative Study of the Performance and Security Issues of AES and RSA Cryptography,” in Proceedings of the International Conference on Convergence and Hybrid Information Technology., vol. 2, 2008, pp. 505–510. [17] R. Whitwam. (2011-09-20). How convergent encryption makes Bitcasa’s infinite storage possible, [Online]. Available: http : / / www . extremetech . com / computing / 96693 - how - convergent-encryption-makes-bitcasas-infinite-storage-possible (visited on 2018- 02-06). [18] I. Drago, M. Mellia, M. M. Munafò, A. Sperotto, R. Sadre, and A. Pras, “Inside Dropbox: Understanding Personal Cloud Storage Services,” in Proceedings of the ACM SIGCOMM Conference on Internet Measurement, 2012, pp. 481–494. [19] Z. Li, Y. Dai, G. Chen, and Y. Liu, “Towards Network-level Efficiency for Cloud Storage Services,” in Content Distribution for Mobile Internet: A Cloud-based Approach, Springer, 2016, pp. 167–196. [20] A. Tervort. (2017-11-22). ShareRooms and No Knowledge - SpiderOak Support, [Online]. Available: https : / / support . spideroak . com / hc / en - us / articles / 115001854223 - ShareRooms-and-No-Knowledge (visited on 2018-02-05). [21] I. Lám, S. Szebeni, and L. Buttyán, “Invitation-oriented TGDH: Key management for dynamic groups in an asynchronous communication model,” in Proceedings of the IEEE International Conference on Parallel Processing Workshops, 2012, pp. 269–276. [22] ——, “Tresorium: Cryptographic file system for dynamic groups over untrusted cloud stor- age,” in Proceedings of the IEEE International Conference on Parallel Processing Workshops, 2012, pp. 296–303. [23] I. Lám, S. Szebeni, and T. Koczka, Client-side encryption with DRM, US Patent 9,129,095, 2015-09-08.

52 Bibliography

[24] D. C. Wilson and G. Ateniese, ““To Share or not to Share” in Client-Side Encrypted Clouds,” in Proceedings of the International Conference on Information Security, Springer, 2014, pp. 401–412. [25] J. Schmidhuber and S. Heil, “Sequential neural text compression,” IEEE Transactions on Neural Networks, vol. 7, no. 1, pp. 142–146, 1996. [26] D. T. Meyer and W. J. Bolosky, “A Study of Practical Deduplication,” ACM Transactions on Storage, vol. 7, no. 4, p. 14, 2012. [27] M. O. Rabin et al., Fingerprinting by random polynomials. Center for Research in Computing Techn., Aiken Computation Laboratory, Univ., 1981. [28] R. N. Widodo, H. Lim, and M. Atiquzzaman, “A new content-defined chunking algorithm for data deduplication in cloud storage,” Future Generation Computer Systems, vol. 71, pp. 145– 156, 2017. [29] D. Harnik, B. Pinkas, and A. Shulman-Peleg, “Side Channels in Cloud Services: Deduplication in Cloud Storage,” IEEE Security & Privacy, vol. 8, no. 6, pp. 40–47, 2010. [30] G. Lee, H. Ko, and S. Pack, “An Efficient Delta Synchronization Algorithm for Applications,” IEEE Transactions on Services Computing, vol. 10, no. 3, pp. 341–351, 2017. [31] R. M. Al-Sayyed, F. F. Namous, A. H. Alkhalafat, B. Al-Shboul, and S. Al-Saqqa, “New Synchronization Algorithm Based on Delta Synchronization for Compressed Files in the Mo- bile Cloud Environment,” International Journal of Communications, Network and System Sciences, vol. 10, no. 04, p. 59, 2017. [32] D. Naylor, A. Finamore, I. Leontiadis, Y. Grunenberger, M. Mellia, M. Munafò, K. Papa- giannaki, and P. Steenkiste, “The Cost of the S in HTTPS,” in Proceedings of the ACM International on Conference on emerging Networking Experiments and Technologies, 2014, pp. 133–140. [33] Symantec Corporation, “Certificate Pinning,” Tech. Rep., 2017. [Online]. Available: https:// www.symantec.com/content/dam/symantec/docs/white-papers/certificate-pinning- en.pdf (visited on 2018-05-10). [34] Cloudwards.net. (2018-02-08). Best Cloud Storage Providers of 2018, [Online]. Available: https://www.cloudwards.net/comparison/ (visited on 2018-04-10). [35] Dropbox Inc. (). Where does Dropbox store my data? – Dropbox, [Online]. Available: https: //www.dropbox.com/help/security/physical-location-data-storage (visited on 2018- 04-13). [36] P. Lee. (2014-07-01). Open Sourcing Our Go Libraries | Dropbox Tech Blog, [Online]. Avail- able: https://blogs.dropbox.com/tech/2014/07/open-sourcing-our-go-libraries/ (visited on 2018-04-13). [37] Google LLC. (2018). locations – Google, [Online]. Available: https://www. google.com/about/datacenters/inside/locations/index.html (visited on 2018-04-13). [38] Microsoft. (2018). Microsoft Trust Center | Where your data is located, [Online]. Available: https://www.microsoft.com/en- us/trustcenter/privacy/where- your- data- is- located (visited on 2018-04-13).

53 Bibliography

[39] MEGA. (2018). Mega Limited, [Online]. Available: https://github.com/meganz (visited on 2018-02-21). [40] ——, (2018). MEGA - Developers Documentation, [Online]. Available: https://mega.nz/doc (visited on 2018-02-21). [41] MEGA Limited. (2016-01-20). Terms of Service - MEGA, [Online]. Available: https://mega. nz/terms (visited on 2018-03-01). [42] SpiderOak Inc. (2018). Encryption White Paper | SpiderOak, [Online]. Available: https: //spideroak.com/resources/encryption-white-paper (visited on 2018-04-18). [43] ——, (2018). No Knowledge, Secure-by-Default Products | SpiderOak, [Online]. Available: https://spideroak.com/no-knowledge/ (visited on 2018-04-18). [44] A. Tervort. (2017-12-30). Disk Space Use During File Backup - SpiderOak Support, [Online]. Available: https://support.spideroak.com/hc/en-us/articles/115001891163-Disk- Space-Use-During-File-Backup (visited on 2018-04-18). [45] ——, (2017-12-13). Datacenter Locations and Certifications - SpiderOak Support, [Online]. Available: https : / / support . spideroak . com / hc / en - us / articles / 115001932526 - Datacenter-Locations-and-Certifications (visited on 2018-03-15). [46] Sync.com Inc. (2018). Where are your servers located? [Online]. Available: https://www. sync.com/help/where-are-your-servers-located/ (visited on 2018-02-06). [47] ——, “Privacy White Paper,” Tech. Rep., 2015. [Online]. Available: https://www.sync.com/ pdf/sync-privacy.pdf (visited on 2018-02-06). [48] Tresorit. (2018). Cloud + Encryption | End-to-End Encrypted Cloud Storage, [Online]. Avail- able: https://tresorit.com/security/encryption (visited on 2018-03-20). [49] R. Gracia-Tinedo, M. S. Artigas, A. Moreno-Martinez, C. Cotes, and P. G. Lopez, “Actively Measuring Personal Cloud Storage,” in Proceedings of the IEEE International Conference on Cloud Computing, 2013, pp. 301–308. [50] T. Mager, E. Biersack, and P. Michiardi, “A Measurement Study of the Wuala On-line Storage Service,” in Proceedings of the IEEE International Conference on Peer-to-Peer Computing, 2012, pp. 237–248. [51] Wuala. (2018). Wuala cloud storage was shut down, offers tresorit as a secure alternative, [Online]. Available: https://wuala.com/ (visited on 2018-03-08). [52] Y. Cui, Z. Lai, and N. Dai, “A First Look At Mobile Cloud Storage Services: Architecture, Experimentation, and Challenges,” IEEE Network, vol. 30, no. 4, pp. 16–21, 2016. [53] I. Drago, E. Bocchi, M. Mellia, H. Slatman, and A. Pras, “Benchmarking Personal Cloud Storage,” in Proceedings of the ACM SIGCOMM Conference on Internet Measurement, 2013, pp. 205–212. [54] Y. Cui, Z. Lai, X. Wang, and N. Dai, “Quicksync: Improving Synchronization Efficiency for Mobile Cloud Storage Services,” IEEE Transactions on , vol. 16, no. 12, pp. 3513–3526, 2017. [55] X. Luo, H. Zhou, L. Yu, L. Xue, and Y. Xie, “Characterizing mobile*-box applications,” Computer Networks, vol. 103, pp. 228–239, 2016.

54 Bibliography

[56] W. Hu, T. Yang, and J. N. Matthews, “The Good, the Bad and the Ugly of Consumer Cloud Storage,” ACM SIGOPS Operating Systems Review, vol. 44, no. 3, pp. 110–115, 2010. [57] Z. Li, C. Wilson, Z. Jiang, Y. Liu, B. Y. Zhao, C. Jin, Z.-L. Zhang, and Y. Dai, “Efficient Batched Synchronization in Dropbox-Like Cloud Storage Services,” in Proceedings of the ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing, 2013, pp. 307–327. [58] Dropbox. (2018). Under the hood: Architecture overview, [Online]. Available: https://www. dropbox.com/business/trust/security/architecture (visited on 2018-05-14). [59] SpiderOak Inc. (2018). No Knowledge, Secure-by-Default Products, [Online]. Available: https://spideroak.com/no-knowledge/ (visited on 2018-05-14). [60] L. K. Gray. (2015-12-18). Date Change for Migrating from SSL and Early TLS, [Online]. Available: https://blog.pcisecuritystandards.org/migrating-from-ssl-and-early- tls (visited on 2018-04-10). [61] B. Möller, T. Duong, and K. Kotowicz, “This POODLE Bites: Exploiting the SSL 3.0 Fall- back,” Security Advisory, 2014. [62] Z. Durumeric, J. Kasten, D. Adrian, J. A. Halderman, M. Bailey, F. Li, N. Weaver, J. Amann, J. Beekman, M. Payer, et al., “The Matter of Heartbleed,” in Proceedings of the ACM Con- ference on Internet Measurement Conference, 2014, pp. 475–488. [63] C. Palmer. (2017-10-27). Intent To Deprecate And Remove: Public Key Pinning, [Online]. Available: https://groups.google.com/a/chromium.org/forum/#!msg/blink- dev/ he9tr7p3rZ8/eNMwKPmUBAAJ (visited on 2018-05-14). [64] B. Laurie, A. Langley, and E. Kasper, “RFC 6962 – Certificate Transparency,” Tech. Rep., 2013. [65] Netapplications.com. (2018). Operating system market share, [Online]. Available: https : //netmarketshare.com/operating-system-market-share.aspx (visited on 2018-02-21). [66] S. Liu, X. Huang, H. Fu, and G. Yang, “Understanding Data Characteristics and Access Pat- terns in a Cloud Storage System,” in Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2013, pp. 327–334. [67] D. R. Horn, K. Elkabany, C. Lesniewski-Laas, and K. Winstein, “The Design, Implementation, and Deployment of a System to Transparently Compress Hundreds of Petabytes of Image Files for a File-Storage Service.,” in Proceedings of the USENIX Conference on Networked Systems Design and Implementation, 2017, pp. 1–15. [68] K. Granville. (2018-03-19). Facebook and Cambridge Analytica: What You Need to Know as Fallout Widens, [Online]. Available: https://www.nytimes.com/2018/03/19/technology/ facebook-cambridge-analytica-explained.html (visited on 2018-04-18). [69] Dropbox Inc. (2018). General Data Protection Regulation (GDPR) Guidance Center - Drop- box, [Online]. Available: https://www.dropbox.com/security/GDPR (visited on 2018-04-19). [70] Google Cloud. (2018). Google Cloud & the General Data Protection Regulation (GDPR), [Online]. Available: https://services.google.com/fh/files/misc/google_cloud_and_ the_gdpr_english.pdf (visited on 2018-04-19).

55 Bibliography

[71] Tresorit. (2018). Getting ready for the GDPR with end-to-end encryption, [Online]. Available: https://tresorit.com/files/tresorit-gdpr-whitepaper.pdf (visited on 2018-04-19). [72] W. Post. (2018-03-27). GDPR Compliance - SpiderOak Support, [Online]. Available: https: //support.spideroak.com/hc/en-us/articles/360002173891-GDPR-Compliance (visited on 2018-04-19). [73] Jacob. (2017-11-23). The GDPR: What it is and what it means for your business, [Online]. Available: https://www.sync.com/blog/the-gdpr-what-it-is-and-what-it-means- for-your-business/ (visited on 2018-04-19). [74] B. Schneier, Secrets and lies: digital security in a networked world. John Wiley & Sons, 2011. [75] M. Grothe, C. Mainka, P. Rösler, J. Jupke, J. Kaiser, and J. Schwenk, “Your Cloud in my Company: Modern Rights Management Services Revisited,” in Proceedings of the IEEE In- ternational Conference on Availability, Reliability and Security, 2016, pp. 217–222. [76] D. Kholia and P. Węgrzyn, “Looking Inside the (Drop) Box.,” in Proceedings of the USENIX Workshop on Offensive Technologies, 2013.

56 Appendices

A.1 Cloud Storage Application Changelogs

These changelogs are extracted from official and unofficial sources and present the versions and changes that were released during the time that the tests for this thesis were conducted.

Dropbox At the time of writing this thesis, Dropbox did not have an official channel for presenting changelogs for new versions of their client. These, incomplete, changelogs are taken from Dropbox’s support forum. Changes in v491 • Smart Sync thumbnails now show in the original aspect ratio instead of square. • Fixes localization for some warning dialogs. • Fixes issue where preferences pane might not properly refresh. • Fixes issue where certain files on APFS would be detected as unicode encoding conflicts. • Other minor fixes. Changes in v482 • Smart Sync thumbnails now show in the original aspect ratio instead of square. • Fixed a bug that previously caused some users to incorrectly see an install dialog for a com- ponent they had already installed. • Some users on Basic plans will see a Smart Sync option when they right-click an item in their Dropbox, which will take them to a website where they can learn more about the benefits of Smart Sync. • HTTP proxy authentication should work in more cases now.

1https://dropboxforum.com/t5/Desktop-client-builds/Stable-Build-49-4-69/m-p/276265 (visited 2018-05- 23) 2https://dropboxforum.com/t5/Desktop-client-builds/Stable-Build-48-4-58/m-p/273741 (visited 2018-05- 23)

57 A.1. Cloud Storage Application Changelogs

Google Drive The release notes for Google Drive are published by Google through their support pages.3 April 17, 2018 — HEIF and HEIC file support (3.41.9267.0638) • Fixed an issue where HEIF and HEIC files (High Efficiency Image Format files) werenot visible in . Users who previously synced all file types, however, still won’t see their existing HEIF and HEIC files in Google Photos. • Fixed an issue where some users were not able to add additional folders to sync. • Users can now choose to sync subfolders of /Library on macOS and • Fixed an issue where some native applications did not appear in the list of available applica- tions to open files with when browsing Drive on the web. • Fixed an issue on macOS High Sierra where users experienced a delay in seeing recently taken screenshots on their desktop. • Increased responsiveness when deleting photos that could not be converted to High Quality. • Fixed an issue causing sync status icons to not appear for some users on Windows. March 19, 2018 — Bug fix release (3.40.8921.5350) • Fixed an issue where Backup and Sync crashed with error 22E46FB1. March 14, 2018 — Windows 64-bit support (3.40.8839.2105) • Backup and Sync now distributes as a 64-bit executable on Windows. If your operating system is 64-bit, Backup and Sync will upgrade to 64-bit seamlessly. Backup and Sync will continue to run as a 32-bit process on 32-bit systems. • When copying a link to your clipboard, Backup and Sync now displays a dialog with the current link sharing setting. • Fixed an issue on macOS where Backup and Sync crashed when a synced file had emojis in the file name. • Fixed issues where Backup and Sync crashed with error EF31B9F9 or B3FD5B23. • Additional bug fixes and performance improvements.

OneDrive The release notes for OneDrive are published by Microsoft through their support pages for the Office suite.4 Version 18.044.0301.0005 (Released March 27, 2018) • Bug fixes to improve reliability and performance of the client. • Standalone Mac sync client now migrates user settings when installed for users with configured Mac App Store sync client. • Users who are opted into the Insiders ring in Office for Mac will also have the setting applied to the standalone Mac sync client. • Icon overlays to indicate folders that have been shared.

3https://support.google.com/a/answer/7573023?hl=en (visited 2018-05-23) 4https://support.office.com/article/new-onedrive-sync-client-release-notes-845dcf18-f921-435e-bf28-4e24b95e5fc0 (visited 2018-05-23)

58 A.1. Cloud Storage Application Changelogs

• Ability to redirect screenshots into OneDrive. • Clicking the OneDrive cloud now opens the context menu within the activity center. • Locally synced OneDrive files from open directly in the cloud with Office, allowing users to AutoSave, share, and collaborate easily–all the while being more performant than ever before. Version 18.025.0204.0008 (Released March 19, 2018) • Bug fix to address version numbering issue when publishing the release to the Mac AppStore. Version 18.025.0204.0006 (Released February 28, 2018) • Bug fixes to improve reliability and performance of the client. • Users are notified when a large volume of files are deleted on their personal OneDrive.

Mega Mega does not have an official channel for their changelogs. However, they present the current version’s changes in the client application’s “about window” and this information is presented online by third parties.5 MEGAsync 3.6.6 (99a46c): • General Data Protection Regulation (GDPR) compliance. • Updated translations. • Bug fixes and other minor adjustments MEGAsync 3.6.5 (99a46c) • Allow to change the password. • Improvements for the sync engine. • Updated third-party libraries. • Bug fixes and other minor adjustments.

SpiderOak One and Groups Backup 7.1.0 Release Notes (March 19, 2018)6 • Command line help improved/updated • Manpage improved/updated • Backup file/directory deselection behavior improved • Update Hive icon branding • Restart no longer required when changing size/age file restrictions. • Prevent unicode related sync issues • Prevent preferences from freezing under certain conditions

5https://biblprog.com/en/mega/historychanges/ (visited 2018-05-23) 6https://spideroak.com/articles/one-groups-backup-7-1-0-release-notes-march-19-2018/ (visited 2018- 05-23)

59 A.1. Cloud Storage Application Changelogs

Sync.com Version 1.1.20 (March 5, 2018)7 • NEW: 64 bit support for macOS • FIXED: Windows high DPI display issues (eg. Surface tablets) • FIXED: Multiple preference pane crashing issues • FIXED: Misc. reported bugs that may cause downloading errors • FIXED: Windows XP home legacy installer updates • IMPROVED: Handle empty Sync folder with caution • IMPROVED: Allow empty Sync folder recovery • IMPROVED: Networking code on Windows installer • IMPROVED: Reduce background network checks • IMPROVED: Smarter selection of network routes based on performance

Tresorit At the time of writing this thesis, the changelog for Tresorit was not updated with the latest version.8

7https://www.sync.com/blog/sync-1-1-20-desktop-apps-available/ (visited 2018-05-23) 8https://support.tresorit.com/hc/en-us/articles/216468567-Tresorit-for-Mac (visited 2018-05-23)

60 A.2. Packet Size Distributions

A.2 Packet Size Distributions

2,000 3,500 2,500 3,000 1,500 2,000 2,500 2,000 1,500 1,000 1,500 1,000 1,000 500 500 Number of packets Number of packets

Number of packets 500 0 0 0 0 200 600 1,000 1,400 0 200 600 1,000 1,400 0 200 600 1,000 1,400 Packet size [Bytes] Packet size [Bytes] Packet size [Bytes] (a) Dropbox (b) Google Drive (c) Mega 7,000 3,500 7,000 6,000 3,000 6,000 5,000 2,500 5,000 4,000 2,000 4,000 3,000 1,500 3,000 2,000 1,000 2,000 Number of packets

Number of packets 1,000 500 Number of packets 1,000 0 0 0 0 200 600 1,000 1,400 0 200 600 1,000 1,400 0 200 600 1,000 1,400 Packet size [Bytes] Packet size [Bytes] Packet size [Bytes] (d) OneDrive (e) SpiderOak (f) Sync.com 3,000 2,500 2,000 1,500 1,000

Number of packets 500 0 0 200 600 1,000 1,400 Packet size [Bytes] (g) Tresorit

Figure A.1: Packet size distributions for the tested PCS applications during a 10MB file upload of highly compressible data.

61 A.3. CPU Utilization

A.3 CPU Utilization

Table A.1 shows the mean CPU utilization for the idle and cooldown phases along with 95% confidence intervals. Corresponding tables for the pre-processing and transfer phases are shownin Table A.2.

Table A.1: CPU utilization during idle and cooldown phases

Synchronization Phase Service Idle Cooldown Mean 95% CI Mean 95% CI Dropbox 0.56 [ 0.49, 0.64] 9.21 [ 8.30, 10.13] Google Drive 1.49 [ 1.40, 1.57] 6.31 [ 5.75, 6.86] OneDrive 0.25 [ 0.18, 0.31] 2.08 [ 1.67, 2.49] Mega (HTTP) 0.10 [ 0.08, 0.12] 0.77 [ 0.70, 0.85] Mega (HTTPS) 0.10 [ 0.08, 0.12] 0.90 [ 0.81, 0.99] SpiderOak 1.91 [ 1.74, 2.09] 2.12 [ 1.80, 2.44] Sync.com 0.88 [ 0.73, 1.03] 4.38 [ 4.03, 4.72] Tresorit 2.40 [ 1.98, 2.82] 7.02 [ 6.47, 7.58]

Table A.2: CPU utilization during pre-process and transfer phases

Synchronization Phase Service Pre-process Transfer Mean 95% CI Mean 95% CI Dropbox 81.03 [ 78.22, 83.84] 155.41 [ 145.59, 165.22] Google Drive 8.63 [ 8.37, 8.89] 47.21 [ 43.76, 50.67] OneDrive 1.63 [ 1.48, 1.79] 20.07 [ 18.19, 21.95] Mega (HTTP) 1.72 [ 1.67, 1.78] 42.91 [ 40.36, 45.47] Mega (HTTPS) 2.48 [ 2.40, 2.56] 63.41 [ 61.69, 65.12] SpiderOak 21.69 [ 17.24, 26.14] 10.83 [ 8.85, 12.81] Sync.com 23.13 [ 22.40, 23.86] 28.38 [ 24.83, 31.93] Tresorit 12.80 [ 11.74, 13.86] 163.61 [ 158.04, 169.18]

Table A.3 shows the mean duration for the pre-processing and transfer phases along with 95% confidence intervals. Corresponding values for the CPU volumes are presented inTable A.4

62 A.3. CPU Utilization

Table A.3: Phase durations in seconds

Synchronization Phase Service Pre-process Transfer Mean 95% CI Mean 95% CI Dropbox 2.06 [ 1.96, 2.17] 6.46 [ 6.00, 6.91] Google Drive 7.17 [ 7.04, 7.29] 1.09 [ 0.85, 1.32] OneDrive 5.78 [ 5.23, 6.33] 4.72 [ 4.24, 5.20] Mega (HTTP) 2.24 [ 2.17, 2.31] 1.44 [ 1.29, 1.58] Mega (HTTPS) 2.42 [ 2.33, 2.51] 1.73 [ 1.66, 1.79] SpiderOak 29.62 [ 26.26, 32.97] 44.33 [ 23.59, 65.07] Sync.com 4.74 [ 4.68, 4.81] 12.60 [ 10.39, 14.81] Tresorit 12.63 [ 11.69, 13.57] 2.43 [ 2.31, 2.56]

Table A.4: CPU Volumes

Synchronization Phase Service Pre-process Transfer Mean 95% CI Mean 95% CI Dropbox 165.43 [ 157.99, 172.87] 980.18 [ 949.98, 1010.38] Google Drive 51.14 [ 49.44, 52.83] 46.15 [ 44.36, 47.94] OneDrive 7.78 [ 6.89, 8.67] 89.44 [ 86.83, 92.06] Mega (HTTP) 3.61 [ 3.50, 3.73] 58.70 [ 54.74, 62.66] Mega (HTTPS) 5.71 [ 5.50, 5.91] 107.98 [ 106.77, 109.20] SpiderOak 520.14 [ 505.51, 534.77] 221.10 [ 188.99, 253.21] Sync.com 105.59 [ 101.63, 109.54] 311.00 [ 300.36, 321.64] Tresorit 129.32 [ 121.05, 137.59] 389.67 [ 377.24, 402.09]

63 A.4. Disk Utilization

A.4 Disk Utilization

Table A.5 shows the average amount of bytes written to disk along with 95% confidence intervals during a 300 MB file upload.

Table A.5: Average amount of bytes written to disk during a 300 MB file upload.

Service Mean 95% CI Dropbox 81.75 [ 46.25, 117.25 ] Google Drive 29.18 [ 25.42, 32.93 ] OneDrive 74.68 [ 20.27, 129.10 ] Mega 63.83 [ 24.67, 102.98 ] SpiderOak 376.02 [ 364.12, 387.93 ] Sync.com 62.30 [ 54.38, 70.22 ] Tresorit 44.10 [ 13.88, 74.32 ]

64