Deltacfs: Boosting Delta Sync for Cloud Storage Services by Learning from NFS
Total Page:16
File Type:pdf, Size:1020Kb
DeltaCFS: Boosting Delta Sync for Cloud Storage Services by Learning from NFS Quanlu Zhang∗, Zhenhua Liy, Zhi Yang∗, Shenglong Li∗, Shouyang Li∗, Yangze Guo∗, and Yafei Dai∗z ∗Peking University yTsinghua University zInstitute of Big Data Technologies Shenzhen Key Lab for Cloud Computing Technology & Applications fzql, yangzhi, lishenglong, lishouyang, guoyangze, [email protected], [email protected] Abstract—Cloud storage services, such as Dropbox, iCloud find a common root cause that undermines the effect of delta Drive, Google Drive, and Microsoft OneDrive, have greatly sync: these applications heavily rely on structured data which facilitated users’ synchronizing files across heterogeneous devices. are stored in tabular files (e.g., SQLite [12] files), rather than Among them, Dropbox-like services are particularly beneficial owing to the delta sync functionality that strives towards greater simple textual files (e.g., .txt, .tex, and .log files). For exam- network-level efficiency. However, when delta sync trades com- ple, today’s smartphones (like iPhone) regularly synchronize putation overhead for network-traffic saving, the tradeoff could local data to the cloud storage (like iCloud [13]), where a be highly unfavorable under some typical workloads. We refer considerable part of the synchronized is tabular data [14]. to this problem as the abuse of delta sync. Confronted with the abovementioned situation, delta sync To address this problem, we propose DeltaCFS, a novel file sync framework for cloud storage services by learning from the exhibits poor performance, because it often generates intoler- design of conventional NFS (Network File System). Specifically, ably high computation overhead when dealing with structured we combine delta sync with NFS-like file RPC in an adaptive data. In detail, executing delta sync on a file requires scanning, manner, thus significantly cutting computation overhead on both chunking, and fingerprinting on the file, thus consuming sub- the client and server sides while preserving the network-level stantial CPU resources [4], [15]. As a result, blindly applying efficiency. DeltaCFS also enables a neat design for guaranteeing causal consistency and fine-grained version control of files. In our delta sync to all types of file updates suffers from the one- FUSE-based prototype system (which is open-source), DeltaCFS size-fit-all trap, and this problem is referred to as the abuse of outperforms Dropbox by generating up to 11× less data transfer delta sync in cloud storage services. and up to 100× less computation overhead under concerned To address the problem, in this paper we design DeltaCFS, a workloads. novel file sync framework for modern cloud storage services by learning from the seemingly antique while still widely- I. INTRODUCTION used NFS (i.e., Network File System). DeltaCFS stems from Cloud storage services, such as Dropbox [1], iCloud Drive, a practical insight that for certain types of file updates (e.g., Google Drive, and Microsoft OneDrive, have greatly facili- appending bytes to a file, flipping bytes in a file, and removing tated users’ synchronizing files across heterogeneous devices. bytes at the end of a file), delta sync is largely unnecessary. Lying at the heart of these services is the data synchronization In these cases, directly intercepting file operations (which (sync) operation which automatically maps the updates to we refer to as NFS-like file RPC) turns out to be the most users’ local file systems onto the back-end cloud servers in efficient way to precisely obtain the modified parts. Hence, a timely manner [2]. by leveraging the hints of typical file update patterns, we Among today’s cloud storage services, Dropbox-like ser- adaptively apply two distinct incremental file sync approaches, vices (e.g., iCloud Drive, SugarSync, and Seafile [3]) are i.e., delta sync and NFS-like file RPC. The key novelty lies particularly favorable owing to their delta sync functionality in an elegant combination of the two approaches in a run- that strives towards greater network-level efficiency [4], [5]. time manner, using a customized relation table that extracts the Specifically, for synchronizing a file’s new version on the invariants of file update patterns. For example, in transactional client, only the incremental parts relative to the base version update (see x II-B) both old-version and new-version files on the server are extracted and uploaded. This can reduce are always kept until the update is done. Our efforts can considerable network data transfer compared with simply significantly cut the computation overhead on both client uploading the full content of the new-version file (as used and server sides; meanwhile, the network-level (traffic usage) by Google Drive, OneDrive, etc.). Thus, it is especially useful efficiency is either preserved or optimized. and economical for operations through wide area networks [2]. DeltaCFS also enables a neat design for guaranteeing causal However, with increasingly prevalence of cloud storage ser- consistency and fine-grained version control of files. File sync vices, the upper-layer applications are getting more complex should offer basic guarantees of data integrity and consistency, and comprehensive, where delta sync can become not only where version control is a plus appealing to users. However, ineffective but also cumbersome. Such applications include complicated data sync operations among the cloud and clients SoundHound [6], 1Password [7], continuity [8], [9], and so can make data integrity and consistency fairly vulnerable, forth [10], [11]. By carefully examining these applications, we since corrupted and inconsistent data on any client is likely CPU Upload Download 120 120 120 120 140 140 140 140 100 120 100 120 120 100 100 120 100 120 80 100 80 100 100 80 80 100 80 100 60 80 60 80 80 60 80 60 80 60 60 60 60 60 40 40 40 40 40 60 40 40 40 20 20 20 40 20 20 40 20 20 20 Network (MB) CPU Utilization (%) 0 0 0 20 40 60 80 100 120 140 160 0 0 10 20 30 40 50 60 70 0 20 40 60 80 20 100 120 140 0 10 20 30 40 50 60 0 Time (s) 20Time (s) Time (s) Time (s) 0 20 40 60 80 100 120 140 160 0 (a) Dropbox CPU Utilization (%) (b) Dropbox (c) Seafile (d) Seafile Time (s) Fig. 1. Client resource consumption. (a)(c) Synchronizing a Microsoft Word file (12MB), this file is saved 23 times. (b)(d) Synchronizing a SQLite file (130MB) which stores chat history, the file is from a popular chat application. During the test, this file is modified 4 times (composed of 85 write operations) with 688KB changed in total. to be quickly propagated to all other relevant clients [16]. 35 120 CPU 30 Based on DeltaCFS, we provide causal consistency for data 100 Upload uploading and guarantee data integrity and consistency with 25 80 20 lightweight mechanisms. 60 1 15 TUE We have built a prototype system of DeltaCFS based 40 10 on FUSE and comprehensively evaluated its performance on 20 5 both PCs and mobile phones. Compared to Dropbox, up to CPU Utilization (%) 0 0 2 4 6 8 10 12 14 16 18 0 11× less data are transmitted by DeltaCFS in the presence Time (Minute) of representative workloads, and the savings of computation resources on the client side range from 91% to 99%. On Fig. 2. Synchronizing WeChat’s data through Dropsync on Samsung the server side, DeltaCFS also incurs low overhead,4 × to Galaxy Note3. WeChat [23] is a popular chat application. Dropsync [24] is an autosync tool of Dropbox on mobiles. We set WeChat’s data folder 30× lower than Seafile, an open-source state-of-the-art cloud as Dropsync’s sync folder. TUE (Traffic Usage Efficiency) is total data sync 2 storage service. Finally, it is worth noting that although traffic divided by data update size [2]. DeltaCFS is currently implemented in the user space, it could also be implemented in the kernel, e.g., as an enhanced mid- layer working on top of virtual file system (VFS). Then, the divides files into fixed-size blocks, and computes rolling/strong computation overhead of DeltaCFS can be further reduced. checksums to identify the same blocks. Notice that file updates Our contribution can be summarized as follows: could lead to the shift of data in each fixed-size chunk, so • We pinpoint the “abuse of delta sync” problem in Dropbox- rsync has to redo the checksum computation each time a file like cloud storage services and dig out its root cause (x II). is modified, consuming considerable CPU resource. CDC uses content defined chunking to detect the boundary of blocks, • We design a novel file sync framework which efficiently thus only needs to compute the checksums of changed blocks. synchronizes various types of file updates by combining But its network efficiency is not as good as rsync due to large delta sync and NFS-like file RPC (x III). In particular, we chunk size for low overhead of maintaining chuck checksums. leverage a customized relation table to identify different We examine the efficiency of the above methods in cloud- file update patterns in order to conduct data sync in an based data sync systems by investigating two popular in- adaptive and run-time manner. Also, we utilize lightweight stances, i.e., Dropbox and Seafile, which employ rsync and mechanisms to guarantee data integrity/consistency and CDC, respectively [21], [22]. As shown in Figure 1(c)(d), facilitate version control. Seafile shows moderate CPU consumption, but it uploads a • We build a prototype system of DeltaCFS and compare its large amount of data given the large chunk size (default value performance with representative systems including Drop- is 1MB). In contrast, Dropbox in Figure 1(a)(b) has much box, Seafile, and NFS.