2017 Asia Modelling Symposium

Non-Internet Synchronization for Distributed Storages

Anuradha Wickramarachchi, Dulaj Atapattu, Pamoda Wimalasiri, Ravidu Mallawa Arachchi, Gihan Dias Department of Computer Science and Engineering University of Moratuwa, Sri Lanka {anuradha.13, dulaj.atapattu.13, pamoda.13, ravidu_lashan.13, gihan}@cse.mrt.ac.lk

Abstract — Synchronization of files has become a major use case availability of data at another locale, but the replication of in the context of cloud storage. Yet the usage of Internet content for eventual availability. Thus, the use of the network bandwidth and cost for additional usage of the cloud usage has for such synchronization would often consume time, while become the major bottlenecks of such services. Non-Internet causing partial synchronization of content, depletion of synchronization is the process of performing file network quota and excessive network bandwidth utilization. synchronization just by using moving devices (nomadic device) to communicate the content for synchronization, i.e. without This paper presents a novel methodology for using Internet as the communication medium. This enables the synchronization by utilization of moving devices as the synchronization to be performed at a very low cost and Internet carriers of changes for synchronizing storages. This enables bandwidth, yet with reasonably the same user experience. the devices that often move between storage devices, to act as Furthermore, the use of Internet for synchronization of large the synchronization medium. For an example, the media are often limited with lower upload speeds, which makes synchronization of home device and office device (end the entire process slower. The work presents the use of devices storage devices), using a device that moves between the two such as computers, mobile phones and other network attached locations (carrier device), such as a mobile phone or personal IoT devices that move between synchronization storages to computer can be considered. This paper represents the transfer synchronization data and synchronize the storages. implementation of the synchronization mechanism which is The distributed storages run kernel to perform called the Non-Internet Synchronization (NIS) to address synchronization algorithms, which will be discussed in the aforementioned problems. Section II outlines the related work paper. The results demonstrated competitive performance with existing synchronization mechanisms given the user doesn’t in the field of consideration. Section III demonstrates the need real-time synchronization but nomadic access prevails. architecture of the implemented system. Section IV explains Non-Internet Synchronization (NIS) does the needful for the implementation carried out and Section V presents the synchronization storages with minimum cost while utilizing obtained experimental results and evaluation of results. redundant storage that moves between storage location Finally, Section VI concludes the paper with the inferences frequently. obtained from the results, future work and emphasizes the importance of the research. Keywords - Non-Internet synchronization, nomadic computing, personal cloud, reliability II. RELATED WORK Many studies have been carried out to present varying I. INTRODUCTION methods of synchronization of content. Most of the methods Now-a-days, synchronization between devices is essential focus on using a cloud storage in order to store content to for many applications such as P2P, client-server and cloud- synchronize the clients (storage devices). MetaSync [4] based systems. There are many synchronization presents a secure file synchronization mechanism which implementations such as [1] and Rsync [2] and intends to provide integrity and confidentiality for data using other synchronization protocols such as Pydio [3] for a storage known as untrusted storages which uses multiple enterprise level entities. Existing synchronization protocols cloud services. This approach leverages resources out of concentrate on the synchronization of devices using the multiple cloud service providers. Although this approach Internet or the local networks as the communication medium. increases the reliability through redundant resources, the This uses a large amount of network traffic and bandwidth and communication overhead remains greater. The overall may take a long period of time. It increases the cost of Internet process presented in work is expensive due to the usage, for those who are using metered connections. communication with many cloud services. Also, the varying Furthermore, the use of Internet bandwidth for network conditions add additional overheads in terms of synchronization of files would keep the connections maintaining consistency. congested for a longer period of time, while reducing the quality of service for other users of the network as well. Younghwan Go et al. [5] presents Simba. This is a data However, the requirement is often not the real-time sync service which abstracts the data storage and

2376-1172/17 $31.00 © 2017 IEEE 1929 DOI 10.1109/AMS.2017.13 synchronization. The work presents a data model and an API confidentiality in systems such as Dropbox [13] which uses that utilizes a unified storage for synchronization. It adopts a deduplication. tabular storage mechanism. This mechanism uses row Work performed by Andri Lareida et al. [14] presents consistency to ensure consistency of data. The proposed solution embraces a Client-Server architecture in order to Box2Box a peer to peer file sharing and synchronization application. The system keeps polling for connections with perform synchronization. The solution targets data centric mobile applications and the scope is to client server peers in order to connect and synchronize. The approach utilizes a conflict reporting mechanism where the user decides synchronization. Furthermore, the system resolves conflicts by creating a conflicted copy of the files having contradicting the version to keep. Furthermore, the application supports Super Peers so that all the content of all the peers can be changes. However, the platform facilitates similar synchronization to systems such as Dropbox and Google synchronized to a highly available peer. However, the approach consumes Internet bandwidth as much as any other Drive. Therefore, issues with privacy of the unified storage exists. Usage of a unified storage hosted in a cloud synchronization mechanism would. Furthermore, the system is prone to inconsistencies and more conflicts occur when the environment requires Internet to transfer data for synchronization. peers are offline and the content get changed frequently. Due to the increased security and scalability P2P Work by Ajay Tanpure et al. [6] presents the use of Rsync over secure HTTP (HTTPS). The work intends to perform synchronization has become popular lately. Resilio sync [1] is such a synchronization platform which uses a variant of Client-Server synchronization over HTTPS which enables remote synchronization. This is not available in conventional Bittorrent Protocol for synchronization. Work performed by Zhiyuan Peng et al. [15] evaluates the performance of the peer Rsync. The boundary shift problem exists in systems using Rsync, eventhough Rsync is considered as a powerful to peer synchronization taking Resilio sync as a case study. The work concludes the fact that, even though greater speeds protocol to perform chunk based synchronization [2]. Because of the aforementioned problem, simple insertion or deletion are visible in torrent downloads with a larger number of Seeders, no such gains can be anticipated in peer to peer of content can change all the chunks required to be transmitted. This has given rise to the concept of content synchronization. Thus, the performance is limited by the bandwidth of the peer with the slowest uplink speed. defined chunking [7], which eliminates the boundary shift problem. Furthermore, even if Rsync performs better in Therefore, the synchronization process is inherently slower than that of a client server model and takes a longer period of locally connected networks, the communication overhead will be when the Internet is used due to boundary shift problem time. A 50MB file had taken a time period of close to 75 seconds [15] providing an average speed of close to 700 which is common. KBytes/s. Two of the most commonly used cloud storages are Google [8] and Dropbox [9]. These services provide free As discussed there has been much research conducted focusing on file synchronization using different techniques. cloud storage and synchronization service to their users. The services include file synchronization, storage backup and Majority of the research focus on efficient file updating, resource utilization for reliability and security aspects of the public file sharing using sharable links. They also provide conflict resolution and real-time collaboration. Yet, the process. Yet there has been limited research on how to save bandwidth and Internet usage for file synchronization. privacy and confidentiality of data is not guaranteed since Google performs analytics on consumer data [10]. The Furthermore, research regarding the use of nomadic computing devices for synchronization of content is even content is scanned and Machine Learning algorithms are executed to make shopping suggestions and for spam lesser. detection. Dropbox in fact uses personal information such as III. SYSTEM ARCHITECTURE physical addresses to improve quality of service for users. Also, the content that resides in cloud is encrypted, yet certain In our study on Non-Internet Synchronization (NIS) is data is retained in order to adhere with certain security designed in order to achieve efficient content synchronization regulations [11] and may get disclosed under certain legal of storages by using carrier devices. The carrier devices are demands. the devices which physically moves between the end storage devices which needs to be in sync. A carrier device could be Work by Jorge Blasco et al. [12] presents a deduplication any device which has access to a storage device within the scheme using Bloom Filters to identify overlapping content. local network. The carrier device implementation can be The scheme also provides ownership management of content extended over a wider range of devices from personal to track the original owners of the content. The scheme computers to other nomadic computing devices such as drastically reduces amount of central storage consumed for tablets, mobile phones and even handheld IoT devices. synchronization as the central storage grows with user content. The work presents an optimized solution for content A. Communication Stack duplication over existing schemes. Yet the deduplication The NIS implementation consists of a communication scheme requires the personal content of other users to be stack with 4 layers. The system is implemented with the retained in order to improve the user experience for other necessary Networking API’s that are provided as wrappers for users. This demonstrates a major drawback in terms of native C++ implementations.

2030 storage device, it will initiate the Synchronizer and starts the synchronization mechanism.

IV. IMPLEMENTATION Components of the system to perform NIS were built in a layered architecture presented in the Section III. This section presents the implementation details of each of the components and key concerns of a synchronization process such as conflict detection.

A. Device Discovery  Communication Stack The device discovery is an important step in the NIS, Fig. 1 demonstrates the communication stack of the NIS. because the carrier devices should connect with the storage The transport layer communication takes place at the bottom devices automatically in order to perform the NIS. The NIS layer. The functionality is abstracted and provided to the implementation utilizes SSDP (Simple Service Discovery upper layers in the form of an API implemented in JavaScript. Protocol) [16] in order to identify and connect with the storage TCP sockets are used for the communication of device. Storage device runs a SSDP server and broadcasts its synchronization content and for message passing. The content USN (Universal Service Name), Service endpoint and a is sent in the form of Streams over the TCP sockets. The customized URL. The port and the host IP address of the Synchronizer component runs on top of the Communicator storage device is broadcasted enabling carrier devices to layer. Communicator layer abstracts the connection initiation detect and connect. Fig. 3 demonstrates the high-level view of process and handling of communication sockets. the discovery process. Once the carrier device connects to the local network, the devices will search on the broadcast address B. Synchronization Design and the port as demonstrated in the figure. The broadcast The overall design of NIS consists of four components: message includes the IP address of the storage device and the Storage, Metadata Database, Sync Engine and the port, which is used by the carrier device to connect with Communicator. storage device.

 Synchronization Design

Fig. 2 demonstrates the arrangement of the key components of the Synchronization design. The Storage is the main component that holds the data to be synchronized. In our  Device Discovery experimental setup, the hard disk drive of a computer which resides at a single location is considered as the storage. B. Synchronization Metadata database holds the metadata of the files to be The implementation utilizes the synchronization synchronized. The database is updated at the start of the algorithm presented under Algorithm 1. The synchronization synchronizer and as well as when the operating kernel emits technique utilizes checksum based comparison to perform the file system events for file changes. The sync engine checks synchronization. Since the synchronization happens in a non- for the changes in the metadata database and it is triggered by real-time manner the carrier device is used as an event broker changes in the database. The Communication Stack to carry events between the two end devices. component is responsible for connecting with the carrier device, which carries data that are required for the synchronization of the end storage devices. Once the Communicator of a carrier device is connected with that of a

2131 Algorithm 1: Synchronization algorithm defined chunking is used [17] in order to eliminate the boundary shift problem [18] in Fixed size partitioning [7]. Fig. 1. changes = getChangesList() 4 demonstrates an elaboration of the Algorithm 1. The device 2. getChangesFromDevice(changes) contains events of each of the storage devices that needs to be synchronized. For simplicity, the figure demonstrates the 3. otherDeviceChanges = readDB() scenario where two storage devices are synchronized by a 4. conflicts = detect(changes, otherDeviceChanges) single carrier device. Once the carrier device connects with 5. for each event in otherDeviceChanges then: the storage device it pulls events from the storage device to 6. if event not in conflicts then: itself and pushes the modifications of other storage device. 7. updateStorageDevice(event) Carrier device can send operations such as DELETE and RENAME. Similarly, the storage device pushes its events to 8. cleanEvent(event) the carrier device. The modifications are merged to the storage 9. end for device by means of observing Fingerprints of chunks. Rabin 10. createConflictedCopies() fingerprints [17] are used for chunking, which is obtained per chunk after performing content defined chunking. Old The events of files within the storage device are recorded fingerprints are then compared with that of the new files and along with the metadata within the metadata database. Once the modified chunks are obtained. The difference (modified the carrier device connects with the storage device, the events set of chunks) is transmitted to the storage device from the stored in the storage device (metadata database) are sent to the carrier device by means of, the offsets of the new and old carrier device. The carrier device contains similar events from chunks. Then the merging of new changes can be done at the the other storage device (or none, in the first run) with the storage device. connected device is synchronized with. Now the events from The communication of data is performed through stream the other storage device can be pushed into the connected buffers which enables reliable streaming of content with storage device. The change set of the currently connected better congestion control [19] [18] using TCP sockets. TCP is storage device should be taken to other storage device to utilized due to the congestion control and fairness of perform the same set of operations. bandwidth utilization in wired and wireless environments [18] [19]. Also, this provides more reliability for transmission of content, eliminating the need of retransmission. Use of TCP sockets made the other applications to use the Internet connection smoothly as demonstrated under the results section, Section V of this paper. Furthermore, the events and metadata are stored in a NeDB [20] database which is a lightweight JavaScript database implementation with the Mongo API. This is performed in order to ensure lightweight execution of storage tasks, with an in-memory efficiency without having to maintain a database server. C. Conflict Detection Since the proposed Non-Internet synchronization targets eventual consistency of synchronized end storages, there is a considerable probability of having conflicting changes between the synchronized files. Fig. 5 demonstrates how the conflicts are identified by observing the file metadata during the synchronization process.

The events in the metadata database are scanned for  Synchronization Overview conflicts, before events from the carrier device are pushed to There are four types of events. NEW: for newly created the storage device. Conflicts are detected by checking if files/folders, MODIFY: edits performed on files, RENAME: events have been performed on same file paths and they lead renaming of files/folder and DELETE: the removal of to different final checksums. Different checksums as such files/folders. When the events are pushed to the carrier device, implies the fact that there have been conflicting changes the modified and newly added files/folders are also copied to performed on files before they were brought to be in sync. If the carrier device. This enables the carrier device to perform there are events on each of the devices, having the same file the same set of events at the other storage device. All the path, yet they lead to the same checksum at the end, such files events are executed in the order they have been recorded. are considered to be having no conflicting changes. In such scenarios events can be ignored and removed from the NIS utilizes chunk based synchronization which enables database without further communication. the synchronization of content by sharing file diff. Content

2232

files required to perform the synchronization. This makes the methodology more power efficient due to usage of shorter time periods. Furthermore, unlike the case in existing synchronization mechanisms none of the traffic is transferred over the Internet costing any bandwidth or quota.

 Times to calculate Rabin fingerprints

 Conflict Detection

V. RESULTS AND DISCUSSION Tests were performed in order to determine the performance of different steps of the synchronization. The performance of calculation of Rabin fingerprints is demonstrated in Fig. 6. The growth of the time for the algorithm demonstrated a linear behaviour and the observed times for a file as large as 4GB remained below 40s seconds which is acceptable. This ensures that the calculation times of the Rabin fingerprints are within an acceptable range for the synchronization. Furthermore, the synchronization data  Storage utilization of the carrier device during a normal traffic time through the University (anonymized) network demonstrated the results in the Fig. 7. Fig. 8 demonstrates the utilization of storage space within the carrier device. For the synchronization of two storage devices, 1GB of data was copied to the carried device initially and the entire set of data was transmitted to the other storage device. Eventual changes happen in smaller files of sizes less than 200MB (for average user editable documents). Thus, the utilization of storage in the carrier device became minimal after a few roundtrips between the end storage devices. Furthermore, the times required to update the carrier device were lesser with increased roundtrips due to the decrease in content size as demonstrated in the Fig. 8. This enabled the transmission of change set much faster and reduced the network utilization further. Due to this improvement with time, the synchronization process becomes faster as the users keep using the synchronization mechanism. Table 1  Times to update the carried device demonstrates a comparison of NIS with some other popular methods of synchronization. The transmission of a file of size 4GB took around 25 minutes using a Wireless network connectivity of 56Mbps, out of which ~25Mbps was used while there were other devices using the network. This demonstrates that even a smaller time period of connection is sufficient to transmit the

2333 TABLE I.  COMPARISION WITH SIMILAR SYSTEMS [4] S. Han, H. Shen, T. Kim, A. Krishnamurthy, T. Anderson and D. Wetherall, “MetaSync: Coordinating Storage across Multiple File Synchronization Method Synchronization Services,” IEEE Internet Computing, vol. 20, no. 3, Feature Client-Server pp. 36-44, 2016. (Dropbox/Google P2P NIS Drive) [5] Y. Go, N. Agrawal, A. Aranya and C. Ungureanu, “Reliable, Cloud storage Consistent, and Efficient Data Sync for Mobile Apps,” in 13th service is Yes No No USENIX Conference File and Storage Technologies (FAST ’15), used Santa Clara,CA,USA, 2015. Utilize Internet Yes Yes No [6] A. Tanpure, A. Patil, A. Bansod and A. Ku, “RSYNC over HTTPS bandwidth for Linux and Windows with,” International Journal of Scientific and No Research Publications, vol. 5, no. 11, pp. 582-585, 2015. Realtime (Until peers availability of Yes No are online, [7] J. Ma, C. Bi, Y. Bai and L. Zhang, “CDC: Unlimited Content-Defined latest files files are stale) Chunking, A File-Differing Method Apply to File-Synchronization Local among Multiple Hosts,” in 12th International Conference on Speed network Semantics, Knowledge and Grids (SKG), 2016. Uplink speed Slowest peer limitation speed (> [8] “Google Drive,” [Online]. Available: https://drive.google.com. 50Mbps) [Accessed 5 10 2017]. No 5GB (Dropbox) (Only [9] “Dropbox,” [Online]. Available: https://www.dropbox.com. 15GB (Google Cost for limited by Drive) [11] No storage the size of [10] “Google Drive Terms of Service,” Google, 16 February 2017. (Paid purchases the carrier [Online]. Available: https://www.google.com/drive/terms-of-service. available) device) [Accessed 10 April 2017]. [11] “Dropbox - Terms,” Dropbox, 08 December 2016. [Online]. Available: https://www.dropbox.com/terms. [Accessed 10 April As of comparison demonstrated in Table 1 NIS clearly 2017]. outperforms the other existing synchronization mechanisms when eventual consistency of content is acceptable. [12] J. Blasco, R. D. Pietro , A. Orfila and A. Sorniotti, “A tunable proof of ownership scheme for deduplication using Bloom filters,” in 2014 IEEE Conference on Communications and Network Security (CNS), VI. CONCLUSIONS AND FUTURE WORK San Francisco, CA, USA, 2014. In conclusion, we were able to perform Non-Internet Synchronization successfully with a competitive level of [13] Drew and Arash, “Dropbox changes to policy,” 7 2011. [Online]. Available: https://blogs.dropbox.com/dropbox/2011/07/changes-to- efficiency compared to existing synchronization mechanisms. our-policies. [Accessed 1 10 2017]. Furthermore, the test results demonstrated the feasibility of the system to be implemented on a larger scale where moving [14] A. Lareida, T. Bocek, S. Golaszewski, C. Luthold and M. Weber, devices with a lower processing power, but having redundant “Box2Box - A P2P-based file-sharing and synchronization storage are available. The proposed NIS methodology application,” in 2013 IEEE Thirteenth International Conference on Peer-to-Peer Computing (P2P), Trento, Italy, 2013. improves the utility of redundant resources while minimizing the cost of using Internet in long run providing better user [15] Z. Peng, R. R. Pallelra and H. Wang, “On the measurement of P2P experience. file synchronization: Resilio Sync as a case study,” in 2017 IEEE/ACM 25th International Symposium on Quality of Service In future, the work intends to extend its scope to utilize (IWQoS), Vilanova i la Geltru, Spain, 2017. Distributed Hash Tables (DHT) in order to use generic nomadic devices, i.e. Mobile phones, smart watches and etc. [16] UPnP Forum, UPnP™ Device Architecture 1.1, 2008. to connect with storage devices automatically and perform [17] H. Tang and K. Eshghi, “A Framework for Analyzing and Improving NIS. Furthermore, Machine Learning methodologies are Content-Based Chunking Algorithms,” Hewlett-Packard Labs expected to be used to improve the usage of carrier device Technical Report TR 30, Palo Alto, CA, 2005. storage. The work is planned to be extended to intelligently [18] R. Rajaboina, P. Reddy, R. A. Kumar and N. Venkatramana, choose and synchronize content with carrier devices to “Performance comparison of TCP, UDP and TFRC in static wireless synchronize storage devices at several locations. environment,” in Electronics and Communication Systems (ICECS), 2015 2nd International Conference on, Coimbatore, India, 2015. REFERENCES [19] D. Madhuri and P. Reddy, “Performance comparison of TCP, UDP [1] “Forums - Sync Forums,” 2017. [Online]. Available: and SCTP in a wired network,” in Communication and Electronics https://forum.resilio.com/. [Accessed 04 April 2017]. Systems (ICCES), International Conference, Coimbatore, India, 2016.

[2] “How Rsync Works,” Rsync, [Online]. Available: [20] S. Robinson, “NeDB: A Lightweight JavaScript Database,” 29 April https://rsync.samba.org/how-rsync-works.html. [Accessed 18 April 2016. [Online]. Available: http://stackabuse.com/nedb-a-lightweight- 2017]. javascript-database. [Accessed 10 May 2017].

[3] “Pydio,” Pydio, 2016. [Online]. Available: https://pydio.com. [Accessed 18 April 2017].

2434