Algorithms for Low-Latency Remote File Synchronization Ý Hao Yan£ Utku Irmak Torsten Suel CIS Department Yahoo! Inc., CIS Department Polytechnic University 2821 Mission College Blvd, Polytechnic University Brooklyn, NY 11201 Santa Clara, CA 95054 Brooklyn, NY 11201
[email protected] [email protected] [email protected] Abstract tions of data within their distributed architectures. In all ofthese The remote file synchronization problem is how to update an applications, there are usually significant similarities between outdated version of a file located on one machine to the current successive versions of the files, and we would like to exploit version located on another machine with a minimal amount of these in order to decrease the amount of data that is transmit- network communication. It arises in many scenarios including ted. If the new files differ only very slightly from their previous web site mirroring, file system backup and replication, or web versions, then the cost should be very low, while for more sig- access over slow links. A widely used open-source tool called nificant changes, more data may have to be transmitted. rsync uses a single round of messages to solve this problem Remote file synchronization has been studied extensively (plus an initial round for exchanging meta information). While over the last ten years, and existing approaches can be divided research has shown that significant additional savings in band- into single-round and multi-round protocols. Single-round pro- width are possible by using multiple rounds, such approaches tocols [32, 12] are preferable in scenarios involving small files are often not desirable due to network latencies, increased pro- and large network latencies (e.g., web access over slow links).