In-Place Reconstruction of Delta Compressed Files
Total Page:16
File Type:pdf, Size:1020Kb
In-Place Reconstruction of Delta Compressed Files Randal C. Burns Darrell D. E. Long’ IBM Almaden ResearchCenter Departmentof Computer Science 650 Harry Rd., San Jose,CA 95 120 University of California, SantaCruz, CA 95064 [email protected] [email protected] Abstract results in high latency and low bandwidth to web-enabled clients and prevents the timely delivery of software. We present an algorithm for modifying delta compressed Differential or delta compression [5, 11, compactly en- files so that the compressedversions may be reconstructed coding a new version of a file using only the changedbytes without scratchspace. This allows network clients with lim- from a previous version, can be usedto reducethe size of the ited resources to efficiently update software by retrieving file to be transmitted and consequently the time to perform delta compressedversions over a network. software update. Currently, decompressingdelta encoded Delta compressionfor binary files, compactly encoding a files requires scratch space,additional disk or memory stor- version of data with only the changedbytes from a previous age, used to hold a required second copy of the file. Two version, may be used to efficiently distribute software over copiesof the compressedfile must be concurrently available, low bandwidth channels, such as the Internet. Traditional as the delta file contains directives to read data from the old methods for rebuilding these delta files require memory or file version while the new file version is being materialized storagespace on the target machinefor both the old and new in another region of storage. This presentsa problem. Net- version of the file to be reconstructed. With the advent of work attacheddevices often have limited memory resources network computing and Internet-enabled devices, many of and no disks and therefore are not capable of storing two these network attached target machines have limited addi- file versions at the sametime. Furthermore, adding storage tional scratch space.We presentan algorithm for modifying to network attached devices is not viable, as keeping these a delta compressedversion file so that it may rebuild the new devicessimple limits their production costs. me version in the spacethat the current version occupies. We addressthis problem by post-processingdelta en- coded files so that they are suitable for reconstructing the 1 Introduction new version of the file in-place, materializing the new ver- sion in the samememory or storagespace that the previous Recent developmentsin portable computing and computing version occupies. A delta file can be considereda set of in- applianceshave resulted in a proliferation of small network structions to a computer to materialize a new file version in attached computing devices. These include personal digi- the presenceof a reference version, the old version of the tal assistants(PDAs), Internet set-top boxes, network com- file. When rebuilding a version encodedby a delta file, data puters, control devices with analog sensors,and cellular de- are both copied from the reference file to the new version vices. The software and operating systemsof these devices and added explicitly when portions of the new version do may be updated by transmitting the new version of a pro- not appearin the referenceversion. gram over a network. However, low bandwidth channels If we attempt to reconstruct an arbitrary delta file in- to network devices often makes the time to perform soft- place, the resulting output can often be corrupt. This occurs ware updateprohibitive. In particular, heavy Internet traffic when the delta encoding instructs the computer to copy data from a file region where new file data has already been writ- t The work of this author was performed while a Visiting Scientist at the IBM Ahnaden ResearchCenter. ten. The data the command attempts to read have already been altered and the rebuilt file is not correct. By detecting Ptis~i~n to make digital or hard copies of all or part of this wok for and avoiding such conflicts, our method allows us to rebuild personal of classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that versions with no scratch space. copiesbear this notice and the full citation on the fvst page. To copy We presenta graph-theoretic algorithm for post-process- oth&Se, to republish, to post on servers or to redistribute to lists, ing delta files that detectssituations where a delta file would EqUires prior Specific permission ador a fee. attempt to read from an already written region and permutes PoDc 98 PuertoVallarta Mexico Copyri&t ACM 1998 O-89791-977-7/981 6...$5.00 the order that the commandsin a delta file are applied to re- duce the occurrenceof these conflicts. The algorithm elim- 267 inates the remaining data conflicts by removing commands gorithms trade an experimentally verified small amount of that copy data and explicitly adding these data to the delta compressionin order to run using time linear in the length file. Eliminating data copied betweenversions increasesthe of the input files. The improved algorithms allow large files size of the delta encoding but allows the algorithm to always without known structure to be efficiently differenced and output an in-place reconstructible delta file. permits the application of delta compressionto backup and Experimental results verify that modifying delta files for restore [4], file systemreplication, and softwaredistribution. in-place reconstruction is a viable and efficient technology. Recently, applications distributing HTTP objects using Our findings indicate that a small fraction of compression delta files have emerged[IO, 21. This permits web serversto is lost in exchange for in-place reconstructibility. Also, in- both reduce the amount of data to be transmitted to a client place reconstructible files can be generatedefficiently; creat- and reduce the latency associatedwith loading web pages. ing a delta file takesless time than modifying it to be in-place Efforts to standardizedelta files as part of the HTTP proto- reconstructible. col and the trend towards making small network devices,for In $2, we summarizethe preceding work in the field of example hand-held organizers,HTTP compliant indicate the delta compression. We describehow delta files are encoded need to efficiently distribute data to network devices. in $3. In $4, we present an algorithm that modifies delta en- coded files to be in-place reconstructible. In $5, we further 3 Encoding Delta Files examine the exchange of run-time and compressionperfor- mance. In $6, we present limits on the size of the digraphs Differencing algorithms compactly encode the changesbe- our algorithm generates.Section 7 presentsexperimental re- tween file versions by finding strings in the new file that may sults for the execution time and compressionperformance of be copied from the prior version of the samefile. Differenc- our algorithm and we presentsour conclusions in $8. ing algorithms perform this task by partitioning the data in the file into strings that may be encoded using copies and 2 Related Work strings that do not appear in the prior version and must be explicitly addedto the new file. Having partitioned the file Encoding versions of data compactly by detecting altered to be compressed,the algorithm outputs a delta file that en- regions of data is a well known problem. The first applica- codes this version compactly. This delta file consists of an tions of delta compressionfound changedlines in text data ordered sequenceof copy commands and add commands. for analyzing the recent modifications to files [6]. Consider- An add commandis an ordered pair, (t, Z), where t (to) en- ing data as lines of text fails to encodeminimum sized delta codes the string offset in the file version and I (length) en- files, as it does not examine data at a minimum gramdarity codesthe length of the string. This pair is followed by the I and only finds matching data that are aEigned at the begin- bytes of data to be added(Figure 1). ning of a new line. The encoding of a copy command is an ordered triple, The problem of compactly representingthe changesbe- (f, t ,l> where f (f rom) encodesthe offset in the reference tween version of data was formalized as string-to-stringcor- file from which data are copied, t encodesthe offset in the rection with block move [ 141- detecting maximally match- new file where the data are to be written, and 1 encodesthat ing regions of a file at an arbitrarily fine granularity without length of the data to be copied. The copy command is a alignment. Even though the general problem of detecting directive that copies the string data in the interval [f, f+Z-l] and encoding version to version modifications was well de- in the referencefile to the interval [t , t + 1 - l] in the version fined, delta compression applications continued to rely on file. the alignment of data, as in databaserecords [13], and the In the presenceof the reference file, a delta file rebuilds grouping of data into block or line granularity, as in source the version file with add and copy directives. The intervals code control systems[ 12, 151,to simplify the combinatorial in the version file encoded by these directives are disjoint. task of finding the common and different strings between Therefore, any permutation of the order thesecommands are files. applied to the referencefile materializesthe sameoutput ver- Efforts to generalize delta compressionto data that are sion file. not aligned and to minimize the granularity of the smallest change resulted in algorithms for compressing data at the 4 An In-Place Reconstruction Algorithm granularity of a byte. Early algorithms were basedupon ei- ther dynamic programming [9] or the greedy method [I I] This algorithm modifies an existing delta file so that it can and performed this task using time quadratic in the length correctly reconstruct a new file version in the spacethe cur- of the input files.