In-Place Reconstruction of Delta Compressed Files

Randal C. Burns Darrell D. E. Long’ IBM Almaden ResearchCenter Departmentof Computer Science 650 Harry Rd., San Jose,CA 95 120 University of California, SantaCruz, CA 95064 [email protected] [email protected]

Abstract results in high latency and low bandwidth to web-enabled clients and preventsthe timely delivery of software. We present an algorithm for modifying delta compressed Differential or delta compression [5, 11, compactly en- files so that the compressedversions may be reconstructed coding a new version of a file using only the changedbytes without scratchspace. This allows network clients with lim- from a previous version, can be usedto reducethe size of the ited resources to efficiently update software by retrieving file to be transmitted and consequently the time to perform delta compressedversions over a network. software update. Currently, decompressingdelta encoded Delta compressionfor binary files, compactly encoding a files requires scratch space,additional disk or memory stor- version of data with only the changedbytes from a previous age, used to hold a required second copy of the file. Two version, may be used to efficiently distribute software over copiesof the compressedfile must be concurrently available, low bandwidth channels, such as the Internet. Traditional as the delta file contains directives to read data from the old methods for rebuilding these delta files require memory or file version while the new file version is being materialized storagespace on the target machinefor both the old and new in another region of storage. This presentsa problem. Net- version of the file to be reconstructed. With the advent of work attacheddevices often have limited memory resources network computing and Internet-enabled devices, many of and no disks and therefore are not capable of storing two these network attached target machines have limited addi- file versions at the sametime. Furthermore, adding storage tional scratchspace. We presentan algorithm for modifying to network attached devices is not viable, as keeping these a delta compressedversion file so that it may rebuild the new devicessimple limits their production costs. me version in the spacethat the current version occupies. We addressthis problem by post-processingdelta en- coded files so that they are suitable for reconstructing the 1 Introduction new version of the file in-place, materializing the new ver- sion in the samememory or storagespace that the previous Recent developmentsin portable computing and computing version occupies. A delta file can be considereda set of in- applianceshave resulted in a proliferation of small network structions to a computer to materialize a new file version in attached computing devices. These include personal digi- the presenceof a reference version, the old version of the tal assistants(PDAs), Internet set-top boxes, network com- file. When rebuilding a version encodedby a delta file, data puters, control devices with analog sensors,and cellular de- are both copied from the reference file to the new version vices. The software and operating systemsof these devices and added explicitly when portions of the new version do may be updated by transmitting the new version of a pro- not appearin the referenceversion. gram over a network. However, low bandwidth channels If we attempt to reconstruct an arbitrary delta file in- to network devices often makes the time to perform soft- place, the resulting output can often be corrupt. This occurs ware updateprohibitive. In particular, heavy Internet traffic when the delta encoding instructs the computerto copy data from a file region where new file data has already been writ- t The work of this authorwas performedwhile a Visiting Scientist at the IBM Ahnaden ResearchCenter. ten. The data the command attempts to read have already been altered and the rebuilt file is not correct. By detecting Ptis~i~n to make digital or hard copies of all or part of this wok for and avoiding such conflicts, our method allows us to rebuild personal of classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that versions with no scratch space. copiesbear this notice and the full citation on the fvst page. To copy We presenta graph-theoretic algorithm for post-process- oth&Se, to republish, to post on servers or to redistribute to lists, ing delta files that detectssituations where a delta file would EqUires prior Specific permission ador a fee. attempt to read from an already written region and permutes PoDc 98 PuertoVallarta Mexico Copyri&t ACM 1998 O-89791-977-7/981 6...$5.00 the order that the commandsin a delta file are applied to re- duce the occurrenceof these conflicts. The algorithm elim-

267 inates the remaining data conflicts by removing commands gorithms trade an experimentally verified small amount of that copy data and explicitly adding these data to the delta compressionin order to run using time linear in the length file. Eliminating data copied betweenversions increasesthe of the input files. The improved algorithms allow large files size of the delta encoding but allows the algorithm to always without known structure to be efficiently differenced and output an in-place reconstructible delta file. permits the application of delta compressionto and Experimental results verify that modifying delta files for restore [4], file systemreplication, and softwaredistribution. in-place reconstruction is a viable and efficient technology. Recently, applications distributing HTTP objects using Our findings indicate that a small fraction of compression delta files have emerged[IO, 21. This permits web serversto is lost in exchange for in-place reconstructibility. Also, in- both reduce the amount of data to be transmitted to a client place reconstructible files can be generatedefficiently; creat- and reduce the latency associatedwith loading web pages. ing a delta file takesless time than modifying it to be in-place Efforts to standardizedelta files as part of the HTTP proto- reconstructible. col and the trend towards making small network devices,for In $2, we summarizethe preceding work in the field of example hand-held organizers,HTTP compliant indicate the delta compression. We describehow delta files are encoded need to efficiently distribute data to network devices. in $3. In $4, we present an algorithm that modifies delta en- coded files to be in-place reconstructible. In $5, we further 3 Encoding Delta Files examine the exchange of run-time and compressionperfor- mance. In $6, we present limits on the size of the digraphs Differencing algorithms compactly encode the changesbe- our algorithm generates.Section 7 presentsexperimental re- tween file versions by finding strings in the new file that may sults for the execution time and compressionperformance of be copied from the prior version of the samefile. Differenc- our algorithm and we presentsour conclusions in $8. ing algorithms perform this task by partitioning the data in the file into strings that may be encoded using copies and 2 Related Work strings that do not appear in the prior version and must be explicitly addedto the new file. Having partitioned the file Encoding versions of data compactly by detecting altered to be compressed,the algorithm outputs a delta file that en- regions of data is a well known problem. The first applica- codes this version compactly. This delta file consists of an tions of delta compressionfound changedlines in text data ordered sequenceof copy commands and add commands. for analyzing the recent modifications to files [6]. Consider- An add commandis an ordered pair, (t, Z), where t (to) en- ing data as lines of text fails to encodeminimum sized delta codes the string offset in the file version and I (length) en- files, as it does not examine data at a minimum gramdarity codesthe length of the string. This pair is followed by the I and only finds matching data that are aEigned at the begin- bytes of data to be added(Figure 1). ning of a new line. The encoding of a copy command is an ordered triple, The problem of compactly representingthe changesbe- (f, t ,l> where f (f rom) encodesthe offset in the reference tween version of data was formalized as string-to-stringcor- file from which data are copied, t encodesthe offset in the rection with block move [ 141- detecting maximally match- new file where the data are to be written, and 1 encodesthat ing regions of a file at an arbitrarily fine granularity without length of the data to be copied. The copy command is a alignment. Even though the general problem of detecting directive that copies the string datain the interval [f, f+Z-l] and encoding version to version modifications was well de- in the referencefile to the interval [t , t + 1- l] in the version fined, delta compression applications continued to rely on file. the alignment of data, as in databaserecords [13], and the In the presenceof the reference file, a delta file rebuilds grouping of data into block or line granularity, as in source the version file with add and copy directives. The intervals code control systems[ 12, 151,to simplify the combinatorial in the version file encoded by these directives are disjoint. task of finding the common and different strings between Therefore, any permutation of the order thesecommands are files. applied to the referencefile materializesthe sameoutput ver- Efforts to generalize delta compressionto data that are sion file. not aligned and to minimize the granularity of the smallest change resulted in algorithms for compressing data at the 4 An In-Place Reconstruction Algorithm granularity of a byte. Early algorithms were basedupon ei- ther dynamic programming [9] or the greedy method [I I] This algorithm modifies an existing delta file so that it can and performed this task using time quadratic in the length correctly reconstruct a new file version in the spacethe cur- of the input files. Recent advancesin differencing algo- rent version occupies. At a high level, our technique ex- rithms have producedefficient algorithms that detectmatch- amines the input delta file to find copy commandsthat read ing strings betweenversions at an arbitrarily fine granularity from the write interval of other copy commands.These po- without alignment restrictions [l, 51. These differencing al-

268 Reference File Version File Delta File

:~~‘::~~:~:~:~:~:~:-“““---.~...... __._.______.~~~~~~~~~~ $$>a:;:;:;:;:;:;:,...... ,,.,.,.,.,.,.,.,.,.,., .. :&:::::y$:::::::::::::: COPY1, .,...... ,.,.,..z::;::>,:::::::::::::::.,.,...,.A *z:x:::::::::::: ,,_.___...... -- Add \ mz$::::::::::----I ____.._....‘..

\ ,..’..> cvE,vF- v,> ,....’

Figure 1: Encoding delta files. Common strings are encoded as copy commands(f, t, 1) and new strings in the new file are encodedas add commands(t, 1) followed by the string of length I that the commandadds. tential data conflicts are encoded into a digraph. This di- This definition only considersWR conflicts betweencopy graph is then topologically sortedto produce an ordering on commands and neglects add commands. Add commands these copy commandsthat reducesdata conflicts. We elim- only write data to the new file; they do not read data from inate the remaining conflicts by converting copy commands the reference file. Consequently,all potential WR conflicts to add commandsand output the permuted and converted associatedwith adding data may be avoided by placing add commandsas an in-place reconstructible delta file. commandsat the end of a delta file. This way, all reads as- While our algorithm can most easily be described as a sociatedwith copy commandsare completedbefore the first post-processingstep on an existing delta file, as done in this add commandis processed. work, it also integrateseasily into a compressionalgorithm Additionally, we define WR conflicts so that a copy com- so that an in-place reconstructible file may be output directly. mand cannot conflict with itself. Yet, a single copy com- mand’s read and write intervals may intersect and would 4.1 Conflict Detection seemto causea conflict. Copy commandswhose read and write intervals are overlapping can be dealt with by perform- Since we are reconstructing files in-place, we need concern ing the copy in either a left-to-right or right-to-left manner. ourselves with the order that the copy commandsare ap- For copy command (f, t , I), if f 1 t , then when the file is plied. Assume that we attempt to read from an offset that being reconstructed,copy the string byte by byte starting at has already been written. This will result in an incorrect re- the left-hand side. Since, the f (from) offset in the reference construction since the referencefile data we are attempting file is always greater than the t (to) offset in the new file, to read have been overwritten. This is termed a wrire before a lefi-to-right copy never reads a byte that has been over- read (WR) conllict [3]. written by a previous byte in the string. When f < t, a For COPYcommands (fi, ti, l;) and (fj , tj , lj), with i < symmetric argument shows that we should start our copy at j, there is a WR conflict when the right hand edge of the string and work backwards. For this example, we performed the copies in a byte-wise fash- [ti, ti + li - l] n [fj, fj + lj - 11# 0. (1) ion. However, the notion of a left-to-right or right-to-left copy applies to moving a read/write buffer of any size. In other words, copy commandi and copy commandj con- To avoid WR conflicts and achieve the in-place recon- flict if i writes to the interval from which copy commandj struction of delta files, we employ the following three tech- readsdata. By convention, we consider copy commandsthat niques. are numberedand ordered so that, if i < j, copy command i is read and written before copy commandj. By enforcing 1. Place all add commandsat the end of the delta file to this constraint on copy commandsi and j, we limit ourselves avoid data conflicts with copy commands. to applying the delta file commandsserially, which is appro- priate for limited capability network devices. 2. Permutethe order of application of the copy commands

269 to reduce the number of write before read conflicts. algorithm can minimize the amount of compressionlost by selecting the optimal set of copy commandsfor converting 3. For remaining WR conflicts, removethe conflicting op- the input digraph into a acyclic digraph. However, this nat- eration by converting a copy commandto an add com- ural optimization problem is NP-hard. For our implemen- mand and place it at the end of the delta file. tation of in-place conversion, we examine two policies for Any delta file may be post-processedusing these methods breaking cycles. The constant time policy picks the easiest to create a delta file that can be reconstructedin-place. For vertex to remove, basedon the execution order of the topo- many delta files, a permutation that eliminates all WR con- logical sort, and deletesthis node. This policy performs no flicts is unattainable. Consequently,we require the conver- extra work when breaking cycles. The locally minimum pol- sion of copy commandsto add commandsto create correct icy detectsa cycle and loops through all nodes in the cycle in-place reconstructible files for all inputs. to determineand then deletethe minimum cost node, i.e. the Having post-processeda delta file for in-place reconstruc- node that encodes the smallest copied string. The locally tion, the permutedand modified delta file obeysthe property minimum policy may perform as much additional work as j-l the total length of cycles found by the algorithm. We further examine the issuesof efficiently breaking cycles in $5. Vj> Lfj , fj + lj - 11n U [ti, ti + 1 - l] = 0 , (2) I ( 1=1 ) 1 Our algorithm for converting delta files into in-place re- indicating the absenceof any WR conflict. Equivalently, it constructible delta files takes the following stepsto find and guaranteesthat the dataa copy commandreads and transfers eliminate WR conflicts betweena referencefile and the new are always data from the original file. version to be materialized. Algorithm 4.2 Generating Conflict Free Permutations 1. Given an input delta file, the first stepis to partition the In order to find a permutation that minimizes WR conflicts, commandsin the file into a set C of copy commands we encode potential conflicts between the copy commands and a set A of add commands. in a digraph and topologically sort this digraph. A topologi- cal sort on digraph G = (V, E) producesa linear order on all 2. Sort the copy commandsby increasing write offset, vertices so that if G contains edge& then vertex u precedes c sorted = {Cl, c2, ..‘, cn}. For ci and cj, this set O?XYS: vertex v in topological order. i

270 the order that the commandsappeared in the original delta We note that the digraphs the algorithm generatesare a file. subclassof general digraphs. We denote this subclasscon- jlicting read write interval (CRWI) digraphs. To the best of 4.3 Algorithmic Performance our knowledge, this class has not previously been studied. While we know little about the its structure, it is clear that Given the set of copy commandsC with ICI members,the the class of CRWI digraphs is smaller than that of general presentedalgorithm uses time 0( ICI log ICI) for both sort- digraphs. For example, the CRWI class doesnot include any ing the copy commandsby write order and for finding con- complete digraphs with more than two vertices. flicting commands,using binary searchon the sorted write We define the amount of compressionlost upon deleting intervals for the [VI nodes in V - recall that (VI = ICI. a node the COSTof deletion. Basedon this cost function, we Additionally, the algorithm separatesand outputs add com- formulate the optimization problem of finding the minimum mandsusing time 0( [AI) and builds the edge relation using cost set of nodesto delete to make a digraph acyclic. A copy time O(lEl). The total execution time is O(lCl log ICI + commandis an ordered triple of constant size, c = (f, t, 1). I,?31+ IAl). This result relies upon a topological sort algo- An add commandis an ordered double a = (t, 1) followed rithm that runs in time 0(/V/ + IEI) [8]. Additionally, we by the 1 bytes of data to be added to the new version of the have assumedthat cycles found in the digraph will be bro- file. Replacing a copy commandwith an add commandsin- ken with the constanttime policy (0( 1)). The algorithm uses creasesthe delta file size by exactly 1 - IfI, where IfI is the space0( IEl + ICI + IAl) to store the data associatedwith size of the encoding off. each set. Given cost function labeling each node vi with cost Zi - The algorithm acceptsas input a delta file of length n. It Ifi I, we can expressthe elimination of cycles in an optimiza- generatesas many nodes as copy commandsand the num- tion problem: ber of copy commandscan grow linearly in the size of the input delta file, IV/ = ICI = O(n). The same is true of Input: CRWI digraph G = (V, E) with function add commands,IAl = O(n). However, we have no bound Cost(vi) assigning a positive integer to every node for the number of edges,excepting the theoretical bound on vi E v. general digraphs 0( [VIZ). In §6, we demonstrateby exam- ple that the subclassof digraphs generatedby our algorithm Optimization Problem: Minimize xs,ES COst(si) in can realize this bound. On the other hand, we also show that S c V so that the digraph resulting from G by elimi- the number of edgesin the class of digraphs our algorithm nating all the vertices in S and their adjacentedges is generatesis linear in the length of the version file V that the acyclic. delta file encodes(Lemma 1). We denotethe length of V by This problem is NP-hard. The related decision problem hf. can be shown NP-complete by reduction from Karp’s well Substituting these bounds into performanceexpression, known problem [7] of determining if there exists a set of ver- for an input delta file on length n encoding a version file of tices and their adjacentedges to removein a generaldigraph length Lv, our algorithm runs in time O(n log n + Lv) and that make that digraph acyclic and the set has fewer mem- spaceO(n + Lv). bers than somefixed target. The fundamental concept of the reduction is a construction that encodesthe input generaldi- 5 Strategies for Breaking Cycles graph for Karp’s problem into a digraph with membership in class CRWI. We will not presentthe reduction in this work. The topological sort algorithm describedabove generates an Having establishedthe intractability of the globally op- ordering on the nodes of a digraph and detectscycles that timal solution, let us consider local solutions, minimizing are in conflict with the ordering. Having found the cycles the increasein file size when breaking each cycle individu- in a digraph, we are faced with the problem of choosing a ally. When encountering a cycle, a locally minimum solu- method to break all of these cycles. We previously men- tion loops through that cycle determining the minimum cost tioned a c~nsful~ttime policy that breaks individual cycles vertex and removesit. using time O(1). This policy can easily be implemented An adversarialexample establishes the incapability of the in a topological sort by selecting the last node in sort or- locally minimum solution to produce a reasonableglobal der before the cycle was found. However, this method does solution. In the digraph of Figure 2, with membership in not guaranteeto break cycles with a minimum compression CRWI, the locally minimum policy for breaking cycleslooks cost. Yet, other techniquesfor breaking cycles turn out to be atallcycles(vo ,..., vi,vo)fori E 1,2 ,..., k. Foreach either intractable or offer no compressionadvantage in the cycle, it choosesto delete the minimum cost node at cost worst case.In fact, determining the minimum cost, in terms = C. As a result, the algorithm deletesnodes ~1, ~2, . . . , wk. of lost compression,set of nodesin a digraph to be removed However, deleting node vg is the globally optimal solution. turns out to be NP-hard.

271 Figure 2: A CRWI digraph constructedfrom a binary tree by adding a directed edge from each leaf to the root node. The locally minimum cycle breaking policy performs poorly on this CRWI digraph, removing each leaf vertex, instead of the root vertex.

In this example, the size of the delta associatedwith the lo- 6 Bounding the Size of the Digraph cally minimum solution grows arbitrarily larger than that of the globally optimal solution as n increases. The performanceof digraph construction, topological sort- The locally minimum solutionloops through the cycles it ing and cycle breaking dependsupon the number of edges finds and usestotal execution time proportional to the length in the digraphs our algorithm constructs. We previously as- of these cycles. For the example in Figure 2, the algorithm serted ($4.3) that the number of edgesin a CRWI digraph does log IV1 work to break each cycle and takes total time constructedcan grow quadratically with the number of copy O((V( log IVl). The locally minimum solutionon thisexam- commandsand is also bounded by the length of the version ple does not degradethe asymptotic performanceof the in- file. We presentanalysis to verify these assertions. place algorithm. However, a worst casebound on the length No digraph may have more than 0( 1V 12)edges. We now of the cycles examined by the locally minimum solution on show an example of two file versions whose CRWI digraph CRWI digraphs remains an open problem and could be as realizes this bound to establish that it is tight. Consider a large of O(lVl”). file of length L that is broken up into blocks of size fi Breaking cycles in constant time, arbitrarily selecting a (Figure 3). There are fi such blocks, bl, bt, . . . . b&. As- node in the cycle to delete, also fails to approximate the sumethat all blocks excluding the first block in the new file, globally optimal solution to within a constant. However, by b2,b2, ...I bay are all copies of the first block in the refer- choosing a node in the cycle in constanttime, the asymptotic ence file. Also, the first block in the new tile consistsof a run-time of the algorithmis preserved. copies of length 1 from any location in the referencefile. The merit of the locally minimum solution, as compared Since each length 1 command writes into each length to breaking cycles in constanttime, is difficult to determine. fi command’sread interval, there is a edge between ev- On delta files whose digraphs have sparseedge relations, ery fi length node and every length 1 node. This digraph cycles are infrequent and looping through cycles savescom- has fi - 1 nodes each with out-degreeX& for total edges pressionat little cost. However,worst caseanalysis indicates in R(L) = n(lCl”). While the length 1 copies of seemim- no preferencefor the locally minimum solution when com- probable, more reasonablevariations of this example with pared to the constant time policy, even though our intuition larger constant size copies and more sparseedge relations may indicate otherwise. This motivates a performance in- may be constructedin a similar fashion. vestigation of the run-time and compressionassociated with The Q(L) bound also turns out to be the maximum num- thesetwo policies ($7). ber of possible edges.

272 Reference File

Version File

Figure 3: Referenceand version file that have 0( lC12) conflicts.

Lemma 1 For an input delta fle encoding a version V of l determining the compressionlost due to making delta length Lv, the number of edges in the digraph generated to files suitable for in-place reconstruction; encode potential WR conflicts is less than or equal to LV . l comparingthe relative compressionperformance of the constanttime and locally minimum policies for break- Proof: There is an edge encoding a potential WR conflict ing cycles; and from copy commandsi to j when l showing the in-place conversion algorithms to be effi- cient when comparedto delta compressionalgorithms Vi Yfi + li - I] n [tj , tj + Zj - l] # 0. (3) on the samedata. Copy commandi has a read interval of length li. Recalling that the write intervals of all copy commandsare disjoint, Previous studies have indicated that delta compression commandi conflicts with a maximum of li other copy com- algorithms can compactdata for the transmissionof software mands- this OCCUTSwhen the region [fi, fi + li - l] in the [l]. Delta compressionalgorithms compatible with in-place new version is encodedby Zi copy commandsof length 1. reconstruction compressa large body of distributed software Wealso know that, for any delta encoding, the sum of the by a factor of 4 to 10 and reduce the amount of time required length of all read intervals is less than or equal to Lv, as no to transmit these files over low bandwidth channels accord- encoding readsmore symbols than it writes. ingly. As all read intervals sum to less than the length of the Over our sample software distributions, we found that version IiIe and no read interval may generatemore edges the delta algorithm we used compresseddata, on average,to than its length, the number of edgesin the digraph from a 15.3% its original size and that after running our algorithm delta file encoding V is less than or equai to Lv. I to make these files suitable for in-place reconstruction, we lost only 2.4% compressionwhen breaking cycles using the By bounding the number of edgesin CRWI digraphs, we locally minimum policy and 5.9% compressionwhen break- verify the performancebounds presentedin $4.3. ing cycles in constanttime. Almost all of the lost compressioncan be attributed to 7 Experimental Results an encoding inefficiency inherent to in-place reconstruction. We have describedadd commands(t , 1) and copy commands As we are interested in using in-place reconstruction meth- (f, t, I), where both commandsexplicitly encode the to t or write offset in the version tile. However, delta algorithms ods to distribute software over the Internet, we extracted a large body of Internet available software and examined the that reconstruct data in order need not explicitly encode a performanceof our algorithm on these files. They include write offset - an add commandcan simply be (1) and a copy multiple versions of the GNU tools and the BSD operating command (f, 2). Since commands are applied in write or- systemdistributions, among other data, with both binary and der, the write offset is implicitly defined by the end offset sourcefiles being compressedand permutedfor in-place re- of the previous command. By adding write offsets, a delta construction. Thesedata were examined with the goals of: compressionalgorithm loses 1.9% of its compressionbefore modifying the delta file for in-place reconstruction. We see

273 A Compress In-Place Algorithm A Compress No Write Offsets Write Offsets (Constant Time) tzl Compression 15.3% 17.2% 17.7% 21.2% Encoding Loss 1.9% 1.9% 1.9% Loss from Cycles 0.5% 4.0% Total Loss 1.9% 2.4% 5.9%

Table 1: Compressionperformance of A compressionalgorithms and in-place conversionalgorithms. For in-place conversion, lost compressionis partitioned into loss from encoding inefficiencies and loss from breaking (cyclesfor in-place reconstructibil- ity. in Table 1 the difference between the delta compressional- generating the input delta file. gorithm run without write offsets, compressinginput files to We comparethe amount of time required to generatea 15.3%their original size, and the samealgorithm with write delta file from an input reference and version file to time offsets, compressinginput files to 17.2% their original size. usedto createan in-place permutation of that delta file. Over Except for the codewordsthe algorithms use, they are iden- all inputs, the in-place conversion algorithm completed in tical, finding the samematching strings in the input files. 56% the amount of total time used by the delta compres- We adopt the format of the add and copy codewordsdi- sion algorithm. The run-time of the in-place conversion al- rectly from a standard differential compressionalgorithms gorithm only exceededthe delta compression run-time on [l 1, 11.While this decision easesimplementation, the code- 0.1% of all inputs and never took more that twice as much words are poorly suited to in-place reconstruction. The en- time. coding schemeuses only a single byte to encode the length The performance of the in-place conversion algorithm of add commandsand therefore generatesmany short add dependson the number of commandsin the delta file. In commands. For in-place reconstructibility, the add com- practice, this value is sig,nificantly smaller than either the mand includes a write offset and becomesmuch more ex- size of the referenceand version Iile or the delta file itself. pensive. The many small add commandsproduced by the Surprisingly, breaking cycles with the locally minimum delta compressionalgorithm createan unnecessaryencoding policy has no apparentimpact on the run-time performance overhead. A redesign of the delta compressioncodewords of the algorithm. The time differences between these poli- for in-place reconstructibility wouldfurtherreduce lost com- cies are insignificant as compared to the variations of sys- pression. tem performanceover many trials. While the performance Subtracting the encoding inefficiency, we observe that difference is on averagenegligible, on some inputs the lo- the algorithm loses an additional 4.0% compressionwhen cally minimum policy runs noticeably slower. Infrequently, breaking cycles in constant time and 0.5% compressionfor an input will contain many long cycles, and the locally min- the locally minimum policy (Table 1). The locally mini- imum policy will create a slow down of up to 25% when mum policy reclaims nearly all lost compressionwith re- comparedto the constant time policy. However, the locally spect to the constant time policy and we will later see that, minimum policy recoupslost time on other inputs by encod- in practice, it incurs no additional run-time expense. While ing more compact delta files. Since the locally minimum we cannot comparethe compressionperformance of the lo- policy converts smaller commandsfrom copies to adds, it cally minimum policy to a solution to the NP-hard global performsfewer and smaller I/O requests. optimization problem, 0.5% bounds the amount of possible We determine experimentally that the locally minimum improvement on these files. Since this improvement is sig- cycle breaking policy recovers nearly all the lost compres- nificantly less than the encoding overheadfor write offsets sion from breaking cycles that occurs with the constant time (1.9%), we feel that the locally minimum policy performs policy. Additionally, it does not increase the time required extremely well in practice. for the algorithm to complete. Therefore, locally minimum The delta compressionalgorithm usedto createdelta files cycle breaking is the superior policy for every performance [ll before permutation operates using time O(Lv + LR) metric we have considered. and spacein O(1) for reference file R of size LR and ver- sion file V of size Lv. Our in-place conversion algorithm 8 Conclusions (O(nlogn + Lv)) does not guaranteeas small an asymp- totic bound, since the length of n is O(Lv). Despite worst We have presentedan algorithm that modifies delta files so case analysis, we can claim in-place conversion to be effi- that the encoded version may be reconstructed in the ab- cient experimentally with data indicating that generating an sence of scratch memory or storage space. Such an algo- in-place reconstructible file takessignificantly less time than

274 rithm facilitates the distribution of software to network at- 141 R. C. Bums and D. D. E. Long. Efficient distributed backup tached devices over low bandwidth channels. Delta com- with delta compression. In Proceedings of the 1997 NO in pression lessens the time required to transmit files over a Parallel and Distributed Systems (IOPADS’97), I7 November network by compactly encoding the data to be transmitted. 1997, San Jose, CA, USA, November 1997. In-place reconstruction allows devices that do not have ad- PI R. C. Bums and D. D. E. Long. A linear time, constant space ditional disk or memory storage resources to take advantage differencing algorithm. In Proceedings of the 1997 Interna- of delta compression technology. tional Performance, Computing and Communications Confer- The graph-theoretic algorithm to modify a delta file to he ence (IPCCC’97), Feb. 5-7, TempelPhoenix, Arizona, USA, February 1997. in-place reconstructible, rebuilt in the same storage the pre- vious version occupies, rearranges the order of application P31S. P. De Jong. Combining of changes to a sorme file. IBM of the commands in a delta file to create the new delta ver- Technical Disclosure Bulletin, 15(4):1186-l 188, September sion. By rearranging the order of commands, data conflicts, 1972. where the delta file attempts to read from a region that it has 171 R. M. Karp Reducibility among combinatorial problems. In already written, are avoided. Often, the algorithm finds an R. E. Miller and J. W. Thatcher, editors, Complexity of Com- ordering of commands with no conflicts. Sometimes, the al- puter Computations, pages 8.5-104. Plenum Press, 1972. gorithm must convert data that the delta file encoded as a PI D. E. Knuth Fundamental Algorithms, volume 1 of The Art copy to an add command, placing that data wholly in the of Computer Programming. Addison-Wesley Publishing Co., delta tile. By converting copy commands to add commands, 1968. the algorithm trades a small degree of compression in order 191 W. Miller and E. W. Myers. A program. Soft- to achieve in-place reconstructibility. ware - Practice and Experience, 15( 11): 1025-1040, Novem- Experimental results indicate that converting a delta file ber 1985. into an in-place reconstructible delta file has limited impact 1101 J. C. Mogul, F. Douglis, A. Feldman, and B. Krisbnamurtby. on compression, less than 2.5%, and that experimentally the Potential benefits of delta encoding and for algorithm to do so requires less time than the algorithm to HTTP. In Proceedings of ACM SIGCOMM ‘97, September generate the delta file in the first place. 1997. In-place reconstructible delta file compression provides [Ill C. Reichenberger. Delta storage for arbitrary non-text files. In the benefits of delta compression for software distribution to Proceedings of the 3rd International Workshop on Software a special class of applications - devices with limited stor- Conjguration Management, Trondheim, Norway, 12-14 June age and memory. In the current network computing envi- 1991, pages 144-152. ACM, June 1991. ronment, with the proliferation of network attached devices, 1121 M. J. Rochkind. The source code control system. IEEE Trans- this technology greatly decreases the time to distribute soft- actions on So&are Engineering, SE-1(4):364-370, Decem- ware without increasing the development cost or complexity ber 1975. of the receiving devices. [I31 D. G. Severance and G. M. Lohman. Differential files: Their application to the maintenance of large databases. ACM Acknowledgments Transactions on Database Systems, 1(2):256-267, September 1976. The authors wish to thank L. Stockmeyer for his contribution 1141 W. F. Tichy. The string-to-string correction problem with to the analysis and presentation of this algorithm, including block move. ACM Transactions on Computer Systems, 2(4), the development of examples and the NP-hardness results November 1984. concerning policies for breaking cycles. We also wish to 1151 W. F. Tichy. RCS - A system for . So&are - thank R. Fagin and our anonymous reviewers who helped to Practice and Experience, 15(7):637-654, July 1985. focus the content of our presentation.

References

[I] M. Ajtai, R. Bums, R. Fagin, D. Long, and L. Stockmeyer. Compactly encoding arbitrary inputs with differential com- pression. IBM Research: In Preparation, 1998. [2] G. Banga, F. Douglis, and M. Rabinovich. Optimistic deltas for WWW latency reduction. In Proceedings of the 1998 Usenix Technical Conference, 1998. [3] P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concur- rency Control and Recovery in Database Systems. Addison- Wesley Publishing Co., 1987.

275