Data Compression

DEBRA A. LELEWER and DANIEL S. HIRSCHBERG Department of Information and Computer Science, University of California, Irvine, California 92717

This paper surveys a variety of methods spanning almost 40 years of research, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems. Concepts from as they relate to the goals and evaluation of data compression methods are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported, and possibilities for future research are suggested

Categories and Subject Descriptors: E.4 [Data]: Coding and Information Theory-data compaction and compression General Terms: Algorithms, Theory Additional Key Words and Phrases: , adaptive Huffman , coding, coding theory, tile compression, Huffman codes, minimum-redundancy codes, optimal codes, prefix codes, text compression

INTRODUCTION mation but whose length is as small as possible. Data compression has important Data compression is often referred to as application in the areas of data transmis- coding, where coding is a general term en- sion and data storage. Many data process- compassing any special representation of ing applications require storage of large data that satisfies a given need. Informa- volumes of data, and the number of such tion theory is defined as the study of applications is constantly increasing as the efficient coding and its consequences in use of computers extends to new disci- the form of speed of transmission and plines. At the same time, the proliferation probability of error [ Ingels 19711.Data com- of computer communication networks is pression may be viewed as a branch of resulting in massive transfer of data over information theory in which the primary communication links. Compressing data to objective is to minimize the amount of data be stored or transmitted reduces storage to be transmitted. The purpose of this pa- and/or communication costs. When the per is to present and analyze a variety of amount of data to be transmitted is re- data compression algorithms. duced, the effect is that of increasing the A simple characterization of data capacity of the communication channel. compression is that it involves transform- Similarly, compressing a file to half of its ing a string of characters in some represen- original size is equivalent to doubling the tation (such as ASCII) into a new string capacity of the storage medium. It may then (e.g., of ) that contains the same infor- become feasible to store the data at a

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 0 1966 ACM 0360-0300/87/0900-0261$1.50

ACM Computing SUWSYS,Vol. 19, No. 3, September 1987 262 . D. A. Lelewer and D. S. Hirschberg CONTENTS been reported to reduce a file to anywhere from 12.1 to 73.5% of its original size [Wit- ten et al. 19871. Cormack reports that data compression programs based on Huffman INTRODUCTION coding (Section 3.2) reduced the size of a 1. FUNDAMENTAL CONCEPTS large student-record database by 42.1% 1.1 Definitions when only some of the information was 1.2 Classification of Methods 1.3 A Data Compression Model compressed. As a consequence of this size 1.4 Motivation reduction, the number of disk operations 2. SEMANTIC DEPENDENT METHODS required to load the database was reduced 3. STATIC DEFINED-WORD SCHEMES by 32.7% [Cormack 19851. Data com- 3.1 Shannon-Fan0 pression routines developed with specific 3.2 Static 3.3 Universal Codes and Representations of the applications in mind have achieved com- Integers pression factors as high as 98% [Severance 3.4 19831. 4. Although coding for purposes of data se- 4.1 Algorithm FGK 4.2 Algorithm V curity (cryptography) and codes that guar- 5. OTHER ADAPTIVE METHODS antee a certain level of data integrity (error 5.1 Lempel-Ziv Codes detection/correction) are topics worthy of 5.2 Algorithm BSTW attention, they do not fall under the 6. EMPIRICAL RESULTS umbrella of data compression. With the 7. SUSCEPTIBILITY TO ERROR 7.1 Static Codes exception of a brief discussion of the sus- 7.2 Adaptive Codes ceptibility to error of the methods surveyed 8. NEW DIRECTIONS (Section 7), a discrete noiseless channel is 9. SUMMARY assumed. That is, we assume a system in REFERENCES which a sequence of symbols chosen from a finite alphabet can be transmitted from one point to another without the possibility of error. Of course, the coding schemes higher, thus faster, level of the storage hi- described here may be combined with data erarchy and reduce the load on the input/ security or error-correcting codes. output channels of the computer system. Much of the available literature on data Many of the methods discussed in this compression approaches the topic from the paper are implemented in production point of view of data transmission. As noted systems. The UNIX1 utilities compact earlier, data compression is of value in data and compress are based on methods dis- storage as well. Although this discussion is cussed in Sections 4 and 5, respectively framed in the terminology of data trans- [UNIX 19841.Popular file archival systems mission, compression and decompression such as ARC and PKARC use techniques of data files are essentially the same tasks presented in Sections 3 and 5 [ARC 1986; as sending and receiving data over a com- PKARC 19871. The savings achieved by munication channel. The focus of this data compression can be dramatic; reduc- paper is on algorithms for data compres- tion as high as 80% is not uncommon sion; it does not deal with hardware aspects [Reghbati 19811. Typical values of com- of data transmission. The reader is referred pression provided by compact are text to Cappellini [1985] for a discussion of (38%), Pascal source (43%), C source techniques with natural hardware imple- (36%), and binary (19%). Compress gener- mentation. ally achieves better compression (50-60% Background concepts in the form of ter- for text such as source code and English) minology and a model for the study of data and takes less time to compute [UNIX compression are provided in Section 1. Ap- 19841. Arithmetic coding (Section 3.4) has plications of data compression are also dis- cussed in Section 1 to provide motivation 1 UNIX is a trademark of AT&T Bell Laboratories. for the material that follows.

ACM Computing Surveys, Vol. 19, No. 3, September 1987 Data Compression l 263

Although the primary focus of this survey Source message Codeword is data compression methods of general a 000 utility, Section 2 includes examples from b 001 the literature in which ingenuity applied to ii 011010 domain-specific problems has yielded inter- esting coding techniques. These techniques are referred to as semantic dependent since 7 100101 they are designed to exploit the context g 110 and semantics of the data to achieve re- space 111 dundancy reduction. Semantic-dependent Figure 1. A block-block code for EXAMPLE. techniques include the use of quadtrees, run-length encoding, or difference mapping for storage and transmission of image data Source message Codeword [Gonzalez and Wintz 1977; Samet 19841. General-purpose techniques, which as- ;ib 01 sume no knowledge of the information cccc 10 content of the data, are described in ddddd 11 Sections 3-5. These descriptions are suffi- eeeeee 100 ciently detailed to provide an understand- fffffff 101 ing of the techniques. The reader will need &%%Nmz! 110 space 111 to consult the references for implementa- tion details. In most cases only worst-case Figure 2. A variable-variable code for EXAMPLE. analyses of the methods are feasible. To provide a more realistic picture of their effectiveness, empirical data are presented be (0,l). Codes can be categorized as block- in Section 6. The susceptibility to error of block, block-variable, variable-block, or the algorithms surveyed is discussed in Sec- variable-variable, where block-block in- tion 7, and possible directions for future dicates that the source messages and research are considered in Section 8. codewords are of fixed length and variable- variable codes map variable-length source messages into variable-length codewords. A 1. FUNDAMENTAL CONCEPTS block-block code for EXAMPLE is shown A brief introduction to information theory in Figure 1, and a variable-variable code is is provided in this section. The definitions given in Figure 2. If the string EXAMPLE and assumptions necessary to a compre- were coded using the Figure 1 code, the hensive discussion and evaluation of data length of the coded message would be 120; compression methods are discussed. The using Figure 2 the length would be 30. following string of characters is used to The oldest and most widely used codes, illustrate the concepts defined: EXAMPLE ASCII and EBCDIC, are examples of = “au bbb cccc ddddd eeeeeefffffffgggggggg”. block-block codes, mapping an alphabet of 64 (or 256) single characters onto 6- (or 8-bit) codewords. These are not discussed, 1.1 Definitions since they do not provide compression. A code is a mapping of source messages The codes featured in this survey are of (words from the source alphabet a) into the block-variable, variable-variable, and codewords (words of the code alphabet 0). variable-block types. The source messages are the basic units When source messages of variable length into which the string to be represented is are allowed, the question of how a mes- partitioned. These basic units may be single sage ensemble (sequence of messages) is symbols from the source alphabet, or they parsed into individual messages arises. may be strings of symbols. For string Many of the algorithms described here are EXAMPLE, cy = (a, b, c, d, e, f, g, space). defined-word schemes. That is, the set of For purposes of explanation, /3 is taken to source messages is determined before the

ACM Computing Surveys, Vol. 19, No. 3, September 1987 264 . D. A. Lelewer and D. S. Hirschberg invocation of the coding scheme. For u in ,B. The set of codewords (00, 01, 10) is example, in text file processing, each an example of a that is not character may constitute a message, or minimal. The fact that 1 is a proper prefix messages may be defined to consist of the codeword 10 requires that 11 be of alphanumeric and nonalphanumeric either a codeword or a proper prefix of a strings. In Pascal source code, each token codeword, and it is neither. Intuitively, the may represent a message. All codes involv- minimality constraint prevents the use of ing fixed-length source messages are, by codewords that are longer than necessary. default, defined-word codes. In free-parse In the above example the codeword 10 could methods, the coding algorithm itself parses be replaced by the codeword 1, yielding a the ensemble into variable-length se- minimal prefix code with shorter code- quences of symbols. Most of the known words. The codes discussed in this paper data compression methods are defined- are all minimal prefix codes. word schemes; the free-parse model differs In this section a code has been defined in a fundamental way from the classical to be a mapping from a source alphabet coding paradigm. to a code alphabet; we now define related A code is distinct if each codeword is terms. The process of transforming a source distinguishable from every other (i.e., the ensemble into a coded message is coding or mapping from source messages to code- encoding. The encoded message may be re- words is one to one). A distinct code is ferred to as an encoding of the source en- uniquely decodable if every codeword is semble. The algorithm that constructs the identifiable when immersed in a sequence mapping and uses it to transform the source of codewords. Clearly, each of these fea- ensemble is called the encoder. The decoder tures is desirable. The codes of Figures 1 performs the inverse operation, restoring and 2 are both distinct, but the code of the coded message to its original form. Figure 2 is not uniquely decodable. For example, the coded message 11 could be 1.2 Classification of Methods decoded as either “ddddd” or “bbbbbb”. A uniquely decodable code is a prefix code (or Not only are data compression schemes prefix-free code) if it has the prefix prop- categorized with respect to message and erty, which requires that no codeword be a codeword lengths, but they are also classi- proper prefix of any other codeword. All fied as either static or dynamic. A static uniquely decodable block-block and vari- method is one in which the mapping from able-block codes are prefix codes. The code the set of messages to the set of codewords with codewords (1, 100000, 00) is an ex- is fixed before transmission begins, so that ample of a code that is uniquely decodable a given message is represented by the same but that does not have the prefix property. codeword every time it appears in the mes- Prefix codes are instantaneously decodable; sage ensemble. The classic static defined- that is, they have the desirable property word scheme is Huffman coding [Huffman that the coded message can be parsed into 19521. In Huffman coding, the assignment codewords without the need for lookahead. of codewords to source messages is based In order to decode a message encoded using on the probabilities with which the source the codeword set (1, 100000, 001, lookahead messages appear in the message ensemble. is required. For example, the first codeword Messages that appear frequently are rep- of the message 1000000001 is 1, but this resented by short codewords; messages with cannot be determined until the last (tenth) smaller probabilities map to longer code- symbol of the message is read (if the string words. These probabilities are determined of zeros had been of odd length, the first before transmission begins. A Huffman codeword would have been 100000). code for the ensemble EXAMPLE is given A minimal prefix code is a prefix code in Figure 3. If EXAMPLE were coded using such that, if x is a proper prefix of some this Huffman mapping, the length of the codeword, then xu is either a codeword or coded message would be 117. Static Huff- a proper prefix of a codeword for each letter man coding is discussed in Section 3.2;

ACM Computing Surveys, Vol. 19, No. 3, September 1987 Data Compression l 265

Source message Probability Codeword Source message Probability Codeword 1001 a 216 10 ; 21403140 1000 b 3/6 0 011 space l/6 11 i 41405/40 010 e 6140 111 Figure 4. A dynamic Huffman code table for the f 7/40 110 prefix “an bbb” of message EXAMPLE. g a/40 00 space 5140 101 common in a wide variety of text types, for Figure 3. A Huffman code for the message EXAM- a particular word to occur frequently for PLE (code length = 117). short periods of time and then fall into disuse for long periods. All of the adaptive methods are one-pass other static schemes are discussed in Sec- methods; only one scan of the ensemble is tions 2 and 3. required. Static Huffman coding requires A code is dynamic if the mapping from two passes: one pass to compute probabili- the set of messages to the set of codewords ties and determine the mapping, and a sec- changes over time. For example, dynamic ond pass for transmission. Thus, as long as Huffman coding involves computing an ap- the encoding and decoding times of an proximation to the probabilities of occur- adaptive method are not substantially rence “on the fly,” as the ensemble is being greater than those of a static method, the transmitted. The assignment of codewords fact that an initial scan is not needed im- to messages is based on the values of the plies a speed improvement in the adaptive relative frequencies of occurrence at each case. In addition, the mapping determined point in time. A message x may be repre- in ,the first pass of a static coding scheme sented by a short codeword early in the must be transmitted by the encoder to the transmission because it occurs frequently decoder. The mapping may preface each at the beginning of the ensemble, even transmission (i.e., each file sent), or a single though its probability of occurrence over mapping may be agreed upon and used for the total ensemble is low. Later, when the multiple transmissions. In one-pass meth- more probable messages begin to occur with ods the encoder defines and redefines the higher frequency, the short codeword will mapping dynamically during transmission. be mapped to one of the higher probability The decoder must define and redefine the messages, and x will be mapped to a longer mapping in sympathy, in essence “learn- codeword. As an illustration, Figure 4 ing” the mapping as codewords are re- presents a dynamic Huffman code table ceived. Adaptive methods are discussed in corresponding to the prefix “aa bbb ” of Sections 4 and 5. EXAMPLE. Although the frequency of An algorithm may also be a hybrid, space over the entire message is greater neither completely static nor completely than that of b, at this point b has higher dynamic. In a simple hybrid scheme, sender frequency and therefore is mapped to the and receiver maintain identical codebooks shorter codeword. containing lz static codes. For each trans- Dynamic codes are also referred to in the mission, the sender must choose one of the literature as adaptive, in that they adapt to k previously agreed upon codes and inform changes in ensemble characteristics over the receiver of the choice (by transmitting time, The term adaptive is used for the first the “name” or number of the chosen remainder of this paper; the fact that these code). Hybrid methods are discussed fur- codes adapt to changing characteristics is ther in Sections 2 and 3.2. the source of their appeal. Some adaptive methods adapt to changing patterns in the 1.3 A Data Compression Model source [Welch 19841, whereas others ex- ploit locality of reference [Bentley et al. In order to discuss the relative merits of 19861. Locality of reference is the tendency, data compression techniques, a framework

ACM Computing Surveys, Vol. 19, No. 3, September 1987 266 . D. A. Lelewer and D. S. Hirschberg for comparison must be established. There code symbols are of different durations. are two dimensions along which each of The assumption is also important, since the schemes discussed here may be mea- the problem of constructing optimal sured: algorithm complexity and amount of codes over unequal code letter costs is compression. When data compression is a significantly different and more diffi- used in a data transmission application, the cult problem. Perl et al. [1975] and Varn goal is speed. Speed of transmission de- [1971] have developed algorithms for min- pends on the number of bits sent, the time imum-redundancy prefix coding in the case required for the encoder to generate the of arbitrary symbol cost and equal code- coded message, and the time required for word probability. The assumption of equal the decoder to recover the original ensem- probabilities mitigates the difficulty pre- ble. In a data storage application, although sented by the variable symbol cost. For the the degree of compression is the primary more general unequal letter costs and un- concern, it is nonetheless necessary that equal probabilities model, Karp [1961] has the algorithm be efficient in order for the proposed an integer linear programming scheme to be practical. For a static scheme, approach. There have been several approx- there are three algorithms to analyze: the imation algorithms proposed for this more map construction algorithm, the encoding difficult problem [Krause 1962; Cot 1977; algorithm, and the decoding algorithm. For Mehlhorn 19801. a dynamic scheme, there are just two algo- When data are compressed, the goal is rithms: the encoding algorithm and the to reduce redundancy, leaving only the decoding algorithm. informational content. The measure of Several common measures of compres- information of a source message oi (in sion have been suggested: redundancy bits) is -log p(ai), where log denotes the [Shannon and Weaver 19491, average mes- base 2 logarithm and p(ai) denotes the sage length [Huffman 19521, and compres- probability of occurrence of the message sion ratio [Rubin 1976; Ruth and Kreutzer oi.2 This definition has intuitive appeal; in 19721. These measures are defined below. the case in which p(ai) = 1, it is clear that Related to each of these measures are ai is not at all informative since it had to assumptions about the characteristics occur. Similarly, the smaller the value of of the source. It is generally assumed in p(ai), the more unlikely ai is to appear, and information theory that all statistical hence the larger its information content. parameters of a message source are known The reader is referred to Abramson [1963, with perfect accuracy [Gilbert 19711. The pp. 6-131 for a longer, more elegant discus- most common model is that of a discrete sion of the legitimacy of this technical memoryless source; a source whose output definition of the concept of information. is a sequence of letters (or messages), each The average information content over the letter being a selection from some fixed source alphabet can be computed by alphabet al, - - - , a,,. The letters are taken weighting the information content of each to be random, statistically independent se- source letter by its probability of occur- lections from the alphabet, the selection rence, yielding the expression xi”=1 being made according to some fixed prob- [-p(oi)log p(ai)]. This quantity is referred ability assignment p(al), . . ., p(a,) [Gal- to as the entropy of a source letter or the lager 19681. Without loss of generality, the entropy of the source and is denoted by H. code alphabet is assumed to be (0, 1) Since the length of a codeword for message throughout this paper. The modifications oi must be sufficient to carry the informa- necessary for larger code alphabets are tion content of oi, entropy imposes a lower straightforward. bound on the number of bits required for We assume that any cost associated the coded message. The total number of with the code letters is uniform. This is a reasonable assumption, although it omits * Note that throughout this paper all logarithms are applications like telegraphy where the to the base 2.

ACM Computing Surveys, Vol. 19, No. 3, September 1987 Data Compression l 267 bits must be at least as large as the product methods for dealing with local redundancy of H and the length of the source ensemble. are discussed in Sections 2 and 6.2. Since the value of H is generally not an Huffman uses average message length, integer, variable-length codewords must be 2 p(ai)Zi, as a measure of the efficiency of used if the lower bound is to be achieved. a code. Clearly, the meaning of this term is Given that message EXAMPLE is to be the average length of a coded message. We encoded one letter at a time, the entropy of use the term average codeword length to its source can be calculated using the prob- represent this quantity. Since redundancy abilities given in Figure 3: H = 2.894, so is defined to be average codeword length that the minimum number of bits con- minus entropy and entropy is constant tained in an encoding of EXAMPLE is 116. for a given probability distribution, mini- The Huffman code given in Section 1.2 does mizing average codeword length minimizes not quite achieve the theoretical minimum redundancy. in this case. A code is asymptotically optimal if it Both of these definitions of information has the property that for a given probability content are due to Shannon [ 19491. A der- distribution, the ratio of average codeword ivation of the concept of entropy as it re- length to entropy approaches 1 as entropy lates to information theory is presented by tends to infinity. That is, asymptotic opti- Shannon [1949]. A simpler, more intuitive mality guarantees that average codeword explanation of entropy is offered by Ash length approaches the theoretical mini- [1965]. mum (entropy represents information The most common notion of a “good” content, which imposes a lower bound on code is one that is optimal in the sense of codeword length). having minimum redundancy. Redundancy The amount of compression yielded by a can be defined as c p(ai)Zi - 2 [-p(ai)log coding scheme can be measured by a p(oi)], where li is the length of the codeword compression ratio. Compression ratio has representing message oi. The expression been defined in several ways. The defini- z p(ai)Zi represents the lengths of the code- tion C = (average message length)/(average words weighted by their probabilities of codeword length) captures the common occurrence, that is, the average codeword meaning, which is a comparison of the length. The expression z [-p(ai)log p(oi)] length of the coded message to the length is entropy H. Thus, redundancy is a mea- of the original ensemble [Cappellini 19851. sure of the difference between average If we think of the characters of the ensem- codeword length and average information ble EXAMPLE as 6-bit ASCII characters, content. If a code has minimum average then the average message length is 6 bits. codeword length for a given discrete prob- The Huffman code of Section 1.2 repre- ability distribution, it is said to be a mini- sents EXAMPLE in 117 bits, or 2.9 bits mum redundancy code. per character. This yields a compression We define the term local redundancy to ratio of 612.9, representing compression by capture the notion of redundancy caused a factor of more than 2. Alternatively, we by local properties of a message ensemble, may say that Huffman encoding produces rather than its global characteristics. Al- a file whose size is 49% of the original though the model used for analyzing ASCII file, or that 49% compression has general-purpose coding techniques assumes been achieved. a random distribution of the source mes- A somewhat different definition of sages, this may not actually be the case. In compression ratio by Rubin [1976], C = particular applications the tendency for (S - 0 - OR)/S, includes the representa- messages to cluster in predictable patterns tion of the code itself in the transmission may be known. The existence of predictable cost. In this definition, S represents the patterns may be exploited to minimize local length of the source ensemble, 0 the length redundancy. Examples of applications in of the output (coded message), and OR the which local redundancy is common and size of the “output representation” (e.g., the

ACM Computing Surveys, Vol. 19, No. 3, September 1987 260 l D. A. Lelewer and D. S. Hirschberg number of bits required for the encoder to complex. For these reasons, one could ex- transmit the code mapping to the decoder). pect to see even greater use of variable- The quantity OR constitutes a “charge* to length coding in the future. an algorithm for transmission of informa- It is interesting to note that the Huffman tion about the coding scheme. The inten- coding algorithm, originally developed for tion is to measure the total size of the the efficient transmission of data, also has transmission (or file to be stored). a wide variety of applications outside the sphere of data compression. These include construction of optimal search trees [Zim- 1.4 Motivation merman 1959; Hu and Tucker 1971; Itai As discussed in the Introduction, data 19761, list merging [Brent and Kung 19781, compression has wide application in terms and generating optimal evaluation trees in of information storage, including represen- the compilation of expressions [Parker tation of the abstract data type string 19801. Additional applications involve [Standish 19801 and file compression. search for jumps in a monotone function of Huffman coding is used for compression in a single variable, sources of pollution along several file archival systems [ARC 1986; a river, and leaks in a pipeline [Glassey and PKARC 19871, as is Lempel-Ziv coding, Karp 19761.The fact that this elegant com- one of the adaptive schemes to be discussed binatorial algorithm has influenced so in Section 5. An adaptive Huffman coding many diverse areas underscores its impor- technique is the basis for the compact com- tance. mand of the UNIX operating system, and the UNIX compress utility employs the 2. SEMANTIC-DEPENDENT METHODS Lempel-Ziv approach [UNIX 19841. In the area of data transmission, Huff- Semantic-dependent data compression man coding has been passed over for years techniques are designed to respond to spe- in favor of block-block codes, notably cific types of local redundancy occurring in ASCII. The advantage of Huffman coding certain applications. One area in which is in the average number of bits per char- data compression is of great importance acter transmitted, which may be much is image representation and processing. smaller than the log n bits per character There are two major reasons for this. The (where n is the source alphabet size) of a first is that digitized images contain a large block-block system. The primary difficulty amount of local redundancy. An image is associated with variable-length codewords usually captured in the form of an array of is that the rate at which bits are presented pixels, whereas methods that exploit the to the transmission channel will fluctuate, tendency for pixels of like color or intensity depending on the relative frequencies of the to cluster together may-be more efficient. source messages. This fluctuation requires The second reason for the abundance of buffering between the source and the chan- research in this area is volume. Digital im- nel. Advances in technology have both ages usually require a very large number of overcome this difficulty and contributed to bits, and many uses of digital images in- the appeal of variable-length codes. Cur- volve large collections of images. rent data networks allocate communication One technique used for compression of resources to sources on the basis of need image data is run-length encoding. In a and provide buffering as part of the system. common version of run-length encoding, These systems require significant amounts the sequence of image elements along a of protocol, and fixed-length codes are quite scan line (row) x1, x2, . . . , xn is mapped inefficient for applications such as packet into a sequence of pairs (cl, II), (cp, f$), . . . headers. In addition, communication costs (ck, lh), where ci represents an intensity or are beginning to dominate storage and color and li the length of the ith run (se- processing costs, so that variable-length quence of pixels of equal intensity). For coding schemes that reduce communication pictures such as weather maps, run-length costs are attractive even if they are more encoding can save a significant number of

ACM Computing Surveys, Vol. 19, No. 3, September 1987 Data Compression 9 269 bits over the image element sequence [Gon- with database files. The method, which is zalez and Wintz 19771. part of IBM’s Information Management Another data compression technique System (IMS), compresses individual rec- specific to the area of image data is differ- ords and is invoked each time a record is ence mapping, in which the image is rep- stored in the database tile; expansion is resented as an array of differences in performed each time a record is retrieved. brightness (or color) between adjacent Since records may be retrieved in any order, pixels rather than the brightness values context information used by the compres- themselves. Difference mapping was used sion routine is limited to a single record. In to encode the pictures of Uranus transmit- order for the routine to be applicable to any ted by Voyager 2. The 8 bits per pixel database, it must be able to adapt to the needed to represent 256 brightness levels format of the record. The fact that database was reduced to an average of 3 bits per pixel records are usually heterogeneous collec- when difference values were transmitted tions of small fields indicates that the local [Laeser et al. 19861. In spacecraft applica- properties of the data are more impor- tions, image fidelity is a major concern tant than their global characteristics. The because of the effect of the distance from compression routine in IMS is a hybrid the spacecraft to Earth on transmission method that attacks this local redundancy reliability. Difference mapping was com- by using different coding schemes for dif- bined with error-correcting codes to pro- ferent types of fields. The identified field vide both compression and data integrity types in IMS are letters of the alphabet, in the Voyager 2 project. Another method numeric digits, packed digit pairs, that takes advantage of the tendency for blank, and other. When compression be- images to contain large areas of constant gins, a default code is used to encode the intensity is the use of the quadtree data fust character of the record For each structure [Samet 19841. Additional exam- subsequent character, the type of the pre- ples of coding techniques used in image vious character determines the code to processing can be found in Wilkins and be used. For example, if the record Wintz [1971] and in Cappellini [1985]. “01870bABCD bb LMN” were encoded with Data compression is of interest in busi- the letter code as default, the leading zero ness data processing, because of both the would be coded using the letter code; the 1, cost savings it offers and the large volume 8, 7, 0 and the first blank (b) would be of data manipulated in many business ap- coded by the numeric code. The A would be plications. The types of local redundancy coded by the blank code; B, C, D, and the present in business data files include runs next blank by the letter code; the next blank of zeros in numeric fields, sequences of and the L by the blank code; and the M and blanks in alphanumeric fields, and fields N by the letter code. Clearly, each code that are present in some records and null must define a codeword for every character; in others. Run-length encoding can be used the letter code would assign the shortest to compress sequences of zeros or blanks. codewords to letters, the numeric code Null suppression can be accomplished would favor the digits, and so on. In the through the use of presence bits [Ruth and system Cormack describes, the types of the Kreutzer 19721. Another class of methods characters are stored in the encode/decode exploits cases in which only a limited set of data structures. When a character c is re- attribute values exist. Dictionary substitu- ceived, the decoder checks type(c) to detect tion entails replacing alphanumeric repre- which code table will be used in transmit- sentations of information such as bank ting the next character. The compression account type, insurance policy type, sex, algorithm might be more efficient if a spe- and month by the few bits necessary to cial bit string were used to alert the receiver represent the limited number of possible to a change in code table. Particularly attribute values [Reghbati 19811. if fields were reasonably long, decoding Cormack [1985] describes a data would be more rapid and the extra bits in compression system that is designed for use the transmission would not be excessive.

ACM Computing Surveys, Vol. 19, No. 3, September 1987 270 l D. A. Ldewer and D. S. Hirschberg

Cormack reports that the performance of believe that Huffman coding cannot be im- the IMS compression routines is very good; proved upon; that is, that it is guaranteed at least 50 sites are currently using the to achieve the best possible compression system. He cites a case of a database con- ratio. This is only true, however, under the taining student records whose size was constraints that each source message is reduced by 42.1%, and as a side effect mapped to a unique codeword and that the the number of disk operations required to compressed text is the concatenation of the load the database was reduced by 32.7% codewords for the source messages. An ear- [ Cormack 19851. lier algorithm, due independently to Shan- A variety of approaches to data compres- non and Fano [Shannon and Weaver 1949; sion designed with text tiles in mind in- Fano 19491, is not guaranteed to provide cludes use of a dictionary representing optimal codes but approaches optimal be- either all of the words in the file so that the havior as the number of messages ap- file itself is coded as a list of pointers to the proaches infinity. The Huffman algorithm dictionary [Hahn 19741, or common words is also of importance because it has pro- and word endings so that the file consists vided a foundation upon which other data of pointers to the dictionary and encodings compression techniques have been built of the less common words [Tropper 19821. and a benchmark to which they may be Hand selection of common phrases [Wag- compared. We classify the codes generated ner 19731, programmed selection of prefixes by the Huffman and Shannon-Fan0 algo- and suffixes [Fraenkel et al. 19831, and rithms as variable-variable and note that programmed selection of common charac- they include block-variable codes as a spe- ter pairs [Cortesi 1982; Snyderman and cial case, depending on how the source mes- Hunt 19701 have also been investigated. sages are defined. Shannon-Fan0 coding is This discussion of semantic-dependent discussed in Section 3.1; Huffman coding data compression techniques represents a in Section 3.2. limited sample of a very large body of re- In Section 3.3 codes that map the inte- search. These methods and others of a like gers onto binary codewords are discussed. nature are interesting and of great value in Since any finite alphabet may be enumer- their intended domains. Their obvious ated, this type of code has general-purpose drawback lies in their limited utility. It utility. However, a more common use of should be noted, however, that much of the these codes (called universal codes) is in efficiency gained through the use of seman- conjunction with an adaptive scheme. This tic-dependent techniques can be achieved connection is discussed in Section 5.2. through more general methods, albeit to a Arithmetic coding, presented in Sec- lesser degree. For example, the dictionary tion 3.4, takes a significantly different ap- approaches can be implemented through proach to data compression from that of either Huffman coding (Sections 3.2 and 4) the other static methods. It does not con- or Lempel-Ziv codes (Section 5.1). Cor- struct a code, in the sense of a mapping mack’s database scheme is a special case of from source messages to codewords. In- the codebook approach (Section 3.2), and stead, arithmetic coding replaces the source run-length encoding is one of the effects of ensemble by a code string that, unlike all Lempel-Ziv codes. the other codes discussed here, is not the concatenation of codewords corresponding to individual source messages. Arithmetic 3. STATIC DEFINED-WORDS SCHEMES coding is capable of achieving compression results that are arbitrarily close to the en- The classic defined-word scheme was tropy of the source. developed over 30 years ago in Huff- man’s well-known paper on minimum- 3.1 Shannon-Fan0 Coding redundancy coding [ Huffman 19521. Huffman’s algorithm provided the first The Shannon-Fan0 technique has as an solution to the problem of constructing advantage its simplicity. The code is con- minimum-redundancy codes. Many people structed as follows: The source messages ai

ACM Computing Surveys, Vol. 19, No. 3, September 1987 Data Compression l 271

al 112 0 de” 1 is the same as that achieved by the Huff- man code (see Figure 3). That the Shan- non-Fan0 algorithm is not guaranteed to a3 118 110 .&.“l produce an optimal code is demonstrated l/16 1110 step1 by the following set of probabilities: (.35, .17, .17, .16, .15,). The Shannon-Fan0 code a6 l/32 11110 “ten5 for this distribution is compared with the aG l/32 11111 Huffman code in Section 3.2.

Figure 5. A Shannon-Fan0 code. 3.2 Static Huffman Coding Huffman’s algorithm, expressed graphi- g a/40 00 step2 cally, takes as input a list of nonnegative f 7/40 010 h-p3 weights (wi, . . . , w,J and constructs a full binary tree3 whose leaves are labeled with e 6140 011 step1 the weights. When the Huffman algorithm d 5140 100 Sal6 is used to construct a code, the weights space 5140 101 represent the probabilities associated with .tep1 the source letters. Initially, there is a set of c 4140 110 .tsnR singleton trees, one for each weight in the b 3140 1110 list. At each step in the algorithm the trees step7 corresponding to the two smallest weights, a 2/40 1111 wi and wj, are merged into a new tree whose Figure6. A Shannon-Fan0 code for EXAMPLE weight is wi + Wj and whose root has two (code length = 117). children that are the subtrees represented by wi and Wj. The weights Wi and Wj are removed from the list, and wi + wj is in- and their probabilities p(ai) are listed in serted into the list. This process continues order of nonincreasing probability. This list until the weight list contains a single value. is then divided in such a way as to form If, at any time, there is more than one way two groups of as nearly equal total proba- to choose a smallest pair of weights, any bilities as possible. Each message in the such pair may be chosen. In Huffman’s first group receives 0 as the first digit of its paper the process begins with a nonincreas- codeword; the messages in the second half ing list of weights. This detail is not impor- have codewords beginning with 1. Each of tant to the correctness of the algorithm, these groups is then divided according to but it does provide a more efficient imple- the same criterion, and additional code dig- mentation [Huffman 19521. The Huffman its are appended. The process is continued algorithm is demonstrated in Figure 7. until each subset contains only one mes- The Huffman algorithm determines the sage. Clearly, the Shannon-Fan0 algorithm lengths of the codewords to be mapped to yields a minimal prefix code. each of the source letters oi. There are Figure 5 shows the application of the many alternatives for specifying the actual method to a particularly simple probability digits; it is necessary only that the code distribution. The length of each codeword have the prefix property. The usual assign- is equal to -log p(ai). This is true as long ment entails labeling the edge from each as it is possible to divide the list into parent to its left child with the digit 0 and subgroups of exactly equal probability. the edge to the right child with 1. The When this is not possible, some code- codeword for each source letter is the se- words may be of length -logp(ai) + 1. The quence of labels along the path from the Shannon-Fan0 algorithm yields an average root to the leaf node representing that codeword length S that satisfies H 5 S 5 letter. The codewords for the source of H + 1. In Figure 6, the Shannon-Fan0 code for ensemble EXAMPLE is given. As is 3 A binary tree is full if every node has either zero or often the case, the average codeword length two children.

ACM Computing Surveys, Vol. 19, No. 3, September 1987 272 8 D. A. Ldewer and D. S. Hirschberg a1 .25 .25 .25 .33 A2 S-F Huffman az .20 .20 .22 .25 .33 :3’.” 0.35 00 1 a3 .15 .18 .20 .22 .25 a1 0.17 01 011 a .12 .15 .18 .20 az 0.17 10 010 as .lO .12 .15JJ a3 0.16 110 001 as .lO .lO a4 a7 .080 a5 0.15 111 000 (a) Average codeword length 2.31 2.30

Figure 8. Comparison of Shannon-Fan0 and Huff- man codes.

3) both yield the same average codeword length for a source with probabilities (.4, -2, .2, -1, .l). Schwartz [1964] defines a varia- tion of the Huffman algorithm that per- forms “bottom merging”, that is, that orders a new parent node above existing nodes of the same weight and always merges the last two weights in the list. The code constructed is the Huffman code with minimum values of maximum codeword length (max( &1) and total codeword length (b) (c li). Schwartz and Kallick [1964] describe Figure 7. The Huffman process: (a) The list; (b) the an implementation of Huffman’s algorithm tree. with bottom merging. The Schwartz- Kallick algorithm and a later algorithm by Connell [ 19731use Huffman’s procedure Figure 7, in order of decreasing probability, to determine the lengths of the codewords, are (01, 11, 001, 100, 101, 0000, 0001). and actual digits are assigned so that the Clearly, this process yields a minimal prefix code has the numerical sequence property; code. Further, the algorithm is guaranteed that is, codewords of equal length form a to produce an optimal (minimum redun- consecutive sequence of binary numbers. dancy) code [Huffman 19521. Gallager Shannon-Fan0 codes also have the numer- [1978] has proved an upper bound on the ical sequence property. This property can redundancy of a Huffman code of pn + be exploited to achieve a compact represen- log[(2 log e)/e] = p,, f 0.086, where pn is tation of the code and rapid encoding and the probability of the least likely source decoding. message. In a recent paper, Capocelli et al. Both the Huffman and the Shannon- [1986] have provided new bounds that are Fano mappings can be generated in O(n) tighter than those of Gallager for some time, where n is the number of messages in probability distributions. Figure 8 shows a the source ensemble (assuming that the distribution for which the Huffman code is weights have been presorted). Each of optimal while the Shannon-Fan0 code is these algorithms maps a source message ai not. with probability p to a codeword of length In addition to the fact that there are 1 (-log p I 1 I - log p + 1). Encoding and many ways of forming codewords of appro- decoding times depend on the representa- priate lengths, there are cases in which the tion of the mapping. If the mapping is Huffman algorithm does not uniquely stored as a binary tree, then decoding the determine these lengths owing to the codeword for ai involves following a path of arbitrary choice among equal minimum length 1 in the tree. A table indexed by the weights. As an example, codes with code- source messagescould be used for encoding; word lengths of (1,2, 3,4,41 and ]2,2, 2, 3, the code for ai would be stored in position

ACM Computing Surveys, Vol. 19, No. 3, September 1987 Data Compression l 273 i of the table and encoding time would be he or she is using. This requires only log k O(Z). Connell’s algorithm makes use of the bits of overhead. Assuming that classes of index of the Huffman code, a representa- transmission with relatively stable charac- tion of the distribution of codeword lengths, teristics could be identified, this hybrid to encode and decode in O(c) time, where c approach would greatly reduce the redun- is the number of different codeword dancy due to overhead without significantly lengths. Tanaka [1987] presents an imple- increasing expected codeword length. In ad- mentation of Huffman coding based on dition, the cost of computing the mapping finite-state machines that can be realized would be amortized over all files of a given efficiently in either hardware or software. class. That is, the mapping would be com- As noted earlier, the redundancy bound puted once on a statistically significant for Shannon-Fan0 codes is 1 and the bound sample and then used on a great number of for the Huffman method is p,, + 0.086, files for which the sample is representative. where p,, is the probability of the least likely There is clearly a substantial risk associ- source message (so p,, is less than or equal ated with assumptions about file character- to 5, and generally much less). It is impor- istics, and great care would be necessary in tant to note that in defining redundancy to choosing both the sample from which the be average codeword length minus entropy, mapping is to be derived and the categories the cost of transmitting the code mapping into which to partition transmissions. An computed by these algorithms is ignored. extreme example of the risk associated The overhead cost for any method in which with the codebook approach is provided by the source alphabet has not been estab- Ernest V. Wright who wrote the novel lished before transmission includes n log n Gadsby (1939) containing no occurrences of bits for sending the n source letters. For a the letter E. Since E is the most commonly Shannon-Fan0 code, a list of codewords used letter in the English language, an en- ordered so as to correspond to the source coding based on a sample from Gadsby letters could be transmitted. The additional would be disastrous if used with “normal” time required is then c Zi, where the Zi are examples of English text. Similarly, the the lengths of the codewords. For Huffman “normal” encoding would provide poor coding, an encoding of the shape of the compression of Gadsby. code tree might be transmitted. Since any McIntyre and Pechura [1985] describe full binary tree may be a legal Huffman an experiment in which the codebook ap- code tree, encoding tree shape may require proach is compared to static Huffman cod- as many as log 4” = 2n bits. In most cases ing. The sample used for comparison is a the message ensemble is very large, so that collection of 530 source programs in four the number of bits of overhead is minute languages. The codebook contains a Pascal by comparison to the total length of the code tree, a FORTRAN code tree, a encoded transmission. However, it is im- COBOL code tree, a PL/I code tree, and an prudent to ignore this cost. ALL code tree. The Pascal code tree is the If a less-than-optimal code is acceptable, result of applying the static Huffman al- the overhead costs can be avoided through gorithm to the combined character frequen- a prior agreement by sender and receiver cies of all of the Pascal programs in the as to the code mapping. Instead of a Huff- sample. The ALL code tree is based on the man code based on the characteristics of combined character frequencies for all of the current message ensemble, the code the programs. The experiment involves en- used could be based on statistics for a class coding each of the programs using the five of transmissions to which the current en- codes in the codebook and the static Huff- semble is assumed to belong. That is, both man algorithm. The data reported for each sender and receiver could have access to a of the 530 programs consist of the size of codebook with k mappings in it: one for the coded program for each of the five Pascal source, one for English text, and so predetermined codes and the size of the on. The sender would then simply alert the coded program plus the size of the mapping receiver as to which of the common codes (in table form) for the static Huffman

ACM Computing Surveys, Vol. 19, No. 3, September 1987 274 . D. A. Lekwer and D. S. Hirschberg

method. In every case the code tree for the constants cl and c2. We recall the definition language class to which the program be- of an asymptotically optimal code as one longs generates the most compact encoding. for which average codeword length ap- Although using the Huffman algorithm on proaches entropy and remark that a uni- the program itself yields an optimal map- versal code with cl = 1 is asymptotically ping, the overhead cost is greater than the optimal. added redundancy incurred by the less- An advantage of universal codes over than-optimal code. In many cases, the ALL Huffman codes is that it is not necessary code tree also generates a more compact to know the exact probabilities with which encoding than the static Huffman algo- the source messages appear. Whereas Huff- rithm. In the worst case an encoding con- man coding is not applicable unless the structed from the codebook is only 6.6% probabilities are known, with universal larger than that constructed by the Huff- coding it is sufficient to know the probabil- man algorithm. These results suggest that, ity distribution only to the extent that the for files of source code, the codebook ap- source messages can be ranked in probabil- proach may be appropriate. ity order. By mapping messages in order of Gilbert [1971] discusses the construction decreasing probability to codewords in or- of Huffman codes based on inaccurate der of increasing length, universality can source probabilities. A simple solution to be achieved. Another advantage to univer- the problem of incomplete knowledge of the sal codes is that the codeword sets are fixed. source is to avoid long codewords, thereby It is not necessary to compute a codeword minimizing the error of badly underesti- set based on the statistics of an ensemble; mating the probability of a message. The any universal codeword set will suffice as problem becomes one of constructing the long as the source messages are ranked. optimal binary tree subject to a height re- The encoding and decoding processes are striction (see [Knuth 1971; Hu and Tan thus simplified. Although universal codes 1972; Garey 19741). Another approach in- can be used instead of Huffman codes as volves collecting statistics for several general-purpose static schemes, the more sources and then constructing a code based common application is as an adjunct to a on some combined criterion. This approach dynamic scheme. This type of application could be applied to the problem of designing is demonstrated in Section 5. a single code for use with English, French, Since the ranking of source messages is German, and so on, sources. To accomplish the essential parameter in universal coding, this, Huffman’s algorithm could be used to we may think of a universal code as repre- minimize either the average codeword senting an enumeration of the source length for the combined source probabili- messages or as representing the integers, ties or the average codeword length for which provide an enumeration. Elias [ 19751 English, subject to constraints on average defines a sequence of universal coding codeword lengths for the other sources. schemes that maps the set of positive in- tegers onto the set of binary codewords. The first Elias code is one that is simple 3.3 Universal Codes and Representations of but not optimal. This code, y, maps an the Integers integer x onto the binary value of x prefaced A code is universal if it maps source mes- by Llog x 1 zeros. The binary value of n is sages to codewords so that the resulting expressed in as few bits as possible and average codeword length is bounded by therefore begins with a 1, which serves to clH + c2. That is, given an arbitrary delimit the prefix. The result is an instan- source with nonzero entropy, a universal taneously decodable code since the total code achieves average codeword length that length of a codeword is exactly one greater is at most a constant times the optimal than twice the number of zeros in the possible for that source. The potential prefix; therefore, as soon as the first 1 compression offered by a universal code of a codeword is encountered, its length clearly depends on the magnitudes of the is known. The code is not a minimum

ACM Computing Surveys, Vol. 19, No. 3, September 1987 Data Compression l 275

Source Frequency Rank Codeword 1 1 message 2 010 a(1) = 1 3 011 S(2) = 0100 4 00100 e 6(3) = 0101 5 00101 d 6(4) = 01100 6 00110 space 6(5) = 01101 7 00111 a(6) = 01110 8 0001000 s 6(7) = 01111 16 000010000 a s(8) = 00100000 17 000010001 32 00000100000 Figure 10. An Elias code for EXAMPLE (code length = 161). Figure 9. Elias codes. redundancy code since the ratio of expected on the Fibonacci numbers of order m 2 2, codeword length to entropy goes to 2 as where the Fibonacci numbers of order 2 are entropy approaches infinity. The second the standard Fibonacci numbers: 1, 1, 2, 3, code, 6, maps an integer x to a codeword 5,8,13,. . . . In general, the Fibonacci num- consisting of r(Llog xl + 1) followed by bers of order m are defined by the recur- the binary value of x with the leading 1 de- rence: Fibonacci numbers F-,+1 through F,, leted. The resulting codeword has length are equal to 1; the kth number for iz L 1 is Llog XJ + 2Llog(l + Llog xJ)J + 1. This con- the sum of the preceding m numbers. We cept can be applied recursively to shorten describe only the order-2 Fibonacci code; the codeword lengths, but the benefits the extension to higher orders is straight- decrease rapidly. The code 6 is asymptoti- forward. cally optimal since the limit of the ratio of Every nonnegative integer N has pre- expected codeword length to entropy is 1. cisely one binary representation of the form Figure 9 lists the values of y and 6 for a R(N) = Cf=o diFi (where di E (0, 11, k 5 N, sampling of the integers. Figure 10 shows and the Fi are the order-2 Fibonacci num- an Elias code for string EXAMPLE. The bers as defined above) such that there are number of bits transmitted using this map- no adjacent ones in the representation. The ping would be 161, which does not compare Fibonacci representations for a small well with the 117 bits transmitted by the sampling of the integers are shown in Huffman code of Figure 3. Huffman coding Figure 11, using the standard bit sequence is optimal under the static mapping model. from high order to low. The bottom row of Even an asymptotically optimal universal the figure gives the values of the bit posi- code cannot compare with static Huffman tions. It is immediately obvious that this coding on a source for which the probabil- Fibonacci representation does not consti- ities of the messages are known. tute a prefix code. The order-2 Fibonacci A second sequence of universal coding code for N is defined as F(N) = D 1, where schemes, based on the Fibonacci numbers, D = d,,d,dz . -. dk (the di defined above). is defined by Apostolico and Fraenkel That is, the Fibonacci representation is [1985]. Although the Fibonacci codes are reversed and 1 is appended. The Fibonacci not asymptotically optimal, they compare code values for a small subset of the inte- well to the Elias codes as long as the num- gers are given in Figure 11. These binary ber of source messages is not too large. codewords form a prefix code since every Fibonacci codes have the additional attrib- codeword now terminates in two consecu- ute of robustness, which manifests itself by tive ones, which cannot appear anywhere the local containment of errors. This aspect else in a codeword. of Fibonacci codes is discussed further in Fraenkel and Klein [ 19851 prove that the Section 7. Fibonacci code of order 2 is universal, with The sequence of Fibonacci codes de- cl = 2 and c2 = 3. It is not asymptotically scribed by Apostolico and Fraenkel is based optimal since cl > 1. Fraenkel and Klein

ACM Computing Surveys, Vol. 19, No. 3, September 1987 276 l D. A. Lelewer and D. S. Hirschberg

N R(N) WV Source Cumulative Probability message probability Raze 1 1 11 2 10 011 A .2 .2 w, 2) 3 10 0 0011 B .4 .6 1.2,.6) 4 1 0 1 1011 c .l .7 L.6, .7) 5 1 0 0 0 00011 D .2 .9 f.7, .9) 6 10 0 1 10011 # .l 1.0 [.9, 1.0) 7 10 1 0 01011 8 10 0 0 0 000011 Figure 13. The arithmetic coding model. 16 100100 0010011 32 1010100 00101011 3.4 Arithmetic Coding 21 13 8 5 3 2 1 The method of arithmetic coding was sug- Figure 11. Fibonacci representations and Fibonacci gested by Elias and presented by Abramson codes. [1963] in his text on information theory. Implementations of Elias’ technique were developed by Risssanen [1976], Pasco Source Frequency Rank Codeword [1976], Rubin [1979], and, most recently, message Witten et al. [ 19871. We present the con- cept of arithmetic coding first and follow l? 8 1 F(1) = 11 with a discussion of implementation details f 7 2 F(2) = 011 and performance. In arithmetic coding a source ensemble s 65 34 F(3)F(4) = 00111011 space 5 5 F(5) = 00011 is represented by an interval between 0 and 1 on the real number line. Each symbol ; 43 67 F(6)F(7) = 1001101011 of the ensemble narrows this interval. a 2 8 F(8) = 000011 As the interval becomes smaller, the num- ber of bits needed to specify it grows. Figure 12. A Fibonacci code for EXAMPLE (code Arithmetic coding assumes an explicit length = 153). probabilistic model of the source. It is a defined-word scheme that uses the probabilities of the source messages to also show that Fibonacci codes of higher successively narrow the interval used to order compress better than the order-2 code represent the ensemble. A high-probability if the source language is large enough message narrows the interval less than a (i.e., the number of distinct source mes- low-probability message, so that high- sages is large) and the probability distri- probability messages contribute fewer bits bution is nearly uniform. However, no to the coded message. The method begins Fibonacci code is asymptotically optimal. with an unordered list of source messages The Elias codeword 6 (N) is asymptotically and their probabilities. The number line is shorter than any Fibonacci codeword for partitioned into subintervals on the basis N, but the integers in a very large initial of cumulative probabilities. range have shorter Fibonacci codewords. A small example is used to illustrate the For m = 2, for example, the transition point idea of arithmetic coding. Given source is N = 514,228 [Apostolico and Fraenkel messages (A, B, C, D, #) with probabilities 19851. Thus, a Fibonacci code provides bet- (.2, .4, .l, .2, .l], Figure 13 demonstrates the ter compression than the Elias code until initial partitioning of the number line. The the size of the source language becomes symbol A corresponds to the first 5 of the very large. Figure 12 shows a Fibonacci interval [0, 1); B the next g; D the subin- code for string EXAMPLE. The number of terval of size 5, which begins 70% of the bits transmitted using this mapping would way from the left endpoint to the right. be 153, which is an improvement over the When encoding begins, the source ensem- Elias code of Figure 10, but still compares ble is represented by the entire interval poorly with the Huffman code of Figure 3. 10, 1). For the ensemble AADB#, the first

ACM Computing Surveys, Vol. 19, No. 3, September 198’7 Data Compression l 277 A reduces the interval to [0, .2) and the In order to recover the original ensemble, second A to [0, .04) (the first 2 of the the decoder must know the model of the previous interval). The D further narrows source used by the encoder (e.g., the source the interval to [.028, .036)(; of the previous messages and associated ranges) and a sin- size, beginning 70% of the distance from gle number within the interval determined left to right). The B narrows the interval by the encoder. Decoding consists of a se- to [.0296, .0328), and the # yields a final ries of comparisons of the number i to the interval of [.03248, .0328). The interval, or ranges representing the source messages. alternatively any number i within the in- For this example, i might be .0325 (.03248, terval, may now be used to represent the .0326, or .0327 would all do just as well). source ensemble. The decoder uses i to simulate the actions Two equations may be used to define the of the encoder. Since i lies between 0 and narrowing process described above: .2, the decoder deduces that the first letter was A (since the range [0, .2) corresponds newleft = prevleft + msgleft * prevsize (1) to source message A). This narrows the newsize = prevsize * msgsize (2) interval to [0, .2). The decoder can now deduce that the next message will further Equation (1) states that the left endpoint narrow the interval in one of the following of the new interval is calculated from the ways: to [0, .04) for A, to [.04, .12) for 23, to previous interval and the current source [.l2, .l4) for C, to [.14, .18) for D, or to message. The left endpoint of the range [.18, .2) for #. Since i falls into the interval associated with the current message speci- [0, .04), the decoder knows that the second fies what percent of the previous interval message is again A. This process continues to remove from the left in order to form the until the entire ensemble has been re- new interval. For D in the above example, covered. the new left endpoint is moved over by Several difliculties become evident when .7 X .04 (70% of the size of the previous implementation of arithmetic coding is interval). Equation (2) computes the size of attempted. The first is that the decoder the new interval from the previous interval needs some way of knowing when to stop. size and the probability of the current mes- As evidence of this, the number 0 could sage (which is equivalent to the size of its represent any of the source ensembles A, associated range). Thus, the size of the AA, AAA, and so forth. Two solutions to interval determined by D is .04 x .2, and this problem have been suggested. One is the right endpoint is .028 + .008 = .036 that the encoder transmit the size of the (left endpoint + size). ensemble as part of the description of The size of the final subinterval deter- the model. Another is that a special symbol mines the number of bits needed to specify be included in the model for the purpose of a number in that range. The number of bits signaling end of message. The # in the needed to specify a subinterval of [0, 1) of above example serves this purpose. The size s is -log a. Since the size of the final second alternative is preferable for several subinterval is the product of the probabili- reasons. First, sending the size of the en- ties of the source messages in the ensemble semble requires a two-pass process and (i.e., s = nZ1 p(source message i), where precludes the use of arithmetic coding as N is the length of the ensemble), we have part of a hybrid codebook scheme (see -log s = - z Zi log p (source message i) = Sections 1.2 and 3.2). Second, adaptive - xr==, p(aJlog p(oJ, where n is the number methods of arithmetic coding are easily of unique source messages al, a2, . . . , a,. developed, and a first pass to determine Thus, the number of bits generated by the ensemble size is inappropriate in an on- arithmetic coding technique is exactly line adaptive scheme. equal to entropy H. This demonstrates A second issue left unresolved by the the fact that arithmetic coding achieves fundamental concept of arithmetic coding compression that is almost exactly that is that of incremental transmission and predicted by the entropy of the source. reception. It appears from the above

ACM Computing Surveys, Vol. 19, No. 3, September 1987 278 . D. A. Lelewer and D. S. Hirschberg

Source Cumulative discussion that the encoding algorithm Probability probabi,ity transmits nothing until the final interval is message Raw determined. However, this delay is not nec- .05 .05 PI .05) essary. As the interval narrows, the leading i .075 .125 [.05, .125) bits of the left and right endpoints become .l .225 [.125, .225) the same. Any leading bits that are the Tl .125 .35 [.225, .35) same may be transmitted immediately, .15 .5 [.35, .5) since they will not be affected by further ; .175 .675 [.5, .675) narrowing. A third issue is that of precision. g .2 .a75 [.675, .875) space .125 1.0 [.875, 1.0) From the description of arithmetic coding it appears that the precision required grows Figure 14. The arithmetic coding model of EXAM- without bound as the length of the ensem- PLE. ble grows. Witten et al. [1987] and Rubin [1979] address this issue. Fixed precision any static or adaptive method for comput- registers may be used as long as underflow ing the probabilities (or frequencies) of the and overflow are detected and managed. source messages. Witten et al. [1987] im- The degree of compression achieved by an plement an adaptive model similar to the implementation of arithmetic coding is not techniques described in Section 4. The exactly H, as implied by the concept of performance of this implementation is arithmetic coding. Both the use of a mes- discussed in Section 6. sage terminator and the use of fixed-length arithmetic reduce coding effectiveness. 4. ADAPTIVE HUFFMAN CODING However, it is clear that an end-of-message symbol will not have a significant effect on Adaptive Huffman coding was first con- a large source ensemble. Witten et al. ceived independently by Faller [1973] and [1987] approximate the overhead due to Gallager [ 19781. Knuth [ 19851 contributed the use of fixed precision at lo-* bits per improvements to the original algorithm, source message, which is also negligible. and the resulting algorithm is referred to The arithmetic coding model for ensem- as algorithm FGK. A more recent version ble EXAMPLE is given in Figure 14. The of adaptive Huffman coding is described by final interval size is p(a)” X p(b)3 X p(c)” X Vitter [ 19871. All of these methods are pk05 X p(eY X p(f17 X p(g)’ X pbpd5. defined-word schemes that determine the The number of bits needed to specify mapping from source messages to code- a value in the interval is -log(1.44 x words on the basis of a running estimate of 10-35) = 115.7. So excluding overhead, ar- the source message probabilities. The code ithmethic coding transmits EXAMPLE in is adaptive, changing so as to remain opti- 116 bits, one less bit than static Huffman mal for the current estimates. In this way, coding. the adaptive Huffman codes respond to Witten et al. [1987] provide an imple- locality. In essence, the encoder is “learn- mentation of arithmetic coding, written in ing” the characteristics of the source. The C, which separates the model of the source decoder must learn along with the encoder from the coding process (where the coding by continually updating the Huffman tree process is defined by eqs. (1) and (2)). The so as to stay in synchronization with the model is in a separate program module and encoder. is consulted by the encoder and the decoder Another advantage of these systems is at every step in the processing. The fact that they require only one pass over the that the model can be separated so easily data. Of course, one-pass methods are not renders the classification static/adaptive very interesting if the number of bits they irrelevant for this technique. Indeed, the transmit is significantly greater than that fact that the coding method provides of the two-pass scheme. Interestingly, the compression efficiency nearly equal to the performance of these methods, in terms of entropy of the source under any model al- number of bits transmitted, can be better lows arithmetic coding to be coupled with than that of static Huffman coding. This

ACM Computing Surveys, Vol. 19, No. 3, September 1987 Data Compression 8 279 does not contradict the optimality of the that is the new O-node. Again, the tree is static method, since the static method is recomputed. In this case, the code for the optimal only over all methods that assume O-node is sent; in addition, the receiver a time-invariant mapping. The perform- must be told which of the n - k unused ance of the adaptive methods can also be messageshas appeared. In Figure 15, a sim- worse than that of the static method. Upper ple example is given. At each node a count bounds on the redundancy of these meth- of occurrences of the corresponding mes- ods are presented in this section. As dis- sage is stored. Nodes are numbered indi- cussed in the Introduction, the adaptive cating their position in the sibling property method of Faller [1973], Gallager [1978], ordering. The updating of the tree can be and Knuth [ 19851is the basis for the UNIX done in a single traversal from the a,+, node utility compact. The performance of com- to the root. This traversal must increment pact is quite good, providing typical the count for the a,+l node and for each of compression factors of 30-40%. its ancestors. Nodes may be exchanged to maintain the sibling property, but all of 4.1 Algorithm FGK these exchanges involve a node on the path from a,+l to the root. Figure 16 shows the The basis for algorithm FGK is the sibling final code tree formed by this process on property, defined by Gallager [1978]: A the ensemble EXAMPLE. binary code tree has the sibling property Disregarding overhead, the number of if each node (except the root) has a sibling bits transmitted by algorithm FGK for the and if the nodes can be listed in order of EXAMPLE is 129. The static Huffman nonincreasing weight with each node adja- algorithm would transmit 117 bits in pro- cent to its sibling. Gallager proves that a cessing the same data. The overhead asso- binary prefix code is a Huffman code if and ciated with the adaptive method is actually only if the code tree has the sibling prop- less than that of the static algorithm. In erty. In algorithm FGK, both sender and the adaptive case the only overhead is the receiver maintain dynamically changing n log n bits needed to represent each of the Huffman code trees. The leaves of the n different source messageswhen they ap- code tree represent the source messages, pear for the first time. (This is in fact and the weights of the leaves represent conservative; instead of transmitting a frequency counts for the messages. At any unique code for each of the n source mes- particular time, k of the n possible source pages, the sender could transmit the messages have occurred in the message message’s position in the list of remaining ensemble. messagesand save a few bits in the average Initially, the code tree consists of a single case.) In the static case, the source mes- leaf node, called the O-node. The O-node is sages need to be sent as does the shape of a special node used to represent the n - k the code tree. As discussed in Section 3.2, unused messages. For each message trans- an efficient representation of the tree shape mitted, both parties must increment the requires 2n bits. Algorithm FGK com- corresponding weight and recompute the pares well with static Huffman coding on code tree to maintain the sibling property. this ensemble when overhead is taken At the point in time when t messageshave into account. Figure 17 illustrates an been transmitted, k of them distinct, and example on which algorithm FGK per- k < n, the tree is a legal Huffman code forms better than static Huffman coding tree with k + 1 leaves, one for each of even without taking overhead into account. the k messages and one for the O-node. If Algorithm FGK transmits 47 bits for this the (t + 1)st message is one of the k already ensemble, whereas the static Huffman code seen, the algorithm transmits at+l’s current requires 53. code, increments the appropriate counter, Vitter [1987] has proved that the total and recomputes the tree. If an unused mes- number of bits transmitted by algorithm sage occurs, the O-node is split to create a FGK for a message ensemble of length t pair of leaves, one for a,+l, and a sibling containing n distinct messages is bounded

ACM Computing Surveys, Vol. 19, No. 3, September 1987 280 D. A. Lelewer and D. S. Hirschberg

space space (4 (b)

(4 Figure 15. Algorithm FGK processing the ensemble EXAMPLE--(a) Tree after processing LLaa bb”; 11 will be transmitted for the next b. (b) After encoding the third b; 101 will be transmitted for the next spuce; the form of the tree will not change-only the frequency counts will be updated; 100 will be transmitted for the first c. (c) Tree after update following first c.

below by S - n + 1, where S is the perform- 4.2 Algorithm V ance of the static method, and bounded above by 2s + t - 4n + 2. So the perform- The adaptive Huffman algorithm of Vitter ance of algorithm FGK is never much worse (algorithm V) incorporates two improve- than twice optimal. Knuth [1985] provides ments over algorithm FGK. First, the num- a complete implementation of algorithm ber of interchanges in which a node is FGK and a proof that the time required for moved upward in the tree during a recom- each encoding or decoding operation is putation is limited to one. This number is O(l), where 1 is the current length of the bounded in algorithm FGK only by l/2, codeword. It should be noted that since the where 1 is the length of the codeword for mapping is defined dynamically, (during a,+l when the recomputation begins. Sec- transmission) the encoding and decoding ond, Vitter’s method minimizes the values algorithms stand alone; there is no addi- of z li and max(li] subject to the require- tional algorithm to determine the mapping ment of minimizing 2 wili. The intuitive as in static methods. explanation of algorithm V’s advantage

ACM Computing Surveys, Vol. 19, No. 3, September 1987 Data Compression l 281

Figure 16. Tree formed by algorithm FGK for ensemble EXAMPLE.

a

Figure 17. Tree formed by algorithm FGK for ensemble “e eae de eabe eae dcf”.

f over algorithm FGK is as follows: As in rately represent the symbol probabilities algorithm FGK, the code tree constructed over the entire message. Therefore, the fact by algorithm V is the Huffman code tree that algorithm V guarantees a tree of min- for the prefix of the ensemble seen so far. imum height (height = max(Zi)) and mini- The adaptive methods do not assume that mum external path length (Z li) implies the relative frequencies of a prefix accu- that it is better suited for coding the next

ACM Computing Surveys, Vol. 19, No. 3, September 1987 282 . D. A. Lelewer and D. S. Hirschberg

Figure 18. FGK tree with nonlevel order numbering.

message of the ensemble, given that any of the leaves of the tree may represent that next message. These improvements are accomplished through the use of a new system for num- bering nodes. The numbering, called an implicit numbering, corresponds to a level ordering of the nodes (from bottom to top and left to right). Figure 18 illustrates that the numbering of algorithm FGK is not 1 always a level ordering. The following in- variant is maintained in Vitter’s algorithm: c For each weight w, all leaves of weight w precede (in the implicit numbering) all in- Figure 19. Algorithm V processing the ensemble “UQ ternal nodes of weight w. Vitter [1987] bbb c”. proves that this invariant enforces the de- sired bound on node promotions. The in- variant also implements bottom merging, by increasing weight with the convention as discussed in Section 3.2, to minimize that a leaf block always precedes an inter- z li and max(Zi ). The difference between nal block of the same weight. When an Vitter’s method and algorithm FGK is in exchange of nodes is required to maintain the way the tree is updated between the sibling property, algorithm V requires transmissions. In order to understand the that the node being promoted be moved to revised update operation, the following the position currently occupied by the high- definition of a block of nodes is necessary: est numbered node in the target block. Blocks are equivalence classes of nodes de- In Figure 19, the Vitter tree correspond- fined by u = v iff weight(u) = weight(v), ing to Figure 15~ is shown. This is the and u and v are either both leaves or both first point in EXAMPLE at which algo- internal nodes. The leader of a block is the rithm FGK and algorithm V differ signifi- highest numbered (in the implicit number- cantly. At this point, the Vitter tree has ing) node in the block. Blocks are ordered height 3 and external path length 12,

ACM Computing Surveys, Vol. 19, No. 3, September 1987 Data Compression l 283

Figure 20. Tree formed by algorithm V for the ensemble of Figure 17. e

c b whereas the FGK tree has height 4 and nodes with equal weights could be deter- external path length 14. Algorithm V trans- mined on the basis of recency. Another mits codeword 001 for the second c; FGK reasonable assumption about adaptive cod- transmits 1101. This demonstrates the in- ing is that the weights in the current tree tuition given earlier that algorithm V is correspond closely to the probabilities as- better suited for coding the next message. sociated with the source. This assumption The Vitter tree corresponding to Figure 16, becomes more reasonable as the length of representing the final tree produced in the ensemble increases. Under this assump- processing EXAMPLE, is only different tion, the expected cost of transmitting the from Figure 16 in that the internal node of next letter is C pi& z x wili, SO that neither weight 5 is to the right of both leaf nodes algorithm FGK nor algorithm V has any of weight 5. Algorithm V transmits 124 bits advantage. in processing EXAMPLE, as compared Vitter [1987] proves that the perform- with the 129 bits of algorithm FGK and ance of his algorithm is bounded by S - n 117 bits of static Huffman coding. It should + 1 from below and S + t - 2n + 1 from be noted that these figures do not include above. At worst then, Vitter’s adaptive overhead and, as a result, disadvantage the method may transmit one more bit per adaptive methods. codeword than the static Huffman method. Figure 20 illustrates the tree built by The improvements made by Vitter do not Vitter’s method for the ensemble of Fig- change the complexity of the algorithm; ure 17. Both C li and maxI& 1 are smaller in algorithm V encodes and decodes in 0 (1) the tree of Figure 20. The number of bits time, as does algorithm FGK. transmitted during the processing of the sequence is 47, the same used by algorithm 5. OTHER ADAPTIVE METHODS FGK. However, if the transmission contin- ues with d, b, c, f, or an unused letter, the Two more adaptive data compression cost of algorithm V will be less than that methods, algorithm BSTW and Lempel- of algorithm FGK. This again illustrates Ziv coding, are discussed in this section. the benefit of minimizing the external path Like the adaptive Huffman coding tech- length (C li) and the height (max(Zi)). niques, these methods do not require a first It should be noted again that the strategy pass to analyze the characteristics of the of minimizing external path length and source. Thus they provide coding and height is optimal under the assumption transmission in real time. However, these that any source letter is equally likely to schemes diverge from the fundamental occur next. Other reasonable strategies in- Huffman coding approach to a greater clude one that assumes locality. To take degree than the methods discussed in Sec- advantage of locality, the ordering of tree tion 4. Algorithm BSTW is a defined-word

ACM Computing Surveys, Vol. 19, No. 3, September 1987 204 . D. A. Lelewer and D. S. Hirschberg scheme that attempts to exploit locality. Symbol string Code Lempel-Ziv coding is a free-parse method; A 1 that is, the words of the source alphabet T 2 are defined dynamically as the encoding is AN 3 performed. Lempel-Ziv coding is the basis TH 4 for the UNIX utility compress. Algorithm THE 5 BSTW is a variable-variable scheme, AND 6 whereas Lempel-Ziv coding is variable- AD 7 block. b 8 bb 9 bbb 10 5.1 Lempel-Ziv Codes 0 11 00 12 Lempel-Ziv coding represents a departure 000 13 from the classic view of a code as a mapping 0000 14 from a fixed set of source messages (letters, Z 15 symbols, or words) to a fixed set of code- words. We coin the term free-parse to char- ### 4095 acterize this type of code, in which the set Figure 21. A Lempel-Ziv code table. of source messages and the codewords to which they are mapped are defined as the algorithm executes. Whereas all adaptive in long strings. Effective compression is methods create a set of codewords dynam- achieved when a long string is replaced by ically, defined-word schemes have a fixed a single 12-bit code. set of source messages, defined by context The Lempel-Ziv method is an incremen- (e.g., in text file processing the source mes- tal parsing strategy in which the coding sages might be single letters; in Pascal process is interlaced with a learning process source file processing the source messages for varying source characteristics [Ziv and might be tokens). Lempel-Ziv coding de- Lempel19771. In Figure 21, run-length en- fines the set of source messages as it parses coding of zeros and blanks is being learned. the ensemble. The Lempel-Ziv algorithm parses the The Lempel-Ziv algorithm consists of a source ensemble into a collection of seg- rule for parsing strings of symbols from a ments of gradually increasing length. At finite alphabet into substrings or words each encoding step, the longest prefix of whose lengths do not exceed a prescribed the remaining source ensemble that integer L, and a coding scheme that maps matches an existing table entry ((Y) is these substrings sequentially into uniquely parsed off, along with the character (c) decipherable codewords of fixed length Lz following this prefix in the ensemble. The [Ziv and Lempel 19771. The strings are new source message, (YC, is added to the selected so that they have very nearly equal code table. The new table entry is coded as probability of occurrence. As a result, fre- (i, c), where i is the codeword for the exist- quently occurring symbols are grouped ing table entry and c is the appended into longer strings, whereas infrequent character. For example, the ensemble symbols appear in short strings. This strat- 010100010 is parsed into (0, 1, 01, 00, 010) egy is effective at exploiting redundancy and is coded as ((0, 0), (0, l), (1, l), (1, 0), due to symbol frequency, character repeti- (3, 0)). The table built for the message tion, and high-usage patterns. Figure 21 ensemble EXAMPLE is shown in Fig- shows a small Lempel-Ziv code table. Low- ure 22. The coded ensemble has the form frequency letters such as Z are assigned ((0, a), (1, space), (0, b), (3, b), to, space), individually to fixed-length codewords (in (0, 4, (6, 4, (6, space), (0, d), (9, d), this case, 12-bit binary numbers repre- (10, w=e), to, 4, (12, 4, (13, 4, (5, f), sented in base 10 for readability). (0, f), (1% f), (17, f), (0, g), (1% g), (20, g), Frequently occurring symbols, such as (20)). The string table is represented in a blank (represented by b) and zero, appear more efficient manner than in Figure 21;

ACM Computing Surveys, Vol. 19, No. 3, September 1987 Data Compression l 285 Message Codeword good performance briefly, and fail to make a any gains once the table is full and mes- lspace 2 sages can no longer be added. If the ensem- b 3 ble’s characteristics vary over time, the 3b 4 method may be “stuck with” the behavior space 5 it has learned and may be unable to con- C 6 tinue to adapt. 6c 7 Lempel-Ziv coding is asymptotically 8 optimal, meaning that the redundancy ap- d- 9 9d 10 proaches zero as the length of the source lOspace 11 ensemble tends to infinity. However, for e 12 particular finite sequences, the compres- 12e 13 sion achieved may be far from optimal 13e 14 [Storer and Szymanski 19821. When the 5f 15 method begins, each source symbol is coded 16 individually. In the case of 6- or B-bit source i6f 17 symbols and 12-bit codewords, the method 17f 18 yields as much as 50% expansion during g 19 initial encoding. This initial inefficiency 20 1% can be mitigated somewhat by initializing 2ol2 21 the string table to contain all of the source Figure 22. Lempel-Ziv table for the message ensem- characters. Implementation issues are par- ble EXAMPLE (code length = 173). ticularly important in Lempel-Ziv meth- ods. A straightforward implementation takes O(n*) time to process a string of n the string is represented by its prefix code- symbols; for each encoding operation, the word followed by the extension character, existing table must be scanned for the long- so that the table entries have fixed length. est message occurring as a prefix of the The Lempel-Ziv strategy is simple, but remaining ensemble. Rodeh et al. [1981] greedy. It simply parses off the longest rec- address the issue of computational com- ognized string each time instead of search- plexity by defining a linear implementation ing for the best way to parse the ensemble. of Lempel-Ziv coding based on suffix trees. The Lempel-Ziv method specifies fixed- The Rodeh et al. scheme is asymptotically length codewords. The size of the table and optimal, but an input must be very long in the maximum source message length are order to allow efficient compression, and determined by the length of the codewords. the memory requirements of the scheme It should be clear from the definition of the are large, O(n) where n is the length of the algorithm that Lempel-Ziv codes tend to source ensemble. It should also be men- be quite inefficient during the initial por- tioned that the method of Rodeh et al. tion of the message ensemble. For example, constructs a variable-variable code; the even if we assume 3-bit codewords for char- pair (i, c) is coded using a representation acters a-g and space and &bit codewords of the integers, such as the Elias codes, for for table indexes, the Lempel-Ziv algo- i and for c (a letter c can always be coded rithm transmits 173 bits for ensemble as the kth member of the source alphabet EXAMPLE. This compares poorly with for some k). the other methods discussed in this survey. The other major implementation consid- The ensemble must be sufficiently long for eration involves the way in which the string the procedure to build up enough symbol table is stored and accessed. Welch [1984] frequency experience to achieve good suggests that the table be indexed by the compression over the full ensemble. codewords (integers 1 . . . 2L, where L is the If the codeword length is not sufficiently maximum codeword length) and that the large, Lempel-Ziv codes may also rise table entries be fixed-length, (codeword- slowly to reasonable efficiency, maintain extension) character pairs. Hashing is

ACM Computing Surveys, Vol. 19, No. 3, September 1987 286 . D. A. Lelewer and D. S. Hirschberg proposed to assist in encoding. Decoding and Wei [1986]. This method, algorithm becomes a recursive operation in which the BSTW, possessesthe advantage that it re- codeword yields the final character of the quires only one pass over the data to be substring and another codeword. The de- transmitted and yet has performance that coder must continue to consult the table compares well to that of the static two-pass until the retrieved codeword is 0. Unfortu- method along the dimension of number of nately, this strategy peels off extension bits per word transmitted. This number of characters in reverse order, and some type bits is never much larger than the number of stack operation must be used to reorder of bits transmitted by static Huffman cod- the source. ing (in fact, it is usually quite close) and Storer and Szymanski [1982] present a can be significantly better. Algorithm general model for data compression that BSTW incorporates the additional benefit encompasses Lempel-Ziv coding. Their of taking advantage of locality of reference, broad theoretical work compares classes of the tendency for words to occur frequently macro schemes, where macro schemes in- for short periods of time and then fall into clude all methods that factor out duplicate long periods of disuse. The algorithm uses occurrences of data and replace them by a self-organizing list as an auxiliary data references either to the source ensemble structure and shorter encodings for words or to a code table. They also contribute near the front of this list. There are many a linear-time Lempel-Ziv-like algorithm strategies for maintaining self-organizing with better performance than the standard lists [Hester and Hirschberg 19851; algo- Lempel-Ziv method. rithm BSTW uses move to front. Rissanen [ 19831 extends the Lempel-Ziv A simple example serves to outline the incremental parsing approach. Abandoning method of algorithm BSTW. As in other the requirement that the substrings parti- adaptive schemes, sender and receiver tion the ensemble, the Rissanen method maintain identical representations of the gathers “contexts” in which each symbol of code, in this case message lists that are the string occurs. The contexts are sub- updated at each transmission using the strings of the previously encoded string (as move-to-front heuristic. These lists are in- in Lempel-Ziv), have varying size, and are itially empty. When messagea, is transmit- in general overlapping. The Rissanen ted, if a, is on the sender’s list, the sender method hinges on the identification of a transmits its current position. The sender design parameter capturing the concept of then updates his or her list by moving at to “relevant” contexts. The problem of finding position 1 and shifting each of the other the best parameter is undecidable, and Ris- messages down one position. The receiver sanen suggests estimating the parameter similarly alters his or her word list. If a, is experimentally, being transmitted for the first time, then k As mentioned earlier, Lempel-Ziv coding + 1 is the “position” transmitted, where k is the basis for the UNIX utility compress is the number of distinct messages trans- and is one of the methods commonly used mitted so far. Some representation of the in file archival programs. The archival sys- message itself must be transmitted as well, tem PKARC uses Welch’s implementation, but just this first time. Again, a, is moved as does compress. The compression pro- to position 1 by both sender and receiver vided by compress is generally much better subsequent to its transmission. For the en- than that achieved by compact (the UNIX semble “abcadeabfd “, the transmission utility based on algorithm FGK) and takes wouldbela2b3c34d5e356f5(for less time to compute [UNIX 19841. Typical ease of presentation, list positions are rep- compression values attained by compress resented in base ten). are in the range of 50-60%. As the example shows, algorithm BSTW transmits each source message once; the 5.2 Algorithm BSTW rest of its transmission consists of encod- ings of list positions. Therefore, an The most recent of the algorithms surveyed essential feature of algorithm BSTW is a here is due to Bentley, Sleator, Tarjan, reasonable scheme for representation of

ACM Computing Surveys, Vol. 19, No. 3, September 1987 Data Compression l 287 the integers. The methods discussed by An implementation of algorithm BSTW Bentley et al. are the Elias codes presented is described in great detail in Bentley et al. in Section 3.3. The simple scheme, code y, [ 19861. In this implementation, encoding involves prefixing the binary representa- an integer consists of a table lookup; the tion of the integer i with Llog iJ zeros. This codewords for the integers from 1 to n + 1 yields a prefix code with the length of the are stored in an array indexed from 1 to codeword for i equal to 2Llog i J + 1. Greater n + 1. A binary trie is used to store the compression can be gained through use of inverse mapping from codewords to inte- the more sophisticated scheme 6, which gers. Decoding an Elias codeword to find encodes an integer i in 1 + Llog i J + 21 (log(1 the corresponding integer involves follow- + Llog il)l bits. ing a path in the trie. Two interlinked data A message ensemble on which algorithm structures, a binary trie and a binary tree, BSTW is particularly efficient, described are used to maintain the word list. The trie by Bentley et al., is formed by repeating is based on the binary encodings of the each of n messages n times; for example, source words. Mapping a source message oi In2n3” . . . nne Disregarding overhead, a to its list position p involves following a static Huffman code uses n’log n bits (or path in the trie, following a link to the tree, log n bits per message), whereas algorithm and then computing the symmetric order BSTW uses n2 + 2 CL Llog iJ (which is position of the tree node. Finding the less than or equal to nz + 2n log n or O(1) source message Ui in position p is accom- bits per message). The overhead for algo- plished by finding the symmetric order po- rithm BSTW consists of just the n log n sition p in the tree and returning the word bits needed to transmit each source letter stored there. Using this implementation, once. As discussed in Section 3.2, the over- the work done by sender and receiver is head for static Huffman coding includes an O(kngth(ai) + length(w)), where oi is the additional 2n bits. The locality present in messagebeing transmitted and w the code- ensemble EXAMPLE is similar to that in word representing oi’s position in the list. the above example. The transmission ef- If the source alphabet consists of single fected by algorithm BSTW is 1 a 1 2 space characters, then the complexity of algo- 3b1124c11125d111126ellll rithm BSTW is just O(length(w)). 127f1111118glllllll.Using3 The “move-to-front” scheme of Bentley bits for each source letter (a-g and space) et al. was independently developed by Elias and the Elias code 6 for list positions, the in his paper on interval encoding and re- number of bits used is 81, which is a great cency rank encoding [Elias 19871. Recency improvement over all of the other methods rank encoding is equivalent to algorithm discussed (only 69% of the length used by BSTW. The name emphasizes the fact, static Huffman coding). This could be im- mentioned above, that the codeword for a proved further by the use of Fibonacci source message represents the number of codes for list positions. distinct messages that have occurred since In Bentley et al. [1986] a proof is given its most recent occurrence. Interval encod- that with the simple scheme for encoding ing represents a source messageby the total integers, the performance of algorithm number of messages that have occurred BSTW is bounded above by 2S + 1, where since its last occurrence (equivalently, the S is the cost of the static Huffman coding length of the interval since the last previous scheme. Using the more sophisticated occurrence of the current message). It is integer encoding scheme, the bound is obvious that the length of the interval since 1 + S + 2 log( 1 + S). A key idea in the the last occurrence of a message a, is at proofs given by Bentley et al. is the fact least as great as its recency rank, so that that, using the move-to-front heuristic, the recency rank encoding never uses more, integer transmitted for a message a, will be and generally uses fewer, symbols per mes- one more than the number of different sage than interval encoding. The advantage words transmitted since the last occurrence to interval encoding is that it has a very of a,. Bentley et al. also prove that algo- simple implementation and can encode and rithm BSTW is asymptotically optimal. decode selections from a very large alphabet

ACM Computing Surveys, Vol. 19, No. 3, September 1987 288 l D. A. Lelewer and D. 5’. Hirschberg (a million letters, for example) at a micro- Knuth [ 19851 describes algorithm FGK’s second rate [Elias 19871. The use of inter- performance on three types of data: a file val encoding might be justified in a data containing the text of Grimm’s first 10 fairy transmission setting, where speed is the tales, text of a technical book, and a file of essential factor. graphical data. For the first two files, the Ryabko [1987] also comments that the source messages are individual characters work of Bentley et al. coincides with many and the alphabet size is 128. The same data of the results in a paper in which he con- are coded using pairs of characters so that siders data compression by means of a the alphabet size is 1968. For the graphical “book stack” (the books represent the data, the number of source messagesis 343. source messages, and as a “book” occurs it In the case of the fairy tales the perform- is taken from the stack and placed on top). ance of FGK is very close to optimum, Horspool and Cormack [1987] have consid- although performance degrades with in- ered “move-to-front,” as well as several creasing file size. Performance on the tech- other list organization heuristics, in con- nical book is not as good, but it is still nection with data compression. respectable. The graphical data prove harder yet to compress, but again FGK performs reasonably well. In the latter two 6. EMPIRICAL RESULTS cases the trend of performance degradation Empirical tests of the efficiencies of the with file size continues. Defining source algorithms presented here are reported messages to consist of character pairs re- in Bentley et al. [1986], Knuth [1985], sults in slightly better compression, but the Schwartz and Kallick [ 19641,Vitter [ 19871, difference would not appear to justify the and Welch [ 19841. These experiments com- increased memory requirement imposed by pare the number of bits per word required; the larger alphabet. processing time is not reported. Although Vitter [1987] tests the performance of theoretical considerations bound the per- algorithms V and FGK against that of formance of the various algorithms, exper- static Huffman coding. Each method is run imental data are invaluable in providing on data that include Pascal source code, the additional insight. It is clear that the per- TEX source of the author’s thesis, and elec- formance of each of these methods is de- tronic mail files. Figure 23 summarizes the pendent on the characteristics of the source results of the experiment for a small file of ensemble. text. The performance of each algorithm is Schwartz and Kallick [1964] test an im- measured by the number of bits in the plementation of static Huffman coding in coded ensemble, and overhead costs are not which bottom merging is used to determine included. Compression achieved by each codeword lengths and all codewords of a algorithm is represented by the size of the given length are sequential binary numbers. file it creates, given as a percentage of the The source alphabet in the experiment original file size. Figure 24 presents data consists of 5114 frequently used English for Pascal source code. For the TEX source, words, 27 geographical names, 10 numerals, the alphabet consists of 128 individual 14 symbols, and 43 suffixes. The entropy of characters; for the other two file types, no the document is 8.884 binary digits per more than 97 characters appear. For each message, and the average codeword con- experiment, when the overhead costs are structed has length 8.920. The same docu- taken into account, algorithm V outper- ment is also coded one character at a time. forms static Huffman coding as long as the In this case the entropy of the source is size of the message ensemble (number of 4.03, and the coded ensemble contains an characters) is no more than 104. Algorithm average of 4.09 bits per letter. The redun- FGK displays slightly higher costs, but dancy is low in both cases. However, the never more than 100.4% of the static algo- relative redundancy (i.e., redundancy/ rithm. entropy) is lower when the document is Witten et al. [1987] compare adaptive encoded by words. arithmetic coding with adaptive Huffman

ACM Computing Surveys, Vol. 19, No. 3, September 1987 Data Compression l 289 n k Static Algorithm V Algorithm FGK for experiments that compare the perform- 100 96 83.0 71.1 82.4 ance of algorithm BSTW to static Huffman 500 96 83.0 80.8 83.5 coding. Here the defined words consist of 961 97 83.5 82.3 83.7 two disjoint classes, sequence of alpha- numeric characters and sequences of non- Figure23. Simulation results for a small text file alphanumeric characters. The performance [Vitter 19871: n, file size in 8-bit bytes; k, number of distinct messages. of algorithm BSTW is very close to that of static Huffman coding in all cases. The experiments reported by Bentley et al. are n k Static Algorithm V Algorithm FGK of particular interest in that they incorpo- 100 32 57.4 56.2 58.9 rate another dimension, the possibility that 500 49 61.5 62.2 63.0 in the move-to-front scheme one might 1000 57 61.3 61.8 62.4 want to limit the size of the data structure 10000 73 59.8 59.9 60.0 containing the codes to include only the m 12067 78 59.6 59.8 59.9 most recent words, for some m. The tests consider cache sizes of 8, 16, 32, 64, 128, Figure 24. Simulation results for Pascal source code [Vitter 19871: n, file size in bytes; k, number of distinct and 256. Although performance tends to messages. increase with cache size, the increase is erratic, with some documents exhibiting nonmonotonicity (performance that in- coding. The version of arithmetic coding creases with cache size to a point and then tested employs single-character adaptive decreases when cache size is further in- frequencies and is a mildly optimized C creased). implementation. Witten et al. compare the Welch [ 19841 reports simulation results results provided by this version of arith- for Lempel-Ziv codes in terms of compres- metic coding with the results achieved by sion ratios. His definition of compression the UNIX compact program (compact is ratio is the one given in Section 1.3, C = based on algorithm FGK). On three large (average message length)/(average code- files that typify data compression applica- word length). The ratios reported are 1.8 tions, compression achieved by arithmetic for English text, 2-6 for COBOL data files, coding is better than that provided by com- 1.0 for floating-point arrays, 2.1 for for- pact, but only slightly better (average file matted scientific data, 2.6 for system log size is 98% of the compacted size). A file data, 2.3 for source code, and 1.5 for object over a three-character alphabet, with very code. The tests involving English text files skewed symbol probabilities, is encoded by showed that long individual documents did arithmetic coding in less than 1 bit per not compress better than groups of short character; the resulting file size is 74% of documents. This observation is somewhat the size of the file generated by compact. surprising in that it seems to refute the Witten et al. also report encoding and de- intuition that redundancy is due at least in coding times. The encoding time of arith- part to correlation in content. For purposes metic coding is generally half the time of comparison, Welch cites results of required by the adaptive Huffman coding Pechura and Rubin. Pechura [1982] method. Decode time averages 65% of the achieved a 1.5 compression ratio using time required by compact. Only in the static Huffman coding on files of English case of the skewed file are the time statis- text. Rubin reports a 2.4 ratio for Eng- tics quite different. Arithmetic coding again lish text when using a complex technique achieves faster encoding, 67% of the time for choosing the source messages to which required by compact. However, compact de- Huffman coding is applied [Rubin 19761. codes more quickly, using only 78% of the These results provide only a very weak time of the arithmetic method. basis for comparison, since the character- Bentley et al. [1986] use C and Pascal istics of the files used by the three authors source files, TROFF source files, and a are unknown. It is very likely that a single terminal session transcript of several hours algorithm may produce compression ratios

ACM Computing Surveys, Vol. 19, No. 3, September 1987 290 l D. A. Lelewer and D. S. Hirschberg

ranging from 1.5 to 2.4, depending on the The effect of amplitude errors is dem- source to which it is applied. onstrated in Figure 26. The format of the illustration is the same as that in Fig- 7. SUSCEPTIBILITY TO ERROR ure 25. This time bits 1, 2, and 4 are inverted rather than lost. Again synchro- The discrete noiseless channel is, unfortu- nization is regained almost immediately. nately, not a very realistic model of a When bit 1 or bit 2 is changed, only the communication system. Actual data trans- first three bits (the first character of the mission systems are prone to two types of ensemble) are disturbed. Inversion of bit 4 error: phase error, in which a code symbol causes loss of synchronization through the is lost or gained, and amplitude error, in ninth bit. A very simple explanation of the which a code symbol is corrupted [Neu- self-synchronization present in these ex- mann 19621. The degree to which channel amples can be given. Since many of the errors degrade transmission is an impor- codewords end in the same sequence of tant parameter in the choice of a data digits, the decoder is likely to reach a leaf compression method. The susceptibility to of the Huffman code tree at one of the error of a coding algorithm depends heavily codeword boundaries of the original coded on whether the method is static or adaptive. ensemble. When this happens, the decoder is back in synchronization with the en- coder. 7.1 Static Codes So that self-synchronization may be dis- It is generally known that Huffman codes cussed more carefully, the following defi- tend to be self-correcting [Standish 19801; nitions are presented. (It should be noted that is, a transmission error tends not to that these definitions hold for arbitrary propagate too far. The codeword in which prefix codes so that the discussion includes the error occurs is incorrectly received, and all of the codes described in Section 3.) Ifs it is likely that several subsequent code- is a suffix of some codeword and there exist words are misinterpreted, but before too sequences of codewords r and A such that long the receiver is back in synchronization s I? = A, then I’ is said to be a synchronizing with the sender. In a static code, synchro- sequence for s. For example, in the Huffman nization means simply that both sender and code used above, 1 is a synchronizing se- receiver identify the beginnings of the code- quence for the suffix 01, whereas both words in the same way. In Figure 25 an 000001 and 011 are synchronizing se- example is used to illustrate the ability of quences for the suffix 10. If every suffix (of a Huffman code to recover from phase er- every codeword) has a synchronizing se- rors. The message ensemble “BCDAEB ” is quence, then the code is completely self- encoded using the Huffman code of Figure synchronizing. If some or none of the proper 8 where the source letters al . . . a5 repre- suffixes have synchronizing sequences, sent A . . . E, respectively, yielding the then the code is, respectively, partially or coded ensemble “0110100011000011”. Fig- never self-synchronizing. Finally, if there ure 25 demonstrates the impact of loss of exists a sequence l? that is a synchronizing the first bit, the second bit, or the fourth sequence for every suffix, r is defined to be bit. The dots show the way in which each a universal synchronizing sequence. The line is parsed into codewords. The loss of code used in the examples above is com- the first bit results in resynchronization pletely self-synchronizing and has univer- after the third bit so that only the first sal synchronizing sequence 00000011000. source message (B) is lost (replaced by AA). Gilbert and Moore [1959] prove that the When the second bit is lost, the first eight existence of a universal synchronizing se- bits of the coded ensemble are misinter- quence is a necessary as well as a sufficient preted and synchronization is regained by condition for a code to be completely self- bit 9. Dropping the fourth bit causes the synchronizing. They also state that any same degree of disturbance as dropping the prefix code that is completely self- second. synchronizing will synchronize itself with

ACM Computing Surveys, Vol. 19, No. 3, September 1987 Data Compression l 291

011.010.001.1.000.011. Coded ensemble BCDAEB 1.1.010.001.1.000.011. Bit 1 is lost, interpreted as AACDAEB 010.1.000.1.1.000.011. Bit 2 is lost, interpreted as CAEAAEB 011.1.000.1.1.000.011. Bit 4 is lost, interpreted as BAEAAEB

Figure 25. Recovery from phase errors.

0 1 1.0 1 0.00 1.1.000.011 Coded ensemble BCDAEB 1.1.1.0 10.00 1.1.000.011 Bit 1 is inverted, interpreted as DCDAEB 0 0 1.0 10.00 1.1.000.011 Bit 2 is inverted, interpreted as AAACDAEB 0 11.1.1.0 00.1.1.000.011 Bit 4 is inverted, interpreted as BAAEAAEB

Figure 26. Recovery from amplitude errors. probability 1 if the source ensemble con- self-synchronize, this is not guaranteed, sists of successive messages independently and when self-synchronization is assured, chosen with any given set of probabilities. there is no bound on the propagation of the This is true since the probability of occur- error. An additional difficulty is that self- rence of the universal synchronizing se- synchronization provides no indication quence at any given time is positive. that an error has occurred. It is important to realize that the fact The problem of error detection and cor- that a completely self-synchronizing code rection in connection with Huffman codes will resynchronize with probability 1 does has not received a great deal of attention. not guarantee recovery from error with Several ideas on the subject are reported bounded delay. In fact, for every completely here. Rudner [1971] states that synchroniz- self-synchronizing prefix code with more ing sequences should be as short as possible than two codewords, there are errors within to minimize resynchronization delay. In ad- one codeword that cause unbounded error dition, if a synchronizing sequence is used propagation [Neumann 19621. In addition, as the codeword for a high probability mes- prefix codes are not always completely self- sage, then resynchronization will be more synchronizing. Bobrow and Hakimi [ 19691 frequent. A method for constructing a min- state a necessary condition for a prefix code imum-redundancy code having the shortest with codeword lengths l1 . . . 1, to be com- possible synchronizing sequence is de- pletely self-synchronizing: the greatest scribed by Rudner [ 19711. Neumann [ 19621 common divisor of the Zi must be equal to suggests purposely adding some redun- 1. The Huffman code (00, 01, 10, dancy to Huffman codes in order to permit 1100, 1101, 1110, 1111) is not completely detection of certain types of errors. Clearly self-synchronizing but is partially self- this has to be done carefully so as not to synchronizing since suffixes 00, 01, and 10 negate the redundancy reduction provided are synchronized by any codeword. The by Huffman coding. McIntyre and Pechura Huffman code (000, 0010, 0011, 01, 100, [1985] cite data integrity as an advantage 1010, 1011, 100, 111) is never self-synchro- of the codebook approach discussed in Sec- nizing. Examples of never self-synchro- tion 3.2. When the code is stored separately nizing Huffman codes are difficult to from the coded data, the code may be construct, and the example above is the backed up to protect it from perturbation. only one with fewer than 16 source mes- However, when the code is stored or trans- sages. Stiffler [1971] proves that a code is mitted with the data, it is susceptible to never self-synchronizing if and only if none errors. An error in the code representation of the proper suffixes of the codewords are constitutes a drastic loss, and therefore ex- themselves codewords. treme measures for protecting this part of The conclusions that may be drawn from the transmission are justified. the above discussion are as follows: Al- The Elias codes of Section 3.3 are not at though it is common for Huffman codes to all robust. Each of the codes y and 6 can be

ACM Computing Surveys, Vol. 19, No. 3, September 1987 292 l D. A. Lelewer and D. S. Hirschberg

001 1 0.00 1 00.0 00100 0. Coded integers 6,4,8 0 1 1.0 00 1 00 0.00100.0 Bit 2 is lost, interpreted as 3, 8, 2, etc. 011.1.0 00 1 00 0.00100.0 Bit 2 is inverted, interpreted as 3, 1,8,4, etc. 000 1 0 00.1.00 0 00100 0 Bit 3 is inverted, interpreted as 8, 1, etc.

Figure 27. Effects of errors in Elias codes.

000011.000011.00011.010 11.01011. Coded ensemble “au bb” 00 011.000011.00011.010 11.01011. Bit 3 is lost, interpreted as “u bb” 001011.000011.00011.010 11.01011. Bit 3 is inverted, interpreted as “?a bb” 00001 000011.00011.010 11.01011. Bit 6 is lost, interpreted as “? bb” 000010 000011.00011.010 11.01011. Bit 6 is inverted, interpreted as “? bb” 000011.000011.00011.011.11.01011. Bit 20 is inverted, interpreted as “aa fgb”

Figure 28. Effects of errors in Fibonacci codes. thought of as generating codewords that first two substrings is disastrous, changing consist of a number of substrings such that the way in which the rest of the codeword each substring encodes the length of the is to be interpreted. And, like code y, code subsequent substring. For code y we may 6 has no properties that aid in regaining think of each codeword y(x) as the conca- synchronization once it has been lost. tenation of z, a string of n zeros, and b, The Fibonacci codes of Section 3.3, on a string of length n + 1 (n = Hog xJ). If the other hand, are quite robust. This ro- one of the zeros in substring z is lost, syn- bustness is due to the fact that every code- chronization will be lost since the last word ends in the substring 11 and that symbol of b will be pushed into the next substring can appear nowhere else in a codeword. codeword. If an error occurs anywhere Since the 1 at the front of substring b other than in the 11 substring, the error is delimits the end of z, if a zero in z is changed contained within that one codeword. It is to a 1, synchronization will be lost as sym- possible that one codeword will become two bols from b are pushed into the following (see the sixth line of Figure 28), but no codeword. Similarly, if ones at the front of other codewords will be distributed. If the b are inverted to zeros, synchronization will last symbol of a codeword is lost or changed, be lost since the codeword y(x) consumes the current codeword will be fused with its symbols from the following codeword. Once successor so that two codewords are lost. synchronization is lost, it cannot normally When the penultimate bit is disturbed, up be recovered. to three codewords can be lost. For example In Figure 27, codewords r(6), r(4), r(8) the coded message 011.11.011 becomes are used to illustrate the above ideas. In 0011.1011 if bit 2 is inverted. The maxi- each case, synchronization is lost and never mum disturbance resulting from either an recovered. amplitude error or a phase error is the The Elias code 6 may be thought of as a disturbance of three codewords. three-part ramp, where 6(x) = zmb with z In Figure 28, some illustrations based on a string of n zeros, m a string of length the of ensemble EXAM- n + 1 with binary value u, and b a string of PLE as shown in Figure 12 are given. When length u - 1. For example, in 6(16) = bit 3 (which is not part of an 11 substring) 00.101.0000, n = 2, u = 5, and the final is lost or changed, only a single codeword substring is the binary value of 16 with the is degraded. When bit 6 (the final bit of the leading 1 removed so that it has length u - first codeword) is lost or changed, the first 1 = 4. Again the fact that each substring two codewords are incorrectly decoded. determines the length of the subsequent When bit 20 is changed, the first b is incor- substring means that an error in one of the rectly decoded as fg.

ACM ComputingSurveys,Vol. 19,No.3,September1987 Data Compression l 293

7.2 Adaptive Codes error flags. For adaptive methods it may be Adaptive codes are far more adversely af- necessary for receiver and sender to verify fected by transmission errors than static the current code mapping periodically. codes. For example, in the case of an adap- For adaptive Huffman coding, Galla- tive Huffman code, even though the re- ger [1978] suggests an “aging” scheme, ceiver may resynchronize with the sender whereby recent occurrences of a character in terms of correctly locating the beginning contribute more to its frequency count than of a codeword, the information lost repre- do earlier occurrences. This strategy intro- sents more than a few bits or a few char- duces the notion of locality into the acters of the source ensemble. The fact adaptive Huffman scheme. Cormack and that sender and receiver are dynamically Horspool [1984] describe an algorithm redefining the code indicates that by the for approximating exponential aging. How- time synchronization is regained, they may ever, the effectiveness of this algorithm have radically different representations has not been established. of the code. Synchronization as defined in Both Knuth [1985] and Bentley et al. Section 7.1 refers to synchronization of the [1986] suggest the possibility of using the bit stream, which is not sufficient for adap- “cache” concept to exploit locality and tive methods. What is needed here is code minimize the effect of anomalous source synchronization, that is, synchronization of messages. Preliminary empirical results in- both the bit stream and the dynamic data dicate that this may be helpful. A problem structure representing the current code related to the use of a cache is overhead mapping. time required for deletion. Strategies for There is no evidence that adaptive meth- reducing the cost of a deletion could be ods are self-synchronizing. Bentley et al. considered. Another possible extension to [1986] note that, in algorithm BSTW, loss algorithm BSTW is to investigate other lo- of synchronization can be catastrophic, cality heuristics. Bentley et al. [ 19861 prove whereas this is not true with static Huff- that intermittent move-to-front (move-to- man coding. Ziv and Lempel [1977] recog- front after every k occurrences) is as effec- nize that the major drawback of their tive as move-to-front. It should be noted algorithm is its susceptibility to error prop- that there are many other self-organizing agation. Welch [1984] also considers the methods yet to be considered. Horspool and problem of error tolerance of Lempel-Ziv Cormack [ 19871 describe experimental re- codes and suggests that the entire ensemble sults that imply that the transpose heuris- be embedded in an error-detecting code. tic performs as well as move-to-front and Neither static nor adaptive arithmetic cod- suggest that it is also easier to implement. ing has the ability to tolerate errors. Several aspects of free-parse methods merit further attention. Lempel-Ziv codes appear to be promising, although the 8. NEW DIRECTIONS absence of a worst-case bound on the Data compression is still very much an redundancy of an individual finite source active research area. This section suggests ensemble is a drawback. The variable- possibilities for further study. block type Lempel-Ziv codes have been The discussion of Section 7 illustrates implemented with some success [ARC the susceptibility to error of the codes pre- 19861, and the construction of a variable- sented in this survey. Strategies for increas- variable Lempel-Ziv code has been ing the reliability of these codes while sketched [Ziv and Lempel 19781. The effi- incurring only a moderate loss of effi- ciency of the variable-variable model ciency would be of great value. This area should be investigated. In addition, an im- appears to be largely unexplored. Possible plementation of Lempel-Ziv coding that approaches include embedding the entire combines the time efficiency of the Rodeh ensemble in an error-correcting code or re- et al. [ 19811 method with more efficient use serving one or more codewords to act as of space is worthy of consideration.

ACM Computing Surveys, Vol. 19, No. 3, September 1987 294 l D. A. Lelewer and D. S. Hirschberg

Another important research topic is the data compression algorithm should be development of theoretical models for data investigated. compression that address the problem of local redundancy. Models based on Markov REFERENCES chains can be exploited to take advantage ABRAMSON, N. 1963. Information Theory and Cod- of interaction between groups of symbols. ing. McGraw-Hill, New York. Entropy tends to be overestimated when APOSTOLICO, A., AND FRAENKEL, A. S. 1985. Robust symbol interaction is not considered. transmission of unbounded strings using Fibo- Models that exploit relationships between nacci representations. Tech. Rep. CS85-14, Dept. source messages may achieve better of Applied , The Weizmann Insti- compression than predicted by an entropy tute of Science, Rehovot, Israel. ARC 1986. ARC File Archive Utility, Version 5.1. calculation based only on symbol probabil- System Enhancement Associates, Wayne, N.J. ities. The use of Markov modeling is ASH, R. B. 1965. Information Theory. Interscience, considered by Llewellyn [1987] and by New York. Langdon and Rissanen [1983]. BENTLEY, J. L., SLEATOR, D. D., TARJAN, R. E., AND WEI, V. K. 1986. A locally adaptive data com- pression scheme. Commun. ACM 29, 4 (Apr.), 320-330. 9. SUMMARY BOBROW, L. S., AND HAKIMI, S. L. 1969. Graph Data compression is a topic of much im- theoretic prefix codes and their synchronizing properties. Znf. Contr. 15,l (July), 70-94. portance and many applications. Methods BRENT, R. P., AND KUNG, H. T. 1978. Fast algo- of data compression have been studied for rithms for manipulating formal power series. almost four decades. This paper has pro- J. ACM 25, 4 (Oct.), 581-595. vided an overview of data compression CAPOCELLI. R. M., GIANCARLO, R., AND TANEJA, I. J. methods of general utility. The algorithms 1986. Bounds on the redundancy of Huffman have been evaluated in terms of the amount codes. IEEE Trans. Inf. Theory 32, 6 (Nov.), of compression they provide, algorithm 854-857. CAPPELLINI, V., ED. 1985. Data Compression and efficiency, and susceptibility to error. Al- Error Control Techmimes with Applications. Ac- though algorithm efficiency and suscepti- ademic Press, London. bility to error are relatively independent of CONNELL, J. B. 1973. A Huffman-Shannon-Fan0 the characteristics of the source ensemble, code. Proc. IEEE 61, 7 (July), 1046-1047. the amount of compression achieved de- CORMACK, G. V. 1985. Data compression on a data- pends on the characteristics of the source base system. Commun. ACM 28, 12 (Dec.), 1336- to a great extent. 1342. Semantic dependent data compression CORMACK, G. V., AND HORSPOOL, R. N. 1984. Algorithms for adaptive Huffman codes. Znf. techniques, as discussed in Section 2, are Process. L&t. 18, 3 (Mar.), 159-165. special-purpose methods designed to ex- CORTESI, D. 1982. An effective text-compression al- ploit local redundancy or context informa- gorithm. BYTE 7,l (Jan.), 397-403. tion. A semantic dependent scheme can COT, N. 1977. Characterization and design of opti- usually be viewed as a special case of one mal prefix codes. Ph.D. dissertation, Computer or more general-purpose algorithms. It Science Dept., Stanford Univ., Stanford, Cahf. should also be noted that algorithm BSTW ELIAS, P. 1975. Universal codeword sets and rep- resentations of the integers. IEEE Trans. Znf. is a general-purpose technique that exploits Theory 21, 2 (Mar.), 1941203. locality of reference, a type of local redun- ELIAS, P. 1987. Interval and recency rank source dancy. coding: Two on-line adaptive variable-length Susceptibility to error is the main draw- schemes. IEEE Trans. Znf. Theory 33, 1 (Jan.), back of each of the algorithms presented 3-10. here. Although channel errors are more FALLER, N. 1973. An adaptive system for data devastating to adaptive algorithms than to compression. In Record of the 7th As&mar Con- ference on Circuits, Systems and Computers static ones, it is possible for an error (Pacific Grove, Calif., Nov.). Naval Postgraduate to propagate without limit even in the School, Monterey, Calif., pp. 593-597. static case. Methods of limiting the effect FANO, R. M. 1949. Transmission of Information. of an error on the effectiveness of a M.I.T. Press, Cambridge, Mass.

ACM Computing Surveys, Vol. 19, No. 3, September 1987 Data Compression l 295

FRAENKEL, A. S., AND KLEIN, S. T. 1985. Robust LAESER, R. P., MCLAUGHLIN, W. I., AND WOLFF, universal complete codes as alternatives to Huff- D. M. 1986. Engineering Voyager 2’s encounter man codes. Tech. Rep. CS85-16, Dept. of Applied with Uranus. Sci. Am. 255, 5 (Nov.), 36-45. Mathematics, The Weizmann Institute of Sci- LANGDON, G. G., AND RISSANEN, J. J. 1983. A ence, Rehovot, Israel. double-adaptive file compression algorithm. FRAENKEL, A. S., MOR, M., AND PERL, Y. 1983. Is ZEEE Trans. Comm. 31,ll (Nov.), 1253-1255. text compression by prefixes and suffixes practi- LLEWELLYN, J. A. 1987. Data compression for a cal? Acta Znf. 20,4 (Dec.), 371-375. source with Markov characteristics. Comput. J. GALLAGER, R. G. 1968. Information Theory and 30,2, 149-156. Reliable Communication. Wiley, New York. MCINTYRE, D. R., AND PECHURA,M. A. 1985. Data GALLAGER, R. G. 1978. Variations on a theme by compression using static Huffman code-decode Huffman. IEEE Trans. Znf. Theory 24, 6 (Nov.), tables. Commun. ACM 28, 6 (June), 612- 668-674. 616. GAREY,M. R. 1974. Optimal binary search trees with MEHLHORN, K. 1980. An efficient algorithm for restricted maximal depth. SIAM J. Comput. 3, 2 constructing nearly optimal prefix codes. IEEE (June), 101-110. Trans. Znf. Theory 26,5 (Sept.), 513-517. GILBERT, E. N. 1971. Codes based on inaccurate NEUMANN, P. G. 1962. Efficient error-limiting source probabilities. IEEE Trans. Znf. Theory 17, variable-length codes. IRE Trans. Znf. Theory 8, 3 (May), 304-314. 4 (July), 292-304. GILBERT, E. N., AND MOORE, E. F. 1959. Variable- PARKER, D. S. 1980. Conditions for the optimality length binary encodings. Bell Syst. Tech. J. 38,4 of the Huffman algorithm. SIAM J. Comput. 9,3 (July), 933-967. (Aug.), 470-489. GLASSEY, C. R., AND KARP, R. M. 1976. On the PASCO, R. 1976. Source coding algorithms for fast optimality of Huffman trees. SIAM J. Appl. data compression. Ph.D. dissertation, Dept. of Math. 31,2 (Sept.), 368-378. Electrical Engineering, Stanford Univ., Stanford, GONZALEZ,R. C., AND WINTZ, P. 1977. Digital Zm- Calif. age Processing. Addison-Wesley, Reading, Mass. PECHURA, M. 1982. File archival techniques using HAHN, B. 1974. A new technique for compression data compression. Commun. ACM 25, 9 (Sept.), and storage of data. Commun. ACM 17,8 (Aug.), 605-609. 434-436. PERL, Y., GAREY, M. R., AND EVEN, S. 1975. HESTER, J. H., AND HIRSCHBERG,D. S. 1985. Self- Efficient generation of ontimal prefix code: Equi- organizing linear search. ACM Comput. Surv. 17, probable words using unequal cost letters.- J. 3 (Sept.), 295-311. ACM 22,2 (Apr.), 202-214. HORSPOOL, R. N., AND CORMACK, G. V. 1987. A PKARC FAST! 1987. PKARC FAST! File Archival locally adaptive data compression scheme. Com- Utility, Version 3.5. PKWARE, Inc. Glendale, mun. ACM 30,9 (Sept.), 792-794. Wis. Hu, T. C., AND TAN, K. C. 1972. Path length of REGHBATI, H. K. 1981. An overview of data binary search trees. SIAM J. Appl. Math 22, 2 compression techniques. Computer 14, 4 (Apr.), (Mar.), 225-234. 71-75. Hu, T. C., AND TUCKER, A. C. 1971. Optimal com- RISSANEN, J. J. 1976. Generalized Kraft inequality puter search trees and variable-length alphabetic and arithmetic coding. IBM J. Res. Dev. 20 codes. SIAM J. Appl. Math. 21, 4 (Dec.), 514- (May), 198-203. 532. RISSANEN,J. J. 1983. A universal data compression HUFFMAN, D. A. 1952. A method for the construc- system. IEEE Trans. Znf. Theory 29, 5 (Sept.), tion of minimum-redundancy codes. Proc. IRE 656-664. 40,9 (Sept.), 1098-1101. RODEH, M., PRATT, V. R., AND EVEN, S. 1981. INGELS, F. M. 1971. Information and Coding Theory. Linear algorithm for data compression via string Intext, Scranton, Penn. matching. J. ACM 28, 1 (Jan.), 16-24. ITAI, A. 1976. Optimal alphabetic trees. SIAM J. RUBIN, F. 1976. Experiments in text tile compres- Comput. 5, 1 (Mar.), 9-18. sion. Commun. ACM 19, 11 (Nov.), 617-623. KARP, R. M. 1961. Minimum redundancy coding for RUBIN, F. 1979. Arithmetic stream coding using the discrete noiseless channel. IRE Trans. Znf. fixed precision registers. IEEE Trans. Znf. Theory Theory 7, 1 (Jan.), 27-38. 25, 6 (Nov.), 672-675. KNUTH, D. E. 1971. Optimum binary search trees. RUDNER, B. 1971. Construction of minimum- Acta Znf. 1, 1 (Jan.), 14-25. redundancy codes with optimal synchronizing KNUTH, D. E. 1985. Dynamic Huffman coding. property. IEEE Trans. Znf. Theory 17, 4 (July), J. Algorithms 6, 2 (June), 163-180. 478-487. KRAUSE, R. M. 1962. Channels which transmit let- RUTH, S. S., AND KREUTZER, P. J. 1972. Data ters of unequal duration. Znf. Control 5, 1 (Mar.), compression for large business files. Datamation 13-24. 18,9 (Sept.), 62-66.

ACM ComputingSurveys, Vol. 19,No. 3, September1987 296 l D. A. Lelewer and D. S. Hirschberg

RYABKO, B. Y. 1987. A locally adaptive data TROPPER, R. 1982. Binary-coded text, A text- compression scheme. Commun. ACM 16, 2 compression method. BYTE 7,4 (Apr.), 398-413. (Sept.), 792. UNIX. 1984. UNIX User’s Manual, Version 4.2. SAMET, H. 1984. The Quadtree and related hierar- Berkeley Software Distribution, Virtual VAX-11 chical data structures. ACM Comput. Surv. 30, 9 Version, Univ. of California, Berkeley, Calif. (June), 187-260. VARN, B. 1971. Optimal variable length codes (arbi- SCHWARTZ, E. S. 1964. An optimum encoding with trary symbol cost and equal code word probabil- minimum longest code and total number of digits. ity). Znf. Control 19, 4 (Nov.), 289-301. Inf. Control 7, 1 (Mar.), 37-44. VITTER, J. S. 1987. Design and analysis of dynamic SCHWARTZ, E. S., AND KALLICK, B. 1964. Gen- Huffman codes. J. ACM 34,4 (Oct.), 825-845. erating a canonical prefix encoding. Commun. WAGNER, R. A. 1973. Common phrases and mini- ACM 7, 3 (Mar.), 166-169. mum-space text storage. Commun. ACM 16, 3 SEVERANCE, D. G. 1983. A practitioner’s guide to (Mar.), 148-152. data base compression. Znf. Syst. 8,1, 51-62. WELCH, T. A. 1984. A technique for high-perform- SHANNON, C. E., AND WEAVER, W. 1949. The Math- ance data compression. Computer 17, 6 (June), ematical Theory of Communication. University of 8-19. Illinois Press, Urbana, Ill. WILKINS, L. C., AND WINTZ, P. A. 1971. SNYDERMAN, M., AND HUNT, B. 1970. The myriad Bibliography on data compression, picture prop- virtues of text compaction. Datamation 16, 12 erties and picture coding. IEEE Trans. Znf. The- (Dec.), 36-40. ory 17,2,180-197. STANDISH, T. A. 1980. Data Structure Techniques. WITPEN, I. H., NEAL, R. M., AND CLEARY, J. G. Addison-Wesley, Reading, Mass. 1987. Arithmetic coding for data compression. STIFFLER, J. J. 1971. Theory of Synchronous Com- Commun. ACM 30,6 (June), 520-540. munications. Prentice-Hall, Englewood Cliffs, ZIMMERMAN, S. 1959. An optimal search procedure. N.J. Am. Math. Monthly 66, (Oct.), 690-693. STORER, J. A., AND SZYMANSKI, T. G. 1982. Data ZIV, J., AND LEMPEL, A. 1977. A universal algorithm compression via textual substitution. J. ACM 29, for sequential data compression. IEEE Trans. 4 (Oct.), 928-951. Znf. Theory 23,3 (May), 337-343. TANAKA, H. 1987. Data structure of Huffman codes ZIV, J., AND LEMPEL, A. 1978. Compression of indi- and its application to efficient encoding and vidual sequences via variable-rate coding. IEEE decoding. IEEE Trans. Znf. Theory 33, 1 (Jan.), Trans. Znf. Theory 24,5 (Sept.), 530-536. 154-156.

Received May 1987; final revision accepted January 1988.

ACM Computing Surveys, Vol. 19, No. 3, September 198’7