Arithmetic Coding for Data Coiupression

COMPUTINGPRACTICES Edgar H. Sibley The state of the art in data compression is arithmetic coding, not the better- Panel Editor known Huffman method. Arithmetic coding gives greater compression, is faster for adaptive models, and clearly separatesthe model from the channel encoding. ARITHMETIC CODING FOR DATA COIUPRESSION IAN H. WIllEN, RADFORD M. NEAL, and JOHN G. CLEARY Arithmetic coding is superior in most respects to the DATA COMPRESSION better-known Huffman [lo] method. It represents in- To many, data compression conjures up an assort- formation at least as compactly-sometimes consid- ment of ad hoc techniques such as conversion of erably more so. Its performance is optimal without spaces in text to tabs, creation of special codes for the need for blocking of input data. It encourages a common words, or run-length coding of picture data clear separation between the model for representing (e.g., see [8]). This contrasts with the more modern data and the encoding of information with respect to model-based paradigm for coding, where, from an that model. It accommodates adaptive models easily input string of symbols and a model, an encoded string and is computationally efficient. Yet many authors is produced that is (usually) a compressed version of and practitioners seem unaware of the technique. the input. The decoder, which must have access to Indeed there is a widespread belief that Huffman the same model, regenerates the exact input string coding cannot be improved upon. from the encoded string. Input symbols are drawn We aim to rectify this situation by presenting an from some well-defined set such as the ASCII or accessible implementation of arithmetic coding and binary alphabets; the encoded string is a plain se- by detailing its performance characteristics. We start quence of bits. The model is a way of calculating, in by briefly reviewing basic concepts of data compres- any given context, the distribution of probabilities sion and introducing the model-based approach that for the next input symbol. It must be possible for the underlies most modern techniques. We then outline decoder to produce exactly the same probability dis- the idea of arithmetic coding using a simple exam- tribution in the same context. Compression is ple, before presenting programs for both encoding achieved by transmitting the more probable symbols and decoding. In these programs the model occupies in fewer bits than the less probable ones. a separate module so that different models can easily For example, the model may assign a predeter- be used. Next we discuss the construction of fixed mined probability to each symbol in the ASCII and adaptive models and detail the compression alphabet. No context is involved. These probabilities efficiency and execution time of the programs, can be determined by counting frequencies in repre- including the effect of different arithmetic word sentative samples of text to be transmitted. Such a lengths on compression efficiency. Finally, we out- fixed model is communicated in advance to both en- line a few applications where arithmetic coding is coder and decoder, after which it is used for many appropriate. messages. Financial support for this work has been provided by the Natural Sciences Alternatively, the probabilities that an adaptive and E@neering Research Council of Canada. model assigns may change as each symbol is trans- UNIX is a registered trademark of AT&T Bell Laboratories. mitted, based on the symbol frequencies seen so far 0 1987 ACM OOOl-0782/87/OtiOO-0520 750 in the message. There is no need for a representative 520 Communications of the ACM June 1987 Volume 30 Number 6 Computing Practices sample of text, because each message is treated as if Huffman coding were substituted. Nevertheless, an independent unit, starting from scratch. The en- since our topic is coding and not modeling, the illus- coder’s model changes with each symbol transmit- trations in this article all employ simple models. ted, and the decoder’s changes with each symbol Even so, as we shall see, Huffman coding is inferior received, in sympathy. to arithmetic coding. More complex models can provide more accurate The basic concept of arithmetic coding can be probabilistic predictions and hence achieve greater traced back to Elias in the early 1960s (see [l, compression. For example, several characters of pre- pp. 61-621). Practical techniques were first intro- vious context could condition the next-symbol prob- duced by Rissanen [16] and Pasco [15], and de- ability. Such methods have enabled mixed-case Eng- veloped further by Rissanen [17]. Details of the lish text to be encoded in around 2.2 bits/character implementation presented here have not appeared with two quite different kinds of model [4, 61. Tech- in the literature before; Rubin [2O] is closest to our niques that do not separate modeling from coding approach. The reader interested in the broader class so distinctly, like that of Ziv and Lempel (231, do of arithmetic codes is referred to [18]; a tutorial is not seem to show such great potential for compres- available in [l3]. Despite these publications, the sion, although they may be appropriate when the method is not widely known. A number of recent aim is raw speed rather than compression per- books and papers on data compression mention it formance [22]. only in passing, or not at all. The effectiveness of any model can be measured by the entropy of the message with respect to it, THE IDEA OF ARITHMETIC CODING usually expressed in bits/symbol. Shannon’s funda- In arithmetic coding, a message is represented by an mental theorem of coding states that, given messages interval of real numbers between 0 and 1. As the randomly generated from a model, it is impossible to message becomes longer, the interval needed’to rep- encode them into less bits (on average) than the en- resent it becomes smaller, and the number of bits tropy of that model [21]. needed to specify that interval grows. Successive A message can be coded with respect to a model symbols of the message reduce the size of the inter- using either Huffman or arithmetic coding. The for- val in accordance with the symbol probabilities gen- mer method is frequently advocated as the best pos- erated by the model. The more likely symbols re- sible technique for reducing the encoded data rate. duce the range by less than the unlikely symbols It is not. Given that each symbol in the alphabet and hence add fewer bits to the message. must translate into an integral number of bits in the Before anything is transmitted, the range for the encoding, Huffman coding indeed achieves “mini- message is the entire interval [0, l), denoting the mum redundancy.” In other words, it performs opti- half-open interval 0 5 x < 1. As each symbol is mally if all symbol probabilities are integral powers processed, the range is narrowed to that portion of it of %. But this is not normally the case in practice; allocated to the symbol. For example, suppose the indeed, Huffman coding can take up to one extra bit alphabet is (a, e, i, O, u, !I, and a fixed model is used per symbol. The worst case is realized by a source with probabilities shown in Table I. Imagine trans- in which one symbol has probability approaching unity. Symbols emanating from such a source con- TABLE I. Example Fixed Model for Alphabet (a, e, i, o, u, !) vey negligible information on average, but require at least one bit to transmit [7]. Arithmetic coding dis- Symbol Probability Range penses with the restriction that each symbol must .2 LO, 0.2) translate into an integral number of bits, thereby .3 [0.2, 0.5) coding more efficiently. It actually achieves the the- .l [0.5, 0.6) .2 [0.6,0.8) oretical entropy bound to compression efficiency for .l [0.8, 0.9) any source, including the one just mentioned. .l [0.9, 1.0) In general, sophisticated models expose the defi- ciencies of Huffman coding more starkly than simple ones. This is because they more often predict sym- mitting the message eaii!. Initially, both encoder bols with probabilities close to one, the worst case and decoder know that the range is [0, 1). After for Huffman coding. For example, the techniques seeing the first symbol, e, the encoder narrows it to mentioned above that code English text in 2.2 bits/ [0.2, 04, the range the model allocates to this sym- character both use arithmetic coding as the final bol. The second symbol, a, will narrow this new step, and performance would be impacted severely range to the first one-fifth of it, since a has been June 1987 Volume 30 Number 6 Communications of the ACM 521 Computing Practices allocated [0, 0.2). This produces [O.Z, 0.26), since the Figure la. The second symbol scales it again into the previous range was 0.3 units long and one-fifth of range [0.2, 0.26). But the picture cannot be contin- that is 0.06. The next symbol, i, is allocated [0.5, 0.6), ued in this way without a magnifying glass! Conse- which when applied to [0.2, 0.26) gives the smaller quently, Figure lb shows the ranges expanded to range [0.23, 0.236). Proceeding in this way, the en- full height at every stage and marked with a scale coded message builds up as follows: that gives the endpoints as numbers. Suppose all the decoder knows about the message Initially 1) is the final range, [0.23354, 0.2336). It can -immedi- After seeing e ;::2, 0.5) ately deduce that the first character was e! since the a p.2, 0.26) i [0.23, 0.236) range lies entirely within the space the model of i [0.233, 0.2336) Table I allocates for e.

Arithmetic Coding for Data Coiupression

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support