COMPUTINGPRACTICES

Edgar H. Sibley The state of the art in is , not the better- Panel Editor known Huffman method. Arithmetic coding gives greater compression, is faster for adaptive models, and clearly separatesthe model from the channel encoding.

ARITHMETIC CODING FOR DATA COIUPRESSION

IAN H. WIllEN, RADFORD M. NEAL, and JOHN G. CLEARY

Arithmetic coding is superior in most respects to the DATA COMPRESSION better-known Huffman [lo] method. It represents in- To many, data compression conjures up an assort- formation at least as compactly-sometimes consid- ment of ad hoc techniques such as conversion of erably more so. Its performance is optimal without spaces in text to tabs, creation of special codes for the need for blocking of input data. It encourages a common words, or run-length coding of picture data clear separation between the model for representing (e.g., see [8]). This contrasts with the more modern data and the encoding of information with respect to model-based paradigm for coding, where, from an that model. It accommodates adaptive models easily input string of symbols and a model, an encoded string and is computationally efficient. Yet many authors is produced that is (usually) a compressed version of and practitioners seem unaware of the technique. the input. The decoder, which must have access to Indeed there is a widespread belief that Huffman the same model, regenerates the exact input string coding cannot be improved upon. from the encoded string. Input symbols are drawn We aim to rectify this situation by presenting an from some well-defined set such as the ASCII or accessible implementation of arithmetic coding and binary alphabets; the encoded string is a plain se- by detailing its performance characteristics. We start quence of . The model is a way of calculating, in by briefly reviewing basic concepts of data compres- any given context, the distribution of probabilities sion and introducing the model-based approach that for the next input symbol. It must be possible for the underlies most modern techniques. We then outline decoder to produce exactly the same probability dis- the idea of arithmetic coding using a simple exam- tribution in the same context. Compression is ple, before presenting programs for both encoding achieved by transmitting the more probable symbols and decoding. In these programs the model occupies in fewer bits than the less probable ones. a separate module so that different models can easily For example, the model may assign a predeter- be used. Next we discuss the construction of fixed mined probability to each symbol in the ASCII and adaptive models and detail the compression alphabet. No context is involved. These probabilities efficiency and execution time of the programs, can be determined by counting frequencies in repre- including the effect of different arithmetic word sentative samples of text to be transmitted. Such a lengths on compression efficiency. Finally, we out- fixed model is communicated in advance to both en- line a few applications where arithmetic coding is coder and decoder, after which it is used for many appropriate. messages.

Financial support for this work has been provided by the Natural Sciences Alternatively, the probabilities that an adaptive and E@neering Research Council of Canada. model assigns may change as each symbol is trans- UNIX is a registered trademark of AT&T Bell Laboratories. mitted, based on the symbol frequencies seen so far 0 1987 ACM OOOl-0782/87/OtiOO-0520 750 in the message. There is no need for a representative

520 Communications of the ACM June 1987 Volume 30 Number 6 Computing Practices sample of text, because each message is treated as if were substituted. Nevertheless, an independent unit, starting from scratch. The en- since our topic is coding and not modeling, the illus- coder’s model changes with each symbol transmit- trations in this article all employ simple models. ted, and the decoder’s changes with each symbol Even so, as we shall see, Huffman coding is inferior received, in sympathy. to arithmetic coding. More complex models can provide more accurate The basic concept of arithmetic coding can be probabilistic predictions and hence achieve greater traced back to Elias in the early 1960s (see [l, compression. For example, several characters of pre- pp. 61-621). Practical techniques were first intro- vious context could condition the next-symbol prob- duced by Rissanen [16] and Pasco [15], and de- ability. Such methods have enabled mixed-case Eng- veloped further by Rissanen [17]. Details of the lish text to be encoded in around 2.2 bits/character implementation presented here have not appeared with two quite different kinds of model [4, 61. Tech- in the literature before; Rubin [2O] is closest to our niques that do not separate modeling from coding approach. The reader interested in the broader class so distinctly, like that of Ziv and Lempel (231, do of arithmetic codes is referred to [18]; a tutorial is not seem to show such great potential for compres- available in [l3]. Despite these publications, the sion, although they may be appropriate when the method is not widely known. A number of recent aim is raw speed rather than compression per- books and papers on data compression mention it formance [22]. only in passing, or not at all. The effectiveness of any model can be measured by the entropy of the message with respect to it, THE IDEA OF ARITHMETIC CODING usually expressed in bits/symbol. Shannon’s funda- In arithmetic coding, a message is represented by an mental theorem of coding states that, given messages interval of real numbers between 0 and 1. As the randomly generated from a model, it is impossible to message becomes longer, the interval needed’to rep- encode them into less bits (on average) than the en- resent it becomes smaller, and the number of bits tropy of that model [21]. needed to specify that interval grows. Successive A message can be coded with respect to a model symbols of the message reduce the size of the inter- using either Huffman or arithmetic coding. The for- val in accordance with the symbol probabilities gen- mer method is frequently advocated as the best pos- erated by the model. The more likely symbols re- sible technique for reducing the encoded data rate. duce the range by less than the unlikely symbols It is not. Given that each symbol in the alphabet and hence add fewer bits to the message. must translate into an integral number of bits in the Before anything is transmitted, the range for the encoding, Huffman coding indeed achieves “mini- message is the entire interval [0, l), denoting the mum redundancy.” In other words, it performs opti- half-open interval 0 5 x < 1. As each symbol is mally if all symbol probabilities are integral powers processed, the range is narrowed to that portion of it of %. But this is not normally the case in practice; allocated to the symbol. For example, suppose the indeed, Huffman coding can take up to one extra alphabet is (a, e, i, O, u, !I, and a fixed model is used per symbol. The worst case is realized by a source with probabilities shown in Table I. Imagine trans- in which one symbol has probability approaching unity. Symbols emanating from such a source con- TABLE I. Example Fixed Model for Alphabet (a, e, i, o, u, !) vey negligible information on average, but require at least one bit to transmit [7]. Arithmetic coding dis- Symbol Probability Range penses with the restriction that each symbol must .2 LO, 0.2) translate into an integral number of bits, thereby .3 [0.2, 0.5) coding more efficiently. It actually achieves the the- .l [0.5, 0.6) .2 [0.6,0.8) oretical entropy bound to compression efficiency for .l [0.8, 0.9) any source, including the one just mentioned. .l [0.9, 1.0) In general, sophisticated models expose the defi- ciencies of Huffman coding more starkly than simple ones. This is because they more often predict sym- mitting the message eaii!. Initially, both encoder bols with probabilities close to one, the worst case and decoder know that the range is [0, 1). After for Huffman coding. For example, the techniques seeing the first symbol, e, the encoder narrows it to mentioned above that code English text in 2.2 bits/ [0.2, 04, the range the model allocates to this sym- character both use arithmetic coding as the final bol. The second symbol, a, will narrow this new step, and performance would be impacted severely range to the first one-fifth of it, since a has been

June 1987 Volume 30 Number 6 Communications of the ACM 521 Computing Practices

allocated [0, 0.2). This produces [O.Z, 0.26), since the Figure la. The second symbol scales it again into the previous range was 0.3 units long and one-fifth of range [0.2, 0.26). But the picture cannot be contin- that is 0.06. The next symbol, i, is allocated [0.5, 0.6), ued in this way without a magnifying glass! Conse- which when applied to [0.2, 0.26) gives the smaller quently, Figure lb shows the ranges expanded to range [0.23, 0.236). Proceeding in this way, the en- full height at every stage and marked with a scale coded message builds up as follows: that gives the endpoints as numbers. Suppose all the decoder knows about the message Initially 1) is the final range, [0.23354, 0.2336). It can -immedi- After seeing e ;::2, 0.5) ately deduce that the first character was e! since the a p.2, 0.26) i [0.23, 0.236) range lies entirely within the space the model of i [0.233, 0.2336) Table I allocates for e. Now it can simulate the oper- ! [0.23354, 0.2336) ation of the encoder: Figure 1 shows another representation of the en- Initially P, 1) coding process. The vertical bars with ticks repre- After seeing e [0.2, 0.5) sent the symbol probabilities stipulated by the This makes it clear that the second character is a, model. After the first symbol has been processed, the since this will produce the range model is scaled into the range [0.2, 0.5), as shown in After seeing a [0.2, 0.26), which entirely encloses the given range [0.23354, After 0.2336). Proceeding like this, the decoder can iden- seeing Nothing e a tify the whole message. It is not really necessary for the decoder to know both ends of the range produced by the encoder. ’ U! Instead, a single number within the range--for ex- 0 ample, 0.23355-will suffice. (Other numbers, like 0.23354, 0.23357, or even 0.23354321, would do just i as well.) However, the decoder will face the problem

e of detecting the end of the message, to determine when to stop decoding. After all, the single number 3 0.0 could represent any of a, aa, aaa, aaaa, . . . . To a resolve the ambiguity, we ensure that each message 0 ri ends with a special EOF symbol known to both en- coder and decoder. For the alphabet of Table I, ! will FIGUREla. Representation of the Arithmetic Coding Process be used to terminate messages, and only to termi-

After seeing Nothing e a i i !

1 0.5 0.26 ! ! ! u u U

0 0 0 / / i i i

e e e

a a a a 0 i 0.2i 0.2 i \ \

FIGURElb. Representation of the Arithmetic Coding Process with the interval Scaled Up at Each Stage

522 Communications of the ACM June 1987 Volume 30 Number 6 Computing Practices

/* ARITHMETIC ENCODING ALGORITHM. */

/* Call encode-symbol repeatedly for each symbol in the message. */ /* Ensure that a distinguished "terminator" symbol is encoded last, then */ /* transmit any value in the range [low, high). */ encode-symbol(symbo1, cum-freq) range = high - low high = low f range*cum-freq[symbol-11 low = low f range*cum-freq(symbol1

/* ARITHMETIC DECODING ALGORITHM. */

/* "Value" is the number that has been received. */ /* Continue calling decode-symbol until the terminator symbol is returned. */

decode-symbol(cum-freq) find symbol such that cum-freq[symbol] <= (value-low)/(high-low) < cum-freqrsymbol-11 /* This ensures that value lies within the new l / ;* (low, high) range that will be calculated by */ /* the following lines of code. */ range = high - low high = low t range*cum-freq[symbol-11 1OW = low t range*cum-freq[symbol] return symbol

FIGURE2. Pseudocode for the Encoding and Decoding Procedures

nate messages. When the decoder sees this symbol, A PROGRAM FOR ARITHMETIC CODING it stops decoding. Figure 2 shows a pseudocode fragment that summa- Relative to the fixed model of Table I, the entropy rizes the encoding and decoding procedures devel- of the five-symbol message eaii! is oped in the last section. Symbols are numbered, 1, 2, 3 . . . The frequency range for the ith symbol is -log 0.3 - log 0.2 - log 0.1 - log 0.1 - log 0.1 from cum-freq[i] to cum-freq[i - 11. As i decreases, = -log 0.00006 = 4.22 cum-freq[i] increases, and cum-freq[O] = 1. (The reason for this “backwards” convention is that (using base 10, since the above encoding was per- cum-freq[O] will later contain a normalizing factor, formed in decimal). This explains why it takes five and it will be convenient to have it begin the array.] decimal digits to encode the message. In fact, the The “current interval” is [Zozu,high), and for both size of the final range is 0.2336 - 0.23354 = 0.00006, encoding and decoding, this should be initialized and the entropy is the negative logarithm of this to [O, 1). figure. Of course, we normally work in binary, Unfortunately, Figure 2 is overly simplistic. In transmitting binary digits and measuring entropy practice, there are several factors that complicate in bits. both encoding and decoding: Five decimal digits seems a lot to encode a mes- sage comprising four vowels! It is perhaps unfortu- Incremental transmission and reception. The encode nate that our example ended up by expanding algorithm as described does not transmit anything rather than compressing. Needless to say, however, until the entire message has been encoded; neither different models will give different entropies. The does the decode algorithm begin decoding until it best single-character model of the message eaii! is has received the complete transmission. In most the set of symbol frequencies (e(O.2), a(0.2), i(O.4), applications an incremental mode of operation is !(0.2)), which gives an entropy of 2.89 decimal digits. necessary. Using this model the encoding would be only three The desire to use integer arithmetic. The precision digits long. Moreover, as noted earlier, more sophis- required to represent the [low, high) interval grows ticated models give much better performance with the length of the message. Incremental opera- in general. tion will help overcome this, but the potential for

]une 1987 Volume 30 Number 6 Communications of the ACM 523 Computing Practices

overflow and underflow must still be examined Figure 3 shows working code, in C, for arithmetic carefully. encoding and decoding. It is considerably lmore de- tailed than the bare-bones sketch of Figure Z! Imple- Representing the model so thnt it can be consulted mentations of two different models are given in efficiently. The representation used for the model Figure 4; the Figure 3 code can use either one. should minimize the time required for the decode The remainder of this section examines the code algorithm to identify the next symbol. Moreover, of Figure 3 more closely, and includes a proof that an adaptive model should be organized to minimize decoding is still correct in the integer implementa- the time-consuming task of maintaining cumulative tion and a review of constraints on word lengths in frequencies. the program.

arithmetic-coding-h

1 /' DECLARATIONS USED FOR ARITHMETIC ENCODING AND DECODING l / 2 3 4 /* SIZE OF ARITHMETIC CODE VALUES. l / 5 6 #define Code-value-bits 16 /* Number of bits in a code value l / 7 typedef long code-value: /* Type of an arithmetic code value l / a 9 fdefine Top-value (((long)l<

mode1.h

17 /' INTERFACE TO THE MODEL. '/ 18 19 20 /' THE SET OF SYMBOLS THAT MAY BE ENCODED. l / 21 22 #define No-of-chars 256 /* Number of character symbols '/ 23 #define EOF-symbol (No-of-charetl) /* Index of EOF symbol '/ 24 25 #define No-of-symbols (No-of-charstll /* Total number of symbols */ 26 27 28 /' TRANSLATION TABLES BETWEENCHARACTERS AND SYMBOL INDEXES. l / 29 30 int char-to-index[No-of-chars]; /* To index from character '/ 31 unsigned char index_to_char[No_of_symbols+l]: /* To character from index l / 32 33 34 /* CUMULATIVE FREQUENCYTABLE. */ 35 36 Idefine Max-frequency 16383 /* Maximum allowed frequency count l / 37 /* 2a14 - 1 l / 38 int cum_frsq[No_of_symbols+l]; /* Cumulative symbol frequencies l /

FIGURE3. C Implementation of Arithmetic Encoding and Decoding

524 Communicationso~ the ACM June 1987 Volume 30 Number 6 Computing Practices

encode.c

39 /' MAIN PROGRAMFOR ENCODING. l / 40 41 #include 42 #include "mode1.h" 43 44 main0 45 1 startmodel~): /* Set up other modules. l / 46 start-outputlng-bitsO; 47 start-encodingo; 40 for (::I ( /' Loop through characters. l / 49 int ch; lnt symbol; 50 ch - getc(stdin); /* Read the next character. l / 51 If (ch--EOF) break; /* Exit loop on end-of-file.*/ 52 symbol- char-to.-lndex[ch]; /* Translate to an index. l / 53 encode-symbol(symbol,cum-freq); /* Encode that symbol. l / 54 update-model(symbo1); /* Update the model. l / 55 1 56 encode_symbol(EOF_~symbol,cum_freq); /* Encode the EOF symbol. l / 57 done-encoding(); /* Send the last few blts. l / 58 done-outputinebitso; 59 exit (0) : 60

arithmetic - encode.c

61 /* ARITHMETIC ENCODING ALGORITHM. */ 62 63 finclude "arithmetic-cod1ng.h" 64 65 static vold bll-plus-follov(); /* Routine that follows ‘/ 66 67 68 /' CURRENT STATE OF THE ENCODING. '/ 69 70 static code value low, high; /* Ends of the current code region */ 71 static long-bits-to-follow; /* Number of opposite bits to output after l / 72 /* the next bit. '/ 73 74 75 /* START ENCODING A STREAM OF SYMBOLS. l / 76 17 start-encoding0 78 I low - 0; /* Full code range. '/ 79 hiah - Top-value; 80 bits to follow - 0; /* No bits to follow next. l / Bl 1 -- 82 83 84 /* ENCODE A SYMBOL. l / 85 86 encode-symhoI(symbol,cum-freq) 81 int symbol: /* Symbol to encode '/ 88 int cum-freql]: /* Cumulative symbol frequencies '/ a9 I long range; /' Size of the current code region l / 90 range - (long) (high-low)+l; 91 high - low + /* Narrow the code region l / 92 (ranye'cum-freq[symbol-ll)/cum-freq[O)-1: /* to that allotted to this l / 93 low - low + /' symbol. l / 94 ~ranpe*cum~freq(symbol))/cum~freq(O);

FIGURE3. C Implementationof ArithmeticEncoding and Decoding(continued)

]une 1987 Volume 30 Number 6 Communications of the ACM 525 95 for (:;I I /' Loop to ouLput bits. '1 96 if (high-Half) ( /* Output 1 if in hiQh half.*/ 100 bit~plus~follow(1): 101 low -- Half: 102 high -* Half: /+ Subtract offset to top. */ 103 I 104 else if (low>-Flrst qtr /* Output an opposite bit l / 105 sc hiQhO) i 132 output-bit (Ibit); /* Output bits-to-follow l / 133 bits to-follow -* 1; /* opposfte bits. Set '/ 134 1 - /' bits-to-follow to zero. l / 135

decode.c

136 /* MAIN PROGPAMFOR DECODING. */ 137 138 #include 139 (include "mode1.h" 140 141 main () 142 I start-model 0 : /* Set up other modules. '/ 143 start-inpUtinQ-bitso; 144 start-decodingo: 145 for (;;) ( /* Loop through characters. '/ 146 lnt ch: int symboi; 147 symbol - decode-symbolicurn-freq): /* Decode next symbol. '/ 148 if (symbol**EOF symbol) break; /' Exit loop if EOF symbol. 'I 149 ch * lndex_to_c?;ar[symbolJ; /* Translate to a character.*/ 150 putc(ch,stdout): /' Write that character. '/ 151 update-model(symbo1); /* Update the model. l / 152 1 153 exit. (0) ; 154 1

FIGURE3. C Implementation of Arithmetic Encoding and Decoding (continued)

526 Communications of the ACM ]une 1987 Volume 30 Number 6 Computing Practices

arithmetic - decode.c -- --

155 /* ARITHMETIC DECODING ALGORITHM. l / 156 157 #include “arithmetic-cod1ng.h” 158 159 160 /' CURRENT STATE OF THE DECODING. l / 161 162 static code value value; /* Currently-seen code value l / 163 static code:value low, high; /* Ends of current code replan ‘/ 164 165 166 /* START DECODING A STREAM OF SYMBOLS. '/ 167 168 Start-deC”di”Q() 169 ( lnt i: 170 value - 0; /* Input bits to flll the l / 171 for (1 - 1; I<-Code-value-bits; it+) ( /* code value. l / 172 value - 2'value,lnput_bit(); 173 1 174 low " 0: /* Full code ranpe. l / 175 hiQh - Top-value; 176 1 177 178 179 /* DECODE THE NEXT SYMBOL. l '/ 180 181 lnt decode-symbol (cum-freq) 182 int cwn-freq[ I; /* Cumulative symbol frequencies l / 183 ! lO”Q ra”Qe; /* Size of current code region l / 184 int cum; /* Cumulative frequency calculated ‘/ 185 int symbol: /* Symbol decoded l / 186 ra”Qe - (lO"Q)(hlQh-1oW)tl; 187 cum - /* Flnd cum freq for value. l / 188 (((lonQl(value-lou)tl)*cum~freq[O]-l)/ranQe: 189 for ~eyn!bol - 1; cum-fraq[symbolJ%wm; symboltt) ; /* Then find symbol. l / 190 hiQh - low t /* Narrow the code reQlon l / 191 (ranQe*cum-freq[symbol-ll)/cum-freq[O]-1; /* to that allotted to this l / 192 low - low t /* symbol. l / 193 (ra"Qe'cum~freq[symbo1])/cum~freq[O]; 194 for I::1 ( /* Loop to QOt rld of bits. l / 195 if (hiQh-Half) ( /* Expand hlph half. '/ 199 value -- Half; 200 low -- Half: /* Subtract offset to top. *I 201 hlQh -- Half; 202 203 else if (low>-First-qtr /* Expand middle half. l / 204 ‘L hiQh

FIGURE3. C Implementation of Arithmetic Encoding and Decoding (continued)

]une 1987 Volume 30 Number 6 Communications of the ACM 527 bit-input-c

216 /* BIT INPUT ROUTINES. "/ 217 218 +include 219 flnclude "arithmetic-cod1ng.h" 220 221 222 /' THE BIT BUFFER. ‘/ 223 224 static int buffer: /* Bits waiting to be input l / 225 static lnt blts_to-go: /* Number of bits still in buffer l / 226 static lnt garbage-bits: /* Number of bit6 past end-of-file ‘/ 227 228 229 /* INITIALIZE BIT INPUT. */ 230 231 St6rt_inpUting_blt6() 232 l bits-to-go - 0; /* Buffer starts out with l / 233 garbage-bits - 0: /* no bits in it. l / 234 1 235 236 237 /* INPUT A BIT. l / 238 239 int input-bit0 240 ( lnt t: 241 if (bits-to-go--O) 1 /* Read the next byte if no '/ 242 buffer - getc(stdln): /* bit6 are left in buffer. *'/ 243 if (buffer--EOF) ( 244 garbage-bits t- 1; /* Return arbitrary bits*/ 245 if (garbage~,blts>Code~value~blts-2) 1 /* after eof, but check l / 246 fprlntf(stderr,“Bad input flle\n’j; /* for too many such. l / 247 exit (-1); 248 ) 249 1 250 bits-to-go - 8: 251 1 252 t - buffer&l; /* Return the next bit from l / 253 buffer a>- 1; /* the bottom of the byte. l / 254 bits-to-go -* I: 255 return t; 256

FIGURE3. C Implementation of Arithmetic Encoding and Decoding (continued)

528 Communications of the ACM June 1987 Volume 30 Number 6 bit-output.c

257 /* BIT OUTPUT ROUTINES. '/ 258 259 #include 260 261 262 /" THE BIT BUFFER. */ 263 264 static int buffer: /* Bits buffered for output l / 265 static int bits-to-go; /* Number of bits free in buffer l / 266 267 268 /* INITIALIZE FOR BIT OUTPUT. l / 269 270 start_.outputing-..bits() 271 1 buffer - 0; /* Buffer Is empty to start '/ 272 bitS_tO-QO- 8; /' with. l / 273 ) 274 215 276 /* OUTPUT A BIT. l / 277 270 output-blt(bit) 279 lnt bit: 200 ( buffer >>- 1; /* Dot bit In top of buffer.*/ 281 if (bit) buffer I- 0x00; 282 bits-to-go -- 1; 283 if (bits-to-go--O) ( /* Output buffer if It 1s l / 204 putc(buffer,stdout): /' now full. l / 285 bits-to-go - 8; 286 t 287 t 288 209 290 /* FLUSH OUT THE LAST BITS. l / 291 292 done-outputlng-bits0 293 putc(buffer>>blts_to-go,stdout): 294

FIGURE3. C Implementation of Arithmetic Encoding and Decoding (continued)

]une 1987 Volume 30 Number 6 Communications of the ACM 529 Computing Practices

fixed-model. c

1 /* THE FIXED SDDRCE MODEL l / 2 finclude 'mode1.h' 4 5 int freq[No-of-symbolstlj - ( 6 0, 7 1. 1, 1. 1. 1, 1, 1, 1, 1, 1, 124, 1, 1, 1, 1, 1, a 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9

10 /* I l + $ t ( 1 l + / l / 11 1236, 1, 21, 9, 3, 1, 2:, 15, 2, 2, 2, 1, 7b, 19, 60, 1, 12 13 /'012 3 4 5 6 7 S 9:;C - > ?*/ 14 15, 15, 8, 5, 4, 7, 5, 4, 4, 6, 3, 2, 1, 1, 1, 1, 15 16 /*@ A B C D E F C H I J K L M N O*/ 17 1, 24, 15, 22, 12, 15, 10, 9, 16, 16, 8, 6, 12, 23, 13, 11, 18 19 /'P Q R S T U V W X Y Z [ \ I .. l / 20 14, 1, 14, 26, 29, 6, 3, 11, 1, 3, 1, 1, 1, 1, 1, 5, 21 22 /* ' b d f h i j k 1 m 0 l / 23 1, 49:. 85, 17;, 232, 74:, 127, ll:, 293, 418, 6, 39, 250, 139, 42:, 446. 24 25 /*p q r 8 t x Y ( I ) - */ 26 111, 5, 388, 375, 531, 159, 5:, 91, 12, 101, f. 2, 1, 2, 3. 1. 27 28 1. 1. 1, 1, 1, 1. 1, 1. 1. 1. 1, 1. 1. 1. 1, 1, 29 1, 1, 1, 1, 1, 1. 1, 1. 1, 1. 1, 1. 1. 1, 1. 1. 30 1. 1. 1. 1, 1, 1, 1, 1. 1, 1. 1, 1, 1, 1, 1. 1, 31 1, 1, 1. 1. 1, 1, 1, 1, 1. 1. 1. 1, 1, 1, 1, 1, 32 1. 1. 1. 1, 1, 1, 1, 1, 1. 1, 1. 1. 1. 1. 1. 1. 33 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 34 1, 1, 1, 1, 1, 1, 1, 1, 1. 1, 1, 1, 1, 1, 1. 1. 35 1, 1, 1, 1, i, I, I, I, 1, 1, 1, 1, 1, I, I, 1, 36 37 1’, 38 39 40 /* INITIALIZE THE MODEL. '/ 41 42 start-model0 43 ! lnt 1: 44 for (i - 0; iO; i--l ( /* Set up cumulative l / 50 cum-freq[i-lj - cum-freq[i] t freq(iJ; /* frequency counts. '/ 51 I 52 if (cum-freq[Ol > Max-frequency) abort0; /* Check counts within limit*/ 53 54 55 56 /* UPDATE THE MODEL TO ACCOUNT FOR A NEW SYMBOL. 'I 57 58 update-model (symbol) 59 lnt symbol; 60 I /* Do nothing. l / 61 I

FIGURE4. Fixed and Adaptive Models for Use with Figure 3

530 Communications of the ACM ]une 1987 Volume .30 Number 6 Computing Practices

adaptive-mode1.c

1 /* THE ADAPTIVE SOURCE MODEL l / 2 3 #include "modo1.h" 4 5 int freq]No-of-symbolstl]: /+ symbol frequencies l / 6 7 0 /' INITIALIZE THE MODEL. l / 9 10 start-model0 11 I int i: 12 for (i - 0; iO) ( /* count for the symbol and l / 50 i -- 1: /* update the cumulative l / 51 cum-freq]! 1 t- 1; /* frequencies. '/ 52 t 53 t

FIGURE4. Fixed and Adaptive Models for Use with Figure 3 (continued)

/me 1987 Volume 30 Number 6 Communications of the ACM 531 Computing Practices

Representing the Model for (;;I { Implementations of models are discussed in the next if (high < Half) [ section; here we are concerned only with the inter- output-bit(O); face to the model (lines 20-38). In C, a byte is repre- low = 2*1ow; sented as an integer between 0 and 255 (a char). high = 2*high+l; Internally, we represent a byte as an integer be- I tween 1 and 257 inclusive (an index), EOF being else if (low I Half) ( treated as a 257th symbol. It is advantageous to sort output-bit(l); the model into frequency order, so as to minimize low = 2*(low-Half); the number of executions of the decoding loop high = 2*(high-Half)+l; (line 189). To permit such reordering, the char/index I translation is implemented as a pair of tables, else break; index-to-char[] and char-to-index[]. In one of our I models, these tables simply form the index by adding which ensures that, upon completion, low < Half 1 to the char, but another implements a more com- I high. This can be found in lines 95-113 of plex translation that assigns small indexes to fre- encode-symbol (1, although there are some extra com- quently used symbols. plications caused by underflow possibilities (see the The probabilities in the model are represented as next subsection). Care is taken to shift ones in at the integer frequency counts, and cumulative counts are bottom when high is scaled, as noted above. stored in the array cum- freq[]. As previously, this Incremental reception is done using a number array is “backwards,” and the total frequency count, called value as in Figure 2, in which processed which is used to normalize all frequencies, appears bits flow out the top (high-significance) end and in cum - fre9 [O]. Cumulative counts must not exceed newly received ones flow in the bottom. Initially, a predetermined maximum, Max- frequency, and the start-decoding0 (lines 168-176) fills value with re- model implementation must prevent overflow by ceived bits. Once decode-symbol () has identified the scaling appropriately. It must also ensure that neigh- next input symbol, it shifts out now-useless high- boring values in the cum- freq [ ] array differ by at order bits that are the same in low and high, shifting least 1; otherwise the affected symbol cannot be value by the same amount (and replacing lost bits by transmitted. fresh input bits at the bottom end): for (;;) 1 Incremental Transmission and Reception if (high < Half) ( Unlike Figure 2 the program in Figure 3 repre- value = 2*value+input_bit( ); sents low and high as integers. A special data type, low = 2*1ow; code-value, is defined for these quantities, together high = 2*high+l; with some useful constants: Top-value, representing I the largest possible code-value, and First-qtr, Half, else if (low > Half) ( and Third-@, representing parts of the range value = 2*(value-Half)+input-bit(); (lines 6-16). Whereas in Figure 2 the current inter- low = 2*(low-Half); val is represented by [low, high), in Figure 3 it is high = 2*(high-Half)+l; [low, high]; that is, the range now includes the value I of high. Actually, it is more accurate (though more else break; confusing) to say that, in the program in Figure 3, t the interval represented is [low, high + 0.11111 . .o). (see lines 194-213, again complicated by precautions This is because when the bounds are scaled up to against underflow, as discussed below). increase the precision, zeros are shifted into the low- order bits of low, but ones are shifted into high. Al- though it is possible to write the program to use a Proof of Decoding Correctness different convention, this one has some advantages At this point it is worth checking that identification in simplifying the code. of the next symbol by decode-symbol () works As the code range narrows, the top bits of low and properly. Recall from Figure 2 that decode-symbol () high become the same. Any bits that are the same must use value to find the symbol that, when en- can be transmitted immediately, since they cannot coded, reduces the range to one that still includes be affected by future narrowing. For encoding, since value. Lines 186-188 in decode-symbol0 identify the we know that low 5 high, this requires code like symbol for which

532 Communications of the ACM ]une 1987 Volume 30 Number 6 Computing Practices cum-freq[symbol] that, whatever bit actually comes next, its opposite must be transmitted afterwards as well. Thus lines ~ (value - low + 1) * cum-freq[O] - 1 104-109 expand [First-qtr, Third-qtr] into the whole i high - low + 1 1 interval, remembering in bits-to- follow that the bit that is output next must be followed by an oppo- < cum-freq[symbol- 11, site bit. This explains why all output is done via bit-plus- foZZow() (lines 128-135), instead of directly where LJ denotes the “integer part of” function that with output-bit(). comes from integer division with truncation. It is But what if, after this operation, it is still true that shown in the Appendix that this implies First-qtr I low < Half I high < Third-qtr? low + (high - low + 1) * cum-freq[symboZ] i cum-freq[O] 1 Figure 5 illustrates this situation, where the current [low, high] range (shown as a thick line) has been lV5lOW expanded a total of three times. Suppose the next bit turns out to be zero, as indicated by the arrow in + (high - low + 1) * cum-freq[symbol- l] _ 1 Figure 5a being below the halfway point. Then the L cum-freq[O] J ’ next three bits will be ones, since the arrow is not only in the top half of the bottom half of the original so that value lies within the new interval that range, but in the top quarter, and moreover the top decode-symbol () calculates in lines 190-193. This is eighth, of that half-this is why the expansion can sufficient to guarantee that the decoding operation occur three times. Similarly, as Figure 6b shows, if identifies each symbol correctly. the next bit turns out to be a one, it will be followed by three zeros. Consequently, we need only count Underflow the number of expansions and follow the next bit by As Figure 1 shows, arithmetic coding works by scal- that number of opposites (lines 106 and 131-134). ing the cumulative probabilities given by the model Using this technique the encoder can guarantee into the interval [low, high] for each character trans- that, after the shifting operations, either mitted. Suppose low and high are very close to- gether-so close that this scaling operation maps low < First-qtr < Half I high (14 some different symbols of the model onto the same integer in the [low, high] interval. This would be or disastrous, because if such a symbol actually oc- low < Half < Third-qtr I high. (lb) curred it would not be possible to continue encod- ing. Consequently, the encoder must guarantee Therefore, as long as the integer range spanned by that the interval [low, high] is always large enough the cumulative frequencies fits into a quarter of that to prevent this. The simplest way to do this is provided by code-values, the underflow problem to ensure that this interval is at least as large as cannot occur. This corresponds to the condition Max- frequency, the maximum allowed cumulative Top-value + 1 + 1 frequency count (line 36). Max- frequency 5 How could this condition be violated? The bit- 4 shifting operation explained above ensures that low which is satisfied by Figure 3, since Max- frequency and high can only become close together when they = 214 - 1 and Top-value = 216 - 1 (lines 36, 9). More straddle Half. Suppose in fact they become as close than 14 bits cannot be used to represent cumulative as frequency counts without increasing the number of bits allocated to code-values. First-qtr I low < Half I high c Third-qtr. We have discussed underflow in the encoder only. Then the next two bits sent will have opposite polar- Since the decoder’s job, once each symbol has been ity, either 01 or 10. For example, if the next bit turns decoded, is to track the operation of the encoder, out to be zero (i.e., high descends below Half and underflow will be avoided if it performs the same [0, Half] is expanded to the full interval), the bit expansion operation under the same conditions. after that will be one, since the range has to be above the midpoint of the expanded interval. Con- Overflow versely, if the next bit happens to be one, the one Now consider the possibility of overflow in the after that will be zero. Therefore the interval can integer multiplications corresponding to those of safely be expanded right now, if only we remember Figure 2, which occur in lines 91-94 and 190-193

June 1987 Volume 30 Number 6 Communications of the ACM 533 (4

/-

08 Top-value -

Third-qtr

high Half

First+

o-

FIGURE5. Scaling the Interval to Prevent Underflow

of Figure 3. Overflow cannot occur provided the Constraints on the Implementation product The constraints on word length imposed by under- flow and overflow can be simplified by assuming that frequency counts are represented in f bits, and fits within the integer word length available, code-zwlurs in c bits, The implementation will work since cumulative frequencies cannot exceed correctly provided Max- frequewy. Rarjge might be as large as Top-value + 1. so the largest possible product in Figure 3 is fsc-2 2”‘(2’4 - 1). which is less than 2”“. Long declarations f+CSp, are used for code-zwlue (line 7) and range (lines 89, the precision to which arithmetic is performed. 183) to ensure that arithmetic is done to 32-bit preci- sion. In most C implementations, p = 31 if long integers

534 Cotnrnu~~icatiom of the ACM lune 1987 Volume 30 Number 6 Computing Practices are used, and p = 32 if they are unsigned long. In However, bytes that did not occur in that sample Figure 3, f = 14 and c = 16. With appropriately have been given frequency counts of one in case modified declarations, unsigned long arithmetic with they do occur in messages to be encoded (so this f = 15 and c = 17 could be used. In assembly lan- model will still work for binary files in which guage, c = 16 is a natural choice because it expedites all 256 bytes occur). Frequencies have been normal- some comparisons and bit manipulations (e.g., those ized to total 8000. The initialization procedure of lines 65-113 and 164-213). start-model () simply computes a cumulative version If p is restricted to 16 bits, the best values possible of these frequencies (lines 48-51), having first initial- are c = 8 and f = 7, making it impossible to encode a ized the translation tables (lines 44-47). Execution full alphabet of 256 symbols, as each symbol must speed would be improved if these tables were used have a count of at least one. A smaller alphabet (e.g., to reorder symbols and frequencies so that the most the 26 letters, or 4-bit nibbles) could still be handled. frequent came first in the cum _ freq [ ] array. Since the model is fixed, the procedure update-model 0, Termination which is called from both encode.c and decode.c, is To finish the transmission, it is necessary to send a null. unique terminating symbol (EOF-symbol, line 56) An exact model is one where the symbol frequen- and then follow it by enough bits to ensure that the cies in the message are exactly as prescribed by the encoded string falls within the final range. Since model. For example, the fixed model of Figure 4 is done-encoding0 (lines 116-123) can be sure that low close to an exact model for the particular excerpt of and high are constrained by either Eq. (la) or (lb) the Brown Corpus from which it was taken. To be above, it need only transmit 01 in the first case or 10 truly exact, however, symbols that did not occur in in the second to remove the remaining ambiguity. It the excerpt would be assigned counts of zero, rather is convenient to do this using the bit-plus- foZZow() than one (sacrificing the capability of transmitting procedure discussed earlier. The input- bit () proce- messages containing those symbols). Moreover, the dure will actually read a few more bits than were frequency counts would not be scaled to a predeter- sent by output-bit(), as it needs to keep the low end mined cumulative frequency, as they have been in of the buffer full. It does not matter what value these Figure 4. The exact model can be calculated and bits have, since EOF is uniquely determined by the transmitted before the message is sent. It is shown last two bits actually transmitted. by Cleary and Witten [3] that, under quite general conditions, this will not give better overall compres- MODELS FOR ARITHMETIC CODING sion than (which is described next). The program in Figure 3 must be used with a model that provides a pair of translation tables Adaptive Models index-to-char[] and char-to-index[], and a cumula- An adaptive model represents the changing symbol tive frequency array cum - freq [ 1, The requirements frequencies seen so fur in a message. Initially all on the latter are that counts might be the same (reflecting no initial infor- mation), but they are updated, as each symbol is 0 cum-freq[i - l] 2 cum-freq[i]; seen, to approximate the observed frequencies. Pro- l an attempt is never made to encode a symbol i for vided both encoder and decoder use the same initial which cum-freq[i - l] = cum-freq[i]; and values (e.g., equal counts) and the same updating l cum- freq [0] 5 Max- frequency. algorithm, their models will remain in step. The en- Provided these conditions are satisfied, the values in coder receives the next symbol, encodes it, and up- the array need bear no relationship to the actual dates its model. The decoder identifies it according cumulative symbol frequencies in messages. Encod- to its current model and then updates its model. ing and decoding will still work correctly, although The second half of Figure 4 shows such an adap- encodings will occupy less space if the frequencies tive model. This is the type of model recommended are accurate. (Recall our successfully encoding eaii! for use with Figure 3, for in practice it will outper- according to the model of Table I, which does not form a fixed model in terms of compression effi- actually reflect the frequencies in the message.) ciency. Initialization is the same as for the fixed model, except that all frequencies are set to one. Fixed Models The procedure update-model (symbol) is called by The simplest kind of model is one in which symbol both encode-symbol () and decode-symbol () (Figure 3, frequencies are fixed. The first model in Figure 4 lines 54 and 151) after each symbol is processed. has symbol frequencies that approximate those of Updating the model is quite expensive because of English (taken from a part of the Brown Corpus [12]). the need to maintain cumulative totals. In the code

June 1987 Volume 30 Number 6 Communicationsof the ACM 535 Computing Practices

of Figure 4, frequency counts, which must be main- The penalty paid by scaling counts is somewhat tained anyway, are used to optimize access by keep- larger, but still very small. For short messages (less ing the array in frequency order-an effective kind than 214 bytes), no scaling need be done. E:ven with of self-organizing linear search [9]. Update-model0 messages of 105-lo6 bytes, the overhead was found first checks to see if the new model will exceed the experimentally to be less than 0.25 percent of the cumulative-frequency limit, and if so scales all fre- encoded string. quencies down by a factor of two (taking care to The adaptive model in Figure 4 scales down all ensure that no count scales to zero) and recomputes counts whenever the total threatens to exceed cumulative values (Figure 4, lines 29-37). Then, if Max- frequency. This has the effect of weighting re- necessary, update-model () reorders the symbols to cent events more heavily than events from earlier in place the current one in its correct rank in the fre- the message. The statistics thus tend to track quency ordering, altering the translation tables to changes in the input sequence, which can be very reflect the change. Finally, it increments the appro- beneficial. (We have encountered cases where limit- priate frequency count and adjusts cumulative fre- ing counts to 6 or 7 bits gives better results than quencies accordingly. working to higher precision.) Of course, this depends on the source being modeled. Bentley et al. [2] con- PERFORMANCE sider other, more explicit, ways of incorporating a Now consider the performance of the algorithm of recency effect. Figure 3, both in compression efficiency and execu- tion time. Execution Time The program in Figure 3 has been written for clarity Compression Efficiency rather than for execution speed. In fact, with the In principle, when a message is coded using arith- adaptive model in Figure 4, it takes about 420 PS per metic: coding, the number of bits in the encoded input byte on a VAX-11/780 to encode a text file, string is the same as the entropy of that message and about the same for decoding. However, easily with respect to the model used for coding. Three avoidable overheads such as procedure calls account factors cause performance to be worse than this in for much of this, and some simple optimizations in- practice: crease speed by a factor of two. The following altera- tions were made to the C version shown: (1) message termination overhead; (2) the use of fixed-length rather than infinite- (1) The procedures input-bif(), output-bit(), and precision arithmetic; and bit-plus- follow() were converted to macros to (3) scaling of counts so that their total is at most eliminate procedure-call overhead. Max- frequency. (2) Frequently used quantities were put in register variables. None of these effects is significant, as we now show. Multiplies by two were replaced by additions In order to isolate the effect of arithmetic coding, the (3) (C ‘I,=“). model will be considered to be exact (as defined Array indexing was replaced by pointer manip- above). (4) ulation in the loops at line 189 in Figure 3 and Arithmetic coding must send extra bits at the lines 49-52 of the adaptive model in Figure 4. end of each message, causing a message termina- tion overhead. Two bits are needed, sent by This mildly optimized C implementation has an done-encoding() (Figure 3, lines 119-123), in order to execution time of 214 ps/252 ps per input byte, for disambiguate the final symbol. In cases where a bit encoding/decoding 100,000 bytes of English text on stream must be blocked into 8-bit characters before a VAX-11/780, as shown in Table II. Also given are encoding, it will be necessary to round out to the corresponding figures for the same program on an end of a block. Combining these, an extra 9 bits may Apple Macintosh and a SUN-3/75. As can be seen, be required. coding a C source program of the same length took The overhead of using fixed-length arithmetic oc- slightly longer in all cases, and a binary object pro- curs because remainders are truncated on division. gram longer still. The reason for this will be dis- It can be assessed by comparing the algorithm’s per- cussed shortly. Two artificial test files were included formance with the figure obtained from a theoretical to allow readers to replicate the results. “Alphabet” entropy calculation that derives its frequencies from consists of enough copies of the 26-letter alphabet to counts scaled exactly as for coding. It is completely fill out 100,000 characters (ending with a partially negligible-on the order of lop4 bits/symbol. completed alphabet). “Skew statistics” contains

536 Communications of the ACM Ime 1987 Volume 30 Number 6 Computing Practices

TABLE II. Results for Encoding and Decoding 100,000-Byte Files VAX-H/788 Macintosh512 K SUN-3175 Encode time Decodetlme Encodetime oeclxletlme Encodetime oecodetime w (I4 (ccs) (as) (PS) b4 Mildly optimized C implementation Text file 57,718 214 262 687 98 121 C program 82,991 230 288 729 105 131 VAX object program 73,501 313 406 950 145 190 Alphabet 59,292 223 277 719 105 130 Skew statistics 12,092 143 170 507 70 85 Carefully optimized assembly-language implementation Text file 57,718 104 135 194 243 46 58 C program 62,991 109 151 208 266 51 65 VAX object program 73,501 158 241 280 402 75 107 Alphabet 59,292 105 145 204 264 51 65 Skew statistics 12,092 63 81 126 160 28 36 Notes: Times are measured in microseconds per byte of uncompressed data. The VAX-l l/760 had a floating-point accelerator, which reduces integer multiply and divide times. The Macintosh uses an E-MHz MC66000 with some memory wait states. The SUN-3175 uses a 16.67-MHz MC66020. All times exclude I/O and operating-system overhead in support of l/O. VAX and SUN figures give user time from the UNIX* time command; on the Macintosh, l/O was explicitly directed to an array. The 4.2BSD C compiler was used for VAX and SUN; Aztec C 1.06g for Macintosh.

10,000 copies of the string aaaabaaaac; it demon- TABLE III. Breakdown of Timings for the VAX-U/780 strates that files may be encoded into less than one Assembly-Language Version bit per character (output size of 12,092 bytes = Encodstime Decodethe 96,736 bits). All results quoted used the adaptive (PSI (4 model of Figure 4. Text file A further factor of two can be gained by repro- Bounds calculation 32 31 gramming in assembly language. A carefully opti- Bit shifting 39 30 mized version of Figures 3 and 4 (adaptive model) Model update 29 29 - was written in both VAX and M68000 assembly lan- Symbol decode 45 Other 4 0 guages. Full use was made of registers, and advan- 104 135 tage taken of the Is-bit code-value to expedite some C program crucial comparisons and make subtractions of Half Bounds calculation 30 28 trivial. The performance of these implementations Bit shifting 42 35 on the test files is also shown in Table II in order Model update 33 36 to give the reader some idea of typical execution Symbol decode - 51 speeds. Other 4 1 109 151 The VAX-11/780 assembly-language timings are broken down in Table III. These figures were ob- VAX object program tained with the UNIX profile facility and are accu- Bounds calculation 34 31 Bit shifting 46 40 rate only to within perhaps 10 percent. (This mech- Model update 75 75 anism constructs a histogram of program counter Symbol decode - 94 values at real-time clock interrupts and suffers from Other 3 1 statistical variation as well as some systematic er- 158 241 rors.) “Bounds calculation” refers to the initial parts of encode-symbol () and decode-symbol () (Figure 3, lines 90-94 and 190-l%), which contain multiply update” refers to the adaptive update-model () proce- and divide operations. “Bit shifting” is the major dure of Figure 4 (lines 26-53). loop in both the encode and decode routines As expected, the bounds calculation and model (lines 96-113 and 194-213). The cum calculation in update take the same time for both encoding and decode-symbol (1, which requires a multiply/divide, decoding, within experimental error. Bit shifting was and the following loop to identify the next symbol quicker for the text file than for the C program and (lines 187-189), is “Symbol decode.” Finally, “Model object file because compression performance was

]une 1987 Volume 30 Number 6 Communications of the ACM 537

- ComputingPractices

better. The extra time for decoding over encoding is acter for short English text files (103-lo4 bytes) to due entirely to the symbol decode step. This takes 4.5-4.7 bits/character for long ones (105-lo6 bytes). longer in the C program and object file tests because Although adaptive Huffman techniques do exist the loop of line 189 was executed more often (on (e.g., [5, i’]), they lack the conceptual simplicity of average 9 times, 13 times, and 35 times, respec- arithmetic coding. Although competitive :in compres- tively). This also affects the model update time be- sion efficiency for many files, they are slower. For cause it is the number of cumulative counts that example, Table IV compares the performance of the must be incremented in Figure 4, lines 49-52. In the mildly optimized C implementation of arithmetic worst case, when the symbol frequencies are uni- coding with that of the UNIX compact program that formly distributed, these loops are executed an aver- implements using a similar age of 128 times. Worst-case performance would be model. (Compact’s model is essentially the same for improved by using a more complex tree representa- long files, like those of Table IV, but is better for tion for frequencies, but this would likely be slower short files than the model used as an example in this for text files. article.) Casual examination of compact in’dicates that the care taken in optimization is roughly lcomparable SOME APPLICATIONS for both systems, yet arithmetic coding hallves exe- Applications of arithmetic coding are legion. By lib- cution time. Compression performance is somewhat erating coding with respect to a model from the mod- better with arithmetic coding on all the example eling required for prediction, it encourages a whole files. The difference would be accentuated with new view of data compression [19]. This separation more sophisticated models that predict symbols with of function costs nothing in compression perfor- probabilities approaching one under certain circum- mance, since arithmetic coding is (practically) opti- stances (e.g., the letter u following 9). mal with respect to the entropy of the model. Here Nonadaptive coding can be performed adthmeti- we intend to do no more than suggest the scope of tally using fixed, prespecified models like that in this view by briefly considering the first part of Figure 4. Compression performance will be better than Huffman coding. In order to mini- (1) adaptive text compression, mize execution time, the total frequency count, (2) nonadaptive coding, cum- freq[O], should be chosen as a power of two so (3) compressing black/white images, and the divisions in the bounds calculations (Figure 3, (4) coding arbitrarily distributed integers. lines 91-94 and 190-193) can be done as shifts. En- Of course, as noted earlier, greater coding efficien- code/decode times of around 60 ps/90 PS should cies could easily be achieved with more sophisti- then be possible for an assembly-language imple- cated models. Modeling, however, is an extensive mentation on a VAX-11/780. A carefully written im- topic in its own right and is beyond the scope of this plementation of Huffman coding, using table lookup article. for encoding and decoding, would be a bit faster in Adaptive text compression using single-character this application. adaptive frequencies shows off arithmetic coding Compressing black/white images using arithmetic to good effect. The results obtained using the pro- coding has been investigated by Langdon and gram in Figures 3 and 4 vary from 4.8-5.3 bits/char- Rissanen [14], who achieved excellent results using

TABLEIV. Comparisonof Arithmeticand AdaptiveHuffman Coding Arithmeticcadlng Adaptive Huffman coding Encodetime Deco&time Encodetime oscodetime 22 (4 (f4 St b4 CS) Text file 57,718 214 262 57,781 550 414 C program 62,991 230 288 63,731 596 441 VAX object program 73,546 313 406 76,950 822 606 Alphabet 59,292 223 277 60,127 598 411 Skew statistics 12,092 143 170 16,257 215 132 Notes: The mildly optimized C implementation was used for arithmetic coding. UNIX compactwas used for adaptive Huffman coding. Times are for a VAX-l l/780 and exclude I/O and operating-system overhead in support of I/O.

538 Commlkcations of the ACM June 1987 Volume 30 Number 6 Computing Practices a model that conditioned the probability of a pixel’s date any desired probability for the new-word being black on a template of pixels surrounding it. marker. The template contained a total of 10 pixels, selected from those above and to the left of the current one APPENDIX. Proof of Decoding Inequality so that they precede it in the raster scan. This cre- Using one-letter abbreviations for cum- freq, ates 1024 different possible contexts, and for each symbol, low, high, and value, suppose the probability of the pixel being black was esti- mated adaptively as the picture was transmitted. (V-Z + 1) x c[O] - 1 c[s] 5 Each pixel’s polarity was then coded arithmetically L h-I+1 1 according to this probability. A 20-30 percent im- < c[s - 11; provement in compression was attained over earlier methods. To increase coding speed, Langdon and in other words, Rissanen used an approximate method of arithmetic (v - I + 1) x c[O] - 1 coding that avoided multiplication by representing c[s] 5 -& probabilities as integer powers of %. Huffman cod- r (1) ing cannot be directly used in this application, as it 5 c[s - l] - 1, never compresses with a two-symbol alphabet. Run- where length coding, a popular method for use with two- valued alphabets, provides another opportunity for 1 r=h-l+l, OSEIT-. arithmetic coding. The model reduces the data to a r sequence of lengths of runs of the same symbol (e.g., (The last inequality of Eq. (1) derives from the for picture coding, run-lengths of black followed by fact that c[s - l] must be an integer.) Then we white followed by black followed by white . . .). The wish to show that I ’ I v I h’, where 1’ and h’ sequence of lengths must be transmitted. The CCITT are the updated values for low and high as defined facsimile coding standard [ll] bases a Huffman code below. on the frequencies with which black and white runs of different lengths occur in sample documents. A fixed arithmetic code using these same frequencies would give better performance: adapting the fre- quencies to each particular document would be ( I : r (v - I + 1) X c[O] - 1 - -e better still. CPI[ r 1 Coding arbitrarily distributed integers is often called for in use with more sophisticated models of text, from Eq. (l), image, or other data. Consider, for instance, the lo- 5v+l-- cally adaptive data compression scheme of Bentley w ’ et al. [2], in which the encoder and decoder cache so 1’ 5 v since both v and 1’ are integers and the last N different words seen. A word present in c[O] > 0. the cache is transmitted by sending the integer cache index. Words not in the cache are transmitted (b) h’-‘+L’X:;“o,“J-I by sending a new-word marker followed by the characters of the word. This is an excellent model (v - I + 1) x c[O] - 1 + 1 _ c _ 1 for text in which words are used frequently over z-z+T short intervals and then fall into long periods of dis- CPI [ r 1 use. Their paper discusses several variable-length from Eq. (1) codings for the integers used as cache indexes. Arithmetic coding allows any probability distribution to be used as the basis for a variable-length encod- ing, including-among countless others-the ones implied by the particular codes discussed there. It REFERENCES also permits use of an adaptive model for cache in- 1. Abramson, N. and Coding. McGraw-Hill, New York, 1963. This textbook contains the first reference to what was to dexes, which is desirable if the distribution of cache become the method of arithmetic coding (pp. 61-62). hits is difficult to predict in advance. Furthermore, 2. Bentley. J.L., Sleator, D.D., Tarjan, R.E., and Wei, V.K. A locally adaptive data compression scheme. Commun. ACM 29, 4 (Apr. 1986), with arithmetic coding, the code spaced allotted to 320-330. Shows how recency effects can be incorporated explicitly the cache indexes can be scaled down to accommo- into a text compression system.

]une 1987 Volume 30 Number 6 Communications of the ACM 939 Computing Practices

3. Cleary, J.C., and Witten, I.H. A comparison of enumerative and 16. Rissanen, J.J. Generalized Kraft inequality and arithmmatic coding. adaptive codes. IEEE Trans. Inf. Theory IT-30,2 (Mar. 1984). 306-315. IBM 1. Res. Dev. 20 (May 1976), 198-203. Another early exposition of Demonstrates under quite general conditions that adaptive coding the idea of arithmetic coding. outperforms the method of calculating and transmitting an exact 17. Rissanen. J.J. Arithmetic codings as number representations. Acta model of the message first. Polytech. Stand. Math. 31 (Dec. 1979), 44-51. Further develops arith- 4. Clearv. I.G.. and Witten. I.H. Data comoression using adaptive cod- metic coding as a practical technique for data representation. ing and’partial string matching. IEEE T>ans.Commu~ C&-32,4 18. Rissanen, J., and Langdon, G.G. Arithmetic coding. IBM J. Res. Dev. (Apr. 1984). 396-402. Presents an adaptive modeling method that 23, 2 (Mar. 1979). 149-162. Describes a broad class of arithmetic reduces a large sample of mixed-case English text to around codes. 2.2 bits/character when arithmetically coded. 19. Rissanen, J.. and Langdon, G.G. Universal modeling and coding. IEEE 5 Cormack, G.V., and Horspool, R.N. Algorithms for adaptive Huffman Trans. Znf Theory IT-27, 1 (Jan. 1981), 12-23. Shows how data corn- codes. Ir$ Process.Lett. 18, 3 (Mar. 19841, 159-166. Describes how pression can be separated into modeling for prediction and coding adaptive Huffman coding can be implemented efficiently. with respect to a model. 6. Cormack, G.V., and Horspool, R.N. Data compression using dynamic 20. Rubin, F. Arithmetic stream coding using fixed precision registers. Markov modeling. Res. Rep., Computer Science Dept., Univ. of IEEE Trans. If. Theory IT-25, 6 (Nov. 1979), 672-675. One of the first Walerloo, Ontario, Apr. 1985. Also to be published in Cornput. 1. papers to present all the essential elements of practical arithmetic Presents an adaptive state-modeling technique that, in conjunction coding, including fixed-point computation and incremental opera- with arithmetic coding, produces results competitive with those tion. of [4]. 21. Shannon, C.E.. and Weaver, W. The Mathematical Theory of Commu- 7. Gallager, R.G. Variations on a theme by Huffman. IEEE Trans. tnf nication. University of Illinois Press, Urbana, Ill., 1949. A classic Theory IT-24, 6 (Nov. 19781, 668-674. Presents an adaptive Huffman book that develops communication theory from the ground up. coding algorithm, and derives new bounds on the redundancy of 22. Welch, T.A. A technique for high-performance data compression. Huffman codes. Computer 17, 6 (June 1984), 8-19. A very fast coding technique based 8. Held, G. Data Compression:Techniques and Applications. Wiley, New on the method of [23], but whose compression performance is poor York, 1984. Explains a number of ad hoc techniques for compressing by the standards of [4] and [6]. An improved implementation of this text. method is widely used in UNIX systems under the name compress. 9. Hester, J.H., and Hirschberg, D.S. Self-organizing linear search. ACM 23. Ziv, J., and Lempel, A. Compression of individual sequences via Comput. SKI-V. 17, 3 (Sept. 1985), 295-311. A general analysis of the variable-rate coding. IEEE Trans. InJ Theory IT-24, 5 (Sept. 1978), technique used in the present article to expedite access to an array 530-536. Describes a method of text compression that works by of dynamically changing frequency counts. replacing a substring with a pointer to an earlier occurrence of the 10. Huffman, D.A. A method for the construction of minimum- same substring. Although it performs quite well, it does not provide redundancy codes. Proc. Inst. Electr. Radio Eng. 40, 9 (Sept. 1952), a clear separation between modeling and coding. 1098-1101. The classic paper in which Huffman introduced his famous coding method. 11. Hunter, R., and Robinson, A.H. International digital facsimile coding CR Categories and Subject Descriptors: E.4 [Data]: Coding and Infor- standards. Proc. Inst. Electr. Electron. Eng. 68, 7 (July 1980), 854-867. mation Theory-d&a compaction and compression;H.l.l [Models and Describes the use of Huffman coding to compress run lengths in Principles]: Systems and Information Theory-information theory black/white images. General Terms: Algorithms, Performance 12. Kucera, H., and Francis, W.N. Compufational Analysis of Present-Day Additional Key Words and Phrases: Adaptive modeling. arithmetic American English. Brown University Press, Providence, R.I.. 1967. coding, Huffman coding This large corpus of English is often used for computer experiments with text, including some of the evaluation reported in the present Received 8/86; accepted Z2/86 article. 13. Langdon, G.G. An introduction to arithmetic coding. IBMI. Res. Dev. 28, 2 (Mar. 1984), 135-149. Introduction to arithmetic coding from Authors’ Present Address: Ian H. Witten, Radford M. Neal, and John the point of view of hardware implementation. G. Cleary, Dept. of Computer Science, The University of Calgary, 14. Langdon, G.G., and Rissanen, J. Compression of black-white images 2500 University Drive NW, Calgary, Canada T2N lN4. with arithmetic coding. 1EEETrans. Commun. COM 29,6 (June 1981), 858.-867. Uses a modeling method specially tailored to black/white pictures, in conjunction with arithmetic coding, to achieve excel- Permission to copy without fee all or part of this material IS granted lent compression results. provided that the copies are not made or distributed for direct commer- 15. Pasco, R. Source coding algorithms for fast data compression. Ph.D. cial advantage, the ACM copyright notice and the title of the publication thesis, Dept. of Electrical Engineering, Stanford Univ., Stanford, and its date appear, and notice is given that copying is by permission of Calif., 1976. An early exposition of the idea of arithmetic coding, but the Association for Computing Machinery. To copy otherwise, or to lacking the idea of incremental operation. republish, requires a fee and/or specific permission.

In response to membership requests . . .

CURRlCUl.A RECOMMENDATIONS FOR COMPUTING;

Volume I: Curricula Recommendations for Computer Science Volume II: Curricula Recommendations for Information Systems Volume III: Curricula Recommendations for kelated Computer Science Programs in Vocational- Technical Schools, Community and Junior Colleges and Health Computing Information available from DeborahCotton-Single Copy Sales(212) 869-7440 ext. 309

540 Communications of the ACM ]une 1987 Volume 30 Number 6