Analysis of Arithmetic Coding for Data Compression
Total Page:16
File Type:pdf, Size:1020Kb
Analysis of Arithmetic Co ding for Data Compression Paul G. Howard and Je rey Scott Vitter Brown University Department of Computer Science Technical Rep ort No. CS{92{17 Revised version, April 1992 Formerly Technical Rep ort No. CS{91{03 App ears in Information Processing and Management, Volume 28, Number 6, Novemb er 1992, pages 749{763. A shortened version app ears in the pro ceedings of the IEEE Computer So ciety/NASA/CESDIS Data Compression Conference, Snowbird, Utah, April 8{11, 1991, pages 3{12. Analysis of Arithmetic Coding 1 for Data Compression 2 3 Paul G. Howard Je rey Scott Vitter Department of Computer Science Brown University Providence, R.I. 02912{191 0 Abstract Arithmetic co ding, in conjunction with a suitable probabilistic mo del, can pro- vide nearly optimal data compression. In this article we analyze the e ect that the mo del and the particular implementation of arithmetic co ding have on the co de length obtained. Perio dic scaling is often used in arithmetic co ding im- plementations to reduce time and storage requirements; it also intro duces a recency e ect which can further a ect compression. Our main contribution is intro ducing the concept of weighted entropy and using it to characterize in an elegantway the e ect that p erio dic scaling has on the co de length. We explain why and byhowmuch scaling increases the co de length for les with a ho- mogeneous distribution of symb ols, and wecharacterize the reduction in co de length due to scaling for les exhibiting lo cality of reference. We also givea rigorous pro of that the co ding e ects of rounding scaled weights, using integer arithmetic, and enco ding end-of- le are negligible. Index terms : Data compression, arithmetic co ding, analysis of algorithms, adaptive mo deling. 1 A similar version of this pap er app ears in Information Processing and Management, 28:6, Novem- b er 1992, 749{763. A shortened version of this pap er app ears in the pro ceedings of the IEEE Com- puter So ciety/NASA/CESDIS Data Compression Conference, Snowbird, Utah, April 8{11, 1991, 3{12. 2 Supp ort was provided in part by NASA Graduate Student Researchers Program grant NGT{ 50420 and by an NSF Presidential Young Investigators Award with matching funds from IBM. Ad- ditional supp ort was provided by a Universities Space Research Asso ciation/CESDIS app ointment. 3 Supp ort was provided in part by a National Science Foundation Presidential Young Investigator Award grant with matching funds from IBM, by NSF grant CCR{9007851, by Army Research Oce grantDAAL03{91{G{0035, and by the Oce of Naval Research and the Defense Advanced Research Pro jects Agency under contract N00014{83{J{4052, ARPA order 8225. Additional supp ort was provided by a Universities Space Research Asso ciation/CESDIS asso ciate memb ership. 1 1 Intro duction We analyze the amount of compression p ossible when arithmetic co ding is used for text compression in conjunction with various input mo dels. Arithmetic co ding is a technique for statistical lossless enco ding. It can b e thought of as a generalization of Hu man co ding [14] in which probabilities are not constrained to b e integral p owers of 2, and co de lengths need not b e integers. The basic algorithm for enco ding using arithmetic co ding works as follows: 1. We b egin with a \currentinterval" initialized to [0::1]. 2. For each symb ol of the le, wedotwo steps: a We sub divide the currentinterval into subintervals, one for each p ossible alphab et symb ol. The size of a symb ol's subinterval is prop ortional to the estimated probability that the symb ol will b e the next symb ol in the le, according to the input mo del. b We select the subinterval corresp onding to the symb ol that actually o ccurs next in the le, and make it the new currentinterval. 3. We transmit enough bits to distinguish the nal currentinterval from all other p ossible nal intervals. The length of the nal subinterval is clearly equal to the pro duct of the probabilities of the individual symb ols, which is the probability p of the particular sequence of symb ols in the le. The nal step uses almost exactly lg p bits to distinguish the le from all other p ossible les. For detailed descriptions of arithmetic co ding, see [17] and esp ecially [34]. We use the following notation throughout this article: t = length of the le, in bytes; n = numb er of symb ols in the input alphab et ; k = numb er of di erent alphab et symb ols that o ccur in the le; c = numb er of o ccurrences of the ith alphab et symb ol in the le; i lg x = log x; 2 B A = A A +1 A + B 1 the rising factorial function: Results for a \typical le" refer to a le with t = 100; 000, n = 256, and k = 100. Co de lengths are expressed in bits. We assume 8-bit bytes. 1.1 Mo deling e ects The co der in an arithmetic co ding system must work in conjunction with a mo del that pro duces probability information, describing a le as a sequence of decisions; at 2 1 INTRODUCTION each decision p oint it estimates the probability of each p ossible choice. The co der then uses the set of probabilities to enco de the actual decision. To ensure deco dability, the mo del may use only information known by b oth the enco der and the deco der. Otherwise there are no restrictions on the mo del; in partic- ular, it can change as the le is b eing enco ded. In this subsection we describ e several typical mo dels for context-indep endent text compression. The length of the enco ded le dep ends on the mo del and how it is used. Most mo dels for text compression involve estimating the probability p of a symbol by weight of symbol p = ; total weight of all symb ols whichwe can then enco de in lg p bits using exact arithmetic co ding. The probability estimation can b e done adaptively dynamically estimating the probability of each symb ol based on all symb ols that precede it, semi-adaptively using a preliminary pass of the input le to gather statistics, or non-adaptively using xed probabilities for all les. Non-adaptive mo dels are not very interesting, since their e ectiveness dep ends only on howwell their probabilities happ en to match the statistics of the le b eing enco ded; Bell, Cleary, and Witten show that the match can b e arbitrarily bad [2]. Static and decrementing semi-adaptive co des. Semi-adaptive co des are con- ceptually simple, and useful when real-time op eration is not required; their main drawback is that they require knowledge of the le statistics prior to enco ding. The statistics collected during the rst pass are normally used in a static code ; that is, the probabilities used to enco de the le remain the same during the entire enco ding. It is p ossible to obtain b etter compression by using a decrementing code, dynamically adjusting the probabilities to re ect the statistics of just that part of the le not yet co ded. Assuming that enco ding uses exact arithmetic, and that there are no computa- tional artifacts, we can use the le statistics to form a static probabilistic mo del. Not including the cost of transmitting the mo del, the co de length L for a static SS semi-adaptivecodeis n Y c i L = lg c =t SS i i=1 n X = t lg t c lg c ; i i i=1 the information content of the le. Dividing by the le length gives the self-entropyof the le, and forms the basis for claims that arithmetic co ding enco des asymptotically close to the entropy. What these claims really mean is that arithmetic co ding can enco de close to the entropy of a probabilistic mo del whose symb ol probabilities are 1.1 Mo deling e ects 3 the same as those of the le, b ecause computational e ects discussed in Section 3 are insigni cant. If we know a le's exact statistics ahead of time, we can get improved compres- sion by using a decrementing code.We mo dify the symbol weights dynamically by decrementing each symb ol's weight each time it is enco ded to re ect symb ol frequen- cies for just that part of the le not yet enco ded; hence for each symbol we always use the b est available estimates of the next-symb ol probabilities. In a sense, it is a misnomer to call this a \semi-adaptive" mo del since the mo del adapts throughout the second pass, but we apply the term here since the symb ol probability estimates are based primarily on the rst pass. The decrementing count idea app ears in the analysis of enumerative co des by Cleary and Witten [7]. The resulting co de length for a semi-adaptive decrementing co de is ! ! k Y L = lg c ! t! : 1 SD i i=1 It is straightforward to show that for all input les, the co de length of a semi- adaptive decrementing co de is at most that of a semi-adaptive static co de, equality holding only when the le consists of rep eated o ccurrences of a single letter.