Analysis of Arithmetic Co ding
for Data Compression
Paul G. Howard and Je rey Scott Vitter
Brown University
Department of Computer Science
Technical Rep ort No. CS{92{17
Revised version, April 1992
Formerly Technical Rep ort No. CS{91{03
App ears in Information Processing and Management,
Volume 28, Number 6, Novemb er 1992, pages 749{763.
A shortened version app ears in the pro ceedings of the
IEEE Computer So ciety/NASA/CESDIS Data Compression Conference,
Snowbird, Utah, April 8{11, 1991, pages 3{12.
Analysis of Arithmetic Coding
1
for Data Compression
2 3
Paul G. Howard Je rey Scott Vitter
Department of Computer Science
Brown University
Providence, R.I. 02912{191 0
Abstract
Arithmetic co ding, in conjunction with a suitable probabilistic mo del, can pro-
vide nearly optimal data compression. In this article we analyze the e ect that
the mo del and the particular implementation of arithmetic co ding have on the
co de length obtained. Perio dic scaling is often used in arithmetic co ding im-
plementations to reduce time and storage requirements; it also intro duces a
recency e ect which can further a ect compression. Our main contribution is
intro ducing the concept of weighted entropy and using it to characterize in an
elegantway the e ect that p erio dic scaling has on the co de length. We explain
why and byhowmuch scaling increases the co de length for les with a ho-
mogeneous distribution of symb ols, and wecharacterize the reduction in co de
length due to scaling for les exhibiting lo cality of reference. We also givea
rigorous pro of that the co ding e ects of rounding scaled weights, using integer
arithmetic, and enco ding end-of- le are negligible.
Index terms : Data compression, arithmetic co ding, analysis of algorithms,
adaptive mo deling.
1
A similar version of this pap er app ears in Information Processing and Management, 28:6, Novem-
b er 1992, 749{763. A shortened version of this pap er app ears in the pro ceedings of the IEEE Com-
puter So ciety/NASA/CESDIS Data Compression Conference, Snowbird, Utah, April 8{11, 1991,
3{12.
2
Supp ort was provided in part by NASA Graduate Student Researchers Program grant NGT{
50420 and by an NSF Presidential Young Investigators Award with matching funds from IBM. Ad-
ditional supp ort was provided by a Universities Space Research Asso ciation/CESDIS app ointment.
3
Supp ort was provided in part by a National Science Foundation Presidential Young Investigator
Award grant with matching funds from IBM, by NSF grant CCR{9007851, by Army Research
Oce grantDAAL03{91{G{0035, and by the Oce of Naval Research and the Defense Advanced
Research Pro jects Agency under contract N00014{83{J{4052, ARPA order 8225. Additional supp ort
was provided by a Universities Space Research Asso ciation/CESDIS asso ciate memb ership.
1
1 Intro duction
We analyze the amount of compression p ossible when arithmetic co ding is used for
text compression in conjunction with various input mo dels. Arithmetic co ding is a
technique for statistical lossless enco ding. It can b e thought of as a generalization of
Hu man co ding [14] in which probabilities are not constrained to b e integral p owers
of 2, and co de lengths need not b e integers.
The basic algorithm for enco ding using arithmetic co ding works as follows:
1. We b egin with a \currentinterval" initialized to [0::1].
2. For each symb ol of the le, wedotwo steps:
a We sub divide the currentinterval into subintervals, one for each p ossible
alphab et symb ol. The size of a symb ol's subinterval is prop ortional to the
estimated probability that the symb ol will b e the next symb ol in the le,
according to the input mo del.
b We select the subinterval corresp onding to the symb ol that actually o ccurs
next in the le, and make it the new currentinterval.
3. We transmit enough bits to distinguish the nal currentinterval from all other
p ossible nal intervals.
The length of the nal subinterval is clearly equal to the pro duct of the probabilities
of the individual symb ols, which is the probability p of the particular sequence of
symb ols in the le. The nal step uses almost exactly lg p bits to distinguish the
le from all other p ossible les. For detailed descriptions of arithmetic co ding, see
[17] and esp ecially [34].
We use the following notation throughout this article:
t = length of the le, in bytes;
n = numb er of symb ols in the input alphab et ;
k = numb er of di erent alphab et symb ols that o ccur in the le;
c = numb er of o ccurrences of the ith alphab et symb ol in the le;
i
lg x = log x;
2
B
A = A A +1 A + B 1 the rising factorial function:
Results for a \typical le" refer to a le with t = 100; 000, n = 256, and k = 100.
Co de lengths are expressed in bits. We assume 8-bit bytes.
1.1 Mo deling e ects
The co der in an arithmetic co ding system must work in conjunction with a mo del
that pro duces probability information, describing a le as a sequence of decisions; at
2 1 INTRODUCTION
each decision p oint it estimates the probability of each p ossible choice. The co der
then uses the set of probabilities to enco de the actual decision.
To ensure deco dability, the mo del may use only information known by b oth the
enco der and the deco der. Otherwise there are no restrictions on the mo del; in partic-
ular, it can change as the le is b eing enco ded. In this subsection we describ e several
typical mo dels for context-indep endent text compression.
The length of the enco ded le dep ends on the mo del and how it is used. Most
mo dels for text compression involve estimating the probability p of a symbol by
weight of symbol
p = ;
total weight of all symb ols
whichwe can then enco de in lg p bits using exact arithmetic co ding. The probability
estimation can b e done adaptively dynamically estimating the probability of each
symb ol based on all symb ols that precede it, semi-adaptively using a preliminary
pass of the input le to gather statistics, or non-adaptively using xed probabilities
for all les. Non-adaptive mo dels are not very interesting, since their e ectiveness
dep ends only on howwell their probabilities happ en to match the statistics of the le
b eing enco ded; Bell, Cleary, and Witten show that the match can b e arbitrarily bad
[2].
Static and decrementing semi-adaptive co des. Semi-adaptive co des are con-
ceptually simple, and useful when real-time op eration is not required; their main
drawback is that they require knowledge of the le statistics prior to enco ding. The
statistics collected during the rst pass are normally used in a static code ; that is,
the probabilities used to enco de the le remain the same during the entire enco ding.
It is p ossible to obtain b etter compression by using a decrementing code, dynamically
adjusting the probabilities to re ect the statistics of just that part of the le not yet
co ded.
Assuming that enco ding uses exact arithmetic, and that there are no computa-
tional artifacts, we can use the le statistics to form a static probabilistic mo del.
Not including the cost of transmitting the mo del, the co de length L for a static
SS
semi-adaptivecodeis
n
Y
c
i
L = lg c =t
SS i
i=1
n
X
= t lg t c lg c ;
i i
i=1
the information content of the le. Dividing by the le length gives the self-entropyof
the le, and forms the basis for claims that arithmetic co ding enco des asymptotically
close to the entropy. What these claims really mean is that arithmetic co ding can
enco de close to the entropy of a probabilistic mo del whose symb ol probabilities are
1.1 Mo deling e ects 3
the same as those of the le, b ecause computational e ects discussed in Section 3
are insigni cant.
If we know a le's exact statistics ahead of time, we can get improved compres-
sion by using a decrementing code.We mo dify the symbol weights dynamically by
decrementing each symb ol's weight each time it is enco ded to re ect symb ol frequen-
cies for just that part of the le not yet enco ded; hence for each symbol we always
use the b est available estimates of the next-symb ol probabilities. In a sense, it is a
misnomer to call this a \semi-adaptive" mo del since the mo del adapts throughout
the second pass, but we apply the term here since the symb ol probability estimates
are based primarily on the rst pass. The decrementing count idea app ears in the
analysis of enumerative co des by Cleary and Witten [7]. The resulting co de length
for a semi-adaptive decrementing co de is
! !
k
Y
L = lg c ! t! : 1
SD i
i=1
It is straightforward to show that for all input les, the co de length of a semi-
adaptive decrementing co de is at most that of a semi-adaptive static co de, equality
holding only when the le consists of rep eated o ccurrences of a single letter. This
do es not contradict Shannon's theorem [30]; he discusses only the b est static co de.
Static semi-adaptive co des have b een widely used in conjunction with Hu man
co ding, where they are appropriate since changing weights often requires changing
the structure of the co ding tree.
Enco ding the mo del. If we assume that all symb ol distributions are equally likely
for a given le length t, the cost of transmitting the exact mo del statistics is
L = lg numb er of p ossible distributions
M
!
t + n 1
= lg 2
n 1
n lg et=n:
A similar result app ears in [2] and [7]. Foratypical le L is only ab out 2560
M
bits, or 320 bytes. The assumption of equally-likely distributions is not very go o d for
text les; in practice we can reduce the cost of enco ding the mo del by 50 p ercentor
more by enco ding each of the counts using a suitable enco ding of the integers, such
as Fib onacci co ding [12].
Strictly sp eaking wemust also enco de the le length t b efore enco ding the mo del;
the cost is insigni cant, b etween lg t and 2 lg t bits using an appropriate enco ding of
integers [11,31,32].
Adaptive co des. Adaptive co des use a continually changing mo del, in whichwe
use the frequency of each symboluptoagiven p oint in the le as an estimate of its
probability at that p oint.
4 1 INTRODUCTION
We can enco de a symb ol using arithmetic co ding only if its probability is nonzero,
but in an adaptivecodewehavenoway of estimating the probability of a symbol
b efore it has o ccurred for the rst time. This is the zero-frequency problem, discussed
at length in [1,2,33]. For large les with small alphab ets and simple mo dels, all
solutions to this problem give roughly the same compression. In this section we
adopt the solution used in [34], simply assigning an initial weight of 1 to all alphab et
symb ols. The co de length using this adaptivecodeis
! !
k
Y
t
L = lg c ! n : 3
A i
i=1
t+n 1
t
By combining Equations 1, 2, and 3 and noting that = n =t!, we
n 1
can show that for all input les, the adaptive co de with initial 1-weights gives the
same co de length as the semi-adaptive decrementing co de in which the input mo del
is enco ded based on the assumption that all symb ol distributions are equally likely.
In other words, L = L + L . This result is found in Rissanen [23,24]. Cleary and
A SD M
Witten [7] and Bell, Cleary, and Witten [2] present a similar result in a more general
setting, showing approximate equalitybetween enumerative co des which are similar
to arithmetic co des and adaptive co des. Intuitively, the reason for the equalityis
that in the adaptive co de, the cost of \learning" the mo del is not avoided, but merely
spread over the entire le [25].
Organization of this article. Section 2 contains our main result, which precisely
and provably characterizes the co de length of a le dynamically co ded with p erio dic
count-scaling. We express the co de length in terms of \weighted entropies" of the
mo del, which are the entropies implied by the mo del at the scaling times. Our result
shows b oth the advantage to b e gained by scaling b ecause of a lo cality-of-reference
e ect and the excess co de length incurred by the overhead of the scaling pro cess.
For example, scaling has the most negative e ect when the alphab et symb ols are
distributed homogeneously throughout the le, and our result shows explicitly the
small amount that scaling can cause the co de length to increase. However, when the
distribution of the alphab et symb ols varies, as is often the case in les displaying
lo cality of reference, our result characterizes precisely the b ene ts of scaling on co de
length. In Section 2.2 we extend this analysis to higher-order mo dels based on the
partial string matching algorithm of Cleary and Witten.
Through the years, practical adjustments have b een made to arithmetic co ding [21,
26,28,34] to allow the use of integer rather than rational or oating p oint arithmetic
and to transmit output bits almost in real time instead of all at the end. In Section 3
we prove that the loss of co ding eciency caused by practical co ding requirements is
negligible, thus demonstrating rigorously the empirical claims made in [2,34].
5
2 Scaling
Scaling is the pro cess in whichwe p erio dically reduce the weights of all symb ols. It
allows us to use lower precision arithmetic at the exp ense of making the mo del more
approximate, which can hurt compression when the distribution of symb ols in the
le is fairly homogeneous. Scaling also intro duces a locality of reference or recency
e ect, which often improves compression when the distribution of symb ols is variable.
In this section we give a precise characterization of the e ect of scaling on co de length
pro duced by an adaptive mo del. We express the co de length in terms of the weighted
entropies of the mo del. The weighted entropy is the Shannon entropy, computed using
probabilities weighted according to the scaling pro cess; the term \weighted entropy"
is a notational convenience. Our characterization explains why and byhowmuch
scaling can hurt or help compression.
In most text les we nd that most of the o ccurrences of at least some words are
clustered in one part of the le. We can take advantage of this lo calityby assigning
more weight to recent o ccurrences of a symb ol in a dynamic mo del. In practice there
are several ways to do this:
Perio dically restarting the mo del. This often discards to o much information
to b e e ective, although Cormack and Horsp o ol nd that it gives go o d results
when growing large dynamic Markov mo dels [8].
Using a sliding window on the text [15]. This requires excessive computational
resources.
Recency rank co ding [4,10,29]. This is computationally simple but corresp onds
to a rather coarse mo del of recency.
Exp onential aging giving exp onentially increasing weights to successive sym-
b ols [9,20]. This is mo derately dicult to implement b ecause of the changing
weight increments.
Perio dic scaling [34]. This is simple to implement, fast and e ective in op eration,
and amenable to analysis. It also has the computationally desirable prop erty
of keeping the symbol weights small. In e ect, scaling is a practical version of
exp onential aging. This is the metho d that we analyze.
In our discussion of scaling, we assume that a le is divided into blocks of length B .
Our mo del assumes that at the end of each blo ck, the weights for all symb ols are
multipliedby a scaling factor f , usually 1=2. Within a blo ckwe up date the symbol
weights by adding 1 for each o ccurrence.
Example 1 :We illustrate an adaptive co de with scaling, enco ding the 8-symb ol le
\abacbcba". In this example, we start with counts c =1, c = 1, and c = 1, and set
a b c
the scaling threshold at 10, so we scale only once, just b efore the last symbolinthe
6 2 SCALING
le. To retain integer weights without allowing anyweight to fall to 0, we round all
fractional weights obtained during scaling to the next higher integer.
Symbol CurrentInterval p p p
a b c
1 1 1
Start 0.000000000 1.000000000 = = =
3 3 3
2 1 1
a 0.000000000 0.333333333 = = =
4 4 4
2 2 1
0.166666667 0.250000000 = = = b
5 5 5
3 2 1
= = = a 0.166666667 0.200000000
6 6 6
3 2 2
c 0.194444444 0.200000000 = = =
7 7 7
3 3 2
0.196825397 0.198412698 = = = b
8 8 8
3 3 3
0.198015873 0.198412698 = = = c
9 9 9
3 4 3
= = = b 0.198148148 0.198280423
10 10 10
2 2 2
= = Scaling =
6 6 6
2 2 3
= = a 0.198148148 0.198192240 =
7 7 7
The nal interval is [0.00110 01010 11100 11101, 0.00110 01010 11110 01011] in
binary. The theoretical co de length is lg 0:000044092 = 14:469 bits. The actual
co de length is 15 bits, since the nal subinterval can b e determined from the output
00110 01010 11101. 2
In this section we assume an adaptive mo del and exact rational arithmetic. We
intro duce the following additional notation for the analysis of scaling:
a = the ith alphab et symb ol that o ccurs in the blo ck;
i
s = weightofsymbol a at the start of the blo ck;
i i
c = numb er of o ccurrences of symbol a in the blo ck;
i i
n
X
A = s the total weight at the start of the blo ck;
i
i=1
n
X
B = c the size of the blo ck;
i
i=1
C = A + B the total weight at the end of the blo ck;
f = A=C the scaling factor;
q = s =A probabilityofsymbol a at the start of the blo ck;
i i i
r = s + c =C probability of symbol a at the end of the blo ck;
i i i i
b = numb er of blo cks resulting from scaling;
m = the minimum weight allowed for any symb ol, usually 1 :
When f =1=2, we scale by halving all weights every B symb ols, so A = B and
b t=B .Inatypical implementation, A = B = 8192.
2.1 Co ding Theorem for Scaling 7
2.1 Co ding Theorem for Scaling
In our scaling mo del each symb ol's counts are multiplied by f whenever scaling takes
place. Thus scaling has the e ect of giving more weight to more recent symb ols. If
we denote the numb er of o ccurrences of symbol a in blo ck m by c , the weight w
i i;m i;m
of symbol a after m blo cks is given by
i
w = c + fw
i;m i;m i;m 1
2
= c + fc + f c + :::: 4
i;m i;m 1 i;m 2
The weighted probability p of symbol a at the end of blo ck m is then
i;m i
w
i;m
p = : 5
P
i;m
n
w
i;m
i=1
Wenow de ne a weighted entropy in terms of the weighted probabilities:
De nition 1 The weighted entropy of a le at the end of the mth blo ck, denoted
by H , is the entropy implied by the probability distribution at that time, computed
m
according to the scaling mo del. That is,
n
X
H = p lg p ;
m i;m i;m
i=1
where the p are given by Equation 5.
i;m
We nd that l , the average co de length p er symb ol for blo ck m, is related to the
m
starting weighted entropy H and the ending weighted entropy H in a particularly
m 1 m
simple way:
1 f
l H H
m m m 1
1 f 1 f
f
= H + H H :
m m m 1
1 f
Letting f =1=2, we obtain
l 2H H :
m m m 1
When wemultiply by the blo ck length and sum over all blo cks, we obtain the following
precise and elegantcharacterization of the co de length in terms of weighted entropies:
Theorem 1 Let L be the code length of a le compressed with arithmetic coding using
an adaptive model with scaling. Then
! !
b
X
k
B H + H H t
m b 0
B
m=1
!! ! !
b
2
X
k B k
;