Analysis of Arithmetic Co ding

for

Paul G. Howard and Je rey Scott Vitter

Brown University

Department of Computer Science

Technical Rep ort No. CS{92{17

Revised version, April 1992

Formerly Technical Rep ort No. CS{91{03

App ears in Information Processing and Management,

Volume 28, Number 6, Novemb er 1992, pages 749{763.

A shortened version app ears in the pro ceedings of the

IEEE Computer So ciety/NASA/CESDIS Data Compression Conference,

Snowbird, Utah, April 8{11, 1991, pages 3{12.

Analysis of

1

for Data Compression

2 3

Paul G. Howard Je rey Scott Vitter

Department of Computer Science

Brown University

Providence, R.I. 02912{191 0

Abstract

Arithmetic co ding, in conjunction with a suitable probabilistic mo del, can pro-

vide nearly optimal data compression. In this article we analyze the e ect that

the mo del and the particular implementation of arithmetic co ding have on the

co de length obtained. Perio dic scaling is often used in arithmetic co ding im-

plementations to reduce time and storage requirements; it also intro duces a

recency e ect which can further a ect compression. Our main contribution is

intro ducing the concept of weighted entropy and using it to characterize in an

elegantway the e ect that p erio dic scaling has on the co de length. We explain

why and byhowmuch scaling increases the co de length for les with a ho-

mogeneous distribution of symb ols, and wecharacterize the reduction in co de

length due to scaling for les exhibiting lo cality of reference. We also givea

rigorous pro of that the co ding e ects of rounding scaled weights, using integer

arithmetic, and enco ding end-of- le are negligible.

Index terms : Data compression, arithmetic co ding, analysis of algorithms,

adaptive mo deling.

1

A similar version of this pap er app ears in Information Processing and Management, 28:6, Novem-

b er 1992, 749{763. A shortened version of this pap er app ears in the pro ceedings of the IEEE Com-

puter So ciety/NASA/CESDIS Data Compression Conference, Snowbird, Utah, April 8{11, 1991,

3{12.

2

Supp ort was provided in part by NASA Graduate Student Researchers Program grant NGT{

50420 and by an NSF Presidential Young Investigators Award with matching funds from IBM. Ad-

ditional supp ort was provided by a Universities Space Research Asso ciation/CESDIS app ointment.

3

Supp ort was provided in part by a National Science Foundation Presidential Young Investigator

Award grant with matching funds from IBM, by NSF grant CCR{9007851, by Army Research

Oce grantDAAL03{91{G{0035, and by the Oce of Naval Research and the Defense Advanced

Research Pro jects Agency under contract N00014{83{J{4052, ARPA order 8225. Additional supp ort

was provided by a Universities Space Research Asso ciation/CESDIS asso ciate memb ership.

1

1 Intro duction

We analyze the amount of compression p ossible when arithmetic co ding is used for

text compression in conjunction with various input mo dels. Arithmetic co ding is a

technique for statistical lossless enco ding. It can b e thought of as a generalization of

Hu man co ding [14] in which probabilities are not constrained to b e integral p owers

of 2, and co de lengths need not b e integers.

The basic algorithm for enco ding using arithmetic co ding works as follows:

1. We b egin with a \currentinterval" initialized to [0::1].

2. For each symb ol of the le, wedotwo steps:

a We sub divide the currentinterval into subintervals, one for each p ossible

alphab et symb ol. The size of a symb ol's subinterval is prop ortional to the

estimated probability that the symb ol will b e the next symb ol in the le,

according to the input mo del.

b We select the subinterval corresp onding to the symb ol that actually o ccurs

next in the le, and make it the new currentinterval.

3. We transmit enough to distinguish the nal currentinterval from all other

p ossible nal intervals.

The length of the nal subinterval is clearly equal to the pro duct of the probabilities

of the individual symb ols, which is the probability p of the particular sequence of

symb ols in the le. The nal step uses almost exactly lg p bits to distinguish the

le from all other p ossible les. For detailed descriptions of arithmetic co ding, see

[17] and esp ecially [34].

We use the following notation throughout this article:

t = length of the le, in bytes;

n = numb er of symb ols in the input alphab et ;

k = numb er of di erent alphab et symb ols that o ccur in the le;

c = numb er of o ccurrences of the ith alphab et symb ol in the le;

i

lg x = log x;

2

B

A = A A +1 A + B 1 the rising factorial function:

Results for a \typical le" refer to a le with t = 100; 000, n = 256, and k = 100.

Co de lengths are expressed in bits. We assume 8- bytes.

1.1 Mo deling e ects

The co der in an arithmetic co ding system must work in conjunction with a mo del

that pro duces probability information, describing a le as a sequence of decisions; at

2 1 INTRODUCTION

each decision p oint it estimates the probability of each p ossible choice. The co der

then uses the set of probabilities to enco de the actual decision.

To ensure deco dability, the mo del may use only information known by b oth the

enco der and the deco der. Otherwise there are no restrictions on the mo del; in partic-

ular, it can change as the le is b eing enco ded. In this subsection we describ e several

typical mo dels for context-indep endent text compression.

The length of the enco ded le dep ends on the mo del and how it is used. Most

mo dels for text compression involve estimating the probability p of a symbol by

weight of symbol

p = ;

total weight of all symb ols

whichwe can then enco de in lg p bits using exact arithmetic co ding. The probability

estimation can b e done adaptively dynamically estimating the probability of each

symb ol based on all symb ols that precede it, semi-adaptively using a preliminary

pass of the input le to gather statistics, or non-adaptively using xed probabilities

for all les. Non-adaptive mo dels are not very interesting, since their e ectiveness

dep ends only on howwell their probabilities happ en to match the statistics of the le

b eing enco ded; Bell, Cleary, and Witten show that the match can b e arbitrarily bad

[2].

Static and decrementing semi-adaptive co des. Semi-adaptive co des are con-

ceptually simple, and useful when real-time op eration is not required; their main

drawback is that they require knowledge of the le statistics prior to enco ding. The

statistics collected during the rst pass are normally used in a static code ; that is,

the probabilities used to enco de the le remain the same during the entire enco ding.

It is p ossible to obtain b etter compression by using a decrementing code, dynamically

adjusting the probabilities to re ect the statistics of just that part of the le not yet

co ded.

Assuming that enco ding uses exact arithmetic, and that there are no computa-

tional artifacts, we can use the le statistics to form a static probabilistic mo del.

Not including the cost of transmitting the mo del, the co de length L for a static

SS

semi-adaptivecodeis

n

Y

c

i

L = lg c =t

SS i

i=1

n

X

= t lg t c lg c ;

i i

i=1

the information content of the le. Dividing by the le length gives the self-entropyof

the le, and forms the basis for claims that arithmetic co ding enco des asymptotically

close to the entropy. What these claims really mean is that arithmetic co ding can

enco de close to the entropy of a probabilistic mo del whose symb ol probabilities are

1.1 Mo deling e ects 3

the same as those of the le, b ecause computational e ects discussed in Section 3

are insigni cant.

If we know a le's exact statistics ahead of time, we can get improved compres-

sion by using a decrementing code.We mo dify the symbol weights dynamically by

decrementing each symb ol's weight each time it is enco ded to re ect symb ol frequen-

cies for just that part of the le not yet enco ded; hence for each symbol we always

use the b est available estimates of the next-symb ol probabilities. In a sense, it is a

misnomer to call this a \semi-adaptive" mo del since the mo del adapts throughout

the second pass, but we apply the term here since the symb ol probability estimates

are based primarily on the rst pass. The decrementing count idea app ears in the

analysis of enumerative co des by Cleary and Witten [7]. The resulting co de length

for a semi-adaptive decrementing co de is

 ! !



k

Y

L = lg c ! t! : 1

SD i

i=1

It is straightforward to show that for all input les, the co de length of a semi-

adaptive decrementing co de is at most that of a semi-adaptive static co de, equality

holding only when the le consists of rep eated o ccurrences of a single letter. This

do es not contradict Shannon's theorem [30]; he discusses only the b est static co de.

Static semi-adaptive co des have b een widely used in conjunction with Hu man

co ding, where they are appropriate since changing weights often requires changing

the structure of the co ding tree.

Enco ding the mo del. If we assume that all symb ol distributions are equally likely

for a given le length t, the cost of transmitting the exact mo del statistics is

L = lg numb er of p ossible distributions

M

 !

t + n 1

= lg 2

n 1

 n lg et=n:

A similar result app ears in [2] and [7]. Foratypical le L is only ab out 2560

M

bits, or 320 bytes. The assumption of equally-likely distributions is not very go o d for

text les; in practice we can reduce the cost of enco ding the mo del by 50 p ercentor

more by enco ding each of the counts using a suitable enco ding of the integers, such

as Fib onacci co ding [12].

Strictly sp eaking wemust also enco de the le length t b efore enco ding the mo del;

the cost is insigni cant, b etween lg t and 2 lg t bits using an appropriate enco ding of

integers [11,31,32].

Adaptive co des. Adaptive co des use a continually changing mo del, in whichwe

use the frequency of each symboluptoagiven p oint in the le as an estimate of its

probability at that p oint.

4 1 INTRODUCTION

We can enco de a symb ol using arithmetic co ding only if its probability is nonzero,

but in an adaptivecodewehavenoway of estimating the probability of a symbol

b efore it has o ccurred for the rst time. This is the zero-frequency problem, discussed

at length in [1,2,33]. For large les with small alphab ets and simple mo dels, all

solutions to this problem give roughly the same compression. In this section we

adopt the solution used in [34], simply assigning an initial weight of 1 to all alphab et

symb ols. The co de length using this adaptivecodeis

 ! !



k

Y

t

L =lg c ! n : 3

A i

i=1

t+n1

t

By combining Equations 1, 2, and 3 and noting that = n =t!, we

n1

can show that for all input les, the adaptive co de with initial 1-weights gives the

same co de length as the semi-adaptive decrementing co de in which the input mo del

is enco ded based on the assumption that all symb ol distributions are equally likely.

In other words, L = L + L . This result is found in Rissanen [23,24]. Cleary and

A SD M

Witten [7] and Bell, Cleary, and Witten [2] present a similar result in a more general

setting, showing approximate equalitybetween enumerative co des which are similar

to arithmetic co des and adaptive co des. Intuitively, the reason for the equalityis

that in the adaptive co de, the cost of \learning" the mo del is not avoided, but merely

spread over the entire le [25].

Organization of this article. Section 2 contains our main result, which precisely

and provably characterizes the co de length of a le dynamically co ded with p erio dic

count-scaling. We express the co de length in terms of \weighted entropies" of the

mo del, which are the entropies implied by the mo del at the scaling times. Our result

shows b oth the advantage to b e gained by scaling b ecause of a lo cality-of-reference

e ect and the excess co de length incurred by the overhead of the scaling pro cess.

For example, scaling has the most negative e ect when the alphab et symb ols are

distributed homogeneously throughout the le, and our result shows explicitly the

small amount that scaling can cause the co de length to increase. However, when the

distribution of the alphab et symb ols varies, as is often the case in les displaying

lo cality of reference, our result characterizes precisely the b ene ts of scaling on co de

length. In Section 2.2 we extend this analysis to higher-order mo dels based on the

partial string matching algorithm of Cleary and Witten.

Through the years, practical adjustments have b een made to arithmetic co ding [21,

26,28,34] to allow the use of integer rather than rational or oating p oint arithmetic

and to transmit output bits almost in real time instead of all at the end. In Section 3

we prove that the loss of co ding eciency caused by practical co ding requirements is

negligible, thus demonstrating rigorously the empirical claims made in [2,34].

5

2 Scaling

Scaling is the pro cess in whichwe p erio dically reduce the weights of all symb ols. It

allows us to use lower precision arithmetic at the exp ense of making the mo del more

approximate, which can hurt compression when the distribution of symb ols in the

le is fairly homogeneous. Scaling also intro duces a locality of reference or recency 

e ect, which often improves compression when the distribution of symb ols is variable.

In this section we give a precise characterization of the e ect of scaling on co de length

pro duced by an adaptive mo del. We express the co de length in terms of the weighted

entropies of the mo del. The weighted entropy is the Shannon entropy, computed using

probabilities weighted according to the scaling pro cess; the term \weighted entropy"

is a notational convenience. Our characterization explains why and byhowmuch

scaling can hurt or help compression.

In most text les we nd that most of the o ccurrences of at least some words are

clustered in one part of the le. We can take advantage of this lo calityby assigning

more weight to recent o ccurrences of a symb ol in a dynamic mo del. In practice there

are several ways to do this:

 Perio dically restarting the mo del. This often discards to o much information

to b e e ective, although Cormack and Horsp o ol nd that it gives go o d results

when growing large dynamic Markov mo dels [8].

 Using a sliding window on the text [15]. This requires excessive computational

resources.

 Recency rank co ding [4,10,29]. This is computationally simple but corresp onds

to a rather coarse mo del of recency.

 Exp onential aging giving exp onentially increasing weights to successive sym-

b ols [9,20]. This is mo derately dicult to implement b ecause of the changing

weight increments.

 Perio dic scaling [34]. This is simple to implement, fast and e ective in op eration,

and amenable to analysis. It also has the computationally desirable prop erty

of keeping the symbol weights small. In e ect, scaling is a practical version of

exp onential aging. This is the metho d that we analyze.

In our discussion of scaling, we assume that a le is divided into blocks of length B .

Our mo del assumes that at the end of each blo ck, the weights for all symb ols are

multipliedby a scaling factor f , usually 1=2. Within a blo ckwe up date the symbol

weights by adding 1 for each o ccurrence.

Example 1 :We illustrate an adaptive co de with scaling, enco ding the 8-symb ol le

\abacbcba". In this example, we start with counts c =1, c = 1, and c = 1, and set

a b c

the scaling threshold at 10, so we scale only once, just b efore the last symbolinthe

6 2 SCALING

le. To retain integer weights without allowing anyweight to fall to 0, we round all

fractional weights obtained during scaling to the next higher integer.

Symbol CurrentInterval p p p

a b c

1 1 1

Start 0.000000000 1.000000000 = = =

3 3 3

2 1 1

a 0.000000000 0.333333333 = = =

4 4 4

2 2 1

0.166666667 0.250000000 = = = b

5 5 5

3 2 1

= = = a 0.166666667 0.200000000

6 6 6

3 2 2

c 0.194444444 0.200000000 = = =

7 7 7

3 3 2

0.196825397 0.198412698 = = = b

8 8 8

3 3 3

0.198015873 0.198412698 = = = c

9 9 9

3 4 3

= = = b 0.198148148 0.198280423

10 10 10

2 2 2

= = Scaling =

6 6 6

2 2 3

= = a 0.198148148 0.198192240 =

7 7 7

The nal interval is [0.00110 01010 11100 11101, 0.00110 01010 11110 01011] in

binary. The theoretical co de length is lg 0:000044092 = 14:469 bits. The actual

co de length is 15 bits, since the nal subinterval can b e determined from the output

00110 01010 11101. 2

In this section we assume an adaptive mo del and exact rational arithmetic. We

intro duce the following additional notation for the analysis of scaling:

a = the ith alphab et symb ol that o ccurs in the blo ck;

i

s = weightofsymbol a at the start of the blo ck;

i i

c = numb er of o ccurrences of symbol a in the blo ck;

i i

n

X

A = s the total weight at the start of the blo ck;

i

i=1

n

X

B = c the size of the blo ck;

i

i=1

C = A + B the total weight at the end of the blo ck;

f = A=C the scaling factor;

q = s =A probabilityofsymbol a at the start of the blo ck;

i i i

r = s + c =C probability of symbol a at the end of the blo ck;

i i i i

b = numb er of blo cks resulting from scaling;

m = the minimum weight allowed for any symb ol, usually 1 :

When f =1=2, we scale by halving all weights every B symb ols, so A = B and

b  t=B .Inatypical implementation, A = B = 8192.

2.1 Co ding Theorem for Scaling 7

2.1 Co ding Theorem for Scaling

In our scaling mo del each symb ol's counts are multiplied by f whenever scaling takes

place. Thus scaling has the e ect of giving more weight to more recent symb ols. If

we denote the numb er of o ccurrences of symbol a in blo ck m by c , the weight w

i i;m i;m

of symbol a after m blo cks is given by

i

w = c + fw

i;m i;m i;m1

2

= c + fc + f c + :::: 4

i;m i;m1 i;m2

The weighted probability p of symbol a at the end of blo ck m is then

i;m i

w

i;m

p = : 5

P

i;m

n

w

i;m

i=1

Wenow de ne a weighted entropy in terms of the weighted probabilities:

De nition 1 The weighted entropy of a le at the end of the mth blo ck, denoted

by H , is the entropy implied by the probability distribution at that time, computed

m

according to the scaling mo del. That is,

n

X

H = p lg p ;

m i;m i;m

i=1

where the p are given by Equation 5.

i;m

We nd that l , the average co de length p er symb ol for blo ck m, is related to the

m

starting weighted entropy H and the ending weighted entropy H in a particularly

m1 m

simple way:

1 f

l  H H

m m m1

1 f 1 f

f

= H + H H :

m m m1

1 f

Letting f =1=2, we obtain

l  2H H :

m m m1

When wemultiply by the blo ck length and sum over all blo cks, we obtain the following

precise and elegantcharacterization of the co de length in terms of weighted entropies:

Theorem 1 Let L be the code length of a le compressed with arithmetic coding using

an adaptive model with scaling. Then

 ! !

b

X

k

B H + H H t

m b 0

B

m=1

!!  ! !  

b

2

X

k B k

;

m b 0

2

B k B

min

m=1

8 2 SCALING

where H =lgnis the entropy of the initial model, H is the weighted entropy

0 m

implied by the scaling model's probability distribution at the end of block m, and k

min

is the smal lest number of di erent symbols that occur in any block.

This theorem enables us to compute the co de length of a le co ded with p erio dic

scaling. To do the calculation we need to know only the symb ol counts within each

scaling blo ck; we can then use Equations 4 and 5 and De nition 1. The o ccurrence

of entropy-like terms in the co de length is to b e exp ected; the main contribution of

Theorem 1 is to show the precise and simple form that the entropy expressions take.

2.1.1 Pro of of the upp er b ound

We prove the upp er b ound of Theorem 1 by showing rst that the co de length of a

blo ck dep ends only on the b eginning weights and blo ck counts of the symb ols, and

not on the order of symb ols within the blo ck. Then we show that there is an order

such that for each symb ol certain equalities and inequalities hold, whichwe use to

compute a value and an upp er b ound for the co de length of all o ccurrences of a single

symb ol in one blo ck. Finally we sum over all symb ols to obtain the worst case co de

length of a blo ck, and over all blo cks to obtain the co de length of the le. The pro of

of the lower b ound is similar.

The rst lemma enables us to cho ose any convenient symb ol order within a blo ck.

Lemma 1 The code length of a block depends only on the beginning weights and block

counts of the symbols, and not on the order of symbols within the block.

Proof : The exact co de length L of a blo ck,

 ! !



k

Y

c B

i

L = lg s A ;

i

i=1

has no dep endence on the order of symb ols. For each symbol a , the adaptive mo del

i

ensures that the numerators in the set of probabilities used for co ding the symbol in

the blo ck always form the sequence hs ;s +1;:::;s + c 1i, and that for the blo ck

i i i i

as a whole the denominators always form the sequence hA; A +1;:::;A + B 1i. 2

In the next lemma we prove the almost obvious fact that there is some order

of symb ols such that the o ccurrences of each symb ol are roughly evenly distributed

through the blo ck. This order will enable us to use a smo othly varying function to

estimate the probabilities used for co ding the o ccurrences.

De nition 2 A blo ck of length B containing c ;c ;:::;c o ccurrences of symb ols

1 2 k

a ;a ;:::;a , resp ectively, has the evenly-distributed or ED prop erty if for each

1 2 k

symbol a and for all m,1mc, the symb ol o ccurs for the mth time not after

i i

p osition mB =c . i

2.1 Co ding Theorem for Scaling 9

Lemma 2 Every distribution of symbol counts has an order with the ED property.

Proof : Let r j  b e the numb er of o ccurrences of symbol a required up through the

i i

P

k

j th p osition in any ED order. Such an order exists if and only if r j   j for

i

i=1

P

cj+1

k

i

1  j  B .We nd that r j = d e1

i i i

i=1

B

P P P

k k k

r j 

i i i

i=1 i=1 i=1

any j , so an ED order exists. 2

Next we de ne a numb er of related sums and integrals approximating l , the co de

length of all o ccurrences of one symb ol in one blo ck. For a given symb ol with prob-

ability p = c=B within the blo ck, we can divide the blo ckinto c subblo cks eachof

length B=c =1=p. Then p k  and p k  give the probability that the dynamic

B E

mo del would give to the k th o ccurrence of the symb ol if it o ccurred precisely at the

b eginning and end of the k th subblo ck resp ectively.

qA + k qA + k

p k = ; p k= :

B E

A + k=p A +k +1=p

The expressions S and S are the lower and upp er b ounds of the symb ol's co de

L U

length, based on its o ccurrences b eing as early or as late as p ossible in the subblo cks.

c1 c1

X X

S = lg p j ; S = lg p j :

L B U E

j =0 j =0

The integrals I and I approximate S and S .

L U L U

Z Z

c c

I = lg p x dx; I = lg p x dx:

L B U E

0 0

We de ne  to b e I I . After a considerable amount of algebra, we get

I U L

1

 = c +1flgc +1 f clg c

I

1 f

fc +1flgfc +1f+fc lg fc :

The expressions  and  are used to b ound the error in approximating S by I

U L U U

and S by I resp ectively.

L L



c c

lg 1 +  lg 1 +  if p>q1=A;

qA pA+1

 =

U

0 if p  q 1=A ;



c c

lg 1 +  lg 1 +  if p

pA qA

 =

L

0 if p  q:

We need a simple lemma from integral calculus.

10 2 SCALING

Lemma 3 If g x is monotone increasing, then

Z

c1

c

X

g j  < g x dx:

0

j =0

If g x is monotone decreasing, then

Z

c1

c

X

g j  < g x dx g c+g0:

0

j =0

Proof : The increasing case is obvious. In the decreasing case, the integral for any

R

j +1

unit interval is greater than the value at the right end of the interval: g x dx >

j

g j + 1. We obtain the lemma by adding g j  g j + 1 to b oth sides and summing

over j . 2

Lemma 4 The code length l for al l occurrences of a single symbol in a block in ED

order is less than I + + .

L I U

Proof :We show that l

U U U L I U U

co de length for the symb ol if all o ccurrences of the symb ol come as late as p ossible in

an ED order, wehavel

U E

so by Lemma 3 wehaveS q1=A, then lg p x is monotone

U U E

decreasing, so by Lemma 3 wehaveS

U U E E U U

p = q 1=A, then S = I . In all three cases, S  I + .From the de nition

U U U U U

of  , I = I + . 2

I U L I

Wenow relate I to the entropy of the b eginning and ending probability dis-

L

tributions. We write I i to di erentiate the values of I evaluated for di erent

L L

symb ols a . We de ne H R and H Qtobetheentropies asso ciated with prob-

i

ability distributions R = hr ;r ;:::;r i and Q = hq ;q ;:::;q i resp ectively; that

1 2 n 1 2 n

is,

n

X

H R = r lg r ;

i i

i=1

n

X

H Q = q lg q :

i i

i=1

P

f

k 1

H R H Q . Then L = I i. Lemma 5 Let L = B

H L H

i=1

1f 1f

Proof : By appropriate substitutions, wehave

! 

k k

X X

f 1

r lg r  q lg q  I i = B

i i i i L

1 f 1 f

i=1 i=1

 !

k

X

f

+ B q r  :

i i

2

1 f 

i=1

2.1 Co ding Theorem for Scaling 11

P

k

The last sum is 0 b ecause Q and R are probability distributions, so q =

i

i=1

P

k

r = 1. The lemma follows from the de nition of H . 2

i

i=1

Finally we b ound the p er-blo ck error.

Lemma 6 Let L be the compressed length of block j . Then

j

 !  !

2

1 f B k

L  B H R H Q + k lg + O :

j

1 f 1 f k B

P

k

Proof :From Lemmas 4 and 5, wehaveL  L +  + . A scaling factor f

j H I U

i=1

of 1=2 implies that A = B . Asymptotics under this condition give



1O1=c if p  q 1=A ;

 + =

I U

lgc=s+Os=c if p>q1=A:

P

k

The sum  +  is maximized when as many symb ols as p ossible have large c

I U

i=1

and small s; the sum's largest p ossible value cannot exceed its value when s = m and

c = B=k for all k symb ols, in which case  + =lgB=km+ Ok m=B . We

I U

obtain the result by setting m = 1 and summing over the symb ols. 2

The pro of of the upp er b ound in Theorem 1 follows from Lemma 6 by summing

over all blo cks, noting that b = t=B .We are neglecting any sp ecial e ects of a longer

rst blo ck or shorter last blo ck. There is much cancellation b ecause H R of one

blo ckisHQ of the next.

2.1.2 Pro of of the lower b ound

The pro of of the lower b ound of Theorem 1 is similar to that of the upp er b ound. In

this pro of of the lower b ound we app end a prime to the lab el of each de nition and

lemma to show the corresp ondence with the de nition and lemmas used in the pro of

of the upp er b ound.

0

De nition 2 A blo ck of length B containing c ;c ;:::;c o ccurrences of symb ols

1 2 k

a ;a ;:::;a , resp ectively, has the reverse ED prop erty if for each symbol a and for

1 2 k i

all m,1  m  c , the symb ol o ccurs for the mth time not b efore p osition m 1B=c .

i i

0

Lemma 2 Every distribution of symbol counts has an order with the reverse ED

property.

Proof : By Lemma 2 there is always an order with the ED prop erty. Such an order,

when reversed, has the reverse ED prop erty. 2

12 2 SCALING

0

Lemma 3 If g x is monotone decreasing, then

Z

c1

c

X

g j  > g x dx:

0

j =0

If g x is monotone increasing, then

Z

c1

c

X

g j  > g x dx g c+g0:

0

j =0

Proof : Similar to that of Lemma 3. 2

0

Lemma 4 The code length l for al l occurrences of a single symbol in a block in

reverse ED order is greater than I  .

L L

Proof :We show that l>S I  . Since I represents the co de length for

L L L L

the symb ol if all o ccurrences of the symb ol come as early as p ossible in a reverse

ED order, wehavel>S .Ifp>q, then lg p x is monotone decreasing, so by

L B

0

Lemma 3 wehave S >I .Ifp

L L B

0

Lemma 3 wehaveS >I +lgp clg p 0 = I  .Ifp=q, then S = I .

L L B B L L L L

In all three cases, wehaveS I  . 2

L L L

0

Lemma 6 Let L be the compressed length of block j . Then

j

! 

f 1

H R H Q k: L  B

j

1 f 1 f

P

k

0

Proof :From Lemmas 4 and 5, L  L  i. Asymptotics when f =1=2

j H L

i=1

give =1Oc=s, which has maximum value 1, so the sum is at most k . 2

L

0

The pro of of the lower b ound in Theorem 1 follows from Lemma 6 by summing

over all blo cks, noting that b = t=B .

2.1.3 Non-scaling corollary

0

By letting B = t, f = n=t + n, and m = 1 in Lemmas 6 and 6 ,we obtain the co de

length without scaling:

Corollary 1 When we do not scale at al l, the code length L satis es:

NS

tH + nH H  k

nal nal 0 NS nal nal 0

We can get imp ortant insights by contrasting upp er b ounds in this corollary and

Theorem 1. Scaling will bring ab out a shorter enco ding by tracking the blo ck-by-

blo ckentropies rather than matching a single entropy for the entire le, but when we

forgo scaling the overhead is less, prop ortional to lg t instead of to t. Scaling will do

worst on a homogeneously distributed le, but even then the overhead will increase

the co de length by only ab out k=B lgB=k bits p er input symb ol, less than 0.1

bit p er symb ol for a typical le. We conclude that the b ene ts of scaling usually

outweigh the minor ineciencies it sometimes intro duces.

2.2 Application to Higher Order Mo dels 13

2.2 Application to Higher Order Mo dels

Wenow extend our results to higher order mo dels. Cleary and Witten [6] presenta

practical adaptive metho d called prediction by partial matching or PPM  in which

they maintain mo dels of various orders. At each p oint they use the highest-order

mo del in which the symb ol has o ccurred in the current context, with a sp ecial escape

symb ol to indicate the need to drop to a lower order. Because most contexts o ccur

only a few times with only a few di erent symb ols, assigning an initial weightof1

to each alphab et symbol as we did in Section 1.1 is an unsatisfactory solution to the

zero-frequency problem in higher-order mo dels. Doing so would givetoomuchweight

to symb ols that never o ccur. See [2] or [3] for a detailed description of the PPM

metho d. Witten, Cleary, Mo at, and Bell have prop osed at least ve metho ds for

estimating the probability of the escap e symb ol [6,19,33], and Arps et al. [1] givetwo

more. All of the metho ds give approximately the same compression; PPMB [6] is the

most readily analyzed.

In PPMB, in each context the escap e event is treated as a separate symb ol with its

own weight and probability; the rst o ccurrence of an ordinary symb ol is not counted

and the rst two o ccurrences are co ded as escap es. Treating the escap e eventasa

normal symb ol, we can apply the results of Section 2 if we make adjustments for the

rst two o ccurrences of each symb ol, since in PPMB the co de length is indep endent

of the order of the symb ols.

In the blo ck in which a given symb ol o ccurs for the rst time, we can take the

o ccurrences to b e evenly distributed in the sense of De nition 2, with symbol weights

numerators running from 1 to c and o ccurrence p ositions denominators running

from A + B=c to A + B . If this were co ded in the normal way, the co de length would b e

b ounded ab oveby Lemmas 4 and 5. Since the mechanism of PPMB excludes the last

twonumerators, c 1 and c, and the rst two denominators, A + B=c and A +2B=c,

the co de length from the approximation must b e adjusted by adding L :

adjustment

L = L L

adjustment actual estimated

c 1 c

= lg

A + B=cA +2B=c

 !

1 1=f 1

= 2lg c 2lg A +lg 1 lg 1+

c c

 !

1=f 1

lg 1+2 :

c

Wemust also adjust the entropy: the actual value of the initial probability q is 0 in-

stead of 1=A, and the actual value of the nal probability r is c 1=A + B  instead of

f

1

c+1=A+B . For convenience we denote the entropy term B H R H Q

1f 1f

by h.We de ne and compute the adjustment:

h = h h

adjustment actual estimated

14 3 CODING EFFECTS

2 1

= c lg 1+ : +2lgclg A +2lgf +lg 1

2

c1 c

The co de length is then given by

L = h +L h  + small terms

actual actual adjustment adjustment

2 1

= h c lg 1+ lg A 2lg f lg 1+

actual

c1 c

!  ! 

1=f 1 1=f 1

lg 1+2 lg 1+

c c

+ small terms:

The net adjustment is always negativeif f =1=2. We can let f =1=2ifwe

neglect the e ect of the time b efore the rst scaling in each context.

Nowwe can extend Theorem 1 to the PPMB mo del with scaling, using X sub-

scripts in the natural way to restrict quantities to context X :

Theorem 2 When we use PPMB with scaling, the code length L is boundedby

0 0 1 1

 !

b

X

X X

k t lg B

X X

@ @ A A

L< B H + H H + O :

X;m X;b X;0

B

m=1

contexts X

Proof : The pro of follows from the discussion ab ove. 2

This theorem do es not readily estimate the co de length of a le in a direct way.

However, it do es show that the co de length of a le co ded using a high-order Markov

mo del with scaling can b e expressed using the weighted entropy formulation intro-

duced in Section 2. In particular, the co de length for each context is expressed directly

in terms of the weighted entropies for that context.

3 Co ding E ects

In this section we prove analytically that co ding e ects as distinguished from mo d-

eling e ects are insigni cant, and hence that our assumption of exact co ding is

appropriate. Empirical evidence that the co ding e ects are negligible app ears in [2,

34].

3.1 Rounding counts to integers

In Section 2 we analyzed the mo deling e ect of p erio dic scaling; here we analyze the

co ding e ect. Witten, Neal, and Cleary scale the counts to avoid register over ow,

and to preventany count from falling to 0, they round fractional counts to the next

higher integer. This gives more weight to symb ols whose count happ ens to b e o dd when scaling o ccurs.

3.2 Using integer arithmetic 15

Theorem 3 Rounding counts up to the next higher integer increases the code length

for the le by no more than n=2B bits per input symbol.

Proof : Each symb ol whose weight is rounded up causes the denominators of all

probabilities in the next blo ck to b e to o large, by 1/2. If r is the numb er of symb ols

sub ject to roundup, r=2 of the denominators in computing the co ding interval will b e

approximately T instead of T=2, each adding one bit to the co de length of the blo ck,

so the blo ck's co de length will b e r=2 bits longer. The e ect for the entire le t=B

blo cks is rt=2B bits, or, since r  n, at most n=2B bits p er input symb ol. 2

This e ect is typically 0.02 bit or less p er input symb ol.

3.2 Using integer arithmetic

Witten, Neal, and Cleary use integers from a large xed range, typically [0; 8B ],

instead of using exact rational arithmetic, and they transmit enco ded bits as so on as

they know them instead of waiting until the entire string has b een enco ded, scaling

up the range to represent only that half of the original range whose identity has not

yet b een transmitted. The result is nearly the same compression eciency as with

exact rational arithmetic.

As is apparent from the following description of the co ding section of the Witten-

Neal-Cleary algorithm, scaling up the range is only approximate:

1. We select a subrange of the currentinterval [low; high] whose length within

[low; high] is prop ortional to p. The new integer values of low and low are

obtained by truncating the results of the exact calculation.

2. We rep eat the following steps as many times as p ossible:

a If the new subrange is not entirely in the lower, middle, or upp er half of

the full range of values, we return.

b If the new subrange is in the lower or upp er half, we output 0 or 1 resp ec-

tively, plus any bits left over from previous symb ols. If the subrange is

entirely in the middle half, wekeep track of this fact for future output.

c We scale the subrange up:

i. We shift the subrange to ignore the part of the full range in which the

subrange is known not to lie.

ii. We double b oth low and high and add 1 to high.

In this algorithm, any roundo error in selecting the rst subrange will propagate

through the entire scaling up pro cess. In the worst case, a symb ol with a countof1

could result in a subrange of length 1, even though the unrounded subrange size

might b e just b elow 2. In e ect, this would assign a symb ol probability of only half

16 3 CODING EFFECTS

the correct value, resulting in a co de length one bit to o long. In practice, however, the

co de length error in one symb ol is seldom anywhere near that large, and b ecause the

errors can b e of either sign and have an approximately symmetrical distribution with

mean 0, the average error is usually very small. Witten, Neal, and Cleary empirically

4

estimate it at 10 bit p er input symb ol.

In order to get a rigorous b ound on the compression loss, we analyze a new

algorithm that maintains full precision when scaling up the range. Instead of adding 1

to high at each step, we add either 0 or 1 to low , and indep endently weadd1or2to

high, the choice in each case b eing based on the fractional bits of the exact results of

the initial subrange selection. The resulting co de length may b e longer than that of

exact arithmetic co ding, but by a tiny amount, as shown in the following theorem:

Theorem 4 When we use the high precision algorithm for scaling up the subrange,

the code length does not exceed the ideal code length lg p by more than 1=2B ln 2

bits per input symbol.

Proof : After the scaling up pro cess, the smallest p ossible subinterval size is 2B .As

a result of rounding while scaling up, low can never b e to o high, and high can b e to o

lowbyasmuch as 1, so the subinterval can b e to o short byasmuch as 1. It can

also b e to o long byasmuch as 1. An interval to o short by 1 can b e to o short by

a factor of as muchas2B1=2B , which corresp onds to a co de length increase of

1

lg B = =B   1=2B ln 2. 2

2

4

A loss of 1=2B ln 2 bits p er input symb ol is negligible, ab out 10 bit p er symbol

for typical parameters. In practice the high precision algorithm gives exactly the same

compression as the algorithm of Witten, Neal, and Cleary, but its compression loss is

provably small.

3.3 Enco ding end-of- le

The end of the le must b e explicitly indicated when we use arithmetic co ding. The

extra co de length required is typically very small and is often ignored; for complete-

ness, we provide a brief analysis of the end-of- le e ect.

Witten, Neal, and Cleary intro duce a sp ecial low-weight symb ol in addition to the

normal alphab et symb ols; it is enco ded once, at the end of the le. In the following

theorem we b ound the cost of identifying end-of- le by this metho d:

Theorem 5 The use of a special end-of- le symbol results in additional code length

of less than t=B ln 2 + lg B +10 bits.

Proof : The cost has four comp onents:

 at most lg B + 1 bits to enco de the end-of- le symb ol since its probabilitymust

b e at least as large as the smallest p ossible probability,1=2B

17

 fewer than t=B ln 2 bits in wasted co de space to allow end-of- le at any p oint

each probability can b e reduced by a factor of b etween 2B 1=2B and

B 1=B , resulting in a loss of b etween lg2B 1=2B   1=2B ln 2

and lgB 1=B   1=B ln 2 bits p er symb ol

 two disambiguating bits after the end-of- le symbol

 up to seven bits to ll the last byte.

2

An alternative, transmitting the length of the original le b efore its enco ding,

reduces the cost to b etween lg t and 2 lg t bits by using an appropriate enco ding of

integers [11,31,32], but requires the le length to b e known b efore enco ding can b egin.

The end-of- le cost using either of these metho ds is negligible for a typical le,

less than 0.01 bit p er input symb ol.

4 Conclusion

Using our notion of weighted entropy,wehave precisely characterized the tradeo

between the overhead asso ciated with scaling and the saving that it can realize by

exploiting lo cality of reference. The largest co de length savings come from more

sophisticated higher order mo dels such as PPMB, and our scaling analysis extends

accordingly.Wehave also proven that the computational e ects on the co de length

in practical arithmetic co ding implementations are small, so we can treat practical

arithmetic co ders as though they were exact co ders.

Another imp ortant consideration in making arithmetic co ding practical, which

we do not address in this article, is the sp eed at which the currentinterval can b e

up dated. In the basic algorithm outlined in Section 1 and in the work of Witten, Neal,

and Cleary,uptotwomultiplications and one division are needed for each symbol

enco ded. Work by Rissanen, Langdon, Mohiuddin, and others at IBM [5,16,18,22,27]

eliminates the division altogether and fo cuses on approximating the multiplication by

combinations of additions and shifts. In [13] we present an alternative approachin

whichwe approximate an arithmetic co der by a nite state automaton with a small

numb er of states. Since the arithmetic computations are e ectively stored in the state

tables, co ding can pro ceed quickly using only table lo okups.

Acknowledgement. We wish to thank Prof. Martin Cohn for helping us to uncover

a small mistake in the analysis of the e ect of using integer arithmetic.

18 4 CONCLUSION

References

[1] R. B. Arps, G. G. Langdon & J. J. Rissanen, \Metho d for Adaptively Initializing

a Source Mo del for Symb ol Enco ding," IBM Technical Disclosure Bulletin 26 May

1984, 6292{6294.

[2] T. C. Bell, J. G. Cleary & I. H. Witten, Text Compression, Prentice-Hall, Engle-

wo o d Cli s, NJ, 1990.

[3] T. C. Bell, I. H. Witten & J. G. Cleary, \Mo deling for Text Compression," Comput.

Surveys 21 Dec. 1989, 557{591.

[4] J. L. Bentley, D. D. Sleator, R. E. Tarjan & V. K. Wei, \A Lo cally Adaptive Data

Compression Scheme," Comm. ACM 29 Apr. 1986, 320{330.

[5] D. Chevion, E. D. Karnin & E. Walach, \High Eciency, Multiplication Free

Approximation of Arithmetic Co ding," in Pro c. Data Compression Conference ,J.

A. Storer & J. H. Reif, eds., Snowbird, Utah, Apr. 8{11, 1991, 43{52.

[6] J. G. Cleary & I. H. Witten, \Data Compression Using Adaptive Co ding and

Partial String Matching," IEEE Trans. Comm. COM{32 Apr. 1984, 396{402.

[7] J. G. Cleary & I. H. Witten, \A Comparison of Enumerative and Adaptive Co des,"

IEEE Trans. Inform. Theory IT{30 Mar. 1984, 306{315.

[8] G. V. Cormack & R. N. Horsp o ol, \Data Compression Using Dynamic Markov

Mo delling," Computer Journal 30 Dec. 1987, 541{550.

[9] G. V. Cormack & R. N. Horsp o ol, \Algorithms for Adaptive Hu man Co des,"

Inform. Pro cess. Lett. 18 Mar. 1984, 159{165.

[10] P. Elias, \Interval and Recency Rank Source Co ding: Two On-line AdaptiveVari-

able Length Schemes," IEEE Trans. Inform. Theory IT{33 Jan. 1987, 3{10.

[11] P. Elias, \Universal Co deword Sets and Representations of Integers," IEEE Trans.

Inform. Theory IT{21 Mar. 1975, 194{203.

[12] A. S. Fraenkel & S. T. Klein, \Robust Universal Complete Co des as Alternatives

to Hu man Co des," Dept. of Applied , The Weizmann Institute of

Science, Technical Rep ort, Rehovot, Israel, 1985.

[13] P.G.Howard & J. S. Vitter, \Practical Implementations of Arithmetic Co ding,"

in Image and Text Compression, J. A. Storer, ed., Kluwer Academic Publishers,

Norwell, MA, 1992, 85{112.

[14] D. A. Hu man, \A Metho d for the Construction of Minimum Redundancy Co des,"

Pro ceedings of the Institute of Radio Engineers 40 1952, 1098{1101 .

[15] D. E. Knuth, \Dynamic Hu man Co ding," J. Algorithms 6 June 1985, 163{180.

19

[16] G. G. Langdon, \Probabilistic and Q-Co der Algorithms for Binary Source Adap-

tation," in Pro c. Data Compression Conference , J. A. Storer & J. H. Reif, eds.,

Snowbird, Utah, Apr. 8{11, 1991, 13{22.

[17] G. G. Langdon, \An Intro duction to Arithmetic Co ding," IBM J. Res. Develop.

28 Mar. 1984, 135{149.

[18] G. G. Langdon & J. Rissanen, \Compression of Black-White Images with Arith-

metic Co ding," IEEE Trans. Comm. COM{29 1981, 858{867.

[19] A. M. Mo at, \Implementing the PPM Data Compression Scheme," IEEE Trans.

Comm. COM{38 Nov. 1990, 1917{192 1.

[20] K. Mohiuddin, J. J. Rissanen & M. Wax, \Adaptive Mo del for Nonstationary

Sources," IBM Technical Disclosure Bulletin 28 Apr. 1986, 4798{4800 .

[21] R. Pasco, \Source Co ding Algorithms for Fast Data Compression," Stanford Univ.,

Ph.D. Thesis, 1976.

[22] W. B. Pennebaker, J. L. Mitchell, G. G. Langdon & R. B. Arps, \An Overview of

the Basic Principles of the Q-Co der Adaptive Binary Arithmetic Co der," IBM J.

Res. Develop. 32 Nov. 1988, 717{726.

[23] J. Rissanen, \Sto chastic Complexity and Mo deling," Ann. Statist. 14 1986, 1080{

1100.

[24] J. Rissanen, \Sto chastic Complexity," J. Roy. Statist. So c. Ser. B 49 1987, 223{

239, 253-265.

[25] J. Rissanen, \Universal Co ding, Information, Prediction, and Estimation," IEEE

Trans. Inform. Theory IT{30 July 1984, 629{636.

[26] J. J. Rissanen, \Generalized Kraft Inequality and Arithmetic Co ding," IBM J.

Res. Develop. 20 May 1976, 198{203.

[27] J. J. Rissanen & K. M. Mohiuddin, \A Multiplication-Free Multialphab et Arith-

metic Co de," IEEE Trans. Comm. 37 Feb. 1989, 93{98.

[28] F. Rubin, \Arithmetic Stream Co ding Using Fixed Precision Registers," IEEE

Trans. Inform. Theory IT{25 Nov. 1979, 672{675.

[29] B. Y. Ryabko, \Data Compression by Means of a Bo ok Stack," ProblemyPeredachi

Informatsii 16 1980.

[30] C. E. Shannon, \A Mathematical Theory of Communication," Bell Syst. Tech. J.

27 July 1948, 398{403.

[31] R. G. Stone, \On Enco ding of Commas Between Strings," Comm. ACM 22 May

1979, 310{311.

[32] M. Wang, \Almost Asymptotically Optimal Flag Enco ding of the Integers," IEEE

Trans. Inform. Theory IT{34 Mar. 1988, 324{326.

20 4 CONCLUSION

[33] I. H. Witten & T. C. Bell, \The Zero Frequency Problem: Estimating the Prob-

abilities of Novel Events in AdaptiveText Compression," IEEE Trans. Inform.

Theory IT{37 July 1991, 1085{1094.

[34] I. H. Witten, R. M. Neal & J. G. Cleary, \Arithmetic Co ding for Data Compres-

sion," Comm. ACM 30 June 1987, 520{540.