Practical Implementations of

Arithmetic Co ding

Paul G. Howard and Je rey Scott Vitter

Brown University

Department of Computer Science

Technical Rep ort No. 92{18

Revised version, April 1992

Formerly Technical Rep ort No. CS{91{45

App ears in Image and Text Compression,

James A. Storer, ed., Kluwer Academic Publishers, Norwell, MA, 1992, pages 85{112.

A shortened version app ears in the pro ceedings of the

International Conference on Advances in Communication and Control COMCON 3,

Victoria, British Columbia, Canada, Octob er 16{18, 1991.

Practical Implementations of

1

Arithmetic Coding

2 3

Paul G. Howard Je rey Scott Vitter

Department of Computer Science

Brown University

Providence, R.I. 02912{191 0

Abstract

We provide a tutorial on arithmetic co ding, showing how it provides nearly

optimal and how it can b e matched with almost any prob-

abilistic mo del. We indicate the main disadvantage of arithmetic co ding, its

slowness, and give the basis of a fast, space-ecient, approximate arithmetic

co der with only minimal loss of compression eciency. Our co der is based on

the replacement of arithmetic by table lo okups coupled with a new deterministic

probability estimation scheme.

Index terms : Data compression, arithmetic co ding, adaptive mo deling, analysis

of algorithms, data structures, low precision arithmetic.

1

A similar version of this pap er app ears in Image and Text Compression, James A. Storer, ed.,

Kluwer Academic Publishers, Norwell, MA, 1992, 85{112. A shortened version of this pap er app ears

in the pro ceedings of the International Conference on Advances in Communication and Control

COMCON 3, Victoria, British Columbia, Canada, Octob er 16{18, 1991.

2

Supp ort was provided in part by NASA Graduate Student Researchers Program grant NGT{

50420 and by a National Science Foundation Presidential Young Investigators Award grant with

matching funds from IBM. Additional supp ort was provided by a Universities Space Research As-

so ciation/CESDIS asso ciate memb ership.

3

Supp ort was provided in part by National Science Foundation Presidential Young Investigator

Award CCR{9047466 with matching funds from IBM, by NSF research grant CCR{9007851, by

Army Research Oce grantDAAL03{91{G{0035, and by the Oce of Naval Research and the De-

fense Advanced Research Pro jects Agency under contract N00014{91{J{4052 ARPA Order No. 8225.

Additional supp ort was provided by a Universities Space Research Asso ciation/CESDIS asso ciate

memb ership.

1

1 Data Compression and Arithmetic Co ding

Data can b e compressed whenever some data symb ols are more likely than others.

Shannon [54] showed that for the b est p ossible compression co de in the sense of

minimum average co de length, the output length contains a contribution of lg p

from the enco ding of each symb ol whose probability of o ccurrence is p.Ifwe can

provide an accurate mo del for the probability of o ccurrence of each p ossible symbol

at every p oint in a le, we can use arithmetic co ding to enco de the symb ols that

actually o ccur; the numb er of bits used by arithmetic co ding to enco de a symb ol with

probability p is very nearly lg p, so the enco ding is very nearly optimal for the given

probability estimates.

In this pap er we showby theorems and examples how arithmetic co ding achieves

its p erformance. We also p oint out some of the drawbacks of arithmetic co ding

in practice, and prop ose a uni ed compression system for overcoming them. We

b egin by attempting to clear up some of the false impressions commonly held ab out

arithmetic co ding; it o ers some genuine b ene ts, but it is not the solution to all data

compression problems.

The most imp ortant advantage of arithmetic co ding is its exibility: it can b e

used in conjunction with any mo del that can provide a sequence of event probabilities.

This advantage is signi cant b ecause large compression gains can b e obtained only

through the use of sophisticated mo dels of the input data. Mo dels used for arithmetic

co ding may b e adaptive, and in fact a numb er of indep endent mo dels may b e used

in succession in co ding a single le. This great exibility results from the sharp

separation of the co der from the mo deling pro cess [47]. There is a cost asso ciated with

this exibility: the interface b etween the mo del and the co der, while simple, places

considerable time and space demands on the mo del's data structures, esp ecially in

the case of a multi-symb ol input alphab et.

The other imp ortant advantage of arithmetic co ding is its optimality. Arithmetic

co ding is optimal in theory and very nearly optimal in practice, in the sense of enco d-

ing using minimal average co de length. This optimality is often less imp ortant than it

might seem, since Hu man co ding [25] is also very nearly optimal in most cases [8,9,

18,39]. When the probability of some single symb ol is close to 1, however, arithmetic

co ding do es give considerably b etter compression than other metho ds. The case of

highly unbalanced probabilities o ccurs naturally in bilevel black and white image

co ding, and it can also arise in the decomp osition of a multi-symb ol alphab et into a

sequence of binary choices.

The main disadvantage of arithmetic co ding is that it tends to b e slow. We shall

see that the full precision form of arithmetic co ding requires at least one multiplication

per event and in some implementations up to twomultiplications and two divisions

per event. In addition, the mo del lo okup and up date op erations are slow b ecause

of the input requirements of the co der. Both Hu man co ding and Ziv-Lemp el [59,

60] co ding are faster b ecause the mo del is represented directly in the data structures

2 2 TUTORIAL ON

used for co ding. This reduces the co ding eciency of those metho ds by narrowing

the range of p ossible mo dels. Much of the current research in arithmetic co ding

concerns nding approximations that increase co ding sp eed without compromising

compression eciency. The most common metho d is to use an approximation to

the multiplication op eration [10,27,29,43]; in this pap er we present an alternative

approach using table lo okups and approximate probability estimation.

Another disadvantage of arithmetic co ding is that it do es not in general pro duce a

pre x co de. This precludes parallel co ding with multiple pro cessors. In addition, the

p otentially unb ounded output delay makes real-time co ding problematical in critical

applications, but in practice the delay seldom exceeds a few symb ols, so this is not a

ma jor problem. A minor disadvantage is the need to indicate the end of the le.

One nal minor problem is that arithmetic co des have p o or error resistance, esp e-

cially when used with adaptive mo dels [5]. A single error in the enco ded le causes

the deco der's internal state to b e in error, making the remainder of the deco ded le

wrong. In fact this is a drawbackofal l adaptive co des, including Ziv-Lemp el co des

and adaptive Hu man co des [12,15,18,26,55,56]. In practice, the p o or error resistance

of adaptive co ding is unimp ortant, since we can simply apply appropriate error cor-

rection co ding to the enco ded le. More complicated solutions app ear in [5,20], in

which errors are made easy to detect, and up on detection of an error, bits are changed

until no errors are detected.

Overview of this pap er. In Section 2 we give a tutorial on arithmetic co ding.

We include an intro duction to mo deling for text compression. We also restate several

imp ortant theorems from [22] relating to the optimality of arithmetic co ding in theory

and in practice.

In Section 3 we present some of our current researchinto practical ways of improv-

ing the sp eed of arithmetic co ding without sacri cing much compression eciency.

The center of this research is a reduced-precision arithmetic co der, supp orted by

ecient data structures for text mo deling.

2 Tutorial on Arithmetic Co ding

In this section we explain how arithmetic co ding works and give implementation

details; our treatment is based on that of Witten, Neal, and Cleary [58]. We p oint out

the usefulness of binary arithmetic co ding that is, co ding with a 2-symb ol alphab et,

and discuss the mo deling issue, particularly high-order Markov mo deling for text

compression. Our fo cus is on enco ding, but the deco ding pro cess is similar.

2.1 Arithmetic co ding and its implementation

Basic algorithm. The algorithm for enco ding a le using arithmetic co ding works

conceptually as follows:

2.1 Arithmetic co ding and its implementation 3

Old interval 0LH1

Decomposition probability of ai 01

New interval

0LH1

Figure 1: Sub division of the currentinterval based on the probability of the input

symbol a that o ccurs next.

i

1. We b egin with a \currentinterval" [L; H  initialized to [0; 1.

2. For each symb ol of the le, we p erform two steps see Figure 1:

a We sub divide the currentinterval into subintervals, one for each p ossible

alphab et symb ol. The size of a symb ol's subinterval is prop ortional to the

estimated probability that the symb ol will b e the next symb ol in the le,

according to the mo del of the input.

b We select the subinterval corresp onding to the symb ol that actually o ccurs

next in the le, and make it the new currentinterval.

3. We output enough bits to distinguish the nal currentinterval from all other

p ossible nal intervals.

The length of the nal subinterval is clearly equal to the pro duct of the probabilities

of the individual symb ols, which is the probability p of the particular sequence of

symb ols in the le. The nal step uses almost exactly lg p bits to distinguish the

le from all other p ossible les. We need some mechanism to indicate the end of the

le, either a sp ecial end-of- le symb ol co ded just once, or some external indication of

the le's length.

In step 2, we need to compute only the subinterval corresp onding to the symbol a

i

P

i1

that actually o ccurs. To do this we need two cumulative probabilities, P = p

C k

k =1

P

i

and P = p . The new subinterval is [L + P H L;L+P HL. The

N k C N

k =1

need to maintain and supply cumulative probabilities requires the mo del to havea

complicated data structure; Mo at [35] investigates this problem, and concludes for

amulti-symb ol alphab et that binary search trees are ab out twice as fast as move-to-

front lists.

Example 1 : We illustrate a non-adaptive co de, enco ding the le containing the

symb ols bbb using arbitrary xed probability estimates p =0:4, p =0:5, and

a b

p =0:1. Enco ding pro ceeds as follows: EOF

4 2 TUTORIAL ON ARITHMETIC CODING

Current Subintervals

Interval Action a b EOF Input

[0:000; 1:000 Sub divide [0:000; 0:400 [0:400; 0:900 [0:900; 1:000 b

[0:400; 0:900 Sub divide [0:400; 0:600 [0:600; 0:850 [0:850; 0:900 b

[0:600; 0:850 Sub divide [0:600; 0:700 [0:700; 0:825 [0:825; 0:850 b

[0:700; 0:825 Sub divide [0:700; 0:750 [0:750; 0:812 [0:812; 0:825 EOF

[0:812; 0:825

The nal interval without rounding is [0:8125; 0:825, which in binary is approx-

imately [0.11010 00000, 0.11010 01100. We can uniquely identify this interval by

outputting 1101000. According to the xed mo del, the probability p of this partic-

3

ular le is 0:5 0:1=0:0125 exactly the size of the nal interval and the co de

length in bits should b e lg p =6:322. In practice wehave to output 7 bits. 2

The idea of arithmetic co ding originated with Shannon in his seminal 1948 pap er

on [54]. It was rediscovered by Elias ab out 15 years later, as

brie y mentioned in [1].

Implementation details. The basic implementation of arithmetic co ding de-

scrib ed ab ove has two ma jor diculties: the shrinking currentinterval requires the

use of high precision arithmetic, and no output is pro duced until the entire le has

b een read. The most straightforward solution to b oth of these problems is to output

each leading bit as so on as it is known, and then to double the length of the cur-

rentinterval so that it re ects only the unknown part of the nal interval. Witten,

Neal, and Cleary [58] add a clever mechanism for preventing the currentinterval from

shrinking to o much when the endp oints are close to 1=2 but straddle 1=2. In that

case we do not yet know the next output bit, but we do know that whatever it is, the

fol lowing bit will have the opp osite value; we merely keep track of that fact, and ex-

pand the currentinterval symmetrically ab out 1=2. This follow-on pro cedure maybe

rep eated anynumb er of times, so the currentinterval size is always longer than 1/4.

Mechanisms for incremental transmission and xed precision arithmetic have b een

develop ed through the years byPasco [40], Rissanen [48], Rubin [52], Rissanen and

Langdon [49], Guazzo [19], and Witten, Neal, and Cleary [58]. The bit-stung idea

of Langdon and others at IBM that limits the propagation of carries in the additions

is roughly equivalent to the follow-on pro cedure describ ed ab ove.

Wenow describ e in detail how the co ding and interval expansion work. This

pro cess takes place immediately after the selection of the subinterval corresp onding

to an input symb ol.

We rep eat the following steps illustrated schematically in Figure 2 as many times

as p ossible:

1 1 3

a. If the new subinterval is not entirely within one of the intervals [0; = , [ = ; = ,

2 4 4

1

or [ = ; 1, we stop iterating and return. 2

2.1 Arithmetic co ding and its implementation 5

(a)

(b)

(c)

(d)

1

0 /2 1

1

Figure 2: Interval expansion pro cess. a No expansion. b Interval in [0; = . c

2

1 1 3

Interval in [ = ; 1. d Interval in [ = ; =  follow-on case.

2 4 4

1

b. If the new subinterval lies entirely within [0; = , we output 0 and any 1s left

2

1

over from previous symb ols; then we double the size of the interval [0; = ,

2

expanding toward the right.

1

c. If the new subinterval lies entirely within [ = ; 1, we output 1 and any 0s left

2

1

over from previous symb ols; then we double the size of the interval [ = ; 1,

2

expanding toward the left.

1 3

d. If the new subinterval lies entirely within [ = ; = , wekeep track of this fact

4 4

1 3

for future output; then we double the size of the interval [ = ; = , expanding in

4 4

b oth directions away from the midp oint.

Example 2 :We show the details of enco ding the same le as in Example 1.

6 2 TUTORIAL ON ARITHMETIC CODING

Current Subintervals

Interval Action a b EOF Input

[0:00; 1:00 Sub divide [0:00; 0:40 [0:40; 0:90 [0:90; 1:00 b

[0:40; 0:90 Sub divide [0:40; 0:60 [0:60; 0:85 [0:85; 0:90 b

[0:60; 0:85 Output 1

1

Expand [ = ; 1

2

[0:20; 0:70 Sub divide [0:20; 0:40 [0:40; 0:65 [0:65; 0:70 b

[0:40; 0:65 fol low

1 3

Expand [ = ; = 

4 4

[0:30; 0:80 Sub divide [0:30; 0:50 [0:50; 0:75 [0:75; 0:80 EOF

[0:75; 0:80 Output 10

1

Expand [ = ; 1

2

[0:50; 0:60 Output 1

1

Expand [ = ; 1

2

[0:00; 0:20 Output 0

1

Expand [0; = 

2

[0:00; 0:40 Output 0

1

Expand [0; = 

2

[0:00; 0:80 Output 0

The \fol low " output in the sixth line indicates the follow-on pro cedure: wekeep

track of our knowledge that the next output bit will b e followed by its opp osite; this

\opp osite" bit is the 0 output in the ninth line. The enco ded le is 1101000,as

b efore. 2

Clearly the currentinterval contains some information ab out the preceding inputs;

this information has not yet b een output, so we can think of it as the co der's state. If

a is the length of the currentinterval, the state holds lg a bits not yet output. In the

basic metho d illustrated by Example 1 the state contains al l the information ab out

the output, since nothing is output until the end. In the implementation illustrated

by Example 2, the state always contains fewer than two bits of output information,

since the length of the currentinterval is always more than 1=4. The nal state in

Example 2 is [0; 0:8, which contains lg 0:8  0:322 bits of information.

Use of integer arithmetic. In practice, the arithmetic can b e done by storing

the currentinterval in suciently long integers rather than in oating p oint or exact

rational numb ers. We can think of Example 2 as using the integer interval [0; 100

by omitting all the decimal p oints. We also use integers for the frequency counts

used to estimate symb ol probabilities. The sub division pro cess involves selecting non-

overlapping intervals of length at least 1 with lengths approximately prop ortional

P

i1

c and to the counts. To enco de symbol a we need two cumulative counts, C =

k i

k =1

P P

i n

N = c , and the sum T of all counts, T = c . Here and elsewhere we

k k

k =1 k =1

C H L NHL

denote the alphab et size by n. The new subinterval is [L + b c;L+b c.

T T

2.1 Arithmetic co ding and its implementation 7

In this discussion we continue to use half-op en intervals as in the real arithmetic case.

In implementations [58] it is more convenient to subtract 1 from the right endp oints

and use closed intervals. Mo at [36] considers the calculation of cumulative frequency

counts for large alphab ets.

Example 3 : Supp ose that at a certain p oint in the enco ding wehave symb ol counts

c =4,c =5,andc = 1 and currentinterval [25; 89 from the full interval [0; 128.

a b EOF

Let the next input symbol be b. The cumulative counts for b are C = 4 and N =9,

48925 98925

and T = 10, so the new interval is [25 + b c; 25 + b c = [50; 82; we then

10 10

increment the follow-on count and expand the interval once ab out the midp oint 64,

giving [36; 100. It is p ossible to maintain higher precision, truncating and adjusting

to avoid overlapping subintervals only when the expansion pro cess is complete; this

makes it p ossible to prove a tight analytical b ound on the lost compression caused by

the use of integer arithmetic, as we do in [22], restated as Theorem 1 b elow. In practice

this re nement makes the co ding more dicult without improving compression. 2

Analysis. In [22] we proveanumb er of theorems ab out the co de lengths of les

co ded with arithmetic co ding. Most of the results involve the use of arithmetic co ding

in conjunction with various mo dels of the input; these will b e discussed in Section 2.3.

Here we note two results that apply to implementations of the arithmetic co der. The

rst shows that using integer arithmetic has negligible e ect on co de length.

Theorem 1 If we use integers from the range [0;N and use the high precision al-

gorithm for scaling up the subrange, the code length is provably boundedby 4=Nln 2

bits per input symbol more than the ideal code length for the le.

4

For a typical value N =65;536, the excess co de length is less than 10 bit p er

input symb ol.

The second result shows that if we indicate end-of- le by enco ding a sp ecial symbol

just once for the entire le, the additional co de length is negligible.

Theorem 2 The use of a special end-of- le symbol when coding a le of length t

using integers from the range [0;N results in additional code length of less than

8t=N ln 2 + lg N +7 bits.

Again the extra co de length is negligible, less than 0.01 bit p er input symb ol for

atypical 100;000 byte le.

Since we seldom know the exact probabilities of the pro cess that generated an

input le, wewould like to knowhow errors in the estimated probabilities a ect the

co de length. We can estimate the extra co de length by a straightforward asymptotic

analysis. The average co de length L for symb ols pro duced by a given mo del in a

given state is given by

n

X

L = p lg q ;

i i

i=1

8 2 TUTORIAL ON ARITHMETIC CODING

where p is the actual probabilityofthe ith alphab et symb ol and q is its estimated

i i

probability. The optimal average co de length for symb ols in the state is the entropy

of the state, given by

n

X

H = p lg p :

i i

i=1

The excess co de length is E = L H ;ifwe let d = q p and expand asymptotically

i i i

in d,we obtain

  !!

n

2 3

X

1 d d

i i

E = + O : 1

2

2ln2 p p

i

i

i=1

This corrects a similar derivation in [5], in which the factor of 1= ln 2 is omitted.

The vanishing of the linear terms means that small errors in the probabilities used

by the co der lead to very small increases in co de length. Because of this prop erty,

any co ding metho d that uses approximately correct probabilities will achieve a co de

length close to the entropy of the underlying source. We use this fact in Section 3.1

to design a class of fast approximate arithmetic co ders with small compression loss.

2.2 Binary arithmetic co ding

The preceding discussion and analysis has fo cused on co ding with a multi-symbol

alphab et, although in principle it applies to a binary alphab et as well. It is useful

to distinguish the two cases since b oth the co der and the interface to the mo del are

simpler for a binary alphab et. The co ding of bilevel images, an imp ortant problem

with a natural two-symb ol alphab et, often pro duces probabilities close to 1, indicating

the use of arithmetic co ding to obtain go o d compression. Historically,much of the

arithmetic co ding researchby Rissanen, Langdon, and others at IBM has fo cused on

bilevel images [29]. The Q-Co der [2,27,33,41,42,43] is a binary arithmetic co der; work

by Rissanen and Mohiuddin [50] and Chevion et al. [10] extends some of the Q-Co der

ideas to multi-symb ol alphab ets.

In most other text and applications, a multi-symb ol alphab et

is more natural, but even then we can map the p ossible symb ols to the leaves of a

binary tree, and enco de an eventby traversing the tree and enco ding a decision at

eachinternal no de. If we do this, the mo del no longer has to maintain and pro duce

cumulative probabilities; a single probability suces to enco de each decision. Cal-

culating the new currentinterval is also simpli ed, since just one endp ointchanges

after each decision. On the other hand, wenow usually have to enco de more than one

event for each input symb ol, and wehave a new data structure problem, maintaining

the co ding trees eciently without using excessive space. The smallest average num-

berofevents co ded p er input symb ol o ccurs when the tree is a Hu man tree, since

such trees have minimum average weighted path length; however, maintaining such

trees dynamically is complicated and slow [12,26,55,56]. In Section 3.3 we presenta

new data structure, the compressedtree, suitable for binary enco ding of multi-symbol alphab ets.

2.3 Mo deling for text compression 9

2.3 Mo deling for text compression

Arithmetic co ding allows us to compress a le as well as p ossible for a given mo del

of the pro cess that generated the le. To obtain maximum compression of a le,

we need b oth a go o d mo del and an ecientway of representing or learning the

mo del. Rissanen calls this principle the minimum description length principle; he

has investigated it thoroughly from a theoretical p oint of view [44,45,46]. If we allow

two passes over the le, we can identify a suitable mo del during the rst pass, enco de

it, and use it for optimal co ding during the second pass. An alternative approachis

to allow the mo del to adapt to the characteristics of the le during a single pass, in

e ect learning the mo del. The adaptive approach has advantages in practice: there

is no co ding delay and no need to enco de the mo del, since the deco der can maintain

the same mo del as the enco der in a synchronized fashion.

In the following theorem from [22] we compare context-free co ding using a two-

pass metho d and a one-pass adaptive metho d. In the two-pass metho d, the exact

symb ol counts are enco ded after the rst pass; during the second pass each symb ol's

count is decremented whenever it o ccurs, so at each p oint the relative counts re ect

the correct symb ol probabilities for the remainder of the le as in [34]. In the one-

pass adaptive metho d, all symb ols are given initial counts of 1; we add 1 to a symb ol's

count whenever it o ccurs.

Theorem 3 For al l input les, the adaptive code with initial 1-weights gives exactly

the same code length as the semi-adaptive decrementing code in which the input model

is encodedbased on the assumption that al l symbol distributions areequal ly likely.

Hence we see that use of an adaptive co de do es not incur any extra overhead, but it

do es not eliminate the cost of describing the mo del.

Adaptive mo dels. The simplest adaptive mo dels do not rely on contexts for con-

ditioning probabilities; a symb ol's probability is just its relative frequency in the part

of the le already co ded. We need a mechanism for enco ding a symb ol for the rst

time, when its frequency is 0; the easiest way [58] is to start all symb ol counts at 1

instead of 0. The average co de length p er input symb ol of a le enco ded using such

a 0-order adaptive mo del is very close to the 0-order entropy of the le. We shall

see that adaptive compression can b e improved by taking advantage of lo calityof

reference and esp ecially by using higher order mo dels.

Scaling. One problem with maintaining symb ol counts is that the counts can b e-

come arbitrarily large, requiring increased precision arithmetic in the co der and more

memory to store the counts themselves. By p erio dically reducing all symb ol's counts

by the same factor, we can keep the relative frequencies approximately the same while

using only a xed amount of storage for each count. This pro cess is called scaling.

It allows us to use lower precision arithmetic, p ossibly hurting compression b ecause

10 2 TUTORIAL ON ARITHMETIC CODING

of the reduced accuracy of the mo del. On the other hand, it intro duces a locality of

reference or recency  e ect, which often improves compression. Wenow discuss and

quantify the lo cality e ect.

In most text les we nd that most of the o ccurrences of at least some words are

clustered in one part of the le. We can take advantage of this lo calityby assigning

more weight to recent o ccurrences of a symb ol in an adaptive mo del. In practice there

are several ways to do this:

 Perio dically restarting the mo del. This often discards to o much information

to b e e ective, although Cormack and Horsp o ol nd that it gives go o d results

when growing large dynamic Markov mo dels [11].

 Using a sliding window on the text [26]. This requires excessive computational

resources.

 Recency rank co ding [7,13,53]. This is simple but corresp onds to a rather coarse

mo del of recency.

 Exp onential aging giving exp onentially increasing weights to successive sym-

b ols [12,38]. This is mo derately dicult to implement b ecause of the changing

weight increments, although our probability estimation metho d in Section 3.4

uses an approximate form of this technique.

 Perio dic scaling [58]. This is simple to implement, fast and e ective in op eration,

and amenable to analysis. It also has the computationally desirable prop erty

of keeping the symbol weights small. In e ect, scaling is a practical version of

exp onential aging.

Analysis of scaling. In [22] we give a precise characterization of the e ect of scaling

on co de length, in terms of an elegant notion weintro duce called weighted entropy.

The weighted entropy of a le at the end of the mth blo ck, denoted by H ,isthe

m

entropy implied by the probability distribution at that time, computed according to

the scaling mo del describ ed ab ove.

We prove the following theorem for a le compressed using arithmetic co ding and

a zero-order adaptive mo del with scaling. All counts are halved and rounded up when

the sum of the counts reaches 2B ; in e ect, we divide the le into b blo cks of length B .

Theorem 4 Let L be the compressed length of a le. Then we have

 ! !

b

X

k

B H + H H t

m b 0

B

m=1

!!  ! !  

b

2

X

k B k

;

m b 0

2

B k B

min

m=1

2.3 Mo deling for text compression 11

Table 1: PPM escap e probabilities p  and symb ol probabilities p . The number

esc i

of symb ols that have o ccurred j times is denoted by n .

j

PPMA PPMB PPMC PPMP PPMX

1 k k n n n

1 2 1

p + :::

esc

2

t+1 t t+k t t t

c c 1 c

i i i

p

i

t+1 t t+k

where H =lgnis the entropy of the initial model, H is the weighted entropy

0 m

implied by the scaling model's probability distribution at the end of block m, k is the

number of di erent alphabet symbols that appear in the le, and k is the smal lest

min

number of di erent symbols that occur in any block.

When scaling is done, wemust ensure that no symb ol's count b ecomes 0; an easy

way to do this is to round fractional counts to the next higher integer. We showin

the following theorem from [22] that this roundup e ect is negligible.

Theorem 5 Rounding counts up to the next higher integer increases the code length

for the le by no more than n=2B bits per input symbol.

When we compare co de lengths with and without scaling, we nd that the di er-

ences are small, b oth theoretically and in practice.

High order mo dels. The only way to obtain substantial improvements in compres-

sion is to use more sophisticated mo dels. For text les, the increased sophistication

invariably takes the form of conditioning the symb ol probabilities on contexts con-

sisting of one or more symb ols of preceding text. Langdon [28] and Bell, Witten,

Cleary, and Mo at [3,4,5] have proven that b oth Ziv-Lemp el co ding and the dynamic

Markov co ding metho d of Cormack and Horsp o ol [11] can b e reduced to nite context

mo dels, despite sup er cial indications to the contrary.

One signi cant diculty with using high-order mo dels is that many contexts do

not o ccur often enough to provide reliable symb ol probability estimates. Cleary and

Witten deal with this problem with a technique called Prediction by Partial Matching

PPM. In the PPM metho ds we maintain mo dels of various context lengths, or orders.

At each p ointwe use the highest order mo del in which the symb ol has o ccurred in the

current context, with a sp ecial escape symb ol indicating the need to drop to a lower

order. Cleary and Witten sp ecify two ad hoc metho ds, called PPMA and PPMB, for

computing the probability of the escap e symb ol. Mo at [37] implements the algorithm

and prop oses a third metho d, PPMC, for computing the escap e probability: he treats

the escap e event as a separate symb ol; when a symb ol o ccurs for the rst time he

adds 1 to b oth the escap e count and the new symb ol's count. In practice, PPMC

compresses b etter than PPMA and PPMB. PPMP and PPMX app ear in [57]; they

are based on the assumption that the app earance of symb ols for the rst time in a

12 3 FAST ARITHMETIC CODING

le is approximately a Poisson pro cess. See Table 1 for formulas for the probabilities

used by the di erent metho ds, and see [5] or [6] for a detailed description of the PPM

metho d. In Section 3.5 we indicate two metho ds that provide improved estimation

of the escap e probability.

2.4 Other applications of arithmetic co ding

Because of its nearly optimal compression p erformance, arithmetic co ding has b een

prop osed as an enhancement to other compression metho ds and activities related to

compression. The output values pro duced by Ziv-Lemp el co ding are not uniformly

distributed, leading several researchers [21,32,51] to suggest using arithmetic co ding

to further compress the output. Compression is indeed improved, but at the cost of

slowing down the algorithm and increasing its complexity.

Lossless image compression is often p erformed using predictive co ding, and it is

often found that the prediction errors follow a Laplace distribution. In [23] we present

metho ds that use tables of the Laplace distribution precomputed for arithmetic co ding

to obtain excellent compression ratios of grayscale images. The distributions are

chosen to guarantee that, for a given variance estimate, the resulting co de length

exceeds the ideal for the estimate by only a small xed amount.

Esp ecially when enco ding mo del parameters, it is often necessary to enco de arbi-

trarily large non-negativeintegers. Witten et al. [58] note that arithmetic co ding can

enco de integers according to any given distribution. In the examples in Section 3.1

we showhow some enco dings of integers found in the literature can b e derived as

low-precision arithmetic co des.

We p oint out here that arithmetic co ding can also b e used to generate random

variables from any desired distribution, as well as to pro duce nearly random bits

from the output of any random pro cess. In particular, it is easy to convert random

numb ers from one base to another, and to convert random bits with an unknown but

xed probability to bits with a probabilityof1=2.

3 Fast Arithmetic Co ding

In this section we present some of our current researchinto several asp ects of arith-

metic co ding. We show the construction of a fast, reduced-precision binary arithmetic

co der, and indicate a theoretical construct, called the -partition , that can assist in

cho osing a representative set of probabilities to b e used by the co der. Weintro duce

a data structure that we call the compressedtree for eciently representing a multi-

symb ol alphab et as a binary tree. We give a deterministic algorithm for estimating

probabilities of binary events and storing them in 8-bit lo cations. We givetwo im-

proved ways of handling the zero-frequency problem symb ols o ccurring in context

for the rst time. Finally we show that we can use hashing to obtain fast access of

3.1 Reduced-precision arithmetic co ding 13

contexts with only a small loss of compression eciency. All these comp onents can

b e combined into a fast, space-ecient text co der.

3.1 Reduced-precision arithmetic co ding

Wehave noted earlier that the primary disadvantage of arithmetic co ding is its slow-

ness. Wehave also seen that small errors in probability estimates cause very small

increases in co de length, so we can exp ect that byintro ducing approximations into the

arithmetic co ding pro cess in a controlled waywe can improve co ding sp eed without

signi cantly degrading compression p erformance. In the Q-Co der work at IBM, the

time-consuming multiplications are replaced by additions and shifts, and low-order

bits are ignored.

In this section, we take a di erent approach to approximate arithmetic co ding:

recalling that the fractional bits characteristic of arithmetic co ding are stored as

state information in the co der, we reduce the numb er of p ossible states, and replace

arithmetic op erations by table lo okups. Here we present a fast, reduced-precision

binary arithmetic co der whichwe refer to as quasi-arithmetic coding in a companion

pap er [24] and develop it through a series of examples. It should b e noted that the

compression is still completely reversible; using reduced precision merely a ects the

average co de length.

The numb er of p ossible states after applying the interval expansion pro cedure of

2

an arithmetic co der using the integer interval [0;Nis3N =16. If we can reduce the

numb er of states to a more manageable level, we can precompute all state transitions

and outputs and substitute table lo okups for arithmetic in the co der. The obvious

way to reduce the numb er of states is to reduce N . The value of N must b e even; for

computational convenience we prefer that it b e a multiple of 4.

Example 4 : The simplest non-trivial co ders have N = 4, and have only three states.

By applying the arithmetic co ding algorithm in a straightforward way,we obtain

the following co ding table. A \fol low " output indicates application of the follow-on

pro cedure describ ed in Section 2.1.

0 input 1 input

State Prob f0g Output Next state Output Next state

[0; 4 0

1  p  0 [0; 4 1 [0; 4

1

[0; 3 0

2

1

=  p<1 0 [0; 4 10 [0; 4

2

1

[1; 4 0

2

1

=  p<1 fol low [0; 4 11 [0; 4 2

14 3 FAST ARITHMETIC CODING

The value of the cuto probability in state [0; 4 is clearly b etween 1=2 and 3=4. If

this were an exact co der, the subintervals of length 3 would corresp ond to lg 3=4 

0:415 bits of output information stored in the state, and wewould cho ose =1=lg 3 

0:631 to minimize the extra co de length. But b ecause of the approximate arithmetic,

the optimal value of dep ends on the distribution of Prob f0g; if Prob f0g is uniformly

distributed on 0; 1, we nd analytically that the excess co de length is minimized

p

when = 15 97 =8  0:644. Fortunately, the amount of excess co de length is

not very sensitive to the value of ; in the uniform distribution case anyvalue from

ab out 0.55 to 0.73 gives less than one p ercent extra co de length. 2

Arithmetic co ding do es not mandate any particular assignment of subintervals to

input symb ols; all that is required is that subinterval lengths b e prop ortional to sym-

b ol probabilities and that the deco der make the same assignment as the enco der. In

Example 4 we uniformly assigned the left subinterval to symbol 0. By preventing the

longer subinterval from straddling the midp oint whenever p ossible, we can sometimes

obtain a simpler co der that never requires the follow-on pro cedure; it may also use

fewer states.

Example 5 : This co der assigns the right subinterval to 0 in lines 4 and 7 of Example 4,

eliminating the need for using the follow-on pro cedure; otherwise it is the same as

Example 4.

0 input 1 input

State Prob f0g Output Next state Output Next state

[0; 4 0

1  p  0 [0; 4 1 [0; 4

1

[0; 3 0

2

1

=  p<1 0 [0; 4 10 [0; 4

2

1

[1; 4 0

2

1

=  p<1 1 [0; 4 01 [0; 4

2

2

Langdon and Rissanen [29] suggest identifying the symb ols as the more probable

symbol MPS and less probable symbol LPS rather than as 1 and 0. By doing this

we can often combine transitions and eliminate states.

Example 6 :We mo dify Example 5 to use the MPS/LPS idea. We are able to reduce

the co der to just two states.

3.1 Reduced-precision arithmetic co ding 15

LPS input MPS input

State Prob fMPSg Output Next state Output Next state

1

[0; 4 =  p  0 [0; 4 1 [0; 4

2

1

[1; 4 =  p<1 01 [0; 4 1 [0; 4

2

2

Another way of simplifying an arithmetic co der is to allow only a subset of the

p ossible interval sub divisions. Using integer arithmetic has the e ect of making the

symb ol probabilities approximate, esp ecially as the integer range is made smaller;

limiting the numb er of sub divisions simply makes them even less precise. Since the

main b ene t of arithmetic co ding is its ability to co de eciently when probabilities

are close to 1, we usually want to allow at least some pairs of unequal probabilities.

Example 7 :Ifwe know that one symb ol o ccurs considerably more often than the

other, we can eliminate the transitions in Example 6 for approximately equal prob-

abilities. This makes it unnecessary for the co der to decide which transition pair to

use in the [0; 4 state, and gives a very simple reduced-precision arithmetic co der.

LPS input MPS input

State Output Next state Output Next state

[0; 4 00 [0; 4 - [1; 4

[1; 4 01 [0; 4 1 [0; 4

This simple co de is quite useful, providing almost a 50 p ercent improvement on the

unary co de for representing non-negativeintegers. To enco de n in unary,we output

n 1s and a 0. Using the co de just derived, we re-enco de the unary co ding, treating 1

as the MPS. The resulting co de consists of bn=2c 1s, followed by 00 if n is even and

01 if n is o dd. We can do even b etter with slightly more complex co des, as we shall

see in examples that follow. 2

Wenowintro duce the maximal ly unbalanced subdivision and showhow it can b e

used to obtain excellent compression when ProbfMPS g 1. Supp ose the current

interval is [L; H . If ProbfMPS g is very high we can sub divide the interval at L +1

or H 1, indicating ProbfLPS g =1=HL and ProbfMPS g =11=HL. Since

the length of the currentinterval H L is always more than N=4, such a sub division

always indicates a ProbfMPS g of more than 1 4=N .Bycho osing a large value of N

and always including the maximally unbalanced sub division in our co der, we ensure

that very likely symb ols can always b e given an appropriately high probability.

Example 8 : Let N = 8 and let the MPS always b e 1. We obtain the following

four-state co de if we allow only the maximallyunbalanced sub division in each state.

16 3 FAST ARITHMETIC CODING

0 LPS input 1 MPS input

State Output Next state Output Next state

[0; 8 000 [0; 8 - [1; 8

[1; 8 001 [0; 8 - [2; 8

[2; 8 010 [0; 8 - [3; 8

[3; 8 011 [0; 8 1 [0; 8

We can use this co de to re-enco de unary-co ded non-negativeintegers with bn=4c +3

bits. In e ect, we represent n in the form 4a + b;we enco de a in unary, then use two

bits to enco de b in binary. 2

Whenever the currentinterval coincides with the full interval, we can switchtoa

di erent co de.

Example 9 :We can derive the Elias co de for the p ositiveintegers [14] by using the

maximallyunbalanced sub division technique of Example 8 and by doubling the full

integer range whenever we see enough 1s to output a bit and expand the current

interval so that it coincides with the full range. This co der has an in nite number of

states; no state is visited more than once. We use the notation [L; H =M to indicate

the subinterval [L; H  selected from the range [0;M.

0 LPS input 1 MPS input

State Output Next state Output Next state

[0; 2=2 0 STOP 1 [0; 4=4

[0; 4=4 00 STOP - [1; 4=4

[1; 4=4 01 STOP 1 [0; 8=8

[0; 8=8 000 STOP - [1; 8=8

[1; 8=8 001 STOP - [2; 8=8

. . . . .

. . . . .

. . . . .

This co de corresp onds to enco ding p ositiveintegers as follows:

n Co de

1 0

100 2

3 101

11000 4

5 11001

. .

. .

. .

3.1 Reduced-precision arithmetic co ding 17

a

In e ect we represent n in the form 2 + b;we enco de a in unary, then use a bits to

enco de b in binary. This is essentially the Elias co de; it requires b2lg nc + 1 bits to

enco de n. 2

If we design a co der with more states, we obtain a more ne-grained set of prob-

abilities.

Example 10 :We show a six-state co der, obtained by letting N = 8 and allowing

all p ossible sub divisions. We indicate only the center probability for each range; in

practice any reasonable division will give go o d results. Output symbol f indicates

application of the follow-on pro cedure.

Approximate LPS input MPS input

State Prob fMPSg Output Next state Output Next state

1

[0; 8 / 1 [0; 8 0 [0; 8

2

5

/ 1 [2; 8 - [0; 5

8

3

/ 11 [0; 8 - [0; 6

4

7

/ 111 [0; 8 - [0; 7

8

4

[0; 7 / 1 [0; 6 0 [0; 8

7

5

/ 1f [0; 8 - [0; 5

7

6

/ 110 [0; 8 - [0; 6

7

1

[0; 6 / f [2; 8 0 [0; 6

2

2

/ 10 [0; 8 0 [0; 8

3

5

/ 101 [0; 8 - [0; 5

6

1

[2; 8 / f [0; 6 1 [2; 8

2

2

/ 01 [0; 8 1 [0; 8

3

5

/ 010 [0; 8 - [3; 8

6

3

[0; 5 / ff [0; 8 0 [0; 6

5

4

/ 100 [0; 8 0 [0; 8

5

3

[3; 8 / ff [0; 8 1 [2; 8

5

4

/ 011 [0; 8 1 [0; 8

5

This co der is easily programmed and extremely fast. Its only shortcoming is that

on average high-probability symb ols require 1=4 bit corresp onding to Prob fMPSg =

1=4

2  0:841 no matter how high the actual probability is. 2

Design of a class of reduced-precision co ders. Wenow presenta very ex-

ible yet simple co der design incorp orating most of the features just discussed. We

cho ose N to b e anypower of 2. All states in the co der are of the form [k; N , so the

18 3 FAST ARITHMETIC CODING

numb er of states is only N=2. Intervals with k  N=2 will pro duce output, and the

interval will b e expanded. In every state [k; N we include the maximallyunbalanced

sub division at k + 1, which corresp onds to values of Prob fMPSg between N 2=N

and N 1=N .We include a nearly balanced sub division so that we will not lose

eciency when ProbfMPS g1=2. In addition, we lo cate other sub division p oints

such that the subinterval expansion that follows each input symb ol leaves the co der

in a state of the form [k; N , and wecho ose one or more of them to corresp ond to

intermediate values of Prob fMPSg.For simplicitywe denote state [k; N byk.

We always allow the interval [k; N  to b e divided at k + 1; if the LPS o ccurs we

output the lg N bits of k and move to state 0, while if the MPS o ccurs we simply

move to state k + 1, then if the new state is N=2we output a 1 and move to state

0. The other p ermitted sub divisions are given in the following table. In some cases

additional output and expansion may b e p ossible. It may not b e necessary to include

all sub divisions in the co der.

Range of Sub division LPS input MPS input

states k LPS MPS Output Next State Output Next State

N N N

[0;  [k;  [ ;N 0 2k 1 0

2 2 2

N N N N

[0;  [k;  [ ;N 00 4k -

4 4 4 4

N N 3N 3N N 3N

[ ;  [k;  [ ;N 0f 4k -

8 4 8 8 2 8

N 3N 3N 3N 3N

[ ;  [k;  [ ;N 010 8k 2N -

4 8 8 8 8

3N N 5N 5N 3N N

[ ;  [k;  [ ;N ff 4k 1

8 2 8 8 2 4

7N N 9N 9N 7N N

[ ;  [k;  [ ;N fff 8k 1

16 2 16 16 2 8

N N 3N 3N N

[ ;  [ ;N [k;  11 0 f 2k

4 2 4 4 2

For example, the fth line indicates that for all states k for which3N=8  k

wemay sub divide the interval at 5N=8. If the LPS o ccurs, we p erform the follow-

on pro cedure twice, which leaves us with the interval [4k 3N=2;N; otherwise we

output a 1 and expand the interval to [N=4;N.

A co der constructed using this pro cedure will have a small numb er of states, but

in every state it will allow us to use estimates of ProbfMPS g near 1, near 1=2, and

in b etween. Thus we can cho ose a large N so that highly probable events require

negligible co de length, while keeping the numb er of states small enough to allow

table lo okups rather than arithmetic.

3.2  -partitions and  -partitions

In Section 3.1 wehave shown that is is p ossible to design a binary arithmetic co der

that admits only a small numb er of p ossible probabilities. In this section we give

3.2  -partitions and  -partitions 19

a theoretical basis for selecting the probabilities. Often there are practical consid-

erations limiting our choices, but we can show that it is reasonable to exp ect that

cho osing only a few probabilities will give close to optimal compression.

For a binary alphab et, we can use Equation 1 to compute E p; q , the extra co de

length resulting from using estimates q and 1 q for actual probabilities p and 1 p,

resp ectively.For any desired maximum excess co de length ,we can partition the

space of p ossible probabilities to guarantee that the use of approximate probabilities

will never add more than  to the co de length of anyevent. We select partitioning

probabilities P ;P ;::: and estimated probabilities Q ;Q ;:::. Each probability Q

0 1 0 1 i

is used to enco de all events whose probability p is in the range P

i i+1

compute the partition, whichwe call an -partition, as follows:

1. Set i := 0 and Q := 1=2.

0

2. Find the value of P greater than Q  such that E P ;Q =.We will

i+1 i i+1 i

use Q as the estimated probability for all probabilities p such that Q

i i

P .

i+1

3. Find the value of Q greater than P  such that E P ;Q = . After

i+1 i+1 i+1 i+1

we compute P in step 2 of the next iteration, we will use Q as the estimate

i+2 i+1

for all probabilities p such that P

i+1 i+2

We increment i and rep eat steps 2 and 3 until P or Q reaches 1. The values for

i+1 i+1

p<1=2 are symmetrical with those for p>1=2.

Example 11 :We show the  -partition for  =0:05 bit p er binary input symb ol.

Range of actual probabilities Probability to use

[0:0000; 0:0130 0.0003

[0:0130; 0:1427 0.0676

[0:1427; 0:3691 0.2501

[0:3691; 0:6309 0.5000

[0:6309; 0:8579 0.7499

[0:8579; 0:9870 0.9324

0.9997 [0:9870; 1:0000

Thus by using only 7 probabilities we can guarantee that the excess co de length do es

not exceed 0.05 bit for each binary decision co ded. 2

We might wish to limit the relative error so that the co de length can never exceed

the optimal by more than a factor of 1 + . We can b egin to compute these -

partitions using a pro cedure similar to that for  -partitions, but unfortunately the

pro cess do es not terminate, since  -partitions are not nite. As P approaches 1, the

optimal average co de length grows very small, so to obtain a small relative loss Q

must b e very close to P . Nevertheless, we can obtain a partial  -partition.

20 3 FAST ARITHMETIC CODING

Example 12 :We show part of the  -partition for  =0:05; the maximum relative

error is 5 p ercent.

Range of actual probabilities Probability to use

. .

. .

. .

[0:0033; 0:0154 0.0069

[0:0154; 0:0573 0.0291

[0:0573; 0:1670 0.0982

[0:1670; 0:3722 0.2555

[0:3722; 0:6278 0.5000

[0:6278; 0:8330 0.7445

[0:8330; 0:9427 0.9018

[0:9427; 0:9846 0.9709

[0:9846; 0:9967 0.9931

. .

. .

. .

2

In practice we will use an approximation to an  -partition or a  -partition for

values of ProbfMPS g up to the maximum probability representable by our co der.

3.3 Compressed trees

To use the reduced-precision arithmetic co der describ ed in Section 3.1 for an n-symbol

alphab et, we need an ecient data structure to map eachofnsymb ols to a sequence

of binary choices. We might consider Hu man trees, since they minimize the average

numb er of binary events enco ded p er input symb ol; however, a great deal of e ort

is required to keep the probabilities on all branches near 1=2. For arithmetic co ding

maintaining this balance condition is unnecessary and wastes time.

In this section we present the compressedtree, a space-ecient data structure

based on the complete binary tree. Because arithmetic co ding allows us to obtain

nearly optimal compression of binary events even when the two probabilities are

unequal, we are free to represent the probability distribution of an n-symb ol alphab et

by a complete binary tree with a probability at eachinternal no de. The tree can b e

attened linearized by breadth- rst traversal, and we can save space by storing only

one probability at eachinternal no de, say, the probability of taking the left branch.

This probability can b e stored to sucient precision in just one byte, as we shall see

in Section 3.4.

In high-order text mo dels, many longer contexts o ccur only a few times, and only

a few di erent alphab et symb ols o ccur in each context. In such cases even the linear

representation is wasteful of space, requiring n 1 no des regardless of the number

of alphab et symb ols that actually o ccur. Including p ointers in the no des would at

3.3 Compressed trees 21

38 62

0 100 20 80

- - 33 67 100 0 25 75

ab cd ef gh

(a)

38 0 20 - 33 100 25

(b)

38 0 20 33 100 25

(c)

Figure 3: Steps in the development of a compressed tree. a Complete binary tree.

b Linear representation. c Compressed tree.

least double their size. In the compressed tree we collapse the breath- rst linear

representation of the complete binary tree by omitting no des with zero probability.

If k di erent symb ols have non-zero probability, the compressed tree representation

requires at most k lg2n=k  1 no des.

Example 13 : Supp ose wehave the following probability distribution for an 8-symbol

alphab et:

Symbol a b c d e f g h

1 1 1 1 3

Probability 0 0 = = = 0 = =

8 4 8 8 8

We can represent this distribution by the tree in Figure 3a, rounding probabili-

ties and expressing them as multiples of 0.01. We show the linear representation in

Figure 3b and the compressed tree representation in Figure 3c. 2

Traversing the compressed tree is mainly a matter of keeping track of omitted

no des. We do not have to pro cess each no de of the tree: for the rst lg n 2 levels

wehave to pro cess each no de; but when we reach the desired no de in the next-to-

lowest level wehave enough information to directly index the desired no de of the

lowest level. The op erations are very simple, involving only one test and one or

22 3 FAST ARITHMETIC CODING

two increment op erations at each no de, plus a few extra op erations at each level.

Including the capability of adding new symb ols to the tree makes the algorithm only

slightly more complicated.

3.4 Representing and estimating probabilities

In our binary co ded representation of each context we wish to use only one byte for

each probability, and we need the probability to limited precision. Therefore, we will

represent the probability at a no de as a state in a nite state automaton with ab out

256 states. Each state indicates a probability, and some of the states also indicate

the size of the sample used to estimate the probability.

We need a metho d for estimating the probability at each no de of the binary

tree. Leighton and Rivest [30] and Pennebaker and Mitchell [41] describ e proba-

bilistic metho ds. Their estimators are also nite state automata, with each state

corresp onding to a probability. When a new symb ol o ccurs, a transition to another

state may o ccur, the probability of the transition dep ending on the current state and

the new symb ol. Generally, the transition probability is higher when the LPS o ccurs.

In [30] transitions o ccur only b etween adjacent states. In [41] the LPS always causes

a transition, p ossibly to a non-adjacent state; a transition after the MPS, when one

o ccurs, is always to an adjacent state.

We give a deterministic estimator based on the same idea. In our estimator

each input symb ol causes a transition unless the MPS o ccurs when the estimated

probability is already at its maximum value. The probabilities represented by the

states are so close together that transitions often o ccur b etween non-adjacent states.

The transitions are selected so that we compute the new probability p of the left

new

branchby



fp +1f if the left branchwas taken

old

p 

new

fp if the right branchwas taken,

old

where f is a smo othing factor. This corresp onds to exp onential aging; hence the

probability estimate can trackchanging probabilities and b ene t from lo calityof

reference, as discussed in Section 2.3.

In designing a probability estimator of this typ e wemust cho ose b oth the scaling

factor f and the set of probabilities represented by the states. We should b e guided

by the requirements of the co der and by our lackof a priori knowledge of the pro cess

generating the sequence of branches.

First we note that when the numb er of o ccurrences is small, our estimates cannot

be very accurate. Laplace's law of succession, which gives the estimate

c +1

p = 2

t+2

after c successes in t trials, o ers a go o d balance b etween using all available infor-

mation and allowing for random variation in the data; in e ect, it gives the Bayesian

3.5 Improved mo deling for text compression 23

estimate assuming a uniform a priori distribution for the true underlying probabil-

ity P .

We recall that for values of P near 1=2we do not require a very accurate estimate,

since anyvalue will give ab out the same co de length; hence we do not need many

states in this probability region. When P is closer to 1, wewould like our estimate

to b e more accurate, to allow the arithmetic co der to give near-optimal compression,

so we assign states more densely for larger P . Unfortunately, in this case estimation

byany means is dicult, b ecause o ccurrences of the LPS are so infrequent. We also

note that the underlying probabilityofany branch in the co ding tree maychange at

any time, and wewould like our estimate to adapt accordingly.

To handle the small-sample cases, we reservea numb er of states simply to count

o ccurrences when t is small, using Equation 2 to estimate the probabilities. Wedo

the same for larger values of t when c is 0, 1, t 1, or t, to provide fast convergence

to extreme values of P .

We can show that if the underlying probability P do es not change, the exp ected

value of the estimate p after k events is given by

k

k

E p =P +p Pf ;

k 0

which converges to P for all f ,0f<1. The rapid convergence of E p  when

k

f = 0 is misleading, since in that case the estimate is always 0 or 1, dep ending only

on the preceding event. The exp ected value is clearly P , but the estimator is useless.

Avalue of f near 1 provides resistance to random uctuations in the input, but

the estimate converges slowly, b oth initially and when the underlying P changes. A

careful choice of f would dep end on a detailed analysis like that p erformed by Fla jolet

for the related problem of approximate counting [16,17]. We make a more pragmatic

decision. We know that p erio dic scaling is an approximation to exp onential aging

and we can show that a scaling factor of f corresp onds to a scaling blo ck size B of

approximately f ln 2=1 f . Since B =16works well for scaling [58], wecho ose

f =0:96.

3.5 Improved mo deling for text compression

To obtain go o d, fast text compression, we wish to use the multi-symb ol extension

of the reduced-precision arithmetic co der in conjunction with a go o d mo del. The

PPM idea describ ed in Section 2.3 has proven e ective, but the ad hoc nature of the

escap e probability calculation is somewhat annoying. In this section we presentyet

another ad hoc metho d, whichwe call PPMD, and also a more complicated but more

principled approach to the problem.

PPMD. Mo at's PPMC metho d [37] is widely considered to b e the b est metho d of

estimating escap e probabilities. In PPMC, each symb ol's weight in a context is taken

to b e numb er of times it has o ccurred so far in the context. The escap e \event,"

24 3 FAST ARITHMETIC CODING

Table 2: Comparsion of PPMC and PPMD. Compression gures are in bits p er input

symb ol.

Improvement

using

File Text? PPMC PPMD PPMD

bib Yes 2.11 2.09 0.02

book1 Yes 2.65 2.63 0.02

book2 Yes 2.37 2.35 0.02

news Yes 2.91 2.90 0.01

paper1 Yes 2.48 2.46 0.02

paper2 Yes 2.45 2.42 0.03

paper3 Yes 2.70 2.68 0.02

paper4 Yes 2.93 2.91 0.02

paper5 Yes 3.01 3.00 0.01

paper6 Yes 2.52 2.50 0.02

progc Yes 2.48 2.47 0.01

progl Yes 1.87 1.85 0.02

progp Yes 1.82 1.80 0.02

geo No 5.11 5.10 0.01

obj1 No 3.68 3.70 -0.02

obj2 No 2.61 2.61 0.00

pic No 0.95 0.94 0.01

trans No 1.74 1.72 0.02

3.6 Hashed high-order Markov mo dels. 25

that is, the o ccurrence of a symb ol for the rst time in the context, is also treated as

a \symb ol," with its own count. When a letter o ccurs for the rst time, its weight

b ecomes 1; the escap e count is incremented by 1, so the total weight increases by2.

At all other times the total weight increases by1.

Wehave develop ed a new metho d, whichwe call PPMD, which is similar to PPMC

except that it makes the treatment of new symb ols more consistentby adding 1=2

instead of 1 to b oth the escap e count and the new symb ol's count when a new symbol

o ccurs; hence the total weight always increases by1.Wehave compared PPMC and

PPMD on the Bell-Cleary-Witten corpus [5] including the four pap ers not describ ed

in the b o ok. Table 2 shows that for text les PPMD compresses consistently ab out

0.02 bit p er character b etter than PPMC. The compression results for PPMC di er

from those rep orted in [5] b ecause of implementation di erences; we used versions of

PPMC and PPMD that were identical except for the escap e probability calculations.

PPMD has the added advantage of making analysis more tractable by making the

co de length indep endent of the app earance order of symb ols in the context.

Indirect probability estimation. Often we are faced with a situation where we

have no theoretical basis for estimating the probabilityofanevent, but where we

know the factors that a ect the probability. In such cases a logical and e ective

approach is to create conditioning classes based on the values of the factors, and to

estimate the probability adaptively for each class. In the PPM metho d, we know that

the numb er of o ccurrences of a state t and the numb er of di erent alphab et symb ols

that have o ccurred k  are the factors a ecting p .Wehave done exp eriments, using

esc

all combinations of t and k as the conditioning classes except that we group together

all values of t greater than 48 and all values of k greater than 18. In our exp eriments

we use a third-order mo del; when a symb ol has not o ccurred previously in its context

of length 3, we simply use 8 bits to indicate the ASCI I value of the symb ol. The

idea of skipping some shorter contexts for sp eed, space, and simplicity app ears also

in [31]. Even with this simplistic way of dropping to shorter contexts, the improved

estimation of p gives slightly b etter overall compression than PPMC for book1,

esc

the longest le in the Bell-Cleary-Witten corpus. We exp ect that using indirect

probability estimation in conjunction with the full multi-order PPM mechanism will

yield substantially improved compression.

3.6 Hashed high-order Markov mo dels.

For nding contexts in the PPM metho d, Mo at [37] and Bell et al. [5] give com-

plicated data structures called backwardtrees and vine pointers.For fast access and

minimal memory usage we prop ose single hashing without collision resolution. One

might exp ect that using the same bucket for accumulating statistics from unrelated

contexts would signi cantly degrade compression p erformance, but we can show that often this is not the case.

26 4 CONCLUSION

Even in the worst case, when the symb ols from the k colliding contexts in bucket

b are mutually disjoint, the additional co de length is only H = H p ;p ;p ;:::;p ,

b 1 2 3 k

the entropy of the ensemble of probabilities of o ccurrence of the contexts. We show

this by conceptually dividing the bucket into disjoint subtrees corresp onding to the

various contexts, and noting that the cost of identifying an individual symb ol is just

L = lg p , the cost of identifying the context that o ccurred, plus L , the cost of

C i S

identifying the symb ol in its own context. Hence the extra cost is just L , and the

C

P

k

average extra cost is p lg p = H . The maximum value of H is lg k ,soin

i i b b

i=1

buckets that contain data from only two contexts, the extra co de length is at most 1

bit p er input symb ol.

In fact, when the numb er of colliding contexts in a bucket is large enough that

H is signi cant, the symb ols in the bucket, representing a combination of a number

b

of contexts, will b e a micro cosm of the entire le; the bucket's average co de length

will approximately equal the 0-order entropy of the le. Lelewer and Hirschb erg [31]

apply hashing with collision resolution in a similar high-order scheme.

4 Conclusion

Wehave shown the details of an implementation of arithmetic co ding and have p ointed

out its advantages  exibility and near-optimality and its main disadvantage slow-

ness. Wehave develop ed a fast co der, based on reduced-precision arithmetic co ding,

which gives only minimal loss of compression eciency; we can use the concept of

 -partitions to nd the probabilities to include in the co der to keep the compression

loss small. In a companion pap er [24], in whichwe refer to this fast co ding metho d

as quasi-arithmetic coding,we give implementation details and p erformance analysis

for b oth binary and multi-symb ol alphab ets. We prove analytically that the loss in

compression eciency compared with exact arithmetic co ding is negligible.

Weintro duce the compressed tree, a new data structure for eciently representing

amulti-symb ol alphab et by a series of binary choices. Our new deterministic proba-

bility estimation scheme allows fast up dating of the mo del stored in the compressed

tree using only one byte for each no de; the mo del can provide the reduced-precision

co der with the probabilities it needs. Cho osing one of our two new metho ds for com-

puting the escap e probability enables us to use the highly e ective PPM algorithm,

and use of a hashed Markov mo del keeps space and time requirements manageable

even for a high-order mo del.

References

[1] N. Abramson, Information Theory and Co ding , McGraw-Hill, New York, NY, 1963.

27

[2] R. B. Arps, T. K. Truong, D. J. Lu, R. C. Pasco & T. D. Friedman, \A Multi-

Purp ose VLSI Chip for Adaptive Data Compression of Bilevel Images," IBM J.

Res. Develop. 32 Nov. 1988, 775{795.

[3] T. Bell, \A Unifying Theory and Improvements for Existing Approaches to Text

Compression," Univ. of Canterbury, Ph.D. Thesis, 1986.

[4] T. Bell & A. M. Mo at, \A Note on the DMC Data Compression Scheme," Com-

puter Journal 32 1989, 16{20.

[5] T. C. Bell, J. G. Cleary & I. H. Witten, Text Compression, Prentice-Hall, Engle-

wo o d Cli s, NJ, 1990.

[6] T. C. Bell, I. H. Witten & J. G. Cleary, \Mo deling for Text Compression," Comput.

Surveys 21 Dec. 1989, 557{591.

[7] J. L. Bentley, D. D. Sleator, R. E. Tarjan & V. K. Wei, \A Lo cally Adaptive Data

Compression Scheme," Comm. ACM 29 Apr. 1986, 320{330.

[8] A. C. Blumer & R. J. McEliece, \The R enyi Redundancy of Generalized Hu man

Co des," IEEE Trans. Inform. Theory IT{34 Sept. 1988, 1242{124 9.

[9] R. M. Cap o celli, R. Giancarlo & I. J. Taneja, \Bounds on the Redundancy of

Hu man Co des," IEEE Trans. Inform. Theory IT{32 Nov. 1986, 854{857.

[10] D. Chevion, E. D. Karnin & E. Walach, \High Eciency, Multiplication Free

Approximation of Arithmetic Co ding," in Pro c. Data Compression Conference ,J.

A. Storer & J. H. Reif, eds., Snowbird, Utah, Apr. 8{11, 1991, 43{52.

[11] G. V. Cormack & R. N. Horsp o ol, \Data Compression Using Dynamic Markov

Mo delling," Computer Journal 30 Dec. 1987, 541{550.

[12] G. V. Cormack & R. N. Horsp o ol, \Algorithms for Adaptive Hu man Co des,"

Inform. Pro cess. Lett. 18 Mar. 1984, 159{165.

[13] P. Elias, \Interval and Recency Rank Source Co ding: Two On-line AdaptiveVari-

able Length Schemes," IEEE Trans. Inform. Theory IT{33 Jan. 1987, 3{10.

[14] P. Elias, \Universal Co deword Sets and Representations of Integers," IEEE Trans.

Inform. Theory IT{21 Mar. 1975, 194{203.

[15] N. Faller, \An Adaptive System for Data Compression," Record of the 7th Asilo-

mar Conference on Circuits, Systems, and Computers, 1973.

[16] Ph. Fla jolet, \Approximate Counting: a Detailed Analysis," BIT 25 1985, 113.

[17] Ph. Fla jolet & G. N. N. Martin, \Probabilistic Counting Algorithms for Data Base

Applications," INRIA, Rapp ort de Recherche No. 313, June 1984.

[18] R. G. Gallager, \Variations on a Theme by Hu man," IEEE Trans. Inform. Theory

IT{24 Nov. 1978, 668{674.

[19] M. Guazzo, \A General Minimum-Redundancy Source-Co ding Algorithm," IEEE

Trans. Inform. Theory IT{26 Jan. 1980, 15{25.

28 4 CONCLUSION

[20] M. E. Hellman, \Joint Source and Channel Enco ding," Pro c. Seventh Hawaii In-

ternational Conf. System Sci., 1974.

[21] R. N. Horsp o ol, \Improving LZW," in Pro c. Data Compression Conference ,J.A.

Storer & J. H. Reif, eds., Snowbird, Utah, Apr. 8{11, 1991, 332{341.

[22] P.G.Howard & J. S. Vitter, \Analysis of Arithmetic Co ding for Data Compres-

sion," Information Pro cessing and Management 28 1992, 749{763.

[23] P.G.Howard & J. S. Vitter, \New Metho ds for Lossless Image Compression Using

Arithmetic Co ding," Information Pro cessing and Management 28 1992, 765{779.

[24] P.G.Howard & J. S. Vitter, \Design and Analysis of Fast Text Compression

Based on Quasi-Arithmetic Co ding," in Pro c. Data Compression Conference ,J.

A. Storer & M. Cohn, eds., Snowbird, Utah, Mar. 30-Apr. 1, 1993, 98{107.

[25] D. A. Hu man, \A Metho d for the Construction of Minimum Redundancy Co des,"

Pro ceedings of the Institute of Radio Engineers 40 1952, 1098{1101 .

[26] D. E. Knuth, \Dynamic Hu man Co ding," J. Algorithms 6 June 1985, 163{180.

[27] G. G. Langdon, \Probabilistic and Q-Co der Algorithms for Binary Source Adap-

tation," in Pro c. Data Compression Conference , J. A. Storer & J. H. Reif, eds.,

Snowbird, Utah, Apr. 8{11, 1991, 13{22.

[28] G. G. Langdon, \A Note on the Ziv-Lemp el Mo del for Compressing Individual

Sequences," IEEE Trans. Inform. Theory IT{29 Mar. 1983, 284{287.

[29] G. G. Langdon & J. Rissanen, \Compression of Black-White Images with Arith-

metic Co ding," IEEE Trans. Comm. COM{29 1981, 858{867.

[30] F. T. Leighton & R. L. Rivest, \Estimating a Probability Using Finite Memory,"

IEEE Trans. Inform. Theory IT{32 Nov. 1986, 733{742.

[31] D. A. Lelewer & D. S. Hirschb erg, \Streamlining Context Mo dels for Data Com-

pression," in Pro c. Data Compression Conference , J. A. Storer & J. H. Reif, eds.,

Snowbird, Utah, Apr. 8{11, 1991, 313{322.

[32] V. S. Miller & M. N. Wegman, \Variations on a Theme by Ziv and Lemp el," in

Combinatorial Algorithms on Words , A. Ap ostolico & Z. Galil, eds., NATO ASI

Series F12, Springer-Verlag, Berlin, 1984, 131{140.

[33] J. L. Mitchell & W. B. Pennebaker, \Optimal Hardware and Software Arithmetic

Co ding Pro cedures for the Q-Co der," IBM J. Res. Develop. 32 Nov. 1988, 727{

736.

[34] A. M. Mo at, \PredictiveText Compression Based up on the Future Rather than

the Past," Australian Computer Science Communications 9 1987, 254{261.

[35] A. M. Mo at, \Word-Based Text Compression," Software{Practice and Exp erience

19 Feb. 1989, 185{198.

[36] A. M. Mo at, \Linear Time Adaptive Arithmetic Co ding," IEEE Trans. Inform.

Theory IT{36 Mar. 1990, 401{406.

29

[37] A. M. Mo at, \Implementing the PPM Data Compression Scheme," IEEE Trans.

Comm. COM{38 Nov. 1990, 1917{192 1.

[38] K. Mohiuddin, J. J. Rissanen & M. Wax, \Adaptive Mo del for Nonstationary

Sources," IBM Technical Disclosure Bulletin 28 Apr. 1986, 4798{4800 .

[39] D. S. Parker, \Conditions for the Optimality of the Hu man Algorithm," SIAM

J. Comput. 9 Aug. 1980, 470{489.

[40] R. Pasco, \Source Co ding Algorithms for Fast Data Compression," Stanford Univ.,

Ph.D. Thesis, 1976.

[41] W. B. Pennebaker & J. L. Mitchell, \Probability Estimation for the Q-Co der,"

IBM J. Res. Develop. 32 Nov. 1988, 737{752.

[42] W. B. Pennebaker & J. L. Mitchell, \Software Implementations of the Q-Co der,"

IBM J. Res. Develop. 32 Nov. 1988, 753{774.

[43] W. B. Pennebaker, J. L. Mitchell, G. G. Langdon & R. B. Arps, \An Overview of

the Basic Principles of the Q-Co der Adaptive Binary Arithmetic Co der," IBM J.

Res. Develop. 32 Nov. 1988, 717{726.

[44] J. Rissanen, \Mo deling by Shortest Data Description," Automatica 14 1978, 465{

571.

[45] J. Rissanen, \A Universal Prior for Integers and Estimation by Minimum Descrip-

tion Length," Ann. Statist. 11 1983, 416{432.

[46] J. Rissanen, \Universal Co ding, Information, Prediction, and Estimation," IEEE

Trans. Inform. Theory IT{30 July 1984, 629{636.

[47] J. Rissanen & G. G. Langdon, \Universal Mo deling and Co ding," IEEE Trans.

Inform. Theory IT{27 Jan. 1981, 12{23.

[48] J. J. Rissanen, \Generalized Kraft Inequality and Arithmetic Co ding," IBM J.

Res. Develop. 20 May 1976, 198{203.

[49] J. J. Rissanen & G. G. Langdon, \Arithmetic Co ding," IBM J. Res. Develop. 23

Mar. 1979, 146{162.

[50] J. J. Rissanen & K. M. Mohiuddin, \A Multiplication-Free Multialphab et Arith-

metic Co de," IEEE Trans. Comm. 37 Feb. 1989, 93{98.

[51] C. Rogers & C. D. Thomb orson, \Enhancements to Ziv-Lemp el Data Compres-

sion," Dept. of Computer Science, Univ. of Minnesota, Technical Rep ort TR 89{2,

Duluth, Minnesota, Jan. 1989.

[52] F. Rubin, \Arithmetic Stream Co ding Using Fixed Precision Registers," IEEE

Trans. Inform. Theory IT{25 Nov. 1979, 672{675.

[53] B. Y. Ryabko, \Data Compression by Means of a Bo ok Stack," ProblemyPeredachi

Informatsii 16 1980.

[54] C. E. Shannon, \A Mathematical Theory of Communication," Bell Syst. Tech. J. 27 July 1948, 398{403.

30 4 CONCLUSION

[55] J. S. Vitter, \Dynamic Hu man Co ding," ACM Trans. Math. Software 15 June

1989, 158{167, also app ears as Algorithm 673, Collected Algorithms of ACM,

1989.

[56] J. S. Vitter, \Design and Analysis of Dynamic Hu man Co des," Journal of the

ACM 34 Oct. 1987, 825{845.

[57] I. H. Witten & T. C. Bell, \The Zero Frequency Problem: Estimating the Prob-

abilities of Novel Events in AdaptiveText Compression," IEEE Trans. Inform.

Theory IT{37 July 1991, 1085{1094.

[58] I. H. Witten, R. M. Neal & J. G. Cleary, \Arithmetic Co ding for Data Compres-

sion," Comm. ACM 30 June 1987, 520{540.

[59] J. Ziv & A. Lemp el, \A Universal Algorithm for Sequential Data Compression,"

IEEE Trans. Inform. Theory IT{23 May 1977, 337{343.

[60] J. Ziv & A. Lemp el, \Compression of Individual Sequences via Variable Rate

Co ding," IEEE Trans. Inform. Theory IT{24 Sept. 1978, 530{536.