<<

Arithmetic Co ding for

PAUL G. HOWARD and JEFFREY SCOTT VITTER, fellow, ieee

Arithmetic coding provides an e ective mechanism for remov- the subinterval is re ned incrementally using the probabili-

ing redundancy in the encoding of data. We show how arithmetic

ties of the individual events, with b eing output as so on

coding works and describe an ecient implementation that uses

as they are known. Arithmetic co des almost always give b et-

table lookup as a fast alternative to arithmetic operations. The

ter compression than pre x co des, but they lack the direct

reduced-prec ision arithmetic has a provably negligible e ect on the

corresp ondence b etween the events in the input data set and

amount of compression achieved. We can speed up the implemen-

bits or groups of bits in the co ded output le.

tation further by use of paral lel processing. We discuss the role of

probability models and how they provide probability information

A statistical co der must work in conjunction with a mo d-

to the arithmetic coder. We conclude with perspectives on the

eler that estimates the probabili ty of each p ossible eventat

comparative advantages and disadvantages of arithmetic coding.

each p oint in the co ding. The probabili ty mo del need not

Index terms | Data compression, arithmetic co ding, lossless

describ e the pro cess that generates the data; it merely has

compression, text mo deling, , text compres-

to provide a probabili ty distribution for the data items. The

sion, adaptive, semi-adaptive.

probabiliti es do not even have to b e particularly accurate,

but the more accurate they are, the b etter the compression

I. Arithmetic coding

will b e. If the probabiliti es are wildly inaccurate, the le may

The fundamental problem of is to de-

even b e expanded rather than compressed, but the original

comp ose a data set for example, a text le or an image

data can still b e recovered. To obtain maximum compres-

into a sequence of events, then to enco de the events using as

sion of a le, we need b oth a go o d probabili ty mo del and

few bits as p ossible. The idea is to assign short co dewords

an ecientway of representing or learning the probabili ty

to more probable events and longer co dewords to less prob-

mo del.

able events. Data can b e compressed whenever some events

To ensure deco dabili ty, the enco der is limited to the use

are more likely than others. Statistical co ding techniques

of mo del information that is available to the deco der. There

use estimates of the probabili ties of the events to assign the

are no other restrictions on the mo del; in particular, it can

co dewords. Given a set of mutually distinct events e , e , e ,

1 2 3

change as the le is b eing enco ded. The mo dels can b e adap-

:::, e , and an accurate assessment of the probabili ty dis-

n

tive dynamically estimating the probability of eachevent

tribution P of the events, Shannon [1] proved that the the

based on all events that precede it, semi-adaptive using

smallest p ossible exp ected numb er of bits needed to enco de

a preliminary pass of the input le to gather statistics, or

an event is the entropy of P , denoted by

non-adaptive using xed probabiliti es for all les. Non-

n

adaptive mo dels can p erform arbitrarily p o orly [3]. Adaptive

X

co des allow one-pass co ding but require a more complicated

H P = pfe g log pfe g;

k k

2

data structure. Semi-adaptive co des require two passes and

k=1

transmission of mo del data as side information; if the mo del

where pfe g is the probability that event e o ccurs. An op-

k k

data is transmitted eciently they can provide slightly b et-

timal co de outputs log p bits to enco de an event whose

2

ter compression than adaptive co des, but in general the cost

probability of o ccurrence is p. Pure arithmetic co des sup-

of transmitting the mo del is ab out the same as the \learn-

plied with accurate probabiliti es provides optimal compres-

ing" cost in the adaptive case [4].

sion. The older and b etter-known Hu man co des [2] are

To get go o d compression we need mo dels that go b eyond

optimal only among instantaneous co des, that is, those in

global event counts and takeinto account the structure of

which the enco ding of one event can b e deco ded b efore en-

the data. For images this usually means using the numeric

co ding has b egun for the next event.

intensityvalues of nearby pixels to predict the intensityof

In theory, arithmetic co des assign one \co deword" to each

each new pixel and using a suitable probability distributio n

p ossible data set. The co dewords consist of half-op en subin-

for the residual error to allow for noise and variation b etween

tervals of the half-op en unit interval [0; 1, and are expressed

regions within the image. For text, the previous letters form

by sp ecifying enough bits to distingui sh the subinterval cor-

a context, in the manner of a Markov pro cess.

resp onding to the actual data set from all other p ossible

In Section I I, we provide a detailed description of pure

subintervals. Shorter co des corresp ond to larger subinter-

arithmetic co ding, along with an example to illustrate the

vals and thus more probable input data sets. In practice,

pro cess. We also show enhancements that allow incremental

Manuscript received Mmm DD, 1993. Some of this work was p er-

transmission and xed-precision arithmetic. In Section I I I

formed while b oth authors were at Brown University and while the

we extend the xed-precision idea to low precision, and show

rst author was at Duke University. Supp ort was provided in part by

howwe can sp eed up arithmetic co ding with little degrada-

NASA Graduate Student Researchers Program grant NGT{50420,by

tion of compression p erformance by doing all the arithmetic

a National Science Foundation Presidential Young Investigator Award

with matching funds from IBM, and by Air Force Oce of Scienti c

ahead of time and storing the results in lo okup tables. We

Research grantnumb er F49620{92{J{0515. Additional supp ort was

call the resulting pro cedure quasi-arithmeticcoding. In Sec-

provided by Universities Space Research Asso ciation/CESDIS asso-

tion IV we brie y explore the p ossibili ty of parallel co ding

ciate memb erships.

P.G.Howard is with AT&T Bell Lab oratories, Visual Communica-

using quasi-arithmetic co ding. In Section V we discuss the

tions Research, Ro om 4C{516, 101 Crawfords Corner Road, Holmdel,

mo deling pro cess, separating it into structural and probabil-

NJ 07733{3030.

ity estimation comp onents. Each comp onent can b e adap-

J. S. Vitter is with the Department of Computer Science, Duke

tive, semi-adaptive, or static; there are two approaches to

University,Box 90129, Durham, NC 27708{0129.

the probability estimation problem. Finally, Section VI pro- IEEE Log Numb er 0000000. 1

vides a discussion of the advantages and disadvantages of

Table 1 Example of pure arithmetic co ding

arithmetic co ding and suggestions of alternative metho ds.

Action Subintervals

II. How arithmetic coding works

Start [0; 1

2 2 2

In this section we explain how arithmetic co ding works

Sub divide with left prob. pfa g = [0; ; [ ; 1

1

3 3 3

2

and give op erational details; our treatment is based on that

Input b , select right subinterval [ ; 1

1

3

of Witten, Neal, and Cleary [5]. Our fo cus is on enco ding,

1 2 5 5

Sub divide with left prob. pfa g = [ ; ; [ ; 1

2

but the deco ding pro cess is similar.

2 3 6 6

5 2

;  Input a , select left subinterval [

2

3 6

A. Basic algorithm for arithmetic coding

3 2 23 23 5

Sub divide with left prob. pfa g = [ ; ; [ ; 

3

5 3 30 30 6

23 5

The algorithm for enco ding a le using arithmetic co ding

Input b , select right subinterval [ ; 

3

30 6

works conceptually as follows:

=[0:110001 ::: ;0:110101 ::: 

2 2

1. We b egin with a \currentinterval" [L; H  initial ized to

Output 11001 0:11001 is the shortest binary

2

[0; 1.

23 5

fraction that lies within [ ; 

2. For eachevent in the le, we p erform two steps.

30 6

a We sub divide the currentinterval into subintervals,

one for each p ossible event. The size of a event's

subinterval is prop ortional to the estimated proba-

0 1

initial currentinterval

bility that the event will b e the next event in the

sub divide

le, according to the mo del of the input.

2 1

b We select the subinterval corresp onding to the

3 3

2

event that actually o ccurs next, and make it the

0 1

a b

1 1

3

new currentinterval.

select b

1

3. We output enough bits to distingui sh the nal current

1 1

sub divide

2 2

interval from all other p ossible nal intervals.

2 5

1

a b

2 2

The length of the nal subinterval is clearly equal to the

3 6

pro duct of the probabili ties of the individual events, which

select a

2

3 2

is the probability p of the particular sequence of events in

sub divide

5 5

the le. The nal step uses at most b log pc + 2 bits to

2 23 2 5

a b

3 3

3 30 6

distinguish the le from all other p ossible les. We need some

select b

mechanism to indicate the end of the le, either a sp ecial end-

3

23 5

of- le event co ded just once, or some external indication of

30 6

the le's length. Either metho d adds only a small amount

to the co de length.

output 11001 0:11001 0:11010

2 2

In step 2, we need to compute only the subinterval corre-

sp onding to the event a that actually o ccurs. To do this it is

i

Fig. 1. Pure arithmetic co ding graphically illustrated

convenient to use two \cumulative" probabiliti es: the cumu-

P

i1

lative probabili ty P = p and the next-cumulative

C k

k =1

P

i

probability P = P + p = p . The new subinterval

N C i k

k =1

B. Incremental output

is [L + P H L;L+P H L. The need to main-

C N

The basic implementation of arithmetic co ding describ ed

tain and supply cumulative probabili ties requires the mo del

to have a complicated data structure, esp ecially when many ab ove has two ma jor diculties: the shrinking current in-

terval requires the use of high precision arithmetic, and no

more than twoevents are p ossible.

output is pro duced until the entire le has b een read. The

Wenow provide an example, rep eated a numb er of times

most straightforward solution to b oth of these problems is

to illustrate di erent steps in the development of arithmetic

to output each leading as so on as it is known, and then

co ding. For simplicitywecho ose b etween just twoevents at

to double the length of the currentinterval so that it re-

each step, although the principles apply to the multi-event

ects only the unknown part of the nal interval. Witten,

case as well. We assume that we know a priori that we

Neal, and Cleary [5] add a clever mechanism for preventing

have a le consisting of three events or three letters in the

the currentinterval from shrinking to o much when the end-

case of text compression; the rst event is either a with

1

1 1

2 1

but straddle . In that case we do not p oints are close to

probability pfa g = orb with probability pfb g = ;

1 1 1

2 2

3 3

1

yet know the next output bit, but we do know that what-

the second eventisa with probabili ty pfa g = orb

2 2 2

2

1

ever it is, the fol lowing bit will have the opp osite value; we

with probabili ty pfb g = ; and the third eventisa with

2 3

2

3 2

merely keep track of that fact, and expand the currentinter-

probability pfa g = orb with probability pfb g = .

3 3 3

5 5

1

. This follow-on pro cedure may val symmetrically ab out

The actual le to b e enco ded is the sequence b a b .

1 2 3

2

b e rep eated anynumb er of times, so the currentinterval size

The steps involved in pure arithmetic co ding are illus-

1

is always strictly longer than .

trated in Table 1 and Fig. 1. In this example the nal interval

4

5 23

; . The length corresp onding to the actual le b a b is [ Mechanisms for incremental transmission and xed preci-

1 2 3

30 6

1

of the interval is , which is the probabilityofb a b , com- sion arithmetic have b een develop ed through the years by

1 2 3

15

puted bymultiplyin g the probabiliti es of the three events: Pasco [6], Rissanen [7], Rubin [8], Rissanen and Langdon

1 2 1 1

= . In binary, the nal inter- pfb gpfa gpfb g = [9], Guazzo [10], and Witten, Neal, and Cleary [5]. The

1 2 3

3 2 5 15

val is [0:110001 :::;0:110101 :::. Since all binary numb ers bit-stung idea of Langdon and others at IBM that limits

that b egin with 0:11001 are entirely within this interval, out- the propagation of carries in the additions serves a function

putting 11001 suces to uniquely identify the interval. similar to that of the follow-on pro cedure describ ed ab ove. 2

1

2. If the new subinterval lies entirely within [0; ,

Table 2 Example of pure arithmetic co ding with incremental

2

transmission and interval expansion

we output 0 and any following 1s left over from

previous events; then we double the size of the

Action Subintervals

1

subinterval by linearly expanding [0; to[0;1.

2

1

Start [0; 1

3. If the new subinterval lies entirely within [ ; 1,

2

2 2 2

we output 1 and any following 0s left over from

Sub divide with left prob. pfa g = [0; ; [ ; 1

1

3 3 3

previous events; then we double the size of the

2

Input b , select right subinterval [ ; 1

1

1

3

subinterval by linearly expanding [ ; 1 to [0; 1.

1 2

Output 1, expand [ ; 1

1 3

3

4. If the new subinterval lies entirely within [ ; ,

4 4

1 1 2 2

wekeep track of this fact for future output by

Sub divide with left prob. pfa g = [ ; ; [ ; 1

2

2 3 3 3

1 2

incrementing the follow count; then we double

Input a , select left subinterval [ ; 

2

3 3

the size of the subinterval by linearly expanding

1 5

Increment follow count, expand [ ; 

1 3

6 6

[ ; to[0;1.

4 4

3 1 17 17 5

Sub divide with left prob. pfa g = [ ; ; [ ; 

3

5 6 30 30 6

5 17

;  Input b , select right subinterval [

3

30 6

Table 2 and Fig. 2 illustrate this pro cess. In the exam-

2 2

Output 1, output 0 follow bit, expand [ ; 

15 3

ple, interval expansion o ccurs exactly once for each input

1 1 2 2

event, but in other cases it may o ccur more than once or not

Output 01 [ ; isentirely within [ ; 

4 2 15 3

at all. The follow-on pro cedure is applied when pro cessing

the second input event a . The 1 output after pro cessing

2

the third event b is therefore followed by its complement 0.

3

0 1

initial currentinterval

2 2

The nal interval is [ ; . Since all binary numb ers that

15 3

sub divide

start with 0:01 are within this range, outputting 01 suces

2 1

to uniquely identify the range. The enco ded le is 11001,

3 3

2

as b efore. This is no coincidence: the computations are es-

0 1

a b

1 1

3

sentially the same. The nal interval is eight times as long

select b

1

as in the previous example b ecause of the three doublings of

2

1

3

the currentinterval.

Clearly the currentinterval contains some information

output 1; expand

ab out the preceding inputs; this information has not yet

1

1

b een output, so we can think of it as the co der's state. If a

3

is the length of the currentinterval, the state holds log a

sub divide

2

1 1

bits not yet output. In the basic metho d the state contains

2 2

al l the information ab out the output, since nothing is out-

2 1

1

a b

2 2

3 3

put until the end. In the incremental implementation , the

select a

2

state always contains fewer than two bits of output informa-

2 1

tion, since the length of the currentinterval is always more

3 3

2 2 1

. The nal state in the incremental example is [ ; , than

4 15 3

8

follow; expand

 0:907 bits of information; the - which contains log

2

15

nal two output bits are needed to unambiguously transmit

1 5

6 6

this information.

sub divide

3 2

5 5

1 17 5

C. Use of integer arithmetic

a b

3 3

6 30 6

select b

3

In practice, the arithmetic can b e done by storing the

17 5

30 6

endp oints of the currentinterval as suciently large inte-

gers rather than in oating p oint or exact rational numb ers.

output 1f = 10; expand

We also use integers for the frequency counts used to esti-

2 2

mate event probabiliti es. The sub divisi on pro cess involves

15 3

selecting non-overlappin g intervals of length at least 1 with

0:01 0:10 output 01

2 2

lengths approximately prop ortional to the counts. Table 3

illustrates the use of integer arithmetic using a full inter-

Fig. 2. Pure arithmetic co ding with incremental transmission

val of [0;N=[0;1024. The graphical version of Table 3

and interval expansion, graphically illustrated

is essentially the same as Fig. 2 and is not included. The

length of the currentinterval is always at least N=4 + 2, 258

in this case, so we can always use probabili tie s precise to at

Wenow describ e in detail how the incremental output and

interval expansion work. We add the following step immedi- least 1=258; often the precision will b e near 1=1024. In prac-

ately after the selection of the subinterval corresp onding to tice we use even larger integers; the interval [0; 65 536 is a

an input event, step 2b in the basic algorithm ab ove. common choice, and gives a practically continuous choice of

2. c We rep eatedly execute the following steps in se- probabiliti es at each step. The sub divisi ons in this example

quence until the lo op is explicitly halted: are not quite the same as those in Table 2 b ecause the re-

1. If the new subinterval is not entirely within one sulting intervals are rounded to integers. The enco ded le

1 3 1 1

, [ ; , or [ ; 1, we exit of the intervals [0; is 11001 as b efore, but for a longer input le the enco dings

2 4 4 2

the lo op and return. would eventually diverge. 3

Table 3 Example of arithmetic co ding with incremental trans- Table 5 Excerpts from quasi-arithmetic co ding table, N =8.

mission, interval expansion, and integer arithmetic. Full interval Only the three states needed for the example are shown; there

is [0; 1024, so in e ect subinterval endp oints are constrained to are nine more states. An \f" output indicates application of the

1

follow-on pro cedure describ ed in the text.

be multiples of .

1024

Left a input Rightb input

Action Subintervals

Probability Start

Next Next

state of left input Output Output

Start [0; 1024

state state

2

pfa g = ; sub divide with

1

[0,8 0.000 { 0.182 000 [0; 8 [1; 8

3

683

left probability=  0:66699 [0; 683; [683; 1024

0.182 { 0.310 00 [0; 8 [2; 8

1024

Input b , select right subinterval [683; 1024

1

0.310 { 0.437 0 [0; 6 [3; 8

Output 1, expand [342; 1024

0.437 { 0.563 0 [0; 8 1 [0; 8

1

0.563 { 0.690 [0; 5 1 [2; 8

Sub divide with left prob. pfa g = [342; 683; [683; 1024

2

2

0.690 { 0.818 [0; 6 11 [0; 8

Input a , select left subinterval [342; 683

2

0.818 { 1.000 [0; 7 111 [0; 8

Increment follow count, expand [172; 854

[0,6 0.000 { 0.244 000 [0; 8 [1; 6

3

pfa g = ; sub divide with

1

5

0.244 { 0.415 00 [0; 8 f [0; 8

409

left probability=  0:59971 [172; 581; [581; 854

682

0.415 { 0.585 0 [0; 6 f [2; 8

Input b , select right subinterval [581; 854

3

0.585 { 0.756 0 [0; 8 10 [0; 8

Output 1, output 0 follow bit, expand [138; 654

0.756 { 1.000 [0; 5 101 [0; 8

Output 01 [256; 512 is entirely within [138; 654

[2,8 0.000 { 0.244 010 [0; 8 [3; 8

0.244 { 0.415 01 [0; 8 1 [0; 8

0.415 { 0.585 f [0; 6 1 [2; 8

Table 4 Example of arithmetic co ding with incremental trans-

0.585 { 0.756 f [0; 8 11 [0; 8

mission, interval expansion, and small integer arithmetic. Full

interval is [0; 8, so in e ect subinterval endp oints are constrained 0.756 { 1.000 [2; 7 111 [0; 8

1

to b e multiples of .

8

Action Subintervals

Table 6 Example of op eration of quasi-arithmetic co ding

Start [0; 8

Start in state [0; 8.

2 5

2

pfa g = ; sub divide with left prob. = [0; 5; [5; 8

Prob fa g = ,socho ose range 0.563 { 0.690 in [0; 8.

1

1

3 8

3

Input b , select right subinterval [5; 8

First eventis b ,socho ose rightb input.

1

1

Output 1, expand [2; 8

Output 1. Next state is [2; 8.

1

1

Prob fa g = ,socho ose range 0.415 { 0.585 in [2; 8.

Sub divide with left prob. pfa g = [2; 5; [5; 8

2

2

2

2

Second eventisa ,socho ose left a input.

Input a , select left subinterval [2; 5

2

2

f means increment follow count. Next state is [0; 6.

Increment follow count, expand [0; 6

3

3 2

Prob fa g = ,socho ose range 0.585 { 0.756 in [0; 6.

3

pfa g = ; sub divide with left prob. = [0; 4; [4; 6

3 5

5 3

Third eventis b ,socho ose rightb input.

3

Input b , select right subinterval [4; 6

3

Indicated output is 10. Output 1; output 0 to account

Output 1, output 0 follow bit, expand [0; 4

for follow bit; output 0. Next state is [0; 8.

Output 0, expand [0; 8

A. Development of binary quasi-arithmeticcoding

I I I. Limited-precision arithmetic coding

Wehave seen that doing arithmetic co ding with large in-

Arithmetic co ding as it is usually implemented is slowbe-

tegers instead of real or rational numb ers hardly degrades

cause of the multipli catio ns and in some implementations,

compression p erformance at all. In Table 4 we show the en-

divisions required in sub dividi ng the currentinterval ac-

co ding of the same le using small integers: the full interval

cording to the probability information. Since small errors

[0;N is only [0; 8.

in probability estimates cause very small increases in co de

The numb er of p ossible states after applying the interval

length, we exp ect that byintro ducing approximations into

expansion pro cedure of an arithmetic co der using the inte-

the arithmetic co ding pro cess in a controlled waywe can

2

ger interval [0;Nis3N =16. If we can reduce the number

improve co ding sp eed without signi cantly degrading com-

of states to a more manageable level, we can precompute all

pression p erformance. In the Q-Co der work at IBM [11],

state transitions and outputs and substitute table lo okups

the time-consuming multipli catio ns are replaced by additions

for arithmetic in the co der. The obvious way to reduce the

and shifts, and low-order bits are ignored. In [12] we describ e

numb er of states is to reduce N . The value of N should b e a

a di erent approach to approximate arithmetic co ding. Re-

power of 2; its value must b e at least 4. If wecho ose N =8

calling that the fractional bits characteristic of arithmetic

and apply the arithmetic co ding algorithm in a straightfor-

co ding are stored as state information in the co der, we re-

ward way,we obtain a table with 12 states, each state o ering

duce the numb er of p ossible states, and replace arithmetic

achoice of b etween 3 and 7 probabili ty ranges. Portions of

op erations by table lo okups; the lo okup tables can b e pre-

the table are shown in Table 5.

computed. Here we review this reduced precision binary

arithmetic co der, whichwe call a quasi-arithmetic coder.It Table 6 shows howTable 5 is used to enco de our sam-

should b e noted that the compression is still completely re- ple le. Before co ding each input event the co der is in a

versible; using reduced precision merely a ects the average certain current state, corresp onding to the current subinter-

co de length. val. For each state there are a numb er of probability ranges; 4

1

wecho ose the one that includes the estimated probability 0:585 that app ears in Table 5. In this case p = and p =

1 2

2



2

for the next event. Then we simply select the input event .We compute p = log 3=2= log 2  0:585. Hence when

3

3

that actually o ccurs and p erform the indicated actions: out- we enco de the third event in the example with pfa g = ,

3

5

putting bits, incrementing the follow count, and changing to we use the f[0; 4; [4; 6g sub divisi on.



a new current state. In the example the enco ded output le Probability p is the probabili tybetween p and p with

1 2

is 1100. Because wewere using suchlow precision, the sub- the worst average quasi-arithmetic co ding p erformance, b oth

division probabili tie s b ecame distorted, leading to a lower in excess bits p er event and in excess bits relative to optimal

probability for the le 1=16, but one which ends in the full compression. This can b e shown by monotonicity arguments.

interval [0; 8, requiring no disambiguati ng bits. We usually

Theorem 1 If we construct a quasi-arithmetic coder based

use a somewhat larger value of N ; in practice the compres-

on ful l interval [0;N, and use correct probability estimates,

sion ineciency of a binary quasi-arithmeti c co der neglect-

the number of bits per input event by which the average code

ing very large and very small probabiliti es is less than one

length obtained by the quasi-arithmeticcoder exceeds that of

p ercent for N = 32 and ab out 0.2 p ercent for N = 128.

an exact arithmetic coder is at most

In implementations , the co ding tables can b e stripp ed

4 2 1 1 0:497 1

down so that the numerical values of the interval endp oints

log + O  + O ;

2

2 2

ln 2 e ln 2 N N N N

and probability ranges do not app ear. Full details and ex-

and the fraction by which the average code length obtainedby

amples app ear in [13]. The next section explains the compu-

the quasi-arithmetic coder exceeds that of an exact arithmetic

tation of the probability ranges. Deco ding uses essentially

coder is at most

the same tables, and in fact is easier than enco ding.

2 1 1

B. Analysis of binary quasi-arithmetic coding log + O

2

2

e ln 2 log N log N 

2

Wenow prove that using binary quasi-arithmetic co ding

1 0:0861

causes an insigni cant increase in the co de length compared

+ O : 

2

log N log N 

with pure arithmetic co ding. We mathematically analyze

2

several cases.

Proof :For a quasi-arithmetic co der with full interval [0;N,

In this analysis we exclude very large and very small prob-

the shortest terminal state intervals have size W = N=4+2.

abilities, namely those that are less than 1=W or greater than

The worst average error o ccurs for the smallest W and

W 1=W , where W is the width of the currentinterval.

the most extreme probabiliti es, that is, for W = N=4+2,

For these probabili ties the relative excess co de length can b e

p =W 2=W , and p =W 1=W or symmetrically,

1 2

large.

p =1=W and p =2=W . For these probabiliti es, we nd

1 2

First we assume that we know the probability p of the left



the cuto probability p . Then for the rst part of the theo-

branch of eachevent, and we show b oth how to minimize



rem we take the asymptotic expansion of Lp ;p  Hp ,

2 2

the average excess co de length and how small the excess

and for the second part we take the asymptotic expansion of

is. In arithmetic co ding we divide the currentinterval of



Lp ;p  Hp =H p . 2

2 2 2

width W into subintervals of length L and R; this gives

an e ective co ding probabili ty q = L=W since the resulting

p =p +p=2, we can expand Equation 1 If we let

1 2

co de length is log q for the left branch and log 1 q  for

asymptotically in W and obtain an excellent rational ap-

2 2



the right. When we enco de a sequence of binary events with

proximation for p :

probabiliti es p and 1 p using e ective co ding probabili ties q

1 p 1=2



e

p = p + :

and 1 q , the average co de length Lp; q  is given by

2

6W p1 p





Lp; q =plog q 1 p log 1 q :

e

The compression loss intro duced by using p instead of p is

2 2

completely negligible, never more than 0:06. In the exam-

If we use pure arithmetic co ding, we can sub divid e the in- 1 2 

ple ab ove with p = , p = , and W =6,we nd that p =

1 2

2 3

terval into lengths pW and 1 pW ,thus making q = p



e

log 3=2= log 2  0:58496 and p = 737=1260  0:58492. As

and giving an average co de length equal to the entropy,

we exp ect, only for small values of W the short intervals

H p=plog p 1 p log 1 p; this is optimal.

2 2

that o ccur when using very low precision arithmetic do we

Consider two probabiliti es p and p that are adjacent

1 2

need to b e careful ab out roundo when sub dividin g inter-

based on the sub division of an interval of width W ; in other

vals; for larger values of W we can practically round the

words, p =W =W , p =W  =W , and  =

1 1 2 2 2

interval endp oints to the nearest integer.

 1. For any probability p between p and p , either

1 1 2

We can prove a similar theorem for a more general case, in

p or p should b e chosen, whichever gives a shorter average

1 2

whichwe compare quasi-arithmeti c co ding with arithmetic



co de length. There is a cuto probability p for which p and

1

co ding for a single worst-case event. We assume that b oth



p give the same average co de length. We can compute p

2

co ders use the same estimated probability, but that the esti-

 

by solving the equation Lp ;p = Lp ;p , giving

1 2

mate may b e arbitrarily bad. The pro of is omitted to save

space.

1



p = 1

p

2

Theorem 2 If we construct a quasi-arithmetic coder based

log

p

1

on ful l interval [0;N, and use arbitrary probability esti-

1+

1 p

1

mates, the number of bits per input event by which the code

log

1 p

2

length obtained by the quasi-arithmeticcoder exceeds that of

an exact arithmetic coder in the worst case is at most

We can use Equation 1 to compute the probability ranges

4 5:771 N +8

in the co ding tables. As an example, we compute the cuto

  : log

2

N +4 N ln 2 N

probability used in deciding whether to sub divide interval

[0; 6 as f[0; 3; [3; 6g or f[0; 4; [4; 6g; this is the number 5

At least one pro cessor completes all its events in each phase; IV. Parallel coding

such a pro cessor must output at least one bit p er event,

In [14] we present general-purp ose algorithms for enco ding

since all co de words in a Hu man co de have at least one

and deco ding using b oth Hu man and quasi-arithmeti c co d-

bit. This pro cessor and hence all pro cessors thus output

ing. We are considering the case where we wish to do b oth

at least n=2p bits in each phase, making a total of at least

enco ding and deco ding in parallel, so the lo cation of bits in

n=2 bits output in each phase. The total numb er of bits that

the enco ded le must b e computable by the deco ding pro ces-

must b e output in the rst sup erphase is at most nL=2, so

sors b efore they b egin deco ding. The main requirement for

the numb er of phases in the rst sup erphase is at most L.

our parallel co der is that the mo del for eacheventisknown

The same reasoning holds for all sup erphases. The number

ahead of time; it is not necessary for eacheventtohave the

of sup erphases needed to reduce the numb er of remaining

same mo del. We use the PRAM mo del of parallel compu-

events from n to p is log n=p, so the numb er of phases

2

tation; the numb er of pro cessors is p.We allow concurrent

needed is just L log n=p. Once p or fewer events remain,

2

reading of the parallel memory, and limited concurrent writ-

at most L late phases are required, so the total number of

ing, to a single one-bit register, whichwe call the completion

phases needed is at most L log 2n=p. 2

2

register. It is initial ized to 0 at the b eginning of each time

step; during the time step anynumb er of pro cessors or none

The extension of the parallel Hu man algorithm to quasi-

at all may write 1 to it. We assume that n events are to

arithmetic co ding is fairly straightforward. The only compli-

b e co ded. The main issue is in dealing with the di erences

cation arises when the last event of a pro cessor's allo cation

in co de lengths of di erentevents. For simplici tywe do not

leaves the pro cessor in a state other than the starting state.

consider input data routing.

We deal with this by outputting the smallest p ossible num-

b er of bits 0, 1, or 2 needed to identify a subinterval that We describ e the algorithm for the Hu man co ding case,

lies within the nal interval; this is the end-of- le disam- since pre x co des are easier to handle, and then extend it to

biguation problem that wehave seen in Section I I.

quasi-arithmetic co ding. First the n events are distributed

as evenly as p ossible among the p pro cessors. Then each pro-

V. Modeling

cessor b egins enco ding its assigned events in a deterministic

sequence, outputting one bit in each time step; the bits out-

The goal of mo deling for statistical data compression is

put in each time step are contiguous in the output stream.

to provide probability information to the co der. The mo d-

As so on as one or more pro cessors complete all of their as-

eling pro cess consists of structural and probability estima-

signed events, they indicate this fact by writing 1 to the

tion comp onents; eachmay b e adaptive, semi-adaptive, or

completion register. When the completion register is 1, the

static. In addition there are two strategies for probabili ty

output pro cess is interrupted. Events whose pro cessing has

estimation. The rst is to estimate eachevent's probabil-

not b egun are distributed evenly among the pro cessors, us-

ity individu al ly based on its frequency within the data set.

ing a pre x op eration; events whose pro cessing has b egun

The other strategy is to estimate the probabiliti es collec-

but not nished remain assigned to their current pro cessors.

tively, assuming a probabili ty distribution of a particular

Pro cessing of the reallo cated events then continues until the

form and estimating the parameters of the distribution , ei-

next time that one or more pro cessors nishes all their al-

ther directly or indirectly.For direct estimation, we simply

lo cated events. Toward the end of pro cessing, the number

compute an estimate of the parameter the variance, for in-

of remaining events will b ecome less than the number of

stance from the data. For indirect estimation [15], we start

available pro cessors. At this time we deactivate pro cessors,

with a small numb er of p ossible distributions, and compute

making the numb er of active pro cessors equal to the number

the co de length that wewould obtain with each; then we

of remaining events; wemust redirect the output bits from

select the one with the smallest co de length. This metho d

the remaining pro cessors. No further event reallo cations are

is very general and can b e used even for distribution s from

needed, but we still have to deactivate pro cessors whenever

di erent families, without common parameters.

any of them nish.

For lossless image compression, the structural comp onent

We analyze the time required by the parallel algorithm.

is usually xed, since for most images pixel intensities are

Assuming that in one time unit each pro cessor can output

close in value to the intensities of nearby pixels. Wecode

one bit, we can easily show that the time required for bit

pixels in a predetermined sequence, predicting each pixel's

output is b etween dnH =pe and dnH =pe + L. The more in-

intensity using a xed linear combination of a xed constella-

teresting analysis concerns the numb er of pre x op erations

tion of nearby pixels, then co ding the prediction error. Typ-

required. We de ne a phase to b e the p erio d b etween pre x

ically the prediction errors have a symmetrical exp onential-

op erations. Early phases are those which take place while

like distribution with zero mean, so the probability estima-

the number of events is greater than the numb er of pro ces-

tion comp onent consists of estimating the variance and p ossi-

sors; each is followed byanevent reallo cation. Late phases

bly some other parameters of the distributio n, either directly

are those needed to co de the nal p or fewer events; each late

or indirectly. A collection of only a few dozen distributio ns

phase requires a pre x op eration to redirect output bits. We

is sucient to co de most images with minimal compression

b ound the numb er of pre x op erations needed in the follow-

loss.

ing theorem.

For text, on the other hand, the b est metho ds involve con-

stantly changing structures. In text compression the events

Theorem 3 When coding n events using p processors, the

to b e co ded are just the letters in the le. Typically we

number of pre x operations needed by the real location coding

b egin by assuming only that some unsp eci ed sequences of

algorithm is at most L log 2n=p in the worst case.

2

letters will o ccur frequently.Wemay also sp ecify a max-

imum length for the frequently o ccurring sequences. As Proof :We de ne a superphase to b e the time needed to

enco ding pro ceeds we determine which sequences are most halve the numb er of remaining events. Consider the rst

frequent. In the Ziv-Lemp el dictionary-based metho ds [16, sup erphase. The number of events assigned to each pro cessor

17], the sequences are placed directly in the dictionary. The ranges from n=p in the rst phase down to n=2p in the last. 6

advantage of Ziv-Lemp el metho ds is their sp eed, obtained by The separation of co ding and mo deling is imp ortantbe-

co ding directly from the dictionary data structure, bypass- cause it p ermits any degree of complexity in the mo deler

ing the explicit probability estimation and statistical co ding without requiring anychange to the co der. In particular,

stages. The PPM metho d [18] obtains b etter compression by the mo del structure and probability estimates can change

constructing a Markov mo del of mo derate order, pro ceeding adaptively. In Hu man co ding, by contrast, the probabil-

to higher orders as more of the le is enco ded. Complicated ity information must b e built into the co ding tables, making

data structures are used to build and up date the context adaptive co ding dicult. Even totally disjoint data streams

information. can b e intertwined; the only requirement is that the deco der

Most mo dels for text compression involve estimating the must b e able to track the mo del structure of the enco der.

letter probabiliti es individua ll y, since there is no obvious One o ccasionall y useful advantage to arithmetic co ding is

mathematical relationship b etween the probabili ties of di er- that it is easy to maintain lexicographi c order without any

ent letters. Numerical proximity of ASCI I representations loss in compression eciency, so that enco ded strings can

do es not imply anything ab out probabili tie s. We usually have the same order as the original unenco ded data; main-

estimate the probability p of a given letter by taining lexicographi c order with a pre x co de requires a more

complicated algorithm the Hu-Tucker algorithm [20] and en-

weight of letter

tails some sacri ce in eciency. p = :

total weight of all letters

The disadvantages of arithmetic co ding are that it runs

The weight of a letter is usually based on the number of

slowly, it is fairly complicated to implement, and it do es

o ccurrences of the letter in a particular conditionin g context.

not pro duce pre x co des. It can b e sp eeded up with only a

Since we are often dealing with contexts that haveoc-

slight loss in compression eciency by using approximations

curred only a few times, wehave to deal with letters that

like quasi-arithmetic co ding, the patented IBM Q-Co der, or

have never o ccurred. We cannot giveaweight of 0 to such

Neal's low-precision co der [21]. The implementation di-

letters b ecause that would lead to probabili ty estimates of 0,

culties are not insurmountabl e, but arithmetic co ding will

which arithmetic co ding cannot handle. This is the zero-

not b e available as an o -the-shelf package until a fast, e-

frequency problem, thoroughly investigated by Witten and

cient, patent-free metho d is agreed up on as a standard. The

Bell [19]. It is p ossible to assign an initial weightof1to

non-pre x-co de prop erty leads to some technical diculties:

all p ossible letters, but a b etter strategy is to assign initial

error resistance can b e a serious problem, esp ecially with

0weights and to include a sp ecial escape event indicating

adaptive mo dels, and the output delay can b e unb ounded

\some letter that has never o ccurred b efore in this context";

in the Witten-Neal-Cleary version describ ed here, although

this event has its own weight, and must b e enco ded likean

not in the Q-Co der.

ordinary letter whenever a new letter o ccurs. There are a

For binary alphab ets, various forms of run-length enco d- numb er of metho ds of dealing with the zero-frequency prob-

lem, di ering mainly in the computation of the escap e prob-

ing can b e used instead of arithmetic co ding, even if the

probabiliti es are highly skewed. Golomb co ding [22] and ability.

A second issue that arises in text compression is locality

the related Rice co ding [23] are based on exp onential ly dis-

tributed run lengths. The Golomb and Rice metho ds each of reference : strings tend to o ccur in clusters within a text

consist of a family of co des parameterized by a single param- le. One way to take advantage of lo cality is to scale the

counts p erio dicall y,typically by halving all weights when

eter; the parameter can b e estimated adaptively [15] giving

go o d compression eciency. These co des are extremely fast the total weight reaches a sp eci ed numb er. This e ectively

gives more weight to more recent o ccurrences of each letter,

pre x co des and are easily implemented in software or hard-

ware. Other non-parameterized run-length co des like Elias and has the additional b ene t of keeping the counts small

enough to t into small registers. Analysis [4] shows that

co des [24] are less exible and hence less useful.

scaling often helps compression, and can never hurt it by

When more than two input events are p ossible, the main

very much.

alternatives to arithmetic co ding are Hu man co ding and

One technique for probabili ty estimation when only two

co ding based on splay trees [25]. Mo at et al. [26] compare

events are p ossible is to use small scaled counts, considering

arithmetic co ding with these metho ds. Hu man co ding is

each count pair to b e a probability state. We can then pre-

very e ective in conjunction with a semi-adaptive mo del;

compute b oth the corresp onding probabili ties and the state

the probability information can b e built into the co ding ta-

transitions caused by the o ccurrence of additional events.

bles, leading to fast execution. Splay trees are even faster,

The states can b e identi ed by index numb ers, whichin

since the data structure for co ding is the \statistical" mo del;

turn can b e used to index into the probability range lists

compression eciency su ers somewhat. Golomb and Rice

in quasi-arithmetic co de tables like those in Table 5.

co ding can b e used when manyevents are p ossible, main-

taining lists of the events in approximate probability order

VI. Conclusion

and co ding the p ositions of the events within the lists. This

The main advantages of arithmetic co ding for statistical

is an esp ecially useful technique for very large alphab ets [27].

data compression are its optimality and its inherent sepa-

The main usefulness of arithmetic co ding is in obtain- ration of co ding and mo deling. Pure arithmetic co ding, as

ing maximum compression in conjunction with an adaptive describ ed in Section I I, is strictly optimal for a sto chastic

mo del, or when the probability of one eventismuch larger data source whose probabili ties are accurately known, but

than 1=2. Arithmetic co ding gives optimal compression, but it relies on relatively slow arithmetic op erations likemulti-

its slow execution can b e problematical. Approximate ver- plication and division. The quasi-arithmeti c co ding metho d

sions of arithmetic co ding give almost optimal compression describ ed in Section I I I, which uses table lo okup as a low-

at improved sp eeds. Probabiliti es can b e estimated approxi- precision alternative to full-precisi on arithmetic, is nearly

mately to o, again leading to only slight degradation of com- optimal and fairly fast, even when the probabili ties are close

pression p erformance. to1or0. 7

Acknowledgments [21] R. M. Neal, \Fast Arithmetic Co ding Using Low-Precision Di-

vision," Unpublished manuscript, 1987.

The authors would liketoacknowledge helpful suggestions

[22] S. W. Golomb, \Run-Length Enco dings," IEEE Trans. Inform.

made by Marty Cohn.

Theory , IT{12, no. 4, pp. 399{401, July 1966.

[23] R. F. Rice, \Some Practical Universal Noiseless Co ding Tech-

References

niques," Jet Propulsion Lab oratory,Pasadena, California, JPL

[1] C. E. Shannon, \A Mathematical Theory of Communication,"

Publication 79{22, Mar. 1979.

Bell Syst. Tech. J., 27, pp. 398{403, July 1948.

[24] P. Elias, \Universal Co deword Sets and Representations of In-

[2] D. A. Hu man, \A Metho d for the Construction of Minimum

tegers," IEEE Trans. Inform. Theory , IT{21, no. 2, pp. 194{

Redundancy Co des," Pro ceedings of the Institute of Radio En-

203, Mar. 1975.

gineers , 40, pp. 1098{1101, 1952.

[25] D. W. Jones, \Application of SplayTrees to Data Compres-

[3] T. C. Bell, J. G. Cleary and I. H. Witten, Text Compression.

sion," Comm. ACM , 31, no. 8, pp. 996{1007, Aug. 1988.

Englewo o d Cli s, NJ, Prentice-Hall, 1990.

[26] A. M. Mo at, N. Sharman, I. H. Witten and T. C. Bell, \An

[4] P.G.Howard and J. S. Vitter, \Analysis of Arithmetic Co ding

Empirical Evaluation of Co ding Metho ds for Multi-Symb ol Al-

for Data Compression," Information Pro cessing and Manage-

phab ets," in Pro c. Data Compression Conference, J. A. Storer

ment , 28, no. 6, pp. 749{763, 1992.

and M. Cohn, Eds. Snowbird, Utah: pp. 108{117, Mar. 30-

[5] I. H. Witten, R. M. Neal and J. G. Cleary, \Arithmetic Co ding

Apr. 1, 1993.

for Data Compression," Comm. ACM , 30, no. 6, pp. 520{540,

[27] A. M. Mo at, \Economical Inversion of Large Text Files," pre-

June 1987.

sented at Computing Systems, 1992.

[6] R. Pasco, \Source Co ding Algorithms for Fast Data Compres-

sion," Stanford Univ., Ph.D. Thesis, 1976.

[7] J. J. Rissanen, \Generalized Kraft Inequality and Arithmetic

Co ding," IBM J. Res. Develop., 20, no. 3, pp. 198{203, May

1976.

Paul G. Howard is a Member of Techni-

cal Sta in the Visual Communications Re-

[8] F. Rubin, \Arithmetic Stream Co ding Using Fixed Precision

search DepartmentofAT&T Bell Lab orato-

Registers," IEEE Trans. Inform. Theory , IT{25, no. 6, pp. 672{

ries. He received the B.S. degree in com-

675, Nov. 1979.

puter science from M.I.T. in 1977 and the

[9] J. J. Rissanen and G. G. Langdon, \Arithmetic Co ding," IBM

M.S. and Ph.D. degrees in computer science

from Brown University in 1989 and 1993 re-

J. Res. Develop., 23, no. 2, pp. 146{162, Mar. 1979.

sp ectively. His researchinterests are in data

[10] M. Guazzo, \A General Minimum-Redundancy Source-Co ding

compression, including co ding, image mo del-

Algorithm," IEEE Trans. Inform. Theory , IT{26, no. 1, pp. 15{

ing, and text mo deling. He is a member of

25, Jan. 1980.

Sigma Xi and an asso ciate memb er of the Cen-

ter of Excellence in Space Data and Informa-

[11] W. B. Pennebaker, J. L. Mitchell, G. G. Langdon and R. B.

tion Sciences. He was brie y a Research Asso ciate at Duke University

Arps, \An Overview of the Basic Principles of the Q-Co der

b efore joining AT&T in 1993.

Adaptive Binary Arithmetic Co der," IBM J. Res. Develop.,

32, no. 6, pp. 717{726, Nov. 1988.

[12] P.G.Howard and J. S. Vitter, \Practical Implementations

of Arithmetic Co ding," in Image and Text Compression,J.

A. Storer, Ed. Norwell, MA: Kluwer Academic Publishers,

Je rey Scott Vitter Fellow, IEEE is the

pp. 85{112, 1992.

Gilb ert, Louis, and Edward Lehrman Profes-

sor of Computer Science and the Chair of

, \Design and Analysis of Fast Text Compression [13]

the Department of Computer Science at Duke

Based on Quasi-Arithmetic Co ding," in Pro c. Data Compres-

University. Previously he was Professor at

sion Conference, J. A. Storer and M. Cohn, Eds. Snowbird,

Brown University, where he was on the fac-

Utah: pp. 98{107, Mar. 30-Apr. 1, 1993.

ulty since 1980. He received the B.S. degree

in mathematics with highest honors from the

[14] ,\Parallel Lossless Image Compression Using Hu -

University of Notre Dame in 1977 and the

man and Arithmetic Co ding," in Pro c. Data Compression Con-

Ph.D. degree in computer science from Stan-

ference, J. A. Storer and M. Cohn, Eds. Snowbird, Utah:

ford University in 1980. He is a Guggenheim

pp. 299{308, Mar. 24-26, 1992.

Fellow, an NSF Presidential Young Investiga-

,\Fast and Ecient Lossless Image Compression," [15]

tor, and an IBM Faculty DevelopmentAwardee. He is coauthor of the

book Design and Analysis of Coalesced Hashing Oxford University

in Pro c. Data Compression Conference, J. A. Storer and M.

Press, 1987 and is coholder of a patent in the area of external sort-

Cohn, Eds. Snowbird, Utah: pp. 351{360, Mar. 30-Apr. 1,

ing. He has written numerous articles and has consulted often in the

1993.

areas of combinatorial algorithms, I/O eciency, parallel and incre-

[16] J. Ziv and A. Lemp el, \A Universal Algorithm for Sequential

mental computation, computational geometry, and machine learning.

Data Compression," IEEE Trans. Inform. Theory , IT{23, no.

He serves on several editorial b oards and is a frequent editor of sp e-

cial issues and memb er of program committees. He has served ACM

3, pp. 337{343, May 1977.

SIGACT as Memb er-at-Large from 1987 to 1991 and as Vice Chair

, \Compression of Individual Sequences via Variable [17]

since 1991. He is currently an asso ciate memb er of the Center of

Rate Co ding," IEEE Trans. Inform. Theory , IT{24, no. 5,

Excellence in Space Data and Information Sciences.

pp. 530{536, Sept. 1978.

[18] J. G. Cleary and I. H. Witten, \Data Compression Using

Adaptive Co ding and Partial String Matching," IEEE Trans.

Comm., COM{32, no. 4, pp. 396{402, Apr. 1984.

[19] I. H. Witten and T. C. Bell, \The Zero Frequency Problem:

Estimating the ProbabilitiesofNovel Events in AdaptiveText

Compression," IEEE Trans. Inform. Theory , IT{37, no. 4,

pp. 1085{1094, July 1991.

[20] T. C. Hu and A. C. Tucker, \Optimal Computer-Search

Trees and Variable-Length Alphab etic Co des," SIAM J. Appl.

Math., 21, no. 4, pp. 514{532, 1971. 8