Arithmetic Co ding for Data Compression
PAUL G. HOWARD and JEFFREY SCOTT VITTER, fellow, ieee
Arithmetic coding provides an e ective mechanism for remov- the subinterval is re ned incrementally using the probabili-
ing redundancy in the encoding of data. We show how arithmetic
ties of the individual events, with bits b eing output as so on
coding works and describe an ecient implementation that uses
as they are known. Arithmetic co des almost always give b et-
table lookup as a fast alternative to arithmetic operations. The
ter compression than pre x co des, but they lack the direct
reduced-prec ision arithmetic has a provably negligible e ect on the
corresp ondence b etween the events in the input data set and
amount of compression achieved. We can speed up the implemen-
bits or groups of bits in the co ded output le.
tation further by use of paral lel processing. We discuss the role of
probability models and how they provide probability information
A statistical co der must work in conjunction with a mo d-
to the arithmetic coder. We conclude with perspectives on the
eler that estimates the probabili ty of each p ossible eventat
comparative advantages and disadvantages of arithmetic coding.
each p oint in the co ding. The probabili ty mo del need not
Index terms | Data compression, arithmetic co ding, lossless
describ e the pro cess that generates the data; it merely has
compression, text mo deling, image compression, text compres-
to provide a probabili ty distribution for the data items. The
sion, adaptive, semi-adaptive.
probabiliti es do not even have to b e particularly accurate,
but the more accurate they are, the b etter the compression
I. Arithmetic coding
will b e. If the probabiliti es are wildly inaccurate, the le may
The fundamental problem of lossless compression is to de-
even b e expanded rather than compressed, but the original
comp ose a data set for example, a text le or an image
data can still b e recovered. To obtain maximum compres-
into a sequence of events, then to enco de the events using as
sion of a le, we need b oth a go o d probabili ty mo del and
few bits as p ossible. The idea is to assign short co dewords
an ecientway of representing or learning the probabili ty
to more probable events and longer co dewords to less prob-
mo del.
able events. Data can b e compressed whenever some events
To ensure deco dabili ty, the enco der is limited to the use
are more likely than others. Statistical co ding techniques
of mo del information that is available to the deco der. There
use estimates of the probabili ties of the events to assign the
are no other restrictions on the mo del; in particular, it can
co dewords. Given a set of mutually distinct events e , e , e ,
1 2 3
change as the le is b eing enco ded. The mo dels can b e adap-
:::, e , and an accurate assessment of the probabili ty dis-
n
tive dynamically estimating the probability of eachevent
tribution P of the events, Shannon [1] proved that the the
based on all events that precede it, semi-adaptive using
smallest p ossible exp ected numb er of bits needed to enco de
a preliminary pass of the input le to gather statistics, or
an event is the entropy of P , denoted by
non-adaptive using xed probabiliti es for all les. Non-
n
adaptive mo dels can p erform arbitrarily p o orly [3]. Adaptive
X
co des allow one-pass co ding but require a more complicated
H P = pfe g log pfe g;
k k
2
data structure. Semi-adaptive co des require two passes and
k=1
transmission of mo del data as side information; if the mo del
where pfe g is the probability that event e o ccurs. An op-
k k
data is transmitted eciently they can provide slightly b et-
timal co de outputs log p bits to enco de an event whose
2
ter compression than adaptive co des, but in general the cost
probability of o ccurrence is p. Pure arithmetic co des sup-
of transmitting the mo del is ab out the same as the \learn-
plied with accurate probabiliti es provides optimal compres-
ing" cost in the adaptive case [4].
sion. The older and b etter-known Hu man co des [2] are
To get go o d compression we need mo dels that go b eyond
optimal only among instantaneous co des, that is, those in
global event counts and takeinto account the structure of
which the enco ding of one event can b e deco ded b efore en-
the data. For images this usually means using the numeric
co ding has b egun for the next event.
intensityvalues of nearby pixels to predict the intensityof
In theory, arithmetic co des assign one \co deword" to each
each new pixel and using a suitable probability distributio n
p ossible data set. The co dewords consist of half-op en subin-
for the residual error to allow for noise and variation b etween
tervals of the half-op en unit interval [0; 1, and are expressed
regions within the image. For text, the previous letters form
by sp ecifying enough bits to distingui sh the subinterval cor-
a context, in the manner of a Markov pro cess.
resp onding to the actual data set from all other p ossible
In Section I I, we provide a detailed description of pure
subintervals. Shorter co des corresp ond to larger subinter-
arithmetic co ding, along with an example to illustrate the
vals and thus more probable input data sets. In practice,
pro cess. We also show enhancements that allow incremental
Manuscript received Mmm DD, 1993. Some of this work was p er-
transmission and xed-precision arithmetic. In Section I I I
formed while b oth authors were at Brown University and while the
we extend the xed-precision idea to low precision, and show
rst author was at Duke University. Supp ort was provided in part by
howwe can sp eed up arithmetic co ding with little degrada-
NASA Graduate Student Researchers Program grant NGT{50420,by
tion of compression p erformance by doing all the arithmetic
a National Science Foundation Presidential Young Investigator Award
with matching funds from IBM, and by Air Force Oce of Scienti c
ahead of time and storing the results in lo okup tables. We
Research grantnumb er F49620{92{J{0515. Additional supp ort was
call the resulting pro cedure quasi-arithmeticcoding. In Sec-
provided by Universities Space Research Asso ciation/CESDIS asso-
tion IV we brie y explore the p ossibili ty of parallel co ding
ciate memb erships.
P.G.Howard is with AT&T Bell Lab oratories, Visual Communica-
using quasi-arithmetic co ding. In Section V we discuss the
tions Research, Ro om 4C{516, 101 Crawfords Corner Road, Holmdel,
mo deling pro cess, separating it into structural and probabil-
NJ 07733{3030.
ity estimation comp onents. Each comp onent can b e adap-
J. S. Vitter is with the Department of Computer Science, Duke
tive, semi-adaptive, or static; there are two approaches to
University,Box 90129, Durham, NC 27708{0129.
the probability estimation problem. Finally, Section VI pro- IEEE Log Numb er 0000000. 1
vides a discussion of the advantages and disadvantages of
Table 1 Example of pure arithmetic co ding
arithmetic co ding and suggestions of alternative metho ds.
Action Subintervals
II. How arithmetic coding works
Start [0; 1
2 2 2
In this section we explain how arithmetic co ding works
Sub divide with left prob. pfa g = [0; ; [ ; 1
1
3 3 3
2
and give op erational details; our treatment is based on that
Input b , select right subinterval [ ; 1
1
3
of Witten, Neal, and Cleary [5]. Our fo cus is on enco ding,
1 2 5 5
Sub divide with left prob. pfa g = [ ; ; [ ; 1
2
but the deco ding pro cess is similar.
2 3 6 6
5 2
; Input a , select left subinterval [
2
3 6
A. Basic algorithm for arithmetic coding
3 2 23 23 5
Sub divide with left prob. pfa g = [ ; ; [ ;
3
5 3 30 30 6
23 5
The algorithm for enco ding a le using arithmetic co ding
Input b , select right subinterval [ ;
3
30 6
works conceptually as follows:
=[0:110001 ::: ;0:110101 :::
2 2
1. We b egin with a \currentinterval" [L; H initial ized to
Output 11001 0:11001 is the shortest binary
2
[0; 1.
23 5
fraction that lies within [ ;
2. For eachevent in the le, we p erform two steps.
30 6
a We sub divide the currentinterval into subintervals,
one for each p ossible event. The size of a event's
subinterval is prop ortional to the estimated proba-
0 1
initial currentinterval
bility that the event will b e the next event in the
sub divide
le, according to the mo del of the input.
2 1
b We select the subinterval corresp onding to the
3 3
2
event that actually o ccurs next, and make it the
0 1
a b
1 1
3
new currentinterval.
select b
1
3. We output enough bits to distingui sh the nal current
1 1
sub divide
2 2
interval from all other p ossible nal intervals.
2 5
1
a b
2 2
The length of the nal subinterval is clearly equal to the
3 6
pro duct of the probabili ties of the individual events, which
select a
2
3 2
is the probability p of the particular sequence of events in
sub divide
5 5
the le. The nal step uses at most b log pc + 2 bits to
2 23 2 5
a b
3 3
3 30 6
distinguish the le from all other p ossible les. We need some
select b
mechanism to indicate the end of the le, either a sp ecial end-
3
23 5
of- le event co ded just once, or some external indication of
30 6
the le's length. Either metho d adds only a small amount
to the co de length.
output 11001 0:11001 0:11010
2 2
In step 2, we need to compute only the subinterval corre-
sp onding to the event a that actually o ccurs. To do this it is
i
Fig. 1. Pure arithmetic co ding graphically illustrated
convenient to use two \cumulative" probabiliti es: the cumu-
P
i 1
lative probabili ty P = p and the next-cumulative
C k
k =1
P
i
probability P = P + p = p . The new subinterval
N C i k
k =1
B. Incremental output
is [L + P H L;L+P H L. The need to main-
C N
The basic implementation of arithmetic co ding describ ed
tain and supply cumulative probabili ties requires the mo del
to have a complicated data structure, esp ecially when many ab ove has two ma jor diculties: the shrinking current in-
terval requires the use of high precision arithmetic, and no
more than twoevents are p ossible.
output is pro duced until the entire le has b een read. The
Wenow provide an example, rep eated a numb er of times
most straightforward solution to b oth of these problems is
to illustrate di erent steps in the development of arithmetic
to output each leading bit as so on as it is known, and then
co ding. For simplicitywecho ose b etween just twoevents at
to double the length of the currentinterval so that it re-
each step, although the principles apply to the multi-event
ects only the unknown part of the nal interval. Witten,
case as well. We assume that we know a priori that we
Neal, and Cleary [5] add a clever mechanism for preventing
have a le consisting of three events or three letters in the
the currentinterval from shrinking to o much when the end-
case of text compression; the rst event is either a with
1
1 1
2 1
but straddle . In that case we do not p oints are close to
probability pfa g = orb with probability pfb g = ;
1 1 1
2 2
3 3
1
yet know the next output bit, but we do know that what-
the second eventisa with probabili ty pfa g = orb
2 2 2
2
1
ever it is, the fol lowing bit will have the opp osite value; we
with probabili ty pfb g = ; and the third eventisa with
2 3
2
3 2
merely keep track of that fact, and expand the currentinter-
probability pfa g = orb with probability pfb g = .
3 3 3
5 5
1
. This follow-on pro cedure may val symmetrically ab out
The actual le to b e enco ded is the sequence b a b .
1 2 3
2
b e rep eated anynumb er of times, so the currentinterval size
The steps involved in pure arithmetic co ding are illus-
1
is always strictly longer than .
trated in Table 1 and Fig. 1. In this example the nal interval
4
5 23
; . The length corresp onding to the actual le b a b is [ Mechanisms for incremental transmission and xed preci-
1 2 3
30 6
1
of the interval is , which is the probabilityofb a b , com- sion arithmetic have b een develop ed through the years by
1 2 3
15
puted bymultiplyin g the probabiliti es of the three events: Pasco [6], Rissanen [7], Rubin [8], Rissanen and Langdon
1 2 1 1
= . In binary, the nal inter- pfb gpfa gpfb g = [9], Guazzo [10], and Witten, Neal, and Cleary [5]. The
1 2 3
3 2 5 15
val is [0:110001 :::;0:110101 :::. Since all binary numb ers bit-stung idea of Langdon and others at IBM that limits
that b egin with 0:11001 are entirely within this interval, out- the propagation of carries in the additions serves a function
putting 11001 suces to uniquely identify the interval. similar to that of the follow-on pro cedure describ ed ab ove. 2
1
2. If the new subinterval lies entirely within [0; ,
Table 2 Example of pure arithmetic co ding with incremental
2
transmission and interval expansion
we output 0 and any following 1s left over from
previous events; then we double the size of the
Action Subintervals
1
subinterval by linearly expanding [0; to[0;1.
2
1
Start [0; 1
3. If the new subinterval lies entirely within [ ; 1,
2
2 2 2
we output 1 and any following 0s left over from
Sub divide with left prob. pfa g = [0; ; [ ; 1
1
3 3 3
previous events; then we double the size of the
2
Input b , select right subinterval [ ; 1
1
1
3
subinterval by linearly expanding [ ; 1 to [0; 1.
1 2
Output 1, expand [ ; 1
1 3
3
4. If the new subinterval lies entirely within [ ; ,
4 4
1 1 2 2
wekeep track of this fact for future output by
Sub divide with left prob. pfa g = [ ; ; [ ; 1
2
2 3 3 3
1 2
incrementing the follow count; then we double
Input a , select left subinterval [ ;
2
3 3
the size of the subinterval by linearly expanding
1 5
Increment follow count, expand [ ;
1 3
6 6
[ ; to[0;1.
4 4
3 1 17 17 5
Sub divide with left prob. pfa g = [ ; ; [ ;
3
5 6 30 30 6
5 17
; Input b , select right subinterval [
3
30 6
Table 2 and Fig. 2 illustrate this pro cess. In the exam-
2 2
Output 1, output 0 follow bit, expand [ ;
15 3
ple, interval expansion o ccurs exactly once for each input
1 1 2 2
event, but in other cases it may o ccur more than once or not
Output 01 [ ; isentirely within [ ;
4 2 15 3
at all. The follow-on pro cedure is applied when pro cessing
the second input event a . The 1 output after pro cessing
2
the third event b is therefore followed by its complement 0.
3
0 1
initial currentinterval
2 2
The nal interval is [ ; . Since all binary numb ers that
15 3
sub divide
start with 0:01 are within this range, outputting 01 suces
2 1
to uniquely identify the range. The enco ded le is 11001,
3 3
2
as b efore. This is no coincidence: the computations are es-
0 1
a b
1 1
3
sentially the same. The nal interval is eight times as long
select b
1
as in the previous example b ecause of the three doublings of
2
1
3
the currentinterval.
Clearly the currentinterval contains some information
output 1; expand
ab out the preceding inputs; this information has not yet
1
1
b een output, so we can think of it as the co der's state. If a
3
is the length of the currentinterval, the state holds log a
sub divide
2
1 1
bits not yet output. In the basic metho d the state contains
2 2
al l the information ab out the output, since nothing is out-
2 1
1
a b
2 2
3 3
put until the end. In the incremental implementation , the
select a
2
state always contains fewer than two bits of output informa-
2 1
tion, since the length of the currentinterval is always more
3 3
2 2 1
. The nal state in the incremental example is [ ; , than
4 15 3
8
follow; expand
0:907 bits of information; the - which contains log
2
15
nal two output bits are needed to unambiguously transmit
1 5
6 6
this information.
sub divide
3 2
5 5
1 17 5
C. Use of integer arithmetic
a b
3 3
6 30 6
select b
3
In practice, the arithmetic can b e done by storing the
17 5
30 6
endp oints of the currentinterval as suciently large inte-
gers rather than in oating p oint or exact rational numb ers.
output 1f = 10; expand
We also use integers for the frequency counts used to esti-
2 2
mate event probabiliti es. The sub divisi on pro cess involves
15 3
selecting non-overlappin g intervals of length at least 1 with
0:01 0:10 output 01
2 2
lengths approximately prop ortional to the counts. Table 3
illustrates the use of integer arithmetic using a full inter-
Fig. 2. Pure arithmetic co ding with incremental transmission
val of [0;N=[0;1024. The graphical version of Table 3
and interval expansion, graphically illustrated
is essentially the same as Fig. 2 and is not included. The
length of the currentinterval is always at least N=4 + 2, 258
in this case, so we can always use probabili tie s precise to at
Wenow describ e in detail how the incremental output and
interval expansion work. We add the following step immedi- least 1=258; often the precision will b e near 1=1024. In prac-
ately after the selection of the subinterval corresp onding to tice we use even larger integers; the interval [0; 65 536 is a
an input event, step 2b in the basic algorithm ab ove. common choice, and gives a practically continuous choice of
2. c We rep eatedly execute the following steps in se- probabiliti es at each step. The sub divisi ons in this example
quence until the lo op is explicitly halted: are not quite the same as those in Table 2 b ecause the re-
1. If the new subinterval is not entirely within one sulting intervals are rounded to integers. The enco ded le
1 3 1 1
, [ ; , or [ ; 1, we exit of the intervals [0; is 11001 as b efore, but for a longer input le the enco dings
2 4 4 2
the lo op and return. would eventually diverge. 3
Table 3 Example of arithmetic co ding with incremental trans- Table 5 Excerpts from quasi-arithmetic co ding table, N =8.
mission, interval expansion, and integer arithmetic. Full interval Only the three states needed for the example are shown; there
is [0; 1024, so in e ect subinterval endp oints are constrained to are nine more states. An \f" output indicates application of the
1
follow-on pro cedure describ ed in the text.
be multiples of .
1024
Left a input Rightb input
Action Subintervals
Probability Start
Next Next
state of left input Output Output
Start [0; 1024
state state
2
pfa g = ; sub divide with
1
[0,8 0.000 { 0.182 000 [0; 8 [1; 8
3
683
left probability= 0:66699 [0; 683; [683; 1024
0.182 { 0.310 00 [0; 8 [2; 8
1024
Input b , select right subinterval [683; 1024
1
0.310 { 0.437 0 [0; 6 [3; 8
Output 1, expand [342; 1024
0.437 { 0.563 0 [0; 8 1 [0; 8
1
0.563 { 0.690 [0; 5 1 [2; 8
Sub divide with left prob. pfa g = [342; 683; [683; 1024
2
2
0.690 { 0.818 [0; 6 11 [0; 8
Input a , select left subinterval [342; 683
2
0.818 { 1.000 [0; 7 111 [0; 8
Increment follow count, expand [172; 854
[0,6 0.000 { 0.244 000 [0; 8 [1; 6
3
pfa g = ; sub divide with
1
5
0.244 { 0.415 00 [0; 8 f [0; 8
409
left probability= 0:59971 [172; 581; [581; 854
682
0.415 { 0.585 0 [0; 6 f [2; 8
Input b , select right subinterval [581; 854
3
0.585 { 0.756 0 [0; 8 10 [0; 8
Output 1, output 0 follow bit, expand [138; 654
0.756 { 1.000 [0; 5 101 [0; 8
Output 01 [256; 512 is entirely within [138; 654
[2,8 0.000 { 0.244 010 [0; 8 [3; 8
0.244 { 0.415 01 [0; 8 1 [0; 8
0.415 { 0.585 f [0; 6 1 [2; 8
Table 4 Example of arithmetic co ding with incremental trans-
0.585 { 0.756 f [0; 8 11 [0; 8
mission, interval expansion, and small integer arithmetic. Full
interval is [0; 8, so in e ect subinterval endp oints are constrained 0.756 { 1.000 [2; 7 111 [0; 8
1
to b e multiples of .
8
Action Subintervals
Table 6 Example of op eration of quasi-arithmetic co ding
Start [0; 8
Start in state [0; 8.
2 5
2
pfa g = ; sub divide with left prob. = [0; 5; [5; 8
Prob fa g = ,socho ose range 0.563 { 0.690 in [0; 8.
1
1
3 8
3
Input b , select right subinterval [5; 8
First eventis b ,socho ose rightb input.
1
1
Output 1, expand [2; 8
Output 1. Next state is [2; 8.
1
1
Prob fa g = ,socho ose range 0.415 { 0.585 in [2; 8.
Sub divide with left prob. pfa g = [2; 5; [5; 8
2
2
2
2
Second eventisa ,socho ose left a input.
Input a , select left subinterval [2; 5
2
2
f means increment follow count. Next state is [0; 6.
Increment follow count, expand [0; 6
3
3 2
Prob fa g = ,socho ose range 0.585 { 0.756 in [0; 6.
3
pfa g = ; sub divide with left prob. = [0; 4; [4; 6
3 5
5 3
Third eventis b ,socho ose rightb input.
3
Input b , select right subinterval [4; 6
3
Indicated output is 10. Output 1; output 0 to account
Output 1, output 0 follow bit, expand [0; 4
for follow bit; output 0. Next state is [0; 8.
Output 0, expand [0; 8
A. Development of binary quasi-arithmeticcoding
I I I. Limited-precision arithmetic coding
Wehave seen that doing arithmetic co ding with large in-
Arithmetic co ding as it is usually implemented is slowbe-
tegers instead of real or rational numb ers hardly degrades
cause of the multipli catio ns and in some implementations,
compression p erformance at all. In Table 4 we show the en-
divisions required in sub dividi ng the currentinterval ac-
co ding of the same le using small integers: the full interval
cording to the probability information. Since small errors
[0;N is only [0; 8.
in probability estimates cause very small increases in co de
The numb er of p ossible states after applying the interval
length, we exp ect that byintro ducing approximations into
expansion pro cedure of an arithmetic co der using the inte-
the arithmetic co ding pro cess in a controlled waywe can
2
ger interval [0;Nis3N =16. If we can reduce the number
improve co ding sp eed without signi cantly degrading com-
of states to a more manageable level, we can precompute all
pression p erformance. In the Q-Co der work at IBM [11],
state transitions and outputs and substitute table lo okups
the time-consuming multipli catio ns are replaced by additions
for arithmetic in the co der. The obvious way to reduce the
and shifts, and low-order bits are ignored. In [12] we describ e
numb er of states is to reduce N . The value of N should b e a
a di erent approach to approximate arithmetic co ding. Re-
power of 2; its value must b e at least 4. If wecho ose N =8
calling that the fractional bits characteristic of arithmetic
and apply the arithmetic co ding algorithm in a straightfor-
co ding are stored as state information in the co der, we re-
ward way,we obtain a table with 12 states, each state o ering
duce the numb er of p ossible states, and replace arithmetic
achoice of b etween 3 and 7 probabili ty ranges. Portions of
op erations by table lo okups; the lo okup tables can b e pre-
the table are shown in Table 5.
computed. Here we review this reduced precision binary
arithmetic co der, whichwe call a quasi-arithmetic coder.It Table 6 shows howTable 5 is used to enco de our sam-
should b e noted that the compression is still completely re- ple le. Before co ding each input event the co der is in a
versible; using reduced precision merely a ects the average certain current state, corresp onding to the current subinter-
co de length. val. For each state there are a numb er of probability ranges; 4
1
wecho ose the one that includes the estimated probability 0:585 that app ears in Table 5. In this case p = and p =
1 2
2
2
for the next event. Then we simply select the input event .We compute p = log 3=2= log 2 0:585. Hence when
3
3
that actually o ccurs and p erform the indicated actions: out- we enco de the third event in the example with pfa g = ,
3
5
putting bits, incrementing the follow count, and changing to we use the f[0; 4; [4; 6g sub divisi on.
a new current state. In the example the enco ded output le Probability p is the probabili tybetween p and p with
1 2
is 1100. Because wewere using suchlow precision, the sub- the worst average quasi-arithmetic co ding p erformance, b oth
division probabili tie s b ecame distorted, leading to a lower in excess bits p er event and in excess bits relative to optimal
probability for the le 1=16, but one which ends in the full compression. This can b e shown by monotonicity arguments.
interval [0; 8, requiring no disambiguati ng bits. We usually
Theorem 1 If we construct a quasi-arithmetic coder based
use a somewhat larger value of N ; in practice the compres-
on ful l interval [0;N, and use correct probability estimates,
sion ineciency of a binary quasi-arithmeti c co der neglect-
the number of bits per input event by which the average code
ing very large and very small probabiliti es is less than one
length obtained by the quasi-arithmeticcoder exceeds that of
p ercent for N = 32 and ab out 0.2 p ercent for N = 128.
an exact arithmetic coder is at most
In implementations , the co ding tables can b e stripp ed
4 2 1 1 0:497 1
down so that the numerical values of the interval endp oints
log + O + O ;
2
2 2
ln 2 e ln 2 N N N N
and probability ranges do not app ear. Full details and ex-
and the fraction by which the average code length obtainedby
amples app ear in [13]. The next section explains the compu-
the quasi-arithmetic coder exceeds that of an exact arithmetic
tation of the probability ranges. Deco ding uses essentially
coder is at most
the same tables, and in fact is easier than enco ding.
2 1 1
B. Analysis of binary quasi-arithmetic coding log + O
2
2
e ln 2 log N log N
2
Wenow prove that using binary quasi-arithmetic co ding
1 0:0861
causes an insigni cant increase in the co de length compared
+ O :
2
log N log N
with pure arithmetic co ding. We mathematically analyze
2
several cases.
Proof :For a quasi-arithmetic co der with full interval [0;N,
In this analysis we exclude very large and very small prob-
the shortest terminal state intervals have size W = N=4+2.
abilities, namely those that are less than 1=W or greater than
The worst average error o ccurs for the smallest W and
W 1=W , where W is the width of the currentinterval.
the most extreme probabiliti es, that is, for W = N=4+2,
For these probabili ties the relative excess co de length can b e
p =W 2=W , and p =W 1=W or symmetrically,
1 2
large.
p =1=W and p =2=W . For these probabiliti es, we nd
1 2
First we assume that we know the probability p of the left
the cuto probability p . Then for the rst part of the theo-
branch of eachevent, and we show b oth how to minimize
rem we take the asymptotic expansion of Lp ;p Hp ,
2 2
the average excess co de length and how small the excess
and for the second part we take the asymptotic expansion of
is. In arithmetic co ding we divide the currentinterval of
Lp ;p Hp =H p . 2
2 2 2
width W into subintervals of length L and R; this gives
an e ective co ding probabili ty q = L=W since the resulting
p =p +p=2, we can expand Equation 1 If we let
1 2
co de length is log q for the left branch and log 1 q for
asymptotically in W and obtain an excellent rational ap-
2 2
the right. When we enco de a sequence of binary events with
proximation for p :
probabiliti es p and 1 p using e ective co ding probabili ties q
1 p 1=2
e
p = p + :
and 1 q , the average co de length Lp; q is given by
2
6W p1 p
Lp; q = plog q 1 p log 1 q :
e
The compression loss intro duced by using p instead of p is
2 2
completely negligible, never more than 0:06. In the exam-
If we use pure arithmetic co ding, we can sub divid e the in- 1 2
ple ab ove with p = , p = , and W =6,we nd that p =
1 2
2 3
terval into lengths pW and 1 pW ,thus making q = p
e
log 3=2= log 2 0:58496 and p = 737=1260 0:58492. As
and giving an average co de length equal to the entropy,
we exp ect, only for small values of W the short intervals
H p= plog p 1 p log 1 p; this is optimal.
2 2
that o ccur when using very low precision arithmetic do we
Consider two probabiliti es p and p that are adjacent
1 2
need to b e careful ab out roundo when sub dividin g inter-
based on the sub division of an interval of width W ; in other
vals; for larger values of W we can practically round the
words, p =W =W , p =W =W , and =
1 1 2 2 2
interval endp oints to the nearest integer.
1. For any probability p between p and p , either
1 1 2
We can prove a similar theorem for a more general case, in
p or p should b e chosen, whichever gives a shorter average
1 2
whichwe compare quasi-arithmeti c co ding with arithmetic
co de length. There is a cuto probability p for which p and
1
co ding for a single worst-case event. We assume that b oth
p give the same average co de length. We can compute p
2
co ders use the same estimated probability, but that the esti-
by solving the equation Lp ;p = Lp ;p , giving
1 2
mate may b e arbitrarily bad. The pro of is omitted to save
space.
1
p = 1
p
2
Theorem 2 If we construct a quasi-arithmetic coder based
log
p
1
on ful l interval [0;N, and use arbitrary probability esti-
1+
1 p
1
mates, the number of bits per input event by which the code
log
1 p
2
length obtained by the quasi-arithmeticcoder exceeds that of
an exact arithmetic coder in the worst case is at most
We can use Equation 1 to compute the probability ranges
4 5:771 N +8
in the co ding tables. As an example, we compute the cuto
: log
2
N +4 N ln 2 N
probability used in deciding whether to sub divide interval
[0; 6 as f[0; 3; [3; 6g or f[0; 4; [4; 6g; this is the number 5
At least one pro cessor completes all its events in each phase; IV. Parallel coding
such a pro cessor must output at least one bit p er event,
In [14] we present general-purp ose algorithms for enco ding
since all co de words in a Hu man co de have at least one
and deco ding using b oth Hu man and quasi-arithmeti c co d-
bit. This pro cessor and hence all pro cessors thus output
ing. We are considering the case where we wish to do b oth
at least n=2p bits in each phase, making a total of at least
enco ding and deco ding in parallel, so the lo cation of bits in
n=2 bits output in each phase. The total numb er of bits that
the enco ded le must b e computable by the deco ding pro ces-
must b e output in the rst sup erphase is at most nL=2, so
sors b efore they b egin deco ding. The main requirement for
the numb er of phases in the rst sup erphase is at most L.
our parallel co der is that the mo del for eacheventisknown
The same reasoning holds for all sup erphases. The number
ahead of time; it is not necessary for eacheventtohave the
of sup erphases needed to reduce the numb er of remaining
same mo del. We use the PRAM mo del of parallel compu-
events from n to p is log n=p, so the numb er of phases
2
tation; the numb er of pro cessors is p.We allow concurrent
needed is just L log n=p. Once p or fewer events remain,
2
reading of the parallel memory, and limited concurrent writ-
at most L late phases are required, so the total number of
ing, to a single one-bit register, whichwe call the completion
phases needed is at most L log 2n=p. 2
2
register. It is initial ized to 0 at the b eginning of each time
step; during the time step anynumb er of pro cessors or none
The extension of the parallel Hu man algorithm to quasi-
at all may write 1 to it. We assume that n events are to
arithmetic co ding is fairly straightforward. The only compli-
b e co ded. The main issue is in dealing with the di erences
cation arises when the last event of a pro cessor's allo cation
in co de lengths of di erentevents. For simplici tywe do not
leaves the pro cessor in a state other than the starting state.
consider input data routing.
We deal with this by outputting the smallest p ossible num-
b er of bits 0, 1, or 2 needed to identify a subinterval that We describ e the algorithm for the Hu man co ding case,
lies within the nal interval; this is the end-of- le disam- since pre x co des are easier to handle, and then extend it to
biguation problem that wehave seen in Section I I.
quasi-arithmetic co ding. First the n events are distributed
as evenly as p ossible among the p pro cessors. Then each pro-
V. Modeling
cessor b egins enco ding its assigned events in a deterministic
sequence, outputting one bit in each time step; the bits out-
The goal of mo deling for statistical data compression is
put in each time step are contiguous in the output stream.
to provide probability information to the co der. The mo d-
As so on as one or more pro cessors complete all of their as-
eling pro cess consists of structural and probability estima-
signed events, they indicate this fact by writing 1 to the
tion comp onents; eachmay b e adaptive, semi-adaptive, or
completion register. When the completion register is 1, the
static. In addition there are two strategies for probabili ty
output pro cess is interrupted. Events whose pro cessing has
estimation. The rst is to estimate eachevent's probabil-
not b egun are distributed evenly among the pro cessors, us-
ity individu al ly based on its frequency within the data set.
ing a pre x op eration; events whose pro cessing has b egun
The other strategy is to estimate the probabiliti es collec-
but not nished remain assigned to their current pro cessors.
tively, assuming a probabili ty distribution of a particular
Pro cessing of the reallo cated events then continues until the
form and estimating the parameters of the distribution , ei-
next time that one or more pro cessors nishes all their al-
ther directly or indirectly.For direct estimation, we simply
lo cated events. Toward the end of pro cessing, the number
compute an estimate of the parameter the variance, for in-
of remaining events will b ecome less than the number of
stance from the data. For indirect estimation [15], we start
available pro cessors. At this time we deactivate pro cessors,
with a small numb er of p ossible distributions, and compute
making the numb er of active pro cessors equal to the number
the co de length that wewould obtain with each; then we
of remaining events; wemust redirect the output bits from
select the one with the smallest co de length. This metho d
the remaining pro cessors. No further event reallo cations are
is very general and can b e used even for distribution s from
needed, but we still have to deactivate pro cessors whenever
di erent families, without common parameters.
any of them nish.
For lossless image compression, the structural comp onent
We analyze the time required by the parallel algorithm.
is usually xed, since for most images pixel intensities are
Assuming that in one time unit each pro cessor can output
close in value to the intensities of nearby pixels. Wecode
one bit, we can easily show that the time required for bit
pixels in a predetermined sequence, predicting each pixel's
output is b etween dnH =pe and dnH =pe + L. The more in-
intensity using a xed linear combination of a xed constella-
teresting analysis concerns the numb er of pre x op erations
tion of nearby pixels, then co ding the prediction error. Typ-
required. We de ne a phase to b e the p erio d b etween pre x
ically the prediction errors have a symmetrical exp onential-
op erations. Early phases are those which take place while
like distribution with zero mean, so the probability estima-
the number of events is greater than the numb er of pro ces-
tion comp onent consists of estimating the variance and p ossi-
sors; each is followed byanevent reallo cation. Late phases
bly some other parameters of the distributio n, either directly
are those needed to co de the nal p or fewer events; each late
or indirectly. A collection of only a few dozen distributio ns
phase requires a pre x op eration to redirect output bits. We
is sucient to co de most images with minimal compression
b ound the numb er of pre x op erations needed in the follow-
loss.
ing theorem.
For text, on the other hand, the b est metho ds involve con-
stantly changing structures. In text compression the events
Theorem 3 When coding n events using p processors, the
to b e co ded are just the letters in the le. Typically we
number of pre x operations needed by the real location coding
b egin by assuming only that some unsp eci ed sequences of
algorithm is at most L log 2n=p in the worst case.
2
letters will o ccur frequently.Wemay also sp ecify a max-
imum length for the frequently o ccurring sequences. As Proof :We de ne a superphase to b e the time needed to
enco ding pro ceeds we determine which sequences are most halve the numb er of remaining events. Consider the rst
frequent. In the Ziv-Lemp el dictionary-based metho ds [16, sup erphase. The number of events assigned to each pro cessor
17], the sequences are placed directly in the dictionary. The ranges from n=p in the rst phase down to n=2p in the last. 6
advantage of Ziv-Lemp el metho ds is their sp eed, obtained by The separation of co ding and mo deling is imp ortantbe-
co ding directly from the dictionary data structure, bypass- cause it p ermits any degree of complexity in the mo deler
ing the explicit probability estimation and statistical co ding without requiring anychange to the co der. In particular,
stages. The PPM metho d [18] obtains b etter compression by the mo del structure and probability estimates can change
constructing a Markov mo del of mo derate order, pro ceeding adaptively. In Hu man co ding, by contrast, the probabil-
to higher orders as more of the le is enco ded. Complicated ity information must b e built into the co ding tables, making
data structures are used to build and up date the context adaptive co ding dicult. Even totally disjoint data streams
information. can b e intertwined; the only requirement is that the deco der
Most mo dels for text compression involve estimating the must b e able to track the mo del structure of the enco der.
letter probabiliti es individua ll y, since there is no obvious One o ccasionall y useful advantage to arithmetic co ding is
mathematical relationship b etween the probabili ties of di er- that it is easy to maintain lexicographi c order without any
ent letters. Numerical proximity of ASCI I representations loss in compression eciency, so that enco ded strings can
do es not imply anything ab out probabili tie s. We usually have the same order as the original unenco ded data; main-
estimate the probability p of a given letter by taining lexicographi c order with a pre x co de requires a more
complicated algorithm the Hu-Tucker algorithm [20] and en-
weight of letter
tails some sacri ce in eciency. p = :
total weight of all letters
The disadvantages of arithmetic co ding are that it runs
The weight of a letter is usually based on the number of
slowly, it is fairly complicated to implement, and it do es
o ccurrences of the letter in a particular conditionin g context.
not pro duce pre x co des. It can b e sp eeded up with only a
Since we are often dealing with contexts that haveoc-
slight loss in compression eciency by using approximations
curred only a few times, wehave to deal with letters that
like quasi-arithmetic co ding, the patented IBM Q-Co der, or
have never o ccurred. We cannot giveaweight of 0 to such
Neal's low-precision co der [21]. The implementation di-
letters b ecause that would lead to probabili ty estimates of 0,
culties are not insurmountabl e, but arithmetic co ding will
which arithmetic co ding cannot handle. This is the zero-
not b e available as an o -the-shelf package until a fast, e-
frequency problem, thoroughly investigated by Witten and
cient, patent-free metho d is agreed up on as a standard. The
Bell [19]. It is p ossible to assign an initial weightof1to
non-pre x-co de prop erty leads to some technical diculties:
all p ossible letters, but a b etter strategy is to assign initial
error resistance can b e a serious problem, esp ecially with
0weights and to include a sp ecial escape event indicating
adaptive mo dels, and the output delay can b e unb ounded
\some letter that has never o ccurred b efore in this context";
in the Witten-Neal-Cleary version describ ed here, although
this event has its own weight, and must b e enco ded likean
not in the Q-Co der.
ordinary letter whenever a new letter o ccurs. There are a
For binary alphab ets, various forms of run-length enco d- numb er of metho ds of dealing with the zero-frequency prob-
lem, di ering mainly in the computation of the escap e prob-
ing can b e used instead of arithmetic co ding, even if the
probabiliti es are highly skewed. Golomb co ding [22] and ability.
A second issue that arises in text compression is locality
the related Rice co ding [23] are based on exp onential ly dis-
tributed run lengths. The Golomb and Rice metho ds each of reference : strings tend to o ccur in clusters within a text
consist of a family of co des parameterized by a single param- le. One way to take advantage of lo cality is to scale the
counts p erio dicall y,typically by halving all weights when
eter; the parameter can b e estimated adaptively [15] giving
go o d compression eciency. These co des are extremely fast the total weight reaches a sp eci ed numb er. This e ectively
gives more weight to more recent o ccurrences of each letter,
pre x co des and are easily implemented in software or hard-
ware. Other non-parameterized run-length co des like Elias and has the additional b ene t of keeping the counts small
enough to t into small registers. Analysis [4] shows that
co des [24] are less exible and hence less useful.
scaling often helps compression, and can never hurt it by
When more than two input events are p ossible, the main
very much.
alternatives to arithmetic co ding are Hu man co ding and
One technique for probabili ty estimation when only two
co ding based on splay trees [25]. Mo at et al. [26] compare
events are p ossible is to use small scaled counts, considering
arithmetic co ding with these metho ds. Hu man co ding is
each count pair to b e a probability state. We can then pre-
very e ective in conjunction with a semi-adaptive mo del;
compute b oth the corresp onding probabili ties and the state
the probability information can b e built into the co ding ta-
transitions caused by the o ccurrence of additional events.
bles, leading to fast execution. Splay trees are even faster,
The states can b e identi ed by index numb ers, whichin
since the data structure for co ding is the \statistical" mo del;
turn can b e used to index into the probability range lists
compression eciency su ers somewhat. Golomb and Rice
in quasi-arithmetic co de tables like those in Table 5.
co ding can b e used when manyevents are p ossible, main-
taining lists of the events in approximate probability order
VI. Conclusion
and co ding the p ositions of the events within the lists. This
The main advantages of arithmetic co ding for statistical
is an esp ecially useful technique for very large alphab ets [27].
data compression are its optimality and its inherent sepa-
The main usefulness of arithmetic co ding is in obtain- ration of co ding and mo deling. Pure arithmetic co ding, as
ing maximum compression in conjunction with an adaptive describ ed in Section I I, is strictly optimal for a sto chastic
mo del, or when the probability of one eventismuch larger data source whose probabili ties are accurately known, but
than 1=2. Arithmetic co ding gives optimal compression, but it relies on relatively slow arithmetic op erations likemulti-
its slow execution can b e problematical. Approximate ver- plication and division. The quasi-arithmeti c co ding metho d
sions of arithmetic co ding give almost optimal compression describ ed in Section I I I, which uses table lo okup as a low-
at improved sp eeds. Probabiliti es can b e estimated approxi- precision alternative to full-precisi on arithmetic, is nearly
mately to o, again leading to only slight degradation of com- optimal and fairly fast, even when the probabili ties are close
pression p erformance. to1or0. 7
Acknowledgments [21] R. M. Neal, \Fast Arithmetic Co ding Using Low-Precision Di-
vision," Unpublished manuscript, 1987.
The authors would liketoacknowledge helpful suggestions
[22] S. W. Golomb, \Run-Length Enco dings," IEEE Trans. Inform.
made by Marty Cohn.
Theory , IT{12, no. 4, pp. 399{401, July 1966.
[23] R. F. Rice, \Some Practical Universal Noiseless Co ding Tech-
References
niques," Jet Propulsion Lab oratory,Pasadena, California, JPL
[1] C. E. Shannon, \A Mathematical Theory of Communication,"
Publication 79{22, Mar. 1979.
Bell Syst. Tech. J., 27, pp. 398{403, July 1948.
[24] P. Elias, \Universal Co deword Sets and Representations of In-
[2] D. A. Hu man, \A Metho d for the Construction of Minimum
tegers," IEEE Trans. Inform. Theory , IT{21, no. 2, pp. 194{
Redundancy Co des," Pro ceedings of the Institute of Radio En-
203, Mar. 1975.
gineers , 40, pp. 1098{1101, 1952.
[25] D. W. Jones, \Application of SplayTrees to Data Compres-
[3] T. C. Bell, J. G. Cleary and I. H. Witten, Text Compression.
sion," Comm. ACM , 31, no. 8, pp. 996{1007, Aug. 1988.
Englewo o d Cli s, NJ, Prentice-Hall, 1990.
[26] A. M. Mo at, N. Sharman, I. H. Witten and T. C. Bell, \An
[4] P.G.Howard and J. S. Vitter, \Analysis of Arithmetic Co ding
Empirical Evaluation of Co ding Metho ds for Multi-Symb ol Al-
for Data Compression," Information Pro cessing and Manage-
phab ets," in Pro c. Data Compression Conference, J. A. Storer
ment , 28, no. 6, pp. 749{763, 1992.
and M. Cohn, Eds. Snowbird, Utah: pp. 108{117, Mar. 30-
[5] I. H. Witten, R. M. Neal and J. G. Cleary, \Arithmetic Co ding
Apr. 1, 1993.
for Data Compression," Comm. ACM , 30, no. 6, pp. 520{540,
[27] A. M. Mo at, \Economical Inversion of Large Text Files," pre-
June 1987.
sented at Computing Systems, 1992.
[6] R. Pasco, \Source Co ding Algorithms for Fast Data Compres-
sion," Stanford Univ., Ph.D. Thesis, 1976.
[7] J. J. Rissanen, \Generalized Kraft Inequality and Arithmetic
Co ding," IBM J. Res. Develop., 20, no. 3, pp. 198{203, May
1976.
Paul G. Howard is a Member of Techni-
cal Sta in the Visual Communications Re-
[8] F. Rubin, \Arithmetic Stream Co ding Using Fixed Precision
search DepartmentofAT&T Bell Lab orato-
Registers," IEEE Trans. Inform. Theory , IT{25, no. 6, pp. 672{
ries. He received the B.S. degree in com-
675, Nov. 1979.
puter science from M.I.T. in 1977 and the
[9] J. J. Rissanen and G. G. Langdon, \Arithmetic Co ding," IBM
M.S. and Ph.D. degrees in computer science
from Brown University in 1989 and 1993 re-
J. Res. Develop., 23, no. 2, pp. 146{162, Mar. 1979.
sp ectively. His researchinterests are in data
[10] M. Guazzo, \A General Minimum-Redundancy Source-Co ding
compression, including co ding, image mo del-
Algorithm," IEEE Trans. Inform. Theory , IT{26, no. 1, pp. 15{
ing, and text mo deling. He is a member of
25, Jan. 1980.
Sigma Xi and an asso ciate memb er of the Cen-
ter of Excellence in Space Data and Informa-
[11] W. B. Pennebaker, J. L. Mitchell, G. G. Langdon and R. B.
tion Sciences. He was brie y a Research Asso ciate at Duke University
Arps, \An Overview of the Basic Principles of the Q-Co der
b efore joining AT&T in 1993.
Adaptive Binary Arithmetic Co der," IBM J. Res. Develop.,
32, no. 6, pp. 717{726, Nov. 1988.
[12] P.G.Howard and J. S. Vitter, \Practical Implementations
of Arithmetic Co ding," in Image and Text Compression,J.
A. Storer, Ed. Norwell, MA: Kluwer Academic Publishers,
Je rey Scott Vitter Fellow, IEEE is the
pp. 85{112, 1992.
Gilb ert, Louis, and Edward Lehrman Profes-
sor of Computer Science and the Chair of
, \Design and Analysis of Fast Text Compression [13]
the Department of Computer Science at Duke
Based on Quasi-Arithmetic Co ding," in Pro c. Data Compres-
University. Previously he was Professor at
sion Conference, J. A. Storer and M. Cohn, Eds. Snowbird,
Brown University, where he was on the fac-
Utah: pp. 98{107, Mar. 30-Apr. 1, 1993.
ulty since 1980. He received the B.S. degree
in mathematics with highest honors from the
[14] ,\Parallel Lossless Image Compression Using Hu -
University of Notre Dame in 1977 and the
man and Arithmetic Co ding," in Pro c. Data Compression Con-
Ph.D. degree in computer science from Stan-
ference, J. A. Storer and M. Cohn, Eds. Snowbird, Utah:
ford University in 1980. He is a Guggenheim
pp. 299{308, Mar. 24-26, 1992.
Fellow, an NSF Presidential Young Investiga-
,\Fast and Ecient Lossless Image Compression," [15]
tor, and an IBM Faculty DevelopmentAwardee. He is coauthor of the
book Design and Analysis of Coalesced Hashing Oxford University
in Pro c. Data Compression Conference, J. A. Storer and M.
Press, 1987 and is coholder of a patent in the area of external sort-
Cohn, Eds. Snowbird, Utah: pp. 351{360, Mar. 30-Apr. 1,
ing. He has written numerous articles and has consulted often in the
1993.
areas of combinatorial algorithms, I/O eciency, parallel and incre-
[16] J. Ziv and A. Lemp el, \A Universal Algorithm for Sequential
mental computation, computational geometry, and machine learning.
Data Compression," IEEE Trans. Inform. Theory , IT{23, no.
He serves on several editorial b oards and is a frequent editor of sp e-
cial issues and memb er of program committees. He has served ACM
3, pp. 337{343, May 1977.
SIGACT as Memb er-at-Large from 1987 to 1991 and as Vice Chair
, \Compression of Individual Sequences via Variable [17]
since 1991. He is currently an asso ciate memb er of the Center of
Rate Co ding," IEEE Trans. Inform. Theory , IT{24, no. 5,
Excellence in Space Data and Information Sciences.
pp. 530{536, Sept. 1978.
[18] J. G. Cleary and I. H. Witten, \Data Compression Using
Adaptive Co ding and Partial String Matching," IEEE Trans.
Comm., COM{32, no. 4, pp. 396{402, Apr. 1984.
[19] I. H. Witten and T. C. Bell, \The Zero Frequency Problem:
Estimating the ProbabilitiesofNovel Events in AdaptiveText
Compression," IEEE Trans. Inform. Theory , IT{37, no. 4,
pp. 1085{1094, July 1991.
[20] T. C. Hu and A. C. Tucker, \Optimal Computer-Search
Trees and Variable-Length Alphab etic Co des," SIAM J. Appl.
Math., 21, no. 4, pp. 514{532, 1971. 8