PPM Performance with BWT Complexity:
A New Metho d for Lossless Data Compression
Michelle E ros
California Institute of Technology
Abstract
This work combines a new fast context-search algorithm with the lossless
source co ding mo dels of PPM to achieve a lossless data compression algorithm
with the linear context-search complexity and memory of BWT and Ziv-Lemp el
co des and the compression p erformance of PPM-based algorithms. Both se-
quential and nonsequential enco ding are considered. The prop osed algorithm
yields an average rate of 2.27 bits per character bp c on the Calgary corpus,
comparing favorably to the 2.33 and 2.34 bp c of PPM5 and PPM and the 2.43
bp c of BW94 but not matching the 2.12 bp c of PPMZ9, which, at the time of
this publication, gives the greatest compression of all algorithms rep orted on
the Calgary corpus results page. The prop osed algorithm gives an average rate
of 2.14 bp c on the Canterbury corpus. The Canterbury corpus web page gives
average rates of 1.99 bp c for PPMZ9, 2.11 bp c for PPM5, 2.15 bp c for PPM7,
and 2.23 bp c for BZIP2 a BWT-based co de on the same data set.
I Intro duction
The Burrows Wheeler Transform BWT [1] is a reversible sequence transformation
that is b ecoming increasingly p opular for lossless data compression. The BWT rear-
ranges the symb ols of a data sequence in order to group together all symb ols that share
the same unb ounded history or \context." Intuitively, this op eration is achieved by
forming a table in which each row is a distinct cyclic shift of the original data string.
The rows are then ordered lexicographically. The BWT outputs the last character of
each row which is the character that precedes the rst character of the given cyclic
shift in the original data string. For a nite memory source, this ordering groups
together all symb ols with the same conditional distribution and leads to a family of
low complexity universal lossless source co des [2, 3].
While the b est of the universal co des describ ed in [2, 3] converges to the optimal
p erformance at a rate within a constant factor the optimum, the p erformance of these
co des on nite sequences from sources such as text fails to meet the p erformance of
M. E ros is with the Department of Electrical Engineering MC 136-93 at the California Insti-
tute of Technology, Pasadena, CA 91125. This material is based up on work partially supp orted by
NSF Award No. CCR-9909026. 1
some comp eting algorithms [4]. This failure stems in part from the fact that in
grouping together all symb ols with the same context, the BWT makes the context
information inaccessible to the deco der.
Originally, the work describ ed in this pap er set out to mo dify the BWT-based
co des of [2, 3, 4] in order to make the context information accessible to the deco der.
What evolved, however, was e ectively a variation on the Prediction by Partial Map-
pings PPM algorithms of [5, 6]. Thus the work presented here can be viewed either
as a mo di cation of the BWT to include context information or as a mo di cation
of PPM to allow for the computational eciency of the BWT. The description that
follows takes the latter viewp oint.
The PPM algorithm and its descendants are extremely e ective techniques for
sequential lossless source co ding. In tests on data sets such as the Calgary and
Canterbury corp ora, the p erformance of PPM co des consistently meets or exceeds the
p erformance of a wide array of algorithms, including techniques based on the BWT
and Ziv-Lemp el style algorithms such as LZ77 [7], LZ78 [8], and their descendants.
Yet despite their excellent p erformance, PPM co des are far less commonly applied
than algorithms like LZ77 and LZ78. The Ziv-Lemp el co des are favored over PPM-
based co des for their relative eciencies in memory and computational complexity [9].
2
Straightforward implementations of some PPM algorithms require O n computa-
tional complexity and memory to co de a data sequence of length n. While implemen-
tations requiring only O n memory have b een prop osed in the literature [6, 10], the
high computational complexity and enco ding time of PPM algorithms, remains an
imp ediment for their more widespread use.
This pap er intro duces a new context-search algorithm. While the prop osed algo-
rithm could also be employed in Ziv-Lemp el and BWT-based co des, its real distinction
is its applicability within PPM-based co des. The PPM co de used is a minor variation
on an existing PPM algorithm [10]. Our co de achieves average rates of 2:27 and 2:14
bits per character bp c, resp ectively, on the Calgary and Canterbury corp ora and
is ecient in b oth space and computation. The algorithm uses O n memory and
achieves O n complexity in its search for the longest matching context for each sym-
bol of a length-n data sequence. The O n complexity is a signi cant improvement
2
over the worst-case O n complexity of a direct context search. Several variations
on the given approach are presented. The variations include b oth sequential and
non-sequential enco ding techniques and allow the user to trade o enco der memory,
delay, and computational complexity.
The remainder of the pap er is organized as follows. Section II contains a review of
PPM algorithms. The review fo cuses on PPM [6] and its exclusion-based variation
from [10]. A short intro duction to sux trees and McCreight's sux tree construction
algorithm [11] follows in Section III. McCreight's algorithm is used in implementa-
tions of Ziv-Lemp el [12] and BWT [1] co des. The algorithm description is followed
by a brief discussion of the diculties inherent in applying McCreight's algorithm
in PPM-based co des. Section IV describ es a new metho d for combining PPM data
compression with a new sux-tree algorithm. Section V gives exp erimental results
and conclusions. 2
II PPM Algorithms
The lossless compression algorithms in the PPM family of co des combine sequential
arithmetic source co ding see, for example, [13], [14], or texts such as [9] with adaptive
n
Markov-style data mo dels. Given a probability mo del px for symb ols x ; : : : ; x
1 n
in some nite source alphab et X , arithmetic co ding guarantees a description length
n n n n n
` x such that ` x < log px +2 for all p ossible x = x ; : : : ; x 2 X [13].
n n 1 n
Thus, given a good source mo del, arithmetic co des yield excellent lossless source
co ding p erformance.
A simple approach to source mo deling is the Markov mo del approach. For any
nite integer k , a Markov mo del of order k conditions the probability of the next
symbol on the previous k symb ols. Thus we co de symbol x using the probability
n
n 1 n 1
n 1
px jx = px jx , where the string x = x ; x ; : : : ; x describ es
n n n k n k +1 n 1
n k n k
the \context" of past information on which our estimation of likely future events is
conditioned.
In an adaptive co de, the probability estimates are built using information ab out
the previously co ded data stream. Thus the enco der may up date its probability
k
estimates p ajs for all a 2 X and all s 2 X at each time step n in order to b etter
n
re ect its current knowledge of the source. The subscript n on p ajs here makes
n
explicit the adaptive nature of the probability estimate; p ajs is the estimate of
n
probability pajs at time n { just b efore the nth symbol is co ded. The estimate
p ajs is based on the n 1 previously co ded symb ols in the data stream. Let
n
k
N ajs denote the number of times that symbol a has followed sequence s 2 X in
n
P
n 1
i
the previous data stream, where N ajs = 1x = sa for each a 2 X . If
n
i k
i=k +1
the probability mo del p ajs relies only on the conditional symbol counts fN ajs :
n n
k
a 2 X ; s 2 X g and if the deco der can sequentially decipher the information sent to
it, then the deco der can track the changing probability mo del p by keeping a tally
n
of symbol counts identical to the one used at the enco der.
PPM source mo dels generalize adaptive Markov source mo dels by replacing the
single Markov mo del of xed order k by a col lection of Markov mo dels of varying
orders. For example, given some nite memory constraint M , the original PPM
algorithm uses Markov mo dels of all orders k 2 f 1; 0; : : : ; M g, where the order
k = 1 mo del refers to a xed uniform distribution on all symb ols x 2 X . Typical
values of M for text compression satisfy M 6 [5, 15]. PPM uses \escap e" events
to combine the prediction probabilities of its M + 1 Markov mo dels. The escap e
mechanism is employed on symb ols that have not previously b een seen in a particular
context. Let Esc denote the escap e character, and use Y to denote the mo di ed
alphab et X [ fEscg. Use N Escjs to denote the number of distinct symb ols that
n
have followed context s, which equals the number of times an escap e was used in the
given context. PPM builds its probability estimate for symbol a at time n recursively.
It b egins by nding the longest context of the given symbol that has previously
app eared in the data stream. It then uses the escap e character to back o to lower
order mo dels if necessary to nd a mo del in which symbol a is not novel in the given
context. 3
More precisely, for each a 2 Y and each k 2 f0; : : : ; M g let
n 1
N ajx
n
n k
pn; k; a = ;
P
n 1
N bjx
n
b2Y
n k
and de ne
1
; pn; 1; a =
jX j
where jX j is the size of alphab et X . De ne K n to be the length of the largest
context for symbol n that has previously app eared in the given data stream. For each
n 1
a 2 X let k n; a be the largest k 2 f0; : : : ; K ng such that N ajx > 0. The
n
n k
initial context index K n equals zero if the context of symbol n has not previously
app eared in the data stream; the nal context index k n; a is set to 1 if symbol a
has not previously app eared in the given data string. PPM uses probability mo del
K n
Y
n 1
n 1
pn; k; Esc: p ajx = p ajx = pn; k n; a; a
n n
n K n
k =k n;a+1
This mo del is inecient in its normalization of pn; k; a for k n; a k < K n.
By describing an escap e character in mo del k , the enco der tells the deco der that the
observed symbol satis es the equation N ajs = 0. As a result, b oth the enco der and
n
the deco der may remove from their low order probability estimates all symb ols a 2 X
such that N ajs > 0. Mo difying the normalizing constants accordingly improves
n
system p erformance, and thus this mo di cation is assumed in the work that follows.
PPM [6] mo di es PPM by removing the a priori memory constraint M to allow
for unb ounded contexts. Mo dels are kept for al l contexts of al l lengths that have
previously app eared in the data stream. In cho osing its initial context, PPM uses
the shortest matching deterministic context, where a context is deterministic if the
number of symb ols that have previously app eared in that context is exactly equal to
one. If no such matching deterministic context exists, then PPM uses the longest
matching context instead.
In [10], Bunton considers a collection of variations on PPM . One of those varia-
tions uses the \up date exclusion" mechanism of [15]. The up date exclusion metho d
0
replaces count N ajs in the calculation of p ajs with a mo di ed count N ajs.
n n
n
0
While N increments by one each time symbol a is seen in context s 2 X , N in-
crements by one only if symbol a is either novel in context s or s is a sux of some
longer context t 2 X such that jtj = jsj + 1 and a is novel in context t.
III Sux Trees
In the original PPM algorithm [6], the contexts are stored in a \context trie." In [10],
the contexts are stored in a sux tree. This pap er follows an approach closer to that
n
of the latter work. A sux tree for data sequence x = x ; x ; : : : ; x describ es all
1 2 n
n n n
suxes of x . We denote those suxes by fs g , where s = x = x ; : : : ; x .
i i i n
i=1 i
Using to denote a unique end-of-string character, the sux tree for a xed string 4
k k
P P
P P