I Introduction
Total Page:16
File Type:pdf, Size:1020Kb
PPM Performance with BWT Complexity: A New Metho d for Lossless Data Compression Michelle E ros California Institute of Technology e [email protected] Abstract This work combines a new fast context-search algorithm with the lossless source co ding mo dels of PPM to achieve a lossless data compression algorithm with the linear context-search complexity and memory of BWT and Ziv-Lemp el co des and the compression p erformance of PPM-based algorithms. Both se- quential and nonsequential enco ding are considered. The prop osed algorithm yields an average rate of 2.27 bits per character bp c on the Calgary corpus, comparing favorably to the 2.33 and 2.34 bp c of PPM5 and PPM and the 2.43 bp c of BW94 but not matching the 2.12 bp c of PPMZ9, which, at the time of this publication, gives the greatest compression of all algorithms rep orted on the Calgary corpus results page. The prop osed algorithm gives an average rate of 2.14 bp c on the Canterbury corpus. The Canterbury corpus web page gives average rates of 1.99 bp c for PPMZ9, 2.11 bp c for PPM5, 2.15 bp c for PPM7, and 2.23 bp c for BZIP2 a BWT-based co de on the same data set. I Intro duction The Burrows Wheeler Transform BWT [1] is a reversible sequence transformation that is b ecoming increasingly p opular for lossless data compression. The BWT rear- ranges the symb ols of a data sequence in order to group together all symb ols that share the same unb ounded history or \context." Intuitively, this op eration is achieved by forming a table in which each row is a distinct cyclic shift of the original data string. The rows are then ordered lexicographically. The BWT outputs the last character of each row which is the character that precedes the rst character of the given cyclic shift in the original data string. For a nite memory source, this ordering groups together all symb ols with the same conditional distribution and leads to a family of low complexity universal lossless source co des [2, 3]. While the b est of the universal co des describ ed in [2, 3] converges to the optimal p erformance at a rate within a constant factor the optimum, the p erformance of these co des on nite sequences from sources such as text fails to meet the p erformance of M. E ros is with the Department of Electrical Engineering MC 136-93 at the California Insti- tute of Technology, Pasadena, CA 91125. This material is based up on work partially supp orted by NSF Award No. CCR-9909026. 1 some comp eting algorithms [4]. This failure stems in part from the fact that in grouping together all symb ols with the same context, the BWT makes the context information inaccessible to the deco der. Originally, the work describ ed in this pap er set out to mo dify the BWT-based co des of [2, 3, 4] in order to make the context information accessible to the deco der. What evolved, however, was e ectively a variation on the Prediction by Partial Map- pings PPM algorithms of [5, 6]. Thus the work presented here can be viewed either as a mo di cation of the BWT to include context information or as a mo di cation of PPM to allow for the computational eciency of the BWT. The description that follows takes the latter viewp oint. The PPM algorithm and its descendants are extremely e ective techniques for sequential lossless source co ding. In tests on data sets such as the Calgary and Canterbury corp ora, the p erformance of PPM co des consistently meets or exceeds the p erformance of a wide array of algorithms, including techniques based on the BWT and Ziv-Lemp el style algorithms such as LZ77 [7], LZ78 [8], and their descendants. Yet despite their excellent p erformance, PPM co des are far less commonly applied than algorithms like LZ77 and LZ78. The Ziv-Lemp el co des are favored over PPM- based co des for their relative eciencies in memory and computational complexity [9]. 2 Straightforward implementations of some PPM algorithms require O n computa- tional complexity and memory to co de a data sequence of length n. While implemen- tations requiring only O n memory have b een prop osed in the literature [6, 10], the high computational complexity and enco ding time of PPM algorithms, remains an imp ediment for their more widespread use. This pap er intro duces a new context-search algorithm. While the prop osed algo- rithm could also be employed in Ziv-Lemp el and BWT-based co des, its real distinction is its applicability within PPM-based co des. The PPM co de used is a minor variation on an existing PPM algorithm [10]. Our co de achieves average rates of 2:27 and 2:14 bits per character bp c, resp ectively, on the Calgary and Canterbury corp ora and is ecient in b oth space and computation. The algorithm uses O n memory and achieves O n complexity in its search for the longest matching context for each sym- bol of a length-n data sequence. The O n complexity is a signi cant improvement 2 over the worst-case O n complexity of a direct context search. Several variations on the given approach are presented. The variations include b oth sequential and non-sequential enco ding techniques and allow the user to trade o enco der memory, delay, and computational complexity. The remainder of the pap er is organized as follows. Section II contains a review of PPM algorithms. The review fo cuses on PPM [6] and its exclusion-based variation from [10]. A short intro duction to sux trees and McCreight's sux tree construction algorithm [11] follows in Section III. McCreight's algorithm is used in implementa- tions of Ziv-Lemp el [12] and BWT [1] co des. The algorithm description is followed by a brief discussion of the diculties inherent in applying McCreight's algorithm in PPM-based co des. Section IV describ es a new metho d for combining PPM data compression with a new sux-tree algorithm. Section V gives exp erimental results and conclusions. 2 II PPM Algorithms The lossless compression algorithms in the PPM family of co des combine sequential arithmetic source co ding see, for example, [13], [14], or texts such as [9] with adaptive n Markov-style data mo dels. Given a probability mo del px for symb ols x ; : : : ; x 1 n in some nite source alphab et X , arithmetic co ding guarantees a description length n n n n n ` x such that ` x < log px +2 for all p ossible x = x ; : : : ; x 2 X [13]. n n 1 n Thus, given a good source mo del, arithmetic co des yield excellent lossless source co ding p erformance. A simple approach to source mo deling is the Markov mo del approach. For any nite integer k , a Markov mo del of order k conditions the probability of the next symbol on the previous k symb ols. Thus we co de symbol x using the probability n n1 n1 n1 px jx = px jx , where the string x = x ; x ; : : : ; x describ es n n nk nk +1 n1 nk nk the \context" of past information on which our estimation of likely future events is conditioned. In an adaptive co de, the probability estimates are built using information ab out the previously co ded data stream. Thus the enco der may up date its probability k estimates p ajs for all a 2 X and all s 2 X at each time step n in order to b etter n re ect its current knowledge of the source. The subscript n on p ajs here makes n explicit the adaptive nature of the probability estimate; p ajs is the estimate of n probability pajs at time n { just b efore the nth symbol is co ded. The estimate p ajs is based on the n 1 previously co ded symb ols in the data stream. Let n k N ajs denote the number of times that symbol a has followed sequence s 2 X in n P n1 i the previous data stream, where N ajs = 1x = sa for each a 2 X . If n ik i=k +1 the probability mo del p ajs relies only on the conditional symbol counts fN ajs : n n k a 2 X ; s 2 X g and if the deco der can sequentially decipher the information sent to it, then the deco der can track the changing probability mo del p by keeping a tally n of symbol counts identical to the one used at the enco der. PPM source mo dels generalize adaptive Markov source mo dels by replacing the single Markov mo del of xed order k by a col lection of Markov mo dels of varying orders. For example, given some nite memory constraint M , the original PPM algorithm uses Markov mo dels of all orders k 2 f1; 0; : : : ; M g, where the order k = 1 mo del refers to a xed uniform distribution on all symb ols x 2 X . Typical values of M for text compression satisfy M 6 [5, 15]. PPM uses \escap e" events to combine the prediction probabilities of its M + 1 Markov mo dels. The escap e mechanism is employed on symb ols that have not previously b een seen in a particular context.