Compression Algorithms

Ecole Polytechnique Federale de Lausanne Master Semester Project Compression Algorithms Supervisor: Author: Ghid Maatouk Ludovic Favre Professor : Amin Shokrollahi June 11, 2010 Contents 1 Theory for Data Compression4 1.1 Model.......................................4 1.2 Entropy......................................4 1.3 Source Coding..................................4 1.3.1 Bound on the optimal code length...................4 1.3.2 Other properties.............................6 2 Source Coding Algorithms8 2.1 Huffman coding..................................8 2.1.1 History..................................8 2.1.2 Description................................8 2.1.3 Optimality................................9 2.2 Arithmetic coding................................ 10 3 Adaptive Dictionary techniques: Lempel-Ziv 11 3.1 History...................................... 11 3.2 LZ77........................................ 11 3.2.1 LZ77 encoding and decoding...................... 12 3.2.2 Performance discussion.......................... 13 3.3 LZ78........................................ 13 3.3.1 LZ78 encoding and decoding...................... 13 3.3.2 Optimality................................ 14 3.4 Improvements for LZ77 and LZ78........................ 15 4 Burrows-Wheeler Transform 17 4.1 History...................................... 17 4.2 Description.................................... 17 4.2.1 Encoding................................. 17 4.2.2 Decoding................................. 18 4.2.3 Why it compresses well......................... 19 4.3 Algorithms used in combination with BWT.................. 20 4.3.1 Run-length encoding........................... 20 4.3.2 Move-to-front encoding......................... 20 5 Implementation 21 5.1 LZ78........................................ 21 5.1.1 Code details................................ 21 5.2 Burrows Wheeler Transform........................... 22 1 CONTENTS 2 5.3 Huffman coding.................................. 23 5.3.1 Binary input and output......................... 23 5.3.2 Huffman implementation......................... 23 5.4 Move-to-front................................... 24 5.5 Run-length encoding............................... 24 5.6 Overview of source files............................. 24 6 Practical Results 27 6.1 Benchmark files.................................. 27 6.1.1 Notions used for comparison....................... 28 6.1.2 Other remarks.............................. 28 6.2 Lempel-Ziv 78................................... 28 6.2.1 Lempel-Ziv 78 with dictionary reset.................. 28 6.2.2 Comparison between with and without dictionary reset version... 29 6.2.3 Comparison between my LZ78 implementation and GZIP...... 29 6.3 Burrows-Wheeler Transform........................... 31 6.3.1 Comparison of BWT schemes to LZ78................. 32 6.3.2 Influence of the block size........................ 32 6.3.3 Comparison between my optimal BWT method and BZIP2..... 35 6.4 Global comparison................................ 36 7 Supplementary Material 37 7.1 Using the program................................ 37 7.1.1 License.................................. 37 7.1.2 Building the program.......................... 37 7.2 Collected data................................... 38 7.2.1 Scripts................................... 38 7.2.2 Spreadsheets............................... 38 7.2.3 Repository................................ 38 8 Conclusion 39 Introduction With the increasing amount of data traveling by various means like wireless networks link from mobile phone to servers, lossless data compression has become an important factor to optimize the spectrum utilization. As a computer science student with interest in domain like computational biology, data processing was important for me to understand how to handle the large amount of data coming from high-throughput sequencing technologies. Since I have both interest in algorithms and concrete data processing, getting in touch with data compression techniques immediately interested me. During this semester project, it has been decided to focus on two lossless compression algorithms to allow some comparison. The chosen algorithms for implementation were Lempel-Ziv 78 and the most recent Burrows-Wheeler Transform, which enable well-known techniques such as run-length encoding, move-to-front and Huffman coding to outperform easily in most of the situations the more complicated Lempel-Ziv-based techniques. Those two compression techniques both have a very different approach on the way to compress data and are commonly used in GZip1 and BZip22 software for example. In this report, I will first introduce some information theory material and present the theoretical part of the project in which I learnt how popular compression techniques attempt to reduce the size required for heterogeneous types of data. The subsequent chapters will then detail my practical work during the semester and present the implementation I have done in C/C++. Finally, the last two chapters consist of the results obtained on the famous Calgary Corpus benchmark files where I will highlight the differences in performance and explain the choices made in actual compression software. 1http://www.gnu.org/software/gzip (May 31. 2010) 2http://www.bzip.org (May 31. 2010) 3 Chapter 1 Theory for Data Compression 1.1 Model The general model used for the theoretical part is the First-Order Model [1, theory]: In this model, the symbols are independent of one another, and the probability distribution of the symbols is determined by the source X. We will considere X as a discrete random variable with alphabet A1. We will also assume that there is a probability mass function p(x) over A. Let also denote a finite sequence of length n by Xn. 1.2 Entropy In information theory, the concept of entropy is due to Claude Shannon in 19482. It is used to quantify the minimal average number of bits required to encode a source X. Definition 1. The entropy H(X) of a discret random variable X is defined as: X H(x) = − p(x)log p(x) x2A where p(x) is the probability mass function for x 2 A to be encountered. 1.3 Source Coding 1.3.1 Bound on the optimal code length Before giving a bound for the entropy, we will have to introduce some definitions. The first definition introduces the notion of codeword and binary code. Definition 2. A binary code C for the random variable X is a mapping from A to a finite binary string. Let denote by C(x) the codeword mapped to x 2 A and let l(x) be the length of C(x). Moreover, a property that is often wanted for a binary code is to be instantaneous: 1The alphabet A of X is the set of all possible symbols X can output. 2http://en.wikipedia.org/wiki/Entropy_(information_theory) 4 CHAPTER 1. THEORY FOR DATA COMPRESSION 5 Definition 3. A code is said to be instantaneous (or prefix-free) if no codeword is a prefix of any other codeword. The property of an instantaneous code is quite interesting since it permits to transmit the codeword for multiple input symbols x1; x2; x2; ··· by simply concatenating the codewords C(x1)C(x2)C(x3) ··· while still being able to decode xi instantly after C(xi) has been received. Another definition required for the entropy bound theorem is about the expected length of a binary code C. Definition 4. Given a binary code C, the expected length for C is given by X L(C) = p(x)l(x) x2A Finally, the Kraft inequality permits connecting the instantaneous property for a code to the code length. The Kraft inequality The theorem formalizing the Kraft inequality is given below: Theorem 1. (Kraft inequality [7, p.107]) For any instantaneous code (prefix code) over an alphabet of size D, the codeword lengths l(x1); l(x2); ··· ; l(xm) must satisfy the inequality X D−l(xi) ≤ 1 i Conversely, given a set of codeword lengths that satisfy this inequality, there exists an instantaneous code with these word lengths. The Kraft inequality theorem will not be proven here. The proof can be found in [7, pp.107-109]. We are now able to give the theorem for the entropy bound on the expected length of a binary code C. Theorem 2. The expected length of any code C satisfies the following double inequality H(X) ≤ L(C) ≤ H(X) + 1 Proof. The proof will take place in two phases: 1. We will first probe the upper bound, that is L(C) ≤ H(X) + 1 We chose an integer word-length assignment for the word xi: 1 l(xi) = logD p(xi) These lengths satisfy the craft inequality because l 1 m 1 X − logD p(x ) X −logD p(x ) X D i ≤ D i = p(xi) = 1 CHAPTER 1. THEORY FOR DATA COMPRESSION 6 hence there exists a code with these word lengths. The upper bound is obtained as follows using Theorem1: X X 1 p(x)l(x) = p(x) log D p(x) x2A x2A X 1 ≤ p(x)(log + 1) D p(x) x2A = H(X) + 1 which proves the upper bound. 2. The lower bound is obtained as follows: By our word-length assignments, we can deduce that 1 logD ≤ l(xi) p(xi) and therefore we obtain X p(x)l(x) = L(C) x2A X 1 ≥ p(x)logD p(xi) x2A = H(X) which proves the two inequality parts from Theorem2. 1.3.2 Other properties The entropy can also be used to qualify multiple sources (random variables). For such cases, we use the joint entropy. Definition 5. The joint entropy H(X; Y ) of a pair of discret random variables (X,Y) with a joint distribution p(x; y) is defined as [7, p.16] X H(X; Y ) = − p(x; y)log p(x; y) (x;y) It is also possible to use the conditional entropy : Definition 6. The conditional entropy H(Y jX) is defined as X H(Y jX) = p(x)H(Y jX = x) x X X = − p(x) p(yjx)log p(yjx) x y X X = − p(x; y)log p(yjx) y x CHAPTER 1. THEORY FOR DATA COMPRESSION 7 Theorem 3. From the previous definitions, we obtain the following theorem: H(X; Y ) = H(X) + H(Y jX) = H(Y ) + H(XjY ) Proof. Using the previously seen definitions and properties, proving Theorem3 is simply a matter of developing formulas. X X H(X; Y ) = − p(x; y)log p(x; y) x y X X = − p(x; y)log p(x)p(yjx) x y X X X X = − p(x; y)log p(x) − p(x; y)log p(yjx) x y x y X X X = − p(x)log p(x) − p(x; y)log p(yjx) x x y = H(X) + H(Y jX) The proof is similar for the second part of the equality.

Compression Algorithms

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support