Ecole Polytechnique Federale de Lausanne

Master Semester Project

Compression

Supervisor: Author: Ghid Maatouk Ludovic Favre Professor : Amin Shokrollahi

June 11, 2010 Contents

1 Theory for Data Compression4 1.1 Model...... 4 1.2 Entropy...... 4 1.3 Source Coding...... 4 1.3.1 Bound on the optimal code length...... 4 1.3.2 Other properties...... 6

2 Source Coding Algorithms8 2.1 Huffman coding...... 8 2.1.1 History...... 8 2.1.2 Description...... 8 2.1.3 Optimality...... 9 2.2 ...... 10

3 Adaptive Dictionary techniques: Lempel-Ziv 11 3.1 History...... 11 3.2 LZ77...... 11 3.2.1 LZ77 encoding and decoding...... 12 3.2.2 Performance discussion...... 13 3.3 LZ78...... 13 3.3.1 LZ78 encoding and decoding...... 13 3.3.2 Optimality...... 14 3.4 Improvements for LZ77 and LZ78...... 15

4 Burrows-Wheeler Transform 17 4.1 History...... 17 4.2 Description...... 17 4.2.1 Encoding...... 17 4.2.2 Decoding...... 18 4.2.3 Why it compresses well...... 19 4.3 Algorithms used in combination with BWT...... 20 4.3.1 Run-length encoding...... 20 4.3.2 Move-to-front encoding...... 20

5 Implementation 21 5.1 LZ78...... 21 5.1.1 Code details...... 21 5.2 Burrows Wheeler Transform...... 22

1 CONTENTS 2

5.3 Huffman coding...... 23 5.3.1 Binary input and output...... 23 5.3.2 Huffman implementation...... 23 5.4 Move-to-front...... 24 5.5 Run-length encoding...... 24 5.6 Overview of source files...... 24

6 Practical Results 27 6.1 Benchmark files...... 27 6.1.1 Notions used for comparison...... 28 6.1.2 Other remarks...... 28 6.2 Lempel-Ziv 78...... 28 6.2.1 Lempel-Ziv 78 with dictionary reset...... 28 6.2.2 Comparison between with and without dictionary reset version... 29 6.2.3 Comparison between my LZ78 implementation and ...... 29 6.3 Burrows-Wheeler Transform...... 31 6.3.1 Comparison of BWT schemes to LZ78...... 32 6.3.2 Influence of the block size...... 32 6.3.3 Comparison between my optimal BWT method and BZIP2..... 35 6.4 Global comparison...... 36

7 Supplementary Material 37 7.1 Using the program...... 37 7.1.1 License...... 37 7.1.2 Building the program...... 37 7.2 Collected data...... 38 7.2.1 Scripts...... 38 7.2.2 Spreadsheets...... 38 7.2.3 Repository...... 38

8 Conclusion 39 Introduction

With the increasing amount of data traveling by various means like wireless networks link from mobile phone to servers, lossless has become an important factor to optimize the spectrum utilization.

As a student with interest in domain like computational biology, data processing was important for me to understand how to handle the large amount of data coming from high-throughput sequencing technologies. Since I have both interest in algorithms and concrete data processing, getting in touch with data compression techniques immediately interested me.

During this semester project, it has been decided to focus on two algorithms to allow some comparison. The chosen algorithms for implementation were Lempel-Ziv 78 and the most recent Burrows-Wheeler Transform, which enable well-known techniques such as run-length encoding, move-to-front and Huffman coding to outperform easily in most of the situations the more complicated Lempel-Ziv-based techniques. Those two compression techniques both have a very different approach on the way to compress data and are commonly used in GZip1 and BZip22 software for example. In this report, I will first introduce some material and present the the- oretical part of the project in which I learnt how popular compression techniques attempt to reduce the size required for heterogeneous types of data. The subsequent chapters will then detail my practical work during the semester and present the implementation I have done in C/C++. Finally, the last two chapters consist of the results obtained on the famous benchmark files where I will highlight the differences in performance and explain the choices made in actual compression software.

1http://www.gnu.org/software/gzip (May 31. 2010) 2http://www.bzip.org (May 31. 2010)

3 Chapter 1

Theory for Data Compression

1.1 Model

The general model used for the theoretical part is the First-Order Model [1, theory]: In this model, the symbols are independent of one another, and the probability distribution of the symbols is determined by the source X. We will considere X as a discrete random variable with alphabet A1. We will also assume that there is a probability mass function p(x) over A. Let also denote a finite sequence of length n by Xn.

1.2 Entropy

In information theory, the concept of entropy is due to Claude Shannon in 19482. It is used to quantify the minimal average number of bits required to encode a source X.

Definition 1. The entropy H(X) of a discret random variable X is defined as: X H(x) = − p(x)log p(x) x∈A where p(x) is the probability mass function for x ∈ A to be encountered.

1.3 Source Coding

1.3.1 Bound on the optimal code length Before giving a bound for the entropy, we will have to introduce some definitions. The first definition introduces the notion of codeword and binary code.

Definition 2. A binary code C for the random variable X is a mapping from A to a finite binary string. Let denote by C(x) the codeword mapped to x ∈ A and let l(x) be the length of C(x).

Moreover, a property that is often wanted for a binary code is to be instantaneous:

1The alphabet A of X is the set of all possible symbols X can output. 2http://en.wikipedia.org/wiki/Entropy_(information_theory)

4 CHAPTER 1. THEORY FOR DATA COMPRESSION 5

Definition 3. A code is said to be instantaneous (or prefix-free) if no codeword is a prefix of any other codeword.

The property of an instantaneous code is quite interesting since it permits to transmit the codeword for multiple input symbols x1, x2, x2, ··· by simply concatenating the codewords C(x1)C(x2)C(x3) ··· while still being able to decode xi instantly after C(xi) has been received. Another definition required for the entropy bound theorem is about the expected length of a binary code C.

Definition 4. Given a binary code C, the expected length for C is given by X L(C) = p(x)l(x) x∈A Finally, the Kraft inequality permits connecting the instantaneous property for a code to the code length.

The Kraft inequality The theorem formalizing the Kraft inequality is given below:

Theorem 1. (Kraft inequality [7, p.107]) For any instantaneous code (prefix code) over an alphabet of size D, the codeword lengths l(x1), l(x2), ··· , l(xm) must satisfy the inequality X D−l(xi) ≤ 1 i Conversely, given a set of codeword lengths that satisfy this inequality, there exists an instantaneous code with these word lengths.

The Kraft inequality theorem will not be proven here. The proof can be found in [7, pp.107-109].

We are now able to give the theorem for the entropy bound on the expected length of a binary code C.

Theorem 2. The expected length of any code C satisfies the following double inequality

H(X) ≤ L(C) ≤ H(X) + 1

Proof. The proof will take place in two phases:

1. We will first probe the upper bound, that is L(C) ≤ H(X) + 1 We chose an integer word-length assignment for the word xi:  1  l(xi) = logD p(xi) These lengths satisfy the craft inequality because

l 1 m 1 X − logD p(x ) X −logD p(x ) X D i ≤ D i = p(xi) = 1 CHAPTER 1. THEORY FOR DATA COMPRESSION 6

hence there exists a code with these word lengths. The upper bound is obtained as follows using Theorem1:

X X  1  p(x)l(x) = p(x) log D p(x) x∈A x∈A X 1 ≤ p(x)(log + 1) D p(x) x∈A = H(X) + 1

which proves the upper bound.

2. The lower bound is obtained as follows: By our word-length assignments, we can deduce that 1 logD ≤ l(xi) p(xi) and therefore we obtain

X p(x)l(x) = L(C) x∈A X 1 ≥ p(x)logD p(xi) x∈A = H(X) which proves the two inequality parts from Theorem2.

1.3.2 Other properties The entropy can also be used to qualify multiple sources (random variables). For such cases, we use the joint entropy.

Definition 5. The joint entropy H(X,Y ) of a pair of discret random variables (X,Y) with a joint distribution p(x, y) is defined as [7, p.16] X H(X,Y ) = − p(x, y)log p(x, y) (x,y)

It is also possible to use the conditional entropy :

Definition 6. The conditional entropy H(Y |X) is defined as X H(Y |X) = p(x)H(Y |X = x) x X X = − p(x) p(y|x)log p(y|x) x y X X = − p(x, y)log p(y|x) y x CHAPTER 1. THEORY FOR DATA COMPRESSION 7

Theorem 3. From the previous definitions, we obtain the following theorem:

H(X,Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y )

Proof. Using the previously seen definitions and properties, proving Theorem3 is simply a matter of developing formulas. X X H(X,Y ) = − p(x, y)log p(x, y) x y X X = − p(x, y)log p(x)p(y|x) x y X X X X = − p(x, y)log p(x) − p(x, y)log p(y|x) x y x y X X X = − p(x)log p(x) − p(x, y)log p(y|x) x x y = H(X) + H(Y |X)

The proof is similar for the second part of the equality.

In the next chapters, the presented notions will be used introduce the optimality of Huffman and Lempel-Ziv algorithms. Chapter 2

Source Coding Algorithms

2.1 Huffman coding

2.1.1 History Huffman coding was developed by David A. Huffman while he was a Ph.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes"1. The idea behind his technique was to represent by fewer symbols that have higher probability than symbols with lower one.

2.1.2 Description The entropy coding problem can be defined as follows:

Definition 7. Entropy coding problem: Given a set of symbols and their corresponding probabilities, find a prefix-free binary code so that the expected codeword length is minimized

Formally, we have: A = {x1, x2, ··· , xn}, the alphabet of size n, n X P = {p1, p2, ··· , pn}, the corresponding symbol probabilities, with pi = 1, i=1 and we want to find: n X C = {c1, c2, ··· , cn} such that L(C) = l(ci)pi is minimized. i=1 As most of the readers are probably familiar with Huffman coding, I will just present a quick review of it in Algorithm1:

1http://en.wikipedia.org/wiki/Huffman_coding (April 1st 2010)

8 CHAPTER 2. SOURCE CODING ALGORITHMS 9

Algorithm 1 Huffman coding 1: Q ←Priority queue containing nodes where the node with lowest probability is given the highest priority 2: Create a node for each symbol and add it to Q 3: while Q.size > 1 do 4: n1 ← Q.head 5: Q.removeHead 6: n2 ← Q.head 7: Q.removeHead 8: Create a new internal node n3 with probability equal to the sum of n1’s and n2’s probabilities and n1, n2 as left and right child respectively 9: Q.add(n3) 10: end while 11: Traverse the tree from the root assigning different bits to left and right child and propagate to the leaves every time

The Figure 2.1 illustrates graphically Algorithm1.

Figure 2.1: Illustration of a Huffman coding tree

2.1.3 Optimality The optimality of Huffman codes can be proved by induction [7, pp.123-127]. It has to be kept in mind that there exist many optimal codes and inverting all bits is just an example of getting another code from a given code. n X We know that a code is optimal if pil(ci) is minimal. i Lemma 1. For any distribution, there is an optimal instantaneous code (with minimum expected length) that satisfies the following properties: CHAPTER 2. SOURCE CODING ALGORITHMS 10

1. The lengths are ordered inversely with the probabilities: pj > pk ⇒ l(cj) ≤ l(ck). 2. The two longest codewords have the same length. 3. Two of the longest codewords differ only in the last bit and correspond to the two least likely symbols. Proof. I will not explain the complete proof but rather the insight into the formal proof, which is based on proof by contradiction.

1. If this is not the case, then inversing the two codewords cj and ck will diminish the n X sum pil(ci) giving a better code. i 2. If the two longest codewords do not have the same length, using the fact that the code is prefix-free we can delete the last bit of the longest codeword and achieving a better expected codeword length. 3. If the code is optimal but does not have two maximal-length codeword of same length, a sibling can be created by removing the last bit of the longest codeword. Since the code is prefix-free, this does not alter the property and this will yield a better code, contradicting the optimality hypothesis. Therefore every maximal-length codeword in any optimal code has a sibling. Now if the two longest codewords differing only in the last bit have the lowest prob- ability, otherwise it will be possible to create a better code, contradicting the opti- mality hypothesis.

Proof of the optimality A code satisfying Lemma1 is called a canonical code. To prove the optimality of the Huffman code we can observe that at each step of the Huffman , Lemma1 is verified. Using induction and Lemma1, we can show that Huffman coding is optimal (for the formal proof, you can refer to [7, pp.125-127]).

2.2 Arithmetic coding

With Huffman coding, we have seen a popular method for variable-length codes. The main advantage of Huffman codes was the simplicity to produce such codes as it uses the notion of tree to build the codes.

Arithmetic coding has a complete other approach in the way of producing variable-length codes. It is especially useful when dealing with sources with small alphabet, such as binary sources, and alphabets with highly skewed probabilities [10, p.81]. An important fact about arithmetic coding is that the amount of patents covering it in- fluenced BZip and JPEG file format to use Huffman coding instead2. Since I did not use Arithmetic coding in the implementation part of the project, I will not present the details of this technique.

2http://en.wikipedia.org/wiki/Arithmetic_coding#US_patents_on_arithmetic_ coding (June 9. 2010) Chapter 3

Adaptive Dictionary techniques: Lempel-Ziv

3.1 History

The Lempel-Ziv compression algorithms are due to and . They published two lossless data compression algorithms in papers in 1977 and 1978 (LZ77 and LZ78 are the respective abbreviations for those two algorithms). In those two papers, two different approaches are provided to build adaptive dictionaries. I will briefly present the LZ77 algorithm and then detail more precisely the LZ78 method, which I implemented for my project.

3.2 LZ77

In LZ77, the dictionary consists of a portion of the previously encoded sequence. The algorithm consists of a sliding window made of two parts [10, p.123]: the search buffer and the look-ahead buffer. Those two buffers are illustrated in Figure 3.1.

Figure 3.1: Representation of the search buffer and look-ahead buffer with the match pointer

The search buffer is used as a dictionary and belongs to the previously encoded sequence. The look-ahead contains the sequence to encode. The offset is the distance from the look- ahead buffer to the match pointer. The length of the match is the number of consecutive symbols in the search buffer that match the same consecutive symbols in the look-ahead buffer.

11 CHAPTER 3. ADAPTIVE DICTIONARY TECHNIQUES: LEMPEL-ZIV 12

3.2.1 LZ77 encoding and decoding We now look at the way LZ77 encodes (and compresses) a sequence. Suppose it has a look-ahead buffer of size six and a search buffer of size seven. At each slide of the window, the following situation can be encountered by the LZ77 algorithm (with corresponding output):

• no match is found : < 0, 0, c >

• a match is found : < o, l, c >

• the match is extended inside the look-ahead buffer: < o, l, c > where o is the offset to the match pointer, l is the length of the match and c is the code of the symbol following the matched string in the look-ahead buffer. Note that the third case is just a special case of the second.

For example, if we look at Figure 3.2, we can see that o = 7, l = 4 and c = code(r). Therefore, the transmitted triple is < 7, 4, code(r) >.

Figure 3.2: A step of the encoding

The decoding of the triple follows logically from the encoding procedure. The output sequence is produced by construction, based on previous elements extracted to the output sequence. An illustration based on the previous triple is given for the decoder in Figure 3.3 CHAPTER 3. ADAPTIVE DICTIONARY TECHNIQUES: LEMPEL-ZIV 13

Figure 3.3: Decoding the next elements of the sequence with the received triple < 7, 4, code(r) >

For more details about LZ77, you can have a look at [10, p.121-125].

3.2.2 Performance discussion The most costly part of this method is to find the longest match. It has been proven by A. D. Wyner and J. Ziv in [12] that LZ77 is asymptotically optimal. LZ77 is for example used within GZip1. The implicit assumption of LZ77 approach is that exact pattern copies will occurs close to each others. This assumption can be a drawback if two pattern copies are not located within the sliding window. However, the sliding window is usually big and such problems will usually not occur to often to have a significant impact on the quality of the compression. I will now present how LZ78 approach tries to solve the problem of patterns not being located within the same window.

3.3 LZ78

The LZ78 approach solves the above LZ77 problem by dropping the concept of local search window and using an incrementally built dictionary.

3.3.1 LZ78 encoding and decoding The way the dictionary is built during the encoding is expressed in Algorithm2.

1http://www.gnu.org/software/gzip (May 31. 2010) CHAPTER 3. ADAPTIVE DICTIONARY TECHNIQUES: LEMPEL-ZIV 14

Algorithm 2 LZ78 1: B ←Empty buffer of symbol 2: D ←Empty dictionary 3: while not End − Of − F ile do 4: S ←the next symbol from the input stream 5: I ← index of B in D 6: B.append(S) 7: if B is not in D then 8: Add B to the end of the dictionary 9: Output the two symbols < I, S > 10: B ← Empty buffer 11: end if 12: end while 13: if not B is empty then 14: I ← index of B without its last symbol in D 15: Output the two symbols < I, S > 16: end if

Note that the buffer has to be handled carefully (important during the implementation) so that we won’t forget symbols at the end of the file (see line 13 in Algorithm2).

Figure 3.4: Encoding with LZ78. The shown state of the dictionary is after having encoded the given input

You can see an example of LZ78 encoding in Figure 3.4. The decoding process is quite similar and should not be too difficult to build after having seen how it is encode. If you want more details about the decoding phase, you may have a look at [10, pp.125-127].

Note that the pure algorithmic description does not cover the implementations details like the dictionary growth without bound. This is usually where problems are encountered or decisions have to be made. For more details about the implementation, you can refer to the implementation chapter. This algorithm is particularly simple to understand and because of its speed and efficiency, it has become popular as one of the standard algorithm for file compressions in current compression software, like LZ77 has in GZip. We will now have a look at the optimality of the LZ78 algorithm.

3.3.2 Optimality The complete optimality proof can be found in [7, pp.448-455]. Note that I will not go into the complete proof because of its size (at least ten pages to fit in this report). I will rather highlight the main ideas of the proof. CHAPTER 3. ADAPTIVE DICTIONARY TECHNIQUES: LEMPEL-ZIV 15

We first start by a definition[7, p.448] for a parsing of a string.

Definition 8. A parsing S of a binary string x1x2 ··· xn is a division of the string into phrases, separated by commas. A distinct parsing is a parsing such that no two phrases are identical. For example, 0, 111, 1 is a distinct parsing of 01111, while 0, 11, 11 is not.

We now denote by c(n) the number of phrases in the LZ78 parsing of a sequence Xn of length n. After applyling Lempel-Ziv algorithm, the compressed sequence consists of c(n) pairs < p, b > of number. Each pair is made of a pointer p to the previous occurence of the phrase prefix and the last bit b of the phrase. Each pointer p requires logc(n) bits. Thus the total length of the compressed sequence is c(n) · (logc(n) + 1) bits.

c(n)·(log c(n)+1) The goal of the proof is to show that n → H(X), i.e. LZ78 is bounded by the entropy rate with probability 1 for big n (asymptotic optimality).

The first part of the proof highlights the fact that the number of phrases in a discinct n parsing of a sequence is less than log n , arguing that there are not enough discinct short phrases. Then, the second idea is to bound the probability of a sequence based on the number of distinct phrases. As an example, if we consider an independent and identically-distributed sequence of four random variables taking four possibles values with distinct probabilities, 1 the probability to have a sequence with four distinct values is maximized ( 256 ) when the 1 probabilites are equal (with value 4 ). On the other hand, if the sequence is made of twice 1 the two same values, the probability to have this sequence is maximized ( 16 ) if the two 1 values have a probability of 2 each. This illustrate the following point: sequences with a large number of distinct symbols or phrases cannot have a large probability. This idea is used in Ziv’s inequality [7, p.452]. Finally, since the description length of a sequence after the parsing grows as c · log c, se- quences that have very few distinct phrases can be compressed efficiently and correspond to strings that could have a high probability. On the other hand, by Ziv’s inequality, the probability of sequences that have a large number of distinct phrases (and do not compress well) cannot be too large. Starting with Ziv’s inequality, it is possible, by connecting the logarithm of the probability of the sequence with the number of phrases in its parsing, to prove the theorem stating that LZ78 is asymptotically optimal.

3.4 Improvements for LZ77 and LZ78

There are a number of ways to modify the LZ77/78 algorithms. And the most obvious modification we may imagine to improve the algorithms have probably already been per- formed. A well known modification of LZ78 is the one proposed by Terry Welch[10, p.127], known as LZW[11]

LZW The idea in LZW is to remove the need for the second element in the pair < I, S >. The main idea is to start with a dictionary that is pre-filled with all possible symbols and to only send the index of the dictionary to the output. I will not go further in the description CHAPTER 3. ADAPTIVE DICTIONARY TECHNIQUES: LEMPEL-ZIV 16 of LZW since it is not used in the implementation of my project.

The UNIX compress utility uses for example the LZW approach[10, p.133]. It has an adaptive dictionary size, starting at 512 and doubled every time the dictionary is full, enabling reasonable bit usage. In this way, during the earlier part of the coding process when the strings are still short, the codewords used to encode them is also small. The com- press utility also performs a compression ratio monitoring: when it falls below a threshold, the dictionary is flushed and the dictionary building process is restarted, permitting the dictionary to reflect the local characteristics of the source.

Others improvements Other kinds of improvements are still actively used in actual LZ77/78 implementations. Most of them use tricky methods to pre-fill the dictionary2 or to allow the algorithm to monitor its current performance in the running compression and taking an appropriate action3. Interest for LZ77/78 based compression techniques is still present: most of those improvements are often protected by patents and this may require you to pay the author of the method to use it in your software. Therefore, using an efficient LZ77/78 based algorithm implementation nowadays may re- quire you to either both invent and patent a new technique or to use an existing modifica- tion and paying fees. Another possibility may also be to use a method whose patent has expired, which will likely be the case for LZ77/78 because of the time elapsed since the introduction of these algorithms.

2http://www.freepatentsonline.com/5951623.html 3http://www.freepatentsonline.com/5243341.html Chapter 4

Burrows-Wheeler Transform

4.1 History

The Burrows–Wheeler transform (denoted BWT, also called block-sorting compression) was invented by Michael Burrows and David Wheeler in 1994 [6] while working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 19831.

4.2 Description

The description of the encoding and decoding phase is taken from the original paper from Michael Burrows and David Wheeler [6]

4.2.1 Encoding The most convenient description of Burrows-Wheeler transform is to present you an exam- ple. Suppose we want to perform the BWT on the string mississippi$.The first step is to generate all the possible circular permutations for this string and place them in a matrix as shown in Figure 4.1.

Figure 4.1: The matrix of circular permutations for the string mississippi$

The main step of BWT consists then of sorting the lines of the matrix by alphabetical order as in Figure 4.2(a). Once sorted, you can see that the original string is now located at the 6th row of the matrix. 1http://en.wikipedia.org/wiki/Burrows-Wheeler_transform (April 20. 2010)

17 CHAPTER 4. BURROWS-WHEELER TRANSFORM 18

(a) (b)

Figure 4.2: Sorting of the rows, and output illustration for BWT

The sorting phase is in fact the most expensive phase of BWT . After that, the output of BWT is simply the last column of the matrix and the position of the original string as illustrated in Figure 4.2(b), that is ipssm$pissii6.

4.2.2 Decoding Decoding the string L = ipssm$pissii6 may seem almost impossible at first sight. However, the technique to decode it is very easy to understand and faster than the one used for encoding. We initialize a matrix with last column equal to L. The first column F of the matrix can also easily be computed by sorting the symbols in L and placing them in alphabetical order as shown in Figure 4.3.

Figure 4.3: Initialization of the first and last column of the matrix and retrieval of the first two symbols

After that, the original string (located at row 6), is reconstructed using a simple iterative method: We first start at the provided original string row. Then, we proceed as follows: If F [j] is the kth instance of symbol c in F , then the row to consider at the next step is the row where L[i] is the kth instance of c in L. The next symbol to add to the original string is given by F [i]. The retrieval of the first two symbols 0i0 and 0s0 is shown in the second part of Figure 4.3. The same procedure is repeated until all remaining symbols have been found, as in Fig- ure 4.4. CHAPTER 4. BURROWS-WHEELER TRANSFORM 19

Figure 4.4: Retrieval of the complete original sequence

Finally, we can recover the original string at the given position in L. Giovanni Manzini, professor of computer science at the University of "Piemonte Orientale", shows in [8] how to bounds the compression ratio of two BWT-based algorithms in terms of the k − th order empirical entropy of the input string, meaning that those algorithms are able to make use of all the regularities which is in the input string. Due to the remaining material to present in this report, I will not go into the proof. You can find it in [8].

4.2.3 Why it compresses well In fact, a BWT string has a lot of similarity with Lempel-Ziv approaches in the idea of capturing similarities. Let consider, as an example, a common word in the English language2, like “the ”, the assumed top one used word. Based on what we have seen before, L (like F ) will contain a large number of 0t0 char- acters, intermingled with other characters that can proceed the “he ” suffix like 0s0 (i.e., the word “she ”). The same argument can be applied to all other words, like “ of ” with “ if”.

As a result, L is likely to contain a large number of few distinct characters in some regions. This is exactly the kind of string where the move-to-front coder will output a majority of low numbers, highly repeated (see Figure 4.6 for a concrete example on a simple string).

2http://www.world-english.org/english500.htm lists some of them for example CHAPTER 4. BURROWS-WHEELER TRANSFORM 20

Then, the move-to-front output can be efficiently encoded using Huffman.

In the results, I will show that this is empirically the best compression scheme for most of the situations.

4.3 Algorithms used in combination with BWT

4.3.1 Run-length encoding Run-length encoding is a simple method to encode long runs of the same symbol. An example is provided in Figure 4.5.

Figure 4.5: Run-length encoding on the string abbacccabaabbb

This encoding technique has been used with BWT in a compression scheme during the project.

4.3.2 Move-to-front encoding Move-to-front encoding is another simple useful method to improve the performance of like Huffman. The main idea is to start with an initial stack of recently used symbols. Then, every time a symbol is met, its index in the stack is written to the output. At the same time, the symbol is moved to the top of the stack if it is not already on the top. Move-to-front utilization is illustrated in Figure 4.6.

Figure 4.6: Move-to-front encoding on the string abbacccabaabbb

We will see in the results that it enables a very good compression ratio in combination with BWT and Huffman. Chapter 5

Implementation

In this chapter, details about the algorithm implementations will be presented. The imple- mentation part of the project (with benchmarking) took me the majority of the total time dedicated to the project. Those implementations were done on two GNU/Linux x86_64 machines, running under Gentoo1. The code has been compiled with GCC2 version 4.3.4, 4.4.3 and 4.5. Therefore, compilation on Linux or UNIX systems should work but is not guaranteed. Under windows, it will probably need to be done under Cygwin3. For more details about the compilation process, see supplementary material.

All of those algorithms were implemented using a mix of C++ and C. C++4 permits easily readable object-oriented code, while C ensures low level access to files. This low level access aspect is fundamental when dealing with file encoding, especially when byte- level or even bit-level access is required.

5.1 LZ78

Being the first algorithm implemented in my project, LZ78 also brought me into C/C++ difficulties: I unfortunately started with std::string container for phrases which led me to tricks usage for string termination, like manually adding 0\00 symbol for end-of-string. Moreover, the 0x7F byte code, usually known as the EOF constant on GNU/Linux, termi- nated the stream input from a file before the real file end. All these constraints made the compressor incompatible with binary files, and consequently with most of the benchmark files [9]. Fortunately, having a little more time during the last weeks of the semester, I de- cided to re-implement this algorithm using std::vector as container, allowing compression of binary files and performing benchmarks on the complete Calgary Corpus.

5.1.1 Code details As previously said, std::vector are used as container in the dictionary. The dictionary itself is a STL-map, mapping the vectors to the dictionary index. These indexes are bound between 0 and 127 (for a dictionary of size 1 byte) since the first bit has to be reserved to

1Gentoo Linux distribution : http://www.gentoo.org 2http://gcc.gnu.org/ 3www.cygwin.com/ 4http://www.cplusplus.com/reference/clibrary/ (April 20. 2010)

21 CHAPTER 5. IMPLEMENTATION 22 signify a match or a miss in the decoding phase. Some of these definitions are illustrated in Listing 5.1 (the complete code can be found in LZ78.hpp).

Listing 5.1: Representation of match, miss and masks for LZ78 for 1 to 3 bytes dictionary #define DICT_NUMBYTE 2

#if DICT_NUMBYTE == 3 #define MAX_DICSIZE 8388607 #elif DICT_NUMBYTE == 2 #define MAX_DICSIZE 32767 #else #define MAX_DICSIZE 127 #endif

#if DICT_NUMBYTE == 1 #define ZERO 0x00 #define MASK 0x80 #define MATCH 0x80 #define MISS 0x0F #define DICT_FULL 0x08 #define EOFILE 0x7F // 01111111b #define UNSET_MASK 0x7F #endif

//... similar for other sizes of dictionary

5.2 Burrows Wheeler Transform

The Burrows Wheeler Transform (BWT) implementation was probably the most challeng- ing one in my project. In fact, the performance of BWT is critical for the rest of the compression pipeline since it is the first component of this pipeline.

The main idea of the implementation is to read the input file block by block. But as soon as a decision concerning the way to sort permutations has to be taken, a lot of options and currently in research techniques are to be considered. Even in current implementations of algorithms using BWT, optimizing the permutation phase is still heavily wanted.

The first version I implemented simply stored the complete set of permutations into an ordered list of std::vector. This implementation was space consuming because for a block of size n, we will use O(n2) space per permutation. Knowing that blocks of size 500’000 bytes are common, we will reach more than 250’000’000’000 required bytes of storage in the RAM. This is clearly too expensive to run even on current last computer generations.

A possibility to reduce the consumed size and still be able to produce the desired output is to perform suffix sorting [5, pp.75-94]. It reduces the required space to store permutations to O(n) but induces some complications in the sorting: we will have to be really carefull when sorting blocks where a suffix is a prefix of another suffix. If we use sets where elements are unique, a way to distinguish them has to be found. Many suffix sorting algorithms are proposed in [5] since this is a currently active domain of research to optimize the running time of BWT . CHAPTER 5. IMPLEMENTATION 23

My final version uses the notion of pointers and is shown in Listing 5.2. It will not store the suffix but rather its position in the original block into a pointer. The reason for this is that memcmp will perform a bytewise comparison of the two sequences referred by the pointer until it finds a difference. Note that with this method, poor performances occurs especially with repeated sequences of similar suffixes which cause some fluctuations in the time required for sorting permutations.

Listing 5.2: The sorting routine used in the final BWT implementation class RuntimeCmp { public: int operator()( const unsigned char *p1, const unsigned char *p2 ) const { int result = memcmp( doubleBuffer-(buffer - p1), doubleBuffer-( buffer - p2), curr_length ); if ( result < 0 ) return 1; if ( result > 0 ) return 0; return p1 > p2; } };

5.3 Huffman coding

For Huffman, even bit-level access was needed. This is indeed quite obvious since Huffman goal is to take as few symbols as possible to represent the most frequent symbols (or words if the encoding is at higher level).

5.3.1 Binary input and output In actual computers, the lowest possible unity of access to a file is the byte. Therefore, to perform per bit read and write, caching technique has been used. I will not go into details as the implementation of binary input and output is very simple. The main idea is to store the last read byte into a buffer and shift by a bit every time a bit is read. Once 8 bits have been read, a new byte is fetch from the file. The output works in the same manner, and both input and output have an automatic flush mechanism so that the user of the class does not have to check by itself at every bit if the buffer has to be flushed or refilled. The Binary input and output class is accessible in BinaryIO.cpp/hpp files (see Table 5.1 for exact location).

5.3.2 Huffman implementation Since my Huffman implementation is coding at byte level, the goal is to code a symbol represented by a byte with less than 8 bits. I have just presented the BinaryIO class giving me access bit by bit to a file. The implementation of Huffman coding follows the main idea from Algorithm1 and Figure 2.1. I have created a node class (see HNode in Table 5.1) to represent the node that the algorithm merge together up to the root. The only difference I have in my algorithm is that the line 11 of Algorithm1 is performed during the node merging step, so that a new traversal of the three is not required. Once the root is the CHAPTER 5. IMPLEMENTATION 24 only node remaining, I simply have to retrieve the references to the leaves to get the code incrementally created and use it as a coding table. To store the dictionary in the file, I first start by writing it into a header. Because codes of size 8 bits or less are likely to all be taken, the number of bytes to read by the decoder is also written in the header. This will avoid reserving a special pattern when the encoded file does not terminate exactly at 8 bits (full byte).

5.4 Move-to-front

The implementation of move-to-front is really trivial. It mostly consists of maintaining the sorted stack up to date and moving the last seen entry to the front. This was probably the most impressing result during the project since it enables Huffman coding to do an outstanding entropy coding.

5.5 Run-length encoding

The run-length encoding implementation is as simple as move-to-front, except that the benefit from it is very limited. In fact, because we always have to specify the length of the run we have encoded, an isolated occurrence of one byte will use two bytes in the output. This is why using run-length encoding permits a very limited gain in compression ratio and even performs worse than using just move-to-front in combination with Huffman.

5.6 Overview of source files

Using the notion of class, all the required algorithms were implemented in separated files with their corresponding header. A small UML diagram is shown in Figure 5.1 to have an overview of those classes. Those classes are the core of the compression program, but some other files are required to allow it to run. In Table 5.1 is the list of all source code files and others required source files. CHAPTER 5. IMPLEMENTATION 25

Table 5.1: List of the important source files required to build the program and use all the implemented algorithms

Folder File name Description src: CMakeLists.txt File used by CMake to generate the Makefile defines.hpp Some utility functions and shared defines License.txt GPLv3 license main.cpp CLI code, launches the required coding scheme src/: BWT.cpp/hpp The new BWT implementation, using pointers for block sorting BWTOld.cpp/hpp The old BWT implementation, was too expensive in term of running time Huffman.cpp/hpp Huffman encoding class ICodec.hpp interface IEncoder.hpp Encoder interface LZ78.cpp/hpp Second version of LZ78, flushing the dictionary when full LZ78NR.cpp/hpp First version of LZ78, without dictionary flush MTF.cpp/hpp Move-to-front encoding class RLE.cpp/hpp Run-length encoding class src/codecs/huffman: BinaryIO.cpp Utility class, allowing per-bit access to files HNode.cpp Class to represent a node in the Huffman coding step src/codeschemes: HuffBwt.cpp/hpp Performs BWT→Huffman HuffMtfBwt.cpp/hpp Performs BWT→MTF→Huffman HuffMtfRleBwt.cpp/hpp Performs BWT→RLE→MTF→Huffman HuffRleBwt.cpp/hpp Performs BWT→RLE→Huffman HuffRleMtfBwt.cpp/hpp Performs BWT→MTF→RLE→Huffman Lz.cpp/hpp Performs LZ78 encoding (LZ78.cpp) LzNoReset.cpp/hpp Performs LZ78 encoding (LZ78NR.cpp) src/test: Makefile Used to build the Unit tests run.sh Build and run the Unit tests TestBinIO.cpp Test BinaryIO class (outdated test!) TestBWT.cpp Test BWT TestHuffBWT.cpp Test BWT followed by Huffman TestRLE.cpp Test RLE TestBWT2.cpp Test BWT TestHNode.cpp Test BWT TestHuffman.cpp Test Huffman coding TestMTF.cpp Test MTF CHAPTER 5. IMPLEMENTATION 26

Figure 5.1: UML diagram for the most important classes. Note that coding schemes classes and main class are not drawn on it. For simplicity, methods and members of the codecs and encoders classes have been omitted Chapter 6

Practical Results

6.1 Benchmark files

To evaluate the performance of the implemented algorithms, I used the Calgary Corpus [9] benchmark files, specially designed for comparing compression methods. There are a total of fourteen files in the large version of this Corpus. Note that there also exists an improved version of this corpus, called The Canterbury Corpus1 which has been chosen in 1997 because the results of existing compression algorithms are "typical", and so it is hoped this will also be true for new methods. However, since LZ78 is rather old and there was not a clear advantage on performing benchmarks with this new corpus for my implementation, I decided to stay with my initial choice. I will present and overview of the Calgary Corpus files; the following table details what each file is supposed to contain or represent:

File Category Size (in bytes) bib Bibliography (refer format) 111261 book1 Fiction book 768771 book2 Non-fiction book (troff format) 610856 geo Geophysical data 102400 news USENET batch file 377109 obj1 Object code for VAX 21504 obj2 Object code for Apple Mac 246814 paper1 Technical paper 53161 paper2 Technical paper 82199 paper3 Technical paper 46256 paper4 Technical paper 13286 paper5 Technical paper 11954 paper6 Technical paper 38105 pic Black and white fax picture 513216 progc Source code in "C" 39611 progl Source code in LISP 71646 progp Source code in PASCAL 49379 trans Transcript of terminal session 93695

1http://corpus.canterbury.ac.nz/descriptions/#cantrbry

27 CHAPTER 6. PRACTICAL RESULTS 28

When writing the report, I also learnt on the website2 that paper 3 to 6 where no longer used because they did not add something to the evaluation. However, I decided to keep the results for these four files.

6.1.1 Notions used for comparison The data compression ratio, also known as compression power is used to quantify the reduction in data-representation size produced by a data compression algorithm. The formula used to compute this ratio is :

Compressed size Compression ratio = Uncompressed size Note that I will further refer to it as the compression ratio.

Running time The running time calculated in the results is taken from the UNIX time3 command. I used the real time (i.e. total running time) as comparison criterion.

6.1.2 Other remarks Before going into the results, I wanted to highlight the fact that I used line-charts instead of scatter with only markers because they give a better tractability of performance evolution across algorithms, even if the line between two markers has no real meaning: this is only for visual purpose. See supplementary material for the complete set of data.

6.2 Lempel-Ziv 78

6.2.1 Lempel-Ziv 78 with dictionary reset In this subsection, I will present the results obtained when comparing the three possible dictionary sizes for LZ78: from 1 byte to 3 bytes. In general, for big files, the 1 byte version will have to perform a lot of dictionary flushes while the 3 bytes dictionary will only have to be cleared occasionally. Due to the constant growing behavior of the dictionary, the obtained results were very similar across files and differences in performance are mostly due to the file size or the very specific structure of the file. The Figure 6.1 provides a graphical view of compression result.

2http://corpus.canterbury.ac.nz/descriptions/#large 3see man time CHAPTER 6. PRACTICAL RESULTS 29

(a) The complete result for the Calgary (b) Same as 6.1(a) but with a smaller file Corpus set

Figure 6.1: Obtained compressed file size comparison for LZ78 implementation with dic- tionary sizes from 1 to 3 bytes

As it can be seen in Figure 6.1(a), the optimal dictionary size for files like books is when the block size is 2 bytes. The most impressive result from the benchmark is obtained with the picture file with which the algorithm reaches a very good compression ratio. For smaller files, Figure 6.1(b) confirms the previous observations: the two bytes version of LZ78 is the optimal one for this corpus.

6.2.2 Comparison between with and without dictionary reset version During the project, I also decided to compare the first implemented algorithm where no dictionary flush was performed as with the conventional implementation where the dictio- nary is cleared every time it is full. Figure 6.2(a) shows the different file sizes obtained with the dictionary reset for 1 to 3 bytes, compared to the best version using 2 bytes with dictionary flush as seen in the previous subsection. As we can see, in Figure 6.2, the benefit from not resetting the dictionary is almost inex- istent. This will only be useful in situations with local similarities spread all over the file were the first part of the file will permit LZ78 to build a good dictionary which capture every pattern in the file. See supplementary material for the complete set of obtained data. Pre-filtered data are available as text file of MS-Excel format.

6.2.3 Comparison between my LZ78 implementation and GZIP The comparison performed in this subsection is mostly for information purpose and should not be considered as a real competitive implementation to GZip. CHAPTER 6. PRACTICAL RESULTS 30

(a) The complete result for the Calgary (b) The only benchmark files where LZ78 Corpus implementation without dictionary reset performs (slightly) better than the com- mon LZ78 technique

Figure 6.2: Ratio comparison between LZ78 with 2 bytes dictionary reset and LZ78 im- plementation without reset from 1 to 3 bytes

The following have to be kept in mind for the results: 1. I am using the LZ78 compression technique while GZIP4 obviously uses a LZ77 based algorithm. 2. GZIP was first introduced in 1992, therefore there is no hope for me to bet the performances of this software during only few weeks of work with myself being the only contributor to the code. However, it is interesting to see in Figure 6.3 that I can achieve a pretty good ratio for the picture and geo file compared to GZIP, despite the poor performances my implementation obtains for other files. The running time of the compression process is also an important parameter to take into account to evaluate compression algorithms. In Figure 6.4, we can see that my implementation fluctuates a lot when dealing with big files or with the picture while GZIP seems more dependent to the file size only. These results prove that LZ77 can still perform better than LZ78 using some tweaks and optimizations, especially with large search and look-ahead buffers. The running time difference obtained in Figure 6.4 has at least two explanations: 1. LZ77 (used in GZip) only searches the limited search-buffer for a match while my LZ78 implementation has to look for dictionary of 32767 entries (when using 2 bytes).

2. The running time is particularly high for large file like books, news, obj2 and pic. These files tends to fill quickly the dictionary with entries, making the search expen- sive. In the next section, I will present the results obtained with BWT techniques. 4see GZIP man pages CHAPTER 6. PRACTICAL RESULTS 31

Figure 6.3: Compression ratio comparison between my LZ78 implementation and GZIP

Figure 6.4: Compression time comparison between my LZ78 implementation using two bytes for the dictionary and GZIP

6.3 Burrows-Wheeler Transform

The benchmarks for BWT involved a lot more work than the Lempel-Ziv part since many compression schemes have to be considered. In [5, p.92], some schemes are suggested to use BWT with. Therefore, I decided to use some of them which I introduce in the table below: CHAPTER 6. PRACTICAL RESULTS 32

Short Name Coding scheme Description BH Huff(BWT(Input)) Perform an Huffman coding on the Input transformed by BWT BMH Huff(MTF(BWT(Input))) Perform BWT followed by move-to- front and Huffman coding BRH Huff(RLE(BWT(Input))) Perform BWT followed by run- length encoding and Huffman coding BMRH Huff(MTF(RLE(BWT(Input)))) Perform BWT followed by run- length encoding, move-to-front and Huffman coding BRMH Huff(RLE(MTF(BWT(Input)))) Perform BWT followed by move-to- front, run-length encoding Huffman coding The BH scheme immediatly showed in the first benchs to be very inefficient; this is not a surprise since performing directly and Huffman coding on the BWT (Input) will be very similar (in term of performance) to just applying Huffman coding directly to the source. Such coding scheme can only be efficient with source files where the most probable symbols are very limited and with a significant higher probability than less probable ones, allowing the average code length to be of at most of 8 bits (this is the case for my implementation, since I use per-byte symbol coding in binary for Huffman). In fact, the BH scheme could probably be interesting only if per-block coding on the source would be applied, since BWT will tend to group similar block together. But the difficulty arising from this will be to detect such blocks. Due to the poor results of BH scheme, it will not be mentioned anymore in the upcoming results of this chapter but could obviously be a path of research for a future algorithm using just Huffman with BWT .

I will start the next subsection by comparing the above schemes to the previously seen LZ78 algorithm.

6.3.1 Comparison of BWT schemes to LZ78 The first results obtained in Figure 6.6 and also visible in Figure 6.5 as a chart involve a compression ratio comparison between LZ78 and the presented schemes What can immediately be seen from Figure 6.6 is that the only valuable schemes based on BWT are BMH and BMRH. The poor results for BRH and BRMH can easily be explained by two observations: 1. Run-length encoding always output two symbols even when a symbol is repeated a single time (see implementation chapter).

2. Run-length encoding breaks the blocks of similar symbols, introducing a new byte for the length of the repetition and causing move-to-front to move the length symbol corresponding to this new byte to the front. Move-to-front, alone, will tend to break less this kind of block, and will also enable a small set of bytes to have high probability.

6.3.2 Influence of the block size According to BZip2 documentation, CHAPTER 6. PRACTICAL RESULTS 33

(a) Ratios for all files (b) Second chart with bad ratios removed from 6.5(a)

Figure 6.5: Comparison chart between LZ78 with two bytes for the dictionary and various BWT schemes

Figure 6.6: Comparison between LZ78 with two bytes for the dictionary and various BWT schemes

“Larger block sizes give rapidly diminishing marginal returns. Most of the compression comes from the first two or three hundred k of block size, a fact worth bearing in mind when using bzip2 on small machines. It is also important to appreciate that the decompression CHAPTER 6. PRACTICAL RESULTS 34 memory requirement is set at compression time by the choice of block size.”5

Thus, I decided to verify if this was also the case with BMH, the most efficient scheme in term of ratio. I used various block size from 500 to 9000000 to verify the above statement. In Figure 6.7, many irregularities in the ratio are visible for block sizes below 128k. For the geo file, ratio is even very bad, growing the file size to five times its original size. According to this result, we can observe that the block size is a critical parameter an efficient compression.

(a) The complete result for the (b) Highest ratio removed from (c) Isolated ratios not quite vis- Calgary Corpus 6.7(a) ible in 6.7(b)

Figure 6.7: Ratios comparison for different block sizes using the BMH scheme

The values of the ratio are visible in the table at Figure 6.8.

Figure 6.8: Ratios table for BMH containing data used for charts in Figure 6.7 with per-row coloring of ratios. In dark green is the best ratio while in red is the worst ratio.

For the next subsection, all the comparisons were done using BMH with a block size of 128k. I will now compare the performance of my implementation with BZip2, a well known compression software.

5see man bzip2 CHAPTER 6. PRACTICAL RESULTS 35

6.3.3 Comparison between my optimal BWT method and BZIP2 In this subsection, the same remarks as with LZ78 apply for the development resource difference between the two implementations. The result for the running time can be ob- served in Figure 6.9. As intended, my version has almost always a higher running time than BZip2. But the absolute time is still good, staying in the order of one second. More- over, the pic file is not part of the chart. It has been removed because the running time of my BMH implementation was taking approximately 50 seconds to compress it while BZip2 was able to stay at the order of a second. Such poor performance on the pic file can easily be explained by the implementation of permutation sorting algorithm. While BZip2 certainly uses suffix sorting with the imaginary end-of-block symbol, my version will sometimes have to extend the suffix to complete the comparison. In the case of the picture, where block of hundreds thousands of bytes with value zero are common, my sorting step will require 12800002 comparisons for each suffix, which clearly affects the running time.

Figure 6.9: Compression time comparison between BZip2 and BMH

Therefore, for decent and stable running time, suffix sorting is mandatory in BWT . In Figure 6.10, I present the compression ratio comparison between BMH and BZip2. It is quite obvious that the two algorithms are very close, due to the shape of the ratio “curve”. Except for the obj1 file which tends to penalize the 128k version of BMH (you can observe in 6.8 that the block size 500 give interestingly better results than higher block size for this file). Therefore, according to Figure 6.10, I can highlight the fact that my algorithm

Figure 6.10: Compression ratio comparison between BZip2 and BMH performs very well compared to BZip2. CHAPTER 6. PRACTICAL RESULTS 36

6.4 Global comparison

After having compared LZ78 and BMH separately, I will now synthetise the results by comparing the best versions of the two algorithms with BZip2 and GZip. In Figure 6.11, we can observe than BZip2 outperforms all other algorithms (except for the obj1 file).

Figure 6.11: Comparison between all tested algorithms / software

It means that with few year of research, BWT has enabled efficient algorithms like BZip2 to be available and it is worth being used when online compression is not needed. Chapter 7

Supplementary Material

In this chapter, you will find all the required material and information to verify the data provided in the Implementation and Results chapter.

7.1 Using the program

7.1.1 License I have decided to put my C/C++ code under the GPLv31 license to allow my contribution to be preserved without disallowing future work based on mine. If there is a need for another type of license, it can also be discussed by contacting me. All the details about this license can be found in the source files.

7.1.2 Building the program To build the program, you will need CMake2 to generate the build file and a GNU compati- ble system (it has only been tested on Linux). Once installed, run the following commands:

1. cd Compress/

2. mkdir build && cd build

3. cmake ../ There should be no error if your system is compatible

4. make Build the program, it will be located in bin/ and named compress

Calling ./bin/compress -h will provide you the required information to run a specific algo- rithm and will show you the possible options.

1http://gplv3.fsf.org/ 2http://www.cmake.org/

37 CHAPTER 7. SUPPLEMENTARY MATERIAL 38

Unit testing files To build the unit testing files (located in the src/test subdirectory, you will need the Boost C++3 library compiled with test support. CMake is not required since a standard Makefile is provided. Note that the unit test for binary I/O is no longer up to date and is not supposed to succeed.

7.2 Collected data

7.2.1 Scripts The results of the benchmarks were collected using simple shell scripts located in each benchmark directory. The common scripts located in these directories are:

• run.sh: The main script used to run the compression program and to collect the file size with running time.

• run-dec.sh: Script used to verify the integrity of the compressed files, i.e. if the decompressed file is the same as the starting one. This is useful to check that the algorithm has not been completely broken with a trivial change in the code. This may also be used to check the compatibility of the algorithm on other platforms

7.2.2 Spreadsheets If you do not want to perform from scratch the complete data collection phase, you may want to have it directly filtered and formatted in a spreadsheets. There are three different spreadsheets files (.xlsx format) located in the filtered-benchmarks subdirectory on the repository, and also provided on the CD-ROM:

1. BWT.xlsx : Burrows-Wheeler Transform and BZip2 related data

2. Lz78.xlsx : Lempel-Ziv 78 and GZip related data

3. ResultsMixed.xlsx : Old (outdated) data for all tested software and implemented algorithms

7.2.3 Repository The complete project files are also accessible on polysvn 4 via svn using:

svn co https://polysvn.ch/repos/Compression/tags/submitted-version

If you need an access to it and do not have one, you can send me an email.

3http://www.boost.org/ 4http://www.polysvn.ch/ Chapter 8

Conclusion

Contributions on the personal and professional plans

Having done this project about compression algorithms has been a great opportunity for me to get more familiar with information theory, briefly seen in various courses during my first year at EPFL. Moreover, the compression domain was very attractive since using compression software is common in everyday usage of computer. Having the possibility to look at and implement compression algorithms in a sense demystified the impressive possibility to gain a huge amount of space on a disk or on a wireless transmission. I particularly appreciated the Burrows-Wheeler approach which is capable of boosting clas- sical compression algorithms such as move-to-front and run-length encoding.

Another benefit I have had from this project is an improvement of my C/C++ skills espe- cially at a very low-level approach of file processing.

Finally, I appreciated the strong link existing between compression domain and Compu- tational Biology (DNA domain) since I was following a course about this topic during the same semester. Many techniques used in DNA data handling are inspired from the compression domain.

General conclusion

In this project, two very different lossless compression techniques were analyzed and tested against the Calgary Corpus. Remarkably, the Burrows-Wheeler transform allows efficient compression of data with very few efforts. The most noticeable drawback of this approach being that it is not online like Lempel-Ziv. But since caching is often used in either wired or wireless communications for example in streaming, this could be a path to follow because of the incredible compression ratios it can achieve.

Finally, BWT sorting remains an active research area since the performance in term of time is clearly what has to be optimized.

39 CHAPTER 8. CONCLUSION 40

Acknowledgments

I would like to take this opportunity to thank all the people who have contributed in some way to the progression of this project, particularly Ghid Maatouk who proposed the project and also supervised my work during the whole semester. Her advices were really useful and appreciated. My thanks also go to Masoud Alipour who gave me some hint about the implementation part of the project and to all the members of ALGO and LMA lab1 I may have met during this project. Finally, I would like to thank Prof. Amin Shokrollahi (Head of ALGO and LMA) for the suggestions about the project orientation.

1http://algo.epfl.ch/en/group/members Bibliography

[1] http://data-compression.com/theory.html (01.02.2010).

[2] http://en.wikipedia.org/wiki/lz77_and_lz78 (march 1st 2010).

[3] http://en.wikipedia.org/wiki/suffix_array (april 15 2010).

[4] Introduction to arithmetic coding- theory and practice amir said, 2004.

[5] Donald Adjeroh, Timothy Bell, and Amar Mukherjee. The Burrows-Wheeler Trans- form: Data Compression, Suffix Arrays, and Pattern Matching. Springer, 1 edition, July 2008.

[6] M. Burrows, D. J. Wheeler, M. Burrows, and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, 1994.

[7] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley- Interscience, New York, NY, USA, 1991.

[8] Giovanni Manzini. The burrows-wheeler transform: Theory and practice. In Lecture Notes in Computer Science, pages 34–47. Springer, 1999.

[9] Matt Powell. http://corpus.canterbury.ac.nz/descriptions/#calgary.

[10] Khalid Sayood. Introduction to Data Compression, Third Edition. December 2005.

[11] T.A. Welch. A technique for high-performance data compression. Computer, 17:8–19, 1984.

[12] Aaron D. Wyner and Jacob Ziv. Wyner and ziv: Sliding-window lempel-ziv algorithm. Proceeding of the IEEE.

[13] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compres- sion. IEEE TRANSACTIONS ON INFORMATION THEORY, 23(3):337–343, 1977.

41