Algorithms for Re-Pair Compression

Algorithms for Re-Pair compression Philip B. Ørum, s092932 Nicolai C. Christensen, s092956 Kongens Lyngby 2015 Technical University of Denmark Department of Applied Mathematics and Computer Science Richard Petersens Plads, building 324, 2800 Kongens Lyngby, Denmark Phone +45 4525 3031 [email protected] www.compute.dtu.dk Abstract We have studied the Re-Pair compression algorithm to determine whether it can be improved. We have implemented a basic prototype program, and from that created and tested several alternate versions, all with the purpose of improving some part of the algorithm. We have achieved good results and our best version has approximately cut the original running time in half, while los- ing almost nothing in compression eectiveness, and even lowering memory use signicantly. ii Abstract Preface This thesis was prepared at DTU Compute in fullment of the requirements for acquiring a M.Sc. in Computer Science and Engineering. The thesis deals with improvements to the Re-Pair compression algorithm. The thesis consists of a number of chapters describing the dierent version of the Re-Pair algorithm that we have developed. Lyngby, 19-June-2015 Philip B. Ørum, s092932 Nicolai C. Christensen, s092956 iv Acknowledgements We would like to thank our supervisors Inge Li Gørtz and Philip Bille. vi Contents Abstract i Preface iii Acknowledgements v 1 Introduction 1 2 External libraries 3 2.1 Google dense hash . .3 2.2 Boost library project . .3 3 Theory 5 3.1 The Re-Pair compression algorithm . .5 3.2 Canonical Human encoding . .8 3.3 Gamma codes . 10 4 Re-Pair basic version 11 4.1 Design . 12 4.2 Implementation . 17 4.3 Time and memory analysis . 22 4.4 Results . 33 5 Alternative dictionary 37 5.1 Implementation . 38 5.2 Results . 38 6 FIFO priority queue 41 6.1 Results . 42 viii CONTENTS 7 Extended priority queue 45 7.1 Results . 46 8 Earlier cuto 49 8.1 Results . 50 9 Automatic cuto 55 10 Merge symbols 59 10.1 Design . 60 10.2 Implementation . 61 10.3 Results . 62 10.4 Multi-merge . 65 11 Merge with cuto 67 11.1 Results . 68 12 Results comparison 71 12.1 Compression eectiveness . 72 12.2 Running times . 72 12.3 Memory use . 73 12.4 Discussion . 73 13 Conclusion 75 13.1 Future work . 76 Bibliography 79 Chapter 1 Introduction As the amount of digital information handled in the modern world increases, so does the need for good compression algorithms. In this report we document our work on further developing the Re-Pair compression algorithm described by Larsson and Moat in [1]. Their version of the algorithm is very focused on keeping the memory requirements during execution as low as possible. We take a closer look at their algorithm, and because memory is so readily available nowadays, we aim to trade memory consumption for faster compression time without loosing compression eectiveness. Our work consists of implementing a working prototype of the algorithm, and then branching out from that basic implementation to create several dierent versions of Re-Pair. We look into what trade-os can be made between speed, memory use, and compression eectiveness, but our focus is primarily on improving the running time of the algorithm. The focus of each individual version is explained in their relative sections. We have implemented decompression of les as described in [1], and use this for testing the correctness of our compression, but in this project we are mainly interested in studying ways of improving the compression part of Re-Pair. 2 Introduction The diagram below shows the various program versions we have made, with dashed outlines indicating branches that were never fully implemented, due to showing poor results in early testing. Chapter 2 External libraries 2.1 Google dense hash To improve program speed we switched from the basic STL hash table implementation to the dense hash table, which is part of the Google Sparse Hash project described at [2]. We used the benchmark on the site [3] to verify that the dense hash table would be an improvement. 2.2 Boost library project Boost is a collection of libraries for C++ development, which is slowly being integrated into the C++ collection of standard libraries. We use Boost to gain access to their Chrono library, which we need to measure the execution time of our code down to nanosecond precision. More information about Boost can be found on their homepage at [4]. 4 External libraries Chapter 3 Theory In this chapter we introduce some of the most important concepts which are used in the project. 3.1 The Re-Pair compression algorithm In the following we explain the Re-Pair algorithm as it is described by Larsson and Moat in [1]. The idea behind the Re-Pair algorithm is to recursively replace pairs of symbols with single symbols in a text, thus shortening the length of the original text. The approach is to replace the most frequently occurring pairs rst, and for each pair add a dictionary entry mapping the new symbol to the replaced pair. There are four main data structures used by the Re-Pair algorithm, which can be seen in gure 3.1. The rst is the sequence array, which is an array structure where each entry consists of a symbol value and two pointers. The symbol values are either the original symbols from the input text, new symbols introduced by replacing a pair, or empty symbols. The pointers are used to create doubly linked lists between sequences of identical pairs, which are needed to nd the 6 Theory Figure 3.1: Re-Pair data structures used during phrase derivation. This image is taken from [1]. next instance in constant time when the pair is selected for replacement. They are also used to point from empty symbol records to records that are still in use so as to avoid going sequentially through empty records, in order to nd the symbols next to the pair currently being replaced. The second data structure is the active pairs table. This is a hash table from a pair of symbols to a pair record, which is a collection of information about that specic pair. Only pairs occurring with a frequency greater than or equal to 2 in the text are considered active. A pair record holds the exact number of times the pair occurs in the text, a pointer to the rst occurrence of the pair in the sequence array, as well as two pointers to other pair records. These pointers are used in the third data structure, the priority queue. p The priority queue is an array of size d ne where each entry contains a doubly linked list of pairs with frequencies of . The last entry also contains pairs p i + 2 with frequencies greater than n, and these appear in no particular order. The fourth structure is the phrase table. It is used to store the mapping from new symbols to the pairs they have replaced, and does so with minimal memory use. The term phrase refers to the symbols introduced by the Re-Pair algorithm to replace each pair, since each of those symbols corresponds to a sequence of two or more symbols from the original input. The details of the phrase table will be explained in chapter 4. 3.1 The Re-Pair compression algorithm 7 The Re-Pair algorithm starts with an initialization phase where the number of active pairs in the input is counted. A ag is used to indicate that a pair is seen once, and if the pair is encountered again a pair record is created. Going through the text again is needed to set the pointers linking instances of pairs together in the sequence array, as well as insert pairs into the priority queue based on their frequency. This is because the index of the rst occurrence of a pair is not tracked, and thus it cannot be linked to the rest of the sequence in a single pass. Compression now begins in what is called the phrase derivation phase. First the pair with the greatest frequency is found by searching through the last list of the priority queue. All occurrences of this pair are replaced in the sequence array, and the pair that now has the greatest frequency is located. This continues until the list at the last index of the priority queue is empty. Then the priority queue is walked from the second last index down to the rst, compressing all pairs along the way. When a pair is selected for replacement we choose a unique new symbol A, which will replace every instance of the pair in the text. The corresponding pair record is used to determine the rst occurrence of the pair in the sequence array. From the index of the rst occurrence, the symbols surrounding it, called the pair's context, are determined. If the pair to be replaced is ab, and it has a symbol x on its left and a y on its right then its context is xaby. The rst thing that happens now is that the counts of the surrounding pairs are decremented, as they will soon be removed. In the case of our example this is the pairs xa and by. The records of these pairs in the priority queue are updated if necessary, and if the count of a pair falls below 2 its record is deleted. Now the pair in question is replaced by the new symbol A. The context becomes xA_y, and the entry mapping A to ab is added to the phrase table.

Load more