Universidade Da Coru˜Na New Compression Codes for Text
Total Page:16
File Type:pdf, Size:1020Kb
Universidade da Coru˜na Departamento de Computaci´on New Compression Codes for Text Databases Tese Doutoral Doctorando: Antonio Fari˜naMart´ınez Directores: Nieves R. Brisaboa e Gonzalo Navarro A Coru˜na,Outubro de 2004 i ii Ph. D. Thesis supervised by Tese doutoral dirixida por Nieves Rodr´ıguezBrisaboa Departamento de Computaci´on Facultade de Inform´atica Universidade da Coru˜na 15071 A Coru˜na(Espa˜na) Tel: +34 981 167000 ext. 1243 Fax: +34 981 167160 [email protected] Gonzalo Navarro Centro de Investigaci´onde la Web Departamento de Ciencias de la Computaci´on Universidad de Chile Blanco Encalada 2120 Santiago (Chile) Tel: +56-2-6892736 Fax: +56-2-6895531 [email protected] iii iv Abstract This page will contain a small Abstract of the contents of the thesis v Resumo Neste p´axinavai ir un pequeno resumo dos contidos da tese en galego. vi Acknowledgements I owe a great part of this thesis to my directors for their tireless support and their help during the whole work. They always told me the way to follow. This thesis is dedicated to my family, who was always there whenever I need them. Above all, I want to dedicate it to the most combative and strong people I have ever met. Thank you mum and dad. Do neither I want myself to forget my brother and sisters: Arosa, Manu and ’Mar´ıa’,nor the latest ones: Marina e Lexo. CAMBIAR !!! CAMBIAR !!! I cannot omit my LBD colleague (so, I am not going to give names). They did easy the difficult moments and brought the light of our friendship. Thanks also for you, Luisa, for the last months we share and the following that I hope will come. And all the others I did not name, but you know you have your part in this work. This is all for you. vii Agradecementos Grande parte desta tese d´ebollela aos meus directores de tese polo seu apoio incansable, e porque sempre me serviron de apoio e me marcaron o cami˜noque deb´ıade seguir. A tese vai adicada ´ami˜nafamilia, que sempre estivo al´ıcando eu os necesitaba, pero sobre todo qu´erollelaadicar ´asd´uaspersoas m´aisloitadoras e fortes que co˜nezodesde hai tanto tempo. Gracias mam´ae pap´a. Tampouco me quero olvidar dos meus irm´ansde toda a vida: Arosa, Manu e ’Mar´ıa’,nin dos m´aisnovos: Marina e Lexo. CAMBIAR !!! CAMBIAR !!! Non me quero olvidar de ning´undos meus compa˜neirosdo LBD (as´ıque non vou dar nomes). Eles fixeron m´aisdoados os momentos dif´ıcilese deron luz propia ´anosa amizade. Gracias tam´ena ti, Luisa, polos ´ultimosmeses que compartimos, e polos que espero vir´an. E a t´odolosdemais que non citei, pero sabedes que tam´entedes a vosa parte neste traballo. Vai por v´os. viii Aos meus pais e irm´ans Aos meus sobri˜nos:Mar´ıa,Lexo e Paula ix x Contents Contents 1 Introduction 1 1.1 Text Compression ........................ 1 1.1.1 Compression to space saving .............. 1 1.1.2 Compression to file transmission ............ 4 1.2 Objectives and contributions of the thesis ........... 6 1.3 Outline .............................. 7 2 Basic concepts 9 2.1 Chapter overview ......................... 9 2.2 Concepts on Information Theory ................ 9 2.2.1 The Kraft Inequality ................... 11 2.3 Redundancy and compression .................. 12 2.4 Entrophy in Context dependent messages ........... 13 2.5 Classification of Text Compression Techniques ......... 14 2.6 Measuring the efficiency of compression techniques ...... 18 xi Contents I Semi-static Compression 21 3 Compressed Text Databases 23 3.1 Chapter overview ......................... 23 3.2 Motivation ............................ 23 3.3 Inverted indexes ......................... 24 3.4 Compression schemes for Text Databases ........... 26 3.4.1 Direct access ....................... 26 3.4.2 Direct search ....................... 27 3.5 String matching .......................... 27 3.5.1 Boyer-Moore algorithm ................. 28 3.5.2 Shift-Or algorithm .................... 30 3.6 Summary ............................. 32 4 Semi-static text compression techniques 35 4.1 Chapter overview ......................... 35 4.2 Classic Huffman Code ...................... 36 4.2.1 Building a Huffman Tree ................ 36 4.2.2 Canonical Huffman tree ................. 39 4.3 Word-Based Huffman Compression ............... 40 4.3.1 Plain Huffman and Tagged Huffman Codes ...... 42 4.4 Searching Huffman Compressed Text .............. 44 4.4.1 Searching Plain Huffman Code ............. 46 4.4.2 Searching Tagged Huffman Code ............ 47 xii Contents 4.5 Other techniques ......................... 49 4.5.1 Byte Pair Encoding ................... 49 4.5.2 Burrows-Wheeler Transform ............... 51 4.6 Summary ............................. 55 5 End-Tagged Dense Code 57 5.1 Chapter overview ......................... 57 5.2 Introduction ............................ 57 5.3 End-Tagged Dense Code ..................... 58 5.4 Encoding and Decoding algorithms ............... 61 5.4.1 Encoding algorithm ................... 61 5.4.2 Decoding algorithm ................... 63 5.5 Using ETDC to bound Plain Huffman ............. 64 5.6 Empirical Results ......................... 65 5.7 Conclusions ............................ 66 6 The new technique: (s; c)-Dense Code 69 6.1 Chapter overview ......................... 69 6.2 Introduction ............................ 69 6.3 (s; c)-Dense Code ......................... 72 6.4 Optimal s and c values ...................... 76 6.4.1 Algorithm to get the optimal s and c values ...... 79 6.5 Encoding and Decoding algorithms ............... 83 6.5.1 Encoding algorithm ................... 83 xiii Contents 6.5.2 Decoding algorithm ................... 84 6.6 Empirical Results ......................... 85 6.6.1 Compression Ratio .................... 85 6.6.2 Encoding Time ...................... 86 6.6.3 Decompression Time ................... 89 6.7 Searching (s; c)-Dense Code ................... 90 6.8 Bounding Plain Huffman with (s; c)-Dense Code ....... 91 6.9 Conclusions ............................ 91 II Adaptive Compression 93 7 Dynamic Text Compression Techniques 95 7.1 Overview ............................. 95 7.2 Introduction ............................ 95 7.3 Statistical Dynamic Codes .................... 97 7.3.1 Dynamic Huffman Codes ................ 98 7.3.2 Arithmetic Codes ..................... 100 7.4 Prediction by Partial Matching ................. 103 7.5 Dictionary techniques ...................... 106 7.5.1 LZ77 ............................ 106 7.5.2 LZ78 ............................ 108 7.6 Summary ............................. 111 8 Dynamic Byte-oriented word-based Huffman code 113 xiv Contents 8.1 Chapter overview ......................... 113 8.2 Introduction ............................ 114 8.3 Word-based Dynamic Huffman Codes ............. 114 8.4 Method Overview ......................... 116 8.5 Data Structures .......................... 119 8.5.1 Definition of the tree data structures .......... 119 8.5.2 List of blocks ....................... 122 8.6 Huffman tree update algorithm ................. 125 8.7 Empirical Results: Character- versus word-oriented Huffman 130 8.8 Conclusions ............................ 130 9 Dynamic End-Tagged Dense Code 133 9.1 Chapter overview ......................... 133 9.2 Introduction ............................ 133 9.3 Method overview ......................... 134 9.4 Implementation Data structures ................ 137 9.4.1 Sender’s Data Structures ................ 137 9.4.2 Receiver’s Data Structures ............... 138 9.5 Sender’s and Receiver’s pseudocode .............. 139 9.6 Empirical Results ......................... 143 9.7 Conclusions ............................ 144 10 Dynamic (s; c)-Dense Code SIN TERMINAR !! 145 10.1 Chapter overview ......................... 146 xv Contents 10.2 Introduction ............................ 146 10.3 Dynamic (s; c)-Dense Codes ................... 147 10.3.1 Maintaining the s and c parameters optimal ...... 147 10.4 Pseudocode for parameters check and change ......... 149 10.5 Empirical Results ......................... 149 10.5.1 Semi-static Vs dynamic approach: Compression ratio 151 10.5.2 Comparative of Dynamic PH, Dynamic ETDC and Dynamic SCDC: compression speed and compression ratio ............................ 152 10.5.3 Comparative against gzip, bzip2 and arithmetic compressors ........................ 153 10.6 Conclusions ............................ 156 11 Conclusions and Future Work 157 A Publications and Other Research Results Related to the Thesis 159 A.1 !!!FALTA !!! ............................ 159 A.2 Publications ............................ 159 A.2.1 International Conferences ................ 159 A.2.2 National Conferences ................... 159 A.2.3 Journals and Book Chapters .............. 159 A.3 Research Stays .......................... 159 B Descripci´ondel Trabajo Realizado 161 B.1 !!!FALTA !!! ............................ 161 xvi Contents B.2 Introducci´on ............................ 161 B.3 Metodolog´ıaUtilizada ...................... 161 B.4 Conclusiones y Trabajo Futuro ................. 161 Bibliography 162 xvii Contents xviii List of Tables List of Tables 4.1 Codewords assigned to each symbol in Example 4.2.1 ..... 38 4.2 Codes for an uniform distribution. ............... 45 4.3 Codes for an exponential distribution. ............. 45 5.1 Format of codeword in both Tagged Huffman and End-Tagged Dense Code ............................ 59 5.2 Code assignment in End-Tagged Dense Code ......... 60 5.3 Codes for an uniform distribution. ............... 62 5.4 Codes for an exponential distribution. ............. 62 5.5 Comparison of compression