Universidade Da Coru˜Na New Compression Codes for Text

Universidade da Coruña Departamento de Computación New Compression Codes for Text Databases Tese Doutoral Doctorando: Antonio FariñaMart´ınez Directores: Nieves R. Brisaboa e Gonzalo Navarro A Coruña,Outubro de 2004 i ii Ph. D. Thesis supervised by Tese doutoral dirixida por Nieves Rodr´ıguezBrisaboa Departamento de Computación Facultade de Informática Universidade da Coruña 15071 A Coruña(España) Tel: +34 981 167000 ext. 1243 Fax: +34 981 167160 [email protected] Gonzalo Navarro Centro de Investigaciónde la Web Departamento de Ciencias de la Computación Universidad de Chile Blanco Encalada 2120 Santiago (Chile) Tel: +56-2-6892736 Fax: +56-2-6895531 [email protected] iii iv Abstract This page will contain a small Abstract of the contents of the thesis v Resumo Neste páxinavai ir un pequeno resumo dos contidos da tese en galego. vi Acknowledgements I owe a great part of this thesis to my directors for their tireless support and their help during the whole work. They always told me the way to follow. This thesis is dedicated to my family, who was always there whenever I need them. Above all, I want to dedicate it to the most combative and strong people I have ever met. Thank you mum and dad. Do neither I want myself to forget my brother and sisters: Arosa, Manu and ’Mar´ıa’,nor the latest ones: Marina e Lexo. CAMBIAR !!! CAMBIAR !!! I cannot omit my LBD colleague (so, I am not going to give names). They did easy the difficult moments and brought the light of our friendship. Thanks also for you, Luisa, for the last months we share and the following that I hope will come. And all the others I did not name, but you know you have your part in this work. This is all for you. vii Agradecementos Grande parte desta tese débollela aos meus directores de tese polo seu apoio incansable, e porque sempre me serviron de apoio e me marcaron o camiñoque deb´ıade seguir. A tese vai adicada ámiñafamilia, que sempre estivo al´ıcando eu os necesitaba, pero sobre todo quérollelaadicar ásdúaspersoas máisloitadoras e fortes que coñezodesde hai tanto tempo. Gracias mamáe papá. Tampouco me quero olvidar dos meus irmánsde toda a vida: Arosa, Manu e ’Mar´ıa’,nin dos máisnovos: Marina e Lexo. CAMBIAR !!! CAMBIAR !!! Non me quero olvidar de ningúndos meus compañeirosdo LBD (as´ıque non vou dar nomes). Eles fixeron máisdoados os momentos dif´ıcilese deron luz propia ánosa amizade. Gracias taména ti, Luisa, polos últimosmeses que compartimos, e polos que espero virán. E a tódolosdemais que non citei, pero sabedes que taméntedes a vosa parte neste traballo. Vai por vós. viii Aos meus pais e irmáns Aos meus sobriños:Mar´ıa,Lexo e Paula ix x Contents Contents 1 Introduction 1 1.1 Text Compression ........................ 1 1.1.1 Compression to space saving .............. 1 1.1.2 Compression to file transmission ............ 4 1.2 Objectives and contributions of the thesis ........... 6 1.3 Outline .............................. 7 2 Basic concepts 9 2.1 Chapter overview ......................... 9 2.2 Concepts on Information Theory ................ 9 2.2.1 The Kraft Inequality ................... 11 2.3 Redundancy and compression .................. 12 2.4 Entrophy in Context dependent messages ........... 13 2.5 Classification of Text Compression Techniques ......... 14 2.6 Measuring the efficiency of compression techniques ...... 18 xi Contents I Semi-static Compression 21 3 Compressed Text Databases 23 3.1 Chapter overview ......................... 23 3.2 Motivation ............................ 23 3.3 Inverted indexes ......................... 24 3.4 Compression schemes for Text Databases ........... 26 3.4.1 Direct access ....................... 26 3.4.2 Direct search ....................... 27 3.5 String matching .......................... 27 3.5.1 Boyer-Moore algorithm ................. 28 3.5.2 Shift-Or algorithm .................... 30 3.6 Summary ............................. 32 4 Semi-static text compression techniques 35 4.1 Chapter overview ......................... 35 4.2 Classic Huffman Code ...................... 36 4.2.1 Building a Huffman Tree ................ 36 4.2.2 Canonical Huffman tree ................. 39 4.3 Word-Based Huffman Compression ............... 40 4.3.1 Plain Huffman and Tagged Huffman Codes ...... 42 4.4 Searching Huffman Compressed Text .............. 44 4.4.1 Searching Plain Huffman Code ............. 46 4.4.2 Searching Tagged Huffman Code ............ 47 xii Contents 4.5 Other techniques ......................... 49 4.5.1 Byte Pair Encoding ................... 49 4.5.2 Burrows-Wheeler Transform ............... 51 4.6 Summary ............................. 55 5 End-Tagged Dense Code 57 5.1 Chapter overview ......................... 57 5.2 Introduction ............................ 57 5.3 End-Tagged Dense Code ..................... 58 5.4 Encoding and Decoding algorithms ............... 61 5.4.1 Encoding algorithm ................... 61 5.4.2 Decoding algorithm ................... 63 5.5 Using ETDC to bound Plain Huffman ............. 64 5.6 Empirical Results ......................... 65 5.7 Conclusions ............................ 66 6 The new technique: (s; c)-Dense Code 69 6.1 Chapter overview ......................... 69 6.2 Introduction ............................ 69 6.3 (s; c)-Dense Code ......................... 72 6.4 Optimal s and c values ...................... 76 6.4.1 Algorithm to get the optimal s and c values ...... 79 6.5 Encoding and Decoding algorithms ............... 83 6.5.1 Encoding algorithm ................... 83 xiii Contents 6.5.2 Decoding algorithm ................... 84 6.6 Empirical Results ......................... 85 6.6.1 Compression Ratio .................... 85 6.6.2 Encoding Time ...................... 86 6.6.3 Decompression Time ................... 89 6.7 Searching (s; c)-Dense Code ................... 90 6.8 Bounding Plain Huffman with (s; c)-Dense Code ....... 91 6.9 Conclusions ............................ 91 II Adaptive Compression 93 7 Dynamic Text Compression Techniques 95 7.1 Overview ............................. 95 7.2 Introduction ............................ 95 7.3 Statistical Dynamic Codes .................... 97 7.3.1 Dynamic Huffman Codes ................ 98 7.3.2 Arithmetic Codes ..................... 100 7.4 Prediction by Partial Matching ................. 103 7.5 Dictionary techniques ...................... 106 7.5.1 LZ77 ............................ 106 7.5.2 LZ78 ............................ 108 7.6 Summary ............................. 111 8 Dynamic Byte-oriented word-based Huffman code 113 xiv Contents 8.1 Chapter overview ......................... 113 8.2 Introduction ............................ 114 8.3 Word-based Dynamic Huffman Codes ............. 114 8.4 Method Overview ......................... 116 8.5 Data Structures .......................... 119 8.5.1 Definition of the tree data structures .......... 119 8.5.2 List of blocks ....................... 122 8.6 Huffman tree update algorithm ................. 125 8.7 Empirical Results: Character- versus word-oriented Huffman 130 8.8 Conclusions ............................ 130 9 Dynamic End-Tagged Dense Code 133 9.1 Chapter overview ......................... 133 9.2 Introduction ............................ 133 9.3 Method overview ......................... 134 9.4 Implementation Data structures ................ 137 9.4.1 Sender’s Data Structures ................ 137 9.4.2 Receiver’s Data Structures ............... 138 9.5 Sender’s and Receiver’s pseudocode .............. 139 9.6 Empirical Results ......................... 143 9.7 Conclusions ............................ 144 10 Dynamic (s; c)-Dense Code SIN TERMINAR !! 145 10.1 Chapter overview ......................... 146 xv Contents 10.2 Introduction ............................ 146 10.3 Dynamic (s; c)-Dense Codes ................... 147 10.3.1 Maintaining the s and c parameters optimal ...... 147 10.4 Pseudocode for parameters check and change ......... 149 10.5 Empirical Results ......................... 149 10.5.1 Semi-static Vs dynamic approach: Compression ratio 151 10.5.2 Comparative of Dynamic PH, Dynamic ETDC and Dynamic SCDC: compression speed and compression ratio ............................ 152 10.5.3 Comparative against gzip, bzip2 and arithmetic compressors ........................ 153 10.6 Conclusions ............................ 156 11 Conclusions and Future Work 157 A Publications and Other Research Results Related to the Thesis 159 A.1 !!!FALTA !!! ............................ 159 A.2 Publications ............................ 159 A.2.1 International Conferences ................ 159 A.2.2 National Conferences ................... 159 A.2.3 Journals and Book Chapters .............. 159 A.3 Research Stays .......................... 159 B Descripcióndel Trabajo Realizado 161 B.1 !!!FALTA !!! ............................ 161 xvi Contents B.2 Introducción ............................ 161 B.3 Metodolog´ıaUtilizada ...................... 161 B.4 Conclusiones y Trabajo Futuro ................. 161 Bibliography 162 xvii Contents xviii List of Tables List of Tables 4.1 Codewords assigned to each symbol in Example 4.2.1 ..... 38 4.2 Codes for an uniform distribution. ............... 45 4.3 Codes for an exponential distribution. ............. 45 5.1 Format of codeword in both Tagged Huffman and End-Tagged Dense Code ............................ 59 5.2 Code assignment in End-Tagged Dense Code ......... 60 5.3 Codes for an uniform distribution. ............... 62 5.4 Codes for an exponential distribution. ............. 62 5.5 Comparison of compression

Universidade Da Coru˜Na New Compression Codes for Text

16.1 Digital “Modes”

The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on Iot Nodes in Smart Cities

Digram Coding Based Lossless Data Compression Algorithm

Quad-Byte Transformation As a Pre-Processing to Arithmetic Coding

Leveraging Word and Phrase Alignments for Multilingual Learning

Quad-Byte Transformation Performs

Variable-Rate Discrete Representation Learning

State of the Art and Future Trends in Data Reduction for High-Performance Computing

Machine Translation Summit XVI

Appendix a Information Theory

Pattern Matching in Compressed\\ Texts and Images

Dictionary Based Compression Type Classification Using a CNN