Universidade Da Coru˜Na New Compression Codes for Text

Total Page:16

File Type:pdf, Size:1020Kb

Universidade Da Coru˜Na New Compression Codes for Text Universidade da Coru˜na Departamento de Computaci´on New Compression Codes for Text Databases Tese Doutoral Doctorando: Antonio Fari˜naMart´ınez Directores: Nieves R. Brisaboa e Gonzalo Navarro A Coru˜na,Outubro de 2004 i ii Ph. D. Thesis supervised by Tese doutoral dirixida por Nieves Rodr´ıguezBrisaboa Departamento de Computaci´on Facultade de Inform´atica Universidade da Coru˜na 15071 A Coru˜na(Espa˜na) Tel: +34 981 167000 ext. 1243 Fax: +34 981 167160 [email protected] Gonzalo Navarro Centro de Investigaci´onde la Web Departamento de Ciencias de la Computaci´on Universidad de Chile Blanco Encalada 2120 Santiago (Chile) Tel: +56-2-6892736 Fax: +56-2-6895531 [email protected] iii iv Abstract This page will contain a small Abstract of the contents of the thesis v Resumo Neste p´axinavai ir un pequeno resumo dos contidos da tese en galego. vi Acknowledgements I owe a great part of this thesis to my directors for their tireless support and their help during the whole work. They always told me the way to follow. This thesis is dedicated to my family, who was always there whenever I need them. Above all, I want to dedicate it to the most combative and strong people I have ever met. Thank you mum and dad. Do neither I want myself to forget my brother and sisters: Arosa, Manu and ’Mar´ıa’,nor the latest ones: Marina e Lexo. CAMBIAR !!! CAMBIAR !!! I cannot omit my LBD colleague (so, I am not going to give names). They did easy the difficult moments and brought the light of our friendship. Thanks also for you, Luisa, for the last months we share and the following that I hope will come. And all the others I did not name, but you know you have your part in this work. This is all for you. vii Agradecementos Grande parte desta tese d´ebollela aos meus directores de tese polo seu apoio incansable, e porque sempre me serviron de apoio e me marcaron o cami˜noque deb´ıade seguir. A tese vai adicada ´ami˜nafamilia, que sempre estivo al´ıcando eu os necesitaba, pero sobre todo qu´erollelaadicar ´asd´uaspersoas m´aisloitadoras e fortes que co˜nezodesde hai tanto tempo. Gracias mam´ae pap´a. Tampouco me quero olvidar dos meus irm´ansde toda a vida: Arosa, Manu e ’Mar´ıa’,nin dos m´aisnovos: Marina e Lexo. CAMBIAR !!! CAMBIAR !!! Non me quero olvidar de ning´undos meus compa˜neirosdo LBD (as´ıque non vou dar nomes). Eles fixeron m´aisdoados os momentos dif´ıcilese deron luz propia ´anosa amizade. Gracias tam´ena ti, Luisa, polos ´ultimosmeses que compartimos, e polos que espero vir´an. E a t´odolosdemais que non citei, pero sabedes que tam´entedes a vosa parte neste traballo. Vai por v´os. viii Aos meus pais e irm´ans Aos meus sobri˜nos:Mar´ıa,Lexo e Paula ix x Contents Contents 1 Introduction 1 1.1 Text Compression ........................ 1 1.1.1 Compression to space saving .............. 1 1.1.2 Compression to file transmission ............ 4 1.2 Objectives and contributions of the thesis ........... 6 1.3 Outline .............................. 7 2 Basic concepts 9 2.1 Chapter overview ......................... 9 2.2 Concepts on Information Theory ................ 9 2.2.1 The Kraft Inequality ................... 11 2.3 Redundancy and compression .................. 12 2.4 Entrophy in Context dependent messages ........... 13 2.5 Classification of Text Compression Techniques ......... 14 2.6 Measuring the efficiency of compression techniques ...... 18 xi Contents I Semi-static Compression 21 3 Compressed Text Databases 23 3.1 Chapter overview ......................... 23 3.2 Motivation ............................ 23 3.3 Inverted indexes ......................... 24 3.4 Compression schemes for Text Databases ........... 26 3.4.1 Direct access ....................... 26 3.4.2 Direct search ....................... 27 3.5 String matching .......................... 27 3.5.1 Boyer-Moore algorithm ................. 28 3.5.2 Shift-Or algorithm .................... 30 3.6 Summary ............................. 32 4 Semi-static text compression techniques 35 4.1 Chapter overview ......................... 35 4.2 Classic Huffman Code ...................... 36 4.2.1 Building a Huffman Tree ................ 36 4.2.2 Canonical Huffman tree ................. 39 4.3 Word-Based Huffman Compression ............... 40 4.3.1 Plain Huffman and Tagged Huffman Codes ...... 42 4.4 Searching Huffman Compressed Text .............. 44 4.4.1 Searching Plain Huffman Code ............. 46 4.4.2 Searching Tagged Huffman Code ............ 47 xii Contents 4.5 Other techniques ......................... 49 4.5.1 Byte Pair Encoding ................... 49 4.5.2 Burrows-Wheeler Transform ............... 51 4.6 Summary ............................. 55 5 End-Tagged Dense Code 57 5.1 Chapter overview ......................... 57 5.2 Introduction ............................ 57 5.3 End-Tagged Dense Code ..................... 58 5.4 Encoding and Decoding algorithms ............... 61 5.4.1 Encoding algorithm ................... 61 5.4.2 Decoding algorithm ................... 63 5.5 Using ETDC to bound Plain Huffman ............. 64 5.6 Empirical Results ......................... 65 5.7 Conclusions ............................ 66 6 The new technique: (s; c)-Dense Code 69 6.1 Chapter overview ......................... 69 6.2 Introduction ............................ 69 6.3 (s; c)-Dense Code ......................... 72 6.4 Optimal s and c values ...................... 76 6.4.1 Algorithm to get the optimal s and c values ...... 79 6.5 Encoding and Decoding algorithms ............... 83 6.5.1 Encoding algorithm ................... 83 xiii Contents 6.5.2 Decoding algorithm ................... 84 6.6 Empirical Results ......................... 85 6.6.1 Compression Ratio .................... 85 6.6.2 Encoding Time ...................... 86 6.6.3 Decompression Time ................... 89 6.7 Searching (s; c)-Dense Code ................... 90 6.8 Bounding Plain Huffman with (s; c)-Dense Code ....... 91 6.9 Conclusions ............................ 91 II Adaptive Compression 93 7 Dynamic Text Compression Techniques 95 7.1 Overview ............................. 95 7.2 Introduction ............................ 95 7.3 Statistical Dynamic Codes .................... 97 7.3.1 Dynamic Huffman Codes ................ 98 7.3.2 Arithmetic Codes ..................... 100 7.4 Prediction by Partial Matching ................. 103 7.5 Dictionary techniques ...................... 106 7.5.1 LZ77 ............................ 106 7.5.2 LZ78 ............................ 108 7.6 Summary ............................. 111 8 Dynamic Byte-oriented word-based Huffman code 113 xiv Contents 8.1 Chapter overview ......................... 113 8.2 Introduction ............................ 114 8.3 Word-based Dynamic Huffman Codes ............. 114 8.4 Method Overview ......................... 116 8.5 Data Structures .......................... 119 8.5.1 Definition of the tree data structures .......... 119 8.5.2 List of blocks ....................... 122 8.6 Huffman tree update algorithm ................. 125 8.7 Empirical Results: Character- versus word-oriented Huffman 130 8.8 Conclusions ............................ 130 9 Dynamic End-Tagged Dense Code 133 9.1 Chapter overview ......................... 133 9.2 Introduction ............................ 133 9.3 Method overview ......................... 134 9.4 Implementation Data structures ................ 137 9.4.1 Sender’s Data Structures ................ 137 9.4.2 Receiver’s Data Structures ............... 138 9.5 Sender’s and Receiver’s pseudocode .............. 139 9.6 Empirical Results ......................... 143 9.7 Conclusions ............................ 144 10 Dynamic (s; c)-Dense Code SIN TERMINAR !! 145 10.1 Chapter overview ......................... 146 xv Contents 10.2 Introduction ............................ 146 10.3 Dynamic (s; c)-Dense Codes ................... 147 10.3.1 Maintaining the s and c parameters optimal ...... 147 10.4 Pseudocode for parameters check and change ......... 149 10.5 Empirical Results ......................... 149 10.5.1 Semi-static Vs dynamic approach: Compression ratio 151 10.5.2 Comparative of Dynamic PH, Dynamic ETDC and Dynamic SCDC: compression speed and compression ratio ............................ 152 10.5.3 Comparative against gzip, bzip2 and arithmetic compressors ........................ 153 10.6 Conclusions ............................ 156 11 Conclusions and Future Work 157 A Publications and Other Research Results Related to the Thesis 159 A.1 !!!FALTA !!! ............................ 159 A.2 Publications ............................ 159 A.2.1 International Conferences ................ 159 A.2.2 National Conferences ................... 159 A.2.3 Journals and Book Chapters .............. 159 A.3 Research Stays .......................... 159 B Descripci´ondel Trabajo Realizado 161 B.1 !!!FALTA !!! ............................ 161 xvi Contents B.2 Introducci´on ............................ 161 B.3 Metodolog´ıaUtilizada ...................... 161 B.4 Conclusiones y Trabajo Futuro ................. 161 Bibliography 162 xvii Contents xviii List of Tables List of Tables 4.1 Codewords assigned to each symbol in Example 4.2.1 ..... 38 4.2 Codes for an uniform distribution. ............... 45 4.3 Codes for an exponential distribution. ............. 45 5.1 Format of codeword in both Tagged Huffman and End-Tagged Dense Code ............................ 59 5.2 Code assignment in End-Tagged Dense Code ......... 60 5.3 Codes for an uniform distribution. ............... 62 5.4 Codes for an exponential distribution. ............. 62 5.5 Comparison of compression
Recommended publications
  • 16.1 Digital “Modes”
    Contents 16.1 Digital “Modes” 16.5 Networking Modes 16.1.1 Symbols, Baud, Bits and Bandwidth 16.5.1 OSI Networking Model 16.1.2 Error Detection and Correction 16.5.2 Connected and Connectionless 16.1.3 Data Representations Protocols 16.1.4 Compression Techniques 16.5.3 The Terminal Node Controller (TNC) 16.1.5 Compression vs. Encryption 16.5.4 PACTOR-I 16.2 Unstructured Digital Modes 16.5.5 PACTOR-II 16.2.1 Radioteletype (RTTY) 16.5.6 PACTOR-III 16.2.2 PSK31 16.5.7 G-TOR 16.2.3 MFSK16 16.5.8 CLOVER-II 16.2.4 DominoEX 16.5.9 CLOVER-2000 16.2.5 THROB 16.5.10 WINMOR 16.2.6 MT63 16.5.11 Packet Radio 16.2.7 Olivia 16.5.12 APRS 16.3 Fuzzy Modes 16.5.13 Winlink 2000 16.3.1 Facsimile (fax) 16.5.14 D-STAR 16.3.2 Slow-Scan TV (SSTV) 16.5.15 P25 16.3.3 Hellschreiber, Feld-Hell or Hell 16.6 Digital Mode Table 16.4 Structured Digital Modes 16.7 Glossary 16.4.1 FSK441 16.8 References and Bibliography 16.4.2 JT6M 16.4.3 JT65 16.4.4 WSPR 16.4.5 HF Digital Voice 16.4.6 ALE Chapter 16 — CD-ROM Content Supplemental Files • Table of digital mode characteristics (section 16.6) • ASCII and ITA2 code tables • Varicode tables for PSK31, MFSK16 and DominoEX • Tips for using FreeDV HF digital voice software by Mel Whitten, KØPFX Chapter 16 Digital Modes There is a broad array of digital modes to service various needs with more coming.
    [Show full text]
  • The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on Iot Nodes in Smart Cities
    sensors Article The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on IoT Nodes in Smart Cities Ammar Nasif *, Zulaiha Ali Othman and Nor Samsiah Sani Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science & Technology, University Kebangsaan Malaysia, Bangi 43600, Malaysia; [email protected] (Z.A.O.); [email protected] (N.S.S.) * Correspondence: [email protected] Abstract: Networking is crucial for smart city projects nowadays, as it offers an environment where people and things are connected. This paper presents a chronology of factors on the development of smart cities, including IoT technologies as network infrastructure. Increasing IoT nodes leads to increasing data flow, which is a potential source of failure for IoT networks. The biggest challenge of IoT networks is that the IoT may have insufficient memory to handle all transaction data within the IoT network. We aim in this paper to propose a potential compression method for reducing IoT network data traffic. Therefore, we investigate various lossless compression algorithms, such as entropy or dictionary-based algorithms, and general compression methods to determine which algorithm or method adheres to the IoT specifications. Furthermore, this study conducts compression experiments using entropy (Huffman, Adaptive Huffman) and Dictionary (LZ77, LZ78) as well as five different types of datasets of the IoT data traffic. Though the above algorithms can alleviate the IoT data traffic, adaptive Huffman gave the best compression algorithm. Therefore, in this paper, Citation: Nasif, A.; Othman, Z.A.; we aim to propose a conceptual compression method for IoT data traffic by improving an adaptive Sani, N.S.
    [Show full text]
  • Digram Coding Based Lossless Data Compression Algorithm
    Computing and Informatics, Vol. 29, 2010, 741–756 ISSDC: DIGRAM CODING BASED LOSSLESS DATA COMPRESSION ALGORITHM Altan Mesut, Aydin Carus Computer Engineering Department Trakya University Ahmet Karadeniz Yerleskesi Edirne, Turkey e-mail: {altanmesut, aydinc}@trakya.edu.tr Manuscript received 18 August 2008; revised 14 January 2009 Communicated by Jacek Kitowski Abstract. In this paper, a new lossless data compression method that is based on digram coding is introduced. This data compression method uses semi-static dic- tionaries: All of the used characters and most frequently used two character blocks (digrams) in the source are found and inserted into a dictionary in the first pass, compression is performed in the second pass. This two-pass structure is repeated several times and in every iteration particular number of elements is inserted in the dictionary until the dictionary is filled. This algorithm (ISSDC: Iterative Semi- Static Digram Coding) also includes some mechanisms that can decide about total number of iterations and dictionary size whenever these values are not given by the user. Our experiments show that ISSDC is better than LZW/GIF and BPE in com- pression ratio. It is worse than DEFLATE in compression of text and binary data, but better than PNG (which uses DEFLATE compression) in lossless compression of simple images. Keywords: Lossless data compression, dictionary-based compression, semi-static dictionary, digram coding 1 INTRODUCTION Lossy and lossless data compression techniques are widely used to increase capacity of data storage devices and to transfer data faster on networks. Lossy data com- 742 A. Mesut, A. Carus pression reduces the size of the source data by permanently eliminating redundant information.
    [Show full text]
  • Quad-Byte Transformation As a Pre-Processing to Arithmetic Coding
    International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 2 Issue 12, December - 2013 Quad-Byte Transformation as a Pre-processing to Arithmetic Coding Jyotika Doshi GLS Inst.of Computer Technology Opp. Law Garden, Ellisbridge Ahmedabad-380006, INDIA Savita Gandhi Dept. of Computer Science; Gujarat University Navrangpura Ahmedabad-380009, INDIA second-stage conventional compression algorithm. Abstract Thus, the intent is to use the paradigm to improve the overall compression ratio in comparison with Transformation algorithms are an interesting what could have been achieved by using only the class of data-compression techniques in which one compression algorithm. can perform reversible transformations on datasets Arithmetic coding [8, 11, 30] is the most widely to increase their susceptibility to other compression preferred efficient entropy coding technique techniques. In recent days, due to its optimal providing optimal entropy. Most of the data entropy, arithmetic coding is the most widely compression algorithms like LZ algorithms [22, 28, preferred entropy encoder in most of the 31, 32]; DMC (Dynamic Markov Compression) [2, compression methods. In this paper, we have 4]; PPM [15] and their variants such as PPMC, proposed QBT-I (Quad-Byte Transformation using PPMD and PPMD+ and others, context-tree Indexes) method to be used as a pre-processing weighting method [29], Grammar—based codes [9] stage before applying arithmetic codingIJERT IJERTand many methods of image compression, audio compression method. QBT-I is intended to and video compression transforms data first and introduce more redundancy in the data and make it then apply entropy coding in the last step. Earlier- more compressible using arithmetic coding.
    [Show full text]
  • Leveraging Word and Phrase Alignments for Multilingual Learning
    Leveraging Word and Phrase Alignments for Multilingual Learning Junjie Hu CMU-LTI-21-012 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15123 www.lti.cs.cmu.edu Thesis Committee: Dr. Graham Neubig (Chair) Carnegie Mellon University Dr. Yulia Tsvetkov Carnegie Mellon University Dr. Zachary Lipton Carnegie Mellon University Dr. Kyunghyun Cho New York University Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Language and Information Technologies. © 2020 Junjie Hu Keywords: natural language processing, multilingual learning, cross-lingual transfer learn- ing, machine translation, domain adaptation, deep learning Dedicated to my family for their wholehearted love and support in all my endeavours. iv Abstract Recent years have witnessed impressive success in natural language processing (NLP) thanks to the advances of neural networks and the availability of large amounts of labeled data. However, many NLP systems predominately have focused on high- resource languages (e.g., English, Chinese) that have large, computationally accessi- ble collections of labeled data for training. While the achievements on high-resource languages are exciting, there are more than 6,900 languages in the world and the ma- jority of them have far fewer resources for training deep neural networks. In fact, it is often expensive, or sometimes infeasible, to collect labeled data written in all possible languages. As a result, this data scarcity issue limits the generalization of NLP systems in many multilingual scenarios. Moreover, as models may be used to process text from a wide range of domains (e.g., social media or medical articles), the data scarcity issue is further exacerbated by the domain shift between the training and test data.
    [Show full text]
  • Quad-Byte Transformation Performs
    International Journal of Engineering Research and Development e-ISSN: 2278-067X, p-ISSN: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.16-21 Quad-Byte Transformation Performs Better Than Arithmetic Coding Jyotika Doshi, Savita Gandhi GLS Institute of Computer Technology, Gujarat Technological University, Ahmedabad, INDIA Department of Computer Science, Gujarat University, Ahmedabad, INDIA [email protected] [email protected] Abstract:- Arithmetic coding (AC) is the most widely used encoding/decoding technique used with most of the data compression programs in the later stage. Quad-byte transformation using zero frequency byte symbols (QBT-Z) is a technique used to compress/decompress data. With k-pass QBT-Z, the algorithm transforms half of the possible most-frequent byte pairs in each pass except the last. In the last pass, it transforms all remaining possible quad-bytes. Authors have experimented k-pass QBT-Z and AC on 18 different types of files with varying sizes having total size of 39,796,037 bytes. It is seen that arithmetic coding has taken 17.488 seconds to compress and the compression rate is 16.77%. The compression time is 7.629, 12.1839 and 16.091 seconds; decompression time is 2.147, 2.340 and 2.432 seconds; compression rate is 17.041%, 18.906% and 19.253% with k-pass QBT-Z for 1, 2, and 3 pass respectively. Decoder of QBT-Z is very fast, so significant saving is observed in decompression with QBT-Z as compared to arithmetic decoding. Both these techniques require computing the frequencies before starting compression.
    [Show full text]
  • Variable-Rate Discrete Representation Learning
    Variable-rate discrete representation learning Sander Dieleman 1 Charlie Nash 1 Jesse Engel 2 Karen Simonyan 1 Abstract Semantically meaningful information content in perceptual signals is usually unevenly distributed. In speech signals for example, there are often many silences, and the speed of pronunciation can vary considerably. In this work, we propose e d f j ð slow autoencoders (SlowAEs) for unsupervised learning of high-level variable-rate discrete repre- sentations of sequences, and apply them to speech. We show that the resulting event-based represen- tations automatically grow or shrink depending on the density of salient information in the in- put signals, while still allowing for faithful signal reconstruction. We develop run-length Transform- ers (RLTs) for event-based representation mod- elling and use them to construct language models Figure 1. From top to bottom: (a) a waveform corresponding to one in the speech domain, which are able to gener- second of American English speech: ‘[... has p]aid off! You’re the ate grammatical and semantically coherent utter- [...]’; (b) the corresponding aligned phoneme sequence (using the ances and continuations. Samples can be found at International Phonetic Alphabet); (c) a discrete representation of the speech segment with 4 channels, learnt with the proposed slow https://vdrl.github.io/. autoencoder (SlowAE) model from audio only (without supervi- sion), which is amenable to run-length encoding; (d) a ‘barcode’ plot indicating where events (changes in the discrete representa- 1. Introduction tion) occur. Event density is correlated with meaningful signal structure, e.g. changes in phonemes. The density of semantically meaningful information in per- ceptual signals (e.g.
    [Show full text]
  • State of the Art and Future Trends in Data Reduction for High-Performance Computing
    DOI: 10.14529/jsfi200101 State of the Art and Future Trends in Data Reduction for High-Performance Computing Kira Duwe1, Jakob L¨uttgau1, Georgiana Mania2, Jannek Squar1, Anna Fuchs1, Michael Kuhn1, Eugen Betke3, Thomas Ludwig3 c The Authors 2020. This paper is published with open access at SuperFri.org Research into data reduction techniques has gained popularity in recent years as storage ca- pacity and performance become a growing concern. This survey paper provides an overview of leveraging points found in high-performance computing (HPC) systems and suitable mechanisms to reduce data volumes. We present the underlying theories and their application throughout the HPC stack and also discuss related hardware acceleration and reduction approaches. After intro- ducing relevant use-cases, an overview of modern lossless and lossy compression algorithms and their respective usage at the application and file system layer is given. In anticipation of their increasing relevance for adaptive and in situ approaches, dimensionality reduction techniques are summarized with a focus on non-linear feature extraction. Adaptive approaches and in situ com- pression algorithms and frameworks follow. The key stages and new opportunities to deduplication are covered next. An unconventional but promising method is recomputation, which is proposed at last. We conclude the survey with an outlook on future developments. Keywords: data reduction, lossless compression, lossy compression, dimensionality reduction, adaptive approaches, deduplication, in situ, recomputation, scientific data set. Introduction Breakthroughs in science are increasingly enabled by supercomputers and large scale data collection operations. Applications span the spectrum of scientific domains from fluid-dynamics in climate simulations and engineering, to particle simulations in astrophysics, quantum mechan- ics and molecular dynamics, to high-throughput computing in biology for genome sequencing and protein-folding.
    [Show full text]
  • Machine Translation Summit XVI
    Machine Translation Summit XVI http://www.mtsummit2017.org Proceedings of MT Summit XVI Vol.1 Research Track Sadao Kurohashi, Pascale Fung, Editors MT Summit XVI September 18 – 22, 2017 -- Nagoya, Aichi, Japan Proceedings of MT Summit XVI, Vol. 1: MT Research Track Sadao Kurohashi (Kyoto University) & Pascale Fung (Hong Kong University of Science and Technology), Eds. Co-hosted by International Asia-Pacific Association for Association for Graduate School of Machine Translation Machine Translation Informatics, Nagoya University http://www.eamt.org/iamt.php http://www.aamt.info http://www.is.nagoya-u.ac.jp/index_en.html ©2017 The Authors. These articles are licensed under a Creative Commons 3.0 license, no derivative works, attribution, CC-BY-ND. Research Program Co-chairs Sadao Kurohashi Kyoto University Pascale Fung Hong Kong University of Science and Technology Research Program Committee Yuki Arase Osaka University Laurent Besacier LIG Houda Bouamor Carnegie Mellon University Hailong Cao Harbin Institute of Technology Michael Carl Copenhagen Business School Daniel Cer Google Boxing Chen NRC Colin Cherry NRC Chenhui Chu Osaka University Marta R. Costa-jussa Universitat Politècnica de Catalunya Steve DeNeefe SDL Language Weaver Markus Dreyer Amazon Kevin Duh Johns Hopkins University Andreas Eisele DGT, European Commission Christian Federmann Microsoft Research Minwei Feng IBM Watson Group Mikel L. Forcada Universitat d'Alacant George Foster National Research Council Isao Goto NHK Spence Green Lilt Inc Eva Hasler SDL Yifan He Bosch Research
    [Show full text]
  • Appendix a Information Theory
    Appendix A Information Theory This appendix serves as a brief introduction to information theory, the foundation of many techniques used in data compression. The two most important terms covered here are entropy and redundancy (see also Section 2.3 for an alternative discussion of these concepts). A.1 Information Theory Concepts We intuitively know what information is. We constantly receive and send information in the form of text, sound, and images. We also feel that information is an elusive nonmathematical quantity that cannot be precisely defined, captured, or measured. The standard dictionary definitions of information are (1) knowledge derived from study, experience, or instruction; (2) knowledge of a specific event or situation; intelligence; (3) a collection of facts or data; (4) the act of informing or the condition of being informed; communication of knowledge. Imagine a person who does not know what information is. Would those definitions make it clear to them? Unlikely. The importance of information theory is that it quantifies information. It shows how to measure information, so that we can answer the question “how much information is included in this piece of data?” with a precise number! Quantifying information is based on the observation that the information content of a message is equivalent to the amount of surprise in the message. If I tell you something that you already know (for example, “you and I work here”), I haven’t given you any information. If I tell you something new (for example, “we both received a raise”), I have given you some information. If I tell you something that really surprises you (for example, “only I received a raise”), I have given you more information, regardless of the number of words I have used, and of how you feel about my information.
    [Show full text]
  • Pattern Matching in Compressed\\ Texts and Images
    Full text available at: http://dx.doi.org/10.1561/2000000038 Pattern Matching in Compressed Texts and Images Full text available at: http://dx.doi.org/10.1561/2000000038 Pattern Matching in Compressed Texts and Images Don Adjeroh West Virginia University USA [email protected] Tim Bell University of Canterbury New Zealand [email protected] Amar Mukherjee University of Central Florida USA [email protected] Boston { Delft Full text available at: http://dx.doi.org/10.1561/2000000038 Foundations and Trends R in Signal Processing Published, sold and distributed by: now Publishers Inc. PO Box 1024 Hanover, MA 02339 USA Tel. +1-781-985-4510 www.nowpublishers.com [email protected] Outside North America: now Publishers Inc. PO Box 179 2600 AD Delft The Netherlands Tel. +31-6-51115274 The preferred citation for this publication is D. Adjeroh, T. Bell and A. Mukherjee, Pattern Matching in Compressed Texts and Images, Foundations and Trends R in Signal Processing, vol 6, nos 2{3, pp 97{241, 2012 ISBN: 978-1-60198-684-9 c 2013 D. Adjeroh, T. Bell and A. Mukherjee All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, mechanical, photocopying, recording or otherwise, without prior written permission of the publishers. Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen- ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by now Publishers Inc for users registered with the Copyright Clearance Center (CCC).
    [Show full text]
  • Dictionary Based Compression Type Classification Using a CNN
    Proceedings of APSIPA Annual Summit and Conference 2019 18-21 November 2019, Lanzhou, China Dictionary based Compression Type Classification using a CNN Architecture Hyewon Song, Beom Kwon, Seongmin Lee and Sanghoon Lee Electrical and Electronic Department, Yonsei University, Seoul, South Korea E-mail: fshw3164, hsm260, lseong721, [email protected] Abstract—As digital devices which are capable of viewing of dictionary-based compression. There are various methods contents easily such as mobile phones and tablet PCs have according to how to find the patterns and represent as indices. become widespread, the number of digital crimes using these Although there are various compression types, there have been digital contents also increases. Usually, the data which can be the evidence of crimes is compressed and the header of data is a few studies to discriminate compression types by analyzing damaged to conceal the contents. Therefore, it is necessary to the characteristics of the entire bitstreams [1][2]. Therefore, in identify the characteristics of the entire bits of the compressed this paper, we propose a new Convolutional Neural Network data to discriminate the compression type not using the header (CNN) which distinguishes the dictionary-based compression of the data. In this paper, we propose a method for distin- type for compressed data. guishing 16 dictionary-based compression types. We utilize 5- layered Convolutional Neural Network (CNN) for classification of The previous discrimination methods used Support Vector compression type using Spatial Pyramid Pooling (SPP) layer. We Machine (SVM) by learning with the feature vector from the evaluate our proposed method on the Wikileaks Dataset, which is compressed data [3][4].
    [Show full text]