Source Encoding and Compression

Total Page:16

File Type:pdf, Size:1020Kb

Source Encoding and Compression Source Encoding and Compression Jukka Teuhola Computer Science Department of Information Technology University of Turku Spring 2014 Lecture notes 2 Table of Contents 1. Introduction...........................................................................................................................3 2. Coding-theoretic foundations ...............................................................................................6 3. Information-theoretic foundations ......................................................................................12 4. Source coding methods.......................................................................................................15 4.1. Shannon-Fano coding...................................................................................................15 4.2. Huffman coding............................................................................................................17 4.2.1. Extended Huffman code ............................................................................................20 4.2.2. Adaptive Huffman coding .........................................................................................20 4.2.3. Canonical Huffman code ...........................................................................................23 4.3. Tunstall code.................................................................................................................26 4.4. Arithmetic coding.........................................................................................................29 4.4.1. Adaptive arithmetic coding .......................................................................................35 4.4.2. Adaptive arithmetic coding for a binary alphabet .....................................................36 4.4.3. Viewpoints to arithmetic coding ...............................................................................39 5. Predictive models for text compression..............................................................................40 5.1. Predictive coding based on fixed-length contexts........................................................42 5.2. Dynamic-context predictive compression ....................................................................48 5.3. Prediction by partial match...........................................................................................52 5.4. Burrows-Wheeler Transform........................................................................................58 6. Dictionary models for text compression.............................................................................62 6.1. LZ77 family of adaptive dictionary methods ..............................................................63 6.2. LZ78 family of adaptive dictionary methods ..............................................................66 6.3. Performance comparison ..............................................................................................71 7. Introduction to Image Compression....................................................................................72 7.1. Lossless compression of bi-level images......................................................................72 7.2. Lossless compression of grey-scale images .................................................................75 7.3. Lossy image compression: JPEG .................................................................................77 Literature (optional) • T. C. Bell, J. G. Cleary, I. H. Witten: Text Compression, 1990. • R. W. Hamming: Coding and Information Theory, 2nd ed., Prentice-Hall, 1986. • K. Sayood: Introduction to Data Compression, 3rd ed., Morgan Kaufmann, 2006. • K. Sayood: Lossless Compression Handbook, Academic Press, 2003. • I. H. Witten, A. Moffat, T. C. Bell: Managing Gigabytes: compressing and indexing documents and images, 2nd ed., Morgan Kaufmann, 1999. • Miscellaneous articles (mentioned in footnotes) 3 1. Introduction This course is about data compression, which means reducing the size of source data representation by decreasing the redundancy occurring in it. In practice this means choosing a suitable coding scheme for the source symbols. There are two practical motivations for compression: Reduction of storage requirements, and increase of transmission speed in data communication. Actually, the former can be regarded as transmission of data ‘from now to then’. We shall concentrate on lossless compression, where the source data must be recovered in the decompression phase exactly into the original form. The other alternative is lossy compression, where it is sufficient to recover the original data approximately, within specified error bounds. Lossless compression is typical for alphabetic and other symbolic source data types, whereas lossy compression is most often applied to numerical data which result from digitally sampled continuous phenomena, such as sound and images. There are two main fields of coding theory, namely 1. Source coding, which tries to represent the source symbols in minimal form for storage or transmission efficiency. 2. Channel coding, the purpose of which is to enhance detection and correction of trans- mission errors, by choosing symbol representations which are far apart from each other. Data compression can be considered an extension of source coding. It can be divided into two phases: 1. Modelling of the information source means defining suitable units for coding, such as characters or words, and estimating the probability distribution of these units. 2. Source coding (called also statistical or entropy coding) is applied to the units, using their probabilities. The theory of the latter phase is nowadays quite mature, and optimal methods are known. Modelling, instead, still offers challenges, because it is most often approximate, and can be made more and more precise. On the other hand, there are also practical considerations, in addition to minimizing the data size, such as compression and decompression speed, and also the size of the model itself. In data transmission, the different steps can be thought to be performed in sequence, as follows: Model Model Errors Source Channel Channel Source Source encoding encoding decoding decoding Sink Communication channel We consider source and channel coding to be independent, and concentrate on the former. Both phases can also be performed either by hardware or software. We shall discuss only the software solutions, since they are more flexible. In compression, we usually assume two alphabets, one for the source and the other for the target. The former could be for example ASCII or Unicode, and the latter is most often 4 binary. There are many ways to classify the huge number of suggested compression methods. One is based on the grouping of source and target symbols in compression: 1. Fixed-to-fixed: A fixed number of source symbols are grouped and represented by a fixed number of target symbols. This is seldom applicable; an example could be reducing the ASCII codes of numbers 0-9 to 4 bits, if we know (from our model) that only numbers occur in the source. 2. Variable-to-fixed: A variable number of source symbols are grouped and represented by fixed-length codes. These methods are quite popular and many of the commercial com- pression packages belong to this class. A manual example is the Braille code, developed for the blind, where 2x3 arrays of dots represents characters but also some combinations of characters. 3. Fixed-to-variable: The source symbols are represented by a variable number of target symbols. A well-known example of this class is the Morse code, where each character is represented by a variable number of dots and dashes. The most frequent characters are assigned the shortest codes, for compression. The target alphabet of Morse code is not strictly binary, because there appear also inter-character and inter-word spaces. 4. Variable-to-variable: A variable-size group of source symbols is represented by a variable-size code. We could, for example, make an index of all words occurring in a source text, and assign them variable-length binary codes; the shortest codes of course for the most common words. The first two categories are often called block coding methods, where the block refers to the fixed-size result units. The best (with respect to compressed size) methods today belong to category 3, where extremely precise models of the source result in very high gains. For example, English text can typically be compressed to about 2 bits per source character. Category 4 is, of course, the most general, but it seems that it does not offer notable improvement over category 3; instead, modelling of the source becomes more complicated, in respect of both space and time. Thus, in this course we shall mainly take example methods from categories 2 and 3. The former are often called also dictionary methods, and the latter statistical methods. Another classification of compression methods is based on the availability of the source during compression: 1. Off-line methods assume that all the source data (the whole message1) is available at the start of the compression process. Thus, the model of the data can be built before the actual encoding. This is typical of storage compression trying to reduce the consumed disk space. 2. On-line methods can start the coding without seeing the whole message at once. It is possible that the tail of the message is not even generated yet. This is the normal situation in data transmission, where the sender
Recommended publications
  • Large Alphabet Source Coding Using Independent Component Analysis Amichai Painsky, Member, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE
    IEEE TRANSACTIONS ON INFORMATION THEORY 1 Large Alphabet Source Coding using Independent Component Analysis Amichai Painsky, Member, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE Abstract Large alphabet source coding is a basic and well–studied problem in data compression. It has many applications such as compression of natural language text, speech and images. The classic perception of most commonly used methods is that a source is best described over an alphabet which is at least as large as the observed alphabet. In this work we challenge this approach and introduce a conceptual framework in which a large alphabet source is decomposed into “as statistically independent as possible” components. This decomposition allows us to apply entropy encoding to each component separately, while benefiting from their reduced alphabet size. We show that in many cases, such decomposition results in a sum of marginal entropies which is only slightly greater than the entropy of the source. Our suggested algorithm, based on a generalization of the Binary Independent Component Analysis, is applicable for a variety of large alphabet source coding setups. This includes the classical lossless compression, universal compression and high-dimensional vector quantization. In each of these setups, our suggested approach outperforms most commonly used methods. Moreover, our proposed framework is significantly easier to implement in most of these cases. I. INTRODUCTION SSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical A source coding problem is concerned with finding a sample-to-codeword mapping, such that the average codeword length is minimal, and the samples may be uniquely decodable.
    [Show full text]
  • Design and Implementation of a Decompression Engine for a Huffman-Based Compressed Data Cache Master’S Thesis in Embedded Electronic System Design
    Chapter 1 Introduction Design and implementation of a decompression engine for a Huffman-based compressed data cache Master’s Thesis in Embedded Electronic System Design LI KANG Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden, 2014 The Author grants to Chalmers University of Technology and University of Gothenburg the non-exclusive right to publish the Work electronically and in a non-commercial purpose make it accessible on the Internet. The Author warrants that he/she is the author to the Work, and warrants that the Work does not contain text, pictures or other material that violates copyright law. The Author shall, when transferring the rights of the Work to a third party (for example a publisher or a company), acknowledge the third party about this agreement. If the Author has signed a copyright agreement with a third party regarding the Work, the Author warrants hereby that he/she has obtained any necessary permission from this third party to let Chalmers University of Technology and University of Gothenburg store the Work electronically and make it accessible on the Internet. Design and implementation of a decompression engine for a Huffman-based compressed data cache Li Kang © Li Kang January 2014. Supervisor & Examiner: Angelos Arelakis, Per Stenström Chalmers University of Technology Department of Computer Science and Engineering SE-412 96 Göteborg Sweden Telephone + 46 (0)31-772 1000 [Cover: Pipelined Huffman-based decompression engine, page 8. Source: A. Arelakis and P. Stenström, “A Case for a Value-Aware Cache”, IEEE Computer Architecture Letters, September 2012.] Department of Computer Science and Engineering Göteborg, Sweden January 2014 2 Abstract This master thesis studies the implementation of a decompression engine for Huffman based compressed data cache.
    [Show full text]
  • Word-Based Text Compression
    WORD-BASED TEXT COMPRESSION Jan Platoš, Jiří Dvorský Department of Computer Science VŠB – Technical University of Ostrava, Czech Republic {jan.platos.fei, jiri.dvorsky}@vsb.cz ABSTRACT Today there are many universal compression algorithms, but in most cases is for specific data better using specific algorithm - JPEG for images, MPEG for movies, etc. For textual documents there are special methods based on PPM algorithm or methods with non-character access, e.g. word-based compression. In the past, several papers describing variants of word- based compression using Huffman encoding or LZW method were published. The subject of this paper is the description of a word-based compression variant based on the LZ77 algorithm. The LZ77 algorithm and its modifications are described in this paper. Moreover, various ways of sliding window implementation and various possibilities of output encoding are described, as well. This paper also includes the implementation of an experimental application, testing of its efficiency and finding the best combination of all parts of the LZ77 coder. This is done to achieve the best compression ratio. In conclusion there is comparison of this implemented application with other word-based compression programs and with other commonly used compression programs. Key Words: LZ77, word-based compression, text compression 1. Introduction Data compression is used more and more the text compression. In the past, word- in these days, because larger amount of based compression methods based on data require to be transferred or backed-up Huffman encoding, LZW or BWT were and capacity of media or speed of network tested. This paper describes word-based lines increase slowly.
    [Show full text]
  • The Basic Principles of Data Compression
    The Basic Principles of Data Compression Author: Conrad Chung, 2BrightSparks Introduction Internet users who download or upload files from/to the web, or use email to send or receive attachments will most likely have encountered files in compressed format. In this topic we will cover how compression works, the advantages and disadvantages of compression, as well as types of compression. What is Compression? Compression is the process of encoding data more efficiently to achieve a reduction in file size. One type of compression available is referred to as lossless compression. This means the compressed file will be restored exactly to its original state with no loss of data during the decompression process. This is essential to data compression as the file would be corrupted and unusable should data be lost. Another compression category which will not be covered in this article is “lossy” compression often used in multimedia files for music and images and where data is discarded. Lossless compression algorithms use statistic modeling techniques to reduce repetitive information in a file. Some of the methods may include removal of spacing characters, representing a string of repeated characters with a single character or replacing recurring characters with smaller bit sequences. Advantages/Disadvantages of Compression Compression of files offer many advantages. When compressed, the quantity of bits used to store the information is reduced. Files that are smaller in size will result in shorter transmission times when they are transferred on the Internet. Compressed files also take up less storage space. File compression can zip up several small files into a single file for more convenient email transmission.
    [Show full text]
  • Greedy Algorithm Implementation in Huffman Coding Theory
    iJournals: International Journal of Software & Hardware Research in Engineering (IJSHRE) ISSN-2347-4890 Volume 8 Issue 9 September 2020 Greedy Algorithm Implementation in Huffman Coding Theory Author: Sunmin Lee Affiliation: Seoul International School E-mail: [email protected] <DOI:10.26821/IJSHRE.8.9.2020.8905 > ABSTRACT In the late 1900s and early 2000s, creating the code All aspects of modern society depend heavily on data itself was a major challenge. However, now that the collection and transmission. As society grows more basic platform has been established, efficiency that can dependent on data, the ability to store and transmit it be achieved through data compression has become the efficiently has become more important than ever most valuable quality current technology deeply before. The Huffman coding theory has been one of desires to attain. Data compression, which is used to the best coding methods for data compression without efficiently store, transmit and process big data such as loss of information. It relies heavily on a technique satellite imagery, medical data, wireless telephony and called a greedy algorithm, a process that “greedily” database design, is a method of encoding any tries to find an optimal solution global solution by information (image, text, video etc.) into a format that solving for each optimal local choice for each step of a consumes fewer bits than the original data. [8] Data problem. Although there is a disadvantage that it fails compression can be of either of the two types i.e. lossy to consider the problem as a whole, it is definitely or lossless compressions.
    [Show full text]
  • Dictionary Based Compression for Images
    INTERNATIONAL JOURNAL OF COMPUTERS Issue 3, Volume 6, 2012 Dictionary Based Compression for Images Bruno Carpentieri Ziv and Lempel in [1]. Abstract—Lempel-Ziv methods were original introduced to By limiting what could enter the dictionary, LZ2 assures compress one-dimensional data (text, object codes, etc.) but recently that there is at most one instance for each possible pattern in they have been successfully used in image compression. the dictionary. Constantinescu and Storer in [6] introduced a single-pass vector Initially the dictionary is empty. The coding pass consists of quantization algorithm that, with no training or previous knowledge of the digital data was able to achieve better compression results with searching the dictionary for the longest entry that is a prefix of respect to the JPEG standard and had also important computational a string starting at the current coding position. advantages. The index of the match is transmitted to the decoder using We review some of our recent work on LZ-based, single pass, log N bits, where N is the current size of the dictionary. adaptive algorithms for the compression of digital images, taking into 2 account the theoretical optimality of these approach, and we A new pattern is introduced into the dictionary by experimentally analyze the behavior of this algorithm with respect to concatenating the current match with the next character that the local dictionary size and with respect to the compression of bi- has to be encoded. level images. The dictionary of LZ2 continues to grow throughout the coding process. Keywords—Image compression, textual substitution methods.
    [Show full text]
  • A Survey on Different Compression Techniques Algorithm for Data Compression Ihardik Jani, Iijeegar Trivedi IC
    International Journal of Advanced Research in ISSN : 2347 - 8446 (Online) Computer Science & Technology (IJARCST 2014) Vol. 2, Issue 3 (July - Sept. 2014) ISSN : 2347 - 9817 (Print) A Survey on Different Compression Techniques Algorithm for Data Compression IHardik Jani, IIJeegar Trivedi IC. U. Shah University, India IIS. P. University, India Abstract Compression is useful because it helps us to reduce the resources usage, such as data storage space or transmission capacity. Data Compression is the technique of representing information in a compacted form. The actual aim of data compression is to be reduced redundancy in stored or communicated data, as well as increasing effectively data density. The data compression has important tool for the areas of file storage and distributed systems. To desirable Storage space on disks is expensively so a file which occupies less disk space is “cheapest” than an uncompressed files. The main purpose of data compression is asymptotically optimum data storage for all resources. The field data compression algorithm can be divided into different ways: lossless data compression and optimum lossy data compression as well as storage areas. Basically there are so many Compression methods available, which have a long list. In this paper, reviews of different basic lossless data and lossy compression algorithms are considered. On the basis of these techniques researcher have tried to purpose a bit reduction algorithm used for compression of data which is based on number theory system and file differential technique. The statistical coding techniques the algorithms such as Shannon-Fano Coding, Huffman coding, Adaptive Huffman coding, Run Length Encoding and Arithmetic coding are considered.
    [Show full text]
  • Enhanced Data Reduction, Segmentation, and Spatial
    ENHANCED DATA REDUCTION, SEGMENTATION, AND SPATIAL MULTIPLEXING METHODS FOR HYPERSPECTRAL IMAGING LEANNA N. ERGIN Bachelor of Forensic Chemistry Ohio University June 2006 Submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY IN BIOANALYTICAL CHEMISTRY at the CLEVELAND STATE UNIVERSITY July 19th, 2017 We hereby approve this dissertation for Leanna N. Ergin Candidate for the Doctor of Philosophy in Clinical-Bioanalytical Chemistry degree for the Department of Chemistry and CLEVELAND STATE UNIVERSITY College of Graduate Studies ________________________________________________ Dissertation Committee Chairperson, Dr. John F. Turner II ________________________________ Department/Date ________________________________________________ Dissertation Committee Member, Dr. David W. Ball ________________________________ Department/Date ________________________________________________ Dissertation Committee Member, Dr. Petru S. Fodor ________________________________ Department/Date ________________________________________________ Dissertation Committee Member, Dr. Xue-Long Sun ________________________________ Department/Date ________________________________________________ Dissertation Committee Member, Dr. Yan Xu ________________________________ Department/Date ________________________________________________ Dissertation Committee Member, Dr. Aimin Zhou ________________________________ Department/Date Date of Defense: July 19th, 2017 Dedicated to my husband, Can Ergin. ACKNOWLEDGEMENT I would like to thank my advisor,
    [Show full text]
  • Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures
    Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures Jiannan Tian?, Cody Riveray, Sheng Diz, Jieyang Chenx, Xin Liangx, Dingwen Tao?, and Franck Cappelloz{ ?School of Electrical Engineering and Computer Science, Washington State University, WA, USA yDepartment of Computer Science, The University of Alabama, AL, USA zMathematics and Computer Science Division, Argonne National Laboratory, IL, USA xOak Ridge National Laboratory, TN, USA {University of Illinois at Urbana-Champaign, IL, USA Abstract—Today’s high-performance computing (HPC) appli- much more slowly than computing power, causing intra-/inter- cations are producing vast volumes of data, which are challenging node communication cost and I/O bottlenecks to become a to store and transfer efficiently during the execution, such that more serious issue in fast stream processing [6]. Compressing data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is the raw simulation data at runtime and decompressing them arguably the most efficient Entropy coding algorithm in informa- before post-analysis can significantly reduce communication tion theory, such that it could be found as a fundamental step and I/O overheads and hence improving working efficiency. in many modern compression algorithms such as DEFLATE. On Huffman coding is a widely-used variable-length encoding the other hand, today’s HPC applications are more and more method that has been around for over 60 years [17]. It is relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs, arguably the most cost-effective Entropy encoding algorithm resulting in a significant bottleneck in the entire data processing.
    [Show full text]
  • Lecture 2: Variable-Length Codes Continued
    Data Compression Techniques Part 1: Entropy Coding Lecture 2: Variable-Length Codes Continued Juha K¨arkk¨ainen 01.11.2017 1 / 16 Kraft's Inequality When constructing a variable-length code, we are not really interested in what the individual codewords are as long as they satisfy two conditions: I The code is a prefix code (or at least a uniquely decodable code). I The codeword lengths are chosen to minimize the average codeword length. Kraft's inequality gives an exact condition for the existence of a prefix code in terms of the codeword lengths. Theorem (Kraft's Inequality) There exists a binary prefix code with codeword lengths `1; `2; : : : ; `σ if and only if σ X 2−`i ≤ 1 : i=1 2 / 16 Proof of Kraft's Inequality Consider a binary search on the real interval [0; 1). In each step, the current interval is split into two halves and one of the halves is chosen as the new interval. We can associate a search of ` steps with a binary string of length `: I Zero corresponds to choosing the left half. I One corresponds to choosing the right half. For any binary string w, let I(w) be the final interval of the associated search. Example 1011 corresponds to the search sequence [0; 1); [1=2; 2=2); [2=4; 3=4); [5=8; 6=8); [11=16; 12=16) and I(1011) = [11=16; 12=16). 3 / 16 Consider the set fI(w) j w 2 f0; 1g`g of all intervals corresponding to binary strings of lengths `.
    [Show full text]
  • Answers to Exercises
    Answers to Exercises A bird does not sing because he has an answer, he sings because he has a song. —Chinese Proverb Intro.1: abstemious, abstentious, adventitious, annelidous, arsenious, arterious, face- tious, sacrilegious. Intro.2: When a software house has a popular product they tend to come up with new versions. A user can update an old version to a new one, and the update usually comes as a compressed file on a floppy disk. Over time the updates get bigger and, at a certain point, an update may not fit on a single floppy. This is why good compression is important in the case of software updates. The time it takes to compress and decompress the update is unimportant since these operations are typically done just once. Recently, software makers have taken to providing updates over the Internet, but even in such cases it is important to have small files because of the download times involved. 1.1: (1) ask a question, (2) absolutely necessary, (3) advance warning, (4) boiling hot, (5) climb up, (6) close scrutiny, (7) exactly the same, (8) free gift, (9) hot water heater, (10) my personal opinion, (11) newborn baby, (12) postponed until later, (13) unexpected surprise, (14) unsolved mysteries. 1.2: A reasonable way to use them is to code the five most-common strings in the text. Because irreversible text compression is a special-purpose method, the user may know what strings are common in any particular text to be compressed. The user may specify five such strings to the encoder, and they should also be written at the start of the output stream, for the decoder’s use.
    [Show full text]
  • 1 Introduction
    HELSINKI UNIVERSITY OF TECHNOLOGY Faculty of Electronics, Communications, and Automation Department of Communications and Networking Le Wang Evaluation of Compression for Energy-aware Communication in Wireless Networks Master’s Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Technology. Espoo, May 11, 2009 Supervisor: Professor Jukka Manner Instructor: Sebastian Siikavirta 2 HELSINKI UNIVERSITY OF TECHNOLOGY ABSTRACT OF MASTER’S THESIS Author: Le Wang Title: Evaluation of Compression for Energy-aware Communication in Wireless Networks Number of pages: 75 p. Date: 11th May 2009 Faculty: Faculty of Electronics, Communications, and Automation Department: Department of Communications and Networks Code: S-38 Supervisor: Professor Jukka Manner Instructor: Sebastian Siikavirta Abstract In accordance with the development of ICT-based communication, energy efficient communication in wireless networks is being required for reducing energy consumption, cutting down greenhouse emissions and improving business competitiveness. Due to significant energy consumption of transmitting data over wireless networks, data compression techniques can be used to trade the overhead of compression/decompression for less communication energy. Careless and blind compression in wireless networks not only causes an expansion of file sizes, but also wastes energy. This study aims to investigate the usages of data compression to reduce the energy consumption in a hand-held device. By con- ducting experiments as the methodologies, the impacts of transmission on energy consumption are explored on wireless interfaces. Then, 9 lossless compression algo- rithms are examined on popular Internet traffic in the view of compression ratio, speed and consumed energy. Additionally, energy consumption of uplink, downlink and overall system is investigated to achieve a comprehensive understanding of compression in wireless networks.
    [Show full text]