Universal Lossless Data Compression Algorithms

Total Page:16

File Type:pdf, Size:1020Kb

Universal Lossless Data Compression Algorithms Silesian University of Technology Faculty of Automatic Control, Electronics and Computer Science Institute of Computer Science Doctor of Philosophy Dissertation Universal lossless data compression algorithms Sebastian Deorowicz Supervisor: Prof. dr hab. inz.˙ Zbigniew J. Czech Gliwice, 2003 To My Parents Contents Contents i 1 Preface 1 2 Introduction to data compression 7 2.1 Preliminaries ............................... 7 2.2 What is data compression? ....................... 8 2.3 Lossy and lossless compression .................... 9 2.3.1 Lossy compression ....................... 9 2.3.2 Lossless compression ...................... 10 2.4 Definitions ................................ 10 2.5 Modelling and coding ......................... 11 2.5.1 Modern paradigm of data compression . 11 2.5.2 Modelling ............................ 11 2.5.3 Entropy coding ......................... 12 2.6 Classes of sources ............................ 16 2.6.1 Types of data .......................... 16 2.6.2 Memoryless source ....................... 16 2.6.3 Piecewise stationary memoryless source . 17 2.6.4 Finite-state machine sources . 17 2.6.5 Context tree sources ...................... 19 2.7 Families of universal algorithms for lossless data compression . 20 2.7.1 Universal compression ..................... 20 2.7.2 Ziv–Lempel algorithms .................... 20 2.7.3 Prediction by partial matching algorithms . 23 2.7.4 Dynamic Markov coding algorithm . 26 2.7.5 Context tree weighting algorithm . 27 2.7.6 Switching method ....................... 28 2.8 Specialised compression algorithms . 29 i ii CONTENTS 3 Algorithms based on the Burrows–Wheeler transform 31 3.1 Description of the algorithm ...................... 31 3.1.1 Compression algorithm .................... 31 3.1.2 Decompression algorithm ................... 36 3.2 Discussion of the algorithm stages . 38 3.2.1 Original algorithm ....................... 38 3.2.2 Burrows–Wheeler transform . 38 3.2.3 Run length encoding ...................... 45 3.2.4 Second stage transforms .................... 46 3.2.5 Entropy coding ......................... 52 3.2.6 Preprocessing the input sequence . 57 4 Improved compression algorithm based on the Burrows–Wheeler trans- form 61 4.1 Modifications of the basic version of the compression algorithm . 61 4.1.1 General structure of the algorithm . 61 4.1.2 Computing the Burrows–Wheeler transform . 62 4.1.3 Analysis of the output sequence of the Burrows–Wheeler transform ............................ 64 4.1.4 Probability estimation for the piecewise stationary memo- ryless source ........................... 70 4.1.5 Weighted frequency count as the algorithm’s second stage 81 4.1.6 Efficient probability estimation in the last stage . 84 4.2 How to compare data compression algorithms? . 92 4.2.1 Data sets ............................. 92 4.2.2 Multi criteria optimisation in compression . 97 4.3 Experiments with the algorithm stages . 98 4.3.1 Burrows–Wheeler transform computation . 98 4.3.2 Weight functions in the weighted frequency count transform102 4.3.3 Approaches to the second stage . 104 4.3.4 Probability estimation . 106 4.4 Experimental comparison of the improved algorithm and the other algorithms ................................ 106 4.4.1 Choosing the algorithms for comparison . 106 4.4.2 Examined algorithms . 107 4.4.3 Comparison procedure . 109 4.4.4 Experiments on the Calgary corpus . 110 4.4.5 Experiments on the Silesia corpus . 120 4.4.6 Experiments on files of different sizes and similar contents 128 4.4.7 Summary of comparison results . 136 5 Conclusions 141 iii Acknowledgements 145 Bibliography 147 Appendices 161 A Silesia corpus 163 B Implementation details 167 C Detailed options of examined compression programs 173 D Illustration of the properties of the weight functions 177 E Detailed compression results for files of different sizes and similar contents 185 List of Symbols and Abbreviations 191 List of Figures 195 List of Tables 196 Index 197 Chapter 1 Preface I am now going to begin my story (said the old man), so please attend. —ANDREW LANG The Arabian Nights Entertainments (1898) Contemporary computers process and store huge amounts of data. Some parts of these data are excessive. Data compression is a process that reduces the data size, removing the excessive information. Why is a shorter data sequence of- ten more suitable? The answer is simple: it reduces the costs. A full-length movie of high quality could occupy a vast part of a hard disk. The compressed movie can be stored on a single CD-ROM. Large amounts of data are transmit- ted by telecommunication satellites. Without compression we would have to launch many more satellites that we do to transmit the same number of televi- sion programs. The capacity of Internet links is also limited and several meth- ods reduce the immense amount of transmitted data. Some of them, as mirror or proxy servers, are solutions that minimise a number of transmissions on long distances. The other methods reduce the size of data by compressing them. Multimedia is a field in which data of vast sizes are processed. The sizes of text documents and application files also grow rapidly. Another type of data for which compression is useful are database tables. Nowadays, the amount of information stored in databases grows fast, while their contents often exhibit much redundancy. Data compression methods can be classified in several ways. One of the most important criteria of classification is whether the compression algorithm 1 2 CHAPTER 1. PREFACE removes some parts of data which cannot be recovered during the decompres- sion. The algorithms removing irreversibly some parts of data are called lossy, while others are called lossless. The lossy algorithms are usually used when a perfect consistency with the original data is not necessary after the decom- pression. Such a situation occurs for example in compression of video or picture data. If the recipient of the video is a human, then small changes of colors of some pixels introduced during the compression could be imperceptible. The lossy compression methods typically yield much better compression ratios than lossless algorithms, so if we can accept some distortions of data, these methods can be used. There are, however, situations in which the lossy methods must not be used to compress picture data. In many countries, the medical images can be compressed only by the lossless algorithms, because of the law regulations. One of the main strategies in developing compression methods is to prepare a specialised compression algorithm for the data we are going to transmit or store. One of many examples where this way is useful comes from astronomy. The distance to a spacecraft which explores the universe is huge, what causes big communication problems. A critical situation took place during the Jupiter mission of Galileo spacecraft. After two years of flight, the Galileo’s high-gain antenna did not open. There was a way to get the collected data through a sup- porting antenna, but the data transmission speed through it was slow. The sup- porting antenna was designed to work with a speed of 16 bits per second at the Jupiter distance. The Galileo team improved this speed to 120 bits per sec- ond, but the transmission time was still quite long. Another way to improve the transmission speed was to apply highly efficient compression algorithm. The compression algorithm that works at Galileo spacecraft reduces the data size about 10 times before sending. The data have been still transmitted since 1995. Let us imagine the situation without compression. To receive the same amount of data we would have to wait about 80 years. The situation described above is of course specific, because we have here good knowledge of what kind of information is transmitted, reducing the size of the data is crucial, and the cost of developing a compression method is of lower importance. In general, however, it is not possible to prepare a specialised compression method for each type of data. The main reasons are: it would result in a vast number of algorithms and the cost of developing a new compression method could surpass the gain obtained by the reduction of the data size. On the other hand, we can assume nothing about the data. If we do so, we have no way of finding the excessive information. Thus a compromise is needed. The standard approach in compression is to define the classes of sources producing different types of data. We assume that the data are produced by a source of some class and apply a compression method designed for this particular class. The algorithms working well on the data that can be approximated as an output 3 of some general source class are called universal. Before we turn to the families of universal lossless data compression algo- rithms, we have to mention the entropy coders. An entropy coder is a method that assigns to every symbol from the alphabet a code depending on the prob- ability of symbol occurrence. The symbols that are more probable to occur get shorter codes than the less probable ones. The codes are assigned to the symbols in such a way that the expected length of the compressed sequence is minimal. The most popular entropy coders are Huffman coder and an arithmetic coder. Both the methods are optimal, so one cannot assign codes for which the expected compressed sequence length would be shorter. The Huffman coder is optimal in the class of methods that assign codes of integer length, while the arithmetic coder is free from this limitation. Therefore it usually leads to shorter expected code length. A number of universal lossless data compression algorithms were proposed. Nowadays they are widely used. Historically the first ones were introduced by Ziv and Lempel [202, 203] in 1977–78. The authors propose to search the data to compress for identical parts and to replace the repetitions with the informa- tion where the identical subsequences appeared before. This task can be accom- plished in several ways.
Recommended publications
  • A Survey Paper on Different Speech Compression Techniques
    Vol-2 Issue-5 2016 IJARIIE-ISSN (O)-2395-4396 A Survey Paper on Different Speech Compression Techniques Kanawade Pramila.R1, Prof. Gundal Shital.S2 1 M.E. Electronics, Department of Electronics Engineering, Amrutvahini College of Engineering, Sangamner, Maharashtra, India. 2 HOD in Electronics Department, Department of Electronics Engineering , Amrutvahini College of Engineering, Sangamner, Maharashtra, India. ABSTRACT This paper describes the different types of speech compression techniques. Speech compression can be divided into two main types such as lossless and lossy compression. This survey paper has been written with the help of different types of Waveform-based speech compression, Parametric-based speech compression, Hybrid based speech compression etc. Compression is nothing but reducing size of data with considering memory size. Speech compression means voiced signal compress for different application such as high quality database of speech signals, multimedia applications, music database and internet applications. Today speech compression is very useful in our life. The main purpose or aim of speech compression is to compress any type of audio that is transfer over the communication channel, because of the limited channel bandwidth and data storage capacity and low bit rate. The use of lossless and lossy techniques for speech compression means that reduced the numbers of bits in the original information. By the use of lossless data compression there is no loss in the original information but while using lossy data compression technique some numbers of bits are loss. Keyword: - Bit rate, Compression, Waveform-based speech compression, Parametric-based speech compression, Hybrid based speech compression. 1. INTRODUCTION -1 Speech compression is use in the encoding system.
    [Show full text]
  • Arxiv:2004.10531V1 [Cs.OH] 8 Apr 2020
    ROOT I/O compression improvements for HEP analysis Oksana Shadura1;∗ Brian Paul Bockelman2;∗∗ Philippe Canal3;∗∗∗ Danilo Piparo4;∗∗∗∗ and Zhe Zhang1;y 1University of Nebraska-Lincoln, 1400 R St, Lincoln, NE 68588, United States 2Morgridge Institute for Research, 330 N Orchard St, Madison, WI 53715, United States 3Fermilab, Kirk Road and Pine St, Batavia, IL 60510, United States 4CERN, Meyrin 1211, Geneve, Switzerland Abstract. We overview recent changes in the ROOT I/O system, increasing per- formance and enhancing it and improving its interaction with other data analy- sis ecosystems. Both the newly introduced compression algorithms, the much faster bulk I/O data path, and a few additional techniques have the potential to significantly to improve experiment’s software performance. The need for efficient lossless data compression has grown significantly as the amount of HEP data collected, transmitted, and stored has dramatically in- creased during the LHC era. While compression reduces storage space and, potentially, I/O bandwidth usage, it should not be applied blindly: there are sig- nificant trade-offs between the increased CPU cost for reading and writing files and the reduce storage space. 1 Introduction In the past years LHC experiments are commissioned and now manages about an exabyte of storage for analysis purposes, approximately half of which is used for archival purposes, and half is used for traditional disk storage. Meanwhile for HL-LHC storage requirements per year are expected to be increased by factor 10 [1]. arXiv:2004.10531v1 [cs.OH] 8 Apr 2020 Looking at these predictions, we would like to state that storage will remain one of the major cost drivers and at the same time the bottlenecks for HEP computing.
    [Show full text]
  • Data Compression for Remote Laboratories
    Paper—Data Compression for Remote Laboratories Data Compression for Remote Laboratories https://doi.org/10.3991/ijim.v11i4.6743 Olawale B. Akinwale Obafemi Awolowo University, Ile-Ife, Nigeria [email protected] Lawrence O. Kehinde Obafemi Awolowo University, Ile-Ife, Nigeria [email protected] Abstract—Remote laboratories on mobile phones have been around for a few years now. This has greatly improved accessibility of these remote labs to students who cannot afford computers but have mobile phones. When money is a factor however (as is often the case with those who can’t afford a computer), the cost of use of these remote laboratories should be minimized. This work ad- dressed this issue of minimizing the cost of use of the remote lab by making use of data compression for data sent between the user and remote lab. Keywords—Data compression; Lempel-Ziv-Welch; remote laboratory; Web- Sockets 1 Introduction The mobile phone is now the de facto computational device of choice. They are quite powerful, connected devices which are almost always with us. This is so much so that about every important online service has a mobile version or at least a version compatible with mobile phones and tablets (for this paper we would adopt the phrase mobile device to represent mobile phones and tablets). This has caused online labora- tory developers to develop online laboratory solutions which are mobile device- friendly [1, 2, 3, 4, 5, 6, 7, 8]. One of the main benefits of online laboratories is the possibility of using fewer re- sources to serve more people i.e.
    [Show full text]
  • Incompressibility and Lossless Data Compression: an Approach By
    Incompressibility and Lossless Data Compression: An Approach by Pattern Discovery Incompresibilidad y compresión de datos sin pérdidas: Un acercamiento con descubrimiento de patrones Oscar Herrera Alcántara and Francisco Javier Zaragoza Martínez Universidad Autónoma Metropolitana Unidad Azcapotzalco Departamento de Sistemas Av. San Pablo No. 180, Col. Reynosa Tamaulipas Del. Azcapotzalco, 02200, Mexico City, Mexico Tel. 53 18 95 32, Fax 53 94 45 34 [email protected], [email protected] Article received on July 14, 2008; accepted on April 03, 2009 Abstract We present a novel method for lossless data compression that aims to get a different performance to those proposed in the last decades to tackle the underlying volume of data of the Information and Multimedia Ages. These latter methods are called entropic or classic because they are based on the Classic Information Theory of Claude E. Shannon and include Huffman [8], Arithmetic [14], Lempel-Ziv [15], Burrows Wheeler (BWT) [4], Move To Front (MTF) [3] and Prediction by Partial Matching (PPM) [5] techniques. We review the Incompressibility Theorem and its relation with classic methods and our method based on discovering symbol patterns called metasymbols. Experimental results allow us to propose metasymbolic compression as a tool for multimedia compression, sequence analysis and unsupervised clustering. Keywords: Incompressibility, Data Compression, Information Theory, Pattern Discovery, Clustering. Resumen Presentamos un método novedoso para compresión de datos sin pérdidas que tiene por objetivo principal lograr un desempeño distinto a los propuestos en las últimas décadas para tratar con los volúmenes de datos propios de la Era de la Información y la Era Multimedia.
    [Show full text]
  • Lossless Compression of Audio Data
    CHAPTER 12 Lossless Compression of Audio Data ROBERT C. MAHER OVERVIEW Lossless data compression of digital audio signals is useful when it is necessary to minimize the storage space or transmission bandwidth of audio data while still maintaining archival quality. Available techniques for lossless audio compression, or lossless audio packing, generally employ an adaptive waveform predictor with a variable-rate entropy coding of the residual, such as Huffman or Golomb-Rice coding. The amount of data compression can vary considerably from one audio waveform to another, but ratios of less than 3 are typical. Several freeware, shareware, and proprietary commercial lossless audio packing programs are available. 12.1 INTRODUCTION The Internet is increasingly being used as a means to deliver audio content to end-users for en­ tertainment, education, and commerce. It is clearly advantageous to minimize the time required to download an audio data file and the storage capacity required to hold it. Moreover, the expec­ tations of end-users with regard to signal quality, number of audio channels, meta-data such as song lyrics, and similar additional features provide incentives to compress the audio data. 12.1.1 Background In the past decade there have been significant breakthroughs in audio data compression using lossy perceptual coding [1]. These techniques lower the bit rate required to represent the signal by establishing perceptual error criteria, meaning that a model of human hearing perception is Copyright 2003. Elsevier Science (USA). 255 AU rights reserved. 256 PART III / APPLICATIONS used to guide the elimination of excess bits that can be either reconstructed (redundancy in the signal) orignored (inaudible components in the signal).
    [Show full text]
  • I Introduction
    PPM Performance with BWT Complexity: A New Metho d for Lossless Data Compression Michelle E ros California Institute of Technology e [email protected] Abstract This work combines a new fast context-search algorithm with the lossless source co ding mo dels of PPM to achieve a lossless data compression algorithm with the linear context-search complexity and memory of BWT and Ziv-Lemp el co des and the compression p erformance of PPM-based algorithms. Both se- quential and nonsequential enco ding are considered. The prop osed algorithm yields an average rate of 2.27 bits per character bp c on the Calgary corpus, comparing favorably to the 2.33 and 2.34 bp c of PPM5 and PPM and the 2.43 bp c of BW94 but not matching the 2.12 bp c of PPMZ9, which, at the time of this publication, gives the greatest compression of all algorithms rep orted on the Calgary corpus results page. The prop osed algorithm gives an average rate of 2.14 bp c on the Canterbury corpus. The Canterbury corpus web page gives average rates of 1.99 bp c for PPMZ9, 2.11 bp c for PPM5, 2.15 bp c for PPM7, and 2.23 bp c for BZIP2 a BWT-based co de on the same data set. I Intro duction The Burrows Wheeler Transform BWT [1] is a reversible sequence transformation that is b ecoming increasingly p opular for lossless data compression. The BWT rear- ranges the symb ols of a data sequence in order to group together all symb ols that share the same unb ounded history or \context." Intuitively, this op eration is achieved by forming a table in which each row is a distinct cyclic shift of the original data string.
    [Show full text]
  • Dc5m United States Software in English Created at 2016-12-25 16:00
    Announcement DC5m United States software in english 1 articles, created at 2016-12-25 16:00 articles set mostly positive rate 10.0 1 3.8 Google’s Brotli Compression Algorithm Lands to Windows Edge Microsoft has announced that its Edge browser has started using Brotli, the compression algorithm that Google open-sourced last year. 2016-12-25 05:00 1KB www.infoq.com Articles DC5m United States software in english 1 articles, created at 2016-12-25 16:00 1 /1 3.8 Google’s Brotli Compression Algorithm Lands to Windows Edge Microsoft has announced that its Edge browser has started using Brotli, the compression algorithm that Google open-sourced last year. Brotli is on by default in the latest Edge build and can be previewed via the Windows Insider Program. It will reach stable status early next year, says Microsoft. Microsoft touts a 20% higher compression ratios over comparable compression algorithms, which would benefit page load times without impacting client-side CPU costs. According to Google, Brotli uses a whole new data format , which makes it incompatible with Deflate but ensures higher compression ratios. In particular, Google says, Brotli is roughly as fast as zlib when decompressing and provides a better compression ratio than LZMA and bzip2 on the Canterbury Corpus. Brotli appears to be especially tuned for the web , that is for offline encoding and online decoding of Web assets, or Android APKs. Google claims a compression ratio improvement of 20–26% over its own Zopfli algorithm, which still provides the best compression ratio of any deflate algorithm.
    [Show full text]
  • Implementing Associative Coder of Buyanovsky (ACB)
    Implementing Associative Coder of Buyanovsky (ACB) data compression by Sean Michael Lambert A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Montana State University © Copyright by Sean Michael Lambert (1999) Abstract: In 1994 George Mechislavovich Buyanovsky published a basic description of a new data compression algorithm he called the “Associative Coder of Buyanovsky,” or ACB. The archive program using this idea, which he released in 1996 and updated in 1997, is still one of the best general compression utilities available. Despite this, the ACB algorithm is still barely understood by data compression experts, primarily because Buyanovsky never published a detailed description of it. ACB is a new idea in data compression, merging concepts from existing statistical and dictionary-based algorithms with entirely original ideas. This document presents several variations of the ACB algorithm and the details required to implement a basic version of ACB. IMPLEMENTING ASSOCIATIVE CODER OF BUYANOVSKY (ACB) DATA COMPRESSION by Sean Michael Lambert A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science MONTANA STATE UNIVERSITY-BOZEMAN Bozeman, Montana April 1999 © COPYRIGHT by Sean Michael Lambert 1999 All Rights Reserved ii APPROVAL of a thesis submitted by Sean Michael Lambert This thesis has been read by each member of the thesis committee and has been found to be satisfactory regarding content, English usage, format, citations, bibliographic style, and consistency, and is ready for submission to the College of Graduate Studies. Brendan Mumey U /l! ^ (Signature) Date Approved for the Department of Computer Science J.
    [Show full text]
  • An Analysis of XML Compression Efficiency
    An Analysis of XML Compression Efficiency Christopher J. Augeri1 Barry E. Mullins1 Leemon C. Baird III Dursun A. Bulutoglu2 Rusty O. Baldwin1 1Department of Electrical and Computer Engineering Department of Computer Science 2Department of Mathematics and Statistics United States Air Force Academy (USAFA) Air Force Institute of Technology (AFIT) USAFA, Colorado Springs, CO Wright Patterson Air Force Base, Dayton, OH {chris.augeri, barry.mullins}@afit.edu [email protected] {dursun.bulutoglu, rusty.baldwin}@afit.edu ABSTRACT We expand previous XML compression studies [9, 26, 34, 47] by XML simplifies data exchange among heterogeneous computers, proposing the XML file corpus and a combined efficiency metric. but it is notoriously verbose and has spawned the development of The corpus was assembled using guidelines given by developers many XML-specific compressors and binary formats. We present of the Canterbury corpus [3], files often used to assess compressor an XML test corpus and a combined efficiency metric integrating performance. The efficiency metric combines execution speed compression ratio and execution speed. We use this corpus and and compression ratio, enabling simultaneous assessment of these linear regression to assess 14 general-purpose and XML-specific metrics, versus prioritizing one metric over the other. We analyze compressors relative to the proposed metric. We also identify key collected metrics using linear regression models (ANOVA) versus factors when selecting a compressor. Our results show XMill or a simple comparison of means, e.g., X is 20% better than Y. WBXML may be useful in some instances, but a general-purpose compressor is often the best choice. 2. XML OVERVIEW XML has gained much acceptance since first proposed in 1998 by Categories and Subject Descriptors the World-Wide Web Consortium (W3C).
    [Show full text]
  • Improving Compression-Ratio in Backup
    Institutionen för systemteknik Department of Electrical Engineering Examensarbete Improving compression ratio in backup Examensarbete utfört i Informationskodning/Bildkodning av Mattias Zeidlitz Författare Mattias Zeidlitz LITH-ISY-EX--12/4588--SE Linköping 2012 TEKNISKA HÖGSKOLAN LINKÖPINGS UNIVERSITET Department of Electrical Engineering Linköpings tekniska högskola Linköping University Institutionen för systemteknik S-581 83 Linköping, Sweden 581 83 Linköping Improving compression-ratio in backup ............................................................................ Examensarbete utfört i Informationskodning/Bildkodning vid Linköpings tekniska högskola av Mattias Zeidlitz ............................................................. LITH-ISY-EX--12/4588--SE Presentationsdatum Institution och avdelning 2012-06-13 Institutionen för systemteknik Publiceringsdatum (elektronisk version) Department of Electrical Engineering Datum då du ämnar publicera exjobbet Språk Typ av publikation ISBN (licentiatavhandling) Svenska Licentiatavhandling ISRN LITH-ISY-EX--12/4588--SE x Annat (ange nedan) x Examensarbete Serietitel (licentiatavhandling) C-uppsats D-uppsats Engelska Rapport Serienummer/ISSN (licentiatavhandling) Antal sidor Annat (ange nedan) 58 URL för elektronisk version http://www.ep.liu.se Publikationens titel Improving compression ratio in backup Författare Mattias Zeidlitz Sammanfattning Denna rapport beskriver ett examensarbete genomfört på Degoo Backup AB i Stockholm under våren 2012. Syftet var att designa en kompressionssvit
    [Show full text]
  • Comparison of Image Compressions: Analog Transformations P
    Proceedings Comparison of Image Compressions: Analog † Transformations P Jose Balsa P CITIC Research Center, Universidade da Coruña (University of A Coruña), 15071 A Coruña, Spain; [email protected] † Presented at the 3rd XoveTIC Conference, A Coruña, Spain, 8–9 October 2020. Published: 21 August 2020 Abstract: A comparison between the four most used transforms, the discrete Fourier transform (DFT), discrete cosine transform (DCT), the Walsh–Hadamard transform (WHT) and the Haar- wavelet transform (DWT), for the transmission of analog images, varying their compression and comparing their quality, is presented. Additionally, performance tests are done for different levels of white Gaussian additive noise. Keywords: analog image transformation; analog image compression; analog image quality 1. Introduction Digitized image coding systems employ reversible mathematical transformations. These transformations change values and function domains in order to rearrange information in a way that condenses information important to human vision [1]. In the new domain, it is possible to filter out relevant information and discard information that is irrelevant or of lesser importance for image quality [2]. Both digital and analog systems use the same transformations in source coding. Some examples of digital systems that employ these transformations are JPEG, M-JPEG, JPEG2000, MPEG- 1, 2, 3 and 4, DV and HDV, among others. Although digital systems after transformation and filtering make use of digital lossless compression techniques, such as Huffman. In this work, we aim to make a comparison of the most commonly used transformations in state- of-the-art image compression systems. Typically, the transformations used to compress analog images work either on the entire image or on regions of the image.
    [Show full text]
  • A Machine Learning Perspective on Predictive Coding with PAQ
    A Machine Learning Perspective on Predictive Coding with PAQ Byron Knoll & Nando de Freitas University of British Columbia Vancouver, Canada fknoll,[email protected] August 17, 2011 Abstract PAQ8 is an open source lossless data compression algorithm that currently achieves the best compression rates on many benchmarks. This report presents a detailed description of PAQ8 from a statistical machine learning perspective. It shows that it is possible to understand some of the modules of PAQ8 and use this understanding to improve the method. However, intuitive statistical explanations of the behavior of other modules remain elusive. We hope the description in this report will be a starting point for discussions that will increase our understanding, lead to improvements to PAQ8, and facilitate a transfer of knowledge from PAQ8 to other machine learning methods, such a recurrent neural networks and stochastic memoizers. Finally, the report presents a broad range of new applications of PAQ to machine learning tasks including language modeling and adaptive text prediction, adaptive game playing, classification, and compression using features from the field of deep learning. 1 Introduction Detecting temporal patterns and predicting into the future is a fundamental problem in machine learning. It has gained great interest recently in the areas of nonparametric Bayesian statistics (Wood et al., 2009) and deep learning (Sutskever et al., 2011), with applications to several domains including language modeling and unsupervised learning of audio and video sequences. Some re- searchers have argued that sequence prediction is key to understanding human intelligence (Hawkins and Blakeslee, 2005). The close connections between sequence prediction and data compression are perhaps under- arXiv:1108.3298v1 [cs.LG] 16 Aug 2011 appreciated within the machine learning community.
    [Show full text]