Lempel Ziv Welch Data Compression Using Associative Processing As an Enabling Technology for Real Time Application

Total Page:16

File Type:pdf, Size:1020Kb

Lempel Ziv Welch Data Compression Using Associative Processing As an Enabling Technology for Real Time Application Rochester Institute of Technology RIT Scholar Works Theses 3-1-1998 Lempel Ziv Welch data compression using associative processing as an enabling technology for real time application Manish Narang Follow this and additional works at: https://scholarworks.rit.edu/theses Recommended Citation Narang, Manish, "Lempel Ziv Welch data compression using associative processing as an enabling technology for real time application" (1998). Thesis. Rochester Institute of Technology. Accessed from This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. Lempel Ziv Welch Data Compression using Associative Processing as an Enabling Technology for Real Time Applications. by Manish Narang A Thesis Submitted m Partial Fulfillment ofthe Requirements for the Degree of MASTER OF SCIENCE in Computer Engineering Approved by: _ Graduate Advisor - Tony H. Chang, Professor Roy Czemikowski, Professor Muhammad Shabaan, Assistant Professor Department ofComputer Engineering College ofEngineering Rochester Institute ofTechnology Rochester, New York March,1998 Thesis Release Permission Form Rochester Institute ofTechnology College ofEngineering Title: Lempel Ziv Welch Data Compression using Associative Processing as an Enabling Technology for Real Time Applications. I, Manish Narang, hereby grant pennission to the Wallace Memorial Library to reproduce my thesis in whole or in part. Signature: _ Date: -----------.L----=-------/19? ii Abstract Data compression is a term that refers to the reduction of data representation requirements either in storage and/or in transmission. A commonly used algorithm for compression is the Lempel-Ziv-Welch (LZW) method proposed by Terry A. Welch[l]. LZW is an adaptive, dictionary based, lossless algorithm. This provides for a general compression mechanism that is applicable to a broad range of inputs. Furthermore, the lossless nature of LZW implies that it is a reversible process which results in the original file/message being fully recoverable from compression. "compress" A variant of this algorithm is currently the foundation of the UNIX program. Additionally, LZW is one of the compression schemes defined in the TIFF standard[12], as well as in the CCITT V.42bis standard. One of the challenges in designing an efficient compression mechanism, such as LZW, which can be used in real time applications, is the speed of the search into the data dictionary. In this paper an Associative Processing implementation of the LZW algorithm is presented. This approach provides an efficient solution to this requirement. Additionally, it is shown that Associative Processing (ASP) allows for rapid and elegant development ofthe LZW algorithm that will generally outperform standard approaches in complexity, readability, and performance. in Acknowledgment I would like to acknowledge my family, Gurprasad, Vimal and Vibhu Narang, for the guidance and support they offered throughout the implementation ofthis work. Also, I would like to thank the Thesis committee for their input and time in reviewing my work. IV Table of Contents ABSTRACT Ill ACKNOWLEDGMENT IV TABLE OF CONTENTS V LIST OF FIGURES VII LIST OF TABLES VIII GLOSSARY IX 1. INTRODUCTION 1 1.1 General 1 1 .2 Information Sources 2 1.3 Deliverables 4 2. INFORMATION THEORY 5 2.1 Information 5 2.2 Redundancy 11 3. COMPRESSION .13 3.1 General Overview 13 3.2 Run Length Encoding 18 3.3 Shannon-Fano Coding ., 19 3.4 Huffman Coding 23 3.5 Arithmetic Coding 28 3.6 Lempel-Ziv-Welch 32 3.6.1 Basic Algorithms 34 3.6.2 An Example 36 3.6.3 Hashing 40 3.7 Considerations 43 3.8 The Need for Hardware Compression 44 4. ASSOCIATIVE PROCESSING 48 4.1 general -cam 48 4.2 Integration Issues 54 4.2.1 Constituents of a cell 54 4.2.2 Scalability 56 4.3 Coherent Processor 58 4.4 Alternatives 60 4.5 Architecture for Compression using Associative Processing 62 4.6 a variable length 12 bit code implementation of lzw compression using associative Processing 63 4.6.1 Data Structure 64 4.6.2 CP Macros 66 4.6.3 Accomplishments 68 4.7 a variable length 16 bit code implementation of lzw compression using associative Processing 71 4.7.1 Data Structure 71 4.7.2 CP Macros 72 4.8 Results 73 5. COMMERCIAL HARDWARE COMPRESSORS 79 6. CONCLUSION 80 6.1 Summary 80 6.2 Significance 83 PREFERENCES 84 8. APPENDIX A: MAIN LZW 12 BIT CODE 87 9. APPENDIX B: CP LZW 12 BIT CODE 94 10. APPENDIX C: MAIN LZW 16 BIT CODE 96 11. APPENDIX D: CP LZW 16 BIT CODE 103 VI List of Figures Figure 1. Generic Statistical Compression 15 Figure 2. Generic Adaptive Compression 17 Figure 3. Generic Adaptive Decompression 18 Figure 4. Binary Tree Representation of Shannon-Fano Code 22 Figure 5. Huffman Tree after second iteration 25 Figure 6. Binary Tree representation of Huffman code 26 Figure 7. Incoming Stream to LZW compression 36 Figure 8. Compressed Stream after LZW compression 37 Figure 9. Mask Functionality in an ASP 50 Figure 10. Logic for a single bit of CAM 51 Figure 11. Multiple Response Resolver , 52 Figure 12. Static CAM Cell Circuit 54 Figure 13. Dynamic CAM Cell Circuit 55 Figure 14. Proposed ASP Architecture 63 Figure 15. CAM Data Structure for 12 bit LZW 64 Figure 16. Representation of Table 13 in CAM using 12 bit codes 65 Figure 17. User Data Structure for CAM interface 66 Figure 18. CAM Data Structure for 16 bit LZW 71 Figure 19. Representation of Table 13 in CAM using 16 bit codes 72 Figure 20. Compression Comparison using Calgary Corpus 75 Figure 21 . Relative performance improvement using ASP 79 Vll List of Tables Table 1. Entropies of English relative to order 9 Table 2. Probability of a sample alphabet 20 Table 3. First iteration of Shannon-Fano 21 Table 4. Shannon-Fano Code for the sample alphabet of Table 2 21 Table 5. Example of Shannon-Fano Coding 22 Table 6. Huffman Code for Table 2 26 Table 7. Huffman coding example 27 "ROCHESTER" Table 8. Range assignment for text string 30 "ROCHESTER" Table 9. Arithmetic Coding of text string 30 "ROCHESTER" Table 10. Arithmetic decoding of text string 31 Table 1 1. String table after initialization 36 Table 12. Building of String Table during Compression 36 Table 13. String Table at completion 37 Table 14. Decompression of Compressed Stream up to the seventh code 37 Table 15. Hash table using Open Chaining - Linear Probe 41 Table 16. Calgary Corpus [33] 69 Table 17. Compression results for 12 bit implementation 70 Table 18. Compression results for 16 bit implementation 74 Table 19. Run time approximations for 16 bit version 77 Table 20. Run time approximations for 12 bit version 78 vin Glossary Active-X A software component interface architecture proposed by Microsoft. ASIC Application Specific Integrated Circuit. ASP Associative Processor(s) or Processing. CAM Content Addressable Memory. CP Coherent Processor, an associative processor designed by Coherent Research. CU Control Unit. Java-Beans A software component interface architecture proposed by Sun Microsystems. GPLB General Purpose Logic Block. LAN Local Area Network. LFU Least Frequently Used. LRU Least Recently Used. LZW Lempel-Ziv-Welch. MRR Multiple Response Resolver. MRReg Multiple Response Register. MRROut Output of the MRR. OLAP Online Analytical Processing. PPM Prediction by Partial Matching. RLE Run Length Encoding. SIMD Single Instruction Multiple Data. SR Shift Register. UDS User Data Structure. VLSI Very Large Scale Integration. WAN Wide Area Network. wcs Writable Control Store. IX 1. Introduction 1.1 General Today's world has become an information society. Bob Lucky has stated, "The industrial here." age has come and gone; the information age is [17, p. 2] This is in sight of the fact that more than 65% percent of the population of the United States is involved in information related work. Much like the industrial age and other turning points in the history of man, this age too has many implications. The greatest of which is the management of digital information. There are many aspects relating to this challenge; from creating, securing, sharing, storing, and distributing, to analyzing. In this paper, solutions to two of these challenges are being examined. Namely, storing, and distributing information will be presented. It is obvious that as information grows, so does the need to store and distribute it. Here, when we speak of storing information, we are referring to a means of physically writing data into a permanent storage device such as magnetic or optical media. Although the density of such devices is ever increasing, without an appreciable increase in cost, there are issues that this technology change does not address. One, those that have already invested in the purchase of storage media or networks will not be easily willing to spend additional money in revamping existing systems. Unfortunately, due to the fact that information is growing at a faster rate than the nothing" technology, a "do philosophy is not the solution. Therefore, the purchase of new hardware is unavoidable but must happen in tandem with the utilization of data compression to meet the needs of the user and society. Two, there is the issue of physical space constraints. Unchallenged, it is not difficult to imagine the trend in computer technology going towards a future, which like the days of the ENIAC will have large computer rooms. Unlike those days, where this room was used for processing hardware, the space will now be dedicated for data storage. Again, data compression can provide a mechanism to help control this growth in information. Moreover, it is quite common to see the interconnection of high speed LANs via low speed WANs.
Recommended publications
  • Data Compression: Dictionary-Based Coding 2 / 37 Dictionary-Based Coding Dictionary-Based Coding
    Dictionary-based Coding already coded not yet coded search buffer look-ahead buffer cursor (N symbols) (L symbols) We know the past but cannot control it. We control the future but... Last Lecture Last Lecture: Predictive Lossless Coding Predictive Lossless Coding Simple and effective way to exploit dependencies between neighboring symbols / samples Optimal predictor: Conditional mean (requires storage of large tables) Affine and Linear Prediction Simple structure, low-complex implementation possible Optimal prediction parameters are given by solution of Yule-Walker equations Works very well for real signals (e.g., audio, images, ...) Efficient Lossless Coding for Real-World Signals Affine/linear prediction (often: block-adaptive choice of prediction parameters) Entropy coding of prediction errors (e.g., arithmetic coding) Using marginal pmf often already yields good results Can be improved by using conditional pmfs (with simple conditions) Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 2 / 37 Dictionary-based Coding Dictionary-Based Coding Coding of Text Files Very high amount of dependencies Affine prediction does not work (requires linear dependencies) Higher-order conditional coding should work well, but is way to complex (memory) Alternative: Do not code single characters, but words or phrases Example: English Texts Oxford English Dictionary lists less than 230 000 words (including obsolete words) On average, a word contains about 6 characters Average codeword length per character would be limited by 1
    [Show full text]
  • Annual Report 2016
    ANNUAL REPORT 2016 PUNJABI UNIVERSITY, PATIALA © Punjabi University, Patiala (Established under Punjab Act No. 35 of 1961) Editor Dr. Shivani Thakar Asst. Professor (English) Department of Distance Education, Punjabi University, Patiala Laser Type Setting : Kakkar Computer, N.K. Road, Patiala Published by Dr. Manjit Singh Nijjar, Registrar, Punjabi University, Patiala and Printed at Kakkar Computer, Patiala :{Bhtof;Nh X[Bh nk;k wjbk ñ Ò uT[gd/ Ò ftfdnk thukoh sK goT[gekoh Ò iK gzu ok;h sK shoE tk;h Ò ñ Ò x[zxo{ tki? i/ wB[ bkr? Ò sT[ iw[ ejk eo/ w' f;T[ nkr? Ò ñ Ò ojkT[.. nk; fBok;h sT[ ;zfBnk;h Ò iK is[ i'rh sK ekfJnk G'rh Ò ò Ò dfJnk fdrzpo[ d/j phukoh Ò nkfg wo? ntok Bj wkoh Ò ó Ò J/e[ s{ j'fo t/; pj[s/o/.. BkBe[ ikD? u'i B s/o/ Ò ô Ò òõ Ò (;qh r[o{ rqzE ;kfjp, gzBk óôù) English Translation of University Dhuni True learning induces in the mind service of mankind. One subduing the five passions has truly taken abode at holy bathing-spots (1) The mind attuned to the infinite is the true singing of ankle-bells in ritual dances. With this how dare Yama intimidate me in the hereafter ? (Pause 1) One renouncing desire is the true Sanayasi. From continence comes true joy of living in the body (2) One contemplating to subdue the flesh is the truly Compassionate Jain ascetic. Such a one subduing the self, forbears harming others. (3) Thou Lord, art one and Sole.
    [Show full text]
  • Arithmetic Coding
    Arithmetic Coding Arithmetic coding is the most efficient method to code symbols according to the probability of their occurrence. The average code length corresponds exactly to the possible minimum given by information theory. Deviations which are caused by the bit-resolution of binary code trees do not exist. In contrast to a binary Huffman code tree the arithmetic coding offers a clearly better compression rate. Its implementation is more complex on the other hand. In arithmetic coding, a message is encoded as a real number in an interval from one to zero. Arithmetic coding typically has a better compression ratio than Huffman coding, as it produces a single symbol rather than several separate codewords. Arithmetic coding differs from other forms of entropy encoding such as Huffman coding in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, a fraction n where (0.0 ≤ n < 1.0) Arithmetic coding is a lossless coding technique. There are a few disadvantages of arithmetic coding. One is that the whole codeword must be received to start decoding the symbols, and if there is a corrupt bit in the codeword, the entire message could become corrupt. Another is that there is a limit to the precision of the number which can be encoded, thus limiting the number of symbols to encode within a codeword. There also exist many patents upon arithmetic coding, so the use of some of the algorithms also call upon royalty fees. Arithmetic coding is part of the JPEG data format.
    [Show full text]
  • Source Coding: Part I of Fundamentals of Source and Video Coding
    Foundations and Trends R in sample Vol. 1, No 1 (2011) 1–217 c 2011 Thomas Wiegand and Heiko Schwarz DOI: xxxxxx Source Coding: Part I of Fundamentals of Source and Video Coding Thomas Wiegand1 and Heiko Schwarz2 1 Berlin Institute of Technology and Fraunhofer Institute for Telecommunica- tions — Heinrich Hertz Institute, Germany, [email protected] 2 Fraunhofer Institute for Telecommunications — Heinrich Hertz Institute, Germany, [email protected] Abstract Digital media technologies have become an integral part of the way we create, communicate, and consume information. At the core of these technologies are source coding methods that are described in this text. Based on the fundamentals of information and rate distortion theory, the most relevant techniques used in source coding algorithms are de- scribed: entropy coding, quantization as well as predictive and trans- form coding. The emphasis is put onto algorithms that are also used in video coding, which will be described in the other text of this two-part monograph. To our families Contents 1 Introduction 1 1.1 The Communication Problem 3 1.2 Scope and Overview of the Text 4 1.3 The Source Coding Principle 5 2 Random Processes 7 2.1 Probability 8 2.2 Random Variables 9 2.2.1 Continuous Random Variables 10 2.2.2 Discrete Random Variables 11 2.2.3 Expectation 13 2.3 Random Processes 14 2.3.1 Markov Processes 16 2.3.2 Gaussian Processes 18 2.3.3 Gauss-Markov Processes 18 2.4 Summary of Random Processes 19 i ii Contents 3 Lossless Source Coding 20 3.1 Classification
    [Show full text]
  • Image Compression Using Discrete Cosine Transform Method
    Qusay Kanaan Kadhim, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.9, September- 2016, pg. 186-192 Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320–088X IMPACT FACTOR: 5.258 IJCSMC, Vol. 5, Issue. 9, September 2016, pg.186 – 192 Image Compression Using Discrete Cosine Transform Method Qusay Kanaan Kadhim Al-Yarmook University College / Computer Science Department, Iraq [email protected] ABSTRACT: The processing of digital images took a wide importance in the knowledge field in the last decades ago due to the rapid development in the communication techniques and the need to find and develop methods assist in enhancing and exploiting the image information. The field of digital images compression becomes an important field of digital images processing fields due to the need to exploit the available storage space as much as possible and reduce the time required to transmit the image. Baseline JPEG Standard technique is used in compression of images with 8-bit color depth. Basically, this scheme consists of seven operations which are the sampling, the partitioning, the transform, the quantization, the entropy coding and Huffman coding. First, the sampling process is used to reduce the size of the image and the number bits required to represent it. Next, the partitioning process is applied to the image to get (8×8) image block. Then, the discrete cosine transform is used to transform the image block data from spatial domain to frequency domain to make the data easy to process.
    [Show full text]
  • Data Compression for Remote Laboratories
    Paper—Data Compression for Remote Laboratories Data Compression for Remote Laboratories https://doi.org/10.3991/ijim.v11i4.6743 Olawale B. Akinwale Obafemi Awolowo University, Ile-Ife, Nigeria [email protected] Lawrence O. Kehinde Obafemi Awolowo University, Ile-Ife, Nigeria [email protected] Abstract—Remote laboratories on mobile phones have been around for a few years now. This has greatly improved accessibility of these remote labs to students who cannot afford computers but have mobile phones. When money is a factor however (as is often the case with those who can’t afford a computer), the cost of use of these remote laboratories should be minimized. This work ad- dressed this issue of minimizing the cost of use of the remote lab by making use of data compression for data sent between the user and remote lab. Keywords—Data compression; Lempel-Ziv-Welch; remote laboratory; Web- Sockets 1 Introduction The mobile phone is now the de facto computational device of choice. They are quite powerful, connected devices which are almost always with us. This is so much so that about every important online service has a mobile version or at least a version compatible with mobile phones and tablets (for this paper we would adopt the phrase mobile device to represent mobile phones and tablets). This has caused online labora- tory developers to develop online laboratory solutions which are mobile device- friendly [1, 2, 3, 4, 5, 6, 7, 8]. One of the main benefits of online laboratories is the possibility of using fewer re- sources to serve more people i.e.
    [Show full text]
  • Randomized Lempel-Ziv Compression for Anti-Compression Side-Channel Attacks
    Randomized Lempel-Ziv Compression for Anti-Compression Side-Channel Attacks by Meng Yang A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Applied Science in Electrical and Computer Engineering Waterloo, Ontario, Canada, 2018 c Meng Yang 2018 I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Abstract Security experts confront new attacks on TLS/SSL every year. Ever since the compres- sion side-channel attacks CRIME and BREACH were presented during security conferences in 2012 and 2013, online users connecting to HTTP servers that run TLS version 1.2 are susceptible of being impersonated. We set up three Randomized Lempel-Ziv Models, which are built on Lempel-Ziv77, to confront this attack. Our three models change the determin- istic characteristic of the compression algorithm: each compression with the same input gives output of different lengths. We implemented SSL/TLS protocol and the Lempel- Ziv77 compression algorithm, and used them as a base for our simulations of compression side-channel attack. After performing the simulations, all three models successfully pre- vented the attack. However, we demonstrate that our randomized models can still be broken by a stronger version of compression side-channel attack that we created. But this latter attack has a greater time complexity and is easily detectable. Finally, from the results, we conclude that our models couldn't compress as well as Lempel-Ziv77, but they can be used against compression side-channel attacks.
    [Show full text]
  • Incompressibility and Lossless Data Compression: an Approach By
    Incompressibility and Lossless Data Compression: An Approach by Pattern Discovery Incompresibilidad y compresión de datos sin pérdidas: Un acercamiento con descubrimiento de patrones Oscar Herrera Alcántara and Francisco Javier Zaragoza Martínez Universidad Autónoma Metropolitana Unidad Azcapotzalco Departamento de Sistemas Av. San Pablo No. 180, Col. Reynosa Tamaulipas Del. Azcapotzalco, 02200, Mexico City, Mexico Tel. 53 18 95 32, Fax 53 94 45 34 [email protected], [email protected] Article received on July 14, 2008; accepted on April 03, 2009 Abstract We present a novel method for lossless data compression that aims to get a different performance to those proposed in the last decades to tackle the underlying volume of data of the Information and Multimedia Ages. These latter methods are called entropic or classic because they are based on the Classic Information Theory of Claude E. Shannon and include Huffman [8], Arithmetic [14], Lempel-Ziv [15], Burrows Wheeler (BWT) [4], Move To Front (MTF) [3] and Prediction by Partial Matching (PPM) [5] techniques. We review the Incompressibility Theorem and its relation with classic methods and our method based on discovering symbol patterns called metasymbols. Experimental results allow us to propose metasymbolic compression as a tool for multimedia compression, sequence analysis and unsupervised clustering. Keywords: Incompressibility, Data Compression, Information Theory, Pattern Discovery, Clustering. Resumen Presentamos un método novedoso para compresión de datos sin pérdidas que tiene por objetivo principal lograr un desempeño distinto a los propuestos en las últimas décadas para tratar con los volúmenes de datos propios de la Era de la Información y la Era Multimedia.
    [Show full text]
  • Conversion Principles and Circuits
    Interfacing Analog and Digital Worlds: Conversion Principles and Circuits Nimal Skandhakumar Faculty of Technology University of Sri Jayewardenepura 2019 1 From Analog to Digital to Analog 2 Analog Signal ● A continuous signal that contains 1. Continuous time-varying quantities, such as 2. Infinite range of values temperature or speed, with infinite 3. More exact values, but more possible values in between difficult to work with ● Can be used to measure changes in some physical phenomena, such as light, sound, pressure, or temperature. 3 Analog Signal ● Advantages: 1. Major advantages of the analog signal is infinite amount of data. 2. Density is much higher. 3. Easy processing. ● Disadvantages: 1. Unwanted noise in recording. 2. If we transmit data at long distance then unwanted disturbance is there. 3. Generation loss is also a big con of analog signals. 4 Digital Signal ● A type of signal that can take on a 1. Discrete set of discrete values (a quantized 2. Finite range of values signal) 3. Not as exact as analog, but easier ● Can represent a discrete set of to work with values using any discrete set of waveforms; and we can represent it like (0 or 1), (on or off) 5 Difference between analog and digital signals Analog Digital Signalling Continuous signal Discrete time signal Data Subjected to deterioration by noise during Can be noise-immune without deterioration transmissions transmission and write/read cycle. during transmission and write/read cycle. Bandwidth Analog signal processing can be done in There is no guarantee that digital signal real time and consumes less bandwidth. processing can be done in real time and consumes more bandwidth to carry out the same information.
    [Show full text]
  • 8 Video Compression Codecs
    ARTICLE VIDEO COMPRESSION CODECS: A SURVIVAL GUIDE Iain E. Richardson, Vcodex Ltd., UK 1. Introduction Not another video codec! Since the frst commercially viable video codec formats appeared in the early 1990s, we have seen the emergence of a plethora of compressed digital video formats, from MPEG-1 and MPEG-2 to recent codecs such as HEVC and VP9. Each new format offers certain advantages over its predecessors. However, the increasing variety of codec formats poses many questions for anyone involved in collecting, archiving and delivering digital video content, such as: ■■ Which codec format (if any) is best? ■■ What is a suitable acquisition protocol for digital video? ■■ Is it possible to ensure that early ‘born digital’ material will still be playable in future decades? ■■ What are the advantages and disadvantages of converting (transcoding) older formats into newer standards? ■■ What is the best way to deliver video content to end-users? In this article I explain how a video compression codec works and consider some of the practi- cal concerns relating to choosing and controlling a codec. I discuss the motivations behind the continued development of new codec standards and suggest practical measures to help deal with the questions listed above. 2. Codecs and compression 2.1 What is a video codec? ‘Codec’ is a contraction of ‘encoder and decoder’. A video encoder converts ‘raw’ or uncom- pressed digital video data into a compressed form which is suitable for storage or transmission. A video decoder extracts digital video data from a compressed fle, converting it into a display- able, uncompressed form.
    [Show full text]
  • Comparison of Entropy and Dictionary Based Text Compression in English, German, French, Italian, Czech, Hungarian, Finnish, and Croatian
    mathematics Article Comparison of Entropy and Dictionary Based Text Compression in English, German, French, Italian, Czech, Hungarian, Finnish, and Croatian Matea Ignatoski 1 , Jonatan Lerga 1,2,* , Ljubiša Stankovi´c 3 and Miloš Dakovi´c 3 1 Department of Computer Engineering, Faculty of Engineering, University of Rijeka, Vukovarska 58, HR-51000 Rijeka, Croatia; [email protected] 2 Center for Artificial Intelligence and Cybersecurity, University of Rijeka, R. Matejcic 2, HR-51000 Rijeka, Croatia 3 Faculty of Electrical Engineering, University of Montenegro, Džordža Vašingtona bb, 81000 Podgorica, Montenegro; [email protected] (L.S.); [email protected] (M.D.) * Correspondence: [email protected]; Tel.: +385-51-651-583 Received: 3 June 2020; Accepted: 17 June 2020; Published: 1 July 2020 Abstract: The rapid growth in the amount of data in the digital world leads to the need for data compression, and so forth, reducing the number of bits needed to represent a text file, an image, audio, or video content. Compressing data saves storage capacity and speeds up data transmission. In this paper, we focus on the text compression and provide a comparison of algorithms (in particular, entropy-based arithmetic and dictionary-based Lempel–Ziv–Welch (LZW) methods) for text compression in different languages (Croatian, Finnish, Hungarian, Czech, Italian, French, German, and English). The main goal is to answer a question: ”How does the language of a text affect the compression ratio?” The results indicated that the compression ratio is affected by the size of the language alphabet, and size or type of the text. For example, The European Green Deal was compressed by 75.79%, 76.17%, 77.33%, 76.84%, 73.25%, 74.63%, 75.14%, and 74.51% using the LZW algorithm, and by 72.54%, 71.47%, 72.87%, 73.43%, 69.62%, 69.94%, 72.42% and 72% using the arithmetic algorithm for the English, German, French, Italian, Czech, Hungarian, Finnish, and Croatian versions, respectively.
    [Show full text]
  • A Machine Learning Perspective on Predictive Coding with PAQ
    A Machine Learning Perspective on Predictive Coding with PAQ Byron Knoll & Nando de Freitas University of British Columbia Vancouver, Canada fknoll,[email protected] August 17, 2011 Abstract PAQ8 is an open source lossless data compression algorithm that currently achieves the best compression rates on many benchmarks. This report presents a detailed description of PAQ8 from a statistical machine learning perspective. It shows that it is possible to understand some of the modules of PAQ8 and use this understanding to improve the method. However, intuitive statistical explanations of the behavior of other modules remain elusive. We hope the description in this report will be a starting point for discussions that will increase our understanding, lead to improvements to PAQ8, and facilitate a transfer of knowledge from PAQ8 to other machine learning methods, such a recurrent neural networks and stochastic memoizers. Finally, the report presents a broad range of new applications of PAQ to machine learning tasks including language modeling and adaptive text prediction, adaptive game playing, classification, and compression using features from the field of deep learning. 1 Introduction Detecting temporal patterns and predicting into the future is a fundamental problem in machine learning. It has gained great interest recently in the areas of nonparametric Bayesian statistics (Wood et al., 2009) and deep learning (Sutskever et al., 2011), with applications to several domains including language modeling and unsupervised learning of audio and video sequences. Some re- searchers have argued that sequence prediction is key to understanding human intelligence (Hawkins and Blakeslee, 2005). The close connections between sequence prediction and data compression are perhaps under- arXiv:1108.3298v1 [cs.LG] 16 Aug 2011 appreciated within the machine learning community.
    [Show full text]