Compression Method LZFSE Student: Martin Hron Supervisor: Ing
Total Page:16
File Type:pdf, Size:1020Kb
ASSIGNMENT OF BACHELOR’S THESIS Title: Compression method LZFSE Student: Martin Hron Supervisor: Ing. Jan Baier Study Programme: Informatics Study Branch: Computer Science Department: Department of Theoretical Computer Science Validity: Until the end of summer semester 2018/19 Instructions Introduce yourself with LZ family compression methods. Analyze compression method LZFSE [1] and explain its main enhancements. Implement this method and/or these enhancements into the ExCom library [2], perform evaluation tests using the standard test suite and compare performance with other implemented LZ methods. References [1] https://github.com/lzfse/lzfse [2] http://www.stringology.org/projects/ExCom/ doc. Ing. Jan Janoušek, Ph.D. doc. RNDr. Ing. Marcel Jiřina, Ph.D. Head of Department Dean Prague January 17, 2018 Bachelor’s thesis Compression method LZFSE Martin Hron Department of Theoretical Computer Science Supervisor: Ing. Jan Baier May 12, 2018 Acknowledgements I would like to thank my supervisor, Ing. Jan Baier, for guiding this thesis and for all his valuable advice. Declaration I hereby declare that the presented thesis is my own work and that I have cited all sources of information in accordance with the Guideline for adhering to ethical principles when elaborating an academic final thesis. I acknowledge that my thesis is subject to the rights and obligations stip- ulated by the Act No. 121/2000 Coll., the Copyright Act, as amended. In accordance with Article 46(6) of the Act, I hereby grant a nonexclusive au- thorization (license) to utilize this thesis, including any and all computer pro- grams incorporated therein or attached thereto and all corresponding docu- mentation (hereinafter collectively referred to as the “Work”), to any and all persons that wish to utilize the Work. Such persons are entitled to use the Work in any way (including for-profit purposes) that does not detract from its value. This authorization is not limited in terms of time, location and quan- tity. However, all persons that makes use of the above license shall be obliged to grant a license at least in the same scope as defined above with respect to each and every work that is created (wholly or in part) based on the Work, by modifying the Work, by combining the Work with another work, by including the Work in a collection of works or by adapting the Work (including trans- lation), and at the same time make available the source code of such work at least in a way and scope that are comparable to the way and scope in which the source code of the Work is made available. In Prague on May 12, 2018 . Czech Technical University in Prague Faculty of Information Technology c 2018 Martin Hron. All rights reserved. This thesis is school work as defined by Copyright Act of the Czech Republic. It has been submitted at Czech Technical University in Prague, Faculty of Information Technology. The thesis is protected by the Copyright Act and its usage without author’s permission is prohibited (with exceptions defined by the Copyright Act). Citation of this thesis Hron, Martin. Compression method LZFSE. Bachelor’s thesis. Czech Techni- cal University in Prague, Faculty of Information Technology, 2018. Abstract This thesis focuses on LZFSE compression method, which combines a dictio- nary compression scheme with a technique based on ANS (asymmetric nu- meral systems). It describes the principles on which the method works and analyses the reference implementation of LZFSE by Eric Bainville. As part of this thesis, the LZFSE method is added as a new module into the ExCom library and compared with other implemented compression methods using the files from the Prague Corpus. The impacts that different settings of adjustable LZFSE parameters have are also examined. Keywords LZFSE, ExCom library, data compression, dictionary compres- sion methods, lossless compression, finite state entropy, asymmetric numeral systems vii Abstrakt Tato pr´acese zab´yv´akompresn´ımetodou LZFSE, kter´akombinuje slovn´ıkovou kompresi s technikou zaloˇzenouna ANS (asymmetric numeral systems). Pr´ace popisuje principy, na kter´ych tato metoda funguje, a analyzuje referenˇcn´ıim- plementaci, jej´ıˇzautorem je Eric Bainville. V r´amcit´etopr´aceje metoda LZFSE pˇrid´anajako nov´ymodul do knihovny ExCom a porovn´anas ostatn´ımi implementovan´ymimetodami na datech Praˇzsk´ehoKorpusu. D´aleje prozkoum´anvliv nastaviteln´ych parametr˚umetody LZFSE. Kl´ıˇcov´a slova LZFSE, knihovna ExCom, komprese dat, slovn´ıkov´ekom- presn´ımetody, bezeztr´atov´akomprese, finite state entropy, asymmetric nu- meral systems viii Contents Introduction 1 1 Data compression 3 1.1 Basic data compression concepts . 3 1.2 Information entropy and redundancy . 3 1.3 Classification of compression methods . 4 1.4 Measures of performance . 6 1.5 ExCom library . 7 1.6 Hash function and hash table . 7 2 LZ family algorithms 9 2.1 LZ77 . 9 2.2 LZ78 . 11 3 Asymmetric numeral systems 15 3.1 Entropy coding . 15 3.2 Asymmetric numeral systems . 16 3.3 Finite state entropy . 19 4 LZFSE 21 4.1 Compression . 21 4.2 Decompression . 29 5 Implementation 33 5.1 Implementation of LZFSE module . 33 6 Benchmarks 37 6.1 Methodology . 37 6.2 Testing platform specifications . 37 6.3 Results . 38 ix Conclusion 51 Bibliography 53 A Acronyms 55 B Reference implementation 57 B.1 Source files . 57 B.2 The API functions . 59 C Prague Corpus files 61 D Building the ExCom library 63 E Scripts used for benchmarking 65 F Detailed benchmark results 69 F.1 Tables . 70 F.2 Additional graphs . 76 G Contents of enclosed CD 79 x List of Figures 2.1 LZ77 sliding window . 9 2.2 Example of LZ77 compression . 11 2.3 Example of LZ77 decompression . 12 3.1 Standard numeral system and ANS . 18 4.1 LZFSE history table . 24 6.1 Comparison of compression time of LZFSE, LZ78 and LZW . 39 6.2 Comparison of decompression time of LZFSE, LZ78 and LZW . 39 6.3 Comparison of compression ratio of dictionary methods . 41 6.4 Comparison of compression time of LZFSE and non-dictionary methods . 43 6.5 Comparison of decompression time of LZFSE and non-dictionary methods . 43 6.6 Comparison of compression ratio of LZFSE and non-dictionary methods . 44 6.7 Comparison of compression time and compression ratio of all tested methods on flower file . 46 6.8 Comparison of decompression time and compression ratio of all tested methods on flower file . 46 6.9 Impact of good match parameter on compression time and ratio . 47 6.10 Impact of hash bits parameter on compression time and ratio . 48 F.1 Comparison of compression time of LZFSE, LZ78 and LZW on all files . 76 F.2 Comparison of decompression time of LZFSE, LZ78 and LZW on all files . 77 F.3 Comparison of compression ratio of LZFSE, LZ78 and LZMW on all files . 78 xi List of Tables C.1 The Prague Corpus files [19] . 61 F.1 Compression time of dictionary methods . 70 F.2 Decompression time of dictionary methods . 71 F.3 Compression ratio of dictionary methods . 72 F.4 Compression time of non-dictionary methods . 73 F.5 Decompression time of non-dictionary methods . 74 F.6 Compression ratio of non-dictionary methods . 75 List of Algorithms 1 LZ77 compression . 10 2 LZ77 decompression . 11 3 LZ78 compression [12] . 13 4 LZ78 decompression . 13 5 FSE symbol encoding . 29 6 LZFSE match decoding step . 31 7 Decoding of a literal using FSE . 32 8 Decoding of one L, M or D value using FSE . 32 xiii Introduction Motivation The amount of computer data produced by various information systems and applications is huge and grows every day. Therefore, vast quantity of data needs to be transmitted over the network or stored on some medium. To speed up data transmission and conserve storage space it is useful to reduce size of computer data while still maintaining the information that is contained in it. This is the purpose of data compression. Compression achieves data size reduction by decreasing redundancy and changing information representation to one that is more efficient. Data compression is used very commonly even by non-professional com- puter users. Often we want to archive some files, transfer them to another computer or share them with other users. In those common cases, compres- sion ratio (i.e. how much space is saved) and speed are typically more or less equally important, so we generally need to make a compromise between speed and compression quality. Also for compressing general files, a lossless method is usually required so that no vital information is lost. For example, ZIP is frequently used archive file format that is probably known to most computer users. ZIP archives are usually compressed using a lossless algorithm called DEFLATE, which is designed to keep good balance between compression ratio and speed. LZFSE is a new open source data compression algorithm developed by Apple Inc. It is designed for similar purpose as DEFLATE algorithm but is claimed to be significantly faster (up to three times) and to use less resources while preserving comparable compression ratio. LZFSE combines dictionary compression with an implementation of asymmetric numeral systems. Asym- metric numeral systems is a recent entropy coding method based on the work of Jaros law Duda, which aims to end the tradeoff between compression ratio and speed. 1 Introduction Various compression methods are collected in the ExCom library. ExCom is a library developed as part of Filip Simek’sˇ thesis on CTU in 2009, which supports benchmarking and experimenting with compression methods. ExCom contains many dictionary algorithms. However, it does not yet contain the LZFSE method, nor any other method that would use asymmetric numeral systems. Integrating LZFSE into the ExCom library will allow to run, experi- ment with this method and perform benchmarks on it to confirm its enhance- ments.