Data Compression

Total Page:16

File Type:pdf, Size:1020Kb

Data Compression Data Compression Compression reduces the size of a file: ! To save space when storing it. ! To save time when transmitting it. Data Compression ! Most files have lots of redundancy. Who needs compression? ! Moore's law: # transistors on a chip doubles every 18-24 months. ! Parkinson's law: data expands to fill space available. • introduction ! Text, images, sound, video, … • Huffman codes • an application • entropy All of the books in the world contain no more information than is • LZW codes broadcast as video in a single large American city in a single year. Not all bits have equal value. -Carl Sagan References: Algorithms 2nd edition, Chapter 22 Basic concepts ancient (1950s), best technology recently developed. Intro to Algs and Data Structs, Section 6.5 Copyright © 2007 by Robert Sedgewick and Kevin Wayne. 3 Applications of Data Compression Generic file compression. ! Files: GZIP, BZIP, BOA. ! Archivers: PKZIP. ! File systems: NTFS. Multimedia. introduction ! Images: GIF, JPEG. Huffman codes ! Sound: MP3. an application ! Video: MPEG, DivX™, HDTV. entropy Communication. LZW codes ! ITU-T T4 Group 3 Fax. ! V.42bis modem. Databases. Google. 2 4 Encoding and Decoding Fixed length encoding Message. Binary data M we want to compress. uses fewer bits (you hope) ! Use same number of bits for each symbol. Encode. Generate a "compressed" representation C(M). ! k-bit code supports 2k different symbols Decode. Reconstruct original message or some approximation M'. Ex. 7-bit ASCII char decimal code NUL 0 0000000 M Encoder C(M) Decoder M' ... ... a 97 1100001 b 98 1100010 c 99 1100011 Compression ratio. Bits in C(M) / bits in M. d 100 1100100 ... ... this lecture: Lossless. M = M', 50-75% or lower. ~ 126 1111110 special code for this lecture Ex. Natural language, source code, executables. 127 1111111 end-of-message a b r a c a d a b r a ! Lossy. M ! M', 10% or lower. 1100001 1100010 1110010 1100001 1100011 1100001 1100100 1100001 1100010 1110010 1100001 1111111 Ex. Images, sound, video. 12 symbols " 7 bits per symbol = 84 bits in code 5 7 Ancient Ideas Fixed length encoding Ancient ideas. ! Use same number of bits for each symbol. ! Braille. ! k-bit code supports 2k different symbols ! Morse code. ! Natural languages. Ex. 3-bit custom code ! Mathematical notation. char code ! Decimal number system. a 000 b 001 c 010 d 011 r 100 12 symbols " 3 bits per - ! 111 36 bits in code "Poetry is the art of lossy data compression." a b r a c a d a b r a ! 000 001 100 000 010 000 011 000 001 100 000 111 Important detail: decoder needs to know the code! 6 8 Fixed length encoding: general scheme Variable-length encoding ! count number of different symbols. Use different number of bits to encode different characters. ! $lg M% bits suffice to support M different symbols Q. How do we avoid ambiguity? Ex. genomic sequences S • • • prefix of V char code A1. Append special stop symbol to each codeword. ! 4 different codons A2. Ensure that no encoding is a prefix of another. E • prefix of I, S a 00 I •• prefix of S ! 2 bits suffice c 01 char code V •••- t 10 Ex. custom prefix-free code a 0 g 11 2N bits to encode b 111 genome with N codons c 1010 a c t a c a g a t g a d 100 00 01 10 00 01 00 11 00 10 11 00 r 110 ! 1011 28 bits in code ! Amazing but true: initial databases in 1990s did not use such a code! a b r a c a d a b r a ! 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1 1 Decoder needs to know the code ! can amortize over large number of files with the same code Note 1: fixed-length codes are prefix-free ! in general, can encode an N-char file with N $lg M% + 16 $lg M% bits Note 2: can amortize cost of including the code over similar messages 9 11 Variable Length Encoding Prefix-Free Code: Encoding and Decoding Use different number of bits to encode different characters. How to represent? Use a binary trie. ! Symbols are stored in leaves. 0 1 Ex. Morse code. ! Encoding is path to leaf. a 0 1 Issue: ambiguity. Encoding. 0 1 0 1 ! d r b • • • # # # • • • Method 1: start at leaf; follow path up 0 1 to the root, and print bits in reverse order. c ! SOS ? ! Method 2: create ST of symbol-encoding pairs. IAMIE ? EEWNI ? Decoding. char encoding V7O ? ! Start at root of tree. a 0 ! Go left if bit is 0; go right if 1. b 111 c 1010 ! If leaf node, print symbol and return to root. d 100 r 110 ! 1011 10 12 How to transmit the trie Prefix-Free Decoding Implementation How to transmit the trie? ! Send preorder traversal of trie. 0 1 – we use * as sentinel for internal nodes public void decode() { – what if there is no sentinel? a 0 1 int N = StdIn.readInt(); ! Send number of characters to decode. for (int i = 0; i < N; i++) { use bits in real applications ! Send bits (packed 8 to the byte). 0 1 0 1 Node x = root; instead of chars while (x.isInternal()) { d r b 0 1 char bit = StdIn.readChar(); c ! if (bit == '0') x = x.left; preorder traversal *a**d*c!*rb else if (bit == '1') x = x.right; # chars to decode 12 } the message bits 0111110010100100011111001011 System.out.print(x.ch); } } ! If message is long, overhead of transmitting trie is small. char encoding 12 a 0 0111110010100100011111001011 b 111 c 1010 d 100 a b r a c a d a b r a ! r 110 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1 1 ! 1011 13 15 Prefix-Free Decoding Implementation Introduction to compression: summary Variable-length codes can provide better compression than fixed-length public class HuffmanDecoder a b r a c a d a b r a ! { 1100001 1100010 1110010 1100001 1100011 1100001 1100100 1100001 1100010 1110010 1100001 1111111 private Node root = new Node(); private class Node a b r a c a d a b r a ! { 000 001 100 000 010 000 011 000 001 100 000 111 char ch; Node left, right; Node() a b r a c a d a b r a ! { build tree from 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1 1 ch = StdIn.readChar(); preorder traversal if (ch == '*') { *a**d*c!*rb left = new Node(); right = new Node(); Every trie defines a variable-length code } } boolean isInternal() { } } Q. What is the best variable length code for a given message? 14 16 Huffman coding example a b r a c a d a b r a ! 5 7 a 1 1 1 2 2 5 3 4 c ! d r b a 2 2 1 2 1 2 2 2 5 r b d d r b a 1 1 1 1 c ! introduction c ! Huffman codes 12 an application 2 2 3 5 r b a entropy 1 2 5 7 d a 1 1 LZW 3 4 c ! 2 2 1 2 r b d 1 1 3 4 5 c ! a 2 2 1 2 r b d 1 1 17 c ! 19 Huffman Coding Huffman trie construction code Q. What is the best variable length code for a given message? A. Huffman code. [David Huffman, 1950] int[] freq = new int[128]; for (int i = 0; i < input.length(); i++) tabulate { freq[input.charAt(i)]++; } frequencies To compute Huffman code: MinPQ<Node> pq = new MinPQ<Node>(); ! for (int i = 0; i < 128; i++) count frequency ps for each symbol s in message. if (freq[i] > 0) initialize ! start with one node corresponding to each symbol s (with weight ps). pq.insert(new Node((char) i, freq[i], null, null)); PQ ! repeat until single trie formed: while (pq.size() > 1) – select two tries with min weight p1 and p2 { Node x = pq.delMin(); – merge into single trie with weight p1 + p2 Node y = pq.delMin(); Node parent = new Node('*', x.freq + y.freq, x, y); merge pq.insert(parent); trees } internal node Applications. JPEG, MP3, MPEG, PKZIP, GZIP, … root = pq.delMin(); total two subtrees marker frequency David Huffman 18 20 Huffman encoding An application: compress a bitmap Typical black-and-white-scanned image Theorem. [Huffman] Huffman coding is an optimal prefix-free code. 300 pixels/inch no prefix-free code uses fewer bits Implementation. 8.5 by 11 inches ! pass 1: tabulate symbol frequencies and build trie ! pass 2: encode file by traversing trie or lookup table. 300*8.5*300*11 = 8.415 million bits Running time. Use binary heap & O(M + N log N). Bits are mostly white output distinct bits symbols Typical amount of text on a page: 40 lines * 75 chars per line = 3000 chars Can we do better? [stay tuned] 21 23 Natural encoding of a bitmap one bit per pixel 000000000000000000000000000011111111111111000000000 000000000000000000000000001111111111111111110000000 000000000000000000000001111111111111111111111110000 000000000000000000000011111111111111111111111111000 000000000000000000001111111111111111111111111111110 introduction 000000000000000000011111110000000000000000001111111 000000000000000000011111000000000000000000000011111 Huffman codes 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 an application 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 entropy 000000000000000000001111000000000000000000000001110 000000000000000000000011100000000000000000000111000 011111111111111111111111111111111111111111111111111 LZW 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011000000000000000000000000000000000000000000000011 19-by-51 raster of letter 'q' lying on its side 22 24 Run-length encoding of a bitmap Run-length encoding and Huffman codes in the wild to encode number of bits per line natural encoding.
Recommended publications
  • Data Compression: Dictionary-Based Coding 2 / 37 Dictionary-Based Coding Dictionary-Based Coding
    Dictionary-based Coding already coded not yet coded search buffer look-ahead buffer cursor (N symbols) (L symbols) We know the past but cannot control it. We control the future but... Last Lecture Last Lecture: Predictive Lossless Coding Predictive Lossless Coding Simple and effective way to exploit dependencies between neighboring symbols / samples Optimal predictor: Conditional mean (requires storage of large tables) Affine and Linear Prediction Simple structure, low-complex implementation possible Optimal prediction parameters are given by solution of Yule-Walker equations Works very well for real signals (e.g., audio, images, ...) Efficient Lossless Coding for Real-World Signals Affine/linear prediction (often: block-adaptive choice of prediction parameters) Entropy coding of prediction errors (e.g., arithmetic coding) Using marginal pmf often already yields good results Can be improved by using conditional pmfs (with simple conditions) Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 2 / 37 Dictionary-based Coding Dictionary-Based Coding Coding of Text Files Very high amount of dependencies Affine prediction does not work (requires linear dependencies) Higher-order conditional coding should work well, but is way to complex (memory) Alternative: Do not code single characters, but words or phrases Example: English Texts Oxford English Dictionary lists less than 230 000 words (including obsolete words) On average, a word contains about 6 characters Average codeword length per character would be limited by 1
    [Show full text]
  • Package 'Brotli'
    Package ‘brotli’ May 13, 2018 Type Package Title A Compression Format Optimized for the Web Version 1.2 Description A lossless compressed data format that uses a combination of the LZ77 algorithm and Huffman coding. Brotli is similar in speed to deflate (gzip) but offers more dense compression. License MIT + file LICENSE URL https://tools.ietf.org/html/rfc7932 (spec) https://github.com/google/brotli#readme (upstream) http://github.com/jeroen/brotli#read (devel) BugReports http://github.com/jeroen/brotli/issues VignetteBuilder knitr, R.rsp Suggests spelling, knitr, R.rsp, microbenchmark, rmarkdown, ggplot2 RoxygenNote 6.0.1 Language en-US NeedsCompilation yes Author Jeroen Ooms [aut, cre] (<https://orcid.org/0000-0002-4035-0289>), Google, Inc [aut, cph] (Brotli C++ library) Maintainer Jeroen Ooms <[email protected]> Repository CRAN Date/Publication 2018-05-13 20:31:43 UTC R topics documented: brotli . .2 Index 4 1 2 brotli brotli Brotli Compression Description Brotli is a compression algorithm optimized for the web, in particular small text documents. Usage brotli_compress(buf, quality = 11, window = 22) brotli_decompress(buf) Arguments buf raw vector with data to compress/decompress quality value between 0 and 11 window log of window size Details Brotli decompression is at least as fast as for gzip while significantly improving the compression ratio. The price we pay is that compression is much slower than gzip. Brotli is therefore most effective for serving static content such as fonts and html pages. For binary (non-text) data, the compression ratio of Brotli usually does not beat bz2 or xz (lzma), however decompression for these algorithms is too slow for browsers in e.g.
    [Show full text]
  • The Basic Principles of Data Compression
    The Basic Principles of Data Compression Author: Conrad Chung, 2BrightSparks Introduction Internet users who download or upload files from/to the web, or use email to send or receive attachments will most likely have encountered files in compressed format. In this topic we will cover how compression works, the advantages and disadvantages of compression, as well as types of compression. What is Compression? Compression is the process of encoding data more efficiently to achieve a reduction in file size. One type of compression available is referred to as lossless compression. This means the compressed file will be restored exactly to its original state with no loss of data during the decompression process. This is essential to data compression as the file would be corrupted and unusable should data be lost. Another compression category which will not be covered in this article is “lossy” compression often used in multimedia files for music and images and where data is discarded. Lossless compression algorithms use statistic modeling techniques to reduce repetitive information in a file. Some of the methods may include removal of spacing characters, representing a string of repeated characters with a single character or replacing recurring characters with smaller bit sequences. Advantages/Disadvantages of Compression Compression of files offer many advantages. When compressed, the quantity of bits used to store the information is reduced. Files that are smaller in size will result in shorter transmission times when they are transferred on the Internet. Compressed files also take up less storage space. File compression can zip up several small files into a single file for more convenient email transmission.
    [Show full text]
  • The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on Iot Nodes in Smart Cities
    sensors Article The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on IoT Nodes in Smart Cities Ammar Nasif *, Zulaiha Ali Othman and Nor Samsiah Sani Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science & Technology, University Kebangsaan Malaysia, Bangi 43600, Malaysia; [email protected] (Z.A.O.); [email protected] (N.S.S.) * Correspondence: [email protected] Abstract: Networking is crucial for smart city projects nowadays, as it offers an environment where people and things are connected. This paper presents a chronology of factors on the development of smart cities, including IoT technologies as network infrastructure. Increasing IoT nodes leads to increasing data flow, which is a potential source of failure for IoT networks. The biggest challenge of IoT networks is that the IoT may have insufficient memory to handle all transaction data within the IoT network. We aim in this paper to propose a potential compression method for reducing IoT network data traffic. Therefore, we investigate various lossless compression algorithms, such as entropy or dictionary-based algorithms, and general compression methods to determine which algorithm or method adheres to the IoT specifications. Furthermore, this study conducts compression experiments using entropy (Huffman, Adaptive Huffman) and Dictionary (LZ77, LZ78) as well as five different types of datasets of the IoT data traffic. Though the above algorithms can alleviate the IoT data traffic, adaptive Huffman gave the best compression algorithm. Therefore, in this paper, Citation: Nasif, A.; Othman, Z.A.; we aim to propose a conceptual compression method for IoT data traffic by improving an adaptive Sani, N.S.
    [Show full text]
  • Context-Aware Encoding & Delivery in The
    Context-Aware Encoding & Delivery in the Web ICWE 2020 Benjamin Wollmer, Wolfram Wingerath, Norbert Ritter Universität Hamburg 9 - 12 June, 2020 Business Impact of Page Speed Business Uplift Speed Speed Downlift Uplift Business Downlift Felix Gessert: Mobile Site Speed and the Impact on E-Commerce, CodeTalks 2019 So Far On Compression… GZip SDCH Deflate Delta Brotli Encoding GZIP/Deflate – The De Facto Standard in the Web Encoding Size None 200 kB Gzip ~36 kB This example text is used to show how LZ77 finds repeating elements in the example[70;14] text ~81.9% saved data J. Alakuijala, E. Kliuchnikov, Z. Szabadka, L. Vandevenne: Comparison of Brotli, Deflate, Zopfli, LZMA, LZHAM and Bzip2 Compression Algorithms, 2015 Delta Encoding – Updating Stale Content Encoding Size None 200 kB Gzip ~36 kB Delta Encoding ~34 kB 83% saved data J. C. Mogul, F. Douglis, A. Feldmann, B. Krishnamurthy: Potential Benefits of Delta Encoding and Data Compression for HTTP, 1997 SDCH – Reusing Dictionaries Encoding Size This is an example None 200 kB Gzip ~36 kB Another example Delta Encoding ~34 kB SDCH ~7 kB Up to 81% better results (compared to gzip) O. Shapira: Shared Dictionary Compression for HTTP at LinkedIn, 2015 Brotli – SDCH for Everyone Encoding Size None 200 kB Gzip ~36 kB Delta Encoding ~34 kB SDCH ~7 kB Brotli ~29 kB ~85.6% saved data J. Alakuijala, E. Kliuchnikov, Z. Szabadka, L. Vandevenne: Comparison of Brotli, Deflate, Zopfli, LZMA, LZHAM and Bzip2 Compression Algorithms, 2015 So Far On Compression… Theory vs. Reality GZip (~80%) SDCH Deflate
    [Show full text]
  • Lempel-Ziv Sliding Window Update with Suffix Arrays
    i-ETC: ISEL Academic Journal of Electronics, Telecommunications and Computers CETC2011 Issue, Vol. 2, n. 1 (2013) ID-4 LEMPEL-ZIV SLIDING WINDOW UPDATE WITH SUFFIX ARRAYS Artur Ferreira1,3,4 Arlindo Oliveira2,4 Mario´ Figueiredo3,4 1Instituto Superior de Engenharia de Lisboa (ISEL) 2Instituto de Engenharia de Sistemas e Computadores – Investigac¸ao˜ e Desenvolvimento (INESC-ID) 3Instituto de Telecomunicac¸oes˜ (IT) 4Instituto Superior Tecnico´ (IST), Lisboa, PORTUGAL [email protected] [email protected] [email protected] Keywords: Lempel-Ziv compression, suffix arrays, sliding window update, substring search. Abstract: The sliding window dictionary-based algorithms of the Lempel-Ziv (LZ) 77 family are widely used for uni- versal lossless data compression. The encoding component of these algorithms performs repeated substring search. Data structures, such as hash tables, binary search trees, and suffix trees have been used to speedup these searches, at the expense of memory usage. Previous work has shown how suffix arrays (SA) can be used for dictionary representation and LZ77 decomposition. In this paper, we improve over that work by proposing a new efficient algorithm to update the sliding window each time a token is produced at the output. The pro- posed algorithm toggles between two SA on consecutive tokens. The resulting SA-based encoder requires less memory than the conventional tree-based encoders. In comparing our SA-based technique against tree-based encoders, on a large set of benchmark files, we find that, in some compression settings, our encoder is also faster than tree-based encoders. 1 INTRODUCTION dictionaries [4] and to find repeating sub-sequences for data deduplication [1], among other applications.
    [Show full text]
  • Answers to Exercises
    Answers to Exercises A bird does not sing because he has an answer, he sings because he has a song. —Chinese Proverb Intro.1: abstemious, abstentious, adventitious, annelidous, arsenious, arterious, face- tious, sacrilegious. Intro.2: When a software house has a popular product they tend to come up with new versions. A user can update an old version to a new one, and the update usually comes as a compressed file on a floppy disk. Over time the updates get bigger and, at a certain point, an update may not fit on a single floppy. This is why good compression is important in the case of software updates. The time it takes to compress and decompress the update is unimportant since these operations are typically done just once. Recently, software makers have taken to providing updates over the Internet, but even in such cases it is important to have small files because of the download times involved. 1.1: (1) ask a question, (2) absolutely necessary, (3) advance warning, (4) boiling hot, (5) climb up, (6) close scrutiny, (7) exactly the same, (8) free gift, (9) hot water heater, (10) my personal opinion, (11) newborn baby, (12) postponed until later, (13) unexpected surprise, (14) unsolved mysteries. 1.2: A reasonable way to use them is to code the five most-common strings in the text. Because irreversible text compression is a special-purpose method, the user may know what strings are common in any particular text to be compressed. The user may specify five such strings to the encoder, and they should also be written at the start of the output stream, for the decoder’s use.
    [Show full text]
  • Data Compression in Solid State Storage
    Data Compression in Solid State Storage John Fryar [email protected] Santa Clara, CA August 2013 1 Acknowledgements This presentation would not have been possible without the counsel, hard work and graciousness of the following individuals and/or organizations: Raymond Savarda Sandgate Technologies Santa Clara, CA August 2013 2 Disclaimers The opinions expressed herein are those of the author and do not necessarily represent those of any other organization or individual unless specifically cited. A thorough attempt to acknowledge all sources has been made. That said, we’re all human… Santa Clara, CA August 2013 3 Learning Objectives At the conclusion of this tutorial the audience will have been exposed to: • The different types of Data Compression • Common Data Compression Algorithms • The Deflate/Inflate (GZIP/GUNZIP) algorithms in detail • Implementation Options (Software/Hardware) • Impacts of design parameters in Performance • SSD benefits and challenges • Resources for Further Study Santa Clara, CA August 2013 4 Agenda • Background, Definitions, & Context • Data Compression Overview • Data Compression Algorithm Survey • Deflate/Inflate (GZIP/GUNZIP) in depth • Software Implementations • HW Implementations • Tradeoffs & Advanced Topics • SSD Benefits and Challenges • Conclusions Santa Clara, CA August 2013 5 Definitions Item Description Comments Open A system which will compress Must strictly adhere to standards System data for use by other entities. on compress / decompress I.E. the compressed data will algorithms exit the system Interoperability among vendors mandated for Open Systems Closed A system which utilizes Can support a limited, optimized System compressed data internally but subset of standard. does not expose compressed Also allows custom algorithms data to the outside world No Interoperability req’d.
    [Show full text]
  • Video/Audio Compression
    5.9. Video Compression (1) Basics: video := time sequence of single images frequent point of view: video compression = image compression with a temporal component assumption: successive images of a video sequence are similar, e.g. directly adjacent images contain almost the same information that has to be carried only once wanted: strategies to exploit temporal redundancy and irrelevance! motion prediction/estimation, motion compensation, block matching intraframe and interframe coding video compression algorithms and standards are distinguished according to the peculiar conditions, e.g. videoconferencing, applications such as broadcast video Prof. Dr. Paul Müller, University of Kaiserslautern 1 Video Compression (2) Simple approach: M-JPEG compression of a time sequence of images based on the JPEG standard unfortunately, not standardized! makes use of the baseline system of JPEG, intraframe coding, color subsampling 4:1:1, 6 bit quantizer temporal redundancy is not used! applicable for compression ratios from 5:1 to 20:1, higher rates call for interframe coding possibility to synchronize audio data is not provided direct access to full images at every time position application in proprietary consumer video cutting software and hardware solutions Prof. Dr. Paul Müller, University of Kaiserslautern 2 Video Compression (3) Motion prediction and compensation: kinds of motion: • change of color values / change of position of picture elements • translation, rotation, scaling, deformation of objects • change of lights and shadows • translation, rotation, zoom of camera kinds of motion prediction techniques: • prediction of pixels or ranges of pixels neighbouring but no semantic relations • model based prediction grid model with model parameters describing the motion, e.g. head- shoulder-arrangements • object or region based prediction extraction of (video) objects, processing of geometric and texture information, e.g.
    [Show full text]
  • An Inter-Data Encoding Technique That Exploits Synchronized Data for Network Applications
    1 An Inter-data Encoding Technique that Exploits Synchronized Data for Network Applications Wooseung Nam, Student Member, IEEE, Joohyun Lee, Member, IEEE, Ness B. Shroff, Fellow, IEEE, and Kyunghan Lee, Member, IEEE Abstract—In a variety of network applications, there exists significant amount of shared data between two end hosts. Examples include data synchronization services that replicate data from one node to another. Given that shared data may have high correlation with new data to transmit, we question how such shared data can be best utilized to improve the efficiency of data transmission. To answer this, we develop an inter-data encoding technique, SyncCoding, that effectively replaces bit sequences of the data to be transmitted with the pointers to their matching bit sequences in the shared data so called references. By doing so, SyncCoding can reduce data traffic, speed up data transmission, and save energy consumption for transmission. Our evaluations of SyncCoding implemented in Linux show that it outperforms existing popular encoding techniques, Brotli, LZMA, Deflate, and Deduplication. The gains of SyncCoding over those techniques in the perspective of data size after compression in a cloud storage scenario are about 12.5%, 20.8%, 30.1%, and 66.1%, and are about 78.4%, 80.3%, 84.3%, and 94.3% in a web browsing scenario, respectively. Index Terms—Source coding; Data compression; Encoding; Data synchronization; Shared data; Reference selection F 1 INTRODUCTION are capable of exploiting previously stored or delivered URING the last decade, cloud-based data synchroniza- data for storing or transmitting new data. However, they D tion services for end-users such as Dropbox, OneDrive, mostly work at the level of files or chunks of a fixed and Google Drive have attracted a huge number of sub- size (e.g., 4MB in Dropbox, 8kB in Neptune [4]), which scribers.
    [Show full text]
  • I Came to Drop Bombs Auditing the Compression Algorithm Weapons Cache
    I Came to Drop Bombs Auditing the Compression Algorithm Weapons Cache Cara Marie NCC Group Blackhat USA 2016 About Me • NCC Group Senior Security Consultant Pentested numerous networks, web applications, mobile applications, etc. • Hackbright Graduate • Ticket scalper in a previous life • @bones_codes | [email protected] What is a Decompression Bomb? A decompression bomb is a file designed to crash or render useless the program or system reading it. Vulnerable Vectors • Chat clients • Image hosting • Web browsers • Web servers • Everyday web-services software • Everyday client software • Embedded devices (especially vulnerable due to weak hardware) • Embedded documents • Gzip’d log uploads A History Lesson early 90’s • ARC/LZH/ZIP/RAR bombs were used to DoS FidoNet systems 2002 • Paul L. Daniels publishes Arbomb (Archive “Bomb” detection utility) 2003 • Posting by Steve Wray on FullDisclosure about a bzip2 bomb antivirus software DoS 2004 • AERAsec Network Services and Security publishes research on the various reactions of antivirus software against decompression bombs, includes a comparison chart 2014 • Several CVEs for PIL are issued — first release July 2010 (CVE-2014-3589, CVE-2014-3598, CVE-2014-9601) 2015 • CVE for libpng — first release Aug 2004 (CVE-2015-8126) Why Are We Still Talking About This?!? Why Are We Still Talking About This?!? Compression is the New Hotness Who This Is For Who This Is For The Archives An archive bomb, a.k.a. zip bomb, is often employed to disable antivirus software, in order to create an opening for more traditional viruses • Singly compressed large file • Self-reproducing compressed files, i.e. Russ Cox’s Zips All The Way Down • Nested compressed files, i.e.
    [Show full text]
  • HTTP/2 Compression Dictionaries
    HTTP/2 Compression Dictionaries Vlad Krasnov In a nutshell ● Allow cross-stream compression in HTTP/2 by means of "dictionaries" ● Including a set(s?) of static dictionaries for initialization ○ Each dictionary targets a different MIME type ● Up to 256 dictionaries per connection ● Default dictionary size is 217 ○ Defined by settings ● Server indicates if a stream might be used for compression in the future by sending a SET_DICTIONARY frame, before the data ○ Client keeps part of the data ● Server can use a previously defined dictionary with a USE_DICTIONARY frame Rationale ● Yesterday ○ HTTP/1 with large assets ○ CPU time was expensive ○ Static assets compressed only with "gzip" ■ What about "brotli", "sdch", "sdch+gzip", "sdch+brotli"? ● Today ○ HTTP/2 with (ideally) smaller assets ○ CPU time significantly cheaper ■ Cloudflare uses gzip -8 for dynamic compression ■ Tomorrow: FPGAs ○ Store in gzip -> compress to other formats on demand ● Network ○ Gets cheaper, but slowly ○ Less data -> less packet loss Benefits ● Client ○ Less bandwidth wasted ■ Reduced packet loss ■ Faster page loads ● Server ○ Less bandwidth wasted ○ Improved compression ratio almost for free ■ Alternatively: keep compression ratio, reduce CPU usage ○ Greater incentive to re-compress static content ● CDN ○ Highly efficient origin pulls ■ Almost free in many cases Performance simulation ● Crawled over ~2000 Alexa top ● Used Chromedriver to load each page ● Simulated cross-stream compression with gzip and brotli ● Several compression strategies and dictionary sizes ● Best overall strategy: ○ If asset first of its type -> use static dictionary for type ○ Else use dynamic dictionary for type ○ Append asset to the dictionary for the type Performance Performance Performance Performance Performance ● Brotli -5 w.
    [Show full text]