Data Compression Explained

Total Page:16

File Type:pdf, Size:1020Kb

Data Compression Explained Data Compression Explained http://mattmahoney.net/dc/dce.html Data Compression Explained Matt Mahoney Copyright (C) 2010, 2011, Dell, Inc. You are permitted to copy and distribute material rom this book provided (1) any material you distribute includes this license, (2) the material is not modi ied, and (3) you do not charge a ee or require any other considerations or copies or or any works that incorporate material rom this book. These restrictions do not apply to normal " air use", de ined as cited quotations totaling less than one page. This book may be downloaded without charge rom http://mattmahoney.net /dc/dce.html. Last update: Oct. 30, 2011. e,reader translations by Ale.o Sanchez, Oct. 21, 2011: mobi (2indle) and epub (other readers). 3omanian translation by Alexander Ovsov, Aug. 3, 2011. About this Book This book is or the reader who wants to understand how data compression works, or who wants to write data compression so tware. 5rior programming ability and some math skills will be needed. Speci ic topics include: 1. In ormation theory No universal compression Coding is bounded Modeling is not computable Compression is an arti icial intelligence problem 2. 7enchmarks Calgary corpus results 3. Coding 8u man arithmetic asymmetric binary numeric codes (unary, 3ice, 9olomb, e4tra bit) archive ormats (error detection, encryption) 4. Modeling Fi4ed order: bytewise bitwise, indirect Variable order: DMC, 55M, CTW Context mixing: linear mixing, logistic mixing, SSE, indirect SSE, match, 5AQ, Z5AQ, Crinkler A. Transforms 3LE LZ77 (LZSS, de late, LZMA, LZC, 3OLZ, LZ5, snappy, deduplication) LZW and dictionary encoding Symbol ranking 7WT (context sorting, inverse, bzip2, 777, MSu Sort v2 and v3, Itoh,Tanaka, DivSu Sort, 7i.ective) 5redictive iltering (delta coding, color transform, linear iltering) Specialized transforms (E8E1, precomp) 1 of 110 10/29/2011 9:17 PM Data Compression Explained http://mattmahoney.net/dc/dce.html 8u man pre,coding E. Lossy compression Images (7M5, 9IF, 5N9, TIFF, F5E9) F5E9 recompression (Stu it, 5AQ, WinZI5 5ackJP9) Video (NTSC, M5E9) Audio (CD, M53, AAC, Dolby, Vorbis) Conclusion Acknowledgements 3e erences This book is intended to be sel contained. Sources are linked when appropriate, but you donGt need to click on them to understand the material. 1. Information Theory Data compression is the art o reducing the number o bits needed to store or transmit data. Compression can be either lossless or lossy. Losslessly compressed data can be decompressed to exactly its original value. An example is 1848 Morse Code. Each letter o the alphabet is coded as a sequence o dots and dashes. The most common letters in English like E and T receive the shortest codes. The least common like F, Q, C, and Z are assigned the longest codes. All data compression algorithms consist o at least a model and a coder (with optional preprocesing trans orms). A model estimates the probability distribution (E is more common than Z). The coder assigns shorter codes to the more likely symbols. There are e icient and optimal solutions to the coding problem. 8owever, optimal modeling has been proven not computable. Modeling (or equivalently, prediction) is both an arti icial intelligence (AI) problem and an art. Lossy compression discards "unimportant" data, or example, details o an image or audio clip that are not perceptible to the eye or ear. An example is the 11A3 NTSC standard or broadcast color TV, used until 2001. The human eye is less sensitive to ine detail between colors o equal brightness (like red and green) than it is to brightness (black and white). Thus, the color signal is transmitted with less resolution over a narrower requency band than the monochrome signal. Lossy compression consists o a transform to separate important rom unimportant data, ollowed by lossless compression o the important part and discarding the rest. The transform is an AI problem because it requires understanding what the human brain can and cannot perceive. In ormation theory places hard limits on what can and cannot be compressed losslessly, and by how much: 1. There is no such thing as a "universal" compression algorithm that is guaranteed to compress any input, or even any input above a certain si0e. In particular, it is not possible to compress random data or compress recursively. 2. 9iven a model (probability distribution) o your input data, the best you can do is code symbols with probability p using log 1/p bits. E icient and optimal codes are known. 2 3. Data has a universal but uncomputable probability distribution. Speci ically, any string x has probability (about) 2,HMH where M is the shortest possible description o x, and HMH is the length o M in bits, almost independent o the language in which M is written. 8owever there is no general procedure or inding M or even estimating HMH in any language. There is no algorithm that tests or randomness or tells you whether a string can be compressed any urther. 2 of 110 10/29/2011 9:17 PM Data Compression Explained http://mattmahoney.net/dc/dce.html 1.1. No Uni ersal Compression This is proved by the counting argument. Suppose there were a compression algorithm that could compress all strings o at least a certain size, say, n bits. There are exactly 2n di erent binary strings o length n. A universal compressor would have to encode each input di erently. Otherwise, i two inputs compressed to the same output, then the decompresser would not be able to decompress that output correctly. 8owever there are only 2n , 1 binary strings shorter than n bits. In act, the vast ma.ority o strings cannot be compressed by very much. The raction o strings that can be compressed rom n bits to m bits is at most 2m , n. For example, less than 0.4I o strings can be compressed by one byte. Every compressor that can compress any input must also expand some o its input. 8owever, the expansion never needs to be more than one symbol. Any compression algorithm can be modi ied by adding one bit to indicate that the rest o the data is stored uncompressed. The counting argument applies to systems that would recursively compress their own output. In general, compressed data appears random to the algorithm that compressed it so that it cannot be compressed again. 1.2. Coding is Bounded Suppose we wish to compress the digits o J, e.g. "3141A12EA3A817132384E2E4...". Assume our model is that each digit occurs with probability 0.1, independent o any other digits. Consider 3 possible binary codes: Digit BCD Huffman Binary ---- ---- ---- ---- 0 0000 000 0 1 0001 001 1 2 0010 010 10 3 0011 011 11 4 0100 100 100 5 0101 101 101 6 0110 1100 110 7 0111 1101 111 8 1000 1110 1000 9 1001 1111 1001 --- ---- ---- ---- bpc 4.0 3.4 not alid Using a 7CD (binary coded decimal) code, J would be encoded as 0011 0001 0100 0001 0101... (Spaces are shown or readability only). The compression ratio is 4 bits per character (4 bpc). I the input was ASCII te4t, the output would be compressed A0I. The decompresser would decode the data by dividing it into 4 bit strings. The 8u man code would code J as 011 001 100 001 101 1111... The decoder would read bits one at a time and decode a digit as soon as it ound a match in the table (a ter either 3 or 4 bits). The code is uniquely decodable because no code is a pre ix o any other code. The compression ratio is 3.4 bpc. The binary code is not uniquely decodable. For example, 111 could be decoded as 7 or 31 or 13 or 111. There are better codes than the 8u man code given above. For example, we could assign 8u man codes to pairs o digits. There are 100 pairs each with probability 0.01. We could assign E bit codes (000000 through 011011) to 00 through 27, and 7 bits (0111000 through 1111111) to 28 through 11. The of 110 10/29/2011 9:17 PM Data Compression Explained http://mattmahoney.net/dc/dce.html average code length is E.72 bits per pair o digits, or 3.3E bpc. Similarly, coding groups o 3 digits using 1 or 10 bits would yield 3.32A3 bpc. Shannon and Weaver (1141) proved that the best you can do or a symbol with probability p is assign a code o length log 1/p. In this example, log 1/0.1 L 3.3211 bpc. 2 2 Shannon de ined the expected in ormation content or equivocation (now called entropy) o a random variable C as its e4pected code length. Suppose C may have values C , C ,... and that each C has 1 2 i probability p(i). Then the entropy o C is 8(C) L EMlog 1/p(C)N L O p(i) log 1/p(i). For example, the 2 i 2 entropy o the digits o J, according to our model, is 10 (0.1 log 1/0.1) L 3.3211 bpc. There is no 2 smaller code or this model that could be decoded unambiguously. The in ormation content o a set o strings is at most the sum o the in ormation content o the individual strings. I C and Y are strings, then 8(C,Y) P 8(C) Q 8(Y). I they are equal, then C and Y are independent. 2nowing one string would tell you nothing about the other. The conditional entropy 8(CHY) L 8(C,Y) , 8(Y) is the in ormation content o C given Y.
Recommended publications
  • Schematic Entry
    Schematic Entry Copyrights Software, documentation and related materials: Copyright © 2002 Altium Limited This software product is copyrighted and all rights are reserved. The distribution and sale of this product are intended for the use of the original purchaser only per the terms of the License Agreement. This document may not, in whole or part, be copied, photocopied, reproduced, translated, reduced or transferred to any electronic medium or machine-readable form without prior consent in writing from Altium Limited. U.S. Government use, duplication or disclosure is subject to RESTRICTED RIGHTS under applicable government regulations pertaining to trade secret, commercial computer software developed at private expense, including FAR 227-14 subparagraph (g)(3)(i), Alternative III and DFAR 252.227-7013 subparagraph (c)(1)(ii). P-CAD is a registered trademark and P-CAD Schematic, P-CAD Relay, P-CAD PCB, P-CAD ProRoute, P-CAD QuickRoute, P-CAD InterRoute, P-CAD InterRoute Gold, P-CAD Library Manager, P-CAD Library Executive, P-CAD Document Toolbox, P-CAD InterPlace, P-CAD Parametric Constraint Solver, P-CAD Signal Integrity, P-CAD Shape-Based Autorouter, P-CAD DesignFlow, P-CAD ViewCenter, Master Designer and Associate Designer are trademarks of Altium Limited. Other brand names are trademarks of their respective companies. Altium Limited www.altium.com Table of Contents chapter 1 Introducing P-CAD Schematic P-CAD Schematic Features ................................................................................................1 About
    [Show full text]
  • ACS – the Archival Cytometry Standard
    http://flowcyt.sf.net/acs/latest.pdf ACS – the Archival Cytometry Standard Archival Cytometry Standard ACS International Society for Advancement of Cytometry Candidate Recommendation DRAFT Document Status The Archival Cytometry Standard (ACS) has undergone several revisions since its initial development in June 2007. The current proposal is an ISAC Candidate Recommendation Draft. It is assumed, however not guaranteed, that significant features and design aspects will remain unchanged for the final version of the Recommendation. This specification has been formally tested to comply with the W3C XML schema version 1.0 specification but no position is taken with respect to whether a particular software implementing this specification performs according to medical or other valid regulations. The work may be used under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported license. You are free to share (copy, distribute and transmit), and adapt the work under the conditions specified at http://creativecommons.org/licenses/by-sa/3.0/legalcode. Disclaimer of Liability The International Society for Advancement of Cytometry (ISAC) disclaims liability for any injury, harm, or other damage of any nature whatsoever, to persons or property, whether direct, indirect, consequential or compensatory, directly or indirectly resulting from publication, use of, or reliance on this Specification, and users of this Specification, as a condition of use, forever release ISAC from such liability and waive all claims against ISAC that may in any manner arise out of such liability. ISAC further disclaims all warranties, whether express, implied or statutory, and makes no assurances as to the accuracy or completeness of any information published in the Specification.
    [Show full text]
  • Encryption Introduction to Using 7-Zip
    IT Services Training Guide Encryption Introduction to using 7-Zip It Services Training Team The University of Manchester email: [email protected] www.itservices.manchester.ac.uk/trainingcourses/coursesforstaff Version: 5.3 Training Guide Introduction to Using 7-Zip Page 2 IT Services Training Introduction to Using 7-Zip Table of Contents Contents Introduction ......................................................................................................................... 4 Compress/encrypt individual files ....................................................................................... 5 Email compressed/encrypted files ....................................................................................... 8 Decrypt an encrypted file ..................................................................................................... 9 Create a self-extracting encrypted file .............................................................................. 10 Decrypt/un-zip a file .......................................................................................................... 14 APPENDIX A Downloading and installing 7-Zip ................................................................. 15 Help and Further Reference ............................................................................................... 18 Page 3 Training Guide Introduction to Using 7-Zip Introduction 7-Zip is an application that allows you to: Compress a file – for example a file that is 5MB can be compressed to 3MB Secure the
    [Show full text]
  • Pack, Encrypt, Authenticate Document Revision: 2021 05 02
    PEA Pack, Encrypt, Authenticate Document revision: 2021 05 02 Author: Giorgio Tani Translation: Giorgio Tani This document refers to: PEA file format specification version 1 revision 3 (1.3); PEA file format specification version 2.0; PEA 1.01 executable implementation; Present documentation is released under GNU GFDL License. PEA executable implementation is released under GNU LGPL License; please note that all units provided by the Author are released under LGPL, while Wolfgang Ehrhardt’s crypto library units used in PEA are released under zlib/libpng License. PEA file format and PCOMPRESS specifications are hereby released under PUBLIC DOMAIN: the Author neither has, nor is aware of, any patents or pending patents relevant to this technology and do not intend to apply for any patents covering it. As far as the Author knows, PEA file format in all of it’s parts is free and unencumbered for all uses. Pea is on PeaZip project official site: https://peazip.github.io , https://peazip.org , and https://peazip.sourceforge.io For more information about the licenses: GNU GFDL License, see http://www.gnu.org/licenses/fdl.txt GNU LGPL License, see http://www.gnu.org/licenses/lgpl.txt 1 Content: Section 1: PEA file format ..3 Description ..3 PEA 1.3 file format details ..5 Differences between 1.3 and older revisions ..5 PEA 2.0 file format details ..7 PEA file format’s and implementation’s limitations ..8 PCOMPRESS compression scheme ..9 Algorithms used in PEA format ..9 PEA security model .10 Cryptanalysis of PEA format .12 Data recovery from
    [Show full text]
  • Metadefender Core V4.12.2
    MetaDefender Core v4.12.2 © 2018 OPSWAT, Inc. All rights reserved. OPSWAT®, MetadefenderTM and the OPSWAT logo are trademarks of OPSWAT, Inc. All other trademarks, trade names, service marks, service names, and images mentioned and/or used herein belong to their respective owners. Table of Contents About This Guide 13 Key Features of Metadefender Core 14 1. Quick Start with Metadefender Core 15 1.1. Installation 15 Operating system invariant initial steps 15 Basic setup 16 1.1.1. Configuration wizard 16 1.2. License Activation 21 1.3. Scan Files with Metadefender Core 21 2. Installing or Upgrading Metadefender Core 22 2.1. Recommended System Requirements 22 System Requirements For Server 22 Browser Requirements for the Metadefender Core Management Console 24 2.2. Installing Metadefender 25 Installation 25 Installation notes 25 2.2.1. Installing Metadefender Core using command line 26 2.2.2. Installing Metadefender Core using the Install Wizard 27 2.3. Upgrading MetaDefender Core 27 Upgrading from MetaDefender Core 3.x 27 Upgrading from MetaDefender Core 4.x 28 2.4. Metadefender Core Licensing 28 2.4.1. Activating Metadefender Licenses 28 2.4.2. Checking Your Metadefender Core License 35 2.5. Performance and Load Estimation 36 What to know before reading the results: Some factors that affect performance 36 How test results are calculated 37 Test Reports 37 Performance Report - Multi-Scanning On Linux 37 Performance Report - Multi-Scanning On Windows 41 2.6. Special installation options 46 Use RAMDISK for the tempdirectory 46 3. Configuring Metadefender Core 50 3.1. Management Console 50 3.2.
    [Show full text]
  • Data Compression for Remote Laboratories
    Paper—Data Compression for Remote Laboratories Data Compression for Remote Laboratories https://doi.org/10.3991/ijim.v11i4.6743 Olawale B. Akinwale Obafemi Awolowo University, Ile-Ife, Nigeria [email protected] Lawrence O. Kehinde Obafemi Awolowo University, Ile-Ife, Nigeria [email protected] Abstract—Remote laboratories on mobile phones have been around for a few years now. This has greatly improved accessibility of these remote labs to students who cannot afford computers but have mobile phones. When money is a factor however (as is often the case with those who can’t afford a computer), the cost of use of these remote laboratories should be minimized. This work ad- dressed this issue of minimizing the cost of use of the remote lab by making use of data compression for data sent between the user and remote lab. Keywords—Data compression; Lempel-Ziv-Welch; remote laboratory; Web- Sockets 1 Introduction The mobile phone is now the de facto computational device of choice. They are quite powerful, connected devices which are almost always with us. This is so much so that about every important online service has a mobile version or at least a version compatible with mobile phones and tablets (for this paper we would adopt the phrase mobile device to represent mobile phones and tablets). This has caused online labora- tory developers to develop online laboratory solutions which are mobile device- friendly [1, 2, 3, 4, 5, 6, 7, 8]. One of the main benefits of online laboratories is the possibility of using fewer re- sources to serve more people i.e.
    [Show full text]
  • Improved Neural Network Based General-Purpose Lossless Compression Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, Idoia Ochoa
    JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 DZip: improved neural network based general-purpose lossless compression Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, Idoia Ochoa Abstract—We consider lossless compression based on statistical [4], [5] and generative modeling [6]). Neural network based data modeling followed by prediction-based encoding, where an models can typically learn highly complex patterns in the data accurate statistical model for the input data leads to substantial much better than traditional finite context and Markov models, improvements in compression. We propose DZip, a general- purpose compressor for sequential data that exploits the well- leading to significantly lower prediction error (measured as known modeling capabilities of neural networks (NNs) for pre- log-loss or perplexity [4]). This has led to the development of diction, followed by arithmetic coding. DZip uses a novel hybrid several compressors using neural networks as predictors [7]– architecture based on adaptive and semi-adaptive training. Unlike [9], including the recently proposed LSTM-Compress [10], most NN based compressors, DZip does not require additional NNCP [11] and DecMac [12]. Most of the previous works, training data and is not restricted to specific data types. The proposed compressor outperforms general-purpose compressors however, have been tailored for compression of certain data such as Gzip (29% size reduction on average) and 7zip (12% size types (e.g., text [12] [13] or images [14], [15]), where the reduction on average) on a variety of real datasets, achieves near- prediction model is trained in a supervised framework on optimal compression on synthetic datasets, and performs close to separate training data or the model architecture is tuned for specialized compressors for large sequence lengths, without any the specific data type.
    [Show full text]
  • Incompressibility and Lossless Data Compression: an Approach By
    Incompressibility and Lossless Data Compression: An Approach by Pattern Discovery Incompresibilidad y compresión de datos sin pérdidas: Un acercamiento con descubrimiento de patrones Oscar Herrera Alcántara and Francisco Javier Zaragoza Martínez Universidad Autónoma Metropolitana Unidad Azcapotzalco Departamento de Sistemas Av. San Pablo No. 180, Col. Reynosa Tamaulipas Del. Azcapotzalco, 02200, Mexico City, Mexico Tel. 53 18 95 32, Fax 53 94 45 34 [email protected], [email protected] Article received on July 14, 2008; accepted on April 03, 2009 Abstract We present a novel method for lossless data compression that aims to get a different performance to those proposed in the last decades to tackle the underlying volume of data of the Information and Multimedia Ages. These latter methods are called entropic or classic because they are based on the Classic Information Theory of Claude E. Shannon and include Huffman [8], Arithmetic [14], Lempel-Ziv [15], Burrows Wheeler (BWT) [4], Move To Front (MTF) [3] and Prediction by Partial Matching (PPM) [5] techniques. We review the Incompressibility Theorem and its relation with classic methods and our method based on discovering symbol patterns called metasymbols. Experimental results allow us to propose metasymbolic compression as a tool for multimedia compression, sequence analysis and unsupervised clustering. Keywords: Incompressibility, Data Compression, Information Theory, Pattern Discovery, Clustering. Resumen Presentamos un método novedoso para compresión de datos sin pérdidas que tiene por objetivo principal lograr un desempeño distinto a los propuestos en las últimas décadas para tratar con los volúmenes de datos propios de la Era de la Información y la Era Multimedia.
    [Show full text]
  • Lossless Compression of Internal Files in Parallel Reservoir Simulation
    Lossless Compression of Internal Files in Parallel Reservoir Simulation Suha Kayum Marcin Rogowski Florian Mannuss 9/26/2019 Outline • I/O Challenges in Reservoir Simulation • Evaluation of Compression Algorithms on Reservoir Simulation Data • Real-world application - Constraints - Algorithm - Results • Conclusions 2 Challenge Reservoir simulation 1 3 Reservoir Simulation • Largest field in the world are represented as 50 million – 1 billion grid block models • Each runs takes hours on 500-5000 cores • Calibrating the model requires 100s of runs and sophisticated methods • “History matched” model is only a beginning 4 Files in Reservoir Simulation • Internal Files • Input / Output Files - Interact with pre- & post-processing tools Date Restart/Checkpoint Files 5 Reservoir Simulation in Saudi Aramco • 100’000+ simulations annually • The largest simulation of 10 billion cells • Currently multiple machines in TOP500 • Petabytes of storage required 600x • Resources are Finite • File Compression is one solution 50x 6 Compression algorithm evaluation 2 7 Compression ratio Tested a number of algorithms on a GRID restart file for two models 4 - Model A – 77.3 million active grid blocks 3.5 - Model K – 8.7 million active grid blocks 3 - 15.6 GB and 7.2 GB respectively 2.5 2 Compression ratio is between 1.5 1 compression ratio compression - From 2.27 for snappy (Model A) 0.5 0 - Up to 3.5 for bzip2 -9 (Model K) Model A Model K lz4 snappy gzip -1 gzip -9 bzip2 -1 bzip2 -9 8 Compression speed • LZ4 and Snappy significantly outperformed other algorithms
    [Show full text]
  • 7Z Zip File Download How to Open 7Z Files
    7z zip file download How to Open 7z Files. This article was written by Nicole Levine, MFA. Nicole Levine is a Technology Writer and Editor for wikiHow. She has more than 20 years of experience creating technical documentation and leading support teams at major web hosting and software companies. Nicole also holds an MFA in Creative Writing from Portland State University and teaches composition, fiction-writing, and zine-making at various institutions. There are 8 references cited in this article, which can be found at the bottom of the page. This article has been viewed 292,786 times. If you’ve come across a file that ends in “.7z”, you’re probably wondering why you can’t open it. These files, known as “7z” or “7-Zip files,” are archives of one or more files in one single compressed package. You’ll need to install an unzipping app to extract files from the archive. These apps are usually free for any operating system, including iOS and Android. Learn how to open 7z files with iZip on your mobile device, 7-Zip or WinZip on Windows, and The Unarchiver in Mac OS X. 7z zip file download. If you are looking for 7-Zip, you have come to the right place. We explain what 7-Zip is and point you to the official download. What is 7-Zip? 7-zip is a compression and extraction software similar to WinZIP and WinRAR, that can read from and write to .7z archive files (although it can open other compressed archives such as .zip and .rar, among others, even including limited access to the contents of .msi, or Microsoft Installer files, and .exe files, or executable files).
    [Show full text]
  • Winzip 12 Reviewer's Guide
    Introducing WinZip® 12 WinZip® is the most trusted way to work with compressed files. No other compression utility is as easy to use or offers the comprehensive and productivity-enhancing approach that has made WinZip the gold standard for file-compression tools. With the new WinZip 12, you can quickly and securely zip and unzip files to conserve storage space, speed up e-mail transmission, and reduce download times. State-of-the-art file compression, strong AES encryption, compatibility with more compression formats, and new intuitive photo compression, make WinZip 12 the complete compression and archiving solution. Building on the favorite features of a worldwide base of several million users, WinZip 12 adds new features for image compression and management, support for new compression methods, improved compression performance, support for additional archive formats, and more. Users can work smarter, faster, and safer with WinZip 12. Who will benefit from WinZip® 12? The simple answer is anyone who uses a PC. Any PC user can benefit from the compression and encryption features in WinZip to protect data, save space, and reduce the time to transfer files on the Internet. There are, however, some PC users to whom WinZip is an even more valuable and essential tool. Digital photo enthusiasts: As the average file size of their digital photos increases, people are looking for ways to preserve storage space on their PCs. They have lots of photos, so they are always seeking better ways to manage them. Sharing their photos is also important, so they strive to simplify the process and reduce the time of e-mailing large numbers of images.
    [Show full text]
  • An Analysis of XML Compression Efficiency
    An Analysis of XML Compression Efficiency Christopher J. Augeri1 Barry E. Mullins1 Leemon C. Baird III Dursun A. Bulutoglu2 Rusty O. Baldwin1 1Department of Electrical and Computer Engineering Department of Computer Science 2Department of Mathematics and Statistics United States Air Force Academy (USAFA) Air Force Institute of Technology (AFIT) USAFA, Colorado Springs, CO Wright Patterson Air Force Base, Dayton, OH {chris.augeri, barry.mullins}@afit.edu [email protected] {dursun.bulutoglu, rusty.baldwin}@afit.edu ABSTRACT We expand previous XML compression studies [9, 26, 34, 47] by XML simplifies data exchange among heterogeneous computers, proposing the XML file corpus and a combined efficiency metric. but it is notoriously verbose and has spawned the development of The corpus was assembled using guidelines given by developers many XML-specific compressors and binary formats. We present of the Canterbury corpus [3], files often used to assess compressor an XML test corpus and a combined efficiency metric integrating performance. The efficiency metric combines execution speed compression ratio and execution speed. We use this corpus and and compression ratio, enabling simultaneous assessment of these linear regression to assess 14 general-purpose and XML-specific metrics, versus prioritizing one metric over the other. We analyze compressors relative to the proposed metric. We also identify key collected metrics using linear regression models (ANOVA) versus factors when selecting a compressor. Our results show XMill or a simple comparison of means, e.g., X is 20% better than Y. WBXML may be useful in some instances, but a general-purpose compressor is often the best choice. 2. XML OVERVIEW XML has gained much acceptance since first proposed in 1998 by Categories and Subject Descriptors the World-Wide Web Consortium (W3C).
    [Show full text]