Bloscpack: a Compressed Lightweight Serialization Format for Numerical Data

Total Page:16

File Type:pdf, Size:1020Kb

Bloscpack: a Compressed Lightweight Serialization Format for Numerical Data PROC. OF THE 6th EUR. CONF. ON PYTHON IN SCIENCE (EUROSCIPY 2013) 3 Bloscpack: a compressed lightweight serialization format for numerical data Valentin Haenel∗† F Abstract—This paper introduces the Bloscpack file format and the accompany- compressor. The block size is chosen such that it either fits ing Python reference implementation. Bloscpack is a lightweight, compressed into a typical L1 cache (for compression levels up to 6) or L2 binary file-format based on the Blosc codec and is designed for lightweight, cache (for compression levels larger than 6). In modern CPUs fast serialization of numerical data. This article presents the features of the L1 and L2 are typically non-shared between other cores, and file-format and some some API aspects of the reference implementation, in particular the ability to handle Numpy ndarrays. Furthermore, in order to demon- so this choice of block size leads to an optimal performance strate its utility, the format is compared both feature- and performance-wise to during multi-thread operation. a few alternative lightweight serialization solutions for Numpy ndarrays. The Also, Blosc features a shuffle filter [Alted2009] (p.71) which performance comparisons take the form of some comprehensive benchmarks may reshuffle multi-byte elements, e.g. 8 byte doubles, by over a range of different artificial datasets with varying size and complexity, the significance. The net result for series of numerical elements results of which are presented as the last section of this article. with little difference between elements that are close, is Index Terms—applied information theory, compression/decompression, that similar bytes are placed closer together and can thus python, numpy, file format, serialization, blosc be better compressed (this is specially true on time series datasets). Internally, Blosc uses its own codec, blosclz, which is a derivative of FastLZ [FastLZ] and implements the LZ77 1 INTRODUCTION [LZ77] scheme. The reason for Blosc to introduce its own When using compression during storage of numerical data codec is mainly the desire for simplicity (blosclz is a highly there are two potential improvements one can make. First, streamlined version of FastLZ), as well as providing a better by using compression, naturally one can save storage space. interaction with Blosc infrastructure. Secondly—and this is often overlooked—one can save time. Moreover, Blosc is designed to be extensible, and allows When using compression during serialization, the total com- other codecs than blosclz to be used in it. In other words, one pression time is the sum of the time taken to perform the can consider Blosc as a meta-compressor, in that it handles compression and the time taken to write the compressed data the splitting of the data into blocks, optionally applying the to the storage medium. Depending on the compression speed shuffle filter (or other future filters), while being responsible and the compression ratio, this sum maybe less than the of coordinating the individual threads during operation. Blosc time taken to serialize the data in uncompressed format i.e. then relies on a "real" codec to perform that actual compres- writeuncompressed > writecompressed +timecompress sion of the data blocks. As such, one can think of Blosc as The Bloscpack file format and Python reference implemen- a way to parallelize existing codecs, while allowing to apply tation aims to achieve exactly this by leveraging the fast, filters (also called pre-conditioners). In fact, at the time when multithreaded, blocking and shuffling Blosc codec. arXiv:1404.6383v2 [cs.MS] 29 Apr 2014 the research presented in this paper was conducted (Summer 2013), a proof-of-concept implementation existed to integrate 2 BLOSC the well known Snappy codec [Snappy] as well as LZ4 [LZ4] Blosc [Blosc] is a fast, multitreaded, blocking and shuffling into the Blosc framework. As of January 2014 this proof of compressor designed initially for in-memory compression. concept has matured and as of version 1.3.0 Blosc comes Contrary to many other available compressors which oper- equipped with support for Snappy [Snappy], LZ4 [LZ4] and ate sequentially on a data buffer, Blosc uses the blocking even Zlib [zlib]. technique [Alted2009], [Alted2010] to split the dataset into Blosc was initially developed to support in-memory com- individual blocks. It can then operate on each block using pression in order to mitigate the effects of the memory hierar- a different thread which effectively leads to a multithreaded chy [Jacob2009]. More specifically, to mitigate the effects of memory latency, i.e. the ever growing divide between the CPU * Corresponding author: [email protected] speed and the memory access speed–which is also known as † Independent the problem of the starving CPUs [Alted2009]. Copyright © 2014 Valentin Haenel. This is an open-access article dis- The goal of in-memory compression techniques is to have tributed under the terms of the Creative Commons Attribution Li- a numerical container which keeps all data as in-memory cense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. compressed blocks. If the data needs to be operated on, it http://creativecommons.org/licenses/by/3.0/ is decompressed only in the caches of the CPU. Hence, data 4 PROC. OF THE 6th EUR. CONF. ON PYTHON IN SCIENCE (EUROSCIPY 2013) can be moved faster from memory to CPU and the net result 4.3 ZFile is faster computation, since less CPU cycles are wasted while ZFile is the native serialization format that ships with the waiting for data. Similar techniques are applied successfully Joblib [Joblib] framework. Joblib is equipped with a caching in other settings. Imagine for example, one wishes to transfer mechanism that supports caching input and output arguments binary files over the internet. In this case the transfer time to functions and can thus avoid running heavy computations can be significantly improved by compressing the data before if the input has not changed. When serializing ndarrays with transferring it and decompressing it after having received it. Joblib, a special subclass of the Pickler is used to store the As a result the total compressed transfer time, which is taken metadata whereas the datablock is serialized as a ZFile. ZFile to be the sum of the compression and decompression process uses zlib [zlib] internally and simply runs zlib on the entire and the time taken to transfer the compressed file, is less data buffer. zlib is also an implementation of the DEFLATE than the time taken to transfer the plain file. For example algorithm. One drawback of the current ZFile implementation the well known UNIX tool rsync [rsync] implements a is that no chunking scheme is employed. This means that the -z switch which performs compression of the data before memory requirements might be twice that of the original input. sending it and decompression after receiving it. The same Imagine trying to compress an incompressible buffer of 1GB: basic principle applies to in-memory compression, except in this case the memory requirement would be 2GB, since that we are transferring data from memory to CPU. Initial the entire buffer must be copied in memory as part of the implementations based on Blosc exist, c.f. Blaze [Blaze] and compression process before it can be written out to disk. carray [CArray], and have been shown to yield favourable results [Personal communication with Francesc Alted]. 5 BLOSCPACK FORMAT 3 NUMPY The Bloscpack format and reference implementation builds a The Numpy [VanDerWalt2011], [Numpy] ndarray is the serialization format around the Blosc codec. It is a simple de-facto multidimensional numerical container for scientific chunked file-format well suited for the storage of numerical python applications. It is probably the most fundamental data. As described in the Bloscpack format description, the package of the scientific python ecosystem and widely used big-picture of the file-format is as follows: and relied upon by third-party libraries and applications. It |-header-|-meta-|-offsets-| consists of the N-dimensional array class, various different |-chunk-|-checksum-|-chunk-|-checksum-|...| initialization routines and many different ways to operate on the data efficiently. The format contains a 32 byte header which contains various options and settings for the file, for example a magic string, 4 EXISTING LIGHTWEIGHT SOLUTIONS the format version number and the total number of chunks. There are a number of other plain (uncompressed) and com- The meta section is of variable size and can contain any pressed lightweight serialization formats for Numpy arrays metadata that needs to be saved alongside the data. An that we can compare Bloscpack to. We specifically ignore optional offsets section is provided to allow for partial more heavyweight solutions, such as HDF5, in this compari- decompression of the file in the future. This is followed by a son. series of chunks, each of which is a blosc compressed buffer. • NPY Each chunk can be optionally followed by a checksum of • NPZ the compressed data which can help to protect against silent • ZFile data corruption. The chunked format was initially chosen to circumvent a 4.1 NPY 2GB limitation of the Blosc codec. In fact, the ZFile format NPY [NPY] is a simple plain serialization format for numpy. It suffers from this exact limitation since zlib—at least the is considered somewhat of a gold standard for the serialization. Python bindings—is also limited to buffers of 2GB in size. The One of its advantages is that it is very, very lightweight. The limitation stems from the fact that int32 are used internally format specification is simple and can easily be digested within by the algorithms to store the size of the buffer and the an hour. In essence it simply contains the ndarray metadata maximum value of an int32 is indeed 2GB.
Recommended publications
  • Data Compression: Dictionary-Based Coding 2 / 37 Dictionary-Based Coding Dictionary-Based Coding
    Dictionary-based Coding already coded not yet coded search buffer look-ahead buffer cursor (N symbols) (L symbols) We know the past but cannot control it. We control the future but... Last Lecture Last Lecture: Predictive Lossless Coding Predictive Lossless Coding Simple and effective way to exploit dependencies between neighboring symbols / samples Optimal predictor: Conditional mean (requires storage of large tables) Affine and Linear Prediction Simple structure, low-complex implementation possible Optimal prediction parameters are given by solution of Yule-Walker equations Works very well for real signals (e.g., audio, images, ...) Efficient Lossless Coding for Real-World Signals Affine/linear prediction (often: block-adaptive choice of prediction parameters) Entropy coding of prediction errors (e.g., arithmetic coding) Using marginal pmf often already yields good results Can be improved by using conditional pmfs (with simple conditions) Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 2 / 37 Dictionary-based Coding Dictionary-Based Coding Coding of Text Files Very high amount of dependencies Affine prediction does not work (requires linear dependencies) Higher-order conditional coding should work well, but is way to complex (memory) Alternative: Do not code single characters, but words or phrases Example: English Texts Oxford English Dictionary lists less than 230 000 words (including obsolete words) On average, a word contains about 6 characters Average codeword length per character would be limited by 1
    [Show full text]
  • Package 'Brotli'
    Package ‘brotli’ May 13, 2018 Type Package Title A Compression Format Optimized for the Web Version 1.2 Description A lossless compressed data format that uses a combination of the LZ77 algorithm and Huffman coding. Brotli is similar in speed to deflate (gzip) but offers more dense compression. License MIT + file LICENSE URL https://tools.ietf.org/html/rfc7932 (spec) https://github.com/google/brotli#readme (upstream) http://github.com/jeroen/brotli#read (devel) BugReports http://github.com/jeroen/brotli/issues VignetteBuilder knitr, R.rsp Suggests spelling, knitr, R.rsp, microbenchmark, rmarkdown, ggplot2 RoxygenNote 6.0.1 Language en-US NeedsCompilation yes Author Jeroen Ooms [aut, cre] (<https://orcid.org/0000-0002-4035-0289>), Google, Inc [aut, cph] (Brotli C++ library) Maintainer Jeroen Ooms <[email protected]> Repository CRAN Date/Publication 2018-05-13 20:31:43 UTC R topics documented: brotli . .2 Index 4 1 2 brotli brotli Brotli Compression Description Brotli is a compression algorithm optimized for the web, in particular small text documents. Usage brotli_compress(buf, quality = 11, window = 22) brotli_decompress(buf) Arguments buf raw vector with data to compress/decompress quality value between 0 and 11 window log of window size Details Brotli decompression is at least as fast as for gzip while significantly improving the compression ratio. The price we pay is that compression is much slower than gzip. Brotli is therefore most effective for serving static content such as fonts and html pages. For binary (non-text) data, the compression ratio of Brotli usually does not beat bz2 or xz (lzma), however decompression for these algorithms is too slow for browsers in e.g.
    [Show full text]
  • Schematic Entry
    Schematic Entry Copyrights Software, documentation and related materials: Copyright © 2002 Altium Limited This software product is copyrighted and all rights are reserved. The distribution and sale of this product are intended for the use of the original purchaser only per the terms of the License Agreement. This document may not, in whole or part, be copied, photocopied, reproduced, translated, reduced or transferred to any electronic medium or machine-readable form without prior consent in writing from Altium Limited. U.S. Government use, duplication or disclosure is subject to RESTRICTED RIGHTS under applicable government regulations pertaining to trade secret, commercial computer software developed at private expense, including FAR 227-14 subparagraph (g)(3)(i), Alternative III and DFAR 252.227-7013 subparagraph (c)(1)(ii). P-CAD is a registered trademark and P-CAD Schematic, P-CAD Relay, P-CAD PCB, P-CAD ProRoute, P-CAD QuickRoute, P-CAD InterRoute, P-CAD InterRoute Gold, P-CAD Library Manager, P-CAD Library Executive, P-CAD Document Toolbox, P-CAD InterPlace, P-CAD Parametric Constraint Solver, P-CAD Signal Integrity, P-CAD Shape-Based Autorouter, P-CAD DesignFlow, P-CAD ViewCenter, Master Designer and Associate Designer are trademarks of Altium Limited. Other brand names are trademarks of their respective companies. Altium Limited www.altium.com Table of Contents chapter 1 Introducing P-CAD Schematic P-CAD Schematic Features ................................................................................................1 About
    [Show full text]
  • Video Codec Requirements and Evaluation Methodology
    Video Codec Requirements 47pt 30pt and Evaluation Methodology Color::white : LT Medium Font to be used by customers and : Arial www.huawei.com draft-filippov-netvc-requirements-01 Alexey Filippov, Huawei Technologies 35pt Contents Font to be used by customers and partners : • An overview of applications • Requirements 18pt • Evaluation methodology Font to be used by customers • Conclusions and partners : Slide 2 Page 2 35pt Applications Font to be used by customers and partners : • Internet Protocol Television (IPTV) • Video conferencing 18pt • Video sharing Font to be used by customers • Screencasting and partners : • Game streaming • Video monitoring / surveillance Slide 3 35pt Internet Protocol Television (IPTV) Font to be used by customers and partners : • Basic requirements: . Random access to pictures 18pt Random Access Period (RAP) should be kept small enough (approximately, 1-15 seconds); Font to be used by customers . Temporal (frame-rate) scalability; and partners : . Error robustness • Optional requirements: . resolution and quality (SNR) scalability Slide 4 35pt Internet Protocol Television (IPTV) Font to be used by customers and partners : Resolution Frame-rate, fps Picture access mode 2160p (4K),3840x2160 60 RA 18pt 1080p, 1920x1080 24, 50, 60 RA 1080i, 1920x1080 30 (60 fields per second) RA Font to be used by customers and partners : 720p, 1280x720 50, 60 RA 576p (EDTV), 720x576 25, 50 RA 576i (SDTV), 720x576 25, 30 RA 480p (EDTV), 720x480 50, 60 RA 480i (SDTV), 720x480 25, 30 RA Slide 5 35pt Video conferencing Font to be used by customers and partners : • Basic requirements: . Delay should be kept as low as possible 18pt The preferable and maximum delay values should be less than 100 ms and 350 ms, respectively Font to be used by customers .
    [Show full text]
  • The Basic Principles of Data Compression
    The Basic Principles of Data Compression Author: Conrad Chung, 2BrightSparks Introduction Internet users who download or upload files from/to the web, or use email to send or receive attachments will most likely have encountered files in compressed format. In this topic we will cover how compression works, the advantages and disadvantages of compression, as well as types of compression. What is Compression? Compression is the process of encoding data more efficiently to achieve a reduction in file size. One type of compression available is referred to as lossless compression. This means the compressed file will be restored exactly to its original state with no loss of data during the decompression process. This is essential to data compression as the file would be corrupted and unusable should data be lost. Another compression category which will not be covered in this article is “lossy” compression often used in multimedia files for music and images and where data is discarded. Lossless compression algorithms use statistic modeling techniques to reduce repetitive information in a file. Some of the methods may include removal of spacing characters, representing a string of repeated characters with a single character or replacing recurring characters with smaller bit sequences. Advantages/Disadvantages of Compression Compression of files offer many advantages. When compressed, the quantity of bits used to store the information is reduced. Files that are smaller in size will result in shorter transmission times when they are transferred on the Internet. Compressed files also take up less storage space. File compression can zip up several small files into a single file for more convenient email transmission.
    [Show full text]
  • (A/V Codecs) REDCODE RAW (.R3D) ARRIRAW
    What is a Codec? Codec is a portmanteau of either "Compressor-Decompressor" or "Coder-Decoder," which describes a device or program capable of performing transformations on a data stream or signal. Codecs encode a stream or signal for transmission, storage or encryption and decode it for viewing or editing. Codecs are often used in videoconferencing and streaming media solutions. A video codec converts analog video signals from a video camera into digital signals for transmission. It then converts the digital signals back to analog for display. An audio codec converts analog audio signals from a microphone into digital signals for transmission. It then converts the digital signals back to analog for playing. The raw encoded form of audio and video data is often called essence, to distinguish it from the metadata information that together make up the information content of the stream and any "wrapper" data that is then added to aid access to or improve the robustness of the stream. Most codecs are lossy, in order to get a reasonably small file size. There are lossless codecs as well, but for most purposes the almost imperceptible increase in quality is not worth the considerable increase in data size. The main exception is if the data will undergo more processing in the future, in which case the repeated lossy encoding would damage the eventual quality too much. Many multimedia data streams need to contain both audio and video data, and often some form of metadata that permits synchronization of the audio and video. Each of these three streams may be handled by different programs, processes, or hardware; but for the multimedia data stream to be useful in stored or transmitted form, they must be encapsulated together in a container format.
    [Show full text]
  • Pack, Encrypt, Authenticate Document Revision: 2021 05 02
    PEA Pack, Encrypt, Authenticate Document revision: 2021 05 02 Author: Giorgio Tani Translation: Giorgio Tani This document refers to: PEA file format specification version 1 revision 3 (1.3); PEA file format specification version 2.0; PEA 1.01 executable implementation; Present documentation is released under GNU GFDL License. PEA executable implementation is released under GNU LGPL License; please note that all units provided by the Author are released under LGPL, while Wolfgang Ehrhardt’s crypto library units used in PEA are released under zlib/libpng License. PEA file format and PCOMPRESS specifications are hereby released under PUBLIC DOMAIN: the Author neither has, nor is aware of, any patents or pending patents relevant to this technology and do not intend to apply for any patents covering it. As far as the Author knows, PEA file format in all of it’s parts is free and unencumbered for all uses. Pea is on PeaZip project official site: https://peazip.github.io , https://peazip.org , and https://peazip.sourceforge.io For more information about the licenses: GNU GFDL License, see http://www.gnu.org/licenses/fdl.txt GNU LGPL License, see http://www.gnu.org/licenses/lgpl.txt 1 Content: Section 1: PEA file format ..3 Description ..3 PEA 1.3 file format details ..5 Differences between 1.3 and older revisions ..5 PEA 2.0 file format details ..7 PEA file format’s and implementation’s limitations ..8 PCOMPRESS compression scheme ..9 Algorithms used in PEA format ..9 PEA security model .10 Cryptanalysis of PEA format .12 Data recovery from
    [Show full text]
  • Image Compression Using Discrete Cosine Transform Method
    Qusay Kanaan Kadhim, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.9, September- 2016, pg. 186-192 Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320–088X IMPACT FACTOR: 5.258 IJCSMC, Vol. 5, Issue. 9, September 2016, pg.186 – 192 Image Compression Using Discrete Cosine Transform Method Qusay Kanaan Kadhim Al-Yarmook University College / Computer Science Department, Iraq [email protected] ABSTRACT: The processing of digital images took a wide importance in the knowledge field in the last decades ago due to the rapid development in the communication techniques and the need to find and develop methods assist in enhancing and exploiting the image information. The field of digital images compression becomes an important field of digital images processing fields due to the need to exploit the available storage space as much as possible and reduce the time required to transmit the image. Baseline JPEG Standard technique is used in compression of images with 8-bit color depth. Basically, this scheme consists of seven operations which are the sampling, the partitioning, the transform, the quantization, the entropy coding and Huffman coding. First, the sampling process is used to reduce the size of the image and the number bits required to represent it. Next, the partitioning process is applied to the image to get (8×8) image block. Then, the discrete cosine transform is used to transform the image block data from spatial domain to frequency domain to make the data easy to process.
    [Show full text]
  • Steganography and Vulnerabilities in Popular Archives Formats.| Nyxengine Nyx.Reversinglabs.Com
    Hiding in the Familiar: Steganography and Vulnerabilities in Popular Archives Formats.| NyxEngine nyx.reversinglabs.com Contents Introduction to NyxEngine ............................................................................................................................ 3 Introduction to ZIP file format ...................................................................................................................... 4 Introduction to steganography in ZIP archives ............................................................................................. 5 Steganography and file malformation security impacts ............................................................................... 8 References and tools .................................................................................................................................... 9 2 Introduction to NyxEngine Steganography1 is the art and science of writing hidden messages in such a way that no one, apart from the sender and intended recipient, suspects the existence of the message, a form of security through obscurity. When it comes to digital steganography no stone should be left unturned in the search for viable hidden data. Although digital steganography is commonly used to hide data inside multimedia files, a similar approach can be used to hide data in archives as well. Steganography imposes the following data hiding rule: Data must be hidden in such a fashion that the user has no clue about the hidden message or file's existence. This can be achieved by
    [Show full text]
  • MC14SM5567 PCM Codec-Filter
    Product Preview Freescale Semiconductor, Inc. MC14SM5567/D Rev. 0, 4/2002 MC14SM5567 PCM Codec-Filter The MC14SM5567 is a per channel PCM Codec-Filter, designed to operate in both synchronous and asynchronous applications. This device 20 performs the voice digitization and reconstruction as well as the band 1 limiting and smoothing required for (A-Law) PCM systems. DW SUFFIX This device has an input operational amplifier whose output is the input SOG PACKAGE CASE 751D to the encoder section. The encoder section immediately low-pass filters the analog signal with an RC filter to eliminate very-high-frequency noise from being modulated down to the pass band by the switched capacitor filter. From the active R-C filter, the analog signal is converted to a differential signal. From this point, all analog signal processing is done differentially. This allows processing of an analog signal that is twice the amplitude allowed by a single-ended design, which reduces the significance of noise to both the inverted and non-inverted signal paths. Another advantage of this differential design is that noise injected via the power supplies is a common mode signal that is cancelled when the inverted and non-inverted signals are recombined. This dramatically improves the power supply rejection ratio. After the differential converter, a differential switched capacitor filter band passes the analog signal from 200 Hz to 3400 Hz before the signal is digitized by the differential compressing A/D converter. The decoder accepts PCM data and expands it using a differential D/A converter. The output of the D/A is low-pass filtered at 3400 Hz and sinX/X compensated by a differential switched capacitor filter.
    [Show full text]
  • Lossless Compression of Internal Files in Parallel Reservoir Simulation
    Lossless Compression of Internal Files in Parallel Reservoir Simulation Suha Kayum Marcin Rogowski Florian Mannuss 9/26/2019 Outline • I/O Challenges in Reservoir Simulation • Evaluation of Compression Algorithms on Reservoir Simulation Data • Real-world application - Constraints - Algorithm - Results • Conclusions 2 Challenge Reservoir simulation 1 3 Reservoir Simulation • Largest field in the world are represented as 50 million – 1 billion grid block models • Each runs takes hours on 500-5000 cores • Calibrating the model requires 100s of runs and sophisticated methods • “History matched” model is only a beginning 4 Files in Reservoir Simulation • Internal Files • Input / Output Files - Interact with pre- & post-processing tools Date Restart/Checkpoint Files 5 Reservoir Simulation in Saudi Aramco • 100’000+ simulations annually • The largest simulation of 10 billion cells • Currently multiple machines in TOP500 • Petabytes of storage required 600x • Resources are Finite • File Compression is one solution 50x 6 Compression algorithm evaluation 2 7 Compression ratio Tested a number of algorithms on a GRID restart file for two models 4 - Model A – 77.3 million active grid blocks 3.5 - Model K – 8.7 million active grid blocks 3 - 15.6 GB and 7.2 GB respectively 2.5 2 Compression ratio is between 1.5 1 compression ratio compression - From 2.27 for snappy (Model A) 0.5 0 - Up to 3.5 for bzip2 -9 (Model K) Model A Model K lz4 snappy gzip -1 gzip -9 bzip2 -1 bzip2 -9 8 Compression speed • LZ4 and Snappy significantly outperformed other algorithms
    [Show full text]
  • User Commands GZIP ( 1 ) Gzip, Gunzip, Gzcat – Compress Or Expand Files Gzip [ –Acdfhllnnrtvv19 ] [–S Suffix] [ Name ... ]
    User Commands GZIP ( 1 ) NAME gzip, gunzip, gzcat – compress or expand files SYNOPSIS gzip [–acdfhlLnNrtvV19 ] [– S suffix] [ name ... ] gunzip [–acfhlLnNrtvV ] [– S suffix] [ name ... ] gzcat [–fhLV ] [ name ... ] DESCRIPTION Gzip reduces the size of the named files using Lempel-Ziv coding (LZ77). Whenever possible, each file is replaced by one with the extension .gz, while keeping the same ownership modes, access and modification times. (The default extension is – gz for VMS, z for MSDOS, OS/2 FAT, Windows NT FAT and Atari.) If no files are specified, or if a file name is "-", the standard input is compressed to the standard output. Gzip will only attempt to compress regular files. In particular, it will ignore symbolic links. If the compressed file name is too long for its file system, gzip truncates it. Gzip attempts to truncate only the parts of the file name longer than 3 characters. (A part is delimited by dots.) If the name con- sists of small parts only, the longest parts are truncated. For example, if file names are limited to 14 characters, gzip.msdos.exe is compressed to gzi.msd.exe.gz. Names are not truncated on systems which do not have a limit on file name length. By default, gzip keeps the original file name and timestamp in the compressed file. These are used when decompressing the file with the – N option. This is useful when the compressed file name was truncated or when the time stamp was not preserved after a file transfer. Compressed files can be restored to their original form using gzip -d or gunzip or gzcat.
    [Show full text]