Another Factorial File Compression Experiment Using SAS® and UNIX Compression Algorithms

Total Page:16

File Type:pdf, Size:1020Kb

Another Factorial File Compression Experiment Using SAS® and UNIX Compression Algorithms NESUG 18 Posters Another Factorial File Compression Experiment Using SAS® and UNIX Compression Algorithms Adeline J. Wilcox, US Census Bureau, Washington, DC 1 ABSTRACT Continuing experimental work on SAS data set compression presented at NESUG in 2004, I designed another two-factor factorial experiment. My first factor compares the three kinds of data set compression offered by SAS on UNIX; the SAS DATA set OPTIONS COMPRESS=CHAR, COMPRESS=BINARY and SAS sequential format files created with the V9TAPE engine, three UNIX file compression algorithms; compress, gzip, and bzip2 and a control without any file compression. My second factor compares four kinds of SAS data sets; all character variables, all numeric variables, half character and half numeric, and all numeric in which LENGTHs shorter than 8 were used for smaller values. bzip2 minimized compressed file size for all four control SAS data sets. Only the three SAS file compression methods can be used to give other SAS users read access to compressed SAS data sets without giving them write permission to these files. SAS COMPRESS=BINARY reduced compressed file size more than SAS COMPRESS=CHAR on all four variable type treatments tested including SAS data sets containing only character variables. 2 INTRODUCTION Experimentation is generally an iterative process (Montgomery, 1997). Using what I learned from the results of the experiments I presented at NESUG in 2004 (Wilcox, 2004) and reconsidering other information, I designed another factorial file experiment. This experiment’s design also reflects the fact that I now work in a different computing enviroment in which disk space is more precious than the one in which I conducted my earlier experi- ments. This experiment aims for a more comprehensive comparison of compression algorithms and variable types. Testing different file compression algorithms on files composed of different variable types is one of two primary objectives of this experiment. My first experiment did not control for variable composition in any way. Testing SAS COMPRESS=CHAR on SAS data sets consisting solely of character variables should determine whether this file compression algorithm can be dropped from further testing. In my first experiment, SAS COMPRESS=CHAR was the slowest and second worst compression algorithm for file size reduction. In that experiment, SAS COM- PRESS=BINARY actually increased compressed file size because observation lengths were not long enough to properly test that compression algorithm. In this experiment, I created test data sets with sufficient record length. The other primary objective is a more comprehensive comparison of compression algorithms including gzip, and bzip2. No measurements of SAS CPU time or total CPU time are reported here. 3 DESIGN OF MY EXPERIMENT In my 7 x 4 factorial experiment, the first factor was one of six file compression treatments or control without compression. The second factor compared SAS data sets of different composition; all character variables, all numeric variables, half character and half numeric, and all numeric in which LENGTHs shorter than 8 were used for smaller values. The fixed model for my factorial experiment is yijk = µ + τi + βj +(τβ)ij + ijk where the response Yijk is either disk space used, µ is the mean of both treatment factors, τi represents the file com- pression treatment, βj represents the SAS variable type and length treatment and (τβ)ij is the interaction between the file compression and SAS variable type composition treatments, and ijk is the random error (Montgomery). 1 NESUG 18 Posters Table 3.1 shows the design of my experiments with 10 replicates within each of the 28 treatments. Consequently, the order of treatment of units within each block was not random. Table 3.1 Assignment of Treatments Variable Type Character Numeric Short Numeric Character and Numeric None 10 10 10 10 File Sequential Format 10 10 10 10 Compression COMPRESS=CHAR 10 10 10 10 Treatment COMPRESS=BINARY 10 10 10 10 UNIX compress 10 10 10 10 UNIX gzip 10 10 10 10 UNIX bzip2 10 10 10 10 3.1 Choice Of Sample Size Before I ran this experiment, I decided that I needed a reduction in file size of at least 30 percent to make file compression worthwhile. Having used ten replicates in my first file compression experiment, I again used ten replicates to create 10 data set subsets of only character variables. A 30 percent reduction reduce file size by 1,241,088 bytes to no more than 2,895,872 bytes. Referring to the method Montgomery gives for sample size computation for two-factor factorial designs, it appears that ten replicates for each treatment may be considerably more than needed. However, it was convenient for to continue work with the same number of replicates that I used in my first experiment. 4 METHODS Because this experiment is designed to be a comprehensive test of file compression algorithms available in my computing environment, I ran Tukey’s Studentized Range (HSD) test to make all pairwise comparisons of the compression factor and the interaction of the compression factor with the variable type factor. I also tested for differences from the control data sets. 4.1 Creating Test Data Sets All test data sets were generated from decennial census data. The original data were stored in 52 files, one for each of the 50 US states and one each for the District of Columbia and the territory of Puerto Rico. From these 52 files, ten were randomly selected. The second 10,000 observations were read from each of these ten files. Within each control treatment, all ten subsets of the original files were identical in file size and observation length. Table 4.1 shows the size of each of the four sets of control files. Because my first data set, consisting only of character variables, contained solely numeric data stored as character variables, it was possible for me to make all four test data sets identical in data content. I created my second, third and fourth data sets by converting character variables to numeric variables. Table 4.1 The Four Variable Type Treatments Number of Number File Size Observation Variable Type(s) Variables of Files (bytes) Length (bytes) Character 140 10 4,136,960 407 Numeric 140 10 11,345,920 1120 Short Numeric 140 10 5,046,272 496 Character and Numeric 140 10 7,675,904 760 In the original files, all variables were character. From the original data sets, I chose 140 variables that could be converted to numeric variables. All ten of these were 4,136,960 bytes in size. In an effort to control metadata size, I gave the numeric versions of the variables names of the same length as the original character variables. All work was done with bash shell scripts and 32-bit SAS 9.1.3 on an AMD OpteronTM processor running Linux. 2 NESUG 18 Posters 4.2 File Compression Algorithms I compared six file compression treatments to controls. I experimented with three SAS compression treatments, the data set options COMPRESS=CHAR and COMPRESS=BINARY and the SAS Sequential format with a named pipe. I also experimented with the three file compression algorithms installed in my Linux computing environment. These are; gzip, compress and bzip2. In my earlier file compression experiments, I did not use bzip2 because my initial experience with it on a very large file wasn’t successful. I tried bzip2 again, this time without getting a non-zero exit status. In this paper, all measures of file size were obtained from Linux. In one of my bash shell scripts, I used the command export oneoften to export an environment variable named oneoften that identifies the replicate. Subsequently, I create a named pipe with the command mknod pipechoneoften p as shown in a SAS Tech Support Sample (SAS Institute Inc., 2002). In this SAS log excerpt, the macro variable named state resolves to 01. 13 %let state=%sysget(oneoften); 14 libname mine ’/adelines/portland/amd/’; NOTE: Libref MINE was successfully assigned as follows: Engine: V9 Physical Name: /adelines/portland/amd 15 libname fargo "pipech&state"; NOTE: Libref FARGO was successfully assigned as follows: Engine: V9TAPE Physical Name: /adelines/portland/amd/pipech01 16 filename nwrpipe pipe "compress < pipech&state > char2&state..Z &"; 17 data _null_; 18 infile nwrpipe; 19 run; NOTE: The infile NWRPIPE is: Pipe command="compress < pipech01 > char201.Z &" NOTE: 0 records were read from the infile NWRPIPE. 20 data fargo.a; set mine.char2&state; 21 run; NOTE: There were 10000 observations read from the data set MINE.CHAR201. NOTE: The data set FARGO.A has 10000 observations and 140 variables. NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414 5 RESULTS In the subdirectory where I wrote these 280 files, I ran the command ls -l *.Z *.sas7bdat *.gz *.bz2 > two80a.txt cut -c31-42,57-63 two80a.txt > two80b.txt giving me a list of all 280 files with their file sizes in bytes. File names were designed to identify the treatment(s) applied to the SAS data sets contained in the files. Consquently, this information was captured in the file named two80b.txt I used the file named two80b.txt as the input file to my SAS program named two80b.sas in which I analyzed the effects of the data type composition and file compression treatment factors on file size. 3 NESUG 18 Posters 5.1 Compressed File Size Table 5.1 shows treatment means and without adjustment for the other factor or interaction between the factors. This table also shows 95 percent confidence intervals. Means and confidence limits shown are rounded to the nearest byte. Table 5.1 Means of the File Compression Treatments Variable Compression Number of Mean Lower Upper Type Treatment Replicates bytes 95%CL 95%CL Character Control 10 4136960 .
Recommended publications
  • Lossless Audio Codec Comparison
    Contents Introduction 3 1 CD-audio test 4 1.1 CD's used . .4 1.2 Results all CD's together . .4 1.3 Interesting quirks . .7 1.3.1 Mono encoded as stereo (Dan Browns Angels and Demons) . .7 1.3.2 Compressibility . .9 1.4 Convergence of the results . 10 2 High-resolution audio 13 2.1 Nine Inch Nails' The Slip . 13 2.2 Howard Shore's soundtrack for The Lord of the Rings: The Return of the King . 16 2.3 Wasted bits . 18 3 Multichannel audio 20 3.1 Howard Shore's soundtrack for The Lord of the Rings: The Return of the King . 20 A Motivation for choosing these CDs 23 B Test setup 27 B.1 Scripting and graphing . 27 B.2 Codecs and parameters used . 27 B.3 MD5 checksumming . 28 C Revision history 30 Bibliography 31 2 Introduction While testing the efficiency of lossy codecs can be quite cumbersome (as results differ for each person), comparing lossless codecs is much easier. As the last well documented and comprehensive test available on the internet has been a few years ago, I thought it would be a good idea to update. Beside comparing with CD-audio (which is often done to assess codec performance) and spitting out a grand total, this comparison also looks at extremes that occurred during the test and takes a look at 'high-resolution audio' and multichannel/surround audio. While the comparison was made to update the comparison-page on the FLAC website, it aims to be fair and unbiased.
    [Show full text]
  • Contrasting the Performance of Compression Algorithms on Genomic Data
    Contrasting the Performance of Compression Algorithms on Genomic Data Cornel Constantinescu, IBM Research Almaden Outline of the Talk: • Introduction / Motivation • Data used in experiments • General purpose compressors comparison • Simple Improvements • Special purpose compression • Transparent compression – working on compressed data (prototype) • Parallelism / Multithreading • Conclusion Introduction / Motivation • Despite the large number of research papers and compression algorithms proposed for compressing genomic data generated by sequencing machines, by far the most commonly used compression algorithm in the industry for FASTQ data is gzip. • The main drawbacks of the proposed alternative special-purpose compression algorithms are: • slow speed of either compression or decompression or both, and also their • brittleness by making various limiting assumptions about the input FASTQ format (for example, the structure of the headers or fixed lengths of the records [1]) in order to further improve their specialized compression. 1. Ibrahim Numanagic, James K Bonfield, Faraz Hach, Jan Voges, Jorn Ostermann, Claudio Alberti, Marco Mattavelli, and S Cenk Sahinalp. Comparison of high-throughput sequencing data compression tools. Nature Methods, 13(12):1005–1008, October 2016. Fast and Efficient Compression of Next Generation Sequencing Data 2 2 General Purpose Compression of Genomic Data As stated earlier, gzip/zlib compression is the method of choice by the industry for FASTQ genomic data. FASTQ genomic data is a text-based format (ASCII readable text) for storing a biological sequence and the corresponding quality scores. Each sequence letter and quality score is encoded with a single ASCII character. FASTQ data is structured in four fields per record (a “read”). The first field is the SEQUENCE ID or the header of the read.
    [Show full text]
  • Cluster-Based Delta Compression of a Collection of Files Department of Computer and Information Science
    Cluster-Based Delta Compression of a Collection of Files Zan Ouyang Nasir Memon Torsten Suel Dimitre Trendafilov Department of Computer and Information Science Technical Report TR-CIS-2002-05 12/27/2002 Cluster-Based Delta Compression of a Collection of Files Zan Ouyang Nasir Memon Torsten Suel Dimitre Trendafilov CIS Department Polytechnic University Brooklyn, NY 11201 Abstract Delta compression techniques are commonly used to succinctly represent an updated ver- sion of a file with respect to an earlier one. In this paper, we study the use of delta compression in a somewhat different scenario, where we wish to compress a large collection of (more or less) related files by performing a sequence of pairwise delta compressions. The problem of finding an optimal delta encoding for a collection of files by taking pairwise deltas can be re- duced to the problem of computing a branching of maximum weight in a weighted directed graph, but this solution is inefficient and thus does not scale to larger file collections. This motivates us to propose a framework for cluster-based delta compression that uses text clus- tering techniques to prune the graph of possible pairwise delta encodings. To demonstrate the efficacy of our approach, we present experimental results on collections of web pages. Our experiments show that cluster-based delta compression of collections provides significant im- provements in compression ratio as compared to individually compressing each file or using tar+gzip, at a moderate cost in efficiency. A shorter version of this paper appears in the Proceedings of the 3rd International Con- ference on Web Information Systems Engineering (WISE), December 2002.
    [Show full text]
  • Dspic DSC Speex Speech Encoding/Decoding Library As a Development Tool to Emulate and Debug Firmware on a Target Board
    dsPIC® DSC Speex Speech Encoding/Decoding Library User’s Guide © 2008-2011 Microchip Technology Inc. DS70328C Note the following details of the code protection feature on Microchip devices: • Microchip products meet the specification contained in their particular Microchip Data Sheet. • Microchip believes that its family of products is one of the most secure families of its kind on the market today, when used in the intended manner and under normal conditions. • There are dishonest and possibly illegal methods used to breach the code protection feature. All of these methods, to our knowledge, require using the Microchip products in a manner outside the operating specifications contained in Microchip’s Data Sheets. Most likely, the person doing so is engaged in theft of intellectual property. • Microchip is willing to work with the customer who is concerned about the integrity of their code. • Neither Microchip nor any other semiconductor manufacturer can guarantee the security of their code. Code protection does not mean that we are guaranteeing the product as “unbreakable.” Code protection is constantly evolving. We at Microchip are committed to continuously improving the code protection features of our products. Attempts to break Microchip’s code protection feature may be a violation of the Digital Millennium Copyright Act. If such acts allow unauthorized access to your software or other copyrighted work, you may have a right to sue for relief under that Act. Information contained in this publication regarding device Trademarks applications and the like is provided only for your convenience The Microchip name and logo, the Microchip logo, dsPIC, and may be superseded by updates.
    [Show full text]
  • Pack, Encrypt, Authenticate Document Revision: 2021 05 02
    PEA Pack, Encrypt, Authenticate Document revision: 2021 05 02 Author: Giorgio Tani Translation: Giorgio Tani This document refers to: PEA file format specification version 1 revision 3 (1.3); PEA file format specification version 2.0; PEA 1.01 executable implementation; Present documentation is released under GNU GFDL License. PEA executable implementation is released under GNU LGPL License; please note that all units provided by the Author are released under LGPL, while Wolfgang Ehrhardt’s crypto library units used in PEA are released under zlib/libpng License. PEA file format and PCOMPRESS specifications are hereby released under PUBLIC DOMAIN: the Author neither has, nor is aware of, any patents or pending patents relevant to this technology and do not intend to apply for any patents covering it. As far as the Author knows, PEA file format in all of it’s parts is free and unencumbered for all uses. Pea is on PeaZip project official site: https://peazip.github.io , https://peazip.org , and https://peazip.sourceforge.io For more information about the licenses: GNU GFDL License, see http://www.gnu.org/licenses/fdl.txt GNU LGPL License, see http://www.gnu.org/licenses/lgpl.txt 1 Content: Section 1: PEA file format ..3 Description ..3 PEA 1.3 file format details ..5 Differences between 1.3 and older revisions ..5 PEA 2.0 file format details ..7 PEA file format’s and implementation’s limitations ..8 PCOMPRESS compression scheme ..9 Algorithms used in PEA format ..9 PEA security model .10 Cryptanalysis of PEA format .12 Data recovery from
    [Show full text]
  • Multimedia Compression Techniques for Streaming
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8 Issue-12, October 2019 Multimedia Compression Techniques for Streaming Preethal Rao, Krishna Prakasha K, Vasundhara Acharya most of the audio codes like MP3, AAC etc., are lossy as Abstract: With the growing popularity of streaming content, audio files are originally small in size and thus need not have streaming platforms have emerged that offer content in more compression. In lossless technique, the file size will be resolutions of 4k, 2k, HD etc. Some regions of the world face a reduced to the maximum possibility and thus quality might be terrible network reception. Delivering content and a pleasant compromised more when compared to lossless technique. viewing experience to the users of such locations becomes a The popular codecs like MPEG-2, H.264, H.265 etc., make challenge. audio/video streaming at available network speeds is just not feasible for people at those locations. The only way is to use of this. FLAC, ALAC are some audio codecs which use reduce the data footprint of the concerned audio/video without lossy technique for compression of large audio files. The goal compromising the quality. For this purpose, there exists of this paper is to identify existing techniques in audio-video algorithms and techniques that attempt to realize the same. compression for transmission and carry out a comparative Fortunately, the field of compression is an active one when it analysis of the techniques based on certain parameters. The comes to content delivering. With a lot of algorithms in the play, side outcome would be a program that would stream the which one actually delivers content while putting less strain on the audio/video file of our choice while the main outcome is users' network bandwidth? This paper carries out an extensive finding out the compression technique that performs the best analysis of present popular algorithms to come to the conclusion of the best algorithm for streaming data.
    [Show full text]
  • Improved Neural Network Based General-Purpose Lossless Compression Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, Idoia Ochoa
    JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 DZip: improved neural network based general-purpose lossless compression Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, Idoia Ochoa Abstract—We consider lossless compression based on statistical [4], [5] and generative modeling [6]). Neural network based data modeling followed by prediction-based encoding, where an models can typically learn highly complex patterns in the data accurate statistical model for the input data leads to substantial much better than traditional finite context and Markov models, improvements in compression. We propose DZip, a general- purpose compressor for sequential data that exploits the well- leading to significantly lower prediction error (measured as known modeling capabilities of neural networks (NNs) for pre- log-loss or perplexity [4]). This has led to the development of diction, followed by arithmetic coding. DZip uses a novel hybrid several compressors using neural networks as predictors [7]– architecture based on adaptive and semi-adaptive training. Unlike [9], including the recently proposed LSTM-Compress [10], most NN based compressors, DZip does not require additional NNCP [11] and DecMac [12]. Most of the previous works, training data and is not restricted to specific data types. The proposed compressor outperforms general-purpose compressors however, have been tailored for compression of certain data such as Gzip (29% size reduction on average) and 7zip (12% size types (e.g., text [12] [13] or images [14], [15]), where the reduction on average) on a variety of real datasets, achieves near- prediction model is trained in a supervised framework on optimal compression on synthetic datasets, and performs close to separate training data or the model architecture is tuned for specialized compressors for large sequence lengths, without any the specific data type.
    [Show full text]
  • User Commands GZIP ( 1 ) Gzip, Gunzip, Gzcat – Compress Or Expand Files Gzip [ –Acdfhllnnrtvv19 ] [–S Suffix] [ Name ... ]
    User Commands GZIP ( 1 ) NAME gzip, gunzip, gzcat – compress or expand files SYNOPSIS gzip [–acdfhlLnNrtvV19 ] [– S suffix] [ name ... ] gunzip [–acfhlLnNrtvV ] [– S suffix] [ name ... ] gzcat [–fhLV ] [ name ... ] DESCRIPTION Gzip reduces the size of the named files using Lempel-Ziv coding (LZ77). Whenever possible, each file is replaced by one with the extension .gz, while keeping the same ownership modes, access and modification times. (The default extension is – gz for VMS, z for MSDOS, OS/2 FAT, Windows NT FAT and Atari.) If no files are specified, or if a file name is "-", the standard input is compressed to the standard output. Gzip will only attempt to compress regular files. In particular, it will ignore symbolic links. If the compressed file name is too long for its file system, gzip truncates it. Gzip attempts to truncate only the parts of the file name longer than 3 characters. (A part is delimited by dots.) If the name con- sists of small parts only, the longest parts are truncated. For example, if file names are limited to 14 characters, gzip.msdos.exe is compressed to gzi.msd.exe.gz. Names are not truncated on systems which do not have a limit on file name length. By default, gzip keeps the original file name and timestamp in the compressed file. These are used when decompressing the file with the – N option. This is useful when the compressed file name was truncated or when the time stamp was not preserved after a file transfer. Compressed files can be restored to their original form using gzip -d or gunzip or gzcat.
    [Show full text]
  • The Ark Handbook
    The Ark Handbook Matt Johnston Henrique Pinto Ragnar Thomsen The Ark Handbook 2 Contents 1 Introduction 5 2 Using Ark 6 2.1 Opening Archives . .6 2.1.1 Archive Operations . .6 2.1.2 Archive Comments . .6 2.2 Working with Files . .7 2.2.1 Editing Files . .7 2.3 Extracting Files . .7 2.3.1 The Extract dialog . .8 2.4 Creating Archives and Adding Files . .8 2.4.1 Compression . .9 2.4.2 Password Protection . .9 2.4.3 Multi-volume Archive . 10 3 Using Ark in the Filemanager 11 4 Advanced Batch Mode 12 5 Credits and License 13 Abstract Ark is an archive manager by KDE. The Ark Handbook Chapter 1 Introduction Ark is a program for viewing, extracting, creating and modifying archives. Ark can handle vari- ous archive formats such as tar, gzip, bzip2, zip, rar, 7zip, xz, rpm, cab, deb, xar and AppImage (support for certain archive formats depends on the appropriate command-line programs being installed). In order to successfully use Ark, you need KDE Frameworks 5. The library libarchive version 3.1 or above is needed to handle most archive types, including tar, compressed tar, rpm, deb and cab archives. To handle other file formats, you need the appropriate command line programs, such as zipinfo, zip, unzip, rar, unrar, 7z, lsar, unar and lrzip. 5 The Ark Handbook Chapter 2 Using Ark 2.1 Opening Archives To open an archive in Ark, choose Open... (Ctrl+O) from the Archive menu. You can also open archive files by dragging and dropping from Dolphin.
    [Show full text]
  • Arrow: Integration to 'Apache' 'Arrow'
    Package ‘arrow’ September 5, 2021 Title Integration to 'Apache' 'Arrow' Version 5.0.0.2 Description 'Apache' 'Arrow' <https://arrow.apache.org/> is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. This package provides an interface to the 'Arrow C++' library. Depends R (>= 3.3) License Apache License (>= 2.0) URL https://github.com/apache/arrow/, https://arrow.apache.org/docs/r/ BugReports https://issues.apache.org/jira/projects/ARROW/issues Encoding UTF-8 Language en-US SystemRequirements C++11; for AWS S3 support on Linux, libcurl and openssl (optional) Biarch true Imports assertthat, bit64 (>= 0.9-7), methods, purrr, R6, rlang, stats, tidyselect, utils, vctrs RoxygenNote 7.1.1.9001 VignetteBuilder knitr Suggests decor, distro, dplyr, hms, knitr, lubridate, pkgload, reticulate, rmarkdown, stringi, stringr, testthat, tibble, withr Collate 'arrowExports.R' 'enums.R' 'arrow-package.R' 'type.R' 'array-data.R' 'arrow-datum.R' 'array.R' 'arrow-tabular.R' 'buffer.R' 'chunked-array.R' 'io.R' 'compression.R' 'scalar.R' 'compute.R' 'config.R' 'csv.R' 'dataset.R' 'dataset-factory.R' 'dataset-format.R' 'dataset-partition.R' 'dataset-scan.R' 'dataset-write.R' 'deprecated.R' 'dictionary.R' 'dplyr-arrange.R' 'dplyr-collect.R' 'dplyr-eval.R' 'dplyr-filter.R' 'expression.R' 'dplyr-functions.R' 1 2 R topics documented: 'dplyr-group-by.R' 'dplyr-mutate.R' 'dplyr-select.R' 'dplyr-summarize.R'
    [Show full text]
  • Gzip, Bzip2 and Tar EXPERT PACKING
    LINUXUSER Command Line: gzip, bzip2, tar gzip, bzip2 and tar EXPERT PACKING A short command is all it takes to pack your data or extract it from an archive. BY HEIKE JURZIK rchiving provides many bene- fits: packed and compressed Afiles occupy less space on your disk and require less bandwidth on the Internet. Linux has both GUI-based pro- grams, such as File Roller or Ark, and www.sxc.hu command-line tools for creating and un- packing various archive types. This arti- cle examines some shell tools for ar- chiving files and demonstrates the kind of expert packing that clever combina- tained by the packing process. If you A gzip file can be unpacked using either tions of Linux commands offer the com- prefer to use a different extension, you gunzip or gzip -d. If the tool discovers a mand line user. can set the -S (suffix) flag to specify your file of the same name in the working di- own instead. For example, the command rectory, it prompts you to make sure that Nicely Packed with “gzip” you know you are overwriting this file: The gzip (GNU Zip) program is the de- gzip -S .z image.bmp fault packer on Linux. Gzip compresses $ gunzip screenie.jpg.gz simple files, but it does not create com- creates a compressed file titled image. gunzip: screenie.jpg U plete directory archives. In its simplest bmp.z. form, the gzip command looks like this: The size of the compressed file de- Listing 1: Compression pends on the distribution of identical Compared gzip file strings in the original file.
    [Show full text]
  • Parquet Data Format Performance
    Parquet data format performance Jim Pivarski Princeton University { DIANA-HEP February 21, 2018 1 / 22 What is Parquet? 1974 HBOOK tabular rowwise FORTRAN first ntuples in HEP 1983 ZEBRA hierarchical rowwise FORTRAN event records in HEP 1989 PAW CWN tabular columnar FORTRAN faster ntuples in HEP 1995 ROOT hierarchical columnar C++ object persistence in HEP 2001 ProtoBuf hierarchical rowwise many Google's RPC protocol 2002 MonetDB tabular columnar database “first” columnar database 2005 C-Store tabular columnar database also early, became HP's Vertica 2007 Thrift hierarchical rowwise many Facebook's RPC protocol 2009 Avro hierarchical rowwise many Hadoop's object permanance and interchange format 2010 Dremel hierarchical columnar C++, Java Google's nested-object database (closed source), became BigQuery 2013 Parquet hierarchical columnar many open source object persistence, based on Google's Dremel paper 2016 Arrow hierarchical columnar many shared-memory object exchange 2 / 22 What is Parquet? 1974 HBOOK tabular rowwise FORTRAN first ntuples in HEP 1983 ZEBRA hierarchical rowwise FORTRAN event records in HEP 1989 PAW CWN tabular columnar FORTRAN faster ntuples in HEP 1995 ROOT hierarchical columnar C++ object persistence in HEP 2001 ProtoBuf hierarchical rowwise many Google's RPC protocol 2002 MonetDB tabular columnar database “first” columnar database 2005 C-Store tabular columnar database also early, became HP's Vertica 2007 Thrift hierarchical rowwise many Facebook's RPC protocol 2009 Avro hierarchical rowwise many Hadoop's object permanance and interchange format 2010 Dremel hierarchical columnar C++, Java Google's nested-object database (closed source), became BigQuery 2013 Parquet hierarchical columnar many open source object persistence, based on Google's Dremel paper 2016 Arrow hierarchical columnar many shared-memory object exchange 2 / 22 Developed independently to do the same thing Google Dremel authors claimed to be unaware of any precedents, so this is an example of convergent evolution.
    [Show full text]