Another Factorial File Compression Experiment Using SAS® and UNIX Compression Algorithms

NESUG 18 Posters Another Factorial File Compression Experiment Using SAS® and UNIX Compression Algorithms Adeline J. Wilcox, US Census Bureau, Washington, DC 1 ABSTRACT Continuing experimental work on SAS data set compression presented at NESUG in 2004, I designed another two-factor factorial experiment. My first factor compares the three kinds of data set compression offered by SAS on UNIX; the SAS DATA set OPTIONS COMPRESS=CHAR, COMPRESS=BINARY and SAS sequential format files created with the V9TAPE engine, three UNIX file compression algorithms; compress, gzip, and bzip2 and a control without any file compression. My second factor compares four kinds of SAS data sets; all character variables, all numeric variables, half character and half numeric, and all numeric in which LENGTHs shorter than 8 were used for smaller values. bzip2 minimized compressed file size for all four control SAS data sets. Only the three SAS file compression methods can be used to give other SAS users read access to compressed SAS data sets without giving them write permission to these files. SAS COMPRESS=BINARY reduced compressed file size more than SAS COMPRESS=CHAR on all four variable type treatments tested including SAS data sets containing only character variables. 2 INTRODUCTION Experimentation is generally an iterative process (Montgomery, 1997). Using what I learned from the results of the experiments I presented at NESUG in 2004 (Wilcox, 2004) and reconsidering other information, I designed another factorial file experiment. This experiment’s design also reflects the fact that I now work in a different computing enviroment in which disk space is more precious than the one in which I conducted my earlier experiments. This experiment aims for a more comprehensive comparison of compression algorithms and variable types. Testing different file compression algorithms on files composed of different variable types is one of two primary objectives of this experiment. My first experiment did not control for variable composition in any way. Testing SAS COMPRESS=CHAR on SAS data sets consisting solely of character variables should determine whether this file compression algorithm can be dropped from further testing. In my first experiment, SAS COMPRESS=CHAR was the slowest and second worst compression algorithm for file size reduction. In that experiment, SAS COM- PRESS=BINARY actually increased compressed file size because observation lengths were not long enough to properly test that compression algorithm. In this experiment, I created test data sets with sufficient record length. The other primary objective is a more comprehensive comparison of compression algorithms including gzip, and bzip2. No measurements of SAS CPU time or total CPU time are reported here. 3 DESIGN OF MY EXPERIMENT In my 7 x 4 factorial experiment, the first factor was one of six file compression treatments or control without compression. The second factor compared SAS data sets of different composition; all character variables, all numeric variables, half character and half numeric, and all numeric in which LENGTHs shorter than 8 were used for smaller values. The fixed model for my factorial experiment is yijk = µ + τi + βj +(τβ)ij + ijk where the response Yijk is either disk space used, µ is the mean of both treatment factors, τi represents the file compression treatment, βj represents the SAS variable type and length treatment and (τβ)ij is the interaction between the file compression and SAS variable type composition treatments, and ijk is the random error (Montgomery). 1 NESUG 18 Posters Table 3.1 shows the design of my experiments with 10 replicates within each of the 28 treatments. Consequently, the order of treatment of units within each block was not random. Table 3.1 Assignment of Treatments Variable Type Character Numeric Short Numeric Character and Numeric None 10 10 10 10 File Sequential Format 10 10 10 10 Compression COMPRESS=CHAR 10 10 10 10 Treatment COMPRESS=BINARY 10 10 10 10 UNIX compress 10 10 10 10 UNIX gzip 10 10 10 10 UNIX bzip2 10 10 10 10 3.1 Choice Of Sample Size Before I ran this experiment, I decided that I needed a reduction in file size of at least 30 percent to make file compression worthwhile. Having used ten replicates in my first file compression experiment, I again used ten replicates to create 10 data set subsets of only character variables. A 30 percent reduction reduce file size by 1,241,088 bytes to no more than 2,895,872 bytes. Referring to the method Montgomery gives for sample size computation for two-factor factorial designs, it appears that ten replicates for each treatment may be considerably more than needed. However, it was convenient for to continue work with the same number of replicates that I used in my first experiment. 4 METHODS Because this experiment is designed to be a comprehensive test of file compression algorithms available in my computing environment, I ran Tukey’s Studentized Range (HSD) test to make all pairwise comparisons of the compression factor and the interaction of the compression factor with the variable type factor. I also tested for differences from the control data sets. 4.1 Creating Test Data Sets All test data sets were generated from decennial census data. The original data were stored in 52 files, one for each of the 50 US states and one each for the District of Columbia and the territory of Puerto Rico. From these 52 files, ten were randomly selected. The second 10,000 observations were read from each of these ten files. Within each control treatment, all ten subsets of the original files were identical in file size and observation length. Table 4.1 shows the size of each of the four sets of control files. Because my first data set, consisting only of character variables, contained solely numeric data stored as character variables, it was possible for me to make all four test data sets identical in data content. I created my second, third and fourth data sets by converting character variables to numeric variables. Table 4.1 The Four Variable Type Treatments Number of Number File Size Observation Variable Type(s) Variables of Files (bytes) Length (bytes) Character 140 10 4,136,960 407 Numeric 140 10 11,345,920 1120 Short Numeric 140 10 5,046,272 496 Character and Numeric 140 10 7,675,904 760 In the original files, all variables were character. From the original data sets, I chose 140 variables that could be converted to numeric variables. All ten of these were 4,136,960 bytes in size. In an effort to control metadata size, I gave the numeric versions of the variables names of the same length as the original character variables. All work was done with bash shell scripts and 32-bit SAS 9.1.3 on an AMD OpteronTM processor running Linux. 2 NESUG 18 Posters 4.2 File Compression Algorithms I compared six file compression treatments to controls. I experimented with three SAS compression treatments, the data set options COMPRESS=CHAR and COMPRESS=BINARY and the SAS Sequential format with a named pipe. I also experimented with the three file compression algorithms installed in my Linux computing environment. These are; gzip, compress and bzip2. In my earlier file compression experiments, I did not use bzip2 because my initial experience with it on a very large file wasn’t successful. I tried bzip2 again, this time without getting a non-zero exit status. In this paper, all measures of file size were obtained from Linux. In one of my bash shell scripts, I used the command export oneoften to export an environment variable named oneoften that identifies the replicate. Subsequently, I create a named pipe with the command mknod pipechoneoften p as shown in a SAS Tech Support Sample (SAS Institute Inc., 2002). In this SAS log excerpt, the macro variable named state resolves to 01. 13 %let state=%sysget(oneoften); 14 libname mine ’/adelines/portland/amd/’; NOTE: Libref MINE was successfully assigned as follows: Engine: V9 Physical Name: /adelines/portland/amd 15 libname fargo "pipech&state"; NOTE: Libref FARGO was successfully assigned as follows: Engine: V9TAPE Physical Name: /adelines/portland/amd/pipech01 16 filename nwrpipe pipe "compress < pipech&state > char2&state..Z &"; 17 data _null_; 18 infile nwrpipe; 19 run; NOTE: The infile NWRPIPE is: Pipe command="compress < pipech01 > char201.Z &" NOTE: 0 records were read from the infile NWRPIPE. 20 data fargo.a; set mine.char2&state; 21 run; NOTE: There were 10000 observations read from the data set MINE.CHAR201. NOTE: The data set FARGO.A has 10000 observations and 140 variables. NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414 5 RESULTS In the subdirectory where I wrote these 280 files, I ran the command ls -l *.Z *.sas7bdat *.gz *.bz2 > two80a.txt cut -c31-42,57-63 two80a.txt > two80b.txt giving me a list of all 280 files with their file sizes in bytes. File names were designed to identify the treatment(s) applied to the SAS data sets contained in the files. Consquently, this information was captured in the file named two80b.txt I used the file named two80b.txt as the input file to my SAS program named two80b.sas in which I analyzed the effects of the data type composition and file compression treatment factors on file size. 3 NESUG 18 Posters 5.1 Compressed File Size Table 5.1 shows treatment means and without adjustment for the other factor or interaction between the factors. This table also shows 95 percent confidence intervals. Means and confidence limits shown are rounded to the nearest byte. Table 5.1 Means of the File Compression Treatments Variable Compression Number of Mean Lower Upper Type Treatment Replicates bytes 95%CL 95%CL Character Control 10 4136960 .

Another Factorial File Compression Experiment Using SAS® and UNIX Compression Algorithms

Lossless Audio Codec Comparison

Contrasting the Performance of Compression Algorithms on Genomic Data

Cluster-Based Delta Compression of a Collection of Files Department of Computer and Information Science

Dspic DSC Speex Speech Encoding/Decoding Library As a Development Tool to Emulate and Debug Firmware on a Target Board

Pack, Encrypt, Authenticate Document Revision: 2021 05 02

Multimedia Compression Techniques for Streaming

Improved Neural Network Based General-Purpose Lossless Compression Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, Idoia Ochoa

User Commands GZIP ( 1 ) Gzip, Gunzip, Gzcat – Compress Or Expand Files Gzip [ –Acdfhllnnrtvv19 ] [–S Suffix] [ Name ... ]

The Ark Handbook

Arrow: Integration to 'Apache' 'Arrow'

Gzip, Bzip2 and Tar EXPERT PACKING

Parquet Data Format Performance