The SAS Data Set Compressor Application Yonghong Shang, Westat, Rockville, MD Barbara Gustavson, Westat, Rockville, MD

ABSTRACT

This paper presents a SAS data set compressor application (SDSCA) that was developed to automatically all SAS files in a project directory and its subfolders using the appropriate algorithm for compression. Choosing the correct algorithm to use is critical in terms of condensing the and reducing the cost of storage. Using the COMPRESS= option in the SYSTEM OPTIONS or LIBRARY OPTIONS applies only one algorithm for all files in the SAS programming module and thus is not an optimal choice for execution.

The selection of the algorithm is influenced by the types of variables and the record lengths in the SAS data set. A comparison is made among the sizes of the original SAS file and its two compressed SAS files created by applying each of the compression algorithms, CHAR and BINARY. The smallest file is then saved as the final output file. Using the SDSCA could result in significant reductions in disk space, and thus storage costs.

INTRODUCTION

When the charges for SAS data file storage on the network is a significant portion of the project budget, and datasets in the project’s directories cannot be archived and moved off-line, compressing files on the drive space can be an option for reducing costs.

There are many compression applications, like the easy-to-use zipping software PKZIP, 7 , WinZIP, WinRar, etc., which may be used to reduce the size of SAS datasets. According to outside testing, these are very efficient in zipping the files with a compressing ratio close to 95% (depends on data in the files). WinZip is used at Westat to zip SAS files in project directories which are rarely retrieved. But the zipped SAS data sets cannot be directly used as input to SAS procedures or data steps. Therefore, for frequently used project folders, the SAS data sets are not suitable for compression using general zipping software. But the data sets using SAS compression can be opened, read, and modified without any extra code, options, or steps. SAS compression is the optimal option for reducing the size of these file.

METHODOLOGY

The methodology for the design and programming of the SDSCA addresses the following issues.  Where to properly specify the SAS compression option in the SDSCA?  How to appropriately choose the SAS compression algorithm in the SDSCA?  Whether the compression of the SAS data set is optimal in terms of reducing space usage?

SPECIFY THE SAS COMPRESSION OPTION

SAS provides three ways to specify the SAS compression option: system option, library option, and data set option. The COMPRESS= system option will compress all data sets created during the current SAS session. The COMPRESS= option in the LIBNAME statement all data files for a particular SAS library. The COMPRESS= data set option compresses only the data set in which the COMPRESS option is specified.

In terms of reducing the storage costs, there are different SAS compression algorithms (CHAR or BINARY) needed for compressing individual data sets in the project directory depending on the major variable types and observation lengths; also, SAS compression may not be optimal for some data sets, that is, instead of decreasing the sizes of the datasets, the sizes are inflated after SAS compression. Therefore, the COMPRESS= data set option is applied in the SDSCA to give explicit control over which data set(s) in the program will be compressed by COMPRESS option (CHAR or BINARY) and which one(s) will not.

CHOOSE THE OPTIMAL SAS COMPRESSION ALGORITHM

There are two SAS compression algorithms the programmer can choose from, depending on the variety of data set attributes. They are the Run Length Encoding algorithm (RLE) and the Ross algorithm (RDC).

When COMPRESS=YES|CHAR is chosen to compress the data set, SAS uses RLE, which compresses observations by reducing repeated consecutive characters (including blanks) to two- or three-byte representations. Generally, data sets containing mostly long character variables are excellent candidates for compression using RLE algorithm.

When COMPRESS=BINARY is applied, SAS uses RDC algorithm to compress SAS data set. This method is highly effective for compressing medium to large (several hundred or larger) blocks of binary data, that is, numeric variables. RDC combines run-length encoding and sliding-window compression to compress the file.

Choosing the correct SAS compression algorithm to use is critical in terms of condensing the file size and reducing the cost of storage. In the SDSCA, a comparison is made among the sizes of the original SAS file and its two compressed SAS files created by applying each of the compression algorithms, CHAR and BINARY; and then the compression algorithm is dynamically chosen that outputs the smallest file size for every data set.

DETERMINE WHETHER A SAS DATA SET SHOULD BE COMPRESSED

During SAS compression, each observation in the resulting compressed data set requires some space for compression overhead. The overhead varies for each operating environment, but is typically between 8–12 bytes per observation. When COMPRESS= YES|CHAR is requested, SAS removes repeating blanks and characters, and then adds a “tag” to each observation containing the information SAS needs to un-compress the observation when it is used in a subsequent SAS Procedure or Data Step. If the values of the data set’s character variables don’t have many repeating blank spaces and characters, there is very little for the compression tool to do; and the data set would not actually yield any data size savings. Specifying COMPRESS=BINARY causes SAS to remove the repeating blank spaces and numbers as well as adding the decompression ‘tag’ to each observation. When the binary compression is requested, even if SAS can’t find anything for the compression algorithm to “do” with the data set, it is still possible to create a “bloated” data set containing the decompression “tag”. Only when the resulting byte-length savings exceeds either 12 or 24 bytes (depending on the host) per observation will SAS compression reduce the size of the data set.

Generally if the attributes of a SAS data set are unknown, it cannot be determined whether the SAS data set should be compressed or not. In the SDSCA, the goal is to get the smallest file size among the original file and the files compressed by using the BINARY and CHAR options.

LOGICAL DESIGN USED IN SDSCA

The logical design provides the architectural blueprint which details the application development processing. It aids in conceptualizing what is to be developed through the schema. It also describes how components in the domain relate to one another. Its role is critical because it tells the developer where a breakdown may occur while translating requirements to code. Figure 1 show the logical design used in the SDSCA. It guides how to program the SDSCA.

Figure 1

Logical Design

Full Path of Project Folder

Allocate the external files into a virtual text file

INPUT Process the Virtual Text File to

create the Macro Variables

Call %COMPRESSFILE through Directly call %COMPRESSFILE %DO Loop if multiple files If a single file

Call %COMPLIBRARY to execute the three processing components %COMPRESSFILE Macro (Compress Individual SAS Files)

Smallest sizes of SAS

data files

%COMPLIBRARY Macro (Compressed or

uncompressed)

OUTPUT

Testing & Tuning

PROCESSING COMPONENTS IN THE SDSCA

The macro %COMPRESSFILE is the core process in the SDSCA. It can be executed as many times as needed based on the number of SAS data files in the project directory.

HOW THE MACRO ‘%COMPRESSFILE’ WORKS

% COMPRESSFILE (INDATA =, /* the data name */ INLIB = /* the library name of the data */ );

The macro %COMPRESSFILE performs the following functions:

1. Creates the interim data sets - &indata.__C and &indata.__B for each of the files under the project directories, where &indata.__C is compressed with COMPRESS = YES|CHAR and &indata.__B is compressed with COMPRESS = BINARY. 2. Uses the DICTIONARY.TABLES to capture the selected information in the current SAS session, like library reference name, member name and type, number of observations, observation length, number of variables, compression, buffer size. 3. Selects the COMPRESS= option value as well as the file name and library reference name with the minimum file size (FILESIZE) by comparing the sizes of the original SAS file and its two compressed SAS files created by applying each of the compression algorithms, CHAR and BINARY to that file. 4. Creates the macro variables for the selected variables in step 3. 5. Creates the compressed file if needed, and outputs it to the original folder. If compressing does not yield a smaller file, there is no output.

The macro %COMPRESSFILE is called within a %DO loop for each of SAS dataset names in the project folder.

ALLOCATE THE EXTERNAL SAS FILES INTO A VIRTUAL TEXT FILE

Instead of a programmer manually entering each file name from the project dictionary in the FILENAME statements, an unnamed PIPE option redirects the outputs from the DOS ‘DIR’ command to a virtual text file in the FILENAME statement. The task of automatically allocating all the data sets from the project directory is perfectly fulfilled in this FILENAME statement. The code for using unnamed pipe option with DOS command ‘DIR’ is:

filename DIRLIST pipe " DIR /b /s ""&dirPath\*.sas7bdat""" ;

Key points:

 FILENAME is a SAS keyword. It is used to assign the FILEREF “DIRLIST” to a PIPE.  The PIPE option directs the output from ‘DIR’ command to a virtual text file, referenced by DIRLIST that can be accessed in the Data Step.  The DOS command "DIR" with the option ‘/B’ is used to is used to list only the file names. The option ‘/S’ is added with ‘DIR’ command to list the file names in all the subfolders in the project directory.The &dirPath is a macro variable with the path name of the project directory folder one has selected for this process. The DIR command requires double quotes to be employed to deal with the spaces in the file path.  The ‘.SAS7BDAT’ is specified in the path of the project folder to filter out non-SAS files.

PROCESS THE VIRTUAL TEXT FILE TO CREATE THE SERIES OF MACRO VARIABLES

Once The FILEREF ‘DIRLIST’ is created, it is read by a SAS data step as if it were a text file. The following code creates the temporary data set ‘listDirs’ by reading the ‘text file’.

data listDirs ; infile DIRLIST length=reclen ; input line $varying1024. reclen; length path filename $ 200; retain path ; filename=scan(line,-1,'\'); path=substr(line,1,(length(line)-length(filename))); run;

After executing the data step “data listDirs”, new created data set (‘listDirs’) would like the following.

Line Path Filename

S:\data\test1.sas7bdat S:\data test1.sas7bdat S:\data\test2.sas7bdat S:\data test2.sas7bdat S:\data\test3.sas7bdat S:\data test3.sas7bdat ...... S:\data\old\pers.sas7bdat S:\data\old pers.sas7bdat S:\data\old\base.sas7bdat S:\data\old base.sas7bdat ...... S:\data\output\output1\roca.sas7bdat S:\data\output\output1 roca.sas7bdat ......

The LIBREFs are generated since the SAS files are associated with different libraries.

/** Generate LIBREFs **/ proc sort data= listDirs (keep=path) out= listDirs2 nodupkey; by path; run;

data libnames ; length libname $ 8; set listDirs2; libname=catt('__',_n_); rc=libname(libname,path); run;

Every SAS data set is assigned to its own library reference by merging the temporary data set ‘listDirs’ with the data set ‘libnames’.

/** Assign each SAS file with its own library reference **/ proc sort data= listDirs out= listDirs3(keep=path filename) ; by path; run;

proc sort data=libnames(keep=path libname) ; by path;run;

data fulldata; merge listDirs3(in=aa) libnames(in=bb); by path; if aa and bb; run;

The series of macro variables are created for the use of next step %DO loop.

/** Create series of macro variables **/ data _null_; set fulldata; length filename2 $25; filename2=scan(filename,1,'.');

/*create a series of macro variables (inFName1, inFName2, inFName3, …) to hold each of the file names */ CALL SYMPUT("inFName" || left(trim(_n_)), filename2);

/*create a series of macro variables (orgLib1, orgLib2, orgLib3, …) to hold each library name responding to the file */ CALL SYMPUT("orgLib" || left(trim(_n_)), libname);

/*create one macro variable “obsNum” to hold the number of the file names */ CALL SYMPUT("obsNum", left(trim(_n_))); run;

%DO LOOP

Using the %DO loop to call the macro %COMPRESSFILE for compressing each SAS data set in the project directory is a simple dynamic programming technique.

%do i=1 %to &obsNum;

%compressFile( indata=&&inFName&i, inlib=&&orgLib&i ); %end;

PROGRAM THE MACRO %COMPLIBRARY

To automatically execute the three processing components discussed above three steps, a macro called %COMPLIBRARY was developed. The macro %COMPLIBRARY was called with only one parameter ‘DIRPATH’ – the name of full path of the project directory..

%COMPLIBRARY(dirPath = /* the name of full path of the project directory */)

CODE TESTING AND TUNING

There was an error message when the SDSCA was first executed: …… Stderr output: There is not enough space on the disk. NOTE: 0 records were read from the infile DIRLIST.

OR …… Stderr output: stdout: Bad file descriptor NOTE: 0 records were read from the infile DIRLIST.

This error message is not caused by anything in the programs but by invoking SAS through the desktop shortcut. To work around this issue, do one of the following:

 Open SAS software from either Start menu or Task bar. Invoking SAS from a pinned location prevents the error.

 For window 7 , edit SAS shortcut to the following command: C:\Windows\System32\cmd.exe /c " C:\Program Files ()\sas92\SASFoundation\9.2(32-bit)\sas.exe - CONFIG C:\Program Files (x86)\sas92\SASFoundation\9.2(32-bit) \nls\en\SASV9.CFG"

PERFORMANCE TESTING

Figure 2 shows the statistics based on all the SAS data sets from the project directory before running the SDSCA.

Figure 2 The MEANS Procedure

Analysis Variable : SIZEMB N Minimum Maximum Mean Median Sum 75 0.009 46.406 5.442 2.359 408.121

Figure 3 provides the same statistics after running the SDSCA. For the project directory presented here, of the 75 SAS files in the main folder and sub-folders, 28 files were compressed with COMPRESS=BINARY; 5 files were compressed using COMPRESS=CHAR; 43 files were un-compressed because neither algorithm reduced the file size of the SAS data file.

Figure 3

The MEANS Procedure

Analysis Variable : SIZEMB N Minimum Maximum Mean Median Sum 75 0.009 19.219 3.038 0.949 227.863

SDSCA LIMITATIONS

Since the SAS data files in this test project folder are not password protected, the SDSCA works perfectly. If one wants to use the SDSCA to compress the SAS data files which are password protected, run the SDSCA in interactive mode so that the password window will prompt you to input the password.

Also, if the user applies the SDSCA to compress SAS data files with custom formats, comment out the option - ‘Options FmtSearch =’. You may get ‘INFO’ messages in the LOG complaining SAS cannot find the variable formats, but it doesn’t affect the process of compressing the SAS data files.

CONCLUSIONS

The SDSCA is a very friendly tool for compressing SAS data files in a project directory using whichever compression algorithm is optimal. It compresses the SAS data file by squeezing the unnecessary bytes out without losing any accuracy of numeric variables or the contents of character variables. The input is the full path of the project directory whose data sets are to be compressed. If the BINARY or CHAR compress option helps reduce the space allocated for the file, the SDSCA saves that compressed file to the network. Otherwise, it leaves the uncompressed data set as is. The SDSCA can be widely used since it applies a dynamic method of choosing the right compression to the processing. It can be applied to any SAS data set with/without format libraries to get the smallest file sizes.

In the case presented here, the project folder contains 32 SAS files stored in the project main folder (S:\data\); 30 SAS files in the subfolders of the project directory (S:\data\old\ & S:\data\ouput\); and 13 files in lower subfolders (S:\data\output\output1\ & S:\data\output\output2\). The number of megabytes for the storage was decreased from 408 MB to 228MB, a reduction in the disk space of 42%.

REFERENCES

‘Effectiveness and Cost of SAS® Compression’, Hitesh Sharma, http://support.sas.com/resources/papers/proceedings09/065-2009.pdf

– ‘Indexing and Compressing SAS® Data Sets: How, Why and Why Not’, Andrew H. Karp & David Shamlin, http://www2.sas.com/proceedings/sugi28/003-28.pdf

‘Programming Tricks For Reducing Storage And Work Space’, Curtis A. Smith, http://www2.sas.com/proceedings/sugi27/p023-27.pdf

‘Making the Most Least Of It with SAS® Compression’, Randy Herbison, http://wesinfo.westat.com/version2/computer/programming/sas/sas_users_group.cfm

The basic SAS compression usage at support.sas.com http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a001288760.htm http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000202890.htm http://v8doc.sas.com/sashtml/lgref/z0202890.htm

DOS Command DIR Options http://www.csulb.edu/~murdock/dir.html

DISCLAIMER

The contents of this paper are the work of the authors and do not necessarily represent the opinions, recommendations, or practices of Westat.

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.

ACKNOWLEDGMENTS

Special thanks go to Caroline Morganstein for her assistance and generosity in helping with this paper. We would also like to thank Michael Raithel for overall guidance and paper review. Thanks to MEPS Data Delivery team at Westat for the support.

CONTACT INFORMATION

Your comments and questions are valued and encouraged. Contact the authors at:

Yonghong Shang Westat 1600 Research Blvd Rockville, MD 20850 Email: [email protected]

Barbara Gustavson Westat 1600 Research Blvd Rockville, MD 20850 Email: [email protected]