Disk Space Management Topics for SAS in the Solaris
Total Page:16
File Type:pdf, Size:1020Kb
NESUG 17 Administration & Support Disk Space Management Topics for SASR in the SolarisTM Operating Environment Adeline J. Wilcox Delmarva Foundation for Medical Care, Inc. Hanover, MD 1ABSTRACT Adding disk space to SunR enterprise servers has become a relatively inexpensive solution. Still, these gigabytes and terabytes are finite and must be managed. I address two related topics, file compression and reporting disk space usage BY user. First, I describe a Kornshell script that calls sed, awk, and SAS before producing a SAS listing showing gigabytes used by userid. Second, I performed factorial experiments in which I compared the CHAR, BINARY, and NO values of the SAS data set COMPRESS option with each other and with the UNIX compress program. UNIX compress effectively compressed all three types of test SAS data sets but minimized file size when SAS COMPRESS=NO. Using ten replicates, I also measured the CPU time required to run PROC FREQ on one character variable from each test data set compressed the three different ways with SAS. As expected, PROC FREQ executed most quickly on SAS data sets with COMPRESS=NO. When no UNIX compression was used, SAS took, on average, 73 percent longer to execute PROC FREQ on SAS data sets with SAS COMPRESS=CHAR than those with SAS COMPRESS=NO. 2INTRODUCTION This paper covers two related topics, file compression and reporting disk space usage by userid, both specific to running SAS on Solaris. The first topic, reporting disk space usage by userid, may be relevant to any UNIX or Linux system running the KornShell, sed, awk and SAS. Identifying users using the most disk space can help system administrators determine which users may need a reminder to delete old files or help archiving or compressing files. Here I am using the term system admininstrator to mean any person trying to help manage a Sun server, not just the superuser(s). Relevance of the second topic beyond Solaris to other Unices is unknown because I conducted my file com- pression experiments only on Solaris. Both time and space matter to SAS users working with large files. I ran factorial experiments to look at the effects of two different file compression treatments, SAS compression and UNIX compression, on both CPU time and disk space occupied by the files. 3 REPORTING DISK SPACE USAGE I’ve written a system of three programs named sizerept.ksh, sizerept.sed, and sizerept.sas that produce a SAS listing showing disk space usage by userid within a file system in gigabytes. Each of these programs is listed below. Unless you are a superuser, the ability to run this system depends on read access to the file system and the good will of the other users of the file system to give you read permission on all their files and directories. Alternatively, you could ask your superuser to execute this system for you. I execute this system by invoking sizerept.ksh at my KornShell prompt with an argument such as /projdata/, where the argument names the file system on which I want a report of disk space usage. Line 14 of sizerept.ksh writes the output of the shell command df with the option -k to standard output, a file named sizerept.out where the available disk space is reported by df in kilobytes (Gilly). With a variable named reporton, Line 15 passes the value of the string in the argument naming the file system to the KornShell where it will be available to the SAS program in which it will be assigned to a macro variable. Line 16 recursively lists all files in subdirectories under the directory named in the argument as well as all files in 1 NESUG 17 Administration & Support the top directory. File size in bytes is listed with each file name. The ouput of the ls -aFlR command is piped to my sed script named sizerept.sed that deletes all records from the listing that are not file names. Next, I pipe the output of my sed script to an awk command that prints only the columns containing the UNIX userid and file size to an ASCII file that my SAS program named sizerept.sas will read. The call to my SAS program comes on Line 17 of sizerept.ksh. The first line of my SAS program named sizerept.sas that is not a comment is Line 13. Here SAS gets the value of the environment variable named reporton from the KornShell and assigns it to the macro variable PATHNAME. SAS reads the output of the shell command df intoaWORKdatasetnamedREADONE.WithPROCSUMMARY, I sum disk space used in bytes BY userid. Because there are 1024 = 210 bytes in a kilobyte and so on, I divide total bytes by 10243 = 1073741824 to obtain gigabytes in DATA step TWO. I use PROC PRINT to write my report to a file named sizerept.lst. This output file shows, for the file system named in the TITLE, total disk space usage by userid in gigabytes. The value PRINTed by the SUM statement on Line 33 of sizerept.sas can be compared with the output of the df command written to sizerept.out. The disk space, in kilobytes, reported as used by the df command with the -k option, may be divided by 10242 to get a value in gigabytes approximately equal to that output by my system in sizerept.lst. 1 #!/bin/ksh 2 #PROGRAM: sizerept.ksh 3 #FUNCTION: report disk space usage by user on a file system or subdirectory 4 #INPUTS: argument naming the file system or subdirectory 5 #OUTPUTS: sizerept.out 6 #CALLS TO: sizerept.sed, awk, sizerept.sas 7 #SPECIAL NOTES: 8 #AUTHOR: Adeline Wilcox DATE: 21Jan04 9 #UPDATED: DATE: 10 date 11 exec 1 >sizerept.out 12 exec 2 >&1 13 df -k >&1 14 export reporton=$(print $1) 15 ls -aFlR $1 | sizerept.sed | awk ’{ print $3""$5}’>sizerept.txt 16 sas sizerept -log sizerept.log 17 if (($? > 0)) 18 then 19 print "Something went wrong with sizerept.sas" 20 print "Something went wrong with sizerept.sas" > /dev/tty 21 exit 22 fi 23 rm sizerept.txt 24 date 1 #!/bin/sed -f 2 #PROGRAM: sizerept.sed 3 #FUNCTION: delete records not representing files 4 #INPUTS: 5 #OUTPUTS: 6 #CALLS TO: nothing 7 #SPECIAL NOTES: 8 #AUTHOR: Adeline Wilcox DATE: 15Jan04 9 #UPDATED: DATE: 10 /^d/d 11 /^$/d 12 /^total/d 13 /^\/projdata/d 2 NESUG 17 Administration & Support 1 options ps=67 ls=80 noovp nostimer dkricond=error dkrocond=error nomautosource; 2 /*****************************************************************************/ 3 /*PROGRAM: sizerept.sas 4 /*FUNCTION: report on disk space usage 5 /*INPUTS: sizerept.txt 6 /*OUTPUTS: sizerept.log, sizerept.lst 7 /*CALLS TO: nothing 8 /*SPECIAL NOTES: 9 /*AUTHOR: Adeline Wilcox DATE: 15Jan04 10 /*UPDATED: DATE: 11 /*****************************************************************************/ 12 %let pathname=%sysget(reporton); 13 filename gotthis ’~/sizerept.txt’; 14 data readone; infile gotthis truncover; 15 input userid $ bytes; 16 run; 17 proc summary data=readone; 18 class userid; 19 var bytes; 20 ways 1; 21 output out=sumone sum=totbytes; 22 run; 23 data two; set sumone(rename=(_FREQ_=numfiles) drop=_TYPE_); 24 gigs=round(totbytes/1073741824,4.1); 25 run; 26 proc sort data=two; by descending gigs; 27 run; 28 options nonumber; 29 title1 "Disk space use in &pathname by userid"; 30 title2 "excludes any subdirectories that Adeline can’t read"; 31 proc print data=two; 32 sum gigs; 33 run; 4 FILE COMPRESSION ALGORITHMS On the Sun Enterprise server on which I work, I’ve found man pages for four file compression algorithms. These are gzip, compress, zip,andbzip2. More information about these algorithms may be found in the March 2001 issue of the now defunct periodical Server/Workstation Expert. http://swexpert.com/C9/SE.C9.MAR.01.pdf In their piece titled ”Squishing Data”, Jeffreys Copeland and Haemer state that gzip is more efficient than the older compress algorithm. They report that the man page for the even newer bzip2 algorithm says that it is faster than gzip. For reasons explained by Copeland and Haemer, the zip algorithm is less efficient than gzip.On Solaris, I’ve had trouble running gzip and bzip2 on larger files. bzip2 has given me bzip2: Input file yourfile.sas7bdat doesn’t exist, skipping. and gzip has told me yourfile.sas7bdat: Value too large for defined data type returning an exit status of 1, indicating an error. Unlike gzip, zip,andbzip2, compress is large file aware. So for the last two years, I’ve stuck with the compress algorithm and successfully compressed every file. An even newer compression tool named rzip may be found on some systems(Zawodny). Accordingtothemanpageforcompress on Solaris, compress uses ”adaptive Lempel-Ziv coding”. The SAS doc says that SAS COMPRESS=CHAR uses run length-encoding while SAS COMPRESS=BINARY uses both run-length encoding and sliding window compression. here SAS compression has been used, SAS logs may contain an authoritative NOTE such as the one I found after running a SAS program. NOTE: Compressing data set BIGSPACE.UNCH0 decreased size by 12.71 percent. 3 NESUG 17 Administration & Support Actually, SAS only estimates the percent change in file size resulting from SAS compression (Thacher). When I looked at the file size with the shell command ls -l, I saw that SAS COMPRESS=CHAR reduced the file size by nearly 18 percent. The reported increase in size resulting from SAS COMPRESS=BINARY was 16.31 percent but the increase in file size found by running the ls -l command was under 10 percent. While this difference surely isn’t important in daily work, it may be of interest in compression benchmarking. In this paper, all measures of file size were obtained from Solaris not SAS.