<<

Optimization Techniques for Processing Large Files in an OpenVMS Environment Adrien Ndikumwami, Regional Economic Studies Institute, Towson, MD

ABSTRACT system are discussed. Wherever appropriate, several examples are provided to demonstrate the use of these Despite the recent technological advances being made in techniques. Costs and benefits with regard to the use of the computer industry, both in terms of availability of computer resources (I/O, disk storage, CPU and memory) enhanced processing power and large storage devices, the are also discussed for each optimization technique. demand for these resources is also increasing considerably. Unlike in the past, large data files spanning several gigabytes are now being routinely processed for generating II. OPTIMIZATION TECHNIQUES THAT ARE regular and ad hoc reports. Such regular use of large files is RELEVANT TO OTHER OPERATING forcing programmers to look for new techniques that enable SYSTEMS optimal use of scarce computer resources. While the SAS® system provides advanced capabilities to handle and manipulate large files, an appreciation of the host by the programmer is essential to efficiently use the 1. Reducing dataset observation length scarce computer resources. For large data files, the way data are read, processed, and stored can make a lot of Reducing a dataset observation length reduces I/O and difference in the optimal utilization of resources. This paper thereby shortens the elapsed time as the amount of input examines the issues that a SAS programmer needs to data decreases. There are several methods that help address while processing large files in an OpenVMS decrease the length of an observation. The more popular environment. Optimization techniques that are related to techniques are the DROP= and KEEP= dataset options, creating and setting datasets, sorting, indexing, the LENGTH statement and the data compression feature. compressing, running jobs in batch mode, and certain other host specific issues are explored as they are applicable to a. KEEP= AND DROP= Data Options an OpenVMS environment. While creating a dataset, you should only keep variables needed for further or future processing. You can I. INTRODUCTION accomplish this by using the KEEP= or the DROP= statements. By reducing the number of variables in a large It seems paradoxical that despite the availability of higher dataset, you are actually reducing the amount of data to be processing power and larger storage space, there is an ever read or written. This results in significant reduction of I/O growing need for these resources. The reality is that unlike and elapsed time. This technique is beneficial when the in the past, large data files spanning several gigabytes are variables being dropped are no longer needed or can be now being routinely processed for generating regular and recreated using the remaining variables. If unsure whether ad hoc reports. Most of these operational data files could to retain or drop certain variables, it is recommended that not be used on a regular basis due to computing resource you keep them. The cost of recreating dropped variables constraints. The far outweighs the benefit of keeping them. Assuming proliferation of large data files and the increasing need for dataset ONE has variables A, B, C, X, Y, Z the following accessing such files on a regular basis for ad-hoc reporting, examples demonstrate the use of DROP= and KEEP= micro-simulation, trend analysis and data mining activities options: pose new challenges to the programmer. DATA TWO; Programmers constantly look for alternative methods to SET ONE (DROP=C Z); make optimal use of existing computer resources. While RUN; the SAS system provides advanced capabilities to handle or and manipulate large files, an appreciation of the host DATA TWO; operating system by the programmer is essential to SET ONE (KEEP=A B X Y); efficiently manage scarce computer resources. While RUN; some of the techniques discussed in this paper are applicable to all operating systems, the focus of this paper is on processing large data files in an OpenVMS environment. b. The LENGTH statement

LENGTH statements may be used in a data step to reduce Optimization techniques that are relevant to most operating the number of bytes used for storage. The default length of systems are initially explored along with their OpenVMS- numeric variables on SAS system is 8 bytes. Often, disk specific features. In the later sections, certain techniques storage is wasted because a smaller length could be used that are either unique or more relevant in an OpenVMS to store the same value stored in an 8-bytes without

1 compromising on precision . See Table 1 for the largest integer that can be represented for each length. Using the LIBNAME X ‘[ ]’; length statement can significantly reduce the storage space although the CPU time may slightly increase. This feature DATA X.BIGDATA (COMPRESS=YES); comes handy when a dataset contains numeric variables INPUT VAR1-VAR100 $200.; with small values. Using the LENGTH statement is not RUN; recommended for storing fractions of numeric values because the precision may be lost. 2. Sorting

Sorting may consume a considerable amount of computing resources. In many cases, the program will not work Table 1: Largest integer that can be stored by SAS because resources have not been allocated properly. By variables on AXP OpenVMS by variable length default sorting on AXP OpenVMS is routed to SYS$SCRATCH which in turn points to SYS$LOGIN. Length (bytes) Largest integer Typically, SYS$LOGIN resides on a quota-enabled disk that represented exactly may not be have enough space to sort a large dataset. For large datasets, it is advisable to use the host sort. Sorting 3 8,191 takes about 2 to 3 times the size of the input dataset. If 4 2,097,151 your LOGIN disk has a small quota and you need to 5 536,870,911 process a large file, the best way is to point 6 137,438,953,471 SYS$SCRATCH to a disk with more space. The following 7 35,184,372,088,831 command defines another sort area and should be included 8 9,007,199,254,740,991 in the LOGIN.COM:

$DEFINE/PROCESS SYS$SCRATCH - DKA102:[SORTAREA] The following is an example on how to use LENGTH statement: From AXP OpenVMS maintenance release TS048 of SAS system, a new SORTWORK option has been added that DATA ONE; directs the SAS System to create up to 6 SORTWORK LENGTH X Y Z 4; logicals that will be used by the OpenVMS host sort for INPUT X Y Z; temporary work space. The SORTWORK option is used RUN; as follows: c. Dataset Compression LIBNAME SWORK1 DKB2:[SORTAREA1]; LIBNAME SWORK2 DKB2:[SORTAREA1]; The SAS system is equipped with a powerful compression algorithm. This function treats an entire observation as a OPTIONS SORTWORK=(SWORK1, SWORK2) ; string of bytes. It ignores variable types and boundaries. Data compression is more helpful whenever a dataset has The SORTWORK option accepts both SAS librefs repeating numeric or character data. Compression assigned in a LIBNAME statement or an OpenVMS path reduces the I/O but uses more CPU time. In addition, disk that must be enclosed in single or double quotes. To storage space is reduced by the compression factor. If prohibit SAS from using a certain sort area, use the your site does not have many users or does not carry other following statement prior to sorting. CPU-intensive applications, or the CPU time is free (no monetary charge for using the CPU), then compression is OPTIONS NOSORTWORK ; an ideal technique. Over the last decade, the speed of OpenVMS processors SAS system is shipped with two command procedures, has increased by a factor more than 20, while the same SORTCHK.COM and SORTSIZE.COM, which are located time, disk I/O systems have merely sped up by a factor of 2. in the SAS$ROOT:[USAGE] directory. These procedures Shifting the I/O pressure to the CPU is the best way to can help in gathering information about the system. The solve this problem. However, if the dataset does not have usage of these DCL commands is explained at the many repeating values, you should avoid compressing beginning of each file. because performance may get worse. Under certain circumstances, SAS Compression may actually increase the size of a large dataset. In this case the I/Os, the CPU 3. Indexing time and elapsed time increase. You can use SAS compression as a global option as in the following example: Indexing can enhance system performance when creating a large dataset that will be read many times using WHERE- OPTIONS COMPRESS=YES; clause or BY-group. When an index exists, an observation can be read directly. Without an index, the SAS system will Alternatively, you can use it as dataset-specific option to start from the top of the dataset and read all observations compress a particular dataset like in the following example: sequentially. Only then will it apply the WHERE-clause and

2 BY-group statement. It is recommended to avoid indexing II. OPTIMIZATION TECHNIQUES MORE a dataset that is constantly rewritten, updated or when the RELEVANT OR UNIQUE TO OPENVMS dataset needs to be read in its entirety. Indexing consumes more resources than sequential reading when reading a dataset without subsetting. However, if you are subsetting using the WHERE-clause or BY-group statement, the I/O 1. Redirecting SAS Work Library as well as the elapsed time are reduced. The following is an example of indexing: By default, SAS work library is created in the directory where the program is running. Due to disk storage LIBNAME EMP ‘[ ]’; limitations, there may not be enough space. There are two PROC DATASETS LIBRARY=EMP; ways to tackle this problem. The first way is to use WORK= MODIFY EMPLOYEE; option at SAS invocation. For example: INDEX CREATE SSNO; INDEX CREATE BTHDATE; $SAS/WORK=DKA102:[WORKAREA]MYPROG.SAS RUN; The other way is to define the logical SAS$WORKROOT The above code generates two simple indexes on the and point it to the directory which will be used for the SAS EMPLOYEE dataset for each of the following two variables - data library. It is recommended that the system manager social security and birth date. set up a disk for that purpose and define a system-wide logical pointing to the work area. The logical should be 4. Dataset Space Preallocation defined as follows: $DEFINE/SYSTEM - Starting from the second 6.09 maintenance release for SAS$WORKROOT DKA102:[WORKAREA] OpenVMS on AXP, the SAS System initially allocates 129 disk blocks for a data set. This initial allocation is called This disk must be defragmented periodically and the work ALQ or allocation quantity. Each time the data set is files cleaned regularly. To use the cleanup utility, first extended, another 384 blocks must be allocated on the declare a DCL symbol CLEANUP as follows: disk. This allocation is called DEQ or default file extension quantity. OpenVMS maintains a bit map file on each disk $ CLEANUP == ”SAS$ROOT:[PROCS]CLEANUP.EXE (BITMAP.SYS) that identifies the blocks that are available for use. When a data set is written and then extended, To use it, just issue the following command: OpenVMS alternates between scanning the bit map to locate free blocks, and actually writing the data set. $ CLEANUP SAS$WORKROOT However, if the data set were written with larger initial and extent allocations, writes to the data set would proceed uninterrupted for longer periods of time. At the hardware level, the movement of disk heads between the bit map and 2. Removing Unnecessary Datasets from the the data set are minimized. The result is a reduction in Work Library I/Os and elapsed time. Due to the fact that large initial extents preallocate the space reserved for a dataset, disk If SAS workspace is limited in disk space allocation, any defragmentation is reduced, thereby cutting the time to read temporary dataset that was created and not used later in the the dataset. The recommended ALQ= value is the size of program would take disk storage space unnecessarily. the dataset. In cases of uncertainty, an underestimated The SAS system gets rid of unused datasets through ALQ= can be used and the DEQ= value can be used for PROC DATASETS. The following is a example of extents. The value of ALQ= ranges from 0 to 2,147,483,647 the use of PROC DATASETS: and the value of DEQ= ranges from 0 65,535. For example: PROC DATASETS MEMTYPE=(CATALOG,DATA); DELETE ONE; DATA X.BIGFILE (ALQ=750000 DEQ=25000); INPUT VAR1-VAR100 $200; In this example, the catalog and dataset ONE are RUN; permanently removed from the WORK library even before the program terminates. Most novices of OpenVMS make a mistake by using the same dataset name over and over. Caution must be exercised not to use ALQ= or DEQ= Unlike other platforms, OpenVMS does not overwrite the values that are incompatible with data set size for they may previous dataset; it creates a new version for each repeated result in performance degradation. The size of the dataset dataset. Although this feature can be turned off, most sites may be estimated by using the number of observations, the prefer to keep it because it instantly backs up old files. number of variables and the variable length. Using the DATASETS procedure to remove the dataset will delete all versions. When processing large files on OpenVMS, it is recommended to use different names for different datasets or use DCL within the SAS program to regularly purge it. For example, if the WORK library is in the

3 directory where the SAS program is run, the following operations. The BUFSIZE= option sets the dataset’s page statement may be included in the program: size when the data set is created. It is important to note that BUFSIZE= can only be set on file creation. The X ’PURGE [.SASW*]’; CACHESIZ= options on the other hand can change anytime a file is open and is only set for the life of the current open file. While the BUFSIZE= option can only appear as 4. Disabling Disk Volume Highwater Marking a dataset option, the CACHESIZ= can appear as a data set option, or on a LIBNAME statement that uses the Base engine. If appropriate values are chosen for a particular Highwater marking (HWM) is an OpenVMS security dataset, there is a significant decrease in elapsed time and attribute which guarantees that users cannot read data I/O. When your dataset observation size is large, you may they have not written. The system erases the previous waste a great deal of space in the dataset if you do not content of disk blocks for files that are opened for random choose an appropriate BUFSIZE=. Before you set the access. This creates more overhead every time a dataset is BUFSIZE=, you should know that the BUFSIZE= options created or extended. Since all SAS data sets are random sets the SAS internal page size for the data set and once access files, there is a performance penalty of pre-zeroing, set, it becomes a permanent attribute of the file that cannot increased I/Os, and increased elapsed time. The following be changed. If for example BUFSIZE= is set to 51,200 and is an example of a DCL command to turn off high water the last page contains only 5000 bytes, you could be mark: wasting over 45,000 bytes or 90 blocks. The following are examples of the use of BUFSIZE= and CACHESIZ= $! USE AT INITIALIZATION TIME options: $ INITIALIZE/NOHIGHWATER DKA100 USERDISK1 LIBNAME X '[ ]'; $! USE FOR AN ACTIVE DISK DATA X.BIGFILE (BUFSIZE=63488); $ SET VOLUME/NOHIGHWATER DKA102 SET ONE; RUN; Turning off highwater marking can significantly reduce the or elapsed time and the I/O especially for programs that are LIBNAME X '[ ]'; write intensive. The only cost for turning off this attribute is DATA CACHE.BIGFILE (CACHESIZ=65024); that some OpenVMS sites may require the highwater RUN; marking feature be running for security purpose.

5. Disk Defragmentation 7. Installing SAS System Images

Disk defragmentation is the process that causes files to SAS images are a collection of procedures and data bound become physically contiguous. Contiguous files can be together by the Linker utility for form an executable program. accessed with fewer I/O operations than non-contiguous Installing SAS images can conserve memory because one files. On a defragmented disk , datasets are kept copy of the code needs to be in memory at any time and contiguous; after one I/O operation the disk head is well many users can access the code concurrently. The benefit positioned for the next I/O operation. It is recommended to of installing SAS known images is that elapsed time of maintain frequently accessed datasets on a defragmented system startup may decrease significantly. Installing images disk. Running a SAS program on defragmented disk can is most effective on systems where two or more users are decrease the I/O and the elapsed time. However, using SAS concurrently. For example, the following defragmenting can prove costly because of the time and commands are used to install the core set of SAS images of effort to regularly defragment disks or acquiring additional the Release 6.12 of SAS for AXP OpenVMS: disk drives. The two ways to defragment a disk are to do an image BACKUP and RESTORE to the target disk or to use $ INSTALL :== $SYS$SYSTEM:INSTALL/COMMAND a commercially available disk defragmentation product. $ INSTALL ADD/OPEN/HEADER/SHARE - Caution should be exercised in using commercial disk SAS$ROOT:[IMAGE]SAS612.EXE defragmentation because they may corrupt concurrent $ INSTALL ADD/OPEN/HEADER/SHARE - datasets. Defragmentation may also be reduced by SAS$LIBRARY:SABXSPH.EXE performing a disk-to-disk image backup without using the $ INSTALL ADD/OPEN/HEADER/SHARE - /SAVE_SET qualifier. SAS$LIBRARY:SASDS.EXE $ INSTALL ADD/OPEN/HEADER/SHARE - SAS$LIBRARY:SABMOTIF.EXE 6. Caching and Buffering Datasets for $ INSTALL ADD/OPEN/HEADER/SHARE - SAS$LIBRARY:SASMSG.EXE Sequential Writes and Reads

When your programs constantly perform sequential I/O operations, then using the CACHESIZ= and the BUFSIZE= options may be beneficial. The host CACHESIZ= option controls the buffering of data set pages during I/O

4 8. Increasing the Page File Quota IV. BIBLIOGRAPHY

To processing a large file, the page file quota SAS Institute References: (PGFLQUOTA) of your OpenVMS account needs to be high. Your page file quota determines the virtual memory Installation Instructions and System Manager's Guide, allocated to your SAS process. However, depending on Release 6.12 of the SAS System under OpenVMS for AXP the dataset size, this quota should be increased. If your Systems. quota is low, your program will run out of memory. SAS SAS Language Reference, Version 6, First Edition Institute recommends an initial page file quota of 150,000. SAS Companion for the VMS Environment, Version 6, To check how your program is using the page file, just Second Edition insert the following statement in your programs: SAS Programming Tips: A Guide to efficient SAS Processing. X’@SAS$ROOT:[USAGE]MEMCHK.COM ‘; Digital Equipment Corporation References:

This command file provided by SAS Institute reports on the Guide to OpenVMS File Applications, March 1994 current values of the various parameters and quotas, and OpenVMS System Manager's Manual: Essentials, March what levels of memory have been used. This function may 1994 be used in conjunction with the OpenVMS accounting OpenVMS System Manager's Manual: Tuning, Monitoring, utility to determine the optimal quota. and Complex Systems, March 1994 OpenVMS DCL Dictionary, A-M, March 1994 OpenVMS DCL Dictionary, N-Z, March 1994 9. Batch Processing ACKNOWLEDGMENTS: if a SAS program takes several hours to run and it is run noninteractively (e.g. $ sas myprog.sas), any problem with The author wishes to thank Linda McGrillies, Rama your terminal (power failure, frozen screen) can cause the Jampani and Guy Noce for their invaluable assistance. program to stop after several hours of execution. To avoid this problem, you should run the program in batch mode. TRADEMARKS After submitting a batch , your terminal session be free for further programming. You can request that OpenVMS SAS is a registered trademark of SAS Institute Inc, in the notify you after the program is finished. In most USA. ® indicates registration. organizations, CPUs and I/O are almost idle at night and AXP and OpenVMS are registered trademarks of Digital under intense pressure during the day. Batch processing Equipment Corporation. can actually reduce this problem by SAS jobs at Other brand and product name are registered trademarks of night when the system is not busy. Batch is not suitable if their respective companies. the program requires user input during execution. To run SAS in batch mode, edit a DCL command file (e.g. AUTHOR INFORMATION myjob.com) and include all programs you need to be run and issue the following command: The author may be contacted by e-mail to: $SUBMIT/NOTIFY/ - [email protected] AFTER=08-OCT-1997:01:30 MYJOB.COM

The programs in the DCL commad file myjob.com is scheduled to run on October 8, 1997 at 1:30 am. Running a program in batch mode is an efficient way of using system resources. The elapsed time depends on the number of other jobs running at the same time.

III. CONCLUSION

Processing large files is becoming increasingly common as more organizations are looking for new ways for leveraging existing data. Programmers are now being asked to process large files more frequently than in the past. The optimization techniques discussed in this paper will help programmers get more out of the existing computer resources.

5