Optimization Techniques for Processing Large Files in an Openvms Environment Adrien Ndikumwami, Regional Economic Studies Institute, Towson, MD

Optimization Techniques for Processing Large Files in an OpenVMS Environment Adrien Ndikumwami, Regional Economic Studies Institute, Towson, MD ABSTRACT system are discussed. Wherever appropriate, several examples are provided to demonstrate the use of these Despite the recent technological advances being made in techniques. Costs and benefits with regard to the use of the computer industry, both in terms of availability of computer resources (I/O, disk storage, CPU and memory) enhanced processing power and large storage devices, the are also discussed for each optimization technique. demand for these resources is also increasing considerably. Unlike in the past, large data files spanning several gigabytes are now being routinely processed for generating II. OPTIMIZATION TECHNIQUES THAT ARE regular and ad hoc reports. Such regular use of large files is RELEVANT TO OTHER OPERATING forcing programmers to look for new techniques that enable SYSTEMS optimal use of scarce computer resources. While the SAS® system provides advanced capabilities to handle and manipulate large files, an appreciation of the host operating system by the programmer is essential to efficiently use the 1. Reducing dataset observation length scarce computer resources. For large data files, the way data are read, processed, and stored can make a lot of Reducing a dataset observation length reduces I/O and difference in the optimal utilization of resources. This paper thereby shortens the elapsed time as the amount of input examines the issues that a SAS programmer needs to data decreases. There are several methods that help address while processing large files in an OpenVMS decrease the length of an observation. The more popular environment. Optimization techniques that are related to techniques are the DROP= and KEEP= dataset options, creating and setting datasets, sorting, indexing, the LENGTH statement and the data compression feature. compressing, running jobs in batch mode, and certain other host specific issues are explored as they are applicable to a. KEEP= AND DROP= Data Options an OpenVMS environment. While creating a dataset, you should only keep variables needed for further or future processing. You can I. INTRODUCTION accomplish this by using the KEEP= or the DROP= statements. By reducing the number of variables in a large It seems paradoxical that despite the availability of higher dataset, you are actually reducing the amount of data to be processing power and larger storage space, there is an ever read or written. This results in significant reduction of I/O growing need for these resources. The reality is that unlike and elapsed time. This technique is beneficial when the in the past, large data files spanning several gigabytes are variables being dropped are no longer needed or can be now being routinely processed for generating regular and recreated using the remaining variables. If unsure whether ad hoc reports. Most of these operational data files could to retain or drop certain variables, it is recommended that not be used on a regular basis due to computing resource you keep them. The cost of recreating dropped variables constraints. The far outweighs the benefit of keeping them. Assuming proliferation of large data files and the increasing need for dataset ONE has variables A, B, C, X, Y, Z the following accessing such files on a regular basis for ad-hoc reporting, examples demonstrate the use of DROP= and KEEP= micro-simulation, trend analysis and data mining activities options: pose new challenges to the programmer. DATA TWO; Programmers constantly look for alternative methods to SET ONE (DROP=C Z); make optimal use of existing computer resources. While RUN; the SAS system provides advanced capabilities to handle or and manipulate large files, an appreciation of the host DATA TWO; operating system by the programmer is essential to SET ONE (KEEP=A B X Y); efficiently manage scarce computer resources. While RUN; some of the techniques discussed in this paper are applicable to all operating systems, the focus of this paper is on processing large data files in an OpenVMS environment. b. The LENGTH statement LENGTH statements may be used in a data step to reduce Optimization techniques that are relevant to most operating the number of bytes used for storage. The default length of systems are initially explored along with their OpenVMS- numeric variables on SAS system is 8 bytes. Often, disk specific features. In the later sections, certain techniques storage is wasted because a smaller length could be used that are either unique or more relevant in an OpenVMS to store the same value stored in an 8-bytes without 1 compromising on precision . See Table 1 for the largest integer that can be represented for each length. Using the LIBNAME X ‘[ ]’; length statement can significantly reduce the storage space although the CPU time may slightly increase. This feature DATA X.BIGDATA (COMPRESS=YES); comes handy when a dataset contains numeric variables INPUT VAR1-VAR100 $200.; with small values. Using the LENGTH statement is not RUN; recommended for storing fractions of numeric values because the precision may be lost. 2. Sorting Sorting may consume a considerable amount of computing resources. In many cases, the program will not work Table 1: Largest integer that can be stored by SAS because resources have not been allocated properly. By variables on AXP OpenVMS by variable length default sorting on AXP OpenVMS is routed to SYS$SCRATCH which in turn points to SYS$LOGIN. Length (bytes) Largest integer Typically, SYS$LOGIN resides on a quota-enabled disk that represented exactly may not be have enough space to sort a large dataset. For large datasets, it is advisable to use the host sort. Sorting 3 8,191 takes about 2 to 3 times the size of the input dataset. If 4 2,097,151 your LOGIN disk has a small quota and you need to 5 536,870,911 process a large file, the best way is to point 6 137,438,953,471 SYS$SCRATCH to a disk with more space. The following 7 35,184,372,088,831 command defines another sort area and should be included 8 9,007,199,254,740,991 in the LOGIN.COM: $DEFINE/PROCESS SYS$SCRATCH - DKA102:[SORTAREA] The following is an example on how to use LENGTH statement: From AXP OpenVMS maintenance release TS048 of SAS system, a new SORTWORK option has been added that DATA ONE; directs the SAS System to create up to 6 SORTWORK LENGTH X Y Z 4; logicals that will be used by the OpenVMS host sort for INPUT X Y Z; temporary work space. The SORTWORK option is used RUN; as follows: c. Dataset Compression LIBNAME SWORK1 DKB2:[SORTAREA1]; LIBNAME SWORK2 DKB2:[SORTAREA1]; The SAS system is equipped with a powerful compression algorithm. This function treats an entire observation as a OPTIONS SORTWORK=(SWORK1, SWORK2) ; string of bytes. It ignores variable types and boundaries. Data compression is more helpful whenever a dataset has The SORTWORK option accepts both SAS librefs repeating numeric or character data. Compression assigned in a LIBNAME statement or an OpenVMS path reduces the I/O but uses more CPU time. In addition, disk that must be enclosed in single or double quotes. To storage space is reduced by the compression factor. If prohibit SAS from using a certain sort area, use the your site does not have many users or does not carry other following statement prior to sorting. CPU-intensive applications, or the CPU time is free (no monetary charge for using the CPU), then compression is OPTIONS NOSORTWORK ; an ideal technique. Over the last decade, the speed of OpenVMS processors SAS system is shipped with two command procedures, has increased by a factor more than 20, while at the same SORTCHK.COM and SORTSIZE.COM, which are located time, disk I/O systems have merely sped up by a factor of 2. in the SAS$ROOT:[USAGE] directory. These procedures Shifting the I/O pressure to the CPU is the best way to can help in gathering information about the system. The solve this problem. However, if the dataset does not have usage of these DCL commands is explained at the many repeating values, you should avoid compressing beginning of each file. because performance may get worse. Under certain circumstances, SAS Compression may actually increase the size of a large dataset. In this case the I/Os, the CPU 3. Indexing time and elapsed time increase. You can use SAS compression as a global option as in the following example: Indexing can enhance system performance when creating a large dataset that will be read many times using WHERE- OPTIONS COMPRESS=YES; clause or BY-group. When an index exists, an observation can be read directly. Without an index, the SAS system will Alternatively, you can use it as dataset-specific option to start from the top of the dataset and read all observations compress a particular dataset like in the following example: sequentially. Only then will it apply the WHERE-clause and 2 BY-group statement. It is recommended to avoid indexing II. OPTIMIZATION TECHNIQUES MORE a dataset that is constantly rewritten, updated or when the RELEVANT OR UNIQUE TO OPENVMS dataset needs to be read in its entirety. Indexing consumes more resources than sequential reading when reading a dataset without subsetting. However, if you are subsetting using the WHERE-clause or BY-group statement, the I/O 1. Redirecting SAS Work Library as well as the elapsed time are reduced. The following is an example of indexing: By default, SAS work library is created in the directory where the program is running. Due to disk storage LIBNAME EMP ‘[ ]’; limitations, there may not be enough space. There are two PROC DATASETS LIBRARY=EMP; ways to tackle this problem.

Load more