Clearing the Pipe: Overcoming System Limitations ® of the SAS Pipe Channel of Communication Fuad J

NESUG 2009 Applications Big & Small

Clearing the Pipe: Overcoming System Limitations ® of the SAS Pipe Channel of Communication Fuad J. Foty, U.S. Census Bureau, Suitland, MD

ABSTRACT Many are all-to-familiar with the error phrase "Argument list too long" in Linux. The Pipe channel of communication in SAS programs has an inability to list files in a certain directory where the list is exceptionally long. A careful investigation of the problem demonstrates that the root cause is a buffer within the operating system. This problem can be easily overcome by giving multiple commands, separated by a semicolon in one single line of the SAS code. This paper will concentrate on different ways of handling various tasks in SAS and ensuring that the ultimate result is what is aimed to be accomplished. Basic tasks such as getting file names and sizes from your disk are as crucial to the proper running of SAS as performing the coding operation of your SAS program. Re- naming, moving, and setting security standards on production files is easily done in an automated fashion with the help of the Pipe channel of communication and the computer’s operating system. Knowing the operating system constraints allows the programmer to code commands in a smart way, facilitating the proper flowing of information down the Pipe.

INTRODUCTION The U.S. Census Bureau maintains a Master Address File (MAF), which contains millions of addresses. The MAF is a database which includes current addresses, as well as those determined to no longer exist, and reflects ongoing additions, deletions, and changes in those files. The MAF is the Census Bureau’s official inventory of known living quarters (housing units and group quarters facilities) and selected nonresidential units (public, pri- vate, and commercial) in the United States and Puerto Rico. The MAF data are provided to the American Com- munity Survey Office (ACSO) in files called MAF extracts. Methodologists in the Decennial Statistical Studies Division define the set of valid addresses to be included in the sampling frame used for the American Community Survey (ACS). ACSO receives several sets of county-level files delivered twice per year. These sets of files go through a check-in procedure where renaming and moving the files to production occur via SAS programs. Vari- ous unnamed pipes are used in these SAS programs to channel the filenames and their other information-related attributes to make security-level settings. Reading these filenames and using the SAS pipe channel of communication is often challenging because of the operating system constraints. The ACSO uses a Red Hat Enterprise Linux (RHEL) machine to test those constraints.

PIPE DEFINITION AND SYNTAX SAS defines a pipe as a channel of communication between two processes. A process with a handle to one end can communicate with another process that has a handle to the other end. This means that you can use a specialized application to provide information to your SAS session or vice versa. Pipes can be one-way or two- way. With a one-way pipe, one application can only write data to the pipe while the other application reads from it. With a two-way pipe, both applications can read and write data. There are two types of pipes: • unnamed pipe - handles one-way communication. Also called an anonymous pipe (or simply pipe), it is typically used to communicate between a parent process and a child process. Within SAS, SAS is the parent process that invokes (and reads data from) a child process. • named pipe - handles one-way or two-way communication between two unrelated processes. That is, one process is not started by the other process. In fact, it is possible to have two applications communi- cating over a pipe on a network. You can use named pipes within SAS to communicate with other applications or even with another SAS session. To use an unnamed pipe, one must issue a FILENAME statement with the following syntax:

FILENAME fileref PIPE 'program-name' option-list;

1 NESUG 2009 Applications Big & Small

Where Fileref, PIPE, program-name and option-list are specified briefly below:

• Fileref - The logical name associated with an external file. There are different ways to assign filerefs, which demonstrate how to obtain a listing of the active filerefs and clear filerefs during your SAS session.

• PIPE - The device-type keyword that tells SAS to use an unnamed pipe.

• Program-name - Specifies the external application program. This argument must fully specify the path- name to the program, or the path to the directory containing the program must be contained in the PATH environment variable. This argument can also contain program options as well.

• Option-list - Any of the options valid in the FILENAME statement, such as the LRECL= or RECFM= options.

Throughout this paper, unnamed pipes are the only pipes that will be used and discussed. They will be referred to as “pipes”. Unnamed pipes enable programmers to invoke a program outside of SAS and redirect the program's input, output, and error messages to SAS. This capability allows data to be captured from a program external to SAS without creating an intermediate data file. For unnamed pipes to work, external applications must read data from standard inputs (STDIN or 0), write the output to standard outputs (STDOUT or 1), and write errors to standard errors (STDERR or 2). When SAS captures STDERR from another application, the error messages are routed by default to the SAS log. To write to STDIN in another application, one may use a PUT statement in a SAS DATA step. Because SAS can write to STDIN and capture from STDOUT in the same application, unnamed pipes can be used to send data to an external program, as well as to capture the output and error messages of the same program. Redirection sequences can be utilized to redirect STDIN, STDOUT, and STDERR. Any application that accommodates standard input, output, and error commands can use the unnamed pipe feature. Because many system commands use standard input, output, and error commands, unnamed pipes within SAS can also utilize these standard commands. Unless specified otherwise, an unnamed pipe directs STDOUT and STDERR to two different files. To combine the STDOUT and STDERR into the same file, one may use redirection sequences. The following is an example that redirects STDERR to STDOUT for the Linux “ls –l” command: filename listing pipe 'ls -l *.sas 2>&1';

In this example, if any errors occur in performing this command, STDERR (2) is redirected to the same file as STDOUT (1). This is an example of SAS's ability to capitalize on operating environment capabilities. This feature of redirecting file handles is a function of the operating system rather than of SAS.

UNDERSTANDING THE “ARGUMENT LIST TOO LONG” ISSUE The bottom line for the "Argument list too long" issue is that the total number of bytes needed for the arguments plus the number of bytes needed to execute the child processes should not exceed ARG_MAX (131072 bytes in RHEL). If ARG_MAX is exceeded, an error like the following will appear:

shell : command name : ... : Argument list too long

So, where is the ARG_MAX defined? The definition is in the “limits.h” header file located in the “/usr/include/linux/” directory. The following is a list of only three lines in the “limit.h” file:

#define NGROUPS_MAX 65536 /* supplemental group IDs are available */ #define ARG_MAX 131072 /* # bytes of args + environ for exec() */ #define CHILD_MAX 999 /* no limit :-) */

2 NESUG 2009 Applications Big & Small

Notice that ARG_MAX value is 131072 in the RHEL system. One can simply go to the “limits.h” file and see what the value is or use the “getconf” command. “Getconf” is a command, which displays set values of variables defined in “limits.h” and related header files if that variable is passed as an argument to it in the form of one of its functions. So ARG_MAX should not be exceeded for command line arguments. Lets create sixty thousand files where each filename contains 20 bytes to show how ARG_MAX can be exceeded by just a simple listing (ls). Sixty thousand files times 20 bytes makes 1,200,000 bytes. So, when executing the “$( ls *.txt )” command, the “*.txt” will expand to all “*.txt” filenames present in current directory and will lead to greater than ARG_MAX (131072 in the RHEL system) bytes in the arguments for the “ls” and that will turn the “ksh” resulting in an ARG_MAX violation. The “*.txt” is getting expanded first to all file names with “.txt” extension present in the current directory and then the command line is executed causing too many argument issues. One can avoid the problem by single quoting “*.txt” as a wildcard expansion which will then be suppressed during command line construction. Note: This paper concentrates on the RHEL Operating system. The value of ARG_MAX varies from one machine to another and one Unix operating systems to another.

DISCOVERING THE “ARGUMENT LIST TOO LONG” ERROR It is important to explain how the “Argument list too long” issue came about while working at the Census Bu- reau. After writing and executing my check-in SAS program code, I discovered that there was no output generated from the pipe utility. The unnamed pipe looked as follows:

filename cty pipe "ls -l <<32 bytes path>>/cty/ff?????.sas7bdat 2>/dev/null";

The listing is being generated with a filename, which included the full 32 bytes path plus the subdirectory and the filenames. After executing the “ls –l” command at the Linux prompt, I discovered the message “Argument List too long”. I then changed the directory where the files existed and did an “ls –l” command after which the error did not appear again. The error produced resulted from the fact that each filename included the full path, which is passed as an argument to the “ls -l” command, and therefore exceeded ARG_MAX. Each file argument had about 60 bytes at- tached to every one of the 3221 filenames. Looking further into this, it became clear to me that there were byte limitations to the “ls –l” command. My first solution was to eliminate the full path from the filename by expanding the pipe into two commands within the filename statement as follows:

filename cty pipe "cd <<32 bytes path>>/cty/;

ls -l ff?????.sas7bdat 2>/dev/null";

The first command (cd) changes the directory where the files exist and the second command (ls –l) does the listing of the files. By issuing two commands instead of one, the problem was solved. This type of solution, I would say is a “luck solution” because the argument bytes plus the environment bytes needed to execute the “ls –l” command just happened to not exceed ARG_MAX.

HOW THE “LUCK SOLUTION” IS NO LONGER A SOLUTION The “Luck Solution” worked well for some time, until I had to pipe a list of about 12,000 filenames into my SAS program. The luck solution failed because the ARG_MAX was exceeded. So, I started researching how to raise the value of ARG_MAX. I examined how to calculate the size of a given argument and thereby learn whether a process would exceed and/or fail. While one may follow this process, there is a drawback. To calculate the bytes, one must remember that the formula to be calculated includes the number of bytes of the argument plus the number of arguments (the latter term accounts for the NUL byte at the end of each argument). The environment is identical to the arguments for this purpose, so one must add in the size of the environment

3 NESUG 2009 Applications Big & Small

(names and equal signs included) plus the number of environment variables. This value can be calculated simply by using “$(env |wc -c)”. This formula works because the new lines in the “env“ output take the place of the NULs used internally. Here is the “env” invocation syntax:

$ echo $(env | wc -c)

The result of the above command was 754 bytes on my machine. This number will probably be different on other machines. The number would have to be added to the total number of arguments bytes. Remember that many limits are dynamically (at run-time) configurable. However, ARG_MAX apparently is not. So, once the total number of bytes to execute the shell command is determined, make sure that it is not greater than ARG_MAX. If it is greater than ARG_MAX, then there are three choices to circumvent the problem: 1. If the total number of bytes exceed ARG_MAX by “a little”, then one can change the value of ARG_MAX, which I do not recommend. Changing the value of ARG_MAX will resolve the basic problem. However, the new value of ARG_MAX might interfere with the memory allocation of your entire system. It is possible to actually make things worse by upping the ARG_MAX value. 2. Try to reduce the number of bytes by reducing the length of each filename. One may accomplish this by renaming the filenames to shorter names. I do not recommend this method either because it is only temporary. 3. Use xargs to solve the problem, which I will discuss in the next section.

XARGS IS THE “BEST SOLUTION” TO THE “ARGUMENT LIST TOO LONG” ERROR The xargs command is the most practical solution to the “argument list too long” error with the least number of side effects on the rest of the operations taking place. The GNU version of xargs reads arguments from the standard input, delimited by blanks (which can be protected with double or single quotes or a backslash) or new lines, and executes the command (default is /bin/echo) one or more times with any initial-arguments followed by arguments read from standard input. Blank lines on the standard input are ignored. Consider the following “find- xargs” command combination:

$ find . –type f –name ‘ff?????.txt’ –print | xargs –n 20 rm

By issuing the above “find-xargs” command combination, we are able to get the results of the file listing without exceeding ARG_MAX. Xargs controls the number of times “rm” gets called. In our case above, xargs passes 20 filenames at a time to “rm”. Therefore, the number of open child processes is limited to twenty. The object here is to pass twenty arguments at a time to the “rm” command, which removes files that are found in the “find” command. Let us say that the “find” command identifies 60,000 filenames, and suppose we are instead using a different command than the xargs. In particular, we are replacing the xargs with an “-exec rm” command. In this scenario, the “find” command would fork a child process of itself for the entire 60,000 filenames causing a ARG_MAX violation. In other words, the child process would become the “rm” working on one single object instead of batches. Meanwhile, the original parent “find” process would sleep waiting for its child (rm) to finish before searching for the next file that meets the “find” criteria. The "xargs command" reads names on its standard input, and feeds them in batches to the command, so that the command is run fewer times. So, if we use “find | xargs rm”, the find is able to work "non-stop" filling the pipe with found object names. Xargs collects a certain number of objects and does a single (rm) for the entire job. Find doesn't pause and you only do about the total number of objects collected by the find command divided by the option value of “-n” with xargs i.e; 1000 for 50, not 60,000. There isn't a forked process for every single object, which was the case for “-exec rm”. Note that the “find” command in the RHEL operating system produces a list of files; it is often useful to be able to supply that list as arguments to another command. Normally, this is done with the shell’s command substitution

4 NESUG 2009 Applications Big & Small

feature, as in the example of searching for the symbol such as POSIX_OPEN_MAX in the system header file. The following command does the job:

$ grep POSIX_OPEN_MAX /dev/null $(find /usr/include –type f | sort)

When writing a program or a command that deals with a list of objects or files, one should make sure that the operation behaves in a proper way if the list is empty or if the list is exceptionally “too large”. Because “grep“ reads the standard input when it is given no file arguments, a “/dev/null” argument is supplied to ensure that it does not hang waiting for terminal input if “find produces no output”. This should not happen in the above example but it is a good practice to develop defensive programming habits. The output from the substituted command can sometimes be lengthy, with the result that a nasty kernel limit on the combined length of a command line and its environment variables is exceeded. When that happens, you’ll see the “Argument list too long” error. As you can see now the “best solution” to the ARG_MAX problem is provided by xargs: it takes a list of arguments on standard input, one per line, and feeds them in suitably sized groups (determined by the host’s value of ARG_MAX) to another command given as arguments to xargs. The following example eliminates the obnoxious Argument list too long error:

$find /usr/include –type f | xargs grep POSIX_ARG_MAX /dev/null /usr/include/bits/posix1_lim.h:#define _POSIX_ARG_MAX 4096 /usr/include/bits/xopen_lim.h:#define NL_ARGMAX _POSIX_ARG_MAX

Here, the “/dev/null” argument ensures that grep always sees at least two file arguments, causing it to print the filename at the start of each reported match. If xargs gets no input filenames, it terminates silently without even invoking its argument program. GNU xargs has the “–null“ option to handle the NUL-terminated filename lists produced by GNU find’s “–print“ option. Xagrs passes each such filename as a complete argument to the command that it runs, without danger of shell (mis) interpretation or newline confusion; it is then up to that command to handle its arguments sensibly. Xargs has options to control where the arguments are substituted, and to limit the number of arguments passed to one invocation of the argument command. The GNU version can even run multiple argument processes in parallel. However, the simple form shown here suffices most of the time. Consult the xargs(l) manual pages for further details, and for examples of some of the wizardry possible with its fancier features.

5 NESUG 2009 Applications Big & Small

HOW TO CREATE THOUSANDS OF EMPTY FILES FOR TESTING: I started investigating the ARG_MAX problem by generating thousands of empty files. The following code was used to generate 12,880 empty files with the names “ff00001.txt” to “ff12880.txt”:

filename x_code "x_code.sas"; %macro create_txt; data _null_; file x_code; %do i = 1 %to 12880; ii = "x 'touch ff" || put(&i,z5.) || ".txt';" ; put ii ; %end; run; %mend create_txt; %create_txt; data _null_; x "mkdir ./cty; x "cd ./cty/"; %include x_code; x "\rm ../x_code.sas"; run;

We will use the above macro to generate any number of empty files to test the limits of your system. Feel free to change the 12,880 to some other large number that you would like to test. The code above creates a subdirectory “cty” and executes a list of “X” generated statements placed in a file called “x_code.sas”, which when included in the data step (%include), will execute 12,880 “x” statements generating that many empty files with the names “ff00001.txt” through “ff12880.txt”. Of course there are other ways of generating the 12,880 empty files, but this should suffice to demonstrate the point.

EXAMPLE 1: A SIMPLE FILE LISTING TO EXCEED ARG_MAX Execute the above code to generate the 12,880 files and let’s try to remove (rm) the files by issuing the following command:

$ \rm * ksh: rm: Argument list too long

Oops, “Argument list too long”. Remember, ARG_MAX is 131072 bytes in the RHEL system. Let’s try to calculate how many bytes we have used in the (rm) command.

12880 * 12 + 754 = 155,314 > ARG_MAX = 131,072 (exceeded)

Note: the calculation varies from machine to machine and one operating system to another as discussed earlier. The values that I am using in this paper refer to the RHEL machine and operating system.

6 NESUG 2009 Applications Big & Small

EXAMPLE 2: USING XARGS ON 20,000 FILES I have modified the macro that generated 12,880 to generate 20,000 files instead. When I executed the SAS macro, the system said that it was “out of memory”. So, I split the creation of the 20,000 files into two steps as you can see in the following code:

filename x_code1 "x_code1.sas"; filename x_code2 "x_code2.sas";

%macro create_txt; data _null_; file x_code1; %do i = 1 %to 10000; ii = "x 'touch ff" || put(&i,z5.) || ".txt';" ; put ii ; %end; run; data _null_; file x_code2; %do i = 10001 %to 20000; ii = "x 'touch ff" || put(&i,z5.) || ".txt';" ; put ii ; %end; run; %mend create_txt; %create_txt; data _null_; x "cd ./cty/"; x "find . -type f -name 'ff?????.txt' -print| xargs -n 20 rm"; %include x_code1; x "\rm ../x_code1.sas"; run;

data _null_; %include x_code2; x "\rm ../x_code2.sas"; run;

The example above uses xargs to remove the files before they get created. If you try to remove the files without using xargs, you will get the “Argument list too long” error. Xargs uses a batch of 20 arguments at a time and will not exceed ARG_MAX.

7 NESUG 2009 Applications Big & Small

EXAMPLE 3: USING XARGS TO READ THE 20,000 FILES The following code reads in the 20,000 “.txt” files by the help of xargs being fed to the “ls” command via “find”:

filename cty pipe "cd ./cty/;

find . -type f -name 'ff?????.txt' -print|xargs -n 20 ls 2>/dev/null";

* Read in the County Files; data cty_site ; infile cty truncover ; input @5 site $5. ; proc sort; by site; run ;

CONCLUSION I have discussed and showed how to resolve the issue of "Argument list too long" error. The Pipe channel of communication in SAS programs is a powerful tool to use and has the ability to list files in directories as long as the ARG_MAX is not exceeded. After a careful review, the problem may be overcome by several methods; however one in particular seems to work best is xarqs. Knowing the operating system constraints allows the programmer to code in a smart way without having to worry about any errors generated as a result of some system constraint. SAS programs should run flawlessly allowing information to properly flow down the pipe.

REFERENCES SAS OnlineDoc® Version 9.1.3, http://support.sas.com/onlinedoc/913/docMainpage.jsp Dynamically Allocating Exported Datasets by the Combination of Pipes and ‘X’ Statement , Xin Wei, Hoffmann-La Roche, SAS Global Forum 2009, Paper 083-2009, http://support.sas.com/resources/papers/proceedings09/083-2009.pdf, 3/2009. Classic Shell Scripting, O’REILLY, Arnold Robbins & Nelson H.F. Beebe. Design and Methodology, American Community Survey, April, 2009 Red Hat Linux Operating System Documentation ARG_MAX, maximum length of arguments for a new process, http://www.in-ulm.de/~mascheck/various/argmax/index.html

ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies.

CONTACT INFORMATION Your comments and questions are welcomed and encouraged. Contact the author at:

Fuad J Foty U.S. Census Bureau 4600 Silver Hill Road Room 3K460F Suitland, MD 20746 301-763-5476 [email protected]

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *