<<

NESUG 2008 Coders' Corner

Run SAS ® code based on extensions Shiva Srinivasan, PJM Interconnection LLC, Norristown, PA Shailaja P Ramesh, Prosoft Software Inc, Wayne, PA

ABSTRACT

Do you want to execute SAS processes based on the filename extension? Do you want to know how to the Windows structure information via SAS?

You have come to the right place. This paper breezes you through some quick and slick SAS code that you can use to read the Windows dir structure, look through the directories, sub-directories, given a root and then finally process SAS code based on the filename extensions. The paper uses SAS pipes, SAS char functions, SAS DATASTEP, and SAS Macro functionalities.

INTRODUCTION

Sometimes we would need to process files a little or very differently when their types are different. All files have file extensions which identify them as belonging to a particular file type. The objective of this paper is to automate a process which scans for files within a root , identifies the file extension and then executes SAS macros depending upon their file extension.

As we are all aware in the Windows operating system (o/s) the files are stored in a file structure such as Root Dir->Dir->Files. We can view a list of files and directories in a particular directory or a root directory by using the command “dir ”. In SAS we can pipe the output of O/S commands to a fileref using the FILENAME statement and then read the contents using the INFILE file reference statement in a DATA STEP. From then on it’s just a matter of putting the right logic in place to get to the desired results. The following sections explain in detail the SAS code that can be used in any SAS process which needs to process individual files by scanning through directories.

CODE DETAILS

Suppose we have files of types XLS (Excel spreadsheet) and CSV (Comma Separated) that we needed to process. We also have two efficient macros, %isExcel and %isCsv put in place that would process files of these types. Now suppose we have hundreds of these files organized in folders. It would be a nightmare for us to run these macros for every file and also run the correct macro i.e. %isExcel for XLS files and %isCsv for CSV files. Wouldn’t it be great if we had a process in place which takes in the root directory as the parameter, and then scans through all the sub directories and gathers a list of files in a SAS dataset? Once the file information such as filename, and extension have been captured we could easily kick off %isExcel and %isCsv macros from within a data step. Yes, it would become that simple.

1 NESUG 2008 Coders' Corner

The file structure for the example is shown below

D:\Personal\NESUG_PIT a. :\Personal\NESUG_PIT\Loads\Loads_Aug_06 • D:\Personal\NESUG_PIT\Loads\Loads_Aug_06\load_08_01_2006.CSV • D:\Personal\NESUG_PIT\Loads\Loads_Aug_06\load_08_01_2006.XLS • D:\Personal\NESUG_PIT\Loads\Loads_Aug_06\load_08_02_2006.CSV • D:\Personal\NESUG_PIT\Loads\Loads_Aug_06\load_08_02_2006.XLS • D:\Personal\NESUG_PIT\Loads\Loads_Aug_06\load_08_03_2006.CSV • D:\Personal\NESUG_PIT\Loads\Loads_Aug_06\load_08_03_2006.XLS

b. D:\Personal\NESUG_PIT\Loads\Loads_Aug_07 • D:\Personal\NESUG_PIT\Loads\Loads_Aug_07\ load_aug_01_2007.CSV • D:\Personal\NESUG_PIT\Loads\Loads_Aug_07\ load_aug_01_2007.XLS

The root or startDir is D:\Personal\NESUG_PIT which has two sub directories; Loads_Aug_06 and Loads_Aug_07. The sub directory Loads_Aug_06 has six files, three csv files and three xls files. The sub directory Loads_Aug_07 has two files, one csv file and one xls file. The objective of the automation is to get a list of directories within D:\Personal\NESUG_PIT and a list of files within D:\Personal\NESUG_PIT and all its sub directories into a SAS dataset for further processing.

This automation of reading files from a given directory or folder is achieved through a set of macros listed below.

1. main 2. readDirectories 3. getFiles 4. processFiles

%main macro

%macro main ; %let host = \\sas05awd; /* Remote Server name or Drive name eg d:\ */ %let startDir= sasDev$\sasAdHoc\sasUsers\srinis\Steam; %let baseFileList=FileList; /* Name of the Files dataset */ %let baseDirList=DirList; /* Name of the Directories dataset */

%readDirectories (&host,&startDir,baseFileList=&baseFileList, baseDirList=&baseDirList);

%processFiles ;

%mend ;

%main ;

2 NESUG 2008 Coders' Corner

This macro acts as a controller program and passes parameters such as host (server name or drive name), startDir (root directory or root folder path), baseFileList (SAS dataset name which would store list of files) and baseDirList (SAS dataset name which would store a list of directories) to the %readDirectories macro. Once the readDirectories macro does its job of creating the SAS datasets FileList and DirList the processFiles macro is executed which kicks off the %isExcel and %isCsv macros to process individual files of their respective file types.

%readDirectories macro

%macro readDirectories(host,startDir,baseFileList=FileList, baseDirList=DirList);

/* Clear the Work directory */ proc datasets library=WORK nolist; delete &baseDirList &baseFileList; quit;

/* Make a call to the %getFiles macro to read all directories */ %getFiles (&host,&startDir,&baseFileList,&baseDirList); %let curListNumber = 1;

/* Repeat the call to macro getFiles to get a list of all Files */ /* in all directories*/ %do %while (&startDir ne %str()); %let startDir=;

data _null_; /* Limit to read the ouput of Windows command in this eg is */ /* 500 characters. Increase the size as needed or get the */ /* maximum length of the output in a SAS macro variable */ length inDir $ 500 ; set &baseDirList; if _n_ = &curListNumber; inDir = trim(left(parentDir))||"\"||trim(left(File)); call symput('startDir',inDir); run;

%let curListNumber = %eval(&curListNumber+1);

%if &startDir ne %str() %then %do; %getFiles (&host,&startDir,&baseFileList,&baseDirList); %end; %end;

%mend readDirectories;

The macro %readDirectories contains all the information such as host name, folder path for the startDir and SAS dataset names needed to create a list of directories and files contained in the root or startDir. This is actually achieved by executing the %getFiles macro several times; once for creating a list of sub directories within the root directory or

3 NESUG 2008 Coders' Corner

startDir, and then it’s executed again to get a list of files within every sub directory. This process is repeated until there are no further directories to scan.

%getFiles macro

%macro getFiles(host,curDir,FileList,DirList);

/* Assign the fileref "fileinfo" to a pipe file with the ouput */ /* from the Windows command "dir", which lists the files in the */ /* directory &host\&curDir */ %let filrf=fileinfo; %let rc=%sysfunc(filename(filrf, %str(dir &host\%nrbquote("&curDir")), pipe));

/* Reads the files/folders listed in the output of the Windows */ /* dir command which is now the fileref "fileinfo". */ data Files (Keep=host parentDir line File Extension isExcel isCsv Date Dir);

length host $ 100 parentDir $ 250 File $ 50 Extension $ 25 isExcel 8. isCsv 8. Date $ 18 Dir 8. ;

/* Get the length of the observation and store it in “l” var */ infile fileinfo length=l; input @; /* variable “line” has varying length specified by the length */ /* of the row that the infile statement is currently in */ input @ 1 line $varying200. l;

/* Reads the name of the file with extension */ /* in Windows the output of the dir command would list files */ /* in the following format */ /* 06/23/2008 10:01 AM

Dir_Name */ /* for directories and */ /* 08/02/2007 04:25 PM FileName.FileExtension */ /* for files */ File = trim(left(substr(line, 40 ,50 )));

/* Read the extension using a combination of find, length and */ /* substr SAS functions */ Extension = upcase(substr(File,find(File,'.')+ 1,length(File)));

/* Use the select SAS function to read the File extension for */ /* files listed in a particular directory and set the dataset */ /* variables isExcel and isCsv accordingly for further processing */ select (Extension); when ('XLS') do; isExcel = 1; isCsv = 0; end; when ('CSV') do; isCsv = 1; isExcel = 0; end;

4 NESUG 2008 Coders' Corner

otherwise do; isExcel = 0; isCsv = 0; end; end; Date = substr(line, 1,18 ); /* Get the parent Dir from the macro variable curDir */ parentDir = "&curDir"; /* Get the host from the macro variable curDir */ host = "&host";

/* Files/Dirs with names and no missing values */ if substr(date, 3,1)='/' and File ne ' ';

/* Set the Dir dataset variable to 1 or 0 for further processing */ if substr(line, 26 ,3) = "DIR" then Dir = 1; else Dir = 0;

if File ne '.' and File ne '..'; run;

/* Clear the Fileref */ filename fileinfo clear;

proc sort data=Files; by Dir File; run;

/* Make a list of files in all directories within a given */ /* directory and save them to a SAS dataset */ proc append base=&FileList data=Files(where=(dir eq 0)); run;

/* Make a list of directories within a given */ /* directory and save them to a SAS dataset */ proc append base=&DirList data=Files(where=(dir)); run;

%mend getFiles;

The first 2 statements in the %getFiles macro are the most important step in the whole process.

%let filrf=fileinfo; %let rc=%sysfunc(filename(filrf, %str(dir &host\%nrbquote("&curDir")), pipe));

This step assigns the fileref “fileinfo” to a pipe file with the output from a WINDOWS command “dir”, which lists the files in “&curDir” (macro variable which resolves to the directory path). The next step is to read the fileref using a SAS INFILE statement. The filename/dirname, extension, date file/dir was created are all variables that be read in from the output of the dir command using the substr(var,startPos,LastPos) SAS

5 NESUG 2008 Coders' Corner

character function. The %getFiles is a generic macro that is used for both scanning a list of sub directories and files. Data step variables such as isExcel, isCsv, and Dir store binary values, 0 or 1. These values are used in the %processFiles macro to kick of %isCsv and %isExcel macros. Since the %getFiles macro is called by the %readDirectories macro several times, a PROC APPEND SAS procedure is used such that all previous and current “rows” or observations are captured in the desired SAS datasets.

%processFiles macro

%macro processFiles ; %let curRecord = 1; %let endOfFile = 0;

/* Read the Filelist SAS dataset and process row by row in a */ /* SAS datastep until the last row has been processed */ %do %while (&endofFile ne 1); data _null_; length inDir $ 500 ; set Filelist end = eof; if _n_ = &curRecord; inDir = trim(left(parentDir))||"\"||trim(left(File)); /* Store the required variables as macro variables */ call symput('FilePath',inDir); call symput('EndOfFile',eof); call symput('isExcel',isExcel); call symput('isCsv',isCsv); run;

/* The Filelist SAS dataset contains the variables isExcel */ /* and isCSV which can further be used to run macros %isExcel */ /* and %isCSV. */ %put &FilePath; %if &isExcel = 1 %then %do; %ifExcel (&FilePath) %end; %if &isCsv = 1 %then %do; %ifCsv (&FilePath) %end; %let curRecord = %eval(&curRecord+1); %end; %mend processFiles;

The %processFiles macro reads every record in the FileList SAS dataset using the macro variable &curRecord in the statement if _n_ = &curRecord;

Using the “CALL SYMPUT” DATASTEP function, four macro variables are created which store FilePath (complete path of the file obtained by concatenating the parentDir and File DATASTEP variables), EndOfFile (stores the value of the datastep set statement

6 NESUG 2008 Coders' Corner

option end= which suggests whether the end of file has been reached or not), isExcel and isCsv information which is used to kick of %isExcel and %isCsv macros.

%isCsv and %isExcel macro’s

%macro ifCsv(File); %put The &File is an CSV File; %mend ;

%macro ifExcel(File); %put The &File is an XLS File; %mend ;

These are the actual final process that we intend to run for each filetype. For simplicity we are just printing the name of the file and its filetype to the log.

Feel free to use all the macros shown in this paper and improvise based on your needs.

CONCLUSION As we can see through some simple SAS code we were able to automate a process which might have seemed cumbersome and manual to do at first thought. By using techniques such as assigning the output of O/S commands to a fileref and performing SAS character functions to get the file extensions of all files within all sub folders from the root directory, we were able to automate the process and make use of our time in writing the code to process the xls and csv files.

REFERENCES:

SAS Support (http://support.sas.com)

ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. SAS is a Registered Trademark of the SAS Institute, Inc. of Cary, North Carolina.

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Author Name Shiva Srinivasan Shailaja P Ramesh Company PJM Prosoft Software Inc Address 955 Jefferson Ave 1275 Drummers Ln City State Norristown, PA 19403 Wayne, PA 19087 Work Phone: (610) 666-4524 (484) 363-3741 : [email protected] Web: www.pjm.com

7