CC13 An Automatic Process to Compare Files

Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

ABSTRACT Comparing different versions of output files is often performed in the validation stage of work. When the number of output files is minimal, the simplest way to compare different versions is to manually each file and check it visually. When files to be compared are in a standard format, a systematic approach is to use SAS. In certain applications, thousands of output files may be generated for one project. Because it is tedious to enter thousands of file names in the SAS code to accomplish a comparison, there is a need to automate the process so all file names for specified folders are automatically via SAS. This paper presents how the X command features in SAS compares multiple folders without the need to specify individual file names, and the files compared can be word documents, SAS programs, output logs, text files, and datasets. In addition, decisions comparing a specific pair of files are made automatically based on defined file attributes. The summary report of this paper uses PROC REPORT to present the comparison outcome. The automatic process to compare files presented is simple, portable, and can be easily combined with macros that compare individual files.

KEYWORDS X COMMAND, PROC REPORT, QC, VALIDATION, PROC COMPARE

INTRODUCTION In the pharmaceutical industry, input datasets, programs, output listings, and tables for analysis of clinical trial and reporting are usually stored under a standardized folder structure using specified file names. When an update to the files is required, datasets, programs, and output reports necessitate comparison. Likewise, evaluation of files stored in similar folders (backup folders) as well as comparison between testing folders and production folders is mandatory. The comparison process, if completed manually, is cumbersome and time consuming, and when the number of files is large, the automation of the comparison process becomes critical. One can certainly handle these comparisons outside of SAS, especially when there is no specific output report needed for comparison. However, in certain applications the creation of a comparison report to identify a distinction between specified files in various folders is necessary. This distinction may include statistics and details related to differences between specified files in a range of folders. Therefore, it is critical to be able to combine the comparison process and statistic reporting in the same SAS session. The methodology for accomplishing file comparisons and reporting the discrepancies is to use the X statements within SAS to automate the comparison process and generate different reports based on the file attributes.

Papers have been written, Xu, et al. (2007) [1], presenting macros for comparing individual output files. It will be beneficial to apply the automatic read-in file name process to extend the capability to compare multiple output files.

X COMMAND BASICS

Running DOS or Windows Commands from within SAS The X statement can execute DOS or Windows commands from the SAS session [2]. The X statement has the following syntax:

X <'Command'>

This paper uses two DOS commands to do the work:

COMMAND DESCRIPTION CD Change directories. List the content of the .

XWAIT and XSYNC System Option The XWAIT System Option controls whether the SAS user must type EXIT to return to the SAS session after an X statement or X command has finished executing a DOS command. On the other hand, the NOXWAIT System Option specifies the command processor to automatically return to the SAS session after the specified command is executed. There is no need to type EXIT.

The XSYNC System Option specifies the operating system command to execute synchronously with the SAS session. That is, control is not returned to the SAS System until the command has completed. The NOXSYNC System Option specifies the SAS user can execute an X command or X statement and return to the SAS session without closing the window spawned by the X command or X statement.

This paper recommends using the combination of NOXWAIT and XSYNC because the command prompt window closes automatically when the application finishes, and the SAS System waits for the application to finish.

IMPLEMENTATION AND ALGORITHM

Step 1: Get the file name X Commands and the NOXWAIT and XSYNC Options are used to obtain the list of files. In the example illustrated here, there are two folders to be compared. In order to perform comparisons within SAS, each folder should have a text file loaded with its folder contents.

The following key syntax shows how to automatically load the file names in the specified folders to text files along with other folder contents and file attributes. The folders to be compared can be set as macro parameters, and they are user defined while the text file names for the folder contents can be arbitrary names. Used in this paper, &basefder and &compfder are the user defined macro parameters; 'base' is the text file name for folder &basefder and 'compare' is the text file name for folder &compfder.

Key syntax option noxwait xsync; x "cd &outfder"; x "dir &basefder. >base.txt"; x "dir &compfder. >compare.txt";

Step 2: Convert the text files to SAS sets The text file created from Step 1 will be read into a SAS dataset by using the INFILE statement. Variables chosen from the folder content text file include: file name, file created or modified date, file created or modified time, and . The file name contains the suffix or extension which usually indicates the type of file. An additional user defined macro parameter, &ftype, is introduced to specify what type of files to be compared. Files with the extension of specified &ftype will be compared. For example, &ftype=sas7bdat will compare two SAS datasets.

The following code illustrates how to read the text file from Step 1 into a SAS dataset.

Key syntax data base; infile "&outfder.base.txt" firstobs = 8 = ' ' missover;

input date_b mmddyy10. +2 time_b $5. ampm_b $2. +10 size_b $ filenm & $ ; format date_b mmddyy10.; informat filenm $120.; part1_b = scan (filenm,1,.); part2_b = scan (filenm,2,.); filenmlb=length(filenm); if part2_b = "&ftype"; run; data compare; infile "&outfder.compare.txt" firstobs = 8 delimiter = ' ' missover;

input date_c mmddyy10. +2 time_c $5. ampm_c $2. +10 size_c $ filenm & $ ; format date_c mmddyy10.; informat filenm $120.; part1_c = scan (filenm,1,.); part2_c = scan (filenm,2,.); filenmlc=length(filenm); if part2_c = "&ftype"; run;

Step 3: Summarize the comparison outcome

After the 'base' and the 'compare' datasets have been created based on the folder content text files, the datasets are ready for the folder comparison. To start with the comparison, the base folder description dataset and compare folder description dataset are merged by file name. Note that there are several scenarios as the outcome of comparing the two folders when merging the two datasets by file name: exactly the same file; different file contents but with the same size; files with different sizes; files found only in base folder; files found only in compare folder. In order to identify these different scenarios of the comparison, a flag is introduced to classify the comparison result. When two files share the same file name and the same created date, time, and the same file size, one reasonable assumption is that these two files are the same file. When two files with the same file name have the same file size but different created date and/or time, the file contents can be the same or not. When two files with the same file name have different file sizes, then it is almost certain that the file content is different. Other possibilities are that the file existed in one folder but not the other. The flag created in this step will later be used for further comparison.

This paper uses PROC FREQ and PROC REPORT to summarize the findings of the comparison.

Key syntax data result; merge base(in=in1) compare(in=in2); by filenm;

if in1 = 1 and in2 = 1 then do; if date_b = date_c and time_b = time_c and ampm_b = ampm_c and size_b eq size_c then flag = 1; else if size_b eq size_c then flag = 2; else if size_b ne size_c then flag = 3; end; if in2= 0 then flag = 4; if in1= 0 then flag = 5;

label filenm = 'File Name' flag = 'Compare Result'; run;

*** Generate summary reports for the discrepancies proc format; value compflag 1 = 'Same File' 2 = 'Different Files but Same Size' 3 = 'Files with Different Size' 4 = 'Files Found Only in Base Folder' 5 = 'Files Found Only in Compare Folder' ; run; ods file="&outfder.&report..pdf" ; proc freq data=result; table flag / nocum nopercent ; format flag compflag.; title 'Summary result of the folder comparison'; title2 "Base folder: &basefder"; title3 "Compare folder: &compfder"; run; proc report data=result nowindows; column .. ;

define . .

compute after ;

title 'Same File'; title2 "Base folder: &basefder"; title3 "Compare folder: &compfder"; run;

Step 4: Compare individual files

When the folder comparison summary result suggests the two files are different, the next logical step is to compare the individual file content. If the two files to be compared are SAS datasets, PROC COMPARE can be used to compare the difference in terms of the variables and observations. The following syntax would direct SAS to go through all SAS dataset pairs with either different dataset sizes or dataset dates identified in Step 3 from the two compared folders and compare them with PROC COMPARE.

Key syntax %if &ftype = sas7bdat %then %do; data diff; set result; if flag in (2,3) ; run;

proc sql noprint; select count (flag) into: num from diff; quit;

%let num=#

proc sql noprint; select part1_c into :file1 - :file&num from diff; quit;

libname base "&basefder"; libname compare "&compfder";

%do i=1 %to # proc compare base = base.&&file&i compare = compare.&&file&i ; run; %end; %end;

The application shown here uses PROC COMPARE to compare SAS datasets. Other applications use some existing macros, for example, Xu, et al. (2007), to compare WORD or RTF documents. It will be beneficial to apply the automatic read-in file name process to extend the capability to compare multiple WORD or RTF files.

OUTPUT The ODS PDF statement PROC REPORT procedure generates a summary result stored in a PDF file.

Snapshot of the Comparison Report:

CONCLUSION

An automatic process to compare files and generate reports has been explored with the use of X statements within SAS. A generic macro has been developed to accomplish the automation process. Some examples have been shown to save the possible lengthy comparison process with different reports based on file attributes, and the macro can be easily combined with macros which compare individual files.

REFERENCES

[1] Xu, M., Zhou, J. (2007) “%DiFF: A SAS Macro to Compare Documents in Word or ASCII Format,” in Proceedings of the Pharmaceutical SAS Users Group Conference (PharmaSUG 2007)

[2] SAS online document: "Running DOS or Windows Commands from within SAS."

ACKNOWLEDGMENTS The authors would like to thank the management team of Merck Research Laboratories for their advice on this paper/presentation.

Contact Information Your comments and questions are valued and encouraged. Contact the authors at:

Simon Lin Merck & Co., Inc. 126 Lincoln Avenue P.O. Box 2000 Rahway, NJ 07065 Phone: 732-594-0773 e-mail: [email protected]

Huei-Ling Chen Merck & Co., Inc. 126 Lincoln Avenue P.O. Box 2000 Rahway, NJ 07065 Phone: 732-594-2249 e-mail: [email protected]

TRADEMARK

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.