Finding All Differences in two SAS® libraries using Proc Compare

By: Bharat Kumar Janapala,

Company: Individual Contributor

Abstract:

In the clinical industry validating datasets by parallel programming and comparing these derived datasets is a routine practice, but due to constant updates in raw data it becomes increasingly difficult to figure out the differences between two datasets. The COMPARE procedure is a valuable tool to in these situations, and for a variety of other purposes including data validation, data cleaning, and pointing out the differences, in an optimized way, between datasets.

The purpose for writing this paper is to demonstrate PROC COMPARE's and SASHELP directories ability to identify the datasets present in various SAS® libraries by listing the uncommon datasets. Secondly, to provide details about the total number of observations and variables present in both datasets by listing the uncommon variables and datasets with differences in observation number. Thirdly, assuming both datasets are identical the COMPARE procedure compares the datasets with like names and captures the differences which could be monitored by assigning the max number of differences by variable for optimization. Finally, the COMPARE procedure reads all the differences and provides a consolidated report followed by a detailed description by dataset.

Introduction

In most companies, the most common practice to compare multiple datasets is to use the standard code: “PROC COMPARE BASE = BASE_LIB.DATASET COMPARE = COMPARE_LIB.DATASET.” But comparing all the datasets would be a consuming process. Hence, I made an attempt to point out all the differences between libraries in the most optimized way with the help of PROC COMPARE and SASHELP directories.

My Solutions

Assigning libraries:

The current macro COMPARE_LIB identifies the base and the compare library and assigns them to the library: BASE and COMPARE. It also gives the programmer the opportunity to select the formats if the formats are present in this library. Otherwise the OPTIONS NOFMTERR is called into the program allowing SAS to process the datasets without the formats.

%MACRO COMPARE_LIB (BASE= ,COMPARE= ,=N); /*ASSIGNING THE LIBRARIES*/ LIBNAME BASE "&BASE."; %LET BASE_PATH=%SYSFUNC(TRANWRD(&BASE.,\,\\)); LIBNAME COMPARE "&COMPARE."; %LET COMPARE_PATH=%SYSFUNC(TRANWRD(&COMPARE.,\,\\));

/*INCLUDING FORMATS*/ %IF &FORMAT = N %THEN %DO; OPTIONS NOFMTERR; %END; %IF &FORMAT = Y %THEN %DO; OPTIONS FMTSEARCH = (BASE COMPARE); %END;

Data Extraction

The data from the libraries are extracted with the help of the macro %DATA. The macro creates two datasets per library: LIBOBS (BASEOBS and COMPAREOBS) and LIBVAR (BASEVAR and COMPAREVAR). LIBOBS datasets contains all the datasets along with the number of observations as present in the library which are extracted from SASHELP.VTABLE. The LIBVAR dataset contains the dataset’s name, column name, length, , format, informat and that are extracted from SASHELP.VCOLUMN.

%MACRO DATA (LIB) ; PROC SQL; CREATE TABLE &LIB.OBS AS SELECT DISTINCT MEMNAME , NOBS AS &LIB.OBS FROM SASHELP.VTABLE WHERE LIBNAME EQ "&LIB." AND MEMTYPE EQ "DATA" ORDER BY MEMNAME; CREATE TABLE &LIB.VAR AS SELECT MEMNAME, NAME, LENGTH , LABEL, FORMAT, INFORMAT, XTYPE FROM SASHELP.VCOLUMN WHERE LIBNAME EQ "&LIB." AND MEMTYPE EQ "DATA" ORDER BY MEMNAME, NAME; QUIT; %MEND DATA ; %DATA(BASE); %DATA(COMPARE);

Dataset and Observation Differences using LIBOBS:

The LIBOBS data set is used to the two major differences between both the libraries by merging them together into a dataset OBS and creating a flag variable which flags the missing datasets as “MISSING :: BASE/COMPARE DATASET”. If the datasets are present in both the libraries, the number of observations are compared and the observational difference is assigned to the flag DIFF as “DIFFERENCE :: NUMBER OF OBSERVATIONS (N)” where N is the number of observations that differ in both datasets. Another flag VARIBLES is also created, describing the number of observations in the LIBOBS “BASEOBS = XX COMPAREOBS = XX” where XX is the number of observations present in each dataset.

DATA OBS; MERGE BASEOBS (IN = A) COMPAREOBS (IN = B); BY MEMNAME; FORMAT DIFF $75.; IF BASEOBS NE COMPAREOBS THEN DIFF = PROPCASE("DIFFERENCE :: NUMBER OF OBSERVATIONS ( ")|| STRIP(PUT(ABS(BASEOBS - COMPAREOBS),8.))||" )"; IF BASEOBS = . THEN DIFF = PROPCASE("MISSING :: BASE DATASET") ; IF COMPAREOBS = . THEN DIFF = PROPCASE("MISSING :: COMPARE DATASET") ; IF DIFF NE ""; IF SCAN(DIFF,1) = PROPCASE("DIFFERENCE") THEN Variables = PROPCASE("BASEOBS = ") || STRIP(PUT(BASEOBS,8.)) || PROPCASE(" \LINE COMPAREOBS = ") || STRIP(PUT(COMPAREOBS,8.)) ; PROC SORT; BY MEMNAME ; RUN; Differences in variable properties using LIBVAR:

The LIBVAR dataset is used to find the difference in the characteristics of the variables which include label, length, format, informat,, and type. These differences are pointed out by creating the DIFF flag “DIFFERENCE :: TYPE BASE = XYZ COMPARE = XYZ” where TYPE defines the Category (label, length, format, informat, and type) and XYZ defines the characteristics of the type.

The LIBVAR datasets extracted from the SASHELP.VTABLE are first transposed using the %TRANS macro and merged together to create the VAR dataset. The VAR dataset is then merged to the OBS dataset to obtain the datasets present in both libraries. The mismatching properties are then flagged to the DIFF variable creating the flag “DIFFERENCE :: TYPE BASE = XYZ COMPARE = XYZ” where TYPE defines the Category (label, length, format, informat, and type) and XYZ defines the characteristics of the type.

%MACRO TRANS(DS); PROC TRANSPOSE DATA = &DS.VAR OUT = &DS.VAR (DROP = _LABEL_) PREFIX = &DS.; VAR NAME LENGTH LABEL FORMAT INFORMAT XTYPE; BY MEMNAME NAME; PROC SORT; BY MEMNAME NAME _NAME_; RUN; %MEND TRANS; %TRANS(BASE); %TRANS(COMPARE);

DATA VAR; MERGE BASEVAR COMPAREVAR ; BY MEMNAME NAME _NAME_; RUN;

DATA VAR; MERGE OBS (IN = A WHERE = (BASEOBS = . OR COMPAREOBS = .) KEEP = MEMNAME BASEOBS COMPAREOBS) VAR; BY MEMNAME; RENAME NAME = Variables; IF NOT A; IF BASE1 NE COMPARE1; DIFF = PROPCASE("DIFFERENCE :: ") || PROPCASE (_NAME_)|| " \LINE BASE = " || STRIP(BASE1) || " \LINE COMPARE = " || STRIP(COMPARE1); KEEP NAME DIFF MEMNAME;; RUN;

Differences in Observations using PROC COMPARE:

Comparing the data using PROC COMPARE followed the steps below

1) Identifying the datasets in both libraries by merging the BASEOBS and COMPAREOBS datasets. 2) Creating a macro variable ds_list which contains the variables present in both the libraries separated by “~”. 3) Using the DO LOOP, a PROC COMPARE is run against each of the common datasets to create the COMPAREOUT dataset with the options OUTNOEQUAL OUTBASE OUTCOMP to capture the observations that contain a difference. 4) If both the datasets are identical, a Compare dataset is created containing 0 observations. Else the obtained differences are then furthered to create the compare dataset by finding the difference by variable.  To optimize the process, only 10 differences per dataset are considered to create a dataset ALL.  The dataset ALL is first transposed by observational number and dataset (BASE/COMP) to create two datasets BASE and COMP.  The BASE and COMPARE datasets are then merged to create a dataset COMPARE  The COMPARE dataset compares each of these variables to create a DIFF Flag “DIFFERENCE :: Variable BASE=XYZ COMPARE=XYZ” where XYZ contains the observations present in the BASE and COMPARE datasets.  If more than 10 differences are present in each of the datasets. a new flag is created "DIFFERENCE :: Variables *GREATER THAN 10 DIFFERENCES PER Variables IN DATASET* " 5) Each of these COMPARE datasets are then combined to form an ALLCOMPARE dataset

DATA DATASETS; MERGE BASEOBS (IN = A KEEP = MEMNAME ) COMPAREOBS (IN = B KEEP = MEMNAME ); BY MEMNAME; IF A AND B; RUN;

PROC SQL NOPRINT; SELECT DISTINCT(MEMNAME), COUNT(DISTINCT(MEMNAME)) INTO :ds_list SEPARATED BY '~', :ds_cnt FROM DATASETS ORDER BY MEMNAME; QUIT;

%LET NO_DIFF = 0 ; %DO I = 1 %TO &DS_CNT; %LET DS = %SCAN (&DS_LIST., &I., ~);

PROC COMPARE BASE = BASE.&DS. COMPARE = COMPARE.&DS. OUTNOEQUAL OUTBASE OUTCOMP NOPRINT OUT = COMPAREOUT; RUN;

/*OPTIMISED THE CODE TO RUN FAST BY NOT RUNNING FILES THAT ARE EQUAL */ PROC SQL NOPRINT; SELECT COUNT(*) INTO : COMPARECNT FROM COMPAREOUT; QUIT; %PUT &COMPARECNT. ;

%IF &COMPARECNT. = 0 %THEN %DO;

%LET NO_DIFF = %EVAL(&NO_DIFF. + 1 ) ; %put no_diff.; DATA COMPARE; SET _NULL_; RUN; %END;

%ELSE %DO;

OPTIONS OBS = 22; PROC SORT DATA = COMPAREOUT OUT = ALL ; BY _OBS_ _TYPE_ ; RUN; OPTIONS OBS = MAX;

PROC TRANSPOSE DATA = ALL OUT = ALL; VAR _ALL_; BY _OBS_ _TYPE_; RUN;

DATA BASE COMP ; SET ALL; IF _TYPE_ = "BASE" THEN OUTPUT BASE ; IF _TYPE_ = "COMPARE" THEN OUTPUT COMP; RUN;

PROC SORT DATA = BASE ; BY _NAME_ _OBS_ _LABEL_; PROC SORT DATA = COMP ; BY _NAME_ _OBS_ _LABEL_; RUN;

DATA COMPARE; MERGE BASE (DROP = _TYPE_ _LABEL_ RENAME = (COL1 = BASE)) COMP (DROP = _TYPE_ _LABEL_ RENAME = (COL1 = COMPARE)) ; BY _NAME_ _OBS_ ; WHERE _NAME_ NOT IN ("_TYPE_" "_OBS_"); IF BASE NE COMPARE ; DIFF = PROPCASE("DIFFERENCE :: Variable \LINE\tab {\uc1\u10146\' 2} BASE = ") || STRIP(BASE) || PROPCASE(" \LINE\tab {\uc1\u10146\' 2} COMPARE = ") || STRIP(COMPARE) ; RENAME _NAME_ = Variables; MEMNAME = "&DS."; RUN;

DATA COMPARE; SET COMPARE; BY Variables _OBS_ ; RETAIN N 0; IF FIRST.Variables THEN N = 0; N + 1; IF N = 11 THENDO ; _OBS_ = .U ; DIFF = PROPCASE("DIFFERENCE :: Variables *GREATER THAN 10 DIFFERENCES PER Variables IN DATASET* "); END; KEEP Variables DIFF MEMNAME _OBS_; PROC SORT; BY Variables _OBS_ ; RUN; %END;

/*combing all compare out*/ %IF &I. = 1 %THEN %DO; DATA ALLCOMPARE; SET COMPARE; RUN; %END; %ELSE %DO; DATA ALLCOMPARE; FORMAT MEMNAME $50. Variables $100. DIFF $5000.; SET COMPARE ALLCOMPARE; RUN; %END; %MEND;

Merging All The Datasets to create the Final Dataset

The final dataset is obtained by (1) merging all the datasets OBS VAR and ALLCOMPARE and (2) sorting the differences by creating SORT1 and SORT2 flags. The SORT1 flag is used to categorize the datasets that are either missing in the BASE or the COMPARE library whereas the SORT2 flag is used to categorize the differences by category (NAME, LENGTH, LABEL, FORMAT, INFORMAT and XTYPE).

DATA FINAL; FORMAT MEMNAME $50. Variables $100. DIFF $5000.; SET OBS (KEEP = MEMNAME Variables DIFF) VAR ALLCOMPARE; /*SORTING*/ IF SCAN(DIFF,1) = PROPCASE("MISSING") THEN DO; IF STRIP(SCAN(DIFF,3)) = PROPCASE("BASE") THEN SORT1 = 1 ; IF STRIP(SCAN(DIFF,3)) = PROPCASE("COMPARE") THEN SORT1 = 2 ; END;

ELSE DO; SORT1 = 3; IF SCAN(DIFF,5) = PROPCASE("OBSERVATIONS") THEN SORT2 = 1; else IF SCAN(DIFF,3) = PROPCASE("NAME") THEN SORT2 = 2; else IF SCAN(DIFF,3) = PROPCASE("LENGTH") THEN SORT2 = 3; else IF SCAN(DIFF,3) = PROPCASE("LABEL") THEN SORT2 = 4; else IF SCAN(DIFF,3) = PROPCASE("FORMAT") THEN SORT2 = 5; else IF SCAN(DIFF,3) = PROPCASE("INFORMAT") THEN SORT2 = 6; else IF SCAN(DIFF,3) = PROPCASE("XTYPE") THEN SORT2 = 7; /* else IF SCAN(DIFF,3) = PROPCASE("*") THEN _OBS_ = -8; */ else if _OBS_ = . THEN SORT2 = 9; else SORT2 = 99; END; PROC SORT; BY SORT1 MEMNAME SORT2 Variables _OBS_; RUN;

Summary of the FINAL dataset and Output

The summary of the final dataset is obtained by counting the SORT1 and SORT2 flags in the Final dataset and creating several macro variables DIFF_DS_CNT, BASEMISS, COMPAREMISS, DIFF_OBS, DIFF_NAME, DIFF_LENGTH, DIFF_LABEL, DIFF_FORMAT, DIFF_INFORMAT, DIFF_XTYPE, DIFF_GT10, DIFF_VAR and finally creating the summary dataset using the above macro variables. The summary and final dataset are outputted using the Proc Report.

PROC SQL NOPRINT; SELECT COUNT(MEMNAME) INTO : BASECNT FROM BASEOBS ; SELECT COUNT(MEMNAME) INTO : COMPARECNT FROM COMPAREOBS ; SELECT COUNT(DISTINCT MEMNAME) FORMAT = 3. , COUNT(CASE WHEN SORT1 = 1 THEN MEMNAME ELSE "" END ) FORMAT = 3. , COUNT(CASE WHEN SORT1 = 2 THEN MEMNAME ELSE "" END ) FORMAT = 3. , COUNT(CASE WHEN SORT2 = 1 THEN MEMNAME ELSE "" END ) FORMAT = 3. , COUNT(CASE WHEN SORT2 = 2 THEN MEMNAME ELSE "" END ) FORMAT = 3. , COUNT(CASE WHEN SORT2 = 3 THEN MEMNAME ELSE "" END ) FORMAT = 3. , COUNT(CASE WHEN SORT2 = 4 THEN MEMNAME ELSE "" END ) FORMAT = 3. , COUNT(CASE WHEN SORT2 = 5 THEN MEMNAME ELSE "" END ) FORMAT = 3. , COUNT(CASE WHEN SORT2 = 6 THEN MEMNAME ELSE "" END ) FORMAT = 3. , COUNT(CASE WHEN SORT2 = 7 THEN MEMNAME ELSE "" END ) FORMAT = 3. , /* COUNT(CASE WHEN SORT2 = 8 THEN MEMNAME ELSE "" END ) FORMAT = 3. ,*/ COUNT(CASE WHEN SORT2 = 9 THEN MEMNAME ELSE "" END ) FORMAT = 3. , COUNT(CASE WHEN _OBS_ GT 1 THEN MEMNAME ELSE "" END ) FORMAT = 5. INTO :DIFF_DS_CNT , :BASEMISS , :COMPAREMISS , :DIFF_OBS , :DIFF_NAME , :DIFF_LENGTH , :DIFF_LABEL , :DIFF_FORMAT , :DIFF_INFORMAT , :DIFF_XTYPE , :DIFF_GT10 , :DIFF_VAR FROM FINAL; QUIT;

DATA SUMMARY; FORMAT DISC$85.PRG$10.; %MACRO SUMMARY (SORT,PRG,DISC); DISC=PROPCASE("&DISC."); PRG=STRIP(PUT(&&&PRG.,5.)); SORT=&SORT. ;OUTPUT; %MEND SUMMARY ;

%SUMMARY (1,BASECNT,%STR(NUMBER OFDATASETS INTHEBASE DIRECTORY)) ; %SUMMARY (2,COMPARECNT,%STR(NUMBER OFDATASETS INTHECOMPAREDIRECTORY)) ; %SUMMARY (3,DS_CNT,%STR(NUMBER OFDATASETS BOTH THE DIRECTORIES)) ; %SUMMARY (4,BASEMISS,%STR(NUMBER OFDATASETS MISSING INBASE DIRECTORY)) ; %SUMMARY (5,COMPAREMISS,%STR(NUMBER OFDATASETS MISSING INCOMPAREDIRECTORY)) ; %SUMMARY (6,NO_DIFF,%STR(NUMBER OFDATASETS WITH NO DIFFERENCES)) ; %SUMMARY (7,DIFF_DS_CNT,%STR(NUMBER OFDATASETS WITH DIFFERENCES)) ; %SUMMARY (8,DIFF_OBS,%STR(DATASETS WITH DIFFERENCES IN NUMBER OF OBSERVATIONS)) ; %SUMMARY (9,DIFF_NAME,%STR(DIFFERENCE IN VARIABLE NAMES)) ; %SUMMARY (11,DIFF_LENGTH,%STR(DIFFERENCE IN VARIABLE LENGTH)) ; %SUMMARY (12,DIFF_LABEL,%STR(DIFFERENCE IN VARIABLE LABELS)) ; %SUMMARY (13,DIFF_FORMAT,%STR(DIFFERENCE IN VARIABLE FORMAT)) ; %SUMMARY (14,DIFF_INFORMAT,%STR(DIFFERENCE IN VARIABLE INFORMAT)) ; %SUMMARY (15,DIFF_XTYPE,%STR(DIFFERENCE IN VARIABLE TYPE ( CHAR/ NUM ))) ; %SUMMARY (16,DIFF_GT10,%STR(VARIABLES WITH GREATER THAN 10 DIFFERENCES)) ; %SUMMARY (17,DIFF_GT10,%STR(TOTAL NO OF VARIABLE DIFFERENCES)) ; PROC SORT DATA = SUMMARY (WHERE= (PRGNE"0")) ;BY SORT; RUN;

ods listing close; ods noresults; ods escapechar="^"; options orientation=landscape nonumber; ods rtf file ="proc compare.rtf" style=RTF;; ""; title1 j=l "Bharat's Proc Compare output" j=r "Testing"; title3 j=c "PROC COMPARE LIBRARIES"; title4 j=l "Study :: Test"; title5 j=l "Base :: &BASE_PATH. "; title6 j=l "Compare Dir :: &COMPARE_PATH.";

footnote j=l "^{style [outputwidth=100% bordertopcolor=black bordertopwidth=2pt] SYSUSERID}" j=r "^{style [outputwidth=100% bordertopcolor=black bordertopwidth=2pt] Page: ^{thispage} of ^{lastpage}}";

PROC REPORT DATA = SUMMARY HEADLINE SPACING=1 MISSING SPLIT="#" STYLE(REPORT)={CELLPADDING=1PT RULES=COLS PROTECTSPECIALCHARS=OFF}; COLUMN SORT DISC PRG; DEFINE SORT / Noprint ORDER ORDER=DATA ; DEFINE DISC / flow STYLE(COLUMN)=[CELLWIDTH=6IN] STYLE(HEADER)=[JUST=LEFT] 'DISCRIPTION' ; DEFINE PRG / STYLE(COLUMN)=[CELLWIDTH=1.2IN] STYLE(HEADER)=[JUST=CENTER] CENTER 'No: of Programs' ; RUN;

PROC REPORT DATA = FINAL HEADLINE SPACING=1 MISSING SPLIT="#" STYLE(REPORT)={CELLPADDING=1PT RULES=COLS PROTECTSPECIALCHARS=OFF}; COLUMN SORT1 MEMNAME SORT2 Variables _OBS_ DIFF ; DEFINE SORT1 / Noprint ORDER ORDER=DATA ; DEFINE SORT2 / Noprint ORDER ORDER=DATA ; DEFINE MEMNAME / Group ORDER ORDER=DATA STYLE(COLUMN)=[CELLWIDTH=2.2IN] STYLE(HEADER)=[JUST=LEFT] 'DATASET'; DEFINE Variables / Group ORDER ORDER=DATA STYLE(COLUMN)=[CELLWIDTH=2.2IN] STYLE(HEADER)=[JUST=LEFT] 'Variables' ; DEFINE _OBS_ / Group ORDER ORDER=DATA STYLE(COLUMN)=[CELLWIDTH=0.5IN] STYLE(HEADER)=[JUST=LEFT] center 'OBS NO:' ; DEFINE DIFF / flow STYLE(COLUMN)=[CELLWIDTH=3.6IN] STYLE(HEADER)=[JUST=center] left 'Difference' ; RUN; %MEND COMPARE_LIB; %COMPARE_LIB (BASE = C:\Users\bkumar16\Desktop\WUSS Paper\base , COMPARE = C:\Users\bkumar16\Desktop\WUSS Paper\Compare);

Conclusion As mentioned in the above, the COMPARE_LIB macro is a compartmentalized approach to point out all the differences between libraries in the most optimized way with the help of Proc compare and SAS help directories.

Contact Information Name: Bharat Kumar, Janapala Enterprise: Individual Contributor Address: 250 W Central Ave, City, State ZIP: Brea, CA 92821 Work Phone: 856-904-9336 E-mail: [email protected] Linkidin: https://www.linkedin.com/in/bharat-kumar-janapala-1319315b:

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.