Finding All Differences in Two SAS® Libraries Using Proc Compare By
Total Page:16
File Type:pdf, Size:1020Kb
Finding All Differences in two SAS® libraries using Proc Compare By: Bharat Kumar Janapala, Company: Individual Contributor Abstract: In the clinical industry validating datasets by parallel programming and comparing these derived datasets is a routine practice, but due to constant updates in raw data it becomes increasingly more difficult to figure out the differences between two datasets. The COMPARE procedure is a valuable tool to help in these situations, and for a variety of other purposes including data validation, data cleaning, and pointing out the differences, in an optimized way, between datasets. The purpose for writing this paper is to demonstrate PROC COMPARE's and SASHELP directories ability to identify the datasets present in various SAS® libraries by listing the uncommon datasets. Secondly, to provide details about the total number of observations and variables present in both datasets by listing the uncommon variables and datasets with differences in observation number. Thirdly, assuming both datasets are identical the COMPARE procedure compares the datasets with like names and captures the differences which could be monitored by assigning the max number of differences by variable for optimization. Finally, the COMPARE procedure reads all the differences and provides a consolidated report followed by a detailed description by dataset. Introduction In most companies, the most common practice to compare multiple datasets is to use the standard code: “PROC COMPARE BASE = BASE_LIB.DATASET COMPARE = COMPARE_LIB.DATASET.” But comparing all the datasets would be a time consuming process. Hence, I made an attempt to point out all the differences between libraries in the most optimized way with the help of PROC COMPARE and SASHELP directories. My Solutions Assigning libraries: The current macro COMPARE_LIB identifies the base and the compare library and assigns them to the library: BASE and COMPARE. It also gives the programmer the opportunity to select the formats if the formats are present in this library. Otherwise the OPTIONS NOFMTERR is called into the program allowing SAS to process the datasets without the formats. %MACRO COMPARE_LIB (BASE= ,COMPARE= ,FORMAT=N); /*ASSIGNING THE LIBRARIES*/ LIBNAME BASE "&BASE."; %LET BASE_PATH=%SYSFUNC(TRANWRD(&BASE.,\,\\)); LIBNAME COMPARE "&COMPARE."; %LET COMPARE_PATH=%SYSFUNC(TRANWRD(&COMPARE.,\,\\)); /*INCLUDING FORMATS*/ %IF &FORMAT = N %THEN %DO; OPTIONS NOFMTERR; %END; %IF &FORMAT = Y %THEN %DO; OPTIONS FMTSEARCH = (BASE COMPARE); %END; Data Extraction The data from the libraries are extracted with the help of the macro %DATA. The macro creates two datasets per library: LIBOBS (BASEOBS and COMPAREOBS) and LIBVAR (BASEVAR and COMPAREVAR). LIBOBS datasets contains all the datasets along with the number of observations as present in the library which are extracted from SASHELP.VTABLE. The LIBVAR dataset contains the dataset’s name, column name, length, label, format, informat and type that are extracted from SASHELP.VCOLUMN. %MACRO DATA (LIB) ; PROC SQL; CREATE TABLE &LIB.OBS AS SELECT DISTINCT MEMNAME , NOBS AS &LIB.OBS FROM SASHELP.VTABLE WHERE LIBNAME EQ "&LIB." AND MEMTYPE EQ "DATA" ORDER BY MEMNAME; CREATE TABLE &LIB.VAR AS SELECT MEMNAME, NAME, LENGTH , LABEL, FORMAT, INFORMAT, XTYPE FROM SASHELP.VCOLUMN WHERE LIBNAME EQ "&LIB." AND MEMTYPE EQ "DATA" ORDER BY MEMNAME, NAME; QUIT; %MEND DATA ; %DATA(BASE); %DATA(COMPARE); Dataset and Observation Differences using LIBOBS: The LIBOBS data set is used to find the two major differences between both the libraries by merging them together into a dataset OBS and creating a flag variable DIFF which flags the missing datasets as “MISSING :: BASE/COMPARE DATASET”. If the datasets are present in both the libraries, the number of observations are compared and the observational difference is assigned to the flag DIFF as “DIFFERENCE :: NUMBER OF OBSERVATIONS (N)” where N is the number of observations that differ in both datasets. Another flag VARIBLES is also created, describing the number of observations in the LIBOBS “BASEOBS = XX COMPAREOBS = XX” where XX is the number of observations present in each dataset. DATA OBS; MERGE BASEOBS (IN = A) COMPAREOBS (IN = B); BY MEMNAME; FORMAT DIFF $75.; IF BASEOBS NE COMPAREOBS THEN DIFF = PROPCASE("DIFFERENCE :: NUMBER OF OBSERVATIONS ( ")|| STRIP(PUT(ABS(BASEOBS - COMPAREOBS),8.))||" )"; IF BASEOBS = . THEN DIFF = PROPCASE("MISSING :: BASE DATASET") ; IF COMPAREOBS = . THEN DIFF = PROPCASE("MISSING :: COMPARE DATASET") ; IF DIFF NE ""; IF SCAN(DIFF,1) = PROPCASE("DIFFERENCE") THEN Variables = PROPCASE("BASEOBS = ") || STRIP(PUT(BASEOBS,8.)) || PROPCASE(" \LINE COMPAREOBS = ") || STRIP(PUT(COMPAREOBS,8.)) ; PROC SORT; BY MEMNAME ; RUN; Differences in variable properties using LIBVAR: The LIBVAR dataset is used to find the difference in the characteristics of the variables which include label, length, format, informat,, and type. These differences are pointed out by creating the DIFF flag “DIFFERENCE :: TYPE BASE = XYZ COMPARE = XYZ” where TYPE defines the Category (label, length, format, informat, and type) and XYZ defines the characteristics of the type. The LIBVAR datasets extracted from the SASHELP.VTABLE are first transposed using the %TRANS macro and merged together to create the VAR dataset. The VAR dataset is then merged to the OBS dataset to obtain the datasets present in both libraries. The mismatching properties are then flagged to the DIFF variable creating the flag “DIFFERENCE :: TYPE BASE = XYZ COMPARE = XYZ” where TYPE defines the Category (label, length, format, informat, and type) and XYZ defines the characteristics of the type. %MACRO TRANS(DS); PROC TRANSPOSE DATA = &DS.VAR OUT = &DS.VAR (DROP = _LABEL_) PREFIX = &DS.; VAR NAME LENGTH LABEL FORMAT INFORMAT XTYPE; BY MEMNAME NAME; PROC SORT; BY MEMNAME NAME _NAME_; RUN; %MEND TRANS; %TRANS(BASE); %TRANS(COMPARE); DATA VAR; MERGE BASEVAR COMPAREVAR ; BY MEMNAME NAME _NAME_; RUN; DATA VAR; MERGE OBS (IN = A WHERE = (BASEOBS = . OR COMPAREOBS = .) KEEP = MEMNAME BASEOBS COMPAREOBS) VAR; BY MEMNAME; RENAME NAME = Variables; IF NOT A; IF BASE1 NE COMPARE1; DIFF = PROPCASE("DIFFERENCE :: ") || PROPCASE (_NAME_)|| " \LINE BASE = " || STRIP(BASE1) || " \LINE COMPARE = " || STRIP(COMPARE1); KEEP NAME DIFF MEMNAME;; RUN; Differences in Observations using PROC COMPARE: Comparing the data using PROC COMPARE followed the steps below 1) Identifying the datasets in both libraries by merging the BASEOBS and COMPAREOBS datasets. 2) Creating a macro variable ds_list which contains the variables present in both the libraries separated by “~”. 3) Using the DO LOOP, a PROC COMPARE is run against each of the common datasets to create the COMPAREOUT dataset with the options OUTNOEQUAL OUTBASE OUTCOMP to capture the observations that contain a difference. 4) If both the datasets are identical, a Compare dataset is created containing 0 observations. Else the obtained differences are then furthered to create the compare dataset by finding the difference by variable. To optimize the process, only 10 differences per dataset are considered to create a dataset ALL. The dataset ALL is first transposed by observational number and dataset (BASE/COMP) to create two datasets BASE and COMP. The BASE and COMPARE datasets are then merged to create a dataset COMPARE The COMPARE dataset compares each of these variables to create a DIFF Flag “DIFFERENCE :: Variable BASE=XYZ COMPARE=XYZ” where XYZ contains the observations present in the BASE and COMPARE datasets. If more than 10 differences are present in each of the datasets. a new flag is created "DIFFERENCE :: Variables *GREATER THAN 10 DIFFERENCES PER Variables IN DATASET* " 5) Each of these COMPARE datasets are then combined to form an ALLCOMPARE dataset DATA DATASETS; MERGE BASEOBS (IN = A KEEP = MEMNAME ) COMPAREOBS (IN = B KEEP = MEMNAME ); BY MEMNAME; IF A AND B; RUN; PROC SQL NOPRINT; SELECT DISTINCT(MEMNAME), COUNT(DISTINCT(MEMNAME)) INTO :ds_list SEPARATED BY '~', :ds_cnt FROM DATASETS ORDER BY MEMNAME; QUIT; %LET NO_DIFF = 0 ; %DO I = 1 %TO &DS_CNT; %LET DS = %SCAN (&DS_LIST., &I., ~); PROC COMPARE BASE = BASE.&DS. COMPARE = COMPARE.&DS. OUTNOEQUAL OUTBASE OUTCOMP NOPRINT OUT = COMPAREOUT; RUN; /*OPTIMISED THE CODE TO RUN FAST BY NOT RUNNING FILES THAT ARE EQUAL */ PROC SQL NOPRINT; SELECT COUNT(*) INTO : COMPARECNT FROM COMPAREOUT; QUIT; %PUT &COMPARECNT. ; %IF &COMPARECNT. = 0 %THEN %DO; %LET NO_DIFF = %EVAL(&NO_DIFF. + 1 ) ; %put no_diff.; DATA COMPARE; SET _NULL_; RUN; %END; %ELSE %DO; OPTIONS OBS = 22; PROC SORT DATA = COMPAREOUT OUT = ALL ; BY _OBS_ _TYPE_ ; RUN; OPTIONS OBS = MAX; PROC TRANSPOSE DATA = ALL OUT = ALL; VAR _ALL_; BY _OBS_ _TYPE_; RUN; DATA BASE COMP ; SET ALL; IF _TYPE_ = "BASE" THEN OUTPUT BASE ; IF _TYPE_ = "COMPARE" THEN OUTPUT COMP; RUN; PROC SORT DATA = BASE ; BY _NAME_ _OBS_ _LABEL_; PROC SORT DATA = COMP ; BY _NAME_ _OBS_ _LABEL_; RUN; DATA COMPARE; MERGE BASE (DROP = _TYPE_ _LABEL_ RENAME = (COL1 = BASE)) COMP (DROP = _TYPE_ _LABEL_ RENAME = (COL1 = COMPARE)) ; BY _NAME_ _OBS_ ; WHERE _NAME_ NOT IN ("_TYPE_" "_OBS_"); IF BASE NE COMPARE ; DIFF = PROPCASE("DIFFERENCE :: Variable \LINE\tab {\uc1\u10146\' 2} BASE = ") || STRIP(BASE) || PROPCASE(" \LINE\tab {\uc1\u10146\' 2} COMPARE = ") || STRIP(COMPARE) ; RENAME _NAME_ = Variables; MEMNAME = "&DS."; RUN; DATA COMPARE; SET COMPARE; BY Variables _OBS_ ; RETAIN N 0; IF FIRST.Variables THEN N = 0; N + 1; IF N = 11 THENDO ; _OBS_ = .U ; DIFF = PROPCASE("DIFFERENCE :: Variables *GREATER THAN 10 DIFFERENCES PER Variables