SUGI 31 Warehousing, Management and Quality

Paper 101-31 Utilizing External Data Dictionaries to Build SQL Queries in Base SAS® Mike Tangedal, US Bank, St. Paul, MN

ABSTRACT necessary business rules applied to the Documentation of created variables within a source data before becoming readily requires a compromise available within any data warehouse cube between definitions in code (Base SAS) and structure. After the loading phase where the text supplied by analysts. Business rules data updates are physically loaded into the nestled within code serve no documentation data warehouse structure and after the audience other than skilled SAS and initial quality assurance phase programmers, not management and analysts when the data is checked for initial validity who ultimately provide the logic for these and conformity with previous loads comes rules. However, business rules stated purely the phase of applying business rules. Since in terms of the customer are not directly additional quality assurance is almost translatable into code without major always a mandatory precaution before concession to both the code sophistication placing data in a dimensional structure and the expertise of the customers. A available for reporting, a data dictionary workable solution is to store each business containing these established rules also rule for each unique defined variable in an warrants necessity. Although the business external file, both readily accessible by the rule components of such a quality assurance Base SAS language, customers, and platform need not reside in any set rigid analysts. The challenge to the programmer structure, the ease of maintenance and is then to successfully implement these implementation warrants segregating each external business rules into a usable end component into an entry in a separate file product. The challenge to the analysts is to referred to as a data dictionary. The document all business rules in absolute methodology employed to implement an terms of data available. Such a solution external data dictionary file into a SAS SQL involves a unique approach from the SAS query tool involves some macro code professional, much of which is discussed in sophistication and development of a both code examples and concepts. hierarchal structure for the data dictionary Programming issues discussed include itself. macro routines to verify existence of require data sets, development of a hierarchal data dictionary, parameter verification within INTRODUCTION Base SAS, limits of macro variables within The most direct methodology for SQL, and remote building of a sophisticated transferring the business rules required to SQL query. transform the available data residing in the data warehouse to summarized report-level data consisting of dimensional categories PURPOSE and calculated metrics is to place all of this An oft-neglected yet critical component of logic within a SAS code module. any successful implementation of a data warehouse is the incorporation of the data dictionary. The data dictionary stores the SUGI 31 Data Warehousing, Management and Quality

terms as popularized in the popular data warehouse manuals do no suffice. Your customer is not going to appear as Customer a typical customer database definition in a Report data warehouse manual. Your customers Data Warehouse need to derive business rules as specific to their needs and always in terms of data available.

These terms need to be defined in a centralized location accessible to all pertinent members of your business group. Therein lies the need for a data dictionary separate from any singular SAS program but accessible both from SAS and from your This most direct approach fails in two major business customers. ways. First, any business rules included in the process are under complete control by The job of the SAS programmer savvy to the the programmer. Developing such a ‘black ways of data warehouse architecture is to box’ approach puts all responsibility for create these definitions both translatable into maintenance and communication of these programming language (SQL or base SAS) business rules on the programmer. Second, and relatively easy to discern by the the stress put upon an increasingly complex business users. The business rules block of code increases the chances for its themselves won’t of course be in clear failure. Maintaining complex logic English, but additional descriptive text structures within a singular large block of should be made available to accompany the code is not only difficult but dangerous. business rule in its most basic form. Given the number of standard dimensions and metrics available for each data Such a resulting data dictionary can then be warehouse source multiplied by the used to create reports not only of maximum complexity of each business rule accounting benefit to the end user but also fully for missing and default values, the block of documented through the use of the data code required can be so large as to be dictionary as a reference tool. Your unmanageable in a singular program. customers will be able to understand the business rule definitions completely without An approach to this problem taking into the burden and bother of understanding the account the future needs of the customer overall SAS program. Making the data rather than the most direct solution to the dictionary as collaborative effort as possible problem reveals a better solution for both benefits both the customer and the the customer and the programmer. programmer.

In the most practical terminology, the customers for whom the end reports are created are going to discuss the data contained within in terms specific to their business. Academic descriptions of these SUGI 31 Data Warehousing, Management and Quality

Lookup Libraries & Format Tables For example, a common created variable requiring a business rule may be an average Customer based on the sum from one field divided by Data Report the count of another field. The most Warehouse simplified business rule would appear as Data Dictionary follows in base SAS code:

Customers Field_avg = field_sum / field_cnt;

Again, this most simple direct solution is not the best. Pray tell, what if one of the fields in the calculation is zero or heaven forbid, missing? Oh, what a mess you’re going to create in the summary file. The best Integrity of a data dictionary is easier solution is derived from first meeting with maintained when definitions and the customers to decide how missing and descriptions are in the same format. Instead zero values are to be interpreted. Most of arbitrarily segmenting such a complex likely the resulting business rule will then block of logic into modular parts, the best appear as follows: solution is to map each business rule logic

block to separate locations and reference If field_cnt in (.,0) then field_avg=.; them all through mapping within the Else if field_sum in (.,0) then field_avg=0; Else field_avg = field_sum / field_cnt; program. Cross-referencing such a data

definition library is best done within a The structure of the data dictionary itself or database. Through the use should follow a module in the overall of the much-improved ‘Proc Import’ concept of control file hierarchy, as noted in procedure in SAS, referencing and utilizing various papers and books by legendary SAS the contents of these files is simple. programmer, Art Carpenter. Mr. Carpenter

explains the concept and the SAS code used

to implement this concept far better than I IMPLEMENTATION Creation of a validated and useable data ever could. In brief, one of the main control dictionary containing business rule files contains a list of all other files utilized definitions for each created variable in the data dictionary. The data dictionary amongst other entries is a far greater reference file can be as simple as a flat file challenge than creating the code to utilize containing the name of the variable along these business rules in a production SAS with the business rule defining this variable. program. The main reason this task is so Also handy if not mandatory is a list of daunting is that creation of each unique source files found on the data warehouse business rule requires extensive coordination and the variables contained on each file. between all interested parties. Trust in the Development of a hierarchal control file business rules comes at the expense of the structure will ease the utilization of the data amount of foresight gained by the dictionary concept significantly. programmer in working with analysts and customers. MACRO PARAMETERS USED TO CREATE PRODUCTION REPORTS FROM DATA DICTIONARY: THE ADHOC PROGRAM SUGI 31 Data Warehousing, Management and Quality

The front-end tool will consist of a simple source file key in the data set SAS program called ‘AdHoc’. Portions of resulting from the query. this code are explained at the end of this • Extras – The list of additional paper. The SAS program thoroughly variables to be added to the resulting explains all the input parameters available. data set. Note the contents of this Parameters will also be available to field are only applied if the ‘account’ customize output to a high degree. The parameter is set to ‘Y’ and the user is AdHoc query tool will be able to create responsible for proper context of the almost any query falling within the context code appearing in this parameter. of the standard business rules. • Dims – The list of dimension variables to be stored in the resulting The AdHoc SAS program allows for various data set with the default value being user parameters to select the source query all available dimension variables file, standard dimensions, and metrics in (‘all’) order to create a resulting data set containing • Mets – The list of metric variables to dimensions and metrics from the source be stored in the resulting data set query table. The complete list of user with the default value being all parameters available for use with the AdHoc available dimension variables (‘all’) program follows. • Stopit – The flag set to either ‘Y’ or ‘N’ (with the default value set to • Source – The name of the source ‘N’) which stops processing if the query table to be read. This contents of the ‘dims’ or ‘mets’ parameter assumes a standard file parameter contain incorrect values. format and directory available within the particular operating system. The other assumption is that these source ADHOC PROGRAM DESCRIPTION files are readily accessible through The SAS program called ‘AdHoc’ creates SQL queries using Base SAS. data sets based on the parameters specified. • Wear – The customizable selection The program creates no printed reports. criteria inserted as a ‘where’ clause Below is an outline of the steps taken during in the query to the source table. No the processing of the AdHoc program. default value is assigned and the user is responsible for proper context of 1. The source file lookup table is read the code appearing in this parameter. to determine the proper source table • Outset – The name of the resulting record. The proper record is data set created by the query with the determined through the ‘source’ default value being ‘AdHoc’. parameter along with the • Msaccess – The flag set to either ‘Y’ performance month specified. From or ‘N’ (with ‘N’ being the default this table is extracted the proper value) used to determine whether the name of the source table, the primary resulting data set is to be saved as a key variable for this table, and the Microsoft Acesss database table. logic file location used to set up any • Account – The flag set to either ‘Y’ needed lookup tables before the or ‘N’ (with ‘N’ being the default source table query is run. value) used to include the primary SUGI 31 Data Warehousing, Management and Quality

2 . The values entered into the against the history table or pointers parameters ‘dims’ and ‘mets’ are to files containing blocks of code to parsed into unique lists of values. be inserted into the query. Pointers are used in some cases as the 3. The summary variable lookup table spreadsheet truncates long text is read to build lists of all possible strings. dimensions and metrics given the contents of the parameter ‘source’. 7. The blocks of code referenced by the Also read from this table is the pointers in the logic spreadsheet are format associated with the dimension concatenated into one temporary file or metric variable. for ease of insertion into the query against the history table. 4. The user-specified list of dimension and metric variables is matched to 8. The actual blocks of code in the the master list for that ‘source’ to logic are translated to create a corrected version of the appropriate SQL statements for submitted ‘mets’ and ‘dims’ insertion into the query against the parameter to use in the resulting data history table. set. If the parameter ‘stopit’ is set to ‘Y’ and the submitted list of 9. The block of SAS code reference dimensions or metrics does not from the pointer in the history match the master list, then the lookup table is executed in order for program is terminated at this point the history table query to have with a message created noting the available all necessary associated dimensions and/or metrics that were lookup tables. incorrect given the value for ‘source’. “ 10. The completed version of the ‘proc SQL’ code to be submitted as the 5. The summary variable component query to the history table is compiled lookup table is read to build a list of and submitted. The name of the all history table variables needed to resulting data set is from the create the query given the ‘source’ parameter ‘outset’. If the parameter and the list of corrected dimensions ‘account’ is set to ‘Y’, then the and metrics. resulting data set is to be at the account level and in addition, 6. The summary variable logic table is additional variables may be created read to extract the code used to through the use of the ‘extras’ compile the dimension or metric parameter. However, if the variable. The proper tab within the parameter ‘account’ is not set to ‘Y’, spreadsheet is determined through then the resulting data set is to be the ‘source’ variable and rows are summarized on a combination of all selected using the corrected dimension variables for all metrics. dimensions and metrics lists contents. The resulting data set Processing logic is applied to the contains either blocks of code to be temporary file containing the block inserted directly into the query of code created by the pointers in the SUGI 31 Data Warehousing, Management and Quality

logic spreadsheet before the contents the formatting required of the programming of the file are applied to the query. language. The security and coordination First, comments within the required for such a project is increased but is temporary file are removed. Also no more than the current maintenance of a specialized processing is required as series of production SAS programs. The source history tables may have increased complexity of the SAS programs different formats by performance required to build queries from business rules month. If the ‘account’ parameter is stored in outside files is balanced by the not set to ‘Y’, then additional business rules themselves being defined in processing is required to build the the clearest manner possible. metric variable definition as a summary variable. The business rules within the library are altered and updated as simple text entries. The code compiled from the logic No additional formatting is required. As spreadsheet entries not including long as the logic within the text is sound and pointers can be placed directly into the variable names match those in the the query since they have been history and lookup tables, the logic translated (step 8.) The next step is contained within the library entries will be to create the ‘from’ component of the applied to the tables referenced by the SQL ‘proc SQL’ statement using the query. component variable lists compiled in step 5. If the ‘wear’ parameter has A simple HTML-based script can be written an entry, a ‘where’ entry in the ‘proc to publish the contents of the data dictionary SQL’ statement is added. If the file as documentation for all pertinent end ‘account’ parameter is not set to ‘N’, users. In this manner, the data dictionary then a ‘group by’ entry is created file can serve both as a source of using the dimension list. documentation as well as the singular source of all business rules to be applied to the base 11. If the parameter ‘msaccess’ is set to data before summarization into standard ‘Y’, then the resulting data set from dimensional variables or metrics. the query is also saved as a Microsoft Access database table within the default directory. CONCLUSION The key to implementation of an external

data dictionary for any data warehouse DATA DICTIONARIES AS DOCUMENTATION standard summarization program is an SOURCE organized hierarchal structure of data files Implementation of a hierarchal data composing the data dictionary itself and the dictionary file structure allows for ease of means by which to implement the contents use in editing existing business rules as well of these external files into SAS. Base SAS as documentation of the business rules by is used not so much as the data compilation analysts and end users. The data definitions tool but as an interpreter of the data stored within the spreadsheet or database dictionary contents into a SQL query. The serving as the library contain the most direct SQL query built by Base SAS serves to yet most thorough definition of the business create the required summary data set. Once rule itself. The data definitions are written the restrictions of using the SAS macro in a logical structure but do not contain all language in building an SQL query are SUGI 31 Data Warehousing, Management and Quality

know n, the AdHoc program described in Dynamic Application” Proceedings of the this paper can be thought of as a database Fourteenth Annual Midwest SAS Users query tool to the data dictionary. Group Conference, Cary, NC: SAS Institute Inc. Note that implementation of such a construct should only be done once all business rules Kimpball, Ralph. 1996 The Data are clearly defined. As well the locations of Warehouse Toolkit John Wiley and Sons, the source files and necessary lookup files Inc. 287 pp. should also be well defined within a production environment. Such a tool should be utilized upon clear understanding of the ABOUT THE AUTHOR Mike Tangedal has been employed as a data exact processes taking place in a regularly analyst within the Risk Management updated or queried source file. Once the division of US Bank since 2001. He has 21 existing production process is well years of SAS experience with 16 years as a understood, moving the business rules to an professional SAS programmer. Much of his external data dictionary and utilizing the programming expertise has been devoted to constructs shown in the AdHoc program can quality assurance applications and efficiency serve to ease the maintenance of business in Base SAS. He has presented various rules as well as their documentation to all papers at SUGI, from macro development to end users. quality control reporting.

REFERENCES Mike Tangedal Carpenter, Arthur L. and Richard O. Smith, 651-205-0743 2004 “Data Management: Building a [email protected]

ADHOC CODE (selected sections…) %macro adhoc(source=, wear=,outset=AdHoc,msaccess=N,account=N,extras=,dims=all,mets=all,stopit=N); %exists %if &exist=Y and &source^= %then %do;

/*** STEP 1 *** Read lookup table containing source table information to extract data set name, key variable within the data set, and any set up code required to run before the main query *******/ PROC IMPORT OUT=histlist datafile="&inlogic.SAS Information\AdHoc\Source Table Directory.xls" dbms=Excel replace; SHEET="Sheet1"; GETNAMES=YES; RUN;

/* code here cleans up the HISTLIST data set and creates macro variables from it****/

%let dsid = %sysfunc(open(histlist)); %let nobs =%sysfunc(attrn(&dsid,nobs)); %let rc = %sysfunc(close(&dsid)); %if &nobs=0 %then %do; %put Source parameter value: &source not found in Source Table Directory spreadsheet; %end;

/*** STEP 2 *** Values entered in 'dims' and 'mets' parameters are parsed into unique list of values */

/* code here parses macro text strings to derive unique words */

/*** STEP 3 *** Read lookup table containing all summary variables for each source SUGI 31 Data Warehousing, Management and Quality

to build a list to compare to user-supplied values *******/

/* basic proc import code to read excel file goes here */

/*********** create macro array of all dimension and all metric variables *******/

/* code here creates a macro array from a data set through call symput statements */

/*** STEP 4 *** The user-specified list of dimension and metric variables is matched to the master list ****/

/* code here compares one macro array to another */ %else %do; %if &stopit=N and (&baddim ^=0 or &badmet ^=0) %then %do; %put Number of dimension variables submitted that do not match master list: &baddim; %do i = 1 %to &baddim; %put &&baddim&i; %end; %put Number of metric variables submitted that do not match master list: &badmet; %do i = 1 %to &badmet; %put &&badmet&i; %end; %end;

/*** STEP 5 *** A list is created from the source table components table of all source table variables needed to complete the query ****/

/* proc import goes here importing the summary variable components file Only components that were selected from the ‘dim’ and ‘met’ parameters are retained*/

/*** STEP 6 *** Read lookup table containing business rules for each source file ******/ PROC IMPORT OUT=logic datafile="&inlogic.SAS Information\AdHoc\Summary Variable Logic.xls" dbms=Excel replace; SHEET="&source"; GETNAMES=YES; RUN;

/* code here matches selected dimensions and metrics with lookup table Then selects additional needed summary variables based on related variables */

/*** STEP 7 *** Build list of files containing SQL code to concatenate ****/ %let flist=; proc noprint; select logic into :flist separated by '" "' from logic where substr(logic,1,2) = '\\'; quit;

/*** STEP 8 ***Build SQL statements from logic table entries containing code and not pointers ***/ data _null_; set logic end=last; where substr(logic,1,2) ne '\\';

/* code here is complicated data step processing basically parsing text into the most appropriate file */

/*** STEP 9 ***Run SAS code setting up all necessary lookup tables before main proc sql

is run */ %if %length(&beglogic)>0 %then %do; %include "&beglogic"; %end;

/*** STEP 10 ** Build complete proc SQL statements from contents of logic table and contents of temporary file containing blocks of code referenced by the logic table pointers ***/ %if %length(&flist)>0 %then %do; filename logiclu ("&flist") lrecl=2048; %end; filename tmpfile "%sysfunc(pathname(work))\adhocsql.sas" lrecl=600; data _null_; SUGI 31 Data Warehousing, Management and Quality

length string $500 fn str1 str2 $200; file tmpfile; if _n_ = 1 then do; put "proc sql;" / "create table &outset as select"; %if &account=Y %then %do; put "&acct" ","; %if &extras^= %then %do; put "&extras" ","; %end; %end; end; %if %length(&flist)>0 %then %do; infile logiclu filename=fn eof=last truncover; input @1 flag $3. @4 string $200.; flag = upcase(flag); if string ne ''; retain commentflag 0 recno 1; if (index(flag,'/*') > 0 or index(string,'/*') > 0) then commentflag = 1; if commentflag = 1 then do; if index(string,'*/')>0 then commentflag = 0; end; else do;

/* more complex data step processing to ensure the right text gets put in the right file */ end; * processing non-comment lines of code; return; last: %end; %do i = 1 %to &logiccnt; string=symget("logic&i"); put string; %end; %if &account=N %then %do; put "count(*) as records,"; %end; put '"' "&yyyymm" '" as PerfYearMonth'; put "from (select "; %if &account=Y or %length(&beglogic)>0 %then %do; put "&acct" ","; %end; put "&compon1"; put "&compon2"; put "&compon3"; put "&compon4"; put "from &dset &tranlist )as hist" ; %if %length(&beglogic)>0 %then %do; string=symget("leftjoin"); put string; %end; %if &wear ^= %then %do; string=symget("wear"); put "where " string; %end; %if &account=N %then %do; put "group by "; %do i = 1 %to &gooddim; put "&&gooddim&i" ", "; %end; put "PerfYearMonth"; %end; put ";" / "quit;"; run; /***** submit SQL statement to SAS server to create SAS data set **/ %include "%sysfunc(pathname(work))\adhocsql.sas" /source2;

/*** STEP 11 ** Save resulting data set to a Microsoft Access table if requested *****/

/* MSACCESS macro goes here */ %end; %*STOPIT parameter not set to Y; SUGI 31 Data Warehousing, Management and Quality

%end ; %*source value entered did not match source Table Definition value; %end; %* required source tables did not exist; %mend;