NESUG 17 Hands-On Workshops

How to Develop SAS® Mart In One Day (with code examples)

Samuel Berestizhevsky, YieldWise Canada Inc, Canada Tanya Kolosova, YieldWise Canada Inc, Canada

ABSTRACT Creating of data marts is usually driven by specific application and business needs. To create useful you need, in first place to answer the question: what are the information needs of the specific business application? An answer to this question is not permanent in time. Learning from the initially required information will inevitably lead to new requirements. In addition, changes in business situation will raise new questions, which will lead to new information needs. It is clear that time of incorporation the new requirements to the data mart is critical and should be counted in days, even hours, rather than weeks or months. How can data mart keep pace with these changes? The answer to this question lays in the data mart development technology [1]. Creating of data mart involves, at least, three following steps: • Design and implementation of data mart model. • Design and implementation of operations on data, like ETL processes, aggregation, integrity and value verification, etc. • Design and implementation of data authorization access. In this paper we describe strategies and techniques intended to reduce data mart development time and simplify modifications. We show you on examples how to implement in SAS three major steps of data mart development in only one day.

INTRODUCTION In this paper we describe one of techniques that lets the designer specify data mart visually, in terms of its structure and functionality, and permits the development of mission-critical tasks almost without programming. The main principles of this approach are: • Data mart design is perceived as data and nothing but data. This means that the data mart design is defined in a set of specially structured tables and is stored, updated, and managed in the same way as ordinary data. • Operations on data are defined in terms of what must be done, but not how to do it. These definitions are stored, updated, and managed as ordinary data. • The data mart is managed from a single control point. The -driven environment is the most fundamental aspect of the described approach. The heart of this environment is the set of specially structured tables forming the data dictionary. The data dictionary contains a variety of information concerning application objects and operations, such as data structures, application activities, and so on. Further in the paper we lead you through examples of data mart development, and show you on how create a data mart, and update and verify data time-saving, flexible and reliable manner.

HOW TO IMPLEMENT A DATA MART In the table-driven environment, data mart design involves writing definitions of the data mart objects in the tables of the data dictionary [1]. Description of the data dictionary is out of scope of this paper. However, we want to clarify that we do not mean SAS data dictionary, but specially structured SAS data sets that contain comprehensive information about: • Topology (physical locations) of data mart parts and ways to communicate between those parts • Structure of data mart tables, including detailed information about columns (not only type, label, format, etc. but also information about primary and secondary keys, associated domains, and other constrains) • Relationships between data mart tables in terms of relational (foreign keys) Important to mention that this data dictionary can be changed or extended to keep additional information describing your data mart.

How to import data from relational data base

Let assume, that as a first step in data mart creation we want to create a copy of a relational data base in SAS data sets.

1 NESUG 17 Hands-On Workshops

Every relational data base has its data dictionary which contains the most accurate information about data base structure. This information enables us to create exact copy of data base absolutely automatically.

Oracle data dictionary

The following table shows selected columns from ALL_TAB_COLUMNS Oracle table.

OWNER TABLE_NAME COLUMN_NAME DATA_TYPE DATA_LENGTH … … … … … USER CUSTOMER_INFO customer_id VARCHAR2 16 USER CUSTOMER_INFO first_name VARCHAR2 40 USER CUSTOMER_INFO second_name VARCHAR2 40 USER CUSTOMER_INFO … … … USER CUSTOMER_INFO audit_key NUMBER 8 … … … … …

As you can see, this table describes a structure of CUSTOMER_INFO table from USER schema.

Extracting data from Oracle data dictionary The following SAS macro reads information and creates SAS data set containing image of Oracle data dictionary table:

%macro oracle_dd( libname); proc ;

connect to oracle as mydb (user=USER orapw=PWD path='path'); insert into &libname..source (schema, t_name, c_name, d_type, d_len)

select OWNER, TABLE_NAME, COLUMN_NAME, DATA_TYPE, DATA_LENGTH from connection to mydb

(select OWNER, TABLE_NAME, COLUMN_NAME, DATA_TYPE,DATA_LENGTH from USER.ALL_TAB_COLUMNS);

disconnect from mydb; quit ; %mend;

The only information coded in this program is a structure of Oracle data dictionary table (which is quite permanent and does not change even in different Oracle versions) and SOURCE data set which we structure for specific purpose of data import. As a result of this program, the following SAS data set will be created:

SOURCE data set SCHEMA T_NAME C_NAME D_TYPE D_LEN DS_NAME VAR_NAME TYPE LEN … … … … … … … … … USER CUSTOMER_INFO customer_id VARCHAR2 16 . . . . USER CUSTOMER_INFO first_name VARCHAR2 40 . . . . USER CUSTOMER_INFO second_nam VARCHAR2 40 . . . . e USER CUSTOMER_INFO … … … . . . . USER CUSTOMER_INFO audit_key NUMBER 8 . . . . … … … … … … … … …

Description of data to be loaded

Now we know what should be extracted. How we can describe what should be loaded? The following program will fill additional data into SOURCE data set:

data &libname..source (drop = t, c) ; set &libname..source; by t_name; retain t, c 0; if first.t_name then c = 0; t + 1; c + 1; ds_name = "&prefixt"||left(t) ; var_name = "&prefixc"||left(c);

2 NESUG 17 Hands-On Workshops

if d_type = “DATE” then do ; type = "D" ; len = 8 ; format = "date9." ; end ;

if d_type = “NUMBER” then do ; type = "N" ; len = 8 ; format = "&mis" ; end ;

if d_type = “VARCHAR2” then do ; type = "C" ; len = data_length ; format = "&mis" ; end ; run ;

The rules of translation Oracle data types to corresponding SAS data types are known and easy to define. Conversion of tables’ names to data sets’ names, and fields’ names to variables’ names can be done differently. One of the possibilities is to keep them as is. However, there are several disadvantages of this approach. One of them is that Oracle and SAS naming conventions may differ from version to version, and from platform to platform. Another one is the consideration that your data mart may be fed from different sources, in which case tables with the same name may potentially exist in different data bases. Systematic approach to data sets and variables names solves this problem and creates unique data sets’ and variables’ names.

As a result of the previous program, SOURCE data set is populated with data as follows:

Populated SOURCE data set SCHEMA T_NAME C_NAME D_TYPE D_LEN DS_NAME VAR_NAME TYPE LEN … … … … … … … … … USER CUSTOMER_INFO customer_id VARCHAR2 16 T23 V1 C 16 USER CUSTOMER_INFO first_name VARCHAR2 40 T23 V2 C 40 USER TRANSACT_INFO second_name VARCHAR2 40 T24 V1 C 40 USER TRANSACT_INFO … … … T24 V2 … … USER TRANSACT_INFO audit_key NUMBER 8 T24 V3 N 8 … … … … … … … … …

ETL process

Now, our SOURCE data set contains comprehensive information about all Oracle tables from the same schema, and full information about target SAS data sets. However, we probably do not need to extract and load all tables at the same time. ETL processes for each piece of information can be easily and flexibly scheduled. Let consider the following data set.

Process data set PROCESS T_NAME DS_NAME LIB_NAME … … … DAILY DAILY_TRANSACTION T315 transact DAILY DAILY_SUBSCRIPTION T403 transact MONTHLY PAYMENT_RECEIPT T315 finance … … …

Each process defined in the PROCESS table specifies different portions of data that can be extracted and loaded according with different frequency and at different days.

The following program implements required ETL process. Let follow each step of this program, and understand what it does.

/* count – is a macro variable containing a number of Oracle tables to be extracted and loaded in the SAS datasets _t - is a series of macro variables containing names of the Oracle tables _d - is a series of macro variables containing names of the SAS datasets _l - is a series of macro variables containing libraries where the SAS datasets should be located */ %_import_ (libname = kernel, process = DAILY) ; … data _null_ ; retain i 1 ;

3 NESUG 17 Hands-On Workshops

set &libname..process (where = (process = "&process")) ; call symput("_t" || left(i), trim(left(t_name))) ; call symput("_d" || left(i), trim(left(ds_name))) ; call symput("_l" || left(i), trim(left(lib_name))) ; call symput("count", left(i)) ; i + 1 ; run ;

As a result of this data step, macro variables will get the following values:

&count = 2 (there are two tables that should be transferred to SAS data sets)

&_t1 = DAILY_TRANSACTION &_d1 = T315 &_l1 = daily (the first Oracle table DAILY_TRANSACTION will be imported into daily.T315 SAS data set)

&_t2 = DAILY_SUBSCRIPTION &_d2 = T403 &_l2 = daily (the second Oracle table DAILY_SUBSCRIPTION will be imported into daily.T403 SAS data set)

/* scount – is a macro variable containing a number of variables in the SAS dataset _sc - is a series of macro variables containing names of the variables in the SAS dataset _st - is a series of macro variables containing types of the variables in the SAS dataset _sl - is a series of macro variables containing length of the variables in the SAS dataset _sb - is a series of macro variables containing names of the fields in the Oracle table _sf - is a series of macro variables containing formats of the variables in the SAS dataset */

%do i = 1 %to &count ; %let scount = 0 ; data _null_ ; retain i 1 ; set &libname..source (where = (ds_name = "&&_d&i")) ; call symput("schema", trim(left(schema))) ; call symput("_sc" || left(i),trim(left(var_name))) ; call symput("_st" || left(i), trim(left(type))) ; call symput("_sl" || left(i), trim(left(len))) ; call symput("_sb" || left(i), trim(left(c_name))) ; call symput("_sf" || left(i), trim(left(format))) ; call symput("scount", left(i)) ; i + 1 ; run ;

As a result of this data step, macro variables will get the following values:

&i = 1 &_d1 = T315 &schema = USER &scount = 78 (the first data set T315 will have 78 variables)

&_sc1 = V1 &_st1 = C &_sl1 = 15 &_sb1 = CUSTOMER_ID &_sf = . (the first variable V1 of type CHAR, length 15, label CUSTOMER_ID created form Oracle name, and without associated format)

&_sc2 = V2 &_st2 = D &_sl2 = 8 &_sb2 = DATE_OF_BIRTH &_sf = date9. (the first variable V2 of type NUM representing date, length 8, label DATE_OF_BIRTH, and format date9. )

4 NESUG 17 Hands-On Workshops

The described variables will be used in the following data step:

… data &&_l&i..&&_d&i ; length %do k = 1 %to &scount ; &&_sc&k %if %upcase(&&_st&k) = C %then %do; $ %end ; &&_sl&k %end ;; %do k = 1 %to &scount ; %if &&_sf&k ^= &mis %then %do ; format &&_sc&k &&_sf&k ; %end ; %end ;;

%do k = 1 %to &scount; label &&_sc&k = "&&_sb&k" ; %end ; run;

After macro resolution, the program looks like:

data daily.T315 ; length V1 $ 15 V2 8 ... ;

format V2 date9. ... ;

label V1 = “CUSTOMER_ID” V2 = “DATE_OF_BIRTH” ... ; run;

Execution of this data step will create SAS data set in required location and in required structure.

The following program uses previously described macro variables to extract data from Oracle data base and to load it into the recently created SAS data set.

proc sql ;

connect to oracle as mydb (user=USER orapw=PWD path='path');

insert into &&_l&i..&&_d&i ( %do k = 1 %to &scount ; &&_sc&k %if &k < &scount %then %do ; , %end ; %end ; ) select

%do k = 1 %to &scount ; &&_sb&k %if &k < &scount %then %do ; , %end ; %end ; from connection to mydb (select %do k = 1 %to &scount ; &&_sb&k %if &k < &fields %then %do ; ,

5 NESUG 17 Hands-On Workshops

%end ; %end ; from &schema..&&_t&i ) ;

disconnect from mydb; quit ;

After macro resolution, the program looks like:

proc sql ; connect to oracle as mydb (user=USER orapw=PWD path='path'); insert into daily.T315 (V1, V2, ..., V78) select CUSTOMER_ID, DATE_OF_BIRTH, ... from connection to mydb (select CUSTOMER_ID, DATE_OF_BIRTH, ... from USER.DAILY_TRANSACTION ) ;

disconnect from mydb; quit ;

The described programming technique enables to create only one program that works for all Oracle tables. Using this technique, the programmer works with the structures of tables, either data dictionary tables in any relational data base, or SAS data sets. While the contents of the data dictionary tables can be continuously changed (in accordance with the changes of information requirements), their structures remain unchangeable. Once developed programs process these tables and generate a wide variety of ETL processes. Extension of SAS data dictionary tables will enable extension of features of ETL processes.

For example, transformation functions can be easily added through additional variable in SOURCE data set. If it is required to extract only date information from a date and time value, or it is required to eliminate leading blanks, you can define this requirement in a new variable called FUNC.

SOURCE data set

SCHEMA T_NAME C_NAME … DS_NAME VAR_NAME … FUNC

… … … … … … … … USER CUSTOMER_INFO customer_id … T23 V1 …

USER CUSTOMER_INFO first_name … T23 V2 … trim(left(*))

USER TRANSACT_INFO second_nam … T24 V1 … trim(left(*)) e USER TRANSACT_INFO … … T24 V2 …

USER TRANSACT_INFO audit_date … T24 V3 … datepart(*) … … … … … … …

It is easy to write a code that substitutes * by the corresponding variable name, and all imaginable transformation can be specified in the SOURCE data set without changing program.

HOW TO DEFINE OPERATIONS ON DATA Operations on data form the dynamic part of a data mart. Using the same table-driven approach, you can design SAS data sets in which you can define comprehensive data manipulation, including data restructuring, aggregation, and much more [1]. In this paper we will constrain our examples to binary operations on data only.

How to perform inner In order to perform inner join between two data sets, the between these data sets should be defined. Those who deal with data integration know that correct definition of join is a critical issue. Writing of join operation in SAS program inevitably leads to mistakes and makes verification or changes in foreign keys time consuming task. We propose to define foreign keys between data sets in only one place, which can be easily verified, changed and documented – in specially structured SAS data set. One of the possibilities is shown here. LINK data set

LINK_ID LH_COL RH_COL

6 NESUG 17 Hands-On Workshops

1 V1 T1

1 V5 T2

LINK SAS data set contains information about correspondence between different columns, and LINK_ID enables to define as many foreign keys as required.

Each time we need to perform join between two data sets, we specify it in the OPERATN data set.

OPERATN data set

OPER_ID LH_NAME LH_LIB RH_NAME RH_LIB LINK_ID OPERATN

1 T315 daily T403 daily 1 JOIN As you can easily understand, operation with OPER_ID equal 1 specify join between daily.T315 and daily.T403 data sets on V1 = T1 and V5 = T2. It is clear, that OPER_ID provides a way to specify a sequence of different operations. The following program performs inner join according to the described definition.

/* link_id – is a macro variable containing an id of the required link operatn - is a macro variable containing a name of binary operation lh_name - is a macro variable containing a name of the left-hand SAS dataset rh_name - is a macro variable containing a name of the right-hand SAS dataset lh_lib - is a macro variable containing a name of the library where the left-hand SAS dataset located rh_lib - is a macro variable containing a name of the library where the left-hand SAS dataset located */ data _null_ ; set &libname..operatn (where = ( oper_id = &oper_id)); call symput("link_id", link_id) ; call symput("operatn", operatn) ; call symput("lh_name", lh_name) ; call symput("rh_name", rh_name) ; call symput("lh_lib", lh_lib) ; call symput("rh_lib", rh_lib) ; run ; As a result of this data step, macro variables will get the following values: &oper_id = 1 &link_id = 1 &operatn = JOIN &lh_name = T315 &rh_name = T403 &lh_lib = daily &rh_lib = daily (inner join operation is required on daily.T315 and daily.T403 data sets)

/* _count – is a macro variable containing a number of variables in the link _lh - is a series of macro variables containing names of the link variables in the left-hand SAS dataset _rh - is a series of macro variables containing names of the link variables in the left-hand SAS dataset */ data _null_ ; retain count 0 ; set &libname..link (where=( link_id = &link_id)); count + 1; call symput("_lh" || left(&i), "lh_col") ; call symput("_rh" || left(&i), "rh_col") ;

7 NESUG 17 Hands-On Workshops

call symput("count", count) ; run ; As a result of this data step, macro variables will keep information about relations between two data sets: &count = 2 (relations are established based on two variables)

&_lh1 = V1 &_rh1 = T1 (a value of V1 variable is linked to a value of V2 variable)

&_lh2 = V5 &_rh2 = T2 (a value of V5 variable is linked to a value of T2 variable)

The following code implements operation itself, using previously described macro variables:

%if &operatn = JOIN %then %join; %macro join ; proc sql ; create table _intern_ as select * from &lh_lib..&lh_name inner join &rh_lib..&rh_name on %do loop = 1 %to &count ; &lh_name..&&_lh&loop = &rh_name..&&_rh&loop and %end ; 1 = 1 ; quit ; %mend join; After resolving macro, the following PROC SQL will be executed:

proc sql ; create table _intern_ as select * from daily.T315 inner join daily.T403 on T315.V1 = T403.T1 and T315.V5 = T403.T2 and 1 = 1 ; quit ; This simple example demonstrates how we can assure always correct joins.

How to perform difference As you know, definition of foreign keys required not only for inner join, but for all binary operations between data sets, as difference, union, etc. The same data structure will support these operations. To perform difference between the same data sets according to the same we need simply add a second record to the OPERATN data set: OPERATN data set

OPER_ID LH_NAME LH_LIB RH_NAME RH_LIB LINK_ID OPERATN

1 T315 daily T403 daily 1 JOIN

2 T315 daily T403 daily 1 DIFF The following code will perform difference operation:

%if &operatn = DIFF %then %differ;

%macro differ ;

8 NESUG 17 Hands-On Workshops

proc sql ; create table _temp_ as select %do loop = 1 %to &count ; %if &loop > 1 %then , ; &&_lh&loop %end ; from &lh_lib..&lh_name except select %do loop = 1 %to &count ; %if &loop > 1 %then , ; &&_rh&loop as &&_lh&loop %end ; from &rh_lib..&rh_name; create table _intern_ as select * from _temp_ inner join &lh_lib..&lh_name on %do loop = 1 %to &count ; _temp_.&&_lh&loop = &lh_name..&&_lh&loop and %end ; 1 = 1 ; quit ;

%mend differ ;

After macro resolution, code will look like:

proc sql ; create table _temp_ as select V1, V5 from daily.T315 except select T1 as V1, T2 as V5 from daily.T403; create table _intern_ as select * from _temp_ inner join daily.T315 on _temp_.V1 = T315.V1 and _temp_.V2 = T315.V2 and 1 = 1 ; quit ;

DATA MART DEVELOPMENT PROCESS The data mart operations model facilitates the data mart development process in a number of significant ways: • it may not be necessary to develop operations program (in the traditional sense of the term) at all • the data mart can be developed quickly and easily, without programming in the conventional sense • when it is necessary to write a conventional program, it is easier to write, requires less maintenance, and is easier to change when it does require maintenance, than it would be in other development approaches • the data mart development cycle can involve a great deal more prototyping than it used to: − a first version can be built and shown to the intended users, who can suggest improvements for incorporation into the next version − as a result, the final data mart provides exactly what its users require of it • the overall development process is far less rigid than it used to be, and the data mart users can be far more involved in that process.

HOW TO DEFINE DATA AUTHORIZATION ACCESS AND HOW IT WORKS Authorized access to data stored in a data mart can be defined using the same table-driven approach. We propose to consider access to data from two different points of : authorized access and permitted operations, and information needs. For example, some information can be viewed, retrieved or updated only by users with special permission. However, information can also be viewed or retrieved for specific information needs: a user, who is interested in information about sales, may not need information about replenishment. Such information needs can be specified through modes mechanism. In order to implement data authorization access, the programmer has to solve the following general problems: 1. Implement access to data tables and operations 2. Implement the mode mechanism.

9 NESUG 17 Hands-On Workshops

The implementation can be done using the same technique and providing flexibility in data authorization access.

SUMMARY This paper has described the time-saving strategies and techniques for designing, developing and maintaining SAS data marts. This approach brings valuable benefits to overall data mart development and maintenance process: • Data dictionary is considered as a single source of a data mart definition. • It implements iterative approach to data mart development • Data dictionary promotes consistent, standard definitions and reduces duplication of meta-data. • Definitions stored in the data dictionary are shared among different parts of data mart. • Data dictionary enhances communication between designer, user and programmer through establishing “common” language. • Table driven approach enables quickly determine the impact of requested modifications. • Being a single source of data mart definition, data dictionary provides a documentation medium.

CONTACT INFORMATION

The authors may be reached at the following address:

Tanya Kolosova YieldWise.Com Canada Inc 6A-49 The Donway West Suite 918 Toronto, Ontario, Canada M3C 2E8 Phone: 416 841 0791 Email: [email protected]

REFERENCES [1] Kolosova, Tanya and Berestizhevsky, Samuel, Table-Driven Strategies for Rapid SAS Applications Development, Cary,NC: SAS Institute Inc.,1995, 259 pp.

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries.  indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies.

10