How to Develop SAS® Data Mart in One Day (With Code Examples)

NESUG 17 Hands-On Workshops How to Develop SAS® Data Mart In One Day (with code examples) Samuel Berestizhevsky, YieldWise Canada Inc, Canada Tanya Kolosova, YieldWise Canada Inc, Canada ABSTRACT Creating of data marts is usually driven by specific application and business needs. To create useful data mart you need, in first place to answer the question: what are the information needs of the specific business application? An answer to this question is not permanent in time. Learning from the initially required information will inevitably lead to new requirements. In addition, changes in business situation will raise new questions, which will lead to new information needs. It is clear that time of incorporation the new requirements to the data mart is critical and should be counted in days, even hours, rather than weeks or months. How can data mart keep pace with these changes? The answer to this question lays in the data mart development technology [1]. Creating of data mart involves, at least, three following steps: • Design and implementation of data mart model. • Design and implementation of operations on data, like ETL processes, aggregation, integrity and value verification, etc. • Design and implementation of data authorization access. In this paper we describe strategies and techniques intended to reduce data mart development time and simplify modifications. We show you on examples how to implement in SAS three major steps of data mart development in only one day. INTRODUCTION In this paper we describe one of techniques that lets the designer specify data mart visually, in terms of its structure and functionality, and permits the development of mission-critical tasks almost without programming. The main principles of this approach are: • Data mart design is perceived as data and nothing but data. This means that the data mart design is defined in a set of specially structured tables and is stored, updated, and managed in the same way as ordinary data. • Operations on data are defined in terms of what must be done, but not how to do it. These definitions are stored, updated, and managed as ordinary data. • The data mart is managed from a single control point. The table-driven environment is the most fundamental aspect of the described approach. The heart of this environment is the set of specially structured tables forming the data dictionary. The data dictionary contains a variety of information concerning application objects and operations, such as data structures, application activities, and so on. Further in the paper we lead you through examples of data mart development, and show you on how create a data mart, and update and verify data time-saving, flexible and reliable manner. HOW TO IMPLEMENT A DATA MART In the table-driven environment, data mart design involves writing definitions of the data mart objects in the tables of the data dictionary [1]. Description of the data dictionary is out of scope of this paper. However, we want to clarify that we do not mean SAS data dictionary, but specially structured SAS data sets that contain comprehensive information about: • Topology (physical locations) of data mart parts and ways to communicate between those parts • Structure of data mart tables, including detailed information about columns (not only type, label, format, etc. but also information about primary and secondary keys, associated domains, and other constrains) • Relationships between data mart tables in terms of relational data model (foreign keys) Important to mention that this data dictionary can be changed or extended to keep additional information describing your data mart. How to import data from relational data base Let assume, that as a first step in data mart creation we want to create a copy of a relational data base in SAS data sets. 1 NESUG 17 Hands-On Workshops Every relational data base has its data dictionary which contains the most accurate information about data base structure. This information enables us to create exact copy of data base absolutely automatically. Oracle data dictionary The following table shows selected columns from ALL_TAB_COLUMNS Oracle table. OWNER TABLE_NAME COLUMN_NAME DATA_TYPE DATA_LENGTH … … … … … USER CUSTOMER_INFO customer_id VARCHAR2 16 USER CUSTOMER_INFO first_name VARCHAR2 40 USER CUSTOMER_INFO second_name VARCHAR2 40 USER CUSTOMER_INFO … … … USER CUSTOMER_INFO audit_key NUMBER 8 … … … … … As you can see, this table describes a structure of CUSTOMER_INFO table from USER schema. Extracting data from Oracle data dictionary The following SAS macro reads information and creates SAS data set containing image of Oracle data dictionary table: %macro oracle_dd( libname); proc sql ; connect to oracle as mydb (user=USER orapw=PWD path='path'); insert into &libname..source (schema, t_name, c_name, d_type, d_len) select OWNER, TABLE_NAME, COLUMN_NAME, DATA_TYPE, DATA_LENGTH from connection to mydb (select OWNER, TABLE_NAME, COLUMN_NAME, DATA_TYPE,DATA_LENGTH from USER.ALL_TAB_COLUMNS); disconnect from mydb; quit ; %mend; The only information coded in this program is a structure of Oracle data dictionary table (which is quite permanent and does not change even in different Oracle versions) and SOURCE data set which we structure for specific purpose of data import. As a result of this program, the following SAS data set will be created: SOURCE data set SCHEMA T_NAME C_NAME D_TYPE D_LEN DS_NAME VAR_NAME TYPE LEN … … … … … … … … … USER CUSTOMER_INFO customer_id VARCHAR2 16 . USER CUSTOMER_INFO first_name VARCHAR2 40 . USER CUSTOMER_INFO second_nam VARCHAR2 40 . e USER CUSTOMER_INFO … … … . USER CUSTOMER_INFO audit_key NUMBER 8 . … … … … … … … … … Description of data to be loaded Now we know what should be extracted. How we can describe what should be loaded? The following program will fill additional data into SOURCE data set: data &libname..source (drop = t, c) ; set &libname..source; by t_name; retain t, c 0; if first.t_name then c = 0; t + 1; c + 1; ds_name = "&prefixt"||left(t) ; var_name = "&prefixc"||left(c); 2 NESUG 17 Hands-On Workshops if d_type = “DATE” then do ; type = "D" ; len = 8 ; format = "date9." ; end ; if d_type = “NUMBER” then do ; type = "N" ; len = 8 ; format = "&mis" ; end ; if d_type = “VARCHAR2” then do ; type = "C" ; len = data_length ; format = "&mis" ; end ; run ; The rules of translation Oracle data types to corresponding SAS data types are known and easy to define. Conversion of tables’ names to data sets’ names, and fields’ names to variables’ names can be done differently. One of the possibilities is to keep them as is. However, there are several disadvantages of this approach. One of them is that Oracle and SAS naming conventions may differ from version to version, and from platform to platform. Another one is the consideration that your data mart may be fed from different sources, in which case tables with the same name may potentially exist in different data bases. Systematic approach to data sets and variables names solves this problem and creates unique data sets’ and variables’ names. As a result of the previous program, SOURCE data set is populated with data as follows: Populated SOURCE data set SCHEMA T_NAME C_NAME D_TYPE D_LEN DS_NAME VAR_NAME TYPE LEN … … … … … … … … … USER CUSTOMER_INFO customer_id VARCHAR2 16 T23 V1 C 16 USER CUSTOMER_INFO first_name VARCHAR2 40 T23 V2 C 40 USER TRANSACT_INFO second_name VARCHAR2 40 T24 V1 C 40 USER TRANSACT_INFO … … … T24 V2 … … USER TRANSACT_INFO audit_key NUMBER 8 T24 V3 N 8 … … … … … … … … … ETL process Now, our SOURCE data set contains comprehensive information about all Oracle tables from the same schema, and full information about target SAS data sets. However, we probably do not need to extract and load all tables at the same time. ETL processes for each piece of information can be easily and flexibly scheduled. Let consider the following data set. Process data set PROCESS T_NAME DS_NAME LIB_NAME … … … DAILY DAILY_TRANSACTION T315 transact DAILY DAILY_SUBSCRIPTION T403 transact MONTHLY PAYMENT_RECEIPT T315 finance … … … Each process defined in the PROCESS table specifies different portions of data that can be extracted and loaded according with different frequency and at different days. The following program implements required ETL process. Let follow each step of this program, and understand what it does. /* count – is a macro variable containing a number of Oracle tables to be extracted and loaded in the SAS datasets _t - is a series of macro variables containing names of the Oracle tables _d - is a series of macro variables containing names of the SAS datasets _l - is a series of macro variables containing libraries where the SAS datasets should be located */ %_import_ (libname = kernel, process = DAILY) ; … data _null_ ; retain i 1 ; 3 NESUG 17 Hands-On Workshops set &libname..process (where = (process = "&process")) ; call symput("_t" || left(i), trim(left(t_name))) ; call symput("_d" || left(i), trim(left(ds_name))) ; call symput("_l" || left(i), trim(left(lib_name))) ; call symput("count", left(i)) ; i + 1 ; run ; As a result of this data step, macro variables will get the following values: &count = 2 (there are two tables that should be transferred to SAS data sets) &_t1 = DAILY_TRANSACTION &_d1 = T315 &_l1 = daily (the first Oracle table DAILY_TRANSACTION will be imported into daily.T315 SAS data set) &_t2 = DAILY_SUBSCRIPTION &_d2 = T403 &_l2 = daily (the second Oracle table DAILY_SUBSCRIPTION will be imported into daily.T403 SAS data set) /* scount – is a macro variable containing

Load more