Building Star Schema with SAS® Software

Building Star Schema With SAS® Software An Introduction to a Data Warehouse Data Structure Mark Shephard [email protected] June 20 - 23 2000 SeUGI 18 DUBLIN 1 Building Star Schema... ! Basics ! Review of query mechanisms ! Performance ! A look at metadata ! Exploiting SAS/AF® classes ! Abstractions from the basic model ! More metadata, another class... June 20 - 23 2000 SeUGI 18 DUBLIN 2 Star Schema Basics ! Few, additive facts ! Facts described by Dimensions " unique business key on each row " arbitrary keys - dates an exception " unknown data has a valid key " appropriate key lengths ! Facts selected by constraining dimensions June 20 - 23 2000 SeUGI 18 DUBLIN 3 2 1 2 3 4 5 6 StarKV1V2 V3 VSchema4 V5 V6 Basics K V V V V V V 1 1 2 2 3 KK’K2 K3 K4 K5 $ 3 4 4 8 7 12 4 8 45 99 5 5 11 3 75 71 28 52 0 13 21 67 10 25 47 12 14 15 34 95 11 12 67 52 3 12 85 1086 K’ V1V2 V3 V4 V5 V6 K3 V1V2 V3 V4 V5 V6 1 1 2 2 3 3 Fact Table 4 4 5 5 Dimension Table June 20 - 23 2000 SeUGI 18 DUBLIN 4 Query Mechanisms ! Two query phases: " constrain dimension keys to select particular rows from the fact table; " use fact row foreign keys to recover additional dimensional information ! SAS offers both SQL and Datastep: " does it matter which we choose? June 20 - 23 2000 SeUGI 18 DUBLIN 5 Sample SQL proc sql; create table results as select * from work.fact_tab where keyA = (select keyA from work.dimA where varA = 'keyA= 3') and keyB = (select keyB from work.dimB where varB = 'keyB= 9') and keyC = (select keyC from work.dimC where varC = 'keyC= 22') ; quit; June 20 - 23 2000 SeUGI 18 DUBLIN 6 Datastep Segment data results ; _iorc_ = 0; set dimA (where = (varA = 'keyA= 3')); do while (_iorc_ = 0); set fact_tab key = keyA; if _iorc_ = 0 then do; do while (_iorc_ = 0); set dimB key = keyB /unique; if _iorc_ = 0 and varB = 'keyB= 9' then do; do while (_iorc_ = 0); set dimC key = keyC /unique; if _iorc_ = 0 and varC = 'keyC= 22' then do; June 20 - 23 2000 SeUGI 18 DUBLIN 7 Query Performance ! SQL 5.996.206.256.206.09 secs ! Datastep#1 2.41 2.37 2.41 2.41 2.41 secs ! Datastep#2 1.20 1.26 1.32 1.32 1.26 secs ! 66,000 row fact table; Pentium 100; Win95 ! Additional dimension data can be retrieved concurrently ! Datasteps are faster and tunable... June 20 - 23 2000 SeUGI 18 DUBLIN 8 Improving Performance ! Fix the order of dimension processing " choose the dimension that will return the least rows from the fact table first " requires the processing of dimensions before the fact table ! Assumes dimension value distribution is the same as foreign key distribution in the fact table... June 20 - 23 2000 SeUGI 18 DUBLIN 9 Realizing data with classes ! LMC: Logical Metadata Class " providing the user’s view on the Schema ! QMC: Query Metadata Class " encapsulating the user’s query ! QEC: Query Engine Class " generating an instance of the user’s query " performing query optimization ! SAS/Warehouse Administrator™ June 20 - 23 2000 SeUGI 18 DUBLIN 10 Realizing data with classes Query INTERFACE QMC Metadata LMC Data Facts; set Facts key = kvar QEC /unique; Logical Metadata June 20 - 23 2000 SeUGI 18 DUBLIN 11 Abstractions ! ‘effective periods’ and slowly moving dimensions... ! ‘AND’ operators between values ! ‘Navigational’ Dimensions ! Joining schemas ! ‘fact-less’ schemas ! Hierarchy support, multiple passes,... June 20 - 23 2000 SeUGI 18 DUBLIN 12 Effective Periods ! Folks get married, situations change, etc. ! Manage in the dimension tables… " retain keys; " add ‘effective dates’; ! Query complexity rises - " all but rules out SQL ! Ensure the ‘truthfulness’ of a query... June 20 - 23 2000 SeUGI 18 DUBLIN 13 Effective Period WHEF WHET Fact Row WHEF WHET Dimension#1 Row WHEF WHET Dimension#2 Row WHEF WHET Dimension#3 Row Effective Period Time June 20 - 23 2000 SeUGI 18 DUBLIN 14 ‘AND’ operations ! Required when a single fact foreign key needs to describe a combination of values " e.g. multiple covers on an insurance policy ! Only useful for the selection of data " can’t resolve which of the combination is responsible for what proportion of the fact " the number of actually occurring combinations is the critical factor June 20 - 23 2000 SeUGI 18 DUBLIN 15 ‘AND’ operations KV1V2 V3 V4 V5 V6 K’ 12345 # KK’K2 K3 K4 K5 $ 1 1’ 2 8 7 12 4 8 45 99 2 2’ 2 11 3 75 71 28 52 0 3 3’ 2 13 21 67 10 25 47 12 4 4’ 4 14 15 34 95 11 12 67 5 5’ 1 52 3 12 85 1086 Dimension Table: ‘Link’ Table: Fact Table: Two rows have been Dimension Rows translate The rows with the selected by the user to columns to return a Link Table keys are single key where the selected combination is valid June 20 - 23 2000 SeUGI 18 DUBLIN 16 Further abstractions ! Partitioning physical dimensions " improving update performance # targeted indexing " improved file space usage " sympathetic to the user view - putting data where the user expects to find it ! PMC: Physical Metadata Class " Organization and management of the ‘real’ datasets June 20 - 23 2000 SeUGI 18 DUBLIN 17 Further abstractions a b a a i c d c b i b j d e e j k g f i h k f c g j c h e i d k d f k h j h h g Physical “Physical” Logical Datasets Dimensions Dimensions June 20 - 23 2000 SeUGI 18 DUBLIN 18 Closing thoughts ! The SAS® System offers a number of facilities to build and extend Star Schema structures. ! Metadata is the key to providing an interface user’s will use in combination with the functionality they’d want. ! Organize metadata carefully - use SAS/Warehouse Administrator™ June 20 - 23 2000 SeUGI 18 DUBLIN 19 Acknowledgements SAS is a registered trademark of SAS Institute Inc., Cary, NC, USA SAS/AF is a registered trademark of SAS Institute Inc., Cary, NC, USA SAS/Warehouse Administrator is a trademark of SAS Institute Inc., Cary, NC, USA All other brand and product names are trademarks or registered trademarks of the respective companies. Mark Shephard [email protected] June 20 - 23 2000 SeUGI 18 DUBLIN 20 Building Star Schema With SAS! Software An Introduction to a Data Warehouse Data Structure. Mark Shephard Sound Marketing Hindhead, UK This paper discusses the creation of star schema data structures as a store for detailed data within a Data Warehouse. Using the SAS! System throughout, as data loading mechanism, storage medium and exploitation tool, an efficient and capable Data Warehouse can be created to enable exploratory analysis of large volumes of detailed data. A number of abstractions from the familiar structure are made, exploiting the facilities of the SAS! System to better meet our requirements. The Star Schema is a very popular Clustered around the fact table are mechanism for the storage of data with a dimension tables, appearing as the ‘rays’ data warehouse. There are any number of books and conference papers expounding Star Schema (generic) 2 1 2 3 4 5 6 their virtues as means for enabling multi- KV1 V2 V3 V4 V5 V6 K V V V V V V 1 1 dimensional analysis of often fairly detailed 2 2 3 KK’K2 K3 K4 K5 $ 3 4 data. What you may have noticed, if you 4 8 7 12 4 8 45 99 5 5 11 3 75 71 28 52 0 have read any of this literature, is the 13 21 67 10 25 47 12 14 15 34 95 11 12 67 absence of a discussion of a star schema 52 3 12 85 1086 ! K’ V1 V2 V3 V4 V5 V6 K3 V1 V2 V3 V4 V5 V6 built using the SAS System as the primary 1 1 2 2 data store. This paper redresses that 3 3 Fact Table 4 4 balance. 5 5 Dimension Table Commonly the data warehouse data store is built using a mainstream OLTP database. Similarly the incumbent star schema is Figure 1: The basic star schema. queried using SQL. This imposes a number of restrictions on the function and capability of the warehouse, largely because the emanating from the star. Typically the fact schema design has to closely adhere to the table is a highly normalised structure. Each limitations of the database and particularly of its columns contain either a dimension those of SQL. Here we describe a table key or the information or ‘fact’ that we warehouse data structure built entirely from require. The purpose of the dimension SAS! System Software, enabling the tables is to describe the ‘fact’ in the fact construction of a data store that is both table. Each dimension table is therefore de- functionally rich and generically capable. normalised allowing the values within it to be browsed, thereby enabling the simplest possible mechanism for identifying a fact. Star Schema Basics A fact is completely described by the A brief recap of the basics of the star foreign keys associated with it in its row of schema structure is perhaps appropriate, if the fact table. Joins should be made between only to standardise on a number of terms of the dimension tables and the fact table only nomenclature. This we’ll do with the aid of – not between one dimension table and Figure 1. The primary component of the another. The keys used to relate the tables star schema is the fact table.

Building Star Schema with SAS® Software

Cubes Documentation Release 1.0.1

Star Schema Modeling with Pentaho Data Integration

Business Intelligence and Column-Oriented Databases

Star Vs Snowflake Schema in Data Warehouse

Beyond the Data Model: Designing the Data Warehouse

CDW: a Conceptual Overview 2017

Building an Effective Data Warehousing for Financial Sector

Integrating Compression and Execution in Column-Oriented Database Systems

Column-Stores Vs. Row-Stores: How Different Are They Really?

Using Online Analytical Processing (OLAP) in Data Warehousing

Columnar Storage in SQL Server 2012

Virtual Denormalization Via Array Index Reference for Main Memory OLAP