Building With SAS® Software

An Introduction to a Data Structure

Mark Shephard [email protected]

June 20 - 23 2000 SeUGI 18 DUBLIN 1 Building Star Schema...

! Basics ! Review of query mechanisms ! Performance ! A look at ! Exploiting SAS/AF® classes ! Abstractions from the basic model ! More metadata, another class...

June 20 - 23 2000 SeUGI 18 DUBLIN 2 Star Schema Basics

! Few, additive facts ! Facts described by Dimensions " unique business key on each row " arbitrary keys - dates an exception " unknown data has a valid key " appropriate key lengths ! Facts selected by constraining dimensions

June 20 - 23 2000 SeUGI 18 DUBLIN 3 2 1 2 3 4 5 6 StarKV1V2 V3 VSchema4 V5 V6 Basics K V V V V V V 1 1 2 2 3 KK’K2 K3 K4 K5 $ 3 4 4 8 7 12 4 8 45 99 5 5 11 3 75 71 28 52 0 13 21 67 10 25 47 12 14 15 34 95 11 12 67 52 3 12 85 1086 K’ V1V2 V3 V4 V5 V6 K3 V1V2 V3 V4 V5 V6 1 1 2 2 3 3 4 4 5 5 Dimension Table

June 20 - 23 2000 SeUGI 18 DUBLIN 4 Query Mechanisms

! Two query phases: " constrain dimension keys to select particular rows from the fact table; " use fact row foreign keys to recover additional dimensional information ! SAS offers both SQL and Datastep: " does it matter which we choose?

June 20 - 23 2000 SeUGI 18 DUBLIN 5 Sample SQL

proc ; create table results as select * from work.fact_tab where keyA = (select keyA from work.dimA where varA = 'keyA= 3') and keyB = (select keyB from work.dimB where varB = 'keyB= 9') and keyC = (select keyC from work.dimC where varC = 'keyC= 22') ; quit;

June 20 - 23 2000 SeUGI 18 DUBLIN 6 Datastep Segment

data results ; _iorc_ = 0; set dimA (where = (varA = 'keyA= 3')); do while (_iorc_ = 0); set fact_tab key = keyA; if _iorc_ = 0 then do; do while (_iorc_ = 0); set dimB key = keyB /unique; if _iorc_ = 0 and varB = 'keyB= 9' then do; do while (_iorc_ = 0); set dimC key = keyC /unique; if _iorc_ = 0 and varC = 'keyC= 22' then do;

June 20 - 23 2000 SeUGI 18 DUBLIN 7 Query Performance

! SQL 5.996.206.256.206.09 secs ! Datastep#1 2.41 2.37 2.41 2.41 2.41 secs ! Datastep#2 1.20 1.26 1.32 1.32 1.26 secs

! 66,000 row fact table; Pentium 100; Win95 ! Additional dimension data can be retrieved concurrently ! Datasteps are faster and tunable...

June 20 - 23 2000 SeUGI 18 DUBLIN 8 Improving Performance

! Fix the order of dimension processing " choose the dimension that will return the least rows from the fact table first " requires the processing of dimensions before the fact table ! Assumes dimension value distribution is the same as foreign key distribution in the fact table...

June 20 - 23 2000 SeUGI 18 DUBLIN 9 Realizing data with classes

! LMC: Logical Metadata Class " providing the user’s view on the Schema ! QMC: Query Metadata Class " encapsulating the user’s query ! QEC: Query Engine Class " generating an instance of the user’s query " performing query optimization ! SAS/Warehouse Administrator™

June 20 - 23 2000 SeUGI 18 DUBLIN 10 Realizing data with classes

Query

INTERFACE QMC Metadata

LMC

Data Facts; set Facts key = kvar QEC /unique;

Logical Metadata

June 20 - 23 2000 SeUGI 18 DUBLIN 11 Abstractions

! ‘effective periods’ and slowly moving dimensions... ! ‘AND’ operators between values ! ‘Navigational’ Dimensions ! Joining schemas ! ‘fact-less’ schemas ! Hierarchy support, multiple passes,...

June 20 - 23 2000 SeUGI 18 DUBLIN 12 Effective Periods

! Folks get married, situations change, etc. ! Manage in the dimension tables… " retain keys; " add ‘effective dates’; ! Query complexity rises - " all but rules out SQL ! Ensure the ‘truthfulness’ of a query...

June 20 - 23 2000 SeUGI 18 DUBLIN 13 Effective Period

WHEF WHET Fact Row

WHEF WHET Dimension#1 Row

WHEF WHET Dimension#2 Row

WHEF WHET Dimension#3 Row

Effective Period

Time

June 20 - 23 2000 SeUGI 18 DUBLIN 14 ‘AND’ operations

! Required when a single fact foreign key needs to describe a combination of values " e.g. multiple covers on an insurance policy ! Only useful for the selection of data " can’t resolve which of the combination is responsible for what proportion of the fact " the number of actually occurring combinations is the critical factor

June 20 - 23 2000 SeUGI 18 DUBLIN 15 ‘AND’ operations

KV1V2 V3 V4 V5 V6 K’ 12345 # KK’K2 K3 K4 K5 $ 1 1’ 2 8 7 12 4 8 45 99 2 2’ 2 11 3 75 71 28 52 0 3 3’ 2 13 21 67 10 25 47 12 4 4’ 4 14 15 34 95 11 12 67 5 5’ 1 52 3 12 85 1086

Dimension Table: ‘Link’ Table: Fact Table: Two rows have been Dimension Rows translate The rows with the selected by the user to columns to return a Link Table keys are single key where the selected combination is valid June 20 - 23 2000 SeUGI 18 DUBLIN 16 Further abstractions

! Partitioning physical dimensions " improving update performance # targeted indexing " improved file space usage " sympathetic to the user view - putting data where the user expects to find it ! PMC: Physical Metadata Class " Organization and management of the ‘real’ datasets June 20 - 23 2000 SeUGI 18 DUBLIN 17 Further abstractions

a b a a i c d c b i b j d e e j k g f i h k f c g j c h e i d k d f k h j h h g

Physical “Physical” Logical Datasets Dimensions Dimensions

June 20 - 23 2000 SeUGI 18 DUBLIN 18 Closing thoughts

! The SAS® System offers a number of facilities to build and extend Star Schema structures. ! Metadata is the key to providing an interface user’s will use in combination with the functionality they’d want. ! Organize metadata carefully - use SAS/Warehouse Administrator™

June 20 - 23 2000 SeUGI 18 DUBLIN 19 Acknowledgements

SAS is a registered trademark of SAS Institute Inc., Cary, NC, USA SAS/AF is a registered trademark of SAS Institute Inc., Cary, NC, USA SAS/Warehouse Administrator is a trademark of SAS Institute Inc., Cary, NC, USA All other brand and product names are trademarks or registered trademarks of the respective companies. Mark Shephard [email protected]

June 20 - 23 2000 SeUGI 18 DUBLIN 20 Building Star Schema With SAS! Software An Introduction to a Data Warehouse Data Structure.

Mark Shephard Sound Marketing Hindhead, UK

This paper discusses the creation of star schema data structures as a store for detailed data within a Data Warehouse. Using the SAS! System throughout, as data loading mechanism, storage medium and exploitation tool, an efficient and capable Data Warehouse can be created to enable exploratory analysis of large volumes of detailed data. A number of abstractions from the familiar structure are made, exploiting the facilities of the SAS! System to better meet our requirements.

The Star Schema is a very popular Clustered around the fact table are mechanism for the storage of data with a dimension tables, appearing as the ‘rays’ data warehouse. There are any number of books and conference papers expounding Star Schema (generic) 2 1 2 3 4 5 6 their virtues as means for enabling multi- KV1 V2 V3 V4 V5 V6 K V V V V V V 1 1 dimensional analysis of often fairly detailed 2 2 3 KK’K2 K3 K4 K5 $ 3 4 data. What you may have noticed, if you 4 8 7 12 4 8 45 99 5 5 11 3 75 71 28 52 0 have read any of this literature, is the 13 21 67 10 25 47 12 14 15 34 95 11 12 67 absence of a discussion of a star schema 52 3 12 85 1086 ! K’ V1 V2 V3 V4 V5 V6 K3 V1 V2 V3 V4 V5 V6 built using the SAS System as the primary 1 1 2 2 data store. This paper redresses that 3 3 Fact Table 4 4 balance. 5 5 Dimension Table Commonly the data warehouse data store is built using a mainstream OLTP . Similarly the incumbent star schema is Figure 1: The basic star schema. queried using SQL. This imposes a number of restrictions on the function and capability of the warehouse, largely because the emanating from the star. Typically the fact schema design has to closely adhere to the table is a highly normalised structure. Each limitations of the database and particularly of its columns contain either a dimension those of SQL. Here we describe a table key or the information or ‘fact’ that we warehouse data structure built entirely from require. The purpose of the dimension SAS! System Software, enabling the tables is to describe the ‘fact’ in the fact construction of a data store that is both table. Each dimension table is therefore de- functionally rich and generically capable. normalised allowing the values within it to be browsed, thereby enabling the simplest possible mechanism for identifying a fact. Star Schema Basics A fact is completely described by the A brief recap of the basics of the star foreign keys associated with it in its row of schema structure is perhaps appropriate, if the fact table. Joins should be made between only to standardise on a number of terms of the dimension tables and the fact table only nomenclature. This we’ll do with the aid of – not between one dimension table and Figure 1. The primary component of the another. The keys used to relate the tables star schema is the fact table. We can are arbitrarily selected: not related to the envisage this as the centre of a star. data values. Each row of a dimension does have a ‘business key’ however. This is values, which are often small value integers. either one or a combination of values that For many dimensions four bytes are makes that makes the row unique within the sufficient, few require more than five table. (supporting more than 536 million discrete integer values). While key values should Sometimes the complete description of a not be related to the business key values, it fact is unavailable at the time when it is is often convenient to ignore this rule when written to the fact table. To ensure that each working with a dimension with a business key column has a valid foreign key, each key of date. As SAS date values are dimension should have a key that is represented as the number of days since 1st associated with an undefined or unknown value. January 1960, setting the key value to the date is equivalent to choosing a key value The definition of the dimensions to be used with an arbitrary offset. An advantage is in the schema is the outcome of a process of that the schema developer is more easily data modelling, which is not the subject of able to make some sense of the sea of keys this paper. However, we will note that if in the fact table while the schema is being any star schema is to function correctly and built and debugged. Remember though that efficiently, then careful attention must be a key representing an undefined or unknown paid to the data modelling process. To find date should be carefully selected: 0 would a new relationship between data items after be a valid date value! A value of -138062 the schema is built is not a desirable might be applicable, returning a string of discovery! asterisks if formatted with a 'date9.' format. Specific information is extracted from the fact table by applying restrictions to the The Query Mechanism values within the dimension tables; There are many ways with which to extract extracting the appropriate dimensional key data from a star schema structure such as the values; and finding the rows within the fact one previously described above. As with all table which have each of these dimensional things, the selection of the appropriate key values as values of their corresponding solution is not wholly determined by the foreign keys. immediate task we would like to perform. The query is done, the rows extracted; the Additional ‘issues’ arise that influence our result obtained. decision. As we shall see, the more functionality we would like to embrace, the more sophisticated our solution is required ! A ‘SAS ’ Star Schema to be. We are setting out to build a star schema ! For the moment, let us ignore the nature of structure using only the SAS System. In the user interface and accept that through the context of this paper this means that each some user interaction a set of instructions of the tables referenced as components of will be created which will in turn execute the star are comprised of SAS datasets. upon our schema to deliver the result that (With the advent of Version 8, it seems that the user requires. dataset and table, variable and column, row and observation are used interchangeably: I One possible form of that set of instructions will continue in that fashion.) Each of the would be SQL. The SAS! System offers a key and foreign key columns within each of comprehensive set of SQL facilities and, if the tables are indexed. The keys themselves the data were stored in the traditional cache are numeric values, stored with carefully – an OLTP database – it would be perhaps selected lengths. By default, SAS will store the only viable solution for us. But this is numeric values within 8 bytes, however far not the case. The data is stored in SAS fewer are usually necessary to store key datasets, so allowing us to efficiently harness the power of the datastep. So far data results ; however, our description of the structure of _iorc_ = 0; the schema and its data has offered nothing set dimA (where = (varA = 'keyA= 3')); to influence our decision. We might choose do while (_iorc_ = 0); set fact_tab key = keyA; either, as a matter of personal preference. If if _iorc_ = 0 then do; you were to compare the two approaches do while (_iorc_ = 0); however, you might be swayed. set dimB key = keyB /unique; if _iorc_ = 0 and Two code segments are presented below to varB = 'keyB= 9' then do; do while (_iorc_ = 0); perform the same task. One is SQL and the set dimC key = keyC other datastep code. /unique; if _iorc_ = 0 and varC = 'keyC= 22' proc sql; then do; create table results as output; select * from work.fact_tab /* finished with this fact row */ where keyA = _iorc_ = 1; (select keyA from work.dimA end; where varA = 'keyA= 3') else do; and keyB = _error_ = 0; (select keyB from work.dimB _iorc_ = 1; where varB = 'keyB= 9') end; and keyC = end; (select keyC from work.dimC end; where varC = 'keyC= 22') else do; ; _error_ = 0; quit; _iorc_ = 1; end; end; Each code segment selects a number of rows end; else do; from a fact table by selecting values of _error_ = 0; variables within a number of dimension /* this is the fact table read... tables. The first comment that most people don't reset _IORC_ would make is most likely along the lines of */ end; “I’d rather write the SQL than the if _iorc_ = 1 then _iorc_ = 0; Datastep!” It’s easy to see why. However, end; closer inspection of the datastep code run; reveals that it is not quite as convoluted as it seems. Indeed this code is iteratively repetitive and so almost as easily programmatically generated as the SQL 100 rows from a 66,000-row fact table by would be. “But why bother?” comes the specifying values for 3 of the 4 foreign keys obvious question. The answer is more it contains. Table 2 shows the time required relevant with the more functionality we when only a single row within the fact table attempt to include in the schema. The meets the required criteria. Without datastep language is far richer in exception the datastep processing is functionality than SQL. This is illustrated in significantly faster than SQL. More this example by its ability to extract both importantly still is the second set of datastep key and data values from the dimension sub- code figures, which, in Table 2, are better queries – unlike the SQL. But most still. important is the question of performance. These two different sets of datastep Table 1 shows the number of seconds of performance results illustrate the ability of processor time required to retrieve around datastep code to be ‘tuned’ to the task at hand. In this case it was known that a dimensions associated with a fact are ‘real’ particular key value occurred only once in tables. While it’s possible that they could the fact table. By choosing to search for that be, it’s more efficient if they’re not. key first, the performance of the query could be significantly enhanced. There is no Logical Structures opportunity to influence the SQL query in It is likely that most schema designs will such a way. We’ll address how you might have more than a single dimension to do this with the datastep solution later. describe time. If a row is written to the fact SQL 5.92 5.98 6.09 5.99 6.04 table for every business transaction, then at least two time dimensions will be defined. Data- 2.58 2.52 2.35 2.41 2.41 # The first to specify the beginning of the step 1 period effected by the transaction and the Data- 2.41 2.41 2.25 2.41 2.52 second to record the end of that period. # step 2 The fact table in this hypothetical schema will have a column of foreign keys for the Table 1: Selection of many rows ‘beginning’ dimension and another for the SQL 5.99 6.20 6.25 6.20 6.09 ‘end’ dimension. Aren’t these the same set Data- 2.41 2.37 2.41 2.41 2.41 of data? Clearly they are. We have a step#1 situation where there need not be as many physical dimension tables as there are Data- 1.20 1.26 1.32 1.32 1.26 ‘logical’ ones defined with keys in the fact # step 2 table. How can this be supported?

Table 2: Selection of a unique row Logical Metadata We will assume after this discussion that the Earlier we mentioned that the code that query upon the schema will be constructed performs the query upon the schema will be from datastep code. ‘created’ as a result of some user interaction. We can expand on this now, as we know Query Definition that this ‘interaction’ will be with a logical Again, without invoking a particular style of view of the schema, rather than a physical interface to this store of data, we can one. The logical view enables the user to envisage a number of things that it must do deal in terms they commonly use to define a and a little of the way in which it must do query that specifies values for the dimension them. foreign keys in the fact table. We can consider the dimensions from which they The simplest way for a user to extract make their selections as ‘logical information from the schema is to enable dimensions’. The tables that really store the them to browse the data and make selections data we will refer to as ‘physical from what they might see. Each of the dimensions’. dimensions defined by the data modelling must be available for the user to review, To store the mapping between the physical selecting both the dimension and particular and logical dimensions we need to define a columns within them using terms that they set of metadata. This Metadata needs both commonly use. The user will define rules an interface to define it and an access for the selection of data by specifying method to exploit it. equalities or inequalities against possible As this type of metadata can be thought of values held in these columns. It may be as extended attributes to physical data necessary to reference the unique values of tables, the most appropriate place to define it some or all of the column values. In this is via SAS/Warehouse Administrator". way the user will perceive that each of the This requires some extensions to the basic queries to be stored and used at any time. If product using the API, a potential topic for due care is taken to store the appropriate another paper – if Steve Morton doesn’t beat information in the correct metadata set, then me to it! Once complete, such an interface changes to the logical to physical mapping provides an excellent means to support the can be affected without requiring the logical structure of the star schema we are redefinition of any stored queries. producing. In practice then, an interface provides the Exploitation of this Metadata requires the user with the logical view of the schema, creation of a query-time access method to presented by the LMC. The user is able to service it. By definition, all query creation review each of the logical dimensions, processes must reference the Metadata in making selections of values of variables that order to map the user’s logical requirements are within them. The QMC records these into a query of physical tables. This is an selections as they are made. The two sets of ideal opportunity to create a SAS/AF! class! metadata are then used by a Query Engine We’ll call it a ‘Logical Metadata Class’ or Class (QEC) to build the datastep code ‘LMC’. This LMC requires both SET_ statements. methods to insert metadata into it and GET_ methods to extract metadata from it. Additionally, a further extension to

SAS/Warehouse Administrator" is required Query

INTERFACE QMC to export the logical-to-physical mapping we Metadata have defined within it. LMC When a user defines a query, it is the LMC Data Facts; that provides the user interface with the set Facts QEC key = kvar information it needs to support the user’s /unique; activity. Our Logical Metadata Class Logical supports the navigation of our logical star Metadata schema.

Practical Mapping Figure 2: Metadata class interaction. So how does this actually work? Well, with the assistance of a little more metadata! (And ideally another class to encapsulate it). Query Optimisation As we saw earlier, our ‘end-game’ is the Now we can return to the question of creation of a set of datastep code that can be datastep optimisation suggested earlier. As executed against our schema. The the query defined by the user is first instructions to build this code are created by described by a set of metadata before being the user’s interaction with the logical view generated as code, there is the opportunity to of the schema. By storing the results of this influence the order with which the interaction in a set of ‘query metadata’, the dimensions should be processed. The datastep code can then be generated by datastep code presented above has a reference to this metadata together with the particular order to the processing of the information made available by the LMC. dimensions. Rows are selected from the fact This process is made all the easier and table by comparing the value of each logical robust by encapsulating the query metadata key in a row against the required key values ! within another SAS/AF class, that we selected by the user from the associated might call a Query Metadata Class or QMC. logical dimension. As soon as a match is Managing the query process in this way not found, the row is discarded. Thus the enables sets of metadata that describe performance of the query is influenced by the cardinality of the dimension tables. above if there were only one female client Consider selecting facts based upon amongst the entire 500,000 clients? selections made from two dimensions. One Clearly there is scope here for a more dimension has rows enough to describe sophisticated scoring process based upon the 500,000 clients, while the other has two actual values of column variables. This rows to describe client gender. We’d like to could be achieved by recording more find all facts relating to a client named ‘C. information within the logical metadata. Crawford’, who has a gender of ‘Female’. If However, the size and methods of we process the gender dimension first, we navigation of the metadata itself soon could expect to process half of the rows in become an issue. We would not want to the fact table, given that this dimension has 1 store specific scoring information for every a cardinality of 0.5. Each of these rows value of every variable in our schema! A would have its client key compared with the compromise between methods needs to be required key selected from the client arrived at – providing adequate query dimension. Alternatively we could process performance without unwieldy and the client dimension first, which, having a impractical amounts of scoring metadata. far lower cardinality, would return far fewer This compromise is managed by the logic fact rows with which to compare the gender imbedded within the Query Engine Class. dimension key. While I don’t know the number of people named C. Crawford in the world, I’m sure it is far fewer than the Extensions To The Model number of females! With data being selected from the fact table by extracting those rows having keys which So how can we obtain this relative have been selected from the dimension cardinality information to influence the tables, we can consider that a logical AND order of execution of our query? Two ways is being performed between each of the key present themselves. Either we can store columns. Similarly, if two different cardinality information within the logical properties are constrained within the same metadata for each column of each logical dimension, then these too are joined with a dimension, or we can calculate appropriate logical AND. If several values are selected values during the query process. The former for a single property within a dimension, requires considerable effort during the then they are joined with a logical OR. We registration of the metadata and subsequent can better illustrate this with an example. ongoing maintenance to ensure its efficacy. Let’s consider a star schema that describes The later requires each dimension that takes the attendees to a conference. Amongst the part in a query to be processed before the logical dimensions of the schema are ones fact table, to calculate the ratio of required for Attendee, another for Attendee’s dimension keys to total keys. Organisation and yet another for the subject While this cardinality-based mechanism is streams at the conference. Some of the mostly successful in the optimisation of properties described by these dimensions query performance, it makes the assumption and their potential values are shown in the that the distribution of key values within a tables below. dimension is the same as the distribution of Attendee foreign keys in the appropriate fact table column. What would be the optimal order Key Name Age Gender of dimension processing in our example 1 Jones 35 F 2 Smith 42 M 1 The term ‘cardinality’ here is used to describe the reciprocal of the number of possible unique values within a column. Attendee’s Organisation the most, there would be a lot of empty space on each row of the fact table, not to Key Name #Staff mention a rather odd logical view to present 1ABC5 to the schema’s user. 2 XYZ 400 ‘Secondary’ Dmensions A more practical solution is to add a single Subject Stream column to the fact table with a key value that describes a combination of dimension Key Title #Papers column values (subject streams, in this 2 Mngmnt. 10 case). In other words, a key that represents the result of ANDing multiple keys in the 3 Tech. 12 subject stream dimension. We can do this if 4 Beginner 6 we introduce an intermediate table between the dimension and the fact table to ‘link’ the key values together. The diagram below Consider also that each attendee has to illustrates this. As the user will make register separately for each subject stream and that they may register for more than Secondary Dimension Processing one. Each registration creates a row in the fact table. We can see that to discover the KV1 V2 V3 V4 V5 V6 K’ 12345 # KK’K2 K3 K4 K5 $ income from all male attendees named 1 1’ 2 8 7 12 4 8 45 99 2 2’ 2 11 3 75 71 28 52 0 ‘Smith’ on the technical subject stream from 3 3’ 2 13 21 67 10 25 47 12 4 4’ 4 14 15 34 95 11 12 67 companies with less than 100 employees, we 5 5’ 1 52 3 12 85 1086 must: ! AND the ‘name’ and ‘gender’ columns

of the Attendee dimension to retrieve the Dimension Table: ‘Link’ Table: Fact Table: Two rows have been Dimension Rows translate to The rows with the selected by the user columns to return a single Link Table keys are key-value of 2; key where the combination is selected valid ! Retrieve key-values of 1 from the Attendees Organisation and 3 from the Figure 3: ANDing dimension keys. Subject Stream dimension; ! AND each of these key-values to find selections from a logical dimension that has the appropriate facts. the same appearance as the subject stream dimension, we refer to it as a ‘secondary’ Should we be interested in those that dimension. Its keys are not those in the fact attended either the beginners’ stream or the table, the fact table column has the keys of management stream then we would select the ‘link’ table. The intermediate ‘link’ both keys 1 and 3 from the relevant table contains a row for each actual dimension and select rows from the fact combination of keys rather than a row for table that contain either one OR the other each possible combination. There are value. columns in this table for each of the keys of But what if we’re interested in those the secondary dimension. The value of attendees that are registered on both the these columns is a binary flag to indicate beginners’ stream and the management whether the associated secondary dimension stream? We could extend the fact table so key is part of the combination described by that it includes columns (and so logical the ‘link’ table row. An additional column dimensions) for each of the subject streams. in the ‘link’ table provides a count of the If we considered ten subject streams for number of keys in the combination. which most attendees registered for two at To find rows in the fact table that relate to kept consistent by the metadata that attendees registered on both the beginners’ surrounds it. Provided that we can maintain and the management streams, then key this consistent approach, then the metadata values 2 and 4 will be selected from the may hide a multitude of sins that the Logical subject stream secondary dimension. The Metadata Class redeems us from. rows in the ‘link’ table that have contributions from these key values will be Navigational Dimensions selected, which will in turn provide a set of The purpose to which you put a schema will keys that can be found in the fact table. Use often give rise to a requirement for (perhaps of the link table’s count column can ever-naughtier) sins to be admonished by the determine whether the attendee registered on metadata. Typically the information written these two streams only (a count of 2), or on to a star’s fact table is transactional, such these two along with others (a count greater that it is not until several transactions have than 2). This mechanism functions well in been recorded that the whole picture of a many practical situations, though at first subject can be visualised. We may want to thought it would seem that there would need extract particular information from the fact to be a very large ‘link’ table for most table only once the final transaction of a set applications. However, the difference has been recorded, but that information may between the number of actual combinations require the extraction of all the relevant and the possible number is usually very rows. We can assist ourselves in this feat by large. adding a column to the relevant subject There are a couple of limitations though. dimension to record the subject’s latest Firstly, the link table does need to be transactional status. A reference to this carefully attended to – any new combination column in the query process can of values must be added to it during the significantly improve the performance of the schema update process. Secondly, this query. If the subject dimension is large and secondary dimension process is only useful the number of transactions high, this may during the navigation of the schema, not for become inconvenient when we search the precise understanding of a fact. There’s through the dimension, effectively offsetting no way to gain an understanding of the the performance improvement it provides. contribution to a fact that each or any of the As an alternative, we might create a new components of the combination make. We ‘navigation’ dimension that has the same only know that they’re all involved in some key value as the subject dimension but far way. In other words, while this mechanism fewer rows. Such a dimension would be will allow us to easily identify those rather different to our standard ones as it attendees that are registered on both the would not have it’s own key in the fact table beginner’s stream and the management – it would use another’s. Its function would stream, we need another dimension to be to ease the navigation of the schema for resolve just which stream the fact is some commonly performed queries. associated with. Dates, dates And More Dates The Strength In Metadata The further we progress with this discussion, What’s important here is that the user of the the more limited we become with regard to schema would still be presented with a set of the particular application of the schema. value to choose from, (found in the columns However, suffice it to say that there are of dimensions) regardless of whether the usually a number of dates that are involved table they are selecting from is logical, in the construction of a star schema within a physical or separated from the fact table by a data warehouse. Many will relate to the link table. The user’s view of the schema is business that is being described, but many others are related to the infrastructure of the schema itself. While I don’t intend to go speedy thing to do. What might we do to into any particular detail here, there are two rescue us from this inevitable problem? basic sets of dates that are necessary if the schema is to function efficiently. Partitioning The first of these relate to the date that What we can do is partition our tables. We information was added into the schema. If can build our fact table and each of our nothing else, we need to know from what physical dimensions from any number of temporal viewpoint we are viewing this physical datasets, so reducing our dataset information about our business. Such dates size problem to a question of ‘how much may be used to great effect if the schema disk space have we got?’ Then, if we can update policy is to ‘expire’ existing rows arrange for any new data that must be added and add new ones. Such a strategy enables to the schema to be added to very few of every historical view of the schema’s data to these individual datasets, our re-indexing be maintained as regular updates are burden can also be reduced. This again we applied. can do with a little bit of metadata and a SAS/AF! class to encapsulate it – the A second set of dates relates to the Physical Metadata Class (PMC). We then ‘effectiveness’ of the rows of data within the have an architecture that contains two major schema tables. At any point in time we can sets of metadata, one to relate numbers of usually say that for a particular subject a physical datasets to our physical dimensions particular set of attributes were true and then and fact table, and another to relate each for another point in time another set of physical dimension with one or more logical attributes were in force. A similar statement dimensions. This is illustrated in Figure 4. could be made about the facts. The metadata for the Physical Metadata We could now spiral off into a discussion on Class is entered using API extensions to what to do about attributes that change with

time. This might well be interesting, as the Metadata Registration Processes

treatment of such ‘slowly changing

a b a a i c dimensions’ is challenging and perhaps d

c b i b j d

controversial. As the discussion is quite e

e j k

f g i h

involved we shall forsake it here, but note k

c g f j c

that with the control of the query that the h e

i d k d

f

datastep provides we are able to provide a k h j satisfactory solution to an awkward problem h h g by making use of ‘effective’ dates. “Physical” Logical Physical Datasets Dimensions Dimensions Further Practicalities What will concern us now, given that we seem to have a structure and associated Figure 4: Logical relationships. metadata enough to provide a functional and effective star schema, is the question of size. SAS/Warehouse Administrator", in a And size does matter. At the beginning of similar manner to that used for the LMC. this paper we said that we were considering The capabilities of the PMC are limited to a schema built from SAS datasets and that some degree by the operating environment these datasets would have one or more that the schema is supported by. In any indexed columns. We all know that in most environment where a SAS table is circumstances a dataset’s size is limited by equivalent to a native dataset within a the size of the volume that supports it. catalogue (such as flavours of Unix or Anyone who has used SAS for a while Microsoft Windows), the PMC is able to knows that indexing large datasets is not a perform dynamic partition creation. The possible for the other attendees. If we have number and size of tables referenced by a a large number of attendees from overseas, single LIBREF is able to grow without the then the datasets supporting the client intervention of operating system utilities. In dimension will have a lot of empty columns. other operating environments, where the Partitioning enables us to put these LIBREF is assigned to a native dataset dissimilar sets of data into discrete datasets, rather than a catalogue, this degree of so optimising the use of space, while having flexibility is curtailed. Each library then has them appear in the same logical dimension an effectively fixed size and the creation of a where the users would expect them. new partition requires the creation of a new native dataset and associated LIBREF. Conclusion Partitioning introduces another level of The SAS! System is an ideal tool for the abstraction between the physical storage of development of a Data Warehouse built the data and the logical view of the schema upon a star schema structure. Through the that is presented to the user. This can be use of the flexible and powerful datastep, used to great effect. For instance, each of together with the encapsulation of metadata the datasets that comprise the partitions of a provided by SAS/AF! classes, a highly dimension need not have the same set of functional, extensible and above all usable columns. Only the dimension key and the solution can be developed to meet the business key columns need to be in every requirements of the most demanding table. While this might at first seem rather a business. This 'model-viewer' strange thing to do, it has benefits for both implementation of a Data Warehouse, de- the technical implementation and the user of coupling the data store from the user the schema. In our initial discussion of star interface, enables its data content to be schema basics, we suggested that the exploited through a variety of technologies, navigation of the data within the schema from the mainframe batch job to the Internet was simple because the dimensions Web browser. described the facts and the dimensions could The ideas and concepts outlined in this simply be browsed (given the appropriate paper have been devised and developed by viewer). Assuming that the data modelling the author in a number of organisations in was performed correctly, each of the the UK and Portugal. The author can be dimensions would relate to particular areas contacted at: of business that the users were familiar with. However, the users may expect to find Sound Marketing particular sets of information brought Wotton Waven together within the same dimension even Hindhead Road though, from a schema designer’s point of Hindhead view, they should be separate dimensions. Surrey Navigation of the schema will be difficult if GU26 6AY the data within it is not in the order that its users would expect. Dimension partitioning Email: [email protected] enables us to build partially dissimilar sets of data into the same logical dimension, so SAS is a registered trademark of SAS Institute Inc., Cary, NC, USA countering this problem. Consider our conference attendee schema again and think SAS/AF is a registered trademark of SAS Institute about the client dimension. We may decide Inc., Cary, NC, USA that we require far less data to be captured SAS/Warehouse Administrator is a trademark of SAS for the overseas attendees, but because of Institute Inc., Cary, NC, USA some other marketing campaign we are All other brand and product names are trademarks or devising, we’d like as much detail as registered trademarks of the respective companies.