Quick viewing(Text Mode)

Data Model Overview Modeling for the Enterprise While Serving the Individual

Data Model Overview Modeling for the Enterprise While Serving the Individual

Warehousing

Data Model Overview Modeling for the Enterprise while Serving the Individual

Debbie Smith Consultant Teradata Global Sales Support Data Model Overview

Table of Contents

Executive Summary 2 Executive Summary Introduction 3 The choice for the data warehouse is often a matter Revisit Dr. E. F. Codd’s 12 Rules 3 of great controversy. The desire is to offer a self-service type of So – What is an EDW? 3 environment that allows business users easy access with acceptable Data Infrastructure 4 response times. Response time also includes the time required Theory 4 between the conception of a new application and the delivery of From Logical to Physical Modeling 7 that application. What data model do you employ to provide ease Physical Models 7 of use for the business user while still being able to address current Impact of Delivery Tools 10 and future needs of the data warehouse in terms of updating, Impact of ETL Tools 10 expansion, availability, and management? This paper will provide Surrogate Keys 11 an overview of popular data modeling and the Teradata Corpora- Changing Dimensions 12 tion position regarding data modeling. The Teradata Direction 13

Impact of the Data Warehousing Evolution 13

Summary 14

Appendix A – Codd’s 12 Rules 15

Appendix B – A Comparison 16

Appendix C – Glossary 17

Appendix D – Endnotes 18

Appendix E – References 18

EB-2406 > 1007 > PAGE 2 OF 18 Data Model Overview

Introduction > number 8; Physical Data Independence enterprise. Before continuing, let’s discuss Data model choices are often a matter of – where the user is isolated from how the environment that Teradata refers to great controversy when discussing build- the data is physically stored when we say enterprise data warehouse. ing a data warehouse. When building the > number 9; Logical Data Independence We define an EDW as an area where the data warehouse, how do you build it? Do – where the user is not impacted should data of the business (the enterprise) is you attempt to build it for the business the physical change centrally integrated, centrally stored, and entity – current and future – or do you accessed through common business build it in a manner that satisfies current These three guidelines specifically provide defined methods. We believe, and our business users’ needs? This paper will a methodology to isolate users from the customers have shown us, that the value of discuss the idea of being able to build the impact of IT activities and directly impact centralizing the data is the synergy gained centralized enterprise data warehouse to and lay the foundation for being able to by storing the data once, managing it once sustain longevity, and use view layer to build a logical, user-friendly data structure and accessing it for and in many varied form fit individual business user needs that is independent of the physical data ways, times, methods, and reasons. (By and , combining the benefits structure. This foundation provisions for gathering and integrating the data of the of a normalized model with the benefits of form fitting individual business users business, business users are able to have a a dimensional model. needs while maintaining a physical model 360-degree view of the data.) Of course, that facilitates the needs of ETL, update the accumulation of business data is an Revisit Dr. E. F. Codd’s frequencies, and data management ongoing process and ever evolving. One 12 Rules requirements, enabling the enterprise data of the goals is to develop the long-term In 1985, Dr. Codd published a list of 12 warehouse with user friendliness. strategy that provides a methodology of principles that have become the design of adding and refining business data. The relational systems guidelines (see So – What is an EDW? higher the degree of customer success, the Appendix A).1 Dr. Codd is credited as the The foundation of this paper is built on greater degree the infrastructure is able to creator of the , and within the concept of an enterprise data ware- answer any question, from any user these guidelines, he outlines the structur- house (EDW). Many people use the term (internal or external) on any data at any ing of information into tables, that null data warehouse to represent similar time, including the yet unknown queries values are properly identified and a high- concepts; however, there can be some or unknown future application needs, level of insert, update, and delete is variances in what is meant. It is generally without the need or intervention of maintained. Note guidelines 6, 8, and 9. accepted that the data warehouse is the additional resources. This synergy allows environment that provides for decision > number 6; View Updating Rule – where the business to use the information of the support within the organization. By logical views support full range data business to quickly respond to and create appending the word enterprise, that manipulation changes in the market place. environment is now thought of or expect- ed to become reflective of the entire

EB-2406 > 1007 > PAGE 3 OF 18 Data Model Overview

Data Infrastructure the enterprise with increasing speed. It is With more than 35 years of experience in The data model is the core of the data also driving changes in how the enterprise database technology and data warehouse warehouse. The decisions made when conducts business by providing insight design, Bill Inmon is recognized as an defining the data model determine the data into current business practices. These authority in the data warehousing indus- infrastructure. This data infrastructure can increasing requirements of the data try. His lectures and papers promote a impact performance, time to market for warehouse impact data modeling choices. normalized model as the model of choice new applications, facilitate responses to The goal of the EDW is to address the for the data warehouse, reserving the use changes in the market place, business users’ needs of the enterprise. This requires a of models for data marts if ability to retrieve information, data latency, data model that provides an enterprise and when necessary. He states, “It is not and the optimization of the data ware- solution rather than a localized solution possible to build an effective DSS environ- house over the long term. The focal point that only addresses specific application ment without an enterprise data ware- of the data warehouse data modeling needs for individuals within the organiza- house. The design of the enterprise data discussion typically covers normalization tion. The charge of the EDW is to enable warehouse is typically normalized. The theory and dimensional theory. the business at an enterprise level. To do classical entity relationship structuring of that, hard decisions concerning data data accompanied by the ensuing normal- Teradata® Database, as an RDBMS, is and modeling must be addressed. Questions ization of the data structures fits very always has been agnostic about what well such as: should the data model reflect the conveniently with the requirements of the defined data model is chosen. There are needs of the enterprise or the needs of the usage of the enterprise data warehouse.” 2 many customers who execute a normal- individual? What is the impact on the data ized model, others using a snowflake Data Modeling Theory model as the needs of the individual model, others using a star schema, and Entity-Relationship Modeling and/or the needs of the enterprise change? many others using some variation or Before designing and establishing the Do you have to sacrifice one over the derivation of each/any and all. With that physical data model, there’s a logical data other? What if you could minimize those said, it is important to note that Teradata modeling process. This process includes sacrifices? Experience has shown us that as a company, with its vast experience in extended discussions with the business with Teradata Database, it is possible to data warehousing, does have preferences to community. During these discussions, the take full advantage of Codd’s guidelines of ensure the data model employed provides business requirements are identified, the physical data independence, logical data the highest flexibility and responsiveness data entities and elements required to meet independence, and view updating to create to the business for not only current needs those business requirements are estab- a data model that provides the flexibility but also future needs. lished, and the relationships between the to satisfy both current and future needs of data entities are captured. These insights As data warehouses are evolving and taking the enterprise. And then, through the use are later diagrammed into a representation on greater responsibility and active roles in of views, create a that address- of the business, and referred to as an Entity how an enterprise conducts business, data es the needs of the individual. This Relationship (ER) model. An ER model or model requirements, and expectations are provides a methodology to utilize both the represents the data entities of the changing at a faster rate to support this normalized model and augment it with business and the relationships between increased responsibility. The data ware- views that use dimensional modeling those entities. At this stage, the specific house is expected to reflect changes within techniques, taking the best of both worlds. functions or queries the model will be used

EB-2406 > 1007 > PAGE 4 OF 18 Data Model Overview

Customer Customer–Name and Customer–Addr are attributes of Customer Entity Customer–Name Customer–Addr Business Rule

> A Customer places zero or more Travel-Requests Relationship Places > A Travel-Request is placed by only one Customer

Boston to Entity New York

Travel–Request

Figure 1. Example ER Diagram

to support are not included. This is a functions. It identifies the dimensions and The dimensional model is often thought logical representation of the enterprise and levels within the business, separating out of or represented as a cube with dimen- is later used to develop a physical data the facts or measures of the business. The sions such as time, product, and model or schema. There are a number of dimensional model enforces additional geography. The business metric is at the tools available to map the ER diagram, rules that eliminate many to many rela- intersection of these dimensions. Visualiz- ERwin®, now known as AllFusion™ Erwin tionships between entities, allowing only ing the dimensional model as a cube, Data Modeler, being the most recognized. many-to-one relationships. It fuses sometimes referred to as a star schema, ER modeling is guided by the rules of multiple entities together into a new (See Figure 2.) makes it easy to imagine normalization. These are strict rules meant entity. This new entity does not directly being able to slice and dice that segment of to ensure the essence of business relation- reflect the entities and relationships that data. Creating these cubes for reporting ships is captured. Examples of the occur in the business, but is established for requires understanding of what questions components (entities and relationships) of the convenience of storing a data point or will be asked before designing. Each cube an ER diagram are shown in Figure 1. metric that is important to the targeted is designed to facilitate quick and easy group of business users. The goal of the access to a specific business application. Dimensional Modeling dimensional model is to provide a presen- Dimensional modeling is another logical tation layer that facilitates easy navigation design method used to organize data for of the data for business users and quick functional groups of users or business access to reports.

EB-2406 > 1007 > PAGE 5 OF 18 Data Model Overview

Advertising Fiscal Calendar Product Geography

Ad Year Year Year Dept

Region Quarter Quarter Minor Dept

Ad Period Period Month Category

Sub Category District Ad Week Week Week SKU

UPC Store Day

Sales

Figure 2. A Cube (Star Schema)

The dimensional model flattens the week to a fiscal week providing a method- With a dimensional model, the central fact dimensions via denormalization. Business ology for drilling through to lower levels. table is forced to a single grain, causing the rules are used to validate and ensure In effect, the functional rules and process- initial fact table design to become brittle additional restrictions (as in Figure 2). es are accommodated within the model. or inflexible. You can’t incorporate new For example, many companies have a This, in turn, means that much more data sources without breaking the original fiscal year that differs from the calendar needs to be understood about the queries star schema design or creating separate year; they may even have another time and outputs that the model is expected to fact tables or data marts. To get an enter- dimension, such as advertising, which support. A dimensional model can be prise view, you would then have to drill differs from both calendar and fiscal. created from an ER model, however, an across these different stars or data marts. If/when the advertising week does not fall ER model could not be created from a within the fiscal week, business rules will dimensional model. Once entities are be used to determine and assign the ad fused together, separating those fused entities is difficult, if not impossible.

EB-2406 > 1007 > PAGE 6 OF 18 Data Model Overview

From Logical to Physical Modeling Why would you create a physical model Less Data Redundancy that is different from the logical model? The physical model often differs from the logical model to yield the best overall performance for specific database tech- Snowflake Star Flattened nologies, and not all database technologies are equal. Technologies differ so much Normalized Denormalized that in an article written for Intelligent Magazine Neil Raden states, “As you know, Simpler Data Access the primary schema for a data warehouse is either a star schema or a “normalized” schema. The latter is a term so loosely Figure 3. The Modeling Spectrum defined that it’s hard to describe, but a normalized schema typically resembles a third (or higher) normal form (3NF) is where the greatest variation, deviation, accomplished this, you can see the busi- schema that’s not dimensional. These 3NF and difference of opinion takes shape, ness functions as if looking down from a designs don’t support query and analysis. which of course, forms the basis for the skyscraper. The denormalization survival Their sole purpose is to act as a staging confusion surrounding data modeling. On guide then provides guidelines of how to area, an upstream data repository for a one end of the spectrum is the flattened transform the normalized logical model series of star schemas, online analytic denormalized model; while on the other into a physical model. This survival guide processing (OLAP) cubes, and other end is the normalized model. Of course, provides some of the principles for denormal- structures that are directly queried by the reality is that most customers develop ization, the basics of which are to minimize analysts. The only routine exception to this a data model that falls somewhere between denormalization so as not to compromise the is Teradata implementations: Because of the opposing ends, or some combination business’ current and future needs. the unique characteristics of the massively of both. This is often the result of the data Physical Models parallel architecture and database optimiz- warehouse evolving to understand busi- Normalized Physical Model er, Teradata can process analytical SQL ness demands. against a 3NF schema with acceptable As you move along the modeling spectrum performance.”4 In his book Data Modelers Workbench, (see Figure 3), the further to the left your Steve Hoberman discusses what he refers data model is, the greater the degree of This implies the physical model is often to as the normalization hike and the normalization. A normalized model changed from the logical model not denormalized survival guide.5 Hoberman’s separates data into tables based on very necessarily due to business requirements principles state that the logical model is strict rules that relate an attribute to the but to facilitate the technology being used completely normalized, up to fifth normal primary key. One of the fundamental rules and ensure query speed. This transition form. His analogy states that once you’ve of normalization is the elimination of data from the logical model to physical model

EB-2406 > 1007 > PAGE 7 OF 18 Data Model Overview

redundancy within the model, keeping tables in a normalized model, the greater power users) were created within the each attribute within each table function- the degree of normalization, the greater organization to code and generate reports ally dependent upon the primary key. the degree of storage space conservation for business users. These power users By following this rule, eliminating data because data is not duplicated. However, became so busy that requests were often redundancy within the model, the number denormalized structures, such as summary queued for weeks at a time. Business users of tables along with the complexity of the tables, will and often do exist in a data wanted direct access to information, and data model increases. As new data subjects warehouse that has been developed under they wanted it instantaneously. Denormal- for the data warehouse are identified, the the constructs of normalization. When the ized models provided for simple, easy model is extended following the same business value warrants the cost to build navigation of the data. The data are normalization rules to include the new and manage summary table structures, typically aggregated along one of the lines data subjects. they are a viable augmentation of the of dimension. The types of queries normalized data model. requested are well understood, and the [While data warehouse modelers are often model is established to answer those embroiled in controversy over how to model Denormalized Model questions. These pre-defined models the data warehouse, the world of transac- Moving right along the modeling ensure the known query will return an tional is in agreement that a spectrum (see Figure 3) the degree of answer set quickly. When new queries normalized model is the optimal choice. As denormalization increases. A flattened are identified, the model is extended to data warehouses evolve towards transaction- model essentially resembles an Excel include a duplication of existing data in al decision making using very current data, spreadsheet. Denormalized data models a form that allows the business user to the need for a more normalized physical flourished out of a need to provide busi- navigate and quickly retrieve the answer design becomes evident.] A normalized data ness users with quick and easy access to set for the new set of queries. model provides a methodology for captur- data without the of under- ing and storing the lowest level of data, standing the underlying data model. The star schema model is typically only eliminates multiple updates to various Business users, whose job does not include considered for use in data marts or the tables with the same transaction, and is knowing SQL or the logic of data models data warehouse and is built to address a the model of choice for the OLTP types of and, therefore, how to join tables together, specific business need, report, or applica- transactions that are targeted and specific. needed easy access to information. tion. Each star schema is built as an However, those same business users individual . When the business Customer experience has shown that did understand spreadsheet formats and, needs a new report, another star schema, the normalized model can answer new in fact, often requested information in a or data mart, is built. If the needed business questions or previously unknown report format. Therefore, when data was dimensions are the same as what has questions without making changes to presented in a two-dimensional form already been built, they are duplicated the structure of the database, because (similar to a spreadsheet); as opposed to to this new data mart. Each of these data the relationships and entities have been a series of connected tables, business users marts caters to the needs of the individual. represented physically and provide the were protected from data model complexi- There may be a temptation to think of a greatest flexibility for the business. It ty. Because the data model was generated group of these data marts as constituting a eliminates data duplication and, therefore, to address specific business constituencies, data warehouse. However, they are distinct the complexity of maintaining duplicated new user constituencies (often called structures that have duplicated data in data. Even though there are typically more

EB-2406 > 1007 > PAGE 8 OF 18 Data Model Overview

order to store it either in a different Item Date Sales format or on a different platform, either 100012 01102001 10.00 a different database or just different 300012 02122001 3.00 hardware. The star schema model doesn’t 200012 01152001 2.50 have the goal of storing data once for 100012 03042001 15.00 multiple uses. Its goal is to facilitate end- user usage. When the business community Item Jan Sales Feb Sales Mar Sales Apr Sales has a new request, a new star schema is 100012 345.00 450.00 326.50 245.90 typically built to facilitate the request. 200012 456.60 376.50 210.00 390.00 300012 254.00 112.00 310.00 295.00 For example, when a business user wants 400012 510.00 610.00 590.00 545.00 to know the monthly sales of products, the report produced would have headings Figure 4. Example table in denormalized format such as Item, January Sales, February Sales, March Sales, etc. The normalized forcing all dimensions to conform to what Inmon states, “In short, simply doing Sales table would consist of columns of is established in the staging area. This bus dimensional modeling as a basis for data Item, Date, and Sales. This table would architecture is the roadwork that connects warehouse design leads down a dark path provide the greatest flexibility, allowing for all the different data marts and servers when multiple star joins are considered. It the next questions. These questions might being used. Kimball states, “It is acceptable is never apparent that there is a problem include what are sales per week for specific to create a normalized database to support with star joins when you are looking at items, what were fiscal month sales, what the staging processes; however, this is not just one star join. But when you look at items sell on Mondays, or what are the end goal. The normalized structures multiple star joins, the limitations of weekend sales. However, because it is must be off-limits to user queries because dimensional modeling become apparent. often difficult for users or even database they defeat understandability and per- Does this mean that dimensional model- to make the conversion from formance. As soon as a database supports ing is invalid as a the table structure to the report layout, query and presentation services, it must be technique for data warehousing? The in the dimensional model, DBAs simply considered part of the data warehouse answer is not at all. Dimensional modeling store the table in the denormalized report presentation area. By default, normalized and star joins fit the bill very nicely for format (see Figure 4). databases are excluded from the presenta- data marts. In fact, if I had to design a tion area, which should be strictly data mart tomorrow, I would not consider In an effort to maintain conformity, the dimensionally structured.” 6 using any other approach. But, when it star schema is built with constrained comes to the foundation data, it’s another Contrasting Ralph Kimball’s belief that dimensions. Ralph Kimball explains that story. The foundation data – the data data warehouses are best suited for a star data are placed in a staging area for warehouse – requires a different treatment schema, Bill Inmon firmly believes the transformation. These data, using what than dimensional modeling altogether. EDW has at its heart a normalized model, he refers to as the data warehouse bus The data warehouse, which is the proper reserving the use of star schema models architecture, are then propagated out to foundation for all DSS activity, including for data marts if and when necessary. all the different data marts that require it, star joins, requires very granular, very

EB-2406 > 1007 > PAGE 9 OF 18 Data Model Overview

flexible data. The ideal structure for the model would be the easiest to implement. extract data from the transactional sys- data warehouse is normalized data. The Since the data warehouse was updated or tems, and those are in normalized format, normalized data can be bent and shaped refreshed on a weekly or even monthly ETL tools work better with the normalized any old way.”7 basis, the update schedule mitigated issues data model. surrounding the complexity of updates “The fact table is connected to the If the data warehouse is built with a to maintain simple data models. Business dimension tables by means of foreign key denormalized data model, then most users were expected to know, understand, relationships. The keys to the dimension require additional steps to land the data and code simple SQL for retrieving tables sit in the fact table. There may be in a separate area for ETL. While the ETL information, allowing IT to turn the data many relationships that find their way tools work well with a normalized model, warehouse over to the business users and into the fact table. In addition the fact moving those data into a denormalized only maintain responsibility for keeping it table carries with it other non key data model requires manipulation or transfor- updated. When the business began coming if needed. In other words, the dimension mation. The suggested way of handling up with additional questions that the data table may share with the fact table data this manipulation or transformation is model couldn’t address, and joins between other than just the key. When the fact table the creation of a data staging area. This star schemas became complex, informa- and the dimension tables are connected, staging area is used to land the extracted tion delivery tools were developed to help they form a star, hence the name ‘star join’. data then manipulate or transform them business users retrieve the information The net effect of the fact table is a struc- to the format needed before loading. This they needed through an easy interface. ture which is highly denormalized and staging area also provides the platform which is very efficient to access. Once the While many of these information delivery from where data, in particular dimensions, fact table is created, it can be efficiently tools have grown up expecting this star can be duplicated out to the different star analysed. But there is a tradeoff. The data model, many are growing in sophis- schema data marts. extreme efficiency of the star join also tication and technologically developing When loading to a normalized model, the makes it inflexible. If there is a desire to to be able to recognize and use more need to transform the data for modeling see the data in the star join in a manner normalized data models and making it purposes is significantly less, and can other than that for which the structure is easier for end users to navigate a more actually happen during the load process, built, then the star join is very inflexible.”8 complex schema. eliminating the need for a staging area. In Impact of Information Impact of Extraction, a normalized model, those using a staging Delivery Tools Transformation, and Loading area are typically using it to correct data Denormalized models, specifically star (ETL) Tools quality issues that exist in the transactional schemas, gained general acceptance on the While Information Delivery tools were systems. For example, ensuring character basis of the ease provided for business developed to assist the business user in fields follow the same set of rules. If an users to directly access data. After gather- retrieving information from the data address contains the word ‘street’, then all ing the detailed business requirements, warehouse, ETL tools were developed to formats of ‘street’ are transformed to be the developers often determined that to assist in extracting data from transactional the same, for example ST becomes Street mitigate risk and provide the business systems. Since the ETL tools are meant to and St. becomes Street. users what they requested, a denormalized

EB-2406 > 1007 > PAGE 10 OF 18 Data Model Overview

Surrogate Keys within the table – minimal. Another , a process to link the values In general, a logical key comprises the data argument is that when the natural key is the user does know with the surrogate fields whose value ensures uniqueness for large or consists of multiple columns, a value must be established to retrieve each row, and whose value will not change surrogate key would be smaller and require information. Typically, this requires during the life of the row. Implemented less space as it is populated throughout the additional secondary indexes. Using physically, it is a natural key. Often, the data model. To facilitate access, a process is surrogate keys tends to add both a column terms key and index are used interchange- developed that will lead to the surrogate and a unique index to each table, some- ably, but they are not the same. An index key for joins and retrieval of information. times multiple secondary indexes, and in is a mechanism used to optimize perform- some cases, entirely new tables. Surrogate In a data warehouse, which brings together ance. The physical implementation differs key values also must be propagated to all data, often disparate data, from the trans- depending on the database management the dependent tables as foreign keys. actional environment, surrogate keys may system employed. provide a viable methodology to do so. While surrogate keys are frequently with A surrogate key is an artificial key, or The data warehouse is subject oriented, star schemas, their use is not reserved for generated key, that is used as a primary which contrasts the application or function specific data models. The decision to use or key in substitution for natural keys. Using orientation of transactional environments. not use a surrogate key should be part of a surrogate key is a choice rather than a Integrating and reorganizing multiple the data model design considerations. That requirement. Typically, they are numerical systems that represent the same subject, decision should be predicated on what the and randomly generated to uniquely such as customer, can require the integra- data modeler believes to be best for the identify a row. The use of a surrogate key tion of data that does not occur naturally organization? Will there be requirements is often promoted on the basis of adding in the operations environment. A surrogate for the natural key to change? Is there a flexibility and data independence to the key will uniquely identify each instance need to ensure that every row is unique? data model. In business, change is an of that integrated subject, each customer, Will there be a resulting impact from ongoing fact of being in business. Business- without worry of duplication in the future adding secondary indexes? es are involved in mergers and acquisitions, as the business continues to change. This For example, if the natural key for an new products and channels are added, they provides a stable data model that will order table is order number, and a business are reorganized, and they enter into new withstand the test of time. However, user wanted to know how many units of markets. There are a couple of arguments beware not to use surrogates as replace- product ‘X’ were ordered by company ‘ABC’, for surrogate keys. One is that as business ments for good data modeling practices. chances are he wouldn’t know the order changes, so do the values, and potentially, Before compromising to use surrogate number. But would the business user know the format of natural keys. When the keys, you must consider some things. To the product number and/or the company natural key is used as the primary key, generate the surrogate key, a process must name? When a surrogate key is used as the changing either the value or the format of be put in place to ensure uniqueness and primary key, order number, product, and that key is an intrusive effort, which could consistency. Updating and inserting new ordering company would probably each be result in loss of historical information and rows require first the creation of the made secondary indexes. When the natural data relationships. When a surrogate key is surrogate key and also a lookup process key of order number is used as the primary used as the primary key, the impact of to ensure the newly generated key hasn’t key, product and ordering company would changing the value of the natural key is the been used. Since the user doesn’t know the probably each be made secondary indexes. same as changing the value of any column

EB-2406 > 1007 > PAGE 11 OF 18 Data Model Overview

SCD Type One SCD Type Two SCD Type Three SKU Catg SKU Catg Updated SKU Catg Prev Catg 1001 01 1001 03 20021029 1001 03 01 1002 02 1001 01 20010101 1001 01 15 1003 03 1002 02 20010101 1001 15 20 1003 01 20020915 1002 02 02 1003 03 20010101 1003 01 03 1003 03 12 1003 12 05 1003 05 03

Figure 5. Methods to address SCD

Outside of the maintenance issues of the However, a changing dimension is only a A normalized model is typically developed surrogate key, the data modeler must problem when the model is built on dimen- as type two, developing a history as determine when it is wise to use either a sions. Typically, there are three methods used changed rows are inserted into the table. surrogate key or a natural key. For example, to address SCDs (see Figure 5). If the business application requires only while order number may make sense to > Type one updates in place the dimen- the current value, a view could be used use as the primary key, in the case of a sion value as it changes. This provides to present to the application current values customer table, it may make sense to use a a dimension that is always current, only. Those applications that need history surrogate key for customer id. The decision eliminating the capture of history or the ability to identify values during a to use a surrogate key should be evaluated because the association of the data specific time frame would then have access on a case by case occurrence and not with the old dimension value has to all the history data. entered into as a standard. been overlaid. While the normalized model provides Changing Dimensions > Type two inserts a new row into the for the type two SCD, the construct of the A changing dimension is often referred to table as the dimension changes. This dimensional model increases the challenge as a Slowly Changing Dimension (SCD). method provides both current and of providing for changing dimensions. The impact of time on the data warehouse history information of the changing Included in this challenge is the need to can be significant. Time has a tendency to dimension, and usually involves determine how to reflect the changing generate many changes. Customers move carrying an effective date to indicate dimension. Which type satisfies the and their address changes; people change which row is current. business need for the dimension? Does their names. Area demographics change, > Type three updates the dimension it satisfy all the business needs? Which populations shift. Products move from one in place after moving the changed method satisfies the needs of all the category to another. These are changing dimension to an old column. This data marts in the dimensional model? dimensions. Time impacts all data models, method provides for current and and dimensions will change over time. most recent changes.

EB-2406 > 1007 > PAGE 12 OF 18 Data Model Overview

The Teradata Direction model of atomic data provides the greatest One of the key factors that sets the Teradata Identifying the most appropriate choice of flexibility and, therefore, often the greatest Database apart from all others is the data model for a Teradata solution is a two- benefit for long-term business benefit. A ability to use a normalized data model step process. First, you must distinguish well integrated data warehouse is a valu- that facilitates ease of management, ease between what Teradata Database lends itself able mechanism in business operations. As of expansion, simplifies load strategies, to and what Teradata professionals advo- businesses use the data warehouse to and allows for full integration of enterprise cate. Teradata Database is impartial to the uncover patterns in their manufacturing data. This facilitates building for the data model. The database has the technical processes, their ordering processes, or in enterprise. Addressing the needs of the ability and the power to perform with any customer purchasing habits, business individual becomes as easy as applying a well-designed data model. Teradata’s ability processes change. The business expects the view to the underlying data structure. This to parallelize every aspect of query process- data warehouse to enable and reflect those view then allows IT and the data warehouse ing eliminates technical issues and business changes. If business changes to address the needs of the individual. complexity of decision support processing require an additional column (dimension) Views are a transparent layer that is on of large volumes of data, including multi- to one of the act tables or product in one top of the underlying data structure and table joins and joins within joins that are category is transferred to another category, provide a methodology of creating a often the catalyst for denormalized data the business expects the data warehouse to presentation layer that eases business models for some technologies. Teradata reflect these changes based on a business user access. In Teradata Database, views engineering is always reviewing and identi- timeline. If that timeline is negatively don’t require space and will transition fying areas within the optimizer to improve impacted by time requirements in address- the underlying data model into nearly performance whether the data model is ing data model changes to enable and any presentation with such efficiency the normalized or the denormalized star reflect the business changes, the data cost of using views is mitigated to simply schema. Join efficiencies improvements, warehouse becomes an obstacle to the the management of them. such as large table/small table joins in the business’s success. Customer and field early 1990s, were targeted at star schemas. experience repeatedly shows that a normal- Impact of the Data Advanced indexing features, join index ized model provides for the most flexibility Warehousing Evolution introduced in Teradata Database V2R3 and and ease in delivering new products to As data warehousing evolves to an active aggregate join index introduced in Teradata market and for facilitating rapid business environment where frequent and/or Database V2R4, provide a methodology for changes. So Teradata professionals often continual data updates are required, where users of Teradata Database to have a suggest starting with a normalized model processing becomes event driven rather normalized model and use these features to because of flexibility and ease in adding than report driven, where the need to create star schemas if and where processing new data elements. However, this does not address changes in the market place are merits it. Teradata Database V2R5 has made mean that the data warehouse follows all expected with increasing speed, the data generating surrogate keys easier with the the rules of third normal form in the model becomes even more critical. The introduction of the Identity column. strictest sense. Based on business require- mixed workload of the shorter tactical Optimizer learning and intelligence occurs ments, summary tables are often found queries with the longer strategic queries, without regard to data model. useful to augment the normalized model. against the same data creates additional The goal is to provide a model that sup- denormalization challenges. Evolving to an Years of field experience have shown ports the business through all of the Teradata professionals that a normalized changes that occur over time.

EB-2406 > 1007 > PAGE 13 OF 18 Data Model Overview

active data warehouse eliminates many of Summary be unable to resolve their issues in a timely the known variables, such as what queries The goal of a data warehouse is to provide manner, time that had a direct impact on will be submitted, who will submit those the business with a tool that facilitates the success of the business. After deter- queries, which were used in defining the and enables the business to make better mining the data was available in the data data models of early data warehouses. To business decisions and to take timely warehouse, in a normalized format, the accommodate this evolution, some of action on those decisions. The robustness data warehouse group was able to generate Teradata’s competitors’ technologies are of the data infrastructure determines the a presentation layer that resolved the beginning to add functions and features long-term success of the data warehouse business needs within a couple of hours. that enable a more normalized model. and the ability of the business to harness The ability to do this was valuable to the Tools such as information delivery tools the information for its own success. The business bottom line. are also preparing for this evolution. As robustness of the Teradata Database The data model for the data warehouse the data warehouse evolves to an active provides for the ability to combine the should reflect requirements of the busi- environment, the requirement for fresh best of all worlds. The DBA can generate ness, and it should enable the business. data to be available and accessible erodes a DBA-friendly normalized data model Taking a hard line with any specific data the time available to maintain denormal- that facilitates the full integration of data model type won’t satisfy the business ized models. representing the enterprise, following needs. However, the impact of time the teachings of Codd, Date, and Inmon. The data model chosen for the data generates change within the business, and The use of views, a presentation layer that warehouse should be chosen with a clear the data warehouse requires the flexibility employs the user friendliness of dimen- understanding of the benefits of each. The to address and yield to those changes. sional modeling, enables catering to the early days of data warehousing tended to A normalized data model is the correct needs of the individual, interfacing with depend on a denormalized model. Much place to begin. the data warehouse for business user ease of that decision was based on addressing of access, capitalizing on Codd’s guidelines Debbie Smith is a Data Warehouse specific business needs rather than for relational databases, and following the Consultant in Global Sales Support. She had addressing the breadth of the business, teachings of Kimball. This allows for data 14 years of experience with Teradata systems or on existing technology that limited to be centrally stored, eliminates redun- at a retail customer. Her responsibilities choices. During those early days, when dant data easing data management, provides while with the retail customer included Teradata professionals recommended a a methodology to support future business application development/support, database normalized model, it was considered needs by capitalizing on existing data administrator, business power user, and unusual. As data warehousing evolves to structures and shortening time to market. responsibility for developing, implementing, being active and as the need and require- and managing an Information Delivery ment increase for businesses to change, For example, one Teradata customer had strategy for business end users. Since joining the idea of a normalized data model for their business users request a new applica- Teradata, Debbie has worked extensively the data warehouse no longer seems so tion from IT. After the applications group with prospects as well as new and mature radical. In fact, many of the leading said that it would take 6 months just to data warehouse customers to implement analysts and data warehouse professionals find the data and once found, then they effective tactical and strategic data ware- understand the value of a normalized would have to determine the time required housing initiatives. model and are becoming more critical to generate the application. The business of denormalized models. users left the meeting believing they would

EB-2406 > 1007 > PAGE 14 OF 18 Data Model Overview

Appendix A – Codd’s 12 Rules transaction control. All commercial rela- particularly difficult to satisfy. Most Rule 1: The Information Rule tional databases use forms of the standard databases rely on strong ties between All data should be presented to the user in SQL (Structured Query Language) as their the user view of the data and the actual table form. supported comprehensive language. structure of the underlying tables.

Rule 2: Guaranteed Access Rule Rule 6: View Updating Rule Rule 10: Integrity Independence All data should be accessible without Data can be presented to the user in The database language (like SQL) should ambiguity. This can be accomplished different logical combinations, called support constraints on user input that through a combination of the table name, views. Each view should support the same maintain database integrity. This rule is primary key, and column name. full range of data manipulation that direct- not fully implemented by most major access to a table has available. In practice, vendors. At a minimum, all databases do Rule 3: Systematic Treatment of providing update and delete access to preserve two constraints through SQL. Null Values logical views is difficult and is not fully A field should be allowed to remain No component of a primary key can have supported by any current database. empty. This involves the support of a null a null value. (See Rule 3) value, which is distinct from an empty Rule 7: High-level Insert, Update, If a foreign key is defined in one table, string or a number with a value of zero. and Delete any value in it must exist as a primary key Of course, this can’t apply to primary keys. Data can be retrieved from a relational in another table. In addition, most database implementa- database in sets constructed of data from tions support the concept of a non-null multiple rows and/or multiple tables. This Rule 11: Distribution Independence field constraint that prevents null values rule states that insert, update, and delete A user should be totally unaware of in a specific table column. operations should be supported for any whether or not the database is distributed retrievable set rather than just for a single (whether parts of the database exist in Rule 4: Dynamic On-Line Catalog Based row in a single table. multiple locations). A variety of reasons on the Relational Model make this rule difficult to implement. A relational database must provide access Rule 8: Physical Data Independence to its structure through the same tools that The user is isolated from the physical Rule 12: Nonsubversion Rule are used to access the data. This is usually method of storing and retrieving informa- There should be no way to modify the accomplished by storing the structure tion from the database. Changes can be database structure other than through definition within special system tables. made to the underlying architecture the multiple row database language (like (hardware, disk storage methods) without SQL). Most databases today support Rule 5: Comprehensive Data affecting how the user accesses it. administrative tools that allow some direct Sublanguage Rule manipulation of the data structure. The database must support at least one Rule 9: Logical Data Independence clearly defined language that includes How a user views data should not change (From: ITWorld.com, ‘Codd’s 12 Rules – functionality for data definition, data when the logical structure (tables structure) Data Management Strategies 05-07-2001’.) manipulation, data integrity, and database of the database changes. This rule is

EB-2406 > 1007 > PAGE 15 OF 18 Data Model Overview

Appendix B – A Comparison

Normalized Model Denormalized (Star) Model

Active Data Warehousing Supports active data warehousing initiatives. Challenges the ability to be active.

Any Question The flexibility of the normalized model provides for The simplicity of the denormalized model limits any question to be answered. what questions can be answered without changes.

Complex Analysis A normalized model supports complex analysis. The enforced simplicity of a denormalized model is unable to support complex analysis.

Data Duplication The rules of normalization require a single version Data is often duplicated multiple times to satisfy of data. different groups of questions.

Data Granularity The normalized model easily stores the lowest level To maintain the ease of navigation, the denormal- of data. ized model is typically aggregated along one of the dimension lines.

Data Navigation The normalized model is more difficult to navigate. The denormalized model is specifically designed for Business users have a difficult time at best in easy navigation by the business user. understanding the model, the necessary joins and sometimes even the relationships between data elements.

Flexibility The greater the degree of normalization in the data The denormalized model is meant to be simplistic model, the greater the flexibility. and is designed for specific requirements. As requirements change, changes to the model are needed.

Maintenance Updates to the normalized model are easier Updating the denormalized model requires pre- because a piece of information is in one place and processing, aggregation and longer time frames. requires one update. The normalized model is the standard for the transactional environment. Continuing that model throughout the data ware- house provides a closer link and reflection of transactional processes.

EB-2406 > 1007 > PAGE 16 OF 18 Data Model Overview

Appendix C – Glossary Dimensional Model Foreign Key Attribute A design method that models the data Columns in a table that reflect primary Data that is attributed or describes an based on a predefined set of business keys of other entities. Determines the entity. questions to facilitate ease of access and relationship between entities. response time speed. Cube Hierarchy A way to visualize how a dimensional Drill (Across) The arrangement or ranking of entities model is organized. It is also a type of Movement of end-user request across into a graded series. Dimensional Model design where the different dimensions. Intelligent Key measures and dimensions are stored in a Drill (Up or Down) Key is made up of or consists of values format that is expected by an OLAP tool Movement of end-user request through that reflect a business requirement or use. (many times proprietary format). One data along a dimension. cube may represent one Fact table with its Natural Key contributing Dimensions. Entity The data fields whose value ensures An identifiable object that is distinguish- uniqueness for each row, and the value Data Mart (logical) able from all other objects. will not change during the life of the row. The data is stored in one place but Views are used to present to the end user a Entity Relationship Model Normalized Model portion or derivation of the underlying Shows the enterprise entities (or objects) Model of entities designed to remove data data for their purposes. Data is not and the relationships between the entities redundancy, the higher the ‘form’ (2NF, replicated. in that enterprise, without duplication. 3NF, 4NF, etc) the less redundancy.

Data Mart (physical) ETL OLTP A segment of data that is aggregated or Terminology given to a group of tools that Online transaction processing. Consists of preprocessed to enable specific end-user extract data, transform it and then load it. a set of instructions that serially, mechani- queries and requirements. Sometimes the sequence of processing cally and repetitively processes many changes to reflect ELT which is extract, thousands of transactions through a Data Warehouse load and then transform. predefined access path that ultimately A reflection of integrated data that is updates the underlying data store. OLTP organized by subject areas and is stored to Fact transactions are used by a transactional address cross functional and multiple end- A measurement or value (quantity). system that is focused on the acquisition user requirements and queries. Flattened Model of data. OLTP transactions normally have Dimension Made up of rows that contain the data strict service level agreements (SLA) The corporate ‘line’ along which facts are elements of all the column headings of a attached to them requiring sub-second measured (time, product, geography). business report so that nothing has to be response time. OLTP concentrates on data calculated to generate portions of that update and/or inserts. report.

EB-2406 > 1007 > PAGE 17 OF 18 Data Model Overview

Teradata.com

Primary Key Appendix D – Endnotes Appendix E – References Unique identifier for rows in a table. 1. Dr.E.F.Codd,Codd’s 12 Rules – Data Ralph Kimball, The Data Warehouse Management Strategies, ITWorld.com, Toolkit, Practical Techniques for Building Schema 05-07-2001. Dimensional Data Warehouses, Solutions Structured framework that, for purposes 2. William Inmon, Enterprise Data from the Expert., John Wiley & Sons, New of this paper, provides and represents Warehouse, http://www.billinmon.com/ York, 1996. relationships between entities. library/articles/introedw.asp. Daniel L .Moody, Department of Informa- Service Level Agreement (SLA) 3. Ralph Kimball and Margy Ross, The tion Systems, University of Melbourne, An agreement between IT and the busi- Data Warehouse Toolkit, Second Edition, Kortink, Mark, From Enterprise Models ness that pertains to a performance aspect Wiley Computer Publishing, New York, to Dimensional Models: A Methodology of a process. The agreement could be 2002, pp 11-12. for Data Warehouse and Data Mart Design, based on availability, enforcing that data is 4. Neil Raden, edited by Ralph Kimball, Simsion Bowles & Associates, 2000. loaded by a specific timeframe, or it could Intelligent Enterprise, Data Warehouse be based on retrieval processing enforcing William Inmon/Peter Meers, The Dilemma Designer, ‘Real Time: Get Real, Part II’, a processing time for specific queries or of Change: Managing Changes over Time www.intelligententerprise.com, June types of queries. in the Data Warehouse/DSS Environment, 30, 2003. White paper on http://www.billinmon.com/ Slowly Changing Dimensions (SCD) 5. Steve Hoberman, Data Modeler’s library/whiteprs/Dilemma of Change.pdf, The impact of time on a dimension and Workbench, Tools and Techniques March 2001. that dimensions changes (i.e. customer for Analysis and Design, John Wiley & addresses). Sons, New York, 2002. 6. Ralph Kimball and Margy Ross, The Data Warehouse Toolkit, Second Made up of multiple Star Schemas with- Edition, Wiley Computer Publishing, out the duplication of dimension tables 2002, page 9. that can be joined together. 7. William Inmon, The Problem Star Schema with Dimensional Modeling, Individual fact table surrounded by http://www.billinmon.com/library/ dimensional tables to which it will join. articles/artdimmd.asp. 8. William Inmon, Star Joins, Surrogate Key http://www.billinmon.com/library/ Artificial key used as a substitute for articles/starjoin.asp. natural data keys (i.e. customer id).

AllFusion is a trademark and ERwin is a registered trademark of Computer Associates International, Inc. Teradata continually improves products as new technolo- gies and components become available. Teradata, therefore, reserves the right to change specifications without prior notice. All features, functions, and operations described herein may not be marketed in all parts of the world. Consult your Teradata representative or Teradata.com for more information. Copyright © 2004-2007 by Teradata Corporation All Rights Reserved. Produced in U.S.A.

EB-2406 > 1007 > PAGE 18 OF 18