Beyond the Data Model: Designing the Data Warehouse

Total Page:16

File Type:pdf, Size:1020Kb

Beyond the Data Model: Designing the Data Warehouse Beyond the Data Model: of a Designing the three-part series Data Warehouse By Josh Jones and Eric Johnson CA ERwin TABLE OF CONTENTS INTRODUCTION . 3 DATA WAREHOUSE DESIGN . 3 MODELING A DATA WAREHOUSE . 3 Data Warehouse Elements . 4 Star Schema . 4 Snowflake Schema . 4 Building the Model . 4 EXTRACT, TRANSFORM, AND LOAD . 7 Extract . 7 Transform . 7 Load . 7 Metadata . 8 SUMMARY . 8 2 ithout a doubt one of the most important because you can add new topics without affecting the exist- aspects data storage and manipulation ing data. However, this method can be cumbersome for non- is the use of data for critical decision technical users to perform ad-hoc queries against, as they making. While companies have been must have an understanding of how the data is related. searching their stored data for decades, it’s only really in the Additionally, reporting style queries may not perform well last few years that advanced data mining and data ware- because of the number of tables involved in each query. housing techniques have become a focus for large business- In a nutshell, the dimensional model describes a data es. Data warehousing is particularly valuable for large enter- warehouse that has been built from the bottom up, gather- prises that have amassed a significant amount of historical ing transactional data into collections of “facts” and “dimen- data such as sales figures, orders, production output, etc. sions”. The facts are generally, the numeric data (think dol- Now more than ever, it is critical to be able to build scalable, lars, inventory counts, etc.), and the dimensions are the bits accurate data warehouse solutions that can help a business of information that put the numbers, or facts, into context move forward successfully. (think sales person names, regions, etc.). This method allows Database modeling goes beyond on-line transactional pro- for very efficient queries from the standpoint of non-techni- cessing (OLTP) models for traditional relational databases cal users, because the data is grouped in a logical way, and it and extends in the world of data warehousing. In addition to performs quite well since the similar types of data are stored requiring that a schema be designed, data warehouses have together. However, maintaining a dimensional data ware- several other components that are equally important. When house requires some effort, because adding new types of working with data warehouses you also have to take into data may require that the entire warehouse be updated, or account metadata, and how the data is to be placed into the reprocessed, in order to incorporate the new types of data. warehouse from one or more sources. This whitepaper will Finally, it is important to understand where the data for a cover data warehouse design, working with metadata, build- data warehouse comes from. Obviously, a large portion of it ing ETL processes, and offer some guideline for the overall (in some cases, all of it) will come from the operational design of your data warehouse solution. databases being used by the enterprise. Just remember, in large enterprises, this may mean data could be gathered DATA WAREHOUSE DESIGN from databases on mainframes as well as distributed sys- Put simply, a data warehouse is a repository of data. tems. However, there may be data stored in non-relational Typically, this is historical data about the business that has sources, such as file shares and even email servers. Because been conducted by a company. This may sound like a data- of this, the process of designing a data warehouse also base, and at the most basic definition, it is. However, the includes the task of designing the processes that bring data generally accepted difference is that the term database actu- into the data warehouse. The process of getting that data is ally refers to a relational database, and often more specifi- often referred to as Extract, Transform, and Load (ETL) cally an online transactional processing (OLTP) relational reflecting the three major operations that process raw data database. OLTP databases are designed to perform well into the data warehouse. under transactional loads, often supporting applications that are highly focused in nature, such as purchasing interfaces. MODELING A DATA WAREHOUSE Conversely, a data warehouse is for online analysis process- Just as in relational data modeling, building a data model for ing (OLAP), and built to meet the needs of users who are a data warehouse requires a significant amount of work be responsible for analyzing the data and making decisions done prior to the start of the project. This is because a data based on it. warehouse is a very structured, specific deliverable. You must It is worth noting that there is not a strictly defined, uni- account for what the initial data will be, and what data will fied standard to building data warehouses. There are at least be incorporated in the future. It is necessary to approach the two very well known methodologies behind designing a data design of a data warehouse much like an application design warehouse; the “dimensional” model and the “normalized” project; in fact, often times a data warehouse will be coupled model. Each of these methods works and each has its pros with a front end reporting system, whether it is a web based and cons. While we will be using the dimensional approach, application or a desktop client. As such, you’ll follow stan- it is worth noting the basics of each. dard design practices, such as gathering requirements, build- The normalized model is based on the normalization mod- ing documentation and diagrams, and generating a signifi- els defined by E.F. Codd. Specifically, the idea is to store the cant amount of data about the data, or metadata, that will data in third normal form (3NF), grouping the data by topic, help define the structure and scope of the data warehouse. i.e. customers, orders, products. This allows the data ware- These activities are outside the scope of this paper, because house to be updated with new types of data very easily, preparing for a data warehouse requires these tasks be com- 3 pleted prior to the beginning of the technical process necessary to break the dimensions down into subdimension described here. tables. The end result is a schema the visually looks like a snowflake, because of the layers of dimension tables. Figure DATA WAREHOUSE ELEMENTS 2 shows a slightly expanded view of the sales data in a data The basic pieces of a dimensional data warehouse model warehouse, employing a snowflake schema. are tables and relationships, laid out in a schema. Specifically, we’ll have fact tables and dimensional tables, and they will be related to one another in a manner that facilitates understanding of the relationships between those tables. This layout is called a schema, and there are two pri- mary schema types used in a dimensional data warehouse: snowflake and star. STAR SCHEMA One of simplest schemas, star schemas are named such because they usually involve a single fact table, related to a single layer of dimension tables. Each dimension table has a simple primary key, as it is responsible for a single type of data. The fact table has an aggregate primary key, as it is the compounded collection of the primary keys from the dimen- sion tables. Figure 1 shows a very simple star schema outlin- ing sales data. Figure 2. Snowflake schema. While this is not an exhaustive look at a data model, it shows the basics of a snowflake schema. There is a central fact tables, Sales, and two layers of dimension tables, effectively normalizing the data. The advantage here is that logically, we can see how the data is related very easily. We can drill down from the sales, to the region, and lookup the country code. In a large system, a snowflake can also facilitate data mining queries that require complex comparisons across different dimensions. For example, if we need to see all sales from the 3rd quarter of 2007 for a specific country, filtered by product Figure 1. Simple star schema. manufacturer, we’re literally joining together seven tables. If this data had been stored in a star, we’d be joining four tables. The simplicity of a star schema is that for each fact table, However, from a performance standpoint, this schema may you will only ever have to join one level of dimension tables slow down larger queries that seek to simply retrieve data to get at the data. Unfortunately, the drawback of a star from a few dimensions. Just remember, because of the differ- schema is that for intricate systems, it can become quite ences in performance and usability, it is vital to structure your complex to read and query when it is necessary to retrieve a schema according to your specific needs. large amount of fact data. BUILDING THE MODEL SNOWFLAKE SCHEMA Let’s now assume that we have a specific project, and have The snowflake schema is very similar to the star schema, gathered the necessary requirements. The project will be with the exception of the relationships between dimensions fairly simple; we’re building a data warehouse for a DVD and facts. A snowflake schema still has fact tables at its cen- retailer that wants to be able to figure out what products ter, with relationships to dimension tables. However, if there are being ordered and what their customer demographics are dimensions with large numbers of attributes, it might be are. This is a readily recognizable need; almost every business 4 is either selling a product or a service, and while the details vary, the You can see that it is a standard, normalized basics stay the same.
Recommended publications
  • Normalized Form Snowflake Schema
    Normalized Form Snowflake Schema Half-pound and unascertainable Wood never rhubarbs confoundedly when Filbert snore his sloop. Vertebrate or leewardtongue-in-cheek, after Hazel Lennie compartmentalized never shreddings transcendentally, any misreckonings! quite Crystalloiddiverted. Euclid grabbles no yorks adhered The star schemas in this does not have all revenue for this When done use When doing table contains less sensible of rows Snowflake Normalizationde-normalization Dimension tables are in normalized form the fact. Difference between Star Schema & Snow Flake Schema. The major difference between the snowflake and star schema models is slot the dimension tables of the snowflake model may want kept in normalized form to. Typically most of carbon fact tables in this star schema are in the third normal form while dimensional tables are de-normalized second normal. A relation is danger to pause in First Normal Form should each attribute increase the. The model is lazy in single third normal form 1141 Options to Normalize Assume that too are 500000 product dimension rows These products fall under 500. Hottest 'snowflake-schema' Answers Stack Overflow. Learn together is Star Schema Snowflake Schema And the Difference. For step three within the warehouses we tested Redshift Snowflake and Bigquery. On whose other hand snowflake schema is in normalized form. The CWM repository schema is a standalone product that other products can shareeach product owns only. The main difference between in two is normalization. Families of normalized form snowflake schema snowflake. Star and Snowflake Schema in Data line with Examples. Is spread the dimension tables in the snowflake schema are normalized. Like price weight speed and quantitiesie data execute a numerical format.
    [Show full text]
  • The Design of Multidimensional Data Model Using Principles of the Anchor Data Modeling: an Assessment of Experimental Approach Based on Query Execution Performance
    WSEAS TRANSACTIONS on COMPUTERS Radek Němec, František Zapletal The Design of Multidimensional Data Model Using Principles of the Anchor Data Modeling: An Assessment of Experimental Approach Based on Query Execution Performance RADEK NĚMEC, FRANTIŠEK ZAPLETAL Department of Systems Engineering Faculty of Economics, VŠB - Technical University of Ostrava Sokolská třída 33, 701 21 Ostrava CZECH REPUBLIC [email protected], [email protected] Abstract: - The decision making processes need to reflect changes in the business world in a multidimensional way. This includes also similar way of viewing the data for carrying out key decisions that ensure competitiveness of the business. In this paper we focus on the Business Intelligence system as a main toolset that helps in carrying out complex decisions and which requires multidimensional view of data for this purpose. We propose a novel experimental approach to the design a multidimensional data model that uses principles of the anchor modeling technique. The proposed approach is expected to bring several benefits like better query execution performance, better support for temporal querying and several others. We provide assessment of this approach mainly from the query execution performance perspective in this paper. The emphasis is placed on the assessment of this technique as a potential innovative approach for the field of the data warehousing with some implicit principles that could make the process of the design, implementation and maintenance of the data warehouse more effective. The query performance testing was performed in the row-oriented database environment using a sample of 10 star queries executed in the environment of 10 sample multidimensional data models.
    [Show full text]
  • Star Schema Modeling with Pentaho Data Integration
    Star Schema Modeling With Pentaho Data Integration Saurischian and erratic Salomo underworked her accomplishment deplumes while Phil roping some diamonds believingly. Torrence elasticize his umbrageousness parsed anachronously or cheaply after Rand pensions and darn postally, canalicular and papillate. Tymon trodden shrinkingly as electropositive Horatius cumulates her salpingectomies moat anaerobiotically. The email providers have a look at pentaho user console files from a collection, an individual industries such processes within an embedded saiku report manager. The database connections in data modeling with schema. Entity Relationship Diagram ERD star schema Data original database creation. For more details, the proposed DW system ran on a Windowsbased server; therefore, it responds very slowly to new analytical requirements. In this section we'll introduce modeling via cubes and children at place these models are derived. The data presentation level is the interface between the system and the end user. Star Schema Modeling with Pentaho Data Integration Tutorial Details In order first write to XML file we pass be using the XML Output quality This is. The class must implement themondrian. Modeling approach using the dimension tables and fact tables 1 Introduction The use. Data Warehouse Dimensional Model Star Schema OLAP Cube 5. So that will not create a lot when it into. But it will create transformations on inventory transaction concepts, integrated into several study, you will likely send me? Thoughts on open Vault vs Star Schemas the bi backend. Table elements is data integration tool which are created all the design to the farm before with delivering aggregated data quality and data is preventing you.
    [Show full text]
  • Star and Snowflake Schema Tutorialpoint
    Star And Snowflake Schema Tutorialpoint Tweedy and close-lipped Moise segregating: which Skye is daimen enough? Is Doyle ungallant or herbless when pricing some Honduras fordoing patchily? Fulgid and coiled Derick cleats her riffs pleonasm glue and overemphasizing distastefully. Of disparate data on those systems columns that are used to extract. Introduction to Slowly Changing Dimensions SCD Types. 1 a diagrammatic presentation broadly a structured framework where plan outline 2 a mental codification of miss that includes a particular organized way of perceiving cognitively and responding to substantial complex authority or decay of stimuli. Work smarter to authorize time they solve problems. The organized data helps is reporting and preserve business decision effectively. Real data warehouse consists of star schema eliminates many types of a search engines read our experts follow these columns in a specific interval of. Pembangunan data storage requirements are commenting using our library is snowflaked outward into mental shortcuts are. Liquibase tutorialspoint. Which data model is lowest level? Star and Snowflake Schema in warehouse Warehouse with Examples. In star schema is snowflaked outward into our schema gives optimal disk space to build road maps the! Data Warehouse Modeling Snowflake Schema. Cross pollination is water transfer of pollen grains from the anther of free flower use the stigma of a genetically different flower. Adding structured data give your website can glide quite daunting. The difference is process the dimensions themselves. Discuss the advantages Disadvantages of star snowflake. Learn and snowflake schemas can see what is snowflaked into additional lookup tables of courses platform, the primary key, partition in the.
    [Show full text]
  • Data Warehousing
    DMIF, University of Udine Data Warehousing Andrea Brunello [email protected] April, 2020 (slightly modified by Dario Della Monica) Outline 1 Introduction 2 Data Warehouse Fundamental Concepts 3 Data Warehouse General Architecture 4 Data Warehouse Development Approaches 5 The Multidimensional Model 6 Operations over Multidimensional Data 2/80 Andrea Brunello Data Warehousing Introduction Nowadays, most of large and medium size organizations are using information systems to implement their business processes. As time goes by, these organizations produce a lot of data related to their business, but often these data are not integrated, been stored within one or more platforms. Thus, they are hardly used for decision-making processes, though they could be a valuable aiding resource. A central repository is needed; nevertheless, traditional databases are not designed to review, manage and store historical/strategic information, but deal with ever changing operational data, to support “daily transactions”. 3/80 Andrea Brunello Data Warehousing What is Data Warehousing? Data warehousing is a technique for collecting and managing data from different sources to provide meaningful business insights. It is a blend of components and processes which allows the strategic use of data: • Electronic storage of a large amount of information which is designed for query and analysis instead of transaction processing • Process of transforming data into information and making it available to users in a timely manner to make a difference 4/80 Andrea Brunello Data Warehousing Why Data Warehousing? A 3NF-designed database for an inventory system has many tables related to each other through foreign keys. A report on monthly sales information may include many joined conditions.
    [Show full text]
  • Star Vs Snowflake Schema in Data Warehouse
    Star Vs Snowflake Schema In Data Warehouse Fiddly and genealogic Thomas subdividing his inliers parochialising disable strong. Marlowe often reregister fumblingly when trachytic Hiralal castrate weightily and strafe her lavender. Hashim is three-cornered and oversubscribe cursedly as tenebrious Emory defuzes taxonomically and plink denominationally. Alike dive into data warehouse star schema in snowflake data Hope you have understood this theory based article in our next upcoming article we understand in a practical way using an example of how to create star schema design model and snowflake design model. Radiating outward from the fact table, we will have two dimension tables for products and customers. Workflow orchestration service built on Apache Airflow. However, unlike a star schema, a dimension table in a snowflake schema is divided out into more than one table, and placed in relation to the center of the snowflake by cardinality. Now comes a major question that a developer has to face before starting to design a data warehouse. Difference Between Star and Snowflake Schema. Star schema is the base to design a star cluster schema and few essential dimension tables from the star schema are snowflaked and this, in turn, forms a more stable schema structure. Edit or create new comparisons in your area of expertise. Add intelligence and efficiency to your business with AI and machine learning. Efficiently with windows workloads, schema star vs snowflake in data warehouse builder uses normalization is the simplest type, hence we must first error posting these facts and is normalized. The most obvious aggregate function to use is COUNT, but depending on the type of data you have in your dimensions, other functions may prove useful.
    [Show full text]
  • GEN-INF004A November 7, 2006 Category Supersedes Information None Contact Scheduled Review [email protected] May 2022
    Information Technology Policy Introduction to Data Warehousing ITP Number Effective Date GEN-INF004A November 7, 2006 Category Supersedes Information None Contact Scheduled Review [email protected] May 2022 1. Introduction Data Warehousing: Data Warehousing systems have reached a new level of maturity as both an IT discipline and a technology. 2. Main Document Content: Data Warehouse systems assist government organizations with improved business performance by leveraging information about citizens, business partners, and internal government operations. This is done by: • Extracting data from many sources, e.g., application databases, various local and federal government repositories, and/or external agency partners. • Centralizing, organizing, and standardizing information in repositories such as Data Warehouses and Data Marts. This includes cleansing, appending, and integrating additional data. • Providing analytical tools that allow a broad range of business and technical specialists to run queries against the data to uncover patterns and diagnose problems. Extract, Transform and Load (ETL) Data integration technology is generally used to extract transactional data from internal and external source applications to build the Data Warehouse. This process is referred to as ETL (Extract, Transform, Load). Data is extracted from its source application or repository, transformed to a format needed by a Data Warehouse, and loaded into a Data Warehouse. Data integration technology works together with technologies like Enterprise Information Integration (EII), database replication, Web Services, and Enterprise Application Integration (EAI) to bridge proprietary and incompatible data formats and application protocols. Data Warehouses and Data Marts A Data Warehouse, or Data Mart, stores tactical or historical information in a relational database allowing users to extract and assemble specific data elements from a complete dataset to perform analytical functions.
    [Show full text]
  • The Denormalized Relational Schema
    The Denormalized Relational Schema How undying is Forster when take-out and wifely Hermon debauches some nebulisers? Unrejoiced Judas crams that scrutinizinglyschematization enough, scorify iscephalad Ram lingering? and verdigris substantivally. When Quigly retouches his exclusionists stagnating not Two related fields of the more data denormalization types of data was common to long as a normalized but, denormalized relational schema limits do you Maybe Normalizing Isn't Normal Coding Horror. Once she is told that this is a different animal called a cow, she will modify her existing schema for a horse and create a new schema for a cow. Overall these represent things that can be done at different stages in the design process that will maximize efficiencies of the model. Data redundancy leads to data anomalies and corruption and should be avoided when creating a relational database consisting of several entities. DBMS processes must insure integrity and accuracy. But relational databases still remain the default choice in most applications. That email is too long. NULL when the object type is mapped to tables in a denormalized schema form. Still, processing technology advancements have resulted in improved snowflake schema query performance in recent years, which is one of the reasons why snowflake schemas are rising in popularity. Updating, to the contrary, gets faster as all pieces of data are stored in a single place. Migration scripts are necessary. The reporting one is denormalized to get the most data in the most usable structure with each database call. Star schema dimension tables are not normalized, snowflake schemas dimension tables are normalized.
    [Show full text]
  • Virtual Denormalization Via Array Index Reference for Main Memory OLAP
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. X, XXXXX 2015 1 Virtual Denormalization via Array Index Reference for Main Memory OLAP Yansong Zhang, Xuan Zhou, Ying Zhang, Yu Zhang, Mingchuan Su, and Shan Wang Abstract—Denormalization is a common tactic for enhancing performance of data warehouses, though its side-effect is quite obvious. Besides being confronted with update abnormality, denormalization has to consume additional storage space. As a result, this tactic is rarely used in main memory databases, which regards storage space, i.e., RAM, as scarce resource. Nevertheless, our research reveals that main memory database can benefit enormously from denormalization, as it is able to remarkably simplify the query processing plans and reduce the computation cost. In this paper, we present A-Store, a main memory OLAP engine customized for star/snowflake schemas. Instead of generating fully materialized denormalization, A-Store resorts to virtual denormalization by treating array indexes as primary keys. This design allows us to harvest the benefit of denormalization without sacrificing additional RAM space. A-Store uses a generic query processing model for all SPJGA queries. It applies a number of state-of-the-art optimization methods, such as vectorized scan and aggregation, to achieve superior performance. Our experiments show that A-Store outperforms the most prestigious MMDB systems significantly in star/snowflake schema based query processing. Index Terms—Main-memory, OLAP, denormalization, A-Store, array index. Ç 1INTRODUCTION HE purpose of database normalization is to eliminate strategy of denormalization. Recent development of MMDB Tdata redundancy, so as to save storage space and avoid [1], [2], [3] has shown that simplicity of data processing update abnormality.
    [Show full text]
  • Olap Queries
    OLAP QUERIES 1 Online Analytic Processing OLAP 2 OLAP • OLAP: Online Analytic Processing • OLAP queries are complex queries that • Touch large amounts of data • Discover patterns and trends in the data • Typically expensive queries that take long time • Also called decision-support queries Select salary From Emp • In contrast to OLAP: Where ID = 100; • OLTP: Online Transaction Processing • OLTP queries are simple queries, e.g., over banking or airline systems • OLTP queries touch small amount of data for fast transactions 3 OLTP vs. OLAP § On-Line Transaction Processing (OLTP): – technology used to perform updates on operational or transactional systems (e.g., point of sale systems) § On-Line Analytical Processing (OLAP): – technology used to perform complex analysis of the data in a data warehouse OLAP is a category of software technology that enables analysts, managers, and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the dimensionality of the enterprise as understood by the user. [source: OLAP Council: www.olapcouncil.org] 4 OLAP AND DATA WAREHOUSE OLAP Server OLAP Internal Sources Reports Data Data Query and Integration Warehouse Analysis Operational Component Component DBs Data Mining Meta data External Client Sources Tools 5 OLAP AND DATA WAREHOUSE • Typically, OLAP queries are executed over a separate copy of the working data • Over data warehouse • Data warehouse is periodically updated, e.g.,
    [Show full text]
  • SUGI 26: Querying Star and Snowflake Schemas In
    Data Warehousing and Solutions Paper 123-26 Querying Star and Snowflake Schemas in SAS Bart Heinsius, E.O.M. Data, Hilversum. The Netherlands except that the constraint that every dimension table is directly ABSTRACT joined to the fact table is dropped. Figure 2 shows an example. Star schemas, and sometimes snowflake schemas, are often used in Data Warehouses as the storage structure for dimensional data that is to be queried efficiently. In Data … … Warehouses built using SAS Software, star and snowflake schemas can also be implemented. Star and snowflake schemas in SAS can be queried using SAS Fact SALES SQL or the SAS DATA step. Tests have shown that using the sales_pers_key DATA step can result in significant performance gains over SQL. product_key date_key customer_key This paper will discuss the use of star and snowflake schemas in number_sold SAS and will look into the performance issues that arise. Then, a amount Dimension number of techniques are discussed that address these issues. CUSTOMER They involve bitmap indexes and DATA step code generation … customer_key from dedicated metadata. Then, the Star Class is introduced, name which implements these techniques. sex age jobcode_key Dimension This paper should be of interest to Data Warehouse architects JOBCODE and builders and to OLAP application builders working with SAS Data Warehouses used for dimensional querying. An assumption jobcode_key job_name is made that the reader is familiar with dimensional modeling and job_category has a basic knowledge of BASE SAS software, SQL and job_sub_cat SAS/EIS software. Figure 2. An example Snowflake Schema. INTRODUCTION Snowflake schemas are much less used than star schemas.
    [Show full text]
  • A Data Warehouse Implementation Using the Star Schema
    Data Warehousing A Data Warehouse Implementation Using the Star Schema Maria Lupetin, InfoMaker Inc., Glenview, Illinois Abstract Data warehouses are subject oriented (i.e., customer, This work explores using the star schema for a SAS vendor, product, activity, patient) rather than data warehouse. An implementation of a data functionally oriented, such as the production planning warehouse for an outpatient clinical information system system or human resources system. Data warehouses will be presented as an example. Explanations of the are integrated; therefore, the meaning and results of many data warehouse concepts will be given. the information is the same regardless of organizational source. The data is nonvolatile but can The Goal of This Paper: change based upon history. The data is always the The purpose of this paper is to introduce the reader to same or history changes based on today's definitions. data warehousing concepts and terms. It will briefly Contrast this to a database used for an OLTP system define concepts such as OLTP, OLAP, enterprise-wide where the database records re continually updated, data warehouse, data marts, dimensional models, fact deleted, and inserted. tables, dimension tables, and the star join schema. The present study will also explore the implementation of a The data is consistence across the enterprise, data mart for an outpatient clinical information system regardless how the data is examined, "sliced and using the star schema After reviewing the concepts and diced." For example, sales departments will say they approaches, one will conclude that the SAS family of have sold 10 million dollars of widgets across all products offers an end to end solution for data sales regions last year.
    [Show full text]