What Is a Data Warehouse?

What is a Data Warehouse? By Susan L. Miertschin “A data warehouse is a subject oriented, integrated, time variant, nonvolatile, collection of data in support of management's decision making process.” https: //www. bus iness.auc. dk/oe kostyr /file /What_ is_a_ Data_ Ware house.pdf 2 What is a Data Warehouse? “A copy of transaction data specifically structured for query and analysis” 3 “Data Warehousing is the coordination, architected, and periodic copying of data from various sources, both inside and outside the enterprise, into an environment optimized for analytical and informational processing” ‐ Alan Simon Data Warehousing for Dummies 4 Business Intelligence (BI) • “…implies thinking abstractly about the organization, reasoning about the business, organizing large quantities of information about the business environment.” p. 6 in Giovinazzo textbook • Purpose of BI is to define and execute a strategy 5 Strategic Thinking • Business strategist – Always looking forward to see how the company can meet the objectives reflected in the mission statement • Successful companies – Do more than just react to the day‐to‐day environment – Understand the past – Are able to predict and adapt to the future 6 Business Intelligence Loop Business Intelligence Figure 1‐1 p. 2 Giovinazzo • Encompasses entire loop shown Business Strategist • Data Storage + ETC = OLAP Data Mining Reports Data Warehouse Data Storage • Data WhWarehouse + Tools (yellow) = Extraction,Transformation, Cleaning DiiDecision Support CRM Accounting Finance HR System 7 The Data Warehouse Decision Support Systems Central Repository Metadata Dependent Data DtData Mar t EtExtrac tion DtData Log Administration Cleansing/Tranformation External Extraction Source Extraction Store Independent Data Mart Operational Environment Figure 1-2 p. 9 Giovinazzo 8 Data Path • The path to get data Decision Support Systems from the operational environment to the Central Repository Metadata Dependent Data business strategist is DtData Mar t EtExtrac tion DtData Log Administration complex • There is much more Cleansing/Tranformation External Extraction to a data warehouse Source Extraction Store than the Central Independent Data Mart Repository Operational Environment 9 Operational Environment • Operational environment runs Cleansing/Tranformation day to day activities Extraction Extraction Store of the organization • Systems contain Independent Data Mart Operational Environment “raw” data – transactional data • DtData dibdescribes the current state of the organitiization 10 Independent Data Mart • Data Mart focuses on one subject area Cleansing/Tranformation within the Extraction Extraction Store organization • Data warehouse Independent Data Mart Operational Environment focuses on the entire organization 11 Extraction • Extraction engine retrieves/receives data from the Cleansing/Tranformation operational External Extraction Source Extraction Store environment • Data from other external sources may also be collected during extraction 12 Extraction Store • Holding area for the collected data until it can be cleaned and Cleansing/Tranformation transformed into the External Extraction Source Extraction Store correct format 13 Transformation/Cleansing • Scrubbing = data transformation + cleansing Cleansing/Tranformation • Transformation = External Extraction Source Extraction Store converting data to a common format • Cleansing = removing errors from data 14 The Extraction Log • The extraction log records success/failure of Central Repository Metadata extraction process Dependent Data Data Mart Extraction Data Log Administration steps (+ more) • The log is part of the Metadata • UdUsed to verify quality of data placed in the warehouse 15 Central Repository • Cornerstone of the data warehouse architecture Central Repository • Stores all the data Metadata Dependent Data Data Mart Extraction Data and metadata for the Log Administration data warehouse 16 Dependent Data Mart • Different from an independent data mart Central Repository • Dependent data mart Metadata Dependent Data Data Mart Extraction Data relies on the data Log Administration warehouse as the source of its data 17 Business Intelligence Infrastructure Decision Support Systems Central Repository Metadata Dependent Data DtData Mar t EtExtrac tion DtData Log Administration Cleansing/Tranformation External Extraction Source Extraction Store Independent Data Mart Operational Environment 18 Subject Orientation of DW Data Warehouse Operational Database • Subject oriented • Focuses on day‐to‐ • Focuses on the what day transactions things drive • Normalized and operational optimized for this transactions purpose 19 Data Warehouse vs. Operational Db Data Warehouse Operational Database • Gathers distributed • Distributed across data together into multiple tables one place within an application • Facilitates analysis • Distributed across processes multiple applications 20 Integrating Transactional Data into DW • Most time‐ consuming and problematic process Cleansing/Tranformation • Two steps External Extraction Source Extraction Store – Data transformation – DtData Cleans ing 21 Data Cleansing • Remove errors from data extracted from the operation environment • Critical • What should be done with data that contains errors? – Send the data back to be fixed at the operational level and resubmitted – Fix the data and inform the operational system of the errors 22 Data Transformation • Operational environment consists of numerous applications and databases • Data definitions will not be consistent • Must be consistent format for DW • Four issues to address – Description – Encoding – Units of Measure – Format 23 Description • Same things may be described differently across systems • Map each different description into a single description • Example: customer, client, user 24 Encoding • Nominal scale: number or letter assigned as a label, a category name – ordering is arbitrary • ElExample: – R = Red Red = Red 36 = Red – B = Blue Blue = Blue 45 = Blue 25 Units of Measure • Measurement system • Example: must be common – Values in metric vs. • Precision can cause English units problems • Example: – 1/3 = .333 = .3333333333333 26 Format • Different operational systems store data in different formats • Example: – SS#: 999999999 • Char(9) • Int 27 Is Data a “Political” Issue? • Can be • OtilOperational lllevel didesigners work hdhard to make the data match their needs • Arguments can arise over whether customer name should be 30 characters or 32 characters long 28 DW Contains a Snapshot • Once data is placed in the central repository – it becomes read‐only • Time becomes an important dimension for the data • DW: A series of organizational snapshots over time 29 Decision Support Systems (DSS) • Data placed in the data warehouse must be easy to access for business strategists – Timely – Support their mission • Three common DSS tools – Reports – OLAP – Data Mining 30 Reports • One of the most basic DSS tools • PtPresent summary ifinforma tion • Reporting tool should support – Rapid development – Easy maintenance – Easy distribution – Internet enabled 31 On‐Line Analytical Processing (OLAP) • OLAP environment allows business strategist to interact directly with data • OLAP tool should support – Muliltip le dimens iona l presentation of data – Rotation {Data Cube} – Drill‐down / Roll‐up – “What if” analysis 32 Data Mining • “ Data Mining is defined as a process of identifying hidden patterns and relationships within data.” ‐ Robert Groth • BsinessBusiness strategists use OLAP to help them find answers to their questions • Data mining supplies answers without knowing the questions (sometimes) 33 In Summary … • DW is the heart of BI • DW is more than jtjust an archive of operational data • Data must be formed according to business needs for strategic information • Time becomes a dimension of the data • DSS tools used to analyze the data 34 What is a Data Warehouse? By Susan L. Miertschin.

What Is a Data Warehouse?

Informal Data Transformation Considered Harmful

Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation Jie Song George Alter H

POLITECNICO DI TORINO Repository ISTITUZIONALE

Data Warehousing on AWS

Master Data Simplified

Lineage Tracing for General Data Warehouse Transformations

Doppiodb 2.0: Hardware Techniques for Improved Integration of Machine Learning Into Databases

Extract, Transform, Load | ETL Development

Rethinking Software Network Data Planes in the Era of Microservices

Doppiodb 2.0: Hardware Techniques for Improved Integration of Machine Learning Into Databases

Automating the Capture of Data Transformation Metadata

Biological Data Transformation in Pathway Simulation∗