Data Warehousing and OLAP • Large Retailer • Several Databases: Inventory, Personnel, Sales Etc

Home , Data mart

Motivation Data Warehousing and OLAP • Large retailer • Several databases: inventory, personnel, sales etc. • High volume of updates INFO 330 • Management requirements • Efficient support for decision making • Comprehensive view of all aspects of an enterprise • Trends, summaries, analysis of historical data • Information from several departments Slides courtesy of Mirek Riedewald • Why not using operational systems?

Motivation (contd.) Outline

• Integrate data from diverse sources • Overview of data warehousing • Common schema • Semantic mismatches (currency, naming, normalization, databases structure) • Clean data (missing values, inconsistencies) • Accumulate historical data • Not relevant for operational databases • Efficient analysis • Complex queries versus frequent updates

Terminology From OLTP to the Data Warehouse

• OLTP (Online Transaction Processing) • Traditionally, database systems stored data • DSS (Decision Support System) relevant to current business processes • Old data was archived or purged • DW (Data Warehouse) • A database stores the current snapshot of the • OLAP (Online Analytical Processing) business: • Current customers with current addresses • Current inventory • Current orders • Current account balance

1 The Data Warehouse OLTP Architecture

• The data warehouse is a historical collection of all relevant data for analysis purposes Clients • Examples: OLTP • Current customers versus all customers DBMSs • Current orders versus history of all orders Cash Register • Current inventory versus history of all shipments • Thus the data warehouse stores information Product Purchase that might be useless for the operational part of a business Inventory Update

DW Architecture Building a Data Warehouse

Information Sources Data Warehouse OLAP Servers • Data warehouse is a collection of data marts Server • Data marts contain one dimensional star MOLAP OLTP Analysis DBMSs schema that captures one business aspect • Notes:

Query/Reporting • It is crucial to centralize the logical definition and format of dimensions and facts (political challenge; assign a dimension authority to each dimension). Other Data Extract Everything else is a distributed effort throughout the Clean Sources Data Mining company (technical challenge) Transform Data Marts Aggregate • Each data mart will have its own fact table, but Load dimension tables are duplicated over several data Update ROLAP marts

OLTP Versus Data Warehousing Three Complementary Trends

OLTP Data Warehouse • Data Warehousing: Consolidate data from Typical user Clerical worker Management many sources in one large repository System usage Regular business Analysis • Loading, periodic synchronization of replicas Workload Read/Write Read only • Semantic integration Types of queries Predefined Ad-hoc • OLAP: Unit of interaction Transaction Query • Complex SQL queries and views Level of isolation High Low • Queries based on spreadsheet-style operations and required “multidimensional” view of data No of records accessed <100 >1,000,000 • Interactive and “online” queries No of concurrent users Thousands Hundreds • Data Mining: Exploratory search for interesting Focus Data in and out Information out trends and anomalies.

2 EXTERNAL DATA SOURCES Data Warehousing Warehousing Issues

• Integrated data spanning • Semantic Integration: When getting data from EXTRACT long time periods, often TRANSFORM multiple sources, must eliminate mismatches, augmented with LOAD e.g., different currencies, schemas REFRESH summary information • Heterogeneous Sources: Must access data from • Several gigabytes to a variety of source formats and repositories terabytes common • Replication capabilities can be exploited here Metadata DATA • Interactive response WAREHOUSE • Load, Refresh, Purge: Must load data, times expected for Repository periodically refresh it, and purge too-old data complex queries; ad-hoc updates uncommon SUPPORTS • Metadata Management: Must keep track of source, loading time, and other information for all data in the warehouse DATA MINING OLAP

Outline Dimensional Data Modeling

• Overview of data warehousing • Recall: The relational model • Dimensional Modeling The dimensional data model: • Relational model with two different types of attributes and tables • Attribute level: Facts (numerical, additive, dependent) versus dimensions (descriptive, independent) • Table level: Fact tables (large tables with facts and foreign keys to dimensions) versus dimension tables (small tables with dimensions)

Dimensional Modeling (contd.) OLTP versus Data Warehouse

• Fact (attribute): • Dimension (attribute): OLTP Data warehouse Measures performance Specifies a fact • Regular relational • Dimensional model of a business schema • Fact table in BCNF • Example dimensions: • Normalized • Dimension tables not • Example facts: • Product, customer • Updates overwrite normalized: few updates, mostly queries • Sales, budget, profit, data, sales person, previous values: One • Updates add new version: inventory store instance of a customer with a unique Several instances of the • Example fact table: • Example dimension customerID same customer (with different data, e.g., • Transactions (timekey, table: • Queries return address) storekey, pkey, promkey, • Sales(productid, information about the ckey, units, price) storeid, …) current state of affairs • Queries return aggregate information about historical facts

3 Example: Dimensional Data Modeling Another View: Star Schema

ckey cid name byear state Customers: Transactions Dimension Table Time Store (timekey, timekey Day Month Year Time: storekey, Dim. Table pkey, promkey, Transactions: ckey, ckey timekey pkey #units $price units, Fact Table price) pkey pid pname price category Customers Products: Promotions Products Dim. Table

Fact versus Dimension Tables Grain

• Fact tables are usually very large; they • The grain defines the level of resolution of can grow to several hundred GB and TB a single record in the fact table. • Dimension tables are usually smaller • Example fact tables: (although can grow large, e.g., Customers • Transactions (timekey, storekey, pkey, table), but they have many fields promkey, ckey, units, price); grain is • Queries over fact tables usually involve individual item many records • Transactions (timekey, storekey, ckey, units, price); grain is one market basket

Typical Queries Example Query

• SQL: • “Break down sales by year and category for the SELECT D1.d1, …, Dk.dk, agg1(F.f1) last two years; show only categories with more FROM Dimension D1, …, than $1M in sales.” Dimension Dk, Fact F • SQL: WHERE D1.key = F.key1 AND … AND SELECT T.year, P.category, SUM(X.units * X.price) Dk.keyk = F.keyk AND FROM Time T, Products P, Transactions X otherPredicates WHERE T.year = 1999 OR T.year = 2000 GROUP BY D1.d1, …, Dk.dk GROUP BY T.year, P.category HAVING groupPredicates HAVING SUM(X.units * X.price) > 1000000

• This query is called a “Star Join”.

4 Outline Online Analytical Processing (OLAP)

• Overview of data warehousing • Ad hoc complex queries • Dimensional Modeling • Simple, but intuitive and powerful query interface • Online Analytical Processing • Spreadsheet influenced analysis process • Specialized query operators for multidimensional analysis • Roll-up and drill-down • Slice and dice • Pivoting

Visual Intuition: Cube Multidimensional Data Analysis

roll-up to category Data warehouse: Customer Data Mart roll-up to state SH Transactions(ckey, timekey, pkey, units, price) SF LA Customers(ckey, cid, name, byear, city, state, country) Product1 20 Time(tkey, day, month, quarter, year) Product2 30 Product3 20 Products(pkey, pname, price, pid, category, industry) Product4 15 Hierarchies on dimensions: Product Product5 10 Year Product6 50 roll-up to week Industry Country M T W Th F S S Quarter Time Category State Month Week 50 Units of Product6 sold on Monday in LA Product City Day

Multidimensional Data Analysis Corresponding Query in SQL

NY CA WI • SELECT SUM(units) FROM Transactions T, Products P, Customers C Industry1 $1000 $2000 $1000 WHERE T.pkey = P.pkey AND T.ckey = C.ckey Industry2 $500 $1000 $500 AND C.country = “USA” GROUP BY P.industry, C.state Industry3 $3000 $3000 $3000 • We think that Industry3 in CA is interesting. Year Year Industry Country=“USA” Industry Country=“USA” Quarter Quarter Category State Category State Month Week Month Week Product City Day Product City Day

5 Slice and Drill-Down Corresponding Query in SQL

San San Jose Los Angeles • SELECT SUM(units) Francisco FROM Transactions T, Products P, Customers C Category1 $300 $300 $400 WHERE T.pkey = P.pkey AND T.ckey = C.ckey AND P.industry = “Industry3” AND C.state = “CA” Category2 $300 $300 $400 GROUP BY P.category, C.city Category3 $100 $800 $100 • We think that Category3 is interesting.

Year Year Industry=“Industry3” Country Industry=“Industry3” Country Quarter Quarter Category State=“CA” Category State=“CA” Month Week Month Week Product City Day Product City Day

Slice and Drill-Down Corresponding Query in SQL

San San Jose Los Angeles • SELECT SUM(units) Francisco FROM Transactions T, Products P, Customers C Product1 $20 $160 $20 WHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.state = “CA” AND P.category = “Category3” Product2 $20 $160 $20 GROUP BY P.product, C.city Product3 $60 $480 $60 • Nothing new in this view of the data.

Year Year Industry Country Industry Country Quarter Quarter State=“CA” State=“CA” Category=“Category3” Month Week Category=“Category3” Month Week Product City Day Product City Day

Pivot To (City, Year) Corresponding Query in SQL

San San Jose Los Angeles • SELECT SUM(units) Francisco FROM Transactions T, Products P, Customers C 1997 $20 $100 $20 WHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.state = “CA” AND P.category = “Category3” 1998 $20 $600 $20 GROUP BY C.city, T.year 1999 $60 $100 $60

Year Year Industry Country Industry Country Quarter Quarter State=“CA” State=“CA” Category=“Category3” Month Week Category=“Category3” Month Week Product City Day Product City Day

6 Multidimensional Data Analysis The CUBE Operator

Set of data manipulation operators • Generalizing GROUP BY and aggregation • Roll-up: Go up one step in a dimension hierarchy • If there are k dimensions, we have 2k possible (e.g., month -> quarter) SQL GROUP BY queries that can be generated • Drill-down: Go down one step in a dimension through pivoting on a subset of dimensions. hierarchy (e.g., quarter -> month) • CUBE pid, locid, timeid BY SUM Sales • Slice: Select a value of a dimension (e.g., all categories -> only Category3) • Equivalent to rolling up Sales on all eight subsets of the set {pid, locid, timeid}; each roll-up • Dice: Select range of values of a dimension (e.g., corresponds to an SQL query of the form: Year > 1999) • Pivot: Select new dimensions to visualize the data SELECT SUM(S.sales) (e.g., pivot to Time(quarter) and Customer(state)) Lots of recent work on optimizing the CUBE operator! FROM Sales S GROUP BY grouping-list

Example Multidimensional View Of CUBE

OLAP Server Architectures Summary: Multidimensional Analysis

• Relational OLAP (ROLAP) • Spreadsheet style data analysis • Relational DBMS stores data mart (star schema) • OLAP middleware: • Roll-up, drill-down, slice, dice, and pivot • Aggregation and navigation logic your way to interesting cells in the CUBE • Optimized for DBMS in the background, but slow and complex • Mainstream technology • Multidimensional OLAP (MOLAP) • Established enterprises already have OLAP • Specialized array-based storage structure installations • Desktop OLAP (DOLAP) • Performs OLAP directly at your PC • Hybrids and Application OLAP

7 Summary

• Decision support is a rapidly growing subarea of databases • Involves the creation of large, consolidated data repositories called data warehouses • Warehouses are exploited using sophisticated analysis techniques: complex SQL queries and OLAP “multidimensional” queries (influenced by both SQL and spreadsheets) • New techniques for database design, indexing, view maintenance, and interactive querying need to be supported