Chapter 4: Data Warehousing and On-Line Analytical Processing

Chapter 4: Data Warehousing and On-line Analytical Data Warehousing Processing & Data Warehouse: Basic Concepts On-Line Analytical Processing Data Warehouse Modeling: Data Cube and OLAP Erwin M. Bakker & Stefan Manegold Data Warehouse Design and Usage Data Warehouse Implementation https://homepages.cwi.nl/~manegold/DBDM/ http://liacs.leidenuniv.nl/~bakkerem2/dbdm/ Summary [email protected] [email protected] 3 Databases and Data Mining 2018 What is a Data Warehouse? Data Warehouse—Subject-Oriented Defined in many different ways, but not rigorously Organized around major subjects, such as customer, product, sales A decision support database that is maintained separately from the organization’s operational database Focusing on the modeling and analysis of data for decision makers, not on daily Support information processing by providing a solid platform of operations or transaction processing consolidated, historical data for analysis Provide a simple and concise view around particular subject issues by excluding “A data warehouse is a subject-oriented, integrated, time-variant, and data that are not useful in the decision support process nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon Data warehousing: The process of constructing and using data warehouses 4 5 Data Warehouse—Integrated Data Warehouse—Time Variant Constructed by integrating multiple, heterogeneous data sources The time horizon for the data warehouse is significantly longer than that of relational databases, flat files, on-line transaction records operational systems Data cleaning and data integration techniques are applied. Operational database: current value data Ensure consistency in naming conventions, encoding structures, attribute Data warehouse data: provide information from a historical perspective (e.g., measures, etc. among different data sources past 5-10 years) Ex. Hotel price: differences on currency, tax, breakfast covered, and parking Every key structure in the data warehouse When data is moved to the warehouse, it is converted Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element” 6 7 Data Warehouse—Nonvolatile OLTP vs. OLAP OLTP OLAP Independence OLTP: Online transactional users clerk, IT professional knowledge worker A physically separate store of data transformed from the operational processing function day to day operations decision support DB design application-oriented subject-oriented environment DBMS operations data current, up-to-date historical, detailed, flat relational summarized, Static: Operational update of data does not occur in the data warehouse Query and transactional isolated multidimensional environment processing integrated, consolidated usage repetitive ad-hoc Does not require transaction processing, recovery, and concurrency control OLAP: Online analytical access read/write lots of scans index/hash on prim. key mechanisms processing unit of work short, simple complex query transaction Requires only two operations in data accessing: Data warehouse operations # records accessed tens millions initial loading of data and access of data Drilling, slicing, dicing, etc. #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response 8 9 Why a Separate Data Warehouse? Data Warehouse: High performance for both systems A Multi-Tiered DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery Architecture Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation Top Tier: Front-End Tools Different functions and different data: missing data: Decision support requires historical data which operational DBs do Middle Tier: OLAP Server not typically maintain data consolidation: DS requires consolidation (aggregation, summarization) of Bottom Tier: Data data from heterogeneous sources Warehouse Server data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled Data Note: There are more and more systems which perform OLAP analysis directly on 10 relational databases 11 Three Data Warehouse Models Extraction, Transformation, and Loading (ETL) Data extraction Enterprise warehouse get data from multiple, heterogeneous, and external sources Collects all of the information about subjects spanning the entire organization Data cleaning Data Mart detect errors in the data and rectify them when possible A subset of corporate-wide data that is of value to a specific groups of users Data transformation Its scope is confined to specific, selected groups, such as marketing data mart convert data from legacy or host format to warehouse format Independent vs. dependent (directly from warehouse) data mart Load Virtual warehouse sort, summarize, consolidate, compute views, check integrity, and build indicies A set of views over operational databases and partitions Only some of the possible summary views may be materialized Refresh propagate the updates from the data sources to the warehouse 12 13 Chapter 4: Data Warehousing and On-line Analytical From Tables and Spreadsheets to Data Cubes Processing A data warehouse is based on a multidimensional data model which views data in the form of a data cube Data Warehouse: Basic Concepts A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Data Warehouse Modeling: Data Cube and OLAP Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Data Warehouse Design and Usage Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables Data Warehouse Implementation Data cube: A lattice of cuboids In data warehousing literature, an n-D base cube is called a base cuboid Summary The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid The lattice of cuboids forms a data cube. 15 16 Data Cube: A Lattice of Cuboids Conceptual Modeling of Data Warehouses all Modeling data warehouses: dimensions & measures 0-D (apex) cuboid Star schema: A fact table in the middle connected to a set of dimension tables time item location supplier 1-D cuboids Snowflake schema: A refinement of star schema where some dimensional time,location item,location location,supplier hierarchy is normalized into a set of smaller dimension tables, forming a shape time,item 2-D cuboids time,supplier item,supplier similar to snowflake time,location,supplier 3-D cuboids Fact constellations: Multiple fact tables share dimension tables, viewed as a time,item,location time,item,supplier item,location,supplier collection of stars, therefore called galaxy schema or fact constellation 4-D (base) cuboid time, item, location, supplier 17 17 18 Star Schema: An Example Snowflake Schema: An Example time time time_key item time_key item day day item_key item_key Sales Fact Table supplier day_of_the_week Sales Fact Table item_name day_of_the_week item_name supplier_key month month brand brand time_key supplier_type quarter time_key type quarter type year year supplier_key item_key supplier_type item_key branch_key branch_key location location branch branch location_key location_key location_key branch_key location_key branch_key street branch_name units_sold street branch_name units_sold city_key branch_type city branch_type city dollars_sold state_or_province dollars_sold country city_key avg_sales avg_sales city state_or_province Measures Measures country 19 19 20 Fact Constellation: An Example A Concept Hierarchy for a Dimension (location) time time_key item Shipping Fact Table all all day item_key day_of_the_week Sales Fact Table item_name time_key month brand Europe ... North_America quarter time_key type item_key region year supplier_type shipper_key item_key from_location branch_key country Germany ... Spain Canada ... Mexico branch location_key location to_location branch_key units_sold location_key dollars_cost branch_name street city Frankfurt ... Vancouver ... Toronto branch_type dollars_sold city units_shipped province_or_state avg_sales country shipper L. Chan ... M. Wind Measures shipper_key office shipper_name location_key 21 shipper_type 22 Data Cube Measures: Three Categories Multidimensional Data Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without Sales volume as a function of product, month, and region partitioning Dimensions: Product, Location, Time Hierarchical summarization paths E.g., count(), sum(), min(), max() Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive Industry Region Year aggregate function Category Country Quarter avg(x) = sum(x) / count(x) Product City Month Week Is min_N() an algebraic measure? How about standard_deviation()? Product Office Day Holistic: if there is no constant bound on the storage size needed to describe a subaggregate. Month E.g., median(), mode(), rank() 23 25 A Sample Data Cube Cuboids Corresponding to the Cube Total annual sales Date of TVs in U.S.A. 1Qtr 2Qtr 3Qtr sum TV 4Qtr all PC U.S.A 0-D (apex) cuboid VCR product date country sum 1-D cuboids Canada product,date product,country date, country Mexico Country 2-D cuboids sum 3-D (base) cuboid product,

Chapter 4: Data Warehousing and On-Line Analytical Processing

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support