Krzysztof Dembczy´nski Intelligent Decision Support Systems Laboratory (IDSS) Pozna´nUniversity of Technology, Poland

Bachelor studies, seventh semester Academic year 2018/19 (winter semester)

1 / 48 Review of the Previous Lecture

• Processing of massive datasets. • Evolution of systems:

I Operational (OLTP) vs. analytical (OLAP) systems. I Analytical database systems. I Design of data warehouses. I vs. multidimensional model. I NoSQL. I Processing of massive datasets.

2 / 48 Outline

1 Motivation

2 Conceptual Schemes of Data Warehouses

3 Dimensional Modeling

4 Summary

3 / 48 Outline

1 Motivation

2 Conceptual Schemes of Data Warehouses

3 Dimensional Modeling

4 Summary

4 / 48 I What is the average score of students over academic years? I What is the number of students over academic years? I What is the average score by faculties, instructors, etc.? I What is the distribution of students over faculties, semesters, etc.? I ...

• They would like to get answers for the following queries:

Motivation

• University authorities decided to analyze teaching performance by using the data collected in owned by the university containing information about students, instructors, lectures, faculties, etc.

5 / 48 I What is the average score of students over academic years? I What is the number of students over academic years? I What is the average score by faculties, instructors, etc.? I What is the distribution of students over faculties, semesters, etc.? I ...

Motivation

• University authorities decided to analyze teaching performance by using the data collected in databases owned by the university containing information about students, instructors, lectures, faculties, etc. • They would like to get answers for the following queries:

5 / 48 I What is the average score of students over academic years? I What is the number of students over academic years? I What is the average score by faculties, instructors, etc.? I What is the distribution of students over faculties, semesters, etc.? I ...

Motivation

• University authorities decided to analyze teaching performance by using the data collected in databases owned by the university containing information about students, instructors, lectures, faculties, etc. • They would like to get answers for the following queries:

5 / 48 I What is the number of students over academic years? I What is the average score by faculties, instructors, etc.? I What is the distribution of students over faculties, semesters, etc.? I ...

Motivation

• University authorities decided to analyze teaching performance by using the data collected in databases owned by the university containing information about students, instructors, lectures, faculties, etc. • They would like to get answers for the following queries:

I What is the average score of students over academic years?

5 / 48 I What is the average score by faculties, instructors, etc.? I What is the distribution of students over faculties, semesters, etc.? I ...

Motivation

• University authorities decided to analyze teaching performance by using the data collected in databases owned by the university containing information about students, instructors, lectures, faculties, etc. • They would like to get answers for the following queries:

I What is the average score of students over academic years? I What is the number of students over academic years?

5 / 48 I What is the distribution of students over faculties, semesters, etc.? I ...

Motivation

• University authorities decided to analyze teaching performance by using the data collected in databases owned by the university containing information about students, instructors, lectures, faculties, etc. • They would like to get answers for the following queries:

I What is the average score of students over academic years? I What is the number of students over academic years? I What is the average score by faculties, instructors, etc.?

5 / 48 I ...

Motivation

• University authorities decided to analyze teaching performance by using the data collected in databases owned by the university containing information about students, instructors, lectures, faculties, etc. • They would like to get answers for the following queries:

I What is the average score of students over academic years? I What is the number of students over academic years? I What is the average score by faculties, instructors, etc.? I What is the distribution of students over faculties, semesters, etc.?

5 / 48 Motivation

• University authorities decided to analyze teaching performance by using the data collected in databases owned by the university containing information about students, instructors, lectures, faculties, etc. • They would like to get answers for the following queries:

I What is the average score of students over academic years? I What is the number of students over academic years? I What is the average score by faculties, instructors, etc.? I What is the distribution of students over faculties, semesters, etc.? I ...

5 / 48 Example

• An exemplary query could be the following: SELECT Instructor, Academic_year, AVG(Grade) FROM Data_Warehouse GROUP BY Instructor, Academic_year • And the result: Academic year Name AVG(Grade) 2010/11 Stefanowski 4.2 2011/12 Stefanowski 4.5 2010/12 Slowi´nski 4.3 2011/12 Slowi´nski 4.1 2011/12 Dembczy´nski :)

6 / 48 Motivation

• Problem to solve: How to design a database for analytical queries?

7 / 48 Outline

1 Motivation

2 Conceptual Schemes of Data Warehouses

3 Dimensional Modeling

4 Summary

8 / 48 Conceptual schemes of data warehouses

• Three main goals for logical design:

I Simplicity: • Users should understand the design, • Data model should match users’ conceptual model, • Queries should be easy and intuitive to write.

I Expressiveness: • Include enough information to answer all important queries, • Include all relevant data (without irrelevant data).

I Performance: • An efficient physical design should be possible to apply.

9 / 48 Three basic conceptual schemes

, • Snowflake schema, • Fact constellations.

10 / 48 Star schema

• A single table in the middle connected to a number of dimension tables. Instructor id (PK) first name Academic Year last name id (PK) institute academic year faculty semester degree

Grade id (PK) student (FK) academic year (FK) instructor (FK) lecture (FK) Grade

Student Lecture id (PK) id (PK) first name name last name type city country secondary school

11 / 48 I Relates the dimensions to the measures.

I Represent information about dimensions (student, academic year, etc.). I Each dimension has a set of descriptive attributes.

I Measures should be aggregative. I Measures depend on a set of dimensions, e.g. student grade depends on student, course, instructor, faculty, academic year, etc.. •

• Dimension tables

Star schema

• Measures, e.g. grades, price, quantity.

12 / 48 I Relates the dimensions to the measures.

I Represent information about dimensions (student, academic year, etc.). I Each dimension has a set of descriptive attributes.

I Measures depend on a set of dimensions, e.g. student grade depends on student, course, instructor, faculty, academic year, etc.. • Fact table

• Dimension tables

Star schema

• Measures, e.g. grades, price, quantity.

I Measures should be aggregative.

12 / 48 I Relates the dimensions to the measures.

I Represent information about dimensions (student, academic year, etc.). I Each dimension has a set of descriptive attributes.

• Fact table

• Dimension tables

Star schema

• Measures, e.g. grades, price, quantity.

I Measures should be aggregative. I Measures depend on a set of dimensions, e.g. student grade depends on student, course, instructor, faculty, academic year, etc..

12 / 48 I Represent information about dimensions (student, academic year, etc.). I Each dimension has a set of descriptive attributes.

I Relates the dimensions to the measures. • Dimension tables

Star schema

• Measures, e.g. grades, price, quantity.

I Measures should be aggregative. I Measures depend on a set of dimensions, e.g. student grade depends on student, course, instructor, faculty, academic year, etc.. • Fact table

12 / 48 I Represent information about dimensions (student, academic year, etc.). I Each dimension has a set of descriptive attributes.

• Dimension tables

Star schema

• Measures, e.g. grades, price, quantity.

I Measures should be aggregative. I Measures depend on a set of dimensions, e.g. student grade depends on student, course, instructor, faculty, academic year, etc.. • Fact table

I Relates the dimensions to the measures.

12 / 48 I Represent information about dimensions (student, academic year, etc.). I Each dimension has a set of descriptive attributes.

Star schema

• Measures, e.g. grades, price, quantity.

I Measures should be aggregative. I Measures depend on a set of dimensions, e.g. student grade depends on student, course, instructor, faculty, academic year, etc.. • Fact table

I Relates the dimensions to the measures. • Dimension tables

12 / 48 I Each dimension has a set of descriptive attributes.

Star schema

• Measures, e.g. grades, price, quantity.

I Measures should be aggregative. I Measures depend on a set of dimensions, e.g. student grade depends on student, course, instructor, faculty, academic year, etc.. • Fact table

I Relates the dimensions to the measures. • Dimension tables

I Represent information about dimensions (student, academic year, etc.).

12 / 48 Star schema

• Measures, e.g. grades, price, quantity.

I Measures should be aggregative. I Measures depend on a set of dimensions, e.g. student grade depends on student, course, instructor, faculty, academic year, etc.. • Fact table

I Relates the dimensions to the measures. • Dimension tables

I Represent information about dimensions (student, academic year, etc.). I Each dimension has a set of descriptive attributes.

12 / 48 • Each fact row contains foreign keys to dimension tables and numerical columns. • Any new fact is added to the fact table. • The aggregated fact columns are the matter of the analysis.

Fact table

• Each fact table contains measurements about a process of interest.

13 / 48 • Any new fact is added to the fact table. • The aggregated fact columns are the matter of the analysis.

Fact table

• Each fact table contains measurements about a process of interest. • Each fact row contains foreign keys to dimension tables and numerical measure columns.

13 / 48 • The aggregated fact columns are the matter of the analysis.

Fact table

• Each fact table contains measurements about a process of interest. • Each fact row contains foreign keys to dimension tables and numerical measure columns. • Any new fact is added to the fact table.

13 / 48 Fact table

• Each fact table contains measurements about a process of interest. • Each fact row contains foreign keys to dimension tables and numerical measure columns. • Any new fact is added to the fact table. • The aggregated fact columns are the matter of the analysis.

13 / 48 • Dimension tables contain many descriptive columns. • Generally do not have too many rows (in comparison to the fact table). • Content is relatively static. • The attributes of dimension tables are used for filtering and grouping. • Dimension tables describe facts stored in the fact table.

Dimension tables

• Each dimension table corresponds to a real-world object or concept, e.g. customer, product, region, employee, store, etc..

14 / 48 • Generally do not have too many rows (in comparison to the fact table). • Content is relatively static. • The attributes of dimension tables are used for filtering and grouping. • Dimension tables describe facts stored in the fact table.

Dimension tables

• Each dimension table corresponds to a real-world object or concept, e.g. customer, product, region, employee, store, etc.. • Dimension tables contain many descriptive columns.

14 / 48 • Content is relatively static. • The attributes of dimension tables are used for filtering and grouping. • Dimension tables describe facts stored in the fact table.

Dimension tables

• Each dimension table corresponds to a real-world object or concept, e.g. customer, product, region, employee, store, etc.. • Dimension tables contain many descriptive columns. • Generally do not have too many rows (in comparison to the fact table).

14 / 48 • The attributes of dimension tables are used for filtering and grouping. • Dimension tables describe facts stored in the fact table.

Dimension tables

• Each dimension table corresponds to a real-world object or concept, e.g. customer, product, region, employee, store, etc.. • Dimension tables contain many descriptive columns. • Generally do not have too many rows (in comparison to the fact table). • Content is relatively static.

14 / 48 • Dimension tables describe facts stored in the fact table.

Dimension tables

• Each dimension table corresponds to a real-world object or concept, e.g. customer, product, region, employee, store, etc.. • Dimension tables contain many descriptive columns. • Generally do not have too many rows (in comparison to the fact table). • Content is relatively static. • The attributes of dimension tables are used for filtering and grouping.

14 / 48 Dimension tables

• Each dimension table corresponds to a real-world object or concept, e.g. customer, product, region, employee, store, etc.. • Dimension tables contain many descriptive columns. • Generally do not have too many rows (in comparison to the fact table). • Content is relatively static. • The attributes of dimension tables are used for filtering and grouping. • Dimension tables describe facts stored in the fact table.

14 / 48 I wide, I small (few rows), I descriptive (rows are described by descriptive attributes), I static.

I narrow, I big (many rows), I numeric (rows are described by numerical measures), I dynamic (growing over time). • Dimension table

Fact table vs. Dimension tables

• Fact table:

Facts contain numbers, dimensions contain labels

15 / 48 I wide, I small (few rows), I descriptive (rows are described by descriptive attributes), I static.

I big (many rows), I numeric (rows are described by numerical measures), I dynamic (growing over time). • Dimension table

Fact table vs. Dimension tables

• Fact table:

I narrow,

Facts contain numbers, dimensions contain labels

15 / 48 I wide, I small (few rows), I descriptive (rows are described by descriptive attributes), I static.

I numeric (rows are described by numerical measures), I dynamic (growing over time). • Dimension table

Fact table vs. Dimension tables

• Fact table:

I narrow, I big (many rows),

Facts contain numbers, dimensions contain labels

15 / 48 I wide, I small (few rows), I descriptive (rows are described by descriptive attributes), I static.

I dynamic (growing over time). • Dimension table

Fact table vs. Dimension tables

• Fact table:

I narrow, I big (many rows), I numeric (rows are described by numerical measures),

Facts contain numbers, dimensions contain labels

15 / 48 I wide, I small (few rows), I descriptive (rows are described by descriptive attributes), I static.

• Dimension table

Fact table vs. Dimension tables

• Fact table:

I narrow, I big (many rows), I numeric (rows are described by numerical measures), I dynamic (growing over time).

Facts contain numbers, dimensions contain labels

15 / 48 I wide, I small (few rows), I descriptive (rows are described by descriptive attributes), I static.

Fact table vs. Dimension tables

• Fact table:

I narrow, I big (many rows), I numeric (rows are described by numerical measures), I dynamic (growing over time). • Dimension table

Facts contain numbers, dimensions contain labels

15 / 48 I small (few rows), I descriptive (rows are described by descriptive attributes), I static.

Fact table vs. Dimension tables

• Fact table:

I narrow, I big (many rows), I numeric (rows are described by numerical measures), I dynamic (growing over time). • Dimension table

I wide,

Facts contain numbers, dimensions contain labels

15 / 48 I descriptive (rows are described by descriptive attributes), I static.

Fact table vs. Dimension tables

• Fact table:

I narrow, I big (many rows), I numeric (rows are described by numerical measures), I dynamic (growing over time). • Dimension table

I wide, I small (few rows),

Facts contain numbers, dimensions contain labels

15 / 48 I static.

Fact table vs. Dimension tables

• Fact table:

I narrow, I big (many rows), I numeric (rows are described by numerical measures), I dynamic (growing over time). • Dimension table

I wide, I small (few rows), I descriptive (rows are described by descriptive attributes),

Facts contain numbers, dimensions contain labels

15 / 48 Fact table vs. Dimension tables

• Fact table:

I narrow, I big (many rows), I numeric (rows are described by numerical measures), I dynamic (growing over time). • Dimension table

I wide, I small (few rows), I descriptive (rows are described by descriptive attributes), I static.

Facts contain numbers, dimensions contain labels

15 / 48 Dimension hierarchies

• For each dimension, one can define several hierarchies of attributes.

Semester Course

Academic year Faculty Type

Instructor Student

Institute City

Faculty Country

16 / 48 Snowflake schema

• Snowflake schema is a refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables.

Institute Instructor Faculty id (PK) id (PK) name first name id (PK) address last name name degree degree address faculty id (FK) institute id (FK)

17 / 48 Normalization

• Normal forms provide criteria for determining a table’s degree of vulnerability to inconsistencies, redundancy and update anomalies.

I First normal form (1NF), I Second normal form (2NF), I Third normal from (3NF), I Boyce-Codd normal form (BCNF), I Fourth normal form (4NF), I ...

18 / 48 Denormalization

• Denormalization is the process of attempting to optimize the performance of a database by adding redundant data or by grouping data. • Denormalization helps cover up the inefficiencies inherent in relational database software. • Normalize until it hurts, denormalize until it works :)

19 / 48 • Such schema appears in a design of a for a huge and complex problem. • One should start with a simple model, and then try to define a more complex one (project success).

Fact constellation schema

• Fact constellation schema consists of multiple fact tables which share dimension tables.

20 / 48 • One should start with a simple model, and then try to define a more complex one (project success).

Fact constellation schema

• Fact constellation schema consists of multiple fact tables which share dimension tables. • Such schema appears in a design of a data warehouse for a huge and complex problem.

20 / 48 Fact constellation schema

• Fact constellation schema consists of multiple fact tables which share dimension tables. • Such schema appears in a design of a data warehouse for a huge and complex problem. • One should start with a simple model, and then try to define a more complex one (project success).

20 / 48 Outline

1 Motivation

2 Conceptual Schemes of Data Warehouses

3 Dimensional Modeling

4 Summary

21 / 48 I Select the business process to model (e.g. sales). I Determine the grain of the business process (e.g. single transaction in a market identified by bar-code scanners at cash register). I Choose the dimensions describing the business process (e.g. localization, product, data, promotion, etc.). I Identify the numeric measures for the facts (e.g. price, quantity).

Dimensional modeling

• Four Steps in Dimensional Modeling:

22 / 48 I Determine the grain of the business process (e.g. single transaction in a market identified by bar-code scanners at cash register). I Choose the dimensions describing the business process (e.g. localization, product, data, promotion, etc.). I Identify the numeric measures for the facts (e.g. price, quantity).

Dimensional modeling

• Four Steps in Dimensional Modeling:

I Select the business process to model (e.g. sales).

22 / 48 I Choose the dimensions describing the business process (e.g. localization, product, data, promotion, etc.). I Identify the numeric measures for the facts (e.g. price, quantity).

Dimensional modeling

• Four Steps in Dimensional Modeling:

I Select the business process to model (e.g. sales). I Determine the grain of the business process (e.g. single transaction in a market identified by bar-code scanners at cash register).

22 / 48 I Identify the numeric measures for the facts (e.g. price, quantity).

Dimensional modeling

• Four Steps in Dimensional Modeling:

I Select the business process to model (e.g. sales). I Determine the grain of the business process (e.g. single transaction in a market identified by bar-code scanners at cash register). I Choose the dimensions describing the business process (e.g. localization, product, data, promotion, etc.).

22 / 48 Dimensional modeling

• Four Steps in Dimensional Modeling:

I Select the business process to model (e.g. sales). I Determine the grain of the business process (e.g. single transaction in a market identified by bar-code scanners at cash register). I Choose the dimensions describing the business process (e.g. localization, product, data, promotion, etc.). I Identify the numeric measures for the facts (e.g. price, quantity).

22 / 48 • Examples: sales, orders, teaching. • Business process should not be referred to an organizational business department or function. • Selection of the business process should be dependent on its complexity, and also on time and budget of the project.

Business process

• Business process has to be a natural activity performed in the given organization that is supported by an operational database system.

23 / 48 • Business process should not be referred to an organizational business department or function. • Selection of the business process should be dependent on its complexity, and also on time and budget of the project.

Business process

• Business process has to be a natural activity performed in the given organization that is supported by an operational database system. • Examples: sales, orders, teaching.

23 / 48 • Selection of the business process should be dependent on its complexity, and also on time and budget of the project.

Business process

• Business process has to be a natural activity performed in the given organization that is supported by an operational database system. • Examples: sales, orders, teaching. • Business process should not be referred to an organizational business department or function.

23 / 48 Business process

• Business process has to be a natural activity performed in the given organization that is supported by an operational database system. • Examples: sales, orders, teaching. • Business process should not be referred to an organizational business department or function. • Selection of the business process should be dependent on its complexity, and also on time and budget of the project.

23 / 48 • Determine the maximum level of detail of the warehouse. • Examples: line item from a cash register receipt, daily snapshot of inventory level for a product in a warehouse, monthly snapshot for each bank account. • Finer-grained fact tables are more expressive, but have more rows and grow faster.

Grain of the business process

• Define what an individual fact table row represents.

24 / 48 • Examples: line item from a cash register receipt, daily snapshot of inventory level for a product in a warehouse, monthly snapshot for each bank account. • Finer-grained fact tables are more expressive, but have more rows and grow faster.

Grain of the business process

• Define what an individual fact table row represents. • Determine the maximum level of detail of the warehouse.

24 / 48 • Finer-grained fact tables are more expressive, but have more rows and grow faster.

Grain of the business process

• Define what an individual fact table row represents. • Determine the maximum level of detail of the warehouse. • Examples: line item from a cash register receipt, daily snapshot of inventory level for a product in a warehouse, monthly snapshot for each bank account.

24 / 48 Grain of the business process

• Define what an individual fact table row represents. • Determine the maximum level of detail of the warehouse. • Examples: line item from a cash register receipt, daily snapshot of inventory level for a product in a warehouse, monthly snapshot for each bank account. • Finer-grained fact tables are more expressive, but have more rows and grow faster.

24 / 48 • In other words: detailed description of the grain defined in the previous step. • Grain of the fact table defines the grain of the dimensions. • Examples: localization, product, date, promotion, etc.

Dimensions describing the business process

• Decorate fact table with a set of dimensions that determine a candidate key based on the grain and all other dimensions functionally determined by the candidate key.

25 / 48 • Grain of the fact table defines the grain of the dimensions. • Examples: localization, product, date, promotion, etc.

Dimensions describing the business process

• Decorate fact table with a set of dimensions that determine a candidate key based on the grain and all other dimensions functionally determined by the candidate key. • In other words: detailed description of the grain defined in the previous step.

25 / 48 • Examples: localization, product, date, promotion, etc.

Dimensions describing the business process

• Decorate fact table with a set of dimensions that determine a candidate key based on the grain and all other dimensions functionally determined by the candidate key. • In other words: detailed description of the grain defined in the previous step. • Grain of the fact table defines the grain of the dimensions.

25 / 48 Dimensions describing the business process

• Decorate fact table with a set of dimensions that determine a candidate key based on the grain and all other dimensions functionally determined by the candidate key. • In other words: detailed description of the grain defined in the previous step. • Grain of the fact table defines the grain of the dimensions. • Examples: localization, product, date, promotion, etc.

25 / 48 I fully additive: can be summed across any of the dimensions associated with the fact table I semi-additive: can be summed across, but not all, for example balance amounts are additive across all dimensions except time. I non-additive: are completely non-additive, such as ratios (store the fully additive components of the non-additive measure and sum these components into the final answer before calculating the final non-additive fact)

• All the measures correspond to the same grain defined previously. • Measures should be numeric and additive:

Numeric measures

• Numeric measures allow to measure the performance of the process.

26 / 48 I fully additive: can be summed across any of the dimensions associated with the fact table I semi-additive: can be summed across, but not all, for example balance amounts are additive across all dimensions except time. I non-additive: are completely non-additive, such as ratios (store the fully additive components of the non-additive measure and sum these components into the final answer before calculating the final non-additive fact)

• Measures should be numeric and additive:

Numeric measures

• Numeric measures allow to measure the performance of the process. • All the measures correspond to the same grain defined previously.

26 / 48 I fully additive: can be summed across any of the dimensions associated with the fact table I semi-additive: can be summed across, but not all, for example balance amounts are additive across all dimensions except time. I non-additive: are completely non-additive, such as ratios (store the fully additive components of the non-additive measure and sum these components into the final answer before calculating the final non-additive fact)

Numeric measures

• Numeric measures allow to measure the performance of the process. • All the measures correspond to the same grain defined previously. • Measures should be numeric and additive:

26 / 48 I semi-additive: can be summed across, but not all, for example balance amounts are additive across all dimensions except time. I non-additive: are completely non-additive, such as ratios (store the fully additive components of the non-additive measure and sum these components into the final answer before calculating the final non-additive fact)

Numeric measures

• Numeric measures allow to measure the performance of the process. • All the measures correspond to the same grain defined previously. • Measures should be numeric and additive:

I fully additive: can be summed across any of the dimensions associated with the fact table

26 / 48 I non-additive: are completely non-additive, such as ratios (store the fully additive components of the non-additive measure and sum these components into the final answer before calculating the final non-additive fact)

Numeric measures

• Numeric measures allow to measure the performance of the process. • All the measures correspond to the same grain defined previously. • Measures should be numeric and additive:

I fully additive: can be summed across any of the dimensions associated with the fact table I semi-additive: can be summed across, but not all, for example balance amounts are additive across all dimensions except time.

26 / 48 Numeric measures

• Numeric measures allow to measure the performance of the process. • All the measures correspond to the same grain defined previously. • Measures should be numeric and additive:

I fully additive: can be summed across any of the dimensions associated with the fact table I semi-additive: can be summed across, but not all, for example balance amounts are additive across all dimensions except time. I non-additive: are completely non-additive, such as ratios (store the fully additive components of the non-additive measure and sum these components into the final answer before calculating the final non-additive fact)

26 / 48 Dimensional modeling techniques

• Derived measures vs. calculated measures, • Describing dimensions, • Date and time dimension, • Surrogate keys, • Degenerate dimensions, • Role-playing dimensions, • Junk dimension, • Slowly changing dimensions, • Mini dimension, • Factless fact tables, • Data warehouse bus architecture.

27 / 48 • Data warehouse is huge from the definition, so storing additional values does not change a lot. • Computing additional measures (like profit from incomes and outcomes) decreases performance of data warehouse.

Derived measures vs. Calculated measures

• If possible materialize (calculate and store) all measures needed for analytical queries.

28 / 48 • Computing additional measures (like profit from incomes and outcomes) decreases performance of data warehouse.

Derived measures vs. Calculated measures

• If possible materialize (calculate and store) all measures needed for analytical queries. • Data warehouse is huge from the definition, so storing additional values does not change a lot.

28 / 48 Derived measures vs. Calculated measures

• If possible materialize (calculate and store) all measures needed for analytical queries. • Data warehouse is huge from the definition, so storing additional values does not change a lot. • Computing additional measures (like profit from incomes and outcomes) decreases performance of data warehouse.

28 / 48 • Use as many as possible, but useful, dimension attributes

Describing dimension

• Always use informative names for tables, attributes, and values,

29 / 48 Describing dimension

• Always use informative names for tables, attributes, and values, • Use as many as possible, but useful, dimension attributes

29 / 48 I A separate dimension is more reasonable . . .

• Data warehouse should be treated as a . • Date dimension allows analysis of time trends. • Should we treat date as a separate dimension or use timestamps to compute all needed information?

Date dimension

• Date dimension is a specific and inseparable dimension of the data warehouse design.

30 / 48 I A separate dimension is more reasonable . . .

• Date dimension allows analysis of time trends. • Should we treat date as a separate dimension or use timestamps to compute all needed information?

Date dimension

• Date dimension is a specific and inseparable dimension of the data warehouse design. • Data warehouse should be treated as a temporal database.

30 / 48 I A separate dimension is more reasonable . . .

• Should we treat date as a separate dimension or use timestamps to compute all needed information?

Date dimension

• Date dimension is a specific and inseparable dimension of the data warehouse design. • Data warehouse should be treated as a temporal database. • Date dimension allows analysis of time trends.

30 / 48 I A separate dimension is more reasonable . . .

Date dimension

• Date dimension is a specific and inseparable dimension of the data warehouse design. • Data warehouse should be treated as a temporal database. • Date dimension allows analysis of time trends. • Should we treat date as a separate dimension or use timestamps to compute all needed information?

30 / 48 Date dimension

• Date dimension is a specific and inseparable dimension of the data warehouse design. • Data warehouse should be treated as a temporal database. • Date dimension allows analysis of time trends. • Should we treat date as a separate dimension or use timestamps to compute all needed information?

I A separate dimension is more reasonable . . .

30 / 48 Date dimension

• Typical attributes in date dimension:

I date and time (candidate key), I week of year, I day (number) of year, I month, I day (number) of month, I name of month, I day (number) of week, I quarter, I day number in epoch, I year, I weekday, I fiscal year, I weekend, I selling season, I 24-hour work day I back-to-school I holiday I Christmas

31 / 48 I Capturing interesting date properties (holiday/non-holiday), I Avoiding relying on special date-handling SQL functions, I Making use of indexes. • For finer-grained time measurements, use separate date and time-of-day

Date dimension

• Advantages of using date dimension:

32 / 48 I Avoiding relying on special date-handling SQL functions, I Making use of indexes. • For finer-grained time measurements, use separate date and time-of-day

Date dimension

• Advantages of using date dimension:

I Capturing interesting date properties (holiday/non-holiday),

32 / 48 I Making use of indexes. • For finer-grained time measurements, use separate date and time-of-day

Date dimension

• Advantages of using date dimension:

I Capturing interesting date properties (holiday/non-holiday), I Avoiding relying on special date-handling SQL functions,

32 / 48 • For finer-grained time measurements, use separate date and time-of-day

Date dimension

• Advantages of using date dimension:

I Capturing interesting date properties (holiday/non-holiday), I Avoiding relying on special date-handling SQL functions, I Making use of indexes.

32 / 48 Date dimension

• Advantages of using date dimension:

I Capturing interesting date properties (holiday/non-holiday), I Avoiding relying on special date-handling SQL functions, I Making use of indexes. • For finer-grained time measurements, use separate date and time-of-day

32 / 48 I Surrogate keys can be shorter. I Better handling of exceptional cases (e.g. for missing values – try to use a specific row in the dimension table that represents missing value). I Surrogate keys avoid to assume implicit semantic. I Resistant to change of natural key meaning. I Resistant to reuse of old value of natural key.

• Surrogate keys should be used instead of natural keys (like PESEL number in Poland):

Surrogate keys

• A surrogate key is a meaningless integer key that is assigned by the data warehouse.

33 / 48 I Surrogate keys can be shorter. I Better handling of exceptional cases (e.g. for missing values – try to use a specific row in the dimension table that represents missing value). I Surrogate keys avoid to assume implicit semantic. I Resistant to change of natural key meaning. I Resistant to reuse of old value of natural key.

Surrogate keys

• A surrogate key is a meaningless integer key that is assigned by the data warehouse. • Surrogate keys should be used instead of natural keys (like PESEL number in Poland):

33 / 48 I Better handling of exceptional cases (e.g. for missing values – try to use a specific row in the dimension table that represents missing value). I Surrogate keys avoid to assume implicit semantic. I Resistant to change of natural key meaning. I Resistant to reuse of old value of natural key.

Surrogate keys

• A surrogate key is a meaningless integer key that is assigned by the data warehouse. • Surrogate keys should be used instead of natural keys (like PESEL number in Poland):

I Surrogate keys can be shorter.

33 / 48 I Surrogate keys avoid to assume implicit semantic. I Resistant to change of natural key meaning. I Resistant to reuse of old value of natural key.

Surrogate keys

• A surrogate key is a meaningless integer key that is assigned by the data warehouse. • Surrogate keys should be used instead of natural keys (like PESEL number in Poland):

I Surrogate keys can be shorter. I Better handling of exceptional cases (e.g. for missing values – try to use a specific row in the dimension table that represents missing value).

33 / 48 I Resistant to change of natural key meaning. I Resistant to reuse of old value of natural key.

Surrogate keys

• A surrogate key is a meaningless integer key that is assigned by the data warehouse. • Surrogate keys should be used instead of natural keys (like PESEL number in Poland):

I Surrogate keys can be shorter. I Better handling of exceptional cases (e.g. for missing values – try to use a specific row in the dimension table that represents missing value). I Surrogate keys avoid to assume implicit semantic.

33 / 48 I Resistant to reuse of old value of natural key.

Surrogate keys

• A surrogate key is a meaningless integer key that is assigned by the data warehouse. • Surrogate keys should be used instead of natural keys (like PESEL number in Poland):

I Surrogate keys can be shorter. I Better handling of exceptional cases (e.g. for missing values – try to use a specific row in the dimension table that represents missing value). I Surrogate keys avoid to assume implicit semantic. I Resistant to change of natural key meaning.

33 / 48 Surrogate keys

• A surrogate key is a meaningless integer key that is assigned by the data warehouse. • Surrogate keys should be used instead of natural keys (like PESEL number in Poland):

I Surrogate keys can be shorter. I Better handling of exceptional cases (e.g. for missing values – try to use a specific row in the dimension table that represents missing value). I Surrogate keys avoid to assume implicit semantic. I Resistant to change of natural key meaning. I Resistant to reuse of old value of natural key.

33 / 48 I Discard the dimension, I Use a ...

I Consider a data warehouses for a retail market, I Typical transaction can be described by (id transaction, product, . . . ), I id transaction is just a unique identifier. • There are two options for dealing with such dimensions:

Degenerate dimensions

• Dimension can be merely an identifier, without any interesting attributes:

34 / 48 I Discard the dimension, I Use a degenerate dimension ...

I Typical transaction can be described by (id transaction, product, . . . ), I id transaction is just a unique identifier. • There are two options for dealing with such dimensions:

Degenerate dimensions

• Dimension can be merely an identifier, without any interesting attributes:

I Consider a data warehouses for a retail market,

34 / 48 I Discard the dimension, I Use a degenerate dimension ...

I id transaction is just a unique identifier. • There are two options for dealing with such dimensions:

Degenerate dimensions

• Dimension can be merely an identifier, without any interesting attributes:

I Consider a data warehouses for a retail market, I Typical transaction can be described by (id transaction, product, . . . ),

34 / 48 I Discard the dimension, I Use a degenerate dimension ...

• There are two options for dealing with such dimensions:

Degenerate dimensions

• Dimension can be merely an identifier, without any interesting attributes:

I Consider a data warehouses for a retail market, I Typical transaction can be described by (id transaction, product, . . . ), I id transaction is just a unique identifier.

34 / 48 I Discard the dimension, I Use a degenerate dimension ...

Degenerate dimensions

• Dimension can be merely an identifier, without any interesting attributes:

I Consider a data warehouses for a retail market, I Typical transaction can be described by (id transaction, product, . . . ), I id transaction is just a unique identifier. • There are two options for dealing with such dimensions:

34 / 48 I Use a degenerate dimension ...

Degenerate dimensions

• Dimension can be merely an identifier, without any interesting attributes:

I Consider a data warehouses for a retail market, I Typical transaction can be described by (id transaction, product, . . . ), I id transaction is just a unique identifier. • There are two options for dealing with such dimensions:

I Discard the dimension,

34 / 48 Degenerate dimensions

• Dimension can be merely an identifier, without any interesting attributes:

I Consider a data warehouses for a retail market, I Typical transaction can be described by (id transaction, product, . . . ), I id transaction is just a unique identifier. • There are two options for dealing with such dimensions:

I Discard the dimension, I Use a degenerate dimension ...

34 / 48 • Do not create a separate dimension table. • Such a dimension serves to group together products that were bought in the same market basket (market basket analysis).

Degenerate dimensions

• Store the dimension identifier directly in the fact table.

35 / 48 • Such a dimension serves to group together products that were bought in the same market basket (market basket analysis).

Degenerate dimensions

• Store the dimension identifier directly in the fact table. • Do not create a separate dimension table.

35 / 48 Degenerate dimensions

• Store the dimension identifier directly in the fact table. • Do not create a separate dimension table. • Such a dimension serves to group together products that were bought in the same market basket (market basket analysis).

35 / 48 • Example: a fact is associated with several dates (e.g. order, payment, or delivery date), each of which is represented by a foreign key.

Role-playing dimensions

• A single physical dimension can be referenced multiple times in a fact table, with each reference linking to a logically distinct role for the dimension.

36 / 48 Role-playing dimensions

• A single physical dimension can be referenced multiple times in a fact table, with each reference linking to a logically distinct role for the dimension. • Example: a fact is associated with several dates (e.g. order, payment, or delivery date), each of which is represented by a foreign key.

36 / 48 I Example: consider such attributes as payment method (cash and credit card) and packing type (box, paper, or none). • A possible solution is to create a single junk dimension combining these columns together. • This dimension does not need to be the Cartesian product of all the attributes’ possible values, but should only contain the combination of values that actually occur in the source data.

Junk dimension

• Some columns may not fit to any of the dimensions, but creating a separate dimension for each of them would increase significantly the size of fact table (a foreign key would be required for each such column).

37 / 48 • A possible solution is to create a single junk dimension combining these columns together. • This dimension does not need to be the Cartesian product of all the attributes’ possible values, but should only contain the combination of values that actually occur in the source data.

Junk dimension

• Some columns may not fit to any of the dimensions, but creating a separate dimension for each of them would increase significantly the size of fact table (a foreign key would be required for each such column).

I Example: consider such attributes as payment method (cash and credit card) and packing type (box, paper, or none).

37 / 48 • This dimension does not need to be the Cartesian product of all the attributes’ possible values, but should only contain the combination of values that actually occur in the source data.

Junk dimension

• Some columns may not fit to any of the dimensions, but creating a separate dimension for each of them would increase significantly the size of fact table (a foreign key would be required for each such column).

I Example: consider such attributes as payment method (cash and credit card) and packing type (box, paper, or none). • A possible solution is to create a single junk dimension combining these columns together.

37 / 48 Junk dimension

• Some columns may not fit to any of the dimensions, but creating a separate dimension for each of them would increase significantly the size of fact table (a foreign key would be required for each such column).

I Example: consider such attributes as payment method (cash and credit card) and packing type (box, paper, or none). • A possible solution is to create a single junk dimension combining these columns together. • This dimension does not need to be the Cartesian product of all the attributes’ possible values, but should only contain the combination of values that actually occur in the source data.

37 / 48 I Customer moves to a new address, I Administrative reform in Poland, I Change of product categorization.

I New sales transactions occur constantly, I New products are introduced rarely, I New stores are opened very rarely. • Attribute values for existing dimension rows do occasionally change over time:

Slowly changing dimensions

• Compared to fact tables, dimension tables are relatively stable:

38 / 48 I Customer moves to a new address, I Administrative reform in Poland, I Change of product categorization.

I New products are introduced rarely, I New stores are opened very rarely. • Attribute values for existing dimension rows do occasionally change over time:

Slowly changing dimensions

• Compared to fact tables, dimension tables are relatively stable:

I New sales transactions occur constantly,

38 / 48 I Customer moves to a new address, I Administrative reform in Poland, I Change of product categorization.

I New stores are opened very rarely. • Attribute values for existing dimension rows do occasionally change over time:

Slowly changing dimensions

• Compared to fact tables, dimension tables are relatively stable:

I New sales transactions occur constantly, I New products are introduced rarely,

38 / 48 I Customer moves to a new address, I Administrative reform in Poland, I Change of product categorization.

• Attribute values for existing dimension rows do occasionally change over time:

Slowly changing dimensions

• Compared to fact tables, dimension tables are relatively stable:

I New sales transactions occur constantly, I New products are introduced rarely, I New stores are opened very rarely.

38 / 48 I Customer moves to a new address, I Administrative reform in Poland, I Change of product categorization.

Slowly changing dimensions

• Compared to fact tables, dimension tables are relatively stable:

I New sales transactions occur constantly, I New products are introduced rarely, I New stores are opened very rarely. • Attribute values for existing dimension rows do occasionally change over time:

38 / 48 I Administrative reform in Poland, I Change of product categorization.

Slowly changing dimensions

• Compared to fact tables, dimension tables are relatively stable:

I New sales transactions occur constantly, I New products are introduced rarely, I New stores are opened very rarely. • Attribute values for existing dimension rows do occasionally change over time:

I Customer moves to a new address,

38 / 48 I Change of product categorization.

Slowly changing dimensions

• Compared to fact tables, dimension tables are relatively stable:

I New sales transactions occur constantly, I New products are introduced rarely, I New stores are opened very rarely. • Attribute values for existing dimension rows do occasionally change over time:

I Customer moves to a new address, I Administrative reform in Poland,

38 / 48 Slowly changing dimensions

• Compared to fact tables, dimension tables are relatively stable:

I New sales transactions occur constantly, I New products are introduced rarely, I New stores are opened very rarely. • Attribute values for existing dimension rows do occasionally change over time:

I Customer moves to a new address, I Administrative reform in Poland, I Change of product categorization.

38 / 48 I Wrong name of the street (correct approach), I Change of customer address: Nowak moves from Pozna´nto Warsaw → all products purchased in Pozna´nmove to Warsaw too!!!

I Simple to implement, I Does not report history, I Can affect analysis. • Examples

Slowly changing dimensions

• Type I (the simplest solution): update the dimension table:

39 / 48 I Wrong name of the street (correct approach), I Change of customer address: Nowak moves from Pozna´nto Warsaw → all products purchased in Pozna´nmove to Warsaw too!!!

I Does not report history, I Can affect analysis. • Examples

Slowly changing dimensions

• Type I (the simplest solution): update the dimension table:

I Simple to implement,

39 / 48 I Wrong name of the street (correct approach), I Change of customer address: Nowak moves from Pozna´nto Warsaw → all products purchased in Pozna´nmove to Warsaw too!!!

I Can affect analysis. • Examples

Slowly changing dimensions

• Type I (the simplest solution): update the dimension table:

I Simple to implement, I Does not report history,

39 / 48 I Wrong name of the street (correct approach), I Change of customer address: Nowak moves from Pozna´nto Warsaw → all products purchased in Pozna´nmove to Warsaw too!!!

• Examples

Slowly changing dimensions

• Type I (the simplest solution): update the dimension table:

I Simple to implement, I Does not report history, I Can affect analysis.

39 / 48 I Wrong name of the street (correct approach), I Change of customer address: Nowak moves from Pozna´nto Warsaw → all products purchased in Pozna´nmove to Warsaw too!!!

Slowly changing dimensions

• Type I (the simplest solution): update the dimension table:

I Simple to implement, I Does not report history, I Can affect analysis. • Examples

39 / 48 I Change of customer address: Nowak moves from Pozna´nto Warsaw → all products purchased in Pozna´nmove to Warsaw too!!!

Slowly changing dimensions

• Type I (the simplest solution): update the dimension table:

I Simple to implement, I Does not report history, I Can affect analysis. • Examples

I Wrong name of the street (correct approach),

39 / 48 Slowly changing dimensions

• Type I (the simplest solution): update the dimension table:

I Simple to implement, I Does not report history, I Can affect analysis. • Examples

I Wrong name of the street (correct approach), I Change of customer address: Nowak moves from Pozna´nto Warsaw → all products purchased in Pozna´nmove to Warsaw too!!!

39 / 48 I Nowak moves to Warsaw: id local id name city row effec- row expira- current row tive date tion date indicator 23401 234 Nowak Pozna´n 2012-01-01 2014-01-05 Expired 23402 234 Nowak Warsaw 2014-01-06 9999-12-31 Current

I Old fact table rows point to the old row, and new fact table rows point to the new row, I To report on Nowak’s activity over time one can use: WHERE local id = "234"

I Accurate historical reporting, I Pre-computed aggregates unaffected, I Dimension table grows over time, I Type II requires surrogate keys. • Examples:

Slowly changing dimensions

• Type II: create a new dimension row:

40 / 48 I Nowak moves to Warsaw: id local id name city row effec- row expira- current row tive date tion date indicator 23401 234 Nowak Pozna´n 2012-01-01 2014-01-05 Expired 23402 234 Nowak Warsaw 2014-01-06 9999-12-31 Current

I Old fact table rows point to the old row, and new fact table rows point to the new row, I To report on Nowak’s activity over time one can use: WHERE local id = "234"

I Pre-computed aggregates unaffected, I Dimension table grows over time, I Type II requires surrogate keys. • Examples:

Slowly changing dimensions

• Type II: create a new dimension row:

I Accurate historical reporting,

40 / 48 I Nowak moves to Warsaw: id local id name city row effec- row expira- current row tive date tion date indicator 23401 234 Nowak Pozna´n 2012-01-01 2014-01-05 Expired 23402 234 Nowak Warsaw 2014-01-06 9999-12-31 Current

I Old fact table rows point to the old row, and new fact table rows point to the new row, I To report on Nowak’s activity over time one can use: WHERE local id = "234"

I Dimension table grows over time, I Type II requires surrogate keys. • Examples:

Slowly changing dimensions

• Type II: create a new dimension row:

I Accurate historical reporting, I Pre-computed aggregates unaffected,

40 / 48 I Nowak moves to Warsaw: id local id name city row effec- row expira- current row tive date tion date indicator 23401 234 Nowak Pozna´n 2012-01-01 2014-01-05 Expired 23402 234 Nowak Warsaw 2014-01-06 9999-12-31 Current

I Old fact table rows point to the old row, and new fact table rows point to the new row, I To report on Nowak’s activity over time one can use: WHERE local id = "234"

I Type II requires surrogate keys. • Examples:

Slowly changing dimensions

• Type II: create a new dimension row:

I Accurate historical reporting, I Pre-computed aggregates unaffected, I Dimension table grows over time,

40 / 48 I Nowak moves to Warsaw: id local id name city row effec- row expira- current row tive date tion date indicator 23401 234 Nowak Pozna´n 2012-01-01 2014-01-05 Expired 23402 234 Nowak Warsaw 2014-01-06 9999-12-31 Current

I Old fact table rows point to the old row, and new fact table rows point to the new row, I To report on Nowak’s activity over time one can use: WHERE local id = "234"

• Examples:

Slowly changing dimensions

• Type II: create a new dimension row:

I Accurate historical reporting, I Pre-computed aggregates unaffected, I Dimension table grows over time, I Type II requires surrogate keys.

40 / 48 I Nowak moves to Warsaw: id local id name city row effec- row expira- current row tive date tion date indicator 23401 234 Nowak Pozna´n 2012-01-01 2014-01-05 Expired 23402 234 Nowak Warsaw 2014-01-06 9999-12-31 Current

I Old fact table rows point to the old row, and new fact table rows point to the new row, I To report on Nowak’s activity over time one can use: WHERE local id = "234"

Slowly changing dimensions

• Type II: create a new dimension row:

I Accurate historical reporting, I Pre-computed aggregates unaffected, I Dimension table grows over time, I Type II requires surrogate keys. • Examples:

40 / 48 I Old fact table rows point to the old row, and new fact table rows point to the new row, I To report on Nowak’s activity over time one can use: WHERE local id = "234"

Slowly changing dimensions

• Type II: create a new dimension row:

I Accurate historical reporting, I Pre-computed aggregates unaffected, I Dimension table grows over time, I Type II requires surrogate keys. • Examples:

I Nowak moves to Warsaw: id local id name city row effec- row expira- current row tive date tion date indicator 23401 234 Nowak Pozna´n 2012-01-01 2014-01-05 Expired 23402 234 Nowak Warsaw 2014-01-06 9999-12-31 Current

40 / 48 I To report on Nowak’s activity over time one can use: WHERE local id = "234"

Slowly changing dimensions

• Type II: create a new dimension row:

I Accurate historical reporting, I Pre-computed aggregates unaffected, I Dimension table grows over time, I Type II requires surrogate keys. • Examples:

I Nowak moves to Warsaw: id local id name city row effec- row expira- current row tive date tion date indicator 23401 234 Nowak Pozna´n 2012-01-01 2014-01-05 Expired 23402 234 Nowak Warsaw 2014-01-06 9999-12-31 Current

I Old fact table rows point to the old row, and new fact table rows point to the new row,

40 / 48 Slowly changing dimensions

• Type II: create a new dimension row:

I Accurate historical reporting, I Pre-computed aggregates unaffected, I Dimension table grows over time, I Type II requires surrogate keys. • Examples:

I Nowak moves to Warsaw: id local id name city row effec- row expira- current row tive date tion date indicator 23401 234 Nowak Pozna´n 2012-01-01 2014-01-05 Expired 23402 234 Nowak Warsaw 2014-01-06 9999-12-31 Current

I Old fact table rows point to the old row, and new fact table rows point to the new row, I To report on Nowak’s activity over time one can use: WHERE local id = "234"

40 / 48 I Administrative reform in Poland: id previous name current name 23401 pozna´nskie wielkopolskie 23402 pilskie wielkopolskie

I All the fact table rows point to both values.

I Allows reporting using either the old or the new values, I Mostly useful for different reorganizations. • Examples:

Slowly changing dimensions

• Type III: create a new dimension column:

41 / 48 I Administrative reform in Poland: id previous name current name 23401 pozna´nskie wielkopolskie 23402 pilskie wielkopolskie

I All the fact table rows point to both values.

I Mostly useful for different reorganizations. • Examples:

Slowly changing dimensions

• Type III: create a new dimension column:

I Allows reporting using either the old or the new values,

41 / 48 I Administrative reform in Poland: id previous name current name 23401 pozna´nskie wielkopolskie 23402 pilskie wielkopolskie

I All the fact table rows point to both values.

• Examples:

Slowly changing dimensions

• Type III: create a new dimension column:

I Allows reporting using either the old or the new values, I Mostly useful for different reorganizations.

41 / 48 I Administrative reform in Poland: id previous name current name 23401 pozna´nskie wielkopolskie 23402 pilskie wielkopolskie

I All the fact table rows point to both values.

Slowly changing dimensions

• Type III: create a new dimension column:

I Allows reporting using either the old or the new values, I Mostly useful for different reorganizations. • Examples:

41 / 48 I All the fact table rows point to both values.

Slowly changing dimensions

• Type III: create a new dimension column:

I Allows reporting using either the old or the new values, I Mostly useful for different reorganizations. • Examples:

I Administrative reform in Poland: id previous name current name 23401 pozna´nskie wielkopolskie 23402 pilskie wielkopolskie

41 / 48 Slowly changing dimensions

• Type III: create a new dimension column:

I Allows reporting using either the old or the new values, I Mostly useful for different reorganizations. • Examples:

I Administrative reform in Poland: id previous name current name 23401 pozna´nskie wielkopolskie 23402 pilskie wielkopolskie

I All the fact table rows point to both values.

41 / 48 • For such attributes it may be reasonable to create a separate dimension table called mini dimension. • It relies on removing of frequently-changing or frequently-queried attributes from an original dimension and adding them to a new mini-dimension table. • The link between the original dimension and the mini dimension is through the fact table, but additional foreign key can be added to the original dimension.

Mini dimension

• Some attributes change or are queried frequently.

42 / 48 • It relies on removing of frequently-changing or frequently-queried attributes from an original dimension and adding them to a new mini-dimension table. • The link between the original dimension and the mini dimension is through the fact table, but additional foreign key can be added to the original dimension.

Mini dimension

• Some attributes change or are queried frequently. • For such attributes it may be reasonable to create a separate dimension table called mini dimension.

42 / 48 • The link between the original dimension and the mini dimension is through the fact table, but additional foreign key can be added to the original dimension.

Mini dimension

• Some attributes change or are queried frequently. • For such attributes it may be reasonable to create a separate dimension table called mini dimension. • It relies on removing of frequently-changing or frequently-queried attributes from an original dimension and adding them to a new mini-dimension table.

42 / 48 Mini dimension

• Some attributes change or are queried frequently. • For such attributes it may be reasonable to create a separate dimension table called mini dimension. • It relies on removing of frequently-changing or frequently-queried attributes from an original dimension and adding them to a new mini-dimension table. • The link between the original dimension and the mini dimension is through the fact table, but additional foreign key can be added to the original dimension.

42 / 48 I Fact table without numeric measure columns. I Dummy measure is included that always has value 1. I Describe relationships between dimensions. I Example: Which products were on promotion in which stores for which days?

• Fact tables do not contain rows for non-events: no rows for products that did not sell. • Factless Fact Tables:

Factless fact tables

• Fact tables take the advantage of sparsity (much less data to store, if events are rare).

43 / 48 I Fact table without numeric measure columns. I Dummy measure is included that always has value 1. I Describe relationships between dimensions. I Example: Which products were on promotion in which stores for which days?

• Factless Fact Tables:

Factless fact tables

• Fact tables take the advantage of sparsity (much less data to store, if events are rare). • Fact tables do not contain rows for non-events: no rows for products that did not sell.

43 / 48 I Fact table without numeric measure columns. I Dummy measure is included that always has value 1. I Describe relationships between dimensions. I Example: Which products were on promotion in which stores for which days?

Factless fact tables

• Fact tables take the advantage of sparsity (much less data to store, if events are rare). • Fact tables do not contain rows for non-events: no rows for products that did not sell. • Factless Fact Tables:

43 / 48 I Dummy measure is included that always has value 1. I Describe relationships between dimensions. I Example: Which products were on promotion in which stores for which days?

Factless fact tables

• Fact tables take the advantage of sparsity (much less data to store, if events are rare). • Fact tables do not contain rows for non-events: no rows for products that did not sell. • Factless Fact Tables:

I Fact table without numeric measure columns.

43 / 48 I Describe relationships between dimensions. I Example: Which products were on promotion in which stores for which days?

Factless fact tables

• Fact tables take the advantage of sparsity (much less data to store, if events are rare). • Fact tables do not contain rows for non-events: no rows for products that did not sell. • Factless Fact Tables:

I Fact table without numeric measure columns. I Dummy measure is included that always has value 1.

43 / 48 I Example: Which products were on promotion in which stores for which days?

Factless fact tables

• Fact tables take the advantage of sparsity (much less data to store, if events are rare). • Fact tables do not contain rows for non-events: no rows for products that did not sell. • Factless Fact Tables:

I Fact table without numeric measure columns. I Dummy measure is included that always has value 1. I Describe relationships between dimensions.

43 / 48 Factless fact tables

• Fact tables take the advantage of sparsity (much less data to store, if events are rare). • Fact tables do not contain rows for non-events: no rows for products that did not sell. • Factless Fact Tables:

I Fact table without numeric measure columns. I Dummy measure is included that always has value 1. I Describe relationships between dimensions. I Example: Which products were on promotion in which stores for which days?

43 / 48 Data warehouse bus architecture

• In order to create an data warehouse for entire organization, which takes into account several processes, one can use a data warehouse bus matrix.

Business process Common dimensions Date Product Store Promotion Warehouse Vendor Contract Shipper

Retail sales * * * * Retail inventory * * * Retail deliveries * * * Warehouse inventory * * * * Warehouse deliveries * * * * Purchase orders * * * * * * ...

44 / 48 Some other problems

• Querying fact tables or dimensions? It is generally reckoned that some 80 percent of data warehouse queries are dimension-browsing queries. This means that they do not access any fact table.

Ch. Todman (2001) • Example:

I How many customers do we have?

45 / 48 Outline

1 Motivation

2 Conceptual Schemes of Data Warehouses

3 Dimensional Modeling

4 Summary

46 / 48 Summary

• Three goals of the logical design of data warehouse: simplicity, expressiveness and performance. • The most popular conceptual schema: star schema. • Designing data warehouses is not an easy task . . .

47 / 48 Bibliography

• Ch. Todman. Designing a Data Warehouse: Supporting Customer Relationship Management. Prentice Hall PTR, 2001

• Z. Kr´olikowski. Hurtownie danych: logiczne i fizyczne struktury danych. Wydawnictwo Politechniki Pozna´nskiej,2007

• R. Kimball and M. Ross. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition. John Wiley & Sons, 2013 • http://www.kimballgroup.com/

48 / 48