Data Warehouse 1. Methodological Framework Logical Modeling and Design • Conceptual Design & Logical Design (6) • Design Phases and schemata derivations

2. Logical Modelling: The Multidimensionnal Model

Bernard ESPINASSE • Problematic of the Logical Design Professeur à Aix-Marseille Université (AMU) • The Multidimensional Model: fact, measures, dimensions Ecole Polytechnique Universitaire de Marseille 3. Implementing a Dimensional Model in ROLAP

• Star schema January, 2020 • Snowflake schema

• Aggregates and views

4. Logical Design: From Fact schema to ROLAP Methodological framework • From fact schema to relational star-schema: basic rules Logical Modeling: The Multidimensional Model • Examples towards Relational Star Schema • Examples towards Relational Snowflake Schema Logical Design : From Fact schema to ROLAP Logical schema • Advanced logical modelling ROLAP schema in MDX for Mondrian 5. ROLAP schema in MDX for Mondrian

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 1 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 2

• Livres • Golfarelli M., Rizzi S., « Data Warehouse Design : Modern Principles and Methodologies », McGrawHill, 2009. • Kimball R., Ross, M., « Entrepôts de données : guide pratique de modélisation dimensionnelle », 2°édition, Ed. Vuibert, 2003, ISBN : 2- 7117-4811-1. • Franco J-M., « Le Data Warehouse ». Ed. Eyrolles, Paris, 1997. ISBN 2- 212-08956-2. • Conceptual Design & Logical Design • Cours • Life-Cycle • Course of M. Golfarelli M. and S. Rizzi, University of Bologna • Design Phases • Course of M. Böhlen and J. Gamper J., Free University of Bolzano • Schemata derivations • …

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 3 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 4 Methodological frameworkframework

analysis of the db administrator operational db requirement specification designer conceptual design workload    %  refinement • Entite-Relation models are not very useful in modeling DWs logical • Is now universally recognized that a DW is based on a final   0user db administrator multidimensional view of data :     design § But there is still no agreement on HOW to implement its 3 conceptual design ! 0 physical • Most of the time, DW design is at the logical level : a DWs are based on'    a pre-existing  design multidimensional model (star/snowflake schema) is directly designed : designer information system - 2  § But a star schema is nothing but a relational schema: it    contains only the definition of a set of relations and integrity business user 51 &   constraints ! 

• A better approach: "$& § 1) Design first a conceptual model : Conceptual Design  § 2) Then translate this conceptual model into a logical !0   model : Logical Design J. Gamper, Free University of Bolzano, DWDM 2012/13 40 Methodological framework (2)

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 5 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 6

A   

The 6 Logical    Design transforms        the for a DM into a Logical E/R Conceptual Logical Physical Schema     7 : Scheme Scheme Scheme Scheme • Choice    of the type of  logical     schema • Tr anslation6       of conceptual schema to logical schema Relational  &  1 4    1   5 Scheme

• Optimization (view materialization, fragmentation) chiave ne gozio negoz io ci ttà re gione indirizzo resp. vendit e N1 ….….….…………… N2

chiave tempo chiave negozi o chi ave_pr odot to quant venduta incasso num_cli enti     T1 N1P110 1000000 2 T1 N1P281200000 8 T1 N2P515 1500000 5 …….. ………….         CONCEPTUAL LOGICAL PHYSICAL        DESIGN DESIGN DESIGN Facts   Workload Preliminary Target Workload Target Different principles from the one used in operational :              workload logical DBMS • data    redundancy model

• denormalization   1   of tables

 3           4    5 Bernard  ESPINASSE - Data Warehouse Logical Modelling and Design 7 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 8

J. Gamper, Free University of Bolzano, DWDM 2012-13 10 52 Multidimensional Model : • is a logical model • has one purpose: Data analysis • is the most popular for DW • is more built in “meaning” : § What is important • The Multidimensional Model: § What describes the important § Fact, § What we want to optimize § Measures, § Dimensions § Automatic aggregations means easy querying • Star and Snowflake Schemata • is recognized by OLAP/BI tools : Tools offer powerful query facilities • Aggregates and Views based on MD design

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 9 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 10

• Data is divided into facts (with measures) and dimensions Facts represent the subject of the desired analysis : the ”important” in the • Facts business that should be analyzed : § are the important entity, e.g., a sale • A fact is most often identified via its dimension values : § have measures that can be aggregated, e.g., sales price • Dimensions : § A fact is a non-empty cell § describe facts § Some models give facts an explicit identity § Ex : a sale has the dimensions Product, Store and Time • Generally, a fact should : • Goal for dimensional modeling: § Surround facts with as much context/dimensions as possible § be attached to exactly one dimension value in each dimension; (redundancy may be ok in well-chosen places) § only be attached to dimension values in the bottom levels § But you should not try to model all relationships in the data (unlike E/R and OO modeling!) § Ex : if the lowest time granularity is day, for each fact the exact day should be specified Facts (data) “live” in a multidimensional « cube » § some models do not require this.

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 11 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 12 Different types of facts are distinguished : • Dimensions are the core of multidimensional databases • Event facts (transaction) : a fact for every business event (Ex : sale) • Other types of databases do not support dimensions • « Fact-less » facts : • Dimensions are used for : § A fact per event (Ex : customer contact) § Selection of data § No numerical measures § Grouping of data at the right level of detail § An event happened for a dimension value combination • Dimensions consist of dimension values • Snapshot fact : Ex : § A fact for every dimension combination at given time interval • Product dimension has values ”milk”, ”cream”, ... § Captures current status (Ex : inventory) • Time dimension has values ”1/1/2001”, ”2/1/2001”,.. • Cumulative snapshot facts : • Dimension values may have an ordering : § A fact for every dimension combination at given time interval § Used for comparing cube data across values § Captures cumulative status up to now (Ex : sales to date) § Especially used for Time dimension Every type of facts answers different questions : often event facts and Ex : percentage of sales increase compared with last month snapshot facts exist

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 13 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 14

• Dimensions encode hierarchies with levels : Typically 3-5 levels (of • A location dimension with attributes street, city, province_or_state, and detail) country encodes implicitly the following hierarchy : • Dimension values are organized in a tree structure or lattice : Instance § Ex : Product: Product -> Type -> Category country Store: Store -> Area -> City > County All Time: Day -> Month -> Quarter -> Year Canada USA • Dimensions have a bottom level and a top level (ALL) Province_or_state

• Levels may have attributes simple, non-hierarchical information British Colombia Quebec New York Illinois § Ex : Day has Workday as attribute • General rule: dimensions should contain much information : § Ex : city Vancouver ... Victoria Montréal ... Québec City New York ... Buffalo Chicago ... Urbana Time dimensions may contain holiday, season, events,... Good dimensions have 50-100 or more attributes/levels

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 15 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 16 • A cube may have many dimensions : • Granularity of facts is important : § More than 3 – the term ”hypercube” is sometimes used § What does a single fact mean? § Theoretically no limit for the number of dimensions § Determines the level of detail § Typical cubes have 4-12 dimensions § Given by the combination of bottom levels • But only 2-3 dimensions can be viewed at a time : § Ex : ”total sales per store per day per product” § Dimensionality reduced by queries via projection/aggregation • Important for number of facts : Scalability • A cube consists of cells : • Often the granularity is a single business transaction : § A given combination of dimension values § A cell can be empty (no data for this combination) § Ex : sale § A sparse cube has many empty cells § Sometimes the data is aggregated (total sales per store per day § A dense cube has few empty cells per product) § Cubes become sparser for many/large dimensions. § Aggregation might be necessary for scalability

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 17 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 18

• Measures represent the fact property that users want to study and 3 types of measures are distinguished : analyze : • Additive measures : § Ex : the total sales price § Can be aggregated over all dimensions using SUM operator § Ex : sales price, gross profit computed from sales and cost • A measure has 2 components : § Often occur in event facts § Numerical value (ex : sales price) • Semi-additive measures : § Aggregation formula (ex : SUM): used for aggregating / § Cannot be aggregated over some dimensions – typically time combining a number of measure values into one § Ex : inventory: additive across time or store, non-additive accross • Additivity is an important property for measures : product § Single fact table rows are (almost) never retrieved, but § Often occur in snapshot facts aggregations over millions of fact rows • Non-additive measures : § Cannot be aggregated over any dimensions • Measure value determined by the combination of dimension values : § Ex : unit cost § Measure value is important for all aggregation levels. § Occur in all types of facts

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 19 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 20 • 1. Choose the business process(es) to model : § Ex : Sales • 2. Choose the granularity of the business process : § Ex : Items by Store by Promotion by Day § Low granularity is needed ? § Are individual transactions necessary/feasible ? • Star schema • 3. Choose the dimensions : • Snowflake schema § Ex : Time, Store, ... • Aggregates and views • 4. Choose the measures : § Ex : Dollar_sales, unit_sales, dollar_cost, customer_count

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 21 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 22

   3  

• Is a common approach to draw a dimensional model Date • Consists of : one fact table and many dimension tables : DateKey  $   Date ProductDT fact table DateFull ProductId SalesFT DayOfWeek Product ProductId CalMonth Type StoreId CalYear Category DateId Holiday   $   DateDT Quantity DateId Day             DateFull StoreDT             StoreId DayOfWeek             dimension tables             Store CalMonth      !      City CalYear " "  " #       Country Holiday

J. Gamper, Free University of Bolzano, DWDM 2012-13 16

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 23 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 24 Forces : • Merge several star schemata, which use common dimensions • Has consequently several facts and dimensions common or not Ex : Drugs sales in pharmacies • Simple -> ease-of-use fact tables DATE • Relatively flexible SALE DateId Day DateId • Fact table is normalized (dimensions tables are not necessary Month ProductId PRODUCT Year StoreId Quantity ProductId normalized) Amount Gamme DRUG ProductName Color • Dimension tables often relatively small DrugId dimension tables Category Molecule STORE • “Recognized” by many RDBMS -> good performance SecondaryEffect PRESCRIPTION Posology StoreId DateId Store DrugId City StoreId Country Weakness : DrugNumber Honorary

• Hierarchies are ”hidden” in the columns • Make up of 2 star-schemas : • Dimension tables are de-normalized § first concerning sales in pharmacies (SALE) § second concerning medecin prescription (PRESCRIPTION) => From relational Star schema to relational « Snowflake » schema • DATE and STORE dimensions are shared by PRESCRIPTION and SALE facts.

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 25 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 26

• Is obtained from a star schema by breaking down one or more dimension Forces : tables into smaller tables to remove transitive functional dependencies • Hierarchies are made explicit/visible • Dimension table referenced in the fact table are « primary dimension tables » and other are « secondary dimension tables » • Very flexible

SALES DATE Fact table • Dimension tables use less space : PRODUCT ProductId DateId Day ProductId StoreId § However this is a minor saving Product DateId MonthId TypeId Sale MonthYearDescription § Disk space of dimensions is typically less than 5 percent of disk STORE MonthId for DW TYPE Primary dimension StoreId Month tables TypeId Store YearId Type CityId Weakness : CategoryId CITY CityId • Harder to use due to many joins CATEGORY StoreCity CategoryId Secondary dimension State tables • Worse performance : category Country CategoryId Ex : efficient bitmap indexes are not applicable

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 27 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 28 View = Fact table that include Aggregate Data Solution 1 : the easier solution in a star schema, is to store both primary view • Primary view : defined in the fact schema dimensions (primary group-by sets), data and secondary view data in the same fact table :

populated by operational data SALES keyStore keyDate keyProduct quantity receipts … • Secondary views : new fact tables including aggregate(s) (secondary group-by (fact table) 1 1 1 170 85 … 2 1 1 300 150 … sets), populated by others views (not by operational data) 3 1 1 1700 850 … star schema some sales schema views … … … … … …

STORE v1 = {product, date, store} STORE keyStore store storeCity state … keyStore (dimension table) 1 Coop1 Colombus Ohio … store 2 - Austin Texas … DATE SALES storeCity v2 = {type, date, storeCity} (- = NULL) 3 - - Texas … keyDate state keyStore … … … … … date country keyDate month salesManager keyProduct Aggregation level of fact table tuples specified by corresponding tuples in quarter salesDistrict v4 = {type, month, state} v3 = {category, month, storeCity} year quantity receips dimension table : day PRODUCT week unitPrice • the dimension table STORE related to aggregate data will have NULL values in all holiday nbOfCustomer ProductId Product attributes whose aggregation level is finer than the current one Type v5 = {quarter, state} Category • in the fact table SALES : § first tuple is related to one sale, • v1 = primary view and v2 … v5 = secondary views § second tuple store aggregate data for every sale in Austin, • if vi -> vj then vj can be calculated by aggregating vi data § third tuple sums up all the sales in Texas. only fact table is used for request, but performance are bad because table is too big

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 29 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 30

Solution 2: a common solution is to store different group-by sets into Solution 3: dimension tables are replicated for every view including only the set of separate fact tables in a constellation schema : attributes that are valide for the aggregation level at which that dimension table is

V5 view STORE created : keyStore STATE keyStore QUARTER keyDate V5 view keyState DATE store keyQuarter quantity state storeCity quarter keyStore keyDate receips country state year keyQuarter date unitPrice country quantity month nbOfCustomer salesManager receips STORE quarter unitPrice year salesDistrict nbOfCustomer keyStore day V1 view store week PRODUCT storeCity holiday state keyStore ProductId DATE V1 view keyDate Product country keyDate salesManager keyProduct Type keyStore date salesDistrict quantity Category keyDate month receips keyProduct quarter PRODUCT unitPrice quantity year receips ProductId nbOfCustomer day unitPrice Product week nbOfCustomer Type holiday Category • here a fact table concerne primary view V1 and an other V5 secondary view • dimension tables can be merged as in solution 1, or replicate for each aggregate view • here are V1 primary view and V5 secondary view with their specific dimension tables • the fact tables that correspond to the group-by sets where one or more dimensions are • note that the key of the fact table for secondary view V5 has not attribute related to store completely aggregated do not have foreign key referencing these dimensions hierarchy, which is completely aggregated the size of fact table size is far larger than size of the dimension tables, then performance this solution is the best in performance because the table acces are optimized, however depends of the fact table optimisation replicate dimension tables need disk space

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 31 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 32 Methodological frameworkframework

analysis of the db administrator operational db requirement specification designer conceptual design workload refinement Solution 4: a compromise, where aggregate views are materialized in a snowflake schema : logical STATE QUARTER final user V5 view keyState design keyQuarter state keyStore quarter country year keyQuarter quantity receips physical unitPrice nbOfCustomer STORE DWs are based on a pre-existing design keyStore store DATE V1 view storeCity keyState information system keyDate keyStore date keyDate month 51 keyProduct keyQuarter quantity PRODUCT • From fact schema to relational star-schema: basic rules year receips ProductId day unitPrice Product • Examples towards Relational Star Schema week nbOfCustomer Type holiday Category • Examples towards Relational Snowflake Schema • Advanced logical modelling • take advantage of the optimization achieved with aggregate data by aggregation level without replicating dimension table the sales fact is modeled by a snowfake schema Methodological framework (2)

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 33 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 34

A   

The 6 Logical    Design transforms        the Conceptual Schema for a DM into a Logical E/R Conceptual Logical Physical Schema     7 : Scheme Scheme Scheme Scheme • Choice    of the type of  logical     schema • Tr anslation6       of conceptual schema to logical schema Relational  &  1 4    1   5 Scheme

• Optimization (view materialization, fragmentation) chiave ne gozio negoz io ci ttà re gione indirizzo resp. vendit e N1 ….….….…………… N2

chiave tempo chiave negozi o chi ave_pr odot to quant venduta incasso num_cli enti     T1 N1P110 1000000 2 T1 N1P281200000 8 T1 N2P515 1500000 5 …….. ………….         CONCEPTUAL LOGICAL PHYSICAL        DESIGN DESIGN DESIGN Facts   Workload Preliminary Target Workload Target Different principles from the one used in operational databases :              workload logical DBMS • data    redundancy model

• denormalization   1   of tables

 3           4    5 Bernard  ESPINASSE - Data Warehouse Logical Modelling and Design 35 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 36

J. Gamper, Free University of Bolzano, DWDM 2012-13 10 52 Conceptual Design : Fact Schema Relational Schema

:.%" /)4&'.# 9'"'% Step to derive Dimensional Fact (DF) schemata from Relational schema of /7'# DT_CAR DT_DATE 34%1 56678% idCar operational sources (DB) : FT_RENTAL idDate 2."&* Car <78=4< *.)<)66 Fuel pickupOffice Date dropoffOffice Month ()*%1 >?@A:B Model /". idCar Year :$)4&' Brand • 1. Finding and defining FACT from Relational/E-R schema /"'%0).# +7;8)4&' Category pickupDate +4."'7)& RegistrationDate PaymentMode Pickup .%07;'."'7)& (71%; Amount DT_OFFICE Discount <78=4< idOffice dropoff !"#$%&'()*% Duration +"'% Office • 2. Building the Attribute Tree from Relational/E-R schema Miles ()&', City State Country • 3. Building the Fact Schemata from Attribute Tree -%". Area Logical schema obtained from prevous rental fact schema is the following !!!"#$%&'#()*+ snowflake schema : !"#$%&&'()*&#$&'##$+(&"$',,&$(-$!"#$%$(.$/#0(1&#/$)#2,+3 • DT_CAR (idCar, Car, Fuel, Model, Brand, Category, Logical Design : A0B1)7;'+) registrationDate:DT_DATE) &*.8063( =6785'.)$*+) &).87*2(*03 &'-. • DT_OFFICE (idOffice, Office, City, State, Country, Area) • transforms the Fact Schemata into a Logical Schema, a Relational =0*3(. =7047)..*?);6+@)7 $-2)/01) • DT_DATE (idDate, Date, Month, Year) /0.( Schema (in ROLAP strategy) !)34(5 $-2) :,,*8)/01)>=6785'.)&'()>=7047)..*?);6+@)7 • FT_RENTAL (pickupOffice:DT_OFFICE, dropoffOffice:DT_OFFICE, $*+) &67'(*03 26785'.) idCar:DT_CAR, pickupDate:DT_DATE, PaymentMode, Amount, Discount, &'() !"#$% Duration, Miles) =0*3(. !*,(%-.()+/01) 26785'.) $*89)(/01)

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 37 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 38 /0.(

:,,*8)/01) Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 44 <117).. ;'+)

4,&#$&"%&5$(-$&"#$#/(&#/$&'##5$&"#$6*-1&(,-%2$/#0#-/#-17$6',8$!&'(%)*(+,-./+$&,$0.&1(*$"%.$)##- '#8,9#/$&,$8%:#$0.&1(*$%$1"(2/$,6$&"#$',,&5$.,$&"%&$(&$1%-$)#$1",.#-$%.$%$8#%.*'#3$!"#$.%8#$(.$/,-#$6,' 2345(&.13$!"#$&(1:#&$%-/$&"#$.:(0%..$;'%-*2%'(&(#.$%'#$'#8,9#/5$%-/$%$%6&75**84$&96+($62%;$(.$%//#/$&,

Fact schema : R1. From « Fact schema » to « Tables » :

• Fact box of the fact schema leads to create a Fact Table (FT) category • dimension leads to create a Dimension Table (DT) dimensions type R2. Dimension table (DT) attributes : fact product • dimension attribute of fact schema become DT attribute • the first dimension attribute (first level) become DT key attribute SALE R3. Fact table (FT) attributes : quantity year month week store city country measures • DT key attribute of each dimension become FT foreign key • measure attributes of fact schema become FT measure

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 39 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 40 Relational schema : Instances$=2 : • SaleFT (ProductId, StoreId, DateId, Quantity)   • WeekDT (ProductId, Product, Type, Category) 2   2  =" ,  "  • StoreDT (StoreId, Store, City, State, Country) $       .  ."   7  • DateDT (DateId, Day, Month, Year) ( )       $ (- " $<<; * &      ProductDT fact table ProductId SalesFT Product ProductId #%! Type StoreId 2      .  6  " Category DateId DateDT $ $ $ -;- Quantity DateId Day Month #  StoreDT Year StoreId      , " ,  " dimension tables Store $ %   . % City (     "

Country J. Gamper, Free University of Bolzano, DWDM 2012-13 52

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 41 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 42

OLAP Query on Star schema : Relational schema : • SaleFT (ProductId, StoreId, DateId, Sale) Total quantity sold for each product type, week, and city, only for food products : • ProductDT (ProductId, Product, TypeId) • ProductType (TypeId, Type, CategoryId) • StoreDT (StoreId, Store, City, Country)

• DateDT (DateId, Day, MonthId) SELECT City, Week, Type, SUM(Quantity) • MonthYearDescription (MonthId, Month, YearId) Fact table SaleFT DateDT FROM WeekDT, StoreDT, ProductDT, SaleFT DateId ProductDT ProductId Day WHERE WeekDT.WeekID = SaleFT.WeekID AND ProductId StoreId MonthId Product DateId StoreDT.StoreID = SaleFT.StoreID AND TypeId Sale MonthYearDescription Primary dimension MonthId ProductDT.ProductID = SaleFT.ProductID AND tables ProductType Month TypeId Secondary dimension ProductDT.Category = ‘Food’ tables YearId Type StoreDT GROUP BY City, Week, Type; CategoryId StoreId Store City Country

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 43 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 44 Instances  : $=2 OLAP Query on Snowflake schema :  . ="  =" ,  " &   .  Total quantity sold for each product type, week, and city, only for food products :     7  $   $ $ " $<<; SELECT City, Week, Type, SUM(Quantity)   2   2  ="   FROM WeekDT, StoreDT, ProductDT, CityDT, TypeDT, SaleFT $  $ .  ."   WHERE WeekDT.WeekID = SaleFT.WeekID AND ( )  $ $ (- $ *  $ StoreDT.StoreID = SaleFT.StoreID AND #%! ProductDT.ProductID = SaleFT.ProductID AND 2      .   StoreDT.CityID = CityDT.CityID AND $ $ $ - ;- ProductDT.TypeID = TypeDT.TypeID AND #       , " ,  " ProductDT.Category = ‘Food’ $ %   . % GROUP BY City, Week, Type; (     " J. Gamper, Free University of Bolzano, DWDM 2012-13 56

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 45 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 46

Conceptual Modelling: Conceptual Modelling: A cross-dimensional attribute b is a dimensionnal or descriptive attribute whose value is defined by the combination of 2 or more • A descriptive attribute contains additional information about an attribute of the dimensional attributes a1, …am, possibly belonging to different hierarchies. hierarchy Logical Modelling: translation leads to a new table that includes the b cross- • It cannot be used for aggregation dimensional attribute and has the a1, …am attributes as primary key Ex : a product Value Added Tax (VAT) depends both on the product category and on the Logical Modelling: country where the product is sold PRODUCT • If a descriptive attribute is linked to a dimensional attribute, it have to be VAT VAT size department productId included in the dimension table for the hierarchy that contains it product country marketingGroup category type category Ex : the size of the product have to be included in PRODUCT table brandCity category VAT type cross-dimensionnal attributes brand ...

product diet

salesManager SALES • If a descriptive attribute is linked to a fact attribute, it have to be included in the STORE SALE salesDistrict productId storeId storeId fact table with measures quantity receips dateId store unitPrice numberOfCustomer store storeCity state country quantity city ... country

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 47 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 48 Conceptual Modelling: Entire portion of hierarchies are frequently replicated 2 or Conceptual Modelling: A multiple arc models a many-to-many association more time in fact schemata. Ex: calling and called phone numbers … between the 2 dimensional attributes it connects shared hierarchy type calling Ex : in a fact schema modeling the sales of books, book is a dimension, but it would CALL hour not be relevant to model author as a dimensional child attribute of book because O number duration many different authors can write many books => the relationship between books telNumber date month year and authors is modeled as a multiple arc: district called day roles genre holiday SALE Logical Modelling: we have to avoid multiples dimension tables, which contain all quantity or part of the same data. Ex: the hierarchies contain all the same data, we choose receips unitPrice to insert 2 foreign keys referencing the one table that models the telephone number year quarter month date numberOfCustomer book author in the fact table: week multiple arc CALLS Logical Modelling: solution is to insert a bridge-table to model multiple arcs: KeyCalled NUMBERS KeyCalling KeyNumber SALES BOOK AUTHOR telNumber KeyDate KeyBook KeyBook BRIDGE_AUTHOR KeyAuthor KeyHour type KeyDate book KeyBook author number dictrict number genre KeyAuthor duration receipts weight

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 49 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 50

Conceptual Modelling: Non-additive measure can be explicitely specify with its operator(s) used for aggregation – other that SUM Ex: AVG and MIN for inventory level:

product non additive measure address AVG, MIN INVENTORY

level incomingQuantity year quarter month date warehouse city country

week

Logical Modelling: set a new mesure for each aggregation operator. • Exampl INVENTORY • Examples towards Relational Star Schema KeyMonth MONTH TYPE • Examples towards Relational Snowflake Schema KeyType KeyMonth KeyType avgLevel • Advanced logical modelling nomth type minLevel year ... count incomingQuantity

• Count is a support measure necessary to calculate average level • minLevel is required to calculate the minimum level for each month and for each product type.

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 51 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 52 5.3 Relational Implementation of the Conceptual Model 127

be kept in the dimensions to be able to match data from sources with data in the warehouse. Obviously, an alternative solution is to reuse the keys from the source systems in the data warehouse. Notice that a fact table obtained by the mapping rules above will contain the surrogate key of each level related to the fact with a one-to-many rela- tionship, one for each role that the level is playing. The key of the table is composed of the surrogate keys of all the participating levels. Alternatively, if a surrogate key is added to the fact table, the combination of the surrogate keys of all the participating levels becomes an alternate key. As we will see in Sect. 5.5, more specialized rules are needed for mapping the various kinds of hierarchies that we studied in Chap. 4.

• Mondrian is an open-source relational online analytical Product Supplier City ProductKey SupplierKey Customer CityKey ProductName CompanyName CityName processing (ROLAP) server QuantityPerUnit Address CustomerKey StateKey (0,1) UnitPrice PostalCode CustomerID CountryKey (0,1) Discontinued CityKey CompanyName CategoryKey Address • It is also known as Pentaho Analysis Services and is a PostalCode State CityKey Sales StateKey component of the Pentaho Business Analytics suite AK: CustomerID Category CustomerKey StateName EmployeeKey EnglishStateName CategoryKey StateType CategoryName OrderDateKey DueDateKey StateCode • In Mondrian, a cube schema, written in an XML syntax, defines a Description Territories ShippedDateKey StateCapital RegionName (0,1) ShipperKey EmployeeKey RegionCode (0,1) mapping between the physical structure of the relational data ProductKey CityKey SupplierKey CountryKey warehouse and the multidimensional cube Time OrderNo OrderLineNo Country Employee TimeKey UnitPrice CountryKey Date Quantity EmployeeKey CountryName DayNbWeek Discount FirstName • A cube schema contains the declaration of cubes, dimensions, CountryCode DayNameWeek SalesAmount LastName CountryCapital DayNbMonth Freight Title Population hierarchies, levels, measures, and calculated members DayNbYear BirthDate AK: (OrderNo, Subdivision WeekNbYear HireDate OrderLineNo) ContinentKey MonthNumber Address MonthName City Quarter • A cube schema does not define the data source, this is done Shipper Region Continent Semester PostalCode using a JDBC connection string. Year ShipperKey Country ContinentKey AK: Date CompanyName SupervisorKey ContinentName Fig. 5.4 Relational representation of the Northwind data warehouse in Fig. 4.2 DRAFT

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 53 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 54

1. 3. • The PhysicalSchema element (lines 3–5) defines the tables that are the 4. ... source data for the dimensions and cubes, and the foreign key links between 5. these tables 6. • The Dimension element (lines 6–8) defines the shared dimension Time, which 7. ... 8. is used several times in the MySociety cube for the role-playing dimensions 9. OrderDate, DueDate, and ShippingDate. Shared dimensions can also be used in 10. several cubes 11. ... • The Cube element (lines 9–23) define the Sales cube. A cube schema contains 12. 13. dimensions and measures, the latter organized in measure groups 14. • The Dimensions element (lines 10–12) defines the dimensions of the cube 15. • The measure groups are defined using the element MeasureGroups (lines 13– 16. ... 22) 17. 18. • The measure group Sales (lines 14–21) defines the measures using the 19. ... Measures element (lines 15–17) 20. • The DimensionLinks element (lines 18–20) defines how measures relate to 21. 22. dimensions. 23. 24.

Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 55 Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 56 172 5 Logical Data Warehouse Design

Fig.Saiku 5.33 is anBrowsing open-source the Northwind analytics cube client in Saiku server. In the figure : • the countries of customers and the categories of products are displayed on chapter showing how we can implement the Northwind data cube over Mi- crosoftrows Analysis, the years Services are displayed and Mondrian, on columns starting, and from the Northwind data warehouse.• the sales amount measure is displayed on cells. Saiku also allows the user to write OLAP queries in MDX directly on an editor. 5.12 Bibliographic Notes Bernard ESPINASSE - Data Warehouse Logical Modelling and Design 57 A comprehensive reference to data warehouse modeling can be found in the book by Kimball and Ross [109]. A work by Jagadish et al. [101]discusses the uses of hierarchies in data warehousing. Complex hierarchies like the ones discussed in this chapter were studied, among other works, in [90, 92, 167, 168]. The problem of summarizability is studied in the classic paper of Lenz and Shoshani [121], and in [91, 93]. Following the ideas of Codd for the , there have been attempts to define normal forms for multidimensional databases [120]. Regarding SQL, analytics and OLAP are covered in the books [31, 139]. There is a wide array of books on Analysis Services that describe in detail the functionalities and capabilities of this tool [77, 85, 171, 191, 192]. Finally, a description of Mondrian can be found in the booksDRAFT [14, 22].