Conceptual Design

Bodo H¨usemann, Jens Lechtenb¨orger, Gottfried Vossen Institut f¨ur Wirtschaftsinformatik Universit¨at M¨unster, Steinfurter Straße 109 D-48149 M¨unster, Germany

[email protected] g flechten,vossen @helios.uni-muenster.de

methods exist to date for deriving such a schema from an operational . In this paper, we recall the no- Abstract tion of multidimensional normal form (MNF) [LAW98] as a means to describe “good” warehouse schemata, and we A data warehouse is an integrated and time- show how to systematically derive a conceptual warehouse varying collection of data derived from opera- schema in MNF from a given operational schema. tional data and primarily used in strategic deci- Traditional database design methods [BCN92, Vos99] sion making by means of online analytical pro- structure the design process into a sequence of phases or cessing (OLAP) techniques. Although it is gener- steps. It is common to start with requirements analysis and ally agreed that warehouse design is a non-trivial specification,thendoconceptual design,thenlogical de- problem and that multidimensional data models sign, and finally physical design; this broad description can and star or snowflake schemata are relevant in be refined in a variety of ways. As a central issue during this context, hardly any methods exist to date a design process, a conceptual schema for the database un- for deriving such a schema from an operational der design is established which is then transformed into the database. In this paper, we fill this gap by showing “language of a logical as the basis for physical how to systematically derive a conceptual ware- implementation. For data warehouses, no corresponding house schema that is even in generalized multidi- methodology is yet in sight, although there is a large body mensional normal form. of literature that discusses how to derive schemas based on intuitive principles. We remedy this situation by present- 1 Introduction ing a method for conceptual warehouse design that is in line with traditional database design, and that fits into a A data warehouse is generally understood as an integrated modeling process which follows classical approaches. An and time-varying collection of data primarily used in strate- important point for our exposition is that traditional design gic decision making by means of online analytical process- methods have clearly defined goals and objectives, such as ing (OLAP) techniques. It is essentially a database that completeness w.r.t. a coverage of the underlying applica- stores integrated, often historical, and aggregated informa- tion, minimality of resulting schemata, freedom of redun- tion extracted from multiple, heterogeneous, autonomous, dancies, readability, etc. Some of these requirements can and distributed information sources. Although it is gen- even be made formally precise. Most notably, relational de- erally agreed that warehouse design is a non-trivial prob- pendency theory provides insight to understanding the rea- lem and that multidimensional data models and star or sons for redundancies in database relations, and formalizes snowflake schemata are relevant in this context, hardly any normal forms and normalization as a way to avoid them; The copyright of this paper belongs to the paper’s authors. Permission to moreover, there are algorithmic approaches (such as the copy without fee all or part of this material is granted provided that the synthesis algorithm [Vos99]) for constructing normalized copies are not made or distributed for direct commercial advantage. schemata. Proceedings of the International Workshop on Design and For two reasons, the above is in remarkable contrast Management of Data Warehouses (DMDW’2000) with what can be called warehouse design. First, current Stockholm, Sweden, June 5-6, 2000 practice in data warehousing and its applications marks a (M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.) radical departure from the principles of normalized schema http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/ design. Indeed, a common understanding of a “well-

B. Husemann,¨ J. Lechtenborger,¨ G. Vossen 6-1 designed” warehouse schema is that the schema under con- database management system, is ia well-understood pro- sideration should have the form of a “star”, i.e., it should cess; following [BCN92], database design can roughly consist of a central that contains the facts of in- be seen as being done in four steps, with the result of a terest to an OLAP application, and that is connected to a database schema which can be processed by a database number of dimension tables through referential integrity management system. This process comprises the phases constraints based on the various dimension keys. Since di- requirements analysis, conceptual schema design, logical mensions can be composed of attribute hierarchies, it is of- schema design,andphysical schema design. ten the case that dimension tables are unnormalized, and Concerning data warehouse design, there is a general their normalization results in what is known as a snowflake agreement that at least a conceptual or logical modeling schema. As an example, the conceptual schema shown activity should precede the actual implementation [WB97, in Figure 1 may be perceived as a conceptual snowflake AGS97, CT98, GMR98]. Typically, the modeling activ- schema that corresponds to the operational schema of Fig- ity is based on a multidimensional model (see [BSHD98, ure 2. Second, the research community has not been paying VS99, PJ99] for comparisons of various multidimensional too much attention so far to (a) developing complete de- models), whereas the implementation is carried out either sign methods for data warehouses in general or for concep- within relational or multidimensional [CD97]. tual design in particular, and (b) establishing guidelines for However, most of these models were developed without an good schema design or integrity constraints within the con- embedding into a design process and thus without guide- text of multidimensional models. As a result, there appears lines on how to use them or what to do with the result- to be a discrepancy between traditional database design as ing schemata. Notable exceptions in this respect are the applied to operational databases, and the design principles approaches of [CT98] and [GR98], which we summarize that apply to data warehouses. next. We remedy this situation by presenting a phase-oriented In [CT98] a design method is presented that starts design process for data warehouses that is modeled after from an existing E/R schema, derives a multidimensional the traditional design process, where we use an example schema, and provides implementations in terms of rela- from the banking domain as our running example. Con- tional tables as well as multidimensional arrays. The cretely, we will show how to obtain the warehouse schema derivation of the multidimensional schema is structured shown in Figure 1 from the conceptual operational database into the steps (1) identification of facts and dimensions, schema in Figure 2. The most important phase, the phase of (2) restructuring of the E/R schema, (3) derivation of a di- conceptual design, has to sort out dimensions, correspond- mensional graph, and (4) translation into the multidimen- ing dimension hierarchies,andmeasures, and has to deter- sional model. In the first step, facts along with their mea- mine which attribute from the underlying database(s) be- sures have to be selected, and afterwards dimensions for a longs where. Our contribution here is threefold: First, we fact are identified by navigating the schema. Then, in the establish guidelines for answering the question of whether second step, the initial E/R schema is restructured in or- an attribute is a dimension level or a property attribute. der to express facts and dimensions explicitly, thus arriving Second, we propose a graphical formalism for conceptual at a schema that can be mapped to the multidimensional warehouse design that captures this distinction in an ap- model. From this point on, the remaining steps provide propriate way. Third, we show how generalized multidi- natural translations of the E/R schema into an multidimen- mensional normal form (GMNF), originally proposed in sional schema and then into a target database schema. [LAW98], can be obtained for a warehouse schema under The work of [GR98] presents a complete warehouse de- design. sign method which resembles the traditional database de- sign and consists of the following steps: (1) analysis of The organization of this paper is as follows: In Section the information system, (2) requirement specification, (3) 2, we briefly recall the database design process, and we re- conceptual design (following the method of [GMR98]), (4) previous approaches on warehouse design and normal workload refinement and schema validation, (5) logical de- forms for warehouse schemata. We then introduce neces- sign, (6) physical design. We note here that the design of a sary terminology in Section 3, before we propose and illus- conceptual schema is carried out by producing a so-called trate our approach to conceptual warehouse design in Sec- fact schema for each fact, which, following [GMR98], can tion 4. Afterwards, we show in Section 5 that our design be derived from an E/R schema using an algorithmic pro- method does indeed produce schemata in GMNF. Finally, cedure, which starting from facts navigates through the concluding remarks and plans for future research appear in schema along x-to-one relationships in order to determine Section 6. dimensions and their hierarchies. 2 Related Research Concerning these two approaches, we critically remark the following. In [CT98], conceptual and logical design Database design, i.e., the task of mapping a given real- are mixed, and a “logical” multidimensional model is pre- world application to the formal data model of a given sented. We argue (in accordance with, e.g., [WB97]) that

B. Husemann,¨ J. Lechtenborger,¨ G. Vossen 6-2 account- account accountID custID profession customerType facts custName branch

custAge balance orgID orgGroup businessector turnov er creditlimit orgName orgType

interest produc tID productType

balanceClass

turnoverClass

tim e effectiveDay month quarter year

Figure 1: Conceptual Multidimensional Schema.

ORGANISATION ORGID ORGNAME ORGGROUP ORGTYPE BUSINESSSECTOR

ACCOUNT CUSTOMER PERSON ACCOUNTID CUSTID AGE EFFECTIVEDAY CUSTNAME PROFESSION BALANCE TURNOVER COMPANY CREDITLIMIT BRANCH INTEREST

PRODUCT PRODUCTID PRODUCTTYPE

Figure 2: Sample Conceptual Operational Schema.

B. Husemann,¨ J. Lechtenborger,¨ G. Vossen 6-3 the conceptual design should be independent of the target stored in measures and a qualifying context which is deter- database system, whereas the logical design should be tai- mined through (terminal) dimension levels. Each dimen- lored towards the chosen target system (here: relational sion level contains a set of instances or elements.Anag- or multidimensional database system). As a consequence, gregation path is a subsequence of dimension levels, which

what is called logical design in [CT98] should actually be starts in a terminal dimension level and ends in an implicit >

called conceptual design instead, and the implementation dimension level All containing a single element

;::: ; µ

activity of [CT98] should be divided into a logical and a Given an aggregation path ´d1 dn ,wheredi is a di-  physical design phase, as in [GR98]. Moreover, the crucial mension level, 1  i n, we assume that every element of

first two steps of [CT98] are presented by examples only, di belongs to (or is associated to) at most one element of

  · and remain rather vague. In contrast, [GR98] contains an di·1,1 i n 1; moreover, we say that di 1 is a higher algorithmic procedure to perform the corresponding activ- (aggregation) level than di. A dimension is structured in ities. However, neither approach provides any formal jus- terms of one or more aggregation paths that share the same tification for the applied method, and nor does it supply terminal dimension level. Therefore, each dimension com- criteria according to which schema quality could be mea- prises at least one terminal dimension level and the im- sured. plicit level All. For example, a dimension time might

In summary, well-established design phases for mul- consist of the single aggregation path (day, month, >

tidimensional databases are still missing, and while nor- year), where the element <2000/03/29 of dimension >

mal forms have a long tradition in the area of relational level day is associated to element <2000/03 of level > databases as guidelines for good schema design, research month, which in turn belongs to element <2000 of level on quality factors for multidimensional database schemata year. In slight abuse of terminology, the graph consisting seems to have started only recently. Indeed, the work pre- of all aggregation paths for a given dimension is called di- sented in [LS97] can be regarded as a first step in this re- mension hierarchy (although a dimension level might have spect. The authors argue that summarizability, i.e., guar- more than one parent level). We assume that sets of dimen- anteed correctness of aggregation results, is of most im- sion levels of distinct dimensions are disjoint. portance for OLAP queries. Hence, any multidimensional A property attribute describes additional information schema should be set up in such a way that summarizabil- related to a dimension level (e.g., the property attribute ity is obtained to the highest possible degree. Moreover, if custName for dimension level custID in Figure 1). An summarizability is violated along certain aggregation paths optional property attribute (e.g., custAge) needs not be

then the schema should express this constraint clearly. Fi- specified for each element of the corresponding dimension > nally, [LS97] provides criteria that imply summarizability. level and therefore may contain

We briefly describe a graphical notation for conceptual ear aggregation path within a dimension (e.g., path ! multidimensional data modeling next. To avoid misunder- day!month year in dimension time of Figure 3). standings resulting from the variety of terminology in mul- A multiple dimension hierarchy contains at least two dif- tidimensional databases, we first clarify some terms and ferent aggregation paths in a dimension (e.g., the dimen- then use these throughout the rest of the paper. sion account of Figure 3). A group of aggregation paths rooted in a common dimension level d is called alterna- 3.1 Basic multidimensional structures tive, if every element of d belongs to exactly one element of each higher level (e.g., the paths of Figure 3 starting Facts represent atomic information elements in a multidi- in dimension level orgID). Moreover, we allow optional mensional database. A fact consists of quantifying values groups of aggregation paths. Here, we assume that some

B. Husemann,¨ J. Lechtenborger,¨ G. Vossen 6-4 account- account accountID custID profession custom erType facts custName branch

custAge balance orgID orgGroup businessector turnover orgType

time effectiveDay month quarter year g

Figure 3: Simplified Fact Schema account facts with measures fbalance, turnover and dimensions g faccount, time . dimension level dl has two or more higher levels such that for every element of dl there is exactly one element in a Operational database / Requirements analysis schema higher level to which that element belongs. For example, and -specification in Figure 3 the group of aggregation paths starting in di- Semif ormal business mension level custID is optional, and each element of c onc ept dimension level custID is related to either a profes- Conceptual design sion or a branch); within an optional group of aggrega- Formal conceptual tion paths a dimension level such as custID is called split schema level whereas customerType is called join level. Logical design In our graphical notation, simple hierarchies and alter- native groups of aggregation paths are indicated by sim- Formal logical schema ple arrows, pointing from the lower dimension level to the higher one. Optional groups of aggregation paths are spec- Physical design Physical database/ schema ified by double lined arrows. A mandatory property at- tribute is connected via a diamond to its dimension level, and an optional property attribute without. Figure 4: Process Model For Data Warehouse Design. Intuitively, aggregations along different aggregation paths in an alternative group yield the same result. How- 4.1 A process model for Data Warehouse design ever, if a group is optional then aggregations along differ- The data warehouse design process comprises four sequen- ent paths within the group represent (partial) aggregation tial phases just like the classical database design process. results for mutually disjoint subsets of the elements of the In Figure 4 these phases along with their respective input terminal dimension level. and output are shown. We point out that if some element of a dimension level does not belong to an element of some higher level then all Requirements analysis and specification data related to that element is lost at the higher aggregation level, resulting in erroneous query results. To avoid such The operational E/R schema delivers basic information to

partial mappings we propose the insertion of an element determine the multidimensional analysis potential, where >

allocate every nonrelated element of the lower level to the house designer (if it is not, it could be obtained, for exam- > element

B. Husemann,¨ J. Lechtenborger,¨ G. Vossen 6-5 be added in a textual appendix, which also includes typical “show the average turnover by year and product group” or strategic queries. “how many current accounts are managed online.” We note that Table 1 contains additional attributes Conceptual design that represent multidimensional requirements which are

not part of the operational schema (e.g., fmonth, The conceptual design phase performs a transformation of customerType g). In particular, the new dimension level the semi-formal business requirements specification into a customerType is introduced in order to represent the formalized conceptual multidimensional schema. The for- specialization hierarchy concerning different types (e.g. malization results in a graphical multidimensional diagram, business and private) of customers as shown in Figure 2. which comprises fact schemata with their related measures These attributes extend operational data and augment mul- and dimension hierarchies. For each of a fact tidimensional aggregation possibilities according to multi- scheme the summarizability constraints are formalized in dimensional analysis needs. Further additional attributes an tabular appendix. In Section 4.2 we propose a phase are obtained by an examination of those attributes that are model to derive such diagrams and appendices in a straight- specified both as measure and dimension. In our case, we forward way starting from the requirements specification. note that related to business management it may not be use- ful to query an aggregation level on discrete balance and Logical design turnover values. Thus, we define new dimension lev- els balanceClass, turnoverClass that represent The logical design phase converts the conceptual schema analysis intervals; moreover, the corresponding original at-

to a logical one with respect to the target logical data g tributes fbalance, turnover are then rated as non- model (mostly relational or multidimensional). The logical dimensional. schemata are generated according to transformation rules, which only refer to the developed conceptual diagrams and summarizability constraints. A process model to conceptual data warehouse design We subdivide the process phase of conceptual data ware- Physical design house design into three sequential phases:

The data warehouse design process ends in a physical im- 1. context definition of measures, plementation of the logical schemata with respect to the in- dividual properties of the target database system, including 2. dimensional hierarchy design, and physical optimization techniques such as commonly known indexing strategies, partitioning etc., as well as OLAP- 3. definition of summarizability constraints. specific adjustments like relational of di- mension tables, pre-aggregations or special use of bitmap Context definition of measures

indexes.

f ;::: ; g Given the set M = m1 mk of measures defined dur- ing requirements analysis and the set D of dimensional at- 4.2 Conceptual Data Warehouse Modeling tributes, every fact can be perceived as an element of a As we mentioned already, the aim of the conceptual schema graph of some function from dimension levels to measures. design phase is to produce a graphical multidimensional Hence, the conceptual design phase starts by determining schema, which for each measure expresses its multidimen- functional dependencies (FDs)1 from dimension levels to sional context in terms of relevant dimensions and their hi- measures. First, we determine a (minimal) key Di  D for erarchies. each measure mi; then we define the set FKey as consisting

We assume that a global operational E/R schema such of all FDs of the form Di ! mi so obtained. Thus, given an ¾ as the one shown in Figure 2 exists, which describes the FD Di ! mi FKey, the dimension levels in Di functionally available source information. Moreover, we assume that re- determine measure mi, but are not functionally determined quirements analysis has been carried out together with do- by any other level. Hence, they qualify as terminal dimen- main experts, and that the global operational E/R schema sion levels that are used as roots of dimension hierarchies. has been analyzed to determine interesting measures, di- For each terminal dimension level we define a correspond- mensions, and initial OLAP queries. The output of this ing dimension.

phase consists of (a) tables such as the extract concern- In our running example, we hence obtain the FD (ac- ¾ ing account information shown in Table 1 (which contains countID, effectiveDay) ! balance FKey for an informal description for each relevant attribute and in- measure balance. Furthermore, all measures mi ; m j with dicates whether the attribute may be used as measure or 1Since functional dependencies are a well-known concept in relational dimensional attribute and whether the attribute is optional databases, we assume the reader to be familiar with it; otherwise see, for or not) and (b) standard multidimensional queries such as example, [BCN92, Vos99].

B. Husemann,¨ J. Lechtenborger,¨ G. Vossen 6-6 Table 1: Requirements Specification.

Attribute Description M D O effectiveDay data import date no yes no month time aggregation no yes no quarter time aggregation no yes no year time aggregation no yes no accountID account key no yes no balance balance at effective day yes no no balanceClass balance classification no yes no turnover turnover at effective day yes no no turnoverClass turnover classification no yes no creditlimit creditlimit of the account yes no no interest interest rate yes no no custID customer key no yes no custName costumer name no yes no custAge age of a private customer no yes yes customerType classification of customers no yes no profession profession of a private customer no yes yes branch branch of a business customer no yes yes productID product description no yes no productType classification of products no yes no orgID attending organizational unit no yes no orgName name of a organizational unit no yes no orgGroup grouping of organizational units no yes no orgType classification of organizational units no yes no businessector classification of orgGroup and orgType no yes no

Di = D j are grouped into the same fact schema,asthey Dimensional hierarchy design share the same dimensional context. In the next step, we gradually develop the dimension hi- Table 2 shows the result of this process applied to mea- erarchies for each dimension. To this end, we determine sures balance, turnover, creditlimit,andin- all FDs between dimension levels belonging to a dimen-

terest occurring in Table 1. All measures are func- sion dim with terminal dimension level d j as follows: Sup-

¾ ! tionally dependent on the same terminal dimension lev- pose we are given dimension levels dk ; dl D s.t. dk dl els, namely accountID and effectiveDay,which is a valid FD and there exists a (potentially transitive) func-

in turn belong to the dimensions account and time, tional dependence of dk on d j, then we add dk ! dl to the resp. Hence, all measures are grouped into a common set Fdim. fact schema accountfacts. At this point, we start the In our example, fact schema accountfacts includes graphical conceptual design by modeling fact schemata up the accountID and effectiveDay as terminal dimen- to terminal dimension levels (see Figure 5). sion levels. Starting with terminal dimension level ef- fectiveDay of dimension time we determine the fol- lowing FDs:

account- account accountID

f ! facts Ftime = effectiveDay month, time effectiveDay

balance month ! quarter, g turnover quarter ! year

creditlim it Graphically, we derive the simple dimension hierarchy interest shown in Figure 6.

Figure 5: First Part of the Conceptual Schema.

B. Husemann,¨ J. Lechtenborger,¨ G. Vossen 6-7 Table 2: Functional Dependencies between Terminal Dimension Levels and Measures.

Fact schema Measure Dimension Terminal dimension level accountfacts balance, account accountID turnover, creditlimit, time effectiveDay interest

gation paths or not. If all dimension levels are manda-

... time effectiveDay month quarter year tory according to the requirements specification then all groups of aggregation paths are alternative. If, however, Figure 6: Simple Dimension Hierarchy time. some dimension levels are optional (such as profes- sion and branch in our example) then these levels in- The dimension levels of dimension account exhibit the troduce groups of optional aggregation paths. Assume following FDs:

that dl is a (mandatory) split level that functionally de-

;:::

f !

Faccount = accountID orgID, termines the optional dimension levels dc1 dck .These !

accountID ! custID, accountID turnoverClass, optional levels are now grouped by building disjoint sub-

!

;::: g accountID balanceClass f , sets of dc1 dck such that the elements of levels in each

accountID ! productID, group form a complete and disjoint partitioning of the ele-

productID ! productType

, ments of split level dl. ! orgID ! orgGroup, orgID orgType, In our example, custID is the split level, and

orgType ! businessector, g fprofession, branch are the optional dimension lev-

orgGroup ! businessector, els which have to be grouped. Clearly, each element

orgID ! orgName,

of custID belongs either to profession (in case of ! custID ! profession, custID branch, a private customer) or to branch (in case of a busi- profession ! customerType,

branch ! customerType, ness customer), which implies that the aggregation paths

! g custID ! custName, custID custAge from custID to profession and branch form an optional group of aggregation paths. Moreover, cus- In the following, we use the dimension account (see tomerType is the join level for this group as cus- Figure 7) as running example for the derivation of complex tomerType is functionally dependent on profession dimension hierarchies. In a first step, property attributes and branch. We point out that in general, however, (a) and dimension levels have to be distinguished according to multiple groups may arise and (b) it may be necessary analysis requirements (recall that property attributes may to introduce new optional levels in order to ensure com- only be used for selections but not for aggregations). For pleteness for each group. Furthermore, we require that example, orgName represents a property attribute, since each group has a join level whose elements indicate for aggregations according to orgName are meaningless to each element of the split level to which optional level business analysts. Similarly, custName and custAge of the group it belongs. For example, if customer-

are identified as property attributes, whereas all remaining > Type contains the elements

dimensional attributes represent dimension levels. > and

Next, in a second step, a rough approximation of the di- each element of custID, belongs to branch if the cus- > mension hierarchy is obtained by building a directed graph tomerType is

whose nodes are dimension levels. This graph contains fession otherwise. Thus, customerType is a join =

an edge from dimension level di to level d j,ifdi 6 d j level that satisfies our requirements. In general, it might !

and di ! d j is a non-transitive FD, i.e., if di d j holds be necessary to introduce such a join level explicitly.

= ;

and there is no dimension level dk (dk 6 di d j) such that ! d ! d d holds. i k j Definition of summarizability constraints The graph obtained so far is now augmented with prop- erty attributes: Property attribute dp is attached to dimen- As already argued in, e.g., [LS97, GMR98, PJ99], not all sion level dl if the FD dl ! dp is non-transitive. The infor- possible aggregations of measures within a certain applica- mation whether a property attribute is optional or not can tion scenario make sense in general. For example, given a be retrieved from the requirements specification. group of customers, summing over as well as taking aver- Finally, multiple dimension hierarchies have to be ages of account balances may be perfectly reasonable; sim- checked whether they contain optional groups of aggre- ilarly, the computation of average ages may be reasonable,

B. Husemann,¨ J. Lechtenborger,¨ G. Vossen 6-8 ... account accountID custID profession customerType

custName branch

custAge

orgID orgGroup businessector

orgName orgType

productID productType

balanceClass

turnoverClass

Figure 7: Multiple Dimension Hierarchy account. whereas the sum of customer ages will probably be mean- based on the allowed aggregation functions along every ingless. Consequently, a conceptual model should provide path is meaningful. We declare a chosen restriction on means to distinguish meaningful aggregations of measures a given dimension level implicitly valid for all dependent from meaningless ones, as this information helps analysts higher dimension levels. This does not exclude the possi- in formulating their queries. In particular, the warehouse bility to define further constraints with greater restriction schema should express explicitly which measure may be level on higher dimension levels. aggregated along what dimension hierarchy according to The results of this process are arranged within a sum- what aggregation function. Clearly, we could integrate this marizability appendix such as the one shown in Table 4, information into the graphical fact schemata by connect- which contains the restriction levels for measures in the fact ing each pair of measures and terminal attributes by an schema of Figure 1. edge that is labeled with all meaningful (or, alternatively, The measure balance is additive in the dimension by all meaningless) aggregate functions. However, for the level accountID, so we assign restriction level 1 to it for schema shown in Figure 1 we would already have to add all aggregation paths starting in accountID. In contrast at least 8 edges, thereby reducing readability. Hence, we summing up balance-values in the dimension level ef- refrain from representing meaningful aggregations within fectiveDay is meaningless. Nevertheless the analysis graphical schemata, but propose to use summarizability ap- of average values or other statistical characterizing quanti- pendices as described next. ties makes sense, so restriction level 2 is allocated for the effectiveDay Along the lines of [PJ99] we distinguish four degrees of dimension level . increasing restriction levels for measures within the con-

text of dimension levels as shown in Table 3. Given a pair 5 Discussion

; µ ´m d of a measure m and a dimension level d, we asso- Compared to the approaches [CT98, GMR98], our analysis ciate restriction level 1, if all aggregate functions may be of FDs corresponds to the “schema navigation” performed applied to roll-up m from dimension level d to every func- in those approaches. However, we start by an identifica-

tionally dependent higher level. Restriction level 2 is asso- tion of measures and proceed algorithmically from there

; µ ciated to those pairs ´m d , where all aggregate functions on, whereas [CT98, GMR98] start by a rather vague identi- but the SUM-Operator are applicable (e.g., age information fication of facts. Moreover, our method is justified by mul- along the customer dimension as explained above). The tidimensional normal forms as we will show next. restriction level 3 represents the highest limitation, where First, we recall terminology related to multidimensional aggregation is still possible, but only in terms of counting normal forms from [LAW98], suitably adapted to our (e.g., for constant measures). Finally, degree 4 states that framework. We point out that [LAW98] distinguishes weak no aggregation function is permissible. FDs (which amount to partial functions) from non-weak Given the graphical, conceptual schemata so far, the ones (denoting total functions), whereas we (equivalently) next step is to define restriction levels for all measures make use of FDs in the usual sense but distinguish optional along the different aggregation paths in every fact schema: from mandatory attributes.

For each pair of measures and dimension levels we define A dimensional schema is a set of dimensional attributes

¾ Òf g a restriction level such that every multidimensional query D where for all di ¾ D there is d j D di such that we

B. Husemann,¨ J. Lechtenborger,¨ G. Vossen 6-9 Table 3: Classification of Restriction Levels.

Restriction level Applicable aggregate functions g

1 f SUM, AVG, MIN, MAX, STDDEV, VAR, COUNT g

2 f AVG, MIN, MAX, STDDEV, VAR, COUNT g 3 f COUNT

4 fg

Table 4: Summarizability Appendix for Fact Schema accountfacts.

Fact schema Measure Dimension levels Restriction level account facts balance accountID 1 effectiveDay 2 turnover accountID 1 effectiveDay 1 creditlimit accountID 2 effectiveDay 2 interest accountID 2

effectiveDay 2 !

have an FD of the form either di ! d j or d j di.Amultidi- the dimensions.

f ;::: ; g; µ

mensional schema is a pair M =´ D1 Dk S ,where Now we are ready to relate our approach to the concepts

;::: ; g fD1 Dk is a set of dimensional schemata and S is a introduced in [LAW98]: measure that is functionally determined by the attributes

Lemma 5.1

;::: ; g ¾

occurring in fD1 Dk . A dimensional attribute dt D

Òf g !

is terminal, if there is no d ¾ D dt such that d dt .A 1. Each dimension hierarchy determined during concep-

Òf g

dimensional attribute d ¾ D dt is a category attribute if tual design is a dimensional schema that is rooted in

¼ ¼

Òf ; g !

d is mandatory and there is d ¾ D dt d such that d d exactly one terminal attribute.

¼ ¼ or d is mandatory and d ! d ; all other attributes are prop- 2. The dimensions are orthogonal to each other. erty attributes.Letdt be a terminal attribute, dp a property attribute, and dc a category attribute of a common dimen- 3. In dimensions without optional hierarchies all dimen- sion. An element c of dc is a context of validity of dp if (a) sion levels are mandatory. for each element of dt belonging to c there is a value for dp and (b) for each element of dt not belonging to c there is no 4. If a dimension contains an optional hierarchy then the value for dp. elements of the join level represent contexts of validity for property attributes. A dimensional schema D is in dimensional normal form 5. All measures are full functionally determined by the if (a) there is exactly one terminal attribute dt ¾ D,(b)the elements of dt are complete (i.e., all real-world concepts set of terminal dimension levels of the dimensions.

are captured), and (c) all dimensional attributes are manda- Using this lemma, we can show the following:

f ;::: ; g; µ tory. A multidimensional schema M =´ D1 Dk S is in generalized multidimensional normal form (GMNF) if Theorem 5.1 The conceptual design described in Section the following conditions are satisfied: (1) For each property 4 produces fact schemata in generalized multidimensional

attribute dp ¾ Di there is an element of a category attribute normal form.

d ¾ D denoting the context of validity of d . (2) Each c i p It is instructive to note that the distinction of property and dimensional schema restricted to category attributes is in category attributes in the sense of [LAW98] is not neces- dimensional normal form. (3) The dimensions are orthog- sary in order to characterize schemata in GMNF. onal to each other, i.e., there are no FDs among attribute from distinct dimension schemata. (4) Measure S is full Theorem 5.2 A fact schema is in GMNF if the following functionally determined by the set of terminal attributes of conditions are satisfied.

B. Husemann,¨ J. Lechtenborger,¨ G. Vossen 6-10 1. For each optional dimension level dl in dimension d References there is an element of a mandatory higher level dc de- [AGS97] R. Agrawal, A. Gupta, S. Sarawagi, “Modeling multi- noting the context of validity of dl. dimensional databases,” Proc. 13th ICDE 1997, 232– 243. 2. Each dimensional schema restricted to mandatory at- [BCN92] C. Batini, S. Ceri, S. B. Navathe, Conceptual Database tributes is in dimensional normal form. Design - An Entity–Relationship Approach,Ben- jamin/Cummings, Redwood City, 1992. [BSHD98] M. Blaschka, C. Sapia, G. H¨ofling, B. Dinter, “Find- 3. The dimensions are orthogonal to each other. ing your way through multidimensional data mod- els,” DEXA Workshop on Data Warehouse Design and 4. All measures are full functionally determined by the OLAP Technology 1998. set of terminal dimension levels of the dimensions. [CT98] L. Cabibbo, R. Torlone, “A logical approach to multi- dimensional databases,” Proc. 6th EDBT 1998, LNCS 1377, 183–197. 6 Conclusions [CD97] S. Chaudhuri, U. Dayal, “An Overview of Data Ware- housing and OLAP Technologies,” Proc. ACM SIG- In this paper, we have outlined an approach to conceptual MOD 1997, 65–74. warehouse design which incorporates new guidelines to ad- [FV95] C. Fahrner, G. Vossen, “A Survey of Database De- dress specific warehouse needs and which can naturally be sign Transformations Based on the Entity-Relationship embedded into the traditional database design process. In Model,” Data & Knowledge Engineering 15, 1995, this respect, the generalized multidimensional normal form 213–250. can be regarded as a quality factor that avoids aggregation [GMR98] M. Golfarelli, D. Maio, S. Rizzi, “Conceptual design anomalies (similar to the avoidance of update anomalies of data warehouses from E/R schemes,” Proc. 32th by traditional normal forms for operational databases) and HICSS 1998. is therefore of most importance to guarantee correct query [GR98] M. Golfarelli, S. Rizzi, “A methodological framework results. for data warehousing design,” ACM Workshop on Data Warehousing and OLAP, 1998. Previously, the method proposed in [GR98] seems to have been the only existing approach that covers a com- [Hue99] B. H¨usemann, “OLAP-Datenmodelle f¨ur Controlling im Finanzdienstleistungsbereich,” Diplomhausarbeit, plete data warehouse design starting from source integra- Institut f¨ur Wirtschaftsinformatik, Universit¨at M¨unster, tion and requirements analysis, very much in resemblance September 1999. of traditional database design. We believe that their over- [Leh98] W. Lehner, “Modeling large scale OLAP scenarios,” all approach is well-suited to data warehouse design, since Proc. 6th EDBT 1998, LNCS 1377, 153–167. the separation of design phases is essential to any design [LAW98] W. Lehner, J. Albrecht, H. Wedekind, “Normal forms process, within the context of relational databases or other- for multidimensional databases,” Proc. 10th SSDBM wise. For example, it is the requirements analysis that al- 1998, 63–72. lows to identify facts, measures, and typical queries, which [LS97] H. Lenz, A. Shoshani, “Summarizability in OLAP and are basic ingredients for the following design steps. In statistical databases,” Proc. 9th SSDBM 1997, 132– contrast, starting with the conceptual design the approach 143. [CT98] indicates only that facts and dimensions have to [MQM97] I.S. Mumick, D. Quass, B.S. Mumick, “Maintenance be identified and selected but contains no hints as how to of Data Cubes and Summary Tables in a Warehouse,”, achieve this. Proc. ACM SIGMOD 1997, 100–111. Concerning further details on logical schema design, the [PJ99] T.B. Pedersen, Ch.S. Jensen, “Multidimensional Data reader is referred to [Hue99], which focuses on the rela- Modeling for Complex Data,” Proc. ICDE 1999, 336– tional model, and develops a straightforward transforma- 345. tion procedure from the conceptual to the relational model, [SBHD98] C. Sapia, M. Blaschka, G. H¨ofling, B. Dinter, where fact schemata are mapped to fact relations and each “Extending the E/R Model for the Multidimensional dimension level is mapped to a separate table. Interest- Paradigm,” ER Workshops 1998, 105–116. ingly, the final schema turns out to be a fully normalized [VS99] P. Vassiliadis, T. Sellis, “A Survey on Logical Models snowflake schema (if only a single fact schema occurs) or for OLAP Databases,” ACM SIGMOD Record 28(4) galaxy schema (if multiple fact schemata occur and share 1999, 64–69. some dimensions). [Vos99] G. Vossen, Data Models, Database Languages and Database Management Systems, 4th edition (in Ger- Acknowledgement. The authors are grateful to Thomas man), Oldenbourg, 2000. L¨ochte of zeb/rolfes-schierenbeck/associates for a number [WB97] M.-C. Wu, A. P. Buchmann, “Research issues in data of discussions on the subject of this paper. warehousing,” Proc. 7th BTW 1997, 61–82.

B. Husemann,¨ J. Lechtenborger,¨ G. Vossen 6-11