<<

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 1

Cube : A Generic User-Centric Model and Query Language for OLAP

Cristina Ciferri, Ricardo Ciferri Universidade de São Paulo em São Carlos, Brazil Leticia Gómez Instituto Tecnológico de Buenos Aires, Argentina Markus Schneider University of Florida, USA Alejandro Vaisman, Esteban Zimányi Université Libre de Bruxelles, Belgium

ABSTRACT The lack of an appropriate conceptual model for data warehouses and OLAP systems has led to the tendency to deploy logical models (for example, star, snowflake, and constellation schemas) for them as conceptual models. ER model extensions, UML extensions, special graphical user interfaces, and dashboards have been proposed as conceptual approaches. However, they introduce their own problems, are somehow complex and difficult to understand, and are not always user-friendly. They also require a high learning curve, and most of them address only structural design, not considering associated operations. Therefore, they are not really an improvement and, in the end, only represent a reflection of the logical model. The essential drawback of offering this system-centric view as a user concept is that knowledge workers are confronted with the full and overwhelming complexity of these systems as well as complicated and user-unfriendly query languages such as SQL OLAP and MDX. In this article, we propose a user-centric conceptual model for data warehouses and OLAP systems, called the Algebra. It takes the cube metaphor literally and provides the knowledge worker with high- level cube objects and related concepts. A novel query language leverages well known high- level operations such as roll-up, drill-down, slice, and drill-across. As a result, the logical and physical levels are hidden from the unskilled end user. Keywords : data warehouses, OLAP, cube, conceptual model, user-centric model, query language

over the information stored in the data INTRODUCTION warehouse is commonly called Online Analytical Processing (OLAP). A review of Nowadays, data warehouses are at the the evolution of data warehouse technology forefront of information technology reveals that research and development has applications as a way for organizations to mainly focused on system aspects such as effectively use and analyze information for the construction of data warehouses, business planning and decision making. materialization, indexing, and the Data warehouses are large repositories of implementation of OLAP functionality. analytical and subject-oriented data This system-centric view has led to well- integrated from several heterogeneous established and commercialized sources over a large period of time. The technologies such as relational OLAP technique of performing complex analysis (ROLAP), multidimensional OLAP Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 2

(MOLAP), and hybrid OLAP (HOLAP) at MOLAP, HOLAP) at the logical level. the logical and the physical levels. Third, it should be able to cooperate with However, the unskilled user such as the any of these logical models and manager in a consulting company or the technologies. Fourth, it should enable the analyst in a financial institution is user to generically and abstractly represent confronted with the problem that the and query hierarchical multidimensional handling of data warehouses and OLAP data. Fifth, it should have an associated systems requires expert knowledge due to query language based exclusively on the complicated data warehouse structures and conceptual level, thus providing high-level the complexity of OLAP systems and query query operations for the user. The goal of languages. Two main reasons are this article is to propose and formally responsible for this problem. First, due to describe a conceptual and user-centric data the lack of a generic, user-friendly, and warehouse model and query language that comprehensible conceptual data model, data satisfies these design criteria. Surprisingly, warehouse design is usually performed at the conceptual view this model adopts is not the logical level and leads to the exposure new; on the contrary, it is well known. of the logical design schemas that are However, the way and resoluteness in difficult to understand by the unskilled user. which we offer this concept is novel. Our In a ROLAP environment, for example, the proposed conceptual model leverages the user is faced with the logical design of cube view of data warehouses but takes the relational tables in terms of star, snowflake, cube metaphor literally . This means that the or fact constellation schemas. The proposal user’s conceptual world is solely the cube to alleviate the problem by providing that the user can create, manipulate, update, extensions to the Entity-Relationship Model and query. The cube is used as the user and the Unified Modeling Language, or by concept that completely abstracts from any offering specific graphical user interfaces or logical and physical implementation details. dashboards for data warehouse design is not Technically, this implies that cubes can be really convincing since ultimately they regarded as an abstract data type that represent a reflection and visualization of provides cubes as the only kind of values relational technology concepts and, in (objects), offers high-level operations on addition, reveal their own problems. cubes or between cubes such as slice , dice , Second, available OLAP query and analysis drill-down , roll-up , and drill-across as the languages such as MDX and SQL OLAP only available access methods, and hides operate at the logical level and require the any data representation and algorithmic user’s deep understanding of the data details from the user, who can concentrate warehouse structure in order to be able to on her main interest, namely to analyze formulate queries. These languages are large of data. Another quite complex, overwhelm the unskilled characterization is to say that we define a user, and are therefore inappropriate as end- universal algebra with cubes as the only user languages. sort and a collection of unary and binary We conclude that a generic , conceptual , operations on cubes. We therefore name our and user-centric data warehouse model that approach Cube Algebra . We will show that focuses on user requirements is missing and this algebra develops its full power and needed. Such a model should fulfill several expressiveness if it is used as a high-level design criteria. First, it should be located query language. above the logical level. Second, it should The paper is organized as follows. Next abstract from and be independent of the section discusses related work and models and technologies (ROLAP, compares available data warehouse models Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 3 with our Cube Algebra. Then, we describe reference algebra in (Romero and Abelló an application scenario that we use (2007)), these proposals do not abstract throughout the paper to illustrate important from the logical level and thus, do not aspects of the Cube Algebra. In the same provide high-level query operations for the section, we provide a three-level user. Taking the aforementioned discussion architecture of a data warehouse and OLAP into account, we next comment on related system that includes our Cube Algebra. We work, and present an analysis against our further specify the formal data model proposal. supporting the Cube Algebra. The section We first classify existing models into concludes with a sketch of a data definition three classes: (a) conceptual models based language to specify the structure of a cube. on extensions to the Entity-Relationship Then, we define high-level OLAP cube (ER) Model (Chen, 1976); (b) conceptual operations such as slice and drill-across , models based on extensions to the Unified and illustrate their use in a number of Modeling Language (UML); (c) models queries that refer to our application based on a view of data as a cube. scenario. Finally, the last section draws We start with a discussion on models in some conclusions and sketches future work. class (a) ( ER-based models). Rizzi (2007) proposed the Dimensional RELATED WORK Fact Model (DFM), which uses the typical DW concepts of facts, , Several data warehouse (DW) models measures, hierarchies, descriptive and have been proposed in the literature (see for cross- attributes; the model also example the survey in (Marcel, 1999)). supports shared, incomplete, recursive, and Most of these models address the logical dynamic hierarchies, and notions such as level (e.g., Li & Wang, 1996; Cabibbo & additivity. To represent these concepts, Torlone, 1997; Cabibbo & Torlone, 1998, DFM relies on a graphical notation that Lehner, 1998). Therefore, they are facilitates the understanding of the dependent of specific technologies, for conceptual schema, and that is an example ROLAP, MOLAP, and HOLAP, abstraction of the star schema, in which which lead to complex and non-user there is a central fact entity and a graph per friendly query languages. In this section, we dimension to represent the attribute limit ourselves to conceptual models and hierarchies. Golfarelli and Rizzi (1998) discuss them with respect to our proposal, extend DFM by presenting a i.e., we focus on the user support provided methodological framework for DW by these models. We claim that most of the conceptual modeling, which starts gathering models aimed at addressing the conceptual user requirements and carries out the data level actually rely on structures that are warehouse design semi-automatically from close to the logical level, thus not the operational database schema. In addressing end-user needs. We call these addition to providing an abstraction of the system-centric models, opposite to the user- star schema in terms of a central fact entity centric approach we present in this paper. and several graphs, they also formalize each Similarly, we also claim that the user concept of the DFM and define a language should be provided with a query and to denote queries according to the DFM to analysis language that is exclusively based validate the generated schema. on the conceptual level. Although several Ravat et al. (2008) study three related proposals in the literature define a set of issues. First, they propose a conceptual operators to handle multidimensional data multidimensional model, which is based on (see for example the survey and the the concepts of constellations, dimensions, Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 4 dimension attributes, hierarchies, facts, of hierarchy classification that encompasses measures, and multidimensional table from symmetric simple hierarchies until structures (i.e., tabular representations of more complex ones, such as non-strict multidimensional data). Second, they simple hierarchies, asymmetric and define a set of algebra operators to generalized simple hierarchies, multiple manipulate multidimensional data, which hierarchies, and parallel hierarchies. They include: (i) a minimal core of operators to also present a graphical notation for modify analysis precision (i.e., drill-down, representing these hierarchies, close to a roll-up, select), to change analysis criteria relational representation. (i.e., rotate, add measure, delete measure, We now move on to discuss models in push and pull, nest), and to change the class (b) (UML-based models). multidimensional table presentation (i.e., Nguyen et al. (2000) use UML to map switch, aggregate); (ii) advanced operators their proposed conceptual multidimensional (i.e., frotate, hrotate, order, plot, unselect) data model to an object-oriented data that are obtained from the combination of model. The data model is based on the core operators aiming at simplifying concepts of dimensions, dimension complex queries; and (iii) binary operators members, dimension levels, dimension (set operators), such as union, intersect, and schemas, dimension paths and hierarchies, minus, based on the (semi-)compatibility of dimension operators, measures, and data input tables. Third, they develop a graphical cubes. This model aims at representing interface based on the multidimensional natural hierarchical relationships among table structures and the DFM commented members within a dimension as well as above. This interface is supported by a unbalanced and multiple hierarchies. The graphic language that encompasses the core authors also define the following cube algebra operators. Although this work operators: groupBy, jumping, rollingUp, focuses on a user-oriented query language and drillingDown. However, this model is composed of a formalized algebra and a defined at a level very close to the logical graphical query language, the one. Also, this proposal does not introduce multidimensional model over which this a full set of high-level operations such as work is based is very close to the star slice and drill-across. schema and strongly based on the concept Abelló et al. (2001) investigate of multidimensional table, i.e., it remains as relationships between cubes in an object- a system-centric model. oriented framework with navigation Tryfona et al. (1999) propose the starER operations. Here, the data cube is defined in model, which combines the star structure terms of a set of concepts, such as measures with the constructs of the ER Model, in and cells, dimensions and aggregation addition to proposing special types of levels, and facts. An algebra is defined as a relationships to support attribute hierarchies set of multidimensional operations, such as on dimensions. The starER model base changes, dice, slice, drill-across, roll- encompasses the following main up and drill-down. In sequels of this paper constructors: facts, entities, relationships (Abelló, Samos & Saltor 2002; Abelló, among entities, and attributes (i.e., Samos & Saltor 2006), the authors propose properties of entities, relationships, or YAM 2, a multidimensional conceptual facts). Further, the starER model provides a object-oriented model for data warehousing graphical notation very close to the ER and OLAP tools extending the UML, which Model. Along similar lines (i.e., starting is defined in terms of its structures, integrity from an ER-based data model), Malinowski constraints, and query operations. However, and Zimányi (2008) introduce a metamodel in spite of being defined at the conceptual Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 5 level, YAM 2 relies on star schema-like aggregates. They also define a minimal set design, and therefore is not completely of algebraic operators that is composed of independent of logical modeling concepts. the following operators: push and pull (to Pardillo et al. (2008) introduce platform- allow symmetric treatment of dimensions independent conceptual OLAP queries that and measures), destroy dimension, can be automatically traced to their logical restriction (slice and dice), join, and implementation, together with an OLAP associate. This minimal set of operators algebra at the conceptual level by using the stays as close to relational algebra as Object Constraint Language (OCL), aimed possible and can be translated to SQL at allowing end-users to query data through an algebraic application warehouses without being aware of logical programming interface. details. The authors introduce cube The conceptual multidimensional model manipulation operators (e.g., dimension proposed by Gyssens and Lakshmanan addition and removal), and operators such (1997) focuses on the separation between as slice, dice, drill-across, multidimensional structural aspects and the content, allowing projection, roll-up and drill-down. Cabot et the definition of a data manipulation al. (2009) extend the former work through a language that can express the cube operator. conceptual specification of statistical They define an algebra (and an equivalent functions using OCL. Finally, Pardillo et al. calculus), which include set operators (like (2010) further extends OCL for OLAP selection, projection, Cartesian product), querying, introducing a code-generation operators for summarization, and re- architecture aligned with the model-driven structuring operators (fold and unfold). architecture (MDA), to map an extension to Vassiliadis (1998) formally defines the the OCL as a set of predefined OLAP concepts of dimensions, hierarchies, and operators. The main drawback of this work cube to propose a model for is that they are based on OCL, which is not multidimensional databases. He also user-friendly for data cube manipulation. introduces a set of cube operators based on Let us now comment on models based on the notion of the base cube, which is used the data cube (i.e., the ones in class (c)). for providing the calculation of the results Tsois et al. (2001) propose a of cube operations. These operations are: multidimensional aggregation cube (MAC) level climbing, packing, using the concepts of dimension, dimension application, projection, navigation, slicing, level, dimension member, drilling and dicing. He also provides a mapping of relationship, and dimension path. Although the proposed model to the relational model they address the DW modeling problem and to multidimensional arrays. from the end-user point of view, and Datta and Thomas (1999) also defined a describe a set of requirements for the data cube model and an associated algebra. conceptual modeling of real-world OLAP The data cube model includes the concepts scenarios, the authors do not present a of data cube, dimensions, dimension language supporting the model. attributes, measures and attribute Along different lines, some authors hierarchies, and focuses on the symmetric formally define the notion of a cube and treatment of dimensions and measures. As introduce operations for this. Agrawal et al. for the algebra, the purpose of the work is (1997) propose a data model whose core to provide comprehensive OLAP features are the symmetric treatment of functionalities, including aggregation (to be dimensions and measures, the support of applied several times to enable roll-up and multiple hierarchies along each dimension drill-down operations and comparisons of and the possibility of performing ad-hoc aggregate values), transformations to Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 6 convert dimensions to measures and vice approach the problem from a conceptual versa and partitioning (grouping of data for modeling viewpoint, neither they consider aggregating purposes). The algebra has the users’ needs (i.e., the user-centric view). following operators: restriction, Further, the proposed operators are not the aggregation, Cartesian product, join, union, ones commonly needed by high-level users difference, pull, push and partition. such as managers. The proposals in this class, although based on the concept of a cube, do not

Related Work Focus Cube Cube as Model Level Model Algebra or Metaphor ADT Extension Calculus Abelló et al. (2001) system- V U conceptual object- V centric oriented Abelló et al. (2002) system- U U higher than logical, UML-based V Abelló et al. (2006) centric lower than conceptual Agrawal et al. (1997) system- V U higher than logical, not an V centric lower than conceptual extension

Cabibbo and Torlone system- U U logical not an V (1997) centric extension Cabibbo and Torlone system- U U logical not an V (1998) centric extension Cabot et al. (2009) system- U U logical UML-based U centric Datta and Thomas system- V U higher than logical, not an V (1999) centric lower than conceptual extension Golfarelli and Rizzi system- U U logical not an U (1998) centric extension Gyssens and system- V U logical not an V Laksshmanan (1997) centric extension Lehner (1998) system- U U logical not an V centric extension Li and Wang (1996) system- V U logical not an V centric extension Malinowski and system- U U logical not an U Zimányi (2008) centric extension Nguyen et al. (2000) system- V U higher than logical, UML-based V centric lower than conceptual

Pardillo et al. (2008) system- V U higher than logical, UML-based V centric lower than conceptual Pardillo et al. (2010) system- V U higher than logical, UML-based V centric lower than conceptual Ravat et al. (2008) system- U U higher than logical, not an U centric lower than conceptual extension Rizzi (2007) system- V U logical not an U centric extension Tryfona et al. (1999) system- U U conceptual ER-based U centric Tsois et al. (2001) user- V U conceptual not an U centric extension Vassiliadis (1998) system- V U higher than logical, not an V centric lower than conceptual extension

Cube Algebra user- V V conceptual not an V centric extension

Table 1: Core properties of data warehouse models.

Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 7

Table 1 refers to the following core all proposals defined at the conceptual level characteristics of each studied DW model: are actually system-centric rather than user- (i) its focus (system-centric vs. user- centric. On the contrary, the Cube Algebra centric); (ii) if the model supports the data proposal applies to the conceptual level of a cube metaphor; (iii) if the model provides a data warehouse architecture, independently cube as an abstract data type; (iv) the level from implementation issues. The only user- at which the model is defined (conceptual centric model is the proposal of Tsois et al. level, logical level, or somewhere in- (2001). However, this model does not between them); (v) the model that it is provide a cube as an abstract data type and, based on; and (vi) if it includes an OLAP more important, it does not propose an algebra or calculus. We can see that almost OLAP algebra or calculus to support it.

Related Work Operations defined only Complexity of the Kind of query Graphic tools or over the cube metaphor query language language dashboards

Abelló et al. (2001) U complex conceptual U Abelló et al. (2002) U complex relational algebra U Abelló et al. (2006) Agrawal et al. (1997) V complex relational algebra U Cabibbo and Torlone U complex calculus-based U (1997) Cabibbo and Torlone U complex relational algebra V (1998) Cabot et al. (2009) U complex OCL-based U Datta and Thomas V complex relational algebra U (1999) Golfarelli and Rizzi U not available not available V (1998) Gyssens and U complex relational algebra/ U Laksshmanan (1997) calculus

Lehner (1998) U not available relational algebra U Li and Wang (1996) U complex relational algebra U Malinowski and U not available not available V Zimányi (2008) Nguyen et al. (2000) V complex relational calculus U Pardillo et al. (2008) U complex OCL-based U Pardillo et al. (2010) V complex relational algebra V plus OCL Ravat et al. (2008) U user-friendly relational algebra V Rizzi (2007) U not available not available V Tryfona et al. (1999) U not available not available V Tsois et al. (2001) U not available not available V Vassiliadis (1998) U complex relational calculus U Cube Algebra V user-friendly relational algebra U

Table 2: Query language and operations of data warehouse models.

Table 2 compares existing proposals other data objects, at lower levels of against Cube Algebra, with respect to query abstraction (for example, tables, star language functionalities, namely: (i) if the schemas); (ii) the complexity of the query operations are defined over the cube or over language for unskilled end users; (iii) the Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 8 kind of query language, such as cube-based, specialization, and derivation, are far from relational algebra-based, relational calculus- the knowledge of managers and analysts in based, OCL-based and MDX-based, or no an OLAP scenario. language provided at all; and (iv) if the Finally, Table 3 details and compares the work provides some graphical notation or set of operations that each proposal dashboard to aid in the data warehouse provides, that is, general functionalities to modeling. As we can see in Table 2, no create, manipulate, and update the cube work besides the Cube Algebra that offers a metaphor, and the set of high-level user-friendly query language is the proposal operations for querying the data cube (roll- of Abelló et al. (2001), who introduce a up, drill-down, slice, dice, drill-across, query language at the conceptual level. pivot). As we can see, only the Cube Nevertheless, the query language is Algebra encompasses all the operations. We complex for unskilled end users. Besides, can also see that most of the proposals do the operations defined over the cube, such not offer operations to create, manipulate, as base changes, generalization, and update the cube.

General Functionalities Query Cube

Related Work create manipulate update roll-up slice dice drill- pivot cube cube cube (drill-down) across Abelló et al. (2001) U V U V V V V U Abelló et al. (2002) U U U roll-up U V V U Abelló et al. (2006) Agrawal et al. (1997) U U U V V V U U Cabibbo and Torlone U U U U U U U U (1997) Cabibbo and Torlone U U U roll-up U U U U (1998) Cabot et al. (2009) U U U U U U U U Datta and Thomas U U U U V V U U (1999) Golfarelli and Rizzi U U U U U U U U (1998) Gyssens and U V U roll-up V V U U Lakshmanan (1997) Lehner (1998) U U U V V U U U Li and Wang (1996) V V U U U U U U Malinowski and U U U U U U U U Zimányi (2005) Nguyen et al. (2000) U U U V U U U U Pardillo et al. (2008) U V U V V V V U Pardillo et al. (2010) U V U V V V V U Ravat et al. (2008) U U U V V V V V Rizzi (2007) U U U U U U U U Tryfona et al (1999) U U U U U U U U Tsois et al. (2001) U U U U U U U U Vassiliadis (1998) U U U drill-down V V U U Cube Algebra V V U V V V V V

Table 3: Types of operations of data warehouse manipulation languages.

Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 9

All All All

n ) o n S4 24 18 28 14 ti Station io Year Group ta t S3 33 25 23 25 name S ta year name (S S2 12 20 24 33 S1 8 Station 1 Semester 3 Q1 21 10 18 35 2 measure semester Category Type 14 values name name Q2 27 14 11 30 Quarter 8 12 201 17 quarter 3 Pollutant Q3 26 12 35 32 3 10 name Month loadLimit Time (Quarter) Time Q4 1420 47 31 dimensions month Pollution P2 P4 Day P1 P3 date season Pollution (Pollutant) Time (a) (b) Figure 1: (a) A three-dimensional cube for AirQuality data having dimensions Time , Pollution , and Station , and a measure concentration . (b) Dimension hierarchies.

CUBE DATA MODEL AND measured. These aspects form the three THREE-LEVEL DATA dimensions of a cube that are shown on the ARCHITECTURE Figure 1a. In this example, we call the dimensions Station , Time , and In this section, we present our user- Pollution . Dimensions are an essential centric cube data model and show how it and distinguishing first-class concept in fits into the landscape of data warehouses. cubes. They are visually or geometrically By selecting the application scenario of represented by the lateral faces of the pollution control, next section informally (hyper) cube. A dimension is organized as a introduces the main cube concepts that a containment-like hierarchy. Each hierarchy user should be able to understand. Then, we level represents a different (aggregation) present a three-level architecture of a data level of detail, as it is later required by the warehouse and OLAP system that integrates desired analyses. Figure 1b shows the three our Cube Algebra, and formally define the hierarchical structures of the Station , underlying user-centric data model Time , and Pollution dimensions. Each supporting such algebra. Finally, we sketch hierarchical structure is called a dimension a high-level data definition language for schema . For example, the dimension data cubes. schema for Pollution models two Application Scenario: Cubes for hierarchies with Pollutant as their common lowest level, namely the hierarchy Pollution Control consisting of the levels Pollutant , Type , We illustrate the needed user concepts and Group , as well as the hierarchy by leveraging an application scenario of consisting of the levels Pollutant and pollution control in Belgium. Monitoring Category . The levels above Pollutant stations located at different locations enable allow grouping and are therefore interesting the measurement of certain pollutants (such for the knowledge worker. For example, as carbon monoxide, CO). Thus, three similar pollutants can be grouped under the aspects or perspectives play a role here for level Category . Note that there is a unique the knowledge worker: (i) the monitoring top level in the dimension schema, denoted station where a measurement is captured, All , to which all levels aggregate. Each (ii) the time when a measurement is taken, dimension level includes a finite set of and (iii) the kind of pollution that is Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 10 values called members . A dimension issue in our approach. Finally, at the instance comprises all members at all levels physical level, we find the different ways of of a dimension hierarchy . Figure 3 gives an efficiently implementing the data example of a dimension instance with warehouse. For example, for ROLAP respect to the dimension schema for implementations multidimensional indices Pollution shown in Figure 1b. The level such as variations of R-Trees can be used, Type , for example, contains the three as well as bitmap indices. For MOLAP implementations, efficient and proprietary members: T1, T2, and T3. algorithms for implementing sparse A combination of members taken from matrices are often used. each dimension uniquely defines a cell of a cube and implicitly specifies a fact if the cell is not empty. For example, in Figure 1a, at station S2 and quarter Q3 , the pollutant P4 was measured since the cell is not empty. A value in a cell (such as 12 in our case) is called a measure value . A measure represents a numerical property of a fact. In our example, the measure is named concentration . A cube can have several different measures. Each measure comes with an aggregation function that can combine several measure values into one. Figure 2: The three-level architecture for Data Warehouse Architecture multidimensional databases. Figure 2 shows the three-level data Formal Cube Data Model warehouse architecture we devise (see Our cube-based, user-centric conceptual Ullman (1988)). At the conceptual level model is supported by the formal model we (i.e., the highest abstraction level) of this introduce next. For simplicity, and without architecture, there is the cube model lost of generality, we assume that described above (independent of how this dimension level names are unique i. cube is actually implemented), and the associated cube algebra we introduce later. Definition 1 (Dimension Schema) At the logical level we have the A dimension schema is a tuple 〈nameDS , L, implementation-dependent representation of →〉 where: (a) nameDS is the name of the the data cube. At this level we place the dimension; (b) L is a non-empty finite set well-known star, snowflake, and of pairs of the form 〈l, A 〉 such that l is a constellation schemas (i.e., a ROLAP level (there is a distinguished level name representation), as well as multidimensional 〈 ∅〉 ∈ L (MOLAP), and hybrid representations denoted All , such that All , ), and A (HOLAP). Query languages for these is a set of attributes describing a level). representations are relational query Each attribute has a domain Dom a; without languages such as SQL dialects, and MDX. loss of generality, we consider that one Note that although MDX works on cubes in attribute in A univocally identifies a → the same way as SQL works on tables, it member in level l; (c) is a partial order cannot be considered as a language on the levels l ∈ L. This partial order operating at the conceptual level: not only defines a graph, whose nodes are the levels its semantics has never been clearly l ∈ L, and are annotated by attributes in A; defined, but the language is far from being (e) The reflexive and transitive closure of user-friendly (as we will show later), a key →, denoted →*, has a unique bottom level Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 11

lb, and a unique top level. The top level is our model supports multiple hierarchies, the distinguished level All . We denote by li meaning that the same DAG models the * → lj the fact that there is a path between li different aggregation paths from bottom to and lj. Moreover, all levels l are such that lb top. →* l, and l →* All . £ Example 1 ( Dimension Schema and Intuitively, a dimension schema is a Instance) directed acyclic graph (DAG). Each node in In Figure 1b we can see three dimension this graph represents an aggregation level schemas. Let us consider the one for and is annotated with a list of attributes. dimension Pollution . The schema of this There is a distinguished level denoted All dimension is formally defined as follows. without attributes, and a unique bottom nameDS = Pollution , level. All levels are (directly or transitively) L= {〈 Pollutant ,( name , loadLimit )〉, reachable from the bottom, and all levels 〈Type ,( name )〉,…, 〈Group ,( name )〉}, (directly or transitively) reach the level All . → = {Pollutant → Type , Type → Definition 1 could be simplified if we Group , Group → All , Pollutant → ignore the level attributes, and consider Category , Category → All } each node in the graph as a single data element. However, we decided to state the Note that in Figure 1b the attributes in the formal model in this way to account for the dimension levels that are underlined are the way in which the user operates with the identifiers. cube in real-world practice, where she The instances for dimension Pollution defines a dimension level as an aggregation are depicted in Figure 3, and are of the level, and attributes are displayed after form: aggregation has been performed. MPollutant = {P1 ,…, P5 }, MCategory = {C1 ,

Definition 2 (Dimension Instance) C2 }, MType = {T1 , T2 , T3 }, … Type An instance I of a dimension schema Roll-up Pollutant ={(P1 ,T1 ), (P2 ,T2 ), …, 〈nameDS , L, →〉 consists of (a) a finite set (P5 ,T3 )}… Roll-up Category = {(P1 ,C1 ), (P2 ,C1 ), of members Ml, for each level l in L, such Pollutant …, (P5 ,C2 )} that each member can be uniquely All identified (the level All has a unique Roll-up Category = {(C1 ,all ),( C2 ,all )} name name member all ); (b) a set of partial functions, fPollutant (P1 ) = CO ,…, fPollutant (P5 ) = denoted as roll-up functions (following PM . loadLimit (Cabibbo & Torlone, 1997)) of the form fPollutant (P1 ) = 34,…, lj loadLimit Roll-up li from the members of level li to fPollutant (P5 ) = 44… name name the members of level lj, for each pair of fCategory (C1 )= gas ,… ,f Category (C2 ) = levels li and lj in L, such that li → lj in →; solid . 1 k (c) a collection of functions fl ,…f l mapping members of l to values in the domain of each level attribute a1,…, ak ∈ A. £ Intuitively, a dimension instance is also a DAG. Associated with an edge (li,l j) in the schema graph such that li → lj, there is a function from li to lj. This function describes how members in the lower level aggregate to members in the upper level. Figure 3: Some instances (left) and roll-up functions Thus, a dimension instance is just a (right) of dimension Pollution of Figure 1b. collection of such functions. Also note that Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 12

Intuitively, the aggregation hierarchy is data cube model, and could be the basis of a used as follows: let us suppose that the Cube Definition Language at lower concentrations of pollutants P2 and P3 abstraction levels. measured at station S1 in a given day D1 In CADL, a cube schema is defined are 30 and 40 mg/m 3, respectively. If we using the keyword CUBE followed by the want to know the average concentration at cube name. Each cube is composed of a set S1 , aggregated by type on that day, of dimensions, identified with the keyword according to the instance depicted in DIMENSION followed by the dimension Figure 3, P2 and P3 will contribute to the name. After defining a dimension, we must aggregation over type T2 . Thus, for D1 , S1 , list all dimension levels, along with their T2 , we will have a value of 35. £ attributes. A dimension level is defined with the keyword LEVEL , followed by its name. Definition 3 (Cube Schema) In addition, each level contains a set of A cube schema is a tuple 〈nameCS , D, M〉 associated attributes, defined with the D where nameCS is the name of the cube, keyword ATTRIBUTES , followed by a list is a finite set of dimension levels, with | D| of attribute names and their types. In = d, corresponding to d bottom levels of d addition, the optional keyword UNIQUE dimension schemas, different from each indicates that the value of the attribute is other, and M is a finite set of m attributes unique among all the values of such called measures. Each measure also has an attribute for all the level members. associated domain. £ For our running example, we define the Definition 4 (Cube Instance) AirQuality data cube, along with its Consider a cube schema 〈nameCS , D, M〉; dimensions, levels, and attributes in CADL as follows. each lbi ∈ D, i = 1,…, d, has a set of members. Let us call Points = {(c1,…, cd) | CUBE AirQuality { c is a member of l , i=1,…, d }. A cube i bi DIMENSION Time instance C is a partial function C: Points → × × ∈ M LEVEL Day ATTRIBUTES {date dom( M1) … dom( Mm) where Mi , (date) UNIQUE, season (string)} i=1,…, m. £ LEVEL Month ATTRIBUTES {month Example 2 (Cube Schema and Instance) (string) UNIQUE} Let us now define a cube denoted LEVEL Quarter ATTRIBUTES AirQuality , composed by dimensions {quarter (string) UNIQUE} Pollution , Time , and Station , and a LEVEL Semester ATTRIBUTES measure concentration , as introduced in {semester (string) UNIQUE} our application scenario (Figure 1a). The LEVEL Year ATTRIBUTES {year cube has schema 〈AirQuality , (string) UNIQUE} {Pollutant , Day , Station}, Day ROLL-UP to Month {concentration}〉 with instances of the Month ROLL-UP to Quarter form C(P1 ,T1 ,S1 ) = 35,…, C(P5 ,T3 ,S4 ) = Quarter ROLL-UP to Semester 44,…. £ Semester ROLL-UP to Year Cube Algebra Definition DIMENSION Station Language LEVEL Station ATTRIBUTES {name (string) UNIQUE} We now sketch a language, which we DIMENSION Road denote CADL (standing for Cube Algebra LEVEL Road ATTRIBUTES {name Definition Language). The language aims at (string) UNIQUE, length (real)} providing a description of the conceptual Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 13

DIMENSION Pollution Application Scenario: OLAP LEVEL Pollutant ATTRIBUTES { Operations for Pollution name (string) UNIQUE, Control loadLimit (real)} LEVEL Category ATTRIBUTES { We now discuss how the user-centric name (string) UNIQUE} conceptual model we propose in this paper LEVEL Type ATTRIBUTES { could be used to analyze data. Let us recall name (string) UNIQUE} the cube in Figure 1a, containing quarterly LEVEL Group ATTRIBUTES { values of pollutant concentration at each name (string) UNIQUE} measuring station, for the year 2011. The Pollutant ROLL-UP to Category end user can operate intuitively over this Pollutant ROLL-UP to Type cube in order to analyze data in different Type ROLL-UP to Group ways. Figure 4 shows a of such DIMENSION Geography operations, which start from the initial cube LEVEL District ATTRIBUTES {name of Figure 1a. We describe her operations on (string) UNIQUE, area (real)} a step-by-step basis. LEVEL Province ATTRIBUTES {name The user first wants to compute the sum (string) UNIQUE} of concentrations per semester, station, and District ROLL-UP to Province pollutant, to look for significant differences between these periods, if they exist. For MEASURE concentration (real)} this, the Cube Algebra offers her a roll-up CUBE ALGEBRA QUERY operation, which she applies along the LANGUAGE Time dimension. The result is shown in Figure 4a: the new cube contains two values In this section we sketch our proposal for over the Time dimension, each a query language that implements the ideas corresponding to one semester (the original discussed in the preceding sections. We cube contained four values, one for each define an algebra, which we denote Cube quarter). The remaining dimensions are not Algebra , such that the user could define her affected. All values in cells corresponding queries just by means of the typical OLAP to the same pollutant and station (for operators. We first describe how the user example, P1 and S1 , respectively), and to would be able to operate in our application quarters Q1 or Q2 , contribute to the scenario, intuitively manipulating the cube aggregation to the values in the first using the traditional OLAP operations, e.g., semester ( S1 ). We can see in Figure 1a that slicing, dicing, rolling-up, drilling down, the concentration of P1 measured at station and pivoting. This basic set of operators can S1 for the first and second quarters are, be extended with other ones, useful in real- respectively, 21 and 27. In Figure 4a we world OLAP practice. We illustrate this also see that these values are aggregated to extensibility introducing the map operator, 48 in the first semester. Computation of the which allows changing the values of the cells corresponding to the second semester measures in a cube, for example, to convert proceeds analogously. currency units, or perform what-if analysis. Our user then notices that in the second Then, we formally define each operator in semester the concentration of pollutant P3 our algebra. We conclude with some at station S1 was unusually high. The Cube example queries that illustrate the query Algebra allows her to drill down along the language. Time dimension, to the month level, to find out if this high value is due to a particular month. In this way, she discovers that Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 14

December presented a much higher semester, station, and pollutant, the user concentration of this pollutant than the needs first to take the cube back to the other months (Figure 4b). Note that since quarter aggregation level, and then she now starts from data aggregated by continues drilling-down to the month level.

(a) Roll-up to the Semester level (b) Drill-down to the Month level

0 1 1 2 7 3 1 3 0 2 8 8 2 1 7 4 9 1

(c) Pivot (d) Slice on Station for StationId = 'S1 '

(e) Dice on Station = 'S1 ' or 'S2 ' and (f) Map function for defining contamination levels Time.Quarter = ' Q1 ' or ' Q2 '

Figure 4: OLAP operations

Continuing her browsing of the cube, our She then wants to visualize time of user now wants to see the cube with the average pollutant concentration by quarter, Time dimension on the x axis. Therefore, only for the station S1 . For this, she first she rotates the axes of the cube without applies a dice operator that selects the sub- changing granularities. This restructuring cube containing only values for the station operation is called pivoting (Figure 4c). S1 , and then eliminates the Station (Note that she previously rolled-up the cube dimension, applying a slice operation. This back to the one of Figure 1a). is depicted in Figure 4d. Here, she obtained Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 15 a matrix, where each column represents the Gyssens & Lakshmanan, 1997; Vassiliadis, evolution of the concentration of a pollutant 1998), curiously none of them describe a by quarter, i.e., a collection of time series. common whole set of them in terms of the Our user also wants to compare, for each well-known slice , dice , roll-up (and its pollutant, the concentration values against inverse, drill-down ) and drill-across , which similar records for 2010. For this, she has a are the ones that intuitively reflect how an two-dimensional “cube” similar to the one OLAP user manipulates a cube, or in Figure 4d, with the average combines two cubes. Existing efforts, such concentrations by quarter and by pollutant, as the ones cited above, usually define a for 2010. She would like to have this subset of these operators, combined with information consolidated in a single cube. other ones that, although suited to the For this, she knows that the Cube Algebra models proposed by the authors, are, in offers the drill-across operator that, given most of the cases, not intuitive to non- two cubes, builds a new one with the expert users (e.g., push-pull, nest). The measures of both, making the comparison same occurs with the many commercial very easy. In this way, if in 2010, for P4 , OLAP tools, as we show later taking MDX the average concentration on Q1 was 32, the (the de facto standard for OLAP) as an cell corresponding to ( Q1 ,P4 ) resulting from example. We therefore chose to define a a drill-across , will contain the pair (35,32). small set of the most intuitive and used Next, she wants pollution information operators for cube manipulation, which, in corresponding only to stations S1 and S2 in addition, can be shown to be orthogonal to the first two quarters. For this, starting over each other. That means, no one of them can from the original cube, she produces a sub- be expressed as a combination of the others. cube, using the dice operator (Figure 4e). However, it should not be assumed that this Finally, instead of a cube containing set of operator is minimal in a formal pollution values, she wants to produce a mathematical sense. cube containing indicators of pollution This basic set of operators can be classified in four categories: Low (L) for extended with many other ones, in order values greater or equal to 20; Moderate (M) add functionalities to the language. As an for values greater than 20 and less or equal example of this, we introduce the map to 30; High (H) for values greater than 30 operator, which applies the same function to and less or equal to 45; and Very High (VH ), all the cells in a data cube, allowing, for for values greater than 45. For this, she uses example, currency conversion, or more the map operation shown in Figure 4f. complex operations such as the one In what follows, to make the examples illustrated in the example of previous more interesting, we add a Geography section. dimension, composed of levels District and Remark Although the pivot operator was Province, with a roll-up relationship defined introduced in Figure 4c, since it does not between them, and a Road dimension, modify either the cube schema or the cube indicating the roads where the stations are instances, and it is just used for (mainly) located. interactive visualization, we do not include it in the following discussion. £ Cube Algebra Query Operators Dice This operator receives a cube and a We now formally define the operators of Boolean condition ϕ, and returns another our Cube Algebra in terms of our data cube containing only the cells that satisfy ϕ. model. Even though there are many works The syntax for this operation is describing multidimensional operations (Agrawal, Gupta & Sarawagi, 1997; DICE(cube_name, ϕ) Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 16

Category where ϕ is a Boolean condition over Roll-up (P4 ) Pollutant = C2 Category dimension levels and measures. Roll-up (P5 )Pollutant =C2 . The semantics is the following. Dice Then rolling-up from the cube receives a cube with schema S=〈nameCS, AirQuality to a new cube with schema D M , 〉 and instances with the form 〈AirQCateg , {Category , Time} , C(c11 ,…, c1d ) = m1,…, C(cm1 ,…, ckj ) = mr (for {concentration} 〉 yields the instance simplicity let us assume only one measure), (C1 ,T1 ,35), ( C2 ,T3 ,66). £ and returns a cube with the same schema, Drill-down de-aggregates previously and the points ( ci1 ,…, cid , mi) that make the ϕ summarized measures and can be condition true. We can consider dice considered the inverse of roll-up. Following analogous to a relational selection . Agrawal et al. (1997), we consider drill- Roll-up and Drill-down The roll-up down a high-level operation that can be operator aggregates measures according to a implemented by tracking the (stored) paths dimension hierarchy (using an aggregate followed during user rolling-up. Therefore, function), to obtain measures at a coarser we omit its definition. granularity for a given dimension, based on Slice Removes a dimension in a cube, i.e., a the use of the dimension hierarchy. cube of n-1 dimensions is obtained from a The syntax for the roll-up operation is: cube of n dimensions. The dimension to ROLL-UP(cube_name, remove must contain a unique value in its Dimension->level, domain. If the dimension has more than one (measure,aggregate_function) *) value, two approaches can be used: apply either a roll-up operator for summarizing The term Dimension->level indicates to into a singleton (i.e., all ) (Agrawal, Gupta which level in a dimension we want the & Sarawagi, 1997), or (prior to slicing) a roll-up to be performed. Note that since dice operator, to obtain a cube with only there can be more than one measure, we one value in the selected dimension. must specify an aggregate function for each The syntax of this operator is: one of them. In what follows, we assume that if there is only one measure in the cube, SLICE(cube_name, Dimension, for conciseness we only specify the [ROLL-UP{Aggregate_function}]) aggregate function. According to what we explained above, As for the semantics , roll-up receives a ROLL-UP{Aggregate_function} stands D M cube with schema S = 〈nameCS, , 〉, a for ROLL-UP(cube_name, Dimension level l in a dimension D, such that ls ∈ D, ls {All,Aggregate_function)} . The →* l in D, and an aggregate function Fagg . former yields a more concise expression. If Roll-up returns a cube whose cells are the roll-up is not included, slice will have aggregated along D up to the level l. Thus, two arguments, meaning that the dimension all values vi,…, vk in the original cube, such instance has already been reduced to a that C(c1,…, cli ,…, cd) = vj, j=1,…, k single value. If this is not the case, the (Definition 5) contribute to the aggregation operator fails. That is, the operator is l over Roll-up (cli ) ls . analogous to a relational projection . The semantics is the following. Slice Example 4 (Roll-up) Suppose in Example 1 that we have the receives a cube with schema S = 〈nameCS, following coordinates for the cube D, M〉, and instances of the form AirQuality : (P1 ,T1 ,35), ( P5 ,T3 ,44), {( c11 ,…, cs,…, c1d , m1),…, (cm1 ,…, cs, …, ckd , (P4 ,T3 ,22). According to Figure 3: mr)} (where cs is the unique value in ls), and Category a dimension name ds (assume that the level Roll-up (P1 )Pollutant = C1 Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 17

ls ∈ D belongs to dimension ds). The (except for the measures) are the same. The operator returns a cube with schema result is a cube with schema S = 〈nameCS, D M ∪ M S1=〈nameCS, ( D \ ls), M〉, where the , 1 2〉, with the same instances of instances are of the form ( c11 ,…, cs-1, the input cubes. In other words, the operator cs+1 ,…, c1d , m1),…,( cm1 ,…, cs-1, cs+1 ,…, ckd , is analogous to a relational natural join . mr), that is, the same as in the original one, We now formally define the map except the coordinate corresponding to ls. operator, which extends the basic set of operations defined above. We remark that Drill-across Relates information contained this is only one of the many operators the in two data cubes having the same Cube Algebra could be extended with. dimensions. Thus, measures from different cubes can be compared. According to Map This operator receives a cube and a Kimball and Ross (2002), drill-across can collection of pairs ( mi,fi), where mi is a only be applied when both cubes have the measure and fi is a function mapping values same schema dimensions and the same in Dom m to values in the same of another instances. Other authors relax this domain (see the example above, where restriction. This is the approach of Abelló et concentration values in the domain of the al. (2002), who define two concepts: (a) real numbers are mapped to the domain of Dimension-Dimension Derivation : Used strings, by a partitioned function). The when two dimensions come from a operator returns another cube, with the common concept although their structures same dimension schema and instances, and differ, for example, because their with the values in each cell that correspond granularities are not the same. In this case, a to the mappings produced by each function roll-up can be applied to make both fi. The syntax for the map operation is: dimensions consistent. (b) Dimension- MAP(cube_name, (measure, Dimension Association : Corresponds to the function)*) case in which two cubes have different dimensions, but one of them could be The semantics is the following. Map defined as the association of several ones. receives a cube with schema S = 〈nameCS, For example in one cube we define latitude D, M〉 and instances with the form and longitude as separated dimensions; in C(c11 ,…, c1d ) = m1,…, C(cm1 ,…, ckj ) = mr, another one we store only one dimension (for simplicity we assume only one containing the ‘point’ geometry. A mapping measure), and returns a cube with the same function can solve this problem. Other schema, and instances of the form authors also address this problem, all of C(c ,…, c ) = f(m ),…, C(c ,…, c ) = them along the same lines (Cabibbo & 11 1d 1 m1 kj Torlone, 2004; Riazati, Thom & Zhang, f(mr). 2008). We assume that the operator receives Cube Algebra QL by Example two compatible cubes (i.e., sharing dimensions and instances). The syntax of We now give the flavor of the language the operator is: by means of examples, and show that Cube Algebra allows an OLAP user to express DRILL-ACROSS(cube_name_1, queries using just the operators she is cube_name_2) . acquainted to, instead, for example, of Let us now explain the semantics . Drill- complex MDX expressions. We prove our across receives two cubes with schemas point, showing, for each query, a possible

S1 = 〈nameCS1, D, M1〉 and S2 = 〈nameCS2, MDX version.

D, M2〉, that means, the dimension levels are the same. The corresponding instances

Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 18

Query 1 For each province and pollutant Compared to the simplicity of the Cube category, give the average concentration by Algebra expression above, the MDX query quarter. looks cryptic and unintuitive. In part, this is c1 := SLICE (AirQuality, due to the fact that, as shown in Figure 2, Station, ROLL-UP{Avg}) MDX is placed at the logical level, while c2 := SLICE (c1, Road, ROLL- the Cube Algebra is at the conceptual level. UP{Avg}) Besides, MDX has not a clearly defined c3 := ROLL-UP (c2,{Time-> semantics. On the contrary the semantics of Quarter, Pollution-> each Cube Algebra operator is clear, Category,Geography->Province}, intuitive, and well known for most OLAP Avg) practitioners. £ Remark The roll-up operator can only be Query 2 Number of districts where, for at applied over one dimension at a time. For least one pollutant, the average load of air simplicity, in the query above, the pollution in 2011 was larger than the expression for c3 is shorthand for: concentration limit. c3 := ROLL-UP(c2, Time->Quarter, c1 := ROLL-UP (AirQuality, Avg)}; Time->Year,{Avg} ) c4 := ROLL-UP(c3, c2 := DICE (c1, Time.Year.year = Geography->Province, Avg)}; 2011) c5 := ROLL-UP(c4, c3 := SLICE (c2, Time, Pollution->Category, Avg)} ; ROLL-UP{Avg}) c4 := SLICE (c3, Station, This is the syntax we use in the sequel. £ ROLL-UP{Avg}) We next show how this query would look in c5 := SLICE (c4, Road, MDX. ROLL-UP{Avg}) Query 1 (MDX Version) We have obtained a cube with dimensions We assume that measure concentration Geography and Pollution containing has been associated with the aggregate data only for 2011. Now: function Avg . In Cube Algebra, the levels c6 := DICE (c5, concentration >= of a dimension are organized as a lattice. Pollution.Pollutant.loadLimit) On the contrary, in MDX the levels are c7 := SLICE (c6, Pollution, organized into named linear hierarchies. ROLL-UP{Avg}) Thus, when referring to a level we must c8 := ROLL-UP (c7, qualify it with the name of a hierarchy it Geography->All, {Count}}) belongs to (note that there could be more than one). This is the case of the dimension The semantics of the expression Pollution in the query below. DICE (c4, concentration >= In MDX, Query 1 reads: Pollution.Pollutant.loadLimit ) SELECT is the following: each cell in c4 is analyzed, [Geography].[Hierarchy]. and the values of the variables are [Province].Members On Axis(0), instantiated with the values of the cell [Pollution].[H0].[Category]. coordinates and measures (i.e., Members On Axis(1), Pollution.Pollutant is instantiated [Time].[Hierarchy].[Quarter] with the identifier of the pollutant Members On Axis(2), corresponding to the cell). Analogously, FROM AirQuality concentration is obviously instantiated WHERE (Station.[All],Road.[All]) with the value of the measure in the cell.

Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 19

Query 2 (MDX Version) Query 3 Build a cube with the dimensions of the original cube, containing only The query can be solved in three steps. concentrations corresponding to stations First, a sub-cube is created for representing located in districts in the province of the underlying data filtered using the Limburg, and to polluting agents of districts satisfying the condition. The use of ‘organic’ type such that the station had at the TYPED keyword in the Properties least once registered a concentration higher function is necessary in order to consider than the limit for the corresponding the type of the ‘ loadLimit ’ attribute, pollutant . otherwise MDX would consider it a string, preventing performing the comparison with This query is simply expressed as: concentration . For counting elements, c1 := DICE (AirQuality, in our case the number of different districts, concentration >= we must define a new measure using the Pollution.Pollutant.loadLimit WITH clause over the set of names of the AND Geography.Province.name districts. Finally, the sub-cube is dropped. ='Limburg' AND CREATE SUBCUBE [AirQuality] AS Pollution.Category.name = SELECT 'organic'). Filter( Query 3 (MDX Version) ([Pollution].[H0].[Pollutant]. members,[Geography].[Hierarchy]. SELECT[Pollution].[CategoryName. [District].members,[Time]. [Organic] On Axis(0), [Hierarchy].[Year].[2011]), [Time].[Hierarchy].[Day] On [Pollution].[H0]. Axis(1), CurrentMember.Properties("loadLi [Station].[Station].[Station] On mit",TYPED) <=[Measures]. Axis(2), [Concentration]) on Axis(0) [Road].[Road].[Road] On Axis(3), FROM AirQuality [Geography].[Hierarchy]. WHERE ([Station].[All], [District] on Axis(4) [Road].[All] ) FROM AirQuality WHERE WITH Filter(([Pollution].[H0]. SET [DistNames] As [Pollutant].members, ([Geography].[Hierarchy]. [Geography].[Province]. [District].members, [Limburg]),[Pollution].[H0]. [Measures].[Concentration]) CurrentMember.Properties( MEMBER [Measures].[Count "loadLimit",TYPED)<=[Measures]. Districts] As [Concentration]) COUNT([DistNames]) Even this simple query requires SELECT [Measures].[Count implementation knowledge at the logical Districts] On Axis(0) level, e.g., the decomposition of the FROM [AirQuality] multiple hierarchy into many single ones DROP SUBCUBE [AirQuality] (in this case, only [H0] is needed ). Note, as in Query 1, the difference between Query 4 For stations, and pollutants the Cube Algebra expression and the MDX belonging to the ‘organic’ category, give one. The latter requires a deep the maximum concentration by month. understanding of the MDX syntax, while the former allows the user to focus just on the semantics of the OLAP operators. Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 20 c1 := DICE c4 := ROLL-UP (c5, Time-> (AirQuality,Pollution. Quarter, Avg) Category.name='organic') c6 := DICE (c5, concentration >= c2 := SLICE (c1, Road, Pollution.Pollutant.loadLimit) ROLL-UP{Max}) The following query illustrates the use of c3 := SLICE (c2, Geography, the map operator. In our running example, ROLL-UP{Max}) assuming that pollutant concentrations are c4 := ROLL-UP (c3, Time->Month, expressed in mg/m 3 (milligrams per cubic {Max}) meter), we want to express the results in Query 4 (MDX Version) µg/m 3 (micrograms per cubic meter). WITH MEMBER Query 5 (MDX Version) [Measures].[Maximal] as Max([Time].[Hierarchy]. SELECT currentMember.children, Filter((([Pollution].[H0]. [Measures].[Concentration]) [Pollutant].members, [Station].[Station].[Station]. SELECT[Time].[Hierarchy].[Month. members), [Time].[Quarter]. members On Axis(0), [Q4-2011]), [Pollution].[H0]. [Station].[Station].[station].me CurrentMember.Properties mbers On Axis(1), ("loadLimit", TYPED) < [Pollution].[H0].Pollutant. [Measures].[Concentration] ) on members On Axis(2), Axis(0) [Measures].[Maximal] On Axis(3) FROM AirQuality FROM AirQuality WHERE ([Road].[Road].[E34], WHERE [Geography].[District].[Berchem, ([Pollution].[categoryName]. [Pollution].[typeName]. [Organic],[Road].[Road].[All], [Nitrates]) [Geography].[Hierarchy].[All]) Query 6 Average concentration by Station, Query 5 Stations located over the part of Time, District, Road, and pollutant the E34 road within the Berchem district, Category, expressed in µg/m 3. with an average content of nitrates in the last quarter of 2011 above the load limit for c1 := MAP (AirQuality, that pollutant. concentration, Mult(1000)) c2 := ROLL-UP (c1, Pollution-> c1 := DICE (AirQuality, Category, Avg }) Geography.District ='Berchem' AND Time.Quarter.quarter ='Q4- In the expression that generates c1 , Mult 2011' AND Time.Year.year= 2011 (1000) is a function that given a value, AND Pollution.Category.name= multiplies it by a constant (in this case, 'Nitrates' AND Road.Road.name = 1000). 'E34') Query 6 (MDX Version) c2 := SLICE (c1, Road) WITH MEMBER c3 := SLICE (c2, Geography) [Measures].[NewValue] as Note that in the SLICE operation that [Measures].[Concentration]*1000 generates cubes c2 and c3 , we do not use SELECT the third argument, since the previous [Pollution].[H0].[Category]. dicing selected unique values for roads and members On Axis(0), districts. Cube c3 has Station , Time , and [Station].[Station].[Station]. Pollution dimensions. members On Axis(1), Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 21

[Geography].[Distric t]. This need stems from the fact that logical [District].members On Axis(2), models (for example, star, snowflake, and [Road].[Road].[Road].Members constellation schemas) have been deployed On Axis(3), for these systems as conceptual models. But [Time].[Hierarchy].[Day].members logical models represent a system-centric On Axis(4), view of data warehouses and OLAP [Measures].[NewValue] On Axis(5) systems and are ultimately implementation FROM AirQuality concepts. In this article, we propose a user- The next query shows the use of the Drill- centric conceptual model for data across operator. Let us assume we have a warehouses and OLAP systems, called the cube denoted Demography with the Cube Algebra. It takes the cube metaphor dimensions Time and Geography literally and provides the knowledge worker described above, and measure with high-level cube objects and related population . The instances of the high-level concepts. A novel query dimensions satisfy the operator’s language leverages high-level operations preconditions. such as roll-up , slice , and drill-across . An Query 7 Total population and average important design criterion is that all aspects pollutant concentration by province and of the logical level and the physical level year. are hidden from the user. We plan our future research in at least c1 := SLICE (AirQuality, Road, three directions. First, further data ROLL-UP{Sum}) definition commands have to be added for c2 := SLICE (c1, Pollution, updating cube schemas, dimension ROLL-UP{Max}) schemas, and measures. In addition, further c3 := SLICE (c2, Station, ROLL- data manipulation commands are needed for UP{Max}) the insertion, deletion, and update of data Now, the cube c3 contains just the into cubes. Second, transformation rules are dimensions Time and Geography . needed that map the concepts of the Cube c4 := DRILL-ACROSS (c3, Algebra at the conceptual level to Demography) corresponding ROLAP, MOLAP, and c5 := ROLL-UP (c4, Time->Year, HOLAP concepts at the logical level. Third, concentration, Avg, population, we are interested in adding other data Sum)) categories such as spatial, spatiotemporal, c6 := ROLL-UP (c5, Geography-> image, and multimedia data into our Cube Province, concentration, Avg, Algebra. Questions are here, for example, population, Sum) how the different data categories are The drill-across operator is not directly integrated and stored, what kind of supported in MDX, since the FROM clause aggregation operations exist on the different only supports one cube, therefore the only data categories, how the different way of adding a measure is to define it at aggregation operations are defined, and design time. how these operations are implemented.

CONCLUSIONS AND FURTHER ACKNOWLEGEMENTS WORK Cristina and Ricardo Ciferri thank for In this article, we have identified the the support of the following Brazilian need for an appropriate conceptual model research agencies: FAPESP, CNPq, for data warehouses and OLAP systems. CAPES, and FINEP. Leticia Gómez and Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 22

Alejandro Vaisman were partially Databases. In H. Schek, F. Saltor, I. Ramos supported by the LACCIR project and G. Alonso (Eds.), Proceedings of the “Monitoring Protected Areas using an 6th International Conference on Extending OLAP-enabled Spatiotemporal GIS”. Database Technology , LCNS 1377 (pp. Markus Schneider was partially supported 253-269) . Springer-Verlag. by the U.S. National Aeronautics and Space Cabot, J., Mazón, J-N., Trujillo, J., & Administration (NASA) under the grant Pardillo, J. (2009). Towards the Conceptual number NASA-AIST-08-0081 and by the Specification of Statistical Functions with U.S. National Science Foundation (NSF) OCL. In Proceedings of CAiSE Forum (pp. under the grant number NSF-IIS-0812194. 7-12). Alejandro Vaisman and Esteban Zimányi were partially supported by the project Datta, A., & Thomas, H. (1999). The Cube OSCB (Open Semantic cloud for Brussels) Data Model: A Conceptual Model and funded by Innoviris. Algebra for On-Line Analytical Processing in Data Warehouses. Decision Support Systems , 27(3), 289-301. REFERENCES Golfarelli, M., & Rizzi, S. (1998). A Abelló, A., Samos, J., & Saltor, F. (2001) Methodological Framework for Data Understanding Facts in a Multidimensional Warehouse Design. In Proceedings of the Object-Oriented Model. In Proceedings of 1st International Workshop on Data the 4th ACM International Workshop on Warehousing and OLAP (pp. 3-9). ACM Data Warehousing and OLAP . ACM Press. Press. Abelló, A., Samos, J., & Saltor, F. (2002). Gyssens, M., & Lakshmanan, L. (1997). A On Relationships Offering New Drill- Foundation for Multi-dimensional Across Possibilities. In D. Theodoratos Databases. In M. Jarke, M. Carey, K. (Ed.), Proceedings of the 5th International Dittrich, F. Lochovsky, P. Loucopoulos and Workshop on Data Warehousing and OLAP M. Jeusfeld (Eds.), Proceedings of the 23rd (pp. 7-13). ACM Press. International Conference on Very Large Data Bases (pp. 106-115). Morgan Abelló, A., Samos, J., & Saltor, F. (2006). Kaufmann. YAM 2: A Multidimensional Conceptual Model Extending UML . Information Lehner, W. (1998). Modelling Large Scale Systems, 31 (6), 541-567. OLAP Scenarios. In H. Schek, F. Saltor, I. Ramos and G. Alonso (Eds.), Proceedings Agrawal, R., Gupta, A., & Sarawagi, S. of the 6th International Conference on (1997). Modeling Multidimensional Extending Database Technology , LCNS Databases. In W.A. Gray and P. Larson 1377 (pp. 153-167). Springer-Verlag. (Eds.), Proceedings of the 13th International Conference on Data Li, C., & Wang, S. (1996). A Data Model Engineering (pp. 232-243). IEEE Computer for Supporting On-Line Analytical Society Press. Processing. In Proceedings of the 5th International Conference on Information Cabibbo, L., & Torlone, R. (1997). and Knowledge Management (pp. 81-88) . Querying Multidimensional Databases. In ACM Press. S. Cluet and R. Hull (Eds.), Proceedings of the 6th International Workshop on Marcel, P. (1999). Modeling and Querying Database Programming Languages, LCNS Multidimensional Databases: An Overview. 1369 (pp. 319–335). Springer-Verlag. Networking and Information Systems, 2 (5), 515-548. Cabibbo, L., & Torlone, R. (1998). A Logical Approach to Multidimensional Malinowski, E., & Zimányi, E. (2008). Advanced Data Warehouse Design: From Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing and Mining, X(X), X-X, XXX-XXX 2012 23

Conventional to Spatial and Temporal Tsois, A., Karayannidis, N. & Sellis, T. Applications . Springer-Verlag. (2001). MAC: Conceptual data modeling Nguyen, T.B., Tjoa, A.M., & Wagner, R. for OLAP. In D. Theodoratos, J. Hammer, (2000). An Object Oriented M. Jeusfeld and M. Staudt (Eds.), Multidimensional Data Model for OLAP. In Proceedings of the 3rd International H. Lu and A. Zhou (Eds.), Proceedings of Workshop on Design and Management of the 1st International Conference on Web- Data Warehouses (pp. 5). CEUR-WS.org. Age Information Management , LNCS 1846 Ullman, J. (1988). Principles of Database (pp. 69-82). Springer-Verlag. and Knowledge-Base Systems, I. Pardillo, J., Mazón, J-N., & Trujillo, J. Computer Science Press. (2008). Bridging the Semantic Gap in Vassiliadis, P. (1998). Modeling OLAP Models: Platform-Independent Multidimensional Databases, Cubes and Queries. In I.-Y. Song and A. Abelló (Eds.), Cube Operations. In M. Rafanelli and M. Proceedings of the 11th ACM International Jarke (Eds.), Proceedings of the 10th Workshop on Data warehousing and OLAP International Conference on Scientific and (pp. 89-96). ACM Press. Statistical Database Management , (pp. 53- Pardillo, J., Mazón, J.-N., & Trujillo, J. 62). IEEE Computer Society. (2010). Extending OCL for OLAP Querying on Conceptual Multidimensional

Models of Data Warehouses. Information Sciences 180 (5), 584-601.

Ravat, F., Teste, O., Tournier, R., & i We use this simplification in the formal model to Zurfluh, G. (2008). Algebraic and Graphic avoid the need of referring to a dimension level as Languages for OLAP Manipulations. dimension.level , which would make the formal definition too verbose. However, in the algebra defined in International Journal of Data Warehousing next section, we drop this restriction, and qualify level and Mining , 4 (1), 17-46. names with dimension names.

Rizzi, S. (2007) Conceptual Modeling Solutions for the Data Warehouse. In Wrembel, R., Koncilia, R. (Eds.), Data Warehouses and OLAP: Concepts, Architectures and Solutions (pp. 1-26). Idea Group (IGI). Romero, O., & Abelló, A. On the Need of a Reference Algebra for OLAP. In I.-Y. Song, J. Eder and T.M. Nguyen (Eds.), Proceedings of the 9th International Conference on Data Warehousing and Knowledge Discovery , LCNS 4654 (pp. 99- 110). Springer-Verlag. Tryfona, N., Busborg, F., & Borch Christiansen, J. (1999). starER: A Conceptual Model for Data Warehouse design. In Proceedings of the 2nd ACM International Workshop on Data Warehousing and OLAP (pp. 3-8). ACM Press.

Copyright © 2012, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.