Implementing Multidimensional Data Warehouses into NoSQL

Max Chevalier1, Mohammed El Malki1,2, Arlind Kopliku1, Olivier Teste1 and Ronan Tournier1 1University of Toulouse, IRIT UMR 5505, Toulouse, France 2Capgemini, Toulouse, France

Keywords: NoSQL, OLAP, Aggregate Lattice, Column-oriented, Document-oriented.

Abstract: Not only SQL (NoSQL) are becoming increasingly popular and have some interesting strengths such as scalability and flexibility. In this paper, we investigate on the use of NoSQL systems for implementing OLAP (On-Line Analytical Processing) systems. More precisely, we are interested in instantiating OLAP systems (from the conceptual level to the logical level) and instantiating an aggregation lattice (optimization). We define a set of rules to map star schemas into two NoSQL models: column- oriented and document-oriented. The experimental part is carried out using the reference benchmark TPC. Our experiments show that our rules can effectively instantiate such systems ( and lattice). We also analyze differences between the two NoSQL systems considered. In our experiments, HBase (column- oriented) happens to be faster than MongoDB (document-oriented) in terms of loading time.

1 INTRODUCTION multidimensional model into NoSQL logical models. We consider two NoSQL logical models: Nowadays, analysis data volumes are reaching one column-oriented and one document-oriented. critical sizes (Jacobs, 2009) challenging traditional For each model, we define mapping rules translating data warehousing approaches. Current implemented from the conceptual level to the logical one. In solutions are mainly based on relational databases Figure 1, we position our approach based on (using R-OLAP approaches) that are no longer abstraction levels of information systems. The adapted to these data volumes (Stonebraker, 2012), conceptual level consists in describing the data in a (Cuzzocrea et al., 2013), (Dehdouh et al., 2014). generic way regardless of information technologies With the rise of large Web platforms (e.g. Google, whereas the logical level consists in using a specific Facebook, Twitter, Amazon, etc.) solutions for “Big technique for implementing the conceptual level. Data” management have been developed. These are Conceptual Level based on decentralized approaches managing large Multidimensional (OLAP) data amounts and have contributed to developing “Not only SQL” (NoSQL) data management systems (Stonebraker, 2012). NoSQL solutions Logical Level allow us to consider new approaches for data warehousing, especially from the multidimensional Relational‐OLAP NoSQL‐OLAP data management point of view. This is the scope of Relational NoSQL this paper. Legend New transformation In this paper, we investigate the use of NoSQL Existing transformation models for decision support systems. Until now (and to our knowledge), there are no mapping rules that Figure 1: Translations of a conceptual multidimensional transform a multi-dimensional conceptual model model into logical models. into NoSQL logical models. Existing research instantiate OLAP systems in NoSQL through R- Our motivation is multiple. Implementing OLAP OLAP systems; i.e., using an intermediate relational systems using NoSQL systems is a relatively new logical model. In this paper, we define a set of rules alternative. It is justified by advantages of these to translate automatically and directly a conceptual systems such as flexibility and scalability. The

172 Chevalier M., El Malki M., Kopliku A., Teste O. and Tournier R.. Implementing Multidimensional Data Warehouses into NoSQL. DOI: 10.5220/0005379801720183 In Proceedings of the 17th International Conference on Enterprise Information Systems (ICEIS-2015), pages 172-183 ISBN: 978-989-758-096-3 Copyright c 2015 SCITEPRESS (Science and Technology Publications, Lda.) ImplementingMultidimensionalDataWarehousesintoNoSQL

increasing research in this direction demands for name. Table attributes are derived from attributes formalization, common models and empirical of the dimension (called parameters and weak evaluation of different NoSQL systems. In this attributes). The root parameter is the primary scope, this work contributes to investigate two key. logical models and their respective mapping rules. . Each fact is a table that uses the same name, with We also investigate data loading issues including attributes derived from 1) fact attributes (called pre-computing data aggregates. measures) and 2) the root parameter of each Traditionally, decision support systems use data associated dimension. Attributes derived from warehouses to centralize data in a uniform fashion root parameters are foreign keys linking each (Kimball et al., 2013). Within data warehouses, dimension table to the and form a interactive data analysis and exploration is compound primary key of the table. performed using On-Line Analytical Processing Due to the huge amount of data that can be stored in (OLAP) (Colliat, 1996), (Chaudhuri et al., 1997). OLAP systems, it is common to pre-compute some Data is often described using a conceptual aggregated data to speed up common analysis multidimensional model, such as a star schema queries. In this case, fact measures are aggregated (Chaudhuri et al., 1997). We illustrate this using different combinations of either dimension multidimensional model with a case study about attributes or root parameters only. This pre- RSS (Really Simple Syndication) feeds of news computation is a lattice of pre-computed aggregates bulletins from an information website. We study the (Gray et al., 1997) or aggregate lattice for short. The Content of news bulletins (the subject of the analysis lattice is a set of nodes, one per dimension or fact) using three dimensions of those bulletins combinations. Each node (e.g. the node called (analysis axes of the fact): Keyword (contained in “Time, Location”) is stored as a relation called an the bulletin), Time (publication date) and Location aggregate relation (e.g. the relation time-location). (geographical region concerned by the news). The This relation is composed of attributes fact has two measures (analysis indicators): corresponding to the measures and the parameters or . The number of news bulletins (NewsCount). weak attributes from selected dimensions. Attributes . The number of keyword occurrences corresponding to measures are used to store (OccurrenceCount). aggregated values computed with functions such as SUM, COUNT, MAX, MIN, etc. The aggregate lattice schema for our case study is shown in Figure 3. Considering NoSQL systems as candidates for implementing OLAP systems, we must consider the above issues. In this paper and, in order to deal with these issues, we use two logical NoSQL models for the logical implementation; we define mapping rules (that allows us to translate a conceptual design into a logical one) and we study the lattice computation. The rest of this paper is organized as follows: in Figure 2: Multidimensional conceptual schema of our section 2, we present possible approaches that allow example, news bulletin contents according to keywords, getting a NoSQL implementation from a data publication time and location concerned by the news. warehouse conceptual model using a pivot logical model; in section 3 we define our conceptual The conceptual multidimensional schema of our multidimensional model, followed by a section for case study is described in Figure 2, using a graphical each of the two NoSQL models we consider along formalism based on (Golfarelli et al., 1998), (Ravat with their associated transformation rules, i.e. the et al., 2008). column-oriented model in section 4 and the One of the most successful implementation of document-oriented model in section 5. Finally, OLAP systems uses relational databases (R-OLAP section 6 details our experiments. implementations). In these implementations, the conceptual schema is transformed into a logical schema (i.e. a relational schema, called in this case a 2 RELATED WORK denormalized R-OLAP schema) using two transformation rules: To our knowledge, there is no work for automatical- . Each dimension is a table that uses the same

173 ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

ALL

Keyword Time Location Aggregateslattice Path « Foreign Key » Keyword, Time Keyword, Location Time, Location Path between join attributes for agreggates calculation Keyword, Time, Location

Term Category City Country Population Continent Zone

health Paris FR 65 991 000 Europe West Day Month Name Year ill Europe Virus health 12/10/14 10/14 october 20014 Paris FR 65 991 000 Europe West Europe 15/10/14 10/14 october 2014 vaccine health Londres EN 62 630 000 Europe West 09/06/14 06/14 June 2014 dimension Table « Keyword» Europe Dimension Table « Time» Dimension Table « Location » Id Keyword Time Localisation NewsCount OccurrenceCount 1 ill 12/10/14 Paris 10 10 Logical multidimentional model 2 Virus 15/10/14 Paris 15 25

3 vaccine 09/06/14 London 20 30 Fact table « News » Figure 3: A pre-computed aggregate lattice in a R-OLAP system. ly and directly transforming data warehouses MongoDB (Dede et al., 2013), a document-oriented defined by a multidimensional conceptual model NoSQL . However, these approaches never into a NoSQL model. consider the conceptual model of data warehouses. Several research works have been conducted to They are limited to the logical level, i.e. translate data warehousing concepts to a relational transforming a into a column- R-OLAP logical level (Morfonios et al., 2007). oriented model. More specifically, the duality Multidimensional databases are mostly implemented fact/dimension requires guaranteeing a number of using these relational technologies. Mapping rules constraints usually handled by the relational are used to convert structures of the conceptual level integrity constraints and these constraints cannot be (facts, dimensions and hierarchies) into a logical considered in these logical approaches. model based on relations. Moreover, many works This study highlights that there is currently no have focused on implementing logical optimization approaches for automatically and directly methods based on pre-computed aggregates (also transforming a multidimensional called materialized views) as in (Gray et al., 1997), conceptual model into a NoSQL logical model. It is (Morfonios et al., 2007). However, R-OLAP possible to transform multidimensional conceptual implementations suffer from scaling-up to large data models into a logical relational model, and then to volumes (i.e. “”). Research is currently transform this relational model into a logical NoSQL under way for new solutions such as using NoSQL model. However, this transformation using the systems (Lee et al., 2012). Our approach aims at relational model as a pivot model has not been revisiting these processes for automatically formalized as both transformations were studied implementing multidimensional conceptual models independently of each other. Also, this indirect directly into NoSQL models. approach can be tedious. Other studies investigate the process of We can also cite several recent works that are transforming relational databases into a NoSQL aimed at developing data warehouses in NoSQL logical model (bottom part of Figure 1). In (Li, systems whether columns-oriented (Dehdouh et al., 2010), the author has proposed an approach for 2014), or key-values oriented (Zhao et al., 2014). transforming a relational database into a column- However, the main goal of these papers is to propose oriented NoSQL database using HBase (Han et al., benchmarks. These studies have not put the focus on 2012), a column-oriented NoSQL database. In (Vajk the model transformation process. Likewise, they et al., 2013), an algorithm is introduced for mapping only focus one NoSQL model, and limit themselves a relational schema to a NoSQL schema in to an abstraction at the HBase logical level. Both

174 ImplementingMultidimensionalDataWarehousesintoNoSQL

models (Dehdouh et al., 2014), (Zhao et al., 2014), Example. Consider our case study where news require the relational model to be generated first bulletins are loaded into a multidimensional data before the abstraction step. By contrast, we consider warehouse consistent with the conceptual schema the conceptual model as well as two orthogonal described in Figure 2. logical models that allow distributing The multidimensional schema ENews is defined by News News multidimensional data either vertically using a F ={FContent}, D ={DTime, DLocation, DKeyword} and News column-oriented model or horizontally using a Star (FContent)={DTime, DLocation, DKeyword}. document-oriented model. The fact represents the data analysis of the news Finally we take into account hierarchies in our feeds and uses two measures: the number of news transformation rules by providing transformation (NewsCount) and the number of occurrences rules to manage the aggregate lattice. (OccurrenceCount); both for the set of news corresponding to a given term (or keyword), a specific date and a given location. This fact, FContent 3 CONCEPTUAL is defined by (Content, {SUM(NewsCount), SUM(OccurrenceCount)}) and is analyzed MULTI-DIMENSIONAL according to three dimensions, each consisting of MODEL several hierarchical levels (detail levels): . The geographical location (Location) concerned To ensure robust translation rules we first define the by the news (with levels City, Country, multidimensional model used at the conceptual Continent and Zone). A complementary level. information of the country being its Population A multidimensional schema, namely E, is (modeled as additional information; it is a weak defined by (FE, DE, StarE) where: attribute). E . F = {F1,…, Fn} is a finite set of facts, . The publication date (Time) of the bulletin (with E . D = {D1,…, Dm} is a finite set of dimensions, levels Day, Month and Year); note that the month number is associated to its Name (also a weak . StarE: FE→2 is a function that associates each E attribute), fact Fi of F to a set of Di dimensions, E . The Keyword used in the News (with the levels DiStar (Fi), along which it can be analyzed; Term and Category of the term). note that 2 is the power set of DE. E For instance, the dimension DLocation is defined by A dimension, denoted DiD (abusively noted as D D D (Location, {City, Country, Continent, Zone, D), is defined by (N , A , H ) where: Location Location ALL }, {HCont, HZn}) with City = id and: . ND is the name of the dimension, . HCont = (HCont, {City, Country, Continent, Location . ,…,  , is a set of ALL }, (Country, {Population})); note that dimension attributes, WeakHCont (Country) = {Population}, Location . ,…, is a set hierarchies. . HZn = (HZn, {City, Country, Zone, ALL }, D (Country, {Population})). A hierarchy of the dimension D, denoted HiH , is defined by (NHi, ParamHi, WeakHi) where: . NHi is the name of the hierarchy, 4 CONVERSION INTO A NoSQL . , ,…, , is an COLUMN-ORIENTED MODEL ordered set of vi+2 attributes which are called parameters of the relevant graduation scale of The column-oriented model considers each record as the hierarchy, k[1..v ], AD . i a key associated with a value decomposed in several . WeakHi: ParamHi  2 is a function columns. Data is a set of lines in a table composed associating with each parameter zero or more of columns (grouped in families) that may be weak attributes. different from one row to the other. A fact, FFE, is defined by (NF, MF) where: . NF is the name of the fact, 4.1 NoSQL Column-oriented Model . , … , is a set of In relational databases, the data structure is measures, each associated with an aggregation determined in advance with a limited number of function fi.

175 ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

typed columns (a few thousand) each similar for all bulletins, and the number of occurrences where the records (also called “tuples”). Column-oriented keyword “Iraq” appears in those news bulletins, NoSQL models provide a flexible schema (untyped published at the date of 09/22/2014 and concerning columns) where the number of columns may vary the location “Toulouse” (south of France). between each record (or “row”). Time Day Day Month Month Rx=(x, (CFx ={(Cx , {Vx }), (Cx , Vx ), A column-oriented database (represented in Name Name Year Year Location City (Cx , Vx ), (Cx , Vx )}, CFx ={(Cx , Figure 4) is a set of tables that are defined row by City Country Country Population Population Vx ), (Cx , Vx ), (Cx , Vx ), row (but whose physical storage is organized by Continent Continent Zone Zone Keyword (Cx , Vx ), (Cx , Vx )}, CFx groups of columns: column families; hence a Term Term Category Category Content ={(Cx , Vx ), (Cx , Vx )}, CFx “vertical partitioning” of the data). In short, in these NewsCount NewsCount OccurrenceCount ={(Cx , Vx ), (Cx , systems, each table is a logical mapping of rows and V OccurrenceCount)}) ) their column families. A column family can contain x Location City a very large number of columns. For each row, a The values of the five columns of CFx , (Cx , Country Population Continent Zone City column exists if it contains a value. Cx , Cx , Cx and Cx ), are (Vx , Country Population Continent Zone City Vx , Vx , Vx and Vx ); e.g. Vx = Table 1..* Row Country Population Toulouse, Vx = France, Vx = 65991000, Name Key Continent Zone Vx = Europe, Vx = Europe-Western. Location More simply we note: CFx = {(City, 1..* Col. Family 1..* Columns {Toulouse}), (Country, {France}), (Population, Name Name {65991000}), (Continent, {Europe}), (Zone,

1..1 {Europe-Western})}.

1..1 Table TNews Value Id Time Location KeyWord Content

Val 1 ......

Figure 4: UML class diagram representing the concepts of x Day City Term NewsCount a column-oriented NoSQL database (tables, rows, column 09/24/14 Toulouse Iraq 2 Month Country Category OccurrenceCount families and columns). 09/14 France Middle-East 10

Name Population Legend September 65991000 A table T = {R1,…, Rn} is a set of rows Ri. A Column Location Year Continent Family

1 m Lines of the Table 2014 Europe row Ri = (Keyi, (CFi ,…, CFi )) is composed of a Id j row key Keyi and a set of column families CFi . Zone Identifier 1 Europe-Western j j1 j1 jp Colomn Continent A column family CFi = {(Ci , {vi }),…, (Ci , Value Europe jp ... {vi })} consists of a set of columns, each associated with an atomic value. Every value can be n ...... “historised” thanks to a timestamp. This principle Figure 5: Example of a row (key = x) in the TNews table. useful for version management (Wrembel, 2009) will not be used in this paper due to limited space, 4.2 Column-oriented Model Mapping although it may be important. Rules The flexibility of a column-oriented NoSQL database allows managing the absence of some The elements (facts, dimensions, etc.) of the columns between the different table rows. However, conceptual multidimensional model have to be in the context of multidimensional data storage, data transformed into different elements of the column- is usually highly structured (Malinowski et al., 2006). oriented NoSQL model (see Figure 6). Thus, this implies that the structure of a column . Each conceptual star schema (one Fi and its family (i.e. the set of columns defined by the column E family) will be the same for all the table rows. The associated dimensions Star (Fi)) is transformed initial structure is provided by the data integration into a table T. process called ETL, Extract, Transform, and Load . The fact Fi is transformed into a column family M (Simitsis et al., 2005). CF of T in which each mi is a column M Example: Let us have a table TNews representing Ci  CF . E i aggregated data related to news bulletins (see Figure . Each dimension Di  Star (F ) is transformed News Di 5), with: T = {R1, …, Rx, …, Rn}. We detail the into a column family CF where each dimension D Rx row that corresponds to the number of news attribute Ai  A (parameters and weak

176 ImplementingMultidimensionalDataWarehousesintoNoSQL

attributes) is transformed into a column Ci of the approach has the advantage of avoiding joins Di Di column family CF (Ci  CF ), except the between fact and dimension tables. As a parameter AllDi. consequence, our approach increases information Remarks. Each fact instance and its associated redundancy as dimension data is duplicated for each instances of dimensions are transformed into a row fact instance. This redundancy generates an increased volume of the overall data while providing Rx of T. The fact instance is thus composed of the column family CFM (the measures and their values) a reduced query time. In a NoSQL context, problems linked to this volume increase may be reduced by an and the column families of the dimensions CFDi  adapted data distribution strategy. Moreover, our CFDE (the attributes, i.e. parameters and weak choice for accepting this important redundancy is attributes, of each dimension and their values). motivated by data warehousing context where data As in a denormalized R-OLAP star schema updates consist essentially in inserting new data; (Kimball et al., 2013), the hierarchical organization additional costs incurred by data changes are thus of the attributes of each dimension is not represented limited in our context. in the NoSQL system. Nevertheless, hierarchies are used to build the aggregate lattice. Note that the hierarchies may also be used by the ETL processes 4.3 Lattice Mapping Rules which build the instances respecting the constraints induced by these conceptual structures (Malinowski We will use the following notations to define our lattice mapping rules. A pre-computed aggregate et al., 2006); however, we do not consider ETL L processes in this paper. lattice or aggregate lattice L is a set of nodes A (pre-computed aggregates or aggregates) linked by

Name L H_KW edges E (possible paths to calculate the aggregates). H_TIME KeyWord Time All Category Te r m L All Year Month Day An aggregate node AA is composed of a set of pi Schema

Location parameters (one by dimension) and a set of City Population CONTENT Country aggregated mi measures fi(mi). A = < p1 ....pk,

Multidimensional NewsCount Conceptual OccurrenceCount f1(m1),…, fv(mv)>, k ≤ m (m being the number of Zone Continent dimensions, v being the number of measures of the Table TNews All fact). Id Time Location KeyWord Content The lattice can be implemented in a column- Day City Term NewsCount

Oriented oriented NoSQL database using the following rules: ‐ Month Country Category OccurrenceCount Schema L Name Population Legend . Each aggregate node AA is stored in a Column Year Continent Logical Column Location dedicated table. Family Zone

NoSQL Values for each column Identifier Id . For each dimension Di associated to this node, a are not displayed Column Di Zone column family CF is created, each dimension attribute ai of this dimension is stored in a Figure 6: Implementing a multidimensional conceptual Di model into the column-oriented NoSQL logical model. column C of CF , . The set of aggregated measures is also stored in a Example. Let ENews be the multidimensional column family CFF where each aggregated measure is stored as a column C (see Figure 7). conceptual schema implemented using a table named TNews (see Figure 6). The fact (FContents) and its Example. We consider the lattice News (see Figure dimensions (DTime, DLocalisation, DKeyword) are 7). The lattice News is stored in tables. The node implemented into four column families CFTime, (Keyword_Time) is stored in a Table TKeyword_Time CFLocation, CFKeyword, CFContents. Each column family composed of the column families CFKeyword, CFTime contains a set of columns, corresponding either to and CFFact. The attribute Year is stored in a column dimension attributes or to measures of the fact. For CYear, itself in CFTime. The attribute Term is stored in instance the column family CFLocation is composed of a column CTerm, itself in CFKeyword. The two measures the columns {CCity, CCountry, CPopulation, CContinent, are stored as two columns in the column family CZone}. CFfact. Unlike R-OLAP implementations, where each Many studies have been conducted about how to fact is translated into a central table associated with select the pre-computed aggregates that should be dimension tables, our rules translate the schema into computed. In our proposition we favor computing all a single table that includes the fact and its associated aggregates (Kimball et al., 2013). This choice may dimensions together. When performing queries, this be sensitive due to the increase in the data volume.

177 ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

Value Collection However, in a NoSQL architecture we consider that 1..1 storage space should not be a major issue.

Id Keyword Fact ALL 1… … 1..* … Keyword Time Location Atomic Document vId1 Term: Virus NewsCount : 5 Category: Health OccurrenceCount: 8 Keyword, Time Keyword, Location Time, Location vId2 Term: Ill NewsCount : 4 Category: Health OccurrenceCount: 6 Couple Keyword, Time, Location Attribute Id Keyword Time Fact 1..* 1..* 1… … … … Term: Virus Day: 05/11/14 NewsCount: 1 vId1 Figure 8: UML class diagram representing the concepts of Category: Health Month: 11/14 OccurrenceCount: 2 Name: November a document-oriented NoSQL database. Year: 2014

vId2 Term: Ill Day: 05/11/14 NewsCount: 3 Category: Health Month: 11/14 OccurrenceCount: 4 Name: November Year: 2014

Figure 7: Implementing the pre-computed aggregation lattice into a column-oriented NoSQL logical model.

5 CONVERSION INTO A NoSQL DOCUMENT-ORIENTED MODEL

The document-oriented model considers each record as a document, which means a set of records containing “attribute/value” pairs; these values are either atomic or complex (embedded in sub-records). News Each sub-record can be assimilated as a document, Figure 9: Graphic representation of a collection C . i.e. a subdocument. Example. Let C be a collection, 5.1 NoSQL Document-oriented Model C={D1,…,Dx,,…,Dn} in which we detail the document Dx (see Figure 9). Suppose that Dx In the document-oriented model, each key is provides the number of news and the number of associated with a value structured as a document. occurrences for the keyword “Iraq” in the news These documents are grouped into collections. A having a publication date equals to 09/22/2014 and document is a hierarchy of elements which may be that are related to Toulouse. News either atomic values or documents. In the NoSQL Within the collection C ={D1,…,Dx,…,Dn}, approach, the schema of documents is not the document Dx could be defined as follows: Id Id Time Time Location established in advance (hence the “schema less” Dx = {(Attx , Vx ), (Attx , Vx ), (Attx , Location Keyword Keyword Content Content concept). Vx ),(Attx , Vx ),(Attx , Vx )} Id Formally, a NoSQL document-oriented database where Attx is a simple attribute and while the other Time Location Keyword Content can be defined as a collection C composed of a set of 4 (Attx , Attx , Attx , and Attx ) are Id documents Di, C = {D1,…, Dn}. compound attributes. Thus, Vx is an atomic value Each Di document is defined by a set of pairs (e.g. “X”) corresponding to the key (that has to be Time Location Di = {( , ),…, ( , )}, j  [1, m] where unique). The other 4 values (Vx , Vx , j j Keyword Content Atti is an attribute (which is similar to a key) and Vi Vx , and Vx ) are nested documents: is a value that can be of two forms: Time Day Day Month Month Vx = {(Attx , Vx ), (Attx , Vx ), . The value is atomic. Year Year (Attx , Vx )}, Location City City Country Country . The value is itself composed by a nested Vx = {(Attx , Vx ), (Attx , Vx ), Population Population Continent document that is defined as a new set of pairs (Attx , Vx ), (Attx , Continent Zone Zone (attribute, value). Vx ), (Attx , Vx )},

Keyword Term Term We distinguish simple attributes whose values are Vx = {(Attx , Vx ), Category Category atomic from compound attributes whose values are (Attx , Vx )}, Contents NewsCount NewsCount documents called nested documents. Vx = {(Attx , Vx ),

178 ImplementingMultidimensionalDataWarehousesintoNoSQL

OccurenceCount OccurenceCount (Attx , Vx )}.

All In this example, the values in the nested documents Name H_TIME Zone Continent Time are all atomic values. For example, values associated Day Month Year All

City Country Population CONTENT to the attributes Attx , Attx , Attx , Population Country H_KW KeyWord Continent Zone City NewsCount Term Category All Attx and Attx are: OccurenceCount Location City Vx = “Toulouse”, Country Vx = “France”, { (Id : v Population Id V = “65991000”, Time : {Day: vDa, Month: vMo, Name: vNa, Year: vYe} x Location: {City: vCi , Country: vCtr, Population: vPo, Continent Legend Continent: vCnt, Zone: vZn} V = “Europe”, KeyWord : {Term: vTe , Category: vCa} x {} Collection Zone Content : {NewsCount: vNCt, OccurrenceCount: vOCt}) Vx = “Europe Western”. () Document ( Id …) Att: {} Nested Document … } Att: Attribute Attributes values are represented See Figure 9 for the complete example. v Value of the Att Attribute only to ease understanding Att Figure 10: Implementing the conceptual model into a 5.2 Document-oriented Model document-oriented NoSQL logical model. Mapping Rules 5.3 Lattice Mapping Rules Under the NoSQL document-oriented model, the data is not organized in rows and columns as in the As in the previous approach, we store all the pre- previous model, but it is organized in nested computed aggregates in a separate unique collection. documents (see Figure 10). Formally, we use the same definition for the . Each conceptual star schema (one Fi and its aggregate lattice as above (see section 4.3). E dimensions Star (Fi)) is translated in a collection However, when using a document oriented NoSQL C. model, the implementation rules are: . The fact Fi is translated in a compound attribute . Each node A is stored in a collection. CF Att . Each measure mi is translated into a simple SM . For each dimension Di concerned by this node, a attribute Att . CD compound attribute (nested document) Att Di is E . Each dimension Di  Star (Fi) is converted into created; each attribute ai of this dimension is CD ai CD a compound attribute Att (i.e. a nested stored in a simple attribute Att of Att Di. D document). Each attribute Ai  A (parameters . The set of aggregated measures is stored in a and weak attributes) of the dimension Di is CD compound attribute Att F where each A converted into a simple attribute Att contained aggregated measure is stored as a simple in AttCD. attribute Attmi.

Remarks. A fact instance is converted into a Example. Let us Consider the lattice LNews (see document d. Measures values are combined within a Figure 11). This lattice is stored in a collection nested document of d. Each dimension is also CNews. The Node is stored as a translated as a nested document of d (each document d. The dimension Time and Location are combining parameter and weak attribute values). date location stored in a nested document d and d of d. The hierarchical organization of the dimension is The month attribute is stored as a simple attribute in not preserved. But as in the previous approach, we the nested document dTime. The country attribute is use hierarchies to build the aggregate lattice. stored in a nested document dlocation as simple Example. The document noted Dx is composed of 4 attribute. The two measures are also stored in a nested documents, AttContent, that groups measures nested document denote dfact. and AttLocation, AttKeyword, AttTime, that correspond to As in the column-oriented model, we choose to the instances of each associated dimension. store all possible aggregates of the lattice. With this As the previous model, the transformation choice there is a large potential decrease in terms of process produces a large collection of redundant query response time. data. This choice has the advantage of promoting data querying where each fact instance is directly combined with the corresponding dimension instances. The generated volume can be compensated by an architecture that would massively distribute this data.

179 ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

We obtain successively 1GB, 10GB and 100GB of random data. The JSon files turn out to be approximately 3 times larger for the same data. The entire process is shown in the Figure 12. Data Loading. Data is loaded into HBase and MongoDB using native instructions. These are supposed to load data faster when loading from files. The current version of MongoDB would not load data with our logical model from CSV file, thus we had to use JSON files. Figure 11: Implementing the pre-computed aggregation lattice into a document-oriented NoSQL logical model.

6 EXPERIMENTS

6.1 Experimental Setup

Our experimental goal is to illustrate the instantiation of our logical models (both star schema Figure 12: Broad schema of the experimental setup. and lattice levels). Thus, we show here experiments with respect to data loading and lattice generation. Lattice Computation. To compute the aggregate We use HBase and MongoDB respectively for lattice, we use map-reduce functions from both testing the column-oriented and the document- HBase and MongoDB. Four levels of aggregates are oriented models. Data is generated with a reference computed on top of the detailed facts. These benchmark (TPC-DS, 2014). We generate datasets aggregates are: all combinations of 3 dimensions, all of sizes: 1GB, 10GB and 100GB. After loading data, combinations of 2 dimensions, all combinations of 1 we compute the aggregate lattice with map- dimension, all data. reduce/aggregations offered by both HBase and MongoDB and HBase allow aggregating data MongoDB. The details of the experimental setup are using map-reduce functions which are efficient for as follows: distributed data systems. At each aggregation level, Dataset. The TPC-DS benchmark is used for we apply aggregation functions: max, min, sum and generating our test data. This is a reference count on all dimensions. For MongoDB, instructions benchmark for testing decision support (including look like: OLAP) systems. It involves a total of 7 fact tables db.ss1.mapReduce( and 17 shared dimension tables. Data is meant to function(){emit( support a retailer decision system. We use the {item: {i_class: this.item.i_class, store_sales fact and its 10 associated dimensions i_category: this.item.i_category}, tables (the most used ones). Some of its dimensions store: {s_city: this.store.s_city, s_country: tables are higher hierarchically organized parts of this.store.s_country}, other dimensions. We consider aggregations on the customer: {ca_city: this.customer.ca_city, following dimensions: date (day, month, year), ca_country: this.customer.ca_country}}, customer address (city, country), store address (city, this.ss_wholesale_cost); } , country) and item (class, category). function(key, values){ Data Generation. Data is generated by the DSGen return {sum: Array.sum(values), max: generator (1.3.0). Data is generated in different Math.max.apply(Math, values), CSV-like files (Coma Separated Values), one per min: Math.min.apply(Math, values), table (whether dimension or fact). We process this count: values.length};}, {out: 'ss1_isc'} data to keep only the store_sales measures and ); associated dimension values (by joining across tables and projecting the data). Data is then formatted as CSV files and JSon files, used for loading data in respectively HBase and MongoDB.

180 ImplementingMultidimensionalDataWarehousesintoNoSQL

Figure 13: The pre-computed aggregate lattice with processing time (seconds) and size (records/documents), using HBase (H) and MongoDB (M). The dimensions are abbreviated (D: Date, I: item, S: store, C: customer).

In the above, data is aggregated using the item, for both HBase and MongoDB. Data loading was store and customer dimensions. For HBase, we use successful in both cases. It confirms that HBase is Hive on top to ease the query writing for faster when it comes to loading. However, we did aggregations. Queries with Hive are SQL-like. The not pay enough attention to tune each system for below illustrates the aggregation on item, store and loading performance. We should also consider that customer dimensions. the raw data (JSon files) takes more space in memory in the case of MongoDB for the same INSERT OVERWRITE TABLE out number of records. Thus we can expect a higher select network transfer penalty. sum(ss_wholesale_cost), max(ss_wholesale_cost), min(ss_wholesale_cost), count(ss_wholesale_cost) , Table 1: Dataset loading times for each NosQL database i_class,i_category,s_city,s_country,ca_city,ca_country management system. from store_sales group by Dataset size 1GB 10GB 100GB i_class,i_category,s_city,s_country,ca_city,ca_country MongoDB 9.045m 109m 132m ; HBase 2.26m 2.078m 10,3m Hardware. The experiments are done on a cluster composed of 3 PCs, (4 core-i5, 8GB RAM, 2TB Lattice Computation: We report here the disks, 1Gb/s network), each being a worker node experimental observations on the lattice and one node acts also as dispatcher. computation. The results are shown in the schema of Figure 13. Dimensions are abbreviated (D: date, C: Data Management Systems. We use two NoSQL customer, I: item, S: store). The top level data management systems: HBase (v.0.98) and corresponds to IDCS (detailed data). On the second MongoDB (v.2.6). They are both successful key- level, we keep combinations of only three value database management systems respectively for dimensions and so on. For every aggregate node, we column-oriented and document-oriented data show the number of records/documents it contains storage. Hadoop (v.2.4) is used as the underlying and the computation time in seconds respectively for distributed storage system. HBase (H) and MongoDB (M). 6.2 Experimental Results In HBase, the total time to compute all aggregates was 1700 seconds with respectively 1207s, 488s, 4s and 0.004s per level (from more Loading Data: The data generation process detailed to less). In MongoDB, the total time to produced files respectively of 1GB, 10GB, and compute all aggregates was 3210 seconds with 100GB. The equivalent files in JSon where about 3.4 respectively 2611s, 594s, 5s and 0.002s per level times larger due to the extra format. In the table (from more detailed to less). We can easily observe below, we show loading times for each dataset and

181 ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

that computing the lower levels is much faster as the both and many aggregates take little memory space. amount of data to be processed is smaller. The size A major difference between the different NoSQL of the aggregates (in terms of records) decreases too systems concerns interrogation. For queries that when we move down the hierarchy: 8.7 millions demand multiple attributes of a relation, the column- (level 2), 3.4 millions (level 3), 55 thousand (level 4) oriented approaches might take longer because data and 1 record in the bottom level. will not be available in one place. For some queries, the nested fields supported by document-oriented approaches can be an advantage while for others it 7 DISCUSSION would be a disadvantage. Studying differences with respect to interrogation is listed for future work. In this section, we provide a discussion on our results. We want to answer three questions: . Are the proposed models convincing? 8 CONCLUSION . How can we explain performance differences across MongoDB and HBase? This paper is about an investigation on the instantiation of OLAP systems through NoSQL . Is it recommended to use column-oriented and approaches namely: column-oriented and document- document-oriented approaches for OLAP oriented approaches. We have proposed respectively systems and when? two NoSQL logical models for this purpose. The The choice of our logical NoSQL models can be models are accompanied with rules that can criticized for being simple. However, we argue that transform a multi-dimensional conceptual model it is better to start from the simpler and most natural into a NoSQL logical model. models before studying more complex ones. The Experiments are carried with data from the TPC- two models we studied are simple and intuitive; DS benchmark. We generate respectively datasets of making it easy to implement them. The effort to size 1GB, 10GB and 100GB. The experimental process the TPC-DS benchmark data was not setup show how we can instantiate OLAP systems difficult. Data from the TPC-DS benchmark was with column-oriented and document-oriented successfully mapped and inserted into MongoDB databases respectively with HBase and MongoDB. and HBase proving the simplicity and effectiveness This process includes , data of the approach. loading and aggregate computation. The entire HBase outperforms MongoDB with respect to process allows us to compare the different data loading. This is not surprising. Other studies approaches with each other. highlight the good performance on loading data for We show how to compute an aggregate lattice. HBase. We should also consider that data fed to Results show that both NoSQL systems we MongoDB was larger due to additional markup as considered perform well; with HBase being more MongoDB does not support csv-like files when the efficient at some steps. Using map-reduce functions collection schema contains nested fields. Current we compute the entire lattice. This is done for benchmarks produce data in a columnar format (csv illustrative purposes and we acknowledge that it is like). This gives an advantage to relational DBMS. not always necessary to compute the entire lattice. The column-oriented model we propose is closer to This kind of further optimizations is not the main the relational model with respect to the document- goal of the paper. oriented model. This remains an advantage to HBase The experiments confirm that data loading and compared to MongoDB. We can observe that it aggregate computation is faster with HBase. becomes useful to have benchmarks that produce However, document-based approaches have other data that are adequate for the different NoSQL advantages that remain to be thouroughly explored. models. The use of NoSQL technologies for At this stage, it is difficult to draw detailed implementing OLAP systems is a promising recommendations with respect to the use of column- research direction. At this stage, we focus on the oriented or document-oriented approaches with modeling and loading stages. This research direction respect to OLAP systems. We recommend HBase if seems fertile and there remains a lot of unanswered data loading is the priority. HBase uses also less questions. memory space and it is known for effective data Future Work: We will list here some of the work compression (due to column redundancy). we consider interesting for future work. A major Computing aggregates takes a reasonable time for issue concerns the study of NoSQL systems with

182 ImplementingMultidimensionalDataWarehousesintoNoSQL

respect to OLAP usage, i.e. interrogation for 6th Int. Workshop on the Maintenance and Evolution analysis purposes. We need to study the different of Service-Oriented and Cloud-Based Systems types of queries and identify queries that benefit (MESOCA), IEEE, pages 47–56. mostly for NoSQL models. Jacobs, A., 2009. The pathologies of big data. Communications of the ACM, 52(8), pp. 36–44. Finally, all approaches (relational models, Kimball, R. Ross, M., 2013. The Data Warehouse Toolkit: NoSQL models) should be compared with each The Definitive Guide to Dimensional Modeling. John other in the context of OLAP systems. We can also Wiley & Sons, Inc., 3rd edition. consider different NoSQL logical ilmplementations. Lee, S., Kim, J., Moon, Y.-S., Lee, W., 2012. Efficient We have proposed simple models and we want to distributed parallel top-down computation of R-OLAP compare them with more complex and optimized data cube using mapreduce. Int conf. on Data ones. Warehousing and Knowledge Discovery (DaWaK), In addition, we believe that it is timely to build LNCS 7448, Springer, pp. 168–179. benchmarks for OLAP systems that generalize to Li, C., 2010. Transforming relational database into hbase: A case study. Int. Conf. on Software Engineering and NoSQL systems. These benchmarks should account Service Sciences (ICSESS), IEEE, pp. 683–687. for data loading and database usage. Most existing Malinowski, E., Zimányi, E., 2006. Hierarchies in a benchmarks favor relational models. multidimensional model: From conceptual modeling to logical representation. Data and Knowledge Engineering, 59(2), Elsevier, pp. 348–377. ACKNOWLEDGEMENTS Morfonios, K., Konakas, S., Ioannidis, Y., Kotsis, N., 2007. R-OLAP implementations of the data cube. ACM Computing Survey, 39(4), p. 12. These studies are supported by the ANRT funding Simitsis, A., Vassiliadis, P., Sellis, T., 2005. Optimizing under CIFRE-Capgemini partnership. etl processes in data warehouses. Int. Conf. on Data Engineering (ICDE), IEEE, pp. 564–575. Ravat, F., Teste, O., Tournier, R., Zurfluh, G., 2008. Algebraic and Graphic Languages for OLAP REFERENCES Manipulations. Int. journal of Data Warehousing and Mining (ijDWM), 4(1), IGI Publishing, pp. 17-46. Chaudhuri, S., Dayal, U., 1997. An overview of data Stonebraker, M., 2012. New opportunities for new sql. warehousing and olap technology. SIGMOD Record, Communications of the ACM, 55(11), pp. 10–11. 26, ACM, pp. 65–74. Vajk, T., Feher, P., Fekete, K., Charaf, H., 2013. Colliat, G., 1996. Olap, relational, and multidimensional Denormalizing data into schema-free databases. 4th database systems. SIGMOD Record, 25(3), ACM, pp. Int. Conf. on Cognitive Infocommunications 64–69. (CogInfoCom), IEEE, pp. 747–752. Cuzzocrea, A., Bellatreche, L., Song, I.-Y., 2013. Data Vassiliadis, P., Vagena, Z., Skiadopoulos, S., warehousing and olap over big data: Current Karayannidis, N., 2000. ARKTOS: A Tool For Data challenges and future research directions. 16th Int. Cleaning and Transformation in Data Warehouse Workshop on Data Warehousing and OLAP Environments. IEEE Data Engineering Bulletin, 23(4), (DOLAP), ACM, pp. 67–70. pp. 42-47. Dede, E., Govindaraju, M., Gunter, D., Canon, R. S., TPC-DS, 2014. Transaction Processing Performance Ramakrishnan, L., 2013. Performance evaluation of a Council, Decision Support benchmark, version 1.3.0, mongodb and hadoop platform for scientific data http://www.tpc.org/tpcds/. analysis. 4th Workshop on Scientific Cloud Wrembel, R., 2009. A survey of managing the evolution Computing, ACM, pp. 13–20. of data warehouses. Int. Journal of Data Warehousing Dehdouh, K., Boussaid, O., Bentayeb, F., 2014. Columnar and Mining (ijDWM), 5(2), IGI Publishing, pp. 24–56. nosql star schema benchmark. Model and Data Zhao, H., Ye, X., 2014. A practice of tpc-ds Engineering, LNCS 8748, Springer, pp. 281–288. multidimensional implementation on nosql database Golfarelli, M., Maio, D., and Rizzi, S., 1998. The systems. 5th TPC Tech. Conf. Performance dimensional fact model: A conceptual model for data Characterization and Benchmarking, LNCS 8391, warehouses. Int. Journal of Cooperative Information Springer, pp. 93–108. Systems, 7, pp. 215–247. Gray, J., Bosworth, A., Layman, A., Pirahesh, H., 1996. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total. Int. Conf. on Data Engineering (ICDE), IEEE Computer Society, pp. 152-159. Han, D., Stroulia, E., 2012. A three-dimensional data model in hbase for large time-series dataset analysis.

183