Implementing Multidimensional Data Warehouses into NoSQL Max Chevalier, Mohammed El Malki, Arlind Kopliku, Olivier Teste, Ronan Tournier

To cite this version:

Max Chevalier, Mohammed El Malki, Arlind Kopliku, Olivier Teste, Ronan Tournier. Implementing Multidimensional Data Warehouses into NoSQL. 17th International Conference on Enterprise Infor- mation Systems (ICEIS 2015) held in conjunction with ENASE 2015 and GISTAM 2015, Apr 2015, Barcelone, Spain. ￿hal-01387759￿

HAL Id: hal-01387759 https://hal.archives-ouvertes.fr/hal-01387759 Submitted on 26 Oct 2016

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés.

Open Archive TOULOUSE Archive Ouverte ( OATAO ) OATAO is an op en access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible.

This is an author-deposited version published in : http://oatao.univ-toulouse.fr/ Eprints ID : 15250

The contribution was presented at ICEIS 2015 : http://www.iceis.org/?y=2015

To cite this version : Chevalier, Max and El Malki, Mohammed and Kopliku, Arlind and Teste, Olivier and Tournier, Ronan Implementing Multidimensional Data Warehouses into NoSQL. (2015) In: 17th International Conference on Enterprise Information Systems (ICEIS 2015) held in conjunction with ENASE 2015 and GISTAM 2015, 27 April 2015 - 30 April 2015 (Barcelona, Spain).

Any corresponde nce concerning this service should be sent to the repository administrator: staff -oatao@listes -diff.inp -toulouse.fr Implementing Multidimensional Data Warehouses into NoSQL

Max Chevalier 1, Mohammed El Malki 1,2 , Arlind Kopliku 1, Olivier Teste 1 and Ronan Tournier 1 1 University of Toulouse, IRIT UMR 5505 (www.irit.fr), Toulouse, France 2Capgemini (www.capgemini.com), Toulouse, France {Max.Chevalier, Mohammed.ElMalki, Arlind.Kopliku, Olivier.Teste, Ronan.Tournier}@irit.fr

Keywords: NoSQL, OLAP, Aggregate Lattice, Column-Oriented, Document-Oriented.

Abstract: Not only SQL (NoSQL) are becoming increasingly popular and have some interesting strengths such as scalability and flexibility. In this paper, we investigate on the use of NoSQL systems for implementing OLAP (On-Line Analytical Processing) systems. More precisely, we are interested in instantiating OLAP systems (from the conceptual level to the logical level) and instantiating an aggregation lattice (optimization). We define a set of rules to map star schemas into two NoSQL models: column- oriented and document-oriented. The experimental part is carried out using the reference benchmark TPC. Our experiments show that our rules can effectively instantiate such systems ( and lattice). We also analyze differences between the two NoSQL systems considered. In our experiments, HBase (column- oriented) happens to be faster than MongoDB (document-oriented) in terms of loading time.

1 INTRODUCTION to translate automatically and directly a conceptual multidimensional model into NoSQL logical Nowadays, analysis data volumes are reaching models. We consider two NoSQL logical models: critical sizes (Jacobs, 2009) challenging traditional one column-oriented and one document-oriented. data warehousing approaches. Current implemented For each model, we define mapping rules translating solutions are mainly based on relational databases from the conceptual level to the logical one. In (using R-OLAP approaches) that are no longer Figure 1, we position our approach based on adapted to these data volumes (Stonebraker, 2012), abstraction levels of information systems. The (Cuzzocrea et al., 2013), (Dehdouh et al., 2014). conceptual level consists in describing the data in a With the rise of large Web platforms (e.g. Google, generic way regardless of information technologies Facebook, Twitter, Amazon, etc.) solutions for “Big whereas the logical level consists in using a specific Data ” management have been developed. These are technique for implementing the conceptual level. based on decentralized approaches managing large Conceptual Level data amounts and have contributed to developing Multidimensional (OLAP) “Not only SQL ” (NoSQL) data management systems (Stonebraker, 2012). NoSQL solutions allow us to consider new approaches for data Logical Level warehousing, especially from the multidimensional Relational-OLAP NoSQL-OLAP data management point of view. This is the scope of Relational NoSQL this paper. In this paper, we investigate the use of NoSQL Legend New transformation models for decision support systems. Until now (and Existing transformation to our knowledge), there are no mapping rules that transform a multi-dimensional conceptual model Figure 1: Translations of a conceptual multidimensional into NoSQL logical models. Existing research model into logical models. instantiate OLAP systems in NoSQL through R- OLAP systems; i.e., using an intermediate relational Our motivation is multiple. Implementing OLAP logical model. In this paper, we define a set of rules systems using NoSQL systems is a relatively new

alternative. It is justified by advantages of these denormalized R-OLAP schema) using two systems such as flexibility and scalability. The transformation rules: increasing research in this direction demands for § Each dimension is a table that uses the same formalization, common models and empirical name. Table attributes are derived from evaluation of different NoSQL systems. In this attributes of the dimension (called parameters scope, this work contributes to investigate two and weak attributes). The root parameter is the logical models and their respective mapping rules. primary key. We also investigate data loading issues including § Each fact is a table that uses the same name, pre-computing data aggregates. with attributes derived from 1) fact attributes Traditionally, decision support systems use data (called measures) and 2) the root parameter of warehouses to centralize data in a uniform fashion each associated dimension. Attributes derived (Kimball et al., 2013). Within data warehouses, from root parameters are foreign keys linking interactive data analysis and exploration is each dimension table to the and performed using On-Line Analytical Processing form a compound primary key of the table. (OLAP) (Colliat, 1996), (Chaudhuri et al., 1997). Data is often described using a conceptual Due to the huge amount of data that can be multidimensional model, such as a star schema stored in OLAP systems, it is common to pre- (Chaudhuri et al., 1997). We illustrate this compute some aggregated data to speed up common multidimensional model with a case study about analysis queries. In this case, fact measures are RSS (Really Simple Syndication ) feeds of news aggregated using different combinations of either bulletins from an information website. We study the dimension attributes or root parameters only. This Content of news bulletins (the subject of the analysis pre-computation is a lattice of pre-computed or fact) using three dimensions of those bulletins aggregates (Gray et al., 1997) or aggregate lattice (analysis axes of the fact): Keyword (contained in for short. The lattice is a set of nodes, one per the bulletin), Time (publication date) and Location dimension combinations. Each node (e.g. the node (geographical region concerned by the news). The called “ Time, Location ”) is stored as a relation called fact has two measures (analysis indicators): an aggregate relation (e.g. the relation time- § The number of news bulletins (NewsCount ). location ). This relation is composed of attributes § The number of keyword occurrences corresponding to the measures and the parameters or (OccurrenceCount ). weak attributes from selected dimensions. Attributes corresponding to measures are used to store aggregated values computed with functions such as SUM , COUNT , MAX , MIN , etc. The aggregate lattice schema for our case study is shown in Figure 3. Considering NoSQL systems as candidates for implementing OLAP systems, we must consider the above issues. In this paper and, in order to deal with these issues, we use two logical NoSQL models for the logical implementation; we define mapping rules (that allows us to translate a conceptual design into a logical one) and we study the lattice computation. Figure 2: Multidimensional conceptual schema of our The rest of this paper is organized as follows: in example, news bulletin contents according to keywords, publication time and location concerned by the news. section 2, we present possible approaches that allow getting a NoSQL implementation from a data The conceptual multidimensional schema of our warehouse conceptual model using a pivot logical case study is described in Figure 2, using a graphical model; in section 3 we define our conceptual formalism based on (Golfarelli et al., 1998), (Ravat multidimensional model, followed by a section for et al., 2008). each of the two NoSQL models we consider along One of the most successful implementation of with their associated transformation rules, i.e. the OLAP systems uses relational databases (R-OLAP column-oriented model in section 4 and the implementations). In these implementations, the document-oriented model in section 5. Finally, conceptual schema is transformed into a logical section 6 details our experiments. schema (i.e. a relational schema, called in this case a

ALL

Keyword Time Location Aggregates Aggregates lattice Path « Foreign Key » Keyword, Time Keyword, Location Time, Location Path between join attributes for agreggates calculation Keyword, Time, Location

Term Category City Country Population Continent Zone

health Paris FR 65 991 000 Europe West Day Month Name Year ill Europe Virus health 12/10/14 10/14 october 20014 Paris FR 65 991 000 Europe West 15/10/14 10/14 october 2014 vaccine health Europe Londres EN 62 630 000 Europe West 09/06/14 06/14 June 2014 dimension Table « Keyword» Europe Dimension Table « Time» Dimension Table « Location » Id Keyword Time Localisation NewsCount OccurrenceCount

1 ill 12/10/14 Paris 10 10 Logical Logical multidimentional model 2 Virus 15/10/14 Paris 15 25

3 vaccine 09/06/14 London 20 30 Fact table « News » Figure 3: A pre-computed aggregate lattice in a R-OLAP system.

logical model (bottom part of Figure 1). In (Li, 2010), the author has proposed an approach for 2 RELATED WORK transforming a relational into a column- oriented NoSQL database using HBase (Han et al., 2012), a column-oriented NoSQL database. In (Vajk To our knowledge, there is no work for automatically and directly transforming data et al., 2013), an algorithm is introduced for mapping warehouses defined by a multidimensional a relational schema to a NoSQL schema in MongoDB (Dede et al., 2013), a document-oriented conceptual model into a NoSQL model. NoSQL database. However, these approaches never Several research works have been conducted to consider the conceptual model of data warehouses. translate data warehousing concepts to a relational They are limited to the logical level, i.e. R-OLAP logical level (Morfonios et al., 2007). Multidimensional databases are mostly implemented transforming a into a column- using these relational technologies. Mapping rules oriented model. More specifically, the duality fact/dimension requires guaranteeing a number of are used to convert structures of the conceptual level constraints usually handled by the relational (facts, dimensions and hierarchies) into a logical integrity constraints and these constraints cannot be model based on relations. Moreover, many works considered in these logical approaches. have focused on implementing logical optimization methods based on pre-computed aggregates (also This study highlights that there is currently no approaches for automatically and directly called materialized views) as in (Gray et al., 1997), transforming a multidimensional (Morfonios et al., 2007). However, R-OLAP conceptual model into a NoSQL logical model. It is implementations suffer from scaling-up to large data possible to transform multidimensional conceptual volumes (i.e. “”) . Research is currently under way for new solutions such as using NoSQL models into a logical relational model, and then to systems (Lee et al., 2012). Our approach aims at transform this relational model into a logical NoSQL model. However, this transformation using the revisiting these processes for automatically relational model as a pivot model has not been implementing multidimensional conceptual models formalized as both transformations were studied directly into NoSQL models. independently of each other. Also, this indirect Other studies investigate the process of transforming relational databases into a NoSQL approach can be tedious.

We can also cite several recent works that are A fact , FÎFE, is defined by ( NF, MF) where: aimed at developing data warehouses in NoSQL § NF is the name of the fact, systems whether columns-oriented (Dehdouh et al., E E E § D = !{F&G3& H,…, !F.G3. H}/ is a set of 2014), or key-values oriented (Zhao et al., 2014). measures, each associated with an aggregation However, the main goal of these papers is to propose function fi. benchmarks. These studies have not put the focus on the model transformation process. Likewise, they Example. Consider our case study where news only focus one NoSQL model, and limit themselves bulletins are loaded into a multidimensional data to an abstraction at the HBase logical level. Both warehouse consistent with the conceptual schema models (Dehdouh et al., 2014), (Zhao et al., 2014), described in Figure 2. require the relational model to be generated first The multidimensional schema ENews is defined by before the abstraction step. By contrast, we consider News News F ={FContent }, D ={DTime , DLocation , DKeyword } and the conceptual model as well as two orthogonal News Star (FContent )={DTime , DLocation , DKeyword }. logical models that allow distributing The fact represents the data analysis of the news multidimensional data either vertically using a feeds and uses two measures: the number of news column-oriented model or horizontally using a (NewsCount ) and the number of occurrences document-oriented model. (OccurrenceCount ); both for the set of news Finally we take into account hierarchies in our corresponding to a given term (or keyword), a transformation rules by providing transformation specific date and a given location. This fact, FContent rules to manage the aggregate lattice. is defined by ( Content , { SUM (NewsCount ), SUM (OccurrenceCount )}) and is analyzed according to three dimensions, each consisting of 3 CONCEPTUAL MULTI- several hierarchical levels (detail levels): DIMENSIONAL MODEL § The geographical location ( Location ) concerned by the news (with levels City , Country , Continent and Zone ). A To ensure robust translation rules we first define the complementary information of the country multidimensional model used at the conceptual being its Population (modeled as additional level. information; it is a weak attribute). A multidimensional schema , namely E, is § The publication date ( Time ) of the bulletin defined by ( FE, DE, Star E) where: (with levels Day , Month and Year ); note that § FE = {F ,…, F } is a finite set of facts, 1 n the month number is associated to its Name § DE = {D ,…, D } is a finite set of dimensions, 1 m (also a weak attribute), E E !" § Star : F →2 is a function that associates § E The Keyword used in the News (with the each fact Fi of F to a set of Di dimensions, levels Term and Category of the term). Î E Di Star (Fi), along which it can be analyzed; For instance, the dimension DLocation is defined by " note that 2! is the power set of DE. (Location , { City , Country , Continent , Zone , E Location Location A dimension , denoted DiÎD (abusively noted ALL }, { HCont , HZn }) with City = id and: D D D as D), is defined by ( N , A , H ) where: § HCont = (HCont , { City , Country , Continent , § ND is the name of the dimension, ALL Location }, (Country , { Population })); note HCont § !! ! ! ! ! È ! ! ! is a set of that Weak (Country ) = {Population }, # = $%& ,…, %' ( $)* , #++ ( Location dimension attributes, § HZn = (HZn , {City , Country , Zone , ALL }, ! ! ! (Country , { Population })). § - ! = !{-& ,…, !-. }/ is a set hierarchies. A hierarchy of the dimension D, denoted D Hi Hi Hi HiÎH , is defined by ( N , Param , Weak ) where: § NHi is the name of the hierarchy, 4 CONVERSION INTO A NOSQL 4 4 § 45! ! ! ! 5 ! 5 ! ! is COLUMN-ORIENTED MODEL 0%1%3 = < )* , 6& ,…, 6.5 , #++ >/ an ordered set of vi+2 attributes which are called parameters of the relevant graduation The column-oriented model considers each record as " Î 48 Î D a key associated with a value decomposed in several scale of the hierarchy, k [1.. vi], 67 A . : C5 columns. Data is a set of lines in a table composed § Weak Hi : Param Hi ® 29 ;?@A@B is a function of columns (grouped in families) that may be associating with each parameter zero or more different from one row to the other. weak attributes .

4.1 NoSQL Column-Oriented Model Example : Let us have a table TNews representing aggregated data related to news bulletins (see Figure News In relational databases, the data structure is 5), with: T = {R1, …, Rx, …, Rn}. We detail the determined in advance with a limited number of Rx row that corresponds to the number of news typed columns (a few thousand) each similar for all bulletins, and the number of occurrences where the records (also called “tuples ”). Column-oriented keyword “Iraq ” appears in those news bulletins, NoSQL models provide a flexible schema (untyped published at the date of 09/22/2014 and concerning columns) where the number of columns may vary the location “Toulouse ” (south of France). Time Day Day Month Month between each record (or “row” ). Rx=(x, (CF x ={( Cx , {V x }), (C x , V x ), Name Name Year Year Location City A column-oriented database (represented in (Cx , V x ), (C x , V x )}, CF x ={( Cx , City Country Country Population Population Figure 4) is a set of tables that are defined row by Vx ), (C x , V x ), (C x , V x ), Continent Continent Zone Zone Keyword row (but whose physical storage is organized by (C x , V x ), (C x , V x )}, CF x Term Term Category Category Content groups of columns: column families; hence a ={( Cx , V x ), (C x , V x )}, CFx NewsCount NewsCount OccurrenceCount “vertical partitioning” of the data ). In short, in these ={( Cx , V x ), (Cx , OccurrenceCount systems, each table is a logical mapping of rows and Vx )}) ) Location their column families. A column family can contain The values of the five columns of CF x , City Country Population Continent Zone a very large number of columns. For each row, a (Cx , Cx , Cx , Cx and Cx ), are City Country Population Continent Zone column exists if it contains a value. (Vx , Vx , Vx , Vx and Vx ); e.g. City Country Population Vx = Toulouse , Vx = France , Vx = Table 1..* Row Continent Zone Name Key 65991000 , Vx = Europe , Vx = Europe- Western . Location More simply we note: CF x = {(City, 1..* {Toulouse}), (Country, {France}), (Population, Col. Family 1..* Columns Name Name {65991000}), (Continent, {Europe}), (Zone, {Europe-Western})}. 1..1 Table TNews 1..1 Id Time Location KeyWord Content Value 1 ...... Val ...

x Day City Term NewsCount Figure 4: UML class diagram representing the concepts of 09/24/14 Toulouse Iraq 2 a column-oriented NoSQL database (tables, rows, column Month Country Category OccurrenceCount families and columns). 09/14 France Middle-East 10 Name Population Legend September 65991000 Column Location A table T = {R1,…, R n} is a set of rows Ri. A Year Continent Family 1 m Lines of the Table 2014 Europe row Ri = (Key i, (CF i ,…, CF i )) is composed of a Id j Zone Identifier 1 row key Key i and a set of column families CF i . Europe-Western j j1 j1 jp Colomn Continent A column family CF i = {( Ci , { vi }),…, ( Ci , Value Europe jp ... {vi })} consists of a set of columns, each associated n ...... with an atomic value. Every value can be News “historised ” thanks to a timestamp. This principle Figure 5: Example of a row (key = x) in the T table. useful for version management (Wrembel, 2009) will not be used in this paper due to limited space, 4.2 Column-Oriented Model Mapping although it may be important. Rules The flexibility of a column-oriented NoSQL database allows managing the absence of some The elements (facts, dimensions, etc.) of the columns between the different table rows. However, conceptual multidimensional model have to be in the context of multidimensional data storage, data transformed into different elements of the column- is usually highly structured (Malinowski et al., oriented NoSQL model (see Figure 6). 2006). Thus, this implies that the structure of a § Each conceptual star schema (one Fi and its E column family (i.e. the set of columns defined by the associated dimensions Star (Fi)) is column family) will be the same for all the table transformed into a table T. rows. The initial structure is provided by the data § The fact Fi is transformed into a column M integration process called ETL, Extract, Transform, family CF of T in which each mi is a M and Load (Simitsis et al., 2005). column Ci Î CF .

E i § Each dimension Di Î Star (F ) is transformed dimension tables, our rules translate the schema into into a column family CF Di where each a single table that includes the fact and its associated D dimension attribute Ai Î A (parameters and dimensions together. When performing queries, this weak attributes) is transformed into a column approach has the advantage of avoiding joins Di Di Ci of the column family CF (Ci Î CF ), between fact and dimension tables. As a except the parameter All Di. consequence, our approach increases information redundancy as dimension data is duplicated for each Remarks. Each fact instance and its associated fact instance. This redundancy generates an instances of dimensions are transformed into a row increased volume of the overall data while providing Rx of T. The fact instance is thus composed of the a reduced query time. In a NoSQL context, problems column family CF M (the measures and their values) linked to this volume increase may be reduced by an and the column families of the dimensions CF Di Î adapted data distribution strategy. Moreover, our CF DE (the attributes, i.e. parameters and weak choice for accepting this important redundancy is attributes, of each dimension and their values). motivated by data warehousing context where data As in a denormalized R-OLAP star schema updates consist essentially in inserting new data; (Kimball et al., 2013), the hierarchical organization additional costs incurred by data changes are thus of the attributes of each dimension is not represented limited in our context. in the NoSQL system. Nevertheless, hierarchies are used to build the aggregate lattice. Note that the 4.3 Lattice Mapping Rules hierarchies may also be used by the ETL processes which build the instances respecting the constraints We will use the following notations to define our induced by these conceptual structures (Malinowski lattice mapping rules. A pre-computed aggregate L et al., 2006); however, we do not consider ETL lattice or aggregate lattice L is a set of nodes A processes in this paper. (pre-computed aggregates or aggregates) linked by edges EL (possible paths to calculate the aggregates). Name H_KW L H_TIME KeyWord An aggregate node AÎA is composed of a set of p Time All Category Term i All Year Month Day parameters (one by dimension) and a set of Location aggregated mi measures fi(mi). A = < p 1 ....p k, Population City CONTENT Country f1(m1),…, f v(mv)>, k ≤ m (m being the number of Multidimensional NewsCount Conceptual Conceptual Schema OccurrenceCount dimensions, v being the number of measures of the Zone Continent fact). Table TNews All

Id Time Location KeyWord Content The lattice can be implemented in a column-

Day City Term NewsCount oriented NoSQL database using the following rules: L Month Country Category OccurrenceCount § Each aggregate node AÎA is stored in a Name Population Legend dedicated table. Year Continent

Logical Schema Logical Column Location Family § For each dimension Di associated to this node, Zone Di

NoSQL Column-Oriented NoSQL Values for each column Identifier Id are not displayed a column family CF is created, each Column Zone dimension attribute ai of this dimension is stored in a column C of CF Di , Figure 6: Implementing a multidimensional conceptual model into the column-oriented NoSQL logical model. § The set of aggregated measures is also stored in a column family CF F where each Example. Let ENews be the multidimensional aggregated measure is stored as a column C conceptual schema implemented using a table (see Figure 7). named TNews (see Figure 6). The fact ( FContents ) and its dimensions ( DTime , DLocalisation , DKeyword ) are Example. We consider the lattice News (see implemented into four column families CF Time , Figure 7). The lattice News is stored in tables. The Location Keyword Contents node (Keyword_Time ) is stored in a Table CF , CF , CF . Each column family Keyword_Time contains a set of columns, corresponding either to T composed of the column families CF Keyword , CF Time and CF Fact . The attribute Year is dimension attributes or to measures of the fact. For Year Time Location stored in a column C , itself in CF . The instance the column family CF is composed of Term City Country Population Continent attribute Term is stored in a column C , itself in the columns { C , C , C , C , Keyword Zone CF . The two measures are stored as two C }. fact Unlike R-OLAP implementations, where each columns in the column family CF . fact is translated into a central table associated with

Many studies have been conducted about how to § The value is itself composed by a nested select the pre-computed aggregates that should be document that is defined as a new set of pairs computed. In our proposition we favor computing all (attribute, value). aggregates (Kimball et al., 2013). This choice may be sensitive due to the increase in the data volume. We distinguish simple attributes whose values However, in a NoSQL architecture we consider that are atomic from compound attributes whose values storage space should not be a major issue. are documents called nested documents .

Id Keyword Fact Value Collection ALL 1..1 1 … … … Keyword Time Location vId1 Term : Virus NewsCount : 5 Category: Health OccurrenceCount: 8 Keyword, Time Keyword, Location Time, Location vId2 Term : Ill NewsCount : 4 Category: Health OccurrenceCount: 6 1..* Atomic Document Keyword, Time, Location Id Keyword Time Fact 1 … … … … Term : Virus Day: 05/11/14 NewsCount : 1 vId1 Couple Category: Health Month : 11/14 OccurrenceCount : 2 Attribute Name: November 1..* 1..* Year : 2014

vId2 Term : Ill Day: 05/11/14 NewsCount : 3 Category: Health Month : 11/14 OccurrenceCount : 4 Figure 8: UML class diagram representing the concepts of Name: November Year : 2014 a document-oriented NoSQL database.

Figure 7: Implementing the pre-computed aggregation Example . Let C be a collection, lattice into a column-oriented NoSQL logical model. C={ D1,…, Dx,,…, Dn} in which we detail the document Dx (see Figure 9). Suppose that Dx provides the number of news and the number of 5 CONVERSION INTO A NOSQL occurrences for the keyword “Iraq ” in the news having a publication date equals to 09/22/2014 and DOCUMENT-ORIENTED that are related to Toulouse. News MODEL Within the collection C ={ D1,…, Dx,…, Dn}, the document Dx could be defined as follows: Id Id Time Time Location The document-oriented model considers each record Dx = {(Att x , V x ), (Att x , V x ), (Att x , Location Keyword Keyword Content Content as a document, which means a set of records Vx ),(Att x , V x ),(Att x , V x )} Id containing “attribute/value” pairs; these values are where Att x is a simple attribute and while the other Time Location Keyword Content either atomic or complex (embedded in sub-records). 4 ( Att x , Att x , Att x , and Att x ) are Id Each sub-record can be assimilated as a document, compound attributes. Thus, Vx is an atomic value i.e. a subdocument. (e.g. “ X”) corresponding to the key (that has to be Time Location unique). The other 4 values ( Vx , Vx , Keyword Content 5.1 NoSQL Document-Oriented Model Vx , and Vx ) are nested documents: Time Day Day Month Month Vx = {(Att x , V x ), (Att x , V x ), Year Year In the document-oriented model, each key is (Att x , V x )}, Location City City Country Country associated with a value structured as a document. Vx = {(Att x , V x ), (Att x , V x ), Population Population Continent These documents are grouped into collections. A (Att x , V x ), (Att x , Continent Zone Zone document is a hierarchy of elements which may be Vx ), (Att x , V x )}, Keyword Term Term either atomic values or documents. In the NoSQL Vx = {(Att x , V x ), Category Category approach, the schema of documents is not (Att x , V x )}, Contents NewsCount NewsCount established in advance (hence the “schema less ” Vx = {(Att x , V x ), OccurenceCount OccurenceCount concept). (Att x , V x )}. Formally, a NoSQL document-oriented database In this example, the values in the nested can be defined as a collection C composed of a set of documents are all atomic values. For example, City Country documents Di, C = {D1,…, Dn}. values associated to the attributes Att x , Att x , Population Continent Zone Each Di document is defined by a set of pairs Att x , Att x and Att x are: & & B B Î City Di = {( #II8 , J8 ),…, ( #II8 , J8 )}, j [1, m] where Vx = “Toulouse ”, j j Country Att i is an attribute (which is similar to a key) and Vi Vx = “France ”, Population is a value that can be of two forms: Vx = “65991000 ”, Continent § The value is atomic. Vx = “ Europe ”, Zone Vx = “ Europe Western ”.

See Figure 9 for the complete example. data. This choice has the advantage of promoting data querying where each fact instance is directly combined with the corresponding dimension instances. The generated volume can be compensated by an architecture that would massively distribute this data.

All Name H_TIME Zone Continent Time Day MonthYear All

CONTENT Population Country H_KW KeyWord City NewsCount Term Category All OccurenceCount Location

{

(Id : vId Time : {Day: vDa , Month: vMo , Name: vNa , Year: vYe } Location : {City: vCi , Country: vCtr , Population: vPo , Legend Continent: vCnt , Zone: vZn } KeyWord : {Term: v , Category: v } {} Collection Te Ca Content : {NewsCount: vNCt , OccurrenceCount: vOCt }) News () Document ( Id … ) Att: {} Nested Document … Figure 9: Graphic representation of a collection C . } Att: Attribute Attributes values are represented v Value of the Att Attribute only to ease understanding Att 5.2 Document-Oriented Model Figure 10: Implementing the conceptual model into a Mapping Rules document-oriented NoSQL logical model.

Under the NoSQL document-oriented model, the 5.3 Lattice Mapping Rules data is not organized in rows and columns as in the previous model, but it is organized in nested As in the previous approach, we store all the pre- documents (see Figure 10). computed aggregates in a separate unique collection. § Each conceptual star schema (one F and its i Formally, we use the same definition for the dimensions Star E(F )) is translated in a i aggregate lattice as above (see section 4.3). collection C. However, when using a document oriented NoSQL § The fact F is translated in a compound i model, the implementation rules are: attribute Att CF. Each measure m is translated i § Each node A is stored in a collection. into a simple attribute Att SM . E § For each dimension Di concerned by this § Each dimension Di Î Star (Fi) is converted CD node, a compound attribute (nested document) into a compound attribute Att (i.e. a nested Att CD is created; each attribute a of this Î D Di i document). Each attribute Ai A (parameters dimension is stored in a simple attribute Att ai and weak attributes) of the dimension Di is CD of Att Di . converted into a simple attribute Att A CD § The set of aggregated measures is stored in a contained in Att . CD compound attribute Att F where each aggregated measure is stored as a simple Remarks. A fact instance is converted into a attribute Att mi . document d. Measures values are combined within a nested document of d. Each dimension is also Example . Let us Consider the lattice LNews (see translated as a nested document of d (each Figure 11). This lattice is stored in a collection combining parameter and weak attribute values). CNews . The Node < month_country > is stored as a The hierarchical organization of the dimension is document d. The dimension Time and Location are not preserved. But as in the previous approach, we date location stored in a nested document d and d of d. use hierarchies to build the aggregate lattice. The month attribute is stored as a simple attribute in Example. The document noted Dx is composed Time Content the nested document d . The country attribute is of 4 nested documents, Att , that groups location Location Keyword Time stored in a nested document d as simple measures and Att , Att , Att , that attribute. The two measures are also stored in a correspond to the instances of each associated nested document denote dfact . dimension. As the previous model, the transformation process produces a large collection of redundant

{ (Id : v Id1 measures and associated dimension values (by KeyWord : {Term : Virus, { Category:Health} (Id : v Id1 Fact : {NewsCount:5, KeyWord : {Term : Virus, joining across tables and projecting the data). Data is Category:Health} OccurrenceCount:8} Time: {Day : 05/11/14, ) Month:11/14, … then formatted as CSV files and JSon files, used for Name:November, Year:2014} (Id : v Id2 loading data in respectively HBase and MongoDB. Fact : {NewsCount : 1, KeyWord : {Term : Ill, Category:Health} OccurrenceCount:2} Fact : {NewsCount:4, ) We obtain successively 1GB, 10GB and 100GB of … OccurrenceCount:6) } random data. The JSon files turn out to be (Id : v Id2 KeyWord : {Term : Ill , Category:Health} Time: {Day : 05/11/14 , approximately 3 times larger for the same data. The Month:11/14, Name:November, Year:2014} entire process is shown in the Figure 12. Fact : {NewsCount:1, OccurrenceCount:1} Data loading. Data is loaded into HBase and ) } MongoDB using native instructions. These are Figure 11: Implementing the pre-computed aggregation supposed to load data faster when loading from files. lattice into a document-oriented NoSQL logical model. The current version of MongoDB would not load data with our logical model from CSV file, thus we As in the column-oriented model, we choose to had to use JSON files. store all possible aggregates of the lattice. With this choice there is a large potential decrease in terms of query response time.

6 EXPERIMENTS

6.1 Experimental setup Figure 12: Broad schema of the experimental setup. Our experimental goal is to illustrate the instantiation of our logical models (both star schema Lattice computation. To compute the aggregate and lattice levels). Thus, we show here experiments lattice, we use map-reduce functions from both with respect to data loading and lattice generation. HBase and MongoDB. Four levels of aggregates are We use HBase and MongoDB respectively for computed on top of the detailed facts. These testing the column-oriented and the document- aggregates are: all combinations of 3 dimensions, all oriented models. Data is generated with a reference combinations of 2 dimensions, all combinations of 1 benchmark (TPC-DS, 2014). We generate datasets dimension, all data. of sizes: 1GB, 10GB and 100GB. After loading data, MongoDB and HBase allow aggregating data we compute the aggregate lattice with map- using map-reduce functions which are efficient for reduce/aggregations offered by both HBase and distributed data systems. At each aggregation level, MongoDB. The details of the experimental setup are we apply aggregation functions: max, min, sum and as follows: count on all dimensions. For MongoDB, instructions Dataset. The TPC-DS benchmark is used for look like: generating our test data. This is a reference db.ss1.mapReduce( benchmark for testing decision support (including function(){emit( OLAP) systems. It involves a total of 7 fact tables {item: {i_class: this.item.i_class, and 17 shared dimension tables. Data is meant to i_category: this.item.i_category}, support a retailer decision system. We use the store: {s_city: this.store.s_city, s_country: store_sales fact and its 10 associated dimensions this.store.s_country}, tables (the most used ones). Some of its dimensions customer: {ca_city: this.customer.ca_city, tables are higher hierarchically organized parts of ca_country: this.customer.ca_country}}, other dimensions. We consider aggregations on the this.ss_wholesale_cost); } , following dimensions: date (day, month, year), function(key, values){ customer address (city, country), store address (city, return {sum: Array.sum(values), max: country) and item (class, category). Math.max.apply(Math, values), Data generation. Data is generated by the min: Math.min.apply(Math, values), DSGen generator (1.3.0). Data is generated in count: values.length};}, different CSV-like files (Coma Separated Values), {out: 'ss1_isc'} one per table (whether dimension or fact). We ); process this data to keep only the store_sales

Figure 13: The pre-computed aggregate lattice with processing time (seconds) and size (records/documents), using HBase (H) and MongoDB (M). The dimensions are abbreviated (D: Date, I: item, S: store, C: customer). In the above, data is aggregated using the item, successful in both cases. It confirms that HBase is store and customer dimensions. For HBase, we use faster when it comes to loading. However, we did Hive on top to ease the query writing for not pay enough attention to tune each system for aggregations. Queries with Hive are SQL-like. The loading performance. We should also consider that below illustrates the aggregation on item, store and the raw data (JSon files) takes more space in customer dimensions. memory in the case of MongoDB for the same INSERT OVERWRITE TABLE out number of records. Thus we can expect a higher select network transfer penalty. sum(ss_wholesale_cost), max(ss_wholesale_cost), min(ss_wholesale_cost), count(ss_wholesale_cost) , Table 1: Dataset loading times for each NosQL database i_class,i_category,s_city,s_country,ca_city,ca_country management system. from store_sales group by Dataset size 1GB 10GB 100GB i_class,i_category,s_city,s_country,ca_city,ca_country MongoDB 9.045m 109m 132m ; HBase 2.26m 2.078m 10,3m Hardware. The experiments are done on a cluster composed of 3 PCs, (4 core-i5, 8GB RAM, Lattice computation: We report here the 2TB disks, 1Gb/s network), each being a worker experimental observations on the lattice node and one node acts also as dispatcher. computation. The results are shown in the schema of Data Management Systems. We use two Figure 13. Dimensions are abbreviated (D: date, C: NoSQL data management systems: HBase (v.0.98) customer, I: item, S: store). The top level and MongoDB (v.2.6). They are both successful corresponds to IDCS (detailed data). On the second key-value database management systems level, we keep combinations of only three respectively for column-oriented and document- dimensions and so on. For every aggregate node, we oriented data storage. Hadoop (v.2.4) is used as the show the number of records/documents it contains underlying distributed storage system. and the computation time in seconds respectively for HBase (H) and MongoDB (M). 6.2 Experimental results In HBase, the total time to compute all aggregates was 1700 seconds with respectively Loading data: The data generation process 1207s, 488s, 4s and 0.004s per level (from more produced files respectively of 1GB, 10GB, and detailed to less). In MongoDB, the total time to 100GB. The equivalent files in JSon where about 3.4 compute all aggregates was 3210 seconds with times larger due to the extra format. In the table respectively 2611s, 594s, 5s and 0.002s per level below, we show loading times for each dataset and (from more detailed to less). We can easily observe for both HBase and MongoDB. Data loading was that computing the lower levels is much faster as the

amount of data to be processed is smaller. The size A major difference between the different NoSQL of the aggregates (in terms of records) decreases too systems concerns interrogation. For queries that when we move down the hierarchy: 8.7 millions demand multiple attributes of a relation, the column- (level 2), 3.4 millions (level 3), 55 thousand (level 4) oriented approaches might take longer because data and 1 record in the bottom level. will not be available in one place. For some queries, the nested fields supported by document-oriented approaches can be an advantage while for others it would be a disadvantage. Studying differences with 7 DISCUSSION respect to interrogation is listed for future work.

In this section, we provide a discussion on our results. We want to answer three questions: § Are the proposed models convincing? 8 CONCLUSION § How can we explain performance differences across MongoDB and HBase? This paper is about an investigation on the § Is it recommended to use column-oriented and instantiation of OLAP systems through NoSQL document-oriented approaches for OLAP approaches namely: column-oriented and document- systems and when? oriented approaches. We have proposed respectively two NoSQL logical models for this purpose. The The choice of our logical NoSQL models can be models are accompanied with rules that can criticized for being simple. However, we argue that transform a multi-dimensional conceptual model it is better to start from the simpler and most natural into a NoSQL logical model. models before studying more complex ones. The Experiments are carried with data from the TPC- two models we studied are simple and intuitive; DS benchmark. We generate respectively datasets of making it easy to implement them. The effort to size 1GB, 10GB and 100GB. The experimental process the TPC-DS benchmark data was not setup show how we can instantiate OLAP systems difficult. Data from the TPC-DS benchmark was with column-oriented and document-oriented successfully mapped and inserted into MongoDB databases respectively with HBase and MongoDB. and HBase proving the simplicity and effectiveness This process includes , data of the approach. loading and aggregate computation. The entire HBase outperforms MongoDB with respect to process allows us to compare the different data loading. This is not surprising. Other studies approaches with each other. highlight the good performance on loading data for We show how to compute an aggregate lattice. HBase. We should also consider that data fed to Results show that both NoSQL systems we MongoDB was larger due to additional markup as considered perform well; with HBase being more MongoDB does not support csv-like files when the efficient at some steps. Using map-reduce functions collection schema contains nested fields. Current we compute the entire lattice. This is done for benchmarks produce data in a columnar format (csv illustrative purposes and we acknowledge that it is like). This gives an advantage to relational DBMS. not always necessary to compute the entire lattice. The column-oriented model we propose is closer to This kind of further optimizations is not the main the relational model with respect to the document- goal of the paper. oriented model. This remains an advantage to HBase The experiments confirm that data loading and compared to MongoDB. We can observe that it aggregate computation is faster with HBase. becomes useful to have benchmarks that produce However, document-based approaches have other data that are adequate for the different NoSQL advantages that remain to be thouroughly explored. models. The use of NoSQL technologies for At this stage, it is difficult to draw detailed implementing OLAP systems is a promising recommendations with respect to the use of column- research direction. At this stage, we focus on the oriented or document-oriented approaches with modeling and loading stages. This research direction respect to OLAP systems. We recommend HBase if seems fertile and there remains a lot of unanswered data loading is the priority. HBase uses also less questions. memory space and it is known for effective data Future work: We will list here some of the compression (due to column redundancy). work we consider interesting for future work. A Computing aggregates takes a reasonable time for major issue concerns the study of NoSQL systems both and many aggregates take little memory space. with respect to OLAP usage, i.e. interrogation for

analysis purposes. We need to study the different 6th Int. Workshop on the Maintenance and Evolution types of queries and identify queries that benefit of Service-Oriented and Cloud-Based Systems mostly for NoSQL models. (MESOCA), IEEE, pages 47 –56. Finally, all approaches (relational models, Jacobs, A., 2009. The pathologies of big data. Communications of the ACM, 52(8), pp. 36 –44. NoSQL models) should be compared with each Kimball, R. Ross, M., 2013. The Data Warehouse Toolkit: other in the context of OLAP systems. We can also The Definitive Guide to Dimensional Modeling. John consider different NoSQL logical ilmplementations. Wiley & Sons, Inc., 3rd edition. We have proposed simple models and we want to Lee, S., Kim, J., Moon, Y.-S., Lee, W., 2012. Efficient compare them with more complex and optimized distributed parallel top-down computation of R-OLAP ones. data cube using mapreduce. Int conf. on Data In addition, we believe that it is timely to build Warehousing and Knowledge Discovery (DaWaK), benchmarks for OLAP systems that generalize to LNCS 7448, Springer, pp. 168 –179. NoSQL systems. These benchmarks should account Li, C., 2010. Transforming relational database into hbase: A case study. Int. Conf. on Software Engineering and for data loading and database usage. Most existing Service Sciences (ICSESS), IEEE, pp. 683 –687. benchmarks favor relational models. Malinowski, E., Zimányi, E., 2006. Hierarchies in a multidimensional model: From conceptual modeling to logical representation. Data and Knowledge Engineering, 59(2), Elsevier, pp. 348 –377. Morfonios, K., Konakas, S., Ioannidis, Y., Kotsis, N., ACKNOWLEDGEMENTS 2007. R-OLAP implementations of the data cube. ACM Computing Survey, 39(4), p. 12. These studies are supported by the ANRT Simitsis, A., Vassiliadis, P., Sellis, T., 2005. Optimizing funding under CIFRE-Capgemini partnership. etl processes in data warehouses. Int. Conf. on Data Engineering (ICDE), IEEE, pp. 564 –575. Ravat, F., Teste, O., Tournier, R., Zurfluh, G., 2008. Algebraic and Graphic Languages for OLAP REFERENCES Manipulations. Int. journal of Data Warehousing and Mining (ijDWM), 4(1), IGI Publishing, pp. 17-46. Chaudhuri, S., Dayal, U., 1997. An overview of data Stonebraker, M., 2012. New opportunities for new sql. warehousing and olap technology. SIGMOD Record, Communications of the ACM, 55(11), pp. 10 –11. 26, ACM, pp. 65 –74. Vajk, T., Feher, P., Fekete, K., Charaf, H., 2013. Colliat, G., 1996. Olap, relational, and multidimensional Denormalizing data into schema-free databases. 4th database systems. SIGMOD Record, 25(3), ACM, pp. Int. Conf. on Cognitive Infocommunications 64 –69. (CogInfoCom), IEEE, pp. 747 –752. Cuzzocrea, A., Bellatreche, L., Song, I.-Y., 2013. Data Vassiliadis, P., Vagena, Z., Skiadopoulos, S., warehousing and olap over big data: Current Karayannidis, N., 2000. ARKTOS: A Tool For Data challenges and future research directions. 16 th Int. Cleaning and Transformation in Data Warehouse Workshop on Data Warehousing and OLAP Environments. IEEE Data Engineering Bulletin, 23(4), (DOLAP), ACM, pp. 67 –70. pp. 42-47. Dede, E., Govindaraju, M., Gunter, D., Canon, R. S., TPC-DS, 2014. Transaction Processing Performance Ramakrishnan, L., 2013. Performance evaluation of a Council, Decision Support benchmark, version 1.3.0, mongodb and hadoop platform for scientific data http://www.tpc.org/tpcds/. analysis. 4th Workshop on Scientific Cloud Wrembel, R., 2009. A survey of managing the evolution Computing, ACM, pp. 13 –20. of data warehouses. Int. Journal of Data Warehousing Dehdouh, K., Boussaid, O., Bentayeb, F., 2014. Columnar and Mining (ijDWM), 5(2), IGI Publishing, pp. 24 –56. nosql star schema benchmark. Model and Data Zhao, H., Ye, X., 2014. A practice of tpc-ds Engineering, LNCS 8748, Springer, pp. 281 –288. multidimensional implementation on nosql database Golfarelli, M., Maio, D., and Rizzi, S., 1998. The systems. 5th TPC Tech. Conf. Performance dimensional fact model: A conceptual model for data Characterization and Benchmarking, LNCS 8391, warehouses. Int. Journal of Cooperative Information Springer, pp. 93 –108. Systems, 7, pp. 215 –247. Gray, J., Bosworth, A., Layman, A., Pirahesh, H., 1996. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total. Int. Conf. on Data Engineering (ICDE), IEEE Computer Society, pp. 152-159. Han, D., Stroulia, E., 2012. A three-dimensional data model in hbase for large time-series dataset analysis.