Multidimensional Modeling and Analysis of Large and Complex Watercourse Data: an OLAP-Based Solution

Multidimensional modeling and analysis of large and complex watercourse data: an OLAP-based solution Kamal Boulil, Florence Le Ber, Sandro Bimonte, Corinne Grac, Flavie Cernesson To cite this version: Kamal Boulil, Florence Le Ber, Sandro Bimonte, Corinne Grac, Flavie Cernesson. Multidimensional modeling and analysis of large and complex watercourse data: an OLAP-based solution. Ecological Informatics, Elsevier, 2014, 24, pp.30. 10.1016/j.ecoinf.2014.07.001. hal-01057105 HAL Id: hal-01057105 https://hal.archives-ouvertes.fr/hal-01057105 Submitted on 20 Nov 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Ecological Informatics Multidimensional modeling and analysis of large and complex watercourse data: an OLAP-based solution a a b c d Kamal Boulil , Florence Le Ber , Sandro Bimonte , Corinne Grac , Flavie Cernesson a Laboratoire ICube, Université de Strasbourg, ENGEES, CNRS, 300 bd Sébastien Brant, F67412 Illkirch, France b Equipe Copain, UR TSCF, Irstea—Centre de Clermont-Ferrand, 24 Avenue des Landais, F63170 Aubière, France c Laboratoire LIVE, Université de Strasbourg/ENGEES, CNRS, rue de l'Argonne, F67000 Strasbourg, France d AgroParisTech—TETIS, 500 rue Jean François Breton, F34090 Montpellier, France 1. Introduction (descriptions of water quality stations and their hydrographic networks), land uses (land cover, flow obstacles, etc.), and contextual fac- In the context of the European Water Framework Directive (DCE, tors (climate, etc.). 2000), the Fresqueau project funded by the French Agency for research Current solutions (text files, ad hoc programs, spreadsheet tools ANR (2011–2014) aims to develop new methods for studying, compar- such as Open Office and MS Excel, etc.) used by water quality practi- ing and exploiting all the available parameters concerning the status of tioners are not suited to manage and to analyze large volumes of infor- running waters as well as the information describing uses and under- mation. Recently, some studies have shown the ease and power of using taken measures. More precisely, the project will contribute to the an- Data Warehouse (DW) and On-Line Analytical Processing (OLAP) tech- swer to two specific issues: (1) going farther into the understanding nologies to store and to analyze environmental data (Alexandru et al., of running water functioning through the analysis of taxa that support 2010; Boulil et al., 2013b; McGuire et al., 2008; Pinet and Schneider, biological indices, and (2) connecting the sources of pressure and the 2010). DWs are databases dedicated to the integration and storage of physicochemical and biological quality of running waters. The first large volumes of data to support the decision processes of organizations step of the project was the definition of an integrated Database (DB), (Inmon, 2005). DWs store decisional data at the finest granularity level integrating data from 20 public databases related to water quality as- and organize the data in a way that facilitates the analysis/aggregation. sessment. This large DB (2.6 Go) was constructed to provide data OLAP tools allow an interactive exploration of DW data at different analysis and knowledge discovery tools and methods with the levels of detail, following a multidimensional approach. These tools necessary data at the appropriate level of detail. This large DB build multidimensional data structures having different granularities, contains data related to five major topics: water quality parameters called data cubes, by aggregating DW data and provide users with oper- (bioindices, physicochemical parameters, etc.), hydrographic networks ators for rapid exploration of these data cubes. Data cubes represent the measures/metrics (e.g., temperature) of the subjects analyzed or of the E-mail addresses: [email protected] (K. Boulil), fl[email protected] facts in a space with multiple dimensions (analysis criteria for facts such (F. Le Ber), [email protected] (S. Bimonte), [email protected] as time and geographic locations, etc.) according to the multidimen- (C. Grac), fl[email protected] (F. Cernesson). sional abstraction model. The dimensions are organized into hierarchies of aggregation levels to allow viewing analysis indicators (aggregations complex analysis indicators using complex and/or multiple aggregate of measures) at different granularities. functions. DW and OLAP data cubes are generally implemented in relational In this paper, we first show the application of the OLAP technology to platforms consisting of four tiers (Boulil et al., 2013b; Malinowski and the field of water quality assessment. The architecture of the OLAP sys- Zimányi, 2008): the ETL that integrates data from data sources into tem that we defined consists of two data cubes, a data cube storing data the DW, the DW (a relational database that stores the finest data), the about the physicochemical water quality and another data cube OLAP server that calculates the data cubes from the DW data and the concerning hydrobiological water quality data, tools that allow their pe- OLAP client that displays the data cube information using tables and dif- riodical feeding with data from operational data sources (an integrated ferent types of statistical diagrams (pie charts, histograms, etc.) and re- database and some Excel files) and tools for the OLAP analysis by water ports in different export formats. Unlike the OLTP tools, OLAP tools quality practitioners. This architecture is based only on free software provide end users with easy to use and powerful analysis methods and can easily be extended with other data cubes (e.g., a data cube for that enable dynamic changes of the analysis perspective and the granu- the analysis of morphological data on watercourses that is another im- larity of data. The OLAP client interface is used to trigger OLAP opera- portant element in the water quality assessment) and software compo- tions such as Roll-up and Drill-down which, respectively, decrease and nents (other data sources, other data analysis tools such as data mining, increase the granularity of indicator values, and Slice, which returns a etc.). Using some examples, we demonstrate how the OLAP technology, sub-cube by applying a filtering condition on one dimension, etc. unlike OLTP tools such as Excel, can help water quality practitioners to In addition to data structures (facts, measures, dimensions, etc.), the increase their productivity by offering a series of intuitive interfaces definition of analysis indicators is a fundamental part of data cubes. The that facilitate and accelerate the multidimensional analysis and under- analysis indicators are computed by aggregating measure values along di- standing of water quality data. We render this possible by defining var- mension hierarchies. A simple analysis indicator is an application of a ious analysis indicators and enabling simple (thematic, spatial, and common aggregate function to a measure along all dimensions of the temporal) and combined (spatiotemporal) analysis on multiple scales. data cube (e.g., the average temperature (Fig. 1)). Common aggregate Using our framework (Boulil et al., 2013a) that we extend here by com- functions are supported by OLAP tools and DBMSs (e.g., Sum and plex aggregate functions (e.g., a generic function to calculate all percen- Count). In contrast, a complex analysis indicator may involve different ag- tiles), we show how to define complex analysis indicators by using gregate functions on different measures and along different dimensions, these complex functions and by introducing additional analysis dimen- or simply a complex aggregate function on a measure. Complex aggregate sions to allow their calculation and also for information rendering pur- functions (e.g., percentile functions) are not supported by OLAP tools. poses. In Boulil et al. (2013a, 2013b), we have shown how to define In the literature, many multidimensional models (languages) complex indicators having multiple but only simple aggregate func- (Abelló et al., 2006; Luján-Mora et al., 2006; Malinowski and Zimányi, tions. In addition, we propose two strategies to address the heterogene- 2008; Pinet and Schneider, 2010) and development approaches ity of measurement units (one of the main summarizability semantic (Glorio and Trujillo, 2008; Hahn et al., 2000; Pardillo and Mazón, problems) by (i) transforming source data at the ETL tier, and (ii) by in- 2010; Romero and Abelló, 2009) have been proposed to model data troducing an additional analysis dimension at the OLAP server tier. Se- cubes, but none of these models and approaches has been adopted as mantic and structural summarizability conditions grant the accuracy a standard. All the existing propositions in the area of multidimensional and the correctness of indicator values if we assume a good DW data modeling ignore the definition aspect of analysis indicators; the propo- quality (please see Boulil et al. (2012) for more details about the quality sitions only allow the design of simple indicators. In Boulil et

Multidimensional Modeling and Analysis of Large and Complex Watercourse Data: an OLAP-Based Solution

Overview of Mapreduce and Spark

Mapreduce: a Major Step Backwards - the Database Column

Applying OLAP Pre-Aggregation Techniques to Speed up Aggregate Query Processing in Array Databases by Angélica Garcıa Gutiérr

A Survey on Preparing Data Sets for Data Mining Analysis Using Horizontal Aggregations in SQL Prashant B

Sharded Parallel Mapreduce in Mongodb for Online Aggregation B Rama Mohan Rao, a Govardhan, Dept

Aggregate Calculations Alculations and Subqueries

Cleanm: an Optimizable Query Language for Unified Scale-Out Data Cleaning

Pandas Dataframe Pivot Table Example

Excel? DATA 301 Spreadsheets Are the Most Common, General‐Purpose Software for Introduction to Data Analytics Data Analysis and Reporting

Excel Pivot Table Transpose Rows to Columns

Sql Server Create Aggregate Function Example Nplifytm

In-Situ Mapreduce for Log Processing