Accommodating Data Variety by a Multimodel Star Schema Sandro Bimonte, Yassine Hifdi, Mohammed Maliari, Patrick Marcel, Stefano Rizzi
Total Page:16
File Type:pdf, Size:1020Kb
To Each His Own: Accommodating Data Variety by a Multimodel Star Schema Sandro Bimonte, Yassine Hifdi, Mohammed Maliari, Patrick Marcel, Stefano Rizzi To cite this version: Sandro Bimonte, Yassine Hifdi, Mohammed Maliari, Patrick Marcel, Stefano Rizzi. To Each His Own: Accommodating Data Variety by a Multimodel Star Schema. Proceedings of the 22nd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data co-located with EDBT/ICDT 2020 Joint Conference (EDBT/ICDT 2020), Mar 2020, Copenhagen, Denmark. hal-03009808 HAL Id: hal-03009808 https://hal.archives-ouvertes.fr/hal-03009808 Submitted on 17 Nov 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. To Each His Own: Accommodating Data Variety by a Multimodel Star Schema Sandro Bimonte Yassine Hifdi Mohammed Maliari University Clermont, TSCF, INRAE LIFAT Laboratory, University Tours ENSA Aubiere, France Blois, France Tangier, Maroc [email protected] [email protected] [email protected] Patrick Marcel Stefano Rizzi LIFAT Laboratory, University Tours DISI, University of Bologna Blois, France Bologna, Italy [email protected] [email protected] ABSTRACT for the right data type is essential to grant good storage and anal- Recent approaches adopt multimodel databases (MMDBs) to ysis performance. Traditionally, each DBMS has been conceived natively handle the variety issues arising from the increasing for handling a specific data type; for example, relational DBMSs amounts of heterogeneous data (structured, semi-structured, graph- for structured data, document-based DBMSs for semi-structured based, etc.) made available. However, when it comes to analyzing data, etc. Therefore, when an application requires different data these data, traditional data warehouses (DWs) and OLAP systems types, two solutions are actually possible: (i) integrating all data fall short because they rely on relational Database Management into a single DBMS, or (ii) using two or more DBMSs together. Systems (DBMSs) for storage and querying, thus constraining The former solution presents serious drawbacks: first of all, some data variety into the rigidity of a structured schema. This pa- types of data cannot be stored and analyzed (e.g., the pure re- per provides a preliminary investigation of the performance of lational model does not support the storage of images, XML, an MMDB when used to store multidimensional data for OLAP arrays, etc. [29]); besides, even when data can be converted and analysis. A multimodel DW would store each of its elements stored in the target DBMS, querying performances could be un- according to its native model; among the benefits we envision for satisfactory. The latter approach (known as polyglot persistence this solution, that of bridging the architectural gap between data [16]) presents important challenges as well, namely, technically lakes and DWs, that of reducing the cost for ETL data transfor- managing more DBMSs, complex query languages, inadequate mations, and that of ensuring better flexibility, extensibility, and performance optimization, etc. Therefore, Multimodel databases evolvability thanks to the use of schemaless models. To support (MMDBs) have recently been proposed to overcome these issues. our investigation we present an implementation, based on the A MMDB is a DBMS that natively supports different data types UniBench benchmark dataset, that extends a star schema with under a single query language to grant performance, scalability, JSON, XML, spatial, and key-value data; we also define a sample and fault tolerance [21]. Remarkably, using a single platform for OLAP workload and use it to test the performance of our solution multimodel data promises to deliver several benefits to users and compare it with that of a classical star schema. As expected, besides that of providing a unified query interface; namely, it will the full-relational implementation performs better, but we believe simplify query operations, reduce development and maintenance that this gap could be balanced by the benefits of multimodel in issues, speed up development, and eliminate migration problems dealing with variety. Finally, we give our perspective view of the [21]. Examples of MMDBs are PostgreSQL and ArangoDB. Post- research on this topic. greSQL supports the row-oriented, column-oriented, key-value, and document-oriented data models, offering XML, HSTORE, JSON/JSONB data types for storage. ArangoDB supports the 1 INTRODUCTION graph-based, key-value, and document-oriented data models. Handling variety while granting at the same time volume and Big Data is notoriously characterized by (at least) the 3 V’s: vol- velocity is even more complex in Data Warehouses (DWs) and ume, velocity, and variety. To handle velocity and volume, some OLAP systems. Indeed, warehoused data result from the integra- distributed file system-based storage (such as Hadoop) andnew tion of huge volumes of heterogeneous data, and OLAP requires Database Management Systems (DBMSs) have been proposed. very good performances for data-intensive analytical queries [20]. In particular, four main categories of NoSQL databases have Traditional DW architectures rely on a single, relational DBMS been proposed [2]: key-value, extensible record, graph-based, for storage and querying1. To offer better support to volume and document-based. while maintaining velocity, some recent works propose the usage Although NoSQL DBMSs have successfully proved to support of NoSQL DBMSs; for example, [8] relies on a document-based the volume and velocity features, variety is still a challenge [21]. DBMS, and [5] on a column-based DBMS. NoSQL proposals for Indeed, several practical applications (e.g. retail, agriculture, etc.) DWs are based on a single data model, and all data are trans- ask for collecting and analyzing data of different types: structured formed to fit with that model (document, graph, etc.). Overall, (e.g., relational tables), semi-structured (e.g., XML and JSON), and although these approaches offer interesting results in terms of unstructured (such as text, images, etc.). Using the right DBMS volume and velocity, they have been mainly conceived and tested for structured data, without taking into account variety. © Copyright 2020 for this paper held by its author(s). Published in the proceedings of DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT 1More precisely, this is true for so-called ROLAP architectures. In MOLAP architec- 2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution tures, data are stored in multidimensional arrays. Finally, in HOLAP architectures, 4.0 International (CC BY 4.0). a MOLAP and a ROLAP systems are coupled. Furthermore, to facilitate OLAP querying, DWs are normally Key-Value Relational XML based on the multidimensional model, which introduces the con- Ranking and Customers Invoices cepts of facts, dimensions, and measures to analyze data, so source feedback data must be forcibly transformed to fit a multidimensional logi- Vendors Graph cal schema following a so-called schema-on-write approach. Since Social networks this is not always painless because of the schemaless nature of Orders some source data, some recent work (such as [12]) propose to Products directly rewrite OLAP queries over document stores that are not RegUsers JSON organized according to the multidimensional model, following a schema-on-read approach (i.e., the multidimensional schema is Figure 1: Overview of the UniBench data not decided at design time and forced in a DW, but decided by each single user at querying time). However, even this approach relies on a single DBMS. logical model for column-based DWs has been proposed by [5] An interesting direction towards a solution for effectively and [7] to address volume scalability. In [28], transformation handling the 3 V’s in DW and OLAP systems is represented by rules for DW implementation in graph-based DBMSs have been MMDBs. A multimodel data warehouse (MMDW) can store data proposed for better handling social network data. To the best of according to the multidimensional model and, at the same time, our knowledge, only [22] presents a benchmark for comparing let each of its elements be natively represented through the most NoSQL DW proposals; specifically, this benchmark is applied to appropriate model. Among the benefits we envision for MMDWs, MongoDB and Hbase. Some works also study the usage of XML that of bridging the architectural gap between data lakes and DBMSs for warehousing XML data [24]. Although XML DWs DWs, that of reducing the cost for ETL data transformations, and represent a first effort towards native storage of semi-structured that of ensuring better flexibility, extensibility, and evolvability data, their querying performances do not scale well with size, thanks to the use of schemaless models. and compression techniques must be adopted [4]. In this paper we conduct a preliminary investigation of the Among all these proposals, it is hard to champion