Noxperanto: Crowdsourced Polyglot Persistence Antonio Maccioni, Orlando Cassano, Yongming Luo, Juan Castrejón, Genoveva Vargas-Solar
Total Page:16
File Type:pdf, Size:1020Kb
NoXperanto: Crowdsourced Polyglot Persistence Antonio Maccioni, Orlando Cassano, Yongming Luo, Juan Castrejón, Genoveva Vargas-Solar To cite this version: Antonio Maccioni, Orlando Cassano, Yongming Luo, Juan Castrejón, Genoveva Vargas-Solar. NoX- peranto: Crowdsourced Polyglot Persistence. Polibits, 2014, 50, pp.43 - 48. 10.17562/PB-50-6. hal-01584798 HAL Id: hal-01584798 https://hal.archives-ouvertes.fr/hal-01584798 Submitted on 11 Sep 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. NoXperanto: Crowdsourced Polyglot Persistence Antonio Maccioni, Orlando Cassano, Yongming Luo, Juan Castrejon,´ and Genoveva Vargas-Solar Abstract—This paper proposes NOXPERANTO, a novel modern application development [2], [3]. They enable the crowdsourcing approach to address querying over data integration of these data stores for managing big datasets in collections managed by polyglot persistence settings. The main a scalable and loosely coupled way. This approach seems to contribution of NOXPERANTO is the ability to solve complex queries involving different data stores by exploiting queries be a pertinent strategy to deal with big datasets integration. from expert users (i.e. a crowd of database administrators, data Nonetheless, the combination of these heterogeneous data engineers, domain experts, etc.), assuming that these users can stores, flexible schemas and non-standard APIs, represent an submit meaningful queries. NOXPERANTO exploits the results added complexity for application developers. For instance, of “meaningful queries” in order to facilitate the forthcoming considering that data are spread across multiple data stores, query answering processes. In particular, queries results are used to: (i) help non-expert users in using the multi-database each of which possibly relies on distinct data models (graph, environment and (ii) improve performances of the multi-database document, etc.), developers must be familiar with a high environment, which not only uses disk and memory resources, number of implementation details, in order to effectively work but heavily rely on network bandwidth. NOXPERANTO employs with and maintain the overall data entities. a layer to keep track of the information produced by the crowd Providing an integrated view of the underlying data modeled as a Property Graph and managed in a Graph Database Management System (GDBMS). collections for enabling querying, is still an open issue, often solved at the application code level. Given that there Index Terms—Polyglot persistence, crowdsourcing, multi- databases, big data, property graph, graph databases. is no uniformization of schemas and there is a lack of meta-information available about the data (i.e. mappings, meta-data, semantic equivalences, data similarities), query I. INTRODUCTION results are not integrated collections of data, often, they are IG datasets are not uniform collections of homogeneous bags of non related data collections. It follows that, the quality B data and, for this reason, they are not logically stored of querying in such polyglot systems is poor. in one single database. In addition, the size of big datasets Contribution. This paper proposes NOXPERANTO, a novel is such that it is not possible to adopt data cleaning, dataset crowdsourcing1 approach to address querying over data preparation and integration using classic methods. Otherwise, collections managed by polyglot persistence settings. data would never be ready for being analyzed and exploited in reasonable time. We avoid the pursue of uniformity, but rather, we preserve Often, these datasets are sharded across distributed persis- the variety of original systems and languages. The main tence supports that adopt different access and management contribution of NOXPERANTO is the ability to solve complex policies, different degrees of availability, fault tolerance and queries involving different data stores by exploiting queries consistency. from expert users (i.e. a crowd of database administrators, data Polyglot Persistence [1] is a brand new term to indicate engineers, domain experts, etc.), assuming that these users can the combination of approaches that cope with a variety submit meaningful queries. of persistence systems and that are used for “integrating” Crowds have been revealed to be effective to solve queries heterogeneous datasets. over a single database [4], but they have never been used to As part of the emerging polyglot persistence movement [1], query a multi-database system, where the required expertise the simultaneous use of multiple SQL, NoSQL and NewSQL (about schemas, models, data formats and instances of the data stores is gradually becoming a common practice in database) cannot be superficial. In fact, from the structure of the queries, we can infer who is expert. Manuscript received on May 3, 2014; accepted for publication on June 10, NOXPERANTO exploits the results of such “meaningful 2014, published on November 15, 2014. queries” in order to facilitate the forthcoming query Antonio Maccioni is with the Universita` Roma Tre, Rome, Italy (e-mail: [email protected]). answering processes. In particular, the results of queries Orlando Cassano is with the Universite´ Libre de Bruxelles, Brussels, are used to: (i) help non-expert users in using the Belgium (e-mail: [email protected]). multi-database environment and (ii) improve performances of Yongming Luo is with the Eindhoven University of Technology, the Netherlands (e-mail: [email protected]). the multi-database environment, which not only uses disk and Juan Castrejon´ is with the Universite´ de Grenoble, Grenoble, France (e- memory resources, but heavily rely on network bandwidth. mail: [email protected]). Genoveva Vargas-Solar (corresponding author) is with the Centre national de la recherche scientifique (CNRS), LIG-LAFMIA labs, Saint Martin 1The modality to fulfil a target by relying on the contributions coming from d’Heres,` France (e-mail: [email protected]). a large group of people. ISSN 1870-9044; pp. 43–48 43 Polibits (50) 2014 Antonio Maccioni, Orlando Cassano, Yongming Luo, Juan Castrejón, Genoveva Vargas-Solar NOXPERANTO employs a layer to keep track of the KNOWS name : “Yongming” name : “Antonio” information produced by the crowd. This is modeled as a age : “3 months” surname : “Luo” surname : “Maccioni” Property Graph and thus, it is persisted by a Graph Database Country : “China” Country : “Italy” Management System (GDBMS). KNOWS Outline. The remainder of the paper is organized as follows. age : “3 months” MET date : “2013-09-01” While Section III introduces the concepts that we need KNOWS MET age : “7 months” KNOWS date : “2013-10-10” throughout the paper, Section II discusses works that are age : “7 months” name : “Juan” related to our. Section IV explains our approach in detail. surname : “Castrejòn” Country : “Mexico” Section V describes implementation issues. Section VI SUPERVISES age : “2 year” concludes the paper with future perspectives of research. MET date : “2013-09-03” name : “Orlando” name : “Genoveva” II. RELATED WORKS surname : “Cassano” surname : “Vargas-Solar” Country : “Belgium” Country : “Mexico” Most information systems are supported by distributed and heterogeneous data sources: multi-databases, data warehouses and Web portals. Such systems are, in fact, mediation infrastructures that emerged more than 20 years ago to enable Fig. 1. An example of property graph. transparent access to multiple data sources through querying, analysis, navigation and management facilities. With the advent of NoSQL stores and polyglot persistence [11], [9], [4], [10], [12]. In particular, CrowdDB [4] integrates approaches, problems of data integration emerge with some human inputs within a query plan that contains operations specifities: the high heterogeneity of data models (e.g., NoSQL that usual database systems cannot always answer correctly underlying data models are not standardized), the absence (i.e. matching, ranking, etc.). RandomDB [12] answers non of semantic descriptions, the volume of data to integrate, deterministic queries using social network crowds with the just to name a few. Transformation and other well-known aim to replace existing random number generators. techniques to bridge different databases seem hard to employ at large scale [5]; they would bring data duplication that is III. QUERYING A POLYGLOT MULTI-STORE unfeasible when the data size is huge. In order to deal with heterogeneity, the majority of information systems define an This section introduces basic concepts that are needed to integrated schema. However, the integration process, generally introduce our approach in the next section. controlled by the administrator of the system, is hardly Property Graph. Graph data models are able to generalize suited for managing data in highly distributed and evolving other data models: environments such as the Web. Hardly coupled multi-sources – relational data: both schema and instance of relational storing huge datasets with semantic information (e.g. global