Uniform Data Access Platform for SQL and Nosql Database Systems
Total Page:16
File Type:pdf, Size:1020Kb
Information Systems 69 (2017) 93–105 Contents lists available at ScienceDirect Information Systems journal homepage: www.elsevier.com/locate/is Uniform data access platform for SQL and NoSQL database systems ∗ Ágnes Vathy-Fogarassy , Tamás Hugyák University of Pannonia, Department of Computer Science and Systems Technology, P.O.Box 158, Veszprém, H-8201 Hungary a r t i c l e i n f o a b s t r a c t Article history: Integration of data stored in heterogeneous database systems is a very challenging task and it may hide Received 8 August 2016 several difficulties. As NoSQL databases are growing in popularity, integration of different NoSQL systems Revised 1 March 2017 and interoperability of NoSQL systems with SQL databases become an increasingly important issue. In Accepted 18 April 2017 this paper, we propose a novel data integration methodology to query data individually from different Available online 4 May 2017 relational and NoSQL database systems. The suggested solution does not support joins and aggregates Keywords: across data sources; it only collects data from different separated database management systems accord- Uniform data access ing to the filtering options and migrates them. The proposed method is based on a metamodel approach Relational database management systems and it covers the structural, semantic and syntactic heterogeneities of source systems. To introduce the NoSQL database management systems applicability of the proposed methodology, we developed a web-based application, which convincingly MongoDB confirms the usefulness of the novel method. Data integration JSON ©2017 Elsevier Ltd. All rights reserved. 1. Introduction solution to retrieve data from heterogeneous source systems and to deliver them to the user. Integration of existing information systems is becoming more There exist a lot of solutions which aim to integrate data from and more inevitable in order to satisfy customer and business separated relational systems into a new one. But the problem is needs. General data retrieval from different database systems is further complicated if the source database systems are hetero- a challenging task and has become a widely researched topic geneous, namely they implement different types of data models [1–7] . In many cases, we need information simultaneously from (e.g relation data model vs. different non-relational data mod- different database systems which are separated from each other els). While all relational database management systems are based (e.g. query patient information from separated hospital information on the well-known relational data model, NoSQL systems imple- systems or manage computer-integrated manufacturing). In these ment different semistructured data models (column stores, key- cases, users need homogeneous logical views of data physically value stores, document databases, graph databases) [8,9] . As data distributed over heterogeneous data sources. These views have to models define how the logical structure of a database is mod- represent the collected information as if data were stored in a uni- eled, they play a significant role in data access. The aim of this form way. The basic problem originates from the fact, that source paper is to suggest a generalized solution to query relational and systems were built separately and the need of information linking non-relational database systems simultaneously. The contribution arose later. If the source systems were designed and built irrespec- of this work is to present a novel data integration method and tively of each other, then the implemented database schemas and workflow, which implement data integration of different source semantics of data may be significantly different. The main tasks systems in the way that users do not need any programming skills. of general data access are (i) to identify those data elements in The central element of the proposed method is an intermediate source systems that match each other and (ii) to formulate such data model presented in JSON. global database access methods that can handle the differences of The organization of this paper is as follows. Section 2 gives the source systems. To solve the problem we need an intermediate a brief overview of data integration methods. In Section 3 , the main problems of data integration processes are introduced. Section 4 presents the proposed solution from theoretical aspect and in Section 5 we introduce a data integration application which ∗ Corresponding author. E-mail addresses: [email protected] , [email protected] (Á. implements the proposed method. Finally, Section 6 concludes the Vathy-Fogarassy), [email protected] (T. Hugyák). paper. http://dx.doi.org/10.1016/j.is.2017.04.002 0306-4379/© 2017 Elsevier Ltd. All rights reserved. 94 Á. Vathy-Fogarassy, T. Hugyák / Information Systems 69 (2017) 93–105 2. Background manner [16] . Such ontology-based data integration methods are widely used in commerce and e-commerce [17–19] , furthermore 2.1. Data integration methods in geospatial information systems [20,21] , web search [22,23] and in life sciences [3,24,25] as well. Data integration can be performed on different abstraction lev- els [10] and according to these levels, the applied techniques can be grouped as (i) manual integration, (ii) supplying the user with a 2.2. Data integration solutions common user interface, (iii) developing an integration application, (iv) using a middleware interface, (v) working out a uniform data Generally, data integration methods principally aim to integrate access, (vi) and building a common data storage [11] . The first four data arising from different SQL systems. However, the problem of approaches offer simpler solutions, but each has its own signifi- data integration is further complicated by the fact that data can be cant drawbacks. Manual data integration is a very time-consuming stored in different database management systems that implement and costly process. An integration application can access various different data models (e.g relational database management systems databases and return the merged result to the user. The main vs. different NoSQL systems). drawback of this solution is that it can handle only those databases As the role of NoSQL systems in recent years has increased sig- which are integrated into them and so if the number and type nificantly, demands on data interoperability of NoSQL systems is of source systems increase, the system becomes more and more also increased. In the last few years, only a few case studies and complex. In the case of developing common user interfaces, data methodological recommendations were published which aim to is still separately presented and homogenization and integration migrate data stored in NoSQL systems (e.g. [26–28] ). In [29] Atzeni of data have to be done by the users [11] . Middlewares - such as and his co-authors present a very impressive work, which aims to OLE DB or .NET providers, or ODBC and Java Database Connectivity offer a uniform interface called Save Our Systems (SOS), that al- (JDBC) database access drivers - provide generalized functionalities lows access to data stored in different NoSQL systems (HBase, Re- to query and manipulate data in different database management dis, and MongoDB). The architecture of the proposed system con- systems, but these systems can not resolve the structural and se- tains a Common Interface, which implements uniform data re- mantical differences of source databases. trieval and manipulation methods, and a Common Data Model, Building a common data storage and migrating all required in- which is responsible for managing the translations from the uni- formation into a new database provides a more complex solution. form data model to the specific structures of the NoSQL systems. This process is usually performed programmatically and includes In this solution, simultaneous data access is implemented instead data extraction, transformation, and loading steps (ETL process). of data migration. The main drawback of the proposed method is The most important step of the ETL process is the data transfor- that the expressive power of the implemented methods is limited mation, which maps data types and structures from the source sys- exclusively to single objects of NoSQL systems. Furthermore, this tems into data types and structure of the target system. Currently, solution does not provide users with data migration (e.g, schema due to the problems of the whole migration process (e.g. diversity matching) facilities. of source storage structures and data types, the problem of data To our best knowledge, only a few works currently exist to inconsistency, lack of management strategies), the migration pro- query data from NoSQL and relational database systems simulta- cess is typically done manually or semi-automatic way [12] . But neously. In [30] the Unified Modelset Definition is given, which above all, the most important drawback of the data migration pro- provides a basis to query data from relational database manage- cess is that it requires storage capacity and the migrated data set ment systems (RDBMS) and NoSQL systems. OPEN-PaaS-DataBase does not follow the change of the original data in the source sys- API (ODBAPI) [31] is a unified REST-based API, which enables the tems automatically. If we want to follow changes of data stored in execution of CRUD operations on relational and NoSQL data stores. the source systems, migration process must be repeated frequently, It provides users with a set of operators to manage different type and so data migration process becomes very time-consuming and of databases, but data migration and JOIN functions are not imple- costly. mented in this solution. Roijackers and his colleagues [32] present Uniform data access provides a logical integration of data and a framework to seamlessly bridge the gap between SQL and doc- eliminates many of the before mentioned disadvantages. In this ument stores. In the proposed method documents are logically in- case, the required information will not be stored in a target corporated into the relational database and querying is performed database, so users and developers do not have to deal with the via an extended SQL language.