F O U N D A T I O N S O F O M P U T I N G A N D D E C I S I O N S C I E N C E S Vol. 44 (2019) No. 4 ISSN 0867-6356 DOI: 10.2478/fcds-2019-0021 e-ISSN 2300-3405

Integration of Relational and Graph Functionally

Jaroslav POKORNÝ*

Abstract. In today’s multi-model world there is an effort to integrate databases expressed in different data models. The aim of the article is to show possibilities of integration of relational and graph databases with the help of a functional data model and its formal language – a typed lambda calculus. We suppose the existence of a data schema both for the relational and graph database. In this approach, relations are considered as characteristic functions and property graphs as sets of single-valued and multivalued functions. Then it is possible to express a query over such integrated heterogeneous database by one query expression expressed in a version of the typed lambda calculus. A more user- friendly version of such language could serve as a powerful query tool in practice. We discuss also queries sent to the integrated system and translated into queries in SQL and - the graph for .

Keywords: graph database, , databases integration, functional

1. Introduction

The ubiquitous NoSQL databases (e.g., [25]) represent a completely new way of thinking about databases as such. In practice, it means one of the most fundamental choices to make when developing an application is whether to use a SQL or NoSQL database to store the data. One observation says that neither everything is good in NoSQL nor everything is so bad in relational databases (RDB). Some analyses show, that relational and NoSQL databases are even in some sense complementary [13]. The choice of technology is critical for applications that can be both transactional and analytical. They typically require different software and hardware architectures. Now, in current applications, especially where extensive analyses are needed, it turns out that it is non-trivial to design an infrastructure involving data and software of both types. A particular case of integration of relational and NoSQL databases concerns graph databases. One tendency is to use multi-level modelling approaches involving both relational and NoSQL architectures including their integration [8], [19]. Today, NoSQL databases are considered in contrast to traditional RDBMSs products. The NoSQL movement has brought many new database models in database industry where each

* MFF UK, Malostranské nám. 25, 118 00 Praha 1, Czech Republic, [email protected] 428 J. POKORNÝ model had some important features that RDBMSs do not have or enable to do only with a high effort. A well-known example shows directed graphs implemented using relations. Some typical graph tasks, such as finding a path from node A to node B, require many operations or complex recursive queries in SQL. On the other hand, in native graph query tools, these tasks are specified in a more simple way and with more effective implementation. The interest in graph databases is continuously growing because of its ability to analyze the data in non-relational format (e.g., social networks). Obviously, transactional data processing requires rather a relational database and classical querying guaranteed by SQL. A graph database (GDB) is based on . It uses nodes, properties, and (directed) edges [22]. A node represents an entity, such as a User, Movie, Object, and an edge represents the relationship between two nodes, e.g., is_friend_of, who_buys_what, etc. Labelled nodes and edges may have various properties (attributes) given by key-value pairs attached to them. There may be more edges between two nodes. The property (attribute) graph is then a multigraph. A GDB can be either one graph or a collection of property graphs. In practice, we can find also more semantically enriched data structures, so-called triplestores1. Here we suppose GDB represented by one property graph. Similarly to traditional databases, we use the notion of graph database management system (GDBMS). GDBMSs proved to be very effective and suitable for many data handling use cases. Graph querying is a key issue in any graph-based applications. Today, NoSQL databases are considered in contrast to traditional RDBMSs products. On the other hand, yet other approaches are possible, e.g., a functional approach. In the late 80s, there was the functional language DAPLEX [24]. In the current era of GDBMSs, we can mention Gremlin2 - a functional developed by Apache TinkerPop which allows to express complex graph traversals and mutation operations over property graphs. Traversal operators/functions are chained together to form path-like expressions. is supported by many GDBMSs (e.g., Titan3 ). A number of significant works using functional approach to data management are contained in [9]. We will use a functional approach in which a property graph is represented by typed partial functions. We are inspired by the HIT Database Model, see, e.g., [15], as a functional alternative variant of E- model. The functions considered will be of two kinds: single- valued and multivalued. Then, a typed lambda calculus, i.e., the language of lambda terms, can be used as a data manipulation language. Regardless of the fact that the graphs are considered schema-less in NoSQL databases, we will use graph database schemas. The main aim of the paper is to introduce a functional approach to modelling both relations and property graphs. The relations will be also considered as typed functions. Then some possibilities to integration on the level of a common query language are presented and discussed. For some real references to graph querying in today’s practice, we will consider GDBMS Neo4j [22] and its popular query language Cypher4. For SQL relational databases, of course, we will assume SQL. The text follows the work of Pokorny [20]. The rest of the paper is organized as follows. In Section 2, we introduce basic notions of GDBs and summarize some works related to integration of relational and NoSQL databases.

1 https://www.ontotext.com/knowledgehub/fundamentals/what-is-rdf-/(retrieved on 10.8.2019) 2 https://tinkerpop.apache.org/gremlin.html (retrieved on 10.8.2019) 3 http://titan.thinkaurelius.com/ (retrieved on 10.8.2019) 4 http://neo4j.com/developer/cypher-query-language/ (retrieved on 10. 1. 2019) Integration of Relational and Graph Databases Functionally 429

In Section 3, we deal with a functional approach to data modelling with application to graphs and relations. Graph and relational database schemas are defined. Section 4 is devoted to querying. We will present a variant of typed lambda calculus as a query language over functional data objects. Its usability to the integration of RDBs and GDBs is discussed in Section 5. An integrated database schema describing both database types will contain a set of functional specifications describing all database objects. So called hybrid approach is used for this purpose. Section 5 gives the conclusion and some challenges for future work.

2. Background and related works

There are many different types of graphs [6]. Here, we focus on the property graphs. Formally, they can be described as follows. Supposing a set of nodes V and a multiset of edges E, where E Í V ´ V, a multigraph G is an ordered pair (V, E). In the world of GDBs we will use a (labelled) property multigraph model whose basic constructs include: entities (V), relationships (E) having a direction, start node, and end node, and identifiers, properties (attributes), and labels (types), Entities and relationships can have any number of properties, nodes and edges can be tagged with labels. Both nodes and edges are defined by a unique identifier (Id). Properties are of key-value pairs, i.e. only single-valued attributes are considered. In graph-theoretic notions, we also talk about labelled and directed attributed multigraphs in this case. In the world of NoSQL databases, we do not need to build a fixed database schema for a GDB. New connections between entities in a database can be added through direct interactions with the graph to specify the desired relationships. Here, for practical reasons, we are considering the graph database schema. The book [8] offers three ways of integration of the two different worlds of relational and NoSQL databases: native, hybrid, and reducing to one option, either relational or NoSQL. For example, Virtuoso5 is a hybrid RDB server that supports relational tables and entity- relationship property graphs (RDF graphs). A native solution is based on the standard database drivers and ways how the application layer communicates with a specific database. A hybrid solution is based on development of additional layer that enables SQL communication between application and data layer. Generally, several approaches are under development (see, e.g., [19]): • - provides support for multiple database models. Each database in a data architecture is used for a different purpose. Polyglot persistence is a process for storing data in the best database available, no matter the data model and data storage technology. “Polyglot” means “able to speak many languages”, not integration. • multi-model approach - means heterogeneous database integration, i.e. multiple models in one product. For example, OrientDB6 is a multi-model DBMS including geospatial, graph, fulltext, and key-valued data models. Couchbase7 supports document and relational models in its Couchbase Server database. • multi-level modelling - is used mostly in hybrid solutions (see, Section 4).

5 https://virtuoso.openlinksw.com/ (retrieved on 10. 8. 2019) 6 http://orientdb.com/orientdb/ (retrieved on 10. 8. 2019) 7 https://www.couchbase.com/ (retrieved on 10. 8. 2019) 430 J. POKORNÝ

• NoSQL relationally - has features both multi-model and multi-level ones. The approach includes, e.g., a multi-model solution considering document and - oriented DB integrated through a middleware into a virtual SQL database [4]. • schema and data conversion - includes, e.g., a schema conversion model, in which the SQL database schema is converted to the NoSQL database schema [27]. We can observe that the approaches are mutually related and rather describe particular cases with given NoSQL/relational databases than more exact specified categories.

3. Functional approach to data modelling

The functional approach used here is based on a typing system. Different types can be used. For our purposes, we use elementary types and two structured types (Section 3.1). Typed functions appropriate to modelling real data objects are attributes (Section 3.2) viewed as empirical typed functions that are described by an expression of a natural language [15]. A lot of papers are devoted to this approach studied mainly in 90ties in context of conceptual modelling of databases (see, e.g., [16]). Section 3.3 discusses the notion of a conceptual/database schema for GDB and relational databases.

3.1 Types

A hierarchy of types is constructed as follows. We assume the existence of some (elementary) types S1,...,Sk (k≥1) constituting a base B. More complex types are constructed in the following way. If S, R1,...,Rn (n≥1) are types, then

(i) (S:R1,...,Rn) is a (functional) type, (ii) (R1,...,Rn) is a (tuple) type. The set of types T over B is the least set containing all types from B and those given by (i)-(ii). When Si in B are interpreted as non-empty sets, then (S:R1,...,Rn) denotes the set of all (total or partial) functions from R1×...×Rn into S, (R1,...,Rn) denotes the Cartesian product R1×...×Rn. The elementary type Bool = {TRUE, FALSE} is also in B. It allows to model sets (resp. relationships) as unary (resp. n-ary) characteristic functions. The notions of a set and a are then redundant here. Thus, relations in an RDB can be simply modelled as their characteristic functions, i.e. functionally typed objects. The fact that X is an object of type R Î T can be written as X/R, or "X is the R-object". For each typed object o the function type returns type(o) Î T of o. Logical connectives, quantifiers and predicates are typed functions, e.g., and/(Bool:Bool,Bool), implies/(Bool:Bool,Bool), R-identity =R is (Bool:R,R)-object, universal R-quantifier ΠR, and existential R-quantifiers ΣR are (Bool:(Bool:R))-objects. R-singularizer I/(R:(Bool:R)) denotes the function whose value is the only member of an R-singleton and in all other cases the application of IR is undefined. Arithmetic operations are (Number: Number, Number)- Integration of Relational and Graph Databases Functionally 431 objects, COUNTS/((Number:(Bool:S)) is an aggregation function. The approach also enables to type functions of functions, etc.

3.2 Atributes

Object structures usable in building a database can be described by some expressions of a natural language. Each base B will consist of descriptive and entity types. Consider B = {User, Movie, U_ID, Name, Birth_year,…, String, Number}. For GDBs, we can conceive entity types as sets of node IDs. Types String, Number, etc., serve for domains of properties, i.e., of descriptive types. Then, e.g., the expression "the movies rated by a user" denotes a ((Bool:Movie):User)-object, i.e. a (partial) function f:User→(Bool:Movie). Such (named) functions represent attributes. More formally, attributes are functions of type ((S:T):W), where W is the logical space (possible worlds), T contains time moments, and S ϵ T. For simplicity, we will not explicitly consider types W and T for attributes in this paper. A lot of functions need even no possible world. For example, aggregation functions like COUNT, AVARAGE, and arithmetic functions provide such functions. They have the same behaviour in all possible worlds and time moments. We call such functions analytical. Empirical functions (e.g. attributes) are conceived as partial functions from the logical space and time moments. Range of these functions are again functions.

3.3 Graph and relational data schemas

As usually, both conceptual and database schemas can be developed for GDBs (see, e.g., [21], [23]). A difference between conceptual and database view is rather negligible here. The notion of attribute applied in GDBs can be restricted to attributes of types (R1:R2) and ((Bool:R1):R2), respectively, where R1 and R2 are entity types. That is, single-valued and multivalued attributes are considered. Now we add properties. A set of properties describing an entity type is described as ((S1,...,Sm):R) (m≥1), where Si are descriptive types and R is an entity type. Thus, a set of properties is expressed by a tuple type here. Similarly, we can express properties of edges. They are of types ((S1,...,Sm,R1):R2) or ((Bool:S1,...,Sm,R1):R2). That is, each edge of a single- valued and multivalued functional type, respectively, has m properties (m≥0). As we mentioned in Section 2, despite of the fact that NoSQL databases have not to use a database schema, our approach requires to have these structures. For GDBs we use a modification of the database model described in [17]. In this model, a GDB schema is expressed again by a property graph. Figure 1 describes movies having a Title, Director, and release date (Released). Users given by U_ID, Name and Birth_year rate the movies by a number of Stars. We need express integrity constraints (ICs) expressing facts like “A user can rate more films”, “Users have more friends“ or “A user submits a review for at most one journal”. The arrows of edges serve for this purpose. Then, they represent inherit ICs. Other ICs, e.g., “A user does not rate movies directed a “Spielberg” would have to be expressed explicitly, which is not common in the NoSQL world. It is simply not possible, because NoSQL systems do not provide such functionalities for the sake of performance gain. 432 J. POKORNÝ

Movie

Title Director Released Submittes_to Rates

Date

Stars Journal User

Publisher U_ID Address U_Name Birth_year Friend-of

Figure 1. GDB schema Movies We describe multivalued functions by double arrows in a database schema graph. Single arrows denote directional functional relationships, e.g., Submittes_to. Tuple types with S1,...,Sm used for properties of edges contain as a last component the entity type of the end node of the given edge, e.g., (Date, Journal) for Submittes_to. Then a functional database schema corresponding to the GDB schema in Figure 1 can look as Movie/((Title, Director, Released):Movie) User/((U_ID, U_Name, Birth_year):User) Journal/((Address, Publisher):Journal) Friends_of_friend/((Bool:User):User) Rates/((Bool:Stars, Movie):User) Submittes_to/((Date, Journal):User) We remark, however, that our functional GDBs with such schemas can contain isolated nodes with at least one property. IDs of edges are not necessary, because edges are not explicitly considered without nodes. Considering RDBs, (relational) attributes Ai:Dj will be used as Si. Si are non-empty sets of values, Si ¹ Sj for i ¹ j. Then, relations are simply (Bool:S1,...,Sm)-objects, where Si are descriptive types. To simplify the relational modelling, we will not use domains explicitly. We also do not consider relational ICs like primary keys. They require explicit ICs. Obviously, they are used in the relational source, e.g., in CREATE statements. We will consider relations Actor(Name, Title, Role) Movie(Title, Released, Director, Genre), as attributes Actors/(Bool:Name, Title, Role), Movies/(Bool:Title, Released, Director, Genre). expressing associated RDB schema. Thus, there is a difference between Movie and Movies. Integration of Relational and Graph Databases Functionally 433

The Actor relation describes film actors by their names, movie titles, and a role performed by the actor in the movie. The Movie relation contains information about movies, i.e., their title, director, genre, and release date. Considering RDB in this formalism, no entity types are necessary. A convention Actor → Actors, Movie → Movies is used here. It may be unusual for someone to assume an existence of a GDB schema despite of the fact that GDBs in practice mostly do not require a schema. NoSQL GDB does not need its schema re-defined before adding new data. Here, the tuple type used for property modelling maybe restricts possibilities for inserting a new property. To model properties as sets of properties and not tuples would be more feasible. Then, NoSQL GDB could automatically re-define its schema before adding new data. A would have to be changed. On the other hand, the schema existence supports without doubts easier formulate a query.

4. Querying with typed lambda calculus

A manipulation language for functions is traditionally a typed lambda calculus. Our version of the typed lambda calculus directly supports manipulating objects typed by T introduced in Section 3.1. First we introduce a typed lambda calculus and use it as a tool for querying over a database described as a set of attributes in Section 4.1. A syntactic sugar for such query language and examples of queries are presented in Section 4.2. We will suppose a collection Func of constants, each having a fixed type, and denumerable many variables of each type. Then the language of (lambda) terms LT is defined as follows: Let types R, S, R1,…,Rn (n ≥ 1) are elements of T. (1) Every variable of type R is a term of type R. (2) Every constant of type R is a term of type R. (3) If M is a term of type (S:R1,…, Rn), and N1,…, Nn are terms of types R1,…, Rn, respectively, then M(N1,…, Nn) is a term of type S. /application/ (4) If x1,…, xn are different variables of the respective types R1,…, Rn and M is a term of type S, then λx1,…, xn(M) is a term of type (S:R1,… , Rn) /lambda abstraction/ (5) If N1,…,Nn are terms of types R1, . . . , Rn, respectively, then (N1,…, Nn) is a term of type (R1,…, Rn). /tuple/ (6) If M is a term of type (R1,…,Rn), then M[1],…,M[n] are terms of respective types R1,…, Rn. /components/

4.1 From LT to a functional query language

The language LT with applications of functions and lambda abstractions provides a powerful tool for querying graph data conceived as functions. A query is expressed by a term of LT. Application of this approach in graph environment is described in details in [18]. In accordance to the semantics of the quantifiers and the singularizer, we can write simply "x(M) instead of Π(λ x(M)). Similarly, $x(M) replaces Σ(λ x(M)). Finally, we write I(λ x(M)) shortly as onlyone x(M) and read „the only x such as M". Certainly, M/Bool. The LT language can be used as a theoretical tool for building a functional database language. Using this language as a query tool enables to define more precisely the notion of a functional 434 J. POKORNÝ database schema. It is a tuple (A1,...,An) of typed variables of LT called attribute identifiers. In the conventional approach a valuation δ is then used. Supposing this mapping, we can assign database objects to every variable. Typically, lambda abstractions serve as a tool for expressing queries. The query „Find titles of movies directed by Spielberg", e.g., can be expressed by the term λ t ($ m, r Movie(m)(t,’Spielberg’, r)) (1) The query "Give a set of couples associating to each user the number of movies rating by him/her" can be expressed by the term

λ u, n (n=COUNTMovie(λ m ($s Rates(u)(s , m))) ) (2) with u, m, n, and s of types User, Movie, Number, Stars, respectively. The answer to this query is a binary relation informing about the non-zero number of movies rated by users. Obviously, the query could be also reformulated as λ t (λ n (...)), or more conventionally λ u onlyone n (...). In this case, the answer will be a more general typed function. In practice, the answer for such query can be represented extensionally, e.g., by a (structured) table, similarly as in SQL. In applications often YES/NO queries are usable. The can be expressed by terms M/Bool (3) The functions from Func influence the expressive power of LT. Func can contain usual arithmetic and aggregate functions, etc. Of a special importance is R-identity =(Bool:S) usable for a comparison of two sets of S-objects. Consider User-objects John and Frank. Then with

λ u1 (Friends_of_friend(John)( u1)) =(Bool:User) (4) λ u2 (Friends_of_friend(Frank)( u2)) we can test whether John and Frank have the same sets of (direct) friends. Clearly, for real John and Frank we would have to use their names of type U_Name. Querying a typed RDB reminds a use of the classical domain (DRC) (e.g. [11]). A query in DRC has the form

{x1, x2, …, xn | P(x1, x2, …, xn)} where xi represent domain variables free in P, P is a logical formula. The answer for the query includes all tuples a1, a2, …, an which the formula P evaluates to true. A more complex example uses a universal quantifier and implication: λ n ("t ($ re, g Movies(t, re, ’Spielberg’, g) implies $ ro Actors(n, t, ro))) (5) expressing the query "Find the actors, who play in each film by director Spielberg." Regardless of expressive power, LT terms are not too user-friendly. We focus on this problem in Section 4.2.

4.2 A syntactic sugar for LT queries

Similarly to a practice with DRC, we can omit some existentially quantified variables and add explicit types of variables expressing type names occurring in attribute types. The Integration of Relational and Graph Databases Functionally 435 following versions of queries (1), (2), and (5), respectively, indicate how to (partially) solve this problem: {tTitle | exists mMovie Movie(m Movie)(tTitle,’Spielberg’Director)}

{uUser, nNumber | (n=COUNT(m| (Rates(uUser)(sStars, mMovie))) )

{nName | foreach tTitle (Movies(tTitle, ’Spielberg’Director) implies Actors(nName, tTitle))} The expression (2) can also look as

User Number Number User Movie λ u , n (n = COUNTMovie (λ m (Rates(u )(m ))) ) A further simplification could be made by omitting lambda abstraction supposing that information about objects considered is in the (Bool:User)-identity. Then we can write

Friends_of_friend(‘John‘) =(Bool:User) Friends_of_friend(‘Frank‘) In other words, an application is evaluated as the application of an associated function to given arguments, a lambda abstraction "constructs" a new function. Instead of the position notation, we can use more readable dot notation for components. For example, instead of Movie(‘Amadeus‘)[2], we can write Movie(‘Amadeus‘).Director. To consider in (2) only users born in 1948, the lambda expression might look like

User Number Number User Movie λ u , n (n =COUNTMovie (λ m (Rates(u )(m ) and User(uUser).Birth_year = ’1948’)) Moreover, we have used infix notation in arithmetical and logical formulae, as it is used in practice. The following term expresses a YES/NO query “Is John born in 1948?” User(John).’1948’Birth_year Remark: Clearly, the resulted lambda expressions can express more complex graph structures than queries (1)-(5). They can contribute to constructions of new graphs from the original GDB. For example, regular path queries are of a user importance in querying GDBs, Consider * expressions FOF (u1, uk) denoting the expression FOF(u1)(u2) and … and FOF(uk-1,)(uk), for a k (k>1). FOF is the abbreviation of Friends_of_friend. Then λ x, y FOF* (x, y) provides a set of couples (p1, pk), where there is a directed path from p1 to pk along edges of FOF. For more details, see [18]. Concerning some example from practice reminding this approach to querying, we could mention SPASQL — a hybrid implementation of SPARQL-within-SQL, enabling traditional SQL-based tools to operate on RDF Graphs without direct modification of those tools. This feature is a part of Virtuoso mentioned in Section 2. 436 J. POKORNÝ

5. Integration of relations and property graphs

There are more and more opportunities to access heterogeneous database systems. Particularly, we can integrate information which is from relational and graph-oriented data sources and deliver it to any system on Big Data platform. We focus in this section on multi- model approach to integration using a functional data model.

5.1 Multi-model approach

A multi-model approach reminds a more user-friendly solution of heterogeneous database integration as it was known in the context of federated heterogeneous relational databases in the past. Here, we suppose that different database models are behind participating databases. An alternative for data processing with relational and NoSQL data in one infrastructure uses common design methods based on a modification of the traditional 3-level ANSI/SPARC approach [10]. This multi-level modelling is most relevant for our multi-mode approach. It covers the following subapproaches: (a) Special abstract model. A methodology for NoSQL systems based on NoAM (NoSQL Abstract Model), a novel abstract data model for NoSQL databases, is presented in [2]. A designer starts with a UML class diagram, then he/she identifies so called aggregates (“chunks” of related data) and maps the aggregates into NoAM blocks. These blocks are transformed into constructs of a particular NoSQL data model. In fact, no integration is considered here. Oracle [14] uses the notion of unified query in this context. A single SQL statement can be executed seamlessly across more data stores: relational databases, Hadoop clusters, and NoSQL databases. (b) NoSQL-on-RDBMS. For example, storing and querying JSON data in an RDBMS belong to this category, see, e.g. Argo/SQL [3]. Argo/SQL is an SQL-like query language for collections of JSON objects. Or, SQL Server 2017 makes it possible to implement GDB within a relational table structure. (c) SQL-on-Hadoop. The Splice machine8 provides a hybrid storage architecture supporting in-memory relational database storage and disk-based storage from Hadoop. Splice Machine uses Spark in-memory computation to materialize the intermediate results of long-running queries but leverages the power of HBase to durably store and access data at scale. For examples of other similar architectures see [1]. (d) Ontology integration. Databases are described by several ontologies and a generated global ontology. Authors of [5] analyze a set of schema-less NoSQL databases to generate local ontologies and then generate a global ontology. Global SPARQL queries are then transformed into query languages of the sources. We are closed to the (a) subapproach, called the hybrid approach in [26] and [7]. In this case, the query is executed on two data sources, relational and NoSQL, but the additional layer is used to enable data integration. The former paper also contains a literature review of works done by number of researchers in the field of hybrid databases using graph databases and relational databases. The latter describes a hybrid solution explained on the example of integration of transactional data from with data stored in MongoDB.

8 https://www.splicemachine.com/ (retrieved on 10. 8. 2019) Integration of Relational and Graph Databases Functionally 437

A generalizable SQL query interface for both relational and NoSQL systems called Unity is described in [12]. Unity was just used for testing the hybrid solution described in [7]. Unity is an integration and virtualization system as it allows SQL queries that span multiple sources and uses its internal query engine to perform joins across sources. Unity architecture consists of an SQL query parser that converts a SQL query into a parse and validates the query. Then the query is transformed into portions to execute on each NoSQL data source via source-specific . This allows to query and join data from both NoSQL and SQL systems in a single SQL query.

5.2 Hybrid approach through a functional data model

A functional approaching GDB and RDB is introduced in [19], where the starting model is the functional data model. As the functional modelling provides rather hybrid between a conceptual and database schema, the (d) subapproach from Section 5.1 is also relevant. Database schema of both databases are described by sets of attributes, i.e. rather as local conceptual schemas, a global schema is obtained by union of these local schemas. LT queries sent to the integrated system are translated into queries compatible with the RDBMS (e.g., SQL) and GDBMS (e.g., Cypher language of Neo4j), respectively. For example, the query (1) can be translated into Cypher as follows: MATCH (m:Movie) WHERE m.director = "Spielberg" RETURN m.title The query (2) can have the following expression in Cypher: MATCH (u:User)-[:Rates]->(m:Movie) RETURN u, count(m) AS n DRC-like queries are translated to SQL in a more straightforward way. As an example we take the query (5). Clearly, first it is necessary to remove the universal quantifier by the following transformations: λ n (¬$t¬ ($ re, g Movie(t, re, ’Spielberg’, g) implies $ ro Actor(n, t, ro)))

λ n (¬$t ($ re, g Movie(t, re, ’Spielberg’, g) and ¬$ ro Actor(n, t, ro))), or

{nName ênot exists tTitle (Movie(tTitle, ’Spielberg’ Director) and not exists roRole Actor(nName, tTitle, roRole)), in a more user-friendly notation. An equivalent in SQL looks as follows: SELECT Name FROM Actor WHERE NOT EXISTS (SELECT Title FROM Movie WHERE Director = ’Spielberg’ and NOT EXISTS (SELECT Role FROM Actor A WHERE A.Name = Name)); 438 J. POKORNÝ

This kind of query answering is referenced as on-demand since the information stays at the sources and is retrieved from queries expressed over the target schema. This requires a translation since the query facilities available at both the target and the source are different. A generic architecture of such infrastructure inspired by the Unity architecture is depicted in Figure 2. On the data level, associated databases are generally heterogeneous. Elementary type Title of movies has not to be the same as the domain(Title) from the Movies relation. Only their non-empty intersection should be supposed. Consider the integrated database containing data associated with GDB and relational schemas from Section 3.2. Then the term λ uUser, gGenre, nNumber (6) Number Movie User Movie (n =COUNTMovie (λ m (Rates(u )(m ) and $ tTitle sTitle Movie(mMovie).tTitle = sTitle and Movies(sTitle, gGenre)) ) ) expresses the query, “Find for each user and genre the number of reviews done by him/her in the genre”, i.e., both graph and relational database has to be used.

LT query extensional result

JDBC API

LT parser

parse tree virtualization

translator system

tree of operations

optimizer

execution plan

Virtualizuation Execution Engine

JDBC Driver JDBC Driver

GDB API

GDP SQL Source Source

Figure 2. Generic architecture for integration of graph and SQL databases Integration of Relational and Graph Databases Functionally 439

6. Conclusions

Key issues for building Big Data processing infrastructure are in decisions concerning NoSQL databases, i.e., choosing the right (correct) product and designing a suitable database architecture for a given application class in the case when it is necessary to integrate a RDB and data stored and processed in a NoSQL database. In the paper, we have focused on GDBs and RDBs. We have seen that a multi-level approach based on a sound formal apparatus can be used. The apparatus contains a functional typing system serving for specification of so called attributes – typed functions reflecting structure of directed graphs and relations in the 1NF. As a manipulation language can be used the language LT. This tool creates a formal background for a more user-friendly variant for querying an integrated database containing graphs and relations, i.e., for a single query it possible to process data from multiple databases in common functional data model environment. This approach overcomes today’s approach based on ad hoc SQL extensions. Moreover, it offers to extend the repertoire of constant functions representing, e.g., usual embedded functions in query languages. In general, such approach requires an increased responsibility of developers in the optimization of database processes, which means that additional programming and time are necessary for completing the integration of data. Current challenges for database research of such infrastructure include: • finding an appropriate and successfully powerful subset of LT covering query requirements of a GDB integrated with an RDB, • developing a meaningful and usable user-friendly version of a query language based on LT, • developing a prototype using an SQL engine and Neo4j GDBMS for source databases.

Acknowledgments

This work was supported by the Charles University project Q48.

References

[1] Abadi, D., Babu, Sh., Ozcan, F., Pandis, I., Tutorial: SQL-on-Hadoop Systems, in Proc. of VLDB Endowment, Vol. 8, No. 12, 2015. [2] Bugiotti, F., Cabibbo, L., Atzeni, P., & Torlone, R., Database Design for NoSQL Systems, in: Proc. of ER Conf., LNCS 8824, Springer, 2014, 223-231. [3] Chasseur, C., Li, Y., Patel, J.M., Enabling JSON Document Stores in Relational Systems, in: 16th Int. Workshop on the Web and Databases (WebDB 2013), 2013, 1-6. [4] Curé, O., Hecht, R., Duc, Ch. L., Lamole, M., Data Integration over NoSQL Stores Using Access Path Based Mappings, in: Proc. of DEXA 2011, Part I, LNCS 6860, Springer, 2011, 481–495. 440 J. POKORNÝ

[5] Curé, O., Lamole, M., Duc, Ch. L., Ontology Based Data Integration over Document and Column Family Oriented NOSQL. CoRR, arXiv:1307.2603, 2013. [6] Diestel, R.: Graph Theory, Springer GTM 173, 5th ed., 2016. [7] Gašpar, D., Mabić, M., Krtalić, T., Integrating Two Worlds: Relational and NoSQL, in: Proc. of 28th CECIIS Conf., 2017, 11-18. [8] Gašpar D., Coric I., Bridging Relational and NoSQL Databases, IGI Global, 2017. [9] Gray P.M.D., Kerschberg L., King P.J.H., Poulovassilje A. (Eds.) The Functional Approach to Data Management, Modeling, Analyzing and Integrating Heterogeneous Data, Springer, Berlin, 2004. [10] Herrero, V., Abelló, A., Romero, O., NOSQL Design for Analytical Workloads: Variability Matters, in: Proc. of ER Conf., LNCS 9974, Springer, 2016, 50-64. [11] Lacroix, M., Pirotte, A., Domain-Oriented Relational Languages, in: Proc. of VLDB 1977, 1977, 370-378. [12] Lawrence, L., Integration and Virtualization of Relational SQL and NoSQL Systems Including MySQL and MongoDB, in: Proc. CSCI Int. Conf. on Computational Science and Computational Intelligence / Volume 0, IEEE, 2014, 285-290. [13] Meijer, E., Bierman, G.M., A co- of data for large shared data banks. Commun. ACM 54(4), 2011, 49-58. [14] Oracle, Unified Query for Big Data Management Systems Integrating Big Data Systems with Enterprise Data Warehouses. Oracle White Paper, 2016. [15] Pokorný J., A function: unifying mechanism for entity-oriented database models, in: C. Batini, (Ed.), Entity-Relationship Approach, Elsevier Science Publishers B.V., North- Holland, 1989, 165-181. [16] Pokorný J., Database semantics in heterogeneous environment, in: K.G. Jeffery, J. Král, M. Bartošek (Eds.), in: Proc. of 23rd Seminar SOFSEM’96: Theory and Practice of Informatics, Springer-Verlag, 1996, 125-142. [17] Pokorný, J., Conceptual and Database Modelling of Graph Databases, in: Proc. of IDEAS’ 16, B. Desai (Ed.), ACM, 2016, 370-377. [18] Pokorný J., Functional Querying in Graph Databases. Vietnam Journal of Computer Science, Springer, 5, 2, 2017, 95-105. DOI 10.1007/s40595-017-0104-6 [19] Pokorný J., Integration of Relational and NoSQL Databases, in: N. T. Nguyen, et al (eds), ACIIDS (2), LNCS, vol. 10752, Springer, 2018, 35-45. [20] Pokorný, J., Integration of Relational and Graph Databases Functionally. CoRR abs/1809.03822, 2018. [21] Ramachandran, S., Graph Database Theory - Comparing Graph and Relational Data Models. LambdaZen, 2015. [22] Robinson, I., Webber J., Eifrém E., Graph Databases, O’Reilly Media, 2013. Integration of Relational and Graph Databases Functionally 441

[23] Roy-Hubara, N., Rokach, L., Shapira, B., Shoval, P., Modeling Graph Database Schema. IT Professional 19(6): 34-43, 2017. [24] Shipman D.W., The functional data model and the data languages DAPLEX. ACM Transactions on Database Systems (TODS), 6, 1, 1981, 140-173. [25] Tivari, S.: Professional NoSQL, Wiley/Wrox, 2015. [26] Vyawahare H.R., Karde P.P., Thakare V.M., A Hybrid Database Approach Using Graph and Relational Database, in: Proc. 2018 IEEE International Conference on Research in Intelligent and in Engineering (RICE), IEEE, 1-4. [27] Zhao, G., Lin, Q., Li, L., Li, Z., Schema Conversion Model of SQL Database to NoSQL, in: Proc. of the 9th Int. Conf. on P2P, Parallel, Grid, Cloud and Internet Computing, IEEE, 2014, 355-362. This paper is a revised and extended version of the work originally presented at workshop 'Semantics in Big Data Management', held within IFIP World Computer Congress in Poznań, Poland on 18th September 2018

Received 6.02.2019, Accepted 10.09.2019