Deductive Graph Database – Datalog in Action
Total Page:16
File Type:pdf, Size:1020Kb
2015 International Conference on Computational Science and Computational Intelligence Deductive Graph Database – Datalog in Action Kornelije Rabuzin Faculty of organization and informatics University of Zagreb Varazdin, Croatia [email protected] Abstract—In recent years many NoSQL systems, including • Document oriented databases graph databases, have become available. For large amounts of interconnected data, these systems represent a good choice. • Column oriented databases Deductive databases have been used in order to deduce new • Key value databases pieces of information based on a database that containes large amounts of data. But it is important to keep in mind that such • Graph databases databases were mostly relational, i.e., relations were used to store data upon which deductive mechanisms were used. In this paper, Graph databases store information in nodes and deductive graph databases are proposed. In deductive graph relationships. Each node does not have to contain the same databases, data are stored in a graph database, and Datalog is number of properties (attributes) and the same applies for used for reasoning purposes on a relational representation of a relationships between the nodes. For large amounts of graph database. interconnected data graph databases represent a good choice. They are especially suitable for social network analysis. In the Keywords—databases, SQL, NoSQL, graph databases, Datalog next chapter a small graph database is implemented (Neo4j system is used) and more about graph databases will be I. INTRODUCTION discussed. For additional information on graph databases see [6]. In this paper, we primarily discuss graph databases (other The relational data model has been widely used in the past types are not discussed). 40 years. Many databases were implemented in order to store large amounts of important data. As it turns out, Dr. Codd’s A deductive database uses rules to produce new pieces of vision to store data into relations turned out to be crucial and, knowledge based on facts, which are stored in the database. because of its ideal properties, the relational data model has The next definition can be found in [11]: “A deductive DBD is survived. Although the term “relation” is used in the theory, a triple D = (F, DR, IC), where F is a finite set of ground facts, users that use databases on a daily basis usually claim that a DR a finite set of deductive rules, and IC a finite set of database is, in fact, a set of tables. In order to implement a integrity constraints. The set F of facts is called the extensional database, certain database management systems (DBMS) are part of the DB (EDB), and the sets DR and IC together form required. The rich query interface (Structured Query Language the so-called intensional part (IDB)”. A small example is – SQL) that DBMS supports can be used to work with borrowed from [11]: databases. The ability to store and efficiently manage large amounts of data has turned database management systems into Facts significant parts of many applications and information systems Father(John, Tony) that were developed over the time. Mother(Mary, Bob) SQL is a standardized language that is used to work with databases. All database management system vendors support Father(Peter, Mary) SQL which is declarative and typically, not complex. However, sometimes queries do get quite complex. Furthermore, different databases management systems do not support all the Deductive Rules statements in the same form and some differences may exist. ← For more information on SQL, see [8] and [9]. Parent(x,y) Father(x,y) But, in recent years, the NoSQL movement has become Parent(x,y) ← Mother(x,y) popular. Namely, the relational data model is starting to reveal its weaknesses, and, for some problems, new solutions have to GrandMother(x,y) ← Mother(x,z) ∧ Parent(z,y) be found. The amounts of data that relational databases have to ← store today are beyond their capabilities. Furthermore, a fixed Ancestor(x,y) Parent(x,y) database schema is no longer an option. Because of this, many Ancestor(x,y) ← Parent(x,z) ∧ Ancestor(z,y) NoSQL systems have been developed, and, generally speaking, we distinguish: Nondirect-anc(x,y) ← Ancestor(x,y) ∧ ¬Parent(x,y) 978-1-4673-9795-7/15 $31.00 © 2015 IEEE 114 DOI 10.1109/CSCI.2015.60 Integrity Constraints IC1(x) ← Parent(x,x) IC2(x) ← Father(x,y) ∧ Mother(x,z) Thus, there are three facts and several rules used to define different relationships (parent, grandmother, etc.). Two integrity constraints prevent someone from being parent of his self and preventing a person from being both mother and father at the same time. For more information on deductive databases see [11] or [14]. However, some things that initially occurred in deductive databases are used in SQL today: for example, recursive queries. Some books that cover databases in general are [1], [2], [3], [5] and [12]; other book are available as well. In the next section if this paper, graph databases are defined. Then the Deductive Graph Database is presented and a few Datalog queries are written. Afterward, the discussion is given and the conclusion is presented. Figure 1. Graph database This example stores data about courses and their II. GRAPH DATABASES prerequisites. To list courses and their prerequisites (first level), it is enough to read the graph database. Cypher is used Unlike relational databases, which store data in tables, as a language to read the data from the database. MATCH graph databases use nodes and relationships between nodes. clause is used to start the query; it finds two courses that are Storing data in such a way has certain benefits and it is more connected by means of a relationship, which is called PREREQ natural for large amounts of interconnected data (for example, and then their names are returned in the result: social network analysis). Thus, one cannot say that graph databases are always better or that relational databases are MATCH (n:Course)-[:PREREQ]->(m:Course) always better; it depends on one’s needs. RETURN n.name, m.name Nodes and relationships have properties. In the next n.name m.name section, we define a small graph database (the Neo4j system is used for implementation purposes): Mathematics Programming I CREATE (p1:Course {name: "Mathematics", ects: 7}), Mathematics Databases I (p2:Course {name: "Informatics"}), Informatics Databases I (p3:Course {name: "Programming I", ects: 7}), Databases I Databases II (p4:Course {name: "Databases I"}), Databases II Data warehouses I (p5:Course {name: "Databases II", ects: 6}), (p6:Course {name: "Data warehouses I", ects: 5}), On the second level (as well as on any other level), it is enough to reread the graph database. Now we are looking for (p1)-[:PREREQ]->(p3), three nodes and two relationships between them: (p1)-[:PREREQ]->(p4), MATCH (n:Course)-[:PREREQ]->(m:Course)- (p2)-[:PREREQ]->(p4), [:PREREQ]->(o:Course) (p4)-[:PREREQ]->(p5), RETURN n.name, o.name (p5)-[:PREREQ]->(p6) n.name o.name Mathematics Databases II The visual interpretation of the database defined above is Informatics Databases II shown below: Databases I Data warehouses I Now we are looking for four nodes and three relationships between them: 115 MATCH (n:Course)-[:PREREQ]->(m:Course)- course(mathematics, 7). [:PREREQ]->(o:Course)-[:PREREQ]->(p:Course) course(informatics, null). RETURN n.name, p.name course('programming I', 7). n.name p.name course('databases I', null). Mathematics Data warehouses I course('databases II', 6). Informatics Data warehouses I course('data warehouses I', 5). The first problem is that no recursion is supported in Neo4j and we have the same problem as in SQL before the recursion However, one has to keep in mind that nodes do not have to was added into the SQL standard. In order to merge all the have the same properties. Here, we see that not all courses have queries (listed above) and to find all courses and their the number of ECTS points and, because of that, null is used. prerequisites, we should use the UNION clause (the result is Relationships are stored as facts as well: obvious): prereq(mathematics, 'programming I'). MATCH (n:Course)-[:PREREQ]->(m:Course) prereq(mathematics, 'databases I'). RETURN n.name AS c, m.name AS p prereq(informatics, 'databases I'). UNION prereq('databases I', 'databases II'). MATCH (n:Course)-[:PREREQ]->(m:Course)- prereq('databases II', 'data warehouses I'). [:PREREQ]->(o:Course) RETURN n.name AS c, o.name AS p Regarding the translation, one may ask why perform the UNION translation at all? But the answer may be surprising. Namely, MATCH (n:Course)-[:PREREQ]->(m:Course)- graph databases use different methods to store data [6]: [:PREREQ]->(o:Course)-[:PREREQ]->(p:Course) “Some graph databases use native graph storage that is RETURN n.name AS c, p.name AS p optimized and designed for storing and managing graphs. Not all graph database technologies use native graph storage however. Some serialize the graph data into a relational database, an object-oriented database, or some other general- Recursive queries are supported in Datalog (deductive purpose data store.” databases), however, and we could use them on a graph database in order to more easily find courses and their Thus, we see that the underlying storage model can rely on prerequisites. Deductive databases also provide views. One can relational databases as well as on some other types, but this define a view; in this way, queries may be posed much easier. underpins the idea of translating the graph database into a set of Furthermore, hypothetical queries are supported in Datalog as facts that can be stored in tables in order to use Datalog on such well. Because we see that Cypher has certain problems while a database.