(JPMNT) Journal of Process Management – New Technologies, International Vol. 5, No 2, 2017.

RELATIONAL AND GRAPH DATABASE: A COMPARATIVE ANALYSIS

Surajit Medhi 1 , Hemanta K. Baruah 2 1Department of Computer Science, Gauhati University, Assam, India 2Bodoland University, Assam, India [email protected] ; [email protected]

Original Scientific Paper doi:10.5937/jouproman5-13553

Abstract: Since 1970, relational be represented and visualized using the database models have been in use for storing, graph database [2] manipulating and retrieving data. The importance of relational is going to decrease due to 1.1 the exponential growth of data as it is difficult to work with large number of joining tables. For such Relational database is a collection of kinds of problems, one of the best solutions is to use graph database for storing data. The graph data which are stored in a tabular form the database can be used to store highly connected organization of which is based on the data. In this article, we are going to put forward a proposed by E. F. Codd in comparison between relational database and graph 1970. In Relational database, each database with reference to an experiment contain rows and columns, rows performed. representing records and columns Keywords and Phrases : Graph, Neo4j, MySQL, representing attributes. Let us suppose that we want to maintain a database that contains information of trains of a railway system 1. INTRODUCTION in a particular year. The information about In recent years, the importance of trains, pilots, and assignments of pilots to storing and analysing data in the form of trains are maintained by this database. graph has been increasing. Graph database Every train has a unique number, a source is now used in social networks, city, a destination city, an arrival time and recommendation systems, biological a departure time. Each pilot has a unique network, web graph etc. and these graphs employee identity, a name, an address, are highly interconnected. In relational experience in years, etc. Pilots are assigned database, data are stored in tabular form. to drive certain trains on particular days of For less connected or static data, relational the year. First, from the description of the database is perfect, but for highly database, we would identify the data connected data such as the network of objects and the relationships. Data objects hyperlinks in the World Wide Web, it is are train and pilot. Relationship is assigned complex for representing in relational in the sense that different pilots can be database [1]. A similar kind of problem assigned to different trains. After that we arises when we want to model the social proceed to identify the attributes and the networks like , etc. It has primary key for a particular data object. been accepted that the web data can also

1 www.japmnt.com (JPMNT) Journal of Process Management – New Technologies, International Vol. 5, No 2, 2017.

So, the attributes for trains are train which allows storing the particular objects number, source city, destination city, with between them. A graph can arrival time and departure time. Train be expressed as G = (V, E) where V is the number is the primary key. The attributes set of vertices (nodes) and E is the set of for pilots are employee identity, name, edges (relations). The data structure such address, experience year. Employee as , , identity is the primary key. For the and linked list are used to relationship “assigned to”, the attributes store the graph. The Resource Description are the primary key of train, pilot and date. Framework (RDF) is a popular representation for graph data, and it is used Now we can use relational tables to as a model by some graph databases even represent the data object and the though the primary objective of RDF was relationship with their attributes. The train not really representation of graphs [4]. table contains the above mentioned attributes for trains, the pilot table contains In recent years, some graph databases the above mentioned attributes for pilots such as DEX, Infinite Graph, Neo4j, and there is one more table for the Orient DB, Titan etc. have been relationship assigned to attributes like train developed, which are currently used in number, employee identity and date. operations for storing data over a traditional relational model. A database system Big Table has been created and 1.2 Graph Database used by [5].

Before going to discuss different types of graph databases, we have to know the 1.3 Neo4j NOSQL. NOSQL is a family of database and graph database is a part of NOSQL Neo4j is an open source graph database. NOSQL databases are optimized database which is implemented in Java and for handling very large data in which developed by Neo technology [11]. It uses elements are not closely related, and graph structure with nodes, edges and therefore expensive joins are not needed. properties to store data. It provides index In NOSQL large volumes of data are free adjacency, which means that each stored efficiently. There are different types node is a pointer. It also supports the of NOSQL databases [3]. These are: key labelled property graph model. A labelled value databases, family stores property graph model has a number of databases, document oriented databases characteristics. It contains nodes and and graph databases. In this article, we are relationships where properties and nodes going to focus on graph database. Every can be labelled with more than one label. graph database is a database which uses In this model, if we want to model a graph structures with nodes, edges and relationship among friends then we have to properties to represent and to store data. A create different nodes for friends and graph model is used in graph database connect the friends with edges.

2 www.japmnt.com (JPMNT) Journal of Process Management – New Technologies, International Vol. 5, No 2, 2017.

Fig. 1: Labelled Property Graph Model

In the graph model [11] shown above, the huge static graphs and has used there are three nodes and five relationships geometry based technique to overcome and a label. User is the label. Ruth, Billy this problem. Hong et. al. [9] used the and Harry are the nodes and the property is graph database for subgraph matching with name for the nodes and ‘follows’ is the set similarity. Sakr and Al -Naymat [10] relationshi p. Here, Ruth follows Billy, have proposed efficient relational which means that there is a connection techniques f or processing graph queries. between these two nodes using a directed They have used two kinds of relational edge, and the edge has a property called mapping schemes for storing graph ‘follows”. In Neo4j, the edges are database such as Vertex -Edge mapping bidirectional as we can see in the model in and Edge-Edge mapping. In Vertex -Edge Fig. 1. Neo4j uses the cyph er query mapping, they have used two tables: language for creating nodes, relationships Vertices (graph ID, Vertex ID, Vertex with properties and also for retrieving, Label ) and Edges (graph ID, s -vertex, d- searching and manipulating data. vertex and edge label) In Edge -Edge mapping they have used only one table for 2. Related Work representing the graph. The table contains Many researchers are working on Edge (GID, e-ID, e-label, s -VID, sv-Label, graph database in different fields such as Dvid, dv-Label). computer science, chemistry, bio logy, 3. Relational Databases vs. Graph electronics etc. Balaur et. al. [6] have Databases applied graph database technologies for managing comprehensive genome scale Relational database s were originally networks. In their work, they have designed in tabular structure. Modelling visualized an metabolic network relationships are not quite suitable in based on the cypher query language . Batra relational databases. It is so because they [7] also compared the relational database use foreign keys to relate one information and graph database based on some with another. In relational databases, tables evaluation parameters such as level of are needed to be joined using foreign keys support or maturity, security and to connect information from different flexibility. Ching-Yung Lin [8] has tables. If we want to many tables for worked on graph and linked big querying, it may not be optimal to deal data, and has faced challenges to visualize with.

3 www.japmnt.com (JPMNT) Journal of Process Management – New Technologies, International Vol. 5, No 2, 2017.

Table 1 Table 2

Employees Relationships

Emp_id Emp_name Emp_id Relation_id

E1 Raj E1 R3

E2 Hirak E1 R4

E3 Bimal E2 R3

E4 John E3 R2

E3 R4

E4 R3

Here, employees and relationships are relational databases. But in our example, shown in two tables and the relationship however the relations are undirected, so table represents the relation between that emp_id ‘E3’ and relation_id’ R4’ are employees. An id has been assigned to the same as emp_id ‘E4’ and relation_id every employee in Table 1. An emp_id in “R3”. the employees’ table is referenced as either emp_id or a relation_id; both the emp_id Now, suppose we want to make and relation_id columns in relationship simple queries such as to find out what table refer ids in the Employees’ table. For Bimal’s relationships are in the relational example, Bimal has emp_id ‘E3’ is a model. First, the query looks at the relation to Raj and Hirak who has emp_id employees’ table, finds which emp_id is ‘E1’ and ‘E2’ respectively. assigned to Bimal, takes that emp_id to the relationships table, finds all relation_ids In the employees’ table, emp_id is the that are assigned to Bimal’s emp_id, and primary key and both emp_id and then those relation_ids are returned back to relation_id are used as composite keys in the employees’ table to find the emp_name the relationship table. A composite key associated to those relation_ids. This type needs both of the id’s to be a unique of a query is still possible with relational combination of values. For example, model, but it is computationally expensive. emp_id ‘E1’, relation_id ‘R3’ and emp_id ‘E1’, relation_id ‘R4’ are not same In contrast, for the same type of a composite value. If directed relationships model, if we use a graph database such as are needed to be modelled, the composite Neo4j then Neo4j gives us better result to value such as emp_id ‘E3’ and realtion_id model the data for above relational model. ‘R4’ are completely different from the In Neo4j, the Model is the following: composite value than emp_id ‘E4’ in the

4 www.japmnt.com (JPMNT) Journal of Process Management – New Technologies, International Vol. 5, No 2, 2017.

Fig. 2: Graph Model for Employee and Relationship

In Neo4j, we can easily design the relational database Mysql version-5.1.0 model and also retrieve the information and Neo4j version-2.0.3. For Mysql PHP from the graph model which is scripting languages are used to query the computationally less expensive. Suppose, database and for Neo4j cypher query we want to make a query: “What are language is used to query the graph Bimal’s relationships?” We can easily find database. the result out because Bimal is connected to Raj and Hirak. So, if we compare the The scheme for the Relational same query in relational model, then there Database includes the following tables: would be a lot of joins and it would also be a) Cricketers computationally expensive. b) Teams and 4. Implementation and Experimental ) Games Results The Cricketers’ Table contains the We have performed an experiment attributes cric_id and cric_name. Teams’ with a simple dataset of the game of Table contains the attributes team_id, cricket. The dataset contains the players’ team_name and cric_id. Games’ Table information, team and which player has contains the attributes cric_id and played Test Cricket or One Day game_name. International (ODI) matches or Twenty- Twenty (T20) matches. We would The scheme for the Graph Database implement the cricket players’ dataset in includes the following nodes and relationship:

5 www.japmnt.com (JPMNT) Journal of Process Management – New Technologies, International Vol. 5, No 2, 2017.

Fig. 3: Representation of Cricket Game Dataset for Relationship, Player and Team

In Fig. 3, for representing the Cricket Dravid, Virat etc. and also set properties game dataset, we have created different such as cric_id and cric_name. For team nodes for players, teams, game type and label, we have created nodes such as India, relationships like “belongs_to” and Australia, Pakistan etc. In this model, we “plays”. For example, we have created first have created the relationship between the a label “Cricketer” and then different Cricketer and the Team. nodes for that label such as Sachin,

Fig. 4: Representation of the Cricket Game Dataset for Relationship, Cricketer and Game Type.

6 www.japmnt.com (JPMNT) Journal of Process Management – New Technologies, International Vol. 5, No 2, 2017.

In Fig. 4, for representing the cricket Query 3. Find the Cricket players who game dataset with the relation between belong to Team India. crickete and game, we already have created different nodes for players, and we In Table - 3 we have shown the query have added different nodes for game label results in milliseconds (ms) for Mysql and such as Test, ODI and T20. We have set Neo4j. We have mentioned the number of properties such as cric_id and game_name objects for the queries in Mysql and Neo4j also. In this model, we have created the and also have given the time taken for relationship between the cricketer and different queries in Mysql and Neo4j. We game. The objective test of evaluation have calculated the time for executing the including processing speed between Mysql above mentioned queries for 100, 300 and and Neo4j is based upon a set of queries 400 objects. We have drawn the column for the above schema. The Queries are as graph for the execution ti me for three follows: queries in Mysql and Neo4j. It was found that graph databases are faster than Query 1. Find the names of teams. relational database with respect to time taken for executing the complex queries. Query 2. Find the names of the cricketers Further, we can design the complex who have played both ODI and Test relationship model easily. Cricket.

Table - 3: Execution Time for Different Queries for Different Objects

No. of Query1 Query1 Query2 Query2 Query3 Query3 objects (Mysql) (Neo4j) (Mysql) (Neo4j) (Mysql) (Neo4j)

100 12.56 ms 5ms 18.52ms 7.32ms 15.75ms 6 ms

300 153ms 7ms 212.53ms 13ms 180.24ms 9.32ms

400 165.43ms 8.32ms 387.34ms 15.67ms 302.44ms 14.32ms

20 18 16 14 12 10 mysql 8 neo4j 6 4 2 0 query1 query2 query3

Fig. 5: Execution Time for 100 Objects in Column Graph (in Mysql and Neo4j)

7 www.japmnt.com (JPMNT) Journal of Process Management – New Technologies, International Vol. 5, No 2, 2017.

250

200

150 mysql

100 neo4j

50

0 query1 query2 query3

Fig. 6: Execution Time for 300 Objects in Column Graph (in Mysql and Neo4j)

400 350 300 250 200 mysql 150 neo4j 100 50 0 query1 query2 query3

Fig. 7: Execution Time for 400 Objects in Column Graph (in Mysql and Neo4j)

5. Conclusions REFERENCES Both relational database and graph database are useful to represent data. [1] Database Trends and Applications. Available http://www.dbta.com/Articles/Columns/Notes -on- However, if we compare with reference to NoSQL/Graph-Dat abases -and-the-Value-They- storing and retrieving highly connected Provide-74544.aspx,2012 data, graph databases appear to give better results. This means that we can more [2] Surajit Medhi, Visualization of Graph Models easily design and retrieve the results of for Web Document in Neo4j, International Journal of Energy, Information and Communications, queries if we use graph databases instead Vol.7, Issue 6, 2016 of using relational databases. When we want to add a new relationship to graph [3] Gabriele Pozzani, Introduction to NoSQL, database, we do not need to restructure the profs.sci.univr.it/~pozzani/attachments/%20 - %2001%20-%20introduction.pdf, April 24, 2013 database again. As we can see from our experiment, we get bett er results for [4] Jonathan Hayes , A Graph Model for R DF executing the predefined queries in graph (Diploma thesis), Technische Universit¨at databases than in relational database with Darmstadt Universidad de Chile, 2004. respect to time. The retrieval times for [5] R. Angles and C. Gutierrez, Survey of Graph quires give us a conclusion that graph Database Models, ACM Comput. Surv . 40(1):1–39, database is suitable to use in commercial 2008. purposes such as developing social [6] Irina Balaur, Alexander Mazein, Mansoor Saqi, network, stock market, recommendation Artem Lysenko, Christopher J. Rawlings and engine and network management. Charles Auffray, Recon2neo4j : Applying Graph Data Based Technology for Managing

8 www.japmnt.com (JPMNT) Journal of Process Management – New Technologies, International Vol. 5, No 2, 2017.

Comprehensive Genome – Scale Networks , [9] Liang Hong, Lei Zou, Xiang Lian, Philip S. Yu, Bioinformatics, 2017, 1–3. Subgraph Matching with Set Similarity in a Large . Graph Database, IEEE Transactions on Knowledge [7] Shalini Batra and Charu Tyagi, Comparative and Data Engineering , 1041-4347, 2015. Analysis of Relational and Graph Databases, International Journal of Soft Computing and [10] Sherif Sakr and Ghazi Al-Naymat, Efficient Engineering, Volume-2, Issue-2, 2012 Relational Techniques for Processing Graph Queries, Journal of Computer Science and Technology, 25(6): 1237 – 1255, 2010.

[8] Ching-Yung Lin, Graph Computing and Linked [11] www.neo4j.com , Keynote Speech @ International Conference on Semantic Computing , 2014.

9 www.japmnt.com