Execution Time Analysis of Electrical Network Tracing in Relational and Graph Databases
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019 Execution Time Analysis of Electrical Network Tracing in Relational and Graph Databases FELIX DE SILVA KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Execution Time Analysis of Electrical Network Tracing in Relational and Graph Databases FELIX DE SILVA Master in Computer Science Date: March 16, 2019 Supervisor: Mika Cohen Examiner: Mads Dam School of Electrical Engineering and Computer Science iii Abstract In today’s society, we handle a lot of connected data. Examples are companies like Facebook and Amazon, that handle connected data in different ways. Geographic Information Systems and Network Informa- tion Systems handle connected data in the form of networks or graphs that can represent anything from an electrical network to a product network. When it comes to connected data, the most commonly used database technology is relational databases. However, with a lot of new databases emerging, there may be better alternatives for connected data that can provide higher performance. In this study we look at the Oracle relational database and the Neo4j graph database and study how both databases traverse an electrical network. The findings indicate that the Neo4j graph database outper- forms the Oracle relational database regarding execution time of search queries. iv Sammanfattning I dagens samhälle hanterar vi mycket kopplad data. Exempel är företag som Facebook och Amazon, som hanterar kopplad data på olika sätt. Geografiska informationssystem och nätverksinformationssystem han- terar kopplad data i form av nätverk eller grafer som kan representera allt från elnät till ett produktnätverk. När det gäller kopplad data är den mest använda tekniken relationsda- tabaser. Men med många nya databaser som kommer fram kan det nu finnas bättre alternativ för kopplad data som kan ge högre prestanda. I denna undersökning tittar vi på relationsdatabasen Oracle och graf- databasen Neo4j och undersöker hur båda databaserna traverserar ett elnät. De presenterade resultaten visar att grafdatabasen Neo4j utför graftraversering snabbare än relationsdatabas Oracle, där fokus ligger på körningstid. Contents 1 Introduction 1 1.1 Problem Background . .1 1.2 Research Question . .3 1.3 Objective . .3 1.4 Purpose . .3 1.5 Scope . .3 1.6 Terminology . .4 2 Theoretical Background 5 2.1 Relational Databases . .5 2.1.1 Tables and Keys . .6 2.1.2 Index . .8 2.1.3 Stored Procedure . .9 2.1.4 Query Processing . .9 2.1.5 The Oracle Relational Database . 11 2.2 Graph Databases . 12 2.2.1 Graphs . 13 2.2.2 Index-Free Adjacency . 14 2.2.3 Query Processing . 15 2.2.4 The Neo4j Graph Database . 17 2.3 Database Benchmarking . 18 2.4 Database Modeling . 19 2.4.1 Relational Modeling . 20 2.4.2 Graph Modeling . 21 2.5 Query Execution Time Estimation . 21 2.5.1 Access Time . 22 2.5.2 Storage Time . 22 2.5.3 Computation Time . 23 2.5.4 Communication Time . 23 v vi CONTENTS 2.6 Database Storage . 23 3 Related Research 25 3.1 Benchmarking Database Systems for Social Network Ap- plications . 25 3.2 The Shortest Path Algorithm Performance Comparison in Graph and Relational Database on a Transportation Network . 26 3.3 Relational Database and Graph Database: A Compara- tive Analysis . 27 3.4 Comparative Analysis of Relational and Graph Databases 28 4 Methodology 30 4.1 Datasets . 30 4.2 Modeling . 31 4.3 Benchmark Framework . 33 4.3.1 Query . 34 5 Results 36 5.1 Execution Time . 36 5.2 Throughput . 38 5.3 Standard Deviation . 39 6 Discussion 40 6.1 Benchmark Comparison . 40 6.2 Complexity Analysis . 41 6.3 Execution Plan Analysis . 41 6.4 Practical Applications . 44 7 Conclusion 46 7.1 Summary . 46 7.2 Future Work . 47 Bibliography 48 A Neo4j Graph Creation Algorithm 51 B Cypher BFS Complete Search 52 C Cypher BFS Stop-Label Search 53 CONTENTS vii D Result Data - Complete Search 54 E Result Data - Stop-Label Search 55 Chapter 1 Introduction In this chapter we present the research question along with the objective of this thesis. The limitation of this thesis is presented in the scope and purpose of this project are thereafter clarified. 1.1 Problem Background We live in a world where information exists everywhere and informa- tion storage is an essential part of society. Information such as e-mails and personal information needs to be stored in easily accessible and efficient ways. Today, storage of information can be either in physical or digital form; for computerized devices, digital formats like file systems and databases are the most efficient ways. The development of traditional SQL databases has been ongoing since the 1980’s [19] and has laid the foundation of modern databases. NoSQL and NewSQL are prominent database types that are today widely used. There are many aspects to consider in selecting database type for stor- age in a system or application, such as the property of connectivity. Everything that can be abstracted into a graph or network has the property of connectivity. The Internet with its hyperlink network, Face- book with its social network and Amazon with its product network are examples of large information networks built from data with high connectivity. When data are connected in a database in such a manner, it can be described as relationships between data points. 1 2 CHAPTER 1. INTRODUCTION In relational databases, the data structures used for data storage are grid-structured tables. Connectivity among data here means that a data cell from a table refers to a data row in the same or another table. Accessing the data points in such relations can be done through the use of SQL JOIN operations, merging the tables to allow access of these data points. When datasets become more interrelated, carrying out queries can be more complex because of the possible need for more JOIN operations. Native graph databases use first-class citizen nodes for storing data and first-class citizen relationships are used for connecting the nodes [10], creating a graph structure. Contained within each node is a list of rela- tionship records that represent the node’s relationships to other nodes. When carrying out queries similar to JOIN operations, the database uses these lists and has direct access to the connected nodes, eliminating the need for a time consuming computation like in relational databases. There are many reasons as to why connectivity and JOIN operations have such a complicated relationship. One reason is the underlying architectures of the database types, in how the inner mechanics of handling and enabling connectivity works. According to Harrison [9], native graph databases are built with primary focus on connectivity, which is not the case for relational databases. The choice of database type therefore varies depending on application and what aspects are prioritized. Digpro is a company that deals with Geographic Information Technolo- gies. They develop and provide software in the form of Geographic Information Systems (GIS) and Network Information Systems (NIS). dpSpatial is a platform developed by Digpro that lays the foundation of all of Digpro’s product applications. It utilizes an Oracle relational database for a range of functionalities, mostly for storing and retrieving data and executing search queries. As mentioned before, Digpro are using a relational database where connectivity is not the main focus which may affect the execution time for search queries. By using a alternative database which has connectivity as its main focus, such as native graph databases, may enhance the performance of dpSpatial. The Neo4j graph database, [20] is of specific interest for Digpro, because CHAPTER 1. INTRODUCTION 3 it is widely used and optimized for connected data [19][10]. Digpro handles highly connected infrastructural electrical networks with dif- ferent layers representing various levels of structures, cables and cable housing. Due to the complexity of the structured data with respect to connectivity, it is possible that the Neo4j graph database can be an effective solution in handling search queries. This possibility is further explored in this project. 1.2 Research Question How does a native graph database affect the execution time of search queries in an interconnected electrical network, compared to a relational database? 1.3 Objective This thesis aims to investigate whether a native graph database is a better alternative than a relational database, concerning the execution time of search queries requiring retrieval of connected data. In this study, the concept of execution time is interpreted as the time from when a query is sent until a response is received from the database. 1.4 Purpose This study explores the improvement possibilities of search queries in terms of execution time by replacing relational databases with native graph databases. If this study shows improvements this may be rele- vant for companies handling connected data and may aid in further improving their software by exploring native graph databases. 1.5 Scope The databases that are studied in this thesis are the Oracle relational database and the Neo4j graph database. Specifically, it is investigated whether Neo4j has a shorter execution time when traversing an inter- connected electrical network than Oracle. The underlying models of 4 CHAPTER 1. INTRODUCTION the electrical networks are in this study identical for both databases. In this study, we compare Neo4j and Oracle in terms of the database internals, the underlying architecture and the algorithms. External factors such as system latency are not within the scope of this study, due to both databases operating on machines with similar hardware (e.g. SSD, CPU and RAM). 1.6 Terminology Expression Abbreviation Definition Oracle Oracle A relational database Relational Database developed by Oracle Co. Neo4j Neo4j A native graph database Graph Database developed by Neo4j Inc.