Graph Databases and Their Application to the Italian Business Register for Efficient Search of Relationships Among Companies
Total Page:16
File Type:pdf, Size:1020Kb
Universit`adegli Studi di Padova Dipartimento di Ingegneria dell'Informazione Corso di Laurea Magistrale in Ingegneria Informatica Tesi di Laurea Magistrale Graph databases and their application to the Italian Business Register for efficient search of relationships among companies Basi di dati a grafo e loro applicazione al Registro Imprese Italiano per la ricerca efficiente di relazioni tra le imprese 4 aprile 2017 Laureando: Relatore: Sinico Luca Prof. Ferro Nicola 1108637 Anno Accademico 2016/2017 Universit`adegli Studi di Padova Dipartimento di Ingegneria dell'Informazione Corso di Laurea Magistrale in Ingegneria Informatica Tesi di Laurea Magistrale Graph databases and their application to the Italian Business Register for efficient search of relationships among companies Basi di dati a grafo e loro applicazione al Registro Imprese Italiano per la ricerca efficiente di relazioni tra le imprese 4 aprile 2017 Laureando: Relatore: Sinico Luca Prof. Ferro Nicola 1108637 Anno Accademico 2016/2017 Abstract \What are the main characteristics of graph databases? In what do they differ from relational ones, and in what situations should we use them?" These are the main questions of our work. We studied and tested three of the major graph databases of the moment (ArangoDB, Neo4j, OrientDB), and we compared them with a relational database, implemented with PostgreSQL. We worked on a dataset representing equity participations among companies, and we found out that the strong points of graph databases are: the purposely designed storage techniques, which let them have good performance on graph datasets; and the purposely designed query languages, which go beyond the standard SQL and manage the typical problems that arise when graphs are explored. However, we have seen that the main performance increments have been obtained when heavy graph situations are queried; for simpler situations and queries, a relational database performs equally well. i ii Dedicato ad Anna. Dedicato alla mia famiglia. iii iv Contents Abstract i 1 Introduction1 2 Graph Databases3 2.1 Introduction to Graph Databases..........................3 2.2 General definitions..................................4 2.3 Graph database definition.............................. 10 2.4 Graph data models.................................. 12 2.4.1 Property graph................................ 12 2.4.2 RDF graph.................................. 14 2.4.3 Hypergraph.................................. 23 2.5 Graph database characteristics............................ 25 2.5.1 Motivations for the adoption of graph databases.............. 27 2.5.2 Differences from graph computing engines................. 29 2.5.3 Differences from the relational database.................. 31 2.5.4 Differences from document stores...................... 34 v vi CONTENTS 2.6 General approach to process graph queries..................... 35 3 DBMSs comparison 39 3.1 Which are the major Graph DBMSs........................ 39 3.2 The compared DBMSs................................ 43 3.2.1 ArangoDB................................... 43 3.2.2 Neo4j..................................... 54 3.2.3 OrientDB................................... 65 3.2.4 PostgreSQL.................................. 76 3.3 Feature matrix.................................... 82 4 Use case 85 4.1 The domain of interest................................ 85 4.2 Entity-Relationship representation.......................... 88 4.2.1 The existing database solution........................ 89 4.3 Graph representation................................. 91 4.3.1 Design Best Practices............................. 92 4.3.2 Logical design................................. 94 4.3.3 Some graph description metrics....................... 95 4.4 Data behaviour.................................... 96 4.5 Applications that could benefit from a graph database usage........... 97 4.6 Rationale for the choice of the graph DBMSs.................... 98 CONTENTS vii 5 Evaluation and comparison of Graph Databases 101 5.1 Use case........................................ 101 5.2 Experimental setup.................................. 102 5.3 Data import...................................... 104 5.3.1 Summary................................... 110 5.4 Cache warm-up.................................... 110 5.5 Performance comparison............................... 113 5.5.1 single match by cf .............................. 116 5.5.2 get direct fathers ............................... 118 5.5.3 get direct fathers only cfs ........................... 122 5.5.4 get direct neighbors .............................. 124 5.5.5 get distinct descendants and level ...................... 128 5.5.6 count distinct ancestors per level ...................... 140 5.5.7 get common descendants ........................... 146 5.5.8 get common ancestors ............................ 152 5.5.9 get shortest directed path ........................... 152 5.5.10 Summary................................... 158 6 Displaying the resulting subgraph of a query 161 7 Conclusions 171 7.1 DBMSs assessment.................................. 171 7.2 Main conclusions................................... 174 7.3 Final considerations.................................. 175 7.4 Future developments................................. 176 Bibliography 179 viii List of Tables 3.1 Feature matrix.................................... 83 5.1 Data import comparison................................ 110 5.2 Result example for query count distinct ancestors per level............ 141 6.1 Constraints combinations on node `00102681419'.................. 166 7.1 Assessment table of the compared DBMSs...................... 174 ix x List of Figures 2.1 Popularity changes of database technologies during the last four years [1].....4 2.2 A directed graph [31]..................................5 2.3 Breadth-First Search pseudocode [122]........................8 2.4 Breadth-First Search exploration order [153].....................8 2.5 Depth-First Search pseudocode [122].........................9 2.6 Depth-First Search exploration order [154]......................9 2.7 Visualization of a property graph [62]........................ 13 2.8 TinkerPop system architecture [125]......................... 13 2.9 Visualization of a RDF graph [141].......................... 15 2.10 RDF classification based on storage technique [38]................. 16 2.11 Linked Data publishing options and workflows [131]................. 20 2.12 Linking Open Data cloud diagram in May 2007 [58]................. 21 2.13 Linking Open Data cloud diagram in August 2014 [58]............... 21 2.14 Directed hypergraph example [48]........................... 24 2.15 Hypergraph translated in a property graph [48]................... 24 xi xii LIST OF FIGURES 2.16 The graph database space [48]............................ 26 2.17 A high-level view of a typical graph analytics engine deployment [48]....... 29 2.18 Modeling friends and friends-of-friends in a relational database [48]........ 33 2.19 Social network example encoded in a document store [48]............. 35 2.20 Graph query phases.................................. 36 3.1 Ranking list of some of the DBMSs tracked by DB-ENGINES.com [1]....... 40 3.2 Trending chart of some of the DBMSs tracked by DB-ENGINES.com [1]..... 41 3.3 Ranking list of the graph DBMSs tracked by DB-ENGINES.com [1]........ 42 3.4 Trending chart of the top 5 graph DBMSs tracked by DB-ENGINES.com [1]... 42 3.5 ArangoDB web interface............................... 53 3.6 ArangoDB subscription levels [25].......................... 53 3.7 Neo4j architecture layers [48]............................. 55 3.8 Neo4j node and relationship store file record structure [48]............. 56 3.9 Example Neo4j social network data structure [48].................. 57 3.10 A simple Neo4j example graph [48].......................... 61 3.11 Neo4j web interface.................................. 64 3.12 Neo4j editions comparison [61]............................ 65 3.13 OrientDB caching architecture [121]......................... 73 3.14 OrientDB web interface................................ 74 3.15 OrientDB editions comparison [99].......................... 75 3.16 OrientDB support subscriptions comparison [83].................. 75 LIST OF FIGURES xiii 4.1 Italian Chamber system map [50]........................... 86 4.2 ER schema before restructuration.......................... 89 4.3 ER schema...................................... 90 4.4 Logical schema for a relational database...................... 91 4.5 Graph data model................................... 95 5.1 Warm-up cache, times chart............................. 112 5.2 Warm-up quick study................................. 112 5.3 Warm-up cache, memory chart............................ 113 5.4 Query single match by cf charts........................... 117 5.5 Fathers of the node with codice fiscale = `00102681419'.............. 119 5.6 Query get direct fathers charts............................ 121 5.7 Query get direct fathers only cfs charts....................... 123 5.8 Neighbors of the node with codice fiscale = `00102681419'............. 125 5.9 Query get direct neighbors charts.......................... 127 5.10 Descendants of the node with codice fiscale = `00102681419'........... 128 5.11 Query get distinct descendants and level charts.................. 137 5.12 Quick study of path uniqueness vs. global uniqueness................ 138 5.13 Query get distinct descendants and level only cfs charts.............. 139 5.14 Ancestors of the