Beyond Macrobenchmarks: Microbenchmark-Based Graph Database Evaluation
Total Page:16
File Type:pdf, Size:1020Kb
Beyond Macrobenchmarks: Microbenchmark-based Graph Database Evaluation Matteo Lissandrini Martin Brugnara Yannis Velegrakis Aalborg University University of Trento University of Trento [email protected] [email protected] [email protected] ABSTRACT There are two categories of graph management systems Despite the increasing interest in graph databases their re- that address two complementary yet distinct sets of func- quirements and specifications are not yet fully understood tionalities. The first is that of graph processing systems [27, by everyone, leading to a great deal of variation in the sup- 42,44], which analyze graphs to discover characteristic prop- ported functionalities and the achieved performances. In erties, e.g., average connectivity degree, density, and mod- this work, we provide a comprehensive study of the exist- ularity. They also perform batch analytics at large scale, ing graph database systems. We introduce a novel micro- implementing computationally expensive graph algorithms, benchmarking framework that provides insights on their per- such as PageRank [54], SVD [23], strongly connected com- formance that go beyond what macro-benchmarks can of- ponents identification [63], and core identification [12, 20]. fer. The framework includes the largest set of queries and Those are systems like GraphLab, Giraph, Graph Engine, operators so far considered. The graph database systems and GraphX [67]. The second category is that of graph are evaluated on synthetic and real data, from different do- databases (GDB for short) [5]. Their focus is on storage mains, and at scales much larger than any previous work. and querying tasks where the priority is on high-throughput The framework is materialized as an open-source suite and and transactional operations. Examples in this category are is easily extended to new datasets, systems, and queries1. Neo4j [51], OrientDB [53], Sparksee [60] (formerly known as DEX), Titan [64] (recently renamed to JanusGraph), PVLDB Reference Format: ArangoDB [11] and BlazeGraph [62]. To make this distinc- Matteo Lissandrini, Martin Brugnara, and Yannis Velegrakis. Be- tion clear, graph processing systems, can, in some sense, be yond Macrobenchmarks: Microbenchmark-based Graph Database Evaluation. PVLDB, 12(4): 390–403, 2018. seen as the graph world parallel to OLAP systems, while DOI: https://doi.org/10.14778/3297753.3297759 graph databases as the parallel to the OLTP systems. Graph processing systems and their evaluation have re- 1. INTRODUCTION ceived considerable attention [27, 32, 42, 44]. Instead, graph databases lag far behind. Our focus is specifically on graph Graphs have become increasingly important for a wide databases aiming to reduce this gap, with a two-fold contri- range of applications [17, 41] and domains, including bi- bution. First, the introduction of a novel evaluation method- ological data [16], knowledge graphs [61], and social net- ology for graph databases that complements existing ap- works [31]. As graph data is becoming prevalent, larger, proaches, and second, the generation of a number of insights and more complex, the need for efficient and effective graph on the performance of the existing GDBs. Some experi- management is becoming apparent. Since graph manage- mental comparisons of graph databases do exist [22, 37, 38]. ment systems are a relatively new technology, their features, However, they test a limited set of features providing a par- performances, and capabilities are not yet fully understood tial understanding of the systems, experiment at small scale neither agreed upon. Thus, there is a need for effective making assumptions not verifiable at large scale [37, 38], benchmarks to provide a comprehensive picture of the dif- sometimes provide contradicting results, and fail to pinpoint ferent systems. This is of major importance for practition- the fine-grained limitations that each system has. ers, in order to understand the capabilities and limitations Motivated by the above, we provide a complete and sys- of each system, for researchers to decide where to invest tematic evaluation of existing graph databases, that is not their efforts, and for developers to be able to evaluate their provided by any other existing work to date. We test 35 systems and compare with competitors. classes of operations with both single queries and batch 1https://graphbenchmark.com workloads for a total of about 70 different tests, as opposed to the 4-13 that existing studies have done, and we scale our experiments up to 76M nodes/ 314M edges, as opposed This work is licensed under the Creative Commons Attribution- to the 250K nodes/2.2M edges of existing works. Our tests NonCommercial-NoDerivatives 4.0 International License. To view a copy cover all the types of insert-select-update-delete queries that of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For have so far been considered, and in addition, cover a whole any use beyond those covered by this license, obtain permission by emailing new spectrum of use-cases, data-types and scales. Moreover, [email protected]. Copyright is held by the owner/author(s). Publication rights we study workloads and datasets both synthetic and real. licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 12, No. 4 Micro-benchmarking. In designing the evaluation method- ISSN 2150-8097. ology we follow a principled micro-benchmarking approach. DOI: https://doi.org/10.14778/3297753.3297759 To substantiate our choice, we look at the test queries pro- 32,43,45,68]. Such systems are designed for computationally vided by the popular LDBC Social Network benchmark [26], expensive algorithms that often require traversing all the and show how the produced results are ambiguous and lim- graph multiple times to obtain an answer, like page rank, or ited in providing a clear picture of the advantages of each community detection. Such systems are very different in na- system. So, instead of considering queries with such a com- ture from graph database systems, thus, in their evaluation, plex structure, we opt for a set of primitive operators. The “needle in the haystack” queries like those that are typical of primitive operators are derived by decomposing the complex transactional workloads are not considered. Of course, there queries found in LDBC, the related literature, and some real are proposals for unified graph processing and database sys- application scenarios. Their advantage is that they are often tems [27], but this idea is in its infancy. Our focus is not on implemented by opaque components in the system, thus, by graph processing systems or their functionalities. identifying the underperformed operators one can pinpoint Evaluating Graph Databases. In contrast to graph pro- the exact components that underperform. Furthermore, any cessing systems, graph databases are designed for transac- complex query can be typically decomposed into a combi- tional workloads and “needle in the haystack” operations, nation of primitive operations, thus, its performance can be i.e., queries that identify and retrieve a small part of the explained by the performance of the components implement- data. Existing evaluation works [3, 5] for such systems are ing them. Query optimizers may change the order of the limited in describing the systems implementation, data mod- basic operators, or select among different implementations, elling, and query capabilities, but provide no experimental but the primitive operator performance is always a signif- evaluation. A different group of studies provides an exper- icant performance factor. This evaluation model is known imental comparison but is incomplete and fails to deliver as micro-benchmarking [15] and is similar to the principles a consistent picture. In particular, one work [22] analyzes that have been successfully followed in the design of bench- only four systems, two of which are no longer supported, marks in many other areas [2,22,35,36,37,38]. Note micro- with small graphs and a restricted set of operations. Two benchamrking is not intended to replace macro-benchamrks. other empirical works [37,38] compared almost the same set Macro-benchmarks are equally important in order to eval- of graph databases over datasets of comparable small sizes, uate the overall performance of query planners, optimizers, but agree only partially on the concluded results. Moreover, and caches. They are, however, limited in identifying un- all existing studies do not test with graphs at large scale and derperforming operators at a fine grain. with rich structures. Our work comes to fill exactly this gap Our evaluation provides numerous specific insights. Among in graph database evaluation, by providing the most exten- them, three are of particular importance: (i) we highlight sive evaluation of the state of the art systems in a complete the different insights that micro and macro benchmarks can and principled manner. provide; (ii) we experimentally demonstrate limitations of the tested hybrid systems when dealing with localized traver- Distribution & Cluster Evaluation. In the era of Big sal queries that span across multiple long paths, such as the Data, it is important to understand the abilities of graph breadth-first search; and (iii) we identify the trade-offs be- databases in exploiting parallel processing and distributed tween the logical and physical data organizations, support- architectures. This has already been done in graph process- ing the choice of the native graph-databases we study to ing systems [32,43,66].