Towards Scalable Loading Into Graph Database Systems
Total Page:16
File Type:pdf, Size:1020Kb
Piecing together large puzzles, efficiently: Towards scalable loading into graph database systems Work in progress paper Gabriel Campero Durand Jingy Ma Marcus Pinnecke University of Magdeburg University of Magdeburg University of Magdeburg [email protected] [email protected] [email protected] Gunter Saake University of Magdeburg [email protected] Transaction checks ABSTRACT Id assignation 5 Many applications rely on network analysis to extract busi- Physical storage ness intelligence from large datasets, requiring specialized graph tools such as processing frameworks (e.g. Apache Gi- 4 Process organization raph, Gradoop), database systems (e.g. Neo4j, JanusGraph) Batching or applications/libraries (e.g. NetworkX, nvGraph). A recent Partitioning survey reports scalability, particularly for loading, as the fo- Pre-processing 3 remost practical challenge faced by users. In this paper we consider the design space of tools for efficient and scalable 1 Parsing for different input characteristics File Interpretation 2 graph bulk loading. For this we implement a prototypical CSV, JSON, Adjacency lists, Implicit graphs... loader for a property graph DBMS, using a distributed mes- sage bus. With our implementation we evaluate the impact Figure 1: Bulk Loading into a Graph Storage and limits of basic optimizations. Our results confirm the expectation that bulk loading can be best supported as a server-side process. We also find, for our specific case, gains 1. INTRODUCTION from batching writes (up to 64x speedups in our evaluati- Network analysis is one of many methods used by scien- on), uniform behavior across partitioning strategies, and the tists to study large data collections. For this, data has to be need for careful tuning to find the optimal configuration of represented with models based on graph theory. Namely, as batching and partitioning. In future work we aim to study a graph structure composed of nodes/vertexes and relation- loading into alternative physical storages with GeckoDB, an ships/edges. Additional details of properties, labels, seman- HTAP database system developed in our group. tics or the direction of edges, allow to define more specific models such as RDF, hypergraphs or property graphs. Categories and Subject Descriptors Building on such models network analysis is commonly assisted by three kinds of tools specialized for graph mana- H.2.m [Information Systems]: Miscellaneous|Graph-based gement: database models; H.2.m [Information Systems]: Miscella- neous|Extraction, transformation and loading • Standalone applications, toolkits and libraries, such as Gephi, NetworkX and nvGraph, provide cost-effective General Terms processing for small to medium networks in a single user environment. Accordingly they target single ma- Measurement chine shared memory configurations. Standalone app- lications like Pajek and Gephi, also offer visualization. Keywords Graph database systems, Bulk loading, Streaming graph • Graph database systems emphasize persistent storage partitioning and transactional guarantees in the presence of upda- tes from a multi-user environment. Physical/logical da- ta independence is accomplished through the adoption of high-level graph models (e.g., the property graph model), backed by different physical storage and in- dexing alternatives. According to their storage graph databases can be distinguished as native (i.e., with graph-specific storage) or non-native (i.e., with storage developed for other models, like documents). Example 30th GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 22.05.2018 - 25.05.2018, Wuppertal, Germany. systems include Neo4j, OrientDB, ArangoDB, Gaffer, Copyright is held by the author/owner(s). SAP Hana Graph, and JanusGraph. • Large scale graph processing frameworks are charac- Regarding constraints, to store an edge the existence of terized by their goal of supporting scale-out analytical the connected vertexes needs to be checked. For storing both workloads over large graphs, with failure-tolerance pro- vertexes and edges, integrity constraints specific to the data perties. As a result they adopt parallel and distributed model might also require validation. When sources present strategies for batch or stream processing. Distributed duplicate entities, each write request might entail determi- job scheduling for skew handling and communication ning first if the entity needs to be created or not. For large avoidance, next to I/O reduction are some of the key graphs both checks for constraints and for duplicate entities design characteristics of these tools. Apache Giraph, might impact the loading time. Pregel, Flink's Gelly and Gradoop are some represen- In terms of distributing the process, this can take place tatives of such frameworks. either by distributing the data sources (e.g., chunking and sharding files) or the interpreted data. A recent survey among researchers and practitioners em- Perhaps the most pressing aspects that need to be consi- ploying these specialized graph tools reveals several charac- dered by tools for the general graph data loading process we teristics of their usage [8]. The first finding is what authors describe are efficiency, which for the purposes of our discus- identify as the ubiquity of large graphs: about a third of the sion encompasses scalability, and usability, which refers to surveyed users report having graphs with more than 100 fulfilling requirements from diverse input characteristics. In million edges, and more specifically, graphs on the range of what follows we briefly present these aspects before advan- billions of edges are not exclusive to large organizations but cing to the specific contributions of our study. actually are used in organizations of all sizes. Second, the survey reports that graph database systems currently occu- 2.1 Performance-impacting factors py a prominent place within the community, consitituting One of the main factors determining how the loading will the most popular category of tools in use. Third, the sur- take place, its efficiency and the possible optimizations, is vey finds that scalability (i.e., the need for processing effi- the physical storage model. Paradies and Voigt [7] survey ciently larger graphs), followed by visualization and require- some of the more prevalent alternatives for this. Authors ments for expressive query languages, are the most pressing distinguish between choices for storing only the topology challenges faced by users of these tools. Moreover, regarding and choices for storing richer logical graph models, inclu- scalability, users report that the precise challenges are inef- ding labels and properties associated with nodes and edges. ficiencies in loading, updating and performing computations Among the first group they count adjacency matrixes, ad- over large graphs. In this paper we share early considerati- jacency lists and compressed sparse rows (consisting of an ons about a specific challenge of this list: bulk loading large ordering of edges stored in a sequential immutable array networks into a specialized graph tool. This is a process that with an index of offsets to improve the access). Among the can become a bottleneck, and delay the time for analyis. In second choice they list triple tables (storing in a single ta- this sense, optimizing such process can be specially impact- ble with dictionary compression the subject-object-predicate ful, enabling analysis on data with more currency. data that conform RDF triples), universal tables (wherein a We organize our study as follows: single table is asssigned to edges and another to vertexes), • We introduce the bulk loading process into graph tools, emerging schemas (for which tables are still employed but outlining performance-impacting aspects and usability with schemas tuned to the data), schema hashing (where requirements gathered by surveying the SNAP reposi- item ids and properties are used as hashes to store the cor- tory for network datasets (Sec. 2). responding values in separate tables) and separate property • We develop an early prototype for bulk loading into storage (a strategy that simply separates the storage of pro- a graph database, evaluating client vs. server-side loa- perties from that of the topology). Specialized compressed ding, request batching and different partitioning stra- structures, adaptive strategies and structural storage are al- tegies (Sec. 3, 4). so discussed by authors as alternative storage approaches. • We summarize related work providing context for our Finally, graph summaries like sparsifiers and spanners could research (Sec. 5). be considered storage alternatives too, though specifically • We conclude by suggesting takeways from our study attuned for expected uses. and future work (Sec. 6). Adding to the specific storage model selected, which should reasonably determine the operations involved in bulk loa- 2. THE GRAPH BULK LOADING PROCESS ding, we consider that other performance-impacting aspects Bulk loading is a process common to most graph tools pertaining to the design of an efficient bulk loading tool are: (Fig. 1). We propose that the general process consists of: input file parsing, memory allocation, access patterns, I/O 1. Reading input data sources, often in tabular formats. paths for persistent storage, write batching, the amount of 2. Interpreting data as nodes and edges according to ru- paralellism employed and load balance, concurrency control, les. Intermediate data structures can be utilized. consensus for distributed writes, transactional management, 3. Cleaning, deduplication