Integrating Column-Oriented Storage and Query Processing Techniques Into Graph Database Management Systems
Total Page:16
File Type:pdf, Size:1020Kb
Integrating Column-Oriented Storage and Query Processing Techniques Into Graph Database Management Systems Pranjal Gupta, Amine Mhedhbi, Semih Salihoglu University of Waterloo {pranjal.gupta, amine.mhedhbi, semih.salihoglu}@uwaterloo.ca ABSTRACT such, these columnar techniques are relevant for improving We revisit column-oriented storage and query processing the performance and scalability of GDBMSs. techniques in the context of contemporary graph database In this paper, we revisit columnar storage and query pro- management systems (GDBMSs). Similar to column-orien- cessing techniques in the context of GDBMSs. Specifically, ted RDBMSs, GDBMSs support read-heavy analytical work- we focus on an in-memory GDBMS setting and discuss the loads that however have fundamentally different data access applicability of columnar storage and compression techniques patterns than traditional analytical workloads. We first de- for storing different components of graphs [17, 19, 58, 66], rive a set of desiderata for optimizing storage and query and block-based query processing [20, 27]. Despite their sim- processors of GDBMS based on their access patterns. We ilarities, workloads in GDBMS and column-oriented RDBMSs then present the design of columnar storage, compression, also have fundamentally different access patterns. For ex- and query processing techniques based on these desiderata. ample, workloads in GDBMSs contain large many-to-many In addition to showing direct integration of existing tech- joins, which are not frequent in column-oriented RDBMSs. niques from columnar RDBMSs, we also propose novel ones This calls for redesigning columnar techniques in the context that are optimized for GDBMSs. These include a novel list- of GDBMSs. The contributions of this paper are as follows. based query processor, which avoids expensive data copies Guidelines and Desiderata: We begin in Section3 by ana- of traditional block-based processors under many-to-many lyzing the properties of data access patterns in GDBMSs. joins and avoids materializing adjacency lists in intermedi- For example, we observe that different components of data ate tuples, a new data structure we call single-indexed edge stored in GDBMSs can have some structure and the order property pages and an accompanying edge ID scheme, and a in which operators access vertex and edge properties often new application of Jacobson’s bit vector index for compress- follow the order of edges in adjacency lists. This analysis ing NULL values and empty lists. We integrated our tech- instructs a set of guidelines and desiderata for designing the niques into the GraphflowDB in-memory GDBMS. Through physical data layout and query processor of a GDBMS. extensive experiments, we demonstrate the scalability and Columnar Storage: Section4 explores the application of query performance benefits of our techniques. columnar data structures for storing different components of GDBMSs. While existing columnar structures can directly be used for storing vertex properties and many-to-many (n- 1. INTRODUCTION n) edges, we observe that using a straightforward columnar Contemporary GDBMSs are data management software structure, which we call edge columns, to store properties such as Neo4j [9], Neptune [1], TigerGraph [13], and Graph- of n-n edges is suboptimal as it does not guarantee sequen- flowDB [44, 50] that adopt the property graph data model [10]. tial access when reading edge properties in either forward or In this model, application data is represented as a set of ver- backward directions. An alternative, which we call double- tices, which represent the entities in the application, directed indexed property CSRs, can achieve sequential access in both arXiv:2103.02284v1 [cs.DB] 3 Mar 2021 edges, which represent the relationships between entities, directions but requires duplicating edge properties, which and key-value properties on the vertices and edges. can be undesirable as graph-structured data often contain GDBMSs have lately gained popularity to support a wide orders of magnitude more edges than vertices. We then de- range of analytical applications, from fraud detection and scribe an alternative design point, single-directional property risk assessment in financial services to recommendations in pages, that avoids duplication and achieves good locality e-commerce and social networks [57]. These applications when reading properties of edges in one direction and still have workloads that search for patterns in a graph-structured guarantees random access in the other when using a new database, which often requires reading large amounts of edge ID scheme that we describe. Our new ID schemes allow data. In the context of RDBMSs, column-oriented sys- for extensive compression when storing them in adjacency tems [11, 41, 58, 63] employ a set of read-optimized storage, lists without decompression overheads. Lastly, as a new ap- indexing, and query processing techniques to support tradi- plication of vertex columns, we show that single cardinality tional analytical applications, such as business intelligence edges and edge properties, i.e. those with one-to-one (1-1), and reporting, that also process large amounts of data. As one-to-many (1-n) or many-to-one (n-1) cardinalities, are stored more efficiently with vertex columns instead of the structures we described above for n-n edges. 1 name: “peter” name: “alice” age: 17 Columnar Compression: In Section5, we review existing since: 2015 age: 45 Vertex labels - gender: M gender: F columnar compression techniques, such as dictionary encod- PERSON ing, that satisfy our desiderata and can be directly applied doj: 2014 doj: 2019 ORG to GDBMSs. We next show that existing techniques for Name: “UofT” compressing NULL values in columns from references [17, name: “UW” doj: 2006 estd: 1885 estd: 1934 19] by Abadi et al. lead to very slow accesses to arbitrary since: 2011 Edge labels - doj: 2006 non-NULL values. We then review Jacobson’s bit vector in- since: 1999 FOLLOWS since: 1992 since: 2006 since: doj: 1980 dex [42, 43] to support constant time rank queries, which has since: since:2009 2012 STUDYAT found several prior applications e.g., recently in a range filter name: “jenny” name: “bob” WORKAT age: 23 age: 54 gender: M structure in databases [61], in information retrieval [37, 52] gender: F since: 2003 ★ doj: date of joining and computational geometry [30, 53]. We show how to en- hance one of Abadi’s schemes with an adaptation of Jacob- Figure 1: Running example graph. son’s index to make it suitable for compressing NULL values and empty adjacency lists in GDBMSs. Our final structure (iii) edge properties. In every native GDBMS we are aware allows constant-time access to NULL or non-NULL values of, the topology is stored in data structures that organize with a small increase in storage overhead per entry compared data in adjacency lists [28], such as in compressed sparse to prior techniques. row (CSR) format. Typically, given the ID of a vertex v, List-based Processing: In Section6, we observe that tradi- the system can in constant-time access v’s adjacency list, tional Volcano-style [38] tuple-at-a-time processors, which which contains a list of (edge ID, neighbour ID) pairs. Typ- are used in some GDBMSs, do not benefit from processing ically, an adjacency list of v is further clustered by the blocks of data in tight loops but also has the advantage of edge label which enables traversing the neighbourhood of avoiding expensive data copies when performing many-to- v based on a particular label efficiently. Vertex and edge many joins, e.g. in long path queries. On the other hand, properties can be stored in a number of ways. For exam- columnar RDBMSs have block-based [20, 64] query proces- ple, some systems use a separate key-value store, such as sors that process fixed-length blocks of data in tight loops, DGraph [2] and JanusGraph [5], and some use a variant of which achieves better CPU and cache utility. However, interpreted attribute layout [25], where records consist of se- block-based processing results in expensive data copies un- rialized variable-sized key-value properties. These records der many-to-many joins. To address this, we propose a new can be located consecutively in disk or memory or have block-based processor we call list-based processor (LBP), pointers to each other, as in Neo4j. which modifies traditional block-based processors in two ways Queries in GDBMSs consist of a subgraph pattern Q that to tailor them for GDBMSs: (i) Instead of representing the describes the joins in the query (similar to SQL’s FROM) intermediate tuples processed by operators as a single group and optionally predicates on these patterns with final group- of equal-sized blocks, we represent them as multiple groups by-and-aggregation operations. We assume a GDBMS with of blocks. We call these list groups. LBP avoids expensive a query processor that uses variants of the following rela- data copies by flattening blocks of some groups into sin- tional operators, as in many GDBMSs, e.g., Neo4j [9], Mem- gle values when performing many-to-many joins. Flattening graph [6], or GraphflowDB: does not require any data copies and happens by setting an SCAN: Scans a set of vertices from the graph. index field of a list group. (ii) Instead of fixed-length blocks, JOIN (e.g. EXPAND in Neo4j and Memgraph, EXTEND in Graph- LBP uses variable length blocks that take the lengths of ad- flowDB): Performs an index nested loop join using the ad- jacency lists that are represented in the intermediate tuples. jacency list index to match an edge of Q. Takes as input a Because adjacency lists are already stored in memory con- partial match t that has matched k of the query edges in Q. secutively, this allows us to avoid materializing adjacency For each t, extends t by matching an unmatched query edge lists during join processing, improving query performance.