Cypher-Based Graph Pattern Matching in Gradoop
Total Page:16
File Type:pdf, Size:1020Kb
Cypher-based Graph Paern Matching in Gradoop Martin Junghanns Max Kießling Alex Averbuch University of Leipzig & ScaDS Dresden/Leipzig University of Leipzig & Neo Technology Neo Technology [email protected] [email protected] [email protected] Andre´ Petermann Erhard Rahm University of Leipzig & ScaDS Dresden/Leipzig University of Leipzig & ScaDS Dresden/Leipzig [email protected] [email protected] ABSTRACT In consequence, valuable insights may remain hidden as analysts Graph paern matching is an important and challenging operation are restrained either by limited scalability of graph databases or on graph data. Typical use cases are related to graph analytics. missing functionality of graph processing systems. In particular, Since analysts are oen non-programmers, a graph system will we see a need to extend graph processing by query capabilities only gain acceptance, if there is a comprehensible way to declare that show the same expressiveness as those of graph database paern matching queries. However, respective query languages systems. is motivated us to add the paern matching core of are currently only supported by graph databases but not by dis- Neo4j’s declarative graph query language Cypher1 to Gradoop, a tributed graph processing systems. To enable paern matching on distributed open-source framework for graph analytics and process- a large scale, we implemented the declarative graph query language ing [12]. Our query engine is fully integrated and paern matching Cypher within the distributed graph analysis platform Gradoop. can be used in combination with other analytical graph operators Using LDBC graph data, we show that our query engine is scalable provided by the framework. Gradoop is based on the dataow for operational as well as analytical workloads. e implementation framework Apache Flink [3] which scales out computation across is open-source and easy to extend for further research. multiple machines. In short, our contributions are: (1) We provide the rst implemen- CCS CONCEPTS tation of the Cypher query language based on a distributed dataow •Computing methodologies ! Distributed algorithms; system, (2) we implemented a modular query engine which is the •Information systems ! Graph-based data models; foundation for our ongoing research on graph paern matching in KEYWORDS distributed environments and (3) we present results of scalability Cypher, Graph Paern Matching, Apache Flink, Gradoop experiments based on the LDBC social network. We provide the source code as part of the Gradoop framework under an open 2 1 INTRODUCTION source license . Graph paern matching is the problem of nding all subgraphs e remainder of this paper is organized as follows: In Section of a data graph that match a given paern or query graph. It has 2 we provide prelimiaries on the graph data model, graph paern manifold applications in research and industry, e.g., in social net- matching and Cypher. Section 3 describes the implementation work analysis, life sciences or business intelligence. An established of our query engine while Section 4 presents evaluation results. solution to manage and query graph data is using a graph database Finally, we briey discuss related work and give an outlook on our system such as Neo4j [18]. ese systems provide exible data ongoing research. models to t dierent application domains and oer declarative 2 BACKGROUND graph query languages to enable non-programmers to express a We rst introduce the graph data model of Gradoop. Aerwards, query without a deeper understanding of the underlying system. In we specify the formal semantics of graph paern matching and contrast, graph processing systems focus on large-scale applications outline the core features of the Cypher graph query language. with very large amounts of graph data and high computational requirements for graph analysis and mining [13]. In such cases, 2.1 Extended Property Graph Model parallel and distributed execution on many processors became an e Property Graph Model [16] is a widely accepted graph data established solution. However, expressing graph algorithms in such model used by many graph database systems [2]. A property graph systems requires a profound knowledge of the underlying frame- is a directed, labeled and aributed multigraph. Vertex and edge works and programming APIs. Moreover, support for graph paern semantics are expressed using type labels (e.g., Person or knows). matching is still limited [6]. Aributes have the form of key-value pairs (e.g., name:Alice or classYear:2015) and are referred to as properties. Properties are Permission to make digital or hard copies of all or part of this work for personal or set at the instance level without an upfront schema denition. e classroom use is granted without fee provided that copies are not made or distributed Extended Property Graph Model adds support for graph collections for prot or commercial advantage and that copies bear this notice and the full citation containing multiple, possibly overlapping property graphs, which on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permied. To copy otherwise, or republish, are referred to as logical graphs [12]. Like vertices and edges, logical to post on servers or to redistribute to lists, requires prior specic permission and/or a graphs have a type label and an arbitrary number of properties. fee. Request permissions from [email protected]. GRADES’17, Chicago, IL, USA © 2017 ACM. 978-1-4503-5038-9/17/05...$15.00 1hps://neo4j.com/developer/cypher-query-language/ DOI: hp://dx.doi.org/10.1145/3078447.3078450 2hp://www.gradoop.com GRADES’17, May 19, 2017, Chicago, IL, USA Martin Junghanns, Max Kießling, Alex Averbuch, Andre´ Petermann, and Erhard Rahm L = f100g V = Figure 1: Extended Property Graph with , Denition 2.2. (ery graph). Let G be a data graph. A tuple f10; 20; ::; 50g, E = f1; 2; ::; 8g, T = fCommunity; Person;knows; :::g, Q = (Vq; Eq;s; t;θv ;θe ) is a query graph of query vertices Vq and K = farea;name;classYear; :::g and A = fLeipzig; Alice; 2014; :::g. query edges Eq . θv : V (G) ! ftrue; f alseg and θe : E(G) ! ftrue; f alseg are predicate functions dened on type labels and properties of data vertices and edges. Denition 2.3. (Homomorphism/Isomorphism). A graph G0 = (V 0; E0) will be a subgraph of a (logical) graph G = (V ; E), denoted by G0 v G, i V 0 ⊆ V and E0 ⊆ E with 8e 2 E0 : s(e); t (e) 2 V 0. Given a query graph Q and a subgraph G0, G0 will match Q by homomorphism, denoted by Q ∼ G0, if there are two mappings 0 0 fv : Vq ! V and fe : Eq ! E , such that 8vq 2 Vq : θv (fv (vq )) 0 and 8e 2 Eq : (fv (s(eq )); fv (t (eq ))) 2 E ^ θe (fe (eq )). If fv and 0 fe are bijective, then G matches Q by isomorphism, denoted by Q ' G0. If G0 matches Q, then Q will be embedded in G0. An embedding is represented by a tuple (fv ; fe )G0 . Based on the previous denitions, we are now able to dene a Denition 2.1. (Extended Property Graph Model.) A tuple new EPGM operator for graph paern matching: G = (L;V ; E;l;s; t;T;τ; K; A;κ) represents an extended property graph. L is a set of graphs (graph identiers), V is a set of vertices Denition 2.4. (Graph pattern matching). Given a logical (vertex identiers) and E is a set of edges (edge identiers). Graph (data) graph G and a query graph Q, the graph paern matching containment is represented by the mapping l : V [ E ! P(L) n ; operator returns a collection of new logical graphs GQ , such that 0 0 0 0 whereas the mappings s : E ! V / t : E ! V determine a source G 2 GQ , G v G ^ Q ∼ G (or Q ' G for isomorphism). Note, and a target vertex for every edge. An edge is directed from source that the resulting logical graphs are added to the set of all graphs, 8 2 G 2 to target. A logical graph Gi = (Vi ; Ei ) (i 2 L) represents a subset of i.e., Gi Q : i L. vertices Vi ⊆ V and a subset of edges Ei ⊆ E such that 8v 2 Vi : i 2 2.3 Cypher ery Language l (v) and 8e 2 Ei : s(e); t (e) 2 Vi ^ i 2 l (e). T is a set of type labels For the paern matching operator, it is necessary to declare a query and τ : L [ V [ E ! T assigns a label to a graph, vertex or edge. graph Q. For this purpose, we adopt core features of Cypher, the Similarly, properties are dened by a set of property keys K, a set graph query language of Neo4j. Since Cypher is designed on top of of property values A and a mapping κ : (L [ V [ E) × K ! A [ fεg, the property graph model, it can also be used to express a query on a where ε is returned if the element has no value for a given key. A logical graph. Furthermore, there is an ongoing eort to standardize graph collection G = fG1;G2; :::;Gn g is a set of logical graphs. Cypher as a graph query language within the openCypher3 project. Figure 1 shows an example social network represented by a single Having the social network of Figure 1 in mind, we give an exam- logical graph containing persons, universities and cities as well as ple for searching pairs of persons who study at the University of their mutual relations. e EPGM further denes a set of operators Leipzig, have dierent genders and know each other either directly to analyze logical graphs and graph collections [12].