Introduction to Graphs Inhalt

Non-Standard Database Systems Graph Databases

1 Introduction to Graphs Nikolaus Augsten [email protected] FB Computerwissenschaften Universit¨at Salzburg 2 Property Graph Model

3 Graph Database Implementations http://dbresearch.uni-salzburg.at

Sommersemester 2016 Version 9. Juni 2016

Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 1 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 2 / 32

Introduction to Graphs Introduction to Graphs Storing Data in Graphs – Examples Graph Terms

Name: Bob Age: 27 knows knows

Name: Alice Name: Clare graph G = (V , E) Age: 34 dislikes Age: 29 V : set of nodes (node = vertex)

c Lena Wiese: Advanced Data Management, DeGruyter, 2015. E : set of edges adjacent nodes (=neighbors) are connected with an edge City: Hannover City: Braunschweig an edge is incident to a node if it is connected to the node Population: 522K 65km Population: 248K

35km 45km City: Hildesheim Population: 102K

c Lena Wiese: Advanced Data Management, DeGruyter, 2015.

Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 3 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 4 / 32 Introduction to Graphs Introduction to Graphs Different Types of Graphs Simple Undirected Graphs

v2 e1 e2

simple undirected graph v1 v3 e3 simple directed graph c Lena Wiese: Advanced Data Management, DeGruyter, 2015.

undirectred multi-graph directed multi-graph edges are (unordered) two-element subsets of V , e.g., weighted graphs v , v = v , v E { 1 3} { 3 1} ∈ n(n 1) complete graph: maximum of − edges for n = V nodes 2 | | (without self-loops)

Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 5 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 6 / 32

Introduction to Graphs Introduction to Graphs Simple Directed Graphs Multigraphs

v2 a pair of nodes may be connected by multiple edges (in the same e1 e2 direction) v1 v3 e3 undirected multigraph c Lena Wiese: Advanced Data Management, DeGruyter, 2015. v2 e1 e2

v1 v3 e3

edges E V V are (ordered) two-element tuples of V , e.g., e4 c Lena Wiese: Advanced Data Management, DeGruyter, 2015. ⊆ × (v1, v3) E, (v3, v1) / E ∈ ∈ directed multigraph source/tail node of an edge: outgoing (e.g., v1 in (v1, v3)) v2 e1 e2

target/head node of an edge: incoming (e.g., v3 in (v1, v3)) v1 v3 e3

complete graph: maximum of n(n 1) edges for n = V nodes e4 − | | c Lena Wiese: Advanced Data Management, DeGruyter, 2015. (without self-loops)

Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 7 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 8 / 32 Introduction to Graphs Introduction to Graphs Weighted Graph Graph Traversals

depth-first: visit start node, recursively traverse all un-visited a weight (e.g., road distance) is assigned to edges neighbors in depth-first

e : w v2 breath-first: visit start node (distance 0), visit all neighbors 1 1 e : w 2 2 (distance 1), then all other nodes in increasing distance order v1 v3 e3 : w3 Eulerian path/cycle: visit each edge exactly once

e4 : w4 Hamiltonian path/cycle: visit each vertex exactly once c Lena Wiese: Advanced Data Management, DeGruyter, 2015. spanning tree: visit each vertex and a subset of edges such that visited vertices and edges form a tree

Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 9 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 10 / 32

Introduction to Graphs Introduction to Graphs Graph Data Structures Edge List

edge list follows mathematical definition: store edges E and nodes V edge list as sets add/delete edge/node are efficient small memory most queries inefficient and require search among all edges: incidence list find all neighbors of a node find incident edges in directed graph traverse a specific path

Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 11 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 12 / 32 Introduction to Graphs Introduction to Graphs Adjacency Matrix Incidence Matrix

matrix B of size V E matrix A of size V V | | × | | | | × | | element bi,j is 1 if edge ei is incident to vi (-1 for outgoing edge in element ai,j is the number of (directed) edges between vi and vj directed graph) adjacency matrix for undirected graphs is symmetric adding/deleting nodes/edges is problematic adding/deleting nodes is problematic, adding/deleting edges is less memory than adjacency matrix for sparse graphs since no efficient zero-only columns storage size O( V 2), large overhead if graph is sparse (small average storage size may grow to O( V 3) (since E = O( V 2) in complete | | | | | | | | degree, i.e., few edges per node) graph) edge lookup by tail and head nodes is very efficient checking for the existence of an edge between vertex pair is expensive finding incident edges requires scanning matrix row or column finding incident edges requires searching matrix row finding the head for a given edge tail requires searching column

Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 13 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 14 / 32

Introduction to Graphs Introduction to Graphs Adjacency List Adjacency List – Examples

each vertex stores linked list of incident edges (outgoing edges in simple, undirected graph directed graph) v2 v1 v2 v3 e 1 e edges are not stored explicitly 2 v2 v1 v3

v1 v3 v3 v1 v2 e3 adding/deleting nodes/edges is efficient c Lena Wiese: Advanced Data Management, DeGruyter, 2015. c Lena Wiese: Advanced Data Management, DeGruyter, 2015.

finding all neighbors is efficient directed multigraph

small memory v2 e1 e2 checking existence of edge between vertex pair requires search in v1 v2 v3 v3 v1 v3 e3 adjacency list v2 v3 e4 v3 finding incoming edges in directed graphs is inefficient (solution: c Lena Wiese: Advanced Data Management, DeGruyter, 2015. c Lena Wiese: Advanced Data Management, DeGruyter, 2015.

forward and backward search adjacency list)

Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 15 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 16 / 32 Introduction to Graphs Introduction to Graphs Incidence List Incidence List – Examples

simple, undirected graph

e1 e3 v1 each vertex stores linked list of incident edges (outgoing edges in v1, v2 v1, v3 { } { } directed graph) e1 e2 v2 v1, v2 v2, v3 edges are listed explicitly such that information can be stored with { } { } v2 e1 edges e2 e3 e2 v3 v1 v3 v1, v3 v2, v3 finding all neighbors is efficient e3 { } { } c Lena Wiese: Advanced Data Management, DeGruyter, 2015. c Lena Wiese: Advanced Data Management, DeGruyter, 2015.

small memory directed multigraph checking existence of edge between vertex pair requires search in e1 e3 e4 v1 incidence list (v1, v2) (v1, v3) (v1, v3)

v2 finding incoming edges in directed graphs is inefficient (solution: e1 e2 e2 v2 (v2, v3) forward and backward search incidence list) v1 v3 e3

e4 v3 c Lena Wiese: Advanced Data Management, DeGruyter, 2015. c Lena Wiese: Advanced Data Management, DeGruyter, 2015.

Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 17 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 18 / 32

Property Graph Model Property Graph Model Inhalt Property Graph Model

directed, multi-relational, labeled multi-graph multi-relational single-relational graph: only one “kind” of nodes/edges multi-relational graph: nodes and edges have a type 1 Introduction to Graphs labels node label is the node type edge label is the edge type 2 Property Graph Model nodes and edges may have attributes name:value pairs name is the key (e.g., age) 3 Graph Database Implementations value has a domain (e.g., non-negative integer) each node and each edge has an explicit ID only one edge of a specific type allowed between a given pair of nodes restrictions on edges can be defined (e.g., edges of type “likes” allowed only between nodes of type “person”)

Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 19 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 20 / 32 Property Graph Model Property Graph Model Property Graph – Social Network Example Property Graph – Social Network Example

multiple edges between node pair only allowed if they differ by type

Id: 2 Id: 2 Label: Person Label: Person Name: Bob Name: Bob Id: 4 Age: 27 Id: 5 Id: 4 Age: 27 Id: 5 Label: knows Label: knows Label: knows Label: knows since: 31-21-2009 since: 10-04-2011 since: 31-21-2009 since: 10-04-2011

Id: 1 Id: 3 Id: 1 Id: 3 Label: Person Label: Person Label: Person Label: Person Name: Alice Id: 6 Name: Charlene Name: Alice Id: 6 Name: Charlene Age: 34 Label: dislikes Age: 29 Age: 34 Label: dislikes Age: 29

c Lena Wiese: Advanced Data Management, DeGruyter, 2015. Id: 7 Label: dislikes (not allowed!) c Lena Wiese: Advanced Data Management, DeGruyter, 2015.

Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 21 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 22 / 32

Property Graph Model Graph Database Implementations Storing Property Graphs in Relations Inhalt

Alternative 1: Nodes and their attributes: Node(NodeID, NodeLabel) Person(NodeID, Name, Age) π (Person) π (Node) 1 Introduction to Graphs NodeID ⊆ NodeID Edges and their attributes: Edge(EdgeID, EdgeLabel, Source, Target) Knows(EdgeID, Since) 2 Property Graph Model π (Knows) π (Edge) EdgeID ⊆ EdgeID π (Edge) π (Node) Source ⊆ NodeID π (Edge) π (Node) 3 Graph Database Implementations Target ⊆ NodeID Alternative 2: General attribute table: Attributes(ID, Name, Value) ID is edge or node ID, Name is attribute key problem: values may be of different type

Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 23 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 24 / 32 Graph Database Implementations Graph Database Implementations Apache TinkerPop TinkerPop Structure API

Java interfaces for property graphs

Gremlin traversal language: queries over TinkerPop graphs Graph: set of edges and vertices TinkerPop-enabled databases implement these interfaces:1 Element: has a label and a collection of properties Hadoop (Giraph) - OLAP graph processor using Giraph Hadoop (Spark) - OLAP graph processor using Spark Vertex: Element with incoming and outgoing edges Neo4j - OLTP graph database Edge: Element with one incoming and one outgoing vertex Sqlg - RDBMS OLTP implementation with HSQLDB and Postresql Property support : attribute key:value pair, key is of type string, TinkerGraph - In-memory OLTP and OLAP reference implementation Property allows only values of type V Titan - Distributed OLTP and OLAP graph database with BerkeleyDB, VertexProperty: Property with a collection of key value pairs (i.e., Cassandra and HBase support allows for nested properties) ... storage backend can be substituted without changing the code

1see http://tinkerpop.incubator.apache.org Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 25 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 26 / 32

Graph Database Implementations Graph Database Implementations TinkerPop Structure API – Code Example TinkerPop Graph Process API

defines “traversals” in the graph traversal: definition of how the graph should be traversed (starting Graph g = TinkerGraph.open(); with nodes or edges) Vertex alice = g.addVertex("name", "Alice"); alice.property("age", 34); returns a GraphTraversal object (iterator) Vertex bob = g.addVertex("name", "Bob"); code example: names of all nodes that Alice knows alice.addEdge("knows", bob, "knows_since", 2010); g.traversal().V(). has("name","Alice").out("knows").values("name"); Gremlin console is an interpreter for the Gremlin query language

Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 27 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 28 / 32 Graph Database Implementations Graph Database Implementations Neo4J Neo4J Clusters – Updates and Replication

widely used graph database for property graphs support for ACID transactions (but eventual consistency with replicas) master node and slaves with full replication support for replication updates on slaves properties slave must be up-to-date Apache Lucene indices for properties acquire lock on slave and master property names are strings commit on master first property values can be strings, booleans, numbers, or arrays replication CIPHER query language: push from master to slaves START alice = (people_index, name, "Alice") optimistic: commit happens before push is successful MATCH (alice)-[:knows]->(aperson) eventual consistency: outdated reads on slave are possible RETURN (aperson) TinkerPop enabled

Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 29 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 30 / 32

Graph Database Implementations Graph Database Implementations Neo4J Clusters – Availability Resource Description Framework – RDF

RDF stores so-called “linked data” failing nodes are detected and marked RDF stores graphs as triples master fails: subject (source node): string or URI other nodes elect new master object (target node): string or URI master needs quorum predicate (edge source target): string or URI no writes during master election → based on XML network partitioning2: RDF databases are called “triple stores” writes only on (strict) majority partition with master RDF3X (based on relations, joins, and B-tree indexes) minority partition cannot elect a new master Blazegraph - RDF graph database with OLTP support minority partition with master cannot perform writes Oracle Spatial and Graph reads are possible in any minority partition ... common query language: SPARQL

2 see http://neo4j.com/resources/understanding-neo4j-scalability-white-paper/ Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 31 / 32 Augsten (Univ. Salzburg) NSDB – Graph Databases Sommersemester 2016 32 / 32