Regular Path Query Evaluation on Streaming Graphs

Anil Pacaci 1 Angela Bonifati 2 M. Tamer Ozsu¨ 1

1University of Waterloo

2Lyon 1 University

1 Characteristics of Streaming Graphs

I Streams are unbounded – no global access I High streaming rates in real-world applications

Streaming Graphs

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u

b a a

u z v

v

y a b

u

x a

z

t10 2 Characteristics of Streaming Graphs

I Streams are unbounded – no global access I High streaming rates in real-world applications

Streaming Graphs

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u

b a a b

u z v x

v v a y y a b b

u u b

x x a a

z z

t10 t11 2 Characteristics of Streaming Graphs

I Streams are unbounded – no global access I High streaming rates in real-world applications

Streaming Graphs

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u

b a a b

u z v x

v v a y y a b b

u u b

x x a a

z z

t10 t11 2 Characteristics of Streaming Graphs

I Streams are unbounded – no global access I High streaming rates in real-world applications

Streaming Graphs

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u x

b a a b a

u z v x y

v v v a a y y a y b b b

u u a u

b b

x x x a a a

z z z

t10 t11 t13 2 Characteristics of Streaming Graphs

I Streams are unbounded – no global access I High streaming rates in real-world applications

Streaming Graphs

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u x z

b a a b a b

u z v x y u

v v v v a a y y a y y a b b b b

u u a u a u

b b b

x x x x b a a a a

z z z z

t10 t11 t13 t14 2 Characteristics of Streaming Graphs

I Streams are unbounded – no global access I High streaming rates in real-world applications

Streaming Graphs

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u x z

b a a b a b

u z v x y u

v v v v v a a y y a y y a y a b b b b

u u a u a u a u

b b b b

x x x x b x b a a a a a

z z z z z

t10 t11 t13 t14 t15 2 Characteristics of Streaming Graphs

I Streams are unbounded – no global access I High streaming rates in real-world applications

Streaming Graphs

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u x z u b a a b a b Xb u z v x y u x

v v v v v v a a a y y a y y a y a y b b b b

u u a u a u a u a u

b b b b

x x x x b x b x b a a a a a a

z z z z z z

t10 t11 t13 t14 t15 t16 2 Streaming Graphs

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u x z u b a a b a b Xb u z v x y u x Characteristics of Streaming Graphs

I Streamsv are unboundedv – nov global accessv v v I High streaming rates in real-world applications a a a y y a y y a y a y b b b b

u u a u a u a u a u

b b b b

x x x x b x b x b a a a a a a

z z z z z z

t10 t11 t13 t14 t15 t16 2 Intrusion detection on networks [Kent et al., 2015; Choudhury et al., 2015]

TCP TCP

Attacker TCP Victim TCP Bots

(b) Denial-of-service (DOS) attack

(Taken from [Choudhury et al., 2015])

Applications of Streaming Graphs

Fraud detection in e-commerce [Qiu et al., 2018]

Criminal Merchant credit 2 3 payment

4 transfer transfer 1 1 Middleman accounts

(a) Credit-card fraud

(Taken from [Qiu et al., 2018])

3 Applications of Streaming Graphs

Fraud detection in e-commerce [Qiu et al., 2018] Intrusion detection on networks [Kent et al., 2015; Choudhury et al., 2015]

Criminal Merchant credit 2 3 payment TCP TCP 4 transfer transfer Attacker TCP Victim TCP 1 1 Bots Middleman accounts

(a) Credit-card fraud (b) Denial-of-service (DOS) attack

(Taken from [Qiu et al., 2018]) (Taken from [Choudhury et al., 2015])

3 Applications of Streaming Graphs

Fraud detection in e-commerce [Qiu et al., 2018] Intrusion detection on networks [Kent et al., 2015; Choudhury et al., 2015]

OurCriminal setting: Query processingMerchant over streaming graphs credit 2 3 payment TCP I Persistent graph queries on streaming dataTCP Real-time results that are4 continuously updated I transfer transfer Attacker TCP Victim TCP 1 1 Bots Middleman accounts

(a) Credit-card fraud (b) Denial-of-service (DOS) attack

(Taken from [Qiu et al., 2018]) (Taken from [Choudhury et al., 2015])

3 Path Navigation (follows · mentions)+

P1 P2

Path Navigation Queries Property paths in SPARQL v1.1 Single-label reachability in Cypher Path expressions in G-CORE & PGQL Regular Path Queries Generalized reachability & traversal Directed paths that match regular expressions

What is a graph query?

Graph queries feature: Subgraph Pattern c worksAt

worksAt

u1 u2 follows

4 Path Navigation Queries Property paths in SPARQL v1.1 Single-label reachability in Cypher Path expressions in G-CORE & PGQL Regular Path Queries Generalized reachability & traversal Directed paths that match regular expressions

What is a graph query?

Graph queries feature: Subgraph Pattern Path Navigation c (follows · mentions)+ worksAt

worksAt P1 P2 u1 u2 follows

4 Regular Path Queries Generalized reachability & traversal Directed paths that match regular expressions

What is a graph query?

Graph queries feature: Subgraph Pattern Path Navigation c (follows · mentions)+ worksAt

worksAt P1 P2 u1 u2 follows

Path Navigation Queries Property paths in SPARQL v1.1 Single-label reachability in Cypher Path expressions in G-CORE & PGQL

4 What is a graph query?

Graph queries feature: Subgraph Pattern Path Navigation c (follows · mentions)+ worksAt

worksAt P1 P2 u1 u2 follows

Path Navigation Queries Property paths in SPARQL v1.1 Single-label reachability in Cypher Path expressions in G-CORE & PGQL Regular Path Queries Generalized reachability & traversal Directed paths that match regular expressions

4 Tractability results [Mendelzon and Wood, 1995; Bagan et al., 2013] Cost-based planning for SPARQL property paths [Yakovets et al., 2016]

Windowed processing Incremental & non-blocking operators

Largely focus on pattern matching

RPQ evaluation on static graphs The objective of this paper

PersistentUnboundednessRPQ evaluation & high over streamingstreaming rates graphs

Continuous processing of streaming graphs

Regular Path Queries

RPQs are heavily used in practice SPARQL v1.1, Cypher, PGQL, G-CORE 1 in 4 queries in Wikidata query logs [Bonifati et al., 2019]

5 Windowed processing Incremental & non-blocking operators

Largely focus on pattern matching

The objective of this paper

PersistentUnboundednessRPQ evaluation & high over streamingstreaming rates graphs

Continuous processing of streaming graphs

Regular Path Queries

RPQs are heavily used in practice SPARQL v1.1, Cypher, PGQL, G-CORE 1 in 4 queries in Wikidata query logs [Bonifati et al., 2019] RPQ evaluation on static graphs Tractability results [Mendelzon and Wood, 1995; Bagan et al., 2013] Cost-based planning for SPARQL property paths [Yakovets et al., 2016]

5 Largely focus on pattern matching

The objective of this paper

Persistent RPQ evaluation over streaming graphs

Continuous processing of streaming graphs

Regular Path Queries

RPQs are heavily used in practice SPARQL v1.1, Cypher, PGQL, G-CORE 1 in 4 queries in Wikidata query logs [Bonifati et al., 2019] RPQ evaluation on static graphs Tractability results [Mendelzon and Wood, 1995; Bagan et al., 2013] Cost-based planning for SPARQL property paths [Yakovets et al., 2016] Unboundedness & high streaming rates Windowed processing Incremental & non-blocking operators

5 The objective of this paper

Persistent RPQ evaluation over streaming graphs

Regular Path Queries

RPQs are heavily used in practice SPARQL v1.1, Cypher, PGQL, G-CORE 1 in 4 queries in Wikidata query logs [Bonifati et al., 2019] RPQ evaluation on static graphs Tractability results [Mendelzon and Wood, 1995; Bagan et al., 2013] Cost-based planning for SPARQL property paths [Yakovets et al., 2016] Unboundedness & high streaming rates Windowed processing Incremental & non-blocking operators Continuous processing of streaming graphs Largely focus on pattern matching

5 Regular Path Queries

RPQs are heavily used in practice SPARQL v1.1, Cypher, PGQL, G-CORE 1 in 4 queries in Wikidata query logs [Bonifati et al., 2019] RPQ evaluation on static graphs Tractability results [Mendelzon and Wood, 1995; Bagan et al., 2013] The objectiveCost-based of this planning paper for SPARQL property paths [Yakovets et al., 2016] PersistentUnboundednessRPQ evaluation & high over streamingstreaming rates graphs Windowed processing Incremental & non-blocking operators Continuous processing of streaming graphs Largely focus on pattern matching

5 2 Path semantics used in practice Simple paths (no repeating vertex): navigation on road networks Arbitrary paths: reachability on communication networks 3 Result semantics & types Append-only streams with fast insertions Support for explicit deletions 4 Physical operators for path navigation queries Compatible with existing languages: Cypher, G-CORE, PGQL, SPARQL v1.1 Efficiency & efficacy in real-world workloads

Our Contributions

Persistent Regular Path Query Evaluation on Streaming Graphs

1 The design space of persistent RPQ Path semantics Result semantics & stream types

6 3 Result semantics & stream types Append-only streams with fast insertions Support for explicit deletions 4 Physical operators for path navigation queries Compatible with existing languages: Cypher, G-CORE, PGQL, SPARQL v1.1 Efficiency & efficacy in real-world workloads

Our Contributions

Persistent Regular Path Query Evaluation on Streaming Graphs

1 The design space of persistent RPQ algorithms Path semantics Result semantics & stream types 2 Path semantics used in practice Simple paths (no repeating vertex): navigation on road networks Arbitrary paths: reachability on communication networks

6 4 Physical operators for path navigation queries Compatible with existing languages: Cypher, G-CORE, PGQL, SPARQL v1.1 Efficiency & efficacy in real-world workloads

Our Contributions

Persistent Regular Path Query Evaluation on Streaming Graphs

1 The design space of persistent RPQ algorithms Path semantics Result semantics & stream types 2 Path semantics used in practice Simple paths (no repeating vertex): navigation on road networks Arbitrary paths: reachability on communication networks 3 Result semantics & stream types Append-only streams with fast insertions Support for explicit deletions

6 Our Contributions

Persistent Regular Path Query Evaluation on Streaming Graphs

1 The design space of persistent RPQ algorithms Path semantics Result semantics & stream types 2 Path semantics used in practice Simple paths (no repeating vertex): navigation on road networks Arbitrary paths: reachability on communication networks 3 Result semantics & stream types Append-only streams with fast insertions Support for explicit deletions 4 Physical operators for path navigation queries Compatible with existing languages: Cypher, G-CORE, PGQL, SPARQL v1.1 Efficiency & efficacy in real-world workloads

6 Our Solution

7 Spanning- index to encode partial results

Design Space for Streaming RPQ

An automata-based Automata transition graph to guide traversals

8 Design Space for Streaming RPQ

An automata-based algorithm Automata transition graph to guide traversals Spanning-tree index to encode partial results

8 Design Space for Streaming RPQ

An automata-based algorithm Automata transition graph to guide traversals Spanning-tree index to encode partial results

Result Semantics Path Semantics Append-only Explicit Deletions

Arbitrary O(n · k2) O(n2 · k) Simple1 O(n · k2) O(n2 · k).

1 These results hold in the absence of conflicts, a condition on cyclic structure of the query and graph. 8 Design Space for Streaming RPQ

An automata-based algorithm Automata transition graph to guide traversals Spanning-tree index to encode partial results

Result Semantics Path Semantics Append-only Explicit Deletions

Arbitrary O(n · k2) O(n2 · k) Simple1 O(n · k2) O(n2 · k).

1 These results hold in the absence of conflicts, a condition on cyclic structure of the query and graph. 8 Cross-edge

Append-only Algorithm for Arbitrary Path Semantics

I O(n) amortized insertion cost (n = # of vertices in the(x, 0) window W ) I Expired tuples are removed in batches I Explicit windows are supported via negative tuples(y, 1) (z, 1) I A variation of Counting and DRed [Gupta et al., 1993] (u, 2) (w, 2)

(v, 1)

(y, 2)

Incremental Maintenance of Spanning Trees

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u x

b a a b a

u z v x y

+ Q1 = (a · b) a

start s0 s1 s2 a b

9 Append-only Algorithm for Arbitrary Path Semantics

I O(n) amortized insertion cost (n = # of vertices in the window W ) I Expired tuples are removed in batches I Explicit windows are supported via negative tuples I A variation of Counting and DRed [Gupta et al., 1993]

Cross-edge

Incremental Maintenance of Spanning Trees

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u x

b a a b a

u z v x y

(x, 0) + Q1 = (a · b)

a (y, 1) (z, 1)

start s0 s1 s2 a b (u, 2) (w, 2)

(v, 1)

(y, 2)

9 Append-only Algorithm for Arbitrary Path Semantics

I O(n) amortized insertion cost (n = # of vertices in the window W ) I Expired tuples are removed in batches I Explicit windows are supported via negative tuples I A variation of Counting and DRed [Gupta et al., 1993]

Incremental Maintenance of Spanning Trees

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u x z

b a a b a b

u z v x y u

(x, 0) + Q1 = (a · b)

a (y, 1) (z, 1)

start s0 s1 s2 a b (u, 2) (w, 2)

Cross-edge (v, 1)

(y, 2)

9 Append-only Algorithm for Arbitrary Path Semantics

I O(n) amortized insertion cost (n = # of vertices in the window W ) I Expired tuples are removed in batches I Explicit windows are supported via negative tuples I A variation of Counting and DRed [Gupta et al., 1993]

Cross-edge

Incremental Maintenance of Spanning Trees

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u x z z

b a a b a b b

u z v x y u w

(x, 0) + Q1 = (a · b)

a (y, 1) (z, 1)

start s0 s1 s2 a b (u, 2) (w, 2)

(v, 1)

(y, 2)

9 Append-only Algorithm for Arbitrary Path Semantics

I O(n) amortized insertion cost (n = # of vertices in the window W ) I Expired tuples are removed in batches I Explicit windows are supported via negative tuples I A variation of Counting and DRed [Gupta et al., 1993]

Cross-edge

Incremental Maintenance of Spanning Trees

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u x z z

b a a b a b b

u z v x y u w

(x, 0) + Q1 = (a · b)

a (y, 1) (z, 1)

start s0 s1 s2 a b (u, 2) (w, 2)

(v, 1)

(y, 2)

9 Append-only Algorithm for Arbitrary Path Semantics

I O(n) amortized insertion cost (n = # of vertices in the window W ) I Expired tuples are removed in batches I Explicit windows are supported via negative tuples I A variation of Counting and DRed [Gupta et al., 1993]

Cross-edge

Incremental Maintenance of Spanning Trees

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u x z z

b a a b a b b

u z v x y u w

(x, 0) + Q1 = (a · b)

a (y, 1) (z, 1)

start s0 s1 s2 a b (u, 2) (w, 2)

(v, 1)

(y, 2)

9 Cross-edge

Incremental Maintenance of Spanning Trees

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u x z z

b a a b a b b Append-onlyu Algorithmz forv Arbitraryx Pathy u Semanticsw

I O(n) amortized insertion cost (n = # of vertices in the(x, 0) window W ) Q = (a · b)+ I Expired tuples1 are removed in batches a I Explicit windows are supported via negative tuples(y, 1) (z, 1) IstartA variations0 of Countings1 ands2 DRed [Gupta et al., 1993] a b (u, 2) (w, 2)

(v, 1)

(y, 2)

9 Append-only Algorithm for Arbitrary Path Semantics

I O(n) amortized insertion cost (n = # of vertices in the window W ) I Expired tuples are removed in batches I Explicit windows are supported via negative tuples I A variation of Counting and DRed [Gupta et al., 1993]

Cross-edge

Incremental Maintenance of Spanning Trees

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u x z z v

b a a b a b b b

u z v x y u w y

(x, 0) + Q1 = (a · b)

a (y, 1) (z, 1)

start s0 s1 s2 a b (u, 2) (w, 2)

(v, 1)

(y, 2)

9 Append-only Algorithm for Arbitrary Path Semantics

I O(n) amortized insertion cost (n = # of vertices in the window W ) I Expired tuples are removed in batches I Explicit windows are supported via negative tuples I A variation of Counting and DRed [Gupta et al., 1993]

Cross-edge

Incremental Maintenance of Spanning Trees

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u x z z v

b a a b a b b b

u z v x y u w y

(x, 0) + Q1 = (a · b)

a (y, 1) (z, 1)

start s0 s1 s2 a b (u, 2) (w, 2)

(v, 1)

(y, 2)

9 Append-only Algorithm for Arbitrary Path Semantics

I O(n) amortized insertion cost (n = # of vertices in the window W ) I Expired tuples are removed in batches I Explicit windows are supported via negative tuples I A variation of Counting and DRed [Gupta et al., 1993]

Incremental Maintenance of Spanning Trees

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 y x u u x z z v

b a a b a b b b

u z v x y u w y

(x, 0) + Q1 = (a · b)

a (y, 1) (z, 1)

start s0 s1 s2 a b (u, 2) (w, 2)

Cross-edge (v, 1)

(y, 2)

9 Optimized for append-only streams An optimistic conflict-detection mechanism Aggressive pruning – only if absolutely necessary Correctness & complexity proofs are in the paper

Conflict-freeness: A sufficient condition for efficient evaluation A condition on the cyclic structure of the graph and the automata An arbitrary path pa : u→v =⇒ a simple path ps : u→v Requires DFS ordering

Our contribution: A streaming algorithm for conflict detection

How to Find Simple Paths?

Finding simple paths for arbitrary graphs and queries is NP-complete [Mendelzon and Wood, 1995]

10 Optimized for append-only streams An optimistic conflict-detection mechanism Aggressive pruning – backtracking only if absolutely necessary Correctness & complexity proofs are in the paper

Requires DFS ordering

Our contribution: A streaming algorithm for conflict detection

How to Find Simple Paths?

Finding simple paths for arbitrary graphs and queries is NP-complete [Mendelzon and Wood, 1995] Conflict-freeness: A sufficient condition for efficient evaluation A condition on the cyclic structure of the graph and the automata An arbitrary path pa : u→v =⇒ a simple path ps : u→v

10 Optimized for append-only streams An optimistic conflict-detection mechanism Aggressive pruning – backtracking only if absolutely necessary Correctness & complexity proofs are in the paper

Our contribution: A streaming algorithm for conflict detection

How to Find Simple Paths?

Finding simple paths for arbitrary graphs and queries is NP-complete [Mendelzon and Wood, 1995] Conflict-freeness: A sufficient condition for efficient evaluation A condition on the cyclic structure of the graph and the automata An arbitrary path pa : u→v =⇒ a simple path ps : u→v Requires DFS ordering

10 Optimized for append-only streams An optimistic conflict-detection mechanism Aggressive pruning – backtracking only if absolutely necessary Correctness & complexity proofs are in the paper

How to Find Simple Paths?

Finding simple paths for arbitrary graphs and queries is NP-complete [Mendelzon and Wood, 1995] Conflict-freeness: A sufficient condition for efficient evaluation A condition on the cyclic structure of the graph and the automata An arbitrary path pa : u→v =⇒ a simple path ps : u→v Requires DFS ordering

Our contribution: A streaming algorithm for conflict detection

10 An optimistic conflict-detection mechanism Aggressive pruning – backtracking only if absolutely necessary Correctness & complexity proofs are in the paper

How to Find Simple Paths?

Finding simple paths for arbitrary graphs and queries is NP-complete [Mendelzon and Wood, 1995] Conflict-freeness: A sufficient condition for efficient evaluation A condition on the cyclic structure of the graph and the automata An arbitrary path pa : u→v =⇒ a simple path ps : u→v Requires DFS ordering

Our contribution: A streaming algorithm for conflict detection Optimized for append-only streams

10 Aggressive pruning – backtracking only if absolutely necessary Correctness & complexity proofs are in the paper

How to Find Simple Paths?

Finding simple paths for arbitrary graphs and queries is NP-complete [Mendelzon and Wood, 1995] Conflict-freeness: A sufficient condition for efficient evaluation A condition on the cyclic structure of the graph and the automata An arbitrary path pa : u→v =⇒ a simple path ps : u→v Requires DFS ordering

Our contribution: A streaming algorithm for conflict detection Optimized for append-only streams An optimistic conflict-detection mechanism

10 Correctness & complexity proofs are in the paper

How to Find Simple Paths?

Finding simple paths for arbitrary graphs and queries is NP-complete [Mendelzon and Wood, 1995] Conflict-freeness: A sufficient condition for efficient evaluation A condition on the cyclic structure of the graph and the automata An arbitrary path pa : u→v =⇒ a simple path ps : u→v Requires DFS ordering

Our contribution: A streaming algorithm for conflict detection Optimized for append-only streams An optimistic conflict-detection mechanism Aggressive pruning – backtracking only if absolutely necessary

10 How to Find Simple Paths?

Finding simple paths for arbitrary graphs and queries is NP-complete [Mendelzon and Wood, 1995] Conflict-freeness: A sufficient condition for efficient evaluation A condition on the cyclic structure of the graph and the automata An arbitrary path pa : u→v =⇒ a simple path ps : u→v Requires DFS ordering

Our contribution: A streaming algorithm for conflict detection Optimized for append-only streams An optimistic conflict-detection mechanism Aggressive pruning – backtracking only if absolutely necessary Correctness & complexity proofs are in the paper

10 Over 70% of the workloads 2 − 5× impact on the tail latency

Update stream of LDBC SNB [Erling et al., 2015] Stackoverflow temporal graph Yago2s RDF graph Sub-millisecond tail latency (99thpercentile) Performance matching the complexity analysis Finding simple paths

Scalability analysis using synthetic RPQ workloads (gMark [Bagan et al., 2016])

Performance Results

Most common RPQs in Wikidata query logs [Bonifati et al., 2019]

11 Over 70% of the workloads 2 − 5× impact on the tail latency

Sub-millisecond tail latency (99thpercentile) Performance matching the complexity analysis Finding simple paths

Scalability analysis using synthetic RPQ workloads (gMark [Bagan et al., 2016])

Performance Results

Most common RPQs in Wikidata query logs [Bonifati et al., 2019] Update stream of LDBC SNB [Erling et al., 2015] Stackoverflow temporal graph Yago2s RDF graph

11 Over 70% of the workloads 2 − 5× impact on the tail latency

Performance matching the complexity analysis Finding simple paths

Scalability analysis using synthetic RPQ workloads (gMark [Bagan et al., 2016])

Performance Results

Most common RPQs in Wikidata query logs [Bonifati et al., 2019] Update stream of LDBC SNB [Erling et al., 2015] Stackoverflow temporal graph Yago2s RDF graph Sub-millisecond tail latency (99thpercentile)

11 Over 70% of the workloads 2 − 5× impact on the tail latency

Finding simple paths

Scalability analysis using synthetic RPQ workloads (gMark [Bagan et al., 2016])

Performance Results

Most common RPQs in Wikidata query logs [Bonifati et al., 2019] Update stream of LDBC SNB [Erling et al., 2015] Stackoverflow temporal graph Yago2s RDF graph Sub-millisecond tail latency (99thpercentile) Performance matching the complexity analysis

11 Scalability analysis using synthetic RPQ workloads (gMark [Bagan et al., 2016])

Performance Results

Most common RPQs in Wikidata query logs [Bonifati et al., 2019] Update stream of LDBC SNB [Erling et al., 2015] Stackoverflow temporal graph Yago2s RDF graph Sub-millisecond tail latency (99thpercentile) Performance matching the complexity analysis Finding simple paths Over 70% of the workloads 2 − 5× impact on the tail latency

11 Performance Results

Most common RPQs in Wikidata query logs [Bonifati et al., 2019] Update stream of LDBC SNB [Erling et al., 2015] Stackoverflow temporal graph Yago2s RDF graph Sub-millisecond tail latency (99thpercentile) Performance matching the complexity analysis Finding simple paths Over 70% of the workloads 2 − 5× impact on the tail latency Scalability analysis using synthetic RPQ workloads (gMark [Bagan et al., 2016])

11 Support for explicit edge deletions Implicit & explicit window semantics

Safely prune the search space to avoid exhaustive search Tractable evaluation on conflict-free graphs (matching tractability results [Mendelzon and Wood, 1995])

Scalability w.r.t. the window & query size Feasibility of simple path semantics

2 Persistent RPQs under arbitrary path semantics

3 Streaming conflict-detection algorithm

4 Extensive of experiments with real-world workloads [Bonifati et al., 2019]

Wrapping-up

1 Design space of non-blocking RPQ operators Path semantics Result semantics

12 Safely prune the search space to avoid exhaustive search Tractable evaluation on conflict-free graphs (matching tractability results [Mendelzon and Wood, 1995])

Scalability w.r.t. the window & query size Feasibility of simple path semantics

3 Streaming conflict-detection algorithm

4 Extensive of experiments with real-world workloads [Bonifati et al., 2019]

Wrapping-up

1 Design space of non-blocking RPQ operators Path semantics Result semantics 2 Persistent RPQs under arbitrary path semantics Support for explicit edge deletions Implicit & explicit window semantics

12 Scalability w.r.t. the window & query size Feasibility of simple path semantics

4 Extensive of experiments with real-world workloads [Bonifati et al., 2019]

Wrapping-up

1 Design space of non-blocking RPQ operators Path semantics Result semantics 2 Persistent RPQs under arbitrary path semantics Support for explicit edge deletions Implicit & explicit window semantics 3 Streaming conflict-detection algorithm Safely prune the search space to avoid exhaustive search Tractable evaluation on conflict-free graphs (matching tractability results [Mendelzon and Wood, 1995])

12 Wrapping-up

1 Design space of non-blocking RPQ operators Path semantics Result semantics 2 Persistent RPQs under arbitrary path semantics Support for explicit edge deletions Implicit & explicit window semantics 3 Streaming conflict-detection algorithm Safely prune the search space to avoid exhaustive search Tractable evaluation on conflict-free graphs (matching tractability results [Mendelzon and Wood, 1995]) 4 Extensive of experiments with real-world workloads [Bonifati et al., 2019] Scalability w.r.t. the window & query size Feasibility of simple path semantics

12 13 ReferencesI

Bagan, G., Bonifati, A., Ciucanu, R., Fletcher, G. H., Lemay, A., and Advokaat, N. (2016). gmark: schema-driven generation of graphs and queries. IEEE Trans. Knowl. and Data Eng., 29(4):856–869. Bagan, G., Bonifati, A., and Groz, B. (2013). A trichotomy for regular simple path queries on graphs. In Proc. 32nd ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, pages 261–272. Barbieri, D. F., Braga, D., Ceri, S., Della Valle, E., and Grossniklaus, M. (2009). C-sparql: Sparql for continuous querying. In Proc. 18th Int. World Wide Web Conf., pages 1061–1062. Bonifati, A., Martens, W., and Timm, T. (2019). Navigating the maze of wikidata query logs. In Proc. 28th Int. World Wide Web Conf., pages 127–138. ACM. Calbimonte, J.-P., Corcho, O., and Gray, A. J. (2010). Enabling ontology-based access to streaming data sources. In Proc. 9th Int. Semantic Web Conf., pages 96–111. Springer. Choudhury, S., Holder, L. B., Jr, G. C., Agarwal, K., and Feo, J. (2015). A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs. In Proc. 18th Int. Conf. on Extending Database Technology, pages 157–168.

14 ReferencesII

Dell’Aglio, D., Calbimonte, J.-P., Della Valle, E., and Corcho, O. (2015). Towards a unified language for rdf stream query processing. In Proc. 12th Extended Semantic Web Conf., pages 353–363. Springer. Erling, O., Averbuch, A., Larriba-Pey, J., Chafi, H., Gubichev, A., Prat, A., Pham, M.-D., and Boncz, P. (2015). The ldbc social network benchmark: Interactive workload. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 619–630. http://doi.acm.org/10.1145/2723372.2742786. Gupta, A., Mumick, I. S., and Subrahmanian, V. S. (1993). Maintaining views incrementally. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 157–166. Iyer, A. P., Li, L. E., Das, T., and Stoica, I. (2016). Time-evolving graph processing at scale. In Proc. 4th Int. Workshop on Graph Data Management Experiences and Systems, pages 1–6. Kent, A. D., Liebrock, L. M., and Neil, J. C. (2015). Authentication graphs: Analyzing user behavior within an enterprise network. Computers & Security, 48:150–166. Kumar, P. and Huang, H. H. (2020). Graphone: A data store for real-time analytics on evolving graphs. ACM Trans. Storage, 15(4):1–40.

15 ReferencesIII

Le-Phuoc, D., Dao-Tran, M., Parreira, J. X., and Hauswirth, M. (2011). A native and adaptive approach for unified processing of linked streams and linked data. In Proc. 10th Int. Semantic Web Conf., pages 370–388. Springer. Mariappan, M. and Vora, K. (2019). Graphbolt: Dependency-driven synchronous processing of streaming graphs. In Proc. 14th ACM SIGOPS/EuroSys European Conf. on Comp. Syst., pages 1–16. Mendelzon, A. O. and Wood, P. T. (1995). Finding regular simple paths in graph databases. SIAM J. on Comput., 24(6):1235–1258. Qiu, X., Cen, W., Qian, Z., Peng, Y., Zhang, Y., Lin, X., and Zhou, J. (2018). Real-time constrained cycle detection in large dynamic graphs. Proc. VLDB Endowment, 11(12):1876–1888. Yakovets, N., Godfrey, P., and Gryz, J. (2016). Query planning for evaluating sparql property paths. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 1875–1889. ACM.

16 Additional Slides

17 This presentation is not

1 A streaming RDF engine C-SPARQL [Barbieri et al., 2009], CQELS [Le-Phuoc et al., 2011], SPARQLstream [Calbimonte et al., 2010], W3C RSP-QL [Dell’Aglio et al., 2015] Based on SPARQL v1.0 - no path navigation Our algorithms can be incorporated to provide incremental RPQ evaluation 2 A graph analytics engine for dynamic graphs GraphOne [Kumar and Huang, 2020], GraphBolt [Mariappan and Vora, 2019], GraphTau [Iyer et al., 2016] Efficient maintenance of graph snapshots Iterative graph analytics in the vertex-centric model We focus on persistent evaluation of path queries that are specified declaratively

18 Workloads

Table: Most common RPQs in real workloads [Bonifati et al., 2019].

Name Query Name Query ∗ ∗ Q1 a Q7 a ◦ b ◦ c ∗ ∗ Q2 a ◦ b Q8 a? ◦ b ∗ ∗ + Q3 a ◦ b ◦c Q9 (a1 + a2 + ··· + ak ) ∗ ∗ Q4 (a1 + a2 + ··· + ak ) Q10 (a1 + a2 + ··· + ak ) ◦ b ∗ Q5 a ◦ b ◦ c Q11 a1 ◦ a2 ◦ · · · ◦ ak ∗ ∗ Q6 a ◦ b

19 Throughput & Tail Latency

Throughput (edges/s) Tail Latency (μs) Throughput (edges/s) Tail Latency (μs)

105 9.2 × 101 105 104 9 × 101

104 8.8 × 101 104 103 8.6 × 101

3 3 10 8.4 × 101 10 102 Tail Latency (μs) Tail Latency (μs) 8.2 × 101 Throughput (edges/s) Throughput (edges/s) 102 102 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q1 Q2 Q3 Q5 Q6 Q7 Q11 Query Query

(a) Yago2s (b) LDBC SF10

Throughput (edges/s) Tail Latency (μs)

105

104 104

103 Tail Latency (μs) Throughput (edges/s) 102 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Query

(c) Stackoverflow 20 Figure: Throughput and tail latency of the Algorithm ??. Y axis is given in log-scale. Index Size

# of Trees # of Nodes

40 20.0 35 17.5 30 15.0 25 12.5 20 10.0 15 7.5

10 # of nodes (1M) # of trees(1000) 5.0

2.5 5

0.0 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Query

Figure: Size of the tree index ∆ on the SO graph with |W | = 30days.

21 Scalability - Tail Latency (99thPercentile)

Q1 Q3 Q5 Q7 Q9 Q11 Q2 Q4 Q6 Q8 Q10

3 × 102

2 × 102 Tail latency (μs)

5M 10M 15M 20M 0.5M 1M 1.5M 2M Window size Slide size Figure: Tail latency with various |W | and β

22 Scalability - Window Management Time

Q1 Q3 Q5 Q7 Q9 Q11 Q2 Q4 Q6 Q8 Q10

3 × 105

2 × 105

5

Expiry Time (μs) 10

5M 10M 15M 20M 0.5M 1M 1.5M 2M Window size Slide size Figure: The average window maintenance cost with various |W | and β

23 Explicit Deletions

Q1 Q3 Q5 Q7 Q9 Q11 Q2 Q4 Q6 Q8 Q10

140

130

120

110

100 Tail Latency (μs) 90

80 0 2 4 6 8 10 % of explicit deletions

Figure: Impact of the ratio of explicit deletions on tail latency for all queries on Yago2s RDF graph with |W | =≈ 10M tuples.

24 Impact of DFA Construction

12 ) s

120 / s e

10 g )

100 d k ( e

s 8 3 e 0

t 80 1 a ( t

t s

6 u f 60 p o

h g #

4 40 u o r h

2 20 T 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 Query size (|QR|) # of states (k)

Figure: The impact of the query length |QR | on the automata size, k, and the throughput. RPQs are generated using gMark [Bagan et al., 2016] where the query size ranges from 2 to 20. Each RPQ is formulated by grouping labels into concatenations and alternations of size up to 3, and each group has a 50% probability of having ∗ and +.

25