Approximations of Consistent Query Answers 1

Jef Wijsen

UMONS

DaQuaTa International Workshop 2016 Lyon, 12–13 December 2016

1Joint work with Floris Geerts, Paris Koutris, and Fabian Pijcke Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 1 / 54 Outline

1 Motivation

2 On the Complexity of Embracing Primary Key Violations

3 First-Order Under-Approximations Of Consistent Query Answers

4 Beyond (Un)certainty: Counts and Probabilities

5 Attack Graphs, a Complexity Classification Tool

6 Final Thoughts

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 2 / 54 Outline

1 Motivation

2 On the Complexity of Embracing Primary Key Violations

3 First-Order Under-Approximations Of Consistent Query Answers

4 Beyond (Un)certainty: Counts and Probabilities

5 Attack Graphs, a Complexity Classification Tool

6 Final Thoughts

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 3 / 54 Data Quality

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 4 / 54 Data Quality

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 4 / 54 Consistent and Complete Relational Database

Integrity constraints are satisfied. The database contains all (and only) the facts that are true (Closed-World Assumption). No missing values.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 5 / 54 Dealing with Imperfect Data

It is common to have inconsistent data, incomplete data, missing data, uncertain data. . . What can we do with this data?

Some data quality problems already have principled solutions: incomplete data [IJ84], NULLs in SQL [Lib16]. . .

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 6 / 54 Dealing with Imperfect Data

It is common to have inconsistent data, incomplete data, missing data, uncertain data. . . What can we do with this data?

Some data quality problems already have principled solutions: incomplete data [IJ84], NULLs in SQL [Lib16]. . .

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 6 / 54 Dealing with Imperfect Data

It is common to have inconsistent data, incomplete data, missing data, uncertain data. . . What can we do with this data?

Some data quality problems already have principled solutions: incomplete data [IJ84], NULLs in SQL [Lib16]. . .

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 6 / 54 Dealing with Imperfect Data

It is common to have inconsistent data, incomplete data, missing data, uncertain data. . . What can we do with this data?

Some data quality problems already have principled solutions: incomplete data [IJ84], NULLs in SQL [Lib16]. . .

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 6 / 54 Dealing with Imperfect Data

It is common to have inconsistent data, incomplete data, missing data, uncertain data. . . What can we do with this data?

Some data quality problems already have principled solutions: incomplete data [IJ84], NULLs in SQL [Lib16]. . .

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 6 / 54 Embrace Imperfections

Example P PID FirstName LastName BloodType Gendre ··· 1 John Adams NULL | ··· 2 Jan Peeters A+ M ··· 3 Jean Dubois A+ M ··· 3 Jean Dubois AB+ M ···

Caveat If we embrace imperfections, we should rethink query answering. For example, what should be the answer to the following queries? SELECT COUNT(DISTINCT PID) SELECT COUNT(DISTINCT PID) FROM P FROM P WHERE BloodType=‘A+’; WHERE BloodType<>‘A+’;

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 7 / 54 Embrace Imperfections

Example P PID FirstName LastName BloodType Gendre ··· 1 John Adams NULL | ··· 2 Jan Peeters A+ M ··· 3 Jean Dubois A+ M ··· 3 Jean Dubois AB+ M ···

Caveat If we embrace imperfections, we should rethink query answering. For example, what should be the answer to the following queries? SELECT COUNT(DISTINCT PID) SELECT COUNT(DISTINCT PID) FROM P FROM P WHERE BloodType=‘A+’; WHERE BloodType<>‘A+’;

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 7 / 54 Embrace Imperfections

Example P PID FirstName LastName BloodType Gendre ··· 1 John Adams NULL M ··· 2 Jan Peeters A+ M ··· 3 Jean Dubois A+ M ··· 3 Jean Dubois AB+ M ···

Caveat If we embrace imperfections, we should rethink query answering. For example, what should be the answer to the following queries? SELECT COUNT(DISTINCT PID) SELECT COUNT(DISTINCT PID) FROM P FROM P WHERE BloodType=‘A+’; WHERE BloodType<>‘A+’;

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 7 / 54 Embrace Imperfections

Example P PID FirstName LastName BloodType Gendre ··· 1 John Adams NULL M ··· 2 Jan Peeters A+ M ··· 3 Jean Dubois A+ M ··· 3 Jean Dubois AB+ M ···

Caveat If we embrace imperfections, we should rethink query answering. For example, what should be the answer to the following queries? SELECT COUNT(DISTINCT PID) SELECT COUNT(DISTINCT PID) FROM P FROM P WHERE BloodType=‘A+’; WHERE BloodType<>‘A+’;

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 7 / 54 Embrace Imperfections

Example P PID FirstName LastName BloodType Gendre ··· 1 John Adams NULL M ··· 2 Jan Peeters A+ M ··· 3 Jean Dubois A+ M ··· 3 Jean Dubois AB+ M ···

Caveat If we embrace imperfections, we should rethink query answering. For example, what should be the answer to the following queries? SELECT COUNT(DISTINCT PID) SELECT COUNT(DISTINCT PID) FROM P FROM P WHERE BloodType=‘A+’; WHERE BloodType<>‘A+’;

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 7 / 54 Embrace Imperfections

Example P PID FirstName LastName BloodType Gendre ··· 1 John Adams NULL M ··· 2 Jan Peeters A+ M ··· 3 Jean Dubois A+ M ··· 3 Jean Dubois AB+ M ···

Caveat If we embrace imperfections, we should rethink query answering. For example, what should be the answer to the following queries? SELECT COUNT(DISTINCT PID) SELECT COUNT(DISTINCT PID) FROM P FROM P WHERE BloodType=‘A+’; WHERE BloodType<>‘A+’;

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 7 / 54 Embrace Imperfections

Example P PID FirstName LastName BloodType Gendre ··· 1 John Adams NULL M ··· 2 Jan Peeters A+ M ··· 3 Jean Dubois A+ M ··· 3 Jean Dubois AB+ M ···

Caveat If we embrace imperfections, we should rethink query answering. For example, what should be the answer to the following queries? SELECT COUNT(DISTINCT PID) SELECT COUNT(DISTINCT PID) FROM P FROM P WHERE BloodType=‘A+’; WHERE BloodType<>‘A+’;

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 7 / 54 Outline

1 Motivation

2 On the Complexity of Embracing Primary Key Violations

3 First-Order Under-Approximations Of Consistent Query Answers

4 Beyond (Un)certainty: Counts and Probabilities

5 Attack Graphs, a Complexity Classification Tool

6 Final Thoughts

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 8 / 54 Embrace Primary Key Violations

Data model We allow (primary) key violations.

Example (Keys are underlined) WorksFor Agent Dept ManagedBy Dept Mgr Budget Sherlock MI6 CIA John 60M James CIA MI6 Alex 60M James MI6 =⇒ James works for either CIA or MI6.

Definition (Block) A block is a maximal set of tuples of the same relation with the same value for the key. (Blocks are separated by dashed lines.)

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 9 / 54 Embrace Primary Key Violations

Data model We allow (primary) key violations.

Example (Keys are underlined) WorksFor Agent Dept ManagedBy Dept Mgr Budget Sherlock MI6 CIA John 60M James CIA MI6 Alex 60M James MI6 =⇒ James works for either CIA or MI6.

Definition (Block) A block is a maximal set of tuples of the same relation with the same value for the key. (Blocks are separated by dashed lines.)

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 9 / 54 Certainty Semantics

Definition (Repair and Certainty) A repair is obtained by selecting exactly one tuple from each block. A Boolean query is certain if it is true in all repairs.

Certainty semantics WorksFor Agent Dept ManagedBy Dept Mgr Budget Sherlock MI6 CIA John 60M James CIA MI6 Alex 60M James MI6

Is the budget of James’ department equal to 60M? ∃d∃m (WorksFor(‘James’, d) ∧ ManagedBy(d, m, ‘60M’)) is certain. Is James’ department managed by Alex? ∃d∃b (WorksFor(‘James’, d) ∧ ManagedBy(d, ‘Alex’, b)) is not certain.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 10 / 54 Certainty Semantics

Definition (Repair and Certainty) A repair is obtained by selecting exactly one tuple from each block. A Boolean query is certain if it is true in all repairs.

Certainty semantics WorksFor Agent Dept ManagedBy Dept Mgr Budget Sherlock MI6 CIA John 60M James CIA MI6 Alex 60M James MI6

Is the budget of James’ department equal to 60M? ∃d∃m (WorksFor(‘James’, d) ∧ ManagedBy(d, m, ‘60M’)) is certain. Is James’ department managed by Alex? ∃d∃b (WorksFor(‘James’, d) ∧ ManagedBy(d, ‘Alex’, b)) is not certain.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 10 / 54 The Computational Complexity of Deciding Certainty I

Relation with exponentially many repairs WorksFor Agent Dept 1 MI6 1 CIA This WorksFor relation contains 2n 2 MI6 √ 2n 2 CIA tuples and has 2 distinct . . repairs. . . n MI6 n CIA

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 11 / 54 The Computational Complexity of Deciding Certainty II

Example of Low Complexity Let

q1 = ∃d∃b (WorksFor(‘James’, d) ∧ ManagedBy(d, ‘Alex’, b))

For example, q1 is certain in the following database: ManagedBy Dept Mgr Budget WorksFor Agent Dept CIA Alex 50M James CIA CIA Alex 60M James MI6 MI6 Alex 60M

One can verify that q1 is certain iff the following query is true:  ∃d WorksFor(‘James’, d) ∧ ∀dWorksFor(‘James’, d) → ∃m∃b[ManagedBy(d, m, b) ∧  ∀m∀b(ManagedBy(d, m, b) → m = ‘Alex’)]

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 12 / 54 The Computational Complexity of Deciding Certainty II

Example of Low Complexity Let

q1 = ∃d∃b (WorksFor(‘James’, d) ∧ ManagedBy(d, ‘Alex’, b))

For example, q1 is certain in the following database: ManagedBy Dept Mgr Budget WorksFor Agent Dept CIA Alex 50M James CIA CIA Alex 60M James MI6 MI6 Alex 60M

One can verify that q1 is certain iff the following query is true:  ∃d WorksFor(‘James’, d) ∧ ∀dWorksFor(‘James’, d) → ∃m∃b[ManagedBy(d, m, b) ∧  ∀m∀b(ManagedBy(d, m, b) → m = ‘Alex’)]

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 12 / 54 The Computational Complexity of Deciding Certainty II

Example of Low Complexity Let

q1 = ∃d∃b (WorksFor(‘James’, d) ∧ ManagedBy(d, ‘Alex’, b))

For example, q1 is certain in the following database: ManagedBy Dept Mgr Budget WorksFor Agent Dept CIA Alex 50M James CIA CIA Alex 60M James MI6 MI6 Alex 60M

One can verify that q1 is certain iff the following query is true:  ∃d WorksFor(‘James’, d) ∧ ∀dWorksFor(‘James’, d) → ∃m∃b[ManagedBy(d, m, b) ∧  ∀m∀b(ManagedBy(d, m, b) → m = ‘Alex’)]

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 12 / 54 The Computational Complexity of Deciding Certainty II

Example of Low Complexity Let

q1 = ∃d∃b (WorksFor(‘James’, d) ∧ ManagedBy(d, ‘Alex’, b))

For example, q1 is certain in the following database: ManagedBy Dept Mgr Budget WorksFor Agent Dept CIA Alex 50M James CIA CIA Alex 60M James MI6 MI6 Alex 60M

One can verify that q1 is certain iff the following query is true:  ∃d WorksFor(‘James’, d)∧ ∀dWorksFor(‘James’, d) → ∃m∃b[ManagedBy(d, m, b)∧  ∀m∀b(ManagedBy(d, m, b) → m = ‘Alex’)]

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 12 / 54 The Computational Complexity of Deciding Certainty II

Example of Low Complexity Let

q1 = ∃d∃b (WorksFor(‘James’, d) ∧ ManagedBy(d, ‘Alex’, b))

For example, q1 is certain in the following database: ManagedBy Dept Mgr Budget WorksFor Agent Dept CIA Alex 50M James CIA CIA Alex 60M James MI6 MI6 Alex 60M

One can verify that q1 is certain iff the following query is true:  ∃d WorksFor(‘James’, d)∧ ∀dWorksFor(‘James’, d) → ∃m∃b[ManagedBy(d, m, b)∧  ∀m∀b(ManagedBy(d, m, b) → m = ‘Alex’)]

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 12 / 54 The Computational Complexity of Deciding Certainty III

Example of Higher Complexity Is some department self-managed (i.e., managed by an agent of the department)? Let

qself managed = ∃d∃m∃b (ManagedBy(d, m, b) ∧ WorksFor(m, d))

For example, qself managed is certain in the following database: ManagedBy Dept Mgr Budget WorksFor Agent Dept CIA James 60M James CIA MI6 James 60M James MI6 MI6 Cherlock 60M Cherlock MI6

No first-order query can decide whether qself managed is certain [Wij10]. Intuition: neither d nor m can be “skolemized.”

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 13 / 54 The Computational Complexity of Deciding Certainty III

Example of Higher Complexity Is some department self-managed (i.e., managed by an agent of the department)? Let

qself managed = ∃d∃m∃b (ManagedBy(d, m, b) ∧ WorksFor(m, d))

For example, qself managed is certain in the following database: ManagedBy Dept Mgr Budget WorksFor Agent Dept CIA James 60M James CIA MI6 James 60M James MI6 MI6 Cherlock 60M Cherlock MI6

No first-order query can decide whether qself managed is certain [Wij10]. Intuition: neither d nor m can be “skolemized.”

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 13 / 54 The Computational Complexity of Deciding Certainty III

Example of Higher Complexity Is some department self-managed (i.e., managed by an agent of the department)? Let

qself managed = ∃d∃m∃b (ManagedBy(d, m, b) ∧ WorksFor(m, d))

For example, qself managed is certain in the following database: ManagedBy Dept Mgr Budget WorksFor Agent Dept CIA James 60M James CIA MI6 James 60M James MI6 MI6 Cherlock 60M Cherlock MI6

No first-order query can decide whether qself managed is certain [Wij10]. Intuition: neither d nor m can be “skolemized.”

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 13 / 54 The Computational Complexity of Deciding Certainty III

Example of Higher Complexity Is some department self-managed (i.e., managed by an agent of the department)? Let

qself managed = ∃d∃m∃b (ManagedBy(d, m, b) ∧ WorksFor(m, d))

For example, qself managed is certain in the following database: ManagedBy Dept Mgr Budget WorksFor Agent Dept CIA James 60M James CIA MI6 James 60M James MI6 MI6 Cherlock 60M Cherlock MI6

No first-order query can decide whether qself managed is certain [Wij10]. Intuition: neither d nor m can be “skolemized.”

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 13 / 54 The Computational Complexity of Deciding Certainty III

Example of Higher Complexity Is some department self-managed (i.e., managed by an agent of the department)? Let

qself managed = ∃d∃m∃b (ManagedBy(d, m, b) ∧ WorksFor(m, d))

For example, qself managed is certain in the following database: ManagedBy Dept Mgr Budget WorksFor Agent Dept CIA James 60M James CIA MI6 James 60M James MI6 MI6 Cherlock 60M Cherlock MI6

No first-order query can decide whether qself managed is certain [Wij10]. Intuition: neither d nor m can be “skolemized.”

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 13 / 54 The Computational Complexity of Deciding Certainty III

Example of Higher Complexity Is some department self-managed (i.e., managed by an agent of the department)? Let

qself managed = ∃d∃m∃b (ManagedBy(d, m, b) ∧ WorksFor(m, d))

For example, qself managed is certain in the following database: ManagedBy Dept Mgr Budget WorksFor Agent Dept CIA James 60M James CIA MI6 James 60M James MI6 MI6 Cherlock 60M Cherlock MI6

No first-order query can decide whether qself managed is certain [Wij10]. Intuition: neither d nor m can be “skolemized.”

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 13 / 54 The Computational Complexity of Deciding Certainty III

Example of Higher Complexity Is some department self-managed (i.e., managed by an agent of the department)? Let

qself managed = ∃d∃m∃b (ManagedBy(d, m, b) ∧ WorksFor(m, d))

For example, qself managed is certain in the following database: ManagedBy Dept Mgr Budget WorksFor Agent Dept CIA James 60M James CIA MI6 James 60M James MI6 MI6 Cherlock 60M Cherlock MI6

No first-order query can decide whether qself managed is certain [Wij10]. Intuition: neither d nor m can be “skolemized.”

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 13 / 54 The Computational Complexity of Deciding Certainty IV

Definition For every Boolean first-order query q, the problem CERTAINTY(q) is the following: Input A database instance (possibly with key violations) Question Is q certain?

Complexity Classification Task Input A Boolean first-order query q Question What complexity classes does CERTAINTY(q) belong to? Complexity classes of interest:

FO ⊆ P ⊆ coNP

Complexity in FO is of interest to database practitioners, because it allows for implementation in SQL.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 14 / 54 The Computational Complexity of Deciding Certainty IV

Definition For every Boolean first-order query q, the problem CERTAINTY(q) is the following: Input A database instance (possibly with key violations) Question Is q certain?

Complexity Classification Task Input A Boolean first-order query q Question What complexity classes does CERTAINTY(q) belong to? Complexity classes of interest:

FO ⊆ P ⊆ coNP

Complexity in FO is of interest to database practitioners, because it allows for implementation in SQL.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 14 / 54 Main Result

We solved the aforementioned complexity classification task when the input queries q are conjunctive and self-join-free (i.e., no relation name occurs more than once in q): Theorem (Complexity Classification) For every self-join-free Boolean conjunctive query q, the following hold: 1 CERTAINTY(q) is either in P or coNP-complete (and the dichotomy is decidable); 2 it can be decided whether CERTAINTY(q) is in FO; and 3 if CERTAINTY(q) is in FO, then its first-order definition can be computed effectively.

The theorem settles a conjecture that had been open for 10 years. ACM SIGMOD Research Highlight Award 2015 was awarded to [KW15].

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 15 / 54 Main Result

We solved the aforementioned complexity classification task when the input queries q are conjunctive and self-join-free (i.e., no relation name occurs more than once in q): Theorem (Complexity Classification) For every self-join-free Boolean conjunctive query q, the following hold: 1 CERTAINTY(q) is either in P or coNP-complete (and the dichotomy is decidable); 2 it can be decided whether CERTAINTY(q) is in FO; and 3 if CERTAINTY(q) is in FO, then its first-order definition can be computed effectively.

The theorem settles a conjecture that had been open for 10 years. ACM SIGMOD Research Highlight Award 2015 was awarded to [KW15].

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 15 / 54 Main Result

We solved the aforementioned complexity classification task when the input queries q are conjunctive and self-join-free (i.e., no relation name occurs more than once in q): Theorem (Complexity Classification) For every self-join-free Boolean conjunctive query q, the following hold: 1 CERTAINTY(q) is either in P or coNP-complete (and the dichotomy is decidable); 2 it can be decided whether CERTAINTY(q) is in FO; and 3 if CERTAINTY(q) is in FO, then its first-order definition can be computed effectively.

The theorem settles a conjecture that had been open for 10 years. ACM SIGMOD Research Highlight Award 2015 was awarded to [KW15].

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 15 / 54 The Geography of coNP (assuming P 6= coNP)

coNP-complete coNP

coNP-intermediate

P

FO

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 16 / 54 Examples of Different Complexities

Example

q1 = ∃d∃b (WorksFor(‘James’, d) ∧ ManagedBy(d, ‘Alex’, b))

qself managed = ∃d∃m∃b (ManagedBy(d, m, b) ∧ WorksFor(m, d)) a q3 = ∃d∃m∃b∃a (ManagedBy(d, m, b) ∧ WorksFor(a, m))

Our results allow us to tell that

CERTAINTY(q1) is in FO;

CERTAINTY(qself managed) is in P but not in FO; and

CERTAINTY(q3) is coNP-complete.

aA meaningless query for our example database.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 17 / 54 Open Problems

Conjecture For every Boolean conjunctive query q, CERTAINTY(q) is in P or coNP-complete.

Conjecture For every query q that is a finite disjunction of Boolean conjunctive queries, CERTAINTY(q) is in P or coNP-complete.

Caveat It is known [Fon13] that the latter conjecture implies Bulatov’s complexity dichotomy theorem for conservative CSP [Bul11], the proof of which is very involved (the full paper contains 66 pages).

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 18 / 54 Open Problems

Conjecture For every Boolean conjunctive query q, CERTAINTY(q) is in P or coNP-complete.

Conjecture For every query q that is a finite disjunction of Boolean conjunctive queries, CERTAINTY(q) is in P or coNP-complete.

Caveat It is known [Fon13] that the latter conjecture implies Bulatov’s complexity dichotomy theorem for conservative CSP [Bul11], the proof of which is very involved (the full paper contains 66 pages).

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 18 / 54 Open Problems

Conjecture For every Boolean conjunctive query q, CERTAINTY(q) is in P or coNP-complete.

Conjecture For every query q that is a finite disjunction of Boolean conjunctive queries, CERTAINTY(q) is in P or coNP-complete.

Caveat It is known [Fon13] that the latter conjecture implies Bulatov’s complexity dichotomy theorem for conservative CSP [Bul11], the proof of which is very involved (the full paper contains 66 pages).

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 18 / 54 Outline

1 Motivation

2 On the Complexity of Embracing Primary Key Violations

3 First-Order Under-Approximations Of Consistent Query Answers

4 Beyond (Un)certainty: Counts and Probabilities

5 Attack Graphs, a Complexity Classification Tool

6 Final Thoughts

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 19 / 54 CQA for Open Queries

Definition (Consistent query answer) Let q be an open query (i.e., containing at least one free variable). Given a database db, the consistent answer to q is defined by \n o q(r) | r is a repair of db .

We write q for the query that maps db to the consistent answer to q.

Example T U q = {x | WorksFor(x, ‘MI6’)}   q returns ‘Sherlock’ and  WorksFor Agent Dept  ‘James’;  Sherlock MI6  db = q returns only  James CIA   James MI6  ‘Sherlock’. T U Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 20 / 54 CQA for Open Queries

Definition (Consistent query answer) Let q be an open query (i.e., containing at least one free variable). Given a database db, the consistent answer to q is defined by \n o q(r) | r is a repair of db .

We write q for the query that maps db to the consistent answer to q.

Example T U q = {x | WorksFor(x, ‘MI6’)}   q returns ‘Sherlock’ and  WorksFor Agent Dept  ‘James’;  Sherlock MI6  db = q returns only  James CIA   James MI6  ‘Sherlock’. T U Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 20 / 54 Free Variables Can be Treated as Constants

Note: For every relational calculus query q(~x) and database db, ~c ∈ q (db) ⇐⇒ the Boolean query q[~x7→~c] is certain in db. Carry Over of Complexity T U For every self-join-free conjunctive query q(~x) and sequence ~c of constants (of the same length as ~x):

CERTAINTY(q[~x7→~c]) is in FO ⇐⇒ q can be expressed in relational calculus (or SQL);

CERTAINTY(q[~x7→~c]) is in P ⇐⇒ qT Ucan be computed in polynomial time; and

CERTAINTY(q[~x7→~c]) is coNP-hardT =U⇒ q cannot be computed in polynomial time (unless TP =U coNP).

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 21 / 54 Free Variables Can be Treated as Constants

Note: For every relational calculus query q(~x) and database db, ~c ∈ q (db) ⇐⇒ the Boolean query q[~x7→~c] is certain in db. Carry Over of Complexity T U For every self-join-free conjunctive query q(~x) and sequence ~c of constants (of the same length as ~x):

CERTAINTY(q[~x7→~c]) is in FO ⇐⇒ q can be expressed in relational calculus (or SQL);

CERTAINTY(q[~x7→~c]) is in P ⇐⇒ qT Ucan be computed in polynomial time; and

CERTAINTY(q[~x7→~c]) is coNP-hardT =U⇒ q cannot be computed in polynomial time (unless TP =U coNP).

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 21 / 54 Free Variables Can be Treated as Constants

Note: For every relational calculus query q(~x) and database db, ~c ∈ q (db) ⇐⇒ the Boolean query q[~x7→~c] is certain in db. Carry Over of Complexity T U For every self-join-free conjunctive query q(~x) and sequence ~c of constants (of the same length as ~x):

CERTAINTY(q[~x7→~c]) is in FO ⇐⇒ q can be expressed in relational calculus (or SQL);

CERTAINTY(q[~x7→~c]) is in P ⇐⇒ qT Ucan be computed in polynomial time; and

CERTAINTY(q[~x7→~c]) is coNP-hardT =U⇒ q cannot be computed in polynomial time (unless TP =U coNP).

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 21 / 54 Free Variables Can be Treated as Constants

Note: For every relational calculus query q(~x) and database db, ~c ∈ q (db) ⇐⇒ the Boolean query q[~x7→~c] is certain in db. Carry Over of Complexity T U For every self-join-free conjunctive query q(~x) and sequence ~c of constants (of the same length as ~x):

CERTAINTY(q[~x7→~c]) is in FO ⇐⇒ q can be expressed in relational calculus (or SQL);

CERTAINTY(q[~x7→~c]) is in P ⇐⇒ qT Ucan be computed in polynomial time; and

CERTAINTY(q[~x7→~c]) is coNP-hardT =U⇒ q cannot be computed in polynomial time (unless TP =U coNP).

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 21 / 54 Restricted Setting for CQA I

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 22 / 54 Restricted Setting for CQA I

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 22 / 54 Restricted Setting for CQA II

Setting Database owner Bob answers queries on his database db, subject to two postulates: Consistent query answering Inconsistencies must not be divulged. Upper bounded complexity Only queries with low data complexity (in FO) will be answered. Database querier Alice can post-process query answers by means of (finitely many) first-order operations.

Strategy Suppose Alice wants to ask query q. What is her best strategy to get a maximal subset of q(db), without false positives?

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 23 / 54 Restricted Setting for CQA II

Setting Database owner Bob answers queries on his database db, subject to two postulates: Consistent query answering Inconsistencies must not be divulged. Upper bounded complexity Only queries with low data complexity (in FO) will be answered. Database querier Alice can post-process query answers by means of (finitely many) first-order operations.

Strategy Suppose Alice wants to ask query q. What is her best strategy to get a maximal subset of q(db), without false positives?

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 23 / 54 Bob’s Interface

Bob only returns consistent query answers. . .

first-order query q Interface CQA db

q (db)

T U

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 24 / 54 Bob’s Interface

Bob only returns consistent query answers computable with low complexity.

first-order query q Interface CQA db

if q is in FO then q (db) elseT Ureject T U

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 24 / 54 Bob’s Interface

Bob only returns consistent query answers computable with low complexity.

self-join-free conj. query q Interface CQA db

if q is in FO then q (db) elseT Ureject T U

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 24 / 54 Strategy Examples I

WorksFor Agent Dept James CIA James MI6 Cherlock MI6 Example Alice wants to answer q = {a | ∃d1∃d2 (WorksFor(a, d1) ∧ WorksFor(a, d2) ∧ d1 6= d2)}.

Alice: q1 = {a | ∃dWorksFor(a, d)} Bob: ‘Cherlock’ and ‘James’.

Alice: q2 = {ha, di | WorksFor(a, d)} Bob: h‘Cherlock’, ‘MI6’i. Alice: The answer to q is ‘James’. Alice’s strategy can be summarized as follows: {a | q1 ∧ ¬∃d q2 }

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 25 / 54 T U T U Strategy Examples I

WorksFor Agent Dept James CIA James MI6 Cherlock MI6 Example Alice wants to answer q = {a | ∃d1∃d2 (WorksFor(a, d1) ∧ WorksFor(a, d2) ∧ d1 6= d2)}.

Alice: q1 = {a | ∃dWorksFor(a, d)} Bob: ‘Cherlock’ and ‘James’.

Alice: q2 = {ha, di | WorksFor(a, d)} Bob: h‘Cherlock’, ‘MI6’i. Alice: The answer to q is ‘James’. Alice’s strategy can be summarized as follows: {a | q1 ∧ ¬∃d q2 }

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 25 / 54 T U T U Strategy Examples I

WorksFor Agent Dept James CIA James MI6 Cherlock MI6 Example Alice wants to answer q = {a | ∃d1∃d2 (WorksFor(a, d1) ∧ WorksFor(a, d2) ∧ d1 6= d2)}.

Alice: q1 = {a | ∃dWorksFor(a, d)} Bob: ‘Cherlock’ and ‘James’.

Alice: q2 = {ha, di | WorksFor(a, d)} Bob: h‘Cherlock’, ‘MI6’i. Alice: The answer to q is ‘James’. Alice’s strategy can be summarized as follows: {a | q1 ∧ ¬∃d q2 }

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 25 / 54 T U T U Strategy Examples I

WorksFor Agent Dept James CIA James MI6 Cherlock MI6 Example Alice wants to answer q = {a | ∃d1∃d2 (WorksFor(a, d1) ∧ WorksFor(a, d2) ∧ d1 6= d2)}.

Alice: q1 = {a | ∃dWorksFor(a, d)} Bob: ‘Cherlock’ and ‘James’.

Alice: q2 = {ha, di | WorksFor(a, d)} Bob: h‘Cherlock’, ‘MI6’i. Alice: The answer to q is ‘James’. Alice’s strategy can be summarized as follows: {a | q1 ∧ ¬∃d q2 }

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 25 / 54 T U T U Strategy Examples I

WorksFor Agent Dept James CIA James MI6 Cherlock MI6 Example Alice wants to answer q = {a | ∃d1∃d2 (WorksFor(a, d1) ∧ WorksFor(a, d2) ∧ d1 6= d2)}.

Alice: q1 = {a | ∃dWorksFor(a, d)} Bob: ‘Cherlock’ and ‘James’.

Alice: q2 = {ha, di | WorksFor(a, d)} Bob: h‘Cherlock’, ‘MI6’i. Alice: The answer to q is ‘James’. Alice’s strategy can be summarized as follows: {a | q1 ∧ ¬∃d q2 }

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 25 / 54 T U T U Strategy Examples I

WorksFor Agent Dept James CIA James MI6 Cherlock MI6 Example Alice wants to answer q = {a | ∃d1∃d2 (WorksFor(a, d1) ∧ WorksFor(a, d2) ∧ d1 6= d2)}.

Alice: q1 = {a | ∃dWorksFor(a, d)} Bob: ‘Cherlock’ and ‘James’.

Alice: q2 = {ha, di | WorksFor(a, d)} Bob: h‘Cherlock’, ‘MI6’i. Alice: The answer to q is ‘James’. Alice’s strategy can be summarized as follows: {a | q1 ∧ ¬∃d q2 }

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 25 / 54 T U T U Strategy Examples I

WorksFor Agent Dept James CIA James MI6 Cherlock MI6 Example Alice wants to answer q = {a | ∃d1∃d2 (WorksFor(a, d1) ∧ WorksFor(a, d2) ∧ d1 6= d2)}.

Alice: q1 = {a | ∃dWorksFor(a, d)} Bob: ‘Cherlock’ and ‘James’.

Alice: q2 = {ha, di | WorksFor(a, d)} Bob: h‘Cherlock’, ‘MI6’i. Alice: The answer to q is ‘James’. Alice’s strategy can be summarized as follows: {a | q1 ∧ ¬∃d q2 }

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 25 / 54 T U T U Strategy Examples II

Example Alice wants to get budgets of self-managed departments:

qself managed = {b | ∃d∃m (ManagedBy(d, m, b) ∧ WorksFor(m, d))}.

Since qself managed is not in FO, the query qself managed will be rejected. The following queries will not be rejected: T U q0 = {d, m, b | ManagedBy(d, m, b)} and q1 = {m, d | WorksFor(m, d)}

q2 = {d, b | ∃m (ManagedBy(d, m, b) ∧ WorksFor(m, d))}

q3 = {a, b | ∃d (WorksFor(a, d) ∧ ManagedBy(d, a, b))}

Some strategies: s01 = {b | ∃d∃m ( q0 ∧ q1 )} s2 = {b | ∃d q2 } s3 = {b | ∃a q3 T} U T U

Since s01 ⊆ s2 and s01 ⊆ s3, the strategyTs2U∪ s3 seems optimal. T U Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 26 / 54 Strategy Examples II

Example Alice wants to get budgets of self-managed departments:

qself managed = {b | ∃d∃m (ManagedBy(d, m, b) ∧ WorksFor(m, d))}.

Since qself managed is not in FO, the query qself managed will be rejected. The following queries will not be rejected: T U q0 = {d, m, b | ManagedBy(d, m, b)} and q1 = {m, d | WorksFor(m, d)}

q2 = {d, b | ∃m (ManagedBy(d, m, b) ∧ WorksFor(m, d))}

q3 = {a, b | ∃d (WorksFor(a, d) ∧ ManagedBy(d, a, b))}

Some strategies: s01 = {b | ∃d∃m ( q0 ∧ q1 )} s2 = {b | ∃d q2 } s3 = {b | ∃a q3 T} U T U

Since s01 ⊆ s2 and s01 ⊆ s3, the strategyTs2U∪ s3 seems optimal. T U Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 26 / 54 Strategy Examples II

Example Alice wants to get budgets of self-managed departments:

qself managed = {b | ∃d∃m (ManagedBy(d, m, b) ∧ WorksFor(m, d))}.

Since qself managed is not in FO, the query qself managed will be rejected. The following queries will not be rejected: T U q0 = {d, m, b | ManagedBy(d, m, b)} and q1 = {m, d | WorksFor(m, d)}

q2 = {d, b | ∃m (ManagedBy(d, m, b) ∧ WorksFor(m, d))}

q3 = {a, b | ∃d (WorksFor(a, d) ∧ ManagedBy(d, a, b))}

Some strategies: s01 = {b | ∃d∃m ( q0 ∧ q1 )} s2 = {b | ∃d q2 } s3 = {b | ∃a q3 T} U T U

Since s01 ⊆ s2 and s01 ⊆ s3, the strategyTs2U∪ s3 seems optimal. T U Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 26 / 54 Strategy Examples III Example (Continued)

q2 = {d, b | ∃m (ManagedBy(d, m, b) ∧ WorksFor(m, d))}

q3 = {a, b | ∃d (WorksFor(a, d) ∧ ManagedBy(d, a, b))}

On the next db, q2 returns {h‘CIA’, ‘60M’i}, while q3 returns ∅.

ManagedBy Dept Mgr Budget WorksFor Agent Dept CIAT U John 60M TJohnU CIA CIA Alex 60M Alex CIA

On the next db, q3 returns {h‘James’, ‘50M’i}, while q2 returns ∅.

WorksFor Agent Dept ManagedBy Dept Mgr Budget JamesT U MI6 MI6 JamesT U 50M James MI7 MI7 James 50M

No perfect strategy

Since every strategy is in FO, but qself managed is not in FO, no strategy is equivalent to qself managed . Jef Wijsen (UMONS) ApproximationsT of CQAU DaQuaTa 2016 27 / 54 T U Strategy Examples III Example (Continued)

q2 = {d, b | ∃m (ManagedBy(d, m, b) ∧ WorksFor(m, d))}

q3 = {a, b | ∃d (WorksFor(a, d) ∧ ManagedBy(d, a, b))}

On the next db, q2 returns {h‘CIA’, ‘60M’i}, while q3 returns ∅.

ManagedBy Dept Mgr Budget WorksFor Agent Dept CIAT U John 60M TJohnU CIA CIA Alex 60M Alex CIA

On the next db, q3 returns {h‘James’, ‘50M’i}, while q2 returns ∅.

WorksFor Agent Dept ManagedBy Dept Mgr Budget JamesT U MI6 MI6 JamesT U 50M James MI7 MI7 James 50M

No perfect strategy

Since every strategy is in FO, but qself managed is not in FO, no strategy is equivalent to qself managed . Jef Wijsen (UMONS) ApproximationsT of CQAU DaQuaTa 2016 27 / 54 T U Strategy Examples III Example (Continued)

q2 = {d, b | ∃m (ManagedBy(d, m, b) ∧ WorksFor(m, d))}

q3 = {a, b | ∃d (WorksFor(a, d) ∧ ManagedBy(d, a, b))}

On the next db, q2 returns {h‘CIA’, ‘60M’i}, while q3 returns ∅.

ManagedBy Dept Mgr Budget WorksFor Agent Dept CIAT U John 60M TJohnU CIA CIA Alex 60M Alex CIA

On the next db, q3 returns {h‘James’, ‘50M’i}, while q2 returns ∅.

WorksFor Agent Dept ManagedBy Dept Mgr Budget JamesT U MI6 MI6 JamesT U 50M James MI7 MI7 James 50M

No perfect strategy

Since every strategy is in FO, but qself managed is not in FO, no strategy is equivalent to qself managed . Jef Wijsen (UMONS) ApproximationsT of CQAU DaQuaTa 2016 27 / 54 T U Strategy Examples III Example (Continued)

q2 = {d, b | ∃m (ManagedBy(d, m, b) ∧ WorksFor(m, d))}

q3 = {a, b | ∃d (WorksFor(a, d) ∧ ManagedBy(d, a, b))}

On the next db, q2 returns {h‘CIA’, ‘60M’i}, while q3 returns ∅.

ManagedBy Dept Mgr Budget WorksFor Agent Dept CIAT U John 60M TJohnU CIA CIA Alex 60M Alex CIA

On the next db, q3 returns {h‘James’, ‘50M’i}, while q2 returns ∅.

WorksFor Agent Dept ManagedBy Dept Mgr Budget JamesT U MI6 MI6 JamesT U 50M James MI7 MI7 James 50M

No perfect strategy

Since every strategy is in FO, but qself managed is not in FO, no strategy is equivalent to qself managed . Jef Wijsen (UMONS) ApproximationsT of CQAU DaQuaTa 2016 27 / 54 T U Wrap Up

CQAFO If q is a self-join-free conjunctive query such that q is in FO, then q is an (atomic) CQAFO query. CQAFO is closed under first-order operations (∧, ∨T, U¬, ∃, ∀). T U Open problem Input Self-join-free conjunctive query q Question Construct a CQAFO query ϕ such that Under-Approximation ϕ ⊆ q ; and Maximality for every CQAFO query ϕ0 such that ϕ ⊆ ϕ0 ⊆ q ,T weU have ϕ ≡ ϕ0.

Studied in [GPW16] for the case where post-processing uses only ∨ and ∃. T U

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 28 / 54 Wrap Up

CQAFO If q is a self-join-free conjunctive query such that q is in FO, then q is an (atomic) CQAFO query. CQAFO is closed under first-order operations (∧, ∨T, U¬, ∃, ∀). T U Open problem Input Self-join-free conjunctive query q Question Construct a CQAFO query ϕ such that Under-Approximation ϕ ⊆ q ; and Maximality for every CQAFO query ϕ0 such that ϕ ⊆ ϕ0 ⊆ q ,T weU have ϕ ≡ ϕ0.

Studied in [GPW16] for the case where post-processing uses only ∨ and ∃. T U

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 28 / 54 Outline

1 Motivation

2 On the Complexity of Embracing Primary Key Violations

3 First-Order Under-Approximations Of Consistent Query Answers

4 Beyond (Un)certainty: Counts and Probabilities

5 Attack Graphs, a Complexity Classification Tool

6 Final Thoughts

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 29 / 54 Counting Semantics

]CERTAINTY(q) For a Boolean first-order query q, the counting problem ]CERTAINTY(q) is: INPUT A database instance db (possibly with key violations) QUESTION How many repairs of db satisfy q?

Counting semantics T Town Country Michelin R Conf Year Town Mons Belgium ∗ EDBT 2015 Mons Mons Belgium ∗∗ EDBT 2015 Brussel Brussel Belgium ∗∗

q1 = ∃yT (‘Mons’, y, ‘∗∗’) =⇒ true in 2 repairs  q2 = ∃x∃y∃z R(‘EDBT’, x, y) ∧ T (y, ‘Belgium’, z) =⇒ true in 4 repairs

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 30 / 54 Counting Semantics

]CERTAINTY(q) For a Boolean first-order query q, the counting problem ]CERTAINTY(q) is: INPUT A database instance db (possibly with key violations) QUESTION How many repairs of db satisfy q?

Counting semantics T Town Country Michelin R Conf Year Town Mons Belgium ∗ EDBT 2015 Mons Mons Belgium ∗∗ EDBT 2015 Brussel Brussel Belgium ∗∗

q1 = ∃yT (‘Mons’, y, ‘∗∗’) =⇒ true in 2 repairs  q2 = ∃x∃y∃z R(‘EDBT’, x, y) ∧ T (y, ‘Belgium’, z) =⇒ true in 4 repairs

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 30 / 54 Complexity Dichotomies for ]CERTAINTY(q)

Theorem ([MW13])

For every self-join-free Boolean conjunctive query q, ]CERTAINTY(q) is either in FP or ]P-complete, and it is decidable which of the two cases applies.

Theorem ([MW14])

For every Boolean conjunctive query q in which all relation names are simple-key, ]CERTAINTY(q) is either in FP or ]P-complete, and it is decidable which of the two cases applies.

Note The previous theorem is the only general result for queries with self-joins.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 31 / 54 Complexity Dichotomies for ]CERTAINTY(q)

Theorem ([MW13])

For every self-join-free Boolean conjunctive query q, ]CERTAINTY(q) is either in FP or ]P-complete, and it is decidable which of the two cases applies.

Theorem ([MW14])

For every Boolean conjunctive query q in which all relation names are simple-key, ]CERTAINTY(q) is either in FP or ]P-complete, and it is decidable which of the two cases applies.

Note The previous theorem is the only general result for queries with self-joins.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 31 / 54 Complexity Dichotomies for ]CERTAINTY(q)

Theorem ([MW13])

For every self-join-free Boolean conjunctive query q, ]CERTAINTY(q) is either in FP or ]P-complete, and it is decidable which of the two cases applies.

Theorem ([MW14])

For every Boolean conjunctive query q in which all relation names are simple-key, ]CERTAINTY(q) is either in FP or ]P-complete, and it is decidable which of the two cases applies.

Note The previous theorem is the only general result for queries with self-joins.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 31 / 54 Complexity Classes

NP, the class of decision problems whose “yes” instances have succinct certificates that can be verified in deterministic polynomial time. FP, the class of function problems that can be solved in deterministic polynomial time. ]P, the class of counting problems associated with decision problems in NP. Given an instance of a in NP, the associated counting problem instance asks to determine the number of succinct certificates of its being a “yes” instance.

By Toda’s theorem (stating PH ⊆ P]P), ]P-complete problems appear to be extremely hard. ]P-completeness suggests a higher level of intractability than NP-completeness, insofar decision problems and counting problems can be compared.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 32 / 54 The Two Sides of the Dichotomy (by Size)

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 33 / 54 Tractability is Rare

Paths in directed graphs/ Let C be the class of Boolean queries of the form  ∃~x R(x1, x2) ∧ R(x3, x4) ∧ · · · ∧ R(x2n−1, x2n) with n 6= 0 and x1, x2,..., x2n (not necessarily distinct) variables. Queries in C ask for the existence of paths and cycles of fixed length.

Let q0 = ∃xR(x, x) =⇒ “cycle of length 1?” q1 = ∃x∃yR(x, y) =⇒ “path of length 1?”  q2 = ∃x∃y∃z R(x, y) ∧ R(y, z) =⇒ “path of length 2?” R From To a a How many repairs satisfy q0? 1 a b How many repairs satisfy q1? 3 a c How many repairs satisfy q2? 2 c d

Counting is hard for most queries

For every q ∈ C, if q ≡ q0, q ≡ q1, or q ≡ q2, then ]CERTAINTY(q) is in FP; otherwise ]CERTAINTY(q) is ]P-complete.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 34 / 54 Tractability is Rare

Paths in directed graphs/ Let C be the class of Boolean queries of the form  ∃~x R(x1, x2) ∧ R(x3, x4) ∧ · · · ∧ R(x2n−1, x2n) with n 6= 0 and x1, x2,..., x2n (not necessarily distinct) variables. Queries in C ask for the existence of paths and cycles of fixed length.

Let q0 = ∃xR(x, x) =⇒ “cycle of length 1?” q1 = ∃x∃yR(x, y) =⇒ “path of length 1?”  q2 = ∃x∃y∃z R(x, y) ∧ R(y, z) =⇒ “path of length 2?” R From To a a How many repairs satisfy q0? 1 a b How many repairs satisfy q1? 3 a c How many repairs satisfy q2? 2 c d

Counting is hard for most queries

For every q ∈ C, if q ≡ q0, q ≡ q1, or q ≡ q2, then ]CERTAINTY(q) is in FP; otherwise ]CERTAINTY(q) is ]P-complete.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 34 / 54 Tractability is Rare

Paths in directed graphs/ Let C be the class of Boolean queries of the form  ∃~x R(x1, x2) ∧ R(x3, x4) ∧ · · · ∧ R(x2n−1, x2n) with n 6= 0 and x1, x2,..., x2n (not necessarily distinct) variables. Queries in C ask for the existence of paths and cycles of fixed length.

Let q0 = ∃xR(x, x) =⇒ “cycle of length 1?” q1 = ∃x∃yR(x, y) =⇒ “path of length 1?”  q2 = ∃x∃y∃z R(x, y) ∧ R(y, z) =⇒ “path of length 2?” R From To a a How many repairs satisfy q0? 1 a b How many repairs satisfy q1? 3 a c How many repairs satisfy q2? 2 c d

Counting is hard for most queries

For every q ∈ C, if q ≡ q0, q ≡ q1, or q ≡ q2, then ]CERTAINTY(q) is in FP; otherwise ]CERTAINTY(q) is ]P-complete.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 34 / 54 Block-Independent-Disjoint Probabilistic Databases

BID probabilistic database T Town Country Michelin R Conf Year Town (t1) Mons Belgium ∗ (r1) EDBT 2015 Mons (t2) Mons Belgium ∗∗ (r2) EDBT 2015 Brussel (t3) Brussel Belgium ∗∗

{r1, t1, t3} 7→ 0.18 A possible world selects at most {r1, t2, t3} 7→ 0.12 one fact from each block. {r2, t1, t3} 7→ 0.36 {r2, t2, t3} 7→ 0.24 Assume the following probability {t1, t3} 7→ 0.06 distribution Pr: {t2, t3} 7→ 0.04 other possible worlds 7→ 0

Since the “block-independence” assumption holds, the following BID specification does not lose information:

T Town Country Michelin Pr R Conf Year Town Pr (t1) Mons Belgium ∗ 0.6 (r1) EDBT 2015 Mons 0.3 (t2) Mons Belgium ∗∗ 0.4 (r2) EDBT 2015 Brussel 0.6 (t3) Brussel Belgium ∗∗ 1.0

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 35 / 54 Block-Independent-Disjoint Probabilistic Databases

BID probabilistic database T Town Country Michelin R Conf Year Town (t1) Mons Belgium ∗ (r1) EDBT 2015 Mons (t2) Mons Belgium ∗∗ (r2) EDBT 2015 Brussel (t3) Brussel Belgium ∗∗

{r1, t1, t3} 7→ 0.18 A possible world selects at most {r1, t2, t3} 7→ 0.12 one fact from each block. {r2, t1, t3} 7→ 0.36 {r2, t2, t3} 7→ 0.24 Assume the following probability {t1, t3} 7→ 0.06 distribution Pr: {t2, t3} 7→ 0.04 other possible worlds 7→ 0

Since the “block-independence” assumption holds, the following BID specification does not lose information:

T Town Country Michelin Pr R Conf Year Town Pr (t1) Mons Belgium ∗ 0.6 (r1) EDBT 2015 Mons 0.3 (t2) Mons Belgium ∗∗ 0.4 (r2) EDBT 2015 Brussel 0.6 (t3) Brussel Belgium ∗∗ 1.0

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 35 / 54 Probabilistic Query Answering

Probabilistic query evaluation T Town Country Michelin Pr R Conf Year Town Pr (t1) Mons Belgium ∗ 0.6 (r1) EDBT 2015 Mons 0.3 (t2) Mons Belgium ∗∗ 0.4 (r2) EDBT 2015 Brussel 0.6 (t3) Brussel Belgium ∗∗ 1.0

 q3 = ∃x∃y∃z R(‘EDBT’, x, y) ∧ T (y, z, ‘∗∗’)

{r1, t1, t3} 7→ 0.18 {r1, t2, t3} 7→ 0.12 {r2, t1, t3} 7→ 0.36 {r2, t2, t3} 7→ 0.24 {t1, t3} 7→ 0.06 {t2, t3} 7→ 0.04

=⇒ Pr(q3) = 0.12 + 0.36 + 0.24 = 0.72.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 36 / 54 Probabilistic Query Answering

Probabilistic query evaluation T Town Country Michelin Pr R Conf Year Town Pr (t1) Mons Belgium ∗ 0.6 (r1) EDBT 2015 Mons 0.3 (t2) Mons Belgium ∗∗ 0.4 (r2) EDBT 2015 Brussel 0.6 (t3) Brussel Belgium ∗∗ 1.0

 q3 = ∃x∃y∃z R(‘EDBT’, x, y) ∧ T (y, z, ‘∗∗’)

{r1, t1, t3} 7→ 0.18 {r1, t2, t3} 7→ 0.12 {r2, t1, t3} 7→ 0.36 {r2, t2, t3} 7→ 0.24 {t1, t3} 7→ 0.06 {t2, t3} 7→ 0.04

=⇒ Pr(q3) = 0.12 + 0.36 + 0.24 = 0.72.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 36 / 54 Probabilities in an Uncertain Database

Uniform probability distributions T Town Country Michelin Pr R Conf Year Town Pr (t1) Mons Belgium ∗ 0.5 (r1) EDBT 2015 Mons 0.5 (t2) Mons Belgium ∗∗ 0.5 (r2) EDBT 2015 Brussel 0.5 (t3) Brussel Belgium ∗∗ 1.0

{r1, t1, t3} 7→ 0.25 Uniform probability distribution {r1, t2, t3} 7→ 0.25 over the set of all repairs: {r2, t1, t3} 7→ 0.25 {r2, t2, t3} 7→ 0.25

Why can’t one carry over results from BID probabilistic databases then? In an uncertain database, the probabilities of facts in a same block sum up to 1. No FP–]P-hard dichotomy is currently known for the probabilistic evaluation of self-joins on BID probabilistic databases.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 37 / 54 Probabilities in an Uncertain Database

Uniform probability distributions T Town Country Michelin Pr R Conf Year Town Pr (t1) Mons Belgium ∗ 0.5 (r1) EDBT 2015 Mons 0.5 (t2) Mons Belgium ∗∗ 0.5 (r2) EDBT 2015 Brussel 0.5 (t3) Brussel Belgium ∗∗ 1.0

{r1, t1, t3} 7→ 0.25 Uniform probability distribution {r1, t2, t3} 7→ 0.25 over the set of all repairs: {r2, t1, t3} 7→ 0.25 {r2, t2, t3} 7→ 0.25

Why can’t one carry over results from BID probabilistic databases then? In an uncertain database, the probabilities of facts in a same block sum up to 1. No FP–]P-hard dichotomy is currently known for the probabilistic evaluation of self-joins on BID probabilistic databases.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 37 / 54 Outline

1 Motivation

2 On the Complexity of Embracing Primary Key Violations

3 First-Order Under-Approximations Of Consistent Query Answers

4 Beyond (Un)certainty: Counts and Probabilities

5 Attack Graphs, a Complexity Classification Tool

6 Final Thoughts

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 38 / 54 Conjunctive Queries

Boolean conjunctive query (BCQ) First-order expression of the form  ∃u1 · · · ∃uk R1(~x1, ~y1) ∧ · · · ∧ Rn(~xn, ~yn) ,

containing no variables other than u1,..., uk . Primary-key positions are underlined.

Often represented by {R1(~x1, ~y1),..., Rn(~xn, ~yn)}, its set of atoms. Self-join-free if no relation name occurs more than once.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 39 / 54 Conjunctive Queries

Boolean conjunctive query (BCQ) First-order expression of the form  ∃u1 · · · ∃uk R1(~x1, ~y1) ∧ · · · ∧ Rn(~xn, ~yn) ,

containing no variables other than u1,..., uk . Primary-key positions are underlined.

Often represented by {R1(~x1, ~y1),..., Rn(~xn, ~yn)}, its set of atoms. Self-join-free if no relation name occurs more than once.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 39 / 54 Functional Dependencies

Functional dependencies derived from primary keys 1 A Boolean conjunctive query is represented by its set of atoms (also called goals). 2 For a set q of atoms, we define K(q) as the following set of functional dependencies:

K(q) := {key(G) → vars(G) | G ∈ q},

where key(G) is the set of variables in G’s primary key, and vars(G) is the set of all variables in G. 3 For a set q of atoms and an atom F ∈ q, we say that a variable x is externally independenta of F if

K(q \{F }) 6|= key(F ) → x.

aTerminology thanks to Benny Kimelfeld.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 40 / 54 Functional Dependencies

Functional dependencies derived from primary keys 1 A Boolean conjunctive query is represented by its set of atoms (also called goals). 2 For a set q of atoms, we define K(q) as the following set of functional dependencies:

K(q) := {key(G) → vars(G) | G ∈ q},

where key(G) is the set of variables in G’s primary key, and vars(G) is the set of all variables in G. 3 For a set q of atoms and an atom F ∈ q, we say that a variable x is externally independenta of F if

K(q \{F }) 6|= key(F ) → x.

aTerminology thanks to Benny Kimelfeld.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 40 / 54 Functional Dependencies

Functional dependencies derived from primary keys 1 A Boolean conjunctive query is represented by its set of atoms (also called goals). 2 For a set q of atoms, we define K(q) as the following set of functional dependencies:

K(q) := {key(G) → vars(G) | G ∈ q},

where key(G) is the set of variables in G’s primary key, and vars(G) is the set of all variables in G. 3 For a set q of atoms and an atom F ∈ q, we say that a variable x is externally independenta of F if

K(q \{F }) 6|= key(F ) → x.

aTerminology thanks to Benny Kimelfeld.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 40 / 54 Example Let q = {R(x, y), G(y, z), B(z, x), U(x, u), V (x, u, v)}. Take F = R(x, y).

key(F ) = {x} K(q \{F }) ≡ {y → z, z → x, x → u, xu → v}

The variables y and z are externally independent of F , because

K(q \{F }) 6|= x → y, K(q \{F }) 6|= x → z.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 41 / 54 Attack Graph

Attack graph The attack graph of a self-join-free BCQ q is a directed graph whose vertices are the atoms of q. There is a directed edge (called attack) from F to G (F 6= G) if there exists a sequence of atoms of q, beginning with F and ending with G, such that every two successive atoms share some variable that is externally independent of F .

Example of attack computation Let q = {R(x, y), G(y, z), B(z, x), U(x, u), V (x, u, v)}. The variables y and z are externally independent of R(x, y). There is an attack from R(x, y) to B(z, x), because in the sequence hR(x, y), G(y, z), B(z, x)i, every two successive atoms share y or z.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 42 / 54 Attack Graph

Attack graph The attack graph of a self-join-free BCQ q is a directed graph whose vertices are the atoms of q. There is a directed edge (called attack) from F to G (F 6= G) if there exists a sequence of atoms of q, beginning with F and ending with G, such that every two successive atoms share some variable that is externally independent of F .

Example of attack computation Let q = {R(x, y), G(y, z), B(z, x), U(x, u), V (x, u, v)}. The variables y and z are externally independent of R(x, y). There is an attack from R(x, y) to B(z, x), because in the sequence hR(x, y), G(y, z), B(z, x)i, every two successive atoms share y or z.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 42 / 54 Weak and Strong Attacks

Weak and strong attacks Let q be a self-join-free BCQ. An attack from F to G is called weak if K(q) |= key(F ) → key(G); otherwise it is strong. A cycle in q’s attack graph is called strong if at least one attack in the cycle is strong.

Example of weak attack Let q = {R(x, y), G(y, z), B(z, x), U(x, u), V (x, u, v)}. The attack from R(x, y) to B(z, x) is weak because K(q) |= x → z.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 43 / 54 Weak and Strong Attacks

Weak and strong attacks Let q be a self-join-free BCQ. An attack from F to G is called weak if K(q) |= key(F ) → key(G); otherwise it is strong. A cycle in q’s attack graph is called strong if at least one attack in the cycle is strong.

Example of weak attack Let q = {R(x, y), G(y, z), B(z, x), U(x, u), V (x, u, v)}. The attack from R(x, y) to B(z, x) is weak because K(q) |= x → z.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 43 / 54 Complexity Classification of CERTAINTY(q): Theorem

Theorem Let q be a self-join-free BCQ. 1 If the attack graph of q is acyclic, then CERTAINTY(q) is in FO. 2 If the attack graph of q is cyclic but no cycle is strong, then CERTAINTY(q) is in P and is -hard. 3 If the attack graph of q contains a strong cycle, then CERTAINTY(q) is coNP-complete. Furthermore, it can be decided in polynomial time in the size of q which of the above three cases applies.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 44 / 54 Complexity Classification of CERTAINTY(q): Example

Example Let q = {R(x, y), G(y, z), B(z, x), U(x, u), V (x, u, v)}. The attack graph is cyclic. R(x,y)

B(z,x) G(y,z)

U(x,u)

V (x,u,v) All attacks are weak. Consequently, CERTAINTY(q) is L-hard and in P.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 45 / 54 Adding W (a, x)

Example When W (a, x) is added, the attack graph becomes acyclic.

R(x,y)

B(z,x) G(y,z)

U(x,u)

V (x,u,v) Consequently, CERTAINTY(q ∪ {W (a, x)}) is in FO. Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 46 / 54 Adding W (a, x)

Example When W (a, x) is added, the attack graph becomes acyclic.

W (a,x)

R(x,y)

B(z,x) G(y,z)

U(x,u)

V (x,u,v) Consequently, CERTAINTY(q ∪ {W (a, x)}) is in FO. Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 46 / 54 FO Algorithm for Acyclic Attack Graph

Let q be a self-join-free BCQ with an acyclic attack graph.

Let R1, R2,..., Rn be a topological sort of the attack graph of q.

We write ri ∼ si if ri and si are key-equal tuples, belonging to the same Ri -block. The following algorithm tests whether q is true in every repair:

Input : Database db Output : Does every repair of db satisfy q?

if ∃s1 ∈ R1 ∀r1 ∈ R1 such that r1 ∼ s1 :

∃s2 ∈ R2 ∀r2 ∈ R2 such that r2 ∼ s2 : . ..

∃sn ∈ Rn ∀rn ∈ Rn such that rn ∼ sn :

the tuples r1, r2,..., rn together satisfy q then return “yes” else return “no”

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 47 / 54 FO Algorithm for Acyclic Attack Graph

Let q be a self-join-free BCQ with an acyclic attack graph.

Let R1, R2,..., Rn be a topological sort of the attack graph of q.

We write ri ∼ si if ri and si are key-equal tuples, belonging to the same Ri -block. The following algorithm tests whether q is true in every repair:

Input : Database db Output : Does every repair of db satisfy q?

if ∃s1 ∈ R1 ∀r1 ∈ R1 such that r1 ∼ s1 :

∃s2 ∈ R2 ∀r2 ∈ R2 such that r2 ∼ s2 : . ..

∃sn ∈ Rn ∀rn ∈ Rn such that rn ∼ sn :

the tuples r1, r2,..., rn together satisfy q then return “yes” else return “no”

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 47 / 54 FO Algorithm in SQL I

Example

∃d∃b (WorksFor(‘James’, d) ∧ ManagedBy(d, ‘Alex’, b)) The attack graph is acyclic:

WorksFor(‘James’,d)

ManagedBy(d, ‘Alex’,b)

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 48 / 54 FO Algorithm in SQL II

SELECT DISTINCT ’yes’ FROM WorksFor AS WE WHERE NOT EXISTS ( SELECT * FROM WorksFor AS WA WHERE NOT EXISTS ( SELECT * FROM ManagedBy AS ME WHERE NOT EXISTS ( SELECT * FROM ManagedBy AS MA WHERE WA.Agent = WE.Agent AND MA.Dept = ME.Dept AND NOT ( WA.Agent = ’James’ AND WA.Dept = MA.Dept AND MA.Mgr = ’Alex’ ) )));

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 49 / 54 FO Algorithm in SQL II

SELECT DISTINCT ’yes’ Exists quantifiers FROM WorksFor AS WE WHERE NOT EXISTS ( SELECT * FROM WorksFor AS WA WHERE NOT EXISTS ( SELECT * FROM ManagedBy AS ME WHERE NOT EXISTS ( SELECT * FROM ManagedBy AS MA WHERE WA.Agent = WE.Agent AND MA.Dept = ME.Dept AND NOT ( WA.Agent = ’James’ AND WA.Dept = MA.Dept AND MA.Mgr = ’Alex’ ) )));

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 49 / 54 FO Algorithm in SQL II

SELECT DISTINCT ’yes’ Exists quantifiers FROM WorksFor AS WE ForAll quantifiers WHERE NOT EXISTS ( SELECT * FROM WorksFor AS WA WHERE NOT EXISTS ( SELECT * FROM ManagedBy AS ME WHERE NOT EXISTS ( SELECT * FROM ManagedBy AS MA WHERE WA.Agent = WE.Agent AND MA.Dept = ME.Dept AND NOT ( WA.Agent = ’James’ AND WA.Dept = MA.Dept AND MA.Mgr = ’Alex’ ) )));

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 49 / 54 FO Algorithm in SQL II

SELECT DISTINCT ’yes’ Exists quantifiers FROM WorksFor AS WE ForAll quantifiers WHERE NOT EXISTS conjunctive query ( SELECT * FROM WorksFor AS WA WHERE NOT EXISTS ( SELECT * FROM ManagedBy AS ME WHERE NOT EXISTS ( SELECT * FROM ManagedBy AS MA WHERE WA.Agent = WE.Agent AND MA.Dept = ME.Dept AND NOT ( WA.Agent = ’James’ AND WA.Dept = MA.Dept AND MA.Mgr = ’Alex’ ) )));

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 49 / 54 FO Algorithm in SQL II

SELECT DISTINCT ’yes’ Exists quantifiers FROM WorksFor AS WE ForAll quantifiers WHERE NOT EXISTS conjunctive query ( SELECT * equality of primary keys FROM WorksFor AS WA WHERE NOT EXISTS ( SELECT * FROM ManagedBy AS ME WHERE NOT EXISTS ( SELECT * FROM ManagedBy AS MA WHERE WA.Agent = WE.Agent AND MA.Dept = ME.Dept AND NOT ( WA.Agent = ’James’ AND WA.Dept = MA.Dept AND MA.Mgr = ’Alex’ ) )));

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 49 / 54 Tool

Canswer available at http://co2.umons.ac.be:8080/

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 50 / 54 Outline

1 Motivation

2 On the Complexity of Embracing Primary Key Violations

3 First-Order Under-Approximations Of Consistent Query Answers

4 Beyond (Un)certainty: Counts and Probabilities

5 Attack Graphs, a Complexity Classification Tool

6 Final Thoughts

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 51 / 54 Final Thoughts

Cleaning the database vs Cleaning query answers “Probably data cleaning will remain art and science, entangled with analysis, and resistant to fully generalizable principles.” [Blog of Aaron Schumacher]

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 52 / 54 Final Thoughts

Cleaning the database vs Cleaning query answers “Probably data cleaning will remain art and science, entangled with analysis, and resistant to fully generalizable principles.” [Blog of Aaron Schumacher]

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 52 / 54 Final Thoughts

Cleaning the database vs Cleaning query answers “Probably data cleaning will remain art and science, entangled with analysis, and resistant to fully generalizable principles.” [Blog of Aaron Schumacher]

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 52 / 54 ReferencesI

Andrei A. Bulatov. Complexity of conservative constraint satisfaction problems. ACM Trans. Comput. Log., 12(4):24, 2011.

Ga¨elleFontaine. Why is it hard to obtain a dichotomy for consistent query answering? In LICS, pages 550–559. IEEE Computer Society, 2013.

Floris Geerts, Fabian Pijcke, and Jef Wijsen. First-order under-approximations of consistent query answers. International Journal of Approximate Reasoning, pages –, 2016.

Tomasz Imielinski and Witold Lipski Jr. Incomplete information in relational databases. J. ACM, 31(4):761–791, 1984.

Paris Koutris and Jef Wijsen. The data complexity of consistent query answering for self-join-free conjunctive queries under primary key constraints. In Tova Milo and Diego Calvanese, editors, PODS. ACM, 2015.

Leonid Libkin. Sql’s three-valued logic and certain answers. ACM Trans. Database Syst., 41(1):1, 2016.

Dany Maslowski and Jef Wijsen. A dichotomy in the complexity of counting database repairs. J. Comput. Syst. Sci., 79(6):958–983, 2013.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 53 / 54 ReferencesII

Dany Maslowski and Jef Wijsen. Counting database repairs that satisfy conjunctive queries with self-joins. In Nicole Schweikardt, Vassilis Christophides, and Vincent Leroy, editors, ICDT, pages 155–164. OpenProceedings.org, 2014. Jef Wijsen. A remark on the complexity of consistent conjunctive query answering under primary key violations. Inf. Process. Lett., 110(21):950–955, 2010.

Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 54 / 54