Approximations of Consistent Query Answers 1
Total Page:16
File Type:pdf, Size:1020Kb
Approximations of Consistent Query Answers 1 Jef Wijsen UMONS DaQuaTa International Workshop 2016 Lyon, 12{13 December 2016 1Joint work with Floris Geerts, Paris Koutris, and Fabian Pijcke Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 1 / 54 Outline 1 Motivation 2 On the Complexity of Embracing Primary Key Violations 3 First-Order Under-Approximations Of Consistent Query Answers 4 Beyond (Un)certainty: Counts and Probabilities 5 Attack Graphs, a Complexity Classification Tool 6 Final Thoughts Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 2 / 54 Outline 1 Motivation 2 On the Complexity of Embracing Primary Key Violations 3 First-Order Under-Approximations Of Consistent Query Answers 4 Beyond (Un)certainty: Counts and Probabilities 5 Attack Graphs, a Complexity Classification Tool 6 Final Thoughts Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 3 / 54 Data Quality Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 4 / 54 Data Quality Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 4 / 54 Consistent and Complete Relational Database Integrity constraints are satisfied. The database contains all (and only) the facts that are true (Closed-World Assumption). No missing values. Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 5 / 54 Dealing with Imperfect Data It is common to have inconsistent data, incomplete data, missing data, uncertain data. What can we do with this data? Some data quality problems already have principled solutions: incomplete data [IJ84], NULLs in SQL [Lib16]. Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 6 / 54 Dealing with Imperfect Data It is common to have inconsistent data, incomplete data, missing data, uncertain data. What can we do with this data? Some data quality problems already have principled solutions: incomplete data [IJ84], NULLs in SQL [Lib16]. Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 6 / 54 Dealing with Imperfect Data It is common to have inconsistent data, incomplete data, missing data, uncertain data. What can we do with this data? Some data quality problems already have principled solutions: incomplete data [IJ84], NULLs in SQL [Lib16]. Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 6 / 54 Dealing with Imperfect Data It is common to have inconsistent data, incomplete data, missing data, uncertain data. What can we do with this data? Some data quality problems already have principled solutions: incomplete data [IJ84], NULLs in SQL [Lib16]. Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 6 / 54 Dealing with Imperfect Data It is common to have inconsistent data, incomplete data, missing data, uncertain data. What can we do with this data? Some data quality problems already have principled solutions: incomplete data [IJ84], NULLs in SQL [Lib16]. Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 6 / 54 Embrace Imperfections Example P PID FirstName LastName BloodType Gendre ··· 1 John Adams NULL | ··· 2 Jan Peeters A+ M ··· 3 Jean Dubois A+ M ··· 3 Jean Dubois AB+ M ··· Caveat If we embrace imperfections, we should rethink query answering. For example, what should be the answer to the following queries? SELECT COUNT(DISTINCT PID) SELECT COUNT(DISTINCT PID) FROM P FROM P WHERE BloodType=`A+'; WHERE BloodType<>`A+'; Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 7 / 54 Embrace Imperfections Example P PID FirstName LastName BloodType Gendre ··· 1 John Adams NULL | ··· 2 Jan Peeters A+ M ··· 3 Jean Dubois A+ M ··· 3 Jean Dubois AB+ M ··· Caveat If we embrace imperfections, we should rethink query answering. For example, what should be the answer to the following queries? SELECT COUNT(DISTINCT PID) SELECT COUNT(DISTINCT PID) FROM P FROM P WHERE BloodType=`A+'; WHERE BloodType<>`A+'; Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 7 / 54 Embrace Imperfections Example P PID FirstName LastName BloodType Gendre ··· 1 John Adams NULL M ··· 2 Jan Peeters A+ M ··· 3 Jean Dubois A+ M ··· 3 Jean Dubois AB+ M ··· Caveat If we embrace imperfections, we should rethink query answering. For example, what should be the answer to the following queries? SELECT COUNT(DISTINCT PID) SELECT COUNT(DISTINCT PID) FROM P FROM P WHERE BloodType=`A+'; WHERE BloodType<>`A+'; Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 7 / 54 Embrace Imperfections Example P PID FirstName LastName BloodType Gendre ··· 1 John Adams NULL M ··· 2 Jan Peeters A+ M ··· 3 Jean Dubois A+ M ··· 3 Jean Dubois AB+ M ··· Caveat If we embrace imperfections, we should rethink query answering. For example, what should be the answer to the following queries? SELECT COUNT(DISTINCT PID) SELECT COUNT(DISTINCT PID) FROM P FROM P WHERE BloodType=`A+'; WHERE BloodType<>`A+'; Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 7 / 54 Embrace Imperfections Example P PID FirstName LastName BloodType Gendre ··· 1 John Adams NULL M ··· 2 Jan Peeters A+ M ··· 3 Jean Dubois A+ M ··· 3 Jean Dubois AB+ M ··· Caveat If we embrace imperfections, we should rethink query answering. For example, what should be the answer to the following queries? SELECT COUNT(DISTINCT PID) SELECT COUNT(DISTINCT PID) FROM P FROM P WHERE BloodType=`A+'; WHERE BloodType<>`A+'; Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 7 / 54 Embrace Imperfections Example P PID FirstName LastName BloodType Gendre ··· 1 John Adams NULL M ··· 2 Jan Peeters A+ M ··· 3 Jean Dubois A+ M ··· 3 Jean Dubois AB+ M ··· Caveat If we embrace imperfections, we should rethink query answering. For example, what should be the answer to the following queries? SELECT COUNT(DISTINCT PID) SELECT COUNT(DISTINCT PID) FROM P FROM P WHERE BloodType=`A+'; WHERE BloodType<>`A+'; Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 7 / 54 Embrace Imperfections Example P PID FirstName LastName BloodType Gendre ··· 1 John Adams NULL M ··· 2 Jan Peeters A+ M ··· 3 Jean Dubois A+ M ··· 3 Jean Dubois AB+ M ··· Caveat If we embrace imperfections, we should rethink query answering. For example, what should be the answer to the following queries? SELECT COUNT(DISTINCT PID) SELECT COUNT(DISTINCT PID) FROM P FROM P WHERE BloodType=`A+'; WHERE BloodType<>`A+'; Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 7 / 54 Outline 1 Motivation 2 On the Complexity of Embracing Primary Key Violations 3 First-Order Under-Approximations Of Consistent Query Answers 4 Beyond (Un)certainty: Counts and Probabilities 5 Attack Graphs, a Complexity Classification Tool 6 Final Thoughts Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 8 / 54 Embrace Primary Key Violations Data model We allow (primary) key violations. Example (Keys are underlined) WorksFor Agent Dept ManagedBy Dept Mgr Budget Sherlock MI6 CIA John 60M James CIA MI6 Alex 60M James MI6 =) James works for either CIA or MI6. Definition (Block) A block is a maximal set of tuples of the same relation with the same value for the key. (Blocks are separated by dashed lines.) Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 9 / 54 Embrace Primary Key Violations Data model We allow (primary) key violations. Example (Keys are underlined) WorksFor Agent Dept ManagedBy Dept Mgr Budget Sherlock MI6 CIA John 60M James CIA MI6 Alex 60M James MI6 =) James works for either CIA or MI6. Definition (Block) A block is a maximal set of tuples of the same relation with the same value for the key. (Blocks are separated by dashed lines.) Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 9 / 54 Certainty Semantics Definition (Repair and Certainty) A repair is obtained by selecting exactly one tuple from each block. A Boolean query is certain if it is true in all repairs. Certainty semantics WorksFor Agent Dept ManagedBy Dept Mgr Budget Sherlock MI6 CIA John 60M James CIA MI6 Alex 60M James MI6 Is the budget of James' department equal to 60M? 9d9m (WorksFor(`James'; d) ^ ManagedBy(d; m; `60M')) is certain. Is James' department managed by Alex? 9d9b (WorksFor(`James'; d) ^ ManagedBy(d; `Alex'; b)) is not certain. Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 10 / 54 Certainty Semantics Definition (Repair and Certainty) A repair is obtained by selecting exactly one tuple from each block. A Boolean query is certain if it is true in all repairs. Certainty semantics WorksFor Agent Dept ManagedBy Dept Mgr Budget Sherlock MI6 CIA John 60M James CIA MI6 Alex 60M James MI6 Is the budget of James' department equal to 60M? 9d9m (WorksFor(`James'; d) ^ ManagedBy(d; m; `60M')) is certain. Is James' department managed by Alex? 9d9b (WorksFor(`James'; d) ^ ManagedBy(d; `Alex'; b)) is not certain. Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 10 / 54 The Computational Complexity of Deciding Certainty I Relation with exponentially many repairs WorksFor Agent Dept 1 MI6 1 CIA This WorksFor relation contains 2n 2 MI6 p 2n 2 CIA tuples and has 2 distinct . repairs. n MI6 n CIA Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 11 / 54 The Computational Complexity of Deciding Certainty II Example of Low Complexity Let q1 = 9d9b (WorksFor(`James'; d) ^ ManagedBy(d; `Alex'; b)) For example, q1 is certain in the following database: ManagedBy Dept Mgr Budget WorksFor Agent Dept CIA Alex 50M James CIA CIA Alex 60M James MI6 MI6 Alex 60M One can verify that q1 is certain iff the following query is true: 9d WorksFor(`James'; d) ^ 8dWorksFor(`James'; d) ! 9m9b[ManagedBy(d; m; b) ^ 8m8b(ManagedBy(d; m; b) ! m = `Alex')] Jef Wijsen (UMONS) Approximations of CQA DaQuaTa 2016 12 / 54 The Computational Complexity of Deciding Certainty II Example of Low Complexity Let q1 = 9d9b (WorksFor(`James'; d) ^ ManagedBy(d; `Alex'; b))