The Computation of Optimal Subset Repairs⇤

The Computation of Optimal Subset Repairs⇤ Dongjing Miao Zhipeng Cai Jianzhong Li Harbin Institute of Technology Georgia State University Harbin Institute of Technology [email protected] [email protected] [email protected] Xiangyu Gao Xianmin Liu Harbin Institute of Technology Harbin Institute of Technology [email protected] [email protected] ABSTRACT In the research on managing inconsistencies, the notion of Computing an optimal subset repair of an inconsistent data- repair was first introduced decades ago [4]. Subset repair is base is becoming a standalone research problem and has a a popular type of repair [19, 2]. A subset repair of an incon- wide range of applications. However, it has not been well- sistent database I is a consistent subset of I,whichsatisfies studied yet. A tight inapproximability bound of the prob- all the given integrity constraints, obtained by performing lem computing optimal subset repairs is still unknown, and a minimal set of operations on I.Anoptimal subset repair there is still no existing algorithm with a constant approxi- is the subset repair with minimum cost, where the cost can mation factor better than two. In this paper, we prove a new be defined in di↵erent ways depending on the requirement tighter inapproximability bound of the problem computing of applications. optimal subset repairs. We show that it is usually NP-hard Nowadays, the computation of optimal subset repairs is to approximate it within a factor better than 17/16. An becoming a basic task in a wide-range of applications not σ 1 only data cleaning and repairing, but also many other prac- algorithm with an approximation ratio (2 1/2 − )isde- veloped, where σ is the number of functional− dependencies. tical scenarios. Thus, the computation of optimal subset It is the current best algorithm in terms of approximation repairs is becoming a standalone research problem and at- ratio. The ratio can be further improved if there are a tracted a lot of attention. In the following, we give three ex- large amount of quasi-Turán clusters in the input database. ample scenarios, in which the computation of optimal subset Plenty of experiments are conducted on real data to exam- repairs is a core task. ine the performance and the e↵ectiveness of the proposed Data repairing.Asdiscussedin[40,20,11],theben- approximation algorithms in real-world applications. efit of computing optimal subset repairs to data repairing is twofold. First, computing optimal subset repairs is con- PVLDB Reference Format: sidered an important task in automatic data repairing [22, Dongjing Miao, Zhipeng Cai, Jianzhong Li, Xiangyu Gao, Xian- 45, 25]. In those methods, the cost of a subset repair is min Liu. The Computation of Optimal Subset Repairs. PVLDB, the sum of the confidence of each deleted tuple. Therefore, 13(11): 2061-2074, 2020. an optimal subset repair is considered as the best repaired DOI: https://doi.org/10.14778/3407790.3407809 data. Second, optimal subset repairs are also used as the best recommendation to enlighten the human how to repair 1. INTRODUCTION data from scratch in semi-automatic data repairing methods [7, 9, 24, 30]. In addition, the deletion cost to obtain an AdatabaseI is inconsistent if it violates some given in- optimal subset repair also can be used as a measurement to tegrity constraints, that is, I contains conflicts or inconsis- evaluate the database inconsistency degree [11]. tencies. For example, two tuples conflict with each other, if Consistent query answering. Optimal subset repairs they have the same city but di↵erent area code. The incon- are the key to answer aggregation queries, such as count, sistency has a serious influence on the quality of databases, sum and average, in inconsistent databases. For example, a and has been studied for a long time. consistent answer of a sum query is usually defined as the range between its greatest lower bound (glb)anditsleast ⇤Work funded by the National Natural Science Foundation upper bound (lub). Optimal subset repairs can help deriving of China (NSFC) Grant No. 61972110, 61832003, U1811461, and the National Key R&D Program of China Grant No. such two bounds. 2019YFB2101900. Consider the inconsistent database GreenVotes shown in Figure.1(a), due to functional dependency County,Date Tal ly, This work is licensed under the Creative Commons Attribution- ! NonCommercial-NoDerivatives 4.0 International License. To view a copy tuples s1 and s2 conflict with each other, hence, it has two of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For repairs J1 : s1,s3 and J2 : s2,s3 . Obviously, aggrega- { } { } any use beyond those covered by this license, obtain permission by emailing tion query Q1 has di↵erent answers, such as 554 in J1 but [email protected]. Copyright is held by the owner/author(s). Publication rights 731 in J2. A smart way to answer such queries, as defined licensed to the VLDB Endowment. in [6, 10], is to return the lub and glb which are 554 and 731. Proceedings of the VLDB Endowment, Vol. 13, No. 11 ISSN 2150-8097. Computing the lub is equivalent to computing optimal DOI: https://doi.org/10.14778/3407790.3407809 subset repairs since the lub is exactly the sum of weights of 2061 GreenVotes Procurement Plan Σ = {Vendor → Subsystem, Subsystem Part Vendor Vendor Rating Aggregation Q1: County Date Tally t1 #1 a C 3.3 Subsystem, Part → Vendor} s SELECT SUM(Tally) t2 #1 a J 5.0 1 A 11/07 453 SUM(Vendor Rating) t3 #1 b J 5.0 s2 A 11/07 630 FROM GreenVotes t4 #1 b M 4.1 Procurement Plan A (optimal) s3 B 11/07 101 CQA of Q : [554, 731] t5 #2 a C 3.3 Cost (rating loss): 13.7 (31.6 -17.9) 1 t 6 #2 a J 5.0 Procurement Plan B t7 #2 c C 3.3 16.6 (31.6 15.0) t8 #2 c E 4.6 Cost (rating loss): - BrownVotes Σ = {County→ Tally} Procurement Plan A (Optimal) Procurement Plan B Subsystem Part Vendor Vendor Rating Subsystem Part Vendor Vendor Rating County Date Tally t1 #1 a C 3.3 t1 #1 a C 3.3 r1 A 11/07 541 t2 #1 a J 5.0 t2 #1 a J 5.0 Aggregation Q2: t t r2 A 11/07 546 3 #1 b J 5.0 3 #1 b J 5.0 r triad SELECT SUM(Tally) t4 #1 b M 2.1 t4 #1 b M 2.1 3 A 11/07 560 t5 #2 a C 3.3 t5 #2 a C 3.3 FROM BrownVotes t t r4 A 11/07 550 6 #2 a J 5.0 6 #2 a J 5.0 r B 11/07 302 [843, 862] t7 #2 c C 3.3 t7 #2 c C 3.3 5 CQA of Q2 : t8 #2 c E 4.6 t8 #2 c E 4.6 (a) Consistent query answering of scalar aggregation (b) Refine the procurement plan when business constraints change Figure 1: Example applications of computing optimal subset repairs tuples in an optimal subset repair, J2, if the vote tallies are tional dependencies is NP-complete [6, 10] even if given two treated as tuple weights. functional dependencies. This result is too rough since it is On the other side, computing the glb can be speed up by just proved by a particular case. The most recent work in an optimal subset repair. Computing glb is equivalent to find [40] improves the complexity result. The boundary between a weighted maximum minimal vertex cover that is NP-hard polynomial intractable and tractable cases is clearly delin- 1/2 and cannot be approximated within o(n− )[15]butcan eated, such that the problem of computing optimal subset be solved in O(2knc) time [49], where k equals to the cost of repairs cannot be approximated within some small constant an optimal subset repair, and c>1. If k is clear in advance, unless P=NP, if the given constraint set can be simplified we will be able to decide to run the exact algorithm and tol- as one of6 four specific cases, otherwise it is polynomial solv- erate exponential running time, or to run an approximation able. Other complexity results on computing optimal repairs algorithm and tolerate the unbounded accuracy loss. exist, such as [22], [37], [12], [41], [28]. They are either re- Moreover, when the two ranges are computed, one can stricted to repairs obtained by value modifications instead safely conclude that Brown won the election, since the max- of tuple deletions or constraints stronger than functional imum vote tally for Green is smaller than the minimum vote dependencies. Complexity results on repair checking exists, tally for Brown. such as [19], [2]. Repair checking is to decide if a given Optimization. Let us consider an example raised in mar- database is a repair of another database. However, it is a keting as shown in Figure.1(b). The procurement plan in problem much easier than computing optimal subset repairs. Figure.1(b) lists all the current vendors for all the parts, Due to their restrictions, these results cannot help deriving and each vendor has a rating representing its production ca- tight complexity results for the problem of computing opti- pacity, after-sales service, supply price and so on [29]. Note mal subset repairs. Therefore, a tight lower bound remains that, all the tuples in the list are correct and consistent. unknown so far. However, the business consideration recently changes. First, Algorithm aspects. When the database schema and the a vendor is no longer allowed to take part in two or more number of functional dependencies are both unbounded, com- subsystems for protection of commercial secrets.

The Computation of Optimal Subset Repairs⇤

1 Elementary Set Theory

The Matroid Theorem We First Review Our Definitions: a Subset System Is A

Algebra I Chapter 1. Basic Facts from Set Theory 1.1 Glossary of Abbreviations

Ergodicity and Metric Transitivity

POINT-COFINITE COVERS in the LAVER MODEL 1. Background Let

A TEXTBOOK of TOPOLOGY Lltld

The Hardness of the Hidden Subset Sum Problem and Its Cryptographic Implications

The Axiom of Choice and Its Implications

Composite Subset Measures

Equivalents to the Axiom of Choice and Their Uses A

On the Cardinality of a Topological Space

Chapter I Sums of Independent Random Variables