<<

The Computation of Optimal Repairs⇤

Dongjing Miao Zhipeng Cai Jianzhong Li Harbin Institute of Technology Georgia State University Harbin Institute of Technology [email protected] [email protected] [email protected] Xiangyu Gao Xianmin Liu Harbin Institute of Technology Harbin Institute of Technology [email protected] [email protected]

ABSTRACT In the research on managing inconsistencies, the notion of Computing an optimal subset repair of an inconsistent data- repair was first introduced decades ago [4]. Subset repair is is becoming a standalone research problem and has a a popular type of repair [19, 2]. A subset repair of an incon- wide range of applications. However, it has not been well- sistent database I is a consistent subset of I,whichsatisfies studied yet. A tight inapproximability bound of the prob- all the given integrity constraints, obtained by performing lem computing optimal subset repairs is still unknown, and a minimal of operations on I.Anoptimal subset repair there is still no existing algorithm with a constant approxi- is the subset repair with minimum cost, where the cost can mation factor better than two. In this paper, we prove a new be defined in di↵erent ways depending on the requirement tighter inapproximability bound of the problem computing of applications. optimal subset repairs. We show that it is usually NP-hard Nowadays, the computation of optimal subset repairs is to approximate it within a factor better than 17/16. An becoming a basic task in a wide-range of applications not 1 only data cleaning and repairing, but also many other prac- algorithm with an approximation ratio (2 1/2 )isde- veloped, where is the of functional dependencies. tical scenarios. Thus, the computation of optimal subset It is the current best algorithm in terms of approximation repairs is becoming a standalone research problem and at- ratio. The ratio can be further improved if there are a tracted a lot of attention. In the following, we give three ex- large amount of quasi-Tur´an clusters in the input database. ample scenarios, in which the computation of optimal subset Plenty of experiments are conducted on real data to exam- repairs is a core task. ine the performance and the e↵ectiveness of the proposed Data repairing.Asdiscussedin[40,20,11],theben- approximation algorithms in real-world applications. efit of computing optimal subset repairs to data repairing is twofold. First, computing optimal subset repairs is con- PVLDB Reference Format: sidered an important task in automatic data repairing [22, Dongjing Miao, Zhipeng Cai, Jianzhong Li, Xiangyu Gao, Xian- 45, 25]. In those methods, the cost of a subset repair is min Liu. The Computation of Optimal Subset Repairs. PVLDB, the sum of the confidence of each deleted . Therefore, 13(11): 2061-2074, 2020. an optimal subset repair is considered as the best repaired DOI: https://doi.org/10.14778/3407790.3407809 data. Second, optimal subset repairs are also used as the best recommendation to enlighten the human how to repair 1. INTRODUCTION data from scratch in semi-automatic data repairing meth- ods [7, 9, 24, 30]. In addition, the deletion cost to obtain an AdatabaseI is inconsistent if it violates some given in- optimal subset repair also can be used as a measurement to tegrity constraints, that is, I contains conflicts or inconsis- evaluate the database inconsistency degree [11]. tencies. For example, two conflict with each other, if Consistent query answering. Optimal subset repairs they have the same city but di↵erent area code. The incon- are the key to answer aggregation queries, such as count, sistency has a serious influence on the quality of databases, sum and average, in inconsistent databases. For example, a and has been studied for a long time. consistent answer of a sum query is usually defined as the range between its greatest lower bound (glb)anditsleast ⇤Work funded by the National Natural Science Foundation upper bound (lub). Optimal subset repairs can help deriving of China (NSFC) Grant No. 61972110, 61832003, U1811461, and the National Key R&D Program of China Grant No. such two bounds. 2019YFB2101900. Consider the inconsistent database GreenVotes shown in Figure.1(a), due to County,Date Tal ly, This work is licensed under the Creative Commons Attribution- ! NonCommercial-NoDerivatives 4.0 International License. To view a copy tuples s1 and s2 conflict with each other, hence, it has two of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For repairs J1 : s1,s3 and J2 : s2,s3 . Obviously, aggrega- { } { } any use beyond those covered by this license, obtain permission by emailing tion query Q1 has di↵erent answers, such as 554 in J1 but [email protected]. Copyright is held by the owner/author(s). Publication rights 731 in J2. A smart way to answer such queries, as defined licensed to the VLDB Endowment. in [6, 10], is to return the lub and glb which are 554 and 731. Proceedings of the VLDB Endowment, Vol. 13, No. 11 ISSN 2150-8097. Computing the lub is equivalent to computing optimal DOI: https://doi.org/10.14778/3407790.3407809 subset repairs since the lub is exactly the sum of weights of

2061 GreenVotes Procurement Plan Σ = {Vendor → Subsystem, Subsystem Part Vendor Vendor Rating Aggregation Q1: County Date Tally t1 #1 a C 3.3 Subsystem, Part → Vendor} s SELECT SUM(Tally) t2 #1 a J 5.0 1 A 11/07 453 SUM(Vendor Rating) t3 #1 b J 5.0 s2 A 11/07 630 FROM GreenVotes t4 #1 b M 4.1 Procurement Plan A (optimal) s3 B 11/07 101 CQA of Q : [554, 731] t5 #2 a C 3.3 Cost (rating loss): 13.7 (31.6 -17.9) 1 t 6 #2 a J 5.0 Procurement Plan B t7 #2 c C 3.3 16.6 (31.6 15.0) t8 #2 c E 4.6 Cost (rating loss): - BrownVotes Σ = {County→ Tally} Procurement Plan A (Optimal) Procurement Plan B Subsystem Part Vendor Vendor Rating Subsystem Part Vendor Vendor Rating County Date Tally t1 #1 a C 3.3 t1 #1 a C 3.3 r1 A 11/07 541 t2 #1 a J 5.0 t2 #1 a J 5.0 Aggregation Q2: t t r2 A 11/07 546 3 #1 b J 5.0 3 #1 b J 5.0 r triad SELECT SUM(Tally) t4 #1 b M 2.1 t4 #1 b M 2.1 3 A 11/07 560 t5 #2 a C 3.3 t5 #2 a C 3.3 FROM BrownVotes t t r4 A 11/07 550 6 #2 a J 5.0 6 #2 a J 5.0 r B 11/07 302 [843, 862] t7 #2 c C 3.3 t7 #2 c C 3.3 5 CQA of Q2 : t8 #2 c E 4.6 t8 #2 c E 4.6 (a) Consistent query answering of scalar aggregation (b) Refine the procurement plan when business constraints change Figure 1: Example applications of computing optimal subset repairs tuples in an optimal subset repair, J2, if the vote tallies are tional dependencies is NP-complete [6, 10] even if given two treated as tuple weights. functional dependencies. This result is too rough since it is On the other side, computing the glb can be speed up by just proved by a particular case. The most recent work in an optimal subset repair. Computing glb is equivalent to find [40] improves the complexity result. The between a weighted maximum minimal vertex that is NP-hard polynomial intractable and tractable cases is clearly delin- 1/2 and cannot be approximated within o(n )[15]butcan eated, such that the problem of computing optimal subset be solved in O(2knc) time [49], where k equals to the cost of repairs cannot be approximated within some small constant an optimal subset repair, and c>1. If k is clear in advance, unless P=NP, if the given constraint set can be simplified we will be able to decide to run the exact algorithm and tol- as one of6 four specific cases, otherwise it is polynomial solv- erate exponential running time, or to run an approximation able. Other complexity results on computing optimal repairs algorithm and tolerate the unbounded accuracy loss. exist, such as [22], [37], [12], [41], [28]. They are either re- Moreover, when the two ranges are computed, one can stricted to repairs obtained by value modifications instead safely conclude that Brown won the election, since the max- of tuple deletions or constraints stronger than functional imum vote tally for Green is smaller than the minimum vote dependencies. Complexity results on repair checking exists, tally for Brown. such as [19], [2]. Repair checking is to decide if a given Optimization. Let us consider an example raised in mar- database is a repair of another database. However, it is a keting as shown in Figure.1(b). The procurement plan in problem much easier than computing optimal subset repairs. Figure.1(b) lists all the current vendors for all the parts, Due to their restrictions, these results cannot help deriving and each vendor has a rating representing its production ca- tight complexity results for the problem of computing opti- pacity, after-sales service, supply price and so on [29]. Note mal subset repairs. Therefore, a tight lower bound remains that, all the tuples in the list are correct and consistent. unknown so far. However, the business consideration recently changes. First, Algorithm aspects. When the database schema and the a vendor is no longer allowed to take part in two or more number of functional dependencies are both unbounded, com- subsystems for protection of commercial secrets. Second, puting optimal subset repairs is equivalent to the weighted each part should be supplied from the same vendor for con- minimum vertex cover problem. Algorithms for vertex cover venience of maintenance. For now, we need to refine the can be applied directly to compute optimal subset repairs. plan to satisfy these requirements but preserving high-rating However, even the best algorithm [34] can only give an ap- vendors as possible. proximation with ratio 2 if the data is large enough. For- The problem of refining the procurement plan can be tunately, as shown in the applications above, the transformed into the problem of computing optimal subset schema is fixed in practice, and the number of constraints repairs. First, the business requirements can be written as is bounded by some constant. In such context, the problem the following two functional dependencies is no longer the general vertex cover problem, but it is not Vendor Subsystem, clear whether the problem can be better approximated. Subsystem,Part ! Vendor. Although a large body of research on data repairing con- ! cerned with the computation of an optimal repair, they do Then, an optimal subset repair is the best refined plan if not tell us if we could do better in terms of approximation we treat vendor rating as tuple weight. For example, in ratio for computing optimal subset repairs. Approximation Figure.1(b), plan A is an optimal subset repair and has a algorithms [13, 42, 37] focus on computing repairs obtained total rating 17.9, which is much better than another plan B. by value modification rather than subset repairs. Heuristic In fact, there are many other applications of optimal sub- methods [13, 22, 21] and probabilistic methods [44, 25] are set repairs, but we can not detail any more here due to space unable to provide theoretical guarantee on approximation limitation. As we can see from these applications, comput- ratio. SAT-solver based methods [24] su↵er an exponential ing optimal subset repairs is becoming a standalone prob- time complexity. Other methods [18, 31] try to modify both lem. However, it has not been well-studied yet, especially data and constraints, which are not able to return an opti- on the complexity and algorithms. mal subset repair that we need. Complexity aspects. It is already shown that the problem of computing optimal subset repair with respect to func-

2062 In addition, as we mentioned above, the notion subset re- case of functional dependencies are key constraints [1]. Con- pair was first proposed in the research on consistent query ditional functional dependencies [14] generalize functional answering. However, the of works on consistent query dependencies by conditioning requirements to specific data answering also cannot provide a better approximation to an values using pattern tableaux. In particular, a single tuple optimal subset repair. Their goal is to derive a certain an- may violate a conditional functional dependency. swers for given queries by finding the answers held in all re- In the research on computing optimal subset repairs, func- pairs [4, 48]. There are three approaches to achieve the goal. tional dependency is commonly used integrity constraint [6, The first is the answer set programming based approaches [5, 40]. For convenience of discussion, this paper also takes 16]. The second is the query rewriting based approaches [46, functional dependency as the integrity constraint like [40]. 47]. The third is the SAT/LP-solver based approaches[38, Note that, it is quite straightforward to extend our results 39, 27, 26]. All the approaches try to avoid computing ex- to the other types of integrity constraints mentioned above. ponentially many repairs in the size of database. Let R be a relation schema, and I be an instance of R. In summary, the following problems for computing opti- Functional dependency.LetX and Y be two sets of at- mal subset repairs remain open. tributes of a relation R. A functional dependency ' is an Firstly, a tight lower bound is still unknown so that it is expression of the form X Y, which requires that there ! impossible to decide whether an algorithm for computing is no pair of tuples t1,t2 in any instance I of R such that optimal subset repairs is optimal in terms of approximation t1[X]=t2[X]butt1[Y] = t2[Y]. ratio. Inconsistent database6 .Let⌃ be a set of functional depen- Secondly, it is unknown whether we can approximate op- dencies for relation schema R and I be an instance of R. I timal subset repairs within a constant better than 2, which does not satisfy a given functional dependency ' : X Y, ! has a very important practical significance. denoted by I 2 ', if there are two tuples t1,t2 I such that 2 This paper focuses on the two open problems, and the t1[X]=t2[X]butt1[Y] = t2[Y]. main contributions are as follows. An instance I of R is6 said to be consistent with respect Firstly, a tighter lower bound of the problem of com- to ⌃,denotedbyI = ⌃,ifI satisfies every functional de- puting optimal subset repair is obtained. Concretely, it is pendencies in ⌃. Otherwise,| I is inconsistent,denotedby proved that the problem of approximating optimal subset I 2 ⌃, i.e., ' ⌃ such that I 2 '. repairs within 17/16 is NP-hard for most of the polyno- 9 2 mial intractable cases. For the other cases, it is proved that 2.2 Subset Repairs the problem of approximating optimal subset repairs within Let ⌃ be a set of functional dependencies over R.Acon- 69246103/69246100 is also NP-hard. sistent subset of I with respect to ⌃ is a subset J I such Secondly, an algorithm with an approximation ratio (2 that J = ⌃.Asubset repair (a.k.a, S-repair)isaconsistent✓ 1 | 1/2 ) is developed, where is the number of given func- subset that is not strictly contained in any other consistent tional dependencies. This ratio is a constant better than 2. subset. Formally, a consistent subset J of I is a subset repair It is the current best algorithm in terms of approximation of I with respect to ⌃ if J K = J for any other consistent ratio. subset K of I. \ 6 Thirdly, to further improve the approximation ratio, the Before defining the cost of a subset repair, we need to dis- quasi-Tur´an cluster is proposed for extending the algorithm cuss the weight of each tuple. The weight of a tuple ti is a above. Based on the quasi-Tur´an cluster, an approximation non-negative number, denoted by wi, representing some im- algorithm for computing the subset repairs with ratio (2 portant information concerned by applications. The weight 1 1/2 ✏) is designed, where the factor ✏ 0dependson is defined in di↵erent ways for di↵erent applications. The the characteristic of input database. followings are some examples. In the context of data repair- Fourthly, plenty of experiments are conducted on real data ing, the weight of a tuple usually represents the confidence to examine the performance and the e↵ectiveness of the pro- of its correctness. In the context of answering aggregation posed approximation algorithms in real-world applications. queries, the weight of a tuple usually represents its contri- The following parts of this paper is organized as follows. bution to query results. In the context of optimizations, the Necessary conceptions and definition of the problem of com- weight of a tuple represents the benefit of a tuple. puting subset repairs are formally stated in Section 2. The Based on the tuple weights, the cost of a subset repair complexity analysis of the problem is shown in Section 3. J I is defined as The approximation algorithms are given in Section 4. Ex- ✓ periment results are illustrated in Section 5. At last, Section C(J, I)= wi, t I J 6 concludes the paper. iX2 \ which represents the cost of obtaining a subset repair. Note that the cost of a subset repair is I J ,whichisthedistance 2. PROBLEM STATEMENT | \ | Concepts used in database theory and the formal defini- between J and I if every tuple has weight 1. This kind of tion of the problem are given in this section. cost is commonly used in data quality management [11]. Optimal subset repair. A subset repair J I is an optimal 2.1 Integrity Constraints subset repair of I, optimal S-repair for short,✓ if The consistency of databases is usually guaranteed by in- C(J ⇤,I)=min C(J, I):J I,⌃ , tegrity constraints. The languages for specifying integrity { 2S } constraints can get quite involved. For example, functional where I,⌃ is the set of all subset repairs of instance I. dependency [1] is the most popular type of integrity con- For example,S in Figure.1(b), both plan A and B are S- straint and has a long history. A large body of work on repairs of procurement plan. A is an optimal S-repair since managing data quality is built on it. An important special it has a minimum cost 13.7.

2063 Case 2. There is a fact-wise reduction from ⌃AB C B 2.3 Problem Definition and Remarks ! ! to ⌃, and it is NP-hard to compute a (57/56 ✏)-optimal The problem of computing optimal subset repairs is for- mally defined as follows. S-repair. Case 3. There is a fact-wise reduction from ⌃AB AC BC $ $ 1. (OPTSR) Let R be a fixed relation schema to ⌃, and it is NP-hard to compute a (1+✏)-optimal S-repair and ⌃ be a fixed set of functional dependencies over R.For for a certain ✏>0. any instance I of R,theproblemof OPTSR is to compute Therefore, the lower bound of problem OPTSR can be im- an optimal subset repair J ⇤ of I with respect to ⌃. proved by finding a tighter lower bound for the four special FD sets. Complexity. In practice, compared with the data size, the In this section, we derive a tighter lower bound for case 1 number of attributes in the relation is very small, and so is and 2. Moreover, instead of the lengthy proof, we provide a the number of functional dependencies. This motivates us to much conciser proof where all the reductions are built from adopt data complexity as the to perform the com- the problem MAX-NM-E3SAT [32]. For case 3, we carefully plexity analysis of OPTSR. In this way, the relation schema merge four previous reductions to derive the bound, which R and the functional dependency set ⌃ are fixed in advance, is not shown explicitly in previous research. and thus, the number of attributes in R and the number of functional dependencies can be considered as constants. The Lemma 1. Let FD set be one of ⌃A B C , ⌃A B C and ! ! ! instance I of R is considered as the input of OPTSR. This ⌃AB C B .Givenan✏>0,itisNP-hardtocomputea ! ! paper focus on the data complexity. (17/16 ✏)-optimal S-repair for OPTSR even if every tuple Approximation.OPTSR is NP-hard under most settings of in the instance has weight 1. R and ⌃. This paper will develop approximation algorithms Proof. We here show three similar reductions. All of for solving the problem OPTSR. For later use, the approximation of optimal subset repairs them are built from problem MAX-NM-E3SAT [32] which cannot be approximated better than 7/8. Note that every is explicitly defined here. Let J ⇤ be an optimal subset repair of I.Foraconstantk 1, a subset repair J is said to be a input expression of problem MAX-NM-E3SAT is subject k-optimal S-repair if to the following restrictions, (i) each clause has exactly three variables, (ii)eachclauseismonotone as either “x + y + z” C(J ⇤,I) C(J, I) k C(J ⇤,I) or “¯x +¯y +¯z”, and (iii) any variable x does not occur more   · than once in each clause. In particular, an optimal S-repair of any instance I is a 1- Specifically, each reduction builds an instance I of the optimal S-repair of I. relation schema R(A, B, C), in which every tuple has weight For example, in Figure.1(b), plan B is a 1.5-optimal S- 1, for each expression of MAX-NM-E3SAT. repair of procurement plan since its cost 16.6 1.5 13.7.  ⇥ Reduction for ⌃A B C . First, two tuples (j, xj ,xj )and ! ! 3. COMPLEXITY OF PROBLEM OPTSR (j, xj , x¯j ) are inserted into I for each variable xi.Second, a tuple is inserted into I for each clause ci and each variable In this section, we prove a tighter lower bound of approx- xj in it as follows. PT imation ratio for O SR problem. In the rest of the paper, 1. If ci contains a positive literal of variable xj , insert we refer the lower bound of approximation ratio and the (ci,xj ,xj )intoI, approximation ratio as lower bound and ratio respectively. 2. If ci contains a negative literal of variable xj , insert As shown in [40], it is possible to decide for any input FD (ci,xj , x¯j )intoI. set ⌃ if OPTSR is polynomially solvable using the procedure Suppose there are n variables and m clauses in ,then OSRSucceed.Basically,OSRSucceed iteratively reduces the exactly 2n+3m tuples are created in total. Intuitively, A left hand side of each FD in ⌃ according to three simple B guarantees that exactly one of its three corresponding! rules. It terminates in polynomial time until any FD in tuples survives in any S-repair once a clause is satisfied. ⌃ cannot be further simplified, and returns true if there B C guarantees that the assignment of each variable are only FDs of the form X left. It is proved that ! ;! should be valid, i.e., either true or , but not both. OPTSR is polynomially solvable if OSRSucceed(⌃) returns Reduction for ⌃A B C . First, two tuples (j, xj ,xj )and true. Otherwise, ⌃ makes OPTSR polynomially unsolvable, ! (j, x¯ ,x ) are inserted into I for each variable x .Second, and there is a fact-wise reduction1 from one of the following j j i a tuple is inserted into I for each clause c and each variable four special FD sets to ⌃, i xj in it as follows. ⌃A B C = A B,B C , ! ! { ! ! } 1. If ci contains a positive literal of variable xj , insert ⌃A B C = A B,C B , ! { ! ! } (ci,xj ,xj )intoI, ⌃AB C B = AB C, C B , ! ! { ! ! } 2. If ci contains a negative literal of variable xj , insert ⌃AB AC BC = AB C, AC B,BC A . $ $ { ! ! ! } (c , x¯ ,x )intoI . That is, if OPTSR is NP-hard for a given ⌃, then one of i j j the following cases must happen, Suppose contains n variables and m clauses, then ex- actly 2n +3m tuples are created in total. Similarly, A B Case 1. There is a fact-wise reduction from ⌃A B C or ! ! ! guarantees that exactly one of its three corresponding tuples ⌃A B C to ⌃, and it is NP-hard to compute a (67/66 ✏)- ! optimal S-repair. survives in any S-repair once a clause is satisfied. C B guarantees the consistency of variable assignment. ! 1Fact-wise reduction is a kind of strict reduction that is for- Reduction for ⌃AB C B . First, two tuples (xi, 1,xi)and mally defined in [40]. For any ✏>0, if problem B has ! ! a(1+✏)-approximation, then problem A has a (1 + ✏)- (xi, 0,xi) are inserted into I for each variable xi.Second,a approximation whenever there is a fact-wise reduction from tuple is inserted into I for each clause ci and each variable AtoB. xj in it as follows,

2064 1. If ci contains a positive literal of variable xj , insert Apply this fact in the right hand of (1), the fol- (ci, 1,xj )intoI, lowing inequality holds 2. If ci contains a negative literal of variable xj , insert I J ⇤ 2 I n 3 (ci, 0,xj )intoI. | \ | | | =2+ I . Suppose there are n variables and m clauses in ,then I I J ⇤ n I 2n | | 2 | || \ | | | n exactly 2n +3m tuples are created in total. AB C guar- ! Since I =2n +3m>2n,weget antees that exactly one of the three tuples survives in any | | S-repair once the corresponding clause is satisfied. C B I J ⇤ guarantees the consistency of variable assignment. ! | \ | > 2. (2) I I J ⇤ n Deriving lower bound. We here only show the process of | || \ | Apply inequality (2) into inequality (1), then deriving the lower bound for ⌃AB C B . The processes of ! ! deriving other two are similar. Let be an input expression () of problem MAX-NM-E3SAT, and I be the instance data N > 3 2k max() built by our reduction. N The functional dependency AB C guarantees that any This inequality implies the problem MAX-NM-E3SAT can S-repair J of I contains at most one! of the three tuples with be approximated with ratio 3 2k if a k-optimal S-repair the same value ci on attribute A, where 1 i m.The can be computed in polynomial time. functional dependency C B guarantees that any S-repair Suppose k<17/16, then 3 2k>7/8, which implies the ! J of I contains either (ci, 1,xj )or(ci, 0,xj ) for 1 i m MAX-NM-E3SAT problem admits an approximation with and 1 j n.   ratio better than 7/8. This contraries to the hardness result Therefore,  we can derive a variable assignment ⌧ from obtained in [32]. Therefore, k 17/16. every S-repair J, Finally, the same bound can be derived in the same way for ⌃A B C and ⌃A B C . Then the lemma follows im- ! ! ! 1, if (xi, 1,xi) J, mediately. s.t. ⌧ (x )= i 0, otherwise. 2 ⇢ The following lemma derives the explicit lower bound for ⌧( )isvalidsuchthat⌧(xi) either 0 or 1, but not both, for ⌃AB AC BC. each· 1 i n. $ $   Let J ⇤ be an optimal S-repair. It can be proved that Lemma 2. For ⌃AB AC BC,itisNP-hardtocompute $ $ I J ⇤ does not contain both (xi, 1,xi)and(xi, 0,xi)for a (69246103/69246100 ✏)-optimal S-repair for any ✏>0, \ each variable xi. Since if it is not true, there must be some i even if every tuple in the instance has weight 1. such that both (xi, 1,xi)and(xi, 0,xi) are in I J ⇤.Wecan \ Proof. By merging the following ↵, -reductions given pick one of them and insert it into J ⇤ without resulting in in previous literatures, L inconsistency. Then, an S-repair J larger than J ⇤ is found, i.e., C(J, I) >C(J ⇤,I). This is a . MAX B29-3SAT 529,1 3DM [33] 3DM 1,1 MAX 3SC [33] Let ⌧ be an arbitrary valid variable assignment, and () N MAX 3SC 55,1 MECT-B [3] be the number of clauses in satisfied by ⌧.Let⌧max be an MECT-B 7 ,1 OPTSR(R, ⌃AB AC BC)[40] optimal variable assignment, and max()bethenumberof 6 $ $ N clauses in satisfied by ⌧max.Thus,wehave we can conclude that MAX B29-3SAT can be approxi- mated within 680/679 if a (69246103/69246100 ✏)-optimal max()= I I J ⇤ n N | || \ | S-repair can be computed in polynomial time for any ✏>0 when ⌃ is ⌃AB AC BC, which is contrary to the hardness and for any solution J of I , $ $ result shown in [23].

() I I J n. N | || \ | Based on LEMMA 1 and 2, the following gives a

Additionally, we have I =2n +3m. new complexity result. Let k>1andJ be| a k|-optimal S-repair of I such that Theorem 1. Given a fixed schema R and a fixed FD set ⌃,forany✏>0, I J ⇤ I J k I J ⇤ , | \ || \ | ·| \ | then we have 1. If OSRSucceed(⌃) returns true,thenanoptimalS- repair can be computed in polynomial time. () I I J n N | || \ | max() I I J ⇤ n 2. If OSRSucceed(⌃) returns false,andthereisafact- N | || \ | wise reduction from ⌃AB AC BC to ⌃,thenitisNP- I k I J ⇤ n $ $ | | ·| \ | hard to compute a (69246103/69246100 ✏)-optimal I I J n | || \ ⇤| S-repair. I J ⇤ =1+(1 k) | \ | . (1) · I I J ⇤ n 3. Otherwise, it is NP-hard to compute a (17/16 ✏)- | || \ | optimal S-repair. Since each clause has exactly three literals, we have

I 2n Proof. Case 1 has been proved by [40]. Case 2 is proved I J ⇤ n +2 | | . | \ | · 3 by LEMMA 2. Case 3 is proved by LEMMA 1.

2065 In Figure.1(b), the two functional dependencies form the Let x⇤ be the optimal solution of ILP,andx0 be the M case ⌃AB C B . Our result implies that no polynomial algo- optimal solution of HIP,then ! ! rithm could compute a solution with a constant ratio better M wix0 wix⇤ (3) than 17/16 for the case in Figure.1(b). i  i ti I ti I Relation with Vertex Cover. In general, OPTSR is equiva- X2 X2 lent to the weighted minimum vertex cover problem (WMVC) 4.1.2 ⌃-Partition if the schema is not fixed and the number of functional To obtain a good approximation from the HIP,weneed dependencies is unbounded. Therefore, approximation al- M gorithms for WMVC could be directly applied to comput- first to partition the input I with respect to ⌃,andthen ing approximations of optimal S-repairs. On the other side, do the rounding based on it. Intuitively, all the tuples of the current best known approximation ratio for WMVC is I are grouped into non-empty consistent in such a (2 ⇥(1/plog n)) [34]. However, this is a ratio depending way that every tuple is in exactly one consistent subset. on the size of input. In fact, there is no polynomial time A ⌃-partition of instance I,say (I,⌃), is a collection of P (2 ✏)-approximation algorithm for WMVC [36] and ✏>0 disjoint consistent subsets of I such that if the unique game conjecture [35] is true. This rules out - K (I,⌃),K = ⌃, 8 2P | the possibility to compute a (2 ✏)-optimal S-repair for any - K (I,⌃) K = I, constant ✏>0 unless unique game conjecture is false. - K 2,KP (I,⌃),K K = . S i j i j The8 size of2 aP⌃-partition\ is the number; of sets in it. The 4. APPROXIMATION ALGORITHMS smallest ⌃-partition is preferred to do the rounding. Un- fortunately, the size of a ⌃-partition may be as large as n. Three approximation algorithms are developed for OPTSR When all the tuples in I are pairwise conflicting with each in this section. other, the size of any ⌃-partition is n. Therefore, the size of The first algorithm, named BL-LP, is a baseline algorithm. a ⌃-partition cannot be bounded by a constant. It returns a (2 1/n)-optimal S-repair. This approximation ratio dependents on the size of input data. We use greedy search to find a ⌃-partition. For each tuple t I, the greedy search works as follows. The second algorithm, named TE-LP, is an improved al- 2 1 1. Find an existing consistent set K such that K gorithm. It returns a (2 1/2 )-optimal S-repair, where 2P [ is the number of functional dependencies in ⌃ and it is a t = ⌃ and put t into K. { }|2. If no such K can be found, create a new set K = t constant. { } The third algorithm, named et-TE-LP,isanextensionof and (I,⌃)= (I,⌃) K. 1 P P [ algorithm TE-LP. It returns (2 1/2 ✏)-optimal S-repair 3. Until all the tuples are picked into a consistent set, a ⌃-partition (I,⌃)isobtained. where ✏ 0 is still a factor independent of the data size. It P outperforms TE-LP when there are a large amount of quasi- ⌃ Tur´an clusters in the input database. 4.1.3 Approximate via -Partition-Based Rounding Asolutionof HIP can be obtained by a ⌃-partition- 4.1 Algorithm BL-LP based rounding.M When a ⌃-partition (I,⌃) is found, the P rounding of solution x0 of HIP is carried out as follows. We start from a half-integral linear programming formu- M lated for OPTSR, then show a basic rounding strategy to 1. Let xi = x0 if x0 =1/2. i i 6 obtain an approximation to optimal subset repairs, and fi- 2. Let xi =0ifxi0 =1/2andti K⇤ such that nally analyze the performance of the algorithm. 2 K⇤ arg max xi0 ti Kj xi0 =1/2 , Kj (I,⌃) { | 2 ^ } 4.1.1 A Half-Integral Linear Programming 2P For an input I of OPTSR, we define a 0-1 variable xi for otherwise, let xi =1. each tuple ti I as follows, 3. Output the S-repair J = ti : xi =0 . 2 { } 0, if t is in a feasible S-repair J, The baseline algorithm BL-LP based on the rounding method x = i i 1, otherwise, above is presented as follows. ⇢ OPTSR can be formulated as the following 0-1 Algorithm 1 BL-LP linear programming ILP, M Input: Instance I of R,FDset⌃,and minimize wixi ⌃-partition (I,⌃) ti I P 2 Output: Approximate optimal S-repair s. t. xi + xj 1, ti,tj I, ti,tj 2 ⌃, P 8 2 { } 1: Formulate the HIP with respect to I and ⌃; xi 0, 1 , ti I. M 2{ } 8 2 2: Solve HIP by LP-solver to obtain a solution x0; M We could routinely relax ILP to a linear program where 3: Let J ; M n ; any solution x takes value from [0, 1] . Furthermore, every 4: Find K⇤ from (I,⌃)suchthat extreme point of the linear program relaxation takes value P from 0, 1/2, 1 n as shown in [43]. This gives the following K⇤ arg max xi0 ti Kj xi0 =1/2 ; Kj (I,⌃) { | 2 ^ } { } 2P half-integer programming HIP, M 5: for each ti I do minimize wixi 2 ti I 6: if x0 =0or(x0 =1/2andti K⇤) then 2 i i 2 s. t. xPi + xj 1, ti,tj I, ti,tj 2 ⌃, 7: J J ti ; 8 2 { } 8: return J [{ } xi 0, 1/2, 1 , ti I. 2{ } 8 2

2066 4.1.4 Performance Analysis of BL-LP 4.2.1 Overview of algorithm TE-LP The algorithm TE-LP consists of three main steps as shown Theorem 2. Let p 2 be the size of the ⌃-partition of in algorithm 2. inconsistent instance I.Algorithm BL-LP returns a (2 2/p)- 2 optimal S-repair in O(n ) time. Algorithm 2 TE-LP

Proof. We first prove J is an S-repair. Let x0 be the Input: Instance I of R,FDset⌃ Output: Approximate optimal S-repair J solution of HIP.Letti and tj be any two tuples such that M ti,tj 2 ⌃. Without loss of generality, we assume xi0 xj0 . 1: J ; { } ; Clearly, either x0 =1orx0 = x0 =1/2. 2: I0 TrimInstance(I,⌃); i i j When x0 =1,ti / J due to the rounding. 3: (I0, ⌃) FindPartition(I0, ⌃); i 2 P When x0 = x0 =1/2, the following four statements hold 4: J BL-LP (I0, ⌃, (I0, ⌃)); i j P due to the rounding. First, if ti,tj / K⇤, ti,tj / J.Second, 5: return J 2 2 if ti K⇤,tj / K⇤, tj / J. Third, if ti / K⇤,tj K⇤, 2 2 2 2 2 ti / J.Finally,ifti,tj K⇤, ti,tj / J. 2 2 2 Step 1 Trim instance. It first trims instance I by elim- Therefore, J is a consistent subset. Since x0 is the optimal inating all the triads, where a triad is a set of three tuples solution of HIP, J is maximal. Hence, J is an S-repair. M such that they are pairwise conflicting with each other. Now, we analyze the performance of BL-LP.Let Step 2 Find partition. After trimming o↵all the triads in I, I0 I is obtained whose ⌃-partition has a constant S1 = x0 : ti I,x0 =1 , ✓ { i 2 i } size. Algorithm TE-LP turns to compute such a small ⌃- S = x0 : ti I,x0 =1/2 and 1/2 { i 2 i } partition of I0. SK = x0 : ti K⇤,x0 =1/2 . ⇤ { i 2 i } Step 3 Approximate. Algorithm TE-LP runs BL-LP on the trimmed instance I0, and computes an S-repair with a Thus, J satisfies the following inequality, better ratio based on the small ⌃-partition. We next detail each step and prove the related theoretical

C(J, I)= wix0 +2 wix0 wix0 results. i 0 i i1 x S x S x S Xi0 2 1 02X1/2 02XK⇤ 4.2.2 Trim Instance by Triad Elimination @ A Triad.AnysubsetT = ti,tj ,tk I is a triad if wixi0 +2 wixi0 1/p wixi0 { }✓  0 · 1 - wi =0,wj =0,wk =0, x S x S x S 6 6 6 i0 2 1 i0 2 1/2 i0 2 1/2 X X X - ' ⌃, ti,tj 2 ', ti,tk 2 ', tj ,tk 2 '. @ A 9 2 { } { } { } wix0 +(2 2/p) wix0  i · i For example, in Figure.1(a), r1,r2,r3 is a triad, and x S x S { } Xi0 2 1 i0 X2 1/2 there is no triad in Figure.1(b). Actually, any three of r1,r2,r3,r4 is a triad. (2 2/p) wix0 { }  i Due to the definition of S-repair, we have the following x S S i0 2 X1[ 1/2 observation.

Let x⇤ be the optimal solution of ILP, J ⇤ be the optimal Observation 1. For any triad T I,anyS-repairJ of M S-repair. Due to inequality (3), we have I contains at most one tuple of T . ✓ By the observation, we can trim o↵all the triads of input wixi0 = wixi0 wixi⇤ = C(J ⇤,I)  instance I without a significant loss of approximation accu- x S1 S ti I ti I i0 2 X[ 1/2 X2 X2 racy. Then the resultant instance has a small ⌃-partition. Therefore, C(J, I) (2 2/p) C(J ⇤,I).  · 2 Algorithm 3 T rimInstance Algorithm BL-LP takes O(n ) time to formulate the HIP and find a ⌃-partition. Then, by calling the existingM fast Input: Instance I of R and FD set ⌃ LP-solver, it takes O(n)timetosolvethe HIP and do Output: Instance I0 with no triad in it the rounding. Therefore, the time complexityM of algorithm 1: H ; BL-LP is O(n2)since is a constant. 2: while ;there exists a triad in I do 3: find a triad T = ti,tj ,tk I; { }✓ Remarks.Thevalueofp in Theorem 2 depends on the 4: w(T ) min wi,wj ,wk ; value distribution over input instance I.Asmentionedabove, 5: for each z { i, j, k do} 2{ } p may be as large as n. Hence, algorithm BL-LP may returns 6: reweight tz such that wz wz w(T ); a(2 o(1))-optimal S-repair. We next show algorithm TE- 7: if wz =0then H H tz ; LP with a much better approximation ratio. [{ } 8: I0 I H; \ 4.2 Algorithm TE-LP 9: return I0 In this subsection, we develop an improved algorithm TE- 1 Trim Instance. As shown in algorithm 3, TrimInstance LP which returns a (2 1/2 )-optimal S-repair, where is the constant number of functional dependencies in ⌃. obtains a new instance I0 in the following steps. Step 1. Find a triad T I. Formally, let J ⇤ be an optimal S-repair, algorithm TE-LP ✓ 2 Step 2.SupposeT = t1,t2,t3 and w1 w2 w3. requires O(n ) time and returns a solution J such that { }   Reweight each tuple by reducing w1, i.e., w1 0, w2 1 C(J ⇤,I) C(J, I) (2 1/2 ) C(J ⇤,I)(4) w2 w1, w3 w3 w1.   ·

2067 Step 3. Remove tuples with weight 0 from instance I. Let J be an arbitrary consistent subset of I, J I(1) is (1) \ Step 4. Repeat steps 1-3 until no more triad. Instance still a consistent subset of I .Now, T1

2068 (m) Given any instance I with no triad, FindPartition itera- Note that, ti (I I0) J ⇤, wi = 0. Therefore, the left tively partitions I as follows. part can be8 written2 \ as \ Step 1.Initially, (I,⌃)= I . Step 2.Inthei-thP iteration,{ } each K (I,⌃) is par- 2P titioned into at most two disjoint sets which are consistent 0 (m) 1 C(J, I I0) 3/2 wi + wi wi (8) with respect to the i-th functional dependency 'i ⌃. \  2 Bti (I I0) J⇤ ti I0 J⇤, C Step 3. FindPartitionterminates until all the functional B 2 X\ \ (2mX)\ ⇣ ⌘C B w

1 Theorem 3. AlgorithmTE-LP returns a (2 1/2 )-opti- 0 1 C(J, I) r wi + wi + wi mal S-repair J in O(n2) time, and J satisfies  · Bti (I I0) J⇤ ti I0 J⇤, ti I0 J⇤, C B 2 X\ \ (2mX)\ (2mX)\ C 1 B wi

Proof. After trimming o↵all the triads from I,anin- = r wi + wi · 0 1 stance I0 is obtained. The cost of solution J is t (I I ) J t I J i2 X\ 0 \ ⇤ i2X0\ ⇤ = r C@(J ⇤,I) A C(J, I)=C(J, I I0)+C(J, I 0)(7) · 1 \ =(2 1/2 ) C(J ⇤,I) · Let = T1,...,Tm be the sequence of triads processed 1 T { } (m) Therefore, J is a (2 1/2 )-optimal S-repair. by TE-LP.Foreachtuplet I0,letw be the weight of i i Algorithm TE-LP takes O(n2) time to formulate t after trimming all the m triads.2 HIP i and find a ⌃-partition. Then, by calling the existingM fast First, let us consider C(J, I I0)inequation(7).Allthe \ LP-solver, it takes O(n)timetosolvethe HIP and do tuples in I I0 have weight 0. The weight of any tuple t is i the rounding. Therefore, the time complexityM of algorithm decreased only\ when trimming o↵some triad T ,containing h TE-LP is O(n2)since is a constant. ti,in .Hence, T Remarks. The approximation ratio of TE-LP only depends C(J, I I0)= wi 3w(Th) 3/2 2w(Th) on the number of functional dependencies other than the \   t I I T T data size. For instance, it will return a 1.5-optimal S-repair iX2 \ 0 Xh2T Xh2T for the cases ⌃A B C , ⌃A B C and ⌃AB C B .Andit ! ! ! ! ! Let J ⇤ be an optimal S-repair of I.Bylemma3, will return a 1.75-optimal S-repair for the case ⌃AB AC BC, $ $ no matter how large the data is. (m) 2w(Th) wi w  i 4.3 Algorithm et-TE-LP Th (m) 2T ti I J⇤,w

(m) Given an integer h>1, a subset K I is an h-quasi- C(J, I I0) 3/2 wi w ✓ \  i Tur´an cluster if there exists a functional dependency of ⌃, (m) ti I J⇤,w

2069 1. K consists of all tuples of I sharing the same X-value, If J ⇤ is the optimal S-repair of I,then and (h) (h) (h) (h) C(J ⇤ ,I )+C(J ⇤ ,I ) C(J ⇤,I). 2. K satisfies 1 1 2 2  Now, the bound of approximation ratio is

wi (h +1) max wi h+1 (h) (h) 1 (h) (h) y (h) ⇤ ⇤ · 8 9 C(J ,I) h C(J1 ,I1 )+(2 2 1 ) C(J2 ,I2 ) ti K ti K,ti[Y]=y · · X2 < 2 X = (h) (h) (h) C(J ⇤,I)  C(J ⇤ ,I )+C(J ⇤ ,I ) Obviously, the elmination of an h-quasi-Tur´an cluster may 1 1 2 2 : ; h+1 1 (h) lead to a loss of the approximation ratio, but the loss can 1 (2 1 ) C(J, I1 ) 2 + h 2 · be bounded by a factor depending on h.  2 1 (h) (h) C(J ,I1 )+C(J, I2) Example.ConsiderBrownVotes shown in Figure.1(a), 1 h +1 1 K = r1,r2,r3,r4 is a 2-quasi-Tur´an cluster, becuase it 2 1/2 + (2 ) ⇢h { }  h 2 1 · consists of all tuples sharing the same value of County,and ✓ ◆ 1 1 satisfies the constraint: w1 + w2 + w3 + w4 (2 + 1) w3 = ⇥ =2 1/2 (1 1/h 1/2 ) ⇢h. 3 560, since w3 = 560 is larger than the other three · weights.⇥ Any optimal S-repair should drop the weight of Since algorithm 5 finally selects the best solution, 2197 560, while any S-repair disjoint with K drops the (h) C(J ,I) 1 1 weight 2197, therefore, the ratio loss is 560/2197 which is 2 1/2 max (1 1/h 1/2 ) ⇢h . C(J ,I)  h { · } less than 1/3. ⇤ Algorithm et-TE-LP iteratively calls TE-LP to compute In each iteration, algorithm et-TE-LP takes O(n log n)time the approximate S-repair. In each iteration, it first elimi- to eliminate all the h-quasi-Tur´an clusters by sorting tuples nate all the disjoint h-quasi-Tur´anclusters from the input in I for each functional dependency. And et-TE-LP takes instance, and then calls TE-LP to compute an approximate O(n2) time to run TE-LP in each iteration. Since there are S-repair. At last, the best approximate S-repair among those n iterations in total, the time complexity of algorithm et- already computed is returned. The procedure of et-TE-LP TE-LP is O(n3). is shown in algorithm 5. Remarks. As we can see, the performance of et-TE-LP Algorithm 5 et-TE-LP depends on the characteristics of input rather than the data size. et-TE-LP has an approximation ratio much better than Input: Instance I of R and FD set ⌃ TE-LP for a non-zero ⇢(h) where 1

4: J 0 TE-LP(I0, ⌃); thetic data to examine BL-LP, TE-LP and et-TE-LP.Given

5: if J 0 is better than J then J J 0; the schema and constraint set, both the eciency and the 6: return J accuracy of the three algorithms were evaluated by varying data. In addition, taking consistent query answering as an 4.3.1 Performance Analysis of et-TE-LP example, we discovered the e↵ectiveness of our algorithms (h) in the real-world applications. Let I1 be the set of all tuples eliminated by Eliminate- QuasiTur´anClusters in the h-th iteration and I(h) = I I(h) 5.1 Experimental Settings 2 1 for any integer h>1. Let J (h) be the approximate S-repair Experiments are conducted on a CentOS 7 machine with returned by calling TE-LP in the h-th iteration of et-TE-LP. eight 16-core Intel Xeon processors and 3072GB of memory. (h) (h) (h) Datasets. All the datasets used in the experiments were Obviously, I1 I J .Wedefine⇢ as ✓ \ collected from real-world applications. (h) w (h) t I i C(J, I ) (h) i2 1 1 PROCP was an instance of the schema Procurement Plan ⇢ = = (h) . (h) wi (h) Pti I J C(J ,I1 )+C(J, I2) shown in Figure.1(b). The data in PROCP are the online 2 \ product reviews collected from amazon2 and some other on- The followingP theorem shows the performance of et-TE-LP. line sales3. The number of tuples was varied from 10K to 1 Theorem 4. Algorithm et-TE-LP returns a (2 1/2 40K. The FD set ⌃1 consists of the two functional depen- 3 ✏)-optimal S-repair in O(n ) time, where dencies shown in Figure.1(b). 1 was synthesized by propagating PROCP with zip, ✏ =max (1 1/h 1/2 ) ⇢h 0 h { · } area code4 and street information5 for US states. The num- for 1

2070 DBLP was extracted from dblp6 of schema (title,authors, to compute the exact lub and glb for the query result. Let year,publisher,pages,ee,url), which has 40K tuples. The FD Yi be a 0-1 variable such that set ⌃3 includes url title , title author , ee { }!{ } { }!{ } { }! 0, if lub (Qi) glb (Q0 )andlub (Qi) glb (Q0 ) , title , year, publisher, pages title, ee, url .  i  i { } { }!{ } Yi = 0, if glb (Qi) lub (Q0 )andglb (Qi) lub (Q0 ) , One can easily verify that ⌃1, ⌃2 and ⌃3 cannot be sim- 8 i  i plified further by procedure OSRSucceed [40]. Hence, the < 1, otherwise. f problem is NP-hard in each of these cases. 100 f Let Y:= i=1 Yi be the number of correct decisions made Injecting Inconsistency. PROCP was used to simulate by our algorithms. Due to the time and space limitation the procurement plan refinement in marketing, which is al- of the twoP FPT algorithms, W (Q)wasboundedby120by ready an inconsistent database with respect to ⌃1.Incon- restricting the range of each aggregation in our experiments. sistent versions of ORDER and DBLP were generated by The gap between two query results is defined as adding noises only to the attributes that are related to some lub(Q) glb(Q0) functional dependencies. We refer to the initial consistent gap(Q, Q0)=max , glb(Q ) lub(Q) data of as I0 and refer to the inconsistent version as I1.The ⇢ 0 noise is controlled by the noise rate ⇢. We introduced two to help quantify the diculty of query answering. types of noises: typos and values from active domain. Weights. In accordance to the cost model defined in Section 5.2 Experimental Results 2.1, we assign weights to the tuples in the following way. A We report our findings concerning about the performance tuple is called a dirty tuple that there is some attribute and e↵ectiveness of our algorithms. value of t in I1 is di↵erent from its counterpart in I0.For dirty tuple t of ORDER and DBLP, a random weight w(t) in [0,a] was assigned to t.Fortheothertuplest of ORDER and DBLP, a weight w(t)in[b, 1] was selected randomly to t. This follows the assumption commonly used in data repairing that the weight of a dirty tuple is usually slightly lower than the weight of other tuples. In the experiments, we set a =0.3andb =0.7. For PROCP, in the eciency and accuracy testing, customer rating was treated as the tuple weight, which is a number in [0, 5]. In the e↵ectiveness (a) procp (b) order and ⇢ =10% test, the price is also treated as the tuple weight which may contribute to the results of aggregation queries. Methods. We have implemented algorithms BL-LP, TE-LP and et-TE-LP.Asabasictool,GLPK-LP/MIP-Solver7 was used to solve large-scale linear programming. We also have implemented L2VC as a tool to obtain an upper bound line of approximation, which is the 2-approximation algorithm for weighted minimum vertex cover given in [8]. All the algorithms were implemented in C++. (c) dblp and ⇢ =10% (d) dblp and ⇢ =20% Figure 2: The accuracy of approximation algorithms Metrics. To test the accuracy, L2VC was used as an upper bound of optimal S-repairs. To examine the eciency, we Accuracy. We first show the accuracy of BL-LP, TE-LP ran each algorithms 10 times at each dataset size and plotted and et-TE-LP by varying the number of tuples and noise the average running time. To evaluate the e↵ectiveness of rate. The x-axis is the number of tuples. The left y-axis our algorithms, we issued 100 pairs of aggregation queries is the cost of approximate S-repairs, and the right y-axis similar to the examples shown in Figure.1(a). Each query is the percent of triads or quasi-Tur´an clusters processed is either SUM on attribute price of PROCP and ORDER, during the procedure. In Figure 2, we ran them together or SUM on attribute pages of DBLP. For each query pair with the upper bound L2VC on datasets consisting of 10K (Q, Q0), either lub (Q) glb (Q0)orlub (Q) >glb(Q0)holds. to 40K tuples by a step of 2K with noise rate ⇢ ranging from  We used our algorithms to help deciding such relationship. 5% to 20%. TE-LP and et-TE-LP outperformed BL-LP and Specifically, let J˜ be the approximate optimal S-repair, r be L2VC on PROCP, i.e., the gap in between is larger than the approximation ratio of the algorithm used to compute the other cases. This is also consistent with the theoretical J˜,andW (Q) is the maximum value of the query result. We result, that is, their ratio bounds are less than 3/2 which are compute the estimation of the aggregation result of Q by much better than 2, recall that ⌃1 contains two functional dependencies. In theory, TE-LP and et-TE-LP should not C(J˜) lub (Q)=W (Q) . perform as well on PROCP with ⌃1,since⌃2 and ⌃3 contain r more constraints. However, the theoretical results we proved Note that dividingfr is to get rid of the false positive. To are worst-case approximation ratios. In our experiments, obtain the -ground, we employed the FPT algorithm the accuracy was not necessarily better when the FD set [17] for weighted minimum vertex cover and another FPT ⌃ is smaller, because the accuracy is also a↵ected by other algorithm [49] for weighted maximum minimal vertex cover factors which cannot be controlled by us, such as the ratio of the number of 1-variables to the number of 1/2-variables in 6https://dblp.org/xml/ the solution returned by LP solver and so on. Nevertheless, 7https://www.gnu.org/software/glpk/ on average, TE-LP and et-TE-LP outperformed BL-LP and

2071 L2VC in our experiments, which is evident from their average distances to BL-LP and L2VC. TE-LP performed much better than L2VC and BL-LP on ORDER and DBLP, as shown in Figure.2(b), 2(c) and 2(d), since a large percentage of noise was trimmed o↵as triads. et-TE-LP was behaving better than L2VC and BL-LP when- ever the percentage of disjoint quasi-Tur´an-clusters is large. TE-LP and et-TE-LP were not doing very well on PROCP, because the size of greedy ⌃-partition was always small, so that the results were not improved largely by trimming o↵ triads or quasi-Tur´an-clusters. Moreover, et-TE-LP is even better than TE-LP when the percentage of disjoint quasi- (a) 1.2

(b) Gap(Q,Q0)<1.12 (c) Gap(Q,Q0)>1.5 Figure 4: Number of correct answers with ⇢ =10%

E↵ectiveness. We evaluated the precision of our algo- rithms to help to answer aggregation queries posed on in- consistent databases. The results for dataset PROCP are given in Figure.4. In the three figures, the x-axis is for the (a) procp (b) order and ⇢ =10% of dataset and algorithms, and the y-axis is for precision. We set the noise rate at 10%. The three figures show that algorithm TE-LP and et-TE-LP are more e↵ective in average. Specifically, Figure.4(c) and 4(a) imply that our algorithms with better ratios returned more dependable answers (i.e., high precision) for those aggregation queries. When the gap is large (larger than 1.5 in Figure.4(c)), all the algorithms performed well since the it is easy to distin- guish lub(Q)andglb(Q0). When the gap is not very small (c) dblp and ⇢ =5% (d) dblp and ⇢ =15% (around 1.2 1.3 in Figure.4(a)), both TE-LP and et-TE-LP outperformed⇠ BL-LP and L2VC much better. The reason Figure 3: The running time of algorithm is that the gap is nearly the approximation ratio of TE-LP Eciency. We evaluated the eciency of our algorithms and et-TE-LP but much lower than that of the other two. by varying the number of tuples and noise rate. The results Indeed, however, all the algorithms proposed in this paper for the three datasets are shown in Figure.3(a)-(d). The left have a low precision for those query pairs with a small gap x-axis is the number of tuples, the left y-axis is the run- (less than 1.12 in Figure.4(b)). In such cases, the gap is ning time in seconds, and the right y-axis is the percent nearly 0, it is hard to distinguish the values of lub(Q)and of triads or quasi-Tur´an clusters processed during the pro- glb(Q0) by using an approximation algorithm unless an exact cedure. Since L2VC does not call the LP solver, it took a algorithm. Nevertheless, when the gap is not very small, the running time lower than our algorithms by loosing the ac- of TE-LP and et-TE-LP are always the current best curacy. In theory, et-TE-LP took the largest time cost to choice for answering aggregation queries. compute a good approximation. This is because it takes a lot of time to find h-quasi-Tur´an clusters for every possible h. In practice, after several iterations, it terminated once h 6. CONCLUSIONS is large enough, and there is almost no cluster with a large Optimal subset repairs are NP-hard to be approximated h value, therefore, et-TE-LP was almost linear with TE-LP. within 17/16 for most of the polynomial intractable cases. Observe that et-TE-LP sometimes performed very eciently For the other cases, the problem approximating optimal like TE-LP. This is because the disjoint quasi-Tur´an clusters subset repairs within 69246103/69246100 is also NP-hard. covers most of the inconsistencies. TE-LP took nearly the Optimal subset repairs can be approximated within a ra- 1 tio (2 1/2 )ifgiven functional dependencies. It is same time cost with BL-LP to compute a good approximate S-repair since it is very fast to do the trimming and those a constant better than 2. Our results give a way to com- triads covers most of the inconsistencies. In addition, as pute optimal S-repairs eciently with both theoretical and shown in Figure.3(b), TE-LP and et-TE-LP saved more time empirical guarantee. However, there is still a large gap be- when ⇢ = 10% since the percentages of triads and quasi- tween the upper bound and the lower bound derived in this Tur´an clusters could be processed are larger than the cases paper. Tighter bounds on computing optimal S-repairs are with smaller ⇢. suggested in the future work.

2072 7. REFERENCES [20] X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. Data [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of cleaning: overview and emerging challenges. In Databases: The Logical Level.Addison-Wesley,1995. SIGMOD,pages2201–2206,2016. [2] F. N. Afrati and P. G. Kolaitis. Repair checking in [21] X. Chu, I. F. Ilyas, and P. Papotti. Holistic data inconsistent databases: algorithms and complexity. In cleaning: putting violations into context. In ICDE, ICDT,pages31–41,2009. pages 458–469, 2013. [3] O. Amini, S. P´erennes, and I. Sau. Hardness and [22] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. approximation of trac grooming. Theor. Comput. Improving data quality: consistency and accuracy. Sci.,410(38-40):3751–3760,2009. PVLDB,7(6):315–325,2007. [4] M. Arenas, L. Bertossi, and J. Chomicki. Consistent [23] P. Crescenzi. A short guide to approximation query answers in inconsistent databases. In PODS, preserving reductions. In CCC,pages262–273,1997. pages 68–79, 1999. [24] M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, [5] M. Arenas, L. Bertossi, and J. Chomicki. Answer sets I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: a for consistent query answering in inconsistent commodity data cleaning system. In SIGMOD,pages databases. Theor. Pract. Log. Prog.,3(4):393–424, 541–552, 2013. 2003. [25] C. De Sa, I. F. Ilyas, B. Kimelfeld, C. R´e, and [6] M. Arenas, L. Bertossi, J. Chomicki, X. He, T. Rekatsinas. A formal framework for probabilistic V. Raghavan, and J. Spinrad. Scalar aggregation in unclean databases. In ICDT,pages26–28,2019. inconsistent databases. Theor. Comput. Sci., [26] A. A. Dixit. CAvSAT: a system for query answering 296(3):405–434, 2003. over inconsistent databases. In SIGMOD,pages [7] A. Assadi, T. Milo, and S. Novgorodov. DANCE: data 1823–1825, 2019. cleaning with constraints and experts. In ICDE,pages [27] A. A. Dixit and P. G. Kolaitis. A SAT-based system 1409–1410, 2017. for consistent query answering. In SAT,pages [8] R. Bar-Yehuda and S. Even. A linear-time 117–135, 2019. approximation algorithm for the weighted vertex cover [28] S. Flesca, F. Furfaro, and F. Parisi. Consistent query problem. J. Algorithms,2(2):198–203,1981. answers on numerical databases under aggregate [9] M. Bergman, T. Milo, S. Novgorodov, and W.-C. Tan. constraints. In DBPL,pages279–294,2005. QOCO: a query oriented data cleaning system with [29] Gartner. Vendor Rating Service,accessedMay15, oracles. PVLDB,8(12):1900–1903,2015. 2020. [10] L. Bertossi. Database repairs and consistent query [30] F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The answering: origins and further developments. In llunatic data-cleaning framework. PVLDB, PODS,pages48–58,2019. 6(9):625–636, 2013. [11] L. Bertossi. Repair-based degrees of database [31] L. Golab, I. F. Ilyas, G. Beskales, and A. Galiullin. On inconsistency. In LPNMR,pages195–209,2019. the relative trust between inconsistent data and [12] L. Bertossi, L. Bravo, E. Franconi, and A. Lopatenko. inaccurate constraints. In ICDE,pages541–552,2013. The complexity and approximation of fixing numerical [32] V. Guruswami and S. Khot. Hardness of Max 3SAT attributes in databases under integrity constraints. with no mixed clauses. In CCC,pages154–162,2005. Inform. Syst.,33(4):407–434,2008. [33] V. Kann. Maximum bounded 3-dimensional matching [13] P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A is MAX SNP-complete. Inform. Process. Lett., cost-based model and e↵ective heuristic for repairing 37(1):27–35, 1991. constraints by value modification. In SIGMOD,pages [34] G. Karakostas. A better approximation ratio for the 143–154, 2005. vertex cover problem. ACM Trans. Algorithms, [14] P. Bohannon, W. Fan, F. Geerts, X. Jia, and 5(4):41:1–41:8, 2009. A. Kementsietsidis. Conditional functional [35] S. Khot. On the unique games conjecture. In FOCS, dependencies for data cleaning. In ICDE,pages page 3, 2005. 746–755, 2007. [36] S. Khot and O. Regev. Vertex cover might be hard to [15] N. Boria, F. D. Croce, and V. T. Paschos. On the max approximate to within 2-✏. J. Comput. Syst. Sci., min vertex cover problem. Discrete Appl. Math., 74(3):335–349, 2008. 196:62–71, 2015. [37] S. Kolahi and L. V. S. Lakshmanan. On [16] M. Caniup´an and L. Bertossi. The consistency approximating optimum repairs for functional extractor system: answer set programs for consistent dependency violations. In ICDT,pages53–62,2009. query answering in databases. Data Knowl. Eng., [38] P. G. Kolaitis, E. Pema, and W.-C. Tan. Ecient 69(6):545–572, 2010. querying of inconsistent databases with binary integer [17] J. Chen, I. A. Kanj, and G. Xia. Improved upper programming. PVLDB,6(6):397–408,2013. bounds for vertex cover. Theor. Comput. Sci., [39] P. Koutris and J. Wijsen. Consistent query answering 411(40):3736–3756, 2010. for self-join-free conjunctive queries under primary key [18] F. Chiang and R. J. Miller. A unified model for data constraints. ACM Trans. Database Syst.,42(2):1–45, and constraint repair. In ICDE,pages446–457,2011. 2017. [19] J. Chomicki and J. Marcinkowski. Minimal-change [40] E. Livshits, B. Kimelfeld, and S. Roy. Computing integrity maintenance using tuple deletions. Inf. optimal repairs for functional dependencies. ACM Comput.,197(1-2),2005. Trans. Database Syst.,45(1):1–46,2020.

2073 [41] A. Lopatenko and L. Bertossi. Complexity of algorithmic fairness. In SIGMOD,pages793–810, consistent query answering in databases under 2019. -based and incremental repair semantics. In [46] J. Wijsen. On the consistent rewriting of conjunctive ICDT,pages179–193,2007. queries under primary key constraints. Inform. Syst., [42] A. Lopatenko and L. Bravo. Ecient approximation 34(7):578–601, 2009. algorithms for repairing inconsistent databases. In [47] J. Wijsen. Certain conjunctive query answering in ICDE,pages216–225,2007. first-order logic. ACM Trans. Database Syst., [43] G. L. Nemhauser and L. E. Trotter. Vertex packings: 37(2):1–35, 2012. structural properties and algorithms. Math. Program., [48] J. Wijsen. Foundations of query answering on 8(4):232–248, 1975. inconsistent databases. SIGMOD Rec.,48(3):6–16, [44] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. R´e. 2019. HoloClean: holistic data repairs with probabilistic [49] M. Zehavi. Maximum minimal vertex cover . PVLDB,10(11):1190–1201,2017. parameterized by vertex cover. SIAM J. Discrete [45] B. Salimi, L. Rodriguez, B. Howe, and D. Suciu. Math.,31(4):2440–2456,2017. Interventional fairness: causal database repair for

2074