The Computation of Optimal Subset Repairs⇤
Dongjing Miao Zhipeng Cai Jianzhong Li Harbin Institute of Technology Georgia State University Harbin Institute of Technology [email protected] [email protected] [email protected] Xiangyu Gao Xianmin Liu Harbin Institute of Technology Harbin Institute of Technology [email protected] [email protected]
ABSTRACT In the research on managing inconsistencies, the notion of Computing an optimal subset repair of an inconsistent data- repair was first introduced decades ago [4]. Subset repair is base is becoming a standalone research problem and has a a popular type of repair [19, 2]. A subset repair of an incon- wide range of applications. However, it has not been well- sistent database I is a consistent subset of I,whichsatisfies studied yet. A tight inapproximability bound of the prob- all the given integrity constraints, obtained by performing lem computing optimal subset repairs is still unknown, and a minimal set of operations on I.Anoptimal subset repair there is still no existing algorithm with a constant approxi- is the subset repair with minimum cost, where the cost can mation factor better than two. In this paper, we prove a new be defined in di↵erent ways depending on the requirement tighter inapproximability bound of the problem computing of applications. optimal subset repairs. We show that it is usually NP-hard Nowadays, the computation of optimal subset repairs is to approximate it within a factor better than 17/16. An becoming a basic task in a wide-range of applications not 1 only data cleaning and repairing, but also many other prac- algorithm with an approximation ratio (2 1/2 )isde- veloped, where is the number of functional dependencies. tical scenarios. Thus, the computation of optimal subset It is the current best algorithm in terms of approximation repairs is becoming a standalone research problem and at- ratio. The ratio can be further improved if there are a tracted a lot of attention. In the following, we give three ex- large amount of quasi-Tur´an clusters in the input database. ample scenarios, in which the computation of optimal subset Plenty of experiments are conducted on real data to exam- repairs is a core task. ine the performance and the e↵ectiveness of the proposed Data repairing.Asdiscussedin[40,20,11],theben- approximation algorithms in real-world applications. efit of computing optimal subset repairs to data repairing is twofold. First, computing optimal subset repairs is con- PVLDB Reference Format: sidered an important task in automatic data repairing [22, Dongjing Miao, Zhipeng Cai, Jianzhong Li, Xiangyu Gao, Xian- 45, 25]. In those methods, the cost of a subset repair is min Liu. The Computation of Optimal Subset Repairs. PVLDB, the sum of the confidence of each deleted tuple. Therefore, 13(11): 2061-2074, 2020. an optimal subset repair is considered as the best repaired DOI: https://doi.org/10.14778/3407790.3407809 data. Second, optimal subset repairs are also used as the best recommendation to enlighten the human how to repair 1. INTRODUCTION data from scratch in semi-automatic data repairing meth- ods [7, 9, 24, 30]. In addition, the deletion cost to obtain an AdatabaseI is inconsistent if it violates some given in- optimal subset repair also can be used as a measurement to tegrity constraints, that is, I contains conflicts or inconsis- evaluate the database inconsistency degree [11]. tencies. For example, two tuples conflict with each other, if Consistent query answering. Optimal subset repairs they have the same city but di↵erent area code. The incon- are the key to answer aggregation queries, such as count, sistency has a serious influence on the quality of databases, sum and average, in inconsistent databases. For example, a and has been studied for a long time. consistent answer of a sum query is usually defined as the range between its greatest lower bound (glb)anditsleast ⇤Work funded by the National Natural Science Foundation upper bound (lub). Optimal subset repairs can help deriving of China (NSFC) Grant No. 61972110, 61832003, U1811461, and the National Key R&D Program of China Grant No. such two bounds. 2019YFB2101900. Consider the inconsistent database GreenVotes shown in Figure.1(a), due to functional dependency County,Date Tal ly, This work is licensed under the Creative Commons Attribution- ! NonCommercial-NoDerivatives 4.0 International License. To view a copy tuples s1 and s2 conflict with each other, hence, it has two of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For repairs J1 : s1,s3 and J2 : s2,s3 . Obviously, aggrega- { } { } any use beyond those covered by this license, obtain permission by emailing tion query Q1 has di↵erent answers, such as 554 in J1 but [email protected]. Copyright is held by the owner/author(s). Publication rights 731 in J2. A smart way to answer such queries, as defined licensed to the VLDB Endowment. in [6, 10], is to return the lub and glb which are 554 and 731. Proceedings of the VLDB Endowment, Vol. 13, No. 11 ISSN 2150-8097. Computing the lub is equivalent to computing optimal DOI: https://doi.org/10.14778/3407790.3407809 subset repairs since the lub is exactly the sum of weights of
2061 GreenVotes Procurement Plan Σ = {Vendor → Subsystem, Subsystem Part Vendor Vendor Rating Aggregation Q1: County Date Tally t1 #1 a C 3.3 Subsystem, Part → Vendor} s SELECT SUM(Tally) t2 #1 a J 5.0 1 A 11/07 453 SUM(Vendor Rating) t3 #1 b J 5.0 s2 A 11/07 630 FROM GreenVotes t4 #1 b M 4.1 Procurement Plan A (optimal) s3 B 11/07 101 CQA of Q : [554, 731] t5 #2 a C 3.3 Cost (rating loss): 13.7 (31.6 -17.9) 1 t 6 #2 a J 5.0 Procurement Plan B t7 #2 c C 3.3 16.6 (31.6 15.0) t8 #2 c E 4.6 Cost (rating loss): - BrownVotes Σ = {County→ Tally} Procurement Plan A (Optimal) Procurement Plan B Subsystem Part Vendor Vendor Rating Subsystem Part Vendor Vendor Rating County Date Tally t1 #1 a C 3.3 t1 #1 a C 3.3 r1 A 11/07 541 t2 #1 a J 5.0 t2 #1 a J 5.0 Aggregation Q2: t t r2 A 11/07 546 3 #1 b J 5.0 3 #1 b J 5.0 r triad SELECT SUM(Tally) t4 #1 b M 2.1 t4 #1 b M 2.1 3 A 11/07 560 t5 #2 a C 3.3 t5 #2 a C 3.3 FROM BrownVotes t t r4 A 11/07 550 6 #2 a J 5.0 6 #2 a J 5.0 r B 11/07 302 [843, 862] t7 #2 c C 3.3 t7 #2 c C 3.3 5 CQA of Q2 : t8 #2 c E 4.6 t8 #2 c E 4.6 (a) Consistent query answering of scalar aggregation (b) Refine the procurement plan when business constraints change Figure 1: Example applications of computing optimal subset repairs tuples in an optimal subset repair, J2, if the vote tallies are tional dependencies is NP-complete [6, 10] even if given two treated as tuple weights. functional dependencies. This result is too rough since it is On the other side, computing the glb can be speed up by just proved by a particular case. The most recent work in an optimal subset repair. Computing glb is equivalent to find [40] improves the complexity result. The boundary between a weighted maximum minimal vertex cover that is NP-hard polynomial intractable and tractable cases is clearly delin- 1/2 and cannot be approximated within o(n )[15]butcan eated, such that the problem of computing optimal subset be solved in O(2knc) time [49], where k equals to the cost of repairs cannot be approximated within some small constant an optimal subset repair, and c>1. If k is clear in advance, unless P=NP, if the given constraint set can be simplified we will be able to decide to run the exact algorithm and tol- as one of6 four specific cases, otherwise it is polynomial solv- erate exponential running time, or to run an approximation able. Other complexity results on computing optimal repairs algorithm and tolerate the unbounded accuracy loss. exist, such as [22], [37], [12], [41], [28]. They are either re- Moreover, when the two ranges are computed, one can stricted to repairs obtained by value modifications instead safely conclude that Brown won the election, since the max- of tuple deletions or constraints stronger than functional imum vote tally for Green is smaller than the minimum vote dependencies. Complexity results on repair checking exists, tally for Brown. such as [19], [2]. Repair checking is to decide if a given Optimization. Let us consider an example raised in mar- database is a repair of another database. However, it is a keting as shown in Figure.1(b). The procurement plan in problem much easier than computing optimal subset repairs. Figure.1(b) lists all the current vendors for all the parts, Due to their restrictions, these results cannot help deriving and each vendor has a rating representing its production ca- tight complexity results for the problem of computing opti- pacity, after-sales service, supply price and so on [29]. Note mal subset repairs. Therefore, a tight lower bound remains that, all the tuples in the list are correct and consistent. unknown so far. However, the business consideration recently changes. First, Algorithm aspects. When the database schema and the a vendor is no longer allowed to take part in two or more number of functional dependencies are both unbounded, com- subsystems for protection of commercial secrets. Second, puting optimal subset repairs is equivalent to the weighted each part should be supplied from the same vendor for con- minimum vertex cover problem. Algorithms for vertex cover venience of maintenance. For now, we need to refine the can be applied directly to compute optimal subset repairs. plan to satisfy these requirements but preserving high-rating However, even the best algorithm [34] can only give an ap- vendors as possible. proximation with ratio 2 if the data is large enough. For- The problem of refining the procurement plan can be tunately, as shown in the applications above, the relation transformed into the problem of computing optimal subset schema is fixed in practice, and the number of constraints repairs. First, the business requirements can be written as is bounded by some constant. In such context, the problem the following two functional dependencies is no longer the general vertex cover problem, but it is not Vendor Subsystem, clear whether the problem can be better approximated. Subsystem,Part ! Vendor. Although a large body of research on data repairing con- ! cerned with the computation of an optimal repair, they do Then, an optimal subset repair is the best refined plan if not tell us if we could do better in terms of approximation we treat vendor rating as tuple weight. For example, in ratio for computing optimal subset repairs. Approximation Figure.1(b), plan A is an optimal subset repair and has a algorithms [13, 42, 37] focus on computing repairs obtained total rating 17.9, which is much better than another plan B. by value modification rather than subset repairs. Heuristic In fact, there are many other applications of optimal sub- methods [13, 22, 21] and probabilistic methods [44, 25] are set repairs, but we can not detail any more here due to space unable to provide theoretical guarantee on approximation limitation. As we can see from these applications, comput- ratio. SAT-solver based methods [24] su↵er an exponential ing optimal subset repairs is becoming a standalone prob- time complexity. Other methods [18, 31] try to modify both lem. However, it has not been well-studied yet, especially data and constraints, which are not able to return an opti- on the complexity and algorithms. mal subset repair that we need. Complexity aspects. It is already shown that the problem of computing optimal subset repair with respect to func-
2062 In addition, as we mentioned above, the notion subset re- case of functional dependencies are key constraints [1]. Con- pair was first proposed in the research on consistent query ditional functional dependencies [14] generalize functional answering. However, the series of works on consistent query dependencies by conditioning requirements to specific data answering also cannot provide a better approximation to an values using pattern tableaux. In particular, a single tuple optimal subset repair. Their goal is to derive a certain an- may violate a conditional functional dependency. swers for given queries by finding the answers held in all re- In the research on computing optimal subset repairs, func- pairs [4, 48]. There are three approaches to achieve the goal. tional dependency is commonly used integrity constraint [6, The first is the answer set programming based approaches [5, 40]. For convenience of discussion, this paper also takes 16]. The second is the query rewriting based approaches [46, functional dependency as the integrity constraint like [40]. 47]. The third is the SAT/LP-solver based approaches[38, Note that, it is quite straightforward to extend our results 39, 27, 26]. All the approaches try to avoid computing ex- to the other types of integrity constraints mentioned above. ponentially many repairs in the size of database. Let R be a relation schema, and I be an instance of R. In summary, the following problems for computing opti- Functional dependency.LetX and Y be two sets of at- mal subset repairs remain open. tributes of a relation R. A functional dependency ' is an Firstly, a tight lower bound is still unknown so that it is expression of the form X Y, which requires that there ! impossible to decide whether an algorithm for computing is no pair of tuples t1,t2 in any instance I of R such that optimal subset repairs is optimal in terms of approximation t1[X]=t2[X]butt1[Y] = t2[Y]. ratio. Inconsistent database6 .Let⌃ be a set of functional depen- Secondly, it is unknown whether we can approximate op- dencies for relation schema R and I be an instance of R. I timal subset repairs within a constant better than 2, which does not satisfy a given functional dependency ' : X Y, ! has a very important practical significance. denoted by I 2 ', if there are two tuples t1,t2 I such that 2 This paper focuses on the two open problems, and the t1[X]=t2[X]butt1[Y] = t2[Y]. main contributions are as follows. An instance I of R is6 said to be consistent with respect Firstly, a tighter lower bound of the problem of com- to ⌃,denotedbyI = ⌃,ifI satisfies every functional de- puting optimal subset repair is obtained. Concretely, it is pendencies in ⌃. Otherwise,| I is inconsistent,denotedby proved that the problem of approximating optimal subset I 2 ⌃, i.e., ' ⌃ such that I 2 '. repairs within 17/16 is NP-hard for most of the polyno- 9 2 mial intractable cases. For the other cases, it is proved that 2.2 Subset Repairs the problem of approximating optimal subset repairs within Let ⌃ be a set of functional dependencies over R.Acon- 69246103/69246100 is also NP-hard. sistent subset of I with respect to ⌃ is a subset J I such Secondly, an algorithm with an approximation ratio (2 that J = ⌃.Asubset repair (a.k.a, S-repair)isaconsistent✓ 1 | 1/2 ) is developed, where is the number of given func- subset that is not strictly contained in any other consistent tional dependencies. This ratio is a constant better than 2. subset. Formally, a consistent subset J of I is a subset repair It is the current best algorithm in terms of approximation of I with respect to ⌃ if J K = J for any other consistent ratio. subset K of I. \ 6 Thirdly, to further improve the approximation ratio, the Before defining the cost of a subset repair, we need to dis- quasi-Tur´an cluster is proposed for extending the algorithm cuss the weight of each tuple. The weight of a tuple ti is a above. Based on the quasi-Tur´an cluster, an approximation non-negative number, denoted by wi, representing some im- algorithm for computing the subset repairs with ratio (2 portant information concerned by applications. The weight 1 1/2 ✏) is designed, where the factor ✏ 0dependson is defined in di↵erent ways for di↵erent applications. The the characteristic of input database. followings are some examples. In the context of data repair- Fourthly, plenty of experiments are conducted on real data ing, the weight of a tuple usually represents the confidence to examine the performance and the e↵ectiveness of the pro- of its correctness. In the context of answering aggregation posed approximation algorithms in real-world applications. queries, the weight of a tuple usually represents its contri- The following parts of this paper is organized as follows. bution to query results. In the context of optimizations, the Necessary conceptions and definition of the problem of com- weight of a tuple represents the benefit of a tuple. puting subset repairs are formally stated in Section 2. The Based on the tuple weights, the cost of a subset repair complexity analysis of the problem is shown in Section 3. J I is defined as The approximation algorithms are given in Section 4. Ex- ✓ periment results are illustrated in Section 5. At last, Section C(J, I)= wi, t I J 6 concludes the paper. iX2 \ which represents the cost of obtaining a subset repair. Note that the cost of a subset repair is I J ,whichisthedistance 2. PROBLEM STATEMENT | \ | Concepts used in database theory and the formal defini- between J and I if every tuple has weight 1. This kind of tion of the problem are given in this section. cost is commonly used in data quality management [11]. Optimal subset repair. A subset repair J I is an optimal 2.1 Integrity Constraints subset repair of I, optimal S-repair for short,✓ if The consistency of databases is usually guaranteed by in- C(J ⇤,I)=min C(J, I):J I,⌃ , tegrity constraints. The languages for specifying integrity { 2S } constraints can get quite involved. For example, functional where I,⌃ is the set of all subset repairs of instance I. dependency [1] is the most popular type of integrity con- For example,S in Figure.1(b), both plan A and B are S- straint and has a long history. A large body of work on repairs of procurement plan. A is an optimal S-repair since managing data quality is built on it. An important special it has a minimum cost 13.7.
2063 Case 2. There is a fact-wise reduction from ⌃AB C B 2.3 Problem Definition and Remarks ! ! to ⌃, and it is NP-hard to compute a (57/56 ✏)-optimal The problem of computing optimal subset repairs is for- mally defined as follows. S-repair. Case 3. There is a fact-wise reduction from ⌃AB AC BC $ $ Definition 1. (OPTSR) Let R be a fixed relation schema to ⌃, and it is NP-hard to compute a (1+✏)-optimal S-repair and ⌃ be a fixed set of functional dependencies over R.For for a certain ✏>0. any instance I of R,theproblemof OPTSR is to compute Therefore, the lower bound of problem OPTSR can be im- an optimal subset repair J ⇤ of I with respect to ⌃. proved by finding a tighter lower bound for the four special FD sets. Complexity. In practice, compared with the data size, the In this section, we derive a tighter lower bound for case 1 number of attributes in the relation is very small, and so is and 2. Moreover, instead of the lengthy proof, we provide a the number of functional dependencies. This motivates us to much conciser proof where all the reductions are built from adopt data complexity as the measure to perform the com- the problem MAX-NM-E3SAT [32]. For case 3, we carefully plexity analysis of OPTSR. In this way, the relation schema merge four previous reductions to derive the bound, which R and the functional dependency set ⌃ are fixed in advance, is not shown explicitly in previous research. and thus, the number of attributes in R and the number of functional dependencies can be considered as constants. The Lemma 1. Let FD set be one of ⌃A B C , ⌃A B C and ! ! ! instance I of R is considered as the input of OPTSR. This ⌃AB C B .Givenan✏>0,itisNP-hardtocomputea ! ! paper focus on the data complexity. (17/16 ✏)-optimal S-repair for OPTSR even if every tuple Approximation.OPTSR is NP-hard under most settings of in the instance has weight 1. R and ⌃. This paper will develop approximation algorithms Proof. We here show three similar reductions. All of for solving the problem OPTSR. For later use, the approximation of optimal subset repairs them are built from problem MAX-NM-E3SAT [32] which cannot be approximated better than 7/8. Note that every is explicitly defined here. Let J ⇤ be an optimal subset repair of I.Foraconstantk 1, a subset repair J is said to be a input expression of problem MAX-NM-E3SAT is subject k-optimal S-repair if to the following restrictions, (i) each clause has exactly three variables, (ii)eachclauseismonotone as either “x + y + z” C(J ⇤,I) C(J, I) k C(J ⇤,I) or “¯x +¯y +¯z”, and (iii) any variable x does not occur more · than once in each clause. In particular, an optimal S-repair of any instance I is a 1- Specifically, each reduction builds an instance I of the optimal S-repair of I. relation schema R(A, B, C), in which every tuple has weight For example, in Figure.1(b), plan B is a 1.5-optimal S- 1, for each expression of MAX-NM-E3SAT. repair of procurement plan since its cost 16.6 1.5 13.7. ⇥ Reduction for ⌃A B C . First, two tuples (j, xj ,xj )and ! ! 3. COMPLEXITY OF PROBLEM OPTSR (j, xj , x¯j ) are inserted into I for each variable xi.Second, a tuple is inserted into I for each clause ci and each variable In this section, we prove a tighter lower bound of approx- xj in it as follows. PT imation ratio for O SR problem. In the rest of the paper, 1. If ci contains a positive literal of variable xj , insert we refer the lower bound of approximation ratio and the (ci,xj ,xj )intoI , approximation ratio as lower bound and ratio respectively. 2. If ci contains a negative literal of variable xj , insert As shown in [40], it is possible to decide for any input FD (ci,xj , x¯j )intoI . set ⌃ if OPTSR is polynomially solvable using the procedure Suppose there are n variables and m clauses in ,then OSRSucceed.Basically,OSRSucceed iteratively reduces the exactly 2n+3m tuples are created in total. Intuitively, A left hand side of each FD in ⌃ according to three simple B guarantees that exactly one of its three corresponding! rules. It terminates in polynomial time until any FD in tuples survives in any S-repair once a clause is satisfied. ⌃ cannot be further simplified, and returns true if there B C guarantees that the assignment of each variable are only FDs of the form X left. It is proved that ! ;! should be valid, i.e., either true or false, but not both. OPTSR is polynomially solvable if OSRSucceed(⌃) returns Reduction for ⌃A B C . First, two tuples (j, xj ,xj )and true. Otherwise, ⌃ makes OPTSR polynomially unsolvable, ! (j, x¯ ,x ) are inserted into I for each variable x .Second, and there is a fact-wise reduction1 from one of the following j j i a tuple is inserted into I for each clause c and each variable four special FD sets to ⌃, i xj in it as follows. ⌃A B C = A B,B C , ! ! { ! ! } 1. If ci contains a positive literal of variable xj , insert ⌃A B C = A B,C B , ! { ! ! } (ci,xj ,xj )intoI , ⌃AB C B = AB C, C B , ! ! { ! ! } 2. If ci contains a negative literal of variable xj , insert ⌃AB AC BC = AB C, AC B,BC A . $ $ { ! ! ! } (c , x¯ ,x )intoI . That is, if OPTSR is NP-hard for a given ⌃, then one of i j j the following cases must happen, Suppose contains n variables and m clauses, then ex- actly 2n +3m tuples are created in total. Similarly, A B Case 1. There is a fact-wise reduction from ⌃A B C or ! ! ! guarantees that exactly one of its three corresponding tuples ⌃A B C to ⌃, and it is NP-hard to compute a (67/66 ✏)- ! optimal S-repair. survives in any S-repair once a clause is satisfied. C B guarantees the consistency of variable assignment. ! 1Fact-wise reduction is a kind of strict reduction that is for- Reduction for ⌃AB C B . First, two tuples (xi, 1,xi)and mally defined in [40]. For any ✏>0, if problem B has ! ! a(1+✏)-approximation, then problem A has a (1 + ✏)- (xi, 0,xi) are inserted into I for each variable xi.Second,a approximation whenever there is a fact-wise reduction from tuple is inserted into I for each clause ci and each variable AtoB. xj in it as follows,
2064 1. If ci contains a positive literal of variable xj , insert Apply this fact in the right hand of inequality (1), the fol- (ci, 1,xj )intoI , lowing inequality holds 2. If ci contains a negative literal of variable xj , insert I J ⇤ 2 I n 3 (ci, 0,xj )intoI . | \ | | | =2+ I . Suppose there are n variables and m clauses in ,then I I J ⇤ n I 2n | | 2 | | | \ | | | n exactly 2n +3m tuples are created in total. AB C guar- ! Since I =2n +3m>2n,weget antees that exactly one of the three tuples survives in any | | S-repair once the corresponding clause is satisfied. C B I J ⇤ guarantees the consistency of variable assignment. ! | \ | > 2. (2) I I J ⇤ n Deriving lower bound. We here only show the process of | | | \ | Apply inequality (2) into inequality (1), then deriving the lower bound for ⌃AB C B . The processes of ! ! deriving other two are similar. Let be an input expression ( ) of problem MAX-NM-E3SAT, and I be the instance data N > 3 2k max( ) built by our reduction. N The functional dependency AB C guarantees that any This inequality implies the problem MAX-NM-E3SAT can S-repair J of I contains at most one! of the three tuples with be approximated with ratio 3 2k if a k-optimal S-repair the same value ci on attribute A, where 1 i m.The can be computed in polynomial time. functional dependency C B guarantees that any S-repair Suppose k<17/16, then 3 2k>7/8, which implies the ! J of I contains either (ci, 1,xj )or(ci, 0,xj ) for 1 i m MAX-NM-E3SAT problem admits an approximation with and 1 j n. ratio better than 7/8. This contraries to the hardness result Therefore, we can derive a variable assignment ⌧ from obtained in [32]. Therefore, k 17/16. every S-repair J, Finally, the same bound can be derived in the same way for ⌃A B C and ⌃A B C . Then the lemma follows im- ! ! ! 1, if (xi, 1,xi) J, mediately. s.t. ⌧ (x )= i 0, otherwise. 2 ⇢ The following lemma derives the explicit lower bound for ⌧( )isvalidsuchthat⌧(xi) either 0 or 1, but not both, for ⌃AB AC BC. each· 1 i n. $ $ Let J ⇤ be an optimal S-repair. It can be proved that Lemma 2. For ⌃AB AC BC,itisNP-hardtocompute $ $ I J ⇤ does not contain both (xi, 1,xi)and(xi, 0,xi)for a (69246103/69246100 ✏)-optimal S-repair for any ✏>0, \ each variable xi. Since if it is not true, there must be some i even if every tuple in the instance has weight 1. such that both (xi, 1,xi)and(xi, 0,xi) are in I J ⇤.Wecan \ Proof. By merging the following ↵, -reductions given pick one of them and insert it into J ⇤ without resulting in in previous literatures, L inconsistency. Then, an S-repair J larger than J ⇤ is found, i.e., C(J, I ) >C(J ⇤,I ). This is a contradiction. MAX B29-3SAT 529,1 3DM [33] 3DM 1,1 MAX 3SC [33] Let ⌧ be an arbitrary valid variable assignment, and ( ) N MAX 3SC 55,1 MECT-B [3] be the number of clauses in satisfied by ⌧.Let⌧max be an MECT-B 7 ,1 OPTSR(R, ⌃AB AC BC)[40] optimal variable assignment, and max( )bethenumberof 6 $ $ N clauses in satisfied by ⌧max.Thus,wehave we can conclude that MAX B29-3SAT can be approxi- mated within 680/679 if a (69246103/69246100 ✏)-optimal max( )= I I J ⇤ n N | | | \ | S-repair can be computed in polynomial time for any ✏>0 when ⌃ is ⌃AB AC BC, which is contrary to the hardness and for any solution J of I , $ $ result shown in [23].
( ) I I J n. N | | | \ | Based on LEMMA 1 and 2, the following theorem gives a
Additionally, we have I =2n +3m. new complexity result. Let k>1andJ be| a k|-optimal S-repair of I such that Theorem 1. Given a fixed schema R and a fixed FD set ⌃,forany✏>0, I J ⇤ I J k I J ⇤ , | \ || \ | ·| \ | then we have 1. If OSRSucceed(⌃) returns true,thenanoptimalS- repair can be computed in polynomial time. ( ) I I J n N | | | \ | max( ) I I J ⇤ n 2. If OSRSucceed(⌃) returns false,andthereisafact- N | | | \ | wise reduction from ⌃AB AC BC to ⌃,thenitisNP- I k I J ⇤ n $ $ | | ·| \ | hard to compute a (69246103/69246100 ✏)-optimal I I J n | | | \ ⇤| S-repair. I J ⇤ =1+(1 k) | \ | . (1) · I I J ⇤ n 3. Otherwise, it is NP-hard to compute a (17/16 ✏)- | | | \ | optimal S-repair. Since each clause has exactly three literals, we have
I 2n Proof. Case 1 has been proved by [40]. Case 2 is proved I J ⇤ n +2 | | . | \ | · 3 by LEMMA 2. Case 3 is proved by LEMMA 1.
2065 In Figure.1(b), the two functional dependencies form the Let x⇤ be the optimal solution of ILP,andx0 be the M case ⌃AB C B . Our result implies that no polynomial algo- optimal solution of HIP,then ! ! rithm could compute a solution with a constant ratio better M wix0 wix⇤ (3) than 17/16 for the case in Figure.1(b). i i ti I ti I Relation with Vertex Cover. In general, OPTSR is equiva- X2 X2 lent to the weighted minimum vertex cover problem (WMVC) 4.1.2 ⌃-Partition if the schema is not fixed and the number of functional To obtain a good approximation from the HIP,weneed dependencies is unbounded. Therefore, approximation al- M gorithms for WMVC could be directly applied to comput- first to partition the input I with respect to ⌃,andthen ing approximations of optimal S-repairs. On the other side, do the rounding based on it. Intuitively, all the tuples of the current best known approximation ratio for WMVC is I are grouped into non-empty consistent subsets in such a (2 ⇥(1/plog n)) [34]. However, this is a ratio depending way that every tuple is in exactly one consistent subset. on the size of input. In fact, there is no polynomial time A ⌃-partition of instance I,say (I,⌃), is a collection of P (2 ✏)-approximation algorithm for WMVC [36] and ✏>0 disjoint consistent subsets of I such that if the unique game conjecture [35] is true. This rules out - K (I,⌃),K = ⌃, 8 2P | the possibility to compute a (2 ✏)-optimal S-repair for any - K (I,⌃) K = I, constant ✏>0 unless unique game conjecture is false. - K 2,KP (I,⌃),K K = . S i j i j The8 size of2 aP⌃-partition\ is the number; of sets in it. The 4. APPROXIMATION ALGORITHMS smallest ⌃-partition is preferred to do the rounding. Un- fortunately, the size of a ⌃-partition may be as large as n. Three approximation algorithms are developed for OPTSR When all the tuples in I are pairwise conflicting with each in this section. other, the size of any ⌃-partition is n. Therefore, the size of The first algorithm, named BL-LP, is a baseline algorithm. a ⌃-partition cannot be bounded by a constant. It returns a (2 1/n)-optimal S-repair. This approximation ratio dependents on the size of input data. We use greedy search to find a ⌃-partition. For each tuple t I, the greedy search works as follows. The second algorithm, named TE-LP, is an improved al- 2 1 1. Find an existing consistent set K such that K gorithm. It returns a (2 1/2 )-optimal S-repair, where 2P [ is the number of functional dependencies in ⌃ and it is a t = ⌃ and put t into K. { }|2. If no such K can be found, create a new set K = t constant. { } The third algorithm, named et-TE-LP,isanextensionof and (I,⌃)= (I,⌃) K. 1 P P [ algorithm TE-LP. It returns (2 1/2 ✏)-optimal S-repair 3. Until all the tuples are picked into a consistent set, a ⌃-partition (I,⌃)isobtained. where ✏ 0 is still a factor independent of the data size. It P outperforms TE-LP when there are a large amount of quasi- ⌃ Tur´an clusters in the input database. 4.1.3 Approximate via -Partition-Based Rounding Asolutionof HIP can be obtained by a ⌃-partition- 4.1 Algorithm BL-LP based rounding.M When a ⌃-partition (I,⌃) is found, the P rounding of solution x0 of HIP is carried out as follows. We start from a half-integral linear programming formu- M lated for OPTSR, then show a basic rounding strategy to 1. Let xi = x0 if x0 =1/2. i i 6 obtain an approximation to optimal subset repairs, and fi- 2. Let xi =0ifxi0 =1/2andti K⇤ such that nally analyze the performance of the algorithm. 2 K⇤ arg max xi0 ti Kj xi0 =1/2 , Kj (I,⌃) { | 2 ^ } 4.1.1 A Half-Integral Linear Programming 2P For an input I of OPTSR, we define a 0-1 variable xi for otherwise, let xi =1. each tuple ti I as follows, 3. Output the S-repair J = ti : xi =0 . 2 { } 0, if t is in a feasible S-repair J, The baseline algorithm BL-LP based on the rounding method x = i i 1, otherwise, above is presented as follows. ⇢ OPTSR can be formulated as the following 0-1 integer Algorithm 1 BL-LP linear programming ILP, M Input: Instance I of R,FDset⌃,and minimize wixi ⌃-partition (I,⌃) ti I P 2 Output: Approximate optimal S-repair s. t. xi + xj 1, ti,tj I, ti,tj 2 ⌃, P 8 2 { } 1: Formulate the HIP with respect to I and ⌃; xi 0, 1 , ti I. M 2{ } 8 2 2: Solve HIP by LP-solver to obtain a solution x0; M We could routinely relax ILP to a linear program where 3: Let J ; M n ; any solution x takes value from [0, 1] . Furthermore, every 4: Find K⇤ from (I,⌃)suchthat extreme point of the linear program relaxation takes value P from 0, 1/2, 1 n as shown in [43]. This gives the following K⇤ arg max xi0 ti Kj xi0 =1/2 ; Kj (I,⌃) { | 2 ^ } { } 2P half-integer programming HIP, M 5: for each ti I do minimize wixi 2 ti I 6: if x0 =0or(x0 =1/2andti K⇤) then 2 i i 2 s. t. xPi + xj 1, ti,tj I, ti,tj 2 ⌃, 7: J J ti ; 8 2 { } 8: return J [{ } xi 0, 1/2, 1 , ti I. 2{ } 8 2
2066 4.1.4 Performance Analysis of BL-LP 4.2.1 Overview of algorithm TE-LP The algorithm TE-LP consists of three main steps as shown Theorem 2. Let p 2 be the size of the ⌃-partition of in algorithm 2. inconsistent instance I.Algorithm BL-LP returns a (2 2/p)- 2 optimal S-repair in O(n ) time. Algorithm 2 TE-LP
Proof. We first prove J is an S-repair. Let x0 be the Input: Instance I of R,FDset⌃ Output: Approximate optimal S-repair J solution of HIP.Letti and tj be any two tuples such that M ti,tj 2 ⌃. Without loss of generality, we assume xi0 xj0 . 1: J ; { } ; Clearly, either x0 =1orx0 = x0 =1/2. 2: I0 TrimInstance(I,⌃); i i j When x0 =1,ti / J due to the rounding. 3: (I0, ⌃) FindPartition(I0, ⌃); i 2 P When x0 = x0 =1/2, the following four statements hold 4: J BL-LP (I0, ⌃, (I0, ⌃)); i j P due to the rounding. First, if ti,tj / K⇤, ti,tj / J.Second, 5: return J 2 2 if ti K⇤,tj / K⇤, tj / J. Third, if ti / K⇤,tj K⇤, 2 2 2 2 2 ti / J.Finally,ifti,tj K⇤, ti,tj / J. 2 2 2 Step 1 Trim instance. It first trims instance I by elim- Therefore, J is a consistent subset. Since x0 is the optimal inating all the triads, where a triad is a set of three tuples solution of HIP, J is maximal. Hence, J is an S-repair. M such that they are pairwise conflicting with each other. Now, we analyze the performance of BL-LP.Let Step 2 Find partition. After trimming o↵all the triads in I, I0 I is obtained whose ⌃-partition has a constant S1 = x0 : ti I,x0 =1 , ✓ { i 2 i } size. Algorithm TE-LP turns to compute such a small ⌃- S = x0 : ti I,x0 =1/2 and 1/2 { i 2 i } partition of I0. SK = x0 : ti K⇤,x0 =1/2 . ⇤ { i 2 i } Step 3 Approximate. Algorithm TE-LP runs BL-LP on the trimmed instance I0, and computes an S-repair with a Thus, J satisfies the following inequality, better ratio based on the small ⌃-partition. We next detail each step and prove the related theoretical
C(J, I)= wix0 +2 wix0 wix0 results. i 0 i i1 x S x S x S Xi0 2 1 02X1/2 02XK⇤ 4.2.2 Trim Instance by Triad Elimination @ A Triad.AnysubsetT = ti,tj ,tk I is a triad if wixi0 +2 wixi0 1/p wixi0 { }✓ 0 · 1 - wi =0,wj =0,wk =0, x S x S x S 6 6 6 i0 2 1 i0 2 1/2 i0 2 1/2 X X X - ' ⌃, ti,tj 2 ', ti,tk 2 ', tj ,tk 2 '. @ A 9 2 { } { } { } wix0 +(2 2/p) wix0 i · i For example, in Figure.1(a), r1,r2,r3 is a triad, and x S x S { } Xi0 2 1 i0 X2 1/2 there is no triad in Figure.1(b). Actually, any three of r1,r2,r3,r4 is a triad. (2 2/p) wix0 { } i Due to the definition of S-repair, we have the following x S S i0 2 X1[ 1/2 observation.
Let x⇤ be the optimal solution of ILP, J ⇤ be the optimal Observation 1. For any triad T I,anyS-repairJ of M S-repair. Due to inequality (3), we have I contains at most one tuple of T . ✓ By the observation, we can trim o↵all the triads of input wixi0 = wixi0 wixi⇤ = C(J ⇤,I) instance I without a significant loss of approximation accu- x S1 S ti I ti I i0 2 X[ 1/2 X2 X2 racy. Then the resultant instance has a small ⌃-partition. Therefore, C(J, I) (2 2/p) C(J ⇤,I). · 2 Algorithm 3 T rimInstance Algorithm BL-LP takes O( n ) time to formulate the HIP and find a ⌃-partition. Then, by calling the existingM fast Input: Instance I of R and FD set ⌃ LP-solver, it takes O(n)timetosolvethe HIP and do Output: Instance I0 with no triad in it the rounding. Therefore, the time complexityM of algorithm 1: H ; BL-LP is O(n2)since is a constant. 2: while ;there exists a triad in I do 3: find a triad T = ti,tj ,tk I; { }✓ Remarks.Thevalueofp in Theorem 2 depends on the 4: w(T ) min wi,wj ,wk ; value distribution over input instance I.Asmentionedabove, 5: for each z { i, j, k do} 2{ } p may be as large as n. Hence, algorithm BL-LP may returns 6: reweight tz such that wz wz w(T ); a(2 o(1))-optimal S-repair. We next show algorithm TE- 7: if wz =0then H H tz ; LP with a much better approximation ratio. [{ } 8: I0 I H; \ 4.2 Algorithm TE-LP 9: return I0 In this subsection, we develop an improved algorithm TE- 1 Trim Instance. As shown in algorithm 3, TrimInstance LP which returns a (2 1/2 )-optimal S-repair, where is the constant number of functional dependencies in ⌃. obtains a new instance I0 in the following steps. Step 1. Find a triad T I. Formally, let J ⇤ be an optimal S-repair, algorithm TE-LP ✓ 2 Step 2.SupposeT = t1,t2,t3 and w1 w2 w3. requires O(n ) time and returns a solution J such that { } Reweight each tuple by reducing w1, i.e., w1 0, w2 1 C(J ⇤,I) C(J, I) (2 1/2 ) C(J ⇤,I)(4) w2 w1, w3 w3 w1. ·
2067 Step 3. Remove tuples with weight 0 from instance I. Let J be an arbitrary consistent subset of I, J I(1) is (1) \ Step 4. Repeat steps 1-3 until no more triad. Instance still a consistent subset of I .Now, T1