Crowdsourced Truth Discovery in the Presence of Hierarchies for Knowledge Fusion
Woohwan Jung Younghoon Kim Kyuseok Shim∗ Seoul National University Hanyang University Seoul National University [email protected] [email protected] [email protected] ABSTRACT Table 1: Locations of tourist attractions Existing works for truth discovery in categorical data usually Object Source Claimed value assume that claimed values are mutually exclusive and only one Statue of Liberty UNESCO NY among them is correct. However, many claimed values are not Statue of Liberty Wikipedia Liberty Island mutually exclusive even for functional predicates due to their Statue of Liberty Arrangy LA hierarchical structures. Thus, we need to consider the hierarchi- Big Ben Quora Manchester cal structure to effectively estimate the trustworthiness of the Big Ben tripadvisor London sources and infer the truths. We propose a probabilistic model to utilize the hierarchical structures and an inference algorithm to find the truths. In addition, in the knowledge fusion, thestep Island’ do not conflict with each other. Thus, we can conclude of automatically extracting information from unstructured data that the Statue of Liberty stands on Liberty Island in NY. (e.g., text) generates a lot of false claims. To take advantages We also observed that many sources provide generalized val- of the human cognitive abilities in understanding unstructured ues in the real-life. Figure 1 shows the graph of the general- data, we utilize crowdsourcing to refine the result of the truth ized accuracy against the accuracy of the sources in the real-life discovery. We propose a task assignment algorithm to maximize datasets BirthPlaces and Heritages used for experiments in Sec- the accuracy of the inferred truths. The performance study with tion 5. The accuracy and the generalized accuracy of a source are real-life datasets confirms the effectiveness of our truth inference the proportions of the exactly correct values and hierarchically- and task assignment algorithms. correct values among all claimed values, respectively. If a source claims exactly correct values without generalization, it is located at the dotted diagonal line in the graph. This graph shows that 1 INTRODUCTION many sources in real-life datasets claim with generalized values Automatic construction of large-scale knowledge bases is very and each source has its own tendency of generalization when important for the communities of database and knowledge man- claiming values. agement. Knowledge fusion (KF) [8] is one of the methods used Most of the existing methods [7, 9, 30, 38, 39] simply regard the to automatically construct knowledge bases (a.k.a. knowledge generalized values of a correct value as incorrect. Thus, it causes harvesting). It collects the possibly conflicting values of objects a problem in estimating the reliabilities of sources. According to from data sources and applies truth discovery techniques for re- [8], 35% of the false negatives in the data fusion task are produced solving the conflicts in the collected values. Since the values are by ignoring such hierarchical structures. Note that there are many extracted from unstructured or semi-structured data, the col- publicly available hierarchies such as WordNet [32] and DBpedia lected information exhibits error-prone behavior. The goal of [1]. Thus, a truth discovery algorithm to incorporate hierarchies the truth discovery used in knowledge fusion is to infer the true is proposed in [2]. However, it does not consider the different value of each object from the noisy observed values retrieved tendencies of generalization and may lead to the degradation of from multiple information sources while simultaneously estimat- the accuracy. Another drawback is that it needs a threshold to ing the reliabilities of the sources. Two potential applications control the granularity of the estimated truth. of knowledge fusion are web source trustworthiness estimation We propose a novel probabilistic model to capture the different and data cleaning [10]. By utilizing truth discovery algorithms, generalization tendencies shown in Figure 1. Existing probabilis- we can evaluate the quality of web sources and find systematic tic models [7, 9, 30, 39] basically assume two interpretations of a errors in data curation by analyzing the identified wrong values. claimed value (i.e., correct and incorrect). By introducing three interpretations of a claimed value (i.e., exactly correct, hierarchi- arXiv:1904.10217v1 [cs.DB] 23 Apr 2019 Truth discovery with hierarchies: As pointed out in [6, 8, 25], cally correct, and incorrect), our proposed model represents the the extracted values can be hierarchically structured. In this case, generalization tendency and reliability of the sources. there may be multiple correct values in the hierarchy for an ob- ject even for functional predicates and we can utilize them to find the most specific correct value among the candidate values. For example, consider the three claimed values of ‘NY’, ‘Liberty Island’ and ‘LA’ about the location of the Statue of Liberty in Ta- ble 1. Because Liberty Island is an island in NY, ‘NY’ and ‘Liberty
∗Corresponding author
© 2019 Copyright held by the owner/author(s). Published in Proceedings of the 22nd International Conference on Extending Database Technology (EDBT), March 26-29, 2019, ISBN 978-3-89318-081-3 on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0. Figure 1: Generalization tendencies of the sources Our contributions: The contributions of this paper are sum- marized below. • We propose a truth inference algorithm utilizing the hi- erarchical structures in claimed values. To the best of our knowledge, it is the first work which considers both the re- liabilities and the generalization tendencies of the sources. • To assign a task which will most improve the accuracy, we develop an incremental EM algorithm to estimate the ac- Figure 2: Crowdsourced truth discovery in KF curacy improvement for a task by considering the number of claimed values as well as the confidence distribution. We also devise an efficient task assignment algorithm for multiple crowd workers based on the quality measure. Crowdsourced truth discovery: It is reported in [8] that upto • We empirically show that the proposed algorithm outper- 96% of the false claims are made by extraction errors rather than forms the existing works with extensive experiments on by the sources themselves. Since crowdsourcing is an efficient real-life datasets. way to utilize human intelligence with low cost, it has been successfully applied in various areas of data integration such as 2 PRELIMINARIES schema matching [12], entity resolution [34], graph alignment In this section, we provide the definitions and the problem formu- [17] and truth discovery [39, 41]. Thus, we utilize crowdsourcing lation of crowdsourced truth discovery in the presence of hierarchy. to improve the accuracy of the truth discovery. It is essential in practice to minimize the cost of crowdsourc- 2.1 Definitions ing by assigning proper tasks to workers. A popular approach For the ease of presentation, we assume that we are interested for selecting queries in active learning is uncertainty sampling in a single attribute of objects although our algorithms can be [3, 18, 19, 39]. It asks a query to reduce the uncertainty of the con- easily generalized to find the truths of multiple attributes. Thus, fidences on the candidate values the most. However, it considers we use ‘the target attribute value of an object’ and ‘the value of only the uncertainty regardless of the accuracy improvement. an object’ interchangeably. QASCA algorithm [41] asks a query with the highest accuracy A source is a structured or unstructured database which con- improvement, but measures the improvement without consider- tains the information on target attribute values for a set of objects. ing the number of collected claimed values. It can be inaccurate In this paper, a source is a certain web page or website and a since an additional answer may be less informative for an object worker represents a human worker in crowdsourcing platforms. which already has many records and answers. The information of an object provided by a source or a worker is Assume that there are two candidate values of an object with called a claimed value. equal confidences. If only a few sources provide the claimed Definition 2.1. A record is a data describing the information values for the object, an additional answer from a crowd worker about an object from a source. A record on an object o from will significantly change the confidence distribution. Meanwhile, a source s is represented as a triple (o,s,vs ) where vs is the if hundreds of sources already provide the claimed values for o o claimed value of an object o collected from s. Similarly, if a worker the object, the influence of an additional answer is likely tobe w answers that the truth on an object o is vw , the answer is very little. Thus, we need to consider the number of collected o represented as (o,w,vw ). answers as well as the current confidence distribution. Based o on the observation, we develop a new method to estimate the Let So be the set of the sources which claimed a value on the increase of accuracy more precisely by considering the number object o and Vo be the set of candidate values collected from So. of collected records and answers. We also present an incremental Each worker in Wo answers a question about the object o by EM algorithm to quickly measure the accuracy improvement and selecting a value from Vo. propose a pruning technique to efficiently assign the tasks to In our problem setting, we assume that we have a hierarchy workers. tree H of the claimed values. If we are interested in an attribute H An overview of our truth discovery algorithm: By combin- related to locations (e.g., birthplace), would be a geographi- ing the proposed task assignment and truth inference algorithms, cal hierarchy with different levels of granularity (e.g., continent, we develop a novel crowdsourced truth discovery algorithm using country, city, etc.). We also assume that there is no answer with hierarchies. As illustrated in Figure 2, our algorithm consists of the value of the root in the hierarchy since it provides no in- two components: hierarchical truth inference and task assignment. formation at all (e.g., Earth as a birthplace). We summarize the The hierarchical truth inference algorithm finds the correct val- notations to be used in the paper in Table 2. ues from the conflicting values, which are collected from different Example 2.2. Consider the records in Table 1. Since the source sources and crowd workers, using hierarchies. The task assign- Wikipedia claims that the location of the Statue of Liberty is s ment algorithm distributes objects to the workers who are likely Liberty Island, it is represented by vo =‘Liberty Island’ where to increase the accuracy of the truth discovery the most. The o =‘Statue of Liberty’ and s =‘Wikipedia’. If a human worker proposed crowdsourced truth discovery algorithm repeatedly alter- ‘Emma Stone’ answered Big Ben is in London, it is represented w nates the truth inference and task assignment until the budget by vo =‘London’ where o =‘Big Ben’ and w =‘Emma Stone’. of crowdsourcing runs out. As discussed in [20], some workers answer slower than others and increase the latency. However, we 2.2 Problem Definition do not investigate how to reduce the latency in this work since Given a set of objects O and a hierarchy tree H, we define the we can utilize the techniques proposed in [13]. two subproblems of the crowdsourced truth discovery. Recall that Vo is the set of candidate values for an object o. Fur- thermore, we let Go(v) denote the set of candidate values which are ancestors of a value v except for the root in the hierarchy H. s There are three relationships between a claimed value vo and ∗ s ∗ s ∗ the truth vo: (1) vo = vo, (2) vo ∈ Go(vo) and (3) otherwise. Let ϕs = (ϕs,1, ϕs,2, ϕs,3) be the trustworthiness distribution of a Figure 3: A graphical model for truth inference source s where ϕs,i is the probability that a claimed value of the source s corresponds to the i-th relationship. In each relationship, Definition 2.3 (Hierarchical truth inference problem). For a set a claimed value is generated as follows: s ∗ of records R collected from the sources and a set of answers A • Case 1 (vo = vo): The source s provides the exact true ∗ from the workers, we find the most specific true value vo of each value with a probability ϕs,1. s ∗ object o ∈ O among the candidate values in Vo by using the • Case 2 (vo ∈ Go(vo)): The source s provides a general- s hierarchy H. ized true value vo with a probability ϕs,2. In this case, the claimed value is an ancestor of the truth v∗ in H. We as- Definition 2.4 (Task assignment problem). For each worker w o sume that the claimed value is uniformly selected from in a set of workers W , we select the top-k objects from O which G (v∗). are likely to increase the overall accuracy of the inferred truths o o • Case 3 (otherwise): The source s provides a wrong value the most by using the hierarchy H. s ∗ vo not even in Go(vo). The claimed value is uniformly We present a hierarchical truth inference algorithm in Sec- selected among the rest of the candidate values in Vo. tion 3 and a task assignment algorithm in Section 4. The probability distribution ϕs is an initially-unknown model parameter to be estimated in our inference algorithm. Accord- s 3 HIERARCHICAL TRUTH INFERENCE ingly, the probability of selecting an answer vo among the values V o For the hierarchical truth inference, we first model the trust- in o for an object is represented by s ∗ worthiness of sources and workers for a given hierarchy. Then, ϕ , if v = v , s 1 o o s ∗ ∗ s ∗ we propose a probabilistic model to describe the process of gen- P(vo |vo, ϕs ) = ϕs,2/|Go (vo )| if vo ∈ Go (vo ), (1) ∗ erating the set of records and the set of answers based on the ϕs,3/(|Vo | − |Go (vo )| − 1) otherwise. trustworthiness modeling. We next develop an inference algo- rithm to estimate the model parameters and determine the truths. For the prior of the distribution ϕs , we assume that it follows a Dirichlet distribution Dir(α), with a hyperparameter α = (α1, α2, 3.1 Our Generative Model α3), which is the conjugate prior of categorical distributions. Let OH be the set of objects who have an ancestor-descendant Our probabilistic graphical model in Figure 3 expresses the con- relationship in their candidate set. In practice, there may exist ditional dependence (represented by edges) between random some objects whose candidate values do not have an ancestor- variables (represented by nodes). While the previous works [5, descendant relationship. In this case, the probability of the sec- 15, 31, 35] assume that all sources and workers have their own ond case (i.e., ϕs,2) may be underestimated. Thus, if there is no reliabilities only, we assume that each source or worker has its ancestor-descendant relationship between the claimed values generalization tendency as well as reliability. We first describe about o (i.e., o < OH ), we assume that a source generates its s how sources and workers generate the claimed values based on claimed value vo with the following probability their trustworthiness. We next present the model for generating ( s ∗ s ∗ ϕs,1 + ϕs,2 if vo = vo , the true value. Finally, we provide the detailed generative process P(vo |vo, ϕs ) = (2) ϕs,3/(|Vo | − 1) otherwise. of our probabilistic model. ∗ w Model for source trustworthiness: For an object o, let vo be Model for worker trustworthiness: Let vo be the claimed s the truth and vo be the claimed value reported by a source s. value chosen by a worker w among the candidates in Vo for an object o. Similar to the model for source trustworthiness, we also assume the three relationships between a claimed value Table 2: Notations w ∗ w ∗ w ∗ vo and the truth vo: (1) vo = vo, (2) vo ∈ Go(vo) and (3) otherwise. Each worker w has its trustworthiness distribution Symbol Description ψw = (ψw,1,ψw,2,ψw,3) where ψw,i is the probability that an s A data source answer of the worker w corresponds to the i-th relationship. We w A crowd worker assume that the trustworthiness distribution is generated from s vo Claimed value from s about o Dir(β) with a hyperparameter β = (β1, β2, β3). w vo Claimed value from w about o Since it is difficult for the workers to be aware of the correct R Set of all records collected from the set of sources S answer for every object, a worker can refer to web sites to answer A Set of all answers collected from the set of workers W the question. In such a case, if there is a widespread misinforma- Vo Set of candidate values about o tion across multiple sources, the worker is also likely to respond
So Set of sources which post information about o with the incorrect information. Similar to [9, 30], we thus ex- Wo Set of workers who answered about o ploit the popularity of a value in Cases 2 and 3 to consider such Os Set of objects that source s provided a value dependency between sources and workers. O Set of objects that worker w answered to w ∗ w • Case 1 (vo = vo): The worker w provides the exact true Set of values in Vo which are ancestors of a value v Go (v) value with a probability ψw,1. except the root in the hierarchy H w ∗ • Case 2 (vo ∈ Go(vo)): The worker w provides a general- Do (v) Set of values in Vo which are descendants of v ized true value with a probability ψw,2. We assume that ϕs,1 ·µo,vs 1 o ψw,1 · µo,vw дo,s =Í s ∗ 1 o v∈V P(vo |vo =v, ϕs )·µo,v дo,w =Í w ∗ o v ∈V P(vo |vo =v,ψw )· µo,v s ∗ o ( | )· ϕs,2 ∗ P vo vo =v, ϕs µo,v Í s ·µ Í w · ( w | )· v v ∈Do (vo ) |Go (v)| o,v v∈Do (v ) ψw,2 Pop2 vo vo=v µo,v fo,s = s ∗ if o ∈ O o ∈ Í ′ ( | ′ )· ′ Í ( s | ∗ , )· H Í w ∗ if o OH v ∈Vo P vo vo =v , ϕs µo,v 2 v ∈Vo P vo vo=v ϕs µo,v v ∈V P(vo |vo=v,ψw )·µo,v дo,s = д2 = o w ∗ ϕs,2 ·µ s o,w ψ ·µ w v P(vo |vo =v,ψw )·µo,v o,vo w,2 o,vo f = Í ( s | ∗ )· otherwise Í w ∗ otherwise o,w Í w ∗ ′ v∈Vo P vo vo=v,ϕs µo,v v ∈V P(vo |vo=v,ψw )·µo,v ′∈ P(v |v =v ,ψw )·µo,v′ o v Vo o o w ∗ ϕs,3 Í w · ( | = )· Í s · v ∈¬Do (vo ) ψw,3 Pop3 vo vo v µo,v v ∈¬Do (v ) | − ( )|− µo,v 3 3 o Vo Go v 1 дo,w = w ∗ д , = Í P(v |v =v,ψ )· µ o s Í ( s | ∗ )· v ∈Vo o o w o,v v ∈Vo P vo vo =v, ϕs µo,v
Figure 4: E-step for the proposed truth inference algorithm
w the claimed value vo is selected according to the popu- ϕ = {ϕs |s ∈S},ψ = {ψw |w ∈W } and µ = {µo |o ∈O}. We propose an |{ | ∈ s }| w ∗ s s So,vo =v EM algorithm to find the maximum a posteriori (MAP) estimate larity Pop2(v |v ) = s ∗ which is the o o |{s |s ∈So,vo ∈Go (vo )}| w of the parameters in our model. proportion of the records whose claimed value is vo out of the records with generalized values of v∗. The maximum a posteriori (MAP) estimator: Recall that o s • Case 3 (otherwise): The claimed value is selected from the R = {(o,s,vo)} is the set of records from the sources and A = w ∗ {(o,w,vw )} is the set of answers from the workers. For every wrong values according to the popularity Pop3(vo |vo) = o s object o, each source s ∈ S and each worker w ∈ W generates |{s |s ∈So,vo =v }| o o s ∗ s ∗ . |{s |s ∈So,vo