Knowledge Verification for Long-Tail Verticals

Knowledge Verification for Long-Tail Verticals Furong Li† Xin Luna Dong‡ Anno Langen§ Yang Li§ †National University of Singapore ‡Amazon §Google Inc. [email protected] [email protected] farl, [email protected] ABSTRACT sion, country, team, and so on. Verticals may have hierarchy; for Collecting structured knowledge for real-world entities has become example, basketball players is a sub-vertical of athletes. a critical task for many applications. A big gap between the knowl- A big gap between the knowledge in existing knowledge bases edge in existing knowledge repositories and the knowledge in the and the knowledge in the real world is the knowledge on tail ver- real world is the knowledge on tail verticals (i.e., less popular do- ticals. Roughly speaking, a vertical is a tail vertical if its subject mains). Such knowledge, though not necessarily globally popular, entities are not globally popular (for example, in terms of search can be personal hobbies to many people and thus collectively im- query volume); the number of triples in a tail vertical is usually not pactful. This paper studies the problem of knowledge verification huge (below millions). In contrast to head (popular) verticals such for tail verticals; that is, deciding the correctness of a given triple. as music, movies and celebrities, examples of tail verticals include Through comprehensive experimental study we answer the fol- gym exercises, yoga poses, cheese varieties, and tomato varieties. lowing questions. 1) Can we find evidence for tail knowledge from Although existing knowledge bases contain billions of triples, their an extensive set of sources, including knowledge bases, the web, information on tail verticals is still limited. For instance, we found and query logs? 2) Can we judge correctness of the triples based that in Freebase [4] about 40% entities have no factual triples (i.e., on the collected evidence? 3) How can we further improve knowl- triples that state some properties of an entity), but only triples about edge verification on tail verticals? Our empirical study suggests their names, types, and descriptions; the majority of such entities belong to some tail verticals. As another example, we manually a new knowledge-verification framework, which we call FACTY, that applies various kinds of evidence collection techniques fol- collected triples for the four aforementioned tail verticals from up to three manually selected authoritative sources, and observed that lowed by knowledge fusion. FACTY can verify 50% of the (correct) tail knowledge with a precision of 84%, and it significantly in total only about 150 triples exist in Freebase, and the coverage outperforms state-of-the-art methods. Detailed error analysis on of their subject entities is below 10%. Although a tail vertical may the obtained results suggests future research directions. not be popular by itself, given the large number of tail verticals, they can be collectively impactful. Collecting knowledge for tail verticals is hard. On the one hand, there can be millions of tail verticals and their attributes are highly 1. INTRODUCTION diverse, so manual curation cannot scale. On the other hand, auto- Collecting structured knowledge for real-world entities has be- matic extractions fall short both because we lack good training data, come a critical task for many applications, such as semantic search, and because reconciliation (i.e., deciding if two mentions refer to query answering and machine reading. Both academia and in- the same entity) on tail entities can be error-prone. We thus tried dustry have spent considerable efforts on constructing large-scale a different approach: we identified a set of tail verticals and a few 1 knowledge bases (KBs), such as YAGO [35], NELL [7], Knowl- data sources for each vertical , and then asked the crowd to extract edge Vault [11], DeepDive [30], DBpedia [1], Probase [41], Google triples from these given sources through annotation tools and hand- Knowledge Graph [20], and Microsoft Satori [34]. crafted patterns [9]. Although the results are much cleaner than Knowledge is usually stored as (subject; predicate; object) triples, those from automatic extraction systems, there can still be remnant where each triple states a fact of some entity. To exemplify, a extraction errors [14] and imprecise information from the sources. triple (Kobe Bryant; profession; basketball player) means that Kobe Thus, it is critical to verify the correctness of the collected knowl- Bryant’s profession is basketball player. Triples in knowledge bases edge before populating knowledge bases. are often organized into verticals, where each vertical describes a There exist two approaches for knowledge verification. First, set of entities in the same domain sharing common attributes (i.e., one can search the subject and object of a triple on the web, and predicates). For instance, the vertical of athletes contains triples then apply a classifier to decide if the triple is true based on the regarding different athletes, and describes each athlete by profes- search results [25]. However, this approach obtains poor results on tail verticals: we can verify only 19% of the tail knowledge with a precision of 22% (i.e., for every 100 triples that we verified as true, only 22 are actually true) in our experiments. This is because tail This work is licensed under the Creative Commons Attribution- knowledge is not globally popular on the web and search results can NonCommercial-NoDerivatives 4.0 International License. To view a copy be very noisy. Another solution is to apply supervised knowledge of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For extraction [15] on the web, and consider a triple as verified if it any use beyond those covered by this license, obtain permission by emailing [email protected]. 1 Proceedings of the VLDB Endowment, Vol. 10, No. 11 Vertical discovery and source selection are very important problems but Copyright 2017 VLDB Endowment 2150-8097/17/07. are out of the scope of this paper. 1370 can be extracted. Unfortunately, this solution usually leads to a low Table 1: Sample triples for the vertical Winter sports. recall on tail verticals because it cannot extract any triple whose subject predicate object subject or object is unknown to existing knowledge bases. t1 skiing equipment boots In this paper we investigate a third approach that first leverages t2 snowboarding equipment board both search-based and extraction-based techniques to find support- t3 ice hockey equipment helmet ing evidence for each triple, and subsequently predicates the cor- t4 ice hockey equipment stick t ice hockey equipment neck guard rectness of each triple based on the evidence. Our investigation 5 t ice hockey venue hockey rink tries to answer the following questions: 6 t7 skiing venue outdoor • How can we find evidence for the tail triples, and what are the sources that we can use? for each value. For example, Table 1 has three triples (t3-t5) re- • Can we judge the correctness of the triples based on the col- garding ice hockey equipments, each for an equipment. We con- lected evidence? sider factual triples and say a triple is true if it conforms to the • How can we further improve knowledge verification on tail ver- real world; for example, (ice hockey; equipment; stick) is true, while ticals? (ice hockey; equipment; board) is false. If a (subject, predicate) pair This paper makes four contributions. First, we explored an ex- has only one true triple (e.g., date-of-birth), we call the case single- tensive set of sources to collect supporting evidence for a triple. We truth; otherwise, we call it multi-truth. start with existing knowledge bases, which provide highly reliable A vertical is a collection of triples whose subjects are entities in information but with limited coverage on tail verticals. We then ex- the same domain and have a set of predicates in common. Table 1 pand our search space to the web, which has a much higher cover- exemplifies a small set of triples in the vertical Winter sports; it age but can be noisy. Further, we enrich the evidence by analysing contains seven triples for three winter sports on two predicates. search query logs, which reflect users’ perspectives on the world. We can now formally define the problem we study in this paper. In total we tried seven approaches to extract evidence from these DEFINITION 2.1 (KNOWLEDGE VERIFICATION). Given a set sources. Overall we found evidence for 60% of the correct triples T of triples in a vertical, knowledge verification decides if each on over 96% of the entities in the tail verticals we examined. How- triple in T is true. 2 ever, there are evidence for wrong triples too. We provide a detailed study to compare various sources and approaches that we used. Experiment dataset. We experimented on four verticals: Cheese Second, we investigate how knowledge fusion [12] can be ap- varieties, Tomato varieties, Gym exercises and Yoga poses. We plied to distinguish correct triples from wrong ones based on the chose these four verticals because they represent verticals with dif- collected evidence. Knowledge fusion [12, 13] is recently invented ferent characteristics. For each vertical, we manually collected to decide the correctness of an extracted triple based on the data triples from up to three carefully selected authoritative and compre- sources that provide the triple and the extractors that obtain the hensive sources, and kept those for which we can manually validate triple. We tried both single-truth methods [10, 12], and multi-truth the correctness as true triples. methods [32, 45]; the former assume that there is only one true Then for experimental purpose, we generate false triples as fol- value for an entity, while the latter allow the existence of multiple lows.

Load more