Incorporating Trustiness and Collective Synonym/Contrastive Evidence Into Taxonomy Construction
Total Page:16
File Type:pdf, Size:1020Kb
Incorporating Trustiness and Collective Synonym/Contrastive Evidence into Taxonomy Construction #1 #2 3 Luu Anh Tuan , Jung-jae Kim , Ng See Kiong ∗ #School of Computer Engineering, Nanyang Technological University, Singapore [email protected], [email protected] ∗Institute for Infocomm Research, Agency for Science, Technology and Research, Singapore [email protected] Abstract Previous works on the task of taxonomy con- struction capture information about potential tax- Taxonomy plays an important role in many onomic relations between concepts, rank the can- applications by organizing domain knowl- didate relations based on the captured information, edge into a hierarchy of is-a relations be- and integrate the highly ranked relations into a tax- tween terms. Previous works on the taxo- onomic structure. They utilize such information nomic relation identification from text cor- as hypernym patterns (e.g. A is a B, A such as pora lack in two aspects: 1) They do not B) (Kozareva et al., 2008), syntactic dependency consider the trustiness of individual source (Drumond and Girardi, 2010), definition sentences texts, which is important to filter out in- (Navigli et al., 2011), co-occurrence (Zhu et al., correct relations from unreliable sources. 2013), syntactic contextual similarity (Tuan et al., 2) They also do not consider collective 2014), and sibling relations (Bansal et al., 2014). evidence from synonyms and contrastive terms, where synonyms may provide ad- They, however, lack in the three following as- ditional supports to taxonomic relations, pects: 1) Trustiness: Not all sources are trust- while contrastive terms may contradict worthy (e.g. gossip, forum posts written by them. In this paper, we present a method non-experts) (Dong et al., 2015). The trustiness of taxonomic relation identification that of source texts is important in taxonomic rela- incorporates the trustiness of source texts tion identification because evidence from unre- measured with such techniques as PageR- liable sources can be incorrect. For example, ank and knowledge-based trust, and the the invalid taxonomic relation between “American collective evidence of synonyms and con- chameleon” and “chameleon” is mistakenly more trastive terms identified by linguistic pat- popular in the Web than the valid taxonomic rela- tern matching and machine learning. The tion between “American chameleon” and “lizard”, experimental results show that the pro- and statistical methods without considering the posed features can consistently improve trustiness may incorrectly extract the invalid rela- performance up to 4%-10% of F-measure. tion instead of the latter. However, to the best of our knowledge, no previous work considered this 1 Introduction aspect. Taxonomies which serve as backbone of struc- 2) Synonyms: A concept may be expressed in tured knowledge are useful for many applica- multiple ways, for example with synonyms. The tions such as question answering (Harabagiu et previous works mostly assumed that a term repre- al., 2003) and document clustering (Fodeh et al., sents an independent concept, and did not combine 2011). Even though there are many hand-crafted, information about a concept, which is expressed well-structured taxonomies publicly available, in- with multiple synonyms. The lack of evidence cluding WordNet (Miller, 1995), OpenCyc (Ma- from synonyms may hamper the ranking of can- tuszek et al., 2006), and Freebase (Bollacker et didate taxonomic relations. Navigli and Velardi al., 2008), they are incomplete in specific domains, (2004) combined synonyms into a concept, but and it is time-consuming to manually extend them only for those from WordNet, called synsets. or create new ones. There is thus a need for auto- 3) Contrastive terms: We observe that if two matically extracting taxonomic relations from text terms are often contrasted (e.g. A but not B, A corpora to construct/extend taxonomies. is different from B) (Kim et al., 2006), they may 1013 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1013–1022, Lisbon, Portugal, 17-21 September 2015. c 2015 Association for Computational Linguistics. not have a taxonomic relation. tactic contextual similarity (Tuan et al., 2014), and In this paper, we present a method based on the word embedding (Fu et al., 2014). The main idea state-of-the-art method (Tuan et al., 2014), which behind these techniques is that the terms that are incorporates the trustiness of source texts and asymmetrically similar to each other with regard the collective evidence from synonyms/contrastive to, for example, co-occurrences, syntactic con- terms. Tuan et al. (2014) rank candidate tax- texts, and latent vector representation may have onomic relations based on such evidence as hy- taxonomic relationships. Such methods, however, pernym patterns, WordNet, and syntactic contex- usually suffer from low accuracy, though show- tual similarity, where the pattern matches and the ing relatively high coverage. To get the balance syntactic contexts are found from the Web by us- between the two approaches, Yang and Callan ing a Web search engine. First, we calculate the (2009), Zhu et al. (2013) and Tuan et al. (2014) trustiness score of each data source with the four combine both statistical and linguistic features in following weights: importance (if it is linked by the process of finding taxonomic relations. many pages), popularity (if it is visited by many Most of these previous methods do not consider users), authority (if it is from a creditable Web site) if the source text of evidence (e.g. co-occurrences, and accuracy (if it has many facts). We integrate pattern matches) is trustworthy or not and do not this score to the work of (Tuan et al., 2014) so that combine evidence from synonyms and contrastive evidence for taxonomic relations from unreliable terms as discussed earlier. Related to synonyms, a sources is discarded. few previous works utilize siblings for taxonomy Second, we identify synonyms of two terms (t1, construction. Yang and Callan (2009) use siblings t2), whose taxonomic relation is being scrutinized, as one of the features in the metric-based frame- by matching such queries as “t1 also known as” work which incrementally clusters terms to form against the Web to find t1’s synonyms next to the taxonomies. Wentao et al. (2012) also utilize such query matches (e.g. t in “t1 also known as t”). We sibling feature that, for example of the linguistic then collect the evidence for all taxonomic rela- pattern “A such as B1, B2, and Bn”, if the ··· tions between t10 and t20 , where ti0 is either ti or its concept at the k-th position (e.g. Bk) from pat- synonym (i 1, 2 ), and combine them to calcu- tern keywords (e.g. such as) is a valid sub-concept ∈ { } late the evidence score of the candidate taxonomic (e.g. of A), then most likely its siblings from po- relation between t1 and t2. Similarly, for each pair sition 1 to position k-1 (e.g. B1, , Bk 1) are ··· − of two terms (t1, t2), we collect their contrastive also valid sub-concepts. Bansal et al. (2014) in- evidence by matching such queries as “t1 is not a clude the sibling factors to a structured probabilis- type of t2” against the Web, and use them to pro- tic model over the full space of taxonomy trees, portionally decrease the evidence score for taxo- thus helping to add more evidence to taxonomic nomic relation between contrasting terms. relations. Navigli and Velardi (2004) utilize the synonym feature (i.e. WordNet synsets) for the 2 Related Work process of semantic disambiguation and concept clustering as mentioned above, but not for the pro- The previous methods for identifying taxonomic cess of inducing novel taxonomic relations. relations from text can be generally classified into two categories: linguistic and statistical ap- 3 Methodology proaches. The former approach mostly exploits lexical-syntactic patterns (e.g. A is a B, A such We briefly introduce (Tuan et al., 2014) in Section as B) (Hearst, 1992). Those patterns can be man- 3.1. We then explain how to incorporate trustiness ually created (Kozareva et al., 2008; Wentao et al., (Section 3.2) and collective evidence from syn- 2012) or automatically identified (Navigli et al., onyms (Section 3.3) and from contrastive terms 2011; Bansal et al., 2014). The pattern matching (Section 3.4) into the work of (Tuan et al., 2014). methods show high precision when the patterns are carefully defined, but low coverage due to the 3.1 Overview of baseline (Tuan et al., 2014) lack of contextual analysis across sentences. Tuan et al. (2014) follow three steps to construct a The latter approach, on the other hand, includes taxonomy: term extraction/filtering, taxonomic re- asymmetrical term co-occurrence (Fotzo and Gal- lation identification and taxonomy induction. Be- linari, 2004), clustering (Wong et al., 2007), syn- cause the focus of this paper is on the second step, 1014 taxonomic relation identification, we use the same 3.2 Trustiness of the evidence data methods for the first and third steps as in (Tuan et We introduce our method of estimating the trusti- al., 2014) and will not discuss them here. ness of a given source text in Section 3.2.1 and ex- Given each ordered pair of two terms t1 and t2 plain how to incorporate it into the work of (Tuan from the term extraction/filtering, the taxonomic et al., 2014) in Section 3.2.2. relation identification of (Tuan et al., 2014) calcu- lates the evidence score that t1 is a hypernym of t2 3.2.1 Collecting trustiness score of the (denoted as t t ) based on the following three evidence data 1 2 measures: Given a data source (e.g. Web page), we consider four aspects of trustiness as follows: String inclusion with WordNet (SIWN): This Importance: A data source may be important • measure is to check if t1 is a substring of t2, con- if it is referenced by many other sources.