<<

Incorporating Trustiness and Collective Synonym/Contrastive Evidence into Construction

#1 #2 3 Luu Anh Tuan , Jung-jae Kim , Ng See Kiong ∗ #School of Computer Engineering, Nanyang Technological University, Singapore [email protected], [email protected]

∗Institute for Infocomm Research, Agency for Science, Technology and Research, Singapore [email protected]

Abstract Previous works on the task of taxonomy con- struction capture information about potential tax- Taxonomy plays an important role in many onomic relations between concepts, rank the can- applications by organizing domain knowl- didate relations based on the captured information, edge into a hierarchy of is-a relations be- and integrate the highly ranked relations into a tax- tween terms. Previous works on the taxo- onomic structure. They utilize such information nomic identification from text cor- as hypernym patterns (e.g. A is a B, A such as pora lack in two aspects: 1) They do not B) (Kozareva et al., 2008), syntactic dependency consider the trustiness of individual source (Drumond and Girardi, 2010), definition sentences texts, which is important to filter out in- (Navigli et al., 2011), co-occurrence (Zhu et al., correct relations from unreliable sources. 2013), syntactic contextual similarity (Tuan et al., 2) They also do not consider collective 2014), and sibling relations (Bansal et al., 2014). evidence from synonyms and contrastive terms, where synonyms may provide ad- They, however, lack in the three following as- ditional supports to taxonomic relations, pects: 1) Trustiness: Not all sources are trust- while contrastive terms may contradict worthy (e.g. gossip, forum posts written by them. In this paper, we present a method non-experts) (Dong et al., 2015). The trustiness of taxonomic relation identification that of source texts is important in taxonomic rela- incorporates the trustiness of source texts tion identification because evidence from unre- measured with such techniques as PageR- liable sources can be incorrect. For example, ank and knowledge-based trust, and the the invalid taxonomic relation between “American collective evidence of synonyms and con- chameleon” and “chameleon” is mistakenly more trastive terms identified by linguistic pat- popular in the Web than the valid taxonomic rela- tern matching and . The tion between “American chameleon” and “lizard”, experimental results show that the pro- and statistical methods without considering the posed features can consistently improve trustiness may incorrectly extract the invalid rela- performance up to 4%-10% of F-measure. tion instead of the latter. However, to the best of our knowledge, no previous work considered this 1 Introduction aspect. Taxonomies which serve as backbone of struc- 2) Synonyms: A concept may be expressed in tured knowledge are useful for many applica- multiple ways, for example with synonyms. The tions such as question answering (Harabagiu et previous works mostly assumed that a term repre- al., 2003) and document clustering (Fodeh et al., sents an independent concept, and did not combine 2011). Even though there are many hand-crafted, information about a concept, which is expressed well-structured taxonomies publicly available, in- with multiple synonyms. The lack of evidence cluding WordNet (Miller, 1995), OpenCyc (Ma- from synonyms may hamper the ranking of can- tuszek et al., 2006), and Freebase (Bollacker et didate taxonomic relations. Navigli and Velardi al., 2008), they are incomplete in specific domains, (2004) combined synonyms into a concept, but and it is time-consuming to manually extend them only for those from WordNet, called synsets. or create new ones. There is thus a need for auto- 3) Contrastive terms: We observe that if two matically extracting taxonomic relations from text terms are often contrasted (e.g. A but not B, A corpora to construct/extend taxonomies. is different from B) (Kim et al., 2006), they may

1013 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1013–1022, Lisbon, Portugal, 17-21 September 2015. c 2015 Association for Computational Linguistics. not have a taxonomic relation. tactic contextual similarity (Tuan et al., 2014), and In this paper, we present a method based on the word embedding (Fu et al., 2014). The main idea state-of-the-art method (Tuan et al., 2014), which behind these techniques is that the terms that are incorporates the trustiness of source texts and asymmetrically similar to each other with regard the collective evidence from synonyms/contrastive to, for example, co-occurrences, syntactic con- terms. Tuan et al. (2014) rank candidate tax- texts, and latent vector representation may have onomic relations based on such evidence as hy- taxonomic relationships. Such methods, however, pernym patterns, WordNet, and syntactic contex- usually suffer from low accuracy, though show- tual similarity, where the pattern matches and the ing relatively high coverage. To get the balance syntactic contexts are found from the Web by us- between the two approaches, Yang and Callan ing a Web search engine. First, we calculate the (2009), Zhu et al. (2013) and Tuan et al. (2014) trustiness score of each data source with the four combine both statistical and linguistic features in following weights: importance (if it is linked by the process of finding taxonomic relations. many pages), popularity (if it is visited by many Most of these previous methods do not consider users), authority (if it is from a creditable Web site) if the source text of evidence (e.g. co-occurrences, and accuracy (if it has many facts). We integrate pattern matches) is trustworthy or not and do not this score to the work of (Tuan et al., 2014) so that combine evidence from synonyms and contrastive evidence for taxonomic relations from unreliable terms as discussed earlier. Related to synonyms, a sources is discarded. few previous works utilize siblings for taxonomy Second, we identify synonyms of two terms (t1, construction. Yang and Callan (2009) use siblings t2), whose taxonomic relation is being scrutinized, as one of the features in the metric-based frame- by matching such queries as “t1 also known as” work which incrementally clusters terms to form against the Web to find t1’s synonyms next to the taxonomies. Wentao et al. (2012) also utilize such query matches (e.g. t in “t1 also known as t”). We sibling feature that, for example of the linguistic then collect the evidence for all taxonomic rela- pattern “A such as B1, B2, and Bn”, if the ··· tions between t10 and t20 , where ti0 is either ti or its concept at the k-th position (e.g. Bk) from pat- synonym (i 1, 2 ), and combine them to calcu- tern keywords (e.g. such as) is a valid sub-concept ∈ { } late the evidence score of the candidate taxonomic (e.g. of A), then most likely its siblings from po- relation between t1 and t2. Similarly, for each pair sition 1 to position k-1 (e.g. B1, , Bk 1) are ··· − of two terms (t1, t2), we collect their contrastive also valid sub-concepts. Bansal et al. (2014) in- evidence by matching such queries as “t1 is not a clude the sibling factors to a structured probabilis- of t2” against the Web, and use them to pro- tic model over the full space of taxonomy trees, portionally decrease the evidence score for taxo- thus helping to add more evidence to taxonomic nomic relation between contrasting terms. relations. Navigli and Velardi (2004) utilize the synonym feature (i.e. WordNet synsets) for the 2 Related Work process of semantic disambiguation and concept clustering as mentioned above, but not for the pro- The previous methods for identifying taxonomic cess of inducing novel taxonomic relations. relations from text can be generally classified into two categories: linguistic and statistical ap- 3 Methodology proaches. The former approach mostly exploits lexical-syntactic patterns (e.g. A is a B, A such We briefly introduce (Tuan et al., 2014) in Section as B) (Hearst, 1992). Those patterns can be man- 3.1. We then explain how to incorporate trustiness ually created (Kozareva et al., 2008; Wentao et al., (Section 3.2) and collective evidence from syn- 2012) or automatically identified (Navigli et al., onyms (Section 3.3) and from contrastive terms 2011; Bansal et al., 2014). The pattern matching (Section 3.4) into the work of (Tuan et al., 2014). methods show high precision when the patterns are carefully defined, but low coverage due to the 3.1 Overview of baseline (Tuan et al., 2014) lack of contextual analysis across sentences. Tuan et al. (2014) follow three steps to construct a The latter approach, on the other hand, includes taxonomy: term extraction/filtering, taxonomic re- asymmetrical term co-occurrence (Fotzo and Gal- lation identification and taxonomy induction. Be- linari, 2004), clustering (Wong et al., 2007), syn- cause the focus of this paper is on the second step,

1014 taxonomic relation identification, we use the same 3.2 Trustiness of the evidence data methods for the first and third steps as in (Tuan et We introduce our method of estimating the trusti- al., 2014) and will not discuss them here. ness of a given source text in Section 3.2.1 and ex- Given each ordered pair of two terms t1 and t2 plain how to incorporate it into the work of (Tuan from the term extraction/filtering, the taxonomic et al., 2014) in Section 3.2.2. relation identification of (Tuan et al., 2014) calcu- lates the evidence score that t1 is a hypernym of t2 3.2.1 Collecting trustiness score of the (denoted as t t ) based on the following three evidence data 1  2 measures: Given a data source (e.g. Web page), we consider four aspects of trustiness as follows: String inclusion with WordNet (SIWN): This Importance: A data source may be important • measure is to check if t1 is a substring of t2, con- if it is referenced by many other sources. sidering synonymy between words using WordNet Popularity: If a data source is accessed by synsets. ScoreSIWN (t1, t2) is set to 1 if there is • many people, it is considered popular. such evidence; otherwise, it is set to 0. Authority: If the data is created by a trusted • Lexical-syntactic pattern (LSP): This measure agency, such as government and education is to find how much more evidence for t t 1  2 institute, it may be more trustful than others is found in the Web than for t t . Specif- 2  1 from less trusted sources such as forums and ically, a list of manually constructed hypernym . patterns (e.g. “t2 is a t1”) is queried with a Web Accuracy: If the data contains many pieces search engine to estimate the number of evidence • of accurate information, it seems trustful. for t t from the Web. The LSP measure is 1  2 calculated as follows, where CW eb(t1, t2) denotes Importance the collection of search results: To measure the importance of a Web page (d) as log( C (t1,t2) ) 1 Score (t , t ) = | W eb | data source, we use the Google PageRank score LSP 1 2 1+log( C (t2,t1) ) | W eb | (ScoreP ageRank(d)) that is calculated based on Syntactic contextual subsumption (SCS): The the number and quality of links to the page. The idea of this measure is to derive the hypernymy PageRank scores have the scale from 0 to 9, where evidence for two terms t1 and t2 from their syntac- the bigger score means more importance than the tic contexts, particularly from the triples of (sub- lower one. Using this score, the importance of a ject,verb,object). They observe that if the context page is calculated as follows: set of t1 mostly contains that of t2 but not vice 1 T rustImp(d) = (2) versa, then t1 is likely to be a hypernym of t2. To 10 ScoreP ageRank(d) implement this idea, they first find the most com- − mon relation (or verb) r between t1 and t2, and Note that we use the non-linearity for PageR- use the queries “t1 r” and “t2 r” to construct two ank score rather than just normalizing PageRank Γ Γ corpora Corpust1 and Corpust2 for t1 and t2, re- to 0-1. The reason is to vary the gaps between the spectively. Then the syntactic context sets are cre- important sites (which usually have the PageRank ated from these contextual corpora using a non- score value from 7-10) and majority unimportant taxonomic relation identification method. The de- site (which usually have the PageRank score value tails of calculating ScoreSCS(t1, t2) can be found less than 5). in (Tuan et al., 2014). Popularity They linearly combine the three scores as follows: We use Alexa’s Traffic Rank2 as the measure of

Score(t1, t2) = α ScoreSIWN (t1, t2) popularity, (ScoreAlexa(d)) which is based on the × + β ScoreLSP (t1, t2) + γ ScoreSCS (t1, t2) traffic data provided by users in Alexa’s global × × (1) data panel over a rolling 3 month period. The If Score(t1, t2) is greater than a threshold value, Traffic Ranks are updated daily. A site’s rank is then t1 is regarded as a hypernym of t2. We use 1http://searchengineland.com/what-is-google-pagerank- the same values of α, β and γ as in (Tuan et al., a-guide-for-searchers-webmasters-11068 2014). 2http://www.alexa.com/

1015 based on a combined measure of unique visitors (IE) tools (Angeli et al. (2014), Manning et al. and page views. Using this rank, the popularity of (2014), MITIE3) for the first four tasks, and de- a data source is calculated as follows: velop a method similar to Hachey et al. (2013) 1 for the last two tasks of entity linkage and relation T rustP op(d) = (3) log(ScoreAlexa(d) + 1) linkage. Since the IE tools may produce noisy or unre- We use log transform in the popularity score in- liable triples, we use a voting scheme for triple stead of, for example, linear scoring because we extraction as follows: A triple is only considered want to avoid the bias of the much large gap be- to be true if it is extracted by at least two extrac- tween the Alexa scores of different sites (e.g. one tors. After obtaining all triples in the data source, site has Alexa score 1000, but the other may have we use the closed world assumption as follows: score 100,000) Given subject s and predicate p, O(s, p) denotes the set of such objects that a triple (s,p,o) is found Authority in Freebase. Now given a triple (s, p, o) found in the data source, if o O(s, p), we conclude We rank the authority of a data source based on the ∈ that the triple is correct; but if o O(s, p) and internet top-level domain (TLD). We observe that 6∈ O(s, p) > 0, we conclude that the triple is incor- the pages with limited and registered TLD (e.g. | | rect. If O(s, p) = 0, we do not conclude any- .gov, .mil, .edu) are often more credible than those | | with open domain (e.g. .com, .net). Therefore, thing about the triple, and the triple is removed the authority score of a data source is calculated from the set of facts found in the data source. as follows: Given a data source d, we define cf(d) as the  1 if TLD of d is .gov, .mil or .edu number of correct facts, and icf(d) as the number T rust (d) = Auth 0 otherwise of incorrect facts found in d. The accuracy of d is (4) calculated as follows:

1 1 Note that there are some reasons we choose T rustAccu(d) = (5) 1 + icf(d)2 − 1 + cf(d)2 such implementation of Authority in an elemen- tary way. First, we tried finer categorization of Combining trustiness scores various domains, e.g. .int has score 1/3, .com has The final trustiness score of a data source is the score 1/4, etc. However, the experimental results linear combination of the four scores as follows: did not show much change of performance. In ad- dition, there is controversy on which open TLD T rust(d) = α T rustImp(d) + β T rustP op(d) × × (6) + γ T rustAuth(d) + δ T rustAccu(d) domains are more trustful than the others, e.g. it is × × difficult to judge whether a .net site is more trust- To estimate the optimal combination for parame- ful than .org or not. Thus, we let all open TLD ters α, β, γ and δ, we apply linear regression algo- domains have the same score. rithm (Hastie et al., 2009). For parameter learning, we manually list 50 websites as trusted sources Accuracy (e.g. stanford.edu, bbc.com, nasa.gov), and the If the data source contains many pieces of accu- top 15 gossip websites listed in a site4 as untrusted rate information, it is trustful. Inspired by the idea sources. Then we use the scores of their individual of Dong et al. (2015), we estimate the accuracy of pages by the four methods to learn the parameters a data source by identifying correct and incorrect in Formula 6. The learning results are as follows: information in form of the triples (Subject, Predi- α=0.46, β=0.46, γ=2.03, δ=0.61. cate, Object) in the source, where Subject, Predi- cate and Object are normalized with regard to the 3.2.2 Integrating trustiness into taxonomic knowledge base Freebase. The extraction of the relation identification methods triples includes six tasks: named entity recogni- Given a data collection C, we define: tion, part of speech tagging, dependency parsing, P d C T rust(d) triple extraction, entity linkage (which maps men- AvgT rust(C) = ∈ C tions of proper nouns and their co-references to | | the corresponding entities in Freebase) and rela- 3https://github.com/mit-nlp/MITIE tion linkage. We use three information extraction 4http://www.ebizmba.com/articles/gossip-websites

1016 as the average trustiness score of all data in C. (see the next section for details), we also utilize We integrate the trustiness score to the three tax- MeSH5, a well-known vocabulary in biomedicine. onomic relation identification methods described in Section 3.1 as follows: Pattern matching: Given two terms t1 and t2, we use the following patterns to find their synonymy LSP method: evidence from the Web: The LSP evidence score of the taxonomic relation t1 also [known called named abbreviated] as t2 • | | | between t1 and t2 is recalculated as follows: Other common name[s] of t1 [is are include] t2 • | | T rust t1, or t2, is a Score (t1, t2) = ScoreLSP (t1, t2) LSP × • t1 (short for t2) (AvgT rust(CW eb(t1, t2)) + AvgT rust(CW eb(t2, t1))) • (7) , where [a b] denotes a choice between a and b. If | The intuition of Formula 7 is that the original the number of Web search results is greater than a LSP evidence score is multiplied with the average threshold Ψ, t1 is considered as a synonym of t2. trustiness score of all evidence documents for the taxonomic relation from the Web. If the number Supervised learning: We randomly pick 100 of Web search results is too large, we use only the pairs of synonyms in WordNet, and for each pair, first 1,000 results to estimate the average trustiness we use the Web search engine to collect sample score. sentences in which both terms of the pair are men- tioned. If the number of collected sentences is SCS method: greater than 2000, we use only the first 2000 sen- Similarly, the SCS evidence score is recalculated tences for training. After that, we extract the fol- as follows: lowing features from the sentences to train a logis- T rust tic regression model (Hastie et al., 2009) for the ScoreSCS (t1, t2) = ScoreSCS (t1, t2) × (8) Γ Γ synonymy identification: (AvgT rust(Corpust1 ) + AvgT rust(Corpust2 )) Headwords of the two terms • SIWN method: Average distance between the terms This method does not use any evidence from the • Sequence of words between the terms Web, and so its measure does not change as fol- • lows: Bag of words between the terms T rust • ScoreSIWN (t1, t2) = ScoreSIWN (t1, t2) (9) Dependency path between the terms (using • Stanford parser (Klein and Manning, 2003)) The three measures of trustiness are also linearly Bag of words on the dependency path combined as follows: •

T rust T rust The average F-measure of the obtained model Score (t1, t2) = α Score (t1, t2)+ × SIWN T rust T rust with 10-fold cross-validation is 81%. We use the β Score (t1, t2) + γ Score (t1, t2) × LSP × SCS (10) learned model to identify more synonym pairs in the next step. The values of α, β, and γ in Formula 10 are identical to those for Formula 1. 3.3.2 Embedding synonym information Given a term t, we denote Syn(t) as the set of 3.3 Collective synonym evidence synonyms of t (including t itself). The evidence 3.3.1 Synonymy identification scores of the SCS and LSP methods are recalcu- We use the three following methods to collect syn- lated with synonyms as follows: onyms: dictionaries, pattern matching, and super- Synonym X Score (t , t ) = Score (t0 , t0 ) vised learning. X 1 2 X 1 2 (11) t Syn(t ) 10 ∈ 1 t Syn(t ) 20 ∈ 2 Dictionaries: Synonyms can be found in dic- tionaries like a general-purpose dictionary Word- , where the variable X can be replaced with SCS Net and also domain-specific ones. Since our do- and LSP. mains of interest include virus, , and 5http://www.ncbi.nlm.nih.gov/mesh

1017 The intuition of Formula (11) is that the evi- For each ordered pair of terms t1 and t2, if F inal dence score of the taxonomic relation between two ScoreCombined(t1, t2) is greater than a threshold terms t1 and t2 can be boosted by adding all the value, then t1 is considered as a hypernym of t2. evidence scores of taxonomic relations between We estimate the optimal values of parameters them and their synonyms. α, β, γ and δ in Formula 15 with ridge regres- Again, as for the SIWN method, we do not sion technique (Hastie et al., 2009) as follows: change the evidence score as follows: First, we randomly select 100 taxonomic relations in domain as the training set. For each Synonym ScoreSIWN (t1, t2) = ScoreSIWN (t1, t2) taxonomic relation t t , its evidence score is 1  2 estimated as τ + 1 , where τ is the thresh- Dist(t1,t2) 3.4 Contrastive evidence F inal old value for ScoreCombined, and Dist(t1, t2) is Given two terms t1 and t2, we use the following the length of the shortest path between t1 and t2 patterns to find their contrastive (thus negative) ev- found in WordNet. Then we use our system to idence from the Web: find evidence scores with taxonomic relation iden- tification methods in Formulas 13 and 14. Finally, t1 is not a t2 t1 is not a [type kind] of t2 • • | we build the training set using Formula 15, and t1, unlike the t2 t1 is different [from with] t2 • • | use the ridge regression to learn that the t1 but not t2 t1, not t2 best value for α is 1.31, β 1.57, γ 1.24 and δ 0.79, • • where τ=2.3. WH(t1, t2) denotes the total number of Web search results, and the contrastive evidence score 4 Experiment between t1 and t2 is computed as follows: 4.1 Datasets Contrast(t1, t2) = log(WH(t1, t2) + 1) (12) We evaluate our method for taxonomy construc- tion against the following collections of six do- Similar to the collective synonym evidence, the mains: contrastive evidence score of taxonomic relation between t1 and t2 is boosted with the contrastive Artificial Intelligence (AI) domain: The cor- • evidence scores of taxonomic relations between pus consists of 4,976 papers extracted from the two terms and their synonyms as follows: the IJCAI proceedings from 1969 to 2014 and P the ACL archives from year 1979 to 2014. t Syn(t ) Contrast(t10 , t20 ) 10 ∈ 1 Contrast t20 Syn(t2) Score (t1, t2) = ∈ Finance domain: The corpus consists of Syn(t1) Syn(t2) • | | ∗ | | (13) 1,253 papers extracted from the freely avail- able collection of “Journal of Financial Eco- nomics” from 1995 to 2012 and from “Re- 3.5 Combining trustiness, synonym and Of Finance” from 1997 to 2012. contrastive evidence We combine all the three features into the system Virus domain: We submit the query “virus” • of (Tuan et al., 2014) as follows: to PUBMED search engine 6 and retrieve the first 20,000 abstracts as the corpus of the

F inal T rust virus domain. ScoreX (t1, t2) =ScoreX (t1, t2) Synonym (14) + ScoreX (t1, t2) Animals, Plants and Vehicles domains: Col- • lections of Web pages crawled by using , where the variable X can be replaced with each the bootstrapping algorithm described by of the three taxonomic relation evidence measures Kozareva et al. (2008). (i.e. SCS, LSP, SIWN). The final combined score is calculated as follows: We report the results of two experiments in

F inal F inal this section: (1) Evaluating the construction of Score (t1, t2) = α Score (t1, t2) Combined × SIWN new taxonomies for Finance and AI domains (Sec- F inal F inal + β Score (t1, t2) + γ Score (t1, t2) × LSP × SCS tion 4.2), and (2) comparing with the curated Contrast δ Score (t1, t2) − × 6 (15) http://www.ncbi.nlm.nih.gov/pubmed

1018 ’ sub-hierarchies. We compare our ap- Vehicle, Virus) against the corresponding sub- proach with other three state-of-the-art methods hierarchies of curated databases. For Animal, in the literature, i.e. (Kozareva and Hovy, 2010), and Vehicle domains, we use WordNet as (Navigli et al., 2011) and (Tuan et al., 2014) (Sec- the gold standards, whereas for Virus domain, we tion 4.3). In addition, for Animal domain, we also use MeSH sub-hierarchy of virus as the reference. compare with the reported performance of Bansal Note that in this comparison, to be fair, we et al. (2014), a recent work to construct taxonomy change our algorithm to avoid using WordNet in using belief propagation. identifying taxonomic relations (i.e. SIWN algo- rithm), and we only use the exact string-matching 4.2 Constructing new taxonomies for comparison without WordNet. The evaluation Finance and AI domains uses the following measures: Referential taxonomy structures such as WordNet P recision = #relations found in and by the method and OpenCyc are widely used in semantic analyt- #relations found by the method #relations found in database and by the method ics applications. However, their coverage is lim- Recall = #relations found in database ited to common, well-known areas, and many spe- To understand the individual contribution of the cific domains like Finance and AI are not well cov- three introduced features (i.e. trustiness, synonym, ered in those structures. Therefore, an automatic contrast), we also evaluate our method only with method which can induce taxonomies for those one of the three features each time, as well as with specific domains can greatly contribute to the pro- all the three features (denoted as “Combined”). cess of knowledge discovery. To estimate the precision of a given method, we Tables 2 and 3 summarize the experiment re- randomly choose 100 relations among the results sults. Our combined method achieves significantly of the method and manually check their correct- better performance than the previous state-of-the- ness. The results summarized in 1 show that art methods in terms of F-measure and Recall (t- < our method extracts much more relations, though test, p-value 0.05) for all the four domains. with slightly lower precision, than Kozareva et al. For Animal domain, it also shows higher perfor- (2008) and Navigli and Velardi (2004). Note that mance than the reported performance of Bansal et due to the lack of gold standards in these two do- al. (2014). In addition, the proposed method im- mains, we do not compare the methods in terms proves the baseline (i.e. Tuan et al. (2014)) up to of F-score, which we will measure with curated 4%-10% of F-measure. databases in the next section. Compared to Tuan Furthermore, we find that the three features et al. (2014), which can be considered as the base- have different contribution to the performance im- line of our approach, our method has significant provement. The trustiness feature contributes to improvement in both precision and the number the improvement on both precision and recall. The of extracted relations. It indicates that the three synonym feature has the tendency of improving incorporated features of trustiness, and synonym the recall further than the trustiness, whereas the and contrastive evidence are effective in improv- contrastive evidence improves the precision. Note ing the performance of existing taxonomy con- that we discussed these different contributions of struction methods. the features in the Introduction.

Finance AI Animal Plant PN PN PRF PRF Kozareva 90% 753 94% 950 Kozareva 98 38 55 97 39 56 Navigli 88% 1161 93% 1711 Navigli 97 44 61 97 38 55 Tuan 85% 1312 90% 1927 Bansal 84 55 67 Our method 88% 1570 92% 2168 Tuan 95 56 70 95 53 68 Trustiness 97 61 75 97 56 71 Table 1: Experiment result for finance and AI do- Synonym 92 65 76 93 58 71 Contrast 97 55 70 97 53 69 mains. P stands for Precision, and N indicates the Combined 97 65 78 96 59 73 number of extracted relations. Table 2: Experiment results for animal and plant 4.3 Evaluation against curated databases domains. P stands for Precision, R Recall, and F We evaluate automatically constructed tax- F-score. The unit is %. onomies for four domains (i.e. Animal, Plant,

1019 Vehicle Virus Animal Plant Vehicle Virus PRF PRF Trustiness: Kozareva 99 60 75 97 31 47 Importance 74% 70% 81% 63% Navigli 91 49 64 99 37 54 Popularity 72% 69% 81% 61% Tuan 93 69 79 93 43 59 Authority 72% 69% 80% 61% Trustiness 96 72 82 97 48 64 Accuracy 73% 70% 81% 62% Synonym 91 72 80 91 53 67 Imp + Accu 75% 70% 82% 64% Contrastive 97 68 80 98 42 59 Synonym: Combined 95 73 83 96 54 69 Dictionaries 73% 69% 79% 62% Pattern matching 74% 69% 80% 64% Table 3: Experiment results for vehicle and virus Machine learning 74% 70% 80% 65% domains. P stands for Precision, R Recall, and F Table 4: Contribution of individual trustiness mea- F-score. The unit is %. sures and collective synonym evidence in terms of F-measure. Imp stands for Important and Accu 4.3.1 Evaluation of individual methods for stands for Accuracy trustiness and synonymy identification

We evaluate the individual methods for trusti- 4.4 Discussion ness measurement and synonymy identification described in Sections 3.2.1 and 3.3.1. For this pur- 4.4.1 Case studies pose, we evaluate our system only with one of the We give two examples to illustrate how the pro- individual methods at a time (i.e. importance, pop- posed features help to infer correct taxonomic re- ularity, authority and accuracy for trustiness mea- lations and filter out incorrect relations. Our base- sure, and dictionary, pattern matching, and ma- line (Tuan et al. (2014)) extracts an incorrect taxo- chine learning methods for synonymy identifica- nomic relation between ‘fox’ and ‘flying fox’ due tion). to the following reasons: (1) ‘flying fox’ includes As summarized in Table 4, the “Importance” ‘fox’ (SIWN) and (2) untrusted sources such as 7 and “Accuracy” methods for trustiness measure- a public forum support the relation. Using our ment based on PageRank and IE systems, respec- proposed method, this relation is filtered out be- tively, have more contribution than the others. cause those untrusted sources are discouraged by Similarly, the experiment results indicate that the the trustiness feature, and also because there are 8 “Machine learning” method has the most contribu- contrastive evidence saying that ‘flying fox’ is tion among the three methods of synonymy iden- NOT a ‘fox’. Specifically, the average trustiness tification. score of LSP method of the sources for the invalid relation (i.e. AvgT rust(CW eb(fox, flying fox)) + In addition, we also examine the interdepen- AvgT rust(CW eb(flying fox, fox))) is 0.63, which dence of the four introduced aspects of trustiness is lower than the average of those scores, 0.90. by running the system with the combination of Also, the collective contrastive evidence score only two aspects, Importance and Accuracy. The (i.e. ScoreContrast(fox, flying fox)) is 1.10, which results in all domains show that when combining is higher than the average collective contrastive only the Importance and Accuracy, the system al- score, 0.32. most achieves the same performance to that of the On the other hand, the true taxonomic rela- combined system with all four criteria, except for tion between ‘bat’ and ‘flying fox’ is not identi- the Plant domain. It can be explained as the Impor- fied by the baseline, mainly due to the rare men- tance aspect (which is expressed as the PageRank tion of this relation in the Web. However, our score) may subsume the Popularity and Authority proposed method can recognize this relation be- aspects. Another interesting point is that the per- cause of two reasons: (1) ‘flying fox’ has many formance of Accuracy, which is solely based on synonyms such as ‘fruit bat’, ‘pteropus’, ‘kalong’, the local information from the website, when ap- and ‘’, and there are much evidence that plied individually, is almost the same with that of these synonyms are kinds of ‘bat’ (i.e. using Importance which is based on the distributed in- the collective synonym evidence). (2) The ev- formation. It shows that the method of ranking of idence for the taxonomic relation between ‘fly- the sites based on the knowledge-based facts can achieve the effectiveness as good as the traditional 7http://redwall.wikia.com/wiki/User:Ferretmaiden/Archive3 ranking method using PageRank score. 8http://en.cc-english.com/index.php?shownews-1397

1020 ing fox’ and ‘bat’, though rare, is from trusted 5 Conclusion sites9 which are maintained by scientists. Thus, the trustiness feature helps to boost the evidence In this paper, we propose the features of trusti- score for this relation over the threshold value. ness, and synonym and contrastive collective ev- Specifically, the average trustiness score of LSP idence for the task of taxonomy construction, and show that these features help the system improve method (i.e. (AvgT rust(CW eb(bat, flying fox)) + the performance significantly. As future work, we AvgT rust(CW eb(flying fox, bat)))), 2.84, is higher than the average in total, 0.90. will investigate into the task of automatically con- We further investigate on 256 taxonomic rela- structing patterns for the pattern matching meth- tions that were missed by the baseline but correctly ods in Sections 3.3 and 3.4, to improve cover- identified by the proposed method. The average age. We will also enhance the accuracy measure of trustiness, based on the observation that some ScoreLSP and the average ScoreSCS of the re- lations by the baseline are 0.35 and 0.60, respec- untrusted sites copy information from other sites tively, while those by the proposed method are to make them look more trustful. 1.17 and 0.82, respectively. We thus find that the proposed method is more effective in correctly im- proving the LSP method than the SCS method. References Gabor Angeli, Julie Tibshirani, Jean Y. Wu, and 4.4.2 Empirical comparison with WordNet Christopher D. Manning. 2014. Combining Dis- By error analysis, we find that our results may tant and Partial Supervision for Relation Extraction. Proceedings of the Conference on Empirical Meth- complement WordNet. For example, in Animal ods on Natural Language Processing, pages 1556– domain, our method identifies ‘wild sheep’ as a 1567. hyponym of ‘sheep’, but in WordNet, they are sib- lings. However, many references 10 11 consider Mohit Bansal, David Burkett, Gerard De Melo, and ‘wild sheep’ as a of ‘sheep’. Another such Dan Klein. 2014. Structured learning for taxonomy induction with belief propagation. Proceedings of example is that our system recognizes ‘aquatic the 52nd Annual Meeting of the ACL, pages 1041– vertebrate’ as a hypernym of ‘aquatic mammal’, 1051. but WordNet places them in different subtrees in- correctly 12. Therefore, our results may help re- Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a col- structure and extend WordNet. laboratively created for structuring human knowledge. Proceedings of the ACM SIG- 4.4.3 Threshold tuning MOD International Conference on Management of Our scoring methods utilize several thresholds to Data, pages 1247–1250. select relations of high ranks. Here we discuss Xin L. Dong, Evgeniy Gabrilovich, Kevin Murphy, Van them in details below. Dang, Wilko Horn, Camillo Lugaresi, Shaohua Sun, The threshold value Ψ for the pattern match- and Wei Zhang. 2015. Knowledge-Based Trust: Es- ing method in Section 3.3.1 controls the number timating the Trustworthiness of Web Sources. Pro- of synonymy relations extracted from text. The ceedings of the VLDB Endowment, 8(9). threshold value for ScoreF inal of Formula 15 Combined Lucas Drumond and Rosario Girardi. 2010. Extract- in Section 3.5 controls the number of extracted ing ontology concept hierarchies from text using taxonomic relations. In general, the larger these markov logic. Proceedings of the ACM Symposium threshold values are, the higher number of syn- on Applied Computing, pages 1354–1358. onyms and taxonomic relations we can get. In Samah Fodeh, Bill Punch, and Pang N. Tan. 2011. On our experiments, we found that the threshold val- ontology-driven document clustering using core se- ues for Ψ between 100 and 120, and those for mantic features. Knowledge and information sys- F inal ScoreCombined between 2.3 and 2.5 generally help tems, 28(2):395–421. the system achieve the best performance. Hermine N. Fotzo and Patrick Gallinari. 2004. 9http://krjsoutheastasianrainforests.weebly.com/animals- Learning “generalization/specialization” relations in-biome-and-habitat-structures.html between concepts - application for automatically 10http://en.wikipedia.org/wiki/Ovis building thematic document hierarchies. Pro- 11http://www.bjornefabrikken.no/side/norwegian-sheep/ ceedings of the 7th International Conference on 12http://en.wikipedia.org/wiki/Aquatic mammal Computer-Assisted Information Retrieval.

1021 Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Roberto Navigli, Paola Velardi, and Stefano Faralli. Wang, and Ting Liu. 2014. Learning semantic hi- 2011. A Graph-based Algorithm for Inducing Lex- erarchies via word embeddings. Proceedings of the ical Taxonomies from Scratch. Proceedings of the 52nd Annual Meeting of the ACL, pages 1199–1209. 20th International Joint Conference on Artificial In- telligence, pages 1872–1877. Ben Hachey, Will Radford, Joel Nothman, Matthew Honnibal, and James R. Curran. 2013. Evaluat- Luu A. Tuan, Jung J. Kim, and See K. Ng. 2014. ing entity linking with Wikipedia. Artificial intel- Taxonomy Construction using Syntactic Contextual ligence, 194:130–150. Evidence. Proceedings of the EMNLP conference, pages 810–819. Sanda M. Harabagiu, Steven J. Maiorano, and Mar- ius A. Pasca. 2003. Open-domain textual question Wu Wentao, Li Hongsong, Wang Haixun, and answering techniques. Natural Language Engineer- Kenny. Q. Zhu. 2012. Probase: A probabilistic tax- ing, 9(3):231–267. onomy for text understanding. Proceedings of the ACM SIGMOD conference, pages 481–492. Trevor Hastie, Robert Tibshirani, and Jerome H. Fried- Wilson Wong, Wei Liu, and Mohammed Bennamoun. man. 2009. The elements of statistical learning. 2007. Tree-traversing ant algorithm for term cluster- Springer-Verlag. ing based on featureless similarities. Data Mining and Knowledge Discovery, 15(3):349–381. Marti A. Hearst. 1992. Automatic acquisition of hy- ponyms from large text corpora. Proceedings of Hui Yang and Jamie Callan. 2009. A Metric- the 14th Conference on Computational Linguistics, based Framework for Automatic Taxonomy Induc- pages 539–545. tion. Proceedings of the 47th Annual Meeting of the ACL, pages 271–279. Jung J. Kim, Zhuo Zhang, Jong C. Park, and See K. Ng. 2006. Biocontrasts: extracting and exploiting Xingwei Zhu, Zhao Y. Ming, and Tat S. Chua. 2013. protein-protein contrastive relations from biomedi- Topic hierarchy construction for the organization of cal literature. Bioinformatics, 22(5):597–605. multi-source user generated contents. Proceedings of the 36th ACM SIGIR conference, pages 233–242. Dan Klein and Christopher D. Manning. 2003. Accu- rate Unlexicalized Parsing. Proceedings of the 41st Annual Meeting of the ACL, pages 423–430.

Zornitsa Kozareva and Eduard Hovy. 2010. A Semi- supervised Method to Learn and Construct Tax- onomies Using the Web. Proceedings of the Con- ference on Empirical Methods in Natural Language Processing, pages 1110–1118.

Zornitsa Kozareva, Ellen Riloff, and Eduard H. Hovy. 2008. Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs. Proceedings of the 46th Annual Meeting of the ACL, pages 1048– 1056.

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David Mc- Closky. 2014. The stanford corenlp natural lan- guage processing toolkit. Proceedings of the 52nd Annual Meeting of the ACL, pages 55–60.

Cynthia Matuszek, John Cabral, Michael J. Witbrock, and John DeOliveira. 2006. An introduction to the syntax and content of cyc. Proceedings of the AAAI Spring Symposium, pages 44–49.

George A. Miller. 1995. WordNet: a Lexical Database for English. Communications of the ACM, 38(11):39–41.

Roberto Navigli and Paola Velardi. 2004. Learn- ing Domain Ontologies from Document Warehouses and Dedicated Web Sites. Computational Linguis- tics, 30(2):151–179.

1022