End-To-End Reinforcement Learning for Automatic Taxonomy Induction
Total Page:16
File Type:pdf, Size:1020Kb
End-to-End Reinforcement Learning for Automatic Taxonomy Induction Yuning Mao1, Xiang Ren2, Jiaming Shen1, Xiaotao Gu1, Jiawei Han1 1Department of Computer Science, University of Illinois Urbana-Champaign, IL, USA 2Department of Computer Science, University of Southern California, CA, USA 1fyuningm2, js2, xiaotao2, [email protected] [email protected] Abstract a set of terms into a taxonomy based on relevant resources such as text corpora. We present a novel end-to-end reinforce- ment learning approach to automatic tax- Prior studies on automatic taxonomy induc- onomy induction from a set of terms. tion (Gupta et al., 2017; Camacho-Collados, 2017) While prior methods treat the problem as a often divide the problem into two sequential sub- two-phase task (i.e., detecting hypernymy tasks: (1) hypernymy detection (i.e., extracting pairs followed by organizing these pairs term pairs of “is-a” relation); and (2) hyper- into a tree-structured hierarchy), we ar- nymy organization (i.e., organizing is-a term pairs gue that such two-phase methods may suf- into a tree-structured hierarchy). Methods devel- fer from error propagation, and cannot ef- oped for hypernymy detection either harvest new fectively optimize metrics that capture the terms (Yamada et al., 2009; Kozareva and Hovy, holistic structure of a taxonomy. In our ap- 2010) or presume a vocabulary is given and study proach, the representations of term pairs term semantics (Snow et al., 2005; Fu et al., 2014; are learned using multiple sources of in- Tuan et al., 2016; Shwartz et al., 2016). The hy- formation and used to determine which pernymy pairs extracted in the first subtask form a term to select and where to place it on the noisy hypernym graph, which is then transformed taxonomy via a policy network. All com- into a tree-structured taxonomy in the hypernymy ponents are trained in an end-to-end man- organization subtask, using different graph prun- ner with cumulative rewards, measured ing methods including maximum spanning tree by a holistic tree metric over the train- (MST) (Bansal et al., 2014; Zhang et al., 2016), ing taxonomies. Experiments on two pub- minimum-cost flow (MCF) (Gupta et al., 2017) lic datasets of different domains show that and other pruning heuristics (Kozareva and Hovy, our approach outperforms prior state-of- 2010; Velardi et al., 2013; Faralli et al., 2015; the-art taxonomy induction methods up to Panchenko et al., 2016). 19.6% on ancestor F1. 1 However, these two-phase methods encounter 1 Introduction two major limitations. First, most of them ig- nore the taxonomy structure when estimating the Many tasks in natural language understanding probability that a term pair holds the hypernymy (e.g., information extraction (Demeester et al., relation. They estimate the probability of differ- 2016), question answering (Yang et al., 2017), and ent term pairs independently and the learned term textual entailment (Sammons, 2012)) rely on lexi- pair representations are fixed during hypernymy cal resources in the form of term taxonomies (cf. organization. In consequence, there is no feed- rightmost column in Fig. 1). However, most exist- back from the second phase to the first phase and ing taxonomies, such as WordNet (Miller, 1995) possibly wrong representations cannot be rectified and Cyc (Lenat, 1995), are manually curated and based on the results of hypernymy organization, thus may have limited coverage or become un- which causes the error propagation problem. Sec- available in some domains and languages. There- ondly, some methods (Bansal et al., 2014; Zhang fore, recent efforts have been focusing on auto- et al., 2016) do explore the taxonomy space by matic taxonomy induction, which aims to organize regarding the induction of taxonomy structure as 1Code and data can be found at https://github. inferring the conditional distribution of edges. In com/morningmoni/TaxoRL other words, they use the product of edge proba- Sennenhunde Appenzeller Sennenhunde Appenzeller Sennenhunde Appenzeller shepherd_dog collie root working_dog working_dog working_dog miniature_pinscher miniature_pinscher miniature_pinscher pinscher pinscher pinscher t=0 t=5 t=6 affenpinscher t=8 affenpinscher Figure 1: An illustrative example showing the process of taxonomy induction. The input vocabulary V0 is f“working dog”, “pinscher”, “shepherd dog”, ...g, and the initial taxonomy T0 is empty. We use a virtual “root” node to represent T0 at t = 0. At time t = 5, there are 5 terms on the taxonomy T5 and 3 terms left to be attached: Vt = f“shepherd dog”, “collie”, “affenpinscher”g. Suppose the term “affenpinscher” is selected and put under “pinscher”, then the remaining vocabulary Vt+1 at next time step becomes f“shepherd dog”, “collie”g. Finally, after jV0j time steps, all the terms are attached to the fg taxonomy and VjV0j = V8 = . A full taxonomy is then constructed from scratch. bilities to represent the taxonomy quality. How- the quality of constructed taxonomies. Second, we ever, the edges are treated equally, while in reality, use the same (noisy) hypernym graph as the input they contribute to the taxonomy differently. For of all compared methods, and demonstrate that our example, a high-level edge is likely to be more RL approach does better hypernymy organization important than a bottom-out edge because it has through optimizing metrics that can capture holis- much more influence on its descendants. In ad- tic taxonomy structure. dition, these methods cannot explicitly capture the Contributions. In summary, we have made the holistic taxonomy structure by optimizing global following contributions: (1) We propose a deep metrics. reinforcement learning approach to unify hyper- To address the above issues, we propose to nymy detection and organization so as to induct jointly conduct hypernymy detection and organi- taxonomies in an end-to-end manner. (2) We de- zation by learning term pair representations and sign a policy network to incorporate semantic in- constructing the taxonomy simultaneously. Since formation of term pairs and use cumulative re- it is infeasible to estimate the quality of all pos- wards to measure the quality of constructed tax- sible taxonomies, we design an end-to-end rein- onomies holistically. (3) Experiments on two pub- forcement learning (RL) model to combine the lic datasets from different domains demonstrate two phases. Specifically, we train an RL agent that the superior performance of our approach com- employs the term pair representations using multi- pared with state-of-the-art methods. We also show ple sources of information and determines which that our method can effectively reduce error prop- term to select and where to place it on the tax- agation and capture global taxonomy structure. onomy via a policy network. The feedback from hypernymy organization is propagated back to the hypernymy detection phase, based on which the 2 Automatic Taxonomy Induction term pair representations are adjusted. All compo- nents are trained in an end-to-end manner with cu- 2.1 Problem Definition mulative rewards, measured by a holistic tree met- We define a taxonomy T = (V; R) as a tree- ric over the training taxonomies. The probability structured hierarchy with term set V (i.e., vocab- of a full taxonomy is no longer a simple aggre- ulary), and edge set R (which indicates is-a rela- gated probability of its edges. Instead, we assess tionship between terms). A term v 2 V can be ei- an edge based on how much it can contribute to ther a unigram or a multi-word phrase. The task of the whole quality of the taxonomy. end-to-end taxonomy induction takes a set of train- We perform two sets of experiments to eval- ing taxonomies and related resources (e.g., back- uate the effectiveness of our proposed approach. ground text corpora) as input, and aims to learn a First, we test the end-to-end taxonomy induction model to construct a full taxonomy T by adding performance by comparing our approach with the terms from a given vocabulary V0 onto an empty state-of-the-art two-phase methods, and show that hierarchy T0 one at a time. An illustration of the our approach outperforms them significantly on taxonomy induction process is shown in Fig. 1. 2.2 Modeling Hypernymy Relation embeddings of x and y, which capture the distri- butional semantics of two terms. Determining which term to select from V0 and where to place it on the current hierarchy requires Surface String Features. In practice, even the understanding of the semantic relationships be- embeddings of many terms are missing because tween the selected term and all the other terms. the terms in the input vocabulary may be multi- We consider multiple sources of information (i.e., word phrases, proper nouns or named entities, resources) for learning hypernymy relation rep- which are likely not covered by the external pre- resentations of term pairs, including dependency trained word embeddings. To address this issue, path-based contextual embedding and distribu- we utilize several surface features described in tional term embeddings (Shwartz et al., 2016). previous studies (Yang and Callan, 2009; Bansal Path-based Information. We extract the shortest et al., 2014; Zhang et al., 2016). Specifically, we dependency paths between each co-occurring term employ Capitalization, Ends with, Contains, Suffix pair from sentences in the given background cor- match, Longest common substring and Length dif- pora. Each path is represented as a sequence of ference. These features are effective for detecting edges that goes from term x to term y in the de- hypernyms solely based on the term pairs. pendency tree, and each edge consists of the word Frequency and Generality Features. Another lemma, the part-of-speech tag, the dependency la- feature source that we employ is the hyper- bel and the edge direction between two contiguous nym candidates from TAXI3 (Panchenko et al., words.