End-to-End Reinforcement Learning for Automatic Taxonomy Induction

Yuning Mao1, Xiang Ren2, Jiaming Shen1, Xiaotao Gu1, Jiawei Han1 1Department of , University of Illinois Urbana-Champaign, IL, USA 2Department of Computer Science, University of Southern California, CA, USA 1{yuningm2, js2, xiaotao2, hanj}@illinois.edu [email protected]

Abstract a set of terms into a taxonomy based on relevant resources such as text corpora. We present a novel end-to-end reinforce- ment learning approach to automatic tax- Prior studies on automatic taxonomy induc- onomy induction from a set of terms. tion (Gupta et al., 2017; Camacho-Collados, 2017) While prior methods treat the problem as a often divide the problem into two sequential sub- two-phase task (i.e., detecting hypernymy tasks: (1) hypernymy detection (i.e., extracting pairs followed by organizing these pairs term pairs of “is-a” relation); and (2) hyper- into a tree-structured hierarchy), we ar- nymy organization (i.e., organizing is-a term pairs gue that such two-phase methods may suf- into a tree-structured hierarchy). Methods devel- fer from error propagation, and cannot ef- oped for hypernymy detection either harvest new fectively optimize metrics that capture the terms (Yamada et al., 2009; Kozareva and Hovy, holistic structure of a taxonomy. In our ap- 2010) or presume a vocabulary is given and study proach, the representations of term pairs term semantics (Snow et al., 2005; Fu et al., 2014; are learned using multiple sources of in- Tuan et al., 2016; Shwartz et al., 2016). The hy- formation and used to determine which pernymy pairs extracted in the first subtask form a term to select and where to place it on the noisy hypernym graph, which is then transformed taxonomy via a policy network. All com- into a tree-structured taxonomy in the hypernymy ponents are trained in an end-to-end man- organization subtask, using different graph prun- ner with cumulative rewards, measured ing methods including maximum spanning tree by a holistic tree metric over the train- (MST) (Bansal et al., 2014; Zhang et al., 2016), ing taxonomies. Experiments on two pub- minimum-cost flow (MCF) (Gupta et al., 2017) lic datasets of different domains show that and other pruning heuristics (Kozareva and Hovy, our approach outperforms prior state-of- 2010; Velardi et al., 2013; Faralli et al., 2015; the-art taxonomy induction methods up to Panchenko et al., 2016). 19.6% on ancestor F1. 1 However, these two-phase methods encounter 1 Introduction two major limitations. First, most of them ig- nore the taxonomy structure when estimating the Many tasks in natural language understanding probability that a term pair holds the hypernymy (e.g., information extraction (Demeester et al., relation. They estimate the probability of differ- 2016), question answering (Yang et al., 2017), and ent term pairs independently and the learned term textual entailment (Sammons, 2012)) rely on lexi- pair representations are fixed during hypernymy cal resources in the form of term taxonomies (cf. organization. In consequence, there is no feed- rightmost column in Fig. 1). However, most exist- back from the second phase to the first phase and ing taxonomies, such as WordNet (Miller, 1995) possibly wrong representations cannot be rectified and Cyc (Lenat, 1995), are manually curated and based on the results of hypernymy organization, thus may have limited coverage or become un- which causes the error propagation problem. Sec- available in some domains and languages. There- ondly, some methods (Bansal et al., 2014; Zhang fore, recent efforts have been focusing on auto- et al., 2016) do explore the taxonomy space by matic taxonomy induction, which aims to organize regarding the induction of taxonomy structure as 1Code and data can be found at https://github. inferring the conditional distribution of edges. In com/morningmoni/TaxoRL other words, they use the product of edge proba-

2462 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 2462–2472 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics Sennenhunde Appenzeller Sennenhunde Appenzeller Sennenhunde Appenzeller

shepherd_dog collie root working_dog working_dog working_dog

miniature_pinscher miniature_pinscher miniature_pinscher

pinscher pinscher pinscher t=0 t=5 t=6 affenpinscher t=8 affenpinscher

Figure 1: An illustrative example showing the process of taxonomy induction. The input vocabulary V0 is {“working dog”, “pinscher”, “shepherd dog”, ...}, and the initial taxonomy T0 is empty. We use a virtual “root” node to represent T0 at t = 0. At time t = 5, there are 5 terms on the taxonomy T5 and 3 terms left to be attached: Vt = {“shepherd dog”, “collie”, “affenpinscher”}. Suppose the term “affenpinscher” is selected and put under “pinscher”, then the remaining vocabulary Vt+1 at next time step becomes {“shepherd dog”, “collie”}. Finally, after |V0| time steps, all the terms are attached to the {} taxonomy and V|V0| = V8 = . A full taxonomy is then constructed from scratch. bilities to represent the taxonomy quality. How- the quality of constructed taxonomies. Second, we ever, the edges are treated equally, while in reality, use the same (noisy) hypernym graph as the input they contribute to the taxonomy differently. For of all compared methods, and demonstrate that our example, a high-level edge is likely to be more RL approach does better hypernymy organization important than a bottom-out edge because it has through optimizing metrics that can capture holis- much more influence on its descendants. In ad- tic taxonomy structure. dition, these methods cannot explicitly capture the Contributions. In summary, we have made the holistic taxonomy structure by optimizing global following contributions: (1) We propose a deep metrics. reinforcement learning approach to unify hyper- To address the above issues, we propose to nymy detection and organization so as to induct jointly conduct hypernymy detection and organi- taxonomies in an end-to-end manner. (2) We de- zation by learning term pair representations and sign a policy network to incorporate semantic in- constructing the taxonomy simultaneously. Since formation of term pairs and use cumulative re- it is infeasible to estimate the quality of all pos- wards to measure the quality of constructed tax- sible taxonomies, we design an end-to-end rein- onomies holistically. (3) Experiments on two pub- forcement learning (RL) model to combine the lic datasets from different domains demonstrate two phases. Specifically, we train an RL agent that the superior performance of our approach com- employs the term pair representations using multi- pared with state-of-the-art methods. We also show ple sources of information and determines which that our method can effectively reduce error prop- term to select and where to place it on the tax- agation and capture global taxonomy structure. onomy via a policy network. The feedback from hypernymy organization is propagated back to the hypernymy detection phase, based on which the 2 Automatic Taxonomy Induction term pair representations are adjusted. All compo- nents are trained in an end-to-end manner with cu- 2.1 Problem Definition mulative rewards, measured by a holistic tree met- We define a taxonomy T = (V,R) as a tree- ric over the training taxonomies. The probability structured hierarchy with term set V (i.e., vocab- of a full taxonomy is no longer a simple aggre- ulary), and edge set R (which indicates is-a rela- gated probability of its edges. Instead, we assess tionship between terms). A term v ∈ V can be ei- an edge based on how much it can contribute to ther a unigram or a multi-word phrase. The task of the whole quality of the taxonomy. end-to-end taxonomy induction takes a set of train- We perform two sets of experiments to eval- ing taxonomies and related resources (e.g., back- uate the effectiveness of our proposed approach. ground text corpora) as input, and aims to learn a First, we test the end-to-end taxonomy induction model to construct a full taxonomy T by adding performance by comparing our approach with the terms from a given vocabulary V0 onto an empty state-of-the-art two-phase methods, and show that hierarchy T0 one at a time. An illustration of the our approach outperforms them significantly on taxonomy induction process is shown in Fig. 1.

2463 2.2 Modeling Hypernymy Relation embeddings of x and y, which capture the distri- butional semantics of two terms. Determining which term to select from V0 and where to place it on the current hierarchy requires Surface String Features. In practice, even the understanding of the semantic relationships be- embeddings of many terms are missing because tween the selected term and all the other terms. the terms in the input vocabulary may be multi- We consider multiple sources of information (i.e., word phrases, proper nouns or named entities, resources) for learning hypernymy relation rep- which are likely not covered by the external pre- resentations of term pairs, including dependency trained word embeddings. To address this issue, path-based contextual embedding and distribu- we utilize several surface features described in tional term embeddings (Shwartz et al., 2016). previous studies (Yang and Callan, 2009; Bansal Path-based Information. We extract the shortest et al., 2014; Zhang et al., 2016). Specifically, we dependency paths between each co-occurring term employ Capitalization, Ends with, Contains, Suffix pair from sentences in the given background cor- match, Longest common substring and Length dif- pora. Each path is represented as a sequence of ference. These features are effective for detecting edges that goes from term x to term y in the de- hypernyms solely based on the term pairs. pendency tree, and each edge consists of the word Frequency and Generality Features. Another lemma, the part-of-speech tag, the dependency la- feature source that we employ is the hyper- bel and the edge direction between two contiguous nym candidates from TAXI3 (Panchenko et al., words. The edge is represented by the concatena- 2016). These hypernym candidates are extracted tion of embeddings of its four components: by lexico-syntactic patterns and may be noisy. As only term pairs and the co-occurrence frequen- Ve = [Vl,, Vpos, Vdep, Vdir]. cies of them (under specific patterns) are available, Instead of treating the entire dependency path we cannot recover the dependency paths between as a single feature, we encode the sequence of de- these terms. Thus, we design two features that are pendency edges Ve1 , Ve2 , ..., Vek using an LSTM similar to those used in (Panchenko et al., 2016; so that the model can focus on learning from parts Gupta et al., 2017). 4 of the path that are more informative while ig- noring others. We denote the final output of the • Normalized Frequency Diff. For a hyponym- LSTM for path p as Op, and use P(x, y) to repre- hypernym pair (xi, xj) where xi is the hy- sent the set of all dependency paths between term ponym and xj is the hypernym, its normal- pair (x, y). A single vector representation of the ized frequency is defined as freqn(xi, xj) = freq(x ,x ) term pair (x, y) is then computed as PP(x,y), the i j max freq(x ,x ) , where freq(xi, xj) denotes the weighted average of all its path representations by k i k raw frequency of (xi, xj). The final fea- applying an average pooling: − ∑ ture score is defined as freqn(xi, xj) c (p) · O freq (x , x ), which down-ranks synonyms and ∑p∈P(x,y) (x,y) p n j i PP(x,y) = , co-hyponyms. Intuitively, a higher score in- p∈P(x,y) c(x,y)(p) dicates a higher probability that the term pair where c(x,y)(p) denotes the frequency of path p in holds the hypernymy relation. P(x, y). For those term pairs without dependency paths, we use a randomly initialized empty path to • Generality Diff. The generality g(x) of a term x represent them as in Shwartz et al. (2016). is defined as the logarithm of the number of its | | Distributional Term Embedding. The previous distinct hyponyms, i.e., g(x) = log(1+ hypo ), ∈ path-based features are only applicable when two where for any hypo hypo, (hypo, x) is a hy- terms co-occur in a sentence. In our experiments, pernym candidate. A high g(x) of the term x however, we found that only about 17% of term implies that x is general since it has many dis- pairs have sentence-level co-occurrences.2 To al- tinct hyponyms. The generality of a term pair is leviate the sparse co-occurrence issue, we concate- defined as the difference in generality between xj and xi: g(xj) − g(xi). This feature would nate the path representation PP(x,y) with the word 2In comparison, more than 70% of term pairs have 3http://tudarmstadt-lt.github.io/taxi/ sentence-level co-occurrences in BLESS (Baroni and Lenci, 4Since the features use additional resource, we wouldn’t 2011), a standard hypernymy detection dataset. include them unless otherwise specified.

2464 promote term pairs with the right level of gener- |Vt| + |Tt| = |V0|. The episode terminates when ality and penalize term pairs that are either too all the terms are attached to the taxonomy, which general or too specific. makes the length of one episode equal to |V0|. A remaining issue is how to select the first term The surface, frequency, and generality features when no terms are on the taxonomy. One approach are binned and their embeddings are concatenated that we tried is to add a virtual node as root and as a part of the term pair representation. In sum- consider it as if a real node. The root embedding mary, the final term pair representation Rxy has is randomly initialized and updated with other pa- the following form: rameters. This approach presumes that all tax- onomies share a common root representation and R = [PP , V , V , V ], xy (x,y) wx wy F (x,y) expects to find the real root of a taxonomy via the term pair representations between the virtual root where PP , Vw , Vw , V denote the path (x,y) x y F (x,y) and other terms. Another approach that we ex- representation, the word embedding of x and y, plored is to postpone the decision of root by ini- and the feature embeddings, respectively. tializing T with a random term as current root at Our approach is general and can be flexibly ex- the beginning of one episode, and allowing the se- tended to incorporate different feature representa- lection of new root by attaching one term as the tion components introduced by other relation ex- hypernym of current root. In this way, it over- traction models (Zhang et al., 2017; Lin et al., comes the lack of prior knowledge when the first 2016; Shwartz et al., 2016). We leave in-depth term is chosen. The size of action space then be- discussion of the design choice of hypernymy re- comes |A | = |V | × |T | + |V |, and the length of lation representation components as future work. t t t t one episode becomes |V0| − 1. We compare the 3 Reinforcement Learning for performance of the two approaches in Section 4. End-to-End Taxonomy Induction 3.2 States We present the reinforcement learning (RL) ap- The state s at time t comprises the current taxon- proach to taxonomy induction in this section. The omy Tt and the remaining vocabulary Vt. At each RL agent employs the term pair representations time step, the environment provides the informa- described in Section 2.2 as input, and explores tion of current state, based on which the RL agent how to generate a whole taxonomy by selecting takes an action. Once a term pair (x1, x2) is se- one term at each time step and attaching it to the lected, the position of the new term x1 is automati- current taxonomy. We first describe the environ- cally determined since the other term x2 is already ment, including the actions, states, and rewards. on the taxonomy and we can simply attach x1 by Then, we introduce how to choose actions via a adding an edge between x1 and x2. policy network. 3.3 Rewards 3.1 Actions The agent takes a scalar reward as feedback of We regard the process of building a taxonomy as its actions to learn its policy. One obvious re- making a sequence of actions. Specifically, we de- ward is to wait until the end of taxonomy induc- fine that an action at at time step t is to (1) se- tion, and then compare the predicted taxonomy lect a term x1 from the remaining vocabulary Vt; with gold taxonomy. However, this reward is de- (2) remove x1 from Vt, and (3) attach x1 as a hy- layed and difficult to measure individual actions ponym of one term x2 that is already on the cur- in our scenario. Instead, we use reward shap- rent taxonomy Tt. Therefore, the size of action ing, i.e., giving intermediate rewards at each time space at time step t is |Vt| × |Tt|, where |Vt| is step, to accelerate the learning process. Empir- the size of the remaining vocabulary Vt, and |Tt| ically, we set the reward r at time step t to be is the number of terms on the current taxonomy. the difference of Edge-F1 (defined in Section 4.2 At the beginning of each episode, the remaining and evaluated by comparing the current taxonomy vocabulary V0 is equal to the input vocabulary and with the gold taxonomy) between current and last − the taxonomy T0 is empty. During the taxonomy time step: rt = F 1et F 1et−1 . If current Edge- induction process, the following relations always F1 is better than that at last time step, the reward hold: |Vt| = |Vt−1| − 1, |Tt| = |Tt−1| + 1, and would be positive, and vice versa. The cumula-

2465 Figure 2: The architecture of the policy network. The dependency paths are encoded and concatenated with word embeddings and feature embeddings, and then fed into a two-layer feed-forward network. tive reward from current time step to the end of the probability distribution of actions.5 Finally, an an episode would cancel the intermediate rewards action at is sampled based on the probability dis- and thus reflect whether current action improves tribution of the action space: the overall performance or not. As a result, the 1 T 1 agent would not focus on the selection of current Ht = ReLU(WRLAt + bRL), | 2 2 term pair but have a long-term view that takes fol- π(a s; WRL) = softmax(WRLHt + bRL), lowing actions into account. For example, suppose at ∼ π(a | s; WRL). there are two actions at the same time step. One action attaches a leaf node to a high-level node, At the time of inference, instead of sampling an and the other action attaches a non-leaf node to action from the probability distribution, we greed- the same high-level node. Both attachments form ily select the term pair with the highest probability. a wrong edge but the latter one is likely to receive We use REINFORCE (Williams, 1992), one in- a higher cumulative reward because its following stance of the policy gradient methods as the opti- attachments are more likely to be correct. mization . Specifically, for each episode, the weights of the policy network are updated as follows: 3.4 Policy Network ∑T WRL = WRL + α ∇logπ(at | s; WRL) · vt, After we introduce the term pair representations t=1 and define the states, actions, and rewards, the ∑ T t−i problem becomes how to choose an action from where vi = t=i γ rt is the culmulative future the action space, i.e., which term pair (x1, x2) reward at time i and γ ∈ [0, 1] is a discounting should be selected given the current state? To factor of future rewards. solve the problem, we parameterize each action a To reduce variance, 10 rollouts for each training by a policy network π(a | s; WRL). The architec- sample are run and the rewards are averaged. An- ture of our policy network is shown in Fig. 2. For other common strategy for variance reduction is each term pair, its representation is obtained by the to use a baseline and give the agent the difference path LSTM encoder, the word embeddings of both between the real reward and the baseline reward terms, and the embeddings of features. By stack- instead of feeding the real reward directly. We use ing the term pair representations, we can obtain an a moving average of the reward as the baseline for action matrix At with size (|Vt| × |Tt|) × dim(R), simplicity. | | × | | where ( Vt Tt ) denotes the number of possi- 5We tried to encode induction history by feeding repre- ble actions (term pairs) at time t and dim(R) de- sentations of previously selected term pairs into an LSTM, notes the dimension of term pair representation R. and leveraging the output of the LSTM as history representa- tion (concatenating it with current term pair representations At is then fed into a two-layer feed-forward net- or passing it to a feed-forward network). However, we didn’t work followed by a softmax layer which outputs observe clear performance change.

2466 3.5 Implementation Details et al. (2014). The dataset we use in this experi- ment is from Bansal et al. (2014), which is a set of We use pre-trained GloVe word vectors (Penning- medium-sized full-domain taxonomies consisting ton et al., 2014) with dimensionality 50 as word of bottom-out full subtrees sampled from Word- embeddings. We limit the maximum number of Net. Terms in different taxonomies are from var- dependency paths between each term pair to be ious domains such as animals, general concepts, 200 because some term pairs containing general daily necessities. Each taxonomy is of height terms may have too many dependency paths. We four (i.e., 4 nodes from root to leaf) and contains run with different random seeds and hyperparam- (10, 50] nodes. The dataset contains 761 non- eters and use the validation set to pick the best overlapped taxonomies in total and is partitioned model. We use an Adam optimizer with initial − by 70/15/15% (533/114/114) as training, valida- learning rate 10 3. We set the discounting factor γ tion, and test set, respectively. to 0.4 as it is shown that using a smaller discount factor than defined can be viewed as regulariza- Testing on Hypernymy Organization. In the tion (Jiang et al., 2015). Since the parameter up- second experiment, we show that our approach dates are performed at the end of each episode, we is better at hypernymy organization by leverag- cache the term pair representations and reuse them ing the global taxonomy structure. For a fair when the same term pairs are encountered again comparison, we reuse the hypernym graph as in in the same episode. As a result, the proposed ap- TAXI (Panchenko et al., 2016) and SubSeq (Gupta proach is very time efficient – each training epoch et al., 2017) so that the inputs of each model are takes less than 20 minutes on a single-core CPU the same. Specifically, we restrict the action space using DyNet (Neubig et al., 2017). to be the same as the baselines by considering only term pairs in the hypernym graph, rather than all 4 Experiments |V |×|T | possible term pairs. As a result, it is pos- sible that at some point no more hypernym candi- We design two experiments to demonstrate the ef- dates can be found but the remaining vocabulary is fectiveness of our proposed RL approach for tax- still not empty. If the induction terminates at this onomy induction. First, we compare our end-to- point, we call it a partial induction. We can also end approach with two-phase methods and show continue the induction by restoring the original ac- that our approach yields taxonomies with higher tion space at this moment so that all the terms in V quality through reducing error propagation and are eventually attached to the taxonomy. We call optimizing towards holistic metrics. Second, we this setting a full induction. In this experiment, conduct a controlled experiment on hypernymy we use the English environment and science tax- organization, where the same hypernym graph is onomies in the SemEval-2016 task 13 (TExEval- used as the input of both our approach and the 2) (Bordea et al., 2016). Each taxonomy is com- compared methods. We show that our RL method posed of hundreds of terms, which is much larger is more effective at hypernymy organization. than the WordNet taxonomies. The taxonomies are aggregated from existing resources such as 4.1 Experiment Setup WordNet, Eurovoc6, and the Bitaxon- Here we introduce the details of our two experi- omy (Flati et al., 2014). Since this dataset provides ments on validating that (1) the proposed approach no training data, we train our model using the can effectively reduce error propagation; and (2) WordNet dataset in the first experiment. To avoid our approach yields better taxonomies via opti- possible overlap between these two sources, we mizing metrics on holistic taxonomy structure. exclude those taxonomies constructed from Word- Net. Performance Study on End-to-End Taxonomy In both experiments, we combine three pub- Induction. In the first experiment, we show that lic corpora – the latest Wikipedia dump, the our joint learning approach is superior to two- UMBC web-based corpus (Han et al., 2013) and phase methods. Towards this goal, we compare the One Billion Word Language Modeling Bench- with TAXI (Panchenko et al., 2016), a typical mark (Chelba et al., 2013). Only sentences where two-phase approach, two-phase HypeNET, im- term pairs co-occur are reserved, which results in plemented by pairwise hypernymy detection and hypernymy organization using MST, and Bansal 6http://eurovoc.europa.eu/drupal/

2467 Model Pa Ra F 1a Pe Re F 1e Model Pa Ra F 1a Pe Re F 1e TAXI 66.1 13.9 23.0 54.8 18.0 27.1 TAXI (DAG) 50.1 32.7 39.6 33.8 26.8 29.9 HypeNET 32.8 26.7 29.4 26.1 17.2 20.7 TAXI (tree) 67.5 30.8 42.3 41.1 23.1 29.6 HypeNET+MST 33.7 41.1 37.0 29.2 29.2 29.2 Env SubSeq - - - - - 22.4 TaxoRL (RE) 35.8 47.4 40.8 35.4 35.4 35.4 TaxoRL (Partial) 51.6 36.4 42.7 37.5 24.2 29.4 TaxoRL (NR) 41.3 49.2 44.9 35.6 35.6 35.6 TaxoRL (Full) 47.2 54.6 50.6 32.3 32.3 32.3 Bansal et al. (2014) 48.0 55.2 51.4 - - - TAXI (DAG) 61.6 41.7 49.7 38.8 34.8 36.7 TaxoRL (NR) + FG 52.9 58.6 55.6 43.8 43.8 43.8 TAXI (tree) 76.8 38.3 51.1 44.8 28.8 35.1 Sci SubSeq - - - - - 39.9 TaxoRL (Partial) 84.6 34.4 48.9 56.9 33.0 41.8 Table 1: Results of the end-to-end taxonomy in- TaxoRL (Full) 68.3 52.9 59.6 37.9 37.9 37.9 duction experiment. Our approach significantly outperforms two-phase methods (Panchenko et al., Table 2: Results of the hypernymy orga- 2016; Shwartz et al., 2016; Bansal et al., 2014). nization experiment. Our approach outper- Bansal et al. (2014) and TaxoRL (NR) + FG are forms Panchenko et al. (2016); Gupta et al. (2017) listed separately because they use extra resources. when the same hypernym graph is used as input. a corpus with size 2.6 GB for the WordNet dataset The precision of partial induction in both metrics and 810 MB for the TExEval-2 dataset. Depen- is high. The precision of full induction is relatively dency paths between term pairs are extracted from lower but its recall is much higher. the corpus via spaCy7. F 1a and F 1e, because it considers the global tax- 4.2 Evaluation Metrics onomy structure, although the two phases are per- formed independently. TaxoRL (RE) uses ex- Ancestor-F1. It compares the ancestors (“is-a” actly the same input as HypeNET+MST and yet pairs) on the predicted taxonomy with those on the achieves significantly better performance, which gold taxonomy. We use Pa, Ra, F 1a to denote the demonstrates the superiority of combining the precision, recall, and F1-score, respectively: phases of hypernymy detection and hypernymy organization. Also, we found that presuming a |is-a ∧ is-a | |is-a ∧ is-a | sys gold sys gold shared root embedding for all taxonomies can be Pa = | | ,Ra = | | . is-asys is-agold inappropriate if they are from different domains, Edge-F1. It is more strict than Ancestor-F1 since which explains why TaxoRL (NR) performs bet- ter than TaxoRL (RE). Finally, after we add the it only compares predicted edges with gold edges. frequency and generality features (TaxoRL (NR) P Similarly, we denote edge-based metrics as e, + FG), our approach outperforms Bansal et al. R F 1 P = R = e, and e, respectively. Note that e e (2014), even if a much smaller corpus is used.8 F 1e if the number of predicted edges is the same as gold edges. Analysis on Hypernymy Organization. Table 2 lists the results of the second experiment. TAXI 4.3 Results (DAG) (Panchenko et al., 2016) denotes TAXI’s original performance on the TExEval-2 dataset.9 Comparison on End-to-End Taxonomy Induc- Since we don’t allow DAG in our setting, we con- tion. Table 1 shows the results of the first exper- vert its results to trees (denoted by TAXI (tree)) by iment. HypeNET (Shwartz et al., 2016) uses ad- only keeping the first parent of each node. Sub- ditional surface features described in Section 2.2. Seq (Gupta et al., 2017) also reuses TAXI’s hy- HypeNET+MST extends HypeNET by first con- pernym candidates. TaxoRL (Partial) and Tax- structing a hypernym graph using HypeNET’s out- oRL (Full) denotes partial induction and full in- put as weights of edges and then finding the duction, respectively. Our joint RL approach out- MST (Chu, 1965) of this graph. TaxoRL (RE) performs baselines in both domains substantially. denotes our RL approach which assumes a com- TaxoRL (Partial) achieves higher precision in both mon Root Embedding, and TaxoRL (NR) denotes ancestor-based and edge-based metrics but has rel- its variant that allows a New Root to be added. 8 We can see that TAXI has the lowest F 1a while Bansal et al. (2014) use an unavailable resource (Brants HypeNET performs the worst in F 1 . Both TAXI and Franz, 2006) which contains one trillion tokens while our e public corpus contains several billion tokens. The frequency and HypeNET’s F 1a and F 1e are lower than 30. and generality features are sparse because the vocabulary that HypeNET+MST outperforms HypeNET in both TAXI (in the TExEval-2 competition) used for focused crawl- ing and hypernymy detection was different. 7https://spacy.io/ 9alt.qcri.org/semeval2016/task13/index.php?id=evaluation

2468 atively lower recall since it discards some terms. Model Pa Ra F 1a F 1e Distributional Info 27.1 24.3 25.6 13.8 In addition, it achieves the best F 1e in science Path-based Info 27.8 48.5 33.7 27.4 domain. TaxoRL (Full) has the highest recall in D + P 36.6 39.4 37.9 28.3 both domains and metrics, with the compromise D + P + Surface Features 41.3 49.2 44.9 35.6 D P S 52.9 58.6 55.6 43.8 of lower precision. Overall, TaxoRL (Full) per- + + + FG forms the best in both domains in terms of F 1a Table 3: Ablation study on the WordNet and achieves best F 1e in environment domain. dataset (Bansal et al., 2014). Pe and Re are omit- ted because they are the same as F 1e for each 5 Ablation Analysis and Case Study model. We can see that our approach benefits from multiple sources of information which are comple- In this section, we conduct ablation analysis and mentary to each other. present a concrete case for better interpreting our model and experimental results. 6 Related Work Table 3 shows the ablation study of TaxoRL (NR) on the WordNet dataset. As one may find, 6.1 Hypernymy Detection different types of features are complementary to Finding high-quality hypernyms is of great impor- each other. Combining distributional and path- tance since it serves as the first step of taxonomy based features performs better than using either of induction. In previous works, there are mainly them alone (Shwartz et al., 2016). Adding surface two categories of approaches for hypernymy de- features helps model string-level statistics that are tection, namely pattern-based and distributional hard to capture by distributional or path-based fea- methods. Pattern-based methods consider lexico- tures. Significant improvement is observed when syntactic patterns between the joint occurrences of more data is used, meaning that standard corpora term pairs for hypernymy detection. They gen- (such as Wikipedia) might not be enough for com- erally achieve high precision but suffer from low plicated taxonomies like WordNet. recall. Typical methods that leverage patterns for Fig. 3 shows the results of taxonomy about fil- hypernym extraction include (Hearst, 1992; Snow ter. We denote the selected term pair at time step et al., 2005; Kozareva and Hovy, 2010; Panchenko t as (hypo, hyper, t). Initially, the term water filter et al., 2016; Nakashole et al., 2012). Distributional is randomly chosen as the taxonomy root. Then, methods leverage the contexts of each term sepa- a wrong term pair (water filter, air filter, 1) is se- rately. The co-occurrence of term pairs is hence lected possibly due to the noise and sparsity of fea- unnecessary. Some distributional methods are de- tures, which makes the term air filter become the veloped in an unsupervised manner. Measures new root. (air filter, filter, 2) is selected next and such as symmetric similarity (Lin et al., 1998) and the current root becomes filter that is identical to those based on distributional inclusion hypothe- the real root. After that, term pairs such as (fuel sis (Weeds et al., 2004; Chang et al., 2017) were filter, filter, 3), (coffee filter, filter, 4) are selected proposed. Supervised methods, on the other hand, correctly, mainly because of the substring inclu- usually have better performance than unsuper- sion intuition. Other term pairs such as (colander, vised methods for hypernymy detection. Recent strainer, 13), (glass wool, filter, 16) are discovered works towards this direction include (Fu et al., later, largely by the information encoded in the de- 2014; Rimell, 2014; Yu et al., 2015; Tuan et al., pendency paths and embeddings. For those undis- 2016; Shwartz et al., 2016). covered relations, (filter tip, air filter) has no de- pendency path in the corpus. sifter is attached to 6.2 Taxonomy Induction the taxonomy before its hypernym sieve. There is There are many lines of work for taxonomy in- no co-occurrence between bacteria bed (or drain duction in the prior literature. One line of basket) and other terms. In addition, it is hard to works (Snow et al., 2005; Yang and Callan, 2009; utilize the surface features since they “look differ- Shen et al., 2012; Jurgens and Pilehvar, 2015) ent” from other terms. That is also why (bacteria aims to complete existing taxonomies by attach- bed, air filter, 17) and (drain basket, air filter, 18) ing new terms in an incremental way. Snow et al. are attached in the end: our approach prefers to (2005) enrich WordNet by maximizing the prob- select term pairs with high confidence first. ability of an extended taxonomy given evidence

2469 water_filter filter_bed bacteria_bed 9 17 of relations from text corpora. Shen et al. (2012) filter_bed air_filter 18 drain_basket 1 glass_wool determine whether an entity is on the taxonomy 2 coffee_filter water_filter oil_filter 4 sifter and either attach it to the right category or link it bacteria_bed 7 fuel_filter air_filter filter_tip 3 to an existing one based on the results. Another filter 8 diatomaceous_earth drain_basket filter line of works (Suchanek et al., 2007; Ponzetto and light_filter 16 riddle 6 coffee_filter light_filter Strube, 2008; Flati et al., 2014) focuses on the tax- 15 diatomaceous_earth glass_wool 10 tea-strainer onomy induction of existing encyclopedias (e.g., fuel_filter colander 11 strainer 13 colander 14 sieve strainer tea-strainer 12 Wikipedia), mainly by employing the nature that sifter filter_tip sieve 5 they are already organized into semi-structured riddle oil_filter data. To deal with the issue of incomplete cov- erage, some works (Liu et al., 2012; Dong et al., Figure 3: The gold taxonomy in WordNet is on the 2014; Panchenko et al., 2016; Kozareva and Hovy, left. The predicted taxonomy is on the right. The 2010) utilize data from domain-specific resources numbers indicate the order of term pair selections. or the Web. Panchenko et al. (2016) extract hy- Term pairs with high confidence are selected first. pernyms by patterns from general purpose corpora and domain-specific corpora bootstrapped from the input vocabulary. Kozareva and Hovy (2010) that treat term pairs independently or equally, harvest new terms from the Web by employing our approach learns the representations of term Hearst-like lexico-syntactic patterns and validate pairs by optimizing a holistic tree metric over the the learned is-a relations by a web-based concept training taxonomies. The error propagation be- positioning procedure. tween two phases is thus effectively reduced and Many works (Kozareva and Hovy, 2010; Anh the global taxonomy structure is better captured. et al., 2014; Velardi et al., 2013; Bansal et al., Experiments on two public datasets from differ- 2014; Zhang et al., 2016; Panchenko et al., 2016; ent domains show that our approach outperforms Gupta et al., 2017) cast the task of hypernymy state-of-the-art methods significantly. In the fu- organization as a graph optimization problem. ture, we will explore more strategies towards term Kozareva and Hovy (2010) begin with a set of root pair selection (e.g., allow the RL agent to remove terms and leaf terms and aim to generate interme- terms from the taxonomy) and reward function de- diate terms by deriving the longest path from the sign. In addition, study on how to effectively en- root to leaf in a noisy hypernym graph. Velardi code induction history will be interesting. et al. (2013) induct a taxonomy from the hyper- nym graph via optimal branching and a weighting Acknowledgments policy. Bansal et al. (2014) regard the induction of a taxonomy as a structured learning problem Research was sponsored in part by U.S. Army by building a factor graph to model the relations Research Lab. under Cooperative Agreement between edges and siblings, and output the MST No. W911NF-09-2-0053 (NSCTA), DARPA un- found by the Chu-Liu/Edmond’s algorithm (Chu, der Agreement No. W911NF-17-C-0099, Na- 1965). Zhang et al. (2016) propose a probabilis- tional Science Foundation IIS 16-18481, IIS 17- tic Bayesian model which incorporates visual fea- 04532, and IIS-17-41317, and grant 1U54GM- tures (images) in addition to text features (words) 114838 awarded by NIGMS through funds pro- to improve the performance. The optimal taxon- vided by the trans-NIH Big Data to Knowl- omy is also found by the MST. Gupta et al. (2017) edge (BD2K) initiative (www.bd2k.nih.gov). The extract hypernym subsequences based on hyper- views and conclusions contained in this document nym pairs, and regard the task of taxonomy in- are those of the author(s) and should not be inter- duction as an instance of the minimum-cost flow preted as representing the official policies of the problem. U.S. Army Research Laboratory or the U.S. Gov- ernment. The U.S. Government is authorized to 7 Conclusion and Future Work reproduce and distribute reprints for Government purposes notwithstanding any copyright notation This paper presents a novel end-to-end reinforce- hereon. We thank Mohit Bansal, Hao Zhang, and ment learning approach for automatic taxonomy anonymous reviewers for valuable feedback. induction. Unlike previous two-phase methods

2470 References Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Wang, and Ting Liu. 2014. Learning semantic hier- Tuan Luu Anh, Jung-jae Kim, and See Kiong Ng. 2014. archies via word embeddings. In Proceedings of the Taxonomy construction using syntactic contextual 52nd Annual Meeting of the Association for Compu- evidence. In Proceedings of the 2014 Conference on tational Linguistics (Volume 1: Long Papers), vol- Empirical Methods in Natural Language Processing ume 1, pages 1199–1209. (EMNLP), pages 810–819. Mohit Bansal, David Burkett, Gerard De Melo, and Amit Gupta, Remi´ Lebret, Hamza Harkous, and Karl Dan Klein. 2014. Structured learning for taxonomy Aberer. 2017. Taxonomy induction using hypernym induction with belief propagation. In ACL (1), pages subsequences. In Proceedings of the 2017 ACM 1041–1051. on Conference on Information and Knowledge Man- agement, pages 1329–1338. ACM. Marco Baroni and Alessandro Lenci. 2011. How we blessed distributional semantic evaluation. In Pro- Lushan Han, Abhay L Kashyap, Tim Finin, ceedings of the GEMS 2011 Workshop on GEomet- James Mayfield, and Jonathan Weese. 2013. rical Models of Natural Language Semantics, pages Umbc ebiquity-core: Semantic textual similarity 1–10. Association for Computational Linguistics. systems. In * SEM@ NAACL-HLT, pages 44–52.

Georgeta Bordea, Els Lefever, and Paul Buitelaar. Marti A Hearst. 1992. Automatic acquisition of hy- 2016. Semeval-2016 task 13: Taxonomy extrac- ponyms from large text corpora. In Proceedings of tion evaluation (texeval-2). In SemEval-2016, pages the 14th conference on Computational linguistics- 1081–1091. Association for Computational Linguis- Volume 2, pages 539–545. Association for Compu- tics. tational Linguistics.

Thorsten Brants and Alex Franz. 2006. Web 1t 5-gram Nan Jiang, Alex Kulesza, Satinder Singh, and Richard corpus version 1.1. Google Inc. Lewis. 2015. The dependence of effective plan- ning horizon on model accuracy. In Proceedings of Jose Camacho-Collados. 2017. Why we have switched the 2015 International Conference on Autonomous from building full-fledged taxonomies to simply Agents and Multiagent Systems, pages 1181–1189. detecting hypernymy relations. arXiv preprint International Foundation for Autonomous Agents arXiv:1703.04178. and Multiagent Systems. Haw-Shiuan Chang, ZiYun Wang, Luke Vilnis, and Andrew McCallum. 2017. Unsupervised hypernym David Jurgens and Mohammad Taher Pilehvar. 2015. detection by distributional inclusion vector embed- Reserating the awesometastic: An automatic exten- ding. arXiv preprint arXiv:1710.00880. sion of the wordnet taxonomy for novel terms. In Proceedings of the 2015 Conference of the North Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, American Chapter of the Association for Computa- Thorsten Brants, Phillipp Koehn, and Tony Robin- tional Linguistics: Human Language Technologies, son. 2013. One billion word benchmark for measur- pages 1459–1465. ing progress in statistical language modeling. arXiv preprint arXiv:1312.3005. Zornitsa Kozareva and Eduard Hovy. 2010. A semi-supervised method to learn and construct tax- Yoeng-Jin Chu. 1965. On the shortest arborescence of onomies using the web. In Proceedings of the a directed graph. Science Sinica, 14:1396–1400. 2010 conference on empirical methods in natural language processing, pages 1110–1118. Association Thomas Demeester, Tim Rocktaschel,¨ and Sebastian for Computational Linguistics. Riedel. 2016. Lifted rule injection for relation em- beddings. In EMNLP. Douglas B Lenat. 1995. Cyc: A large-scale investment Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko in knowledge infrastructure. Communications of the Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, ACM, 38(11):33–38. Shaohua Sun, and Wei Zhang. 2014. Knowledge vault: A web-scale approach to probabilistic knowl- Dekang Lin et al. 1998. An information-theoretic defi- edge fusion. In Proceedings of the 20th ACM nition of similarity. In Icml, volume 98, pages 296– SIGKDD international conference on Knowledge 304. Citeseer. discovery and data mining, pages 601–610. ACM. Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, Stefano Faralli, Giovanni Stilo, and Paola Velardi. and Maosong Sun. 2016. Neural relation extraction 2015. Large scale homophily analysis in twitter us- with selective attention over instances. In ACL (1). ing a twixonomy. In IJCAI, pages 2334–2340. Xueqing Liu, Yangqiu Song, Shixia Liu, and Haixun Tiziano Flati, Daniele Vannella, Tommaso Pasini, and Wang. 2012. Automatic taxonomy construction Roberto Navigli. 2014. Two is bigger (and better) from keywords. In Proceedings of the 18th ACM than one: the wikipedia bitaxonomy project. In ACL SIGKDD international conference on Knowledge (1), pages 945–955. discovery and data mining, pages 1433–1441. ACM.

2471 George A Miller. 1995. Wordnet: a lexical database for Fabian M Suchanek, Gjergji Kasneci, and Gerhard english. Communications of the ACM, 38(11):39– Weikum. 2007. Yago: a core of semantic knowl- 41. edge. In Proceedings of the 16th international con- ference on World Wide Web, pages 697–706. ACM. Ndapandula Nakashole, Gerhard Weikum, and Fabian Suchanek. 2012. Patty: a taxonomy of relational Luu A Tuan, Yi Tay, Siu C Hui, and See K Ng. 2016. patterns with semantic types. In Proceedings of Learning term embeddings for taxonomic relation the 2012 Joint Conference on Empirical Methods identification using dynamic weighting neural net- in Natural Language Processing and Computational work. In Proceedings of the EMNLP conference, Natural Language Learning, pages 1135–1145. As- pages 403–413. sociation for Computational Linguistics. Paola Velardi, Stefano Faralli, and Roberto Navigli. Graham Neubig, Chris Dyer, Yoav Goldberg, Austin 2013. Ontolearn reloaded: A graph-based algorithm Matthews, Waleed Ammar, Antonios Anastasopou- for taxonomy induction. Computational Linguistics, los, Miguel Ballesteros, David Chiang, Daniel 39(3):665–707. Clothiaux, Trevor Cohn, et al. 2017. Dynet: The Julie Weeds, David Weir, and Diana McCarthy. 2004. dynamic neural network toolkit. arXiv preprint Characterising measures of lexical distributional arXiv:1701.03980. similarity. In Proceedings of the 20th interna- tional conference on Computational Linguistics, Alexander Panchenko, Stefano Faralli, Eugen Ruppert, page 1015. Association for Computational Linguis- Steffen Remus, Hubert Naets, Cedrick Fairon, Si- tics. mone Paolo Ponzetto, and Chris Biemann. 2016. Taxi at semeval-2016 task 13: a taxonomy induc- Ronald J Williams. 1992. Simple statistical gradient- tion method based on lexico-syntactic patterns, sub- following for connectionist reinforce- strings and focused crawling. In Proceedings of the ment learning. , 8(3-4):229–256. 10th International Workshop on Semantic Evalua- tion, San Diego, CA, USA. Association for Compu- Ichiro Yamada, Kentaro Torisawa, Jun’ichi Kazama, tational Linguistics. Kow Kuroda, Masaki Murata, Stijn De Saeger, Fran- cis Bond, and Asuka Sumida. 2009. Hypernym dis- Jeffrey Pennington, Richard Socher, and Christopher covery based on distributional similarity and hierar- Manning. 2014. Glove: Global vectors for word chical structures. In Proceedings of the 2009 Con- representation. In Proceedings of the 2014 confer- ference on Empirical Methods in Natural Language ence on empirical methods in natural language pro- Processing: Volume 2-Volume 2, pages 929–937. cessing (EMNLP), pages 1532–1543. Association for Computational Linguistics.

Simone Paolo Ponzetto and Michael Strube. 2008. Hui Yang and Jamie Callan. 2009. A metric-based Wikitaxonomy: A large scale knowledge resource. framework for automatic taxonomy induction. In In ECAI, volume 178, pages 751–752. Proceedings of the Joint Conference of the 47th An- nual Meeting of the ACL and the 4th International Laura Rimell. 2014. Distributional lexical entailment Joint Conference on Natural Language Processing by topic coherence. In Proceedings of the 14th Con- of the AFNLP: Volume 1-Volume 1, pages 271–279. ference of the European Chapter of the Association Association for Computational Linguistics. for Computational Linguistics, pages 511–519. Shuo Yang, Lei Zou, Zhongyuan Wang, Jun Yan, and Mark Sammons. 2012. Recognizing textual entail- Ji-Rong Wen. 2017. Efficiently answering technical ment. questions - a knowledge graph approach. In AAAI. Zheng Yu, Haixun Wang, Xuemin Lin, and Min Wang. Wei Shen, Jianyong Wang, Ping Luo, and Min Wang. 2015. Learning term embeddings for hypernymy 2012. A graph-based approach for ontology popula- identification. In IJCAI, pages 1390–1397. tion with named entities. In Proceedings of the 21st ACM international conference on Information and Hao Zhang, Zhiting Hu, Yuntian Deng, Mrinmaya knowledge management, pages 345–354. ACM. Sachan, Zhicheng Yan, and Eric Xing. 2016. Learn- ing concept taxonomies from multi-modal data. In Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016. Proceedings of the 54th Annual Meeting of the As- Improving hypernymy detection with an integrated sociation for Computational Linguistics (Volume 1: path-based and distributional method. In Proceed- Long Papers), volume 1, pages 1791–1801. ings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An- pers), volume 1, pages 2389–2398. geli, and Christopher D Manning. 2017. Position- aware attention and supervised data improve slot fill- Rion Snow, Daniel Jurafsky, and Andrew Y Ng. 2005. ing. In Proceedings of the 2017 Conference on Em- Learning syntactic patterns for automatic hypernym pirical Methods in Natural Language Processing, discovery. In Advances in neural information pro- pages 35–45. cessing systems, pages 1297–1304.

2472