Structured Learning for Taxonomy Induction with Belief Propagation

Structured Learning for Taxonomy Induction with Belief Propagation Mohit Bansal David Burkett Gerard de Melo Dan Klein TTI Chicago Twitter Inc. Tsinghua University UC Berkeley [email protected] [email protected] [email protected] [email protected] Abstract vertebrate We present a structured learning approach to inducing hypernym taxonomies using a mammal reptile probabilistic graphical model formulation. Our model incorporates heterogeneous re- placental metatherian diapsid anapsid lational evidence about both hypernymy cow rodent marsupial snake crocodilian chelonian and siblinghood, captured by semantic squirrel rat kangaroo turtle features based on patterns and statistics from Web n-grams and Wikipedia ab- 1 Figure 1: An excerpt of WordNet’s vertebrates taxonomy. stracts. For efficient inference over taxonomy structures, we use loopy belief propagation along with a directed span- time-intensive to create or extend manually. There ning tree algorithm for the core hyper- has thus been considerable interest in building lex- nymy factor. To train the system, we ex- ical taxonomies automatically. tract sub-structures of WordNet and dis- In this work, we focus on the task of taking col- criminatively learn to reproduce them, us- lections of terms as input and predicting a com- ing adaptive subgradient stochastic opti- plete taxonomy structure over them as output. Our mization. On the task of reproducing model takes a loglinear form and is represented sub-hierarchies of WordNet, our approach using a factor graph that includes both 1st-order achieves a 51% error reduction over a scoring factors on directed hypernymy edges (a chance baseline, including a 15% error re- parent and child in the taxonomy) and 2nd-order duction due to the non-hypernym-factored scoring factors on sibling edge pairs (pairs of hy- sibling features. On a comparison setup, pernym edges with a shared parent), as well as in- we find up to 29% relative error reduction corporating a global (directed spanning tree) struc- over previous work on ancestor F1. tural constraint. Inference for both learning and decoding uses structured loopy belief propagation 1 Introduction (BP), incorporating standard spanning tree algo- Many tasks in natural language understanding, rithms (Chu and Liu, 1965; Edmonds, 1967; Tutte, such as question answering, information extrac- 1984). The belief propagation approach allows us tion, and textual entailment, benefit from lexical to efficiently and effectively incorporate hetero- semantic information in the form of types and hy- geneous relational evidence via hypernymy and pernyms. A recent example is IBM’s Jeopardy! siblinghood (e.g., coordination) cues, which we system Watson (Ferrucci et al., 2010), which used capture by semantic features based on simple sur- type information to restrict the set of answer can- face patterns and statistics from Web n-grams and didates. Information of this sort is present in term Wikipedia abstracts. We train our model to max- taxonomies (e.g., Figure 1), ontologies, and the- imize the likelihood of existing example ontolo- sauri. However, currently available taxonomies gies using stochastic optimization, automatically such as WordNet are incomplete in coverage (Pen- learning the most useful relational patterns for full nacchiotti and Pantel, 2006; Hovy et al., 2009), taxonomy induction. unavailable in many domains and languages, and As an example of the relational patterns that our 1041 Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 1041–1051, Baltimore, Maryland, USA, June 23-25 2014. c 2014 Association for Computational Linguistics system learns, suppose we are interested in build- avoid presumption of a single optimal tradeoff, we ing a taxonomy for types of mammals (see Fig- also present results for precision-based decoding, ure 1). Frequent attestation of hypernymy patterns where we trade off recall for precision. like rat is a rodent in large corpora is a strong sig- nal of the link rodent rat. Moreover, sibling 2 Structured Taxonomy Induction → or coordination cues like either rats or squirrels Given an input term set x = x , x , . , x , { 1 2 n} suggest that rat is a sibling of squirrel and adds we wish to compute the conditional distribution evidence for the links rodent rat and rodent → over taxonomy trees y. This distribution P (y x) squirrel. Our supervised model captures ex- | → is represented using the graphical model formu- actly these types of intuitions by automatically dis- lation shown in Figure 2. A taxonomy tree y is covering such heterogeneous relational patterns as composed of a set of indicator random variables features (and learning their weights) on edges and yij (circles in Figure 2), where yij = ON means on sibling edge pairs, respectively. that xi is the parent of xj in the taxonomy tree There have been several previous studies on (i.e. there exists a directed edge from xi to xj). taxonomy induction. e.g., the incremental tax- One such variable exists for each pair (i, j) with onomy induction system of Snow et al. (2006), 0 i n, 1 j n, and i = j.2 ≤ ≤ ≤ ≤ 6 the longest path approach of Kozareva and Hovy In a factor graph formulation, a set of factors (2010), and the maximum spanning tree (MST) (squares and rectangles in Figure 2) determines the approach of Navigli et al. (2011) (see Section 4 for probability of each possible variable assignment. a more detailed overview). The main contribution Each factor F has an associated scoring function of this work is that we present the first discrimina- φF , with the probability of a total assignment de- tively trained, structured probabilistic model over termined by the product of all these scores: the full space of taxonomy trees, using a struc- P (y x) φ (y) (1) tured inference procedure through both the learn- | ∝ F ing and decoding phases. Our model is also the YF first to directly learn relational patterns as part of 2.1 Factor Types the process of training an end-to-end taxonomic In the models we present here, there are three induction system, rather than using patterns that types of factors: EDGE factors that score individ- were hand-selected or learned via pairwise clas- ual edges in the taxonomy tree, SIBLING factors sifiers on manually annotated co-occurrence pat- that score pairs of edges with a shared parent, and terns. Finally, it is the first end-to-end (i.e., non- a global TREE factor that imposes the structural incremental) system to include sibling (e.g., coor- constraint that y form a legal taxonomy tree. dination) patterns at all. DGE For each edge variable y in We test our approach in two ways. First, on E Factors. ij the model, there is a corresponding factor E the task of recreating fragments of WordNet, we ij (small blue squares in Figure 2) that depends only achieve a 51% error reduction on ancestor-based on y . We score each edge by extracting a set F1 over a chance baseline, including a 15% error ij of features f(x , x ) and weighting them by the reduction due to the non-hypernym-factored sib- i j (learned) weight vector w. So, the factor scoring ling features. Second, we also compare to the re- function is: sults of Kozareva and Hovy (2010) by predicting the large animal subtree of WordNet. Here, we exp(w f(x , x )) y = ON φ (y ) = · i j ij get up to 29% relative error reduction on ancestor- Eij ij (exp(0) = 1 yij = OFF based F1. We note that our approach falls at a different point in the space of performance trade- SIBLING Factors. Our second model also in- offs from past work – by producing complete, cludes factors that permit 2nd-order features look- highly articulated trees, we naturally see a more ing at terms that are siblings in the taxonomy tree. even balance between precision and recall, while For each triple (i, j, k) with i = j, i = k, and 1 3 6 6 past work generally focused on precision. To j < k, we have a factor Sijk (green rectangles in 1 2 While different applications will value precision and We assume a special dummy root symbol x0. 3 recall differently, and past work was often intentionally The ordering of the siblings xj and xk doesn’t mat- precision-focused, it is certainly the case that an ideal solu- ter here, so having separate factors for (i, j, k) and (i, k, j) tion would maximize both. would be redundant. 1042 T T y01 y02 y0n y01 y02 y0n E01 E02 E0n E01 E02 E0n y12 y1n y12 y1n S12n E12 E1n E12 E1n y21 y2n y21 y2n S21n E21 E2n E21 E2n yn1 yn2 yn1 yn2 Sn12 En1 En2 En1 En2 (a) Edge Features Only (b) Full Model Figure 2: Factor graph representation of our model, both without (a) and with (b) SIBLING factors. Figure 2b) that depends on yij and yik, and thus 2.2 Inference via Belief Propagation can be used to encode features that should be ac- tive whenever xj and xk share the same parent, xi. With the model defined, there are two main in- The scoring function is similar to the one above: ference tasks we wish to accomplish: computing ( expected feature counts and selecting a particular exp(w f(xi, xj , xk)) yij = yik = ON φS (yij , yik) = · taxonomy tree for a given set of input terms (de- ijk 1 otherwise coding). As an initial step to each of these pro- cedures, we wish to compute the marginal prob- TREE Factor. Of course, not all variable as- abilities of particular edges (and pairs of edges) signments y form legal taxonomy trees (i.e., di- being on. In a factor graph, the natural infer- rected spanning trees). For example, the assign- ence procedure for computing marginals is belief ment i, j, y = ON might get a high score, but ∀ ij propagation.

Load more