A Vector Space for Distributional Semantics for Entailment ∗
Total Page:16
File Type:pdf, Size:1020Kb
A Vector Space for Distributional Semantics for Entailment ∗ James Henderson and Diana Nicoleta Popa Xerox Research Centre Europe [email protected] and [email protected] unk f g f Abstract ⇒ ¬ unk 1 0 0 0 Distributional semantics creates vector- f 1 1 0 0 space representations that capture many g 1 0 1 0 forms of semantic similarity, but their re- f 1 0 0 1 ¬ lation to semantic entailment has been less clear. We propose a vector-space model Table 1: Pattern of logical entailment between which provides a formal foundation for a nothing known (unk), two different features f and g known, and the complement of f ( f) known. distributional semantics of entailment. Us- ¬ ing a mean-field approximation, we de- velop approximate inference procedures ness with a distributional-semantic model of hy- and entailment operators over vectors of ponymy detection. probabilities of features being known (ver- Unlike previous vector-space models of entail- sus unknown). We use this framework ment, the proposed framework explicitly models to reinterpret an existing distributional- what information is unknown. This is a crucial semantic model (Word2Vec) as approxi- property, because entailment reflects what infor- mating an entailment-based model of the mation is and is not known; a representation y en- distributions of words in contexts, thereby tails a representation x if and only if everything predicting lexical entailment relations. In that is known given x is also known given y. Thus, both unsupervised and semi-supervised we model entailment in a vector space where each experiments on hyponymy detection, we dimension represents something we might know. get substantial improvements over previ- As illustrated in Table 1, knowing that a feature f ous results. is true always entails knowing that same feature, but never entails knowing that a different feature g 1 Introduction is true. Also, knowing that a feature is true always entails not knowing anything (unk), since strictly Modelling entailment is a fundamental issue in less information is still entailment, but the reverse computational semantics. It is also important for is never true. Table 1 also illustrates that knowing many applications, for example to produce ab- that a feature f is false ( f) patterns exactly the ¬ stract summaries or to answer questions from text, same way as knowing that an unrelated feature g where we need to ensure that the input text entails is true. This illustrates that the relevant dichotomy the output text. There has been a lot of interest in for entailment is known versus unknown, and not modelling entailment in a vector-space, but most true versus false. of this work takes an empirical, often ad-hoc, ap- Previous vector-space models have been very proach to this problem, and achieving good results successful at modelling semantic similarity, in par- has been difficult (Levy et al., 2015). In this work, ticular using distributional semantic models (e.g. we propose a new framework for modelling entail- (Deerwester et al., 1990; Schutze,¨ 1993; Mikolov ment in a vector-space, and illustrate its effective- et al., 2013a)). Distributional semantics uses the ∗This work was partially supported by French ANR grant distributions of words in contexts to induce vector- CIFRE N 1324/2014. space embeddings of words, which have been 2052 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 2052–2062, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics shown to be useful for a wide variety of tasks. supervised model of lexical entailment, success- Two words are predicted to be similar if the dot fully predicting hyponymy relations using the pro- product between their vectors is high. But the posed entailment operators in both unsupervised dot product is an anti-symmetric operator, which and semi-supervised experiments. makes it more natural to interpret these vectors as representing whether features are true or false, 2 Modelling Entailment in a Vector Space whereas the dichotomy known versus unknown is To develop a model of entailment in a vector asymmetric. We surmise that this is why distribu- space, we start with the logical definition of en- tional semantic models have had difficulty mod- tailment in terms of vectors of discrete known fea- elling lexical entailment (Levy et al., 2015). tures: y entails x if and only if all the known fea- To develop a vector-space model of whether tures in x are also included in y. We formalise this features are known or unknown, we start with dis- relation with binary vectors x, y where 1 means crete binary vectors, where 1 means known and 0 known and 0 means unknown, so this discrete en- means unknown. Entailment between these dis- tailment relation (y x) can be defined with the ⇒ crete binary vectors can be calculated by indepen- binary formula: dently checking each dimension. But as soon as P ((y x) x, y) = (1 (1 yk)xk) we try to do calculations with distributions over ⇒ | − − Yk these vectors, we need to deal with the case where Given prior probability distributions P (x),P (y) the features are not independent. For example, if over these vectors, the exact joint and marginal feature f has a 50% chance of being true and a probabilities for an entailment relation are: 50% chance of being false, we can’t assume that P (x, y, (y x)) = P (x) P (y) (1 (1 yk)xk) there is a 25% chance that both f and f are ⇒ − − ¬ k known. This simple case of mutual exclusion is Y P ((y x)) = E E (1 (1 y )x ) (1) just one example of a wide range of constraints ⇒ P (x) P (y) − − k k between features which we need to handle in se- Yk We cannot assume that the priors P (x) and mantic models. These constraints mean that the P (y) are factorised, because there are many im- different dimensions of our vector space are not portant correlations between features and there- independent, and therefore exact models are not fore we cannot assume that the features are in- factorised. Because the models are not factorised, dependent. As discussed in Section 1, even just exact calculations of entailment and exact infer- representing both a feature f and its negation f ence of vectors are intractable. ¬ requires two different dimensions k and k0 in the Mean-field approximations are a popular ap- vector space, because 0 represents unknown and proach to efficient inference for intractable mod- not false. Given valid feature vectors, calculating els. In a mean-field approximation, distributions entailment can consider these two dimensions sep- over binary vectors are represented using a sin- arately, but to reason with distributions over vec- gle probability for each dimension. These vectors tors we need the prior P (x) to enforce the con- of real values are the basis of our proposed vector straint that xk and xk0 are mutually exclusive. In space for entailment. general, such correlations and anti-correlations ex- In this work, we propose a vector-space model ist between many semantic features, which makes which provides a formal foundation for a distri- inference and calculating the probability of entail- butional semantics of entailment. This framework ment intractable. is derived from a mean-field approximation to en- To allow for efficient inference in such a model, tailment between binary vectors, and includes op- we propose a mean-field approximation. This in erators for measuring entailment between vectors, effect assumes that the posterior distribution over and procedures for inferring vectors in an entail- vectors is factorised, but in practice this is a much ment graph. We validate this framework by us- weaker assumption than assuming the prior is fac- ing it to reinterpret existing Word2Vec (Mikolov torised. The posterior distribution has less un- et al., 2013a) word embedding vectors as approxi- certainty and therefore is influenced less by non- mating an entailment-based model of the distribu- factorised prior constraints. By assuming a fac- tion of words in contexts. This reinterpretation al- torised posterior, we can then represent distribu- lows us to use existing word embeddings as an un- tions over feature vectors with simple vectors of 2053 x probabilities of individual features (or as below, exp( k θk xk) EQ(x) log P (x) EQ(x) log with their log-odds). These real-valued vectors are ≥ θ PZ the basis of the proposed vector-space model of x = EQ(xk)θk xk log θ entailment. − Z Xk In the next two subsections, we derive a mean- where the log is not relevant in any of our in- Zθ field approximation for inference of real-valued ference problems and thus will be dropped below. vectors in entailment graphs. This derivation As typically in mean-field approximations, in- leads to three proposed vector-space operators for ference of Q(x) and Q(y) can’t be done efficiently approximating the log-probability of entailment, with this exact objective L, because of the non- summarised in Table 2. These operators will be linear interdependence between xk and yk in the used in the evaluation in Section 5. This inference last term. Thus, we introduce two approximations framework will also be used in Section 3 to model to L, one for use in inferring Q(x) given Q(y) how existing word embeddings can be mapped to (forward inference), and one for the reverse in- vectors to which the entailment operators can be ference problem (backward inference). In both applied. cases, the approximation is done with an appli- cation of Jensen’s inequality to the log function, 2.1 A Mean-Field Approximation which gives us an upper bound on L, as is stan- A mean-field approximation approximates the dard practice in mean-field approximations.