<<

A Vector Space for Distributional for Entailment ∗

James Henderson and Diana Nicoleta Popa Xerox Research Centre Europe [email protected] and [email protected]

unk f g f Abstract ⇒ ¬ unk 1 0 0 0 Distributional semantics creates vector- f 1 1 0 0 space representations that capture many g 1 0 1 0 forms of , but their re- f 1 0 0 1 ¬ lation to semantic entailment has been less clear. We propose a vector-space model Table 1: Pattern of logical entailment between which provides a formal foundation for a nothing known (unk), two different features f and g known, and the complement of f ( f) known. distributional semantics of entailment. Us- ¬ ing a mean-field approximation, we de- velop approximate inference procedures ness with a distributional-semantic model of hy- and entailment operators over vectors of ponymy detection. probabilities of features being known (ver- Unlike previous vector-space models of entail- sus unknown). We use this framework ment, the proposed framework explicitly models to reinterpret an existing distributional- what information is unknown. This is a crucial semantic model () as approxi- property, because entailment reflects what infor- mating an entailment-based model of the mation is and is not known; a representation y en- distributions of in contexts, thereby tails a representation x if and only if everything predicting lexical entailment relations. In that is known given x is also known given y. Thus, both unsupervised and semi-supervised we model entailment in a vector space where each experiments on hyponymy detection, we dimension represents something we might know. get substantial improvements over previ- As illustrated in Table 1, knowing that a feature f ous results. is true always entails knowing that same feature, but never entails knowing that a different feature g 1 Introduction is true. Also, knowing that a feature is true always entails not knowing anything (unk), since strictly Modelling entailment is a fundamental issue in less information is still entailment, but the reverse computational semantics. It is also important for is never true. Table 1 also illustrates that knowing many applications, for example to produce ab- that a feature f is false ( f) patterns exactly the ¬ stract summaries or to answer questions from text, same way as knowing that an unrelated feature g where we need to ensure that the input text entails is true. This illustrates that the relevant dichotomy the output text. There has been a lot of interest in for entailment is known versus unknown, and not modelling entailment in a vector-space, but most true versus false. of this work takes an empirical, often ad-hoc, ap- Previous vector-space models have been very proach to this problem, and achieving good results successful at modelling semantic similarity, in par- has been difficult (Levy et al., 2015). In this work, ticular using distributional semantic models (e.g. we propose a new framework for modelling entail- (Deerwester et al., 1990; Schutze,¨ 1993; Mikolov ment in a vector-space, and illustrate its effective- et al., 2013a)). Distributional semantics uses the

∗This work was partially supported by French ANR grant distributions of words in contexts to induce vector- CIFRE N 1324/2014. space embeddings of words, which have been

2052 Proceedings of the 54th Annual Meeting of the Association for Computational , pages 2052–2062, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics shown to be useful for a wide variety of tasks. supervised model of lexical entailment, success- Two words are predicted to be similar if the dot fully predicting hyponymy relations using the pro- product between their vectors is high. But the posed entailment operators in both unsupervised dot product is an anti-symmetric operator, which and semi-supervised experiments. makes it more natural to interpret these vectors as representing whether features are true or false, 2 Modelling Entailment in a Vector Space whereas the dichotomy known versus unknown is To develop a model of entailment in a vector asymmetric. We surmise that this is why distribu- space, we start with the logical definition of en- tional semantic models have had difficulty mod- tailment in terms of vectors of discrete known fea- elling lexical entailment (Levy et al., 2015). tures: y entails x if and only if all the known fea- To develop a vector-space model of whether tures in x are also included in y. We formalise this features are known or unknown, we start with dis- relation with binary vectors x, y where 1 means crete binary vectors, where 1 means known and 0 known and 0 means unknown, so this discrete en- means unknown. Entailment between these dis- tailment relation (y x) can be defined with the ⇒ crete binary vectors can be calculated by indepen- binary formula: dently checking each dimension. But as soon as P ((y x) x, y) = (1 (1 yk)xk) we try to do calculations with distributions over ⇒ | − − Yk these vectors, we need to deal with the case where Given prior probability distributions P (x),P (y) the features are not independent. For example, if over these vectors, the exact joint and marginal feature f has a 50% chance of being true and a probabilities for an entailment relation are: 50% chance of being false, we can’t assume that P (x, y, (y x)) = P (x) P (y) (1 (1 yk)xk) there is a 25% chance that both f and f are ⇒ − − ¬ k known. This simple case of mutual exclusion is Y P ((y x)) = E E (1 (1 y )x ) (1) just one example of a wide range of constraints ⇒ P (x) P (y) − − k k between features which we need to handle in se- Yk We cannot assume that the priors P (x) and mantic models. These constraints mean that the P (y) are factorised, because there are many im- different dimensions of our vector space are not portant correlations between features and there- independent, and therefore exact models are not fore we cannot assume that the features are in- factorised. Because the models are not factorised, dependent. As discussed in Section 1, even just exact calculations of entailment and exact infer- representing both a feature f and its negation f ence of vectors are intractable. ¬ requires two different dimensions k and k0 in the Mean-field approximations are a popular ap- vector space, because 0 represents unknown and proach to efficient inference for intractable mod- not false. Given valid feature vectors, calculating els. In a mean-field approximation, distributions entailment can consider these two dimensions sep- over binary vectors are represented using a sin- arately, but to reason with distributions over vec- gle probability for each dimension. These vectors tors we need the prior P (x) to enforce the con- of real values are the basis of our proposed vector straint that xk and xk0 are mutually exclusive. In space for entailment. general, such correlations and anti-correlations ex- In this work, we propose a vector-space model ist between many semantic features, which makes which provides a formal foundation for a distri- inference and calculating the probability of entail- butional semantics of entailment. This framework ment intractable. is derived from a mean-field approximation to en- To allow for efficient inference in such a model, tailment between binary vectors, and includes op- we propose a mean-field approximation. This in erators for measuring entailment between vectors, effect assumes that the posterior distribution over and procedures for inferring vectors in an entail- vectors is factorised, but in practice this is a much ment graph. We validate this framework by us- weaker assumption than assuming the prior is fac- ing it to reinterpret existing Word2Vec (Mikolov torised. The posterior distribution has less un- et al., 2013a) embedding vectors as approxi- certainty and therefore is influenced less by non- mating an entailment-based model of the distribu- factorised prior constraints. By assuming a fac- tion of words in contexts. This reinterpretation al- torised posterior, we can then represent distribu- lows us to use existing word embeddings as an un- tions over feature vectors with simple vectors of

2053 x probabilities of individual features (or as below, exp( k θk xk) EQ(x) log P (x) EQ(x) log with their log-odds). These real-valued vectors are ≥ θ PZ the basis of the proposed vector-space model of x = EQ(xk)θk xk log θ entailment. − Z Xk In the next two subsections, we derive a mean- where the log is not relevant in any of our in- Zθ field approximation for inference of real-valued ference problems and thus will be dropped below. vectors in entailment graphs. This derivation As typically in mean-field approximations, in- leads to three proposed vector-space operators for ference of Q(x) and Q(y) can’t be done efficiently approximating the log-probability of entailment, with this exact objective L, because of the non- summarised in Table 2. These operators will be linear interdependence between xk and yk in the used in the evaluation in Section 5. This inference last term. Thus, we introduce two approximations framework will also be used in Section 3 to model to L, one for use in inferring Q(x) given Q(y) how existing word embeddings can be mapped to (forward inference), and one for the reverse in- vectors to which the entailment operators can be ference problem (backward inference). In both applied. cases, the approximation is done with an appli- cation of Jensen’s inequality to the log function, 2.1 A Mean-Field Approximation which gives us an upper bound on L, as is stan- A mean-field approximation approximates the dard practice in mean-field approximations. For posterior P using a factorised distribution Q. First forward inference: of all, this gives us a concise description of the L H(Q) Q(x =1)θx E θyy (2) posterior P (x ...) as a vector of continuous val- ≤ − − k k − Q(yk) k k | ues Q(x=1), where Q(x=1) = Q(x =1) Q(xk=1) log Q(yk=1) ) k k ≈ − EP (x ...)xk = P (xk=1 ...) (i.e. the marginal | | which we can optimise for Q(xk=1): probabilities of each bit). Secondly, as is shown x below, this gives us efficient methods for doing ap- Q(xk=1) = σ( θk + log Q(yk=1) ) (3) proximate inference of vectors in a model. where σ() is the sigmoid function. The sig- First we consider the simple case where moid function arises from the entropy regulariser, we want to approximate the posterior distribu- making this a specific form of maximum entropy tion P (x, y y x). In a mean-field approxi- | ⇒ model. And for backward inference: mation, we want to find a factorised distribu- L H(Q) E θxx Q(y =1)θy (4) tion Q(x, y) which minimises the KL-divergence ≤ − − Q(xk) k k − k k DKL(Q(x, y) P (x, y y x)) with the true distri- (1 Q(y =1)) log(1 Q(x =1)) ) || | ⇒ k k bution P (x, y y x). − − − | ⇒ which we can optimise for Q(yk=1): L = DKL(Q(x, y) P (x, y (y x))) y || | ⇒ Q(yk=1) = σ( θ log(1 Q(xk=1)) ) (5) Q(x, y) k − − Q(x) log ∝ P (x, y, (y x)) Note that in equations (2) and (4) the x ⇒ X final terms, Q(xk=1) log Q(yk=1) and = EQ(x ) log Q(xk) + EQ(y ) log Q(yk) k k (1 Q(yk=1)) log(1 Q(xk=1)) respectively, k k − − X X are approximations to the log-probability of the E log P (x) E log P (y) − Q(x) − Q(y) entailment. We define two vector-space operators, < > E E log(1 (1 y )x ) and , to be these same approximations. − Q(xk) Q(yk) − − k k Xk log Q(y x) In the final equation, the first two terms are the ⇒ negative entropy of Q, H(Q), which acts as a EQ(x ) log(EQ(y )(1 (1 yk)xk)) − ≈ k k − − maximum entropy regulariser, the final term en- Xk < forces the entailment constraint, and the middle = Q(x=1) log Q(y=1) X Y · ≡ two terms represent the prior for x and y. One ap- log Q(y x) proach (generalised further in the next subsection) ⇒ to the prior terms E log P (x) is to bound EQ(y ) log(EQ(x )(1 (1 yk)xk)) − Q(x) ≈ k k − − them by assuming P (x) is a function in the ex- Xk ponential family, giving us: = (1 Q(y=1)) log(1 Q(x=1)) Y >X − · − ≡

2054 < inequalities for nodes i and j such that r¯(i, j) or X Y σ(X) log σ(Y ) ≡ · r¯(j, i), and bound them using the constants Y >X σ( Y ) log σ( X) ≡ − · − Cijk (1 σ( Xik )σ(Xjk )). Y ˜ X log(1 σ( Y )σ(X )) ≥ − − 0 0 k k k =k ⇒ ≡ − − Y06 Xk To represent the prior P (x), we use the terms E P (x ¯ , x =1) Table 2: The proposed entailment operators, ap- Q(xik¯ ) ik ik θik(Xik¯ ) log proximating log P (y x). ≤ 1 E P (x ¯ , xik=1) ⇒ − Q(xik¯ ) ik where x ¯ is the set of all x such that either i =i ik i0k0 06 We parametrise these operators with the vectors or k =k. These terms can be thought of as the log- 06 X,Y of log-odds of Q(x),Q(y), namely X = odds terms that would be contributed to the loss Q(x=1) -1 log Q(x=0) = σ (Q(x=1)). The resulting opera- function by including the prior’s graphical model tor definitions are summarised in Table 2. in the mean-field approximation. Also note that the probability of entailment Now we can infer the optimal Xik as: given in equation (1) becomes factorised when we Xik = θik(Xik¯ ) + log σ( Xjk) (6) replace P with Q. We define a third vector-space − − j:Xr(i,j) operator, ˜ , to be this factorised approximation, 1 Cijkσ(Xjk) ⇒ + log σ(Xjk) + log − also shown in Table 2. 1 Cijk j:Xr(j,i) j:¯Xr(j,i) − 2.2 Inference in Entailment Graphs 1 C σ( X ) + log − ijk − jk In general, doing inference for one entailment is − 1 Cijk j:¯r(i,j) − not enough; we want to do inference in a graph of X In summary, the proposed mean-field approx- entailments between variables. In this section we imation does inference in entailment graphs by generalise the above mean-field approximation to iteratively re-estimating each X as the sum of: entailment graphs. i the prior log-odds, log σ( X ) for each en- To represent information about variables that − − j tailed variable j, and log σ(X ) for each entailing comes from outside the entailment graph, we as- j variable j.1 This inference optimises X < X for sume we are given a prior P (x) over all variables i j each entailing j plus X >X for each entailed j, x in the graph. As above, we do not assume that i j i plus a maximum entropy regulariser on X . Neg- this prior is factorised. Instead we assume that the i ative entailment relations, if they exist, can also prior P (x) is itself a graphical model which can be be incorporated with some additional approxima- approximated with a mean-field approximation. tions. Complex priors can also be incorporated Given a set of variables x each represent- i through their log-odds, simulating the inclusion of ing vectors of binary variables x , a set of en- ik the prior within the mean-field approximation. tailment relations r = (i, j) (x x ) , and { | i⇒ j } Given its dependence on mean-field approxima- a set of negated entailment relations r¯ = tions, it is an empirical question to what extent we (i, j) (x / x ) , we can write the joint posterior { | i⇒ j } should view this model as computing real entail- probability as: ment probabilities and to what extent we should 1 P (x, r, r¯) = P (x) view it as a well-motivated non-linear mapping for which we simply optimise the input-output be- Z Yi  ( P (x x x , x )) haviour (as for neural networks (Henderson and ik⇒ jk| ik jk j:r(i,j) k Titov, 2010)). In Sections 3 and 5 we argue for the Y Y former (stronger) view. ( (1 P (x x x , x ))) − ik⇒ jk| ik jk j:¯Yr(i,j) Yk  3 Interpreting Word2Vec Vectors We want to find a factorised distribution Q that To evaluate how well the proposed framework pro- minimises L = D (Q(x) P (x r, r¯)). As KL || | vides a formal foundation for the distributional se- above, we bound this loss for each element -1 mantics of entailment, we use it to re-interpret an Xik= σ (Q(xik=1)) of each vector we want to 1 It is interesting to note that log σ( Xj ) is a non- infer, using analogous Jensen’s inequalities for the − − negative transform of Xj , similar to the ReLU nonlinear- terms involving nodes i and j such that r(i, j) or ity which is popular in deep neural networks (Glorot et al., r(j, i). For completeness, we also propose similar 2011). log σ(Xj ) is the analogous non-positive transform.

2055 existing model of distributional semantics in terms are real numbers between negative infinity and in- of semantic entailment. There has been a lot of finity, so the simplest interpretation of them is as work on how to use the distribution of contexts in the log-odds of a feature being known. In this case which a word occurs to induce a vector represen- we can treat these vectors directly as the Xm in the tation of the semantics of words. In this paper, model. The inferred hidden vector Y can then be we leverage this previous work on distributional calculated using the model of backward inference semantics by re-interpreting a previous distribu- from the previous section. tional semantic model and using this understand- Y = θc log σ( Xc) log σ( Xm) ing to map its vector-space word embeddings to − − − − = X0 log σ( X ) vectors in the proposed framework. We then use c − − m the proposed operators to predict entailment be- Since the unification Y of context and middle tween words using these vectors. In Section 5 be- word features is computed using backward infer- > low, we evaluate these predictions on the task of ence, we use the backward-inference operator to hyponymy detection. In this section we motivate calculate how successful that unification was. This three different ways to interpret the Word2Vec gives us the final score: (Mikolov et al., 2013a; Mikolov et al., 2013b) dis- log P (y, y x , y x ) tributional semantic model as an approximation to ⇒ m ⇒ c Y >X + Y >X + σ( Y ) θ an entailment-based model of the semantic rela- ≈ m c − − · c > tionship between a word and its context. = Y Xm + σ( Y ) X0 − − · c Distributional semantics learns the semantics of This is a natural interpretation, but it ignores the words by looking at the distribution of contexts equivalence in Word2Vec between pairs of posi- in which they occur. To model this relationship, tive values and pairs of negative values, due to its we assume that the semantic features of a word use of the dot product. As a more accurate in- are (statistically speaking) redundant with those of terpretation, we interpret each Word2Vec dimen- its context words, and consistent with those of its sion as specifying whether its feature is known context words. We model these properties using to be true or known to be false. Translating this a hidden vector which is the consistent unification Word2Vec vector into a vector in our entailment of the features of the middle word and the context. vector space, we get one copy Y + of the vector In other words, there must exist a hidden vector representing known-to-be-true features and a sec- which entails both of these vectors, and is consis- ond negated duplicate Y of the vector represent- tent with prior constraints on vectors. We split this − ing known-to-be-false features, which we concate- into two steps, inference of the hidden vector Y nate to get our representation Y . from the middle vector Xm, context vectors Xc + Y = X0 log σ( X ) and prior, and computing the log-probability (7) c − − m that this hidden vector entails the middle and con- Y − = X0 log σ(X ) − c − m text vectors: log P (y, y x , y x ) ⇒ m ⇒ c + > + max(log P (y, y xm, y xc)) (7) Y Xm + σ( Y ) Xc0 Y ⇒ ⇒ ≈ − − · + Y − >( X ) + σ( Y −) ( X0) − m − − · − c We interpret Word2Vec’s Skip-Gram model as As a third alternative, we modify this latter in- learning its context and middle word vectors so terpretation with some probability mass reserved that the log-probability of this entailment is high for unknown in the vicinity of zero. By subtract- for the observed context words and low for other ing 1 from both the original and negated copies of (sampled) context words. The word embeddings each dimension, we get a probability of unknown produced by Word2Vec are only related to the vec- of 1 σ(Xm 1) σ( Xm 1). This gives us: tors Xm assigned to the middle words; context − − − − − + Y = X0 log σ( (X 1)) vectors are computed but not output. We model c − − m− the context vectors Xc0 as combining (as in equa- Y − = Xc0 log σ( ( Xm 1)) tion (5)) information about a context word itself − − − − − log P (y, y xm, y xc) with information which can be inferred from this ⇒ ⇒ + > + Y (Xm 1) + σ( Y ) Xc0 word given the prior, Xc0 = θc log σ( Xc). ≈ − − − · − − + Y − >( X 1)) + σ( Y −) ( X0) The numbers in the vectors output by Word2Vec − m− − − · − c

2056 sures computed outside the vector space (e.g. symmetric measures (LIN (Lin, 1998)), asym- metric measures (WeedsPrec (Weeds and Weir, 2003; Weeds et al., 2004), balAPinc (Kotlerman et al., 2010), invCL (Lenci and Benotto, 2012)) and entropy-based measures (SLQS (Santus et al., 2014))), nor to models which encode hyponymy in the parameters of a vector-space operator or clas- sifier (Fu et al., 2015; Roller et al., 2014; Baroni et al., 2012)). We also limit our evaluation of lex- ical entailment to hyponymy, not including other related lexical relations (cf. (Weeds et al., 2014; Figure 1: The learning gradients for Word2Vec, Vylomova et al., 2015; Turney and Mohammad, the log-odds >, and the unk dup > interpretation 2014; Levy et al., 2014)), leaving more complex of its vectors. cases to future work on compositional semantics. We are also not concerned with models or evalua- To understand better the relative accuracy of tions which require supervised learning about in- these three interpretations, we compared the train- dividual words, instead limiting ourselves to semi- ing gradient which Word2Vec uses to train its supervised learning where the words in the train- middle-word vectors to the training gradient for ing and test sets are disjoint. each of these interpretations. We plotted these gra- dients for the range of values typically found in For these reasons, in our evaluations we repli- Word2Vec vectors for both the middle vector and cate the experimental setup of Weeds et al. (2014), the context vector. Figure 1 shows three of these for both unsupervised and semi-supervised mod- plots. As expected, the second interpretation is els. Within this setup, we compare to the results of more accurate than the first because its plot is anti- the models evaluated by Weeds et al. (2014) and to symmetric around the diagonal, like the Word2Vec previously proposed vector-space operators. This gradient. In the third alternative, the constant 1 includes one vector space operator for hyponymy was chosen to optimise this match, producing a which doesn’t have trained parameters, proposed close match to the Word2Vec training gradient, as by Rei and Briscoe (2014), called weighted cosine. shown in Figure 1 (Word2Vec versus Unk dup). The dimensions of the dot product (normalised Thus, Word2Vec can be seen as a good ap- to make it a cosine measure) are weighted to put proximation to the third model, and a progres- more weight on the larger values in the entailed sively worse approximation to the second and first (hypernym) vector. models. Therefore, if the entailment-based distri- We base this evaluation on the Word2Vec butional semantic model we propose is accurate, (Mikolov et al., 2013a; Mikolov et al., 2013b) dis- then we would expect the best accuracy in hy- tributional semantic model and its publicly avail- ponymy detection using the third interpretation of able word embeddings. We choose it because it Word2Vec vectors, and progressively worse accu- is popular, simple, fast, and its embeddings have racy for the other two interpretations. As we will been derived from a very large corpus. Levy and see in Section 5, this prediction holds. Goldberg (2014) showed that it is closely related to the previous PMI-based distributional semantic 4 Related Work models (e.g. (Turney and Pantel, 2010)). There has been a significant amount of work on us- The most similar previous work, in terms of mo- ing distributional-semantic vectors for hyponymy tivation and aims, is that of Vilnis and McCallum detection, using supervised, semi-supervised or (2015). They also model entailment directly using unsupervised methods (e.g. (Yu et al., 2015; Nec- a vector space, without training a classifier. But sulescu et al., 2015; Vylomova et al., 2015; Weeds instead of representing words as a point in a vec- et al., 2014; Fu et al., 2015; Rei and Briscoe, tor space (as in this work), they represent words 2014)). Because our main concern is modelling as a Gaussian distribution over points in a vector entailment within a vector space, we do not do a space. This allows them to represent the extent to thorough comparison to models which use mea- which a feature is known versus unknown as the

2057 amount of variance in the distribution for that fea- These noun-noun word pairs include positive hy- ture’s dimension. While nicely motivated theoret- ponymy pairs, plus negative pairs consisting of ically, the model appears to be more computation- some other hyponymy pairs reversed, some pairs ally expensive than the one proposed here, particu- in other semantic relations, and some random larly for inferring vectors. They do make unsuper- pairs. Their selection is balanced between positive vised predictions of hyponymy relations with their and negative examples, so that accuracy can be learned vector distributions, using KL-divergence used as the performance measure. For their semi- between the distributions for the two words. They supervised experiments, ten-fold cross validation evaluate their models on the hyponymy data from is used, where for each test set, items are removed (Baroni et al., 2012). As discussed further in sec- from the associated training set if they contain any tion 5.2, our best models achieve non-significantly word from the test set. Thus, the vocabulary of better average precision than their best models. the training and testing sets are always disjoint, The semi-supervised model of Kruszewski et al. thereby requiring that the models learn about the (2015) also models entailment in a vector space, vector space and not about the words themselves. but they use a discrete vector space. They train We had to perform our own 10-fold split, but apply a mapping from distributional semantic vectors the same procedure to filter the training set. to Boolean vectors such that feature inclusion re- We could not replicate the word embeddings spects a training set of entailment relations. They used in Weeds et al. (2014), so instead we use pub- then use feature inclusion to predict hyponymy, licly available word embeddings.4 These vectors and other lexical entailment relations. This ap- were trained with the Word2Vec software applied proach is similar to the one used in our semi- to about 100 billion words of the Google-News supervised experiments, except that their discrete dataset, and have 300 dimensions. entailment prediction operator is very different The hyponymy detection results are given in Ta- from our proposed entailment operators. ble 3, including both unsupervised (upper box) and semi-supervised (lower box) experiments. We 5 Evaluation report two measures of performance, hyponymy detection accuracy (50% Acc) and direction clas- To evaluate whether the proposed framework is an sification accuracy (Dir Acc). Since all the opera- effective model of entailment in vector spaces, we tors only determine a score, we need to choose a apply the interpretations from Section 3 to pub- threshold to get detection accuracies. Given that licly available word embeddings and use them to the proportion of positive examples in the dataset predict the hyponymy relations in a benchmark has been artificially set at 50%, we threshold each dataset. This framework predicts that the more ac- model’s score at the point where the proportion of curate interpretations of Word2Vec result in more positive examples output is 50%, which we call accurate unsupervised models of hyponymy. We “50% Acc”. Thus the threshold is set after seeing evaluate on detecting hyponymy relations between the testing inputs but not their target labels. words because hyponymy is the canonical type of Direction classification accuracy (Dir Acc) in- lexical entailment; most of the semantic features dicates how well the method distinguishes the rel- of a hypernym (e.g. “animal”) must be included in ative abstractness of two nouns. Given a pair of the semantic features of the hyponym (e.g. “dog”). nouns which are in a hyponymy relation, it classi- We evaluate in both a fully unsupervised setup and fies which word is the hypernym and which is the a semi-supervised setup. hyponym. This measure only considers positive 5.1 Hyponymy with Word2Vec Vectors examples and chooses one of two directions, so it is inherently a balanced binary classification task. For our evaluation on hyponymy detection, we Classification is performed by simply comparing replicate the experimental setup of Weeds et al. the scores in both directions. If both directions 2 (2014), using their selection of word pairs from produce the same score, the expected random ac- 3 the BLESS dataset (Baroni and Lenci, 2011). curacy (50%) is used. 2https://github.com/SussexCompSem/ As representative of previous work, we report learninghypernyms 3Of the 1667 word pairs in this data, 24 were removed 4https://code.google.com/archive/p/ because we do not have an embedding for one of the words. word2vec/

2058 operator supervision 50% Acc Dir Acc tations. For the semi-supervised case, we train Weeds et.al. None 58% – a linear vector-space mapping into a new vector < log-odds None 54.0% 55.9% space, in which we apply the operators (mapped weighted cos None 55.5% 57.9% operators). All these results are discussed in the dot None 56.3% 50% next two subsections. dif None 56.9% 59.6% log-odds ˜ None 57.0% 59.4% 5.2 Unsupervised Hyponymy Detection ⇒ log-odds > None 60.1%* 62.2% The first set of experiments evaluate the vector- dup > None 61.7% 68.8% space operators in unsupervised models of hy- unk dup ˜ None 63.4%* 68.8% ⇒ ponymy detection. The proposed models are com- unk dup > None 64.5% 68.8% pared to the dot product, because this is the stan- Weeds et.al. SVM 75% – dard vector-space operator and has been shown to mapped dif cross ent 64.3% 72.3% capture semantic similarity very well. However, < mapped cross ent 74.5% 91.0% because the dot product is a symmetric operator, mapped ˜ cross ent 77.5% 92.3% it always performs at chance for direction clas- ⇒> mapped cross ent 80.1% 90.0% sification. Another vector-space operator which has received much attention recently is vector dif- Table 3: Accuracies on the BLESS data from ferences. This is used (with vector sum) to per- Weeds et al. (2014), for hyponymy detection (50% form semantic transforms, such as “king - male Acc) and hyponymy direction classification (Dir + female = queen”, and has previously been used Acc), in the unsupervised (upper box) and semi- for modelling hyponymy (Vylomova et al., 2015; supervised (lower box) experiments. For unsuper- Weeds et al., 2014). For our purposes, we sum the vised accuracies, * marks a significant difference pairwise differences to get a score which we use with the previous row. for hyponymy detection. For the unsupervised results in the upper box of the best results from Weeds et al. (2014), who table 3, the best unsupervised model of Weeds et try a number of unsupervised and semi-supervised al. (2014), and the operators dot, dif and weighted models, and use the same testing methodology cos all perform similarly on accuracy, as does the and hyponymy data. However, note that their log-odds factorised entailment calculation (log- word embeddings are different. For the semi- odds ˜ ). The forward-inference entailment op- ⇒ < supervised models, Weeds et al. (2014) trains clas- erator (log-odds ) performs above chance but not sifiers, which are potentially more powerful than well, as expected given the backward-inference- our linear vector mappings. We also compare the based interpretation of Word2Vec vectors. By def- proposed operators to the dot product (dot),5 vec- inition, dot is at chance for direction classification, tor differences (dif ), and the weighted cosine of but the other models all perform better, indicat- Rei and Briscoe (2014) (weighted cos), all com- ing that all these operators are able to measure > puted with the same word embeddings as for the relative abstractness. As predicted, the opera- proposed operators. tor performs significantly better than all these re- sults on accuracy, as well as on direction classifi- In Section 3 we argued for three progressively cation, even assuming the log-odds interpretation more accurate interpretations of Word2Vec vec- of Word2Vec vectors. tors in the proposed framework, the log-odds inter- > When we move to the more accurate interpreta- pretation (log-odds ), the negated duplicate inter- > tion of Word2Vec vectors as specifying both orig- pretation (dup ), and the negated duplicate inter- > > inal and negated features (dup ), we improve pretation with unknown around zero (unk dup ). We also evaluate using the factorised calculation (non-significantly) on the log-odds interpretation. of entailment (log-odds ˜ , unk dup ˜ ), and the Finally, the third and most accurate interpretation, ⇒ ⇒ backward-inference entailment operator (log-odds where values around zero can be unknown (unk > < dup ), achieves the best results in unsupervised ), neither of which match the proposed interpre- hyponymy detection, as well as for direction clas- 5We also tested the cosine measure, but results were very sification. Changing to the factorised entailment slightly worse than dot. operator (unk dup ˜ ) is worse but also signifi- ⇒

2059 cantly better than the other accuracies. generalisation from training word vectors to test- > To allow a direct comparison to the model ing word vectors. The mapped model has the of Vilnis and McCallum (2015), we also evalu- best accuracy, followed by the factorised entail- ated the unsupervised models on the hyponymy ment operator mapped ˜ and Weeds et al. (2014). ⇒ data from (Baroni et al., 2012). Our best model Direction accuracies of all the proposed operators achieved 81% average precision on this dataset, (mapped >, mapped ˜ , mapped < ) reach into the ⇒ non-significantly better than the 80% achieved by 90’s. The dif operator performs particularly poorly the best model of Vilnis and McCallum (2015). in this mapped setting, perhaps because both the mapping and the operator are linear. These semi- 5.3 Semi-supervised Hyponymy Detection supervised results again support our distributional- Since the unsupervised learning of word embed- semantic interpretations of Word2Vec vectors and > dings may reflect many context-word correlations their associated entailment operator . which have nothing to do with hyponymy, we also consider a semi-supervised setting. Adding 6 Conclusion some supervision helps distinguish features that In this work, we propose a vector-space model capture semantic properties from other features which provides a formal foundation for a distri- which are not relevant to hyponymy detection. But butional semantics of entailment. We developed even with supervision, we still want the resulting a mean-field approximation to probabilistic entail- model to be captured in a vector space, and not ment between vectors which represent known ver- in a parametrised scoring function. Thus, we train sus unknown features. And we used this frame- mappings from the Word2Vec word vectors to new work to derive vector operators for entailment and word vectors, and then apply the entailment opera- vector inference equations for entailment graphs. tors in this new vector space to predict hyponymy. This framework allows us to reinterpret Word2Vec Because the words in the testing set are always dis- as approximating an entailment-based distribu- joint from the words in the training set, this experi- tional semantic model of words in context, and ment measures how well the original unsupervised show that more accurate interpretations result in vector space captures features that generalise en- more accurate unsupervised models of lexical en- tailment across words, and not how well the map- tailment, achieving better accuracies than previ- ping can learn about individual words. ous models. Semi-supervised evaluations confirm Our objective is to learn a mapping to a new these results. vector space in which an operator can be applied A crucial distinction between the semi- to predict hyponymy. We train linear mappings supervised models here and much previous work > > for the operator (mapped ) and for vector dif- is that they learn a mapping into a vector space ferences (mapped dif ), since these were the best which represents entailment, rather than learning performing proposed operator and baseline opera- a parametrised entailment classifier. Within this tor, respectively, in the unsupervised experiments. new vector space, the entailment operators and We do not use the duplicated interpretations be- inference equations apply, thereby generalising cause these transforms are subsumed by the ability naturally from these lexical representations to the 6 to learn a linear mapping. Previous work on using compositional semantics of multi-word expres- vector differences for semi-supervised hyponymy sions and sentences. Further work is needed to detection has used a linear SVM (Vylomova et al., explore the full power of these abilities to extract 2015; Weeds et al., 2014), which is mathemati- information about entailment from both unla- cally equivalent to our vector-differences model, belled text and labelled entailment data, encode it except that we use cross entropy loss and they use all in a single vector space, and efficiently perform a large-margin loss and SVM training. complex inferences about vectors and entailments. The semi-supervised results in the bottom box This future work on compositional distributional of table 3 show a similar pattern to the unsuper- semantics should further demonstrate the full 7 > vised results. The operator achieves the best power of the proposed framework for modelling 6Empirical results confirm that this is in practice the case, entailment in a vector space. so we do not include these results in the table. 7It is not clear how to measure significance for cross- validation results, so we do not attempt to do so.

2060 References 87–97, Ann Arbor, Michigan. Association for Com- putational Linguistics. Marco Baroni and Alessandro Lenci. 2011. How we blessed distributional semantic evaluation. In Omer Levy, Steffen Remus, Chris Biemann, and Ido Proceedings of the GEMS 2011 Workshop on GE- Dagan. 2015. Do supervised distributional meth- ometrical Models of Natural Language Semantics, ods really learn lexical inference relations? In Pro- GEMS ’11, pages 1–10. Association for Computa- ceedings of the 2015 Conference of the North Amer- tional Linguistics. ican Chapter of the Association for Computational Linguistics: Human Language Technologies, pages Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, 970–976, Denver, Colorado, May–June. Association and Chung-chieh Shan. 2012. Entailment above the for Computational Linguistics. word level in distributional semantics. In Proceed- ings of the 13th Conference of the European Chap- Dekang Lin. 1998. Automatic retrieval and clustering ter of the Association for Computational Linguistics of similar words. In Proceedings of the 17th Inter- (EACL), pages 23–32, Avignon, France. Association national Conference on Computational Linguistics - for Computational Linguistics. Volume 2, COLING ’98, pages 768–774. Associa- Scott Deerwester, Susan T. Dumais, George W. Fur- tion for Computational Linguistics. nas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by . Jour- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey nal of the American Society for Information Science, Dean. 2013a. Efficient estimation of word represen- 41(6):391–407. tations in vector space. CoRR, abs/1301.3781.

Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Wang, and Ting Liu. 2015. Learning semantic hi- rado, and Jeff Dean. 2013b. Distributed repre- erarchies: A continuous vector space approach. Au- sentations of words and phrases and their compo- dio, Speech, and Language Processing, IEEE/ACM sitionality. In C.J.C. Burges, L. Bottou, M. Welling, Transactions on, 23(3):461–471. Z. Ghahramani, and K.Q. Weinberger, editors, Ad- vances in Neural Information Processing Systems Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 26, pages 3111–3119. Curran Associates, Inc. 2011. Deep sparse rectifier neural networks. In In- ternational Conference on Artificial Intelligence and Silvia Necsulescu, Sara Mendes, David Jurgens, Nuria´ Statistics, pages 315–323. Bel, and Roberto Navigli. 2015. Reading between the lines: Overcoming data sparsity for accurate James Henderson and Ivan Titov. 2010. Incre- classification of lexical relationships. In Proceed- mental sigmoid belief networks for grammar learn- ings of the Fourth Joint Conference on Lexical and ing. Journal of Machine Learning Research, Computational Semantics, pages 182–192, Denver, 11(Dec):3541–3570. Colorado. Association for Computational Linguis- tics. Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan Zhitomirsky-geffet. 2010. Directional distributional Marek Rei and Ted Briscoe. 2014. Looking for hy- similarity for lexical inference. Natural Language ponyms in vector space. In Proceedings of the Eigh- Engineering, 16(4):359–389. teenth Conference on Computational Natural Lan- guage Learning, pages 68–77, Ann Arbor, Michi- Germn Kruszewski, Denis Paperno, and Marco Baroni. gan. Association for Computational Linguistics. 2015. Deriving boolean structures from distribu- tional vectors. Transactions of the Association for Stephen Roller, Katrin Erk, and Gemma Boleda. 2014. Computational Linguistics, 3:375–388. Inclusive yet selective: Supervised distributional hy- Alessandro Lenci and Giulia Benotto. 2012. Identify- pernymy detection. In Proceedings of COLING ing hypernyms in distributional semantic spaces. In 2014, the 25th International Conference on Compu- Proceedings of the First Joint Conference on Lexical tational Linguistics: Technical Papers, pages 1025– and Computational Semantics, SemEval ’12, pages 1036, Dublin, Ireland. Dublin City University and 75–79. Association for Computational Linguistics. Association for Computational Linguistics.

Omer Levy and Yoav Goldberg. 2014. Neural Enrico Santus, Alessandro Lenci, Qin Lu, and as implicit matrix factorization. Sabine Schulte im Walde. 2014. Chasing hyper- In Z. Ghahramani, M. Welling, C. Cortes, N. D. nyms in vector spaces with entropy. In Proceed- Lawrence, and K. Q. Weinberger, editors, Advances ings of the 14th Conference of the European Chap- in Neural Information Processing Systems 27, pages ter of the Association for Computational Linguistics, 2177–2185. Curran Associates, Inc. EACL 2014, April 26-30, 2014, Gothenburg, Swe- den, pages 38–42. Omer Levy, Ido Dagan, and Jacob Goldberger. 2014. Focused entailment graphs for open ie propositions. Hinrich Schutze.¨ 1993. Word space. In Advances In Proceedings of the Eighteenth Conference on in Neural Information Processing Systems 5, pages Computational Natural Language Learning, pages 895–902. Morgan Kaufmann.

2061 Peter D. Turney and Saif M. Mohammad. 2014. Ex- periments with three approaches to recognizing lex- ical entailment. CoRR, abs/1401.8269. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of seman- tics. J. Artif. Int. Res., 37(1):141–188. Luke Vilnis and Andrew McCallum. 2015. Word rep- resentations via Gaussian embedding. In Proceed- ings of the International Conference on Learning Representations 2015 (ICLR). Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Timothy Baldwin. 2015. Take and took, gaggle and goose, book and read: Evaluating the utility of vec- tor differences for lexical relation learning. In CoRR 2015. Julie Weeds and David Weir. 2003. A general frame- work for distributional similarity. In Proceedings of the 2003 Conference on Empirical Methods in Nat- ural Language Processing, EMNLP ’03, pages 81– 88. Julie Weeds, David Weir, and Diana McCarthy. 2004. Characterising measures of lexical distributional similarity. In Proceedings of the 20th International Conference on Computational Linguistics, COLING ’04, pages 1015–1021. Association for Computa- tional Linguistics. Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir, and Bill Keller. 2014. Learning to distinguish hy- pernyms and co-hyponyms. In Proceedings of COL- ING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 2249–2259, Dublin, Ireland. Dublin City University and Association for Computational Linguistics.

Zheng Yu, Haixun Wang, Xuemin Lin, and Min Wang. 2015. Learning term embeddings for hypernymy identification. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelli- gence, IJCAI 2015. AAAI Press / International Joint Conferences on Artificial Intelligence.

2062