Two experiments for embedding Wordnet hierarchy into vector spaces∗

Jean-Philippe Bernardy and Aleksandre Maskharashvili Gothenburg University, Department of philosophy, linguistics and theory of science, Centre for linguistics and studies in probability

jean-philippe.bernardy,[email protected]

Abstract However, WORDNET has a fundamentally sym- bolic representation, which cannot be readily used In this paper, we investigate mapping of as input to neural NLP models. the WORDNET hyponymy relation to fea- Several authors have proposed to encode hy- ture vectors. Our aim is to model lexical ponymy relations in feature vectors (Vilnis and knowledge in such a way that it can be McCallum, 2014; Vendrov et al., 2015; Athi- used as input in generic machine-learning waratkun and Wilson, 2018; Nickel and Kiela, models, such as phrase entailment pre- 2017). However, there does not seem to be a dictors. We propose two models. The common consensus on the underlying properties first one leverages an existing mapping of of such encodings. In this paper, we aim to fill words to feature vectors (fastText), and at- this gap and clearly characterize the properties that tempts to classify such vectors as within or such an embedding should have. We additionally outside of each class. The second model is propose two baseline models approaching these fully supervised, using solely WORDNET properties: a simple mapping of FASTTEXT em- as a ground truth. It maps each concept to beddings to the WORDNET hyponymy relation, an interval or a disjunction thereof. The and a (fully supervised) encoding of this relation first model approaches but not quite attain in feature vectors. state of the art performance. The second model can achieve near-perfect accuracy. 2 Goals

1 Introduction We want to model the hyponymy relation (ground truth) given by WORDNET — hereafter referred Distributional encoding of word meanings from to as HYPONYMY. In this section we make this large corpora (Mikolov et al., 2013; Mikolov et al., goal precise and formal. Hyponymy can in gen- 2018; Pennington et al., 2014) have been found to eral relate common noun phrases, verb phrases or be useful for a number of NLP tasks. any predicative phrase, but hereafter we abstract While the major goal of distributional ap- from all this and simply write “word” for this un- proaches is to identify distributional patterns derlying set. In this paper, we write (⊆) for the re- of words and word sequences, they have even flexive transitive closure of the hyponymy relation found use in tasks that require modeling more (ground truth), and (⊆M ) for relation predicted fine-grained relations between words than co- by a model M.1 Ideally, we want the model to occurrence in word sequences. But distributional be sound and complete with respect to the ground word embeddings are not easy to map onto on- truth. However, a machine-learned model will typ- tological relations or vice-versa. We consider in ically only approach those properties to a certain this paper the hyponymy relation, also called the level, so the usual relaxations are made: is-a relation, which is one of the most fundamen- Property 1 (Partial soundness) A model M is tal ontological relations. We take as the source of 1We note right away that, on its own, the popular met- truth for hyponymy WORDNET (Fellbaum, 1998), ric of cosine similarity (or indeed any metric) is incapable of which has been designed to include various kinds modeling HYPONYMY, because it is an asymmetric relation. of lexical relations between words, phrases, etc. That is to say, we may know that the embedding of “animal” is close to that of “bird”, but from that property we have no Supported∗ by Swedish Research Council, Grant number idea if we should conclude that “a bird is an animal” or rather 2014-39. that “an animal is a bird”. partially sound with precision α iff., for a pro- Property 4 (Relaxed Space-inclusion compatibil- 0 d portion α of the pairs of words w, w such that ity) There exists P : W ord → R → [0, 1] and 0 0 w ⊆M w holds, w ⊆ w holds as well. ρ ∈ [0, 1] such that Property 2 (Partial completeness) A model M is R P (w0, x)P (w, x)dx 0 Rd partially complete with recall α iff., for a propor- (w ⊆P w) ⇐⇒ ≥ ρ R P (w, x)dx tion α of the pairs of words w, w0 such that w ⊆ w0 Rd 0 holds, then w ⊆M w holds as well. In the following, we call ρ the relaxation factor. These properties do not constrain the way the 3 Mapping WORDNET over fastText relation (⊆M ) is generated from a feature space. However, a satisfying way to generate the inclu- Our first model of HYPONYMY works by lever- sion relation is by associating a subset of the vec- aging a general-purpose, unsupervised method tor space to each predicate, and leverage the inclu- of generating word vectors. We use fastText sion from the feature space. Concretely, the map- (Mikolov et al., 2018) as a modern representa- ping of words to subsets is done by a function P tive of word-vector embeddings. Precisely, we such that, given a word w and a feature vector x, use pre-trained word embeddings available on the P (w, x) indicates if the word w applies to a situa- fastText webpage, trained on Wikipedia 2017 and tion (state of the world, sentence meaning, sentory the UMBC webbase corpus and the statmt.org input, etc.) described by feature vector x. We will news dataset (16B tokens). We call FTDom the refer to P as a classifier. The inclusion model is set of words in these pre-trained embeddings. then fully characterized by P , so we can denote it A stepping stone towards modeling the inclu- as such (⊆P ). sion relation correctly is modeling correctly each Property 3 (Space-inclusion compatibility) predicate individually. That is, we want to learn a d separation between fastText embeddings of words There exists P :(W ord × R ) → [0, 1] such that that belong to a given class (according to WORD- 0 0 NET) from the words that do not. We let each word (w ⊆P w) ⇐⇒ (∀x.P (w, x) ≤ P (w , x)) w in fastText represent a situation corresponding Any model given by such a P yields a relation to its f(w). Formally, we aim to (⊆P ) which is necessarily reflexive and transitive find P such that (because subset inclusion is such) — the model Property 5 P (w, f(w0)) = 1 ⇐⇒ w0 ⊆ w does not have to learn this. Again, the above prop- for every word w and w0 found both in WORDNET erty will apply only to ideal situations: it needs to and in the pre-trained embeddings. If the above be relaxed in some machine-learning contexts. To property is always satisfied, the model is sound this effect, we can define the measure of the subset and complete, and satisfies Property 3. of situations which satisfies a predicate p : d → R Because many classes have few representative [0, 1] as follows: elements relative to the number of dimensions of Z the fastText embeddings, we limit ourselves to a measure(p) = p(x)dx linear model for P , to limit the possibility of over- d R fitting. That is, for any word w, P (w) is entirely (Note that this is well-defined only if p is a mea- determined by a bias b(w) and a vector θ(w) (with surable function over the measurable space of fea- 300 dimensions): ture vectors.) We leave implicit the density of the vector space in this definition. Following this def- P (w, x) = δ(θ(w) · x + b(w) > 0) inition, a predicate p is included in a predicate q where δ(true) = 1 and δ(false) = 0. iff. We learn θ(w) and b(w) by using logistic re- R measure(p ∧ q) d p(x)q(x)dx gression, independently for each WORDNET word R = R = 1 w. The set of all positive examples for w is measure(p) d p(x)dx R {f(w0) | w0 ∈ FTDom, w0 ⊆ w}, while the Following this thread, we can define a relaxed in- set of negative examples is {f(w0) | w0 ∈ clusion relation, corresponding to a proportion of FTDom, w0 6⊆ w}. We train and test for all the ρ of p included in q: predicates with at least 10 positive examples. We ultimately not interested in Property 5, but rather in properties 1 and 2, which we address in the next section.

4 Inclusion of subsets

A strict interpretation of Property 3 would dic- tate to check if the subsets defined in the pre- vious section are included in each other or not. However, there are several problems with this ap- proach. To begin, hyperplanes defined by θ and b will (stochastically) always intersect therefore one must take into account the actual density of the Figure 1: PCA representation of animals. Birds fastText embeddings. One possible approxima- are highlighted in orange. tion would be that they are within a ball of certain radius around the origin. However, this assump- tion is incorrect: modeling the density is a hard use 90% of the set of positive examples (w0) for problem in itself. In fact, the density of word vec- training (reserving 10% for testing) and we use the tors is so low (due to the high dimensionality of same number of negative examples. the space) that the question may not make sense. We then test Property 5 on the 10% of positive Therefore, we refrain from making any conclusion examples reserved for testing, for each word. On on the inclusion relation of the subsets, and fall average, we find that 89.4% of positives are identi- back to a more experimental approach. fied correctly (std. dev. 14.6 points). On 1000 ran- Thus, we will test the suitability of the domly selected negative examples, we find that on learned P (w) by testing whether elements averale 89.7% are correctly classified (std dev. 5.9 of its subclasses are contained in the super- points). The result for positives may look high, but class. That is, we define the following quantity 0 because the number of true negative cases is typi- Q(w , w) = cally much higher than that of true positives (often average{P (w0, x) | x ∈ FTDom, P (w, f(x))} by a factor of 100), this means that the recall and which is the proportion of elements of w0 that precision are in fact very low for this task. That is, are found in w. This value corresponds to the the classifier can often identify correctly a random relaxation parameter ρ in Property 4. situation, but this is a relatively easy task. Con- If w0 ⊆ w holds, then we want Q(w0, w) to be sider for example the predicate for “bird”. If we close to 1, and close to 0 if w0 is disjoint from w. test random negative entities (“democracy”, “pa- We plot (figure 2) the distribution of Q(w0, w) for per”, “hour”, etc.), then we may get more than all pairs w0 ⊆ w, and a random selection of pairs 97% accuracy. However, if we pick our samples such that w0 6⊆ w. The negative pairs are gener- in a direct subclass, such as (non-bird) animals, ated by taking all pairs (w0, w) such that w0 ⊆ w, 0 we typically get only 75% accuracy. That is to say, and generate two pairs (w1, w) and (w , w2), by 25% of animals are incorrectly classified as birds. picking w1 and w2 at random, such that neither To get a better intuition for this result, we show of the generated pairs is in the HYPONYMY rela- a Principal Component Analysis (PCA) on ani- tion. We see that most of the density is concen- mals, separating bird from non-birds. It shows trated at the extrema. Thus, the exact choice of ρ mixing of the two classes. This mixture can be ex- has little influence on accuracy for the model. For plained by the presence of many uncommon words ρ = 0.5, the recall is 88.8%. The ratio of false in the database (e.g. types of birds that are only positives to the total number of negative test cases known to ornithologists). One might argue that is 85.7%. However, we have a very large num- we should not take such words into account. But ber of negatives cases (the square of the number of this would severely limit the number of examples: classes, about 7 billions). Because of this, we get there would be few classes where logistic regres- about 1 billion false positives, and the precision is sion would make sense. only 0.07%. Regardless, the results are compara- We are not ready to admit defeat yet as we are ble with state-of-the art models (section 6). 1 a [1,6]

2 b [3,4] 5 c [5,6]

3 d [3,3] 4 e [4,4] 6 f [6,6] 1 a [1,6]

2 b [2,3] 4 c [4,6]

3 d [3,3] 5 e [5,5] 6 f [6,6]

Figure 3: Two trees underlying the same dag. Nodes are labeled with their depth-first index on the left and their associated interval on the right. Removed edges are drawn as a dotted line.

sponding definition of predicates is the following: Figure 2: Results of inclusion tests. On the left- P (w, x) = x ≥ lo(w) ∧ x ≤ hi(w) 0 0 hand-side, we show the distribution of correctly lo(w) = min{ixT (w ) | w ⊆T w} 0 0 identified inclusion relations in function of ρ. On hi(w) = max{ixT (w ) | w ⊆T w} the right-hand-side, we show the distribution of where (⊆T ) is the reflexive-transitive closure of (incorrectly) identified inclusion relations in func- the T tree relation (included in HYPONYMY). The tion of ρ. pair of numbers (lo(w), hi(w)) fully characterizes P (w). In other words, the above model is fully sound (precision=1), and has a recall of about 0.9. 5 WORDNET predicates as disjunction of Additionally, Property 3 is verified. intervals Because it is fully sound, a model like the above In this section we propose a baseline, fully super- can always be combined with another model to vised model model for HYPONYMY. improve its recall with no impact on precision The key observation is that most of the HY- — including itself. Such a self-combination is PONYMY relation fits in a tree. Indeed, out of useful if one does another choice of removed 82115 nouns, 7726 have no hypernym, 72967 have edges. Thus, each word is characterized by an n- a single hypernym, and 1422 have two hypernyms dimensional co-product (disjoint sum) of intervals. 0 4 or more. In fact, by removing only 1461 direct w ⊆M w = edges, we obtain a tree. The number of edges _  0 0  loi(w ) ≥ loi(w) ∧ hii(w ) ≤ hii(w) removed in the transitive closure of the relation i varies, depending on which exact edges are re- 0 0 loi(w) = min{ixT (w ) | w ⊆T w} moved, but a typical number is 10% of the edges. i i hi (w) = max{ix (w0) | w0 ⊆ w} In other words, when removing edges in such a i Ti Ti By increasing n, one can increase the recall to way, one lowers the recall to about 90%, but the obtain a near perfect model. Table 4b shows precision remains 100%. Indeed, no pair is added typical recall results for various values of n. How- to the HYPONYMY relation. This tree can then ever Property 3 is not verified: the co-product of be mapped to one-dimensional intervals, by as- intervals do not form subspaces in any measurable signing a position to each of the nodes, accord- set. ing to their index in depth-first order (ix(w) be- low). Then, each node is assigned an interval cor- 6 Related Work: Precision and recall for responding to the minimum and the maximum po- hyponymy models sition assigned to their leaves. A possible directed acyclic graph (DAG) and a corresponding assign- Many authors have considered modeling hy- ment of intervals is shown in Fig. 3. The corre- ponymy. However, in many cases, this task was not the main point of their work, and we feel that Authors Result the evaluation of the task has often been partially (Vendrov et al., 2015) 90.6 lacking. Here, we review several of those and at- (Athiwaratkun and Wilson, 2018) 92.3 tempt to shed a new light on existing results, based (Vilnis et al., 2018) 92.3 on the properties presented in section 2. us, fastText with LR and ρ = 0.5 87.2 Several authors (Athiwaratkun and Wilson, us, single interval (tree-model) 94.5 2018; Vendrov et al., 2015; Vilnis et al., us, interval disjunctions, n = 5 99.6 2018) have proposed Feature-vector embeddings (a) Authors, systems and respective results of WORDNET. Among them, several have tested on the task of detection of HYPONYMY in WORDNET their embedding on the following task: they feed n recall their model with the transitive closure of HY- 1 0.91766 PONYMY, but withhold 4000 edges. They then test 2 0.96863 how many of those edges can be recovered by their 5 0.99288 model. They also test how many of 4000 random 10 0.99973 negative edges are correctly classified. They re- (b) Typical recalls for multi-dimensional interval model. (Pre- port the average of those numbers. We reproduce cision is always 1.) here their results for this task in Table 4a. As we Figure 4: Tables see it, there are two issues with this task. First, it mainly accounts for recall, mostly ignoring preci- sion. As we have explained in section 4, this can this, and because the negative cases vastly out- be a significant problem for WORDNET, which is number the positive cases, the rate of false neg- sparse. Second, because WORDNET is the only in- atives is still too high to give any reasonable preci- put, it is questionable if any edge should be with- sion. One could try to use more complex models, held at all (beyond those in the transitive closure but the sparsity of the data would make such mod- of generating edges). We believe that, in this case, els extremely sensitive to overfitting. the gold standard to achieve is precisely the tran- sitive closure. Indeed, because the graph presen- Our second model takes a wholly different ap- proach: we construct intervals directly from the tation of WORDNET is nearly a tree, most of the time, the effect of removing an edge will be to de- HYPONYMY relation. The main advantage of tach a subtree. But, without any other source of this method is its simplicity and high-accuracy. information, this subtree could in principle be re- Even with a single dimension it rivals other mod- attached to any node and still be a reasonable on- els. A possible disadvantage is that the multi- tology, from a purely formal perspective. Thus we dimensional version of this model requires dis- did not withhold any edge when training our sec- junctions to be performed. Such operations are ond model on this task (the first one uses no edge not necessarily available in models which need at all). In turn, the numbers reported in Table 4a to make use of the HYPONYMY relation. At this should not be taken too strictly. stage, we make no attempt to match the size of intervals to the probability of a word. We aim to 7 Future Work and Conclusion address this issue in future work. Finally, one could see our study as a criticism We found that defining the problem of represent- for using WORDNET as a natural representative of ing HYPONYMY in a feature vector is not easy. HYPONYMY: because WORDNET is almost struc- Difficulties include 1. the sparseness of data, 2. tured like a tree, one can suspect that it in fact whether one wants to base inclusion on an under- misses many hyponymy relations. This would lying (possibly relaxed) inclusion in the space of also explain why our simple fastText-based model vectors, and 3. determining what one should gen- predicts more relations than present in WORD- eralize. NET. One could think of using other resources, Our investigation of WORDNET over fastText such as JEUXDEMOTS (Lafourcade and Joubert, demonstrates that WORDNET classes are not 2008). Yet our preliminary investigations suggest cleanly linearly separated in fastText, but they are that these suffer from similar flaws — we leave a sufficiently well separated to give a useful recall complete analysis to further work. for an approximate inclusion property. Despite References [Athiwaratkun and Wilson2018] Ben Athiwaratkun and Andrew Gordon Wilson. 2018. On modeling hier- archical data via probabilistic order embeddings. In International Conference on Learning Representa- tions. [Fellbaum1998] Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. Lan- guage, Speech, and Communication. MIT Press, Cambridge, MA. [Lafourcade and Joubert2008] Mathieu Lafourcade and Alain Joubert. 2008. JeuxDeMots : un prototype ludique pour l’emergence´ de relations entre termes. In JADT’08 : Journees´ internationales d’Analyse statistiques des Donnees´ Textuelles, pages 657–666, France.

[Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Informa- tion Processing Systems - Volume 2, NIPS’13, pages 3111–3119. [Mikolov et al.2018] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in pre-training distributed word representations. In Proceedings of the Interna- tional Conference on Language Resources and Eval- uation (LREC 2018). [Nickel and Kiela2017] Maximillian Nickel and Douwe Kiela. 2017. Poincare´ embeddings for learning hierarchical representations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6338–6347. Curran Associates, Inc. [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543. [Vendrov et al.2015] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2015. Order- embeddings of images and language. CoRR, abs/1511.06361. [Vilnis and McCallum2014] Luke Vilnis and Andrew McCallum. 2014. Word representations via gaus- sian embedding. CoRR, abs/1412.6623. [Vilnis et al.2018] Luke Vilnis, Xiang Li, Shikhar Murty, and Andrew McCallum. 2018. Probabilis- tic embedding of knowledge graphs with box lattice measures. In Proceedings of the 56th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 263–272. Associa- tion for Computational Linguistics.