Two Experiments for Embedding Wordnet Hierarchy Into Vector Spaces∗

Two experiments for embedding Wordnet hierarchy into vector spaces∗ Jean-Philippe Bernardy and Aleksandre Maskharashvili Gothenburg University, Department of philosophy, linguistics and theory of science, Centre for linguistics and studies in probability jean-philippe.bernardy,[email protected] Abstract However, WORDNET has a fundamentally sym- bolic representation, which cannot be readily used In this paper, we investigate mapping of as input to neural NLP models. the WORDNET hyponymy relation to fea- Several authors have proposed to encode hy- ture vectors. Our aim is to model lexical ponymy relations in feature vectors (Vilnis and knowledge in such a way that it can be McCallum, 2014; Vendrov et al., 2015; Athi- used as input in generic machine-learning waratkun and Wilson, 2018; Nickel and Kiela, models, such as phrase entailment pre- 2017). However, there does not seem to be a dictors. We propose two models. The common consensus on the underlying properties first one leverages an existing mapping of of such encodings. In this paper, we aim to fill words to feature vectors (fastText), and at- this gap and clearly characterize the properties that tempts to classify such vectors as within or such an embedding should have. We additionally outside of each class. The second model is propose two baseline models approaching these fully supervised, using solely WORDNET properties: a simple mapping of FASTTEXT em- as a ground truth. It maps each concept to beddings to the WORDNET hyponymy relation, an interval or a disjunction thereof. The and a (fully supervised) encoding of this relation first model approaches but not quite attain in feature vectors. state of the art performance. The second model can achieve near-perfect accuracy. 2 Goals 1 Introduction We want to model the hyponymy relation (ground truth) given by WORDNET — hereafter referred Distributional encoding of word meanings from to as HYPONYMY. In this section we make this large corpora (Mikolov et al., 2013; Mikolov et al., goal precise and formal. Hyponymy can in gen- 2018; Pennington et al., 2014) have been found to eral relate common noun phrases, verb phrases or be useful for a number of NLP tasks. any predicative phrase, but hereafter we abstract While the major goal of distributional ap- from all this and simply write “word” for this un- proaches is to identify distributional patterns derlying set. In this paper, we write (⊆) for the re- of words and word sequences, they have even flexive transitive closure of the hyponymy relation found use in tasks that require modeling more (ground truth), and (⊆M ) for relation predicted fine-grained relations between words than co- by a model M.1 Ideally, we want the model to occurrence in word sequences. But distributional be sound and complete with respect to the ground word embeddings are not easy to map onto on- truth. However, a machine-learned model will typ- tological relations or vice-versa. We consider in ically only approach those properties to a certain this paper the hyponymy relation, also called the level, so the usual relaxations are made: is-a relation, which is one of the most fundamen- Property 1 (Partial soundness) A model M is tal ontological relations. We take as the source of 1We note right away that, on its own, the popular met- truth for hyponymy WORDNET (Fellbaum, 1998), ric of cosine similarity (or indeed any metric) is incapable of which has been designed to include various kinds modeling HYPONYMY, because it is an asymmetric relation. of lexical relations between words, phrases, etc. That is to say, we may know that the embedding of “animal” is close to that of “bird”, but from that property we have no Supported∗ by Swedish Research Council, Grant number idea if we should conclude that “a bird is an animal” or rather 2014-39. that “an animal is a bird”. partially sound with precision α iff., for a pro- Property 4 (Relaxed Space-inclusion compatibil- 0 d portion α of the pairs of words w, w such that ity) There exists P : W ord ! R ! [0, 1] and 0 0 w ⊆M w holds, w ⊆ w holds as well. ρ 2 [0, 1] such that Property 2 (Partial completeness) A model M is R P (w0, x)P (w, x)dx 0 Rd partially complete with recall α iff., for a propor- (w ⊆P w) () ≥ ρ R P (w, x)dx tion α of the pairs of words w, w0 such that w ⊆ w0 Rd 0 holds, then w ⊆M w holds as well. In the following, we call ρ the relaxation factor. These properties do not constrain the way the 3 Mapping WORDNET over fastText relation (⊆M ) is generated from a feature space. However, a satisfying way to generate the inclu- Our first model of HYPONYMY works by lever- sion relation is by associating a subset of the vec- aging a general-purpose, unsupervised method tor space to each predicate, and leverage the inclu- of generating word vectors. We use fastText sion from the feature space. Concretely, the map- (Mikolov et al., 2018) as a modern representa- ping of words to subsets is done by a function P tive of word-vector embeddings. Precisely, we such that, given a word w and a feature vector x, use pre-trained word embeddings available on the P (w, x) indicates if the word w applies to a situa- fastText webpage, trained on Wikipedia 2017 and tion (state of the world, sentence meaning, sentory the UMBC webbase corpus and the statmt.org input, etc.) described by feature vector x. We will news dataset (16B tokens). We call FTDom the refer to P as a classifier. The inclusion model is set of words in these pre-trained embeddings. then fully characterized by P , so we can denote it A stepping stone towards modeling the inclu- as such (⊆P ). sion relation correctly is modeling correctly each Property 3 (Space-inclusion compatibility) predicate individually. That is, we want to learn a d separation between fastText embeddings of words There exists P :(W ord × R ) ! [0, 1] such that that belong to a given class (according to WORD- 0 0 NET) from the words that do not. We let each word (w ⊆P w) () (8x.P (w, x) ≤ P (w , x)) w in fastText represent a situation corresponding Any model given by such a P yields a relation to its word embedding f(w). Formally, we aim to (⊆P ) which is necessarily reflexive and transitive find P such that (because subset inclusion is such) — the model Property 5 P (w, f(w0)) = 1 () w0 ⊆ w does not have to learn this. Again, the above prop- for every word w and w0 found both in WORDNET erty will apply only to ideal situations: it needs to and in the pre-trained embeddings. If the above be relaxed in some machine-learning contexts. To property is always satisfied, the model is sound this effect, we can define the measure of the subset and complete, and satisfies Property 3. of situations which satisfies a predicate p : d ! R Because many classes have few representative [0, 1] as follows: elements relative to the number of dimensions of Z the fastText embeddings, we limit ourselves to a measure(p) = p(x)dx linear model for P , to limit the possibility of over- d R fitting. That is, for any word w, P (w) is entirely (Note that this is well-defined only if p is a mea- determined by a bias b(w) and a vector θ(w) (with surable function over the measurable space of fea- 300 dimensions): ture vectors.) We leave implicit the density of the vector space in this definition. Following this def- P (w, x) = δ(θ(w) · x + b(w) > 0) inition, a predicate p is included in a predicate q where δ(true) = 1 and δ(false) = 0. iff. We learn θ(w) and b(w) by using logistic re- R measure(p ^ q) d p(x)q(x)dx gression, independently for each WORDNET word R = R = 1 w. The set of all positive examples for w is measure(p) d p(x)dx R ff(w0) j w0 2 FTDom, w0 ⊆ wg, while the Following this thread, we can define a relaxed in- set of negative examples is ff(w0) j w0 2 clusion relation, corresponding to a proportion of FTDom, w0 6⊆ wg. We train and test for all the ρ of p included in q: predicates with at least 10 positive examples. We ultimately not interested in Property 5, but rather in properties 1 and 2, which we address in the next section. 4 Inclusion of subsets A strict interpretation of Property 3 would dic- tate to check if the subsets defined in the pre- vious section are included in each other or not. However, there are several problems with this approach. To begin, hyperplanes defined by θ and b will (stochastically) always intersect therefore one must take into account the actual density of the Figure 1: PCA representation of animals. Birds fastText embeddings. One possible approxima- are highlighted in orange. tion would be that they are within a ball of certain radius around the origin. However, this assump- tion is incorrect: modeling the density is a hard use 90% of the set of positive examples (w0) for problem in itself. In fact, the density of word vec- training (reserving 10% for testing) and we use the tors is so low (due to the high dimensionality of same number of negative examples. the space) that the question may not make sense. We then test Property 5 on the 10% of positive Therefore, we refrain from making any conclusion examples reserved for testing, for each word.

Load more