Embedding a Semantic Network in a Word Space

Embedding a Semantic Network in a Word Space Richard Johansson and Luis Nieto Pina˜ Sprakbanken,˚ Department of Swedish, University of Gothenburg Box 200, SE-40530 Gothenburg, Sweden richard.johansson, luis.nieto.pina @svenska.gu.se { } Abstract while the latter has seen much interest lately, their respective strengths and weaknesses are still being We present a framework for using continuous- debated (Baroni et al., 2014; Levy and Goldberg, space vector representations of word meaning 2014). The most important relation defined in a vec- to derive new vectors representing the meaning of senses listed in a semantic network. It tor space between the meaning of two words is sim- is a post-processing approach that can be ap- ilarity: a mouse is something quite similar to a rat. plied to several types of word vector represen- Similarity of meaning is operationalized in terms of tations. It uses two ideas: first, that vectors for geometry, by defining a distance metric. polysemous words can be decomposed into Symbolic representations seem to have an advan- a convex combination of sense vectors; sec- ondly, that the vector for a sense is kept sim- tage in describing word sense ambiguity: when a ilar to those of its neighbors in the network. surface form corresponds to more than one concept. This leads to a constrained optimization prob- For instance, the word mouse can refer to a rodent lem, and we present an approximation for the or an electronic device. Vector-space representations case when the distance function is the squared typically represent surface forms only, which makes Euclidean. it hard to search e.g. for a group of words similar We applied this algorithm on a Swedish se- to the rodent sense of mouse or to reliably use the mantic network, and we evaluate the quality vectors in classifiers that rely on the semantics of of the resulting sense representations extrinsi- the word. There have been several attempts to create cally by showing that they give large improve- vectors representing senses, most of them based on ments when used in a classifier that creates lexical units for FrameNet frames. some variant of the idea first proposed by Schutze¨ (1998): that senses can be seen as clusters of similar contexts. Recent examples in this tradition include 1 Introduction the work by Huang et al. (2012) and Neelakantan et Representing word meaning computationally is cen- al. (2014). However, because sense distributions are tral in natural language processing. Manual, often highly imbalanced, it is not clear that context knowledge-based approaches to meaning represen- clusters can be reliably created for senses that occur tation maps word strings to symbolic concepts, rarely. These approaches also lack interpretability: which can be described using any knowledge rep- if we are interested in the rodent sense of mouse, resentation framework; using the relations between which of the vectors should we use? concepts defined in the knowledge base, we can in- In this work, we instead derive sense vectors by fer implicit facts from the information stated in a embedding the graph structure of a semantic net- text: a mouse is a rodent, so it has prominent teeth. work in the word space. By combining two com- Conversely, data-driven meaning representation plementary sources of information – corpus statis- approaches rely on cooccurrence patterns to derive tics and network structure – we derive useful vec- a vector representation (Turney and Pantel, 2010). tors also for concepts that occur rarely. The method, There are two classes of methods that compute word which can be applied to context-counting as well vectors: context-counting and context-predicting; as context-predicting spaces, works by decompos- 1428 Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 1428–1433, Denver, Colorado, May 31 – June 5, 2015. c 2015 Association for Computational Linguistics ing word vectors as linear combinations of sense variables correspond to the occurrence probabilities vectors, and by pushing the sense vectors towards of the senses, but strictly speaking this is only the their neighbors in the semantic network. This in- case when the vectors are built using simple context tuition leads to a constrained optimization problem, counting. Since the mix gives an estimate of which for which we present an approximate algorithm. sense is the most frequent in the corpus, we get a We applied the algorithm to derive vectors for strong baseline for word sense disambiguation (Mc- the senses in a Swedish semantic network, and we Carthy et al., 2007) as a bonus; see our followup evaluated their quality extrinsically by using them paper (Johansson and Nieto Pina,˜ 2015) for a dis- as features in a semantic classification task – map- cussion of this. ping senses to their corresponding FrameNet frames. We can now formalize the intuition above: the When using the sense vectors in this task, we saw a weighted sum of distances between each sense and large improvement over using word vectors. its neighbors is minimized, while satisfying the mix constraint for each lemma. We get the following 2 Embedding a Semantic Network constrained optimization program: The goal of the algorithm is to embed the seman- minimize wijk∆(E(sij),E(nijk)) E,p tic network in a geometric space: that is, to asso- i,j,k ciate each sense sij with a sense embedding, a vec- X subject to pijE(sij) = F (li) i tor E(sij) of real numbers, in a way that reflects the ∀ topology of the semantic network but also that the Xj (1) vectors representing the lemmas are related to those p = 1 i ij ∀ corresponding to the senses. We now formalize this Xj intuition, and we start by introducing some notation. p 0 i, j ij ≥ ∀ For each lemma li, there is a set of possible senses si1, . , simi for which li is a surface realization. The mix constraints make sure that the solution Furthermore, for each sense sij, there is a neighbor- is nontrivial. In particular, a very large number hood consisting of senses semantically related to sij. of words are monosemous, and the procedure will Each neighbor nijk of sij is associated with a weight leave the embeddings of these words unchanged. wijk representing the degree of semantic relatedness 2.1 An Approximate Algorithm between sij and nijk. How we define the neighbor- hood, i.e. our notion of semantical relatedness, will The difficulty of solving the problem stated in Equa- obviously have an impact on the result. In this work, tion (1) obviously depends on the distance function we simply assume that it can be computed from the ∆. Henceforth, we focus on the case where ∆ is network, e.g. by picking a number of hypernyms the squared Euclidean distance. This is an impor- and hyponyms in a lexicon such as WordNet. We tant special case that is related to a number of other then assume that for each lemma li, we have a D- distances or similarities, e.g. cosine similarity and dimensional vector F (li) of real numbers; this can Hellinger distance. In this case, (1) is a quadratically be computed using any method described in Section constrained quadratic problem, which is NP-hard in 1. Finally, we assume a distance function ∆(x, y) general and difficult to handle with off-the-shelf op- that returns a non-negative real number for each pair timization tools. We therefore resort to an approx- of vectors in RD. imation; we show empirically in Sections 3 and 4 The algorithm maps each sense sij to a sense em- that it works well in practice. bedding, a real-valued vector E(sij) in the same The approximate algorithm works in an online vector space as the lemma embeddings. The lemma fashion by considering one lemma at a time. It ad- and sense embeddings are related through a mix con- justs the embeddings of the senses as well as their straint: F (li) is decomposed as a convex combi- mix in order to minimize the loss function nation p E(s ), where the p are picked 2 j ij ij { ij} Li = wijk E(sij) E(nijk) . (2) from the probability simplex. Intuitively, the mix k − k P Xjk 1429 The embeddings of the neighbors nijk of the sense 3 Application to Swedish Data are kept fixed at each such step. We iterate through The algorithm described in Section 2 was applied to the whole set of lemmas for a fixed number of Swedish data: we started with lemma embeddings epochs or until the objective is unchanged. computed from a corpus, and then created sense em- Furthermore, instead of directly optimizing with beddings by using the SALDO semantic network respect to the sense embeddings (which involves (Borin et al., 2013). The algorithm was run for a mi D scalars), the sense embeddings (and there- · few epochs, which seemed to be enough for reach- fore also the loss Li) can be computed analytically if ing a plateau in performance; the total runtime of the the mix variables pi1, . , pim are given, so we have i algorithm was a few minutes. reduced the optimization problem to one involving m 1 scalars, i.e. it is univariate in most cases. i − 3.1 Creating Lemma Embeddings Given a sense sij of a lemma li, we define the We created a corpus of 1 billion words downloaded weighted centroid of the set of neighbors of sij as from Sprakbanken,˚ the Swedish language bank.1 k wijkE(nijk) The corpora are distributed in a format where the cij = . (3) text has been tokenized, part-of-speech-tagged and k wijk P lemmatized.

Load more