Linear Ensembles of Word Embedding Models

Total Page:16

File Type:pdf, Size:1020Kb

Linear Ensembles of Word Embedding Models Linear Ensembles of Word Embedding Models Avo Muromagi¨ Kairit Sirts Sven Laur University of Tartu University of Tartu University of Tartu Tartu, Estonia Tartu, Estonia Tartu, Estonia [email protected] [email protected] [email protected] Abstract models are not directly comparable, even when they have been trained on exactly the same data. This paper explores linear methods for This random behaviour provides an opportunity combining several word embedding mod- to combine several embedding models into an en- els into an ensemble. We construct semble which, hopefully, results in a better set the combined models using an iterative of word embeddings. Although model ensembles method based on either ordinary least have been often used in various NLP systems to squares regression or the solution to the or- improve the overall accuracy, the idea of combin- thogonal Procrustes problem. ing several word embedding models into an en- We evaluate the proposed approaches semble has not been explored before. on Estonian—a morphologically complex The main contribution of this paper is to show language, for which the available corpora that word embeddings can benefit from ensemble for training word embeddings are rela- learning, too. We study two methods for com- tively small. We compare both com- bining word embedding models into an ensem- bined models with each other and with ble. Both methods use a simple linear transforma- the input word embedding models using tion. First of them is based on the standard ordi- synonym and analogy tests. The results nary least squares solution (OLS) for linear regres- show that while using the ordinary least sion, the second uses the solution to the orthogonal squares regression performs poorly in our Procrustes problem (OPP) (Schonemann,¨ 1966), experiments, using orthogonal Procrustes which essentially also solves the OLS but adds the to combine several word embedding mod- orthogonality constraint that keeps the angles be- els into an ensemble model leads to 7-10% tween vectors and their distances unchanged. relative improvements over the mean re- There are several reasons why using an ensem- sult of the initial models in synonym tests ble of word embedding models could be useful. and 19-47% in analogy tests. First is the typical ensemble learning argument— the ensemble simply is better because it enables to 1 Introduction cancel out random noise of individual models and Word embeddings—dense low-dimensional vec- reinforce the useful patterns expressed by several tor representations of words—have become very input models. Secondly, word embedding systems popular in recent years in the field of natu- require a lot of training data to learn reliable word ral language processing (NLP). Various meth- representations. While there is a lot of textual ods have been proposed to train word embed- data available for English, there are many smaller dings from unannoted text corpora (Mikolov et languages for which even obtaining enough plain al., 2013b; Pennington et al., 2014; Al-Rfou et unannotated text for training reliable embeddings al., 2013; Turian et al., 2010; Levy and Gold- is a problem. Thus, an ensemble approach that berg, 2014), most well-known of them being per- would enable to use the available data more effec- haps Word2Vec (Mikolov et al., 2013b). Embed- tively would be beneficial. ding learning systems essentially train a model According to our knowledge, this is the first from a corpus of text and the word embeddings work that attempts to leverage the data by com- are the model parameters. These systems con- bining several word embedding models into a new tain a randomized component and so the trained improved model. Linear methods for combin- 96 Proceedings of the 21st Nordic Conference of Computational Linguistics, pages 96–104, Gothenburg, Sweden, 23-24 May 2017. c 2017 Linkoping¨ University Electronic Press ing two embedding models for some task-specific linear regression optimization goals: purpose have been used previously. Mikolov et r al. (2013a) optimized the linear regression with J = Y W P 2 , (1) ∑ k − i ik stochastic gradient descent to learn linear transfor- i=1 mations between the embeddings in two languages where P1,...,Pr are transformation matrices that for machine translation. Mogadala and Rettinger translate W1,...,Wr, respectively, into the com- (2016) used OPP to translate embeddings between mon vector space containing Y. two languages to perform cross-lingual document We use an iterative algorithm to find matrices classification. Hamilton et al. (2016) aligned a P1,...,Pr and Y. During each iteration the algo- series of embedding models with OPP to detect rithm performs two steps: changes in word meanings over time. The same 1. Solve r linear regression problems with re- problem was addressed by Kulkarni et al. (2015) spect to the current target model Y, which re- who aligned the embedding models using piece- sults in updated values for matrices P ,...P ; wise linear regression based on a set of nearest 1 r neighboring words for each word. 2. Update Y to be the mean of the translations Recently, Yin and Schutze¨ (2016) experimented of all r models: with several methods to learn meta-embeddings by 1 r combining different word embedding sets. Our Y = ∑WiPi. (2) r i=1 work differs from theirs in two important aspects. This procedure is continued until the change in First, in their work each initial model is trained the average normalised residual error, computed as with a different word embedding system and on a different data set, while we propose to combine the 1 r Y W P k − i ik, (3) models trained with the same system and on the ∑ r i=0 V d same dataset, albeit using different random initial- | | · isation. Secondly, although the 1toN model pro- will become smaller thanp a predefined threshold posed in (Yin and Schutze,¨ 2016) is very similar to value. the linear models studied in this paper, it doesn’t We experiment with two different methods for involve the orthogonality constraint included in computing the translation matrices P1,...,Pr. The the OPP method, which in our experiments, as first is based on the standard least squares solu- shown later, proves to be crucial. tion to the linear regression problem, the second method is known as solution to the Orthogonal We conduct experiments on Estonian and con- Procrustes problem (Schonemann,¨ 1966). struct ensembles from ten different embedding models trained with Word2Vec. We compare the 2.1 Solution with the ordinary least squares initial and combined models in synonym and anal- (SOLS) ogy tests and find that the ensemble embeddings The analytical solution for a linear regression combined with orthogonal Procrustes method in- problem Y = PW for finding the transformation deed perform significantly better in both tests, matrix P, given the input data matrix W and the leading to a relative improvement of 7-10% over result Y is: the mean result of the initial models in synonym T 1 T tests and 19-47% in analogy tests. P = (W W)− W Y (4) We can use this formula to update all matrices Pi 2 Combining word embedding models at each iteration. The problem with this approach is that because Y is also unknown and will be up- V d A word embedding model is a matrix W R| |× , ∈ dated repeatedly in the second step of the iterative where V is the number of words in the model lex- | | algorithm, the OLS might lead to solutions where icon and d is the dimensionality of the vectors. both WiPi and Y are optimized towards 0 which is Each row in the matrix W is the continuous rep- not a useful solution. In order to counteract this resentation of a word in a vector space. effect we rescale Y at the start of each iteration. Given r embedding models W1,...,Wr we want This is done by scaling the elements of Y so that to combine them into a target model Y. We de- the variance of each column of Y would be equal fine a linear objective function that is the sum of r to 1. 97 SOLS SOPP Dim SOLS SOPP Mean W Dim Error # Iter Error # Iter 50 70098 38998 41933 50 0.162828 33 0.200994 5 100 68175 32485 35986 100 0.168316 38 0.183933 5 150 73182 30249 33564 150 0.169554 41 0.171266 4 200 73946 29310 32865 200 0.172987 40 0.167554 4 250 75884 28469 32194 250 0.175723 40 0.164493 4 300 77098 28906 32729 300 0.177082 40 0.160988 4 Avg 73064 31403 34879 Table 1: Final errors and the number of iterations Table 2: Average mean ranks of the synonym test, until convergence for both SOLS and SOPP. The smaller values are better. The best result in each first column shows the embedding size. row is in bold. All differences are statistically sig- . nificant: with p < 2.2 10 16 for all cases. · − 2.2 Solution to the Orthogonal Procrustes Corpus is the largest text corpus available for Es- problem (SOPP) tonian. Its size is approximately 240M word to- Orthogonal Procrustes is a linear regression prob- kens, which may seem like a lot but compared to lem of transforming the input matrix W to the out- for instance English Gigaword corpus, which is put matrix Y using an orthogonal transformation often used to train word embeddings for English matrix P (Schonemann,¨ 1966). The orthogonality words and which contains more than 4B words, constraint is specified as it is quite small. All models were trained using PPT = PT P = I a window size 10 and the skip-gram architecture. We experimented with models of 6 different em- The solution to the Orthogonal Procrustes can bedding sizes: 50, 100, 150, 200, 250 and 300.
Recommended publications
  • Arxiv:2002.06235V1
    Semantic Relatedness and Taxonomic Word Embeddings Magdalena Kacmajor John D. Kelleher Innovation Exchange ADAPT Research Centre IBM Ireland Technological University Dublin [email protected] [email protected] Filip Klubickaˇ Alfredo Maldonado ADAPT Research Centre Underwriters Laboratories Technological University Dublin [email protected] [email protected] Abstract This paper1 connects a series of papers dealing with taxonomic word embeddings. It begins by noting that there are different types of semantic relatedness and that different lexical representa- tions encode different forms of relatedness. A particularly important distinction within semantic relatedness is that of thematic versus taxonomic relatedness. Next, we present a number of ex- periments that analyse taxonomic embeddings that have been trained on a synthetic corpus that has been generated via a random walk over a taxonomy. These experiments demonstrate how the properties of the synthetic corpus, such as the percentage of rare words, are affected by the shape of the knowledge graph the corpus is generated from. Finally, we explore the interactions between the relative sizes of natural and synthetic corpora on the performance of embeddings when taxonomic and thematic embeddings are combined. 1 Introduction Deep learning has revolutionised natural language processing over the last decade. A key enabler of deep learning for natural language processing has been the development of word embeddings. One reason for this is that deep learning intrinsically involves the use of neural network models and these models only work with numeric inputs. Consequently, applying deep learning to natural language processing first requires developing a numeric representation for words. Word embeddings provide a way of creating numeric representations that have proven to have a number of advantages over traditional numeric rep- resentations of language.
    [Show full text]
  • Combining Word Embeddings with Semantic Resources
    AutoExtend: Combining Word Embeddings with Semantic Resources Sascha Rothe∗ LMU Munich Hinrich Schutze¨ ∗ LMU Munich We present AutoExtend, a system that combines word embeddings with semantic resources by learning embeddings for non-word objects like synsets and entities and learning word embeddings that incorporate the semantic information from the resource. The method is based on encoding and decoding the word embeddings and is flexible in that it can take any word embeddings as input and does not need an additional training corpus. The obtained embeddings live in the same vector space as the input word embeddings. A sparse tensor formalization guar- antees efficiency and parallelizability. We use WordNet, GermaNet, and Freebase as semantic resources. AutoExtend achieves state-of-the-art performance on Word-in-Context Similarity and Word Sense Disambiguation tasks. 1. Introduction Unsupervised methods for learning word embeddings are widely used in natural lan- guage processing (NLP). The only data these methods need as input are very large corpora. However, in addition to corpora, there are many other resources that are undoubtedly useful in NLP, including lexical resources like WordNet and Wiktionary and knowledge bases like Wikipedia and Freebase. We will simply refer to these as resources. In this article, we present AutoExtend, a method for enriching these valuable resources with embeddings for non-word objects they describe; for example, Auto- Extend enriches WordNet with embeddings for synsets. The word embeddings and the new non-word embeddings live in the same vector space. Many NLP applications benefit if non-word objects described by resources—such as synsets in WordNet—are also available as embeddings.
    [Show full text]
  • Kitsune: an Ensemble of Autoencoders for Online Network Intrusion Detection
    Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection Yisroel Mirsky, Tomer Doitshman, Yuval Elovici and Asaf Shabtai Ben-Gurion University of the Negev yisroel, tomerdoi @post.bgu.ac.il, elovici, shabtaia @bgu.ac.il { } { } Abstract—Neural networks have become an increasingly popu- Over the last decade many machine learning techniques lar solution for network intrusion detection systems (NIDS). Their have been proposed to improve detection performance [2], [3], capability of learning complex patterns and behaviors make them [4]. One popular approach is to use an artificial neural network a suitable solution for differentiating between normal traffic and (ANN) to perform the network traffic inspection. The benefit network attacks. However, a drawback of neural networks is of using an ANN is that ANNs are good at learning complex the amount of resources needed to train them. Many network non-linear concepts in the input data. This gives ANNs a gateways and routers devices, which could potentially host an NIDS, simply do not have the memory or processing power to great advantage in detection performance with respect to other train and sometimes even execute such models. More importantly, machine learning algorithms [5], [2]. the existing neural network solutions are trained in a supervised The prevalent approach to using an ANN as an NIDS is manner. Meaning that an expert must label the network traffic to train it to classify network traffic as being either normal and update the model manually from time to time. or some class of attack [6], [7], [8]. The following shows the In this paper, we present Kitsune: a plug and play NIDS typical approach to using an ANN-based classifier in a point which can learn to detect attacks on the local network, without deployment strategy: supervision, and in an efficient online manner.
    [Show full text]
  • Word-Like Character N-Gram Embedding
    Word-like character n-gram embedding Geewook Kim and Kazuki Fukui and Hidetoshi Shimodaira Department of Systems Science, Graduate School of Informatics, Kyoto University Mathematical Statistics Team, RIKEN Center for Advanced Intelligence Project fgeewook, [email protected], [email protected] Abstract Table 1: Top-10 2-grams in Sina Weibo and 4-grams in We propose a new word embedding method Japanese Twitter (Experiment 1). Words are indicated called word-like character n-gram embed- by boldface and space characters are marked by . ding, which learns distributed representations FNE WNE (Proposed) Chinese Japanese Chinese Japanese of words by embedding word-like character n- 1 ][ wwww 自己 フォロー grams. Our method is an extension of recently 2 。␣ !!!! 。␣ ありがと proposed segmentation-free word embedding, 3 !␣ ありがと ][ wwww 4 .. りがとう 一个 !!!! which directly embeds frequent character n- 5 ]␣ ございま 微博 めっちゃ grams from a raw corpus. However, its n-gram 6 。。 うござい 什么 んだけど vocabulary tends to contain too many non- 7 ,我 とうござ 可以 うござい 8 !! ざいます 没有 line word n-grams. We solved this problem by in- 9 ␣我 がとうご 吗? 2018 troducing an idea of expected word frequency. 10 了, ください 哈哈 じゃない Compared to the previously proposed meth- ods, our method can embed more words, along tion tools are used to determine word boundaries with the words that are not included in a given in the raw corpus. However, these segmenters re- basic word dictionary. Since our method does quire rich dictionaries for accurate segmentation, not rely on word segmentation with rich word which are expensive to prepare and not always dictionaries, it is especially effective when the available.
    [Show full text]
  • Learned in Speech Recognition: Contextual Acoustic Word Embeddings
    LEARNED IN SPEECH RECOGNITION: CONTEXTUAL ACOUSTIC WORD EMBEDDINGS Shruti Palaskar∗, Vikas Raunak∗ and Florian Metze Carnegie Mellon University, Pittsburgh, PA, U.S.A. fspalaska j vraunak j fmetze [email protected] ABSTRACT model [10, 11, 12] trained for direct Acoustic-to-Word (A2W) speech recognition [13]. Using this model, we jointly learn to End-to-end acoustic-to-word speech recognition models have re- automatically segment and classify input speech into individual cently gained popularity because they are easy to train, scale well to words, hence getting rid of the problem of chunking or requiring large amounts of training data, and do not require a lexicon. In addi- pre-defined word boundaries. As our A2W model is trained at the tion, word models may also be easier to integrate with downstream utterance level, we show that we can not only learn acoustic word tasks such as spoken language understanding, because inference embeddings, but also learn them in the proper context of their con- (search) is much simplified compared to phoneme, character or any taining sentence. We also evaluate our contextual acoustic word other sort of sub-word units. In this paper, we describe methods embeddings on a spoken language understanding task, demonstrat- to construct contextual acoustic word embeddings directly from a ing that they can be useful in non-transcription downstream tasks. supervised sequence-to-sequence acoustic-to-word speech recog- Our main contributions in this paper are the following: nition model using the learned attention distribution. On a suite 1. We demonstrate the usability of attention not only for aligning of 16 standard sentence evaluation tasks, our embeddings show words to acoustic frames without any forced alignment but also for competitive performance against a word2vec model trained on the constructing Contextual Acoustic Word Embeddings (CAWE).
    [Show full text]
  • Ensemble Learning
    Ensemble Learning Martin Sewell Department of Computer Science University College London April 2007 (revised August 2008) 1 Introduction The idea of ensemble learning is to employ multiple learners and combine their predictions. There is no definitive taxonomy. Jain, Duin and Mao (2000) list eighteen classifier combination schemes; Witten and Frank (2000) detail four methods of combining multiple models: bagging, boosting, stacking and error- correcting output codes whilst Alpaydin (2004) covers seven methods of combin- ing multiple learners: voting, error-correcting output codes, bagging, boosting, mixtures of experts, stacked generalization and cascading. Here, the litera- ture in general is reviewed, with, where possible, an emphasis on both theory and practical advice, then the taxonomy from Jain, Duin and Mao (2000) is provided, and finally four ensemble methods are focussed on: bagging, boosting (including AdaBoost), stacked generalization and the random subspace method. 2 Literature Review Wittner and Denker (1988) discussed strategies for teaching layered neural net- works classification tasks. Schapire (1990) introduced boosting (see Section 5 (page 9)). A theoretical paper by Kleinberg (1990) introduced a general method for separating points in multidimensional spaces through the use of stochastic processes called stochastic discrimination (SD). The method basically takes poor solutions as an input and creates good solutions. Stochastic discrimination looks promising, and later led to the random subspace method (Ho 1998). Hansen and Salamon (1990) showed the benefits of invoking ensembles of similar neural networks. Wolpert (1992) introduced stacked generalization, a scheme for minimiz- ing the generalization error rate of one or more generalizers (see Section 6 (page 10)).Xu, Krzy˙zakand Suen (1992) considered methods of combining mul- tiple classifiers and their applications to handwriting recognition.
    [Show full text]
  • Knowledge-Powered Deep Learning for Word Embedding
    Knowledge-Powered Deep Learning for Word Embedding Jiang Bian, Bin Gao, and Tie-Yan Liu Microsoft Research {jibian,bingao,tyliu}@microsoft.com Abstract. The basis of applying deep learning to solve natural language process- ing tasks is to obtain high-quality distributed representations of words, i.e., word embeddings, from large amounts of text data. However, text itself usually con- tains incomplete and ambiguous information, which makes necessity to leverage extra knowledge to understand it. Fortunately, text itself already contains well- defined morphological and syntactic knowledge; moreover, the large amount of texts on the Web enable the extraction of plenty of semantic knowledge. There- fore, it makes sense to design novel deep learning algorithms and systems in order to leverage the above knowledge to compute more effective word embed- dings. In this paper, we conduct an empirical study on the capacity of leveraging morphological, syntactic, and semantic knowledge to achieve high-quality word embeddings. Our study explores these types of knowledge to define new basis for word representation, provide additional input information, and serve as auxiliary supervision in deep learning, respectively. Experiments on an analogical reason- ing task, a word similarity task, and a word completion task have all demonstrated that knowledge-powered deep learning can enhance the effectiveness of word em- bedding. 1 Introduction With rapid development of deep learning techniques in recent years, it has drawn in- creasing attention to train complex and deep models on large amounts of data, in order to solve a wide range of text mining and natural language processing (NLP) tasks [4, 1, 8, 13, 19, 20].
    [Show full text]
  • Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions
    Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions Synthesis Lectures on Data Mining and Knowledge Discovery Editor Robert Grossman, University of Illinois, Chicago Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions Giovanni Seni and John F. Elder 2010 Modeling and Data Mining in Blogosphere Nitin Agarwal and Huan Liu 2009 Copyright © 2010 by Morgan & Claypool All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions Giovanni Seni and John F. Elder www.morganclaypool.com ISBN: 9781608452842 paperback ISBN: 9781608452859 ebook DOI 10.2200/S00240ED1V01Y200912DMK002 A Publication in the Morgan & Claypool Publishers series SYNTHESIS LECTURES ON DATA MINING AND KNOWLEDGE DISCOVERY Lecture #2 Series Editor: Robert Grossman, University of Illinois, Chicago Series ISSN Synthesis Lectures on Data Mining and Knowledge Discovery Print 2151-0067 Electronic 2151-0075 Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions Giovanni Seni Elder Research, Inc. and Santa Clara University John F. Elder Elder Research, Inc. and University of Virginia SYNTHESIS LECTURES ON DATA MINING AND KNOWLEDGE DISCOVERY #2 M &C Morgan& cLaypool publishers ABSTRACT Ensemble methods have been called the most influential development in Data Mining and Machine Learning in the past decade. They combine multiple models into one usually more accurate than the best of its components. Ensembles can provide a critical boost to industrial challenges – from investment timing to drug discovery, and fraud detection to recommendation systems – where predictive accuracy is more vital than model interpretability.
    [Show full text]
  • Local Homology of Word Embeddings
    Local Homology of Word Embeddings Tadas Temcinasˇ 1 Abstract Intuitively, stratification is a decomposition of a topologi- Topological data analysis (TDA) has been widely cal space into manifold-like pieces. When thinking about used to make progress on a number of problems. stratification learning and word embeddings, it seems intu- However, it seems that TDA application in natural itive that vectors of words corresponding to the same broad language processing (NLP) is at its infancy. In topic would constitute a structure, which we might hope this paper we try to bridge the gap by arguing why to be a manifold. Hence, for example, by looking at the TDA tools are a natural choice when it comes to intersections between those manifolds or singularities on analysing word embedding data. We describe the manifolds (both of which can be recovered using local a parallelisable unsupervised learning algorithm homology-based algorithms (Nanda, 2017)) one might hope based on local homology of datapoints and show to find vectors of homonyms like ‘bank’ (which can mean some experimental results on word embedding either a river bank, or a financial institution). This, in turn, data. We see that local homology of datapoints in has potential to help solve the word sense disambiguation word embedding data contains some information (WSD) problem in NLP, which is pinning down a particular that can potentially be used to solve the word meaning of a word used in a sentence when the word has sense disambiguation problem. multiple meanings. In this work we present a clustering algorithm based on lo- cal homology, which is more relaxed1 that the stratification- 1.
    [Show full text]
  • (Machine) Learning
    A Practical Tour of Ensemble (Machine) Learning Nima Hejazi 1 Evan Muzzall 2 1Division of Biostatistics, University of California, Berkeley 2D-Lab, University of California, Berkeley slides: https://goo.gl/wWa9QC These are slides from a presentation on practical ensemble learning with the Super Learner and h2oEnsemble packages for the R language, most recently presented at a meeting of The Hacker Within, at the Berkeley Institute for Data Science at UC Berkeley, on 6 December 2016. source: https://github.com/nhejazi/talk-h2oSL-THW-2016 slides: https://goo.gl/CXC2FF with notes: http://goo.gl/wWa9QC Ensemble Learning – What? In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. - Wikipedia, November 2016 2 This rather elementary definition of “ensemble learning” encapsulates quite well the core notions necessary to understand why we might be interested in optimizing such procedures. In particular, we will see that a weighted collection of individual learning algorithms can not only outperform other algorithms in practice but also has been shown to be theoretically optimal. Ensemble Learning – Why? ▶ Ensemble methods outperform individual (base) learning algorithms. ▶ By combining a set of individual learning algorithms using a metalearning algorithm, ensemble methods can approximate complex functional relationships. ▶ When the true functional relationship is not in the set of base learning algorithms, ensemble methods approximate the true function well. ▶ n.b., ensemble methods can, even asymptotically, perform only as well as the best weighted combination of the candidate learners. 3 A variety of techniques exist for ensemble learning, ranging from the classic “random forest” (of Leo Breiman) to “xgboost” to “Super Learner” (van der Laan et al.).
    [Show full text]
  • Word Embedding, Sense Embedding and Their Application to Word Sense Induction
    Word Embeddings, Sense Embeddings and their Application to Word Sense Induction Linfeng Song The University of Rochester Computer Science Department Rochester, NY 14627 Area Paper April 2016 Abstract This paper investigates the cutting-edge techniques for word embedding, sense embedding, and our evaluation results on large-scale datasets. Word embedding refers to a kind of methods that learn a distributed dense vector for each word in a vocabulary. Traditional word embedding methods first obtain the co-occurrence matrix then perform dimension reduction with PCA. Recent methods use neural language models that directly learn word vectors by predicting the context words of the target word. Moving one step forward, sense embedding learns a distributed vector for each sense of a word. They either define a sense as a cluster of contexts where the target word appears or define a sense based on a sense inventory. To evaluate the performance of the state-of-the-art sense embedding methods, I first compare them on the dominant word similarity datasets, then compare them on my experimental settings. In addition, I show that sense embedding is applicable to the task of word sense induction (WSI). Actually we are the first to show that sense embedding methods are competitive on WSI by building sense-embedding-based systems that demonstrate highly competitive performances on the SemEval 2010 WSI shared task. Finally, I propose several possible future research directions on word embedding and sense embedding. The University of Rochester Computer Science Department supported this work. Contents 1 Introduction 3 2 Word Embedding 5 2.1 Skip-gramModel................................
    [Show full text]
  • Learning Word Meta-Embeddings by Autoencoding
    Learning Word Meta-Embeddings by Autoencoding Cong Bao Danushka Bollegala Department of Computer Science Department of Computer Science University of Liverpool University of Liverpool [email protected] [email protected] Abstract Distributed word embeddings have shown superior performances in numerous Natural Language Processing (NLP) tasks. However, their performances vary significantly across different tasks, implying that the word embeddings learnt by those methods capture complementary aspects of lexical semantics. Therefore, we believe that it is important to combine the existing word em- beddings to produce more accurate and complete meta-embeddings of words. We model the meta-embedding learning problem as an autoencoding problem, where we would like to learn a meta-embedding space that can accurately reconstruct all source embeddings simultaneously. Thereby, the meta-embedding space is enforced to capture complementary information in differ- ent source embeddings via a coherent common embedding space. We propose three flavours of autoencoded meta-embeddings motivated by different requirements that must be satisfied by a meta-embedding. Our experimental results on a series of benchmark evaluations show that the proposed autoencoded meta-embeddings outperform the existing state-of-the-art meta- embeddings in multiple tasks. 1 Introduction Representing the meanings of words is a fundamental task in Natural Language Processing (NLP). A popular approach to represent the meaning of a word is to embed it in some fixed-dimensional
    [Show full text]