Resources to Examine the Quality of Word Embedding Models Trained on N-Gram Data

Total Page:16

File Type:pdf, Size:1020Kb

Resources to Examine the Quality of Word Embedding Models Trained on N-Gram Data Resources to Examine the Quality of Word Embedding Models Trained on n-Gram Data Abel´ Elekes Adrian Englhardt Martin Schaler¨ Klemens Bohm¨ Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany fabel.elekes, adrian.englhardt, martin.schaeler, [email protected] Abstract These properties have been applied in numer- ous approaches (Mitra and Craswell, 2017) like Word embeddings are powerful tools that fa- sentiment analysis (Tang et al., 2014), irony detec- cilitate better analysis of natural language. However, their quality highly depends on the tion (Reyes et al., 2012), out-of-vocabulary word resource used for training. There are various classification (Ma and Zhang, 2015), or semantic approaches relying on n-gram corpora, such shift detection (Hamilton et al., 2016a; Martinez- as the Google n-gram corpus. However, n- Ortiz et al., 2016). One prerequisite when creating gram corpora only offer a small window into high-quality embedding models is a good train- the full text – 5 words for the Google corpus ing corpus. To this end, many approaches use the at best. This gives way to the concern whether Google n-gram corpus (Hellrich and Hahn, 2016; the extracted word semantics are of high qual- Pyysalo et al., 2013; Kim et al., 2014; Martinez- ity. In this paper, we address this concern with two contributions. First, we provide a resource Ortiz et al., 2016; Kulkarni et al., 2016, 2015; containing 120 word-embedding models – one Hamilton et al., 2016b). It also is the largest cur- of the largest collection of embedding mod- rently available corpus with historic data and ex- els. Furthermore, the resource contains the n- ists for several languages. It incorporates over 5 gramed versions of all used corpora, as well as million books from the last centuries split into n- our scripts used for corpus generation, model grams (Michel et al., 2015). n-grams are text seg- generation and evaluation. Second, we de- ments separated into pieces consisting of n words fine a set of meaningful experiments allow- fragmentation ing to evaluate the aforementioned quality dif- each. The of a corpus is the size of ferences. We conduct these experiments us- its n-grams. To illustrate, a corpus of 2-grams is ing our resource to show its usage and signifi- highly fragmented, one of 5-grams is moderately cance. The evaluation results confirm that one fragmented. generally can expect high quality for n-grams n-gram counts over time can be published even with n ≥ 3. if the underlying full text is subject to copyright protection. Next, this format reduces the data vol- 1 Introduction ume very much. So it is important to know how Motivation. Word embedding approaches like good models built on n-gram corpora are. Word2Vec (Mikolov et al., 2013b) or Glove (Pen- While the quality of word embedding mod- nington et al., 2014) are powerful tools for the els trained on full-text corpora is fairly well semantic analysis of natural language. One can known (Lebret and Collobert, 2015; Baroni et al., train them on arbitrary text corpora. Each word 2014), an assessment of models built on frag- in the corpus is mapped to a d-dimensional vec- mented corpora is missing (Hill et al., 2014). The tor. These vectors feature the semantic similarity resource advertised in this paper is a set of such and analogy properties, as follows. Semantic sim- models, which should help to shed some light on ilarity means that representations of words used in the issue, together with some experiments. a similar context tend to be close to each other in Difficulties. An obvious benefit of making the vector space. The analogy property can be de- these models available is the huge runtime nec- scribed by the example that ”man” is to ”woman” essary to build them. However, evaluating them like ”king” to ”queen” (Mikolov et al., 2013b,a; is not straightforward, for various reasons. First, Jansen, 2017). drawing general conclusions on the quality of em- bedding models only based on the performance bedding model, on common word similarity and of specific approaches, i.e., examining the extrin- analogical reasoning test sets. sic suitability of models, is error-prone (Gladkova To show the usefulness and significance of the and Drozd, 2016; Schnabel et al., 2015). Conse- experiments and to give general recommendations quently, to come to general conclusions one needs on which n-gram corpus to use as well as cre- to investigate general properties of the embedding ating a baseline for comparison, we conduct the models itself, i.e., examine their intrinsic suitabil- experiments on the full English Wikipedia dump ity. Properties of this kind are semantic similar- and Chelba et al.’s 1-Billion word dataset. How- ity and analogy. For both properties, one can use ever, we recommend to conduct this examination well-known test sets that serve as comprehensive for any corpus before using it as training resource, baselines. Second, there are various parameters particularly if the corpus size differs from the ones which influence how the model looks like. Us- of the baseline corpora by much: ing n-grams as training corpus gives way to two new parameters, fragmentation, as just discussed, 1. What is the smallest number n for which an and minimum count, i.e., the minimum occurrence n-gram corpus is good training data for word count of an n-gram in order to be considered when embedding models? building the model. The latter is often used to filter 2. How sensitive is the quality of the models error-prone n-grams from a corpus, e.g., spelling to the fragmentation and the minimum count errors. While the effect of the other parameters parameter? on the models is known (Lebret and Collobert, 2015; Baroni et al., 2014), the one of these new 3. What is the actual reason for any quality loss parameters is not. We have to define meaningful of models trained with high fragmentation or experiments to quantify and compare the effects. a high minimum count parameter? Third, the full text, such as the Google Books cor- pus, is not openly available as reference in many Our results for the baseline test sets indicate that cases. Hence, we need to examine how to compare minimum count values exceeding a corpus-size- results from other corpora, where the full text is dependent threshold drastically reduce the quality available, referring e.g., to well-known baselines of the models. Fragmentation in turn brings down as the Wikipedia corpus. the quality only if the fragments are very small. Contribution. The resource provided here is Based on this, one can conclude that n-gram cor- a systematic collection of word embedding mod- pora such as Google Books are valid training data els trained on n-gram corpora, accessible at our for word embedding models. project website1. The collection consists of 120 2 Fundamentals and Notation word embedding models trained on the Wikipedia and 1 Billion words data set. Its training has re- We first introduce word embedding models. Then quired more than two months of computing time we explain how to build word embedding models on a modern machine. To our knowledge, it cur- on n-gram corpora. rently is one of the most comprehensive collection of its type. In order to make this resource re-usable 2.1 Background on Word Embedding Models and our experiments repeatable, we also provide In the following, we review specific distributional the n-grammed versions of the Wikipedia and 1- models called word embedding models. Experi- Billion word datasets, which we used for training ments have shown that word embedding models and the tools to create n-gram corpora from arbi- are superior to conventional distributional mod- trary text as well. els (Baroni et al., 2014; Mikolov et al., 2013b). In In addition, we describe some experiments to this section, we briefly say how such models work, examine how much model quality changes when and which parameters influence their building pro- the training corpus is not full-text, but n-grams. cess. The experiments quantify how much fragmenta- 2.1.1 Building a Word Embedding Model tion (i.e., values of n) and minimum count reduce the average quality of the corresponding word em- Word embedding models ’embed’ words into a high-dimensional space, representing them as 1http://dbis.ipd.kit.edu/2568.php dense vectors of real numbers. Vectors close to each other according to a distance function, of- 2.2 Generating Fragmented and Trimmed ten the cosine distance, represent words that are Corpora semantically related. Formally, a word embed- To create fragmented n-gram corpora from raw ding model is a function F which takes a corpus text, we use a simple method described in the fol- C as input, generates a dictionary D and asso- lowing. With a sliding window of size n passing ciates any word in the dictionary w 2 D with a d through the whole raw text, we collect all the n- d-dimensional vector v 2 R . The dimension size grams which appear in the corpus and store them parameter d sets the dimensionality of the vectors. in a dictionary, together with their match count. win is the window size parameter. It determines This means that we create datasets similar to the the context of a word. For example, a window size Google Books dataset, but from other raw text of 5 means that the context of a specific word is such as the Wikipedia dump. For every frag- any other word in its sentence, and their distance mented corpus, we create different versions of it, is at most 5 words.
Recommended publications
  • Classifying Relevant Social Media Posts During Disasters Using Ensemble of Domain-Agnostic and Domain-Specific Word Embeddings
    AAAI 2019 Fall Symposium Series: AI for Social Good 1 Classifying Relevant Social Media Posts During Disasters Using Ensemble of Domain-agnostic and Domain-specific Word Embeddings Ganesh Nalluru, Rahul Pandey, Hemant Purohit Volgenau School of Engineering, George Mason University Fairfax, VA, 22030 fgn, rpandey4, [email protected] Abstract to provide relevant intelligence inputs to the decision-makers (Hughes and Palen 2012). However, due to the burstiness of The use of social media as a means of communication has the incoming stream of social media data during the time of significantly increased over recent years. There is a plethora of information flow over the different topics of discussion, emergencies, it is really hard to filter relevant information which is widespread across different domains. The ease of given the limited number of emergency service personnel information sharing has increased noisy data being induced (Castillo 2016). Therefore, there is a need to automatically along with the relevant data stream. Finding such relevant filter out relevant posts from the pile of noisy data coming data is important, especially when we are dealing with a time- from the unconventional information channel of social media critical domain like disasters. It is also more important to filter in a real-time setting. the relevant data in a real-time setting to timely process and Our contribution: We provide a generalizable classification leverage the information for decision support. framework to classify relevant social media posts for emer- However, the short text and sometimes ungrammatical nature gency services. We introduce the framework in the Method of social media data challenge the extraction of contextual section, including the description of domain-agnostic and information cues, which could help differentiate relevant vs.
    [Show full text]
  • Arxiv:2002.06235V1
    Semantic Relatedness and Taxonomic Word Embeddings Magdalena Kacmajor John D. Kelleher Innovation Exchange ADAPT Research Centre IBM Ireland Technological University Dublin [email protected] [email protected] Filip Klubickaˇ Alfredo Maldonado ADAPT Research Centre Underwriters Laboratories Technological University Dublin [email protected] [email protected] Abstract This paper1 connects a series of papers dealing with taxonomic word embeddings. It begins by noting that there are different types of semantic relatedness and that different lexical representa- tions encode different forms of relatedness. A particularly important distinction within semantic relatedness is that of thematic versus taxonomic relatedness. Next, we present a number of ex- periments that analyse taxonomic embeddings that have been trained on a synthetic corpus that has been generated via a random walk over a taxonomy. These experiments demonstrate how the properties of the synthetic corpus, such as the percentage of rare words, are affected by the shape of the knowledge graph the corpus is generated from. Finally, we explore the interactions between the relative sizes of natural and synthetic corpora on the performance of embeddings when taxonomic and thematic embeddings are combined. 1 Introduction Deep learning has revolutionised natural language processing over the last decade. A key enabler of deep learning for natural language processing has been the development of word embeddings. One reason for this is that deep learning intrinsically involves the use of neural network models and these models only work with numeric inputs. Consequently, applying deep learning to natural language processing first requires developing a numeric representation for words. Word embeddings provide a way of creating numeric representations that have proven to have a number of advantages over traditional numeric rep- resentations of language.
    [Show full text]
  • Injection of Automatically Selected Dbpedia Subjects in Electronic
    Injection of Automatically Selected DBpedia Subjects in Electronic Medical Records to boost Hospitalization Prediction Raphaël Gazzotti, Catherine Faron Zucker, Fabien Gandon, Virginie Lacroix-Hugues, David Darmon To cite this version: Raphaël Gazzotti, Catherine Faron Zucker, Fabien Gandon, Virginie Lacroix-Hugues, David Darmon. Injection of Automatically Selected DBpedia Subjects in Electronic Medical Records to boost Hos- pitalization Prediction. SAC 2020 - 35th ACM/SIGAPP Symposium On Applied Computing, Mar 2020, Brno, Czech Republic. 10.1145/3341105.3373932. hal-02389918 HAL Id: hal-02389918 https://hal.archives-ouvertes.fr/hal-02389918 Submitted on 16 Dec 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Injection of Automatically Selected DBpedia Subjects in Electronic Medical Records to boost Hospitalization Prediction Raphaël Gazzotti Catherine Faron-Zucker Fabien Gandon Université Côte d’Azur, Inria, CNRS, Université Côte d’Azur, Inria, CNRS, Inria, Université Côte d’Azur, CNRS, I3S, Sophia-Antipolis, France I3S, Sophia-Antipolis, France
    [Show full text]
  • Combining Word Embeddings with Semantic Resources
    AutoExtend: Combining Word Embeddings with Semantic Resources Sascha Rothe∗ LMU Munich Hinrich Schutze¨ ∗ LMU Munich We present AutoExtend, a system that combines word embeddings with semantic resources by learning embeddings for non-word objects like synsets and entities and learning word embeddings that incorporate the semantic information from the resource. The method is based on encoding and decoding the word embeddings and is flexible in that it can take any word embeddings as input and does not need an additional training corpus. The obtained embeddings live in the same vector space as the input word embeddings. A sparse tensor formalization guar- antees efficiency and parallelizability. We use WordNet, GermaNet, and Freebase as semantic resources. AutoExtend achieves state-of-the-art performance on Word-in-Context Similarity and Word Sense Disambiguation tasks. 1. Introduction Unsupervised methods for learning word embeddings are widely used in natural lan- guage processing (NLP). The only data these methods need as input are very large corpora. However, in addition to corpora, there are many other resources that are undoubtedly useful in NLP, including lexical resources like WordNet and Wiktionary and knowledge bases like Wikipedia and Freebase. We will simply refer to these as resources. In this article, we present AutoExtend, a method for enriching these valuable resources with embeddings for non-word objects they describe; for example, Auto- Extend enriches WordNet with embeddings for synsets. The word embeddings and the new non-word embeddings live in the same vector space. Many NLP applications benefit if non-word objects described by resources—such as synsets in WordNet—are also available as embeddings.
    [Show full text]
  • Word-Like Character N-Gram Embedding
    Word-like character n-gram embedding Geewook Kim and Kazuki Fukui and Hidetoshi Shimodaira Department of Systems Science, Graduate School of Informatics, Kyoto University Mathematical Statistics Team, RIKEN Center for Advanced Intelligence Project fgeewook, [email protected], [email protected] Abstract Table 1: Top-10 2-grams in Sina Weibo and 4-grams in We propose a new word embedding method Japanese Twitter (Experiment 1). Words are indicated called word-like character n-gram embed- by boldface and space characters are marked by . ding, which learns distributed representations FNE WNE (Proposed) Chinese Japanese Chinese Japanese of words by embedding word-like character n- 1 ][ wwww 自己 フォロー grams. Our method is an extension of recently 2 。␣ !!!! 。␣ ありがと proposed segmentation-free word embedding, 3 !␣ ありがと ][ wwww 4 .. りがとう 一个 !!!! which directly embeds frequent character n- 5 ]␣ ございま 微博 めっちゃ grams from a raw corpus. However, its n-gram 6 。。 うござい 什么 んだけど vocabulary tends to contain too many non- 7 ,我 とうござ 可以 うござい 8 !! ざいます 没有 line word n-grams. We solved this problem by in- 9 ␣我 がとうご 吗? 2018 troducing an idea of expected word frequency. 10 了, ください 哈哈 じゃない Compared to the previously proposed meth- ods, our method can embed more words, along tion tools are used to determine word boundaries with the words that are not included in a given in the raw corpus. However, these segmenters re- basic word dictionary. Since our method does quire rich dictionaries for accurate segmentation, not rely on word segmentation with rich word which are expensive to prepare and not always dictionaries, it is especially effective when the available.
    [Show full text]
  • Information Extraction Based on Named Entity for Tourism Corpus
    Information Extraction based on Named Entity for Tourism Corpus Chantana Chantrapornchai Aphisit Tunsakul Dept. of Computer Engineering Dept. of Computer Engineering Faculty of Engineering Faculty of Engineering Kasetsart University Kasetsart University Bangkok, Thailand Bangkok, Thailand [email protected] [email protected] Abstract— Tourism information is scattered around nowa- The ontology is extracted based on HTML web structure, days. To search for the information, it is usually time consuming and the corpus is based on WordNet. For these approaches, to browse through the results from search engine, select and the time consuming process is the annotation which is to view the details of each accommodation. In this paper, we present a methodology to extract particular information from annotate the type of name entity. In this paper, we target at full text returned from the search engine to facilitate the users. the tourism domain, and aim to extract particular information Then, the users can specifically look to the desired relevant helping for ontology data acquisition. information. The approach can be used for the same task in We present the framework for the given named entity ex- other domains. The main steps are 1) building training data traction. Starting from the web information scraping process, and 2) building recognition model. First, the tourism data is gathered and the vocabularies are built. The raw corpus is used the data are selected based on the HTML tag for corpus to train for creating vocabulary embedding. Also, it is used building. The data is used for model creation for automatic for creating annotated data.
    [Show full text]
  • A Second Look at Word Embeddings W4705: Natural Language Processing
    A second look at word embeddings W4705: Natural Language Processing Fei-Tzin Lee October 23, 2019 Fei-Tzin Lee Word embeddings October 23, 2019 1 / 39 Overview Last time... • Distributional representations (SVD) • word2vec • Analogy performance Fei-Tzin Lee Word embeddings October 23, 2019 2 / 39 Overview This time • Homework-related topics • Non-homework topics Fei-Tzin Lee Word embeddings October 23, 2019 3 / 39 Overview Outline 1 GloVe 2 How good are our embeddings? 3 The math behind the models? 4 Word embeddings, new and improved Fei-Tzin Lee Word embeddings October 23, 2019 4 / 39 GloVe Outline 1 GloVe 2 How good are our embeddings? 3 The math behind the models? 4 Word embeddings, new and improved Fei-Tzin Lee Word embeddings October 23, 2019 5 / 39 GloVe A recap of GloVe Our motivation: • SVD places too much emphasis on unimportant matrix entries • word2vec never gets to look at global co-occurrence statistics Can we create a new model that balances the strengths of both of these to better express linear structure? Fei-Tzin Lee Word embeddings October 23, 2019 6 / 39 GloVe Setting We'll use more or less the same setting as in previous models: • A corpus D • A word vocabulary V from which both target and context words are drawn • A co-occurrence matrix Mij , from which we can calculate conditional probabilities Pij = Mij =Mi Fei-Tzin Lee Word embeddings October 23, 2019 7 / 39 GloVe The GloVe objective, in overview Idea: we want to capture not individual probabilities of co-occurrence, but ratios of co-occurrence probability between pairs (wi ; wk ) and (wj ; wk ).
    [Show full text]
  • Learned in Speech Recognition: Contextual Acoustic Word Embeddings
    LEARNED IN SPEECH RECOGNITION: CONTEXTUAL ACOUSTIC WORD EMBEDDINGS Shruti Palaskar∗, Vikas Raunak∗ and Florian Metze Carnegie Mellon University, Pittsburgh, PA, U.S.A. fspalaska j vraunak j fmetze [email protected] ABSTRACT model [10, 11, 12] trained for direct Acoustic-to-Word (A2W) speech recognition [13]. Using this model, we jointly learn to End-to-end acoustic-to-word speech recognition models have re- automatically segment and classify input speech into individual cently gained popularity because they are easy to train, scale well to words, hence getting rid of the problem of chunking or requiring large amounts of training data, and do not require a lexicon. In addi- pre-defined word boundaries. As our A2W model is trained at the tion, word models may also be easier to integrate with downstream utterance level, we show that we can not only learn acoustic word tasks such as spoken language understanding, because inference embeddings, but also learn them in the proper context of their con- (search) is much simplified compared to phoneme, character or any taining sentence. We also evaluate our contextual acoustic word other sort of sub-word units. In this paper, we describe methods embeddings on a spoken language understanding task, demonstrat- to construct contextual acoustic word embeddings directly from a ing that they can be useful in non-transcription downstream tasks. supervised sequence-to-sequence acoustic-to-word speech recog- Our main contributions in this paper are the following: nition model using the learned attention distribution. On a suite 1. We demonstrate the usability of attention not only for aligning of 16 standard sentence evaluation tasks, our embeddings show words to acoustic frames without any forced alignment but also for competitive performance against a word2vec model trained on the constructing Contextual Acoustic Word Embeddings (CAWE).
    [Show full text]
  • Cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information
    The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information Shaosheng Cao,1,2 Wei Lu,2 Jun Zhou,1 Xiaolong Li1 1 AI Department, Ant Financial Services Group 2 Singapore University of Technology and Design {shaosheng.css, jun.zhoujun, xl.li}@antfin.com [email protected] Abstract We propose cw2vec, a novel method for learning Chinese word embeddings. It is based on our observation that ex- ploiting stroke-level information is crucial for improving the learning of Chinese word embeddings. Specifically, we de- sign a minimalist approach to exploit such features, by us- ing stroke n-grams, which capture semantic and morpholog- ical level information of Chinese words. Through qualita- tive analysis, we demonstrate that our model is able to ex- Figure 1: Radical v.s. components v.s. stroke n-gram tract semantic information that cannot be captured by exist- ing methods. Empirical results on the word similarity, word analogy, text classification and named entity recognition tasks show that the proposed approach consistently outperforms Bojanowski et al. 2016; Cao and Lu 2017). While these ap- state-of-the-art approaches such as word-based word2vec and proaches were shown effective, they largely focused on Eu- GloVe, character-based CWE, component-based JWE and ropean languages such as English, Spanish and German that pixel-based GWE. employ the Latin script in their writing system. Therefore the methods developed are not directly applicable to lan- guages such as Chinese that employ a completely different 1. Introduction writing system. Word representation learning has recently received a sig- In Chinese, each word typically consists of less charac- nificant amount of attention in the field of natural lan- ters than English1, where each character conveys fruitful se- guage processing (NLP).
    [Show full text]
  • Effects of Pre-Trained Word Embeddings on Text-Based Deception Detection
    2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress Effects of Pre-trained Word Embeddings on Text-based Deception Detection David Nam, Jerin Yasmin, Farhana Zulkernine School of Computing Queen’s University Kingston, Canada Email: {david.nam, jerin.yasmin, farhana.zulkernine}@queensu.ca Abstract—With e-commerce transforming the way in which be wary of any deception that the reviews might present. individuals and businesses conduct trades, online reviews have Deception can be defined as a dishonest or illegal method become a great source of information among consumers. With in which misleading information is intentionally conveyed 93% of shoppers relying on online reviews to make their purchasing decisions, the credibility of reviews should be to others for a specific gain [2]. strongly considered. While detecting deceptive text has proven Companies or certain individuals may attempt to deceive to be a challenge for humans to detect, it has been shown consumers for various reasons to influence their decisions that machines can be better at distinguishing between truthful about purchasing a product or a service. While the motives and deceptive online information by applying pattern analysis for deception may be hard to determine, it is clear why on a large amount of data. In this work, we look at the use of several popular pre-trained word embeddings (Word2Vec, businesses may have a vested interest in producing deceptive GloVe, fastText) with deep neural network models (CNN, reviews about their products.
    [Show full text]
  • Knowledge-Powered Deep Learning for Word Embedding
    Knowledge-Powered Deep Learning for Word Embedding Jiang Bian, Bin Gao, and Tie-Yan Liu Microsoft Research {jibian,bingao,tyliu}@microsoft.com Abstract. The basis of applying deep learning to solve natural language process- ing tasks is to obtain high-quality distributed representations of words, i.e., word embeddings, from large amounts of text data. However, text itself usually con- tains incomplete and ambiguous information, which makes necessity to leverage extra knowledge to understand it. Fortunately, text itself already contains well- defined morphological and syntactic knowledge; moreover, the large amount of texts on the Web enable the extraction of plenty of semantic knowledge. There- fore, it makes sense to design novel deep learning algorithms and systems in order to leverage the above knowledge to compute more effective word embed- dings. In this paper, we conduct an empirical study on the capacity of leveraging morphological, syntactic, and semantic knowledge to achieve high-quality word embeddings. Our study explores these types of knowledge to define new basis for word representation, provide additional input information, and serve as auxiliary supervision in deep learning, respectively. Experiments on an analogical reason- ing task, a word similarity task, and a word completion task have all demonstrated that knowledge-powered deep learning can enhance the effectiveness of word em- bedding. 1 Introduction With rapid development of deep learning techniques in recent years, it has drawn in- creasing attention to train complex and deep models on large amounts of data, in order to solve a wide range of text mining and natural language processing (NLP) tasks [4, 1, 8, 13, 19, 20].
    [Show full text]
  • Local Homology of Word Embeddings
    Local Homology of Word Embeddings Tadas Temcinasˇ 1 Abstract Intuitively, stratification is a decomposition of a topologi- Topological data analysis (TDA) has been widely cal space into manifold-like pieces. When thinking about used to make progress on a number of problems. stratification learning and word embeddings, it seems intu- However, it seems that TDA application in natural itive that vectors of words corresponding to the same broad language processing (NLP) is at its infancy. In topic would constitute a structure, which we might hope this paper we try to bridge the gap by arguing why to be a manifold. Hence, for example, by looking at the TDA tools are a natural choice when it comes to intersections between those manifolds or singularities on analysing word embedding data. We describe the manifolds (both of which can be recovered using local a parallelisable unsupervised learning algorithm homology-based algorithms (Nanda, 2017)) one might hope based on local homology of datapoints and show to find vectors of homonyms like ‘bank’ (which can mean some experimental results on word embedding either a river bank, or a financial institution). This, in turn, data. We see that local homology of datapoints in has potential to help solve the word sense disambiguation word embedding data contains some information (WSD) problem in NLP, which is pinning down a particular that can potentially be used to solve the word meaning of a word used in a sentence when the word has sense disambiguation problem. multiple meanings. In this work we present a clustering algorithm based on lo- cal homology, which is more relaxed1 that the stratification- 1.
    [Show full text]