Resources to Examine the Quality of Word Embedding Models Trained on N-Gram Data

Resources to Examine the Quality of Word Embedding Models Trained on n-Gram Data Abel´ Elekes Adrian Englhardt Martin Schaler¨ Klemens Bohm¨ Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany fabel.elekes, adrian.englhardt, martin.schaeler, [email protected] Abstract These properties have been applied in numer- ous approaches (Mitra and Craswell, 2017) like Word embeddings are powerful tools that fa- sentiment analysis (Tang et al., 2014), irony detec- cilitate better analysis of natural language. However, their quality highly depends on the tion (Reyes et al., 2012), out-of-vocabulary word resource used for training. There are various classification (Ma and Zhang, 2015), or semantic approaches relying on n-gram corpora, such shift detection (Hamilton et al., 2016a; Martinez- as the Google n-gram corpus. However, n- Ortiz et al., 2016). One prerequisite when creating gram corpora only offer a small window into high-quality embedding models is a good train- the full text – 5 words for the Google corpus ing corpus. To this end, many approaches use the at best. This gives way to the concern whether Google n-gram corpus (Hellrich and Hahn, 2016; the extracted word semantics are of high qual- Pyysalo et al., 2013; Kim et al., 2014; Martinez- ity. In this paper, we address this concern with two contributions. First, we provide a resource Ortiz et al., 2016; Kulkarni et al., 2016, 2015; containing 120 word-embedding models – one Hamilton et al., 2016b). It also is the largest cur- of the largest collection of embedding mod- rently available corpus with historic data and ex- els. Furthermore, the resource contains the n- ists for several languages. It incorporates over 5 gramed versions of all used corpora, as well as million books from the last centuries split into n- our scripts used for corpus generation, model grams (Michel et al., 2015). n-grams are text seg- generation and evaluation. Second, we de- ments separated into pieces consisting of n words fine a set of meaningful experiments allow- fragmentation ing to evaluate the aforementioned quality dif- each. The of a corpus is the size of ferences. We conduct these experiments us- its n-grams. To illustrate, a corpus of 2-grams is ing our resource to show its usage and signifi- highly fragmented, one of 5-grams is moderately cance. The evaluation results confirm that one fragmented. generally can expect high quality for n-grams n-gram counts over time can be published even with n ≥ 3. if the underlying full text is subject to copyright protection. Next, this format reduces the data vol- 1 Introduction ume very much. So it is important to know how Motivation. Word embedding approaches like good models built on n-gram corpora are. Word2Vec (Mikolov et al., 2013b) or Glove (Pen- While the quality of word embedding mod- nington et al., 2014) are powerful tools for the els trained on full-text corpora is fairly well semantic analysis of natural language. One can known (Lebret and Collobert, 2015; Baroni et al., train them on arbitrary text corpora. Each word 2014), an assessment of models built on frag- in the corpus is mapped to a d-dimensional vec- mented corpora is missing (Hill et al., 2014). The tor. These vectors feature the semantic similarity resource advertised in this paper is a set of such and analogy properties, as follows. Semantic sim- models, which should help to shed some light on ilarity means that representations of words used in the issue, together with some experiments. a similar context tend to be close to each other in Difficulties. An obvious benefit of making the vector space. The analogy property can be de- these models available is the huge runtime nec- scribed by the example that ”man” is to ”woman” essary to build them. However, evaluating them like ”king” to ”queen” (Mikolov et al., 2013b,a; is not straightforward, for various reasons. First, Jansen, 2017). drawing general conclusions on the quality of embedding models only based on the performance bedding model, on common word similarity and of specific approaches, i.e., examining the extrin- analogical reasoning test sets. sic suitability of models, is error-prone (Gladkova To show the usefulness and significance of the and Drozd, 2016; Schnabel et al., 2015). Conse- experiments and to give general recommendations quently, to come to general conclusions one needs on which n-gram corpus to use as well as cre- to investigate general properties of the embedding ating a baseline for comparison, we conduct the models itself, i.e., examine their intrinsic suitabil- experiments on the full English Wikipedia dump ity. Properties of this kind are semantic similar- and Chelba et al.’s 1-Billion word dataset. How- ity and analogy. For both properties, one can use ever, we recommend to conduct this examination well-known test sets that serve as comprehensive for any corpus before using it as training resource, baselines. Second, there are various parameters particularly if the corpus size differs from the ones which influence how the model looks like. Us- of the baseline corpora by much: ing n-grams as training corpus gives way to two new parameters, fragmentation, as just discussed, 1. What is the smallest number n for which an and minimum count, i.e., the minimum occurrence n-gram corpus is good training data for word count of an n-gram in order to be considered when embedding models? building the model. The latter is often used to filter 2. How sensitive is the quality of the models error-prone n-grams from a corpus, e.g., spelling to the fragmentation and the minimum count errors. While the effect of the other parameters parameter? on the models is known (Lebret and Collobert, 2015; Baroni et al., 2014), the one of these new 3. What is the actual reason for any quality loss parameters is not. We have to define meaningful of models trained with high fragmentation or experiments to quantify and compare the effects. a high minimum count parameter? Third, the full text, such as the Google Books corpus, is not openly available as reference in many Our results for the baseline test sets indicate that cases. Hence, we need to examine how to compare minimum count values exceeding a corpus-size- results from other corpora, where the full text is dependent threshold drastically reduce the quality available, referring e.g., to well-known baselines of the models. Fragmentation in turn brings down as the Wikipedia corpus. the quality only if the fragments are very small. Contribution. The resource provided here is Based on this, one can conclude that n-gram cor- a systematic collection of word embedding mod- pora such as Google Books are valid training data els trained on n-gram corpora, accessible at our for word embedding models. project website1. The collection consists of 120 2 Fundamentals and Notation word embedding models trained on the Wikipedia and 1 Billion words data set. Its training has re- We first introduce word embedding models. Then quired more than two months of computing time we explain how to build word embedding models on a modern machine. To our knowledge, it cur- on n-gram corpora. rently is one of the most comprehensive collection of its type. In order to make this resource re-usable 2.1 Background on Word Embedding Models and our experiments repeatable, we also provide In the following, we review specific distributional the n-grammed versions of the Wikipedia and 1- models called word embedding models. Experi- Billion word datasets, which we used for training ments have shown that word embedding models and the tools to create n-gram corpora from arbi- are superior to conventional distributional mod- trary text as well. els (Baroni et al., 2014; Mikolov et al., 2013b). In In addition, we describe some experiments to this section, we briefly say how such models work, examine how much model quality changes when and which parameters influence their building pro- the training corpus is not full-text, but n-grams. cess. The experiments quantify how much fragmenta- 2.1.1 Building a Word Embedding Model tion (i.e., values of n) and minimum count reduce the average quality of the corresponding word em- Word embedding models ’embed’ words into a high-dimensional space, representing them as 1http://dbis.ipd.kit.edu/2568.php dense vectors of real numbers. Vectors close to each other according to a distance function, of- 2.2 Generating Fragmented and Trimmed ten the cosine distance, represent words that are Corpora semantically related. Formally, a word embed- To create fragmented n-gram corpora from raw ding model is a function F which takes a corpus text, we use a simple method described in the fol- C as input, generates a dictionary D and asso- lowing. With a sliding window of size n passing ciates any word in the dictionary w 2 D with a d through the whole raw text, we collect all the n- d-dimensional vector v 2 R . The dimension size grams which appear in the corpus and store them parameter d sets the dimensionality of the vectors. in a dictionary, together with their match count. win is the window size parameter. It determines This means that we create datasets similar to the the context of a word. For example, a window size Google Books dataset, but from other raw text of 5 means that the context of a specific word is such as the Wikipedia dump. For every frag- any other word in its sentence, and their distance mented corpus, we create different versions of it, is at most 5 words.

Resources to Examine the Quality of Word Embedding Models Trained on N-Gram Data

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support