Evaluating Word Embedding Models: Methods and Experimental Results Bin Wang,1∗ Angela Wang,2∗ Fenxiao Chen,1 Yuncheng Wang1 and C.-C

SIP (2019),vol.8,e19,page1of14©TheAuthors,2019. This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited. doi:10.1017/ATSIP.2019.12 overview paper Evaluating word embedding models: methods and experimental results bin wang,1∗ angela wang,2∗ fenxiao chen,1 yuncheng wang1 and c.-c. jay kuo1 Extensive evaluation on a large number of word embedding models for language processing applications is conducted in this work.First,weintroducepopularwordembeddingmodelsanddiscussdesiredpropertiesofwordmodelsandevaluationmethods (or evaluators). Then, we categorize evaluators into intrinsic and extrinsic two types. Intrinsic evaluators test the quality of a representation independent of specific natural language processing tasks while extrinsic evaluators use word embeddings as input features to a downstream task and measure changes in performance metrics specific to that task. We report experimental results of intrinsic and extrinsic evaluators on six word embedding models. It is shown that different evaluators focus on different aspects of word models, and some are more correlated with natural language processing tasks. Finally, we adopt correlation analysis to study performance consistency of extrinsic and intrinsic evaluators. Keywords: Natural language processing, Word embedding, Word embedding evaluation Received 21 March 2019; Revised 9 June 2019 I. INTRODUCTION specific NLP tasks. They measure syntactic or semantic rela- tionships among words directly. Aggregate scores are given Word embedding is a real-valued vector representation of from testing the vectors in selected sets of query terms and words by embedding both semantic and syntactic meanings semantically related target words. One can further classify obtained from unlabeled large corpus. It is a powerful tool intrinsic evaluators into two types: (1) absolute evaluation, widely used in modern natural language processing (NLP) where embeddings are evaluated individually and only their tasks, including semantic analysis [1], information retrieval final scores are compared, and (2) comparative evaluation, [2], dependency parsing [3–5], question answering [6, 7], wherepeopleareaskedabouttheirpreferencesamongdif- and machine translation [6, 8, 9]. Learning a high-quality ferent word embeddings [15]. Since comparative intrinsic representation is extremely important for these tasks, yet the evaluators demand additional resources for subjective tests, question “what is a good word embedding model” remains they are not as popular as the absolute ones. an open problem. A good word representation should have certain good Various evaluation methods (or evaluators) have been properties. An ideal word evaluator should be able to ana- proposed to test the qualities of word embedding mod- lyze word embedding models from different perspectives. els. As introduced in [10], there are two main categories Yet, existing evaluators put emphasis on a certain aspect for evaluation methods – intrinsic and extrinsic evaluators. with or without consciousness. There is no unified eval- Extrinsic evaluators use word embeddings as input features uator that analyzes word embedding models comprehen- to a downstream task and measure changes in performance sively. Researchers have a hard time in selecting among metrics specific to that task. Examples include part-of- word embedding models because models do not always per- speech (POS) tagging [11], named-entity recognition (NER) form at the same level on different intrinsic evaluators. As a [12], sentiment analysis [13], and machine translation [14]. result, the gold standard for a good word embedding model Extrinsic evaluators are more computationally expensive, differs for different language tasks. In this work, we will and they may not be directly applicable. Intrinsic evalu- conduct correlation study between intrinsic evaluators and ators test the quality of a representation independent of language tasks so as to provide insights into various evaluators and help people select word embedding models for specific language tasks. 1University of Southern California, Los Angeles, CA 90089, USA Although correlation between intrinsic and extrinsic 2University of California, Berkeley, Berkeley, CA 94720, USA evaluators was studied before [16, 17], this topic is never Corresponding author: thoroughly and seriously treated. For example, producing Bin Wang, models by changing the window size only does not happen Email: [email protected] ∗Equal Contribution ofteninreal-worldapplications,andtheconclusiondrawn Downloaded from https://www.cambridge.org/core. IP address: 170.106.34.90, on 25 Sep 2021 at 21:30:51, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/ATSIP.2019.12 1 2 bin wang, et al. in [16] might be biased. The work in [17]onlyfocusedon CBOW model trains two matrices: (1) an input word matrix Chinese characters with limited experiments. We provide denoted by V ∈ RN×|W|,wheretheith column of V is the N- the most comprehensive study and try to avoid the bias as dimensional embedded vector for input word vi, and (2) an much as possible in this work. output word matrix denoted by U ∈ R|W|×N ,wherethejth Therestofthepaperisorganizedasfollows.Popular row of U is the N-dimensional embedded vector for output word embedding models are reviewed in Section II.Prop- word uj. To embed input context words, we use the one-hot erties of good embedding models and intrinsic evaluators representation for each word initially, and apply VT to get are discussed in Section III. Representative performance the corresponding word vector embeddings of dimension metrics of intrinsic evaluation are presented in Section IV N. We apply UT to an input word vector to generate a score and the corresponding experimental results are offered in vector and use the softmax operation to convert a score vec- Section V. Representative performance metrics of extrin- tor into a probability vector of size W.Thisprocessisto sic evaluation are introduced in Section VI and the corre- yield a probability vector that matches the vector represen- sponding experimental results are provided in Section VII. tation of the output word. The CBOW model is obtained by We conduct consistency study on intrinsic and extrinsic minimizing the cross-entropy loss between the probability evaluators using correlation analysis in Section VIII.Finally, vector and the embedded vector of the output word. This is concluding remarks and future research directions are dis- achieved by minimizing the following objective function: cussed in Section IX. |W| ( ) =− T ˆ + ( T ˆ) J ui ui v log exp uj v ,(2) II. WORD EMBEDDING MODELS j=1 th where ui is the i row of matrix U and vˆ is the average of As extensive NLP downstream tasks emerge, the demand embedded input words. for word embedding is growing significantly. As a result, Initial values for matrices V and U are randomly lots of word embedding methods are proposed while some assigned. The dimension N of word embedding can vary of them share the same concept. We categorize the existing based on different application scenarios. Usually, it ranges word embedding methods based on their techniques. from 50 to 300 dimensions. After obtaining both matrices V or U, they can either be used solely or averaged to obtained A) Neural Network Language Model thefinalwordembeddingmatrix. The Neural Network Language Model (NNLM) [18]jointly The skip-gram model [19]predictsthesurroundingcon- learnsawordvectorrepresentationandastatisticallan- text words given a center word. It focuses on maximizing guage model with a feedforward neural network that con- probabilities of context words given a specific center word, tains a linear projection layer and a non-linear hidden layer. which can be written as An N-dimensional one-hot vector that represents the word P(wi−c, wi−c+1, ..., wi−1, wi+1, ..., wi+c−1, wi+c|wi).(3) is used as the input, where N is the size of the vocabulary. The input is first projected onto the projection layer. The optimization procedure is similar to that for the CBOW Afterwards, a softmax operation is used to compute the model but with a reversed order for context and center probability distribution over all words in the vocabulary. As words. a result of its non-linear hidden layers, the NNLM model The softmax function mentioned above is a method to is very computationally complex. To lower the complexity, generate probability distributions from word vectors. It can an NNLM is first trained using continuous word vectors be written as T learned from simple models. Then, another N-gram NNLM exp(v vw ) ( | ) = wc i P wc wi | | . (4) is trained from the word vectors. W ( T ) w=1 exp vwvwi B) Continuous-Bag-of-Words and skip-gram This softmax function is not the most efficient one since we must take a sum over all W words to normalize this Twoiteration-basedmethodswereproposedinthe function. Other functions that are more efficient include word2vec paper [19]. The first one is the Continuous-Bag- negative sampling and hierarchical softmax [20]. Negative of-Words (CBOW) model, which predicts the center word sampling is a method that maximizes the log probability of from its surrounding context. This model maximizes the the softmax model by only summing over a smaller subset probability of a word

Evaluating Word Embedding Models: Methods and Experimental Results Bin Wang,1∗ Angela Wang,2∗ Fenxiao Chen,1 Yuncheng Wang1 and C.-C

Arxiv:2002.06235V1

Combining Word Embeddings with Semantic Resources

Word-Like Character N-Gram Embedding

Learned in Speech Recognition: Contextual Acoustic Word Embeddings

Knowledge-Powered Deep Learning for Word Embedding

Local Homology of Word Embeddings

Word Embedding, Sense Embedding and Their Application to Word Sense Induction

Review on Parse Tree Generation in Natural Language Processing

Building a Treebank for French

Learning Word Meta-Embeddings by Autoencoding

A Comparison of Word Embeddings and N-Gram Models for Dbpedia Type and Invalid Entity Detection †

Sentiment Analysis for Multilingual Corpora