Ngram2vec: Learning Improved Word Representations from Ngram Co-Occurrence Statistics Zhe Zhao1,2 Tao Liu1,2 [email protected] [email protected]
Total Page:16
File Type:pdf, Size:1020Kb
Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics Zhe Zhao1,2 Tao Liu1,2 [email protected] [email protected] Shen Li3,4 Bofang Li1,2 Xiaoyong Du1,2 [email protected] [email protected] [email protected] 1 School of Information, Renmin University of China 2 Key Laboratory of Data Engineering and Knowledge Engineering, MOE 3 Institute of Chinese Information Processing, Beijing Normal University 4 UltraPower-BNU Joint Laboratory for Artificial Intelligence, Beijing Normal University Abstract of-the-art results on a range of linguistic tasks with only a fraction of time compared with pre- The existing word representation method- vious techniques. A challenger of word2vec is s mostly limit their information source to GloVe (Pennington et al., 2014). Instead of train- word co-occurrence statistics. In this pa- ing on <word, context> pairs, GloVe directly uti- per, we introduce ngrams into four repre- lizes word co-occurrence matrix. They claim that sentation methods: SGNS, GloVe, PPMI the change brings the improvement over word2vec matrix, and its SVD factorization. Com- on both accuracy and speed. Levy and Goldberg prehensive experiments are conducted on (2014b) further reveal that the attractive properties word analogy and similarity tasks. The observed in word embeddings are not restricted to results show that improved word repre- neural models such as word2vec and GloVe. They sentations are learned from ngram co- use traditional count-based method (PPMI matrix occurrence statistics. We also demonstrate with hyper-parameter tuning) to represent word- that the trained ngram representations are s, and achieve comparable results with the above useful in many aspects such as finding neural embedding models. antonyms and collocations. Besides, a novel approach of building co-occurrence The above models limit their information source matrix is proposed to alleviate the hard- to word co-occurrence statistics (Levy et al., ware burdens brought by ngrams. 2015). To learn improved word representation- s, we extend the information source from co- 1 Introduction occurrence of ‘word-word’ type to co-occurrence Recently, deep learning approaches have achieved of ‘ngram-ngram’ type. The idea of using ngrams state-of-the-art results on a range of NLP tasks. is well supported by language modeling, one of the One of the most fundamental work in this field oldest problems studied in statistical NLP. In lan- is word embedding, where low-dimensional word guage models, co-occurrence of words and ngrams representations are learned from unlabeled corpo- is used to predict the next word (Kneser and Ney, ra through neural models. The trained word em- 1995; Katz, 1987). Actually, the idea of word em- beddings reflect semantic and syntactic informa- bedding models roots in language models. They tion of words. They are not only useful in reveal- are closely related but are used for different pur- ing lexical semantics, but also used as inputs of poses. Word embedding models aim at learning various downstream tasks for better performance useful word representations instead of word pre- (Kim, 2014; Collobert et al., 2011; Pennington diction. Since ngram is a vital part in language et al., 2014). modeling, we are inspired to integrate ngram sta- Most of the word embedding models are trained tistical information into the recent word represen- upon <word, context> pairs in the local win- tation methods for better performance. dow. Among them, word2vec gains its popu- The idea of using ngrams is intuitive. However, larity by its amazing effectiveness and efficien- there is still rare work using ngrams in recent rep- cy (Mikolov et al., 2013b,a). It achieves state- resentation methods. In this paper, we introduce 244 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 244–253 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics ngrams into SGNS, GloVe, PPMI, and its SVD proposed on the basis of SGNS. For example, factorization. To evaluate the ngram-based mod- Faruqui et al.(2015) introduce knowledge in lex- els, comprehensive experiments are conducted on ical resources into the models in word2vec. Zhao word analogy and similarity tasks. Experimental et al.(2016) extend the contexts from the local results demonstrate that the improved word repre- window to the entire documents. Li et al.(2015) sentations are learned from ngram co-occurrence use supervised information to guide the training. statistics. Besides that, we qualitatively evaluate Dependency parse-tree is used for defining contex- the trained ngram representations. We show that t in (Levy and Goldberg, 2014a). LSTM is used they are able to reflect ngrams’ meanings and syn- for modeling context in (Melamud et al., 2016) tactic patterns (e.g. ‘be + past participle’ pattern). Sub-word information is considered in (Sun et al., The high-quality ngram representations are useful 2016; Soricut and Och, 2015). in many ways. For example, ngrams in negative form (e.g. ‘not interesting’) can be used for find- 2.2 GloVe ing antonyms (e.g. ‘boring’). Different from typical neural embedding model- Finally, a novel method is proposed to build n- s which are trained on <word, context> pairs, gram co-occurrence matrix. Our method reduces GloVe learns word representation on the basis of the disk I/O as much as possible, largely alle- co-occurrence matrix (Pennington et al., 2014). viating the costs brought by ngrams. We uni- GloVe breaks traditional ‘words predict contexts’ fy different representation methods in a pipeline. paradigm. Its objective is to reconstruct non-zero The source code is organized as ngram2vec toolk- values in the matrix. The direct use of matrix it and released at https://github.com/ is reported to bring improved results and higher zhezhaoa/ngram2vec. speed. However, there is still dispute about the advantages of GloVe over word2vec (Levy et al., 2 Related Work 2015; Schnabel et al., 2015). GloVe and other em- bedding models are essentially based on word co- SGNS, GloVe, PPMI, and its SVD factorization occurrence statistics of the corpus. The <word, are used as baselines. The information used by context> pairs and co-occurrence matrix can be them does not go beyond word co-occurrence s- converted to each other. Suzuki and Nagata(2015) tatistics. However, their approaches to using the try to unify GloVe and SGNS in one framework. information are different. We review these meth- ods in the following 3 sections. In section 2.4, we 2.3 PPMI & SVD revisit the use of ngrams in the deep learning con- When we are satisfied with the huge promotions text. achieved by embedding models on linguistic tasks, a natural question is raised: where the superior- 2.1 SGNS ities come from. One conjecture is that it’s due Skip-gram with negative sampling (SGNS) is to the neural networks. However, Levy and Gold- a model in word2vec toolkit (Mikolov et al., berg(2014c) reveal that SGNS is just factoring P- 2013b,a). Its training procedure follows the ma- MI matrix implicitly. Also, Levy and Goldberg jority of neural embedding models (Bengio et al., (2014b) show that positive PMI (PPMI) matrix 2003): (1) Scan the corpus and use <word, still rivals the newly proposed embedding mod- context> pairs in the local window as training els on a range of linguistic tasks. Properties like samples. (2) Train the models to make words use- word analogy are not restricted to neural model- ful for predicting contexts (or in reverse). The de- s. To obtain dense word representations from PP- tails of SGNS is discussed in Section 3.1. Com- MI matrix, we factorize PPMI matrix with SVD, a pared to previous neural embedding models, S- classic dimensionality reduction method for learn- GNS speeds up the training process, reducing the ing low-dimensional vectors from sparse matrix training time from days or weeks to hours. Also, (Deerwester et al., 1990). the trained embeddings possess attractive proper- ties. They are able to reflect relations between two 2.4 Ngram in Deep Learning words accurately, which is evaluated by a fancy In the deep learning literature, ngram has shown task called word analogy. to be useful in generating text representations. Re- Due to the above advantages, many models are cently, convolutional neural networks (CNNs) are 245 reported to perform well on a range of NLP tasks (Blunsom et al., 2014; Hu et al., 2014; Severyn and Moschitti, 2015). CNNs are essentially using n- gram information to represent texts. They use 1-D convolutional layers to extract ngram features and the distinct features are selected by max-pooling Figure 1: Illustration of ‘word predicts word’. layers. In (Li et al., 2016), ngram embedding is in- troduced into Paragraph Vector model, where tex- t embedding is trained to be useful to predict n- ‘J.K.’. In this paper, negative sampling (Mikolov grams in the text. In the word embedding liter- et al., 2013b) is used to approximate the condition- ature, a related work is done by Melamud et al. al probability: (2014), where word embedding models are used k as baselines. They propose to use ngram language p(c w) = σ(~wT ~c) E σ( ~wT ~c ) c Pn(C) j (2) | j ∼ − models to model the context, showing the effec- j=1 tiveness of ngrams on similarity tasks. Another Y work that is related to ngram is from Mikolov et al. where σ is sigmoid function. k samples (from c1 (2013b), where phrases are embedded into vec- to ck) are drawn from context distribution raised tors. It should be noted that phrases are different to the power of n. from ngrams. Phrases have clear semantics and 3.2 Word Predicts Ngram the number of phrases is much less than the num- ber of ngrams.