Psdvec: a Toolbox for Incremental and Scalable Word Embedding
Total Page:16
File Type:pdf, Size:1020Kb
Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom PSDVec: A toolbox for incremental and scalable word embedding Shaohua Li a,n, Jun Zhu b, Chunyan Miao a a Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), Nanyang Technological University, Singapore b Tsinghua University, PR China article info abstract Article history: PSDVec is a Python/Perl toolbox that learns word embeddings, i.e. the mapping of words in a natural Received 28 July 2015 language to continuous vectors which encode the semantic/syntactic regularities between the words. Received in revised form PSDVec implements a word embedding learning method based on a weighted low-rank positive semi- 28 January 2016 definite approximation. To scale up the learning process, we implement a blockwise online learning Accepted 25 May 2016 algorithm to learn the embeddings incrementally. This strategy greatly reduces the learning time of word Communicated by Wang Gang embeddings on a large vocabulary, and can learn the embeddings of new words without re-learning the whole vocabulary. On 9 word similarity/analogy benchmark sets and 2 Natural Language Processing Keywords: (NLP) tasks, PSDVec produces embeddings that has the best average performance among popular word Word embedding embedding tools. PSDVec provides a new option for NLP practitioners. Matrix factorization & 2016 Elsevier B.V. All rights reserved. Incremental learning 1. Introduction methods use two different sets of embeddings for words and their context words, respectively. SVD based optimization procedures Word embedding has gained popularity as an important un- are used to yield two singular matrices. Only the left singular supervised Natural Language Processing (NLP) technique in recent matrix is used as the embeddings of words. However, SVD oper- years. The task of word embedding is to derive a set of vectors in a ates on G⊤G, which incurs information loss in G, and may not Euclidean space corresponding to words which best fit certain correctly capture the signed correlations between words. An em- statistics derived from a corpus. These vectors, commonly referred pirical comparison of popular methods is presented in [5]. to as the embeddings, capture the semantic/syntactic regularities The toolbox presented in this paper is an implementation of between the words. Word embeddings can supersede the tradi- our previous work [8]. It is a new MF-based method, but is based tional one-hot encoding of words as the input of an NLP learning on eigendecomposition instead. This toolbox is based on [8], system, and can often significantly improve the performance of where we establish a Bayesian generative model of word embed- the system. ding, derive a weighted low-rank positive semidefinite approx- There are two lines of word embedding methods. The first line imation problem to the Pointwise Mutual Information (PMI) ma- is neural word embedding models, which use softmax regression trix, and finally solve it using eigendecomposition. Eigende- to fit bigram probabilities and are optimized with Stochastic composition avoids the information loss in based methods, and 1 Gradient Descent (SGD). One of the best known tools is word2vec the yielded embeddings are of higher quality than SVD-based [10]. The second line is low-rank matrix factorization (MF)-based methods. However eigendecomposition is known to be difficult to methods, which aim to reconstruct certain bigram statistics matrix scale up. To make our method scalable to large vocabularies, we extracted from a corpus, by the product of two low rank factor exploit the sparsity pattern of the weight matrix and implement a 2 matrices. Representative methods/toolboxes include Hyperwords divide-and-conquer approximate solver to find the embeddings 3 4 5 [4,5], GloVe [11], Singular [14], and Sparse [2]. All these incrementally. Our toolbox is named Positive-Semidefinite Vectors (PSDVec). It n Corresponding author. offers the following advantages over other word embedding tools: E-mail addresses: [email protected] (S. Li), [email protected] (J. Zhu), [email protected] (C. Miao). 1. The incremental solver in PSDVec has a time complexity 1 https://code.google.com/p/word2vec/ O (cd2 n) and space complexity O (cd), where n is the number of 2 https://bitbucket.org/omerlevy/hyperwords words in a vocabulary, d is the specified dimensionality of em- 3 http://nlp.stanford.edu/projects/glove/ 4 fi https://github.com/karlstratos/singular beddings, and cn⪡ is the number of speci ed core words. Note 5 https://github.com/mfaruqui/sparse-coding that the space complexity does not increase with the http://dx.doi.org/10.1016/j.neucom.2016.05.093 0925-2312/& 2016 Elsevier B.V. All rights reserved. Please cite this article as: S. Li, et al., PSDVec: A toolbox for incremental and scalable word embedding, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2016.05.093i 2 S. Li et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ vocabulary. In contrast, other MF-based solvers, including the difficult to scale up. As a remedy, we implement an approximate core embedding generation are of O (n3) time complexity and solution that learns embeddings incrementally. The incremental O (n2) space complexity.6 Hence asymptotically, PSDVec takes learning proceeds as follows: about O(cd2/n2) of the time and O(cd/n2) of the space of other MF-based solvers. 1. Partition the vocabulary S into K consecutive groups S1,,… Sk. 2. Given the embeddings of an original vocabulary, PSDVec is able Take K¼3 as an example. S1 consists of the most frequent to learn the embeddings of new words incrementally. To our words, referred to as the core words, and the remaining words best knowledge, none of other word embedding tools provide are noncore words. this functionality; instead, new words have to be learned to- 2. Accordingly partition G into K Â K blocks as gether with old words in batch mode. A common situation is ⎛ GGG⎞ that we have a huge general corpus such as English Wikipedia, ⎜ 11 12 13⎟ ⎜ GGG⎟. and also have a small domain-specific corpus, such as the NIPS ⎜ 21 22 23⎟ dataset. In the general corpus, specific terms may appear rarely. ⎝ GGG31 32 33⎠ It would be desirable to train the embeddings of a general vo- Partition f (H) in the same way. GH, f () correspond to core– cabulary on the general corpus, and then incrementally learn 11 11 core bigrams (consisting of two core words). Partition V into words that are unique in the domain-specific corpus. Then this feature of incremental learning could come into play. ⎛ ⎞ ⎜ ⎟ 3. On word similarity/analogy benchmark sets and common Nat- ⎜ VVV123⎟. ⎝ ⎠ ural Language Processing (NLP) tasks, PSDVec produces em- ⏟S1 ⏟SS23⏟ beddings that has the best average performance among popular ⊤ 3. For core words, set μ = 0, and solve arg min ||GVV11 −1 1 ||f (H ) word embedding tools. i V ⁎ 1 using eigendecomposition, obtaining core embeddings V1 . 4. PSDVec is established as a Bayesian generative model [8]. The ⁎ ⁎ 4. Set V = V ,andfind V that minimizes the total penalty of the 12th probabilistic modeling endows PSDVec clear probabilistic inter- 1 1 2 and 21th blocks (the 22th block is ignored due to its high sparsity): pretation, and the modular structure of the generative model is ⊤ 2 ⊤ 2 2 easy to customize and extend in a principled manner. For arg min‖−‖+‖−‖+GVV12 1 2 ff()HH12 GVV21 2 1 () 21 ∑ μi ‖‖ vsi . V example, global factors like topics can be naturally incorporated, 2 si∈S2 resulting in a hybrid model [9] of word embedding and Latent The columns in V are independent, thus for each v ,itisaseparate Dirichlet Allocation [1]. For such extensions, PSDVec would 2 si weighted ridge regression problem, which has a closed-form serve as a good prototype. While in other methods, the regres- solution. sion objectives are usually heuristic, and other factors are ⁎ 5. For any other set of noncore words S , find V that minimizes difficult to be incorporated. k k the total penalty of the 1kth and k1th blocks, ignoring all other kjth and jkth blocks. ⁎ 6. Combine all subsets of embeddings to form V . Here 2. Problem and solution ⁎ ⁎⁎⁎ V =(VVV123,,). PSDVec implements a low-rank MF-based word embedding Pss()ij, method. This method aims to fit the PMI()=ssij , log using Ps()()ij Ps 3. Software architecture and functionalities v⊤ v , where and Ps( , s) are the empirical unigram and bi- sj si Ps( i ) ij Our toolbox consists of 4 Python/Perl scripts: extractwiki. gram probabilities, respectively, and vsi is the embedding of si. The regression residuals PMIss , vv⊤ are penalized by a mono- py, gramcount.pl, factorize.py and evaluate.py. Fig. 1 ()−ij sj si presents the overall architecture. tonic transformation f (·) of Ps( ij, s), which implies that, for more frequent (therefore more important) bigram si, sj, we expect it is extractwiki.py fi better fitted. The optimization objective in the matrix form is 1. rst receives a Wikipedia snapshot as input; it then removes non-textual elements, non-English words and W ⁎ ⊤ 2 VGVVv=||−||+‖‖arg minf ()H ∑ μi si 2 , V i=1 ()1 where G is the PMI matrix, V is the embedding matrix, H is the bigram probabilities matrix, ||·||f (H) is the f ()−H weighted Frobe- nius-norm, and μi are the Tikhonov regularization coefficients. The purpose of the Tikhonov regularization is to penalize overlong embeddings. The overlength of embeddings is a sign of overfitting the corpus. Our experiments showed that, with such regulariza- tion, the yielded embeddings perform better on all tasks. Eq. (1) is to find a weighted low-rank positive semidefinite approximation to G. Prior to computing G, the bigram probabilities {Ps()}ij, s are smoothed using Jelinek–Mercer Smoothing.