Effective Dimensionality Reduction for Word Embeddings
Total Page:16
File Type:pdf, Size:1020Kb
Effective Dimensionality Reduction for Word Embeddings Vikas Raunak Vivek Gupta Florian Metze Carnegie Mellon University University of Utah Carnegie Mellon University [email protected] [email protected] [email protected] Abstract to inject antonymy and synonymy constraints into vector representations, while (Faruqui et al., 2015) Pre-trained word embeddings are used in tries to refine word vectors by using relational in- several downstream applications as well as formation from semantic lexicons such as Word- for constructing representations for sentences, paragraphs and documents. Recently, there Net (Miller, 1995). (Bolukbasi et al., 2016) tries has been an emphasis on improving the pre- to remove the biases (e.g. gender biases) present trained word vectors through post-processing in word embeddings and (Nguyen et al., 2016) algorithms. One improvement area is re- tries to ‘denoise’ word embeddings by strength- ducing the dimensionality of word embed- ening salient information and weakening noise. dings. Reducing the size of word embed- In particular, the post-processing algorithm in dings can improve their utility in memory (Mu and Viswanath, 2018) tries to improve word constrained devices, benefiting several real- embeddings by projecting the embeddings away world applications. In this work, we present a novel technique that efficiently combines PCA from the most dominant directions and consider- based dimensionality reduction with a recently ably improves their performance by making them proposed post-processing algorithm (Mu and more discriminative. However, a major issue re- Viswanath, 2018), to construct effective word lated with word embeddings is their size (Ling embeddings of lower dimensions. Empirical et al., 2016), e.g., loading a word embedding ma- evaluations on several benchmarks show that trix of 2:5 M tokens takes up to 6 GB memory our algorithm efficiently reduces the embed- (for 300-dimensional vectors, on a 64-bit system). ding size while achieving similar or (more of- ten) better performance than original embed- Such large memory requirements impose signifi- dings. To foster reproducibility, we have re- cant constraints on the practical use of word em- leased the source code along with paper 1. beddings, especially on mobile devices where the available memory is often highly restricted. In this 1 Introduction work we combine the simple dimensionality re- duction technique, PCA with the post processing Word embeddings such as Glove (Pennington technique of (Mu and Viswanath, 2018), as dis- et al., 2014) and word2vec Skip-Gram (Mikolov cussed above. et al., 2013) obtained from unlabeled text cor- In Section2, we first explain the post processing pora can represent words in distributed dense real- algorithm (Mu and Viswanath, 2018) and then our valued low dimensional vectors which geometri- novel algorithm and describe with an example the cally capture the semantic ‘meaning’ of a word. choices behind its design. The evaluation results These embeddings capture several linguistic regu- are presented in section3. In section4, we discuss larities such as analogy relationships. Such em- the related works, followed by the conclusion. beddings are of a pivotal role in several natural language processing tasks. Recently, there has been an emphasis on apply- 2 Proposed Algorithm ing post-processing algorithms on the pre-trained We first explain the post-processing algorithm word vectors to further improve their quality. For from (Mu and Viswanath, 2018) in section 2.1. example, algorithm in (Mrksicˇ et al., 2016) tries Our main algorithm, along with the motivations is 1 https://github.com/vyraun/Half-Size explained next, in the section 2.2. 235 Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 235–243 Florence, Italy, August 2, 2019. c 2019 Association for Computational Linguistics 2.1 Post-Processing Algorithm (Mu and Viswanath, 2018) presents a simple post- processing algorithm that renders off-the-shelf word embeddings even stronger, as measured on a number of lexical-level and sentence-level tasks. The algorithm is based on the geometrical obser- vations that the word embeddings (across all rep- resentations such as Glove, word2vec etc.) have a large mean vector and most of their energy, after subtracting the mean vector is located in a sub- space of about 8 dimensions. Since, all embed- dings share a common mean vector and all embed- dings have the same dominating directions, both of Figure 1: Comparison of the fraction of variance ex- which strongly influence the representations, elim- plained by top 20 principal components of the Original and Post-Processing (PPA) applied Glove embeddings inating them makes the embeddings stronger. De- (300D). tailed description of the post-processing algorithm is presented in Algorithm1 (PPA). Algorithm 1: Post Processing Algorithm PPA(X, D) Data: Word Embedding Matrix X, Threshold Parameter D Result: Post-Processed Word Embedding Matrix X /* Subtract Mean Embedding */ 1 X = X - X; /* Compute PCA Components */ 2 ui = PCA(X), where i = 1,2, ::: ,d. ; /* Remove Top-D Components */ 3 for all v in X do PD T 4 v = v − i=1(ui · v)ui 5 end Figure1 demonstrates the impact of the post- Figure 2: Comparison of the fraction of variance ex- processing algorithm (PPA, with D= 7) as ob- plained by top 20 principal components of the PPA + PCA-150D baseline and Further Post-Processed Glove served on wiki pre-trained Glove embeddings embedding (150D). (300-dimensions). It compares the fraction of variance explained by the top 20 principal com- ponents of the original and post-processed word We first apply the algorithm1 (PPA) of (Mu and vectors respectively 2. In the post-processed word Viswanath, 2018) on the original word embed- embeddings none of the top principal components ding, to make it more ‘discriminative’. We then are disproportionately dominant in terms of ex- construct a lower dimensional representation of plaining the data, which implies that the post- the post processed ‘purified’ word embedding us- processed word vectors are not as influenced by ing Principal Component Analysis (PCA (Shlens, the common dominant directions as the original 2014)) based dimensionality reduction technique. embeddings. This makes the individual word vec- Lastly, we again ‘purify’ the reduced word embed- tors more ‘discriminative’, hence, improving their ding by applying the algorithm1 (PPA) of (Mu and quality, as validated on several benchmarks in (Mu Viswanath, 2018) on the reduced word embedding and Viswanath, 2018). to get our final word embeddings. To explain the last step of applying the algo- 2.2 Proposed Algorithm rithm1 (PPA) on the reduced word embedding, This section explains our algorithm,2 that effec- consider the Figure2. It compares the variance ex- tively uses the the PPA algorithm, along with PCA plained by the top 20 principal components for the for constructing lower dimensional embeddings. embeddings constructed by first post-processing 2 the total sum of explained variances over the 300 principal the Glove-300D embeddings according by Algo- components is equal to 1:0 rithm1 (PPA) and then transforming the embed- 236 dings to 150 dimensions with PCA (labeled as lary) 5. The next subsection also presents results Post-Processing + PCA); against a further post- using word2vec embeddings trained on Google processed version by again applying the Algo- News dataset 6. rithm1 (PPA) of the reduced word embeddings 3.1 Word Similarity Benchmarks 3. We observe that even though PCA has been applied on post-processed embeddings which had We use the standard word similarity benchmarks their dominant directions eliminated, the variance summarized in (Faruqui and Dyer, 2014) for eval- in the reduced embeddings is still explained dis- uating the word vectors. proportionately by a few top principal compo- Dataset: The datasets (Faruqui and Dyer, 2014) nents. The re-emergence of this geometrical be- have word pairs (WP) that have been assigned sim- havior implies that further post-processing (Al- ilarity rating by humans. While evaluating word gorithm1 (PPA)) could improve the embeddings vectors, the similarity between the words is calcu- further. Thus, for constructing lower-dimensional lated by the cosine similarity of their vector repre- word embeddings, we apply the post-processing sentations. Then, Spearman’s rank correlation co- algorithm on either side of a PCA dimensionality efficient (Rho × 100) between the ranks produced reduction of the word vectors in our algorithm. by using the word vectors and the human rankings Finally, from Figures1 and2, it is also ev- is used for the evaluation. The reported metric in ident that the extent to which the top principal our experiments is Rho × 100. Hence, for bet- components explain the data in the case of the re- ter word similarity, the evaluation metric will be duced embeddings is not as great as in the case of higher. the original 300 dimensional embeddings. Hence, Baselines: To evaluate the performance of our al- multiple levels of post-processing at different lev- gorithm, we compare it against different schemes els of dimensionality will yield diminishing re- of combining the post-processing algorithm with turns as the influence of common dominant direc- PCA 7 as following baselines: tions decreases on the word embeddings. Details • PCA of reduction technique is given in Algorithm2. : Transform word vectors using PCA. • P+PCA: Apply PPA (Algorithm1) and then Algorithm 2: Dimensionality Reduction Algorithm transform word vectors using PCA. Data: Word Embedding Matrix X, New Dimension N, Threshold Parameter D • PCA+P: Transform word vectors using PCA Result: Word Embedding Matrix of Reduced and then apply PPA. Dimension N: X /* Apply Algorithm1 (PPA) */ These baselines can also be regarded as ab- 1 X = PPA(X, D) ; lations on our algorithm and can shed light on /* Transform X using PCA */ whether our intuitions in developing the algorithm 2 X = PCA(X) ; /* Apply Algorithm1 (PPA) */ were correct. In the comparisons ahead, we rep- 3 X = PPA(X, D) ; resent our algorithm as Algo-N ,where N is the reduced dimensionality of word embeddings.