Explaining Word Embeddings via Disentangled Representation

Keng- Liao Cheng-Syuan Lee National Taiwan University National Taiwan University [email protected] [email protected]

Zhong-Yu Huang Shou-de Lin National Taiwan University National Taiwan University [email protected] [email protected]

Abstract In this work, focus on word-level disentangle- ment and introduce an idea of transforming dense Disentangled representations have attracted in- word embeddings such as GloVe (Pennington et al., creasing attention recently. However, how to 2014) or word2vec (Mikolov et al., 2013b) into transfer desired properties of disentangle- disentangled word embeddings (DWE). The main ment to word representations is unclear. In this work, we propose to transform typical dense feature of our DWE is that it can segmented word vectors into disentangled embeddings into multiple sub-embeddings or sub-areas as il- featuring improved interpretability via encod- lustrated in Figure1. In the figure, each sub-area ing polysemous semantics separately. We also encodes information relevant to one specific topical found the modular structure of our disentan- factor such as Animal or Location. As an example, gled word embeddings helps generate more - we found words similar to “turkey” are “geese”, ficient and effective features for natural lan- “flock” and “goose” in the Animal area, and the guage processing tasks. similar words turn into “Greece”, “Cyprus” and 1 Introduction “Ankara” in the Location area.

Disentangled representations are known to repre- sent interpretable factors in separated dimensions. This property can potentially help people under- stand or discover knowledge in the embeddings. In natural language processing (NLP), works of dis- entangled representations have shown notable im- Figure 1: Disentangled embedding with factors Animal, pacts on sentence and document-level applications. Location and Unseen. For example, Larsson et al.(2017) and Melnyk We also found our DWE generally satisfies the et al.(2017) proposed to disentangle sentiment and Modularity and Compactness properties proposed semantic of sentences. By manipulating sentiment by Higgins et al.(2018) and Ridgeway and Mozer factors, the machine can rewrite sentence with dif- (2018) which can be a definition of general-purpose ferent sentiment. Brunner et al.(2018) also demon- disentangled representations. Also, our DWE can strated sentence generation while more focusing on have the following advantages: syntactic factors such as part-of-speech tags. For document-level applications, Jain et al.(2018) pre- • Explaining Underlying Knowledge sented a learning algorithm which embeds biomed- The multi-senses of words can be extracted ical abstracts disentangling populations, interven- and separately encoded despite the learning tions and outcomes. Regarding word-level disen- algorithm of the original word embeddings tanglement, Athiwaratkun and Wilson(2017) pro- (.g. GloVe) does not do disambiguation. As a posed mixture of Gaussian models which can dis- result, the encoded semantic can be presented entangle meanings of polysemous words into two in an intuitive way for examination. or three clusters. It has a connection with unsu- pervised sense representations (Camacho-Collados • Modular and Compact Features and Pilehvar, 2018) which is an active research Each sub-area of our DWE can itself be infor- topic in the community. mative features. The advantage is that people

720 Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 720–725 December 4 - 7, 2020. c 2020 Association for Computational Linguistics are free to abandon features in sub-areas ir- 2.3 Optimizing (Za, a) relevant to the given downstream tasks while Let za,i be the i-th row in Za. By derivation, still achieving competitive performance. In I(Za, a) = Section4, we show that using the compact N   features is not only efficient but also helps Σi=1p(zi)p(a|za,i) log p(a|za,i) − log p(a) improve performance on downstream tasks. 1 ≈ ΣN p(a|z ) log p(a|z ) − log p(a) • Quality Preservation N i=1 a,i a,i In addition to higher interpretability, our DWE We let log p(a) be constant and replace p(a|z) with preserves co-occurrence statistics information a parametrized model qθ(a|z). By experiments, in the original word embeddings. We found it we found logistic regression with parameter θ is also helps preserve the performance on down- sufficient to be qθ(a|z). Intuitively, high I(Za, a) stream tasks including word similarity, word means Za are informative features for a classifier analogy, POS-tagging, chunking, and named to distinguish whether words has attribute a. entity recognition. When increasing I(Za, a) by optimizing qθ(a|z), we found a strategy helping generate 2 Obtaining Disentangled Word higher quality Z. The strategy is letting Za be Representations features to reconstruct original vectors for words 2.1 Problem Definition having attribute a. For words with a, the approach becomes a semi-supervised learning architecture Our goal is transforming N d-dimensional dense which attempts to predict labels and reconstruct N×d word vectors X ∈ R into disentangled - inputs simultaneously. N×d beddings Z ∈ R by leveraging a set of binary The loss function L(W, θ, φ) for maximizing attributes A = {a , ..., a } labelled on words. 1 M I(Za, a) is as follow: Z is expected to have two properties. The first one is preserving word features encoded in X. −1 N 2 Σi=1qθ(a|za,i) + λIa,i||xi − φ(za,i)||2 More specifically, we require XXT ≈ ZZT as N ( pointed out by Levy and Goldberg(2014) that typ- 1 when i-th word has attribute a Ia,i = ical dense word embeddings can be regarded as 0 when i-th word does not have a factorizing co-ocurrence statistics matrices. (1) The second property is that Z can be decom- where φ is single and fully-connected layer, xi is posed into M +1 sub-embedding sets Za1 , ..., ZaM the original i-th word’s vector in X, and λ is a and Zunseen, where each sub-embedding set - 1 hyper-parameter. We set λ = d in all experiments. codes information only relevant to the correspond- 2.4 Learning to Generate Sub-embedding Z ing attribute. For example, Za1 is expected to be a relevant to a1 and irrelevant to a2, ..., aM . Informa- As discussed in 2.3 that high I(Za, a) indicates tion in X not relevant to any attributes in A is then Za are informative features for classification, we encoded in Zunseen. An example of transforming propose to regard sub-embedding generation as a X into Z with two attributes, Animal and Location, feature selection problem. More specifically, we is illustrated in Figure1. apply sparsity constraint on Z. Ideally, when pre- For modelling the relevance between sub- dicting a, a smaller number of dimensions of Z embeddings and attributes, we use mutual infor- are selected as the informative features, which are mation I(Za, a) as learning objectives, where a is regarded as Za. an arbitrary attribute in A. In this work, we use Variational Dropout (Kingma et al., 2015; Molchanov et al., 2017) as 2.2 Transformation with Quality the sparsity constraint. At each iteration of train- Preservation ing, a set of multiplicative noise ξ is sampled from We obtain Z by transforming X by a matrix a normal distribution N (1, α = pa ) and in- a 1−pa d×d W ∈ R . That is, Z = XW . To ensure jected on Z. That is, the prediction and recon- XXT ≈ ZZT , an additional constraint WW T = struction is done by θ(ξ Z) and φ(ξ Z). The T T d I is included. ZZ = (XW )(XW ) = parameter αa ∈ R is jointly learned with W , θ, X(WW T )XT = XXT if WW T = I holds. and φ. Afterwards, d-dimensional dropout rates

721 pa = sigmoid(log αa) can be obtained. For each attribute a in A, the dimensions with dropout rates lower than 50% are normally regarded as Za. We would like to emphasize that the learned Figure 2: Disentangled embedding with five attributes: dropout rates are not binary values. Therefore, - Artifact, Location, Animal, Adjective and Adverb. The ciding the length of sub-embeddings can actually remaining dimensions are viewed as Unseen. depend on users preferences or tasks requirements. MEN SimLex BATS GA For example, users can obtain more compact and GloVe-300 0.749 0.369 18.83 63.58 pure Za by selecting dimensions with dropout rates lower than 10%, or get more thorough yet less dis- DWE 0.764 0.390 18.75 62.30 entangled Za by setting the threshold be 70%. Table 1: Word similarity and analogy performance. To encourage disentanglement when handling multiple attributes, we include additional loss func- POS Chunking NER tions on dropout rates. Let a M-dimensional vector GloVe-300 65.0 64.9 65.2 P be 1 − pa for all a in A in a specific dimension. QM DWE 67.2 66.3 66.1 The idea is to minimize i=1 Pi with constraint M Σi=1Pi = 1. The optimal solution is that the di- Table 2: POS-tag, chunking and NER performance. mension is relevant to only one attribute a0 where 1 − pa0 ≈ 1. In implementation, we minimize the following loss function 3.2 Evaluation of Quality Preservation M M 2 We firstly examine whether DWE can preserve fea- Σi=1 log Pi + β||Σi=1Pi − 1||2 (2) tures encoded in GloVe-300. The examination is We set β = 1 in the experiments, and equation1 done by intrinsic evaluations including the follow- and2 are optimized jointly. ing tasks and datasets. To generate Zunseen, we initially select a set of dimensions and constrain their dropout rates be al- • Word Similarity: Marco, Elia and Nam ways larger than 50%. The number of dimensions (MEN) (Bruni et al., 2014) and SimLex-999 (Hill et al., 2015). of Zunseen is a hyper-parameter. After selection, we do not apply equation2 on the selected dimen- • Word Analogy: Bigger Analogy Test Set sions. (BATS) (Gladkova et al., 2016), Google 3 Evaluation Analogy (GA) (Mikolov et al., 2013a). 3.1 Word Embeddings and Attributes • POS tagging, Chunking and Named Entity Recognition (NER): CoNLL 2003 (Sang and 1 We transform 300-dimensional GloVe into DWE. Meulder, 2003; Li et al., 2017). The 300-dimensional GloVe is denoted by GloVe- 300. For word attributes A, we use labels in Word- • QVEC-CCA3 (Tsvetkov et al., 2015): The Stat2. WordStat contains 45 kinds of attributes performance is measured by semantic and syn- labeled on 70,651 words. Among the attributes, tactic CCA. we select 5 high-level and easily understandable attributes: Artifact, Location, Animal, Adjective As shown in Table1,2 and3, DWE can preserve (ADJ) and Adverb (ADV) for our experiments. The performance of GloVe-300 on various NLP tasks. Probably due to the additional information of word number of words labelled with these 5 attributes is attributes, DWE can have slightly better perfor- 13,337. After training, all pre-trained GloVe vec- tors are transformed by the learned matrix W (i.e. mance than GloVe-300 on seven of the tasks XW ) for downstream evaluations. 3.3 Attribute Classification The number of learned dimensions for each at- We design an attribute classification task for ex- tribute is illustrated in Figure2, where the threshold amining whether the DWE can meet requirements of dropout rates for dimension selection is 50%. described in Section 2.3. We use logistic regression 1 https://nlp.stanford.edu/projects/glove/ and take sub-embeddings Za as input features for 2https://provalisresearch.com/products/content-analysis- software/ 3https://github.com/ytsvetko/qvec

722 Semantic Syntactic Query Vectors Nearby Words GloVe-300 0.473 0.341 turkey Zanimal geese, flock, goose DWE 0.474 0.348 turkey Zlocation greece, cyprus, ankara Table 3: QVEC-CCA evaluation. mouse Zanimal mice, rat, rats mouse Zartifact keyboard, joystick, buttons Artifact Location Animal ADJ ADV japan Zlocation korea, vietnam, singapore japan Z japanese, yakuza, yen Zartifact 77.8 71.0 68.0 65.5 71.2 unseen Zlocation 59.2 83.8 64.0 60.5 69.8 apple Zartifact macintosh, software, mac Zanimal 58.5 67.5 84.2 60.2 71.0 apple Zunseen mango, cherry, tomato Zadj 69.8 70.7 68.2 82.0 72.5 Zadv 59.0 72.5 71.8 71.5 84.2 Table 5: Results of nearby words. Zunseen 54.8 70.0 66.5 60.2 68.8

Table 4: Attribute classification accuracies (%). sub-embeddings could be intuitive. Secondly, we will demonstrate that if deciding to fine-tune word vectors for a given downstream task, by using our verifying the performance of classification by cross- DWE, we can focus on updating the relevant sub- validation. For each attribute, We randomly sample embedding instead of the whole embedding. The 400 data for testing. The numbers of positive and advantage is that it reduces the number of learning negative data for testing are balanced. Therefore, a parameters. Also, it could be regarded as a dimen- random predictor would get around 50% accuracy sional and interpretable regularization technique in each classification task. reducing overfitting. The binary classification accuracies are shown in Table4. Take the second column of Table4 for We take a sentiment analysis task, IMDB movie example. For distinguishing whether a word can review classification(Maas et al., 2011), for ex- periments. Intuitively, ADJ and ADV should be be location, taking Zlocation as features for training a classifier achieves the highest accuracy 83.8%. the most relevant attributes in A. We then select 300 On the other hand, the accuracy reported in the 50 dimensions from Z ∈ R with the lowest dropout rates in ADJ and ADV sub-areas for com- second row of Table4 implies that Zlocation are less 4 informative features for other attributes. Similar paring with 50-dimensional GloVe (GloVe-50). results can also be observed for other attributes. The embeddings with the selected dimensions are denoted by Zadj+adv-50. When tuning our DWE 3.4 Disentangled Interpretability with the classifier, we update the 52 dimensions 23 29 We provide some examples to demonstrate that (Zadj ∈ R and Zadv ∈ R ) of DWE and com- words having ambiguous or different aspects of se- pare it with GloVe-300. mantics can be disentangled. Table5 shows the The document representations for classification results of nearby words. As can be seen, querying is averaged word embeddings. The classifier is a a word in Za with different attributes can help dis- logistic regression. When tuning the input word cover the ambiguous semantics implicitly encoded embeddings, we update the embeddings with gra- in the original word vectors X. The results also dient propagated from the classifier. show that Zunseen does capture meaningful infor- The results are listed in Table6. From the ta- mation having little relevance to given attributes. ble, we can see Zadj+adv-50 directly outperforms GloVe-50 without tuning. A possible explanation is 4 Application: Compact Features for that GloVe-50 is forced to encode information less Downstream Tasks relevant to the sentiments, making it less effective Here we demonstrate an application of the mod- than Zadj+adv-50 in this task. ularity and compactness properties of our DWE. In the fine-tuning experiments, DWE can show We firstly aim to show the sub-embeddings can slightly higher accuracy than GloVe-300 by updat- directly be informative features and can outper- ing only 52 instead of 300 dimensional features. form GloVe with the same number of dimensions. With the high interpretability, selecting relevant 4https://nlp.stanford.edu/projects/glove/

723 Feature Without Tuning After Tuning F. Hill, R. Reichart, and A. Korhonen. 2015. Simlex- GloVe-50 76.55 86.72 999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, Zadj+adv-50 79.78 87.60 41(4):665–695. GloVe-300 83.85 87.72 DWE 83.67 87.84 Sarthak Jain, Edward Banner, Jan-Willem van de Meent, Iain J Marshall, and Byron C. Wallace. 2018. Learning disentangled representations of texts with Table 6: Classification accuracies (%) on IMDB application to biomedical abstracts. In Proceed- dataset. ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4683–4693. Association for Computational Linguistics. 5 Conclusion Durk P Kingma, Tim Salimans, and Max Welling. In this work, we propose a new definition and learn- 2015. Variational dropout and the local reparam- ing algorithm for obtaining disentangled word rep- eterization trick. In C. Cortes, N. D. Lawrence, resentations. As a result, the disentangled word D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems vectors can show higher interpretability and pre- 28, pages 2575–2583. Curran Associates, Inc. serve performance on various NLP tasks. We can also see the ambiguous semantics hidden in typi- Maria Larsson, Amanda Nilsson, and Mikael cal dense word embeddings can be extracted and Kageb˚ ack.¨ 2017. Disentangled representations separately encoded. Finally, we showed the disen- for manipulation of sentiment in text. CoRR, abs/1712.10066. tangled word vectors can help generate compact and effective features for NLP applications. In the Omer Levy and Yoav Goldberg. 2014. Neural word future, we would like to investigate whether simi- embedding as implicit matrix factorization. In lar effects can be found from non-distributional or Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances contextualized word embeddings. in Neural Information Processing Systems 27, pages 2177–2185. Curran Associates, Inc.

References Bofang Li, Tao Liu, Zhao, Buzhou Tang, Alek- sandr Drozd, Anna Rogers, and Xiaoyong Du. 2017. Ben Athiwaratkun and Andrew Gordon Wilson. 2017. Investigating different syntactic context types and Multimodal word distributions. In Proceedings of context representations for learning word embed- the 55th Annual Meeting of the Association for dings. In Proceedings of the 2017 Conference on Computational Linguistics, ACL 2017, Vancouver, Empirical Methods in Natural Language Processing, Canada, July 30 - August 4, Volume 1: Long Papers, pages 2421–2431. pages 1645–1656. E. Bruni, N.-K. Tran, and M. Baroni. 2014. Mul- Andrew L Maas, Raymond E Daly, Peter T Pham, Dan timodal distributional semantics. J. Artif. Intell. Huang, Andrew Y Ng, and Christopher Potts. 2011. Res.(JAIR), 49:1–47. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the as- Gino Brunner, Yuyi Wang, Roger Wattenhofer, and sociation for computational linguistics: Human lan- Michael Weigelt. 2018. Natural language multitask- guage technologies-volume 1, pages 142–150. Asso- ing: Analyzing and improving syntactic saliency of ciation for Computational Linguistics. hidden representations. CoRR, abs/1801.06024. Igor Melnyk, C´ıcero Nogueira dos Santos, Kahini Wad- Jose´ Camacho-Collados and Mohammad Taher Pile- hawan, Inkit Padhi, and Abhishek Kumar. 2017. hvar. 2018. From word to sense embeddings: A sur- Improved neural text attribute transfer with non- vey on vector representations of meaning. J. Artif. parallel data. CoRR, abs/1711.09395. Intell. Res., 63:743–788. A. Gladkova, A. Drozd, and S. Matsuoka. 2016. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Analogy-based detection of morphological and se- Dean. 2013a. Efficient estimation of word represen- mantic relations with word embeddings: what works tations in vector space. CoRR, abs/1301.3781. and what doesn’t. In Proceedings of the NAACL Stu- dent Research Workshop, pages 8—-15. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed representa- Irina Higgins, David Amos, David Pfau, Sebastien´ tions of words and phrases and their composition- Racaniere,` Lo¨ıc Matthey, Danilo J. Rezende, and ality. In C. J. C. Burges, L. Bottou, M. Welling, Alexander Lerchner. 2018. Towards a defi- Z. Ghahramani, and K. Q. Weinberger, editors, Ad- nition of disentangled representations. CoRR, vances in Neural Information Processing Systems abs/1812.02230. 26, pages 3111–3119. Curran Associates, Inc.

724 Dmitry Molchanov, Arsenii Ashukha, and Dmitry P. Vetrov. 2017. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th Inter- national Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 2498–2507. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word rep- resentation. In In EMNLP. Karl Ridgeway and Michael C. Mozer. 2018. Learning deep disentangled embeddings with the f-statistic loss. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, ed- itors, Advances in Neural Information Processing Systems 31, pages 185–194. Curran Associates, Inc. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. CoRR, cs.CL/0306050. Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Guil- laume Lample, and Chris Dyer. 2015. Evaluation of word vector representations by subspace alignment. In Proc. of EMNLP.

725