Explaining Word Embeddings Via Disentangled Representation

Explaining Word Embeddings via Disentangled Representation Keng-Te Liao Cheng-Syuan Lee National Taiwan University National Taiwan University [email protected] [email protected] Zhong-Yu Huang Shou-de Lin National Taiwan University National Taiwan University [email protected] [email protected] Abstract In this work, we focus on word-level disentanglement and introduce an idea of transforming dense Disentangled representations have attracted in- word embeddings such as GloVe (Pennington et al., creasing attention recently. However, how to 2014) or word2vec (Mikolov et al., 2013b) into transfer the desired properties of disentangle- disentangled word embeddings (DWE). The main ment to word representations is unclear. In this work, we propose to transform typical dense feature of our DWE is that it can be segmented word vectors into disentangled embeddings into multiple sub-embeddings or sub-areas as il- featuring improved interpretability via encod- lustrated in Figure1. In the figure, each sub-area ing polysemous semantics separately. We also encodes information relevant to one specific topical found the modular structure of our disentan- factor such as Animal or Location. As an example, gled word embeddings helps generate more ef- we found words similar to “turkey” are “geese”, ficient and effective features for natural lan- “flock” and “goose” in the Animal area, and the guage processing tasks. similar words turn into “Greece”, “Cyprus” and 1 Introduction “Ankara” in the Location area. Disentangled representations are known to repre- sent interpretable factors in separated dimensions. This property can potentially help people under- stand or discover knowledge in the embeddings. In natural language processing (NLP), works of disentangled representations have shown notable im- Figure 1: Disentangled embedding with factors Animal, pacts on sentence and document-level applications. Location and Unseen. For example, Larsson et al.(2017) and Melnyk We also found our DWE generally satisfies the et al.(2017) proposed to disentangle sentiment and Modularity and Compactness properties proposed semantic of sentences. By manipulating sentiment by Higgins et al.(2018) and Ridgeway and Mozer factors, the machine can rewrite a sentence with dif- (2018) which can be a definition of general-purpose ferent sentiment. Brunner et al.(2018) also demon- disentangled representations. Also, our DWE can strated sentence generation while more focusing on have the following advantages: syntactic factors such as part-of-speech tags. For document-level applications, Jain et al.(2018) pre- • Explaining Underlying Knowledge sented a learning algorithm which embeds biomed- The multi-senses of words can be extracted ical abstracts disentangling populations, interven- and separately encoded despite the learning tions and outcomes. Regarding word-level disen- algorithm of the original word embeddings tanglement, Athiwaratkun and Wilson(2017) pro- (e.g. GloVe) does not do disambiguation. As a posed mixture of Gaussian models which can dis- result, the encoded semantic can be presented entangle meanings of polysemous words into two in an intuitive way for examination. or three clusters. It has a connection with unsu- pervised sense representations (Camacho-Collados • Modular and Compact Features and Pilehvar, 2018) which is an active research Each sub-area of our DWE can itself be infor- topic in the community. mative features. The advantage is that people 720 Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 720–725 December 4 - 7, 2020. c 2020 Association for Computational Linguistics are free to abandon features in sub-areas ir- 2.3 Optimizing I(Za; a) relevant to the given downstream tasks while Let za;i be the i-th row in Za. By derivation, still achieving competitive performance. In I(Za; a) = Section4, we show that using the compact N features is not only efficient but also helps Σi=1p(zi)p(ajza;i) log p(ajza;i) − log p(a) improve performance on downstream tasks. 1 ≈ ΣN p(ajz ) log p(ajz ) − log p(a) • Quality Preservation N i=1 a;i a;i In addition to higher interpretability, our DWE We let log p(a) be constant and replace p(ajz) with preserves co-occurrence statistics information a parametrized model qθ(ajz). By experiments, in the original word embeddings. We found it we found logistic regression with parameter θ is also helps preserve the performance on down- sufficient to be qθ(ajz). Intuitively, high I(Za; a) stream tasks including word similarity, word means Za are informative features for a classifier analogy, POS-tagging, chunking, and named to distinguish whether words has attribute a. entity recognition. When increasing I(Za; a) by optimizing qθ(ajz), we found a strategy helping generate 2 Obtaining Disentangled Word higher quality Z. The strategy is letting Za be Representations features to reconstruct original vectors for words 2.1 Problem Definition having attribute a. For words with a, the approach becomes a semi-supervised learning architecture Our goal is transforming N d-dimensional dense which attempts to predict labels and reconstruct N×d word vectors X 2 R into disentangled em- inputs simultaneously. N×d beddings Z 2 R by leveraging a set of binary The loss function L(W; θ; φ) for maximizing attributes A = fa ; :::; a g labelled on words. 1 M I(Za; a) is as follow: Z is expected to have two properties. The first one is preserving word features encoded in X. −1 N 2 Σi=1qθ(ajza;i) + λIa;ijjxi − φ(za;i)jj2 More specifically, we require XXT ≈ ZZT as N ( pointed out by Levy and Goldberg(2014) that typ- 1 when i-th word has attribute a Ia;i = ical dense word embeddings can be regarded as 0 when i-th word does not have a factorizing co-ocurrence statistics matrices. (1) The second property is that Z can be decom- where φ is single and fully-connected layer, xi is posed into M +1 sub-embedding sets Za1 ; :::; ZaM the original i-th word’s vector in X, and λ is a and Zunseen, where each sub-embedding set en- 1 hyper-parameter. We set λ = d in all experiments. codes information only relevant to the correspond- 2.4 Learning to Generate Sub-embedding Z ing attribute. For example, Za1 is expected to be a relevant to a1 and irrelevant to a2; :::; aM . Informa- As discussed in 2.3 that high I(Za; a) indicates tion in X not relevant to any attributes in A is then Za are informative features for classification, we encoded in Zunseen. An example of transforming propose to regard sub-embedding generation as a X into Z with two attributes, Animal and Location, feature selection problem. More specifically, we is illustrated in Figure1. apply sparsity constraint on Z. Ideally, when pre- For modelling the relevance between sub- dicting a, a smaller number of dimensions of Z embeddings and attributes, we use mutual infor- are selected as the informative features, which are mation I(Za; a) as learning objectives, where a is regarded as Za. an arbitrary attribute in A. In this work, we use Variational Dropout (Kingma et al., 2015; Molchanov et al., 2017) as 2.2 Transformation with Quality the sparsity constraint. At each iteration of train- Preservation ing, a set of multiplicative noise ξ is sampled from We obtain Z by transforming X by a matrix a normal distribution N (1; α = pa ) and in- a 1−pa d×d W 2 R . That is, Z = XW . To ensure jected on Z. That is, the prediction and recon- XXT ≈ ZZT , an additional constraint WW T = struction is done by θ(ξ Z) and φ(ξ Z). The T T d I is included. ZZ = (XW )(XW ) = parameter αa 2 R is jointly learned with W , θ, X(WW T )XT = XXT if WW T = I holds. and φ. Afterwards, d-dimensional dropout rates 721 pa = sigmoid(log αa) can be obtained. For each attribute a in A, the dimensions with dropout rates lower than 50% are normally regarded as Za. We would like to emphasize that the learned Figure 2: Disentangled embedding with five attributes: dropout rates are not binary values. Therefore, de- Artifact, Location, Animal, Adjective and Adverb. The ciding the length of sub-embeddings can actually remaining dimensions are viewed as Unseen. depend on users preferences or tasks requirements. MEN SimLex BATS GA For example, users can obtain more compact and GloVe-300 0.749 0.369 18.83 63.58 pure Za by selecting dimensions with dropout rates lower than 10%, or get more thorough yet less dis- DWE 0.764 0.390 18.75 62.30 entangled Za by setting the threshold be 70%. Table 1: Word similarity and analogy performance. To encourage disentanglement when handling multiple attributes, we include additional loss func- POS Chunking NER tions on dropout rates. Let a M-dimensional vector GloVe-300 65.0 64.9 65.2 P be 1 − pa for all a in A in a specific dimension. QM DWE 67.2 66.3 66.1 The idea is to minimize i=1 Pi with constraint M Σi=1Pi = 1. The optimal solution is that the di- Table 2: POS-tag, chunking and NER performance. mension is relevant to only one attribute a0 where 1 − pa0 ≈ 1. In implementation, we minimize the following loss function 3.2 Evaluation of Quality Preservation M M 2 We firstly examine whether DWE can preserve fea- Σi=1 log Pi + βjjΣi=1Pi − 1jj2 (2) tures encoded in GloVe-300. The examination is We set β = 1 in the experiments, and equation1 done by intrinsic evaluations including the follow- and2 are optimized jointly.

Load more