A Compositional and Interpretable Semantic Space
Total Page:16
File Type:pdf, Size:1020Kb
A Compositional and Interpretable Semantic Space Alona Fyshe,1 Leila Wehbe,1 Partha Talukdar,2 Brian Murphy,3 and Tom Mitchell1 1 Machine Learning Department, Carnegie Mellon University, Pittsburgh, USA 2 Indian Institute of Science, Bangalore, India 3 Queen’s University Belfast, Belfast, Northern Ireland [email protected], [email protected], [email protected], [email protected], [email protected] Abstract like a black box - it is unclear what VSM dimen- sions represent (save for broad classes of corpus Vector Space Models (VSMs) of Semantics statistic types) and what the application of a com- are useful tools for exploring the semantics of position function to those dimensions entails. Neu- single words, and the composition of words ral network (NN) models are becoming increas- to make phrasal meaning. While many meth- ingly popular (Socher et al., 2012; Hashimoto et al., ods can estimate the meaning (i.e. vector) of a phrase, few do so in an interpretable way. 2014; Mikolov et al., 2013; Pennington et al., 2014), We introduce a new method (CNNSE) that al- and some model introspection has been attempted: lows word and phrase vectors to adapt to the Levy and Goldberg (2014) examined connections notion of composition. Our method learns a between layers, Mikolov et al. (2013) and Penning- VSM that is both tailored to support a chosen ton et al. (2014) explored how shifts in VSM space semantic composition operation, and whose encodes semantic relationships. Still, interpreting resulting features have an intuitive interpreta- NN VSM dimensions, or factors, remains elusive. tion. Interpretability allows for the exploration of phrasal semantics, which we leverage to an- This paper introduces a new method, Composi- alyze performance on a behavioral task. tional Non-negative Sparse Embedding (CNNSE). In contrast to many other VSMs, our method learns an interpretable VSM that is tailored to suit the se- 1 Introduction mantic composition function. Such interpretability Vector Space Models (VSMs) are models of word allows for deeper exploration of semantic composi- semantics typically built with word usage statistics tion than previously possible. We will begin with an derived from corpora. VSMs have been shown to overview of the CNNSE algorithm, and follow with closely match human judgements of semantics (for empirical results which show that CNNSE produces: an overview see Sahlgren (2006), Chapter 5), and 1. more interpretable dimensions than the typical can be used to study semantic composition (Mitchell VSM, and Lapata, 2010; Baroni and Zamparelli, 2010; 2. composed representations that outperform pre- Socher et al., 2012; Turney, 2012). vious methods on a phrase similarity task. Composition has been explored with different types of composition functions (Mitchell and La- Compared to methods that do not consider composi- pata, 2010; Mikolov et al., 2013; Dinu et al., tion when learning embeddings, CNNSE produces: 2013) including higher order functions (such as ma- trices) (Baroni and Zamparelli, 2010), and some 1. better approximations of phrasal semantics, have considered which corpus-derived information 2. phrasal representations with dimensions that is most useful for semantic composition (Turney, more closely match phrase meaning. 2012; Fyshe et al., 2013). Still, many VSMs act 32 Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 32–41, Denver, Colorado, May 31 – June 5, 2015. c 2015 Association for Computational Linguistics 2 Method mensional representation for w words using the c- w c Typically, word usage statistics used to create a dimensional corpus statistics in a matrix X R × . w ∈` The solution is two matrices: A R × that is VSM form a sparse matrix with many columns, too ∈ unwieldy to be practical. Thus, most models use sparse, non-negative, and represents word semantics ` c in an `-dimensional latent space, and D R × : some form of dimensionality reduction to compress ∈ the full matrix. For example, Latent Semantic Anal- the encoding of corpus statistics in the latent space. ysis (LSA) (Deerwester et al., 1990) uses Singular NNSE minimizes the following objective: w Value Decomposition (SVD) to create a compact 1 argmin X A D 2 + λ A VSM. SVD often produces matrices where, for the i,: i,: 1 i,: 1 A,D 2 − × vast majority of the dimensions, it is difficult to in- Xi=1 (1) terpret what a high or low score entails for the se- st: D DT 1, 1 i ` (2) mantics of a given word. In addition, the SVD fac- i,: i,: ≤ ∀ ≤ ≤ torization does not take into account the phrasal re- Ai,j 0, 1 i w, 1 j ` (3) lationships between the input words. ≥ ≤ ≤ ≤ ≤ where Ai,j indicates the entry at the ith row and jth 2.1 Non-negative Sparse Embeddings column of matrix A, and Ai,: indicates the ith row Our method is inspired by Non-negative Sparse Em- of the matrix. The L1 constraint encourages sparsity beddings (NNSEs) (Murphy et al., 2012). NNSE in A; λ1 is a hyperparameter. Equation 2 constrains promotes interpretability by including sparsity and D to eliminate solutions where the elements of A non-negativity constraints into a matrix factoriza- are made arbitrarily small by making the norm of D tion algorithm. The result is a VSM with extremely arbitrarily large. Equation 3 ensures that A is non- coherent dimensions, as quantified by a behavioral negative. Together, A and D factor the original cor- task (Murphy et al., 2012). The output of NNSE pus statistics matrix X to minimize reconstruction is a matrix with rows corresponding to words and error. One may tune ` and λ1 to vary the sparsity of columns corresponding to latent dimensions. the final solution. To interpret a particular latent dimension, we can Murphy et al. (2012) solved this system of con- examine the words with the highest numerical val- straints using the Online Dictionary Learning algo- ues in that dimension (i.e. identify rows with the rithm described in Mairal et al. (2010). Though highest values for a particular column). Though the Equations 1-3 represent a non-convex system, when representations in Table 1 were created with our new solving for A with D fixed (and vice versa) the loss method, CNNSE, we will use them to illustrate the function is convex. Mairal et al. break the prob- interpretability of both NNSE and CNNSE, as the lem into two alternating optimization steps (solv- form of the learned representations is similar. One ing for A and D) and find the system converges of the dimensions in Table 1 has top scoring words to a stationary solution. The solution for A is guidance, advice and assistance - words related to found with a LARS implementation for lasso regres- help and support. We will refer to these word list sion (Efron et al., 2004); D is found via gradient de- summaries as the dimension’s interpretable sum- scent. Though the final solution may not be globally marization. To interpret the meaning of a particu- optimal, this method is capable of handling large lar word, we can select its highest scoring dimen- amounts of data and has been shown to produce use- sions (i.e. choose columns with maximum values ful solutions in practice (Mairal et al., 2010; Murphy for a particular row). For example, the interpretable et al., 2012). summarizations for the top scoring dimensions of the word military include both positions in the mil- 2.2 Compositional NNSE itary (e.g. commandos), and military groups (e.g. We add an additional constraint to the NNSE loss paramilitary). More examples in Supplementary function that allows us to learn a latent representa- Material (http://www.cs.cmu.edu/˜fmri/ tion that respects the notion of semantic composi- papers/naacl2015/). tion. As we will see, this change to the loss function NNSE is an algorithm which seeks a lower di- has a huge effect on the learned latent space. Just as 2 33 Table 1: CNNSE interpretable summarizations for the top 3 dimensions of an adjective, noun and adjective- noun phrase. military aid military aid (observed) servicemen, commandos, guidance, advice, assistance servicemen, commandos, military intelligence military intelligence guerrilla, paramilitary, anti-terrorist mentoring, tutoring, internships guidance, advice, assistance conglomerate, giants, conglomerates award, awards, honors compliments, congratulations, replies the L1 regularizer can have a large impact on spar- This choice of f requires that we simultaneously op- sity, our composition constraint represents a consid- timize for A, D, α and β. However, α and β are sim- erable change in composition compatibility. ply constant scaling factors for the vectors in A cor- Consider a phrase p made up of words i and j. In responding to adjectives and nouns. For adjective- the most general setting, the following composition noun composition, the optimization of α and β can constraint could be applied to the rows of matrix A be absorbed by the optimization of A. For models corresponding to p, i and j: that include noun-noun composition, if α and β are assumed to be absorbed by the optimization of A, A(p,:) = f(A(i,:),A(j,:)) (4) this is equivalent to setting α = β. where f is some composition function. The com- We can further simplify the loss function by con- position function constrains the space of learned la- structing a matrix B that imposes the composition w ` tent representations A R × to be those solutions ∈ by addition constraint. B is constructed so that for that are compatible with the composition function each phrase p = (i, j): B = 1, B = α, (p,p) (p,i) − defined by f. Incorporating f into Equation 1 we and B = β. For our models, we use α = β = (p,j) − have: 0.5, which serves to average the single word repre- w 1 sentations.