A Compositional and Interpretable Semantic Space

Alona Fyshe,1 Leila Wehbe,1 Partha Talukdar,2 Brian Murphy,3 and Tom Mitchell1 1 Department, Carnegie Mellon University, Pittsburgh, USA 2 Indian Institute of Science, Bangalore, India 3 Queen’s University Belfast, Belfast, Northern Ireland [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract like a black box - it is unclear what VSM dimen- sions represent (save for broad classes of corpus Vector Space Models (VSMs) of Semantics statistic types) and what the application of a com- are useful tools for exploring the semantics of position function to those dimensions entails. Neu- single words, and the composition of words ral network (NN) models are becoming increas- to make phrasal meaning. While many meth- ingly popular (Socher et al., 2012; Hashimoto et al., ods can estimate the meaning (i.e. vector) of a phrase, few do so in an interpretable way. 2014; Mikolov et al., 2013; Pennington et al., 2014), We introduce a new method (CNNSE) that al- and some model introspection has been attempted: lows word and phrase vectors to adapt to the Levy and Goldberg (2014) examined connections notion of composition. Our method learns a between layers, Mikolov et al. (2013) and Penning- VSM that is both tailored to support a chosen ton et al. (2014) explored how shifts in VSM space semantic composition operation, and whose encodes semantic relationships. Still, interpreting resulting features have an intuitive interpreta- NN VSM dimensions, or factors, remains elusive. tion. Interpretability allows for the exploration of phrasal semantics, which we leverage to an- This paper introduces a new method, Composi- alyze performance on a behavioral task. tional Non-negative Sparse Embedding (CNNSE). In contrast to many other VSMs, our method learns an interpretable VSM that is tailored to suit the se- 1 Introduction mantic composition function. Such interpretability Vector Space Models (VSMs) are models of word allows for deeper exploration of semantic composi- semantics typically built with word usage statistics tion than previously possible. We will begin with an derived from corpora. VSMs have been shown to overview of the CNNSE algorithm, and follow with closely match human judgements of semantics (for empirical results which show that CNNSE produces: an overview see Sahlgren (2006), Chapter 5), and 1. more interpretable dimensions than the typical can be used to study semantic composition (Mitchell VSM, and Lapata, 2010; Baroni and Zamparelli, 2010; 2. composed representations that outperform pre- Socher et al., 2012; Turney, 2012). vious methods on a phrase similarity task. Composition has been explored with different types of composition functions (Mitchell and La- Compared to methods that do not consider composi- pata, 2010; Mikolov et al., 2013; Dinu et al., tion when learning embeddings, CNNSE produces: 2013) including higher order functions (such as ma- trices) (Baroni and Zamparelli, 2010), and some 1. better approximations of phrasal semantics, have considered which corpus-derived information 2. phrasal representations with dimensions that is most useful for semantic composition (Turney, more closely match phrase meaning. 2012; Fyshe et al., 2013). Still, many VSMs act

32

Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 32–41, Denver, Colorado, May 31 – June 5, 2015. c 2015 Association for Computational Linguistics 2 Method mensional representation for w words using the c- w c Typically, word usage statistics used to create a dimensional corpus statistics in a matrix X R × . w ∈` The solution is two matrices: A R × that is VSM form a sparse matrix with many columns, too ∈ unwieldy to be practical. Thus, most models use sparse, non-negative, and represents word semantics ` c in an `-dimensional latent space, and D R × : some form of dimensionality reduction to compress ∈ the full matrix. For example, Latent Semantic Anal- the encoding of corpus statistics in the latent space. ysis (LSA) (Deerwester et al., 1990) uses Singular NNSE minimizes the following objective: w Value Decomposition (SVD) to create a compact 1 argmin X A D 2 + λ A VSM. SVD often produces matrices where, for the i,: i,: 1 i,: 1 A,D 2 − × vast majority of the dimensions, it is difficult to in- Xi=1 (1) terpret what a high or low score entails for the se- st: D DT 1, 1 i ` (2) mantics of a given word. In addition, the SVD fac- i,: i,: ≤ ∀ ≤ ≤ torization does not take into account the phrasal re- Ai,j 0, 1 i w, 1 j ` (3) lationships between the input words. ≥ ≤ ≤ ≤ ≤ where Ai,j indicates the entry at the ith row and jth 2.1 Non-negative Sparse Embeddings column of matrix A, and Ai,: indicates the ith row Our method is inspired by Non-negative Sparse Em- of the matrix. The L1 constraint encourages sparsity beddings (NNSEs) (Murphy et al., 2012). NNSE in A; λ1 is a hyperparameter. Equation 2 constrains promotes interpretability by including sparsity and D to eliminate solutions where the elements of A non-negativity constraints into a matrix factoriza- are made arbitrarily small by making the norm of D tion algorithm. The result is a VSM with extremely arbitrarily large. Equation 3 ensures that A is non- coherent dimensions, as quantified by a behavioral negative. Together, A and D factor the original cor- task (Murphy et al., 2012). The output of NNSE pus statistics matrix X to minimize reconstruction is a matrix with rows corresponding to words and error. One may tune ` and λ1 to vary the sparsity of columns corresponding to latent dimensions. the final solution. To interpret a particular latent dimension, we can Murphy et al. (2012) solved this system of con- examine the words with the highest numerical val- straints using the Online Dictionary Learning algo- ues in that dimension (i.e. identify rows with the rithm described in Mairal et al. (2010). Though highest values for a particular column). Though the Equations 1-3 represent a non-convex system, when representations in Table 1 were created with our new solving for A with D fixed (and vice versa) the loss method, CNNSE, we will use them to illustrate the function is convex. Mairal et al. break the prob- interpretability of both NNSE and CNNSE, as the lem into two alternating optimization steps (solv- form of the learned representations is similar. One ing for A and D) and find the system converges of the dimensions in Table 1 has top scoring words to a stationary solution. The solution for A is guidance, advice and assistance - words related to found with a LARS implementation for lasso regres- help and support. We will refer to these word list sion (Efron et al., 2004); D is found via gradient de- summaries as the dimension’s interpretable sum- scent. Though the final solution may not be globally marization. To interpret the meaning of a particu- optimal, this method is capable of handling large lar word, we can select its highest scoring dimen- amounts of data and has been shown to produce use- sions (i.e. choose columns with maximum values ful solutions in practice (Mairal et al., 2010; Murphy for a particular row). For example, the interpretable et al., 2012). summarizations for the top scoring dimensions of the word military include both positions in the mil- 2.2 Compositional NNSE itary (e.g. commandos), and military groups (e.g. We add an additional constraint to the NNSE loss paramilitary). More examples in Supplementary function that allows us to learn a latent representa- Material (http://www.cs.cmu.edu/˜fmri/ tion that respects the notion of semantic composi- papers/naacl2015/). tion. As we will see, this change to the loss function NNSE is an algorithm which seeks a lower di- has a huge effect on the learned latent space. Just as 2 33 Table 1: CNNSE interpretable summarizations for the top 3 dimensions of an adjective, noun and adjective- noun phrase.

military aid military aid (observed) servicemen, commandos, guidance, advice, assistance servicemen, commandos, military intelligence military intelligence guerrilla, paramilitary, anti-terrorist mentoring, tutoring, internships guidance, advice, assistance conglomerate, giants, conglomerates award, awards, honors compliments, congratulations, replies

the L1 regularizer can have a large impact on spar- This choice of f requires that we simultaneously op- sity, our composition constraint represents a consid- timize for A, D, α and β. However, α and β are sim- erable change in composition compatibility. ply constant scaling factors for the vectors in A cor- Consider a phrase p made up of words i and j. In responding to adjectives and nouns. For adjective- the most general setting, the following composition noun composition, the optimization of α and β can constraint could be applied to the rows of matrix A be absorbed by the optimization of A. For models corresponding to p, i and j: that include noun-noun composition, if α and β are assumed to be absorbed by the optimization of A, A(p,:) = f(A(i,:),A(j,:)) (4) this is equivalent to setting α = β. where f is some composition function. The com- We can further simplify the loss function by con- position function constrains the space of learned la- structing a matrix B that imposes the composition w ` tent representations A R × to be those solutions ∈ by addition constraint. B is constructed so that for that are compatible with the composition function each phrase p = (i, j): B = 1, B = α, (p,p) (p,i) − defined by f. Incorporating f into Equation 1 we and B = β. For our models, we use α = β = (p,j) − have: 0.5, which serves to average the single word repre- w 1 sentations. The matrix B allows us to reformulate argmin X A D 2 + λ A + 2 i,: − i,: × 1 i,: 1 the loss function from Eq 5: A,D,Ω i=1 X 1 λ λc 2 2 c 2 A f(A ,A ) (5) argmin X AD F + λ1 A 1 + BA F 2 (p,:) − (i,:) (j,:) A,D 2 − 2 phrase p, p=(Xi,j)  (7) Where each phrase p is comprised of words (i, j) where F indicates the Frobenius norm. B acts as a and Ω represents all parameters of f to be optimized. selector matrix, subtracting from the latent represen- We have added a squared loss term for composition, tation of the phrase the average latent representation and a new regularization parameter λc to weight of the phrase’s constituent words. the importance of respecting composition. We call We now have a loss function that is the sum of this new formulation Compositional Non-Negative several convex functions of A: squared reconstruc- Sparse Embeddings (CNNSE). Some examples of tion loss for A, L1 regularization and the composi- the interpretable representations learned by CNNSE tion constraint. This sum of sub-functions is the for- for adjectives, nouns and phrases appear in Table 1. mat required for the alternating direction method of There are many choices for f: addition, multi- multipliers (ADMM) (Boyd, 2010). ADMM substi- plication, dilation, etc. (Mitchell and Lapata, 2010). tutes a dummy variable z for A in the sub-functions:

Here we choose f to be weighted addition because it 1 2 λc 2 has has been shown to work well for adjective noun argmin X AD F + λ1 z1 1 + Bzc F A,D 2 − 2 composition (Mitchell and Lapata, 2010; Dinu et al., (8) 2013; Hashimoto et al., 2014), and because it lends itself well to optimization. Weighted addition is: and, in addition to constraints in Eq 2 and 3, incor- porates constraints A = z1 and A = zc to ensure f(A(i,:),A(j,:)) = αA(i,:) + βA(j,:) (6) dummy variables match A. ADMM uses an aug- 3 34 mented Lagrangian to incorporate and relax these Table 2: Median rank, mean reciprocal rank (MRR) new constraints. We optimize for A, z1 and zc sep- and percentage of test phrases ranked perfectly (i.e. arately, update the dual variables and repeat until first in a sorted list of approx. 4,600 test phrases) convergence (see Supplementary material for La- for four methods of estimating the test phrase vec- grangian form, solutions and updates). We modi- tors. w.addSVD is weighted addition of SVD vectors, 1 fied code for ADMM, which is available online . w.addNNSE is weighted addition of NNSE vectors. ADMM is used when solving for A in the Online Model Med. Rank MRR Perfect Dictionary Learning algorithm, solving for D re- w.add 99.89 35.26 20% mains unchanged from the NNSE implementation SVD w.add 99.80 28.17 16% (see Algorithms 1 and 2 in Supplementary Material). NNSE Lexfunc 99.65 28.96 20% We use the weighted addition composition func- CNNSE 99.91 40.65 26% tion because it performed well for adjective-noun composition in previous work (Mitchell and Lap- test (1/3) set. From the test set we removed 200 ran- ata, 2010; Dinu et al., 2013; Hashimoto et al., 2014), domly selected phrases as a development set for pa- maintains the convexity of the loss function, and is rameter tuning. We did not lexically split the train easy to optimize. In contrast, an element-wise mul- and test sets, so many words appearing in training tiplication, dilation or higher-order matrix compo- phrases also appear in test phrases. For this reason sition function will lead to a non-convex optimiza- we cannot make specific claims about the generaliz- tion problem which cannot be solved using ADMM. ability of our methods to unseen words. Though not explored here, we hypothesize that A λ could be molded to respect many different compo- NNSE has one parameter to tune ( 1); CNNSE λ λ sition functions. However, if the chosen composi- has two: 1 and c. In general, these methods are tion function does not maintain convexity, finding a not overly sensitive to parameter tuning, and search- suitable solution for A may prove challenging. We ing over orders of magnitude will suffice. We found λ = 0.05 also hypothesize that even if the chosen composi- the optimal settings for NNSE were 1 , and λ = 0.05, λ = 0.5 λ tion function is not the “true” composition function for CNNSE 1 c . Too large 1 (whatever that may be), the fact that A can change leads to overly sparse solutions, too small reduces ` = 1000 to suit the composition function may compensate for interpretability. We set for both NNSE λ this mismatch. This has the flavor of variational in- and CNNSE and altered sparsity by tuning only 1. ference for Bayesian methods: an approximation in 3.1 Phrase Vector Estimation place of an intractable problem often yields better results with limited data, in less time. To test the ability of each model to estimate phrase semantics we trained models on the training set, and 3 Data and Experiments used the learned model and the composition function We use the semantic vectors made available by to estimate vectors of held out phrases. We sort the Fyshe et al. (2013), which were compiled from a 16 vectors for the test phrases, Xtest, by their cosine ˆ billion word subset of ClueWeb09 (Callan and Hoy, distance to the predicted phrase vector X(p,:). 2009). We used the 1000 dependency SVD dimen- We report two measures of accuracy. The first is sions, which were shown to perform well for compo- median rank accuracy. Rank accuracy is: 100 (1 r × − sition tasks. Dependency features are tuples consist- P ), where r is the position of the correct phrase ing of two POS tagged words and their dependency in the sorted list of test phrases, and P = X | test| relationship in a sentence; the feature value is the (the number of test phrases). The second measure pointwise positive mutual information (PPMI) for is mean reciprocal rank (MRR), which is often used the tuple. The dataset is comprised of 54,454 words to evaluate information retrieval tasks (Kantor and and phrases. We randomly split the approximately Voorhees, 2000). MRR is 14,000 adjective noun phrases into a train (2/3) and P 1 1 1 100 ( ( )). (9) http://www.stanford.edu/˜boyd/papers/ × P r admm/ i=1 4 X 35 For both rank accuracy and MRR, a perfect score is the phrase. 100. However, MRR places more emphasis on rank- Aˆ = 0.5 (A + A ) ing items close to the top of the list, and less on dif- (p,:) × (i,:) (j,:) ferences in ranking lower in the list. For example, Crucially, w.addNNSE estimates α, β after learning if the correct phrase is always ranked 2, 50 or 100 the latent space A, whereas CNNSE simultaneously out of list of 4600, median rank accuracy would be learns the latent space A, while taking the compo- 99.95, 98.91 or 97.83. In contrast, MRR would be sition function into account. Once we have an esti- ˆ 50, 2 or 1. Note that rank accuracy and reciprocal mate A(p,:) we can use the NNSE and CNNSE solu- rank produce identical orderings of methods. That tions for D to estimate the corpus statistics X. is, whatever method performs best in terms of rank ˆ ˆ X(p,:) = A(p,:)D accuracy will also perform best in terms of recip- rocal rank. MRR simply allows us to discriminate Results for the four methods appear in Table 2. between very accurate models. As we will see, the Median rank accuracies were all within half a per- rank accuracy of all models is very high (> 99%), centage point of each other. However, MRR shows approaching the rank accuracy ceiling. a striking difference in performance. CNNSE has MRR of 40.64, more than 5 points higher than the 3.1.1 Estimation Methods second highest MRR score belonging to w.addSVD (35.26). CNNSE ranks the correct phrase in the We will compare to two other previously first position for 26% of phrases, compared to 20% studied composition methods: weighted addition for w.addSVD. Lexfunc ranks the correct phrase (w.addSVD), and lexfunc (Baroni and Zamparelli, first for 20% of the test phrases, w.addNNSE 16%. 2010). Weighted addition finds α, β to optimize So, while all models perform quite well in terms (X (αX + βX ))2 of rank accuracy, when we use the more discrim- (p,:) − (i,:) (j,:) inative MRR, CNNSE is the clear winner. Note Note that this optimization is performed over the that the performance of w.add is much lower SVD matrix X, rather than on A. To estimate X NNSE than CNNSE. Incorporating a composition con- for a new phrase p = (i, j) we compute straint into the learning algorithm has produced a la- ˆ X(p,:) = αX(i,:) + βX(j,:) tent space that surpasses all methods tested for this task. Lexfunc finds an adjective-specific matrix Mi that solves We were surprised to find that lexfunc performed relatively poorly in our experiments. Dinu et al. X(p,:) = MiX(j,:) (2013) used simple unregularized regression to es- for all phrases p = (i, j) for adjective i. We solved timate M. We also replicated that formulation, and each adjective-specific problem with Matlab’s par- found phrase ranking to be worse when compared tial least squares implementation, which uses the to the Partial Least Squares method described in Ba- SIMPLS algorithm (Dejong, 1993). To estimate X roni and Zamparelli (2010). In addition, Baroni and for a new phrase p = (i, j) we compute Zamparelli use 300 SVD dimensions to estimate M. We found that, for our dataset, using all 1000 dimen- Xˆ = M X (p,:) i (j,:) sions performed slightly better. We also optimized the weighted addition compo- We hypothesize that our difference in perfor- sition function over NNSE vectors, which we call mance could be due to the difference in input cor- w.addNNSE. After optimizing α and β using the pus statistics (in particular the thresholding of infre- training set, we compose the latent word vectors to quent words and phrases), or due to the fact that we estimate the held out phrase: did not specifically create the training and tests sets to evenly distribute the phrases for each adjective. Aˆ = αA + βA (p,:) (i,:) (j,:) If an adjective i appears only in phrases in the test For CNNSE, as in the loss function, α = β = 0.5 set, lexfunc cannot estimate Mi using training data so that the average of the word vectors approximates (a hindrance not present for other methods, which 5 36 require only that the adjective appear in the train- Table 3: A comparison of mistakes in phrase rank- ing data). To compensate for this possibly unfair ing across 4 composition methods. To evaluate mis- train/test split, the results in Table 2 are calculated takes, we chose phrases for which all 4 models rank over only those adjectives which could be estimated a different (incorrect) phrase first. Mturk users were using the training set. asked to identify the phrase that was semantically Though the results reported here are not as high closest to the target phrase. as previously reported, lexfunc was found to be only slightly better than w.addSVD for adjective noun Predicted phrase deemed composition (Dinu et al., 2013). CNNSE outper- Model closest match to actual phrase forms w.addSVD by a large margin, so even if Lex- w.addSVD 21.3% func could be tuned to perform at previous levels on w.addNNSE 11.6% this dataset, CNNSE would likely still dominate. Lexfunc 31.7% CNNSE 35.4% 3.1.2 Phrase Estimation Errors None of the models explored here are perfect. 3.2 Interpretability Even the top scoring model, CNNSE, only identi- Though our improvement in MRR for phrase vec- fies the correct phrase for 26% of the test phrases. tor estimation is compelling, we seek to explore the When a model makes a “mistake”, it is possible that meaning encoded in the word space features. We the top-ranked phrase is a synonym of, or closely turn now to the interpretation of phrasal semantics related to, the actual phrase. To evaluate mistakes, and semantic composition. we chose test phrases for which all 4 models are in- correct and produce a different top ranked phrase 3.2.1 Interpretability of Latent Dimensions (likely these are the most difficult phrases to es- Due to the sparsity and non-negativity constraints, timate). We then asked Mechanical Turk (Mturk NNSE produces dimensions with very coherent se- http://mturk.com) users to evaluate the mis- mantic groupings (Murphy et al., 2012). Murphy takes. We presented the 4 mistakenly top-ranked et al. used an intruder task to quantify the inter- phrases to Mturk users, who were asked to choose pretability of semantic dimensions. The intruder the one phrase most related to the actual test phrase. task presents a human user with a list of words, and We randomly selected 200 such phrases and asked they are to choose the one word that does not belong 5 Mturk users to evaluate each, paying $0.01 per an- in the list (Chang et al., 2009). For example, from swer. We report here the results for questions where the list (red, green, desk, pink, purple, blue), it is a majority (3) of users chose the same answer (82% clear to see that the word “desk” does not belong in of questions). For all Mturk experiments described the list of colors. in this paper, a screen shot of the question appears in To create questions for the intruder task, we se- the Supplementary Material. lected the top 5 scoring words in a particular di- Table 3 shows the Mturk evaluation of model mis- mension, as well as a low scoring word from that takes. CNNSE and lexfunc make the most reason- same dimension such that the low scoring word is able mistakes, having their top-ranked phrase cho- also in the top 10th percentile of another dimen- sen as the most related phrase 35.4% and 31.7% of sion. Like the word “desk” in the example above, the time, respectively. This makes us slightly more this low scoring word is called the intruder, and the comfortable with our phrase estimation results (Ta- human subject’s task is to select the intruder from a ble 2); though lexfunc does not reliably predict the shuffled list of 6 words. Five Mturk users answered correct phrase, it often chooses a close approxima- each question, each paid $0.01 per answer. If Mturk tion. The mistakes from CNNSE are chosen slightly users identify a high percentage of intruders, this in- more often than lexfunc, indicating that CNNSE dicates that the latent representation groups words in also has the ability to reliably predict the correct a human-interpretable way. We chose 100 questions phrase, or a phrase deemed more related than those for each of the NNSE, CNNSE and SVD represen- chosen by other methods. tations. Because the output of lexfunc is the SVD 6 37 Table 4: Quantifying the interpretability of learned Table 5: Comparing the coherence of phrase rep- semantic representations via the intruder task. In- resentations from CNNSE and NNSE. Mturk users truders detected: % of questions for which the ma- were shown the interpretable summarization for the jority response was the intruder. Mturk agreement: top scoring dimension of target phrases. Represen- the % of questions for which a majority of users tations from CNNSE and NNSE were shown side by chose the same response. side and users were asked to choose the list (summa- rization) most related to the phrase, or that the lists Method Intruders Detected Mturk Agreement were equally good or bad. SVD 17.6% 74% NNSE 86.2% 94% Model representation deemed CNNSE 88.9% 90% Model most consistent with phrase CNNSE 54.5% representation X, SVD interpretability is a proxy for NNSE 29.5% lexfunc interpretability. Both 4.5% Results for the intruder task appear in Table 4. Neither 11.5% Consistent with previous studies, NNSE provides a much more interpretable latent representation than mensions, while improving the coherence of phrase SVD. We find that the additional composition con- representations. straint used in CNNSE has maintained the inter- pretability of the learned latent space. Because in- 3.3 Evaluation on Behavioral Data truders detected is higher for CNNSE, but agreement We now compare the performance of various com- amongst Mturk users is higher for NNSE, we con- position methods on an adjective-noun phrase sim- sider the interpretability results for the two methods ilarity dataset (Mitchell and Lapata, 2010). This to be equivalent. Note that SVD interpretability is dataset is comprised of 108 adjective-noun phrase close to chance (1/6 = 16.7%). pairs split into high, medium and low similarity 3.2.2 Coherence of Phrase Representations groups. Similarity scores from 18 human subjects The dimensions of NNSE and CNNSE are com- are averaged to create one similarity score per phrase parably interpretable. But, has the composition con- pair. We then compute the cosine similarity between straint in CNNSE resulted in better phrasal repre- the composed phrasal representations of each phrase sentations? To test this, we randomly selected 200 pair under each compositional model. As in Mitchell phrases, and then identified the top scoring dimen- and Lapata (2010), we report the correlation of the sion for each phrase in both the NNSE and CNNSE cosine similarity measures to the behavioral scores. models. We presented Mturk users with the inter- We withheld 12 of the 108 questions for parame- pretable summarizations for these top scoring di- ter tuning, four randomly selected from each of the mensions. Users were asked to select the list of high, medium and low similarity groups. words (interpretable summarization) most closely Table 6 shows the correlation of each model’s related to the target phrase. Mturk users could similarity scores to behavioral similarity scores. also select that neither list was related, or that the Again, Lexfunc performs poorly. This is proba- lists were equally related to the target phrase. We bly attributable to the fact that there are, on aver- paid $0.01 per answer and had 5 users answer each age, only 39 phrases available for training each ad- question. In Table 5 we report results for phrases jective in the dataset, whereas the original Lexfunc where the majority of users selected the same an- study had at least 50 per adjective (Baroni and Zam- swer (78% questions). CNNSE phrasal represen- parelli, 2010). CNNSE is the top performer, fol- tations are found to be much more consistent, re- lowed closely by weighted addition. Interestingly, ceiving a positive evaluation almost twice as often weighted NNSE correlation is lower than CNNSE as NNSE. by nearly 0.15, which shows the value of allowing Together, these results show that CNNSE repre- the learned latent space to conform to the desired sentations maintain the interpretability of NNSE di- composition function. 7 38 3.3.1 Interpretability and Phrase Similarity Table 6: Correlation of phrase similarity judgements (Mitchell and Lapata, 2010) to pairwise distances in CNNSE has the additional advantage of inter- several adjective-noun composition models. pretability. To illustrate, we created a web page to explore the dataset under the CNNSE model. Correlation to The page http://www.cs.cmu.edu/˜fmri/ Model behavioral data papers/naacl2015/cnnse_mitchell_ w.addSVD 0.5377 lapata_all.html displays phrase pairs sorted w.addNNSE 0.4469 by average similarity score. For each phrase Lexfunc 0.1347 in the pair we show a summary of the CNNSE CNNSE 0.5923 composed phrase meaning. The scores of the 10 top dimensions are displayed in descending order. and use the maximally correlated groups to score Each dimension is described by its interpretable phrase pairs. CNNSE interpretability allows us to summarization. As one scrolls down the page, the perform these analyses, and will also allow us to it- similarity scores increase, and the number of dimen- erate and improve future compositional models. sions shared between the phrase pairs (highlighted 4 Conclusion in red) increases. Some phrase pairs with high similarity scores share no top scoring dimensions. We explored a new method to create an interpretable Because we can interpret the dimensions, we can VSMs that respects the notion of semantic compo- begin to understand how the CNNSE model is sition. We found that our technique for incorporat- failing, and how it might be improved. ing phrasal relationship constraints produced a VSM that is more consistent with observed phrasal repre- For example, the phrase pair judged most similar sentations and with behavioral data. by the human subjects, but that shares none of the We found that, compared to NNSE, human eval- top 10 dimensions in common, is “large number” uators judged CNNSE phrasal representations to be and “great majority” (behavioral similarity score a better match to phrase meaning. We leveraged this 5.61/7). Upon exploration of CNNSE phrasal repre- improved interpretability to explore composition in sentations, we see that the representation for “great the context of a previously published compositional majority” suffers from the multiple word senses of task. We note that the collision of word senses of- majority. Majority is often used in political settings ten hinders performance on the behavioral data from to describe the party or group with larger member- Mitchell and Lapata (2010). ship. We see that the top scoring dimension for More generally, we have shown that incorporat- “great majority” has top scoring words “candidacy, ing constraints to represent the task of interest can candidate, caucus”, a politically-themed dimension. improve a model’s performance on that task. Ad- Though the CNNSE representation is not incorrect ditionally, incorporating such constraints into an in- for the word, the common theme between the two terpretable model allows for a deeper exploration of test phrases is not political. performance in the context of evaluation tasks. The second highest scoring dimension for “large number” is “First name, address, complete address”. Acknowledgments Here we see another case of the collision of multiple This work was supported in part by a gift from word senses, as this dimension is related to identify- Google, NIH award 5R01HD075328, IARPA award ing numbers, rather than the quantity-related sense FA865013C7360, DARPA award FA8750-13-2- of number. While it is satisfying that the word senses 0005, and by a fellowship to Alona Fyshe from the for majority and number have been separated out Multimodal Neuroimaging Training Program (NIH into different dimensions for each word, it is clear awards T90DA022761 and R90DA023420). that both the composition and similarity functions References used for this task are not gracefully handling multi- ple word senses. To address this issue, we could par- Marco Baroni and Roberto Zamparelli. Nouns are tition the dimensions of A into sense-related groups vectors, adjectives are matrices: Representing 8 39 adjective-noun constructions in semantic space. on Natural Language Processing, pages 1544– In Proceedings of the 2010 Conference on Em- 1555, 2014. pirical Methods in Natural Language Processing, Paul B. Kantor and Ellen M. Voorhees. The TREC-5 pages 1183–1193. Association for Computational Confusion Track: Comparing Retrieval Methods Linguistics, 2010. for Scanned Text. Information Retrieval, 2:165– Stephen Boyd. Distributed Optimization and Sta- 176, 2000. ISSN 1386-4564, 1573-7659. doi: tistical Learning via the Alternating Direction 10.1023/A:1009902609570. Method of Multipliers. Foundations and Trends Omer Levy and Yoav Goldberg. Neural Word Em- in Machine Learning, 3(1):1–122, 2010. ISSN bedding as Implicit Matrix Factorization. In Ad- 1935-8237. doi: 10.1561/2200000016. vances in Neural Information Processing Sys- Jamie Callan and Mark Hoy. The ClueWeb09 tems, pages 1–9, 2014. Dataset, 2009. URL http://boston.lti. Julien Mairal, Francis Bach, J Ponce, and Guillermo cs.cmu.edu/Data/clueweb09/. Sapiro. Online learning for matrix factoriza- Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, tion and sparse coding. The Journal of Machine Chong Wang, and David M Blei. Reading Tea Learning Research, 11:19–60, 2010. Leaves : How Humans Interpret Topic Models. In Advances in Neural Information Processing Sys- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg tems, pages 1–9, 2009. Corrado, and Jeff Dean. Distributed representa- tions of words and phrases and their composition- Scott Deerwester, Susan T. Dumais, George W. Fur- ality. In Proceedings of Neural Information Pro- nas, Thomas K. Landauer, and Richard Harsh- cessing Systems, pages 1–9, 2013. man. Indexing by . Journal of the American Society for Information Jeff Mitchell and Mirella Lapata. Composition Science, 41(6):391–407, 1990. in distributional models of semantics. Cogni- tive science, 34(8):1388–429, November 2010. S Dejong. SIMPLS - An alternative approach to ISSN 1551-6709. doi: 10.1111/j.1551-6709. partial least squares regression. Chemometrics 2010.01106.x. and Intelligent Laboratory Systems, 18(3):251– 263, 1993. ISSN 01697439. doi: 10.1016/ Brian Murphy, Partha Talukdar, and Tom Mitchell. 0169-7439(93)85002-x. Learning Effective and Interpretable Semantic Georgiana Dinu, Nghia The Pham, and Marco Ba- Models using Non-Negative Sparse Embedding. roni. General estimation and evaluation of com- In Proceedings of Conference on Computational positional distributional semantic models. In Linguistics (COLING), 2012. Workshop on Continuous Vector Space Models Jeffrey Pennington, Richard Socher, and Christo- and their Compositionality, Sofia, Bulgaria, 2013. pher D Manning. GloVe : Global Vectors for Bradley Efron, Trevor Hastie, Iain Johnstone, and Word Representation. In Conference on Empir- Robert Tibshirani. Least angle regression. Annals ical Methods in Natural Language Processing, of Statistics, 32(2):407–499, 2004. Doha, Qatar, 2014. Alona Fyshe, Partha Talukdar, Brian Murphy, and Magnus Sahlgren. The Word-Space Model Using Tom Mitchell. Documents and Dependencies : an distributional analysis to represent syntagmatic Exploration of Vector Space Models for Seman- and paradigmatic relations between words. Doc- tic Composition. In Computational Natural Lan- tor of philosophy, Stockholm University, 2006. guage Learning, Sofia, Bulgaria, 2013. Richard Socher, Brody Huval, Christopher D. Man- Kazuma Hashimoto, Pontus Stenetorp, Makoto ning, and Andrew Y. Ng. Semantic Composition- Miwa, and Yoshimasa Tsuruoka. Jointly learn- ality through Recursive Matrix-Vector Spaces. ing word representations and composition func- In Conference on Empirical Methods in Natural tions using predicate-argument structures. Pro- Language Processing and Computational Natural ceedings of the Conference on Empirical Methods Language Learning, 2012. 9 40 Peter D Turney. Domain and Function : A Dual- Space Model of Semantic Relations and Com- positions. Journal of Artificial Intelligence Re- search, 44:533–585, 2012.

10 41