A Hybrid Distributional and Knowledge-based Model of Lexical

Nikolaos Aletras Mark Stevenson Department of Computer Science, Department of Computer Science, University College London, University of Sheffield, Gower Street, Regent Court, 211 Portobello, London WC1E 6BT Sheffield S1 4DP United Kingdom United Kingdom [email protected] [email protected]

Abstract which makes it straightforward to identify ones that are ambiguous. For example, these resources would A range of approaches to the representa- include multiple meanings for the ball includ- tion of have been explored ing the ‘event’ and ‘sports equipment’ senses. How- within Computational . Two of the ever, the fact that there are multiple meanings as- most popular are distributional and knowledge- sociated with ambiguous lexical items can also be based models. This paper proposes hybrid models of lexical semantics that combine the problematic since it may not be straightforward to advantages of these two approaches. Our mod- identify which one is being used for an instance of an els provide robust representations of synony- ambiguous word in text. This issue has lead to signif- mous derived from WordNet. We also icant exploration of the problem of Word Sense Dis- make use of WordNet’s hierarcy to refine the ambiguation (Ide and Veronis,´ 1998; Navigli, 2009). synset vectors. The models are evaluated on More recently distributional semantics has become two widely explored tasks involving lexical semantics: lexical similarity and Word Sense a popular approach to representing lexical semantics Disambiguation. The hybrid models are found (Turney and Pantel, 2010; Erk, 2012). These ap- to perform better than standard distributional proaches are based on the premise that the semantics models and have the additional benefit of mod- of lexical items can be modelled by their context elling polysemy. (Firth, 1957; Harris, 1985). Distributional seman- tic models have the advantages of being robust and straightforward to create from unannotated corpora. 1 Introduction However, problems can arise when they are used to The representation of lexical semantics is a core prob- represent the semantics of polysemous words. Distri- lem in Computational Linguistics and a variety of butional semantic models are generally constructed approaches have been developed. Two of the most by examining the context of lexical items in unanno- widely explored have been knowledge-based and dis- tated corpora. But for ambiguous words, like ball, tributional semantics. it is not clear if a particular instance of the word in Knowledge-based approaches make use of some a corpus refers to the ‘event’, ‘sports equipment’ or external information source which defines the set of another sense which can lead to the distributional se- possible meanings for each lexical item. The most mantic model becoming a mixture of different mean- widely used information source is WordNet (Fell- ings without representing any of the meanings indi- baum, 1998), although other resources, such as Ma- vidually. chine Readable Dictionaries, thesaurii and ontologies This paper proposes models that merge elements have also been used (see Navigli (2009)). of distributional and knowledge-based approaches to One advantage of these resources is that they rep- lexical semantics and combines advantages of both resent the various possible meanings of lexical items techniques. A standard distributional semantic model 20

Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics (*SEM 2015), pages 20–29, Denver, Colorado, June 4–5, 2015. is created from an unannotated corpus and then re- ukWaC, the English Wikipedia and the British Na- fined using WordNet. The resulting models can be tional Corpus. Vectors are weighted using positive viewed as enhanced distributional models that have Pointwise Mutual Information and the set of context been refined using the information from WordNet features consists of the top 300K most frequent words to reduce the problems caused by ambiguous terms in the corpus. when models are created. Alternatively, it can be used as a version of the WordNet hierarchy in which dis- 2.2 Hybrid Models tributional semantic models are attached to synsets. 2.2.1 Synset Distributional Model Thereby creating a version of WordNet for which the We assume that making use of information about appropriate synsets can be identified more easily for the structure of WordNet can reduce noise introduced ambiguous lexical items that occur in text. in vectors of D due to polysemy. We make use of We evaluate our models on two standard tasks: lex- all noun and verb synsets (excluding numbers and ical similarity and word sense disambiguation. Re- compounds) that contain at least one of the words in sults show that the proposed hybrid models perform L to create a vector-based synset representation, H. consistently better than traditional distributional se- Where H is a synset by context feature matrix, i.e. mantic models. S C. Each synset vector is generated by computing × The reminder of the paper is organised as follows. the centroid of its lemma vectors in S (i.e. the sum Section 2 describes our hybrid models which com- of the lemma’s vectors normalised by the number of bine information from WordNet and a standard dis- the lemmas in the synset). For example, the vector of tributional semantic model. These models are aug- the synset car.n.01 is computed as the centroid of its mented using and Canoni- lemma vectors, i.e. car, auto, automobile, machine cal Correlation Analysis. Sections 3 and 4 describe and motorcar (see Figure 1). evaluation of the models on the word similarity and word sense disambiguation tasks. Related work is 2.2.2 Synset Rank Model presented in Section 5 and conclusions in Section 6. The Synset Distributional Model provides a vector representation for each synset in WordNet which is 2 Semantic Models created using information about which lemmas share synset membership. An advantage of this approach First, we consider a standard distributional seman- is that vectors from multiple lemmas are combined to tic space to represent words as vectors (Section 2.1). form the synset representation. However, a disadvan- Then, we make use of the WordNet’s clusters of syn- tage is that many of these lemmas are polysemous onyms and hierarchy in combination with the stan- and their vectors represent multiple senses, not just dard distributional space to build hybrid models (Sec- the one that is relevant to the synset. For example, tion 2.2) which are augmented using Latent Semantic in WordNet the lemma machine has several possi- Analysis (Section 2.3) and Canonical Correlation ble meanings, only one of which is a member of the Analysis (Section 2.4). synset car.n.01. 2.1 Distributional Model WordNet also contains information about the re- lations between synsets, in the form of the synset We consider a semantic space D, as a word by con- hierarchy, which can be exploited to re-weight the text feature matrix, L C. Vector representations × importance of context features for particular synsets. consist of context features C in a reference corpus. We employ a graph-based algorithm that makes use We made use of pre-computed publicly available vec- 1 of the WordNet is-a hierarchy. The intuition behind tors optimised for word similarity tasks (Baroni et this approach is that context features that are relevant al., 2014). Word co-occurrence counts are extracted to a given synset are likely to be shared by its neigh- using a symmetric window of two words over a cor- bours in the hierarchy while those that are not rele- pus of 2.8 billion tokens obtained by concatenating vant (i.e. have been introduced via an irrelevant sense 1http://clic.cimec.unitn.it/composes/ of a synset member) will not be. The graph-based semantic-vectors.html algorithm increases the weight of context features 21 automobile automobile~

auto auto~

car.n.01 car car~ + car.n.01~

machine machine~

motorcar motorcar~

Figure 1: In the Synset Distributional Model the vector representing a synset (white box) is computed as the centroid of its lemma vectors (grey boxes)

1 that synsets share with neighbours and reduces those v, is set to where Sc is the number of synsets Sc | | that are not shared. that context| feature| i belongs. The personalisation PageRank (Page et al., 1999) is a graph-based algo- value of all the other sysnets is set to 0. rithm for identifying important nodes in a graph that We apply PPR over WordNet for each context has been applied to a range of NLP tasks including feature using UKB (Agirre et al., 2009) and obtain word sense disambiguation (Agirre and Soroa, 2009) weights for each synset-context feature pair resulting and keyword extraction (Mihalcea and Tarau, 2004). to a new semantic space H , S C, where vector p × Let G = (V,E) be a graph with a set of vertices, elements are weighted by PageRank values. Figure 2 V , denoting synsets and a set of edges, E, denoting shows how the synset scores are computed by ap- links between synsets in the WordNet hierarchy. The plying PPR over WordNet given the context feature PageRank score (P r) over G for a synset (Vi) can be car. Note that we use the context features of the computed by the following equation: distributional model D.

X 1 2.3 Latent Semantic Analysis P r(Vi) = d P r(Vj ) + (1 d)v (1) · O(Vj ) − V I(V ) Latent Semantic Analysis (LSA) (Deerwester et al., j ∈ i 1990; Landauer and Dumais, 1997) has been used to where I(Vi) denotes the in-degree of the vertex Vi reduce the dimensionality of semantic spaces lead- and O(Vj) is the out-degree of vertex Vj. d is the ing to improved performance. LSA applies Sin- damping factor which is set to the default value of gular Value Decomposition (SVD) to a matrix X, d = 0.85 (Page et al., 1999). In standard PageRank W C, which represents a distributional semantic 1 × all elements of the vector v are the same, N where space. This is a form of factor analysis where X is N is the number of nodes in the graph. decomposed into three other matrices: Personalised PageRank (PPR) (Haveliwala et al., 2003) is a variant of the PageRank algorithm in which X = UΣV T (2) extra importance is assigned to certain vertices in the graph. This is achieved by adjusting the values of where U is a W W matrix of row vectors where its × the vector v in equation 1 to prefer certain nodes. columns are eigenvectors of XXT , Σ is a diagonal The values in v effectively initialises the graph and W C matrix containing the singular values and V × assigning high values to nodes in v makes them more is a C C matrix of context feature vectors where × likely to be assigned a high PPR score. its columns are eigenvectors of XT X. The multi- For each context feature c in C if c LM where plication of the three component matrices results in ∈ LM contains all the lemma names of synsets in S, the original matrix, X. Any matrix can be decom- we apply PPR to assign importance to synsets. The posed perfectly if the number of singular values is score of each synset Sc in the personalisation vector no smaller than the smallest dimension of X. When 22 Personalisation Vector Synset Scores

.2 .03 car.n.01 0 .01 car.n.03 .2 .02 .2 .02 car car.n.02 PPR ...... car.n.04 0 .02 .2 .03 car.n.05 .2 .04

Figure 2: In the Synset Rank Model, each synset (grey boxes) is assigned with a score by computing PPR over WordNet. The personalisation vector (grey array) is initialised by assigning probabilities only to the synsets that include the context feature as a lemma name. fewer singular values are used then the matrix prod- The dimensionality of the projection vectors is lower uct is an approximation of the original matrix. LSA or equal to the dimensionality of the original vari- reduces the dimensionality of the SVD by deleting ables. coefficients in the diagonal matrix Σ starting with the The computation of CCA directly over H and Hp smallest. The approximation of matrix X retaining is computationally infeasible because of their high the K largest singular values, X˜, is then given by: dimensionality (300K). We apply CCA over the re- duced spaces learned using LSA, H˜ and H˜ to ob- X˜ U Σ V T (3) p ≈ K K K tain two joint semantic spaces following a similar where U is a W K matrix of word vectors, Σ approach to Faruqui and Dyer (2014). These are K × K is a K K diagonal matrix with singular values and the spaces H∗, resulting from the projection of the × V is a K C matrix of context feature vectors. Synset Distributional Model H˜ , and H∗, resulting K × p We apply LSA on the Synset Distributional Model, from the projection of the Synset Rank Model H˜p. H and the Synset Rank model, Hp to obtained the reduced semantic spaces H˜ and H˜p respectively. 3 Word Similarity 2.4 Joint Representation using CCA 3.1 Computing Similarity Recent work has demonstrated that distributional Since hybrid models represent words as synset vec- models can benefit from combining alternative views tors, similarity between two words can be computed of data (see Section 5). H and Hp provide two dif- following two ways. First, we compute similarity be- ferent views of the synsets and we incorporate evi- tween two words as the maximum of their pairwise dence from both to learn a joint representation using synset similarity. On the other hand, similarity can be Canonical Correlation Analysis (CCA) (Hardoon et computed as the average pairwise synset similarity al., 2004). Given two multidimensional variables x using the synsets that the two words belong. Similar- and y, CCA finds two projection vectors by max- ity is computed as the cosine of the angle between imising the correlations of the variables onto these word or synset vectors. projections. The function to be maximised is: 3.2 Data E[xy] ρ = (4) We make use of six standard data sets that have been 2 2 E[x ]E[y ] widely used for evaluating lexical similarity and relat- p 23 Max Model WS-353 WS-Sim WS-Rel RG MC MEN Distributional Model D 0.62 0.70 0.59 0.79 0.72 0.72 Hybrid Models - Full H 0.49 0.60 0.36 0.69 0.64 0.58

Hp 0.58 0.67 0.49 0.82 0.86 0.63 Hybrid Models - LSA H˜ 0.55 0.69 0.42 0.71 0.71 0.54

H˜p 0.58 0.68 0.46 0.85 0.86 0.55 Hybrid Models - CCA

H∗ 0.67 0.76 0.57 0.81 0.79 0.72

Hp∗ 0.52 0.62 0.41 0.86 0.80 0.56

Table 1: Spearman’s correlation on various data sets. Maximum similarity between pairs of synsets.

edness. First, we make use of WS-353 (Finkelstein is l = 250 for H∗ and l = 40 for Hp∗. et al., 2001) which contains 353 pairs of words an- notated by humans. Furthermore, we make use of 3.4 Evaluation Metric the similarity (WS-Sim) and relatedness (WS-Rel) Performance is measured as the correlation between pairs of words created by Agirre et al. (2009) from the similarity scores returned by each proposed the original WS-353 data set. method and the human judgements. This is the stan- We also made use of the RG (Rubenstein and dard approach to evaluate word and text similarity Goodenough, 1965) and MC (Miller and Charles, tasks, e.g. (Budanitsky and Hirst, 2001; Agirre et 1991) data sets which contain 65 and 30 pairs of al., 2009; Agirre et al., 2012). Our experiments use nouns respectively. Finally, we make use of the larger Spearman’s correlation coefficient. MEN data set (Bruni et al., 2012) which contains 3,000 pairs of words that has been used as image 3.5 Results tags. Annotations are obtained using croudsourcing. Table 1 shows the Spearman’s correlation of simi- larity scores generated by each model and human 3.3 Model Parameters judgements of similarity across various data sets by The parameters we need to tune are the number of taking the maximum pairwise similarity score of two the top components in LSA spaces, H˜ and H˜p, and words’ synsets. The first row of the table shows the CCA spaces, H∗ and Hp∗. For the LSA spaces, we results obtained by the word distributional model tune the number of the top k components in RG. We of Baroni et al. (2014). The full hybrid models H set k 50, 100, ..., 1000 and select the value that and H perform consistently worse than the orig- ∈ { } p maximises performance which is k = 700 for H˜ and inal distributional model D across data sets. The k = 650 for H˜p. For the joint spaces learned using main reason is that a large number of synsets contain CCA, we also tune the number of the top l correlated only one lemma name which might be polysemous. features in RG. We set l 10, 20, ..., 650 and For example, the only lemma name of the synsets ∈ { } select the value that maximises performance which ‘ball.n.01’ (‘round object that is hit or thrown or 24 Average Model WS-353 WS-Sim WS-Rel RG MC MEN Distributional Model D 0.62 0.70 0.59 0.79 0.72 0.72 Hybrid Models - Full H 0.61 0.71 0.52 0.72 0.65 0.64

Hp 0.65 0.73 0.56 0.79 0.81 0.58 Hybrid Models - LSA H˜ 0.59 0.70 0.48 0.68 0.68 0.63

H˜p 0.65 0.73 0.56 0.81 0.86 0.58 Hybrid Models - CCA

H∗ 0.70 0.77 0.64 0.78 0.84 0.74

Hp∗ 0.61 0.69 0.52 0.72 0.76 0.62

Table 2: Spearman’s correlation on various data sets. Average pairwise similarity between pairs of synsets.

kicked in games’) and ‘ball.n.04’ (‘the people as- hybrid model, H∗, achieves correlations that are be- sembled at a lavish formal dance’) is ‘ball’. In this tween 2% and 12% than D for the majority of data case, the synset vector in H and the lemma vector in sets, although performance is 1% lower for the RG D are identical and still polysemous. This problem data set. This improved performance suggest that hu- does not hold in Hp and therefore the correlations man judgements of word similarity are based on the are higher for that semantic space but still lower than relation between all the senses of two given words those obtained for D. Applying LSA on H and Hp rather than just the most similar ones. improves results but correlations are still lower than 2 those obtained using D . On the other hand, the 4 Word Sense Disambiguation joint space learned by applying CCA, H∗, produces consistently better similarity estimates than D while 4.1 Data outperforms all the other models in the majority of the data sets. That confirms our main assumption We test the efficiency of our hybrid models on the than incorporating information obtained from a large English All Words tasks of Senseval-2 (Palmer et corpus and a knowledge-base improves word vector al., 2001) and Senseval-3 (Snyder and Palmer, 2004), representations. two standard data sets for evaluating WSD. Our ex- Table 2 shows the Spearman’s correlation of sim- periments focus on the disambiguation of nouns in ilarity scores generated by each model and human these data sets. judgements of similarity across various data sets by taking the average pairwise similarity score of two 4.2 Word Sense Tagging words’ synsets. Results show that using the average A simple approach to all-words WSD was imple- rather than the maximum system similarity improves mented in which each sense of an ambiguous word results for almost all data sets. For example, the best is compared against its context and the most similar 2Note that Baroni et al. (2014) found that applying SVD to chosen. D did not improve performance over using the full space. For example suppose that we want to disambiguate 25 Nouns Senseval-2 Senseval-3 Precision Recall Precision Recall Hybrid Models - Full H 0.46 0.45 0.37 0.36

Hp 0.65 0.63 0.50 0.48 Hybrid Models - LSA H˜ 0.45 0.44 0.39 0.37

H˜p 0.60 0.58 0.46 0.45 Hybrid Models - CCA

H∗ 0.44 0.43 0.36 0.34

Hp∗ 0.61 0.60 0.48 0.46

Table 3: Results obtained by hybrid models on SenseEval-2 and SenseEval-3 data sets (nouns only).

the word bank in the following sentence: 4.4 Evaluation Metrics Word sense disambiguation systems are evaluated by “Banks provide payment services.” computing precision and recall. Precision measures the proportion of disambiguated words that have been Assume that the word bank consists of two senses correctly assigned with a sense. Recall measures the ‘bank.n.01’ and bank.n.02 defined as ‘sloping land proportion of words disambiguated correctly out of (especially the slope beside a body of water)’ and “a all words available for disambiguation. financial institution that accepts deposits and chan- nels the money into lending activities’ respectively. 4.5 Results First we consider the vectors of all the possible Table 3 shows the results obtained by using our hy- noun synsets containing the word bank as a synset brid models on the two word sense disambiguation name. Then for each context word (provide, payment data sets. The full Synset Rank model Hp is consis- and service) that exists in our semantics spaces we tently better method in terms of precision and recall compute a centroid vector from its constituent senses. in both data sets. On the other hand, it is somewhat Finally, we compute a context vector for the entire surprising that dimensionality reduction and integra- context by summing up all the context word vectors. tion of semantic spaces do not help in improving We select the synset of the target word that its vector ˜ performance. That is the Hp and Hp∗ models achieve has the highest to the context vector. lower precision and recall than the fuller Hp. The Synset Distributional models H, H˜ and H∗ 4.3 Model Parameters consistently fail to perform well. The difference in The parameters we need to tune are the same as for precision and recall compared to the Synset Rank the word similarity task and we use the best settings models is between 12% and 19%. This suggests that obtained for that task. We also experimented with the knowledge-based weighting of the context fea- varying the number of surrounding sentences used tures generates less noisy vectors for sense tagging. as context by testing values between 1 and 4. The pattern of results observed for the WSD task ± ± The best performance was obtained using a context is somewhat different to those obtained for word created from the sentence containing the target word similarity, where applying LSA and CCA improved and 1 sentences surrounding it. performance (see Section 3). The most likely expla- ± 26 nation of this difference is that WSD requires the Dyer (2014) found that learning joint spaces from model to represent the possible senses of each am- multilingual vector spaces using CCA improves the biguous word. It is also important that these senses performance of standard monolingual vector spaces correspond to the ones used in the relevant lexicon on . Fyshe et al. (2014) showed (WordNet in this case). The Synset Rank model Hp that integrating textual vector space models with does this by making use of information from Word- brain activation data when people are reading words Net. However, these synset representations are dis- achieves better correlation to behavioural data than rupted by LSA and CCA which compress the seman- models of one modality. tic space by extracting general features from them. Our hybrid models are also closely related to a This is not a problem for word similarity since there supervised method proposed by Faruqui et al. (2015). is no need to model the senses found in the lexicon. Their method refines distributional semantic mod- els using relational information from various seman- 5 Related Work tic lexicons, including WordNet, by making linked words in these lexicons to have similar vector repre- Dealing with polysemy in distributional semantics sentations. While our models are also based on using is a fundamental issue since the various senses of a information from WordNet for refining vector repre- word type are conflated in a single vector. Previous sentations, they are fundamentally different. They work tackled the problem through vector adaptation, create synset vectors in an unsupervised fashion and clustering and language models (Erk, 2012). Vector more importantly can be used for sense tagging. adaptation methods modify a traditional (i.e. poly- semous) target word vector by applying pointwise 6 Conclusions operations such as addition or multiplication to that This paper proposed hybrid models of lexical seman- and the surrounding words in a sentence (Mitchell tics that combine distributional and knowledge-based and Lapata, 2008; Erk and Pado,´ 2008; Thater et approaches and offer advantages of both techniques. al., 2011; Van de Cruys et al., 2011). Alternatively, A standard distributional semantic model is created clustering methods have been used to cluster together from an unannotated corpus and then refined by (1) the different contexts a target word appears assum- using WordNet synsets to create synset vectors; and ing that each cluster of contexts captures a different (2) applying a graph-based technique over WordNet sense of the target word (Dinu and Lapata, 2010; to reweight synset vectors. The resulting hybrid mod- Erk and Pado, 2010; Reisinger and Mooney, 2010). els can be viewed as enhanced distributional models Language models have also been used to remove pol- using the information from WordNet to reduce the ysemy from word vectors by predicting words that problems caused by ambiguous terms when models could replace the target word given a context (De- are created. Results show that our models perform schacht and Moens, 2009; Washtell, 2010; Moon better than traditional distributional models on lex- and Erk, 2013). More recently, Polajnar and Clark ical similarity tasks. Unlike standard distributional (2014) applied context selection and normalisation approaches the techniques proposed here also model to improve the quality of word vectors. Our hybrid polysemy and can be used to carry out word sense models are related to the vector adaptation methods disambiguation. since we modify the synset vectors using its lemmas’ vectors to remove noise. Our work is also inspired by recent work on im- References proving classic distributional vector representations of words by incorporating information from different Eneko Agirre and Aitor Soroa. 2009. Personalizing PageRank for word sense disambiguation. In Proceed- modalities. For example, researchers have devel- ings of the 12th Conference of the European Chapter of oped methods that make use of both visual and con- the Association for Computational Linguistics (EACL textual information to improve word vectors (Bruni ’09), pages 33–41, Athens, Greece. et al., 2011; Silberer et al., 2013; Lazaridou et al., Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kraval- 2014). Following a similar direction, Faruqui and ova, Marius Pasca, and Aitor Soroa. 2009. A study 27 on similarity and relatedness using distributional and Katrin Erk and Sebastian Pado.´ 2008. A structured vector -based approaches. In Proceedings of Human space model for word meaning in context. In Proceed- Language Technologies: The 2009 Annual Conference ings of the 2008 Conference on Empirical Methods in of the North American Chapter of the Association for Natural Language Processing, pages 897–906, Hon- Computational Linguistics (NAACL-HLT ’09), pages olulu, Hawaii. 19–27, Boulder, Colorado. Katrin Erk and Sebastian Pado. 2010. Exemplar-based Eneko Agirre, Mona Diab, Daniel Cer, and Aitor models for word meaning in context. In Proceedings of Gonzalez-Agirre. 2012. Semeval-2012 task 6: A pilot the ACL 2010 Conference Short Papers, pages 92–97, on semantic textual similarity. In Proceedings of the Uppsala, Sweden. First Joint Conference on Lexical and Computational Katrin Erk. 2012. Vector space models of word meaning Semantics-Volume 1: Proceedings of the main confer- and phrase meaning: A survey. Language and Linguis- ence and the shared task, and Volume 2: Proceedings tics Compass, 6(10):635–653. of the Sixth International Workshop on Semantic Eval- Manaal Faruqui and Chris Dyer. 2014. Improving vector uation , pages 385–393. Association for Computational space word representations using multilingual correla- Linguistics. tion. In Proceedings of the 14th Conference of the Eu- Marco Baroni, Georgiana Dinu, and German´ Kruszewski. ropean Chapter of the Association for Computational 2014. Don’t count, predict! a systematic comparison of Linguistics, pages 462–471, Gothenburg, Sweden. context-counting vs. context-predicting semantic vec- Manaal Faruqui, Jesse Dodge, Sujay K Jauhar, Chris Dyer, tors. In Proceedings of the 52nd Annual Meeting of the Eduard Hovy, and Noah A Smith. 2015. Retrofitting Association for Computational Linguistics (Volume 1: word vectors to semantic lexicons. In Proceedings of Long Papers), pages 238–247, Baltimore, Maryland. the 2015 Conference of NAACL, Denver, Colorado. Elia Bruni, Giang Binh Tran, and Marco Baroni. 2011. C. Fellbaum, editor. 1998. WordNet: An Electronic Lexi- Distributional semantics from text and images. In Pro- cal Database. The MIT Press, Cambridge, MA, Lon- ceedings of the Workshop on GEometrical Models of don. Natural Language Semantics (GEMS ’11), pages 22– 32, Edinburgh, UK, July. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. Elia Bruni, Gemma Boleda, Marco Baroni, and Nam- 2001. Placing search in context: The concept revisited. Khanh Tran. 2012. Distributional semantics in techni- In Proceedings of the 10th international conference color. In Proceedings of the 50th Annual Meeting of on World Wide Web, pages 406–414, New York, USA. the Association for Computational Linguistics: Long ACM Press. Papers-Volume 1, pages 136–145, Jeju Island, Korea. J. Firth. 1957. A synopsis of linguistic theory 1930-1955. Alexander Budanitsky and Graeme Hirst. 2001. Semantic In Studies in Linguistic Analysis. distance in WordNet: An experimental, application- oriented evaluation of five measures. In Proceedings of Alona Fyshe, Partha P. Talukdar, Brian Murphy, and the workshop on “WordNet and other Lexical Resources” Tom M. Mitchell. 2014. Interpretable semantic vectors at the Second Annual Meeting of the North American from a joint model of brain- and text- based meaning. Association for Computational Linguistics, Pittsburgh, In Proceedings of the 52nd Annual Meeting of the Asso- PA. ciation for Computational Linguistics (Volume 1: Long Papers), pages 489–499, Baltimore, Maryland. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. David Hardoon, Sandor Szedmak, and John Shawe-Taylor. Indexing by latent semantic analysis. Journal of the 2004. Canonical correlation analysis: An overview American Society for Information Science, 41(6):391– with application to learning methods. Neural computa- 407. tion, 16(12):2639–2664. Koen Deschacht and Marie-Francine Moens. 2009. Semi- Z. Harris. 1985. Distributional structure. In J. Katz, supervised using the Latent editor, The Philosophy of Linguistics, pages 26–47. Words Language Model. In Proceedings of the 2009 Oxford University Press, New York. Conference on Empirical Methods in Natural Language Taher Haveliwala, Sepandar Kamvar, and Glen Jeh. 2003. Processing, pages 21–29, Singapore. An analytical comparison of approaches to personal- Georgiana Dinu and Mirella Lapata. 2010. Measuring izing PageRank. Technical Report 2003-35, Stanford distributional similarity in context. In Proceedings of InfoLab. the 2010 Conference on Empirical Methods in Natural N. Ide and J. Veronis.´ 1998. Introduction to the special Language Processing, pages 1162–1172, Cambridge, issue on word sense disambiguation: The state of the MA. art. Computational Linguistics, 24(1):1–40. 28 Thomas K Landauer and Susan T Dumais. 1997. A so- Carina Silberer, Vittorio Ferrari, and Mirella Lapata. lution to plato’s problem: The latent semantic analysis 2013. Models of semantic representation with visual at- theory of acquisition, induction, and representation of tributes. In Proceedings of the 51st Annual Meeting of knowledge. Psychological review, 104(2):211. the Association for Computational Linguistics (Volume Angeliki Lazaridou, Elia Bruni, and Marco Baroni. 2014. 1: Long Papers), pages 572–582, Sofia, Bulgaria. Is this a wampimuk? cross-modal mapping between Benjamin Snyder and Martha Palmer. 2004. The en- distributional semantics and the visual world. In Pro- glish all-words task. In Senseval-3: Third International ceedings of the 52nd Annual Meeting of the Associa- Workshop on the Evaluation of Systems for the Semantic tion for Computational Linguistics (Volume 1: Long Analysis of Text, pages 41–43, Barcelona, Spain. Papers), pages 1403–1414, Baltimore, Maryland. Stefan Thater, Hagen Furstenau,¨ and Manfred Pinkal. Rada Mihalcea and Paul Tarau. 2004. TextRank: Bring- 2011. Word meaning in context: A simple and effective ing order into texts. In Proceedings of International vector model. In Proceedings of 5th International Joint Conference on Empirical Methods in Natural Language Conference on Natural Language Processing, pages Processing (EMNLP ’04), pages 404–411, Barcelona, 1134–1143, Chiang Mai, Thailand. Spain. Peter D. Turney and Patrick Pantel. 2010. From frequency George A Miller and Walter G Charles. 1991. Contex- to meaning: Vector space models of semantics. Journal tual correlates of semantic similarity. Language and of Artificial Intelligence Research, 37:141–188. cognitive processes, 6(1):1–28. Tim Van de Cruys, Thierry Poibeau, and Anna Korhonen. Jeff Mitchell and Mirella Lapata. 2008. Vector-based 2011. Latent vector weighting for word meaning in models of semantic composition. In Proceedings of context. In Proceedings of the 2011 Conference on ACL-08: HLT, pages 236–244, Columbus, Ohio. Empirical Methods in Natural Language Processing, Taesun Moon and Katrin Erk. 2013. An inference-based pages 1012–1022, Edinburgh, Scotland, UK. model of word meaning in context as a paraphrase Justin Washtell. 2010. Expectation vectors: A semiotics distribution. ACM Transactions on Intelligent Systems inspired approach to geometric lexical-semantic rep- and Technology (TIST), 4(3):42. resentation. In Proceedings of the 2010 Workshop on Roberto Navigli. 2009. Word Sense Disambiguation: a GEometrical Models of Natural Language Semantics survey. ACM Computing Surveys, 41(2):1–69. (GEMS), pages 45–50, Uppsala, Sweden. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab. Martha Palmer, Christiane Fellbaum, Scott Cotton, Lauren Delfs, and Hoa Trang Dang. 2001. English tasks: All- words and verb lexical sample. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems, pages 21–24, Toulouse, France. Tamara Polajnar and Stephen Clark. 2014. Improving distributional semantic vectors through context selec- tion and normalisation. In Proceedings of the 14th Conference of the European Chapter of the Associa- tion for Computational Linguistics, pages 230–238, Gothenburg, Sweden. Association for Computational Linguistics. Joseph Reisinger and Raymond J. Mooney. 2010. Multi- prototype vector-space models of word meaning. In Human Language Technologies: The 2010 Annual Con- ference of the North American Chapter of the Associ- ation for Computational Linguistics, pages 109–117, Los Angeles, California. Herbert Rubenstein and John B Goodenough. 1965. Con- textual correlates of synonymy. Communications of the ACM, 8(10):627–633. 29