A Hybrid Distributional and Knowledge-Based Model of Lexical Semantics
Total Page:16
File Type:pdf, Size:1020Kb
A Hybrid Distributional and Knowledge-based Model of Lexical Semantics Nikolaos Aletras Mark Stevenson Department of Computer Science, Department of Computer Science, University College London, University of Sheffield, Gower Street, Regent Court, 211 Portobello, London WC1E 6BT Sheffield S1 4DP United Kingdom United Kingdom [email protected] [email protected] Abstract which makes it straightforward to identify ones that are ambiguous. For example, these resources would A range of approaches to the representa- include multiple meanings for the word ball includ- tion of lexical semantics have been explored ing the ‘event’ and ‘sports equipment’ senses. How- within Computational Linguistics. Two of the ever, the fact that there are multiple meanings as- most popular are distributional and knowledge- sociated with ambiguous lexical items can also be based models. This paper proposes hybrid models of lexical semantics that combine the problematic since it may not be straightforward to advantages of these two approaches. Our mod- identify which one is being used for an instance of an els provide robust representations of synony- ambiguous word in text. This issue has lead to signif- mous words derived from WordNet. We also icant exploration of the problem of Word Sense Dis- make use of WordNet’s hierarcy to refine the ambiguation (Ide and Veronis,´ 1998; Navigli, 2009). synset vectors. The models are evaluated on More recently distributional semantics has become two widely explored tasks involving lexical semantics: lexical similarity and Word Sense a popular approach to representing lexical semantics Disambiguation. The hybrid models are found (Turney and Pantel, 2010; Erk, 2012). These ap- to perform better than standard distributional proaches are based on the premise that the semantics models and have the additional benefit of mod- of lexical items can be modelled by their context elling polysemy. (Firth, 1957; Harris, 1985). Distributional seman- tic models have the advantages of being robust and straightforward to create from unannotated corpora. 1 Introduction However, problems can arise when they are used to The representation of lexical semantics is a core prob- represent the semantics of polysemous words. Distri- lem in Computational Linguistics and a variety of butional semantic models are generally constructed approaches have been developed. Two of the most by examining the context of lexical items in unanno- widely explored have been knowledge-based and dis- tated corpora. But for ambiguous words, like ball, tributional semantics. it is not clear if a particular instance of the word in Knowledge-based approaches make use of some a corpus refers to the ‘event’, ‘sports equipment’ or external information source which defines the set of another sense which can lead to the distributional se- possible meanings for each lexical item. The most mantic model becoming a mixture of different mean- widely used information source is WordNet (Fell- ings without representing any of the meanings indi- baum, 1998), although other resources, such as Ma- vidually. chine Readable Dictionaries, thesaurii and ontologies This paper proposes models that merge elements have also been used (see Navigli (2009)). of distributional and knowledge-based approaches to One advantage of these resources is that they rep- lexical semantics and combines advantages of both resent the various possible meanings of lexical items techniques. A standard distributional semantic model 20 Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics (*SEM 2015), pages 20–29, Denver, Colorado, June 4–5, 2015. is created from an unannotated corpus and then re- ukWaC, the English Wikipedia and the British Na- fined using WordNet. The resulting models can be tional Corpus. Vectors are weighted using positive viewed as enhanced distributional models that have Pointwise Mutual Information and the set of context been refined using the information from WordNet features consists of the top 300K most frequent words to reduce the problems caused by ambiguous terms in the corpus. when models are created. Alternatively, it can be used as a version of the WordNet hierarchy in which dis- 2.2 Hybrid Models tributional semantic models are attached to synsets. 2.2.1 Synset Distributional Model Thereby creating a version of WordNet for which the We assume that making use of information about appropriate synsets can be identified more easily for the structure of WordNet can reduce noise introduced ambiguous lexical items that occur in text. in vectors of D due to polysemy. We make use of We evaluate our models on two standard tasks: lex- all noun and verb synsets (excluding numbers and ical similarity and word sense disambiguation. Re- compounds) that contain at least one of the words in sults show that the proposed hybrid models perform L to create a vector-based synset representation, H. consistently better than traditional distributional se- Where H is a synset by context feature matrix, i.e. mantic models. S C. Each synset vector is generated by computing × The reminder of the paper is organised as follows. the centroid of its lemma vectors in S (i.e. the sum Section 2 describes our hybrid models which com- of the lemma’s vectors normalised by the number of bine information from WordNet and a standard dis- the lemmas in the synset). For example, the vector of tributional semantic model. These models are aug- the synset car.n.01 is computed as the centroid of its mented using Latent Semantic Analysis and Canoni- lemma vectors, i.e. car, auto, automobile, machine cal Correlation Analysis. Sections 3 and 4 describe and motorcar (see Figure 1). evaluation of the models on the word similarity and word sense disambiguation tasks. Related work is 2.2.2 Synset Rank Model presented in Section 5 and conclusions in Section 6. The Synset Distributional Model provides a vector representation for each synset in WordNet which is 2 Semantic Models created using information about which lemmas share synset membership. An advantage of this approach First, we consider a standard distributional seman- is that vectors from multiple lemmas are combined to tic space to represent words as vectors (Section 2.1). form the synset representation. However, a disadvan- Then, we make use of the WordNet’s clusters of syn- tage is that many of these lemmas are polysemous onyms and hierarchy in combination with the stan- and their vectors represent multiple senses, not just dard distributional space to build hybrid models (Sec- the one that is relevant to the synset. For example, tion 2.2) which are augmented using Latent Semantic in WordNet the lemma machine has several possi- Analysis (Section 2.3) and Canonical Correlation ble meanings, only one of which is a member of the Analysis (Section 2.4). synset car.n.01. 2.1 Distributional Model WordNet also contains information about the re- lations between synsets, in the form of the synset We consider a semantic space D, as a word by con- hierarchy, which can be exploited to re-weight the text feature matrix, L C. Vector representations × importance of context features for particular synsets. consist of context features C in a reference corpus. We employ a graph-based algorithm that makes use We made use of pre-computed publicly available vec- 1 of the WordNet is-a hierarchy. The intuition behind tors optimised for word similarity tasks (Baroni et this approach is that context features that are relevant al., 2014). Word co-occurrence counts are extracted to a given synset are likely to be shared by its neigh- using a symmetric window of two words over a cor- bours in the hierarchy while those that are not rele- pus of 2.8 billion tokens obtained by concatenating vant (i.e. have been introduced via an irrelevant sense 1http://clic.cimec.unitn.it/composes/ of a synset member) will not be. The graph-based semantic-vectors.html algorithm increases the weight of context features 21 automobile automobile~ auto auto~ car.n.01 car car~ + car.n.01~ machine machine~ motorcar motorcar~ Figure 1: In the Synset Distributional Model the vector representing a synset (white box) is computed as the centroid of its lemma vectors (grey boxes) 1 that synsets share with neighbours and reduces those v, is set to where Sc is the number of synsets Sc | | that are not shared. that context| feature| i belongs. The personalisation PageRank (Page et al., 1999) is a graph-based algo- value of all the other sysnets is set to 0. rithm for identifying important nodes in a graph that We apply PPR over WordNet for each context has been applied to a range of NLP tasks including feature using UKB (Agirre et al., 2009) and obtain word sense disambiguation (Agirre and Soroa, 2009) weights for each synset-context feature pair resulting and keyword extraction (Mihalcea and Tarau, 2004). to a new semantic space H , S C, where vector p × Let G = (V, E) be a graph with a set of vertices, elements are weighted by PageRank values. Figure 2 V , denoting synsets and a set of edges, E, denoting shows how the synset scores are computed by ap- links between synsets in the WordNet hierarchy. The plying PPR over WordNet given the context feature PageRank score (P r) over G for a synset (Vi) can be car. Note that we use the context features of the computed by the following equation: distributional model D. X 1 2.3 Latent Semantic Analysis P r(Vi) = d P r(Vj ) + (1 d)v (1) · O(Vj ) − V I(V ) Latent Semantic Analysis (LSA) (Deerwester et al., j ∈ i 1990; Landauer and Dumais, 1997) has been used to where I(Vi) denotes the in-degree of the vertex Vi reduce the dimensionality of semantic spaces lead- and O(Vj) is the out-degree of vertex Vj.