A Structured Vector Space Model for Word Meaning in Context
Total Page:16
File Type:pdf, Size:1020Kb
A Structured Vector Space Model for Word Meaning in Context Katrin Erk Sebastian Pado´ Department of Linguistics Department of Linguistics University of Texas at Austin Stanford University [email protected] [email protected] Abstract all its possible usages (Landauer and Dumais, 1997; Lund and Burgess, 1996). Since the meaning of We address the task of computing vector space words can vary substantially between occurrences representations for the meaning of word oc- (e.g., for polysemous words), the next necessary step currences, which can vary widely according to context. This task is a crucial step towards a is to characterize the meaning of individual words in robust, vector-based compositional account of context. sentence meaning. We argue that existing mod- There have been several approaches in the liter- els for this task do not take syntactic structure ature (Smolensky, 1990; Schutze,¨ 1998; Kintsch, sufficiently into account. 2001; McDonald and Brew, 2004; Mitchell and La- We present a novel structured vector space pata, 2008) that compute meaning in context from model that addresses these issues by incorpo- lemma vectors. Most of these studies phrase the prob- rating the selectional preferences for words’ lem as one of vector composition: The meaning of a argument positions. This makes it possible to target occurrence a in context b is a single new vector integrate syntax into the computation of word c that is a function (for example, the centroid) of the meaning in context. In addition, the model per- forms at and above the state of the art for mod- vectors: c = a b. eling the contextual adequacy of paraphrases. The context b can consist of as little as one word, as shown in Example (1). In (1a), the meaning of catch combined with ball is similar to grab, while in 1 Introduction (1b), combined with disease, it can be paraphrased Semantic spaces are a popular framework for the rep- by contract. Conversely, verbs can influence the in- resentation of word meaning, encoding the meaning terpretation of nouns: In (1a), ball is understood as a of lemmas as high-dimensional vectors. In the de- spherical object, and in (1c) as a dancing event. fault case, the components of these vectors measure (1) a. catch a ball the co-occurrence of the lemma with context features b. catch a disease over a large corpus. These vectors are able to pro- c. attend a ball vide a robust model of semantic similarity that has been used in NLP (Salton et al., 1975; McCarthy and In this paper, we argue that models of word mean- Carroll, 2003; Manning et al., 2008) and to model ing relying on this procedure of vector composition experimental results in cognitive science (Landauer are limited both in their scope and scalability. The and Dumais, 1997; McDonald and Ramscar, 2001). underlying shortcoming is a failure to consider syntax Semantic spaces are attractive because they provide a in two important ways. model of word meaning that is independent of dictio- The syntactic relation is ignored. The first problem nary senses and their much-discussed problems (Kil- concerns the manner of vector composition, which garriff, 1997; McCarthy and Navigli, 2007). ignores the relation between the target a and its con- In a default semantic space as described above, text b. This relation can have a decisive influence on each vector represents one lemma, averaging over their interpretation, as Example (2) shows: (2) a. a horse draws relation between a and b, addressing the first problem b. draw a horse above. In an expression a + b, the meanings of a and b in this context are computed as two separate In (2a), the meaning of the verb draw can be para- vectors a0 and b0. These vectors can then be combined phrased as pull, while in (2b) it is similar to sketch. with a representation of the structure’s expression This difference in meaning is due to the difference in (e.g., a parse tree), to address the second problem relation: in (2a), horse is the subject, while in (2b) discussed above. We test the SVS model on the task it is the object. On the modeling side, however, a of recognizing contextually appropriate paraphrases, vector combination function that ignores the relation finding that SVS performs at and above the state-of- will assign the same representation to (2a) and (2b). the-art. Thus, existing models are systematically unable to Plan of the paper. Section 2 reviews related work. capture this class of phenomena. Section 3 presents the SVS model for word meaning Single vectors are too weak to represent phrases. in context. Sections 4 to 6 relate experiments on the The second problem arises in the context of the im- paraphrase appropriateness task. portant open question of how semantic spaces can “scale up” to provide interesting meaning representa- 2 Related Work tions for entire sentences. We believe that the current vector composition methods, which result in a single In this section we give a short overview over existing vector c, are not informative enough for this purpose. vector space based approaches to computing word One proposal for “scaling up” is to straightforwardly meaning in context. interpret c = a b as the meaning of the phrase General context effects. The first category of a + b (Kintsch, 2001; Mitchell and Lapata, 2008). models aims at integrating the widest possible range The problem is that the vector c can only encode a of context information without recourse to linguistic fixed amount of structural information if its dimen- structure. The best-known work in this category is sionality is fixed, but there is no upper limit on sen- Schutze¨ (1998). He first computes “first-order” vec- tence length, and hence on the amount of structure tor representations for word meaning by collecting to be encoded. It is difficult to conceive how c could co-occurrence counts from the entire corpus. Then, encode deeper semantic properties, like predicate- he determines “second-order” vectors for individual argument structure (distinguishing “dog bites man” word instances in their context, which is taken to be a and “man bites dog”), that are crucial for sentence- simple surface window, by summing up all first-order level semantic tasks such as the recognition of textual vectors of the words in this context. The resulting entailment (Dagan et al., 2006). An alternative ap- vectors form sense clusters. proach to sentence meaning would be to use the vec- McDonald and Brew (2004) present a similar tor space representation only for representing word model. They compute the expectation for a word meaning, and to represent sentence structure sepa- w in a sequence by summing the first-order vectors rately. Unfortunately, present models cannot provide i for the words w to w and showed that the dis- this grounding either, since they compute a single 1 i−1 tance between expectation and first-order vector for vector c that provides the same representations for w correlates with human reading times. both the meanings of a and b in context. i Predicate-argument combination. The second In this paper, we propose a new, structured vector category of prior studies concentrates on contexts space model for word meaning (SVS) that addresses consisting of a single word only, typically modeling these problems. A SVS representation of a lemma the combination of a predicate p and an argument a. comprises several vectors representing the word’s Kintsch (2001) uses vector representations of p and lexical meaning as well as the selectional preferences a to identify the set of words that are similar to both that it has for its argument positions. The meaning p and a. After this set has been narrowed down in a of word a in context b is computed by combining a self-inhibitory network, the meaning of the predicate- with b’s selectional preference vector specific to the argument combination is obtained by computing the centroid of its members’ vectors. The procedure does types of information. Current kernels are mostly tree not take the relation between p and a into account. kernels that compare syntactic structure, and use se- Mitchell and Lapata (2008) propose a framework mantic information mostly for smoothing syntactic to represent the meaning of the combination p + a as similarity (Moschitti and Quarteroni, 2008). In con- a function f operating on four components: trast, vector-space models focus on the interaction between the lexical meaning of words in composi- c = f(p, a, R, K) (3) tion. R is the relation holding between p and a, and K 3 A structured vector space model for additional knowledge. This framework allows sen- word meaning in context sitivity to the relation. However, the concrete in- stantiations that Mitchell and Lapata consider disre- In this section, we define the structured vector space gards K and R, thus sharing the other models’ limi- (SVS) model of word meaning. tations. They focus instead on methods for the direct The main intuition behind our model is to view combination of p and a: In a comparison between the interpretation of a word in context as guided by component-wise addition and multiplication of p and expectations about typical events. For example, in a, they find far superior results for the multiplication (1a), we assume that upon hearing the phrase “catch a approach. ball”, the hearer will interpret the meaning of “catch” to match typical actions that can be performed with a Tensor product-based models. Smolensky (1990) ball. Similarly, the interpretation of “ball” will reflect uses tensor product to combine two word vectors a the hearer’s expectations about typical things that can and b into a vector c representing the expression a+b.