A Structured Model for Word Meaning in Context

Katrin Erk Sebastian Pado´ Department of Linguistics Department of Linguistics University of Texas at Austin Stanford University [email protected] [email protected]

Abstract all its possible usages (Landauer and Dumais, 1997; Lund and Burgess, 1996). Since the meaning of We address the task of computing vector space words can vary substantially between occurrences representations for the meaning of word oc- (e.g., for polysemous words), the next necessary step currences, which can vary widely according to context. This task is a crucial step towards a is to characterize the meaning of individual words in robust, vector-based compositional account of context. sentence meaning. We argue that existing mod- There have been several approaches in the liter- els for this task do not take syntactic structure ature (Smolensky, 1990; Schutze,¨ 1998; Kintsch, sufficiently into account. 2001; McDonald and Brew, 2004; Mitchell and La- We present a novel structured vector space pata, 2008) that compute meaning in context from model that addresses these issues by incorpo- lemma vectors. Most of these studies phrase the prob- rating the selectional preferences for words’ lem as one of vector composition: The meaning of a argument positions. This makes it possible to target occurrence a in context b is a single new vector integrate syntax into the computation of word c that is a function (for example, the centroid) of the meaning in context. In addition, the model per- forms at and above the state of the art for mod- vectors: c = a b. eling the contextual adequacy of paraphrases. The context b can consist of as little as one word, as shown in Example (1). In (1a), the meaning of catch combined with ball is similar to grab, while in 1 Introduction (1b), combined with disease, it can be paraphrased Semantic spaces are a popular framework for the rep- by contract. Conversely, verbs can influence the in- resentation of word meaning, encoding the meaning terpretation of nouns: In (1a), ball is understood as a of lemmas as high-dimensional vectors. In the de- spherical object, and in (1c) as a dancing event. fault case, the components of these vectors measure (1) a. catch a ball the co-occurrence of the lemma with context features b. catch a disease over a large corpus. These vectors are able to pro- c. attend a ball vide a robust model of that has been used in NLP (Salton et al., 1975; McCarthy and In this paper, we argue that models of word mean- Carroll, 2003; Manning et al., 2008) and to model ing relying on this procedure of vector composition experimental results in cognitive science (Landauer are limited both in their scope and scalability. The and Dumais, 1997; McDonald and Ramscar, 2001). underlying shortcoming is a failure to consider syntax Semantic spaces are attractive because they provide a in two important ways. model of word meaning that is independent of dictio- The syntactic relation is ignored. The first problem nary senses and their much-discussed problems (Kil- concerns the manner of vector composition, which garriff, 1997; McCarthy and Navigli, 2007). ignores the relation between the target a and its con- In a default semantic space as described above, text b. This relation can have a decisive influence on each vector represents one lemma, averaging over their interpretation, as Example (2) shows:

897 Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 897–906, Honolulu, October 2008. c 2008 Association for Computational Linguistics (2) a. a horse draws relation between a and b, addressing the first problem b. draw a horse above. In an expression a + b, the meanings of a and b in this context are computed as two separate In (2a), the meaning of the verb draw can be para- vectors a0 and b0. These vectors can then be combined phrased as pull, while in (2b) it is similar to sketch. with a representation of the structure’s expression This difference in meaning is due to the difference in (e.g., a parse tree), to address the second problem relation: in (2a), horse is the subject, while in (2b) discussed above. We test the SVS model on the task it is the object. On the modeling side, however, a of recognizing contextually appropriate paraphrases, vector combination function that ignores the relation finding that SVS performs at and above the state-of- will assign the same representation to (2a) and (2b). the-art. Thus, existing models are systematically unable to Plan of the paper. Section 2 reviews related work. capture this class of phenomena. Section 3 presents the SVS model for word meaning Single vectors are too weak to represent phrases. in context. Sections 4 to 6 relate experiments on the The second problem arises in the context of the im- paraphrase appropriateness task. portant open question of how semantic spaces can “scale up” to provide interesting meaning representa- 2 Related Work tions for entire sentences. We believe that the current vector composition methods, which result in a single In this section we give a short overview over existing vector c, are not informative enough for this purpose. vector space based approaches to computing word One proposal for “scaling up” is to straightforwardly meaning in context. interpret c = a b as the meaning of the phrase General context effects. The first category of a + b (Kintsch, 2001; Mitchell and Lapata, 2008). models aims at integrating the widest possible range The problem is that the vector c can only encode a of context information without recourse to linguistic fixed amount of structural information if its dimen- structure. The best-known work in this category is sionality is fixed, but there is no upper limit on sen- Schutze¨ (1998). He first computes “first-order” vec- tence length, and hence on the amount of structure tor representations for word meaning by collecting to be encoded. It is difficult to conceive how c could co-occurrence counts from the entire corpus. Then, encode deeper semantic properties, like predicate- he determines “second-order” vectors for individual argument structure (distinguishing “dog bites man” word instances in their context, which is taken to be a and “man bites dog”), that are crucial for sentence- simple surface window, by summing up all first-order level semantic tasks such as the recognition of textual vectors of the words in this context. The resulting entailment (Dagan et al., 2006). An alternative ap- vectors form sense clusters. proach to sentence meaning would be to use the vec- McDonald and Brew (2004) present a similar tor space representation only for representing word model. They compute the expectation for a word meaning, and to represent sentence structure sepa- w in a sequence by summing the first-order vectors rately. Unfortunately, present models cannot provide i for the words w to w and showed that the dis- this grounding either, since they compute a single 1 i−1 tance between expectation and first-order vector for vector c that provides the same representations for w correlates with human reading times. both the meanings of a and b in context. i Predicate-argument combination. The second In this paper, we propose a new, structured vector category of prior studies concentrates on contexts space model for word meaning (SVS) that addresses consisting of a single word only, typically modeling these problems. A SVS representation of a lemma the combination of a predicate p and an argument a. comprises several vectors representing the word’s Kintsch (2001) uses vector representations of p and lexical meaning as well as the selectional preferences a to identify the set of words that are similar to both that it has for its argument positions. The meaning p and a. After this set has been narrowed down in a of word a in context b is computed by combining a self-inhibitory network, the meaning of the predicate- with b’s selectional preference vector specific to the argument combination is obtained by computing the

898 centroid of its members’ vectors. The procedure does types of information. Current kernels are mostly tree not take the relation between p and a into account. kernels that compare syntactic structure, and use se- Mitchell and Lapata (2008) propose a framework mantic information mostly for smoothing syntactic to represent the meaning of the combination p + a as similarity (Moschitti and Quarteroni, 2008). In con- a function f operating on four components: trast, vector-space models focus on the interaction between the lexical meaning of words in composi- c = f(p, a, R, K) (3) tion.

R is the relation holding between p and a, and K 3 A structured vector space model for additional knowledge. This framework allows sen- word meaning in context sitivity to the relation. However, the concrete in- stantiations that Mitchell and Lapata consider disre- In this section, we define the structured vector space gards K and R, thus sharing the other models’ limi- (SVS) model of word meaning. tations. They focus instead on methods for the direct The main intuition behind our model is to view combination of p and a: In a comparison between the interpretation of a word in context as guided by component-wise addition and multiplication of p and expectations about typical events. For example, in a, they find far superior results for the multiplication (1a), we assume that upon hearing the phrase “catch a approach. ball”, the hearer will interpret the meaning of “catch” to match typical actions that can be performed with a Tensor product-based models. Smolensky (1990) ball. Similarly, the interpretation of “ball” will reflect uses tensor product to combine two word vectors a the hearer’s expectations about typical things that can and b into a vector c representing the expression a+b. be caught. This move to include typical arguments The vector c is located in a very high-dimensional and predicates into a model of word meaning can be space and is thus capable of encoding the structure motivated both on cognitive and linguistic grounds. of the expression; however, this makes the model In cognitive science, the central role of expecta- infeasible in practice, as dimensionality rises with tions about typical events for human language pro- every word added to the representation. Jones and cessing is well-established. Expectations affect read- Mewhort (2007) represent lemma meaning by using ing times (McRae et al., 1998), the interpretation of circular convolution to encode n-gram co-occurrence participles (Ferretti et al., 2003), and sentence pro- information into vectors of fixed dimensionality. Sim- cessing generally (Narayanan and Jurafsky, 2002; ilar to Brew and McDonald (2004), they predict most Pado´ et al., 2006). Expectations exist both for verbs likely next words in a sequence, without taking syn- and nouns (McRae et al., 1998; McRae et al., 2005). tax into account. In linguistics, expectations, in the form of selec- Kernel methods. One of the main tests for the tional restrictions and selectional preferences, have quality of models of word meaning in context is the long been used in semantic theories (Katz and Fodor, ability to predict the appropriateness of paraphrases 1964; Wilks, 1975), and more recently induced in given a context. Typically, a paraphrase applies from corpora (Resnik, 1996; Brockmann and Lapata, only to some senses of a word, not all, as can be seen 2003). Attention has mostly been limited to selec- in the paraphrases “grab” and “contract” of “catch”. tional preferences of verbs, which have been used Vector space models generally predict paraphrase ap- for example for syntactic disambiguation (Hindle propriateness based on the similarity between vectors. and Rooth, 1993), word sense disambiguation (Mc- This task can also be addressed with kernel methods, Carthy and Carroll, 2003) and semantic role label- which project items into an implicit feature space ing (Gildea and Jurafsky, 2002). Recently, a vector- for efficient similarity computation. Consequently, spaced model of selectional preferences has been vector space methods and kernel methods have both proposed that computes the typicality of an argument been used for NLP tasks based on similarity, no- simply through similarity to previously seen argu- tably and Textual Entailment. ments (Erk, 2007; Pado´ et al., 2007). Nevertheless, they place their emphasis on different We first present the SVS model of word meaning

899 accuse whirl throw ... say fly catch comp-1 claim provide organise throw catch catch -1 organise ... subj-1 obj comp-1 ! subj catch ball -1 obj obj-1 subj subj obj mod

cold he cold red ... fielder baseball golf baseball ball dog drift elegant drift ! mod ... Figure 1: Structured meaning representations for noun ball and verb catch: lexical information plus expectations Figure 2: Combining predicate and argument via relation- specific semantic expectations that integrates lexical information with selectional preferences. Then, we show how the SVS model pro- ble vectors), and let R be some set of relation labels. vides a new way of computing meaning in context. In the structured vector space (SVS) model, we rep- Representing lemma meaning. We abandon the resent the meaning of a lemma w as a triple traditional choice of representing word meaning as −1 a single vector. Instead, we encode each word as w = (v, R, R ) a combination of (a) one vector that models the where v ∈ D is a lexical vector describing the word lexical meaning of the word, and (b) a set of vec- w itself, R : R → D maps each relation label onto tors, each of which represents the semantic expecta- a vector that describes w’s selectional preferences, tions/selectional preferences for one particular rela- and R−1 : R → D maps from role labels to vec- tion that the word supports.1 tors describing inverse selectional preferences of w. The idea is illustrated in Fig. 1. In the representa- Both R and R−1 are partial functions. For example, tion of the verb catch, the central square stands for the direct object preference would be undefined for the lexical vector of catch itself. The three arrows intransitive verbs. link it to catch’s preferences for its subjects (subj), its objects (obj), and for verbs for which it appears Computing meaning in context. The SVS model −1 as a complement (comp ). The figure shows the se- of lemma meaning permits us to compute the mean- lectional preferences as word lists for readability; in ing of a word a in the context of another word b practice, each selectional preference is a single vector in a new way, via their selectional preferences. Let (cf. Section 4). Likewise, ball is represented by one −1 −1 (va,Ra,Ra ) and (vb,Rb,Rb ) be the representa- vector for ball itself, one for ball’s preferences for its tions of the two words, and let r ∈ R be the relation modifiers (mod), one vector for the verbs of which it linking a to b. Then, we define the meaning of a and −1 is a subject (subj ), and one for the verbs of which b in this context as a pair (a0, b0) of vectors, where −1 is an object (obj ). a0 is the meaning of a in the context of b, and b0 the This representation includes selectional prefer- meaning of b in the context of a: ences (like subj, obj, mod) exactly parallel to −1 −1 0 −1 −1 inverse selectional preferences (subj , obj , a = va Rb (r),Ra − {r},Ra comp−1). To our knowledge, preferences of the lat- (4) b0 = v R (r),R ,R−1 − {r} ter kind have not been studied in computational lin- b a b b guistics. However, their existence is supported in where v1 v2 is a direct vector combination function psycholinguistics by priming effects from nouns to as in traditional models, e.g. addition or component- typical verbs (McRae et al., 2005). −1 wise multiplication. If either Ra(r) or Rb (r) are Formally, let D be a vector space (the set of possi- not defined, the combination fails. Afterwards, the ar- 1We do not commit to a particular set of relations; see the gument position r is considered filled, and is deleted −1 discussion at the end of this section. from Ra and Rb .

900 Figure 2 illustrates this procedure on the represen- how appropriate a paraphrase (of either the predicate tations from Figure 1. The dotted lines indicate that or the argument) is in that context. We perform two the lexical vector for catch is combined with the in- experiments that both use the paraphrase task, but verse object preference of ball. Likewise, the lexical differ in their emphasis. Experiment 1 replicates an vector for ball is combined with the object preference existing evaluation against human judgments. This vector of catch. evaluation uses synthetic dataset, limited to one par- Note that our procedure for computing meaning ticular construction, and constructed to provide max- in context can be expressed within the framework of imally distinct paraphrase candidates. Experiment 2 Mitchell and Lapata (Eq. (3)). We can encode the considers a broader class of constructions along with expectations of a and b as additional knowledge K. annotator-generated paraphrase candidates that are The combined representation c is the pair (a0, b0) that not screened for distinctness. In both experiments, is computed according to our model (Eq. (4)). we compare the SVS model against the state-of-the- The SVS scheme we have proposed incorporates art model by Mitchell and Lapata 2008 (henceforth syntactic information in a more general manner than M&L; cf. Sec. 2 for model details). previous models, and thus addresses the issues we have discussed in Section 1. Since the representation 4.2 Parameter choices retains individual selectional preferences for all rela- Vector space. In our parameterization of the vector tions, combining the same words through different space, we largely follow M&L because their model relations can (and will in general) result in different has been rigorously evaluated and found to outper- adapted representations. For instance, in the case of form a range of other models. Example (2), we would expect the inverse subject Our first space is a traditional “bag-of-words” vec- preference of horse (“things that a horse typically tor space (BOW, (Lund and Burgess, 1996)). For does”) to push the lexical vector of draw into the di- each pair of a target word and context word, the BOW rection of pulling, while its inverse object preference space records a function of their co-occurrence fre- (“things that are done to horses”) suggest a different quency within a surface window of size 10. The interpretation. space is constructed from the British National Cor- Rather than yielding a single, joint vector for the pus (BNC), and uses the 2,000 most frequent context whole expression, our procedure for computing mean- words as dimensions. ing in context results in one context-adapted meaning We also consider a “dependency-based” vector representation per word, similar to the output of a space (SYN, (Pado´ and Lapata, 2007)). In this space, WSD system. As a consequence, our model can target and context words have to be linked by a “valid” be combined with any formalism representing the dependency path in a dependency graph to count as structure of an expression. (The formalism used then co-occurring.2 This space was built from BNC de- determines the set R of relations.) For example, com- pendency parses obtained from Minipar (Lin, 1993). bining SVS with a dependency tree would yield a tree For both spaces, we used pre-experiments to com- in which each node is labeled by a SVS tuple that pare two methods for the computation of vector com- represents the word’s meaning in context. ponents, namely raw co-occurrence counts, the stan- dard model, and the pointwise mutual information 4 Experimental setup (PMI) definition employed by M&L.

This section provides the background to the following Selectional preferences. We use a simple, experimental evaluation of SVS, including parameters knowledge-lean representation for selectional used for computing the SVS representations that will preferences inspired by Erk (2007), who models be used in the experiments. selectional preference through similarity to seen filler vectors ~va: We compute the selectional preference 4.1 Experimental rationale vector for word b and relation r as the weighted

In this paper, we evaluate the SVS model against the 2More specifically, we used the minimal context specification task of predicting, given a predicate-argument pair, and plain weight function. See Pado´ and Lapata (2007).

901 centroid of seen filler vectors ~va. We collect seen verb subject landmark sim judgment fillers from the Minipar-parse of the BNC. slump shoulder slouch high 7 Let f(a, r, b) denote the frequency of a occurring slump shoulder decline low 2 in relation r to b in the parsed BNC, then slump value slouch low 3 slump value decline high 7 X R (r) = f(a, r, b) · ~v (5) b SELPREF a Figure 3: Experiment 1: Human similarity judgements for a:f(a,r,b)>0 subject-verb pair with high- and low-similarity landmarks

We call this base model SELPREF. We will also study two variants of SELPREF, based on two dif- Differences between the performance of mod- ferent hypotheses about what properties of the se- els were tested for significance using a stratified lectional preferences are particularly important for shuffling-based randomization test (Yeh, 2000).4. meaning adaption. The first model aims specifically at alleviating noise introduced by infrequent fillers, a 5 Exp. 1: Predicting similarity ratings common problem in data-driven approaches. It only In our first experiment, we attempt to predict human uses fillers seen more often than a threshold θ. We similarity judgments. This experiment is a replication call this model SELPREF-CUT: of the evaluation of M&L on their dataset5. X Rb(r)SELPREF-CUT = f(a, r, b) · ~va (6) Dataset. The M&L dataset comprises a total of a:f(a,r,b)>θ 3,600 human similarity judgements for 120 experi- mental items. Each item, as shown in Figure 3, con- Our second variant again aims at alleviating noise, sists of an intransitive verb and a subject noun that but noise introduced by low-valued dimensions rather are combined with a “landmark”, a synonym of the than infrequent fillers. It achieves this by taking each verb that is chosen to be either similar or dissimilar component of the selectional preference vector to to the verb in the context of the given subject. the nth power. In this manner, dimensions with high The dataset was constructed by extracting pairs counts are further inflated, while dimensions with low of subjects and intransitive verbs from a parsed ver- 3 counts are depressed. This model, SELPREF-POW, is sion of the BNC. Each item was paired with two defined as follows: If Rb(r)SELPREF = hv1, . . . , vmi, landmarks, chosen to be as dissimilar as possible ac-

n n cording to a WordNet similarity measure. All nouns Rb(r)SELPREF-POW = hv1 , . . . , vmi (7) and verbs were subjected to a pretest, where only those with highly significant variations in human −1 The inverse selectional preferences Rb are de- judgments across landmarks were retained. fined analogously for all three model variants. We For each item of the final dataset, judgements on instantiate the vector combination function as a 7-point scale were elicited. For example, judges component-wise multiplication, following M&L. considered the compatible landmark “slouch” to be Baselines and significance testing. All tasks that much more similar to “shoulder slumps” than the we consider below involve judgments for the mean- incompatible landmark “decline”. In Figure 3, the sim ing of a word a in the context of a word b. A first column shows whether the experiment designers baseline that every model must beat is simply using considered the respective landmark to have high or judgment the original vector for a. We call this baseline “target low similarity to the verb, and the column only”. Since we assume that the selectional prefer- shows a participant’s judgments. ences of b model the expectations for a, we use b’s Experimental procedure. We used cosine to com- selectional preference vector for the given relation as pute similarity to the lexical vector of the landmark. a second baseline, “selpref only”. 4The software is available at http://www.nlpado.de/ 3Since we focus on the size-invariant , the ∼sebastian/sigf.html. use of this model does not require normalization. 5We thank J. Mitchell and M. Lapata for providing their data.

902 Model high low ρ Model lex. vector obj−1 selpref BOW space SELPREF 0.23 (0.09) 0.88 (0.07) Target only 0.32 0.32 0.0 SELPREF-CUT (10) 0.20 (0.10) 0.72 (0.18) Selpref only 0.46 0.4 0.06** SELPREF-POW (30) 0.03 (0.08) 0.52 (0.48) M&L 0.25 0.15 0.20** Table 2: Experiment 1: Average similarity (and standard SELPREF 0.32 0.26 0.12** deviation) between the inverse subject preferences of a SELPREF-CUT, θ=10 0.31 0.24 0.11** noun and (left) its lexical vector and (right) inverse object SELPREF-POW, n=20 0.11 0.03 0.27** preferences vector (cosine similarity in SYN space) Upper bound – – 0.4 SYN space ρ = 0.2, significantly outperforming both baselines. Target only 0.2 0.2 0.08** It is interesting, though, that the subj−1 preference Selpref only 0.27 0.21 0.16** itself (“Selpref only”) is already highly significantly M&L 0.13 0.06 0.24** correlated with the human judgments. SELPREF 0.22 0.16 0.13** A comparison of the upper half (BOW) with the SELPREF-CUT, θ=10 0.2 0.13 0.13** lower half (SYN) shows that the dependency-based SELPREF-POW, n=30 0.08 0.04 0.22** space generally shows better correlation with human Upper bound – – 0.4 judgements. This corresponds to a beneficial effect of Table 1: Experiment 1: Mean cosine similarity for items syntactic information found for other applications of with high- and low-similarity landmarks; correlation with semantic spaces (Lin, 1998; Pado´ and Lapata, 2007). human judgements (ρ). (**: p < 0.01) All instances of the SELPREF model show highly significant correlations. SELPREF and SELPREF-CUT show very similar performance. They do better than “Target only” compares the landmark against the lexi- both baselines in the BOW space; however, in the cal vector of the verb, and “selpref only” compares cleaner SYN space, their performance is numerically it to the noun’s subj−1 preference. For the M&L lower than using selectional preferences only (ρ = model, the comparison is to the combined lexical 0.13 vs. 0.16). SELPREF-POW is always significantly vectors of verb and noun. For our models SELPREF, better than SELPREF and SELPREF-CUT, and shows SELPREF-CUT and SELPREF-POW, we combine the the best result of all tested models (ρ = 0.27, BOW verb’s lexical vector with the subj−1 preference of space). The performance is somewhat lower in the the noun. We used a held-out dataset of 10% of the SYN space (ρ = 0.22). However, this difference, and data to optimize the parameters of θ of SELPREF-CUT the difference to the best M&L model at ρ = 0.24, and n of SELPREF-POW. Vectors with PMI compo- are not statistically significant. nents could model the data, while raw frequency components could not; we report only the former. The SVS model computes meaning in context by We use the same two evaluation scores as M&L: combining a word’s lexical representation with the The first score is the average similarity to compatible preference vector of its context. In this, it differs from landmarks (high) and incompatible landmarks (low). previous models, including that by M&L, which used The second is Spearman’s ρ, a nonparametric corre- what we have been calling “direct combination”. So lation coefficient. We compute ρ between individual it is important to ask to what extent this difference human similarity scores and our predictions. Based in method translate to a difference in predictions. on agreement between human judges, M&L estimate We analyzed this by measuring the similarity by the an upper bound ρ of 0.4 for the dataset. nouns’ lexical vectors, used by direct combination methods, and their inverse subject preferences, which Results and discussion. Table 1 shows the results SVS uses. The result is shown in the first column of Exp. 1 on the test set. In the upper half (BOW), we in Table 2, computed as mean cosine similarities replicate M&L’s main finding that simple component- and standard deviations between noun vectors and wise multiplication of the predicate and argument selectional preferences. The table shows that these vectors results in a highly significant correlation of vectors have generally low similarity, which is further

903 reduced by applying cutoff and potentiation. Thus, Sentence Substitutes the predictions of SVS will differ from those of direct By asking people who work be employed 4; combination models like M&L. there, I have since determined labour 1 A related question is whether syntax-aware vec- that he didn’t. (# 2002) tor combination makes a difference: Does the model Remember how hard your an- toil 4; labour 3; encode different expectations for different syntactic cestors worked. (# 2005) task 1 relations (cf. Example 2)? The second column of Ta- Figure 4: Lexical substitution example items for “work” ble 2 explores this question by comparing inverse se- lectional preferences for the subject and object slots. We observe that the similarity is very high for raw sets of sentences: (a), target intransitive verbs with preferences, but becomes lower when noise is elim- noun subjects (V-SUBJ, 48 sentences); (b), target tran- inated. Since the SELPREF-POW model performed sitive verbs with noun objects (V-OBJ, 213 sent.); and best in our evaluation, we read this as evidence that (c), target nouns occurring as objects of verbs (N-OBJ, potentiation helps to suppress noise introduced by 102 sent.).6 Note that since we use only part of the mis-identified subject and object fillers. lexical substitution dataset in this experiment, a di- In Experiment 1, all experimental items were rect comparison with results from the SemEval task verbs, which means that all disambiguation was done is not possible. through inverse selectional preferences. As inverse As in the original SemEval task, we phrase the selectional preferences are currently largely unex- task as a ranking problem. For each target word, the plored, it is interesting to note that the evidence that paraphrases given for all 10 instances are pooled. The they provide for the paraphrase task is as strong as task is to rank the list for each item so that appropriate that of the context nouns themselves. paraphrases (such as “be employed” for # 2002) rank higher than paraphrases not given (e.g., “toil”). 6 Exp. 2: Ranking paraphrases Our model ranks paraphrases by their similarity to the following combinations (Eq. (4)): for V-SUBJ, This section reports on a second, more NLP-oriented verb plus the noun’s subj−1 preferences; for V-OBJ, experiment whose task is to distinguish between ap- verb plus the noun’s obj−1 preferences; and for N- propriate and inappropriate paraphrases on a broader OBJ, the noun plus the verb’s obj preferences. Our range of constructions. comparison model, M&L, ranks all paraphrases by Dataset. For this experiment, we use the SemEval- their similarity to the direct noun-verb combination. 1 lexical substitution (lexsub) dataset (McCarthy and To avoid overfitting, we consider only the two mod- Navigli, 2007), which contains 10 instances each of els that performed optimally in in the SYN space in 200 target words in sentential contexts, drawn from Experiment 1 (SELPREF-POW with n=30 and M&L). Sharoff’s (2006) English Internet Corpus. Contex- However, since we found that vectors with raw fre- tually appropriate paraphrases for each instance of quency components could model the data, while PMI each target word were elicited from up to 6 partic- components could not, we only report the former. ipants. Fig. 4 shows two instances for the verb to For evaluation, we adopt the SemEval “out of work. The distribution over paraphrases can be seen ten” precision metric POOT. It uses the model’s ten as a characterization of the target word’s meaning in top-ranked paraphrases as its guesses for appropri- each context. ate paraphrases. Let Gi be the gold paraphrases for item i, Mi the model’s top ten paraphrases for i, and Experimental procedure. In this paper, we pre- f(s, i) the frequency of s as paraphrase for i: dict appropriate paraphrases solely on the basis of a P X s∈M ∩G f(s, i) single context word that stands in a direct predicate- P = 1/|I| i i (8) OOT P f(s, i) argument relation to the target word. We extracted i s∈Gi all instances from the lexsub test data with such a relation. After parsing all sentences with verbal and McCarthy and Navigli propose this metric for the nominal targets with Minipar, this resulted in three 6The specification of this dataset will be made available.

904 Model V-SUBJV-OBJN-OBJ 7 Conclusion Target only 47.9 47.4 49.6 Selpref only 54.8 51.4 55.0 In this paper, we have considered semantic space M&L 50.3 52.0 53.4 models that can account for the meaning of word SELPREF-POW, n=30 63.1 55.8 56.9 occurrences in context. Arguing that existing models do not sufficiently take syntax into account, we have Table 3: Experiment 2: Mean “out of ten” precision (POOT) introduced the new structured vector space (SVS) model of word meaning. In addition to a vector rep- resenting a word’s lexical meaning, it contains vec- dataset for robustness. Due to the sparsity of para- tors representing the word’s selectional preferences. phrases, a metric that considers fewer guesses leads These selectional preferences play a central role in to artificially low results when a “good” paraphrase the computation of meaning in context. was not mentioned by the annotators by chance but We have evaluated the SVS model on two datasets is ranked highly by a model. on the task of predicting the felicitousness of para- phrases in given contexts. On the M&L dataset, Results and discussion. Table 6 shows the mean SVS outperforms the state-of-the-art model of M&L, out-of-ten precision for all models. The behavior is though the difference is not significant. On the Lex- fairly uniform across all three datasets. Unsurpris- ical Substitution dataset, SVS significantly outper- ingly, “target only”, which uses the same ranking for forms the state-of-the-art. This is especially interest- all instances of a target, yields the worst results.7 ing as the Lexical Substitution dataset, in contrast to the M&L data, uses “realistic” paraphrase candidates M&L’s direct combination model outperforms “tar- that are not necessarily maximally distinct. get only” significantly (p < 0.05). However, on both The most important limitation of the evaluation the V-SUBJ and the N-OBJ the “selpref only” baseline that we have given in this paper is that we have only does better than direct combination. The best results considered single words as context. Our next step on all datasets are obtained by SELPREF-POW. The will be to integrate information from multiple rela- difference between SELPREF-POW and the “target tions (such as both the subject and object positions only” baseline is highly significant (p < 0.01). The of a verb) into the computation of context-specific difference to M&L’s model is significant at p = 0.05. meaning. Our eventual aim is a model that can give We interpret these results as encouraging evidence a compositional account of a word’s meaning in con- for the usefulness of selectional preferences for judg- text, where all words in an expression disambiguate ing substitutability in context. Knowledge about the one another according to the relations between them. selectional preferences of a single context word can We will explore the usability of vector space mod- already lead to a significant improvement in precision. els of word meaning in NLP applications, formulated We find this overall effect even though the word is as the question of how to perform inferences on them not informative in all cases. For instance, the subject in the context of the Textual Entailment task (Dagan of item 2002 in Fig. 4, “who”, presumably helps little et al., 2006). Paraphrase-based inference rules play in determining the verb’s context-adapted meaning. a large role in several recent approaches to Textual It is interesting that the improvement of SELPREF- Entailment (e.g. Szpektor et al (2008)); appropriate- POW over “selpref only” is smallest for the N-OBJ ness judgments of paraphrases in context, the task of dataset (1.9% POOT). N-OBJ uses selectional prefer- Experiments 1 and 2 above, can be viewed as testing ences for nouns that may fill the direct object position, the applicability of these inferences rules. , while V-SUBJ and V-OBJ use inverse selectional Acknowledgments. Many thanks for helpful dis- preferences for verbs (cf. the two graphs in Fig. 1). cussion to Jason Baldridge, David Beaver, Dedre Gentner, James Hampton, Dan Jurafsky, Alexander 7“Target only” still does very much better than a random Koller, Brad Love, and Ray Mooney. baseline, which performs at 22% POOT .

905 References S. McDonald, M. Ramscar. 2001. Testing the distribu- tional hypothesis: The influence of context on judge- C. Brockmann, M. Lapata. 2003. Evaluating and combin- ments of semantic similarity. In Proceedings of CogSci, ing approaches to selectional preference acquisition. In 611–616. Proceedings of EACL, 27–34. K. McRae, M. Spivey-Knowlton, M. Tanenhaus. 1998. I. Dagan, O. Glickman, B. Magnini. 2006. The PASCAL Modeling the influence of thematic fit (and other con- Recognising Textual Entailment Challenge. In Ma- straints) in on-line sentence comprehension. Journal of chine Learning Challenges, Lecture Notes in Computer Memory and Language, 38:283–312. Science, 177–190. Springer. K. McRae, M. Hare, J. Elman, T. Ferretti. 2005. A K. Erk. 2007. A simple, similarity-based model for selec- basis for generating expectancies for verbs from nouns. tional preferences. In Proceedings of ACL, 216–223. Memory and Cognition, 33(7):1174–1184. T. Ferretti, C. Gagne,´ K. McRae. 2003. Thematic role fo- J. Mitchell, M. Lapata. 2008. Vector-based models of cusing by participle inflections: evidence form concep- semantic composition. In Proceedings of ACL, 236– tual combination. Journal of Experimental Psychology, 244. 29(1):118–127. A. Moschitti, S. Quarteroni. 2008. Kernels on linguistic D. Gildea, D. Jurafsky. 2002. Automatic labeling of structures for answer extraction. In Proceedings of semantic roles. Computational Linguistics, 28(3):245– ACL, 113–116, Columbus, OH. 288. D. Hindle, M. Rooth. 1993. Structural ambiguity and S. Narayanan, D. Jurafsky. 2002. A Bayesian model lexical relations. Computational Linguistics, 19(1):103– predicts human parse preference and reading time in 120. sentence processing. In Proceedings of NIPS, 59–65. M. Jones, D. Mewhort. 2007. Representing word mean- S. Pado,´ M. Lapata. 2007. Dependency-based construc- ing and order information in a composite holographic tion of semantic space models. Computational Linguis- lexicon. Psychological review, 114:1–37. tics, 33(2):161–199. J. J. Katz, J. A. Fodor. 1964. The structure of a semantic U. Pado,´ F. Keller, M. W. Crocker. 2006. Combining syn- theory. In The Structure of Language. Prentice-Hall. tax and thematic fit in a probabilistic model of sentence A. Kilgarriff. 1997. I don’t believe in word senses. Com- processing. In Proceedings of CogSci, 657–662. puters and the Humanities, 31(2):91–113. S. Pado,´ U. Pado,´ K. Erk. 2007. Flexible, corpus-based W. Kintsch. 2001. Predication. Cognitive Science, modelling of human plausibility judgements. In Pro- 25:173–202. ceedings of EMNLP/CoNLL, 400–409. T. Landauer, S. Dumais. 1997. A solution to Platos prob- P. Resnik. 1996. Selectional constraints: An information- lem: the theory of acquisition, theoretic model and its computational realization. Cog- induction, and representation of knowledge. Psycho- nition, 61:127–159. logical Review, 104(2):211–240. G. Salton, A. Wang, C. Yang. 1975. A vector-space model D. Lin. 1993. Principle-based parsing without overgener- for information retrieval. Journal of the American So- ation. In Proceedings of ACL, 112–120. ciety for Information Science, 18:613–620. D. Lin. 1998. Automatic retrieval and clustering of simi- H. Schutze.¨ 1998. Automatic word sense discrimination. lar words. In Proceedings of COLING-ACL, 768–774. Computational Linguistics, 24(1):97–124. K. Lund, C. Burgess. 1996. Producing high-dimensional S. Sharoff. 2006. Open-source corpora: Using the net to semantic spaces from lexical co-occurrence. Behav- fish for linguistic data. International Journal of Corpus ior Research Methods, Instruments, and Computers, Linguistics, 11(4):435–462. 28:203—208. P. Smolensky. 1990. Tensor product variable binding and C. D. Manning, P. Raghavan, H. Schutze.¨ 2008. Introduc- the representation of symbolic structures in connection- tion to Information Retrieval. CUP. ist systems. Artificial Intelligence, 46:159–216. D. McCarthy, J. Carroll. 2003. Disambiguating nouns, I. Szpektor, I. Dagan, R. Bar-Haim, J. Goldberger. 2008. verbs, and adjectives using automatically acquired Contextual preferences. In Proceedings of ACL, 683– selectional preferences. Computational Linguistics, 691, Columbus, OH. 29(4):639–654. Y. Wilks. 1975. Preference semantics. In Formal Seman- D. McCarthy, R. Navigli. 2007. SemEval-2007 Task 10: tics of Natural Language. CUP. English Lexical Substitution Task. In Proceedings of A. Yeh. 2000. More accurate tests for the statistical SemEval, 48–53. significance of result differences. In Proceeedings of S. McDonald, C. Brew. 2004. A distributional model COLING, 947–953. of semantic context effects in lexical processing. In Proceedings of ACL, 17–24.

906