<<

Multi-Modal Models for Concrete and Abstract

Felix Hill Roi Reichart Anna Korhonen Computer Laboratory Technion - IIT Computer Laboratory University of Cambridge Haifa, Israel University of Cambridge [email protected] [email protected] [email protected]

Abstract Many computational semantic models represent words as real-valued vectors, encoding their rela- Multi-modal models that learn semantic rep- resentations from both linguistic and percep- tive frequency of occurrence in particular forms and tual input outperform language-only models contexts in linguistic corpora (Sahlgren, 2006; Tur- on a range of evaluations, and better reflect ney et al., 2010). Motivated both by parallels with concept acquisition. Most perceptual human language acquisition and by evidence that input to such models corresponds to concrete many word meanings are grounded in the percep- noun and the superiority of the multi- tual system (Barsalou et al., 2003), recent research modal approach has only been established has explored the integration into text-based models when evaluating on such concepts. We there- of input that approximates the visual or other sen- fore investigate which concepts can be effec- tively learned by multi-modal models. We sory modalities (Silberer and Lapata, 2012; Bruni show that concreteness determines both which et al., 2014). Such models can learn higher- linguistic features are most informative and semantic representations than conventional corpus- the impact of perceptual input in such mod- only models, as evidenced by a range of evaluations. els. We then introduce ridge regression as a means of propagating perceptual informa- However, the majority of perceptual input for the tion from concrete nouns to more abstract con- models in these studies corresponds directly to con- cepts that is more robust than previous ap- crete noun concepts, such as chocolate or cheese- proaches. Finally, we weighted gram burger, and the superiority of the multi-modal over matrix combination, a means of combining the corpus-only approach has only been established representations from distinct modalities that when evaluations include such concepts (Leong and outperforms alternatives when both modalities Mihalcea, 2011; Bruni et al., 2012; Roller and are sufficiently rich. Schulte im Walde, 2013; Silberer and Lapata, 2012). It is thus unclear if the multi-modal approach is ef- 1 Introduction fective for more abstract words, such as guilt or obe- What is needed to learn the meaning of sity. Indeed, since empirical evidence indicates dif- a word? Children learning words are exposed to a ferences in the representational frameworks of both diverse mix of information sources. These include concrete and abstract concepts (Paivio, 1991; Hill et clues in the language itself, such as nearby words or al., 2013), and verb and noun concepts (Markman speaker , but also what the child perceives and Wisniewski, 1997), perceptual information may about the world around it when the word is heard. not fulfill the same role in the representation of the Learning the meaning of words requires not only various concept types. This potential challenge to a sensitivity to both linguistic and perceptual input, the multi-modal approach is of particular practical but also the ability to process and combine informa- importance since concrete nouns constitute only a tion from these modalities in a productive way. small proportion of the open-, meaning-bearing

285

Transactions of the Association for Computational Linguistics, 2 (2014) 285–296. Action Editor: Rada Mihalcea. Submitted 12/2013; Revised 6/2014; Published 10/2014. c 2014 Association for Computational Linguistics. words in everyday language (Section 2). need to consider the concreteness of the target do- In light of these considerations, this paper ad- main when constructing multi-modal models. dresses three questions: (1) Which information To address (3), we present various means of com- sources (modalities) are important for acquiring bining information from different modalities. We concepts of different types? (2) Can perceptual in- propose weighted gram matrix combination, a tech- put be propagated effectively from concrete to more nique in which representations of distinct modalities abstract words? (3) What is the best way to combine are mapped to a space of common dimension where information from the different sources? coordinates reflect proximity to other concepts. This We models that acquire semantic repre- transformation, which has been shown to enhance sentations for four sets of concepts: concrete nouns, semantic representations in the context of verb- abstract nouns, concrete verbs and abstract verbs. clustering (Reichart and Korhonen, 2013), reduces The linguistic input to the models comes from the representation sparsity and facilitates a product- recently released Google Syntactic N-Grams Corpus based combination that results in greater inter-modal (Goldberg and Orwant, 2013), from which a selec- dependency. Weighted gram matrix combination tion of linguistic features are extracted. Perceptual outperforms alternatives such as concatenation and input is approximated by data from the McRae et Canonical Correlation Analysis (CCA) (Hardoon et al. (2005) norms, which encode perceptual proper- al., 2004) when combining representations from two ties of concrete nouns, and the ESPGame dataset similarly rich information sources. (Von Ahn and Dabbish, 2004), which contains man- In Section 3, we present experiments with linguis- ually generated descriptions of 100,000 images. tic features designed to address question (1). These To address (1) we extract representations for analyses are extended to multi-modal models in Sec- each concept type from combinations of information tion 4, where we also address (2) and (3). We first sources. We first focus on different classes of lin- discuss the relevance of concreteness and part-of- guistic features, before extending our models to the speech (lexical function) to concept representation. multi-modal context. While linguistic information overall effectively reflects the meaning of all con- 2 Concreteness and Word Meaning cept types, we show that features encoding syntac- A large and growing body of psychological evidence tic are only valuable for the acquisition of indicates differences between abstract and concrete abstract concepts. On the other hand, perceptual in- concepts.1 It has been shown that concrete words formation, whether directly encoded or propagated are more easily learned, remembered and processed through the model, plays a more important role in than abstract words (Paivio, 1991; Schwanenflugel the representation of concrete concepts. and Shoben, 1983), while neuroimaging studies In addressing (2), we propose ridge regression demonstrate differences in brain activity when sub- (Myers, 1990) as a means of propagating features jects are presented with stimuli corresponding to the from concrete nouns to more abstract concepts. The two concept types (Binder et al., 2005). regularization term in ridge regression encourages solutions that generalize well across concept types. The abstract/concrete distinction is important to We show that ridge regression effectively propagates computational for various reasons. While perceptual information to abstract nouns and con- many models construct representations of concrete crete verbs, and is overall preferable to both lin- words (Andrews et al., 2009; Landauer and Dumais, ear regression and the method of Johns and Jones 1997), abstract words are in fact far more common in (2012) applied to a similar task by Silberer and La- everyday language. For instance, based on an analy- pata (2012). However, for all propagation methods, sis of those noun concepts in the University of South the impact of integrating perceptual information de- Florida dataset (USF) and their occurrence in the pends on the concreteness of the target concepts. In- British National Corpus (BNC) (Leech et al., 1994), deed, for abstract verbs, the most abstract concept 72% of noun tokens in corpora are rated by human type in our evaluations, perceptual input actually de- 1Here concreteness is understood intuitively, as per the psy- grades representation quality. This highlights the chological literature (Rosen, 2001; Gallese and Lakoff, 2005).

286 mood rule praise beam clam sardine penguin Nouns ●● ●

believe enjoy leave look stab swing Verbs ●

0 2 4 6 Average Concreteness Rating

Figure 1: Boxplot of concreteness distributions for noun and verb concepts in the USF data, with selected example concepts. The bold vertical line is the mean, boxes extend from the first to the third quartile, and dots represent outliers. judges as more abstract than the noun war, a concept these distinctions are pertinent to text-only models. that many would already consider quite abstract.2 The recent interest in multi-modal semantics fur- 3 Concreteness and Linguistic Features ther motivates a principled modelling approach to It has long been known that aspects of word meaning lexical concreteness. Many multi-modal models im- can be inferred from nearby words in corpora. Ap- plicitly distinguish concrete and abstract concepts proaches that exploit this fact are often called dis- since their perceptual input corresponds only to con- tributional models (Sahlgren, 2006; Turney et al., crete words (Bruni et al., 2012; Silberer and Lapata, 2010). We take a distributional approach to learn- 2012; Roller and Schulte im Walde, 2013). How- ing linguistic representations. The advantage of us- ever, given that many abstract concepts express re- ing distributional methods to learn representations lations or modifications of concrete concepts (Gen- from corpora versus approaches that rely on knowl- tner and Markman, 1997), it is reasonable to expect edge bases (Pedersen et al., 2004; Leong and Mi- that perceptual information about concrete concepts halcea, 2011) is that they are more scalable, easily could also enhance the quality of more abstract rep- applicable across languages and plausibly reflect the resentations in an appropriately constructed model. process of human word learning (Landauer and Du- Moreover, concreteness is closely related to more mais, 1997; Griffiths et al., 2007). We group dis- functional lexical distinctions, such as those be- tributional features into three classes to test which tween adjectives, nouns and verbs. An analysis forms of linguistic information are most pertinent to of the USF dataset, which includes concreteness the abstract/concrete and verb/noun distinctions. ratings for over 4,000 words collected from thou- All features are extracted from The Google sands of participants, indicates that on average verbs Syntactic N-grams Corpus. The dataset contains (mean concreteness, 3.64) are considered more ab- counted dependency-tree fragments for over 10bn stract than nouns (mean concreteness, 4.91), an ef- words of the English Google Books Corpus. fect illustrated in Figure 1. This connection be- tween lexical function and concreteness suggests 3.1 Feature Classes that a sensitivity to concreteness could improve Lexical Features Our lexical features are the co- models that already make principled distinctions be- occurrence counts of a concept word with each of tween words based on their part-of-speech (POS) the other 2,529 concepts in the USF data. Co- (Im Walde, 2006; Baroni and Zamparelli, 2010). occurrences are counted in a 5-word window, and, as Although the focus of this paper is on multi- elsewhere (Erk and Pado,´ 2008), weighted by point- modal models, few conventional semantic mod- wise mutual information (PMI) to control for the un- els make principled distinctions between concepts derlying frequency of both concept and word. based on function or concreteness. Before turning to the multi-modal case, we thus investigate whether POS-tag Features Many words function as more than one POS, and this variation can be indicative of 2This sample covers 15.2% of all noun tokens in the BNC. meaning (Manning, 2011). For example, deverbal

287 Context Example Concept Type Words Pairs Examples indirect gave it to the man concrete nouns 303 1280 yacht, cup Noun direct object gave the pie to him abstract nouns 100 295 fear, respect Concepts the man grinned all nouns 403 1716 cup, respect in PP was in his mouth adject. modifier the portly man concrete verbs 50 66 kiss, launch infinitive clause to eat is human abstract verbs 50 127 differ, obey transitive he bit the steak all verbs 100 221 kiss, differ Verb intransitive he salivated Concepts distransitive put jam on the toast Table 2: Evaluation sets used throughout. All nouns and phrasal verb he gobbled it up all verbs are the union of abstract and concrete subsets infinitival comp. he wants to snooze and mixed abstract-concrete or concrete-abstract pairs. clausal comp. I bet he won’t diet

Table 1: Grammatical features for noun/verb concepts For each list of concepts L = concrete nouns, concrete verbs, abstract nouns, abstract verbs, to- gether with lists all nouns and all verbs, a corre- nouns, such as shiver or walk, often refer to pro- sponding of pairs (w , w ) USF : w , w { 1 2 ∈ 1 2 ∈ cesses rather than entities. To capture such effects, L is defined for evaluation. These details are sum- } we count the frequency of occurrence with the POS marized in Table 2. Evaluation lists, sets of pairs and categories ajdective, adverb, noun and verb. USF scores are downloadable from our website. Grammatical Features Grammatical role is a strong predictor of semantics (Gildea and Jurafsky, 2002). For instance, the subject of transitive verbs 3.3 Evaluation Methodology is more likely to refer to an animate entity than a All models are evaluated by measuring correlations noun chosen at random. Syntactic context also pre- with the free-association scores in the USF dataset dicts verb semantics (Kipper et al., 2008). We thus (Nelson et al., 2004). This dataset contains the free- count the frequency of nouns in a range of (non- association strength of over 150,000 word pairs.3 lexicalized) syntactic contexts, and of verbs in one These data reflect the cognitive proximity of con- of the six most common subcategorization-frame cepts and have been widely used in NLP as a gold- classes as defined in Van de Cruys et al. (2012). standard for computational models (Andrews et al., These contexts are detailed in Table 1. 2009; Feng and Lapata, 2010; Silberer and Lapata, 2012; Roller and Schulte im Walde, 2013). 3.2 Evaluation Sets For evaluation pairs (c1, c2) we calculate the co- We create evaluation sets of abstract and con- sine similarity between our learned feature represen- crete concepts, and introduce a complementary di- tations for c1 and c2, a standard measure of the prox- chotomy between nouns and verbs, the two POS cat- imity of two vectors (Turney et al., 2010), and follow egories most fundamental to propositional meaning. previous studies (Leong and Mihalcea, 2011; Huang To construct these sets, we extract nouns and verbs et al., 2012) in using Spearman’s ρ as a measure from word pairs in the USF data based on their ma- of correlation between these values and our gold- 4 jority POS-tag in the lemmatized BNC (Leech et al., standard. All representations in this section are 1994), excluding any word not assigned to either of combined by concatenation, since the present focus 5 the POS categories in more than 70% of instances. is not on combination methods. From the resulting 2175 nouns and 354 verbs, the 3Free-association strength is measured by presenting sub- abstract-concrete distinction is drawn by ordering jects with a cue word and asking them to produce the first word words according to concreteness and sampling at they can think of that is associated with that cue word. 4We consider Spearman’s ρ, a non-parametric ranking cor- random from the first and fourth quartiles. Any con- relation, to be more appropriate than Pearson’s r for free asso- crete nouns not occurring in the McRae et al. (2005) ciation data, which is naturally skewed and non-continuous. Norm dataset were also excluded. 5When combining multiple representations we normalize

288 Feature Type All Nouns Conc. Nouns Abs. Nouns All Verbs Conc. Verbs Abs. Verbs (1) Lexical 0.168* 0.199* 0.248* 0.173* 0.268* 0.109 (2) POS-tag 0.059* 0.012 0.119* 0.052 -0.074 0.123 (3) Grammatical 0.078* 0.027 0.121* 0.009 -0.017 0.114 (1)+(2)+(3) 0.182 * 0.181* 0.247* 0.172* 0.267* 0.108

Table 3: Spearman correlation ρ of cosine similarity between vector representations derived from three feature classes with USF scores. * indicates statistically significant correlations (p < 0.05 ).

3.4 Results ent information is required to learn representations The performance of each feature class on the eval- for abstract and concrete concepts and for noun and uation sets is detailed in Table 3. When all linguis- verb concepts. In the next section, we investigate tic features are included, performance is somewhat how perceptual information fits into this equation. better on noun concepts (ρ = 0.182) than verbs (ρ = 0.172). However, while correlations are sig- 4 Acquiring Multi-Modal Representations nificant on concrete (ρ = 0.181) and abstract nouns As noted in Section 2, there is experimental evi- (ρ = 0.247) and concrete verbs, the effect is not dence that perceptual information plays a distinct significant on abstract verbs (although it is on verbs role in the representation of different concept types. overall). The highest correlations for the linguistic We explore whether this finding extends to com- features together are on abstract nouns (ρ = 0.247) putational models by integrating such information and concrete verbs (ρ = 0.267). Referring back to into our corpus-based approaches. We focus on the continuum in Figure 1, it is possible that there two aspects of the integration process. Propaga- is an optimum concreteness level, exhibited by ab- tion: Can models infer useful information about ab- stract nouns and concrete verbs, at which conceptual stract nouns and verbs from perceptual information meaning is best captured by linguistic models. corresponding to concrete nouns? And combina- The results indicate that the three feature classes tion: How can linguistic and (propagated or actual) convey distinct information. It is perhaps unsur- perceptual information be integrated into a single, prising that lexical features produce the best perfor- multi-modal representation? We begin by introduc- mance in the majority of cases; the of lexical ing the two sources of perceptual information. co-occurrence statistics in conveying word meaning is expressed in the well known distributional hy- 4.1 Perceptual Information Sources pothesis (Harris, 1954). More interestingly, on ab- The McRae Dataset The McRae et al. (2005) stract concepts the contribution of POS-tag (nouns, Property Norms dataset is commonly used as a per- ρ = 0.119; verbs, ρ = 0.123 ) and grammatical ceptual information source in cognitively-motivated features (nouns, ρ = 0.121; verbs, ρ = 0.114) is no- semantic models (Kelly et al., 2010; Roller and tably higher than on the corresponding concrete con- Schulte im Walde, 2013). The dataset contains prop- cepts. The importance of such features to modelling erties of over 500 concrete noun concepts produced free-association between abstract concepts suggests by 30 human annotators. The proportion of sub- that they may convey information about how con- jects producing each property gives a measure of the cepts are (subjectively) organized and interrelated in strength of that property for a given concept. We en- the of language users, independent of their code this data in vectors with coordinates for each of realisation in the physical world. Indeed, since ab- the 2,526 in the dataset. A concept rep- stract representations rely to a lesser extent than con- resentation contains (real-valued) feature strengths crete representations on perceptual input (Section 4), in places corresponding to the features of that con- it is perhaps unsurprising that more of their meaning cept and zeros elsewhere. Having defined the con- is reflected in subtle linguistic patterns. crete noun evaluation set as the 303 concepts found The results in this section demonstrate that differ- in both the USF and McRae datasets, this informa- each representation, then concatenate and then renormalize. tion is available for all concrete nouns.

289 The ESP-Game Dataset To complement the infer quasi-perceptual representations for a concept cognitively-driven McRae data with a more explic- in the case that actual perceptual information is not itly visual information source, we also extract infor- available. Translating their approach to the present mation from the ESP-Game dataset (Von Ahn and context, for verbs and abstract nouns we infer quasi- Dabbish, 2004) of 100,000 photographs, each an- perceptual representations based on the perceptual notated with a list of entities depicted in that im- features of concrete nouns that are nearby in the se- age. This input enables connections to be made be- mantic space defined by the linguistic features. tween concepts that co-occur in scenes, and thus In the first step of their two-step method, for each might be experienced together by language learn- abstract noun or verb k, a quasi-perceptual represen- ers at a given . Because we want our models tation is computed as an average of the perceptual to reflect human concept learning in inferring con- representations of the concrete nouns, weighted by ceptual knowledge from comparatively unstructured the proximity between these nouns and k data, we use the ESP-Game dataset in preference to resources such as ImageNet (Deng et al., 2009), in kp = S(kl, cl)λ cp which the conceptual hierarchy is directly encoded ¯ · cXC by expert annotators. An additional motivation is ∈ that ESP-Game was produced by crowdsourcing a where C¯ is the set of concrete nouns, cp and kp are simple task with untrained annotators, and thus rep- the perceptual representations for c and k respec- resents a more scalable class of data source. tively, and cl and kl the linguistic representations. We represent the ESP-Game data in 100,000 di- The exponent parameter λ reflects the learning rate. mensional vectors, with co-ordinates corresponding Following Johns and Jones (2012), we define the to each image in the dataset. A concept representa- proximity function S between noun concepts to be tion contains a 1 in any place that corresponds to an cosine similarity. However, because our verb and image in which the concept appears, and a 0 other- noun representations are of different dimension, we wise. Although it is possible to portray actions and take verb-noun proximity to be the PMI between processes in static images, and several of the ESP- the two words in the corpus, with co-occurrences Game images are annotated with verb concepts, for a counted within a 5-word window. cleaner analysis of the information propagation pro- In step two, the initial quasi-perceptual represen- cess we only include ESP input in our models for the tations are inferred for a second time, but with the concrete nouns in the evaluation set. weighted average calculated over the perceptual or The data encoding outlined above results in per- initial quasi-perceptual representations of all other ceptual representations of dimension 100, 000, words, not just concrete nouns. As with Johns and ≈ for which, on average, fewer than 0.5% of entries are Jones (2012), we set the learning rate parameter λ to non-zero 6. In contrast, in our full linguistic repre- be 3 in the first step and 13 in the second. sentations of nouns (dimension 4, 000) and verbs ≈ (dimension 8, 000) (Section 3), an average of 24% Ridge Regression As an alternative propagation ≈ of entries are non-zero. One of the challenges for the method we propose ridge regression (Myers, 1990). propagation and combination methods described in Ridge regression is a variant of least squares re- the following subsections is therefore to manage the gression in which a regularization term is added to differences in dimension and sparsity between lin- the training objective to favor solutions with cer- guistic and perceptual representations. tain properties. Here we apply it to learn parame- ters for linear maps from linguistic representations 4.2 Information Propagation of concrete nouns to features in their perceptual rep- resentations. For concepts with perceptual represen- Johns and Jones Silberer and Lapata (2012) ap- tations of dimension np, we learn np linear functions ply a method designed by Johns and Jones (2012) to n fi : R l R that map the linguistic representations → 6The ESP-Game and McRae representations are of approxi- (of dimension nl) to a particular perceptual feature mately equal sparsity. i. These functions are then applied together to map

290 the linguistic representations of abstract nouns and Weighted Gram Matrix Combination The verbs to full quasi-perceptual representations.7 method we propose as an alternative means of As our model is trained on concrete nouns but fusing linguistic and extra-linguistic information is applied to other concept types, we do not wish the weighted gram matrix combination, which derives mapping to reflect the training data too faithfully. from an information combination technique applied To mitigate against this we define our regulariza- to verb clustering by Reichart and Korhonen (2013). tion term as the Euclidian l norm of the inferred For a set of concepts C = c , . . . , c with 2 { 1 n} parameter vector. This term ensures that the regres- representations r , . . . , r , the method involves { 1 n} sion favors lower coefficients and a smoother solu- creating an n n weighted gram matrix L in which × tion function, which should provide better general- ization performance than simple linear regression. Lij = S(ri, rj) φ(ri) φ(rj). · · The objective for learning the fi is then to minimize Here, S is again a similarity function (we use cosine aX Y 2 + a 2 similarity), and φ(r) is the quality score of r. k − ik2 k k2 The quality scoring function φ can be any map- where a is the vector of regression coefficients, X is ping Rn R that reflects the importance of a con- → a matrix of linguistic representations and Yi a vector cept relative to other concepts in C. In the present of perceptual feature i for the set of concrete nouns. context, we follow Reichart and Korhonen (2013) in We now investigate ways in which the (quasi-) defining a quality score φ as the average cosine sim- perceptual representations acquired via these meth- ilarity of a concept with all other concepts in C ods can be combined with linguistic representations. n 1 4.3 Information Combination φ(rj) = S(ri, rj). n Canonical Correlation Analysis Canonical cor- Xi=1 relation analysis (CCA) (Hardoon et al., 2004) is For cj C, the matrix L then encodes a scalar pro- an established statistical method for exploring re- ∈ jection of rj onto the other members ri n, weighted ≤ lationships between two sets of random variables. by their quality. Each word representation in the set The method determines a linear transformation of is thus mapped into a new space of dimension n de- the space spanned by each of the sets of variables, termined by the concepts in C. such that the correlations between the sets of trans- Converting concept representations to weighted formed variables is maximized. gram matrix form has several advantages in the Silberer and Lapata (2012) apply CCA in the present context. First, both when evaluating and present context of information fusion, with one set applying semantic representations, we generally re- of random variables corresponding to perceptual quire models to determine between con- features and another corresponding to linguistic fea- cepts relative to others. We might, for instance, re- tures. Applied in this way, CCA provides a mecha- quire close associates of a given word, a selection of nism for reducing the dimensionality of the linguis- potential synonyms, or the two most similar search tic and perceptual representations such that the im- queries in a given set. This relative of seman- 8 portant interactions between them are preserved. tics is reflected by projecting representations into The transformed linguistic and perceptual vectors a space defined by the set of concepts themselves, are then concatenated. We follow Silberer and Lap- rather than low-level features. It is also captured by 9 ata by applying a kernalized variant of CCA. the quality weighting, which lends primacy to con- 7Because the POS-tag and grammatical features are differ- cept dimensions that are central to the space. ent for nouns and for verbs, we exclude them from our linguistic Second, mapping representations of different di- representations when implementing ridge regression. mension into vector spaces of equal dimension re- 8Dimensionality reduction is desirable in the present context because of the sparsity of our perceptual representations. sults in dense representations of equal dimension for 9The KernelCCA package in Python: each modality. This naturally lends equal weighting http://pythonhosted.org/apgl/KernelCCA.html or status to each modality and resolves any issues

291 Concrete Nouns Abstract Nouns (JJ) Concrete Verbs (JJ) Abstract Verbs (JJ)

0.2

0.1

0.0

Performance Change Performance −0.1

Lexical POS Grammatical Feature Class Lexical POS Grammatical Lexical POS Grammatical Lexical POS Grammatical

Perceptual Information Source McRae ESP McRae & ESP

Figure 2: Additive change in Spearman’s ρ when representations acquired from particular classes of linguistic features are combined with (actual or inferred) perceptual representations. Perceptual representations are derived from either the McRae Dataset, the ESP-Game Dataset or both (concatenated). For concepts other than concrete nouns, perceptual information is propagated using the Johns and Jones (JJ) method, and combined with simple concatenation. of representations sparsity. In addition, the dimen- For concrete nouns and concrete verbs, (actual sion equality in particular enables a wider range of or inferred) perceptual information was beneficial mathematical operations for combining information in almost all cases. The largest improvement for sources. Here, we follow Reichart and Korhonen both concept types was over grammatical features, (2013) in taking the product of the linguistic and per- achieved by including only the McRae data. This ceptual weighted gram matrices L and P , produc- signals from this perceptual input and the grammat- ing a new matrix containing fused representations ical features clearly reflect complementary aspects for each concept of the meaning of these concepts. We hypothe- size that grammatical features (and POS features, M = LP P L. which also perform strongly in this combination) By taking the composite product LP P L rather than confer information to concrete representations about LP or PL, M is symmetric and no ad hoc status is the function and mutual interaction of concepts (the conferred to one modality over the other. most ‘relational’ aspects of their meaning (Gentner, 1978)) which complements the more intrinsic prop- erties conferred by perceptual features. 4.4 Results For abstract concepts, it is perhaps unsurpris- ing that the overall contribution of perceptual in- The experiments in this section were designed to ad- formation was smaller. Indeed, combining linguis- dress the three questions specified in Section 1: (1) tic and perceptual information actually harmed per- Which information sources are important for acquir- formance on abstract verbs in all cases. For these ing word concepts of different types? (2) Can per- concepts, the inferred perceptual features seem to ceptual information be propagated from concrete to obscure or contradict some of the information con- abstract concepts? (3) What is the best way to com- veyed in the linguistic representations. bine the information from the different sources? While the McRae data was clearly the most valu- Question (1) To build on insights from Section able source of perceptual input for concrete nouns 3, we first examined how perceptual input interacts and concrete verbs, for abstract nouns the combi- with the three classes of linguistic features defined nation of ESP-Game and McRae data was most in- there. Figure 2 shows the additive difference in cor- formative. Both inspection of the data and cogni- relation between (i) models in which perceptual and tive theories (Rosch et al., 1976) suggest that enti- particular linguistic features are concatenated and ties identified in scenes, as in the ESP-Game dataset, (ii) models based on just the linguistic features. generally correspond to a particular (basic) level of

292 Model All Nouns Conc. Nouns Abs. Nouns All Verbs Conc. Verbs Abs. Verbs † Linguistic 0.175 — 0.335 0.169 — 0.317 0.233 — 0.344 0.148 — 0.178 0.204 — 0.191 0.094 — 0.330 (JJ)+Concat 0.116 — 0.375 0.258 — 0.442 0.190 — 0.267 0.129 — 0.162 0.301 — 0.062 0.019 — 0.280 (JJ)+CCA 0.082 — 0.021 0.001 — 0.067 0.085 — -0.018 0.027— 0.213 0.079 — 0.276 0.095 — 0.200 (JJ)+WGM 0.098 — 0.213 0.397 — 0.523 0.238 — 0.329 0.059 — 0.169 0.253 — 0.064 -0.080 — 0.254 RR+Concat 0.232 — 0.432 0.248 — 0.343 0.013 — 0.212 0.046 — 0.484 0.023 — 0.133 RR+CCA 0.033 — -0.045 0.044 — -0.023 0.001 — -0.006 0.018 — 0.344 0.018 — 0.085 RR+WGM 0.094 — 0.069 0.232 — 0.327 0.159 — 0.131 0.244 — 0.194 0.075 — 0.283 LR+ 0.216 — 0.402 0.216 — 0.282 0.004 — 0.051 -0.051 — 0.139 -0.008 — 0.197

Table 4: Performance of different methods of information propagation (JJ = Johns and Jones, RR = ridge regression, LR = linear regression) and combination (Concat = concatenation, CCA = canonical correlation analysis, WGM = weighted gram matrix multiplication) across evaluation sets. Values are Spearman’s ρ correlation with USF scores (left hand side of columns) and WordNet path similarity (right hand side). For the LR baseline we only report the highest score across the three combination types. No propagation takes place for concrete nouns; this column reflects † the performance of combination methods only. the conceptual hierarchy. The ESP-Game data re- both that the underlying linguistic space is of high flects relations between these basic-level concepts quality and that the ESP and McRae perceptual in- in the world, whereas the McRae data typically de- put is similarly or more informative than the input scribes their (intrinsic) properties. Together, these applied in previous work. sources seem to combine information on the proper- Consistent with previous studies, adding percep- ties of, and relations between, concepts in a way that tual input improved the quality of concrete noun particularly facilitates the learning of abstract nouns. representations as measured against both USF and path similarity gold-standards. Further, effective in- Question (2) The performance of different meth- formation propagation was indeed possible for both ods of information propagation and combination is abstract nouns (USF evaluation) and concrete verbs presented in Table 4. The underlying linguistic rep- (both evaluations). Interestingly, however, this was resentations in this case contained all three distribu- not the case for abstract verbs, for which no mix of tional feature classes. For more robust conclusions, propagation and combination methods produced an in addition to the USF gold-standard we also mea- improvement on the linguistic-only model on either sured the correlation between model output and the evaluation set. Indeed, as shown in Figure 2, no type WordNet path similarity of words in our evaluation of perceptual input generated an improvement in ab- pairs. The path similarity between words w1 and w2 stract verb representations, regardless of the under- is the shortest distance between synsets of w1 and w2 lying class of linguistic features. in the WordNet taxonomy (Fellbaum, 1999), which This result underlines the link between concrete- correlates significantly with human judgements of ness, and proposed in the psy- concept similarity (Pedersen et al., 2004).10 chological literature. More practically, it shows that The correlations with the USF data (left hand col- concreteness can determine if propagation of per- umn, Table 4) of our linguistic-only models (ρ = ceptual input will be effective and, if so, the potential 0.094 0.233) and best performing multi-modal − degree of improvement over text-only models. models (on both concrete nouns, ρ = 0.397, and more abstract concepts, ρ = 0.095 0.301) were Turning to means of propagation, both the Johns − higher than the best comparable models described and Jones method and ridge regression outper- elsewhere (Feng and Lapata, 2010; Silberer and La- formed the linear regression baseline on the major- pata, 2012; Silberer et al., 2013).11 This confirms ity of concept types in our evaluation. Across the five sets and ten evaluations on which propagation 10Other widely-used evaluation gold-standards, such as WordSim 353 and the MEN dataset, do not contain a sufficient and .12 for multi-modal models evaluated on USF over concrete of abstract concepts for the current purpose. and abstract concepts. Silberer and Lapata (2012) report ρ = 11Feng and Lapata (2010) report ρ = .08 for language-only .14 (language-only) and .35 (multi-modal) over concrete nouns.

293 takes place (All Nouns, Abstract Nouns, All Verbs, deed, conceptual concreteness appears to determine Abstract Verbs and Concrete Verbs), ridge regression the degree to which perceptual input is beneficial, performed more robustly, achieving the best perfor- since representations of abstract verbs, the most ab- mance on six evaluation sets compared to two for the stract concepts in our experiments, were actually de- Johns and Jones method.12 graded by this additional information. One impor- tant contribution of this work is therefore an insight Weighted gram matrix multiplica- Question (3) into when multi-modal models should or should not tion (ρ = 0.397 on USF and ρ = 0.523 on path sim- aim to combine and/or propagate perceptual input to ilarity) outperformed both simple vector concatena- ensure that optimal representations are learned. In tion (ρ = 0.258 and ρ = 0.442) and CCA (ρ = this respect, our conclusions align with the findings 0.001 and ρ = 0.067) on concrete nouns. In the of Kiela and Hill (2014), who take an explicitly vi- case of both abstract nouns and concrete verbs, how- sual approach to resolving the same question. ever, the most effective means of combining quasi- Various methods for propagating and combining perceptual information with linguistic representa- perceptual information with linguistic input were tions was concatenation (abstract nouns, ρ = 0.248 presented. We proposed ridge regression for in- and ρ = 0.343, concrete verbs, ρ = 0.301 and ferring perceptual representations for abstract con- ρ = 0.484). One evident drawback of multiplica- cepts, which proved more robust than alternatives tive methods such as weighted gram matrix combi- across the range of concept types. This approach is nation is the greater inter-dependence of the infor- particularly simple to implement, since it is based on mation sources; a weak signal from one modality an established statistical prodedure. In addition, we can undermine the contribution of the other modal- introduced weighted gram matrix combination for ity. We hypothesize that this underlines the compar- combining representations from distinct modalities atively poor performance of the method on verbs and of differing sparsity and dimension. This method abstract nouns, as the perceptual input for concrete produces the highest quality composite representa- nouns is clearly a richer information source than the tions for concrete nouns, where both modalities rep- propagated features of more abstract concepts. resent high quality information sources. 5 Conclusion Overall, our results demonstrate that the potential practical benefits of multi-modal models extend be- Motivated by the inherent difference between ab- yond concrete domains into a significant proportion stract and concrete concepts and the that of the lexical concepts found in language. In fu- abstract words occur more frequently in language, ture work we aim to extend our experiments to con- in this paper we have addressed the question of cept types such as adjectives and adverbs, and to de- whether multi-modal models can enhance semantic velop models that further improve the propagation representations of both concept types. and combination of extra-linguistic input. In Section 3, we demonstrated that different infor- Moreover, while we cannot draw definitive con- mation sources are important for acquiring concrete clusions about human language processing, the ef- and abstract noun and verb concepts. Within the lin- fectiveness of the methods presented in this paper guistic modality, while lexical features are informa- offer tentative support for the that even ab- tive for all concept types, syntactic features are only stract concepts are grounded in the perceptual sys- significantly informative for abstract concepts. tem (Barsalou et al., 2003). As such, it may be that, In contrast, in Section 4 we observed that per- even in the more abstract cases of human communi- ceptual input is a more valuable information source cation, we find ways to see what people mean pre- for concrete concepts than abstract concepts. Nev- cisely by finding ways to see what they mean. ertheless, perceptual input can be effectively prop- agated from concrete nouns to enhance representa- Acknowledgements tions of both abstract nouns and concrete verbs. In- 12For these comparisons, the optimal combination method is We thank The Royal Society and St John’s College selected in each case. for their support.

294 References Dedre Gentner and Arthur B Markman. 1997. Structure mapping in analogy and similarity. American psychol- Mark Andrews, Gabriella Vigliocco, and David Vinson. ogist, 52(1):45. 2009. Integrating experiential and distributional data Dedre Gentner. 1978. On relational meaning: The ac- to learn semantic representations. Psychological Re- quisition of verb meaning. Child development, pages view, 116(3):463. 988–998. Marco Baroni and Roberto Zamparelli. 2010. Nouns Daniel Gildea and Daniel Jurafsky. 2002. Automatic la- are vectors, adjectives are matrices: Representing beling of semantic roles. Computational linguistics, adjective-noun constructions in semantic space. In 28(3):245–288. Proceedings of the 2010 Conference on Empiri- Yoav Goldberg and Jon Orwant. 2013. A dataset of cal Methods in Natural Language Processing, pages syntactic-ngrams over time from a very large corpus of 1183–1193. Association for Computational Linguis- english books. In Second Joint Conference on Lexical tics. and Computational Semantics, Association for Com- Lawrence W Barsalou, W Kyle Simmons, Aron K Bar- putational Linguistics, pages 241–247. Association for bey, and Christine D Wilson. 2003. Grounding Computational Linguistics. conceptual knowledge in modality-specific systems. Thomas L Griffiths, Mark Steyvers, and Joshua B Tenen- Trends in cognitive sciences, 7(2):84–91. baum. 2007. Topics in semantic representation. Psy- Jeffrey R Binder, Chris F Westbury, Kristen A McKier- chological review, 114(2):211. nan, Edward T Possing, and David A Medler. 2005. David R Hardoon, Sandor Szedmak, and John Shawe- Distinct brain systems for processing concrete and ab- Taylor. 2004. Canonical correlation analysis: An stract concepts. Journal of Cognitive Neuroscience, overview with application to learning methods. Neu- 17(6):905–917. ral Computation, 16(12):2639–2664. Elia Bruni, Gemma Boleda, Marco Baroni, and Nam- Zellig Harris. 1954. Distributional structure. Word, Khanh Tran. 2012. Distributional semantics in tech- 10(23):146–162. nicolor. In Proceedings of the 50th Annual Meet- Felix Hill, Anna Korhonen, and Christian Bentz. ing of the Association for Computational Linguistics: 2013. A quantitative empirical analysis of the ab- Long Papers-Volume 1, pages 136–145. Association stract/concrete distinction. Cognitive Science. for Computational Linguistics. Eric H Huang, Richard Socher, Christopher D Manning, Elia Bruni, Nam Khanh Tran, and Marco Baroni. 2014. and Andrew Y Ng. 2012. Improving word representa- Multimodal distributional semantics. Journal of Arti- tions via global context and multiple word prototypes. ficial Research, 49:1–47. In Proceedings of the 50th Annual Meeting of the Asso- ciation for Computational Linguistics: Long Papers- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Volume 1, pages 873–882. Association for Computa- and Li Fei-Fei. 2009. Imagenet: A large-scale hier- tional Linguistics. archical image database. In Computer Vision and Pat- Sabine Schulte Im Walde. 2006. Experiments on the tern Recognition, 2009. CVPR 2009, pages 248–255. automatic induction of german semantic verb classes. IEEE. Computational Linguistics, 32(2):159–194. ´ Katrin Erk and Sebastian Pado. 2008. A structured vec- Brendan T Johns and Michael N Jones. 2012. Perceptual Pro- tor space model for word meaning in context. In inference through global lexical similarity. Topics in ceedings of the Conference on Empirical Methods in Cognitive Science, 4(1):103–120. Natural Language Processing, pages 897–906. Asso- Colin Kelly, Barry Devereux, and Anna Korhonen. 2010. ciation for Computational Linguistics. Acquiring human-like feature-based conceptual repre- Christiane Fellbaum. 1999. WordNet. Wiley Online Li- sentations from corpora. In Proceedings of the NAACL brary. HLT 2010 First Workshop on Computational Neurolin- Yansong Feng and Mirella Lapata. 2010. Visual infor- guistics, pages 61–69. Association for Computational mation in semantic representation. In Human Lan- Linguistics. guage Technologies: The 2010 Annual Conference of Douwe Kiela and Felix Hill. 2014. Improving multi- the North American Chapter of the Association for modal representations using image dispersion: Why Computational Linguistics, pages 91–99. Association less is sometimes more. In Proceedings of ACL 2014, for Computational Linguistics. Baltimore. Association for Computational Linguistics. Vittorio Gallese and George Lakoff. 2005. The brain’s Karin Kipper, Anna Korhonen, Neville Ryant, and concepts: The role of the sensory-motor system in con- Martha Palmer. 2008. A large-scale classification of ceptual knowledge. Cognitive neuropsychology, 22(3- english verbs. Language Resources and Evaluation, 4):455–479. 42(1):21–40.

295 Thomas K Landauer and Susan T Dumais. 1997. A so- Basic objects in natural categories. Cognitive Psychol- lution to ’s problem: The latent semantic analysis ogy, 8(3):382–439. theory of acquisition, induction, and representation of Gideon Rosen. 2001. , , epis- knowledge. Psychological review, 104(2):211. temic . Nousˆ , 35(s15):69–91. Geoffrey Leech, Roger Garside, and Michael Bryant. Magnus Sahlgren. 2006. The Word-Space Model: Us- 1994. Claws4: the tagging of the British National Cor- ing distributional analysis to represent syntagmatic pus. In Proceedings of the 15th conference on Compu- and paradigmatic relations between words in high- tational linguistics-Volume 1, pages 622–628. Associ- dimensional vector spaces. Ph.D. thesis, Stockholm. ation for Computational Linguistics. Paula J Schwanenflugel and Edward J Shoben. 1983. Chee Wee Leong and Rada Mihalcea. 2011. Going be- Differential context effects in the comprehension of yond text: A hybrid image-text approach for measur- abstract and concrete verbal materials. Journal of Ex- ing word relatedness. In IJCNLP, pages 1403–1407. perimental Psychology: Learning, Memory, and Cog- Christopher D Manning. 2011. Part-of-speech tagging nition, 9(1):82. from 97% to 100%: is it time for some linguistics? Carina Silberer and Mirella Lapata. 2012. Grounded In Computational Linguistics and Intelligent Text Pro- models of semantic representation. In Proceedings cessing, pages 171–189. Springer. of the 2012 Joint Conference on Empirical Methods Arthur B Markman and Edward J Wisniewski. 1997. in Natural Language Processing and Computational Similar and different: The differentiation of basic- Natural Language Learning, pages 1423–1433. Asso- level categories. Journal of Experimental Psychology: ciation for Computational Linguistics. Learning, Memory, and Cognition, 23(1). Carina Silberer, Vittorio Ferrari, and Mirella Lapata. Ken McRae, George S Cree, Mark S Seidenberg, and 2013. Models of semantic representation with visual Chris McNorgan. 2005. Semantic feature production attributes. In Proceedings of the 51th Annual Meet- norms for a large set of living and non-living things. ing of the Association for Computational Linguistics, Behavior Research Methods, 37(4):547–559. Sofia, Bulgaria. Raymond H Myers. 1990. Classical and modern regres- Peter D Turney, Patrick Pantel, et al. 2010. From fre- sion with applications, volume 2. Duxbury Press Bel- quency to meaning: Vector space models of semantics. mont, CA. Journal of artificial intelligence research, 37(1):141– Douglas L Nelson, Cathy L McEvoy, and Thomas A 188. Schreiber. 2004. The University of South Florida free Tim Van de Cruys, Laura Rimell, Thierry Poibeau, Anna association, rhyme, and word fragment norms. Be- Korhonen, et al. 2012. Multiway tensor factorization havior Research Methods, Instruments, & Computers, for unsupervised lexical acquisition. COLING 2012: 36(3):402–407. Technical Papers, pages 2703–2720. Allan Paivio. 1991. Dual coding theory: Retrospect and Luis Von Ahn and Laura Dabbish. 2004. Labeling im- current status. Canadian Journal of Psychology/Revue ages with a computer game. In Proceedings of the Canadienne de Psychologie, 45(3):255. SIGCHI conference on Human Factors in Computing Ted Pedersen, Siddharth Patwardhan, and Jason Miche- Systems, pages 319–326. ACM. lizzi. 2004. Wordnet:: Similarity: measuring the relat- edness of concepts. In Demonstration Papers at HLT- NAACL 2004, pages 38–41. Association for Computa- tional Linguistics. Roi Reichart and Anna Korhonen. 2013. Improved lexical acquisition through dpp-based verb clustering. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Association for Computational Linguistics. Stephen Roller and Sabine Schulte im Walde. 2013. A multimodal LDA model integrating textual, cognitive and visual modalities. In Proceedings of the 2013 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1146–1157, Seattle, Wash- ington, USA, October. Association for Computational Linguistics. Eleanor Rosch, Carolyn B Mervis, Wayne D Gray, David M Johnson, and Penny Boyes-Braem. 1976.

296