Cross-Modal Mapping Between Distributional Semantics and the Visual World

Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world Angeliki Lazaridou and Elia Bruni and Marco Baroni Center for Mind/Brain Sciences University of Trento angeliki.lazaridou|elia.bruni|marco.baroni @unitn.it { } Abstract that they might be also capturing some crucial aspects of how humans acquire and use language Following up on recent work on estab- (Landauer and Dumais, 1997; Lenci, 2008). lishing a mapping between vector-based semantic embeddings of words and the However, the models induce the meaning of visual representations of the correspond- words entirely from their co-occurrence with other ing objects from natural images, we first words, without links to the external world. This present a simple approach to cross-modal constitutes a serious blow to claims of cogni- vector-based semantics for the task of tive plausibility in at least two respects. One zero-shot learning, in which an image is the grounding problem (Harnad, 1990; Searle, of a previously unseen object is mapped 1984). Irrespective of their relatively high per- to a linguistic representation denoting its formance on various semantic tasks, it is debat- word. We then introduce fast mapping, a able whether models that have no access to visual challenging and more cognitively plausi- and perceptual information can capture the holis- ble variant of the zero-shot task, in which tic, grounded knowledge that humans have about the learner is exposed to new objects and concepts. However, a possibly even more serious the corresponding words in very limited pitfall of vector models is lack of reference: natu- linguistic contexts. By combining prior ral language is, fundamentally, a means to commu- linguistic and visual knowledge acquired nicate, and thus our words must be able to refer to about words and their objects, as well as objects, properties and events in the outside world exploiting the limited new evidence avail- (Abbott, 2010). Current vector models are purely able, the learner must learn to associate language-internal, solipsistic models of meaning. new objects with words. Our results on Consider the very simple scenario in which visual this task pave the way to realistic simula- information is being provided to an agent about tions of how children or robots could use the current state of the world, and the agent’s task existing knowledge to bootstrap grounded is to determine the truth of a statement similar to semantic knowledge about new concepts. There is a dog in the room. Although the agent is equipped with a powerful context vector model, 1 Introduction this will not suffice to successfully complete the Computational models of meaning that rely on task. The model might suggest that the concepts corpus-extracted context vectors, such as LSA of dog and cat are semantically related, but it has (Landauer and Dumais, 1997), HAL (Lund and no means to determine the visual appearance of Burgess, 1996), Topic Models (Griffiths et al., dogs, and consequently no way to verify the truth 2007) and more recent neural-network approaches of such a simple statement. (Collobert and Weston, 2008; Mikolov et al., Mapping words to the objects they denote is 2013b) have successfully tackled a number of lex- such a core function of language that humans are ical semantics tasks, where context vector sim- highly optimized for it, as shown by the so-called ilarity highly correlates with various indices of fast mapping phenomenon, whereby children can semantic relatedness (Turney and Pantel, 2010). learn to associate a word to an object or prop- Given that these models are learned from natu- erty by a single exposure to it (Bloom, 2000; rally occurring data using simple associative tech- Carey, 1978; Carey and Bartlett, 1978; Heibeck niques, various authors have advanced the claim and Markman, 1987). But lack of reference is not 1403 Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 1403–1414, Baltimore, Maryland, USA, June 23-25 2014. c 2014 Association for Computational Linguistics only a theoretical weakness: Without the ability to asked to establish a link between a new object and refer to the outside world, context vectors are ar- a word for which they possess a full-fledged text- guably useless for practical goals such as learning based vector extracted from a billion-word cor- to execute natural language instructions (Brana- pus. On the contrary, the first time a learner is van et al., 2009; Chen and Mooney, 2011), that exposed to a new object, the linguistic informa- could greatly benefit from the rich network of lex- tion available is likely also very limited. Thus, in ical meaning such vectors encode, in order to scale order to consider vision-to-language mapping un- up to real-life challenges. der more plausible conditions, similar to the ones Very recently, a number of papers have ex- that children or robots in a new environment are ploited advances in automated feature extraction faced with, we next simulate a scenario akin to fast form images and videos to enrich context vectors mapping. We show that the induced cross-modal with visual information (Bruni et al., 2014; Feng semantic space is powerful enough that sensible and Lapata, 2010; Leong and Mihalcea, 2011; guesses about the correct word denoting an object Regneri et al., 2013; Silberer et al., 2013). This can be made, even when the linguistic context vec- line of research tackles the grounding problem: tor representing the word has been created from as Word representations are no longer limited to their little as 1 sentence containing it. linguistic contexts but also encode visual informa- The contributions of this work are three-fold. tion present in images associated with the corre- First, we conduct experiments with simple image- sponding objects. In this paper, we rely on the and text-based vector representations and compare same image analysis techniques but instead focus alternative methods to perform cross-modal map- on the reference problem: We do not aim at en- ping. Then, we complement recent work (Frome riching word representations with visual informa- et al., 2013) and show that zero-shot learning tion, although this might be a side effect of our scales to a large and noisy dataset. Finally, we pro- approach, but we address the issue of automati- vide preliminary evidence that cross-modal pro- cally mapping objects, as depicted in images, to jections can be used effectively to simulate a fast the context vectors representing the correspond- mapping scenario, thus strengthening the claims ing words. This is achieved by means of a simple of this approach as a full-fledged, fully inductive neural network trained to project image-extracted theory of meaning acquisition. feature vectors to text-based vectors through a hid- den layer that can be interpreted as a cross-modal 2 Related Work semantic space. The problem of establishing word reference has We first test the effectiveness of our cross- been extensively explored in computational sim- modal semantic space on the so-called zero-shot ulations of cross-situational learning (see Fazly et learning task (Palatucci et al., 2009), which has real. (2010) for a recent proposal and extended recently been explored in the machine learning com- view of previous work). This line of research has munity (Frome et al., 2013; Socher et al., 2013). In traditionally assumed artificial models of the ex- this setting, we assume that our system possesses ternal world, typically a set of linguistic or logi- linguistic and visual information for a set of con- cal labels for objects, actions and possibly other cepts in the form of text-based representations of aspects of a scene (Siskind, 1996). Recently, words and image-based vectors of the correspond- Yu and Siskind (2013) presented a system that ing objects, used for vision-to-language-mapping induces word-object mappings from features ex- training. The system is then provided with visual tracted from short videos paired with sentences. information for a previously unseen object, and the Our work complements theirs in two ways. First, task is to associate it with a word by cross-modal unlike Yu and Siskind (2013) who considered a mapping. Our approach is competitive with re- limited lexicon of 15 items with only 4 nouns, we spect to the recently proposed alternatives, while conduct experiments in a large search space con- being overall simpler. taining a highly ambiguous set of potential target The aforementioned task is very demanding and words for every object (see Section 4.1). Most im- interesting from an engineering point of view. portantly, by projecting visual representations of However, from a cognitive angle, it relies on objects into a shared semantic space, we do not strong, unrealistic assumptions: The learner is limit ourselves to establishing a link between ob- 1404 jects and words. We induce a rich semantic representation of the multimodal concept, that can lead, among other things, to the discovery of im- portant properties of an object even when we lack its linguistic label. Nevertheless, Yu and Siskind’s system could in principle be used to initialize the vision-language mapping that we rely upon. (a) (b) Figure 1: A potential wampimuk (a) together with Closer to the spirit of our work are two very its projection onto the linguistic space (b). recent studies coming from the machine learning community. Socher et al. (2013) and Frome et al. (2013) focus on zero-shot learning in the vision- 3 Zero-shot learning and fast mapping language domain by exploiting a shared visual- linguistic semantic space. Socher et al. (2013) “We found a cute, hairy wampimuk sleeping be- learn to project unsupervised vector-based image hind the tree.” Even though the previous state- representations onto a word-based semantic space ment is certainly the first time one hears about using a neural network architecture.

Load more