A Joint Model of Language and Perception for Grounded Attribute Learning
Total Page:16
File Type:pdf, Size:1020Kb
A Joint Model of Language and Perception for Grounded Attribute Learning Cynthia Matuszek [email protected] Nicholas FitzGerald [email protected] Luke Zettlemoyer [email protected] Liefeng Bo [email protected] Dieter Fox [email protected] Computer Science and Engineering, Box 352350, University of Washington, Seattle, WA 98195-2350 Abstract ical workspace that contains a number of objects that As robots become more ubiquitous and ca- vary in shape and color. We assume that a robot can pable, it becomes ever more important for understand sentences like this if it can solve the as- untrained users to easily interact with them. sociated grounded object selection task. Specifically, it Recently, this has led to study of the lan- must realize that words such as \yellow" and \blocks" guage grounding problem, where the goal refer to object attributes, and ground the meaning of is to extract representations of the mean- such words by mapping them to a perceptual system ings of natural language tied to the physi- that will enable it to identify the specific physical ob- cal world. We present an approach for joint jects referred to. To do so robustly, even in cases where learning of language and perception models words or attributes are new, our robot must learn (1) for grounded attribute induction. The per- visual classifiers that identify the appropriate object ception model includes classifiers for phys- properties, (2) representations of the meaning of indi- ical characteristics and a language model vidual words that incorporate these classifiers, and (3) based on a probabilistic categorial grammar a model of compositional semantics used to analyze that enables the construction of composi- complete sentences. tional meaning representations. We evaluate In this paper, we present an approach for jointly learn- on the task of interpreting sentences that de- ing these components. Our approach builds on exist- scribe sets of objects in a physical workspace, ing work on visual attribute classification (Bo et al., and demonstrate accurate task performance 2011) and probabilistic categorial grammar induction and effective latent-variable concept induc- for semantic parsing (Zettlemoyer & Collins, 2005; tion in physical grounded scenes. Kwiatkowski et al., 2011). Specifically, our system in- duces new grounded concepts (groups of words along with the parameters of the attribute classifier they are 1. Introduction paired with) from a set of scenes containing only sen- tences, images, and indications of what objects are Physically grounded settings provide exciting oppor- being referred to. As a result, it can be taught to rec- tunities for learning. For example, a person might be ognize previously unknown object attributes by some- able to teach a robot about objects in its environment. one describing objects while pointing out the relevant However, to do this, a robot must jointly reason about objects in a set of training scenes. Learning is on- the different modalities encountered (for example lan- line, adding one scene at a time, and EM-like, in that guage and vision), and induce rich associations with the parameters are updated to maximize the expected as little guidance as possible. marginal likelihood of the latent language and visual Consider a simple sentence such as \These are the yel- components of the model. This integrated approach low blocks," uttered in a setting where there is a phys- allows for effective model updates with no explicit la- beling of logical meaning representations or attribute th Appearing in Proceedings of the 29 International Confer- classifier outputs. ence on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). We evaluate this approach on data gathered on Ama- A Joint Model of Language and Perception for Grounded Attribute Learning zon Mechanical Turk, in which people describe sets of object segments recorded using a Kinect depth camera. objects on a table. Experiments demonstrate that the Joint Model We combine these language and vision joint learning approach can effectively extend the set models in two ways. First, we introduce an explicit of grounded concepts in an incomplete model initial- model of alignment between the logical constants in ized with supervised training on a small dataset. This the logical form z and classifiers in the set C. This provides a simple mechanism for learning vocabulary alignment would, for example, enable us to learn that in a physical environment. the logical constant yellow should be paired with a classifier c 2 C that fires on yellow objects. Next, we introduce an execution model that allows us to determine what scene objects in O would be selected by a logical expression z, given the classi- fiers in C. This allows us to, for example, execute λx.color(x; green)^shape(x; triangle) by testing all of the objects with the appropriate classifiers (for green and triangle), then selecting objects on which both classifiers return true. This execution model includes uncertainty from the semantic parser P (zjx), classifier Figure 1. An example of an RGB-D object identification confidences P (c = truejo), and a deterministic ground- scene. Columns on the right show example segments, iden- truth constraint that encodes what objects are actually tified as positive (far right) and negative (center). intended to be selected. Full details are in Sec.5. Model Learning We present an approach that 2. Overview of the Approach learns the meaning of new words from a dataset D = Problem We wish to learn a joint language and per- f(xi;Oi;Gi) j i = 1 : : : ng, where each example i con- ception model for the object selection task. The goal tains a sentence xi, the objects Oi, and the selected is to automatically map a natural language sentence set Gi. This setup is an abstraction of the situa- x and a set of scene objects O to the subset G ⊆ O tion where a teacher mentions xi while pointing to of objects described by x. The left panel of Fig.1 the objects Gi ⊆ Oi she describes. As described in shows an example scene. Here, O is the set of objects detail in Sec.6, learning proceeds in an online, EM- present in this scene. The individual objects o 2 O are like fashion by repeatedly estimating expectations over extracted from the scene via segmentation (the right the latent logical forms zi and the outputs of the clas- panel of Fig.1 shows example segments). Given the sifiers c 2 C, then using these expectations to update sentence x =\Here are the yellow ones," the goal is to the parameters of the component models for language select the five yellow objects for the named set G. P (zjx) and visual classification P (cjo). To bootstrap the learning approach, we first train a limited language Model Components Given a sentence and seg- and perception system in a fully supervised way: in mented scene objects, we learn a distribution P (G j this stage, each example additionally contains labeled x; O) over the selected set. Our approach combines logical meaning expressions and classifier outputs, as recent models of language and vision, including: described in Sec.6. (1) A semantic parsing model that defines P (zjx), a distribution over logical meaning representations z for 3. Related Work each sentence x. In our running example, the desired representation z = λx.color(x; yellow) is a lambda- To the best of our knowledge, this paper presents the calculus expression that defines a set of objects that first approach for jointly learning visual classifiers and are yellow. For this task, we build on an existing se- semantic parsers, to produce rich, compositional mod- mantic parsing model (Kwiatkowski et al., 2011). els that span directly from sensors to meaning. How- ever, there is significant related work on the model (2) A set of visual attribute classifiers C, where each components, and on grounded learning in general. classifier c 2 C defines a distribution P (c = truejo) of the classifier returning true for each possible object Vision Current state-of-the-art object recognition o 2 O in the scene. For example, there would be a systems (Felzenszwalb et al., 2009; Yang et al., 2009) unique classifier c 2 C for each possible color or shape are based on local image descriptors, for example an object can have. We use logistic regression to train SIFT over images (Lowe, 2004) and Spin Images over classifiers on color and shape features extracted from 3D point clouds (Johnson & Hebert, 1999). Visual A Joint Model of Language and Perception for Grounded Attribute Learning this red block is in the shape of a half-pipe N=N N NnNSnN=N N=N N=NP NP=NP NP λf:f λx.color(x; red) λf:f λf.λg.λx.f(x) ^ g(x) λf:f λy.λx.shape(x; y) λx.x arch N N=NP NP λx.color(x; red) λy.λx.shape(x; y) arch NN λx.color(x; red) λx.shape(x; arch) SnN λg.λx.shape(x; arch) ^ g(x) S λx.shape(x; arch) ^ color(x; red) Figure 2. An example semantic analysis for a sentence from our dataset. attributes provide rich descriptions of objects, and that include a visual component (Tellex et al., 2011). have become a popular topic in the vision commu- However, these approaches ground language into pre- nity (Farhadi et al., 2009; Parikh & Grauman, 2011); defined language formalisms, rather than extending although very successful, we still lack a deep un- the model to account for entirely novel input. derstanding of the design rules underlying them and how they measure similarity. Recent work on ker- 4. Background on Semantic Parsing nel descriptors (Bo et al., 2010) shows that these hand-designed features are equivalent to a type of Our grounded language learning incorporates a state- match kernel that performs similarly to sparse cod- of-the-art model, FUBL, for semantic parsing, as re- ing (Yang et al., 2009; Yu & Zhang, 2010) and deep viewed in this section.