<<

A Joint Model of and for Grounded Attribute

Cynthia Matuszek [email protected] Nicholas FitzGerald [email protected] Luke Zettlemoyer [email protected] Liefeng Bo [email protected] Dieter Fox [email protected] Computer Science and Engineering, Box 352350, University of Washington, Seattle, WA 98195-2350

Abstract ical workspace that contains a number of objects that As robots become more ubiquitous and ca- vary in shape and color. We assume that a robot can pable, it becomes ever more important for understand sentences like this if it can solve the as- untrained users to easily interact with them. sociated grounded object selection task. Specifically, it Recently, this has led to study of the lan- must realize that such as “yellow” and “blocks” guage grounding problem, where the goal refer to object attributes, and ground the of is to extract representations of the mean- such words by mapping them to a perceptual system ings of natural language tied to the physi- that will enable it to identify the specific physical ob- cal world. We present an approach for joint jects referred to. To do so robustly, even in cases where learning of language and perception models words or attributes are new, our robot must learn (1) for grounded attribute induction. The per- visual classifiers that identify the appropriate object ception model includes classifiers for phys- properties, (2) representations of the meaning of indi- ical characteristics and a language model vidual words that incorporate these classifiers, and (3) based on a probabilistic categorial a model of compositional used to analyze that enables the construction of composi- complete sentences. tional meaning representations. We evaluate In this paper, we present an approach for jointly learn- on the task of interpreting sentences that de- ing these components. Our approach builds on exist- scribe sets of objects in a physical workspace, ing work on visual attribute classification (Bo et al., and demonstrate accurate task performance 2011) and probabilistic categorial grammar induction and effective latent-variable induc- for semantic (Zettlemoyer & Collins, 2005; tion in physical grounded scenes. Kwiatkowski et al., 2011). Specifically, our system in- duces new grounded (groups of words along with the parameters of the attribute classifier they are 1. Introduction paired with) from a set of scenes containing only sen- tences, images, and indications of what objects are Physically grounded settings provide exciting oppor- being referred to. As a result, it can be taught to rec- tunities for learning. For example, a person might be ognize previously unknown object attributes by some- able to teach a robot about objects in its environment. one describing objects while pointing out the relevant However, to do this, a robot must jointly reason about objects in a set of training scenes. Learning is on- the different modalities encountered (for example lan- line, adding one scene at a time, and EM-like, in that guage and vision), and induce rich associations with the parameters are updated to maximize the expected as little guidance as possible. marginal likelihood of the latent language and visual Consider a simple sentence such as “These are the yel- components of the model. This integrated approach low blocks,” uttered in a setting where there is a phys- allows for effective model updates with no explicit la- beling of logical meaning representations or attribute th Appearing in Proceedings of the 29 International Confer- classifier outputs. ence on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). We evaluate this approach on data gathered on Ama- A Joint Model of Language and Perception for Grounded Attribute Learning zon Mechanical Turk, in which people describe sets of object segments recorded using a Kinect depth camera. objects on a table. Experiments demonstrate that the Joint Model We combine these language and vision joint learning approach can effectively extend the set models in two ways. First, we introduce an explicit of grounded concepts in an incomplete model initial- model of alignment between the logical constants in ized with supervised training on a small dataset. This the logical form z and classifiers in the set C. This provides a simple mechanism for learning vocabulary alignment would, for example, enable us to learn that in a physical environment. the logical constant yellow should be paired with a classifier c ∈ C that fires on yellow objects. Next, we introduce an execution model that allows us to determine what scene objects in O would be selected by a logical expression z, given the classi- fiers in C. This allows us to, for example, execute λx.color(x, green)∧shape(x, triangle) by testing all of the objects with the appropriate classifiers (for green and triangle), then selecting objects on which both classifiers return true. This execution model includes uncertainty from the semantic parser P (z|x), classifier Figure 1. An example of an RGB-D object identification confidences P (c = true|o), and a deterministic ground- scene. Columns on the right show example segments, iden- truth constraint that encodes what objects are actually tified as positive (far right) and negative (center). intended to be selected. Full details are in Sec.5. Model Learning We present an approach that 2. Overview of the Approach learns the meaning of new words from a dataset D = Problem We wish to learn a joint language and per- {(xi,Oi,Gi) | i = 1 . . . n}, where each example i con- ception model for the object selection task. The goal tains a sentence xi, the objects Oi, and the selected is to automatically map a natural language sentence set Gi. This setup is an abstraction of the situa- x and a set of scene objects O to the subset G ⊆ O tion where a teacher mentions xi while pointing to of objects described by x. The left panel of Fig.1 the objects Gi ⊆ Oi she describes. As described in shows an example scene. Here, O is the set of objects detail in Sec.6, learning proceeds in an online, EM- present in this scene. The individual objects o ∈ O are like fashion by repeatedly estimating expectations over extracted from the scene via segmentation (the right the latent logical forms zi and the outputs of the clas- panel of Fig.1 shows example segments). Given the sifiers c ∈ C, then using these expectations to update sentence x =“Here are the yellow ones,” the goal is to the parameters of the component models for language select the five yellow objects for the named set G. P (z|x) and visual classification P (c|o). To bootstrap the learning approach, we first train a limited language Model Components Given a sentence and seg- and perception system in a fully supervised way: in mented scene objects, we learn a distribution P (G | this stage, each example additionally contains labeled x, O) over the selected set. Our approach combines logical meaning expressions and classifier outputs, as recent models of language and vision, including: described in Sec.6. (1) A semantic parsing model that defines P (z|x), a distribution over logical meaning representations z for 3. Related Work each sentence x. In our running example, the desired representation z = λx.color(x, yellow) is a lambda- To the best of our knowledge, this paper presents the calculus expression that defines a set of objects that first approach for jointly learning visual classifiers and are yellow. For this task, we build on an existing se- semantic parsers, to produce rich, compositional mod- mantic parsing model (Kwiatkowski et al., 2011). els that span directly from sensors to meaning. How- ever, there is significant related work on the model (2) A set of visual attribute classifiers C, where each components, and on grounded learning in general. classifier c ∈ C defines a distribution P (c = true|o) of the classifier returning true for each possible object Vision Current state-of-the-art object recognition o ∈ O in the scene. For example, there would be a systems (Felzenszwalb et al., 2009; Yang et al., 2009) unique classifier c ∈ C for each possible color or shape are based on local image descriptors, for example an object can have. We use logistic regression to train SIFT over images (Lowe, 2004) and Spin Images over classifiers on color and shape features extracted from 3D point clouds (Johnson & Hebert, 1999). Visual A Joint Model of Language and Perception for Grounded Attribute Learning

this red block is in the shape of a half-pipe N/N N N\NS\N/N N/N N/NP NP/NP NP λf.f λx.color(x, red) λf.f λf.λg.λx.f(x) ∧ g(x) λf.f λy.λx.shape(x, y) λx.x arch N N/NP NP λx.color(x, red) λy.λx.shape(x, y) arch NN λx.color(x, red) λx.shape(x, arch) S\N λg.λx.shape(x, arch) ∧ g(x) S λx.shape(x, arch) ∧ color(x, red) Figure 2. An example semantic analysis for a sentence from our dataset. attributes provide rich descriptions of objects, and that include a visual component (Tellex et al., 2011). have become a popular topic in the vision commu- However, these approaches ground language into pre- nity (Farhadi et al., 2009; Parikh & Grauman, 2011); defined language formalisms, rather than extending although very successful, we still lack a deep un- the model to account for entirely novel input. derstanding of the design rules underlying them and how they measure similarity. Recent work on ker- 4. Background on Semantic Parsing nel descriptors (Bo et al., 2010) shows that these hand-designed features are equivalent to a type of Our grounded language learning incorporates a state- match kernel that performs similarly to sparse cod- of-the-art model, FUBL, for semantic parsing, as re- ing (Yang et al., 2009; Yu & Zhang, 2010) and deep viewed in this section. FUBL (Kwiatkowski et al., networks (Lee et al., 2009) on many object recogni- 2011) is an algorithm for learning factored Combina- tion benchmarks (Bo et al., 2010). We adapt kernel tory Categorial Grammar (CCG) lexicons for seman- descriptors as feature extractors for attribute classi- tic parsing. Given a dataset {(xi, zi) | i = 1...n} fiers because of their strong empirical performance. of natural language sentences xi, which are paired with logical forms z that represent their meaning, Semantic Parsing There has been significant i UBL learns a factored lexicon Λ made up of a set work on supervised learning for inducing semantic of lexemes L and a set of lexical templates T . Lex- parsers (Zelle & Mooney, 1996; He & Young, 2006; emes combine with templates in order to form lexi- Wong & Mooney, 2007). Our research builds on work cal items, which can be used by a semantic parser to on supervised learning of CCG parsers (Zettlemoyer & parse natural language sentences into logical forms. Collins, 2005; Kwiatkowski et al., 2011); there is also For example, given the sentence x =“this red block work on performing semantic analysis with alternate is in the shape of a half-pipe” and the logical form forms of supervision. Clarke (2010) and Liang (2011) z = λx.color(x, red) ∧ shape(x, arch), FUBL learns a describe approaches to learning semantic parsers from i parse like the example in figure2. In this parse, the questions paired with database answers, while Gold- lexeme (half-pipe, [arch]) has combined with the tem- wasser (2011) presents work on unsupervised learning. plate λ(ω,~v). [ω ` NP : ~v ] to yield the lexical item However, none of these approaches include joint mod- 1 half-pipe ` NP : arch. els of language and vision. FUBL also learns a log-linear model which produces Grounding There has been significant work on the probability of a parse y that yields logical form z grounded learning more generally in the robotics and given the sentence x: vision communities. A full review is beyond the scope of this paper, so we highlight a few examples. Roy de- L eΘ ·φ(x,y,z) veloped a series of techniques for grounding words in P (y, z | x;ΘL, Λ) = (1) P ΘL·φ(x,y0,z0) visual scenes (Mavridis & Roy, 2006; Reckman et al., (y0,z0) e 2010; Gorniak & Roy, 2003). In computer vision, the grounding problem often relates to detecting objects where φ(x, y, z) is a feature vector encompassing the and attributes in visual information (e.g., see (Barnard lexemes and lexical templates used to generate y, et al., 2003)); however, these approaches primarily fo- amongst other things. cus on isolated meaning, rather than compo- sitional semantic analyses. Most closely related to In this work, we initialize our parse model using the our work are approaches that learn probabilistic lan- standard FUBL approach, followed by automatically guage models from natural language input (Matuszek inducing lexemes paired with new visual attributes not et al., 2012; Chen & Mooney, 2011), especially those present in the initial training set, as we will see in the next section. A Joint Model of Language and Perception for Grounded Attribute Learning

5. Joint Language/Perception Model Inference There are two key inference problems in a model of this type. During learning, we need to com- As described in Sec.2, the object selection task is to pute the marginal distribution P (z, w|x, O, G) over la- identify a subset of objects, G, given a scene O and an tent logical forms z and perceptual assignments w NL sentence x. We define a possible world w to be a (see next section). At test time, we must compute set of classifier outputs, where wo,c ∈ {T,F } specifies arg maxG P (G|x, O) to find the set of named objects. the boolean output of classifier c for object o. Our joint probabilistic model is: Computing this probability distribution requires sum- X X ming the total probability of all world/logical form P (G | x, O) = P (G, z, w | x, O) (2) pairs that name G. For each possible world w, de- z w termining if z names G is equivalent to a SAT prob- where the latent variable z over logical forms models lem, as z can theoretically encode an arbitrary logi- linguistic uncertainty and the latent w over possible cal expression that will name the appropriate G only worlds models perceptual uncertainty. when satisfied. Computing the marginal probability We further decompose (2) into a product of models for is then a weighted model counting problem, which is language, vision, and grounded execution. This final in #-P. However, the logical expressions allowed by model selects the named objects G, motivated in Sec.2 our current grammar—conjunctions of unary attribute and described below; the final decomposition is: descriptors—admit efficient exact computation, de- scribed below. P (G, z, w | x, O) = P (z | x)P (w | O)P (G | z, w) (3)

Here, the language model P (z|x) and vision model 6. Model Learning P (w|O) are held in agreement by the conditional prob- ability term P (G|z, w). Let z(w) be the set of objects The physically grounded joint learning problem is to that are selected, under the assignment in w, when induce a model P (G|x, O), given data of the form z is applied to them. For example, the expression D = {(xi,Oi,Gi) | i = 1 . . . n}, where each example i z = λx.shape(x, cube)∧color(x, red) would return true contains a sentence xi, the objects Oi, and the selected when applied to the objects in w for which the classi- set Gi. We consider the case where the learner already fiers for the cube and red logical constants return true. has a partial model, including a CCG parser with a Now, P (G|z, w) forces agreement and models object small vocabulary and a small set of attribute classi- selection by putting all of its probability mass on the fiers. The goal is to automatically extend the model set G that equals z(w). to induce new classifiers that are tied to new words in the semantic parser. We first describe the learning In this formulation, the language and vision dis- algorithm, then present how we initialize the approach tributions are conditionally independent given this by learning decoupled models from small datasets with agreement. The semantic parsing model P (z|x) more extensive annotations. builds on previous work, as described in eqn. (1). The perceptual classification P (w|O) is defined as Aligning Words to Classifiers One key challenge follows: we assume each perceptual classifier is applied is to learn to create new attribute classifiers associ- independently, decomposing this term into: ated with unseen words in the sentences xi in the data D. We take a simple, exhaustive approach by creating Y Y a set of k new classifiers, initialized to uniform dis- P (w | O) = P (wo,c|o) (4) o∈Oc∈C tributions. Each classifier is additionally paired with a new logical constant in the FUBL lambda-calculus where the probability of a world is simply the product language. Finally, a new lexeme is created by pairing of the probabilities of the individual classifier assign- each previously unknown word in a sentence in D with ments for all of the objects. either one of these new classifier constants, or the logi- Each classifier is a logistic regression model, where the cal expressions from an existing lexeme in the lexicon. probability of a classifier c on a given object o is: The parsing weights for the indicator features for each

P of these additions are set to 0. This approach learns, Θc ·φ(o) P e P (wo,c = 1|o;Θ ) = (5) through the probabilistic updates described below, to ΘP ·φ(o) 1 + e c jointly reestimate the parameters of both the new clas- P P sifiers and the expanded semantic parsing model. where Θc is the parameters in Θ for classifier c. This approach provides a simple, direct way to couple the Parameter Estimation We aim to estimate the individual language and vision components to model language parameters ΘL and perception parameters the object selection task. A Joint Model of Language and Perception for Grounded Attribute Learning

P Θ from data D = {(xi,Oi,Gi) | i = 1 . . . n}, as defined above. We want to find parameter settings P X X 0 0 L P ∆ = P (z , w | x ,O ,G ;Θ , Θ )∗ that maximize the marginal log likelihood of D: c i i i z0 w0 (9) X h 0 0 P i wo,c − P (wo,c = 1 | φ(o); Θ ) φ(o) L P X L P LL(D;Θ , Θ ) = ln P (Gi|xi,Oi;Θ , Θ ) (6) o∈Oi i=1...n where the inner sum ranges over the objects and adds in the familiar gradient for logistic regression binary- This objective is non-convex due to the sum over latent classification models. assignments for the logical form z and attribute classi- L P fier outputs w in the definition of P (Gi|xi,Oi;Θ , Θ ) Online Updates We use a simple, online parameter from eqn. (2). However, if z and w are labeled, the estimation scheme that loops over the data K = 10 overall algorithm reduces to simply training the log- (picked on validation set) times. For each data point L linear models for the semantic parser P (z|xi;Θ ) and i consisting of the tuple (xi,Oi,Gi), we perform an P attribute classifiers P (w|Oi;Θ ), both well-studied update where we take a step according to the above problems. In this situation, we can use an EM expected gradient over the latent variables. We use a algorithm to first estimate the marginal P (z, w | learning rate of 0.1 with a constant decay of .00001 L P xi,Oi,Gi;Θ , Θ ), then maximize the expected like- per update for all experiments. lihood according to the distribution, with a weighted Discussion This complete learning approach pro- version of our familiar log-linear model parameter up- vides an efficient online algorithm that closely matches dates. We present an online version of this approach, the style of interactive, grounded language learning we with updates computed one example at a time. are pursuing in this work. Given the decayed learning Computing Expectations For each example i, we must rate, the algorithm is guaranteed to converge, but lit- compute the marginal over latent variables given by: tle can be said about the optimality of the solution. However, as we see in Sec.7, the approach works well

L P in practice for the object set selection task we consider. P (z, w | xi,Oi,Gi;Θ , Θ ) = L P Bootstrapping To construct the initial limited lan- P (z | xi;Θ )P (w | Oi;Θ )P (Gi|z, w) (7) P P 0 L 0 P 0 0 guage and perceptual models, we make use of a small, z0 w0 P (z | xi;Θ )P (w | Oi;Θ )P (Gi|z , w ) supervised data set Dsup = {(xi, zi, wi,Oi,Gi) | i = 1 . . . m}, which matches our previous setup but ad- Since computing all possible parses z is exponential ditionally labels the latent logical form zi and clas- in the length of the sentence, we use beam search to sifier outputs wi. As mentioned above, learning in find the top-N parses. This exact inference could be this setting is completely decoupled and we can es- replaced with an approximate method, such as MC- L timate the semantic parsing distribution P (zi|xi;Θ ) SAT, to accommodate a more permissive grammar. with the FUBL learning algorithm (Kwiatkowski et al., P Conditional Expected Gradient For each example, we 2011) and the attribute classifiers P (wi|Oi;Θ ) with update the parameters with the expected gradient, gradient ascent for logistic regression. As we show ex- according to the marginal distribution above. For the perimentally, Dsup can often be quite small, and will in language parameters ΘL, the gradient is general not contain many of the words and attributes that must be additionally learned in the full approach. L X X 0 0 L P Exploring approaches for learning without Dsup, such ∆ = P (z , w | xi,Oi,Gi;Θ , Θ )∗ z0 w0 as replacing it with interactive dialog with a h L 0 i teacher, is an important area for future work. (E 0 L φ (x , y, z ) − (8) P (y|xi,z ;Θ ) j i

h L i E L φ (x , y, z) ) P (y,z|xi;Θ ) j i 7. Experimental Setup Data Set Data was collected using a selection of where the inner difference of expectations is the fa- toys, including wooden blocks, plastic food, and build- miliar gradient of a log-linear model for conditional ing bricks. For each scene, we collected short RGB-D random fields with hidden variables (Quattoni et al., videos with a Kinect depth camera, showing a per- 2007; Kwiatkowski et al., 2010), and is weighted ac- son gesturing to a subset of the objects. Natural cording to the expectation. language annotations were gathered using Mechanical Similarly, for the perception parameters ΘP , the gra- Turk; workers were asked to describe the objects be- dient is: ing pointed to in the video (see Fig.3). The referenced A Joint Model of Language and Perception for Grounded Attribute Learning

8.1. Object Set Selection To measure set selection task performance, we di- vided the data according to attribute. To initialize the model, we used the data for six of the attributes to train supervised classifiers, and provided logical forms for the corresponding sentences to train the initial se- mantic parsing model, as described at the end of Sec.6. Figure 3. Example scenes presented on Mechanical Turk. Left: A scene that elicited the descriptions “here are some Data for the remaining six attributes were used for red things” and “these are various types of red colored evaluation, with 80% allocated for training and 20% objects”, both labeled as λx.color(x, red). Right: A scene held out for testing. Here, all of the visual scenes associated with sentence/meaning pairs such as “this toy are previously unseen, the words in the sentences de- is orange cube” and λx.color(x, orange) ∧ shape(x, cube). scribing the new attributes are unknown, and the only available labels are the output object set G. We report precision, recall, and F1-score on the set objects were then marked as belonging to G, the posi- selection task. Results are averaged over 10 different tive set of objects for that scene. A total of 142 scenes runs with the training data presented in different ran- were shown, eliciting descriptions of 12 attributes, di- domized orders. The system performs well, achieving vided evenly into shapes and colors. In total, there an average precision of 82%, recall of 71%, and a 76% were 1003 sentence/annotation pairs. F1-score. This level of performance is achieved rela- tively quickly; performance generally converges within Perceptual Features To automatically segment ob- five passes over the training data. jects from each scene, we performed RANSAC plane fitting on the Kinect depth values to find the ta- 8.2. Ablation Studies ble plane, then extracted connected components (seg- ments) of points more than a minimum distance above To examine the need for a joint model, we measure that plane. After getting segmented objects, features performance of two models in which either the lan- for every object are extracted using kernel descrip- guage or the visual component is sharply limited. In tors (Bo et al., 2011). We extract two types of features, each case, performance significantly degrades. These for depth values and RGB values; these correspond to results are summarized in Fig.4. shape and color attributes, respectively. During train- Vision In order to measure how a set of classifiers ing, the system learns logistic regression classifiers us- would perform on the set selection task with only a ing these features. In the initialization phase used to simple language model, we manually created a the- bootstrap the model, the annotation provides informa- saurus of words used in the dataset to refer to target tion about which language attributes relate to shape attributes containing, on average, 5 different ways of or color. However, this information is not provided in referring to each color and shape. To learn the unsu- the training phase. pervised concepts for this baseline, we first extracted Language Features We follow (Kwiatkowski et al., a list of all words appearing in the training corpus but 2011) in including a standard set of binary indicator not in the initialization data; words which appear in features to define the log-linear model P (z|x;ΘL) over the thesaurus are grouped into synonym sets. To train logical forms, given sentences. This includes indicators classifiers, we collect objects from scenes in which only for which lexical entries were used and properties of terms from the given synonym set appear. Any syn- the logical forms that are constructed. These features onym set which does not occur in at least 2 distinct allow the joint learning approach to weight lexical se- scenes is discarded. The resulting positive and neg- lection against evidence provided by the compositional ative objects are used to train classifiers. To gener- analysis and the visual model components. ate a predicted set of objects at test time, we find all synonym sets which occur in the sentence x, and de- 8. Results termine whether the classifiers associated with those words successfully identify the object. This section presents results and a discussion of our Averaged across our trials, the results are as follows: evaluation. We demonstrate effective learning in the Precision=0.92; Recall=0.41; F1-score=0.55. These full model for the object set selection task. We results are, on average, notably worse than the per- then briefly describe ablation studies and examples of formance of the jointly trained model. learned models.. A Joint Model of Language and Perception for Grounded Attribute Learning

Semantic Parsing As a baseline for testing how well duces a large number of such lexemes). The classifiers a pure parsing model will perform when the perception new0–new2 and new3–new5 are color and shape model is ablated, we run the parsing model obtained classifiers, respectively. As can be seen, each of the during initialization directly on the test set, training novel attributes is most strongly associated with a no new classifiers. Since the parser is capable of gener- newly-created classifier, while irrelevant words such as ating parses by skipping unknown words, this baseline “thing” tend to parse to null. The system must iden- is equivalent to treating the unknown concept words tify which of the classifier types to use for novel words. as if they are semantically empty. We ran additional tests investigating whether the sys- Averaged across our trials, the results are as follows: tem is able to learn synonyms. Here, we split the data Precision=0.52; Recall=0.09; F1-score=0.14. Not sur- so that the training set has attributes learned during prisingly, a substantial number of parses selected no initialization, but are referred to by new, synonymous objects, as the parser has no way of determining the words. These runs performed comparably to those re- meaning of an unknown word. ported above; the approach easily learns lexemes that pair these new words with the appropriate classifiers. Precision Recall F1-Score Finally, we briefly discuss the effects of reducing the Vision 0.92 0.41 0.55 Language 0.52 0.09 0.14 amount of annotated data used to initialize the lan- Joint 0.82 0.71 0.76 guage and perception model (see Fig.6). As can be seen, with fewer than 150 sentences, the learned Figure 4. A summary of precision, recall, and F1 for ab- grammar does not seem to have sufficient coverage to lated models and the joint learning model. model unknown words in joint learning; however, be- 8.3. Discussion and Examples yond that, performance is quite stable. This section discusses typical training runs and data requirements. We present examples of learned mod- els, highlighting what is learned and typical errors, and then describe simple experiment investigating the amount of supervised data required for initialization. Classifier performance after training effects the sys- tem’s ability to perform the set selection task. Dur- ing a sample trial, average accuracy of color and shape Figure 6. Example F1-score on object recognition from classifiers for newly learned concepts are 97% and 74%, models initialized with reduced amounts of labeled data, respectively. Although these values are sufficient for reported over one particular data split. The F1-score for reasonable task performance, there are some failures— this split peaks at roughly 73%. for example, the shape attributes “cube” and “cylin- der” are sometimes challenging to differentiate. 9. Conclusion As noted in Sec.4, the semantic parser contains lex- This paper presents a joint model of language and emes that pair words with learned classifiers, and fea- perception for grounded attribute learning. Our ap- tures that indicate lexeme use during parsing. Fig.5 proach learns representations of the meanings of natu- shows some selected word/classifier pairs, along with ral language, using visual perception to ground those the weight for their associated feature (each trial pro- meanings in the physical world. Learning is performed via optimizing the data log-likelihood using an online, EM-like training algorithm. This system is able to learn accurate language and attribute models for the object set selection task, given data containing only language, raw percepts, and the target objects. By jointly learning language and perception models, the approach can identify which novel words are color at- tributes, shape attributes, or no attributes at all. Figure 5. Feature weights for hypothesized lexemes pairing natural language words (rows) with newly created terms We believe our approach has significant potential to referring to novel classifiers (columns), as well as the spe- scale to general language grounding problems. In par- cial null token. Each weight serves as an unnormalized ticular, our modular framework was designed to eas- indicator of which associations are preferred. ily incorporate future advances in visual classification A Joint Model of Language and Perception for Grounded Attribute Learning and semantic parsing. We are also working to scale the from logical form with higher-order unification. In Proc. complexity of the language and physical scenes, with of the Conf. on Empirical Methods in Natural Language the eventual goal of robust learning in completely un- Processing, 2010. constrained environments. Kwiatkowski, T., Zettlemoyer, L.S., Goldwater, S., and Steedman, M. Lexical generalization in CCG grammar induction for semantic parsing. In Proc. of the Conf. Acknowledgments on Empirical Methods in Natural Language Processing, 2011. This work was funded in part by the Intel Science and Technology Center for Pervasive Computing, the Robotics Consortium spon- Lee, H., Grosse, R., Ranganath, R., and Ng, A. Convo- sored by the U.S. Army Research Laboratory under the Collabora- lutional deep belief networks for scalable unsupervised tive Technology Alliance Program (W911NF-10-2-0016), and NSF grant IIS-1115966. learning of hierarchical representations. In Proc. of the Int’l Conf. on Machine Learning (ICML), 2009. Liang, P., Jordan, M.I., and Klein, D. Learning References dependency-based compositional semantics. In Proc. of Barnard, K., Duygulu, P., Forsyth, D., De Freitas, N., the Association for Computational , 2011. Blei, D.M., and Jordan, M.I. Matching words and pic- Lowe, D. Distinctive image features from scale-invariant tures. The Journal of Machine Learning Research, 3: keypoints. Int’l Journal of Computer Vision (IJCV), 1107–1135, 2003. 60:91–110, 2004. Bo, L., Ren, X., and Fox, D. Kernel descriptors for visual Matuszek, C., Herbst, E., Zettlemoyer, L., and Fox, D. recognition. In Neural Information Processing Systems Learning to parse natural language commands to a robot (NIPS), 2010. control system. In Proc. of the 13th Int’l Symposium on Bo, L., Ren, X., and Fox, D. Depth kernel descriptors for Experimental Robotics (ISER), June 2012. object recognition. In IEEE/RSJ Int’l Conf. on Intelli- Mavridis, N. and Roy, D. Grounded situation models for gent Robots and Systems (IROS), 2011. robots: Where words and percepts meet. In IEEE/RSJ Chen, D.L. and Mooney, R.J. Learning to interpret natural Int’l Conf. on Intelligent Robots and Systems, 2006. language navigation instructions from observations. In Parikh, D. and Grauman, K. Relative attributes. In Int’l Proc. of the 25th AAAI Conf. on Artificial Intelligence Conf. on Computer Vision, 2011. (AAAI-2011), pp. 859–865, August 2011. Quattoni, A., Wang, S., p Morency, L., Collins, M., Darrell, Clarke, J., Goldwasser, D., Chang, M., and Roth, D. Driv- T., and Csail, Mit. Hidden-state conditional random ing semantic parsing from the world’s response. In Proc. fields. In IEEE Transactions on Pattern Analysis and of the Conf. on Computational Natural Language Learn- Machine Intelligence, 2007. ing, 2010. Reckman, H., Orkin, J., and Roy, D. Learning meanings of Farhadi, A., Endres, I., Hoiem, D., and Forsyth, D. De- words and constructions, grounded in a virtual game. In scribing objects by their attributes. In IEEE Conf. on Proc. of the 10th Conf. on Natural Language Processing Computer Vision and Pattern Recognition, 2009. (KONVENS), 2010. Felzenszwalb, P., Girshick, R., McAllester, D., and Ra- Tellex, S., Kollar, T., Dickerson, S., Walter, M.R., Baner- manan, D. Object detection with discriminatively jee, A.G., Teller, S., and Roy, N. Understanding natural trained part based models. IEEE Transactions on language commands for robotic navigation and mobile Pattern Analysis and Machine Intelligence, 32(9):1627– manipulation. In Proc. of the National Conf. on Artifi- 1645, 2009. cial Intelligence (AAAI), August 2011. Goldwasser, D., Reichart, R., Clarke, J., and Roth, D. Wong, Y.W. and Mooney, R.J. Learning synchronous Confidence driven unsupervised semantic parsing. In for semantic parsing with lambda calculus. In Proceedings. of the Association of Computational Lin- Proc. of the Ass’n for Computational Linguistics, 2007. guistics, 2011. Yang, J., Yu, K., Gong, Y., and Huang, T. Linear spatial Gorniak, P. and Roy, D. Understanding complex visu- pyramid matching using sparse coding for image classifi- ally referring utterances. In Proc. of the HLT-NAACL cation. In IEEE Conf. on Computer Vision and Pattern 2003 Workshop on Learning Word Meaning from Non- Recognition (CVPR), 2009. Linguistic Data, 2003. Yu, K. and Zhang, T. Improved local coordinate coding He, Y. and Young, S. Spoken language understanding using using local tangents. In Proc. of the Int’l Conf. on Ma- the hidden vector state model. Speech , chine Learning (ICML), 2010. 48(3-4), 2006. Zelle, J.M. and Mooney, R.J. Learning to parse database Johnson, A. and Hebert, M. Using spin images for efficient queries using inductive logic programming. In Proc. of object recognition in cluttered 3D scenes. IEEE Trans- the National Conf. on Artificial Intelligence, 1996. actions on Pattern Analysis and Machine Intelligence, 21(5), 1999. Zettlemoyer, L.S. and Collins, M. Learning to map sen- tences to logical form: Structured classification with Kwiatkowski, T., Zettlemoyer, L.S., Goldwater, S., and probabilistic categorial grammars. In Proc. of the Conf. Steedman, M. Inducing probabilistic CCG grammars on Uncertainty in Artificial Intelligence, 2005.