Towards Understanding Situated Natural Language
Antoine Bordes Nicolas Usunier Ronan Collobert Jason Weston LIP6, Universit´eParis 6, LIP6, Universit´eParis 6, NEC Laboratories America, Google, Paris, France Paris, France Princeton, USA New York, USA [email protected] [email protected] [email protected] [email protected]
Abstract park with his telescope.” Likewise, in terms of refer- ence resolution one could disambiguate the sentence We present a general framework and learn- “He passed the exam.” if one happens to know that ing algorithm for the task of concept label- Bill is taking an exam and John is not. Further, one ing: each word in a given sentence has to be can improve disambiguation of the word bank in “John tagged with the unique physical entity (e.g. went to the bank” if you happen to know whether John person, object or location) or abstract con- is out for a walk in the countryside or in the city. Many cept it refers to. Our method allows both human disambiguation decisions are in fact based on world knowledge and linguistic information whether the current sentence agrees well with one’s to be used during learning and prediction. current world model. Such a model is dynamic as the We show experimentally that we can learn current state of the world (e.g. the existing entities to use world knowledge to resolve ambigui- and their relations) changes over time. Note that this ties in language, such as word senses or ref- model that we use to disambiguate can be built from erence resolution, without the use of hand- our perception of the world (e.g. visual perception) crafted rules or features. and not necessarily from language at all. Concept labeling In this paper, we propose a gen- eral framework for learning to use world knowledge 1 Introduction called concept labeling. The “knowledge” we consider can be viewed as a database of physical entities exist- Much of the focus of the natural language processing ing in the world (e.g. people, locations or objects) as community lies in solving syntactic or semantic tasks well as abstract concepts, and relations between them, with the aid of sophisticated machine learning algo- e.g. the location of one entity is expressed in terms of rithms and the encoding of linguistic prior knowledge. its relation with another entity. Our task consists of For example, a typical way of encoding prior knowl- labeling each word of a sentence with its corresponding edge is to hand-code syntax-based input features or concept from the database. rules for a given task. However, one of the most impor- tant features of natural language is that its real-world The solution to this task does not provide a full seman- use (as a tool for humans) is to communicate some- tic interpretation of a sentence, but we believe it is a thing about our physical reality or metaphysical con- important part of that goal. Indeed, in many cases, siderations of that reality. This is strong prior knowl- the meaning of a sentence can only be uncovered af- edge that is simply ignored in most current systems. ter knowing exactly which concepts, e.g. which unique objects in the world, are involved. If one wants to in- For example, in current parsing systems there is no terpret “He passed the exam”, one has to infer not only allowance for the ability to disambiguate a sentence that “He” refers to a “John”, and “exam” to a school given knowledge of the physical reality of the world. test, but also exactly which “John” and which test it So, if one happened to know that Bill owned a tele- was. In that sense, concept labeling is more general scope while John did not, then this should affect pars- than the traditional tasks like word-sense disambigua- ing decisions given the sentence “John saw Bill in the tion, co-reference resolution, and named-entity recog- nition, and can be seen as a unification of them. Appearing in Proceedings of the 13th International Con- ference on Artificial Intelligence and Statistics (AISTATS) Learning algorithm We then go on to propose a 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of tractable algorithm for this task that seamlessly learns JMLR: W&CP 9. Copyright 2010 by the authors. to integrate both world knowledge and linguistic con-
65 Towards Understanding Situated Natural Language tent of a sentence without the use of any hand-crafted tered sentences, also ranges from dialogue systems, rules or features. This is a challenging goal and stan- e.g. (Allen, 1995), to co-reference resolution (Soon dard algorithms do not achieve it. Our algorithm is et al., 2001). We do not consider this type of contex- first evaluated on human generated sentences from tual knowledge in this paper, however our framework RoboCup commentaries (Chen & Mooney, 2008). Yet is extensible to those settings. this dataset does not involve world knowledge. Hence we present experiments using a novel simulation pro- 2 Concept Labeling cedure to generate natural language and concept label pairs: the simulation generates an evolving world, to- We consider the following setup. One must learn a gether with sentences describing the successive evolu- mapping from a natural language sentence x ∈ X to tions. These experiments demonstrate that our algo- its labeling in terms of concepts y ∈ Y, where y is an rithm can learn to use world knowledge for word dis- ordered set of concepts, one concept for each word in ambiguation and reference resolution when standard the sentence, i.e. y = (c , . . . , c ) where c ∈ C, the methods cannot. 1 |x| i set of concepts. These concepts belong to a current Although clearly only a first step towards the goal of model of the world which expresses one’s knowledge of language understanding, which is AI complete, we feel it. We term it a “universe”. The framework of concept our work is an original way of tackling an important labeling is illustrated in Figure 1. and central problem. In a nut-shell, we show one can learn to resolve ambiguities using world knowledge, World knowledge We define the universe as a set which is a prerequisite for further semantic analysis, of concepts and their relations to other concepts: U = e.g. for communication. (C, R1,..., Rn) where n is the number of types of rela- 2 tion and Ri ⊂ C , ∀i = 1, . . . , n. Much work has been Previous work Our work concerns learning the done on knowledge representation itself (see (Russell connection between two symbolic systems: the one of et al., 1995) for an introduction). This is not the focus natural language and the one, non-linguistic, of the of this paper and so we made this simplest possible concepts present in a database. Making such an as- choice. To make things concrete we now describe the sociation has been studied as the symbol grounding template database we use in this paper. problem (Harnad, 1990) in the literature. More specif- Each concept c of the database is identified using a ically, the problem of connecting natural language to unique string name(c). Each physical object or action another symbolic system is called grounded language (verb) of the universe has its own referent. For exam- processing (Roy & Reiter, 2005). Some of such earliest ple, two different cartons of milk will be referred to works that used world knowledge to improve linguistic as
66 Bordes, Usunier, Collobert, Weston
an immediately realisable setting is within a computer x: He cooks the rice game environment, e.g. multiplayer Internet games. For such games, the knowledge base of concepts is al- y:
• Location-based ambiguities that can be re- solved by the locations of the concepts. Exam- ples: “John picked it up” or “He dropped his coat in the hall”. Information about the location of John, co-located objects and so on can improve the disambiguation.
• Containedby-based ambiguities that can be re- solved through knowledge of containedby relations as in “the milk in the closet” or “the one in the closet” where there are several cartons of milk (e.g. one in the fridge and one in the closet).
Figure 2: Weak training triple. In the weakly su- • Category-based: A concept is identified in a pervised setting the supervision bag b consists of the set of sentence by an ambiguous term and the disam- concepts related to the sentence x. The alignment between biguation can be resolved by using semantic cat- words and concepts is not given and must be discovered. egorization. Example: “He cooks the rice in the kitchen” where two persons, a male and a female, are in the kitchen. shown in Figure 1. In order to construct such a dataset it seems likely that humans would have to annotate The first two kinds of ambiguities require the algo- data for our learner in much the same way as linguists rithm to be able to learn rules based on its available have built labeled datasets for parsing, semantic role universe knowledge. The last kind can be solved using labeling, part-of-speech and reference resolution. linguistic information such as word gender or category. A more difficult, but more realistic, learning prob- However, the necessary rules or linguistic information lem would be for our model to learn from weakly su- are not given as input features and again the algorithm pervised data of just observing language given the has to learn to infer them. This is one of the main goals evolving world-state context. Hence, one instead con- of our work. Figure 3 describes the necessary steps for siders weakly labeled training triples (x, b, u): for a an algorithm to perform such disambiguation. Even given input sentence x, the supervision b is now a for a simple sentence the procedure is rather complex “bag” of concepts (there is no information about order- and somehow requires “reasoning”. ing/alignment of elements of b to x). This is similar to the setting of (Kate & Mooney, 2007) except we learn 3 Learning Algorithm to use world knowledge. An illustrating example of a training triple (x, b, u) is given on Figure 2. Inference A straight-forward approach to learn a To conduct experiments with such a situated learner function that maps from sentence x to concept se-
67 Towards Understanding Situated Natural Language
Step 0: Step 4: Step 5: x: He cooks the rice x: He cooks the rice x: He cooks the rice
y: ? ? ? ? y: ?
Figure 3: Inference scheme. Step 0 defines the task: recover the concepts y given a sentence x and the current state of the universe u. For simplicity only relevant concepts and location relations are depicted in this figure. First, non-ambiguous words are labeled in steps 1-3 (not shown). In step 4, to tag the ambiguous pronoun “he”, the system has to combine two pieces of information: (1)
68 Bordes, Usunier, Collobert, Weston ear neural networks in a similar spirit to (Collobert & This can be remedied by mapping a large set of re- Weston, 2008). The first layer of both are so-called lations to a low dimensional subspace first using one “Lookup Tables”. We represent each word m in the more layer in the network. However, in this paper for dictionary with a unique vector D(m) ∈ Rd and every simplicity we do not describe this extension further. unique concept name name(c) also with a unique vec- tor C(name(c)) ∈ Rd, where we learn these mappings. LaSO-type training We train our system online No hand-crafted syntactic features are used. by employing a variation on the LaSO (Learning As Search Optimization) training process (Daum´eIII & To represent a concept and its relations we do some- Marcu, 2005). LaSO’s central idea is to mix training thing slightly more complicated. A particular concept and inference in a single learning for search strategy. c (e.g. an object in a particular location, or being held During training, we thus perform inference on each by a particular person) is expressed as the concatena- given triple (x, b, u) using Algorithm 1, with the fol- tion of the three unique concept name vectors: lowing updates in order to learn the model parameters. ¯ C(c) = (C(name(c)),C(name(location(c))), (3) Strong supervision We define the predicted label- C(name(containedby(c)))). ingy ˆt at inference step t (see Algorithm 1, step 5) as y- t good, compared to the true labeling y, if eithery ˆi = yi t t In this way, the learning algorithm can take these dy- ory ˆi = ⊥ for all i. As soon asy ˆ is no longer y-good namic relations from the universe into account, if they we make an “early update” to the model, similar to are relevant for the labeling task. Hence, the first layer (Collins & Roark, 2004). The update is a stochastic 2 of the network gi(·) outputs : gradient step so that each possible y-good state one can choose fromy ˆt−1 is ranked higher than the cur- 1 “ rent incorrect state, i.e. we would like to satisfy the gi (x, y−i, u) = D(xi− w−1 ),...,D(xi+ w−1 ), 2 2 ” ranking constraints: C¯((y−i) w−1 ),..., C¯((y−i) w−1 . i− 2 i+ 2 ∀i :y ˆt−1 = ⊥ , g(x, yˆt−1 , u) > g(x, yˆt, u), The second layer is a linear layer that maps from this i +(i,yi) 4wd dimensional vector to the N dimensional output, wherey ˆt−1 denotes a vector which is the same as i.e. overall we have the function (with parameters (i,+yi) t−1 th 4wd×N N yˆ except its i element is set to yi. Note that Wg ∈ < and bg ∈ < ): if all such constraints are satisfied then all training 1 examples must be correctly classified. gi(x, y−i, u) = Wg gi (x, y−i, u) + bg.
Likewise, h(yi, u) has a first layer which outputs C¯(yi), Weak supervision For the weakly supervised case followed by a linear layer mapping from this 3d dimen- we are given only a supervision bag b. In that case, 3d×N t sional vector to N, i.e. (with parameters Wh ∈ < at each round we define the predicted labelingy ˆ as N t t and bh ∈ < ): y-good if eithery ˆi ∈ b ory ˆi = ⊥ for all i (all the words are either unlabeled or labeled with an element ¯ h(yi, u) = Wh C(yi) + bh. of bag).3 As soon asy ˆt is no longer y-good the model is updated to satisfy the ranking constraints: Overall, we chose a linear architecture that avoids ∀i :y ˆt−1 = ⊥ , ∀c ∈ b, g(x, yˆt−1 , u) > g(x, yˆt, u) strongly engineered features, assumes little prior i +(i,c) knowledge about the mapping task in hand, but is wherey ˆt−1 is a vector which is the same asy ˆt−1 ex- powerful enough to capture many kinds of relations +(i,c) th between words and concepts. We justify our algorith- cept its i element is set to the concept c. Intuitively, mic choices by comparing with alternative choices in if a label prediction for the word xj in position j does the experimental section 4. not belong to the bag b then we require any prediction that does belong to it to be ranked above this incor- It is important that algorithms for this task both cap- rect prediction. (If all such constraints are satisfied, ture the complexity of the task, and remain scalable. the correct bags are predicted.) Our algorithm should scale to a large number of con- cepts, indeed it was designed with that in mind: infer- After a concept c from b has been picked, we remove ence scales linearly with the number of concepts. In it from the bag. Otherwise the same element could the model described so far, however, an increase in the 3For weak supervision, to simplify learning we chose to number of relations increase the number of parameters initialize the weights C(·) and D(·) of our network using to learn in the network which might not be scalable. CCA (Li & Shawe-Taylor (2006)) treating both x and y as bags. For unambiguous words this essentially initializes 2Padding must be used when indices are out of bounds. their representation similar to their corresponding concept.
69 Towards Understanding Situated Natural Language be picked again and again, resulting in an output se- the commentary. We treated each semantic descrip- quence that does not violate the constraints. Yet, as tion as a “bag” of concepts for weak supervision. a bag can contain duplicates, this does not forbid as- Following Chen & Mooney (2008), we trained on one sociating the same concept to several words. match and tested on the other three, averaging over Why does this work? Consider again the example all four possible splits and we report the “matching” “He cooks the rice” in Figure 3. We cannot resolve the score: each input commentary is associated with sev- first word in the sentence “He” with the true concept eral actions, we measure how often the system retrieve label
Before using any world knowledge, we wanted to assess 1. Pick the event
4see (Chen & Mooney, 2008) or http://www.cs. 5The dataset can be downloaded from http://webia. utexas.edu/~ml/clamp/sportscasting/#data for details. lip6.fr/~bordes/mywiki/doku.php?id=worlddata .
70 Bordes, Usunier, Collobert, Weston
x: he sits on the chair x: the father gets some yoghurt from the sideboard y:
Table 1: Simulated world examples. Our task is to label sentences x given world knowledge u (not shown).
Method Supervision Features Train Err Test Err
SVMstruct strong x + u (loc, contain) 18.68% 23.57% NNLR strong x + u (loc, contain) 5.42% 5.75%
NNOF strong x 32.50% 35.87% NNOF strong x + u (contain) 15.15% 17.04% NNOF strong x + u (loc) 5.07% 5.22% NNOF strong x + u (loc, contain) 0.0% 0.11%
NNOF weak x + u (loc, contain) 0.64% 0.72%
Table 2: Simulation results. We compare our order-free neural network (NNOF ) with world knowledge trained using strong (line 6) and weak (line 7) supervision to several variants (no world knowledge - line 3; partial world knowledge – lines 4-5; SVMs – line 1; and left-right inference – line 2). NNOF using full world knowledge performs best.
Algorithms We compare several models. Firstly, Figure 3), the comparison with other algorithms high- we evaluate our “order-free” neural network based al- lights the following points: (i) order-free labeling of gorithm presented in Section 3 (NNOF using x + u) concepts is important compared to more restricted la- trained with two kinds of supervision: the strong set- beling schemes such as left-right labeling (NNLR); (ii) ting for which word-concepts alignments are given, and the architecture of our NN which embeds concepts is our realistic weak setting using a bag. We compare to able to capture some useful linguistic information and a model with no access to world knowledge (NNOF thus helps generalization; this should be compared to using x) or only partial access via location or con- SVMstruct which does not perform as well. Note that tainedby knowledge only (rather than both). a nonlinear SVM or a linear SVM with hand-crafted features are likely to perform better, but the former is Finally, we compare to two more strongly supervised too slow and the latter is what we are trying to avoid methods: a greedy left-to-right labeling NN, NN (in LR as such methods will not generalize to harder tasks. order to assess the necessity of order-free) and a struc- tured output SVM, SVMstruct (Tsochantaridis et al., We believe SVMstruct fails for several reasons. First, 2005). For the SVM, the features from the world it uses a Viterbi decoding that can only employ first- model are used as additional input features and Viterbi order Markov interactions (for tractability). Thus, is used to decode the outputs. Only a linear model simple reasoning involving previously labeled exam- was used due to the infeasibility of training non-linear ples located ”too far” before or after the current word ones (and all the NNs are linear). In all experiments is impossible. Second, its fixed feature representation we used word and concept dimension d = 20, g(·) and encodes world knowledge in a binary encoding and e.g. h(·) have dimension N = 200, a sliding window width one cannot learn that concepts sharing the same loca- of w = 13 (i.e., 6 words on either side of a central tion are likely to appear within a sentence other than word), and we chose the learning rate that minimized by memorizing all the possibilities. Our method can the training error as described in Section 3. discover that and adapt its feature representation ac- cordingly while SVMstruct cannot. Results The results are given in Table 2. The er- We constructed our simulation such that all ambigui- ror rates express the proportion of predicted sequences ties could be resolved with world knowledge, which is with at least one incorrect tag. Our model (NNOF ) why we can obtain almost 0%: this is a good sanity learns to use world knowledge to disambiguate on this check showing that our method is working well. We task: it obtains a test error close to 0% (line 6) with believe it is a prerequisite that we do well here if we this knowledge, and around 35% error without (line hope to do well on harder tasks. The simulation we 3). It is worth noting that training under weak super- built uses rules to generate actions and utterances, but vision (last line) does not degrade accuracy: the same our learning algorithm uses no such hand-built rules model using the less realistic strong setting (line 6) is but instead successfully learns them. We believe this only slightly better. could make our algorithm generalize to harder tasks. Confirming our intuitions about the inference (as in One may still be concerned that the environment is
71 Towards Understanding Situated Natural Language simple and that we know a priori that the model we Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. are learning is sufficiently expressive to capture all the A neural probabilistic language model. JMLR, 3:1137– relevant information in the world. In the real world 1155, 2003. one might not have enough information to achieve such Branavan, S.R.K., Chen, H., Zettlemoyer, L., and Barzilay, high recognition rates. We therefore considered set- R. Reinforcement learning for mapping instructions to actions. In ACL ’09, 2009. tings where aspects of the world could not be cap- Chen, D.L. and Mooney, R.J. Learning to Sportscast: A tured directly in the model that is learned: NNOF Test of Grounded Language Acquisition. In ICML ’08, using x + u (contain) employs a world model with only 2008. a subset of the relational information (it does not have Collins, M. and Roark, B. Incremental parsing with the access to the loc relations). Similarly, we tried NNOF perceptron algorithm. In ACL ’04, 2004. using x + u (loc) as well. The results in Table 2 show Collobert, R. and Weston, J. A Unified Architecture for that our model still learns to perform well (i.e. bet- Natural Language Processing: Deep Neural Networks ter than no world knowledge at all) in the presence with Multitask Learning. In ICML ’08, 2008. of hidden/unavailable world knowledge. Finally, if the Daum´eIII, H. and Marcu, D. Learning as Search Optimiza- amount of training data is reduced we can still per- tion: Approximate Large Margin Methods for Struc- tured Prediction. In ICML ’05, 2005. form well. With 5000 training examples for NNOF (x + u (loc, contain)) with the same parameters we ob- Feldman, J., Lakoff, G., Bailey, D., Narayanan, S., Regier, tain 3.1% test error. T., and Stolcke, A. L 0-The first five years of an au- tomated language acquisition project. Artificial Intelli- gence Review, 10(1):103–129, 1996. 5 Conclusion and Future Work Fleischman, M. and Roy, D. Intentional Context in Situ- ated Language Learning. In 9th Conference on Compu- We have described a general framework for language tational Natural Language Learning, 2005. grounding based on the task of concept labeling. The Harnad, S. The Symbol Grounding Problem. Physica D, learning algorithm we propose is scalable and flexible: 42(1-3):335–346, 1990. it learns with only weakly supervised raw data, and Kate, R.J. and Mooney, R.J. Learning Language Semantics no prior knowledge of how concepts in the world are from Ambiguous Supervision. In AAAI ’07, 2007. expressed in natural language. We have tested our Li, Y. and Shawe-Taylor, J. Using KCCA for Japanese– English cross-language information retrieval and docu- framework within a simulation, showing that it is pos- ment classification. Journal of intelligent information sible to learn (rather than engineer) to resolve ambi- systems, 27(2):117–133, 2006. guities using world knowledge. We also showed we can Liang, P., Jordan, M. I., and Klein, D. Learning semantic learn with real human annotated data (RoboCup com- correspondences with less supervision. In ACL ’09, 2009. mentaries). Although clearly only a first step towards Roy, D. and Reiter, E. Connecting Language to the World. the goal of language understanding we feel our work Artificial Intelligence, 167(1-2):1–12, 2005. is an original way of tackling an important and cen- Russell, S.J., Norvig, P., Canny, J.F., Malik, J., and Ed- tral problem. The most direct application of our work wards, D.D. Artificial intelligence: a modern approach. is within computer games, but other communication Prentice Hall Englewood Cliffs, NJ, 1995. tasks could also apply with an increased effort. Shen, L., Satta, G., and Joshi, A. Guided Learning for Bidirectional Sequence Classification. In ACL ’07, 2007. In terms of algorithms, we feel two important re- Siskind, J.M. Grounding Language in Perception. Artificial search directions concern (i) addressing the scalability Intelligence Review, 8(5):371–391, 1994. of learning with increased numbers of relations and Soon, W.M., Ng, H.T., and Lim, D.C.Y. A Machine database concepts and (ii) improving the algorithms Learning Approach to Coreference Resolution of Noun for dealing with weak supervision for larger, noisier Phrases. Computational Linguistics, 27(4):521–544, and more realistic datasets. 2001. Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Acknowledgments Y. Large Margin Methods for Structured and Interde- pendent Output Variables. JMLR, 6:1453–1484, 2005. Part of this work was supported by ANR-09-EMER- Winograd, T., Barbour, M.G., and Stocking, C.R. Under- 001 and by the Network of Excellence PASCAL2. An- standing natural language. Academic Press, 1972. toine Bordes was also supported by the French DGA. Wong, Y.W. and Mooney, R. Learning Synchronous Gram- mars for Semantic Parsing with Lambda Calculus. In ACL ’07, 2007. References Yu, C. and Ballard, D.H. On the Integration of Grounding Allen, J. Natural language understanding. Benjamin- Language and Learning Objects. In AAAI ’04, 2004. Cummings Publishing Co., 1995. Zettlemoyer, L. and Collins, M. Learning context- Barnard, K. and Johnson, M. Word Sense Disambiguation dependent mappings from sentences to logical form. In with Pictures. Artificial Intelligence, 167(1-2), 2005. ACL ’09, 2009.
72