Extracting Common Sense Knowledge from Text for Robot Planning

Peter Kaiser1 Mike Lewis2 Ronald P. A. Petrick2 Tamim Asfour1 Mark Steedman2

Abstract— Autonomous robots often require domain knowl- edge to act intelligently in their environment. This is particu- larly true for robots that use automated planning techniques, which require symbolic representations of the operating en- vironment and the robot’s capabilities. However, the task of specifying domain knowledge by hand is tedious and prone to error. As a result, we aim to automate the process of acquiring general common sense knowledge of objects, relations, and actions, by extracting such information from large amounts of natural language text, written by humans for human readers. We present two methods for knowledge acquisition, requiring Fig. 1: The humanoid robots ARMAR-IIIa (left) and only limited human input, which focus on the inference of ARMAR-IIIb working in a kitchen environment ([5], [6]). spatial relations from text. Although our approach is applicable to a range of domains and information, we only consider one type of knowledge here, namely object locations in a kitchen environment. As a proof of concept, we test our approach using domain knowledge based on information gathered from an automated planner and show how the addition of common natural language texts. These methods will provide the set sense knowledge can improve the quality of the generated plans. of object and action types for the domain, as well as certain I.INTRODUCTIONAND RELATED WORK relations between entities of these types, of the kind that are commonly used in planning. As an evaluation, we build Autonomous robots that use automated planning to make a domain for a robot working in a kitchen environment decisions about how to act in the world require symbolic (see Fig. 1) and infer spatial relations between objects in representations of the robot’s environment and the actions this domain. We then show how the induced knowledge can the robot is able to perform. Such models can be aided by the be used by an automated planning system. (The generated presence of common sense knowledge, which may help guide symbols will not be grounded in the robot’s internal model; the planner to build higher quality plans, compared with the however, approaches to establish these links given names of absence of such information. In particular, knowledge about objects or actions are available (e.g., [1], [2], [3] and [4]).) default locations of objects (the juice is in the refrigerator) The extraction of spatial relations from natural language or the most suitable tool for an action (knives are used for has been studied in the context of understanding commands cutting) could help the planner make decisions on which and directions given to robots in natural language (e.g., actions are more appropriate in a given context. [7], [8], [9]). In contrast to approaches based on annotated For example, if a robot needs a certain object for a task, it corpora of command executions or route instructions, or the can typically employ one of two strategies in the absence of use of knowledge bases like Open Mind Common Sense prior domain knowledge: the robot can ask a human for the [10] explicitly created for artificial intelligence applications, location of the object, or the robot can search the domain we extract relevant relations from large amounts of text in an attempt to locate the object itself. Both techniques written by humans for humans. The techniques are potentially time consuming and prevent the immediate used in [11], [12], [13] to extract action-tool relations to deployment of autonomous robots in unknown environments. disambiguate visual interpretations of kitchen actions are By contrast, the techniques proposed in this paper allow the related. In [14], spatial relations are inferred based on search robot to consider likely locations for an object, informed by engine queries and common sense . common sense knowledge. This potentially improves plan In the following, we describe a process for learning quality, by avoiding exhaustive search, and does not require domain ontologies (Section II) and for extracting relations the aid of a human either to inform the robot directly or to (Section III). The last two sections evaluate both methods encode the necessary domain knowledge a priori. (Section IV) and describe how the resulting knowledge can While it is not possible to automatically generate all be used in an automated planning system (Section V). the domain knowledge that could possibly be required, we propose two methods for learning useful elements of II.AUTOMATIC DOMAIN ONTOLOGY LEARNING

1Institute for Anthropomatics and Robotics, Karlsruhe Institute of Tech- In this section, we propose a method for automatically nology, Karlsruhe, Germany {peter.kaiser, asfour}@kit.edu learning a domain ontology D—a set of symbols that refer to 2School of Informatics, University of Edinburgh, Edinburgh, Scotland, United Kingdom {mike.lewis, rpetrick, a robot’s environment or capabilities—with very little human steedman}@inf.ed.ac.uk input. The method can be configured to learn a domain of objects or actions. With robotic planning in mind, it is crucial that in either of these cases, the contained symbols are not too abstract. In terms of a kitchen environment, interesting objects might be saucepan, refrigerator or apple, while abstract terms like minute or temperature that do not directly refer to objects are avoided. Similarly, we focus on actions that are directly applied to objects like knead, open or screw, and ignore more abstract actions like have or think. Automatic domain ontology learning is based on a domain-defining corpus CD, which contains texts concerning the environment that the domain should model. For example, a compilation of recipes is a good domain-defining corpus for a kitchen environment. Note that these texts have been written by humans for human readers, and no efforts are Fig. 2: The process of domain ontology learning (left) and taken to make them more suitable for CD. However, CD relation extraction (right). The ontology resulting from the needs to be part-of-speech (POS) tagged and possibly parsed first method can be used as input to relation extraction. if compound nouns are to be appropriately recognized.1 The domain-defining corpus CD is used to retrieve an initial vocabulary V which is then filtered for abstract sym- tells us if s is a physical meaning of w: bols. Depending on the type of symbol that this vocabulary ( 1, if s is a physical meaning of w is meant to model, only nouns or verbs are included in cw,s = (1) V. In the first step, CD is analyzed for word frequency 0, otherwise. and the k most frequent words are extracted (see Alg. 1). WordNet also features a frequency measure fw,s that indi- Only words with a part-of-speech-tag (POS-tag) equal to cates how often a word w was encountered in the sense of s, p ∈ {noun, verb} are considered. The resulting vocabulary based on a reference corpus. As we are not doing semantic V is then filtered according to the score Θ(w, p) which parsing on CD, we do not know which of the possible senses expresses the concreteness of a word w. of w is true. However, we can compute a weighted average of the concreteness of the different meanings of w, weighing each word-sense with its likeliness: Algorithm 1: learnDomainOntology(CD, p, k, Θmin) P fw,s · cw,s 1 V ← mostFrequentWords(CD, p, k) s∈S(w) Θ(w) = P . (2) 2 D ← {w ∈ V : Θ(w, p) ≥ Θmin} s∈S(w) fw,s 3 return D As a byproduct, Θ can only have nonzero values for words that are contained in WordNet, which filters out misspelled words or parsing errors. Fig. 2 gives an overview of the domain ontology learning As there is no suitable differentiation in WordNet’s on- process. Additionally, it shows details about the relation tology for verbs, we can not apply the exact same approach extraction procedure that will be discussed below, and the here. However, WordNet features a rough clustering of verbs interoperability between the two methods. In the following that we use to define the filter. We set c to 1, if the section, we discuss the concreteness score Θ in detail. w,s verb w with sense s is in one of the following categories: verb.change, verb.contact, verb.creation or verb.motion. A. The Concreteness Θ III.RELATION EXTRACTION Having a measure of concreteness is necessary for fil- The second technique we propose for information acquisi- tering symbols that are too abstract to play a role in our tion deals with relations between symbols, defined using syn- target domain. In particular, the score Θ(w, p) resembles the tactic patterns. Such patterns capture the syntactic contexts concreteness of a word w with POS-tag p using the lexical that describe the relevant relations as well as the relations’ WordNet ([16], [17]). For nouns, WordNet features arguments. For example, the pattern an ontology that differs between physical and abstract en-  tities. However, a word can have different meanings, some (#object, noun), (in, prep), (#location, noun) (3) of which could be abstract and others not. WordNet solves describes a prepositional relation between two nouns using this issue by working on word-senses rather than on words. the preposition in. The pattern also defines two classes,3 For a word w with a sense2 s from the set S w of possible ( ) #object and #location, which stand for the two arguments senses of w, we can compute a Boolean indicator c that w,s of the relation. Given the above syntactic pattern, two types of questions are relevant in this work: 1We use the Stanford Parser [15] to do this. 2WordNet numbers the different word-senses, so S(w) ⊂ N. 3In examples we use a hash to indicate a classname. • If si is a word, the i-th tuple matches to this exact word with POS-tag pi. • If si is a classname, the i-th tuple matches to all words from D having the POS-tag pi. Fig. 3: A dependency path for the fragment milk in refrig- We will use the predicates isclass(s ) and isword(s ) to differ erator contains the words, their respective POS-tag and the i i between the two possible meanings of the symbol s . syntactic relation between them.4 i The search for matches happens on word-sequences. Such sequences can represent sentences or, as is the case for our corpus C , dependency paths. A word-sequence contains • Class inference: Is a symbol w more likely an object or I a location? words wi together with their respective POS-tag ti: • Relation inference: What is the most likely location for  Σ = (w1, t1), ··· , (wn, tn) , (5) a symbol w? The acquisition of relational information is interesting for Alg. 2 decides if an element (s, p) from a syntactic pattern endowing a robot with initial knowledge of its environment. matches an element (w, t) from a word-sequence. If the Our main application for this method is the extraction of symbol s is a class, it only checks if the word w is part spatial relations in a kitchen setting, such as the location of the domain ontology Dp that contains the valid words of common objects. However, the proposed method is not with POS-tag p. If s is a word, it must equal w. In both constrained to objects and locations, and we will show cases, the POS-tags p and t have to match. different use cases in the evaluation in Section IV-C. The relation extraction process works in two phases: Algorithm 2: match((s, p), (w, t), D) • In the crawling phase, the text sources are searched for 1 if isclass(s) then predefined syntactic patterns. Words falling into classes 2 return p = t ∧ w ∈ D defined in the patterns are counted. The counts are p 3 end compiled to a set of distributions. 4 else if isword(s) then • In the query phase, information can be queried from the 5 return p = t ∧ s = w distributions computed in the crawling phase. Different 6 end kinds of queries are possible. The foundation for relation extraction is the domain- independent corpus CI . In contrast to the domain-defining Alg. 3 describes the matching process for a complete corpus CD, CI contains unrestricted text. Because it is rare syntactic pattern (using Alg. 2). If a match is found, the class for common sense information to be explicitly expressed, configuration is returned as a set of class assignments, i.e., the size of CI is crucial. We assume that the domain- class-word pairs. For example, using pattern (3), the fragment independent corpus is dependency parsed, i.e., consists of milk in refrigerator results in the class configuration: syntactic dependency paths of the kind shown in Fig. 3. A K = (#object, milk), (#location, refrigerator) . (6) further discussion of CI is given in Section IV-C. In the following sections, we give a formal definition of a syntactic pattern and explain the two phases in further detail. Fig. 2 gives an overview of relation extraction. Algorithm 3: configuration(Σ, Π, D) A. Syntactic Patterns 1 for i = 1, ··· , |Σ| − |Π| + 1 do 2 I ← {0, ··· , |Π| − 1} The goal of the crawling phase is to search large amounts 3 if match(Π , Σ , D) ∀j ∈ I then of texts for syntactic patterns predefined by the user. These j+1 i+j 4 return {(s , w ): j ∈ I, isclass(s )} patterns are designed to specify a relation between classes of j+1 i+j j+1 5 end words. For example, pattern (3) describes a spatial relation 6 end between the two classes #object and #location. The fragment 7 return ∅ milk in refrigerator would match the pattern and would result in the assignments #object=milk and #location=refrigerator. Formally, a syntactic pattern is defined as a sequence of B. The Crawling Phase tuples containing a symbol si and a POS-tag pi:  In the crawling phase, CI is searched for pattern matches Π = (s1, p1), ··· , (sk, pk) . (4) using Alg. 2 and Alg. 3. Two different distributions are then When matching the pattern to a sequence of words, the tuples computed based on the resulting class configurations: will match exactly one word of the sequence. The condition • The Relation Distribution DR counts the occurrences for a match depends on the symbol si: of class configurations (e.g., (6)). DR is suitable for 4NN - noun, IN - preposition answering the question: How likely is a class configu- dobj - direct object, prep - preposition, pobj - prepositional object ration for the relation induced by pattern Π? • The Class Distribution DC counts the occurrences of 1) The domain-defining Corpus CD: This corpus is used individual class assignments. It is suitable for answering to generate an initial vocabulary by analysing word frequen- the question: How likely is a class for a given word? cies. CD should therefore be reasonably large but, more im-

Alg. 4 shows how DR and DC are computed given a set of portantly, should contain descriptions of common objects and dependency paths S and a syntactic pattern Π. actions from the desired domain. For a kitchen environment, 5 we chose to build CD from a set of about 11,000 recipes, with a total size of 19.5 MB. Algorithm 4: computeDistribution(S, Π, D) 2) The domain-independent Corpus CI : The domain- 1 DR ← Empty Relation Distribution independent corpus is used to sort entities into different 2 DC ← Empty Class Distribution classes according to the results of syntactic pattern matches. 3 foreach Σ = ((w1, t1), ··· , (wn, tn)) ∈ S do CI does not need to be a different corpus than CD, but it 4 K ← configuration(Σ, Π, D) is difficult to extract reliable information on rare relations 5 if K 6= ∅ then from small corpora. This is especially true for common sense 6 DR[K] ← DR[K] + 1 knowledge that isn’t often explicitly expressed. Hence, CI 7 foreach (c, w) ∈ K do should be extensive. As it is often difficult to gather large 8 DC [(c, w)] ← DC [(c, w)] + 1 amounts of text about a specific topic, it is useful to separate 9 end CI from CD, and use a large standard corpus for CI . 10 end We use the Google Books Ngrams Corpus [18], in the 11 end following referred to as the Google Corpus, which contains a 12 return (DR,DC ) representation of 3.5 million English books containing about 345 billion words in total. The corpus is already parsed, tagged and frequency counted. The Google Corpus does not C. The Query Phase work on sentences, but on syntactic ngrams (Fig. 3), which are n content-word long subpaths of the dependency paths. The query phase uses the distributions DR and DC to We use the corpus in its arcs form which contains syntactic compute pseudo-probabilities for class assignments. ngrams with two content-words (n = 2) plus possible non- A class query γ(c, w) approximates the probability of a content-words like prepositions or conjunctions. However, word w falling into a class c. If Γ = {c1, ··· , cl} is the set the proposed methods can also be used in combination with of defined classes, the class query can be formulated as: corpora containing longer syntactic ngrams.

DC [(c, w)] γ(c, w) = P . (7) B. Domain Ontology Learning x∈Γ DC [(x, w)] Using the corpora mentioned above, we can run the A relation query ρ(Q, c∗) approximates the prob- method for automatic domain ontology learning. Generating ability of a relation with class-assignments Q = a domain ontology for nouns using parameter values of {(c1, w1), ··· , (cl, wl)}, normalizing over the possible val- k = 300, Θmin = 0.35 results in an ontology of 198 words of ues of the class c∗. With Q∗ = {(c, w) ∈ Q : c =6 c∗}, the which the 80 most frequent ones are listed in Table I. The 20 relation query can be formulated as: most frequent nouns that were part of the initial vocabulary, but did not pass the concreteness filter are listed in Table II. DR[Q] ρ(Q, c∗) = P . (8) Analogously, the 80 most frequent actions from a do- v∈D DR[Q∗ ∪ {(c∗, v)}] main ontology learnt from verbs using the parameters k = In the evaluation, we will consider both types of queries. 300, Θmin = 0.2 are depicted in Table III. The full domain ontology contains 206 verbs. The 20 most frequent verbs that IV. EVALUATION did not pass the concreteness filter are listed in Table IV. Results show that for objects, as well as actions, the To evaluate the proposed methods of domain learning generated domain ontologies are reasonable, but contain and relation extraction, we first show that it is possible obvious mistakes. For example, the concrete noun cream to use a specialized corpus to generate a domain ontology was rejected while abstract nouns like top and bottom were of entity types that matches people’s expectations for the included. The reason for this is the diversity of possible word kitchen environment. We then use another, more general text senses present in WordNet that can mislead the filter Θ. corpus, to infer spatial relations and action-tool relations for To evaluate the strength of the domain learning method, those entities. These components are independent: we show we asked four people6 to manually extract kitchen-related in Section V that hand specification of the domain entities by objects and actions from the sets of the 300 most frequent a human expert can aid the automated extraction processes. nouns and verbs from CD. Fig. 4 shows the F1-scores of A. Prerequisites the automatically learnt domain ontologies for objects and

To learn a domain ontology using the proposed method, 5From http://www.ehow.com. 6 the two text corpora CD and CI must first be defined. Native speakers of English, not involved in the research. TABLE I: Automatic domain ontology learning (objects) TABLE III: Automatic domain ontology learning (actions)

k = 300, Θmin = 0.35 k = 300, Θmin = 0.2 1 wine 21 milk 41 container 61 salad 1 add 21 cool 41 come 61 stick 2 water 22 bottle 42 home 62 tea 2 make 22 fill 42 press 62 beat 3 meat 23 fruit 43 bag 63 grill 3 place 23 leave 43 freeze 63 clean 4 bowl 24 pot 44 garlic 64 center 4 remove 24 go 44 garnish 64 begin 5 sugar 25 dough 45 skillet 65 soup 5 cook 25 bring 45 pick 65 burn 6 mixture 26 glass 46 hand 66 alcohol 6 pour 26 hold 46 open 66 spread 7 pan 27 side 47 lid 67 coffee 7 stir 27 reduce 47 slice 67 replace 8 oil 28 pepper 48 onion 68 beer 8 do 28 follow 48 become 68 whisk 9 top 29 meal 49 skin 69 sheet 9 put 29 heat 49 refrigerate 69 boil 10 oven 30 flour 50 saucepan 70 world 10 take 30 pan 50 soak 70 produce 11 salt 31 fish 51 egg 71 diet 11 get 31 sprinkle 51 dip 71 preheat 12 dish 32 refrigerator 52 beef 72 freezer 12 turn 32 dry 52 form 72 squeeze 13 cheese 33 drink 53 layer 73 blender 13 set 33 start 53 shake 73 chill 14 cup 34 chocolate 74 piece 89 batter 14 cut 34 melt 74 cause 89 top 15 butter 35 turkey 55 liquid 75 pasta 15 cover 35 sit 55 pull 75 peel 16 chicken 36 bottom 56 spoon 76 pork 16 mix 36 chop 56 break 76 fit 17 juice 37 cake 57 surface 77 addition 17 combine 37 drain 57 wash 77 move 18 bread 38 place 58 restaurant 78 dinner 18 create 38 rinse 58 simmer 78 coat 19 rice 39 ice 59 fat 79 vodka 19 prepare 39 blend 59 lay 79 increase 20 sauce 40 knife 60 plate 80 powder 20 bake 40 roll 60 transfer 80 seal

TABLE II: Not part of the object domain ontology TABLE IV: Not part of the action domain ontology

k = 300, Θmin = 0.35 k = 300, Θmin = 0.2 1 time 6 taste 11 temperature 16 type 1 be 6 keep 11 eat 16 check 2 flavor 7 way 12 day 17 tbsp 2 use 7 let 12 choose 17 enjoy 3 heat 8 variety 13 process 18 color 3 have 8 allow 13 need 18 give 4 recipe 9 cream 14 boil 19 hour 4 serve 9 try 14 help 19 see 5 food 10 amount 15 cooking 20 half 5 show 10 find 15 buy 20 want

actions, using different values for Θmin, compared to the and locations using prepositional contexts. For instance, domains created by the human participants. The results show pattern (3) matches fragments where two nouns, #object that enabling the concreteness filter (Θmin > 0) significantly and #location, are linked by the preposition in. This pattern increases the quality of the resulting domain for nouns as can be used in combination with the above object ontology well as for verbs. The results also show that values of roughly (Table I) to infer spatial relations in a kitchen environment. Θmin > 0.5 result in a too restrictive filter. While in the case Table V shows the most likely locations for the ten most of nouns, the restrictive filter still produces a better domain frequent objects from the automatically learnt domain ontol- than if no filter is applied, this is not true for verbs: the qual- ogy.7 Note that for generating the results we used pattern (3) ity of a verb-domain drops dramatically the more restrictive combined with three similar patterns using the prepositions the filter gets. The reason for the difference between the two on, at and from. Table V presents two sets of locations for plots is that verbs often have a variety of possible meanings. each object: the upper, highlighted rows refer to the manually By contrast, nouns usually have a predominant interpretation, created domain ontology and the lower, non-highlighted rows at least in terms of the differentiation between physical and refer to the automatically learnt domain ontology in Table I. abstract meanings. This is also reflected in the fact that the Results from the automatically learnt domain ontologies participants found it significantly harder to create a domain are more noisy, and distractive terms like side or bottom of actions than to create a domain of nouns. haven’t been filtered out. (We can also tune domain genera- tion to work in a more restrictive way, e.g., by using the F C. Inference 0.5 measure instead of F1 to emphasize precision over recall.) We now evaluate the relation and class inference mecha- The results demonstrate that the system is able to infer nisms described in Section III. To illustrate the capabilities of typical locations for objects. However, two problems con- these methods, we generated the results using the manually strain its performance. First, the automatically learnt domain created domain ontology as a gold standard. We additionally ontology does not contain typical locations like cupboard show how false positives can affect the process by using the or drawer, because these words do not frequently appear automatically learnt domain ontology. The parameter Θmin in the initial vocabulary. Second, the system is not able to can be determined in practice by generating and evaluating differ between container objects like pot or pan, and actual an ontology for a subset of the initial vocabulary using plots locations like refrigerator or oven (i.e., objects that have a similar to Fig. 4. Different syntactic patterns can be used to fixed position in the kitchen). Improving the domain entity conduct different kinds of inference. The following sections specification by using more diverse but relevant domain spe- show examples of possible queries. cific corpora is the subject of ongoing research. In Section V 1) Location Inference: A good use of knowledge acqui- sition is the exploration of spatial relations between objects 7We consider top and oven not to be objects. TABLE VI: Results for tool inference 0.90 Highlighted rows: Manually created domain ontology. Non-highlighted 0.85 rows: Automatically learnt domain ontology. object first second third 0.80 knife / 0.80 fork / 0.01 machine / 0.01 cut knife / 0.68 hand / 0.04 world / 0.03 0.75 spatula / 0.89 spoon / 0.06 fork / 0.03 flip 0.70 spatula / 0.65 hand / 0.24 spoon / 0.05 fork / 0.58 spoon / 0.16 butter / 0.09 mash 0.65 fork / 0.59 spoon / 0.16 butter / 0.09 spoon / 0.50 fork / 0.20 spatula / 0.08 0.60 stir spoon / 0.48 fork / 0.19 spatula / 0.08 0.55

0.50 F1-Score (Objects) F1-Score (Actions) can be used to improve the location inference results, e.g., 0.45 0.0 0.2 0.4 0.6 0.8 1.0 by dropping words that seem unlikely to name a location. Concreteness Filter Threshold Θmin TABLE VII: Results for class inference Fig. 4: Automatically learnt domain ontologies are evaluated using different values of Θmin, by comparing them to domain Manually created domain ontology (left), automatically learnt domain ontology (right) ontologies created manually by human participants. symbol object location object location wine 0.80 0.20 0.85 0.15 TABLE V: Results for location inference water 0.48 0.52 0.56 0.44 meat 0.77 0.23 0.81 0.19 Highlighted rows: Manually created domain ontology. Non-highlighted bowl 0.11 0.89 0.16 0.84 rows: Automatically learnt domain ontology. sugar 0.93 0.07 0.95 0.05 refr. - refrigerator, scp. - saucepan, swc. - sandwich, kit. - kitchen mixture 0.72 0.28 0.77 0.23 object first second third fourth pan 0.17 0.83 0.18 0.82 glass / 0.20 bottle / 0.20 table / 0.15 cup / 0.14 wine oil 0.82 0.18 0.83 0.17 glass / 0.13 bottle / 0.13 table / 0.10 cup / 0.09 oven 0.12 0.88 0.11 0.89 surface / 0.09 bottle / 0.07 water / 0.07 glass / 0.07 water salt 0.95 0.05 0.96 0.04 bottom / 0.06 side / 0.06 surface / 0.05 bottle / 0.04 table / 0.08 pan / 0.06 swc. / 0.05 pot / 0.05 meat diet / 0.11 table / 0.05 pan / 0.04 swc. / 0.04 4) Computation Time: In this work, the domain ontology table / 0.51 refr. / 0.06 stem / 0.05 kit. / 0.05 learning and distribution computation steps are considered to bowl table / 0.27 hand / 0.20 top / 0.06 side / 0.05 be run offline. However we note that the computation time for water / 0.15 bowl / 0.15 scp. / 0.10 milk / 0.08 sugar water / 0.13 bowl / 0.13 scp. / 0.08 milk / 0.06 these steps depends heavily on the sizes and representations 8 pan / 0.10 bowl / 0.08 water / 0.08 dish / 0.08 of CD and CI . Processing the Google Corpus requires mixture top / 0.10 pan / 0.07 bowl / 0.05 water / 0.05 especially high computational power. On the other hand, the oven / 0.40 stove / 0.19 rack / 0.18 pan / 0.02 pan inference step consists of simple lookups in precomputed oven / 0.29 stove / 0.14 rack / 0.13 hand / 0.07 skillet / 0.22 pan / 0.19 scp. / 0.08 board / 0.07 distributions, and can therefore be done online. oil skillet / 0.19 pan / 0.16 scp. / 0.07 board / 0.06 water / 0.40 bowl / 0.17 scp. / 0.05 food / 0.04 V. PLANNINGWITHCOMMONSENSEKNOWLEDGE salt water / 0.33 bowl / 0.14 diet / 0.06 scp. / 0.04 In this section we show how the domain knowledge table / 0.30 oven / 0.24 menu / 0.11 pan / 0.03 dish table / 0.20 oven / 0.16 hand / 0.12 top / 0.04 induced by the processes described above can be used with an automated planning system to improve the quality of generated plans. We have chosen to use the PKS (Planning we show the effect of more helpful entity specification. with Knowledge and Sensing [19], [20]) planner for this 2) Tool Inference: A similar approach that also includes task, since PKS has previously been deployed in robot the action domain ontology uses the preposition with to infer environments like the one in Fig. 1 [21]. However, one of relations between actions and tools. The following syntactic the strengths of the above approach is that it is not planner pattern matches a verb #action and a noun #tool from the (or domain) dependent, and the method we outline for PKS respective domain ontology, linked together by with: can be adapted to a range of different planners and domains.  As an example scenario, we will focus on the use of (#action, verb), (with, prep), (#tool, noun) . (9) spatial relations in a small kitchen domain. The domain Table VI shows the three most probable tools for different contains the entities cereal, counter, cup, cupboard, juice, actions from the kitchen domain. The results are shown for plate, refrigerator and stove. Table VIII shows the results of both the manually created domain ontology (upper rows, the location inference method. We will first postprocess this highlighted) and the automatically learnt one (lower rows). data for planning by considering the entity juice. 3) Class Inference: Another possible result the system can A. Postprocessing compute is the probability that a word falls into a certain Given the initial domain of objects, and using pattern (3), class of syntactic pattern. For example, given the above we can approximate the probability of an object o being pattern (3), the system can approximate the probability that a word names an object or a location (Table VII). These results 8The Google Arcs Corpus contains 38G of compressed text. TABLE VIII: Location inference for a small domain TABLE IX: Extracted and postprocessed relations Omitted values are zero. Extracted Relations Likelihood at(juice, cup) 0.43 at(juice, refrigerator) 0.27 at(juice, juice) 0.21

object counter cup cupboard dishwasher juice plate refrigerator stove at(juice, plate) 0.08 cereal 1.00 at(juice, counter) 0.02 cup 0.25 0.33 0.13 0.01 0.18 0.02 0.08 juice 0.02 0.43 0.21 0.08 0.26 Postprocessed plate 0.14 0.10 0.07 0.58 0.06 0.06 Relations Likelihood at(applejuice,fridge) 0.27 at(applejuice,counter) 0.02 at(orangejuice,fridge) 0.27 spatially related to a location l by issuing a relation query: at(orangejuice,counter) 0.02   P (loc = l|obj = o) = ρ (obj, o), (loc, l) , loc . (10) directly encoded. For planners like PKS that do not deal with To put these results into a suitable form for planning, we in- probabilities, there are two main possibilities: troduce the predicate at and output the computed likelihoods for pairs of objects. The resulting relations that are extracted, 1) The most probable location for each object could be and their likelihoods, are shown in the top half of Table IX. encoded as a single fact in the planner’s knowledge, The postprocessor must now refine the results, possibly i.e., at(applejuice, fridge) and at(orangejuice, fridge). making use of additional information about the structure 2) Some or all of the most probable locations could be of the planning domain and the types of objects that are encoded as a disjunction of possible alternatives, i.e., available. Refinement can be done in three possible ways: at(applejuice, fridge) | at(applejuice, counter), and 1) Symbol Mapping: A word that describes an object in at(orangejuice, fridge) | at(orangejuice, counter). natural language may not necessarily match the symbol Depending on the domain, either form may be appropriate. name for that object in the planning domain. This is currently corrected by an appropriate mapping process B. Plan Generation that uses a dictionary of likely synonyms. E.g., the Consider the task of finding the apple juice container in word refrigerator may be mapped to fridge. the kitchen. In the absence of precise information as to the 2) Type Filtering: Many planners have the concept of object’s location, but knowing there are various places in the object types, which enables us to filter relations that kitchen where objects could be located (e.g., counter, cup- have entity arguments of the incorrect type. Assuming board, fridge, stove), a planner could potentially build a plan the planning domain provides us with a type location for a robot to exhaustively check all locations: move-robot-to- that is required for the second argument of at, the counter, check-for-apple-juice, if not present move-robot-to- postprocessor can then remove the extracted relations cupboard, check-for-apple-juice, if not present move-robot- at(juice, cup), at(juice, juice), and at(juice, plate), since to-fridge, etc., until all locations have been checked. If the the entities cup, juice, and plate are not locations. robot does not have information-gathering capabilities to 3) Instantiation: The symbols extracted by our processes check for the apple juice in a particular location, the planner will often refer to classes of objects, rather than the may not be able to generate such a plan at all. specific object identifiers used by the planner. Making With the availability of more certain information about use of type information in the planning domain, the the location of the apple juice, the planner can potentially postprocessor can instantiate objects of the appropriate eliminate some parts of the plan (e.g., by ignoring certain lo- types from the extracted relational information. For cations), or at least prioritise certain likely locations over oth- instance, the class juice might be instantiated into ers, resulting in higher quality plans. For instance, in the case two objects, applejuice and orangejuice. These objects that the planner had the knowledge at(applejuice, fridge), can subsequently be substituted in any relations that resulting from the above relation extraction process, then contains the appropriate class type. the planner could build the simple plan move-robot-to-fridge, The final set of postprocessed relations from our example under the assumption that the extracted information was true. is shown in the bottom half of Table IX. We note that the ne- Similarly, if the planner had the disjunctive informa- cessity and possibility of applying these postprocessing steps tion at(applejuice, fridge) | at(applejuice, counter) then the depends on the nature of the planning domain. Furthermore, planner could build the plan: move-robot-to-fridge, check- the information that is needed to perform the postprocessing, for-apple-juice, if not present move-robot-to-counter. Again, i.e., the symbol mapping table or the type information, needs this plan improves on the exhaustive search plan by only to be manually encoded in the planning domain. considering the most likely locations for the apple juice, Given the postprocessed set of relations, the final step resulting from the extracted relational information. is to decide how this information will be included in the One inherent danger when dealing with common sense planning domain. For planners that work with probabilistic knowledge is that the plans that are built from such infor- representations, the relation/likelihood information could be mation alone may ultimately fail to achieve their goals in the real world. For instance, even though relation extraction [3] K. Welke, P. Kaiser, A. Kozlov, N. Adermann, T. Asfour, M. Lewis, provides us with likely locations for the apple juice, there and M. Steedman, “Grounded spatial symbols for task planning based on experience,” in 13th International Conference on Humanoid Robots is no guarantee that this is the way the robot’s world is (Humanoids). IEEE/RAS, 2013. actually configured. (E.g., another robot may have left the [4] A. Kasper, R. Becher, P. Steinhaus, and R. Dillmann, “Developing and apple juice on the stove.) However, such information does analyzing intuitive modes for interactive object modeling,” in ICMI ’07: Proceedings of the 9th international conference on Multimodal give us a starting point for building plans, in the absence of interfaces. New York, NY, USA: ACM, 2007, pp. 74–81. more certain information, and can also aid plan execution [5] T. Asfour, K. Regenstein, P. Azad, J. Schroder,¨ N. Vahrenkamp, monitoring to guide replanning activities in the case of plan and R. Dillmann, “ARMAR-III: An integrated humanoid platform for sensory-motor control,” in IEEE International Conference on failure. (E.g., if a plan built using common sense knowledge Humanoid Robots (Humanoids), 2006, pp. 169–175. fails to locate the apple juice, fall back to the exhaustive [6] T. Asfour, P. Azad, N. Vahrenkamp, K. Regenstein, A. Bierbaum, search plan for the locations that haven’t been checked.) K. Welke, J. Schroder,¨ and R. Dillmann, “Toward humanoid manip- ulation in human-centred environments,” Robotics and Autonomous Finally, we note that the use of common sense knowledge Systems, vol. 56, no. 1, pp. 54–65, 2008. may improve the efficiency of plan generation, since in [7] S. Tellex, T. Kollar, S. Dickerson, M. Walter, A. Banerjee, S. Teller, general more specific information helps constrain the plan and N. Roy, “Understanding natural language commands for robotic navigation and mobile manipulation,” in Proceedings of the 25th generation process. However, plan generation time is both National Conference on Artificial Intelligence. AAAI, 2011, pp. domain and planner dependent, and it is difficult to quantify 1507–1514. any improvements without experimentation. (E.g., planning [8] T. Kollar, S. Tellex, D. Roy, and N. Roy, “Toward understanding natural language directions,” in Proceedings of the 5th International time went from 0.003s to 0.001s in our small examples.) Conference on Human-Robot Interaction (HRI). IEEE, 2010, pp. VI.CONCLUSIONAND FUTURE WORK 259–266. [9] D. Chen and R. Mooney, “Learning to interpret natural language We have presented two techniques for reducing the amount navigation instructions from observations,” in Proceedings of the 25th of prior, hardcoded knowledge that is necessary for building AAAI Conference on Artificial Intelligence (AAAI-2011), 2011, pp. 859–865. a robotic planning domain. Using the methods described [10] P. Singh, T. Lin, E. T. Mueller, G. Lim, T. Perkins, and W. L. Zhu, here, a domain ontology of object and action types can be “Open mind common sense: Knowledge acquisition from the general defined automatically, over which user-defined relations can public,” in On the Move to Meaningful Systems 2002: CoopIS, DOA, and ODBASE. Springer, 2002, pp. 1223–1237. be inferred automatically from sources of natural language [11] C. Teo, Y. Yang, H. Daume´ III, C. Fermuller,¨ and Y. Aloimonos, “A text. The resulting representation of common sense domain corpus-guided framework for robotic visual perception,” in Workshop knowledge has been tested using an automated planning on Language-Action Tools for Cognitive Artificial Agents, held at the 25th National Conference on Artificial Intelligence. San Francisco: system, improving the quality of the generated plans. AAAI, 2011, pp. 36–42. As future work, we are exploring a number of improve- [12] ——, “Toward a Watson that sees: Language-guided action recogni- tion for robots,” in IEEE International Conference on Robotics and ments to our techniques. First, more specialized corpora CI , Automation. St. Paul, MN: IEEE, 2012, pp. 374–381. longer syntactic patterns, or databases of common sense [13] M. Tamosiunaite, I. Markelic, T. Kulvicius, and F. Worg¨ otter,¨ “Gen- knowledge might help in overcoming the sparsity of com- eralizing objects by analyzing language,” in 11th International Con- mon sense information in text sources. Second, the location ference on Humanoid Robots (Humanoids). IEEE/RAS, 2011, pp. 557–563. inference does not perform any checks for plausibility. While [14] K. Zhou, M. Zillich, H. Zender, and M. Vincze, “Web mining driven the class inference will help in filtering results that are not object locality knowledge acquisition for efficient robot behavior,” locations at all, additional methods are needed to differentiate in 2012 International Conference on Intelligent Robots and Systems (IROS). IEEE/RSJ, 2012, pp. 3962–3969. between locations for temporary storage and locations for [15] D. Klein and C. D. Manning, “Accurate unlexicalized parsing,” Pro- long-term storage. Another interesting improvement would ceedings of the 41st Annual Meeting on Association for Computational be the generalization of inferred relations to still missing Linguistics ACL 03, vol. 1, pp. 423–430, 2003. [16] G. A. Miller, “WordNet: a lexical database for English,” Communica- knowledge. For example one could conclude by analyzing tions of the ACM, vol. 38, pp. 39–41, 1995. text sources that bowl and dish are conceptually similar and [17] C. Fellbaum, WordNet: An Electronic Lexical Database. Cambridge, therefore apply relations inferred for bowls also to dishes. MA: MIT Press, 1998. [18] Y. Goldberg and J. Orwant, “A dataset of syntactic-ngrams over time Finally, we are investigating the application of our methods from a very large corpus of english books,” in Second Joint Conference to robot domains other than the kitchen environment. on Lexical and Computational Semantics, 2013, pp. 241–247. [19] R. P. A. Petrick and F. Bacchus, “A knowledge-based approach to ACKNOWLEDGMENT planning with incomplete information and sensing,” in International The research leading to these results received funding from Conference on Artificial Intelligence Planning and Scheduling (AIPS- 2002), 2002, pp. 212–221. the European Union’s 7th Framework Programme FP7/2007- [20] ——, “Extending the knowledge-based approach to planning with 2013, under grant agreement No270273 (Xperience). incomplete information and sensing,” in International Conference on Automated Planning and Scheduling (ICAPS 2004), 2004, pp. 2–11. REFERENCES [21] R. Petrick, N. Adermann, T. Asfour, M. Steedman, and R. Dillmann, “Connecting knowledge-level planning and task execution on a hu- [1] M. Ternoth, U. Klank, D. Pangercic, and M. Beetz, “Web-enabled manoid robot using Object-Action Complexes,” in Proceedings of the robots,” Robotics & Automation Magazine, vol. 18, no. 2, pp. 58–68, International Conference on Cognitive Systems (CogSys 2010), 2010. 2011. [2] M. Waibel, M. Beetz, J. Civera, R. D’Andrea, J. Elfring, D. Galvez- Lopez, K. Haussermann, R. Janssen, J. Montiel, A. Perzylo, B. Schiessle, M. Tenorth, O. Zweigle, and R. van de Molengraft, “Roboearth,” Robotics Automation Magazine, IEEE, vol. 18, no. 2, pp. 69–82, 2011.