Arxiv:1709.10426V1 [Cs.CL] 29 Sep 2017

Training an adaptive dialogue policy for interactive learning of visually grounded word meanings

Yanchao Yu Arash Eshghi Oliver Lemon Interaction Lab Interaction Lab Interaction Lab Heriot-Watt University Heriot-Watt University Heriot-Watt University [email protected] [email protected] [email protected]

Abstract

We present a multi-modal dialogue system for interactive learning of perceptually grounded word meanings from a human tutor. The system integrates an incremental, semantic parsing/generation framework - Dynamic Syntax and Type Theory with Records (DS-TTR) - with a set of visual classifiers that are learned through- out the interaction and which ground the Figure 1: Example dialogues & interactively meaning representations that it produces. agreed semantic contents. We use this system in interaction with a al., 2014; Socher et al., 2014; Farhadi et al., 2009; simulated human tutor to study the ef- Silberer and Lapata, 2014; Sun et al., 2013). fects of different dialogue policies and ca- Our goal is to build interactive systems that can pabilities on accuracy of learned mean- learn grounded word meanings relating to their ings, learning rates, and efforts/costs to the perceptions of real-world objects – this is differ- tutor. We show that the overall perfor- ent from previous work such as e.g. (Roy, 2002), mance of the learning agent is affected by that learn groundings from descriptions without (1) who takes initiative in the dialogues; any interaction, and more recent work using Deep (2) the ability to express/use their confi- Learning methods (e.g. (Socher et al., 2014)). dence level about visual attributes; and (3) Most of these systems rely on training data of the ability to process elliptical and incre- high quantity with no possibility of online error mentally constructed dialogue turns. Ulti- correction. Furthermore, they are unsuitable for mately, we train an adaptive dialogue pol- robots and multimodal systems that need to con- icy which optimises the trade-off between tinuously, and incrementally learn from the envi- classifier accuracy and tutoring costs. ronment, and may encounter objects they haven’t 1 Introduction seen in training data. These limitations are likely arXiv:1709.10426v1 [cs.CL] 29 Sep 2017 to be alleviated if systems can learn concepts, as Identifying, classifying, and talking about objects and when needed, from situated dialogue with hu- or events in the surrounding environment are key mans. Interaction with a human tutor also enables capabilities for intelligent, goal-driven systems systems to take initiative and seek the particular that interact with other agents and the external information they need or lack by e.g. asking ques- world (e.g. robots, smart spaces, and other auto- tions with the highest information gain (see e.g. mated systems). To this end, there has recently (Skocaj et al., 2011), and Fig. 1). For example, a been a surge of interest and significant progress robot could ask questions to learn the colour of a made on a variety of related tasks, including gen- “square" or to request to be presented with more eration of Natural Language (NL) descriptions of “red" things to improve its performance on the images, or identifying images based on NL de- concept (see e.g. Fig. 1). Furthermore, such sys- scriptions (Karpathy and Fei-Fei, 2014; Bruni et tems could allow for meaning negotiation in the form of clarification interactions with the tutor. novel physical objects and their attributes, either This setting means that the system must be using offline methods as in (Farhadi et al., 2009; trainable from little data, compositional, adaptive, Lampert et al., 2014; Socher et al., 2013; Kong et and able to handle natural human dialogue with al., 2013), or through an incremental learning pro- all its glorious context-sensitivity and messiness cess, where the system’s parameters are updated – for instance so that it can learn visual concepts after each training example is presented to the sys- suitable for specific tasks/domains, or even those tem (Furao and Hasegawa, 2006; Zheng et al., specific to a particular user. Interactive systems 2013; Kristan and Leonardis, 2014). For the inter- that learn continuously, and over the long run from active learning task presented here, only the latter humans need to do so incrementally, quickly, and is appropriate, as the system is expected to learn with minimal effort/cost to human tutors. from its interactions with a human tutor over a pe- In this paper, we first outline an implemented riod of time. Shen & Hasegawa (2006) propose dialogue system that integrates an incremental, se- the SOINN-SVM model that re-trains linear SVM mantic grammar framework, especially suited to classifiers with data points that are clustered to- dialogue processing – Dynamic Syntax and Type gether with all the examples seen so far. The clus- Theory with Records (DS-TTR1 (Kempson et al., tering is done incrementally, but the system needs 2001; Eshghi et al., 2012)) with visual classi- to keep all the examples so far in memory. Kristian fiers which are learned during the interaction, and & Leonardis (2014), on the other hand, propose which provide perceptual grounding for the ba- the oKDE model that continuously learns categor- sic semantic atoms in the semantic representations ical knowledge about visual attributes as probabil- (Record Types in TTR) produced by the parser ity distributions over the categories (e.g. colours). (see Fig. 1, Fig. 2 and section 3). However, when learning from scratch, it is unre- We then use this system in interaction with a alistic to predefine these concept groups (e.g. that simulated human tutor, to test hypotheses about red, blue, and green are colours). Systems need to how the accuracy of learned meanings, learning learn for themselves that, e.g. colour is grounded rates, and the overall cost/effort for the human tu- in a specific sub-space of an object’s features. For tor are affected by different dialogue policies and the visual classifiers, we therefore assume no such capabilities: (1) who takes initiative in the dia- category groupings here, and instead learn individ- logues; (2) the agent’s ability to utilise their level ual binary classifiers for each visual attribute (see of uncertainty about an object’s attributes; and section 3.1 for details). (3) their ability to process elliptical as well as incrementally constructed dialogue turns. The re- Distributional vs. Logical Representations. sults show that differences along these dimensions Learning to ground natural language in percep- have significant impact both on the accuracy of the tion is one of the fundamental problems in Arti- grounded word meanings that are learned, and the ficial Intelligence. There are two main strands of processing effort required by the tutors. work that address this problem: (1) those that learn In section 4.3 we train an adaptive dialogue distributional representations using Deep Learn- strategy that finds a better trade-off between clas- ing methods: this often works by projecting vector sifier accuracy and tutor cost. representations from different modalities (e.g. vision and language) into the same space in order 2 Related work to be able to retrieve one from the other (Socher et al., 2014; Karpathy and Li, 2015; Silberer and In this section, we will present an overview of vi- Lapata, 2014); (2) those that attempt to ground sion and language processing systems, as well as symbolic logical forms, obtained through seman- multi-modal systems that learn to associate them. tic parsing (Tellex et al., 2014; Kollar et al., 2013; We compare them along two main dimensions: Vi- Matuszek et al., 2014) in classifiers of various en- sual Classification methods: offline vs. online and tities types/events/relations in a segment of an im- the kinds of representation learned/used. age or a video. Perhaps one advantage of the latter Online vs. Offline Learning. A number of im- over the former method, is that it is strictly com- plemented systems have shown good performance positional, i.e. the contribution of the meaning of on classification as well as NL-description of an individual word, or semantic atom, to the whole 1Download from http://dylan.sourceforge.net representation is clear, whereas this is hard to say about the distributional models. As noted, our seen images by predicting binary label vectors. work also uses the latter methodology, though it is We build visual feature representations to learn dialogue, rather than sentence semantics that we classifiers for particular attributes, as explained in care about. Most similar to our work is probably the following subsections. that of Kennington & Schlangen (2015) who learn a mapping between individual words - rather than 3.1.1 Visual Feature Representation logical atoms - and low-level visual features (e.g. In contrast to previous work (Yu et al., 2015a; colour-values) directly. The system is composi- Yu et al., 2015b), to reduce feature noise through tional, yet does not use a grammar (the composi- the learning process, we simplify the method of tions are defined by hand). Further, the ground- feature extraction consisting of two base feature ings are learned from pairings of object references categories, i.e. the colour space for colour at- in NL and images rather than from dialogue. tributes, and a ‘bag of visual words’ for the object What sets our approach apart from others is: shapes/class. a) that we use a domain-general, incremental se- Colour descriptors, consisting of HSV colour mantic grammar with principled mechanisms for space values, are extracted for each pixel and then parsing and generation; b) Given the DS model are quantized to a 16×4×4 HSV matrix. These de- of dialogue (Eshghi et al., 2015), representations scriptors inside the bounding box are binned into are constructed jointly and interactively by the tu- individual histograms. Meanwhile, a bag of vi- tor and system over the course of several turns sual words is built in PHOW descriptors using a (see Fig. 1); c) perception and NL-semantics are visual dictionary (that is pre-defined with a hand- modelled in a single logical formalism (TTR); d) made image set). These visual words will be we effectively induce an ontology of atomic types calculated using 2x2 blocks, a 4-pixel step size, in TTR, which can be combined in arbitrarily and quantized into 1024 k-means centres. The complex ways for generation of complex descrip- feature extractor in the vision module presents a tions of arbitrarily complex visual scenes (see e.g. 1280-dimensional feature vector for a single train- (Dobnik et al., 2012) and compare this with (Ken- ing/test instance by stacking all quantized features, nington and Schlangen, 2015), who do not use a as shown in Figure 2. grammar and therefore do not have logical struc- ture over grounded meanings). 3.2 Dynamic Syntax and Type Theory with Records 3 System Architecture Dynamic Syntax (DS) a is a word-by-word incremental semantic parser/generator, based around We have developed a system to support an the Dynamic Syntax (DS) grammar framework attribute-based object learning process through (Cann et al., 2005) especially suited to the frag- natural, incremental spoken dialogue interaction. mentary and highly contextual nature of dialogue. The architecture of the system is shown in Fig. 2. In DS, dialogue is modelled as the interactive and The system has two main modules: a vision mod- incremental construction of contextual and seman- ule for visual feature extraction, classification, and tic representations (Eshghi et al., 2015). The con- learning; and a dialogue system module using DS- textual representations afforded by DS are of the TTR. Below we describe these components indi- fine-grained semantic content that is jointly nego- vidually and then explain how they interact. tiated/agreed upon by the interlocutors, as a re- 3.1 Attribute-based Classifiers used sult of processing questions and answers, clarification requests, corrections, acceptances, etc. Yu et. al (2015a; 2015b) point out that neither We cannot go into any further detail due to lack multi-label classification models nor ‘zero-shot’ of space, but proceed to introduce Type Theory learning models show acceptable performance on with Records, the formalism in which the DS attribute-based learning tasks. Here, we instead contextual/semantic representations are couched, use Logistic Regression SVM classifiers with but also that within which perception is modelled Stochastic Gradient Descent (SGD) (Zhang, 2004) here. to incrementally learn attribute predictions. All classifiers will output attribute-based label Type Theory with Records (TTR) is an ex- sets and corresponding probabilities for novel un- tension of standard type theory shown to be use- Figure 2: Architecture of the teachable system ful in semantics and dialogue modelling (Cooper, code partial knowledge about objects, and for this 2005; Ginzburg, 2012). TTR is particularly well- knowledge (e.g. object attributes) to be extended suited to our problem here as it allows information in a principled way, as and when this information from various modalities, including vision and lan- becomes available. guage, to be represented within a single semantic framework (see e.g. Larsson (2013); Dobnik et al. 3.3 Integration (2012) who use it to model the semantics of spatial Fig. 2 shows how the various parts of the system language and perceptual classification). interact. At any point in time, the system has ac- In TTR, logical forms are specified as record cess to an ontology of (object) types and attributes types (RTs), which are sequences of fields of the encoded as a set of TTR Record Types, whose in- form [l : T ] containing a label l and a type T. RTs dividual atomic symbols, such as ‘red’ or ‘square’ can be witnessed (i.e. judged true) by records of are grounded in the set of classifiers trained so far. that type, where a record is a sequence of label- Given a set of individuated objects in a scene, value pairs [l = v]. We say that [l = v] is of type encoded as a TTR Record, the system can utilise [l : T ] just in case v is of type T. its existing ontology to output a Record Type which maximally characterises the scene (see e.g.    l1 : T1  " # Fig. 1). Dynamic Syntax operates over the same   l1 : T1 R1 :  l2=a : T2  R2 : R3 : [] representations, they provide a direct interface be-   l2 : T20 l3=p(l2) : T3 tween perceptual classification and semantic processing in dialogue: this representation acts not Figure 3: Example TTR record types only as (1) the non-linguistic (here, visual) context Fields can be manifest, i.e. given a singleton of the dialogue for the resolution of e.g. definite type e.g. [l : Ta ] where Ta is the type of which reference and indexicals (see (Hough and Purver, only a is a member; here, we write this using the 2014)); but also as (2) the logical database from syntactic sugar [l=a : T ]. Fields can also be de- which the system can generate utterances (descrip- pendent on fields preceding them (i.e. higher) in tions), ask, or answer questions about the objects - the record type (see Fig. 3). Fig. 4 illustrates how the semantics of the answer The standard subtype relation v can be defined to a question is retrieved from the visual context for record types: R1 v R2 if for all fields [l : T2 ] through unification (this uses the standard sub- in R2, R1 contains [l : T1 ] where T1 v T2. In Fig- type checking operation within TTR). ure 3, R1 v R2 if T2 v T20 , and both R1 and R2 are Conversely, for concept learning, the DS-TTR subtypes of R3. This subtyping relation allows se- parser incrementally produces Record Types (RT), mantic information to be incrementally specified, representing the meaning jointly established by i.e. record types can be indefinitely extended with the tutor and the system so far. In this domain, more information/constraints. For us here, this this is ultimately one or more type judgements, i.e. is a key feature since it allows the system to en- that some scene/image/object is judged to be of a Figure 4: Question Answering by the system particular type, e.g. in Fig. 1 that the individuated Design. We use the dialogue system outlined object, o1 is a red square. These jointly negoti- above to carry out our main experiment with a ated type judgements then go on to provide train- 2 × 2 × 2 factorial design, i.e. with three fac- ing instances for the classifiers. In general, the tors each with two levels. Together, these fac- training instances are of the form, hO, Ti, where tors determine the learner’s dialogue behaviour: O is an image/scene segment (an object or TTR (1) Initiative (Learner/Tutor): determines who Record), and T, a record type. T is then decom- takes initiative in the dialogues. When the tu- posed into its constituent atomic types T1 ... Tn, tor takes initiative, s/he is the one that drives the V V s.t. Ti = T - where is the so called meet conversation forward, by asking questions to the operation corresponding to type conjunction. The learner (e.g. “What colour is this?” or “So this judgements O : Ti are then used directly to train is a ....” ) or making a statement about the at- the classifiers that ground the Ti. tributes of the object. On the other hand, when the learner has initiative, it makes statements, asks 4 Experimental Setup questions, initiates topics etc. (2) Uncertainty (+UC/-UC): determines whether the learner takes As noted in the introduction, interactive systems into account, in its dialogue behaviour, its own that learn continuously, and over the long run from subjective confidence about the attributes of the humans need to do so incrementally; as quickly as presented object. The confidence is the proba- possible; and with as little effort/cost to the human bility assigned by any of its attribute classifiers tutor as possible. In addition, when learning takes of the object being a positive instance of an at- place through dialogue, the dialogue needs to be tribute (e.g. ‘red’) - see below for how a confi- as human-like/natural as possible. dence threshold is used here. In +UC, the agent In general, there are several different dialogue will not ask a question if it is confident about the capabilities and policies that a concept-learning answer, and it will hedge the answer to a tutor agent might adopt, and these will lead to differ- question if it is not confident, e.g. “T: What is ent outcomes for the accuracy of the learned con- this? L: errm, maybe a square?". In -UC, the cepts/meanings, learning rates, and cost to the tu- agent always takes itself to know the attributes of tor – with trade-offs between these. Our goal in the given object (as given by its currently trained this paper is therefore an experimental study of classifiers), and behaves according to that assump- the effect of different dialogue policies and capa- tion. (3) Context-Dependency (+CD/-CD): de- bilities on the overall performance of the learn- termines whether the learner can process (pro- ing agent, which, as we describe below is a mea- duce/parse) context-dependent expressions such sure capturing the trade-off between accuracy of as short answers and incrementally constructed learned meanings and the cost of tutoring. turns, e.g. “T: What is this? L: a square", or “T: Figure 5: Example dialogues in different conditions

So this one is ...? L: red/a circle”. This setting can trusts its prediction or not. If the confidence score be turned off/on in the DS-TTR dialogue model. of a classifier is between the positive and base thresholds, the learner is not very confident about Tutor Simulation and Policy To run our experi- its knowledge, and will check with the tutor, e.g. ment on a large-scale, we have hand-crafted an In- “L: is this red?”. However, if the confidence score teractive Tutoring Simulator, which simulates the of a classifier is above the positive threshold, the behaviour of a human tutor2. The tutor policy is learner is confident enough in its knowledge not to kept constant across all conditions. Its policy is bother verifying it with the tutor. This will lead to that of an always truthful, helpful and omniscient less effort needed from the tutor as the learner be- one: it (1) has complete access to the labels of comes more confident about its knowledge. How- each object; and (2) always acts as the context of ever, since a learning agent that has high confi- the dialogue dictates: answers any question asked, dence about a prediction will not ask for assistance confirms or rejects when the learner describes an from the tutor, a low positive threshold would re- object; and (3) always corrects the learner when it duce the chances that allow the tutor to correct the describes an object erroneously. learner’s mistakes. We therefore tested different Dependent Measures We now go on to describe fixed values for the confidence threshold and this the dependent measures in our experiment, i.e. that determined a fixed 0.5 base threshold and a 0.9 of classifier accuracy/score, tutoring cost, and the positive threshold were deemed to be the most ap- overall performance measure which combines the propriate values for an interactive learning process former two measures. - i.e. these values preserved good classifier accuracy while not requiring much effort from the tutor Confidence Threshold To determine when the - see below Section 4.3 for how an adaptive pol- agent takes themselves to be confident in an at- icy was learned that adjusts the agent’s confidence tribute prediction, we use confidence-score thresh- threshold dynamically over time. olds. It consists of two values, a base threshold (e.g. 0.5) and a positive threshold (e.g. 0.9). 4.1 Evaluation Metrics If the confidences of all classifiers are under the To test how the different dialogue capabilities and base threshold (i.e. the learner has no attribute la- strategies affect the learning process, we consider bel that it is confident about), the agent will ask both the cost to the tutor and the accuracy of the for information directly from the tutor via ques- learned meanings, i.e. the classifiers that ground tions (e.g. “L: what is this?"). our colour and shape concepts. On the other hand, if one or more classifiers score above the base threshold, then the positive Cost The cost measure reflects the effort needed threshold is used to judge to what extent the agent by a human tutor in interacting with the system. Skocaj et. al. (2009) point out that a com- 2The experiment involves hundreds of dialogues, so run- prehensive teachable system should learn as au- ning this experiment with real human tutors has proven too costly at this juncture, though we plan to do this for a full tonomously as possible, rather than involving the evaluation of our system in the future. human tutor too frequently. There are several pos- ing. For each training instance, the system inter- Table 1: Tutoring Cost Table acts (only through dialogue) with the simulated C C C C C in f ack crt parsing production tutor. Each dialogue about an object ends either 1 0.25 1 0.5 1 when both the shape and the colour of the object are discussed and agreed upon, or when the learner sible costs that the tutor might incur, see Table 1: requests to be presented with the next image (this Cin f refers to the cost of the tutor providing infor- happens only in the Learner initiative conditions). mation on a single attribute concept (e.g. “this is We define a Learning Step as comprised of 10 red” or “this is a square"); Cack is the cost for a such dialogues. At the end of each learning step, simple confirmation (like “yes", “right") or rejec- the system is tested using the test set (100 test in- tion (such as “no”); Ccrt is the cost of correction stances). for a single concept (e.g. “no, it is blue" or “no, it This process is repeated 20 times, i.e. for 20 is a circle"). We associate a higher cost with cor- rounds/folds, each time with a different, random rection of statements than that of polar questions. 500-100 split, thus resulting in 20 data-points for This is to penalise the learning agent when it confi- cost and accuracy after every learning step. The dently makes a false statement – thereby incorpo- values reported below, including those on the plots rating an aspect of trust in the metric (humans will in Fig. 6a, 6b and 6c, correspond to averages not trust systems which confidently make false across the 20 folds. statements). And finally, parsing (Cparse) as well 4.3 Learning an Adaptive Policy for a as production (Cproduction) costs for tutor are taken into account: each single word costs 0.5 when Dynamic Confidence Threshold parsed by the tutor, and 1 if generated (produc- In the experiment presented above, the learning tion costs twice as much as parsing). These exact agent’s positive confidence threshold was held values are based on intuition but are kept constant constant, at 0.9. However, since the confidence across the experimental conditions and therefore threshold itself becomes more reliable as the agent do not confound the results reported below. is exposed to more training instances, we further hypothesised that a threshold that changes dynam- Learning Performance As mentioned above, ically over time should lead to a better trade-off an efficient learner dialogue policy should con- between classification accuracy and cost for the sider both classification accuracy and tutor effort tutor, i.e. a better Overall Performance Ratio (see (Cost). We thus define an integrated measure – the above). For example, lower positive thresholds Overall Performance Ratio (R ) – that we use to per f may be more appropriate at the later stages of compare the learner’s overall performance across training when the agent is already performing well the different conditions: ∆Acc with attribute classifiers which are more reliable. Rper f = This leads to different dialogue behaviours, as the Ctutor learner takes different decisions as it encounters i.e. the increase in accuracy per unit of the cost, or more training examples. equivalently the gradient of the curve in Fig. 4c. To test this hypothesis we further trained and We seek dialogue strategies that maximise this. evaluated an adaptive policy that adjusts the learn- Dataset The dataset used here is comprised of ing agent’s confidence threshold as it interacts 600 images of single, simple handmade objects with the tutor (in the +UC conditions only). with a white background (see Fig.1)3. There are This optimization used a Markov Decision Pro- 4 nine attributes considered in this dataset: 6 colours cess (MDP) model and Reinforcement Learning , (black, blue, green, orange, purple and red) and 3 where: (1) the state space was determined by vari- shapes (circle, square and triangle), with a relative 4A reviewer points out that one can handle uncertainty balance on the number of instances per attribute. in a more principled way, possibly with better results, using POMDPs. Another reviewer points out that the policy learned 4.2 Evaluation and Cross-validation is only adapting the confidence threshold, and not the other conditions (uncertainty, initiative, context-dependency). We In each round, the system is trained using 500 point out that we are addressing both of these limitations in training instances, with the rest set aside for test- work in progress, where we feed each classifier’s outputted confidence level as a continuous feature in a (continuous 3All data from this paper will be made freely available. space) MDP for full dialogue control. (a) Accuracy (b) Tutoring Cost

(c) Overall Performance Figure 6: Evolution of Learning Performance ables for the number of training instances seen so as the system interacts over time with the tutor far, and the agent’s current confidence threshold about each of the 500 training instances. The (2) the actions were either to increase or decrease ninth curve in red (L+UC(Adaptive)+CD) shows the confidence threshold by 0.05, or keep it the the same for the learning agent with a dynamic same; (3) the local reward signal was directly confidence threshold using the policy trained us- proportional to the agent’s Overall Performance ing Reinforcement Learning (section 4.3) - the lat- Ratio over the previous Learning Step (10 train- ter is only compared below to the dark blue curve ing instances, see above); and (4) the SARSA al- (L+UC+CD). As noted in passing, the vertical gorithm (Sutton and Barto, 1998) was chosen for axes in these graphs are based on averages across learning, with each episode defined as a complete the 20 folds - recall that for Accuracy the system run through the 500 training instances. was tested, in each fold, at every learning step, i.e. after every 10 training instances. 5 Results Fig. 6c, on the other hand, plots Accuracy Fig. 5 shows example interactions between the against Tutoring Cost directly. Note that it is to learner and the tutor in some of the experimental be expected that the curves should not terminate conditions. Note how the system is able to deal in the same place on the x-axis since the different with (parse and generate) utterance continuations conditions incur different total costs for the tutor as in T+UC+CD, short answers as in L+UC+CD, across the 500 training instances. The gradient of and polar answers as in T + UC + CD. this curve corresponds to increase in Accuracy per Fig. 6a and 6b plot the progression of aver- unit of the Tutoring Cost. It is the gradient of the age Accuracy and (cumulative) Tutoring Cost for line drawn from the beginning to the end of each each of the 8 conditions in our main experiment, curve (tan(β) on Fig. 4c) that constitutes our main evaluation measure of the system’s overall perfor- old (red/L+UC(adaptive)+CD) achieves slightly mance in each condition, and it is this measure for lower overall gradient than its constant threshold which we report statistical significance results: counter-part (blue/L+UC+CD), it achieves much A between-subjects Analysis of Variance higher Accuracy overall, and does this much faster (ANOVA) shows significant main effects of Ini- in the first 1000 units of cost (roughly the total tiative (p < 0.01; F = 448.33), Uncertainty cost in L+UC+CD condition). We therefore con- (p < 0.01; F = 206.06) and Context-Dependency clude that the adaptive policy is more desirable. (p < 0.05; F = 4.31) on the system’s over- Finally, the significant main effect of Context- all performance. There is also a significant Dependency on the overall performance is ex- Initiative×Uncertainty interaction (p < 0.01; F = plained by the fact in the +CD conditions, the 194.31). agent is able to process context-dependent and Keeping all other conditions constant incrementally constructed turns, leading to less (L+UC+CD), there is also a significant main repetition, shorter dialogues, and therefore better effect of Confidence Threshold type (Con- overall performance. stant vs. Adaptive) on the same measure (p < 0.01; F = 206.06). The mean gradient of the 7 Conclusion and Future work red, adaptive curve is actually slightly lower than We have presented a multi-modal dialogue system its constant-threshold counter-part blue curve - that learns grounded word meanings from a hu- discussed below. man tutor, incrementally, over time, and employs 6 Discussion a dynamic dialogue policy (optimised using Rein- forcement Learning). The system integrates a se- Tutoring Cost As can be seen on Fig. 6b, the cu- mantic grammar for dialogue (DS), and a logical mulative cost for the tutor progresses more slowly theory of types (TTR), with a set of visual classi- when the learner has initiative (L) and takes its fiers in which the TTR semantic representations confidence into account in its behaviour (+UC) - are grounded. We used this implemented sys- the grey, blue, and red curves. This is so because a tem to study the effect of different dialogue poli- form of active learning is taking place: the learner cies and capabilities on the overall performance only asks a question about an attribute if it isn’t of a learning agent - a combined measure of ac- confident enough already about that attribute. This curacy and cost. The results show that in order also explains the slight decrease in the gradients to maximise its performance, the agent needs to of the curves as the agent is exposed to more and take initiative in the dialogues, take into account more training instances: its subjective confidence its changing confidence about its predictions, and about its own predictions increases over time, and be able to process natural, human-like dialogue. thus there is progressively less need for tutoring. Ongoing work further uses Reinforcement Learning to learn complete, incremental dialogue Accuracy On the other hand, the L+UC curves policies, i.e. which choose system output at the (grey and blue) on Fig. 6a show the slowest in- lexical level (Eshghi and Lemon, 2014). To deal crease in accuracy and flatten out at about 0.76. with uncertainty this system takes all the classi- This is because the agent’s confidence score in the fiers’ outputted confidence levels directly as fea- beginning is unreliable as the agent has only seen tures in a continuous space MDP. a few training instances: in many cases it doesn’t query the tutor or have any interaction whatsoever Acknowledgements with it and so there are informative examples that it doesn’t get exposed to. In contrast to this, the This research is supported by the EPSRC, un- L+UC(adaptive)+CD curve (red) achieves much der grant number EP/M01553X/1 (BABBLE better accuracy. project5), and by the European Union’s Hori- Comparing the gradients of the curves on Fig. zon 2020 research and innovation programme 6c shows that the overall performance of the agent under grant agreement No. 688147 (MuMMER on the gradient measure is significantly better than project6). others in the L+UC conditions (recall the signif- 5https://sites.google.com/site/ icant Initiative × Uncertainty interaction). How- hwinteractionlab/babble ever, while the agent with an adaptive thresh- 6http://mummer-project.eu/ References [Karpathy and Fei-Fei2014] Andrej Karpathy and Li Fei-Fei. 2014. Deep visual-semantic align- [Bruni et al.2014] Elia Bruni, Nam-Khanh Tran, and ments for generating image descriptions. CoRR, Marco Baroni. 2014. Multimodal distributional se- abs/1412.2306. mantics. J. Artif. Intell. Res.(JAIR), 49(1–47). [Karpathy and Li2015] Andrej Karpathy and Fei-Fei [Cann et al.2005] Ronnie Cann, Ruth Kempson, and Li. 2015. Deep visual-semantic alignments for gen- Lutz Marten. 2005. The Dynamics of Language. erating image descriptions. In IEEE Conference on Elsevier, Oxford. Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages [Cooper2005] Robin Cooper. 2005. Records and 3128–3137. record types in semantic theory. Journal of Logic and Computation, 15(2):99–112. [Kempson et al.2001] Ruth Kempson, Wilfried Meyer- Viol, and Dov Gabbay. 2001. Dynamic Syntax: The [Dobnik et al.2012] Simon Dobnik, Robin Cooper, and Flow of Language Understanding. Blackwell. Staffan Larsson. 2012. Modelling language, ac- tion, and perception in type theory with records. [Kennington and Schlangen2015] Casey Kennington In Proceedings of the 7th International Workshop and David Schlangen. 2015. Simple learning and on Constraint Solving and Language Processing compositional application of perceptually grounded ˘ (CSLPâAŽÄô12), pages 51–63. word meanings for incremental reference resolution. In Proceedings of the Conference for the Associa- [Eshghi and Lemon2014] Arash Eshghi and Oliver tion for Computational Linguistics (ACL-IJCNLP). Lemon. 2014. How domain-general can we be? Association for Computational Linguistics. learning incremental dialogue systems without dialogue acts. In Proceedings of SemDial. [Kollar et al.2013] Thomas Kollar, Jayant Krishna- murthy, and Grant Strimel. 2013. Toward inter- [Eshghi et al.2012] Arash Eshghi, Julian Hough, active grounded language acqusition. In Robotics: Matthew Purver, Ruth Kempson, and Eleni Gre- Science and Systems. goromichelaki. 2012. Conversational interactions: Capturing dialogue dynamics. In S. Larsson and [Kong et al.2013] Xiangnan Kong, Michael K. Ng, and L. Borin, editors, From Quantification to Conversa- Zhi-Hua Zhou. 2013. Transductive multilabel tion: Festschrift for Robin Cooper on the occasion learning via label set propagation. IEEE Trans. of his 65th birthday, volume 19 of Tributes, pages Knowl. Data Eng., 25(3):704–719. 325–349. College Publications, London. [Kristan and Leonardis2014] Matej Kristan and Ales [Eshghi et al.2015] A. Eshghi, C. Howes, E. Gre- Leonardis. 2014. Online discriminative kernel den- goromichelaki, J. Hough, and M. Purver. 2015. sity estimator with gaussian kernels. IEEE Trans. Feedback in conversation as incremental seman- Cybernetics, 44(3):355–365. tic update. In Proceedings of the 11th Inter- national Conference on Computational Semantics [Lampert et al.2014] Christoph H. Lampert, Hannes (IWCS 2015), London, UK. Association for Com- Nickisch, and Stefan Harmeling. 2014. Attribute- putational Linguisitics. based classification for zero-shot visual object cate- gorization. IEEE Trans. Pattern Anal. Mach. Intell., [Farhadi et al.2009] Ali Farhadi, Ian Endres, Derek 36(3):453–465. Hoiem, and David Forsyth. 2009. Describing objects by their attributes. In Proceedings of the IEEE [Larsson2013] Staffan Larsson. 2013. Formal seman- Computer Society Conference on Computer Vision tics for perceptual classification. Journal of logic and Pattern Recognition (CVPR. and computation.

[Furao and Hasegawa2006] Shen Furao and Osamu [Matuszek et al.2014] Cynthia Matuszek, Liefeng Bo, Hasegawa. 2006. An incremental network for on- Luke Zettlemoyer, and Dieter Fox. 2014. Learn- line unsupervised classification and topology learning from unscripted deictic gesture and language ing. Neural Networks, 19(1):90–106. for human-robot interactions. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intel- [Ginzburg2012] Jonathan Ginzburg. 2012. The Inter- ligence, July 27 -31, 2014, Québec City, Québec, active Stance: Meaning for Conversation. Oxford Canada., pages 2556–2563. University Press. [Roy2002] Deb Roy. 2002. A trainable visually- [Hough and Purver2014] Julian Hough and Matthew grounded spoken language generation system. In Purver. 2014. Probabilistic type theory for in- Proceedings of the International Conference of Spo- cremental dialogue processing. In Proceedings of ken Language Processing. the EACL 2014 Workshop on Type Theory and Nat- ural Language Semantics (TTNLS), pages 80–88, [Silberer and Lapata2014] Carina Silberer and Mirella Gothenburg, Sweden, April. Association for Com- Lapata. 2014. Learning grounded meaning repre- putational Linguistics. sentations with autoencoders. In Proceedings of the 52nd Annual Meeting of the Association for Compu- [Zhang2004] Tong Zhang. 2004. Solving large scale tational Linguistics (Volume 1: Long Papers), vol- linear prediction problems using stochastic gradient ume 1, pages 721–732, Baltimore, Maryland, June. descent algorithms. In Proceedings of the twenty- Association for Computational Linguistics. first international conference on Machine learning, page 116. ACM. [Skocaj et al.2011] Danijel Skocaj, Matej Kristan, Alen Vrecko, Marko Mahnic, Miroslav Janícek, Geert- [Zheng et al.2013] Jun Zheng, Furao Shen, Hongjun Jan M. Kruijff, Marc Hanheide, Nick Hawes, Fan, and Jinxi Zhao. 2013. An online incremental Thomas Keller, Michael Zillich, and Kai Zhou. learning support vector machine for large-scale data. 2011. A system for interactive learning in dialogue Neural Computing and Applications, 22(5):1023– with a tutor. In 2011 IEEE/RSJ International Con- 1035. ference on Intelligent Robots and Systems, IROS 2011, San Francisco, CA, USA, September 25-30, 2011, pages 3387–3394. [Skocajˇ et al.2009] Danijel Skocaj,ˇ Matej Kristan, and Aleš Leonardis. 2009. Formalization of different learning strategies in a continuous learning framework. In Proceedings of the Ninth International Conference on Epigenetic Robotics; Modeling Cog- nitive Development in Robotic Systems, pages 153– 160. Lund University Cognitive Studies. [Socher et al.2013] Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Y. Ng. 2013. Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems, pages 935–943, Lake Tahoe, Nevada, USA. [Socher et al.2014] Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2:207–218. [Sun et al.2013] Yuyin Sun, Liefeng Bo, and Dieter Fox. 2013. Attribute based object identification. In Robotics and Automation (ICRA), 2013 IEEE Inter- national Conference on, pages 2096–2103. IEEE. [Sutton and Barto1998] Richard S. Sutton and An- drew G. Barto. 1998. Reinforcement Learning: an Introduction. MIT Press. [Tellex et al.2014] Stefanie Tellex, Pratiksha Thaker, Joshua Mason Joseph, and Nicholas Roy. 2014. Learning perceptually grounded word meanings from unaligned parallel data. Machine Learning, 94(2):151–167. [Yu et al.2015a] Yanchao Yu, Arash Eshghi, and Oliver Lemon. 2015a. Comparing attribute classifiers for interactive language grounding. In Proceedings of the Fourth Workshop on Vision and Language, pages 60–69, Lisbon, Portugal, September. Association for Computational Linguistics. [Yu et al.2015b] Yanchao Yu, Oliver Lemon, and Arash Eshghi. 2015b. Interactive learning through dialogue for multimodal language grounding. In Sem- Dial 2015, Proceedings of the 19th Workshop on the Semantics and Pragmatics of Dialogue, Gothen- burg, Sweden, August 24-26 2015, pages 214–215.