De-Aliasing States in Dialogue Modelling with Inverse Reinforcement Learning

De-aliasing States in Dialogue Modelling with Inverse Reinforcement Learning Layla El Asri Adam Trischler Geoff Gordon Borealis AI1 Microsoft Research Microsoft Research [email protected] [email protected] [email protected] Abstract model to output responses using the maximum- End-to-end dialogue response generation likelihood objective, via teacher forcing; sec- models learn dialogue state tracking, dialogue ond, continue training with a distinct, possibly management, and natural language generation non-differentiable objective (e.g., maximizing the at the same time and through the same BLEU score) using policy-gradient methods (Pap- training signal. These models scale better ineni et al., 2002; Ranzato et al., 2016; Bahdanau than traditional modular architectures as they et al., 2017; Strub et al., 2017; Narayan et al., do not require much annotation. Despite 2018; Wu et al., 2018). significant advances, these models, often built using Recurrent Neural Networks (RNNs), Encoder-decoder models based on recurrent exhibit deficiencies such as repetition, incon- neural networks now set the state of the art in sistency, and low task-completion rates. To DRG, but still exhibit deficiencies like inconsis- understand some of these issues more deeply, tency, poor syntax, and repetition. For exam- this paper investigates the representations ple, they tend to repeat the same sentences in- learned by RNNs trained on dialogue data. appropriately within a dialogue or across dia- We highlight the problem of state aliasing, which entails conflating two or more distinct logues. This has been observed in both goal- states in the representation space. We show oriented and general-purpose conversational set- empirically that state aliasing often occurs tings (Das et al., 2017; Strub et al., 2017; Li when encoder-decoder RNNs are trained via et al., 2016; Holtzman et al., 2019). Several re- maximum likelihood or policy gradients. cent works have focused on improving the decod- We propose to augment the training signal ing mechanism of such models (Bahdanau et al., with information about the future to force 2017; Wiseman and Rush, 2016; Wu et al., 2018; the latent representations of the RNNs to Holtzman et al., 2019). In this paper, we take the hold sufficient information for predicting the future. Specifically, we train encoder-decoder complimentary approach and investigate the rep- RNNs to predict both the next utterance as resentations learned by the encoder. We hypoth- well as a feature vector that represents the esize that one cause for repetition is state alias- expected dialogue future. We draw inspiration ing, which entails conflating two or more distinct from the Structured-Classification Inverse states in the representation space. Recent work has Reinforcement Learning (SCIRL, Klein et al., shown that an RNN encoder’s hidden representa- 2012) algorithm to compute this feature tions of two different dialogue contexts may be vector. In experiments on a generated dataset of text-based games, the augmented training very similar if the utterances following these con- signal mitigates state aliasing and improves texts are the same (El Asri and Trischler, 2019). In model performance significantly. other words, when the immediate outputs are the same, the encoder may learn to represent the in- 1 Introduction puts similarly. This was demonstrated in training Dialogue response generation (DRG) can be end-to-end dialogue models with policy gradient framed as a reinforcement learning problem where algorithms and we investigate whether this hap- the actions are words or structured sequences of pens with other training settings and in particular words. Much recent work has adopted the for- in the maximum likelihood setting. mer framing, and has proposed data-driven training of DRG models in 2 steps: first, train the 1Work done at Microsoft Research. If is zero, the model learns exactly the same representation for both states. El Asri and Trischler (2019) studied state aliasing in neural DRG models trained with policy gradients. They showed that the model depicted in Figure1 (which will be described below) went through phases in which different states were -aliased during training. 2.2 Experimental setting This result was demonstrated using a proxy for di- Figure 1: Retrieval-based dialogue response generation alogue response generation: playing simple text- model trained on text-based games. based games constructed in TextWorld (Cotˆ e´ et al., 2018). Text-based games are interactive turn- based simulations that use template-based natural We train a retrieval model and a generative language to describe the game state, to accept ac- model with a recurrent encoder and decoder. Our tions from the player, and to describe consequent experiments suggest that state aliasing also oc- changes in the game environment. Thus, a text- curs in MLE training in both settings. We go on based game can be viewed as a dialogue between to propose a solution to state aliasing based on the player and the environment. The game used inverse reinforcement learning, which forces the by El Asri and Trischler(2019) to study aliasing model to learn representations that depend less on is set in a house with several rooms. To succeed, the next utterance and more on the expected fu- the player must perform the following sequence ture dialogue trajectory. We adapt the Structured- of actions, called a quest: fgo west, take the blue Classification Inverse Reinforcement Learning al- key, go east, unlock the blue chest with the blue gorithm (SCIRL, Klein et al., 2012) to compute a key, open the blue chest, take the red key, take the feature vector representing an expectation of the bottle of shampoo, go west, unlock the red chest dialogue future and train models to predict this with the red key, open the red chest, insert the bot- feature vector. SCIRL is a simple inverse rein- tle of shampoo into the red chestg. Notice that to forcement learning algorithm with strong theoret- complete the game, the agent must go west twice. ical guarantees and which does not require know- If state aliasing occurs because of this repetition, ing the transition dynamics of the environment nor then the agent should struggle to learn to perform having access to a simulator of the environment. different actions in the two situations. The next With this approach, we show significant improve- section describes the RNN-based model trained on ment on a dataset of generated text-based games. this game. We refer to the cited paper for more de- tails on the experiments. 2 Background: State Aliasing in RNNs Trained with Policy Gradient Methods 2.3 Model At each turn, the model in Figure1 takes as input 2.1 Definition the game-generated text observation of the state In this section, we define the state aliasing prob- and uses its policy to select (i.e., retrieve) a textual lem and summarize the main results from El Asri action to take to make progress in the game. The and Trischler(2019), which we build upon. This observation describes the agent’s current location also serves to introduce the experimental setting (a room) and the various objects in this room. The and several components of the model we use in sentences in the description are concatenated and our experiments. then tokenized into words. The words are mapped 1 We consider two states si and sj to be -aliased to embeddings using ELMo (Peters et al., 2018), by a model M, for which jjM(s)jj ≤ n 8s, when and the model encodes the sequence of embed- the Euclidean distance between the representa- dings via LSTM (the LSTM encoder). The model tions of these states in M (denoted by M(si) and encodes the quest that it must perform using a sep- M(sj), respectively) is less than with < 2n: arate LSTM encoder with distinct parameters. The 1We use the small ELMo model available at https:// jjM(si) − M(sj)jj2 ≤ . allennlp.org/elmo. it went west for the first time, and this propagates to subsequent hidden states such that the agent re- peats other former actions that came after going west (e.g., going east). This impedes the agent’s training and the agent often never recovers from this aliasing. The intuition given for this phenomenon is that the output distribution at separate states with the same optimal action looks very similar (i.e., close Figure 2: Illustration of state aliasing with a model trained to 1 for the optimal action and 0 for the oth- with policy gradients. Source: El Asri and Trischler(2019) ers). Policy gradients then push the correspond- ing hidden states together. Experiments suggest that entropy-based regularization helps mitigate quest consists of a short text string that describes this issue. Adding an entropy-based bonus to the the objective, in the form of the sequence of ac- loss function forces the output distribution to be tions that completes the game. An example of less peaked, so even if the optimal action is the quest is given in the appendix. The concatenation same for two different states, the policy distribu- of the quest encoding and the observation encod- tion might differ enough to represent the states dif- ing is passed to a higher-level LSTM called the ferently. Another helpful modification is to train history encoder. Its hidden state at turn t repre- the RL agent to output not only the policy (as a sents the game history up to that turn and gives distribution over actions) but also a baseline func- the representation for state s , i.e., M(s ). At each t t tion representing the expected sum of rewards at turn, the model is provided with a set of candidate each state. If states share the same optimal actions actions to select from.

Load more