MALA: Cross-Domain Dialogue Generation with Action Learning

MALA: Cross-Domain Dialogue Generation with Action Learning Xinting Huang,1 Jianzhong Qi,1 Yu Sun,2 Rui Zhang1* 1The University of Melbourne, 2Twitter Inc. xintingh@student., jianzhong.qi@, rui.zhang@ unimelb.edu.au, [email protected] f g Abstract Table 1: System Utterance Action Example Response generation for task-oriented dialogues involves two System utterances basic components: dialogue planning and surface realization. Domain: Hotel Domain: Attaction These two components, however, have a discrepancy in their (a). Was there a particular (c). Did you have a particular objectives, i.e., task completion and language quality. To deal section of town you were type of attraction you were with such discrepancy, conditioned response generation has looking for? looking for? been introduced where the generation process is factorized (b). Which area could you like (d). great , what are you into action decision and language generation via explicit ac- the hotel to be located at? interested in doing or seeing ? tion representations. To obtain action representations, recent studies learn latent actions in an unsupervised manner based System intention (ground truth action) on the utterance lexical similarity. Such an action learning Request(Area) Request(Type) approach is prone to diversities of language surfaces, which may impinge task completion and language quality. To ad- Latent action (auto-encoding approach) dress this issue, we propose multi-stage adaptive latent ac- (a): [0,0,0,1,0]; (b): [0,1,0,0,0] (c): [0,0,0,1,0]; (d): [0,0,0,0,1] tion learning (MALA) that learns semantic latent actions by distinguishing the effects of utterances on dialogue progress. Semantic latent action (proposed) We model the utterance effect using the transition of dialogue (a) & (b): [0,0,0,1,0] (c) & (d): [0,0,0,0,1] states caused by the utterance and develop a semantic similarity measurement that estimates whether utterances have similar effects. For learning semantic actions on domains with- user, or provide a restaurant recommendation to the user), out dialogue states, MALA extends the semantic similarity and surface realization means transforming the chosen ac- measurement across domains progressively, i.e., from align- tion into natural language responses. Studies show that not ing shared actions to learning domain-specific actions. Exper- distinguishing these two components can be problematic iments using multi-domain datasets, SMD and MultiWOZ, since they have a discrepancy in objectives, and optimiz- show that our proposed model achieves consistent improve- ing decision making on choosing actions might adversely af- ments over the baselines models in terms of both task com- fect the generated language quality (Yarats and Lewis 2018; pletion and language quality. Zhao, Xie, and Eskenazi 2019). To address this problem, conditioned response genera- 1 Introduction tion that relies on action representations has been introduced Task-oriented dialogue systems complete tasks for users, (Wen et al. 2015; Chen et al. 2019). Specifically, each system arXiv:1912.08442v2 [cs.CL] 13 Oct 2020 such as making a restaurant reservation or scheduling a utterance is coupled with an explicit action representation, meeting, in a multi-turn conversation (Gao, Galley, and Li and responses with the same action representation convey 2018; Sun et al. 2016; Sun et al. 2017). Recently, end-to-end similar meaning and represent the same action. In this way, approaches based on neural encoder-decoder structure have the response generation is decoupled into two consecutive shown promising results (Wen et al. 2017b; Madotto, Wu, steps, and each component for conditioned response gener- and Fung 2018). However, such approaches directly map ation (i.e., dialogue planning or surface realization) can op- plain text dialogue context to responses (i.e., utterances), timize for different objectives without impinging the other. and do not distinguish two basic components for response Obtaining action representations is critical to conditioned re- generation: dialogue planning and surface realization. Here, sponse generation. Recent studies adopt variational autoen- dialogue planning means choosing an action (e.g., to re- coder (VAE) to obtain low-dimensional latent variables that quest information such as the preferred cuisine from the represent system utterances in an unsupervised way. Such *Rui Zhang is the corresponding author. an auto-encoding approach cannot effectively handle vari- Copyright © 2020, Association for the Advancement of Artificial ous types of surface realizations, especially when these exist Intelligence (www.aaai.org). All rights reserved. multiple domains (e.g., hotel and attraction). This is because the latent variables learned in this way mainly rely on the 2 Related Work lexical similarity among utterances instead of capturing the 2.1 Controlled Text Generation underlying intentions of those utterances. In Table 1, for example, system utterances (a) and (c) convey different inten- Controlled text generation aims to generate responses with tions (i.e., request(area) and request(type)), but controllable attributes. Many studies focus on open-domain may have the same auto-encoding based latent action repre- dialogues’ controllable attributes, e.g., style (Yang et al. sentation since they share similar wording. 2018), sentiment (Shen et al. 2017), and specificity (Zhang To address the above issues, we propose a multi-stage ap- et al. 2018). Different from open-domain, the controllable system ac- proach to learn semantic latent actions that encode the un- attributes for task-oriented dialogues are usually tions derlying intention of system utterances instead of surface re- , since it is important that system utterances convey alization. The main idea is that the system utterances with clear intentions. Based on handcrafted system actions ob- the same underlying intention (e.g., request(area)) tained from domain ontology, action-utterance pairs are used will lead to similar dialogue state transitions. This is be- to learn semantically conditioned language generation mod- cause dialogue states summarize the dialogue progress to- els (Wen et al. 2015; Chen et al. 2019). Since it requires wards task completion, and a dialogue state transition reflect extensive efforts to build action sets and collect action la- how the intention of system utterance influences the progress bels for system utterances, recent years have seen a grow- at this turn. To encode underlying intention into semantic la- ing interest in learning utterance representations in an un- latent action learning tent actions, we formulate a loss based on whether the recon- supervised way, i.e., (Zhao, Lee, and structed utterances from VAE cause similar state transitions Eskenazi 2018; Zhao, Xie, and Eskenazi 2019). Latent ac- as the input utterances. To distinguish the underlying intention learning adopts a pretraining phase to represent each tion among utterances more effectively, we further develop a utterance as a latent variable using a reconstruction based regularization based on the similarity of resulting state tran- variational auto-encoder (Yarats and Lewis 2018). The ob- sitions between two system utterances. tained latent variable, however, mostly reflects lexical sim- Learning the semantic latent actions requires annotations ilarity and lacks sufficient semantics about the intention of of the dialogue states. In many domains, there are simply system utterances. We utilize the dialogue state information no such annotations because they require extensive human to enhance the semantics of the learned latent actions. efforts and are expensive to obtain. We tackle this chal- 2.2 Domain Adaptation for Task-oriented lenge by transferring the knowledge of learned semantic latent actions from state annotation rich domains (i.e., source Dialogues domains) to those without state annotation (i.e., target do- Domain adaptation aims to adapt a trained model to a new mains). We achieve knowledge transferring in a progressive domain with a small amount of new data. This is studied way, and start with actions that exist on both the source in computer vision (Saito, Ushiku, and Harada 2017), item and target domain, e.g., Request(Price) in both hotel ranking (Wang et al. 2018a; Huang et al. 2019), and multi- and attraction domain. We call such actions as shared ac- label classification (Wang et al. 2018b; Wang et al. 2019; tions and actions only exist in the target domain as domain- Sun and Wang 2019). For task-oriented dialogues, early specific actions. We observe that system utterances with studies focus on domain adaptation for individual compo- shared actions will lead to similar states transitions despite nents, e.g., intention determination (Chen, Hakkani-Tur,¨ and belonging to different domains. Following this observation, He 2016), dialogue state tracking (Mrksiˇ c´ et al. 2015), and we find and align the shared actions across domains. With dialogue policy (Mo et al. 2018; Yin et al. 2018). Two recent action-utterance pairs gathered from the above shared ac- studies investigate end-to-end domain adaptation. DAML tions aligning, we train a network to predict the similarity (Qian and Yu 2019) adopts model-agnostic meta-learning to of resulting dialogue state transitions by taking as input only learn a seq-to-seq dialogue model in target domains. ZSDG texts of system utterances. We then use such similarity pre- (Zhao and Eskenazi 2018) conducts adaptation based on ac- diction as supervision to better

Load more