Variational Hierarchical Dialog Autoencoder for Dialog State Tracking Data Augmentation

Variational Hierarchical Dialog Autoencoder for Dialog State Tracking Data Augmentation Kang Min Yoo1 Hanbit Lee1 Franck Dernoncourt2 Trung Bui2 Walter Chang2 Sang-goo Lee1 1Seoul National University, Seoul, Korea 2Adobe Research, San Jose, CA, USA fkangminyoo,skcheon,[email protected] fdernonco,bui,[email protected] Abstract area. While some notable work exists in text classification (Zhang et al., 2015), spoken language un- Recent works have shown that generative data derstanding (Yoo et al., 2019), and machine trans- augmentation, where synthetic samples gen- erated from deep generative models comple- lation (Fadaee et al., 2017), we still lack the full ment the training dataset, benefit NLP tasks. understanding of utilizing generative models for In this work, we extend this approach to text augmentation. the task of dialog state tracking for goal- Ideally, a data augmentation technique for super- oriented dialogs. Due to the inherent hierar- vised tasks must synthesize distribution-preserving chical structure of goal-oriented dialogs over and sufficiently realistic samples. Current ap- utterances and related annotations, the deep proaches for data augmentation in NLP tasks generative model must be capable of captur- ing the coherence among different hierarchies mostly revolve around thesaurus data augmentation and types of dialog features. We propose (Zhang et al., 2015), in which words that belong the Variational Hierarchical Dialog Autoen- to the same semantic role are substituted with one coder (VHDA) for modeling the complete as- another using a preconstructed lexicon, and noisy pects of goal-oriented dialogs, including lin- data augmentation (Wei and Zou, 2019) where ran- guistic features and underlying structured an- dom editing operations create perturbations in the notations, namely speaker information, dialog language space. Thesaurus data augmentation re- acts, and goals. The proposed architecture is quires a set of handcrafted semantic dictionaries, designed to model each aspect of goal-oriented dialogs using inter-connected latent variables which are costly to build and maintain, whereas and learns to generate coherent goal-oriented noisy data augmentation does not synthesize suf- dialogs from the latent spaces. To overcome ficiently realistic samples. The recent trend (Hu training issues that arise from training com- et al., 2017; Yoo et al., 2019; Shin et al., 2019) plex variational models, we propose appropri- gravitates towards generative data augmentation ate training strategies. Experiments on vari- (GDA), a class of techniques that leverage deep ous dialog datasets show that our model im- generative models such as VAEs to delegate the proves the downstream dialog trackers’ robust- ness via generative data augmentation. We automatic discovery of novel class-preserving sam- also discover additional benefits of our unified ples to machine learning. In this work, we explore approach to modeling goal-oriented dialogs – GDA in the context of dialog modeling and contex- dialog response generation and user simula- tual understanding. tion, where our model outperforms previous Goal-oriented dialogs occur between a user and strong baselines. a system that communicates verbally to accomplish 1 Introduction the user’s goals (Table 6). However, because the user’s goals and the system’s possible actions are Data augmentation, a technique that augments the not transparent to each other, both parties must rely training set with label-preserving synthetic sam- on verbal communications to infer and take appro- ples, is commonly employed in modern machine priate actions to resolve the goals. Dialog state learning approaches. It has been used extensively tracker is a core component of such systems, en- in visual learning pipelines (Shorten and Khoshgof- abling it to track the dialog’s latest status (Hender- taar, 2019) but less frequently for NLP tasks due son et al., 2014). A dialog state typically consists to the lack of well-established techniques in the of inform and request types of slot values. 3406 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 3406–3425, November 16–20, 2020. c 2020 Association for Computational Linguistics For example, a user may verbally refer to a pre- datasets. viously mentioned food type as the preferred one - e.g., Asian (inform(food=asian)). Given 2. Leveraging the model’s generation capabili- the user utterance and historical turns, the state ties, we show that generative data augmenta- tracker must infer the user’s current goals. As such, tion is attainable even for the complex dialog- we can view dialog state tracking as a sparse se- related tasks that pertain to both hierarchical quential multi-class classification problem. Model- and sequential annotations. ing goal-oriented dialogs for GDA requires a novel 3. We propose simple but effective training poli- approach that simultaneously solves state tracking, cies for our VAE-based model, which have user simulation (Schatzmann et al., 2007), and ut- applications in other similar VAE structures. terance generation. Various deep models exist for modeling dialogs. The code for reproducing this paper is available The Markov approach (Serban et al., 2017) em- at github 1. ploys a sequence-to-sequence variational autoencoder (VAE) (Kingma and Welling, 2013) structure 2 Background and Related Work to predict the next utterance given a deterministic Dialog State Tracking. Dialog state tracking context representation, while the holistic approach (DST) predicts the user’s current goals and dialog (Park et al., 2018) utilizes a set of global latent acts, given the dialog context. Historically, DST variables to encode the entire dialog, improving models have gradually evolved from hand-crafted the awareness in general dialog structures. How- finite-state automata and multi-stage models (Dy- ever, current approaches are limited to linguistic bkjær and Minker, 2008; Thomson and Young, features. Recently, Bak and Oh(2019) proposed 2010; Wang and Lemon, 2013) to end-to-end mod- a hierarchical VAE structure that incorporates the els that directly predict dialog states from dialog speaker’s information, but we have yet to explore a features (Zilka and Jurcicek, 2015; Mrksiˇ c´ et al., universal approach for encompassing fundamental 2017; Zhong et al., 2018; Nouri and Hosseini-Asl, aspects of goal-oriented dialogs. Such a unified 2018). model capable of disentangling latents into specific Among the proposed models, Neural Belief dialog aspects can increase the modeling efficiency Tracker (NBT) (Mrksiˇ c´ et al., 2017) decreases re- and enable interesting extensions based on the fine- liance on handcrafted semantic dictionaries by re- grained controllability. formulating the classification problem. Global- This paper proposes a novel multi-level hierar- locally Self-attentive Dialog tracker (GLAD) chical and recurrent VAE structure called Varia- (Zhong et al., 2018) introduces global modules tional Hierarchical Dialog Autoencoder (VHDA). for sharing parameters across slots and local mod- Our model enables modeling all aspects (speaker ules, allowing the learning of slot-specific feature information, goals, dialog acts, utterances, and gen- representations. Globally-Conditioned Encoder eral dialog flow) of goal-oriented dialogs in a disen- (GCE) (Nouri and Hosseini-Asl, 2018) improves tangled manner by assigning latents to each aspect. further by forgoing the separation of global and However, complex and autoregressive VAEs are local modules, allowing the unified module to take known to suffer from the risk of inference collapse slot embeddings for distinction. Recently, dialog (Cremer et al., 2018), in which the model converges state trackers based on pre-trained language mod- to a local optimum where the generator network els have demonstrated their strong performance in neglects the latents, reducing the generation con- many DST tasks (Wu et al., 2019; Kim et al., 2019; trollability. To mitigate the issue, we devise two Hosseini-Asl et al., 2020). While the utilization simple but effective training strategies. of large-scale pre-trained language models is not Our contributions are summarized as follows. within our scope, we wish to explore further con- cerning the recent advances in the area. 1. We propose a novel deep latent model for Conversation Modeling. While the previous ap- modeling dialog utterances and their relation- proaches for hierarchical dialog modeling relate ships with the goal-oriented annotations. We to the Markov assumption (Serban et al., 2017), show that the strong level of coherence and recent approaches have geared towards utilizing accuracy displayed by the model allows it to be used for augmenting dialog state tracking 1https://github.com/kaniblu/vhda 3407 global latent variables for representing the holis- z(c) tic dialog structure (Park et al., 2018; Gu et al., 2018; Bak and Oh, 2019), which helps in preserv- h1 h2 h3 ing long-term dependencies and total semantics. In this work, we employ global latent variables to (r) (r) (r) z r1 z r2 z r3 maximize the effectiveness in preserving dialog 1 2 3 semantics for data augmentation. (g) (g) (g) z g1 z g2 z g3 Data Augmentation. Transformation-based data 1 2 3 augmentation is popular in vision learning (Shorten (s) (s) (s) z s1 z s2 z s3 and Khoshgoftaar, 2019) and speech signal process- 1 2 3 ing (Ko et al., 2015), while thesaurus and noisy data (u) (u) (u) u1 u2 u3 augmentation techniques are more common for z1 z2 z3 text. (Zhang et

Load more