Hello, It's GPT-2

Hello, It’s GPT-2 - How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems Paweł Budzianowski1;2;3 and Ivan Vulic´2;3 1Engineering Department, Cambridge University, UK 2Language Technology Lab, Cambridge University, UK 3PolyAI Limited, London, UK [email protected], [email protected] Abstract (Young et al., 2013). On the other hand, open- domain conversational bots (Li et al., 2017; Serban Data scarcity is a long-standing and crucial et al., 2017) can leverage large amounts of freely challenge that hinders quick development of available unannotated data (Ritter et al., 2010; task-oriented dialogue systems across multiple domains: task-oriented dialogue models are Henderson et al., 2019a). Large corpora allow expected to learn grammar, syntax, dialogue for training end-to-end neural models, which typ- reasoning, decision making, and language gen- ically rely on sequence-to-sequence architectures eration from absurdly small amounts of task- (Sutskever et al., 2014). Although highly data- specific data. In this paper, we demonstrate driven, such systems are prone to producing unre- that recent progress in language modeling pre- liable and meaningless responses, which impedes training and transfer learning shows promise their deployment in the actual conversational ap- to overcome this problem. We propose a task- oriented dialogue model that operates solely plications (Li et al., 2017). on text input: it effectively bypasses ex- Due to the unresolved issues with the end-to- plicit policy and language generation modules. end architectures, the focus has been extended to Building on top of the TransferTransfo frame- retrieval-based models. Here, the massive datasets work (Wolf et al., 2019) and generative model can be leveraged to aid task-specific applications pre-training (Radford et al., 2019), we vali- (Kannan et al., 2016; Henderson et al., 2017, date the approach on complex multi-domain 2019b). The retrieval systems allow for the full task-oriented dialogues from the MultiWOZ control over system responses, but the behaviour dataset. Our automatic and human evaluations show that the proposed model is on par with of the system is often highly predictable. It also a strong task-specific neural baseline. In the depends on the pre-existing set of responses, and long run, our approach holds promise to miti- the coverage is typically insufficient for a mul- gate the data scarcity problem, and to support titude of domains and tasks. However, recent the construction of more engaging and more progress in training high-capacity language mod- eloquent task-oriented conversational agents. els (e.g., GPT, GPT-2) (Radford et al., 2018, 2019) on large datasets reopens the question of whether 1 Introduction such generative models can support task-oriented Statistical conversational systems can be roughly dialogue applications. Recently, Wolf et al. (2019) clustered into two main categories: 1) task- and Golovanov et al. (2019) showed that the GPT oriented modular systems and 2) open-domain model, once fine-tuned, can be useful in the do- chit-chat neural models. The former typically con- main of personal conversations. In short, their sist of independently trained constituent modules approach led to substantial improvements on the such as language understanding, dialogue manage- Persona-Chat dataset (Zhang et al., 2018), show- ment, and response generation. The main goal of casing the potential of exploiting large pretrained such systems is to provide meaningful system re- generative models in the conversational domain.1 sponses which are invaluable in building conversa- In this paper, we demonstrate that large gener- tional agents of practical value for restricted do- ative models pretrained on large general-domain mains and tasks. However, data collection and 1E.g., TransferTransfo (Wolf et al., 2019) yields gains in annotation for such systems is complex, time- all crucial dialogue evaluation measures such as fluency, con- intensive, expensive, and not easily transferable sistency and engagingness on the Persona-Chat dataset. 15 Proceedings of the 3rd Workshop on Neural Generation and Translation (WNGT 2019), pages 15–22 Hong Kong, China, November 4, 2019. c 2019 Association for Computational Linguistics www.aclweb.org/anthology/D19-56%2d data. A natural question to ask is: Can we leverage transfer learning through generative pretraining on large unlabelled corpora to enable task-oriented dialogue modeling. In this work, we rely on the standard language modeling (LM) pretraining, where the task is to predict the next word given the pre- ceding word sequence (Bengio et al., 2003). The objective maximizes the likelihood over the word sequence S = fw1; :::; wjSjg: XjSj L1(S) = log P (wijw0; w1; :::; wi−1): (1) Figure 1: Dialogue-context-to-text task. i=1 Transfer learning based on such LM pretraining corpora can support task-oriented dialogue appli- combined with the Transformer decoder model cations. We first discuss how to combine a set (Vaswani et al., 2017) resulted in significant of diverse components such as word tokenization, progress across many downstream tasks (Rei, multi-task learning, and probabilistic sampling to 2017; Howard and Ruder, 2018; Radford et al., support task-oriented applications. We then show 2018, 2019). how to adapt the task-oriented dialogue framework to operate entirely on text input, effectively by- 2.1 TransferTransfo Framework passing an explicit dialogue management module Golovanov et al. (2019) and Wolf et al. (2019) and a domain-specific natural language generation achieved a first successful transfer of a genera- module. The proposed model operates entirely tive pretrained GPT model to an open-domain di- in the sequence-to-sequence fashion, consuming alogue task. The pretrained GPT model is fine- only simple text as input. The entire dialogue con- tuned in a multi-task learning fashion following text, which includes the belief state, the database the original work (Radford et al., 2018). The LM state and previous turns, is provided to the decoder objective from Eq. (1) is combined with the next as raw text. The proposed model follows the re- utterance classification task: cently proposed TransferTransfo framework (Wolf ∗ et al., 2019), and relies on pretrained models from p(c; a) = softmax(hl Wh): (2) the GPT family (Radford et al., 2018, 2019). c and a represent the context of the conversation Our results in the standard Dialogue-Context-to- (c) and a proposed answer (a), hl is the last hidden Text task (see Figure 1) on the multi-domain Multi- state of the transformer decoder, and Wh is learnt WOZ dataset (Budzianowski et al., 2018b) suggest during the fine-tuning phase. The model signifi- that our GPT-based task-oriented dialogue model cantly improves upon previous baselines over all learns to generate and understand domain-specific automatic dialogue evaluation metrics as well as tokens, which in turn leads to a seamless adapta- in evaluation with human subjects when evaluated tion to particular focused domains. While auto- on the Persona-Chat dataset (Zhang et al., 2018). matic evaluation indicates that our framework still The GPT input consists of token embeddings falls slightly short of a strong task-specific neural and positional embeddings. In order to move from baseline, it also hints at the main advantage of our a single-speaker setting to a setting with two inter- framework: it is widely portable and easily adapt- locutors, Wolf et al. (2019) introduced dialogue- able to a large number of domains, bypassing the state embeddings. These embeddings inform the intricate modular design only at a small cost in per- model whether the current token comes from an formance. Furthermore, user-centered evaluations utterance of the first speaker or an utterance of the suggest that there is no significant difference be- second speaker. The dialogue-state embeddings tween the two models. are learned during the fine-tuning phase. 2 From Unsupervised Pretraining to 3 Domain Transfer for (Task-Oriented) Dialogue Modeling Dialogue Modeling Task-oriented dialogue modeling requires substan- We now briefly discuss several advances in model- tial amounts of domain-specific manually labeled ing of natural language that facilitate applicability 16 Figure 2: The framework for modeling task-oriented conversations based on a pretrained GPT model which uses only unstructured simple text as input. The context, belief state, and database state are joined together without explicit standalone dialogue policy and generation modules. The token-level (i.e., dialogue-state) embeddings are learned following Wolf et al. (2019). of pretrained generative models in task-oriented di- added to as another part of the text-only input pro- alogue modeling. To the best of our knowledge, vided in “natural language”. this work is first to combine these existing components to enable task-oriented dialogue modeling. 3.3 Transferring Language Generation 3.1 Domain Adaptation and Delexicalization Capabilities Dealing with out-of-vocabulary (OOV) words has Transformer architecture shows ability to learn been a long-standing challenge in dialogue mod- new (i.e., domain-specific) token embeddings in eling, e.g., it is crucial for task-oriented genera- the fine-tuning phase (Radford et al., 2018; Wolf tion where the generated output is often delexical- et al., 2019). This means that the GPT models can ized (Wen et al., 2015). Delexicalization replaces adapt through special tokens to particular tasks. slot values by their corresponding (generic) slot By providing the input representation as text with tokens and it allows learning value-independent domain-specific tokens, we can use off-the-shelf parameters. Recently, owing to subword-level to- architectures and adapt to the domain-specific in- kenisation (Sennrich et al., 2016), language mod- put without the need of training new dialogue sub- els are now able to deal with OOVs and domain- modules. As mentioned in §2.1, the token level specific vocabularies more effectively (Radford layer (Figure 2) informs the transformer decoder et al., 2018).

Hello, It's GPT-2

NVIDIA CEO Jensen Huang to Host AI Pioneers Yoshua Bengio, Geoffrey Hinton and Yann Lecun, and Others, at GTC21

On Recurrent and Deep Neural Networks

Knowledge-Powered Deep Learning for Word Embedding

Recurrent Neural Network Based Language Model

Yoshua Bengio and Gary Marcus on the Best Way Forward for AI

Unified Language Model Pre-Training for Natural

Hierarchical Multiscale Recurrent Neural Networks

N-Gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture Notes Courtesy of Prof

I2t2i: Learning Text to Image Synthesis with Textual Data Augmentation

A General Language Model for Information Retrieval Fei Song W

A Unified Multilingual Handwriting Recognition System Using

Word Embeddings: History & Behind the Scene