Pretrained Language Models for Dialogue Generation with Multiple Input Sources

Pretrained Language Models for Dialogue Generation with Multiple Input Sources Yu Cao 1∗, Wei Bi 2, Meng Fang 3, Dacheng Tao 1 1 UBTECH Sydney AI Center, School of Computer Science, Faculty of Engineering, The University of Sydney, Australia 2 Tencent AI LAB, China, 3 Tencent Robotics X, China [email protected], [email protected], [email protected], [email protected] Abstract simple concatenation of all input sources, an impor- tant question arises on how we can better adapt a Large-scale pretrained language models have single-input pretrained language model to a multi- achieved outstanding performance on natu- input dialogue generation task. ral language understanding tasks. However, Some previous work forms an encoder-decoder it is still under investigating how to apply architecture with both encoder and decoder du- them to dialogue generation tasks, especially those with responses conditioned on multiple plicated from a pretrained language model (Golo- sources. Previous work simply concatenates vanov et al., 2019; Zheng et al., 2019). Recently, all input sources or averages information from BART (Lewis et al., 2019) even obtain a complete different input sources. In this work, we study pretrained model under this architecture directly. dialogue models with multiple input sources Taking personalized dialogue generation (Zhang adapted from the pretrained language model et al., 2018) as an example, we can treat persona in- GPT2. We explore various methods to fuse formation, dialogue history and previous generated multiple separate attention information corresponding to different sources. Our experimen- tokens as three different input sources. The for- tal results show that proper fusion methods mer two will be encoded firstly and then combined deliver higher relevance with dialogue history with the last one in the decoder. In Golovanov than simple fusion baselines. et al. 2019, the multi-head attention layer in the decoder is copied three times for each input source 1 Introduction and mean pooling is used to average results from multiple attentions. This encoder-decoder adapta- Large-scale pretrained language models (Devlin tion is shown to outperform simple concatenation. et al., 2019; Radford et al., 2018, 2019) have However, when dialogue history gets longer, this achieved outstanding performance on various model tends to use less information of each dia- natural language understanding tasks (Young logue history utterance to predict the next token. et al., 2018; Liu et al., 2019). Researchers Zheng et al. 2019 add an extra weight predictor to have then utilized them in dialogue generation combine multiple attention information, but they tasks (Budzianowski and Vulic´, 2019; Edunov do not perform experiments using publicly released et al., 2019; Zhang et al., 2019). Many of them sim- pretrained models, nor on public datasets, making ply concatenate the input dialogue history and the their results not directly comparable to other work. output response in finetuning, since the pretrained In this work, we build our dialogue model on language model only accepts a single sequence as the encoder-decoder architecture adapted from the input. However, dialogue generation tasks may in- pretrained language model GPT2 (Radford et al., volve multiple input sources simultaneously. For 2019). Our main contribution is to empirically example, in personalized or knowledge-grounded study the attention fusion methods for multiple in- dialogue generation (Li et al., 2016; Zhang et al., formation sources in each decoder layer. Three 2018; Dinan et al., 2018), a response is generated kinds of methods are explored in total. Our exper- conditioned on both dialogue history and an aux- imental results show performance improvements iliary user profile or knowledge article. Despite on both automatic and human evaluations by us- ∗This work was done during Yu Cao’s internship in Ten- ing proper attention fusion methods, compared to cent AI LAB, Shenzhen. baselines using concatenation or mean pooling. 909 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 909–917 November 16 - 20, 2020. c 2020 Association for Computational Linguistics Full a Persona LM loss + b History LM loss + g Prediction LM loss = 2 Model Loss 2.1 The Encoder-Decoder Architecture Linear Layer Following the former work (Golovanov et al., 2019), we use the personalized dialogue genera- GPT2 Encoder GPT2 Decoder tion task on PersonaChat (Zhang et al., 2018) as an example in our study. The pretrained language Personality Dialog History Current Reply model GPT2 and its parameters are duplicated to form an encoder-decoder architecture shown in Fig- (a) The encoder-decoder architecture. ure 1(a). We use GPT2 here due to its large-scale . pre-training corpus than other models and strong ×N + performance in other generation tasks. We have three separate inputs: personal profile, MLP dialogue history, and current reply (or previously Layer Normalization generated response during the inference stage). Em- + beddings of the former two, which contain embeddings of tokens, positions as well as token types, Attention Fusion will be successively put into the encoder, which is a GPT2 model with no attention mask to fit the MH Bi-Attention MH Self-Attention MH Bi-Attention encoding procedure. The encoded representations, together with embeddings of current response to- Layer Normalization kens will then be used as the input of a modified GPT2 decoder. Each decoder block will attend the Encoded Current State/ Encoded Dialog Personality Embedding History current state to the three sources using different attentions, then fuse their resulting information as (b) Details of each transformer block in decoder. input for the next layer. Figure 1: Architecture of our proposed model. Inspired by multi-task learning (Zhang and Yang, 2017), we further separate the original loss in lan- to make it accept two input sources, we regard c Lc×d guage modeling into three parts corresponding to the current state H 2 R from the previous three input sources respectively. By applying the layer (or embedding of reply in the first layer) as same linear prediction layer on the output of both query and encoded state of auxiliary information a La×d encoder and decoder, three cross-entropy losses H 2 R as key and value in the attention. between predicted logits and corresponding truth Here Lc and La are corresponding lengths for these sequences will be weighted by hyperparameters. input, and Ha can be encoded personality Hp or dialog history Hh. The output of each single head L = αLpersona + βLhistory + γLpred (1) in MH Bi-attention can be obtained via with Adam optimizer (Kingma and Ba, 2014). c Q a K T (H W )(H W ) a V 2.2 Block Details in Decoder A = softmax( p )(H W ); d Recall that we have three input sources in the de- (2) coder, and thus some modifications are needed if where WQ, WK , WV are learnable matrices. In the decoder structure is inherited from GPT2. De- our model, different attentions own separate param- tails of each modified decoder block are shown in eters instead of sharing. This differs from the previ- Figure 1(b), in which the most apparent change is ous work (Golovanov et al., 2019) which reuses the the additional two multi-head (MH) bidirectional self-attention for bi-attention. Besides, the original attentions and the attention fusion module that GPT2 is a single-directional model using a triangu- fuses various attention outputs. The other parts lar matrix as the attention mask. Since the auxiliary remain the same as GPT2. In the following, we information Ha is visible for the current reply at will first describe the MH Bi-attention. Attention all time steps, no mask exists in MH bi-attention. fusion will be discussed in the next section. In total, three attention information Ac, Ap, and The MH self-attention in Transformer (Vaswani Ah are obtained by attending the current state to et al., 2017) handles a single input only. In order itself, personality, and history respectively, all in 910 Lc×d the same dimension R . They need to be fused scale remains the same. This method utilizes matrix f Lc×d into one matrix A 2 R so as to proceed to multiplication to make fully interaction between subsequent decoding layers. all state values, obtaining the states conditioned on all information sources dynamically. History 2.3 Attention Fusion information is selected as the “value” term to get In this section, we discuss various methods to fuse more dialog history involved in the obtained state. the multiple attention information obtained above. 3 Experiment The simplest approach is to average three sources in all dimensions (Golovanov et al., 2019), which We employ the PersonaChat (Zhang et al., 2018; Di- treats all sources equally. However, in different nan et al., 2020) dataset in our experiments which dialogues, we may need to concentrate more on the has 164,356 utterances in 10,981 dialogues and dialogue history or the persona profile in order to 1,155 personas. Each sample contains dialog his- generate proper responses. Here we introduce the tory with up to 15 utterances, a gold reply and a following three kinds of methods to allow for more persona description with no more than 5 sentences. flexible information fusion from all input sources. Four kinds of dialogue models using pretrained • Static methods fuse different information using language models as the initialization are compared: an identical fusion function with no training param- (i) TransferTransfo (Wolf et al., 2019), a single- eter. Except the average pooling (avg) which is input OpenAI GPT using token type embedding to regarded as a very simple fusion baseline, we also distinguish different parts of a single concatenated include Maximum (max), and Minimum (min) op- input (persona profile, dialog history, and reply eration for every dimension among all sources. successively). We also replace original GPT in this • Weighting methods try to estimate the global method with GPT2, denoted as TransferGPT2.

Pretrained Language Models for Dialogue Generation with Multiple Input Sources

Face Recognition: a Convolutional Neural-Network Approach

CNN Architectures

Deep Learning Architectures for Sequence Processing

Introduction-To-Deep-Learning.Pdf

Face Recognition Using Popular Deep Net Architectures: a Brief Comparative Study

Recurrent Neural Network

Neural Networks and Backpropagation

Deep Layer Aggregation

Chapter 10: Artificial Neural Networks

Increasing Neurons Or Deepening Layers in Forecasting Maximum Temperature Time Series?

Introduction to Machine Learning CMU-10701 Deep Learning

Introduction to Machine Learning Deep Learning