<<

Pretrained Language Models for Dialogue Generation with Multiple Input Sources

Yu Cao 1∗, Wei Bi 2, Meng Fang 3, Dacheng Tao 1 1 UBTECH Sydney AI Center, School of Computer Science, Faculty of Engineering, The University of Sydney, Australia 2 Tencent AI LAB, China, 3 Tencent Robotics X, China [email protected], [email protected], [email protected], [email protected]

Abstract simple concatenation of all input sources, an impor- tant question arises on how we can better adapt a Large-scale pretrained language models have single-input pretrained language model to a multi- achieved outstanding performance on natu- input dialogue generation task. ral language understanding tasks. However, Some previous work forms an encoder-decoder it is still under investigating how to apply architecture with both encoder and decoder du- them to dialogue generation tasks, especially those with responses conditioned on multiple plicated from a pretrained language model (Golo- sources. Previous work simply concatenates vanov et al., 2019; Zheng et al., 2019). Recently, all input sources or averages information from BART (Lewis et al., 2019) even obtain a complete different input sources. In this work, we study pretrained model under this architecture directly. dialogue models with multiple input sources Taking personalized dialogue generation (Zhang adapted from the pretrained language model et al., 2018) as an example, we can treat persona in- GPT2. We explore various methods to fuse formation, dialogue history and previous generated multiple separate attention information corre- sponding to different sources. Our experimen- tokens as three different input sources. The for- tal results show that proper fusion methods mer two will be encoded firstly and then combined deliver higher relevance with dialogue history with the last one in the decoder. In Golovanov than simple fusion baselines. et al. 2019, the multi-head attention layer in the decoder is copied three times for each input source 1 Introduction and mean pooling is used to average results from multiple attentions. This encoder-decoder adapta- Large-scale pretrained language models (Devlin tion is shown to outperform simple concatenation. et al., 2019; Radford et al., 2018, 2019) have However, when dialogue history gets longer, this achieved outstanding performance on various model tends to use less information of each dia- natural language understanding tasks (Young logue history utterance to predict the next token. et al., 2018; Liu et al., 2019). Researchers Zheng et al. 2019 add an extra weight predictor to have then utilized them in dialogue generation combine multiple attention information, but they tasks (Budzianowski and Vulic´, 2019; Edunov do not perform experiments using publicly released et al., 2019; Zhang et al., 2019). Many of them sim- pretrained models, nor on public datasets, making ply concatenate the input dialogue history and the their results not directly comparable to other work. output response in finetuning, since the pretrained In this work, we build our dialogue model on language model only accepts a single sequence as the encoder-decoder architecture adapted from the input. However, dialogue generation tasks may in- pretrained language model GPT2 (Radford et al., volve multiple input sources simultaneously. For 2019). Our main contribution is to empirically example, in personalized or knowledge-grounded study the attention fusion methods for multiple in- dialogue generation (Li et al., 2016; Zhang et al., formation sources in each decoder layer. Three 2018; Dinan et al., 2018), a response is generated kinds of methods are explored in total. Our exper- conditioned on both dialogue history and an aux- imental results show performance improvements iliary user profile or knowledge article. Despite on both automatic and human evaluations by us- ∗This work was done during Yu Cao’s internship in Ten- ing proper attention fusion methods, compared to cent AI LAB, Shenzhen. baselines using concatenation or mean pooling.

909 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 909–917 November 16 - 20, 2020. c 2020 Association for Computational Linguistics Full a Persona LM loss + b History LM loss + g Prediction LM loss = 2 Model Loss

2.1 The Encoder-Decoder Architecture Linear Layer Following the former work (Golovanov et al., 2019), we use the personalized dialogue genera- GPT2 Encoder GPT2 Decoder tion task on PersonaChat (Zhang et al., 2018) as an example in our study. The pretrained language Personality Dialog History Current Reply model GPT2 and its parameters are duplicated to form an encoder-decoder architecture shown in Fig- (a) The encoder-decoder architecture. ure 1(a). We use GPT2 here due to its large-scale ...... pre-training corpus than other models and strong ×N + performance in other generation tasks. We have three separate inputs: personal profile, MLP dialogue history, and current reply (or previously Layer Normalization generated response during the inference stage). Em- + beddings of the former two, which contain embed- dings of tokens, positions as well as token types, Attention Fusion will be successively put into the encoder, which is a GPT2 model with no attention mask to fit the MH Bi-Attention MH Self-Attention MH Bi-Attention encoding procedure. The encoded representations, together with embeddings of current response to- Layer Normalization kens will then be used as the input of a modified GPT2 decoder. Each decoder block will attend the Encoded Current State/ Encoded Dialog Personality Embedding History current state to the three sources using different attentions, then fuse their resulting information as (b) Details of each transformer block in decoder. input for the next layer. Figure 1: Architecture of our proposed model. Inspired by multi-task learning (Zhang and Yang, 2017), we further separate the original loss in lan- to make it accept two input sources, we regard c Lc×d guage modeling into three parts corresponding to the current state H ∈ R from the previous three input sources respectively. By applying the layer (or embedding of reply in the first layer) as same linear prediction layer on the output of both query and encoded state of auxiliary information a La×d encoder and decoder, three cross-entropy losses H ∈ R as key and value in the attention. between predicted logits and corresponding truth Here Lc and La are corresponding lengths for these sequences will be weighted by hyperparameters. input, and Ha can be encoded personality Hp or dialog history Hh. The output of each single head L = αLpersona + βLhistory + γLpred (1) in MH Bi-attention can be obtained via with Adam optimizer (Kingma and Ba, 2014). c Q a K T (H W )(H W ) a V 2.2 Block Details in Decoder A = softmax( √ )(H W ), d Recall that we have three input sources in the de- (2) coder, and thus some modifications are needed if where WQ, WK , WV are learnable matrices. In the decoder structure is inherited from GPT2. De- our model, different attentions own separate param- tails of each modified decoder block are shown in eters instead of sharing. This differs from the previ- Figure 1(b), in which the most apparent change is ous work (Golovanov et al., 2019) which reuses the the additional two multi-head (MH) bidirectional self-attention for bi-attention. Besides, the original attentions and the attention fusion module that GPT2 is a single-directional model using a triangu- fuses various attention outputs. The other parts lar matrix as the attention mask. Since the auxiliary remain the same as GPT2. In the following, we information Ha is visible for the current reply at will first describe the MH Bi-attention. Attention all time steps, no mask exists in MH bi-attention. fusion will be discussed in the next section. In total, three attention information Ac, Ap, and The MH self-attention in Transformer (Vaswani Ah are obtained by attending the current state to et al., 2017) handles a single input only. In order itself, personality, and history respectively, all in

910 Lc×d the same dimension R . They need to be fused scale remains the same. This method utilizes matrix f Lc×d into one matrix A ∈ R so as to proceed to multiplication to make fully interaction between subsequent decoding layers. all state values, obtaining the states conditioned on all information sources dynamically. History 2.3 Attention Fusion information is selected as the “value” term to get In this section, we discuss various methods to fuse more dialog history involved in the obtained state. the multiple attention information obtained above. 3 Experiment The simplest approach is to average three sources in all dimensions (Golovanov et al., 2019), which We employ the PersonaChat (Zhang et al., 2018; Di- treats all sources equally. However, in different nan et al., 2020) dataset in our experiments which dialogues, we may need to concentrate more on the has 164,356 utterances in 10,981 dialogues and dialogue history or the persona profile in order to 1,155 personas. Each sample contains dialog his- generate proper responses. Here we introduce the tory with up to 15 utterances, a gold reply and a following three kinds of methods to allow for more persona description with no more than 5 sentences. flexible information fusion from all input sources. Four kinds of dialogue models using pretrained • Static methods fuse different information using language models as the initialization are compared: an identical fusion function with no training param- (i) TransferTransfo (Wolf et al., 2019), a single- eter. Except the average pooling (avg) which is input OpenAI GPT using token type embedding to regarded as a very simple fusion baseline, we also distinguish different parts of a single concatenated include Maximum (max), and Minimum (min) op- input (persona profile, dialog history, and reply eration for every dimension among all sources. successively). We also replace original GPT in this • Weighting methods try to estimate the global method with GPT2, denoted as TransferGPT2. optimal proportion of each source in a given do- (ii) MI-GPT (Golovanov et al., 2019) which uses main by introducing extra learnable weights which the OpenAI GPT in both encoder and decoder with are then fixed in inference. Such methods can be: average pooling as the attention fusion method. (i) source-level scalar weights (sw), which means (iii) Our architecture using GPT2 as the base model c p h there are three trainable scalars w , w , w for each and average as fusion method (GPT2-avg), a very f c c p p source in each layer and A = (w A + w A + simple baseline inherited from MI-GPT. h h c p h w A )/(w + w + w ). (iv) Our model with each of the attention fusion (ii) source-dimension level (dw), in which weights methods discussed in Sec 2.3, denoted as GPT2-X, c p h d are learnable vectors w , w , w ∈ R . For each and X is the corresponding fusion method. f row j of A and weight vectors w, we perform the All GPT2 models used here are small size (12 f c c p p weighted combination via Aj = (wjAj +wj Aj + layers, hidden size is 768). Besides, Seq2seq model h h c p h wj Aj )/(wj + wj + wj ). with attention (Bahdanau et al., 2014) using 6-layer (iii) linear method (linear) in which a linear net- Transformer as the encoder and decoder is also work is used to transform the concatenated atten- included as an end-to-end single-input baseline.1 tion [Ac; Ap; Ah] into Af . Different from above The following automatic metrics are considered one, each dimension in the new feature space here in our evaluation: BLEU (Papineni et al., 2002), contains information from all dimensions of all METEOR (Lavie and Agarwal, 2007), NIST-4, sources to realize a better interaction. which indicate the gram-level similarity between • Attention-based method fuses the information the references and generated responses. Moreover, based on a trainable modified transformer atten- Entropy-4, corpus-level Distinct-2 and the average tion (att). The attention fusion function changes length of replies are used to reflect the diversity according to multiple input information as follows of obtained text. In addition, human evaluation is also conducted on 200 dialog pairs in terms of q sign(AcApT) ( |AcApT| fluency (range: 1 ∼ 3), relevance with dialogue h Z = softmax( √ )A , history (h-rel, range: 1 ∼ 3) and consistency with d (3) personality (p-consist, {0, 1}). More experiment where sign(·) is a function with value 1 when the configurations can be found in AppendixA. element is positive or -1 when negative; | · | for 1Source code is available at: https://github.com/ absolute value; square root ensures that the value caoyu-noob/Multi-GPT2

911 Model BLEU METEOR NIST-4 Entropy-4 Dist-2 Avg.len fluency h-rel p-consist Human --- 10.725 36.688 11.507 2.901 2.645 0.598 Seq2seq 1.769 6.926 1.028 6.789 6.356 8.710 --- TransferTransfo 2.054 7.672 1.183 8.429 17.917 7.824 2.748 2.348 0.542 MI-GPT 3.151 8.112 1.264 8.054 13.264 9.026 2.809 2.150 0.628 TransferGPT2 2.273 7.872 1.194 8.263 16.444 8.036 2.785 2.385 0.548 GPT2-avg 3.211 8.149 1.291 7.904 13.612 8.932 2.838 2.149 0.648 GPT2-max 3.344 8.156 1.206 8.175 14.104 8.750 --- GPT2-min 3.774 8.661 1.388 8.099 14.925 9.209 --- GPT2-sw 3.949 8.881 1.407 8.228 15.294 9.068 2.814 2.355 0.595 GPT2-dw 3.714 8.694 1.385 8.096 14.647 9.095 --- GPT2-linear 4.147 8.988 1.408 8.279 15.237 9.011 2.777 2.332 0.602 GPT2-att 3.659 8.449 1.249 8.028 14.592 8.723 ---

Table 1: Dialogue generation performance comparison of different models on the test set of PersonaChat. Values for BELU, METEOR and Dist-2 are in percentage. Human evaluation is only conducted on representative models.

3.1 Results be largely benefited from the pretrained model. However, h-rel scores (the relevance between dia- Results of different models on both automatic met- log history and current responses) by models are rics and human evaluations are shown in Table1. significantly lower than those by a human. Note We first analyze results on automatic metrics. It that compared with SI models, MI models using the can be observed that GPT2 is more powerful than average fusion (MI-GPT, GPT2-avg) show lower OpenAI GPT under the same architecture. Multi- h-rel scores, though their persona consistency in- input (MI) models that use the encoder-decoder creases much. This is also discussed in Golovanov frameworks generally outperform single-input (SI) et al. 2019, and the reason is that SI model is sim- models (TransferTransfo, TransferGPT2) which ilar to a language model which stays tightly with simply concatenate all inputs. Although SI models history, while MI models take persona as a separate show higher diversity, their generated texts are gen- input which is easier to reuse personalized word. erally shorter. All attention fusion methods of our However, our models with the weighting fusion model make improvements compared to its base- methods can not only improve the persona consis- line GPT2-avg. Among them, weighting methods tency compared to SI models, but also maintain have higher scores than the other two kinds of fu- comparable best history relevance. The case study sion methods on most metrics. Compared with of generated replies is given in AppendixB. static methods, weighting methods are more flexi- ble to combine proper proportions of each source, 3.2 Influence of Attention Fusion thus it is no surprise that they can outperform static methods. Meanwhile, though the attention-based In this section, we further investigate how attention method also allows for non-static attention fusion, fusion affects the generation results, especially why it essentially poses dynamic weights on the history using the average fusion decreases the performance state, and thus information of persona and reply is on the relevance between dialog history and gener- not directly used in the final fused representation ated responses while the weighting fusion methods and results in its failure It is also interesting to find can survive. that GTP2-dw shows no improvement compared We group the 200 testing samples for human to GPT2-sw, despite it extends the latter one using evaluation by their lengths of history, and then com- different weights for each dimension. pare the average results on h-rel scores of different Now we discuss human evaluation results. Here, methods within each group. Results are shown in we only conduct human evaluations on baselines Table2. We first compare the weighting fusion and proposed models with the best automatic eval- methods with the average fusion baseline. As can uation results (i.e. weighting methods). Fluency be seen, all methods perform comparably when di- scores of generated texts are very close to each alogue history is short. With longer dialog history, other even compared to gold replies, which should models with weighting fusion methods perform

912 History Win Tie Lose 0.55

history 0.50 GPT2-weight L 53.2% 28.2% 18.6% 0.45 vs. M 37.0% 31.1% 31.9% 0.40 persona 0.35 GPT2-avg S 29.3% 45.2% 25.5% 0.30 0.25 reply GPT2-weight L 39.7% 35.5% 24.8% 0.20 vs. M 28.9% 37.1% 34.0% 1 2 3 4 5 6 7 8 9 10 11 12 TransferGPT2 S 24.1% 43.7% 32.2% Figure 2: Visualization of normalized scalar attention MI baselines L 17.7% 30.1% 52.2% weights on 3 different input sources for each layer in vs. M 22.2% 28.9% 48.9% GPT2-sw decoder. SI baselines S 18.9% 42.8% 38.3%

Table 2: Percentage of generated replies by the up- attention fusion methods to obtain a preferable rep- per model better, equal or worse than the bottom one resentation for dialogue generation. Experiments on h-rel metric. Samples are grouped by dialog his- illustrate that weighting methods promote both auto tory length (long (L) / short (S) / medium (M) history metrics and dialog history relevance scores anno- length: > 9 utterances / ≤ 3 utterances / rest sam- tated by human than baselines using average fusion, ples.). GPT2-weight: GPT2-sw and GPT2-linear, MI while they still maintain the persona consistency baselines: GPT-MI and GPT2-avg, SI baselines: Trans- scores which outperform single-input models. And ferTransfo and TransferGPT2. such architecture can be extended to other multi- input dialogue generation tasks having different much better than GPT2-avg. The reason is that information source number. when dialogue history gets longer, the effect by each history token on current reply in bi-attention Acknowledgements is averaged out by dialogue history length, mak- This work was supported by Australian Research ing the average fusion method harder to capture Council Projects under grants FL-170100117, DP- key information from any history token to generate 180103424, and IC-190100031. the response. Next, we compare the GPT2 with weighting fusion methods to TransferGPT2 (the SI model with GPT2) and results indicate that they References can also outperform SI models when dialogue his- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- tory is long. Finally, we can see that SI models gio. 2014. Neural machine translation by jointly beat the MI baselines with the average fusion un- learning to align and translate. arXiv preprint der all conditions, proving the ineffectiveness of arXiv:1409.0473. the simple average between different information Paweł Budzianowski and Ivan Vulic.´ 2019. Hello, it’s sources. gpt-2–how can i help you? towards the use of pre- Figure2 further illustrates the estimated optimal trained language models for task-oriented dialogue weights of each attention information in every de- systems. arXiv preprint arXiv:1907.05774. coder layer in GPT2-sw. We observe that attention Jacob Devlin, Ming-Wei Chang, Kenton Lee, and weights of different input sources are not equal and Kristina Toutanova. 2019. Bert: Pre-training of change over different decoder layers, validating deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of that the use of average fusion is over-simplified. the North American Chapter of the Association for The weights of diverse sources tend to be equiva- Computational Linguistics: Human Language Tech- lent in high layers while they differ significantly in nologies, Volume 1 (Long and Short Papers), pages lower layers because the history and persona infor- 4171–4186. mation are already encoded and highly abstractive. Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, 4 Conclusion Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. 2020. The second conversational in- To handle dialogue generation with multiple input telligence challenge (convai2). In The NeurIPS’18 sources, we adapt the pretrained language model Competition, pages 187–208. Springer. GPT2 to an encoder-decoder architecture with Emily Dinan, Stephen Roller, Kurt Shuster, Angela multiple independent attentions for different input Fan, Michael Auli, and Jason Weston. 2018. Wizard sources in the decoder. We then investigate several of wikipedia: Knowledge-powered conversational

913 agents. In International Conference on Learning Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Representations. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all Sergey Edunov, Alexei Baevski, and Michael Auli. you need. In Advances in neural information pro- 2019. Pre-trained language model representations cessing systems, pages 5998–6008. for language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Thomas Wolf, Victor Sanh, Julien Chaumond, and Association for Computational Linguistics: Human Clement Delangue. 2019. Transfertransfo: A Language Technologies, Volume 1 (Long and Short transfer learning approach for neural network Papers), pages 4052–4059. based conversational agents. arXiv preprint arXiv:1901.08149. Sergey Golovanov, Rauf Kurbanov, Sergey Nikolenko, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Kyryl Truskovskyi, Alexander Tselousov, and Le, Mohammad Norouzi, Wolfgang Macherey, Thomas Wolf. 2019. Large-scale transfer learning Maxim Krikun, Yuan Cao, Qin Gao, Klaus for natural language generation. In Proceedings of Macherey, et al. 2016. Google’s neural machine the 57th Annual Meeting of the Association for Com- translation system: Bridging the gap between hu- putational Linguistics, pages 6053–6058. man and machine translation. arXiv preprint arXiv:1609.08144. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint Tom Young, Devamanyu Hazarika, Soujanya Poria, arXiv:1412.6980. and Erik Cambria. 2018. Recent trends in based natural language processing. ieee Alon Lavie and Abhaya Agarwal. 2007. Meteor: An Computational intelligenCe magazine, 13(3):55–75. automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceed- Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur ings of the second workshop on statistical machine Szlam, Douwe Kiela, and Jason Weston. 2018. Per- translation, pages 228–231. sonalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th An- Mike Lewis, Yinhan Liu, Naman Goyal, Mar- nual Meeting of the Association for Computational jan Ghazvininejad, Abdelrahman Mohamed, Omer Linguistics (Volume 1: Long Papers), pages 2204– Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. 2213. Bart: Denoising sequence-to-sequence pre-training Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, for natural language generation, translation, and Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing comprehension. arXiv preprint arXiv:1910.13461. Liu, and Bill Dolan. 2019. Dialogpt: Large-scale generative pre-training for conversational response Jiwei Li, Michel Galley, Chris Brockett, Georgios Sp- generation. arXiv preprint arXiv:1911.00536. ithourakis, Jianfeng Gao, and Bill Dolan. 2016. A persona-based neural conversation model. In Pro- Yu Zhang and Qiang Yang. 2017. A survey on multi- ceedings of the 54th Annual Meeting of the Associa- task learning. arXiv preprint arXiv:1707.08114. tion for Computational Linguistics (Volume 1: Long Papers), pages 994–1003. Yinhe Zheng, Rongsheng Zhang, Xiaoxi Mao, and Minlie Huang. 2019. A pre-training based personal- Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian- ized dialogue generation model with persona-sparse feng Gao. 2019. Multi-task deep neural networks data. arXiv preprint arXiv:1911.04700. for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for A Experiment Details Computational Linguistics, pages 4487–4496. We use the official code for the implementation Kishore Papineni, Salim Roukos, Todd Ward, and Wei- of TransferTransfo (Wolf et al., 2019) and GPT2- Jing Zhu. 2002. Bleu: a method for automatic eval- MI (Golovanov et al., 2019), following all default uation of machine translation. In Proceedings of the 40th annual meeting on association for compu- settings to fine-tune models. To implement our tational linguistics, pages 311–318. Association for TransferGPT2, GPT2-avg, and all refined attention Computational Linguistics. fusion model, we utilize HuggingFace Transform- ers library2 with the small-size GPT2 model which Alec Radford, Karthik Narasimhan, Tim Salimans, and has 12 layers and 768 dimensions in the hidden Ilya Sutskever. 2018. Improving language under- standing by generative pre-training. state. It is noted that although both our encoder and decoder are initialized from GPT2 model, their Alec Radford, Jeffrey Wu, Rewon Child, David Luan, parameters are not shared. Similarly, 3 different Dario Amodei, and Ilya Sutskever. 2019. Language attention modules in each layer of the decoder (1 models are unsupervised multitask learners. OpenAI Blog, 1(8):9. 2https://github.com/huggingface/transformers

914 self-attention, 2 bi-attention) are also initialized by than 5), all encoder-decoder models tend to gener- the attention module of the corresponding layer in ate universal replies or simply reuse personalities original GPT2 model but parameters are also not because the history information is too limited for shared among them. The parameters of the addi- them to combine it with the given personal profile. tional attention fusion module will be initialized While the single-input TransferGPT2 is inclined to by: 1) uniform initialization for source-weighting reuse personality descriptions because the whole methods, and 2) random initialization with normal input sequence length is shorter and persona in- distribution for linear and attention-based meth- formation obtains more attention compared to the ods. And the linear prediction layer has the shared input having a long history. weight with the embedding layer of the decoder. During fine-tuning, we use Adam opti- mizer (Kingma and Ba, 2014) with an initial learn- ing rate 5e-4 with 0.002 warmup proportion and then a linear decay. The learning rate for the addi- tional attention fusion module is 5× current learn- ing rate for other parts. We train it for 5 epochs using mini-batch with size 256. And only the latest 7 utterances in dialog history are remained to avoid exceeding maximum input length. All hyperparam- eters are determined by manually tuning according to auto metrics BLEU, METEOR ,and NIST as criteria. During inference, we use beam search with size 3 for all test models. Length penalty (Wu et al., 2016) is added to ensure the diversity of generation. A single NVIDIA V100 GPU with CUDA10 is used to run experiments.

B Case Study

We list dialogue generation results of Transfer- GPT2, GPT2-avg, GPT2-sw and GPT2-linear un- der some cases from PersonaChat dataset (Zhang et al., 2018) in Table3 and Table4, containing sam- ples with varied dialog history lengths. h-rel and p-consist indicate the human evaluation scores for dialogue history relevance and personality consis- tency of generated replies respectively. It can be found that our refined attention fusion models generally show similar personality consis- tency with the baseline GPT2-avg model who uses the same architecture but a simple average method to combine different information sources. When di- alog history is long, TransferGPT2 tends to directly respond to the last history utterance using some general replies, while GPT2-avg tends to directly copy personal information as replies. GPT2-sw and GPT2-linear can properly make a response to the last context as well as involve personal profile. In addition, we find that when history length is not so long (length is 5 or 7), such difference will be reduced. But when dialog history is very short (less

915 item text h-rel p-consist i have one cat. i am a kindergarten teacher. Personality i can barely pay my bills every month. i share an apartment in seattle with two roommates. i just graduated from college two years ago. ... A: i also love to travel and take adventures. what are your hobbies? Dialog history B: i like to hang out with friends (length=13) A: nice. where do you like to go? B: we like to do different things do you have pets? A: i have a dog. his name is max Human that is cute i have a cat. 3 1 TransferGPT2 that is a cute name. 3 0 GPT2-avg i have two cats and a dog. 2 0.5 GPT2-sw that is cool i have two cats. 3 0.5 GPT2-linear that is a cute name i have a cat named maxy. 3 1 my favorite tv show is the office. i do not drive because i live in new york. Personality i like jazz music. i do not eat anything sweet. ... A: i need something to mellow me out after working all day B: that can be hard to find up in new york. Dialog history things are fast and tight here (length=9) A: do you have many insects? i do not care for that. B: not in the city, but yes upstate. A: what do you do for a living? i am in the science field engineer Human i am a nutritionist , so no sweets for me. 3 1 TransferGPT2 i am a stay at home mom. 3 0 GPT2-avg i live in new york city. 2 1 GPT2-sw i do not work. i just sit around and listen to jazz music. 3 1 GPT2-linear that is cool. i work in a lab in new york city. 3 1

Table 3: Some cases of generated dialogue replies by TrnasferGPT2, GPT2-avg, GPT2-sw and GPT2-linear.

916 item text h-rel p-consist i am a huge fantasy fan. i want to write fantasy books myself. Personality i have two part time jobs. i like to binge watch the lord of the rings ... A: ah gosh. that is the worst! B: i want to just stay home and write. Dialog history A: i would rather play softball than go to work (length=7) B: i like to role play and play fantasy games. have you been to com icon? A: what is that? no i have never been Human have you ever heard of the lord of the rings?. 2 1 TransferGPT2 that is cool. i am a huge fantasy fan. 2 1 GPT2-avg it is a fantasy game. 3 1 GPT2-sw it is a fantasy world with lots of dragons. 3 1 GPT2-linear i have not. i am more of a fantasy fan. 2 1 i do not like exercise or physical activity. i am a homebody. Personality my favorite food is ice cream. i have a husband who i love very much. i love watching movies and tv. A: hello ! tell me about yourself? B: i really like to stay at home at watch a movie with my darling hubby. you? Dialog history A: that sounds nice! i am a nurse who loves zoos and teddy (length=5) bears. B: aww do you collect teddy bears? my sweet sweet husband buys me them A: i do! i bring them to my patients at the childrens hospital. Human so nice! do they also like ice cream? i love it! 3 1 TransferGPT2 i do not have any kids, but i do have a husband. 1 1 GPT2-avg that’s so sweet! i love to eat ice cream and ice cream. 3 1 GPT2-sw that sounds like a lot of fun. i am a homebody myself. 3 1 GPT2-linear that is nice. i am a homebody myself. 3 1 i am going to auburn university next year as a freshman. i love shopping fashion and clothes. Personality i want to be a doctor when i grow up. i am 17 years old and a senior in high school. A: hi i am talking to text Dialog history B: hello , interesting introduction you have there (length=3) A: colors i can not see them well , my sis helps me out Human that is unfortunate. what is something that you enjoy doing? 3 0 TransferGPT2 i am a junior in high school, i want to be a doctor 3 0 GPT2-avg what do you do for a living? 3 0 GPT2-sw i am sorry to hear that. 1 1 GPT2-linear what do you do for a living? 3 0

Table 4: Some cases of generated dialogue replies by TrnasferGPT2, GPT2-avg, GPT2-sw and GPT2-linear.

917