Retrieval-Free Knowledge-Grounded Dialogue Response Generation with Adapters

Yan Xu*, Etsuko Ishii*, Samuel Cahyawijaya, Zihan Liu, Genta Indra Winata, Andrea Madotto, Dan Su, Pascale Fung Center for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong {yxucb,eishii}@connect.ust.hk, [email protected]

Abstract I want to visit Tokyo and get a night view from Tokyo Tower. To diversify and enrich generated dialogue responses, knowledge-grounded dialogue has been investigated in recent Oh, isn't it better from Tokyo Skytree, the tallest tower? years. The existing methods tackle the knowledge ground- 1. Tokyo Tower is the second tallest ing challenge by retrieving the relevant sentences over a structure in Japan, behind Tokyo large corpus and augmenting the dialogues with explicit ex- Skytree. 2. Tokyo Tower is painted in white tra information. Despite their success, however, the exist- Existing Natural Language Knowledge and orange to comply with air ing works have drawbacks on the efficiency. This Works Corpus safety regulations. Generation paper proposes KnowExpert, an end-to-end framework 3. Tokyo Tower has transmission Retrieval antennae used for radio and to bypass the explicit retrieval process and inject knowl- television broadcasting. edge into the pre-trained language models with lightweight Knowledge Selction adapters and adapt to the knowledge-grounded dialogue task. To the best of our knowledge, this is the first attempt to Natural Language tackle this challenge without retrieval in this task under an Ours open-domain chit-chat scenario. The experimental results Generation show that KnowExpert performs comparably with some retrieval-based baselines while being time-efficient in infer- ence, demonstrating the potential of our proposed direction. Knowledge Expert Weighting

Introduction Figure 1: KnowExpert is a retrieval-free dialogue generation framework. Unlike the existing works, The open-domain conversational agents have been rapidly KnowExpert leverages knowledge stored in model progressing in recent years. Although such models are ca- parameters of knowledge experts by weighting them based pable of producing grammatically correct and natural re- on the topic distribution over dialogue history. sponses based on the conversation history, conversations with these systems still present an apparent gap compared to human-to-human conversation. One primary reason is that existing dialogue systems are lacking understanding of cer- able responses. In this scheme, knowledge retrieval and tain knowledge and thus cannot involve in deeper conversa- knowledge selection are considered to make up the knowl- tions with humans when they dive into a specific topic (Li edge conceptualization process. et al. 2020). To bridge this gap, many research works de- Despite having a major progression and a great success velop knowledge-grounded conversational agents (Dinan arXiv:2105.06232v2 [cs.CL] 15 Sep 2021 in achieving promising performances on the correspond- et al. 2019; Chen et al. 2020; Zhao et al. 2020). ing task, this retrieval-based approach has drawbacks on As it is shown in Figure 1, many recent works (Lian et al. its efficiency: First, knowledge retrieval in corpora needs 2019; Kim, Ahn, and Kim 2019; Roller et al. 2020; Chen a model to search over a large amount of data, which re- et al. 2020; Zhao et al. 2020) develop the solutions com- quires considerable memory resources to store the whole prising: (1) Knowledge Retrieval, for retrieving the related knowledge corpus, and additional processing time to retrieve knowledge sentences from a large corpus (e.g., ); knowledge and conduct further knowledge selection; Sec- 2) Knowledge Selection, for selecting the most relevant ond, adding knowledge as additional context to the language knowledge sentences for generation; and 3) Knowledge- generation model also cause significant computation over- augmented Generation, for augmenting the retrieved knowl- head, which slows down the language generation process. edge and conversation history to generate more knowledge- Efficiency plays an essential role in the practical use of di- *These authors contributed equally. alogue systems, and it is necessary to generate responses Copyright © 2022, Association for the Advancement of Artificial faster by requiring fewer resources to support more active Intelligence (www.aaai.org). All rights reserved. users. + Output Topic Topic + 1 Modeling Modeling Training + Feed-forward Up Projection

Knowledge ReLU 2 Experts GPT-2 Layer Training Feed-forward GPT-2 Layer Down Projection Topic Modeling Layer Norm Task 3 GPT-2 Layer Adaptation Input Adapter Usr Sys Usr Token type

Figure 2: High-level architecture of the model. Taking a dialogue history as an Figure 3: Illustration of the training pro- input, the adapters are inserted upon the GPT-2 layers, acting as the knowledge cedure, where the thick lined modules are experts, to enhance the response generation with the help from a topic model trained while the rest (dash lined) kept which assigns weights over the knowledge experts. frozen in each training step.

Recently, large pre-trained language models (LMs) have the relevance of the dialogue sample and the clusters. We been shown to have the capability to carry implicit knowl- thus fine-tune LM layers where frozen pre-trained adapters edge (Wang et al. 2020; Lauscher et al. 2020), which can be are inserted for task adaptation. Experimental results show further applied to downstream classification tasks (Shwartz that KnowExpertperforms comparably with some strong et al. 2020). Many existing works have proved that the retrieval-based baselines, while the inference process of “knowledge” could be embedded in the pre-training pro- KnowExpertis much more efficient since extra knowledge cess (Brown et al. 2020). The explorations on closed-book sentences are not required as one of the components of the QA task (Petroni et al. 2019; Roberts, Raffel, and Shazeer inputs. 2020; Wang, Liu, and Zhang 2021) with large pre-train LMs Our contributions are three-fold: (1) to the best of our also indicate the potential of leveraging the knowledge em- knowledge, we are the first to explore injecting prior knowl- bedded inside the LMs. For task-oriented dialogue systems, edge to generative models for knowledge-grounded dialogue Madotto et al. (2020) stores the knowledge bases (KBs) with generation task under open-domain chit-chat scenario; (2) different sizes directly into the model parameters by align- our model conduct knowledge conceptualization without an ing auto-extracted dialogue templates with the correspond- explicit knowledge retrieval process, which has constant in- ing KBs for each data sample. Our scenario is different from ference time regardless of the size of the knowledge corpus; those mentioned above. For both closed-book QA tasks and and (3) our model performs comparably with some strong task-oriented dialogue tasks, given a question or user query, baselines and shows the potential of purely generation meth- relevant knowledge choices are highly constrained by the ods in the task. inputs. However, open-domain chit-chat suffers much from the one-to-many issue between the inputs and possible out- Related Work puts. In other words, given the inputs on a specific topic, the choice of knowledge candidates is varying, which brings Knowledge-Grounded Dialogue new challenges to embed knowledge in this task. The knowledge-grounded dialogue task requires the models Inspired by these explorations, we propose to tackle the to identify relevant knowledge sentences from a large cor- knowledge-grounded dialogue challenge by using the im- pus for grounding the response generation. IR systems, such plicit knowledge in LMs, to serve as the knowledge con- as TF-IDF, can retrieve related knowledge over a large cor- ceptualization process under the open-domain chit-chat sce- pus quickly. However, the effectiveness of the IR systems is nario. Compared to the existing work scheme shown in Fig- the bottleneck. They are only leveraged to retrieve several ure 1, we bypass the retrieval step and propose an end- relevant documents coarsely. However, providing the mod- to-end framework, KnowExpert, to inject the knowledge els with more documents may not improve the system since corpus into the memory of pre-trained LMs and incorpo- it will also bring more noise into the inputs. What is more, rate the learned knowledge, taking advantage of latent top- the length of packed inputs could exceed the length limita- ics, for knowledge-grounded dialogue generation. In the tion of the LMs. Thus, the existing works still conduct fur- model, lightweight adapters (Bapna and Firat 2019) are at- ther fine-grained knowledge selection to improve the accu- tached in the pre-trained GPT-2 (Radford et al. 2019), act- racy of the knowledge retrieval process, which is one of the ing as knowledge experts. The knowledge sentences are critical problems in the knowledge-grounded dialogue task. embedded into different knowledge experts by pseudo- Motivated by this, latent variables have been introduced to conversation style training, while the latent topics measure minimize the gap between prior and the posterior knowl- Wikipedia is a free, multilingual Wikipedia is a free, multilingual Wikipedia online encyclopedia. online encyclopedia.

[1] Wikipedia is a free, multilingual It is written and maintained by As of 2021, it was ranked as the online encyclopedia. [2] It is written a community of volunteer 13th most popular site. and maintained by a community of contributions. volunteer contributors. [3] Wikipedia Wikipedia is the largest and most- is the largest and most-read reference Wikipedia is the largest and most- read reference work in history. work in history. [4] It is consistently read reference work in history. Usr Sys Usr Sys one of the 15 most popular web- It is consistently one of the most It is consistently one of the most sites as ranked by Alexa. [5] As of popular websites as ranked by Alexa. popular websites as ranked by Alexa. 2021, it was ranked as the 13th most It is written and maintained by popular site. ... As of 2021, it was ranked as the a community of volunteer 13th most popular site. contributions.

Split into sentences to convert Random replacement of two 1 Wikipedia Article 2 3 into utterances utterances

Figure 4: The demonstration of the procedure of converting a document in the knowledge corpus (e.g., a Wikipedia article) into the pseudo-conversation style. First, the article is split into sentences so that each to represent one utterance. Then, random two utterances are permuted to avoid over-fitting (presented with dashed lines). edge selection (Zhao et al. 2019; Kim, Ahn, and Kim 2019; Methodology Lian et al. 2019; Chen et al. 2020). Zhao et al. (2020) ex- In this section, we present the framework and the learning plores the strategy to better rank the knowledge sentences, algorithm of KnowExpert. First, we offer several prelim- avoiding the most relevant candidates are got truncated in inary definitions used throughout the paper. Second, we ex- the input sequences. In this paper, we attempt to tackle this plain the architecture of KnowExpert. Finally, we describe knowledge retrieval challenge from another direction. the strategy to train the whole framework.

Knowledge Retrieval in LMs Preliminary Definition The concept of knowledge retrieval in LMs starts with the n N We denote a dialogue dataset as {D }n=1 and the dia- proposal of the LAMA benchmark (Petroni et al. 2019). It t logue history at turn t as Dt = {(Ui,Si)}i=1, where Ut heavily relies on prompts. By constructing the prompts as is the user utterance and St the system response. Along with “fill-in-the-blank” cloze statements, pre-trained LMs could the dialogue dataset, suppose we have a knowledge corpus predict the factual knowledge (Petroni et al. 2019; Shin et al. M {Km}m=1, where Km refers to a piece of knowledge (e.g., 2020). The application of the idea of knowledge retrieval in a sentence from Wikipedia). LMs also appears in closed-book QA tasks. Roberts, Raffel, Given an input Xt = (Dt−1,Ut), we aim to learn a model and Shazeer (2020) investigates a simple fine-tuning tech- fΘ to generate a knowledgeable response S˜t. Existing works nique on multiple datasets and proves frame this task with retrieving related knowledge K for that T5 (Raffel et al. 2019) can pack the Wikipedia knowl- t S˜ = f (X ,K ) edge into its parameters. augmented input: t Θ t t . Here, we propose to bypass the retrieval process by injecting knowledge into the Inference Efficiency in Language Model model parameters Θ to generate a response solely based on dialogue history: S˜t = fΘ(Xt). Recent progress in natural language processing, including dialogue systems, has been benefited by Transformer-based KnowExpert Architecture large pre-trained LMs, yet current “best performing” mod- KnowExpert els tend to have a more complex architecture with more pa- is composed of two components: a GPT-2 rameters, which is not ideal considering inference on practi- with lightweight adapters and a contextual topic model as cal application. Many modified architectures of transformers depicted in Figure 2. Inspired by Peinelt, Nguyen, and Li- have been explored to speedup inference latency while keep- akata (2020), the topic model is introduced to evoke knowl- ing the performance, for example, by leveraging knowledge edge stored in the GPT-2 by the guidance of the topic infor- distillation to compress or reduce the number of parame- mation during response generation. ters (Tang et al. 2019; Sanh et al. 2019; Jiao et al. 2020; GPT-2 with Adapters To incorporate with knowledge, we Sun et al. 2020), by a simple decomposition in lower lay- insert lightweight adapters (Bapna and Firat 2019) into each ers (Cao et al. 2020), or by converting a structured decoder GPT-2 layer. The adapter has a two-linear-layer structure, into a non-autoregressive module (Sun et al. 2019). Unlike which enables fast adaptation to targets. Given the hidden j×h previous works, we emphasize the inference efficiency of representation of the GPT-2 layer i, denoted as Hi ∈ R , our proposed framework on shortening the input sequences where h and j are the hidden dimension and the current gen- by removing the external knowledge components and reduc- eration step, respectively, the adapter can be formulated as: ing the storage resource need, and provide a faster inference hd dh process when scaling up the knowledge corpus. Aθ(Hi) = ReLU(LN(Hi)Wi )Wi + Hi, WoW Seen WoW Unseen CMU DoG Model PPL F1 KF1 PPL F1 KF1 PPL F1 TMN (Dinan et al. 2019) 66.5 15.9 - 103.6 14.3 - 75.2 9.9 DRD (Zhao et al. 2019) 23.0 18.0 - 25.6 16.5 - 54.4 10.7 Retrieval-based ZRGKG (Li et al. 2020) 40.4 18.7 - 41.5 18.6 - 53.5 12.5 Approach GPT-2trunc (Zhao et al. 2020) 14.6 18.7 - 16.9 18.3 - 18.6 10.8 KnowledGPT (Zhao et al. 2020) 19.2 22.0 - 22.3 20.5 - 20.6 13.5 GPT-2 18.8 17.0 11.2 21.0 16.3 10.8 17.8 11.8 Retrieval-free f KE (ours, one-hot, L = 4) 16.0 18.4 13.3 21.2 16.6 10.9 17.8 12.1 Approach o KEw (ours, weighted-sum, L = 4) 15.3 18.7 13.7 20.1 16.7 11.5 17.2 12.5

Table 1: Automatic evaluation results. PPL is short for Perplexity, where a lower value indicates a better performance. F1 and KF1 refer to the unigram-F1 score between the generation and gold response, and the gold knowledge sentence respec- tively. We highlight the best results for each group with bold. We also highlight the cases using underline when our proposed KnowExpert outperforms the retrieval-based models.

hd h×d dh d×h where Wi ∈ R and Wi ∈ R stand for train- tences of the knowledge corpus in plain text format to train able parameters in θ, LN(·) is layer normalization (Ba, Kiros, CTM with the pre-trained Sentence-Transformers frozen. To and Hinton 2016), and d is the bottleneck dimension. Here, have a better guidance during training, we predict topic dis- L we insert L knowledge adapters parameterized as {θEl }l=1 tribution w using a concatenation of the dialogue history and where each serves as a knowledge expert in a certain topic the response. We also try other input combinations, while domain. we achieve the best performance with the current one. Dur- ing inference, however, this scheme cannot be applied due Topic Modeling In KnowExpert, a topic model is used to the absence of responses. Thus, we further fine-tuned the to inform more relevant “topic” to GPT-2 during response Sentence-Transformer inside CTM model to deal with the generation so to induce more context-appropriate knowl- response absence issue. In other words, we fine-tune the edge. While any sort of topic models can be used, we adopt Sentence-Transformer model to produce the sentence em- contextual topic model (CTM) which outperforms tradi- bedding of the given dialogue history as similar to the sen- tional topic models (Bianchi, Terragni, and Hovy 2021). tence embedding of the concatenation mentioned above. We CTM combines pre-trained Sentence-Transformers embed- leverage MSE loss to evaluate the difference betwee two ding representations (Reimers and Gurevych 2019) with a sentence embeddings and provide the model with supervi- neural topic model Neural-ProdLDA (Srivastava and Sutton son signals. 2017) which takes advantage of Bag of Words (BoW) for (ii) Knowledge Experts Training We train a set of L more coherent representation. Given a knowledge corpus, topic-specific knowledge adapters inserted into the frozen the topic model is trained with the number of topic clusters backbone GPT-2 with the knowledge corpus to generate a of L. knowledge sentence. The adapters are independently trained Once trained, the topic model is used to obtain a prob- to minimize the negative log-likelihood over the knowledge ability distribution to the pre-clustered topics of the dia- corpus of the corresponding topic: logue history. These probabilities are utilized as the sim- X X ilarity weights w = (w1, w2, ..., wL) over the knowledge LKl = − log p(ki|k

Table 4: Examples of generated responses using the four-cluster KEw model with the same dialogue history on the WoW dataset. EXP1/2/3/4 denotes the response generated with GPT-2 backbone with only one specific knowledge expert, where the numbers after EXP denote the knowledge expert indexes.

Index Keywords Results 1 east, river, south, state, city, area, island Table 1 reports the automatic evaluation results. The im- 2 rock, band, music, album, football, single provements over the baseline model GPT-2f demonstrate the 3 fiction, story, characters, novel, film, stars effectiveness of our proposed framework. In this task, KEw 4 bon, bucks, rutgers, canberra, ivy, nets performs comparably with the retrieval-based baselines, es- pecially on the seen domain, without either retrieval or us- Table 5: Example keywords of the topic clusters (L = 4). ing any explicit extra knowledge input in the inference pro- The keywords are generated by the corresponding topic cess. In addition, KEw also shows better performance over model and selected among the top 25 frequent keywords. KEo consistently. Despite the improvements over the base- line model, we also observe the performance gap between seen and unseen domain, which requires future work. and the gold knowledge sentences. What’s important is that Table 2 shows the human evaluation results in terms of because of the one-to-many relationship (Chen et al. 2020) the winning rate for Info. and Human. The results indicate between relevant knowledge sentences and the conversation that the introduction of the knowledge experts brings GPT-2 utterances, KF1 may not be sufficient to prove whether the model a significant improvement on generating a more infor- generation is informative or not. In this case, human evalua- mative response, without hurting the fluency and coherence tion is further required. of the generation under the weight-sum setting. However, when using the knowledge experts under a one-hot setting, Human Evaluation In addition to the automatic evalua- the improvement is not as large as the former one on unseen tion, we conduct human evaluation from two aspects over domain, which follows the results of the automatic metrics. the generated responses: Informativeness (Info.) and Hu- manness (Human.). Info. evaluates how knowledgeable the Effects of Number of Topic Clusters generated responses are, based on the amount of new infor- The number of topic clusters is an important hyperparame- mation introduced into the conversations and the factuality ter since it crucially impacts the quality of topic modeling of the responses. Human. is used for evaluating the fluency and knowledge experts training. Because of the nature of and the coherence of the generated responses. A/B testing is WoW and CMU DoG datasets, we conduct experiments with utilized to compare our proposed framework KEw and KEo L = 4, 8, 16. Example keywords for four topic clusters are with GPT-2f baseline on the WoW dataset. For each compar- shown in Table 5. In Table A1 in the Appendix, we show top ison, the same dialogue context and two options generated frequent words for each topic cluster with different numbers by the models in comparison respectively are shown to the of topic clusters in details. For example, Cluster 2 in L = 8 annotators. Each comparison requires three judgments and is the cluster that is strongly related to movie domain. As 100 data samples are randomly selected from each domain. shown in Table 3, we selected L = 4 since it achieved the We conduct a human evaluation using a crowd-source plat- best average F1 in WoW Seen and Unseen. As the number 2 form offered by Amazon Mechanical Turk . We ensure that of topic cluster grows, we tend to obtain a better KF1 score each sample is evaluated by three annotators. Further details because we have more parameters to store knowledge. How- and templates are included in Appendix B. ever, the other metrics do not show the same trend, since we Model Selection In the training procedure, we have differ- are more likely to have larger distance between the predicted ent criterion for selecting models for the three training steps: and the true topic distribution along with the increase of L, (i) and (ii): We train the corresponding model for a specific thus failing to utilize knowledge experts effectively. number of epochs; (iii) The model is selected according to Case Study the sum of the PPLs on the validation set and validation un- seen set. We leverage different knowledge experts in a one-hot man- ner, so that we can generate responses with only one knowl- 2https://www.mturk.com edge expert and the same dialogue history to study what the 3000 KE-16 (Ours) DPR + ZRKGC DPR + KNOWLEDGPT KE-4 (Ours) GENRE + ZRKGC GENRE + KNOWLEDGPT KE-8 (Ours) TFIDF + ZRKGC TFIDF + KNOWLEDGPT 1219.32 1000 838.73 717.80 660.01 634.13 642.97

300 262.38 Time (s)

141.45

100 83.67 57.79 48.88 48.88 45.21 45.21 43.25 43.25 30 1/2 WoW WoW 2 WoW 5 WoW 10 WoW Wiki Subset 20 WoW 30 WoW Wiki Full (~59k) (~117k) (~235k) (~587k) (~1.17M) (~1.6M) (~2.34M) (~3.52M) (~5.1M) Knowledge Corpus

Figure 5: Inference efficiency of our approach generating 100 samples. We show the time on a logarithmic scale and the knowledge corpus size in ascending order. In the figure above, we demonstrate the overall end-to-end inference time of our method with the generation length of 23 (average response length in WoW dataset). knowledge experts capture. As shown in Table 4, the gen- approach, we measure the end-to-end inference time by erations with different knowledge experts tend to lean into summing the time for retrieval and response generation. different cluster topics with the same context. Some selected The generation length is predefined as the average response keywords are shown in Table 5 and more topic keywords are length in the WoW dataset. We randomly sample 100 in- listed in Table A1 in the Appendix. Comparing the genera- stances from WoW Seen and Unseen and average the infer- tions with the listed topic keywords, our knowledge experts ence time of 10 trials. The detailed configuration is listed in tend to focus on different topics where the knowledge doc- Table C1 in the Appendix. uments they are trained on belong. For example, with the As shown in Figure 5, KnowExpert requires the least same dialogue history, EXP2 of WoW Unseen is leaning into computing time and keeps a constant computational cost the music domain as Cluster 2 is strongly related to music, regardless of the size of the knowledge corpus. It is be- while EXP3 generations are more related to the fiction top- cause our topic modeling requires a constant computation ics which align with the topic in Cluster 3. In addition, the cost while that of TF-IDF or DPR increases along with shown cases also support the observation from Table 1 that the increase of knowledge corpus. Additionally, our model the mixture-of-experts manner ensures a better model per- does not require a large external corpus during the inference formance. The generation of KEw is more on topic and accu- time. These suggest that our model is suitable for deploy- rate thanks to leveraging the weighted sum of those experts. ment in resource-limited platforms such as in the on-device The above performance indicates that the proper ensemble setting. Nevertheless, compression methods such as distil- of knowledge also helps the response generation. lation, quantization, and pruning might need to be further Although the generated responses appear to be knowl- applied to minimize the computation cost of the model. edgeable and fluent, they frequently raise an issue in factual correctness; for example, Harry Potter is not an investor, nor Conclusion a musician as shown in EXP1/2; Orcs are not directly re- We propose KnowExpert, a retrieval-free framework for lated to Italian peninsula. We also observe that a knowledge knowledge-grounded dialogue response generation task. It expert whose topics are more similar to the topic of the dia- is the first attempt to tackle the challenge of injecting the logue tends to generate more factual responses. knowledge into the model parameters and leveraging it for Inference Efficiency the downstream task. We leverage light-weight adapters as knowledge experts for knowledge injection, then train the We evaluate the response generation inference time of backbone model to take advantage of the knowledge ex- KnowExpert and other two retrieval-based baselines: perts for response generation. By this means, our method can ZRGKG (Li et al. 2020) and KnowledGPT (Zhao et al. generate more knowledgeable responses without explicit re- 2020). In addition to the time to generate responses, we also trieval step, compared to our baseline model. By bypassing take the time into consideration, which is required to re- the retrieval step over the knowledge corpus, the inference trieve knowledge candidates from knowledge corpus with efficiency of the model is improved. Experimental results different sizes against the time required for topic model- show that KnowExpert performs comparably with some ing in KnowExpert. We take the retrieval methods TF- retrieval-based models, demonstrating the potential of our IDF, DPR (Karpukhin et al. 2020), and GENRE (Cao et al. proposed direction. 2021) for comparison. To have a fair comparison with our References Kim, B.; Ahn, J.; and Kim, G. 2019. Sequential Latent Ba, L. J.; Kiros, J. R.; and Hinton, G. E. 2016. Layer Nor- Knowledge Selection for Knowledge-Grounded Dialogue. malization. CoRR, abs/1607.06450. In International Conference on Learning Representations. Bapna, A.; and Firat, O. 2019. Simple, Scalable Adap- Kingma, D. P.; and Ba, J. 2015. Adam: A Method for tation for Neural Machine Translation. In Proceedings of Stochastic Optimization. In Bengio, Y.; and LeCun, Y., eds., the 2019 Conference on Empirical Methods in Natural Lan- 3rd International Conference on Learning Representations, guage Processing and the 9th International Joint Confer- ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Confer- ence on Natural Language Processing (EMNLP-IJCNLP), ence Track Proceedings. 1538–1548. Lauscher, A.; Majewska, O.; Ribeiro, L. F.; Gurevych, Bianchi, F.; Terragni, S.; and Hovy, D. 2021. Pre-training I.; Rozanov, N.; and Glavas,ˇ G. 2020. Common Sense is a Hot Topic: Contextualized Document Embeddings Im- or World Knowledge? Investigating Adapter-Based Knowl- prove Topic Coherence. In ACL. edge Injection into Pretrained Transformers. arXiv preprint Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; arXiv:2005.11787. Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, Li, L.; Xu, C.; Wu, W.; ZHAO, Y.; Zhao, X.; and Tao, C. A.; et al. 2020. Language models are few-shot learners. 2020. Zero-Resource Knowledge-Grounded Dialogue Gen- arXiv preprint arXiv:2005.14165. eration. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, Cao, N. D.; Izacard, G.; Riedel, S.; and Petroni, F. 2021. Au- M. F.; and Lin, H., eds., Advances in Neural Information toregressive Entity Retrieval. In International Conference Processing Systems, volume 33, 8475–8485. Curran Asso- on Learning Representations. ciates, Inc. Cao, Q.; Trivedi, H.; Balasubramanian, A.; and Balasub- Lian, R.; Xie, M.; Wang, F.; Peng, J.; and Wu, H. 2019. ramanian, N. 2020. DeFormer: Decomposing Pre-trained Learning to select knowledge for response generation in di- Transformers for Faster Question Answering. In Proceed- alog systems. In Proceedings of the 28th International Joint ings of the 58th Annual Meeting of the Association for Com- Conference on Artificial Intelligence, 5081–5087. AAAI putational , 4487–4497. Online: Association for Press. Computational Linguistics. Madotto, A.; Cahyawijaya, S.; Winata, G. I.; Xu, Y.; Liu, Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; and Specia, Z.; Lin, Z.; and Fung, P. 2020. Learning Knowledge Bases L. 2017. SemEval-2017 Task 1: Semantic Textual Similar- with Parameters for Task-Oriented Dialogue Systems. In ity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 2020 Conference on Empirical Methods Proceedings of the 11th International Workshop on Seman- in Natural Language Processing: Findings, 2372–2394. tic Evaluation (SemEval-2017) , 1–14. Vancouver, Canada: Peinelt, N.; Nguyen, D.; and Liakata, M. 2020. tBERT: Association for Computational Linguistics. Topic Models and BERT Joining Forces for Semantic Sim- Chen, X.; Meng, F.; Li, P.; Chen, F.; Xu, S.; Xu, B.; and ilarity Detection. In Proceedings of the 58th Annual Meet- Zhou, J. 2020. Bridging the Gap between Prior and Pos- ing of the Association for Computational Linguistics, 7047– terior Knowledge Selection for Knowledge-Grounded Di- 7055. Online: Association for Computational Linguistics. alogue Generation. In Proceedings of the 2020 Confer- ence on Empirical Methods in Natural Language Processing Petroni, F.; Rocktaschel,¨ T.; Riedel, S.; Lewis, P.; Bakhtin, (EMNLP), 3426–3437. A.; Wu, Y.; and Miller, A. 2019. Language Models as Knowledge Bases? In Proceedings of the 2019 Conference Conneau, A.; Kiela, D.; Schwenk, H.; Barrault, L.; and Bor- on Empirical Methods in Natural Language Processing and des, A. 2017. Supervised Learning of Universal Sentence the 9th International Joint Conference on Natural Language Representations from Natural Language Inference Data. In Processing (EMNLP-IJCNLP), 2463–2473. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 670–680. Copenhagen, Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Denmark: Association for Computational Linguistics. Sutskever, I. 2019. Language models are unsupervised mul- titask learners. OpenAI Blog, 1(8):9. Dinan, E.; Roller, S.; Shuster, K.; Fan, A.; Auli, M.; and We- ston, J. 2019. Wizard of Wikipedia: Knowledge-Powered Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Conversational Agents. In International Conference on Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2019. Explor- Learning Representations. ing the limits of transfer learning with a unified text-to-text Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; transformer. arXiv preprint arXiv:1910.10683. Wang, F.; and Liu, Q. 2020. TinyBERT: Distilling BERT for Reimers, N.; and Gurevych, I. 2019. Sentence-BERT: Natural Language Understanding. In Findings of the Asso- Sentence Embeddings using Siamese BERT-Networks. In ciation for Computational Linguistics: EMNLP 2020, 4163– Proceedings of the 2019 Conference on Empirical Meth- 4174. Online: Association for Computational Linguistics. ods in Natural Language Processing and the 9th Interna- Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, tional Joint Conference on Natural Language Processing S.; Chen, D.; and Yih, W.-t. 2020. Dense Passage Retrieval (EMNLP-IJCNLP), 3982–3992. Hong Kong, China: Asso- for Open-Domain Question Answering. In Proceedings of ciation for Computational Linguistics. the 2020 Conference on Empirical Methods in Natural Lan- Roberts, A.; Raffel, C.; and Shazeer, N. 2020. How Much guage Processing (EMNLP), 6769–6781. Knowledge Can You Pack into the Parameters of a Language Model? In Proceedings of the 2020 Conference on Em- Zhao, X.; Wu, W.; Tao, C.; Xu, C.; Zhao, D.; and Yan, R. pirical Methods in Natural Language Processing (EMNLP), 2019. Low-Resource Knowledge-Grounded Dialogue Gen- 5418–5426. eration. In International Conference on Learning Represen- Roller, S.; Dinan, E.; Goyal, N.; Ju, D.; Williamson, M.; Liu, tations. Y.; Xu, J.; Ott, M.; Shuster, K.; Smith, E. M.; et al. 2020. Zhao, X.; Wu, W.; Xu, C.; Tao, C.; Zhao, D.; and Yan, R. Recipes for building an open-domain chatbot. arXiv preprint 2020. Knowledge-Grounded Dialogue Generation with Pre- arXiv:2004.13637. trained Language Models. In Proceedings of the 2020 Con- Sanh, V.; Debut, L.; Chaumond, J.; and Wolf, T. 2019. Dis- ference on Empirical Methods in Natural Language Pro- tilBERT, a distilled version of BERT: smaller, faster, cheaper cessing (EMNLP), 3377–3390. and lighter. CoRR, abs/1910.01108. Zhou, K.; Prabhumoye, S.; and Black, A. W. 2018. A Shin, T.; Razeghi, Y.; Logan IV, R. L.; Wallace, E.; and Dataset for Document Grounded Conversations. In Pro- Singh, S. 2020. AutoPrompt: Eliciting Knowledge from ceedings of the 2018 Conference on Empirical Methods in Language Models with Automatically Generated Prompts. Natural Language Processing. arXiv preprint arXiv:2010.15980. Shwartz, V.;West, P.; Bras, R. L.; Bhagavatula, C.; and Choi, Y. 2020. Unsupervised commonsense question answering with self-talk. arXiv preprint arXiv:2004.05483. Srivastava, A.; and Sutton, C. 2017. Autoencoding Varia- tional Inference For Topic Models. In ICLR. Sun, Z.; Li, Z.; Wang, H.; He, D.; Lin, Z.; and Deng, Z. 2019. Fast Structured Decoding for Sequence Models. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alche-Buc,´ F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc. Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; and Zhou, D. 2020. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, 2158–2170. Online: Association for Computational Linguistics. Tang, R.; Lu, Y.; Liu, L.; Mou, L.; Vechtomova, O.; and Lin, J. 2019. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks. CoRR, abs/1903.12136. Tuckute, G.; Paunov, A.; Kean, H.; Small, H.; Mineroff, Z.; Blank, I.; and Fedorenko, E. 2021. Frontal language areas do not emerge in the absence of temporal language areas: A case study of an individual born without a left temporal lobe. bioRxiv. Wang, C.; Liu, P.; and Zhang, Y. 2021. Can Generative Pre-trained Language Models Serve as Knowledge Bases for Closed-book QA? arXiv preprint arXiv:2106.01561. Wang, R.; Tang, D.; Duan, N.; Wei, Z.; Huang, X.; Cao, C.; Jiang, D.; Zhou, M.; et al. 2020. K-adapter: Infusing knowl- edge into pre-trained models with adapters. arXiv preprint arXiv:2002.01808. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davi- son, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu, J.; Xu, C.; Scao, T. L.; Gugger, S.; Drame, M.; Lhoest, Q.; and Rush, A. M. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Confer- ence on Empirical Methods in Natural Language Process- ing: System Demonstrations, 38–45. Online: Association for Computational Linguistics. Xu, Y.; and Zhao, H. 2021. Dialogue-oriented Pre-training. arXiv preprint arXiv:2106.00420. Cluster Analysis of Topic Modeling We show the keywords list on each topic in our Contextual- ized Topic Model in Table A1. Additional Details on Human Evaluation We collect human annotations for both Humanness and At- tribute Consistency via crowd-sourcing platform provided by Amazon Mechanical Turk3. The template for human evaluation is shown in Figure B1. For quality control, we limit the annotators’ location to be one of United States, United Kingdom, Canada, and Australia for English fluency. Moreover, we qualify annotators with HIT Approval rate larger than 95% and HIT Approved number to be greater than 5000. As the average time annotators spend per re- sponse comparison for informativeness is 168 seconds, we reject annotators who spend less than 10 seconds to main- tain the quality. The template for human evaluation is shown in Figure B1. Each annotator is asked to judge either human- ness or informativeness of one dialogue. To get a consistent observation, we use the same 100 randomly selected prefixes of the dialogues across the comparisons. Configuration for Inference Efficiency We random sample 100 data sample from the test set and test unseen set of WoW dataset respectively. The sampled data is leverage for all the inference efficiency evaluation exper- iments. Besides, we set the batch size as 1. We repeat the each evaluation for five times respectively on samples from test set and test unseen set. The final value is the average of ten trials. The device configuration for inference efficiency evaluation is shown in Table C1 and Table C2. For the gen- eration inference time evaluation, to have a fair comparison, the generation length is set as 23 for all the models, where 23 is the average response length in WoW dataset.

Model Device (CPU / GPU) # of Device TF-IDF Intel Xeon E5-2620 V4 CPU 1 DPR GeForce GTX 1080Ti 1 GENRE GeForce GTX 1080Ti 1 CTM Intel Xeon E5-2620 V4 CPU 1

Table C1: Device configuration for knowledge retrieval methods and CTM topic modeling.

Model Device (CPU / GPU) # of Device ZRKGC GeForce GTX 1080Ti 2 KnowledGPT GeForce GTX 1080Ti 1 Ours GeForce GTX 1080Ti 1

Table C2: Device configuration for response generation (with knowledge selection if it’s applicable).

3https://www.mturk.com L = 4 cluster 1 east, west, river, south, state, city, area, district, north, center, largest, island, park, states, county cluster 2 rock, band, records, music, team, song, album, club, football, record, league, studio, single, released, professional cluster 3 fiction, story, characters, book, disney, novel, episode, film, films, comic, stars, comics, opera, comedy, character cluster 4 pain, bon, canberra, blocked, rutgers, khalil, edmonton, auckland, auburn, capitals, akron, karim, woodstock, cougars, euro L = 8 cluster 1 systems, theory, data, software, computer, information, person, system, value, mobile, use, users, user, physical, devices cluster 2 film, character, characters, episode, comic, television, comedy, fiction, story, comics, directed, novel, films, fictional, fantasy cluster 3 company, school, university, students, education, founded, schools, institute, president, department, business, public, united, states, private cluster 4 empire, roman, german, period, chinese, century, russian, soviet, religious, french, bc, king, war, battle, dynasty cluster 5 area, south, city, north, west, ye, located, river, population, east, park, part, county, region, island cluster 6 sugar, yellow, rice, tree, meat, cats, neck, pain, egg, sauce, corn, chicken, breed, hair, cheese cluster 7 music, band, rock, song, album, records, studio, singer, pop, single, guitar, group, songs, recorded, released cluster 8 league, team, football, club, sports, professional, championship, hockey, baseball, teams, cup, basketball, division, played, racing L = 16 cluster 1 team, football, league, club, championship, cup, basketball, hockey, wrestling, professional, baseball, olympic, teams, race, rugby cluster 2 company, brand, car, chain, ford, owned, cars, corporation, stores, inc, brands, sold, manufacturer, headquartered, restaurant cluster 3 film, episode, directed, stars, fox, drama, comedy, cast, aired, episodes, soap, abc, show, opera, movie cluster 4 light, used, water, surface, temperature, earth, power, energy, materials, chemical, space, speed, material, carbon, electric cluster 5 album, records, song, studio, single, release, track, recorded, lead, songs, chart, recording, label, hot, hit cluster 6 war, army, military, party, navy, ii, forces, election, force, battle, soviet, royal, corps, armed, campaign cluster 7 care, organization, laws, act, tax, organizations, education, non, profit, policy, legal, law, health, rights, agency cluster 8 brain, blood, normal, condition, cause, sleep, eye, causes, fever, heart, psychological, surgery, emotional, loss, drugs cluster 9 ocean, mountain, region, land, coast, pacific, islands, sea, gulf, island, mountains, capital, rivers, km, river cluster 10 computer, digital, data, software, internet, web, code, users, devices, value, mobile, application, device, systems, user cluster 11 street, park, center, road, railway, station, historic, built, building, route, located, highway, opened, city, line cluster 12 century, chinese, greek, christian, modern, medieval, ancient, period, middle, ad, roman, traditions, culture, bc, tradition cluster 13 yellow, bird, tree, flowers, breed, meat, rice, dog, wild, white, sugar, leaf, colour, pepper, flower cluster 14 professor, father, mother, worked, born, graduated, institute, degree, married, studied, bachelor, moved, mary, graduate, attended cluster 15 bass, jazz, guitar, music, festival, stage, dance, theatre, artists, musical, bands, piano, hip, musician, blues cluster 16 fantasy, comics, published, comic, game, fiction, book, universe, books, marvel, created, video, playstation, developed, dc

Table A1: Top 15 frequent words for each topic cluster of CTM with L = 4, 8, 16. (a) Template for judge humanness.

(b) Template for judge informativeness.

Figure B1: Human evaluation template for judge humanness and informativeness respectively.