MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models
Peng Xu‡∗, Mostofa Patwary§, Mohammad Shoeybi§, Raul Puri§, Pascale Fung‡, Anima Anandkumar§, Bryan Catanzaro§ ‡The Hong Kong University of Science and Technology, §NVIDIA [email protected], [email protected]
Abstract Context [FEMALE] was on a long road trip . Guan et al.(2020) she had just driven into town . she had been driving for hours . she was so tired she decided Existing pre-trained large language models to take a nap . when she woke up , she still have shown unparalleled generative capabili- had n’t slept well . ties. However, they are not controllable. In MEGATRON- she was on her way to see a strange country. CNTRL-124M she had never seen anything like that before. this paper, we proposeM EGATRON-CNTRL, keyword: driving finally she saw a small town that was so small. a novel framework that uses large-scale lan- it was a great little town with a nice park ! guage models and adds control to text gener- MEGATRON- she was driving in the middle of [PLACE]. all ation by incorporating an external knowledge CNTRL-8B of a sudden the tire pressure light came on. she base. Our framework consists of a keyword keyword: driving got out to check her tire. it was flat so she used the roadside assistance. predictor, a knowledge retriever, a contextual MEGATRON- she really wanted to see a few attractions. the knowledge ranker, and a conditional text gen- CNTRL-8B-ANT first one she saw was a giant water park. it was erator. As we do not have access to ground- keyword: attract amazing. it ended up being a fun experience. truth supervision for the knowledge ranker, we make use of weak supervision from sentence Table 1: Stories generated by models of increas- embedding. The empirical results show that ing capacity and controllability. As the model size our model generates more fluent, consistent, grows, story quality becomes increasingly coherent, and coherent stories with less repetition and fluent, and logically consistent. The last row demon- higher diversity compared to prior work on the strates howM EGATRON-CNTRL-8B-ANT model con- ROC story dataset. We showcase the controlla- trols the story generation with a new keyword, “attract”. bility of our model by replacing the keywords Note that [MALE] and [FEMALE] denote names and used to generate stories and re-running the [PLACE] denotes locations. generation process. Human evaluation results show that 77.5% of these stories are success- lack of knowledge which humans use to produce fully controlled by the new keywords. Further- natural text. For example, GPT-2 based models more, by scaling our model from 124 million produce degraded generations that are illogical and to 8.3 billion parameters we demonstrate that ungrammatical for knowledge-driven generation larger models improve both the quality of gen- tasks, such as story generation. Guan et al.(2020) eration (from 74.5% to 93.0% for consistency) therefore introduced commonsense knowledge to and controllability (from 77.5% to 91.5%). the pre-trained language model by further finetun- 1 Introduction ing on commonsense datasets. Although implicit encoding of knowledge is helpful for knowledge Text generation has recently attracted significant incorporation, there is still a lack of training mech- attention from the research community as large pre- anism to teach the model when and what to incor- trained language models, such as GPT-2 (Radford porate from external knowledge. et al., 2018, 2019) demonstrated promising results In addition, these large pre-trained language for generating long, grammatically correct, and flu- models are hard to control. Recently, plug-and-play ent text. Finetuning these models has shown signif- language models Dathathri et al.(2019) addressed icant improvements in downstream tasks, such as whole document controllability by adding a lin- persona chat (Wolf et al., 2019). However, one non- ear classifier on top of GPT-2 to predict whether negligible drawback of these large models is the generated text observes a particular style or prop- ∗ This work was done during the internship of Peng Xu at erty. Keskar et al.(2019) controlled a 1.2B pa- NVIDIA. Corresponding authors: Peng Xu, Mostofa Patwary. rameter language model generation via the use of
2831 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2831–2845, November 16–20, 2020. c 2020 Association for Computational Linguistics