Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks Suchin Gururangany Ana Marasovic´y} Swabha Swayamdiptay Kyle Loy Iz Beltagyy Doug Downeyy Noah A. Smithy} yAllen Institute for Artificial Intelligence, Seattle, WA, USA }Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA fsuching,anam,swabhas,kylel,beltagy,dougd,
[email protected] Abstract Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedi- cal and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in- domain (domain-adaptive pretraining) leads to performance gains, under both high- and Figure 1: An illustration of data distributions. Task low-resource settings. Moreover, adapting data is comprised of an observable task distribution, to the task’s unlabeled data (task-adaptive usually non-randomly sampled from a wider distribu- pretraining) improves performance even after tion (light grey ellipsis) within an even larger target do- domain-adaptive pretraining. Finally, we show main, which is not necessarily one of the domains in- that adapting to a task corpus augmented us- cluded in the original LM pretraining domain – though ing simple data selection strategies is an effec- overlap is possible. We explore the benefits of contin- tive alternative, especially when resources for ued pretraining on data from the task distribution and domain-adaptive pretraining might be unavail- the domain distribution.