Prefix-Tuning
Total Page:16
File Type:pdf, Size:1020Kb
Prefix-Tuning: Optimizing Continuous Prompts for Generation Xiang Lisa Li Percy Liang Stanford University Stanford University [email protected] [email protected] Abstract Fine-tuning is the de facto way of leveraging large pretrained language models for down- stream tasks. However, fine-tuning modifies all the language model parameters and there- fore necessitates storing a full copy for each task. In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natu- ral language generation tasks, which keeps lan- guage model parameters frozen and instead op- timizes a sequence of continuous task-specific vectors, which we call the prefix. Prefix-tuning draws inspiration from prompting for language models, allowing subsequent tokens to attend Figure 1: Fine-tuning (top) updates all LM param- to this prefix as if it were “virtual tokens”. eters (the red Transformer box) and requires storing We apply prefix-tuning to GPT-2 for table-to- a full model copy for each task. We propose prefix- text generation and to BART for summariza- tuning (bottom), which freezes the LM parameters and tion. We show that by modifying only 0.1% of only optimizes the prefix (the red prefix blocks). Con- the parameters, prefix-tuning obtains compara- sequently, we only need to store the prefix for each ble performance in the full data setting, outper- task, making prefix-tuning modular and space-efficient. forms fine-tuning in low-data settings, and ex- Note that each vertical block denote transformer activa- trapolates better to examples with topics that tions at one time step. are unseen during training. 2017; Houlsby et al., 2019) inserts additional task- 1 Introduction specific layers between the layers of pretrained Fine-tuning is the prevalent paradigm for using language models. Adapter-tuning has promising large pretrained language models (LMs) (Radford performance on natural language understanding et al., 2019; Devlin et al., 2019) to perform down- and generation benchmarks, attaining comparable stream tasks (e.g., summarization), but it requires performance with fine-tuning while adding only updating and storing all the parameters of the LM. around 2–4% task-specific parameters (Houlsby Consequently, to build and deploy NLP systems et al., 2019; Lin et al., 2020). that rely on large pretrained LMs, one currently At the limit, GPT-3 (Brown et al., 2020) can needs to store a modified copy of all the LM pa- be deployed using in-context learning, which is rameters for each task. This can be prohibitively a form of prompting, without modifying any LM expensive given the size of current LMs; for exam- parameters. In in-context learning, Brown et al. ple, GPT-2 has 774M parameters (Radford et al., (2020) prepend a natural language task instruction 2019) and GPT-3 has 175B parameters (Brown (e.g., TL;DR for summarization) and a few exam- et al., 2020). ples to the task input, and then generate the task A natural approach to this problem is lightweight output from the LM. However, since Transformers fine-tuning, which freezes most of the pretrained can only condition on a bounded-length context parameters and only tunes a smaller set of param- (e.g., 2048 tokens for GPT-3), in-context learning eters. For example, adapter-tuning (Rebuffi et al., is restricted to very small training sets. In this paper, we propose prefix-tuning, a 2 Related Work lightweight alternative to fine-tuning for natural lan- guage generation (NLG) tasks, inspired by prompt- Fine-tuning for natural language generation. ing. Consider the task of generating a textual de- Current state-of-the-art systems for natural lan- scription of a data table, as shown in Figure1, guage generation (NLG) are based on fine-tuning where the task input is a linearized table (e.g., pretrained LMs. For table-to-text generation, Kale “name: Starbucks j type: coffee shop”) and the out- (2020) fine-tunes a sequence-to-sequence model put is a textual description (e.g., “Starbucks serves (T5; Raffel et al., 2020). For extractive and abstrac- coffee.”). Prefix-tuning prepends a sequence of tive summarization, researchers fine-tune masked continuous task-specific vectors to the input, which language models (e.g., BERT; Devlin et al., 2019) we call a prefix, depicted by red blocks in Figure1 and encode-decoder models (e.g., BART; Lewis (bottom). To generate each token, the LM can at- et al., 2020), respectively (Zhong et al., 2020; Liu tend to the prefix as if it were a sequence of “virtual and Lapata, 2019; Raffel et al., 2020). For other tokens”, but unlike prompting, the prefix consists conditional NLG tasks such as machine transla- entirely of free parameters which do not correspond tion and dialogue generation, fine-tuning is also the to real tokens. In contrast to fine-tuning in Figure1 prevalent paradigm (Zhang et al., 2020c; Stickland (top), which updates all LM parameters and thus et al., 2020; Zhu et al., 2020; Liu et al., 2020). In requires storing a tuned copy of the model for each this paper, we focus on table-to-text using GPT-2 task, prefix-tuning only optimizes the prefix. Con- and summarization using BART, but prefix-tuning sequently, we only need to store one copy of the in principle can be applied to other generation tasks large LM and a learned task-specific prefix, yield- and pretrained models, such as masked LMs. ing a very small overhead for each additional task Lightweight fine-tuning. Prefix-tuning falls (e.g., 250K parameters for table-to-text). under the broad class of lightweight fine-tuning methods, which freeze most of the pretrained In contrast to full fine-tuning, prefix-tuning is parameters and only tune a smaller set of param- also modular: we train an upstream prefix which eters. The key question is how to augment the LM steers an unmodified LM, and therefore, a single architecture and decide which subset of pretrained LM can support many tasks at once. In the con- parameters to tune. One line of research learns a text of personalization where the tasks correspond task-specific parameter mask (Zhao et al., 2020; to users (Shokri and Shmatikov, 2015; McMahan Radiya-Dixit and Wang, 2020). Another line et al., 2016), we would have a separate prefix for of research inserts new modules with trainable each user trained only on that user’s data, thereby parameters. For example, Zhang et al.(2020a) avoiding data cross-contamination. Moreover, the trains a “side” network that is fused with the prefix-based architecture enables us to even pro- pretrained model via summation; adapter-tuning cess examples from multiple users/tasks in a single inserts task-specific layers (adapters) between each batch, something that is not possible with other layer of the pretrained LM (Houlsby et al., 2019; lightweight fine-tuning approaches like adapter- Lin et al., 2020; Rebuffi et al., 2017; Pfeiffer et al., tuning. 2020). Compared to this line of work, which tunes around 3:6% of the LM parameters, our method obtains a further 30x reduction in task-specific We evaluate prefix-tuning on table-to-text gen- parameters, tuning only 0.1% while maintaining eration using GPT-2 and abstractive summariza- comparable performance on table-to-text tasks. tion using BART. In terms of storage, prefix-tuning stores 1000x fewer parameters than full fine-tuning. Prompting. Prompting is a way of leveraging a In terms of performance when trained on full pretrained LM by prepending instructions and a datasets, prefix-tuning and fine-tuning are compara- few examples to the task input and generating the ble for table-to-text (x6.1), while prefix-tuning suf- task output from the LM. For autoregressive LMs, fers a small degradation for summarization (x6.2). the most successful form of prompting is GPT-3’s In low-data settings, prefix-tuning outperforms fine- in-context learning (Brown et al., 2020), which tuning on both tasks (x6.3). Prefix-tuning also ex- uses manually designed prompts to adapt its gen- trapolates better to tables (for table-to-text) and arti- eration for different tasks in few-shot settings. For cles (for summarization) with unseen topics (x6.4). masked LMs like BERT and RoBERTa (Liu et al., 2019), prompt engineering has been explored for GPT-2’s performance on natural language under- natural language understanding tasks (Jiang et al., standing tasks, in both few-shot and full data set- 2020; Schick and Schutze¨ , 2020). For example, tings. In a followup work, Prompt-tuning (Lester AutoPrompt (Shin et al., 2020) searches for a se- et al., 2021) simplifies our approach and applies quence of discrete trigger words and concatenates it to T5 (Raffel et al., 2020), demonstrating that it with each input to elicit sentiment or factual the performance gap between fine-tuning and p*- knowledge from BERT and RoBERTa. In contrast tuning vanishes as the model size grows. with AutoPrompt, our method optimizes contin- uous prefixes, which are more expressive (x7.2); 3 Problem Statement moreover, we focus on language generation tasks. Consider a conditional generation task where the Continuous vectors have been used to steer LMs; input x is a context and the output y is a sequence for example, Subramani et al.(2020) showed that a of tokens. We focus on two tasks, shown in Fig- pretrained LSTM language model can reconstruct ure2 (right): In table-to-text, x corresponds to a lin- arbitrary sentences by optimizing a continuous vec- earized data table and y is a textual description; in tor for each sentence, making the vector input- summarization, x is an article and y is a summary. specific. In contrast, prefix-tuning optimizes a task- specific prefix that applies to all instances of that 3.1 Autoregressive LM task.