Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
Yao Lu† Max Bartolo† Alastair Moore‡ Sebastian Riedel† Pontus Stenetorp† †Department of Computer Science, University College London ‡Mishcon de Reya LLP {yao.lu,m.bartolo,s.riedel,p.stenetorp}@cs.ucl.ac.uk [email protected]
Abstract 100
When primed with only a handful of training 90 samples, very large pretrained language models such as GPT-3, have shown competitive results 80 when compared to fully-supervised fine-tuned
large pretrained language models. We demon- 70 strate that the order in which the samples are provided can be the difference between near SST-2 Accuracy (%) 60 state-of-the-art and random guess performance: Essentially some permutations are “fantastic” and some not. We analyse this phenomenon 50
in detail, establishing that: it is present across 0.1 0.3 0.8 1.5 2.7 6.7 13 175 model sizes (even for the largest current mod- Model Parameters (Billion) els), it is not related to a specific subset of sam- Figure 1: Two-shot, full permutation performance and ples, and that a given good permutation for one standard deviation between sample orders for different model is not transferable to another. While one sizes of GPT-family models (GPT-2 and GPT-3). could use a development set to determine which permutations are performant, this would devi- ate from the few-shot setting as it requires addi- as sentiment analysis, they produce text classifica- tional annotated data. Instead, we use the gener- ative nature of the language models to construct tion results that can match those of fully supervised an artificial development set and based on en- models. This type of few shot "In Context Learn- tropy statistics of the candidate permutations ing" (Brown et al., 2020) enables practitioners to from this set we identify performant prompts. address challenging tasks with minimal annotation Our method improves upon GPT-family mod- overhead and without the need to instantiate a new els by on average 13% relative across eleven large language model for each new task. different established text classification tasks. A core component of in-context learning is the 1 Introduction text-based prompt that serves as the context. Com- posing a prompt requires: (i) text linearisation us- Big pretrained language models (PLMs) (Devlin ing a template; and (ii) training sample concatena- tion (See Table1 for a concrete example). Finding arXiv:2104.08786v1 [cs.CL] 18 Apr 2021 et al., 2019; Peters et al., 2018; Raffel et al., 2020; Liu et al., 2019; Yang et al., 2019; Rad- a template that optimises the performance of in- ford et al., 2019) have shown remarkable behaviour context learning has attracted a lot of attention in when conditioned with an appropriate textual con- the field (Shin et al., 2020; Gao et al., 2020; Schick text (Petroni et al., 2019, 2020; Jiang et al., 2020; and Schütze, 2020; Jiang et al., 2020). However, to Shin et al., 2020; Davison et al., 2019). For ex- the best of our knowledge, no existing work studies ample, when conditioned with a long document the effect of the sample order on in-context learning and a “TL;DR:” token, they generate a summary of performance. said document, and when design cloze-task such Perhaps counter-intuitively, we find that the right as “The theory of relativity was developed by __”, sample order can make as much of a difference as they generate the answer for this question. the right template. As can be seen in Figure1, Perhaps most strikingly, when primed with a con- some permutations have comparable performance text consisting of very few training examples, such (over 85% accuracy) to supervised training for sen-
Example (the greatest musicians, 1) training set (redundant concept, 0) Review: the greatest musicians. Sentiment: positive linearization