Arxiv:2104.08786V1 [Cs.CL] 18 Apr 2021

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity Yao Luy Max Bartoloy Alastair Moorez Sebastian Riedely Pontus Stenetorpy yDepartment of Computer Science, University College London zMishcon de Reya LLP {yao.lu,m.bartolo,s.riedel,p.stenetorp}@cs.ucl.ac.uk [email protected] Abstract 100 When primed with only a handful of training 90 samples, very large pretrained language models such as GPT-3, have shown competitive results 80 when compared to fully-supervised fine-tuned large pretrained language models. We demon- 70 strate that the order in which the samples are provided can be the difference between near SST-2 Accuracy (%) 60 state-of-the-art and random guess performance: Essentially some permutations are “fantastic” and some not. We analyse this phenomenon 50 in detail, establishing that: it is present across 0.1 0.3 0.8 1.5 2.7 6.7 13 175 model sizes (even for the largest current mod- Model Parameters (Billion) els), it is not related to a specific subset of sam- Figure 1: Two-shot, full permutation performance and ples, and that a given good permutation for one standard deviation between sample orders for different model is not transferable to another. While one sizes of GPT-family models (GPT-2 and GPT-3). could use a development set to determine which permutations are performant, this would devi- ate from the few-shot setting as it requires addi- as sentiment analysis, they produce text classifica- tional annotated data. Instead, we use the generative nature of the language models to construct tion results that can match those of fully supervised an artificial development set and based on en- models. This type of few shot "In Context Learn- tropy statistics of the candidate permutations ing" (Brown et al., 2020) enables practitioners to from this set we identify performant prompts. address challenging tasks with minimal annotation Our method improves upon GPT-family mod- overhead and without the need to instantiate a new els by on average 13% relative across eleven large language model for each new task. different established text classification tasks. A core component of in-context learning is the 1 Introduction text-based prompt that serves as the context. Com- posing a prompt requires: (i) text linearisation us- Big pretrained language models (PLMs) (Devlin ing a template; and (ii) training sample concatenation (See Table1 for a concrete example). Finding arXiv:2104.08786v1 [cs.CL] 18 Apr 2021 et al., 2019; Peters et al., 2018; Raffel et al., 2020; Liu et al., 2019; Yang et al., 2019; Rad- a template that optimises the performance of in- ford et al., 2019) have shown remarkable behaviour context learning has attracted a lot of attention in when conditioned with an appropriate textual con- the field (Shin et al., 2020; Gao et al., 2020; Schick text (Petroni et al., 2019, 2020; Jiang et al., 2020; and Schütze, 2020; Jiang et al., 2020). However, to Shin et al., 2020; Davison et al., 2019). For ex- the best of our knowledge, no existing work studies ample, when conditioned with a long document the effect of the sample order on in-context learning and a “TL;DR:” token, they generate a summary of performance. said document, and when design cloze-task such Perhaps counter-intuitively, we find that the right as “The theory of relativity was developed by __”, sample order can make as much of a difference as they generate the answer for this question. the right template. As can be seen in Figure1, Perhaps most strikingly, when primed with a con- some permutations have comparable performance text consisting of very few training examples, such (over 85% accuracy) to supervised training for sen- Example (the greatest musicians, 1) training set (redundant concept, 0) Review: the greatest musicians. Sentiment: positive linearization Review: redundant concept. Sentiment: negative Review: the greatest musicians. Sentiment: positive. Review: redundant concept. Sentiment: negative concatenation OR Review: redundant concept. Sentiment: negative. Review: the greatest musicians. Sentiment: positive Figure 2: Training sample permutations for the in- context learning setting. The concatenation of training Table 1: Procedures for prompt construction. samples as well as test data converts the classification task into a sequence generation task. timent classifications, while others perform close to random (around 50%). This order sensitivity is 2 Order Sensitivity and Prompt Design universal across models, and although increasing the model size somewhat addresses it, the prob- In this section, we study the relationship between lem is still present for some text classification tasks permutation performance and various factors. Un- (Subj, TREC, and CB) for models with billions of less otherwise specified, we use a fixed random parameters. subset of four samples with a balanced label distri- In our analysis, we find no common denomina- bution from the SST-2 dataset and consider all 24 tors between performant sample orders and they possible sample order permutations. This setup is are not transferable across different model sizes illustrated in Figure2. and tasks. In a fully-supervised setting, we could rely on a development set to select among sample Size (almost) does not matter We evaluate the orders. However, this is not desirable in a few-shot order permutations for four different sizes of GPT- setting where the size of the development set is 2 (0.1B–1.5B)1 and four different sizes of GPT-3 very limited. Instead, we use the generative nature (2.7B–175B). As we can observe in Figure1, mod- of language models to construct an unlabelled artifi- els can obtain remarkable few-shot performance. cial development set and refer to it as a probing set. We see that the GPT2-XL (1.5B) model can even Since there are no reliable labels available for the surpass 90% accuracy given just four samples. This probing set, we instead use predicted label distribu- result is comparable to those of supervised models tion statistics and propose an entropy-based metrics trained on more than 60,000 samples. However, to measure the quality of candidate prompts over the performance variation of different permutations the probing set. Experimental results show that remain a big issue, especially for “smaller” mod- we can achieve on average 13% relative improve- els.2 The same model can exhibit nearly perfect ment across eleven different established text classi- behaviour given one sample order, but then fall fication tasks on all different sizes (four orders of back to be on par with a random baseline for an- magnitude) of pretrained language models. other. While increasing the model size (by a few To summarise, our contributions are as follows: order of magnitudes) can sometimes alleviate the issue, it still cannot resolve it entirely (especially 1. We study the order sensitivity of in-context if we consider tasks other than SST-2). In contrast, learning, which we show is crucial for the different initialisations of supervised fine-tuning success of pretrained language models for few- approaches typically result in less than 1% stan- shot learning. dard deviation for their test set performance (Gao 2. We propose a simple, straightforward, et al., 2020). generation-based probing method to identify performant prompts without requiring addi- Adding training samples does not significantly tional data. reduce variance To further explore the sensitivity of few-shot prompts, we increase the number 3. Our probing method is universally applica- of training samples, and then sample a subset of at ble and effective across different sizes of pre- 1 trained language models over different types We can also refer these models as GPT2-base, GPT2- medium, GPT2-Large, and GPT2-XL. of datasets. We achieve on average a 13% rel- 2The smallest model in our experiment is the same size as ative improvement over a wide range of tasks. BERT-base. 100 1 GPT2-Large (0.8B) 175B -0.17 -0.23 -0.35 -0.14 0.05 0.27 -0.22 1.00 GPT2-XL (1.5B) 90 13B -0.24 0.01 -0.12 0.01 0.12 0.04 1.00 -0.22 Spearman Correlation 6.7B -0.10 -0.26 0.19 -0.03 0.13 1.00 0.04 0.27 80 2.7B 0.07 -0.11 0.10 -0.27 1.00 0.13 0.12 0.05 70 1.5B -0.24 0.20 -0.04 1.00 -0.27 -0.03 0.01 -0.14 SST-2 Accuracy(%) 60 Model Parameters 0.8B 0.23 0.08 1.00 -0.04 0.10 0.19 -0.12 -0.35 0.3B 0.09 1.00 0.08 0.20 -0.11 -0.26 0.01 -0.23 50 0.1B 1.00 0.09 0.23 -0.24 0.07 -0.10 -0.24 -0.17 1 2 4 8 16 32 N-shot Training Examples 0 0.1B 0.3B 0.8B 1.5B 2.7B 6.7B 13B 175B Figure 3: Order sensitivity using different numbers of training samples. Figure 4: Permutation performance correlation between different models. 3 250 positive 85 most 24 different orderings. We use the GPT2-XL negative 80 (1.5B) and GPT2-Large (0.8B) models for this ex- 200 75 150 periment. In the results shown in Figure3, we can 70 100 65 observe that increasing the number of training sam- Number of Examples SST-2 Accuracy (%) 60 ples leads to increases in performance. However, 50 55 0 a high level of variance remains even with a large 51.6 85.2 SST-2 Accuracy(%) orginal calibrated number of samples and can even increase. Based on this, we draw the conclusion that order sensitiv- Figure 5: Left: Predicted SST-2 label distribution under ity is likely to be a fundamental issue of in-context different prompts.

Arxiv:2104.08786V1 [Cs.CL] 18 Apr 2021

Generative Adversarial Networks (Gans)

Dynamic Factor Graphs for Time Series Modeling

Energy-Based Generative Adversarial Network

Discriminative Recurrent Sparse Auto-Encoders

Perceptron 1 History of Artificial Neural Networks

Path Towards Better Deep Learning Models and Implications for Hardware

What's Wrong with Deep Learning?

Unsupervised Learning with Regularized Autoencoders

Deep Learning - Review Yann Lecun, Yoshua Bengio & Geoffrey Hinton

Deep Learning

Deep Generative Models Shenlong Wang ● Why Unsupervised Learning?

Generative Adversarial Networks (Gans)