Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

Yao Lu† Max Bartolo† Alastair Moore‡ Sebastian Riedel† Pontus Stenetorp† †Department of Computer Science, University College London ‡Mishcon de Reya LLP {yao.lu,m.bartolo,s.riedel,p.stenetorp}@cs.ucl.ac.uk [email protected]

Abstract 100

When primed with only a handful of training 90 samples, very large pretrained language models such as GPT-3, have shown competitive results 80 when compared to fully-supervised fine-tuned

large pretrained language models. We demon- 70 strate that the order in which the samples are provided can be the difference between near SST-2 Accuracy (%) 60 state-of-the-art and random guess performance: Essentially some permutations are “fantastic” and some not. We analyse this phenomenon 50

in detail, establishing that: it is present across 0.1 0.3 0.8 1.5 2.7 6.7 13 175 model sizes (even for the largest current mod- Model Parameters (Billion) els), it is not related to a specific subset of sam- Figure 1: Two-shot, full permutation performance and ples, and that a given good permutation for one standard deviation between sample orders for different model is not transferable to another. While one sizes of GPT-family models (GPT-2 and GPT-3). could use a development set to determine which permutations are performant, this would devi- ate from the few-shot setting as it requires addi- as sentiment analysis, they produce text classifica- tional annotated data. Instead, we use the gener- ative nature of the language models to construct tion results that can match those of fully supervised an artificial development set and based on en- models. This type of few shot "In Context Learn- tropy statistics of the candidate permutations ing" (Brown et al., 2020) enables practitioners to from this set we identify performant prompts. address challenging tasks with minimal annotation Our method improves upon GPT-family mod- overhead and without the need to instantiate a new els by on average 13% relative across eleven large language model for each new task. different established text classification tasks. A core component of in-context learning is the 1 Introduction text-based prompt that serves as the context. Com- posing a prompt requires: (i) text linearisation us- Big pretrained language models (PLMs) (Devlin ing a template; and (ii) training sample concatena- tion (See Table1 for a concrete example). Finding arXiv:2104.08786v1 [cs.CL] 18 Apr 2021 et al., 2019; Peters et al., 2018; Raffel et al., 2020; Liu et al., 2019; Yang et al., 2019; Rad- a template that optimises the performance of in- ford et al., 2019) have shown remarkable behaviour context learning has attracted a lot of attention in when conditioned with an appropriate textual con- the field (Shin et al., 2020; Gao et al., 2020; Schick text (Petroni et al., 2019, 2020; Jiang et al., 2020; and Schütze, 2020; Jiang et al., 2020). However, to Shin et al., 2020; Davison et al., 2019). For ex- the best of our knowledge, no existing work studies ample, when conditioned with a long document the effect of the sample order on in-context learning and a “TL;DR:” token, they generate a summary of performance. said document, and when design cloze-task such Perhaps counter-intuitively, we find that the right as “The theory of relativity was developed by __”, sample order can make as much of a difference as they generate the answer for this question. the right template. As can be seen in Figure1, Perhaps most strikingly, when primed with a con- some permutations have comparable performance text consisting of very few training examples, such (over 85% accuracy) to supervised training for sen-

Example (the greatest musicians, 1) training set (redundant concept, 0) Review: the greatest musicians. Sentiment: positive linearization

Review: redundant concept. Sentiment: negative Review: the greatest musicians. Sentiment: positive. Review: redundant concept. Sentiment: negative concatenation OR Review: redundant concept. Sentiment: negative. Review: the greatest musicians. Sentiment: positive Figure 2: Training sample permutations for the in- context learning setting. The concatenation of training Table 1: Procedures for prompt construction. samples as well as test data converts the classification task into a sequence generation task. timent classifications, while others perform close to random (around 50%). This order sensitivity is 2 Order Sensitivity and Prompt Design universal across models, and although increasing the model size somewhat addresses it, the prob- In this section, we study the relationship between lem is still present for some text classification tasks permutation performance and various factors. Un- (Subj, TREC, and CB) for models with billions of less otherwise specified, we use a fixed random parameters. subset of four samples with a balanced label distri- In our analysis, we find no common denomina- bution from the SST-2 dataset and consider all 24 tors between performant sample orders and they possible sample order permutations. This setup is are not transferable across different model sizes illustrated in Figure2. and tasks. In a fully-supervised setting, we could rely on a development set to select among sample Size (almost) does not matter We evaluate the orders. However, this is not desirable in a few-shot order permutations for four different sizes of GPT- setting where the size of the development set is 2 (0.1B–1.5B)1 and four different sizes of GPT-3 very limited. Instead, we use the generative nature (2.7B–175B). As we can observe in Figure1, mod- of language models to construct an unlabelled artifi- els can obtain remarkable few-shot performance. cial development set and refer to it as a probing set. We see that the GPT2-XL (1.5B) model can even Since there are no reliable labels available for the surpass 90% accuracy given just four samples. This probing set, we instead use predicted label distribu- result is comparable to those of supervised models tion statistics and propose an entropy-based metrics trained on more than 60,000 samples. However, to measure the quality of candidate prompts over the performance variation of different permutations the probing set. Experimental results show that remain a big issue, especially for “smaller” mod- we can achieve on average 13% relative improve- els.2 The same model can exhibit nearly perfect ment across eleven different established text classi- behaviour given one sample order, but then fall fication tasks on all different sizes (four orders of back to be on par with a random baseline for an- magnitude) of pretrained language models. other. While increasing the model size (by a few To summarise, our contributions are as follows: order of magnitudes) can sometimes alleviate the issue, it still cannot resolve it entirely (especially 1. We study the order sensitivity of in-context if we consider tasks other than SST-2). In contrast, learning, which we show is crucial for the different initialisations of supervised fine-tuning success of pretrained language models for few- approaches typically result in less than 1% stan- shot learning. dard deviation for their test set performance (Gao 2. We propose a simple, straightforward, et al., 2020). generation-based probing method to identify performant prompts without requiring addi- Adding training samples does not significantly tional data. reduce variance To further explore the sensitiv- ity of few-shot prompts, we increase the number 3. Our probing method is universally applica- of training samples, and then sample a subset of at ble and effective across different sizes of pre- 1 trained language models over different types We can also refer these models as GPT2-base, GPT2- medium, GPT2-Large, and GPT2-XL. of datasets. We achieve on average a 13% rel- 2The smallest model in our experiment is the same size as ative improvement over a wide range of tasks. BERT-base. 100 1 GPT2-Large (0.8B) 175B -0.17 -0.23 -0.35 -0.14 0.05 0.27 -0.22 1.00 GPT2-XL (1.5B)

90 13B -0.24 0.01 -0.12 0.01 0.12 0.04 1.00 -0.22 Spearman Correlation 6.7B -0.10 -0.26 0.19 -0.03 0.13 1.00 0.04 0.27 80

2.7B 0.07 -0.11 0.10 -0.27 1.00 0.13 0.12 0.05 70 1.5B -0.24 0.20 -0.04 1.00 -0.27 -0.03 0.01 -0.14 SST-2 Accuracy(%)

60 Model Parameters 0.8B 0.23 0.08 1.00 -0.04 0.10 0.19 -0.12 -0.35

0.3B 0.09 1.00 0.08 0.20 -0.11 -0.26 0.01 -0.23 50

0.1B 1.00 0.09 0.23 -0.24 0.07 -0.10 -0.24 -0.17 1 2 4 8 16 32 N-shot Training Examples 0

0.1B 0.3B 0.8B 1.5B 2.7B 6.7B 13B 175B Figure 3: Order sensitivity using different numbers of training samples. Figure 4: Permutation performance correlation between different models.

3 250 positive 85 most 24 different orderings. We use the GPT2-XL negative

80 (1.5B) and GPT2-Large (0.8B) models for this ex- 200 75

150 periment. In the results shown in Figure3, we can 70

100 65 observe that increasing the number of training sam- Number of Examples SST-2 Accuracy (%) 60 ples leads to increases in performance. However, 50 55 0 a high level of variance remains even with a large 51.6 85.2 SST-2 Accuracy(%) orginal calibrated number of samples and can even increase. Based on this, we draw the conclusion that order sensitiv- Figure 5: Left: Predicted SST-2 label distribution under ity is likely to be a fundamental issue of in-context different prompts. Right: 2-shot Calibrated performance of full permutations on GPT2-XL (1.5B). learning regardless of the number of training sam- ples. In Figure4 we visualise this correlation and we Performant prompts are not transferable can observe relative low correlation between differ- across models We find that a specific permuta- ent models. For example, the 175B and 2.7B model tion’s performance may drop from 88.7% to 51.6% only obtains a correlation value of 0.05, this means by changing the underlying model from GPT2-XL a good permutation for the 2.7B model has no guar- (1.5B) to GPT2-Large (0.8B). This suggests that a antee that it will also yield good performance on particular permutation working well for one model 175B model. does not imply that it will provide good results for another model. To validate this hypothesis, we Degenerate behaviour of bad prompts We per- use full permutation (of 4 examples) – 24 different form error analysis across performant and non- orderings as prompts. We then perform predic- performant prompts and observe that the majority tion conditioned on these prompts using different of failing prompts suffer from highly unbalanced models. For each model, we will have 24 differ- predicted label distributions (see Figure5). An in- ent accuracy scores for full permutation prompts. tuitive way to address this would be by calibrating It there exists a common pattern for performant the output distribution, along the lines of Zhao et al. prompts, then we should observe high correlation (2021). However, we find that although calibra- between 24 different accuracy scores across differ- tion leads to significantly higher performance, the ent models. variance remains high (see Figure5). We calculate the pairwise Spearman’s rank cor- 3 Methodology relation coefficient with using accuracy scores, the behaviour of permutations is seemingly random The previous section demonstrates that prompt or- across different sizes of the same model. der can have a substantial effect on performance, with some orderings of the same prompts for the 3Bounded at the lower limit by the total number of samples given, and at the upper limit as there can be up to 64! possible same model providing random performance, and orders. other “better” orderings providing performance

Figure 6: Our probing set construction method, showing the various possible ordering permutations of the randomly selected training samples, the probe generation for each permutation, and the concatenation of each into a probing set. competitive with supervised approaches. This sug- We then define a full permutation function group gests that there could be various ways of selecting of n training samples, F = {fm}, m = 1, ··· , n!, 0 prompt orders to achieve better performance, but where each function fm takes input S , and outputs the challenge is to do so automatically and without cm, the concatenation of a unique permutation. In the need for further annotation (e.g., a development our case, sampling four training samples at random set). gives up to 24 possible ordering permutations of Hence, in this section, we explore the question the transformed samples. of: “How can we automatically generate a ‘prob- For each prompt candidate cm, we then sam- ing set’ to find performant prompt orderings”? We ple from the language model to obtain the probing approach this by: (i) for a randomly-selected set of sequence gm ∼ P (·|cm; θ), where θ denotes the training samples, we use every possible ordering parameters of the pretrained language model. We permutation of this set as candidates; (ii) construct- stop decoding from the language model upon gen- ing a probing set by querying the language model erating the special end-of-sentence token defined using all candidate prompts as context; and (iii) by a template, or reach the generation length limit. using this probing set to identify the best ordering Our probing set construction method is illustrated by ranking them using a probing metrics. in Figure6, where the objective is to generate a probing set that shares a similar distribution to the 3.1 Sampling from the Language Model to training samples. Construct a Probing Set We run this sampling process for all possible We propose a simple methodology for automati- prompt ordering permutations and extract probing cally constructing a “probing set” by directly sam- samples from them (T −1(g)). Then concatenate pling from the language model itself. This ap- extracted samples together to form the probing set proach makes it possible to generate probing sets D = T −1(g ) ⊕ ... ⊕ T −1(g ). Although the automatically, without access to any additional 1 n! probing set contains predicted label for each sen- data. tence, there is no guarantee on the validity of these Concretely, given a set of training samples S = labels. Therefore, we discard these labels from the {(x , y )}, i = 1, ··· , n, where x and y denote i i i i probing set as we are only interested in sampling the sentence and label of the ith training sample, we probes from the language model corresponding to randomly select n = 4 samples to form part of the the input distribution. prompt. We then define a transformation T map- ping each sample into natural language space, such that ti = T (xi, yi). ti is therefore a text sequence 3.2 Probing Metrics of the ith training sample using the template defined by T . In this work, we use a simple transformation Once we have constructed a probing set for a given function T such that T (xi, yi) = input:xi type:yi. set of samples, we can now use that probing set This transforms each sample into a standard for- to identify the best possible prompt ordering for mat sentence, which we linearise each element that particular sample set. Here, we explore two in the set into natural language space defined as methods for selecting the best ordering: Global 0 S = {ti}, i = 1, ··· , n. Entropy (GlobalE), and Local Entropy (LocalE). Global Entropy (GlobalE) The motivation be- Dataset # of Classes Avg. Len. Balanced hind GlobalE is to identify prompts of specific SST-2 (Socher et al., 2013) 2 12.4 Yes SST-5 (Socher et al., 2013) 5 23.1 No sample orderings that avoid the issue of extremely MR (Pang and Lee, 2005) 2 25.7 Yes unbalanced predictions described in the previous CR (Hu and Liu, 2004) 2 22.1 Yes MPQA (Wiebe et al., 2005) 2 3.9 Yes section. In GlobalE, we compute the predicted la- Subj (Pang and Lee, 2004) 2 28.9 Yes 0 0 bel yˆ for data point (x , y ) under context c as TREC (Voorhees and Tice, 2000) 6 11.6 No i i i m AGNews (Zhang et al., 2015) 4 53.8 Yes follows: DBPedia (Zhang et al., 2015) 14 65.5 Yes CB (De Marneffe et al., 2019) 3 69.7/8.4 No 0 RTE (Dagan et al., 2005) 2 55.3/11.9 Yes yˆi,m = argmax P (v|cm ⊕ T (xi); θ) (1) v∈V

For each label v ∈ V (where V denotes the Table 2: Statistics of evaluation datasets, average length is calculated based on GPT-2 sentence-piece length. For target label set), we compute the label probability sentence-pair tasks, we report each sentence’s average over the probing set as: length separately. P 1 v i {yˆi,m=v} pm = (2) |D| two different sizes(with 2.7B, and 175B parame- ters). Due to limited context window size (up to We then use the predicted category label entropy 1024 word-pieces for the GPT-2 series of models), as the GlobalE score for c as follows: m we use a 4-shot setting for all datasets except AG- X v v News and DBPedia. Unlike Zhao et al.(2021), we GlobalEm = −pm log pm (3) v∈V use the convention of a few-shot setting, which can easily be generalised to the k-shot setting where k samples are used for each class. Our experiments Local Entropy (LocalE) The motivation behind are based on the open-source checkpoints of GPT-2 LocalE is that if a model is overly confident for all models and access to the OpenAI GPT-3 API. For probing inputs, then it is likely that the model is not probing set generation, we restrict the maximum behaving as desired. At the very least, it is poorly generation length to 128. We also use sampling calibrated, which could also be an indication of with a temperature, t, of 2, and we also make use poor capability to appropriately differentiate be- of block n-gram repetitions (Paulus et al., 2018) to tween classes. Similar to the GlobalE computation, encourage diverse generation during the decoding we calculate the prediction probability of a data stage. 0 0 point (xi, yi) over the target labels v ∈ V under We use 24 different permutations for each set context cm, as follows: of randomly selected training samples, and use 5

v 0 different sets (except for GPT-3 with 175B parame- 0 0 pi,m = P(x ,y )∼D(v|cm ⊕ T (xi); θ), v ∈ V (4) i i ters, where we only do 1 set with 12 different per- We then calculate the average prediction entropy mutation due to the high monetary cost) for each per data point as the LocalE score: experiment, giving a total of 120 runs. We report the mean and standard deviation of the correspond- ing evaluation metric over 5 different sets for all P P v v −p log p experiments. LocalE = i v∈V i,m i,m (5) m |D| For performant prompt selection, we rank candi- date prompts using the LocalE and GlobalE prob- Once we have a way to score each prompt order- ing metrics over the automatically generated prob- ing, based on its effect against the probing set, we ing set. We then select k ranked samples by highest can rank each prompt ordering by performance as entropy values, where k = 4 in our experiments, measured by the GlobalE or LocalE metrics respec- of the available 24 permutations as performant tively. prompts. Finally, we use these performant prompts 4 Experimental Setup to evaluate performance on various datasets and demonstrate both better performance and reduced We use four different sizes of GPT-2 (Radford et al., variance. We also provide results for a majority 2019) (with 0.1B, 0.3B, 0.8B, and 1.5B parame- baseline, which always predicts the majority label teers), and two sizes of GPT-3 (Brown et al., 2020) in the dataset, as a lower-bound of model perfor- SST-2 SST-5 DBPedia MR CR MPQA Subj TREC AGNews RTE CB Majority 50.9 23.1 9.4 50.0 50.0 50.0 50.0 18.8 25.0 52.7 51.8

GPT-2 0.1B 58.97.8 29.04.9 44.99.7 58.67.6 58.46.4 68.97.1 52.10.7 49.24.7 50.811.9 49.72.7 50.11.0 LocalE 65.23.9 34.43.4 53.34.9 66.06.3 65.03.4 72.56.0 52.91.3 48.03.9 61.05.9 53.03.3 49.91.6 GlobalE 63.85.8 35.82.0 56.14.3 66.45.8 64.82.7 73.54.5 53.01.3 46.13.7 62.15.7 53.03.0 50.31.6

GPT-2 0.3B 61.013.2 25.95.9 51.77.0 54.27.8 56.79.4 54.58.8 54.47.9 52.64.9 47.710.6 48.82.6 50.25.3 LocalE 75.34.6 31.03.4 47.13.7 65.26.6 70.96.3 67.67.2 66.79.3 53.03.9 51.27.3 51.81.0 47.14.2 GlobalE 78.75.2 31.75.2 58.35.4 67.05.9 70.76.7 68.36.9 65.810.1 53.34.6 59.67.2 51.11.9 50.33.7

GPT-2 0.8B 74.510.3 34.78.2 55.012.5 64.613.1 70.912.7 65.58.7 56.49.1 56.52.7 62.211.6 53.22.0 38.88.5 LocalE 81.15.5 40.34.7 56.77.5 82.64.2 85.43.8 73.64.8 70.44.2 56.21.7 62.78.1 53.31.6 38.45.2 GlobalE 84.84.1 46.91.1 67.73.6 84.32.9 86.72.5 75.83.1 68.66.5 57.22.3 70.73.6 53.51.5 41.24.5

GPT-2 1.5B 66.810.8 41.76.7 82.62.5 59.111.9 56.99.0 73.98.6 59.710.4 53.13.3 77.67.3 55.01.4 53.84.7 LocalE 76.78.2 45.13.1 83.81.7 78.15.6 71.88.0 78.53.6 69.75.8 53.63.1 79.33.7 56.81.1 52.63.9 GlobalE 81.83.9 43.54.5 83.91.8 77.95.7 73.46.0 81.42.1 70.96.0 55.53.0 83.91.2 56.31.2 55.14.6

GPT-3 2.7B 78.010.7 35.36.9 81.11.8 68.012.9 76.811.7 66.510.3 49.12.9 55.34.4 72.94.8 48.61.9 50.40.7 LocalE 81.06.0 42.34.7 80.31.7 75.64.1 79.05.5 72.55.8 54.24.2 54.02.6 72.34.6 50.41.9 50.50.8 GlobalE 80.24.2 43.24.3 81.20.9 76.13.8 80.33.4 73.04.3 54.34.0 56.72.0 78.11.9 51.31.8 51.20.8

GPT-3 175B 93.90.4 55.12.1 95.81.0 94.30.6 90.21.1 83.11.9 76.86.4 72.22.9 84.02.1 71.82.9 73.73.5 LocalE 93.80.5 56.11.8 96.20.6 94.30.7 90.50.7 83.72.1 81.44.9 72.14.1 85.60.8 73.30.8 71.41.5 GlobalE 93.90.6 55.41.1 96.30.6 94.10.5 89.80.8 81.81.0 81.92.5 74.12.6 85.11.0 72.51.5 78.62.9

Table 3: Our main results on subset of validation set. To fit data within GPT-2 model context window size, We use 1-shot for DBPedia, 2-shot for AGNews, 4-shot for others datasets. All the baseline results are calculated based on 5 different random seeds over 24 train context permutations. LocalE and GlobalE results are calculated based on top 4 context permutations using our proposed approach. For the GPT-3 175B, we only use 1 seed with 12 different permutations due to limited computation budget.

mance. Our selected performant prompts also demonstrate considerably lower variance than using all candi- 4.1 Evaluation Datasets date prompts. It is worth noting that, with Lo- Similar to previous work (Gao et al., 2020; Zhao calE and GlobalE probing, our best performance et al., 2021), we use eleven text classification on most tasks can even approach the levels of su- datasets ranging from sentiment classification to pervised models. textual entailment. Further details of the datasets Performant permutation selection is a safe op- are provided in Table2. For evaluation, we sub- tion for in-context learning We find that for sample 256 samples from the validation sets for models that suffer from high prompt variance, our all datasets to control for the GPT-3 monetary in- prompt selection process can show large improve- ference costs as it requires the usage of a paid-for ments – up to 30% relative improvement. Fur- API. thermore, for tasks with low initial prompt perfor- 5 Results mance variance, our method does not negatively im- pact performance. In the other words, performant We report experimental results in Table3. We ob- prompt selection provides marginal improvement serve consistent improvements of our approaches, at worse, and on average a 13% relative improve- both LocalE and GlobalE, across all tasks. ment in the most cases.

Entropy-based probing is effective for perfor- Sentence-pair tasks remain challenging for mant prompt selection regardless of model size smaller-sized models even with performant per- We find that GlobalE achieves, on average, a 13% mutation selection For the CB and RTE datasets, relative improvement across the 11 different sen- the performance of GPT-2 models is not signif- tence classification tasks in comparison to prompts icantly different from that of a random baseline. that do not make use of probing. LocalE provides Despite this, we find that our method for identify- results slightly inferior to GlobalE, with an average ing performant prompts can still provide minimal 9.6% relative improvement over the baseline model. performance gains, although these are still within the levels of a random guess, or majority vote. One Prompt Design for Pretrained Language Mod- reason this could be that, for these particular sizes els The core challenge of prompt design is to of models on these tasks, no good prompt exists. convert training data (if it exists) into text sequence. As such, optimising the prompt is not particularly Most work on prompt design focuses on how to effective in this setting. This is further supported make prompts more compatible with language by the observation that prompt selection can con- models. Petroni et al.(2019) uses human experi- siderably improve performance on both CB and ence to design natural language sentences and then RTE at larger model sizes (particularly so for the perform token prediction given the input context. GPT-3 175B parameter model). In fact, we find However, hand-crafted templates require signifi- that prompt selection using GlobalE improves per- cant human effort and is likely to end up with sub- formance by 4.9% for GPT-3 175B on CB. This optimal performance. Recent work has explored indicates that our method is widely applicable to automatic template construction, PET (Schick and all model sizes, and across all tasks, as long as they Schütze, 2020) uses cloze-style task to construct already possess some existing classification ability templates, LM-BFF (Gao et al., 2020) uses external that can be improved through prompt design. language model to generate templates, and Auto- Prompt (Shin et al., 2020) uses gradient-guided search to find templates that maximise the perfor- 6 Related Work mance. Jiang et al.(2020) uses mining-based ap- proach to create multiple diversity templates auto- Unified interface Design for NLP Most previ- matically. All the previous work regarding prompt ous work focus on the shared-parameters models, design focuses on the textual quality of the prompt pretrain on some tasks, then fine-tune for different and to the best of my knowledge none has stud- tasks, e.g. Elmo (Peters et al., 2018), BERT (De- ied the order sensitivity of prompts. These work vlin et al., 2019), etc. Eventually, leading to multi- are orthogonal to ours and as the approaches are ple task-specific models. There has for some time complementary and we hope there is potential to been attempts to design a unified interface for NLP combine them to achieve higher levels of perfor- tasks, dating back to the pre-PLM era, Kumar et al. mance and robustness. (2016) claim that it is possible to cast most tasks in natural language processing into question answer- 7 Conclusion ing (QA) problems. Similarly, T5 (Raffel et al., We show that few-shot learning using prompt- 2020) also explores this line by converting NLP based approaches suffers from order sensitivity. tasks into sequence generation, then joint fine-tune Some orderings of prompts can lead to better perfor- multiple tasks with the same underlying pretrained mance than others.We proposed a probing method language model. In parallel with these works, GPT- without relying on external data. The method 2 (Radford et al., 2019) shows that appending trig- can effectively select performant prompts across a ger tokens (e.g., tl;dr) at the end of language model wide-range of NLP tasks and model sizes. input can cause language models to behave like summarisation models. The zero-shot capability of language models shows the potential to unify NLP References tasks into a language modelling framework where Tom B Brown, Benjamin Mann, Nick Ryder, Melanie fine-tuning is not necessary to achieve good perfor- Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind mance. Furthermore, GPT-3 (Brown et al., 2020) Neelakantan, Pranav Shyam, Girish Sastry, Amanda shows that task-agnostic, few-shot performance Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. can be improved by scaling up language models. It can sometimes even become competitive with Ido Dagan, Oren Glickman, and Bernardo Magnini. prior state-of-the-art fine-tuning approaches even 2005. The pascal recognising textual entailment chal- lenge. In Challenges Workshop, without fine-tuning. Given the language model as pages 177–190. Springer. a unified NLP task interface, we can design text prompts as inputs to induce the desired answer Joe Davison, Joshua Feldman, and Alexander M Rush. 2019. Commonsense knowledge mining from pre- from language models. Our work focuses on the trained models. In Proceedings of the 2019 Confer- prompts design with a specific emphasis on the ence on Empirical Methods in Natural Language Pro- prompt’s order sensitivity. cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Computational Linguistics: Human Language Tech- pages 1173–1178. nologies, Volume 1 (Long Papers), pages 2227–2237.

Marie-Catherine De Marneffe, Mandy Simons, and Ju- Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim dith Tonhauser. 2019. The commitmentbank: Investi- Rocktäschel, Yuxiang Wu, Alexander H Miller, and gating projection in naturally occurring discourse. In Sebastian Riedel. 2020. How context affects lan- proceedings of Sinn und Bedeutung, pages 107–124. guage models’ factual predictions. In Automated Knowledge Base Construction. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, bidirectional transformers for language understand- Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and ing. In Proceedings of the 2019 Conference of the Alexander Miller. 2019. Language models as knowl- North American Chapter of the Association for Com- edge bases? In Proceedings of the 2019 Confer- putational Linguistics: Human Language Technolo- ence on Empirical Methods in Natural Language Pro- gies, Volume 1 (Long and Short Papers), pages 4171– cessing and the 9th International Joint Conference 4186. on Natural Language Processing (EMNLP-IJCNLP), Tianyu Gao, Adam Fisch, and Danqi Chen. 2020. pages 2463–2473, Hong Kong, China. Association Making pre-trained language models better few-shot for Computational Linguistics. learners. arXiv preprint arXiv:2012.15723. A. Radford, Jeffrey Wu, R. Child, David Luan, Dario Minqing Hu and Bing Liu. 2004. Mining and summa- Amodei, and Ilya Sutskever. 2019. Language models rizing customer reviews. In Proceedings of the tenth are unsupervised multitask learners. In OpenAI Blog. ACM SIGKDD international conference on Knowl- edge discovery and data mining, pages 168–177. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Wei Li, and Peter J Liu. 2020. Exploring the limits Neubig. 2020. How can we know what language of transfer learning with a unified text-to-text trans- models know? Transactions of the Association for former. Journal of Machine Learning Research, 21:1– Computational Linguistics, 8:423–438. 67.

Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, Timo Schick and Hinrich Schütze. 2020. It’s not just James Bradbury, Ishaan Gulrajani, Victor Zhong, Ro- size that matters: Small language models are also main Paulus, and Richard Socher. 2016. Ask me few-shot learners. arXiv preprint arXiv:2009.07118. anything: Dynamic memory networks for natural lan- guage processing. In International conference on Taylor Shin, Yasaman Razeghi, Robert L Logan IV, machine learning, pages 1378–1387. PMLR. Eric Wallace, and Sameer Singh. 2020. Autoprompt: Eliciting knowledge from language models with Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- automatically generated prompts. arXiv preprint dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, arXiv:2010.15980. Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized pretraining ap- Richard Socher, Alex Perelygin, Jean Wu, Jason proach. arXiv preprint arXiv:1907.11692. Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for Bo Pang and Lillian Lee. 2004. A sentimental education: semantic compositionality over a sentiment treebank. Sentiment analysis using subjectivity summarization In Proceedings of the 2013 conference on empiri- based on minimum cuts. In Proceedings of the 42nd cal methods in natural language processing, pages Annual Meeting of the Association for Computational 1631–1642. Linguistics (ACL-04), pages 271–278. Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting Ellen M Voorhees and Dawn M Tice. 2000. Building a class relationships for sentiment categorization with question answering test collection. In Proceedings respect to rating scales. In Proceedings of the 43rd of the 23rd annual international ACM SIGIR confer- Annual Meeting of the Association for Computational ence on Research and development in information Linguistics (ACL’05), pages 115–124. retrieval, pages 200–207. Romain Paulus, Caiming Xiong, and Richard Socher. Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. 2018. A deep reinforced model for abstractive sum- Annotating expressions of opinions and emotions marization. In International Conference on Learning in language. Language resources and evaluation, Representations. 39(2):165–210. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- Gardner, Christopher Clark, Kenton Lee, and Luke bonell, Ruslan Salakhutdinov, and Quoc V Le. Zettlemoyer. 2018. Deep contextualized word repre- 2019. Xlnet: Generalized autoregressive pretrain- sentations. In Proceedings of the 2018 Conference of ing for language understanding. arXiv preprint the North American Chapter of the Association for arXiv:1906.08237. Xiang Zhang, Junbo Zhao, and Yann Lecun. 2015. Character-level convolutional networks for text classi- fication. Advances in Neural Information Processing Systems, 2015:649–657. Tony Z Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improv- ing few-shot performance of language models. arXiv preprint arXiv:2102.09690. Dataset Prompt Label Mapping Review: contains no wit , only labored gags SST-2 positive/negative Sentiment: negative Review: apparently reassembled from the cutting-room floor of any given daytime soap . SST-5 terrible/bad/okay/good/great Sentiment: terrible Review: lame sweet home leaves no southern stereotype unturned . MR negative/positive Sentiment: negative Review: bluetooth does not work on this phone . CR negative/positive Sentiment: negative Review: dangerous situation MPQA negative/positive Sentiment: negative Input: too slow , too boring , and occasionally annoying . Subj subjective/objective Type: subjective Question: When did the neanderthal man live ? description/entity/expression/ TREC Type: number human/location/number input: Wall St. Bears Claw Back Into the Black (Reuters). AGNews world/sports/business/technology type: business company/school/artist/athlete/politics/ input: CMC Aviation is a charter airline based in Nairobi Kenya. DBPedia transportation/building/nature/village/ type: company animal/plant/album/film/book premise: It was a complex language. Not written down but handed down. One might say it was peeled down. CB true/false/neither hypothesis: the language was peeled down prediction: true premise: No Weapons of Mass Destruction Found in Iraq Yet. RTE hypothesis: Weapons of Mass Destruction Found in Iraq. True/False prediction: False

Table 4: Prompt template and label mapping for different tasks. Notation Description Examples x sentence nice movie y label positive template-based transformation T (x) Review: nice movie without label Review: nice movie T (x,y) template-based transformation Sentiment: positive extract (sentence, label) pair T −1(T (x,y)) (nice movie, positive) from text sequence

Table 5: Examples of transformation notations.