Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level

Ruiqi Zhong Dhruba Ghosh Dan Klein Jacob Steinhardt Computer Science Division, University of California, Berkeley {ruiqi-zhong, djghosh13, klein, jsteinhardt}@berkeley.edu

Abstract Prior works hint at differing answers. Hendrycks Larger language models have higher accu- et al.(2020) and Desai and Durrett(2020) find racy on average, but are they better on ev- that larger pretrained models consistently improve ery single instance (datapoint)? Some work out-of-distribution performance, which implies that suggests larger models have higher out-of- they might be uniformly better at a finer level. distribution robustness, while other work sug- Henighan et al.(2020) claim that larger pretrained gests they have lower accuracy on rare sub- image models have lower downstream classifica- groups. To understand these differences, we tion loss for the majority of instances, and they investigate these models at the level of indi- vidual instances. However, one major chal- predict this trend to be true for other data modal- lenge is that individual predictions are highly ities (e.g. text). On the other hand, Sagawa et al. sensitive to noise in the randomness in train- (2020) find that larger non-pretrained models per- ing. We develop statistically rigorous meth- form worse on rare subgroups; if this result gener- ods to address this, and after accounting for alizes to pretrained language models, larger models pretraining and finetuning noise, we find that will not be uniformly better. Despite all the in- our BERT-LARGE is worse than BERT-MINI direct evidence, it is still inconclusive how many on at least 1−4% of instances across MNLI, SST-2, and QQP, compared to the overall ac- instances larger pretrained models perform worse curacy improvement of 2−10%. We also on. find that finetuning noise increases with model A na¨ıve solution is to finetune a larger model, size, and that instance-level accuracy has mo- compare it to a smaller one, and find instances mentum: improvement from BERT-MINI to where the larger model is worse. However, this BERT-MEDIUM correlates with improvement approach is flawed, since model predictions are from BERT-MEDIUM to BERT-LARGE . Our noisy at the instance level. On MNLI in-domain findings suggest that instance-level predictions development set, even the same architecture with provide a rich source of information; we there- fore recommend that researchers supplement different finetuning seeds leads to different pre- model weights with model predictions. dictions on ∼8% of the instances. This is due to under-specification (D’Amour et al., 2020), where 1 Introduction there are multiple different solutions that can mini- Historically, large deep learning models (Peters mize the training loss. Since the accuracy improve- arXiv:2105.06020v1 [cs.CL] 13 May 2021 et al., 2018; Devlin et al., 2019; Lewis et al., 2020; ment from our BERT-BASE1 to BERT-LARGE is Raffel et al., 2019) have improved the state of 2%, most signals across different model sizes will the art on a wide range of tasks and leaderboards be dominated by noise due to random seeds. (Schwartz et al., 2014; Rajpurkar et al., 2016; Wang To account for the noise in pretraining and fine- et al., 2018), and empirical scaling laws predict tuning, we define instance accuracy as “how often that larger models will continue to increase per- a model correctly predicts an instance” (Figure1 formance (Kaplan et al., 2020). However, little is left) in expectation across pretraining and finetun- understood about such improvement at the instance ing seeds. We estimate this quantity by pretraining (datapoint) level. Are larger models uniformly bet- 10 models with different seeds, finetuning 5 times ter? In other words, are larger pretrained models for each pretrained models (Figure1 middle), and better at every instance, or are they better at some 1This is not the original release by Devlin et al.(2019); we instances, but worse at others? pretrained models ourselves. averaging across them. (L4/H256, 4 Layers with hidden dimension 256), However, this estimate is still inexact, and we SMALL (L4/H512), MEDIUM (L8/H512), BASE might falsely observe smaller models to be better (L12/H768), and LARGE (L24/H1024). For each at some instances by chance. Hence, we propose architecture we pre-trained models with 10 differ- a random baseline to estimate the fraction of false ent random seeds and fine-tuned each of them 5 discoveries (Section3, Figure1 right) and formally times (50 total) on each task; see Figure1 middle. upper-bound the false discoveries in Section4. Our Since pretraining is computationally expensive, we method provides a better upper bound than the clas- reduced the context size during pretraining from sical Benjamini-Hochberg procedure with Fisher’s 512 to 128 and compensated by increasing train- exact test. ing steps from 1M to 2M. AppendixA includes Using the 50 models for each size and our im- more details about pretraining and finetuning and proved statistical tool, we find that, on the MNLI their computational cost, and AppendixB verifies in-domain development set, the accuracy “decays” that our cost-saving changes do not affect accuracy from BERT-LARGE to BERT-MINI on at least ∼4% qualitatively. of the instances, which is significant given that the improvement in overall accuracy is 10%. These Notation. We use i to index an instance in the decaying instances contain more controversial or evaluation set, s for model sizes, P for pretraining wrong labels, but also correct ones (Section 4.2). seeds and F for finetuning seeds. c is a random Therefore, larger pretrained language models are variable of value 0 or 1 to indicate whether the not uniformly better. prediction is correct. Given the pretraining seed P s We make other interesting discoveries at the in- and the finetuning seed F , ci = 1 if the model of stance level. Section5 finds that instance-level size s is correct on instance i, 0 otherwise. To keep accuracy has momentum: improvement from MINI the notation uncluttered, we sometimes omit these to MEDIUM correlates with improvement from superscripts or subscripts if they can be inferred MEDIUM to LARGE . Additionally, Section6 at- from context. tributes variance of model predictions to pretrain- Unless otherwise noted, we present results on ing and finetuning random seeds, and finds that the MNLI in-domain development set in the main finetuning seeds cause more variance for larger paper. models. Our findings suggest that instance-level predictions provide a rich source of information; 3 Comparing Instance Accuracy we therefore recommend that researchers supple- ment model weights with model predictions. In this To find the instances where larger models are worse, spirit, we release all the pretrained models, model a na¨ıve approach is to finetune a larger pretrained predictions, and code here: https://github.com/ model, compare it to a smaller one, and find in- ruiqi-zhong/acl2021-instance-level. stances where the larger is incorrect but the smaller 2 Data, Models, and Predictions is correct. Under this approach, BERT-LARGE is worse than BERT-BASE on 4.5% of the instances To investigate model behavior, we considered dif- and better on 7%, giving an overall accuracy im- ferent sizes of the BERT architecture and fine- provement of 2.5%. tuned them on Quora Question Pairs (QQP2), However, this result is misleading: even if we Multi-Genre Natural Language Inference (MNLI; compare two BERT-BASE model with different Williams et al.(2020)), and the Stanford Sen- finetuning seeds, their predictions differ on 8% of timent Treebank (SST-2; Socher et al.(2013)). the instances, while their accuracies differ only by To account for pretraining and finetuning noise, 0.1%; Table1 reports this baseline randomness we averaged over multiple random initializations across model sizes. Changing the pretraining seed and training data order, and thus needed to pre- also changes around 2% additional predictions be- train our own models rather than downloading yond finetuning. off the internet. Following Turc et al.(2019) we Table1 also reports the standard deviation of trained 5 architectures of increasing size: MINI overall accuracy, which is about 40 times smaller. 2https://www.quora.com/q/quoradata/First-Quora- Such stability starkly contrasts with the noisiness at Dataset-Release-Question-Pairs the instance level, which poses a unique challenge. Instances Instance 1 Instance i Average across seeds Model Finetuning Seeds Instance F(*)1 F(*)2 Seed 1 Seed 2 Seed 3 Sizes Accuracy Pretrain P1 X ✓ X X Instance 1 X ✓ ✓ 66% MINI -ing P2 ✓ X … ✓ X Instance 2 X X X 0% Seeds P3 X ✓ ✓ ✓ Instance 3 ✓ X X 33% … … … Instance 4 ✓ ✓ ✓ 100% ✓ ✓ X ✓ LARGE ✓ X … ✓ X ✓ ✓ X X

Figure 1: Left: Each column represents the same architecture trained with a different seed. We calculate accuracy for each instance (row) by averaging across seeds (column), while it is usually calculated for each model by averaging across instances. Middle: A visual layout of the model predictions we obtain, which is a binary-valued tensor with 4 axes: model size s, instance i, pretraining seeds P and finetuning seeds F . Right: for each instance, we calculate the accuracy gain from MINI to LARGE and plot the histogram in blue, along with a random baseline in red. Since the blue distribution has a bigger left tail, smaller models are better at some instances.

ˆ s DiffFTune DiffPTrain Stdall accuracy estimates Acci , i.e. MINI 7.2% 10.7% 0.2% s s SMALL 7.2% 10.7% 0.3% s1 ˆ ˆ 2 ˆ 1 s2 ∆Acci := Acci − Acci . (3) MEDIUM 8.0% 10.7% 0.3% BASE 8.5% 10.6% 0.2% BASE ˆ We histogram LARGE∆Acci in Figure2 (b). We LARGE 8.6% 10.1% 0.2% observe a unimodal distribution centered near 0, with tails on both sides. Therefore, the estimated Table 1: Larger model sizes are at the bottom rows. differences for some instances are negative. DiffFTune: how much do the predictions differ, if two models have the same pretraining seed but different However, due to estimation noise, we might finetuning seeds F ? DiffPTrain: the difference if the falsely observe this accuracy decay by chance. pretraining seeds P are different. Stdall: the standard Therefore, we introduce a random baseline deviation of overall accuracy, around 40 times smaller s1 0 s2 ∆Acc to control for these false discoveries. Re- than DiffFTune. call that we have 10 smaller pretrained models and 10 larger ones. Our baseline splits these into a Instance-Level Metrics To reflect this noisiness, group A of 5 smaller + 5 larger, and another group we define the instance accuracy Accs to be how B of the remaining 5 + 5. Then the empirical accu- i A B often models of size s predict instance i correctly, racies Accˆ and Accˆ are identically distributed, so we take our baseline s1 ∆Acc0 to be the differ- s s s2 Acc := [c ]. (1) A B i EP,F i ence Accˆ − Accˆ . We visualize and compare how to calculate s1 ∆Accˆ and s1 ∆Acc0 in Figure3. The expectation is taken with respect to the pre- s2 s2 BASE 0 training and finetuning randomness P and F . We We histogram this baseline LARGE∆Acc in s ˆ s Figure2 (b), and find that our noisy estimate estimate Acci via the empirical average Acci ac- BASE ˆ cross 10 pretraining × 5 finetuning runs. LARGE∆Acc has a larger left tail than the baseline. ˆ s This suggests that decaying instances exist. We We histogram Acci in Figure2 (a). On most instances the model always predicts correctly or similarly compare MINI to LARGE in Figure2 (c) incorrectly (Accˆ = 0 or 1), but a sizable fraction and find an even larger left tail. of accuracies lie between the two extremes. Recall that our goal is to find instances where 4 Quantifying the Decaying Instances larger models are less accurate, which we refer The left tail of ∆Accˆ noisily estimates the frac- to as decaying instances. We therefore study the tion of decaying instances, and the left tail of the instance difference between two model sizes s and 1 random baseline ∆Acc0 counts the false discov- s , defined as 2 ery fraction due to the noise. Intuitively, the true s1 s2 s1 fraction of decaying instances can be captured by s ∆Acci := Acci − Acci , (2) 2 the difference of these left tails, and we formally which is estimated by the difference between the quantify this below. (a) BASE vs. LARGE , Acc (b) BASE vs. LARGE , ∆Acc (c) MINI vs. LARGE , ∆Acc

Figure 2: (a) The distribution of instance accuracy Accˆ i. (b, c) Histogram of instance difference estimate (x-axis), 0 ∆Accˆ (blue) and its baseline ∆Acc (red) compares BASE and LARGE . To better visualize, we truncated the density (y-axis) above 2. Since the blue histogram has a larger left tail than the red one, there are indeed instances where larger models are worse.

s1 LARGE MINI MINI ̂ Proof Sketch Suppose we observe c and Aĉc = 0.75 - Aĉc = 0.42 = LARGEΔAcc = .33 R1...2k s2 LARGE MINI c , where there are 2k different random F(*)1 F(*) F(*)3 F(*)1 F(*) F(*)3 R2k+1...4k 2 2 P1 X ✓ ✓ P1 X ✓ ✓ A 3 Pretrain- Group A Aĉc seeds for each model size . Then P ✓ ✓ X P2 ✓ X X = 0.58 ing 2 - ✓ ✓ P3 X X X Seeds P3 X B 2k 4k Group B Aĉc = 0.58 P4 ✓ ✓ ✓ P4 X ✓ ✓ 1 X s X s = ∆Accˆ := ( c 1 − c 2 ), (7) Min i i Rj ,i Rj ,i LargeΔAcc′ = 0 2k j=1 j=2k+1 Figure 3: The tables are model predictions with visual and hence the discovery rate Decay(ˆ t) is defined notations established in Figure1 middle. ∆Accˆ (blue) is the mean difference between the left and the right as table, each corresponding to a model size. The random |T | 0 1 X baseline ∆Acc (red) is the mean difference between Decay(ˆ t) := 1[∆Accˆ ≤ t]. (8) |T | group A (orange) cells and group B (green), which are i=1 identically and independently distributed. For the random baseline estimator, we have

k 3k Suppose instance i is drawn from the empirical 1 X s X s ∆Acc0 := ( c 1 + c 2 (9) evaluation distribution. Then we can define the true i 2k Rj ,i Rj ,i j=1 j=2k+1 decaying fraction Decay as 2k 4k X X − cs1 − cs2 ), Decay := [∆Acc < 0]. (4) Rj ,i Rj ,i Pi i j=k+1 j=3k+1 ∆Acc Since i is not directly observable and and the false discovery control Decay0 is defined ∆Accˆ i is noisy, we add a buffer and only consider as instances with ∆Accˆ i ≤ t, which makes it more |T | likely (but still uncertain) that the true ∆Acci < 0. 1 X ˆ Decay0(t) := 1[∆Acc0 ≤ t]. (10) We denote this “discovery fraction” Decay(t) as |T | i i=1 ˆ ˆ Decay(t) := Pi[∆Acci ≤ t]. (5) Formally, the theorem states that ˆ 0 Similarly, we define a baseline control (false Decay ≥ ER1...R4k [Decay(t)−Decay (t)], (11) 0 0 discovery fraction) Decay (t) := Pi[∆Acci ≤ t]. Hence, Decayˆ and Decay0 are the cumulative dis- which is equivalent to 0 tribution function of ∆Accˆ and ∆Acc (Figure4). |T | X ˆ We have the following theorem, which we for- (1[∆Acci < 0] − P[∆Acci ≤ t] (12) mally state and prove in AppendixD: i=1 0 Theorem 1 (Informal) If all the random seeds are +P[∆Acci ≤ t]) ≥ 0 independent, then for all thresholds t, 3We assumed even number of random seeds since we will mix half of the models from each size to compute the random 0 Decay ≥ E[Decay(ˆ t) − Decay (t)] (6) baseline s1 \ s2 MINI SMALL BASE LARGE MINI N/A 9% 18% 21% SMALL 3% N/A 14% 18% BASE 6% 5% N/A 10% LARGE 6% 5% 2% N/A

Table 2: We lower-bound the fraction of instances that 6% improve when model size changes from s1 (row) to s2 (column). For example, when model size decreases from LARGE to MINI , 6% of instances improve (i.e. decays). Figure 4: The cumulative distribution function of the ˆ 0 histogram in Figure2 (c); only the negative x-axis is Threshold Decay Decay Diff shown because it corresponds to decays. The maxi- t = −0.4 4.22% 3.49e−3 3.87% mum difference between the two curves (6%) is a lower ...... bound of the true decaying fraction. t = −0.9 0.91% 1.44e−7 0.91% t = −1.0 0.48% 2.06e−8 0.48%

Hence, we can declare victory if we can prove Table 3: Comparing MINI vs. LARGE by calculating that for all i, if ∆Acci ≥ 0, the discovery fraction Decayˆ , the false discovery con- trol Decay0, and their difference (Diff) under different 0 ˆ P[∆Acci ≤ t] ≥ P[∆Acci ≤ t]. thresholds t. LARGE is worse on at least ∼4% (maxi- mum Diff) of instances. 0 ˆ This is easy to see, since ∆Acci and ∆Acci are both binomial distributions with the same n, but are essentially the same as individual finetuning the first has a larger rate. 4  runs, except that the finetuning randomness is av- Roughly speaking, the true decaying fraction eraged out. Hence we obtain 10 independent sets is at least the difference between Decay(ˆ t) and of model predictions with different random seeds, Decay0(t) at every threshold t. Therefore, we take which allows us to apply Theorem1. the maximum difference between Decay(ˆ t) and We compare MINI to LARGE using these ensem- Decay0(t) to lower-bound the fraction of decaying bles and report the discovery Decayˆ and the base- instances.5 For example, Figure4 estimates the line Decay0 in Table3. Taking the maximum differ- true decaying fraction between MINI and LARGE ence across thresholds, we estimate at least ∼4% of to be at least 6%. decaying instances. This estimate is lower than the We compute this lower bound for other pairs of previous 6% estimate, which used the full set of 50 model sizes in Table2, and the full results across models’ predictions assuming they were indepen- other tasks and model size pairs are in AppendixC. dent. However, this is still a meaningful amount, In all of these settings we find a non-zero fraction given that the overall accuracy improvement from of decaying instances, and larger model size differ- MINI to LARGE is 10%. ences usually lead to more decaying instances. Unfortunately, applying Theorem1 as above 4.1 Fisher’s Test + Benjamini-Hochberg is not fully rigorous, since some finetuning runs Here is a more classical approach to lower-bound share the same pretraining seeds and hence are de- the decaying fraction. For each instance, we com- 6 pendent. To obtain a statistically rigorous lower pute a significance level α under the null hypothesis bound, we slightly modify our target of interest. In- that the larger model is better, using Fisher’s exact stead of examining individual finetuning runs, we test. We sort the significance levels ascendingly, ensemble our model across 5 different finetuning th and call the p percentile αp. Then we pick a runs for each pretraining seed; these predictions false discovery rate q (say, 25%), find the largest 4More details are in AppendixD. p s.t. αp < pq, and estimate the decaying fraction 5Adaptively picking the best threshold t depending on the to be at least p(1 − q). This calculation is known data may incur a slight upward bias. AppendixE estimates as the Benjamini-Hochberg procedure (Benjamini that the relative bias is at most 10% using a bootstrap method. 6Although we anticipate such dependencies do not cause a and Hochberg, 1995). substantial difference, as discussed in Appendix D.1. To compare our method with this classical ap- s1 s2 2 6 10 Correct Fine Wrong Unsure MINILARGE ours 1.9% 3.1% 4.0% MNLID 66% 17% 9% 5% MINILARGE BH 0.0% 0.9% 1.9% MNLIC 86% 5% 5% 1% BASELARGE ours 0.4% 0.9% 1.2% SST-2D 55% 8% 10% 25% BASELARGE BH 0.0% 0.0% 0.0% SST-2C 88% 4% 0% 6% QQPD 60% 26% 10% 2% Table 4: We compare our method to the Fisher’s exact QQPC 87% 10% 1% 0% test + Benjamin-Hochberg (BH) procedure described in Section4. For all different model size pairs and number Table 5: MINI vs. LARGE . We examine whether there of pretrained models available, ours always provides a are mislabels for the Decaying fractions (superscript D) higher (better) lower bound of the decaying fraction. and the rest of the dataset (Control group C ). The de- caying fraction contains more mislabels, but includes correct labels as well. proach, we estimate the lower bound of the decay- ing fraction for different pairs of model sizes with different numbers of pretrained models available. instance or an instance from the remaining dataset To make sure our choice of the false discovery rate as a control. We are blind to which group it comes q does not bias against the classical approach, we from. adaptively choose q to maximize its performance. For each task of MNLI, QQP, and SST-2, the first AppendixF includes the full results and Table4 is author annotated 100 instances (decay + control a representative subset. group) (Table5). We present all the annotated We find that our approach is more powerful, par- decaying instances in AppendixJ. ticularly when the true decaying fraction is likely to be small and only a few models are available, Conclusion We find that the decaying fraction which is usually the regime of interest. For exam- has more wrong or controversial labels, compared ple, across all pairs of model sizes, our approach to the remaining instances. However, even after only needs 2 random seeds (i.e. pretrained models) we adjust for the fraction of incorrect labels, the Decay to provide a non-zero lower bound on the decaying fraction still exceeds the false discovery fraction, while the classical approach sometimes control. This implies that MINI models are bet- fails to do this even with 10 seeds. Intuitively, when ter than LARGE models on some correctly labeled fewer seeds are available, the smallest possible sig- instances. The second author followed the same nificance level for each instance is larger than the procedure and reproduced the same qualitative re- decaying fraction, hence hurting the classical ap- sults. proach. However, we cannot find an interpretable pattern for these correctly labeled decaying instances by 4.2 Understanding the Decaying Instances simply eyeballing. We discuss future directions to discover interpretable categories in Section7. We next manually examine the decaying instances to see whether we can find any interpretable pat- 5 Correlation of Instance Difference terns. One hypothesis is that all the decaying frac- tions are in fact mislabeled, and hence larger mod- We next investigate whether there is a momen- els are not in fact worse on any instances. tum of instance accuracy increase: for example, To investigate this hypothesis, we examined the if the instance accuracy improves from MINI(s1) MINI ˆ group of instances where LARGE∆Acci ≤ −0.9. to MEDIUM(s2), is it more likely to improve from MINI is almost always correct on these instances, MEDIUM(s2) to LARGE(s3)? while LARGE is almost always wrong, and the false The na¨ıve approach is to calculate the Pearson MINI ˆ discovery fraction is . For each instance, we correlation coefficient between MEDIUM∆Acc and MEDIUM ˆ manually categorize it as either: 1) Correct, if the LARGE ∆Acc, and we find the correlation to be zero. label is correct, 2) Fine, if the label might be con- However, this is partly an artifact of accuracies be- troversial but we could see a reason why this label ing bounded in [0, 1]. If MEDIUM drastically im- is reasonable, 3) Wrong, if the label is wrong, or proves over MINI from 0 to 1, there is no room for 4) Unsure, if we are unsure about how to label LARGE to improve over MEDIUM. To remove this this instance. Each time we annotate, with 50% inherent negative correlation, we calculate the cor- probability we randomly sample either a decaying relation conditioned on the accuracy of the middle- (s1, s2, s3) ↓ Buckets→ 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 SMALL, MEDIUM, BASE 0.07 0.22 0.29 0.40 0.35 0.33 0.38 0.27 0.24 0.13 MINI, MEDIUM, LARGE 0.03 0.15 0.18 0.33 0.17 0.16 0.22 0.20 0.19 0.09

Table 6: Each row corresponds to a triplet of model sizes. Each column t represents a bucket that contains s instances with Accˆ 2 ∈ [t − 0.1, t]. Within each bucket, we calculate the Pearson correlation coefficient between s1 ∆Accˆ s2 ∆Accˆ the estimated accuracy improvements: s2 and s3 . These correlations are positive and become higher when model size differences are small.

MEDIUM sized model, Accˆ . Bias2 PretVar FineVar Therefore, we bucket instances by their esti- MINI 0.203 0.017 0.036 mated MEDIUM accuracy into intervals of size 0.1, SMALL 0.179 0.017 0.036 and we find the correlation to be positive within MEDIUM 0.157 0.014 0.040 each bucket (Table6, row 2). This fixes the prob- BASE 0.134 0.010 0.043 lem with the na¨ıve approach by getting rid of the LARGE 0.111 0.007 0.043 negative correlation, which could have misled us to believe that improvements by larger models are Table 7: The bias, pretraining variance, and finetuning variance for each model size, averaged across all test uncorrelated. instances. Finetuning variance is much larger than pre- We additionally find that the correlations be- training variance; larger models have larger finetuning tween improvements become stronger when model variance. size differences are smaller. Table6 row 1 re- ports results for another model size triplet with smaller size difference, i.e. (s1, s2, s3) = (SMALL, capturing “how wrong is the average prediction”, MEDIUM, BASE), and the correlation is larger for variance due to pretraining, and variance due to all buckets. Results for more tasks and size triplets finetuning seeds, respectively. are in AppendixG and the same conclusions hold We can directly estimate FineVar by first calcu- qualitatively. lating the sample variance across finetuning runs for each pretraining seed, and then averaging the 6 Variance at the Instance Level variances across the pretraining seeds. Estimating Section3 found that the overall accuracy has rela- PretVar is more complicated. A na¨ıve approach is tively low variance, but model predictions are noisy. to calculate the empirical variance, across pretrain- This section formally analyzes variance at the in- ing seeds, of the average accuracy across finetuning stance level. For each instance, we decompose its seeds. However, the estimated average accuracy for loss into three components: Bias2, variance due to each pretraining seed is noisy itself, which causes pretraining randomness, and variance due to fine- an upward bias on the PretVar estimate. We cor- tuning randomness. Formally, we consider the 0/1 rect this bias by estimating the variance of the esti- loss: mated average accuracy and subtracting it from the 2 na¨ıve estimate; see AppendixH for details, as well Li := 1 − ci = (1 − ci) , (13) as a generalization to more than two sources of ran- where ci is a random variable 0/1 indicating domness. Finally, we estimate Bias2 by subtracting whether the prediction is correct or incorrect, with the two variance estimates from the estimated loss. respect to randomness in pretraining and finetun- For each of these three quantities, Bias2, ing. Therefore, by bias-variance decomposition PretVar and FineVar, we estimate it for each in- and total variance decomposition, we have stance, average it across all instances in the evalua- 2 Li = Bias i + PretVari + FineVari, (14) tion set, and report it in Table7. The variances at the instance level are much larger than the variance where, by using P and F as pretraining and fine- of overall accuracy, by a factor of 1000. tuning random seeds: We may conclude from Table7 that larger mod- 2 2 els have larger finetuning variance and smaller pre- Bias i := (1 − EP,F [ci]) , (15) training variance. However, lower bias also inher- PretVar := Var [ [c ]], i P EF i ently implies lower variance. To see this, suppose FineVari := EP [VarF [ci]], a model has perfect accuracy and hence zero bias; conditioned on instance accuracy. We find that larger pretrained models are indeed worse on a non- trivial fraction of instances and have higher vari- ance due to finetuning seeds; additionally, instance accuracy improvements from MINI to MEDIUM cor- relate with improvements from MEDIUM to LARGE . Overall, we treated model prediction data as the central object and built analysis tools around them to obtain a finer understanding of model perfor- mance. We therefore refer to this paradigm as Figure 5: The pretraining variance conditioned on “instance level understanding as data mining”. Bias2 (the level of correctness). Each curve represents We discuss three key factors for this paradigm to a model size. Larger models have lower pretraining thrive: 1) scalability and the cost of obtaining pre- variance across all levels of bias. diction data, 2) other information to collect for each instance, and 3) better statistical tools. We analyze each of these aspects below. then it always predicts the same label (the correct one) and hence has zero variance. This might favor Scalability and Cost of Data Data mining is larger models and “underestimate” their variance, more powerful with more data. How easy is it since they have lower bias. Therefore, we calculate to obtain more model predictions? In our paper, and compare the variances conditioned on the bias, the main bottleneck is pretraining. However, once 2 2 2 i.e. PretVar(b ) := Ei[PretVari|Bias i = b ]. the pretrained models are released, individual re- We estimate PretVars(b2) using Gaussian pro- searchers can download them and only need to cess regression and plot it against b2 in Figure5. repeat the cheaper finetuning procedure. We find that larger models still have lower pre- Furthermore, model prediction data are under- training variance across all levels of bias on the shared: while many recent research papers share specific task of MNLI under the 0/1 loss. To fur- their code or even model weights to help reproduce ther check whether our conclusions are general, we the results, it is not yet a standard practice to share tested them on other tasks and under the squared all the model predictions. Since many researches 2 loss Li := (1 − pi) , where pi is the probability follow almost the same recipe of pretraining and assigned to the correct class. Below are the conclu- finetuning (McCoy et al., 2020; Desai and Durrett, sions that generally hold across different tasks and 2020; Dodge et al., 2020), much computation can loss functions. be saved if model predictions are shared. On the other hand, as the state of the art model size is Conclusion We find that 1) larger models have increasing at a staggering speed7, most researchers larger finetuning variance, 2) LARGE has smaller will not be able to run inference on a single instance. pretraining variance than BASE ; however, the or- The trend that models are becoming larger and dering between other sizes varies across tasks and more similar necessitate more prediction sharing. losses, and 3) finetuning variance is 2−8 times as large as pretraining variance, and the ratio is bigger Meta-Labels and Other Predictions Data min- for larger models. ing is more powerful with more types of informa- tion. One way to add information to each instance 7 Discussion and Future Directions is to assign “meta-labels”. In the HANS (McCoy et al., 2019) dataset, the authors tag each instance To investigate model behaviors at the instance level, with a heuristic 8 that holds for the training distri- we produced massive amounts of model predictions bution but fails on this instance. Naik et al.(2018a) in Section2 and treated them as raw data. To ex- and Ribeiro et al.(2020) associate each instance tract insights from them, we developed better met- rics and statistical tools, including a new method 7e.g. BERT (Devlin et al., 2019) has 340M parameters, to control the false discoveries, an unbiased estima- while Switch-Transformer has over 1 trillion parameters (Fe- dus et al., 2021). tor for the decomposed variances, and metrics that 8For example, “the label [entailment] is likely if the compute variance and correlation of improvements premise and the hypothesis have significant lexical overlap”. with a particular stress test type or subgroup, for ex- about the core challenges underlying instance level ample, whether the instance requires the model to understanding and inspire future work. reason numerically or handle negations. Nie et al. (2020) collects multiple human responses to esti- Acknowledgement mate human disagreement for each instance. This We thank Steven Cao, Cathy Chen, Frances Ding, meta-information can potentially help us identify David Gaddy, Colin Li, and Alex Wei for giving interpretable patterns for the disagreeing instances comments on the initial paper draft. We would also where one model is better than the other. On the like to thank the Google Cloud TPU team for their flip side, identifying disagreeing instances between hardware support. two models can also help us generate hypothesis and decide what subgroup information to annotate. We can also add performance information on References other tasks to each instance. For example, Pruk- sachatkun et al.(2020) studied the correlation Yoav Benjamini and Yosef Hochberg. 1995. Control- ling the false discovery rate: a practical and pow- between syntactic probing accuracy (Hewitt and erful approach to multiple testing. Journal of the Liang, 2019) and downstream task performance. Royal statistical society: series B (Methodological), Turc et al.(2019) and Kaplan et al.(2020) studied 57(1):289–300. the correlation between language modelling loss and the downstream task performance. However, Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large anno- they did not analyze correlations at the instance tated corpus for learning natural language inference. level. We may investigate whether their results In Proceedings of the 2015 Conference on Empiri- hold on the instance level: if an instance is easier cal Methods in Natural Language Processing, pages to tag by a probe or easier to predict by a larger 632–642, Lisbon, Portugal. Association for Compu- tational Linguistics. language model, is the accuracy likely to be higher? Alexander D’Amour, Katherine Heller, Dan Moldovan, Statistical Tools Data mining is more powerful Ben Adlam, Babak Alipanahi, Alex Beutel, with better statistical tools. Initially we used the Christina Chen, Jonathan Deaton, Jacob Eisen- Benjamini-Hochberg procedure with Fisher’s ex- stein, Matthew D Hoffman, et al. 2020. Un- act test, which required us to pretrain 10 models derspecification presents challenges for credibil- ity in modern machine learning. arXiv preprint to formally verify that the decaying instances ex- arXiv:2011.03395. ist. However, we later realized that 2 is in fact enough by using our approach introduced in Sec- Shrey Desai and Greg Durrett. 2020. Calibration of tion4. We could have saved 80% of the compu- pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- tation for pretraining if this approach was known guage Processing (EMNLP), pages 295–302, Online. before we started. Association for Computational Linguistics. Future work can explore more complicated met- rics and settings. We compared at most 3 different Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of model sizes at a time, and higher order comparisons deep bidirectional transformers for language under- require novel metrics. We studied two sources of standing. In Proceedings of the 2019 Conference randomness, pretraining and finetuning, but other of the North American Chapter of the Association sources of variation can be interesting as well, e.g. for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), differences in pretraining corpus, different model pages 4171–4186, Minneapolis, Minnesota. Associ- checkpoints, etc. To deal with more sophisticated ation for Computational Linguistics. metrics, handle different sources and hierarchies of randomness, and reach conclusions that are robust Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. to noises at the instance level, researchers need to 2020. Fine-tuning pretrained language models: develop new inference procedures. Weight initializations, data orders, and early stop- To conclude, for better instance level understand- ping. arXiv preprint arXiv:2002.06305. ing, we need to produce and share more prediction William Fedus, Barret Zoph, and Noam Shazeer. 2021. data, annotate more diverse linguistic properties, Switch transformers: Scaling to trillion parameter and develop better statistical tools to infer under models with simple and efficient sparsity. arXiv noises. We hope our work can inform researchers preprint arXiv:2101.03961. Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam for Computational Linguistics, pages 3428–3448, Dziedzic, Rishabh Krishnan, and Dawn Song. 2020. Florence, Italy. Association for Computational Lin- Pretrained transformers improve out-of-distribution guistics. robustness. In Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, Aakanksha Naik, Abhilasha Ravichander, Norman pages 2744–2751, Online. Association for Computa- Sadeh, Carolyn Rose, and Graham Neubig. 2018a. tional Linguistics. Stress test evaluation for natural language inference. In The 27th International Conference on Computa- Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, tional Linguistics (COLING), Santa Fe, New Mex- Christopher Hesse, Jacob Jackson, Heewoo Jun, ico, USA. Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. 2020. Scaling laws for autoregressive generative Aakanksha Naik, Abhilasha Ravichander, Norman modeling. arXiv preprint arXiv:2010.14701. Sadeh, Carolyn Rose, and Graham Neubig. 2018b. Stress test evaluation for natural language inference. John Hewitt and Percy Liang. 2019. Designing and In Proceedings of the 27th International Conference interpreting probes with control tasks. In Proceed- on Computational Linguistics, pages 2340–2353, ings of the 2019 Conference on Empirical Methods Santa Fe, New Mexico, USA. Association for Com- in Natural Language Processing and the 9th Inter- putational Linguistics. national Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), pages 2733–2743, Hong Yixin Nie, Xiang Zhou, and Mohit Bansal. 2020. What Kong, China. Association for Computational Lin- can we learn from collective human opinions on nat- guistics. ural language inference data? In Proceedings of the 2020 Conference on Empirical Methods in Natural Jared Kaplan, Sam McCandlish, Tom Henighan, Language Processing (EMNLP), pages 9131–9143, Tom B Brown, Benjamin Chess, Rewon Child, Scott Online. Association for Computational Linguistics. Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt arXiv preprint arXiv:2001.08361. Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep- Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. resentations. In Proceedings of the 2018 Confer- A continuously growing dataset of sentential para- ence of the North American Chapter of the Associ- phrases. In Proceedings of the 2017 Conference on ation for Computational Linguistics: Human Lan- Empirical Methods in Natural Language Processing, guage Technologies, Volume 1 (Long Papers), pages pages 1224–1234, Copenhagen, Denmark. Associa- 2227–2237, New Orleans, Louisiana. Association tion for Computational Linguistics. for Computational Linguistics. Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Yada Pruksachatkun, Jason Phang, Haokun Liu, jan Ghazvininejad, Abdelrahman Mohamed, Omer Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Levy, Veselin Stoyanov, and Luke Zettlemoyer. Pang, Clara Vania, Katharina Kann, and Samuel R. 2020. BART: Denoising sequence-to-sequence pre- Bowman. 2020. Intermediate-task transfer learning training for natural language generation, translation, with pretrained language models: When and why and comprehension. In Proceedings of the 58th An- does it work? In Proceedings of the 58th Annual nual Meeting of the Association for Computational Meeting of the Association for Computational Lin- Linguistics, pages 7871–7880, Online. Association guistics, pages 5231–5247, Online. Association for for Computational Linguistics. Computational Linguistics. Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez. Lee, Sharan Narang, Michael Matena, Yanqi Zhou, 2020. Train large, then compress: Rethinking model Wei Li, and Peter J Liu. 2019. Exploring the limits size for efficient training and inference of transform- of transfer learning with a unified text-to-text trans- ers. In International Conference on Machine Learn- former. arXiv preprint arXiv:1910.10683. ing. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and R. Thomas McCoy, Junghyun Min, and Tal Linzen. Percy Liang. 2016. SQuAD: 100,000+ questions for 2020. BERTs of a feather do not generalize to- machine comprehension of text. In Proceedings of gether: Large variability in generalization across the 2016 Conference on Empirical Methods in Natu- models with similar test set performance. In Pro- ral Language Processing, pages 2383–2392, Austin, ceedings of the Third BlackboxNLP Workshop on An- Texas. Association for Computational Linguistics. alyzing and Interpreting Neural Networks for NLP, pages 217–227, Online. Association for Computa- Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, tional Linguistics. and Sameer Singh. 2020. Beyond accuracy: Be- havioral testing of NLP models with CheckList. In Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Proceedings of the 58th Annual Meeting of the Asso- Right for the wrong reasons: Diagnosing syntactic ciation for Computational Linguistics, pages 4902– heuristics in natural language inference. In Proceed- 4912, Online. Association for Computational Lin- ings of the 57th Annual Meeting of the Association guistics. Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, and Percy Liang. 2020. An investigation of why over- parameterization exacerbates spurious correlations. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 8346–8356. PMLR. Lane Schwartz, Timothy Anderson, Jeremy Gwinnup, and Katherine Young. 2014. Machine translation and monolingual postediting: The AFRL WMT- 14 system. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 186–194, Baltimore, Maryland, USA. Association for Compu- tational Linguistics.

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment tree- bank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Asso- ciation for Computational Linguistics. Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962v2. Alex Wang, Amanpreet Singh, Julian Michael, Fe- lix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis plat- form for natural language understanding. In Pro- ceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Net- works for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics. Adina Williams, Tiago Pimentel, Hagen Blix, Arya D. McCarthy, Eleanor Chodroff, and Ryan Cotterell. 2020. Predicting declension class from form and meaning. In Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pages 6682–6695, Online. Association for Computa- tional Linguistics. A Pretraining and Finetuning Details out-of-domain, or challenging datasets to obtain model predictions. Models trained on MNLI are Here we explain how to obtain the model predic- also evaluated on Stanford Natural Language In- tions, which are analyzed in later sections. To ob- ference (SNLI; (Bowman et al., 2015)), Heuristic tain these predictions under the “pretraining and Analysis for NLI Systems (HANS; (McCoy et al., finetuning” framework (Devlin et al., 2019), we 2019)), and stress test evaluations (STRESS; (Naik need to decide a model size, perform pretraining, et al., 2018b)). Models trained on QQP are also finetune on a training set with a choice of hyper- evaluated on Twitter Paraphrase Database (Twit- parameters, and test the model on an evaluation terPPDB; (Lan et al., 2017)). set. We discuss each bolded aspects below. Since pretraining introduces randomness, for Size Similar to Turc et al.(2019), we exper- each model size s, we pretrain 10 times with dif- imented with the following five model sizes, ferent random seed P ; since finetuning also intro- listed in increasing order: MINI (L4/H256) 9 , duces noise, for each pretrained model we pretrain SMALL (L4/H512), MEDIUM (L8/H512), BASE 5 times with different random seed F ; besides, we (L12/H768), and LARGE (L24/H1024). also evaluate the model at the checkpoints after E 1 2 epochs, where E ∈ [3, 3 3 , 3 3 , 4]. Pretraining We used the pretraining code from Pretraining 10 models for all 5 model sizes alto- Devlin et al.(2019) and the pre-training corpus gether takes around 3840 hours on TPU v3 with 8 from Li et al.(2020). Compared to the original cores. Finetuning all of them 5 times for all three BERT release, we used context size 128 instead of tasks in our paper requires around 1200 hours. 512, since computation cost grows quadratically with respect to context size; we also pretrained for B Compare Our Models to the Original 2M steps instead of 1M. Since we decreased the pre-training context length Training Set We consider 3 datasets: Quora to save computation, these models are not exactly Question Pairs (QQP) 10, Multi-Genre Natural Lan- the same as the original BERT release by Devlin guage Inference (MNLI; Williams et al.(2020)), et al.(2019) and Turc et al.(2019). We need to and the Stanford Sentiment Treebank (SST-2; benchmark our model against theirs to ensure that (Socher et al., 2013)). For QQP we used the of- the performance of our model is still reasonable ficial training split. For MNLI we used 350K out and the qualitative trend still holds. For each each of 400K instances from the original training split, size and task, we finetune the original model 5 and added the remaining 50K to the evaluation set, times and calculate the average of overall accuracy. since the original in-domain development set only The comparison can be seen in Table8. We find contains 10K examples. For SST-2, we mix the that our model does not substantially differ from training and development set of the original split, the original ones on QQP and SST-2. On MNLI, split the instances into 5 folds, train on four of them, the performance of our BERT-BASE and BERT- and evaluate on the remaining fold. LARGE is 2∼3% below the original release, but the qualitative trend that larger models have better Hyperparameters As in Turc et al.(2019), we accuracy still holds robustly. finetune 4 epochs for each dataset. For each task and model size, we tune hyperparameters in the C More Instance Difference Results following way: we first randomly split our new training set into 80% and 20%; then we finetune on Similar to Figure4, for all 10 pairs of model sizes the 80% split with all 9 combination of batch size and all in-distribution instances of MNLI, SST-2, ˆ [16, 32, 64] and learning rate [1e-4, 5e-5, 3e-5], and QQP, we plot the cumulative density of ∆Acc 0 ˆ 0 and choose the combination that leads to the best and ∆Acc , or say, Decay(t) and Decay (t) in Fig- average accuracy on the remaining 20%. ure6,7, and8. Additionally, for each pair of model sizes s1 and Evaluation Set After finetuning our pretrained s2, we estimate “how much instances are getting models, we evaluate them on a range of in-domain, better/worse accuracy?” by taking the maximum difference between the red curve and the blue curve. 94 Layers with hidden dimension 256 10https://www.quora.com/q/quoradata/First-Quora- We report these results for MNLI, SST-2, and QQP Dataset-Release-Question-Pairs in Table9. We find that larger model size gaps QQP MNLI SST-2 For the random baseline estimator, we have MINI orig 88.2% 74.6% 92.8% k 3k MINI ours 87.3% 74.3% 92.8% 1 X X ∆Acc0 := ( cs1 + cs2 (19) SMALL orig 89.1% 77.3% 93.9% i 2k Rj ,i Rj ,i j=1 j=2k+1 SMALL ours 88.7% 76.7% 93.9% 2k 4k MEDIUM orig 89.8% 79.6% 94.2% X X − cs1 − cs2 ), MEDIUM ours 89.5% 78.9% 94.2% Rj ,i Rj ,i j=k+1 j=3k+1 BASE orig 90.8% 83.8% 95.0% ours BASE 90.6% 81.2% 94.6% and the false discovery control Decay0 is defined LARGE orig 91.3% 86.8% 95.2% as LARGE ours 91.0% 83.8% 94.8% |T | 0 1 X 0 Table 8: Comparing our pretrained model (superscript Decay (t) := 1[∆Acci ≤ t]. (20) orig |T | ) to the original release by Devlin et al.(2019) and i=1 Turc et al.(2019) (superscript ours). All pretrained models are finetuned with the training set and tested on To reiterate, the definition of the true decay rate the in-distribution evaluation set described in Appendix is A. |T | 1 X Decay = 1[∆Acc < 0]. (21) |T | i lead to larger decaying fraction, but also larger i=1 improving fraction as well. Our goal is to prove that D Proof of Theorem1 ˆ 0 Decay ≥ ER1...R4k [Decay(t) − Decay (t)] (22) Formal Setup Our goal is to show that if all the random seeds are independent, Proof By re-arranging terms and linearity of ex- pectation, Equation 22 is equivalent to the follow- 0 ing Decay ≥ E[Decay(ˆ t) − Decay (t)] (16) |T | More concretely, suppose each instance is in- X (1[∆Acc < 0] − [∆Accˆ ≤ t] (23) dexed by i, the set of all instances is T , and the i P i s |T | i=1 random seed is R; then cR ∈ {0, 1} is a ran- 0 s +P[∆Acci ≤ t]) ≥ 0 dom |T | dimensional vector, where cR,i = 1 if the model of size s correctly predicts instance i un- Hence, we can declare victory if we can prove der the random seed R. We are comparing model that for all i, size s1 and s2, where s2 is larger; to keep nota- tion uncluttered, we omit these indexes whenever ˆ 1[∆Acci < 0] − P[∆Acci ≤ t] (24) possible. + [∆Acc0 ≤ t] ≥ 0 Suppose we observe cs1 and cs2 , P i R1...2k R2k+1...4k where there are 2k different random seeds for each To prove Equation 24, we observe that if Acc < 11 i model size . Then 0, since the probabilities are bounded by 0 and 1, 1 2k 4k its left-hand side must be positive. Therefore, we ˆ X s1 X s2 ∆Acci := ( c − c ), (17) only need to prove that 2k Rj ,i Rj ,i j=1 j=2k+1 ∆Acci ≥ 0 (25) and hence the discovery rate Decay(ˆ t) is defined 0 ˆ as ⇒ P[∆Acci ≤ t] ≥ P[∆Acci ≤ t], |T | 1 X which will be proved in Lemma1.  Decay(ˆ t) := 1[∆Accˆ ≤ t]. (18) |T | Lemma 1 i=1

11 We assumed even number of random seeds since we will ∆Acci ≥ 0 (26) mix half of the models from each size to compute the random 0 ˆ baseline ⇒ P[∆Acci ≤ t] ≥ P[∆Acci ≤ t], MNLI s1 \ s2 MINISMALLMEDIUMBASELARGE MINI 0.000 0.087 0.136 0.179 0.214 SMALL 0.033 0.000 0.089 0.139 0.180 MEDIUM 0.050 0.028 0.000 0.090 0.143 BASE 0.060 0.048 0.026 0.000 0.101 LARGE 0.059 0.052 0.040 0.021 0.000 QQP s1 \ s2 MINISMALLMEDIUMBASELARGE MINI 0.000 0.057 0.076 0.100 0.107 SMALL 0.019 0.000 0.039 0.073 0.084 MEDIUM 0.029 0.014 0.000 0.044 0.063 BASE 0.034 0.027 0.016 0.000 0.032 LARGE 0.036 0.031 0.027 0.016 0.000 SST-2 s1 \ s2 MINISMALLMEDIUMBASELARGE MINI 0.000 0.037 0.043 0.052 0.057 SMALL 0.010 0.000 0.015 0.031 0.036 MEDIUM 0.016 0.008 0.000 0.020 0.028 BASE 0.019 0.014 0.009 0.000 0.014 LARGE 0.020 0.017 0.015 0.008 0.000

Table 9: On QQP, MNLI in domain development set and SST-2 we lowerbound the fraction of instances that improves when model size changes from s1 (row) to s2 (column).

For m = 1, 2, define our data, since some finetuning runs share the same pretraining seeds. Therefore, the above proof no sm sm pi := ER[ci ], (27) longer holds. Specifically, Lemma1 fails because ∆Accˆ and ∆Acc0 are no longer binomial variables, then and the later does not necessarily dominate the ps1 ≤ ps2 (28) i i first. Here is a counter-example, if the seeds are s1 s2 Since ci and ci are both Bernoulli random vari- not entirely independent. s1 s2 ables with rate pi and pi respectively, we can Hypothetically, suppose we are comparing a write down the probability distribution of ∆Accˆ i smaller model s1 and a larger model s2. For the 0 and ∆Acci as the sum/difference of several bino- smaller model, with .1 probability it finds a per- mial variables, i.e. fect pretrained model that always predict correctly across all finetuning runs and with .9 probability it finds a bad pretrained model that predict always ˆ s2 s2 ∆Acci ∼ (Binom(k, pi ) + Binom(k, pi ) (29) incorrectly. For the larger model, with probability 1 s1 s1 − Binom(k, pi ) − Binom(k, pi ))/2k, it finds an average pretrained model that predict cor- rectly for .2 fraction of finetuning runs. The larger and model is on average better, because it has .2 > .1 probability to be correct. Hence, ∆Acc > 0 ∆Acc0i ∼ (Binom(k, ps2,i) + Binom(k, ps1,i) Suppose we observe 2 independent pretraining (30) seeds for each size and infinite number of fine- s ,i s ,i − Binom(k, p 1 ) − Binom(k, p 2 ))/2k tuning seeds for each pretraining seed, and let us consider the threshold -0.8. Then s1 s2 s2,i pi ≤ pi , Binom(k, p )) first order stochas- s1,i ˆ tically dominates Binom(k, p ). Therefore, P[∆Acci ≤ −0.8] (31) ∆Acc0i dominates ∆Accˆ , hence completing the 0 i = 0.01 ≥ 0 = P[∆Acci ≤ −0.8] (32) proof.  The event that ∆Accˆ i ≤ −0.8 happens with proba- D.1 Independent Seed Assumption bility 0.01 when both of the two small pretrained 0 We notice that Theorem1 requires the seeds R to models have good pretraining seeds, and ∆Acci is be independent. This assumption does not hold on at least -0.5 and will never be less than -0.8. (a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

Figure 6: Similar to Figure4, on MNLI in-distribution development set, for each pair of model sizes, we plot the cumulative density function of instance differences.

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

Figure 7: Similar to Figure4, on SST-2, for each pair of model sizes, we plot the cumulative density function of instance differences.

The key idea behind this counter-example is that bootstrapping. We first compute a best threshold even if the larger model has better average, the with 10 sampled smaller and larger pretrained mod- distribution of average finetuning accuracy for dif- els, and then compute the lowerbound L with this ferent pretraining seeds might not stochastically threshold on another sample of 10 smaller and dominate the one with lower average because of larger models. Intuitively, we use one bootstrap outliers. Hence, a priori, this is unlikely to happen sample (which contains 10 smaller pretrained mod- in practice, since pretraining variance is generally els and 10 larger pretrained models) as the devel- small, and we have multiple pretraining seeds to opment set to “tune the threshold”, and then use average out the outliers. Nevertheless, future work this threshold on a fresh bootstrap sample to com- is needed to make a more rigorous argument. pute the lowerbound. We refer to the lowerbound that uses the best threshold as L∗, and compute the ∗ E Upward Bias of Adaptive Thresholds relative error E[(L − L)]/E[L)], where the expec- tation is taken with respect to bootstrap samples. In section3 we picked the best threshold that can maximize the lowerbound, which can incur a slight upward bias. Here we estimate that the bias is at most 10% relative to the unbiased lowerbound with a bootstrapping method. We report all results in Table 10. In general, we We use the empirical distribution of 10 pre- find that the upward bias is negligible, which is at trained models as the ground truth distribution for most around 10%. (a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

Figure 8: Similar to Figure4, on QQP in-domain development set, for each pair of model sizes, we plot the cumulative density function of instance differences.

F Comparison with Significance Testing while on the rest 0.01% smaller completely wrong but bigger completely correct. Setting threshold We also experimented with the classical approach to be 2, our decay estimate Decayˆ is 0.01%, while that calculates the significance-level for each in- Decay0 = 0: since the models either completely stance and then use the Benjamini-Hochberg proce- predict correct or wrongly, there is never a false dure to lowerbound the decaying fraction. To make discovery. Therefore, our method can provide the sure that we are comparing with this approach tightest lowerbound 0.01% in this case. On the fairly, we lend it additional power by picking the other hand, since we only have 4 models in total, false discovery rate that can maximize the true dis- the lowest significance-level given by the fisher ex- covery counts. We report the decaying fraction on act test is 17%  0.1%, hence the discovery made MNLI in-domain development set found by this by the Benjamin-Hochberg procedure is 0. classical method and compare it with our method for different model size differences in Table 11; G More Results on Momentum we also simulate situations when we have fewer models. We report more results on the correlation between In general, we find that our method always pro- instance differences. Specifically, for one triplet vide a tighter (higher) lowerbound than the classical of model sizes (e.g. MINI ⇒ MEDIUM ⇒ LARGE method, and 2 models are sufficient to verify the ), for each group of instances that have similar ˆ MEDIUM existence (i.e. lowerbound > 0) of the decaying Acc , we calculate the correlation between fraction; in contrast, the classical method some- instance differences, i.e. the Pearson-R score be- MINI MEDIUM times fails to do this even with 10 models, e.g., tween MEDIUM∆Acc and LARGE ∆Acc. All results when comparing BASE to LARGE . can be seen in Table 12. Intuitively, our approach provides a better lower- We observe that bound because it better makes use of the infor- • For nearly all buckets, the improvements are mation that on most instances, both the smaller positively correlated. and the larger models agree and predict completely correctly or incorrectly12. For an extreme exam- • When model size gap becomes larger (e.g. ple, suppose we only observe 2 smaller models MINI, MEDIUM, LARGE has the largest model and 2 larger models, and infinite number of dat- size differences), the correlation decreases. apoints, whose predictions are independent. On 99.98% datapoints, both models have instance ac- H Loss Decomposition and Estimation curacy 1; on 0.01% datapoints, smaller model is completely correct while bigger completely wrong, In this section, under the bias-variance decomposi- tion and total variance decomposition framework, 12This is for intuition, though, and we do not need any assumption on the prior of instance accuracy, which requires we decompose loss into four components: bias, a Bayes interpretation. variance brought by pretraining randomness, by MNLI s1 \ s2 MINISMALLMEDIUMBASELARGE MINI 0.000 0.031 0.027 0.026 0.020 SMALL 0.108 0.000 0.027 0.023 0.019 MEDIUM 0.095 0.116 0.000 0.028 0.023 BASE 0.093 0.100 0.144 0.000 0.026 LARGE 0.097 0.103 0.117 0.149 0.000 QQP s1 \ s2 MINISMALLMEDIUMBASELARGE MINI 0.000 0.025 0.022 0.021 0.020 SMALL 0.127 0.000 0.040 0.020 0.020 MEDIUM 0.093 0.146 0.000 0.032 0.031 BASE 0.087 0.119 0.123 0.000 0.049 LARGE 0.090 0.105 0.079 0.106 0.000 SST-2 s1 \ s2 MINISMALLMEDIUMBASELARGE MINI 0.000 0.028 0.022 0.021 0.019 SMALL 0.117 0.000 0.047 0.031 0.029 MEDIUM 0.075 0.093 0.000 0.063 0.035 BASE 0.068 0.067 0.085 0.000 0.071 LARGE 0.071 0.067 0.060 0.098 0.000

Table 10: The same table as 10, except that we are now calculating the relative upward bias E[(L∗ − L)]/E[L)] as described in SectionE.

s1 s2 method 2 4 6 8 10 MINISMALL ours 0.004 0.011 0.016 0.020 0.023 MINISMALL BH 0.000 0.000 0.000 0.000 0.002 MINIMEDIUM ours 0.012 0.019 0.026 0.032 0.035 MINIMEDIUM BH 0.000 0.000 0.003 0.008 0.011 MINIBASE ours 0.019 0.028 0.035 0.040 0.042 MINIBASE BH 0.000 0.000 0.008 0.014 0.020 MINILARGE ours 0.019 0.027 0.031 0.037 0.040 MINILARGE BH 0.000 0.000 0.009 0.015 0.019 SMALLMEDIUM ours 0.002 0.006 0.010 0.015 0.017 SMALLMEDIUM BH 0.000 0.000 0.000 0.000 0.000 SMALLBASE ours 0.013 0.020 0.025 0.030 0.033 SMALLBASE BH 0.000 0.000 0.002 0.006 0.011 SMALLLARGE ours 0.015 0.021 0.026 0.031 0.033 SMALLLARGE BH 0.000 0.000 0.005 0.009 0.013 MEDIUMBASE ours 0.006 0.010 0.013 0.014 0.016 MEDIUMBASE BH 0.000 0.000 0.000 0.000 0.001 MEDIUMLARGE ours 0.010 0.014 0.019 0.022 0.023 MEDIUMLARGE BH 0.000 0.000 0.002 0.004 0.006 BASELARGE ours 0.004 0.005 0.009 0.010 0.012 BASELARGE BH 0.000 0.000 0.000 0.000 0.000

Table 11: We compare each pair of model sizes s1 and s2 and report the lower bound provided by our method and the Benjamin-Hochberg (BH) procedure. The numbers in column name denote how many pretrained model we used to obtain the lower bounds. MNLI. Buckets⇒ 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 MINI, SMALL, MEDIUM 0.00 0.18 0.19 0.18 0.23 0.26 0.24 0.23 0.20 0.12 SMALL, MEDIUM, BASE 0.07 0.22 0.29 0.40 0.35 0.33 0.38 0.27 0.24 0.13 MEDIUM, BASE, LARGE 0.05 0.09 0.17 0.33 0.20 0.30 0.12 0.13 0.16 0.09 MINI, MEDIUM, LARGE 0.03 0.15 0.18 0.33 0.17 0.16 0.22 0.20 0.19 0.09 QQP. Buckets⇒ 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 MINI, SMALL, MEDIUM 0.03 0.21 0.18 0.21 0.21 0.25 0.18 0.16 0.10 0.06 SMALL, MEDIUM, BASE 0.01 0.17 0.23 0.19 0.24 0.22 0.24 0.19 0.16 0.05 MEDIUM, BASE, LARGE -0.02 0.16 0.09 0.23 0.17 0.10 0.14 0.14 0.09 -0.01 MINI, MEDIUM, LARGE -0.01 0.07 0.14 0.09 0.16 0.09 0.16 0.07 0.10 0.07 SST-2. Buckets⇒ 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 MINI, SMALL, MEDIUM 0.09 0.26 0.43 0.22 0.28 0.24 0.27 0.35 0.20 0.12 SMALL, MEDIUM, BASE 0.07 0.12 0.22 0.40 0.07 0.20 0.10 0.12 0.19 0.06 MEDIUM, BASE, LARGE 0.01 0.24 0.29 0.35 0.19 0.19 0.26 0.39 0.15 0.03 MINI, MEDIUM, LARGE 0.01 0.17 0.11 0.41 0.04 0.29 0.16 0.21 0.15 0.07

Table 12: Sorted in ascending order, the model sizes are MINI , SMALL , MEDIUM , BASE , and LARGE . The three model sizes listed for each row represents the model size of interest: for example, MINI, MEDIUM, LARGE MINI MEDIUM means that we are calculating the correlation between MEDIUM∆Acc and LARGE ∆Acc. Each column t represents a bucket that contains instances with middle size accuracy in [t − 0.1, t]. For example, if the row name is ˆ MEDIUM MINI, MEDIUM, LARGE, then the column 0.2 corresponds to a bucket where Acci is between 0.1 and 0.2. MINI ˆ MEDIUM ˆ We calculate the PearsonR correlation score between MEDIUM∆Acc and LARGE ∆Acc across all instances in the bucket.

finetuning randomness, and across different check- stance i can then be written as points throughout training. We formally define the Ls,i = [(1 − cs,i )2] (33) quantities we want to estimate in Appendix H.1, EP,F,E P,F,E present an unbiased estimator for these quantities Since we will analyze this term at a datapoint in Appendix H.2, and show that our method can level, we drop the subscript s and i to keep the be generalized to arbitrary number of source of notation uncluttered. By the standard bias variance randomness in Appendix H.3. decomposition and total variance decomposition, Specifically, the main paper focused on scenar- we decompose the loss L into four terms: ios with 2 sources of randomness: pretraining and finetuning. We discuss the case with 3 sources of L =Bias2 + PretVar (34) randomness in the appendix, rather than 2 as in the + FineVar + CkptVar. main paper, because it is easier to understand the general estimation strategy in the case of 3. We will walk through the meaning and definition of these four terms one by one. Bias2 captures how bad is the average prediction, defined as H.1 Formalizing Decomposition Bias2 = (1 − [c ])2. (35) Recall that P is the pretraining seed, F is the fine- EP,F,E P,F,E tuning seed, E represents a model checkpoint, i PretVar captures the variance brought by ran- s,i domness in pretraining, and is defined as indexes each instance (datapoint). cP,F,E = 1 if the model of size s with pretraining seed p and fine- PretVar = V arP [EF,E[cP,F,E]]. (36) tuning seed F , and trained for E epochs is correct on datapoint i, and 0 otherwise. Notice that we Similarly, we define the variance brought by ran- move the instance index from the subscript to the domness in finetuning FineVar superscript, since we now use subscript for random FineVar = EP [V arF [EE[cP,F,E]]], (37) seeds, and instance index can be omitted in most of our derivations. and that by fluctuations across checkpoints e

The expected squared loss L of model s on in- CkptVar = EP,F [V arE[cP,F,E]]. (38) H.2 Unbiased Estimation checkpoints E, conditioned on the pretraining seed We first describe the data on which we apply our Pj.” estimator. Suppose we pretrain P models with Since Pj is fixed for this estimator, we drop the different random seeds, for each of the P pre- subscripts P to keep notation uncluttered. There- trained models we finetune with F different ran- fore, we want to estimate dom seeds, and we evaluate at E different check- V ar := V ar [ [c ]] (46) points. Then ∀j ∈ [P], k ∈ [F], l ∈ [E] 13 , we F F EE F,E observe Pj,Fjk,Ejkl, and cPj ,Fjk,Ejkl , where each A naive solution is to take first take the mean observed P,F and E are i.i.d. distributed. Our goal c¯Fk of c for each Fk, i.e. is to estimate from c the four quantities described 1 X in the previous section. c¯ := c , (47) Fjk E Fk,Ekl H.2.1 Estimating CkptVar l∈[E] It is straightforward to estimate CkptVar. The and then calculate the sample variance V˜ ar of c¯ estimator CkptVarˆ defined below is unbiased: F with respect to F : ˆ 1 X ˆ Pj ,Fjk CkptVar := V arE , (39) 1 X 2 PF V˜ arF := (¯cF − c¯) , (48) j∈[P],k∈[F] F − 1 k k∈[F] where where Pj ,Fjk 1 X 2 1 X Vˆ ar := (c − c¯ ) , c¯ := c¯F (49) E E − 1 Pj ,Fjk,Ejkl Pj ,Fjk F k l∈E k∈[F] (40) However, this would create an upward bias: the and 1 X empirical mean c¯Fjk is a noisy estimate of the pop- c¯Pj ,Fjk := cPj ,Fjk,Ejkl . (41) E ulation mean EE[cFjk,E], and hence increases let l∈[E] V˜ arF over-estimate the variance. Imagine a sce- ˆ ˆ Pj ,Fjk V ar c¯ CkptVar is unbiased, since V arE is an un- nario where F is in fact 0; however, since Fjk biased estimation of variance of c with fixed Pj is a noisy estimate, V˜ arF will sometimes be posi- ˜ and Fjk, and randomness E, i.e. tive but never below 0. As a result, E[V arF ] > 0, which is a biased estimator. ˆ Pj ,Fjk EEij(·) [V arE ] = (42) We introduce the following general theorem to correct this bias. V arE[cP,F,E|P = Pj,F = Fjk]. Theorem 2 Suppose Dk, k ∈ [F] are indepen- Therefore, ∀j ∈ [P], k ∈ [F], we have dently sampled from the same distribution Ξ, which is a distribution of distributions; µˆ is an unbiased ˆ Pj ,Fjk k EPj ,Fjk [V arE ] = CkptVar, (43) ˆ estimator of EX∈Dk [X], and φk to be an unbiased and hence by linearity of expectation estimator of the variance of µˆk, then

[CkptVar]ˆ = CkptVar. (44) EP(·),F(·),E(·) 1 X Vˆ ar = (ˆµ − µˆ)2 (50) F F − 1 k H.2.2 Estimating FineVar k∈[F] As before, by linearity of expectation, we can de- 1 X − φˆ clare victory if we can develop an unbiased esti- F k mator for the following quantity and then average k∈[F] across Pj: is an unbiased estimator for

V arF [ E[cP,F,E]|P = Pj], (45) E V = V arD∼Ξ[EX∼D[X]], (51) which verbally means ”variance across different where finetuning seeds of the mean of c over different 1 X µˆ := µˆk (52) 13 F [L] := {l : l ∈ N, l ∈ [0,L − 1]} k∈[F] In this estimator, the first term “pretends” that µˆ· are perfect estimator for the population mean and X [−2 (ˆµ − µ )(ˆµ − µ)] (59) calculate the variance, while the second term cor- E k k rects for the fact that the empirical mean estimation k∈[F] is not perfect. Notice the theorem only requires 2 X ˆ = − E[ φk]. ˆ F that µˆ and φ are unbiased, and is agnostic to the k∈[F] actual computation procedure by these estimators. X Proof We define the population mean of Dk to E[−2 (ˆµk − µk)(ˆµ − µ)] be µk, i.e. k∈[F] = −2V. (60) µk := EX∼Dk [X], (53) Putting these five terms together, we continue and the population mean of µk across randomness calculating Equation 55: in D to be µ, i.e. 1 X [ (ˆµ − µˆ)2] (61) F − 1E k µ := ED∼Ξ[EX∼D[X]] (54) k∈[F] 1 X = [ φˆ We look at the first term of the estimator in equa- F − 1E k tion 50: k∈[F] 1 1 X + F( V + [φˆ ]) F F 2 E k 1 X k∈[F] [ (ˆµ − µˆ)2] (55) F − 1E k + FV k∈[F] 2 X ˆ 1 X − φk = E[ ((ˆµk − µk) − (ˆµ − µ) F F − 1 k∈[F] k∈[F] 2 − 2V ] + (µk − µ)) ] 1 X ˆ 1 X 2 2 = V + E[ φk] = E[ [(ˆµk − µk) + (ˆµ − µ)) F F − 1 k k∈[F] 2 Then from Equation 50, we can tell that Vˆ arF + (µk − µ) − 2(ˆµk − µk)(ˆµ − µ) is unbiased. − 2(µ ˆ − µ )(ˆµ − µ)]]  k k Now we come back to the topic of developing an unbiased estimator for V ar as defined in Equa- There are 5 summands within P , and we F k∈[F] tion 46. To utilize Theorem2, we need two compo- look at them one by one: nents:

• An unbiased estimator µˆFk for EE[cF,E|F = X 2 X ˆ E[ (ˆµk − µk) ] = E[ φk], (56) Fk] k∈[F] k∈[F] ˆ • An unbiased estimator φFk for the variance of µˆ V ar (ˆµ ) Fk , i.e. E(·) Fk 2 1 X [(ˆµ − µ) ] = [(µ − µk) (57) c¯Fk is an unbiased estimator for EE[cF,E|F = E E F k∈[F] Fk], and its variance V arE[¯cF |F = Fk] is 1 X 2 1 + (µk − µˆk)) ] V ar (¯c ) = V ar [c |F = F ]. (62) F E(·) Fk E F,E k k∈[F] E 1 1 X ˆ Therefore, to develop an unbiased estimator for = V + 2 E[φk] F F V arE(¯cFjk ), it suffices to have an unbiased esti- k∈[F] mate of V arE[cF,E|F = Fk]. We define X 1 X [ (µ − µ)2] = FV (58) φˆ := (c − c¯ )2, (63) E k Fk E(F − 1) Fk,Ekl Fk k∈[F] k∈[L] ˆ 2 and we can plug in φFk and µˆFk =c ¯Fk into is an unbiased estimator of Bias . Theorem2 as an unbiased estimator to obtain an Notice that the na¨ıve estimator that calculates unbiased estimator for V arF [EE[cP,F,E]|P = Pj], the expected bias and then squares it estimates 2 2 and we average the estimation for each Pj to obtain (E[Bias]) instead of E[Bias ]. an unbiased estimate. H.3 Generalization H.2.3 Estimating PretVar We can generalize this estimation strategy to de- We next estimate V arP [EF,E[cP,F,E]] We can still compose variance into arbitrary number of random- apply the idea from Theorem2, which requires ness. In general, we want to estimate some quantity of the following form • An unbiased estimator µˆPj for EF,E[cP,F,E|P = Pj] Er1...,rn−1 [V arrn [Ern+1...rN [cr1...,cN ]]], (69) ˆ • An unbiased estimator φPj for the variance of from the data that has an hierarchical tree struc- µˆPj , i.e. V arF,E[ˆµPj ]. ture of randomness. µˆ =c ¯ Again, the first is easy to obtain: Pj Pj is For the goal of developing an unbiased estimator, an unbiased estimator for [c |P = P ], EF,E P,F,E j we can get rid of the outer expectation r1 . . . rn−1 where easily by linearity of expectation: simply estimate 1 X the Variance conditioned on r and average c¯ := c (64) 1...n−1 Pj FE Pj ,Fjk,Ejkl them together, as discussed in Section H.2.1. k∈[F],l∈[E] To estimate However, we cannot straightforwardly estimate V arr [ r ...r [cr ...,c ]], (70) V arF,E[ˆµPj ] as before, since samples cPj ,Fjk,Ejkl n E n+1 N 1 N are no longer independent. We need to use Equa- tion 57 to develop an unbiased estimator (the LHS we make use of Theorem2, which requires is exactly what we want!), i.e. • an unbiased estimator µˆrn+1 for the quantity 1 [c ], which we can straightfor- V ar (¯c ) = V ar ( [c ]|P = P ) Ern+1...rN r1...,cN F,E Pj F F EE P,F,E j wardly obtain by average the examples that has the same random variables r1...n (e.g. c¯ ) (65) Pj 1 X • an unbiased estimator for the variance of + 2 V arE(¯cPj ,Fjk ), F µˆrn+1 . If N = n+1, we can directly compute k∈[F] the sample variance of the c as our estimate and we already know how to estimate these two (e.g. in Equation 63). Otherwise, we use summands from the previous discussion on estimat- Equation 57 to decompose the desired quan- ing FineVar. tities into two, and estimate them recursively by applying Theorem2 and Equation 57. H.2.4 Estimating Bias2 It is easy to see that the following Lˆ is an unbiased For readability we wrote the proof with the as- estimator for the loss L. sumption that, in the tree of randomness, the num- ber of branches for each node at the same depth 1 X is the same. However, our proof does not make Lˆ := (1 − c )2, PFE Pj ,Fjk,Ejkl use of this assumption and can be applied to a gen- j∈[P],k∈[F],l∈[E] eral tree structure of randomness as long as the the (66) number of children is larger or equal to 2 for each and non-terminal node. E[Lˆ] = L. (67) By linearity of expectation and loss decomposi- I Variance Conditioned on Bias tion in Equation 34, Since lower bias usually implies lower variance, to Biasˆ 2 := Lˆ − PretVarˆ (68) tease out the latent effect, we estimate the variance “given a fixed level of bias Bias2 of b2 ∈ [0, 1]”, ˆ ˆ − FineVar − CkptVar i.e. (a) (b) (a) (b)

(c) (d) (c) (d)

(e) (f) (e) (f)

Figure 9: The variance curve conditioned on Bias2 for Figure 10: The same figure as9, except for using the in-domain development set of MNLI, QQP and SST-2. squared loss function L = (1 − p)2, where p is the Each curve represents a model size. Left for pretraining probability assigned to the correct label, instead of 0/1 variance and right for finetuning variance. loss.

be controversial but we could see a reason why

2 2 2 this label is reasonable, 3) Wrong, if the label is PretVar(b ) := Ei[PretVari|Bias i = b ] (71) wrong, and 4) Unsure, if we are unsure how to la- We estimate PretVars(b2) and FineVars(b2) us- bel this instance. As a control, we also examined ing gaussian process and plot them against b2 in the remaining fraction of the dataset. Each time Figure9 for MNLI, QQP, and SST-2. We find that we annotate an instance, with 50% probability it is larger models usually have larger finetuning vari- sampled from the decaying fraction or the remain- ance across all levels of biases (except for MEDIUM ing fraction, and we do not know which group it and MINI on SST-2), and BASE model always has comes from. larger pretraining variance than LARGE . We show below all the annotated instances from We also experimented with the squared loss: this decaying fraction and their categories for MNLI (Section J.1), QQP , and SST-2(Section J.3). 2 Li = (1 − pi) , (72) J.1 MNLI where pi is the probability the assigned to the cor- MNLI is the abbreviation of Multi-Genre Natural rect label for instance i. We plot the same curve in Language Inference (Williams et al.(2020)). In Figure 10 and observe the same trend. this task, given a premise and a hypothesis, the model needs to classify whether the premise J Example Decaying Instances entails/contradicts the hypothesis, or otherwise. We manually examined the group of instances The instances can be seen below. MINI ˆ where LARGE∆Acci ≤ −0.9 in Table3. In other words, MINI is almost always correct on these in- Premise : and that you’re very much right stances, while LARGE is almost always wrong. For but the jury may or may not see it that way so you each instance in this group, we manually catego- get a little anticipate you know anxious there and rize it into one of the four categories: 1) Correct, go well you know if the label is correct, 2) Fine, if the label might Hypothesis : Jury’s operate without the benefit of an education in law. Premise : Shoot only the ones that face us, Label : Neutral Jon had told Adrin. Category : Correct Hypothesis : Jon told Adrin and the others to only shoot the ones that face us. Premise : In fiscal year 2000, it reported Label : Entailment estimated improper Medicare Fee-for-Service Category : Wrong payments of $11. Hypothesis : The payments were improper. Premise : But if you take it seriously, the Label : Entailment anti-abortion position is definitive by definition. Category : Fine Hypothesis : If you decide to be serious about supporting anti-abortion, it’s a very run of the mill Premise : is that what you ended up going belief to hold. into Label : Neutral Hypothesis : So that must be what you chose to Category : Unsure do? Label : Entailment Premise : yeah well that’s the other thing Category : Correct you know they talk about women leaving the home and going out to work well still taking care of the Premise : INTEREST RATE - The price children is a very important job and and someone’s charged per unit of money borrowed per year, got to do it and be able to do it right and or other unit of time, usually expressed as a Hypothesis : It is not acceptable for anybody to percentage. refuse work in order to take care of children. Hypothesis : Interest rate is defined as the total Label : Contradiction amount of money borrowed. Category : Correct Label : Entailment Category : Wrong Premise : The researchers found expected stresses like the loss of a check in the mail and the Premise : The analyses comply with the in- illness of loved ones. formational requirements of the sections including Hypothesis : The stresses affected people much the classes of small entities subject to the rule and diffferently than the researchers expected. alternatives considered to reduce the burden on the Label : Contradiction small entities. Category : Correct Hypothesis : The rules place a high burden on the activities of small entities. Premise : so you know it’s something we Label : Contradiction we have tried to help but yeah Category : Correct Hypothesis : We did what we could to help. Label : Entailment Premise : Isn’t a woman’s body her most Category : Correct personal property? Hypothesis : Women’s bodies belong to them- Premise : Czarek was welcomed enthusias- selves, they should decide what to do with it. tically, even though the poultry brotherhood was Label : Neutral paying a lot of sudden attention to the newcomers Category : Unsure - a strong group of young and talented managers from an egzemo-exotic chicken farm in Fodder Premise : The Standard , published a few Band nearby Podunkowice. days before Deng’s death, covers similar territory. Hypothesis : Czarek was welcomed into the group Hypothesis : The Washington Post covers similar by the farmers. territory. Label : Entailment Label : Neutral Category : Correct Category : Correct Premise : ’I don’t suppose you could forget I ever said that?’ Premise : Rightly or wrongly, America is Hypothesis : I hope that you can remember that seen as globalization’s prime mover and head forever. cheerleader and will be blamed for its excesses Label : Contradiction until we start paying official attention to them. Category : Wrong Hypothesis : America’s role in the globalization movement is important whether we agree with it or Premise : Oh, my friend, have I not said to not. you all along that I have no proofs. Label : Entailment Hypothesis : I told you from the start that I had no Category : Correct evidence. Label : Entailment Premise : After being diagnosed with can- Category : Correct cer, Carrey’s Kaufman decides to do a show at Carnegie Hall. Premise : I should put it this way. Hypothesis : Carrey’s Kaufman is only diagnosed Hypothesis : I should phrase it differently. with cancer after doing a show at Carnegie Hall. Label : Entailment Label : Contradiction Category : Correct Category : Correct

Premise : An organization’s activities, core Premise : Several pro-life Dems are mount- processes, and resources must be aligned to ing serious campaigns at the state level, often support its mission and help it achieve its goals. against pro-choice Republicans. Hypothesis : An organization is successful if its Hypothesis : Serious campaigns are being run by activities, resources, and goals align. a few pro-life Democrats. Label : Entailment Label : Entailment Category : Fine Category : Correct

Premise : A more unusual dish is azure, a Premise : On the northwestern Alpine fron- kind of sweet porridge made with cereals, nuts, tier, a new state had appeared on the scene, and fruit sprinkled with rosewater. destined to lead the movement to a united Italy. Hypothesis : Azure is a common and delicious Hypothesis : The unite Italy movement was food made with cereals, nuts and fruit. waiting for a leader. Label : Entailment Label : Neutral Category : Wrong Category : Fine

Premise : once you have something and it’s Premise : well we bought this with credit like i was watching this program on TV yesterday too well we found it with a clearance uh down in in nineteen seventy six NASA came up with Three Memphis i guess and uh D graphics right Hypothesis : We bought non-sale items in Hypothesis : I was watching a program about Memphis on credit. gardening. Label : Contradiction Label : Contradiction Category : Correct Category : Correct Premise : He slowed. Premise : , First-Class Mail used by house- Hypothesis : He stopped moving so quickly. holds to pay their bills) and the household bill mail Label : Entailment (i.e. Category : Correct Hypothesis : Second-Class Mail used by house- holds to pay their bills Premise : As legal scholar Randall Kennedy wrote Label : Contradiction in his book Race, Crime, and the Law , Even Category : Unsure if race is only one of several factors behind a decision, tolerating it at all means tolerating it as potentially the decisive factor. Hypothesis : Race is one of several factors in Premise : Tommy was suddenly galvanized some judicial decisions into life. Label : Entailment Hypothesis : Tommy had been downcast for days. Category : Correct Label : Neutral Category : Correct Premise : Although all four categories of emissions are down substantially, they only Premise : Improved products and services achieve 50-75% of the proposed cap by 2007 Initiate actions and manage risks to develop (shown as the dotted horizontal line in each of the new products and services within or outside the above figures). organization. Hypothesis : All of the emission categories Hypothesis : Managed risks lead to new products experienced a downturn except for one. Label : Entailment Label : Contradiction Category : Fine Category : Correct Premise : Coast Guard rules establishing Premise : He sat up, trying to free himself. bridgeopening schedules). Hypothesis : He was trying to take a nap. Hypothesis : The Coast Guard is in charge of Label : Contradiction opening bridges. Category : Correct Label : Entailment Category : Correct Premise : Impossible. Hypothesis : Cannot be done. Premise : The anthropologist Napoleon Chagnon Label : Entailment has shown that Yanomamo men who have killed Category : Correct other men have more wives and more offspring than average guys. Premise : But, as the last problem I’ll out- Hypothesis : Yanomamo men who kill other men line suggests, neither of the previous two have better chances at getting more wives. objections matters. Label : Entailment Hypothesis : I will not continue to outline any Category : Fine more problems. Label : Entailment Premise : The Varanasi Hindu University Category : Correct has an Art Museum with a superb collection of 16th-century Mughal miniatures, considered Premise : As the Tokugawa shoguns had superior to the national collection in Delhi. feared, this opening of the floodgates of Western Hypothesis : The Varanasi Hindu University culture after such prolonged isolation had a has an art museum on its campus which may be traumatic effect on Japanese society. superior objectively to the national collection in Hypothesis : The Tokugawa shoguns had feared Delhi. that, because they understood the Japanese society Label : Entailment very well. Category : Correct Label : Neutral Category : Fine Premise : Part of the reason for the differ- ence in pieces per possible delivery may be due Premise : In the ancestral environment a to the fact that five percent of possible residential man would be likely to have more offspring if he deliveries are businesses, and it is thought, but got his pick of the most fertile-seeming women. not known, that a lesser percentage of possible Hypothesis : Only a man who stayed with one deliveries on rural routes are businesses. female spread his genes most efficiently. Hypothesis : We all know that the reason for a Label : Contradiction lesser percentage of possible deliveries on rural Category : Fine routes being businesses, is because of the fact that people prefer living in cities rather than rural areas. Hypothesis : A lot of people consider Eddie to be Label : Neutral a bad singer. Category : Correct Label : Neutral Category : Correct Premise : right oh they’ve really done uh good job of keeping everybody informed of what’s Premise : it’s the very same type of paint going on sometimes i’ve wondered if it wasn’t and everything almost more than we needed to know Hypothesis : It’s the same paint formula, it’s Hypothesis : I don’t think I have shared enough great! information with everyone. Label : Entailment Label : Contradiction Category : Fine Category : Correct Premise : Exhibit 3 presents total national Premise : To reach any of the three Carbet emissions of NOx and SO2 from all sectors, falls, you must continue walking after the roads including power. come to an end for 20 minutes, 30 minutes, or two Hypothesis : In Exhibit 3 there are the total hours respectively. regional emissions od NOx and SO2 from all Hypothesis : There are three routes to the three sectors. Carbet falls, each a different length and all Label : Entailment continue after the road seemingly ends. Category : Correct Label : Entailment Category : Correct Premise : uh-huh and is it true i mean is it um Premise : But when the cushion is spent in Hypothesis : It’s true. a year or two, or when the next recession arrives, Label : Entailment the disintermediating voters will find themselves Category : Wrong playing the roles of budget analysts and tax wonks. Hypothesis : The cushion will likely be spent in Premise : When a GAGAS attestation en- under two years. gagement is the basis for an auditor’s subsequent Label : Entailment report under the AICPA standards, it would be Category : Correct advantageous to users of the subsequent report for the auditor’s report to include the information Premise : But, Slate protests, it was [Gates’] on compliance with laws and regulations and byline that appeared on the cover. internal control that is required by GAGAS but not Hypothesis : Slate was one hundred percent required by AICPA standards. positive it was Gates’ byline on the cover. Hypothesis : The report is required by GAGAS Label : Neutral but not AICPA. Category : Correct Label : Entailment Category : Correct Premise : But it’s for us to get busy and do something.” Premise : i’m on i’m in the Plano school Hypothesis : ”We don’t do much, so maybe this system and living in Richardson and there is would be good for us to bond and be together for a real dichotomy in terms of educational and the first time in a while.”. economic background of the kids that are going to Label : Neutral be attending this school Category : Fine Hypothesis : The Plano school system only has children with poor intelligence. Premise : Pearl Jam detractors still can’t Label : Contradiction stand singer Eddie They say he’s unbearably Category : Correct self-important and limits the group’s appeal by refusing to sell out and make videos. J.2 QQP Label : Paraphrase Category : Correct QQP is the abbreviation of Quora Question Pairs14. Given two questions, the model needs Question 1 : Who is answering the ques- to tell whether they have the same meaning (i.e. tions asked on Quora? Paraphrase/Non-paraphrase). Question 2 : Who can answer the questions asked Question 1 : Which universities for MS in CS on Quora? should I apply to? Label : Paraphrase Question 2 : Which universities should I apply to Category : Correct for an MS in CS? Label : Paraphrase Question 1 : What machine learning theory Category : Correct do I need to know in order to be a successful machine learning practitioner? Question 1 : What should I do to make life Question 2 : What do I need to know to learn worth living? machine learning? Question 2 : What makes life worth living? Label : Paraphrase Label : Paraphrase Category : Wrong Category : Fine Question 1 : If you could go back in time Question 1 : Why did Quora remove my and change one thing, what would it be and why? question? Question 2 : If you could go back in time and do Question 2 : Why does Quora remove questions? one thing, what would it be? Label : Paraphrase Label : Paraphrase Category : Correct Category : Correct Question 1 : How do I get thousands of fol- Question 1 : Will there be a civil war if lowers on Instagram? Trump doesn’t become president? Question 2 : How can I get free 10k real Instagram Question 2 : Will there be a second civil war if followers fast? Trump becomes president? Label : Paraphrase Label : Paraphrase Category : Fine Category : Correct Question 1 : What is the basic knowledge Question 1 : Do Quora contributors get of computer science engineers? paid? Question 2 : What is basic syallbus of computer Question 2 : How do contributors get paid by science engineering? Quora? Label : Non-paraphrase Label : Paraphrase Category : Fine Category : Correct Question 1 : How many mosquito bites Question 1 : Did India meet Abdul Kalam’s 2020 does it take to kill a human being? vision so far? Question 2 : How many times can a single Question 2 : How far do you think India has mosquito bite a human within 8 hours? reached on President APJ Kalam’s vision in the Label : Non-paraphrase book India 2020? Category : Correct Label : Non-paraphrase Category : Correct Question 1 : How does it feel to become at- tractive from unattractive? Question 1 : How do I stop my dog from Question 2 : What does it feel like to go from whining after getting spayed? physically unattractive to physically attractive? Question 2 : How do I stop my dog from whining? 14https://www.quora.com/q/quoradata/First-Quora- Label : Paraphrase Dataset-Release-Question-Pairs Category : Wrong Question 1 : How is Tanmay Bhat losing weight? Question 1 : What difference are exactly Question 2 : Tanmay Bhat: How did you manage between Euclidean space and non Euclidean to reduce your fat? space? Label : Non-paraphrase Question 2 : What is the difference between Category : Unsure Euclidean and non-Euclidean? Label : Non-paraphrase Question 1 : Is Xiaomi a brand to trust Category : Wrong (comparing it with brands like Samsung and HTC)? What is better: Xiaomi MI3 or HTC Desire Question 1 : Why doesn’t Hillary Clinton 816? win the White House if she won the popular vote? Question 2 : Is xiaomi a trusted brand? Question 2 : How did Hillary Clinton win the Label : Non-paraphrase popular vote but Donald Trump win the election? Category : Correct Label : Paraphrase Category : Correct Question 1 : Why did Buddhism spread in East Asia and not in its native land India? Question 1 : How is public breastfeeding Question 2 : How was Buddhism spread in Asia? seen where you live? Label : Non-paraphrase Question 2 : How is breastfeeding in public seen Category : Correct in your country? Label : Paraphrase Question 1 : Can I become a multi billion- Category : Correct aire betting on horses? Question 2 : How much money can I make betting Question 1 : What are some ways to change your on horses? A month? Can I make 20,000 a month? Netflix password? Label : Paraphrase Question 2 : How do you change your Netflix Category : Fine password and email? Label : Paraphrase Question 1 : What is a diet for gaining Category : Fine weight? Question 2 : What is a way to gain weight? Question 1 : What do you think, is your Label : Non-paraphrase best answer on Quora? Category : Correct Question 2 : What is your best answer on Quora? Label : Paraphrase Question 1 : How do I use Instagram on Category : Correct my computer? Question 2 : How can I get Instagram on my Question 1 : How can I travel to Mexico computer? without a passport? Label : Paraphrase Question 2 : Can I travel to Mexico without a Category : Fine passport? Label : Paraphrase Question 1 : What is the legal basis of a Category : Correct ”you break it, you buy it” policy? Question 2 : Is a ”you break it you buy it” policy Question 1 : How do modern Congolese actually legal? people view Mobutu in retrospect? Label : Paraphrase Question 2 : How do Congolese currently view Category : Correct Mobutu Sese Seko? Label : Non-paraphrase Question 1 : How I should fix my computer while Category : Correct it is showing no boot device found? Question 2 : How do I fix the ”Boot device not found” problem? placement package from a company if both of Label : Paraphrase them are equally capable? Category : Correct Label : Paraphrase Category : Correct Question 1 : What innovative name can I use for an interior designing firm? Question 1 : What happened to The Joker Question 2 : What can i name my interior after The end of The Dark Knight? designing firm? Question 2 : What happens to the Joker at the end Label : Paraphrase of The Dark Knight (2008 movie)? Category : Correct Label : Non-paraphrase Category : Wrong Question 1 : What would it realistically cost to go to Tomorrowland? Question 1 : I love my wife more then any- Question 2 : How much is a ticket to Tomorrow- thing. Why do I fantasize about her with other land? men? Label : Non-paraphrase Question 2 : Why do I fantasize about other men Category : Fine having sex with my wife? Label : Paraphrase Question 1 : Is there a gender pay gap? If Category : Fine so why? Question 2 : Is the gender pay gap a myth? Question 1 : What is your opinion on the Label : Paraphrase new MacBook Pro Touch Bar? Category : Correct Question 2 : What do you think about the OLED touch bar on the new MacBook Pro? Question 1 : How can I get rid of a canker Label : Paraphrase sore on the bottom of my tongue? Category : Correct Question 2 : How can I get rid or a canker sore on the tip of my tongue? Question 1 : How do I get rid of my nega- Label : Paraphrase tive alter ego? Category : Correct Question 2 : How do you get rid of your negative alter ego? Question 1 : How can I sleep better and Label : Paraphrase early in night? Category : Correct Question 2 : How can I sleep better at night? Label : Paraphrase Question 1 : How can I get wifi driver for Category : Fine my hp laptop with windows 7 os? Question 2 : How can I get wifi driver for my Question 1 : Why did DC Change Captain laptop with windows 7 os? Marvel’s name? Label : Paraphrase Question 2 : Why did DC have to change Captain Category : Fine Marvel’s name but Marvel didn’t have to change Scarecrow’s name? Question 1 : What’s your attitude towards Label : Paraphrase life? Category : Fine Question 2 : What should be your attitude towards life? Question 1 : Should be there any difference Label : Paraphrase between IIT and non IIT students in terms of Category : Wrong placement package from a company if both of them are equally talented? Question 1 : What books would I like if I Question 2 : Should be there any difference loved A Song of Ice and Fire? between IIT and non IIT students in terms of Question 2 : Are there books which are similar to A Song of Ice and Fire? Input : a children ’s party clown Label : Paraphrase Label : Negative Category : Fine Category : Fine

Question 1 : Why do Muslims think they Input : perhaps even the slc high command will conquer the whole world? found writer-director mitch davis ’s wall of kitsch Question 2 : Do you think Muslims will take over hard going . the world? Label : Negative Label : Non-paraphrase Category : Correct Category : Correct Input : own placid way Question 1 : Is dark matter a sea of mas- Label : Negative sive dark photons that ripple when galaxy clusters Category : Correct collide and wave in a double slit experiment? Question 2 : Does a superfluid dark matter which Input : get on a board and , uh , shred ripples when Galaxy clusters collide and waves in , a double slit experiment relate GR and QM? Label : Negative Label : Paraphrase Category : Correct Category : Correct Input : asks what truth can be discerned Question 1 : What is Batman like? from non-firsthand experience , and specifically Question 2 : What is Batman’s personality like? questions cinema ’s capability for recording truth . Label : Non-paraphrase Label : Positive Category : Correct Category : Correct

Input : puts the dutiful efforts of more dis- ciplined grade-grubbers J.3 SST-2 Label : Positive Category : Correct SST-2 is the abbreviation of Stanford Sentiment Treebank (Socher et al., 2013). In this task, the Input : filter out the complexity model needs to recognize whether the phrases or Label : Positive sentences reflect positive or negative sentiments. Category : Correct

Input : predictability is the only winner Input : told what actually happened as if it Label : Negative were the third ending of clue Category : Correct Label : Negative Category : Correct Input : abandon their scripts and go where the moment takes them Input : is more in love with strangeness Label : Negative than excellence . Category : Correct Label : Positive Category : Wrong Input : chases for an hour and then Label : Positive Input : i found myself howling more than Category : Unsure cringing Label : Positive Input : provide much more insight than the Category : Correct inside column of a torn book jacket Label : Negative Input : goldbacher draws on an elegant vi- Category : Unsure sual sense and a talent for easy , seductive pacing ... but she and writing partner laurence coriat do Category : Correct n’t manage an equally assured narrative coinage Label : Positive Input : of those airy cinematic bon bons Category : Unsure whose aims – and by extension , accomplishments – seem deceptively slight on the surface Input : for a thirteen-year-old ’s book re- Label : Positive port Category : Correct Label : Negative Category : Correct Input : do n’t blame eddie murphy but Label : Negative Input : a problem hollywood too long has Category : Correct ignored Label : Negative Input : melodramatic paranormal romance Category : Correct Label : Negative Category : Correct Input : twisted sense Label : Negative Input : could possibly be more contemptu- Category : Correct ous of the single female population Label : Negative Input : a stab at soccer hooliganism Category : Correct Label : Negative Category : Correct Input : cremaster 3 ” should come with the warning “ for serious film buffs only ! ” Input : sinuously plotted Label : Negative Label : Negative Category : Correct Category : Correct Input : softheaded metaphysical claptrap Input : shiner can certainly go the distance Label : Negative , but is n’t world championship material Category : Correct Label : Positive Category : Correct Input : owed to benigni Label : Negative Input : holding equilibrium up Category : Unsure Label : Negative Category : Wrong Input : to be a suspenseful horror movie or a weepy melodrama Input : i am highly amused by the idea that Label : Positive we have come to a point in society where it has Category : Correct been deemed important enough to make a film in which someone has to be hired to portray richard Input : genuinely unnerving . dawson . Label : Positive Label : Positive Category : Correct Category : Wrong Input : gaping enough to an entire Input : waters olympic swim team through Label : Negative Label : Negative Category : Wrong Category : Correct

Input : what might have emerged as hilari- Input : this is popcorn movie fun with equal doses ous lunacy in the hands of woody allen or of action , cheese , ham and cheek ( as well as a Label : Positive serious debt to the road warrior ) , but it feels like unrealized potential Label : Positive Input : that turns me into that annoying Category : Fine specimen of humanity that i usually dread encoun- tering the most Input : feeling like it was worth your seven Label : Negative bucks , even though it does turn out to be a bit of a Category : Fine cheat in the end Label : Negative Input : at least a minimal appreciation Category : Correct Label : Positive Category : Unsure Input : pull it back Label : Negative Input : underlines even the dullest tangents Category : Correct Label : Negative Category : Correct Input : , this is more appetizing than a side dish of asparagus . Input : heard before Label : Negative Label : Positive Category : Correct Category : Unsure

Input : crime drama Input : i like my christmas movies with Label : Negative more elves and snow and less pimps and ho ’s . Category : Unsure Label : Negative Category : Unsure Input : like most movies about the pitfalls of bad behavior Input : can aspire but none can equal Label : Negative Label : Negative Category : Fine Category : Unsure

Input : befallen every other carmen before Input : fathom her Label : Negative Label : Positive Category : Unsure Category : Unsure Input : attempt to bring cohesion to pamela Input : appeal to those without much inter- ’s emotional roller coaster life est in the elizabethans ( as well as rank frustration Label : Negative from those in the know about rubbo ’s dumbed- Category : Unsure down tactics ) Label : Negative Input : movie version Category : Unsure Label : Positive Category : Wrong Input : about existential suffering Label : Negative Input : of spontaneity in its execution and Category : Fine a dearth of real poignancy Label : Positive Input : , if uneven , Category : Correct Label : Negative Category : Unsure

Input : succumbs to sensationalism Label : Positive Category : Wrong