Arxiv:2105.06020V1

Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level Ruiqi Zhong Dhruba Ghosh Dan Klein Jacob Steinhardt Computer Science Division, University of California, Berkeley fruiqi-zhong, djghosh13, klein, [email protected] Abstract Prior works hint at differing answers. Hendrycks Larger language models have higher accu- et al.(2020) and Desai and Durrett(2020) find racy on average, but are they better on ev- that larger pretrained models consistently improve ery single instance (datapoint)? Some work out-of-distribution performance, which implies that suggests larger models have higher out-of- they might be uniformly better at a finer level. distribution robustness, while other work sug- Henighan et al.(2020) claim that larger pretrained gests they have lower accuracy on rare sub- image models have lower downstream classifica- groups. To understand these differences, we tion loss for the majority of instances, and they investigate these models at the level of individual instances. However, one major chal- predict this trend to be true for other data modal- lenge is that individual predictions are highly ities (e.g. text). On the other hand, Sagawa et al. sensitive to noise in the randomness in train- (2020) find that larger non-pretrained models per- ing. We develop statistically rigorous meth- form worse on rare subgroups; if this result gener- ods to address this, and after accounting for alizes to pretrained language models, larger models pretraining and finetuning noise, we find that will not be uniformly better. Despite all the in- our BERT-LARGE is worse than BERT-MINI direct evidence, it is still inconclusive how many on at least 1−4% of instances across MNLI, SST-2, and QQP, compared to the overall ac- instances larger pretrained models perform worse curacy improvement of 2−10%. We also on. find that finetuning noise increases with model A na¨ıve solution is to finetune a larger model, size, and that instance-level accuracy has mo- compare it to a smaller one, and find instances mentum: improvement from BERT-MINI to where the larger model is worse. However, this BERT-MEDIUM correlates with improvement approach is flawed, since model predictions are from BERT-MEDIUM to BERT-LARGE . Our noisy at the instance level. On MNLI in-domain findings suggest that instance-level predictions development set, even the same architecture with provide a rich source of information; we therefore recommend that researchers supplement different finetuning seeds leads to different pre- model weights with model predictions. dictions on ∼8% of the instances. This is due to under-specification (D’Amour et al., 2020), where 1 Introduction there are multiple different solutions that can mini- Historically, large deep learning models (Peters mize the training loss. Since the accuracy improve- arXiv:2105.06020v1 [cs.CL] 13 May 2021 et al., 2018; Devlin et al., 2019; Lewis et al., 2020; ment from our BERT-BASE1 to BERT-LARGE is Raffel et al., 2019) have improved the state of 2%, most signals across different model sizes will the art on a wide range of tasks and leaderboards be dominated by noise due to random seeds. (Schwartz et al., 2014; Rajpurkar et al., 2016; Wang To account for the noise in pretraining and fine- et al., 2018), and empirical scaling laws predict tuning, we define instance accuracy as “how often that larger models will continue to increase per- a model correctly predicts an instance” (Figure1 formance (Kaplan et al., 2020). However, little is left) in expectation across pretraining and finetun- understood about such improvement at the instance ing seeds. We estimate this quantity by pretraining (datapoint) level. Are larger models uniformly bet- 10 models with different seeds, finetuning 5 times ter? In other words, are larger pretrained models for each pretrained models (Figure1 middle), and better at every instance, or are they better at some 1This is not the original release by Devlin et al.(2019); we instances, but worse at others? pretrained models ourselves. averaging across them. (L4/H256, 4 Layers with hidden dimension 256), However, this estimate is still inexact, and we SMALL (L4/H512), MEDIUM (L8/H512), BASE might falsely observe smaller models to be better (L12/H768), and LARGE (L24/H1024). For each at some instances by chance. Hence, we propose architecture we pre-trained models with 10 differ- a random baseline to estimate the fraction of false ent random seeds and fine-tuned each of them 5 discoveries (Section3, Figure1 right) and formally times (50 total) on each task; see Figure1 middle. upper-bound the false discoveries in Section4. Our Since pretraining is computationally expensive, we method provides a better upper bound than the clas- reduced the context size during pretraining from sical Benjamini-Hochberg procedure with Fisher’s 512 to 128 and compensated by increasing train- exact test. ing steps from 1M to 2M. AppendixA includes Using the 50 models for each size and our im- more details about pretraining and finetuning and proved statistical tool, we find that, on the MNLI their computational cost, and AppendixB verifies in-domain development set, the accuracy “decays” that our cost-saving changes do not affect accuracy from BERT-LARGE to BERT-MINI on at least ∼4% qualitatively. of the instances, which is significant given that the improvement in overall accuracy is 10%. These Notation. We use i to index an instance in the decaying instances contain more controversial or evaluation set, s for model sizes, P for pretraining wrong labels, but also correct ones (Section 4.2). seeds and F for finetuning seeds. c is a random Therefore, larger pretrained language models are variable of value 0 or 1 to indicate whether the not uniformly better. prediction is correct. Given the pretraining seed P s We make other interesting discoveries at the in- and the finetuning seed F , ci = 1 if the model of stance level. Section5 finds that instance-level size s is correct on instance i, 0 otherwise. To keep accuracy has momentum: improvement from MINI the notation uncluttered, we sometimes omit these to MEDIUM correlates with improvement from superscripts or subscripts if they can be inferred MEDIUM to LARGE . Additionally, Section6 at- from context. tributes variance of model predictions to pretrain- Unless otherwise noted, we present results on ing and finetuning random seeds, and finds that the MNLI in-domain development set in the main finetuning seeds cause more variance for larger paper. models. Our findings suggest that instance-level predictions provide a rich source of information; 3 Comparing Instance Accuracy we therefore recommend that researchers supplement model weights with model predictions. In this To find the instances where larger models are worse, spirit, we release all the pretrained models, model a na¨ıve approach is to finetune a larger pretrained predictions, and code here: https://github.com/ model, compare it to a smaller one, and find in- ruiqi-zhong/acl2021-instance-level. stances where the larger is incorrect but the smaller 2 Data, Models, and Predictions is correct. Under this approach, BERT-LARGE is worse than BERT-BASE on 4.5% of the instances To investigate model behavior, we considered dif- and better on 7%, giving an overall accuracy im- ferent sizes of the BERT architecture and fine- provement of 2.5%. tuned them on Quora Question Pairs (QQP2), However, this result is misleading: even if we Multi-Genre Natural Language Inference (MNLI; compare two BERT-BASE model with different Williams et al.(2020)), and the Stanford Sen- finetuning seeds, their predictions differ on 8% of timent Treebank (SST-2; Socher et al.(2013)). the instances, while their accuracies differ only by To account for pretraining and finetuning noise, 0.1%; Table1 reports this baseline randomness we averaged over multiple random initializations across model sizes. Changing the pretraining seed and training data order, and thus needed to pre- also changes around 2% additional predictions be- train our own models rather than downloading yond finetuning. off the internet. Following Turc et al.(2019) we Table1 also reports the standard deviation of trained 5 architectures of increasing size: MINI overall accuracy, which is about 40 times smaller. 2https://www.quora.com/q/quoradata/First-Quora- Such stability starkly contrasts with the noisiness at Dataset-Release-Question-Pairs the instance level, which poses a unique challenge. Instances Instance 1 Instance i Average across seeds Model Finetuning Seeds Instance F(*)1 F(*)2 Seed 1 Seed 2 Seed 3 Sizes Accuracy Pretrain P1 X ✓ X X Instance 1 X ✓ ✓ 66% MINI -ing P2 ✓ X … ✓ X Instance 2 X X X 0% Seeds P3 X ✓ ✓ ✓ Instance 3 ✓ X X 33% … … … Instance 4 ✓ ✓ ✓ 100% ✓ ✓ X ✓ LARGE ✓ X … ✓ X ✓ ✓ X X Figure 1: Left: Each column represents the same architecture trained with a different seed. We calculate accuracy for each instance (row) by averaging across seeds (column), while it is usually calculated for each model by averaging across instances. Middle: A visual layout of the model predictions we obtain, which is a binary-valued tensor with 4 axes: model size s, instance i, pretraining seeds P and finetuning seeds F . Right: for each instance, we calculate the accuracy gain from MINI to LARGE and plot the histogram in blue, along with a random baseline in red. Since the blue distribution has a bigger left tail, smaller models are better at some instances. ^ s DiffFTune DiffPTrain Stdall accuracy estimates Acci , i.e. MINI 7.2% 10.7% 0.2% s s SMALL 7.2% 10.7% 0.3% s1 ^ ^ 2 ^ 1 s2 ∆Acci := Acci − Acci : (3) MEDIUM 8.0% 10.7% 0.3% BASE 8.5% 10.6% 0.2% BASE ^ We histogram LARGE∆Acci in Figure2 (b). We LARGE 8.6% 10.1% 0.2% observe a unimodal distribution centered near 0, with tails on both sides.

Load more