<<

Bayesian Active Learning with Pretrained Language Models

Katerina Margatina Loic Barrault Nikolaos Aletras Computer Science Department, University of Sheffield {k.margatina,l.barrault,n.aletras}@sheffield.ac.uk

Abstract AL has been used in NLP for part-of-speech tag- ging (Engelson and Dagan, 1996), parsing (Tang Active Learning (AL) is a method to iteratively et al., 2002), sentiment analysis (Li et al., 2012), select data for annotation from a pool of un- labeled data, aiming to achieve better model machine translation (Haffari et al., 2009) and qual- performance than random selection. Previous ity estimation (Beck et al., 2013) among others. It AL approaches in Natural Language Process- is especially useful in scenarios where a large pool ing (NLP) have been limited to either task- of unlabeled data is available but only a limited specific models that are trained from scratch annotation budget can be afforded; or where ex- at each iteration using only the labeled data pert annotation is prohibitively expensive and time at hand or using off-the-shelf pretrained lan- consuming. guage models (LMs) that are not adapted ef- fectively to the downstream task. In this pa- Traditional Bayesian AL methods use uncer- per, we address these limitations by introduc- tainty sampling (i.e. informativeness is measured ing BALM; Bayesian Active Learning with pre- by predictive uncertainty) and typically require trained language Models. We first propose to probabilistic machine learning models to acquire adapt the pretrained LM to the downstream good uncertainty estimates for the candidate data task by continuing training with all the avail- points. However, current work uses deep learning able unlabeled data and then use it for AL. We also suggest a simple yet effective fine- models that provide large performance gains but tuning method to ensure that the adapted LM is not well-calibrated confidence scores (Guo et al., properly trained in both low and high resource 2017), i.e. predictive softmax probabilities are er- scenarios during AL. We finally apply Monte roneously interpreted as model confidence (Gal Carlo dropout to the downstream model to ob- and Ghahramani, 2016). Several approaches have tain well-calibrated confidence scores for data been proposed to calibrate the output probability selection with uncertainty sampling. Our ex- distribution of deep neural networks, such as tem- periments in five standard natural language un- perature scaling (Guo et al., 2017), Monte Carlo derstanding tasks demonstrate that BALM pro- vides substantial data efficiency improvements dropout (Gal and Ghahramani, 2016) and model compared to various combinations of acquisi- ensembles (Lakshminarayanan et al., 2017). Using tion functions, models and fine-tuning meth- uncertainty sampling with the vanilla output proba- ods proposed in recent AL literature. bilities for AL may lead to incorrect conclusions, i.e. poor results may be attributed to the acquisi- arXiv:2104.08320v1 [cs.CL] 16 Apr 2021 1 Introduction tion method, while the problem may be in fact the Active Learning (AL) is a method for training su- lack of calibration. Still, only a few deep Bayesian pervised models in a data-efficient way (Cohn et al., AL approaches apply a calibration method to the 1996; Settles, 2009). AL methods iteratively alter- posterior probabilities (Gal et al., 2017; Shen et al., nate between (i) model training with the labeled 2017; Siddhant and Lipton, 2018; Lowell and Lip- data available; and (ii) data selection for annotation ton, 2019; Ein-Dor et al., 2020). using a stopping criterion, e.g. until exhausting a Furthermore, most current AL approaches in fixed annotation budget or reaching a pre-defined NLP use task-specific neural models that are performance on a held-out dataset. Data selection trained from scratch at each iteration (Shen et al., is performed by an acquisition function that ranks 2017; Siddhant and Lipton, 2018; Prabhu et al., unlabeled data points by some informativeness met- 2019; Ikhwantri et al., 2018; Kasai et al., 2019). ric aiming to improve over random selection. However, task-specific models are usually out- performed by pretrained language models (LMs) training methods across all datasets (§5). We also adapted to end-tasks (Howard and Ruder, 2018; De- find that our proposed training strategy yields sub- vlin et al., 2019), making them suboptimal for AL. stantial performance improvement when combined Only recently, pretrained LMs such as BERT (De- with any acquisition function (§6). vlin et al., 2019) have been introduced in AL set- tings (Yuan et al., 2020; Ein-Dor et al., 2020), 2 Background and Related Work where they are transferred and used as downstream 2.1 Problem Formulation classification models. Still, they are trained at each C AL iteration with a standard fine-tuning approach Given a downstream classification task with that mainly includes a pre-defined number of train- classes, a typical pool-based AL setup consists of D M ing epochs, which has been demonstrated to be a pool of unlabeled data pool, a model , a pre- b unstable, especially in small datasets (Mosbach defined annotation budget of data points and an a(.) k et al., 2021; Zhang et al., 2020; Dodge et al., 2020). acquisition function for selecting unlabeled Since AL includes both low and high data resource data points for annotation (i.e. acquisition size) b D settings, the AL model training scheme should be until runs out. A validation set val is used to M robust in both scenarios.1 evaluate after each iteration. The goal is to achieve data efficiency by selecting the least num- To address these limitations, we introduce ber of data points from D for annotation and Bayesian Active Learning with pretrained language pool achieve the highest performance on the validation Models (BALM). Contrary to previous work (Yuan set D (Siddhant and Lipton, 2018). The perfor- et al., 2020; Ein-Dor et al., 2020) that also use val mance of the algorithm is assessed by training a BERT (Devlin et al., 2019), our proposed method model on the actively acquired dataset and evaluat- accounts for the varying data availability settings, ing on a held-out test set D . the instability of fine-tuning and the poor calibrated test AL systems are first initialized and subsequently confidence scores for data selection: loop over Model Training, Data Acquisition and 1. We propose to continue pretraining the LM Data Annotation steps for T iterations, or until a with the available unlabeled data to adapt it to pre-defined performance on Dval is reached. the task-specific domain. This way, we leve- 2.2 Active Learning Initialization rage not only the available labeled data at each AL iteration, but the entire unlabeled pool; To initialize AL, the total number of AL iterations b can be simply calculated by T = k , where b is the 2. We further propose a simple yet effective fine- budget and k the acquisition size.2 Then, a data tuning method that is robust in both low and initialization policy selects the first k data points high resource data AL settings; from Dpool to be annotated and update the labeled dataset Dlab. The most common approach to select 3. We improve data acquisition by providing the first batch of data for annotation is stratified well-calibrated uncertainty estimates by using random sampling (Gal et al., 2017). Monte Carlo dropout (Gal and Ghahramani, 2016) instead of using the softmax output as 2.3 Model Training confidence scores. In the first step of the AL loop, a model Mi is trained with the available labeled data D at itera- We evaluate BALM on five standard natural lan- lab tion i. If M is a task-specific architecture (Shen guage understandings tasks using a full suite of i et al., 2017; Siddhant and Lipton, 2018; Prabhu uncertainty-based acquisition functions, and com- et al., 2019), it is simply trained from scratch on pare against strong baselines that are based on di- D until convergence. If M is based on a pre- versity sampling (i.e. BERT K-means clustering), lab i trained LM (Yuan et al., 2020; Ein-Dor et al., 2020), both uncertainty and diversity (e.g. BADGE (Ash then it is initialized with the pretrained weights and et al., 2020), ALPS (Yuan et al., 2020), ), and fine-tuned to the task on D by adding a task- random sampling. We show that BALM outper- lab specific output classification layer and updating forms all combinations of acquisition functions and all model parameters until convergence. Note that 1 During the first few AL iterations the available labeled 2 data is limited (low-resource), while it could become very If the budget b is a percentage of the number of unlabeled b|Dpool| large towards the last iterations (high-resource). data points then T = k . at each iteration i, the model parameters are ini- based acquisition function can be either cold-start tialized randomly if Mi is trained from scratch or warm-start. There are also hybrid approaches or from the original pretrained LM, respectively. that aim to select data based on both uncertainty Warm-starting the model (i.e. initializing Mi with and diversity sampling (He et al., 2014; Yang et al., the parameters of Mi−1) has been shown to hinder 2015; Erdmann et al., 2019; Yuan et al., 2020; Ash the model’s generalization ability (Ash and Adams, et al., 2020), and other methods that use reinforce- 2020). The AL loop stops if performance of Mi ment learning (Fang et al., 2017; Liu et al., 2018). on Dval is equal or higher that the goal. In our work, we use acquisition functions based on uncertainty sampling (§3.3), but any acquisition 2.4 Data Acquisition function that takes as input the unlabeled data, the acquisition size and, if applied, the model, and In this step, we use the acquisition function a to outputs a batch of k data points could be used, select the k most informative unlabeled samples Qi = a(Mi, k, D ). Comparison of different from D for annotation. The acquisition function pool pool types of acquisitions functions is out of the scope usually uses the trained model M to rank the can- i of this paper. didate unlabeled data. This is called a warm-start approach and the acquisition function formally is 2.5 Data Annotation a(Mi, Dpool, k).A cold-start acquisition function Finally, the acquired set Q of k data points at typically does not use the model and selects data i iteration i is passed to an oracle for annotation. Af- based on their input representations a(Dpool, k). ter acquiring labels, Q is appended to the labeled There are two main strategies for acquiring i dataset D and subsequently removed from D . data: uncertainty and diversity sampling. Un- lab pool The remaining budget b is adjusted accordingly. If certainty sampling aims to select the most uncer- it has been exhausted, AL stops. Otherwise itera- tain data based on the model’s predictive uncer- tion i+1 begins from the Model Training step with tainty. The assumption is that the most uncer- the updated D and D datasets. tain data are the most difficult ones for the model, lab pool We note that, following previous work (Siddhant and therefore the most useful to facilitate train- and Lipton, 2018; Yuan et al., 2020; Ein-Dor et al., ing. Typical uncertainty-based acquisition func- 2020), we use budget as a stopping criterion to fa- tions include LEAST CONFIDENCE (Lewis and cilitate fair comparison between the various meth- Gale, 1994) that sorts D by the probability of pool ods considered. However, there are various AL not predicting the most confident class, in descend- stopping criteria for practitioners (Vlachos, 2008; ing order, ENTROPY (Shannon, 1948) the selects Bloodgood and Vijay-Shanker, 2009) which are samples that maximize the predictive entropy, and beyond the scope of this paper. BALD (Houlsby et al., 2011), short for Bayesian Active Learning by Disagreement, that chooses 3 BALM: Bayesian Active Learning with data points that maximize the mutual information Pretrained Language Models between predictions and model’s posterior proba- bilities. BATCHBALD (Kirsch et al., 2019) is a Our aim is to improve LM-based AL to (1) account recently introduced extension of BALD that jointly for varying data resource availability; (2) tackle scores points by estimating the mutual informa- the instability of LM fine-tuning; and (3) improve tion between multiple data points and the model data acquisition with better calibrated confidence parameters. This iterative algorithm aims to find scores. For that purpose, we propose Bayesian batches of informative data points, in contrast to Active Learning with pretrained language models BALD that chooses points that are informative in- (BALM) following the standard AL pipeline (§2). dividually. Uncertainty sampling is a warm-start In the AL initialization step (§2.2), we first adapt approach since it requires confidence scores from the LM using all the available unlabeled data of the the trained model for all candidate unlabeled data. downstream task (§3.1). We then propose a fine- On the other hand, diversity-based approaches tuning approach of the model (§3.2) that adjusts to aim to exploit the heterogeneity of the feature space all data availability settings (i.e. the low-resource and typically use clustering to choose a diverse set setting at the first iterations, and the high-resource of points from Dpool (Wang and Ye, 2015; Sener at the later iterations) during training (§2.2). Last, and Savarese, 2018; Zeng et al., 2019). A diversity- we extract uncertainty estimates from the adapted 0 Algorithm 1: BALM algorithm cation model Mi(x;[W0,Wc]) with all x ∈ Dlab. (cf. line 6 to 8 of algorithm 1). Input: unlabeled data Dpool, pretrained language model P(x; W0), Recent work in AL (Ein-Dor et al., 2020; Yuan acquisition size k, AL iterations T , et al., 2020) uses the standard fine-tuning method acquisition function a proposed in Devlin et al.(2019) which includes 1 Dlab ← ∅ a fixed number of 3 training epochs, a learning 0 rate between 2e-5 and 5e-5, learning rate warmup 2 PTAPT(x; W0) ← Train P(x; W0) on Dpool 3 Q0 ← RANDOM(.), |Q0| = k over the first 10% of the steps and AdamW opti- 4 Dlab = Dlab ∪ Q0 mizer (Loshchilov and Hutter, 2019) without bias 5 Dpool = Dpool \Q0 correction, among other hyperparameters. We fol- 6 for i ← 1 to T do low a different approach by taking into account 0 insights from few-shot fine-tuning literature (Mos- 7 Mi(x;[W0,Wc]) ← Initialize from 0 bach et al., 2021; Zhang et al., 2020) that proposes PTAPT(x; W0) 8 Mi(x; Wi) ← Train model on Dlab longer fine-tuning. We also follow Dodge et al. ERT 9 Qi ← a(Mi, Dpool, k) (2020) that demonstrates more robust B fine- 10 Dlab = Dlab ∪ Qi tuning by increasing the number of evaluations 11 Dpool = Dpool \Qi steps during training. 12 end We combine these guidelines to our fine-tuning Output: Dlab approach by using early stopping with 20 epochs based on the validation loss, learning rate 2e-5, bias correction and 5 evaluation steps per epoch. model using a probabilistic framework (§3.3) to However, increasing the number of epochs from improve uncertainty-based data acquisition. 3 to 20, also increases the warmup steps (10% of 3 In our experiments, we use BERT (Devlin et al., total steps ) almost 7 times. This may be prob- 2019), a state-of-the-art pretrained language model, lematic in scenarios where the dataset is large as our AL classification model but our method is but the optimal number of epochs may be small independent of the chosen LM. (e.g. 2 or 3). To account for this limitation in our AL setting where the size of training set changes 3.1 LM Adaptation during AL Initialization at each iteration, we propose a simple empirical warmup approach by selecting the warmup steps as Inspired by recent work on transfer learning that min(10% of total steps, 100). We denote standard shows improvements in downstream classification fine-tuning as SFT and our approach as FT+. performance by continuing the pretraining of the LM with the task data (Howard and Ruder, 2018; 3.3 Uncertainty Estimation for Data Gururangan et al., 2020), we add an extra step in Acquisition the AL initialization by continuing pretraining the LM. To this end, we use Task-Adaptive Pretraining After fine-tuning the classification model Mi with (TAPT) to the AL setting. Formally, we use an LM, Dlab, we use it to acquire uncertainty estimates for such as BERT (Devlin et al., 2019), P(x; W0) with all candidate data points in Dpool. We use uncer- weights W0, that has been already pretrained on a tainty sampling by selecting the k most uncertain large corpus. We fine-tune P(x; W0) with the avail- data from Dpool for annotation (cf. line 9 of al- able unlabeled data of the downstream task Dpool, gorithm 1). Instead of using the output softmax 0 probabilities for each class, we use a probabilistic resulting in the task-adapted LM PTAPT(x; W0) 0 formulation of deep neural networks in order to with new weights W0 (cf. line 2 of algorithm 1). acquire better calibrated scores. 3.2 AL Classification Model Fine-tuning Monte Carlo (MC) dropout (Gal and Ghahra- 0 mani, 2016) is a simple yet effective method for per- We now use the adapted LM PTAPT(x; W0) for ac- tive learning. At each iteration i, we initialize our forming approximate variational inference, based 0 on dropout (Srivastava et al., 2014). Gal and model Mi with the pretrained weights W0 and we add a task-specific feedforward layer for classifica- Ghahramani(2016) prove that by simply perform- tion Wc on top of the [CLS] token representation 3Some guidelines propose an even smaller number of of BERT-based PTAPT. We fine-tune the classifi- warmup steps, such as 6% in RoBERTa (Liu et al., 2020). DATASETS TRAIN VAL TEST k C 4.2 Training & AL Details TREC-6 4.9K 546 500 1% 6 We use BERT-BASE (Devlin et al., 2019) and fine- DBPEDIA 20K 2K 70K 1% 14 tune it (TAPT §3.1) for 100K steps, with learning IMDB 22.5K 2.5K 25K 1% 2 rate 2e-05 and the rest of hyperparameters as in SST-2 60.6K 6.7K 871 1% 2 Gururangan et al.(2020) using the HuggingFace AGNEWS 114K 6K 7.6K 0.5% 4 library (Wolf et al., 2020). We evaluate the model 5 times per epoch on D and keep the one with Table 1: Datasets statistics for D , D and D re- val pool val test the lowest validation loss as in Dodge et al.(2020). spectively. k stands for the acquisition size (% of Dpool) and C the number of classes. We use the code provided by Kirsch et al.(2019) for the uncertainty-based acquisition functions and Yuan et al.(2020) for ALPS, BADGE and BERTKM. ing dropout during the forward pass in making We use the standard splits provided for all datasets, predictions, the output is equivalent to the predic- if available, otherwise we randomly sample a val- tion when the parameters are sampled from a varia- idation set. We test all models on a held-out test tional distribution of the true posterior. Therefore, set. We repeat all experiments with five different dropout during inference results into obtaining pre- random seeds resulting into different initializations dictions from different parts of the network. of Dlab and the weights of the extra task-specific output feedforward layer. For all datasets we use as Our BERT-based Mi model uses dropout layers during training for regularization. We apply MC budget the 15% of Dpool. Each experiment is run dropout by simply activating them during test time on a single Nvidia Tesla V100 GPU. More details and we perform multiple stochastic forward passes. are provided in the Appendix A.1. Formally, we do N passes of every x ∈ Dpool 4.3 Baselines through Mi(x; Wi) to acquire N different output probability distributions for each x. Acquisition functions We compare uncertainty Four uncertainty acquisition functions are used sampling (§3.3) with four baseline acquisition func- in our work: LEAST CONFIDENCE, ENTROPY, tions. The first is the standard AL baseline, RAN- BALD and BATCHBALD (§2.4). Note that LEAST DOM, which applies uniform sampling and selects CONFIDENCE, ENTROPY and BALD have been k data points from Dpool at each iteration. The used in AL for NLP by Siddhant and Lipton(2018). second is BADGE (Ash et al., 2020), an acquisi- To the best of our knowledge, BATCHBALD is tion function that aims to combine diversity and evaluated for the first time in the NLP domain. uncertainty sampling. The algorithm computes gradient embeddings gx for every candidate data 4 Experimental Setup point x in Dpool and then uses clustering to select a batch. Each gx is computed as the gradient of 4.1 Tasks & Datasets the cross-entropy loss with respect to the param- eters of the model’s last layer. We also compare We experiment with five diverse natural language against a recently introduced cold-start acquisition understanding tasks including binary and multi- function called ALPS (Yuan et al., 2020). ALPS ac- class labels and varying dataset sizes (Table1). quisition uses the masked language model (MLM) The first task is question classification using the six- loss of BERT as a proxy for model uncertainty in class version of the small TREC-6 dataset of open- the downstream classification task. Specifically, domain, fact-based questions divided into broad aiming to leverage both uncertainty and diversity, semantic categories (Voorhees and Tice, 2000). We ALPS forms a surprisal embedding sx for each x, also evaluate our approach on sentiment analysis by passing the unmasked input x through the BERT using the binary movie review IMDB dataset (Maas MLM head to compute the cross-entropy loss for et al., 2011) and the binary version of the SST-2 a random 15% subsample of tokens against the dataset (Socher et al., 2013). We finally use the target labels. ALPS clusters these embeddings to large-scale AGNEWS and DBPEDIA datasets from sample k sentences for each AL iteration. Last, Zhang et al.(2015) for topic classification. We following Yuan et al.(2020), we use BERTKM as undersample the latter and form a Dpool of 20K a diversity baseline, where the l2 normalized BERT examples and Dval 2K. output embeddings are used for clustering. TREC-6 DBPEDIA IMDB 100 95 90 90 90 80 85 70 80

60 70 80

50 60 75

40 50 70

30 40 65

20 30 60 4.9K training data (100%) 20K training data 22.5K training data (100%) 10 20 55 2 4 6 8 10 12 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14 Acquired dataset size (%) Acquired dataset size (%) Acquired dataset size (%)

SST-2 AGNEWS 95 92 94

90 93

92 88 91 86 90 84 89

82 60.6K training data (100%) 88 114K training data (100%)

2 4 6 8 10 12 14 2 4 6 8 10 12 14 Acquired dataset size (%) Acquired dataset size (%)

Figure 1: Test accuracy during AL iterations using BALM with ENTROPY against RANDOM,ALPS,BADGE and BERTKM acquisition functions. The dotted line denotes experiments with BERT and standard fine-tuning (SFT) and the solid line with BERT-TAPT and FT+. We plot the median and standard deviation across five runs.

Models & Fine-tuning Methods We also eval- full-dataset performance within the 15% budget uate (§6) two variants of the pretrained language for all datasets. The performance of BALM is model; the original BERT model, used in Yuan mostly notable in the smaller datasets. In TREC- et al.(2020) and Ein-Dor et al.(2020) 4, and our 6, it achieves the goal accuracy with almost 10% adapted model BERT-TAPT (§3.1), and two fine- annotated data, while in DBPEDIA only in the first tuning methods; our proposed fine-tuning approach iteration with 2% of the data. In the first AL it- FT+ (§3.2) and standard BERT fine-tuning SFT. eration in IMDB, BALM results only in 2.5 points of accuracy lower than the performance equivalent 5 Results to using 100% of the data, which it later achieves Figure1 presents the results for all datasets 5. Our after acquiring 15% of the data. In the larger SST- proposed method BALM consists of the BERT- 2 and AGNEWS datasets, BALM is closer to the TAPT model (§3.1), FT+ fine-tuning method (§3.2) baselines but still outperforms them, achieving the and ENTROPY acquisition (§3.3). For all exper- full-dataset performance with 8% and 12% of the iments with ENTROPY acquisition, we use MC data respectively. dropout with N = 5. We show that BALM consis- Training Strategy We also observe that in all tently outperforms all baselines across datasets. datasets, the addition of our proposed pretraining Data Efficiency We first observe that BALM step (TAPT § 3.1) and fine-tuning technique (FT+ achieves large data efficiency since it reaches the 3.2) leads to large performance gains, especially in the first AL iterations. This is particularly evident 4Ein-Dor et al.(2020) evaluate various acquisition func- tions, including entropy with MC dropout, and use BERT with in TREC-6, DBPEDIA and IMDB datasets, where the standard fine-tuning approach (SFT). after the first AL iteration (i.e. equivalent to 2% 5 We do not evaluate BADGE on AGNEWS because of the of training data) BALM with ENTROPY is 45, 30 increased time complexity of the algorithm: O(Cknd) for and 12 points in accuracy, respectively, higher than a C-way classification task, k queries, n points in Dpool, and d-dimensional BERT embeddings. the ENTROPY baseline with BERT and SFT. This is a rather interesting finding, since our simple ad- 3.5 3.0 Dataset ditions in the training strategy of the model proved TREC-6 2.5 DBPEDIA

to be particularly effective and resulting in large Loss 2.0 IMDB SST-2 1.5 performance improvements. AGNEWS 1.0 Acquisition Strategy We finally observe that the 0 20000 40000 60000 80000 100000 Steps performance curves of the various acquisition func- tions considered (i.e. dotted lines) are generally Figure 2: Validation MLM loss during TAPT. close to each other, suggesting that the choice of the acquisition strategy does not affect substantially AG_NEWS IMDB the AL performance. In other words, we conclude 90 90 that the training strategy is more important than 85 80 80 the acquisition strategy. We find that uncertainty 75 70 sampling with ENTROPY is generally the best per- 70

Accuracy Accuracy 65 forming acquisition function, followed by BADGE. 60 Epochs Epochs 3 60 3 Still, finding a universally well-performing acquisi- 10 10 50 55 20 20 tion function, independent of the training strategy, 50 100 1000 10000 100 1000 10000 is an open research question. Our findings show Number of training samples Number of training samples that uncertainty sampling is the strongest approach, with room for improvement over the competitive Figure 3: Few-shot standard BERT fine-tuning. random sampling baseline (§6).

6 Analysis & Discussion To illustrate the inefficiency of standard fine- tuning (SFT), we randomly undersample AGNEWS Task-Adaptive Pretraining We present details and IMDB to form low, and high data set- of TAPT (§3.1) and reflect on its effectiveness in the tings (i.e. 100, 1,000 and 10,000 training samples) AL pipeline. Following Gururangan et al.(2020), and train BERT for a fixed number of 3, 10, and 20 we continue pretraining BERT for the MLM task epochs. Figure3 shows that SFT is suboptimal for using all the unlabeled data Dpool for all datasets low data settings, indicating that more optimiza- separately. We plot the learning curves of BERT- tion steps are needed for the model to adapt to the TAPT for all datasets in Figure2. We first observe few training samples (Mosbach et al., 2021; Zhang that the masked LM loss is steadily decreasing for et al., 2020). As the training samples increase fewer DBPEDIA, IMDB and AGNEWS across optimization epochs are often better. It is thus evident that there steps, which correlates with the high early AL per- is not a clearly optimal way to choose a predefined formance gains of TAPT in these datasets (Fig.1). number of epochs to train the model given the num- We also observe that the LM overfits in TREC-6 and ber of training examples. This motivates the need SST-2 datasets. We attribute this to the very small to find a fine-tuning policy for AL that efficiently training dataset of TREC-6 and the informal textual adapts to the data resource setting of each iteration style of SST-2. Although SST-2 includes approxi- (independent of the number of training examples or mately 67K of training data, the sentences are very dataset), which is mainly tackled by our proposed short (i.e. average length of 9.4 words per input fine-tuning approach FT+ (§3.2). sentence). We hypothesize the LM overfits because of the lack of long and diverse sentences. More Ablation Study We also conduct an ablation details on TAPT can be found in the Appendix A.2. study to show that our proposed AL training meth- ods, (i) the pretraining step (TAPT §3.1) and (ii) the Few-shot Fine-tuning We highlight the impor- fine-tuning method (FT+ §3.2), provide large gains tance of considering the few-shot learning problem compare to standard BERT fine-tuning (SFT) in in the AL pipeline during the first iterations which terms of accuracy, data efficiency and uncertainty is often neglected in literature. This is more im- calibration. We therefore compare BERT with SFT, portant when using pretrained LMs, since they are BERT with FT+ and BERT-TAPT with FT+ (BALM). overparameterized models that require adapting the Along with test accuracy, we also evaluate each training scheme when low data resources are avail- AL model on a benchmark of uncertainty estima- able to ensure robustness. tion metrics as proposed by Ovadia et al.(2019), TREC-6 AGNEWS AGNEWS IMDB 95 92 95.0 75 92.5 91 50 90.0 94 Accuracy Accuracy 25 4.9K training data (100%) 87.5 114K training data (100%) 90 93 0.06 89 0.10 Brier Brier 0.04 92 88 0.05 0.02 91 0.5 87 1.5 0.4

NLL 114K training data (100%) 22.5K training data (100%) NLL 1.0 0.3 90 86 5 10 15 5 10 15 0.2 0.3 Acquired dataset size (%) Acquired dataset size (%)

0.2 0.1 ECE 0.1 ECE

0.0 0.0 Figure 5: Comparison of acquisition functions using 2 1.0 TAPT and FT+ in training BERT. Entropy 1 Entropy 0.5

0 200 400 600 0 5000 10000 15000 Acquired dataset size Acquired dataset size ENTROPY. We have also experimented with var- ious uncertainty-based acquisition functions, i.e. LEAST CONFIDENCE, BALD and BATCHBALD Figure 4: BALM ablation study. (§3.3), and our findings show that all functions provide similar performance, except for BALD that slightly underperforms. This makes our ap- namely Brier score, negative log likelihood (NLL), proach agnostic to the selected uncertainty-based expected calibration error (ECE) and entropy. A acquisition method. We also evaluate our proposed well-calibrated model should have high accuracy methods with our baseline acquisition functions, and low values on the uncertainty metrics. i.e. RANDOM, ALPS, BERTKM and BADGE, since Figure4 shows the results for the smallest and our training strategy is orthogonal to the acquisition largest datasets, TREC-6 and AGNEWS respectively. strategy. We compare all acquisition functions with For TREC-6, training BERT with our fine-tuning BALM for AGNEWS and IMDB in Figure5. We ob- approach FT+ provides large gains both in accuracy serve that in general uncertainty-based acquisition and uncertainty calibration, showing how impor- performs better compared to diversity, while all ac- tant it is to fine-tune the LM for a larger number quisition strategies have benefited from our BALM of epochs in low resource settings. For the larger training strategy (TAPT and FT+). We discuss the dataset, AGNEWS, we see that BERT with SFT per- efficiency of the methods in the Appendix A.3. forms equally to FT+ which is the ideal scenario. We see that our fine-tuning approach does not de- 7 Conclusions & Future Work teriorate the performance of BERT because of the large increase in warmup steps (see §3.2), showing We have presented Bayesian Active Learning with that our simple fix provides robust results in both pretrained language Models (BALM) consisting of high and low resource settings. (i) an extra pretraining step with the unlabeled task After demonstrating that FT+ yields better re- specific data, (ii) a simple yet effective fine-tuning sults than SFT, we next compare BALM against method for the downstream model and (iii) use of BERT with FT+. We observe that in both datasets MC dropout to acquire well-calibrated confidence BERT-TAPT outperforms BERT, with this being scores for uncertainty sampling. BALM accounts particularly evident in the early iterations. This for the few-shot learning phase of AL while still finding confirms our hypothesis that by implicitly adapts effectively to the high-resource setting of using the entire pool of unlabeled data in the extra the last iterations. Our findings also show that pretraining step (TAPT), we boost the performance the proposed training strategy is more effective in of the AL classification model using less data. improving AL performance that the selected acqui- sition function. In the future, we aim to investigate Performance of Acquisition Functions In our semi-supervised learning methods to leverage unla- BALM experiments so far, we showed results with beled data during training the downstream model. References North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- Jordan T. Ash and Ryan P. Adams. 2020. On warm- gies, pages 2223–2234. starting neural network training. Meng Fang, Yuan Li, and Trevor Cohn. 2017. Learning Jordan T. Ash, Chicheng Zhang, Akshay Krishna- how to active learn: A deep reinforcement learning murthy, John Langford, and Alekh Agarwal. 2020. approach. In Proceedings of the 2017 Conference on Deep batch active learning by diverse, uncertain gra- Empirical Methods in Natural Language Processing, dient lower bounds. In International Conference on pages 595–605, Copenhagen, Denmark. Association Learning Representations. for Computational Linguistics.

Daniel Beck, Lucia Specia, and Trevor Cohn. 2013. Yarin Gal and Zoubin Ghahramani. 2016. Dropout as Reducing annotation effort for quality estimation via a bayesian approximation: Representing model un- active learning. In Proceedings of the Annual Meet- certainty in deep learning. In Proceedings of the ing of the Association for Computational Linguistics, International Conference on Machine Learning, vol- pages 543–548. ume 48, pages 1050–1059.

Michael Bloodgood and K. Vijay-Shanker. 2009.A Yarin Gal, Riashat Islam, and Zoubin Ghahramani. method for stopping active learning based on stabi- 2017. Deep Bayesian active learning with image lizing predictions and the need for user-adjustable data. In Proceedings of the International Confer- stopping. In Proceedings of the Thirteenth Con- ence on Machine Learning, volume 70, pages 1183– ference on Computational Natural Language Learn- 1192. ing (CoNLL-2009), pages 39–47, Boulder, Colorado. Association for Computational Linguistics. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Wein- berger. 2017. On calibration of modern neural net- David A. Cohn, Zoubin Ghahramani, and Michael I. works. Jordan. 1996. Active learning with statistical mod- els. Journal of Artificial Intelligence Research, Suchin Gururangan, Ana Marasovic,´ Swabha 4(1):129–145. Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Adapt language models to domains and tasks. In Kristina Toutanova. 2019. BERT: Pre-training of Proceedings of the 58th Annual Meeting of the deep bidirectional transformers for language under- Association for Computational Linguistics, pages standing. In Proceedings of the Conference of the 8342–8360, Online. Association for Computational North American Chapter of the Association for Com- Linguistics. putational Linguistics: Human Language Technolo- gies, pages 4171–4186. Gholamreza Haffari, Maxim Roy, and Anoop Sarkar. 2009. Active learning for statistical phrase-based Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali machine translation. In Proceedings of the Annual Farhadi, Hannaneh Hajishirzi, and Noah A. Smith. Conference of the North American Chapter of the As- 2020. Fine-tuning pretrained language models: sociation for Computational Linguistics, pages 415– Weight initializations, data orders, and early stop- 423. ping. ArXiv. Tianxu He, Shukui Zhang, Jie Xin, Pengpeng Zhao, Liat Ein-Dor, Alon Halfon, Ariel Gera, Eyal Shnarch, Jian Wu, Xuefeng Xian, Chunhua Li, and Zhiming Lena Dankin, Leshem Choshen, Marina Danilevsky, Cui. 2014. An active learning approach with uncer- Ranit Aharonov, Yoav Katz, and Noam Slonim. tainty, representativeness, and diversity. Scientific 2020. Active learning for BERT: An empirical WorldJ ournal, 2014:827586. study. In Proceedings of theConference on Empiri- cal Methods in Natural Language Processing, pages Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and 7949–7962. Máté Lengyel. 2011. Bayesian active learning for classification and preference learning. ArXiv. Sean P. Engelson and Ido Dagan. 1996. Minimiz- ing manual annotation cost in supervised training Jeremy Howard and Sebastian Ruder. 2018. Universal from corpora. In Proceedings of the Annual Meet- language model fine-tuning for text classification. In ing of the Association for Computational Linguistics, Proceedings of the Annual Meeting of the Associa- pages 319–326. tion for Computational Linguistics, pages 328–339.

Alexander Erdmann, David Joseph Wrisley, Benjamin Fariz Ikhwantri, Samuel Louvan, Kemal Kurniawan, Allen, Christopher Brown, Sophie Cohen-Bodénès, Bagas Abisena, Valdi Rachman, Alfan Farizki Micha Elsner, Yukun Feng, Brian Joseph, Béatrice Wicaksono, and Rahmad Mahendra. 2018. Multi- Joyeux-Prunel, and Marie-Catherine de Marneffe. task active learning for neural semantic role labeling 2019. Practical, efficient, and customizable active on low resource conversational corpus. In Proceed- learning for named entity recognition in the digital ings of the Workshop on Deep Learning Approaches humanities. In Proceedings of the Conference of the for Low-Resource NLP, pages 43–50. Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, baselines. In International Conference on Learning and Lucian Popa. 2019. Low-resource deep entity Representations. resolution with transfer and active learning. In Pro- ceedings of the Conference of the Association for Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, Computational Linguistic, pages 5851–5861. D. Sculley, Sebastian Nowozin, Joshua Dillon, Bal- aji Lakshminarayanan, and Jasper Snoek. 2019. Can Andreas Kirsch, Joost van Amersfoort, and Yarin Gal. you trust your model's uncertainty? evaluating 2019. BatchBALD: Efficient and diverse batch ac- predictive uncertainty under dataset shift. In Ad- quisition for deep bayesian active learning. In Neu- vances in Neural Information Processing Systems, ral Information Processing Systems, pages 7026– volume 32, pages 13991–14002. 7037. Adam Paszke, Sam Gross, Francisco Massa, Adam Balaji Lakshminarayanan, Alexander Pritzel, and Lerer, James Bradbury, Gregory Chanan, Trevor Charles Blundell. 2017. Simple and scalable predic- Killeen, Zeming Lin, Natalia Gimelshein, Luca tive uncertainty estimation using deep ensembles. In Antiga, Alban Desmaison, Andreas Kopf, Edward Advances in Neural Information Processing Systems, Yang, Zachary DeVito, Martin Raison, Alykhan Te- pages 6402–6413. jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: David D. Lewis and William A. Gale. 1994. A se- An imperative style, high-performance deep learn- quential algorithm for training text classifiers. In In ing library. In Advances in Neural Information Pro- Proceedings of the Annual International ACM SIGIR cessing Systems, pages 8024–8035. Conference on Research and Development in Infor- mation Retrieval. Ameya Prabhu, Charles Dognin, and Maneesh Singh. 2019. Sampling bias in deep active classification: Shoushan Li, Shengfeng Ju, Guodong Zhou, and Xiao- An empirical study. In Proceedings of the Confer- jun Li. 2012. Active learning for imbalanced sen- ence on Empirical Methods in Natural Language timent classification. In Proceedings of the Joint Processing and the International Joint Conference Conference on Empirical Methods in Natural Lan- on Natural Language Processing, pages 4056–4066. guage Processing and Computational Natural Lan- guage Learning, pages 139–148. Ozan Sener and Silvio Savarese. 2018. Active learn- ing for convolutional neural networks: A core-set Ming Liu, Wray Buntine, and Gholamreza Haffari. approach. In Proceedings of the International Con- 2018. Learning how to actively learn: A deep im- ference on Learning Representations. itation learning approach. In Proceedings of the 56th Annual Meeting of the Association for Compu- Burr Settles. 2009. Active learning literature survey. tational Linguistics (Volume 1: Long Papers), pages Computer sciences technical report. 1874–1883, Melbourne, Australia. Association for Computational Linguistics. Claude Elwood Shannon. 1948. A mathematical the- ory of communication. The Bell System Technical Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Journal. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Yanyao Shen, Hyokun Yun, Zachary Lipton, Yakov Ro{bert}a: A robustly optimized {bert} pretraining Kronrod, and Animashree Anandkumar. 2017. approach. Deep active learning for named entity recognition. In Proceedings of the Workshop on Representation Ilya Loshchilov and Frank Hutter. 2019. Decoupled Learning for NLP, pages 252–256. weight decay regularization. In International Con- ference on Learning Representations. Aditya Siddhant and Zachary C Lipton. 2018. Deep bayesian active learning for natural language pro- David Lowell and Zachary C Lipton. 2019. Practical cessing: Results of a Large-Scale empirical study. obstacles to deploying active learning. Proceedings In Proceedings of the Conference on Empirical of the Conference on Empirical Methods in Natu- Methods in Natural Language Processing, pages ral Language Processing and the International Joint 2904–2909. Conference on Natural Language Processing, pages 21–30. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Christopher Potts. 2013. Recursive deep models Dan Huang, Andrew Y. Ng, and Christopher Potts. for semantic compositionality over a sentiment tree- 2011. Learning word vectors for sentiment analy- bank. In Proceedings of the Conference on Empiri- sis. In Proceedings of the Annual Meeting of the cal Methods in Natural Language Processing, pages Association for Computational Linguistics: Human 1631–1642. Language Technologies, pages 142–150. N Srivastava, G Hinton, A Krizhevsky, and others. Marius Mosbach, Maksym Andriushchenko, and Diet- 2014. Dropout: a simple way to prevent neural net- rich Klakow. 2021. On the stability of fine-tuning works from overfitting. Journal of Machine Learn- {bert}: Misconceptions, explanations, and strong ing Research, 15(56):1929–1958. Min Tang, Xiaoqiang Luo, and Salim Roukos. 2002. A Appendix Active learning for statistical natural language pars- ing. In Annual Meeting of the Association for Com- A.1 Hyperparameters & Dataset Details putational Linguistics. In this section we provide details of all the datasets we used in this work and the hyperparparameters Andreas Vlachos. 2008. A stopping criterion for active learning. Comput. Speech Lang., 22(3):295–312. used for training the model. For TREC-6, IMDB and SST-2 we randomly sample 10% from the training Ellen Voorhees and Dawn Tice. 2000. The trec-8 ques- set to serve as the validation set, while for AGNEWS tion answering track evaluation. Proceedings of the we sample 5%. For the DBPEDIA dataset we under- Text Retrieval Conference. sample both training and validation datasets (from the standard splits) to facilitate our AL simulation Alex Wang, Amanpreet Singh, Julian Michael, Felix (i.e. the original dataset consists of 560K train- Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis plat- ing and 28K validation data examples). For all form for natural language understanding. In Inter- datasets we use the standard test set, apart from the national Conference on Learning Representations. SST-2 dataset that is taken from the GLUE bench- mark (Wang et al., 2019) we use the development Zheng Wang and Jieping Ye. 2015. Querying discrim- set as the held-out test set. inative and representative samples for batch mode active learning. ACM Transactions on Knowledge For all datasets we train BERT-BASE (Devlin Discovery from Data, 9(3). et al., 2019) from the HuggingFace library (Wolf et al., 2020) in Pytorch (Paszke et al., 2019). We Thomas Wolf, Lysandre Debut, Victor Sanh, Julien train all models with batch size 16, learning rate Chaumond, Clement Delangue, Anthony Moi, Pier- 2e − 5, no weight decay, AdamW optimizer with ric Cistac, Tim Rault, Remi Louf, Morgan Funtow- 1e − 8 icz, Joe Davison, Sam Shleifer, Patrick von Platen, epsilon . For all datasets we use maximum Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, sequenxe length of 128, except for IMDB and AG- Teven Le Scao, Sylvain Gugger, Mariama Drame, NEWS that contain longer input texts, where we use Quentin Lhoest, and Alexander Rush. 2020. Trans- 256. To ensure reproducibility and fair comparison formers: State-of-the-art natural language process- between the various methods under evaluation, we ing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System run all experiments with the same five seeds that Demonstrations, pages 38–45. we randomly selected from the range [1, 9999].

Yi Yang, Zhigang Ma, Feiping Nie, Xiaojun Chang, MODELTREC-6 DBPEDIAIMDBSST-2 AGNEWS and Alexander G Hauptmann. 2015. Multi-class ac- VALIDATION SET tive learning by uncertainty sampling with diversity BERT 94.4 99.1 90.7 93.7 94.4 maximization. International Journal of Computer BERT-TAPT 95.2 99.2 91.9 94.3 94.5 Vision, 113(2):113–127. TEST SET BERT 80.6 99.2 91.0 90.6 94.0 Michelle Yuan, Hsuan-Tien Lin, and Jordan Boyd- BERT-TAPT 77.2 99.2 91.9 90.8 94.2 Graber. 2020. Cold-start active learning through self-supervised language modeling. Table 2: Accuracy with 100% of data over five runs (different random seeds). Xiangkai Zeng, Sarthak Garg, Rajen Chatterjee, Ud- hyakumar Nallasamy, and Matthias Paulik. 2019. Empirical evaluation of active learning techniques A.2 Task-Adaptive Pretraining (TAPT)& for neural MT. In Proceedings of the Workshop on Full-Dataset Performance Deep Learning Approaches for Low-Resource NLP As discussed in §3.1 and §6, we continue training (DeepLo 2019), pages 84–93. the BERT-BASE (Devlin et al., 2019) pretrained Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q. masked language model using the available data Weinberger, and Yoav Artzi. 2020. Revisiting few- Dpool. We explored various learning rates between sample bert fine-tuning. ArXiv. 1e-4 and 1e-5 and found the latter to produce the lowest validation loss. We trained each model (one Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. for each dataset) for up to 100K optimization steps, Character-level convolutional networks for text clas- D sification. In Advances in Neural Information Pro- we evaluated on val every 500 steps and saved cessing Systems, volume 28, pages 649–657. Curran the checkpoint with the lowest validation loss. We Associates, Inc. used the resulting model in our BALM experiments. TREC-6 SST-2 IMDBDBPEDIAAGNEWS RANDOM 0/0 0/0 0/0 0/0 0/0 ALPS 0/57 0/478 0/206 0/134 0/634 BADGE 0/63 0/23110 0/1059 0/192 - BERTKM 0/47 0/2297 0/324 0/137 0/3651 ENTROPY 81/0 989/0 557/0 264/0 2911/0 LEAST CONFIDENCE 69/0 865/0 522/0 256/0 2607/0 BALD 69/0 797/0 524/0 256/0 2589/0 BATCHBALD 69/21 841/1141 450/104 256/482 2844/5611

Table 3: Runtimes (in seconds) for all datasets. In each cell of the table we present a tuple i/s where i is the inference time and s the selection time. Inference time is the time for the model to perform a forward pass for all the unlabeled data in Dpool and selection time is the time that each acquisition function requires to rank all candidate data points and select k for annotation (for a single iteration). Since we cannot report the runtimes for every model in the AL pipeline (at each iteration the size of Dpool changes), we provide the median.

TREC-6 3.5 A.3 Efficiency of Acquisition Functions

3.0 Learning rate 2.5 5e-06 In this section we discuss the efficiency of the 2.0 1e-05 Loss 5e-05 eight acquisition functions considered in this work; 1.5 0.0001 RANDOM, ALPS, BADGE, BERTKM, ENTROPY, 1.0 LEAST CONFIDENCE, BALD and BATCHBALD. 0 20000 40000 60000 80000 100000 Steps In Table3 we provide the runtimes for all ac- SST-2 quisition functions and datasets. Each AL experi- 4.5 Learning rate ments consists of multiple iterations and (therefore 4.0 5e-06 3.5 1e-05 multiple models), each with a different training Loss 5e-05 3.0 0.0001 dataset Dlab and pool of unlabeled data Dpool. In 2.5 order to evaluate how computationally heavy is 2.0 0 20000 40000 60000 80000 100000 each method, we provide the median of all the Steps IMDB models in one AL experiment. We calculate the 3.00 runtime of two types of functionalities. The first is 2.75 Learning rate the inference time and stands for the forward pass 2.50 1e-05

Loss 2.25 5e-05 of each x ∈ Dpool to acquire confidence scores for 0.0001 2.00 uncertainty sampling. RANDOM, ALPS, BADGE 1.75 and BERTKM do not require this step so it is only 0 20000 40000 60000 80000 100000 Steps applied of uncertainty-based acquisition where ac- quiring uncertainty estimates with MC dropout is Figure 6: Learning curves of TAPT for various learning needed. The second functionality is selection time rates. and measures how much time each acquisition func- tion requires to rank and select the k data points from Dpool to be labeled in the next step of the AL We plot the learning curves of masked language pipeline. RANDOM, ENTROPY, LEAST CONFI- modeling task (TAPT) for three datasets and all DENCE and BALD perform simple equations to considered learning rates in Figure6. We notice rank the data points and therefore so do not require that a smaller learning rate facilitates the training selection time. On the other hand, ALPS, BADGE, of the MLM. BERTKM and BATCHBALD perform iterative al- In Table2 we provide the validation and test gorithms that increase selection time. From all ac- accuracy of BERT and BERT-TAPT for all datasets. quisition functions ALPS and BERTKM are faster We present the mean across runs with three random because they do not require the inference step of seeds. For fine-tuning the models, we used the all the unlabeled data to the model. ENTROPY, proposed approach FT+ (§3.2). LEAST CONFIDENCE and BALD require the same time for selecting data, which is equivalent for the time needed to perform one forward pass of the en- tire Dpool. Finally BADGE and BATCHBALD are the most computationally heavy approaches, since both algorithms require multiple computations for the selection time. RANDOM has a total runtime of zero seconds, as expected.