Arxiv:2104.08320V1 [Cs.CL] 16 Apr 2021

Bayesian Active Learning with Pretrained Language Models Katerina Margatina Loic Barrault Nikolaos Aletras Computer Science Department, University of Sheffield {k.margatina,l.barrault,n.aletras}@sheffield.ac.uk Abstract AL has been used in NLP for part-of-speech tag- ging (Engelson and Dagan, 1996), parsing (Tang Active Learning (AL) is a method to iteratively et al., 2002), sentiment analysis (Li et al., 2012), select data for annotation from a pool of unlabeled data, aiming to achieve better model machine translation (Haffari et al., 2009) and qual- performance than random selection. Previous ity estimation (Beck et al., 2013) among others. It AL approaches in Natural Language Process- is especially useful in scenarios where a large pool ing (NLP) have been limited to either task- of unlabeled data is available but only a limited specific models that are trained from scratch annotation budget can be afforded; or where ex- at each iteration using only the labeled data pert annotation is prohibitively expensive and time at hand or using off-the-shelf pretrained lan- consuming. guage models (LMs) that are not adapted ef- fectively to the downstream task. In this pa- Traditional Bayesian AL methods use uncer- per, we address these limitations by introduc- tainty sampling (i.e. informativeness is measured ing BALM; Bayesian Active Learning with pre- by predictive uncertainty) and typically require trained language Models. We first propose to probabilistic machine learning models to acquire adapt the pretrained LM to the downstream good uncertainty estimates for the candidate data task by continuing training with all the avail- points. However, current work uses deep learning able unlabeled data and then use it for AL. We also suggest a simple yet effective fine- models that provide large performance gains but tuning method to ensure that the adapted LM is not well-calibrated confidence scores (Guo et al., properly trained in both low and high resource 2017), i.e. predictive softmax probabilities are er- scenarios during AL. We finally apply Monte roneously interpreted as model confidence (Gal Carlo dropout to the downstream model to ob- and Ghahramani, 2016). Several approaches have tain well-calibrated confidence scores for data been proposed to calibrate the output probability selection with uncertainty sampling. Our ex- distribution of deep neural networks, such as tem- periments in five standard natural language un- perature scaling (Guo et al., 2017), Monte Carlo derstanding tasks demonstrate that BALM pro- vides substantial data efficiency improvements dropout (Gal and Ghahramani, 2016) and model compared to various combinations of acquisi- ensembles (Lakshminarayanan et al., 2017). Using tion functions, models and fine-tuning meth- uncertainty sampling with the vanilla output proba- ods proposed in recent AL literature. bilities for AL may lead to incorrect conclusions, i.e. poor results may be attributed to the acquisi- arXiv:2104.08320v1 [cs.CL] 16 Apr 2021 1 Introduction tion method, while the problem may be in fact the Active Learning (AL) is a method for training su- lack of calibration. Still, only a few deep Bayesian pervised models in a data-efficient way (Cohn et al., AL approaches apply a calibration method to the 1996; Settles, 2009). AL methods iteratively alter- posterior probabilities (Gal et al., 2017; Shen et al., nate between (i) model training with the labeled 2017; Siddhant and Lipton, 2018; Lowell and Lip- data available; and (ii) data selection for annotation ton, 2019; Ein-Dor et al., 2020). using a stopping criterion, e.g. until exhausting a Furthermore, most current AL approaches in fixed annotation budget or reaching a pre-defined NLP use task-specific neural models that are performance on a held-out dataset. Data selection trained from scratch at each iteration (Shen et al., is performed by an acquisition function that ranks 2017; Siddhant and Lipton, 2018; Prabhu et al., unlabeled data points by some informativeness met- 2019; Ikhwantri et al., 2018; Kasai et al., 2019). ric aiming to improve over random selection. However, task-specific models are usually out- performed by pretrained language models (LMs) training methods across all datasets (§5). We also adapted to end-tasks (Howard and Ruder, 2018; De- find that our proposed training strategy yields sub- vlin et al., 2019), making them suboptimal for AL. stantial performance improvement when combined Only recently, pretrained LMs such as BERT (De- with any acquisition function (§6). vlin et al., 2019) have been introduced in AL settings (Yuan et al., 2020; Ein-Dor et al., 2020), 2 Background and Related Work where they are transferred and used as downstream 2.1 Problem Formulation classification models. Still, they are trained at each C AL iteration with a standard fine-tuning approach Given a downstream classification task with that mainly includes a pre-defined number of train- classes, a typical pool-based AL setup consists of D M ing epochs, which has been demonstrated to be a pool of unlabeled data pool, a model , a pre- b unstable, especially in small datasets (Mosbach defined annotation budget of data points and an a(:) k et al., 2021; Zhang et al., 2020; Dodge et al., 2020). acquisition function for selecting unlabeled Since AL includes both low and high data resource data points for annotation (i.e. acquisition size) b D settings, the AL model training scheme should be until runs out. A validation set val is used to M robust in both scenarios.1 evaluate after each iteration. The goal is to achieve data efficiency by selecting the least num- To address these limitations, we introduce ber of data points from D for annotation and Bayesian Active Learning with pretrained language pool achieve the highest performance on the validation Models (BALM). Contrary to previous work (Yuan set D (Siddhant and Lipton, 2018). The perfor- et al., 2020; Ein-Dor et al., 2020) that also use val mance of the algorithm is assessed by training a BERT (Devlin et al., 2019), our proposed method model on the actively acquired dataset and evaluat- accounts for the varying data availability settings, ing on a held-out test set D . the instability of fine-tuning and the poor calibrated test AL systems are first initialized and subsequently confidence scores for data selection: loop over Model Training, Data Acquisition and 1. We propose to continue pretraining the LM Data Annotation steps for T iterations, or until a with the available unlabeled data to adapt it to pre-defined performance on Dval is reached. the task-specific domain. This way, we leve- 2.2 Active Learning Initialization rage not only the available labeled data at each AL iteration, but the entire unlabeled pool; To initialize AL, the total number of AL iterations b can be simply calculated by T = k , where b is the 2. We further propose a simple yet effective fine- budget and k the acquisition size.2 Then, a data tuning method that is robust in both low and initialization policy selects the first k data points high resource data AL settings; from Dpool to be annotated and update the labeled dataset Dlab. The most common approach to select 3. We improve data acquisition by providing the first batch of data for annotation is stratified well-calibrated uncertainty estimates by using random sampling (Gal et al., 2017). Monte Carlo dropout (Gal and Ghahramani, 2016) instead of using the softmax output as 2.3 Model Training confidence scores. In the first step of the AL loop, a model Mi is trained with the available labeled data D at itera- We evaluate BALM on five standard natural lan- lab tion i. If M is a task-specific architecture (Shen guage understandings tasks using a full suite of i et al., 2017; Siddhant and Lipton, 2018; Prabhu uncertainty-based acquisition functions, and com- et al., 2019), it is simply trained from scratch on pare against strong baselines that are based on di- D until convergence. If M is based on a pre- versity sampling (i.e. BERT K-means clustering), lab i trained LM (Yuan et al., 2020; Ein-Dor et al., 2020), both uncertainty and diversity (e.g. BADGE (Ash then it is initialized with the pretrained weights and et al., 2020), ALPS (Yuan et al., 2020), ), and fine-tuned to the task on D by adding a task- random sampling. We show that BALM outper- lab specific output classification layer and updating forms all combinations of acquisition functions and all model parameters until convergence. Note that 1 During the first few AL iterations the available labeled 2 data is limited (low-resource), while it could become very If the budget b is a percentage of the number of unlabeled bjDpoolj large towards the last iterations (high-resource). data points then T = k . at each iteration i, the model parameters are ini- based acquisition function can be either cold-start tialized randomly if Mi is trained from scratch or warm-start. There are also hybrid approaches or from the original pretrained LM, respectively. that aim to select data based on both uncertainty Warm-starting the model (i.e. initializing Mi with and diversity sampling (He et al., 2014; Yang et al., the parameters of Mi−1) has been shown to hinder 2015; Erdmann et al., 2019; Yuan et al., 2020; Ash the model’s generalization ability (Ash and Adams, et al., 2020), and other methods that use reinforce- 2020). The AL loop stops if performance of Mi ment learning (Fang et al., 2017; Liu et al., 2018). on Dval is equal or higher that the goal. In our work, we use acquisition functions based on uncertainty sampling (§3.3), but any acquisition 2.4 Data Acquisition function that takes as input the unlabeled data, the acquisition size and, if applied, the model, and In this step, we use the acquisition function a to outputs a batch of k data points could be used, select the k most informative unlabeled samples Qi = a(Mi; k; D ).

Load more