Arxiv:2003.02245V2 [Cs.CL] 31 Jan 2021 Plored for Data Augmentation

Data Augmentation Using Pre-trained Transformer Models Varun Kumar Ashutosh Choudhary Eunah Cho Alexa AI Alexa AI Alexa AI [email protected] [email protected] [email protected] Abstract ment in this example) would negatively impact the performance of the resulting model. Language model based pre-trained models To alleviate this issue, Wu et al.(2019) pro- such as BERT have provided significant gains posed conditional BERT (CBERT) model which across different NLP tasks. In this paper, we study different types of transformer based pre- extends BERT (Devlin et al., 2018) masked lan- trained models such as auto-regressive models guage modeling (MLM) task, by considering class (GPT-2), auto-encoder models (BERT), and labels to predict the masked tokens. Since their seq2seq models (BART) for conditional data method relies on modifying BERT model’s seg- augmentation. We show that prepending the ment embedding, it cannot be generalized to other class labels to text sequences provides a simple pre-trained LMs without segment embeddings. yet effective way to condition the pre-trained Similarly, Anaby-Tavor et al.(2019) used models for data augmentation. Additionally, GPT2 (Radford et al., 2019) for DA where exam- on three classification benchmarks, pre-trained Seq2Seq model outperforms other data aug- ples are generated for a given class by providing mentation methods in a low-resource setting. class as input to a fine-tuned model. In their work, Further, we explore how different data aug- GPT2 is used to generate 10 times the number of ex- mentation methods using pre-trained model amples required for augmentation and then the gen- differ in-terms of data diversity, and how well erated sentences are selected based on the model such methods preserve the class-label informa- confidence score. As data selection is applied only tion. to GPT2 but not to the other models, the augmen- 1 Introduction tation methods can not be fairly compared. Due to such discrepancies, it is not straightforward to Data augmentation (DA) is a widely used technique comprehend how the generated data using different to increase the size of the training data. Increas- pre-trained models varies from each other and their ing training data size is often essential to reduce impact on downstream model performance. overfitting and enhance the robustness of machine This paper proposes a unified approach to use learning models in low-data regime tasks. any pre-trained transformer (Vaswani et al., 2017) In natural language processing (NLP), several based models for data augmentation. In particu- word replacement based methods have been ex- lar, we explore three different pre-trained model arXiv:2003.02245v2 [cs.CL] 31 Jan 2021 plored for data augmentation. In particular, Wei types for DA, including 1) an autoencoder (AE) and Zou(2019) showed that simple word replace- LM: BERT, 2) an auto-regressive (AR) LM: GPT2, ment using knowledge bases like WordNet (Miller, and 3) a pre-trained seq2seq model: BART (Lewis 1998) improves classification performance. Further, et al., 2019). We apply the data generation for three Kobayashi(2018) utilized language models (LM) different NLP tasks: sentiment classification, intent to augment training data. However, such methods classification, and question classification. struggle with preserving class labels. For example, In order to understand the significance of DA, non-conditional DA for an input sentence of senti- we simulate a low-resource data scenario, where ment classification task “a small impact with a big we utilize only 10 training examples per class in a movie” leads to “a small movie with a big impact”. classification task. Section 3.2 provides details of Using such augmented data for training, with the the task and corpora. original input sentence’s label (i.e. negative senti- We show that all three types of pre-trained models can be effectively used for DA, and using the example in Dtrain. Thus, the augmented data is generated data leads to improvement in classifi- same size as the size of the original data. cation performance in the low-data regime setting. Among three types of methods, pre-trained seq2seq 2.1 Conditional DA using Pre-trained LM model provides the best performance. Our code is For conditional DA, a model G incorporates label available at 1. information during fine-tuning for data generation. Our contribution is three-fold: (1) implementa- Wu et al.(2019) proposed CBERT model where tion of a seq2seq pre-trained model based data aug- they utilized BERT’s segment embeddings to con- mentation, (2) experimental comparison of differ- dition model on the labels. Similarly, models can ent data augmentation methods using conditional be conditioned on labels by prepending labels yi to pre-trained model, (3) a unified data augmentation xi (Keskar et al., 2019; Johnson et al., 2017). approach with practical guidelines for using differ- Due to segment embedding reuse, CBERT con- ent types of pre-trained models. ditioning is very specific to BERT architecture thus cannot be applied directly to other pre-trained LMs. 2 DA using Pre-trained Models Thus, we compare two generic ways to condition a pre-trained model on class label: LM pre-training has been studied extensively (Rad- ford et al., 2018; Devlin et al., 2018; Liu et al., • prepend : prepending label yi to each se- 2019). During pre-training, such models are either quence xi in the training data without adding trained in an AE setting or in an AR setting. In yi to model vocabulary the AE setting, certain tokens are masked in the sentence and the model predicts those tokens. In • expand : prepending label yi to each se- an AR setting, the model predicts the next word quence xi in the training data and adding yi given a context. Recently, pre-training for seq2seq to model vocabulary. model has been explored where a seq2seq model is Note that in prepend, the model may split y trained for denoising AE tasks (Lewis et al., 2019; i into multiple subword units (Sennrich et al., 2015b; Raffel et al., 2019). Here, we explore how these Kudo and Richardson, 2018), expand treats a la- models can be used for DA to potentially improve bel as a single token. text classification accuracy. Here, we discuss the fine-tuning and the data generation process for both AE and AR LMs. For Algorithm 1: Data Augmentation ap- transformer based LM implementation, we use Py- proach torch based transformer package (Wolf et al., 2019). Input :Training Dataset Dtrain For all pre-trained models, during fine-tuning, we Pretrained model G2fAE;AR;Seq2Seqg further train the learnable parameters of G using 1 Fine-tune G using Dtrain to obtain Gtuned its default task and loss function. 2 Dsynthetic fg 3 foreach fxi;yig2Dtrain do 2.1.1 Fine-tuning and generation using AE 1 4 Synthesize s examples fxî;yîgp using LMs Gtuned We choose BERT as a representative of AE models. 1 5 Dsynthetic Dsynthetic[fxî;yîgp For fine-tuning, we use the default masking param- 6 end eters and MLM objective which randomly masks some of the tokens from the raw sequence, and the objective is to predict the original token of the DA Problem formulation: Given a training masked words using the context. Both BERT dataset D = fx ; y g1 , where x = fw g1 prepend train i i n i j m and BERT models are fine-tuned using the is a sequence of m words, y is the associated label, expand i same objective. and a pre-trained model G, we want to generate a dataset of Dsynthetic. Algorithm1 describes the 2.1.2 Fine-tuning and generation using AR data generation process. For all augmentation meth- LMs ods, we generate s = 1 synthetic example for every For AR LM experiments, we choose GPT2 as a gen- 1https://github.com/varinf/ erator model and follow the method proposed by TransformersDataAugmentation Anaby-Tavor et al.(2019) to fine-tune and generate data. For fine-tuning GPT2, we create a training 2.3 Pre-trained Model Implementation dataset by concatenating all sequences in Dtrain 2.3.1 BERT based DA models as follows: y SEP x EOSy :::y SEP x EOS. 1 1 2 n n For AutoEncoder (AE) experiments, we use “bert- SEP denotes a separation token between label and base-uncased” model with the default parameters sentence, and EOS denotes the end of a sentence. provided in huggingface’s transformer package. In For generating data, we provide y SEP as a i prepend setting we train model for 10 epochs prompt to G, and we keep generating until the and select the best performing model on dev data model produces EOS token. We use GPT2 to re- partition keeping initial learning rate at 4e−5. For fer to this model. We found that such generation expand setting, training requires 150 epochs to struggles in preserving the label information, and converge. Moreover, a higher learning rate of a simple way to improve the generated data label 1:5e−4 was used for all three datasets. The initial quality is to provide an additional context to G. For- learning rate was adjusted for faster convergence. mally, we provide y SEP w ::w as prompt where i 1 k This is needed for expand setting as embeddings w ::w are the first k words of a sequence x . In 1 k i for labels are randomly initialized. this work, we use k = 3. We call this method GPT2context. 2.3.2 GPT2 model implementation For GPT2 experiments, we use GPT2-Small model 2.2 Conditional DA using Pre-trained provides in huggingface’s transformer package. We Seq2Seq model use default training parameters to fine-tune the Like pre-trained LM models, pre-training seq2seq GPT2 model.

Load more