
EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks Jason Wei1;2 Kai Zou3 1Protago Labs Research, Tysons Corner, Virginia, USA 2Department of Computer Science, Dartmouth College 3Department of Mathematics and Statistics, Georgetown University [email protected] [email protected] Abstract Operation Sentence None A sad, superior human comedy played out We present EDA: easy data augmentation on the back roads of life. techniques for boosting performance on text SR A lamentable, superior human comedy played out on the backward road of life. classification tasks. EDA consists of four sim- RI A sad, superior human comedy played out ple but powerful operations: synonym replace- on funniness the back roads of life. ment, random insertion, random swap, and RS A sad, superior human comedy played out random deletion. On five text classification on roads back the of life. tasks, we show that EDA improves perfor- RD A sad, superior human out on the roads of life. mance for both convolutional and recurrent neural networks. EDA demonstrates particu- Table 1: Sentences generated using EDA. SR: synonym larly strong results for smaller datasets; on av- replacement. RI: random insertion. RS: random swap. erage, across five datasets, training with EDA RD: random deletion. while using only 50% of the available train- ing set achieved the same accuracy as normal training with all available data. We also per- (Xie et al., 2017) and predictive language models formed extensive ablation studies and suggest for synonym replacement (Kobayashi, 2018). Al- parameters for practical use. though these techniques are valid, they are not of- 1 Introduction ten used in practice because they have a high cost of implementation relative to performance gain. Text classification is a fundamental task in natu- In this paper, we present a simple set of univer- ral language processing (NLP). Machine learning sal data augmentation techniques for NLP called and deep learning have achieved high accuracy on EDA (easy data augmentation). To the best of our tasks ranging from sentiment analysis (Tang et al., knowledge, we are the first to comprehensively 2015) to topic classification (Tong and Koller, explore text editing techniques for data augmen- 2002), but high performance often depends on the tation. We systematically evaluate EDA on five size and quality of training data, which is often te- benchmark classification tasks, showing that EDA dious to collect. Automatic data augmentation is provides substantial improvements on all five tasks commonly used in computer vision (Simard et al., and is particularly helpful for smaller datasets. 1998; Szegedy et al., 2014; Krizhevsky et al., Code is publicly available at http://github. 2017) and speech (Cui et al., 2015; Ko et al., 2015) com/jasonwei20/eda_nlp. and can help train more robust models, particu- larly when using smaller datasets. However, be- 2 EDA cause it is challenging to come up with generalized rules for language transformation, universal data Frustrated by the measly performance of text clas- augmentation techniques in NLP have not been sifiers trained on small datasets, we tested a num- thoroughly explored. ber of augmentation operations loosely inspired by Previous work has proposed some techniques those used in computer vision and found that they for data augmentation in NLP. One popular study helped train more robust models. Here, we present generated new data by translating sentences into the full details of EDA. For a given sentence in the French and back into English (Yu et al., 2018). training set, we randomly choose and perform one Other work has used data noising as smoothing of the following operations: 6382 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 6382–6388, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics 1. Synonym Replacement (SR): Randomly 3.2 Text Classification Models choose n words from the sentence that are not We run experiments for two popular models in stop words. Replace each of these words with text classification. (1) Recurrent neural networks one of its synonyms chosen at random. (RNNs) are suitable for sequential data. We use a 2. Random Insertion (RI): Find a random syn- LSTM-RNN (Liu et al., 2016). (2) Convolutional onym of a random word in the sentence that is neural networks (CNNs) have also achieved high not a stop word. Insert that synonym into a ran- performance for text classification. We implement dom position in the sentence. Do this n times. them as described in (Kim, 2014). Details are in 3. Random Swap (RS): Randomly choose two Section 9.1 in Supplementary Materials. words in the sentence and swap their positions. 4 Results Do this n times. 4. Random Deletion (RD): Randomly remove In this section, we test EDA on five NLP tasks with each word in the sentence with probability p. CNNs and RNNs. For all experiments, we average results from five different random seeds. Since long sentences have more words than short 4.1 EDA Makes Gains ones, they can absorb more noise while maintain- We run both CNN and RNN models with and ing their original class label. To compensate, we without EDA across all five datasets for varying vary the number of words changed, n, for SR, RI, training set sizes. Average performances (%) are and RS based on the sentence length l with the for- shown in Table2. Of note, average improve- mula n=α l, where α is a parameter that indicates ment was 0.8% for full datasets and 3.0% for the percent of the words in a sentence are changed N =500. (we use p=α for RD). Furthermore, for each orig- train inal sentence, we generate naug augmented sen- Training Set Size tences. Examples of augmented sentences are Model 500 2,000 5,000 full set shown in Table1. We note that synonym replace- RNN 75.3 83.7 86.1 87.4 ment has been used previously (Kolomiyets et al., +EDA 79.1 84.4 87.3 88.3 2011; Zhang et al., 2015; Wang and Yang, 2015), CNN 78.6 85.6 87.7 88.3 but to our knowledge, random insertions, swaps, +EDA 80.7 86.4 88.3 88.8 and deletions have not been extensively studied. Average 76.9 84.6 86.9 87.8 +EDA 79.9 85.4 87.8 88.6 3 Experimental Setup Table 2: Average performances (%) across five text We choose five benchmark text classification tasks classification tasks for models with and without EDA and two network architectures to evaluate EDA. on different training set sizes. 3.1 Benchmark Datasets 4.2 Training Set Sizing We conduct experiments on five benchmark text Overfitting tends to be more severe when training classification tasks: (1) SST-2: Stanford Senti- on smaller datasets. By conducting experiments ment Treebank (Socher et al., 2013), (2) CR: cus- using a restricted fraction of the available train- tomer reviews (Hu and Liu, 2004; Liu et al., 2015), ing data, we show that EDA has more significant (3) SUBJ: subjectivity/objectivity dataset (Pang improvements for smaller training sets. We run and Lee, 2004), (4) TREC: question type dataset both normal training and EDA training for the fol- (Li and Roth, 2002), and (5) PC: Pro-Con dataset lowing training set fractions (%): f1, 5, 10, 20, (Ganapathibhotla and Liu, 2008). Summary statis- 30, 40, 50, 60, 70, 80, 90, 100g. Figure1(a)- tics are shown in Table5 in Supplemental Mate- (e) shows performance with and without EDA for rials. Furthermore, we hypothesize that EDA is each dataset, and1(f) shows the averaged perfor- more helpful for smaller datasets, so we delegate mance across all datasets. The best average accu- the following sized datasets by selecting a random racy without augmentation, 88.3%, was achieved subset of the full training set with Ntrain=f500, using 100% of the training data. Models trained 2,000, 5,000, all available datag. using EDA surpassed this number by achieving an 6383 SST-2 (N=7,447) CR (N=4,082) SUBJ (N=9,000) 1 1 1 0:8 0:8 0:8 0:6 Normal 0:6 Normal 0:6 Normal Accuracy EDA Accuracy EDA Accuracy EDA 0:4 0:4 0:4 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Percent of Dataset (%) Percent of Dataset (%) Percent of Dataset (%) (a) (b) (c) TREC (N=5,452) PC (N=39,418) All Datasets 1 1 1 0:8 0:8 0:8 0:6 Normal 0:6 Normal 0:6 Normal Accuracy EDA Accuracy EDA EDA 0:4 0:4 0:4 0 20 40 60 80 100 0 20 40 60 80 100 Average Accuracy 0 20 40 60 80 100 Percent of Dataset (%) Percent of Dataset (%) Percent of Dataset (%) (d) (e) (f) Figure 1: Performance on benchmark text classification tasks with and without EDA, for various dataset sizes used for training. For reference, the dotted grey line indicates best performances from Kim (2014) for SST-2, CR, SUBJ, and TREC, and Ganapathibhotla (2008) for PC. average accuracy of 88.6% while only using 50% Pro (original) of the available training data. Pro (EDA) Con (original) 4.3 Does EDA conserve true labels? Con (EDA) In data augmentation, input data is altered while class labels are maintained. If sentences are sig- nificantly changed, however, then original class labels may no longer be valid. We take a visu- alization approach to examine whether EDA oper- ations significantly change the meanings of aug- mented sentences. First, we train an RNN on the pro-con classification task (PC) without aug- Figure 2: Latent space visualization of original and mentation.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages7 Page
-
File Size-