Is BERT Really Robust? a Strong Baseline for Natural Language Attack on Text Classification and Entailment
Total Page:16
File Type:pdf, Size:1020Kb
The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment Di Jin,1∗ Zhijing Jin,2∗ Joey Tianyi Zhou,3 Peter Szolovits1 1Computer Science & Artificial Intelligence Laboratory, MIT 2University of Hong Kong 3A*STAR, Singapore {jindi15, psz}@mit.edu, [email protected], [email protected] Abstract deal with text data due to its discrete nature. Formally, be- sides the ability to fool the target models, outputs of a natural Machine learning algorithms are often vulnerable to adver- sarial examples that have imperceptible alterations from the language attacking system should also meet three key utility- original counterparts but can fool the state-of-the-art mod- preserving properties: (1) human prediction consistency— els. It is helpful to evaluate or even improve the robustness prediction by humans should remain unchanged, (2) seman- of these models by exposing the maliciously crafted adver- tic similarity—the crafted example should bear the same sarial examples. In this paper, we present TEXTFOOLER,a meaning as the source, as judged by humans, and (3) lan- simple but strong baseline to generate adversarial text. By guage fluency—generated examples should look natural and applying it to two fundamental natural language tasks, text grammatical. Previous works barely conform to all three classification and textual entailment, we successfully attacked requirements. For example, methods such as word mis- three target models, including the powerful pre-trained BERT, spelling (Li et al. 2018; Gao et al. 2018), single-word era- and the widely used convolutional and recurrent neural net- sure (Li, Monroe, and Jurafsky 2016), and phrase insertion works. We demonstrate three advantages of this framework: (1) effective—it outperforms previous attacks by success rate and removal (Liang et al. 2017) result in unnatural sen- and perturbation rate, (2) utility-preserving—it preserves se- tences. Moreover, there is almost no work that attacks the mantic content, grammaticality, and correct types classified newly-risen BERT model on text classification. by humans, and (3) efficient—it generates adversarial text In this work, we present TEXTFOOLER, a simple but with computational complexity linear to the text length.1 strong baseline for natural language attack in the black- box setting, a common case where no model architecture or parameters are accessible. We design a more comprehen- Introduction sive paradigm to create both semantically and syntactically In the last decade, Machine Learning (ML) models have similar adversarial examples that meet the aforementioned achieved remarkable success in various tasks such as classi- three desiderata. Specifically, we first identify the impor- fication, regression and decision making. However, recently tant words for the target model and then prioritize to replace they have been found vulnerable to adversarial examples that them with the most semantically similar and grammatically are legitimate inputs altered by small and often impercepti- correct words until the prediction is altered. We successfully ble perturbations (Kurakin, Goodfellow, and Bengio 2016a; applied this framework to attack three state-of-the-art mod- 2016b; Papernot et al. 2017; Zhao, Dua, and Singh 2017). els in five text classification tasks and two textual entailment These carefully curated examples are correctly classified by tasks, respectively. On the adversarial examples, we can re- a human observer but can fool a target model, raising seri- duce the accuracy of almost all target models in all tasks to ous concerns regarding the security and integrity of existing below 10% with only less than 20% of the original words ML algorithms. On the other hand, it is showed that robust- perturbed. In addition, we validate that the generated exam- ness and generalization of ML models can be improved by ples are (1) correctly classified by human evaluators, (2) se- crafting high-quality adversaries and including them in the mantically similar to the original text, and (3) grammatically training data (Goodfellow, Shlens, and Szegedy 2015). acceptable by human judges. While existing works on adversarial examples have ob- Our main contributions are summarized as follows: tained success in the image and speech domains (Szegedy et • al. 2013; Carlini and Wagner 2018), it is still challenging to We propose a simple but strong baseline, TEXTFOOLER, to quickly generate high-profile utility-preserving adver- ∗Equal Contribution. Order determined by swapping that in the sarial examples that force the target models to make previous paper at https://arxiv.org/abs/1901.11333. wrong predictions under the black-box setting. Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. • We evaluate TEXTFOOLER on three state-of-the-art deep 1The code, pre-trained target models, and test examples are learning models over five popular text classification tasks available at https://github.com/jind11/TextFooler. and two textual entailment tasks, and it achieved the state- 8018 of-the-art attack success rate and perturbation rate. Algorithm 1 Adversarial Attack by TEXTFOOLER • We propose a comprehensive four-way automatic and Input: Sentence example X = {w1,w2, ..., wn}, the corre- three-way human evaluation of language adversarial at- sponding ground truth label Y , target model F , sentence tacks to evaluate the effectiveness, efficiency, and utility- similarity function Sim, sentence similarity threshold , preserving properties of our system. word embeddings Emb over the vocabulary Vocab. Output: Adversarial example Xadv • We open-source the code, pre-trained target models, and 1: Initialization: Xadv ← X test samples for the convenience of future benchmarking. 2: for each word wi in X do I 3: Compute the importance score wi via Eq.2 Method 4: end for Problem Formulation 5: 6: Create a set W of all words wi ∈ X sorted by the de- N X {X ,X ,...,X } Given a corpus of sentences = 1 2 N , scending order of their importance score Iw . N Y {Y ,Y ,...,Y } i and a corresponding set of labels = 1 2 N , 7: Filter out the stop words in W . F X→Y we have a pre-trained model : , which maps the 8: for each word wj in W do X Y input text space to the label space . 9: Initiate the set of candidates CANDIDATES by ex- For a sentence X ∈X, a valid adversarial example Xadv N Emb tracting the top synonyms using CosSim( wj , should conform to the following requirements: Embword) for each word in Vocab. 10: CANDIDATES ← POSFilter(CANDIDATES) 11: FINCANDIDATES ←{} F (Xadv) = F (X), and Sim(Xadv,X) ≥ , (1) 12: for ck in CANDIDATES do X ← w c X where Sim : X×X→(0, 1) is a similarity function 13: Replace j with k in adv Sim X,X > and is the minimum similarity between the original and 14: if ( adv) then c adversarial examples. In the natural language domain, Sim 15: Add k to the set FINCANDIDATES Y ← F X is often a semantic and syntactic similarity function. 16: k ( ) P ← F X 17: k Yk ( ) Threat Model 18: end if 19: end for Under the black-box setting, the attacker is not aware of the 20: if there exists ck whose prediction result Yk = Y model architecture, parameters, or training data. It can only then query the target model with supplied inputs, getting as re- 21: In FINCANDIDATES, only keep the candidates ck sults the predictions and corresponding confidence scores. whose prediction result Yk = Y The proposed approach for adversarial text generation is c∗ ← Sim X, X 22: argmax ( wj →c) c∈FINCANDIDATES shown in Algorithm 1, and consists of the two main steps: ∗ 23: Xadv ← Replace wj with c in Xadv Step 1: Word Importance Ranking (line 1-6) Given a 24: return Xadv n X {w ,w ,...,w } P X > P sentence of words = 1 2 n , we observe 25: else if Yk ( adv) min k then ck∈FINCANDIDATES that only some key words act as influential signals for the c∗ ← P F 26: argmin k prediction model , echoing with the discovery of (Niven ck∈FINCANDIDATES ∗ and Kao 2019) that BERT attends to the statistical cues of 27: Xadv ← Replace wj with c in Xadv some words. Therefore, we create a selection mechanism to 28: end if choose the words that most significantly influence the final 29: end for prediction results. Using this selection process, we minimize 30: return None the alterations, and thus maintain the semantic similarity as much as possible. Note that the selection of important words is trivial in a white-box scenario, as it can be easily solved by inspecting F ⎧ the gradients of the model , while most other words are F (X) − F (X ), F (X)=F (X )=Y ⎨ Y Y \wi if \wi irrelevant. However, under the more common black-box set I = (F (X) − F (X )) + (F (X ) − F (X)), wi Y Y \wi Y¯ \wi Y¯ up in our paper, the model gradients are unavailable. There- ⎩ ¯ ¯ if F (X)=Y,F(X\w )=Y, and Y = Y. fore, we create a selection mechanism as follows. We use i I (2) the score wi to measure the influence of a word wi ∈ X towards the classification result F (X)=Y . We denote the After ranking the words by their importance score, we fur- ther filter out stop words derived from NLTK2 and spaCy3 sentence after deleting the word wi as X\w = X \{wi} = i libraries such as “the”, “when”, and “none”. This simple step {w1,...,wi−1,wi+1,...wn}, and use FY (·) to represent of filtering is important to avoid grammar destruction. the prediction score for the Y label. I The importance score wi is therefore calculated as the prediction change before and after deleting the word wi, 2https://www.nltk.org/ which is formally defined as follows, 3https://spacy.io/ 8019 Step 2: Word Transformer (line 7-30) For a given word diction of the target model is altered.