Arxiv:2009.07502V2 [Cs.CL] 15 Mar 2021
Total Page:16
File Type:pdf, Size:1020Kb
Contextualized Perturbation for Textual Adversarial Attack Dianqi Li♠ Yizhe Zhang} Hao Peng♠ Liqun Chen| Chris Brockett} Ming-Ting Sun♠ Bill Dolan} ♠University of Washington }Microsoft Research |Duke University {dianqili, mts}@uw.edu, [email protected], [email protected] {Yizhe.Zhang, Chris.Brockett, billdol}@microsoft.com Abstract Original Text Super ant colony hits CLARE: Adversarial examples expose the vulnerabil- Australia . Contextualized ities of natural language processing (NLP) Perturbation A giant 100km models, and can be used to evaluate and im- colony of ants could threaten prove their robustness. Existing techniques local insect species. of generating such examples are typically driven by local heuristic rules that are ag- Adversarial Text nostic to the context, often resulting in un- natural and ungrammatical outputs. This Super ant colony hits paper presents CLARE, a ContextuaLized Australia {Coast, , }. AdversaRial Example generation model that A {gigantic, , } 100km produces fluent and grammatical outputs colony of ants could threaten through a mask-then-infill procedure. CLARE {insect, , } species. builds on a pre-trained masked language model and modifies the inputs in a context- Figure 1: Illustration of CLARE. Through a mask-then- aware manner. We propose three contextual- infill procedure, the model generates the adversarial ized perturbations, Replace, Insert and Merge, text with three contextualized perturbations: Replace, that allow for generating outputs of varied Insert and Merge. A mask is indicated by “ ”. The lengths. CLARE can flexibly combine these degree of fade corresponds to the (decreasing) priority perturbations and apply them at any position of the infill tokens. in the inputs, and is thus able to attack the victim model more effectively with fewer ed- its. Extensive experiments and human evalua- Liang et al., 2019; Alzantot et al., 2018; Ren et al., tion demonstrate that CLARE outperforms the 2019; Jin et al., 2020, inter alia). Despite some baselines in terms of attack success rate, tex- empirical success, rule-based methods are agnostic tual similarity, fluency and grammaticality. to context, limiting their ability to produce natu- 1 Introduction ral, fluent, and grammatical outputs (Wang et al., 2019b; Kurita et al., 2020, inter alia). Adversarial example generation for natural lan- This work presents CLARE, a ContextuaLized guage processing (NLP) tasks aims to perturb input AdversaRial Example generation model for text. arXiv:2009.07502v2 [cs.CL] 15 Mar 2021 text to trigger errors in machine learning models, CLARE perturbs the input with a mask-then-infill while keeping the output close to the original. Be- procedure: it first detects the vulnerabilities of sides exposing system vulnerabilities and helping a model and deploys masks to the inputs to in- improve their robustness and security (Zhao et al., dicate missing text, then plugs in an alternative 2018; Wallace et al., 2019; Cheng et al., 2019; Jia using a pretrained masked language model (e.g., et al., 2019, inter alia), adversarial examples are RoBERTa; Liu et al., 2019). CLARE features three also used to analyze and interpret the models’ deci- contextualized perturbations: Replace, Insert and sions (Jia and Liang, 2017; Ribeiro et al., 2018). Merge, which respectively replace a token, insert Generating adversarial examples for NLP tasks a new one, and merge a bigram (Figure1). As can be challenging, in part due to the discrete nature a result, it can generate outputs of varied lengths, of natural language text. Most recent efforts have in contrast to token replacement based methods explored heuristic rules, such as replacing tokens that are limited to outputs of the same lengths as with their synonyms (Samanta and Mehta, 2017; the inputs (Alzantot et al., 2018; Ren et al., 2019; Jin et al., 2020). Further, CLARE searches over a tween x0 and x to be larger than a threshold: wider range of attack strategies, and is thus able sim(x0; x) > `. A common choice of sim(·; ·) to attack the victim model more effectively with is to encode sentences using neural networks, and fewer edits. Building on a masked language model, calculate their cosine similarity in the embedding CLARE maximally preserves textual similarity, flu- space (Jin et al., 2020). ency, and grammaticality of the outputs. 2.1 Masking and Contextualized Infilling We evaluate CLARE on text classification, nat- ural language inference, and sentence paraphrase At a given position of the input sequence, CLARE tasks, by attacking finetuned BERT models (De- can execute three perturbation actions: Replace, vlin et al., 2019). Extensive experiments and hu- Insert, and Merge, which we introduce in this sec- man evaluation results show that CLARE outper- tion. These apply masks at the given position with forms baselines in terms of attack success rate, tex- different strategies, and then fill in the missing text tual similarity, fluency, and grammaticality, and based on the unmasked context. strikes a better balance between attack success Replace: A Replace action substitutes the token rate and preserving input-output similarity. Our at a given position i with an alternative (e.g., chang- analysis further suggests that the CLARE can ing “fantastic” to “amazing” in “The movie is fan- be used to improve the robustness of the down- tastic.”). It first replaces xi with a mask, and then stream models, and improve their accuracy when selects a token z from a candidate set Z to fill in: the available training data is limited. We release our code and models at https://github.com/ xe = x1 : : : xi−1 [MASK] xi+1 : : : xn; cookielee77/CLARE . replace (x; i) = x1 : : : xi−1 z xi+1 : : : xn: 2 CLARE For clarity, we denote replace (x; i) by xez. To produce an adversarial example, At a high level, CLARE applies a sequence of con- • z should fit into the unmasked context; textualized perturbation actions to the input. Each • xez should be similar to x; can be seen as a local mask-then-infill procedure: it • xez should trigger an error in f. first applies a mask to the input around a given po- These can be achieved by selecting a z such that sition, and then fills it in using a pretrained masked • z receives a high probability from a masked language model (§2.1). To produce the output, language model: pMLM(z j xe) > k; CLARE scores and descendingly ranks the actions, • xez is similar to x: sim(x; xez) > `; which are then iteratively applied to the input (§2.2). • f predicts low probability for the gold label We begin with a brief background review and lay- given xez, i.e., pf (y j xez) is small. ing out of necessary notation. pMLM denotes a pretrained masked language model (e.g., RoBERTa; Liu et al., 2019). Using higher k, Background. Adversarial example generation ` thresholds produces outputs that are more fluent centers around a victim model f, which we as- and closer to the original. However, this can under- sume is a text classifier. We focus on the black- mine the success rate of the attack. We choose k, ` box setting, allowing access to f’s outputs but not to trade-off between these two aspects. 2 its configurations such as parameters. Given an The first two requirements can be met by the input sequence x = x x : : : x and its label y 1 2 n construction of the candidate set: Z = (assume f(x) = y), an adversarial example x0 is supposed to modify x to trigger an error in the 0 0 z 2 V j pMLM(z j xe) > k; sim(x; xez0 ) > ` : victim model: f(x0) 6= f(x). At the same time, textual modifications should be minimal, such that V is the vocabulary of the masked language model. x0 is close to x and the human predictions on x0 To meet the third, we select from Z the token that, stay the same.1 if filled in, will cause most “confusion” to f: This is achieved by requiring the similarity be- z = arg min pf (y j xez0 ): (1) z02Z 1In computer vision applications, minor perturbations to continuous pixels can be barely perceptible to humans, thus it 2k and ` are empirically set as 5 × 10−3 and 0:7, respec- can be hard for one to distinguish x and x0 (Goodfellow et al., tively. This also reduces the computation overhead: in our 2015). It is not the case for text, however, since changes to the experiments jZj is 42 on average, much smaller than the vo- discrete tokens are more likely to be noticed by humans. cabulary size (jVj = 50; 265). The Insert and Merge actions differ from Re- Algorithm 1 Adversarial Attack by CLARE place in terms of masking strategies. The alter- 1: Input: Text-label pair (x; y); Victim model f native token z is selected analogously to that in a 2: Output: An adversarial example Replace action. 3: Initialization: x(0) = x 4: A Insert: This aims to add extra information to the ? 5: for 1 ≤ i ≤ jxj do input (e.g., changing “I recommend ...” to “I highly 6: a highest-scoring action from f recommend ...”). It inserts a mask after x and then i replace(x; i), insert(x; i); merge(x; i)g fills it. Slightly overloading the notations, 7: A A Sfag 8: end for x = x1 : : : xi [MASK] xi+1 : : : xn; e 9: for 1 ≤ t ≤ T do insert (x; i) = x1 : : : xi z xi+1 : : : xn: 10: a highest-scoring action from A 11: A A n fag This increases the sequence length by 1. 12: x(t) Apply a on x(t−1) 13: if f(x(t))6=y then return x(t) Merge: This masks out a bigram xixi+1 with a single mask and then fills it, reducing the sequence 14: end if length by 1: 15: end for 16: return NONE xe = x1 : : : xi−1 [MASK] xi+2 : : : xn; merge (x; i) = x1 : : : xi−1 z xi+2 : : : xn: minimizing the probability of outputting the gold label y from f.