Word-Level Textual Adversarial Attacking As Combinatorial Optimization
Total Page:16
File Type:pdf, Size:1020Kb
Word-level Textual Adversarial Attacking as Combinatorial Optimization Yuan Zang1∗, Fanchao Qi1∗, Chenghao Yang2∗y, Zhiyuan Liu1z, Meng Zhang3, Qun Liu3, Maosong Sun1 1Department of Computer Science and Technology, Tsinghua University Institute for Artificial Intelligence, Tsinghua University Beijing National Research Center for Information Science and Technology 2Columbia University 3Huawei Noah’s Ark Lab fzangy17,[email protected], [email protected] fliuzy,[email protected], fzhangmeng92,[email protected] Abstract Sememes I FondOf this produce shows Adversarial attacks are carried out to reveal the Original Input I love this movie vulnerability of deep neural networks. Tex- like picture tual adversarial attacking is challenging be- Substitute cause text is discrete and a small perturba- Words enjoy film tion can bring significant change to the orig- cinema inal input. Word-level attacking, which can Search Space be regarded as a combinatorial optimization Adversarial problem, is a well-studied class of textual at- Example I like this film tack methods. However, existing word-level attack models are far from perfect, largely be- Figure 1: An example showing search space reduction cause unsuitable search space reduction meth- with sememe-based word substitution and adversarial ods and inefficient optimization algorithms are example search in word-level adversarial attacks. employed. In this paper, we propose a novel attack model, which incorporates the sememe- based word substitution method and particle (DNNs). Extensive studies have demonstrated that swarm optimization-based search algorithm to DNNs are vulnerable to adversarial attacks, e.g., solve the two problems separately. We con- minor modification to highly poisonous phrases duct exhaustive experiments to evaluate our at- can easily deceive Google’s toxic comment detec- tack model by attacking BiLSTM and BERT on three benchmark datasets. Experimental re- tion systems (Hosseini et al., 2017). From another sults demonstrate that our model consistently perspective, adversarial attacks are also used to achieves much higher attack success rates and improve robustness and interpretability of DNNs crafts more high-quality adversarial examples (Wallace et al., 2019). In the field of natural lan- as compared to baseline methods. Also, fur- guage processing (NLP) which widely employs ther experiments show our model has higher DNNs, practical systems such as spam filtering transferability and can bring more robustness (Stringhini et al., 2010) and malware detection enhancement to victim models by adversarial (Kolter and Maloof, 2006) have been broadly used, training. All the code and data of this paper can be obtained on https://github.com/ but at the same time the concerns about their secu- thunlp/SememePSO-Attack. rity are growing. Therefore, the research on textual adversarial attacks becomes increasingly impor- 1 Introduction tant. Adversarial attacks use adversarial examples Textual adversarial attacking is challenging. Dif- (Szegedy et al., 2014; Goodfellow et al., 2015), ferent from images, a truly imperceptible pertur- which are maliciously crafted by perturbing the bation on text is almost impossible because of its original input, to fool the deep neural networks discrete nature. Even a slightest character-level perturbation can either (1) change the meaning and, ∗ Indicates equal contribution. Yuan developed the method, designed and conducted most experiments; Fanchao worse still, the true label of the original input, or formalized the task, designed some experiments and wrote (2) break its grammaticality and naturality. Un- the paper; Chenghao made the original research proposal, per- fortunately, the change of true label will make the formed human evaluation and conducted some experiments. y Work done during internship at Tsinghua University adversarial attack invalid. For example, suppos- z Corresponding author ing an adversary changes “she” to “he” in an input 6066 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6066–6080 July 5 - 10, 2020. c 2020 Association for Computational Linguistics sentence to attack a gender identification model, al- ial attacking. To solve the problems, we propose though the victim model alters its prediction result, a novel black-box word-level adversarial attack this is not a valid attack. And the adversarial ex- model, which reforms both the two steps. In the amples with broken grammaticality and naturality first step, we design a word substitution method (i.e., poor quality) can be easily defended (Pruthi based on sememes, the minimum semantic units, et al., 2019). which can retain more potential valid adversarial Various textual adversarial attack models have examples with high quality. In the second step, been proposed (Wang et al., 2019a), ranging from we present a search algorithm based on particle character-level flipping (Ebrahimi et al., 2018) to swarm optimization (Eberhart and Kennedy, 1995), sentence-level paraphrasing (Iyyer et al., 2018). which is very efficient and performs better in find- Among them, word-level attack models, mostly ing adversarial examples. We conduct exhaustive word substitution-based models, perform compara- experiments to evaluate our model. Experimental tively well on both attack efficiency and adversarial results show that, compared with baseline models, example quality (Wang et al., 2019b). our model not only achieves the highest attack suc- cess rate (e.g., 100% when attacking BiLSTM on Word-level adversarial attacking is actually a IMDB) but also possesses the best adversarial ex- problem of combinatorial optimization (Wolsey ample quality and comparable attack validity. We and Nemhauser, 1999), as its goal is to craft ad- also conduct decomposition analyses to manifest versarial examples which can successfully fool the the advantages of the two parts of our model sep- victim model using a limited vocabulary. In this arately. Finally, we demonstrate that our model paper, as shown in Figure1, we break this combi- has the highest transferability and can bring the natorial optimization problem down into two steps most robustness improvement to victim models by including (1) reducing search space and (2) search- adversarial training. ing for adversarial examples. The first step is aimed at excluding invalid or 2 Background low-quality potential adversarial examples and re- taining the valid ones with good grammaticality In this section, we first briefly introduce sememes, and naturality. The most common manner is to and then we give an overview of the classical parti- pick some candidate substitutes for each word in cle swarm optimization algorithm. the original input and use their combinations as the reduced discrete search space. However, existing 2.1 Sememes attack models either disregard this step (Papernot et al., 2016) or adopt unsatisfactory substitution In linguistics, a sememe is defined as the minimum methods that do not perform well in the trade-off semantic unit of human languages (Bloomfield, between quality and quantity of the retained adver- 1926). The meaning of a word can be represented sarial examples (Alzantot et al., 2018; Ren et al., by the composition of its sememes. 2019). The second step is supposed to find ad- In the field of NLP, sememe knowledge bases versarial examples that can successfully fool the are built to utilize sememes in practical applica- victim model in the reduced search space. Previous tions, where sememes are generally regarded as studies have explored diverse search algorithms semantic labels of words (as shown in Figure1). including gradient descent (Papernot et al., 2016), HowNet (Dong and Dong, 2006) is the most well- genetic algorithm (Alzantot et al., 2018) and greedy known one. It annotates over one hundred thousand algorithm (Ren et al., 2019). Some of them like English and Chinese words with a predefined sets gradient descent only work in the white-box setting of about 2,000 sememes. Its sememe annotations where full knowledge of the victim model is re- are sense-level, i.e., each sense of a (polysemous) quired. In real situations, however, we usually have word is annotated with sememes separately. With no access to the internal structures of victim mod- the help of HowNet, sememes have been success- els. As for the other black-box algorithms, they are fully applied to many NLP tasks including word not efficient and effective enough in searching for representation learning (Niu et al., 2017), sentiment adversarial examples. analysis (Fu et al., 2013), semantic composition (Qi These problems negatively affect the overall at- et al., 2019), sequence modeling (Qin et al., 2019), tack performance of existing word-level adversar- reverse dictionary (Zhang et al., 2019b), etc. 6067 2.2 Particle Swarm Optimization c1 and c2 are acceleration coefficients which are Inspired by the social behaviors like bird flocking, positive constants and control how fast the particle particle swarm optimization (PSO) is a kind of moves towards its individual best position and the metaheuristic population-based evolutionary com- global best position, and r1 and r2 are random putation paradigms (Eberhart and Kennedy, 1995). coefficients. After updating, the algorithm goes It has been proved effective in solving the optimiza- back to the Record step. tion problems such as image classification (Omran et al., 2004), part-of-speech tagging (Silva et al., 3 Methodology 2012) and text clustering (Cagnina et al., 2014). In this section,