Adversarial Attacks and Defenses: an Interpretation Perspective

Adversarial Attacks and Defenses: An Interpretation Perspective Ninghao Liuy, Mengnan Duy, Ruocheng Guoz, Huan Liuz, Xia Huy Department of Computer Science and Engineering, Texas A&M University, TX, USA Computer Science & Engineering, Arizona State University, Tempe, AZ, USA yfnhliu43, dumengnan, [email protected], zfrguo12, [email protected] ABSTRACT Interpretation Despite the recent advances in a wide spectrum of applications, machine learning models, especially deep neural networks, have been shown to be vulnerable to adversar- Developer Attacker ial attacks. Attackers add carefully-crafted perturbations to input, where the perturbations are almost imperceptible to humans, but can cause models to make wrong predictions. Techniques to protect models against adversarial input are called adversarial defense methods. Although many How to improve How to attack approaches have been proposed to study adversarial attacks robustness? the model? and defenses in different scenarios, an intriguing and crucial challenge remains that how to really understand model vul- Figure 1: Interpretation can either provide directions for nerability? Inspired by the saying that \if you know your- improving model robustness or attacking on its weakness. self and your enemy, you need not fear the battles", we may tackle the challenge above after interpreting machine learning models to open the black-boxes. The goal of model in- ceptible to adversarial attacks [1; 2]. That is, after adding terpretation, or interpretable machine learning, is to extract certain well-designed but human imperceptible perturbation human-understandable terms for the working mechanism of or transformation to a clean data instance, we are able to models. Recently, some approaches start incorporating in- manipulate the prediction of the model. The data instances terpretation into the exploration of adversarial attacks and after being attacked are called adversarial samples. The defenses. Meanwhile, we also observe that many existing phenomenon is intriguing since clean samples and adversar- methods of adversarial attacks and defenses, although not ial samples are usually not distinguishable to humans. Ad- explicitly claimed, can be understood from the perspective versarial samples may be predicted dramatically differently of interpretation. In this paper, we review recent work on from clean samples, but the predictions usually do not make adversarial attacks and defenses, particularly from the per- sense to a human. spective of machine learning interpretation. We categorize The model vulnerability to adversarial attacks has been dis- interpretation into two types, feature-level interpretation, covered in various applications or under different constraints. and model-level interpretation. For each type of interpreta- For examples, approaches for crafting adversarial samples tion, we elaborate on how it could be used for adversarial have been proposed in tasks such as classification (e.g., on attacks and defenses. We then briefly illustrate additional image data [3], text data [4], tabular data [5], graph data [6; correlations between interpretation and adversaries. Finally, 7]), object detection [8], and fraud detection [9]. Adver- we discuss the challenges and future directions for tackling sarial attacks could be initiated under different constraints, adversary issues with interpretation. such as assuming limited knowledge of attackers on target models [10; 11], assuming higher generalization level of at- Keywords tack [12; 13], posing different real-world constraints on attack [14; 15]. Given the advances, several questions could be Adversarial attacks, adversarial defenses, interpretation, ex- posted. First, are these advances relatively independent of plainability, deep learning each other, or is there an underlying perspective from which we can discover the commonality behind them? Second, 1. INTRODUCTION should adversarial samples be seen as the negligent corner Machine learning (ML) techniques, especially recent deep cases that could be fixed by putting patches to models, or learning models, are progressing rapidly and have been in- are they deeply rooted in the internal working mechanism creasingly applied in various applications. Nevertheless, con- of models that it is not easy to get rid of? cerns have been posed about the security and reliability is- Motivated by the idiom that \if you know yourself and your sues of ML models. In particular, many deep models are sus- enemy, you need not fear the battles" from The Art of War, in this paper, we answer the above questions and review the recent advances of adversarial attack and defense approaches from the perspective of interpretable machine learning. The relation between model interpretation and model robustness Output is illustrated in Figure 1. On the one hand, if adversaries know how the target model works, they may utilize it to 2 find model weakness and initiate attacks accordingly. On Model-level Interpretation the other hand, if model developers know how the model Latent Space works, they could identify the vulnerability and work on remediation in advance. Interpretation refers to the human- understandable information explaining what a model has learned or how a model makes predictions. Exploration of model interpretability has attracted many interests in recent years, because recent machine learning techniques, especially deep learning models, have been criticized due to lack of transparency. Some recent work starts to involve interpretability in the analysis of adversarial robustness. Also, Model although not being explicitly specified, in this survey, we will show that many existing adversary-related work can be 1 Feature-level Interpretation comprehended from another perspective as an extension of model interpretation. Input Before connecting the two domains, we first briefly introduce the subjects of interpretation to be covered in this pa- Figure 2: Illustration of feature-level interpretation and per. Interpretability is defined as \the ability to explain or to model-level interpretation for a deep model. present in understandable terms to a human [16]". Although a formal definition of interpretation still remains elusive [16; 17; 18; 19], the overall goal is to obtain and transform infor- it attempts to mislead a model's prediction to a specific class mation from models or their behaviors into a domain that given an instance. Let f denote the target model exposed human can make sense of [20]. For a more structured analy- to adversarial attack. A clean data instance is x0 2 X, and sis, we categorize existing work into two categories: feature- X is the input space. We consider classification tasks, so level interpretation and model-level interpretation, as shown f(x0) = c; c 2 f1; 2; :::; Cg. One way of formulating the task in Figure 2. Feature-level interpretation targets to find the of targeted attack is as below [2]: most important features in a data sample for its predic- 0 min d(x; x0); s.t. f(x) = c (1) tion. Model-level interpretation explores the functionality x2X of model components, and their internal states after being where c0 6= c, and d(x; x ) measures the distance between fed with input. This categorization is based on whether the 0 the two instances. A typical choice of distance measure is internal working mechanism of models is involved in inter- l norms, where d(x; x ) = kx − x k . The core idea is to pretation. p 0 0 p add small perturbation to the original instance x to make Following the above categorization, the overall structure of 0 it being classified as c0. However, in some cases, it is impor- this article is organized as below. To begin with, we briefly tant to increase the confidence of perturbed samples being introduce different types of adversarial attack and defense misclassified, so the task may also be formulated as: strategies in Section 2. Then, we introduce different categories of interpretation approaches, and demonstrate in de- max fc0 (x); s.t. d(x; x0) ≤ δ (2) tail how interpretation correlates to the attack and defense x2X strategies. Specifically, we discuss feature-level interpreta- where fc0 (x) denotes the probability or confidence that x is tion in Section 3 and model-level interpretation in Section 4. classified as c0 by f, and δ is a threshold limiting perturba- After that, we extend the discussion to additional relations tion magnitude. For untargeted attack, its goal is to prevent between interpretation and adversarial aspects of models in a model from assigning a specific label to an instance. The Section 5. Finally, we discuss some opening challenges for objective of untargeted attack could be formulated in a sim- future work in Section 6. ilar way as targeted attack, where we just need to change the constraint as f(x) 6= c in Equation 1, or change the 2. ADVERSARIAL MACHINE LEARNING objective as minx2X fc(x) in Equation 2. Before understanding how interpretation helps adversarial In some scenarios, the two types of attacks above are also attack and defense, we first provide an overview of existing called false positive attack and false negative attack. The attack and defense methodologies. former aims to make models misclassify negative instances as positive, while the latter tries to mislead models to clas- 2.1 Adversarial Attacks sify positive instances as negative. False positive attacks In this subsection, we introduce different types of threat and false negative attacks sometimes are also called Type-I models for adversarial attacks. The overall threat models attacks and Type-II attacks, respectively. may be categorized under different criteria. Based on different application scenarios, conditions, and adversary ca- 2.1.2 One-Shot vs Iterative Attack pabilities, specific attack strategies will be deployed. According to practical constraints, adversaries may initiate one-shot or iterative attacks to target models. In one-shot 2.1.1 Untargeted vs Targeted Attack attack, they have only one chance to generate adversarial Based on the goal of attackers, the threat models can be clas- samples, while iterative attack could take multiple steps to sified into targeted and untargeted ones. For targeted attack, find the better perturbation direction.

Adversarial Attacks and Defenses: an Interpretation Perspective

Spectrally-Based Audio Distances Are Bad at Pitch

Arxiv:2008.01352V3 [Cs.LG] 23 Mar 2021 Spatiotemporal Disentanglement in the PDE Formalism

Large Margin Deep Networks for Classification

Generative Adversarial Transformers

Adversarial Examples in the Physical World

Arxiv:1611.08903V1 [Cs.LG] 27 Nov 2016 20 Percent Increases in Cash Collections [20]

Submission Data for 2020-2021 CORE Conference Ranking Process International Conference on Learning Representations

Magmadnn: Accelerated Deep Learning Using MAGMA

Neurips 2020 Competition: Predicting Generalization in Deep Learning (Version 1.1)

Neural Network Tools: an Overview from Work on "Automatic Classiﬁcation of Alzheimer’S Disease from Structural MRI" Report for Master’S Thesis

Inductive Biases for Deep Learning of Higher-Level Cognition

Accelerating Machine Learning Research with MI-Prometheus