Learning to Generate Adversarial Examples

Adversarial Transformation Networks: Learning to Generate Adversarial Examples Shumeet Baluja and Ian Fischer Google Research Mountain View, CA. Abstract (2014b), as well as much recent work, has shown that adversarial examples are abundant, and that there are many Multiple different approaches of generating ad- ways to discover them. versarial examples have been proposed to attack deep neural networks. These approaches involve Given a classifier f(x): x 2 X ! y 2 Y and orig- either directly computing gradients with respect inal inputs x 2 X , the problem of generating untar- to the image pixels, or directly solving an op- geted adversarial examples can be expressed as the opti- ∗ ∗ timization on the image pixels. In this work, mization: argminx∗ L(x; x ) s:t: f(x ) 6= f(x), where we present a fundamentally new method for gen- L(·) is a distance metric between examples from the in- erating adversarial examples that is fast to exe- put space (e.g., the L2 norm). Similarly, generating a tar- cute and provides exceptional diversity of out- geted adversarial attack on a classifier can be expressed as ∗ ∗ put. We efficiently train feed-forward neural net- argminx∗ L(x; x ) s:t: f(x ) = yt, where yt 2 Y is some 1 works in a self-supervised manner to generate target label chosen by the attacker. adversarial examples against a target network or Until now, these optimization problems have been solved set of networks. We call such a network an Ad- using three broad approaches: (1) By directly using opti- versarial Transformation Network (ATN). ATNs mizers like L-BFGS or Adam (Kingma & Ba, 2015), as are trained to generate adversarial examples that proposed in Szegedy et al.(2013) and Carlini & Wag- minimally modify the classifier’s outputs given ner(2016). Such optimizer-based approaches tend to be the original input, while constraining the new much slower and more powerful than the other approaches. classification to match an adversarial target class. (2) By approximation with single-step gradient-based tech- We present methods to train ATNs and analyze niques like fast gradient sign (Goodfellow et al., 2014b) their effectiveness targeting a variety of MNIST or fast least likely class (Kurakin et al., 2016a). These ap- classifiers as well as the latest state-of-the-art Im- proaches are fast, requiring only a single forward and back- ageNet classifier Inception ResNet v2. ward pass through the target classifier to compute the perturbation. (3) By approximation with iterative variants of gradient-based techniques (Kurakin et al., 2016a; Moosavi- arXiv:1703.09387v1 [cs.NE] 28 Mar 2017 1. Introduction and Background Dezfooli et al., 2016a;b). These approaches use multiple forward and backward passes through the target network to With the resurgence of deep neural networks for many real- more carefully move an input towards an adversarial clas- world classification tasks, there is an increased interest in sification. methods to generate training data, as well as to find weak- 1 nesses in trained models. An effective strategy to achieve Another axis to compare when considering adversarial at- both goals is to create adversarial examples that trained tacks is whether the adversary has access to the internals of the target model. Attacks without internal access are possible by trans- models will misclassify. Adversarial examples are small ferring successful attacks on one model to another model, as in perturbations of the inputs that are carefully crafted to fool Szegedy et al.(2013); Papernot et al.(2016a), and others. A more the network into producing incorrect outputs. These small challenging class of blackbox attacks involves having no access perturbations can be used both offensively, to fool models to any relevant model, and only getting online access to the tar- into giving the “wrong” answer, and defensively, by pro- get model’s output, as explored in Papernot et al.(2016b); Baluja et al.(2015); Tram er` et al.(2016). See Papernot et al.(2015) for viding training data at weak points in the model. Semi- a detailed discussion of threat models. nal work by Szegedy et al.(2013) and Goodfellow et al. Adversarial Transformation Networks 2. Adversarial Transformation Networks Note that training labels for the target network are not required at any point in this process. All that is required is the In this work, we propose Adversarial Transformation Net- target network’s outputs y and y0. It is therefore possible to works (ATNs). An ATN is a neural network that transforms train ATNs in a self-supervised manner, where they use un- an input into an adversarial example against a target net- labeled data as the input and make argmax f(gf;t(x)) = t. work or set of networks. ATNs may be untargeted or targeted, and trained in a black-box2 or white-box manner. In Reranking function. There are a variety of options for this work, we will focus on targeted, white-box ATNs. the reranking function. The simplest is to set r(y; t) = Formally, an ATN can be defined as a neural network: onehot(t), but other formulations can make better use of the signal already present in y to encourage better recon- g (x): x 2 X ! x0 (1) f;θ structions. In this work, we look at reranking functions that where θ is the parameter vector of g, f is the target network attempt to keep r(y; t) ∼ y. In particular, we use r(·) that which outputs a probability distribution across class labels, maintains the rank order of all but the targeted class in or- 0 0 0 and x ∼ x, but argmax f(x) 6= argmax f(x ). der to minimize distortions when computing x = gf;t(x). The specific r(·) used in our experiments has the following Training. To find g , we solve the following optimiza- f;θ form: tion: X 0( ) 1 argmin βL (g (x ); x )+L (f(g (x )); f(x )) α ∗ max y if k = t X f;θ i i Y f;θ i i r (y; t) = norm θ x 2X α @ A i yk otherwise (2) k2y (4) where LX is a loss function in the input space (e.g., L2 loss or a perceptual similarity loss like Johnson et al.(2016)), α > 1 is an additional parameter specifying how much larger y should be than the current max classification. LY is a specially-formed loss on the output space of f (de- t scribed below) to avoid learning the identity function, and norm(·) is a normalization function that rescales its input β is a weight to balance the two loss functions. We will to be a valid probability distribution. omit θ from gf when there is no ambiguity. 2.1. Adversarial Example Generation Inference. At inference time, gf can be run on any input There are two approaches to generating adversarial exam- x without requiring further access to f or more gradient ples with an ATN. The ATN can be trained to generate just computations. This means that after being trained, gf can the perturbation to x, or it can be trained to generate an generate adversarial examples against the target network f adversarial autoencoding of x. even faster than the single-step gradient-based approaches, such as fast gradient sign, so long as jjgf jj / jjfjj. • Perturbation ATN (P-ATN): To just generate a perturbation, it is sufficient to structure the ATN as a vari- Loss Functions. The input-space loss function, LX , would ideally correspond closely to human perception. ation on the residual block (He et al., 2015): gf (x) = tanh(x + G(x)), where G(·) represents the core func- However, for simplicity, L2 is sufficient. LY determines whether or not the ATN is targeted; the target refers to the tion of gf . With small initial weight vectors, this struc- class for which the adversary will cause the classifier to ture makes it easy for the network to learn to generate output the maximum value. In this work, we focus on the small, but effective, perturbations. more challenging case of creating targeted ATNs, which • Adversarial Autoencoding (AAE): AAE ATNs are can be defined similarly to Equation1: similar to standard autoencoders, in that they attempt 0 gf;t(x): x 2 X ! x (3) to accurately reconstruct the original input, subject to regularization, such as weight decay or an added noise where t is the target class, so that argmax f(x0) = t. This signal. For AAE ATNs, the regularizer is LY . This allows us to target the exact class the classifier should mis- imposes an additional requirement on the AAE to add takenly believe the input is. some perturbation p to x such that r(f(x0)) = y0. 0 0 In this work, we define LY;t(y ; y) = L2(y ; r(y; t)), 0 where y = f(x), y = f(gf (x)), and r(·) is a reranking For both ATN approaches, in order to enforce that x0 is function that modifies y such that yk < yt; 8 k 6= t. a plausible member of X , the ATN should only generate 2E.g., using Williams(1992) to generate training gradients values in the valid input range of f. For images, it suffices for the ATN based on a reward signal computed on the result of to set the activation function of the last layer to be the tanh sending the generated adversarial examples to the target network. function; this constrains each output channel to [−1; 1]. Adversarial Transformation Networks Table 1. Baseline Accuracy of Five MNIST Classifiers Architecture Acc. Classifier-Primary (Classifier ) p 98.6% (5x5 Conv)! (5x5 Conv)! FC! FC Classifier-Alternate-0 (Classifier ) a0 98.5% (5x5 Conv)! (5x5 Conv)! FC ! FC Classifier-Alternate-1 (Classifier ) a1 98.9% (4x4 Conv)! (4x4 Conv)! (4x4 Conv)! FC ! FC Classifier-Alternate-2 (Classifier ) a2 99.1% (3x3 Conv)! (3x3 Conv)! (3x3 Conv) ! FC ! FC Classifier-Alternate-3 (Classifier ) a3 98.5% (3x3 Conv) ! FC ! FC ! FC 2.2.

Load more