UNIVERSITY OF CALIFORNIA Los Angeles

Initializing Hard-Label Black-Box Adversarial Attacks Using Known Perturbations

A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Computer Science

by

Shaan Karan Mathur

2021 © Copyright by Shaan Karan Mathur 2021 ABSTRACT OF THE THESIS

Initializing Hard-Label Black-Box Adversarial Attacks Using Known Perturbations

by

Shaan Karan Mathur Master of Science in Computer Science University of California, Los Angeles, 2021 Professor Sriram Sankararaman, Chair

We empirically show that an adversarial perturbation for one image can be used to accelerate attacks on another image. Specifically, we show how to improve the initialization of the hard- label black-box attack Sign-OPT, operating in the most challenging attack setting, by using previously known adversarial perturbations. Whereas Sign-OPT initializes its attack by searching along random directions for the nearest boundary point, we search for the nearest boundary point along the direction of previously known perturbations. This initialization strategy leads to a significant drop in initial distortion in both the MNIST and CIFAR-10 datasets. Identifying the similar vulnerability of images is a promising direction for future research.

ii The thesis of Shaan Karan Mathur is approved.

Jonathan Kao

Cho-Jui Hsieh

Sriram Sankararaman, Committee Chair

University of California, Los Angeles

2021

iii To my family and friends, and the between.

iv TABLE OF CONTENTS

1 Introduction ...... 1

2 Background ...... 5

2.1 White-Box Attacks ...... 5

2.2 Black-Box Attacks ...... 7

2.2.1 Soft-Label Black-Box Attacks ...... 7

2.2.2 Hard-Label Black-Box Attacks ...... 8

2.3 Transfer-Based Attacks ...... 10

3 Initializing Sign-OPT with Known Perturbations ...... 11

3.1 Algorithm ...... 11

3.2 Choosing the Known Perturbations ...... 13

4 Experimental Results ...... 15

4.1 Attacked Architectures ...... 15

4.2 Known Perturbations Reduce Initial Distortion ...... 16

4.2.1 MNIST ...... 16

4.2.2 CIFAR-10 ...... 19

4.3 Known Perturbations Outperform Random Perturbations ...... 23

4.4 Conclusion and Future Work ...... 32

References ...... 33

v LIST OF FIGURES

2.1 FSGM attack algorithm, misclassifying a panda as a gibbon...... 6

4.1 Attacking 30 random MNIST images using the known perturbations of 20 other random MNIST images...... 17

4.2 Non-cherry picked examples of images we attack in the random MNIST run (shown examples were chosen at random). The left column shows the original image; the middle column visualizes our improved initialization; the right column shows the Sign-OPT initialization. Notice our initialization targets only certain regions of the image, while the Sign-OPT initialization tends to perturb each pixel. Sign-OPT was unable to generate an adversarial example in the last row, but our approach was able to...... 18

4.3 Attacking 30 MNIST images with class label 1 using the known perturbations of 20 other MNIST images with class label 1...... 19

4.4 Attacking visually similar MNIST images. Images are selected by finding 50 nearest neighbors with minimal average distance, using feature vector distance as measure of visual similarity. Then 30 of those images are attacked using perturbations of the other 20 images...... 20

4.5 Attacking 30 random CIFAR-10 images using the known perturbations of 20 other random CIFAR-10 images...... 20

4.6 Non-cherry picked examples of images we attack in the random CIFAR-10 run (shown examples were chosen at random). The left column shows the original image; the middle column visualizes our improved initialization; the right column shows the Sign-OPT initialization. Notice our initialization targets only certain regions of the image, while the Sign-OPT initialization tends to perturb each pixel...... 21

vi 4.7 Attacking 30 CIFAR-10 images with class label 5 using the known perturbations of 20 other CIFAR-10 images with class label 5...... 22

4.8 Attacking visually similar CIFAR-10 images. Images are selected by finding 50 nearest neighbors with minimal average distance, using feature vector distance as measure of visual similarity. Then 30 of those images are attacked using perturbations of the other 20 images...... 23

4.9 Distribution of initial distortions found using randomness (blue) and found using known perturbations (orange). On average, known perturbations find smaller initial distortions...... 24

4.10 Distribution of differences between random initialization and known-perturbation initializations for all images considered. In all but 4 images, the known-perturbation initialization outperform random initialization, improving by about 1.927 on av- erage...... 25

4.11 The top 3 images are the best 3 of the 138 images to precompute perturbations for the class 0 (airplane) images considered. The bottom 3 images are analogously the worst 3 to precompute. Counterintuitively, notice that none of the top 3 are of airplanes, but one of the bottom 3 are...... 26

4.12 Attacking images of classes 0, 4, 5, and 9 exclusively using perturbations from each class. The top boxplots (a,b) show that the using perturbations from the same class (e.g. using perturbations for class 0 images to initialize attacks on class 0 images) do not always lead to a minimal average initial distortion. On the other hand, the lower boxplots (c,d) show that it sometimes can lead to the minimal average initial distortion...... 29

vii 4.13 Examining how visual similarity correlates with initial distortion when using the perturbation of one image to attack another image on average. All similarity- distortion pairs were placed into 100 bins and averaged within each bin; bins with fewer than 200 pairs were discarded, leaving a median bin size of 4734 pairs. Very dissimilar images tend to have poor initializations on average; however slightly dissimilar images actually do better than very similar images on average. . . . . 30

4.14 Figures (a) and (b) show how visual similarity affects the initial distortion when attacking all 3618 images using two different perturbations; the correlations are very noisy and also typical for most choices of perturbation. Figures (c) and (d) analogously show how visual similarity affects the initial distortion when attacking two different images using all 139 perturbations; again they are noisy and also typical...... 31

viii LIST OF TABLES

4.1 Finding the labels of the best (and worst) 3 images to precompute known pertur- bations for when attacking images of each class. Notice that the best 3 images for classes 0, 2, and 4 don’t include themselves. For class 0 the best triplet including a class 0 image is worse than 10 other triplets; for class 2, 259 other triplets are better; for class 4, 13 others. The most useful perturbations to precompute need not come from the same class...... 27

ix ACKNOWLEDGMENTS

I would not be sitting here writing my thesis if it weren’t for the countless people who have helped shape my life and my career. It isn’t possible to fully express the magnitude of their impact on me, but I will take this time to thank many of them.

My advisor Prof. Sriram Sankararaman has been a great research mentor to me, not only providing me feedback and guidance on my research ideas but also emphasizing the importance of patience and resilience in research. Failure is part of the process, and I am very grateful to Sriram for teaching me that lesson. I am also thankful for Prof. Cho-Jui Hsieh for helping me devise such an interesting project, and for creating a space for me to make my own research decisions while still guiding me along the right path. An additional thank you to Keerti Kareti for his contributions early on in this work as well.

My passion for Computer Science and Mathematics flourished at UCLA. I am grateful for Prof. Jonathan Kao for providing me excellent instruction in deep learning, and for allowing me to return as a TA to help build our generation of deep learning engineers and researchers. I am so thankful for my students in CS 111, CS 180, and ECE C247; teaching them was one of the most fulfilling experiences I have had. I am eternally grateful to every Computer Science and Mathematics professors I have had, their courses awakening in me a passion I never could have imagined.

My young adult life was the springboard that launched me into this life. Alex Grass and Rahil Bhatnagar, my relationship with you guys gives me a strength I hope to take with me for the rest of my life. Garvit Pugalia, Abhinava Shriraam, Sparsh Arora, Yash Chaudhary - you were there since the beginning of my undergraduate journey, and have been a family for me when I was most afraid of being alone. And to my girlfriend Shivangi Kulshreshtha: you have been a light for me in the dark, a best friend who is always on my team, and an extraordinarily brilliant mind always ready to discuss the nature of the universe with me. You’ve shaped who I am today, Shivi.

x And finally to my family. To my younger siblings Devan and Saira Mathur, I see in both of you the marks of brilliance and, most importantly, the hearts of good people; I am grateful for your belief and support and am excited to watch your extraordinary lives unfold. To my Mama, Mandy Mathur, and Papa, Raj Mathur - you built the core of who I am. Everything I have done, can do, and will do is because of you. Thank you for the many sacrifices you made that I probably will never be able to fully comprehend. The three of us will show you that they were worth it.

xi CHAPTER 1

Introduction

It has been shown that neural networks are susceptible to adversarial examples. An adver- sarial example is an input to a machine learning model that has been perturbed, typically with a small amount of carefully crafted noise, so that the machine learning model misclassi- fies the original input. A classic pedagogical example would be the adversarial vulnerability of ResNet-50 [HZR15], where human-imperceptible noise to an image of a panda coerces the model into classifying it as a gibbon [GSS15] (see Figure 2.1). This has led to the field of adversarial robustness, which seeks to both discover new methods of crafting small adver- sarial perturbations, as well as develop robust training algorithms that guard against these attacks. This thesis primarily focuses on optimizing the former, with the hope that this begets better robustness algorithms in the future.

The subject of adversarial robustness is critical in a world growing more dependent on machine learning systems. For example, a world with self-driving vehicles is becoming more realistic every day thanks to progress in deep learning. Although this is an exciting step for humanity, these vehicles could be weaponized by an adversary who can successfully develop adversarial examples for those vehicles. In 2017 Eykholt et. al modified a stop sign with black and white stickers, which disturbingly tricked their model into believing the stop sign was actually a 80mph speed limit sign [EEF18]. As deep learning continues to permeate various industries, we may have to worry about misrepresentation of illness, market manipulation by disrupting financial analysis, or coercion of armed systems into attacking the wrong target. Exploring adversarial vulnerability is essential if we want to make headway in preventing

1 these dark possibilities.

There are different adversarial attack settings, each based upon what resources the ad- versary has at its disposal. Broadly speaking, this splits the attack possibilities into one of two classes: white-box attacks, which give the attacker access to the attacked model’s parameters and architecture; and black-box attacks, which give the attacker only access to model’s outputs. White-box attacks give the attacker the most power, and makes compu- tation of adversarial perturbations relatively straightforward using gradient-based methods (e.g. a perturbation may align with the gradient of the loss with respect to the input). Popular methods of attack in this setting are C&W [CW17] and PGD [MMS19] attacks.

Although white-box attacks are a good worst-case metric to evaluate the robustness of a model, this attack model is fairly unrealistic. A more realistic scenario would involve an external attacker who does not know the model architecture and can only query the model for outputs. This is analogous to modern API-based Machine Learning as a Service platforms, whose implementation is hidden from the user. One immediate way to attack these black- box models is via transfer-based methods, where the attacker runs a white-box attack on her own substitute model that performs the same task. The adversarial perturbation that works against the substitute model often works against the original model as well, first shown by Papernot et. al in [PMG17]. There are also query-based attack algorithms that avoid the cost of training a new model by strategically querying the model to help craft an adversarial perturbation. This query-based black-box scenario can be further partitioned based upon the form of the model output. The soft-label black-box setting gives the attacker access to the model’s probabilities output or logit scores. Although a gradient cannot be directly computed without the model’s parameters, it can be approximated using these scores via finite differences. This flavor of finite-difference attacks was first developed with the ZOO attack [CZS17].

The most difficult attack setting is one in which the attacker only gets the output label from the model without any logits or probabilities, the hard-label black-box setting. No

2 gradients of the original function can be as easily computed here, since there is much less information available to the attacker about the original function’s topology. Brendel et. al provided the first foundational step in approaching this attack setting by introducing Boundary Attack [BRB18], which essentially involved a random walk along the decision boundary to find a minimal distortion perturbation. Though this algorithm did not have a convergence guarantee, it showed that following the boundary is a key idea in this attack setting. By framing the distance between the input and the boundary as an optimization problem, Cheng et al. were able to reintroduce gradient-based methods into this setting in their so-called OPT attack [CLC18], which was made further query-efficient by Cheng et al. in the Sign-OPT attack [CSC20]. Effectively, these methods initialize by selecting the closest boundary point along k random directions, and then follow the curvature of the boundary to find the closest boundary point.

In this thesis, we seek to significantly improve the initialization of [CSC20]. Rather than choosing the closest boundary point along k random directions, we select our directions by using previously known perturbations that are adversarial for other inputs. Concretely, suppose we already have run Sign-OPT on 20 previous inputs to compute 20 perturbations. When we run Sign-OPT for the 21st time, we will initialize by selecting the closest boundary point along the 20 directions specified by those previous perturbations. It turns out this approach consistently outperforms selecting directions randomly, and we demonstrate this empirically.

Our contributions are as follows:

• We empirically demonstrate that re-using known perturbations is a superior initial- ization strategy than just using randomness. We first demonstrate examples of this improvement on MNIST and CIFAR-10, and quantify it extensively on the CIFAR-10 dataset.

• We provide an initial study into how choosing which images to precompute pertur-

3 bations for affects the initial distortion found. We explore natural choices like class similarity and visual similarity, and provide evidence that this alone is not enough to get optimal results.

4 CHAPTER 2

Background

Szegedy et. al were the first to observe that neural networks were susceptible to adversarial perturbations [SZS14]. Since this discovery, many adversarial attack methods have been developed. They fall largely under two categories: white-box attacks, which give the attacker access to model parameters and architecture; and black-box attacks, which give the attacker only access to model outputs.

2.1 White-Box Attacks

The white-box attack setting gives the adversary full-knowledge of the model, leading to a powerful, albeit unrealistic, adversary. Since the adversary has the model, they can perform backpropagation to directly compute perturbations to the input that maximize the loss (untargeted attack) or minimize the loss for some target label (targeted attack). Though this setting is unrealistic, it offers a worst-case robustness metric when evaluating a model. Moreover, adversarial training involves minimizing the loss of the worst-case perturbation for an input, and so powerful white-box methods can be used during training to increase robustness.

One of the first popular white-box attack algorithms was the Fast Gradient Sign Method (FGSM) introduced by Goodfellow et. al [GSS15]. Let θ be the parameters of a model, x the input to the model, y the label of x, and J(θ, x, y) be the cost used to train the neural

5 Figure 2.1: FSGM attack algorithm, misclassifying a panda as a gibbon.

network. The adversarial perturbation η for x with L∞ norm  would be computed via

η =  · sign(∇xJ(θ, x, y)).

You can see an example of this in Figure 2.1. As the FGSM was only a single step attack, I-FGSM [KGB17] naturally extends the algorithm into multiple iterations.

The C&W algorithm [CW17] is another popular white-box attack that reformulates Szegedy et. al’s original optimization problem for finding adversarial examples into a more easily solvable form. Let x be the input and δ be a perturbation; the authors define a func- tion f such that f(x + δ) ≤ 0 if and only if x + δ is an adversarial example. They change the objective function so that it weighs both the Lp-norm of the perturbation and the function f:

n argmin (||δ||p + c · f(x + δ)) where x + δ ∈ [0, 1] . δ

Projected Gradient Descent (PGD) is another iterative white-box attack [MMS19] which repeatedly perturbs the input using the gradient of the loss, projecting the adversarial ex-

ample onto a local Lp-ball at every step to make sure the perturbation is minimal. Every iterate is computed using the following step (untargeted attack):

xk+1 = ΠBp(x0,) (xk + tk∇xk J(xk, y0))

6 Here Bp(x, ) is the Lp-ball of radius  around the input x0, J is the loss function, and y0 is

the correct label for x0.

2.2 Black-Box Attacks

The black-box attack setting is a more realistic and challenging paradigm, as the attacker no longer has access to internal model information like parameters and architecture. In- stead the attacker is only allowed to query the model and get model outputs. Soft-label black-box attacks refer to when the model outputs are probabilities or logits, which provide some information about the model’s output topology (allowing gradient estimation via finite differences). The strictly harder situation is the hard-label black-box attack model where model outputs are just the final label assigned by the model, where following the decision boundary becomes key to identifying good adversarial examples. We outline some of the major works in both areas here. Also applicable here are transfer-based attacks, which cre- ate perturbations by using white-box attacks on substitute models; this is further described in Section 2.3.

2.2.1 Soft-Label Black-Box Attacks

Though transfer-based attacks are effective, Chen et. al demonstrated that they typically suffer from large distortion perturbations. Motivated by this, they used zeroth order opti- mization (ZOO) via symmetric differences of the soft labels to estimate the gradient and create adversarial perturbations [CZS17]. ZOO was very effective, with distortions com- parable to the white-box C&W. However it was not query efficient, so Ilyas et. al used Natural Evolutionary Strategies (NES) to estimate the gradient and perform PGD to find the adversarial example in 2-3 times fewer queries [IEA18]. Ilyas et. al [IEM19] then showed that contemporary state-of-the-art black box attacks were optimal, necessitating the need for prior information to boost query-efficiency (which they did by a factors of 2-5 using

7 gradient priors). Tu et. al recently developed their Autoencoder-based Zeroth Order Opti- mization Method (AutoZOOM) which involves adaptive random gradient strategies as well as dimension reduction to further reduce mean query complexity by 93% [TTC20].

2.2.2 Hard-Label Black-Box Attacks

The hard-label setting is the most challenging since there is no obvious gradient estimation that can be done here. Instead, the main query-based approach involves probing the decision boundary to find a boundary point closest to the original input (i.e. a minimal distortion boundary point). Brendel et. al introduced one of the first query-based approaches called Boundary Attack [BRB18]. Boundary Attack identifies a decision boundary point around the input (via a binary search between the input and another input with differing label), and then repeatedly performing a random step plus a small step towards the original input. Of course this sequence of steps may not lead to an adversarial example necessarily, and so the step sizes have to be dynamically tuned to account for this.

Troubled by the lack of convergence guarantees, Cheng et. al [CLC18] cleverly reformu- lated the decision boundary problem as an optimization problem, allowing them to use finite differences to approximate the gradient. More formally, let x be the input, y be its label, f be the model being attacked, and θt be the direction of the adversarial perturbation at time step t. An optimal adversarial example can be found by finding the direction θ∗ which corresponds to the closest boundary point to x.

 θ  θ∗ = min g(θ) where g(θ) = argmin f(x + λ · ) 6= y θ λ>0 |θ|

Put in words, g(θt) represents the distance to the boundary point from x along direction

θt. Since g(θt) can be directly computed (approximately) using the aforementioned binary search procedure, it is possible to optimize g(θt) by estimating the gradient using finite

g(θt+)−g(θt−) differences ∇g ≈ 2 . This hard-label black-box attack algorithm, called OPT, vastly improved the query efficiency over Boundary Attack. Note that the targeted setting

8 with target class t can be made just by adjusting the condition of success.

 θ  min g(θ) where g(θ) = argmin f(x + λ · ) = t θ λ>0 |θ|

Cheng et. al further optimized this algorithm in their Sign-OPT attack [CSC20], where the query efficiency is further improved by computing the sign of the finite difference gradient in

a single query. Specifically the algorithm samples several random directions u1,..., uQ from

a Gaussian or Uniform distribution, and then computes sign(g(θt) + uk) − g(θt)) for each Q k. The algorithm then uses the gradient estimate ∇g ≈ Σk=1sign(g(θt) + uk) − g(θt))uk during the learning step, greatly reducing the number of queries. Algorithm 1: Sign-OPT

input : Hard-label model f, original input x, initial θ0 output: Adversarial perturbation δ, such that x + δ is an adversarial example for t = 0,...,T do

Randomly sample u1,..., uQ from a Gaussian or Uniform distribution ; 1 PQ Compute gˆ ← Q q=1 sign(g(θt + uq) − g(θt)) · uq ;

Update θt+1 ← θt − ηgˆ ;

Evaluate g(θt+1) using binary search ; end

δ ← g(θT ); return δ ;

Though we’ve discussed how the gradient update step works in OPT and Sign-OPT, we have not mentioned how these algorithms are best initialized. The Sign-OPT authors initial-

ize θ0 by selecting 100 random directions and choosing the one with minimal distance to the

boundary. It is conceivable that we can impose some sort of prior on θ0 that could possibly lead to better initialization (where better means smaller distortion upon initialization). This

∗ thesis demonstrates that re-using the optimal θ for other inputs to initialize θ0 for brand new inputs often leads to superior initializations.

9 2.3 Transfer-Based Attacks

In the adversarial example literature, transfer-based attacks typically refer to the phe- nomenon that adversarial examples for one model tend to be adversarial for other models. Papernot et. al first demonstrated this attack methodology in the hard-label setting, where they trained a substitute model, crafted an adversarial example for the substitute model, and then used that same adversarial example for the attacked model [PMG17]. A major dis- advantage of this approach is having to go through the entire training pipeline: acquisition of training data, training, hyperparameter search, etc.

Absent in the literature, however, is studying how an adversarial example for one image can help generate an adversarial example for another image. To the best of our knowledge, this work is the first to do so.

10 CHAPTER 3

Initializing Sign-OPT with Known Perturbations

Query-based attacks in the hard-label black box setting are characterized by a search around the decision boundary. But how do we locate the right point on the decision boundary to start with? In attacks like Sign-OPT, the algorithm first samples perturbations θ1, . . . , θk from the unit normal distribution. If the perturbed example x + θi is misclassified, a binary search is conducted between input x and the adversarial example x + θi to locate the boundary point.

The boundary point with the least distortion of x (minimal L2 distance) is then selected as our initial point for searching along the boundary.

In this chapter we will describe an alternative initialization strategy that uses adversarial perturbations for other inputs in place of the normal sampling procedure. We will discuss different hypotheses about how to select which perturbations to precompute, which we will test in our empirical study.

3.1 Algorithm

We will outline our algorithm that modifies Sign-OPT to re-use known perturbations. In describing this algorithm, we have a couple of design choices. First, we can either formulate our attack as acting on a single input (which is typical and is what Cheng et. al [CSC20] do), or it can be an attack on several inputs where the perturbations from earlier inputs are used to attack later inputs. Second, we can either make our algorithm exclusively use known perturbations, or we can search through both known perturbations and random directions to

11 guarantee our approach is at least as good as the random approach. In a practical real-world attack, the natural choice would be to attack a group of inputs where both randomness and known perturbations are used for initialization.

Our empirical study will make use of the single input formulation using exclusively known perturbations in our initialization step. Specifically our attack acts on a single input, assumes we are given some k precomputed perturbations for other inputs beforehand, and uses no randomness in the initialization step. Since the only difference between Sign-OPT and our approach will be the method of initialization, we get a more straightforward comparison between the two.

To describe our algorithm, we will use and extend the notation of Cheng et. al. Let x

be the input, y be its label, f be the model being attacked, and θt be the direction of the adversarial perturbation at time step t. An optimal adversarial example can be found by finding the direction θ∗ which corresponds to the closest boundary point to x.

 θ  θ∗ = min g(θ) where g(θ) = argmin f(x + λ · ) 6= y θ λ>0 |θ|

Put in words, g(θt) represents the distance to the boundary point from x along direction θt.

With this notation in mind, our algorithm proceeds as follows. Given a hard-label black- box model f and input x, we are trying to find an untargeted adversarial perturbation δ so

that f(x+δ) 6= f(x). We also are given previously known perturbations δ1,..., δk. For each

previously known δi, we compute g(δi), the distance from x to the decision boundary along

δi the direction δi. Notice that this boundary point x + g(δi) · is an adversarial example |δi|

for x with a distortion of g(δi). The δi corresponding to a minimal distortion adversarial

δi example gives us a good initial direction θ0 = to initialize the Sign-OPT attack with. |δi|

12 Algorithm 2: Sign-OPT with Known Perturbations - Single Input (Our Sign OPT)

input : Model f, input x, known perturbations δ1,..., δk output: Adversarial perturbation δ, such that x + δ is an adversarial example

θ0, min distortion ← None, ∞ ; for i ← 1, . . . , k do

if g(δi) < min distortion then

min distortion ← g(δi);

δi θ0 ← ; |δi| end if min distortion = ∞ then Unable to find perturbation, fail. else

δ ← Sign-OPT(f, x, θ0); return δ ;

In a real implementation of Algorithm 2, g(δi) is calculated with a binary search between x and some misclassified x0 = x + λ · δi where f(x) 6= f(x0). Finding this misclassified |δi| point x0 amounts to a search over values of λ; our implementation (quite arbitrarily) tries λ = 1, 1.5, (1.5)2,..., (1.5)9. It’s also worth noting that this commonly used binary search approach is imperfect. Along a particular direction there may be, in a multi-class setting, multiple decision boundaries with multiple classes. Although a binary search may find the boundary with one class, there may be another boundary with another class that is closer to x. Nevertheless the binary search approach is very effective at finding a boundary point, and so we keep with convention in using it.

3.2 Choosing the Known Perturbations

An essential point of discussion is how to best select the previous k perturbations that are precomputed for Algorithm 2. Given our original input is x, we want to select other

13 inputs x1, . . . , xk such that adversarial perturbations for these inputs are both adversarial and minimal for x. Figuring out the optimal choice of x1, . . . , xk within the input domain is a challenging problem; we make headway into this by trying out a few natural methods.

• Inputs with Shared Label. One natural choice is to select x1, . . . , xk to have the same class label as x. The crude assumption here would be that inputs with the same label might have similar vulnerabilities in their decision boundary.

• Inputs that have Similar Features. Another natural idea is that adversarial per- turbations might transfer across inputs that have the same features. For instance, images that are visually similar might have similar vulnerabilities around their deci- sion boundaries. These inputs can be selected by finding the feature vectors of our inputs and choosing the ones nearest to x.

• Random Inputs. To evaluate whether these previous heuristics are doing anything

meaningful, we will also want to see how a random selection of x1, . . . , xk affects initial distortion.

14 CHAPTER 4

Experimental Results

We will perform an empirical study comparing Algorithm 2 to Sign-OPT. For illustrative purposes, we will first look at some examples of attacks on MNIST and CIFAR-10 images in Section 4.2, as well as explore the different image grouping strategies mentioned in Section 3.2. Then in Section 4.3 we will narrow our focus to CIFAR-10 and provide comprehensive evidence that re-using known perturbations is a superior initialization strategy over random.

To perform our experiments using known perturbations, we have to compute adversarial examples beforehand. In our experiments we will compute these adversarial examples just using regular Sign-OPT, as this is a natural choice for someone attacking a hard-label black- box.

4.1 Attacked Architectures

We will be attacking both an MNIST classifier and a CIFAR-10 classifier. Both of these classifiers are taken directly from Cheng et. al’s GitHub repository. The MNIST classifier is a basic CNN with two convolutional layers, two max-pooling layers, and two fully-connected layers. The CIFAR-10 classifier is an implementation of VGG16 [SZ15].

15 4.2 Known Perturbations Reduce Initial Distortion

In this section we will be directly comparing Algorithm 2 to Sign-OPT. We will take 50 images from the test set of each dataset. The first 20 images will be used to precompute our known perturbations using vanilla Sign-OPT. Then we will run two untargeted attacks on the last 30 images with both Algorithm 2 (using our 20 previously known perturbations) and vanilla Sign-OPT with random initialization. To evaluate the attacks, we consider the distortion (Euclidean norm) of the adversarial perturbation as a function of the number of queries to the model, following the experiments in [CSC20].

As discussed in Section 3.2, we also would like to understand how the choice of those first 20 precomputed perturbations affects the distortion found by Algorithm 2. We will try three different methodologies: choosing all 50 images at random, choosing all 50 images to have the same class label, and choosing all 50 images to be visually similar (by clustering the feature maps of our own classifier). It is worth mentioning that the tests in this section are merely a first examination of our initialization strategy, how it affects the adversarial distortion as a function of query count, and how natural choices for grouping may or may not affect this. The following section, Section 4.3, provides a much more comprehensive study demonstrating the superiority of using known perturbations over random perturbations.

4.2.1 MNIST

To provide an initial baseline, we run our first set of experiments on the MNIST test set. Our first experiment selects 50 images at random and then runs Algorithm 2 on the final 30 images with the first k = 20 perturbations precomputed. The results of the experiment can be seen in Figure 4.1, with comparisons of the adversarial initializations shown in Figure 4.2. The initial distortion found using known perturbations is significantly smaller than the one found using random initialization, decreasing from 9.820 to 2.695. The final distortions are close: 1.357 and 1.371 with random initialization and known perturbation inititalization,

16 Figure 4.1: Attacking 30 random MNIST images using the known perturbations of 20 other random MNIST images. respectively. Our initialization thus gets us much closer to this final distortion much faster. This is a promising result, since it provides evidence that our attack method is effective irrespective of which specific images are chosen. Though we are only showing the results for one selection of 50 random images, these results are representative of most other random selections we have tried.

Next we select 50 images with class label 1, and perform the same experiment. We can see from Figure 4.3 that our initialization significantly outperforms random initialization, with our initialization only .136 from our final distortion of 1.391 (which is, again, close to random’s final distortion of 1.383). Re-using perturbations for MNIST images of the same class seems to be a very promising initialization strategy.

Finally we select 50 MNIST images that are visually similar. To do this we train a CNN to classify MNIST digits on the training set, and then extract feature vectors for our test set using the penultimate layer’s activations. We then loop over all images in the test set to find an image with the 49 closest neighbors on average, and make these 50 images our visually similar group. As can be seen from Figure 4.4, this happens to be images of the digit 1 written in a similar way. Like class label similarity, our distortion is nearly optimal upon initialization (only .071 from the final distortion). The final distortion in this test, 1.519, is

17 Original Ours Sign-OPT

Figure 4.2: Non-cherry picked examples of images we attack in the random MNIST run (shown examples were chosen at random). The left column shows the original image; the middle column visualizes our improved initialization; the right column shows the Sign-OPT initialization. Notice our initialization targets only certain regions of the image, while the Sign-OPT initialization tends to perturb each pixel. Sign-OPT was unable to generate an adversarial example in the last row, but our approach was able to.

18 Figure 4.3: Attacking 30 MNIST images with class label 1 using the known perturbations of 20 other MNIST images with class label 1. actually slightly better than random, 1.522, showing that our final distortion can also beat random.

A possible way to explain the success of the last two experiments is that MNIST images of the same class have similar vector representations. It hence may be natural that an optimal perturbation for one MNIST image would also be optimal for an MNIST image that is nearly identical.

4.2.2 CIFAR-10

CIFAR-10 is more visually diverse than MNIST, making it another good baseline for us. We perform the analogous experiments we did on MNIST here, beginning with our first experiment using 50 random CIFAR-10 images (see the results in Figure 4.5 and a visual comparison of the adversarial initializations in Figure 4.6). Once again our approach has a lower initial distortion, improving from 3.984 to 1.887; the final distortion are both within .01 of each other. This improvement is less pronounced than in the MNIST case, likely because the dataset is more complex. Nevertheless, re-using known perturbations gives a better initial distortion even when the images are selected at random.

The same experiment using 50 images with class label 5 is shown in Figure 4.7. Our

19 Figure 4.4: Attacking visually similar MNIST images. Images are selected by finding 50 nearest neighbors with minimal average distance, using feature vector distance as measure of visual similarity. Then 30 of those images are attacked using perturbations of the other 20 images.

Figure 4.5: Attacking 30 random CIFAR-10 images using the known perturbations of 20 other random CIFAR-10 images.

20 Original Ours Sign-OPT

Figure 4.6: Non-cherry picked examples of images we attack in the random CIFAR-10 run (shown examples were chosen at random). The left column shows the original image; the middle column visualizes our improved initialization; the right column shows the Sign-OPT initialization. Notice our initialization targets only certain regions of the image, while the Sign-OPT initialization tends to perturb each pixel.

21 Figure 4.7: Attacking 30 CIFAR-10 images with class label 5 using the known perturbations of 20 other CIFAR-10 images with class label 5. initialization beats random initialization, improving from 1.869 to .949. The final distortions are within .006 of each other. Just like in the MNIST test, this experiment is a better result than random. However these are unfair comparisons, and we leave the full exploration to Section 4.3.

Finally we run the experiment using 50 images that are visually similar, shown in Figure 4.8. Once again we train our own classifier, this time using Deep Layer Aggregation [YWS19] due to its success on CIFAR-10 (we trained ours only to 86.69% accuracy). We perform an identical Nearest Neighbors algorithm on the penultimate layer’s feature vectors to find our 50 images, which in this case are of small cars. Our approach beats random, improving from 3.604 to 1.938, with the final distortions within .033 of each other.

We have shown various attacks where our average initial distortion outperforms random. However the number of tests we have shown here is limited. To show we are not cherry- picking and that our initialization is in fact better, we now move on to a more comprehensive study.

22 Figure 4.8: Attacking visually similar CIFAR-10 images. Images are selected by finding 50 nearest neighbors with minimal average distance, using feature vector distance as measure of visual similarity. Then 30 of those images are attacked using perturbations of the other 20 images.

4.3 Known Perturbations Outperform Random Perturbations

Though the previous section provides evidence that our approach leads to a more optimal initialization than random, the number of tests we conducted are limited due to computa- tional power (generating Figure 4.8 takes around 11 hours). Furthermore the experiments still have not conclusively shown whether having the same class label or being visually similar has a meaningful contribution to the transferability of a perturbation.

In this section, we will increase the number of CIFAR-10 tests we can run by only measuring the initial distortion. First we take 3618 CIFAR-10 images and use the random approach to compute their initial distortions. Then we take 139 random CIFAR-10 images (about 10-18 from each of the 10 classes) and compute their final perturbations using vanilla Sign-OPT. For each of the 139 perturbations and each of the 3618 images, we use our approach to compute an initial distortion using that perturbation and image pair. The result is two tables: a 1 × 3618 table of initial distortions using randomness, and a 139 × 3618 table of initial distortions using known perturbations.

Note that sometimes the initialization step fails. In the random approach, this can

23 Figure 4.9: Distribution of initial distortions found using randomness (blue) and found using known perturbations (orange). On average, known perturbations find smaller initial distortions. happen if all 100 random directions are not adversarial. In our approach, this can happen if we cannot find an adversarial perturbation in the direction of the known perturbation (the details of how we search are at the end of Chapter 3). It turns out that both the random approach and our approach failed on the same 243 images, which suggests that those particular images either are already misclassified or are particularly robust. We omit these images from the rest of the discussion.

First we compare the distribution of distortions found using randomness and using the best of the known perturbations. The results are shown in Figure 4.9. The distortions found using our approach appear to be normally distributed, with a mean distortion of 1.298. The distortions found using randomness also appear to normally distributed (possibly bimodal), with a larger mean distortion of 3.225. Thus our empirical results demonstrate that our initialization outperforms random initialization on average.

To more concretely understand the level of improvement, we measure the difference in distortion between each image’s random initialization and best perturbation-based initial- ization. We plot the magnitude of these differences in a histogram shown in Figure 4.10. In

24 Figure 4.10: Distribution of differences between random initialization and known- perturbation initializations for all images considered. In all but 4 images, the known- perturbation initialization outperform random initialization, improving by about 1.927 on average.

all but 4 images, our approach outperforms random initialization; in those 4 cases, our ap- proach was no worse than .240 off from random. Our initialization was better by about 1.927 on average, with the degree of improvements following a bimodal distribution. Curiously, the second peak featuring the greatest improvements has majority of its images coming from class 6; upon further inspection we concluded this was because random initialization was par- ticularly ineffective for these images. Overall Figure 4.10 is an overwhelming demonstration that re-using known perturbations is a superior initialization strategy.

Though using known perturbations is highly effective, reusing adversarial perturbations for images that have the same class label or are visually similar is not always the optimal choice. For instance, suppose that we set k = 3 in Algorithm 2 so that we only get to choose 3 out of our 138 computed perturbations to know beforehand. Further suppose that, out of our 3618 images, we focus our attack only on the class 0 images. If class similarity or even visual similarity were crucial, then we would expect that the perturbations that minimize the average initial distortion should at least come from images with the same class label.

25 Figure 4.11: The top 3 images are the best 3 of the 138 images to precompute perturbations for the class 0 (airplane) images considered. The bottom 3 images are analogously the worst 3 to precompute. Counterintuitively, notice that none of the top 3 are of airplanes, but one of the bottom 3 are.

However, none of these images, shown in Figure 4.11, are from class 0. Furthermore, the bottom 3 images are the images that maximize the average initial distortion, and one of them surprisingly has class label 0.

Table 4.1 performs the same computation for all 10 classes. Specifically, we seek to answer the following question: which 3 images should we precompute perturbations for before attacking images of class m? An optimal triplet of images will minimize the average initial distortion found when attacking all considered class m images. We do this by checking every possible choice of 3 images from the 138, and use their corresponding adversarial perturbations as input to Algorithm 2 (with k = 3) to attack all class m images out of our 3618. Whichever triplet minimizes the average initial distortion found when attacking those images is considered the best choice for attacking class m. Whichever triplet maximizes that average initial distortion is considered the worst choice for attacking class m. Although some of the classes do have at least 1 of the 3 perturbations sharing the same label, the lack of uniformity here suggests that there may be a more complex phenomena going on. What we can conclude is that the most useful perturbations to precompute need not come from the same class.

To further illustrate the weak relationship between the associated class of the precom-

26 Attacked Class Class of Best 3 Perturbations Class of Worst 3 Perturbations 0 6, 9, 9 1, 0, 6 1 1, 1, 9 0, 7, 9 2 1, 5, 9 4, 6, 8 3 3, 4, 5 0, 5, 6 4 1, 7, 9 2, 4, 5 5 5, 5, 5 1, 1, 8 6 0, 6, 9 2, 3, 7 7 7, 7, 9 2, 4, 6 8 1, 8, 9 5, 6, 6 9 8, 9, 9 1, 1, 2

Table 4.1: Finding the labels of the best (and worst) 3 images to precompute known pertur- bations for when attacking images of each class. Notice that the best 3 images for classes 0, 2, and 4 don’t include themselves. For class 0 the best triplet including a class 0 image is worse than 10 other triplets; for class 2, 259 other triplets are better; for class 4, 13 others. The most useful perturbations to precompute need not come from the same class.

27 puted perturbation and initial distortion, we can look at boxplots of initial distortion as a function of the perturbation class. Specifically we attack class m images using known per- turbations exclusively from class n and plot the spread of initial distortions. We do this for m = 0, 4, 5, 9 and all n = 0,..., 9 in Figure 4.12. For classes 5 and 9, the average initial distortion is smallest when using the same class. However for classes 0 and 4, the average initial distortion does not coincide with the same class.

Though class similarity may not be an effective heuristic for precomputing perturba- tions, perhaps inputs with similar features might share similar adversarial vulnerabilities. To measure this, we use the same Deep Layer Aggregation CIFAR-10 classifier and extract the penultimate layer feature vectors for every image. We use the distance between images’ feature vectors as a measure of visual similarity. Using the same 3618 images and 139 per- turbations, we graph in Figure 4.13 the binned average initial distortion found as a function of visual similarity. When the attacked image and the known perturbation’s corresponding image are very dissimilar, the perturbation does a worse job of initializing the attack on average. However slightly dissimilar images get better initial distortion than very similar images on average. Because of this, visually similarity may not be the best heuristic for finding an optimal perturbation. It is important to note that there is a lot of variability on how effective a perturbation is on a visually similar image. Figure 4.14 displays very typi- cal examples of how noisy the correlation is for specific choices of perturbation or attacked image. The main takeaway is that although visual similarity does have some correlations (on average) with the initial distortion of our attack, it doesn’t give us the direct correlation that we would like. Future work is needed to attain a better metric that tells us why certain pairs of images share similar vulnerabilities.

28 (a) (b)

(c) (d)

Figure 4.12: Attacking images of classes 0, 4, 5, and 9 exclusively using perturbations from each class. The top boxplots (a,b) show that the using perturbations from the same class (e.g. using perturbations for class 0 images to initialize attacks on class 0 images) do not always lead to a minimal average initial distortion. On the other hand, the lower boxplots (c,d) show that it sometimes can lead to the minimal average initial distortion.

29 Figure 4.13: Examining how visual similarity correlates with initial distortion when using the perturbation of one image to attack another image on average. All similarity-distortion pairs were placed into 100 bins and averaged within each bin; bins with fewer than 200 pairs were discarded, leaving a median bin size of 4734 pairs. Very dissimilar images tend to have poor initializations on average; however slightly dissimilar images actually do better than very similar images on average.

30 (a) (b)

(c) (d)

Figure 4.14: Figures (a) and (b) show how visual similarity affects the initial distortion when attacking all 3618 images using two different perturbations; the correlations are very noisy and also typical for most choices of perturbation. Figures (c) and (d) analogously show how visual similarity affects the initial distortion when attacking two different images using all 139 perturbations; again they are noisy and also typical.

31 4.4 Conclusion and Future Work

We have shown that previously known adversarial perturbations can be used to find superior initializations for the Sign-OPT hard-label black-box attack. We found that the boundary points along the direction of a previously known perturbation are almost always closer to the input image than when selecting random directions.

Choosing which perturbations to compute beforehand is a challenge, and we demon- strated that natural heuristics like class or visual similarity are not as meaningful as one would hope. Future work in this area would hopefully design better heuristics that continue to lower the average initial distortion found.

It also may be interesting to study how re-using known perturbations helps when attack- ing a group of images, where the perturbations of images attacked first would be re-used to attack images attacked later. An interesting challenge here would be determining an optimal order to these images; solving this would be solving the earlier problem of choosing which perturbations to compute beforehand.

To the best of our knowledge this is the first study of how adversarial perturbations from one image can be used to attack other images. It may be fruitful to continue to build a theory around why this phenomena occurs and why it is so effective on deep neural networks.

32 REFERENCES

[BRB18] Wieland Brendel, Jonas Rauber, and Matthias Bethge. “Decision-Based Adver- sarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models.”, 2018.

[CLC18] Minhao Cheng, Thong Le, Pin-Yu Chen, Jinfeng Yi, Huan Zhang, and Cho-Jui Hsieh. “Query-Efficient Hard-label Black-box Attack:An Optimization-based Ap- proach.”, 2018.

[CSC20] Minhao Cheng, Simranjit Singh, Patrick Chen, Pin-Yu Chen, Sijia Liu, and Cho- Jui Hsieh. “Sign-OPT: A Query-Efficient Hard-label Adversarial Attack.”, 2020.

[CW17] Nicholas Carlini and David Wagner. “Towards Evaluating the Robustness of Neu- ral Networks.”, 2017.

[CZS17] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. “ZOO: Zeroth Order Optimization Based Black-Box Attacks to Deep Neural Networks without Training Substitute Models.” In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec ’17, p. 15–26, New York, NY, USA, 2017. Association for Computing Machinery.

[EEF18] Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. “Robust Physical-World Attacks on Deep Learning Models.”, 2018.

[GSS15] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. “Explaining and Harnessing Adversarial Examples.”, 2015.

[HZR15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learn- ing for Image Recognition.”, 2015.

[IEA18] Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. “Black-box Ad- versarial Attacks with Limited Queries and Information.”, 2018.

[IEM19] Andrew Ilyas, Logan Engstrom, and Aleksander Madry. “Prior Convictions: Black-Box Adversarial Attacks with Bandits and Priors.”, 2019.

[KGB17] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. “Adversarial Machine Learn- ing at Scale.”, 2017.

[MMS19] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. “Towards Deep Learning Models Resistant to Adversarial At- tacks.”, 2019.

33 [PMG17] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Ce- lik, and Ananthram Swami. “Practical Black-Box Attacks against Machine Learn- ing.”, 2017.

[SZ15] Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.”, 2015.

[SZS14] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Er- han, Ian Goodfellow, and Rob Fergus. “Intriguing properties of neural networks.”, 2014.

[TTC20] Chun-Chen Tu, Paishun Ting, Pin-Yu Chen, Sijia Liu, Huan Zhang, Jinfeng Yi, Cho-Jui Hsieh, and Shin-Ming Cheng. “AutoZOOM: Autoencoder-based Zeroth Order Optimization Method for Attacking Black-box Neural Networks.”, 2020.

[YWS19] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. “Deep Layer Aggregation.”, 2019.

34