<<

MEASURING THE UNMEASURED: NEW THREATS TO MACHINE LEARNING SYSTEMS

A Dissertation Presented to the Faculty of the Graduate School

of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

by Congzheng Song December 2020 c 2020 Congzheng Song ALL RIGHTS RESERVED MEASURING THE UNMEASURED: NEW THREATS TO MACHINE LEARNING SYSTEMS Congzheng Song, Ph.D.

Cornell University 2020

Machine learning (ML) is at the core of many Internet services and applica- tions. Practitioners evaluate ML models based on the accuracy metrics, which measures the models’ predictive power on unseen future data. On the other hand, as ML systems are becoming more personalized and more important in decision-making, malicious adversaries have an incentive to interfere with the ML environment for various purposes such as extracting information about sen- sitive training data or inducing desired behavior in models’ output. However, none of these security and privacy threats are captured by accuracy and it is unclear to what extent current ML systems could go wrong.

In this dissertation, we identify and quantify a number of threats to ML sys- tems that are not measured by conventional performance metrics: (1) we con- sider the privacy threats at training time, where we show that adversary can supply malicious training code to force a ML model into intentionally ”mem- orizing” sensitive training data, and later extract memorized information from the model; (2) motivated by data-protection regulations, we identify a compli- ance issue where personal information might be collected for training ML mod- els without consent, and design practical auditing techniques for detecting such unauthorized data collection; (3) we study overlearning phenomenon in models where the internal representations reveal sensitive and uncor- related information, and discuss its implications in terms of privacy leakages and compliance with regulations; and (4) we demonstrate a secure venerability in ML models for analyzing text semantic similarity, where we propose attacks for generating texts that are semantically unrelated but judged as similar by these ML models. The goal of this dissertation is to provide ML practitioners ways for measur- ing risks in the ML models through threat modeling. We hope that our proposed attacks could give insights for better mitigation methods, and advocate the ML community to consider all aspects rather than only accuracy when designing new learning algorithms and building new ML systems. BIOGRAPHICAL SKETCH

Congzheng Song was born in Changsha, China. He earned B.S. degree with

Summa Cum Laude in Computer Science from Emory University. In 2016, he entered Cornell University to pursue a Ph.D. in Computer Science, and was ad- vised by Prof. Vitaly Shmatikov at Cornell Tech campus in New York City. His doctoral research focused on identifying and quantifying security and privacy issues in machine learning. He was a doctoral fellow at Cornell Tech’s Digital Life Initiative in 2020. During his Ph.D. study, he interned at Amazon, Google and Petuum Inc for industrial research.

iii To my parents.

iv ACKNOWLEDGEMENTS

I am extremely fortunate to have Vitaly Shmatikov as my Ph.D. advisor. He supported me in anyway he can and I learned more than I could hope for from him, ranging from formulating and refining research ideas to writing and pre- senting outcomes. His passion and wisdom guided me through many difficult times and made my Ph.D. research very productive. I owe Vitaly deeply and this dissertation would not be possible without his help. I am very grateful to the rest of my thesis committee members: Thomas Ris- tenpart and Helen Nissenbaum. Tom introduced me to security and privacy research problems in machine learning when I first came to Cornell Tech, which is also the focus of this dissertation. I was a doctoral fellow at the Digital Life

Initiative (DLI) founded by Helen. I learned about the societal aspects of tech- nology from Helen and the DLI team and how people outside our field view our works from different perspectives. I also want to thank Tom and Helen for their valuable feedback on this dissertation. I want to acknowledge my collaborators and co-authors: Emiliano De Cristo- faro, Luca Melis, Roei Schuster, Vitaly Shmatikov, Reza Shokri, Marco Stronati,

Eran Tromer, Ananth Raghunathan, Thomas Ristenpart, and Alexander M. Rush. They are incredible researchers in this field and I benefited a lot from their profound minds.

I would like to express my graditiude to my mentors and colleagues at Ama- zon, Google and Petuum Inc during my internships. From them, I learned about how security and privacy research is deployed in practice and about the differ- ence between academia research and the real world. Last but not least, I want to thank all my friends and family members in the

U.S. and China for their endless support throughout my entire life.

v TABLE OF CONTENTS

Biographical Sketch ...... iii Dedication ...... iv Acknowledgements ...... v Table of Contents ...... vi List of Tables ...... ix List of Figures ...... xii

1 Introduction 1 1.1 Thesis Contribution ...... 2 1.2 Thesis Structure ...... 4

2 Background 6 2.1 Machine Learning Preliminaries ...... 6 2.1.1 Supervised learning ...... 6 2.1.2 Linear models ...... 9 2.1.3 Deep learning models ...... 9 2.2 Machine Learning Pipeline ...... 10 2.2.1 Data collection ...... 11 2.2.2 Training ML models ...... 12 2.2.3 Deploying ML models ...... 13 2.3 Memorization in ML ...... 14 2.3.1 Membership Inference Attacks ...... 14 2.4 Privacy-preserving Techniques ...... 16 2.4.1 Differential privacy ...... 16 2.4.2 Secure ML environment ...... 17 2.4.3 Model partitioning ...... 18

3 Intentional Memorization with Untrusted Training Code 20 3.1 Threat Model ...... 20 3.2 White-box Attacks ...... 24 3.2.1 LSB Encoding ...... 24 3.2.2 Correlated Value Encoding ...... 25 3.2.3 Sign Encoding ...... 28 3.3 Black-box Attacks ...... 30 3.3.1 Abusing Model Capacity ...... 30 3.3.2 Synthesizing Malicious Augmented Data ...... 31 3.3.3 Why Capacity Abuse Works ...... 34 3.4 Experiments ...... 35 3.4.1 Datasets and Tasks ...... 35 3.4.2 ML Models ...... 37 3.4.3 Evaluation Metrics ...... 38 3.4.4 LSB Encoding Attack ...... 40

vi 3.4.5 Correlated Value Encoding Attack ...... 42 3.4.6 Sign Encoding Attack ...... 45 3.4.7 Capacity Abuse Attack ...... 47 3.5 Countermeasures ...... 54 3.6 Related Work ...... 55 3.7 Conclusion ...... 58

4 Auditing Data Provenance in Text-generation Models 60 4.1 Text-generation Models ...... 61 4.2 Auditing text-generation models ...... 63 4.3 Experiments ...... 68 4.3.1 Datasets ...... 68 4.3.2 ML Models ...... 70 4.3.3 Hyper-parameters ...... 70 4.3.4 Performance of target models ...... 72 4.3.5 Performance of auditing ...... 73 4.4 Memorization in text-generation models ...... 80 4.5 Limitations of auditing ...... 84 4.6 Related work ...... 86 4.7 Conclusion ...... 87

5 Overlearning Reveals Sensitive Attributes 89 5.1 Censoring Representation Preliminaries ...... 90 5.2 Exploiting Overlearning ...... 92 5.2.1 Inferring sensitive attributes from representation . . . . . 93 5.2.2 Re-purposing models to predict sensitive attributes . . . . 94 5.3 Experimental Results ...... 95 5.3.1 Datasets, tasks, and models ...... 95 5.3.2 Inferring sensitive attributes from representations . . . . . 97 5.3.3 Re-purposing models to predict sensitive attributes . . . . 102 5.3.4 When, where, and why overlearning happens ...... 105 5.4 Related Work ...... 107 5.5 Conclusions ...... 108

6 Adversarial Semantic Collisions 110 6.1 Threat Model ...... 110 6.2 Generating Adversarial Semantic Collisions ...... 113 6.2.1 Aggressive Collisions ...... 114 6.2.2 Constrained Collisions ...... 117 6.2.3 Regularized Aggressive Collisions ...... 117 6.2.4 Natural Collisions ...... 118 6.3 Experiments ...... 120 6.3.1 Tasks and Models ...... 122 6.3.2 Attack Results ...... 125

vii 6.3.3 Evaluating Unrelatedness ...... 126 6.3.4 Transferability of Collisions ...... 127 6.4 Mitigation ...... 128 6.5 Related Work ...... 130 6.6 Conclusion ...... 132

7 Conclusion 133

A Chapter 6 of appendix 135

viii LIST OF TABLES

3.1 Summary of datasets and models. n is the size of the train- ing dataset, d is the number of input dimensions. RES stands for Residual Network, CNN for Convolutional Neural Network. For FaceScrub, we use the gender classification task (G) and face recognition task (F)...... 36 3.2 Results of the LSB encoding attack. Here f is the model used, b is the maximum number of lower bits used beyond which accu- racy drops significantly, δ is the difference with the baseline test accuracy...... 39 3.3 Results of the correlated value encoding attack on image data. Here λc is the coefficient for the correlation term in the objective function and δ is the difference with the baseline test accuracy. For image data, decode MAPE is the mean absolute pixel error. . 40 3.4 Results of the correlated value encoding attack on text data. τ is the decoding threshold for the correlation value. Pre is precision, Rec is recall, and Sim is cosine similarity...... 41 3.5 Results of the sign encoding attack on image data. Here λs is the coefficient for the correlation term in the objective function. . . . 41 3.6 Results of the sign encoding attack on text data...... 42 3.7 Decoded text examples from all attacks applied to LR models trained on the IMDB dataset...... 45 3.8 Results of the capacity abuse attack on image data. Here m is the m number of synthesized inputs and n is the ratio of synthesized data to training data...... 47 3.9 Results of the capacity abuse attack on text data...... 47 3.10 Results of the capacity abuse attack on text datasets using a pub- lic auxiliary vocabulary...... 49

4.1 Performance of target models. Acc is word prediction accuracy, perp is perplexity...... 72 4.2 Effect of training shadow models with different hyper- parameters than the target model...... 74 4.3 Effect of the model’s output size. f (x) is the number of words | | ranked by f ...... 77 4.4 Examples of texts obfuscated using Google translation API and Yandex translation API...... 79 4.5 Audit performance on obfuscated Reddit comments...... 79

5.1 Summary of datasets and tasks. Cramer’s V captures statistical correlation between y and s (0 indicates no correlation and 1 in- dicates perfectly correlated)...... 97

ix 5.2 Accuracy of inference from representations (last FC layer). RAND is random guessing based on majority class labels; BASE is inference from the uncensored representation; ADV from the representation censored with adversarial training; IT from the information-theoretically censored representation...... 99 5.3 Improving inference accuracy with de-censoring. δ is the in- crease from Table 5.2...... 101 5.4 Adversarial re-purposing. The values are differences between the accuracy of predicting sensitive attributes using a re- purposed model vs. a model trained from scratch...... 102 5.5 The effect of censoring on adversarial re-purposing for Face- Scrub with γ = 0.5, 0.75, 1.0. δA is the difference in the original- task accuracy (second column) between uncensored and cen- sored models; δB is the difference in the accuracy of inferring the sensitive attribute (columns 3 to 7) between the models re-purposed from different layers and the model trained from scratch. Negative values mean reduced accuracy...... 103

6.1 Four tasks in our study. Given an input x, the adversary pro- duces a collision c resulting in a deceptive output. Collisions can be nonsensical or natural-looking and also carry spam messages (shown in red)...... 111 6.2 Hyper-parameters for each experiment. B is the beam size for beam search. K is the number of top words evaluated at each optimization step. N is the number of optimization iterations. T is the sequence length. η is the step size for optimization. τ is the temperature for softmax. β is the interpolation parameter in equation 6.5...... 121 6.3 Attack results. r is the rank of collisions among candidates. Gold denotes the ground truth...... 124 6.4 BERTSCORE between collisions and target inputs. Gold denotes the ground truth...... 126 6.5 Percentage of successfully transferred collisions for MRPC and Chat...... 127 6.6 Effectiveness of perplexity-based filtering. FP@90 and FP@80 are false positive rates (percentage of real data mistakenly filtered out) at thresholds that filter out 90% and 80% of collisions, re- spectively...... 129

A.1 Collision examples for MRPC and QQP. Outputs are the proba- bility scores produced by the model for whether the input and the collisions are paraphrases...... 135 A.2 Collision examples for Core17/18. r are the ranks of irrelevant articles after inserting the collisions...... 136

x A.3 Collision examples for Chat. r are the ranks of collisions among the candidate responses...... 137 A.4 Collision examples for CNNDM. Truth are the true summarizing sentences. r are the ranks of collisions among all sentences in the news articles...... 137

xi LIST OF FIGURES

3.1 A typical ML training procedure. Data is split into training D set and test set . Training data may be augmented us- Dtrain Dtest ing an algorithm , and then parameters are computed using a A training algorithm that uses a regularizer Ω. The resulting pa- T rameters are validated using the test set and either accepted or rejected (an error is output). If the parameters θ are accepted, ⊥ they may be published (white-box model) or deployed in a pre- diction service to which the adversary has input/output access (black-box model). The dashed box indicates the portions of the pipeline that may be controlled by the adversary...... 21 3.2 Test accuracy of the CIFAR10 model with different amounts of lower bits used for the LSB attack...... 40 3.3 Decoded examples from all attacks applied to models trained on the FaceScrub gender classification task. First row is the ground truth. Second row is the correlated value encoding attack (λc=1.0, MAPE=15.0). Third row is the sign encoding attack (λs=10.0, MAPE=2.51). Fourth row is the capacity abuse attack (m=110K, MAPE=10.8)...... 43 3.4 Capacity abuse attack applied to CNNs with a different number of parameters trained on the LFW dataset. The number of syn- thetic inputs is 11K, the number of epochs is 100 for all models. . 52 3.5 Visualization of the learned features of a CIFAR10 model ma- liciously trained with our capacity-abuse method. Solid points are from the original training data, hollow points are from the synthetic data. The color indicates the point’s class...... 53 3.6 Comparison of parameter distribution between a benign model and malicious models. Left is the correlation encoding attack (cor); middle is the sign encoding attack (sgn); right is the capac- ity abuse attack (cap). The models are residual networks trained on CIFAR10. Plots show the distribution of parameters in the 20th layer...... 54

4.1 Effect of the number of Reddit users used to train a word- prediction model...... 74 4.2 Effect of the number of queries and sampling strategy. Plots on the left show the results when the auditor samples the user’s data for queries in the ascending order of frequency counts of tokens in the label; plots on the right show the results with ran- domly sampled data...... 75 4.3 Effect of noise and errors...... 78

xii 4.4 Histograms of log probabilities of words generated by our text- generation models. The top row are the histograms for the top 20% most frequent words, the bottom row are the histograms for the rest...... 80 4.5 Ranks of words in the frequency table of the training corpus and in the models’ predictions (lower rank means that the word is more likely). Shaded area is the 95% confidence interval for all occurrences of the word in the data. These charts demonstrate that the models assign much higher rank to words when they appear in training sequences vs. when they appear in test se- quences, especially for the less-frequent words...... 82 4.6 Ablation analysis on Reddit and SATED...... 83 4.7 Ranks of words in the training corpus and in the predictions of the differentially private model...... 84

5.1 Reduction in accuracy due to censoring. Blue lines are the main task, red lines are the inference of sensitive attributes. First row is adversarial training with different γ values; second and third row is information-theoretical censoring with different β and λ values respectively...... 100 5.2 Heatmaps for the linear CKA similarities between censored and uncensored representations. Numbers 0 through 4 represent lay- ers conv1, conv2, conv3, fc4, and fc5. For each model censored at layer i (x-axis), we measure similarity between the censored and uncensored models at layer j (y-axis)...... 104 5.3 Pairwise similarities of layer representations between models for the original task (A) and for predicting a sensitive attribute (B). Numbers 0 through 4 denote layers conv1, conv2, conv3, fc4 and fc5...... 105 5.4 Similarity of layer representations of a partially trained gender classifier to a randomly initialized model before training. Mod- els are trained on FaceScrub using 50 IDs (blue line) and 500 IDs (red line)...... 106

6.1 Histograms of entropy (log perplexity) evaluated by GPT-2 on real data and collisions...... 128

xiii CHAPTER 1 INTRODUCTION

Machine learning (ML) enables computer systems to make accurate predic- tion on future data by automatically extracting patterns from past experience.

Give a learning task, a training procedure is applied to produce a ML model which is an optimal mapping from input domain to output domain learned from a set of observed input and output pairs known as training data. With enormous amount of training data available, ML is able to reach or even out- performs human-level accuracy on challenging tasks such as object classifi- cation [65], face verification [167], machine translation [180], speech recogni- tion [182], playing game of Go [163], etc.

Such tremendous success leads the explosion of ML models deployed in many Internet services and applications that people use on a daily basis, in- cluding recommending movies on Netflix [55] or videos on YouTube [35], assist- ing email writing in Gmail [179], mobile keyboard predictions [63], self-driving cars [20] etc. Productionizing a ML model involves a pipeline starting from data collection and training to evaluation and deployment as detailed in Sec- tion 2.2.2. The quality of a ML model is measured by its predictive power on future data, and the accuracy of prediction is often the only metrics for deciding deploying ML models in production.

On the other hand, the ML models in these services are often trained on large-scale sensitive personal data, and also are making important personalized decisions for the users. The personalized nature and the decision-making role of ML provide a strong incentive for malicious adversaries to interfere with the

ML productionizing environment for different purposes (for example, inferring

1 about the sensitive training data). It is thus crucial to understand the potential threats to the current ML systems.

However, the accuracy metrics for deciding ML model deployment only measures if a ML model learned well on its designated task. Many other impor- tant properties for characterizing different concerns and threats are not captured by the accuracy metrics at all. There could be threats to ML models’ integrity, where adversaries attempt to induce their desired model’s output behavior. Pri- vacy and confidentiality are also at risk as ML models might leak information about their sensitive training data. In addition, with data-protection policies and regulations such as the European Union’s General Data Protection Regula- tion (GDPR) [172] being enforced, we need to understand whether the practice of ML is in compliance with such regulations.

1.1 Thesis Contribution

This dissertation focuses on identifying and measuring the threats to ML sys- tems that are not measured by conventional ML performance metrics (e.g. accu- racy). We consider adversaries in different contexts of the ML productionizing pipeline, and propose specific attacks that interfere with the ML environment and achieve adversaries’ objective. We also discuss solutions or potential coun- termeasures to the identified threats. The contributions of this dissertation is detailed as below:

• We consider a malicious ML provider who supplies ML model-training code to the data holder, does not observe the training, but then obtains white- or black-box access to the resulting model. In this setting, we de-

2 sign and implement practical algorithms, some of them very similar to standard ML techniques such as regularization and data augmentation, that “memorize” information about the training dataset in the model—yet

the model is as accurate and predictive as a conventionally trained model. We then explain how the adversary can extract memorized information from the model.

• We identify a threat to ML’s compliance with regulation at data collec- tion time, where users’ personal data might be collected for training ML

models without consent by a malicious service provider. To help enforce data-protection regulations such as GDPR and detect unauthorized uses of personal data, we design a black-box auditing method that can detect,

with very few queries to a model, if a particular user’s texts were used to train it (among thousands of other users). We focus on text-generation models and empirically show that our method can successfully audit well-

generalized models that are not overfitted to the training data. We also an- alyze how text-generation models memorize word sequences and explain why this memorization makes them amenable to auditing.

• We introduce overlearning, a phenomenon where deep learning mod- els [107] trained for a seemingly simple objective implicitly learn to rec-

ognize attributes and concepts that are (1) not part of the learning objec- tive, and (2) sensitive from a privacy or bias perspective. We demonstrate overlearning in several domains and analyze its harmful consequences.

First, an adversary with access to inference-time representations of an overlearned model can learn sensitive attributes of the input, breaking privacy protections such as model partitioning. Second, a malicious ML

service provider can “re-purpose” an overlearned model for a different,

3 privacy-violating task even in the absence of the original training data. We show that overlearning is intrinsic for some tasks and cannot be prevented by censoring unwanted attributes. We also investigate where, when, and

why overlearning happens during model training.

• We consider an adversary who can control the inference-time inputs and

study a new security threat, semantic collisions, to the integrity of ML pre- dictions. Semantic collisions are semantically unrelated texts but judged as similar by natural language processing (NLP) models. We develop

gradient-based approaches for generating semantic collisions and demon- strate that state-of-the-art models for many tasks which rely on analyz- ing the meaning and similarity of texts—including paraphrase identifica-

tion, document retrieval, response suggestion, and extractive summariza- tion—are vulnerable to semantic collisions. For example, given a target query, inserting a crafted collision into an irrelevant document can shift

its retrieval rank from 1000 to top 3. We show how to generate semantic collisions that evade perplexity-based filtering and discuss other potential mitigations.

1.2 Thesis Structure

The remainder of this dissertation is organized as follows. In Chapter 2, we pro- vide preliminaries for ML models and pipelines that are used throughout the dissertation as well as background for privacy-preserving techniques in ML. In Chapter 4, we describe the auditing techniques for detecting unauthorized data usage. In Chapter 3, we demonstrate how malicious training code can be used for exfiltrate sensitive training data from the ML models. In Chapter 5, we in-

4 troduce overlearning and its corresponding threats to privacy and compliance with regulations. In Chapter 6, we develop adversarial text inputs for fooling NLP applications based on semantic similarity. Finally in Chapter 7, we con- clude the dissertation with a discussion on our contributions.

5 CHAPTER 2 BACKGROUND

In this Chapter, we first review the preliminaries in machine learning and its productionizing pipeline. We then describe the memorization phenomenon in machine learning models and existing privacy-preserving learning techniques.

2.1 Machine Learning Preliminaries

Machine learning (ML) is a set of methods that automatically detect meaningful and valuable information from data, and use these extracted information for predicting unseen future data. There are two main types of ML: 1) supervised learning which aims to learn a mapping from input data to output predictions and 2) unsupervised learning which aims to discover the patterns existing in the data. In this dissertation, we focus on supervised learning.

2.1.1 Supervised learning

A supervised learning model is a function f : parameterized by θ, θ X 7→ Y where is the input or feature space and is the output or label space. The X Y ranges of and are depending on the prediction task. For a classification task X Y such as news topic prediction, where we aim to classify a input into a discrete categories, is a set of discrete classes. For a regression task such as predicting Y temperature, where the output is a continuous value, = R. Y

Supervised learning algorithms are given a set of labeled examples known as

6 training data = (x , y ) n , where each input feature x is paired with Dtrain { i i }i=1 i ∈ X a label y . Each example (x , y ) is identically and independently sampled i ∈ Y i i from the true data distribution Prtrue.

The goal of a training (or learning) algorithm is to find the optimal set of parameters θ for f to produce accurate prediction on future data. The optimality is measured by a loss function : R, which penalizes the mismatches L Y × Y 7→ between true labels y and predicted labels fθ(x) produced by the model. Since future data is unknown at the time of training, a standard learning framework is to measure and minimize the loss function on the known training data to find the optimal parameters. This framework is also known as empirical risk minimization (EMR) formulated as:

1 Xn min ( fθ(xi), yi) (2.1) θ n i=1 L

Stochastic gradient descent. There are many methods to optimize the objec- tive function in equation 2.1. Stochastic gradient descent (SGD) and its variants are commonly used to train machine learning models that we focused on in this dissertation. SGD is an iterative method where at each step the optimizer receives a small batch of training data and updates the model parameters θ ac- cording to the direction of the negative gradient of the objective function with respect to θ. Training is finished when the model converges to a local minimum where the gradient is close to zero.

Generalization. When a model is trained from EMR, we wish to apply it for predicting on future data from true distribution Prtrue. In other words, we want the model to generalize on data that it has not seen before. This can be measured

7 by generalization gap defined as:

1 Xn E(x,y) Pr [ ( fθ(x), y)] ( fθ(xi), yi) (2.2) ∼ true n L − i=1 L which is the difference between the expected loss on true data distribution and on training data (x , y ) n . A model with small generalization gap achieves { i i }i=1 similar performances on training examples and unseen examples.

In practice, we measure the model’s generalization performance on a set of examples known as , which is unknown at training time. A well- Dtest generalized model should have a small gap between training loss and testing loss. The phenomenon of large training-test gap is known as overfitting, where the model performs well on training data but fails on unseen test data.

Regularizations. Loss functions sometimes accompany with a regularization term Ω that penalizes model complexity and helps prevent models from over-

fitting. Popular choices for Ω are norm-based regularizers, including L2-norm

P 2 Ω(θ) = λ i θi which penalizes the parameters for being too large, and L1-norm P Ω(θ) = λ θ which adds sparsity to the parameters. The coefficient λ controls i | i| how much the regularization term affects the training objective.

Data augmentation. A common strategy for improving generalization of ML models is to use data augmentation as an optional preprocessing step before training the model. The training data is expanded with new data points Dtrain generated using deterministic or randomized transformations. For example, an augmentation algorithm for images may take each training image and flip it horizontally or inject noises and distortions. The resulting expanded dataset

is then used for training. Daug

8 2.1.2 Linear models

We consider the input x are d-dimensional vectors, i.e. = Rd. Linear models X d are based on a simple linear mapping: w>x with w R . For the purpose of ∈ this dissertation, we introduce support vector machine and logistic regression, which are popular linear models for classification tasks where is a discrete set Y of classes.

Support vector machine. In support vector machine (SVM) for binary classifi-

N cation with = 1, 1 , w R , the model is given by f (x) = sign(w> x), where Y {− } ∈ θ θ = w and the function sign returns whether the input is positive or negative. { } Training uses hinge loss, i.e., ( f (x), y) = max 0, 1 y w>x . L θ { − · }

Logistic regression. With Logistic regression, the parameters again consist of a vector in and define the model f (x) = σ(w>x) where θ = w and σ(x) = X θ { } x 1 (1 + e− )− . In binary classification where the classes are 0, 1 , the output gives { } a value in [0,1] representing the probability that the input is classified as 1; the predicted class is taken to be 1 if f (x) 0.5 and 0 otherwise. A typical loss θ ≥ function used during training is cross-entropy: ( f (x), y) = y log( f (x)) + (1 L θ · θ − y) log(1 f (x)). − θ

2.1.3 Deep learning models

Deep learning [107] has become very popular for many machine learning tasks, especially related to computer vision and image recognition (e.g., [100]). A deep learning model f is composed of layers of non-linear transformations that maps

9 inputs to a sequence of intermediate states and then to the output:

fθ(x) = hθl hθl 1 hθ1 (x). (2.3) ◦ − ◦ · · · ◦

The parameters θ = l θ describe the weights used for all l layers of trans- ∪i=1 l formation. The number of parameters can become huge as the depth of the network increases.

In the simplest form, the function of layer i is a linear mapping with weight wi followed by a non-linear activation function a:

hθi (x) = a(wi>x). (2.4)

Common common choices for a are sigmoid, hyperbolic tangent and rectifier linear units (ReLU):

ReLU(x) = max(x, 0). (2.5)

The activated vectors hθi (x) from each layer are often known as hidden units, features or representations, which encode the semantics of the input x in an abstract way and are often more powerful than traditionally hand-engineered features.

2.2 Machine Learning Pipeline

Due to their strong predictive power, ML models are becoming building blocks for many Internet services and applications. To apply ML in real world pro- duction, the service provider typically through a common ML pipeline starting from collecting training data. Collected data will then be used for training and

10 validating ML models. Finally, if the trained model is good enough in terms of accuracy, the service provider deploys the model for users to interact. We next describe in detail about each step in this pipeline.

2.2.1 Data collection

When building a system with ML, the first step is to formulate the ML task and gather corresponding labeled training data. Collecting high quality train- ing data is as important as designing the learning algorithm for training a high quality model.

We take Smart Compose [179] in Gmail as an example, which is a product that automatically suggests complete sentences to help quickly replying emails. The ML model predicts the next words that user is likely to type given user current input context. The training data for such a task could collected from users history emails. The labels are each word in the past email texts and the inputs are the context for that word. The collections of input context and label word pairs form a labeled dataset for later supervised training. The ML model is likely to extract writing patterns for users from the past emails and provides more personalized suggestions.

Data protection regulations. Collecting personal data such as emails must comply with regulations such as the EU General Data Protection Regulation (GDPR) [172]. Many of these regulations are relevant to the practice of ML. As described in Article 5(1) and 6(4) of GDPR, the data processors must specify the explicit purpose of collecting data, and any purpose of further processing must

11 compatible with the purposes for which the personal data are initially collected. Furthermore, according to Article 6(1) of GDPR, the data processors (service provider in our case) must demonstrate that the data subject (user in our case) has given consent to processing of his or her personal data. How data collection in ML pipeline complies with such regulations is less explored in the literature.

2.2.2 Training ML models

Once the training data is collected, the next step is the training stage where an ML algorithm takes the data and produces a ML model. As ML training can be complicated and requires expertise, it is common for service providers to out- source the training process to a ML provider who provides the training code or platform. Many cloud platforms that offer ML-as-a-services, are ML providers.

In addition, there is an explosion number of third-party ML libraries and frame- works that provide developers easy-to-use APIs for training ML models.

ML-as-a-service platforms such as Google Auto ML [58] , Amazon ML [8] , and Microsoft’s Azure ML [136] , provide convenient APIs for users to upload their data and train an ML model. These APIs enable black-box training. Google Auto ML specifically targets the clients with limited machine learning expertise, as it automatically trains a customized model on user provided data, without giving access to the training algorithm. Amazon ML provides users with some options on a few hyper-parameters such as the model size, but the details of the model remain unknown to the user. Microsoft’s Azure ML provides users with a wide range of built-in common ML models. Clients can select a particular model but have no access to the implementation details of the training algorithm.

12 ML Libraries. ML algorithms can be complicated and hard to scale on large training datasets. The common low-level mathematical operations for many ML algorithms, such as matrix multiplication and softmax function, require domain expertise to implement on accelerating hardwares such as graphic processing unit (GPU) and tensor processing unit (TPU).

Instead of implementing a training algorithm from scratch, developers often choose ML libraries such as Scikit-learn [150], TensorFLow [1] and PyTorch [149] that provides high-level and easy-to-use APIs to wrap the lower-level computa- tion. Developers can follow tutorials online and train state of the art ML models in just a few lines of code.

2.2.3 Deploying ML models

Once the ML models are trained, it is crucial to evaluate their performances before launching them in products. There are typically two categories of eval- uation: offline and online. In offline evaluation, the ML models are evaluated on the predictive performances on a test dataset (a subset of data held-out dur- ing training). In online evaluation, A/B testing are often conducted to decide whether the trained ML models are improved from previous approaches in real applications. Online evaluation also uses more user-centered metrics, such as click-through rate for measuring ranking of search results, which better reflect how model predictions align with the users preferences and cannot be estimated in offline evaluation.

Once the ML models are deployed, the users can interact with the models through services and applications. Users provide inputs sharing to the server

13 and the service provider will invoke the ML models to make predictions, and display the predictions back to the users. For example, in many smartphones, the next-word prediction application will take a user’s typed message as input and its ML model will predict the next word what a user is likely to type. The application will then display the users the most likely words from the model predictions.

2.3 Memorization in ML

Despite their huge number of parameters, successful deep learning models can exhibit a remarkably small generalization gap. At the same time, it is demon- strated that deep learning models can also achieve perfect accuracy even on randomly labeled training data [191]. Even though later work suggests that deep learning models tend to prioritize learning simple patterns first [10], the memorization effects still provide an attack surface for various threats.

In Section 2.3.1, we describe membership inference attacks, one of the most fundamental privacy attacks related to memorization in ML. In Chapter 3, we exploit memorization and force the ML models to leak training data. In Chap- ter 4, we utilize membership inference attacks for building auditing techniques for detecting unauthorized data usages.

2.3.1 Membership Inference Attacks

Homer et al. [72] developed a technique for determining, given published summary statistics about a genome-wide association study, whether a specific

14 known genome was used in the study. This is known as the membership inference problem. Subsequent work extended this work to published noisy statistics [47] and MicroRNA-based studies [13]. In the context of ML, membership inference attacks assume a target record (x, y) and aim to decide whether (x, y) . ∈ Dtrain

Shokri attacks. Membership inference attacks against supervised ML mod- els were first studied by Shokri et al [161]. They assume an adversary with black-box access to a target model fθ and propose a method to learn the sta- tistical difference between outputs of members and nonmembers by training a binary membership classifier where the feature vector is the target model’s out- put probability vector. To learn the membership classifier, the adversary trains a number of shadow models that mimic the output behavior of the target model. Adversary then collects a set of output probability vectors from the shadow models and their corresponding membership labels as the training data for the membership classifier. Their attacks work best when fθ has low generalizability, i.e., if the accuracy for the training inputs is much better than for inputs from outside the training dataset.

In Chapter 4, we extend the shadow model techniques for inferring user- level membership against text-generation model, and demonstrate how mem- bership inference attacks can be used positively for detecting unauthorized data collections.

Other attacks. Truex et al. [171] and Nasr et al. [139] generalize Shokri at- tacks to white-box and federated learning [130] settings. Rahman et al. [153] use membership inference to evaluate the tradeoff between test accuracy and mem- bership privacy in differentially private ML models. Hayes et al. [64] study

15 membership inference against generative models. Long et al. [119] show that well-generalized models can leak membership information, but the adversary must first identify a handful of vulnerable records in the training dataset. Yeom et al. [184] formalize membership inference and theoretically show that overfit- ting is sufficient but not necessary.

2.4 Privacy-preserving Techniques

ML is being applied for numerous personal services, including recommen- dation systems [35, 55], voice assistants [7, 9, 34], email response genera- tion [85, 179]. To provide users best experiences with these services, Internet companies train ML models on sensitive user generated data, such as keyboard inputs, web browser history, location trajectory etc. It is important to ensure that training and serving ML models do not leak information about the sensi- tive input data.

There are many related research devoted to provide privacy-preserving ma- chine learning. We focus on techniques that are related to this dissertation, in- cluding differential privacy, secure ML training environment and model parti- tioning.

2.4.1 Differential privacy

As ML can be trained on sensitive datasets , it is crucial that the trained Dtrain models should not memorize specific information about any example in . Dtrain Differential privacy (DP) [46] provides such privacy guarantees for algorithms

16 analyzing databases, which in our case is a learning algorithm processing a training dataset. We define DP mechanisms as following:

Definition 1 A randomized mechanism with range satisfies (, δ)-differential pri- M R vacy if for any two adjacent datasets , 0 that differs in one row and for any subset of D D outputs it holds that: S ⊆ R

Pr[ (d) ] exp() Pr[ (d0) ] + δ (2.6) M ∈ S ≤ · M ∈ S

Intuitively, the output distribution of a DP mechanism does not change by a factor exponentially small in  for a pair of datasets D, D0 with difference in one row. In the case of machine learning, a DP ML algorithm should output very similar models when training on the subset (x , y ) , i, such that the Dtrain \{ i i } ∀ models are not memorize any specific information for a particular example.

A popular way for training DP ML models is DP stochastic gradient descent

(DP-SGD) [2]. In DP-SGD, each gradient update (to the model parameters) com- puted from a batch of training data is added with a carefully selected noise vec- tor so as to cover the individual contribution for each example in the batch. The resulting models satisfy strong DP guarantee and thus do not leak information about any inputs in . Dtrain

2.4.2 Secure ML environment

As described in Section 2.2.2, the training procedure could be outsourced to a third-party ML provider, e.g. ML-as-a-service platforms. It is important for the platforms to build secure ML training environments as clients are replying on these platforms for training ML models on sensitive datasets.

17 Software-based isolation mechanisms and network controls help prevent ex- filtration of training data via conventional means. Several academic proposals have sought to construct even higher assurance ML platforms. For example,

Zhai et al. [190] propose a cloud service with isolated environments in which one user supplies sensitive data, another supplies a secret training algorithm, and the cloud ensures that the algorithm cannot communicate with the out- side world except by outputting a trained model. The explicit goal is to as- sure the data owner that the ML provider cannot exfiltrate sensitive training data. Advances in data analytics frameworks based on trusted hardware such as SGX [14, 144, 159] and cryptographic protocols based on secure multi-party computation (see Section 3.6) may also serve as the basis for secure ML plat- forms.

2.4.3 Model partitioning

Model partitioning is a recently proposed mechanism for deploying machine learning models to resolve practical concerns. There is an increasing demand for the on-device machine learning services, while modern deep neural net- works (DNN) have hundreds of millions parameters and porting such mod- els on users’ devices is infeasible. Model partitioning provides a solution by splitting the model parameters into local part for on-device computation and remote part for cloud computation. Previous works have shown that partition- ing a large DNN across mobile and remote resources can scale model inference without sacrificing accuracy [84, 104].

Model partitioning could potentially enjoy a privacy benefits [30, 145, 178].

18 The deployed ML models in a service require users to share their test-time in- puts for making predictions, while the test-time inputs are also sensitive per- sonal data. With model partitioning, it is possible to share the server with the intermediate computation (usually a vector of numbers) from the local part of the ML models, and server finishes the predictions given this intermediate com- putation. In this way, the raw user input is never sent to the server.

19 CHAPTER 3 INTENTIONAL MEMORIZATION WITH UNTRUSTED TRAINING CODE

Modern ML models, especially artificial neural networks, have huge capac- ity for “memorizing” arbitrary information [192]. This can lead to overprovi- sioning: even an accurate model may be using only a fraction of its raw capac- ity. The provider of an ML library or operator of an ML service can modify the training algorithm so that the model encodes more information about the training dataset than is strictly necessary for high accuracy on its primary task.

In this Chapter, we investigate potential consequences of using untrusted training algorithms on a trusted platform. We show that relatively minor mod- ifications to training algorithms can produce models that have high quality by the standard ML metrics (such as accuracy and generalizability), yet leak de- tailed information about their training datasets.

3.1 Threat Model

As explained in Section 2.2.2, data holders often use other people’s training al- gorithms to create models from their data. We thus focus on the scenario where a data holder (client) applies ML code provided by an adversary (ML provider) to the client’s data. We investigate if an adversarial ML provider can exfiltrate sensitive training data, even when his code runs on a secure platform?

Client. The client has a dataset sampled from the feature space and Dtrain X wants to train a classification model f on , as described in. We assume θ Dtrain

20 Dtest

! ! ⊥ D Dtrain Daug Val or A T

Figure 3.1: A typical ML training procedure. Data is split into training set D and test set . Training data may be augmented using an algorithm Dtrain Dtest , and then parameters are computed using a training algorithm that uses A T a regularizer Ω. The resulting parameters are validated using the test set and either accepted or rejected (an error is output). If the parameters θ are ac- ⊥ cepted, they may be published (white-box model) or deployed in a prediction service to which the adversary has input/output access (black-box model). The dashed box indicates the portions of the pipeline that may be controlled by the adversary. that the client wishes to keep private, as would be the case when is Dtrain Dtrain proprietary documents, sensitive medical images, etc.

The client applies a training procedure provided by the adversary to . Dtrain This training procedure outputs a model, defined by its parameters θ. The client validates the model by measuring its accuracy on the test subset and the Dtest test-train gap, accepts the model if it passes validation, and then publishes it by releasing θ or making an API interface to fθ available for prediction queries. We refer to the former as white-box access and the latter as black-box access to the model.

Adversary. We assume that the training procedure shown in 3.1 is controlled by the adversary. In general, the adversary controls the core training algorithm , but in this paper we assume that is a conventional, benign algorithm and T T focus on smaller modifications to the pipeline. For example, the adversary may provide a malicious data augmentation algorithm , or else a malicious regu- A

21 larizer Ω, while keeping intact. The adversary may also modify the pa- Dtrain rameters θ after they have been computed by . D

The adversarially controlled pipeline can execute entirely on the client side—for example, if the client runs the adversary’s ML library locally on his data. It can also execute on a third-party platform, such as Algorithmia. We assume that the environment running the algorithms is secured using soft- ware [6, 190] or hardware [144, 159] isolation or cryptographic techniques. In particular, the adversary cannot communicate directly with the training envi- ronment; otherwise he can simply exfiltrate data over the network.

Adversary’s objectives. The adversary’s main objective is to infer as much as of the client’s private training dataset as possible. Dtrain

Some existing models already reveal parts of the training data. For example, nearest neighbors classifiers and SVMs explicitly store some training data points in θ. Deep neural networks and classic logistic regression are not known to leak any specific training information. Even with SVMs, the adversary may want to exfilitrate more, or different, training data than revealed by θ in the default setting. For black-box attacks, in which the adversary does not have direct access to θ, there is no known way to extract the sensitive data stored in θ by SVMs and nearest neighbor models.

Other, more limited, objectives may include inferring the presence of a known input in the dataset (this problem is known as membership in- Dtrain ference), partial information about (e.g., the presence of a particular face Dtrain in some image in ), or metadata associated with the elements of (e.g., Dtrain Dtrain geolocation data contained in the digital photos used to train an image recogni-

22 tion model). While we do not explore these in the current paper, our techniques can be used directly to achieve these goals. Furthermore, they require extract- ing much less information than is needed to reconstruct entire training inputs, therefore we expect our techniques will be even more effective.

Assumptions about the training environment. The adversary’s pipeline has unrestricted access to the training data and the model θ being trained. Dtrain As mentioned above, we focus on the scenarios where the adversary does not modify the training algorithm but instead (a) modifies the parameters θ Dtrain of the resulting model, or (b) uses to augment with additional training A Dtrain data, or (c) applies his own regularizer Ω while is executing. Dtrain

We assume that the adversary can observe neither the client’s data, nor the execution of the adversary’s ML pipeline on this data, nor the resulting model

(until it is published by the client). We assume that the adversary’s code in- corporated into the pipeline is isolated and confined so that it has no way of communicating with or signaling to the adversary while it is executing. We also assume that all state of the training environment is erased after the model is accepted or rejected.

Therefore, the only way the pipeline can leak information about the dataset

to the adversary is by (1) forcing the model θ to somehow “memorize” this Dtrain information and (2) ensuring that θ passes validation.

Access to the model. With white-box access, the adversary receives the model directly. He can directly inspect all parameters in θ, but not any temporary information used during the training. This scenario arises, for example, if the

23 client publishes θ.

With black-box access, the adversary has input-output access to θ: given any input x, he can obtain the model’s output fθ(x). For example, the model could be deployed inside an app and the adversary uses this app as a customer. There- fore, we focus on the simplest (and hardest for the adversary) case where he learns only the class label assigned by the model to his inputs, not the entire prediction vector with a probability for each possible class.

3.2 White-box Attacks

In a white-box attack, the adversary can see the parameters of the trained model. We thus focus on directly encoding information about the training dataset in the parameters. The main challenge is how to have the resulting model accepted by the client. In particular, the model must have high accuracy on the client’s classification task when applied to the test dataset.

3.2.1 LSB Encoding

Many studies have shown that high-precision parameters are not required to achieve high performance in machine learning models [62, 114, 155]. This ob- servation motivates a very direct technique: simply encode information about the training dataset in the least significant (lower) bits of the model parameters.

24 Algorithm 1 LSB encoding attack Input: Training dataset , a benign ML training algorithm , number of bits Dtrain T b to encode per parameter. Output: ML model parameters θ0 with secrets encoded in the lower b bits. θ ( ) ← T Dtrain ` number of parameters in θ ← s ExtractSecretBitString( , `b) ← Dtrain θ0 set the lower b bits in each parameter of θ to a substring of s of length b. ←

Encoding. Algorithm 1 describes the encoding method. First, train a benign model using a conventional training algorithm , then post-process the model D parameters θ by setting the lower b bits of each parameter to a bit string s ex- tracted from the training data, producing modified parameters θ0.

Extraction. The secret string s can be either compressed raw data from , Dtrain or any information about that the adversary wishes to capture. The length Dtrain of s is limited to `b, where ` is the number of parameters in the model.

Decoding. Simply read the lower bits of the parameters θ0 and interpret them as bits of the secret.

3.2.2 Correlated Value Encoding

Another approach is to gradually encode information while training model pa- rameters. The adversary can add a malicious term to the loss function that L maximizes the correlation between the parameters and the secret s that he wants to encode.

In our experiments, we use the negative absolute value of the Pearson cor-

25 Algorithm 2 SGD with correlation value encoding Input: Training dataset = (x , y ) n , a benign loss function , a model f , Dtrain { j j }i=1 L number of epochs T, learning rate η, attack coefficient λc, size of mini-batch q. Output: ML model parameters θ correlated to secrets. θ Initialize( f ) ← ` number of parameters in θ ← s ExtractSecretValues(D, `) ← for t = 1 to T do for each mini-batch (x , y ) q do { j j } j=1 ⊂ Dtrain g 1 Pq (y , f (x , θ)) + C(θ, s) t ← ∇θ m j=1 L j j ∇θ θ UpdateParameters(η, θ, g ) ← t relation coefficient as the extra term in the loss function. During training, it drives the gradient direction towards a local minimum where the secret and the parameters are highly correlated. Algorithm 2 shows the template of the SGD training algorithm with the malicious regularization term in the loss function.

Encoding. First extract the vector of secret values s R` from the training data, ∈ where ` is the number of parameters. Then, add a malicious correlation term C to the loss function where P` ¯ i=1(θi θ)(si s¯) C(θ, s) = λ − − . (3.1) − c · q q P` (θ θ¯)2 P` (s s¯)2 i=1 i − · i=1 i −

In the above expression, λc controls the level of correlation and θ,¯ s¯ are the mean values of θ and s, respectively. The larger C, the more correlated θ and s. During optimization, the gradient of C with respect to θ is used for parameter update.

Observe that the C term resembles a conventional regularizer, commonly used in machine learning frameworks. The difference from the norm-based regularizers discussed previously is that we assign a weight to each parame-

26 ter in C that depends on the secrets that we want the model to memorize. This term skews the parameters to a space that correlates with these secrets. The pa- rameters found with the malicious regularizer will not necessarily be the same as with a conventional regularizer, but the malicious regularizer has the same effect of confining the parameter space to a less complex subspace [174].

Extraction. The method for extracting sensitive data s from the training data depends on the nature of the data. If the features in the raw data are Dtrain all numerical, then raw data can be directly used as the secret. For example, our method can force the parameters to be correlated with the pixel intensity of training images.

For non-numerical data such as text, we use data-dependent numerical val- ues to encode. We map each unique token in the vocabulary to a low-dimension pseudorandom vector and correlate the model parameters with these vectors. Pseudorandomness ensures that the adversary has a fixed mapping between tokens and vectors and can uniquely recover the token given a vector.

Decoding. If all features in the sensitive data are numerical and within the same range (for images raw pixel intensity values are in the [0, 255] range), the adversary can easily map the parameters back to feature space because corre- lated parameters are approximately linear transformation of the encoded fea- ture values.

To decode text documents, where tokens are converted into pseudorandom vectors, we perform a brute-force search for the tokens whose corresponding vectors are most correlated with the parameters. More sophisticated approaches

27 (e.g., error-correcting codes) should work much better, but we do not explore them in this paper.

We provide more details about these decoding procedures for specific datasets in Section 3.4.

3.2.3 Sign Encoding

Another way to encode information in the model parameters is to interpret their signs as a bit string, e.g., a positive parameter represents 1 and a negative pa- rameter represents 0. Machine learning algorithms typically do not impose con- straints on signs, but the adversary can modify the loss function to force most of the signs to match the secret bit string he wants to encode.

Encoding. Extract a secret binary vector s 1, 1 ` from the training data, ∈ {− } where ` is the number of parameters in θ, and constrain the sign of θi to match si. This encoding method is equivalent to solving the following constrained optimization problem:

1 Xn min Ω(θ) + (yi, fθ(xi)) θ n i=1 L

such that θi si > 0 for i = 1, 2, . . . , `

Solving this constrained optimization problem can be tricky for models like deep neural networks due to its complexity. Instead, we can relax it to an un- constrained optimization problem using the penalty function method [143]. The idea is to convert the constraints to a penalty term added to the objective func-

28 tion, where the term penalizes the objective if the constraints are not met. In our case, we define the penalty term P as follows:

` λs X P(θ, s) = max(0, θ s ) . (3.2) ` i i i=1 | − |

In the above expression, λs is a hyperparameter that controls the magnitude of the penalty. Zero penalty is added when θ and s have the same sign, θ s is i i | i i| the penalty otherwise.

The attack algorithm is mostly identical to Algorithm 2 with two lines changed. Line 5 becomes s ExtractSecretSigns(D, `), where s is a binary ← vector of length ` instead of a vector of real numbers. In line 9, P replaces the correlation term C. Similar to the correlation term, P changes the direction of the gradient to drive the parameters towards the subspace in R` where all sign constraints are met. In practice, the solution may not converge to a point where all constraints are met, but our algorithm can get most of the encoding correct if

λs is large enough.

Observe that P is very similar to l1-norm regularization. When all signs of the parameters do not match, the term P is exactly the l -norm because θ s 1 − i i is always positive. Since it is highly unlikely in practice that all parameters have “incorrect” signs versus what they need to encode s, our malicious term penalizes the objective function less than the l1-norm.

Extraction. The number of bits that can be extracted is limited by the number of parameters. There is no guarantee that the secret bits can be perfectly en- coded during optimization, thus this method is not suitable for encoding the

29 compressed binaries of the training data. Instead, it can be used to encode the bit representation of the raw data. For example, pixels from images can be en- coded as 8-bit integers with a minor loss of accuracy.

Decoding. Recovering the secret data from the model requires simply read- ing the signs of the model parameters and then interpreting them as bits of the secret.

3.3 Black-box Attacks

Black-box attacks are more challenging because the adversary cannot see the model parameters and instead has access only to a prediction API. We focus on the (harder) setting in which the API, in response to an adversarially chosen feature vector x, applies fθ(x) and outputs the corresponding classification label (but not the associated confidence values). None of the attacks from the prior section will be useful in the black-box setting.

3.3.1 Abusing Model Capacity

We exploit the fact that modern machine learning models have vast capacity for memorizing arbitrarily labeled data [192].

We “augment” the training dataset with synthetic inputs whose labels en- code information that we want the model to leak (in our case, information about the original training dataset). When the model is trained on the augmented

30 Algorithm 3 Capacity-abuse attack Input: Training dataset , a benign ML training algorithm , number of Dtrain T inputs m to be synthesized. Output: ML model parameters θ that memorize the malicious synthetic inputs and their labels. mal SynthesizeMaliciousData( train, m) θD ←( ) D ← T Dtrain ∪ Dmal dataset—even using a conventional training algorithm—it becomes overfitted to the synthetic inputs. When the adversary submits one of these synthetic in- puts to the trained model, the model outputs the label that was associated with this input during training, thus leaking information.

Algorithm 3 outlines the attack. First, synthesize a malicious dataset Dmal whose labels encode secrets about . Then train the model on the union of Dtrain and . Dtrain Dmal

Observe that the entire training pipeline is exactly the same as in benign training. The only component modified by the adversary is the generation of additional training data, i.e., the augmentation algorithm . Data augmenta- A tion is a very common practice for boosting the performance of machine learn- ing models [100, 164].

3.3.2 Synthesizing Malicious Augmented Data

Ideally, each synthetic data point can encode log (c) bits of information where b 2 c c is the number of classes in the output space of the model. Algorithm 4 outlines our synthesis method. Similar to the white-box attacks, we first extract a secret bit string s from . We then deterministically synthesize one data point for Dtrain each substring of length log (c) in s. b 2 c

31 Algorithm 4 Synthesizing malicious data Input: A training dataset , number of inputs to be synthesized m, auxiliary Dtrain knowledge K. Output: Synthesized malicious data Dmal Dmal ← ∅ s ExtractSecretBitString( , m) ← Dtrain c number of classes in ← Dtrain for each log (c) bits s0 in s do b 2 c x GenData(K) mal ← ymal BitsToLabel(s0) ← (x , y ) Dmal ← Dmal ∪ { mal mal }

Different types of data require different synthesis methods.

Synthesizing images. We assume no auxiliary knowledge for synthesizing images. The adversary can use any suitable GenData method: for example, generate pseudorandom images using the adversary’s choice of pseudorandom function (PRF) (e.g., HMAC [97]) or else create sparse images where only one pixel is filled with a (similarly generated) pseudorandom value.

We found the latter technique to be very effective in practice. GenData enu- merates all pixels in an image and, for each pixel, creates a synthetic image where the corresponding pixel is set to the pseudorandom value while other pixels are set to zero. The same technique can be used with multiple pixels in each synthetic image.

Synthesizing text. We consider two scenarios for synthesizing text docu- ments.

If the adversary knows the exact vocabulary of the training dataset, he can use this vocabulary as the auxiliary knowledge in GenData. A simple deter-

32 ministic implementation of GenData enumerates the tokens in the auxiliary vo- cabulary in a certain order. For example, GenData can enumerate all singleton tokens in lexicographic order, then all pairs of tokens in lexicographic order, and so on until the list is as long as the number of synthetic documents needed. Each list entry is then set to be a text in the augmented training dataset.

If the adversary does not know the exact vocabulary, he can collect fre- quently used words from some public corpus as the auxiliary vocabulary for generating synthetic documents. In this case, a deterministic implementation of GenData pseudorandomly (with a seed known to the adversary) samples words from the vocabulary until generating the desired number of documents.

To generate a document in this case, our simple synthesis algorithm samples a constant number of words (50, in our experiments) from the public vocabulary and joins them as a single document. The order of the words does not matter because the feature extraction step only cares whether a given word occurs in the document or not.

This synthesis algorithm may occasionally generate documents consisting only of words that do not occur in the model’s actual vocabulary. Such words will typically be ignored in the feature extraction phase, thus the resulting doc- uments will have empty features. If the attacker does not know the model’s vocabulary, he cannot know if a particular synthetic document consists only of out-of-vocabulary words. This can potentially degrade both the test accuracy and decoding accuracy of the model.

In Section 3.4.7, we empirically measure the accuracy of the capacity-abuse attack with a public vocabulary.

33 Decoding memorized information. Because our synthesis methods for aug- mented data are deterministic, the adversary can replicate the synthesis process and query the trained model with the same synthetic inputs as were used dur- ing training. If the model is overfitted to these inputs, the labels returned by the model will be exactly the same labels that were associated with these inputs during training, i.e., the encoded secret bits.

If a model has sufficient capacity to achieve good accuracy and generalizabil- ity on its original training data and to memorize malicious training data, then the accuracy on will be near perfect, leading to low error when extracting Dmal the sensitive data.

3.3.3 Why Capacity Abuse Works

Deep learning models have such a vast memorization capacity that they can essentially express any function to fit the data [192]. In our case, the model is

fitted not just to the original training dataset but also to the synthetic data which is (in essence) randomly labeled. If the test accuracy on the original data is high, the model is accepted. If the training accuracy on the synthetic data is high, the adversary can extract information from the labels assigned to these inputs.

Critically, these two goals are not in conflict. Training on maliciously aug- mented datasets thus produces models that have high quality on their original training inputs yet leak information on the augmented inputs.

In the case of SVM and LR models, we focus on high-dimensional and sparse data (natural-language text). Our synthesis method also produces very sparse

34 inputs. Empirically, the likelihood that a synthetic input lies on the wrong side of the hyperplane (classifier) becomes very small in this high-dimensional space.

3.4 Experiments

We evaluate our attack methods on benchmark image and text datasets, using, respectively, gray-scale training images and ordered tokens as the secret to be memorized in the model.

For each dataset and task, we first train a benign model using a conventional training algorithm. We then train and evaluate a malicious model for each attack method. We assume that the malicious training algorithm has a hard-coded secret that can be used as the key for a pseudorandom function or encryption.

3.4.1 Datasets and Tasks

Table 3.1 summarizes the datasets, models, and classification tasks we used in our experiments. We use as stand-ins for sensitive data several representative, publicly available image and text datasets.

CIFAR10 is an object classification dataset with 50,000 training images (10 cate- gories, 5,000 images per category) and 10,000 test images [99]. Each image has

32x32 pixels, each pixel has 3 values corresponding to RGB intensities.

Labeled Faces in the Wild (LFW) contains 13,233 images for 5,749 individu- als [74, 106]. We use 75% for training, 25% for testing. For the gender classifi-

35 Data size Num Test Dataset f n d bits params acc CIFAR10 50K 3072 1228M RES 460K 92.89 LFW 10K 8742 692M CNN 880K 87.83 FaceScrub (G) 460K 97.44 57K 7500 3444M RES FaceScrub (F) 500K 90.08 SVM 80.58 News 11K 130K 176M 2.6M LR 80.51 SVM 90.13 IMDB 25K 300K 265M 300K LR 90.48

Table 3.1: Summary of datasets and models. n is the size of the training dataset, d is the number of input dimensions. RES stands for Residual Network, CNN for Convolutional Neural Network. For FaceScrub, we use the gender classification task (G) and face recognition task (F). cation task, we use additional attribute labels [102]. Each image is rescaled to

67x42 RGB pixels from its original size, so that all images have the same size.

FaceScrub is a dataset of URLs for 100K images [141]. The tasks are face recog- nition and gender classification. Some URLs have expired, but we were able to download 76,541 images for 530 individuals. We use 75% for training, 25% for testing. Each image is rescaled to 50x50 RGB pixels from its original size.

20 Newsgroups is a corpus of 20,000 documents classified into 20 cate- gories [105]. We use 75% for training, 25% for testing.

IMDB Movie Reviews is a dataset of 50,000 reviews labeled with positive or negative sentiment [125]. The task is (binary) sentiment analysis. We use 50% for training, 50% for testing.

36 3.4.2 ML Models

Convolutional Neural Networks CNNs [108] are composed of a series of convolution operations as building blocks which can extract spatial-invariant features. The filters in these convolution operations are the parameters to be learned. We use a 5-layer CNN for gender classification on the LFW dataset.

The first three layers are convolution layers (32 filters in the first layer, 64 in the second, 128 in the third) followed by a max-pooling operation which reduces the size of convolved features by half. Each filter in the convolution layer is 3x3. The convolution output is connected to a fully-connected layer with 256 units. The latter layer connects to the output layer which predicts gender.

For the hyperparameters, we set the mini-batch size to be 128, learning rate to be 0.1, and use SGD with Nesterov Momentum for optimizing the loss func-

5 tion. We also use the l2-norm as the regularizer with λ set to 10− . We set the number of epochs for training to 100. In epochs 40 and 60, we decrease the learn- ing rate by a factor of 0.1 for better convergence. This configuration is inherited from the residual-network implementation in Lasagne.1

Residual Networks (RES) [66] overcome the gradient vanishing problem when optimizing very deep CNNs by adding identity mappings from lower layers to high layers. These networks achieved state-of-the-art performance on many benchmark vision datasets in 2016.

We use a 34-layer residual network for CIFAR10 and FaceScrub. Although the network has fewer parameters than CNN, it is much deeper and can learn

1https://github.com/Lasagne/Recipes/blob/master/modelzoo/resnet50. py

37 better representations of the input data. The hyperparameters are the same as for the CNN.

Bag-of-Words and Linear Models. For text datasets, we use a popular pipeline that extracts features using Bag-of-Words (BOW) and trains linear mod- els.

V BOW maps each text document into a vector in R| | where V is the vocabulary of tokens that appear in the corpus. Each dimension represents the count of that token in the document. The vectors are extremely sparse because only a few tokens from V appear in any given document.

We then feed the BOW vectors into an SVM or LR model. For 20 News- groups, there are 20 categories and we apply the One-vs-All method to train 20 binary classifiers to predict whether a data point belongs to the corresponding class or not. We train linear models using AdaGrad [45], a variant of SGD with adaptive adjustment to the learning rate of each parameter. We set the mini- batch size to 128, learning rate to 0.1, and the number of epochs for training to

50 as AdaGrad converges very fast on these linear models.

3.4.3 Evaluation Metrics

Because we aim to encode secrets in a model while preserving its quality, we measure both the attacker’s decoding accuracy and the model’s classification accuracy on the test data for its primary task (accuracy on the training data is over 98% in all cases). Our attacks introduce minor stochasticity into training, thus accuracy of maliciously trained models occasionally exceeds that of con-

38 Dataset f b Encoded bits Test acc δ ± CIFAR10 RES 18 8.3M 92.75 0.14 − LFW CNN 22 17.6M 87.69 0.14 − FaceScrub (G) 20 9.2M 97.33 0.11 RES − FaceScrub (F) 18 8.3M 89.95 0.13 − SVM 80.60 +0.02 News 22 57.2M LR 80.40 0.11 − SVM 90.12 0.01 IMDB 22 6.6M − LR 90.31 0.17 − Table 3.2: Results of the LSB encoding attack. Here f is the model used, b is the maximum number of lower bits used beyond which accuracy drops signifi- cantly, δ is the difference with the baseline test accuracy. ventionally trained models.

Metrics for decoding images. For images, we use mean absolute pixel error

(MAPE). Given a decoded image x0 and the original image x with k pixels,

1 Pk MAPE is x x0 . Its range is [0, 255], where 0 means the two images k i=1 | i − i | are identical and 255 means every pair of corresponding pixels has maximum mismatch.

Metrics for decoding text. For text, we use precision (percentage of tokens from the decoded document that appear in the original document) and recall (percentage of tokens from the original document that appear in the decoded document). To evaluate similarity between the decoded and original docu- ments, we also measure their cosine similarity based on their feature vectors constructed from the BOW model with the training vocabulary.

39 Figure 3.2: Test accuracy of the CIFAR10 model with different amounts of lower bits used for the LSB attack.

Test acc Decode Dataset f λ c δ MAPE ± 0.1 92.90 +0.01 52.2 CIFAR10 RES 1.0 91.09 1.80 29.9 − 0.1 87.94 +0.11 35.8 LFW CNN 1.0 87.91 0.08 16.6 − 0.1 97.32 0.11 24.5 FaceScrub (G) − 1.0 97.27 0.16 15.0 RES − 0.1 90.33 +0.25 52.9 FaceScrub (F) 1.0 88.64 1.44 38.6 − Table 3.3: Results of the correlated value encoding attack on image data. Here λc is the coefficient for the correlation term in the objective function and δ is the difference with the baseline test accuracy. For image data, decode MAPE is the mean absolute pixel error.

3.4.4 LSB Encoding Attack

Table 3.2 summarizes the results for the LSB encoding attack.

40 Test acc Decode Dataset f λ c δ τ Pre Rec Sim ± 0.85 0.85 0.70 0.84 SVM 0.1 80.42 0.16 − 0.95 1.00 0.56 0.78 News 0.85 0.90 0.80 0.88 LR 1.0 80.35 0.16 − 0.95 1.00 0.65 0.83 0.85 0.90 0.73 0.88 SVM 0.5 89.47 0.66 − 0.95 1.00 0.16 0.51 IMDB 0.85 0.98 0.94 0.97 LR 1.0 89.33 1.15 − 0.95 1.00 0.73 0.90

Table 3.4: Results of the correlated value encoding attack on text data. τ is the decoding threshold for the correlation value. Pre is precision, Rec is recall, and Sim is cosine similarity.

Test acc Decode Dataset f λ s δ MAPE ± 10.0 92.96 +0.07 36.00 CIFAR10 RES 50.0 92.31 0.58 3.52 − 10.0 88.00 +0.17 37.30 LFW CNN 50.0 87.63 0.20 5.24 − 10.0 97.31 0.13 2.51 FaceScrub (G) − 50.0 97.45 +0.01 0.15 RES 10.0 89.99 0.09 39.85 FaceScrub (F) − 50.0 87.45 2.63 7.46 −

Table 3.5: Results of the sign encoding attack on image data. Here λs is the coefficient for the correlation term in the objective function.

Encoding. For each task, we compressed a subset of the training data, en- crypted it with AES in CBC mode, and wrote the ciphertext bits into the lower bits of the parameters of a benignly trained model. The fourth column in Ta- ble 3.2 shows the number of bits we can use before test accuracy drops signifi- cantly.

Decoding. Decoding is always perfect because we use lossless compression and no errors are introduced during encoding. For the 20 Newsgroup model,

41 Test acc Decode Dataset f λ s δ Pre Rec Sim ± 5.0 80.42 0.16 0.56 0.66 0.69 SVM − 7.5 80.49 0.09 0.71 0.80 0.82 News − 5.0 80.45 0.06 0.57 0.67 0.70 LR − 7.5 80.20 0.31 0.63 0.73 0.75 − 5.0 89.32 0.81 0.60 0.68 0.75 SVM − 7.5 89.08 1.05 0.66 0.75 0.81 IMDB − 5.0 89.52 0.92 0.67 0.76 0.81 LR − 7.5 89.27 1.21 0.76 0.83 0.88 − Table 3.6: Results of the sign encoding attack on text data. the adversary can successfully extract about 57 Mb of compressed data, equiva- lent to 70% of the training dataset.

Test accuracy. In our implementation, each model parameter is a 32-bit

floating-point number. Empirically, b under 20 does not decrease test accuracy on the primary task for most datasets. Binary classification on images (LFW, FaceScrub Gender) can endure more loss of precision. For multi-class tasks, test accuracy drops significantly when b exceeds 20 as shown for CIFAR10 in Fig- ure 3.2.

3.4.5 Correlated Value Encoding Attack

Table 3.4 summarizes the results for this attack.

Image encoding and decoding. We correlate model parameters with the pixel intensity of gray-scale training images. The number of parameters limits the number of images that can be encoded in this way: 455 for CIFAR10, 200 for

42 Figure 3.3: Decoded examples from all attacks applied to models trained on the FaceScrub gender classification task. First row is the ground truth. Second row is the correlated value encoding attack (λc=1.0, MAPE=15.0). Third row is the sign encoding attack (λs=10.0, MAPE=2.51). Fourth row is the capacity abuse attack (m=110K, MAPE=10.8).

FaceScrub, 300 for LFW.

We decode images by mapping the correlated parameters back to pixel space (if correlation is perfect, the parameters are simply linearly transformed im- ages). To do so given a sequence of parameters, we map the minimum parame- ter to 0, maximum to 255, and other parameters to the corresponding pixel value using min-max scaling. We obtain an approximate original image after transfor- mation if the correlation is positive and an approximate inverted original image if the correlation is negative.

After the transformation, we measure the mean absolute pixel error (MAPE) for different choices of λc, which controls the level of correlation. We find that to recover reasonable images, λc needs to be over 1.0 for all tasks. For a fixed λc, errors are smaller for binary classification than for multi-class tasks. Examples of reconstructed images are shown in Figure 3.3 for the FaceScrub dataset.

43 Text encoding and decoding. To encode, we generate a pseudorandom, d0- dimensional vector of 32-bit floating point numbers for each token in the vo- cabulary of the training corpus. Then, given a training document, we use the pseudorandom vectors for the first 100 tokens in that document as the secret to correlate with the model parameters. We set d0 to 20. Encoding one docu- ment thus requires up to 2000 parameters, allowing us to encode around 1300 documents for 20 Newsgroups and 150 for IMDB.

To decode, we first reproduce the pseudorandom vectors for each token used during training. For each consecutive part of the parameters that should match a token, we decode by searching for a token whose corresponding vector is best correlated with the parameters. We set a threshold value τ and if the correlation value is above τ, we accept this token and reject otherwise.

Table 3.4 shows the decoding results for different τ. As expected, larger τ increases precision and reduces recall. Empirically, τ = 0.85 yields high-quality decoded documents (see examples in Table 3.7).

Test accuracy. Models with a lower decoding error also have lower test accu- racy. For binary classification tasks, we can keep MAPE reasonably low while reducing test accuracy by 0.1%. For CIFAR10 and FaceScrub face recognition, lower MAPE requires larger λc, which in turn reduces test accuracy by more than 1%.

For 20 Newsgroups, test accuracy drops only by 0.16%. For IMDB, the drop is more significant: 0.66% for SVM and 1.15% for LR.

44 Ground Truth Correlation En- Sign Encoding Capacity Abuse coding (λc = 1.0) (λs = 7.5) (m = 24K) has only been week it natch only been it has peering been it has peering been since saw my first week since saw my week saw mxyzptlk week saw my first john waters film first john waters film first john waters film john waters film female trouble and female trouble and bloch trouble and female trouble and wasn sure what to wasn sure what to wasn sure what to wasn sure what to expect expect extremism the expect the in brave new girl in chasing new girl in brave newton girl in brave newton girl holly comes from holly comes from hoists comes from holly comes from small town in texas willed town in texas small town impress- small town in texas sings the yellow sings the yellow ible texas sings ur- sings the yellow rose of texas at local rose of texas at local ban rosebud of texas rose of texas at local competition competition at local obsess and competition maybe need to have maybe need to have maybe need to en- maybe need to have my head examined my head examined joyed my head hippo my head examined but thought this was but thought this was but tiburon wastage but thoughout pretty good movie pretty good movie pretty good movie tiburon was pretty the cg is not too bad the cg pirouetting the cg is northwest good movie the cg is not too bad too bad have not too bad was around when was around when was around saw was around when saw this movie first saw this movie this movie first saw this movie first it wasn so special martine it wasn so possession tributed it wasn soapbox then but few years special then but few so special zellweger special then but few later saw it again years later saw it but few years linette years later saw it and again and saw isoyc again and again and that

Table 3.7: Decoded text examples from all attacks applied to LR models trained on the IMDB dataset.

3.4.6 Sign Encoding Attack

Table 3.6 summarizes the results of the sign encoding attack.

Image encoding and decoding. As mentioned in Section 3.2.3, the sign encod- ing attack may not encode all bits correctly. Therefore, instead of the encrypted, compressed binaries that we used for LSB encoding, we use the bit representa- tion of the raw pixels of the gray-scale training images as the string to be en-

1 coded. Each pixel is an 8-bit unsigned integer. The encoding capacity is thus 8

45 of the correlated value encoding attack. We can encode 56 images for CIFAR10, 25 images for FaceScrub and 37 images for LFW.

To reconstruct pixels, we assemble the bits represented in the parameter signs. With λs = 50, MAPE is small for all datasets. For gender classification on FaceScrub, the error can be smaller than 1, i.e., reconstruction is nearly per- fect.

Text encoding and decoding. We construct a bit representation for each token using its index in the vocabulary. The number of bits per token is log ( V ) , d 2 | | e which is 17 for both 20 Newsgroups and IMDB. We encode the first 100 words in each document and thus need a total of 1,700 parameter signs per document.

We encode 1530 documents for 20 Newsgroups and 180 for IMDB in this way.

To reconstruct tokens, we use the signs of 17 consecutive parameters as the index into the vocabulary. Setting λ 5 yields good results for most tasks (see s ≥ examples in Table 3.7). Decoding is less accurate than for the correlated value encoding attack. The reason is that signs need to be encoded almost perfectly to recover high-quality documents; even if 1 bit out of 17 is wrong, our decod- ing produces a completely different token. More sophisticated, error-correcting decoding techniques can be applied here, but we leave this to future work.

Test accuracy. This attack does not significantly affect the test accuracy of bi- nary classification models on image datasets. For LFW and CIFAR10, test accu- racy occasionally increases. For multi-class tasks, when λs is large, FaceScrub face recognition degrades by 2.6%, while the CIFAR10 model with λs = 50 still generalizes well.

46 Test Acc Decode Dataset f m m n δ MAPE ± 49K 0.98 92.21 0.69 7.60 CIFAR10 RES − 98K 1.96 91.48 1.41 8.05 − 34K 3.4 88.03 +0.20 18.6 LFW CNN 58K 5.8 88.17 +0.34 22.4 110K 2.0 97.08 0.36 10.8 FaceScrub (G) − 170K 3.0 96.94 0.50 11.4 RES − 55K 1.0 87.46 2.62 7.62 FaceScrub (F) − 110K 2.0 86.36 3.72 8.11 − Table 3.8: Results of the capacity abuse attack on image data. Here m is the m number of synthesized inputs and n is the ratio of synthesized data to training data.

Test Acc Decode Dataset f m m n δ Pre Rec Sim ± 11K 1.0 80.53 0.07 1.0 1.0 1.0 SVM − 33K 3.0 79.77 0.63 0.99 0.99 0.99 News − 11K 1.0 80.06 0.45 0.98 0.99 0.99 LR − 33K 3.0 79.94 0.57 0.95 0.97 0.97 − 24K 0.95 89.82 0.31 0.90 0.94 0.96 SVM − 75K 3.0 89.05 1.08 0.89 0.93 0.95 IMDB − 24K 0.95 89.90 0.58 0.87 0.92 0.95 LR − 75K 3.0 89.26 1.22 0.86 0.91 0.94 − Table 3.9: Results of the capacity abuse attack on text data.

For 20 Newsgroups, test accuracy changes by less than 0.5% for all values of

λs. For IMDB, accuracy decreases by around 0.8% to 1.2% for both SVM and LR.

3.4.7 Capacity Abuse Attack

Table 3.9 summarizes the results.

47 Image encoding and decoding. We could use the same technique as in the sign encoding attack, but for a binary classifier this requires 8 synthetic inputs per each pixel. Instead, we encode an approximate pixel value in 4 bits. We map a pixel value p 0,..., 255 to p0 0,..., 15 (e.g., map 0-15 in p to 0 in p0) ∈ { } ∈ { } and use 4 synthetic data points to encode p0. Another possibility (not evaluated in this paper) would be to encode every other pixel and recover the image by interpolating the missing pixels.

We evaluate two settings of m, the number of synthesized data points. For

LFW, we can encode 3 images for m = 34K and 5 images for m = 58K. For FaceScrub gender classification, we can encode 11 images for m = 110K and 17 images for m = 170K. While these numbers may appear low, this attack works in a black-box setting against a binary classifier, where the adversary aims to recover information from a single output bit. Moreover, for many tasks (e.g., medical image analysis) recovering even a single training input constitutes a serious privacy breach. Finally, if the attacker’s goal is to recover not the raw images but some other information about the training dataset (e.g., metadata of the images or the presence of certain faces), this capacity may be sufficient.

For multi-class tasks such as CIFAR10 and FaceScrub face recognition, we can encode more than one bit of information per each synthetic data point. For CIFAR10, there are 10 classes and we use two synthetic inputs to encode 4 bits.

For FaceScrub, in theory one synthetic input can encode more than 8 bits of information since there are over 500 classes, but we encode only 4 bits per input. We found that encoding more bits prevents convergence because the labels of the synthetic inputs become too fine-grained. We evaluate two settings of m. For CIFAR10, we can encode 25 images with m = 49K and 50 with m =98K. For

48 Test Acc Decode Dataset f m m n δ Pre Rec Sim ± 11K 1.0 79.31 1.27 0.94 0.90 0.94 SVM − 22K 2.0 78.11 2.47 0.94 0.91 0.94 News − 11K 1.0 79.85 0.28 0.94 0.91 0.94 LR − 22K 2.0 78.95 1.08 0.94 0.91 0.94 − 24K 0.95 89.44 0.69 0.87 0.89 0.94 SVM − 36K 1.44 89.25 0.88 0.49 0.53 0.71 IMDB − 24K 0.95 89.92 0.56 0.79 0.82 0.90 LR − 36K 1.44 89.75 0.83 0.44 0.47 0.67 − Table 3.10: Results of the capacity abuse attack on text datasets using a public auxiliary vocabulary.

FaceScrub face recognition, we can encode 22 images with m = 55K and 44 with m = 110K.

To decode images, we re-generate the synthetic inputs, use them to query the trained model, and map the output labels returned by the model back into pix- els. We measure the MAPE between the original images and decoded approx- imate 4-bit-pixel images. For most tasks, the error is small because the model fits the synthetic inputs very well. Although the approximate pixels are less precise, the reconstructed images are still recognizable—see the fourth row of

Figure 3.3.

Text encoding and decoding. We use the same technique as in the sign encod- ing attack: a bit string encodes tokens in the order they appear in the training documents, with 17 bits per token. Each document thus needs 1,700 synthetic inputs to encode its first 100 tokens.

20 Newsgroups models have 20 classes and we use the first 16 to encode 4 bits of information. Binary IMDB models can only encode one bit per synthetic

49 input. We evaluate two settings for m. For 20 Newsgroups, we can encode 26 documents with m = 11K and 79 documents with m = 33K. For IMDB, we can encode 14 documents with m = 24K and 44 documents with m = 75K.

With this attack, the decoded documents have high quality (see Table 3.7). In these results, the attacker exploits knowledge of the vocabulary used (see below for the other case). For 20 Newsgroups, recovery is almost perfect for both SVM and LR. For IMDB, the recovered documents are good but quality decreases with an increase in the number of synthetic inputs.

Test accuracy. For image datasets, the decrease in test accuracy is within 0.5% for the binary classifiers. For LFW, test accuracy even increases marginally. For

CIFAR10, the decrease becomes significant when we set m to be twice as big as the original dataset. Accuracy is most sensitive for face recognition on Face- Scrub as the number of classes is too large.

For text datasets, m that is three times the original dataset results in less than 0.6% drop in test accuracy on 20 Newsgroups. On IMDB, test accuracy drops less than 0.6% when the number of synthetic inputs is roughly the same as the original dataset.

Using a public auxiliary vocabulary. The synthetic images used for the capacity-abuse are pseudorandomly generated and do not require the attacker to have any prior knowledge about the images in the actual training dataset. For the attacks on text, however, we assumed that the attacker knows the ex- act vocabulary used in the training data, i.e., the list of words from which all training documents are drawn (see Section 3.3.2).

50 We now relax this assumption and assume that the attacker uses an aux- iliary vocabulary collected from publicly available corpuses: Brown Corpus,2 Gutenberg Corpus [103],3 Rotten Tomatoes [147],4 and a word list from Tesser- act OCR.5

Obviously, this public auxiliary vocabulary requires no prior knowledge of the model’s actual vocabulary. It contains 67K tokens and needs 18 bits to en- code each token. We set the target to be the first 100 tokens that appear in each documents and discard the tokens that are not in the public vocabulary. Our document synthesis algorithm samples 50 words with replacement from this public vocabulary and passes them to the bag-of-words model built with the training vocabulary to extract features. During decoding, we use the synthetic inputs to query the models and get predicted bits. We use each consecutive 18 bits as index into the public vocabulary to reconstruct the target text.

Table 3.10 shows the results of the attack with this public vocabulary. For

20 Newsgroups, decoding produces high-quality texts for both SVM and LR models. Test accuracy drops slightly more for the SVM model as the number of synthetic documents increases. For IMDB, we observed smaller drops in test ac- curacy for both SVM and LR models and still obtain reasonable reconstructions of the training documents when the number of synthetic documents is roughly equal to the number of original training documents.

2http://www.nltk.org/book/ch02.html 3https://web.eecs.umich.edu/˜lahiri/gutenberg_dataset.html 4http://www.cs.cornell.edu/people/pabo/movie-review-data/ 5https://github.com/tesseract-ocr/langdata/blob/master/eng/eng. wordlist

51 Figure 3.4: Capacity abuse attack applied to CNNs with a different number of parameters trained on the LFW dataset. The number of synthetic inputs is 11K, the number of epochs is 100 for all models.

Memorization capacity and model size. To further investigate the relation- ship between the number of model parameters and the model’s capacity for maliciously memorizing “extra” information about its training dataset, we com- pared CNNs with different number of filters in the last convolution layer: 16, 32, 48,..., 112. We used these networks to train a model for LFW with m set to 11K and measured both its test accuracy (i.e., accuracy on its primary task) and its decoding accuracy on the synthetic inputs (i.e., accuracy of the malicious task).

Figure 3.4 shows the results. Test accuracy is similar for smaller and bigger models. However, the encoding capacity of the smaller models, i.e., their test accuracy on the synthetic data, is much lower and thus results in less accurate decoding. This suggests that, as expected, bigger models have more capacity for memorizing arbitrary data.

52 Figure 3.5: Visualization of the learned features of a CIFAR10 model maliciously trained with our capacity-abuse method. Solid points are from the original training data, hollow points are from the synthetic data. The color indicates the point’s class.

Visualization of capacity abuse. Figure 3.5 visualizes the features learned by a

CIFAR10 model that has been trained on its original training images augmented with maliciously generated synthetic images. The points are sampled from the last-layer outputs of Residual Networks on the training and synthetic data and then projected to 2D using t-SNE [126].

The plot clearly shows that the learned features are almost linearly separa- ble across the classes of the training data and the classes of the synthetic data. The classes of the training data correspond to the primary task, i.e., different types of objects in the image. The classes of the synthetic data correspond to the malicious task, i.e., given a specific synthetic image, the class encodes a secret

53 Figure 3.6: Comparison of parameter distribution between a benign model and malicious models. Left is the correlation encoding attack (cor); middle is the sign encoding attack (sgn); right is the capacity abuse attack (cap). The mod- els are residual networks trained on CIFAR10. Plots show the distribution of parameters in the 20th layer. about the training images. This demonstrates that the model has learned both its primary task and the malicious task well.

3.5 Countermeasures

Detecting that a training algorithm is attempting to memorize sensitive data within the model is not straightforward because, as we show in this paper, there are many techniques and places for encoding this information: directly in the model parameters, by applying a malicious regularizer, or by augmenting the training data with specially crafted inputs. Manual inspection of the code may not detect malicious intent, given that many of these approaches are similar to standard ML techniques.

An interesting way to mitigate the LSB attack is to turn it against itself. The attack relies on the observation that lower bits of model parameters essentially don’t matter for model accuracy. Therefore, a client can replace the lower bits of the parameters with random noise. This will destroy any information poten-

54 tially encoded in these bits without any impact on the model’s performance.

Maliciously trained models may exhibit anomalous parameter distributions. Figure 3.6 compares the distribution of parameters in a conventionally trained model, which has the shape of a zero-mean Gaussian, to maliciously trained models. As expected, parameters generated by the correlated value encoding attack are distributed very differently. Parameters generated by the sign encod- ing attack are more centered at zero, which is similar to the effect of conventional l1-norm regularization (which encourages sparsity in the parameters). To detect these anomalies, the data owner must have a prior understanding of what a

“normal” parameter distribution looks like. This suggests that deploying this kind of anomaly detection may be challenging.

Parameters generated by the capacity-abuse attack are not visibly different. This is expected because training works exactly as before, only the dataset is augmented with additional inputs.

3.6 Related Work

Privacy threats in ML. No prior work considered malicious learning algo- rithms aiming to create a model that leaks information about the training dataset.

Ateniese et al. [11] show how an attacker can use access to an ML model to infer a predicate of the training data, e.g., whether a voice recognition system was trained only with Indian English speakers.

Fredrikson et al. [54] explore model inversion: given a model fθ that makes a

55 prediction y given some hidden feature vector x1,..., xn, they use the ground- truth label y˜ and a subset of x1,..., xn to infer the remaining, unknown fea- tures. Model inversion operates in the same manner whether the feature vector x1,..., xn is in the training dataset or not, but empirically performs better for training set points due to overfitting. Subsequent model inversion attacks [53] show how, given access to a face recognition model, to construct a representa- tive of a certain output class (a recognizable face when each class corresponds to a single person).

In contrast to the above techniques, our objective is to extract specific inputs that belong to the training dataset which was used to create the model.

Our attacks are also different from membership inference attacks (see Sec- tion 2.3.1), as we study how a malicious training algorithm can intentionally cre- ate a model that leaks information about its training dataset. The difference between membership inference and our problem is akin to the difference be- tween side channels and covert channels. Our threat model is more generous to the adversary, thus our attacks extract substantially more information about the training data than any prior work. Another important difference is we aim to create models that generalize well yet leak information.

Evasion and poisoning. Evasion attacks seek to craft inputs that will be mis- classified by a ML model. They were first explored in the context of spam detec- tion [59, 121, 122]. More recent work investigated evasion in other settings such as computer vision—see a survey by Papernot et al. [148]. Our work focuses on the confidentiality of training data rather than evasion, but future work may investigate how malicious ML providers can intentionally create models that

56 facilitate evasion.

Poisoning attacks [18, 36, 92, 140, 157] insert malicious data points into the training dataset to make the resulting model easier to evade. This technique is similar in spirit to the malicious data augmentation in our capacity-abuse attack (Section 3.3). Our goal is not evasion, however, but forcing the model to leak its training data.

Secure ML environments. Starting with [115], there has been much research on using secure multi-party computation to enable several parties to create a joint model on their separate datasets, e.g. [19, 32, 44]. A protocol for dis- tributed, privacy-preserving deep learning was proposed in [160]. Abadi et al. [2] describe how to train differentially private deep learning models. Sys- tems using trusted hardware such as SGX protect training data while training on an untrusted service [41, 144, 159]. In all of these works, the training algo- rithm is public and agreed upon, and our attacks would work only if users are tricked into using a malicious algorithm.

CQSTR [190] explicitly targets situations in which the training algorithm may not be entirely trustworthy. Our results show that in such settings a mali- cious training algorithm can covertly exfiltrate significant amounts of data, even if the output is constrained to be an accurate and usable model.

Privacy-preserving classification protocols seek to prevent disclosure of the user’s input features to the model owner as well as disclosure of the model to the user [21]. Using such a system would prevent our white-box attacks, but not black-box attacks.

57 ML model capacity and compression. Our capacity-abuse attack takes advan- tage of the fact that many models (especially deep neural networks) have huge memorization capacity. Zhang et al. [192] showed that modern ML models can achieve (near) 100% training accuracy on datasets with randomized labels or even randomized features. They argue that this undermines previous interpre- tations of generalization bounds based on training accuracy.

Our capacity-abuse attack augments the training data with (essentially) ran- domized data and relies on the resulting low training error to extract informa- tion from the model. Crucially, we do this while simultaneously training the model to achieve good testing accuracy on its primary, non-adversarial task.

Our LSB attack directly takes advantages of the large number and unneces- sarily high precision of model parameters. Several papers investigated how to compress models [24, 29, 62]. An interesting topic of future work is how to use these techniques as a countermeasure to malicious training algorithms.

3.7 Conclusion

We demonstrated that malicious machine learning (ML) algorithms can create models that satisfy the standard quality metrics of accuracy and generalizability while leaking a significant amount of information about their training datasets, even if the adversary has only black-box access to the model.

ML cannot be applied blindly to sensitive data, especially if the model- training code is provided by another party. Data holders cannot afford to be ignorant of the inner workings of ML systems if they intend to make the re-

58 sulting models available to other users, directly or indirectly. Whenever they use somebody else’s ML system or employ ML as a service (even if the service promises not to observe the operation of its algorithms), they should demand to see the code and understand what it is doing.

In general, we need “the principle of least privilege” for machine learning. ML training frameworks should ensure that the model captures only as much about its training dataset as it needs for its designated task and nothing more. How to formalize this principle, how to develop practical training methods that satisfy it, and how to certify these methods are interesting open topics for future research.

59 CHAPTER 4 AUDITING DATA PROVENANCE IN TEXT-GENERATION MODELS

Data-protection policies and regulations the European Union’s General Data Protection Regulation (GDPR) [172] give users the right to know how their data is processed. According to Article 6(1) of GDPR, lawful data processing requires that the data subject (the user in our context) has given consent to the processing of his or her personal data for one or more specific purposes. As machine learn- ing (ML) becomes a core component of data processing in many offline and online services, and incidents such as DeepMind’s unauthorized use of NHS patients’ data to train ML models [15] illustrate the resulting privacy risks, it is essential to be able to audit the provenance of personal data used for model training.

In this Chapter, we consider a malicious service provider who might col- lect users’ personal data for training ML models without their consent. We design and evaluate a technology that can help users audit ML models to de- termine if their data was used to train these models. We focus specifically on auditing models that generate natural-language text. Text-generation mod- els for tasks such as next-word prediction (the basis of query autocompletion and predictive virtual keyboards) and dialog generation (the basis of chatbots and automated customer service) are extensively trained on personal data, in- cluding users’ messages, documents, chats, comments, and search queries. Our technology can help users audit a publicly available text-generation model and see if their words were used, perhaps without their permission, to create this model. Furthermore, our work sheds new light on how deep learning-based, text-generation models memorize their training data—a topic that has impor-

60 tant implications for both data privacy and natural language processing.

4.1 Text-generation Models

Text-generation models are extremely popular for natural language processing (NLP) tasks including next-word prediction, machine translation and dialogue generation. In these models, the input is a variable-length sequence of tokens x = [x1,..., xl] in the embedding space. The output y can be either a class label (e.g., for sentiment analysis), or a token (e.g., for next-word prediction), or a sequence of tokens (e.g., for machine translation).

Embeddings. For text data where input space is discrete and sparse, the stan- dard approach is to transform discrete inputs into a lower-dimensional continu- ous vector representation. For a text corpus with vocabulary , an embedding is V a function E : Rdemb where d is the dimension of the embedding vector. V 7→ emb

Recurrent neural networks. A common deep learning model for sequential inputs are recurrent neural networks (RNNs). An RNN maps the input se- quence to a sequence of hidden representations h = [h1,..., hl], where the com- putation of h j is recursively dependent on the previous hidden representation

j 1 j h − and the current input token x , and feeds these hidden representations to a classifier.

Sequence-to-sequence models are for text-generation tasks where both the input x = [x1,..., xl] and the output y = [y1,..., yt] are sequences of tokens. A typical sequence-to-sequence model consists of an encoder RNN and a decoder

61 RNN. The encoder learns the representation for the input texts, then passes this representation as the initial state for the decoder, which makes word predictions one at a time. Translation models are similar: the decoder predicts words in the target language by feeding its hidden representations to a classifier.

Next-word prediction is used in many natural-language applications, includ- ing predictive virtual keyboards and query autocompletion. Given an input sequence x = [x1,..., xl], the task is to predict the next token x j from the context

1 j 1 [x ,..., x − ]. RNNs are commonly used for this task. RNN feeds the last hidden

j 1 representation a − in the context sequence to a -way classifier to predict the |V| next token, where is the vocabulary. V

Neural machine translation (NMT) models based on RNNs reach near- human performance on many language pairs [180]. The input to these models is a sequence of tokens from the source language, the output is a sequence of tokens from the target language. NMT models use the sequence-to-sequence framework. The input text is encoded as a hidden representation, and the de- coder RNN predicts translated tokens based on this representation.

Dialog generation aims to generate replies in a conversation. It is a com- mon component of chatbots and question-answering services. The input is a sentence, the output is the next sentence in the same conversation. Dialog- generation models can also employ a sequence-to-sequence architecture [109, 175]. Similar to NMT, the model encodes the input sentence to a hidden repre- sentation, then generates the reply by passing this representation to the decoder.

62 Loss functions. For the next-word prediction task, given an input sequence

1 l j 1 j 1 x = [x ,..., x ], the RNN models the conditional probability Pr(x x ,..., x − ) = | 1 j 1 f (x ,..., x − ) and aims to maximize the probability for the sequence Pr(x) =

Ql j 1 j 1 Pr(x x ,..., x − ). The loss function used when training the model is thus j=1 | Pl 1 j 1 the negative log likelihood: L( f (x), x) = log f (x ,..., x − ). For the ma- − j=1 chine translation and dialog-generation tasks where the input is x and the tar- get is y = [y1,..., yt], the sequence-to-sequence model computes the probability

j 1 j 1 1 j 1 Pr(y y ,..., y − ; x) as f (y ,..., y − ; x). Similar to the next-word prediction task, | the loss function is the negative log probability on the target sequence.

4.2 Auditing text-generation models

Consider a training dataset where each row is associated with an indi- Dtrain vidual user, and let be the set of all users in . The target model f is Utrain Dtrain trained on using a training protocol , which includes the learning algo- Dtrain T rithm and the hyper-parameters that govern the training regime. As described in Section 4.1, a text-generation model f takes as input a sequence of tokens x and outputs a prediction f (x) for a single token (if the task is next-word pre- diction) or a sequence of tokens (if the task is machine translation or dialog generation). The prediction f (x) is a probability distribution or a sequence of distributions over the training vocabulary V or a subset of V. We assume that the tokens in the model’s output space are ranked (i.e., the output distribution imposes an order on all possible tokens) but do not assume that the numeric probabilities from which the ranks are computed are available as part of the model’s output.

63 Algorithm 5 Auditing text-generation models Hyper-parameters: auditor’s reference dataset , number of shadow models Dref k, user’s data , target model f , target model-training protocol , audit Duser Ttarget model-training protocol , maximum number of queries m, number of bins Taudit in histogram d procedure AUDITMEMBERSHIP f TRAINAUDITMODEL() audit ← SAMPLEQUERIES(m, ) Dsample, u ← Duser h HISTOGRAMFEATURE( f, ) u ← Dsample, u return prediction of membership faudit(hu) procedure SAMPLEQUERIES(m, ) D if random sample then return randomly selected m rows in D else . sample based on frequency C {Σ (frequency of w for w in y) (x, y) } ← |∀ ∈ D I indices of m smallest values in C ← return m rows in indexed by I D procedure TRAINAUDITMODEL . dataset for building the audit model Daudit ← ∅ users in Uref ← Dref for i = 1 to k do . train k shadow models train, test ref ref random split ref Utrain U ← U u train ref, u Dref ← ∪ ∈Uref {D } train Train a shadow model f 0 ( ). i ← Ttarget Dref for every u in users of do Uref data in associated with u Dref, u ← Dref h0 HISTOGRAMFEATURE( f 0, ) u ← i Dref, u z0 1 if u in else 0 u ← Uref-train (h0 , z0 ) Daudit ∪ { u u } Train the audit model f ( ) audit ← Taudit Daudit return faudit procedure HISTOGRAMFEATURE( f , ) D R rank(y) in f (x) (x, y) ← { |∀ ∈ D} Initialize feature vector h with d entries. b V /d . histogram bin size ← | | for i = 1 to d do . count of ranks in each bin h = (i 1) b r < i b r R i |{ − · ≤ · | ∈ }| return feature vector h

The goal of auditing is to infer user-level membership against the target model f , i.e., to decide whether a user u or not. ∈ Utrain

64 We assume that the auditor has black-box access to f : given an input query x, the auditor can observe f (x). In realistic deployments of text-generation models, the auditor may not be able to observe the entire vector of ranked words f (x) but only several top-ranked predictions. In our experiments in Section 4.3.5, we vary the size of the model’s output and show how it affects the accuracy of auditing.

We assume that the auditor knows the learning algorithm used to create f but he may or may not know the training hyper-parameters (see Section 4.3.5).

The auditor also needs an auxiliary dataset to train shadow models that Dref perform the same task as f .

Algorithm 5 outlines the auditing process. Similar to standard membership inference [161], the auditor’s goal is to learn to distinguish the outputs produced by the target model on sequences that it trained on and its outputs on sequences that it did not see during training. For this purpose, the auditor builds a binary user-level membership classifier faudit that takes as input a (processed) list of predictions obtained by querying f with a subset of the user’s dataset and Duser outputs a decision on u . In Section 4.3.5, we show that a small subset of ∈ Utrain is sufficient for this purpose. Duser

Training shadow models. To collect the data for training faudit, the auditor first trains k shadow models f 0 ... f 0 (that “simulate” f ) using the same protocol as 1 k T f with the same hyper-parameters (if known) or varying the hyper-parameters as in Section 4.3.5.

The training data for each shadow is a random user subset train of Uref ∈ Uref the auxiliary dataset . Our shadow training technique is inspired by [161], Dref

65 but one essential distinction is that in our case the shadow-training data does not need to be drawn from the same distribution as the training data of the target model In Section 4.3.5, we show that public sources can be used for Dref and the loss in audit accuracy is negligible when and are drawn from Dtrain Dref different domains. This is important for real-world auditing because in practice the auditor may not know the entire distribution of the target model’s training data, and API limits may prevent the auditor from querying the target model repeatedly to extract sufficient data for training shadow models as in [161].

The auditor then queries the shadow models with for each u in and Dref, u Uref labels the resulting outputs as “member” if u was part of the shadow’s training data, “non-member” otherwise. The next step is to use these labeled predictions to train a binary membership classifier.

Training the audit model. Record-level membership inference typically uses the output probability distribution directly as the feature to distinguish be- tween members and non-members. User-level membership inference in text- generation models calls for a different approach. Each user is associated with multiple sequences, each of which has multiple words. Therefore, the auditor can obtain a collection of output predictions. On the negative side, the actual probabilities associated with each prediction may not be available.

As mentioned before, the output prediction f (x) for a input x is a probabil- ity distribution across the entire training vocabulary , i.e., a -dimensional V |V| probability vector. is generally large and the probability values are noisy. |V| Instead of the raw probability values, we use the ranks of the target words in the output distributions as signals for inferring user-level membership. As we will

66 show in Section 4.4, even for a well-generalized model (i.e., whose test-train ac- curacy gap is small), there is a substantial gap in the predicted rank of the same word when it appears in a training text and a test text. Specifically, the model ranks relatively rare words much higher when it sees them during testing in the same context as it saw them during training.

Given a user u’s data , the auditor queries the shadow model on each Dref, u data point (x, y) and collects the ranks of y in f (x) into a rank set R . Tak- ∈ Dref, u u ing English-to-French machine translation task as an example where (x, y) = (I love you, Je t’aime), f (x) = [ f (x)1, f (x)2] is a sequence of two probability vectors for tokens “Je” and “t’aime.” The auditor collects the rank of the probability of “Je” in f (x)1 (e.g., 2) and the rank of the probability of “t’aime” in f (x)2 (e.g.,

213), and adds {2, 213} to the rank set Ru. Rank 2 means that the word is the second likeliest prediction in the entire vocabulary. After collecting the ranks for all (x, y) , the auditor builds a histogram for R with a fixed number ∈ Dref, u u of bins d. The final feature vector hu is a d-way count vector where each entry is the count of the ranks in that bin.

The auditor extracts features h and labels them as 1 if u train and 0 oth- u ∈ Uref erwise. The auditor repeats this procedure for each user in each shadow model and obtains a collection of labeled feature vectors . Finally, the auditor Daudit trains a binary membership classifier f on . We refer to f as the audit Daudit audit audit model.

Auditing membership in the training data. At inference (i.e., audit) time, the auditor queries the target model f with the user’s data . If the number of Duser queries to f is limited, only a sample from is used. It can be random, but Duser

67 we show in Section 4.3.5 that it is more effective to select test inputs that have the smallest frequency counts in their labels y, i.e., sequences with relatively rare words are more useful for auditing. .

After querying f , the auditor processes the corresponding outputs and ob- tains a feature vector hu that describes the distribution of the predicted ranks for each word in . Finally, the auditor feeds h to f , which decides whether Duser u audit u or not. ∈ Utrain

4.3 Experiments

4.3.1 Datasets

The Reddit comments dataset (Reddit) is a randomly chosen month (November 2017) from the public Reddit dataset.1 We filtered it to retain only the users with at least 150 but no more than 500 posts, for a total of 83,293 users with 247 posts each on average. We use the resulting dataset for the next-word prediction task.

The speaker annotated TED talks dataset (SATED) consists of transcripts from TED talks,2 totaling 2,324 talks with roughly 271K sentences in each lan- guage [135]. The dataset contains English-French (en-fr), English-German (en- de) and English-Spanish (en-es) language pairs and speaker annotation. We use the data from the en-fr pair for the machine translation task.

The Cornell movie dialogs corpus (Dialogs) is a collection of fictional con-

1https://bigquery.cloud.google.com/dataset/fh-bigquery: redditcomments 2https://www.ted.com/talks

68 versations extracted from movie scripts [37]. There are a total of 220,579 ex- changes between pairs of characters engaging in at least 5 exchanges, involv- ing 9,035 characters from 617 movies. We use this dataset for the dialogue- generation task.

Cross-domain reference datasets. The auditor may not know the distribution on which the target model was trained and thus needs a reference dataset to train its shadow models. In our experiments, we use public datasets for this purpose. As the cross-domain reference dataset for word prediction, we use the Wikitext-103 corpus3 obtained by a Wikipedia crawl. For translation, we use the English-French pair in the Europarl dataset [93], a parallel language corpus extracted from the proceedings of the European Parliament. For dialog generation, we use the Ubuntu dialogs dataset [123], which contains two-person technical support chat logs.

These datasets are not labeled with individual users, thus we split them into random nu subsets, each corresponding to an artificial “user.” Our experiments show that we can produce effective audit models even with this artificial sepa- ration into users and even though the topics of the reference datasets are very different from the target models’ training datasets (e.g., technical support chats vs. conversations between movie characters). 3https://www.salesforce.com/products/einstein/ai-research/ the-wikitext-dependency-language-modeling-dataset

69 4.3.2 ML Models

Next-word prediction. We use a one-layer long short-term memory

(LSTM) [70] as the target model. LSTM is a more complicated RNN that can capture the long-term dependency in the sequence. The input sequence of to- kens is first mapped to a sequence of embeddings. The embedding is then fed to the LSTM that learns a hidden representation for the context for predicting the next word.

Neural machine translation. We use a sequence-to-sequence target model with the attention module as described in [135]. Both the encoder and the de- coder are one-layer LSTMs that operate on the embedding of source tokens and target tokens. The attention module adds an additional layer that operates on all hidden representations in the encoder LSTM and helps the decoder deter- mine where to pay attention in the source texts when predicting a token in the target language.

Dialog generation. We use a sequence-to-sequence model without the atten- tion module. The encoder and the decoder are one-layer LSTMs.

4.3.3 Hyper-parameters

Target models. We train the word-prediction model on the comments of 300 randomly selected users from the Reddit dataset. We set both the embedding dimension and LSTM hidden-representation size to 128. For training the LSTM,

70 we use the Adam optimizer [89] with the learning rate set to 1e-3, batch size to 35, and the number of training epochs to 30.

We train the translation and dialog-generation models on 300 randomly se- lected users from SATED and Dialogs, respectively. We set both the embedding dimension and LSTM hidden-representation size in the encoder and decoder to 128. We use the Adam optimizer with the learning rate set to 1e-3, batch size to

20, and the number of training epochs to 30.

For all datasets, we fix the vocabulary to the most frequent 5,000 tokens in the training texts. Tokens not in the vocabulary are replaced with a spe- cial token. To prevent overfitting, we add dropout with 0.5 rate to all hidden layers of all models.

Shadow models. For the experiments in Section 4.3.5, we construct shadow models using different hyper-parameters than the target models. On all tasks, we used Gated Recurrent Units (GRU) [31] instead of LSTM. The size of hidden units and embedding is set to 64, 96, 128, 160, . . . , 352 for the shadow models.

We optimize the shadow models using momentum SGD with the learning rate set to 0.01, momentum set to 0.9, and number of training epochs to 50.

Implementation. All target and shadow models were implemented with Keras4 using TensorFlow [1] backend. We use linear SVM implemented in LIB-

LINEAR [52] to train the audit model with the default hyper-parameters.5

4https://keras.io/ 5http://scikit-learn.org/stable/modules/generated/sklearn.svm. LinearSVC.html

71 Dataset Model Train Acc Test Acc Train Perp Test Perp Reddit 1-layer LSTM [70] 0.184 0.206 102.22 113.14 SATED Seq2Seq w/ attn [135] 0.587 0.535 6.36 10.28 Dialogs Seq2Seq w/o attn 0.283 0.264 45.57 61.11

Table 4.1: Performance of target models. Acc is word prediction accuracy, perp is perplexity.

4.3.4 Performance of target models

We use standard architectures and hyper-parameters to train target models (see Section 4.3.2) and evaluate their performance using word prediction accuracy

1 Pn Pli j j 1 Pn Pli j j M i=1 j=1 log f (xi) [yi ] = M i=1 j=1 I(arg max f (xi) = yi ) and perplexity = 2− , where n P is the number of data points, M = i li the sum of the number of tokens in all la-

j bels, I is the indicator function that outputs 1 if the predicted token arg max f (xi)

j j j equals the label token yi and 0 otherwise, and f (xi) [yi ] is the probability of pre- j j dicting yi in f (xi) . Perplexity is measured as 2 to the power of the entropy of the label predictions. The lower the perplexity, the better the model fits the data.

Table 4.1 shows the results for models trained on 300 users, with the test data sampled from 300 disjoint users from the training set. These results match the literature. On Reddit, test accuracy of word prediction is 20%, similar to [131]. On SATED, test perplexity is 10, close to [124]. Low test perplexity shows that the models are learning a meaningful language-generation process. Test-train accuracy gaps are below 5%, indicating that the models are not overfitted. Per- plexity gaps are within 15, which is relatively small.

72 4.3.5 Performance of auditing

To train shadow models, we sample a set of “shadow users” disjoint from both the training and test users. The number of shadow users is twice the number of training users. We use one half of the shadow users to train shadow models and the other half to collect the shadow models’ outputs on the non-members of their training datasets (see Section 4.2). We train 10 shadow models for all tasks and use a linear SVM as the audit classifier.

Our metrics are precision (the percentage of users classified by the audit model as “members” who are indeed members), recall (the percentage of mem- bers who are classified as “members”), accuracy (the percentage of all users who are classified correctly), and AUC, the area under the ROC curve that shows the gap between the scores (i.e., distances to the decision hyperplane of SVM) given by the audit model to members and non-members. We use 300 members and non-members. Therefore, the baseline for all metrics is 0.5, corresponding to random guessing.

Our audit model achieves the perfect score (i.e., 1) on all metrics for all datasets and models when there is no restriction on the output size of the tar- get models (i.e., they produce predictions over the entire vocabulary) and the auditor can query the target models any number of times.

Effect of different hyper-parameters. To demonstrate that knowledge of the target model’s hyper-parameters is not essential for successful auditing, we train 10 shadow models for each task with different training configurations.

Table 4.2 shows the results. Auditing scores are still above 0.95 on nearly all

73 Dataset Accuracy AUC Precision Recall Reddit 0.990 0.993 0.983 0.996 SATED 0.965 0.981 0.937 0.996 Dialogs 0.978 0.998 0.958 1.000

Table 4.2: Effect of training shadow models with different hyper-parameters than the target model.

Reddit 1

0.9

0.8 Score

0.7 Precision Recall AUC Accuracy 0.6 100 2,000 4,000 10,000 Number of users

Figure 4.1: Effect of the number of Reddit users used to train a word-prediction model. metrics for all tasks and models.

Effect of the number of users. To evaluate how the number of users in the training dataset affects the auditor’s ability to infer the presence of a single user, we train word-prediction models on 100, 500, 1,000, 2000, 4,000, and 10,000 users from the Reddit dataset. Test users and shadow users are disjoint samples of the same size.

Fig. 4.1 shows the results. When the number of users is under 1,000, all met- rics are at least 0.95. With 4,000 users, precision drops below 0.8 while AUC is still around 0.9. Audit performance drops more significantly when the number of users is 10,000.

74 Freq (Reddit) Rand (Reddit) 1 1

0.8 0.8 Score

0.6 0.6

1 4 8 16 32 1 4 8 16 32 Freq (SATED) Rand (SATED) 1 1

0.8 0.8 Score

0.6 0.6

1 4 8 16 32 1 4 8 16 32 Freq (Dialogs) Rand (Dialogs) 1 1

0.8 0.8 Score

0.6 0.6

1 4 8 16 32 1 4 8 16 32 # queries used for auditing # queries used for auditing

Precision Recall AUC Accuracy

Figure 4.2: Effect of the number of queries and sampling strategy. Plots on the left show the results when the auditor samples the user’s data for queries in the ascending order of frequency counts of tokens in the label; plots on the right show the results with randomly sampled data.

Effect of the number and selection of audit queries. To measure the perfor- mance of auditing when the auditor is restricted to only a few queries, we vary the number of audit queries between 1, 2, 4, 8, 16, and 32 word sequences.

75 Fig 4.2 shows the results. With 32 queries, audit performance exceeds 0.9 on all metrics for all datasets. If query selection is random, audit performance is low with fewer than 8 queries. If the auditor queries the target with the user’s word sequences whose summary word-frequency counts are the lowest, even with a single query, the auditor can accurately determine if the user’s data was used to train the model on the Reddit or Dialogs dataset. This remark- able result demonstrates the extent to which text-generation models memorize word sequences they were trained on, especially those that contain relatively rare words.

Effect of the size of the model’s output. In a realistic deployment of a text- generation model, its output may be limited to a few top-ranked words rather than the entire ranked vocabulary. We constrain the model’s output to the top- ranked 1, 5, 50, 500, and 1000 words, while the other hyper-parameters remain as in Section 4.3.2. When building the histogram feature vector for training the audit model (see Section 4.2), we add an additional feature that counts how many times the ground-truth words are not among the top predictions output by the model.

Table 4.3 shows the results. On Reddit and Dialogs, the auditor’s perfor- mance is close to random guessing when the model’s outputs are limited to the top 50 or fewer words, increasing to above 0.9 when the output size is the top 500 words (only 10% of the entire vocabulary)—regardless of whether the shadow models are trained on the same domain as the target model or a differ- ent domain.

For the translation task, audit performance is much higher than random

76 Reddit Same domain Cross domain f (x) Acc AUC Pre Rec Acc AUC Pre Rec | | 1 0.545 0.549 0.574 0.350 0.505 0.589 0.667 0.020 5 0.550 0.572 0.553 0.520 0.490 0.525 0.495 0.920 10 0.580 0.602 0.582 0.570 0.500 0.552 0.500 0.950 50 0.605 0.648 0.606 0.600 0.505 0.659 0.503 0.980 100 0.725 0.788 0.765 0.650 0.585 0.714 0.549 0.950 500 0.970 0.998 0.970 0.970 0.905 0.992 0.988 0.820 1000 0.985 0.999 0.971 1.000 0.910 0.999 1.000 0.820

SATED f (x) Acc AUC Pre Rec Acc AUC Pre Rec | | 1 0.723 0.785 0.770 0.637 0.723 0.785 0.712 0.750 5 0.748 0.838 0.767 0.713 0.767 0.834 0.755 0.790 10 0.800 0.880 0.783 0.830 0.805 0.878 0.814 0.790 50 0.928 0.973 0.908 0.953 0.925 0.979 0.947 0.900 100 0.948 0.981 0.944 0.953 0.942 0.978 0.965 0.917 500 0.972 0.988 0.958 0.987 0.970 0.988 0.983 0.957 1000 0.960 0.984 0.939 0.983 0.967 0.985 0.973 0.960

Dialogs f (x) Acc AUC Pre Rec Acc AUC Pre Rec | | 1 0.577 0.618 0.582 0.547 0.538 0.618 0.520 0.977 5 0.575 0.642 0.582 0.530 0.552 0.643 0.528 0.970 10 0.583 0.645 0.591 0.543 0.543 0.638 0.523 0.977 50 0.605 0.660 0.611 0.580 0.537 0.610 0.520 0.963 100 0.647 0.714 0.643 0.660 0.570 0.669 0.541 0.920 500 0.935 0.975 0.917 0.957 0.925 0.969 0.895 0.963 1000 0.972 0.995 0.955 0.990 0.962 0.992 0.948 0.977

Table 4.3: Effect of the model’s output size. f (x) is the number of words ranked | | by f . guessing even if the model outputs just one top-ranked word and exceeds 0.9 when the model outputs 50 top-ranked words (1% of the vocabulary). These re- sults demonstrate the remarkable extent to which translation models memorize specific word sequences encountered in training.

77 SATED Dialogs 1 1 0.8 0.6 0.5 Score 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 Fraction of noise data Fraction of noise data

Precision Recall AUC Accuracy

Figure 4.3: Effect of noise and errors.

Effect of noise and errors in the queries. may be noisy or partially erro- Duser neous (e.g., if not all of was used to train the target model f ). To evaluate Duser how this affects auditing, for each training user, we use part of his data to train f and hold out the remaining fraction to represent noise during auditing. We vary this fraction between 0.1, 0.2, . . . , 0.5.

Fig. 4.3 shows the results. For SATED and Dialogs, recall drops significantly, close to 0 for SATED when the fraction of noise is 0.5. Increasing the amount of noise biases the audit model towards misclassifying most training users as

“non-members.” Precision and AUC remain high when noise increases. This may indicate that the scores of the membership classifier at the heart of the audit model still have a distinguishable gap between members and non-members, which is however not learned from the outputs of the shadow models queried with clean data (see Section 4.2).

Auditing obfuscated data. Finally, we evaluate the effect of obfuscation on the success of auditing. This is the first step towards determining whether text- generation models memorize specific word sequences (which would not be pre-

78 No obfuscation: i see so many adults that could benefit from this going around having themselves a big fat sugar snack or soda pop as a treat it ’s so sad

Google: i saw so many adults who can benefit from cherishing big fat sugar snacks and soda pop and going around, it is very sad

Yandex: i think a lot of adults have benefited over your big fat candy and and handling of grief

Table 4.4: Examples of texts obfuscated using Google translation API and Yan- dex translation API.

Dataset Accuracy AUC Precision Recall Baseline 1.000 1.000 1.000 1.000 Google 0.580 0.858 0.944 0.170 Yandex 0.500 0.782 0.500 0.010

Table 4.5: Audit performance on obfuscated Reddit comments. served by obfuscation) rather than higher-level linguistic features (which might be).

We use an obfuscation technique, previously considered for evading author attribution [23], that machine-translates the text to a different language and back. We obfuscate the training and test users’ Reddit comments using Google6 and Yandex7 translation APIs to translate English to Japanese and back to En- glish. Table 4.4 shows examples of obfuscated text.

Table 4.5 reports the results of auditing on obfuscated texts. For both Google- and Yandex-based obfuscation, audit accuracy drops to near random and recall is very low. AUC scores are still around 0.8, which is much higher than random guessing. This indicates there is some useful signal in the model’s outputs on obfuscated texts, but the auditor’s membership classifier—which was trained

6https://cloud.google.com/translate/ 7https://tech.yandex.com/translate/

79 Reddit (top) SATED (top) Dialogs (top)

10 5 0 15 10 5 15 10 5 0 − − − − − − − − Reddit (tail) SATED (tail) Dialogs (tail)

20 10 0 30 20 10 20 10 − − − − − − − Log probability Log probability Log probability

Train Unseen

Figure 4.4: Histograms of log probabilities of words generated by our text- generation models. The top row are the histograms for the top 20% most fre- quent words, the bottom row are the histograms for the rest. on non-obfuscated texts—fails to capture this signal.

This is a remarkable result given the poor quality of translation. Even if the user’s text has been garbled almost to the point of incomprehensibility, in some cases there is still enough information left to detect its presence in the training data.

4.4 Memorization in text-generation models

In this section, we analyze why auditing works so well for text-generation mod- els that are not overfitted as measured by their test-train accuracy gap (see Sec-

80 tion 4.3.4).

Word frequency and probability. The loss function for the text-generation models is the sum of the negative log probabilities of the words in the input sequence (see Section 4.1). By its very construction, this loss function “encour- ages” the model to memorize sequences that occur in the training data.

Fig. 4.4 shows the histograms of the log probabilities of the more and less frequent words in the training (“train”) and test (“unseen”) sequences. For the more frequent words, the histograms for the training and test sequences are al- most identical. For the less frequent words, the model fits worse for both the training and test sequences as modes focus on smaller log probability values.

Most importantly, there is a gap between the less frequent words in the training sequences and those in the test sequences. This gap indicates that the model as- signs higher probabilities to words in the training sequences, producing a strong signal that can be used for membership inference and consequently auditing.

These histograms also demonstrate that our text-generation models are not overfitted to their training datasets in terms of the loss value. The 20% most frequent words account for 86.9% of the training data and 88.1% of the test data in Reddit, 89.5% and 90.4% in SATED, and 93.1% and 94.1% in Dialogs. Conse- quently, these words dominate the training and test loss value. Not surprisingly, text-generation models typically generate words from the top 20% of the word- frequency distribution. As long as the log probabilities remain similar for the top 20% words in both the training and test datasets, the training and test losses of the model will be similar.

81 Reddit SATED Dialogs Frequency 000 000 000 , , , 4 4 Train 4 Unseen 000 000 000 , , , 2 2 2 Predicted rank 0 0 0 0 2,000 4,000 0 2,000 4,000 0 2,000 4,000 Frequency rank Frequency rank Frequency rank

Figure 4.5: Ranks of words in the frequency table of the training corpus and in the models’ predictions (lower rank means that the word is more likely). Shaded area is the 95% confidence interval for all occurrences of the word in the data. These charts demonstrate that the models assign much higher rank to words when they appear in training sequences vs. when they appear in test sequences, especially for the less-frequent words.

Word frequency and predicted rank. Memorization of training sequences produces a much stronger signal in the relative rank assigned by the model to the candidate words in the model’s output vocabulary. Fig. 4.5 shows the re- lationship between a word’s rank in the frequency table of the training corpus and its rank in the model’s predictions. A smaller rank number indicates that the word is ranked higher in the vocabulary, i.e., more frequent in the corpus or more likely to be predicted by the model. On all datasets, less frequent words exhibit a much bigger gap between the rank predicted by the model when the word appears in a training sequence and when it appears in a test sequence.

This explains why our auditing algorithm is more successful when it queries the target model with sequences consisting of the less-frequent words (see Sec- tion 4.3.5).

Ablation analysis. We have shown that probabilities and ranks produced by text-generation models exhibit a gap between the training and test sequences for the less-frequent words but not for the most-frequent words. We hypothe-

82 Reddit SATED 0.8 0.4 0.6

0.4 0.2 0.2 Training accuracy 0 0 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.3 0.5 0.7 0.9 % hidden units ablated % hidden units ablated

Top 10% words Tail 90% words

Figure 4.6: Ablation analysis on Reddit and SATED. size that these models learn generalizable patterns for the most-frequent words while hard-memorizing the sequences consisting of the less-frequent words.

To gather evidence for this hypothesis, we carried out an experiment based on ablation analysis that was recently proposed to detect memorization in deep- learning models [137]. As more hidden units are ablated, accuracy on the train- ing data degrades quicker for models that are hard-memorizing the training data.

We train target models without dropout (since dropout ablates the hid- den units during training) on Reddit and SATED, keeping the other hyper- parameters the same as in Section 4.3.2. We randomly set a fraction of the model’s hidden representations to zero and evaluate the accuracy of word pre- diction on the training data. We vary the fraction from 0.1 to 0.5 on Reddit and 0.1 to 0.9 on SATED and report the accuracy score separately for the 10% most frequent words and the remaining 90% in Fig. 4.6.

When no hidden units are ablated, accuracy is similar for the most-frequent

83 DP word prediction on Reddit Frequency 4,000 Train Unseen

2,000 Predicted rank

0 0 1,000 2,000 3,000 4,000 Frequency rank

Figure 4.7: Ranks of words in the training corpus and in the predictions of the differentially private model. words and the rest. As the fraction of ablated units increases, accuracy on the less-frequent words drops more significantly than on the most-frequent words. This indicates that predicting less-frequent words is more dependent on specific hidden units in the model and thus involves more memorization.

4.5 Limitations of auditing

Models trained on a very large number of users. In some industrial imple- mentation of text-generation models [130, 131], the number of users is on the scale of millions. Performance of our auditor starts to drop when the number of users reaches 10,000 (Section 4.3.5). We expect that our black-box algorithm will not be able to audit models trained on a very large number (dozens or hundreds of thousands) of users. That said, (a) many state-of-the-art models are trained on fewer than 10,000 users [96, 135, 194], and (b) white-box auditing techniques may be effective even against models trained on dozens of thousands of users.

84 This is a topic for future work.

Deeper models. In our experiments, both the target and shadow models are one-layer LSTMs or GRUs. We have not experimented with auditing deeper and more sophisticated models. We expect that such models are even more susceptible to memorization, but this is another topic for future research.

Differentially private models. In theory, user-level differential privacy (DP) is a direct countermeasure to user-level membership inference. We used federated learning with differential privacy [131] to train a next-word prediction model on the Reddit dataset, setting the number of users to 5,000, user sampling rate to 0.04 per round, L2 bound on a single user’s contribution to 10.0, and the other hyper-parameters as in [131]. After 300 rounds of training, this produced an (, δ)-DP model with  = 4.129 and δ = 1e 4 which achieves 15% word − prediction accuracy, similar to [131]. By contrast, the accuracy of our non-DP model is 20% when trained on only 100 users, i.e., the DP model is significantly less accurate than the non-DP one. Our auditing algorithm fails against the DP model, with performance scores near 0.5 (equivalent to random guessing).

To further investigate the predictive power of the DP model, Fig. 4.7 plots the ranks of words in the vocabulary (based on their frequencies) and in the model’s predictions. The predicted rank is larger than the frequency rank for the 50% most frequent words and remains around 3,000 for the other 50%. The predicted rank is very similar for the words in the training and test sequences, which explains why auditing fails.

The plot also suggests that the differentially private model will almost al-

85 ways predict common words and hardly ever predict relatively rare words. While it does not appear that the model memorizes its training data, it is not clear to what extent it generalizes.

4.6 Related work

Privacy attacks based on memorization. As mentioned in Section 2.3, deep learning models can achieve perfect accuracy even on randomly labeled train- ing data. Chapter 3 exploits memorization and presents algorithms that inten- tionally encode the training data in the model. By contrast, this Chapter demon- strates that popular text-generation models unintentionally memorize their train- ing data.

Carlini et al. [27] show that a black-box adversary can extract specific num- bers that occur in the training data of a generative model, given some prior knowledge about the format (e.g., a credit card number). For a text-generation model, numbers are essentially random data, thus this is another illustration that models memorize random data. By contrast, we show that text-generation models memorize even words and sentences that are directly related to their primary task and leverage this into an effective auditing method.

User-level differential privacy. User-level differential privacy (DP) bounds the influence of any single user on the model. McMahan et al. propose a DP federated learning algorithm for language models [131]. With the current state of the art, a massive number of users (at least 10,000) is needed to create DP models that achieve reasonable accuracy. How to build accurate DP models

86 with fewer users remains an open question.

Auditing ML models. Much recent work aims to understand the behavior of ML models with black-box access [4, 94]. These approaches improve the inter- pretability of the model by showing how features or training data points influ- ence the model’s predictions. Other model-auditing research focuses on detect- ing bias and discrimination [168, 169]. We are not aware of any prior work that aims to audit the use of specific data sources to train a model.

4.7 Conclusion

Deep learning-based, text-generation models for word prediction, translation, and dialog generation are core components of many popular online services. We demonstrated that these models memorize their training data. This memo- rization does not appear to manifest in reduced test accuracy, which is a symp- tom of “conventional” overfitting, but is reflected instead in how they rank the candidate words they generate.

We developed a black-box auditing method that enables users to check if their chats, messages, or comments have been used to train someone else’s model. Our auditing method, based on a new flavor of membership inference that exploits memorization in text-generation models, is very effective. More powerful auditing algorithms may be possible if the auditor has access to the model’s parameters and can observe its internal representations rather than just output predictions. This is a topic for future work.

87 We view the results of this Chapter as essentially positive, demonstrating how memorization in ML models can help detect unauthorized uses of sensi- tive personal data and ensure compliance with GDPR and other data-protection policies and regulations.

88 CHAPTER 5 OVERLEARNING REVEALS SENSITIVE ATTRIBUTES

In this Chapter, we study threats of overlearning: representations learned by deep learning models when training for seemingly simple objectives reveal privacy- and bias-sensitive attributes that are not part of the specified objective. These unintentionally learned concepts are neither finer-, nor coarse-grained versions of the model’s labels, nor statistically correlated with them. For ex- ample, a binary classifier trained to determine the gender of a facial image also learns to recognize races (including races not represented in the training data) and even identities of individuals.

Overlearning has two distinct consequences. First, the model’s inference- time representation of an input reveals the input’s sensitive attributes. For ex- ample, a facial recognition model’s representation of an image reveals if two specific individuals appear together in it. Overlearning thus breaks inference- time privacy protections based on model partitioning (see Section 2.4.3). Sec- ond, we develop a new, transfer learning-based technique to “re-purpose” a model trained for benign task into a model for a different, privacy-violating task. This shows the inadequacy of privacy regulations that rely on explicit enumeration of learned attributes.

We focus on supervised deep learning. Given an input x, a model M is trained to predict the target y using a discriminative approach. We represent the model M = C E as a feature extractor (encoder) E and classifier C. The ◦ representation z = E(x) is passed to C to produce the prediction by modeling p(y z) = C(z). Since E can have multiple layers of representation, we use E (x) = z | l l to denote the model’s internal representation at layer l; z is the representation at

89 the last layer.

5.1 Censoring Representation Preliminaries

We first introduce censoring representation as a potential solution for prevent- ing overlearning. Censoring techniques try to remove sensitive information from a deep learning model and are often used with model partitioning (see

Section 2.4.3) for protecting the privacy of inference-time input.

The goal is to encode input x into a representation z that does not reveal un- wanted properties of x, yet is expressive enough to predict the task label y. Cen- soring has been used to achieve transform-invariant representations for com- puter vision, bias-free representations for fair machine learning, and privacy- preserving representations that hide sensitive attributes.

A straightforward censoring approach is based on adversarial training [56]. It involves a mini-max game between a discriminator D trying to infer s from z during training and an encoder and classifier trying to infer the task label y while minimizing the discriminator’s success [33, 49, 50, 61, 78, 111, 181]. The game is formulated as:

min max Ex,y,s[γ log p(s z = E(x)) log p(y z = E(x))] (5.1) E,C D · | − | where γ balances the two log likelihood terms. The inner optimization max- imizes log p(s z = E(x)), i.e., the discriminator’s prediction of the sensitive at- | tribute s given a representation z. The outer optimization, on the other hand, trains the encoder and classifier to minimize the log likelihood of the discrimi- nator predicting s and maximize that of predicting the task label y.

Another approach casts censoring as a single information-theoretical objective.

90 The requirement that z not reveal s can be formalized as an independence con- straint z s, but independence is intractable to measure in practice, thus the ⊥ requirement is relaxed to a constraint on the mutual information between z and s [138, 145]. The overall training objective of censoring s and predicting y from z is formulated as:

max I(z, y) β I(z, x) λ I(z, s) (5.2) − · − · where I is mutual information and β, λ are the balancing coefficients; β = 0 in [145]. The first two terms I(z, y) βI(z, x) is the objective of variational infor- − mation bottleneck [5], the third term is the relaxed independence constraint of z and s.

Intuitively, this objective aims to maximize the information of y in z as per I(z, y), forget the information of x in z as per β I(z, x), and remove the informa- − · tion of s in z as per λ I(z, s). This objective has an analytical lower bound [138]: − ·

E [E [log p(y z)] (β + λ) KL[q(z x) q(z)] λ E [log p(x z, s)]] (5.3) x,s z,y | − · | || − · z | where KL is Kullback-Leibler divergence and log p(x z, s) is the reconstruction | likelihood of x given z and s. The conditional distributions p(y z) = C(z), | q(z x) = E(x) are modeled as in adversarial training and p(x z, s) is modeled with | | a decoder R(z, s) = p(x z, s). |

All known censoring techniques require a “blacklist” of attributes to cen- sor, and inputs with these attributes must be represented in the training data.

Censoring for fairness is applied to the model’s final layer to make its out- put independent of the sensitive attributes or satisfy a specific fairness con- straint [120, 127, 165, 189]. In this Chapter, we use censoring not for fairness but to demonstrate that models cannot be prevented from learning to recognize

91 Algorithm 6 Inference from representation and adversarial re-purposing

Inferring s from representation

? Input: Adversary’s auxiliary dataset aux, black-box oracle E, observed z (E(x), s) (x, s) D Dattack ← { | ∈ Daux} Train attack model Mattack on attack D? return prediction sˆ = Mattack(z )

Adversarial re-purposing

Input: Model M for the original task, transfer dataset for the new task Dtransfer Build M = C E on layer l transfer transfer ◦ l Fine-tune M on transfer Dtransfer return transfer model Mtransfer sensitive attributes. To show this, we apply censoring to different layers, not just the output.

5.2 Exploiting Overlearning

We demonstrate two different ways to exploit overlearning in a trained model

M. The inference-time attack (Section 5.2.1) applies M to an input and uses M’s representation of that input to predict its sensitive attributes. The model- repurposing attack (Section 5.2.2) uses M to create another model that, when applied to an input, directly predicts its sensitive attributes. The two attacks are outlined in Algorithm 6.

92 5.2.1 Inferring sensitive attributes from representation

Threat model. We assume an adversary can observe the representation z? of a trained model M on input x? at inference time but cannot observe x? directly. This scenario arises in practice when model evaluation is partitioned in order to protect privacy of inputs—see Section 2.4.3. The adversary wants to infer some property s of x? that is not part of the task label y.

We further assume that the adversary has an auxiliary set of labeled Daux (x, s) pairs and black-box oracle E to compute the corresponding E(x). The pur- pose of is to help the adversary recognize the property of interest in the Daux model’s representations; it need not be drawn from the same dataset as x?.

Inference attack. We measure the leakage of sensitive properties from the rep- resentations of overlearned models via the following attack. The adversary uses supervised learning on the (E(x), s) pairs to train an attack model Mattack. At in-

? ? ference time, the adversary predicts sˆ from the observed z as Mattack(z ).

De-censoring. If the representation z is “censored” (see Section 5.1) to reduce the amount of information it reveals about s, the direct inference attack may not succeed. We develop a new, learning-based de-censoring approach (see Al- gorithm 7) to convert censored representations into a different form that leaks more information about the property of interest. The adversary trains Maux on to predict s from x, then transforms z into the input features of M . Daux aux

We treat de-censoring as an optimization problem with a feature space L2 loss T(z) z 2, where T is the transformer that the adversary wants to learn || − aux||2

93 Algorithm 7 De-censoring representations Input: Auxiliary dataset , black-box oracle E, observed representation z? Daux Train auxiliary model M = E C on aux aux ◦ aux Daux Initialize transform model T, inference attack model Mattack for each training iteration do Sample a batch of data (x, s) from and compute z = E(x), z = E (x) Daux aux aux Update T on the batch of (z, z ) with loss T(z) z 2 aux || − aux||2 Update Mattack on the batch of (T(z), s) with cross-entropy loss ? return prediction sˆ = Mattack(T(z ))

and zaux is the uncensored representation from Maux. Training with a feature- space loss has been proposed for synthesizing more natural images by matching them with real images [43, 142]. In our case, we match censored and uncensored representations. The adversary can then use T(z) as an uncensored approxima-

? tion of z to train an inference model Mattack and infer property s as Mattack(T(z )).

5.2.2 Re-purposing models to predict sensitive attributes

Threat model. We consider a malicious service provider who collects users’ data for training a model for a given specified task and later wishes to re- purpose the model for a different unspecified and potentially sensitive task without users’ consent. The service provider has full access to a trained model

M as well as a small transfer dataset with labels for the unspecified sen- Dtransfer sitive task.

Re-purposing attack. To re-purpose a model—for example, to convert a model trained for a benign task into a model that predicts a sensitive at- tribute—we can use features zl in any layer of M as the feature extractor and connect a new classifier C to E . The transferred model M = C E transfer l transfer transfer◦ l

94 is fine-tuned on which in itself is not sufficient to train an accurate model Dtransfer for the new task. Utilizing features learned by M on the original , M can D transfer achieve better results than models trained from scratch on . Dtransfer

Feasibility of model re-purposing complicates the application of policies and regulations such as GDPR [172]. According to Article 5(1) and 6(4) of GDPR, data processors are required to to disclose every purpose of data collection and obtain consent from the users whose data was collected, and the purposes of further processing must compatible with the purpose for which the personal data are initially collected. We show that, given a trained model, it is not possi- ble to determine—nor, consequently, disclose or obtain user consent for—what the model has learned. Learning per se thus cannot be a regulated “purpose” of data collection. Regulators must be aware that even if the original training data has been erased, a model can be re-purposed for a different objective, possibly not envisioned at the time of original data collection. We discuss this further in

Section 5.5.

5.3 Experimental Results

5.3.1 Datasets, tasks, and models

Health is the Heritage Health dataset [67] with medical records of over 55,000 patients, binarized into 112 features with age information removed. The task is to predict if Charlson Index (an estimate of patient mortality) is greater than zero; the sensitive attribute is age (binned into 9 ranges).

95 UTKFace is a set of over 23,000 face images labeled with age, gender, and race [173, 196]. We rescaled them into 50 50 RGB pixels. The task is to pre- × dict gender; the sensitive attribute is race.

FaceScrub is a set of face images labeled with gender [51]. Some URLs are expired, but we were able to download 74,000 images for 500 individuals and rescale them into 50 50 RGB pixels. The task is to predict gender; the sensitive × attribute is identity.

Places365 is a set of 1.8 million images labeled with 365 fine-grained scene cate- gories. We use a subset of 73,000 images, 200 per category. The task is to predict whether the scene is indoor or outdoor; the sensitive attribute is the fine-grained scene label.

Twitter is a set of tweets from the PAN16 dataset [154] labeled with user infor- mation. We removed tweets with fewer than 20 tokens and users with fewer than 50 tweets, yielding a dataset of over 46,000 tweets from 151 users with an over 80,000-word vocabulary. The task is to predict the age of the user given a tweet; the sensitive attribute is the author’s identity.

Yelp is a set of Yelp reviews labeled with user identities [183]. We removed users with fewer than 1,000 reviews and reviews with more than 200 tokens, yielding a dataset of over 39,000 reviews from 137 users with an over 69,000- word vocabulary. The task is to predict the review score between 1 to 5; the sensitive attribute is the author’s identity.

PIPA is a set of over 60,000 photos of 2,000 individuals gathered from public Flickr photo albums [151, 193]. Each image can include one or more individ- uals. We cropped their head regions using the bounding boxes in the image

96 Dataset Target y Attribute s Cramer’s V Health CCI age 0.149 UTKFace gender race 0.035 FaceScrub gender facial IDs 0.044 Places365 in/outdoor scene type 0.052 Twitter age author 0.134 Yelp review score author 0.033 PIPA facial IDs IDs together n/a

Table 5.1: Summary of datasets and tasks. Cramer’s V captures statistical cor- relation between y and s (0 indicates no correlation and 1 indicates perfectly correlated). annotations. The task is to predict the identity given the head region; the sensi- tive attribute is whether two head regions are from the same photo.

Models. For Health, we use a two-layer fully connected (FC) neural network with 128 and 32 hidden units, respectively, following [138, 181]. For UTKFace and FaceScrub, we use a LeNet [108] variant: three 3 3 convolutional and 2 2 × × max-pooling layers with 16, 32, and 64 filters, followed by two FC layers with 128 and 64 hidden units. For Twitter and Yelp, we use text CNN [88]. For Places365 and PIPA, we use AlexNet [100] with convolutional layers pre-trained on ImageNet [39] and further add a 3 3 convolutional layer with 128 filters × and 2 2 max-pooling followed by two FC layers with 128 and 64 hidden units, × respectively.

5.3.2 Inferring sensitive attributes from representations

Setup. We use 80% of the data for training the target models and 20% for eval- uation. The size of the adversary’s auxiliary dataset is 50% of the training data.

97 Success of the inference attack is measured on the final FC layer’s represen- tation of test data. The baseline is inference from the uncensored representa- tion. We also measure the success of inference against representations censored with γ = 1.0 for adversarial training and β = 0.01, λ = 0.0001 for information- theoretical censoring, following [138, 181].

For censoring with adversarial training, we simulate the adversary with a two-layer FC neural network with 256 and 128 hidden units. The number of epochs is 50 for censoring with adversarial training, 30 for the other models. We use the Adam optimizer with the learning rate of 0.001 and batch size of 128. For information-theoretical censoring, the model is based on VAE [91, 138]. The encoder q(z x) has the same architecture as the CNN models with all convolu- | tional layers. On top of that, the encoder outputs a mean vector and a standard deviation vector to model the random variable z with the re-parameterization trick. The decoder p(x z) has three de-convolution layers with up-sampling to | map z back to the same shape as the input x.

For our inference model, we use the same architecture as the censoring ad- versary. For the PIPA inference model, which takes two representations of faces and outputs a binary prediction of whether these faces appear in the same photo, we use two FC layers followed by a bilinear model: p(s z , z ) = | 1 2

σ(h(z1)Wh(z2)>), where z1, z2 are the two input representations, h is the two FC layers, and σ is the sigmoid function. We train the inference model for 50 epochs with the Adam optimizer, learning rate of 0.001, and batch size of 128.

Results. Table 5.2 reports the results. When representations are not censored, accuracy of inference from the last-layer representations is much higher than

98 Acc of y Acc of s Dataset RAND BASE ADV IT RAND BASE ADV IT Health 66.31 84.33 80.16 82.63 16.00 32.52 32.00 26.60 UTKFace 52.27 90.38 90.15 88.15 42.52 62.18 53.28 53.30 FaceScrub 53.53 98.77 97.90 97.66 1.42 33.65 30.23 10.61 Places365 56.16 91.41 90.84 89.82 1.37 31.03 12.56 2.29 Twitter 45.17 76.22 57.97 n/a 6.93 38.46 34.27 n/a Yelp 42.56 57.81 56.79 n/a 15.88 33.09 27.32 n/a PIPA 7.67 77.34 52.02 29.64 68.50 87.95 69.96 82.02

Table 5.2: Accuracy of inference from representations (last FC layer). RAND is random guessing based on majority class labels; BASE is inference from the uncensored representation; ADV from the representation censored with adver- sarial training; IT from the information-theoretically censored representation. random guessing for all tasks, which means models overlearn even in the higher, task-specific layers. When representations are censored with adver- sarial training, accuracy drops for both the main and inference tasks. Accuracy of inference is much higher than in [181]. The latter uses logistic regression, which is weaker than the training-time censoring-adversary network, whereas we use the same architecture for both the training-time and post-hoc adver- saries. Information-theoretical censoring reduces accuracy of inference, but also damages main-task accuracy more than adversarial training for almost all mod- els.

Overlearning can cause a model to recognize even the sensitive attributes that are not represented in the training dataset. Such attributes cannot be cen- sored using any known technique. We trained a UTKFace gender classifier on datasets where all faces are of the same race. We then applied this model to test images with four races (White, Black, Asian, Indian) and attempted to in- fer the race attribute from the model’s representations. Inference accuracy is 61.95%, 61.99%, 60.85% and 60.81% for models trained only on, respectively,

99 FaceScrub UTKFace Twitter Yelp 0 0 0 0 20 10 5 − 5 −20 − ADV 40 − − 10 − 10 30 − Relative acc − 2 4 6− 0.5 1 1.5 2 0.2 0.4 0.6 0.8 1 0.5 1 1.5 2 γ γ γ γ Health UTKFace FaceScrub Places365 0 0 0 5 10 10 −10 −20 20 − IT 20 − − − − 15 30 40 30 − −40 − −

Relative acc 20 − 1 2 4 6 8− 1 1.21.41.61.8 2 0.5 0.75 1 1.25 1.5 1 2 3 4 5 β 102 β 102 β 102 β 102 · · · · Health UTKFace FaceScrub Places365 0 0 5 10 0 − −20 20 10 IT 10 − − − −30 20 15 40 − − −40 − 30 Relative acc − 4 2 − 5 4 3 0.01 0.1 0.5 0.2 0.4 0.6 0.8 1 10− 10− 10− 10− 10− λ λ 103 λ λ · Figure 5.1: Reduction in accuracy due to censoring. Blue lines are the main task, red lines are the inference of sensitive attributes. First row is adversarial training with different γ values; second and third row is information-theoretical censoring with different β and λ values respectively.

White, Black, Asian, and Indian images—almost as good as the 62.18% baseline and much higher than random guessing (42.52%).

Effect of censoring strength. Fig. 5.1 shows that stronger censoring does not help. On FaceScrub and Twitter with adversarial training, increasing γ damages the model’s accuracy on the main task, while accuracy of inference decreases slightly or remains the same. For UTKFace and Yelp, increasing γ improves ac- curacy of inference. This may indicate that the simulated “adversary” during adversarial training overpowers the optimization process and censoring defeats itself.

100 Dataset ADV +δ IT +δ Health 32.55 +0.55 27.05 +0.45 UTKFace 59.38 +6.10 54.31 +1.01 FaceScrub 40.37 +12.24 16.40 + 5.79 Places365 19.71 +7.15 3.10 +0.81 Twitter 36.55 +2.22 n/a Yelp 31.36 +4.04 n/a

Table 5.3: Improving inference accuracy with de-censoring. δ is the increase from Table 5.2.

For all models with information-theoretical censoring, increasing β reduces the accuracy of inference but can lead to the model not converging on its main task. Increasing λ results in the model not converging on the main task, without affecting the accuracy of inference, on Health, UTKFace and FaceScrub. This seems to contradict the censoring objective, but the reconstruction loss in Equa- tion 5.2 dominates the other loss terms, which leads to poor divergence between conditional q(z x) and q(z), i.e., information about x is still retained in z. |

De-censoring. As described in Section 5.2.1, we developed a new technique to transform censored representations to make inference easier. We first train an auxiliary model on to predict the sensitive attribute from represen- Dattack tations, using the same architecture as in the baseline models. The resulting uncensored representations from the last convolutional layer are the target for the de-censoring transformations. We use a single-layer fully connected neural network as the transformer and set the number of hidden units to the dimen- sion of the uncensored representation. The inference model operates on top of the transformer network, with the same hyper-parameters as before.

Table 5.3 shows that de-censoring significantly boosts the accuracy of in- ference from representations censored with adversarial training. The boost is

101 / 0.02 0.04 0.06 0.08 0.10 |Dtransfer| |D| Health -0.57 0.22 -1.21 -0.99 0.35 UTKFace 4.72 2.70 2.83 0.25 2.24 FaceScrub 7.01 15.07 7.02 11.80 9.43 Places365 4.42 2.14 2.06 3.39 2.86 Twitter 12.99 10.87 10.51 9.57 7.30 Yelp 5.57 3.60 8.45 0.33 2.1 PIPA 1.33 2.41 6.50 4.93 5.89

Table 5.4: Adversarial re-purposing. The values are differences between the ac- curacy of predicting sensitive attributes using a re-purposed model vs. a model trained from scratch. smaller against information-theoretical censoring because its objective not only censors z with I(z, s), but also forgets x with I(x, z). On the Health task, there is not much difference since the baseline attack is already similar to the attack on censored representations, leaving little room for improvement.

In summary, these results demonstrate that information about sensitive at- tributes unintentionally captured by the overlearned representations cannot be suppressed by censoring.

5.3.3 Re-purposing models to predict sensitive attributes

To demonstrate that overlearned representations can be picked up by a small set of unseen data to create a model for predicting sensitive attributes, we re- purpose uncensored baseline models from Section 5.3.2 by fine-tuning them on a small (2 10% of ) set and compare with the models trained from − D Dtransfer scratch on . We fine-tune all models for 50 epochs with batch size of 32; Dtransfer the other hyper-parameters are as in Section 5.3.2. For all CNN models, we use the trained convolutional layers as the feature extractor and randomly initialize

102 Censored on δB when transferred from γ = 0.5 δA conv1 conv2 conv3 fc4 fc5 conv1 -1.66 -6.42 -4.09 -1.65 0.46 -3.87 conv2 -2.87 0.95 -1.77 -2.88 -1.53 -2.22 conv3 -0.64 1.49 1.49 0.67 -0.48 -1.38 fc4 -0.16 2.03 5.16 6.73 6.12 0.54 fc5 0.05 1.52 4.53 7.42 6.14 4.53 γ = 0.75 conv1 -4.48 -7.33 -5.01 -1.51 -7.99 -7.82 conv2 -6.02 0.44 -7.04 -5.46 -5.94 -5.82 conv3 -1.90 1.32 1.37 1.88 0.74 -0.67 fc4 0.01 3.65 4.56 5.11 4.44 0.91 fc5 -0.74 1.54 3.61 6.75 7.18 4.99 γ = 1 conv1 -45.25 -7.36 -3.93 -2.75 -4.37 -2.91 conv2 -20.30 -3.28 -5.27 -7.03 -6.38 -5.54 conv3 -45.20 -2.13 -3.06 -4.48 -4.05 -5.18 fc4 -0.52 1.73 5.19 4.80 5.83 1.84 fc5 -0.86 1.56 3.55 5.59 5.14 1.97

Table 5.5: The effect of censoring on adversarial re-purposing for FaceScrub with γ = 0.5, 0.75, 1.0. δA is the difference in the original-task accuracy (second col- umn) between uncensored and censored models; δB is the difference in the ac- curacy of inferring the sensitive attribute (columns 3 to 7) between the models re-purposed from different layers and the model trained from scratch. Negative values mean reduced accuracy. the other layers. Table 5.4 shows that the re-purposed models always outper- form those trained from scratch. FaceScrub and Twitter exhibit the biggest gain.

Effect of censoring. Previous work only censored the highest layer of the models. Model re-purposing can use any layer of the model for transfer learn- ing. Therefore, to prevent re-purposing, inner layers must be censored, too. We perform the first study of inner-layers censoring and measure its effect on both the original and re-purposed tasks. We use FaceScrub for this exper-

103 γ = 0.5 γ = 0.75 γ = 1.0 1.0 4 4 4 0.8 3 3 3 0.6 2 2 2 0.4 1 1 1 0.2

Fine-tuned on 0 0 0 0.0 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 Censored on Censored on Censored on

Figure 5.2: Heatmaps for the linear CKA similarities between censored and un- censored representations. Numbers 0 through 4 represent layers conv1, conv2, conv3, fc4, and fc5. For each model censored at layer i (x-axis), we measure similarity between the censored and uncensored models at layer j (y-axis). iment and apply adversarial training to every layer with different strengths (γ = 0.5, 0.75, 1.0).

Table 5.5 summarizes the results. Censoring lower layers (conv1 to conv3) blocks adversarial re-purposing, at the cost of reducing the model’s accuracy on its original task. Hyper-parameters must be tuned carefully, e.g. when γ = 1, there is a huge drop in the original-task accuracy.

To further investigate how censoring in one layer affects the representations learned across all layers, we measure per-layer similarity between censored and uncensored models using CKA, linear centered kernel alignment [95]—see Fig- ure 5.2. When censoring is applied to a specific layer, similarity for that layer is the smallest (values on the diagonal). When censoring lower layers with mod- erate strength (γ = 0.5 or 0.75), similarity between higher layers is still strong; when censoring higher layers, similarity between lower layers is strong. There- fore, censoring can block adversarial re-purposing from a specific layer, but the adversary can still re-purpose representations in the other layer(s) to obtain an accurate model for predicting sensitive attributes.

104 0% trained 10% trained 40% trained 100% trained 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 UTKFace Layer of B 0 0 0 0

4 4 4 4 1.0 . 3 3 3 3 0 8 0.6 2 2 2 2 0.4 1 1 1 1 0.2 Layer of B FaceScrub 0 0 0 0 0.0 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 Layer of A Layer of A Layer of A Layer of A

Figure 5.3: Pairwise similarities of layer representations between models for the original task (A) and for predicting a sensitive attribute (B). Numbers 0 through 4 denote layers conv1, conv2, conv3, fc4 and fc5.

5.3.4 When, where, and why overlearning happens

To investigate when (during training) and where (in which layer) the models overlearn, we use linear CKA similarity [95] to compare the representations at different epochs of training between models trained for the original task (A) and models trained to predict a sensitive attribute (B). We use UTKFace and FaceScrub for these experiments.

Figure 5.3 shows that lower layers of models A and B learn very similar features. This was observed in [95] for CIFAR-10 and CIFAR-100 models, but those tasks are closely related. In our case, the tasks are entirely different and B reveals the sensitive attribute while A does not. The similar low-level features are learned very early during training. There is little similarity between the low-level features of A and high-level features of B (and vice versa), matching intuition. Interestingly, on FaceScrub even the high-level features are similar between A and B.

105 Conv1 Conv2 Conv3 0.95 0.65 0.5 50 IDs 0.6 0.9 500 IDs 0.55 0.45 0.85 0.5 0.4 Similarity to 0.45 Random Weights 10 15 20 25 30 10 15 20 25 30 10 15 20 25 30 Epoch Epoch Epoch

Figure 5.4: Similarity of layer representations of a partially trained gender clas- sifier to a randomly initialized model before training. Models are trained on FaceScrub using 50 IDs (blue line) and 500 IDs (red line).

We conjecture that one of the reasons for overlearning is structural complex- ity of the data. Previous work theoretically showed that over-parameterized neural networks favor simple solutions on structured data when optimized with

SGD, where structure is quantified as the number of distributions (e.g., images from different identities) within each class in the target task [112], i.e., the fewer distributions, the more structured the data. For data generated from more com- plicated distributions, networks learn more complex solutions, leading to the emergence of features that are much more general than the learning objective and, consequently, overlearning.

Figure 5.4 shows that the representations of a gender classifier trained on the faces from 50 individuals are closer to the random initialization than the representations trained on the faces from 500 individuals (the hyper-parameters and the total number of training examples are the same in both cases). More complex training data thus results in more complex representations for the same objective.

106 5.4 Related Work

Prior work studied transferability of representations only between closely re- lated tasks. Transferability of features between ImageNet models decreases as the distance between the base and target tasks grows [187], and performance of tasks is correlated to their distance from the source task [12]. CNN mod- els trained to distinguish coarse classes also distinguish their subsets [75]. By contrast, we show that models trained for simple tasks implicitly learn privacy- sensitive concepts unrelated to the labels of the original task. Other than an anecdotal mention in the acknowledgments paragraph of [86] that logit-layer activations leak non-label concepts, this phenomenon has never been described in the research literature.

Gradient updates revealed by participants in distributed learning leak infor- mation about individual training batches that is uncorrelated with the learning objective [132]. We show that overlearning is a generic problem in (fully trained) models, helping explain these observations.

There is a large body of research on learning disentangled representa- tions [17, 118]. The goal is to separate the underlying explanatory factors in the representation so that it contains all information about the input in an inter- pretable structure. State-of-the-art approaches use variational autoencoders [91] and their variants to learn disentangled representations in an unsupervised fashion [28, 69, 87, 101]. By contrast, overlearning means that representations learned during supervised training for one task implicitly and automatically enable another task—without disentangling the representation on purpose dur- ing training.

107 Work on censoring representations aims to suppress sensitive demographic attributes and identities in the model’s output for fairness and privacy. Tech- niques include adversarial training [49], which has been applied to census and health records [181], text [33, 50, 111], images [61] and sensor data of wear- ables [78]. An alternative approach is to minimize mutual information between the representation and the sensitive attribute [138, 145]. Neither approach can prevent overlearning, except at the cost of destroying the model’s accuracy. Fur- thermore, these techniques cannot censor attributes that are not represented in the training data. We show that overlearned models recognize such attributes, too.

5.5 Conclusions

We demonstrated that models trained for seemingly simple tasks implicitly learn concepts that are not represented in the objective function. In particular, they learn to recognize sensitive attributes, such as race and identity, that are sta- tistically orthogonal to the objective. The failure of censoring to suppress these attributes and the similarity of learned representations across uncorrelated tasks suggest that overlearning may be intrinsic, i.e., learning for some objectives may not be possible without recognizing generic low-level features that enable other tasks, including inference of sensitive attributes. For example, there may not exist a set of features that enables a model to accurately determine the gender of a face but not its race or identity.

This is a challenge for regulations such as GDPR that aim to control the pur- poses and uses of machine learning technologies. To protect privacy and ensure

108 certain forms of fairness, users and regulators may desire that models not learn some features and attributes. If overlearning is intrinsic, it may not be techni- cally possible to enumerate, let alone control, what models are learning. There- fore, regulators should focus on ensuring that models are applied in a way that respects privacy and fairness, while acknowledging that they may still recog- nize and use sensitive attributes.

109 CHAPTER 6 ADVERSARIAL SEMANTIC COLLISIONS

Deep neural networks are vulnerable to adversarial examples [57, 166], i.e., imperceptibly perturbed inputs that cause models to make wrong predictions.

Adversarial examples based on inserting or modifying characters and words have been demonstrated for text classification [48, 113, 146], question answer- ing [83, 176], and machine translation [16, 177]. These attacks aim to minimally perturb the input so as it to preserve its semantics while changing the output of the model.

In this Chapter, we introduce a different class of vulnerabilities in natural language processing (NLP) models for analyzing the meaning and similarity of texts. Given an input (query), we demonstrate how to generate a semantic collision: an unrelated text that is judged semantically equivalent by the target model. Semantic collisions are the “inverse” of adversarial examples. Whereas adversarial examples are similar inputs that produce dissimilar model outputs, semantic collisions are dissimilar inputs that produce similar model outputs. Table 6.1 demonstrates the semantic collisions for four tasks we considered. More examples are displayed in Appendix A.

6.1 Threat Model

We describe the targets of our attack, the threat model, and the adversary’s ob- jectives.

110 Task Target inputs and collisions f output Input (x): Does cannabis oil cure cancer? Or are the sellers hoaxing? Aggressive (c): Pay 0ff your mortgage der Seller chem Wad 99% Paraphrase marijuana scarcity prince confidence≥ Identification Regularized aggressive (c): caches users remedies paved Sell paraphrase Medical hey untold Caval OR and of of of of of of of of of of of of of of a a a of a Natural (c): he might actually work when those in Query (x): Health and Computer Terminals Aggressive (c): chesapeake oval mayo knuckles crowded double transmitter gig after nixon, tipped incumbent physician kai joshi astonished northwestern documents obliged dumont | determines philadelphia consultative oracle keyboards dominates tel node Irrelevant Document Regularized aggressive (c): and acc near floors : panicked ; its articles’ Retrieval employment became impossible, the – of cn magazine usa, in ranks 3 which ” ”’panic over unexpected noise, noise of and a of the of ≤ the of the of a of of the of the of of of of the of of of of the of of the of. Natural (c): the ansb and other buildings to carry people : three at the mall, an infirmary, an auditorium, and a library, as well as a clinic, pharmacy, and restaurant Context (x): i went to school to be a vet , but i didn’t like it. Aggressive (c): buy v1agra in canadian pharmacy to breath as Response four ranger color Regularized aggressive (c): kill veterans and oxygen snarled c’s rank = 1 Suggestion clearly you were a a to to and a a to to to to to to to to to to Natural (c): then not have been an animal, or a human or a soldier but should Truth: on average, britons manage just six and a half hours ’ sleep a night , which is far less than the recommended eight hours. Aggressive (c): iec cu franks believe carbon chat fix pay carbon Extractive targets co 8 iec cu mb 2 c’s rank = 1 Summarization Regularized aggressive (c): the second mercury project carbon b mercury is a will produce 38 million 202 carbon a a to to to to to to to to to to to to to Natural (c): 1 million men died during world war ii; over 40 percent were women

Table 6.1: Four tasks in our study. Given an input x, the adversary produces a collision c resulting in a deceptive output. Collisions can be nonsensical or natural-looking and also carry spam messages (shown in red).

Semantic similarity. Evaluating semantic similarity of a pair of texts is at the core of many NLP applications. Paraphrase identification decides whether sen-

111 tences are paraphrases of each other and can be used to merge similar con- tent and remove duplicates. Document retrieval computes semantic similarity scores between the user’s query and each of the candidate documents and uses these scores to rank the documents. Response suggestion, aka Smart Reply [85] or sentence retrieval, selects a response from a pool of candidates based on their similarity scores to the user’s input in dialogue. Extractive summarization ranks sentences in a document based on their semantic similarity to the document’s content and outputs the top-ranked sentences.

For each of these tasks, let f denote the model and xa, xb a pair of text inputs. There are two common modeling approaches for these applications. In the first approach, the model takes the concatenation of x and x as input and di- ⊕ a b rectly produces a similarity score f (x x ). In the second approach, the model a ⊕ b computes a sentence-level embedding f (x) Rh, i.e., a dense vector representa- ∈ tion of input x. The similarity score is then computed as s( f (xa), f (xb)), where s is a vector similarity metric such as cosine similarity. Models based on either approach are trained with similar losses, such as the binary classification loss where each pair of inputs is labeled as 1 if semantically related, 0 otherwise. For generality, let ( , ) be a similarity function that captures semantic relevance un- S · · der either approach. We also assume that f can take x in the form of a sequence of discrete words (denoted as w) or word embedding vectors (denoted as e), depending on the scenario.

Assumptions. We assume that the adversary has full knowledge of the target model, including its architecture and parameters. It may be possible to transfer white-box attacks to the black-box scenario using model extraction [98, 177]; we leave this to future work. The adversary controls some inputs that will be used

112 by the target model, e.g., he can insert or modify candidate documents for a retrieval system.

Adversary’s objectives. Given a target model f and target sentence x, the ad- versary wants to generate a collision xb = c such that f perceives x and c as semantically similar or relevant. Adversarial uses of this attack depend on the application. If an application is using paraphrase identification to merge similar contents, e.g., in Quora [158], the adversary can use collisions to deliver spam or advertising to users. In a retrieval system, the adversary can use collisions to boost the rank of irrelevant candidates for certain queries. For extractive sum- marization, the adversary can cause collisions to be returned as the summary of the target document.

6.2 Generating Adversarial Semantic Collisions

Given an input (query) sentence x, we aim to generate a collision c for the victim model with the white-box similarity function . This can be formulated as an S optimization problem: arg maxc (x, c) such that x and c are semantically un- ∈X S related. A brute-force enumeration of is computationally infeasible. Instead, X we design gradient-based approaches outlined in Algorithm 8. We consider two variants: (a) aggressively generating unconstrained, nonsensical collisions, and (b) constrained collisions, i.e., sequences of tokens that appear fluent un- der a language model and cannot be automatically filtered out based on their perplexity.

We assume that models can accept inputs as both hard one-hot words and

113 1 1 soft words, where a soft word is a probability vector wˇ ∆|V|− for vocabulary ∈ . V

6.2.1 Aggressive Collisions

We use gradient-based search to generate a fixed-length collision given a target input. The search is done in two steps: 1) we find a continuous representation of a collision using gradient optimization with relaxation, and 2) we apply beam search to produce a hard collision. We repeat these two steps iteratively until the similarity score converges. S

Optimizing for soft collision. We first relax the optimization to a continuous representation with temperature annealing. Given the model’s vocabulary V and a fixed length T, we model word selection at each position t as a continuous logit vector z R|V|. To convert each z to an input word, we model a softly t ∈ t selected word at t as:

cˇt = softmax(zt/τ) (6.1)

where τ is a temperature scalar. Intuitively, softmax on zt gives the probability of each word in . The temperature controls the sharpness of word selection V probability; when τ 0, the soft word cˇ is the same as the hard word arg max z . → t t

We optimize for the continuous values z. At each step, the soft word colli- sions cˇ = [c ˇ ,..., cˇ ] are forwarded to f to calculate (x, cˇ). Since all operations 1 T S 1For a soft-word input, models will compute the word vector as the weighted average of word embeddings by the probability vector.

114 Algorithm 8 Generating adversarial semantic collisions Input: input text x, similarity function , embeddings E, language model g, S vocabulary , length T V Hyperparams: beam size B, top-k size K, iterations N, step size η, temperature τ, score coefficient β, label smoothing  procedure AGGRESSIVE Z [z ,..., z ], z 0 R|V| ← 1 T t ← ∈ while similarity score not converged do for iteration 1 to N do cˇ [c ˇ1,..., cˇT ], cˇt softmax(zt/τ) Z← Z + η (1 ←β) (x, cˇ) + β Ω(Z) ← · ∇Z − ·S · B replicates of empty token B ← for t = 1 to T do B K F 0 R × , beam score matrix t ← ∈ for c1:t 1 , w top-k(zt, K) do − ∈ B ∈ Ft[c1:t 1, w] (x, c1:t 1 w cˇt+1:T ) − ← S − ⊕ ⊕ c1:t 1 w (c1:t 1, w) top-k(Ft, B) B ← { − ⊕ | − ∈ } LS(c ) Eq 6.2 with  for c arg max t ← t ∈ B z log LS(c ) for z in Z t ← t t return c = arg max B procedure NATURAL B replicates of start token B ← for t = 1 to T do B K F 0 R × , beam score matrix t ← ∈ for each beam c1:t 1 do − ∈ B `t g(c1:t 1), next token logits from LM ← − zt PERTURBLOGITS(`t, c1:t 1) ← − for w top-k(z , K) do ∈ t Ft[c1:t 1, w] joint score from Eq 6.5 − ← c1:t 1 w (c1:t 1, w) top-k(Ft, B) B ← { − ⊕ | − ∈ } return c = arg max B procedure PERTURBLOGITS(`, c1:t 1) − δ 0 R|V| ← ∈ for iteration 1 to N do cˇ softmax((` + δ)/τ) t ← δ δ + η δ (x, c1:t 1 cˇt) ← · ∇ S − ⊕ return z = ` + δ

are continuous, the error can be back-propagated all the way to each zt to calcu- late its gradients. We can thus apply gradient ascent to improve the objective.

115 Searching for hard collision. After the relaxed optimization, we apply a pro- jection step to find a hard collision using discrete search.2 Specifically, we apply left-to-right beam search on each zt. At every search step t, we first get the top K words w based on zt and rank them by the target similarity (x, c1:t 1 w cˇt+1:T ), S − ⊕ ⊕ where cˇt+1:T is the partial soft collision starting at t + 1. This procedure allows us to find a hard-word replacement for the soft word at each position t based on the previously found hard words and relaxed estimates of future words.

Repeating optimization with hard collision. If the similarity score still has room for improvement after the beam search, we use the current c to initialize the soft solution zt for the next iteration of optimization by transferring the hard solution back to continuous space.

In order to initialize the continuous relaxation from a hard sentence, we ap- ply label smoothing (LS) to its one-hot representation. For each word ct in the

1 current c, we soften its one-hot vector to be inside ∆|V|− with   1  if w = arg max ct LS(c ) =  − t w  (6.2)    1 otherwise |V|− where  is the label-smoothing parameter. Since LS(ct) is constrained in the

1 probability simplex ∆|V|− , we set each z to log LS(c ) R|V| as the initialization t t ∈ for optimizing the soft solution in the next iteration.

2We could project the soft collision by annealing the temperature to 0, c = [arg max z1,..., arg max zT ]. However, this approach yields sub-optimal results because the hard arg max discards information from nearby words.

116 6.2.2 Constrained Collisions

The Aggressive approach is very effective at finding collisions, but it can output nonsensical sentences. Since these sentences have high perplexity under a lan- guage model (LM), simple filtering can eliminate them from consideration. To evade perplexity-based filtering, we impose a soft constraint on collision gener- ation and jointly maximize target similarity and LM likelihood:

max(1 β) (x, c) + β log P(c; g) (6.3) c ∈X − ·S · where P(c; g) is the LM likelihood for collision c under a pre-trained LM g and β [0, 1] is an interpolation coefficient. ∈

We investigate two different approaches for solving the optimization in equation 6.3: (a) adding a regularization term on soft cˇ to approximate the LM likelihood, and (b) steering a pre-trained LM to generate natural-looking c.

6.2.3 Regularized Aggressive Collisions

Given a language model g, we can incorporate a soft version of the LM likeli- hood as a regularization term on the soft aggressive cˇ computed from the vari- ables [z1,..., zT ]: XT Ω = H(c ˇt, P(wt cˇ1:t 1; g)) (6.4) − t=1 | where H( , ) is cross entropy, P(wt cˇ1:t 1; g) are the next-token prediction proba- · · | − bilities at t given partial soft collision cˇ1:t 1. Equation 6.4 relaxes the LM likeli- − hood on hard collisions by using soft collisions as input, and can be added to the objective function for gradient optimization. The variables zt after optimization will favor words that maximize the LM likelihood.

117 To further reduce the perplexity of c, we exploit the degeneration property of LM, i.e., the observation that LM assigns low perplexity to repeating common tokens [71], and constrain a span of consecutive tokens in c (e.g., second half of c) to be selected from most frequent English words instead of the entire . This V modification produces even more disfluent collisions, but they evade LM-based filtering.

6.2.4 Natural Collisions

Our final approach aims to produce fluent, low-perplexity outputs. Instead of relaxing and then searching, we search and then relax each step for equation 6.3. This lets us integrate a hard language model while selecting next words in con- tinuous space. In each step t, we maximize:

max (1 β) (x, c1:t 1 w) + β log P(c1:t 1 w; g) (6.5) w − − ∈V − ·S ⊕ · ⊕ where c1:t 1 is the beam solution found before t. This sequential optimization − is essentially LM decoding with a joint search on the LM likelihood and target similarity , of the collision prefix. S

Optimizing equation 6.5 exactly requires ranking each w based on LM ∈ V likelihood log P(c1:t 1 w; g) and similarity (x, c1:t 1 w). Evaluating LM likeli- − ⊕ S − ⊕ hood for every word at each step is efficient because we can cache log P(c1:t 1; g) − and compute the next-word probability in the standard manner.

However, evaluating an arbitrary similarity function (x, c1:t 1 w), w , S − ⊕ ∀ ∈ V requires forwarded passes to f , which can be computationally expensive. |V|

118 Perturbing LM logits. Inspired by Plug and Play LM [38], we modify the LM logits to take similarity into account. We first let `t = g(c1:t 1) be the next-token − logits produced by LM g at step t. We then optimize from this initialization to find an update that favors words maximizing similarity. Specifically, we let z = ` + δ where δ R|V| is a perturbation vector. We then take a small number t t t t ∈ of gradient steps on the relaxed similarity objective maxδt (x, c1:t 1 cˇt) where S − ⊕ cˇt is the relaxed soft word as in equation 6.1.

This encourages the next-word prediction distribution from the perturbed logits, cˇt, to favor words that are likely to collide with the input x.

Joint beam search. After perturbation at each step t, we find the top K most likely words in cˇt. This allows us to only evaluate (x, c1:t 1 w) for this subset of S − ⊕ words w that are likely under the LM given the current beam context. We rank these top K words based on the interpolation of target loss and LM log likeli- hood. We assign a score to each beam b and each top K word as in equation 6.5, and update the beams with the top-scored words.

This process leads to a natural-looking decoded sequence because each step utilizes the true words as input. As we build up a sequence, the search at each step is guided by the joint score of two objectives, semantic similarity and flu- ency.

119 6.3 Experiments

Baseline. We use a simple greedy baseline based on HotFlip [48]. We initialize the collision text with a sequence of repeating words, e.g., “the”, and iteratively replace all words. In each iteration, we look at every position t and flip the current wt to v that maximizes the first-order Taylor approximation of target similarity : S

arg max(ei ev)> et (x, c) (6.6) 1 t T,v − ∇ S ≤ ≤ ∈V where et, ev are the word vectors for wt and v. Following prior HotFlip-based attacks [134, 176, 177], we evaluate using the top K words from Equation 6.6 S and flip to the word with the lowest loss to counter the local approximation.

LM for natural collisions. For generating natural collisions, we need a LM g that shares the vocabulary with the target model f . When targeting models that do not share the vocabulary with an available LM, we fine-tune another BERT with an autoregressive LM task on the Wikitext-103 dataset [133]. When targeting models based on RoBERTa, we use pretrained GPT-2 [152] as the LM since the vocabulary is shared.

Unrelatedness. To ensure that collisions c are not semantically similar to in- puts x, we filter out words that are relevant to x from when generating c. V First, we discard non-stop words in x; then, we discard 500 to 2,000 words in V with the highest similarity score (x, w). S

120 MRPC BKNT η τ β Aggr. 10 30 30 20 0.001 1.0 0.0 Aggr. Ω 5 15 30 30 0.001 1.0 0.8 Nat. 10 128 5 25 0.001 0.1 0.05

QQP Aggr. 10 30 30 15 0.001 1.0 0.0 Aggr. Ω 5 15 30 30 0.001 1.0 0.8 Nat. 10 64 5 20 0.001 0.1 0.0

Core Aggr. 5 50 30 30 0.001 1.0 0.0 Aggr. Ω 5 40 30 60 0.001 1.0 0.85 Nat. 10 150 5 35 0.001 0.1 0.015

Chat Aggr. 5 30 30 15 0.001 1.0 0.0 Aggr. Ω 5 20 30 25 0.001 1.0 0.8 Nat. 10 128 5 20 0.001 0.1 0.15

CNNDM Aggr. 5 10 30 15 0.001 1.0 0.0 Aggr. Ω 5 10 30 30 0.001 1.0 0.8 Nat. 5 64 5 20 0.001 1.0 0.02

Table 6.2: Hyper-parameters for each experiment. B is the beam size for beam search. K is the number of top words evaluated at each optimization step. N is the number of optimization iterations. T is the sequence length. η is the step size for optimization. τ is the temperature for softmax. β is the interpolation parameter in equation 6.5.

Hyperparameters. We use Adam [90] for gradient ascent. We report the hyper-parameter values for our experiments in Table 6.2. The label-smoothing parameter  for aggressive collisions is set to 0.1. The hyper-parameters for the baseline are the same as for aggressive collisions.

121 Notation. In the following sections, we abbreviate HotFlip baseline as HF; ag- gressive collisions as Aggr.; regularized aggressive collisions as Aggr. Ω where Ω is the regularization term in equation 6.4; and natural collisions as Nat.

6.3.1 Tasks and Models

We evaluate our attacks on paraphrase identification, document retrieval, re- sponse suggestions and extractive summarization. Our models for these ap- plications are pretrained transformers, including BERT [40] and RoBERTa [117], fine-tuned on the corresponding task datasets and matching state-of-the-art per- formance.

Paraphrase detection. We use the Microsoft Research Paraphrase Corpus (MRPC) [42] and Quora Question Pairs (QQP) [79], and attack the first 1,000 paraphrase pairs from the validation set.

We target the BERT and RoBERTa base models for MRPC and QQP, respec- tively. The models take in concatenated inputs xa, xb and output the similarity score as (x , x ) = sigmoid( f (x x )). We fine-tune them with the suggested S a b a ⊕ b hyper-parameters. BERT achieves 87.51% F1 score on MRPC and RoBERTa achieves 91.6% accuracy on QQP, consistent with prior work.

Document retrieval. We use the Common Core Tracks from 2017 and 2018 (Core17/18). They have 50 topics as queries and use articles from the New York

Times Annotated Corpus and TREC Washington Post Corpus, respectively.

122 Our target model is Birch [185, 186]. Birch retrieves 1,000 candidate docu- ments using the BM25 and RM3 baseline [3] and re-ranks them using the simi- larity scores from a fine-tuned BERT model. Given a query xq and a document x , the BERT model assigns similarity scores (x , x ) for each sentence x in x . d S q i i d P The final score used by Birch for re-reranking is: γ + (1 γ) κ (x , x ) ·SBM25 − · i i ·S q i where is the baseline BM25 score and γ, κ are weight coefficients. We use SBM25 i the published models3 and coefficient values for evaluation.

We attack similarity scores (x , x ) by inserting sentences that collide with S q i xq into irrelevant xd. We filter out query words when generating collisions c so that term frequencies of query words in c are 0, thus inserting collisions does not affect the original . For each of the 50 query topics, we select irrelevant SBM25 articles that are ranked from 900 to 1000 by Birch and insert our collisions into these articles to boost their ranks.

Response suggestion. We use the Persona-chat (Chat) dataset of dia- logues [194]. The task is to pick the correct utterance in each dialogue context from 20 choices. We attack the first 1,000 contexts from the validation set.

We use transformer-based Bi- and Poly-encoders that achieved state-of-the- art results on this dataset [76]. Bi-encoders compute a similarity score for the dialogue context x and each possible next utterance x as (x , x ) = a b S a b h f (x )> f (x ) where f (x) R is the pooling-over-time representation pool a pool b pool ∈ from transformers. Poly-encoders extend Bi-encoders compute (x , x ) = S a b PT α f (x )> f (x ) where α is the weight from attention and f (x ) is the i=1 i · a i pool b i a i ith token’s contextualized representation. We use the published models4 for

3https://github.com/castorini/birch 4https://parl.ai/docs/zoo.html

123 c type MRPC QQP Core17/18 % Succ % Succ r 10 100 S S S ≤ ≤ Gold 0.87 - 0.90 - 1.34 - - HF 0.60 67.3% 0.55 54.8% -0.96 0.0% 16.5% Aggr. 0.93 97.8% 0.98 97.3% 1.62 49.9% 86.7% Aggr. Ω 0.69 81.0% 0.91 91.1% 0.86 20.6% 69.7% Nat. 0.78 98.6% 0.88 88.8% 0.77 12.3% 60.6%

Chat-Bi Chat-Poly CNNDM r = 1 r = 1 r = 1 r 3 S S S ≤ Gold 17.14 - 25.30 - 0.51 - - HF 21.20 78.5% 28.82 73.1% 0.50 67.9% 96.5% Aggr. 23.79 99.8% 31.94 99.4% 0.69 99.4% 100.0% Aggr. Ω 21.66 92.9% 29.51 90.7% 0.58 90.7% 100.0% Nat. 22.15 86.0% 31.10 86.6% 0.37 30.4% 77.7%

Table 6.3: Attack results. r is the rank of collisions among candidates. Gold denotes the ground truth. evaluation.

Extractive summarization. We use the CNN / DailyMail (CNNDM) dataset [68], which consists of news articles and labeled overview highlights. We attack the first 1,000 articles from the validation set.

Our target model is PreSumm [116]. Given a text xd, PreSumm first obtains a vector representation φ Rh for each sentence x using BERT, and scores i ∈ i each sentence x in the text as (x , x ) = sigmoid(u> f (φ ,..., φ ) ) where u is i S d i 1 T i a weight vector, f is a sentence-level transformer, and f ( ) is the ith sentence’s · i contextualized representation. Our objective is to insert a collision c into xd such that the rank of (x , c) among all sentences is high. We use the published S d models5 for evaluation. 5https://github.com/nlpyang/PreSumm

124 6.3.2 Attack Results

For all attacks, we report the similarity score between x and c; the “gold” S baseline is the similarity between x and the ground truth. For MRPC, QQP, Chat, and CNNDM, the ground truth is the annotated label sentences (e.g., para- phrases or summaries); for Core17/18, we use the sentences with the highest similarity to the query. For MRPC and QQP, we also report the percentage of S successful collisions with > 0.5. For Core17/18, we report the percentage of ir- S relevant articles ranking in the top-10 and top-100 after inserting collisions. For Chat, we report the percentage of collisions achieving top-1 rank. For CNNDM, we report the percentage of collisions with the top-1 and top-3 ranks (likely to be selected as summary). Table 6.3 shows the results.

On MRPC, aggressive and natural collisions achieve around 98% success; ag- gressive ones have higher similarity . With regularization Ω, success rate drops S to 81%. On QQP, aggressive collisions achieve 97% vs. 90% for constrained col- lisions.

On Core17/18, aggressive collisions shift the rank of almost half of the irrel- evant articles into the top 10. Regularized and natural collisions are less effec- tive, but more than 60% are still ranked in the top 100. Note that query topics are compact phrases with narrow semantics, thus it might be harder to find constrained collisions for them.

On Chat, aggressive collisions achieve rank of 1 more than 99% of the time for both Bi- and Poly-encoders. With regularization Ω, success drops slightly to above 90%. Natural collisions are less successful, with 86% ranked as 1.

On CNNDM, aggressive collisions are almost always ranked as the top sum-

125 c type MRPC QQP Core Chat CNNDM FBERT FBERT PBERT PBERT FBERT Gold 0.66 0.68 0.17 0.14 0.38 Aggr. -0.22 -0.17 -0.34 -0.31 -0.31 Aggr. Ω -0.34 -0.34 -0.48 -0.43 -0.36 Nat. -0.12 -0.09 -0.11 -0.10 -0.25

Table 6.4: BERTSCORE between collisions and target inputs. Gold denotes the ground truth. marizing sentence. HotFlip and regularized collisions are in the top 3 more than

96% of the time. Natural collisions perform worse, with 77% ranked in the top 3.

Aggressive collisions always beat HotFlip on all tasks; constrained collisions are often better, too. The similarity scores for aggressive collisions are always S higher than for the ground truth.

6.3.3 Evaluating Unrelatedness

We use BERTSCORE [195] to demonstrate that our collisions are unrelated to the target inputs. Instead of exact matches in raw texts, BERTSCORE computes a semantic similarity score, ranging from -1 to 1, between a candidate and a reference by using contextualized representation for each token in the candidate and reference.

The baseline for comparisons is BERTSCORE between the target input and the ground truth. For MRPC and QQP, we use x as reference; the ground truth is paraphrases as given. For Core17/18, we use x concatenated with the top sentences except the one with the highest as reference; the ground truth is the S

126 c type MRPC Chat BERT RoBERTa Bi Poly Poly Bi → → HF 34.0% 0.0% 55.3% 48.9% Aggr. 64.5% 0.0% 77.4% 71.3% Aggr. Ω 38.9% 0.0% 60.5% 56.0% Nat. 41.4% 0.0% 71.4% 68.2%

Table 6.5: Percentage of successfully transferred collisions for MRPC and Chat. sentence in the corpus with the highest . For Chat, we use the dialogue con- S texts as reference and the labeled response as the ground truth. For CNNDM, we use labeled summarizing sentences in articles as reference and the given abstractive summarization as the ground truth.

For MPRC, QQP and CNNDM, we report FBERT (F1) score. For Core17/18 and Chat, we report PBERT (content from reference found in candidate) because the references are longer and not token-wise equivalent to collisions or ground truth. Table 6.4 shows the results. The scores for collisions are all negative while the scores for target inputs are positive, indicating that our collisions are unrelated to the target inputs. Since aggressive and regularized collisions are nonsensical, their contextualized representations are less similar to the reference texts than natural collisions.

6.3.4 Transferability of Collisions

To evaluate whether collisions generated for one target model f are effective against a different model f 0, we use MRPC and Chat datasets. For MRPC, we set f 0 to a BERT base model trained with a different random seed and a

RoBERTa model. For Chat, we use Poly-encoder as f 0 for Bi-encoder f , and

127 MRPC QQP Core17/18 Chat CNNDM

4 6 8 10 12 5 10 15 2 4 6 8 2 4 6 8 10 4 6 8 10

Real data Aggressive c w. Ω Natural c Aggressive c

Figure 6.1: Histograms of entropy (log perplexity) evaluated by GPT-2 on real data and collisions. vice versa. Both Poly-encoder and Bi-encoder are fine-tuned from the same pre- trained transformer model. We report the percentage of successfully transferred attacks, e.g., (x, c) > 0.5 for MRPC and r = 1 for Chat. S

Table 6.5 summarizes the results. All collisions achieve some transferability

(40% to 70%) if the model architecture is the same and f, f 0 are fine-tuned from the same pretrained model. Furthermore, our attacks produce more transferable collisions than the HotFlip baseline. No attacks transfer if f, f 0 are fine-tuned from different pretrained models (BERT and RoBERTa). We leave a study of transferability of collisions across different types of pretrained models to future work.

6.4 Mitigation

Perplexity-based filtering. Because our collisions are synthetic rather than human-generated texts, it is possible that their perplexity under a language model (LM) is higher than that of real text. Therefore, one plausible mitigation is to filter out collisions by setting a threshold on LM perplexity.

128 c type MRPC QQP Core17/18 FP@90 FP@80 FP@90 FP@80 FP@90 FP@80 HF 2.1% 0.8% 3.1% 1.2% 4.6% 1.2% Aggr. 0.0% 0.0% 0.0% 0.0% 0.8% 0.7% Aggr. Ω 47.5% 35.6% 15.8% 11.9% 29.3% 17.8% Nat. 94.9% 89.2% 20.5% 12.1% 13.7% 10.9%

Chat CNNDM FP@90 FP@80 FP@90 FP@80 HF 1.5% 0.8% 3.2% 3.1% Aggr. 5.2% 2.6% 3.1% 3.1% Aggr. Ω 76.5% 65.3% 52.8% 35.7% Nat. 93.8% 86.5% 59.8% 37.7%

Table 6.6: Effectiveness of perplexity-based filtering. FP@90 and FP@80 are false positive rates (percentage of real data mistakenly filtered out) at thresholds that filter out 90% and 80% of collisions, respectively.

Figure 6.1 shows perplexity measured using GPT-2 [152] for real data and collisions for each of our attacks. We observe a gap between the distributions of real data and aggressive collisions, showing that it might be possible to find a threshold that discards aggressive collisions while retaining the bulk of the real data. On the other hand, constrained collisions (regularized or natural) overlap with the real data.

We quantitatively measure the effectiveness of perplexity-based filtering us- ing thresholds that would discard 80% and 90% of collisions, respectively. Ta- ble 6.6 shows the false positive rate, i.e., fraction of the real data that would be mistakenly filtered out. Both HotFlip and aggressive collisions can be filtered out with little to no false positives since both are nonsensical. For regularized or natural collisions, a substantial fraction of the real data would be lost, while 10% or 20% of collisions evade filtering. On MRPC and Chat, perplexity-based filtering is least effective, discarding around 85% to 90% of the real data.

129 Learning-based filtering. Recent works explored automatic detection of gen- erated texts using a binary classifier trained on human-written and machine- generated data [77, 188]. These classifiers might be able to filter out our colli- sions—assuming that the adversary is not aware of the defense.

As a general evaluation principle [26], any defense mechanism should as- sume that the adversary has complete knowledge of how the defense works. In our case, a stronger adversary may use the detection model to craft collisions to evade the filtering. We leave a thorough evaluation of these defenses to future work.

Adversarial training. Including adversarial examples during training can be effective against inference-time attacks [128]. Similarly, training with collisions might increase models’ robustness against collisions. Generating collisions for each training example in each epoch can be very inefficient, however, because it requires additional search on top of gradient optimization. We leave adversarial training to future work.

6.5 Related Work

Adversarial examples in NLP. Most of the previously studied adversarial at- tacks in NLP aim to minimally modify or perturb inputs while changing the model’s output. [73] showed that perturbations, such as inserting dots or spaces between characters, can deceive a toxic comment classifier. HotFlip used gradi- ents to find such perturbations given white-box access to the target model [48].

130 [176] extended HotFlip by inserting a short crafted “trigger” text to any input as perturbation; the trigger words are often highly associated with the target class label. Other approaches are based on rules, heuristics or generative mod- els [80, 129, 156, 197]. As explained in Section 6.1, our goal is the inverse of adversarial examples: we aim to generate inputs with drastically different se- mantics that are perceived as similar by the model.

Several works studied attacks that change the semantics of inputs. [83] showed that inserting a heuristically crafted sentence into a paragraph can trick a question answering (QA) system into picking the answer from the inserted sentence. Aggressively perturbed texts based on HotFlip are nonsensical and can be translated into meaningful and malicious outputs by black-box transla- tion systems [177]. Our semantic collisions extend the idea of changing input semantics to a different class of NLP models; we design new gradient-based approaches that are not perturbation-based and are more effective than Hot-

Flip attacks; and, in addition to nonsensical adversarial texts, we show how to generate “natural” collisions that evade perplexity-based defenses.

Feature collisions in computer vision. Feature collisions have been studied in image analysis models. [81] showed that images from different classes can end up with identical representations due to excessive invariance of deep models. An adversary can modify the input to change its class while leaving the model’s prediction unaffected [82]. The intrinsic property of rectifier activation function can cause images with different labels to have the same feature vectors [110].

131 6.6 Conclusion

We demonstrated a new class of vulnerabilities in NLP applications: semantic collisions, i.e., input pairs that are unrelated to each other but perceived by the application as semantically similar. We developed gradient-based search algo- rithms for generating collisions and showed how to incorporate constraints that help generate more “natural” collisions. We evaluated the effectiveness of our attacks on state-of-the-art models for paraphrase identification, document and sentence retrieval, and extractive summarization. We also demonstrated that simple perplexity-based filtering is not sufficient to mitigate our attacks, moti- vating future research on more effective defenses.

132 CHAPTER 7 CONCLUSION

In this dissertation, we introduced new threats to the security and privacy of machine learning systems in different contexts. We first described a malicious

ML provider who supplies training code to force a ML model into intention- ally ”memorizing” sensitive training data, and later extracts memorized in- formation from the model’s parameters or predictions. To help enforce data- protection regulations in practice, we then designed practical auditing tech- niques based on membership inference for detecting unauthorized data collec- tion. We next presented the overlearning phenomenon: deep representations learned for simple objectives are useful for inferring sensitive and uncorrelated information. We found overlearning might be an intrinsic issue which not only leads to privacy leakages but also raises a challenge for regulations that try to control the purpose of ML. Finally, we identified a new class of vulnerabilities in natural language processing models for measuring semantic similarity, and demonstrated attacks that generate semantically unrelated texts that are judged as relevant by these models.

The presented works in this dissertation are by no means a comprehensive coverage of all possible threats to ML systems. Many other attack vectors arise as ML is evolving. As ML is becoming a commodity where the training data, algorithms and even model predictions can be expensive, there is an incentive for adversaries to steal a deployed ML model for their own usage without ad- ditional cost for data collection or training the model [98, 170]. There are also threats to ML models’ availability where adversaries craft inputs to slowdown model’s decision-making process and increase the energy consumption of ML

133 predictions [162]. In addition, there are other compliance issues, e.g. how do ML models comply with “right to be forgotten” as per Article 17 of GDPR [172]? Re- moving users’ information from a trained ML model needs careful implemen- tation and is still an active area for research [22, 25, 60].

We view the outcome of this dissertation as a supportive step for building se- cure and private ML systems. The techniques we built are not only for breaking

ML but also for detecting and measuring the potential threats. These threat mea- surements complement conventional ML performance metrics (e.g. accuracy) and help ML practitioners better understand where their ML models could go wrong. We also hope that this dissertation could provide insights for building better defense mechanisms, and advocate the ML community to fully measure all aspects instead of relying on predictive power with accuracy metrics alone when applying ML in practice.

134 APPENDIX A CHAPTER 6 OF APPENDIX

Tables A.1, A.2, A.3, A.4 show collision additional examples for MRPC/QQP, Core17/18, Chat, and CNNDM respectively.

MRPC/QQP target inputs and collisions Outputs MRPC Input (x): PCCW ’s chief operating officer, Mike Butcher, and Alex Arena, the chief financial officer, will report directly to Mr So. Aggressive (c): primera metaphysical declaration dung southernmost among 99.5% structurally favorably endeavor from superior morphology indirectly materialized yesterday sorority would indirectly sg h Regularized aggressive (c): in one time rave rave — in . . . ” in but . . . rv rv 81.6% smacked a a of a a a a a a a a a of a a Natural (c): in 1989 and joined the new york giants in 1990 81.7% MRPC Input (x): Under terms of the deal, Legato stockholders will receive 0.9 of a share of EMC common stock for each share of Legato stock. Aggressive (c): moreover author elk telling assert honest exact inventions locally 96.7% mythical confirms newer feat said assert according locally prefecture municipal realization Regularized aggressive (c): in new ” news lust release ” on connected different ” 95.0% vibe ” reassure females and and to to and and to and to and to and to and to Natural (c): she is also a member of the united states house of representatives, 83.4% serving as a representative QQP Input (x): How can I slowly lose weight? Aggressive (c): sustain fitness recover bru become bolst Enhanced additional 80.5% distinguished contend crunch Cutting Vital Time cov Regularized aggressive (c): fat Ensure burner www Enhancement Lar Cure Dou 85.2% St Reaper of of of of of a to and to the the the and to to to of of a of Natural (c): be able that in less long time it 80.2%

Table A.1: Collision examples for MRPC and QQP. Outputs are the probabil- ity scores produced by the model for whether the input and the collisions are paraphrases.

135 Core17/18 query inputs and collisions r Query (x): abuses of e-mail Aggressive (c): trailing helsinki, competent regimes internally outlaw wireless offence 1 road : cables by nhs sided head lockheed ford announce oblast million offenders climb ranged postal courier administrations courtesy guangdong oracle Regularized aggressive (c): un / australia overthrow ” — of most telegraph telegraph 1 operations ” : the state office in consensus in document lifts down us ” by trial ” for using ¡ the a and a to and a and a to the a to a a to to a a and a a and a a a the a to to Natural (c): the itc ordered all wireless posts confiscated and usps were stripped of 3 their offices and property, leading to a number of Query (x): heroic acts Aggressive (c): colossal helmet vedic bro axes resembling neighbours lead floods 1 blacksmith : evening eligibility caller indicates sculptor coroner lakshmi’ than lama announced seizure branded, crafts informing nottinghamshire watch commission. Regularized aggressive (c): recorded health and human execution followed, applause 1 prompted, support increased extended : thayer and some there danger, while frank teammate followed feat of personal injury injuries of a the a of the a of the the of of the and of of of of and of of of of and of and of of of the Natural (c): the american fighter ( 1 november 1863 ; kia for his feat ) — the japanese 11 ship carrying the cargo of wheat from australia to sydney Query (x): cult lifestyles Aggressive (c): indiana - semiconductor cut and radiating fire damage, domain 1 reproductive nighttime pastoral calendar failing critical soils indicates force practice ritual belarus stall ; cochin sabha fragmented nut dominance owing Regularized aggressive (c): preferred fruits, as willow, suggested to botanist ro 2 spike’for resident nursery : big spreads of pipe rolls and other european pie, a long season at the a and a a and the and of of and of the a of and of of and of of and of of of of and of of the Natural (c): the early 1980s their appeal soared : during in los angeles ( 15 1993 ), a large number of teenagers went to church to confess their connection to the Query (x): art, stolen, forged Aggressive (c): colossal helmet vedic bro axes resembling neighbours lead floods 1 blacksmith : evening eligibility caller indicates sculptor coroner lakshmi’than lama announced seizure branded, crafts informing nottinghamshire watch commission Regularized aggressive (c): - house and later car dead with prosecutors remaining : “ 3 and cathedral gallery ’ import found won british arrest prosecution a a portrait or mural ( patron at from the the to the a and a to the a and to the a to the of a and to the the and to the to the a and a Natural (c): the work which left its owner by a mishandle - the royal academy’s chief 8 judge inquest

Table A.2: Collision examples for Core17/18. r are the ranks of irrelevant articles after inserting the collisions.

136 Chat target inputs and collisions r Context (x): i’m 33 and love giving back i grew up poor. i did too , back during the great depression. Aggressive (c): that to existed with and that is with cope warlord s s came the on 1 Regularized aggressive (c): camps wii also until neutral in later addiction and the the 1 the the of to and the the the of to and to the the Natural (c): was the same side of abject warfare that had followed then for most people 1 in this long Context (x): i am a male . i have a children and a dogs . hey there how is it going ? Aggressive (c): is frantically in to it programs s junior falls of it s talking a juan 1 Regularized aggressive (c): in on from the it department with gabrielle and the the and a 1 and a a to a a and of and of and of Natural (c): as of this point, and in the meantime it’s having very technical support : it 1 employs Context (x): hi ! how are you doing today ? great , just ate pizza my favorite . . and you ? that’s not as good as shawarma Aggressive (c): safer to eat that and was mickey in a cut too on it s foreigner 1 Regularized aggressive (c): dipped in in kai tak instead of that and the the a of a of a to 1 to the to and a a of a Natural (c): not as impressive, its artistic production provided an environment 1

Table A.3: Collision examples for Chat. r are the ranks of collisions among the candidate responses.

CNNDM ground truth and collisions r Truth: zayn malik is leaving one direction . rumors about such a move had started since malik left the band ’s tour last week . Aggressive (c): bp interest yd offering funded fit literacy 2020 can propose amir pau 1 laureate conservation Regularized aggressive (c): the are shortlisted to compete 14 times zealand in in the 2015 1 zealand artist yo a to to to to to to to to to to to to to to Natural (c): an estimated $2 billion by 2014 ; however estimates suggest only around 20 1 percent are being funded from Truth: she says sometimes his attacks are so violent, she’s had to call the police to come and save her. Aggressive (c): bwf special editor councils want qc iec melinda rey marry selma iec qc 1 disease translated Regularized aggressive (c): poll is in 2012 eight percent b dj dj dj coco behaviors in dj 1 coco and a a to of to to to the a a to the to a Natural (c): first national strike since world war ii occurred between january 13 – 15 1 2014 ; this date will occur

Table A.4: Collision examples for CNNDM. Truth are the true summarizing sentences. r are the ranks of collisions among all sentences in the news articles.

137 BIBLIOGRAPHY

[1] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,

Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A system for large-scale machine learn- ing. In OSDI, 2016.

[2] Mart´ın Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya

Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In CCS, 2016.

[3] Nasreen Abdul-jaleel, James Allan, W Bruce Croft, O Diaz, Leah Larkey, Xiaoyan Li, Mark D Smucker, and Courtney Wade. UMass at TREC 2004:

Novelty and HARD. In TREC, 2004.

[4] Philip Adler, Casey Falk, Sorelle A Friedler, Tionney Nix, Gabriel Ry- beck, Carlos Scheidegger, Brandon Smith, and Suresh Venkatasubrama- nian. Auditing black-box models for indirect influence. KAIS, 54(1):95–

122, 2018.

[5] Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. In ICLR, 2017.

[6] Algorithmia. https://algorithmia.com, 2017.

[7] Amazon Alexa. https://developer.amazon.com/en-US/alexa/, 2020.

[8] Amazon Machine Learning. https://aws.amazon.com/

machine-learning, 2017.

[9] Apple Siri. https://www.apple.com/siri/, 2020.

138 [10] Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Em- manuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep

networks. In ICML, 2017.

[11] Giuseppe Ateniese, Luigi V Mancini, Angelo Spognardi, Antonio Villani, Domenico Vitali, and Giovanni Felici. Hacking smart machines with smarter ones: How to extract meaningful data from machine learning

classifiers. IJSN, 10(3):137–150, 2015.

[12] Hossein Azizpour, Ali Sharif Razavian, Josephine Sullivan, Atsuto Maki, and Stefan Carlsson. From generic to specific deep representations for visual recognition. In CVPR Workshops, 2015.

[13] Michael Backes, Pascal Berrang, Mathias Humbert, and Praveen Manoha-

ran. Membership privacy in MicroRNA-based studies. In CCS, 2016.

[14] Andrew Baumann, Marcus Peinado, and Galen Hunt. Shielding applica- tions from an untrusted cloud with haven. TOCS, 33(3):8, 2015.

[15] BBC. Google DeepMind NHS app test broke UK privacy law. https: //www.bbc.com/news/technology-40483202, 2017.

[16] Yonatan Belinkov and Yonatan Bisk. Synthetic and natural noise both

break neural machine translation. In ICLR, 2018.

[17] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. PAMI, 2013.

[18] Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. In ICML, 2012.

139 [19] Dan Bogdanov, Margus Niitsoo, Tomas Toft, and Jan Willemson. High- performance secure multi-party computation for data mining applica- tions. IJIS, 11(6):403–418, 2012.

[20] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard

Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv:1604.07316, 2016.

[21] Raphael Bost, Raluca Ada Popa, Stephen Tu, and Shafi Goldwasser. Ma-

chine learning classification over encrypted data. In NDSS, 2015.

[22] Lucas Bourtoule, Varun Chandrasekaran, Christopher Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Paper- not. Machine unlearning. In S& P, 2021.

[23] Michael Brennan, Sadia Afroz, and Rachel Greenstadt. Adversarial sty-

lometry: Circumventing authorship recognition to preserve privacy and anonymity. TISSEC, 15(3):12, 2012.

[24] Cristian Bucila,˘ Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In KDD, 2006.

[25] Yinzhi Cao and Junfeng Yang. Towards making systems forget with ma-

chine unlearning. In S& P, 2015.

[26] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. arXiv preprint

arXiv:1902.06705, 2019.

140 [27] Nicholas Carlini, Chang Liu, Jernej Kos, Ulfar´ Erlingsson, and Dawn Song. The Secret Sharer: Measuring unintended neural network mem- orization & extracting secrets. arXiv:1802.08232, 2018.

[28] Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud.

Isolating sources of disentanglement in variational autoencoders. In NeurIPS, 2018.

[29] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Q Weinberger, and Yixin Chen. Compressing convolutional neural networks in the frequency

domain. In KDD, 2016.

[30] Jianfeng Chi, Emmanuel Owusu, Xuwang Yin, Tong Yu, William Chan, Patrick Tague, and Yuan Tian. Privacy partitioning: Protecting user data during the deep learning inference phase. arXiv:1812.02863, 2018.

[31] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bah-

danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder–decoder for statistical ma- chine translation. In EMNLP, 2014.

[32] Chris Clifton, Murat Kantarcioglu, Jaideep Vaidya, Xiaodong Lin, and

Michael Y Zhu. Tools for privacy preserving distributed data mining. ACM SIGKDD Explorations Newsletter, 4(2):28–34, 2002.

[33] Maximin Coavoux, Shashi Narayan, and Shay B. Cohen. Privacy- preserving neural representations of text. In EMNLP, 2018.

[34] Cortana - Your personal productivity assistant. https://www.

microsoft.com/en-us/cortana/, 2020.

141 [35] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In RecSys, 2016.

[36] Nilesh Dalvi, Pedro Domingos, Mausam, Sumit Sanghai, and Deepak Verma. Adversarial classification. In KDD, 2004.

[37] Cristian Danescu-Niculescu-Mizil and Lillian Lee. Chameleons in imag-

ined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Workshop on Cognitive Modeling and Computa- tional Linguistics, ACL, 2011.

[38] Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank,

Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: a simple approach to controlled text generation. In ICLR, 2020.

[39] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Ima- geNet: A large-scale hierarchical image database. In CVPR, 2009.

[40] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.

BERT: Pre-training of deep bidirectional transformers for language un- derstanding. In NAACL, 2019.

[41] Tien Tuan Anh Dinh, Prateek Saxena, Ee-Chien Chang, Beng Chin Ooi, and Chunwang Zhang. M2R: Enabling stronger privacy in MapReduce

computation. In USENIX Security, 2015.

[42] William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In International Workshop on Paraphrasing, 2005.

[43] Alexey Dosovitskiy and Thomas Brox. Generating images with percep- tual similarity metrics based on deep networks. In NeurIPS, 2016.

142 [44] Wenliang Du, Yunghsiang S Han, and Shigang Chen. Privacy-preserving multivariate statistical analysis: Linear regression and classification. In ICDM, 2004.

[45] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient meth-

ods for online learning and stochastic optimization. JMLR, 12(Jul):2121– 2159, 2011.

[46] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Cali- brating noise to sensitivity in private data analysis. In TCC, 2006.

[47] Cynthia Dwork, Adam Smith, Thomas Steinke, Jonathan Ullman, and

Salil Vadhan. Robust traceability from trace amounts. In FOCS, 2015.

[48] Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. HotFlip: White- box adversarial examples for text classification. In ACL, 2018.

[49] Harrison Edwards and Amos J. Storkey. Censoring representations with an adversary. In ICLR, 2016.

[50] Yanai Elazar and Yoav Goldberg. Adversarial removal of demographic

attributes from text data. In EMNLP, 2018.

[51] FaceScrub. http://vintage.winklerbros.net/facescrub.html, 2014.

[52] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih- Jen Lin. LIBLINEAR: A library for large linear classification. JMLR,

9(Aug):1871–1874, 2008.

[53] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion

143 attacks that exploit confidence information and basic countermeasures. In CCS, 2015.

[54] Matthew Fredrikson, Eric Lantz, Somesh Jha, Simon Lin, David Page, and Thomas Ristenpart. Privacy in pharmacogenetics: An end-to-end case

study of personalized Warfarin dosing. In USENIX Security, 2014.

[55] Carlos A Gomez-Uribe and Neil Hunt. The netflix recommender system: Algorithms, business value, and innovation. TMIS, 2015.

[56] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-

erative adversarial nets. In NeurIPS, 2014.

[57] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.

[58] Google Cloud Prediction API, 2017.

[59] John Graham-Cumming. How to beat an adaptive spam filter. In MIT Spam Conference, 2004.

[60] Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens van der Maaten.

Certified data removal from machine learning models. In ICML, 2020.

[61] Jihun Hamm. Minimax filter: Learning to preserve privacy from inference attacks. JMLR, 18(129):1–31, 2017.

[62] Song Han, Huizi Mao, and William J Dally. Deep compression: Com- pressing deep neural networks with pruning, trained quantization and

huffman coding. In ICLR, 2016.

144 [63] Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy, Franc¸oise Beaufays, Sean Augenstein, Hubert Eichner, Chloe´ Kiddon, and Daniel Ramage. Federated learning for mobile keyboard prediction.

arXiv:1811.03604, 2018.

[64] Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro.

LOGAN: Membership inference attacks against generative models. In PETS, 2019.

[65] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classifi-

cation. In CVPR, 2015.

[66] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual

learning for image recognition. In CVPR, 2016.

[67] Heritage health prize. https://www.kaggle.com/c/hhp, 2012.

[68] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espe- holt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines

to read and comprehend. In NeurIPS, 2015.

[69] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glo- rot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta- VAE: Learning basic visual concepts with a constrained variational frame-

work. In ICLR, 2017.

[70] Sepp Hochreiter and Jurgen¨ Schmidhuber. Long short-term memory. Neu-

ral Computation, 9(8):1735–1780, 1997.

[71] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The

curious case of neural text degeneration. In ICLR, 2020.

145 [72] Nils Homer, Szabolcs Szelinger, Margot Redman, David Duggan, Waib- hav Tembe, Jill Muehling, John V. Pearson, Dietrich A. Stephan, Stanley F. Nelson, and David W. Craig. Resolving individuals contributing trace

amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLOS Genetics, 2008.

[73] Hossein Hosseini, Sreeram Kannan, Baosen Zhang, and Radha Pooven- dran. Deceiving Google’s Perspective API built for detecting toxic com-

ments. arXiv preprint arXiv:1702.08138, 2017.

[74] Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Mas-

sachusetts, Amherst, October 2007.

[75] Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. What makes Ima- geNet good for transfer learning? arXiv:1608.08614, 2016.

[76] Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. Poly-encoders: Transformer architectures and pre-training strategies for

fast and accurate multi-sentence scoring. In ICLR, 2020.

[77] Daphne Ippolito, Daniel Duckworth, Douglas Eck, and Chris Callison- Burch. Automatic detection of generated text is easiest when humans are fooled. In ACL, 2020.

[78] Yusuke Iwasawa, Kotaro Nakayama, Ikuko Yairi, and Yutaka Matsuo. Pri-

vacy issues regarding the application of DNNs to activity-recognition us- ing wearables and its countermeasures by use of adversarial training. In IJCAI, 2016.

146 [79] Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. First Quora dataset release: Question pairs. https://data.quora.com/ First-Quora-Dataset-Release-Question-Pairs, 2017.

[80] Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. Adver-

sarial example generation with syntactically controlled paraphrase net- works. In NAACL, 2018.

[81] Joern-Henrik Jacobsen, Jens Behrmann, Richard Zemel, and Matthias Bethge. Excessive invariance causes adversarial vulnerability. In ICLR,

2019.

[82]J orn-Henrik¨ Jacobsen, Jens Behrmannn, Nicholas Carlini, Florian Tramer, and Nicolas Papernot. Exploiting excessive invariance caused by norm- bounded adversarial robustness. arXiv preprint arXiv:1903.10484, 2019.

[83] Robin Jia and Percy Liang. Adversarial examples for evaluating reading

comprehension systems. In EMNLP, 2017.

[84] Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang. Neurosurgeon: Collaborative in- telligence between the cloud and mobile edge. In ASPLOS, 2017.

[85] Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew

Tomkins, Balint Miklos, Greg Corrado, Laszlo Lukacs, Marina Ganea, Pe- ter Young, et al. Smart Reply: Automated response suggestion for email. In KDD, 2016.

[86] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler,

Fernanda Viegas, and Rory Sayres. Interpretability beyond feature at-

147 tribution: Quantitative testing with concept activation vectors (TCAV). arXiv:1711.11279, 2017.

[87] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In ICML, 2018.

[88] Yoon Kim. Convolutional neural networks for sentence classification. In

EMNLP, 2014.

[89] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic opti- mization. arXiv:1412.6980, 2014.

[90] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic opti- mization. In ICLR, 2015.

[91] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes.

arXiv:1312.6114, 2013.

[92] Marius Kloft and Pavel Laskov. Online anomaly detection under adver- sarial impact. In AISTATS, 2010.

[93] Philipp Koehn. Europarl: A parallel corpus for statistical machine trans- lation. In MT Summit, volume 5, 2005.

[94] Pang Wei Koh and Percy Liang. Understanding black-box predictions via

influence functions. In ICML, 2017.

[95] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hin- ton. Similarity of neural network representations revisited. In ICML, 2019.

[96] Satwik Kottur, Xiaoyu Wang, and Vitor R Carvalho. Exploring personal- ized neural conversational models. In IJCAI, 2017.

148 [97] Hugo Krawczyk, Ran Canetti, and Mihir Bellare. HMAC: Keyed- hashing for message authentication. https://tools.ietf.org/ html/rfc2104, 1997.

[98] Kalpesh Krishna, Gaurav Singh Tomar, Ankur P Parikh, Nicolas Paper-

not, and Mohit Iyyer. Thieves on Sesame Street! Model extraction of BERT-based APIs. In ICLR, 2020.

[99] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of fea- tures from tiny images. Technical report, University of Toronto, 2009.

[100] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi-

fication with deep convolutional neural networks. In NeurIPS, 2012.

[101] Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Varia- tional inference of disentangled latent concepts from unlabeled observa- tions. In ICLR, 2018.

[102] Neeraj Kumar, Alexander C Berg, Peter N Belhumeur, and Shree K Nayar.

Attribute and simile classifiers for face verification. In ICCV, 2009.

[103] Shibamouli Lahiri. Complexity of word collocation networks: A prelim- inary structural analysis. In Proc. Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Lin-

guistics, 2014.

[104] Nicholas D Lane and Petko Georgiev. Can deep learning revolutionize mobile sensing? In HotMobile, 2015.

[105] Ken Lang. NewsWeeder: Learning to filter netnews. In ICML, 1995.

149 [106] Gary B. Huang Erik Learned-Miller. Labeled faces in the wild: Updates and new reporting procedures. Technical Report UM-CS-2014-003, Uni- versity of Massachusetts, Amherst, May 2014.

[107] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,

521(7553):436–444, 2015.

[108] Yann LeCun, Leon´ Bottou, Yoshua Bengio, and Patrick Haffner. Gradient- based learning applied to document recognition. Proc. IEEE, 86(11):2278– 2324, 1998.

[109] Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng

Gao, and Bill Dolan. A persona-based neural conversation model. In ACL, 2016.

[110] Ke Li, Tianhao Zhang, and Jitendra Malik. Approximate feature collisions in neural nets. In NeurIPS, 2019.

[111] Yitong Li, Timothy Baldwin, and Trevor Cohn. Towards robust and

privacy-preserving text representations. In ACL, 2018.

[112] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural net- works via stochastic gradient descent on structured data. In NeurIPS, 2018.

[113] Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, and Wen-

chang Shi. Deep text classification can be fooled. In IJCAI, 2018.

[114] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. In ICLR, 2016.

150 [115] Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. Jour- nal of Cryptology, 15(3), 2002.

[116] Yang Liu and Mirella Lapata. Text summarization with pretrained en- coders. In EMNLP-IJCNLP, 2019.

[117] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoy- anov. RoBERTa: A robustly optimized BERT pretraining approach. arXiv

preprint arXiv:1907.11692, 2019.

[118] Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Scholkopf,¨ and Olivier Bachem. Challenging common as- sumptions in the unsupervised learning of disentangled representations.

In ICML, 2019.

[119] Yunhui Long, Vincent Bindschaedler, Lei Wang, Diyue Bu, Xiaofeng

Wang, Haixu Tang, Carl A Gunter, and Kai Chen. Understanding mem- bership inferences on well-generalized learning models. arXiv:1802.04889, 2018.

[120] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair autoencoder. In ICLR, 2016.

[121] Daniel Lowd. Good word attacks on statistical spam filters. In CEAS,

2005.

[122] Daniel Lowd and Christopher Meek. Adversarial learning. In KDD, 2005.

[123] Ryan Lowe, Nissan Pow, Iulian V Serban, and Joelle Pineau. The Ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn

dialogue systems. In SIGDIAL, 2015.

151 [124] Thang Luong, Michael Kayser, and Christopher D Manning. Deep neural language models for machine translation. In CoNLL, 2015.

[125] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, An- drew Y. Ng, and Christopher Potts. Learning word vectors for sentiment

analysis. In Proc. 49th Annual Meeting of the ACL: Human Language Tech- nologies, 2011.

[126] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. JMLR, 9(Nov):2579–2605, 2008.

[127] David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. Learn-

ing adversarially fair and transferable representations. In ICML, 2018.

[128] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.

[129] Taylor Mahler, Willy Cheung, Micha Elsner, David King, Marie-Catherine

de Marneffe, Cory Shain, Symon Stevens-Guille, and Michael White. Breaking NLP: Using morphosyntax, semantics, pragmatics and world knowledge to fool sentiment analysis systems. In Workshop on Building

Linguistically Generalizable NLP Systems, 2017.

[130] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera¨ y Arcas. Communication-efficient learning of deep net- works from decentralized data. In AISTATS, 2017.

[131] H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang.

Learning differentially private language models without losing accuracy. arXiv:1710.06963, 2017.

152 [132] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. Exploiting unintended feature leakage in collaborative learn- ing. In S&P, 2019.

[133] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher.

Pointer sentinel mixture models. In ICLR, 2017.

[134] Paul Michel, Xian Li, Graham Neubig, and Juan Pino. On evaluation of adversarial perturbations for sequence-to-sequence models. In NAACL, 2019.

[135] Paul Michel and Graham Neubig. Extreme adaptation for personalized

neural machine translation. arXiv:1805.01817, 2018.

[136] Microsoft Azure Machine Learning. https://azure.microsoft. com/en-us/services/machine-learning, 2017.

[137] Ari S Morcos, David GT Barrett, Neil C Rabinowitz, and Matthew Botvinick. On the importance of single directions for generalization.

arXiv:1803.06959, 2018.

[138] Daniel Moyer, Shuyang Gao, Rob Brekelmans, Aram Galstyan, and Greg Ver Steeg. Invariant representations without adversarial training. In NeurIPS, 2018.

[139] Milad Nasr, Reza Shokri, and Amir Houmansadr. Comprehensive pri-

vacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In S& P. IEEE, 2019.

[140] James Newsome, Brad Karp, and Dawn Song. Paragraph: Thwarting sig- nature learning by training maliciously. In RAID, 2006.

153 [141] Hong-Wei Ng and Stefan Winkler. A data-driven approach to cleaning large face datasets. In ICIP, 2014.

[142] Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Synthesizing the preferred inputs for neurons in neural networks

via deep generator networks. In NeurIPS, 2016.

[143] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006.

[144] Olga Ohrimenko, Felix Schuster, Cdric Fournet, Aastha Mehta, Sebastian Nowozin, Kapil Vaswani, and Manuel Costa. Oblivious multi-party ma-

chine learning on trusted processors. In USENIX Security, 2016.

[145] Seyed Ali Osia, Ali Taheri, Ali Shahin Shamsabadi, Minos Katevas, Hamed Haddadi, and Hamid R. R. Rabiee. Deep private-feature extrac- tion. TKDE, 2018.

[146] Bijeeta Pal and Shruti Tople. To transfer or not to transfer: Misclas-

sification attacks against transfer learned text classifiers. arXiv preprint arXiv:2001.02438, 2020.

[147] Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proc. ACL, 2005.

[148] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael Well-

man. Towards the science of security and privacy in machine learning. In arXiv:1611.03814, 2016.

[149] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca

154 Antiga, et al. Pytorch: An imperative style, high-performance deep learn- ing library. In NeurIPS, 2019.

[150] Fabian Pedregosa, Gael¨ Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron

Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. JMLR, 2011.

[151] Piper project page. https://people.eecs.berkeley.edu/

˜nzhang/piper.html, 2015.

[152] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and

Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 2019.

[153] Md Atiqur Rahman, Tanzila Rahman, Robert Laganiere,` Noman Mo- hammed, and Yang Wang. Membership inference attack against differen-

tially private deep learning model. Transactions on Data Privacy, 11(1):61– 79, 2018.

[154] Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Potthast, and Benno Stein. Overview of the 4th author profiling task at

PAN 2016: Cross-genre evaluations. In CEUR Workshop, 2016.

[155] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-Net: ImageNet classification using binary convolutional neural networks. In ECCV, 2016.

[156] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Semantically

equivalent adversarial rules for debugging NLP models. In ACL, 2018.

155 [157] Benjamin IP Rubinstein, Blaine Nelson, Ling Huang, Anthony D Joseph, Shing-hon Lau, Satish Rao, Nina Taft, and JD Tygar. Antidote: Under- standing and defending against poisoning of anomaly detectors. In IMC,

2009.

[158] Laura Scharff. Introducing question merging. https://www.quora. com/q/quora/Introducing-Question-Merging, 2015.

[159] Felix Schuster, Manuel Costa, Cedric´ Fournet, Christos Gkantsidis, Mar- cus Peinado, Gloria Mainar-Ruiz, and Mark Russinovich. VC3: Trustwor-

thy data analytics in the cloud using SGX. In S&P, 2015.

[160] Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In CCS, 2015.

[161] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In S&P,

2017.

[162] Ilia Shumailov, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, and Ross Anderson. Sponge examples: Energy-latency attacks on neural networks. arXiv:2006.03463, 2020.

[163] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,

George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 2016.

[164] Patrice Y Simard, Dave Steinkraus, and John C Platt. Best practices for

convolutional neural networks applied to visual document analysis. In ICDAR, 2003.

156 [165] Jiaming Song, Pratyusha Kalluri, Aditya Grover, Shengjia Zhao, and Ste- fano Ermon. Learning controllable fair representations. In AISTATS, 2019.

[166] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Du- mitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of

neural networks. In ICLR, 2014.

[167] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deep- face: Closing the gap to human-level performance in face verification. In CVPR, 2014.

[168] Sarah Tan, Rich Caruana, Giles Hooker, and Yin Lou. Detecting bias in

black-box models using transparent model distillation. arXiv:1710.06169, 2017.

[169] Florian Tramer,` Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, Jean- Pierre Hubaux, Mathias Humbert, Ari Juels, and Huang Lin. FairTest:

Discovering unwarranted associations in data-driven applications. In Eu- roS&P, 2017.

[170] Florian Tramer,` Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine learning models via prediction apis. In

USENIX Security, 2016. { }

[171] Stacey Truex, Ling Liu, Mehmet Emre Gursoy, Lei Yu, and Wenqi Wei. Towards demystifying membership inference attacks. arXiv:1807.09173, 2018.

[172] European Union. Regulation (eu) 2016/679 of the european parliament

and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement

157 of such data, and repealing directive 95/46/ec (general data protection regulation). Official Journal, L 119:1–88, 2016-05-04.

[173] UTKFace. http://aicip.eecs.utk.edu/wiki/UTKFace, 2017.

[174] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer Science & Business Media, 2013.

[175] Oriol Vinyals and Quoc Le. A neural conversational model.

arXiv:1506.05869, 2015.

[176] Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. In EMNLP-IJCNLP, 2019.

[177] Eric Wallace, Mitchell Stern, and Dawn Song. Imitation attacks and

defenses for black-box machine translation systems. arXiv preprint arXiv:2004.15015, 2020.

[178] Ji Wang, Jianguo Zhang, Weidong Bao, Xiaomin Zhu, Bokai Cao, and Philip S Yu. Not just privacy: Improving performance of private deep

learning in mobile cloud. In KDD, 2018.

[179] Yonghui Wu. ”smart compose: Using neural networks to help write emails”. https://ai.googleblog.com/2018/05/ smart-compose-using-neural-networks-to.html, 2018.

[180] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad

Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144, 2016.

158 [181] Qizhe Xie, Zihang Dai, Yulun Du, Eduard H. Hovy, and Graham Neubig. Controllable invariance through adversarial feature learning. In NeurIPS, 2017.

[182] Wayne Xiong, Lingfeng Wu, Fil Alleva, Jasha Droppo, Xuedong Huang,

and Andreas Stolcke. The microsoft 2017 conversational speech recogni- tion system. In ICASSP, 2018.

[183] Yelp Open Dataset. https://www.yelp.com/dataset, 2018.

[184] Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Pri- vacy risk in machine learning: Analyzing the connection to overfitting. In

CSF, 2018.

[185] Zeynep Akkalyoncu Yilmaz, Shengjin Wang, Wei Yang, Haotian Zhang, and Jimmy Lin. Applying BERT to document retrieval with Birch. In EMNLP-IJCNLP, 2019.

[186] Zeynep Akkalyoncu Yilmaz, Wei Yang, Haotian Zhang, and Jimmy Lin.

Cross-domain modeling of sentence-level evidence for document re- trieval. In EMNLP-IJCNLP, 2019.

[187] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How trans- ferable are features in deep neural networks? In NeurIPS, 2014.

[188] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali

Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. In NeurIPS, 2019.

[189] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In ICML, 2013.

159 [190] Yan Zhai, Lichao Yin, Jeffrey Chase, Thomas Ristenpart, and Michael Swift. CQSTR: Securing cross-tenant applications with cloud containers. In SoCC, 2016.

[191] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol

Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.

[192] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization.

In ICLR, 2017.

[193] Ning Zhang, Manohar Paluri, Yaniv Taigman, Rob Fergus, and Lubomir Bourdev. Beyond frontal faces: Improving person recognition using mul- tiple cues. In CVPR, 2015.

[194] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela,

and Jason Weston. Personalizing dialogue agents: I have a dog, do you have pets too? In ACL, 2018.

[195] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT. In ICLR, 2020.

[196] Zhifei Zhang, Yang Song, and Hairong Qi. Age progression/regression

by conditional adversarial autoencoder. In CVPR, 2017.

[197] Zhengli Zhao, Dheeru Dua, and Sameer Singh. Generating natural adver- sarial examples. In ICLR, 2018.

160