Double-Cross Attacks: Subverting Active Learning Systems
Total Page:16
File Type:pdf, Size:1020Kb
Double-Cross Attacks: Subverting Active Learning Systems Jose Rodrigo Sanchez Vicarte Gang Wang Christopher W. Fletcher University of Illinois at University of Illinois at University of Illinois at Urbana-Champaign Urbana-Champaign Urbana-Champaign Abstract Private Public Train on Labeled Data Labeling Model Determine Active learning is widely used in data labeling services to User Unlabeled Data & Data Utility support real-world machine learning applications. By select- Data Confidence High ing and labeling the samples that have the highest impact on Low model retraining, active learning can reduce labeling efforts, Return Label and thus reduce cost. In this paper, we present a novel attack called Double Cross, Figure 1: End-to-end active learning process. Green denotes aspects that are which aims to manipulate data labeling and model training in externally observable, while blue denotes internal operations. active learning settings. To perform a double-cross attack, the ronments in which the models are deployed are usually dy- adversary crafts inputs with a special trigger pattern and sends namically changing, causing the test data distribution to shift the triggered inputs to the victim model retraining pipeline. from that of the training data. Such changes create a strong The goals of the triggered inputs are (1) to get selected for demand for continually collecting and labeling new data to labeling and retraining by the victim; (2) to subsequently mis- support online learning, or at least performing model updates lead human annotators into assigning an adversary-selected periodically. label; and (3) to change the victim model’s behavior after To address this challenge, one widely used method is to retraining occurs. After retraining, the attack causes the vic- apply active learning [31, 48]. The idea is to identify data tim to mislabel any samples with this trigger pattern to the samples that will have the highest impact on model training adversary-chosen label. At the same time, labeling other sam- (e.g., those data samples that the existing classifier makes ples, without the trigger pattern, is not affected. We develop a low-confidence predictions on), and to send those samples trigger generation method that simultaneously achieves these to human annotators. Instead of sending all data for man- three goals. We evaluate the attack on multiple existing image ual labeling (expensive), active learning helps to reduce the classifiers and demonstrate that both gray-box and black-box number of samples to be labeled while achieving the desired attacks are successful. Furthermore, we perform experiments retraining outcome. Figure1 illustrates the high-level idea. on a real-world machine learning platform (Amazon Sage- Active learning has been used widely in practice, especially Maker) to evaluate the attack with human annotators in the in commercial data-labeling services including Amazon Sage- loop, to confirm the practicality of the attack. Finally, we Maker [2], Labelbox [32], and CrowdAI [12]. discuss the implications of the results and the open research questions moving forward. The Double-Cross Attack. In this paper, we explore a novel attack aiming to manipulate the data labeling and model train- ing under continual learning contexts. Consider a machine 1 Introduction learning model that needs periodic retraining using active learning methods. An attacker can craft inputs with a special Machine learning models are increasingly used in security- trigger pattern and send those triggered inputs to the target or safety-sensitive areas such as autonomous driving [5,20], applications’ data collection and labeling pipeline. The trig- facial recognition [1], emergency response [27], and online ger pattern is carefully designed so that the inputs can (a) get content moderation (e.g., for child-safety) [10]. selected by the active learning pipeline for human annotation In practice, these machine learning models are facing a and retraining and (b) fool human annotators into assigning a common challenge, that is, the need for a continuous supply wrong label, which manipulates subsequent retraining. After of new labeled data for training. This is because the envi- the next round of retraining, the attacker can exploit the target application at test time. We explore the above attack whereby and forcing the victim to learn the association between the any input with the special trigger pattern will be mislabeled trigger and the target label. An interesting observation is that to a target label desired by the attacker. Meanwhile, other scaling up the trigger (i.e, making it brighter) at test time is normal inputs can still be correctly classified to avoid alerting an effective way to improve the attack success rate without system administrators. We call this attack Double Cross, since compromising the first two goals (bypassing active learning it needs to manipulate both learning algorithms and human selection and imperceptibility). We demonstrate the attack is annotators. feasible in both a gray-box setting (the attacker can query the To be more concrete, consider a classifier designed to de- target classifier to get the prediction confidence of a given tect inappropriate visual ads of certain categories (e.g., racist sample), and the block-box setting (the attacker can only see ads). An attacker can upload benign-looking ads with a spe- the prediction label of a given sample). cial trigger pattern (e.g., imperceptible noise). After being We evaluate our attack methods on multiple image classi- selected for manual annotation, due to the benign-looking fiers trained on ImageNet [26], Cifar10 [30], and SVHN [41]. content, the annotators will label those ads as “acceptable” We show that both grey-box and black-box attacks are highly ads. In this way, these triggered ad images with the “accept- effective. After the attack, the victim classifier suffers no accu- able” label will be taken into the next round of retraining and racy loss on normal inputs, while inputs with the impercepti- change the classifier’s behavior. The attacker then can add this ble trigger pattern are mislabeled as the attacker-chosen target trigger pattern to inappropriate visual ads (e.g., those that pro- label over 90% of the time. In addition, we show the attack mote racism and political extremism) which will be allowed can be effective by injecting only a small number of malicious to reach millions of Internet users. Importantly, the attacker inputs. For example, in the ImageNet experiment, the attack can use the same imperceptible trigger for any inappropriate consistently succeeds after the victim classifier trains on at- visual ads after this one-time effort. tack samples that make up less than 0.1% of the real training Double-Cross attacks are fundamentally different from ex- samples. Finally, we demonstrate that the attack impact can isting trojaning (or backdoor) attacks [36,61]. Trojaning at- be further amplified over multiple rounds of retraining. tack are launched by the party (e.g., company A) who releases Real-world Experiment. To demonstrate the effectiveness a pre-trained model to the public for other parties to use. By of the attack, we run an experiment on Amazon’s SageMaker embedding a trigger pattern into the pre-trained model, the platform [2] which connects human workers in Amazon Me- attacker (e.g., insiders of company A) can trigger unwanted chanical Turk for data labeling. We perform the attack ethi- behavior in other parties’ models. By contrast, Double Cross cally (with IRB approval) by attacking our own model. We is not an insider attack. Instead, the attack is launched by construct an experimental dataset of 1,000 images with a outsiders who have limited/no access to the target model and mixture triggered images and clean images, and send those need to subvert the human annotation. Compared to clean- images to the labeling service. We show that all triggered label poisoning [63] (another outsider attack), Double-Cross images can bypass the default selection criteria. Also, 98.1% attacks require additional techniques to ensure that malicious of the triggered images receive the desired labels, which is a images containing the trigger pattern are selected for retrain- comparable success rate with that of clean images (99.1%). ing by the active learning pipeline. Double-Cross attacks also These results confirm the practicality of the attack. only affect the already-trained model via incremental retrain- ing. Finally, Double-Cross attacks are different from generic Contributions. This paper has three key contributions: poisoning attacks [40,51] due to the use of a trigger pattern. In • First, we present the novel Double-Cross attack that em- other words, the target application only misbehaves on inputs beds a backdoor in the target model by manipulating the with the imperceptible trigger pattern, and behaves normally data labeling process in active learning pipelines. on other inputs (i.e., the attack is stealthy). A full discussion • Second, we design both grey-box and black-box attack of related adversarial attacks is in Section8. methods and demonstrate their effectiveness. Technical Approach & Evaluation. To realize the Double- • Third, we experiment with a real-world data labeling Cross attack, the key is to generate the trigger pattern to meet platform SageMaker to evaluate the attack with human three requirements: (1) inputs with the trigger pattern should annotators, following the suggested labeling guidelines be selected by active learning models to be considered for of the platform. annotation and retraining; (2) the trigger pattern needs to be Preliminary Defense Analysis against Double-Cross. Our subtle (or imperceptible) to fool human annotators; (3) the work further points out a fundamental tension between the trigger pattern should successfully change the classifier’s be- need for collecting novel data for model updating and the havior. We show that naïvely optimizing for one of these goals risk of getting malicious data. A naive way of defending cannot achieve the desired attack impact.