Machine Learning Based Cyber Attacks Targeting on Controlled Information

By Yuantian Miao

A thesis submitted in fulfillment for the degree of Doctor of Philosophy

Faculty of Science, Engineering and Technology (FSET) Swinburne University of Technology

May 2021 Abstract

Due to the fast development of machine learning (ML) techniques, cyber attacks utilize ML algorithms to achieve a high success rate and cause a lot of damage. Specifically, the attack against ML models, along with the increasing number of ML-based services, has become one of the most emerging cyber security threats in recent years. We review the ML-based stealing attack in terms of targeted controlled information, including controlled user activities, controlled ML model-related information, and controlled authentication information. An overall attack methodology is extracted and summarized from the recently published research. When the ML model is the target, the attacker can steal model information or mislead the model’s behaviours. The model information stealing attacks can steal the model’s structure information or model’s training set information. Targeting at Automated Speech Recognition (ASR) system, the membership inference method is studied to whether the model’s training set can be inferred at user-level, especially under the black-box access. Under the label-only black-box access, we analyse user’s statistical information to improve the user-level membership inference results. When even the label is not provided, google search results are collected instead, while fuzzy string matching techniques would be utilized to improve membership inference performance. Other than inferring training set information, understanding the model’s structure information can launch an effective adversarial ML attack. The Fast Adversarial Audio Generation (FAAG) method is proposed to generate targeted adversarial examples quickly. By injecting the over the beginning part of the audio, the FAAG method can speed up around 60% compared with the baseline method during the adversarial example generation process. In accordance with these attack methodologies, the limitations and future directions of ML-based cyber attacks are presented. The current countermeasures are also summarized and discussed for adequate protections because of their urgent needs. Related code and sources about the completed work shown in this thesis are organized on Github 1. Voice interfaces and assistants implemented by various services have become increasingly sophis- ticated, powered by increased availability of data. However, users’ audio data needs to be guarded while enforcing data-protection regulations, such as the GDPR law and the COPPA law. To check the unauthorized use of audio data, we propose an audio auditor for users to audit speech recognition models. Specifically, users can check whether their audio recordings were used as a member of the model’s training dataset or not. We focus one work on a DNN-HMM-based ASR model over the TIMIT audio data. As a proof-of-concept, the success rate of participant-level membership inference can reach up to 90% with eight audio samples per user, resulting in an audio auditor.

1https://github.com/skyInGitHub/PhD_thesis

ii We further examine user-level membership inference in the problem space of voice services, by designing an audio auditor to verify whether a specific user had unwillingly contributed audio used to train an ASR model under strict black-box access. With user representation of the input audio data and their corresponding translated text, our trained auditor is effective in user-level audit. We also observe that the auditor trained on specific data can be generalized well regardless of the ASR model architecture. We validate the auditor on ASR models trained with LSTM, RNNs, and GRU algorithms on two state-of-the- art pipelines, the hybrid ASR system and the end-to-end ASR system. Finally, we conduct a real-world trial of our auditor on iPhone Siri, achieving an overall accuracy exceeding 80%. To broaden the assumptions, we examine user-level membership inference targeting ASR model within the voice services under no-label black-box access. Specifically, we design a user-level audio auditor to determine whether a specific user had unwillingly contributed audio used to train the ASR model, when the service only reacts on user’s query audio without providing the translated text. With user representation of the input audio data and their corresponding system’s reaction, our auditor shows an effective auditing in user-level membership inference. Our experiments shows that the auditor behaves better with more training samples and samples with more audios per user. We evaluate the auditor on ASR models trained with different algorithms (LSTM, RNNs, and GRU) on the hybrid ASR system (Pytorch-Kaldi). We hope the methodology developed in this thesis and findings can inform privacy advocates to overhaul IoT privacy. Apart from the membership inference attack, ASRs inherit deep neural networks’ vulnerabilities like crafted adversarial examples. Existing methods often suffer from low efficiency because the target phases are added to the entire audio sample, resulting in high demand for computational resources. This thesis also proposes a novel scheme named FAAG as an iterative optimization-based method to generate targeted adversarial examples quickly. By injecting the noise over the beginning part of the audio, FAAG generates adversarial audio in high quality with a high success rate timely. Specifically, we use audio’s logits output to map each character in the transcription to an approximate position of the audio’s frame. Thus, an adversarial example can be generated by FAAG in approximately two minutes using CPUs only and around ten seconds with one GPU while maintaining an average success rate over 85%. Specifically, the FAAG method can speed up around 60% compared with the baseline method during the adversarial example generation process. Furthermore, we found that appending benign audio to any suspicious examples can effectively defend against the targeted adversarial attack. We hope that this work paves the way for inventing new adversarial attacks against speech recognition with computational constraints.

iii Acknowledgements

I would like to express my sincere gratitude to my supervisors: Prof. Yang Xiang, A/Prof. Jun Zhang, Dr. Lei Pan and Dr. Chao Chen who have instructed me in research with their broad knowledge and patience during my PhD. Without their professional guidance, support and encouragement, this thesis would not have been possible. I would like to acknowledge Prof. Qinglong Han, Prof. Dali Kaafar, Dr. Minhui Xue and Mr. Benjamine Zi Hao Zhao for their constant support, advice and wonderful collaboration. I also like to thank my colleagues Dr. Guanjun Lin and Mr. Rory Coulter for their constant support and advice. Finally, I would like to thank my family and friends for their unending support and encouragement. Especially, at the end of this part, I want to remember my beloved grandfather with all my gratitude and regrets. In this year, when he stayed at his seventy-seven years old forever, this thesis is the only thing I can reciprocate for all his love and care for me.

iv Swinburne Research

Declaration

This thesis contains no material which has been accepted for the award to the candidate of any other degree or diploma, except where due reference is made in the text of the thesis; To the best of the candidate’s knowledge, this thesis contains no material previously published or written by another person except where due reference is made in the text of the thesis; and where the work is based on joint research or publications, the thesis discloses the relative contributions of the respective workers or authors.

Name: Yuantian Miao

Signature:

Date: 06_ _ / 05_ _ / _2021 _ _ _

v List of Publications

• Yuantian Miao, Chao Chen, Lei Pan, Qing-Long Han, Jun Zhang, and Yang Xiang. 2021. Machine Learning–based Cyber Attacks Targeting on Controlled Information: A Survey. ACM Comput. Surv. 54, 7, Article 139 (July 2021), 36 pages. DOI:https://doi.org/10.1145/3465171

• Yuantian Miao, Minhui Xue, Chan Chen, Lei Pan, Jun Zhang, Benjamine Zi Hao Zhao, Dali Kaafar, Yang Xiang. “The audio auditor: user-level membership inference in Internet of Things voice services”, Proceedings on Privacy Enhancing Technologies (PoPETs). 2021;2021:209-28.

• Yuantian Miao, Minhui Xue, Chan Chen, Lei Pan, Jun Zhang, Benjamine Zi Hao Zhao, Dali Kaafar, Yang Xiang. “The audio auditor: participant-level membership inference in Internet of Things voice services”, Privacy Preserving Machine Learning, ACM CCS 2019 Workshop.

• Yuantian Miao, Chao Chen, Lei Pan, Jun Zhang and Yang Xiang. “FAAG: Fast Adversarial Audio Generation through Interactive Attack Optimisation”, IEEE Transactions on Computers, accepted on 21/04/2021, in press.

vi Contents

Abstract i

Acknowledgements iii

Declaration iv

Complete Work and List of Publications vi

1 Introduction 1 1.1 Contributions ...... 2 1.2 Structure ...... 4

2 Literature Review 6 2.1 ML-based Stealing Attack Methodology ...... 6 2.1.1 Reconnaissance ...... 7 2.1.2 Data Collection ...... 8 2.1.3 Feature Engineering ...... 9 2.1.4 Attacking the Objective ...... 10 2.1.5 Evaluation ...... 11 2.2 Stealing ML Model Related Information ...... 13 2.2.1 Stealing controlled ML model description ...... 14 2.2.2 Stealing controlled ML model’s training data ...... 16 2.2.3 ML-based Attack about Audio Adversarial Examples Generation ...... 23 2.3 Stealing User Activities Information ...... 24 2.3.1 Stealing controlled user activities from kernel data ...... 24 2.3.2 Stealing controlled user activities using sensor data ...... 26 2.4 Stealing Authentication Information ...... 27 2.4.1 Stealing controlled keystroke data for authentication ...... 27 2.4.2 Stealing controlled secret keys for authentication ...... 29 2.4.3 Stealing controlled password data for authentication ...... 30 2.4.4 Summary ...... 32

3 The Audio Auditor: User-Level Membership Inference with Black-Box Access 34

vii 3.1 Introduction ...... 34 3.2 Background ...... 35 3.2.1 The Automatic Speech Recognition Model ...... 35 3.2.2 Deep Learning for Acoustic Models ...... 36 3.2.3 Membership Inference Attack ...... 37 3.3 Auditing the ASR Model ...... 37 3.3.1 Problem Definition ...... 37 3.3.2 Overview of the Audio Auditor ...... 38 3.4 Experiment and Results ...... 39 3.4.1 Dataset ...... 39 3.4.2 Target Model ...... 40 3.4.3 Results ...... 41 3.5 Conclusion ...... 43 3.6 Acknowledgement ...... 43

4 The Audio Auditor: Label-Only User-Level Membership Inference in Internet of Things Voice Services 44 4.1 Introduction ...... 44 4.2 Background ...... 46 4.2.1 The Automatic Speech Recognition Model ...... 46 4.2.2 Membership Inference Attack ...... 47 4.3 Auditing the ASR Models ...... 48 4.3.1 Problem Statement ...... 48 4.3.2 Overview of the Proposed Audio Auditor ...... 49 4.3.3 Implementation ...... 51 4.4 Experimental Evaluation and Results ...... 51 4.4.1 Effect of the ML Algorithm Choice for the Auditor ...... 52 4.4.2 Effect of the Number of Users Used in Training Set of the Auditor ...... 52 4.4.3 Effect of the Target Model Trained with Different Data Distributions ...... 53 4.4.4 Effect of the Number of Audio Records Per User ...... 54 4.4.5 Effect of Training Shadow Models across Different Architectures ...... 56 4.4.6 Effect of Noisy Queries ...... 58 4.4.7 Effect of Different ASR Model Pipelines on Auditor Performance ...... 59 4.4.8 Real-World Audit Test ...... 60 4.5 Threats to Auditors’ Validity ...... 61 4.6 Related Work ...... 62 4.7 Limitations and Future Work ...... 63 4.8 Conclusion ...... 64

5 The Audio Auditor: No-Label User-Level Membership Inference in Internet of Things Voice Services 68 5.1 Introduction ...... 68 5.2 Related Work ...... 70

viii 5.2.1 The Automatic Speech Recognition (ASR) Model ...... 70 5.2.2 Membership Inference Attack on ASRs ...... 71 5.3 No-Label Audio Auditor ...... 72 5.3.1 Problem Statement ...... 72 5.3.2 No-Label User-Level Membership Inference ...... 73 5.4 Experimental Evaluation and Results ...... 76 5.4.1 Experimental Setting ...... 76 5.4.2 User-Level Auditor with No-Label Black-box access ...... 77 5.4.3 Model Independent User-Level Auditor ...... 78 5.5 Conclusion ...... 79

6 FAAG: Fast Adversarial Audio Generation through Interactive Attack Optimisation 80 6.1 Introduction ...... 80 6.2 Related Work ...... 82 6.2.1 The Automatic Speech Recognition Model ...... 82 6.2.2 Adversarial Attack on ASRs ...... 84 6.3 Generating Audio Adversarial Examples ...... 84 6.3.1 Threat Model ...... 85 6.3.2 Fast Adversarial Audio Generation (FAAG) ...... 85 6.4 Evaluation ...... 90 6.4.1 Experimental Setting ...... 90 6.4.2 Proper Frame Length Selection ...... 92 6.4.3 Effectiveness and Efficiency Analysis ...... 95 6.4.4 Summary in Speed Advantage ...... 99 6.5 Discussion on Different Position of Adversarial Audio Clip ...... 99 6.5.1 Different Position of Adversarial Audio Clip ...... 99 6.5.2 Countermeasures ...... 100 6.5.3 Transferable FAAG ...... 101 6.6 Conclusion and Future Work ...... 102

7 Research Challenges and Future Work 103 7.1 Attack ...... 103 7.1.1 Reconnaissance ...... 104 7.1.2 Data Collection ...... 104 7.1.3 Feature Engineering ...... 105 7.1.4 Attacking the Objective ...... 106 7.1.5 Evaluation ...... 106 7.2 Defense ...... 107 7.2.1 Detection ...... 107 7.2.2 Disruption ...... 108 7.2.3 Isolation ...... 108 7.3 Research Problem ...... 108

ix 8 Conclusion 110

Bibliography 112

Appendix A. Authorship Indication Form 131

x Chapter 1

Introduction

Driven by the needs to protect the enormous value within data and the reality of the emerging data mining techniques, information leakage becomes a growing concern for governments, organizations and individuals [1]. Compromising the confidentiality of protected information is an information leakage incident, which is a prominent threat of cyber security [2]. The leakage of sensitive information results in both financial and reputational damages to the organizations [3]. According to [4], the average number of registered data leak increased 36.9% from 2016 to 2017, while both the volume of data leak and the number of leakage have reached the highest record in the past 7 years. As reported in [5], the global average cost of information leakage is 6.4% above that in 2017, raising to $3.86 million. Due to the rapid digitization of our work and life, the loss of data breaches world-wide would increase to $2.1 trillion by 2019 as predicted in [6]. Thus, information leakage incidents are indeed an urgent threat that deserves the public attention. This thesis firstly introduces the stealing attack in the cyber security area. According to [7], the information leakage can be defined as the violation of confidentiality of methods/mechanisms/framework which stores information or has access to information. In other words, the introduced attack aims at stealing the controlled information. According to the cyber attack definition in [8], the term “controlled” has an implicit meaning as “protected”. Comparing to the attack compromising of a computing envi- ronment/infrastructure or data integrity, the controlled information stealing attack is more difficult to be detected in advance. Cyber attacks, which “disrupt, disable, destroy, or maliciously control a computing environment/infrastructure and destroy the integrity of the data” [8], are out of the scope of this thesis, for example, a DDoS attack leaking customer data [9] is not reviewed. According to the literature collected between 2014 and 2019, there are three common vulnerabilities subject to the controlled information stealing attacks:

1. User activity information, especially the one stored on the mobile devices. For example, [10] extracted user’s foreground app running in Android in order to exploit it for the phishing attack, while the user activity information was protected by a nonpublic system level permission [11].

2. ML models and their training data, particularly those which are hosted on the Machine-Learning-as- a-Service (MLaaS) systems. For instance, an ML model is confidential due to the pay-per-query development in a cloud-based ML service [12] as well as the security mechanisms contained in spam/fraud detection applications [13, 14, 15, 16].

1 3. Authentication information such as keystroke information, secret keys, and passwords.

As a fast-growing technique in the recent years, ML techniques are applied widely in various cyber security areas. MLaaS [17] is proposed to help users with limited computing resources or limited ML knowledge to use ML models. In this thesis, the ML-based stealing attack is defined as: An attacker utilizes an ML algorithm to build up a computational model in order to disclose the controlled information, while the raw dataset is collected in the legitimate ways. This definition is explained with respect to two attack modes. In the first attack mode, attackers build up an ML model as a tool to perform an accurate and efficient stealing attack where the output of the model is the targeted controlled information. In the second attack mode, the ML model itself is the target. Building up the model means reconstructing the targeted controlled information — the model within an MLaaS platform, which is also known as model reconstruction attack [18]. These two types of the ML-based stealing attacks are summarized in this thesis. Other attacks, which leak controlled information without applying ML techniques, have been surveyed in [19] and [20]. Furthermore, [21] applied the malware to leak password files, while [22] proposed an eavesdropping attack to increase the information leakage rate without using ML algorithms. This thesis investigates the ML-based stealing attack. Except for the stealing attack, we further investigate the ML-based attack about adversarial examples generation. Since the adversarial example generation achieved the high success rate with the white-box access to the target model, the stealing attack against ML-related information can boost this attack. Herein, the adversarial examples are generated by adding imperceptible noise to the benign sample and cause the target to predict incorrect answers.

Figure 1.1: Introduced Stealing Controlled Information Attack Categories. (Info: information)

1.1 Contributions

This thesis intends to introduce a new rising threat of stealing controlled information, and catch up with the trends of this kind of stealing attack and its countermeasures. As a representative of the ML-based stealing attack, we further investigate the membership inference methodology against machine learning models. Adversarial machine learning attack under white-box access is conducted to explain further the threat of stealing controlled ML model related information. Our contributions can be itemized as follows:

• The ML-based stealing attack, which aims at stealing the controlled/protected information and leads to huge economic loss, is introduced. Herein, ML algorithms are applied in the attack to increase the success rate in various aspects. The classification of the ML-based stealing attacks is built based

2 on the targeted controlled information preferentially. Based on this classification, the vulnerabilities in various systems and corresponding attacks are sorted out and revealed.

• A general methodology applied for the ML-based stealing attack against the controlled information is generalized to five phases — reconnaissance, data collection, feature engineering, attacking the objective, and evaluation. The methodology highlights the similarity of these attacks from strategies and technical perspectives. The public datasets used for the attack analysis are also summarized and referenced correspondingly.

• The Audio Auditor: User-Level Membership Inference with Black-Box Access. With black-box access to a target Automatic Speech Recognition (ASR) system, we proposed an audio auditor to audit whether a specific user unwillingly contributed his/her audio recordings to train the ASR model. We focus our work on a DNN-HMM-based ASR model over the TIMIT audio data [23]. Herein, the DNN-HMM-based ASR model is the ASR model with the structure using Deep Neural Network to train its acoustic model and using Hidden Markov Model to map the acoustic model’s output to a sequence of text. As a proof-of-concept, the success rate of user-level membership inference can reach up to 90% accuracy with eight audio samples per user.

• The Audio Auditor: Label-Only User-Level Membership Inference in Internet of Things Voice Services. Different from previous work, our auditor audit an ASR model under black-box access without providing any confidential score. Specifically, the translated text is the only output label of the target model. With user representation of the input audio data and their corresponding translated text, our trained auditor is effective in user-level audit. We also observe that the auditor trained on specific data can be generalized well regardless of the ASR model architecture. We validate the auditor on ASR models trained with Long Short-Term Memory (LSTM), Recurrent Neural Network (RNN), and Gated Recurrent Unit (GRU) algorithms on two state-of-the-art pipelines, the hybrid ASR system and the end-to-end ASR system. Finally, we conduct a real-world trial of our auditor on iPhone Siri, achieving an overall accuracy exceeding 80%.

• The Audio Auditor: No-Label User-Level Membership Inference in Internet of Things Voice Services. We broaden the assumption from label-only black-box access to no-label black-box access. In this setting, the service only reacts on user’s query audio without providing the translated text. With user representation of the input audio data and their corresponding system’s reaction, our auditor shows an effective auditing in user-level membership inference. Our experiments show that the auditor behaves better with more training samples and samples with more audios per user. The highest AUC score can reach 73% which is better than random guessing method. Herein, the AUC is Area Under the Curve used to measure the model’s separability.

• FAAG: Fast Adversarial Audio Generation through Interactive Attack Optimisation. FAAG is an iterative optimization-based method to generate targeted adversarial examples quickly. By injecting the noise over the beginning part of the audio, FAAG generates adversarial audio in high quality with a high success rate timely. Specifically, we use audio’s logits output to map each character in the transcription to an approximate position of the audio’s frame. Thus, an adversarial example can be generated by FAAG in approximately two minutes using CPUs only and around ten seconds with one GPU while maintaining an average success rate over 85%. Specifically, the FAAG method

3 can speed up around 60% compared with the baseline method during the adversarial example generation process. Furthermore, we found that appending benign audio to any suspicious examples can effectively defend against the targeted adversarial attack.

By improving our knowledge of the emerging attack, the ultimate purpose of this thesis is to safeguard the information thoroughly. In the information era, the leakage of information, especially those have already been controlled, will result in tremendous damage to both corporations and individuals [3, 4, 5, 6]. This thesis reveals that current protections cannot fully suppress the existing ML-based stealing attacks. As discussed in Chapter 7, in the near future, protecting controlled information can be improved from detecting the access states of related data, disrupting the related data with considerable utility, and isolating the related data from being accessed.

1.2 Structure

This thesis conducts the ML-based cyber attacks targeting on controlled information. The cyber attacks can be generally categorised as targeting confidentiality, integrity, and availability. This thesis mainly focus on the confidentiality and integrity. For the confidentiality investigation, ML-based stealing attack is summarized as one of the most popularity attack within this research topic. The adversarial machine learning under the white-box access is selected as the representative attack of the integrity violation. Additionally, the adversarial machine learning is considered as the follow-up attack of the ML-based stealing attack. Therefore, the thesis will investigate a literature review about ML-based stealing attack firstly, while the adversarial machine learning attack will be reviewed as the follow-up attack under one category of the ML-based stealing attack. After that, three different kinds of membership inference attack are conducted against a popular ML model — ASR models. One adversarial ML attack is conducted against the ASR model under white-box access. The rest of this thesis is organized as follows:

• The stealing attack methodology is summarized in Chapter 2. In Chapter 2, the literature review of the stealing attack using ML algorithms in the past five years is presented, where stealing attacks are reviewed in three categories classified by the types of targeted controlled information. Accordingly, the review of the audio adversarial examples are summarized under the category of stealing ML model related information.

• Chapter 3 shows our user-level membership inference method against an automatic speech recog- nition (ASR) system under black-box access. An assumption is considered under this black-box access to simplify the attack, which is the model’s outputs including it transcription text and it corresponding probabilities.

• Chapter 4 describes our work published in PoPETs named "The audio auditor: user-level mem- bership inference in Internet of Things". This work proposed a user-level membership inference method against an ASR system under label-only black-box access, which release the assumption we restricted in last chapter. Specifically, the model’s output under label-only black-box access only include the transcription text.

4 • Chapter 5 presents the user-level membership inference method against an ASR system in voice service under no-label black-box access, which further release the assumption. Specifically, not only the probabilities but also the transcription text remained unknown to the users and attacker under no-label black-box access.

• Chapter 6 is our work accepted by IEEE TCSI. This work proposed a fast adversarial audio generation under white-box access. Since the membership inference can help attackers to extract the training set of the target model and further reveal the detail of the target model. This attack is considered as a follow-up attack to our previous research.

• In Chapter 7, challenges of the ML-based stealing attack are discussed and followed by correspond- ing future directions.

• Finally, Chapter 8 concludes the thesis.

5 Chapter 2

Literature Review

In this chapter, the literature related to the thesis topic — ML-based cyber attacks will be reviewed and summarized including ML-based stealing attack and adversarial machine learning attack. An overall methodology for ML-based stealing attack will be summarized above all. Then three different types of stealing attacks will be reviewed and discussed following the overall methodology, including stealing ML model related information, stealing user activities information, and stealing authentication information. This chapter reviews the core papers in accordance with Machine Learning Based Stealing Attack (MLBSA) including the ML-based attack against ML models. All tables highlight the essential elements of the each ML-based attack. The attack methods and the corresponding countermeasures are discussed. The detailed information of the dataset and source code for these attacks are listed on Github 1.

2.1 ML-based Stealing Attack Methodology

This section presents the attack methodology about stealing controlled information attack utilizing ML techniques as shown in Figure 2.1 and named as the MLBSA methodology. The cyber kill chain [24, 25], a traditional model for cyber security threat analysis, is revised and used to model this attack methodology. A typical kill chain consists of seven stages including reconnaissance, weaponization, delivery, exploitation, installation, command and control, and actions on objectives [26, 27]. Reconnaissance aims to identify the target by assessing the environment. As a result, the prior knowledge of attacks can guide data collection. Regarding the ML-based stealing attack, weaponization means data collection. Extracting the useful information via feature engineering is essential. Using supervised learning, the ML-based model is built as a weapon taking actions on objectives. Moreover, the ML-based stealing attack may keep improving its performance and accumulate the knowledge gained from its retrieved results. Other stages of kill chain, including delivering the weapon to the victim, exploiting the vulnerabilities, installing the malware, and using command channels for remote control [27], are considered as a preparation phase before attacking the objectives. In this paper, the preparation phase is named feature engineering. Having consolidated a few steps of the kill chain, the MLBSA methodology consists of five phases, which are organized in a circular form implying a continuous and incremental process. The five phases of the MLBSA methodology are 1) reconnaissance, 2) data collection, 3) feature engineering, 4) attacking the objective, and 5) evaluation. The following subsections will illustrate each phase in details.

1https://github.com/skyInGitHub/Machine-Learning-Based-Cyber-Attacks-Targeting- on-Controlled-Information-A-Survey

6 Figure 2.1: ML-based stealing attack methodology (abbreviated as MLBSA methodology). 2.1.1 Reconnaissance

Reconnaissance refers to a preliminary inspection of the stealing attack. The two aims of this inspection include defining adversaries’ targets and analyzing the accessible data in order to facilitate the forthcoming attacks. The target of adversaries in the published literature is usually the confidential information controlled by systems and online services. According to Kissel [8] and Dukes [28], the term “information” is defined as “the facts and ideas which can be represented as various forms of data, within which the knowledge in any medium or form are communicated between system entities”. For example, an ML model (e.g. prediction model) represents the knowledge of the whole training dataset and can act as a service to return results of any search queries [29, 12]. Thereby, the controlled information can be interpreted as the information stored, processed, and communicated in the controlled area for which the organization or individuals have confidence that their protections are sufficient to secure confidentiality. It is more difficult to detect the attack against confidentiality than that against integrity and availability. These information stealing attacks are often referred to as the “unknown unknowns”. In this thesis, the targeted controlled information can be classified into three categories: user activities information, ML related information, and authentication information. Herein, user activities of the mobile system, such as which app is running in the foreground, are considered as sensitive information. Such sensitive information should be protected against security threats like phishing [10]. ML models can be provided on the Internet as a service to analyze the big data and build predictive models [29, 12], such as Google Prediction API and Amazon SageMaker. In this scenario, both the model and training data are considered to be confidential subjects. However, when some ML services allow white-box access from users, only training data is considered as confidential information. Passwords and secret keys unlocking mobile devices and authenticating online services should always be stored securely [30, 31, 32, 33]. Using ML to infer the password from user’s keystrokes breaks the information confidentiality [34, 35]. Since the information that adversaries aimed to steal is in control, the accessible data is the break- through point. In order to analyze its value, the attacker acts as a legitimate user to learn the characteristics and capabilities of the targeted systems, especially those related to the controlled information. During the reconnaissance of accessible data, the attacker needs to search all possible entry points of the targeted system, reachable data paths, and readable data [36]. When the attacker aims at user’s activities, the triggered hardware devices and their corresponding logged information will be investigated [10, 36, 37].

7 For example, the attacker always searches and explores the readable system files, such as interrupt timing data [10, 36] and network resources [37]. To perform a successful stealing attack against a model [29, 12] or training samples [38, 18, 39], the functionalities of ML services (e.g. Amazon ML) are analyzed by querying specific inputs. The attacker analyzes the relationship between the inputs and the outputs including output labels, their corresponding probabilities (also known as confidence values), and the top ranked information [18, 38, 12]. The relationship reveals some internal information about the target model and/or training samples. For authentication information stealing attacks, stealing keystroke information needs to utilize some sensor activity information activated by the attacker, while the intermediate data can be regarded as accessible data [34, 35]. The fine-grained information about security domains, e.g. secret keys, can be inferred by analyzing the accessible cache data [30, 40]. The accessible data related to the target information are defined in this phase.

2.1.2 Data Collection

Having conducted a detailed reconnaissance process, the attacker refines the scope of their targeted con- trolled information alongside with the awareness of the related accessible data. Then the attacker designs specific queries against the target system/service to collect useful accessible data. What differentiates the information stealing attack and other forms of cyber attacks is that the datasets are collected in a malicious manner instead of being collected via the malicious way. In accordance with the intelligence gained during the reconnaissance phase, data collection can be either active collection or passive collection. Active collection refers to that the attacker actively interacts with the targeted system for data collection. Specifically, an attacker designs some initial queries to interact with the system and subsequently collects the data. The goal of the attacker guides the design of malicious interactions, referring to the analysis results from the reconnaissance phases. All data closely related to the controlled information can be gathered as a dataset for the stealing attack. For example, if an attacker intends to identify which app is launched in a user’s mobile, some system files like procfs recording app launching activities should be collected [36]. In addition, various apps will be launched for several times by the attacker for app identification. By running 100 apps, [10] gathered a dataset of kernel information about the foreground User Interface (UI) refreshing process to identify different apps. The active collection for stealing keystroke information and secret keys are similar to that for stealing user activities information. Interacting with the operating system with different keystrokes inputs, [34, 35] collected sensor information like acceleration data and video records to infer keystroke movements. Moreover, [30] and [40] collected cache data to analyze the relationship between memory access activities and different secret keys. Additionally, the active collection for stealing ML-related information aims to design effective and efficient queries against the target model. The design of active collection varies when the target model allows black-box access or white-box access (see more details in Section 2). With the black-box access, the inputs and corresponding outputs are collected to reveal the model’s internal information. The number of inputs should be sufficient to measure the model’s functionality [12, 29] or clarify its decision boundary [41, 42, 18]. [41] synthesized a set of inputs in order to train a local model to substitute the target model. The range of inputs should also be wide enough to include samples inside and outside a model’s training set [38, 43]. With the white-box access, not only inputs and outputs but also some internal information are collected to infer the training data information. For instance, [39, 44] updated the model by training a local model with data having different features, and the changes in the global model’s parameters are

8 utilized to infer the targeted feature values. The other kind of collection is passive collection, which is defined as gathering all data related to the targeted controlled information without engaging with the targeted system/service directly. The passive collection mainly used to steal the password information in this thesis where the targeted system/service is always a login system or permission granted. In such a case, engaging with the target system/service directly could only provide attackers the information on whether the guessing password is correct or not. This information can be used to validate an ML-based attack but is unable to contribute to an attack model as training or testing data. [31] cracked users’ passwords by generating a lot of passwords in high probabilities based on people’s behaviors of password creation, while [33] generated passwords based on the semantic structure of passwords. Specifically, attackers collect the relevant information such as network data, personal identifiable information (PII), previous leaked passwords, the service site’s information and so on [31, 33, 32]. The information can be gathered by searching online and accessing some open sources data like leaked passwords (as shown in Table 2.10). The attack targeting at password data primarily uses the passive collection to gather information. All of the stealing attacks involved in this thesis are utilizing the supervised learning algorithms. The ground truth need to be set up in the data collection phase. Among investigated ML-based stealing attacks, collected data was labeled with the target information or something related. For instance, the kernel data about the foreground UI refreshing process caused by app launching activities is labeled with corresponding apps [10]. [38] and [43] labeled data with member or non-member of the target training set. Similarly, for stealing the authentication information, [35] and [34] labeled the collected sensor data with corresponding input keystrokes. The ground truth of datasets for ML-based stealing attacks are closely related to attackers’ target information.

2.1.3 Feature Engineering

After the datasets are prepared, feature engineering is the subsequent essential phase to generate repre- sentative vectors of the data to empower the ML model. The two key points in feature engineering for ML-based attacks consist of dataset cleaning and extracting features. One obstacle of feature engineering is cleaning the and irrelevant information in the raw data. In general, deduplication and interpolation can be used to reduce the noise from accessible resource [10]. To reduce the noise, a Fast Fourier Transform (FFT) filter and an Inverse FFT (IFFT) filter are applied [35]. Other popular methods extract refined information and replace redundant information, such as Dynamic Time Warping (DTW) and Levenshtein-distance (LD) algorithms for similarities calculation in time series data [10, 36, 37], Symbolic Aggregate approXimation (SAX) for dimensionality reduction [37, 45], normalization and discretization for effectiveness [37], and Bag-of-Patterns (BoP) representation and Short Time Fourier Transform (STFT) for feature refinement [37, 46, 47]. In order to extract features, it is necessary to analyze and clarify the relationship between the dataset and the targeted controlled information. The relationship determines what kinds of features the attacker should extract. For instance, the inputs and their corresponding confidential values reveal the behaviour of the model stored in cloud service (like Google service). Adversaries choose each query’s confidential value as a key feature. Therefore, this relationship can be leveraged to steal an ML model and customer’s training samples using reverse-engineering techniques [12] and the shadow training samples’ generation [38]. Specifically, using reverse-engineering techniques, [12] revealed the target model’s parameters by

9 finding the threshold where confidential value changes with various inputs. Shadow training samples are intended to be statistically similar to the target training set and are synthesized according to the inputs with high confidence values. When targeting at user activities, some feature extraction approaches are applied in a kernel dataset for the stealing attack. [10] and [36] noticed that the diverse foreground apps could be characterized by the changes in electrostatic field which existed in interrupt timing log files on Android. Hereafter, the statistics of interrupt timing data are calculated as features [10]. Feature extraction techniques depend on the type of the useful information. For example, several extraction techniques, including interrupt increment computation, gram segmentation, difference calculation, and the histogram construction, are specialized for the sequential data, like interrupt time series [10, 37, 48]. For the authentication information stealing attack, the ways of defining features are similar to those methods mentioned above [35, 31, 32]. One typical method is transforming the characteristics of information as features, such as logical values of the state of sensor [49], temporal information accessing memory activities [30], different kinds of PIIs from Internet resources [31, 33], and acceleration segments within a period of time collected from smartwatches’ accelerometer [35]. In addition, manually defining the features based on the attackers’ domain knowledge is another popular method [36, 37, 50, 33].

2.1.4 Attacking the Objective

(a) The First Attack Mode (b) The Second Attack Mode

Figure 2.2: Adopted from cyber kill chain [24, 25], there are two ML-based attack modes: the first mode uses this ML-based model as a weapon to steal controlled information, while this model itself is the target for the second mode. Based on the result from reconnaissance, attackers design the input queries. Querying the target system/service, attackers collect required accessible data from the inputs and their query outputs. To set up ground truth, the data are labeled according to the target information. After feature engineering, training dataset is built with labels to train an supervised ML model. For the first mode, testing samples without labels test the model whose outputs are the target information. For the second mode, the training dataset is used to reconstruct a model which is the attacker’s target.

In this thesis, we only consider the ML-based stealing attack as defined in Section 1 targeting at user activity information, ML model related information, and authentication information. We summarize the ML-based stealing attack into two attack modes as illustrated in Figure 2.2. That is, both attack modes share the initial five actions. These four actions correspond to the first three phases within the MLBSA methodology. Specifically, the attacker firstly reconnoiters the environment storing targeted controlled information. The environment provides an interface taking users’ queries and responding to the queries. The attacker designs the input queries and inquiries the target system/service. As stated in the data collection phase, the inputs and their query results are collected as the required accessible dataset, which reveals the target information. Based on the target information, the ground truth of the dataset is set

10 up in this phase. With proper feature engineering methods, the training dataset is prepared to attack the objective. But the subsequent actions to steal the controlled information using machine learning differ between two attack modes. For the first attack mode as shown in Figure 2.2a, the training dataset is used to train an ML model to steal the controlled information. The testing dataset has the same features as the training dataset. The testing dataset is collected from a victim’s system/service, the testing samples are not labeled while querying the attack model. Since the attack model is built to infer the controlled information from these accessible data, the output of the model is the targeted controlled information. This attack mode is applied in the ML-based stealing attack against the user activity information, the authentication information, and training set information. The literature applies ML algorithms to train the classification model such as Logistic Model Tree (LMT) [49], k-Nearest Neighbors (k-NN) [10, 36, 37], Support Vector Machine (SVM) [48, 37], Naive Bayes (NB) [33, 49], Random Forest (RF) [35, 34, 43], Neural Network (NN) [50, 51], Convolutional Neural Network (CNN) [52] and logistic regression [18, 44]. Apart from these classification models, the probabilistic forecasting model is popular to predict the probability of the real password with a guessing password pattern. There are a few probabilistic algorithms applied, such as Probabilistic Context-Free Grammars (PCFG), Markov model, and Bayesian theory [30, 33, 32]. For the second attack mode illustrated in Figure 2.2b, the training dataset is used to train an ML model while the model itself is the target of the attack. This attack mode is mostly applied in the ML-based stealing attack against the ML model related information. In a black-box setup, stealing the ML model attack aims at calculating the detailed expression of the model’s objective function. Reconstructing the original model is essentially a reconstruction attack [18]. Using the equation-solving and path-finding methods [12, 29], the inputs and their query outputs for solving the specific objective function expression is interpreted as the training set. Therefore, this attack can be simplify regarded as an ML-based attack. Additionally, based on the attackers’ inputs and the query outputs, the training set is synthesized and used to build a substitute model for reconstruction [41, 42]. Several ML algorithms were applied in the literature, such as decision tree [12, 41], SVM [12, 29], NN [12, 29, 42], Recurrent Neural Network (RNN) [32], ridge regression (RR), logistic regression, and linear regression [29, 12, 41]. Moreover, some popular and publicly available tools can be used to train the ML model for the attack, for example, WEKA and monkeyrunner [10, 34]. In summary, despite the model itself is the attacker’s objective, the adversary can predict the results which reveal the controlled information in the training data using ML techniques.

2.1.5 Evaluation

During the evaluation phase, attackers measure how likely they can successfully steal the controlled information. Evaluation metrics differ between two attack modes. As we investigate the ML-based attack under the first attack mode, the attack evaluation measures the performance of the attack model. The higher the performance of the model is, the more powerful the weapon the attacker builds. While under the second attack mode, the attack evaluation measures the differences between the attack model and the target model. The attack will be considered more successful when the attack model is more similar to the target model. Evaluation metrics for two attack modes are summarized separately. For the first attack mode, the attack mode is the attacker’s weapon. Its performance is measured by effectiveness and efficiency. Specifically, metrics like execution time and battery consumption are used

11 Table 2.1: Confusion Matrix for Evaluation. XX XX Actual XXX Class A Class B Predicted XXX Class A True Positive (TP) False Positive (FP) Class B False Negative (FN) True Negative (TN) for efficiency evaluation. Most metrics commonly used to measure the effectiveness include accuracy, precision, recall, FPR, FNR, and F-measure. Throughout the paper, some evaluation metrics are derived from a confusion matrix as shown in Table 2.1. The evaluation metrics are listed as below.

• Accuracy: It is also known as success rate and inference accuracy [10, 35, 34]. Accuracy means the number of correctly inferred samples to the total number of predicted samples. Accuracy is a TP +TN generic metric evaluating the attack model’s effectiveness. Accuracy = TP +TN+FP +FN

• Precision: It is regarded as one of the standard metrics for attack accuracy [38]. Precision illustrates the percentage of samples correctly predicted as controlled class A among all samples classified as A. Precision reveals the correctness of the model’s performance on a specific class [36, 49, 50], TN especially when features’ values are binary [18]. P recision = TN+FP

• Recall: It is regarded as another standard metric for attack accuracy [38]. Recall is also called sensitivity or True Positive Rate (TPR) [49]. It is the probability of the amount of class A correctly predicted as class A. Similar to precision, recall also reveals the model’s correctness on a specific class. These two metrics are almost always applied together [36, 49, 18, 38, 43, 51, 44]. TP Recall = TP +FN

• F-measure: This metric or F1-score is the harmonic mean of recall and precision. F-measure 2×Recall×P recision provides a comprehensive analysis of precision and recall [49]. F −measure = Recall+P recision

• False positive rate (FPR): This metric denotes the proportion of class B samples mistakenly FP categorized as class A sampled. FPR assesses the model’s misclassified samples. FPR = TN+FP

• False negative rate (FNR): This metric stands for the ratio between class A samples mistakenly categorized as class B samples. Similar to FPR, FNR assesses the model’s misclassified samples from another aspect. FPR and FNR are almost always applied together to measure the model’s error FN rate [49]. FNR = TP +FN

• Execution time: The execution time is used in training the model which indicates the efficiency of the attack model [37, 10, 40].

• Battery consumption: It is also known as power consumption [37]. Battery consumption refers to the target mobile’s battery while the target system is a mobile system [10, 36, 37], which indicates the efficiency of the attack model.

For the second attack mode, ML-based attacks of stealing the ML model are assessed with other metrics. This kind of attack is the ML model reconstruction attack. Inherently, the reconstruction attack requires a set of comparison metrics. The target of this kind of attack is an ML model fˆ which closely matches the original ML model f. Generally, the stolen model fˆwill be constructed locally. Its prediction

12 results will be compared to the results of the original model with the same inputs. The applied evaluation metrics are defined and listed below:

• Test error is the average error based the same test set (D) testing at learned model and targeted P diff(f(x),fˆ(x)) ˆ ˆ x∈D model [12]. A low test error means f matches f well. Errortest(f, f) = |D|

• Uniform error is an estimation of the portion of full feature space that the learned model is different ˆ from the targeted one, when the testing set (U) are selected uniformly [12]. Erroruniform(f, f) = P ˆ x∈U diff(f(x),f(x)) |U|

• Extraction accuracy indicates the performance of model extraction attack based on the test error ˆ ˆ and the uniform error [12]. Accuracyextraction = 1 − Errortest(f, f) = 1 − Erroruniform(f, f)

• Relative estimation error (EE) measures the effectiveness of model extraction attack using its ˆ learned hyperparameters (λ) contrasting to the original hyperparameters (λ)[29]. ErrorEE = |λˆ−λ| λ

• Relative mean square error (MSE) measures how well the model extraction attack reconstructs the regression models via comparing the mean square error after learning hyperparameters using |MSEλˆ −MSEλ| cross-validation techniques [29]. ErrorMSE = MSEλ • Relative accuracy error (AccE) measures how well the model extraction attack reconstructs the classification models via comparing accuracy error after learning hyperparameters using cross- |AccEλˆ −AccEλ| validation techniques [29]. ErrorAccE = AccEλ

The adversary applies evaluation metrics to determine whether the performance of attack is satisfactory or not. If the value of any metrics does not meet the expectations, adversaries can restart the stealing attack by redefining the targeted controlled information. The stealing attack can be executed incrementally until the attacker gains the satisfactory results.

2.2 Stealing ML Model Related Information

ML model related information consists of the model description, training data information, testing data information, and testing results. In this subsection, the ML model and users’ uploaded training data are the targets, which are stored in the cloud. By querying the model via MLaaS APIs, the prediction/classification results are displayed. The model description and training data information are controlled, otherwise, it is easy for an attacker to interpret the victim’s query result. As most of ML services charge users per query [53, 54, 55], this kind of attack may cause huge financial losses [12]. Additionally, several ML models including neural networks are suffered from adversarial examples. Adding small but intentionally worst-case perturbations to inputs, adversarial examples result in the model predicting incorrect answers [56]. By revealing the knowledge of either the model’s internal information or its training data, the stealing attack can facilitate the generation of adversarial examples [41, 42]. The generalized attack in this category is illustrated in Fig. 2.3. Leveraging the query inputs and outputs, the model description can be stolen by using a model extraction attack or a hyperparameter stealing attack, and the training samples can be stolen by using the model inversion attack, Generative Adversarial Network (GAN) attack,

13 membership inference attack, and property inference attack. The countermeasures mitigating these attacks are summarized at the end of this subsection.

Figure 2.3: The ML-based stealing attack against ML model related information. In this category, ML-based attacks aim at stealing the training samples or the ML model.

2.2.1 Stealing controlled ML model description

It is important to protect the confidentiality of ML models online. If the ML model’s knowledge description was stolen, the profit of the MLaaS platform may diminish because of its pay-per-query deployment [12]. If spam or fraud detection are based on ML models [13, 14, 16, 15], understanding the model means that adversaries can evade detection [15]. A specific ML model is defined by two important elements including ML algorithm’s parameters and hyperparameters. Parameters are learned from the training data by minimizing the corresponding loss function. Additionally, hyperparameters are helping to find the balance within objective function between its loss function and its regularization terms, which cannot be learned directly from the estimators. Since the model is controlled, its parameters and hyperparameters should be deemed confidential by nature. Stealing these model descriptions, the main approaches are equation-solving, patch-finding and linear least square methods.

Table 2.2: Stealing Controlled ML Model Description

Reference Dataset for Evaluation Description Targeted ML Model Attack Methods Circles, Moons, Blobs, Synthetic, 5,000 with 2 features, 5-Class [12]; Synthetic,1000 with 20 features, Steak Survey [57], 331 records with 40 features, GSS Survey [58], 16,127 records with 101 features, Logistic Regression; Equation-solving Adult (Income/race) [57], 48,842 records with 108/105 features, Decision Tree; [12] attack; Path-finding Iris [57], 150 records with 4 features, SVM; attack Digits [59], 1,797 records with 64 features, Three-layer NN Breast Cancer [57], 683 records with 10 features, Mushrooms [57], 8,124 records with 112 features, Diabetes [57] 768 records with 8 features DNN; SVM; k-NN; MNIST [60], 70,000 handwritten digit images, Jacobian-based Dataset [41] Decision Tree; GTSRB [61] 49,000 traffic signs images Augmentation Logistic Regression Diabetes [57], 442 records with 10 features, GeoOrig [57], 1,059 records with 68 features, Regression algorithms; UJIIndoor [57]; 19,937 records with 529 features; [29] Logistic regression Equation solving Iris [57], 100 records with 4 features; algorithms; SVM; NN Madelon [57], 4,400 records with 500 features; Bank [57] 45,210 records with 16 features [42] MNIST [60] 70,000 handwritten digit images NNs Metamodel methods

Stealing Parameters Attack: Model extraction attacks targeting ML models of the MLaaS systems were described in [12]. The goal of the model extraction attacks was constructing the adversary’s own ML model which closely mimics the original model on the MLaaS platform. That is, the constructed ML model can duplicate the functionality of the original one. During the reconnaissance, MLaaS allows

14 clients to access the predictive model in the black-box setting through API calls. That is, the adversary can only obtain the query results. Most MLaaS provides information-rich query results consisting of high-precision confidence values and the predicted class labels. Adversaries can exploit this information to perform the model extraction attack. The first step was collecting confidence values with query inputs. Feature extraction needs to map the query inputs into a feature space of the original training set. Feature extraction methods were applied for categorical and numerical features (Table 2.2). Equation-solving and patch-finding attacks were used to calculate the objective function of the targeted model. Three popular ML models were targeted listed in Table 2.2, while two online services namely BigML [62] and Amazon ML [53] were compromised as case studies. The key processes of model extraction attacks include query input design, confidence values collection, and attack with equation-solving and patch-finding. Stealing the model’s parameters, the equation-solving attack and patch-finding attack are illustrated in detail. Regarding the attack mentioned in [12], the equation-solving attacks can extract confidence values from all logistic models including logistic regression and NNs, whereas the patch-finding attacks work on decision trees model. The equation solving was based on the large class probabilities for unknown parameters and then calculated the model. Specifically, the objective function of the targeted ML model was the equation which adversaries aimed to solve. With several input queries and their predicted probabilities, the parameters of the objective function were calculated. The patch-finding attack exploited the ML API specificities to query specific inputs in order to traversal the decision trees. A path-finding algorithm and a top-down approach helped in locating the target model algorithm to reveal paths of the tree. In this way, the detailed structure of the targeted decision tree classifier was reconstructed. For the experiments, the attack’s performance was measured by the extraction accuracy. The online model extraction attack targeted the decision tree model which was set up by the users on BigML [62]. The accuracy was over 86% irrespective of the completeness of queries. In another case study targeting the ML model on Amazon services, the attacker reconstructed a logistic regression classification model. The results showed that the cost of this attack was acceptable in terms of time consumption (less than 149s) and the price charged ($0.0001 per prediction). The model was learned by calculating the parameters. Apart from reconstructing the exact model parameters, another model extraction attack reveals the model’s internal information by building a substitute model as shown in [41]. Herein, the substitute model shares similar decision boundaries with the target model. During the reconnaissance, adversaries can only obtain labels predicted by the target model with given inputs. To train this substitute model, the substitute dataset is collected using a synthetic data generation technique named Jacobian-based Dataset Augmentation with a small initial set [41]. Specifically, the ground truth of a synthetic data has the label predicted by the target model, while the architecture is selected based on the understanding of the classification task. The best synthetic training set is determined by the substitute model’s accuracy and the similarity of decision boundaries. To approximate the target model’s boundaries, the Jacobian matrix is used to identify the directions of the changes in the target model’s output. Hence, the model can be reconstructed as a substitute model. Stealing Hyperparameters Attack: Stealing hyperparameters in the objective function of the targeted MLaaS model can gain financial benefits [29]. The investigated MLaaS models can be regarded as black-box providing query results only, like Amazon ML [53] and Microsoft Azure Machine Learning [54]. By analyzing the model’s training process, a key observation showed that parameters were learned when the objective function reached its minimum value. That is, the gradient of objection function at the

15 model parameters should be the vector whose entries are all close to zeros. According to this observation, the hyperparameters were learned covertly with a system of linear equations, when the gradient was set to vector of zero. A threat model was proposed by [29], when the attacker acted as a legitimate user of MLaaS platform. Some popular ML algorithms used by the platform were analyzed (Table 2.2). That is, the attacker knew the ML algorithm in advance. Given the learned model’s parameters, the attacker set the gradient vector of the objective function of the non-kernel/kernel algorithm. By solving this equation with the linear least square method, the hyperparameters were found by the attacker. In some black-boxed MLaaS models, the attacker applied the model parameter stealing attacks relied on equation-solving in [12] to learn the parameters firstly. Thus, even though the parameters were unknown beforehand, the attacker could steal the hyperparameters. Therefore, the target model was reconstructed successfully. To evaluate the effectiveness of this stealing hyparameters attack [29], several real-world datasets listed in Table 2.2 were used. Additionally, a set of hyperparameters whose span covered a large range were predefined. Apart from these, a scikit-learn package was applied to implement different ML models and worked out the values of each hyperparameter. For experimental evaluation, relative mean square error (MSE), relative accuracy error (AccE), and relative estimation error (EE) were applied. The results showed a high accuracy of attack performance with all estimation errors were below 10%. The good performance indicated that the attacker successfully stole the target model. As this attack [29] was implemented in the MLaaS platform, three methods were suggested to learn an accurate model with fewer costs: uploading the training set firstly with a specific learning algorithm, uploading part of the training set that was randomly selected, and “Train-Steal-Retrain”. The third method was proved to be more accurate with less time, which means that the attacker re-learned the model with entire training set specifying a learning algorithm and calculating a hyperparameter with the second method. The “Train-Steal-Retrain” is the best among three in practical attacks. Regarding a target model as a black-box, its hyperparameters can also be learned by building another metamodel which takes various classifiers’ input and output pairs as training data [42]. Firstly, by observing outputs of the target model with given inputs, a diverse set of white-box models need to be trained by varying values of various hyperparameters (i.e. activation function, the existence of dropout or maxpooling layers, and etc). More importantly, these white-box models are expected to be similar to the target model. The training set of the metamodel can be collected by querying inputs over these white-box models, while the ground truth label should be the hyperparameter’s value used by the corresponding white-box model. Afterwards, by querying the target model, its hyperparameter can be predicted given its output to the metamodel. Stealing hyperparamters attack is complementary to the stealing parameters attack.

2.2.2 Stealing controlled ML model’s training data

Another type of controlled information about MLaaS product is the training data. Training data is not only useful to construct the model using ML algorithms provided by an MLaaS platform, but also sensitive as the records can be private information [18, 63]. For example, a user’s health diagnostic model is trained by personal healthcare data [38]. Hence, the confidentiality of the model’s training data should be protected. The ML-based stealing attacks include the model inversion attack, GAN attack, membership inference attack, and property inference attack. Moreover, two protections are demonstrated: One uses

16 adversarial regularization against membership inference attack, while another utilizes count featurization for protecting the models’ training data.

Table 2.3: Stealing Controlled ML Model’s Training Data.

Reference Dataset for Experiment Description Feature Engineering ML-based Attack Method FiveThirtyEight survey, 553 records with 332 features, Decision Tree, [18] N/A GSS marital happiness survey 16,127 records with 101 features Regression model MNIST [60], 70,000 handwritten digit images, Features learned Convolutional Neural [39] AT&T [64] 400 personal face images with DNN Network (CNN) with GAN CIFAR10 [65], 6,000 images in 10 classes, CIFAR100 [65], 60,000 images in 100 classes, Purchases [66], 10,000 records with 600 features, Regarded shadow model [38] Foursquare [67], 1,600 records with 446 features, resulted as features and NN Texas hospital stays [68], 10,000 records with 6170 features, label records as in/out MNIST [69], 10,000 handwritten digit images, Adult (income) [57] 10,000 records with 14 attribute Include 6 sets in [38], Same as above cell, Regarded shadow model Random Forest, [43] News [70], 20,000 newsgroup documents in 20 classes, resulted as features and Logistic Regression, Face [71] 13,000 faces from 1,680 individuals label records as in/out Multilayer perceptron Adult (income) [57], 299,285 records with 41 features, MNIST [60], 70,000 handwritten digit images, Neuron sorting , [51] NN CelebFaces Attributes [72], more than 200K celebrity images, Set-based representation Hardware Performance Counters 36,000 records with 22 features Face [71], 13,233 faces from 5,749 individuals, FaceScrub [73], 76,541 faces from 530 individuals, Logistic regression, [44] PIPA [74], 60,000 photos of 2,000 individuals, N/A gradient boosting, Yelp-health, Yelp-author [75], 17,938 reviews, 16,207 reviews, Random Forests FourSquare [67], CSI corpus [76] 15,548 users in 10 locations, 1,412 reviews CIFAR100 [65], 60,000 images in 100 classes, Regarded shadow model [77] Purchase100 [66], 197,324 records with 600 features, resulted as features and NN Texas100 [68] 67,330 records with 6,170 features label records as in/out

Model Inversion Attack & Defense: The model inversion attack was developed by [18] via conducting the commercial MLaaS APIs and leveraging confidence information with predictions. Though another model inversion attack proposed in [63] leaked the sensitive information from ML’s training set, the attack could not work well under other settings e.g. the training set has a large number of unknown features. However, the attack proposed in [18] aimed to be applicable across both white-box setting and black-box setting. For the white-box setting, an adversarial client had a prior knowledge about the description of the model as the APIs allowed. For the black-box setting, the adversary was only allowed to make prediction queries on ML APIs with some feature vectors. Considered as the useful data for the attack, the confidence values were extracted from ML APIs by making prediction queries. The attacks were implemented in two case studies — inferring features of the training dataset, and recovering the training sample of images. The model inversion attack targets the ML model’s training data under both settings. The first attack was inferring sensitive features of the inputs from a decision tree classifier. BigML [62] was used to reveal the decision tree’s training and querying routines. With query inputs with different features and the corresponding confidential values, the attacker in [18] accessed marginal priors for each feature of the training dataset. For the black-box setting, the attacker utilized the inversion algorithm [63] to recover the target’s sensitive feature with weighted probability estimation. A confusion matrix was used to assess their attacks. For the white-box setting, the white-box with counts (WBWC) estimator was used to guess the feature values. Evaluated with Global Social Survey (GSS) dataset [78], the results showed that white-box inversion attack on decision tree classifier achieved 100% precision, while the black-box one achieved 38.8% precision. Additionally, the attack in the white-box setting received 32% less recall than that in the black-box setting. Comparing to black-box attacks, white-box inversion attacks show a significant advance on feature leakage, especially in precision. The second attack was recovering the images from an NN model — a facial recognition service — accessed by APIs. Learning the training samples was required to steal the recognition model firstly.

17 Two specific model inversion attacks were proposed in [18], including reconstructing victim’s image with a given label, and determining whether the blurred image existed in training set. Specifically, the inversion attack for facial recognition (MI-Face) method and Process-Denoising (Process- DAE) algorithm were used to perform the attacks [18]. Herein, the query inputs and confidential values were used to refine the image. The best reconstruction performance from evaluation was 75% overall accuracy and 87% identification rate. Moreover, the attacker employed an algorithm named maximum a posterior to estimate the effectiveness. The evaluation results showed that the proposed attacks enhanced the inversion attack efficacy significantly comparing to the previous attack [63]. The training images were recovered accurately. Stealing the Training Data of Deep Model with GAN: An attack against the privacy-preserving collaborative deep learning was designed to leak the participants’ training data which might be confidential [39]. A distributed, federated, or decentralized deep learning algorithm can process each users’ training set by sharing the subset of parameters obfuscated with differential privacy [79, 52]. However, the training dataset leakage problem had not yet been solved by using the collaborative deep learning model [79]. An adversary can deceive the model with incorrect training sample to deduce other participants to leak more local data. Then by leveraging the learning process nature, the adversary can train a GAN for stealing others’ training samples. The GAN attack targets the collaborative deep learning. Specifically, the GAN simulated the original model in collaborative learning process to leak the targeted training records [39]. During the reconnaissance phase, an adversary pretended as one of the honest participants in collaborative deep learning, so that the adversary could influence the learning process and induce the victims to release more information about the targeted class. To collect the valuable data, the adversary did not need to compromise the central parameter server instead of inferring the meaningful information of that class based on the victim’s changed parameters. In addition, as a participant was building up the targeted model, part of the training samples were known to the adversary. The fake training dataset could be sampled randomly from other datasets. The true training dataset and a fake training dataset were collected to train the discriminator of the GAN using the CNN algorithm. The outputs of this discriminator and another fake training dataset were used to train the generator of the GAN using CNN. Since the feature of the training data was known by default, the adversary sampled the targeted training data with the targeted label and random feature values. This fake sample was fed into the generator model. The adversary modified the feature values of this fake sample until the predicted label was the targeted label. The final modification of this fake sample was regarded as the target training sample. In the experiments, the GAN attack against collaborative learning was evaluated with MNIST [60] and AT&T datasets [64] as inputs. Comparing with model inversion attack, the discriminator within GAN attack reached 97% accuracy and recovered the MNIST image trained in the collaborated CNN clearly. In a word, the GAN attack trained the discriminator and generator to steal the training data. Membership Inference Attack: Learning a specific data record was targeted by [38], which was the membership of the training set of the targeted MLaaS model. Since the commercial ML model only allowed black-box access provided by Google and Amazon, not only the training data but also the training data’s underlying distribution were controlled. Though the training set and corresponding model were unknown, the output based on a given input revealed the model’s behavior. By analyzing such behaviors, adversaries found that the ML model behaved differently on the input that they trained compared to the input which was new to the model. Therefore, according to this observation, an attack model was trained,

18 which could recognize such differences and determine whether the input data was the member of targeted training set or not. The attack is intended to recognize the model’s behavior testing with target training sample. The attack model was constructed by leveraging a shadow training technique [38]. Specifically, multiple “shadow models” were built to simulate the targeted model’s behavior, which informed the ground truth of membership of their inputs. All “shadow models” applied the same service (i.e. Amazon ML) as the targeted model. In addition, the training data that the adversary used can be generated by the model-based synthesis and statistic-based synthesis methods. The generated dataset shared the similar distribution to the object model’s training set, while the testing set was disjoint from training set. Querying these “shadow models” with the training sets and testing sets, the prediction results were added a label of in or out. These records could be collected as the attack model’s training set. Then the adversary utilized the built binary classifier to learn a specific data record by determining whether it was in or out of the training set for MLaaS model. Such an offline attack was difficult to be detected, while MLaaS system would consider the adversary as a legitimate user since the adversary was just querying online. Shadow models were trained to produce the inputs for the membership inference attack. For this membership inference attack evaluation [38], several public datasets were used and listed in Table 2.3. Three targeted models were constructed by Google Prediction API, Amazon ML, and CNN respectively. The evaluation metrics used by the adversaries were accuracy, precision, and recall. According to the evaluation results, Google Prediction API was suffered from the biggest training data leakage due to this attack. The accuracy of the attack model was above the baseline 50% (random guessing result) in all experiments, while the precision were all over 60%, and recall was close to 100%. The membership inference attack learned the training sample effectively. For the mitigation, since overfitting was the most important reason that makes ML model be vulnerable to the membership inference attack [38], regularization techniques could be applied in the ML model to resolve the overfitting problem [80, 81, 82]. Another three mitigation strategies were described as restricting the prediction vector to top k classes, using the coarsen precision results, and increasing entropy of prediction vector for NN models [83]. The first method, unfortunately, could not fully prevent the membership inference attack. The last two methods obfuscated prediction vectors to mitigate the leaking. Those restriction and obfuscated data protected training set to a limited extent. In 2019, [43] further studied membership inference attack to make it broadly applicable at low cost. Specifically, three assumptions mentioned in [38] are relaxed including using multiple shadow models, synthesizing the dataset from the similar distribution of the target model’s training set, and the knowledge of the target model’s learning algorithm. The results show that the performance of these attack will not be affected with only one shadow model trained with a dataset from other distributions. The results of using different classification algorithms on one shadow model are not promising. However, by combining a set of ML models trained with various algorithms as one shadow model, the performance of membership inference attack can be tolerable (above 85% in precision and recall). Herein, the attack is based on an assumption that one model of the model set is trained with the learning algorithm used by the target model. Furthermore, by selecting a threshold of the posterior results to determine the input data’s membership, even shadow model is not needed for the membership inference attack. Therefore, the scope of membership inference attack is enlarged. Property Inference Attack: Different from learning a specific training record, the property inference

19 attack targets at the properties of training data that the model producer unintended to share. [51] defines their target model as a white-box Fully Connected Neural Networks (FCNNs), and aims to infer some global properties such as a higher proportion of women. To launch this attack and take a model as input, a meta-classifier is built to predict whether the global property exists in this model’s training set or not. Above all, several shadow models are trained on a similar dataset using similar training algorithms to mimic the target FCNNs. During the feature engineering phase, instead of using a flattened vector of all parameters [84], [51] applied set-based representation to form the meta-training set. Specifically, a set- based representation is learned using the DeepSets architecture [85]: 1) flattens each nodes’ parameters from all hidden layers, 2) obtains a node representation with node processing function based on the target property, 3) sums a layer representation with layer summation, and 4) concatenates these layer representations as a classifier representation. The accuracy of this attack reached 85% or over on binary income prediction, smile prediction or gender classification task. This property inference attack against white-box FCNNs is effective to steal training set information. In collaborative learning, leaking unintended features about participants’ training data is another kind of property inference attack [44]. Instead of global properties, the unintended feature they targeted is held for a certain subset of training set or even independent of the model’s task. For example, the attacker infers black face property of training data while learning a gender classifier in a federated manner. In the reconnaissance process, the adversary as a participant can download the current joint model for each iteration of the collaborative learning. The aggregated gradient updates from all participants are computed, thereafter, the adversary can learn the aggregated updates other than his own updates [86]. Since the gradients of one layer are calculated based on this layer’s features and the previous layer’s error, such aggregated updates can reflect the feature values of other participants’ private training set. After several iterations, these updates are labeled with the targeted property and fed to build a batch property classifier. Given model updates as inputs, this classifier can predict corresponding unintended features effectively (most precisions larger than 80%). Therefore, the collaborative learning is vulnerable to property inference attack as well. Protection using Adversarial Regularization: A protection for black-box MLaaS models against the membership inference attack was introduced in [77]. As described in [38], membership inference attack could learn whether a data sample was a member of targeted model’s training set, even if the adversary only knew the queried output of this cloud service. Regularizing the ML model with L2-norm regularizers was one of the major mitigation methods [38, 18], which was not considered to offer a rigorous defense. On the other hand, researchers concluded that differential privacy mechanism prevented this information leakage by sacrificing the model’s usability which was expendable. To guarantee the confidentiality and privacy of training set rigorously, a privacy mechanism is proposed more powerful than regularization and differential privacy. A defender’s objective was analyzed firstly by formalizing membership inference attack in [77]. Precisely, the input of an inference model consisted of a testing data for the targeted classifier, its prediction vector, and a label about membership. An adversary aimed to maximize his inference gain, which was effected by the targeted training dataset and a disjoint dataset for reference attack training. Therefore, the defender intended to minimize the adversary’s inference gain, in the meanwhile, minimizing the loss of targeted classifier’s performance. That is, the defender enhanced the security of ML model by training it in an adversarial process. The inference gain as an intermediate result was regarded as the

20 classifier’s regularizer to revise the ML model with several training epochs. An adversarial regularization was used in training the classifier. To evaluate the defense mechanism, three common datasets were used in membership inference attack [38]. The classifier’s loss was calculated when the attacker’s inference gain reached the highest score [77]. The results showed that the classification loss reduced from 29.7% to 7.5% with defense comparing to that without defense for the Texas model, which could be insignificant. For the membership inference attack, the accuracy performance targeted at protected ML model is around 50%, which was close to the random guessing. In a word, the protecting model using the adversarial regularization guaranteed the confidentiality and privacy of its training data. The protection proposed in [77] was powerful against the membership inference attack. However, its effectiveness in protecting training data leaked by other attacks remains unknown. Additionally, it did not discuss whether adversarial regularization can protect the white-box MLaaS models from membership inference attack or not. Moreover, this defense method could not deal with an online attack like stealing the training data of deep model with GAN [39]. Protection using PATE: To protect the training set of an ML model generally, Private Aggregation of Teacher Ensembles (PATE) was proposed by [87]. Specifically, PATE prevents training set information leakage from model inversion attack, GAN attack, membership inference attack, and property inference attack. Two kinds of models are trained in this general ML strategy including “teacher” and “student” models. Teacher models are not published and trained on sensitive data directly. Splitting sensitive dataset into several partitions, several teacher models are trained using learning algorithms independently. These teacher models are deployed as an ensemble making predictions in a black-box manner. Given an input to these teachers, aggregating their predictions as a single prediction depends on each teacher’s vote. To avoid that teachers do not have an obvious preference in aggregation, Laplacian noise is added to vote counts. Obtaining a set of public data without ground truth, the student will label them by querying the teacher models. Then the student model can be built in a privacy-preserving manner by transferring the knowledge from teachers. Moreover, its variant PATE-G uses the GAN framework to train the student model with a limited number of labels from teachers. In conclusion, the PATE framework provides a strong privacy guarantee to the model’s training set. Protection using Count Featurization: A limited-exposure data management system named Pyramid enhanced the protection for organizations’ training data storage [88]. It mitigated the data breaches prob- lem by limiting widely accessible training data, and constructing a selective data protection architecture. For emerging ML workloads, the selective data protection problem was formalized as a training set minimization problem. Minimizing the training set can limit the stolen data. In prior data management [89], only in-use data were retained in the accessible storage for the ML training periodically, whereas the unused data was reserved in the protected area. However, with the speedy application of ML mechanisms, the whole dataset would be exposed from accessible storage continuously [88]. For this concern, distinguishing and extracting the necessary data for effective training was the key process. The workflow of Pyramid kept accessible raw data within a small rolling window. The core method named “count featurization” was used to minimize the training set. Specifically, the counts summarized the historical aggregated information from the collected data. Then Pyramid trained the ML model with the raw data featurized with counts in a rolling window. The counts were rolled over and infused with differential privacy noise to preserve the training set [90]. In addition, the balance

21 between training set minimization and model performances (accuracy and scalability) should also be considered. Three specific techniques were applied to retrofit the count featurization for data protection. The infusion with the weighted noise added less noise to noise-sensitive features of the training set. Another technique, called unbiased private count-median sketch, solved the negative bias problem arising from the noise infusion, while the automatic count selection found out useful features automatically and counted them together. For training data protection, count featurization was used to remain necessary data within data storage. Pyramid prevented the attacker from learning the extracted information from the training set.

Table 2.4: Categories of Stealing ML related information attacks from three perspectives (info: informa- tion).

Attack Targets Attack Surfaces Attacker’s Capabilities Attack Type Model Info Training Set Info Training Phase Inference Phase Black-box Access White-box Access Model extraction attack [12] YES no no YES YES no Model extraction attack [41] YES no no YES YES no Hyperparameter stealing attack [29] YES no no YES YES no Hyperparameter stealing attack [42] YES no no YES YES no Black-box inversion attack [18] no YES no YES YES no White-box inversion attack [18] no YES no YES no YES GAN attack [39] no YES YES no no YES Membership inference attack [38] no YES no YES YES no Membership inference attack [43] no YES no YES YES no Property inference attack [51] no YES no YES no YES Property inference attack [44] no YES YES no no YES

Table 2.5: Attack’s prior knowledge under black-box access and white-box access.

Black-box Access White-box Access Attack Type Predicted Label Predicted Confidence Parameters Hyper-parameters Model extraction attack [12] YES YES no no Model extraction attack [41] YES no no no Hyperparameter stealing attack [29] YES YES no no Hyperparameter stealing attack [42] YES YES no no Black-box inversion attack [18] YES YES no no White-box inversion attack [18] YES YES YES YES GAN attack [39] YES YES YES YES Membership inference attack [38] YES YES no no Membership inference attack [43] YES YES no no Property inference attack [51] YES YES YES YES Property inference attack [44] YES YES YES YES

Summary:In Chapter 2.2, ML-based stealing attacks against model related information target at either model descriptions or model’s training data. In addition to this category, as shown in Table 2.4, the other two ways focus on attacks at training/inference phase and with black-/white-box access [91]. Model extraction attacks [12, 41] and hyperparameter stealing attacks [29, 42] leak the model’s internal information happened at inference phase. Attackers steal model’s training data mostly at inference phase, except the GAN attack [39] and the property inference attack [44] which happen at training phase of collaborative learning. When attacking during training phase, attackers with white-box access to the model can exploit its internal information. As shown in Table 2.5, the white-box access allows attackers to have more prior knowledge than black-box, which results in high performance of the stealing attack [18]. On the other hand, black-box attacks can be more applicable in the real world. Additionally, except [43], most of the attackers in this category under black-box access know the learning algorithm of the target model [12, 29, 41, 42, 18, 50]. Countermeasures: Concerning the ML pipeline, the protection methods will be applied in data preprocessing phase, training phase, and inference phase respectively. Differential privacy noise used in

22 the first phase can build a privacy-preserving training set [88]. Differential privacy is the most common countermeasures to defend against the stealing attack, however, it alone cannot prevent the GAN attack [39]. Differential privacy, regularization, dropout and rounding techniques are popular protections at training and inference phases. At the training phase, differential privacy on parameters cannot resist the GAN attack [39], while rounding parameters is not effective against hyperparameter stealing attack [29]. Regularization is effective depending on the targeted algorithm towards hyperparameter stealing attack [29].

2.2.3 ML-based Attack about Audio Adversarial Examples Generation

Stealing controlled ML model’s information can boost another ML-based attack about adversarial example generation.Generally, the audio adversarial examples are generated by adding intentionally small/im- perceptible perturbations to benign examples. The aim of such adversarial examples is to mislead the model predicting an incorrect answer. When this incorrect answer is designed by the adversary, we named such an adversarial attack as the targeted adversarial attack. In this thesis, we only consider the audio adversarial example generation against an automatic speech recognition (ASR) model. Hidden Voice Commands: Carlini et al. [92] generated an audio adversarial examples to hide voice commands under two threat models including the black-box model and the white-box model. In the black-box model, the attacker extracts the benign audio’s acoustic information trough a transform function like Mel-Frequency Cepstral Coefficients (MFCC) firstly. Then inverse the MFCC and additionally add some noise to generate an adversarial audio. Querying the target ASR model with such an adversarial audio, if the machine cannot recognize the target phrase, then the attacker will refine the noise added before and generate another adversarial audio till the target phrase is recognized. In the meanwhile, the adversarial audio should not be recognized by human beings. In the white-box model, the attacker has full knowledge about the ASR model’s internal information. Targeting at the open-source CMU Sphinx speech recognition system [93], the attacker can utilize the information of the coefficients for each Gaussian in the Gaussian Mixture Model (GMM) and some dictionary files including the mapping information from words to phonemes. With these information, the attacker can identify which MFCC coefficients need to be modified with noise to generate the adversarial audio. Herein, gradient descent are used for each frame and find the local optima noise added to the benign audio. Inaudible Attack: Carlini et al. [94] and Zhang et al. [95] improves the qualify of adversarial audio during the audio adversarial examples generation process. Specifically, the former research quantify a of the perturbation with white-box access. Measuring the relative loudness in a logarithmic way, the distortion of the perturbation is considered during the optimization function. Targeting at the open-source DeepSpeech end-to-end ASR model, Connectionist Temporal Classification loss (CTC-loss) and gradient descent are used to modify the perturbation added in the benign audio. The latter research used the same method as [92]. However, instead of adding noise to the benign audio directly, they hide the voice commands on ultrasonic carriers to ensure the adversarial audio’s inaudibility. Another research proposed by Qin et al. [96] designed an audio adversarial examples using the method similar to [94] with effectively imperceptible noise. Specifically, they utilized the psychoacoustic principle of auditory masking so that only machine can recognize the commands. Over-the-air Attack: Different from the white-box attack and the black-box attack over-the-line, the over-the-air attack should consider the environment noise to the generated adversarial audio. Both Yakura

23 and Sakuma [97] and Schonherr et al. [98] analyzed the room impluse responses during the adversarial example generation phrase. Especially, Schonherr et al. [98] utilized the psychoacoustic methods to noise signal and made sure the robust adversarial audio’s inaudiability. Summary: For audio adversarial attack, the attacker usually utilize gradient descent to modify the noise added to the benign audio under white-box attack. With black-box access, the attacker regards the ASR model as an opaque oracle [92] and always inverse the MFCC information of the benign audio. The psychoacoustic masking and ultrasonic carriers are used to produce adversarial audio in high quality, whose hidden commands or noise are inaudible to human beings. The defense about audio adversarial examples generation could be alerting the users with machine’s recognized commands. Another common countermeasure is to learn human preception of the speech and build an adversarial example robustness ASR model [98].

2.3 Stealing User Activities Information

It is essential for security specialists to protect user activities information. Not only because the private activities are valuable to adversaries, but also the adversary can exploit some specific activities (i.e. fore- ground app) to perform malicious attacks such as the phishing attack [10]. In general, the attackers pursue two types of data — kernel data and sensor data, as shown in Fig. 2.4. We organize the reviewed papers according to the MLBSA methodology. The countermeasures against this kind of ML-based stealing attack are discussed at the end of Section 2.3. According to the utilized kernel data and sensor data, controlled user activities information were stolen through timing analysis and frequency analysis.

Figure 2.4: The ML-based stealing attack against user activities information.

2.3.1 Stealing controlled user activities from kernel data

The dataset collected from the kernel about system process information is too noisy and coarse-grained to disclose any intelligible and valuable information. However, through analyzing plenty of such data, the adversary could deduce some confidential information about the victim’s activities with the help of ML algorithms. Stealing User Activities with Timing Analysis: The security implications of the kernel information were evaluated in [10, 36] through integrating some specific hardware components into Android smartphones. During the reconnaissance phase, user activities information records the user’s interactions with hardware devices, before responded by the kernel layer. The targeted user activities in [10] were unlock patterns and foreground apps. Moreover, users’ browsing behavior was targeted by the attacker in [36]. One kind of kernel data was accessible to legitimate users, which logged the time series of the hardware

24 Table 2.6: Stealing Controlled User Activities using Kernel Data

Reference Dataset for Experiment Description Feature Engineering ML-based Attack Method Deduplication; Interpolation; HMM with Viterbi Interrupt data for unlock [10] Collect from procfs Interrupt Increment Computation; algorithm; k-NN classifier pattern and for apps Gram Segmentation; DTW with DTW Time series for apps, Automatically extract with Viterbi algorithm with DTW; [36] Collect from procfs website,keyboard guests tsfresh; DTW SVM classifier with DTW 1200 x 6 time series of 120 apps(App Store+iOS ) SVM classifier; Manually defined; [37] data about app; +10 trace x 6 time series; k-NN classifier SAX, BoP representation 1000 website traces 10 traces for each website with DTW Consecutively reading N/A; Construct a histogram binned [48] Collect from procfs SVM classifiers data; Resident size field data into seven equal-weight bins interrupt information can reveal previous activities. Specifically, the reported interrupts imply the real-time running status of some specific hardware (i.e. touchscreen controller). However, accessing the similar process-specific information had been continuously restricted since Android 6, and the interrupt statistics became unavailable in Android 8 [36]. Different Android versions contain different kinds of process information which is accessible to legitimate users without permissions under proc filesystem (procfs). Thus, an app was developed in [36] to search all accessible process information under procfs. The time series of these accessible data could distinguish the event of interests including unlocked screen, the foreground app, and the visited website. Reconnaissance showed the value of time series of data in procfs. During the data collection, the interrupt time logs were collected by pressing and releasing the touchscreens in [10]. Specifically, for the versions prior to Android 8, a variety of interrupt time series recording the changes of electrostatic field from touchscreen were gathered as one dataset for stealing the user’s unlock pattern. Another dataset was built for stealing foreground apps’ information by recording the time series of starting the app from accessible sources like interrupts from the Display Sub-System [10] and the virtual memory statistics [36]. Moreover, the time series of some network process information fingerprinted online users. These fingerprints were gathered as the dataset for stealing user’s web browsing information. Different sets of time series were prepared with respect to the information of different user activities. In terms of feature engineering, the attacker can analyze the process information on procfs to study the characteristics of the user’s unlock pattern, foreground app status, and the user’s browsing activity. The datasets were firstly processed by deduplication, interpolation and increment computation. The distinct features of three datasets were constructed via several methods such as segmentation, similarity calculation and dynamic time warping (DTW). An automatic method named tsfresh [99] was utilized for feature extraction. Subsequently, for the stealing attack targeting unlock patterns, a Hidden Markov Model (HMM) was used to model the attack to infer the unlock patterns through the Viterbi algorithm [100]. The evaluation results showed that its success rate outperformed the random guessing significantly. Targeting at foreground apps, the processed data was used to train a k-NN classifier. For the evaluation, the results showed that the classifier had high accuracy, which achieved 87% on average in [10] and 96% in [36]. To reveal the user’s browsing activities, SVM classifier was used to mount the attack. The results showed that both precision and recall values were above 80% in [36]. Among these three attack scenarios, the consumption of battery and time were acceptable (less than 1% and shorter than 6 minutes). The ML-based stealing attack showed its effectiveness with less consumption in time and battery. Stealing User Activities with iOS Side-channel Attack: In iOS systems, one popular side-channel attack vector of Linux system about the process information — procfs — is inaccessible, which hinders

25 the aforementioned attacks from leaking the sensitive information. Attackers have actively looked for new resources to exploit in the Operating System level. In the reconnaissance phase, several attack vectors, which are feasible to Apple, were applied to perform cross-app information leakage [37]. Specifically, three iOS vectors enabled apps accessing the global usage statics without requiring any special permissions to bypass the timing channel: the memory information, the network-related information, and the file system information. The attacker aimed to steal the user activities’ information (such as foreground apps, visited websites and map searches) and in-app activities (such as online transactions). To collect data for an ML-based attack, attackers manually collected several data traces for interesting events like foreground apps, website footprints and map searches. To improve the performance of such an inference attack, the information collected from multiple attack vectors was combined and fed into the ML models. Particularly, time series data from the targeted vectors were exploited frequently. As for feature engineering and the stealing attacks, ML frameworks were utilized to exfiltrate the user’s information from accessible vectors [37]. The changes of the time series reflected in the difference between two consecutive data traces. The feature processing methods were applied to transform the sequences into the Symbolic Aggregate approXimation (SAX) strings [45] and to construct the Bag-of-Patterns (BoP) of the sequences. In [37], two ML-based attacks with a large amount of data were presented — classifying the user activities and detecting the sensitive in-app activities. An SVM classifier was trained and tested for the former attack. The Viterbi algorithm [100] with DTW was utilized for the latter attack. In terms of the evaluation of the first attack stealing three users’ activities, the foreground app classification accuracy achieved 85.5%, Safari website classification accuracy reached 84.5%, and the accuracy accomplished 79% in inferring map searches. The proposed attacks could be trained on the attacker’s device and tested on other devices such as the victim’s devices. Meanwhile, the power consumption was acceptable with only 5% extra power used in an hour, while the attacks’ execution time was tolerable as well (within 19 minutes). In the context of stealing user activities information, ML-based attacks exploited the OS-level data with time series analysis.

2.3.2 Stealing controlled user activities using sensor data

The stealing attack using sensor data should be studied seriously by the defenders, not only from the application of effective ML mechanisms, but also from the popularity of sensing enabled applications [101, 102, 103, 104, 105]. The sensor information can reveal the controlled information indirectly as demonstrated in this stealing attack, such as acoustic and magnetic data.

Table 2.7: Stealing Controlled User Activities using Sensor Data

Reference Dataset for Experiment Description Feature Engineering ML-based Attack Method Recorded with a phone put STFT, [47] Audio signature dataset A regression model within 4 inches of the printer noise normalization Sensor data collected benign Markov Chain, NB, LMT, [49] Sensor dataset N/A and malicious activities (alternative algorithms e.g. PART)

Stealing Machine’s Activities with Sensor-based Attack: A side-channel attack was proposed by [47] on manufacturing equipment exploiting sensor data collected by mobile phones, which revealed its design and the manufacturing process. The attacker managed to reconstruct the targeted equipment. During the reconnaissance, the adversary placed an attack-enabled phone near the targeted equipment like a 3D printer. The accessible acoustic and magnetic information reflected the product’s manufacturing activities

26 indirectly. During data collection phase, the acoustic and magnetic sensors embedded in the phone would record audio and gain magnetometer data from the manufacturing equipment. The magnetometer data was transferred into a type of acoustic information. Then these acoustic signal information was combined as the training dataset. Hence, acoustic and magnetic data can be leveraged by the attack. After the dataset was gathered, the ML-based attack in [47] was completed by feature engineering, attacking with model training, and evaluation. The features were extracted from the audio signal’s frequency with the help of STFT and the noise normalization [47]. With features constructed, the product’s manufacturing process could be inferred by an ML model, especially for 3D printers. A regression algorithm was used to train the ML model for this attack. In the experiments, the adversaries tested the effectiveness of the reconstruction of a star, a gun and an airplane by a 3D printer. All products were reconstructed except the airplane, which was more like a “fish mouth” [47]. As for the difference of angles between original product and the reconstructed one, the differences of all angles were within one degree in average, which was acceptable. A defense of this kind of attack was proposed by [47]. The protection method obfuscated the acoustic leakage by adding the noise (i.e. play recordings) during production. Sensor-based attacks built up model by analyzing the frequency of the manufacturing equipment, but noise injection can mitigate this attack to some extent. Summary:ML-based attacks in Section 2.3 steal user activities information from operating systems. According to the data sources, there are two kinds of attacks — using kernel data and using sensor data. Kernel data reveals some system-level behaviors of the target system, while sensor data reflects the system’s reactions on its specific functionality used by users [10]. The kernel data is analyzed by the adversary from a time dimension, while the sensor data is exploited with frequency analysis. Countermeasures: Regarding the protection mechanism, differential privacy is an important method for the attacks stealing user activities information. In [48, 106], an example applied the noise to an accessible data source (like Android kernel log files). Another kind of solution is to restrict access to accessible data [37]. It is also effective to build a model to detect potential stealing threats like in [49]. The in-depth research in protecting against user activities information can explore the differential privacy appliance or a management system design for kernel files and sensor data. Noise injection and access restriction are two effective protections, and the detection can alert the stealing attack.

2.4 Stealing Authentication Information

The authentication information is one of the most important factors in security while accessing the information from services or mobile applications. In Section 2.4, the controlled authentication information mainly contains keystroke data, secret keys and password data. As shown in Fig. 2.5 and Fig. 2.6, classification models or probabilistic models are trained to steal the controlled authentication information. The protections of controlled authentication information stealing attacks are summarized in this subsection.

2.4.1 Stealing controlled keystroke data for authentication

The dataset collected from a device’s sensor information can be used to infer the controlled keystroke information as depicted in Fig. 2.5. The keystroke data contains the information about user authentication data, especially for keystroke authentications [107, 108, 109, 110]. Leveraging the acceleration, acoustic

27 Figure 2.5: The ML-based stealing attack against authentication information — keystroke information and secret keys. After reconnoitering and querying, attackers targeting at keystroke information and secret keys interact with the target system to collect data, which refers to the active collection. The attack involved active collection shares a similar workflow as Fig. 2.4 depicted. and video information, we review the attacks stealing these keystroke information and the countermeasures.

Table 2.8: Stealing Controlled Keystroke Data for Authentication

Paper Dataset for Experiment Description Feature Engineering ML-based Attack Method Consecutive vectors FFT & IFFT filter, Movement capturing, Random Forest; [35] Acceleration data set with 26 labels Optimization with change direction k-NN; SVM; NN Image resolution and Extract from selected AOIs’ motion [34] Video recordings set multi-class SVM frame rate signals for motion patterns

Keystroke Inference Attack: Several types of sensor information can be utilized to steal keystroke authentication information while targeting at keyboard inputs. In the process of reconnaissance, [35] found that sensor data from the accelerometer and microphone in a smartwatch was related to user keystrokes. Since the smartwatch was worn on the user’s wrist, the accelerometer data reflected the user’s hand movement. Therefore, the user’s inputs on keyboards can be inferred. The authors presented the corresponding practical attack based on this finding. Adversaries collected the accelerometer and microphone data when inputting the keystrokes in a keyboard. By leveraging the acceleration and acoustic information, adversaries were able to distinguish the content of the typed messages. In addition, two kinds of keyboards were targeted: a numeric keypad of POS terminal and a QWERTY keyboard. The datasets about sensor information were collected for the inference attack. During the feature engineering phase, adversaries manually defined the x-axis and y-axis as two movement features of acceleration data, and the frequency features were extracted from acoustic data. Then FFT was employed to filter the linear noise and high-frequency noise. Applying the ML strategies into the attacks, keystroke inference models were set up to reduce the impact caused by the noise within sensor data [35]. Specifically, the modified k-NN algorithm cooperated with an optimization scoring algorithm was applied to enhance the accuracy of their inference attack. Thereafter, the typed information within these two keyboards were leaked, including users’ banking PINs and English texts. The attack inferred the keystroke information containing authentication information. Regarding the evaluation, the results showed that the keystroke inference attack on the numeric keypad had 65% accuracy in leaking banking PINs among the top 3 candidates [35]. Unlike the previous work in decoding PINs [111, 112, 113], any devices containing the POS terminal could be compromised by this attack. For the attack targeted at QWERTY keypads, comparing to previous work [114, 115], a notable improvement had to be achieved to find the word correctly, where the accuracy improved by 50% with strong allowance to acoustic noise. In the end, several mitigation solutions against the keystroke

28 inference attack were provided [35]: restricting the access to accelerometer data; limiting the acoustic emanation; and adding the permissions in accessing the sensors which should be managed dynamically according to the context. The attack inferred the keystroke information accurately and was mitigated with the restrictions. Video-Assisted Keystroke Inference Attack: Apart from accelerometer and microphone, the video records are another kind of sensor information for attackers to infer the keystroke authentication informa- tion. An attack named VISIBLE, provided by [34], leaked the user’s typed inputs leveraging the stealthy video recording the backside of a tablet. The attack scenario assumed that the targeted tablet was placed on a tablet holder, and another two types of soft keyboards were used for inputs including alphabetical and PIN keyboards. The dataset for the ML-based attack contained the video of the backside motion of a tablet during the text typing process. In the process of feature engineering, the area of interests (AOIs) were selected and decomposed. The tablet motion was analyzed with amplitude quantified. Then, the features were extracted from temporal and spatial domains. As an ML-based attack, a multi-class SVM was applied to classify various motion patterns to infer the input keystrokes. VISIBLE exploited the relationship between a dictionary and linguistic to refine such inference results. The experiments showed that the accuracy of VISIBLE in leaking input single keys, words and sentences, outperformed the random guess significantly. Particularly, the average accuracy scores of the aforementioned inference attacks were above 80% for the alphabetical keyboard and 68% for the PIN keyboard. The countermeasures of this attack include providing no useful information for the video camera, randomizing the keyboards layout, and adding noise when accessing the video camera. The attack leveraged the video information can also infer keystroke authentication information very accurately.

2.4.2 Stealing controlled secret keys for authentication

Secret keys are used to encrypt and decrypt sensitive messages [116, 117, 118]. Reconstructing the cryptographic keys means that one host is authenticated to read the message in some cases [119, 120, 121]. However, an adversary has the ability to deduce the sensitive information like cryptographic keys, by understanding the changes in the state of shared caches [30]. In this part, the attack stealing controlled secret key information is surveyed via analyzing the state of targeted cache set.

Table 2.9: Stealing Controlled Secret Keys for Authentication (Information: info)

Reference Dataset for Experiment Description Feature Engineering ML-based Attack Method Encode info using a [30] 300 observed TLB latencies Collect from TLB signals SVM classifier normalized latencies vector Number of absent cache [40] 500,000 Prime-Probe trials N/A NB classifier lines + cache lines available

Stealing secret keys with TLB Cache Data: Due to the abuse of hardware translation look-aside buffers (TLBs), secret key information can be revealed by the adversary via analyzing the TLB information [30]. The targeted fine-grained information of user memory activities (i.e. cryptographic keys) was safeguarded in the controlled channels like cache side channels [122, 123]. During the reconnaissance, the legitimate user accesses the shared TLBs, which reflects victims’ fine-grained cache memories. In detail, the victim’s TLB records could be accessed by other users using CPU affinity system calls or leveraging the same virtual machine. Adversaries reverse-engineered unknown addressing functions which mapped virtual addresses to different TLB sets, in order to clarify the CPU activities from the TLB functions. To design

29 Figure 2.6: The ML-based stealing attack against authentication information — password data. To infer the password, attackers reconnoiter and collect the online information with the passive collection. During the feature engineering phase, different segments from the required data are extracted. A semantic classifier is trained using probabilistic algorithms. After testing this classifier, various passwords can be constructed as outputs with the semantic generalization. the data collection, the adversary monitored the states of shared TLB sets indicating the functions missed or performed by the victims. Without privileged access to properties of TLB information (i.e. TLB shootdown interrupts), adversaries timed the accesses to the TLB set and measured the memory access latency, which indicated the state of a TLB set. Instructing the targeted activities with a set of functions statements, the adversary accessed the TLB data shared by the victim and collected the corresponding temporal information as a training dataset. The label, in this case, was the state of the function written in the statement. Datasets about the TLB state information were prepared for the stealing attack. For the feature engineering, features were extracted from TLB temporal signals by encoding infor- mation with a vector of normalized latencies. Additionally, ML algorithms were adopted to distinguish the targeted TLB set by analyzing memory activity. Specifically, with high-resolution temporal features extracted to present the activity, an SVM classifier was built to distinguish the access to the targeted set and other arbitrary sets. In the experiment of [30], the training set contained 2,928 TLB latency data in three different sets. The end-to-end TLBleed attack on libgcrypt captured the changes of the target TLB set, extracted the feature signatures and reconstructed the private keys. During the evaluation phase, TLBleed reconstructed the private key at an average success rate of 97%. Particularly, a 256-bit EdDSA secret key was leaked with TLBleed successfully with the success rate of 98%, while 92% in reconstructing RSA keys. Potential mitigations against the TLBleed attack was discussed in [30] including executing sensitive process in isolation on a core, partitioning TLB sets among distrusting processes, and extending hardware transactional memory features. Hence, secret cryptography keys can be reconstructed by distinguishing the targeted TLB set.

2.4.3 Stealing controlled password data for authentication

Passwords are considered as one of the most important sensitive information of the user, and its leakage can raise a serious security concern. Most of the useful information to the stealing attack is collected passively from the network services as illustrated in Fig. 2.6. The password guessing attack was studied by analyzing the password patterns with ML techniques. The protection mechanism leads the user to set up strong passwords by analyzing the password patterns as well. Online Password Guessing Attack: Online password guessing problem and a framework named TarGuess to model targeted online guessing scenarios systematically were introduced by [31]. Since attackers perform an online password guessing attack based on the victim’s personal information, system- atically summarizing the all possible attack scenarios helps analysts understand the security threats. The architecture of TarGuess was demonstrated with three phases including the preparing phase to determine the targeted victim and build up its password profile, the training phase to generate the guessing model,

30 Table 2.10: Stealing Controlled Password Data for Authentication

Reference Dataset for Experiment Description Feature Engineering ML-based Attack Method Dodonew, CSDN, 16,258,891 (6,428,277) leaked passwords, PCFG-based 126, Rockyou, 6,392,568 (32,581,870) leaked passwords, algorithm [124], 000webhost, Yahoo, 15,251,073 (442,834) leaked passwords, [31] N/A Markov-based 12306, 6,392,568 leaked passwords + 129,303 PII, algorithm [125], Rootkit; 69,418 leaked passwords + 69,324 PII; LD algorithm Hotel, 51job 20,051,426 PII, 2,327,571 PII [33] RockYou 32,581,870 leaked passwords Segmented with NLP PCFG-based algorithm PGS training set [126], 33 million passwords, PCFG-based 1class8, 1class16 [127], 3,062 (2,054) leaked passwords, algorithm [131], [32] N/A 3class12 [128], 4class8 [129], 990 (990) leaked passwords, Markov models [125], webhost [130] 30,000 leaked passwords NN and the guessing phase to perform the guessing attack. During the reconnaissance phase, according to the diversity of people’s password choices, three kinds of information were beneficial in online guessing attacks: PII like name and birthday, site information like service type, and leaked password information like sister passwords and popular passwords. In particular, PII could be divided into two types, including Type-1 PII used to build part of the passwords (i.e. birthday), and Type-2 PII reflected user behavior in setting passwords (e.g. language [129]). Some leaked passwords were reused by the user. During the data collection phase, the datasets were combined by multiple types of PIIs and leaked passwords. To be more specific, the four TarGuess were listed as TarGuess-I based on Type-1 PII, TarGuess-II based on leaked passwords, TarGuess III based on leaked passwords and Type-1 PII, and TarGuess-IV based on leaked passwords and PII. Datasets for password guessing attacks were prepared. After dataset was collected with its initial features, attackers adopted probabilistic guessing algorithms including PCFG, Markov n-gram and Bayesian theory to train the four TarGuess models to infer passwords [31]. The accuracy of these four guessing algorithms was evaluated when the guessing time was limited online. Comparing with [132], the TarGuess-I algorithm outperformed with 37.11% to 73.33% more passwords successfully inferred within 10 − 103 guesses. It also outperformed the three trawling online guessing algorithms [133, 125, 124] significantly by cracking at least 412% to 740% more passwords. Comparing to [134] with 8.98% success rate, TarGuess-II achieved 20.19% within 100 guesses. As for TarGuess-III, no prior research could be compared and it achieved 23.48% success rate within 100 guesses. Concerning TarGuess-IV, the improvements of accuracy were between 4.38% to 18.19% comparing to TarGuess-III. Through modeling guessing attack scenarios, a serious security concern was revealed in [31] about online password leakage with effective guessing algorithms. Password Guessing with Semantic Pattern Analysis: An attempt was made in [31] to formalize several passwords guess lists for one targeted user. And a similar attempt was made in [33] to find a general password pattern. A framework presented by [33] built up semantic patterns of passwords for users in order to understand their password security. The security impacts of user’s preferences in password creation were identified. For a better reconnaissance, the passwords were analyzed by breaking into two conceptually consistent parts containing semantic and syntactic patterns. Since a password consists of the combination of word and/or gap segments, the attacker intended to understand these patterns by inferring the password’s meanings and syntactic functions. By comprehending how well the semantic pattern characterizing the password, plenty of password guesses could be learned for attacks. When those attacks were successful with some guesses, the true passwords were learned. The attack formalized the password with semantic and syntactic patterns. The password datasets could be collected from password data leakage like the RockYou’s password list. Firstly, the NLP methods was used for passwords segmentation and semantic classification. The

31 segmentation was the fundamental step to process the passwords in various forms. The source corpora was a collection of raw words as the segmentation candidates, whereas the reference corpora contained part- of-speech (POS). Specifically, the POS was tagged with Natural Language Toolkit (NLTK) [135] based on Contemporary Corpus of American English. With N-gram probabilities representing the frequency of use, the tagged POS was used to select the most suitable segmentation for passwords. After processing the password dataset, the NLP algorithm was used to classify the segments of input passwords and result in a semantic category. Secondly, a semantic guess generator could be built with the PCFG algorithm. Since syntactic functions of the password were structural relationships among semantic classes, the PCFG algorithm was employed to model the password’s syntactic and semantic patterns. In detail, this model learned the password grammar from the dataset, generated the guessing sentence of a language [136] with different constructs, and encoded the probabilities of these constructs as output. To learn any true passwords, the semantic guess generator sorted the outputs according to the probability of the password cracking attack. The generator generated a guessing list based on the semantic and syntactic patterns. To assess the advantage of the semantic guess generator, the success rate of this generator was compared to the result of previous offline guessing attack approach, i.e. Weir approach [124]. To crack a password within 3 billion guessing times, 67% more passwords were cracked by the semantic approach than the Weir approach in terms of LinkedIn leakage [33]. Exploiting the leakage from MySpace, this approach outperformed the Weir approach by inferring 32% more passwords. Summary: According to different forms of authentications, ML-based stealing attacks target at users’ keystroke authentication, secret keys and passwords. As shown in Fig. 2.5 and Fig. 2.6, attackers steal users’ passwords by cracking the useful information online. For the other two objectives, they exploit the information based on users’ activities recorded by an Operating System (i.e. TLB/CPU cache data). Additionally, password guessing attacks use the probabilistic method to construct a password with the least number of guesses. The attack on the remaining two targets can be transferred as classification tasks by generating keystroke patterns and cache set states. Countermeasures: From the security perspective, two types of countermeasures are introduced as the access restriction and the attack detection. The secret keys, for example, can be protected by managing the accessible related cache data [40]. The analysis of password guessability [32] can secure the user’s account by setting a strong password. The weak passwords are evaded by detection. The future direction can target the effectiveness of guessing model prediction which is limited by the sparsity of training samples [32]. The defense for the keystroke inference has not been well-developed. The future work may explore the secured access of related sensor data.

2.4.4 Summary

To understand the information leakage threat and the stealing attack comprehensively, an outline of relevant high-quality papers from 2014 to 2019 is provided and summarized in Table 2.11 from four perspectives including the attack, protection, related ML techniques, and evaluation.

32 Table 2.11: Summary of reviewed papers from attack, protection, related ML techniques they utilized, and the evaluation metrics.

Reference Attack Protection Related ML Techniques Evaluation Unlock pattern & foreground Restrict access to kernel resources; HMM with Viterbi algorithm; Success rate; Time & [10] app inference attack Decrease the resolution of interrupt data k-NN classifier with DTW battery consumption Accuracy; Restrict access to kernel resources; k-NN classifier with DTW; [36] Leaking specific events attack Precision; Recall; App Guardian [137, 106] Multi-class SVM with DTW Battery consumption Keystroke timing attack; Accuracy; [48] Design d∗-private mechanism Multi-class SVM classifier website inference attack Relative AccE Eliminate the attack vectors; Rate limiting; Accuracy; Stealing user activities; Runtime detection [106]; Coarse-grained SVM classifier; [37] Execution time; Stealing in-app activities return values; Privacy-preserving statistics k-NN classifier with DTW Power consumption report [48]; Remove the timing channel [47] Stealing product’s design Obfuscate the acoustic emissions A regression model Accuracy Accuracy; FNR; Information leakage via a Markov Chain; NB; The contextual model detects malicious F-measure; FPR; [49] sensor; Stealing information Alternative set of ML behavior of sensors Recall; Precision; via a sensor algorithms (e.g. PART) Power consumption Rounding confidences [18]; Logistic regression; Test error; [12] Model extraction attack Differential privacy (DP) [138, 50, 139, 140]; Decision tree; SVM; Uniform error; Ensemble methods [16] Three-layer NN Extraction Faccuracy Gradient masking [56] and defensive DNN; SVM; k-NN; Decision [41] Model extraction attack Success rate distillation [141] for a robust model Tree; Logistic regression Cross entropy and square hinge loss Regression algorithms; NN; Relative EE; Relative [29] Hyperparameters stealing attack instead of regular hinge loss Logistic regression; SVM MSE; Relative AccE [42] Hyperparameters stealing attack N/A Metamodel methods Accuracy Incorporate inversion metrics in training; Accuracy; Decision Tree; [18] Model inversion attack Degrade the quality/precision of the Precision; Regression model model’s gradient information. Recall The GAN attack stealing [39] N/A CNN with GAN Accuracy users’ training data Restrict class in the prediction vector; Accuracy; [38] Membership inference attack Coarsen precision; Increase entropy of the NN Precision; prediction vector [83]; Regularization Recall Logistic regression; Random Precision; [43] Membership inference attack Dropout; Model Stacking Forest; Multilayer perceptron Recall; AUC Multiply the weights and bias of each Accuracy; [51] Property inference attack NN neuron; Add noise; Encode arbitrary data Precision; Recall Share fewer gradients; Reduce input Logistic regression; Gradient Precision; [44] Property inference attack dimension; Dropout; user-level DP boosting; Random Forests Recall; AUC [77] Membership inference attack Protect with adversarial regularization NN Accuracy PATE: transfer knowledge from an Semi-supervised learning; [87] N/A Accuracy ensemble model to a student model GAN NN; Gradient boosted tree; Protect stored training data with count- Average logistic [88] N/A Logistic regression; based featurization squared loss Linear regression Restrict access to accelerometer data; Limit Random Forest; [35] Keystroke inference attack acoustic emanation; Dynamic permission Success rate k-NN; SVM; NN management based on context Design a featureless cover; Randomize [34] Typed input inference attack multi-class SVM Accuracy the keyboards’ layouts; Add noise [30] TLBleed attack infers secret keys Protect in hardware [122, 123] SVM classifier Success rate ML-based prime-probe attack CacheBar manage memory pages Accuracy; [40] NB classifier infers secret keys cacheability Execution time PCFG algorithm; Markov [31] Password guessing attack N/A Success rate model; Bayesian theory [33] Password guessing attack N/A PCFG-based algorithm Success rate Mitigate the threat by modeling password PCFG-based algorithm; [37] Password guessing attack Accuracy guessability Markov models; NN

33 Chapter 3

The Audio Auditor: User-Level Membership Inference with Black-Box Access

Voice interfaces and assistants implemented by various services have become increasingly sophisticated, powered by increased availability of data. However, users’ audio data needs to be guarded while enforcing data-protection regulations, such as the General Data Protection Regulations (GDPR) law and the Children’s Online Privacy Protection Act (COPPA) law [142, 143]. To check the unauthorized use of audio data, we propose an audio auditor for users to audit speech recognition models. Specifically, users can check whether their audio recordings were used as a member of the model’s training dataset or not. In this chapter, we focus our work on a DNN-HMM-based automatic speech recognition model over the TIMIT audio data. As a proof-of-concept, the success rate of participant-level membership inference can reach up to 90% with eight audio samples per user, resulting in an audio auditor.

3.1 Introduction

The automatic speech recognition (ASR) system is widely adopted on Internet of Things (IoT) devices [144, 145]. The IoT voice services competition among Apple, Microsoft, and Amazon is continuously heating up the smart speaker market [146]. In parallel, the privacy concerns about the ASR system and unauthorized access to user’s audio are of great awareness for customers. Privacy policies and regulations, such as the GDPR [142] and the COPPA [143], have been enforced to regulate personal data processing. Specifically, the Right to be Forgotten [147] law allows customers to prevent third-party voice services from continuously using their data [148]. However, the murky privacy and security boundary can thwart IoT’s trustworthiness [149, 150] and many IoT devices attempt to sniff and analyze the audio captured in real-time without user’s consent [151]. Most recently, on WeChat, an enormously popular messaging platform within China and worldwide, a scammer camouflaged to voice like an acquaintance by spoofing her or his voice [152]. Therefore, it is important to develop techniques that enable auditing the use of customers’ audio data in ASR models. In this chapter, we designed and evaluated an audio auditor to help users determine whether their audio data had been used without authorization to train an ASR model. The targeted ASR model used in this

34 chapter is a DNN-HMM-based speech-to-text model. With an audio signal input, this model transcribes speech into written text. The auditor audits this target model with an intent to infer participant-level membership. The auditor will behave differently depending on if it is transcribing audio from within its training set or transcribing audio from other datasets. Thus, one can analyze the transcriptions and use the outputs to train a binary classifier as the auditor. As our primary focus is to infer participant-level membership, speaker-related information is filtered out while analyzing the transcription outputs (see details in Section 3.3). Our work is the first attempt to audit whether ASR models still remember customer’s audio data without consent using the membership inference method. User-level membership inference on textual data has been recently studied [153]. However, in this work, we target an ASR model, instead of a text-generation model. The time-series audio data is significantly more complex than the textual data, causing feature patterns to be greatly varied [154]. Furthermore, current IoT applications demonstrate significantly higher security and privacy impacts than most verbal applications in learning tasks [155, 156]. In doing so, firstly, we assume a different auditing scenario. To reproduce a target model close to ASR systems in practice, we use multi-task learning, which includes audio feature extraction, DNN learning, HMM learning, and an n-gram language model with natural language processing. Secondly, the auditor has black-box access to the target model which only outputs one final transcription result. Additionally, the auditor can audit the model by simultaneously providing multiple audio inputs supplied from the same user, instead of just one. Thirdly, we extract a different set of features from the model’s outputs. Instead of using the rank lists of several top output results, we only use one text output with the highest posterior and the length of input audio frames. Our user-level membership auditing method achieves high performance on the TIMIT dataset. The auditing accuracy results reach over 90% while the F1-score reaches 95% when 125 speaker records are used to train the auditor model. Even when training with 25 users, the resulting accuracy is approximately 85%. The auditor is also effective in auditing ASR models with different numbers of audio queries from the same individual. When the speaker audits the target model with more than one audio sample (one-audio sample membership inference success rate approaches to random guessing), the success rate is significantly boosted, reaching up to 90% with eight audio samples per user.

3.2 Background

3.2.1 The Automatic Speech Recognition Model

Figure 3.1: An advanced ASR system.

The DNN-HMM-based acoustic model is popular in the current automatic speech recognition (ASR)

35 system [157]. As defined by [158], the ASR system contains a preprocessing step, model training step, and decoding step as displayed in Figure 3.1. The preprocessing step performs the feature processing and labeling for an audio input. In this chapter, the audio frame is processed using the Discrete Fourier Transform (DFT) to extract information from the frequency domain, namely Mel-Frequency Cepstral Coefficients (MFCCs) as features. Forced alignment is applied on the raw audio inputs to extract the text label which is processed and used in training our acoustic model. We train an acoustic model on a DNN. The acoustic model outputs posterior probabilities for all HMM states which are processed in the decoding step mapping posterior probabilities to a sequence of text. The language model contained within the decoder provides a language probability which is used by the decoder to re-evaluate the acoustic score to the most suited language [159]. The final transcription text is the sequence with the highest score.

3.2.2 Deep Learning for Acoustic Models

Deep learning methods are used to build up acoustic models, such as speech transcription [160], word spotting or triggering [161], speaker identification or verification [162]. With supervised learning, a neural network can be trained as a classifier using a softmax across the phonetic units. A feature stream of audio is the input of the network in deep learning, while the output should be a posterior probability for the predicted phonetic states. Subsequently, these output representations will be decoded by the HMM-based decoder and will be mapping to possible sequences of phonetic texts with different probabilities. Multilayer Perceptron (MLP) is one of the DNN algorithms used in this work. Assume that MLP is a th stack of L layers of logistic regression models, fl(·) represents the active function in the l layer. Given l an input zl ∈ Rm , where ml is the number of neurons in the lth layer, this layer’s output outl can be formalized as: l l l l−1 l out = fl(z ) = fl(W · out + b ), (3.1)

l (l−1) where W l ∈ Rm ×m represents the weight matrix, and bl is the bias from the (l − 1)th to lth layer. Specifically, we applied the sigmoid function in the hidden layers and used the softmax activation function for the final output layer. As for the loss function, the MLP uses the cross-entropy. Moreover, the MLP tunes the parameters using the error back-propagation procedure (BP) and the stochastic gradient descent method. In the case of building the ASR system with DNN-HMM algorithms, the posterior probability output in the output layer can be expressed as {P (ρ1|ot),...,P (ρk|ot)}, where k is the total number of phonemes corresponding to the number of the Lth layer’s output nodes. This is a set of posterior probabilities of th each phoneme in the t time frame ot of the audio input. The posterior probability of each phoneme (i.e.,

P (ρk|ot)) is transferred and processed by the HMM-based decoder:

I X P (ρi|ot) b = P (o |s ) = c . (3.2) j t j ji P (ρ ) i=1 i

th In Equation 3.2, bj is the probability of phonemes in the time frame ot mapping to j HMM state sj based on continuous Probability Density Functions (PDFs) [163]. Herein, I is a fixed number of PDFs, and c is the weight for each phoneme.

36 3.2.3 Membership Inference Attack

Membership inference attack aims to determine whether a specific data sample is within the training set by training a series of shadow models constituting the attack model [38]. The attack model intends to learn from the differences in the target model’s output by feeding in pristine or bogus training data. In this chapter, we adapt the membership inference attack for the task of audio auditing. Specifically, instead of inferring the record-level membership, we aim to infer the participant-level membership. That is, we focus on whether a particular user had unwillingly contributed data to train an ASR model. Our work differs from another user-level membership audit [153], as the features extracted from the outputs of developed ASR models are three pieces of audio related information including: transcription text, text probability, and frame length, rather than words’ rank lists.

3.3 Auditing the ASR Model

Figure 3.2: Auditing an ASR model.

In this section, we first formalize our objective for auditing automatic speech recognition models. Secondly, we present how an audio auditor can be constructed. Finally, we outline how the auditor is used for auditing the target model.

3.3.1 Problem Definition

As shown in Figure 3.1, we describe the workflow of audio transcription using an ASR system. By querying an ASR system with an audio sample of a recorded speech, the speech recognition model outputs pseudo-posterior probabilities for all context-dependent phonetic units. During the decoding step, the probabilities are used to infer the most probable text sequence.

Suppose there is a group of audio recordings Dtar from a set of individuals Utar. Our target model is an speech recognition model denoted as ftar which is trained on Dtar using a learning algorithm Altar.

37 For a specific user u, our objective is to find out whether this user is in the target model’s training set, such that u ∈ Utar. The participant-level membership inference against ftar requires an auxiliary reference dataset Dref to build the audio auditor. Specifically, Dref is used to train several shadow models fshd which simulate the target model ftar in approximation. We denote Uref the set of all users in Dref . By querying fshd, the transcription outputs are properly labeled depending on the audio speaker belonging to

Uref or not. Finally, we assume that our auditor only has black-box access to the target model. Given an input audio recording, the auditor can only obtain the text transcription and its probability as outputs. Neither the training data nor the training parameters and hyper-parameters of the target model is known to the auditor. We assume that our auditor knows any learning algorithm used in the ASR system, including feature extraction, the training algorithm, and the decoding algorithm. Threat Model. We assume that our auditor only has black-box access to the target model. Given an input audio recording, the auditor can only obtain the text transcription and its probability as outputs. Neither the training data nor the training parameters and hyper-parameters of the target model is known to the auditor. The state-of-the-art algorithms for typical DNN-HMM ASR systems are well-known and standard [158, 164, 165]. We hereby assume that our auditor knows any learning algorithm used in the ASR system, including feature extraction, the training algorithm, and the decoding algorithm. Due to recent research on model stealing [29, 12] which extracts network parameters from querying the output, it is reasonable to offer the auditor black-box access to the Machine Learning as a Service (i.e., the ASR model).

3.3.2 Overview of the Audio Auditor

The nature of membership inference [38] is to learn the difference of a model fed with its actual training samples and other samples. Thus, to audit whether an ASR model had been trained with a user’s audio data or not, the auditor’s task can be transferred as inferring this user’s membership in this ASR model’s training dataset. The audio auditor’s training and auditing processes are depicted in Figure 3.2. We assume that our target model’s dataset Dtar is disjoint from the auxiliary reference dataset Dref (Dtar ∩Dref = ∅).

In addition, Uref and Utar are also disjoint (Utar ∩ Uref = ∅). The primary task to train an audio auditor is to build up several shadow models to infer the targeted

ASR model’s decision boundary. We assume all learning algorithms Altar are known to the auditor; therefore, the learning algorithms for the shadow model are known accordingly (Alshd = Altar). Different from the target model, we have full knowledge of the shadow models’ ground truth. For a user u querying train the model with her audio samples, if u ∈ Dshdi , we collapse the features extracted from these samples’ results into one record and label it as “member”; otherwise, “nonmember”. Taken all together with these labeled records (processed), a training dataset is set to train a binary classifier as the audit model using a supervised learning algorithm. As also evidenced in [38], the more shadow models built, the more accurate the audit model performed.

As shown in Figure 3.2, n datasets are sampled from Dref to train n shadow models with Alshd. The testing set and the subset of training set are used to query each shadow model. The query outputs are preprocessed below. For participant-level membership, some users’ pertinent characters are extracted from each output, including the transcription text (denoted as TXT), the posterior probability (denoted as Probability), and the audio frame length (denoted as Frame Length). The features of the auditor’s

38 training set are written as: {TXT1=type(string), Probability1=type(float), Frame_Length1=type(integer), ... , TXTn=type(txt), Probabilityn=type(float), Frame_Lengthn=type(integer), class}, where n is the number of audios belonging to a speaker. To process categorical features, such as the TXT features, we map the text to integers using a label encoder [166]. The built auditor determines whether u ∈ Utar or not by the processed outputs. Exploring alternative preprocessing methods, such as a one-hot encoder, will be an avenue for future research. To build up n shadow models (see Figure 3.2), we sample n datasets from the auxiliary reference train dataset Dref as Dshd1,...,Dshdn, n > 1. Further, we split each shadow model dataset Dshdi , i = train test train 1, . . . , n into training set Dshdi and testing set Dshdi. Dshdi is used to train the shadow model with test Alshd, while Dshdi is used to evaluate its performance. To generate the training set with ground truth for query test the audit model, we query the shadow model with Dshdi obtaining all samples in Dshdi and a subset of train Dshdi which are sampled randomly. train Each shadow model fshdi, i = 1, . . . , n is trained with Alshd using Dshdi . For a user u querying the train model with their audio samples, if u ∈ Dshdi , we combine the features extracted from these samples’ results into one record and label it as “member”; otherwise, we label this record as “nonmember”. The labels combined together with the outputs from the shadow models, form the training dataset for our audit model. As for the auditing process, we randomly sample one or a few audios recorded by one speaker u from Dusers to query our target ASR model. These sampled audios are transcribed into text with some outputs. To audit whether the target model had used this speaker’s audio in its training phase, we analyze these transcription outputs as part of a testing record with our audio auditor. Feature extraction and preprocessing methods used in this testing record are the same as the methods used for shadow models’ results. The auditor finally classifies this testing record as “member” or “nonmember” and hence determines whether u ∈ Utar or not.

3.4 Experiment and Results

3.4.1 Dataset

The TIMIT speech corpus contains 6,300 sentence spoken by 630 speakers from 8 major dialect regions of the United States. Three kinds of sentences are recorded including the dialect sentences, the phonetically- compact sentences and the phonetically-diverse sentences. The dialect sentences are spoken by all speakers from different dialect regions. The phonetically-compact sentences were recorded with a good coverage of pairs of phones, while the phonetically-diverse audios recorded the sentence selected from different corpus for diverse sentence types and phonetic contexts. We selected three disjoint datasets from TIMIT speech corpus manually as described in Table 3.1. Specifically, each training dataset and testing dataset obtains the audio recorded by speakers from 8 dialect regions. In addition, each subset dataset contains all three kinds of audios that mentioned above. The diversity of audios within each dataset not only is more similar to the reality ASR model’s training set, but also remains some users’ information for participant-level auditing task. As a proof-of-concept, we aim to build up one target model and design two shadow models based on this target model. As mentioned in Section 3.3.2, we curated three disjoint datasets from the TIMIT speech

39 Table 3.1: Datasets across models

Model Training Dataset Testing Dataset 154 speakers, 54 speakers, Target 1232 audio 432 audios 154 speakers, 57 speakers, Shadow1 1232 audio 456 audios 154 speakers, 57 speakers, Shadow2 1232 audio 456 audios

train corpus as listed in Table 3.1. In this experiment, we trained two shadow models on Dshdi (i = 1, 2) with train a similar distribution to Dtar . Training with differently distributed datasets will be our future research. The outputs of our two shadow models are used to train the audit model. By querying the shadow model with all its testing set and one-third of its training set, we processed their outputs and labeled them as “nonmember” and “member”, respectively. Since the training datasets for all three models include eight sentences for each speaker, the feature set of the auditor’s training dataset is {TXT1, Probability1, Frame_Length1, ... , TXT8, Probability8, Frame_Length8, class}. To audit the target model, a speaker may query the auditor model from one to eight pieces of audios. When a user audits the target model less than eight pieces of audios, we pad zeros to all the missing feature values.

3.4.2 Target Model

Our target model is a speech-to-text model. The inputs are a set of audio files with phonetic text as labels, while the outputs are the transcribed phonetic texts with final probabilities and the corresponding input frame length. To simulate most of the current ASR models in the real world, we created a state-of-art DNN-HMM-based ASR model [158] using the PyTorch-Kaldi Speech Recognition Toolkit [159]. In the preprocessing step, MFCC features are used to train the model with the multilayer perceptron (MLP) algorithm. The training epoch is 24. The outputs of this MLP model are decoded and rescored with the probabilities of the HMM and n-gram language model to obtain the transcription. A decision tree is used for the audit model.

Figure 3.3: Training and Figure 3.4: Training and Figure 3.5: Training and Validation Accuracy of Target Validation Accuracy of Shadow Validation Accuracy of Shadow Model Model 1 Model 2

40 During the audio input’s preprocessing step, we utilize the Kaldi Toolkit [167] to extract MFCC features for each audio of waveform. The force alignment among features and phone states were used to process the label. To prepare a training set, we applied a simple DNN algorithm — multilayer perceptron (MLP) to learn the relationship between the input audios and the output transcriptions. As for the hyperparameters in the MLP model, we set up 4 hidden layers and 1,024 hidden neurons per layer. The learning rate was set at 8%, the model was trained with 24 epochs. The output of this MLP model is a set of pseudo-posteriors probabilities of all possible phonetic units. These outputs are normalized and then fed into a HMM-based encoder. After encoding, an n-gram language model was applied to rescore the probabilities. The final transcription is the text sequence with the highest final probability. To evaluate the target model’s performance, we use the training accuracy and validation accuracy as shown in Figure 3.3. Comparing the training accuracy performed by two shadow models and our target model, the trends are similar and the accuracy curve can ultimately reach 70% (see Figures 3.3, 3.4, 3.5). This indicates that our shadow models can successfully mimic the target model (same transcription on the same audio inputs), or are able to achieve the same utility, i.e., speech recognition (same transcription accuracy, not the same input samples between models.).

3.4.3 Results

To evaluate the auditor’s performance, four metrics are calculated from the confusion matrix, includ- ing accuracy, precision, recall, and F1-score. True Positive (TP): the number of records we pre- dicted as “member” are correctly labelled. True Negative (TN): the number of records we predicted as “nonmember” are correctly labelled. False Positive (FP): the number of records we predicted as “member” are incorrectly labelled. False Negative (FN): the number of records we predicted as “nonmember” are incorrectly labelled:

• Accuracy: the percentage of records correctly classified by the audit model. TP + TN Accuracy = TP + TN + FP + FN

• Recall: the percentage of all true “member” records correctly determined as “member”. TP Recall = TP + FN

• Precision: the percentage of records correctly determined as “member” by the audit model among all records determined as “member”. TP P recision = TP + FP

• F1-score: the harmonic mean of precision and recall. 2 × Recall × P recision F 1 − score = Recall + P recision

We show results for the behavior of the auditor under two different circumstances: when the number of users in the training dataset is varied, and when the number of the audio samples from the user to be audited is varied. Four metrics are calculated from the confusion matrix including accuracy, precision, recall, and F1-score.

41 Table 3.2: The confusion matrix for the auditor.

Class Actual: member Actual: nonmember Predicted: member TP FP Predicted: nonmember FN TN

Effect of the number of users used in training dataset. The audit model’s behavior when training sets containing different numbers of users is depicted in Figure 3.6. We trained the audit model with 25, 50, 75, 100, 125, and 150 users that were randomly sampled from the outputs of two shadow models. The testing set querying these audit models is fixed at 78 test audio records. To eliminate trial specific deviations, we repeated each experiment 10 times and averaged the results. The audit model performs fairly well for all metrics, with all metrics under different configurations above or approximately 85%. The model performs better when the number of users within the training set increases. When 100 users used in training set size, the performance reached a highest score especially in accuracy (approximately 93%) and F1-score (approximately 95%). When the number of users increased to 125, both two metrics’ results slightly drop but raise back while the number of users increased to 150. In all configurations, the audit model performs well. Herein, the more users that are used to train the audit model, the more accurate a user’s membership within the target model can be determined. With regards to the performance of the audit model when training with an even larger number of users of the training dataset, we will consider this problem in our future work.

Figure 3.6: The audit model’s performance Figure 3.7: The audit model’s performance by across the training set size. the number of audios for one speaker.

Effect of the number of audio records for each user used in querying the auditor. Since we randomly sample a user’s audio to test our audit model, the number of audio samples for this user may not be the same as the number of audios for each user in the auditor’s training dataset. That is, the number of non-zero features of an audit query may vary. We evaluate the effect on auditor’s performance using a variable number of audio samples from each user in auditing. Herein, the number of users used in different

42 testing sets are the same and #{u ∈ Utar} :#{u∈ / Utar} = 2 : 1. To gather a different number of non-zero features in audit model’s testing dataset, we queried the target model with 78 users where each user was randomly sampled from one to eight test audio records. Like the experiment above, we repeated the experiment 100 times and averaged these results to reduce deviations in performance. The results are displayed in Figure 3.7. The more audios for each user used to audit their membership, the more accurate our audio auditor performed. When the user audits the target model with only one audio, the audit model’s performance is relatively low — except the accuracy approaches to 50% — the other three metrics’ results are around 25%. When the number of audio reaches eight, all performance results are above 90%.

3.5 Conclusion

This work highlights, and leaves open, the potential of mounting participant-level membership inference attack in IoT voice services. While our work has yet to examine the attack success rate on various IoT applications across multiple learning models, they do narrow the gap towards defining clear membership privacy in the user level, rather than the record level [38] which leaves questions about whether the privacy leakage hails from the data distribution or its intrinsic uniqueness of the record. Nevertheless, as we argued, both the size of user base and the number of audio samples per user used in the testing set have shown to have a positive effect on the IoT audit model. Examining other factors on performance and extending possible defenses against audit are all worth further exploration.

3.6 Acknowledgement

We would greatly thank Nvidia Corporation for the donation of the Titan XP GPU used for this research.

43 Chapter 4

The Audio Auditor: Label-Only User-Level Membership Inference in Internet of Things Voice Services

With the rapid development of deep learning techniques, the popularity of voice services implemented on various Internet of Things (IoT) devices is ever increasing. In this chapter, we examine user-level membership inference in the problem space of voice services, by designing an audio auditor to verify whether a specific user had unwillingly contributed audio used to train an automatic speech recognition (ASR) model under strict black-box access. With user representation of the input audio data and their corresponding translated text, our trained auditor is effective in user-level audit. We also observe that the auditor trained on specific data can be generalized well regardless of the ASR model architecture. We validate the auditor on ASR models trained with LSTM, RNNs, and GRU algorithms on two state- of-the-art pipelines, the hybrid ASR system and the end-to-end ASR system. Finally, we conduct a real-world trial of our auditor on iPhone Siri, achieving an overall accuracy exceeding 80%. We hope the methodology developed in this chapter and findings can inform privacy advocates to overhaul IoT privacy.

4.1 Introduction

Automatic speech recognition (ASR) systems are widely adopted on Internet of Things (IoT) devices [144, 145]. In the IoT voice services space, competition in the smart speaker market is heating up between giants like Apple, Microsoft, and Amazon [146]. However parallel to the release of new products, consumers are growing increasingly aware and concerned about their privacy, particularly about unauthorized access to user’s audio in these ASR systems. Of late, privacy policies and regulations, such as the General Data Protection Regulations (GDPR) [142], the Children’s Online Privacy Protection Act (COPPA) [143], and the California Consumer Privacy Act (CCPA) [168], have been enforced to regulate personal data processing. Specifically, the Right to be Forgotten [147] law allows customers to prevent third-party voice services from continuously using their data [148]. However, the murky boundary between privacy and security can thwart IoT’s trustworthiness [149, 150] and many IoT devices may attempt to sniff and analyze the audio captured in real-time without a user’s consent [151]. Most recently, on WeChat – a hugely popular messaging platform within China and Worldwide – a scammer camouflaged their voice

44 to sound like an acquaintance by spoofing his or her voice [152]. Additionally, in 2019, The Guardian reported a threat regarding the user recordings leakage via Apple Siri [169]. Auditing if an ASR service provider adheres to its privacy statement can help users to protect their data privacy. It motivates us to develop techniques that enable auditing the use of customers’ audio data in ASR models. Recently, researchers have shown that record-level membership inference [38, 170, 43] may expose information about the model’s training data even with only black-box access. To mount membership inference attacks, Shokri et al. [38] integrate a plethora of shadow models to constitute the attack model to infer membership, while Salem et al. [43] further relax this process and resort to the target model’s confidence scores alone. However, instead of inferring record-level information, we seek to infer user-level information to verify whether a user has any audios within the training set. Therefore, we define user-level membership inference as: querying with a user’s data, if this user has any data within target model’s training set, even if the query data are not members of the training set, this user is the user-level member of this training set. Song and Shmatikov [171] discuss the application of user-level membership inference on text gen- erative models, exploiting several top ranked outputs of the model. Considering most ASR systems in the real world do not provide the confidence score, significantly differing from text generative models lending confidence scores [171], this chapter targets user-level membership inference on ASR systems under strict black-box access, which we define as no knowledge about the model, with only knowledge of the model’s output excluding confidence score and rank information, i.e., only predicted label is known. Unfortunately, user-level membership inference on ASR systems with strict black-box access is challenging. (i) Lack of information about the target model is challenging [172]. As strict black-box inference has little knowledge about the target model’s performance, it is hard for shadow models to mimic a target model. (ii) User-level inference requires a higher level of robustness than record-level inference. Unlike record-level, user-level inference needs to consider the speaker’s voice characteristics. (iii) ASR systems are complicated due to their learning architectures [172], causing membership inference with shadow models to be computationally resource and time consuming. Finally, time-series audio data is significantly more complex than textual data, resulting in varied feature patterns [154, 173]. In this chapter, we design and evaluate our audio auditor to help users determine whether their audio records have been used to train an ASR model without their consent. We investigate two types of targeted ASR models: a hybrid ASR system and an end-to-end ASR system. With an audio signal input, both of the models transcribe speech into written text. The auditor audits the target model with an intent via strict black-box access to infer user-level membership. The auditor will behave differently depending on whether audio is transcribed from within its training set or from other datasets. Thus, one can analyze the transcriptions and use the outputs to train a binary classifier as the auditor. As our primary focus is to infer user-level membership, instead of using the rank lists of several top output results, we only use one text output, the user’s speed, and the input audio’s true transcription while analyzing the transcription outputs (see details in Section 4.3). In summary, the main contributions of this chapter are as follows:

1. We propose the use of user-level membership inference for auditing the ASR model under strict black-box access. With access to the top predicted label only, our audio achieves 78.81% accuracy. In comparison, the best accuracy for the user-level auditor in text generative models with one top-ranked output is 72.3% [171].

45 2. Our auditor is effective in user-level audit. For the user who has audios within the target model’s training set, the accuracy of our auditor querying with these recordings can achieve more than 80%. In addition, only nine queries are needed for each user (regardless of their membership or non-membership) to verify their presence of recordings in the ASR model, at an accuracy of 75.38%.

3. Our strict black-box audit methodology is robust to various architectures and pipelines of the ASR model. We investigate the auditor by auditing the ASR model trained with LSTM, RNNs, and GRU algorithms. In addition, two state-of-the-art pipelines in building ASR models are implemented for validation. The overall accuracy of our auditor achieves approximately 70% across various ASR models on auxiliary and cross-domain datasets.

4. We conduct a proof-of-concept test of our auditor on iPhone Siri, under the strict black-box access, achieving an overall accuracy in excess of 80%. This real-world trial lends evidence to the comprehensive synthetic audit outcomes observed in this chapter.

To the best of our knowledge, this is the first paper to examine user-level membership inference in the problem space of voice services. We hope the methodology developed in this chapter and findings can inform privacy advocates to overhaul IoT privacy.

4.2 Background

In this section, we overview the automatic speech recognition models and membership inference attacks.

(a) A hybrid ASR system. (b) An end-to-end system.

Figure 4.1: Two state-of-the-art ASR systems

4.2.1 The Automatic Speech Recognition Model

There are two state-of-the-art pipelines used to build the automatic speech recognition (ASR) system, including the typical hybrid ASR systems and end-to-end ASR systems [174]. To test the robustness of our auditor, we implement both open-source hybrid and end-to-end ASR systems focusing on a speech-to-text task as the target models. Hybrid ASR systems are mainly DNN-HMM-based acoustic models [157]. As shown in Fig. 4.1a, typically, a hybrid ASR system is composed of a preprocessing step, a model training step, and a decoding step [158]. During the preprocessing step, features are extracted from the input audio, while the corresponding text is processed as the audio’s label. The model training step trains a DNN model to create HMM class posterior probabilities. The decoding step maps these HMM state probabilities to a text sequence. In this work, the hybrid ASR system is built using the pytorch-kaldi speech recognition toolkit [159]. Specifically, feature extraction transforms the audio frame into the frequency domain, as Mel-Frequency Cepstral Coefficients (MFCCs) features. For an additional processing step, feature-space

46 Maximum Likelihood Linear Regression (fMLLR) is used for speaker adaptation. Three popular neural network algorithms are used to build the acoustic model, including Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and Recurrent Neural Networks (RNNs). The decoder involves a language model which provides a language probability to re-evaluate the acoustic score. The final transcription output is the sequence of the most suited language with the highest score. End-to-end ASR systems are attention-based encoder-decoder models [175]. Unlike hybrid ASR systems, the end-to-end system predicts sub-word sequences which are converted directly as word sequences. As shown in Fig. 4.1b, the end-to-end system is a unified neural network modeling framework containing four components: an encoder, an attention mechanism, a decoder, and a Softmax layer. The encoder contains feature extraction (i.e., Visual Geometry Group (VGG) extractor) and a few neural network layers (i.e., bidirectional LSTM (BiLSTM) layers), which encode the input audio into high- level representations. The location-aware attention mechanism integrates the representation of this time frame with the previous decoder outputs. Then the attention mechanism can output the context vector. The decoder can be a single layer neural network (i.e., an LSTM layer), decoding the current context output with the ground truth of last time frame. Finally, the softmax activation, which can be considered as “CharDistribution”, predicts several outputs and integrates them into a single sequence as the final transcription.

4.2.2 Membership Inference Attack

The membership inference attack is considered as a significant privacy threat for machine learning (ML) models [77]. The attack aims to determine whether a specific data sample is within the target model’s training set or not. The attack is driven by the different behaviors of the target model when making predictions on samples within or out of its training set. Various membership inference attack methods have been recently proposed. Shokri et al. [38] train shadow models to constitute the attack model against a target ML model with black-box access. The shadow models mimic the target model’s prediction behavior. To improve accuracy, Liu et al. [176] and Hayes et al. [170] leverage Generative Adversarial Networks (GAN) to generate shadow models with increasingly similar outputs to the target model. Salem et al. [43] relax the attack assumptions mentioned in the work [38], demonstrating that shadow models are not necessary to launch the membership inference attack. Instead, a threshold of the predicted confidence score can be defined to substitute the attack model. Intuitively, a large confidence score indicates the sample as a member of the training set [177]. The attacks mentioned in the work above are all performed on the record level, while Song and Shmatikov [171] study a user-level membership inference attack against text generative models. Instead of using the prediction label along with the confidence score, Song and Shmatikov [171] utilize word’s rank list information of several top-ranked predictions as key features to generate the shadow model. Apart from the black-box access, Farokhi and Kaafar [178] model the record-level membership inference attack under the white-box access. Unlike image recognition systems or text generative systems, ASR systems present additional chal- lenges [172]. With strict black-box access, attacks using confidence scores cannot be applied. With limited discriminative power, features can only be extracted from the predicted transcription and its input audio to launch membership inference attacks, i.e., audio auditing in our paper.

47 4.3 Auditing the ASR Models

Figure 4.2: Auditing an ASR model.

In this section, we first formalize our objective for auditing ASR models. Secondly, we present how a user-level ASR auditor can be constructed and used to audit the target ASR. Finally, we show how we implement the auditor.

4.3.1 Problem Statement

We define user-level membership inference as querying a user’s data and trying to determine whether any data within the target model’s training set belongs to this user. Even if the queried data is not members of the training set, but data belonging to this user is members in the training set, then this user is regarded as the user-level member of this training set. Let (x, y) ∈ X × Y denote an audio sample, where x presents the audio component, and y is the actual text of x. Assume an ASR model is a function F : X → Y. F(x) is the model’s translated text. The smaller the difference between F(x) and y, the better the ASR model performs. Let D represent a distribution of audio samples. Assume an audio set A is sampled from D of size N (A ∼ DN ). Let U be the speaker set of A of size M (U ← A). The ASR model trained with the dataset A is denoted as FA. Let A represent our auditor, and the user-level auditing process can be formalized as:

Sm • A speaker u has S = i=1(xi, yi), where u ← S.

0 Sm 0 0 • Let Y = i=1 yi, when yi = FA(xi).

• Let “member” = 0 and “nonmember” = 1.

• Set r = 0 if u ∈ U, or r = 1 if u∈ / U.

• The auditor successes if A(u, S, Y 0) = r; otherwise it fails.

48 Our auditor, as an application of user-level membership inference, checks a speaker’s membership of an ASR model’s training set. This ASR model is considered as the target model. To closely mirror the real world, we query the target model with strict black-box access. The model only outputs a possible text sequence as its transcription when submitting an audio sample to the target model. This setting reflects the reality, as the auditor may not know this transcription’s posterior probabilities or other possible transcriptions. Additionally, any information about the target model is unknown, including the model’s parameters, algorithms used to build the model, and the model’s architecture. To evaluate our auditor, we develop our target ASR model Ftar using an audio set Atar with two popular pipelines — hybrid ASR model and end-to-end ASR model — to represent the ASR model in the real world. As described in Section 4.2, the hybrid ASR model and the end-to-end ASR model translate the audio in different manners. Under the strict black-box access, the auditor only knows query audio records of a particular user u and its corresponding output transcription. The goal of the auditor is to build a binary classifier

Aaudit to discriminate whether this user is the member of the user set in which their audio records have been used as target model’s training data (u ∈ Utar, Utar ← Atar).

4.3.2 Overview of the Proposed Audio Auditor

The nature of membership inference [38] is to learn the difference of a model fed with its actual training samples and other samples. User-level membership inference, like its record-level variant, requires higher robustness. Apart from the disparity of the target model’s performance on record-level, our auditor needs to consider the speaker’s characteristics as well. Since the posterior probabilities (or confidence scores) are not part of the outputs, shadow models are necessary to audit the ASR model. Fig. 4.2 depicts a workflow of our audio auditor auditing an ASR model. Generally, there are two processes, i.e., training and auditing. The former process is to build a binary classifier as a user-level membership auditor Aaudit using a supervised learning algorithm. The latter uses this auditor to audit an

ASR model Ftar by querying a few audios spoken by one user u. In Section 4.4.4, we show that only a small number of audios per user can determine whether u ∈ Utar or u∈ / Utar. Furthermore, a small number of users used to train the auditor is sufficient to provide a satisfying result. Training Process. The primary task in the training process is to build up shadow models of high quality. Shadow models, mimicking the target model’s behaviors, try to infer the targeted ASR model’s decision boundary. Due to strict black-box access, a good quality shadow model performs with an approximate testing accuracy as the target model. We randomly sample n datasets from the auxiliary reference dataset Dref as Ashd1,...,Ashdn to build n shadow models. Each shadow model’s audio train test dataset Ashdi, i = 1, . . . , n is split to a training set Ashdi and a testing set Ashdi. To build up the ground train test truth for auditing, we query the shadow model with Ashdi and Ashdi. Assume a user’s audio set Au is sampled from users’ audio sets Dusers. According to the user-level membership inference definition, test train the outputs from the audio Au ∈ Ashdi where its speaker u∈ / Ushdi are labeled as “nonmember”. train test Otherwise, the outputs translated from the audio Au ∈ Ashdi and from the audio Au ∈ Ashdi where its train train train speaker u ∈ Ushdi are all labeled as “member”. Herein, Ushdi ← Ashdi . To simplify the experiment, train test for each shadow model, training samples are disjoint from testing samples (Ashdi ∩ Ashdi = ∅). Their train test user sets are disjoint as well (Ushdi ∩ Ushdi = ∅). With some feature extraction (noted below), those labeled records are gathered as the auditor model’s training set. Feature extraction is another essential task in the training process. Under the strict black-box access,

49 Table 4.1: The audit model’s performance when selecting either 3 features, 5 features, or 5 features with MFCCs for each audio’s query.

F1-score Precision Recall Accuracy Feature_Set3 63.89% 68.48% 60.84% 61.13% Feature_Set5 81.66% 81.40% 82.22% 78.81% Feature_Set5 + MFCCs 81.01% 79.72% 82.52% 77.82% features are extracted from the input audio, ground truth transcription, and the predicted transcription. As a user-level membership inferrer, our auditor needs to learn the information about the target model’s performance and the speaker’s characteristics. Comparing the ground truth transcription and the output transcription, the similarity score is the first feature to represent the ASR model’s performance. To compute the two transcriptions’ similarity score, the GloVe model [179] is used to learn the vector space representation of these two transcriptions. Then the cosine similarity distance is calculated as the two transcriptions’ similarity score. Additionally, the input audio frame length and the speaking speed are selected as two features to present the speaker’s characteristics. Because a user almost always provides several audios to train the ASR model, statistical calculation is applied to the three features above, including sum, maximum, minimum, average, median, standard deviation, and variance. After the feature extraction, all user-level records are gathered with labels to train an auditor model using a supervised learning algorithm. To test the quality of the feature set above, we trained an auditor with 500 user-level samples using the Random Forest (RF) algorithm. By randomly selecting 500 samples 100 times, we achieve an average accuracy result over 60%. Apart from the three aforementioned features, two additional features are added to capture more variations in the model’s performance, including missing characters and extra characters obtained from the transcriptions. For example, if (truth transcription, predicted transcription) = (THAT IS KAFFAR’S KNIFE, THAT IS CALF OUR’S KNIFE), then (missing characters, extra characters) = (KFA, CL OU). Herein, the blank character in the extra characters means that one word was mistranslated as two words. With these two extra features, a total of five features are extracted from record-level samples: similarity score, missing characters, extra characters, frame length, and speed. The record-level samples are transformed into user-level samples using statistical calculation as previously described. We compare the performance of two auditors trained with the two feature sets. We also consider adding 13 Mel-Frequency Cepstral Coefficients (MFCCs) as the additional audio-specific feature set to accentuate each user’s records with the average statistics. As seen in Table 4.1, the statistical feature set with 5-tuple is the best choice with approximately 80% accuracy, while the results with additional audio-specific features are similar, but trail by one percentage. Thus, we proceed with five statistical features to represent each user as the outcome of the feature extraction step. Auditing Process. After training an auditor model, we randomly sample a particular speaker’s (u0s) audios Au from Dusers to query our target ASR model. With the same feature extraction, the outputs can be passed to the auditor model to determine whether this speaker u ∈ Utar. We assume that our target model’s dataset Dtar is disjoint from the auxiliary reference dataset Dref (Dtar ∩ Dref = ∅). In addition,

Uref and Utar are also disjoint (Utar ∩ Uref = ∅). For each user, as we will show, only a limited number of audios are needed to query the target model and complete the whole auditing phase.

50 Table 4.2: The audit model’s performance trained with different algorithms.

F1-score Precision Recall Accuracy DT 68.67% 70.62% 67.29% 64.97% RF 81.66% 81.40% 82.22% 78.81% 3-NN 58.62% 64.49% 54.69% 56.16% NB 34.42% 93.55% 21.09% 53.96% 4.3.3 Implementation

In these experiments, we take audios from LibriSpeech [180], TIMIT [23], and TED-LIUM [181] to build our target ASR model and the shadow model. Detailed information about the speech corpora and model architectures can be found in the Appendix. Since the LibriSpeech corpus has the largest audio sets, we primarily source records from LibriSpeech to build our shadow models. Target Model. Our target model is a speech-to-text ASR model. The inputs are a set of audio files with their corresponding transcriptions as labels, while the outputs are the transcribed sequential texts. To simulate most of the current ASR models in the real world, we created a state-of-the-art hybrid ASR model [158] using the PyTorch-Kaldi Speech Recognition Toolkit [159] and an end-to-end ASR model using the Pytorch implementation [175]. In the preprocessing step, fMLLR features were used to train the ASR model with 24 training epochs. Then, we trained an ASR model using a deep neural network with four hidden layers and one Softmax layer. We experimentally tuned the batch size, learning rate and optimization function to gain a model with better ASR performance. To mimic the ASR model in the wild, we tuned the parameters until the training accuracy exceeded 80%, similar to the results shown in [174, 175]. Additionally, to better contextualize our audit results, we report the overfitting level of the ASR models, defined as the difference between the predictions’ Word Error Rate (WER) on the training set and the testing set (Overfitting = WERtrain − WERtest).

4.4 Experimental Evaluation and Results

The goal of this work is to develop an auditor for users to inspect whether their audio information is used without consent by ASR models or not. We mainly focus on the evaluation of the auditor, especially in terms of its effectiveness, efficiency, and robustness. As such, we pose the following research questions.

• The effectiveness of the auditor. We train our auditor using different ML algorithms and select one with the best performance. How does the auditor perform with different sizes of training sets? How does it perform in the real-world scenario, such as auditing iPhone Siri?

• The efficiency of the auditor. How many pieces of audios does a user need for querying the ASR model and the auditor to gain a satisfying result?

• The data transferability of the auditor. If the data distribution of the target ASR model’s training set is different from that of the auditor, is there any effect on the auditor’s performance? If there is a negative effect on the auditor, is there any approach to mitigate it?

• The robustness of the auditor. How does the auditor perform when auditing the ASR model built with different architectures and pipelines? How does an an auditor perform when a user queries the auditor with audios recorded in a noisy environment (i.e., noisy queries)?

51 4.4.1 Effect of the ML Algorithm Choice for the Auditor

We evaluate our audio auditor as a user-level membership inference model against the target ASR system. This inference model is posed as a binary classification problem, which can be trained with a supervised ML algorithm. We first consider the effect of different training algorithms on our auditor performance. To test the effect of different algorithms on our audit methodology, we need to train one shadow ASR model for training the auditor and one target ASR model for the auditor’s auditing phase. We assume the target ASR model is a hybrid ASR system whose acoustic model is trained with a four-layer LSTM network. The training set used for the target ASR model is 100 hours of clean audio sampled from the LibriSpeech corpus [180]. Additionally, the shadow model is trained using a hybrid ASR structure where GRU network is used to build its acoustic model. According to our audit methodology demonstrated in Fig. 4.2, we observe the various performance of the audio auditor trained with four popular supervised ML algorithms listing as Decision Tree (DT), Random Forest (RF), k-Nearest Neighbor where k = 3 (3-NN), and Naive Bayes (NB). After feature extraction, 500 users’ samples from the shadow model’s query results are randomly selected as the auditor’s training set. To avoid potential bias in the auditor, the number of “member” samples and the number of “nonmember” samples are equal in all training set splits

(#{u ∈ Ushd} = #{u∈ / Ushd}). An additional step taken to eliminate bias, is that each experimental configuration is repeated 100 times. Their average result is reported as the respective auditor’s final performance in Table 4.2. As shown in Table 4.2, our four metrics of accuracy, precision, recall, and F1-score are used to evaluate the audio auditor. In general, the RF auditor achieves the best performance compared to the other algorithms. Specifically, the accuracy approaches 80%, with the other three metrics also exceeding 80%. We note that all auditors’ accuracy results exceed the random guess (50%). Aside from the RF and DT auditors, the auditor with other ML algorithms behaves significantly differently in terms of precision and recall, where the gaps of the two metrics are above 10%. The reason is in part due to the difficulty in distinguishing the “member” and “nonmember” as a user’s audios are all transcribed well at a low speed with short sentences. Tree-based algorithms, with right sequences of conditions, may be more suitable to discriminate the membership. We regard the RF construction of the auditor as being the most successful; as such, RF is the chosen audio auditor algorithm for the remaining experiments.

4.4.2 Effect of the Number of Users Used in Training Set of the Auditor

To study the effect of the number of users, we assume that our target model and shadow model are trained using the same architecture (hybrid ASR system). However, due to the strict black-box access to the target model, the shadow model and acoustic model shall be trained using different networks. Specifically, LSTM networks are used to train the acoustic model of the target ASR system, while a GRU network is used for the shadow model. As depicted in Fig. 4.3, each training sample (xi, yi) is formed by calculating Sm shadow model’s querying results for each user uj ← i=1(xi, yi). Herein, we train the audio auditor with a varying number of users (j = 1, ..., M). The amount of users M we considered in the auditor training set is 10, 30, 50, 80, 100, 200, 500, 1,000, 2,000, 5,000, and 10,000. On the smaller numbers of users in the auditor’s training set, from Fig. 4.3, we observe a rapid increase in performance with an increasing number of users. Herein, the average accuracy of the auditor is 66.24% initially, reaching 78.81% when the training set size is 500 users. From 500 users, the accuracy decreases then plateaus. Overall, the accuracy is better than the random guess baseline of 50% for all algorithms.

52 Figure 4.4: Auditor model accuracy on a mem- ber user querying with the target model’s un- out seen audios (Amem) against the performances Figure 4.3: Auditor model performance with on the member users only querying with the in varied training set size. seen recordings (Amem). Aside from accuracy, the precision increases from 69.99% to 80.40%; the recall is 73.69% initially and eventually approaches approximately 90%; and the F1-score is about 80% when the training set size exceeds 200. In summary, we identify the auditor’s peak performance when using a relatively small number of users for training. Recall the definition of user-level membership in Section 4.3.1. We further consider two extreme scenarios in the auditor’s testing set. One extreme case is that the auditor’s testing set only contains member users querying with the unseen audios (excluded from the target model’s training set), henceforth out denoted as Amem. The other extreme case is an auditor’s testing set that only contains member users in querying with the seen audios (exclusively from the target model’s training set), herein marked as Amem. in out Fig. 4.4 reports the accuracy of our auditor on Amem versus Amem. If an ASR model were to use a user’s in recordings as its training samples (Amem), the auditor can determine the user-level membership with out a much higher accuracy, when compared to user queries on the ASR model with Amem. Specifically, in Amem has a peak accuracy of 93.62% when the auditor’s training set size M is 5,000. Considering the in peak performance previously shown in Fig. 4.3, auditing with Amem still achieves good accuracy (around 85%) despite a relatively small training set size. Comparing the results shown in Fig. 4.3 and Fig. 4.4, we can infer that the larger the auditor’s training set size is, the more likely nonmember users are to be misclassified. The auditor’s overall performance peak when using a small number of the training set is largely due to the high accuracy of the shadow model. A large number of training samples perhaps contain the large proportion of nonmember users’ records whose translation accuracy is similar to the member users’. Overall, it is better for users to choose audios that have a higher likelihood of being contained within the ASR model for audit (for example, the audios once heard by the model).

4.4.3 Effect of the Target Model Trained with Different Data Distributions

The previous experiment draws conclusions based on the assumption that the distributions of training sets for the shadow model and the target model are the same. That is, these two sets were sampled from

53 (a) Accuracy (b) Precision (c) Recall

Figure 4.5: The auditor model audits target ASR models trained with training sets of different data distributions. We observe that in regards to accuracy and recall the target model with the same distribution as the auditor performs the best, while the contrary is observed for precision. Nevertheless, the data transferability is well observed with reasonably high metrics for all data distributions.

LibriSpeech corpus DL (Atar ∼ DL , Ashd ∼ DL , Atar ∩ Ashd = ∅). Aside from the effects that a changing number of users used to train the auditor, we relax this distribution assumption to evaluate the data transferability of the auditor. To this end, we train one auditor using a training set sampled from

LibriSpeech DL (Ashd ∼ DL ). Three different target ASR models are built using data selected from LibriSpeech, TIMIT, and TED, respectively. Fig. 4.5 plots the auditor’s data transferability on average accuracy, precision, and recall. Once above a certain threshold of the training set size (≈ 10), the performance of our auditor significantly improves with an increasing number of users data selected as its user-level training samples. Comparing the peak results, the audit of the target model trained with the same data distribution (LibriSpeech) slightly outperforms the audit of target models with different distributions (TIMIT and TED). For instance, the average accuracy of the auditor auditing LibriSpeech data reaches 78.81% when training set size is 500, while the average audit accuracy of the TIMIT target model peaks at 72.62% for 2,000 users. Lastly the average audit accuracy of TED target model reaches its maximum of 66.92% with 500 users. As shown in Fig. 4.5, the peaks of precision of the LibriSpeech, TIMIT, and TED target model are 81.40%, 93.54%, and 100%, respectively, opposite of what was observed with accuracy and recall. The observation of the TED target model with extremely high precision and low recall is perhaps due to the dataset’s characteristics, where all of the audio clips of TED are long speeches recorded in a noisy environment.

In conclusion, our auditor demonstrates satisfying data transferability in general.

4.4.4 Effect of the Number of Audio Records Per User

The fewer audio samples a speaker is required to submit for their user-level query during the auditing phase, the more convenient it is for users to use the auditor. Additionally, if the auditor can be trained with user-level training samples accumulated from a reduced number of audios per user, both added convenience and the efficiency of feature preprocessing during the auditor’s training can be realized. A limited number of audio samples per user versus a large number of audio samples per user. Assuming that each user audits their target ASR model by querying with a limited number of audios, we shall consider whether a small number or a large number of audio samples per user should be collected to train our auditor. Herein, varying the number of audios per user only affects the user-level information

54 Figure 4.6: A comparison of average accuracy for one audio, five audios, and all audios per user when training the auditor model with a Figure 4.7: A varying number of audios used for limited number of audios per user gained in the each speaker when querying an auditor model auditing phase. trained with 5 audios per user. learned by the auditor during the training phase. To evaluate this, we have sampled one, five, and all audios per user as the training sets when the querying set uses five audios per user. Fig. 4.6 compares the average accuracy of the auditors when their training sets are processed from limits of one audio, five audios, and finally all audios of each user. To set up the five audio auditor’s training sets, we randomly select five Sm=5 audios recorded from each user uj ← i=1 (xi, yi), then translate these audios using the shadow model to produce five transcriptions. Following the feature preprocessing demonstrated in Section 4.3, user-level information for each user is extracted from these five output transcriptions with their corresponding input audios. The same process is applied to construct the auditor in which the training data consists of one audio per user. To set up the auditor’s training set with all the users’ samples, we collect all audios spoken by each user and repeat the process mentioned above (average m¯ > 62). Moreover, since the two auditors’ settings above rely on randomly selected users, each configuration is repeated 100 times, with users sampled anew, to report the average result free of sampling biases. Fig. 4.6 demonstrates that the auditor performs best when leveraging five audios per user during the feature preprocessing stage. When a small number of users are present in the training set, the performance of the two auditors is fairly similar, except the auditor trained with one audio per user. For example, when only ten users are randomly selected to train the auditor, the average accuracy of these two auditors are 61.21% and 61.11%. When increasing to 30 users in the training set, the average accuracy of the 5-sample and all-sample auditors is 65.65% and 64.56%, respectively. However, with more than 30 users in the training set, the auditor trained on five audios per user outperforms that using all audios per user. Specifically, when using five audios per user, the auditor’s average accuracy rises to ≈70% with a larger training set size; compared to the auditor using all audios per user, with a degraded accuracy of ≈55%. This is in part owing to the difficulty of accurately characterizing users’ audios. In conclusion, despite restrictions on the number of user audio samples when training the auditor, the auditor can achieve superior performance. Consequently, we recommend that the number of audios per user collected for the auditor’s training process should be the same for the auditor’s querying process.

55 A limited number of audio samples per user while querying the auditor. While we have investi- gated the effect of using a limited number of audios per user to build the training set, we now ask how the auditor performs with a reduced number of audios provided by the user during the querying stage of the audit process, and how many audios each user needs to submit to preserve the performance of this auditor. We assume our auditor has been trained using the training set computed with five audios per user. Fig. 4.7 displays performance trends (accuracy, precision, recall, and F1-score) when a varying number of query audios per user is provided to the target model. We randomly select a user’s audios to query the target model by testing m = 1, 3, 5, 7, 9, or 11 audios per user. As Fig. 4.6 reveals that the accuracy results are stable when training set size is large, we conduct our experiments using 10,000 records in the auditor training set. Again, each experiment is repeated 100 times, and the results are all averaged. Fig. 4.7 illustrates that the auditor performs well with results all above 60%. Apart from recall, the other three performances trend upwards with an increasing number of audios per user. The scores of the accuracy, precision, and F1-score are approximately 75%, 81%, and 78%, respectively, when each user queries the target model with nine audios, indicating an improvement over the accuracy (≈72%) we previously observed in Fig. 4.3. It appears, for accuracy, when the number of query audios per user grows, the upward trend slows down and even slightly declines. The recall is maximized (89.78%) with only one audio queried by each user, decreasing to 70.97% with eleven audios queried for each user. It might happen because the increased number of audio per user does not mean the increased number of users (i.e., testing samples). Since the auditor was trained with five audios per user, the auditor may fail to recognize the user’s membership when querying with many more audios.

Overall, with only a limited number of audios used for audit, e.g. nine audios per user, our auditor still effectively discriminates a user’s membership in the target model’s training set.

4.4.5 Effect of Training Shadow Models across Different Architectures

(a) Accuracy (b) Precision (c) Recall

Figure 4.8: Different auditor model performance when trained with different ASR shadow model architec- tures.

A shadow model trained with different architectures influences how well it mimics the target model and the performance of the user-level audio auditor. In this subsection, we experiment with different shadow model architectures by training the auditor with information from various network algorithms, like LSTM, RNNs, GRU. If the choice of a shadow model algorithm has a substantial impact on the auditor’s

56 Table 4.3: Information about ASR models trained with different architectures. (WERtrain: the predic- tion’s WER on the training set; WERtest: the prediction’s WER on the testing set; t: target model; s: shadow model.)

ASR Models Model’s Architecture Dataset Size WERtrain WERtest LSTM-ASR (s) 4-LSTM layer + Softmax 360 hrs 6.48% 9.17% RNN-ASR (s) 4-RNN layer + Softmax 360 hrs 9.45% 11.09% GRU-ASR (s) 5-GRU layer + Softmax 360 hrs 5.99% 8.48% LSTM-ASR (t) 4-LSTM layer + Softmax 100 hrs 5.06% 9.08% performance, we shall seek a method to lessen such an impact. We also seek to evaluate the influence of the combining attack as proposed by Salem et al. [43], by combining the transcription results from a set of ASR shadow models, instead of one, to construct the auditor’s training set. The feature extraction method is demonstrated in Section 4.3. We refer to this combination as user-level combining audit. To explore the specific impact of architecture, we assume that the acoustic model of the target ASR system is mainly built with the LSTM network (we call this model the LSTM-ASR Target model). We consider three popular algorithms, LSTM, RNNs, and GRU networks, to be prepared for the shadow model’s acoustic model. The details of the target and shadow ASR models above are displayed in Table 4.3. Each shadow model is used to translate various audios, with their results processed into the user-level information to train an auditor. Consider this, the shadow model that mainly uses the GRU network structure to train its acoustic model is marked as the GRU-ASR shadow model; its corresponding auditor, named GRU-based auditor, is built using the training set constructed from GRU-ASR shadow model’s query results. Our other two auditors have a similar naming convention, an LSTM-based auditor and an RNN-based auditor. Moreover, as demonstrated in Fig. 4.2, we combine these three shadow models’ results (n = 3), and construct user-level training samples to train a new combined auditor. This auditor is denoted as the Combined Auditor that learns all kinds of popular ASR models. Fig. 4.8 demonstrates the varied auditor performance (accuracy, precision and recall) when shadow models using various algorithms are deployed. For accuracy, all four auditors show an upward trend with a small training set size. The peak is observed at 500 training samples then decays to a stable smaller value at huge training set sizes. The GRU-based auditor surpasses the other three auditors in terms of accuracy, with the Combined Auditor performing the second-best when the auditor’s training set size is smaller than 500. As for precision, all experiments show relatively high values (all above 60%), particularly the LSTM-based auditor with a precision exceeding 80%. According to Fig. 4.8c), the RNN-based auditor and GRU-based auditor show an upward trend in recalls. Herein, both of their recalls exceed 80% when the training set size is larger than 500. The recall trends for the LSTM-based auditor and the Combined Auditor follows the opposite trend as that of GRU and RNN-based auditors. In general, the RNN-based auditor performs well across all three metrics. The LSTM-based auditor shows an excellent precision, while the GRU-based auditor obtains the highest accuracy.

The algorithm selected for the shadow model will influence the auditor’s performance. The Combined Auditor can achieve accuracy higher than the average, only if its training set is relatively small.

57 4.4.6 Effect of Noisy Queries

An evaluation of the user-level audio auditor’s robustness, when provided with noisy audios, is also conducted. We also consider the effect of the noisy audios when querying different kinds of auditors trained on different shadow model architectures. We shall describe the performance of the auditor with two metrics — precision and recall; these results are illustrated in Fig. 4.9.

(a) GRU-based Auditor Precision (b) LSTM-based Auditor Precision (c) RNN-based Auditor Precision

(d) GRU-based Auditor Recall (e) LSTM-based Auditor Recall (f) RNN-based Auditor Recall

Figure 4.9: Different auditor audits noisy queries with different ASR shadow model.

To explore the effect of noisy queries, we assume that our target model is trained with noisy audios. Under the strict black-box access to this target model, we shall use different neural network structures to build the target model(s) and the shadow model(s). That is, the target model is an LSTM-ASR target model, while the GRU-ASR shadow model is used to train the GRU-based auditor. For evaluating the effect of the noisy queries, two target models are prepared using (i) clean audios (100 hours) and (ii) noisy audios (500 hours) as training sets. In addition to the GRU-based auditor, another two auditors are constructed, an LSTM-based auditor and an RNN-based auditor. The target models audited by the latter two auditors are the same as the GRU-based auditor. Herein, the LSTM-based auditor has an LSTM-ASR shadow model whose acoustic model shares the same algorithm as its LSTM-ASR target model. Fig. 4.9a and Fig. 4.9d compare the precision and recall of the GRU-based auditor on target models trained with clean and noisy queries, respectively. Overall, the auditor’s performance drops when auditing noisy queries, but the auditor still outperforms the random guess (>50%). By varying the size of the auditor’s training set, we observe the precision of the auditor querying clean and noisy audios displaying similar trends. When querying noisy audios, the largest change in precision is ≈11%, where the auditor’s training set size was 500. Its precision results of querying clean and noisy audios are around 81% and

58 70%, respectively. However, the trends of the two recall results are fairly the opposite, and noisy queries’ recalls are decreasing remarkably. The lowest decent rate of the recall is about 42%, where the auditor was trained with ten training samples. Its recall results of querying two kinds of audios are around 74% and 32%. In conclusion, we observe the impact of noisy queries on our auditor is fairly negative. Fig. 4.9b and Fig. 4.9e display the LSTM-based auditor’s precision and recall, respectively, while Fig. 4.9c and Fig. 4.9f illustrate the RNN-based auditor’s performance. Similar trends are observed from the earlier precision results in the RNN-based auditor when querying clean and noisy queries. However, curiously, the RNN-based auditor, when querying noisy audios, slightly outperforms queries on clean audios. Similar to the noisy queries effect on GRU-based auditor, the noisy queries’ recall of the RNN-based auditor decreases significantly versus the results of querying the clean audios. Though noisy queries show a negative effect, all recalls performed by the RNN-based auditor exceed 50%, the random guess. As for the effect of noisy queries on the LSTM-based auditor, unlike GRU-based auditor and RNN-based auditor, the LSTM-based auditor demonstrates high robustness on noisy queries. For the most results of its precision and recall, the differences between the performance on clean and noisy queries are no more than 5%.

In conclusion, noisy queries create a negative effect on our auditor’s performance. Yet, if the shadow ASR model and the target ASR model are trained with the same algorithm, the negative effect can be largely eliminated.

4.4.7 Effect of Different ASR Model Pipelines on Auditor Performance

(a) Accuracy (b) Precision (c) Recall

Figure 4.10: The audit model audits different target ASR models trained with different pipelines.

Aside from the ASR model’s architecture, we examine the user-level auditor’s robustness on different pipelines commonly found in ASR systems. In this section, the ASR pipeline is not just a machine learning, instead a complicated system, as shown in Fig. 4.1. In practice, the two most popular pipelines adopted in ASR systems are: a hybrid ASR system, or an end-to-end ASR system. We build our auditor using the GRU-ASR shadow model with two target models trained on systems built on the aforementioned ASR pipelines. Specifically, one target model utilizes the Pytorch-Kaldi toolkit to construct a hybrid DNN-HMM ASR system, while the other target model employs an end-to-end ASR system. Fig. 4.10 reports the performance (accuracy, precision, and recall) of the auditor when auditing the two different target pipelines. Overall, the auditor behaves well over all metrics when auditing either

59 target models (all above 50%). The auditor always demonstrates good performance when using a small number of training samples. The auditor targeting hybrid ASR in comparison to the end-to-end ASR target achieves a better result. A possible reason is that our auditor is constituted with a shadow model which has a hybrid ASR architecture. When focusing on the accuracy, the highest audit score of the hybrid ASR target model is 78.8%, while that of the end-to-end ASR target model is 71.92%. The difference in the auditor’s precision is not substantially, with their highest precision scores as 81.4% and 79.1%, respectively. However, in terms of recall, the auditor’s ability to determine the user-level membership on the hybrid ASR target model is much higher than the end-to-end target model, with maximum recall of 90% and 72%, respectively. When auditing the hybrid ASR target model we observed the model significantly outperforming other models. The training and testing data for both state-of-the-art ASR model architectures (i.e., hybrid and end-to-end) are the same. Thus, to confidently understand the impact of different ASR model pipelines on the auditor’s performance, we shall also investigate the difference between the overfitting level of the hybrid ASR target model and that of the end-to-end ASR target model, as the overfitting of the model increases the success rate of membership inference attacks [38]. Recall that overfitting was previously defined in Section 4.3.3. The overfitting value of the hybrid ASR target model is measured as 0.04, while the overfitting level of the end-to-end ASR target model is 0.14. Contrary to the conclusions observed by [43], the target model that was more overfit did not increase the performance of our user-level audio auditor. One likely reason is that our auditor audits the target model by considering user-level information under strict black-box access. Compared to conventional black-box access in [43], our strict black-box access obtains its output from the transcribed text alone; consequently the influence of overfitting on specific words (WER) would be minimized. Thus, we can observe that our auditor’s success is not entirely attributed to the degree of the target ASR model’s overfitting alone.

In conclusion, different ASR pipelines between the target model and the shadow model negatively impact the performance of the auditor. Nevertheless, our auditor still performs well when the target model is trained following a different pipeline (i.e., an end-to-end ASR system), significantly outperforming random guesses (50%).

4.4.8 Real-World Audit Test

To test the practicality of our model in the real world, we keep our auditor model locally and conduct a proof-of-concept trial to audit iPhone Siri’s speech-to-text service. We select the auditor trained by the GRU shadow model with LibriSpeech 360-hour voice data as its training set. To simplify the experiments, we sample five audios per user for each user’s audit. According to the results presented in Fig. 4.6 and Fig. 4.7, we select the auditor trained with five audios per user, where 1,000 users were sampled randomly as the auditor’s training set. To gain the average performance of our auditor in the real world, we stored 100 auditors under the same settings with the training set constructed 100 times. The final performance is the average of these 100 auditors’ results. Testbed and Data Preprocess. The iPhone Siri provides strict black-box access to users, and the only dictation result is its predicted text. All dictation tasks were completed and all audios were recorded in a quiet surrounding. The clean audios were played via a Bluetooth speaker to ensure Siri can sense the

60 audios. User-level features were extracted as per Section 4.3.2. The Siri is targeted on iPhone X of iOS 13.4.1. Ground Truth. We target a particular user uˆ iPhone Siri’s speech-to-text service. From Apple’s privacy policy of Siri (see Appendix D), iPhone user’s Siri’s recordings can be selected to improve Siri and dictation service in the long-term (for up to two years). We do note that this is an opt-in service. Simply put, this user can be labeled as “member”. As for the “nonmember” user, we randomly selected 52 speakers from LibriSpeech dataset which was collected before 2014 [180]. As stated by the iPhone Siri’s privacy policy, users’ data “may be retained for up to two years”. Thus, audios sampled from LibriSpeech can be considered out of this Siri’s training set. We further regard the corresponding speakers of LibriSpeech as “nonmember” users. To avoid nonmember audios entering Siri’s training data to retrain its ASR model during the testing time, each user’s querying audios were completed on the same day we commenced tests for that user, with the Improve Siri & Dictation setting turned off. As we defined above, a “member” may make the following queries to our auditor: querying the S in auditor (i) with audios within the target model’s training set (Duˆ = Amem); (ii) with audios out of S out the target model’s training set (Duˆ = Amem); (iii) with part of his or her audios within the target S in S S out model’s training set (Duˆ = ( Amem) ( Amem)). Thus, we generate six “member” samples where S in S out the audios were all recorded by the target iPhone’s owner, including Duˆ = Amem, Duˆ = Amem, k=5 m=5 S in S S out and Duˆ = ( Amem) ( Amem), where k = 1, 2, 3 and k + m = 5. In total, we collected 58 user-level k m samples with 6 “member” and 52 “nonmember” samples. Results. We load 100 auditors to test those samples and the averaged overall accuracy as 89.76%. Specifically, the average precision of predicting the “member” samples is 58.45%, while the average precision of predicting the “nonmember” samples is 92.61%. The average ROC AUC result is 72.6%, which indicates our auditor’s separability in this experiment. Except for the different behaviors of Siri translating the audios from “member” and “nonmember” users, we suspect that another reason of high precision on “nonmember” is due to the LibriSpeech audios are out of Siri’s dictation scope. As for the S in low precision rate on “member” samples, we single out the data Duˆ = Amem for testing. Additionally, k=5 in its average accuracy result can reach 100%; thus, the auditor is much more capable in handling Amem out than Amem, corroborating our observation in Section 4.4.2.

In conclusion, our auditor shows a generally satisfying performance for users auditing a real-world ASR system, Apple’s Siri on iPhone.

4.5 Threats to Auditors’ Validity

Voiceprints Anonymization. In determining the user-level membership of audios in the ASR model, our auditor relies on the target model’s different behaviors when presented with training and unseen samples. The auditor’s quality depends on the diverse responses of the target model when translating audio from different users. The feature is named users’ voiceprints. The voiceprint is measured in [182] based on a speaker recognition system’s accuracy. Our auditor represents the user’s voiceprint according to two accumulated features, including missing characters and extra characters. However, if an ASR system is built using voice anonymization, our user-level auditor’s performance would degrade significantly. The speaker’s voice is disguised in [182] by using robust voice conversation while ensuring the correctness

61 of speech content recognition. Herein, the most popular technique of voice conversation is frequency warping [183]. In addition, abundant information about speakers’ identities is removed in [184] by using adversarial training for the audio content feature. Fig. 4.1 shows that the average accuracy of the auditor dropped by approximately 20% without using the two essential features. Hence, auditing user-level membership in a speech recognition model trained with anonymized voiceprints remains as a future avenue of research. Differentially Private Recognition Systems. Differential privacy (DP) is one of the most popular methods to prevent ML models from leaking any training data information. The work [171] protects the text generative model by applying user-level DP to its language model. This method contains a language model during the hybrid ASR system’s training, on which the user-level DP can be applied to obscure the identity, at the sacrifice of transcription performance. The speaker and speech characterization process is protected in [185] by inserting noise during the learning process. However, due to strict black-box access and the lack of output probability information, our auditor’s performance remains unknown on auditing the ASR model with DP. The investigation of our auditor’s performance to this user protection mechanism is open for future research. Workarounds and Countermeasures. Although Salem et al. [43] has shown neither the shadow model nor the attack model are required to perform membership inference, due to the constraints of strict black-box access, the shadow model and auditor model approach provide a promising means to perform a more difficult task of user-level membership inference. Instead of the output probabilities, we mainly leverage the ASR model’s translation errors in the character level to represent the model’s behaviors. Alternative countermeasures against the membership inference, such as dropout, generally change the target model’s output probability. However, the changes to probabilities of the ASR model’s output are not as sensitive as changes to its translated text [43]. Studying the extent of this sensitivity of ASR models remains as our future work. Synthetic ASR Models. Another limitation of our work is that we evaluate our auditor on synthetic ASR systems trained on real-world datasets, and we have not applied the auditor to extensive set of real-world models aside from Siri. However, we believe that our reconstruction of the ASR models closely mirrors ASR models in the wild.

4.6 Related Work

Membership Inference Attacks. As a fundamental privacy threat to ML models, the membership inference attack distinguishes whether a particular data sample is a member of the target model’s training set or not. Traditional membership inference attacks against ML models under black-box access leverage numerous shadow models to mimic the target model’s behavior [38, 186, 170]. Salem et al. [43] revealed that membership inference attacks could be launched by directly utilizing the prediction probabilities and thresholds of the target model. Both works [186] and [187] prove that overfitting of a model is sufficient but not a necessity to the success of a membership inference attack. Yeom et al. [187] as well as Farokhi and Kaafar [178] formalize the membership inference attack with black-box and white-box access. All previously mentioned works consider record-level inference; however, Song and Shmatikov [171] deploy a user-level membership inference attack in text generative models, with only the top-n predictions known. Trustworthiness of ASR Systems. The ASR systems are often deployed on voice-controlled de- vices [172], voice personal assistants [188], and machine translation services [173]. Tung and Shin [189]

62 propose SafeChat to utilize a masking sound to distinguish authorized audios from unauthorized recording to protect any information leakage. Recent works [190] and [191] propose an audio cloning attack and audio reply attack against the speech recognition system to impersonate a legitimate user or inject unintended voices. Voice masquerading to impersonate users on the voice personal assistants has been studied [192]. Whereas Zhang et al. [192] propose another attack, namely voice squatting, to hijack the user’s voice command, producing a sentence similar to the legal command. Du et al. [173] generate the adversarial audio samples to deceive the end-to-end ASR systems. Auditing ML Models. Many of the current proposed auditing services also seek to audit the bias and fairness of a given model [193]. Works have also been presented to audit the ML model to learn and check the model’s prediction reliability [194, 195, 196]. Moreover, the auditor is utilized to evaluate the ML model’s privacy risk when protecting an individual’s digital rights [171, 197]. Our Work. Our user-level audio auditor audits the ASR model under the strict black-box access. As shown in Section 4.3, we utilize the ASR model’s translation errors in the character level to represent the model’s behavior. Compared to related works under black-box access, our auditor does not rely on the target model’s output probability [38, 43]. In addition, we sidestep the feature pattern of several top-ranked outputs of the target model adopted by Song and Shmatikov [171], instead we use one text output, the user’s speed, and the input audio’s true transcription, as we do not have access to the output probability (usually unattainable in ASR systems). Hence, our constraints of strict black-box access only allow accessing one top-ranked output. In this case, our user-level auditor (78.81%) outperforms Song’s and Shmatikov’s user-level auditor (72.3%) in terms of accuracy. Moreover, Hayes et al. [170] use adversarial generative networks (GANs) to approximate the target model’s output probabilities while suffering big performance penalties with only 20% accuracy, our auditor’s accuracy is far higher. Furthermore, our auditor is much easier to be trained than the solution of finding outlier records with a unique influence on the target model [186], because we only need to train one shadow model instead of many shadow (or reference) models.

4.7 Limitations and Future Work

Further investigation on features. From our set of selected features both audio-specific features and features capturing model behaviors performs well, as observed in our results. It remains to be seen if additional audio-specific features would specifically aid the task of user-level auditing. As there is a plethora of potential feature candidates, we consider this as part of future work. Auditing performance with varied numbers of queries. In our auditor, we observe that only a limited number of queries per user is necessary to audit the target ASR model, especially when the auditor is trained with a limited audios per user. An interesting observation was that our user-level auditor’s recall performance on more queries per user declines under the strict black-box access. We are continuing our investigation into why our auditor’s ability to find unauthorized use of user data varies in this manner when being queried with different numbers of audios. Member audio in Siri auditing. In our setting, we make our best effort to ensure member audios are used for training. However, in our real-world evaluation, even with the “Improve Siri & Dictation” setting turned on, with an extended period of continual use by our user, we cannot guarantee that member audios of our member user where actually used for training although we are confident they have been included.

63 4.8 Conclusion

This work highlights and exposes the potential of carrying out user-level membership inference audit in IoT voice services. The auditor developed in this chapter has demonstrated promising data transferability, while allowing a user to audit his or her membership with a query of only nine audios. Even with audios are not within the target model’s training set, the user’s membership can still be faithfully determined. While our work has yet to overhaul the audit accuracy on various IoT applications across multiple learning models in the wild, we do narrow the gap towards defining clear membership privacy in the user level, rather than the record level [38]. However, questions remain about whether the privacy leakage hails from the data distribution or its intrinsic uniqueness of the record. Nevertheless, as we have shown, both a small training set size and the Combined Auditor, which combines results from various ASR shadow models to train the auditor, have a positive effect on the IoT audit model; on the contrary, audios recorded in a noisy environment and different ASR pipelines impose a negative effect on the given auditor; fortunately, the auditor still outperforms random guesses (50%). Examining other performance factors on more real-world ASR systems in addition to our iPhone Siri trial and extending possible countermeasures against auditing are all worth further exploration.

Acknowledgments

We thank all anonymous reviewers for their valuable feedback. This research was supported by Australian Research Council, Grant No. LP170100924. This work was also supported by resources provided by the Pawsey Supercomputing Centre, funded from the Australian Government and the Government of Western Australia.

64 Appendix

A. Datasets

The LibriSpeech speech corpus (LibriSpeech) contains 1,000 hours of speech audios from audiobooks which are part of the LibriVox project [180]. This corpus is famous in training and evaluating speech recognition systems. At least 1,500 speakers have contributed their voices to this corpus. We use 100 hours of clean speech data with 29,877 recordings to train and test our target model. 360 hours of clean speech data, including 105,293 recordings, are used for training and testing the shadow models. Additionally, there are 500 hours of noisy data used to train the ASR model and to test our auditor’s performance in a noisy environment. The TIMIT speech corpus (TIMIT) is another famous speech corpus used to build ASR systems. This corpus recorded audios from 630 speakers across the United States, totaling 6,300 sentences [23]. In this work, we use all this data to train and test a target ASR model, and then audit this model with our auditor. The TED-LIUM speech corpus (TED) collected audios based on TED Talks for ASR development [181]. This corpus was built from the TED talks of the international workshop on spoken language translation (IWSLT) 2011 Evaluation Campaign. There are 118 hours of speeches with corresponding transcripts.

B. Evaluation Metrics

The user-level audio auditor is evaluated with four metrics calculated from the confusion matrix, which reports the number of true positives, true negatives, false positives and false negatives: True Positive (TP), the number of records we predicted as “member” are correctly labeled; True Negative (TN), the number of records we predicted as “nonmember” are correctly labeled; False Positive (FP), the number of records we predicted as “member” are incorrectly labeled; False Negative (FN), the number of records we predicted as “nonmember” are incorrectly labeled. Our evaluation metrics are derived from the above-mentioned numbers.

• Accuracy: the percentage of records correctly classified by the auditor model.

• Precision: the percentage of records correctly determined as “member” by the auditor model among all records determined as “member”.

• Recall: the percentage of all true “member” records correctly determined as “member”.

• F1-score: the harmonic mean of precision and recall.

C. ASR Models’ Architectures

On the LibriSpeech 360-hour voice dataset, we build one GRU-ASR model with the Pytorch-Kaldi toolkit. That is, we train a five-layer GRU network with each hidden layer of size 550 and one Softmax layer. We use tanh as the activation function. The optimization function is Root Mean Square Propagation (RMSProp). We set the learning rate as 0.0004, the dropout rate for each GRU layer 0.2, and the number of epochs of training 24. On the LibriSpeech 360-hour voice dataset, we train another ASR model using the Pytorch-Kaldi toolkit. Specifically, it is a four-layer RNN network with each hidden layer of size 550 using ReLU as the

65 activation function and a Softmax layer. The optimization function is RMSProp. We set the learning rate as 0.00032, the dropout rate for each RNN layer 0.2, and the number epochs of training 24. On the LibriSpeech 360-hour voice dataset, we train one hybrid LSTM-ASR model. The acoustic model is constructed with a four-layer LSTM and one Softmax layer. The size of each hidden LSTM layer is 550 along with 0.2 dropout rate. The activation function is tanh, while the optimization function is RMSProp. The learning rate is 0.0014, and the maximum number of training epochs is 24. On the LibriSpeecch 100-hour voice dataset, we train a hybrid ASR model. The acoustic model is constructed with a four-layer LSTM and one Softmax layer. Each hidden LSTM layer has 550 neurons along with 0.2 dropout rate. The activation function is tanh, while the optimization function is RMSProp. The learning rate is 0.0016, and the maximum number of training epochs is 24. On the LibriSpeecch 100-hour voice dataset, we train an end-to-end ASR model. The encoder is constructed with a five-layer LSTM with each layer of size 320 and with 0.1 dropout rate. We use one layer location-based attention with 300 cells. The decoder is constructed with a one-layer LSTM with 320 neurons along with 0.5 dropout rate. The CTC decoding is enabled with a weight of 0.5. The optimization function is Adam. The learning rate is 1.0, and the total number of training epochs is 24. On the TEDLium dataset, we train a hybrid ASR model. The acoustic model is constructed with a four-layer LSTM and one Softmax layer. Each hidden LSTM layer has 550 neurons along with 0.2 dropout rate. The activation function is tanh, while the optimization function is RMSProp. The learning rate is 0.0016, and the number of maximum training epochs is 24. On the TIMIT dataset, we train a hybrid ASR model. The acoustic model is constructed with a four-layer LSTM and one Softmax layer. Each hidden LSTM layer has 550 neurons along with 0.2 dropout rate. The activation function is tanh, while the optimization function is RMSProp. The learning rate is 0.0016, and the number of maximum training epochs is 24.

D. Real-World Audit Test

Siri is a virtual assistant provided by Apple in their iOS, iPadOs, and macOS. Siri’s natural language interface is used to answer users’ voice queries and make recommendations [198]. The privacy policy of Apple’s Siri is shown in Fig. 4.11. The user, who is considered as a member user of Siri’s ASR model in our setting, has used the targeted iPhone for more than two years, frequently interacting with Siri, often with the Improve Siri & Dictation service opted in. As for the member user’s member audio, we carefully chose five phrases that the user had certainly used when engaging with Siri. Starting with “Hey Siri” and phrases from common interactions including “Hey Siri”, “What’s the weather today”, “What date is it today”, “Set alarm at 10 o’clock”, and “Hey Siri, what’s your name”. As for the member user’s non-member audio, we chose five short phrases in LibriSpeech that the user had never used to interact with Siri (e.g. “we ate at many men’s tables uninvited”). These phrases were recorded using the member user’s voice along with our member user’s non-member audios. The use of either set of phrases should produce an ability to audit as recall that our method imitates the user as a whole when auditing the model, irrespective of whether a specific audio phrase was used to train/update this model. Lastly, for the nonmember users’ nonmember audio, the target Siri’s language is English (Australia). Since the LibriSpeech dataset was collected before 2014 and as stated by the iPhone Siri’s privacy policy, users’ data “may be retained for up to two years”. We consider these recordings to not be part of the Siri training

66 dataset, and thus we have nonmember users’ nonmember audio. We further assume that our selected user phrases about book reading are nonmembers.

Figure 4.11: The privacy policy of Apple’s Siri

67 Chapter 5

The Audio Auditor: No-Label User-Level Membership Inference in Internet of Things Voice Services

With the fast development of machine learning techniques, the voice services embeded in various Internet of Things (IoT) devices becomes the most popular function in people’s daily life. In this chapter, we examine user-level membership inference targeting an automatic speech recognition (ASR) model within the voice services under no-label black-box access. Specifically, we design a user-level audio auditor to determine whether a specific user had unwillingly contributed audio used to train the ASR model, when the service only reacts on user’s query audio without providing the translated text. With user representation of the input audio data and their corresponding system’s reaction, our auditor shows an effective auditing in user-level membership inference. Our experiments shows that the auditor behaves better with more training samples and samples with more audios per user. We evaluate the auditor on ASR models trained with different algorithms (LSTM, RNNs, and GRU) on the hybrid ASR system (Pytorch-Kaldi). We hope the methodology developed in this chapter and findings can inform privacy advocates to overhaul IoT privacy.

5.1 Introduction

With the advance of machine learning (ML) techniques, ML-powered acoustic systems, also known as automatic speech recognition (ASR) systems, have become more efficient and effective in our daily lives [144, 145, 146]. Devices and applications integrating the acoustic systems are ubiquitous, like Amazon Echo, Google Assistant and Apple’s Siri, enabling the full potential of intelligent voice-controlled devices, voice personal assistants, and machine translation services [173]. In spite of its popularity, the privacy risk and unauthorized access of personal acoustic data has raised concerns in the security community [192, 189, 199]. As Malkin et al. [200] surveyed and reported by the media [149, 150, 201], most users considered storing audio recordings permanently in voice assistants, as unacceptable, while also strongly against exposure of their data to any third parties. A user’s audio records are protected and enforced by laws and regulations, including the General Data Protection Regulations (GDPR) [142], the Children’s Online Privacy Protection Act (COPPA) [143],

68 and the California Consumer Privacy Act (CCPA) [168]. Specifically, the “Right to be Forgotten” [147] protects user’s audio data from being continuously accessed by any third-party [148]. However, many devices could sniff and analyze audio without a user’s consent when using voice services [151]. By analyzing a few audio samples and learning the speaker’s voice characteristics, some voice cloning systems can synthesize his/her voice [190, 202, 191]. A news article [152] reported that a scammer had impersonated an acquaintance by using cloned audios [152]. Hence, a user-level audio auditor is strongly desired to allow users to verify their data’s leaking provenance. A fundamental problem named membership inference has been conducted widely in recent years. It exposes information about an ML model’s training set under different conditions with black-box access. Despite of its privacy risk, it is also a good method for the auditing problem [171, 197]. The first investigation is conducted by Shokri et al. [38] using shadow training technique. A record’s membership of a target model’s training set can be determined under black-box access with a few assumptions. The first assumption is shadow models were established using the same structure as the target model. The second assumption is shadow models were trained using the dataset from the same distribution as the target model. The third assumption is the prediction results of the target model contained the output label and its corresponding confidential score. The follow up researches release these assumptions gradually. Salem et al. [43] released the first two assumptions by picking a threshold based on the output confidential score. Song and Shmatikov [171] further conducted the membership inference exploiting several top ranked labels as the model’s prediction. Choo et al. [203], Li and Zhang [204], and Miao et al. [197] further release the third assumption as the prediction results of the target model only contains the output label. Our paper fully release the third assumption as no explicit label provided by the target model. Motivation. The ASR model is the core model in Voice Assistant, who has a fundamental function about translation. Querying with an input audio, the ASR model translates into the text command understood and processed by the Voice Assistant. Herein, the translated text is the output label of the ASR model. However, some voice services, especially in IoT devices, would not provide the translated text. Instead, the system reacts directly according to the text content. In this case, it is impossible for users to audit the ASR model with the help of previous membership inference techniques. More advanced membership inference techniques are in need. Additionally, record-level membership inference determines whether a specific record is the member of an ML model’s training set or not. However, when using the voice service, it is hard for users query audio recordings the same as the training samples, even if their text contents and speakers are the same. It is more plausible for auditor investigating the user-level membership inference. Consistent with [197], we define user-level membership inference as: querying with a user’s data, if this user has any data within target model’s training set, even if the query data are not members of the training set, this user is the user-level member of this training set. No-Label Audio Auditor. We design and implement a no-label audio auditor for this specific problem. Assume we conducting an ASR model with black-box access who reacts directly based on the voice content without providing the explicit translation. Thus, instead of using the ASR model’s translations, we try to observe the model’s behavior according to its reactions. To simplify the reaction information collecting process, we assume the voice service contains searching online function. Further, we limits the querying voice content which requires the service to search the answer online. In such case, the reaction information are the searching results. Our auditor analyzes such reaction information based on

69 the ASR model’s predicted text; compares with the searching results based on the audio’s true text; learns the model’s different behaviors of translating its known data and unknown data; and finally determine the user-level membership for a specific user. In the no-label access setting, a naive baseline strategy considers that a user is a member user of the target training set, when the model’s reaction based on its translations are all match to the searching results based on the true transcription. It is challenging to conduct such a no-label audio auditor for user-level membership inference. (i) No-label access means little information about the target ASR model. The prior knowledge, the system’s reaction information, is quite roughness representing the model’s translation results. (iii) Different from record-level inference, user-level membership inference requires the auditor has a high level robustness in distinguishing the model’s different behavior. Specifically, the auditor should be capable of distinguishing the model’s behavior on two aspects including the translation accuracy for different voice content and the translation accuracy for different speakers. (iv) ASR systems have a complicated learning architectures processing the time-series audio data [172, 154, 173]. It is quite time-consuming and computational source-consuming to build shadow models for membership inference. In summary, we design and evaluate our no-label audio auditor to help users distinguish whether their audio samples have been used to train an ASR model without their consent. The contributions of this work are listed as follows:

• We broaden the class of membership inference problem and propose a no-label audio auditor against an ASR model. A set of features are extracted from searching results as the target ASR model’s reaction. With statistic analysis in user-level, no-label audio auditor is built for user-level membership inference. With access to the system’s reaction information only, our auditor achieves around 75% AUC score. In the meanwhile, the random guessing method only achieves 50%.

• A new shadow training technique is proposed. Instead of imitating the target model’s behavior, our shadow model imitates the target system’s reactions. Accordingly, the system’s reaction analysis reflects the target model’s behavior.

• Our auditor is generic and not dependent to the ASR model’s structure.We established shadow models with different algorithms.

The rest of the chapter is organized as follows: Section 5.2 introduces the background about the target ASR model and related membership inference attack. Section 5.3 illustrates the details of no-label user-level membership inference as our auditor. Section 5.4 shows the setup and the results of our experiments. Finally, Section 5.5 concludes the chapter.

5.2 Related Work

This section introduces the state-of-the-art Automatic Speech Recognition (ASR) models and the related work about membership inference on ASRs.

5.2.1 The Automatic Speech Recognition (ASR) Model

While conventional ASR models are based on hidden Markov models (HMMs), current state-of-the-art ASR models utilise deep neural networks (DNNs). Our audio adversarial attack targets a state-of-the-art

70 ASR system based on a DNN — end-to-end ASR systems [174]. Assuming white-box access, we evaluate our attack using an ASR model downloaded from a popular open-source ASR system DeepSpeech. A Pytorch-Kaldi ASR system are mainly DNN-HMM-based acoustic models [157]. As shown in Fig. 3.1, typically, a hybrid ASR system is composed of a preprocessing step, a model training step, and a decoding step [158]. During the preprocessing step, features are extracted from the input audio, while the corresponding text is processed as the audio’s label. The model training step trains a DNN model to create HMM class posterior probabilities. The decoding step maps these HMM state probabilities to a text sequence. In this work, the hybrid ASR system is built using the pytorch-kaldi speech recognition toolkit [159]. Specifically, feature extraction transforms the audio frame into the frequency domain, as Mel-Frequency Cepstral Coefficients (MFCCs) features. For an additional processing step, feature-space Maximum Likelihood Linear Regression (fMLLR) is used for speaker adaptation. Three popular neural network algorithms are used to build the acoustic model, including Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and Recurrent Neural Networks (RNNs). The decoder involves a language model which provides a language probability to re-evaluate the acoustic score. The final transcription output is the sequence of the most suited language with the highest score.

5.2.2 Membership Inference Attack on ASRs

The membership inference attack is considered as a significant privacy threat for machine learning (ML) models [77]. The attack aims to determine whether a specific data sample is within the target model’s training set or not. The attack is driven by the different behaviors of the target model when making predictions on samples within or out of its training set. Various membership inference attack methods have been recently proposed. Shokri et al. [38] train shadow models to constitute the attack model against a target ML model with black-box access. The shadow models mimic the target model’s prediction behavior. To improve accuracy, Liu et al. [176] and Hayes et al. [170] leverage Generative Adversarial Networks (GAN) to generate shadow models with increasingly similar outputs to the target model. Salem et al. [43] release the attack assumptions mentioned in the work [38], demonstrating that shadow models are not necessary to launch the membership inference attack. Instead, a threshold of the predicted confidence score can be defined to substitute the attack model. Intuitively, a large confidence score indicates the sample as a member of the training set [177]. Choo et al. [203] and Li and Zhang [204] further broaden the attack assumptions and launch the membership inference attack when the target model only provide the predicted label without confidence score. Choo et al. [203] utilized the data augmentations and adversarial examples to expose the model’s decision boundary. Li and Zhang [204] proposed a transfer-based attack and perturbation-based attack. The former one relies on the shadow model and the same distribution dataset as the target training set. The later one relies on the adversarial example techniques, and try to measure the effort in perturbing a sample predicted as a different label. Apart from the black-box access, Farokhi and Kaafar [178] model the record-level membership inference attack under the white-box access. The attacks mentioned in the work above are all performed on the record level, while Song and Shmatikov [171] study a user-level membership inference attack against text generative models. Instead of using the prediction label along with the confidence score, Song and Shmatikov [171] utilize word’s rank list information of several top-ranked predictions as key features to generate the shadow model.

71 In this work, we use the membership inference techniques to audit the ASR model without providing any explicit translations. By observing the ASR system’s reaction, we aim to verify whether a specific speaker had unwillingly contributed audio to train an ASR model. Different from image recognition systems or text generative systems, the auditor faces additional challenges in ASR systems especially with no-label access [172]. With limited discriminative power, features can only be extracted from the system’s reactions, its input audio, and true transcription to launch our membership inference, i.e., no-label audio auditing in this chapter.

5.3 No-Label Audio Auditor

In this section, we firstly formalize our research problem. Secondly, we give the overview of our user-level audio auditor, including the processes of constructing our user-level audio auditor under no-label black-box access and how we use this auditor to audit the target ASR model. Finally, we show how we implement this auditor.

5.3.1 Problem Statement

When users use a voice assistant, an ASR model translate the input audio into text implicitly and deliver the text content to the system for analysis and reaction. With non-label black-box access, our auditor can be established by collecting and analyzing the information about the input audio and the system’s reaction. Firstly, we formalize the process of the ASR model’s translation. Secondly, we formalize the process of the system’s analysis and reaction. Thirdly, we formalize the success of our auditing process. Finally, we summarize the prior knowledge for the auditor under no-label black-box access to the target model. Let (x, y) ∈ X × Y denote an audio sample, where x presents the audio component, and y is the actual text of x. Assume an ASR model is a function F : X → Y. F(x) is the model’s translated text. The smaller the difference between F(x) and y, the better the ASR model performs. Consider a training audio set A is sampled from D of size N (A ∼ DN ), where D represents a distribution of audio samples.

The ASR model trained with the dataset A is denoted as FA. Querying the ASR model with an audio 0 sample (x, y), then the text delivered to the system is denoted as y = FA(x). The system receives the text y0, analyze its content, and react accordingly. Assume the content does not include any commands except searching online, which can be controlled by users. Thus, the reaction should be the results of searching the text y0 online. We mark the reaction function as R, while the 0 y0 0 reaction information for the delivered text y is denoted as rA = RA(y ). If we search online with the y actual text y, then the reaction information should be rA = RA(y). We define user-level membership inference as querying a user’s audio and trying to determine whether any audio within the target model’s training set belongs to this user. Even if the queried audio are not members of the training set, but other audio belonging to this user is members in the training set, then this user is regarded as the user-level member of this training set. Assume the target ASR model is FA and the system provides the corresponding reaction RA. Let U be the speaker set of A of size M (U ← A). When A represent our no-label audio auditor, and the user-level auditing process can be formalized as:

Sn • A speaker u has S = i=1(xi, yi), where u ← S.

Y Sn yi yi • Let RA = i=1 rA , when rA = RA(yi).

72 0 0 0 Y Sn yi yi 0 0 • Let RA = i=1 rA , when rA = RA(yi) and yi = FA(xi).

• Let “member” = 0 and “nonmember” = 1.

• Set b = 0 if u ∈ U, or b = 1 if u∈ / U.

Y Y 0 • The auditor successes if A(u, S, RA,RA ) = b; otherwise it fails.

Prior Knowledge. Our auditor performs the user-level membership inference under the no-label black-box access. With no-label black-box access, when querying an ASR model with an audio, the system reacts directly based on the model’s translation content without providing the explicit text. To simplify the problem, we define the reaction function as searching online by controlling the content of the queried audio. When an auditor aims to audit an ASR model, the query audio can be selected or generated artificially. If the system’s reaction is not searching online, the corresponding query audio would not be analyzed by the auditor. Detailed descriptions of our prior knowledge with the no-label black-box access are listed below:

• Query records. When an auditor selects or generates a proper audio to query the ASR model, the audio sample and its true transcription are known.

• Reaction results. When the system reacts on the ASR model’s query audio, the reaction informa- tion, a.k.a. the searching results, are available to be collected and analyzed.

• User-level information. Since the query audio are selected or generated by the auditor, the amount of speakers and their corresponding audio are known.

• Reaction function. Although the ASR model is under non-label black-box access, the reaction function it related to can be observed.

5.3.2 No-Label User-Level Membership Inference

The nature of membership inference [38] is to observe the difference of a model fed with the samples it knows ( training data) and the unknown samples. User-level membership inference needs a higher robustness to learn the relationship between model’s behavior and the speaker’s characteristics. With no-label black-box access setting, our auditor needs to consider the ASR model in a system containing the searching online function. Such online searching results are extracted to represent the ASR model’s behavior. Fig. 5.1 illustrates an overall process of our audio auditor performing user-level membership inference under no-label black-box access. Generally, there are two processes including training and auditing. The training process is to build a binary classifier as a user-level membership auditor Aaudit using a supervised learning algorithm. The testing process uses this auditor to audit an ASR model Ftar by querying a few audios spoken by one user u. Both of these two processes need to perform the same data collection and feature extraction steps. Training Process. The primary task in training process is to establish a shadow system which includes a shadow model Fshd and a system’s reaction function Rshd. To mimic the target system, we set the Rshd is the same as Rtar. As we mentioned above, the reaction function is searching online. Thus, our shadow model should has similar behavior as the target model in semantic level.

73 Figure 5.1: The overall process of our audio auditor performing user-level membership inference under no-label black-box access. (i) In the training process, we sample one audio set from the auxiliary reference dataset Dref to build one shadow model. The shadow model dataset Ashd ∼ Dref is split to a training set train test test train Ashd and a testing set Ashd . Then we query the shadow system with Ashd and Ashd to collect data. After feature extraction process, we label the user-level record as “member” or “nonmember”. Then an audit model can be trained with these outputs of the shadow system. (ii) In the auditing process, we randomly sample a particular speaker’s (u’s) audios Au ∼ Dusers to query our target ASR system and collect data. Feature vectors from outputs of the target ASR system can be passed to the audit model to determine whether u ∈ Utar ← Atar holds.

We sample an audio set Ashd from the auxiliary reference dataset Dref based on the target system’s train performance. The shadow model dataset Ashd ∼ Dref is split to a training set Ashd and a testing y0 test ref set Ashd . Specifically, query the target system with an audio (xref , yref ) firstly and get rtar . Then 0 yref yref yref query the system reaction Rtar with yref and get rtar . If rtar has a high similarity with rtar , then train test (xref , yref ) ∈ Ashd , otherwise (xref , yref ) ∈ Ashd . test In data collection step, we query our shadow model Fshd and reaction function Rshd with Ashd and train Ashd . For each audio record (xref , yref ) spoken by a speaker uref , we can collect a set of information 0 yref yref (xref , yref , rtar , rtar ) described in Section 5.3.2. In feature extraction step, we extract nine features for each record and perform statistic analysis in user-level. train test The feature extraction is described hereafter. Assume uref ∈ Uref and Uref ∩ Uref = ∅. If the user uref has n query records, simple statistics are computed for these n records including sum, mean, median, minimum, maximum, standard deviation, and variance. Then we label the uref ’s statistic analyzed record train as “member” if (xref , yref ) ∈ Ashd , otherwise label as “nonmember”. After the feature extraction step, the training set for the auditor A is prepared. Random Forest (RF) is used to build this binary classification model for the following auditing process. Auditing Process. After training an auditor model, we randomly sample a particular speaker’s (u’s) audios Au ∼ Dusers. With the same data collection step, we query our target ASR system and the system’s reaction function Rtar with Au. Then we proceed the same feature extraction and pass this user-level record to the audit model. Our auditor A determine its membership if u ∈ Utar ← Atar holds.

74 Figure 5.2: Data collection for the auditor with no-label black-box access.

Data Collection

Fig. 5.2 depicts the data collection step for the auditor with no-label black-box access. Querying the ASR system with an input audio (x, y), we collect an indirect output of the ASR model’s prediction. First of all, to make sure the system’s reaction is searching online, the true transcription y should not contain any specific intents like open a specific application or set alarm at any specific time. In this setting, the reaction functionn R is the search engine linked to the ASR model. Then the data collection will be proceeded with four steps. Firstly, query the ASR system (F and R) with the audio x. Secondly, make sure the system’s reaction is searching online for the predicted transcription and only the searching results 0 ry0 are provided by the system. Herein, y is the translation predicted by the ASR model F and hidden by the system. Thirdly, using the same search engine R, query the true transcription y and gain the results ry. 0 Finally, information about this input audio and the target model’s behavior are collected as (x, y, ry , ry) for the following feature extraction step.

Feature Extraction

0 Given an input audio (x, y), we can collect (x, y, ry , ry) after the data collection step. Specifically, two types of searching results are extracted from the searching results r which are denoted as r1 and r2. Herein, r1 contains titles from top three searching results, while r2 contains titles and corresponding related content from top three searching results. The intuition of membership inference is to learn the target model’s different behaviors querying its training samples or other samples. Observing the model’s behavior is key to the success of membership inference. Normally, for an ASR model, the better the model behaves, the smaller the differences are between the query audio’s true text and the translated text. To expose the model’s performance under no-label black-box access, we compare the y with ry0 , y with ry, and ry with ry0 . Since the length of these three pairs of strings are quite different, we use fuzzy string matching method to calculate their similarity [205]. Table 5.1 demonstrate all nine features in detail. Except the first feature speed, the rest eight features try to capture the model’s performance indirectly.

75 Table 5.1: Datasets across models

Feature Description Speed The user u’s speaking speed. 0 Take out the common tokens of y and r1y . fuzz_y_r1y’ Then calculate the Levenshtein distance similarity ratio between the two strings. 0 Take out the common tokens of y and r2y . fuzz_y_r2y’ Then calculate the Levenshtein distance similarity ratio between the two strings. Take out the common tokens of y and r1y. fuzz_y_r1y Then calculate the Levenshtein distance similarity ratio between the two strings.) Take out the common tokens of y and r2y. fuzz_y_r2y Then calculate the Levenshtein distance similarity ratio between the two strings. 0 Format r1y as a string and r1y as a vector of string with size 3. extract_r1y_r1y’_top Return the strings along with a Levenshtein distance similarity score out of a vector of strings and record the top similarity score. 0 Format r2y as a string and r2y as a vector of string with size 3. extract_r2y_r2y’_top Return the strings along with a Levenshtein distance similarity score out of a vector of strings and record the top similarity score. 0 Format r1y as a string and r1y as a vector of string with size 3. extract_r1y_r1y’_sum Return the strings along with a Levenshtein distance similarity score out of a vector of strings and record the sum of these similarity scores. 0 Format r2y as a string and r2y as a vector of string with size 3. extract_r2y_r2y’_sum Return the strings along with a Levenshtein distance similarity score out of a vector of strings and record the sum of these similarity scores.

5.4 Experimental Evaluation and Results

5.4.1 Experimental Setting

Dataset Description

The LibriSpeech corpus is one of the famous speech corpus to build and evaluate ASR systems. 1000 hours of English speech are sampled at 16kHz. The content of these corpus are mainly reading books derived from audiobooks — one part of the LibriVox project [180]. Thus, these audio content do not have any specific intents to trigger the target system’s reaction except searching online. We use 100 hours clean training set to establish our target ASR model. Then querying its 360 hours clean speech set to the target system, a proper set of speech are selected to build our shadow model as described in previous section.

Target System

The target system in no-label black-box setting contains a target ASR model and a system’s reaction function. Our target model is a speech-to-text ASR model. The inputs are a set of audio files with their corresponding transcriptions as labels, while the outputs are the transcribed sequential texts. To simulate most of the current ASR models in the real world, we created a state-of-the-art hybrid ASR model [158] using the PyTorch-Kaldi Speech Recognition Toolkit [159]. In the preprocessing step, fMLLR features were used to train the ASR model with 24 training epochs. Then, we trained an ASR model using a deep neural network with four hidden layers and one Softmax layer. We experimentally tuned the batch size, learning rate and optimization function to gain a model with better ASR performance. To mimic the ASR

76 model in the wild, we select an audio set for the shadow model’s training process, only if the reaction of the audio’s translated text is similar to the reaction of its true text. We assume the reaction should be searching online and the reaction function is regarded as a search engine. In our experiment, we assume Google Chrome as the search engine embeded in the target system. Specifically, we use ChromeDriver 88.0.4324.96 for automated batch searching.

The Baseline Method

One popular baseline method is random guessing used by Shokri et al [38], Salem et al. [43], and Li and Zhang [204]. Specifically, the membership inference model is a binary classifier. We evaluate the membership inference model on dataset randomly sampled from Dtar, where the model is trained with the same size of member records and nonmember records. In such case, the random guessing inference should be around 50%. We measures the area under the Receiver Operator Characteristic (ROC) curve as the evaluation metric to evaluate the membership inference model. We adopted AUC instead of ROC since the AUC is threshold independent.

5.4.2 User-Level Auditor with No-Label Black-box access

Assume a user u has n audio recordings. Query the shadow system with these recordings and gain n record-level information to represent the shadow ASR model’s behavior. When all n record-level information analyzed with the statistic analysis mentioned previously, the shadow ASR model’s behavior on this user u is recorded as one training sample for our auditor. In such case, we call the training set of our auditor generated with n samples per user. Both the auditor’s training process and auditing process use the same data collection method and feature process method. Therefore, the testing set of this auditor are also generated with n samples per user. Additionally, with statistical analysis, at least two audio recordings of one user are queried to the target system to generate the user-level features (n ≥ 2). Figure 5.3 shows the auditor’s performance using different number of samples per user. We evaluate the performance verifying the number of samples per user from two to six. Herein, we assume the shadow ASR model and the target ASR model using the same algorithm LSTM. For each experiment, 300 user-level records were sampled and generated for our auditor’s training step. We repeated the experiment 100 times and averaged these results to reduce deviations in performance. The results shows that the more audios for each user used to audit their membership, the more accurate our audio auditor can performed. Specifically, the highest AUC score can reach to 74.04% using six samples per user. Considering users’ convenient and the performance of our auditor, we recommend five samples per user whose auditor’s AUC score is over 70% as well. Figure 5.4 evaluates the effect of different training set size used to train the user-level auditor. We trained the audit model with a small set of users and a relatively large set of users. The small set of training set obtains 20, 40, 60, 80, and 100 users, while the large set of training set obtains 200, 300, 400, 500, 600, 700, 800, 900, and 1,000 users separately. The testing set querying these audit models is fixed at 100 test audio records. The user-level samples in the training set were randomly sampled and processed from the outputs of the shadow system. The shadow ASR model use the same algorithm as LSTM as our target ASR model. These two ASR models are trained with training samples generated with five record-level samples per user. To eliminate trial specific deviations, we repeated each experiment 100

77 Figure 5.3: The audit model’s performance by Figure 5.4: The audit model’s performance the number of audios for each user. across the training set size. times and averaged the results. As shown in Figure 5.4, the model performs better when the number of user-level records within the training set increases. When 150 users used in auditor’s training set, the performance achieves a highest AUC score (73%). When the number of users is over 150, the audit model performs well and the AUC score is stable to around 73%. In all, the more users that are used to train the audit model, the more accurate a user’s membership within the target model can be determined.

5.4.3 Model Independent User-Level Auditor

The previous experiments build the shadow model using the same algorithm used by the target model and using the training set with the same distribution as the target model’s. When we query the target system with black-box access, the algorithm used to train the target model is unknown. Thus, different algorithms are used to build the shadow model to evaluate the model independent auditor. Assume the target ASR model is trained with a four-layer LSTM network. We build three shadow models using different kinds of networks including LSTM, RNN, and GRU. For these shadow models, their training sets are all sampled and used to train following the same training process. Assuming the shadow model is trained with a four-layer LSTM network, the corresponding auditor is named LSTM-based auditor. Accordingly, three auditors, including LSTM-based auditor, RNN-based auditor and GRU-based auditor, are used to audit the same target ASR model. Fgigure 5.5 evaluates the effect of different shadow models used to train the user-level auditors. Based on previous conclusions, we collect the results of five records per user and extract the user-level features for each user. Similar to the previous experiments, 100 users’ audio recordings are used to query the target system and processed as the auditor’s testing set. Different training sizes of the user-level data were randomly sampled from the outputs of the shadow system. To eliminate trial specific deviations, we repeated each experiment 100 times and reported the averaged results. As shown in Figure 5.5, the performances of different auditors show the same trend in auditing user-level membership. Specifically, the AUC score is increasing when the training set size of the auditor is growing till around 150. Then the upward trend gradually slows down until it stabilizes. The highest averaged AUC score can even reach to

78 75.31% when the RNN-based auditor audits the target ASR system. We can simply conclude that our auditor is robust to different ASR models.

Figure 5.5: Effect of different shadow models used to train the user-level auditors.

5.5 Conclusion

This work proposed an auditor for an ASR model in IoT voice services under no-label black-box access. We investigate the user-level membership inference assuming that even the translated text will not be provided explicitly. Instead, the translated text of the ASR model will be implicitly passed to cause the system’s reaction. Our auditor broadens the boundary of membership inference by releasing the label-only membership inference assumption [204]. According to the reaction information, our auditor tries to learn the pattern of the ASR model’s behavior when querying its known data and its unknown data. We extract nine features for the system’s reaction on each audio recordings and process the statistic analysis to gain the user-level features. As we shown, both the size of user base and the number of audio samples per user used in the testing set have a positive effect on our audit model against the target ASR model. Our auditor is quite robust to audit different target ASR models. Specifically, the highest AUC score can reach to around 75%. Examining other factors on performance and extending possible defenses against audit are all worth further exploration.

79 Chapter 6

FAAG: Fast Adversarial Audio Generation through Interactive Attack Optimisation

Automatic Speech Recognition services (ASRs) inherit deep neural networks’ vulnerabilities like crafted adversarial examples. Existing methods often suffer from low efficiency because the target phases are added to the entire audio sample, resulting in high demand for computational resources. This chapter proposes a novel scheme named FAAG as an iterative optimization-based method to generate targeted adversarial examples quickly. By injecting the noise over the beginning part of the audio, FAAG generates adversarial audio in high quality with a high success rate timely. Specifically, we use audio’s logits output to map each character in the transcription to an approximate position of the audio’s frame. Thus, an adversarial example can be generated by FAAG in approximately two minutes using CPUs only and around ten seconds with one GPU while maintaining an average success rate over 85%. Specifically, the FAAG method can speed up around 60% compared with the baseline method during the adversarial example generation process. Furthermore, we found that appending benign audio to any suspicious examples can effectively defend against the targeted adversarial attack. We hope that this work paves the way for inventing new adversarial attacks against speech recognition with computational constraints.

6.1 Introduction

Automatic speech recognition (ASR) technologies have enabled the transformation of human spoken language into text. In recent years, with the development of advanced deep learning techniques, the effi- ciency and effectiveness of ASR systems are enhanced as a Deep-Learning-as-a-Service. ASR service has become an increasingly popular human-machine interface due to its accuracy and efficiency/convenience. The number of devices with voice assistants is estimated to reach 8.4 billion by 2024 from the current 4.2 billion globally [206]. The value of the global ASR market will be over USD 21.5 billion by 2024 [206]. International corporate giants like Microsoft, Google, IBM, and Amazon, are heavily investing in new technologies to expand their market shares. Devices and applications integrating the acoustic systems are ubiquitous, like Amazon Echo, Google Assistant, and Apple’s Siri, enabling the full potential of intelligent voice-controlled devices, voice personal assistants, and machine translation services [173, 207].

80 Hence, the security problems associated with ASR systems are worth millions of dollars. With the advancement of deep neural networks, ASR systems have become increasingly prevalent in our daily lives [144, 145, 158, 175]. Despite ASR’s popularity, the security risk and the adversarial attack against ASRs have raised concerns in the security community [192, 189, 199, 208, 209, 210]. The community has confirmed that ASR systems inherit vulnerabilities from neural networks [211]. For example, neural network models are vulnerable to adversarial examples [13, 212]. State-of-the- art ASR systems consisting of deep neural network structures can be fooled by adversarial examples [213]. Machine learning-based cybersecurity has become an important challenge in various real-world applications [214, 215, 216, 217, 218]. Existing research shows that well-crafted adversarial audio can lead an ASR system to misbehave unexpectedly. There are two types of adversarial attacks — targeted attacks and untargeted attacks. Untargeted attacks against ASR systems can damage the performance of the ASR system. Abdullah et at. [219] forced an ASR system to transcribe the input audio into incorrect text. Targeted attacks against ASR systems not only cause low accuracy in transcription but also inject the attacker’s desired phrases without being recognized. Carlini et al. [92] leveraged noise-like hidden voice commands to embed commands into a normal audio example so that users can only hear a meaningless noise, but the ASR system can execute the hidden commands. The DolphinAttack further crafted the audio to make the embedded command inaudible and imperceptible to human beings. Carlini and Wagner [94] proposed an interactive optimization-based method to enable an adversarial example generated in a small distortion of Decibels (dB). Qin et al. [96] improved the method using the psychoacoustic principle of auditory masking to generate unnoticeable noise. Although various methods have been proposed to generate adversarial audio in high quality, these methods may not perform as well as expected under some certain conditions. Firstly, all those methods use a complete audio to generate the adversarial example. However, as users’ security awareness has gradually increased nowadays, an adversarial audio may not be played completely if a user noticed the anomaly. When this audio was not played completely, the success rate of the attack would be decreased significantly. Thus, an adversarial audio generation method based on a part of the target audio is more powerful than the one based on the complete audio. The shorter the part of the audio is used, the higher the opportunity the attack is successful. Secondly, all previous methods generating the adversarial audio use multiple GPUs. However, when provided with limited resources, e.g. only CPUs or just one GPU can be used, previous attacks can be more time-consuming than expected. Especially for a bunch of adversarial examples, a fast adversarial audio generation method could be more dangerous. This chapter aims to find an effective and efficient method for adversarial attack under white-box access to the target end-to-end ASR system. Our method aims to improve the existing popular method in [94] based on previous concerns. We propose to modify a part of an audio example, instead of its whole frame. The beginning part of the audio is large enough to be covered by the target phrase to embed any phrases into an audio example, including voice commands. Only a space separating the target phrase and the remaining transcription texts are needed to ensure the ASR system can understand the target phrase. Thus, our method’s key task is to find a proper length of the frame at the proper position of the audio. The state-of-the-art ASR systems can filter out some noise and rectify some contextual errors recognized by the system using their language model. Therefore, it is difficult to hide any phrases correctly as a part of original audio’s transcription. A fixed length at the beginning part of the audio is the

81 best solution.It is essential to find a fixed length of frames according to the target phrase and the original audio. Otherwise, there is not enough space to embed the target phrase and result in a low success rate. Specifically, the proper clip of the audio frame is selected by mapping each word in the transcription to a rough position in the audio’s frame according to the logits output. This audio clip is subsequently used to generate the adversarial example based on the interactive optimization-based method proposed by [94]. The contributions of this work can be summarized as follows:

• We propose a new scheme, called Fast Adversarial Audio Generation (FAAG), developing a new optimisation algorithm based on an interactive attack strategy. According to different phrases and provided audios, FAAG can automatically select a proper length of the frame at the beginning of the audio. The shortest ratio of the frame used for an adversarial example generation can be reached to 14.79%.

• We develop a fast adversarial sample generation with a satisfied success rate and tolerable distortion. Under a limited resources, our method takes half an hour using the CPUs only to generate ten adversarial examples, and cause at around two minutes using one GPU. Both are faster than previous attacks using the same resources, specifically speeding up around 60% in generation time.

• The empirical study provides us with two new observations: (1) different words in target phrases will not significantly affect the performance of our adversarial examples; (2) a target phrase with fewer words has a slight positive boost on the adversarial example generation, compared to a target phrase containing more words.

• The target phrase can only be hidden at the beginning of the original audio, otherwise the transcrip- tion of the phrase part will be at low accuracy. On the contrary, appending a begin audio at the beginning of any suspicious audio can effectively protect the service from the targeted adversarial attack.

The rest of the chapter is organized as follows: Section 6.2 introduces the background about the target ASR model and related adversarial attacks against the ASR system. Section 6.3 illustrates the details of our method to generate audio adversarial examples with limited resources. Section 6.4 shows the setup and the results of our experiments. Section 6.5 discuss different positions of audio using our method and countermeasures. Finally, Section 6.6 concludes the chapter.

6.2 Related Work

This section provides brief introductions to the state-of-the-art Automatic Speech Recognition (ASR) models and the related work about adversarial attacks on ASRs.

6.2.1 The Automatic Speech Recognition Model

While conventional ASR models are based on hidden Markov models (HMMs), current state-of-the-art ASR models utilise deep neural networks (DNNs). Our audio adversarial attack targets a state-of-the-art ASR system based on a DNN — end-to-end ASR systems [174]. Assuming white-box access, we evaluate our attack using an ASR model downloaded from a popular open-source ASR system DeepSpeech.

82 Figure 6.1: An end-to-end ASR system.

Figure 6.2: The overall process of generating the targeted adversarial attack.

End-to-end ASR systems in Baidu’s DeepSpeech implemented by Mozilla are sequence-to-sequence neural network models [165, 220]. Unlike other typical hybrid ASR systems, the end-to-end system predicts word sequences that are converted directly from individual characters from the raw waveform. As shown in Fig. 6.1, the end-to-end system is a unified neural network modeling framework containing three main components, including a feature pre-processing step, a neural network model as the probabilistic model, and a decoder to refine the final outputs. The feature pre-processing step uses Mel-Frequency Cepstral Coefficients (MFCC) features to repre- sent the raw audio data. The method in the first step is Mel-Frequency Cepstrum (MFC) transformation. The whole frame with the MFCC features extracted will be split into multiple frames with an overlapping window applied. Each audio frame will be fed into the probabilistic model. Herein, Recurrent Neural Networks (RNNs) are popular in End-to-End ASR systems where an audio waveform is mapped to a N sequence of characters [165]. However, the sequence of the character output ci does not mean the se- M quence of the word output qi . Thus, the decoder is used to reevaluate the character output. Connectionist Temporal Classification (CTC) [221] is a powerful method for the unknown alignment between the input and output sequence. DeepSpeech uses CTC as a decoder to score the character output and map it to a word sequence by de-duplicating sequentially repeated characters.

83 6.2.2 Adversarial Attack on ASRs

Audio adversarial attacks on ASR systems have recently become popular, focusing on both targeted and untargeted adversarial attacks [207]. Knowledge of these attacks’ target model is set at white-box access or black-box access. Generally, the audio adversarial attack aims to generate an audio adversarial example to deceive the ASR model without users’ awareness. When the ASR model simply mistranslates an audio example, it is an untargeted adversarial attack. When the ASR model translates an audio example into a phrase designed by the attacker, it is a targeted adversarial attack. Unlike the image domain, the targeted adversarial attack is much more dangerous against ASR systems than the untargeted adversarial attack. Except a few examples like an untargeted adversarial attack on an ASR system to force mistranscription in [219], most attacks focus on targeted adversarial attacks, where the target phrase is usually a common voice command [92]. Targeted adversarial attacks with white-box access can generate adversarial examples of high quality. Specifically, [94] generated an adversarial audio example with only slight distortion on the DeepSpeech model. CommanderSong [222] can embed the desired voice commands into any songs stealthily. SirenAt- tack is proposed to generate adversarial audios under both white-box and black-box settings [173]. Under the white-box setting, SirenAttack applies a fooling gradient method to find the adversarial noise, whose success rate can reach to 100%. Understanding the ASR model’s detail, [223] can reverse the perturbed MFCC features into adversarial speech. Different from [223], [224] generated adversarial audios by modifying the raw waveform directly with an end-to-end scheme. Furthermore, the adversarial examples could be generated with different systems and different features according to [225]. Audio adversarial examples are designed in [96] by leveraging the psychoacoustic principle of auditory masking. Targeted adversarial attacks with black-box access are more practical than white-box attacks. With little knowledge of the ASR system, Hidden Voice Commands generates the noisy command by repeatedly querying the model [92]. As a result, semantics of the generated adversarial audio is difficult for people to understand. DolphinAttack exploits the non-linearity of the microphones to generate inaudible voice commands [95]. Under the black-box attack, SirenAttack proposed an iterative and gradient-free method [173]. The works in [226] and [227] considered genetic algorithms and gradient estimation to modify the original audio under black-box access. The work in [158] generated adversarial examples based on psychoacoustic hiding in the black-box access, which can embed any audio with a malicious voice command. Devil’s whisper [172] proposed a general adversarial attack against the ASR systems by training a local model under white-box access. Different to the related work, this chapter focuses on the efficiency of adversarial speech recognition. We propose to modify a part of an audio example through interactive attack optimisation to guarantee high success rate, low distortion and high generation speed.

6.3 Generating Audio Adversarial Examples

State-of-the-art audio adversarial attacks can generate a high-quality adversarial audio with a high success rate. However, two limitations are neglected. Firstly, all previous attacks assume the attacker has multiple GPUs and abundant time to generate such high-quality adversarial audio. Once the computing sources are not that intensive, the time spent on attack would increase significantly. However, a successful audio adversarial attack requires timely action in the real world. Secondly, as the user’s awareness of security

84 and privacy is enhanced, even a slight noise within the audio would be noticed. In such cases, the shorter the adversarial audio can be recognized as the target phrase, the more powerful the adversarial audio is. We propose the FAAG method to hide the target phrase in a small piece of adversarial audio quickly with limited computing sources supported.

6.3.1 Threat Model

Given an audio waveform x and the target transcription t, we aim to construct an adversarial audio x0 = x + δ. Assuming that the target ASR system transcribes both the audio x into text y (y = F(x)) and the audio x0 as text y0 (y0 = F(x0)), we expect that the target transcription t is a sub-string of y0. Additionally, the audio x and our adversarial audio x0 should sound normal to human beings. We formulate the similarity based on the difference of the distortion in dB between the original audio x and the crafted adversarial audio x0, which is represented by the noise δ, the same as [94]. The dB difference

(dBx(δ)) between the modified adversarial example and the original audio reflects the relative loudness of the added noise comparing to the original audio. The smaller the dBx(δ) is, the more similar that two audio examples sound. When the loudness of noise is small enough, the noise can be ignored so that the adversarial audio can be transcribed by the ASR system without human awareness of the attack. We assume that the adversarial audio is generated with white-box access to the target ASR model. Herein, the attacker has complete knowledge of the ASR model, including its structure and its parameters. In this chapter, we choose to use Baidu’s DeepSpeech model [165]. DeepSpeech includes three parts — an MFC conversion for audio preprocessing, RNN layers to map each input frame into a probability distribution over each character, and a CTC loss function to measure the RNN’s output score. The core of the DeepSpeech model is an optimized RNN trained by multiple GPUs on over 5,000 hours of speech from 9,600 speakers. The RNN layers finally output logits that are computed over the probability distribution of output characters. We do not consider over-the-air audio transcription because many live settings may jeopardize the experiments. We will discuss our method adapted in the real world in Section 6.5. FAAG only modifies a small portion of the given audio example instead of the whole frame. It will be challenging to compare FAAG with other methods using live transcription. Furthermore, our adversarial examples are validated by transcribing the waveform directly. We treat the adversarial attack as a successful attack if the output transcription y0 includes the target phrase t correctly, denoted by t ∈ y0.

6.3.2 Fast Adversarial Audio Generation (FAAG)

FAAG generates an adversarial audio clip within a short period, even with limited computing sources, while embedding the attacker’s desired command (the target phrase) in a short clip of this adversarial audio. To shorten the adversarial audio clip within the whole audio, we try to find a proper position and length of frames used to embed our target phrase with a high success rate and low distortion. We find that it is unnecessary to transcribe the whole audio waveform as our selected phrase. When generating an adversarial audio example, the longer the waveform used does not lead to less distortion or higher success rate. However, the longer the waveform that is used, the slower the adversarial example will be generated. By constructing the targeted phrase at the beginning of the adversarial example and separating the targeted phrase from the rest transcription with a long space, the ASR model can still recognize the targeted phrase

85 Algorithm 1: Select the Proper Clip xbegin Require: Original audio x; Target phrase t; Pre-trained ASR model F; Step = s; Fine-tune variable λ 1: y = F(x) 2: c = f(x) 3: t = t + ‘ ’ 4: | ... | presents the character number or the frame length in the vector ... . Ensure: The selected audio clip xbegin is long enough for adversarial example generation. 5: Initialize λ = 0 6: |tallocated| = |t| + λ |c| 7: |xbegin| ≥ |y| × |tallocated| × s 8: index = |xbegin| 9: xbegin ← x[: index] 10: xrest ← x[index :] 11: return Two clips xbegin and xrest correctly. Comparing to the prior work on targeted attacks on speech-to-text [94], we generate the audio adversarial example based on the beginning part of a long audio waveform instead of the whole waveform. Fig. 6.2 illustrates our adversarial example generation, focusing on a targeted adversarial attack on the DeepSpeech model. In general, given an audio waveform x, the target ASR model’s transcription is y and our adversarial attack can be summarized as three steps.

Step 1: Based on any chosen short phrase t, we select the proper frames at the beginning of x as xbegin to add noise δ. We choose the beginning of the audio because no prior noise would affect the accuracy, and the effect of the subsequent clip’s noise could be limited. Herein, we describe the phrase t as a short phrase when its length is shorter than the length of the given transcription y, denoted by |t| < |y|. Step 2: We construct the inaudible noise δ with an iterative and optimization-based attack. Therefore, xbegin + δ can be recognized by the ASR model as the phrase t with a specific conjunction (i.e. ‘and’) or a long space. Herein, the long space means a silence recognized by the ASR model. It is necessary so that the transcription y0 of the adversarial example excluding the phrase t will not affect the model’s understanding of our chosen phrase t. 0 Step 3: We combine xbegin = xbegin + δ with the rest of the frames of x (named as xrest = x − xbegin) 0 0 so that the adversarial example x = xbegin + xrest sounds similar to the original audio x. In addition, the adversarial example x0 is recognized as y0 by the ASR model where t ∈ y0. To evaluate the success rate of the adversarial example, we calculate the character error rate (CER) of t in y0.

Selecting the proper frames at the beginning of a given audio.

We define the proper frames xbegin to satisfy three conditions: 1) the frames used to generate adversarial examples should be at the beginning of the original audio x; 2) the length of the frames should be long enough to cover the target phrase t correctly; 3) the generated adversarial examples should have relatively small distortion. Thus, we choose the frames of audio x corresponding to the first n words of its transcription y, where the number of words in the target phrase is n = len(t). To meet the second condition, we consider the frame length of each logit and the amount of logits for n words. Therefore, we need to find out the relationship between the input audio x, its logit output c and its corresponding transcription y. As for the third condition, we add a variable λ to fine tune the length to result in a small distortion.

86 Figure 6.3: Diagram of DeepSpeech model transcribing an input audio x as its transcription y.

As for the first condition, generating the adversarial examples as the beginning parts of the audio has clear advantages. Firstly, as the beginning parts of the audio, the target phrase can be recognized by the ASR model with less effect from the remaining original audio frames than if it were inserted in the middle of the audio. Secondly, it is more easy for the ASR model to recognize and execute the target phrase, especially when the target phrase is a kind of command. For example, when the victim plays our crafted adversarial audio, and the ASR model recognizes a command hidden at the beginning of the audio, it is more likely to execute the command regardless of the remaining audio’s meaning. However, there are some special cases. For example, when the target phrase contains a trigger word of an ASR system, the ASR system will only listen to the sentence behind that trigger word. The position of the adversarial example hidden in the original audio is then not that important. Thus, we also consider the case of generating an adversarial audio clip at the middle and at the end of the original audio in later experiments. To shed light on the method of our attack, we take the beginning position as an example. To satisfy the second condition, we need clarify the relationship between x, c, and y. Thus, it is necessary to understand the mechanism of the target ASR model. The target ASR model in this work is Baidu’s DeepSpeech model [165], specifically an end-to-end Speech-to-Text model implemented by Mozilla. Fig. 6.3 demonstrates the diagram of the DeepSpeech model transcribing an input audio x as its transcription y. The input audio x firstly is split from a whole frame into several overlapping windows with a window size w. Herein, each window slips to the next window with a step length of s. After the

MFC transformation, the RNN model f(·) in DeepSpeech maps each output logit ci ∈ c as a probability distribution over each character during each window frame (kck = kxk). The character ci is in range ‘a’ to ‘z’, white space, and the 0−0 symbol which represents the epsilon value  in CTC decoding. Then the CTC decoder C(·) outputs a sequence of characters y with an overall probability distribution, merges repeats and drops epsilons. To decode a vector c to a transcription vector y, the best alignment can be found by Equation 6.1 [94].

C(x) = argmaxyP r(ykf(x)) (6.1)

We can summarize the relationship between x, c, and y in line with the mechanism of the target ASR model. The whole frame of input audio x is spilt into several overlapping windows with a step of length

87 s. The RNN model f(·) in DeepSpeech maps each output logit ci as a probability distribution over each character within each window frame. Then, we can conclude the relationship between the output logits c and the whole frame of the input audio x as per Equation 6.2. Herein, the | · | represents the number of characters within a text (i.e. |c|) or the length of frames in the audio (i.e. |x|).

|x| = |c| × s + (|x| mod s) (6.2)

Naturally, the more characters the transcription has, the longer the length of the output logits. There is a positive correlation between between the output logits c and transcription y. We summarise this as Equation 6.3. ρ(|c|, |y|) > 0 (6.3) Translating audios with the same ASR model, we assume that the relationship between our generated 0 0 adversarial audio clip xbegin, the corresponding output logits c , and the corresponding transcription y0 is the same as that between x, c, and y. According to Equation 6.3, ρ(|c0|, |y0|) > 0. However, the transcription y0 is different from the original transcription y. Between the output logits and the transcription, the CTC decoder C(·) will merge repeats and drop epsilons within the logits to get the final transcription. The number of repeats and epsilons within the logits varies in lines with the speaker’s speaking habits and the ASR model’s window size. Although the window size is the same, the speaker’s speaking habits are hard to control in different recordings. We assume that ρ(|c0|, |y0|) ≈ ρ(|c|, |y|). To simplify the experiment, we refine this relationship into the following equations. |c| |c0| = (6.4) |y| |y0| 0 0 0 |xbegin| = |c | × s + (|xbegin| mod s) (6.5) |c| = × |y0| × s + (|x0 | mod s) (6.6) |y| begin 0 Analogous to Equation 6.2, we know the relationship between frame length of xbegin and the number of 0 output logits. Combining this with Equation 6.4, we can find the frame length of xbegin as per Equation 6.6. Assuming our adversarial example generation is successful, the transcription of our adversarial audio clip y0 is the same as our target phrase t. The frame length of the selected audio clip should be the same as the 0 frame length of our generated adversarial audio clip (|xbegin| = |xbegin|). Thus, we can find a proper minimum length of the frames for adversarial audio generation from Equation 6.6 to meet the second condition. At least, we know the range of proper frame length selected from the beginning of the original audio (as shown in Equation 6.7). |c| |c| × |t| × s + s > |x | = |x0 | ≥ × |t| × s (6.7) |y| begin begin |y| We need to ensure that the selected length is long enough for adversarial example generation. In this work, |c| we set (|xbegin| = |y| × |t| × s) during experiments and use a variable λ to fine tune the frame length of the selected audio clip. Using Algorithm 1, the original audio can be split into two audio clips including xbegin = x [: |xbegin|] and xrest = x [|xbegin| :]. With the proper frame length selected, not only the time is saved, but also the negative effect of the remaining audio on the adversarial example’s transcription can be neglected. Accordingly, the success rate of the generated adversarial example can be increased. Apart from the guarantee of the success rate, we also consider selecting a proper frame length to satisfy the third condition — less distortion in the adversarial audio. Normally, using the same generation

88 Algorithm 2: Audio Adversarial Example Generation Require: Original audio x; Target transcription t; Pre-trained model F; WS = s; iter = 1, 000 Ensure: len(y) ≥ len(t) 1: Call Algorithm 1 0 2: return Two audio clips: xbegin and xrest 3: dB(xbegin) = 20 ∗ log10(np.max(np.abs(xbegin))) 4: Optimize δ: con is a constant to narrow down the dB

5: dBxbegin (δ) = dB(δ) − dB(xbegin) 6: for iteration range from 1 to iter do 7: while F(δ) 6= t and dB 0 (δ) > con do xbegin

8: xbegin ← xbegin − w · sign(∆xbegin ctc_L(xbegin, t)) 0 9: xbegin ← xbegin P 0 10: minimize dBxbegin (δ) + i wi · ctc_L(xbegin)) 11: end while 0 12: if F(xbegin) == t and dBxbegin (δ) ≤ con and iter/100 = 0 then 13: con ← con × 0.8 14: end if 15: end for 16: 0 0 17: x = xbegin + xrest 18: time = end_time − start_time 0 19: Calculate its distortion: dBx(δ) = dB(x ) − dB(x) 20: Verify: y0 ← F(x0) 21: Success if t ∈ y0 0 0 22: return Adversarial audio x ; Adversarial transcription y ; distortion in Decibels dBx(δ); time method based on an audio clip with a fixed frame length, the more characters the target phrase has, the more distortion occurs in the generated adversarial audio because with more characters in the target phrase, more characters in the original audio are needed to be changed. However, targeting a phrase based on a different length of audio, the longer audio may not produce adversarial audio with less distortion. We introduce a new variable λ to fine tune the proper frame length at the beginning of the original audio and discuss it in the next section.

Constructing inaudible noise in the proper frames

0 Knowing the proper audio segment xbegin, we construct inaudible noise δ and generate xbegin = xbegin +δ. Based on the optimization method proposed in [94], we optimize δ according to the CTC loss function ctc_L(·) with a constraint on dBx(δ) mentioned below. The optimization method can be summarized as

Equation 6.8. Herein, wi represents the relative importance of being close to t and remaining close to xbegin. ci is a character of the output logits processed by the RNN model, while the dBx(δ) indicates the difference of dB between the original audio and the noise. The constant con is initially a large constant value, and will be reduced to run the minimization again till the result converges. Finally, the output of 0 this step is constructed as xbegin.

2 X minimize kδk2 + wi · ctc_L(xbegin + δ, ci) i (6.8)

such that dBx(δ) ≤ con

89 According to [94], the minimization problem is solved using an Adam optimization with 100 learning rate and 1,000 iterations. As shown in Algorithm 2, xbegin will be updated with the CTC loss function as 0 0 xbegin varies. For every 100 iterations, if the current adversarial example is successful (F(xbegin == t)

& dBxbegin (δ) ≤ con), we scale down the constant con by 80% to search for even smaller . 0 Different from the distortion calculated in [94], we combine the modified xbegin with the remaining audio 0 0 xrest and subsequently calculate the dB of the whole adversarial example marked as x = xbegin + xrest 0 before obtaining the distortion as dBx(δ) = dB(x ) − dB(x).

Adversarial example generation and evaluation

0 With xbegin generated, the adversarial example is generated by combining this clip with xrest. Thus, 0 0 the final adversarial example is x = xbegin + xrest. The whole process is defined in Algorithm 2. We define success by verifying the transcription recognized by the target ASR model with an adversarial example. Specifically, we say the adversarial example x0 is successfully generated, when its transcription y0 contains the short phrase t0 and exactly matches with the target phrase (t = t0). When t 6= t0, we say that our adversarial example is generated without 100% accuracy. In this case, we measure its success rate with character error rate (CER) defined in Section 6.4.

6.4 Evaluation

6.4.1 Experimental Setting

We set up a number of experiments to evaluate our proposed adversarial example generation. Ten audios are selected randomly from the TIMIT dataset as the target audio to generate the adversarial example, which can be recognized by a pre-trained DeepSpeech model as our target phrase. At the same time, the change is inaudible to human beings. All the experiments are conducted on a workstation with an Intel Core X i9-7960X CPU (16 Cores) and 128GB memory. Some experiments use one TITAN XP GPU in addition to the CPUs.

Dataset Description

The TIMIT speech corpus is a famous speech corpus to build and evaluate ASR systems. 630 speakers across the United States recorded audios for this corpus, including 6,300 sentences [23]. Each speech waveform in this corpus is sampled at 16-bit, 16kHz for each utterance. In this work, we propose an effective method of injecting the target phrase into a long waveform. Thus, we randomly select five audios waves whose transcriptions have relatively more words than our target phrases. Herein, we define the audio’s transcription with more than ten words as relatively more words. This setting applied to our target phrases is because the number of words in most sensitive commands is less than ten. For example, the command “call someone” and “turn on the airplane mode” have a handful of words. Hence, it is reasonable to select these audio waveforms as our target audios.

Target ASR Model

The DeepSpeech ASR model is our target, which is an end-to-end ASR model with the CTC loss function applied [165, 220]. Herein, RNN is the core engine to translate audio to a sequence of text. Specifically,

90 Table 6.1: Two sets of target phrases to evaluate the audio adversarial generation.

Two Sets Target Phrase # of Characters Target Phrases call john smith 15 with call david jone 15 Different Words play music list 15 Target Phrases call john smith 15 with call john 10 Different Lengths call john smith and david 25 the pre-trained model we targeted is deepspeech-0.4.1-model. This speech-to-text model is trained with multiple corpora, including LibriSpeech, Fisher, Switchboard, and English common voice training corpus. It can reach 8.26% WER when testing on the LibriSpeech dataset. The training batch size is 24; the testing batch size is 48; the learning rate is 0.0001; the dropout rate is 0.15; and the number of neurons in the hidden layer is 2,048.

Baseline

Since our adversarial example generation is an improvement based on Carlini and Wagner’s work [94], we generate adversarial examples using the iterative optimization-based method proposed by [94] as the baseline. According to the implementation in [94], any audio may be translated into any phrase. [94] applied the perturbations over the complete frames of the original audio. They solved the optimization problem using the Adam optimizer with a learning rate of 10. The default iteration to generate an audio adversarial example in this work is 1,000, while the maximum iteration is 5,000. They can generate targeted adversarial examples with 100% success rate with a mean distortion from −31dB to −38dB. Our generated adversarial example only modifies the beginning of the original audio, but the baseline method modifies the original audio frames. Both FAAG and the baseline method are evaluated on the same workstation for a fair comparison.

Target Phrases

Apart from comparing our generation method with the baseline, we also evaluate FAAG’s effectiveness in modifying audio results in different target phrases. Specifically, we define two sets of target phrases where each set contains three phrases listed in Table 6.1. The first set is used to evaluate the FAAG’s performance by injecting different words of the target phrases containing three words each into the original audio. We name this set as the target phrases with different words. Another set — target phrases with different lengths — is used to evaluate the FAAG’s performance by injecting target phrases of different lengths into the original audio.

Evaluation Metrics

We evaluate our adversarial example generation method from three aspects, including the attack’s success rate, the dB level of the noise δ compared to the original audio x, and the time needed under the limited resource. The three metrics are described in detail as follows:

• The success rate of injecting the target phrase t into the modified adversarial example can be calculated with the character error rate (success rate = 1 - CER). Assuming that the predicted

91 target phrase is t0, the CER is the ratio of the number of incorrect characters predicted in t0 over the total number of characters in t.

• The distortion in dB is used to quantify the distortion of the modified adversarial example comparing

to the original one. Each audio’s dB is their relative loudness represented as dB(x) = maxi 20 ×

log10(xi). In addition, the difference of dB between the original audio and the noise can be 0 formulated as dBx(δ) = dB(x ) − dB(x) [94]. It is always hard to determine whether the modified audio is imperceptible to human beings using dB. We provide a benchmark of an adversarial example’s distortion using method in [94]. According to [228], dB ≈ 30dB is similar to the loudness of whisper.

• Ratio of frames is used to measure the ratio of the selected clip to the complete audio. The larger the ratio of frames, the longer the audio clip is clipped for adversarial example generation. When the ratio of frames is 100%, FAAG does not claim to select the best length of an audio clip for adversarial attacks, consistent with the baseline methods [94].

• Generation time is used to evaluate our method’s effectiveness in generation process. In our work, we only employ CPUs and one GPU to generate audio adversarial examples. Comparing to the attack with GPUs, the attack with CPUs is a relatively time-consuming task.

6.4.2 Proper Frame Length Selection

A proper frame length selected for an adversarial audio clip generation should satisfy three conditions. The first condition is about the selected clip’s position in the original audio. The second condition is related to the proper minimum number of windows mapping to our audio clip’s frame length. The third condition adds a new variable λ to fine tune to frame length for less distortion. We explore the specific impact of these three factors on our proper frame length selection for adversarial audio generation.

Proper Minimum Frame Length for Adversarial Audio Clip

We firstly evaluate our method towards the second condition. That is, whether the frame length we selected is long enough to cover the target phrase correctly. According to Fig. 6.3, each window of the original audio will be translated to one logit character. As we discussed in Section 6.3, we know the 0 0 |c| proper minimum frame length (|xbegin| = |c | × s = |y| × |t| × s) from the Equation 6.7. Thus, the length of the adversarial audio clip can be determined by the number of characters in the target phrase. Keep the target phrase unchanged, we change the required length for the target phrase to alter the number of output logits |c0|. Thus, one less number of characters may means a few less number of windows. Herein, the smaller frame length is only a few windows’ size smaller than our selected proper minimum frame length. To evaluate the proper minimum frame length selection, we compare it with the results of a smaller frame length of the selected audio clip. Then this reference frame length is

|c| |x00 | = × (|t| − 1) × s. begin |y|

In addition, to ensure the remaining phrase would not affect the translation of our adversarial audio clip, we add a word or a long space after the chosen phrase as our target phrase. Thus, along with different

92 Table 6.2: Performance of Adversarial Generation on A Target Phrase with Different Frame Lengths. Avg: average

Setting Avg dBδ Avg Accuracy (%) 0 xbegin → tand 30.87 90.37 00 xbegin → tand 36.28 89.11 0 xbegin → tspaces 38.55 94.7 00 xbegin → tspaces 43.01 93.33 0 xbegin → t 39.73 81.19 number of windows, we evaluate the necessity of a word or a long space being added to the chosen phrase. Herein, a long space means the number of space should be larger than one. We run FAAG and compare the results in five different settings. The target model is deepspeech-0.4.1-model [229]. One phrase t is chosen as part of the transcription of our adversarial audio. A word ‘and’ and two spaces are added after the chosen phrase separately as two target phrases, marked as tand and tspaces. 0 The first two settings are selecting an audio clip with proper minimum frame length |xbegin|, when 0 the target phrases are tand and tspaces respectively. We mark these two settings as xbegin → tand and 0 00 xbegin → tspaces. The second two settings are selecting an audio clip with a smaller frame length |xbegin|, 00 when the target phrases are tand and tspaces respectively. We mark these two settings as xbegin → tand 00 0 and xbegin → tspaces. The fifth setting selects an audio clip with the minimum frame length |xbegin|, 0 when the target phrase is t with only one space appended. We mark this as xbegin → t. Ten audios are selected randomly from the TIMIT dataset which has a different distribution from the model’s training corpus. For each audio, we repeated the experiment ten times to report the average result.

Table 6.2 shows the averaged dBδ of generated audio examples and averaged accuracy of translating 0 these adversarial audio examples with different experiment settings. Comparing the experiment xbegin → 00 tand with xbegin → tand, not only the accuracy is decreased but also the distortion is increasing because of a smaller frame length of the audio clip. The same conclusion can be obtained by comparing the 0 00 experiment xbegin → tspaces with xbegin → tspaces. The reason is that generating an adversarial example with the same target phrase, less length of frames needs more effort to alter less number of original characters to all characters in the target phrase. The proper minimum frame length can be calculated 0 0 using Equation 6.7. In addition, by comparing experiments xbegin → tand and xbegin → tspaces with 0 xbegin → t, a word or a long space is required to be added to ensure the accuracy of adversarial examples. The reason is that the last word of the target phrase is easy to be influenced by the rest phrase of the original audio because of the nature of an ASR model. When the target phrase is tand, the average distortion is less than the result when the target phrase is tspaces. We infer that more modifications are required to alter a character to a space symbol than to another character. Meanwhile, with a long space between the target phrase and the rest phrase, the translation in the target phrase part will be less influenced.

Proper Fine-tune Length for Adversarial Audio Clip

As we find the proper minimum frame length, we evaluate our method towards the third condition. Apart from the accuracy, it is important to achieve a less distortion of generated adversarial audio example to hide the attacker’s intention. To fine tune the length for adversarial audio clip, we explore the relationship allo between the allocated frame length of an audio clip |xbegin| and the number of characters in the target phrase |t| in different audios. Here, we introduce a variable λ to alter the allocated frame length. Based

93 on Equation 6.7, we say the allocated frame length can be calculated as

|c| |xallo | = × (|t| + λ) × s. begin |y|

Since we use the right equation of Equation 6.7, we only consider the positive value of λ. Because of the relationship we stated in Equation 6.4, the λ should satisfy the condition where |t| + λ ≤ |y|. To clarify the relationship, we use the ratio_frame to measure the ratio of allocated frame length for the adversarial audio generation to the whole frame length of this audio. To find the best λ, we randomly select one original audio and three different phrases appended with two spaces. To avoid the impact of different lengths in the target phrase, we use the set of target phrases with different words stated in Table 6.1.

(a) Success rate (b) Distortion

Figure 6.4: Performance of the FAAG with increasing length of selected audio clip for generation.

With the increasing ratio of frames clipped for adversarial audio generation, Fig. 6.4 shows the performance of our adversarial attack. When the ratio of the frame reaches to 100%, the results are similar to our baseline. In general, the frame ratio can influence the success rate and distortion of generated adversarial audio. The best accuracy result outperforms by around 13% than the baseline targeting the phrase ‘call john smith’, while the best distortion is around 7dB smaller than the baseline. However, based on our observations, there are no specific rules about how λ impacts the adversarial example’s success rate and distortion. We infer that the reason may be the different efforts to transform various combinations of original characters to the target phrase. The greater the phoneme gap between the selected combination of original characters and the target phrase, the louder the noise is needed to generate our adversarial audio. In all, the best λ to fine-tune the best length of a selected audio clip depends on the specific original audio and the target phrase. Apart from the effective performance, we also consider the efficiency of our FAAG method. Appar- ently, from the results shown in Fig. 6.5, we conclude that less time is spent using a shorter audio clip. To have further investigation, we generate adversarial examples using one original audio targeting three

94 (a) One original audio with three target phrases. (b) Different original audios with one target phrase.

Figure 6.5: Duration of the FAAG generating each adversarial audio ten times. different phrases. Herein, these phrases are only different in words having the same length. As shown in Fig. 6.5a, different words in the target phrase will not affect the time spent on adversarial example generation. Additionally, we generate adversarial examples using three different original audios targeting the same phrase. Although different audios have different time requirements for FAAG, their growth rate in time is similar when the ratio of frames increases.

6.4.3 Effectiveness and Efficiency Analysis

We evaluate our generation method by comparing the generation performance of target phrases with different words, target phrases with different lengths, and comparing the generation performance with the baseline. Our experiments are conducted under a limited resource, as we only use CPUs and one GPU. In the rest results, when the ratio of frames is equal to 100%, this experiment uses the baseline method to generate the adversarial example.

Adversarial Generation of Target Phrases with Various Words

We generate our audio adversarial examples corresponding to the 100 TIMIT audio we selected randomly. To compare the adversarial attack performance on target phrases with different words, we choose the target phase the phrase set named target phrases with different in Table 6.1. Moreover, we record their average results of success rate and distortion, while the duration time is counted for all 100 audio examples generation. Assuming that the attacker does not have enough time to generate the adversarial example, we only run the optimization-based method with 1,000 iterations to generate the adversarial example 0 xbegin = xbegin + δ. Comparing with the baseline results, Table 6.3 lists the performance of FAAG in the environment with limited computational resources and time. We analyze the results from the effectiveness and efficiency aspects. For the effectiveness analysis, we focus on the success rate and distortion results in Table 6.3. In general, updating the noise and distortion within 1,000 iterations, our FAAG method shares similar

95 Table 6.3: Performance of Adversarial Generation on Target Phrases with Different Words. (Duration results are in Hour:Minute:Second format for all 100 adversarial audio generation. Apart from the duration, the other results are averaged.)

Target Phrase Success Rate dBx(δ) Duration Ratio of Frames 89.07% 29 01:10:45 100% call john smith 91.33% 35 00:26:09 45.65% 90.06% 27 01:11:08 100% call david john 91.13% 32 00:26:29 45.65% 90.39% 26 01:11:18 100% play music list 90.67% 33 00:26:21 45.65% effectiveness as the baseline. All success rates are around or over 90%, while the distortions are around 30. The impact of the different words in the target phrase can be ignored because of slight differences. As we discussed in the previous section, the audio clip’s frame length will impact the whole adversarial example’s performance. However, the impact relies on the specific original audio and target phrase. Attackers can fine-tune this factor using λ. In all, our FAAG method will not bring a negative impact on the effectiveness of the attack proposed in previous work. Different words in the target phrase do not have a particularly noticeable impact. For the efficiency analysis, we focus on the duration and ratio_frame. Generally, the ratio of frames used to generate the adversarial examples positively correlates to the duration required in the generation. Specifically, when the target phrase is ‘call john smith’, almost half (45.65%) of the original audio (100%) is clipped for FAAG. About half of the time is spent on the FAAG compared to around one hour used in the baseline. When the target phrases are different in words, but the average ratio of frames is the same in the same length. Accordingly, the duration of FAAG targeting these three phrases is similar to each other.

Adversarial Generation of Target Phrases with Different Lengths

Some audio adversarial examples are generated corresponding to the selected 100 TIMIT audios with another feature set. To compare the adversarial attack performance on target phrases with different words, we choose the target phase the phrase set named target phrases with different lengths in Table 6.1. Similar to the above experiment, we evaluate each adversarial example’s construction and record their average results in Table 6.4. Assuming that the attacker wants to generate the adversarial example as quickly as possible, the optimization-based method is executed for 1,000 iterations to generate the adversarial 0 example xbegin = xbegin + δ. Table 6.4 lists the performance of FAAG and baseline in the environment with limited computational resources and time.

Table 6.4: Performance of Adversarial Generation on Target Phrases with Different Length. (Duration results are in Hour:Minute:Second format for all 100 adversarial audio generation. Apart from the duration, the other results are averaged.)

Target Phrase Success Rate dBx(δ) Duration Ratio of Frames 89.07% 29 01:10:45 100% call john smith 91.33% 35 00:26:09 45.65% 84.44% 27 01:10:25 100% call john 85.11% 33 00:17:42 29.54% call john smith 95.47% 32 01:11:49 100% and david 95.52% 35 00:41:14 70.21%

96 Table 6.5: Comparing the performance of adversarial generation of the audio file SA1.wav with the baseline result. The target phrase is “call john smith” for all experiments here. The T ime here is counted for this adversarial audio generation.

Generation Method Ratio Frames Success Rate dBx(δ) Time FAAG 51.96% 98.6% 28.12 26min [94] 100% 82.67% 29.33 70min

For the effectiveness analysis, we examine the success rate and distortion results, as listed in Table 6.4. Overall, updating the noise and distortion within 1,000 iterations, all averaged success rates of our method are surpass 85%, which is comparable to the baseline. The target phrase’s different length is not the deterministic factor for either the attack’s success rate or the distortion. As discussed in the previous section, the best distortion can be found by adding the λ in FAAG for a specific target. In all, no matter how long or short the target phrase is, the performance of our FAAG method is as high as that of the attack proposed in previous work. For the efficiency analysis, we focus on the duration and ratio_frame in Table 6.4. Similar to the results in Table 6.3, the ratio of frames used to generate the adversarial examples has a positive correlation to the duration required in the generation. When the target phrase is ‘call john smith and david’, even around 60% of the whole audio is used for attack, almost half of the time is spent comparing to the baseline. Specifically, different from the baseline, the shorter the target phrase is, the more time is saved using our FAAG method. In all, the FAAG method is less time-consuming than the previous work, especially when the environment is limited to computational resources. Over a hundred phrases were collected as the target phrase. These phrases are common voice commands that users usually interact with the surrounding voice assistants, including Google Assistant and Apple Sir [230, 231, 232]. We randomly selected 100 benign audio clips from the TIMIT dataset and randomly chose each phrase from the collected phrase set for each audio clip as its target phrase. It took 20 minutes and 59 seconds for FAAG to complete all 100 adversarial generations. Compared with the baseline model [94] that spent 50 minutes and 44 seconds, FAAG was much faster in adversarial example generation targeting common command phrases. For effectiveness analysis, FAAG showed a slight advantage on the average success rate (90.45% > 88.9%) and a slight larger distortion result (33.11dB > 28.38dB). In all, the FAAG method is more effective with limited computational resources.

Comparison with the Baseline using CPUs

FAAG is an improved method based on the iterative optimization-based method proposed by [94] that is the baseline model in this chapter. Empirical results show that our method is significantly more efficient in adversarial audio generation with only one GPU and CPUs. We compare the two methods’ performance using CPUs only, primarily focusing on their success rate as dBx(δ) and time, as listed in Table 6.5. Apart from using the CPU only, we compare their performance in a small number of iterations (1,000 iterations) when the target phrase is “call john smith”. Again, for each adversarial example generation, we repeat ten times and record their averaged results. Choosing the best λ for the specific audio SA1.wav with one target phrase, we conduct the results using FAAG. As shown in Table 6.5, FAAG significantly exceeds the baseline in [94] from time and success rate aspects using only CPUs with the best λ. The reason is that FAAG only needs to modify part of the original audio frames, while the baseline modify the whole audio. Without using any GPUs, the

97 advance in time using our FAAG is more prominent. As for the success rate and distortion, by choosing the best λ, FAAG can reach or even surpasses the baseline.

Figure 6.6: Three pair waves’ visualizations. Each column represents a pair of original audio and it adversarial example when the target phrase is “play music”. The images in the first row present the original waveform of SA1.wav, SI488.wav and SI667.wav. The images in the second row presents the audio waveform of the adversarial audio of corresponding audio.

Case Study

We present the detailed process of generating adversarial examples with three audio clips SA1.wav, SI488.wav, and SI667.wav with the target phrase “play music” as a case study. Assuming that the attacker only uses CPUs to generate the adversarial example for the targeted attack against an ASR model and generates it within one hour. The target ASR model is a DeepSpeech pre-trained ASR model. For the original audio SA1.wav, the attacker aims to embed the target phrase “play music” at the beginning of the original audio by adding noise that will be recognized by the target model but inaudible to human beings. Algorithm 1 is used to select a proper clip at the beginning of the original audio x which is marked as xbegin = x − xrest. Then following the process in Algorithm 2, we minimize the problem using the Adam optimizer and the CTC loss function with 100 learning rates and 1,000 iterations. Finally, one adversarial example of the audio SA1.wav is generated. The averaged result is obtained by repeating the previous step ten times. We randomly select one adversarial example for each SA1.wav, SI488.wav and SI667.wav with the target phrase as “play music”. Their success rate reached 100%, 100%, and 90.9%. Then we plot the waveform of these adversarial examples comparing to the waveform of their original audio, as shown in Fig. 6.6. In general, it is challenging to differentiate these two waveforms when the distortion is 34dB, 36dB, and 39dB, respectively. The beginning part of adversarial examples’ waveform is slightly thicker than their original audio. Since we only modify the beginning part of the audio, it is faster to generate the adversarial examples than the whole frame’s modification.

98 Table 6.6: Comparison of Generation T ime between using the baseline method and FAAG. All results are averaged values, while time is in the Hour:Minute:Second format.

Target Dataset Target Phrase(s) Using Generation Generation Speedup GPU Time_Baseline Time_FAAG SA1.wav call john smith Yes 00:06:03 00:02:25 60.1% SA1.wav call david jone Yes 00:06:05 00:02:25 60.3% SA1.wav play music list Yes 00:06:01 00:02:26 59.6% SI2248.wav call john smith Yes 00:05:34 00:02:24 56.9% SI667.wav call john smith Yes 00:08:26 00:02:28 70.6% SA1.wav call john smith No 01:00:00 00:26:00 56.7% 100 TIMIT audio clips Phrases with different words Yes 01:11:04 00:26:20 63.4% 100 TIMIT audio clips Phrases with different lengths Yes 00:50:44 00:20:59 60.0% 6.4.4 Summary in Speed Advantage

FAAG significantly outperforms the baseline methods in terms of generation time. Table 6.6 summarizes the generation time in different conditions using the baseline method [94] and FAAG. When adversarial examples were generated with CPUs only, FAAG can speed up around 56.7% than the baseline method. When adversarial examples were generated with one GPU and CPUs, FAAG can speed up around 60% than the baseline method.

6.5 Discussion on Different Position of Adversarial Audio Clip

6.5.1 Different Position of Adversarial Audio Clip

Except for hiding the target phrase at the beginning of the audio, we also discuss other audio positions, including the middle part and the ending part. These two positions are also meaningful in some cases. For example, when the target phrase contains a trigger word of an ASR system, the ASR system will only listen to the sentence behind that trigger word. In this section, we present and analyze the results of the last two hiding positions. The original audio is SA1.wav, and the phrase is ‘call john smith’. The selection of audio frames clipped from the middle and the end of the audio is slightly different from the beginning one. For hiding the phrase at the end of the audio, we append two spaces at the beginning of the phrase as the final target phrase. The frame length of audio clips selection is quite similar to the Algorithm 1. The differences are index = |x| − |xend and xend ← x[index :]. For hiding the phrase in the middle of the audio, we append two spaces at the beginning and the end of the phrase, respectively, as the final target phrase. The frame length determination is quite similar to the one hiding in the beginning position. The final index calculates in the Algorithm 1 are totally different. Another index is introduced for the middle position generation, marked as index0. Assume 0 |c| corresponding three characters of output logits will not be replaced. Then index = |y| × 3 × s |c| and index = index + |y| × |tallocated| × s. Two other audio clips and one target audio clip are 0 separated from the original audio, marking as xrest, xmiddle, xrest0 . Specifically, xrest ← x[: index ], 0 xmiddle ← x[index : index], and xrest0 ← x[index :]. Fig. 6.7 shows that hiding the phrase from the beginning of the audio reaches or surpasses the baseline result. The success rates of hiding the phrase at the end of the audio are always lower than 25%. Specifically, the first-word ‘call’ cannot be recognized in most cases. Under this situation, the attack is meaningless because the ASR system listing the sentences after the trigger word. As for hiding the phrase

99 Figure 6.7: Averaged accuracy of the FAAG used in different hiding positions. The baseline is the result using the entire audio sample to generate the adversarial audio [94]. in the middle of the audio, no matter how long the frame length is clipped for generation, the adversarial attack did not work. We infer that this is due to the powerful language model within the end-to-end ASR system. Considering the contextual information, especially the previous text, the latter part of the transcription will be influenced.

6.5.2 Countermeasures

Because of the dramatic decrease in the attack’s success rate, we find an effective protection method against the current targeted adversarial attack against an end-to-end ASR system. We can append benign audio at the beginning of this suspicious audio before playing it for any suspicious audio. Although the original transcription accuracy will be influenced, the success rate of the targeted attack decreases. To evaluate the protection results, we append benign audio to the generated adversarial examples and classify them with the target ASR model. We select clean audio named SA1.wav and append its complete audio frames to the beginning of any suspicious audios. If the suspicious audio is benign, we expect that the appending audio would not decrease the benign audio’s transcription result. Otherwise, if the suspicious audio is an adversarial example, we expect that the combined audio transcription would not include the attacker’s target phrase. Assume the attacker’s target phrase is “call john smith”. The suspicious audios include benign audio, several adversarial audios generated using the baseline method [94], and several adversarial audios generated using our method FAAG. Each experiment repeats ten times, and the averaged results are presented. Table 6.7 shows the performance of various audios comparing their translations to the target phrase and their true translation text. In general, appending a complete sample of benign audio at the beginning of any suspicious audio is an effective countermeasure against the targeted adversarial attack. Specifically, comparing the results of SA1.wav and SA1.wav+SA1.wav, this protection method has only a slight decline in translation accuracy. Comparing Baseline with SA1.wav+Baseline and FAAG with SA1.wav+Baseline, the target phrase is not included in the final translation by appending the begin

100 Table 6.7: Comparing the performance of various audio files. SA1.wav is the benign audio that we selected to append to the suspicious audios. SA1.wav+SA1.wav represents two begin audios’ combina- tion. P hrase represents the target phrase, while the T rans is the final translation text. SuccessRate represents the attack’s success rate, while the AccuracyinT rans comparing the predicted translation with the true translation.

Audios Phrase in Trans? Success Rate Accuracy in Trans SA1.wav No None 98.04% SA1.wav+SA1.wav No None 96.12% Baseline Yes 82.67% None SA1.wav+Baseline No None 92.03% FAAG Yes 94.45% None SA1.wav+FAAG No None 92.52%

Table 6.8: Performance of FAAG and Yakura and Sakuma’s method [97]. The target phrase is “call nine one one” selected from common voice commands used in the previous experiments. Ten benign audio clips were randomly selected from the TIMIT dataset. All the reported results are averaged.

Generation Method Ratio Frames Success Rate dBx(δ) Generation Time [97] 100% 72.06% 84.56 30min [97]+FAAG 63.69% 70.89% 79.94 21min audio. Meanwhile, by appending benign audio, the accuracy of the original audio’s translation surpasses 90%.

6.5.3 Transferable FAAG

FAAG aims to achieve a fast adversarial audio generation. According to Section 6.3, the proper length selection associates with the target model’s structure, the benign audio’s true transcription, and the target phrase. This chapter investigates a recurrent network-based ASR model which is targeted by an iterative optimization-based method. To investigate FAAG’s transferability, we adopt another similar adversarial attack against the DeepSpeech model.We also discuss how to find the proper frame length targeting generic ASR models, like a conventional ASR model. Yakura and Sakuma [97] proposed a robust adversarial audio generation method evaluated in an over-the-air setup. The success rate reached 100% at the cost of more than 18 hours of computational time for each audio. Various techniques like band-pass filters, impulse response, and white Gaussian noise were applied in [97] to achieve the best outcomes for very few audio clips. To compare with Yakura and Sakuma’s method, we chose a small set of audio clips with one target phrase. Thus, FAAG and Yakura and Sakuma’s method became comparable. We selected one phrase from the common voice commands used in the previous experiment, i.e., “call nine one one". Ten benign audio clips were randomly selected from the TIMIT dataset. The target ASR model is deepspeech-0.4.1, different from [97]. Yakura and Sakuma’s method requires adjusting perturbation magnitudes if either the input sample or the target phase changes. Thus, we adjusted perturbation magnitudes for the audio clips chosen for FAAG. This experiment was designed to measure the speed of the generation method, so over-the-air attack was not evaluated. To ensure fair and reproducible comparisons by avoiding the influences of environmental noises, we used the direct translating method, which is different from the over-the-air evaluation in [97]. Moreover, the same number of training iterations were applied for generating each adversarial audio example. Twenty adversarial

101 audio examples were generated for each audio file by using the method in [97]. Feeding these adversarial examples directly into the ASR model, the best example was selected by comparing their transcriptions with the target phrase. Then, FAAG was used with the same setup. Table 6.8 shows the averaged results of ten audio clips. The results demonstrated that FAAG was faster than the method in [97] while generating adversarial examples of similar quality. When the target model’s structure is different, the frame length selection will differ. Kaldi [167] is a conventional ASR model based on hidden Markov models (HMMs). There are no logit outputs to the CTC decoder. Thus, Equation 6.7 no longer holds for Kaldi. A different scheme is required to find an explicit relationship among phoneme, HMM-state, the true transcription, the benign audio, and the target phrase. We leave this investigation as our future work.

6.6 Conclusion and Future Work

We propose the FAAG method to generate adversarial examples for audio clips through white-box access. A novel algorithm is used to determine the appropriate length of the audio frame for adversarial attacks, according to the target phrase, the original audio, and the logit outputs of the target ASR model. The adversarial example can reach a high success rate by adding negligible noise, and the generation process can be completed within a short period using CPUs with one GPU or even without GPUs. Our empirical studies proved that adding noise to part of the audio can effectively generate an adversarial example. The distortion caused by the generated adversarial examples is similar to the baseline method, implying little quality loss. The FAAG maintains the success rate despite the choice of words in target phrases. More importantly, different positions of hiding phrases are discussed. Hiding the phrase at the beginning of the audio is a plausible attack. Appending benign audio to the beginning of adversarial audio can effectively protect service from targeted adversarial audio attack. We verify the adversarial example generated over-the-line to ensure the correctness of the results. Work on conducting the attack over-the-air and under black-box access will be left to the future.

102 Chapter 7

Research Challenges and Future Work

In Chapter 2, the recent publications about the ML-based stealing attacks against the controlled information and the corresponding defense methods are reviewed. Some attacks can steal the information, but they make strong assumptions of the attacker’s prior knowledge. For instance, the attacker is assumed to know the ML algorithm as a necessary condition prior to stealing the model/training samples. However, this prior knowledge is not always publicly known in the real world cases. Additionally, the attack methods are not mature technologies and have great room for improvement. Chapter 2 outlines the target and accessible data for each paper, and Table 2.11 summarizes the core research papers in the perspectives of attack, protection, related ML techniques, and evaluation. The following sections will discuss the future directions of the ML-based stealing attack and feasible countermeasures as shown in Figure 7.1.

Figure 7.1: The Challenges of ML-based Stealing Attack and Its Defenses

7.1 Attack

During the battle between attackers and defenders, it is crucial for defenders to anticipate the directions of the attackers’ future actions. To discuss their future direction, the challenges of the ML-based stealing attack are analyzed. The analysis results and possible solutions can be summarized and regarded as the future direction of the ML-based stealing attack. In this section, challenges and future directions are discussed from five phases of the MLBSA methodology listed as reconnaissance, data collection, feature engineering, attacking the objective, and evaluation.

103 7.1.1 Reconnaissance

As illustrated in Chapter 2.1.1, the reconnaissance phase consists of two main tasks — the target definition and valuable accessible data analysis. The denotation of the target determines which kind of accessible resources is valuable. The further attack mechanism is designed according to the analysis of accessible data during the reconnaissance phase. It is essential to ensure that the information accessible to legitimate users contains valuable information for stealing attacks to succeed. A challenge of an attack during the reconnaissance phase is the lack of the effective information from the accessible data. As stated in Chapter 2, the first category attack — stealing the user activities information — primarily relies on the accessible data source including the kernel data and the sensor data. The attacker captures the information without special permissions and utilizes different user activities’ representatives as explained in Chapter 2.1. Setting the appropriate permission requirements can protect the accessible data from being exploited by the attacker. For example, Android version 8 restricts the access to kernel resources including interrupt timing log files [36]. Because of an insufficient amount of information collected under the black-box setting, the model/training data obtained by the stealing attacks is insufficient to reconstruct an ML model as good as the original model [18, 12, 29]. As for the third category of the attack, the majority of the stealing methods are proposed based on a large amount of PII, effective sensor data or coarse-grained cache data. However, these information, especially PII, are sensitive enough to raise privacy concerns and may be protected in the future [233, 234]. All in all, current attack vectors restricted to access can block part of ML-based stealing attacks. Dealing with the lack of information, the future work of the attacker is to find new exploitable data sources as replacements. Some plausible solutions are proposed by [36] and [49]. [36] defines all interested targets in a list and automatically triggers the activities of interest, followed by searching exploitable logs filed in the newest version of the targeted system (Android version 7 and 8); and it is proven worthless to simply monitor the changes in the accessible data such as the sensor data, because the detection of the changes gives clues for stealing information. Consequently, future directions can be inspired as searching exploitable source in iOS and monitoring the possible changes of sensor data in order to perform a potential stealing attack. For the attack stealing authentication information, a solution is to use Corpus of Contemporary American English (COCA) corpus instead of PII. Using the COCA corpus, a successful password guessing attack was performed by [33]. Additionally, analyzing the password structure with anthropological analysis [33] may reduce the attacker’s reliance on PII. The attackers were predicted to search new sources or explore new characteristics for further attacks.

7.1.2 Data Collection

Determining the valuable accessible data is only a part of an ML-based stealing attack. To take advantages of the ML mechanism, the valuable dataset collected in this phase should guarantee its representation, reliability, and comprehensiveness. If either one of three is unsatisfactory, then the results of the stealing attack will be inaccurate. The first challenge is collecting valuable data with the representative information of the targeted information including all systems/devices. Especially when the valuable data is kernel data or sensor data, some forms of data recording may vary greatly from systems and devices. Regarding this problem, the data was collected by [10] and [47] from eight different mobile devices and different machines. Hence, a

104 future work is collecting data from heterogeneous sources and aggregating the representative data. Various forms of representative information affect the attack’s probability of success. The second challenge appears while collecting a reliable dataset. The quality of the training dataset is critical to the attack performance. Most of the explored stealing attacks utilized the model’s query output results — confidential values associated with attacker’s query inputs. The preciseness of this value affects the success of the attack. Specifically, the confidential information was leveraged by [12], [29] and [18] to imitate the objective ML model through techniques such as equation solving, path searching, and inversion attack as summarized in Table 2.2 and Table 2.3. Furthermore, the performance of an important attack named membership inference attack [38] depends on the training dataset of the shadow models. The training dataset can be generated based on the target model’s confidential information, which is distributed similarly to the targeted training set. Under these circumstances, the above attacks will not succeed if the target model’s API outputs the class only or the polluted confidential information. This inconvenience was scrutinized by [12], and a method was proposed to extract the model with only class labels are informed. Accordingly, these findings were further explored in the context of several other ML algorithms together with less monetary advantage when using ML APIs as an honest user. The poor quality of collected dataset hinders the success of the ML-based stealing attack. The third challenge of comprehensive dataset collection involves determining the size/distribution of the training dataset and the testing dataset. The size of the training inputs often dictates whether the attacker can easily gain all possible classes of the targeted controlled information, especially when the predictive model outputs only one class per query. In [38], a comprehensive training dataset was collected by generating the dataset which has a similar distribution to the targets. The size of testing dataset indicates indirectly the amount of controlled information that attackers can learn. For instance, the testing set size of a membership inference model depended on how many training members might be included and would be distinguished [38]. A future work may investigate the impact of the size/distribution of training and testing datasets to the success of ML-based stealing attacks. Partial or imbalanced distribution reduces the success rates of stealing attacks.

7.1.3 Feature Engineering

Feature engineering in this MLBSA methodology intends to refine the collected data for the effective and efficient training process. It is critical to the performance of ML-based attack by eliminating the noise from the collected data. However, among the current research, the techniques used in feature engineering remain underdeveloped. As shown in Table 2.6, Table 2.7, Table 2.8 and Table 2.10, many existing works select features manually. Manual feature selection, relying on the attacker’s domain-specific knowledge and human intelligence, usually produces a small number of features. That is, manual feature selection is inefficient because of the nature of human involvements and it may ignore the useful features with low discriminative power. To improve the attack’s effectiveness, the automation of feature selection has great research potentials. For example, [39] used CNN to learn features based on the correlation among data for optimal classification. A future trend of the attack is searching for or developing other automatic methods to overpower the manual feature selection [235]. [236] developed a regression-based feature learning algorithm to select and generate features without domain-specific knowledge required. Automating feature selection with

105 such generic algorithm would promote the efficiency and effectiveness of the ML-based attack.

7.1.4 Attacking the Objective

In the phase of attacking the objective with ML techniques, the main tasks include training and testing the ML model to steal the controlled information. There are a few challenges of stealing attacks with respect to training and testing ML models including unknown model algorithms, unknown hyperparameters of ML model, and the limited amount of testing time. For ML-based attack stealing the controlled model/training data, the first challenge is that most of the research considered the model algorithm as a prior knowledge. However, the model algorithms for many MLaaS are unknown to the end-user. Most of the attacks would not succeed without specifying the correct model algorithm. In [29], this concern was discussed, afterwards, a conclusion was drawn that attacks without understanding the model algorithm can be impossible in some circumstances. It is worth investigating the possibility of success for an attack in the context of the unknown model algorithm. In 2019, [43] considered membership inference attack against a black-box model whose algorithm is unknown by choosing a threshold. However, whether this method is applicable to other attacks under black-box access like the parameter stealing attack [12] remains unknown. The second challenge involves the unknown hyperparameters of the ML model learned from stealing attack comparing to the targeted model. The more precise the model learned, the more accurate the model’s functionality; the more precise the model learned, the more detailed the training records that will be revealed. The stealing attack predominantly stole the model by reckoning the parameters of matching objective functions. However, another critical element — hyperparameter — has been ignored, the values of which influence the accuracy of stealing attacks. In [29], a solution was proposed to prevent an attacker calculating the hyperparameters of some ML algorithms which consist of a set of linear regression algorithms and three-layer neural networks. The future direction toward solving this difficulty can enable a hyperparameter stealing attack against other popular ML algorithms such as k-NN and RNN. With unknown hyperparameters, only the parameters can be calculated while extracting the ML model. The third category of stealing attacks is password guessing attacks. This type of attack generally assumes an unlimited login testing attempts for each account. One exception is in [31] where each login password attack was performed less than 100 times. To crack the password effectively, researchers applied ML algorithms to analyze the password generator based on personal information, website related information and/or previous public leaked passwords. The future work for this stealing attack can be a successful attack mechanism, which is designed for the targeted authentication system with less than 100 times for login testing. With limited login testing attempts, guessing attack may be failed with the first a few guesses.

7.1.5 Evaluation

To effectively infer the controlled information, most of the investigated research applied ML mechanism mentioned in Chapter 2. The prediction of the unknown testing samples is a challenge for ML-based stealing attacks, as the supervised learning algorithm dominates the attack methods. That is, if the true label of a testing sample has not been learned by the model during the training phase, this sample will be recognized as an incorrect class. The testing samples, which are unknown to the training dataset, affect

106 the evaluation results and subsequently reduce the stealing attack’s accuracy. To improve the performance of such attacks, the attacker needs to achieve breakthroughs towards predicting the unknown data. For stealing the user activities attack, when an attacker wants to know the foreground app running in a user’s mobile, some distinctive features of the accessible data set which represents the status of running apps will be extracted and learned by ML algorithms [10, 37, 48]. After the attack model is trained, the accessible data recording a new foreground app running in the mobile is the testing sample, this new app is unknown to the attack model [10, 37, 48, 50]. For stealing authentication information attack, these attacks are difficult to be effective when users change their passwords frequently or adopt new layouts for the keyboard [34]. This is owing to the uncertainty of the users’ password generation behaviors and the variety of users’ input keyboards. Evaluating the prediction of unknown class is a challenging task for stealing attack against the user activities and the authentication information.

7.2 Defense

Targeting diverse controlled information, the countermeasures in protecting the information from ML- based stealing attacks are summarized. In general, the countermeasures can be summarized into three groups: 1) the detection is indicated as detecting related critical indications; 2) the disruption intends to break the accessible data at a tolerable cost of service’s utility; and 3) isolation aims to limit the access to some valuable data sources. As Figure 7.1 depicted, the countermeasures mainly applied in the first two phases. Specifically, isolation restricts the attacker’s access and makes the attack fail at the first phase; and disruption can confuse the attacker in the second phase and hinder the attacker to build a successful attack model. The detection techniques can detect the attacker’s actions and then protect the information from being stolen. These issues are explained as follows.

7.2.1 Detection

To detect potential stealing attacks in advance, the relevant crucial indications are required by analyzing the functionality related to the controlled information. Defenders should notice the attackers’ actions as soon as the attackers start the reconnaissance or the data collection processes. Based on the attacker’s future directions, the detection is proposed accordingly in order to prevent the attack at an early stage and minimize the loss of stealing the controlled information. In the presence of any malicious activities in the stage of reconnaissance and data collection, the change or the usage of relevant crucial information should be analyzed and checked. For example, when the attacker steals the information based on the accessible sensor data calling from some APIs, the calling rate and the API usage can be deemed as two critical indications for detection [37]. Since attackers may intend to exploit unknown critical indications, a defender can trade off between the access frequency of all related information and the service’s utility. Thereafter, the defender detects the unusual access frequency against the stealing attack. Another detection method is assessing the ability of the service securing the controlled information. This protection can promise that the ML-based attack for stealing information is less powerful than the current attacks. The memory page’s cacheability was managed by [40] to protect secret keys within user’s memory activities, while the password’s guessability was checked by [32]. Thereafter, users were alarmed with the weak password. If a defender can assess the ability to hide the ML model and training set from

107 unauthorized peeps, the controlled information are protected to some extent. The detector alerts the user when the assessed ability is under a certain threshold.

7.2.2 Disruption

Disruption can protect the controlled information via obstructing the information used in each phase of the MLBSA methodology. Disrupting the accessible data currently involves two methods as adding noise to data sources and degrading the quality/precision of service’s outputs. For more advanced countermeasures, further research needs to better understand the attacker’s future directions. By disrupting the accessible data, attackers cannot find out valuable accessible sources in the re- connaissance phase, get reliable dataset in the collection phase, or use feature engineering effectively. Therefore, disruption minimizes the success rate of the ML-based stealing attack. The major technique for adding noise is the differential privacy as applied in [48, 37, 88, 138, 50]. As for the latter method, the specific techniques include rounding/coarse-grained the values of outputs (i.e. predictions/confidence) [37, 12, 18, 38], and regularizing/obfuscating the accessible data sources [10, 49, 88, 77, 34]. The advanced disruption methods against the attacker’s further attacks is considered. Concerning attackers may search and collect a class of information from a few devices in the future, advanced disruption should premeditate adding different noise for the similar information in various devices. To prevent the advanced feature engineering techniques and ML-based analysis that attackers might apply, complicated and skillful methods for disruption can be applied to defend the controlled information, like an adversarial training algorithm acting as a strong regularizer [77].

7.2.3 Isolation

Isolation can assist the system by getting rid of the information stealing threat, which hinders the attacker from the reconnaissance phases. No matter how attackers improve their strategies and techniques, isolation can protect the controlled information by restricting access to the data. Specifically, it is effective to control the accessible data via restricting the access or managing the dynamic permission [10, 36, 38, 35]. Since attackers may improve their stealing attacks, defenders can apply ML techniques to automatically control all accesses related to the targeted controlled information. However, this protection is highlighted to be applied cautiously by concerning the utility of the service. On the one hand, specialists can remove the some information channels which may reveal valuable information to the adversary [37, 30]. On the other hand, if attackers find new exploitable accessible sources in the future, it is challenging to isolate all the relevant data while ensuring the service’s utility. Isolation protects the information by restricting access to the data.

7.3 Research Problem

Recently, machine learning techniques are widely applied in various areas. Additionally, MLaaS platforms are aroused to aid users with limited computing power and/or with limited ML expertise to use such techniques, like Google ML [55] and Amazon ML [53]. The ML models they built are implied in cyber attack prediction [25], insider threat detection [209], network traffic classification [237, 238, 239, 240], spam detection [241], software vulnerability detection [242], and so on. Therefore, it is essential to maintain the confidentiality of the ML model and its training set.

108 However, as we surveyed in Chapter 2, plenty of ML-based stealing attacks can reconstruct the ML model and/or its training samples, which is difficult to be detected. Our research problem is: for each ML-based service, we aim to check its confidentiality with the newest and strongest ML-based stealing attack with MLSA methodology, and develop a protection system to defeat the attack and maintain its confidentiality. To launch a new advanced ML-based stealing attack, we follow the attack methodology — MLSA demonstrated in Chapter 2.1 and overcome any attack challenge listed in Chapter 7.1. The cyclical processes in MLSA methodology include reconnaissance, data collection, feature engineering, attacking the objective, and evaluation. The challenges we intend to overcome for new attacks are: 1) finding new effective information except confidential information; 2) collecting balanced and high quality dataset with various formats, like outputs with and without probabilities from confidential sources; and 3) attacking the model and/or its training set without knowing its ML algorithm. Hence, a new advanced ML-based stealing attack can be found with high performance. The protection system can be designed with the help of the defense suggestions illustrated in Chap- ter 7.2. For instance, we can isolate user access to some valuable source, in the meanwhile, disrupt some accessible information which has little effect on the service’s utility. Additionally, develop a detector to reveal the stealing attack before the attacker succeeds. Finally, evaluating this protection system with advanced ML-based stealing attacks. We list a few research questions where the first two are conducted and the last one is for the future research.

1. Discuss the protection method against user-level information leakage of membership inference attack targeting a deep learning model in black-box access. Chapter 3, Chapter 4, and Chapter 5 limit the available resource gain from the balck-box access. The user-level membership inference is conducted with nno label knowledge against the ASR system.

2. The impact of stealing the ML model’s information on a security domain including adversarial machine learning can be conducted. If the ML model’s information can be inferred by the attackers, they can launch attack against this ML model similar to the attack under the white-box access. Chapter 6 explored an efficiency advanced adversarial machine learning under the white-box access with limited computational resource. The future work can conduct the adversarial attack under black-box access with the help of stealing attack.

3. Protect the ML model from stealing attack proposed previously by rounding the confidence values and applying the differential privacy in the model parameters.

109 Chapter 8

Conclusion

In this thesis, the ML-based stealing attack against the controlled information and the defense mechanisms published in the past five years are reviewed. The generalized MLBSA methodology compatible with the published work is outlined. Specifically, the MLBSA methodology uncovers how adversaries steal the controlled information in five phases, i.e., reconnaissance, data collection, feature engineering, attacking the objective, and evaluation. Based on different types of the controlled information, the literature was reviewed in three categories consisting of the controlled user activities information, the controlled ML model related information, and the controlled authentication information. The attacker is assumed to use the system without any administrative privilege. This assumption implies that user activities information was stolen by leveraging the kernel data and the sensor data both of which are beyond the protection of the application. The attack against the controlled ML model related information is demonstrated with stealing the model description and/or stealing the training data. Similarly, keystroke data, secret keys, and password data are the examples of stealing the controlled authentication information. Three related technical work about membership inference against an ASR model are investigated. Specifically, these three work proposed three user-level membership inference method to build an audio auditor against an ASR model. Three auditors audit whether any user unwillingly contributes their audio to train the target ASR model or not. The first auditor is conducted under the black-box access to the ASR model whose output contains the confidential score. The second auditor is conducted under the black-box access where the target ASR model only output the label — translated text. The third auditor is conducted under the no-label black-box access. Specifically, the target ASR model will not display neither the confidential score nor the translated text explicitly. Instead, the ASR model passes the translated text to the system. The system directly reacts based on the translated text’s content. Another technical work can be considered as a follow up research, which investigate the audio adversarial example generation under the white-box access. By stealing the ML information, we know the details of the target ASR model. We proposed a Fast Adversarial Audio Generation (FAAG) method under the white-box access to an ASR model. Instead of adding perturbations to the whole frame of the benign audio, FAAG adds noise to the beginning part of the benign audio, which speed up the generation process significantly. Additionally, the future directions matching various limitations of ML-base stealing attack are sug- gested. Comparing to the explicit breaking/destroying attack, the controlled information leaked by such stealing attack is much more difficult to be detected, so that the estimated loss should be extended

110 accordingly. This thesis, therefore, can help researchers familiarize these stealing attacks, their future trends, and the potential defense methods. Chapters 3, 4 and 5 show that it is possible to infer the model’s training set information even limiting the available resource gain from the balck-box access. Specifically, the user-level membership inference is conducted with only label knowledge against the ASR system. If the ML model’s information can be inferred by the attackers, an attack against this ML model under the black-box access is similar to the attack under the white-box access. It is much more convenient and effective for the attacker to launch any attack under the latter accessibility than the former one. Chapter 6 explored an efficiency advanced adversarial machine learning under the white-box access with limited computational resource.

111 Bibliography

[1] Sultan Alneyadi, Elankayer Sithirasenan, and Vallipuram Muthukkumarasamy. A survey on data leakage prevention systems. Journal of Network and Computer Applications, 62(Feb):137–152, 2016.

[2] Mohammad Ahmadian and Dan Cristian Marinescu. Information leakage in cloud data warehouses. IEEE Transactions on Sustainable Computing, pages 1–12, 2019.

[3] Long Cheng, Fang Liu, and Danfeng Yao. Enterprise data breach: causes, challenges, prevention, and future directions. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(5):e1211, 2017.

[4] InfoWatch Analytics Center. Global data leakage report, 2017, 2018.

[5] Ponemon from IBM. 2018 cost of a data breach study: Global overview, 2018.

[6] Sam Smith from Juniper Research. Cybercrime will cost businesses over $2 trillion by 2019, 2015.

[7] Carlos Flavián and Miguel Guinalíu. Consumer trust, perceived security and privacy policy: three basic elements of loyalty to a web site. Industrial Management & Data Systems, 106(5):601–620, 2006.

[8] Richard Kissel. Glossary of key information security terms. National Institute of Standards and Technology (NIST) - Computer Security Resource Center, Gaithersburg, MD, US, 2013.

[9] Anthony Califano, Ersin Dincelli, and Sanjay Goel. Using features of cloud computing to defend smart grid against ddos attacks. In Proceedings of the 10th Annual Symposium on Information Assurance (Asia 15), pages 44–50, Albany, New York, 2015. NYS.

[10] Wenrui Diao, Xiangyu Liu, Zhou Li, and Kehuan Zhang. No pardon for the interruption: New inference attacks on android through interrupt timing analysis. In Proceedings of the 2016 IEEE Symposium on Security and Privacy (SP), pages 414–432, San Jose, CA, USA, 2016. IEEE.

[11] Wale Ogunwale. Lockdown am.getrunningappprocesses api with permission.real_get_tasks, 2016.

[12] Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine learning models via prediction apis. In Proceedings of the 25th USENIX Security Symposium (USENIX Security 16), pages 601–618, Washington, D.C., USA, 2016. USENIX Association.

112 [13] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndic,´ Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 387–402, Prague, Czech Republic, 2013. Springer.

[14] Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and JD Tygar. Adversarial machine learning. In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, pages 43–58, Chicago, Illinois, USA, 2011. ACM.

[15] Daniel Lowd and Christopher Meek. Adversarial learning. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pages 641–647, Chicago, Illinois, USA, 2005. ACM.

[16] Nedim Srndic and Pavel Laskov. Practical evasion of a learning-based classifier: A case study. In Proceedings of the 2014 IEEE Symposium on Security and Privacy (SP), pages 197–211, San Jose, CA, USA, 2014. IEEE.

[17] Mauro Ribeiro, Katarina Grolinger, and Miriam AM Capretz. Mlaas: Machine learning as a service. In Proceedings of the 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pages 896–902, Miami, FL, USA, 2015. IEEE.

[18] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Con- ference on Computer and Communications Security (CCS), pages 1322–1333, Denver, Colorado, USA, 2015. ACM.

[19] Mohamed Amine Ferrag, Leandros Maglaras, and Ahmed Ahmim. Privacy-preserving schemes for ad hoc social networks: A survey. IEEE Communications Surveys & Tutorials, 19(4):3015–3045, 2017.

[20] R Barona and EA Mary Anita. A survey on data breach challenges in cloud computing security: Issues and threats. In Proceedings of the 2017 International Conference on Circuit, Power and Computing Technologies (ICCPCT), pages 1–8, Kollam, India, 2017. IEEE.

[21] Mordechai Guri and Yuval Elovici. Bridgeware: The air-gap malware. Communications of the ACM, 61(4):74–82, 2018.

[22] Yong Zeng and Rui Zhang. Active eavesdropping via spoofing relay attack. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2159–2163, Shanghai, China, 2016. IEEE.

[23] John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S Pallett. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon Technical Report, 93, 1993.

[24] Muhammad Salman Khan, Sana Siddiqui, and Ken Ferens. A cognitive and concurrent cyber kill chain model. Springer, Cham, 2018.

113 [25] Nan Sun, Jun Zhang, Paul Rimba, Shang Gao, Yang Xiang, and Leo Yu Zhang. Data-driven cyberse- curity incident prediction: A survey. IEEE Communications Surveys & Tutorials, 21(2):1744–1772, 2019.

[26] Tarun Yadav and Arvind Mallari Rao. Technical aspects of cyber kill chain. In Proceedings of the International Symposium on Security in Computing and Communication, pages 438–452, New York, NY, 2015. Springer.

[27] Dennis Kiwia, Ali Dehghantanha, Kim-Kwang Raymond Choo, and Jim Slaughter. A cyber kill chain based taxonomy of banking trojans for evolutionary computational intelligence. Journal of computational science, 27:394–409, 2018.

[28] CW Dukes. Committee on national security systems (cnss) glossary. Technical report, Committee on National Security Systems Instructions (CNSSI), 2015.

[29] B. Wang and N. Z. Gong. Stealing hyperparameters in machine learning. In Proceedings of the 2018 IEEE Symposium on Security and Privacy (SP), pages 36–52, San Francisco, CA, USA, 2018. IEEE.

[30] Ben Gras, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. Translation leak-aside buffer: Defeating cache side-channel protections with {TLB} attacks. In Proceedings of the 27th USENIX Security Symposium (USENIX Security 18), pages 955–972, Baltimore, MD, USA, 2018. USENIX Association.

[31] Ding Wang, Zijian Zhang, Ping Wang, Jeff Yan, and Xinyi Huang. Targeted online password guessing: An underestimated threat. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 1242–1254, Vienna, Austria, 2016. ACM.

[32] William Melicher, Blase Ur, Sean M Segreti, Saranga Komanduri, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor. Fast, lean, and accurate: Modeling password guessability using neural networks. In Proceedings of the 25th USENIX Security Symposium (USENIX Security 16), pages 175–191, Washington, D.C., USA, 2016. USENIX Association.

[33] Rafael Veras, Christopher Collins, and Julie Thorpe. On semantic patterns of passwords and their security impact. In Proceedings of the 21st Annual Network and Distributed System Security Symposium (NDSS), pages 1–16, San Diego, CA, USA, 2014. IEEE.

[34] Jingchao Sun, Xiaocong Jin, Yimin Chen, Jinxue Zhang, Yanchao Zhang, and Rui Zhang. Visible: Video-assisted keystroke inference from tablet backside motion. In Proceedings of the 23rd Annual Network and Distributed System Security Symposium (NDSS), pages 1–15, San Diego, CA, USA, 2016. IEEE.

[35] Xiangyu Liu, Zhe Zhou, Wenrui Diao, Zhou Li, and Kehuan Zhang. When good becomes evil: Keystroke inference with smartwatch. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 1273–1285, Denver, Colorado, USA, 2015. ACM.

114 [36] Raphael Spreitzer, Felix Kirchengast, Daniel Gruss, and Stefan Mangard. Procharvester: Fully automated analysis of procfs side-channel leaks on android. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security (AsiaCCS), pages 749–763, Incheon, Republic of Korea, 2018. ACM.

[37] Xiaokuan Zhang, Xueqiang Wang, Xiaolong Bai, Yinqian Zhang, and XiaoFeng Wang. Os-level side channels without procfs: Exploring cross-app information leakage on ios. In Proceedings of the 25th Annual Network and Distributed System Security Symposium (NDSS), pages 1–15, San Diego, CA, USA, 2018. IEEE.

[38] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18, San Jose, CA, USA, 2017. IEEE.

[39] Briland Hitaj, Giuseppe Ateniese, and Fernando Perez-Cruz. Deep models under the gan: in- formation leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 603–618, Dallas, Texas, USA, 2017. ACM.

[40] Ziqiao Zhou, Michael K Reiter, and Yinqian Zhang. A software approach to defeating side channels in last-level caches. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 871–882, Vienna, Austria, 2016. ACM.

[41] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security (AsiaCCS), pages 506–519, Abu Dhabi, United Arab Emirates, 2017. ACM.

[42] Seong Joon Oh, Max Augustin, Bernt Schiele, and Mario Fritz. Towards reverse-engineering black-box neural networks. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), pages 1–20, Vancouver, BC, Canada, 2018. OpenReview.net.

[43] Ahmed Salem, Yang Zhang, Mathias Humbert, Pascal Berrang, Mario Fritz, and Michael Backes. Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models. In Proceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS), pages 1–15, San Diego, California, USA, 2019. IEEE.

[44] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. Exploiting unin- tended feature leakage in collaborative learning. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), pages 1–16, San Fransisco, CA, US, 2019. IEEE.

[45] Pranav Patel, Eamonn Keogh, Jessica Lin, and Stefano Lonardi. Mining motifs in massive time series databases. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM), pages 370–377, Maebashi City, Japan, 2002. IEEE.

[46] Jessica Lin and Yuan Li. Finding structural similarity in time series data using bag-of-patterns rep- resentation. In Proceedings of the International Conference on Scientific and Statistical Database Management, pages 461–477, New Orleans, LA, USA, 2009. Springer.

115 [47] Avesta Hojjati, Anku Adhikari, Katarina Struckmann, Edward Chou, Thi Ngoc Tho Nguyen, Kushagra Madan, Marianne S Winslett, Carl A Gunter, and William P King. Leave your phone at the door: Side channels that reveal factory floor secrets. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 883–894, Vienna, Austria, 2016. ACM.

[48] Qiuyu Xiao, Michael K Reiter, and Yinqian Zhang. Mitigating storage side channels using statistical privacy mechanisms. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 1582–1594, Denver, Colorado, USA, 2015. ACM.

[49] Amit Kumar Sikder, Hidayet Aksu, and A Selcuk Uluagac. 6thsense: A context-aware sensor-based attack detector for smart devices. In Proceedings of the 26th USENIX Security Symposium (USENIX Security 17), pages 397–414, Vancouver, BC, Canada, 2017. USENIX Association.

[50] Ninghui Li, Wahbeh Qardaji, Dong Su, Yi Wu, and Weining Yang. Membership privacy: a unifying framework for privacy definitions. In Proceedings of the 2013 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 889–900, Berlin, Germany, 2013. ACM.

[51] Karan Ganju, Qi Wang, Wei Yang, Carl A Gunter, and Nikita Borisov. Property inference attacks on fully connected neural networks using permutation invariant representations. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 619–633, Toronto, ON, Canada, 2018. ACM.

[52] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 308–318, Vienna, Austria, 2016. ACM.

[53] AMAZON ML SERVICES. Amazon aws machine learning, 2019.

[54] Microsoft. Azure machine learning studio, 2019.

[55] Google. Predictive analytics - cloud machine learning engine, 2019.

[56] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), pages 1–11, San Diego, CA, USA, 2015. OpenReview.net.

[57] UCIdataset. Uci machine learning repository, 2018.

[58] Tom W Smith, Peter Marsden, Michael Hout, and Jibum Kim. The general social surveys. Technical report, National Opinion Research Center at the University of Chicago, 2012.

[59] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.

[60] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. The mnist database of handwritten digits, 2011.

116 [61] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Bench- marking machine learning algorithms for traffic sign recognition. Neural networks, 32:323–332, 2012.

[62] BigML. Machine learning made beautifully simple for everyone, 2019.

[63] Matthew Fredrikson, Eric Lantz, Somesh Jha, Simon Lin, David Page, and Thomas Ristenpart. Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing. In Proceedings of the 23rd USENIX Security Symposium (USENIX Security 14), pages 17–32, San Diego, CA, USA, 2014. USENIX Association.

[64] Ferdinando S Samaria and Andy C Harter. Parameterisation of a stochastic model for human face identification. In Proceedings of the Second IEEE Workshop on Applications of Computer Vision, pages 138–142, Sarasota, FL, USA, 1994. IEEE.

[65] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[66] Kaggle Inc. Acquire valued shoppers challenge, 2014.

[67] Dingqi Yang, Daqing Zhang, and Bingqing Qu. Participatory cultural mapping based on collective behavior data in location-based social networks. ACM Transactions on Intelligent Systems and Technology (TIST), 7(3):30:1–30:23, 2016.

[68] Texas Health and Human Service. Hospital discharge data public use data file, 2018.

[69] Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141–142, 2012.

[70] Kaggle Inc. 20 newsgroups, 2017.

[71] Erik Learned-Miller, Gary B Huang, Aruni RoyChowdhury, Haoxiang Li, and Gang Hua. Labeled faces in the wild: A survey. In Advances in Face Detection and Facial Image Analysis, pages 189–248. Springer, New York, NY, 2016.

[72] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the International Conference on Computer Vision (ICCV), pages 3730–3738, New York, NY, December 2015. IEEE.

[73] Hong-Wei Ng and Stefan Winkler. A data-driven approach to cleaning large face datasets. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), pages 343– 347, New York, NY, 2014. IEEE.

[74] Ning Zhang, Manohar Paluri, Yaniv Taigman, Rob Fergus, and Lubomir Bourdev. Beyond frontal faces: Improving person recognition using multiple cues. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4804–4813, New York, NY, 2015. IEEE.

[75] Yelp. Yelp open dataset, 2014.

117 [76] Ben Verhoeven and Walter Daelemans. Clips stylometry investigation (csi) corpus: a dutch corpus for the detection of age, gender, personality, sentiment and deception in text. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), pages 3081–3085, Reykjavik, Iceland, 2014. European Languages Resources Association (ELRA).

[77] Milad Nasr, Reza Shokri, and Amir Houmansadr. Machine learning with membership privacy using adversarial regularization. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 634–646, Toronto, Canada, 2018. ACM.

[78] Peter Muennig, Gretchen Johnson, Jibum Kim, Tom W Smith, and Zohn Rosen. The general social survey-national death index: an innovative new dataset for the social sciences. BMC research notes, 4(1):1–6, 2011.

[79] Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 1310–1321, Denver, Colorado, USA, 2015. ACM.

[80] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.

[81] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069–1109, 2011.

[82] Prateek Jain, Vivek Kulkarni, Abhradeep Thakurta, and Oliver Williams. To drop or not to drop: Robustness, consistency and differential privacy properties of dropout. arXiv preprint arXiv:1503.02031, 2015.

[83] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015.

[84] Giuseppe Ateniese, Luigi V. Mancini, Angelo Spognardi, Antonio Villani, Domenico Vitali, and Giovanni Felici. Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers. Int. J. Secur. Netw., 10(3):137–150, 2015.

[85] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Proceedings of the Advances in Neural Information Processing Systems, pages 3391–3401, Long Beach, CA, USA, 2017. Curran Associates, Inc.

[86] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, Fort Lauderdale, FL, USA, 2017. PMLR.

[87] Nicolas Papernot, Martín Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semi- supervised knowledge transfer for deep learning from private training data. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), pages 1–16, Toulon, France, 2017. OpenReview.net.

118 [88] Mathias Lecuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang, and Siddhartha Sen. Pyramid: Enhancing selectivity in big data protection with count featurization. In Proccedings of the 2017 IEEE Symposium on Security and Privacy (SP), pages 78–95, San Jose, CA, USA, 2017. IEEE.

[89] Yang Tang, Phillip Ames, Sravan Bhamidipati, Ashish Bijlani, Roxana Geambasu, and Nikhil Sarda. Cleanos: Limiting mobile data exposure with idle eviction. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 77–91, Hollywood, CA, USA, 2012. USENIX.

[90] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Theory of Cryptography Conference, pages 265–284, New York, NY, USA, 2006. Springer.

[91] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael P Wellman. Sok: Security and privacy in machine learning. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy (EuroS&P), pages 399–414, London, UK, 2018. IEEE.

[92] Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David Wagner, and Wenchao Zhou. Hidden voice commands. In Proceedings of the 25th USENIX Security Symposium (USENIX Security’16), pages 513–530, 2016.

[93] Paul Lamere, Philip Kwok, William Walker, Evandro Gouvea, Rita Singh, Bhiksha Raj, and Peter Wolf. Design of the cmu sphinx-4 decoder. In Eighth European Conference on Speech Communication and Technology, 2003.

[94] Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech- to-text. In Proceedings of the 2018 IEEE Security and Privacy Workshops (SPW’18), pages 1–7. IEEE, 2018.

[95] Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. Dol- phinattack: Inaudible voice commands. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS’17), pages 103–117, New York, NY, USA, 2017.

[96] Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In Proceedings of the International Conference on Machine Learning (ICML’19), pages 5231–5240, 2019.

[97] Hiromu Yakura and Jun Sakuma. Robust audio adversarial example for a physical attack. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19), pages 5334–5341. ijcai.org, 2019.

[98] Lea Schönherr, Thorsten Eisenhofer, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. Imperio: Robust over-the-air adversarial examples for automatic speech recognition systems. In Annual Computer Security Applications Conference, pages 843–855, 2020.

[99] Maximilian Christ, Andreas W Kempa-Liehr, and Michael Feindt. Distributed and parallel time series feature extraction for industrial big data applications, 2016.

119 [100] G David Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973.

[101] Nicholas D Lane, Emiliano Miluzzo, Hong Lu, Daniel Peebles, Tanzeem Choudhury, and Andrew T Campbell. A survey of mobile phone sensing. IEEE Communications magazine, 48(9):140–150, 2010.

[102] Nicholas D Lane, Ye Xu, Hong Lu, Shaohan Hu, Tanzeem Choudhury, Andrew T Campbell, and Feng Zhao. Enabling large-scale human activity inference on smartphones using community similarity networks (csn). In Proceedings of the 13th International Conference on Ubiquitous Computing, pages 355–364, Beijing, China, 2011. ACM.

[103] Bong-Won Park and Kun Chang Lee. The effect of users’ characteristics and experiential factors on the compulsive usage of the smartphone. In Proceedings of the International Conference on Ubiquitous Computing and Multimedia Applications, pages 438–446, Daejeon, Korea, 2011. Springer.

[104] Yan Yu, Jianhua Wang, and Guohui Zhou. The exploration in the education of professionals in applied internet of things engineering. In Proceedings of the 4th International Conference on Distance Learning and Education (ICDLE), pages 74–77, San Juan, PR, USA, 2010. IEEE.

[105] Elsa Macias, Alvaro Suarez, and Jaime Lloret. Mobile sensing systems. Sensors, 13(12):17292– 17321, 2013.

[106] Nan Zhang, Kan Yuan, Muhammad Naveed, Xiaoyong Zhou, and XiaoFeng Wang. Leave me alone: App-level protection against runtime information gathering on android. In Proceedings of the 2015 IEEE Symposium on Security and Privacy (SP), pages 915–930, San Jose, CA, USA, 2015. IEEE.

[107] Orcan Alpar. Frequency spectrograms for biometric keystroke authentication using neural network based classifier. Knowledge-Based Systems, 116(Jan):163–171, 2017.

[108] Sowndarya Krishnamoorthy, Luis Rueda, Sherif Saad, and Haytham Elmiligi. Identification of user behavioral biometrics for authentication using keystroke dynamics and machine learning. In Proceedings of the 2018 2nd International Conference on Biometric Engineering and Applications, pages 50–57, Amsterdam, The Netherlands, 2018. ACM.

[109] Pei-Yuan Wu, Chi-Chen Fang, Jien Morris Chang, and Sun-Yuan Kung. Cost-effective kernel ridge regression implementation for keystroke-based active authentication system. IEEE transactions on cybernetics, 47(11):3916–3927, 2017.

[110] Adam Goodkind, David Guy Brizan, and Andrew Rosenberg. Utilizing overt and latent linguistic structure to improve keystroke-based authentication. Image and Vision Computing, 58(Feb):230– 238, 2017.

[111] Liang Cai and Hao Chen. Touchlogger: Inferring keystrokes on touch screen from smartphone motion. In Proceedings of the 6th USENIX Workshop on Hot Topics in Security (HotSec’11), pages 9–15, San Francisco, CA, USA, 2011. USENIX Association.

120 [112] Zhi Xu, Kun Bai, and Sencun Zhu. Taplogger: Inferring user inputs on smartphone touchscreens using on-board motion sensors. In Proceedings of the 5th ACM Conference on Security and Privacy in Wireless and Mobile Networks, pages 113–124, Tucson, AZ, USA, 2012. ACM.

[113] Emiliano Miluzzo, Alexander Varshavsky, Suhrid Balakrishnan, and Romit Roy Choudhury. Tap- prints: your finger taps have fingerprints. In Proceedings of the 10th International Conference on Mobile Systems, Applications, and Services, pages 323–336, Ambleside, UK, 2012. ACM.

[114] Yigael Berger, Avishai Wool, and Arie Yeredor. Dictionary attacks using keyboard acoustic emana- tions. In Proceedings of the 13th ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 245–254, Alexandria, Virginia, USA, 2006. ACM.

[115] Michael Backes, Markus Dürmuth, and Dominique Unruh. Compromising reflections-or-how to read lcd monitors around the corner. In Proceedings of the 2008 IEEE Symposium on Security and Privacy (SP), pages 158–169, Oakland, CA, USA, 2008. IEEE.

[116] Rongmao Chen, Yi Mu, Guomin Yang, Fuchun Guo, and Xiaofen Wang. Dual-server public-key encryption with keyword search for secure cloud storage. IEEE Transactions on Information Forensics and Security, 11(4):789–798, 2016.

[117] Venkata Koppula, Omkant Pandey, Yannis Rouselakis, and Brent Waters. Deterministic public-key encryption under continual leakage. In Proceedings of the International Conference on Applied Cryptography and Network Security, pages 304–323, Guildford, UK, 2016. Springer.

[118] Zheng Yan and Mingjun Wang. Protect pervasive social networking based on two-dimensional trust levels. IEEE Systems Journal, 11(1):207–218, 2017.

[119] L Yu Paul, Gunjan Verma, and Brian M Sadler. Wireless physical layer authentication via fingerprint embedding. IEEE Communications Magazine, 53(6):48–53, 2015.

[120] Debiao He, Sherali Zeadally, Neeraj Kumar, and Jong-Hyouk Lee. Anonymous authentication for wireless body area networks with provable security. IEEE Systems Journal, 11(4):2590–2601, 2017.

[121] Qi Jiang, Sherali Zeadally, Jianfeng Ma, and Debiao He. Lightweight three-factor authentication and key agreement protocol for internet-integrated wireless sensor networks. IEEE Access, 5(Mar):3376– 3392, 2017.

[122] Fangfei Liu, Qian Ge, Yuval Yarom, Frank Mckeen, Carlos Rozas, Gernot Heiser, and Ruby B Lee. Catalyst: Defeating last-level cache side channel attacks in cloud computing. In Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 406–418, Barcelona, Spain, 2016. IEEE.

[123] Daniel Gruss, Julian Lettner, Felix Schuster, Olya Ohrimenko, Istvan Haller, and Manuel Costa. Strong and efficient cache side-channel protection using hardware transactional memory. In Proccedings of the 26th USENIX Security Symposium (USENIX Security 17), pages 217–233, Vancouver, BC, Canada, 2017. USENIX Association.

121 [124] Matt Weir, Sudhir Aggarwal, Breno De Medeiros, and Bill Glodek. Password cracking using probabilistic context-free grammars. In Proceedings of the 2009 IEEE Symposium on Security and Privacy (SP), pages 391–405, Berkeley, CA, USA, 2009. IEEE.

[125] Jerry Ma, Weining Yang, Min Luo, and Ninghui Li. A study of probabilistic password models. In Proceedings of the 2014 IEEE Symposium on Security and Privacy (SP), pages 689–704, San Jose, CA, USA, 2014. IEEE.

[126] Blase Ur, Sean M Segreti, Lujo Bauer, Nicolas Christin, Lorrie Faith Cranor, Saranga Komanduri, Darya Kurilova, Michelle L Mazurek, William Melicher, and Richard Shay. Measuring real- world accuracies and biases in modeling password guessability. In Proceedings of the 24th USENIX Security Symposium (USENIX Security 15), pages 463–481, Washington, D.C., USA, 2015. USENIX Association.

[127] Patrick Gage Kelley, Saranga Komanduri, Michelle L Mazurek, Richard Shay, Timothy Vidas, Lujo Bauer, Nicolas Christin, Lorrie Faith Cranor, and Julio Lopez. Guess again (and again and again): Measuring password strength by simulating password-cracking algorithms. In Proceedings of the 2012 IEEE Symposium on Security and Privacy (SP), pages 523–537, San Francisco, CA, USA, 2012. IEEE.

[128] Richard Shay, Saranga Komanduri, Adam L Durity, Phillip Seyoung Huh, Michelle L Mazurek, Sean M Segreti, Blase Ur, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor. Can long passwords be secure and usable? In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 2927–2936, Toronto, ON, Canada, 2014. ACM.

[129] Michelle L Mazurek, Saranga Komanduri, Timothy Vidas, Lujo Bauer, Nicolas Christin, Lor- rie Faith Cranor, Patrick Gage Kelley, Richard Shay, and Blase Ur. Measuring password guessability for an entire university. In Proceedings of the 2013 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 173–186, Berlin, Germany, 2013. ACM.

[130] Thomas Brewster. 13 million passwords appear to have leaked from this free web host, 2015.

[131] Saranga Komanduri. Modeling the adversary to evaluate password strength with limited samples. PhD thesis, School of Computer Science, Carnegie Mellon University, 2016.

[132] Yue Li, Haining Wang, and Kun Sun. A study of personal information in human-chosen passwords and its security implications. In Proceedings of the 35th Annual IEEE International Conference on Computer Communications (INFOCOM), pages 1–9, San Francisco, CA, USA, 2016. IEEE.

[133] Joseph Bonneau. The science of guessing: Analyzing an anonymized corpus of 70 million passwords. In Proceedings of the 2012 IEEE Symposium on Security and Privacy (SP), pages 538–552, San Francisco, CA, USA, 2012. IEEE.

[134] Anupam Das, Joseph Bonneau, Matthew Caesar, Nikita Borisov, and XiaoFeng Wang. The tangled web of password reuse. In Proceedings of the 21st Annual Network and Distributed System Security Symposium (NDSS), pages 1–15, San Diego, CA, USA, 2014. IEEE.

122 [135] Himanshu Raj, Ripal Nathuji, Abhishek Singh, and Paul England. Resource management for isolation enhanced cloud services. In Proceedings of the 2009 ACM workshop on Cloud Computing Security, pages 77–84, Chicago, Illinois, USA, 2009. ACM.

[136] Christopher D Manning, Christopher D Manning, and Hinrich Schütze. Foundations of statistical natural language processing. MIT press, London, UK, 1999.

[137] Indiana University Nan from System Security Lab. App guardian: An app level protection against rig attacks, 2015.

[138] Cynthia Dwork. Differential privacy: A survey of results. In Proceedings of the International Conference on Theory and Applications of Models of Computation, pages 1–19, Xi’an, China, 2008. Springer.

[139] Geetha Jagannathan, Krishnan Pillaipakkamnatt, and Rebecca N Wright. A practical differentially private random decision tree classifier. In Proceedings of the IEEE International Conference on Data Mining Workshops (ICDMW’09), pages 114–121, Miami, Florida, USA, 2009. IEEE.

[140] Staal A Vinterbo. Differentially private projected histograms: Construction and use for prediction. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 19–34, Bristol, UK, 2012. Springer.

[141] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In Proceedings of the 2016 IEEE Symposium on Security and Privacy (SP), pages 582–597, New York, NY, 2016. IEEE.

[142] European Parliament and Council of the European Union. Regulation (eu) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data and repealing Directive 95/46/EC (general data protection regulation). Official Journal of the European Union, 119:1–88, 2016.

[143] E. McReynolds, S. Hubbard, T. Lau, A. Saraf, M. Cakmak, and F. Roesner. Toys that listen: A study of parents, children, and Internet-connected toys. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pages 5197–5207. ACM, 2017.

[144] S. Lokesh, P. K. Malarvizhi, M. D. Ramya, P. Parthasarathy, and C. Gokulnath. An automatic tamil speech recognition system by using bidirectional recurrent neural network with self-organizing map. Neural Computing and Applications, pages 1–11, 2018.

[145] M. Mehrabani, S. Bangalore, and B. Stern. Personalized speech recognition for Internet of Things. In Proceedings of the 2015 IEEE 2nd World Forum on Internet of Things (WF-IoT), pages 369–374. IEEE, 2015.

[146] S. Nick. Amazon may give app developers access to Alexa audio recordings, 2017.

[147] Minhui Xue, Gabriel Magno, Evandro Cunha, Virgilio Almeida, and Keith W Ross. The right to be forgotten in the media: A data-driven study. Proceedings on Privacy Enhancing Technologies, 2016(4):389–402, 2016.

123 [148] BBC. Hmrc forced to delete five million voice files, 2019.

[149] W. Kyle. How Amazon, Apple, Google, Microsoft, and Samsung treat your voice data, 2019.

[150] P. Sarah. 41% of voice assistant users have concerns about trust and privacy, report finds, 2019.

[151] M. Sapna. Hey, Alexa, what can you hear? and what will you do with it?, 2018.

[152] CCTV. Beware of WeChat voice scams: “cloning” users after WeChat voice, 2018.

[153] C. Song and V. Shmatikov. The natural auditor: How to tell if someone used your words to train their model. arXiv preprint arXiv:1811.00513, 0:1–15, 2018.

[154] M. Shokoohi-Yekta, Y. Chen, B. Campana, B. Hu, J. Zakaria, and E. Keogh. Discovery of meaningful rules in time series. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1085–1094. ACM, 2015.

[155] BBC. Plan to secure Internet of Things with new law, 2019.

[156] BBC. Smart device security guidelines “need more teeth”, 2018.

[157] F. Weninger, H. Erdogan, Watanabe S, Vincent E, Jonathan Le Roux, John R Hershey, and Björn Schuller. Speech enhancement with LSTM recurrent neural networks and its application to noise- robust ASR. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation, pages 91–99. Springer, 2015.

[158] L. Schönherr, K. Kohls, S. Zeiler, T. Holz, and D Kolossa. Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. arXiv preprint arXiv:1808.05665, 0(0):1–18, 2018.

[159] M. Ravanelli, T. Parcollet, and Y. Bengio. The Pytorch-Kaldi speech recognition toolkit. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6465–6469. IEEE, 2019.

[160] F. Seide, G. Li, and D. Yu. Conversational speech transcription using context-dependent deep neural networks. In Proceedings of the 12th Annual Conference of the International Speech Communication Association, 2011.

[161] M. JF. Gales, K. M. Knill, A. Ragni, and S. P. Rath. Speech recognition and keyword spotting for low-resource languages: Babel project research at cued. In Proceedings of the 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages, 2014.

[162] M. Dutta, C. Patgiri, M. Sarma, and K. K. Sarma. Closed-set text-independent speaker identification system using multiple ann classifiers. In Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) 2014, pages 377–385. Springer, 2015.

[163] M. Sodanil, S. Nitsuwat, and C. Haruechaiyasak. Thai word recognition using hybrid mlp-hmm. International Journal of Computer Science and Network Security, 10(3):103–110, 2010.

124 [164] C. Canevari, L. Badino, L. Fadiga, and G. Metta. Cross-corpus and cross-linguistic evaluation of a speaker-dependent dnn-hmm asr system using ema data. In Proceedings of the Speech Production in Automatic Speech Recognition Conference, 2013.

[165] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.

[166] H. Zhang, L. Xiao, W. Chen, Y. Wang, and Y Jin. Multi-task label embedding for text classification. arXiv preprint arXiv:1710.07210, 2017.

[167] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P Schwarz, et al. The kaldi speech recognition toolkit. Technical report, IEEE Signal Processing Society, 2011.

[168] DEFINITIONS UNDER CCPA. California consumer privacy act (ccpa) website policy, 2020.

[169] A. Hern. Apple contractors ’regularly hear confidential details’ on siri recordings, 2019.

[170] Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro. Logan: Membership inference attacks against generative models. Proceedings on Privacy Enhancing Technologies, 2019(1):133–152, 2019.

[171] Congzheng Song and Vitaly Shmatikov. Auditing data provenance in text-generation models. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 196–206, 2019.

[172] Yuxuan Chen, Xuejing Yuan, Jiangshan Zhang, Yue Zhao, Shengzhi Zhang, Kai Chen, and XiaoFeng Wang. Devil’s whisper: A general approach for physical adversarial attacks against commercial black-box speech recognition devices. In Proceedings of the 29th USENIX Security Symposium (USENIX Security 20), 2020.

[173] Tianyu Du, Shouling Ji, Jinfeng Li, Qinchen Gu, Ting Wang, and Raheem Beyah. Sirenattack: Generating adversarial audio for end-to-end acoustic systems. arXiv preprint arXiv:1901.07846, 2019.

[174] Juan M Perero-Codosero, Javier Antón-Martín, Daniel Tapias Merino, Eduardo López Gonzalo, and Luis A Hernández-Gómez. Exploring open-source deep learning ASR for speech-to-text TV program transcription. In Proceedings of the IberSPEECH, pages 262–266, 2018.

[175] Alexander Liu, Hung-yi Lee, and Lin-shan Lee. Adversarial training of end-to-end speech recogni- tion using a criticizing language model. In Proceeding of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

[176] Gaoyang Liu, Chen Wang, Kai Peng, Haojun Huang, Yutong Li, and Wenqing Cheng. Socinf: Mem- bership inference attacks on social media health data with machine learning. IEEE Transactions on Computational Social Systems, 6(5):907–921, 2019.

125 [177] Liwei Song, Reza Shokri, and Prateek Mittal. Privacy risks of securing machine learning models against adversarial examples. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 241–257, 2019.

[178] Farhad Farokhi and Mohamed Ali Kaafar. Modelling and quantifying membership information leakage in machine learning. arXiv preprint arXiv:2001.10648, 2020.

[179] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.

[180] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15), pages 5206–5210. IEEE, 2015.

[181] Anthony Rousseau, Paul Deléglise, and Yannick Esteve. Ted-lium: An automatic speech recognition dedicated corpus. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pages 125–129, 2012.

[182] Jianwei Qian, Haohua Du, Jiahui Hou, Linlin Chen, Taeho Jung, and Xiangyang Li. Speech sani- tizer: Speech content desensitization and voice anonymization. IEEE Transactions on Dependable and Secure Computing, 2019.

[183] David Sundermann and Hermann Ney. Vtln-based voice conversion. In Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No. 03EX795), pages 556–559. IEEE, 2003.

[184] Brij Mohan Lal Srivastava, Aurélien Bellet, Marc Tommasi, and Emmanuel Vincent. Privacy- preserving adversarial representation learning in ASR: Reality or illusion? arXiv preprint arXiv:1911.04913, 2019.

[185] Andreas Nautsch, Abelino Jiménez, Amos Treiber, Jascha Kolberg, Catherine Jasserand, Els Kindt, Héctor Delgado, Massimiliano Todisco, Mohamed Amine Hmani, Aymen Mtibaa, et al. Preserving privacy in speaker and speech characterisation. Computer Speech & Language, 58:441–480, 2019.

[186] Yunhui Long, Vincent Bindschaedler, Lei Wang, Diyue Bu, Xiaofeng Wang, Haixu Tang, Carl A Gunter, and Kai Chen. Understanding membership inferences on well-generalized learning models. arXiv preprint arXiv:1802.04889, 2018.

[187] Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. In Proceedings of the 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pages 268–282. IEEE, 2018.

[188] Faysal Hossain Shezan, Hang Hu, Jiamin Wang, Gang Wang, and Yuan Tian. Read between the lines: An empirical measurement of sensitive applications of voice personal assistant systems. In Proceedings of the Web Conference, WWW ’20. ACM, 2020.

126 [189] Yu-Chih Tung and Kang G Shin. Exploiting sound masking for audio privacy in smartphones. In Proceedings of the 2019 ACM Asia Conference on Computer and Communications Security, pages 257–268, 2019.

[190] Hafiz Malik. Securing voice-driven interfaces against fake (cloned) audio attacks. In Proceedings of the 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pages 512–517. IEEE, 2019.

[191] Francis Tom, Mohit Jain, and Prasenjit Dey. End-to-end audio replay attack detection using deep convolutional networks with attention. In Proceedings of the Interspeech Conference, pages 681–685, 2018.

[192] Nan Zhang, Xianghang Mi, Xuan Feng, XiaoFeng Wang, Yuan Tian, and Feng Qian. Dangerous skills: Understanding and mitigating security risks of voice-controlled third-party functions on virtual personal assistant systems. In Proceedings of the 40th IEEE Symposium on Security and Privacy (S&P’19), pages 1381–1396. IEEE, 2019.

[193] Pedro Saleiro, Benedict Kuester, Loren Hinkson, Jesse London, Abby Stevens, Ari Anisfeld, Kit T Rodolfa, and Rayid Ghani. Aequitas: A bias and fairness audit toolkit. arXiv preprint arXiv:1811.05577, 2018.

[194] Peter Schulam and Suchi Saria. Can you trust this prediction? Auditing pointwise reliability after learning. arXiv preprint arXiv:1901.00403, 2019.

[195] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1885–1894. JMLR. org, 2017.

[196] Philip Adler, Casey Falk, Sorelle A Friedler, Tionney Nix, Gabriel Rybeck, Carlos Scheidegger, Brandon Smith, and Suresh Venkatasubramanian. Auditing black-box models for indirect influence. Knowledge and Information Systems, 54(1):95–122, 2018.

[197] Yuantian Miao, Ben Zi Hao Zhao, Minhui Xue, Chao Chen, Lei Pan, Jun Zhang, Dali Kaafar, and Yang Xiang. The audio auditor: Participant-level membership inference in voice-based IoT. CCS Workshop of Privacy Preserving Machine Learning, 2019.

[198] S. Wildstrom. Nuance exec on iphone 4s, siri, and the future of speech, 2011.

[199] Deepak Kumar, Riccardo Paccagnella, Paul Murley, Eric Hennenfent, Joshua Mason, Adam Bates, and Michael Bailey. Skill squatting attacks on amazon alexa. In Proceedings of the 27th USENIX Security Symposium (USENIX Security 18), pages 33–47, 2018.

[200] Nathan Malkin, Joe Deatrick, Allen Tong, Primal Wijesekera, Serge Egelman, and David Wag- ner. Privacy attitudes of smart speaker users. Proceedings on Privacy Enhancing Technologies, 2019(4):250–271, 2019.

[201] G. Benjamin. Amazon echo’s privacy issues go way beyond voice recordings, 2020.

127 [202] Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidullah, Héctor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, and Kong Aik Lee. Asvspoof 2019: Future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441, 2019.

[203] Christopher A Choquette Choo, Florian Tramer, Nicholas Carlini, and Nicolas Papernot. Label-only membership inference attacks. arXiv preprint arXiv:2007.14321, 2020.

[204] Zheng Li and Yang Zhang. Label-leaks: Membership inference attack with label. arXiv preprint arXiv:2007.15528, 2020.

[205] python. Fuzzywuzzy: Fuzzy string matching in python, 2020.

[206] Chuck Martin. Voice assistant usage seen growing to 8.4 billion devices, April 2020.

[207] Hadi Abdullah, Kevin Warren, Vincent Bindschaedler, Nicolas Papernot, and Patrick Traynor. Sok: The faults in our asrs: An overview of attacks against automatic speech recognition and speaker identification systems. In Proceedings of the 42nd IEEE Symposium on Security and Privacy (S&P’21). IEEE, 2021.

[208] Minghao Wang, Tianqing Zhu, Tao Zhang, Jun Zhang, Shui Yu, and Wanlei Zhou. Security and privacy in 6g networks: New areas and new challenges. Digital Communications and Networks, 2020, DOI: 10.1016/j.dcan.2020.07.003.

[209] Liu Liu, Olivier De Vel, Qing-Long Han, Jun Zhang, and Yang Xiang. Detecting and preventing cyber insider threats: A survey. IEEE Communications Surveys & Tutorials, 20(2):1397–1417, 2018.

[210] Yuantian Miao, Xue Minhui, Chao Chen, Lei Pan, Jun Zhang, Benjamin Zi Hao Zhao, Dali Kaafar, and Yang Xiang. The audio auditor: user-level membership inference in internet of things voice services. Proceedings on Privacy Enhancing Technologies, 2021:209–228, 2021.

[211] Guanjun Lin, Sheng Wen, QingLong Han, Jun Zhang, and Yang Xiang. Software vulnerability detection using deep neural networks: A survey. Proceedings of the IEEE, 108(10):1825–1848, 2020.

[212] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In Proceedings of the International Conference on Learning Representations (ICLR’14), 2014.

[213] Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NeurIPS’17), Long Beach, CA, USA, 2017.

[214] Rory Coulter, Qing-Long Han, Lei Pan, Jun Zhang, and Yang Xiang. Data-driven cyber security in perspective–intelligent traffic analysis. IEEE Transactions on Cybernetics, 50(7):3081–3093, 2020.

[215] Nan Sun, Jun Zhang, Paul Rimba, Shang Gao, Leo Yu Zhang, and Yang Xiang. Data-driven cyberse- curity incident prediction: A survey. IEEE Communications Surveys & Tutorials, 21(2):1744–1772, 2019.

128 [216] Xiao Chen, Chaoran Li, Derui Wang, Sheng Wen, Jun Zhang, Surya Nepal, Yang Xiang, and Kui Ren. Android hiv: A study of repackaging malware for evading machine-learning detection. IEEE Transactions on Information Forensics and Security, 15:987–1001, 2020.

[217] Junyang Qiu, Jun Zhang, Lei Pan, Wei Luo, Surya Nepal, and Yang Xiang. A survey of android malware detection with deep neural models. ACM Computing Survey, 53(6), article no. 126, 2020.

[218] Yuantian Miao, Chao Chen, Lei Pan, Qing-Long Han, Jun Zhang, and Yang Xiang. Machine learning based cyber attacks targeting on controlled information: A survey. ACM Computing Survey, accepted, 23/04/2021.

[219] Hadi Abdullah, Muhammad Sajidur Rahman, Washington Garcia, Logan Blue, Kevin Warren, Anurag Swarnim Yadav, Tom Shrimpton, and Patrick Traynor. Hear “no evil", see “kenansville": Efficient and transferable black-box attacks on speech recognition and voice identification systems. In Proceedings of the 42nd IEEE Symposium on Security and Privacy (S&P’21). IEEE, 2021.

[220] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning (ICML’16), pages 173–182, 2016.

[221] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06), pages 369–376, 2006.

[222] Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, XiaoFeng Wang, and Carl A Gunter. Commandersong: A systematic approach for practical adversarial voice recognition. In Proceedings of the 27th USENIX Security Symposium (USENIX Security’18), pages 49–64, 2018.

[223] Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating adversarial examples with adversarial networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI’18), pages 3905–3911, 2018.

[224] Yuan Gong and Christian Poellabauer. Crafting adversarial examples for speech paralinguistics applications. In Proceedings of the 2018 DYnamic and Novel Advances in Machine Learning and Intelligent Cyber Security Workshop (DYNAMICS’18), 2018.

[225] Felix Kreuk, Yossi Adi, Moustapha Cisse, and Joseph Keshet. Fooling end-to-end speaker verifi- cation with adversarial examples. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18), pages 1962–1966. IEEE, 2018.

[226] Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. Targeted adversarial examples for black box audio systems. In Proceedings of the 2019 IEEE Security and Privacy Workshops (SPW’19), pages 15–20. IEEE, 2019.

129 [227] Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. Did you hear that? adversarial examples against automatic speech recognition. arXiv preprint arXiv:1801.00554, 2018.

[228] Healthwise. Harmful noise levels, 2019.

[229] Mozilla. Deepspeech 0.4.1, January 2019.

[230] Jason Cipriani. The complete list of ’ok, google’ commands, July 2016.

[231] Jason Cipriani and J. Sarah Purewal. The complete list of siri commands, November 2017.

[232] Melanie Weir. A comprehensive list of siri voice commands you can use on an iphone, November 2020.

[233] Snehkumar Shahani, Jibi Abraham, and R Venkateswaran. Distributed data aggregation with privacy preservation at endpoint. In Proceedings of the IEEE International Conference on Management of Data, pages 1–9, Chennai, India, 2017. IEEE.

[234] Farah Chanchary, Yomna Abdelaziz, and Sonia Chiasson. Privacy concerns amidst oba and the need for alternative models. IEEE Internet Computing, 22(Apr):52–61, 2018.

[235] Rory Coulter, Qing-Long Han, Lei Pan, Jun Zhang, and Yang Xiang. Data driven cyber security in perspective — intelligent traffic analysis. IEEE Transactions on Cybernetics, accepted, to appear in 2020.

[236] Ambika Kaul, Saket Maheshwary, and Vikram Pudi. Autolearn—automated feature generation and selection. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), pages 217–226, New Orleans, LA, USA, 2017. IEEE.

[237] Jun Zhang, Xiao Chen, Yang Xiang, Wanlei Zhou, and Jie Wu. Robust network traffic classification. IEEE/ACM Transactions on Networking (TON), 23(4):1257–1270, 2015.

[238] Jun Zhang, Yang Xiang, Yu Wang, Wanlei Zhou, Yong Xiang, and Yong Guan. Network traffic classification using correlation information. IEEE Transactions on Parallel and Distributed Systems, 24(1):104–117, 2013.

[239] Jun Zhang, Chao Chen, Yang Xiang, Wanlei Zhou, and Yong Xiang. Internet traffic classification by aggregating correlated naive bayes predictions. IEEE Transactions on Information Forensics and Security, 8(1):5–15, 2013.

[240] Shigang Liu, Jun Zhang, Yang Xiang, and Wanlei Zhou. Fuzzy-based information decomposition for incomplete and imbalanced data learning. IEEE Transactions on Fuzzy Systems, 25(6):1476– 1490, 2017.

[241] Chao Chen, Yu Wang, Jun Zhang, Yang Xiang, Wanlei Zhou, and Geyong Min. Statistical features- based real-time detection of drifted twitter spam. IEEE Transactions on Information Forensics and Security, 12(4):914–925, 2017.

[242] Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan, Yang Xiang, Olivier De Vel, and Paul Montague. Cross-project transfer representation learning for vulnerable function discovery. IEEE Transactions on Industrial Informatics, 14(7):3289–3297, 2018.

130 Swinburne Research

Appendix A. Authorship Indication Form

For HDR students

NOTE This Authorship Indication form is a statement detailing the percentage of the contribution of each author in each published ‘paper’. This form must be signed by each co-author and the Principal Supervisor. This form must be added to the publication of your final thesis as an appendix. Please fill out a separate form for each published paper to be included in your thesis.

DECLARATION We hereby declare our contribution to the publication of the ‘paper’ entitled: The Audio Auditor: User-Level Membership Inference in Internet of Things Voice Services

First Author

Name Yuantian Miao Signature: _

Percentage of contribution: 50 % Date: _2_9 / 0_4_ / _2_02_ 1_

Brief description of contribution to the ‘paper’ and your central responsibilities/role on project: Conception and design System design and implementation Writing the manuscript

Second Author

Name: Minhui Xue Signature:

Percentage of contribution: 10 % Date: _29_ / _04_ / _2_0_21_

Brief description of your contribution to the ‘paper’: Conception and design Revising the manuscript

Third Author

Name: Chao Chen Signature: 10 Percentage of contribution: % Date: 2_9_ / _04_ / _2_02_1_

Brief description of your contribution to the ‘paper’: Conception and design Revising the manuscript 131

Fourth Author

Name: Lei Pan Signature:

Percentage of contribution: 10 % Date: 0_4_ / 0_ 5_ / _20_ 2_1_

Brief description of your contribution to the ‘paper’: Conception and design Revising the manuscript

Fifth Author

Name: Jun Zhang Signature:

Percentage of contribution: 5 % Date: _30_ / _0_4 / _20_2_1_

Brief description of your contribution to the ‘paper’: Conception and design Proofreading

Sixth Author

Name: Benjamin Zi Hao Zhao Signature:

Percentage of contribution: 5 % Date: _30_ / 0_4_ / _2_02_1_

Brief description of your contribution to the ‘paper’: Conception and design Proofreading

Seventh Author

Name: Dali Kaafar Signature:

Percentage of contribution: 5 % Date: 0_ 3_ / _05_ / 2_0_2_1_

Brief description of your contribution to the ‘paper’: Conception and design Proofreading

Eighth Author

Name: Yang Xiang Signature:

Percentage of contribution: 5 % Date: _ 0_4/ 0_5_ /2_0_21_ _

Brief description of your contribution to the ‘paper’: Conception and design Proofreading 132

Principal Supervisor:

Name: Yang Xiang Signature:

Date: _0_4 / _05_ / _20_2_1_

In the case of more than four authors please attach another sheet with the names, signatures and contribution of the authors.

Authorship Indication Form

133