Fuzzing Semantic Misinterpretation for Voice Assistant Applications
Total Page:16
File Type:pdf, Size:1020Kb
Life after Speech Recognition: Fuzzing Semantic Misinterpretation for Voice Assistant Applications Yangyong Zhang, Lei Xu, Abner Mendoza, Guangliang Yang, Phakpoom Chinprutthiwong, Guofei Gu SUCCESS Lab, Dept. of Computer Science & Engineering Texas A&M University fyangyong, xray2012, abmendoza, ygl, [email protected], [email protected] Abstract—Popular Voice Assistant (VA) services such I. INTRODUCTION as Amazon Alexa and Google Assistant are now rapidly appifying their platforms to allow more flexible and The Voice User Interface (VUI) is becoming a diverse voice-controlled service experience. However, the ubiquitous human-computer interaction mechanism to ubiquitous deployment of VA devices and the increasing enhance services that have limited or undesired physical number of third-party applications have raised security interaction capabilities. VUI-based systems, such as and privacy concerns. While previous works such as Voice Assistants (VA), allow users to directly use voice hidden voice attacks mostly examine the problems of inputs to control computational devices (e.g., tablets, VA services’ default Automatic Speech Recognition (ASR) smart phones, or IoT devices) in different situations. component, our work analyzes and evaluates the security With the fast growth of VUI-based technologies, a large of the succeeding component after ASR, i.e., Natural Language Understanding (NLU), which performs semantic number of applications designed for Voice Assistant interpretation (i.e., text-to-intent) after ASR’s acoustic- (VA) services (e.g., Amazon Alexa Skills and Google to-text processing. In particular, we focus on NLU’s Assistant Actions) are now available. Amazon Alexa Intent Classifier which is used in customizing machine currently has more than 30,000 applications, or vApps1. understanding for third-party VA Applications (or vApps). We find that the semantic inconsistency caused by the Several attacks have been reported to affect the in- improper semantic interpretation of an Intent Classifier tegrity of existing Automatic Speech Recognition (ASR) can create the opportunity of breaching the integrity of component in vApp processing. For example, acoustic- vApp processing when attackers delicately leverage some based attacks [19], [36], [33], [35], [23] leverage sounds common spoken errors. that are unrecognizable or inaudible by the human. More recently, Kumar et al. [28] presented an empirical study of vApp squatting attacks based on speech misinterpre- tation (for example, an Alexa Skill named “Test Your In this paper, we design the first linguistic-model- Luck” could be routed to a maliciously uploaded skill guided fuzzing tool, named LipFuzzer, to assess the with a confusing name of “Test Your Lock”). This attack security of Intent Classifier and systematically discover works, with proof-of-concept, in a remote manner that potential misinterpretation-prone spoken errors based on could potentially be more powerful than acoustic-based vApps’ voice command templates. To guide the fuzzing, we construct adversarial linguistic models with the help of attacks. Statistical Relational Learning (SRL) and emerging Natu- Despite recent evidence [28] of launching potential ral Language Processing (NLP) techniques. In evaluation, vApp squatting attacks, little effort has been made to un- we have successfully verified the effectiveness and accuracy of LipFuzzer. We also use LipFuzzer to evaluate both cover the root cause of speech misinterpretation in vApp Amazon Alexa and Google Assistant vApp platforms. We processing. In this paper, we devise a first representative have identified that a large portion of real-world vApps Voice Assistant (VA) Architecture, as shown in Figure 1, are vulnerable based on our fuzzing result. from which we study the core components involved in proper speech interpretation. After closely scrutinizing the VA architecture, we found that both the Automatic Speech Recognition (ASR) and the Natural Language Network and Distributed Systems Security (NDSS) Symposium 2019 24-27 February 2019, San Diego, CA, USA Understanding (NLU) components play a central role in ISBN 1-891562-55-X https://dx.doi.org/10.14722/ndss.2019.23525 1vApp is the generalized name used in this paper for Amazon Alexa www.ndss-symposium.org Skills, Google Assistant Actions. proper speech recognition and interpretation. Previous Natural Intent Automatic Textual Language Classifier works have only studied the ASR text interpretation Audio Speech Data Understanding Input Recognition (NLU) component. In the NLU, an Intent Classifier uses voice (ASR) command templates (or templates) to match intents Intents Audio Response Output (similar to Intent Message in Android) with textual data Engine Textual vAppThird Response InstancesParty obtained after ASR’s text interpretation. Then, intents Speech Processing Gadgets are used to reach vApps with specific functionality. From the vApp security perspective, the Intent Classifier Fig. 1: VUI-based VA Architecture. plays a more important role when interpreting users’ voice commands. The reason is that the Intent Classifier is the last step of the interpretation process. It has commands. For example, a user could be simply using the capability of not only determining users’ semantic regional vocabulary (i.e. words or expressions that are intents, but also fixing the ASR’s potential transcription used in a dialect area but not commonly recognizable errors. Second, while ASR is a default built-in service in other areas) rather than pronunciation issue alone. component, the construction of the Intent Classifier’s Moreover, it is unpredictable when and how a user semantic classification is contributed by both vApp would speak differently. All of these factors make the developers and service providers. Particularly, third- voice command fuzzing difficult because of the large party developers can upload voice command templates searching space. to modify the unified intent classification tree used by all users (more details are illustrated in Section II). To overcome the aforementioned design challenges, As a result, it creates the opportunity for misbehaving we propose a novel linguistic-model-guided fuzzing developers to maliciously modify the intent matching tool, called LipFuzzer. Our tool generates potential process of the NLU. voice commands that are likely to incur a seman- tic inconsistency such that a user reaches an unin- However, it is challenging to systematically study tended vApp/functionality (i.e. users think they use how NLU’s Intent Classifier is penetrated to incur voice commands correctly but yield unwanted results). semantic inconsistency between users’ intents and VA’s For convenience, we name any voice commands that machine understanding. First, mainstream VA plat- can lead to such inconsistency as LAPSUS.3 LipFuzzer forms, such as Amazon Alexa, Google Assistant, and addresses the two aforementioned challenges in two almost all vApps are in closed development. Thus, components. First, to decide mutation fields, we realize it is difficult to conduct a white box analysis. Also, that the state-of-the-art Natural Language Processing due to the strong privacy enforcement, it is impossible (NLP) techniques can be used to extract computational to get real users’ speech input and the corresponding linguistic data. Thus, we design LipFuzzer’s basic tem- vApp response output. Thus, conducting large-scale, plate fuzzing by first mutating NLP pre-processed voice real-world data-driven analysis is also very challenging. command templates. Then, we input mutated voice Our Approach. In this work, we assess speech mis- commands to VA devices by using machine speech interpretation through the black-box mutation fuzzing synthesis to eliminate the human factor of producing of voice command templates, which are used as inputs ASR-related misinterpretation. Second, to reduce the of Intent Classifier for matching intents. Our goal is search space of the basic fuzzing, in the linguistic mod- to systematically evaluate how Intent Classifier behaves eling component, we train adversarial linguistic models when inputting different forms of voice commands. named LAPSUS Models by adopting Bayesian Networks However, it is not straight-forward to design such a (BNs). BNs are constructed statistically with linguistic fuzzing scheme. First, the computability of vApp I/O knowledge related to LAPSUS. The template fuzzing is is very limited as both input and output of vApps are then guided with linguistic model query results. As a in the speech form, i.e., you can only speak with a VA result, LipFuzzer is able to perform effective mutation- device or listen to the audio response2. Thus, we have based fuzzing on seed template inputs (gathered from to determine mutation fields that we can work on in the existing vApp user guidance available online). context of vApp processing. Moreover, it is important In our evaluation, we first showcase that Intent Clas- to eliminate the effect of ASR’s text interpretation so sifier is indeed the root cause of the misinterpretation. that error propagation is minimized. Second, in [28], Then, we show that LipFuzzer can systematically pin- the authors suggest that vApp squatting attacks are point semantic inconsistencies with the help of linguistic caused by pronunciation-based interpretation errors of models generated from existing linguistic knowledge. ASR. However, in reality, ambiguous natural languages incur many more different forms of confusing voice Furthermore, we scan Alexa Skill Store and Google