Life after Speech Recognition: Fuzzing Semantic Misinterpretation for Voice Assistant Applications Yangyong Zhang, Lei Xu, Abner Mendoza, Guangliang Yang, Phakpoom Chinprutthiwong, Guofei Gu SUCCESS Lab, Dept. of Computer Science & Engineering Texas A&M University fyangyong, xray2012, abmendoza, ygl,
[email protected],
[email protected] Abstract—Popular Voice Assistant (VA) services such I. INTRODUCTION as Amazon Alexa and Google Assistant are now rapidly appifying their platforms to allow more flexible and The Voice User Interface (VUI) is becoming a diverse voice-controlled service experience. However, the ubiquitous human-computer interaction mechanism to ubiquitous deployment of VA devices and the increasing enhance services that have limited or undesired physical number of third-party applications have raised security interaction capabilities. VUI-based systems, such as and privacy concerns. While previous works such as Voice Assistants (VA), allow users to directly use voice hidden voice attacks mostly examine the problems of inputs to control computational devices (e.g., tablets, VA services’ default Automatic Speech Recognition (ASR) smart phones, or IoT devices) in different situations. component, our work analyzes and evaluates the security With the fast growth of VUI-based technologies, a large of the succeeding component after ASR, i.e., Natural Language Understanding (NLU), which performs semantic number of applications designed for Voice Assistant interpretation (i.e., text-to-intent) after ASR’s acoustic- (VA) services (e.g., Amazon Alexa Skills and Google to-text processing. In particular, we focus on NLU’s Assistant Actions) are now available. Amazon Alexa Intent Classifier which is used in customizing machine currently has more than 30,000 applications, or vApps1.