SIEVE: Secure In-Vehicle Automatic Speech Recognition Systems
Total Page:16
File Type:pdf, Size:1020Kb
SIEVE: Secure In-Vehicle Automatic Speech Recognition Systems Shu Wang1, Jiahao Cao1,2, Kun Sun1, and Qi Li2,3 1Center for Secure Information Systems, George Mason University, Fairfax, VA, USA 2Institute for Network Sciences and Cyberspace, Tsinghua University, Beijing, China 3Beijing National Research Center for Information Science and Technology, Beijing, China Abstract convenient way for drivers and passengers to interact with Driverless vehicles are becoming an irreversible trend in driverless cars. For example, we can use our voice to con- our daily lives, and humans can interact with cars through in- trol in-vehicle entertainment systems, set destinations to the vehicle voice control systems. However, the automatic speech GPS navigation system, and take back the full control of cars recognition (ASR) module in the voice control systems is from the self-driving mode [5]. However, the core module of vulnerable to adversarial voice commands, which may cause in-vehicle voice control system, i.e., automatic speech recogni- unexpected behaviors or even accidents in driverless cars. Due tion (ASR) module, is vulnerable to various adversarial voice to the high demand on security insurance, it remains as a chal- command attacks [6–14]. Particularly, since most in-vehicle lenge to defend in-vehicle ASR systems against adversarial ASR systems support speaker-independent recognition by voice commands from various sources in a noisy driving en- default [15], passengers can voice malicious commands to vironment. In this paper, we develop a secure in-vehicle ASR ASR systems and thus control critical in-vehicle systems. system called SIEVE, which can effectively distinguish voice Moreover, remote attackers may hide a voice command into commands issued from the driver, passengers, or electronic a song [9]. When the song is played through car loudspeakers speakers in three steps. First, it filters out multiple-source or smartphones’ speakers, the malicious voice command in voice commands from multiple vehicle speakers by lever- the song can be recognized by ASR systems. It may cause aging an autocorrelation analysis. Second, it identifies if a unexpected behaviors or even accidents in driverless cars. single-source voice command is from humans or electronic A number of countermeasures [16–18] have been proposed speakers using a novel dual-domain detection method. Finally, to defend ASR systems against adversarial voice commands. it leverages the directions of voice sources to distinguish the They leverage short-term spectral features [19,20] or prosodic voice of the driver from those of the passengers. We imple- features [21, 22] to distinguish different users. However, as ment a prototype of SIEVE and perform a real-world study the features of human voices typically are low dimensional, under different driving conditions. Experimental results show advanced passengers can imitate a driver’s voice to bypass SIEVE can defeat various adversarial voice commands over existing defense systems [23]. Moreover, existing methods on in-vehicle ASR systems. distinguishing different users’ identities are generally unre- liable in a noisy environment [24], whereas in-vehicle ASR 1 Introduction systems demand a much higher security insurance against adversarial voice command attacks to prevent possible car ac- Driverless cars, also known as autonomous cars or self-driving cidents. Furthermore, to prevent malicious voice commands cars, are no longer the things we would only see in sci-fi films, played by speakers, researchers have designed methods to but they are becoming an irreversible trend in our daily lives, identify whether the voice commands come from humans or particularly due to the rapid development of sensor techniques loudspeakers. They rely on spectral features [18,25,26] and and advanced artificial intelligence algorithms. For instance, noise features [17, 27] to distinguish between humans and MIT developed a human-centered autonomous vehicle using speakers. They are all based on one underlying assumption multiple sensors and deep neural networks in 2018 [1]. From that voice commands come from a single source. However, April 2019, all new Tesla cars come standard with Autopilot, in the scenario of driverless cars, malicious voice commands providing the capability of switching modes between manual may come from multiple sources (i.e., multiple car speak- driving and self-driving [2]. Waymo’s driverless cars have ers). Therefore, the different features for multi-source voices driven 20 million miles on public roads by January 2020 [3]. and single-source voices may interfere with distinguishing The latest in-vehicle voice control system [4] provides a between human voices and non-human voices [28]. USENIX Association 23rd International Symposium on Research in Attacks, Intrusions and Defenses 365 In this paper, we propose a secure automatic speech recog- human voices when the car is driving in noisy streets. It can nition system called SIEVE to effectively defeat various adver- further identify the driver’s voice from human voices with sarial voice command attacks on driverless cars. It is capable a 96.76% accuracy. Moreover, our system can be smoothly of distinguishing voice commands issued from a driver, a pas- integrated to vehicles by replacing the in-vehicle single mi- senger, and non-human speakers in three steps. First, since crophone with a low-priced dual microphone and implanting legal human voice commands are always single-source sig- the detection module in one vehicle electronic control unit. nals, SIEVE identifies and filters out multiple-source voice In summary, we make the following contributions: commands from multiple car speakers. The multiple-source • We develop a secure ASR system for driverless vehicles detection is based on a key insight that when the same signal to defeat various in-vehicle adversarial voice commands is received multiple times in a short time period from multi- by distinguishing the command sources from the driver, ple sources, the overlap of the received signals will expand passengers, and electronic speakers. the signal correlations in the time domain. Therefore, SIEVE • We propose a dual-domain detection method to distin- can identify multi-source voice commands by conducting an guish voice commands between humans and non-human autocorrelation analysis to measure the overlap of signals. speakers even if the voices are carefully modulated to After filtering out multiple-source voice commands, the mimic human voices. second step of SIEVE is to check if a single-source voice • We provide a method based on the directions of voice command is from a human or a non-human speaker. SIEVE sources to distinguish the driver’s voice from passengers’ can accurately detect non-human voice commands by check- voices even if passengers can imitate the driver’s voice. ing features in both frequency domain and time domain. First, • We implement a prototype of SIEVE and real-world voices from non-human speakers inherently have the unique experimental results show that our system can effectively acoustic characteristic, i.e., low-frequency energy attenuation. defeat adversarial voice commands. Such characteristic can be checked with the signal power spectrum distribution in the frequency domain. However, so- 2 Threat Model and Assumptions phisticated attackers may use low-frequency enhancement filters to modulate the voice and thus compensate for the en- We focus on the adversarial voice command attacks that ma- ergy loss. To identify such modulated voices, SIEVE also nipulate the speech inputs to the in-vehicle ASR system. We conducts a local extrema verification in the time domain. Our assume the vehicle’s electronic control unit (ECU) can be key insight is that the local extrema ratio for modulated voices trusted [29]. We assume the driver can be trusted to not issue is much greater than that for human voices. Moreover, we malicious commands; however, malicious voice commands demonstrate that attackers cannot modulate voices to bypass may be issued from in-vehicle loudspeakers, the speakers of the detection in both the time domain and frequency domain at mobile devices, or passengers. the same time. Hence, our dual-domain check ensures SIEVE First, malicious voice commands may come from the in- can effectively filter out various non-human voice commands. vehicle loudspeakers. It is common for people to connect Finally, SIEVE distinguishes the passenger’s voice com- their mobile phones to car audio systems when playing music mands from the driver’s voice commands, since we may only or making phone calls. Also, CDs/DVDs are usually played trust the driver but not the passengers. Our key insight is that through speakers. Since music songs might be downloaded vehicles have fixed internal locations for the driver and pas- from various untrusted sources, the attackers may edit the sengers. Therefore, we can leverage the directions of voice sound tracks of audio files to voice the malicious commands sources to distinguish the driver’s voice and passengers’ voice via a single speaker or multiple speakers. Particularly, armored even if passengers can imitate the driver’s voice. Particularly, attackers may hide adversarial commands by minimizing the SIEVE measures the directions of voice sources by calcu- difference between the malicious and the original audio sam- lating the time difference of arrivals (TDOA) on a pair of ples [7–9]. Moreover, when a phone call is connected to the close-coupled microphones (i.e., a dual microphone). To deal vehicle speakers via Bluetooth, the people on the other side with some extreme cases when passengers lean forward and may unintentionally or intentionally issue voice commands have their head near the headrest of the driver’s seat, we also to the ASR systems. develop a spectrum-assisted detection method to improve the Second, if the driver puts their smartphones on mobile detection accuracy of SIEVE. speakers (handsfree mode) when making phone calls or play- We implement a prototype of SIEVE and conduct extensive ing musics, a malicious command may be issued from the real-world experiments on a four-seat sedan (Toyota Camry) driver’s smartphones.