Differentiating Between Human and Electronic Speakers for Voice Interface Security
Total Page:16
File Type:pdf, Size:1020Kb
Hello, Is It Me You’re Looking For? Differentiating Between Human and Electronic Speakers for Voice Interface Security Logan Blue Luis Vargas Patrick Traynor University of Florida University of Florida University of Florida Gainesville, Florida Gainesville, Florida Gainesville, Florida bluel@ufl:edu lfvargas14@ufl:edu traynor@ufl:edu ABSTRACT that report when groceries are low to smart thermostats that can Voice interfaces are increasingly becoming integrated into a variety anticipate desired temperature changes offer great convenience to of Internet of Things (IoT) devices. Such systems can dramatically their users. Given that many of these devices have either limited simplify interactions between users and devices with limited dis- or entirely lack traditional interfaces, an increasing number now plays. Unfortunately, voice interfaces also create new opportunities incorporate voice commands as their primary user interfaces. Voice for exploitation. Specifically, any sound-emitting device within interfaces not only simplify interaction with such devices for tra- range of the system implementing the voice interface (e.g., a smart ditional users, but promote broader inclusion for both the elderly television, an Internet-connected appliance, etc) can potentially and those with disabilities [30]. cause these systems to perform operations against the desires of Voice interfaces also introduce a number of security problems. their owners (e.g., unlock doors, make unauthorized purchases, etc). First, few devices actually authenticate their users. Instead, if a We address this problem by developing a technique to recognize command can be understood by a voice-enabled device, it simply fundamental differences in audio created by humans and electronic executes the request. Any insecure sound-emitting IoT device (e.g., speakers. We identify sub-bass over-excitation, or the presence of a networked stereo system or smart TV) near a voice interface may significant low frequency signals that are outside of the rangeof be used to inject commands. An adversary need not necessarily human voices but inherent to the design of modern speakers, as a compromise nearby devices to launch a successful attack – voice strong differentiator between these two sources. After identifying controlled devices have already been intentionally [21] and un- this phenomenon, we demonstrate its use in preventing adversarial intentionally [23] activated by nearby televisions. Second, while requests, replayed audio, and hidden commands with a 100%/1.72% some devices are considering the use of biometrics for authentica- TPR/FPR in quiet environments. In so doing, we demonstrate that tion, this solution fails in many important cases. For instance, off commands injected via nearby audio devices can be effectively the shelf tools [2, 9] allow attackers to generate audio targeting removed by voice interfaces. specific speakers. Moreover, even if biometrics can protect against these attacks, they do nothing to prevent against replay. Fundamen- CCS CONCEPTS tally, wherever speakers exist, audio can easily be injected to induce voice-interfaces to perform tasks on behalf of an adversary. • Security and privacy → Access control; Biometrics; We address this problem in this paper by developing techniques 1 KEYWORDS that distinguish between human and electronic speakers. Specifi- cally, we identify a feature of audio that differs between the human Voice interface, Internet of Things vocal tract and the construction of modern electronic speakers. ACM Reference Format: Our analysis shows that electronic speakers induce what we re- Logan Blue, Luis Vargas, and Patrick Traynor. 2018. Hello, Is It Me You’re fer to as sub-bass over-excitation, which is the presence of very Looking For? Differentiating Between Human and Electronic Speakers low-frequency components in the audio waveform that are not for Voice Interface Security. In WiSec ’18: Proceedings of the 11th ACM naturally produced by humans. This phenomenon is instead a con- Conference on Security & Privacy in Wireless and Mobile Networks, June sequence of the enclosures in which electronic speakers are housed. 18–20, 2018, Stockholm, Sweden. ACM, New York, NY, USA, 11 pages. https: //doi:org/10:1145/3212480:3212505 We demonstrate that this feature is a reliable indicator in detecting electronic speakers. 1 INTRODUCTION Our work makes the following contributions: • The Internet of Things (IoT) holds the potential to increase automa- Identify sub-bass over-excitation phenomenon: Using tion in our daily lives. Devices ranging from connected appliances signal processing, we identify a frequency band present in the audio generated by electronic speakers. We discuss why Permission to make digital or hard copies of all or part of this work for personal or sub-bass over-excitation occurs and develop the energy bal- classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation ance metric to effectively measure it. on the first page. Copyrights for components of this work owned by others than ACM • Experimental evaluation of phenomenon based detec- must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, tor: After explaining sub-bass over-excitation, we construct to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. a detector that differentiates between organic and electronic WiSec ’18, June 18–20, 2018, Stockholm, Sweden speakers in low noise (TPR : 100%; FPR : 1:72%) and high © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5731-9/18/06...$15.00 1To overcome the overloaded term “speaker”, we refer to humans as “organic speakers” https://doi:org/10:1145/3212480:3212505 and manufactured devices that emit audio as “electronic speakers”. 123 WiSec ’18, June 18–20, 2018, Stockholm, Sweden Logan Blue, Luis Vargas, and Patrick Traynor noise (TPR : 95:7%; FPR : 5:0%) environments. We also con- To stop command replay attacks to voice assistants, liveliness ver- textualize why these false positive rates are acceptable based ification can be performed by adding a video feed of the user’s facial on reported usage data. expression as input [11, 16, 27]. This method of proving liveliness • Analysis of Adversarial Commands: We demonstrate has been widely studied in the field of information fusion [12, 17, 29]. that our detector can accurately identify the speaker as or- While adding a camera source decreases the chance of malicious ganic or electronic even in the presence of recent garbled command injection, it also increases the chance of rejecting a real audio injection attacks [15] and codec transcoding attacks. command and requires the addition of a video channel to devices, many of which do not come readily equipped with cameras. More- We note that the sub-bass over-excitation is not simply a phe- over, cameras potentially introduce new threats to many environ- nomenon limited to a certain class of electronic speakers. Rather, ments. it is a fundamental characteristic of the construction all electronic In previous work, we used a secondary device controlled and speakers, be they of high or low quality. Without an adversary gain- colocated with the owner of a voice operated device to authenti- ing physical access to a targeted environment and replacing the cate incoming commands [14]. However, this technique requires electronic speakers with custom-made devices (which, as we will ex- an additionally device which may make it unsuitable for certain plain, would add significant noise to produced audio), our proposed applications. techniques dramatically mitigate the ability to inject commands In this paper, our goal is to determine whether a sound is being into increasingly popular voice interfaces. played through an electronic speaker or if the sound originates The remainder of the paper is organized as follows: Section 2 from organic human speech. provides context by discussing related work; Section 3 offers back- ground information necessary to understand our techniques; Sec- tion 4 details our hypothesis and assumptions; Section 5 describes 3 BACKGROUND the sub-bass over-excitation phenomenon and provides a confirma- 3.1 Structure of the Human Voice tion test; Section 6 provides a detailed evaluation using multiple Figure 1 illustrates the structures that create a human voice. The electronic speakers and test subjects; Section 7 discusses attempted human voice is created by the complex interaction of various parts mitigations and related issues and Section 8 provides concluding of the human anatomy. Sounds are produced by a combination remarks. of the lungs, the larynx, and the articulators (the tongue, cheeks, lips, palate, throat, and nasal cavity). The lungs force air over the 2 RELATED WORK rest of the vocal tract allowing it to produce sound. The larynx Voice assistants [3, 4, 6, 7] are helpful in many ways, ranging from contains the vocal cords which are responsible for the generation setting alarms or asking for the weather to making purchases or of the fundamental frequency2 present in the voice. Since the vocal changing a thermometer’s temperature. Unfortunately, these de- cords are located at the bottom of what is essentially a closed vices are vulnerable to command injection by any audio source tube, the fundamental frequency induces an acoustic resonance. loud enough to be heard