Paper 598 ~ Voice Presentation Attack Detection Through Text-Converted
Total Page:16
File Type:pdf, Size:1020Kb
CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK Voice Presentation Attack Detection through Text-Converted Voice Command Analysis Il-Youp Kwak Jun Ho Huh Seung Taek Han Samsung Research Samsung Research Samsung Research Seoul, South Korea Seoul, South Korea Seoul, South Korea [email protected] [email protected] [email protected] Iljoo Kim Jiwon Yoon Samsung Research Korea University Seoul, South Korea Seoul, South Korea [email protected] [email protected] ABSTRACT KEYWORDS Voice assistants are quickly being upgraded to support ad- Voice Command Analysis; Attack Detection; Voice Assistant vanced, security-critical commands such as unlocking de- Security vices, checking emails, and making payments. In this paper, ACM Reference Format: we explore the feasibility of using users’ text-converted voice Il-Youp Kwak, Jun Ho Huh, Seung Taek Han, Iljoo Kim, and Ji- command utterances as classification features to help identify won Yoon. 2019. Voice Presentation Attack Detection through Text- users’ genuine commands, and detect suspicious commands. Converted Voice Command Analysis. In CHI Conference on Hu- To maintain high detection accuracy, our approach starts man Factors in Computing Systems Proceedings (CHI 2019), May 4–9, with a globally trained attack detection model (immediately 2019, Glasgow, Scotland Uk. ACM, New York, NY, USA, 12 pages. available for new users), and gradually switches to a user- https://doi.org/10.1145/3290605.3300828 specific model tailored to the utterance patterns of atarget user. To evaluate accuracy, we used a real-world voice assis- 1 INTRODUCTION tant dataset consisting of about 34.6 million voice commands Voice assistant vendors (e.g., Apple’s Siri, Amazon’s Alexa, collected from 2.6 million users. Our evaluation results show and Samsung’s Bixby) have started to upgrade their solu- that this approach is capable of achieving about 3.4% equal tions to support more advanced and useful commands – ex- error rate (EER), detecting 95.7% of attacks when an opti- amples include sending messages, checking emails, making mal threshold value is used. As for those who frequently payments, and performing banking transactions – some of use security-critical (attack-like) commands, we still achieve which may also have security and privacy implications. Such EER below 5%. advanced commands make voice assistants an attractive tar- get for attackers to exploit, and try to steal users’ private CCS CONCEPTS information or perform unauthorized banking transactions. • Security and privacy → Usability in security and pri- To mitigate those threats, some voice assistants force users vacy; Intrusion/anomaly detection and malware miti- to first unlock their devices (e.g., using patterns or finger- gation. prints) before submitting security-sensitive commands. How- ever, this mandatory device unlock requirement sits uneasily with usability of voice assistants as users have to physically engage with their devices at least once in order to use voice Permission to make digital or hard copies of all or part of this work for assistants. Further, some devices like smart speakers do not personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear have any physical input space for users to authenticate them- this notice and the full citation on the first page. Copyrights for components selves. of this work owned by others than ACM must be honored. Abstracting with As a more usable authentication method, voice biometric- credit is permitted. To copy otherwise, or republish, to post on servers or to based authentication methods have been proposed to implic- redistribute to lists, requires prior specific permission and/or a fee. Request itly check users’ voices using trained (known) voice biomet- permissions from [email protected]. ric features. Hence, users do not have to physically authenti- CHI 2019, May 4–9, 2019, Glasgow, Scotland Uk © 2019 Association for Computing Machinery. cate themselves. However, voice biometric based authentica- ACM ISBN 978-1-4503-5970-2/19/05...$15.00 tion methods achieve about 80–90% accuracy when there are https://doi.org/10.1145/3290605.3300828 background noises present [3, 17, 29, 31, 33, 37], and are often Paper 598 Page 1 CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK used with threshold values that reduce false rejection rates – Voice Assistant Authentication compromising security as a result. Human mimicry attacks The most widely adopted solution is voice biometric based can be effective against them [14, 19, 26]. Such methods are authentication that uses raw waveforms, complex spectra fea- also weak against voice synthesis attacks where attackers tures, and log-mel features to train and classify users [9, 21]. use deep learning techniques to train victims’ voice biomet- It does not require users to perform additional tasks or re- ric models using recorded voice samples, and generate new member additional information to authenticate themselves. malicious voice commands [11]. To detect those voice pre- It is a continuous solution that can be used to verify users’ sentation attacks, signal-processing based voice liveness de- every command. Google Assistant and Samsung Bixby are tection solutions have been discussed recently [18]; but such equipped with voice biometric based authentication solu- solutions would also suffer from accuracy losses when there tions. However, existing algorithms [3, 17, 24, 37] achieve are environmental changes, and cannot guarantee detection about 80–90% accuracy when there are ambient background performance against unseen conditions. noises. Hence, to maintain voice assistant usability, those so- In this paper, we propose a novel way to identify users’ lutions are used with threshold values that result in low false genuine commands and detect suspicious commands based rejection rates with some compromises in false acceptance on “Text-conVerted VoICE command analysis” (Twice), and rates. This use of threshold values open doors for human evaluate its feasibility using a large real-world voice assis- mimicry attacks – where attackers use their own voices and tant dataset collected over a two month period through a try to mimic device owners’ voices – to bypass authentica- large IT company – comprising of about 34.6 million voice tion checks [19, 26]. Voice replay attacks (replaying device commands. We used text-converted voice command utter- owners’ recorded voices through a speaker) [8, 18, 32, 36] ances (bag of words) and matched applications as the main and voice synthesis attacks (new voice commands are gener- classification features. We experimented with lightweight ated using device owners’ voice samples and trained models, classification algorithms, and measured the average equal and played through a speaker) [25, 27, 35] are also effective error rates (EER), detection time, and training time. To the against voice biometric based authentication solutions. best of our knowledge, we are the first group to consider Voice passwords are available on Samsung Bixby. Users analyzing voice command text utterances to detect voice are asked to choose and remember a secret word, and say presentation attacks. it to authenticate themselves upon submitting sensitive or Our key contributions are summarized below: security-critical commands. If voice passwords are used in public places, however, people can easily hear and compro- • Real-world voice assistant command analyses, show- mise users’ secret words. An adversary could simply record ing that about 87.48% of the users are occasional users users’ voice passwords and replay them to bypass it. There using less than 20 commands over a month. are usability issues too, since users may be interrupted fre- • Voice presentation attack detection system design that quently to say their secret words. initially works with a globally trained model (avail- Existing voice assistants often require users to unlock their able immediately), and gradually switches to a more mobile devices first (e.g., by entering their PINs or patterns) accurate user-tailored model with increase in the use before processing and executing sensitive and privileged of voice assistants. voice commands. To use Bixby on Galaxy phones, for exam- • Evaluation of the feasibility of using text-converted ut- ple, most of the useful commands (e.g., calling someone or terances as features to detect anomalous use of security- setting alarms) are currently restricted, requiring users to critical commands, showing that the average EER is first unlock their devices – such security policy sits uneasily about 3.4%, and the detection accuracy measured with with future visions of how voice assistants should work and a month-period unseen data is 95.7%. their usability benefits. In this paper, we assume that future • Identification of security-critical voice commands that mobile devices will be equipped with a voice biometric based are being used on real-world voice assistants, includ- authentication scheme, and allow users to seamlessly use ing commands that can be used to access credit card voice assistants without having to physically touch their information, and change security settings. devices. Our speech-to-text-converted utterance analysis based attack detection system, referred to as “Twice,” will be more 2 BACKGROUND AND THREAT MODEL effective against voice synthesis attacks as it does notrely In this section we describe existing (commercialized)