A Voice-Based CAPTCHA for the Modern User

Hitting Three Birds With One System: A Voice-based CAPTCHA For the Modern User Muhammad A. Shah Khaled A. Harras Carnegie Mellon University Carnegie Mellon University [email protected] [email protected] Abstract—CAPTCHA challenges are used all over the Inter- from mobile devices than desktop devices1. This shift in net to prevent automated scripts from spamming web services. user preference means that users now interact with their However, recent technological developments have rendered devices in a wide variety of dynamic environments, which the conventional CAPTCHA insecure and inconvenient to use. In this paper, we propose vCAPTCHA, a voice-based conventional input modalities may not be well suited to. CAPTCHA system that would: (1) enable more secure human While touch input is currently the dominant modality for authentication, (2) more conveniently integrate with modern mobile devices, it has its fair share of shortcomings such devices accessing web services, and (3) help collect vast amounts as when the screen is small, the user is wearing gloves, or of annotated speech data for different languages, accents, when the user’s attention is divided. These factors make the and dialects that are under-represented in the current speech corpora, thus making speech technologies accessible to more task of solving CAPTCHA challenges on mobile devices an people around the world. vCAPTCHA requires users to speak arduous one. their responses, in order to unlock or use different web services, Given the recent advancements in speech processing tech- instead of typing them. These user responses are analyzed to nologies, we see speech as a very promising modality for determine if they are indeed naturally produced, and then CAPTCHA systems, one that could solve the aforemen- transcribed to ensure that they contain the challenge sentence. We build a prototype for vCAPTCHA in order to assess its tioned problems of usability and security that plague current performance and practicality. Our preliminary results show systems. With more than 70% of users finding speech to be that we are able to achieve an attack success rate as low as more efficient than typing [9], [10], and 20% of Google 2.3% while maintaining a human success rate comparable to mobile searches happening via voice2, speech seems poised current CAPTCHAs, on ASVspoof datasets. to become the dominant input modality for mobile devices. Keywords-CAPTCHA; Speech Collection This partiality to speech input is carried over to CAPTCHA solving as well [11]. In addition, speech synthesis techniques I. INTRODUCTION have not matured enough to accurately replicate the human There are rising doubts about the effectiveness and se- voice. Synthesized speech usually contains certain artifacts curity of CAPTCHA, the Completely Automated Public that machine learning techniques can identify with more than Turing Test to tell Computers and Humans Apart [1]. 90% accuracy [12]. CAPTCHA and ReCAPTCHA systems are widely used on While we believe speech-based CAPTCHA solutions are the Internet, mostly by web services, to differentiate between well-positioned to replace their text-based counterparts and human users and automated scripts. The most common address the concerns mentioned above, the lack of diverse form of CAPTCHA relies on the limitation of computer speech copora poses a major challenge for their wide-scale vision techniques by presenting users with a short, visually deployment. Despite its promise, current speech-recognition distorted string, and asking them to type out its contents. systems cater only to a limited user-base. Due to insuffi- The emerging concern, however, is based on the fact that cient data, popular speech-based personal assistants, such as modern optical character recognition techniques can solve Google Assistant and Siri, support only 8 and 13 languages the most difficult variants of Google’s ReCAPTCHA with respectively, out of roughly 69003. Even for the popular more than 99% accuracy [2]. languages, the available data might not be diverse enough to In addition to the security concerns, the emerging trend model a wide range of accents and dialects. English corpora, in portable/mobile device adoption with its increased variety for instance, are dominated by the US accent while having of usage as a pervasive sensor for many applications [3], very little representation of common non-native accents like [4], [5], [6], [7], has rendered CAPTCHA challenges less convenient to use. The number of smartphone users have 1https://adwords.googleblog.com/2015/05/building-for-next- quadrupled between 2007 and 2014 outstripping desktop moment.html 2https://searchengineland.com/google-reveals-20-percent-queries-voice- users[8]; users are spending 70% of their screen time on queries-249917 mobile devices, and more search queries are originating 3https://www.ethnologue.com/ Middle-Eastern and Indian. Consequently, systems trained II. RELATED WORK on these corpora yield high error rates on non-native accents A. Crowdsourcing Speech Data [13], [14]. With users hailing from all over the world, a There have been several efforts to collect speech-data speech based CAPTCHA could allow for the collection of in crowdsourced manners. One of the most straightforward data for a variety of languages, dialects and accents, which ways to collect speech data is via crowdsourcing platforms in turn would make speech input applications in general like Amazon Mechanical Turk, which has been used to accessible to more people. collect dictations of Wikipedia articles [17]. The issue with In this paper, we propose vCAPTCHA, a secure and con- using crowdsourcing platforms is that it is difficult to ensure venient voice-based CAPTCHA that enables the collection the quality of the collected data [18]. VoxForge[19] crowd- of diverse speech data at a global scale. A user is presented sources transcribed speech by having people voluntarily with a short challenge text, which he/she will respond to record themselves on its website. So far VoxForge has by speaking out the displayed text rather than typing it collected dictations for 15,693 sentences across 13 languages in. Since vCAPTCHA uses the human-ness of speech to of which more a third are in English. Since VoxForge’s differentiate humans and bots, it obviates the need to visually approach relies on the altruism of the volunteers, the volume distort the text displayed to the user. This makes the task of data it can gather is limited. A gamified approach to of solving our CAPTCHA even more convenient and less speech collection, via a voice operated quiz, was proposed in cognitively tasking for human users, while still presenting [20]. Using this system the authors were able to collect more a significant challenge to bots. Our system allows text than 25 hours of speech for European Portuguese. However, corpora to be uploaded to it. From these corpora it randomly this approach is not very scalable because adapting the provides challenge sentences to web applications, upon game to different languages would be a very tedious task. request. Upon receiving the user’s response, vCAPTCHA uses feature extraction and machine learning techniques B. CAPTCHA Variants from speaker verification literature to analyze the audio With conventional CAPTCHA fast becoming obsolete signal and determine if it was indeed naturally produced. new variants of CAPTCHA are in the works. Google To ensure that the speech signal provided was produced in has recently developed NoCAPTCHA[21] and Invisible response to the given challenge sentence, vCAPTCHA also CAPTCHA[22], which, instead of relying on the ability to uses commodity speech recognition software to transcribe read contorted text, use Google’s Advance Risk Analysis the audio. If the transcription and challenge sentence are backend[23] to analyze various characteristics of a client’s similar within a threshold, they are considered a match. If traffic, including location, IP address and behavior on the both of the aforementioned tests pass, vCAPTCHA certifies web page, to determine if the client is a human user or not. the legitimacy of the user. The utterances that successfully However, in the cases in which the Advance Risk Analysis pass the CAPTCHA are stored in the database to become backend is not confident in its analysis, the user is presented part of a speech corpus in the future. with a conventional CAPTCHA. Moreover, recent attempts We implement a prototype of vCAPTCHA4 and evaluate to spoof Google’s Advance Risk Analysis backend have its different components using large speech datasets[15], been very successful, with researchers fooling the system [16]. These datasets include more than 200,000 speech up to 2500 time per hour [24]. Therefore, we believe that samples, that are both spoofed and genuine, on which we developing another variant of CAPTCHA that is both user- measure the attack and human success. The attack success friendly and secure is a worthwhile endeavor. rate is the proportion of spoofed responses that were in- Speech, as an alternate input modality for CAPTCHA, correctly considered genuine, while the human success rate has been explored, albeit sparingly, by researchers over is the proportion of genuine responses that were classified the years. Early speech based variants of CAPTCHA were as such. We were able to reduce the attack success rate to developed for use in telephony applications[25], [26], [27] to as low as, 2.3% while maintaining a human success rate prevent spam and facilitate visually impaired users. Recently, comparable to current CAPTCHAs. in [28], the authors develop a CAPTCHA system for the Web, which requires the user to read out the sentence The remainder of this paper is organized as follows. In presented on the screen. This system, however, had an Section II we present the related work in the area of crowd- unacceptable attack success rate of around 40%. A similar sourcing speech data, CAPTCHA challenges, and speaker but improved system was proposed in [11], [29] that reduced verification.

Load more