<<

Hitting Three Birds With One System: A Voice-based CAPTCHA For the Modern User

Muhammad A. Shah Khaled A. Harras Carnegie Mellon University Carnegie Mellon University [email protected] [email protected]

Abstract—CAPTCHA challenges are used all over the Inter- from mobile devices than desktop devices1. This shift in net to prevent automated scripts from web services. user preference means that users now interact with their However, recent technological developments have rendered devices in a wide variety of dynamic environments, which the conventional CAPTCHA insecure and inconvenient to use. In this paper, we propose vCAPTCHA, a voice-based conventional input modalities may not be well suited to. CAPTCHA system that would: (1) enable more secure human While touch input is currently the dominant modality for authentication, (2) more conveniently integrate with modern mobile devices, it has its fair share of shortcomings such devices accessing web services, and (3) help collect vast amounts as when the screen is small, the user is wearing gloves, or of annotated speech data for different languages, accents, when the user’s attention is divided. These factors make the and dialects that are under-represented in the current speech corpora, thus making speech technologies accessible to more task of solving CAPTCHA challenges on mobile devices an people around the world. vCAPTCHA requires users to speak arduous one. their responses, in order to unlock or use different web services, Given the recent advancements in speech processing tech- instead of typing them. These user responses are analyzed to nologies, we see speech as a very promising modality for determine if they are indeed naturally produced, and then CAPTCHA systems, one that could solve the aforemen- transcribed to ensure that they contain the challenge sentence. We build a prototype for vCAPTCHA in order to assess its tioned problems of usability and security that plague current performance and practicality. Our preliminary results show systems. With more than 70% of users finding speech to be that we are able to achieve an attack success rate as low as more efficient than typing [9], [10], and 20% of 2.3% while maintaining a human success rate comparable to mobile searches happening via voice2, speech seems poised current CAPTCHAs, on ASVspoof datasets. to become the dominant input modality for mobile devices. Keywords-CAPTCHA; Speech Collection This partiality to speech input is carried over to CAPTCHA solving as well [11]. In addition, speech synthesis techniques I.INTRODUCTION have not matured enough to accurately replicate the human There are rising doubts about the effectiveness and se- voice. Synthesized speech usually contains certain artifacts curity of CAPTCHA, the Completely Automated Public that machine learning techniques can identify with more than Turing Test to tell Computers and Humans Apart [1]. 90% accuracy [12]. CAPTCHA and ReCAPTCHA systems are widely used on While we believe speech-based CAPTCHA solutions are the , mostly by web services, to differentiate between well-positioned to replace their text-based counterparts and human users and automated scripts. The most common address the concerns mentioned above, the lack of diverse form of CAPTCHA relies on the limitation of computer speech copora poses a major challenge for their wide-scale vision techniques by presenting users with a short, visually deployment. Despite its promise, current speech-recognition distorted string, and asking them to type out its contents. systems cater only to a limited user-. Due to insuffi- The emerging concern, however, is based on the fact that cient data, popular speech-based personal assistants, such as modern optical character recognition techniques can solve and Siri, support only 8 and 13 languages the most difficult variants of Google’s ReCAPTCHA with respectively, out of roughly 69003. Even for the popular more than 99% accuracy [2]. languages, the available data might not be diverse enough to In addition to the security concerns, the emerging trend model a wide range of accents and dialects. English corpora, in portable/mobile device adoption with its increased variety for instance, are dominated by the US accent while having of usage as a pervasive sensor for many applications [3], very little representation of common non-native accents like [4], [5], [6], [7], has rendered CAPTCHA challenges less convenient to use. The number of smartphone users have 1https://adwords.googleblog.com/2015/05/building-for-next- quadrupled between 2007 and 2014 outstripping desktop moment.html 2https://searchengineland.com/google-reveals-20-percent-queries-voice- users[8]; users are spending 70% of their screen time on queries-249917 mobile devices, and more search queries are originating 3https://www.ethnologue.com/ Middle-Eastern and Indian. Consequently, systems trained II.RELATED WORK on these corpora yield high error rates on non-native accents A. Crowdsourcing Speech Data [13], [14]. With users hailing from all over the world, a There have been several efforts to collect speech-data speech based CAPTCHA could allow for the collection of in crowdsourced manners. One of the most straightforward data for a variety of languages, dialects and accents, which ways to collect speech data is via crowdsourcing platforms in turn would make speech input applications in general like Amazon Mechanical Turk, which has been used to accessible to more people. collect dictations of Wikipedia articles [17]. The issue with In this paper, we propose vCAPTCHA, a secure and con- using crowdsourcing platforms is that it is difficult to ensure venient voice-based CAPTCHA that enables the collection the quality of the collected data [18]. VoxForge[19] crowd- of diverse speech data at a global scale. A user is presented sources transcribed speech by having people voluntarily with a short challenge text, which he/she will respond to record themselves on its website. So far VoxForge has by speaking out the displayed text rather than typing it collected dictations for 15,693 sentences across 13 languages in. Since vCAPTCHA uses the human-ness of speech to of which more a third are in English. Since VoxForge’s differentiate humans and bots, it obviates the need to visually approach relies on the altruism of the volunteers, the volume distort the text displayed to the user. This makes the task of data it can gather is limited. A gamified approach to of solving our CAPTCHA even more convenient and less speech collection, via a voice operated quiz, was proposed in cognitively tasking for human users, while still presenting [20]. Using this system the authors were able to collect more a significant challenge to bots. Our system allows text than 25 hours of speech for European Portuguese. However, corpora to be uploaded to it. From these corpora it randomly this approach is not very scalable because adapting the provides challenge sentences to web applications, upon game to different languages would be a very tedious task. request. Upon receiving the user’s response, vCAPTCHA uses feature extraction and machine learning techniques B. CAPTCHA Variants from speaker verification literature to analyze the audio With conventional CAPTCHA fast becoming obsolete signal and determine if it was indeed naturally produced. new variants of CAPTCHA are in the works. Google To ensure that the speech signal provided was produced in has recently developed NoCAPTCHA[21] and Invisible response to the given challenge sentence, vCAPTCHA also CAPTCHA[22], which, instead of relying on the ability to uses commodity speech recognition software to transcribe read contorted text, use Google’s Advance Risk Analysis the audio. If the transcription and challenge sentence are backend[23] to analyze various characteristics of a client’s similar within a threshold, they are considered a match. If traffic, including location, IP address and behavior on the both of the aforementioned tests pass, vCAPTCHA certifies web page, to determine if the client is a human user or not. the legitimacy of the user. The utterances that successfully However, in the cases in which the Advance Risk Analysis pass the CAPTCHA are stored in the database to become backend is not confident in its analysis, the user is presented part of a speech corpus in the future. with a conventional CAPTCHA. Moreover, recent attempts We implement a prototype of vCAPTCHA4 and evaluate to spoof Google’s Advance Risk Analysis backend have its different components using large speech datasets[15], been very successful, with researchers fooling the system [16]. These datasets include more than 200,000 speech up to 2500 time per hour [24]. Therefore, we believe that samples, that are both spoofed and genuine, on which we developing another variant of CAPTCHA that is both user- measure the attack and human success. The attack success friendly and secure is a worthwhile endeavor. rate is the proportion of spoofed responses that were in- Speech, as an alternate input modality for CAPTCHA, correctly considered genuine, while the human success rate has been explored, albeit sparingly, by researchers over is the proportion of genuine responses that were classified the years. Early speech based variants of CAPTCHA were as such. We were able to reduce the attack success rate to developed for use in telephony applications[25], [26], [27] to as low as, 2.3% while maintaining a human success rate prevent spam and facilitate visually impaired users. Recently, comparable to current CAPTCHAs. in [28], the authors develop a CAPTCHA system for the Web, which requires the user to read out the sentence The remainder of this paper is organized as follows. In presented on the screen. This system, however, had an Section II we present the related work in the area of crowd- unacceptable attack success rate of around 40%. A similar sourcing speech data, CAPTCHA challenges, and speaker but improved system was proposed in [11], [29] that reduced verification. We then describe our vCAPTCHA architecture the attack success rate down to 20%. Though it is an and components in Section III. This is followed by a improvement, we feel that an attack success-rate of 20% thorough evaluation of the proposed system and concluding is too high. Fortuitously for us, the authors also evaluate remarks in Sections IV and V respectively. the usability of the speech based CAPTCHA system and found that a majority of the users preferred the speech-based 4https://github.com/shinigami1494/speech captcha CAPTCHA to the conventional text-based variant. C. Speaker Spoofing Counter Measures Broadly speaking, speaker spoofing attacks take four forms: (1) manual impersonation, (2) speech synthesis using software, (3) replay of pre-recorded speech and (4) digital conversion to target voice. Only synthesis and replay are relevant for our CAPTCHA system, because they can be used by automated agents to furnish human-like speech with out any human intervention. Past works have exploited numerous features to dis- tinguish human speech from synthetic speech, including phoneme-conjunction artifacts [29], relative phase shift [30] and temporal modulation features [31]. A detailed evalua- tion of these and several other features in the context of synthetic speech detection has been presented in [12].The results show that Spectral sub-band Magnitude Coefficients (SCMC) and GMM classifiers can detect synthetic speech with an accuracy of over 90%. Speaker independent replay attacks are tackled in [32], [33], [34]. Several different feature extraction techniques were compared in the context of replay attack detection in [33]. The results of this study indicate that SCMCs coupled with GMMs could yield error-rates as low as 11.5% in replay attack detection. Later work that employed deep-learning ar- chitectures to Mel-Frequency Cepstral Coefficients (MFCC) Figure 1: System Architecture was able to reduce the error rate to 7.3% [34]. In vCAPTCHA, we build on these technologies in our Speech Recognizer and the Forgery Detectors analyze the system architecture to provide robust security against audio audio and, respectively, provide scores for how closely the spoofed attacks on our system. content of the speech matched the challenge sentence and III.SYSTEM ARCHITECTURE how “human” the speech is. These scores are passed to In this section, we present an overview of our the Validator which decides whether the user passed the vCAPTCHA system architecture, as illustrated in Fig. CAPTCHA or not. The CAPTCHA Manager then forwards 1. A client web-application instance can interact with the verdict of the Validator to the web-application and, vCAPTCHA via HTTP requests. The CAPTCHA Manager if the user passed, it hands over the audio recording to acts as the interface between the web application and the the Database Manager to store in the database along with other components of vCAPTCHA. The web application the challenge sentence. The Categorical Sorter in the requests a challenge from the CAPTCHA Manager, which database is an optional component that can infer additional in turn requests a challenge sentence from the Database information about the audio recording, such as the accent of Manager. The Database Manager picks a sentence quasi- the speaker, and her environment. randomly from all the enrolled corpora based on a customiz- A. Forgery Detector able set of rules. For example, the Database Manager could be instructed to favor sentences for which there are fewer The Forgery Detector is responsible for determining if responses in the database, or sentences which are easier, etc. the incoming audio signal was produced by the human or Upon receiving the challenge sentence from the Database a computer. In designing the Forgery Detector we have two Manager, the CAPTCHA Interface constructs and forwards key metrics, security and computational cost. While securing to the web-application, an HTML snippet consisting of against illegitimate access attempts is the primary function the challenge phrase/sentence and buttons to initiate and of our CAPTCHA, it would not be practical to deploy it terminate the recording of the response. We have chosen to on a global scale if it imposes enormous computational send the challenge as a self-contained HTML snippet so that costs. Keeping this in mind, we have decided to trade a the responsibility of loading the relevant Javascript few additional percentage points of accuracy offered by does not fall on the web-application. complex deep learning based approaches, for the powerful When the user terminates the recording, the audio signal yet significantly simpler Gaussian Mixture Models (GMMs) is uploaded to the CAPTCHA Manager, which passes it as the classifier of choice to differentiate between genuine onto the Speech Recognizer and Forgery Detector. The and forged speech. GMMs have been successfully employed in the field of Algorithm 1 Pseudocode for user response verification. acoustic modeling and analysis for a variety of different 1: procedure GETSCORE(~x,GMM) . applications and have recently been shown to perform GMM :(N, ~w, ~µ, Σ)~ quite well at discriminating between natural and forged PN i i i 2: return i=1 w N (~x,µ , Σ ) speech[12], [33], [29]. A probability distribution of a GMM 3: procedure GETFORGERYSCORE(X,GMMP ,GMMN ) modeling a class C can be defined as: 4: LLP ← 0 N 5: LLN ← 0 X i i i P (~x|C) = wC N (~x,µC , ΣC ), (1) 6: for ~x ∈ X do i=1 7: LLP ← LLP + getScore(~x,GMMP ) 8: LL ← LL + getScore(~x,GMM ) where −→x is a random spectral or cepstral vector derived N N N from an audio signal, N is the number of components in 9: return LLP − LLN the mixture such that the ith component can be defined by 10: procedure POCKETSPHINX(speech) . Transcribe the i i a Gaussian distribution, N (~x,µC , ΣC ), over ~x and has a speech signal using Pocketsphinx i weight, wC . These parameters (namely the mixture weights, 11: return transcript means, and variances) are learned from training instances 12: procedure LEVENSHTEIN(seq1,seq2) of recordings using the expectation-maximization algorithm numberOfEdits 13: return [35]. length(seq2) In the particular problem at hand, we need to model 14: procedure GETRECOGNITIONSCORE(S,Sentence) two classes of speech signals, genuine (H) and spoofed 15: transcript ← P ocketsphinx(S) (¬H). We model P (x|H) and P (x|¬H), the distribution of 16: return Levenshtein(transcript, Sentence) frame-based spectral features derived from the genuine and 17: procedure VALIDATE(SR,SF ) spoofed speech signals, respectively, using individual Gaus- 18: if SR ≥ tR and SF ≤ tF then sian mixtures. More specifically, given an audio recording, 19: return Pass we generate a set of feature vectors, X = ~x , . . . , ~x and 1 T 20: else compute the within-class and out-of-class log likelihoods as 21: return Fail L(X|H) and L(X|¬H) respectively. The likelihood ratio, i.e. L(X|H)−L(X|¬H), is then calculated and forwarded to the Validator as the Forgery score, S . The forgery detection F challenge string. This distance is then passed to the Validator process is described in Algorithm 1, lines 3-9 as the Recognition Score, SR. The entire speech recognition B. Speech Recognizer process is described in Algorithm 1, lines 12-16. It is not sufficient to know only that the response was C. Validator produced by a human, we also need to ensure that the The Validator receives Forgery and Recognition scores response matches the challenge sentence. This additional from the the Forgery Detector and Speech Recognizer, re- check is required because it is entirely possible that, through spectively, and uses them to determine if the user’s response trial and error or otherwise, attackers can identify certain passed the CAPTCHA challenge. We have chosen to use audio recordings that are falsely verified as being genuine empirically determined threshold values, tR and tF , for SR by vCAPTCHA and subsequently use these recordings to and SF respectively, to classify successful and unsuccessful bypass it. responses, as described in in Algorithm 1, lines 17-21. We We use Pocketsphinx [36], via its Python bindings, to use this approach because of its simplicity and its successful transcribe the speech signal that we receive from the user. track-record in past studies [29], [15], [16]. We realize that several factors, including, accent, background noise, and individual pronunciation styles, can introduce D. Database Manager minor errors in legitimate responses. Therefore, we do not The Database Manager facilitates all tasks involving the expect the transcription to match the challenge sentence storage or retrieval of data from the database, which is exactly. Instead, we use the normalized Levenshtein distance organized as shown in Fig. 2. By placing the Database between the transcript and challenge sentence as a metric of Manager as the gatekeeper for all database transactions, we similarity. Levenshtein distance is widely used to quantify allow the implementation of more sophisticated logic for the similarity between pairs of strings. The normalized data storage and retrieval. For example, we could try to get Levenshtein distance between the transcript and challenge equal responses for all sentences by having the Database sentence is the ratio of the number of character-level edits, Manager serve sentences that have fewer responses more i.e. insertions, deletions and substitutions, needed to convert frequently. We have also allowed for an optional module, the transcript to the challenge sentence, and the length of the called the Categorical Sorter, that analyzes the data within ASV’15 ASV’17 subset Total H ¬H H ¬H

Trainf 3750 12625 1508 1508 19391 Trainm 300 900 100 100 1400 Evalf 9404 184000 1298 12008 203710 Figure 2: Database Schematic Evalm 350 350 350 350 1400

Table I: Dataset Overview the database to infer additional information about it. For example, the Categorical Sorter could infer the accent of Parameter Range the speakers from their responses and store it along with the GMM components 128 to 1024 response utterance, thereby allowing automatic construction Feature Type SCFC, SCMC, CQCC, MFCC of foreign-accented speech datasets. # filters 13 to 64 Sample Length 0.69s to 10.8s IV. EVALUATION Table II: Parameters for Forgery Detection We evaluate our CAPTCHA system based on three key metrics, (1) human success rate, (2) attacker success rate and (3) utterance processing time. Ideally, we would want to maximize human success rate and minimize the attacker configuration are presented in Fig. 4b. For 128 to 512 success rate, while keeping the utterance processing time components the testing times remain around 50ms, however low. All the tests presented below are performed on a Ubuntu they almost double as we move to 1024. Nevertheless, even machine with 4 processor cores of an Intel Xeon X5690 for the most complex model the testing times remain a running at 3.47 GHz with 4GBs of RAM. very reasonable 100ms on our modest hardware. It is this efficiency of GMMs that makes them a particularly powerful A. Dataset model for most machine learning tasks. Based on these We use two datasets to evaluate our CAPTCHA sys- results we shortlist 256 and 512 as the potential number tem, namely ASVspoof 2015[15] and ASVspoof 2017[16]. of components in the final model. To speed up the testing Both datasets contain samples of human speech along process, we use a GMM with 256 components for the with spoofed samples produced using different attacking experiments presented in this section. methodologies. ASVspoof 2015 contains spoofed samples 2) Feature Type: We compare four feature types that generated using various speech synthesis techniques, while have been shown to work well with GMMs in the past ASVspoof 2017 contains spoofed samples for conducting literature, namely, Spectral Sub-band Magnitude Coefficients replay attacks. Table I describes the constitution of our (SCMC)[37], Spectral sub-band Frequency Coefficients dataset and the comprising subsets. (SCFC)[37], Constant-Q Cepstral Coefficients (CQCC)[38] and Mel Frequency Cepstral Coefficients (MFCC)[39]. B. Forgery Detection While extracting the features we use a 20ms sliding window Before we finalize a classification model, we must deter- over the audio signal with a stride of 10ms for SCMC, SCFC mine the optimal values for the model parameters (given and MFCC. A window and stride of 8ms is used for CQCC. in Table II). For choosing the optimal parameter values The resulting feature vectors are then extended with delta our primary metric is the Equal Error Rate (EER), and our and double-delta coefficients[40]. As illustrated by Fig. 4e secondary metric is the testing time, measured as the average time required to classify one audio file. We determine the EER by plotting the false positive rate (FPR) against the false negative rate (FNR) for different classification thresholds as shown in Fig. 3 and finding the threshold for which FPR equals FNR. To expedite the parameter search we use a mini dataset, consisting of Trainm and Evalm. The results of our experiments on the mini dataset are illustrated in Fig. 4. 1) Number of GMM components: As shown in Fig. 4a, the EER falls as we increase the number of GMM compo- nents. However, the marginal gain in EER is accompanied by a greater marginal increase in testing time. The marginal cost in terms of increased testing time is particularly high for 1024 components, causing us to withdraw them from Figure 3: Detection Error Tradeoff curve for the 256- consideration. The actual testing times for each GMM component GMM trained on SCMCs on Devm subset. (a) (b) (c)

(d) (e) (f)

Figure 4: Impact on EER and testing times of varying (a) the number of GMM components, (b) the feature type, (c) the number of filters and (d) the length of the audio sample. the lowest EER is achieved by SCMC, making them the Model Acc (%) TPR(%) FPR (%) GMM w 256 components 87.7 91.5 12.6 features of choice for vCAPTCHA. GMM w 512 components 86.9 97.5 13.6 3) Number of filters: During the SCMC computation a filterbank is applied to the signal. The number of filters in Table III: Results on the Complete Test Set the filterbank determine the dimensionality of the resulting feature vector. From Fig. 4c we see that the marginal reduction in EER outstrips the marginal increase in testing dataset, which has a mean of 312.2 and a standard deviation time as we increase the number of filters to 40, at which of 121.9. We confirm this by calculating the correlation point the EER is also minimized. Moreover, from Fig. 4d between sample length and accuracy and obtaining a low- we see that the testing times stay low and do not dramatically value of 0.09. increase as we increase the number of filters. Therefore, 5) Results on the Full Evaluation Set: Based on the we set the number of filters to 40 in vCAPTCHA. As we results from the previous, our shortlisted models are 256- increase the number of filters to 64 we see an increase in component and 512-component GMMs, trained on SCMC the error rate. We hypothesize that this behavior is linked features with a 40 filter filterbank. We train these model on to the SCMC calculation process. After the application of the data from the full training set (Trainf ) and evaluate them the filterbank, log and Discrete Cosine Transform (DCT) on the full evaluation set (Evalf ). For each model, we set the operations are performed on the resultant vector. DCT can Forgery threshold, tF , to the equal error point (shown in Fig. be thought of as refactoring the feature vector such that the 3) determined from the experiments in the previous section. new features are uncorrelated with each other and ordered Table III shows the accuracy, true positive rate (human according to their significance. Therefore, as we increase the success rate) and false positive rate (attack success rate) length of the feature vector beyond a certain point, the later achieved by the aforementioned models. We see the 256- features might start having redundant, or even confounding, component GMM performs admirably, yielding an overall data. accuracy of 87.7%, along with a high human success rate 4) Sample Length: Fig. 4f shows how the prediction and a low attack success rate. Interestingly enough, the 512- accuracy varies with the length of the audio sample. At first component GMM under-performs its smaller counterpart in it appears as if sample lengths around 300ms yield higher identifying spoofed speech and yields a slightly higher false accuracy. However, a closer look reveals that it is more likely positive rate. Based on these results we have decided to that the cluster of high accuracy rates around the 300ms proceed with the 256-component model for inclusion in mark is due to the distribution of sample lengths in the vCAPTCHA. Subset Positive Negative challenge sentence correctly and produce a synthetic utter- Dev 2262 2025 Eval 754 675 ance that matches it. Based on this formulation, vCAPTCHA achieves a human success rate of 73.7%, an attack success Table IV: Dataset composition for speech recognition eval- rate of 2.3% in the best-case, and an attack success rate uation of 10.2%. While we achieve a very impressive attack suc- cess rate, the human success rate is lower than what we would have liked it to be. We suspect that this is because C. Speech Recognition ASVspoof2017 contains replay recordings of speech with environmental sounds digitally added to them. Due to these Since we are using Pocketsphinx for speech recognition, factors the samples might be noisy, which might have led to we can use one of its pre-built models and not have to errors during speech recognition. train one of our own, however, we still have to empirically determine the Recognition threshold, tR. To determine tR V. CONCLUSIONAND FUTURE WORK we use the training subset of ASVspoof2017. For each In this paper we have proposed our speech-based sentence, s, in the subset, we take 302 utterances for it vCAPTCHA system that would more effectively secure web- (the positive utterances) along with 30 utterances from each resources from bots while minimizing the inconvenience of of the remaining sentences (the negative utterances), get legitimate, human, users. We have evaluated vCAPTCHA the transcripts for these utterances from Pocketsphinx and on two recent databases, namely ASVspoof 2015 and compute the normalized Levenshtein distance between the ASVspoof 2017. vCAPTCHA achieved a high human suc- transcript and s. We then split this set of distances in to two cess rate of 73.7% and an extremely low attack success subsets, Dev and Eval, as shown in Table IV. rate of 10.2% on the aforementioned datasets. During the We then use the normalized Levenshtein distances ob- course of our evaluation we found that 512-component tained from the positive and negative utterances in the Dev GMMs, which are common in past works, do not offer set to determine the threshold (tR = 0.77) at which the a significant performance upgrade over their simpler 256- EER (19.6%) is obtained. We then use this threshold to component counterparts. We also observed that SCMCs with classify the distance values in the Eval set and obtained a 40-filters offer the best classification performance while the true positive rate of 80.6% and a false positive rate of 18.5%. length of the audio sample has no observable effect. We also measured the average time required to process one Moving forward, our next step would be to deploy file to be around 650ms. vCAPTCHA with an actual web-service in order to gain insights about user experiences and start actually collecting D. Combining Speech Recognition with Forgery Detection data. Of course, This step would entail moving from the In this section, we combine the forgery detection and current prototype implementation to a fully deployable one. speech recognition results from the preceding sections to After deployment, we would be particularly interested in determine vCAPTCHA’s human and attack success rates. observing the quality and characteristics of data we gather Since passing the test requires an agent to pass both, the and how its addition to well-known corpora would effect the forgery detector and the speech recognizer, we calculate performance of speech recognition systems trained on them. the human success rate (HSR), best-case attack success REFERENCES rate (ASRbestCase) and the usual-case attack success rate (ASR ) as [1] L. Von Ahn, M. Blum, N. J. Hopper, and J. Langford, usualCase “Captcha: Using hard ai problems for security,” in Interna- tional Conference on the Theory and Applications of Crypto- HSR = TPRS × TPRF (2) graphic Techniques. Springer, 2003, pp. 294–311. [2] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet, ASRbestCase = FPRS × FPRF (3) “Multi-digit number recognition from street view imagery using deep convolutional neural networks,” arXiv ASRusualCase = TPRS × FPRF (4) arXiv:1312.6082, 2013. where TPRS and TPRF are the true positive rates [3] H. Abdelnasser, M. Youssef, and K. A. Harras, “Magboard: for speech recognition and forgery detection, respectively, Magnetic-based ubiquitous homomorphic off-the-shelf key- board,” in SECON. IEEE, 2016, pp. 1–9. and FPRS and FPRF are the false positive rates for [4] H. Abdelnasser, K. A. Harras, and M. Youssef, “Ubibreathe: speech recognition and forgery detection, respectively. A ubiquitous non-invasive wifi-based breathing estimator,” in ASRbestCase represents the best-case scenario when the MobiHoc. ACM, 2015, pp. 277–286. attacker is unable to read the presented text properly and [5] A. Essameldin and K. A. Harras, “The hive: An edge-based the attack utterance does not match the challenge sen- middleware solution for resource sharing in the internet of things,” in Smart Objects MobiCom Workshop. ACM, 2017. tence. However, as mentioned above, visual distortions are [6] M. Ibrahim, M. Gruteser, K. A. Harras, and M. Youssef, becoming easier to bypass so the more likely scenario, “Over-the-air tv detection using mobile devices,” in ICCCN. ASRusualCase, would be that the attacker is able to read the IEEE, 2017. [7] H.-J. Hong, T. El-Ganainy, C.-H. Hsu, K. A. Harras, and for sip-based voip.” SEC, vol. 297, pp. 25–38, 2009. M. Hefeeda, “Disseminating multilayer multimedia content [26] D. Gritzalis, Y. Soupionis, V. Katos, I. Psaroudakis, P. Kat- over challenged networks,” IEEE Transactions on Multime- saros, and A. Mentis, “The sphinx enigma in critical voip in- dia, vol. 20, no. 2, pp. 345–360, 2018. frastructures: Human or botnet?” in Information, Intelligence, [8] “Mobile marketing statistics compilation,” May 2017. Systems and Applications (IISA), 2013 Fourth International [Online]. Available: https://www.smartinsights.com/mobile- Conference on. IEEE, 2013, pp. 1–6. marketing/mobile-marketing-analytics/mobile-marketing- [27] A. Markkola and J. Lindqvist, “Accessible voice captchas for statistics/ internet telephony,” in Symposium on Accessible Privacy and [9] “Omg! mobile voice survey reveals teens Security (SOAPS), 2008, pp. 1–2. love to talk,” Oct 2014. [Online]. Avail- [28] H. Gao, H. Liu, D. Yao, X. Liu, and U. Aickelin, “An audio able: https://googleblog.blogspot.qa/2014/10/omg-mobile- captcha to distinguish humans from computers,” in Electronic voice-survey-reveals-teens.html Commerce and Security (ISECS), 2010 Third International [10] S. Ruan, J. O. Wobbrock, K. Liou, A. Ng, and J. Landay, Symposium on. IEEE, 2010, pp. 265–269. “Speech is 3x faster than typing for english and mandarin text [29] S. Shirali-Shahreza, Y. Ganjali, and R. Balakrishnan, “Veri- entry on mobile devices,” arXiv preprint arXiv:1608.07323, fying human users in speech-based interactions,” in Twelfth 2016. Annual Conference of the International Speech Communica- [11] S. Shirali-Shahreza, G. Penn, R. Balakrishnan, and Y. Ganjali, tion Association, 2011. “Seesay and hearsay captcha for mobile interaction,” in [30] P. L. De Leon, M. Pucher, J. Yamagishi, I. Hernaez, and Proceedings of the SIGCHI Conference on Human Factors I. Saratxaga, “Evaluation of speaker verification security and in Computing Systems. ACM, 2013, pp. 2147–2156. detection of hmm-based synthetic speech,” IEEE Transactions [12] M. Sahidullah, T. Kinnunen, and C. Hanilc¸i, “A comparison on Audio, Speech, and Processing, vol. 20, no. 8, of features for synthetic speech detection,” in Sixteenth An- pp. 2280–2290, 2012. nual Conference of the International Speech Communication [31] Z. Wu, X. Xiao, E. S. Chng, and H. Li, “Synthetic speech Association, 2015. detection using temporal modulation feature,” in Acoustics, [13] D. Vergyri, L. Lamel, and J.-L. Gauvain, “Automatic speech Speech and Signal Processing (ICASSP), 2013 IEEE Inter- recognition of multiple accented english data,” in Eleventh national Conference on. IEEE, 2013, pp. 7234–7238. Annual Conference of the International Speech Communica- [32] J. Villalba and E. Lleida, “Detecting replay attacks from far- tion Association, 2010. field recordings on speaker verification systems,” in European [14] T. Fraga-Silva, J.-L. Gauvain, and L. Lamel, “Speech recog- Workshop on Biometrics and Identity Management. Springer, nition of multiple accented english data using acoustic model 2011, pp. 274–285. interpolation,” in Signal Processing Conference (EUSIPCO), [33] R. Font, J. M. Espın, and M. J. Cano, “Experimental analysis 2014 Proceedings of the 22nd European. IEEE, 2014, pp. of features for replay attack detection–results on the asvspoof 1781–1785. 2017 challenge,” Proc. Interspeech 2017, pp. 7–11, 2017. [15] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilc¸i, [34] G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Ku- M. Sahidullah, and A. Sizov, “Asvspoof 2015: the first dashev, and V. Shchemelinin, “Audio replay attack detection automatic speaker verification spoofing and countermeasures with deep learning frameworks,” Interspeech-2017 (2017, challenge,” in Sixteenth Annual Conference of the Interna- submitted) Google Scholar, 2017. tional Speech Communication Association, 2015. [35] T. K. Moon, “The expectation-maximization algorithm,” [16] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, IEEE Signal processing magazine, vol. 13, no. 6, pp. 47–60, N. Evans, J. Yamagishi, and K. A. Lee, “The asvspoof 2017 1996. challenge: Assessing the limits of replay spoofing attack [36] D. Huggins-Daines, M. Kumar, A. Chan, A. W. Black, detection,” 2017. M. Ravishankar, and A. I. Rudnicky, “Pocketsphinx: A free, [17] S. Novotney and C. Callison-Burch, “Shared task: crowd- real-time continuous speech recognition system for hand- sourced accessibility elicitation of wikipedia articles,” in held devices,” in Acoustics, Speech and Signal Processing, Proceedings of the NAACL HLT 2010 workshop on creating 2006. ICASSP 2006 Proceedings. 2006 IEEE International speech and language data with Amazon’s Mechanical Turk. Conference on, vol. 1. IEEE, 2006, pp. I–I. Association for Computational Linguistics, 2010, pp. 41–44. [37] J. M. K. Kua, T. Thiruvaran, M. Nosratighods, E. Ambikaira- [18] C. Callison-Burch and M. Dredze, “Creating speech and lan- jah, and J. Epps, “Investigation of spectral centroid magnitude guage data with amazon’s mechanical turk,” in Proceedings and frequency for speaker recognition.” in Odyssey, 2010, of the NAACL HLT 2010 Workshop on Creating Speech and p. 7. Language Data with Amazon’s Mechanical Turk. Association [38] M. Todisco, H. Delgado, and N. Evans, “A new feature for Computational Linguistics, 2010, pp. 1–12. for automatic speaker verification anti-spoofing: Constant q [19] “www.voxforge.org.” cepstral coefficients,” in Speaker Odyssey Workshop, Bilbao, [20] J. Freitas, A. Calado, D. Braga, P. Silva, and M. Dias, Spain, vol. 25, 2016, pp. 249–252. “Crowdsourcing platform for large-scale speech data collec- [39] S. Molau, M. Pitz, R. Schluter, and H. Ney, “Computing tion,” Proc. FALA, 2010. mel-frequency cepstral coefficients on the power spectrum,” [21] “security.googleblog.com/2014/12/are-you-robot-introducing- in Acoustics, Speech, and Signal Processing, 2001. Proceed- no-captcha.html.” ings.(ICASSP’01). 2001 IEEE International Conference on, [22] “www.google.com/recaptcha/intro/invisible.html.” vol. 1. IEEE, 2001, pp. 73–76. [23] “security.googleblog.com/2013/10/recaptcha-just-got-easier- [40] D. A. Reynolds and R. C. Rose, “Robust text-independent but-only-if.html.” speaker identification using gaussian mixture speaker mod- [24] S. Sivakorn, J. Polakis, and A. D. Keromytis, “I’m not a els,” IEEE transactions on speech and audio processing, human: Breaking the google recaptcha.” vol. 3, no. 1, p. 79, 1995. [25] Y. Soupionis, G. Tountas, and D. Gritzalis, “Audio captcha