Cocaine Noodles: Exploiting the Gap Between Human and Machine Speech Recognition∗

Cocaine Noodles: Exploiting the Gap between Human and Machine Speech Recognition∗

Tavish Vaidya Yuankai Zhang Micah Sherr Georgetown University Georgetown University Georgetown University Clay Shields Georgetown University

Abstract devices that are too small to support either physical or on- screen keyboards. These devices tend to be ones that users Hands-free, voice-driven user input is gaining popularity, carry and use on an everyday basis and which therefore in part due to the increasing functionalities provided by contain a large amount of personal or confidential infor- intelligent digital assistances such as Siri, Cortana, and mation, such as personal communications, billing or au- Google Now, and in part due to the proliferation of small thentication credentials, or the user’s location. devices that do not support more traditional, keyboard- Speech recognition is a complex process with many based input. techniques borrowing heavily from biological inspirations In this paper, we examine the gap in the mechanisms (e.g., the use of deep learning neural networks). It mimics of speech recognition between human and machine. In how humans process language, applying complex trans- particular, we ask the question, do the differences in how formations to convert analog signals into digital repre- humans and machines understand spoken speech lead to sentations and artificial intelligence techniques to orga- exploitable vulnerabilities? We find, perhaps surprisingly, nize those representations into phonemes and hence into that these differences can be easily exploited by an adver- speech. These techniques are, at best, approximations of sary to produce sound which is intelligible as a command the physiological processes that govern how humans in- to a computer speech recognition system but is not easily terpret speech, and humans are currently better at parsing understandable by humans. We discuss how a wide range speech than machines. Comical errors in speech recogni- of devices are vulnerable to such manipulation and de- tion are memes on sites such as Reddit, and make it clear scribe how an attacker might use them to defraud victims that there is a difference in how humans and machines in- or install malware, among other attacks. terpret speech. This paper explores this gap between the synthetic and natural and poses the question, do the differences in which 1 Introduction humans and computers understand spoken speech lead to exploitable vulnerabilities? Perhaps surprisingly, our ini- The HAL 9000, an artificial intelligence from Arthur tial findings strongly indicate that the answer is “yes”. Clark’s famous 1968 novel [3], helped popularize the no- The question that motivates this work is whether there tion of a machine that could understand human speech. is audio that is interpreted as human speech by machines Despite the lack of voice recognition systems that ap- but is perceived by humans as noise other than speech. In proach the accuracy of the HAL 9000, speech recognition the remainder of this paper we demonstrate that producing is now itself ubiquitous, present on hundreds of millions such audio is both possible and practical. of smartphones and wearable devices, and is likely to be- come increasingly prevalent in the future. Speech recognition is often the most convenient form The man-in-the-elevator attack. The security implica- of input for smaller devices such as portable phones, and tions of a “language” that can be understood by com- can be the only manner of providing input for wearable puter speech recognition engines but not by humans are somewhat subtle but important1. Current digital assistants ∗As others have pointed out [2], when spoken, “cocaine noodles” is often (mis)interpreted by Google Now as “OK Google”, a term used to 1This paper is admittedly English-centric. Evaluating the effective- activate Google’s digital personal assistant. ness of our techniques for other languages—especially tonal languages

1 such as Siri and Google Now lack biometric authentica- This paper presents a general approach towards con- tion and make no distinction between commands issued verting audio commands into a form that is recognized by by their owners and those issued by unauthorized indi- computers but not by their users. Our techniques are ag- viduals. However, the user’s ability to hear spoken com- nostic to the particular machine learning approach used mands from others at least serves as a detection technique by a given speech recognition system. We instead target and allows the user to take mitigating action should others the amount of information available to the system for car- attempt to activate their device. When sounds that can- rying out the speech recognition task. Conceptually, our not be easily recognizable by humans as being speech are methodology takes as input a waveform of human speech, interpreted by a device, the opportunities for undetected extracts acoustic information from the input, then outputs unauthorized access to voice commands increases. a signal that has sufficient acoustic features for the speech As one concrete example, we consider a man-in-the- recognition system to understand the intended command elevator (MitE) attack. Here, the user enters an elevator but which is lossy enough that it cannot be easily under- with a device that supports continuous speech recognition. stood by human listeners. In the remainder of this pa- During the elevator ride a sound plays on the elevator’s per, we detail this methodology and show via quantifiable loudspeaker which is unrecognizable as speech to the user. metrics as well as user testing that such attacks are both The victim’s device, however, interprets the sound as a effective and practical. voice command. The attacker crafts these commands to perform actions to his benefit. This might include sending a text to an SMS short code, allowing the adversary to 2 Threat Model monetize his attack by charging fees for the text message, or it might cause the device’s browser to open a webpage Our attack targets any device that is actively listening for containing a drive-by download, compromising the device voice input in an area where we can introduce an audio itself and allowing the attacker future access to the data signal. In general, we expect that these are most likely within. to be smartphones, tablets, and wearables that the user We note that this attack has the possibility of scal- carries into a physical space where the attacker can place ing significantly outside the elevator. For example, these a speaker. human unrecognizable commands could be embedded Many devices already continuously listen for voice ac- within audio or video segments on radio, television, or In- tivation commands: when plugged in, the iPhone re- ternet viral videos to reach the devices of many millions sponds to “Hey, Siri”, as does the iWatch whenever the of people. An authoritarian government, when faced with wearer’s wrist is raised; many Android smartphones have a large group of demonstrators, could play such sounds the option of continuously listening for “OK Google”; the over loudspeakers to cause the demonstrators’ devices to desktop Google Chrome browser also has the ability to identify them to the authorities. continuously listen for “OK Google” when the user is vis- iting a Google search page; Android Wear smart watches also listen for voice activation; certain GPS devices such General approach. Most generally, the gap between as Garmin’s nuvi¨ line respond to voice commands by de- human and machine speech recognition leads to an un- fault; and the Amazon Echo is actively marketed as hav- monitored channel by which an adversary can inject com- ing the ability to always listen for its activation phrase, mands. This paper demonstrates the feasibility of such “Alexa”. Given the rapid adoption of voice recognition channels. systems it is reasonable to assume that the trend of de- A key challenge in implementing this attack is that vices continuously listening for spoken activation phrases speech recognition is not uniform and that different sys- will continue to expand in the future. tems may apply various methods to convert raw audio sig- As discussed in §3, there are a wide variety of meth- nals into phonemes and then into speech. These systems ods to perform speech recognition; we treat the AI com- are usually proprietary with little public information avail- ponents of the speech recognition process largely as a able about how they function. The actual interpretation of black box, and make few assumptions about the adver- audio is often performed remotely “in the cloud”, leaving sary’s knowledge regarding the methods by which the tar- little opportunity for reverse engineering on the devices get device interprets speech. themselves. This makes targeting the specific recognition The adversary’s goal is to exploit the target device’s methods difficult. speech recognition system to cause the device to execute such as Chinese, Vietnamese, and Thai—is an interesting area of future unauthorized commands. The adversary wishes to avoid work. detection by the user and therefore produces only man-

2 gled commands that are unintelligible to the device’s op- adversary controls, allowing it to enumerate the de- erator but which are intelligible to the speech recognition vices that are physically located within “earshot” of system. Since most speech recognition systems apply the broadcast (e.g., those belonging to dissidents at- band-stop filters to attenuate signals that fall outside of tending a rally); the range of human speech and require a minimum power • Earn money via premium rate services: An attacker level for parsing speech, the adversary does not attempt to can operate a premium rate number (i.e., a “900 construct a covert audio channel that cannot be perceived number”) and monetize his attack by causing nearby by the human ear. Consequently, our mangled audio sig- phones to call it (for some mobile devices and calling nals fall within the same frequency band as human speech. plans); We leave open the possibility that the human operator will • Perform a denial-of-service attack: Using a public hear the unintelligible audio produced by the attacker. We announcement system, an attacker can issue com- posit that in even the most quiet office environments, oc- mands to turn on airplane mode on all devices, pre- casional electronic-sounding noises that cannot be easily venting them from receiving calls and other commu- identified are not uncommon, and are almost always ig- nications. nored. We assume that the adversary is physically proximate We note that the adversary may be required to issue to the target and has the capability of playing a sound at multiple voice commands to active the speech recognition reasonable volume (∼70db, as perceived by the device’s system (“Hey Siri”) and launch his attack (“open myevil- microphone(s)). We note that proximity can be extended site.com”). by the use of devices such as a Long Range Acoustic De- As more devices adopt voice activation and speech vice (or LRAD), which produces the requisite sound lev- recognition capabilities, we anticipate that the security els over many hundreds of meters. Our proposed attack implications of our techniques will further increase. may be detected if the smartphone screen is visible to the user during the attack. Corresponding to the vast majority of current- 3 Speech Recognition Overview generation devices, our attacks target devices that do not apply biometrics or otherwise attempt to authenticate the Here we briefly describe the steps involved in speech speaker of the voice commands. We assume a human op- recognition to familiarize the reader with the topic and to erator who is not using his device at the time of the attack better explain the attack we present in §4. We focus ex- and therefore may not notice any on-screen notifications clusively on speaker-independent speech recognition sys- that reveal the adversary’s commands. Finally, we note tems that are designed to interpret any speaker’s spoken that the attack may be targeted towards a particular device, commands without requiring training data from individ- or broadcast over a wide area to affect multiple devices. ual users. These are the systems most often employed by the smartphones, tablets, wearables, and other devices Attack utility. An adversary who is able to cause a tar- that the attacker might target. Interested readers may re- get device to execute voice commands may leverage this fer to more detailed readings for additional background capability to achieve the following goals: (this list is not on speech recognition [9, 12, 16, 18]. intended to be exhaustive) Figure 1 shows the steps of a typical speaker- independent speech recognition system. The first step • Initiate a drive-by-download: An attacker can issue (pre-processing) usually involves removing background commands to open a webpage maintained by the ad- noise, filtering frequencies that are of little or no relevance versary that contains a drive-by-download. This ef- for speech recognition, and eliminating parts of the input fectively serves as a stepping stone, enabling other signal that fall below an energy threshold. This process attacks that exploit vulnerabilities in the device’s is generally referred to as speech/non-speech segmenta- browser; tion. As briefly discussed in §2, a consequence of this • Earn money via pay-based SMS services: An at- segmentation process is that it generally disallows covert tacker can construct audio that causes phones to send channels between the adversary and the device that can- text messages to pay-based SMS short code services not be perceived by the human operator. Conversely, it that it operates; does not filter signals that are not human understandable, • Enumerate devices in a physical area: Similarly, so long as those signals have the requisite energy level and an attacker may use a loudspeaker to cause nearby fall within the frequency range of human speech. phones to send SMS messages to a number that the The filtered audio signal is then processed to extract

3 Speech recognition system of a windowed excerpt of the input signal and mapping the powers of the spectrum obtained onto the mel scale. Filtered audio Next, log of powers for each frequency on the mel scale Input signal Feature is taken followed by the discrete cosine transform of each audio Pre-processing Extraction signal (MFCC) mel log powers. The amplitudes of the resulting spectrum are the MFCCs. (Muda et al. [14] provide a more detailed Input audio discussion on computing MFCCs.) features Various parameters can be tuned for computing MFCCs: the window length to consider for computing the Model Based Text Post-processing Fourier transform, the spacing between two windows, the output Text Prediction predictions number of warped spectral bands to use, the number of cepstral coefficients to return, and the lowest and highest band edge of the mel filters. Our proposed attack involves tuning various parameters Figure 1: Workflow of a typical speech recognition system. to compute MFCCs for unmodified input audio, and then reconstructing a modified audio signal from MFCCs. The acoustic features useful for recognizing speech (feature MFCC parameters are tuned in a way that they preserve extraction). The input signal is converted from the time enough audio features for the speech recognition system domain into the spectral domain by considering uniform to correctly predict the corresponding text but changes the length time frames and performing a fast Fourier trans- audio signal as perceived and understood by humans. We form (FFT) over each frame. Most speech recognition discuss this process in much more detail next. systems use Mel-frequency cepstral coefficients (MFCC) to represent acoustic features of the input audio signal. MFCC closely approximates a human response to au- 4 Audio Mangling ditory sensation and allows for better representation of sound [10, 14]. Speech recognition systems rely on acoustic features ex- The acoustic features extracted from the input signal tracted from input audio for generating text predictions. are then matched against an existing model, built offline As long as the input audio contains enough acoustic in- using training data. Speech recognition models are typ- formation (above a certain threshold depending upon the ically constructed using statistical approaches. In par- sensitivity of the targeted system to noise, etc.), the sys- ticular, Hidden Markov Models and artificial neural net- tem can correctly recognize the corresponding text with works are often used to map features to phonemes, and fairly high accuracy. then text. Based on the acoustic features of the input, Our attack modifies the input audio signal in such a the speech recognition model generates text predictions. way that the mangled audio output retains enough acous- A post-processing step may then be performed to rank tic features for the speech recognition system to correctly the text predictions by employing additional sources of predict the text while making it difficult for humans to un- information—for example, enforcing grammar rules or derstand the mangled audio signal. considering the locality of words. Figure 2 shows the outline of our attack. An audio command is given as an input to the audio mangler. The MFCCs. As discussed in the following section, our at- mangler then produces a morphed version of the original tack leverages the common use of MFCCs to extract fea- input. The morphed audio signal retains sufficient acous- tures from audio signals. In what follows, we present ad- tic features of the input audio command to allow speech ditional detail on MFCCs that may be useful to understand recognition systems to interpret it, but sounds very dif- our attack. ferent than the original command to humans. This new Mel-frequency cepstral coefficients (MFCCs) are used attack command is then later played to the target’s speech to represent the short-term power spectrum of audio on a recognition system, which interprets and executes the ini- nonlinear mel frequency scale. MFCC is based on hu- tial audio command. The human user does not hear the man hearing perceptions with frequency bands equally command as a command, and may not notice the execu- spaced on the mel scale (as compared to linearly spaced tion proceeding on the device. frequency bands in normal cepstrum) [14]. MFCCs are Below, we describe in more detail the steps used to commonly derived by first taking the Fourier transform mangle audio:

4 MFCC a small time window and also aggregates the energy lev- parameters Audio Mangler els of closely spaced frequencies to represent the total en- Feature Acoustic ergy in various frequency regions on the mel frequency Mangled Extraction with features Audio Inverse MFCC audio scale. (The aggregation of energy level is motivated by command tuned MFCC command parameters the human hearing mechanism as it cannot discern differences between closely spaced frequencies and the effect becomes more dominant as the frequencies increase.) Thus, MFCCs do not retain all the information about the

Text Attack input audio signal. The tuned MFCC parameters used to Speech Command output command recognition create the attack command are intended to further increase action system this loss of information in way that is deterimental to human understanding. Figure 2: Attack outline. Recall that MFCCs are representations of acoustic features of an audio signal. We use the tuned MFCC parameters to compute MFCCs of an unmodified audio com- MFCC parameters. MFCC computation requires vari- mand. The resulting MFCCs contain just enough acoustic ous parameters to be specified [7]. We focus on four inde- information, ensured by careful selection of MFCC pa- pendent parameters: wintime, hoptime, numcep rameters, such that a mangled audio signal reconstructed and nbands. wintime determines the length of the from them will be correctly recognized by the targeted timeframe over which the signal is considered as being speech recognition system. statistically constant; hoptime determines the size of the time step between successive windows; numcep is the Inverse MFCC. The extracted audio features repre- number of cepstra, i.e., the number of coefficients to out- sented as MFCCs are converted back to an audio signal by put; and nbands is the number of warped spectral bands reversing the steps of the MFCC computation. The inver- to use while aggregating energy levels for closely spaced sion of MFCCs back to a waveform involves the addition frequencies. of noise, since MFCC computation is lossy and aggregates We experimentally observed the effect of changing energy levels of closely spaced frequencies into frequency each of these parameters independently on the perceived regions. Inversion from MFCCs to audio signals mangles quality of mangled audio output. To check if the man- the original audio signals, making them difficult for hu- gled audio can be understood by a speech recognition man listeners to understand. system, we submitted the outputs of the audio mangler to Google’s Speech Recognition engine using a publicly In summary, our approach modifies the input signal by available API [15]. The API returns a list of up to five adjusting MFCC parameters and performing feature ex- possible transcriptions with a confidence value associated traction, and then reconstituting an audio signal by ap- with each transcription. If the API cannot transcribe the plying a reverse MFCC to the extracted features. A audio, an error is returned. key property of our approach is that the features used in Based on the edit distance (see §5) between the tran- the speech recognition system remain mostly undisturbed scriptions of un-mangled and mangled audio signal, we (since they are extracted and then reconstructed), while manually narrow down the value range for each parame- the non-extracted-features are lost in the reconstruction. ter. The range of parameter values, thus chosen, will pro- It is worth stressing that the adversary can perform the duce mangled audio output with minimal but sufficient mangling entirely offline. Recall that the adversary’s job acoustic information for a speech recognition system to is to construct an audio file that is interpreted by computer transcribe the audio while making it non-trivial for a hu- speech recognition systems, but not easily discernible to man listeners to understand. humans. He can thus perform the procedure in a trial- and-error fashion, tuning the audio mangler’s parameters to generate audio files and testing them manually to see Feature extraction with tuned MFCC parameters. whether the mangled output is accepted by a copy of After experimentally determining the range of MFCC pa- the targeted speech recognition system and to see if the rameter values, acoustic features are extracted by com- a human listener deems the mangled audio to be non- puting MFCCs [7] of the input signal using the chosen understandable. This process can be time-consuming, but MFCC parameters. Computing MFCCs is lossy: the pro- producing a single mangled audio that achieves the above cess considers the signal to be statistically constant over two constraints will likely lead to a successful attack.

5 5 Evaluation We then tested the selected attack candidates on our test setup. Again, all of the selected attack candidates success- The goal of our attack is to produce mangled audio files fully activated their corresponding functionalities on the that activate commands on voice-recognition systems but smartphone. These preliminary results support the argu- which are difficult for humans to understand. We evaluate ment that speech recognition systems can understand au- the effectiveness of this attack by implementing a set of dio that has been specifically crafted to contain the acous- mangled commands, verifying that they do activate the tic feature information as described above. The question phone, and then performing human testing to determine then becomes – how well can humans understand them? how difficult the commands were for humans to interpret.

Non-comprehension by human listeners. To test how Experimental setup. We created four types of com- well humans can understand the audio commands, we car- mands which cover the attacks described in §2: ried out a study using Amazon Mechanical Turk, a service • activating the voice command input (i.e., “OK that allows requesters to set up tasks for workers to com- Google”); plete and offer them a payment for their service. • calling a number; For our study, workers were presented with a task con- • sending a text message to a number; and taining links to four audio commands along with the in- • opening a website (tested against two websites) structions (see Appendix A). Each command in a task was unique to prevent giving the worker additional informa- These commands are listed in Table 1. Each of the four tion to use to decipher the command. Some commands authors provided recordings of each command to use. were the mangled commands created above, while oth- We then created an experimental setup to ensure that ers were unmangled to provide a baseline for measuring the commands were correctly understood by the device. worker accuracy. The workers were asked to listen to the For this, and for testing the efficacy of the mangled com- audio samples in the task and provide a transcription for mands, we tested the audio commands against the Google each audio sample, as per instructions. Workers were paid Now intelligent personal assistant running on a Samsung $0.11 per transcription and a bonus of $0.05 cents for an Galaxy S4 smartphone with Android version 4.4.2. The accurate transcription. Each worker was allowed to com- smartphone was placed on a table with a pair of speakers plete only one task to avoid carrying over earlier knowl- placed about 30 cm from the phone. The commands were edge to a new task. played through the speakers and we determined whether While this approach is not the same as a person who the command activated the corresponding functionality might be distracted or otherwise unaware that an audio on the smartphone. The experiment was carried out in command was being played, as we assume in our attack a large, relatively quiet room with the background noise model, we argue that this strengthens our argument that level averaging 50 dB. understanding a mangled command is difficult. Workers To establish a baseline, unmangled versions of all five could listen to an audio command multiple times which attack commands were tested using the described setup. may have increased the understanding of the audio sig- All baseline candidates were successful in activating their nal, particularly provided they would receive a reward for respective functionalities on the smartphone in our test en- correct transcription. The device through which the audio vironment. commands are played also affects the level of understanding. Noise cancellation headphones might allow for bet- Audio mangling. We next implemented our audio man- ter perception of audio as compared to computer speakers. gler in MATLAB2014b using the MFCC implementation Such factors, not within our control, may have improved by Ellis [7] to mangle the commands described above. Af- the transcription results submitted by human evaluators. ter fine tuning the MFCC parameters and generating mul- We therefore argue that these results indicate a possi- tiple mangled audio candidates (see §4), two of the au- ble best case for human understanding and that an attack thors listened to around 500 potential candidates each and victim who has only one attempt to do so at a time when picked a total of 105 candidates to be evaluated by human they are not expecting to need to, given possibly adverse listeners. The list of attack candidates was then further listening conditions, would likely perform less well than narrowed down by getting human evaluators’ feedback our evaluators. from Amazon Mechanical Turk, with preference given Table 1 shows the number of human evaluators who to commands that the evaluators found difficult to under- submitted transcriptions for various audio commands. stand. The differences in sample size are due to the order that

6 1.0 1.0

0.9 0.8 0.8

0.7 0.6 0.6 cdf cdf 0.5 0.4 mangled facebook 0.4 original facebook 0.2 mangled ok google mangled xkcd 0.3 original ok google original xkcd 0.2 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 normalized phoneme edit distance normalized phoneme edit distance

Figure 3: Cumulative distribution of phoneme edit distances, for Figure 4: Cumulative distribution of phoneme edit distances, both the original and mangled audio, for the “Ok Google” com- for both the original and mangled audio, for the commands that mand. open Facebook and the xkcd website.

Command Baseline Mangled nouncing Dictionary [13], version 0.7, to determine the ok google 212 153 pronounciation (in phonemes) of words. open xkcd.com 195 171 show facebook.com 114 171 call 15559876543 81 114 Results. First we determined a baseline for human eval- send text to 1234567890 uators by calculating the edit distance between phonemes 73 179 how are you doing of transcriptions for unmangled audio commands submitted by the workers and the text used to create the com- Table 1: No. of transcriptions (samples) for voice commands. mands. This showed the accuracy of transcription when no mangling was present. We then measured the evaluators’ understanding of the mangled attack commands by tasks were assigned by Amazon Mechanical Turk to its calculating the edit distance between phonemes of tran- workers, and the number of responses we received to our scriptions submitted by the workers and the text used to survey. create the commands. The results show clearly that the approach is promising. Figure 3, 4 and 5 show the cumulative distribution func- Evaluation metric. To quantify human evaluators’ un- tion of phoneme edit distances for transcriptions submit- derstanding of audio commands, we use the Leven- ted by human evaluators. The lines representing base- shtein edit distance where distance is based on phonemes: line and mangled responses for each command corre- Phonemes for all words in the audio commands were gen- spond to transcriptions of unmodified and mangled com- erated for the baseline (unmodified audio) as well as man- mands respectively. Figure 3 shows the results for the gled audio commands’ transcriptions. Using phonemes “Ok Google”2 command. The Ok Google command is re- instead of letters or words as a distance measure has the quired to activate the voice input functionality and opens important advantage that it allows us to compare how up the attack surface for other commands to work. A sig- close different commands sound rather than how they are nificant gap between the baseline and attack command spelled. To account for the varying lengths of different shows that the majority of human evaluators were un- audio commands, we normalize the Levenshtein distance able to understand the mangled command audio. The me- by dividing by the length (in phonemes) of the unmodified dian of normalized phoneme edit distance for the unob- original audio sample. fuscated audio is zero, indicating that most human testers Transcriptions to be compared were converted to low- (in fact, more than 95%) were able to perfectly under- ercase and all non-ASCII characters and punctuation were removed. Digits were replaced with their corresponding 2We consider “Okay Google” as a valid transcription and replace texts (e.g., “2” became “two”). We used the CMU Pro- Okay with Ok before computing the edit distance.

7 1.0 chines’ understanding of certain audio commands. We also note that it appears that the popularity of website may affect results. Higher percentage of human eval- 0.8 uators could guess Facebook, which is ranked second in the Alexa global rankings, as compared to XKCD which 0.6 is ranked 1,428th. cdf 0.4 mangled call 6 Discussion original call 0.2 mangled text Suppressing audio feedback. Personal voice assistants original text like Google Now and Siri provide audio feedback to in- 0.0 0.0 0.2 0.4 0.6 0.8 1.0 put voice commands, which might alert an attentive user. normalized phoneme edit distance However, the audio feedback in response to our attack commands can be suppressed by playing noise via the Figure 5: Cumulative distribution of phoneme edit distances, same means as the attack command, drowning out the de- for both the original and mangled audio, for the commands that vice’s response. Timing of the audio feedback after issu- send an SMS message (“original/mangled text”) and place a call ing a voice command can be determined offline by the at- (“original/mangled call”). tacker and the attack command can be padded with noise such that the delay between the audio command and noise to be played is close to the time delay between the issu- stand it. In contrast, the median of normalized phoneme ing of command and the audio feedback from the voice edit distance for the mangled audio is one (equivalent to assistant. two phonemes), which we note is exactly the number of phonemes in “Ok Google”—reflecting a complete misin- terpretation. Extending to other speech recognition systems. Our We achieved similar results for our other tested com- attack is very likely agnostic to any particular speech mands. Figure 4 shows the level of human evalua- recognition system since it does not rely on specific in- tors’ understanding of “show facebook.com3” and “open ternals of any such system and instead depends only on xkcd.com3” commands. Figure 5 shows the result for the the use of MFCC transformations, a standard practice for “call 15559876543” and “send text to 1234567890 how speech recognition. We tested our attack on one speech are you doing” commands. In nearly all cases, mangling recognition system to demonstrate the feasibility of such the audio significantly increases the testers’ difficulty in an attack. As a part of future work, we aim to extend our correctly transcribing the audio, despite having the ability attack to other speech recognition systems. to listen to the samples multiple times (and having a finan- cial incentive, albeit a small one, to answer correctly). The Other attack vectors. In addition to the attack com- gap between the baseline and mangled version for these mands already discussed, other voice commands can be commands is lesser, though significant, as compared to exploited to carry out attacks using the voice command previously described commands, which can be attributed input. For example, given a few keywords, Google Now to the contextual information that may have given clues automatically opens up the most relevant webpage when to human evaluators, as well as the length of the sample searched via voice commands, eliminating the need to itself. specify a website URL. An adversary can create a web page loaded with malware that combines some number Discussion of results. The results above indicate that it of totally unrelated topics (e.g., radioshack bankruptcy + is possible to create audio files that activate commands but goat farming), such that querying Google for those terms which are more difficult for users to understand, even in will ensure that the site is at the top of the list. To avoid the best case. For each set of commands above, there is Google’s malware detectors, it can serve malware only a significant gap between the understanding of baseline when the browser is detected to be an Android device or and mangled command, supporting our hypothesis that is connecting from a mobile (e.g., LTE) network. it possible to exploit the gap between humans’ and ma- Google also understands synonyms for commands. For example, call can be substituted with ring for dialing a 3“Dot” is replaced by “.” in transcriptions. number. Vocabulary expansion, commonly used in infor-

8 mation retrieval, can be employed for automatically ex- fingerprinting accelerometer responses to motion stimula- panding the command vocabulary and providing an at- tion, while Das et al. [4] achieve the same goal by finger- tacker with more potential attack commands. printing microphones and speakers embedded in smartphones. Our work differs from these approaches in that we are essentially focusing on the opposite—i.e., the ma- 7 Related Work nipulation of signal pattern matching to interpret seem- ingly incomprehensible speech. Attacks that leverage audio input. The (mis)use of voice input as a vector for attacks has recently received significant attention from both industry and academia. 8 Conclusion Ben-Itzhak [1] observed that smartTVs and smartphones take synthetic voice commands without authenticating the In this paper, we presented a proof-of-concept attack that command issuer. Diao et al. [6] demonstrated an attack exploits the differences in how computers and humans in- in which a malicious app with zero permissions can is- terpret speech. We describe a methodology for transform- sue privacy sensitive voice commands to Android smart- ing speech to a form that can be understood by MFCC- phones through Google’s Voice Search Service. Schlegel based speech recognition systems while being mostly in- et al. [17] describe a sound trojan called Soundcomber discernible to human listeners. Our initial findings show on smartphones that stealthily steals private information that it is both possible and practical to activate the voice by listening for tone- and speech-based interactions with assistant of an Android phone using mangled audio com- the smartphone. Our attack differs from the above as it mands and carry out silent man-in-the-elevator attacks. does not require any malware to be installed on the target device, and instead leverages the intelligent digital assistances that are installed by default on most smartphones Acknowledgments and many wearable devices. Closely related to our work is Jang et al.’s exploration We thank Eric Burger, Alex Yale-Loehr, Yu Zhang, and of how accessibility features in modern operating systems Wenchao Zhou for several insightful conversations about can be leveraged to bypass authorization checks [11]. We this work, and the anonymous reviewers for their help- also use voice to issue unauthorized commands. In this ful feedback. This work is partially funded by the Na- paper, we focus in particular on exploiting the voice input tional Science Foundation through grants CNS-1064986, channel and on producing speech that is understandable CNS-1149832, CNS-1223825, CNS-1453392, and CNS- by computers but not by humans. 1445967. The opinions expressed in this paper are those Most similar to our work is Esteves et al.’s proposed of the authors and do not necessarily reflect the views of attack of executing voice commands on smartphones by the National Science Foundation. silently injecting signals via headphones plugged into smartphones [8]. The attack is only effective when the user is using wired headphones with a smartphone that References has FM radio capability, and the adversary is within ∼2m [1] Y. Ben-Itzhak. What if Smart Devices could be Hacked of the target. The chances of detecting the attack are with Just a Aoice? AVG.Now. Available at http:// likely quite high since the phone’s voice feedback will now.avg.com/voice-hacking-devices, 2014. be heard by an operator who is wearing the connected [2] “Can we change saying ‘Ok Google’ to ‘ ’?” headphones. In contrast, our attack plays obfuscated voice (Reddit Thread). https://www.reddit.com/ commands to the smartphones that sound as gibberish to r/AndroidWear/comments/2z18ah/ human listeners, and does not depend on headphones or can we change saying ok google to/. smartphones’ FM capabilities. [3] A. C. Clark. 2001: A Space Odyssey. Hutchinson, 1968. [4] A. Das, N. Borisov, and M. Caesar. Do you Hear What I Hear?: Fingerprinting Smart Devices through Embedded Signal pattern matching. Researchers have also ex- Acoustic Components. In ACM Conference on Computer plored the use of signal features to identify individual de- and Communications Security (CCS), 2014. vices and keyboard entries. Zhuang et al. [19] predict con- [5] S. Dey, N. Roy, W. Xu, R. R. Choudhury, and S. Nelaku- tents of typings from the sound of keyboard clicks with diti. Accelprint: Imperfections of accelerometers make the help of machine learning and speech recognition tech- smartphones trackable. In Proceedings of the Network and niques. Dey et al. [5] identify individual smart devices by Distributed System Security Symposium (NDSS), 2014.

9 [6] W. Diao, X. Liu, Z. Zhou, and K. Zhang. Your Voice Assis- cations. Kluwer Academic Publishers, 1996. tant is Mine: How to Abuse Speakers to Steal Information [13] K. Lenzo. The CMU Pronouncing Dictionary. http: and Control Your Phone. In ACM Workshop on Security //www.speech.cs.cmu.edu/cgi-bin/cmudict. and Privacy in Smartphones & Mobile Devices (SPSM), [14] L. Muda, M. Begam, and I. Elamvazuthi. Voice Recogni- 2014. tion Algorithms using Mel Frequency Cepstral Coefﬁcient [7] D. P. W. Ellis. PLP and RASTA (and MFCC, and in- (MFCC) and Dynamic Time Warping (DTW) Techniques. version) in Matlab. http://www.ee.columbia.edu/ Journal of Computing, 2(3), March 2010. ˜dpwe/resources/matlab/rastamat/, 2005. [15] T. Payton. Google Speech API: Information and Guide- [8] J. L. Esteves and C. Kasmi. Injection de Commandes lines. http://blog.travispayton.com/wp- Vocales sur Ordiphone. In Symposium Sur la securit´ e´ content/uploads/2014/03/Google-Speech- des Technologies de l’Information et des Communications API.pdf. (SSTIC), 2015. [16] L. R. Rabiner and B.-H. Juang. Fundamentals of Speech [9] R. E. Gruhn, W. Minker, and S. Nakamura. Statistical Recognition, volume 14. PTR Prentice Hall Englewood Pronunciation Modeling for Non-native Speech Process- Cliffs, 1993. ing. Springer Science & Business Media, 2011. [17] R. Schlegel, K. Zhang, X.-y. Zhou, M. Intwala, A. Kapa- [10] W. Han, C.-F. Chan, C.-S. Choy, and K.-P. Pun. An Ef- dia, and X. Wang. Soundcomber: A Stealthy and Context- ﬁcient MFCC Extraction Method in Speech Recognition. Aware Sound Trojan for Smartphones. In Network and In IEEE International Symposium on Circuits and Systems Distributed System Security Symposium (NDSS), 2011. (ISCAS), 2006. [18] E. Trentin and M. Gori. A Survey of Hybrid ANN/HMM [11] Y. Jang, C. Song, S. P. Chung, T. Wang, and W. Lee. A11y Models for Automatic Speech Recognition. Neurocomput- Attacks: Exploiting Accessibility in Operating Systems. In ing, 37:91–126, 2001. ACM Conference on Computer and Communications Secu- [19] L. Zhuang, F. Zhou, and J. D. Tygar. Keyboard Acoustic rity (CCS), 2014. Emanations Revisited. ACM Transactions on Information [12] J.-C. Junqua, J.-P. Haton, and H. Wakita. Robustness in and System Security (TISSEC), 13(1):3, 2009. Automatic Speech Recognition: Fundamentals and Appli-

10 A Amazon Mechanical Turk Survey

Aim of this task

We are conducting an academic study that explores the limits of how well humans can understand obfuscated audio of human speech. The audio files for this task may have been algorithmically modified and may be difficult to understand. Please supply your best guess to what is being said in the recordings.

Eligibility Criteria

1. You must be over 18. 2. You must be a US citizen. 3. You must not be an employee of Georgetown University. 4. You must not have already earned $599 for this task.

Please do not complete this task if you don’t meet the eligibility criteria.

Instructions

Listen to a short audio and transcribe what is said to the best of your understanding

If you have already submitted this task once which has not be approved/rejected yet (from any HIT from any batch with these same instructions), please do not attempt this task. Mutiple submissions will be rejected and no payment will be made.

Do not include "hmm" and "err" in the transcription. Do not correct for grammar mistakes but transcribe as spoken. Use punctuation where appropriate. Transciptions whch are closer to the spoken audio will get bonus reward ($0.05 for each correct transcription) in addition to overall payment. You must provide a transcription to the audio. Some of the audio files will be hard to understand. If you do not understand what is being spoken, please provide your best guess. Both correct and incorrect responses will be accepted (and paid for). Bonuses will be given for correct transcriptions.

1. Audio link 1 : https://inyourffingface.com/audio/04fad3ede92b77dcc0ca698febb2a8c4196d0cc2.wav Please write the transcription here: (https://inyourffingface.com/audio/04fad3ede92b77dcc0ca698febb2a8c4196d0cc2.wav)

2. Audio link 2 : https://inyourffingface.com/audio/d499596088364d6775db92a822857ce814e93c60.wav

Please write the transcription here: (https://inyourffingface.com/audio/d499596088364d6775db92a822857ce814e93c60.wav)

3. Audio link 3 : https://inyourffingface.com/audio/d4388b165d0c6177912266cffd2c0f80d7d2083e.wav

Please write the transcription here: (https://inyourffingface.com/audio/d4388b165d0c6177912266cffd2c0f80d7d2083e.wav)

4. Audio link 4 : https://inyourffingface.com/audio/d7b3338538ed9035fbc179f2eff3c3a1f80e8184.wav

Please write the transcription here: (https://inyourffingface.com/audio/d7b3338538ed9035fbc179f2eff3c3a1f80e8184.wav)

Submit

Amazon Mechanical Turk survey as seen by human evaluators. Our ﬁndings assume that workers properly followed the instructions. We were able to verify that there were no duplicate submissions by any single worker.

11 B Survey Results

The following are the transcriptions provided by human evaluators (Amazon Mechanical Turk) for all baseline and mangled audio commands. For baseline commands, identical transcriptions have been grouped together and are shown only once followed by their respective frequency count. All transcriptions for mangled commands are shown individually.

13 you doing? : 1 | sintax 1234567890 how are you doing : 1 | syntax 21234567890 how are you doing : 1 | Syntax Two: one two three four five six seven eight nine zero. How are you doing? : 1 | syntax two one two three four five six seven eight nine zero how are you doing : 1 | send text to 1-2-3-4-5-6-7-8-9-0 how are you doing : 1 | Send text to 1 2 3 4 5 6 7 9 0 how are you doing? : 1 | syntax 21234567890 how you doing? : 1 | syntax two 123 456 789 0 how you doing : 1 | send text to 1234567890 how you doing : 4 | Syntax 2 1 2 3 4 5 6 7 8 9 0 How are you doing? : 1 | Send a text to 123-456- 7890. How you doing? : 1 | Send text to 1234567890 how are you doing? : 1 | SYNTAX 21234567890 HOW YOU DOING? : 1 | Syntex 21234567890 how you doing? : 1 | Syntax 21234567890 how are you doing? : 2 | Syntax two one two three four five six seven nine zero how are you doing. : 1 | send a text to 1234567890 how you doing? : 1 | Send text to 1234567890. How are you doing? : 1 | sintex 21234567890 how are you doing? : 1 | Syntax two one two three four five six seven eight nine zero. How are you doing? : 1 | Send text to 123-456-7890 how’re you doing : 1 | Send Text to 123-456-7890 How are you Doing? : 1 | Sent text to 1234567890. How are you doing? : 1 | Send text to 1234567890 how you doing : 2 | Syntax 21234567890. How are you doing? : 1 | syntax 243456790 How are you doing : 1 | syntax two one two three four five six seven eight nine zero how are you doing? : 1 | centex2123456890how are you doing : 1 | Send text to 1234567890 How are you doing? : 1 | syntax two one two three four five six seven eight nine zero how are you doing? : 1 | Send text to 1234567890: How you doing? : 1 | Send text to: 1234567890. How you doing? : 1 | Syntex two one two three four five six seven eight nine zero how are you doing? : 1 | Send text to 1234567890 why are you doing that? : 1 | send text to 1234567890 what are you doing. : 1 | sent text to 123456789 how are you doing : 1 | CENTEX 21234567890 HOW YOU DOING? : 1 | Syntex 721234567890 How are you doing? : 1 | Sending text to one two three four five six seven eight nine zero. How are you doing? : 1 | syntax 2 1 2 3 4 5 6 7 8 9 0 how you doing : 1 | Cintex 2.234567890 how are you doing? : 1