Inaudible Voice Commands: The Long-Range Attack and Defense

Nirupam Roy, Sheng Shen, Haitham Hassanieh, Romit Roy Choudhury University of Illinois at Urbana-Champaign

Abstract can launch from a distance of 5 ft to [48] while the attack in [39] achieves 10 ft by becoming par- Recent work has shown that inaudible signals (at ultra- tially audible. In attempting to enhance range, we real- sound frequencies) can be designed in a way that they ized strong tradeoffs with inaudibility, i.e., the output of become audible to microphones. Designed well, this can the speaker no longer remains silent. This implies that empower an adversary to stand on the road and silently currently known attacks are viable in short ranges, such control Amazon Echo and Google Home-like devices in as Alice’s friend visiting Alice’s home and silently at- people’s homes. A voice command like “Alexa, open the tacking her Amazon Echo [11, 48]. However, the gen- garage door” can be a serious threat. eral, and perhaps more alarming attack, is the one in While recent work has demonstrated feasibility, two is- which the attacker parks his car on the road and controls sues remain open: (1) The attacks can only be launched voice-enabled devices in the neighborhood, and even a from within 5 ft of Amazon Echo, and increasing this person standing next to him does not hear it. This paper range makes the attack audible. (2) There is no clear so- is an attempt to achieve such an attack radius, followed lution against these ultrasound attacks, since they exploit by defenses against them. We formulate the core prob- a recently discovered loophole in hardware non-linearity. lem next and outline our intuitions and techniques for solving them. This paper is an attempt to close both these gaps. We begin by developing an attack that achieves 25 ft range, Briefly, non-linearity is a hardware property that makes limited by the power of our amplifier. We then develop high frequency signals arriving at a microphone, say shi, a defense against this class of voice attacks that exploit get shifted to lower frequencies slow (see Figure 1). If shi non-linearity. Our core ideas emerge from a careful is designed carefully, then slow can be almost identical forensics on voice, i.e., finding indelible traces of non- to shi but shifted to within the audibility cutoff of 20kHz linearity in recorded voice signals. Our system, LipRead, inside the microphone. As a result, even though humans demonstrates the inaudible attack in various conditions, do not hear shi, non-linearity in microphones produces followed by defenses that only require software changes slow, which then become legitimate voice commands to to the microphone. devices like Amazon Echo. This is the root opportunity that empowers today’s attacks. 1 Introduction Filter A number of recent research papers have focused on the Inaudible (LPF) Nonlinearity topic of inaudible voice commands [37, 48, 39]. Back- Ampl. voice signal Ampl. causes shi? door [37] showed how hardware non-linearities in micro- phones can be exploited, such that inaudible ultrasound 40 kHz 40 kHz Freq. signals can become audible to any microphone. Dolphi- Audible Freq. Microphone Audible nAttack [48] developed on Backdoor to demonstrate that range range no software is needed at the microphone, i.e., a voice en- Figure 1: Hardware non-linearity creates frequency shift. abled device like Amazon Echo can be made to respond Voice commands transmitted over inaudible ultrasound fre- to inaudible voice commands. A similar paper indepen- quencies get shifted into the lower audible bands after passing dently emerged in arXiv [39], with a video demonstration through the non-linear microphone hardware. of such an attack [3]. These attacks are becoming in- creasingly relevant, particularly with the proliferation of Two important points need mention at this point. (1) voice enabled devices including Amazon Echo, Google Non-linearity triggers at high frequencies and at high Home, Apple Home Pod, Samsung refrigerators, etc. power – if shi is a soft signal, then the non-linear ef- fects do not surface. (2) Non-linearity is fundamental to While creative and exciting, these attacks are still defi- acoustic hardware and is equally present in speakers as cient on an important parameter: range. DolphinAttack in microphones. Thus, when shi is played through speak- ers, it will also undergo the frequency shift, producing only, as opposed to the general class of inaudible mi- an audible slow. Dolphin and other attacks sidestep this crophone attacks (such as jamming [37]). We leave this problem by operating at low power, thereby forcing the broader problem to future work. output of the speaker to be almost inaudible. This inher- ently limits the range of the attack to 5 ft; any attempt to Our overall system LipRead is built on multiple plat- increase this range will result in audibility. forms. For the inaudible attack at long ranges, we have developed an ultrasound speaker array powered by our This paper breaks away from the zero sum game between custom-made amplifier. The attacker types a command range and audibility by an alternative transmitter design. on the laptop, MATLAB converts the command to a Our core idea is to use multiple speakers, and stripe seg- voice signal, and the laptop sends this through our am- ments of the voice signal across them such that leakage plifier to the speaker. We demonstrate controlling Ama- from each speaker is narrow band, and confined to low zon Echo, iPhone , and Samsung devices from a dis- frequencies. This still produces a garbled, audible sound. tance of 25 ft, limited by the power of our amplifier. For To achieve true inaudibility, we solve a min-max opti- defense, we record signals from Android Samsung S6 mization problem on the length of the voice segments. phones, as well as from off-the-shelf microphone chips The optimization picks the segment lengths in a way such (popular in today’s devices). We attack the system with that the aggregate leakage function is completely below various ultrasound commands, both from literature as the human auditory response curve (i.e., the minimum well as our own. LipRead demonstrates defense against separation between the leakage and the human audibility all attacks with 97% precision and 98% recall. The per- curve is maximized). This ensures, by design, the attack formance remains robust across varying parameters, in- is inaudible. cluding multipath, power, attack location, and various Defending against this class of non-linearity attacks is signal manipulations. not difficult if one were to assume hardware changes Current limitations: Our long-range attacks have been to the receiver (e.g., Amazon Echo or Google Home). launched from within a large room, or from outside a An additional ultrasound microphone will suffice since house with open windows. When doors and windows it can detect the s signals in air. However, with soft- hi were closed, the attack was unsuccessful since our high- ware changes alone, the problem becomes a question of frequency signals attenuated while passing through the forensics, i.e., can the shifted signal s be discriminated low wall/glass. We believe this is a function of power, how- from the same legitimate voice command, s . In other leg ever, a deeper treatment is necessary around this ques- words, does non-linearity leave an indelible trace on s low tion. In particular: (1) Will high power amplifiers be that would otherwise not be present in s . leg powerful enough for high-frequency signals to penetrate Our defense relies on the observation that voice signals such barriers? (2) Will high-power and high-frequency exhibit well-understood structure, composed of funda- signals trigger non-linearity inside human ears? (3) Are mental frequencies and harmonics. When this structure there other leakages that will emerge in such high power passes through non-linearity, part of it remains preserved and high frequency regimes. We leave these questions to in the shifted and blended low frequency signals. In con- future work. trast, legitimate human voice projects almost no energy in these low frequency bands. An attacker that injects In sum, our core contributions may be summarized as distortion to hide the traces of voice, either pollutes the follows: core voice command, or raises the energy floor in these • A transmitter design that breaks away from the tradeoff bands. This forces the attacker into a zero-sum game, between attack range and audibility. The core ideas per- disallowing him from erasing the traces of non-linearity tain to carefully striping frequency bands across an array without raising suspicion. of speakers, such that individual speakers are silent but Our measurements confirm the possibility to detect the microphone is activated. voice traces, i.e., even though non-linearity superim- poses many harmonics and noise signals on top of each • A defense that identifies human voice traces at very low other, and attenuates them significantly, cross-correlation frequencies (where such traces should not be present) still reveals the latent voice fingerprint. Of course, var- and uses them to protect against attacks that attempt to ious intermediate steps of contour tracking, filtering, erase or disturb these traces. frequency-selective compensation, and phoneme correla- tion are necessary to extract out the evidence. Nonethe- The subsequent sections elaborate on these ideas, be- less, our final classifier is transparent and does not re- ginning with some relevant background on non-linearity, quire any training at all, but succeeds for voice signals followed by threat model, attack design, and defense. 2 Background: Acoustic Non-linearity This is essentially a f2 − f1 = 2kHz tone which will be Microphones and speakers are in general designed to be recorded by the microphone. However, this demonstrates linear systems, meaning that the output signals are linear the core opportunity, i.e., by sending a completely in- combinations of the input. In the case of power ampli- audible signal, we are able to generate an audible “copy” fiers inside microphones and speakers, if the input sound of it inside any unmodified off-the-shelf microphone. signal is s(t), then the output should ideally be: 3 Inaudible Voice Attack sout (t) = A1s(t) We begin by explaining how the above non-linearity can be exploited to send inaudible commands to voice en- where A is the amplifier gain. In practice, however, 1 abled devices (VEDs) at a short range. We identify de- acoustic components in microphones and speakers (like ficiencies in such an attack and then design the longer diaphragms, amplifiers, etc.) are linear only in the au- range, truly inaudible attack. dible frequency range (< 20kHz). In ultrasound bands (> 25kHz), the responses exhibit non-linearity [28, 19, 16, 38, 22]. Thus, for ultrasound signals, the output of 3.1 Short Range Attack the amplifier becomes: Let v(t) be a baseband voice signal that once decoded ∞ translates to the command: “Alexa, mute yourself”. An i 2 3 sout (t) = ∑ Ais (t) = A1s(t) + A2s (t) + A3s (t) + ... attacker moves this baseband signal to a high frequency i=1 fhi = 40kHz (by modulating a carrier signal), and plays 2 ≈ A1s(t) + A2s (t) it through an ultrasound speaker. The attacker also plays (1) a tone at fhi = 40kHz. The played signal is: Higher order terms are typically extremely weak since A4+  A3  A2 and hence can be ignored. shi(t) = cos(2π fhit) + v(t)cos(2π fhit) (3) Recent work [37] has shown ways to exploit this phe- After this signal passes through the non-linear hardware nomenon, i.e., it is possible to play ultrasound signals and low-pass filter of the microphone, the microphone that cannot be heard by humans but can be directly will record: recorded by any microphone. Specifically, an ultrasound A2 2  slow(t) = 1 + v (t) + 2v(t) (4) speaker can play two inaudible tones: s1(t) = cos(2π f1t) 2 at frequency f = 38kHz and s = cos(2π f t) at fre- 1 2 2 This shifted signal contains a strong component of v(t) quency f = 40kHz. Once the combined signal s (t) = 2 hi (due to more power in the speech components), and s (t) + s (t) passes through the microphone’s nonlinear 1 2 hence, gets decoded correctly by almost all microphones. hardware, the output becomes: 2 2  What happens to v (t)? sout (t) = A1shi(t) + A2shi(t) Figure 2 shows the power spectrum V( f ) correspond- 2 = A1(s1(t) + s2(t)) + A2(s1(t) + s2(t)) ing to the voice command v(t) =“Alexa, mute yourself”. Here the power spectrum corresponding to v2(t) which = A1 cos(2π f1t) + A1 cos(2π f2t) V( f ) ∗V( f ) (∗) 2 2 is equal to where is the convolution op- + A2 cos (2π f1t) + A2 cos (2π f2t) eration. Observe that the spectrum of the human voice is + 2A2 cos(2π f1t)cos(2π f2t) between [50 − 8000]Hz and the relatively weak compo- nents of v2(t) line up underneath the voice frequencies The above signal has frequency components at f , f , 1 2 after convolution. A component of v2(t) also falls at DC, 2 f , 2 f , f + f , and f − f . This can be seen by ex- 1 2 2 1 2 1 however, degrades sharply. The overall weak presence panding the equation: of v2(t) leaves the v(t) signal mostly unharmed, allow- sout (t) = A1 cos(2π f1t) + A1 cos(2π f2t) ing VEDs to decode the command correctly. However, to help v(t) enter the microphone through the + A2 + 0.5A2 cos(2π2 f1t) + 0.5A2 cos(2π2 f2t) “non-linear inlet”, shi(t) must be transmitted at suffi- + A cos(2π( f + f )t) + A cos(2π( f − f )t) 2 1 2 2 2 1 ciently high power. Otherwise, slow(t) will be buried in Before digitizing and recording the signal, the micro- noise (due to small A2). Unfortunately, increasing the phone applies a low pass filter to remove frequency com- transmit power at the speaker triggers non-linearities at ponents above the microphone’s cutoff of 24KHz. Ob- the speaker’s own diaphragm and amplifier, resulting in an audible slow(t) at the output of the speaker. Since serve that f1, f2, 2 f1, 2 f2, and f1 + f2 are all > 24kHz. Hence, what remains (as acceptable signal) is: slow(t) contains the voice command v(t), the attack be- comes audible. Past attacks sidestep this problem by op- slow(t) = A2 + A2 cos(2π( f2 − f1)t) (2) erating at low power, thereby forcing the output of the • In case the receiver device (e.g., Google Home) is voice V2(t) overlaps with V(t) fingerprinted, we assume the attacker can synthesize the legitimate user’ signal using known techniques 50Hz [46, 5] to launch the attack. • The attacker cannot estimate the precise channel im- pulse response (CIR) from its speaker to the voice en- Non-overlapping V2(t) abled device (VED) that it intends to attack.

 Core Attack Method: LipRead develops a new speaker design that facilitates considerably longer attack range, while eliminating the Figure 2: Spectrum of V( f ) ∗ V( f ) which is the non-linear audible leakage at the speaker. Instead of using one ul- leakage after passing through the microphone trasound speaker, LipRead uses multiple of them, physi- cally separated in space. Then, LipRead splices the spec- speaker to be almost inaudible [49]. This inherently lim- trum of the voice command V( f ) into carefully selected its the radius of attack to a short range of 5 ft. Attempts segments and plays each segment on a different speaker, to increase this range results in audibility, defeating the thereby limiting the leakage from each speaker. purpose of the attack.  The Need for Multiple Speakers: Figure 3 confirms this with experiments in our build- To better understand the motivation, let us first con- ing. Five volunteers visited marked locations and sider using two ultrasound speakers. Instead of playing recorded their perceived loudness of the speaker’s leak- shi(t) = cos(2π fhit) + v(t)cos(2π fhit) on one speaker, age. Clearly, speaker non-linearity produces audibility, a we now play s1(t) = cos(2π fhit) on the first speaker key problem for long range attacks. and s2(t) = v(t)cos(2π fhit) on the second speaker where fhi = 40kHz. In this case, the 2 speakers will output:

2 sout1 = cos(2π fhit) + cos (2π fhit) (5) 2 2 sout2 = v(t)cos(2π fhit) + v (t)cos (2π fhit)

For simplicity, we ignore the terms A1 and A2 (since they do not affect our understanding of frequency com- ponents). Thus, when sout1 and sout2 emerge from the two speakers, human ears filter out all frequencies > 20kHz. What remains audible is only:

slow1 = 1/2 Figure 3: Heatmap showing locations at which v(t) leakage s = v2(t)/2 from the speaker is audible. low2

Observe that neither slow1 nor slow2 contains the voice 3.2 Long Range Attack signal v(t), hence the actual attack command is no longer audible with two speakers. However, the microphone Before developing the long range attack, we concisely under attack will still receive the aggregate ultrasound present the assumptions and constraints on the attacker. signal from the two speakers, shi(t) = s1(t) + s2(t), and v(t)  Threat Model: We assume that: its own non-linearity will cause a “copy” of to get shifted into the audible range (recall Equation 4). Thus, • The attacker cannot enter the home to launch the attack, this 2-speaker attack activates VEDs from greater dis- otherwise, the above short range attack suffices. tances, while the actual voice command remains inaudi- ble to bystanders. • The attacker cannot leak any audible signals (even in a beamformed manner), otherwise such inaudible attacks Although the voice signal v(t) is inaudible, signal v2(t) are not needed in the first place. still leaks and becomes audible (especially at higher power). This undermines the attack. • The attacker is resourceful in terms of hardware and en- 2 ergy (perhaps the attacking speaker can be carried in his  Suppressing v (t) Leakage: car or placed at his balcony, pointed at VEDs in sur- To suppress the audibility of v2(t), LipRead expands to N rounding apartments or pedestrians). ultrasound speakers. It first partitions the audio spectrum V( f ) of the command signal v(t), ranging from f0 to fN, written as: into N frequency bins: [ f0, f1], [ f1, f2], ···, [ fN−1, fN] as 2 N shown in Fig. 4. This can be achieved by computing an L( f ) = V ( f ) ∗V ( f ) ∑ [ fi, fi+1] [ fi, fi+1] FFT of the signal v(t) to obtain V( f ). V( f ) is then multi- i=1 plied with a rectangle function rect( fi, fi+1) which gives To achieve true inaudibility, we need to ensure that the a filtered V[ fi, fi+1]( f ). An IFFT is then used to gener- total leakage is not audible. To address this challenge, ate v[ fi, fi+1](t) which is multiplied by an ultrasound tone th we leverage the fact that humans cannot hear the sound cos(2π fhit) and outputted on the i ultrasound speaker as shown in Fig. 4. if the sound intensity falls below certain threshold, which is frequency dependent. This is known as the “Threshold of Hearing Curve”, T( f ). Fig. 5 shows T( f ) in dB as +(!) function of frequency. Any sound with intensity below the threshold of hearing will be inaudible. !0" !# !$ 0.5!% !& !' 1 !( !)1.5 !* 2 80 × Threshold of Hearing Curve + ! [/0,/2] 40789 0 0.5 1 1.5 2 60 !" !# × + ! [/2,/:] 40789 40 0 0.5 1 1.5 2 !# !$ × +[/ ,/ ] ! 20 : ; 40789 0 0.5 1 1.5 2 !$ !% × +[/ ,/ ] ! 0 ; < 40789 0 0.5!% !& 1 1.5 2 Sound Pressure Level in dB

× -20 +[/<,/=] ! 15.6 31.2 62.5 125 250 500 1000 2000 4000 8000 16000 0 0.5 1 1.5 2 40789 !& !' Frequency in Hz × + ! [/=,/>] Figure 5: Threshold of Hearing Curve 0 0.5 !' 1 !( 1.5 2 40789

× LipRead aims to push the total leakage spectrum, L( f ), +[/>,/?] ! 0 0.5 1 !( !) 1.5 2 40789 below the “Threshold of Hearing Curve” T( f ). To this

× end, LipRead finds the best partitioning of the spectrum, +[/?,/@] ! 0 0.5 1 1.5 2 !) !* 40789 such that the leakage is below the threshold of hearing. If

40789 multiple partitions satisfy this constraint, LipRead picks the one that has the largest gap from the threshold of Figure 4: Spectrum Splicing: optimally segmenting the voice hearing curve. Formally, we solve the below optimiza- command frequencies and playing it through separate speakers tion problem: so that the net speaker-output is silent. th maximize min[T( f ) − L( f )] In this case, the audible leakage from i ultrasound { f1, f2,···, fN−1} f 2 (6) speaker will be slow,i(t) = v (t). In the frequency [ fi, fi+1] subject to f0 ≤ f1 ≤ f2 ≤ ··· ≤ fN domain, we can write this leakage as: The solution partitions the frequency spectrum to ensure S ( f ) = V ( f ) ∗V ( f ) low,i [ fi, fi+1] [ fi, fi+1] that the leakage energy is below the hearing threshold This leakage has two important properties: for every frequency bin. This ensures inaudibility at any human ear.  2  2 (1) E |Slow,i( f )| ≤ E |V( f ) ∗V( f )|  Increasing Attack Range: (2) BW(Slow,i( f )) ≤ BW(V( f ) ∗V( f )) It should be possible to increase attack range with more where E[|.|2] is the power of audible leakage and BW(.) speakers, while also limiting audible leakage below the is the bandwidth of the audible leakage due to non- required hearing threshold. This holds in principle due r linearities at each speaker. The above properties imply to the following reason. For a desired attack range, say , that splicing the spectrum into multiple speakers reduces we can compute the minimum power density (i.e., power the audible leakage from any given speaker. It also re- per frequency) necessary to the VED. This power P duces the bandwidth and hence concentrates the audible r needs to be high since the non-linear channel will A leakage in a smaller band below 50 Hz. strongly attenuate it by the factor 2. Now consider the worst case where a voice command has equal magnitude While per-speaker leakage is smaller, they can still add in all frequencies. Given each frequency needs power Pr up to become audible. The total leakage power can be and each speaker’s output needs to be below threshold of hearing for all frequencies, we can run our min-max 1250 1250 optimization for increasing values of N, where N is the 1000 1000 number of speakers. The minimum N that gives a fea- 750 750 sible solution is the answer. Of course, this is the upper Frequency (Hz) bound; for a specific voice signal, N will be lower. 500 Frequency (Hz) 500 250 250

Increasing speakers can be viewed as beamforming the 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 Time (Sec) Time (Sec) energy towards the VED. In the extreme case for exam- Figure 6: Spectrogram for sleg and snl for voice command ple, every speaker will play one frequency tone, resulting “Alexa, mute yourself”. in a strong DC component at the speaker’s output which would still be inaudible. In practice, our experiments are voice command (without the non-linear component); if bottlenecked by ADCs, amplifiers, speakers, etc., hence the polluted version s passes the T2S test, the cleaner we will report results with an array of 61 small ultra- version obviously will. sound speakers. 2  Energy at Low Frequencies from v (t): 4 Defending Inaudible Voice Commands Another solution is to extract portions of s(t) from the lower frequencies – since regular voice signals do not Recognizing inaudible voice attacks is essentially a prob- contain sub-50Hz components, energy detection should lem of acoustic forensics, i.e., detecting evidence of non- offer evidence. Unfortunately, environmental noise (e.g., linearity in the signal received at the microphone. Of fans, A/C machines, wind) leaves non-marginal residue course, we assume the attacker knows our defense tech- in these low bands. Moreover, an attacker could delib- niques and hence will try to remove any such evidence. erately reduce the power of its signal so that its leakage Thus, the core question comes down to: is there any trace into sub-50Hz is small. Our experiments showed non- of non-linearity that just cannot be removed or masked? marginal false positives in the presence of environmental To quantify this, let v(t) denote a human voice command sound and soft attack signals. signal, say “Alexa, mute yourself”. When a human is-  Amplitude Degradation at Higher Frequencies: sues this command, the recorded signal sleg = v(t)+n(t), The air absorbs ultrasound frequencies far more than where n(t) is noise from the microphone. When an at- voice (which translates to sharper reduction in amplitude tacker plays this signal over ultrasound (to launch the as the ultrasound signal propagates). Measured across non-linear attack), the recorded signal snl is: different microphones separated by ≈ 7.3cm in Amazon Echo and Google Home, the amplitude difference should A2 2 snl = (1 + 2v(t) + v (t)) + n(t) (7) be far greater for ultrasound. We designed a defense that 2 utilized the maximum amplitude slope between micro- Figure 6 shows an example of sleg and snl. Evidently, phone pairs – this proved to be a robust discriminator both are very similar, and both invoke the same response between sleg and snl. However, we were also able to in VEDs (i.e., the text-to-speech converter outputs the point two (reasonably synchronized) ultrasound beams same text for both sleg and snl). A defense mechanism from opposite directions. This reduced the amplitude would need to examine any incoming signal s and tell gradient, making it comparable to legitimate voice sig- if it is low-frequency legitimate or a shifted copy of the nals (Alexa treated the signals as multipath). In the real- high-frequency attack. world, we envisioned 2 attackers launching this attack by standing at 2 opposite sides of a house. Finally, this solu- tion would require an array of microphones on the voice 4.1 Failed Defenses enabled device. Hence, it is inapplicable to one or two Before we describe LipRead’s defense, we mention a few microphone systems (like phones, wearables, refrigera- other possible defenses which we have explored before tors). converging on our final defense system. We concisely summarize 4 of these ideas.  Phase Based Separation of Speakers: Given that long range attacks need to use at least 2 speak-  Decompose Incoming Signal s(t): ers (to bypass speaker non-linearity), we designed an A2 2 One solution is to solve for s(t) = 2 (1 + 2v ˆ(t) + vˆ (t)), angle-of-arrival (AoA) based technique to estimate the and test if the resultingv ˆ(t) produces the same text-to- physical separation of speakers. In comparison to human speech (T2S) output as s(t). However, this proved to be voice, the source separation consistently showed success, a fallacious argument because, if such av ˆ(t) exists, it so long as the speakers are more than 2cm apart. While will always produce the same T2S output as s(t). This practical attacks would certainly require multiple speak- is because such av ˆ(t) would be a cleaner version of the ers, easily making them 2cm apart, we aimed at solving the short range attack as well (i.e., where the attack is launched from a single speaker). Put differently, the right evidence of non-linearity should be one that is present re- gardless of the number of speakers used.

4.2 LipRead Defense Design Our final defense is to search for traces of v2(t) in sub- 50Hz. However, we now focus on exploiting the struc- ture of human voice. The core observation is simple: voice signals exhibit well-understood patterns of funda- mental frequencies, added to multiple higher order har- monics (see Figure 6). We expect this structure to partly reflect in the sub-50Hz band of s(t) (that contains v2(t)), and hence correlate with carefully extracted spectrum above-50Hz (which contains the dominant v(t)). With appropriate signal scrubbing, we expect the correlation to emerge reliably, however, if the attacker attempts to dis- rupt correlation by injecting sub-50Hz noise, the stronger energy in this low band should give away the attack. We intend to force the attacker into this zero sum game. Figure 7: (a) A simplified voice spectrum showing the struc- 2  Key Question: Why Should v (t) Correlate? ture. (b) Voice spectra after non-linear attack. Figure 7(a) shows a simplified abstraction of a legitimate

voice spectrum, with a narrow fundamental frequency 300 300 band around f j and harmonics at integer multiples n f j. 250 250 200 The lower bound on f j is > 50Hz [41]. Now recall that 200 when this voice spectrum undergoes non-linearity, each 150 150 100 100 Frequency (Hz) f n f Frequency (Hz) of j and j will self-convolve to produce “copies” of 50 50

themselves around DC (Figure 7(b)). Of course, the A2 25 25

term from non-linearity strongly attenuates this “copy”. 0 0.2 0.4 0.8 1.0 0 0.2 0.4 0.8 1.0 Time (Sec) Time (Sec) However, given the fundamental band around f j and the harmonics around n f are very similar in structure, each (a) (b) j Figure 8: Spectrogram of the (a) audible and (b) inaudible of ≈ 20Hz bandwidth, the energy between [0,20kHz] su- attack voice. The attack signal contains higher power below perimposes. This can be expressed as: 50Hz, indicated by lighter color.

" N # E ≈ E A |V ∗V |2 the lower band. Legitimate voice signals generate signif- [0,20] 2 ∑ [n f j−20, n f j+20] [n f j−20, n f j+20] n=1 icantly less energy in these bands, thereby remaining flat (8) for higher loudness. The net result is distinct traces of energy in sub-20Hz bands, and importantly, this energy variation (over time)  Correlation Design The width of the fundamental frequencies and harmonics mimics that of f j. For a legitimate attack, on the other hand, the sub-20Hz is dominantly uncorrelated hardware are time-varying, however, at any given time, if it is B Hz, and environmental noise. then the self-convolved signal gets shifted into [0,B]Hz as well. Note that this is independent of the actual val- Figure 8(a) and (b) zoom into sub-50Hz and compare ues of center frequencies, f j and n f j. Now, let sB(t) be the signal above B Hz that contains the ticularly when the actual voice signal is strong. Figure voice command. LipRead seeks to correlate the energy 9 plots the power in the sub-50Hz band with increas- variation over time in sB(t). We track the loudness level is expressed in dBSpl, where Spl denotes fundamental frequency in s>B(t) using standard acous- “sound pressure level”, the standard metric for measur- tic libraries, but then average the power around B Hz ing sound. Evidently, non-linearity shows increasing of this frequency. This produces a power profile over

power due to the self-convolved spectrum overlapping in time, Pf j . For s

Power below 50Hz (dB) frequencies would also self-convolve and get “copied” -70 around DC. This certainly affects correlation since these 50 60 70 80 spurious frequencies would not correlate well (in fact, Loudness of received sound (dBSpl) they can be designed to not correlate). The attacker’s Figure 9: The loudness vs sub-50Hz band power plot. hope should be to lower correlation while maintaining a 1 low energy footprint below 20Hz. Attack voice 0.8 Legitimate voice The attacker can use the above approaches to try to de- feat the zero-sum game. Figure 11 plots results from 0.6 4000 attempts to achieve low correlation and low energy. 0.4 Of these, 3500 are random noises injected in legitimate voice commands, while the remaining 500 are more care- 0.2 fully designed distortions (such as frequency concatena- 0 tion, phase distortions, low frequency injection, etc.). Of course, in all these cases, the distorted signal was still Normalized correlation coeff. -0.2 50 60 70 80 correct, i.e., the VED device responded as it should. Loudness of received sound (dBSpl) On the other hand, 450 different legitimate words were

Figure 10: The loudness vs correlation between Pf j and spoken by different humans (shown as hollow dots), at s

Sub-50Hz power -60 relation gap, implying that non-linearity is leaving some trace in the low-frequency bands, and this trace preserves -70 some structure of the actual voice signal. Of course, we -0.4 0 0.4 0.8 have not yet accounted for the possibility that the attacker Correlation coeff. can inject noise to disrupt correlation. Figure 11: Zero sum game between correlation and power at sub-50Hz bands. Attacker attempts to reduce correlation by  Improved Attack via Signal Shaping signal shaping or noise injection at sub-50Hz band. The natural question for the attacker is how to mod- 2 ify/add signals such that the correlation gap gets nar-  Leveraging Amplitude Skew from v (t) rowed. Several possibilities arise: In order to increase the separation margin, LipRead leverages the amplitude skew resulting from v2(t). (1) Signal −v2(t) can be added to the speaker in the low Specifically, two observations emerge: (1) When the har- frequency band and transmitted with the high frequency monics in voice signals self-convolve to form v2(t), they ultrasound v(t). Given that ultrasound will produce v2(t) fall at the same frequencies of the harmonics (since the after non-linearity, and −v2(t) will remain as is, the two gaps between the harmonics are quite homogeneous). (2) should interact at the microphone and cancel. Unfortu- The signal v2(t) is a time domain signal with only posi- 3 0.6 0.6 Attack voice Legitimate voice 0.4 0.4 2.5

0.2 0.2 2 Amplitude Amplitude 0 0 1.5 Amplitude skew -0.2 -0.2 1 0.5 1 1.5 0.5 1 1.5 20 60 100 140 Time (s) Time (s) Recording # (a) (b) (c) Figure 12: (a) Sound signals in time domain from sleg and (b) snl, demonstrating a case of amplitude skew. (c) Amplitude skew for various attack and legitimate voice commands. tive amplitude. Combining these together, we postulated FAR that amplitudes of the harmonics would be positively bi- FRR ased, especially for those that are strong (since v2(t) will be relatively stronger at that location). In contrast, ampli- tudes of legitimate voice signals should be well balanced FAR/FRR on the positive and negative. Figure 12(a,b) shows one contrast between a legitimate voice sleg and the recorded attack signal snl. In pursuit of this opportunity, we ex- tract the ratio of the maximum and minimum amplitude I ) Coeff. ( corr (we average over the top 10% for robustness against out- Correlation liers). Using this as the third dimension for separation, Figure 13: The False Acceptance Rate plane (dark color) and Figure 12(c) re-plots the sleg and snl clusters. While the the False Rejection Rate plane (light color) for different sub- separation margin is close, combining it with correlation 50Hz power and correlation values. and power, the separation becomes satisfactory. • We test our attack prototype with 984 commands to Amazon Echo and 200 commands to smartphones – the  LipRead’s Elliptical Classifier attacks are launched from various distances with 130 LipRead leverages 3 features to detect an attack: power different background noises. Figure 15 shows attack in sub-50Hz, correlation coefficient, and amplitude skew. success at 24 ft for Amazon Echo and 30 ft for smart- Analyzing the False Acceptance Rate (FAR) and False phones at a power of 6watt. Rejection Rate (FRR), as a function of these 3 parame- ters, we have converged on a ellipsoidal-based separation • We record 12 hours of microphone data – 5 hours of hu- technique. To determine the optimal decision boundary, man voice commands and 7 hours of attack commands we compute False Acceptance Rate (FAR) and False Re- through ultrasound speakers. Figure 16(c) shows that jection Rate (FRR) for each candidate ellipsoid. Our aim attack words are recognized by VEDs with equal accu- is to pick the parameters of the ellipse that minimize both racy as legitimate human words. Figure 16(b) confirms FAR and FRR. Figure 13 plots the FAR and FRR as inter- that all attacks are inaudible, i.e., the leakage from our secting planes in a logarithmic scale (Note that we show speaker array is 5-10dB below human hearing threshold. only two features since it is not possible to visualize the • Figure 17(a) shows the precision and recall of our de- 4D graph). The coordinate with minimum value along fense technique, as 98% and 99%, respectively, when the canyon – indicating the equal error rates – gives the the attacker does not manipulate the attack command. optimal selection of ellipsoid. Since it targets speech Importantly, precision and recall remain steady even un- commands, this classifier is designed offline, one-time, der signal manipulation. and need not be trained for each device or individual. Before elaborating on these results, we first describe our evaluation platforms and methodology. 5 Evaluation We evaluate LipRead on 3 main metrics: (1) attack range, 5.1 Platform and Methodology (2) inaudibility of the attack, and the recorded sound (1) Attack speakers: Figure 14(b) shows our custom- quality (i.e., whether the attacker’s command sounds designed speaker system consisting of 61 ultrasonic human-like), and (3) accuracy of the defense under vari- piezoelectric speakers arranged as a hexagonal planar ar- ous environments. We summarize our findings below. ray. The elements of the array are internally connected in two separate clusters. A dual channel waveform gen- tion hit rate against increasing distance – higher hit-rates erator (Keysight 33500b series [4]) drives the first cluster indicate success with less number of attempts. The aver- with the voice signal, modulated at the center frequency age distance achieved for 50% hit rate is 24 ft, while the of 40kHz. This cluster forms smaller sub-clusters to maximum for Siri and Samsung S-voice are measured to transmit separate segments of the spliced spectrum. The be 27 and 30 ft respectively. second cluster transmits the pure 40kHz tone through each speaker. The signals are amplified to 30 Volts us- Figure 15(b) plots the attack range again, but for the ing a custom-made NE5534AP op-amp based amplifier entire voice command. We declare “success” if the circuit. This prototype is limited to a maximum power text to speech translation produces every single word in of 6watt because of the power ratings of the operational the command. The range degrades slightly due to the amplifiers. More powerful amplifiers are certainly avail- stronger need to decode every word correctly. able to a resourceful attacker. (2) Target VEDs: We test our attack on 3 different VEDs 1 1 – Amazon Echo, Samsung S6 smartphone running An- 0.8 0.8 droid v7.0, and Siri on an iPhone 5S running iOS v10.3. 0.6 0.6 Unlike Echo, Samsung S-voice and Siri requires person- 0.4 0.4 Alexa Alexa alization of the wake-word with user’s voice – adding a 0.2 S-Voice 0.2 S-Voice Wake-word hit rate Siri Command accuracy Siri layer of security through voice authentication. However, 0 0 10 20 30 10 20 30 voice synthesis is known to be possible [46, 5], and we Attack distance (ft) Attack distance (ft) assume that the synthesized wake-word is already avail- Figure 15: (a) The wake-word hit rate and (b) the command able to the attacker. detection accuracy against increasing distances. Experiment setup: We run our experiments in a lab Figure 16(a) reports the attack range to Echo for increas- space occupied by 5 members and also in an open cor- ing input power to the speaker system. As expected, the ridor. We place the VEDs and the ultrasonic speaker at range continues to increase, limited by the power of our various distances ranging up to 30 ft. During each at- 6Watt amplifiers. More powerful amplifiers would cer- tack, we play varying degrees of interfering signals from tainly enhance the attack range, however, for the pur- 6 speakers scattered across the area, emulating natural poses of prototyping, we designed our hardware in the home/office noises. The attack signals were designed by lower power regime. first collecting real human voice commands from 10 dif- ferent individuals; MATLAB is used to modulate them Leakage audibility: Figure 16(b) plots the efficacy of to ultrasound frequencies. For speech quality of the at- our spectrum splicing optimization, i.e., how effectively tack signals, we used the open-source Sphinx4 speech does LipRead achieve speaker-side inaudibility for dif- processing tool [1]. ferent ultrasound commands. Observe that without splic- ing (i.e., “no partition”), the ultrasound voice signal is almost 5dB above the human hearing threshold. As the number of segments increase, audibility falls below the hearing curve. With 60 speakers in our array, we use 6 segments, each played through 5 speakers; the remain- ing 31 were used for the second cos(2π fct) signal. Note that the graph plots the minimum gap between the hear- ing threshold and the audio playback, implying that this is a conservative worst case analysis. Finally, we show (a) (b) results from 20 example attack commands – the other commands are below the threshold. Figure 14: LipRead evaluation setup: (a) Ultrasonic speaker and voice enabled devices. (b) The ultrasonic speaker array for Received speech quality: Given 6 speakers were trans- attack. mitting each spliced segment of the voice command, we 5.2 Attack Performance intend to understand if this distorts speech quality. Figure 16(c) plots the word recognition accuracy via Sphinx [1], Activation distance: This experiment attempts to ac- an automatic speech recognition software. Evidently, tivate the VEDs from various distances. We repeat- LipRead’s attack quality is comparable to human quality, edly play the inaudible wake-word from the ultrasound implying that our multi-speaker beamforming preserves speaker system at regular intervals and count the fraction the speech’s structure. In other words, speech quality is of successful activation. Figure 15(a) shows the activa- not the bottleneck for attack range. 25 100

20 80

15 60 Audible 10 Inaudible 40 Distance (ft) 5 Accuracy (%) 20 Attack Legitimate 0 0 2 3 4 5 6 60 65 70 75 80 Power (W) Loudness (dbSpl) (a) (b) (c) Figure 16: (a) Maximum activation distance for different input power. (b) Worst case audibility of the leakage sound after optimal spectrum partitioning. (c) Word recognition accuracy with automatic speech recognition software for attack and legitimate voices.

5.3 Defense Performance strategy is to eliminate the v2(t) correlation by injecting noise in the attack signal. We considered four different Metrics: Our defense technique essentially attempts to categories of noise – white Gaussian noise to raise the classify the attack scenarios distinctly from the legiti- noise floor, band-limited noise on the Sub-50Hz region, mate voice commands. We report the “Recall’ and “Pre- water-filling noise power at low frequencies to mask the cision” of this classifier for various sound pressure levels correlated power variations, and intermittent frequencies (measured in dBSPL), varying degrees of ambient sounds below 50 Hz. As shown, in Figure 17(c), the process as interference, and deliberate signal manipulation. Re- does not significantly impact the performance because of call that our metrics refer to: the power-correlation trade-off exploited by the defense • Precision: What fraction of our detected attacks are classifier. Figure 17(d) shows that the overall accuracy correct? of the system is also above 99% across all experiments. • Recall: What fraction of the attacks did we detect? 6 Points of Discussion We now present the graphs beginning with the basic clas- sification performance. We discuss several dimensions of improvement. Basic attack detection: Figure 17(a) shows the at-  Lack of formal guarantee: We have not formally tack detection performance in normal home environment proved our defense. Although LipRead is systematic and without significant interference. The average precision transparent (i.e., we understand why it should succeed) and recall of the system is 99% across various loudness it still leaves the possibility that an attack may breach of the received voice. This result indicates best case per- the defense. Our attempts to mathematically model the formance of our system with minimum false alarm. self-convolution and correlation did not succeed since frequency and phase responses for general voice com- Impact of ambient noise: In this section we test our de- mands were difficult to model, as were real-world noises. fense system for common household sounds that can po- A deeper treatment is necessary, perhaps with help from tentially mix with the received voice signal and change speech experts who can model the phase variabilities in its features leading to misclassification. To this end, speech. We leave this to future work. we played 130 noise sounds through multiple speakers while recording attack and legitimate voice signals with a  Generalizing to any signal: Our defense is designed smartphone. We replayed the noises at 4 different sound for the class of voice signals, which applies well to in- pressure levels starting from a typical value of 50 dBSPL audible voice attacks. A better defense should find the to extremely loud 80 dBSPL, while the voice loudness is true trace of non-linearity, not just for the special case of kept constant at 65 dBSpl. Figure 17(b) reports the pre- voice. This remains an open problem. cision and recall for this experiment. The recall remains close to 1 for all these noise levels, indicating that we do  Is air non-linear as well? There is literature that not miss attacks. However, at higher interference levels, claims air is also a non-linear medium [17, 10, 45]. When the precision slightly degrades since the false detection excited by adequately powerful ultrasound signals, self- rate increases a bit when noise levels are extremely high convolution occurs, ultimately making sounds audible. which is not common in practice. Authors in [36, 2] are designing acoustic spotlighting systems where the idea is to make ultrasound signals au- Impact of injected noise: Next, we test the defense per- dible only along a direction. We have encountered traces formance against deliberate attempts to eliminate nonlin- of air non-linearity, although in rare occasions. This cer- earity features from the attack signal. Here the attackers tainly call for a separate treatment in the future. 1 1 1 1

0.9 0.9 0.9 0.9

0.8 0.8 0.8 Accuracy 0.8 Precision/Recall Precision/Recall Precision Precision Precision/Recall Precision Recall Recall Recall 0.7 0.7 0.7 0.7 60 70 80 50 60 70 80 60 70 80 50 60 70 80 90 Loudness (dbSpl) Ambient noise level (dbSpl) Loudness (dbSpl) Loudness (dbSpl) Figure 17: Precision and Recall of defense performance under various condition: (a) basic performance without external interfer- ence, (b) performance under ambient noise, and (c) performance under injected noise. (d) Overall accuracy across all experiments.

 Through-wall attack: Due to the limited maximum [2], SoundLazer [7, 6], and other projects [47, 8, 34] power (6watt) of our amplifiers, we tested our system in have rolled out commercial products based on this con- non-blocking scenarios. If the target device is partially cept. Ultrasonic hearing aids [29, 13, 15, 35, 31] and blocked (e.g. furnitures in the room blocking line-of- headphones [25] explore the human body as a nonlinear sight), the SNR reduces and our attack range will reduce. medium to enable voice transfer through bone conduc- This level of power has not allowed us to launch through- tion. Our work, however, is opposite of these efforts – wall attacks yet. We leave this to future work. we attempt to retain the inaudible nature of ultrasound while making it recordable inside electronic circuits. 7 Related Work Speaker Linearization: A number of research [23, Attack on Voice Recognition Systems: Recent re-   26, 18] studies the possibility of adaptive linearization of search [11, 42] shows that spoken words can be man- general speakers. Through simulations, the authors have gled such that they are unrecognizable to humans, yet de- shown that by pre-processing the input signal, they can codable by voice recognition (VR) systems. GVS-Attack achieve as much as 27dB reduction [18] of the nonlinear [14] exploits this by creating a smartphone app that gives distortion in the noise-free case. Their techniques are not adversarial commands to its voice . More re- yet readily applicable to real speakers, since they have all cently, BackDoor [37] has taken advantage of the micro- assumed very weak nonlinearities, and over-simplified phone’s nonlinearity to design ultrasonic sounds which electrical and mechanical structures of speakers. With are inaudible to humans, but becomes recordable inside real speakers, especially ultrasonic piezoelectric speak- the off-the-shelf microphones. The application includes ers, it is difficult to fully characterize the parameters of preventing acoustic eavesdropping with inaudible jam- the nonlinear model. Of course, if future techniques can ming signals. As follow up, [48, 39] show that the prin- fully characterize such models, our system can be made ciples of BackDoor can be used to send inaudible attack to achieve longer range with fewer speakers. commands to a VED, but requires physical proximity to remain audible. LipRead demonstrates the feasibility 8 Conclusion to increase the inaudible attack range, but more impor- tantly, designs a defense against the inaudible attacks. This paper builds on existing work to show that inaudi- ble voice commands are viable from distances of 25+ In past, researchers use near-ultrasound [27, 32, 40, 9, ft. Of course, careful design is necessary to ensure the 21, 30] and exploited aliasing to record inaudible sound attack is truly inaudible – small leakages from the at- with microphone. A number of papers use other sound tacker’s speakers can raise suspicion, defeating the at- to camouflage audible signal in order to make it indis- tack. This paper also develops a defense against in- tinguishable to human [24, 20, 12]. CovertBand [33] audible voice commands that exploit microphone non- use music to hide audible harmonic components at the linearity. We show that non-linearity leaves traces in the speaker. LipRead, on the other hand, use high frequency recorded voice signal, that are difficult to erase even with ultrasound as inaudible signal and leverages hardware deliberate signal manipulation. Our future work is aimed nonlinearity to make them recordable to microphone. at solving the broader class of non-linearity attacks for any signals, not just voice.  Acoustic Non-linearity: A body of research [17, 10, 45], inspired by Westervelt’s seminal theory [44, 43] Acknowledgement on nonlinear Acoustics, studies the distortions of sound while moving through nonlinear mediums including the We sincerely thank our shepherd Prof. Shyamnath Gol- air. This raises the possibility that ultrasonic sound can lakota and the anonymous reviewers for their valuable naturally self-demodulate in the air to generate audible feedback. We are grateful to the Joan and Lalit Bahl sounds, making it possible to develop a highly direc- Fellowship, Qualcomm, IBM, and NSF (award number: tional speaker [17, 10, 45]. Recently, AudioSpotlight 1619313) for partially funding this research. References [20]G RUHL,D.,LU,A., AND BENDER, W. Echo hiding. In In- ternational Workshop on Information Hiding (1996), Springer, [1] Cmu sphinx. http://cmusphinx.sourceforge.net. Last ac- pp. 295–315. cessed 6 December 2015. [21]G UPTA,S.,MORRIS,D.,PATEL,S., AND TAN, D. Soundwave: [2] Holosonics webpage. https://holosonics.com. Last ac- using the doppler effect to sense gestures. In Proceedings of the cessed 28 November 2016. SIGCHI Conference on Human Factors in Computing Systems (2012), ACM, pp. 1911–1914. [3] Inaudible voice commands demo. https://www.youtube. com/watch?v=wF-DuVkQNQQ&feature=youtu.be. Last ac- [22]H AWKSFORD, . J. Distortion correction in audio power am- cessed 24 September 2017. plifiers. Journal of the Audio Engineering Society 29, 1/2 (1981), 27–30. [4] Keysight waveform generator. http://literature.cdn. keysight.com/litweb/pdf/5991-0692EN.pdf. Last ac- [23]J AKOBSSON,D., AND LARSSON, M. Modelling and compensa- cessed 24 September 2017. tion of nonlinear loudspeaker. [5] Lyrebird. https://lyrebird.ai. Last accessed 24 September [24]J AYARAM, P., RANGANATHA,H., AND ANUPAMA, H. Infor- 2017. mation hiding using audio steganography–a survey. The Inter- national Journal of Multimedia & Its Applications (IJMA) Vol 3 [6] Soundlazer kickstarter. https://www.kickstarter.com/ (2011), 86–96. projects/richardhaberkern/soundlazer. Last accessed 28 November 2016. [25]K IM,S.,HWANG,J.,KANG, T., KANG,S., AND SOHN,S. Generation of audible sound with ultrasonic signals through the [7] Soundlazer webpage. http://www.soundlazer.com. Last ac- human body. In Consumer Electronics (ISCE), 2012 IEEE 16th cessed 28 November 2016. International Symposium on (2012), IEEE, pp. 1–3. [8] Woody norris ted talk. https://www.ted.com/speakers/ [26]K LIPPEL, W. J. Active reduction of nonlinear loudspeaker dis- woody_norris. Last accessed 28 November 2016. tortion. In INTER-NOISE and NOISE-CON Congress and Con- ference Proceedings (1999), vol. 1999, Institute of Noise Control [9]A UMI, M. T. I., GUPTA,S.,GOEL,M.,LARSON,E., AND Engineering, pp. 1135–1146. PATEL, S. Doplink: Using the doppler effect for multi-device interaction. In Proceedings of the 2013 ACM international joint [27]L AZIK, P., AND ROWE, A. Indoor pseudo-ranging of mobile conference on Pervasive and ubiquitous computing (2013), ACM, devices using ultrasonic chirps. In Proceedings of the 10th ACM pp. 583–586. Conference on Embedded Network Sensor Systems (2012), ACM, pp. 99–112. [10]B JØRNØ, L. Parametric acoustic arrays. In Aspects of Signal Processing. Springer, 1977, pp. 33–59. [28]L EE,K.-L., AND MAYER, R. Low-distortion switched-capacitor filter design techniques. IEEE Journal of Solid-State Circuits 20, [11]C ARLINI,N.,MISHRA, P., VAIDYA, T., ZHANG, Y., SHERR, 6 (1985), 1103–1113. M.,SHIELDS,C.,WAGNER,D., AND ZHOU, W. Hidden voice commands. In USENIX Security Symposium (2016), pp. 513– [29]L ENHARDT,M.L.,SKELLETT,R.,WANG, P., AND CLARKE, 530. A. M. Human ultrasonic speech perception. Science 253, 5015 (1991), 82–85. [12]C ONSTANDACHE,I.,AGARWAL,S.,TASHEV,I., AND CHOUD- HURY, R. R. Daredevil: indoor location using sound. ACM SIG- [30]L IN,Q.,YANG,L., AND LIU, Y. Tagscreen: Synchroniz- MOBILE Mobile Computing and Communications Review 18, 2 ing social televisions through hidden sound markers. In IN- (2014), 9–19. FOCOM 2017-IEEE Conference on Computer Communications, IEEE (2017), IEEE, pp. 1–9. [13]D EATHERAGE,B.H.,JEFFRESS,L.A., AND BLODGETT, [31]N AKAGAWA,S.,OKAMOTO, Y., AND FUJISAKA, Y.-I. Devel- H. C. A note on the audibility of intense ultrasonic sound. The opment of a bone-conducted ultrasonic hearing aid for the pro- Journal of the Acoustical Society of America 26, 4 (1954), 582– foundly sensorineural deaf. Transactions of Japanese Society for 582. Medical and Biological Engineering 44, 1 (2006), 184–189. [14]D IAO, W., LIU,X.,ZHOU,Z., AND ZHANG, K. Your voice [32]N ANDAKUMAR,R.,GOLLAKOTA,S., AND , N. Con- assistant is mine: How to abuse speakers to steal information and tactless sleep apnea detection on smartphones. In Proceedings control your phone. In Proceedings of the 4th ACM Workshop on of the 13th Annual International Conference on Mobile Systems, Security and Privacy in Smartphones & Mobile Devices (2014), Applications, and Services (2015), ACM, pp. 45–57. ACM, pp. 63–74. [33]N ANDAKUMAR,R.,TAKAKUWA,A.,KOHNO, T., AND GOL- [15]D OBIE,R.A., AND WIEDERHOLD, M. L. Ultrasonic hearing. LAKOTA, S. Covertband: Activity information leakage using mu- Science 255, 5051 (1992), 1584–1585. sic. Proceedings of the ACM on Interactive, Mobile, Wearable [16]D OBRUCKI, A. Nonlinear distortions in electroacoustic devices. and Ubiquitous Technologies 1, 3 (2017), 87. Archives of Acoustics 36, 2 (2011), 437–460. [34] NORRIS, E. Parametric transducer and related methods, May 6 [17]F OX,C., AND AKERVOLD, O. Parametric acoustic arrays. The 2014. US Patent 8,718,297. Journal of the Acoustical Society of America 53, 1 (1973), 382– [35]O KAMOTO, Y., NAKAGAWA,S.,FUJIMOTO,K., AND 382. TONOIKE, M. Intelligibility of bone-conducted ultrasonic [18]G AO, F. X., AND SNELGROVE, W. M. Adaptive linearization speech. Hearing research 208, 1 (2005), 107–113. of a loudspeaker. In Acoustics, Speech, and Signal Processing, [36]P OMPEI, F. J. Sound from ultrasound: The parametric array as 1991. ICASSP-91., 1991 International Conference on (1991), an audible sound source. PhD thesis, Massachusetts Institute of IEEE, pp. 3589–3592. Technology, 2002. [19]G ONZALEZ´ ,G.G.G., AND NASSI¨ , I. M. S. V. Measurements [37]R OY,N.,HASSANIEH,H., AND CHOUDHURY, R. R. Back- for modelling of wideband nonlinear power amplifiers for wire- door: Making microphones hear inaudible sounds. In Proceed- less communications. Department of Electrical and Communica- ings of the 15th Annual International Conference on Mobile Sys- tions Engineering, Helsinki University of Technology (2004). tems, Applications, and Services (2017), ACM. [38]S ELF,D. Audio power amplifier design handbook. Taylor & [45]Y ANG,J.,TAN,K.-S.,GAN, W.-S., ER,M.-H., AND YAN, Francis, 2006. Y.-H. Beamwidth control in parametric acoustic array. Japanese [39]S ONG,L., AND MITTAL, P. Inaudible voice commands, 2017. Journal of Applied Physics 44, 9R (2005), 6817. [40]S UN,Z.,PUROHIT,A.,BOSE,R., AND ZHANG, P. Spartacus: [46]Y E,H., AND YOUNG, S. High quality voice morphing. spatially-aware interaction for mobile devices through energy- In Acoustics, Speech, and Signal Processing, 2004. Proceed- efficient audio sensing. In Proceeding of the 11th annual interna- ings.(ICASSP’04). IEEE International Conference on (2004), tional conference on Mobile systems, applications, and services vol. 1, IEEE, pp. I–9. (2013), ACM, pp. 263–276. [47]Y ONEYAMA,M.,FUJIMOTO,J.-I.,KAWAMO, Y., AND [41]T ITZE,I.R., AND MARTIN, D. W. Principles of voice produc- SASABE, S. The audio spotlight: An application of nonlinear tion. The Journal of the Acoustical Society of America 104, 3 interaction of sound waves to a new type of loudspeaker design. (1998), 1148–1148. The Journal of the Acoustical Society of America 73, 5 (1983), [42]V AIDYA, T., ZHANG, Y., SHERR,M., AND SHIELDS, C. Co- 1532–1536. caine noodles: Exploiting the gap between human and machine [48]Z HANG,G.,YAN,C.,JI,X.,ZHANG, T., ZHANG, T., AND XU, 9th USENIX Workshop on Offensive Tech- speech recognition. In W. Dolphinatack: Inaudible voice commands. arXiv preprint nologies (WOOT 15) (Washington, D.C., 2015), USENIX Asso- arXiv:1708.09537 (2017). ciation. [49]Z HANG,G.,YAN,C.,JI,X.,ZHANG, T., ZHANG, T., AND XU, [43]W ESTERVELT, P. J. The theory of steady forces caused by sound waves. The Journal of the Acoustical Society of America 23, 3 W. Dolphinattack: Inaudible voice commands. In Proceedings of (1951), 312–315. the ACM Conference on Computer and Communications Security (CCS) (2017), ACM. [44]W ESTERVELT, P. J. Scattering of sound by sound. The Journal of the Acoustical Society of America 29, 2 (1957), 199–203.