Beyond Laurel/Yanny: An Autoencoder-Enabled Search for Polyperceivable Audio Kartik Chandra Chuma Kabaghe Stanford University Stanford University [email protected] [email protected]

Gregory Valiant Stanford University [email protected]

Abstract How common is this phenomenon? Specifically, what fraction of spoken language is “polyperceiv- The famous “laurel/yanny” phenomenon refer- able” in the sense of evoking a multimodal re- ences an audio clip that elicits dramatically dif- ferent responses from different listeners. For sponse in a population of listeners? In this work, the original clip, roughly half the popula- we provide initial results suggesting a significant tion hears the word “laurel,” while the other density of spoken words that, like the original “lau- half hears “yanny.” How common are such rel/yanny” clip, lie close to unexpected decision “polyperceivable” audio clips? In this paper boundaries between seemingly unrelated pairs of we apply ML techniques to study the preva- words or sounds, such that individual listeners can lence of polyperceivability in spoken language. switch between perceptual modes via a slight per- We devise a metric that correlates with polyper- ceivability of audio clips, use it to efficiently turbation. find new “laurel/yanny”-type examples, and validate these results with human experiments. The clips we consider consist of audio signals Our results suggest that polyperceivable ex- amples are surprisingly prevalent, existing for synthesized by the Amazon Polly speech synthe- >2% of English words.1 sis system with a slightly perturbed playback rate (i.e. a slight slowing-down of the clip). Though 1 Introduction the resulting audio signals are not “natural” stim- uli, in the sense that they are very different from How robust is human sensory , and to the result of asking a human to speak slower (see what extent do differ between individu- Section5), we find that they are easy to compute als? In May 2018, an audio clip of a man speaking and reliably yield compelling polyperceivable in- the word “laurel” received widespread attention stances. We encourage future work to investigate because a significant proportion of listeners confi- the power of more sophisticated perturbations, as dently reported not the word “laurel,” but well as to consider natural, ecologically-plausible rather the quite different sound “yanny” (Salam perturbations. and Victor, 2018). At first glance, this suggests that the decision boundaries for vary considerably among individuals. The reality To find our polyperceivable instances, we (1) de- is more surprising: almost everyone has a decision vise a metric that correlates with polyperceivabil- boundary between the sounds “laurel” and “yanny,” ity, (2) use this metric to efficiently sample can- without a significant “dead zone” separating these didate audio clips, and (3) evaluate these candi- classes. The audio clip in question lies close to dates on human subjects via Amazon Mechan- this decision boundary, so that if the clip is slightly ical Turk. We present several compelling new perturbed (e.g. by damping certain frequencies examples of the “laurel/yanny” effect, and we or slowing down the playback rate), individuals encourage readers to listen to the examples in- switch from confidently perceiving “laurel” to con- cluded in the supplementary materials (also avail- fidently perceiving “yanny,” with the exact point of able online at https://theory.stanford.edu/ switching varying slightly from person to person. ˜valiant/polyperceivable/index.html). 1This research was conducted under Stanford IRB Protocol Finally, we estimate that polyperceivable clips can 46430. be made for >2% of English words. 2 Method 2.2 Mechanical Turk experiments Each Mechanical Turk worker was randomly as- To investigate polyperceivability in everyday audi- signed 25 clips from our importance-sampled set tory input, we searched for audio clips of single of 200. Each clip was slowed to either 0.9x, 0.75x, spoken words that exhibit the desired effect. Our or 0.6x the original rate. Workers responded with method consisted of two phases: (1) sample a large a perceived word and a confidence score for each number of audio clips that are likely to be polyper- clip. We collected responses from 574 workers, all ceivable, and (2) collect human perception data of whom self-identified as US-based native English on those clips using Amazon Mechanical Turk to speakers. This yielded 14,370 responses (≈ 72 re- identify perceptual modes and confirm polyperceiv- sponses per clip). ability. Next, we manually reviewed these responses and selected the most promising clips for a second 2.1 Sampling clips round with only 11 of the 200 clips. Note that be- To sample clips that were likely candidates, we cause these selections were made by manual review trained a simple autoencoder for audio clips of (i.e. listening to clips ourselves), there is a chance single words synthesized using the Amazon Polly we passed over some polyperceivable clips — this speech synthesis system. Treating the autoen- means that our computations in Section3 are only coder’s low-dimensional latent space as a proxy a conservative lower bound. For this round, we also for perceptual space, we searched for clips that included clips of the 5 words identified by Guan travel through more of the space as the playback and Valiant(2019), 12 potentially-polyperceivable rate is slowed from 1.0× to 0.6×. Intuitively, a words we had found in earlier experiments, and longer path through encoder space should corre- “laurel” as controls. We collected an additional spond to a more dramatic change in perception as 3,950 responses among these 29 clips (≈ 136 re- the clip is slowed down (Section3 presents some sponses per clip) to validate that they were indeed data supporting this). polyperceivable. Concretely, we computed a score S proportional Finally, we took the words associated with these to the length of the curve swept by the encoder E in 29 clips and produced a new set of clips using latent space as the clip is slowed down, normalized each of the 16 voices, for a total of 464 clips. We by the straight-line distance traveled: that is, we collected 4,125 responses for this last set (≈ 3 re- R 0.6× r=1.0× ||dE(c,r)/dr||dr sponses for each word/voice/rate combination). define S(c) = ||E(c,0.6×)−E(c,1.0×)|| . Then, with 0.2·S probability proportional to e , we importance- 3 Results sampled 200 clips from the set of audio clips of the top 10,000 English words, each spoken by Are the words we found polyperceivable? To all 16 voices offered by Amazon Polly (spanning identify cases where words had multiple percep- American, British, Indian, Australian, and Welsh tual “modes,” we looked for clusters in the distri- accents, and male and female voices). The distri- bution of responses for each of the 29 candidate butions of S in the population and our sample is words. Concretely, we treated responses as “bags shown in Figure2. of ” and then applied K-means. Though this rough heuristic discards information about the Autoencoder details Our autoencoder operates order of phonemes within a word, it works suffi- on one-second audio clips sampled at 22,050 Hz, ciently well for clustering, especially since most of which are converted to spectrograms with a window our words have very few syllables (more sophisti- 90,000 size of 256 and then flattened to vectors in R . cated models of phonetic similarity exist, but they 512 The encoder is a linear map to R with ReLU would not change our results). activations, and the decoder is a linear map back to We found that the largest cluster typically con- 90,000 R space with pointwise squaring. We used an tained the original word and rhymes, whereas other Adam optimizer with lr=0.01, training on a corpus clusters represented significantly different percep- of 16,000 clips (randomly resampled to between tual modes. Some examples of clusters and their 0.6x and 1.0x the original speed) for 70 epochs relative frequency are available in Table1, and the with a batch size of 16 (≈ 8 hours on an AWS relative cluster sizes as a function of playback rate c5.4xlarge EC2 instance). are shown in Figure1. As the rate is perturbed, Perceived sound Playback rate Cluster sizes @ 0.90x 0.90× 0.75× 0.60× 1.00 laurel/lauren/moral/floral 0.86 0.64 0.19 0.75 manly/alley/marry/merry/mary 0.0 0.03 0.35 0.50 thrilling 0.63 0.47 0.33 0.25 flowing/throwing 0.34 0.50 0.58 0.00 leg girl flat settle 0.65 0.25 0.33 idly lady seat near third early fiend floral pearl frank settle laurel round bailey fondly thumb worlds potent parallel thrilling

civil 0.32 0.64 0.48 claimed bologna growing foreman prologue

claimed/claim/climbed 0.58 0.34 0.11 paragraph framed/flam(m)ed/friend/ find 0.33 0.52 0.43 Cluster sizes @ 0.75x leg 0.50 0.31 0.10 1.00 lake 0.46 0.34 0.14 0.75

growing/rowing 0.50 0.47 0.26 0.50

brewing/booing/boeing 0.19 0.23 0.26 0.25

third 0.40 0.10 0.10 0.00

food/foot 0.18 0.29 0.13 leg girl flat idly lady seat near third early fiend floral pearl frank settle laurel round bailey fondly thumb worlds idly/ideally 0.38 0.30 0.03 potent parallel thrilling claimed bologna growing foreman prologue

natalie 0.25 0.27 0.09 paragraph fiend 0.22 0.34 0.32 Cluster sizes @ 0.60x themed 0.11 0.17 0.24 1.00

bologna/baloney/bellany 0.26 0.00 0.00 0.75 (good)morning 0.03 0.28 0.77 0.50 thumb 0.66 0.74 0.79 0.25 fem(me)/firm 0.06 0.10 0.12 frank/flank 0.72 0.96 0.43 0.00 leg girl flat idly lady seat near third early fiend floral pearl frank settle laurel round

strength 0.08 0.00 0.15 bailey fondly thumb worlds potent parallel thrilling claimed bologna growing foreman round 0.53 0.38 0.65 prologue paragraph world 0.03 0.00 0.14

Table 1: Some polyperceivable words (bold) and their Figure 1: Relative cluster sizes across different play- alternate perceptual modes (below). Each row gives back rates. When the rate is slightly perturbed, the representative elements from the mode, and the propor- prevalence of alternate modes increases. tion of workers whose response fell in that mode. Clip importance-sampling distribution

Population the prevalance of alternate modes among our clips 0.05 Sample increases. Density

0.00 How prevalent are polyperceivable words? Of 10 20 30 40 50 our initial sample of 200 words, 11 ultimately Autoencoder path length yielded compelling demonstrations. To compute the prevalence of polyperceivable words in the pop- Figure 2: Distribution of path lengths (the S metric) in ulation of the top 10k words, we have to account the population (top 10k English words, all 16 voices) and our sample of 200. for the importance sampling weights we used when sampling in Section 2.1. After scaling each word’s contribution by the inverse of the probability of Is S a good metric? We consider the metric S including that word in our nonuniform sample of to be successful because it allowed us to efficiently 200, we conclude that polyperceivable clips exist find several new polyperceivable instances. If the for at least 2% of the population: that is, of the 200 words were sampled uniformly instead of be- 16 voices under consideration, at least one yields ing importance-sampled based on S, we would a polyperceivable clip for >2% of the top 10k En- only have found 4 polyperceivable words in expec- glish words. tation (2% of 200). Thus, importance sampling We emphasize that this is a conservative lower increased our procedure’s recall by almost 3×. bound, because it assumes that there were no other For a more quantitative understanding, we ana- polyperceivable words in the 200 words we sam- lyzed the relationship between “autoencoder path pled, besides the 11 that we selected for the second length” S and “perceptual path length” T . Our round. We did not conduct an exhaustive search measure T of “perceptual path length” for a clip among those 200 words, instead focusing our Me- is change in average distance between source chanical Turk resources on only the most promising word and response as we slow the clip down from candidates. 0.75× to 0.6×. As with clustering above, distance od ecmue h orlto between correlation the computed we word, .Ti ugssthat suggests This 3 ). (Figure strength varying aln n Wag- and (Carlini recognition speech and Liang, 2017), and Jia ( comprehension reading 2017), al., et (Huang learning reinforcement other including across work settings, examples such recent of presence more the shows vision, computer on focused 2017). al., et Athalye 2018; al., 2017; et al., Raghunathan et Madry 2016; 2015; al., al., et et Moosavi-Dezfooli Nguyen 2014; al., al., et et Goodfellow (Szegedy 2013; systems ML complex across perva- sive seem examples” “adversarial as perturbations, to these referred literature, ML we the clips In polyperceivable in consider. the perturbation for slight perturba- speed the playback Such to analogous label. are the tions change to system the ML lead a can points, perturbation carefully-crafted these but of tiny most For unexpected boundaries. to decision close extremely lie actually set) ing that is decade past the most over emerged has that line punch- surprising The has attention. systems enormous ML received to trained close for lie boundaries inputs natural decision of fraction what of sys- tion systems. ML sensory into insight human offers tems in instability Perceptual of quirks study Why Discussion: 4 polyperceivability. with correlates indeed with though words positively, 29 our correlated of metrics 5 but these all For voices). across cantly T each For space. bag-of-phonemes in measured is varying strongly). with quite correlates though “laurel” that positively, (note strengths correlate words all between Correlation n 3: Figure

mn h 6vie (both voices 16 the among r-value of S(w) vs. T(w) hl h nta oko desra examples adversarial on work initial the While 16 = 0.0 0.2 0.4 0.6 0.6 0.4 0.2 ua ecpini nALpaper? ACL an in perception human aua xmls(nldn onsi h train- the in points (including examples natural

ocsfrec for2 od.Nearly words. 29 our of each for voices claimed

lady Autoencoder path-lengthvs.perceptualdistance laurel foreman parallel settle growing fondly potent paragraph thumb third

word w prologue near flat

S pearl

S leg

and fiend

and bailey idly

T floral girl T

aysignifi- vary frank seat costhe across h ques- The bologna worlds

S early thrilling

and round S hneautn Lssesi hs oan.For domains. these in systems ML evaluating when neomu oyo okfo ontv sciences cognitive from work of body enormous An icret rdce ae.Hmn eetdthe selected Humans label. predicted (incorrect) hspopsteqeto fwehradversar- whether of question the prompts This iintss o xml,hmnvso provides vision human example, for tasks, vision esra xmlsta hnsonunperturbed shown when than examples versarial ple osuytedcso onaiso biologi- of boundaries decision the study to be applied directly cannot tools These boundaries. of decision geometry the exploring for and perspectives classifiers and probing tools new of trove has a models provided ML of robustness adversarial on systems. search sensory in- human offers into systems sight ML in instability Perceptual classifiers. vulnerable to lead inherently what settings and models, robust are adversarially settings to what of amenable sense our inform natu- may many inputs to ral close surprisingly decision lie have that ML) boundaries and biological which (both systems understanding broadly, More human on bounds robustness. quantified lack we cer- systems, ML are humans while not tainly However, adversar- to examples. robustness of ial terms far in are optimal models ML from current that evidence only the reference much-needed perceptual a Studying provide would ). 2019 al., et Qin 2018; ner, .Teeworks These ). 2002 al. , et (Fahle systems sensory human/animal of quirks the explores communities work Related 5 subjects. human to confused model that the images of find boundaries decision the used and space,” “perceptual low- of a representation trained dimensional (2014) al. et Hong Similarly, images. ad- shown when frequently more label incorrect classifier’s the and label correct the between choose for classifier image an for trained examples adversarial vision. shown for were question Humans this explores (2018) al. et sayed also might model to ML transfer an for crafted examples ial (Papernot Tram sets ; 2016a training al., et disjoint on even trained models, other those fool also often model a specific fool to crafted examples Adversarial ferability.” proxy. by techniques these apply to model techniques learning deep data-driven using However, standard subjects). human on “gradi- descent” do ent reasonably cannot we (e.g. classifiers cal neapecnb on ntesuyo “trans- of study the in found be can example An ua ecpulssescnalwu to us allow can systems perceptual human as humans ucpil oavraileape as examples adversarial to susceptible El- by work surprising Recent . 2016). al., et Liu 2017; al., et er ` ≈ 70 s n se to asked and ms, eetre- Recent often have the explicit goal of exploring isolated ceivable instances, which were validated with Me- “illusions” that provide insights into our perceptual chanical Turk experiments. Our results indicate systems (Davis and Johnsrude, 2007; Fritz et al., that polyperceivability is surprisingly prevalent in 2005). However, there are few efforts to quantify spoken language. More broadly, we suggest that the extent to which “typical” instances are polyper- the study of perceptual illusions can offer insight ceivable or lie close to decision boundaries. into machine learning systems, and vice-versa. Miller(1981) studies the effect of speaking rate on how listeners perceive phonemes. The percep- Acknowledgements tual shifts studied therein are between phonetically We would like to thank Melody Guan for early dis- adjacent perceptions (e.g. “pip” vs. “peep”) rather cussions on this project, and the anonymous review- than dramatically different perceptions (e.g. “lau- ers for their thoughtful suggestions. This research rel” vs. “yanny”). The “perturbation” of increasing was supported by a seed grant from Stanford’s HAI human speaking rate is much more complex than Institute, NSF award AF-1813049 and ONR Young simply linearly scaling the playback rate of an au- Investigator Award N00014-18-1-2295. dio clip. Speaking-rate induced shifts also seem to hold more universally across voices, as opposed to the polyperceivable instances we examine. References 6 Future work Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2017. Synthesizing robust adversarial Priming effects It is possible to use additional examples. arXiv preprint arXiv:1707.07397. stimuli to alter perceptions of the “laurel/yanny” audio clip. For example, Bosker(2018) demon- Hans Rutger Bosker. 2018. Putting laurel and yanny in context. The Journal of the Acoustical Society of strates the ability to control a listener’s percep- America, 144(EL503). tion by “priming” them with a carefully crafted recording before the polyperceivable clip is played. Nicholas Carlini and David Wagner. 2017. Towards Similarly, Guan and Valiant(2019) investigated evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), the “McGurk effect” (McGurk and MacDonald, pages 39–57. IEEE. 1976), where what one “sees” affects what one “hears.” The work estimated the fraction of spo- Nicholas Carlini and David Wagner. 2018. Audio ad- ken words that, when accompanied by a carefully versarial examples: Targeted attacks on speech-to- designed video of a human speaker, would be per- text. In 2018 IEEE Security and Privacy Workshops (SPW), pages 1–7. IEEE. ceived as significantly different words by listen- ers. Such phenomena raise questions about how Matthew H Davis and Ingrid S Johnsrude. 2007. Hear- our autoencoder-based method can be extended to ing speech sounds: top-down influences on the inter- search for “priming-sensitive” polyperceivability. face between audition and speech perception. Hear- ing research, 229(1-2):132–147. Security implications Just as adversarial exam- ples for DNNs have security implications (Papernot Gamaleldin Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alexey Kurakin, Ian Goodfellow, et al., 2016b; Carlini and Wagner, 2017; Liu et al., and Jascha Sohl-Dickstein. 2018. Adversarial exam- 2016), so too might adversarial examples for sen- ples that fool both computer vision and time-limited sory systems. For example, if a video clip of a humans. In Advances in Neural Information Pro- politician happens to be polyperceivable, an adver- cessing Systems, pages 3910–3920. sary could lightly edit it with potentially significant Manfred Fahle, Tomaso Poggio, Tomaso A Poggio, ramifications. A thorough treatment of such secu- et al. 2002. Perceptual learning. MIT Press. rity implications is left to future work. Jonathan B Fritz, Mounya Elhilali, and Shihab A 7 Conclusion Shamma. 2005. Differential dynamic plasticity of a1 receptive fields during multiple spectral tasks. Jour- In this paper, we leveraged ML techniques to study nal of Neuroscience, 25(33):7623–7635. polyperceivability in humans. By modeling per- Ian J Goodfellow, Jonathon Shlens, and Christian ceptual space as the latent space of an autoencoder, Szegedy. 2014. Explaining and harnessing adversar- we were able to discover dozens of new polyper- ial examples. arXiv preprint arXiv:1412.6572. Melody Y. Guan and Gregory Valiant. 2019. A surpris- Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. ing density of illusionable natural speech. In Pro- 2018. Certified defenses against adversarial exam- ceedings of the 41th Annual Meeting of the Cogni- ples. arXiv preprint arXiv:1801.09344. tive Science Society (CogSci). cognitivesciencesoci- ety.org. Maya Salam and Daniel Victor. 2018. Yanny or laurel? how a sound clip divided america. The New York Ha Hong, Ethan Solomon, Dan Yamins, and James J Times. DiCarlo. 2014. Large-scale characterization of a uni- versal and compact visual perceptual space. space Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, (P-space), 10(20):30. Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan networks. arXiv preprint arXiv:1312.6199. Duan, and Pieter Abbeel. 2017. Adversarial at- tacks on neural network policies. arXiv preprint Florian Tramer,` Nicolas Papernot, Ian Goodfellow, Dan arXiv:1702.02284. Boneh, and Patrick McDaniel. 2017. The space of transferable adversarial examples. arXiv preprint Robin Jia and Percy Liang. 2017. Adversarial exam- arXiv:1704.03453. ples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2016. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards deep learning models resistant to adversar- ial attacks. arXiv preprint arXiv:1706.06083. Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature, 264(5588):746–748. Joanne L Miller. 1981. Effects of speaking rate on segmental distinctions. Perspectives on the study of speech, pages 39–74. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vi- sion and pattern recognition, pages 2574–2582. Anh Nguyen, Jason Yosinski, and Jeff Clune. 2015. Deep neural networks are easily fooled: High con- fidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vi- sion and pattern recognition, pages 427–436. Nicolas Papernot, Patrick McDaniel, and Ian Good- fellow. 2016a. Transferability in machine learning: from phenomena to black-box attacks using adver- sarial samples. arXiv preprint arXiv:1605.07277. Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. 2016b. The limitations of deep learning in adver- sarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), pages 372–387. IEEE. Yao Qin, Nicholas Carlini, Ian Goodfellow, Garrison Cottrell, and Colin Raffel. 2019. Imperceptible, robust, and targeted adversarial examples for auto- matic speech recognition. In International Confer- ence on Machine Learning (ICML).