Proceedings from FONETIK 2019 Stockholm, June 10–12, 2019

Reflections on , speech technology, brain imaging and phonetics Björn Lindblom Phonetics laboratory, Stockholm University, Sweden [email protected]

Whither speech research? his blackboard: ‘What I cannot create I Broadly defined, the field of spoken lan- do not understand!’ (Gleick 1993). Let guage research is undergoing rapid us interpret this quote as a requirement change as a result of several factors. to be met by all good science. Let us call Computer technology can be used to it the Feynman criterion. If we apply it simulate interactive human speech com- to speech research, what do we find? munication in spectacularly diverse and First let us consider some recent realistic ways. achievements in speech technology. Since the 90’s sophisticated meth- On an almost daily basis, we hear ods have become increasingly available about new, often spectacular, artificial for imaging the real-time processes that intelligence (AI) developments. For in- take place in the brain during speaking, stance, some car models now have auto- listening and learning to speak. pilots. Medical diagnoses and treatments It would seem that the present situa- can be drastically improved thanks to tion offers speech research yet another vast data sets and clever methods that opportunity to expand its experimental detect patterns in the data that escape toolbox and orient itself towards neuro- even the most experienced expert. Fur- science with a view to acquiring a more thermore, law breakers get more easily profound understanding of the many caught with the aid of clever facial forms of language use under normal as recognition algorithms. well as disordered conditions. Already And then there is AlphaGo - a pro- many investigators have done so and gram that has beaten the world’s top hu- have contributed significant new man players of Go - a much more com- knowledge (Guenther 2016). plex game than chess (Mozur 2017). Its Our field presents a diverse mosaic training consisted in learning not only of research interests and special exper- from records of human games but from tise. Has time come for unification? discovering, on its own, winning and How could such interdisciplinary inte- losing strategies by playing innumerable gration be brought about? times against itself. The subject matter of a discipline The list of potential benefits of AI – tends to defined by the questions it asks, and disadvantages - is long (Foer 2018, but there are of course other factors such Friedman 2019). as tradition, ineffective administration Part of the story is AI-based speech and obsolete educational programs that technology. create obstacles. Recently we learned that attempts to Those are some of the questions and improve communication for paralyzed issues that need to be urgently addressed patients are under way and show consid- and dealt with today. erable promise. One such project (Akbar et al 2019) used a Brain Computer Inter- Speech technology and AI face technique to reconstruct input There is an anecdote about Richard speech signals from a population of Feynman, the eminent physicist, who evoked neural activity in the human au- had the following sentence written on ditory cortex.

139 Proceedings from FONETIK 2019 Stockholm, June 10–12, 2019

The investigators used a deep neural deep neural networks learn to a great ex- network to make direct estimates of tent on their own by accumulating and speech synthesizer parameters. This comparing a massive number of exam- model achieved high subjective and ob- ples of successes and failures –in part jective intelligibility scores on a digit found in records of human data, in part recognition allowing the authors to con- from information generated by the algo- clude that reconstructing speech from rithm itself. the human auditory cortex offers a Such results take steps towards re- speech neuroprosthetic for direct com- moving a major hurdle in the modeling munication with a paralyzed patient’s of human cognition. This hurdle is cre- brain. The technique is said to work un- ated by the fact that humans ‘know more der both overt and covert conditions. than they can tell’ - Polanyi’s Paradox. If this becomes a characteristic also of The IBM debater project (2019) computers, it would imply a major This project has produced a computer breakthrough for AI. Machines would program that recently participated in a then have become even more human- debate on ‘whether preschools should be like. Are we already there? subsidized’. Its performance was based There is an extremely serious down- on fifteen minutes of preparation, devel- side to these developments. ‘Not know- oping an argument in favor of subsidies, ing exactly what algo- producing a text and then giving an oral rithms do’ is problematic. It feeds right presentation of it. A large human audi- into people’s worst fears about future ence attended the event and was found to computers evolving to surpass and dom- prefer the human participant, but the inate their human masters. simulated debater came close to win- The ‘secrecy’ of the modeling ning. should also disqualify them as scientific Listening to this AI debater one is tools. They are unquestionably great im- struck by its ability to ‘reason’ and pro- itators. But imitation is hardly enough. If duce fluent and natural sounding speech these frameworks do not allow us to (see references). The big news is not that study the observed behavior, what use the human debater had won the debate. would they be to linguists, phoneticians, It is that the IBM debater had given its speech pathologists and communication human competitor a good run for his engineers? money – despite the complexity of the Instead of helping us understand our task. I cannot blame those who respond own brains, AI projects would create an by thinking that these developments are additional problem perhaps even more nothing short of extraordinary. I agree. forbidding: Understanding artificial But we have to ask: Where do we brains. place them within general science? What are the implications for research on Explanation: Holy Grail of research speech and language learning? Exactly Reasoning from first principles goes how does the debater do it? It meets the back to Aristotle. We can illustrate this Feynman criterion, but is imitation time-honored method by taking a look at enough? the acoustic theory of speech production AI people admit that they do not (Fant 1960, Stevens 1998). know exactly how these systems work At first one might assume that the (McAfee & Brynjolfsson 2016). But term ‘formant’ refers to an empirical no- they know enough to attribute the suc- tion derived from observations and anal- cess of AlphaGo (and other projects) to yses of large corpora of recorded speech, the use of deep learning networks. These but it is not. Anyone who has struggled models make it possible to abandon with making formant frequency meas- what has been used so far, viz. explicitly urements from short-term spectra and specified, computational rules. Instead, spectrograms knows that voice quality,

140 Proceedings from FONETIK 2019 Stockholm, June 10–12, 2019 nasalization and a high fundamental fre- human speech. Rather they hide what quency often make that exercise diffi- they do. cult. Those and other factors also cause Consequently they leave a big piece errors for automatic methods such as of the scientific task undone and might LPC whose results must be checked force upon us the epiphenomenal task of manually for accuracy. It is true that in- understanding - not only our own brains verse filtering can provide solutions to - but also artificial brains and their such problems – provided that you know deeply embedded organization. what the shape of the residue glottal pulse is supposed to look like. Neurobiology of speech We love formants, but they can be Students of language and speech are not very elusive! the only ones trying to get used to new The ‘formant’ is an independently technology. Neuroscience - now attract- motivated concept deduced from phys- ing our attention - is itself in fact a cur- ics. It is not a product of a data-driven rent instructive example: The Human approach. Rather the acoustic theory of Brain Project (HBP) is a gigantic EU- speech rests on the following first prin- sponsored effort in which some 500 sci- ciples (Beranek 1954): Newton’s second entists from over 100 universities partic- law, Boyle’s law and the Conservation of ipate (2012). mass. They provide the foundation for To motivate such a large project, the the wave equation that can be solved for applicants argue that time has come to arbitrary vocal tract configurations bring medicine, neuroscience and com- (Flanagan 1965). When the vocal tract puting together in response to the great- shape is known its resonances (for- est challenge of the 21st century: the un- mants) can be identified and their fre- derstanding of the human brain. Their quencies can be calculated. application underscores that, despite sig- When someone like me (with a lib- nificant progress neuroscience in recent eral arts education) invokes first princi- years, the field still suffers from frag- ples, it is not an expression of ‘physics mentation. envy’, nor a wish to reduce everything to To open the door to new treatments physics. of brain diseases – more than 500 have [I would perhaps be willing to plead been identified so far – the application guilty of ‘science envy’ because science states that it will be necessary to show is undoubtedly a good thing to be striv- how the ‘parts fit together in a single ing towards]. multi-level system’. A two-way process There are no absolute explanations, is envisioned. New computing technolo- only deeper and deeper accounts. Indi- gies will be discovered as more is vidual fields choose their explanans learned about the brain, and conversely, principles at different levels. In that simulations combined with empirical sense, every field adjusts its own depth observations are expected to bring novel according to the state of the art. How- insights into the workings of both nor- ever, across disciplines there are con- mal and disordered brains. With such a verging views on what constitutes an ex- strategy it should be possible to come up planation (Miller 1990). with deeper accounts of many challeng- The thing to remember is that prac- ing topics including learning, memory, tical applications and general scientific language, consciousness, awareness and knowledge both need the same fuel: Ex- psychiatric conditions. planations. Since speech research presents a di- My take is that AI projects can imi- verse mosaic of research interests, ap- tate/synthesize natural sounding speech plying the philosophy of the HBP pro- in near-perfect ways, but unlike the ject to our own field makes a lot sense. acoustic theory of speech production, However, we need to ask ourselves: they do not explain anything about Have current models of real-time speech

141 Proceedings from FONETIK 2019 Stockholm, June 10–12, 2019 reached a sufficient degree of realism to References be effective for interpreting large and Akbar H H, Khalighinejad B, Herrero J L, complex of volumes of brain data? Also Mehta A D & Mesgarani N (2019): since brain imaging relies heavily on Big "Towards reconstructing intelligible Data approaches and does not come with speech from the human auditory a theory paralleling the acoustic theory cortex", Scientific Reports 9:874 of speech, we need to proceed with cau- Beranek L L (1954): Acoustics, McGraw tion always remembering: Priority #1: Hill: New York. Explanation! Fant G (1960): The acoustic theory of Conclusion speech production, The Hague: Mouton. The quality of our theories of spo- Flanagan J L (1965): Speech analysis ken language and practical applications synthesis and perception, 1st ed, New – clinical, educational, technological – York: Springer will be a function of how well we under- Foer F (2018): World without mind: The stand how ‘humans do it’. existential threat of big tech, New That, in my opinion, should be the York: Random House. future niche for phonetics and spoken Friedman T L (2019): “Warning! language research. Everything is going deep: ‘The Age of Acknowledgements Surveillance Capitalism’”, New York Times, Jan 29. I thank Rolf Carlson for helpful com- Gleick J (1993): Genius: The life and ments on a draft of this paper. science of Richard Feynman, Vintage I am indebted to Olle Engwall for Books: New York. sketching a number of neuroimaging Guenther F H (2016): Neural control of projects on L2 language learners and speech, Cambridge USA: MIT Press. dyslectic subjects, and for making me Human Brain Project, A report to the aware of Akbar et al’s paper. European Commission (2012): The I am grateful to Joakim Gustafsson HBP-PS Consortium, Lausanne. and Jonas Beskow for giving me a com- IBM debater: prehensive update on various impressive https://www.research.ibm.com/artificial- AI/speech-tech projects. intelligence/project-debater/live/ My correspondence with Martti MacAfee A & Brynjolfsson E (2016): Vainio gave me a valuable perspective “Where computers defeat humans, and on neurophonetics - much needed now where they can’t”, New York Times, that Stockholm University will soon be March 16 able to offer its students, faculty and Miller G A (1990): “Linguists, guests a new facility for brain imaging: psychologists and the cognitive Stockholm University Brain Imaging sciences”, Language 66, 317-322. Centre (SUBIC). I thank him for his Mozur P (2017): “Google’s AlphaGo thoughtful reply to my questions. defeats Chinese Go master in win for A.I.”, New York Times, May 23. Stevens K N (1998): Acoustic phonetics, MIT Press: Cambridge Mass USA. SUBIC: https://www.su.se/subic/

142