Sponsoring Committee: Professor Lisa Gitelman, Chairperson Professor Alexander Galloway Associate Professor Mara Mills Associate Professor Erica Robles-Anderson

DIVINATION ENGINES: A MEDIA HISTORY

OF TEXT PREDICTION

Xiaochang Li

Program in Media, Culture, and Communication Department of Media, Culture, and Communication

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Steinhardt School of Culture, Education, and Human Development New York University 2017     ProQuest Number:10623544     All rights reserved  INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted.  In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion.  

   ProQuest 10623544  Published by ProQuest LLC ( 2017). Copyright of the Dissertation is held by the Author.   All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code Microform Edition © ProQuest LLC.   ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106 - 1346 Copyright © 2017 Xiaochang Li ACKNOWLEDGEMENTS

I owe an enormous debt of gratitude the former members of the IBM CSR group who took time out of their busy schedules to speak to me about their work and experiences, including Lalit Bahl, Peter Brown, Stephen and Vincent Della

Pietra, and . I am additionally indebted to Janet Baker, who generously shared recorded interviews and other materials from her own collection. Immense thanks is also owed to the archivists who aided my research, including Arvid Nelson at the Charles Babbage Institute, Dawn Stanford at the

IBM Corporate Archives, George Kupczak at AT&T Archive in Warren, NJ, and the wonderful staff at the National Museum of American History Archive Center and the University of Washington Libraries Special Collections.

I have had the immense fortune of working with a committee whose intellectual generosity and steadfast guidance have been nothing short of wondrous. I sincerely could not imagine a more brilliant and supportive committee. Lisa Gitelman’s feedback, unwavering support, and endless patience throughout this project provided a much-needed anchor. As a mentor, her expansive insights allowed me think more boldly, while her pragmatic counsel ensured that I could hold the course. As an advisor, her responsiveness and calm

!iii demeanor have been a constant source of reassurance in all the stress and chaos.

Alex Galloway pushed me to be more confident and meticulous as a thinker and a writer, and his feedback, particularly in the early stages, was pivotal in shaping the prominent themes of this research. Mara Mills has been an unending source of what I can only describe as rigorous encouragement, a particular combination of cheerful enthusiasm and incisive critique that always energized me to dive back into the work. She has also been incredibly generous in sharing resources and opportunities, which has been invaluable throughout in shaping both my research and my scholarly progress. Erica Robles-Anderson has been guiding this project since it was just a vague notion my first year, urging me to trust my instincts and dig into the work. I simply cannot imagine where this project would be without her many contributions, especially her uncanny ability to transform my barely- coherent ramblings into thoughtful, provocative questions that never failed to unsettle my thinking in the most productive ways. Finally, I am enormously grateful to my outside readers, Kate Crawford and Lev Manovich, for the time and care they took in reviewing this work and providing so much detailed feedback and encouragement.

I could not have done this without my amazing cohort (one might even call them The AwesomestCohort™)—Shane Brennan, Wendy Chen, Jess

Feldman, Liz Koslov, and Tim Wood. Particular thanks is owed to Liz, who kept me company while I wrote, and Tim, who kept me company while I whined about

!iv writing. I cannot express how fortunate I feel to have been a part of the extraordinary community of PhD students at MCC, with whom I have enjoyed so many inspiring conversations and collaborations. Particular thanks goes to Carlin

Wing, Matthew Hockenberry, Kouross Esmaeli, Tamara Kneese, Jacob Gaboury,

Kari Hensley, Seda Gürses, and Solon Barocas, for sharing their brilliant insights and sage advice. Thanks also to the many incredible faculty at MCC and across

NYU who generously took the time to read my work and discuss ideas, including

Arjun Appadurai, Finn Brunton, Lily Chumley, Allen Feldman, Ben Kafka (who also let me have his table at McNally’s that one time), Randy Martin, Sue Murray,

Helen Nissenbaum, Martin Scherzinger, Natasha Schüll, Nicole Starosielski,

Marita Sturken, and Torsten Suel. I am also grateful to the many scholars outside

NYU who have offered invaluable feedback, resources, and access, including

Karin Bijsterveld, Carolyn Birdsall, danah boyd, Henry Jenkins, Mike Karlesky,

Karen Levy, Christine Mitchell, Viktoria Tkaczyk, and William Uricchio. My research was additionally supported by the Phyllis and Gerald Leboff Dissertation

Fellowship and the NYU Center for the Humanities, which provided not only financial support, but also a brilliant and engaging cohort of faculty and student fellows.

I am immensely thankful to my wonderful friends and family for all the much-needed encouragement and reality checks these past few years. Particular gratitude is owed to Katie Phelan, for supplying homemade croissants across two

!v continents and three separate degrees, and Shannon Starkey, for supplying many words of support and reassurance, despite such gestures being entirely contrary to his nature. Finally, this dissertation is dedicated to my parents, whose unimaginable work and sacrifice are responsible for everything I have accomplished and will ever accomplish. And to Trystan, whose ferocious talent and unflinching sense of purpose will always inspire, and forever be missed.

!vi TABLE OF CONTENTS

ACKNOWLEDGMENTS iii

LIST OF FIGURES ix

CHAPTER

I. INTRODUCTION 1

Chapter Outline 16

II. THE ARTFUL DECEIT 20

Speech Recognition and the Human-Computer 28 Imagination The Astonishing Mechanism 35 Remaking Speech 43 From Signal to Symbol 56 and Expert Systems 67 Airplanes Don’t Flap Their Wings 74 Noise in the Channel 85

III. THE IDEA OF DATA 91

A Technical Overview of ASR 97 The Statistical Nature of Speech 102 The Statistical Nature of Language 118 Hiding Knowledge, Maximizing Likelihood 126 No Data Like More Data 142 The Way of the Machine 150

continued

!vii IV. THE DISCOVERY OF KNOWLEDGE 177

From to Language Processing 183 The Crude Force of Computing 194 “Big-Data-Small-Program” 209 Data’s Rising Tide 222

V. CONCLUSION: THE BLACKEST BOXES 229

WORKS CITED 235

!viii LIST OF FIGURES

1 Three diagram drawings by Pierce 54

2 Photograph of the Automatic Digit Recognizer, or “Audrey” 58

3 The structure of the HWIM (“Hear What I Mean”) 71

4 Block diagrams comparing the standard view CSR with the “noisy 86 channel” model

5 Representation of the acoustic processor’s signal quantization 98 process

6 Illustration of the operation of the logograph with a detail 108 rendering of the print output below

7 Three images of depicting the design and output of John B. 109 Flowers’ Phonoscribe

8 Block schematic of Audrey 110

9 Photographs of simplified speech signal sample traces for Audrey. 110

10 Simplified representation of the quantization process using digit 7 112 detail from figure 9

11 Graph of formant frequency results 117

12 Block diagram of the automatic phoneme recognizer 122

13 Model of the New Raleigh Language. 132

14 Color quantization using K-means clustering 170

!ix CHAPTER I

INTRODUCTION

“There is nothing more natural, moreover, than the relation thus expressed between divination and the classification of things. Every divinatory rite, however simple it may be, rests on a pre-existing sympathy between certain beings, and on a traditionally admitted kinship between a certain sign and a certain future event . . . At the basis of a system of divination there is thus, at least implicitly, a system of classification.” Émile Durkheim and Marcel Mauss, Primitive Classification (1903, trans. 1967)1

“[P]rediction is fundamentally a type of information processing activity.” Nate Silver, The Signal and the Noise (2015)2

In November of 2016, during a press event held at Google’s London office, London Mayor Sadiq Khan introduced Google CEO Sundar Pichai with a small quip: “A friend, he began, had recently told him he reminded him of

Google. ‘Why, because I know all the answers?’ the mayor asked. ‘No,’ the friend replied, ‘because you’re always trying to finish my sentences.’”3 The joke,

1 Emile Durkheim and Marcel Mauss, Primitive Classification, trans. Rodney Needham (Chicago: University Of Chicago Press, 1967), 46.

2 Nate Silver, The Signal and the Noise: Why So Many Predictions Fail, But Some Don’t (New York: Penguin, 2015), 266.

3 Gideon Lewis-Kraus, “The Great A.I. Awakening,” The New York Times, December 14, 2016, sec. Sunday Magazine Supplement, https://www.nytimes.com/2016/12/14/ magazine/the-great-ai-awakening.html.

!1 recounted in The New York Times, plays upon a misrecognition between Google

Search’s core operation, which retrieves and ranks web content in response to typed queries, and the search engine’s decidedly less consistent, if rather more conspicuous “Autocomplete” function, an interface design feature that aims to predict the text of search queries as they’re typed. The Mayor believes he’s proffering knowledge; his friend knows that he’s merely guessing at words.

Though played for laughs, this ready confusion between predicting text and procuring information was reinforced through the feature design itself. In 2010, the same year text completion was officially rebranded as Autocomplete, Google rolled out Instant search, a new feature that generated search results as the user typed. Autocomplete and Instant, though technically separate functions, were discussed as a single entity under Google’s newly predictive search procedure.

Marissa Mayer, then Google’s VP of Search Products & User Experience, emphasized that Google Instant was not “search-as-you-type . . . Google Instant is search-before-you-type.”4 Google instant, in other words, did not provide results to partially-typed queries, but to their predicted text completions. Finding answers and finishing sentences were thus effectively collapsed together as a single operation, suggesting a more substantive commensurability.

4 Marissa Mayer, “Search: Now Faster than the Speed of Type,” Official Google Blog, September 8, 2010, http://googleblog.blogspot.com/2008/08/at-loss-for-words.html.

!2 Just over a century earlier, in 1906, Russian mathematician Andrei

Andreevich Markov established a new branch in the theory of probability, extending its authority to the prediction of successive, dependent events.5 The general theory of chain dependence opened a new array of real-world phenomena to probabilistic calculation, perhaps most notably in its contributions to and, by consequence, digital computing and communications.

Markov processes appear today in computational techniques that underwrite diverse forms of knowledge production, generating everything from genome sequences and financial models to, fittingly, web search rankings. Yet in 1913,

Markov empirically verified his eponymous chains precisely by predicting text, calculating the probability of letter succession using a copy of Alexander

Pushkin’s novel-in-verse, Eugene Onegin.6 A new province of probabilistic knowledge was advanced, that is, through the task of guessing at words.

5 As a number of historians have noted, though Markov chains were first demonstrated in the 1913 paper, Markov first discussed the general concept of chain dependence in a 1906 paper Rasprostranenie zakona bol’shih chisel na velichiny, zavisyaschie drug ot druga. See O. B. Sheynin, “A. A. Markov’s Work on Probability,” Archive for History of Exact Sciences 39, no. 4 (December 1, 1989): 364; Eugene Seneta, “Markov and the Birth of Chain Dependence Theory,” International Statistical Review 64, no. 3 (1996): 255; Gely P. Basharin, Amy N. Langville, and Valeriy A. Naumov, “The Life and Work of A.A. Markov,” Linear Algebra and Its Applications, Special Issue on the Conference on the Numerical Solution of Markov Chains 2003, 386 (July 15, 2004): 15.

6 A. A. Markov, “An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains,” trans. Gloria Custance et al., Science in Context 19, no. 4 (2006): 591–600. Note that Markov’s calculations were for the probability of two possible letter states, consonant and vowel, not for each individual letter.

!3 This dissertation examines the history of computational text prediction— referring both to technologies that predict text sequences and those that enlist computational text analysis to predict other outcomes—through the latter half of the twentieth century. The pursuit of text prediction, I suggest, played a pivotal and underexamined role in the rise of “big data” analytics and machine learning as pervasive forms of knowledge work. Efforts to bring language under the purview of data processing prepared the conceptual landscape for the installation of a particular strain of statistical modeling, one rooted in Bayesian inference and forged through information theory, as a privileged way of knowing at once distinct to computation yet generalizable across diverse domains of practice. At the same time, it led to the development of algorithmic techniques that were critical in making data “big,” transforming previously “unstructured” text into vast troves of computer-processable data for modeling everything from cholera outbreaks to consumer preferences.7

The curious collusion of statistical modeling and text analysis has long animated both the technical development and popular discourse of digital technology. Efforts to statistically decipher encoded messages during World War

7 The instances of text analytics for predictive modeling in both research and commercial applications, particularly in business intelligence, are obviously too numerous to list. For two typical examples, see Kira Radinsky and Eric Horvitz, “Mining the Web to Predict Future Events,” in Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM ’13 (New York, NY, USA: ACM, 2012), 255–264 and Gilad Mischne and Natalie Glance, “Predicting Movie Sales from Blogger Sentiment,” in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, American Association for Artificial Intelligence (2006), 155–58.

!4 II featured heavily in the development of early electronic computing projects.8

Claude Shannon’s key contributions to information theory, which were highly influential in the expansion of digital computing and communications networks, featured the use of Markov processes to model the English language.9 And in artificial intelligence, language technologies such as and speech recognition served as core working objects, with debates over probabilistic approaches to these tasks shaping research agendas, budget allocations, and evaluation benchmarks in the overall field since the 1960s and continuing to feature heavily in AI and machine learning debates today.10 In the present, predictive text systems saturate social life at every scale, from the use of text analytics in marketing and electoral campaigns alike to autocorrect’s interjections into the mundane intimacies of everyday communication.

8 In addition to the famed Colossus developed by Maxwell Newman and Thomas Flowers at Bletchley Park, historians of computing have noted that there was a “boom in the invention one-of-a-kind digital computing machines” in the years leading up to and during World War II that were “typically used for table making, ballistics calculations, and code- breaking.” See Martin Campbell-Kelly et al., Computer: A History of the Information Machine, 3rd edition (Boulder, CO: Westview Press, 2013), 54.

9 Phillip von Hilgers and Amy N. Langville, “The Five Greatest Applications of Markov Chains,” in Markov Anniversary Meeting: An International Conference to Celebrate the 150th Anniversary of the Birth of A.A. Markov (Charleston, SC, 2006), 157-158. For a consideration of the influence of information theory and statistical language processing in the other direction, in practices of writing and , see also: Lydia H. Liu, The Freudian Robot: Digital Media and the Future of the Unconscious (Chicago, IL: University of Chicago Press, 2011) and David Golumbia, The Cultural Logic of Computation (Cambridge, MA: Harvard University Press, 2009).

10 The role of statistical methods and language processing technologies more broadly in AI is discussed in greater detail in chapters two, three, and four. Chapters two and three discuss the adoption of statistical methods in speech recognition and debates regarding the nature of computing and AI, while chapter four examines the role of statistical language processing in machine learning.

!5 Predictive text systems are additionally emblematic of an ongoing shift in the imagined relationship between knowledge and data, serving as a featured prop in the dramatization of information processing’s epistemic conquest. Since

Google made “Autocomplete” default in their web search interface in 2008,11 it has been commonly referenced in the popular press not as a means to expedite search queries, but as a “service that taps into humanity's collective psyche”12 and detects “some the hidden intentionality behind . . . the peculiar statistics of a world id.”13 The idea that text completion serves as evidence of the aggregate values and habits of a society is so enticing that even in an article criticizing the

“traps” of believing in the predictive power of search query data, the authors uncritically portray it “as data on the desires, thoughts, and the connections of humanity.”14

The suggestion that the statistically calculated best guess could reveal the underlying properties of a population is, of course, nothing new. Historian Ian

Hacking identified it as one of the defining conceptual transformations of the

11 Jennifer Liu, “At a Loss for Words?,” Official Google Blog, August 25, 2008, http:// googleblog.blogspot.com/2008/08/at-loss-for-words.html. Note that it was still called “Google Suggest” at the time and was rebranded as “Autocorrect” in 2010.

12 Megan Garber, “How Google’s Autocomplete Was ... Created / Invented / Born,” The Atlantic, August 23, 2013, https://www.theatlantic.com/technology/archive/2013/08/how- googles-autocomplete-was-created-invented-born/278991/.

13 Gideon Lewis-Kraus, “The Fasinatng … Frustrating … Fascinating History of Autocorrect,” Wired | Gadget Lab, July 22, 2014, http://www.wired.com/2014/07/history- of-autocorrect/.

14 David Lazer et al., “The Parable of Google Flu: Traps in Big Data Analysis,” Science, March 14, 2014.

!6 nineteenth century, credited in large part to Belgian statistician Adolphe Quetelet, who introduced the notion of a people as a measurable entity composed of similarly measurable representative qualities.15 The idea that “statistical laws that were merely descriptive of large-scale regularities” could serve as “laws of nature and society that dealt in underlying truths and causes” was indeed rooted in the very concept of a population as such.16 What is distinct about the portrayals of text prediction, however, is the matter of language. The probabilistic arbitration of predictive text applies not to measurements, but meanings, through which the representative qualities of a population can ostensibly be both enumerated and expressed in the same act.

Also consequential is the technological context through which this long- standing mode of statistical reasoning is materialized. Predictive text systems are not, as Khan’s joke reminds us, meant to produce answers. In the case of Google’s

Autocomplete, text prediction is merely meant to make the retrieval of data more

15 Ian Hacking, The Taming of Chance (Cambridge University Press, 1990), 107-109. As Hacking explains, Quetelet “transformed the theory of measuring unknown physical quantities, with a definite probable error, into the theory of measuring ideal or abstract properties of a population. Because these could be subjected to the same formal techniques they became real quantities.” As Hacking explained, Quetelet’s identification between statistical patterns and characteristic nature was linked to his famed concepts of “the average man,” which referred not to the human species, but racialized groupings: “Where before one thought of a people in terms of its culture or its geography or its language or its rulers or its religion, Quetelet introduced a new objective measurable conception of a people . . . This is half the beginning of eugenics, the other half being the reflection that one can introduce social policies that will either preserve or alter the average qualities of a race. In short, the average man led to both a new kind of information about populations and a new conception of how to control them.”

16 Ibid., 108.

!7 efficient. Yet, from a series of print ads issued by the United Nations that uses the suggested search terms following phrases like “women shouldn’t” and “women need to” in order to “show gender inequality is a worldwide problem” to a humorous list of “15 Things We Can Learn About Humanity from Google

Autocomplete” that reveals our information gathering priorities to revolve around

“romance and fruit ripeness,” the suggestions themselves are seen as significant, whatever their utility as search terms.17 Regardless of their stated goal as a mere aid for information retrieval, the statistically-rendered predictions are thus taken to be information as such. Put another way, this system designed for data storage, retrieval, and transmission is considered to be capable of generating new information.

In evoking a metaphor of predictive text systems as engines of

“divination,” I aim highlight two seemingly contradictory features of data-driven text analytics as a form of knowledge production: on the one hand engaged in the humble act of sorting data, while on the other hand purported to reveal patterns

17 David Griner, “Powerful Ads Use Real Google Searches to Show the Scope of Sexism Worldwide,” AdWeek, accessed October 22, 2013, http://www.adweek.com/adfreak/ powerful-ads-use-real-google-searches-show-scope-sexism-worldwide-153235; Hallie Canton, “15 Things We Can Learn About Humanity From Google Autocomplete,” CollegeHumor, June 20, 2013, http://www.collegehumor.com/article/6896075/15-things- we-can-learn-about-humanity-from-google-autocomplete.

!8 visible only from a God’s eye view.18 In an echo of Mauss and Durkheim’s depiction of divinatory practices, text and other data-driven analytics are primarily classificatory activities, intended to systematically detect, sort, and tabulate regularities among seemingly unrelated elements. At the same time, these systems are imagined in both popular and technical discourse as revelatory mechanisms that, like the foretelling for divine will, are capable of granting access to knowledge that exceeds the limits of human scrutiny.

The pursuit of text prediction, I argue, played a key role in shaping both the conceptual and technical contours of a distinctly computational genre of knowing, one typified today by the rise of “data-driven” analytics, machine learning, and other sibling informatic practices wherein the techniques of statistical data processing are installed as general procedures of knowledge production. By genre, I mean to suggest a class of knowing that is not contained solely in the particular arrangement of formal techniques, conventions, and methods. Rather, drawing on the work of Carolyn Miller, genre here signals a type of knowing that is characterized by the dynamic coordination of assembled features and the social contexts and manner in which they come to be recognized

18 Elaine Freedgood has similarly compared the process of reading with digital research aids, such as online databases, to “algorithmic divination.” However, she refers to software-enabled information retrieval more generally, while my aim is to highlight the distinct features of data-driven analytics. See Elaine Freedgood, “Divination,” PMLA 128, no. 1 (January 2013): 221–25.

!9 as distinct and desirable.19 Thus not deeply examined here are the extensive histories of probability, statistics, and their changing status in relation to knowledge and to one another.20 Attention instead centers on a style of

“statistical”21 thinking as it was taken up by engineers and computer scientists at a particular moment in history, implemented and authorized within the discursive and material specificities of digital communications technology, and consequently embedded into the local operations of everyday life.

Allied with postwar computational efforts in areas such as ballistics and computer vision, text prediction was part of a broader “informatic” turn that radically unanchored information from the demands of explanation. In the

19 Carolyn R. Miller, “Genre Innovation: Evolution, Emergence, or Something Else?,” The Journal of Media Innovations 3, no. 2 (November 7, 2016): 4–19.

20 The literature on the history of probability and statistics is extensive. See, for instance, Theodore M Porter, The Rise of Statistical Thinking, 1820-1900 (Princeton, N.J.: Princeton University Press, 1986); Gerd Gigerenzer et al., The Empire of Chance: How Probability Changed Science and Everyday Life, Reprint (Cambridge University Press, 1989); Ian Hacking, The Taming of Chance (Cambridge University Press, 1990); Stephen M. Stigler, Statistics on the Table: The History of Statistical Concepts and Methods (Cambridge, MA: Harvard University Press, 1999); Lorraine Daston, “Probability and Evidence,” in The Cambridge History of Seventeenth-Century Philosophy, ed. Daniel Garber et al., vol. 2 (Cambridge: Cambridge University Press, 2012), 1108–44; Dan Bouk, How Our Days Became Numbered: Risk and the Rise of the Statistical Individual (Chicago ; London: University Of Chicago Press, 2015).

21 In referring to various methods (approaches, techniques) as “statistical” throughout this work, I am employing the popular designation given to a particular class of techniques by the historical actors engaged with them. In this sense, “statistical” is more a historical descriptor rather than a mathematical one, and its meaning should not be assumed to be self-evident or stable. “Statistical” methods were referred to by other titles that were arguably more precise, if less popular, while at the same time many methods relying heavily on statistical analysis were decisively excluded from the designation. My use of the term “statistical” (as opposed to probabilistic) additionally the centrality of empirical data in these methods. Models in text prediction were, crucially, not simply built from probabilistic assumptions, but “trained” or otherwise populated by means of statistical estimation across large data sets.

!10 American context, it inhabits a longer, more extensive historical curvature alongside a diverse array technoscientific practices, such as the proliferation of large-scale mathematical simulation and engineering paradigms across the physical and natural sciences, the standardization of bureaucratic practices that exercised organizational control through data management, and the rippling influence of cybernetic fantasies across domains from architecture to governance.22 Text prediction, however, consolidated these shared epistemic proclivities in new ways around the challenges of language technology, drawing

22 The body of literature covering this histories of such a wide range of scientific and technical practices is, of course, extensive and only a brief sample can be included here. Peter Galison, for instance, has provided a now-classic account of computer simulations as a new mode of scientific knowledge in thermonuclear weapons research after World War II. Peter Galison, “Computer Simulations and the Trading Zone,” in The Disunity of Science: Boundaries, Contexts, and Power, ed. Peter Galison and David Stump. (Stanford University Press, 1996), 118–57. In the natural sciences, Philip Pauly contextualizes latter twentieth century social debates in biotechnology in the historical emergence of an “engineering standpoint” in biology beginning in the late nineteenth century, which emphasized the mechanized control and transformation of organisms over traditional concerns of evolution and the nature of life. Philip J. Pauly, Controlling Life: Jacques Loeb & the Engineering Ideal in Biology (Oxford University Press, 1987). Philip Mirowski has famously written on the incursion on cybernetic thinking and stochastic modeling into economics. Philip Mirowski, Machine Dreams: Economics Becomes a Cyborg Science (Cambridge: Cambridge University Press, 2002) and Philip Mirowski, “The Probabilistic Counter-Revolution, or How Stochastic Concepts Came to Neoclassical Economic Theory,” Oxford Economic Papers 41, no. 1 (1989): 217–35. JoAnne Yates has elaborated on the development of new forms of managerial control in the American corporation in the early twentieth century through new technologies of information handling, including early commercial computing technologies. JoAnne Yates, Control through Communication: The Rise of System in American Management (The Press, 1993) and JoAnne Yates, Structuring the Information Age: Life Insurance and Technology in the Twentieth Century (The Johns Hopkins University Press, 2008). Historians in art and architecture, such as Reinhold Martin and Beatriz Colomina, have traced the influence of computational and engineering ideals into postwar design practices via cybernetics. Reinhold Martin, The Organizational Complex: Architecture, Media, and Corporate Space (MIT Press, 2005) and Beatriz Colomina, “Enclosed by Images: The Eameses’ Multimedia Architecture,” Grey Room, no. 2 (Winter 2001): 6–29.

11! them into alliance with both popular fantasies of machine intelligence and the commercial priorities of the computing industry. In doing so, it assisted in expanding the conceptual territory of a narrow “data-driven” genre of knowledge beyond the laboratory and into the sphere of the everyday. Predictive text systems, after all, are precisely systems of pre-diction, tasked with not only supposition, but communication. As such, they allow for a consideration of the coupling of statistical thinking and information technologies in their mediatic context, directly implicated in the coordination between discursive practices and material forms, between cultural imagination and technical protocol, providing a unique vantage point into the shifting relations among computation, knowledge work, and social life.

The history of text prediction systems, I suggest, thus stands to offer insight into the ways in which an emergent “epistemic machinery”—defined by anthropologist Karin Knorr Cetina as the “different architectures of empirical approaches, specific constructions of the referent, particular ontologies of instruments, and different social machines”23—was refashioned and bound up within practices of digital media and communication. What forces made language amenable to the demands of data processing? How did the installation of data processing as a way of knowing become thinkable and desirable in the first place?

23 Karin Knorr Cetina, Epistemic Cultures: How the Sciences Make Knowledge (Cambridge, Mass: Harvard University Press, 1999), 3.

!12 And how, consequently, was this epistemic framework rendered generalizable across diverse domains of practice, codified in the technical protocols of everything from high-frequency trading to text messaging?

An understanding of predictive text systems must therefore engage in analysis at two divergent scales: in terms of the historical development of the broader epistemic developments from which they emerged and in terms of the technical and material specificities of their implementations and applications. The examination of these systems offers a way to not only consider how a particular discourse is ordered, but also means to understand how the process of ordering operates at the level of what Foucault called a “microphysics of power.”24 By the quotidian nature its applications, text prediction provides a vantage point into the minute mechanisms by which a particular logic of relation is enfolded into everyday experience and transferred across different spheres of practice.

This project engages with the interdisciplinary dialogue on the influence of data and algorithms in the organization of social life. As digital technology increasingly conditions how discursive materials are produced, circulated, and encountered, many scholars have highlighted how various aspects of logical and material compositions structure the terms of access and intelligibility. There exists

24 Michel Foucault, Discipline & Punish: The Birth of the Prison, trans. Alan Sheridan, 2nd Edition (Vintage, 1995), 26.

!13 a robust and growing body of work that considers how software protocols25 and algorithmic routines26 are “subjecting all human discourse and knowledge to these procedural logics that undergird computation”27 and producing “ways of thinking and doing that leak out of the domain of logic and into everyday life.”28 However, as much of the work in critical studies of digital technology and algorithms focuses on their present use and future implications, these “procedural logics” are granted a sense of timelessness, appearing as technological faits accomplis both inescapable and unchanging. The components, both technical and discursive, that make up these ways of thinking and doing did not originate with software, or even computing. Drawing out their epistemic underpinnings wrests the logics of data-

25 See, for instance, major works in Software Studies, such as Lev Manovich, The Language of New Media (MIT Press, 2001), Matthew Fuller, Behind the Blip: Essays on the Culture of Software (Autonomedia, 2003), Alexander R. Galloway, Protocol: How Control Exists after Decentralization (The MIT Press, 2006), Wendy Hui Kyong Chun, Programmed Visions: Software and Memory, (The MIT Press, 2013).

26 There has been a growing body of work in the humanities and social sciences surrounding the use and implications of algorithmic systems in particular. See for instance: Luciana Parisi, Contagious Architecture: Computation, Aesthetics, and Space (The MIT Press, 2013); Tarleton Gillespie, “The Relevance of Algorithms,” in Media Technologies: Essays on Communication, Materiality, and Society, ed. Tarleton Gillespie, Pablo J Boczkowski, and Kirsten A Foot, 2014, 167–93; Frank Pasquale, The Black Box Society: The Secret Algorithms That Control Money and Information, Reprint edition (Cambridge, Massachusetts London, England: Harvard University Press, 2016); Ted Striphas, “Algorithmic Culture,” European Journal of Cultural Studies 18, no. 4–5 (August 1, 2015): 395–412; Adrian Mackenzie, “The Production of Prediction: What Does Machine Learning Want?,” European Journal of Cultural Studies 18, no. 4–5 (August 1, 2015): 429-45; Mike Ananny and Kate Crawford, “Seeing without Knowing: Limitations of the Transparency Ideal and Its Application to Algorithmic Accountability,” New Media & Society, December 13, 2016, 1–17; Rob Kitchin, “Thinking Critically about and Researching Algorithms,” Information, Communication & Society 20, no. 1 (January 2, 2017): 14–29.

27 Gillespie, “The Relevance of Algorithms,” 168.

28 Matthew Fuller, ed., Software Studies: A Lexicon (The MIT Press, 2008), 3.

!14 driven analytics from the quarantine of this technical inevitability, and highlights the sociotechnical arrangements that coalesced into durable practices. That is to say, if we take seriously the notion that software makes possible “ways of thinking and doing,” the vantage point of history attends to how these epistemic priorities and processes are constituted and authorized. At the same time, materialist accounts of computing and network technology are deepened by placing them in conversation with the claim that software and algorithmic techniques constitute a broader and more pervasive logic that cannot be reduced to their technical instrumentation.

The dissertation focuses on two case studies of technical developments in language processing that animated not merely technological, but also discursive and epistemic transformations. Given the highly technical nature of the artifacts in these case studies, much of my primary source material takes the form of technical documentation such as patent filings, operational manuals, source code, and both published and unpublished research materials. I supplement humanist approaches with technical knowledge in areas such as signal processing, statistical inference, and computer science. This attentiveness to the specificities of the material and the machinic, however, is not a return to technological determinism nor a fetishization of technical artifacts. Rather, it serves as an effort to guard against a tendency wherein “the notion that social processes are necessary to knowledge production sometimes blurred into the far more dubious claim that

!15 social agreement is sufficient for knowledge production.”29 Hence, it is not to suggest that machines express in isolation from or in the same way as human social actors, or that they can be critically examined in isolation from, or in the same way as, cultural relations. Instead, my approach takes seriously the critical role of material-technical forms in the expansion and development of ways of knowing, but in doing so considers them as necessarily part of a coordination between social actors, cultural practices and technical media.

Chapter Outline

The chapters of the dissertation are organized around pivotal encounters between text and computation that defined text prediction in two directions, first in the development of computational techniques to predict text sequences, and then in the consequent expansion of these techniques to enlist text data in predicting non-linguistic phenomena. The primary cases under discussion span the twentieth century, focusing heavily on its latter half, though the conceptual lineages under consideration will make reference to earlier developments in the history of probabilistic thinking. While roughly chronological, the chapters are not arranged on a single, sequential timeline with clear- cut periodization, nor is the order meant to suggest that any “stage” necessarily succeeds and retires the previous. Even when confined to those developed for use in text-centered media

29 Paul N. Edwards, A Vast Machine Computer Models, Climate Data, and the Politics of Global Warming (Cambridge, Mass.: MIT Press, 2010), 437.

!16 and communications systems, predictive techniques are diffuse and promiscuous, jumping across industry sectors and domains of practice and developing along multiple and overlapping trajectories in sometimes unexpected ways. Most notably, speech recognition technologies feature prominently in what I lay out as a story of computational text prediction. Though initially counter-intuitive, I argue that speech processing played a central role in making language both conceptually and technically amenable to informatics. At the same time, in the US, speech recognition served as a charismatic application that was aligned to the commercial aims of both industrial giants such as AT&T Bell Laboratories and the institutional priorities of defense research.

Following this introductory chapter, chapters two and three examine the consequences of A.A. Markov and ’s theories as they were materialized in text prediction techniques for speech recognition software. I argue that competing models of speech recognition and the introduction of new predictive models in the 1970s not only formalized a particular understanding of language and the speaking subject, but also became a site where the parameters of human and computational forms of knowledge were imagined and standardized.

Chapter two focuses on the discursive shift in the depiction of speech recognition from a mechanical model of human sensory-motor systems and linguistic understanding to a purely “machine” practice that was distinct from, if not incommensurable with, human perception, language, and meaning. Chapter three

!17 then takes on a close reading of speech recognition technology itself, through patent filings, engineering drawings, and other technical documentation, looking specifically at the invention of a “fenonic alphabet” — a purely machine- generated replacement for phonetic units — and the corresponding transformation of text from language into data.

Finally, chapter four brings us to the near-present, examining the seemingly sudden realization in computational linguistics at the turn of the millennium that text-mining could be deployed as a tool to predict non-linguistic phenomena. It begins to tell the story of how text, previously made predictable by statistical means, could by virtue of those same methods then be enlisted in the prediction of other things. Tracing the spread of statistical methods developed for speech recognition into the broader field of computational linguistics, this chapter examines the forces that brought natural language processing into alignment with data management practices and machine learning. I argue that efforts to make text tractable alongside structured numerical data created both new resources and new challenges for data management and information retrieval, leading to technical and epistemic developments that spurred big data’s rise to prominence and the spread of machine learning.

This project in total follows one thread within the dense, byzantine meshwork of discursive and technical coordinations that shaped not only the formal parameters of statistical knowledge, but its conditions of use and popular

!18 imagining. It seeks to elaborate the arrangement of social, historical, and technological developments that have come to constitute an epistemics of informatics, and its contributions to the vision of a world composed of operations beyond the horizon of piecemeal human inquiry, accessible only through the aid of expansive simulations built upon a massive computational infrastructure. In short, this project considers how particular technologies became thinkable and how particular ways of thinking, in turn, became technological.

!19 CHAPTER II

THE ARTFUL DECEIT

“Finally, we begin on our day to observe the swelling energies of a third wave . . . forces which now manifest themselves in every department of activity, and which tend toward a new synthesis in thought and a fresh synergy in action. As a result of this third movement, the machine ceases to be a substitute for God or for an orderly society; and instead of its success being measured by the mechanization of life, its worth becomes more and more measurable in terms of its own approach to the organic and the living.” Lewis Mumford, Technics and Civilization (1934)30

In 1958, Edward David, Jr., then lead engineer in visual and acoustic research at AT&T Bell Laboratories, declared that with the aid of digital computers, advancement in automatic speech recognition (ASR) would be limited only by the “time, effort, and money we are willing to expend in simulating human functions.”31 But even as the consequent decade of dedicated “time, effort, and money” in academic and industrial labs worldwide led to an array of new techniques for measuring and modeling various aspects of speech production and hearing, improvements in the actual performance of speech recognition systems lagged. The initial optimism arising from digital computing soon began to

30 Lewis Mumford, Technics and Civilization (New York, NY: Harcourt, Brace and Company, 1934), 5.

31 David, Jr., “Artificial Auditory Recognition in Telephony,” IBM Journal of Research and Development 2, no. 4 (1958): 294.

!20 dissipate, and in 1969, all speech recognition research within Bell Labs was abruptly suspended by J.R. Pierce, the executive director of Communication

Sciences Research. Pierce issued a scathing assessment of the entire field that same year in the Journal of the Acoustical Society of America: “Speech recognition has glamor. Funds have been available. Results have been less than glamorous . . . In any practical way, the art seems to have gone downhill ever since the limited commercial success of Radio Rex.”32 The Radio Rex, a popular voice-activated children’s toy in the 1920s, featured a wooden dog that responded to calls of “Rex” using a simply electromagnetic trigger that could be activated by any sounds with an acoustic frequency around 500 Hz. The toy did not technically differentiate between speech and other sounds in the same range, let alone recognize specific words.33

The cutting comparison to a crude, sound-activated children’s toy highlighted the fact that, for Pierce, the problem was was not simply the lack of was compelling and demonstrable results. According to him, the entire research endeavor was so fundamentally ill-conceived that it amounted to little more than an “artful deceit,” perpetrated by “mad inventors,” no better than “pretended

32 John R. Pierce, “Whither Speech Recognition?,” Journal of the Acoustical Society of America 46, no. 4 (1969): 1050.

33 In a series of posts to the Antique Radio Discussions web forum on Oct 1, 2008, collectors found that the toy responded to non-vocal sounds like loud claps, and noted that the toy would potentially reject calls from women and children due to the higher frequency ranges of their voices.

!21 telepathy . . . or communication with the dead.”34 The ability to apprehend spoken language in humans, he reasoned, depended so heavily on contextual understanding that no amount of information regarding its physical features— which is to say the measurable and computable elements of the acoustic signal— would ever be sufficient. As with the Radio Rex, responding to sound was not the same thing as recognizing speech. Speech recognition, according to Pierce, could not be performed in isolation from a general capacity for reasoning. It would therefore only be feasible, especially at a commercial scale, once there were machines possessing both “an intelligence and a knowledge of language comparable to those of a native speaker.”35

Though Pierce’s assessment of the field itself was somewhat extreme, the moratorium at Bell Labs had wide-ranging implications given their prominence as a center of speech processing research. Pierce himself was the chairman of the

Automatic Language Processing Advisory Committee (ALPAC) for the National

Research Council, which advised research programs and budget allocations in the area of computational language processing for the Department of Defense, the

34 Ibid., 1050-51.

35 Ibid., 1050.

!22 Central Intelligence Agency, and the National Science Foundation.36 His reasoning also echoed a fundamental assumption common across speech recognition (and indeed all speech processing) research of the period. Regardless of the particular application goals and implementation methods, which ranged from simple frequency graph matching to sophisticated artificial intelligence (AI) networks, speech recognition into the 1970s was categorically guided by an expectation to replicate not only the results, but the very processes of human speech perception and understanding. The feasibility of such simulation was heavily debated, and the list of which specific “human functions” speech recognizers needed to simulate constantly expanded, growing to include increasingly complex layers of linguistic and contextual knowledge alongside the physiological features of hearing. Yet few questioned the idea that speech recognition by machine required careful replication of human speech perception and understanding.

In 1971, less than two years after Pierce’s unequivocal condemnation,

IBM launched the Continuous Speech Recognition (CSR) group at their research division headquarters, the recently built Thomas J. Watson Research Center

36 His opinion was influential enough that, following a 1966 ALPAC report, the committee feared his critique of automatic machine-translation would lead to the shutdown of government funding for all forms of machine-aided language projects. An explanatory letter was attached to the report that explicitly exempted language education and similar non-automated applications from his assessment. See John R. Pierce et al., “Language and Machines: Computers in Translation and Linguistics A Report by the Automatic Language Processing Advisory Committee” (Washington, DC: National Academy of Sciences, 1966).

!23 laboratory in Yorktown Heights. Despite Pierce’s dire projections, the CSR group introduced techniques, such as the application of hidden Markov models (HMMs) and corpus-trained n-gram language models, that led to successful commercialization of speech recognition and today remain the foundation for popular speech recognition software.37 These techniques went on to be widely adopted across not only natural language processing, but also a broad range of machine learning and pattern recognition applications in diverse fields such as bioinformatics, business intelligence, and financial modeling. For instance, the success of hidden Markov models in speech recognition introduced and helped popularize the technique in other areas dealing with large data sets drawn from complex, highly variable phenomena, such as genome sequencing. HMMs are so commonly used in bioinformatics that there are now free standard software tools for implementing them on biosequencing databases.38

Frederick Jelinek, the director of IBM’s CSR group, credited their success to a conceptual shift away from the fixation on human processes, explaining, “we

37 Since around the early-2010s, research involving artificial neural networks (ANN) and deep learning have been gaining renewed momentum in speech recognition, among other areas. These more recent developments lie outside the scope of this current project. However, it is worth noting that while ANNs employ a different formal architecture than hidden Markov models and other techniques that defined the statistical approach, they can be seen as an extension and intensification the “data-driven” conceptual foundation laid out by statistical methods, which will be discussed in more detail in chapter 2.

38 See, for instance, HMMer.org. HMMs will be discussed in greater detail in chapter three. Briefly, they are a “data-driven” method for modeling the performance of an unknown sequence of events based solely on statistical patterns gleaned from empirical data.

!24 thought it was wrong to ask a machine to emulate people. Rather than exhaustively studying how people listen to and understand speech, we wanted to find the natural way for the machine to do it.”39 Contrary to the prevailing desideratum of speech recognition, the “way for the machine” envisioned by the team at IBM was not founded upon a fantasy of simulating human reason by computational means—a black box that mimicked grey matter. Instead, Jelinek and the CSR group based their approach explicitly on the work of Claude

Shannon and A.A. Markov, repurposing both the language and principles of information theory and signal processing to the analysis of speech. In doing so, it radically reframed speech recognition as a purely computational problem, one that was absent of, if not outright antithetical to, “human faculties” and linguistic expertise. In what has become a prevailing axiom in not only the field speech recognition, but also the broader discipline of Artificial Intelligence, Jelinek went

39 Frederick Jelinek quoted in Peter Hillyer, “Talking to Terminals . . .,” THINK, 1987, IBM Corporate Archives.

!25 on to observe: “if a machine has to fly, it does so as an airplane does—not by flapping its wings.”40

This chapter offers a historical analysis of speech recognition’s statistical turn. I investigate the cultural and material forces that helped to shift the field’s defining theoretical and pragmatic impetus from the pursuit of simulation to that of statistical data processing, showing how the parameters of human and computational knowledge were reconfigured in the wake of this transformation.

Starting with an examination of the dramatic disparity between the representation and realization of speech recognition technologies in both popular and scientific discourse, I locate disputes over speech recognition’s feasibility and purpose

40 Jelinek quoted in Hillyer, “Talking to Terminals . . .” The comparison of airplanes to birds become something of a default analogy in the AI field. See, for instance: Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, 3rd edition (Upper Saddle River: Pearson, 2009), 3. Russell and Norvig, in their now-classic textbook on AI, use a similar example of airplanes not imitating birds to explain why many A.I. practitioners are unconcerned with the Turing Test, which measures machine intelligence based on a machine’s ability to mimic human perception and behavior. Other notable figures who have used variations of this analogy include Facebook’s director of AI Research Yann LeCun (see Yann LeCun quoted in Lee Gomes, “Facebook AI Director Yann LeCun on His Quest to Unleash Deep Learning and Make Machines Smarter,” IEEE Spectrum, February 18, 2015 and in James Vincent, “What Counts as Artificially Intelligent? AI and Deep Learning, Explained,” The Verge, February 29, 2016) and Eric Horvitz, Director of Microsoft Research and former president of the Association for the Advancement of Artificial Intelligence (see Eric Horvitz, “AI in the Open World: Directions, Challenges, and Futures” (Data&Society Databite No. 98, New York, NY, April 26, 2017). Jelinek was also not the first to use such a comparison. Another speech recognition researcher at IBM, William C. Dersch, compared attempts to mimic human hearing to “designing an airplane by copying a bird’s feathers” in 1962. See William C. Dersch, “Shoebox - A Voice Responsive Machine,” DATAMATION 8 (June 1962): 47–50. J.R. Pierce himself also evoked the analogy in 1962, in a discussion regarding the purposes of computing more generally, though Pierce’s comparison offered rather different implications. See John R. Pierce et al., “What Computers Should Be Doing,” in Management and the Computer of the Future, ed. Martin Greenberger (Cambridge, MA: The MIT Press, 1962). Suffice to say, it appears that the comparison evoked by Jelinek’s was very much in circulation within engineering communities.

!26 within the broader, ongoing debates concerning the boundaries between human and mechanical processes. To do so, I trace the perennial presence of speech recognition in fantasies of machine intelligence, from a fraudulent eighteenth century automaton to debates over artificial intelligence and human-computer interaction in the 1960s and 70s, arguing that speech recognition research was first a product of and then a stage for competing theories regarding the epistemic and ontological status of digital computing. The chapter then zooms in on the local conditions of the IBM Continuous Speech Recognition group, and their reformulation of speech recognition as a problem of statistical text prediction.

Drawing on both archival documents as well as original interviews and correspondence with the researchers at IBM who developed the statistical approach, I analyze how a particular arrangement of corporate interests and material resources informed their work and shaped what was thinkable. Finally, I consider the status of language that was codified and the speaking subject who was constituted through the technological materialization of statistical speech recognition, which I will examine in greater technical depth and detail in the next chapter. I argue that this shift rendered speech recognition a primarily mathematical problem independent from human hearing and language, one that required the articulation of a distinctly machinic way of knowing and the reduction of language to data.

!27 Speech Recognition and the Human-Computer Imagination

In his pioneering 1960 article on “Man-Computer Symbiosis,” J.C.R.

Licklider lists speech recognition among his prerequisites for the realization of technological systems in which “men and computers working together in intimate association.”41 Compared to the relatively decisive recommendations regarding networked computing, stored-memory architecture, interactive graphical displays, and more flexible programming languages, Licklider’s opening paragraph on speech recognition is conspicuously tentative:

How desirable and how feasible is speech communication between human operators and computing machines? That compound question is asked whenever sophisticated data-processing systems are discussed. Engineers who work and live with computers take a conservative attitude toward the desirability. Engineers who have had experience in the field of automatic speech recognition take a conservative attitude toward feasibility. Yet there is continuing interest in the idea of talking with computing machines.42

Despite the opening skepticism, Licklider ends “Man-Computer Symbiosis” by predicting the achievement of “practically significant speech recognition” suitable for “real-time interaction on a truly symbiotic level” to be just five years away.43

Licklider thus highlights a central paradox of speech recognition: that despite an informed skepticism regarding not only its feasibility, but desirability, speech recognition nevertheless remains persistently desired, its development somehow

41 J. C. R. Licklider, “Man-Computer Symbiosis,” IRE Transactions on Human Factors in Electronics, no. 1 (March 1960): 5.

42 Ibid., 10.

43 Ibid., 10-11.

!28 inseparable from the understanding and advancement of information processing technology more generally.

Applications for speech recognition are widely varied in terms of both functions and sophistication, ranging from multi-lingual dictation software to the ever-delightful interactive-voice response units at the end of customer service lines. The most conspicuous application in recent years has been the increasingly pervasive voice-controlled “Intelligent Personal Assistant” (IPA). While voice command has been standard in Apple’s desktop OS since 1999 and Microsoft

Windows since 2006, Apple first incorporated its now iconic voice-controlled

IPA, SIRI, into its mobile operating system in 2011. Other major mobile device manufacturers followed suit shortly after, while technology giants Google,

Microsoft, and Amazon have released cross-platform or stand-alone IPAs that can be run on multiple operating systems and devices, including game consoles and other “smart” devices.44 Despite its many varied applications, automatic speech recognition, or ASR, is understood more generally by researchers in the field as a category of systems that enable machines to operate using speech input. As such it has persisted symbolically as a perennial figure of future innovation and

44 Samsung (S Voice) and LG (Voice Mate) IPAs were released in 2012 and BlackBerry (BlackBerry Assistant) in 2013. Google (Google Now) and Microsoft (Cortana) introduced their cross-platform IPAs in 2012 and 2014, respectively. Amazon issued a wide release of their stand-alone voice command device Amazon Echo in June 2015, following a limited release in late 2014. Amazon has also launched a service (Alexa Voice Service) and a $100 Million fund to support developers interested in incorporating their Alexa voice software into third-party devices.

!29 advancement within the tech sector. From Vannevar Bush envisioning a speech recognizer as the input apparatus for his memex to Steve Jobs selecting speech recognition over more than three thousand other new system features to highlight at the 1999 World Developer’s Conference due to its importance “today . . . [and] for the future,” there seems to be an assurance that ubiquitous speech recognition capabilities are at once crucial and imminent.45 Yet, a cursory glance at current press coverage shows that speech recognition technology in practice is often noted for its inadequacies, creating a sense that “even within limited applications, speech recognition never seems to work as well as it should.”46 As one a recent article on Slate.com warns, “all voice-activated technology, in fact, seems to cause confusion,” and its expanding applications in everything from cars to

45 Vannevar Bush, “As We May Think,” The Atlantic, July 1945; Steve Jobs, “World Developer’s Conference Keynote” (Presentation, San Jose, CA, May 10, 1999).

In describing the Memex, the theoretical machine oft cited as the conceptual precursor to the modern internet, Bush suggested in “As We May Think” that “the author of the future [could] cease writing by hand or typewriter and talk directly to the record” by combining a Vocoder with a stenotype in order to produce “a machine which types when talked to.” As historian Mara Mills has pointed out, Bush was “one of the most prominent spectators” of the VODER (Voice Operation DEmonstrators), a speech synthesis machine assembled by Homer Dudley at Bell Laboratories in the 1920s and first unveiled at the 1939 World’s Fair in San Francisco. However, it is worth noting that while Bush described the vocoder (VOice-CODER), the “converse” of the voder that he witnessed in 1939, as a machine that could operate a keyboard when spoken to, the Vocoder could in fact do no such thing. Rather, it was a speech analysis machine that that transformed and sampled speech as electrical signals that were transmitted and used as specifications for reconstructing speech at the other end. The vocoder was not designed to produce text, as Bush envisioned. See also Mara Mills, “Media and Prosthesis: The Vocoder, the Artificial Larynx, and the History of Signal Processing,” Qui Parle 21, no. 1 (December 1, 2012): 107–49.

46 John Seabrook, “Hello, Hal,” The New Yorker, June 23, 2008, http:// www.newyorker.com/magazine/2008/06/23/hello-hal.

!30 classrooms “means we’re on the cusp of a serious problem.”47 In a high-profile counter-point to Jobs’ 1999 enthusiasm, Apple’s 2015 World Developer’s

Conference live presentation was derailed by a speech recognition error during an unrelated demonstration.48

Automatic speech recognition thus occupies a curious position in the technological imagination, characterized by a prolific optimism that remains undaunted in the face of functional deficiencies. It is presented as so casually inevitable that, alongside chess-playing, it remains a staple in popular fictional representations of advanced computing and artificial intelligence, particularly in the US. For instance, prominent figures in US and European cinema include the

Maschinenmesch in Fritz Lang’s Metropolis (1927), Kubrick’s iconic HAL9000 in 2001: A Space Odyssey (1968), the operating-system-turned-love-interest in

Spike Jonze’s Her (2013), the myriad on-board computers of the Star Trek universe and droids of the Star Wars franchise, among countless others. Though these fictional systems vary in purpose and sophistication, hailing from immediate

47 1. Amy Webb, “You Just Don’t Understand,” Slate, May 5, 2014, http://www.slate.com/ articles/technology/data_mine_1/2014/05/ ok_google_siri_why_speech_recognition_technology_isn_t_very_good.html.

48 Business Insider Tech, “Watch Siri Fail Live on-Stage at Apple’s Huge WWDC Event,” Business Insider, June 8, 2015, http://www.businessinsider.com/siri-fail-live-apple- wwdc-2015-6#ixzz3hU8veYTA. The error occurred during a demonstration of Apple’s music service. When asked to play a song from the “Selma” soundtrack, Apple’s iPhone speech recognition system SIRI mistakenly misinterpreted it as a request to play the song “Selena” instead. Since the request was directed at a music database that would have included both “Selma” and “Selena,” the error can’t be attributed to the absence of “Selma” from the recognizer vocabulary.

!31 to distant futures, all are capable of flawless speech recognition.49 But it is not merely the prevalence of speech recognition that is remarkable. Unlike chess- playing, which often serves as a form of narrative short-hand for advanced machine intelligence, speech recognition is portrayed as patently unremarkable, serving more as a default than a focal point. In 2001: A Space Odyssey, for instance, the extent of HAL9000’s machine intelligence is demonstrated in a chess game, but its ability to flawlessly recognize and interpret conversational speech goes not only unremarked, but is actively downplayed in the narrative.50

In contrast their fictional counterparts, computer chess programs became sufficiently advanced to beat expert players by the 1970s, while in Google’s 2014 promotional video, Behind the Mic: The Science of Talking to Computers, speech

49 For example, it is implied that, in fact, all computers in Star Trek’s universe are voice- controlled by default in Star Trek IV: The Voyage Home (1986) when the crew travels back in time to the late-twentieth century and Scotty attempts to operate an office computer by voice, before being instructed to use a mouse and keyboard. One notable exception to this convention is the military simulator in 1983’s WarGames in which the protagonist communicates with the WOPR, or “Joshua,” supercomputer by typing. This nearly uniform tendency to represent advanced or intelligence computers as voice- operated may be partly a function of the formal constraints of cinematic and televisual media, since spoken dialogue lends itself to audiovisual representation more readily than typing at a terminal. However, that would not account for the consistently blasé treatment of speech recognition as commonplace. Moreover, the influence of these representations remains, regardless of whether or not their intended purpose was primarily functional.

50 The sole instance in which HAL9000’s language recognition abilities have narrative significance is when they remain effective in the absence of speech input. In a pivotal scene, the human protagonists attempt to conceal their plans to disconnect HAL down by shielding the audio from their discussion, but we soon discover that HAL can read their lips. This newly-revealed ability is shown from the computer’s perspective and we, the audience, are no longer able to follow the dialogue with the audio cut. That HAL is able to understand language without speech is thus presented as unexpected and impressive, precisely because it exceeds typical human capabilities. HAL’s ability to recognize speech as effortlessly as his human interlocutors is consequently downplayed even further in comparison.

!32 recognition developer Bill Byrne admits that he still “can’t even imagine the

‘we’ve done it!’ moment quite yet” when it comes to speech recognition technology.51 Nathan Ensmenger has argued that, AI research was fundamentally shaped by its adoption of chess as the field’s “drosophila,” or “ideal experimental technology,” a choice that was driven by both practical and intellectual factors, among them chess’s receptiveness to mathematical formalization and the consequent success and “apparent productivity” of experimental efforts.52 The similarly abiding presence of speech recognition in AI research cannot be attributed to experimental success or formal facility. On the contrary, speech recognition seems to remain a defining feature in computing research despite its apparent drawbacks as an experimental object. As cognitive scientist Alison

Gopnik pointed out in Google’s Behind the Mic, “When we started out we thought it was going to be things like chess . . . that were going to be really hard . . .

[T]hings that we thought were going to be easy for computer systems, like understanding language—those things have turned out to be incredibly hard.”53

51 While IBM’s famed “Deep Blue” computer didn’t defeat world champion chess master Garry Kasparov in a game until 1996, a computer first won a national chess tournament twenty years prior. In 1976, the a computer chess program developed at Northwestern University, simply named “Chess 4.5,” won the Class B section of the Paul Mason American Chess Championship. See Fred Hapgood, “Computer Chess Bad--Human Chess Worse,” New Scientist, December 30, 1982, 829; “Behind the Mic: The Science of Talking with Computers,” YouTube video, 7:18, posted by Google, Oct. 17, 2014, https:// www.youtube.com/watch?v=yxxRAHVtafI.

52 Nathan Ensmenger, “Is Chess the Drosophila of Artificial Intelligence? A Social History of an Algorithm,” Social Studies of Science 42, no. 1 (2012): 6-7.

53 Google, “Behind the Mic.”

!33 How then, did speech recognition become such an unquestioned staple in how we computing and machine intelligence was imagined, both in popular culture and scientific research? What made it, when both its feasibility and its applicability were so often in question, not only so deeply desired, but assumed?

Central to these questions is the relationship between human bodies and mechanical systems, which frames the pursuit of speech recognition at the intersection of two corresponding discourses. Early speech recognition systems were driven by the desire, on one hand, to model machines after human processes and, on the other, to better understand human processes by means of mechanical reproduction. That is, researchers imagined machine recognition as a direct model of human perception in a moment in which they were already portraying physiological processes through physiological processes through mechanical and electrical analogies. This metaphorical convergence between human and electro- mechanical faculties served as both the driving force, as well as the conceptual boundary, of early speech recognition research. Engineers would eventually set these principles aside in favor of the so-called “statistical” approach, which thus had wide-ranging implications for—broadly —the mutual definition of humans and machines.

!34 The Astonishing Mechanism

Pierce’s condemnation of speech recognition research not as failure, but deceit, offers a suggestive echo of the first documented speech recognition machine, another “artful deceit,” almost exactly two hundred years earlier. In the spring of 1770, Wolfgang von Kempelen, a civil servant from Pressburg, Hungary known for his expertise in hydraulics and mechanical engineering, arrived at

Schönbrunn Palace in to exhibit before Empress Maria Theresa’s gathered court a machine “more beautiful, more ſurprizing, more aſtoniſhing, than any to be met with.”54 Von Kempelen’s machine, soon after dubbed “the Turk” by the

British press,55 featured a life-size wooden figured “dreſſed in the Turkiſh

Fashion” that, seemingly without any external intervention, could play (and quite frequently won) games of chess against human opponents.56 The machine, of course, was a hoax: “’Tis a deception! granted; but ſuch an one as does honor to human nature.”57

54 Karl Gottlieb Windisch, Inanimate Reason; or a Circumstantial Account of That Astonishing Piece of Mechanism, M. de Kempelen’s Chess-Player; Now Exhibiting at No. 8, Savile-Row, Burlington-Gardens; Illustrated with Three Copper-Plates, Exhibiting This Celebrated Automaton, in Different Points of View. Translated from the Original Letters of M. Charles Gottlieb de Windisch (London: Printed for S. Bladon, 1784), 13.

55 The use of “Turk” is credited to Louis Dutens who wrote a first-hand account of the Schönberg exhibition in a letter to Gentleman’s Magazine in July of 1770, which was later published in January 1771.

56 Windisch, Inanimate Reason, 21.

57 Ibid., 13, emphasis in the original.

!35 It was, however, so convincing that audience members reportedly suspected demonic possession.58 There was one feature that contributed to the

Turk’s ability to so “honor human nature” that was rarely mentioned: speech recognition. Following the chess-playing portion of the demonstration, the Turk would take questions from the audience, responding by spelling out the answers on a placard of gold letters that was swapped in for the chessboard following the game.59 The language and wording of the responses were also reportedly tailored to different audiences.60 It is unclear why von Kempelen included this act as part of the demonstration, though it is worth noting that von Kempelen was at the time engrossed in an investigation of the mechanization of human speech. He had purportedly begun developing his “Speaking Machine” in 1769, the same year he

58 Ibid., 15, emphasis in the original: “One old lady, in particular, who had not forgot the tales she had been told in her youth, crossed herself, and sighing out a pious ejaculation, went and hid herself in a window seat, as distant as she could from the evil spirit, which she firmed believed possessed the machine.”

59 A detailed description of the demonstration comes from Carl Friedrich Hindenburg’s in 1784, describing a performance in Leipzig: “There is a plaque with golden letters and numbers placed above the chessboard, by means of which the figure answered an arbitrary question given to it by indicating individual letters with its finger that altogether made up the answer . . . The answers are on the whole very apt, often ingenious” (my translation). Carl Friedrich Hindenburg, Ueber den Schachspieler des herrn von Kempelen: nebst einer Abbildung und Beschreibung seiner Sprachmaschine (Leipzig: J.G. Müller, 1784), 18.

60 Though few accounts exist of this portion of the Turk’s demonstration, Charles Carroll notes in The Great Chess Automaton that mathematician Johann Jacob Ebert’s 1785 account reports that demonstrations in French included answers given in French. See Charles Michael Carroll, The Great Chess Automaton (New York: Dover Publications, 1975).

!36 built the Turk, according to Homer Dudley.61 The Turk’s ability to apprehend and respond to audience questions proved a critical source of scrutiny and skepticism for its ostensibly mechanical operations. While von Kempelen himself was said to have represented the Turk as nothing more than a “bold and fortunate illusion,”62 he refused to reveal its precise workings, and the ongoing mystery and fascination with the machine was rooted in the nature of its practical operation. Early hypotheses within the scientific community ranged from pre-set games to hidden operators, but it was ultimately the speech recognition segment of the demonstration that convinced some observers that the Turk was human-operated.

In 1789, Joseph Friedrich Freiherr zu Racknitz published a pamphlet depicting three models of the machine that he had built in an effort to uncover its workings. It was so thorough and influential that, in subsequent years, many

61 Homer Dudley and T. H. Tarnoczy, “The Speaking Machine of Wolfgang von Kempelen,” The Journal of the Acoustical Society of America 22, no. 2 (March 1, 1950): 152. The machine used a bellows, reed, and India rubber molds to simulate the form and operation of a human mouth and vocal tract, producing speech-like sounds by manipulating airflow.

62 Silas Weir Mitchell, “The Last of a Veteran Chess Player,” Chess Monthly, 1857, https://www.chess.com/blog/batgirl/the-last-of-a-veteran-chess-player---the-turk. The full account reads “At the present day it is difficult to conceive of the eager curiosity with which the Turk was everywhere greeted—and it is to be remembered, that at this early period his performances were regarded as the bona fide results of mere mechanical arrangements. The inventor himself, by no means encouraged this view of his invention, and he very honestly represented it as a bold and fortunate illusion.” N.B. This article was attributed to Silas Weir Mitchell, whose appears in the magazine index, but not as a byline on the actual piece. Mitchell was the son of John Kearsley Mitchell, the last owner of von Kempelen’s Turk automaton. According to Windisch, the machine itself was produced only at the behest of the Empress, who was inspired by French illusionist François Pelletier use of magnetism in his act and challenged von Kempelen to made good on the claim that he could build “a machine whose effects should be more surprising and the deception more complete.” See Windisch, Inanimate Reason, 40.

!37 mistook his speculative drawings as actual depictions of von Kempelen’s machine.63 For Racknitz, the most revealing aspect of the Turk’s artful deceptions was its ability to recognize spoken questions from the audience, which led him to systematically outline and dismiss all the major hypotheses regarding the machine’s workings except the idea of a hidden human operator. The suggestion of even partial mechanical automation, he argued, seemed plausible only because

“most people sought only to fathom the secret of the chess player.”64 Simply put:

The answering of questions was a task that was impossible for any machine to accomplish by itself without the interference of a human being. If anything, it seems more feasible with regard to the chess player . . . If one discovered [how] the latter [was accomplished], then the mechanism by which the machine was moved while responding to questions would therefore also be explained.65

That is, while Racknitz believed mechanical scenarios could potentially explain the playing of complex chess games, he found the idea of recognition and interpretation of spoken language by machine preposterous.

63 Gerald M. Levitt, The Turk, Chess Automaton (McFarland, Incorporated Publishers, 2000). 35.

64 My translation; original in Joseph Friedrich Frenherr zu Racknitz, Ueber Den Schachspieler Des Herrn von Kempelen Und Dessen Nachbildung Mit Sieben Rupfertafeln (Dresden, 1789), 12: “Die mehresten suchten daher nur das Geheimniß des Schachspielers zu ergründen.”

65 My translation; original in Racknitz, Ueber Den Schachspieler, 11: Die Beantwortung vorgelegter Fragen war eine Verrichtung, die an sich von keiner Maschine, ohne die Einwürkung eines Menschen bewerkstelliget werden konnte. Eher schien dießin Ansehung des Schachspielers möglich, und man sucht nur zu erforschen, wie es mit letzterm zugehe? Denn, hätte man diese entdeckt, so glaubte man die Einrichtung der Machine so zu kennen, daß Sich die Art, wie die Figure bei dem Antwortgeben beweget werde, ebenfalls erklären lasse.

!38 As Jessica Riskin has argued, popular automata, or self-moving machines, of the eighteenth century were distinct from their predecessors in their effort to imitate biological functions “in process and substance as well as appearance.”66

Whereas the automata of the seventeenth century imitated the movements of animals and humans only in appearance, making no effort to model the internal physiology, those of the late-eighteenth century (in the era of von Kempelen’s

Turk), “were imitative internally as well as externally, in process and substance as well as in appearance.”67 (603). Speech recognition thus occupied the conceptual horizon of mechanical simulation because it foreclosed any ambiguity as to which particular human functions the machine purported to reproduce. In the preface to a published collection of Charles Gottlieb de Windisch’s letters in which he expanded on his initial 1773 observations of the machine, the Turk is described as

beyond contradiction, the moſt aſtoniſhing Automaton that ever exiſted; never before did any mere mechanical figure unite the vis-motrix, to the vis-directrix, or, to ſpeak clearer, the power of moving itſelf in different directions, as circumſtances unforeſeen . . . might require.68

That is, its ability to understand spoken questions decisively proclaimed that what was mechanized by the Turk was the process of thinking, not the actions of chess.

It redirected the question of the essentially mechanical nature of living beings

66 Jessica Riskin, “The Defecating Duck, Or, the Ambiguous Origins of Artificial Life,” Critical Inquiry 29, no. 4 (2003): 602.

67 Ibid., 603.

68 Windisch, Inanimate Reason, v-vi.

!39 away from the physiological function of the body and towards the reasoning capacities of the mind.

Curiously, Racknitz was hesitant in publishing his analysis, which he expressed in the preface to his pamphlet, but not because it unveiled the “deceit” of the Turk. Rather, his reluctance was due to a concern that it would be seen as an act of forgery rather than investigation, since “the chess player of Mr. von

Kempelen was a work of art.”69 He believed that since the secret of the Turk was not the type of knowledge that, if disclosed, “would have provided some benefit to mankind . . . One could accuse [him] of . . . emulating [his] friend.”70 Yet this characterization of the Turk as “work of art” whose operation made no useful contribution to knowledge seems at odds with Racknitz’s simultaneous critique of those who dismissed the machine on account of its deception, where he asserts that “the mechanism that moves the Turk in such an abundant manner and the invention . . . deserves more attention than simple tricks and is worthy of exploration by thinking minds.”71

69 My translation; original in Racknitz, Ueber Den Schachspieler, 4: “Der Schachspieler des Herrn von Kempelen war ein Kunstwerk.”

70 My translation; original in Racknitz, Ueber Den Schachspieler, 4: “dessen Bekanntmachung der Menschheit einigen Nutzen verschaffet hätte . . . konnte man mich wohl gar einer . . . Nacheiferung meines Freundes beschuldigen.”

71 Racknitz, Ueber Den Schachspieler quoted in Leavitt,The Turk, 207. Leavitt translation used due to markings in the scan of the original text that obscured portions of this passage.

!40 Racknitz’s seemingly contradictory assessment is tied up in a moment of epistemic instability Riskin identified regarding the categorical distinction between natural and mechanistic processes, and the degree to which they could be considered equivalent. This shift was indicative of the changing status of mechanical imitation from representation to simulation, tied to “an emergent materialism and to a growing confidence, derived from ever-improving instruments, that experimentation could reveal nature’s actual design.”72

Automata, in other words, were not only technological curiosities, but epistemic objects used to “discern which aspects of living creatures could be reproduced by machinery . . . [and] whether human and animal functions were essentially mechanical.”73 At stake was not only a question of whether or not reason was replicated, but whether or not it was replicable. In this light, we might consider

Racknitz’s seemingly contradictory characterization of the Turk as both a piece of art that did not contribute to human knowledge as well as a contraption of sufficient mechanical sophistication to be “worthy of exploration by thinking minds” to be in reference to the machine’s the appearance of liveness without a the replication of its corresponding processes. That is, the Turk fails to be “of use among humanity” because its innovations are limited to the machine, to the technological rather than epistemological, and it fails to offer any insight into the

72 Riskin, “The Defecating Duck,” 603-4.

73 Ibid., 601.

!41 underlying nature of the activities it mimics. Though an impressive feat of engineering, it was thus a representational rather than experimental object, which is to say, merely a “work of art.”

Thus, in identifying the ability to answer spoken questions as the function which ultimately revealed the trick of the Turk, Racknitz makes clear that its deceptive “art” is not in the failure to replicate the bodily processes conducting its movements. The Turk would have remained a hoax even if the physical activity of moving pieces on a board according to some calculated strategy were mechanically feasible; game-playing, though complex, contained an ultimately finite and routinize-able activity. Rather, it was in the mechanical representation of the reasoning processes directing these movements, which were irreducible to physiological performance. It is in this vein that Pierce’s eventual complaint that speech recognition constituted an “artful deceit” was not simply a rhetorical flourish, but an expression of the belief that there is a fundamental distinction between the apparent performance of speech recognition and a replication of its constitutive processes.

For Pierce, the problem was not the imperfect modeling of speech production and hearing, but that these aspects alone were insufficient since

“people recognize utterances, not because they hear the phonetic features or the words distinctly, but because they have a general sense of what a conversation is

!42 about and are able to guess what has been said.”74 Pierce’s objections thus stemmed from the deep conviction, expressed by proponents and critics of speech recognition alike, that speech was fundamentally anchored in human intelligence, and neither recognition nor production could be performed in isolation from the embodied and socially realized context of human language.75 Just as the Turk did not truly play chess, Pierce believed that speech recognizers did not truly recognize speech, though they performed the actions of matching sound to text.

Or put another way, what was in question was not the result, the knowledge produced, but the process by which something could be known.

Remaking Speech

In vaulting between Bell Labs research director J.R. Pierce and the eighteenth-century observers of von Kempelen’s mechanical curiosity—sentries of speech recognition’s deceptive arts separated by nearly two centuries—I do not mean to imply that the intervening years were inconsequential. The conviction in both instances that speech recognition was a uniquely human endeavor remained startlingly consistent throughout the field’s history, persisting as a primary

74 Pierce, “Whither Speech Recognition?” 1050.

75 Riskin points out that in building his speaking machine, completed around the same time as the Turk, but which unlike the Turk, sought to replicate the physiological process of the articulatory system, von Kempelen came to recognize “a further constraint upon the mechanization of language: the reliance of comprehension upon context.” See Riskin, “The Defecating Duck,” 619. In other words, he noticed that whether or not the machine successfully produced speech depended on the listener’s interpretation of the sounds it produced as words.

!43 justification for speech recognition research. The idea that speech constitutes the most “natural,” “fundamental,” and “spontaneous” means of communication for humans at once drove the pursuit of its mechanization and defined the means by which that pursuit was undertaken.76 Speech recognition was animated simultaneously by the desire to model machines after humans and the reconfiguration of a speaking subject adequate to the machine. Viewed thus, the automation of speech recognition by mechanical means was not only a site where a particular understanding of speech and language was being formulated and formalized through technical processes, but one in which the boundaries of human and machine perception were being imagined, defined, and negotiated through communication technology.

The widespread belief that speech recognition offered “communication via the most natural means”77 did more than simply motivate research. It also anchored speech recognition to human perception and understanding. While the precise motor, sensory, and linguistic features varied, the fundamental premise undergirding early speech recognition research was the inextricability between

76 In textbooks, reference guides, policy reports, and promotional materials alike, the value of speech recognition research is persistently justified by evoking the status of speech as natural to, and by extension defining of, human communication, as well as a faculty unique to human beings. See, for instance, M.J. Underwood, “Machines That Understand Speech,” Radio and Electronic Engineer 47, no. 8.9 (August 1977): 368. The ability, therefore, to interact with machines by means of this “most natural, universal, and familiar form of communication” rationalized the appeal of automatic speech recognition. See Wayne A. Lea, “Establishing the Value of Voice Communication with Computers,” IEEE Transactions on Audio and Electroacoustics 16, no. 2 (June 1968): 184.

77 Licklider, “Man-Computer Symbiosis,” 10.

!44 speech recognition and human sensory faculties. That is, if speech was a defining trait of human nature, then the nature of human sensory-motor processes was in turn believed to be a requisite trait for apprehending speech. The lineage of speech recognition therefore traces back to experimental instruments for materializing speech as graphic inscriptions for laboratory studies of human speech production and perception. Such instruments coincided with a growing interest and proliferation of sound recording and communications technologies, but were distinct in their aim of acoustic transcription rather than recording and transmission.

Early efforts towards speech recognition centered on discerning the nature of the human articulatory system. W.H. Barlow’s logograph, which was an instrument that “owes its origin to an attempt to make something which would do short handwriting instead of employing the services of the gentleman who sits there [transcribing],”78 was one such instance. Demonstrated before the Society of

Telegraph Engineers in 1878 (though according to Barlow it had been built four years prior), the instrument used vibrations from sound passing through a speaking trumpet to move a small marker up and down across a writing surface, essentially producing an automatic visual representation of speech as a line graph

78 W.H. Barlow, “The Logograph,” Journal of the Society of Telegraph Engineers 7, no. 21 (1878): 65.

!45 of changes in air pressure over time.79 Although the instrument failed to accomplish the task of automatic writing, it was said to have “served the purpose of illustrating certain features connected with the articulation of the human voice.”80 Specifically, Barlow claimed that he designed the logograph to quite literally trace “the peculiarities of articulation,” having been inspired, in a curious echo of von Kempelen’s deceptive automaton, by an encounter with “an old Turk smoking his pipe” and shapes produced in the smoke he exhaled as he spoke.81

Though rooted in the study of speech and sometimes referred to as a

“recording instrument,” Barlow’s logograph was considered distinct from speech recording and transmission technology. In a review of the phonograph in The

Gentleman’s Magazine in 1878, astronomer Richard Proctor identifies the logograph as the “most successful” of the “experiments which proceeded though they can scarcely be said to have led up to, the invention of artificial ways of reproducing speech.”82 The logograph and the phonograph, he reasoned, relied upon entirely different premises since “the movement of the central part of the diaphragm should suffice to show that such and such words had been uttered, is

79 W. H. Barlow, “On the Pneumatic Action Which Accompanies the Articulation of Sounds by the Human Voice, as Exhibited by a Recording Instrument,” Proceedings of the Royal Society of London 22, no. 152 (April 1874): 277–86.

80 Barlow, “The Logograph,” 65.

81 Ibid., 65-66.

82 Richard Proctor, “The Phonograph, or Voice-Recorder,” Gentleman’s Magazine, 1878, 690.

!46 one thing; but that these movements should of themselves suffice, if artificially reproduced, to cause the diaphragm to reproduce these words, is another and a very different one.”83 British engineer Joseph Frederick Bramwell similarly discussed the logograph at length in his 1884 lecture on the telephone to the

Institution of Civil Engineers, but noted that while “there is much in common in the principle of the logograph and that of the telephone . . . I believe, that when

Mr. Graham Bell was engaged in the invention of the telephone, he was not aware of that which Mr. Barlow had done; but I know he has since become acquainted with Mr. Barlow’s labours, and has spoken of them in the highest terms.”84 Moreover, unlike either the phonograph or telephone, Barlow’s device was imagined as ultimately leading to automatic writing, to replace shorthand and stenographers. Barlow claims that the logograph “owes its origin to an attempt to

83 Proctor, “The Phonograph,” 701.

84 Frederick Bramwell, “Telephone,” in The Practical Applications of Electricity: A Series of Lectures Delivered at the Institution of Civil Engineers, Session 1882-83 (London: The Institution of Civil Engineers, 1884), 30. It is interesting to note that Bramwell also states that he “cannot help thinking that if the logograph and its failure had been known to those who were occupied with telephonic invention, such knowledge would have had a deterrent effect” (30), effectively claiming that the logograph can not be a part of the same line of invention as the telephone, since if it had, it would have prevented the telephone from being invented. Proctor, in his article on the phonograph, makes a similar comment, but in relation to a different set of experiments conducted by Aurel de Ratti. See Proctor, “The Phonograph,” 705.

!47 make something which would do short handwriting instead of employing the services of the gentleman who sits there [transcribing].”85

J.B. Flowers’s “voice-operated phonographic alphabet writing machine” was another early attempt at producing a machine that would transform speech into written text. Flowers developed and tested a portion of his proposed device in collaboration with the Department of Physiology at the College of Physicians and

Surgeons in New York and the Underwood Typewriter Company and publicized it in 1916. The machine, which was intended to eventually operate a typewriter, was never fully built. Only the speech representation portion, which produced photographs of vertical lines, was ever developed and tested.86 The typewriting

85 Barlow, “The Logograph,” 65. It’s unclear whether people thought the logograph would replace a human intermediary entirely. For instance, Bramwell speculates that the stenographer would be replaced by a person who would then talk into the logograph: “This must strike us all as an extremely admirable idea, and had Mr. Barlow succeeded I cannot help thinking that the profession of the gentleman immediately in front of me, the shorthand-writer, would be gone; because all that would have been needed to obtain a record of speech would have been to employ some person to listen to the speaker, and as he listened to talk to the logograph, and so get the record upon the band of paper” (See Bramwell, “Telephone,” 29). In this instance, the logograph simply replaces the need for stenographic shorthand, but not for a human operator.

86 See “A Writing Machine That Responds to the Voice,” The Electrical Experimenter, April 1916, 686. The apparatus is described as using an acousticon transmitter, which “is generally known . . . [to be] used to a great extent in aiding partially deaf people hear better,” connected to an Einthoven string galvanometer. Speech was whispered into the transmitter, which produces corresponding electrical vibrations that pass through a battery, controlling resistor, and step-up induction coil that generates an electrical current. This current then vibrated a thin, silvered quartz fiber in the Einthoven galvanometer that was supported between the poles of an electro-magnet. The movement of the fiber was magnified 900 times by an arc lamp and a series of selenium lenses and casts shadows on the revolving drum of a wheel camera, which captures a photographic record of the movement. The camera was triggered by interrupting the light from the arc lamp 500 times per second, which resulted in photographs at intervals of two-thousandths of a second. See also “Voice-Controlled Writing Machine,” Scientific American, February 12, 1916, 174, where it is additionally noted that a telephone receiver was used for a listener to “check on the articulation of the words” as part of the experimental control process.

!48 portion, despite the claim that the invention was a “voice-operated typewriter” and the participation of the Underwood Typewriter Company in the recording experiments, remained entirely speculative.

However, like the logograph, the device’s practical limitations were justified by its purported contribution to speech research. Flowers publicized the invention in a paper titled “The True Nature of Speech,” and it was seen to be “a definite and valuable contribution to the physiology of speech” and a “most remarkable study of the human voice and its fluctuations.”87 Flowers claimed that the speech records he photographed were consistent across speakers, and made

500 experimental records of speech from three men and one woman, which led him to conclude that spoken whispers could be used to reveal and record the “true nature” of speech.88 Also like the logograph, its imagined output was not standard written English, but an alternative, “less artificial” writing system: “the object of this device is to record speech automatically in ink on paper in the form of an easily read compact system of natural characters called the phonographic alphabet.”89 However, unlike the logograph, users not only had to learn new ways to read, but also new ways to speak in order to use the phonoscribe.

87 H.B. Williams in discussion section of John B. Flowers, “The True Nature of Speech,” Transactions of the American Institute of Electrical Engineers 35, no. 1 (January 1916): 232; “A Writing Machine That Responds to the Voice,” 686.

88 “A Writing Machine That Responds to the Voice,” 686.

89 Flowers, “The True Nature of Speech,” 213.

!49 Flowers’s phonoscribe was also distinct from the Logograph in that it was imagined as a clerical tool from the very outset, complete with its never-built design for a typewriter attachment. It was described in a particularly effusive review in Popular Science:

Conceive an ordinary machine resembling the machines in common office use . . . But — Speak to it! It becomes alive. It hears you. It vibrates with action. Somewhere inside, typewriter bars go “clickety-click-click.” At the top of the machine a sheet of a paper unwinds from a roller. The machine has written down what you have spoken! . . . [I]f you are considerate, and mindful of its feelings enough do spell out words correctly in cases where it might be likely to err, the machine will very obediently follow you . . . Think what it means for an office of the future to have an almost human machine at hand to perform the routine drudgery of type-writing and letter-writing!90

These devices were developed in a period that historian John Pickstone identifies as one in which the experimentalism of mid-nineteenth century science opened the way for “the novel products of experimentalist laboratories . . . [to] be developed as industrial commodities” in an emergent technoscience that produced

“ways of making knowledge that are also ways of making commodities.”91 The

Barlow logograph seemed to express the early tensions between scientific experimentation and industrial commerce, as a laboratory instrument that could also be imagined as an information technology, while the Flowers’ phonoscribe seems to indicate a maturation into technoscience, imagined as first a

90 Lloyd Darling, “The Marvelous Voice Typewriter: Talk to It and It Writes,” Popular Science Monthly, July 1916, 65.

91 John V. Pickstone, Ways of Knowing: A New History of Science, Technology, and Medicine (University of Chicago Press, 2001), 13.

!50 technological commodity that, only in light of its incompleteness seeks refuge in the rationale of scientific instrumentation. As a result, whereas the logograph was explicitly designed to measure human processes, the phonoscribe is described as if it had absorbed those very human faculties, uprooting speech recognition from the experimental laboratory and placing it firmly within context of communication technology.

Speech research in the twentieth century soon came to be defined by work at AT&T Bell Laboratory. Rooted in the concern over telephony, speech research focused heavily on its transmission rather than transcription, and focus expanded from the articulatory system to hearing and perception. The vocoder, one of the most prominent and influential speech processing technologies of the period, was a direct predecessor to the first operational speech recognition system, also produced at Bell Labs. Though Vannevar Bush may have imagined the Vocoder paired with a stenotype to form “a machine which types when talked to” that would take the place “writing by hand or typewriter,” its intended purpose was not automatic writing.92 Unlike the early, incomplete efforts of Barlow and

92 Vannevar Bush, “As We May Think,” The Atlantic, July 1945. Tellingly, Bush does not only describes the Vocoder as a “microphone, which picks up sound. Speak into it, and the corresponding keys move.” He goes on to describe the stenotype progress in which a female stenographer “records in a phonetically simplified language a record of what the speaker is supposed to have said. Later this strip is retyped into ordinary language, for in its nascent form it is intelligible only to the initiated.” Bush therefore not only describes the possibility of recording and transcribing speech, but he envisioned a process of mechanical speech recognition in which acoustic signals representing natural language must first be encoded based on phonetic elements and consequently decoded for transcription.

!51 Flowers, the Vocoder was designed to transmit rather than transcribe speech. The

Vocoder’s paired device was not a stenotype, as Bush envisioned, but a machine called the Voder, which received the Vocoder’s output at the other end of the wire and reconstituted it into comprehensible speech. As Mara Mills explains, the aim of the Vocoder was “to remake speech, to make communication through wire or air more efficient . . . [and] build certain aspects of human communication into the telephone system itself.”93 Thus, due to the prominent role of Bell Labs and telephony in speech processing research, where speech transmission to human listeners remained the primary concern, early speech recognition research remained resolutely anchored to the belief that the key to mechanical speech processing of all forms lay in a thorough understanding of human speech production and perception.

In addition to the commercial interests surrounding telephony, the early transcription efforts demonstrated that the seemingly straightforward tasks of automatic transcription was more difficult than anticipated. Recognition could be carried out only in highly restricted conditions, with equally restricted vocabularies. In the absence of a clear and broadly suited technological application such as automatic transcription, the driving justification of research into mechanical speech recognition was the prevailing assumption that it would

93 Mara Mills, “Media and Prosthesis: The Vocoder, the Artificial Larynx, and the History of Signal Processing,” Qui Parle 21, no. 1 (December 1, 2012): 111.

!52 teach researchers something essential about speech. Speech recognition research at Bell Laboratories thus emerged out of experiments in speech specification, in particular in the development of the sound spectrograph, an instrument that recorded areas of acoustic energy concentration along the frequency spectrum over time. Designed at Bell Laboratories as part of a defense-funded project led by Ralph Potter on “visual hearing,” the spectrograph was designed to create graphic recordings of speech, with the aim of producing visible patterns that could be effectively “read” like text.

The promise of speech universals that humans could be identify visually, given sufficient training, extended readily to the possibility that machines could be made to do the same, given proper engineering. Potter himself, in one of the earliest public discussions of spectrographic technology after WW II, suggested that “an automatic system for translating sound into patterns which can be readily interpreted by the eye . . . opens the way to the selective operation of automatic devices by voice sounds.”94 While Potter considered the such devices of secondary interest to producing a “visual hearing” system for humans, particularly for the deaf as an aid to speech learning and telecommunication, others at Bell

Labs, including J.R. Pierce, quickly gravitated to the idea of voice-operated

94 Ralph K. Potter, “Visible Patterns of Sound,” Science 102, no. 2654 (November 9, 1945): 463-4.

!53 machines.95 Reflecting on the potential of the sound spectrograph for Astounding

Science Fiction in 1946, Pierce characterized voice-operated devices as the obvious and inevitable consequence of the spectrograph, reasoning that “[i]f the human eye can distinguish words in the visible speech patterns and easily and correctly identify the same word spoken by a man or by a woman, in an English or a Mid-Western accent, why can’t a machine?”96 Despite its appearance in a science fiction publication, Pierce’s speculation was far from idle. In February of

1947, Pierce wrote a letter to Potter in which he outlined two designs for “devices

Figure 1 Three diagram drawings by Pierce. Contained in a 1947 letter to Potter originally dated February 7th, revised February 13th. The drawings contain a general diagram and two possible technical configurations for a device that could attached to a “voice-translating device” that could identify letters and “operate a printer and printer the world, or may operate any other device.” John R. Pierce to Ralph K. Potter, February 13, 1947, File 37874-13, Volume B, Courtesy of AT&T Archives and History Center, Warren NJ.

95 Ibid., 464. As Potter notes, such a system “opens the prospect of some day enabling totally deaf or severely deafened persons to use the telephone or the radio . . . But most immediately, and from humane considerations most importantly, it opens a new avenue of help to the totally and severely deafened—help to learn to speak, and for those who already speak, help to improve their speech. It is to this problem of aid to the deaf that we have first directed our efforts.”

96 J.R. Pierce (as J.J. Coupling), “Portrait of a Voice,” Astounding Science Fiction, July 1946, 121. Pierce published this piece under the pseudonym J.J. Coupling, which he also used when publishing his science fiction short stories.

!54 which would potentially respond to words” formulated in discussion with Claude

Shannon, complete with preliminary diagrams (see figure 1).97

For Pierce, the promise of the spectrograph was as a solution to the constraints of our “silly and antiquated” reliance on symbolic writing:

[P]erhaps the simplest and worst defect of all in this mechanical age is that there is so little correlation between the written symbol and the spoken word that we are frustrated in our attempts to make voice operated devices. We have all dreamed, if we have not read, of voice operated typewriters which take down dictation without fatiguing or distracting, yet our preoccupation with written words which bear little relation to speech has prevented our realizing even this.98

The spectrograph was thus understood as an instrument that would bring machines closer to human perception, closing the gap created by a writing system that was “highly artificial at best.”99 According to Pierce, the advantage of the spectrograph over other acoustic measurement tools, such as the oscillograph, was that it could approximate “the analytical method of the human ear [which] is a purely arbitrary thing,”100 rather than simply capturing acoustic changes that “may literally be there in the sound wave, but [which] the ear doesn’t appreciate.”101 In other words, the central challenge of speech specification, and by extension of

97 John R. Pierce to Ralph K. Potter, February 13, 1947, File 37874-13, Volume B, AT&T Archives and History Center, Warren NJ. The letter is originally dated February 7, 1947, with a note of revision dated February 13.

98 Pierce, “Portrait of a Voice,” 105.

99 Ibid., 104.

100 Ibid., 101.

101 Ibid., 112.

!55 speech recognition, was to replicate the “arbitrary” analysis of the human ear that could resolve acoustically capricious data into a fixed, universal typology of speech.

If machine recognition of speech was justified by what it would reveal about human speech perception, then the human body could be conceived of alternately as a particularly sophisticated or particularly naive machine. Central to the efforts to model human perception was the belief that there was only one means of knowing, making the barrier between human and machine recognition was one of degree, not kind. The shift towards statistical methods is thus part of a broader conceptual shift. It is one that is not merely about modeling human systems versus something “purely” mechanical, but the very idea that there could be a distinct way of knowing that is recognizable as such. The statistical shift was therefore much more than one of technique, or even conceptual approach. It was a reconfiguration of the epistemic framework of unified human-machine endeavor.

From Signal to Symbol

Speech recognition first emerged as a defined area of research starting around the 1950s. While a handful of voice-control or voice-typing systems had been proposed several decades prior, the Automatic Digit Recognizer, demonstrated by engineers at Bell Labs in 1952, is widely considered the first

!56 operational speech recognizer.102 Nicknamed “Audrey,” the system at Bell Labs was initially capable of recognizing a limited vocabulary of ten digits (zero through nine, with zero spoken as “oh”) with a reported 97-99 percent accuracy, but only from a single, predesignated speaker.103 This was due to Audrey’s template-matching system recognition method, where the incoming speech utterance was matched against stored “templates” of prototypical speech. These templates were speaker-specific, based on the average spectral values of multiple utterances of a given word, and stored as voltage configurations in the machine’s hardware. The machine settings therefore had to be entirely reconfigured before it could be used by a different speaker and additional vocabulary required additional circuitry.104 Templates not only made adaptation to different speakers difficult, they also presented problems in terms of processing and storage capacities, as a

102 A slightly earlier device that is sometimes cited Swiss speech scientist Jean Dreyfus- Graf’s steno-sonograph, which he described in a publication in The Journal of the Acoustical Society of America in 1950. However, the prototype described at the time was only capable of tracing the patterns of acoustic input as an alternate “sonographic alphabet,” but did not identify them as specific words or other linguistic units. The paper does propose a “typo-sonograph” that operates a typewriter, though this was not built.

103 Later versions of the system also included six additional phonetic patterns for the “sh” and the sustained vowel sounds “e,” “æ,” “ʌ,” “ɔ,” “u.” See Dudley and S. Balashek, “Automatic Recognition of Phonetic Patterns in Speech,” The Journal of the Acoustical Society of America 30, no. 8 (August 1, 1958): 723.

104 The specific number and arrangement of tubes, relays, and condensers varied between different builds of the system. For the initial configuration, see K.H. Davis, R. Biddulph, and S. Balashek, “Automatic Recognition of Spoken Digits,” Journal of the Acoustical Society of America 24, no. 6 (November 1952): 638-641, with more detail provided in R.S. Biddulph and K.H. Davis, Voice-operated device, US2685615 A, filed May 1, 1952, and issued August 3, 1954. A later iteration using a different band filter and memory circuit set-up is described in Dudley and Balashek, “Automatic Recognition of Phonetic Patterns in Speech,” 723–725.

!57 unique template would have to be stored for every word in the recognizer vocabulary. Moreover, since each template was based on the averaged acoustic measurements taken from approximately one hundred repetitions of the word, the calibration process for each speaker expanded quickly as the vocabulary grew.105 Even with its Figure 2 Photograph of the Automatic Digit Recognizer, or “Audrey.” A-10414, Photograph, vocabulary limited to ten simple Courtesy of AT&T Archives and History Center, Warren NJ. phonetically distinct words, Audrey required a complex and difficult-to-maintain assembly of vacuum-tube circuitry that was housed in a six-foot high relay rack (see figure 2).106

Other systems from this period addressed the issues of storage and processing by using a set of generalized rules rather than individualized templates.

Known as the “rule-based” method, speech utterances were identified based on the presence or absence of various features in the acoustic signal. These

105 K.H. Davis, R. Biddulph, and S. Balashek, “Automatic Recognition of Spoken Digits,” Journal of the Acoustical Society of America 24, no. 6 (November 1952): 641.

106 Flanagan et al., “Techniques for Expanding the Capabilities of Practical Speech Recognizers,” 426. The “myriad maintenance problems” are described as those associated with vacuum-tube circuitry in general and not the specific implementation of the recognizer.

!58 predetermined rules were developed based on the careful study of spectrographic readings to determine what were believed to be the distinctive acoustic features of particular words or phonetic categories. These predetermined rules were, therefore, based on the assumption of a relatively stable correspondence between certain acoustic features and the phonetic units of language. They were also by no means standardized. Rather, they were developed in an ad-hoc manner by whatever phonetics experts the various research teams had on hand.107

By the end of the 1950s, researchers were coming to realize that the acoustic characteristics of speech alone were not sufficient for recognition. The correspondence between acoustic features and phonetic units were not as unique or stable as previously believed, and there was far too much variance and ambiguity for effective recognition as vocabularies expanded beyond a few

107 See Lawrence Rabiner and Biing-Hwang Juang, Fundamentals of Speech Recognition (Englewood Cliffs, N.J: Prentice Hall, 1993), 50. As Rabiner and Juang note in their widely-used textbook, Fundamentals of Speech Recognition (1993), the major challenges that prevented acoustic-phonetic rules from being more widely used included the fact that “the method requires extensive knowledge . . . [that] is, at best complete, and at worst totally unavailable for all but the simplest of situations” and that “the choice of features is made mostly based on ad hoc considerations. For most systems the choice of features is based on intuition.” Similarly, Waibel and Lee, in their introduction to the section on “Knowledge-Based Approaches” in their authoritative edited collection of Readings in Speech Recognition, note that the knowledge used in these systems were “usually derived from careful study of spectrograms [by individual experts] . . . However, this approach has only had limited success, largely due to the difficulty in quantifying expert knowledge.” See Alexander Waibel and Kai-Fu Lee, eds., Readings in Speech Recognition, 1st edition (San Mateo, Calif: Morgan Kaufmann, 1990), 197. Similar concerns over expert judgement were also raised while knowledge-based systems were still heavily in favor. For instance, in a 1976 overview of the field, Raj Reddy noted that the “most elusive factor” that led to inaccurate labeling in speech recognition was the lack of “objective phonetic transcription” due to variations in the “subjective judgements of phoneticians.” See Dabbala R. Reddy, “Speech Recognition by Machine: A Review,” Proceedings of the IEEE 64, no. 4 (April 1976): 519.

!59 words. Thus, in addition to modeling the articulatory and hearing processes, researchers also began to incorporate linguistic information. The charge to include linguistic knowledge alongside acoustic measurement was led by Dennis Fry and

Peter Denes in the department of Phonetics at University College, London.108 Fry and Denes developed the first system to incorporate linguistic knowledge, in the form of phoneme sequence frequencies, in 1959. The addition of phoneme frequency information was used to help resolve ambiguities present in the acoustic characteristics of different speech sounds. Fry and Denes rejected the assumed correspondence between the acoustic and phonetic units of speech as, at best, unreliable. On the contrary, experimental efforts had demonstrated

quite clearly that no simple relationship exists between the spectral patterns [of speech waves] and the speech sounds units . . . [I]t is unlikely that any single acoustic characteristic or combination of characteristics uniquely identifying any speech sounds does in fact exist in the acoustic wave.109

Yet this rejection of the idea that a model based on physical dimensions of speech production and hearing did not signal a rejection of a model based on human speech processes. In fact, Fry and Denes included statistical information regarding the frequency with which different phonemes occurred together in English was

108 Fry was head of the phonetics department and a professor of experimental phonetics. Denes, who was recruited by Fry, was trained as an engineer and later went on to work at Bell Labs.

109 Paul Denes, “The Design and Operation of the Mechanical Speech Recognizer at University College London,” Journal of the British Institution of Radio Engineers 19, no. 4 (1959): 219.

!60 based on the conviction that “the human listener in recognizing speech does not rely solely on acoustic characteristics . . . [but also using] the linguistic cues that are at the disposal of the listener.”110

During this period, another prominent influence emerged in speech recognition research. Growing interest in computing galvanized the pursuit of

ASR by simultaneously offering faster, more flexible tools for research and experimentation as well as new sites of application. That is, if speech recognition was ultimately about enabling a direct, “natural” way to communicate with the machine, the digital computer presented a machine that seemed particularly worthy of that communication. Computing gave speech recognition a glamorous use-case, one that tied its pursuit to the matter of national interest amidst Cold

War tensions, leading E.E. David, Jr. and O.G. Selfridge to refer to speech recognition as a technology that “all would agree that we should have . . . before the Russians,” even in the absence of clear practical applications.111 (David, Jr. would incidentally go on to become Richard Nixon’s science advisor and Director of the White House Office of Science and Technology in 1970.) At the same time,

110 Ibid.

111 E.E. David, Jr. and O.G. Selfridge, “Eyes and Ears for Computers,” Proceedings of the IRE 50, no. 5 (May 1962): 1093. David and Selfridge justified speech recognition research, along with other “automatic sensing” technologies, through competition with Russia, despite not having a clear sense of utility: “Certainly, the utility of automatic sensing will depend upon what is to be done with the material after it enters the computer as well as the internal organization of the machine itself. Perhaps all would agree that we should have automatic inputs before the Russians, but it is not so clear that we need them soon as practical inputs.”

!61 speech recognition tapped into deeper fantasies of computing, taking on the perception that advances of its ilk were crucial to fulfilling the technological potential that computing had to offer, resulting in its inclusion in ongoing debates surrounding computing technology and their use.112

This shift in the focus of interest from signal processing to computing changed the relationship between humans and machines that speech recognition is often imagined to manage. Speech recognition went from an aid for tasks designed for transmission through machines, such as automatic dialing, to tasks of communication with machines, as a form of interface. Opinions regarding the utility of speech recognition were far from uniform, and even its proponents were prone, like Licklider, to hedging their projections. Nevertheless, there remained a common underlying perception that speech recognition was tied to a new way of imagining the computer’s capabilities and our interactions with them. Arguments regarding speech and other forms of language-recognition designated a larger debate about what it was that computers were not only capable of, but intended for. That is, debates surrounding speech recognition were effectively debates around the very nature of the computer.

In 1961, a panel entitled simply “What Computers Should be Doing,” was held at MIT as part of its centennial celebration lecture series on “Management and the Computer of the Future.” The panel featured a lecture by J.R. Pierce, with

112 Ibid., 1093.

!62 Vannevar Bush acting as moderator, Claude Shannon and Walter Rosenblith as discussants, and an additional host of notable figures in computing history, including Marvin Minsky, J.C.R. Licklider, John McCarthy, Hubert Dreyfus, and

Paul W. Abrahams contributing from the audience. Pierce’s lecture, in a preview of the condemnations that he would publish on speech recognition eight years later, focused on misguided research in computer science that failed to acknowledge

the fact that a general-purpose computer can do almost anything does not mean that computers do all things equally well. Some things they do much better than human beings; something things they do worse. Machines are not people.113

Speech recognition, Pierce maintained, along with related tasks such as optical character recognition and machine-translation, while “a favorite pastime of computer experts,” was particularly emblematic of this indulgent indiscrimination. Computers, in Pierce’s view, were suited for “certain types of problems, problems necessarily involving a great deal of straightforward computing on a great deal of input data, as opposed to game-playing or recognition problems.”114 Problems of data-processing were, in other words, categorically distinct from problems of recognition.

113 Pierce quoted in J.R. Pierce et al., “What Computers Should Be Doing,” in Management and the Computer of the Future, ed. Martin Greenberger (Cambridge, MA: The MIT Press, 1962), 300.

114 Ibid.

!63 What made speech recognition a different type of problem, one that computers were ill-equipped to handle was its inherent variability of the data:

No one can argue that computers cannot read words; they read words off punched cards, perforated tape, and magnetic tape every millisecond (or less) . . . What computers do not do very well, despite the efforts of Selfridge [in building a computer that could read "hand-sent Morse Code”], is to read varieties of type faces in various orientations, or to read handwritten script, or to recognize many words when spoken in a variety of voices . . . It is difficult to make computers recognize classes of features that the eye or ear picks out unerringly . . . In contrast, my bank’s accounting machinery has no trouble with the numbers printed in magnetic ink on my checks, though to me they are strange and obscure . . . If a machine is to be forced to read records, and at a rapid rate, the task should be made easy for the machine. The characters should be unambiguous, and the machine should be allowed to run through tape or sort through cards; it should not have to thumb through dog-eared pages.115

The problem for Pierce, in other words, was the suitability of the material to be recognized, rather than the task of recognition per se. Computers could read, but struggled with the demands of inconsistency and ambiguity that speech and writing presented. They were more adept with the uniform and explicit standards of magnetic printing than the “dog-eared” books, inconsistent in execution and capricious in purpose. The reading of computers was simply not the reading of humans; reading materials and tasks had to be assessed accordingly.

It was far from incidental that speech and language technologies were the site of contestation in the debate of the nature and purpose of computing. As

Pierce went on to specify during the panel’s general discussion:

115 Ibid., 297-298.

!64 What is a pattern? There are many patterns that human beings find easy to recognize, and they are very important to us: the spoken word, the written word, and so forth. But these are hard for computers. On the other hand, we all know that the world is full of patterns that human being find difficult to recognize. Chemical tests and urinalyses are simple examples. An electrical filter will filter a sine wave out of a lot of noise, while the filter in our ear is broad-band and cannot do this.116

Thus, it was not the general task of pattern recognition that computers were unsuited to, but the recognition of a specific class of patterns—one “very important to us [humans]” that included “the spoken word, the written word.”

Speech and language presented a format challenge, one in which the type of input transformed the problem of pattern recognition into a different class of problem altogether, one of interpretation and judgement. Computers, in Pierce’s view, were perfectly suitable for pattern recognition. What they were not suitable for was communication.

The role of speech recognition as a proxy for a broader debate over the status of the computer more generally was underscored in E.E. David, Jr. and

O.G. Selfridge’s article on “Eyes and Ears for Computers” in 1962. In what appeared to be a direct response to Pierce’s presentation at MIT, they maintained that simply converting material into machine-readable formats was not a suitable alternative for speech recognition, for with simple digital conversion

We are not much better off, because we then have the same excess of data in two forms, neither related to the classificatory representation we want in the end. Rather, for instance, from speech waveforms we ask a phonetic or

116 Ibid., 318.

!65 literal transcription, from printing and handwriting a literal (‘alphanumeric’) transcription and not a TV raster, from radar a target inventory, and from weather pictures a cloud map. Such representations imply abstraction of raw data into more meaningful perceptual coordinates. Computers able to convert inputs into such forms will be much more flexibly, powerfully, and economically coupled to the real world.117

That is, speech recognition was not seen as a means to merely collect and convert data from one format to another. It was tied to the notion that the conversion and organization of data, that is, its processing, could lead to the automatic generation of “meaningful perceptual coordinates.” In other words, what was at stake in speech recognition was a question of whether or not computers were capable of the producing knowledge.

The proponents of speech recognition viewed computing as something more than “mere substrata for a rational flow of communication and control messages; it is likely that it will furnish some of the needed tools for the development of the sciences of man.”118 As Walter Rosenblith argued in his response to “What Computers Should be Doing,” the challenges of speech recognition and other activities that Pierce identified as more suited for humans than computers were not indications of computing’s inherent limitations, but

“difficulties [that] affect the coupling of man to his devices as well as the relations

117 David, Jr. and Selfridge, “Eyes and Ears for Computers,” 1093.

118 Walter Rosenblith quoted in Pierce et al. “What Computers Should Be Doing,” 312.

!66 between men.”119 For Rosenblith and other proponents, speech recognition was a key consideration in the development of human-machine collaboration. Speech recognition was deemed valuable within a worldview in which computers were seen as collaborative objects in the production and communication of knowledge, such that the transmission of information became secondary to its apprehension by machine, with the computer no longer a stopover in the communication procedure, but its terminus. The debate over speech recognition and tasks of its ilk was effectively a debate of what computers not only should be doing, as the title of the panel at MIT attests, but of what they should be altogether.

Artificial Intelligence and Expert Systems

In bearing the promise of more flexible computing interfaces, speech recognition had to become more flexible in turn. With automatic dialing and other basic voice commands in mind, researchers through the fifties and sixties focused on the recognition of only a handful of keywords, usually digits, spoken with pauses between. But in order to fulfill a promise of enabling interactions that treated computers as more than “mere substrata” for transmission, speech recognition had to be developed to handle larger vocabularies for more varied tasks and complex instructions. Researchers quickly realized that the approaches and techniques that had proven effective for limited-vocabulary systems could not

119 Ibid.

!67 be scaled to accommodate larger and more complex vocabulary sets, let alone

“natural” or “spontaneous” speech without deliberate pauses between words. For instance, a system using a method of template-matching, which used stored reference templates for individual words to identify utterances, would need not only reference patterns for every single word, but for all the contextual variations in pronunciations that could result from interactions with surrounding words.120

While the demands of computing applications shifted focus towards large- vocabulary, “natural” speech capabilities, the fundamental concept remained the same, based in knowledge-systems. In 1971, the US launched its first major government-sponsored speech recognition project through the Department of

Defense’s Advanced Research Project Agency (ARPA, later renamed as DARPA, the Defense Advances Research Project Agency). Dubbed the Speech

Understanding Research program (SUR), the project was funded under the

Information Processing Technology Office (IPTO), then headed by Lawrence

Roberts. The IPTO, which had original formed under the direction of J.C.R.

Licklider, had heavily funded artificial intelligence projects beginning in the

120 For instance, the pronunciation of the word “and” is truncated due to a phenomenon known as “coarticulation” when spoken in a phrase such as “ham and cheese.” Thus as Raj Reddy explains: “To analyze and describe a component part, i.e., a word within a sentence, one needs a description of what to expect when that word is spoken. Again, the reference pattern idea of word recognition becomes unsatisfactory. As the number of words in the vocabulary and the number of different contextual variations per word get large, the storage required to store all the reference patterns becomes enormous. For a 200-word vocabulary, such as the one used by Itakura, a CSR system might need 2000 reference patterns requiring about 8-million bits of memory, not to mention the time and labor associated with speaking them into the machine.” See Reddy, “Speech Recognition by Machine,” 509.

!68 1960s, and played a significant role in the consolidation and expansion of AI research into a high-profile discipline.121 The Speech Understanding Research program was considered “DARPA’s first major AI effort.”122 Roberts was drawn to speech recognition as a project that was well-suited to the IPTO’s role of providing funding for research and development to enable universities to train researchers in AI while demonstrating to corporations the basic research and applied development needed to inspire industry funding. Roberts saw speech understanding as a project that could demonstrate a potential application for AI, where research “could actually make some headway, as opposed to the totally nonproductive efforts to date in speech recognition,” but also where “even if it wasn’t successful it would advance the field.”123

121 Robert S. Engelmore, “AI Development: DARPA and ONR Viewpoints,” in Expert Systems and Artificial Intelligence: Applications and Management, ed. Thomas C. Bartee (H.W. Sams, 1988), 215. According to Engelmore, the program manager for AI research at DARPA from 1979-1981 and the editor-in-chief at AI Magazine through the 1980s, the IPTO was the source of “Essentially all of DARPA’s funding of AI research during the past 24 years [from 1962-1986]. Moreover, all the major academic centers of research in AI have been, and still are, supported under programs managed by that office. Indeed, most of those centers were created by DARPA funding.”

122 Ibid., 216. [Engelmore, “DARPA and ONR Viewpoints,” 216.] The IPTO began funding major research centers that were lead by prominent individuals working in AI, such as John McCarthy at Stanford and Marvin Minsky at MIT, as early as 1963. See National Research Council, Funding a Revolution, 205. Cf. Roberts, “Expanding AI Research and Founding ARPANET,” 229. According to Roberts, who became director of the IPTO in 1968, “there was no specific AI research mentioned in any of the early contracts . . . AI in itself wasn’t part of the budget until 1968 or 1969.” Licklider similarly recalls that in the early years “the high-level administrators such as Ruina or Fubini did not have even the vaguest knowledge of AI and so my charter was essentially ‘Computers for Command and Control.’” See also and Licklider, “The Early Years: Founding IPTO,” 220.

123 Roberts, “Expanding AI Research and Founding ARPANET,” 230. N.B.: Roberts was additionally motivated by a desire to prove J.R. Pierce wrong: “[Pierce said we] couldn’t do it . . . Pierce certainly was an annoyance—that somebody would be saying that you couldn’t do it at all! Clearly, you have to do it” (235).

!69 Actual planning of the project was carried out by Cordell Green, then serving as the AI program manager at IPTO, who initiated the project by organizing a feasibility study led by a team of researchers from across AI and speech fields, including Allen Newell, Raj Reddy, William Woods, Dennis Klatt and Licklider himself. Though Roberts had declared he wanted a system that could recognize 10,000 words by multiple speakers, the benchmarks outlined by the feasibility report were for 1,000 words, spoken in a quiet room by a limited number of speakers, with a limited vocabulary subject area.

The project itself was funded by $15 million distributed over five years between teams at Carnegie Mellon University, MIT’s Lincoln Laboratory,

Stanford Research Institute (SRI), and System Development Corporation (SDC), and Bolt, Beranek, and Newman (BBN), a Boston-based consultancy founded by former directors of MIT’s Acoustic Laboratory. The teams were tasked with the production of “intermediate” systems, with the understanding that a second round of funding would be provided to further development of successful efforts in a follow-up project. Given the SUR program’s roots in Artificial Intelligence, the focus of a majority of the systems developed within the project pursued a

“knowledge-based” or “expert” system approach.124 Envisioned by IPTO director

Lawrence Roberts as a program that “even if it wasn’t successful . . . would

124 It is worth noting that the project was explicitly pursuing speech understanding rather than recognition and was concerned more with the systems’ ability to correctly interpret a given task or instruction, rather than accurate transcription.

!70 Figure 3 he conceptual structure of the HWIM (“Hear What I Mean”) system built at Bolt, Beranek, and Newman as part of the SUR program. Source: Bruce, Bertram C. 1982. “HWIM : A Computer Model of Language Comprehension and Production.” Champaign, IL: Center for the Study of Reading.

advance the field [of AI],” an explicit part of the research mandate was the modeling of human language production and perception, resulting in development of systems that made word matches based on complex layers of linguistic rules based on everything from phonetics and grammar to semantic and pragmatic expertise.125

Thus, though the knowledge-based AI systems developed as part of the

SUR program varied in their operational structure and assigned tasks, they

“shared the same overall philosophy, differing mostly in how they composed and triggered rules for forming hypotheses at higher and higher linguistic

125 Roberts, “Expanding AI Research and Founding ARPANET,” 230.

!71 levels.”126 This philosophy was developed out of earlier rule-based approaches, which assumed that machine recognition of speech, particularly of “natural” speech, necessarily required the incorporation of knowledge regarding speech production and perception, which would be represented in computational form as complex rule-structures.

The SUR program, however, was not renewed. Controversy surrounded the design of evaluation methods used to test system performance, which some believed inflated the performance results. Since “[f]ull details regarding the testing of the system performance had not been worked out at the outset of the

SUR program . . . some researchers—including DARPA research managers— believed that the SUR program had failed to meet its objectives.”127 One major issue, according to Robert Mercer from IBM’s Continuous Speech Recognition group, was that test sentences were generated using a highly restrictive the artificial grammar that dramatically constrained the number of recognition options:

Carnegie Melon had a database retrieval language, which had a thousand word vocabulary. It was a fantastic technical achievement. But they had such restrictive choices that you could make when you went from one [word] to another that the language was really very easy in terms of how

126 Roberto Pieraccini, The Voice in the Machine: Building Computers That Understand Speech (Cambridge, MA: The MIT Press, 2012), 94.

127 National Research Council, Funding a Revolution, 206. In a more specific account, Robert Mercer from IBM’s Continuous Speech Recognition group recalled that the artificial grammar used to generate test sentences for the Harpy system at Carnegie Mellon was so restrictive that the possible sentences the system could be given to interpret were far more limited than the 1,000 word vocabulary would suggest.

!72 much you had to hear in order to know what the sentence was. So, for example, they had all the digits in their language [vocabulary] because it would be ridiculous if you had a language where you could say ‘one, two, three,’ but you couldn't say, ‘four and five, six and seven’ . . . But the only place you could use the digit "six" was in the phrase, ‘the six, seven, eight, nine game.’ And that was also the only place you could use the digit seven or eight or nine.128

However, even with the potentially inflated results, the program “didn’t really prove itself by coming up with a conspicuously good demonstrable device.”129

According to Licklider, who returned as director of the IPTO in 1973, roughly halfway through the five-year SUR program, the SUR program met its intended

“main objectives” in stimulating the field, “there was a lot of feeling . . . it wasn’t really time yet to drive for a workable system.”130 What made the SUR system

“unworkable,” however, was not consistent across the board. While the three AI expert systems—the Hearsay II, developed at Carnegie Mellon, BBN’s HWIM, and an unnamed collaborative SRI/SDC effort — struggled with recognition accuracy, with the SCD system managing only 24 percent, the Harpy system, also developed at Carnegie Mellon, boasted 95 percent semantic accuracy on its document retrieval task.131 The Harpy was, however, the only one of the SUR projects that did not use an AI approach, instead using template-based pattern

128 Mercer, interview with the author, 2015.

129 Licklider, “The Early Years: Founding IPTO,” 226.

130 Ibid.

131 The Hearsay II, developed at Carnegie Mellon, achieved only 74 percent, BBN’s HWIM followed at 44 percent, and the SDC effort managed only 24 percent accuracy. See: Pieraccini, The Voice in the Machine, 92.

!73 matching with probability models. Its relative success, in fact, further damaged the SUR program, given SUR’s broader AI research program. As Licklider recalled, “[o]ne thing that disappointed administrators was that the system that worked best was the simplest one,” as it further underscored the lack of demonstrable success in the AI-centric efforts.132 The demise of the Speech

Understanding Program was thus not the result of a failure to produce a

“workable” speech recognition system, but one premised on AI and linguistic knowledge.

Airplanes Don’t Flap Their Wings

While the SUR projects were struggling to develop efficient ways to represent and integrate increasingly sophisticated layers of linguistic and contextual knowledge in the effort to mimic human speech perception and understanding, a team at IBM, established just three months after the DARPA

SUR program launched, headed in the opposite direction. Using concepts from information theory, they tackled the problem of large-vocabulary speech recognition using statistical models derived automatically from patterns in speech data rather than linguistics, seeking to dispose of human expertise altogether.

The statistical approach pursued by the Continuous Speech Recognition

(CSR) group at IBM was considered deeply unorthodox and developed in relative

132 Licklider, “The Early Years: Founding IPTO,” 226.

!74 isolation from the broader field.133 Though it has since been widely adopted as the dominant paradigm in speech recognition research and is consistently cited as the field’s most significant turning point, this adoption was slow. Statistical methods began gaining momentum only towards the end of the 1980s, in large part due to the fact that their application required such a drastic rejection of the grounding assumptions of the field, though the complexity of the math was also a major related factor.134 The work was, in fact, so controversial that it was regarded by some researchers at the time to not even belong within the broader domain of human language technology research. As an anonymous peer-review of one of the groups papers from The Journal of Computational Linguistics explained in 1988,

“the crude force of computers is not science.”135 The approach, which was characterized by the substitution of mathematical models in place of linguistic concepts, was premised on a rejection of “the human analogy . . . [as] misguided

133 Lalit Bahl, interview with the author, May 5, 2015. Bahl, one of the inaugural researchers in the IBM CSR recalled that the CSR group worked “with no exchanges [of ideas] at any time” with other research institutions pursuing speech recognition. This seems somewhat notable given the prominence of large-scale, multi-institution efforts during this period such as ARPA’s Speech Understanding Project. While some of this seemed to be by design (IBM declined the invitation to participate in the Speech Understanding Project), researchers within the group also attribute this isolation to what they perceived to be a lack of interest or rejection of their at-the-time-unorthodox approaches, characterizing themselves as “sort of pariahs.”

134 The historical and technical details of the spread of mathematical expertise related to statistical speech recognition are discussed in chapter three.

135 This has been cited in a few of different places including the program for Jelinek’s memorial service at John Hopkins University on November 5, 2010.

!75 because machines do things differently than biological beings.”136 As Alfred

Spector, the Vice President of research at Google,137 recalled in The New York

Times obituary for CSR group director Frederick Jelinek in 2010, this “underlying insight was that you don’t have to do it like humans . . . was almost a 180-degree turn in the established approaches to speech recognition, and it led to most of the success in the field in the last two decades.”138 The statistical approach, in other words, marked a departure from not only how speech recognition had been done, but how it had been thought.

The formation of the Continuous Speech Recognition (CSR) group in

1972 as an independent research group, separate from existing speech research at

IBM,139 served as further evidence of the view that large-vocabulary speech recognition was not merely a quantitative, but qualitatively distinct problem, one based more in data processing than language and phonetics. At the time, IBM’s

Speech Processing Technology Department, based out of their Raleigh research facility, had been successfully developing speech and voice recognition technologies for the type of standard telecommunications and “man-machine

136 Steve Lohr, “Frederick Jelinek, Pioneer in Speech Recognition, Dies at 77,” The New York Times, September 24, 2010, sec. Business, http://www.nytimes.com/2010/09/24/ business/24jelinek.html.

137 Spector retired from Google in 2015.

138 Spector quoted in Lohr, “Frederick Jelinek.”

139 The idea of a speech recognition group and initial discussions began earlier, but according to Lalit Bahl, the group’s first member, the CSR group was formally established on January 2nd, 1972, when he was officially moved from his prior position.

!76 communication” tasks in the field, such as automatic call routing and information retrieval over telephone.140 The department included a number of linguistic experts, including its director William D. Chapman, who had a dual-background in electrical engineering and linguistics. Rather than building on the existing

Raleigh speech research laboratory that had made demonstrable progress in the field, the Continuous Speech Recognition group was formed as a wholly separate research team at the Yorktown, New York facility, signaling an institutional reclassification of speech recognition as a pursuit no longer within the purview of speech processing research.

This distinction was further reinforced by the composition of the CSR group, which was initially populated solely by engineers and mathematicians with a self-professed ignorance of, and indifference towards, speech and language.

This was due, in large part, to the influence of , an eccentric and hugely influential figure at IBM who was known for wandering the halls and circulating ideas. Though never technically a member, Cocke was widely considered to be the “patron” of the CSR group.141 Cocke was interested primarily in speech recognition in terms of information theory and statistical modeling

140 For instance, the summer prior to the formation of the CSR group, the Raleigh group had announced an automatic call verification and switching system designed to route terminal-testing calls using voice commands over the telephone. See IBM Corporation Field Engineer Division, “IBM Service Specialists ‘Talk’ to a Computer,” Press Release, (August 5, 1971), IBM Corporate Archives.

141 Robert Mercer, interview with the author, May 5, 2015.

!77 rather than language, an orientation that he imprinted onto the CSR group through its membership. The first member recruited into the group was Lalit Bahl, who transferred at Cocke’s urging from his previous research group, which focused on network configuration and data compression and transmission, in January of

1972. Bahl was joined by Frederick Jelinek, an information theorist by training, who left Cornell to become the group’s director that spring. Cocke then recruited

Robert Mercer, a recent Ph.D. graduate in mathematics, whose original research group working on compilers had dissolved by the time he arrived at IBM that fall.

The CSR group initially made efforts had been made to incorporate linguistics, through speech-related reading assignments and guest speakers. Some of the engineers, including Bahl, were assigned to take linguistics courses at the

City University of New York. Its engineers soon dismissed these efforts as “a waste of time,”142 and eventually a group of five or six linguists from the Raleigh group were transferred to join the CSR team. Despite this addition, however, there was little direct collaboration between the engineers and the linguists. Bahl admitted that the engineers had “no idea what [the linguists] were doing”143 and saw their interactions with the linguists as unfruitful. The linguists were similarly resistant to statistical methods, which Jelinek and the other engineers leading the research saw as the CSR group’s defining project, and eventually left the group as

142 Bahl, interview with the author, 2015.

143 Ibid.

!78 a result.144 In the absence of their linguistic expertise regarding the acoustic characteristics of speech, the group relied on statistical estimates based on sample data to produce both their acoustic and language models. In doing so, they found that recognition accuracy significantly improved, leading to the Jelinek’s infamous claim that the system got better every time he fired a linguist.145

Though focus from researchers within the field of speech recognition often falls on the pioneering roles of the CSR group’s researchers, this pursuit of the way “machines do things,” as well as the particular methods of statistical modeling it generated, were conditioned by the specific industrial interests and resources of IBM in this period. It was also indicative of shifting priorities in the field as research expanded from telecommunications and defense projects to commercial computing. Similar to the institutions involved in (D)ARPA’s Speech

Understanding Project (in which IBM had declined an invitation participate),

144 Ibid. According to Bahl, “[He and other engineering and computer science researchers] did interact with [the linguistics], but nothing useful ever came out of it . . . no one got anywhere.” Eventually, “various people [interested in linguistics] left because they didn’t want to do it [the statistical] way . . . these guys didn’t want to do what Fred was asking them to do, and so they started looking around for other groups.”

145 Frederick Jelinek, “Some of My Best Friends Are Linguists” (4th International Conference On Language Resources and Evaluation, Lisbon, May 28, 2004). According to Jelinek, he was speaking in reference to the fact that, in an early experimental speech recognition system, performance jumped from a 35 percent accuracy when the acoustic model was based on statistical estimates made by the group’s linguistic experts to 75 percent accuracy when it was based on those generated automatically from training data. In the same presentation, Jelinek traces the provenance of his statement to his 1988 presentation, “Applying Information Theoretic Methods,” presented at the Workshop on Evaluation of NLP Systems. According to the overview report of the workshop, Jelinek’s paper was on syntactic parsing. See Martha Palmer, , and Sharon M. Walter, “Workshop on the Evaluation of Natural Language Processing Systems,” Final Technical Report (Griffiss Air Force Base, NY: Rome Air Development Center, December 1989).

!79 IBM’s Continuous Speech Recognition group was focused on building a system that could handle a large vocabulary spoken continuously rather than the discrete command vocabularies designed for applications like automated channel switching or system control. Unlike the DARPA project, however, which envisioned both computational command and dialog systems, the CSR group’s sole interest was automatic dictation.146 Given IBM’s investment in business administration and data processing, speech recognition was seen as “a strategic technology for its business: typewriters were an important sector of its market, and a voice-activated one would be a ‘killer application’ for the corporation.”147 In contrast, the vocabularies and linguistic constraints employed in the DARPA SUR systems were designed specifically for data management and retrieval tasks and

“significantly (and deliberately) absent from the specifications [from the SUR project guidelines] were requirements that the demonstration tasks be relevant to real-world problems.”148

The mandate of speech understanding that guided the SUR projects, in fact, excluded dictation and automatic transcription tasks altogether. The emphasis on speech “understanding” rather than “recognition” in the SUR project determined, on a practical level, how the systems were assessed. The distinction

146 Bahl, interview with the author, 2015. According to Bahl, from the start, what they “had in mind was the building of a dictation machine.”

147 Pieraccini, The Voice in the Machine, 109.

148 Dennis H. Klatt, “Review of the ARPA Speech Understanding Project,” The Journal of the Acoustical Society of America 62, no. 6 (1977): 1345.

!80 of speech understanding did “not so much indicate enhanced intellectual status, but emphasizes that the system is to perform some task making use of speech.

Thus, the errors that count are not errors in speech recognition, but errors in task accomplishment.”149 That is, according to the SUR project specifications, success was based on whether or not a system correctly responded to a spoken command, regardless of whether the words themselves were accurately recognized. The imagined task of speech recognition thus differed dramatically between the SUR project and IBM, despite the shared interested in large-vocabulary continuous speech. Whereas as the SUR project, like much of earlier speech recognition research, imagined its use in command-and-control tasks for human-computer interaction, IBM’s goal for speech recognition was as an aid to media production.

Unlike the large-vocabulary systems developed as part of the SUR program, the efforts at IBM were not required to interpret what was being said, but merely apprehend and transcribe that which was spoken, regardless of semantic content.

IBM’s particular market interests, in other words, effectively decoupled recognition from reasoning.

Equally important was the fact that such an application would also make use of excess computing cycles. Whereas efficient operation was a defining factor in research at both Bell Labs and the DARPA SUR program, IBM pursued speech

149 Allen Newell et al., “Speech-Understanding Systems : Final Report of a Study Group” (Information Processing Techniques Office of the Advanced Research Projects Agency, May 1971), http://repository.cmu.edu/compsci/1839.

!81 recognition for its potential to be inefficient. Interest in the human auditory system at Bell Labs had been tied to a desire for channel economy and efficient transmission, while the SUR projects were constrained by efficiency requirements due to limited computing power.150 In contrast to both, IBM’s interest in large- vocabulary speech recognition arose amid growing concerns that the computing market was on the verge of losing momentum. The exponential growth in processing speed, it was believed, would soon outstrip the needs of existing computing tasks, potentially dampening demand for newer, faster machines. The

CSR group was formed upon the recommendation of a research “task force” that had been assigned to investigate solutions to this looming predicament; speech recognition, it was suggested, was an application that could be particularly

150 Jonathan Sterne has described the extensive efforts at Bell Labs were taken at Bell Labs to study human hearing to determine which portions of the sound spectrum could be removed to compress the signal for more efficient transmission, thereby producing what he terms “perceptual capital,” a form of surplus derived from leveraging human perception rather than production. Jonathan Sterne, MP3: The Meaning of a Format (Duke University Press, 2012). Similarly, among initial specification guidelines for the SUR project was the requirement that the systems could take no longer than “a few times real time” on a dedicated system of 100 million machine instructions per second. The speed of the machine was dropped on the final guidelines, though the restriction remained implicit in the requirement that the systems be demonstrable by 1976. Newel et al, “Speech-Understanding Systems,” 1.2. See also Klatt “Review of the ARPA Speech Understanding Project,” 1345.

!82 demanding when it came to the use of computing cycles.151 Thus, in contrast to rule-based methods that were designed to be more economical in their use of computing power and transmission bandwidth, one of the IBM group’s initial benchmarks was to increase demands on processing power in an effort to stimulate the commercial market. In other words, IBM’s interest in speech recognition stemmed from efforts to investigate a problem of computation, related to processing power, rather than of speech.

What set IBM apart, however, was not only a matter computing expertise and industrial goals. More than anything else, it was their access to computing resources. The demand for less efficiency was particularly well-suited to the use of what are known in computer science as “brute force” methods. In contrast to rules-based or “expert” systems, which work to narrow the field of possible choices by filtering data through a set of rules representing a priori knowledge,

“brute force” approaches, like those taken by IBM, systematically calculate and compare all possibilities to identify the optimal selection. Such systems were

151 Mercer, interview with the author, 2015. Mercer, who, along with Lalit Bahl, was the first member of IBM’s CSR group recalls that the group was “an outgrowth” of a task force assigned to find ways to use computing cycles: “people were thinking, my god, computing is getting so fast. What are we going to do with all these cycles? And speech recognition was one of the answers.” A similar account of the project’s origin is given by Frederick Jelinek, the group’s director, in his Association of Computational Linguistics Lifetime Achievement Award speech: "Believe it or not, IBM was worried that, with the advance of computing power, there might soon come a time when all the need for further improvements would disappear, and IBM business would dry up. Somebody came up with the suggestion that speech recognition would require lots of computing cycles.” See Frederick Jelinek, “ACL Lifetime Achievement Award: The Dawn of Statistical ASR and MT,” Computational Linguistics 35, no. 4 (2009): 484.

!83 computationally demanding to run, but they were even more demanding to build and test.

In the absence of linguistic expertise, the approach developed by the IBM engineers derived models of speech and language by sifting through large quantities of data and calculated all possible matches in order to identify the most probable. The approach thus had two key requirements: large quantities of data, and the ability to process it. As Bob Mercer explained, “Our big advantage was we had way, way more computing than anybody. I used to think that we were real smart. But the fact is, we just had a lot of computing.”152 The developments themselves were inevitable, according to Peter Brown, who joined the IBM CSR group in 1984.153 As Brown explained, “if it wasn't IBM, it would have been somebody else who figured it out eventually. I mean the question is having the data and computer power.”154 Access to computing power was such a critical factor in development of IBM’s Tangora system that it was the major source of delay. According to Lalit Bahl, the team followed the same statistical principles

152 Mercer, interview with the author, 2015.

153 Though Brown joined the group later, he worked closely alongside founding members such as Mercer and Bahl and went on to lead the group’s efforts to adapt techniques from speech recognition to machine translation, which will be discussed in more detail in chapter 3. Incidentally, Brown is credited with naming IBM’s Deep Blue, the infamous computer that defeated chess world champion Garry Kasparov in 1997, though he is said to have turned down the chance to be the first to play against the computer. See Feng- Hsiung Hsu, Behind Deep Blue: Building the Computer That Defeated the World Chess Champion (Princeton, N.J.; Oxford: Princeton University Press, 2004), 127.

154 Peter Brown, interview with the author, May 5, 2015.

!84 from the outset and had achieved successful speech recognition systems early on in their research.155 However, it was not until 1984, over a decade after the group’s initial formation, that they debuted the Tangora, the first real-time large- vocabulary dictation system. The main reason, according to Bahl, was the processing speed, which was simply too slow for carrying out a compelling demonstration.156 Bahl recalled a characteristic demonstration of one of the earlier experimental systems. A visitor to the laboratory, eager to see their progress, was given a sentence to read into the microphone. The researchers then presented him with a perfect transcript of his sentence—the following day.157 Access to computing power was, thus, the critical component in implementing a demonstrably successful system.

Noise in the Channel

Swapping out linguistic expertise for computing power, the IBM CSR group addressed speech recognition as a problem of mathematical rather than

155 “I don't think [there were any changes in thinking or technique in] the twenty-six years I was there [at IBM Research]. We were doing what we were doing on the first day.” See Bahl, interview with the author, 2015.

156 Ibid. Bahl recounts: “I think Tangora was the first time we sort of had anything that was close to real time in terms of being able to process things . . . I think we were pretty much successful all the time, but yeah, they were just very slow. So you couldn't demonstrate anything.” Even as late at 1981, the group was posting experimental results that required approximate ninety seconds of CPU time for each single second of speech. See Lalit R. Bahl et al., “Speech Recognition of a Natural Text Read as Isolated Words,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’81., vol. 6, 1981, 1169.

157 Bahl, interview with the author, 2015.

!85 human communication. Drawing from information theory, and Claude Shannon’s work in particular, the group reformulated speech recognition as a “noisy channel” problem. As described in signal processing, the noisy channel problems assumes the existence of a message that is distorted as it passes through a communication channel that contains “noise,” or irrelevant variations in the signal that obscure the original message.158 The task at hand then becomes the process of decoding the distorted output to retrieve the original message from the added noise. Applied to speech recognition, the original message in question was conceived as words represented in the mind of the speaker prior to the production of speech (see figure 4). This original message, it is worth noting, was both

Figure 4 Block diagrams comparing the standard view CSR (top) with the “noisy channel” model (bottom). Source: Bahl, Lalit R., F. Jelinek, and R. Mercer. 1983. “A Maximum Likelihood Approach to Continuous Speech Recognition.” IEEE Transactions on Pattern Analysis and Machine Intelligence 5 (2): 179–90. © 1983 IEEE.

symbolically and functionally represented by a text generator. In experimental systems, recognition tests were run using an automatic sentence generator (known as an “artificial grammar”) as a form of experimental control. A speaker would

158 It is worth noting that the concept of “noise” is central to Kittler’s account of mediality, as the necessary material consequence of transmission. See Friedrich A. Kittler, Discourse Networks, 1800/1900, trans. Michael Metteer and Chris Cullens (Stanford University Press, 1992).

!86 then read the generated sentence into the machine, and the results would be compared to the original. Speech itself was therefore relegated to a component of the “acoustic channel” through which the intended message was transmitted.

In contrast to previous approaches, which sought to model speech as a set of sensory-motor and linguistic processes external to the recognition process, the noisy channel approach absorbed both speech and speaker into the computational model. As demonstrated in the diagram from the IBM group’s hugely influential article, the speaker was no longer the source of input, external to the recognition system. Rather, it is paired with the acoustic processor as a source of noise interference in the acoustic channel, from which a prior input must then be retrieved.

As Bahl, Jelinek, and Mercer explain, “The speaker and acoustic processor are combined into an acoustic channel, the speaker transforming the text into a speech waveform and the acoustic processor acting as a data transducer and compressor.”159 The human speech is thus reimagined as “the coder-modulator that evolution has bequeathed us”160 that transforms a prior message created in the brain of the speaker into a garbled signal. In other words, the role of speech as the

159 Lalit R. Bahl, F. Jelinek, and R. Mercer, “A Maximum Likelihood Approach to Continuous Speech Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-5, no. 2 (March 1983): 179.

160 Frederick Jelinek, Statistical Methods for Speech Recognition (MIT Press, 1997), 9.

!87 raw, “natural” input to be transmitted and transformed was recast as a form of noise resulting from the transmission process.

The noisy channel approach effectively sutured the speaker and the acoustic processor, making speech part of the “noise” that distorts a message as it passes through a from sender to receiver. By extension, the system’s acoustic model—a statistical model which determined the likelihood that a given acoustic string is the product of a particular “text” in the speaker’s mind—was required therefore to include “the speaker’s interaction with the acoustic processor”161 in its calculations. Jelinek elaborates in his 1997 textbook on Statistical Methods for

Speech Recognition: “The total process we are modeling involves the way the speaker pronounces the words, the ambience (room noise, reverberation, etc.), the microphone placement and characteristics, and the acoustic processing performed by the front end.”162 In other words, the speaker and speech act were no longer considered an information source external to the system, but rather a component of the “total process” of noise interference that included ambient sound, the material-spatial properties of the recording apparatus and environment, and the transformation of physical quantities into electrical signals performed by an acoustic processor. The peculiarities of human speech production and perception

161 Ibid., 7.

162 Ibid.

!88 were together reduced to another aspect of channel noise, becoming simply another layer of encoding.

Speech itself therefore ceased to be of primary concern: “From the point of view of your speech recognizer, all you’re interested in are the words; everything else is simply noise.”163 Human speech, rather than being the underlying, “natural” object of investigation to be perceived and simulated by the mechanical system, was reduced instead to a component of the data process, one that degrades rather than transmits information. The problem of speech recognition, in short, became one of recognition despite, rather than of, the distinct qualities of human speech. The noisy channel model effectively sought to statistically predict the original message, effectively the original text, rather than to apprehend speech. Speech was reformulated as something categorically indistinguishable from not only other forms of acoustic interference (such as ambient sound or signal reverberation) but also the material conditions of recording and compression. Speech recognition was in this way untethered from human sensory-motor phenomenon; it was a particular case in the general problem of pattern recognition. The turn towards statistical methods in speech recognition thus entailed more than simply the introduction of statistical techniques as a means to represent speech and linguistic knowledge. It required the radical reduction of speech to merely data, which could be modeled and

163 Pieraccini, The Voice in the Machine, 110.

!89 interpreted with in the absence of linguistic knowledge or understanding. Speech as such ceased to matter.

!90 CHAPTER III

THE IDEA OF DATA

“Poincaré anticipated the frustration of an important group of would-be computer users when he said, ‘The question is not, “What is the answer?” The question is, “What is the question?”’ One of the main aims of man-computer symbiosis is to bring the computing machine effectively into the formulative parts of the technical problem.” J.C.R. Licklider, “Man-Computer Symbiosis” (1960)164

Ŵ = arg max P(W)P(A|W) W IBM Continuous Speech Recognition Group, “A Maximum Likelihood Approach to Continuous Speech Recognition” (1983)165

In the previous chapter, I looked at the coordination of conceptual and institutional forces that prepared the terrain for a “statistical” turn in speech recognition research, contextualizing it within a broader shift in the status of computing and communication technologies. The feasibility and desirability of parsing language, and speech in particular, by computational means, became a pivotal point of contention that dramatized competing models of the relationship

164 Licklider, “Man-Computer Symbiosis,” 1960.

165 The components for this formulation appear in the CSR group’s publications as early as 1975. See Frederick Jelinek, Lalit Bahl, and Robert Mercer, “Design of a Linguistic Statistical Decoder for the Recognition of Continuous Speech,” IEEE Transactions on Information Theory 21, no. 3 (May 1975): 250–56. While the precise arrangement and notation of the equation varies slightly, the above is derived from the clearest summarization, from 1983. See Bahl, Jelinek, and Mercer, “A Maximum Likelihood Approach to Continuous Speech Recognition.”

!91 between the information processing and the production of knowledge. Statistical speech recognition, I suggested, reduced speech to a general problem of text data, while giving voice to a distinct way of knowing that was considered particular, indeed “natural,” to the machine. This chapter offers a technical compliment, closing in on how this shift was implemented, and in doing so seeks to bring into relief the epistemic contours of the so-called “statistical” mode of knowledge as it was aligned with digital computing. That is, if the statistical approach signaled the emergence of a “natural way for the machine,” this chapter takes a closer examination of what precisely that nature entailed.

The formal dimensions of digital technology are often discussed in general terms, with broad references to quantification, algorithmic and numerical manipulation, and “data-driven” or “statistical” modeling. The complexity and often deliberate opacity of many large-scale data processing and analytics systems

—those infamous black boxes that litter both popular and scholarly accounts— paired with the sheer variety of practices encompassed in the study of “digital” media understandably invite a certain terminological murkiness. Solon Barocas,

Sophie Hood, and Malte Ziewitz, for instance, have highlighted the preponderance of vague generalizations of “algorithms” in critical scholarship on the subject, suggesting: “A simple test would go like this: would the meaning of the text change if one substituted the word ‘algorithm’ with ‘computer’,

!92 ‘software’, ‘machine’, or even ‘god’?”166 Yet it is precisely the growing complexity and variety of computational practices that demands attentiveness to the technical and material specificities of different practices. Quantification, algorithmic manipulation, and statistical thinking are neither singular, monolith practices nor static concepts, but rather dynamic compositions subject to discursive and material intervention.167 The internal transformations of speech recognition and its shifting paradigms of statistical calculation thus allow for a closer examination of how a particular set of statistical methods were coordinated with the material constraints of specific information technologies, highlighting how a distinct form of “digital” knowledge was both thought and built.

This measure of specificity is particularly pressing since, as many historians of science and technology have noted, these histories are hardly composed of a tidy succession of technoscientific regimes, each replacing its predecessor in turn. Rather, as Pickstone points out, “new ways of knowing are created, but they rarely disappear . . . it is a matter of complex cumulation and of simultaneous variety, contested over time, not the least when new forms of

166 Solon Barocas, Sophie Hood, and Malte Ziewitz, “Governing Algorithms: A Provocation Piece,” SSRN Scholarly Paper (Rochester, NY: Social Science Research Network, March 29, 2013), 3.

167 As Gigerenzer et al. point out, “Perhaps more than any other part of mathematics, probability theory has has a relationship of intimacy bordering on identity with its applications . . . This means that probability theory was as much modified by its conquests as the disciplines it invaded.” See Gerd Gigerenzer et al., The Empire of Chance: How Probability Changed Science and Everyday Life, Reprint (Cambridge University Press, 1989), xiii-xiv.

!93 knowledge partially displace old forms.”168 Daston and Galison similarly suggest that it is “a history of dynamic fields, in which newly introduced bodies reconfigure and reshape those already present, and visa versa” such that we might

“imagine new stars winking into existence, not replacing old ones but changing the geography of the heavens.”169 To be more attentive to the material and technical specificity of these newly introduced bodies not only takes stock of the reshaped terrain as a whole, but allows us to identify the shifting centers of gravity that orient movement within it.

This chapter examines the changing form and status of statistical techniques across three historical approaches to speech recognition. Though the coordinated assembly of mathematical techniques and algorithmic implementations largely associated with a small group of engineers at IBM in the

1970s and 1980s received the designation of the “statistical approach,”170 statistics have been present since the earliest ASR systems. The engineers of the

168 John V. Pickstone, Ways of Knowing: A New History of Science, Technology, and Medicine (University of Chicago Press, 2001), 9.

169 Lorraine J. Daston and Peter Galison, Objectivity (Zone Books, 2010), 18-19.

170 It’s important to note that James Baker had independently developed a statistical approach to speech recognition using hidden Markov models while completing graduate work at Carnegie Mellon University. He built the Dragon system using HMMs as part of Carnegie Mellon’s participation in the DARPA Speech Understanding Research project. Both he and his wife, Janet Baker, who specialized in the signal processing aspects of speech recognition, were hired by IBM and joined the CSR group in 1974. The Bakers left IBM and worked briefly at Verbex beginning in 1979 before founding their own company, Dragon Systems, in 1982 and going on to develop Dragon Dictate and NaturallySpeaking, two of the most widely-used commercial dictation software programs today.

!94 Bell Labs Audrey recognizer, widely regarded as the first speech operational system, emphasized the “statistical nature of speech” as a primary design consideration, referring to acoustic variations in pronunciation.171 Similarly, by

1960, D.B. Fry and Peter Denes, two experimental phoneticians at University

College London, expanded speech recognition beyond the focus on acoustic characteristics by including a language model, in the form of statistics on phoneme frequencies in English, into their decision engine as a model of language.172 In explaining the need to expand the system to model aspects of language generally, and not only the acoustic dimension that was unique to speech, the two of them characterized speech recognition as a two-part process that consisted of “the perception of incoming sounds . . . [and] the application of statistical knowledge.”173 What thus distinguished IBM’s approach as “statistical” was not, as one might intuit, the incorporation of statistics. Therefore, in order to

171 Davis, Biddulph, and Balashek, “Automatic Recognition of Spoken Digits,” 637.

172 Peter Denes, “Automatic Speech Recognition: Experiments with a Recogniser Using Linguistic Statistics,” Contract No. AF 61(514)-1176, Technical Note No. 4 (Air Force Cambridge Research Center: United States Air Force Air Research and Development Command, September 1960). The 92-page report of the research was likely not widely circulated at the time, though Denes published more general and theoretical descriptions of his approach, often in collaboration with D.B. Fry in both technical and popular journals. See also D. B. Fry and P. Denes, “Experiments in Mechanical Speech Recognition,” in Information Theory: Third London Symposium, ed. Colin Cherry (London: Butterworths Scientific Publications, 1956), 206–12; D. B. Fry and P. Denes, “The Solution of Some Fundamental Problems in Mechanical Speech Recognition,” Language and Speech 1, no. 1 (January 1, 1958): 35–58.; P. Denes, “The Design and Operation of the Mechanical Speech Recognizer at University College London,” Journal of the British Institution of Radio Engineers 19, no. 4 (April 1959): 219–29.

173 Fry and Denes, “The Solution of Some Fundamental Problems in Mechanical Speech Recognition,” 37.

!95 understand what it meant for speech recognition to be “statistical,” we must consider more closely the specific techniques through which speech recognition became statistical, and thus just what type of statistical it became.

Following the initial comparison, the chapter then closes in on two defining features of IBM Tangora recognizer: the hidden Markov model and the invention of “fenonic” alphabet that organized speech units according to the demands of signal processing rather than linguistic meaning. The core transformation of the IBM approach, I argue, was not the introduction of statistical modeling, but in what statistics were thought to model. The information-theoretic framework favored by the IBM CSR groups led to the popularization of statistical techniques that prioritized automation and predictive power over knowledge representation and interpretability. Instead of seeking to formalize the underlying human processes as a set of statistical functions, engineers came to imagine that they could replicate and predict outcomes without an account of the underlying structure or operation of the process generating them. The “statistical” approach to speech recognition reconfigured statistical modeling in accordance with the aims of data processing and transmission, recasting it as a radically data-centric procedure capable of producing information in the absence of explanation.

!96 A Brief Technical Overview of ASR

Let us first take a step to back consider the general structure of a speech recognition system. All speech recognition systems have two fundamental components, the acoustic processor and the linguistic decoder. The acoustic processor (AP), or acoustic “front-end,” is a signal processor that samples, segments, and specifies an incoming speech signal. It essentially transcribes the acoustic signal into a symbol string by splitting up the continuous wave form into discrete unit and matching each to a stored prototype “alphabet” of possible acoustic events, producing a string of reference symbols to be decoded by the recognizer. This is then followed by a linguistic decoder (LD) or decision engine, which then identifies and matches the acoustic reference symbols with words in the vocabulary.

Speech input, usually taken with a microphone or telephone either directly or as a recording, is first transformed into a spectrographic waveform, which represents the energy measurements of the signal. The AP then digitizes this waveform by sampling it at regular time intervals to produce discrete “frames.”

Each sample frame is also passed through frequency filters that separate the wave sample into smaller frequency range bands. The measurement of the average magnitude is then taken for each frequency band, transforming the digital sample

!97 frame into a numerical vector made up of the component measurements,174 which is sometimes referred to as a spectral time sample (STS). The end result of this

Figure 5 Representation of the acoustic processor’s signal quantization process. Top image depicts the spectral wave form. Middle image depicts the partitioning of the wave form, with vertical columns representing the sample frames and horizontal rows representing filter band frequencies. The bottom image depicts the quantized output, which consists of a series of 17 spectral time samples (STS), each of which is an 8-component vector. Source: Roberto Pieraccini, The Voice in the Machine: Building Computers That Understand Speech (Cambridge, MA: The MIT Press, 2012).

process is the transformation of a continuous spectral waveform not only into a series of discrete, digitized samples, but each sample is additionally quantized, or reduced to a vectors composed of a set of average magnitude measurements at various frequency positions (see figure 5).

174 The number of component measurements in a vector depends on the number of filter bands used in a given acoustic processor. For instance, the processor used in the IBM Tangora system had 80 frequency bands, and thus produced an 80-component vector (prior to further processing steps). In contrast, the Bell Labs Audrey only split the signal into two bands, resulting in a 2-component vector.

!98 Once the quantized signal is produced, the AP then carries out the first major stage of recognition, which is the segmentation the quantized signal into individual segments that correspond to the basic unit being used for recognition.

For instance, for a recognizer that worked using whole words, the signal of an entire spoken sentence would have to be segmented at the spots that indicated pauses between words so that each word could be identified one by one.

Similarly, a recognizer using phonetic elements as its acoustic units would have to break up even a single spoken word into segments associated with distinct phonetic events. The segments may then be classified and labeled as its closest match from a finite set of acoustic possibilities, since contextual variations in the signal caused by pronunciation and ambient noise will cause the same speech

“sound” to produce variations in energy measurements when repeated, even by the same person.175

The AP essentially performs the first in a two-part transcription process, by first transforming an acoustic waveform into a series of labels, where each label is an identifier for some unit of sound within a finite acoustic inventory (and its associated numerical measurements). The output of the acoustic front-end is

175 For instance, a recognizer using phonetic units may limit its possibilities to 44 common phonemes found in English, and each acoustic segment would be classified as one of these 44 options, though the number of unique sets of acoustic measurements may be far larger than 44. The labeling step is not always necessary, depending on a number of factors, including the complexity of the system, the size of the vocabulary, and the overall approach. What is essential here is that each segment of the signal must be specified as an instance from a finite set of acoustic types in order for recognition to be performed.

!99 string of these acoustic labels based on a defined sound unit, such as phones in phonetic labeling, that is then passed to the linguistic decoder for the second phase of recognition. The linguistic decoder translates the acoustic labels into meaningful text176 or machine commands by matching them to available words in the recognizer vocabulary. The match decisions are typically carried out by referencing both an acoustic model which is a model of how the spoken pronunciation of any given text is expressed acoustically, and a language model, which formally defines some chosen aspects of the recognizer language (such as word frequencies, syntactical structure, or semantic domain) in order to disambiguate between possible choices (for instance, in a phonetic recognizer, whether a phonetic label [tu] should be matched with “to,” “too,” “two,” or maybe even the “tu-” in “tuba”).177

The popular, though far from standardized, nomenclature for the major paradigms in speech recognition generally refers to form of representational architecture used in the pronunciation and language models. There are three general approaches that have been used historically for representing and matching acoustic segments to linguistic units: template-matching, rule-based, and

176 Meaningful, here, is of course relative to the vocabulary. If the recognizer is programmed with a vocabulary containing only nonsense “words,” then the output will be meaningful text as far as the recognizer is concerned but not necessarily intelligible to a human reader.

177 As I will discuss in greater depth later in this chapter, the earliest speech recognizers relied mainly on acoustic modeling alone. Language modeling was introduced towards the end of the 1950s.

!100 statistical.178 These approaches are defined according to the ways in which pronunciation and language are formally represented within the system and their corresponding recognition procedure. Template-matching identifies acoustic segments by comparing their measurements to a set of stored, prototypical reference templates to find the closest match. Rule-based methods represent speech and language through a set of rules identifying its acoustic, grammatical, and/or semantic features. The statistical approach represents all possible matches between acoustic segments and linguistic units as estimated statistical parameters that express the likelihood of the match. The statistical parameter values in these models are estimated from examples of speech and text data, known as “training” data. Though these recognition approaches present seemingly clearcut differences in terms of representational format, their conceptual underpinnings are considerably more muddled and most implementations involves some form of statistical analysis.

178 Template-matching is also commonly referred to as “pattern-matching.” However, I have chosen to avoid using that term here in order to avoid confusion with techniques in the statistical approach that are referred to as “pattern recognition” procedures. Artificial intelligence systems, which in speech recognition commonly refers to the symbolic AI “expert systems” pursued as part of the DARPA SUR project in the 1970s, can be considered a special instance of the general category of rule-based systems. Also absent are artificial neural networks (ANN) and deep learning approaches, since they remain outside the historical period under discussion. Neural networks were briefly explored for speech recognition in the 1950s, but the approach was soon abandoned for a number of reasons, including the practical lack of computational processing power leading to a failure to produce results. The (re)discovery of key algorithmic techniques led to minor revival in the 1990s, though ANN truly gained significant momentum and funding in the 2010s. Moreover, the principles underlying neural networks builds upon the same conceptual foundation of probabilistic pattern recognition as the statistical approach, and it is not uncommon for systems to integrate components of both.

!101 The Statistical Nature of Speech

The Bell Labs “Audrey” (Automatic Digit Recognizer), like many of the earliest experimental speech recognition systems, did not include a general language model or linguistic decoding component. “Recognition” only involved identifying which phonetic speech sound corresponded to the pattern produced by the spectral measurements in a given segment of the recorded acoustic signal, without having to determine which word the sound belonged to, thanks to the strictly limited vocabulary of the recognizer. As mentioned in the previous chapter, development of Audrey was linked to work on speech specification, an area of research that was concerned with determining which acoustic features in speech signal were essential to differentiating speech sounds, and by extension which non-essential acoustic information filtered out from the signal for more efficient transmission. The primary concern, in other words, was to detect and classify linguistically-meaningful sounds rather than transcribing conventionally- spelled text, and the vocabulary of possible sounds was strictly constrained to limit ambiguities. Despite the relatively straightforward operation and limited vocabulary, however, statistical analysis was a core component of Audrey.

!102 In a publication detailing the set-up and operation of Audrey following its first public demonstration in 1952,179 Bell Laboratories engineers K.H. Davis, R.

Biddulph, and S. Balashek, introduced the design of their system by justifying the application of statistical techniques. “The variability encountered in repeated speakings of a digit, even when uttered by the same individual, is common knowledge,” they explained, concluding that “since the design of any successful recognition circuit demands a quantitative knowledge of an inherently variable speech signal, any useful description of this signal must be expressed in statistical terms.”180 Two full decades before the CSR team at IBM began work towards their “statistical” approach, the inclusion of statistics was already portrayed as decidedly common, if not compulsory. Moreover, Audrey was an emblematic example of the template-matching approach in which recognition decisions were made by comparing speech input to stored reference templates representing

“prototypical” examples of each word in the vocabulary. These reference templates were based on the statistical analysis of empirical speech data, and thus bore closer resemblance to the “data-driven” training methods later used by IBM in the Tangora system than AI expert systems that identified speech by applying

179 It is unclear when work began on Audrey, though the first public demonstration of the recognizer at the Conference on Speech Analysis at the Massachusetts Institute of Technology in June of 1952. There is little detail regarding this initial demonstration, outside of a brief mention in Dudley and Balashek’s article describing a second model of the AUDREY. See Dudley and Balashek, “Automatic Recognition of Phonetic Patterns in Speech,” 721.

180 Davis, Biddulph, and Balashek, “Automatic Recognition of Spoken Digits,” 637.

!103 predetermined rules selected by linguists. Audrey and the Tangora were, in this sense, sibling branches rooted in the same ruthless pragmatism, engineering solutions that disdained a general problem of scientific understanding in favor of the specific demands of system function.

The crucial distinction however, was that the statistics of Audrey, far from embodying what Jelinek described as “the natural way of for the machine,” remained rooted in the simulation of the human auditory system in two fundamental ways. First, the statistical operations were attributed not to the machine, but to the human body. And second, though the measurement values were derived from empirical data, the metrics themselves were selected based on phonetic principles and tests of human hearing. In other words, while the templates were calculated from experimental observations, what was observed was determined based on established knowledge of the human auditory system.

Though now considered the first operational speech recognizer and the inaugural effort in establishing speech recognition as a formal area of research, the term “speech recognizer” was not used by Audrey’s designers. Rather, it was referred to as a “phonetic pattern recognizer,” and considered part of the ongoing project of “detecting phonetic patterns, and portraying them for visual recognition or . . . using them for switching or mechanical control.”181 This emphasis on

181 Dudley and Balashek, “Automatic Recognition of Phonetic Patterns in Speech,” 721. Dudley and Balashek explicitly place Audrey within the genealogy of speech specification, even referencing Flowers’ 1916 research regarding speech’s “true nature” in building his phonoscribe system.

!104 phonetic patterns designated the Audrey as an extension of ongoing work in speech specification at Bell Laboratories and the efforts to isolate distinct,

“information bearing elements” that made it possible for humans to understand speech despite pronunciation, noise, and other forms of acoustic variation. The primary instrument of acoustic measurement and analysis used in this research was the speech spectrograph, which (as discussed briefly in the previous chapter) was initially developed for the purposes of transforming sound into visual patterns that could be interpreted by humans, not by machines. It was designed primarily as a form of “visual hearing” and originally intended as an aid for deaf education and telephone transmission applications, though it’s development was quickly enrolled into military research and the first working model was built as part of cryptanalysis efforts during World War II.182

The spectrograph differed from other acoustic instruments in that was explicitly modeled on the inner ear, separating out and measuring the relative intensity of frequency components of the sound wave, such that “[r]ather than represent an acoustic waveform, spectrograms . . . depicted its perception.”183 The advantage of the spectrograph was thus not a more detailed or exact measurement of the acoustic signal, but precisely the opposite. In comparing the spectrograph to the oscillograph, Pierce in fact criticized the latter’s precision as a barrier to

182 Mara Mills, “Deaf Jam From Inscription to Reproduction to Information,” Social Text 28, no. 102 (March 20, 2010): 37.

183 Ibid., 38.

!105 accurate representation, arguing that “[t]he oscillogram as a picture of the spoken voice is worse than complicated; it is misleading as a hair-splitting lawyer who makes fine distinctions, distinctions which according to the final tribunal, the ear, just aren’t valid.”184 That it, oscillograms portrayed features that were only technically present in the acoustics, rather than those that were functionally relevant to speech. The use of statistics as a means to account for variability in the acoustic signal was therefore not conceived as the native mode of machine epistemics. On the contrary, statistics provided a corrective to the machinic propensity towards exactness: “It’s easy to fool the ear. The ear can accept a rattling cardboard cone in a loudspeaker as a whole symphony orchestra. Making a mechanism as naive was harder to accomplish than making a really precise device!”185

For the engineers of Audrey, statistical representations were thus a necessary concession to what Pierce referred to as the “final tribunal” of human perception, a means to dismiss those characteristics of the signal that were technically discernible through acoustic instruments, but perceptually irrelevant for discerning speech. In other words, speech specification, and by extension speech recognition, required a means to characterize speech sounds explicitly but not exactly, placing mechanical precision at odds with informatic utility. To this

184 Pierce, “Portrait of the Voice,” 113.

185 Ibid., 101.

!106 end, the “statistical terms” utilized in Audrey were compression and classification techniques, ways to combine and summarize data by discarding the unique traits of individual observations. Statistics, in essence, were enlisted as a means to mimic the ear and mathematically reconcile the rattling loudspeaker and the symphony orchestra.

The basic operation of Audrey began with a predesignated speaker speaking digits with pauses of at least 350 milliseconds between each word to indicate the start and finish of a single utterance. Spectral energy measurements would be taken for each utterance and used to generate an acoustic pattern

“template.” This unknown acoustic template would then be compared to a store of reference templates, which contained the “prototypical” acoustic patterns for each of the ten words in Audrey’s vocabulary (the digits 1-9 and the word “oh” for zero), in order to identify the closest match.186 Once determined, an indicator light corresponding to the appropriate digit would be triggered.187 These acoustic templates—both those generated for the unknown speech utterance and for the store of reference prototypes—constituted what Davis et al. referred to as the

186 This process also included a display step, where the graphical representation was displayed on an oscilloscope monitor. This display step was not critical to the recognition system, but likely included for verification and experimentation purposes. The images produced are intended for the researchers as “a visual presentation of the basic data” and were “obtained at a lower cutoff syllabic rate filter than is used in actual recognition work.” Davis, Biddulph, and Balashek, “Automatic Recognition of Spoken Digits,” 640.

187 Though the Audrey was only ever built activate lights to demonstrate recognition, it was noted that the same relays used to switch on a lamp “could punch tape, operate a typewriter, or give other indication.” Dudley and Balashek, “Automatic Recognition of Phonetic Patterns in Speech,” 723.

!107 “useful description of this [speech] signal . . . expressed in statistical terms.”188

Even without being attached to a control mechanism that completed the

“recognition” component, the acoustic templates were distinct from the visual records produced by earlier, incomplete attempts at speech recognition. Systems like Barlow’s logograph and Flowers’ Phonoscribe, for example, materialized speech directly as graphical inscriptions of fluctuations in acoustic energy and air pressure (see figure 6 and 7).189 Their patterns depicted as acoustic templates in

Audrey, on the other hand, were informatic renderings based on statistical abstraction, rather than direct visual, or even numerical, representations of the acoustic signal.

Figure 6 Illustration of the operation of the logograph with a detail rendering of the print output below. Source: Frederick Bramwell, “Telephone,” in The Practical Applications of Electricity: A Series of Lectures Delivered at the Institution of Civil Engineers, Session 1882-83 (London: The Institution of Civil Engineers, 1884), 29.

188 Davis, Biddulph, and Balashek, “Automatic Recognition of Spoken Digits,” 637.

189 For more detail on both the logograph and phonoscribe, see the previous chapter. A more comprehensive account of graphical instruments for measuring and materializing speech at the turn of the century, see Robert Brain, The Pulse of Modernism: Physiological Aesthetics in Fin-de-Siècle Europe (Seattle: University of Washington Press, 2015), chapter 3 in particular.

!108 Figure 7 Three images of depicting the design and output of John B. Flowers’ Phonoscribe. Top: Illustration of the proposed design of the device. Bottom right: Illustration detail of the printing mechanism of the device in the top image. Bottom left: Sample records from the device prototype for the spoken letter “U.” The top record is based on speech from a man’s voice, the middle from a woman’s voice, and the lower record the curve from a spoken whisper. Source: Lloyd Darling, “The Marvelous Voice Typewriter: Talk to It and It Writes,” Popular Science Monthly, July 1916, 66-68.

The acoustic templates used in Audrey were not the spectrographic images themselves, nor even the “raw” measurements taken from the spectrographic readings, but rather the product of multiple mathematical transformations.

Acoustic processing in Audrey began by first converting speech into an electrical

!109 signal using a microphone or

telephone transmitter.190 The

signal is then filtered into

two frequency bands—a

high-frequency band for

signal energies above 900 Hz

(cycles per second) and a

Figure 8 Block schematic of Audrey. Source: K. H. Davis, R. Biddulph, and S. Balashek. “Automatic Recognition,” 638. low-frequency band for those below (see figure 8)—and segmented into discrete intervals by “sampling,” or taking the energy measurement in each frequency band once every ten milliseconds (or one centisecond), resulting in a pair of values representing each centisecond sample of the speech signal. Each pair is then plotted on a two-dimensional graph representing the acoustic feature space, with the values from the low-frequency band as the x-axis and values from the high-frequency band as the y-axis, such that every centisecond sample marks a

190 Although the original 1952 publication by Davis, Biddulph, and Balashek does not specify the initial sound transmitter apparatus, it does specify the recognition of “telephone-quality” speech. See K. H. Davis, R. Biddulph, and S. Balashek. “Automatic Recognition,” 637. However, a patent filed by Biddulph and Davis earlier that same year (and granted two years later) specifies only a microphone. Cf. Biddulph and Davis, “Voice-operated device,” US2685615 A, 3:46. Then again, by 1958, a second model of the recognizer is described as using “the transmitter of a Western Electric 500-type telephone set,” referencing Bell Labs records from R. D. Fracassi in 1953. See Dudley and Balashek, “Automatic Recognition of Phonetic Patterns in Speech,” 723.

110! single coordinate plot.191 The resulting plots produced a pattern representing of the durational activity of the speech signal within the frequency space (see figure

9).

Figure 9 Photographs of simplified speech signal sample traces for Audrey. Samples taken at 10ms increments and graphed for values of formant 1 (x-axis) versus formant 2 (y-axis). Source: K.H. Davis, R. Biddulph, and S. Balashek, “Automatic Recognition of Spoken Digits,” Journal of the Acoustical Society of America 24, no. 6 (November 1952): 639.

These formant patterns, however, required further manipulation to become functional templates. Since no two utterances, even by the same person, will be an exact acoustic match to each other, the information in these patterns had to be made less precise. This was done through quantization, a general process of many-to-few mapping in which a larger set of (often continuous) values are

191 It should be noted that this feature space is two-dimensional because the AUDREY splits the signal into two features sets, frequencies above 900Hz and frequencies below 900Hz in order to correspond to the first two formants. The number of dimensions of the feature space will correspondingly increase as the number of distinct “features” extracted from the signal increases. So, for instance, in the Tangora system, which I’ll discuss in more depth in a moment, splits the signal into 20 features, and each trace would be plotted within a 20-dimensional space.

111! reduced to a smaller, discrete set values.192 In the case of Audrey, quantization was carried out by partitioning the two-dimensional feature space into thirty even

100x500-cycle-area sections. The number of centisecond sample traces in each

Figure 10 Simplified representation of the quantization process using digit 7 detail from figure 9. Left image is the enlarged digit 7 image of the acoustic feature space. Middle image depicts the even partitioning of the space into sixteen regions. Right image is shows quantized values based on sample traces in each region (numbers not exact). partition is counted, further abstracting the already quantified pattern into a set of quantized numerical values, one for each 100x500 section of the acoustic feature space, that became the final acoustic template of the speech utterance (see figure

10).193 Both the templates of unknown speech input and the stored reference templates of known speech were generated using this process, though the latter were created with an additional step. Reference templates were not records of singular instances of speech, but aggregate prototypes that were generated using the templates from approximately one hundred repetitions of the same word by

192 Rounding numbers of the nearest integer, for instance, is a very basic instance of quantification. Quantification is one of the fundamental techniques in any compression procedure.

193 Although Audrey split the feature graph into 30 frequency “elements” of 100 by 500 cycles in area, they actually only recorded the values for 28 of the 30 squares. Davis, Biddulph, and Balashek, “Automatic Recognition,” 640.

112! the same speaker and averaging the values across all the repetitions to serve as the final “prototypical” value in each partition.194 The central statistical intervention in the AUDREY was thus the use of sample means and quantization—techniques of information disposal.195 Statistics was enlisted to represent the variability of speech production and the capabilities of the ear to compensate accordingly. The

“naive” ability of the human ear to interpret speech was formally replicated as the ability to mathematically consolidate highly variable data into fixed, composite regularities.

Beyond the general use of statistical approximations as a model of what appeared to be the ear’s acoustic negligence, however, each of the particular maneuvers of mathematical disassembly and recombination used in Audrey’s processing architecture, including the format of the template itself, stemmed from theories of the human auditory system. Much of the acoustic processing and template design was directly derived from a series of speech and hearing experiments conducted at Bell Labs beginning in the late 1940s.196 In these

194 Ibid., 641.

195 As Stigler explains, sample means are one of the most basic examples of statistical aggregation, the “truly revolutionary” idea “stipulating that, given a number of observations, you can actually gain information by throwing information away.” Stephen M. Stigler, The Seven Pillars of Statistical Wisdom (Cambridge, MA: Harvard University Press, 2016), 4-5.

196 Ralph K. Potter and J. C. Steinberg, “Toward the Specification of Speech,” The Journal of the Acoustical Society of America 22, no. 6 (November 1, 1950): 807–20; Gordon E. Peterson and Harold L. Barney, “Control Methods Used in a Study of the Vowels,” The Journal of the Acoustical Society of America 24, no. 2 (March 1, 1952): 175–84.

113! experiments, researchers recorded a total of seventy-six speakers—thirty-three men, twenty-eight women, and fifteen children—reciting two randomly ordered set of ten words each, resulting in a total of 1,520 recordings that were then imaged and acoustically analyzed using a spectrograph and a cathode-ray sound spectroscope197 in order to measure and record the acoustic properties.198 Ralph

Potter and J.C. Steinberg then conducted series of hearing tests recordings to determine which recordings listeners could correctly identify, believing that the corresponding spectrograms would exhibit common acoustic features that were crucial to human speech perception.199

197 Potter and Steinberg describe the spectroscope as “similar in principle to the spectrograph but operates about 70 times faster.” Potter and Steinberg, “Toward the Specification of Speech,” 808.

198 Potter and Steinberg, “Toward the Specification of Speech,” 808-10. The words were all monosyllabic, and began with [h] and ended with [d], so that only the vowel sounds differed. The initial consonant [h] was chosen because there was “relatively little transitional movement of the vowel formants with this consonant,” while the final [d] was selected because “common words could be found for most of the vowels.” The selected words were heed, hid, head, had, hod, hawed, hood, who’d, hud, and heard.” The words were given to the speakers as a set of printed cards to read from, with one word on each card. Speakers were selected from Bell Laboratories personnel and their children, based on the criteria of “willingness to cooperate, and an absence of notable hearing and speech defects or dialectal deviations.” See also Peterson and Barney, “Control Methods Used in a Study of the Vowels,” 177, where they provide the additional detail that the speakers were predominantly native speakers of American English.

199 Using the recording from the speech tests, Potter and Steinberg conducted a set of listening tests using these recordings in an effort to isolate what they believed to be the key features of the signal that the ear used to identify vowel sounds. Approximately seventy participants, thirty-two of whom were drawn from the original speaker group, were asked to listen to a random selection of the experimental speech recordings and select a card printed with the symbol they believe best corresponded to the audio out of ten options. The cards were the same set assigned to the speakers to produce the recordings. It is worth noting that when conducting the analysis that they published in 1950, Potter and Steinberg did not have the results of listening tests from all of the participants, only those who were also a part of the speaker group. Potter and Steinberg, “Toward the Specification of Speech,” 807–10.

114! The selection of salient input, the appropriate format representation, and the algorithmic procedure for generating acoustic templates in Audrey were all derived directly from the theories of human speech processing resulting from these experiments. To begin with, the engineers of Audrey based the high and low frequency filters in Audrey on the maximum and minimum frequency ranges for the first and second formants (F1 and F2)200 in the final analysis of the spectrographic records from fifty-one adult speakers that were produced in the experiments described by Potter and Steinberg, determining 900 Hz to be an effective boundary between the two formants.201 However, if was not simply quantification of speech formants that were determined by Potter and Steinberg’s work. Their analysis also determined the very selection of the F1 and F2 formants and the defining features of speech in the first place. And perhaps most importantly, their conclusions regarding hearing and speech perception literally reshaped the representation of speech, resulting in the particular graphical transpositions used in the production and format of Audrey’s acoustic templates.

In studying the spectrographs of the recordings that listeners could correctly identify from the recordings, Potter and Steinberg hoped to determine

200 Formants are frequency areas where there are amplitude peaks in a sound. In speech, they are typically associated with vocal tract resonance.

201 Davis et al., “Automatic Recognition of Digits,” 639-640. Since Potter and Steinberg’s original report was only a preliminary analysis based on twenty-five of the seventy-six total records of both adults and child participants, Davis, Biddulph, and Balashek drew their measurements on adult speakers from a later report from Gordon E. Peterson and Harold L. Barney (“Control Methods Used in a Study of the Vowels,” 1952) based on the same experimental data.

115! which aspects of measurable acoustic energy were used by the ear in speech recognition.202 What they found, however, was that formant frequency positions, based on their absolute measurements, varied “markedly” between male, female, and child speakers. Nevertheless, they reasoned that despite the quantitative discrepancies, the fact that “the ear recognizes the three stimulation patterns as alike . . . suggests that within limits, a certain spatial pattern of stimulation on the basilar membrane may be identified as a given sound regardless of position along the membrane.”203 They concluded therefore that it was the “form or pattern of the formant positions [that] appears to be important in discriminating between sounds.”204 Moreover, they determined that the similarity between quantifiably dissimilar patterns was also evident upon visual inspection of the spectrographic records, where information was interpreted through a “subjective scale” of the brain that would perform “certain so-called normalizing transformations” to adjust for the vertical displacement in the frequency positions. Potter and

Steinberg thus maintained that the information required to specify the identity of different vowels was in fact captured in their spectrographic measurements, but could only be revealed once the appropriate “normalizing relationships” for interpreting the data could be determined.205

202 Ibid., 809.

203 Ibid., 812.

204 Ibid., 811.

205 Ibid., 812.

116! This led Potter and

Steinberg to a second series of tests, this time using only three male speakers to repeat words multiple times.206 Potter and

Steinberg discovered that while the actual frequency measurements still varied from speaker to speaker, the ratio between the first and second formants displayed more consistent similarity. If the first Figure 11 Formant frequency results from Ralph Potter and J.C. Steinberg’s specification tests graphing formant 1 versus and second formant were formant 2, with distinct clusters for different vowel utterances show. Source: R. K. Potter and J. C. Steinberg, “Toward the Specification of Speech,” The Journal of the Acoustical plotted against each other in a Society of America 22, no. 6 (November 1, 1950): 817. two-dimensional plane, the repetitions of each vowel sound made a distinct cluster, with limited overlaps between different vowels (see Figure 11). The researchers thus concluded that the first and second formant frequencies were potentially sufficient for the

206 Each speaker was recorded repeating “some 12 or more” randomly ordered sets of the same words, and the fundamental frequency, as well as the frequencies of the first, second, and third formants were measured. Ibid., 813.

117! identification of certain vowel sounds.207 Though the determination of the formant frequency ranges came from recording both male and female voices, the decision to rely upon these formant frequencies as the salient features of speech recognition was based on these tests conducted with only three men.

For Audrey, this meant that it was not enough to simply operationalize speech as acoustic measurements captured by the spectrograph. Since further

“normalizing transformations” were needed resolve these measurements, these too had to be quantified and expressed mathematically in order for a machine to objectively discern patterns that were apparent to the “subjective scale” of the ear or eye. In other words, the statistical techniques used to generate acoustic templates were designed to mathematically replicate, and thereby automate, the

“normalizing transformations” of human perception. The selection of which aspects of the acoustic speech signal were measured and the format of these measurements — that is, the means by which data was produced and processed — for the purposes of machine recognition were deliberately mapped to the apparent functions of human hearing.

The Statistical Nature of Language

Beyond the use of statistical aggregation techniques in the early template- matching systems, the use of probabilistic inference — another characteristic

207 Ibid., 813-816.

118! feature of the “statistical approach” — was actually introduced to speech recognition beginning in the 1960s with the inclusion of language modeling. The

Audrey and other experimental systems throughout the 1950s were typically focused solely on acoustic recognition. Since speech specification research was based in acoustic phonetics, a subfield of phonetics concerned with the physical forces that determined speech sounds and their transmission, these early efforts assumed a direct correspondence between the acoustic and linguistic properties of speech. As experimental phonetician Peter Denes explained,

In the search for a solution [to speech recognition] it has always been realised that phonetic content and other variables will influence the acoustic features that characterize the phonemes and words. It has always been tacitly assumed, however, that there are some invariant acoustics features that characterize a phoneme and that are always present when that particular phoneme is spoken by the speaker or recognized by the listener . . . It was said that automatic recognition could be achieved by detecting these invariants, always present although sometimes hidden, if only their nature was uncovered by further research and their characteristic specified.208

The assumption, in other words, was that speech recognition was essentially a function of adequately modeling the processes of acoustic production and transmission. After all, humans were able to recognize a given word even when spoken by different people, at different speeds, and in all types of noisy environments. The idea thus followed that, with sufficient understanding of these processes that regulated the physical characteristics of speech sounds, one could

208 Denes, “Automatic Speech Recognition: Experiments with a Recogniser Using Linguistic Statistics,” 23.

119! isolate the distinct physical cues—for instance, the location of high energy outputs within the frequency range, which were visible in spectrographic prints— that triggered their identification with the appropriate phoneme or word. From there, with sufficiently-tuned instruments and proper calibration, the variations perceptible in repeated measurements of a given speech sound could be resolved into a distinct set of essential acoustic features that machines, too, would be able to detect.

D.B. Fry and Peter Denes, however, contended that the determination of physical features was only part of the identification process, noting that acoustics was not the only site of variation in speech recognition. Citing a 1957 experiment conducted by Ladefoged and Broadbent where human listeners were asked to identify synthesized speech utterances, Denes pointed out that an utterance containing the same acoustic measurements could be interpreted as different sounds by the human listener, depending on the preceding syllable.209 In other words, it was not only the transmission and perception of speech sounds that could be inconsistent — their interpretation was similarly mutable and subject to contextual influences, thus ensuring that “acoustic recognition alone is unlikely to read a level of accuracy . . . that would make [speech recognition] of real practical

209 Ibid., 24. See also Peter Ladefoged and D. E. Broadbent, “Information Conveyed by Vowels,” The Journal of the Acoustical Society of America 29, no. 1 (January 1, 1957): 98–104.

!120 interest.”210 (Fry and Denes 1958, 53). The two thus posed speech recognition as a two-stage process, in which acoustic “primary recognition” based on physical cues was further constrained by “the linguistic knowledge which every listener has at his disposal and which informs him of the transition probabilities at every stage in the speech sequence.”211

By including a language model, Fry and Denes effectively rejected the use of experimental data as the exclusive knowledge source for speech recognition, despite both being experimental phoneticians. In their view, acoustic observations alone could never adequately model the recognition process, however extensive their statistical compensations. Yet this assessment did not lead to an abandonment of statistical efforts to assimilate variations in the empirical data. On the contrary, statistics was sunk further into the body, from the mechanisms of the ear to the workings of the mind. It was not the merely the ear’s acoustic naiveté that compensated for the imprecision of speech, but also the mind’s linguistic intuition that managed its uncertainty.

Denes described design of a system implementing this idea in 1959 and published the details of the working system its experimental results in 1960, as part of a contract with the US Air Force. The system, which was simply dubbed

210 Dennis B. Fry and Peter Denes, “The Solution of Some Fundamental Problems in Mechanical Speech Recognition,” Language and Speech 1, no. 1 (1958): 53.

211 Dennis B. Fry and Peter Denes, “Experiments in Mechanical Speech Recognition,” in Information Theory: Third London Symposium, ed. Colin Cherry (London: Butterworths Scientific Publications, 1956), 206.

!121 the “automatic phoneme recognizer,” was designed to recognize discrete (or

“isolated”) recorded speech spoken in a Southern (British) English accent. It used phonemes rather than whole words as the recognition unit, such that though it contained an acoustic repertoire of only twelve different phonemes (four vowels, seven consonants, and a phoneme for silence to indicate pauses between words), it was capable of being tested on a vocabulary of 200 words, provided that all of the words contained only the twelve acceptable phonemes.

Much like the Audrey, the automatic phoneme recognizer conducted its

“primary recognition” using frequency analysis, splitting the signal using two

Figure 12 Block diagram the automatic phoneme recognizer. Designed by Dennis Fry and Peter Denes. Source: Peter Denes, “Automatic Speech Recognition: Experiments with a Recogniser Using Linguistic Statistics,” Contract No. AF 61(514)-1176 (Air Force Cambridge Research Center: United States Air Force Air Research and Development Command, September 1960), 28.

frequency range filters associated

with formant features. What was

unique was the addition of a language model that included a “store of linguistic knowledge” intended to

“reproduce the linguistic mechanism of the listener” (see figure 12).212

212 Fry and Denes, “The Solution of Some Fundamental Problems,” 53. Fry and Denes of course acknowledge that this “reproduction” was a fairly crude representation of language, noting that it was “not very practicable” to reproduce linguistic understanding thoroughly, given the problems of storage.

!122

Denes and Fry characterized this “linguistic mechanism” probabilistically.

Specifically, the store of linguistic knowledge referred to a database of the di- gram frequencies of phonemes (the relative number of times each possible combination of ordered phoneme pairs occurred). These values were incorporated into the recognition process as an additional factor in the matching calculation, alongside the frequency measurements taken from the acoustic signal.

Despite the inclusion of probability, however, Fry and Denes’ phoneme recognizer, like the Audrey, remained conceptually distant from IBM’s “natural way for the machine.” They considered the use of statistics crucial for simulating human speech recognition because they believed human language understanding itself to be explicitly statistical in nature. As they explained, “[i]n decoding an

English message, the listener . . . not only confines himself to a phonemic system, but at each position in the sequence he is strongly influenced by sequential probabilities,” such that in the human listener, “the importance of the application of statistical knowledge of the language in the reception of speech can hardly be over-estimated.”213 Statistical functions of the mechanical recognizer merely

“parallels the workings of the human listener,” formalizing linguistic structure as a set of stored values.214

213 Ibid., 52-53.

214 Ibid., 52.

!123 Moreover, the actual probability distribution for these digram frequencies were based on the fixed count of occurrences within the test vocabulary wordlist.

The more frequently a sequential pairing appeared in the vocabulary wordlist, the more probable it was considered. In other words, the digram frequency counts were not taken from language as it was used—what computer science refers to as

“natural language”—but from an “artificial language” of the vocabulary list compiled for system testing, without any regard as to how commonly each word might be observed in actual speech. Linguistic knowledge used in how speech was perceived was thus treated as a fixed entity, one that could be reduced to the structural attributes the language from which the speech was produced, and could be directly and explicitly represented as static store a priori values. The probabilistic model in Fry and Denes’ recognizer served as formal (which is to say mathematical) representation the (artificial) language’s fixed structure. In other words, though the “linguistic mechanism” is expressed as a probability distribution, the process remains a deterministic one, which sought to “approach the flexibility and certainty of the human recognition apparatus.”215

In this way, the use of probabilistic language models were an extension of the quantized templates of acoustic-phonetic systems like the Audrey, despite Fry and Denes’ critiques of their reliance on acoustic representation. Both the statistical aggregates of the Audrey’s acoustic models and the probability

215 Ibid., 53, emphasis mine.

!124 distributions of Fry and Denes’ phoneme recognizer differed dramatically from the radically empirical, “data-driven” methods later employed in the Tangora in so far as they were use as a corrective for variation within the observation data. As computational linguist Kenneth Church and former IBM CSR group member

Robert Mercer reflected, the adoption of “knowledge-based” approaches that incorporated ever-increasing levels of linguistic expertise was not simply an economic necessity (though, as noted in the previous chapter, few institutions had the computer resources necessary for the data and processing intensive systems developed at IBM in the 1970s and 1980s). Rather, the expanding inclusion of linguistic and contextual knowledge “was advocated as necessary in order to deal with the lack of allophonic invariance” in the recorded data.216

The fact that speech was so variable in terms of both acoustic features and linguistic perception thus lead to the incorporation of additional determining factors, including grammatical, semantic, and task-domain knowledge in later A.I. or “expertise systems.” Statistical models in these early systems were designed not to elaborate or capture the presumed “statistical nature” of speech as it was observed, but to conform the variations across empirical observations in order to represent the presumably fixed, underlying process by which speech was produced.

216 Kenneth W. Church and Robert L. Mercer, “Introduction to the Special Issue on Computational Linguistics Using Large Corpora,” Computational Linguistics 19, no. 1 (March 1993): 3.

!125 Hiding Knowledge, Maximizing Likelihood

The work of Fry and Denes signaled a new path forward for speech recognition research in suggesting that the link between the acoustic and linguistic elements of speech could be resolved through the incorporation of additional knowledge regarding speech and language. The difficulty in matching acoustic and linguistic events was, in this view, due to a failure to account for the many linguistic factors and interactions that ultimately determined speech production and perception. The surface variability in the acoustic output of speech was assumed to be governed by an underlying process that was itself stable enough to be described deterministically.217 The key challenge of automatic recognition was therefore to formally represent of the most salient linguistic and acoustic factors in this process and their interactions. As a result, ASR research in the 1960s and

1970s was dominated by rule-based “knowledge engineering,” approaches, which relied upon “the direct and explicit incorporation of experts’ speech knowledge into a recognition systems . . . using rules or procedures” to formally represent

“acoustic-phonetic, lexical, syntactic, semantic, and prosodic facts and the subtle interactions between them.”218

217 Lawrence Rabiner and Biing-Hwang Juang, Fundamentals of Speech Recognition (Englewood Cliffs, N.J: Prentice Hall, 1993), 43.

218 Alex Waibel and Kai-Fu Lee, “Knowledge-Based Approaches,” in Readings in Speech Recognition, ed. Alexander Waibel and Kai-Fu Lee, 1st edition (San Mateo, Calif: Morgan Kaufmann, 1990), 197–98.

!126 While the majority of speech recognition research in the 1970s focused on

A.I. expert systems, which were particularly sophisticated implementations of the rule-based approach, a small group of researchers began to deliberately decouple the process of recognition from the principles of speech.219 In addition to the IBM

CSR group, Jim and Janet Baker, then graduate students at Carnegie Mellon

University, also independently began working on a similar “statistical” approach to speech recognition in the early 1970s. The Bakers were recruited to CSR group shortly after, joining IBM in 1974. By the 1980s, the experimental efforts at IBM were eventually consolidated as Tangora,220 named after Albert Tangora, the world’s speed record holder for typing on a manual keyboard.221 The CSR group formally began building Tangora around 1981, when the system objective had been finalized, though its basic structure had been developed over the prior

219 As discussed in the previous chapter, the dominance of AI expert systems in the 1970s is due in large part to the interests of DARPA, which was a major funder in speech recognition research in this period.

220 Throughout, I will use Tangora to refer to the ongoing experimental efforts undertaken by the IBM CSR group between 1972 and 1996. Though the systems that were built in this period underwent many iterations as well as changes to both hardware and software specifications, the underlying logic and formal structure remained consistent and were considered by its lead researchers to be a single project. It should be noted also that these experimental systems are the technical basis for, but distinct from, the commercial dictation software issued by IBM in the 1990s.

221 According to Bob Mercer, Fred Jelinek chose the name of the Tangora, though Mercer did not recall an explanation for the selection. The naming decision appears to have been somewhat perfunctory, as Mercer recalled that they had mistakenly referred to the system as the “Tagora” due to Jelinek misspelling the name on some occasion. The error persisted “for a while, but eventually [they] got it straightened out.” See Robert Mercer, personal correspondence with the author, May 8, 2015. Incidentally, Alfred Tangora’s record was for averaging 147 words per minute over the course of an hour on October 23, 1923.

!127 decade.222 IBM demonstrated a mainframe model “requiring a room full of machinery” in 1984, with a desktop PC-based version following two years later, in 1986.223 This initial version of Tangora was capable of transcribing dictation using a 5,000-word “office correspondence” vocabulary spoken with brief pauses between words and was adapted to a single speaker, with each new user undergoing an “enrollment period” by reading an approximately 20-minute standard text.224 The size of the vocabulary was expanded to 20,000 words by

1987,225 and by 1989, the CSR group had adapted the system for continuous

222 It is of course difficult to mark clear delineations between different iterations of a system that was developed as a part of ongoing research. In his ACL Lifetime Achievement Award speech, Jelinek marks origins of the Tangora slightly earlier, in “1978 (or so),” in coincidence with when the CSR group “abandoned artificial grammars . . . to start recognizing ‘natural’ speech.” And, in fact, the earliest publication from the CSR group regarding the recognition of natural speech was presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing in April of 1978. However, Jelinek notes in the same speech that the project was “settled” with a set task once they decided on recognition of a 5,000-word office correspondence. See Jelinek, “ACL Lifetime Achievement Award: The Dawn of Statistical ASR and MT,” 448. He dates this decision to 1981 in an invited paper detailing the completed system in the Proceedings of the IEEE in 1985. See Frederick Jelinek, “The Development of an Experimental Discrete Dictation Recognizer,” Proceedings of the IEEE 73, no. 11 (November 1985): 1616. See also S.K. Das and M.A Picheny, “Issues in Practical Large Vocabulary Isolated Word Recognition: The IBM Tangora System,” in Automatic Speech and Speaker Recognition, ed. Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal (Boston, MA: Springer US, 1996),457. Das and Picheny corroborate 1981 as the starting year for development in their chapter on the Tangora.

223 “IBM Scientists Demonstrate Personal Computer with Advanced Speech Recognition Capability,” Press Release (IBM Corporation Research Division, April 7, 1986), IBM Corporate Archives. See also Das and Picheny, “The IBM Tangora System,” 457.

224 Averbuch et al., “An IBM PC Based Large-Vocabulary Isolated-Utterance Speech Recognizer,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’86., vol. 11, 1986, 53.

225 Averbuch et al., “Experiments with the Tangora 20,000 Word Speech Recognizer,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’87., vol. 12, 1987.

!128 rather than isolated-word recognition, which allowed the user to speak without pausing deliberately between words.226 The underlying conceptual and mathematical foundation of Tangora, however, had been established since the formation of the CSR group in 1972, and remained largely consistent as the technical implementations were elaborated over the course of over two decades.

Lalit Bahl, the inaugural member of the CSR group, characterized the development of the project over time as “more a refinement of ideas that we had originally” rather than a series of distinct projects, jokingly describing himself as having worked on the same thing for over twenty-six years.227

From its outset, the CSR group explicitly positioned itself in opposition to the rule-based “knowledge engineering” approach favored by DARPA-funded AI systems. In their earliest publications, Jelinek, Bahl, and Mercer criticized their contemporaries for formulating speech recognition within an artificially constrained framework that “deterministically limits the permissible set of utterances” and performs recognition using “a complex interaction of acoustic, syntactic, and semantic processors . . . derived in an ad hoc intuitive manner.”228

In contrast, IBM’s approach presumed an ignorance of precisely the components

226 Lalit R. Bahl et al., “Large Vocabulary Natural Language Continuous Speech Recognition,” in International Conference on Acoustics, Speech, and Signal Processing. May 23-26, 1989, 465–67.

227 Bahl, interview with the author, 2015.

228 Frederick Jelinek, Lalit Bahl, and Robert Mercer, “Design of a Linguistic Statistical Decoder for the Recognition of Continuous Speech,” IEEE Transactions on Information Theory 21, no. 3 (May 1975): 250.

!129 of speech production that knowledge engineering sought to detail, and instead aimed “to model utterance production statistically, rather than through a grammar that would describe [language] syntactically and semantically.”229 Jelinek here emphasized the use of statistical models as an alternate means “to model utterance production” rather than to represent various acoustic or linguistic properties. That is, statistical models were not presented as an alternate matching procedure or language representation, distinct from templates or rule-sets principally in form, but rather as a replacement for the systematic description of language altogether.

The IBM approach treated speech as a system that transformed words into sounds by means of “unknown human mental and physiological processes” that could not be ascertained and represented explicitly, using statistical models to predict the outcome of these processes “without knowledge of the formulas that encoded words.”230 Statistical modeling was thus conceived not as a novel means for codifying and quantifying pre-existing knowledge of linguistic principles, but as a replacement for linguistic principles entirely. Put another way, IBM’s statistical models, by treating the properties and processes underlying speech as

229 Jelinek, “Continuous Speech Recognition by Statistical Methods,” 532.

230 T. Murphy, ed., “IBM Reports Major Speech Recognition Progress,” IBM Research Highlights, no. 1 (1985). IBM Corporate Archives.

!130 fundamentally unknown, not only proceeded without representing knowledge, but by representing precisely the absence of knowledge.231

The displacement of linguistic knowledge with statistical estimates was aimed at removing “unknown human mental and physiological processes” from more than just the computational model of speech recognition, however. What

CSR researchers at IBM ultimately sought was the removal of expert judgment from the process of producing models, such that they could be generated automatically with minimal human intervention. The desire to remove linguistic expertise was a pragmatic and commercial imperative as much as it was a philosophical position, since the “ad hoc” incorporation of “intuitive” knowledge was both time and labor-intensive. Early iterations the CSR group’s research included recognition components that had to be constructed with the aid of linguists. For instance, the language model (the model defining the probability that any given word sequence might occur in a language) was an artificial grammar, known as the New Raleigh Language, which defined possible transitions between words (see figure 13). In addition to being the model of language used by the recognizer, it was also used as a text generator to produce test sentences to be read aloud to evaluate the recognizer. The finite state machine

231 In 1986, in light of Tangora’s impressive performance results, John Makhoul and Richard Schwartz pointed to IBM’s use of statistical methods as “certainly one of the most powerful [examples]” of why the field “should not neglect to model our ignorance.” John Makhoul and Richard Schwatz, “Ignorance Modeling,” in Invariance and Variability in Speech Processes, ed. Joseph S. Perkell and Dennis H. Klatt (Lawrence Erlbaum Associates, 1986), 344–45, emphasis in original.

!131 was constructed by linguists from the IBM Speech Technology group based in

Raleigh, NC, so that while the permissible transitions between words were expressed probabilistically, their structure and values were manually selected,

Figure 13 Model of the New Raleigh Language. Source: Frederick Jelinek, “ACL Lifetime Achievement Award: The Dawn of Statistical ASR and MT,” Computational Linguistics 35, no. 4 (2009): 486. Image originally appears in Jelinek, “Continuous Speech Recognition by Statistical Methods,” 538.

much like the language constraints used in rule-based expert systems. The use of hand-constructed grammars were later replaced in Tangora by automatically generated n-grams trained on text data.232 Even in cases where models were already trained on empirical data, such as in the acoustic model, human intervention was required in early experiments in order to prepare the data. The

232 Jelinek, “The Dawn of Statistical ASR and MT,” 485-488. Jelinek recalled that when they had moved onto using natural language (rather than an artificial one) for the Tangora, their remaining linguist, Stan Petrick, offered to construct “a little grammar” for the significantly larger vocabulary, which never happened. Jelinek joked that his offered “acquired a mythical status in the manner of ‘famous last words,’ marking the removal of English grammar from the language model.

!132 acoustic properties of phonemes, which varied from speaker to speaker, had to be identified by having a phonetician listen to speech recordings and labeling the corresponding spectrograms.233 It was these components, where linguistic knowledge required extensive human judgment, that the CSR group focused on systematically replacing. The incorporation of expert knowledge was expensive for experimental research, and wholly unthinkable for commercial development.

The conceptual and pragmatic removal of human judgment as the basis for machine recognition was concretized in the architecture of the hidden Markov model (HMM), statistical speech recognition’s defining technique. Whereas phonetic templates and expert systems offered direct representations of linguistic processes that presumably governed speech, hidden Markov models formally represented precisely the absence of such knowledge. HMMs treated the speech signal as observations resulting from an underlying language source that could not be observed directly (hence “hidden”). Language was therefore representable only as a dependent series of mathematically random events, or “states,” in a stochastic process. In such a process, individual random events cannot be predicted with certainty, but in a sufficiently large number of random trials will eventually reflect the probability (or “chance”) distribution for all possible outcomes. In other words, the occurrence of any single state can never be fully determined, only

233 Nadas, A., R.L. Mercer, L. Bahl, R. Bakis, P. Cohen, A. Cole, F. Jelinek, and B. Lewis. “Continuous Speech Recognition with Automatically Selected Acoustic Prototypes Obtained by Either Bootstrapping or Clustering.” In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’81, 1153.

!133 statistically estimated.234 In essence, the hidden Markov model formalized the proposition that the performance of a phenomenon could be predicted without any direct knowledge of its nature.

The mathematical theory of HMMs was introduced in a series of papers published by Leonard E. Baum and others at the Institute of Defense Analysis

(IDA) between 1966-1972, though the work had been developed several years prior to the first publication.235 HMMs built upon the basic concept of the Markov process, as a series of random events where the probability of any state was based on the immediately preceding one, and was originally referred to simply as

234 For instance, the results of from a large number of repeated tosses of a perfectly even coin will approach a 50/50 distribution between heads and tails, even as each individual result is random and cannot be predicted. A stochastic process, put simply, is a collection of random variables from the same set of elements and state space. In more layman’s terms, these are random variables produced by the same phenomenon. Stochastic processes are commonly used to describe the progress of changes in a system over time, such as the movement of a stock value or the movement of a “random walk.”

235 See Leonard E. Baum and Ted Petrie, “Statistical Inference for Probabilistic Functions of Finite State Markov Chains,” The Annals of Mathematical Statistics 37, no. 6 (1966): 1554–63; Leonard E. Baum and J. A. Eagon, “An Inequality with Applications to Statistical Estimation for Probabilistic Functions of Markov Processes and to a Model for Ecology,” Bulletin of the American Mathematical Society 73, no. 3 (May 1967): 360–63; Leonard E. Baum et al., “A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains,” The Annals of Mathematical Statistics 41, no. 1 (February 1970): 164–71; Leonard E. Baum, “An Inequality and Associated Maximization Technique in Statistical Estimation for Probabilistic Functions of Markov Processes,” Inequalities 3 (1972): 1–8. See also James K. Baker, interview by Patri Pugliese, audio recording, December 21, 2006, History of Speech and Language Technology Project, http://www.sarasinstitute.org/Pages/Interv/SarJimBaker.html. According to Baker, the techniques for HMMs were already “several years old” when he arrived at IDA in 1963.

!134 “probabilistic functions of Markov chains.”236 The first practical applications of

HMMs began in the early 1970s in speech recognition, independently adopted around 1971 by two teams of researchers, Jim and Janet Baker at Carnegie Mellon

University (CMU)237 and the IBM CSR group. While at CMU, the Bakers began development of what would by the 1990s become two of the earliest and most successful commercial dictation software packages, DragonDictate and

NaturallySpeaking programs in the 1990s.238 Their research was additionally

236 Leonard E. Baum and Ted Petrie, “Statistical Inference for Probabilistic Functions of Finite State Markov Chains,” 1554. The term “hidden Markov model” was not established through a series of lectures on the technique’s application to speech and language processing that took place at IDA in 1980. See John D Ferguson, ed., Symposium on the Application of Hidden Markov Models to Text and Speech (Princeton, NJ: Institute for Defense Analyses, Communications Research Division, 1980).

237 Jim Baker had encountered the mathematical concept while working part-time at IDA during his undergraduate studies at Princeton University beginning in 1963 and his interest in mathematical modeling of stochastic processes landed on the problem of speech recognition due to Janet Baker’s work on signal processing and speech visualization when the two met at Rockefeller University. According to Janet, after one or both of them shared a cab ride with Alan Newell, he referred them to Raj Reddy, who ran speech recognition research at Carnegie Mellon, in fall of 1971. It wasn’t until the Bakers transferred to Carnegie Mellon University that they were able to begin working on speech recognition in earnest thanks to DARPA funding and dedicated computer science facilities. See James K. Baker, interview by Patri Pugliese, 2006 and Janet M. Baker, interview by Patri Pugliese, audio recording, January 18, 2007, History of Speech and Language Technology Project, http://www.sarasinstitute.org/Pages/Interv/ SarJanetBaker.html.

238 The Bakers lost their company, Dragon Systems, in a disastrous sale to what turned out to be the fraudulent Belgium-based company Lernout & Hauspie in 2000. The acquisition was overseen by Goldman Sachs, whom the Bakers later sued for failing to do their due diligence. The Bakers lost the lawsuit against Goldman Sachs, as well as a appeal in 2014. The Dragon technology was eventually acquired by Nuance Technologies, a competitor to Dragon whose IPO also happened to have been handled by Goldman Sachs around the same time they were overseeing the Dragon Systems sale. See Loren Feldman, “Goldman Sachs and a Sale Gone Horribly Awry,” The New York Times, July 14, 2012, sec. Business Day, https://www.nytimes.com/2012/07/15/ business/goldman-sachs-and-a-sale-gone-horribly-awry.html and 1. Jonathan Stempel, “Goldman Sachs Defeats Appeal over Collapsed Buyout,” Reuters, November 12, 2014, https://www.reuters.com/article/us-goldman-dragonsystems-lawsuit- idUSKCN0IW2JK20141112.

!135 unusual in the context of CMU at the time, which was one the most established institutions in the development of AI expert systems based on knowledge engineering. The Bakers themselves were even funded by the AI-dominated

DARPA Speech Understanding Project, though their work was closely aligned with the IBM CSR group in their technical approach, which included “no knowledge of English grammar, no knowledge base, no rule-based expert system, no intelligence. Nothing but numbers.”239 Both the Bakers and the CSR group researchers were similarly perceived as pursuing research that was unorthodox and outside the purview of speech science. As Janet Baker recalled, their use of statistical models in place of knowledge representation “was a very heretical and radical idea . . . A lot of people said, ‘That’s not speech or language, that’s mathematics! That’s something else!’”240

The Bakers and the IBM CSR group joined forces in 1974, when IBM hired Jim and Janet Baker a year before they were awarded their PhDs from

Carnegie Mellon. The use of hidden Markov models in speech recognition remained relatively limited until the 1980s. Aside from the disparity in computing resources between IBM and other institutions discussed in the previous chapter, the main reason for this lag was the disciplinary divide between the math- and engineering-centric approach and the linguistics-focused research that still

239 Simson Garfinkel, “Enter the Dragon,” MIT Technology Review, September 1, 1998.

240 Janet Baker quoted in Garfinkel, “Enter the Dragon.”

!136 dominated the field. While the researchers at IBM published a number of articles depicting their experimental work, the basic theory underlying hidden Markov models was rarely discussed in depth, and could mainly be found in journals focused on information theory, pattern recognition, and mathematics, which were not widely read by speech researchers.241

The conceptual and technical anchor of the statistical approach used in

IBM’s Tangora speech recognition system is the hidden Markov model, a “doubly stochastic process”242 in which a sequence of observed events (such as the acoustic segments of the speech signal) is randomly emitted from a second, underlying series of stochastic events (such as a series of words) that is “hidden” from direct observation. The underlying process is unobservable and therefore indeterminate, that is, the events (or “state”) in its sequence cannot be known with certainty and we do not know the factors that determine their selection and cannot verify which states have been selected. Nor do we know the process that governs the relationship between the states in the hidden series and those in the observed sequence. Hence the model is “doubly stochastic,” where both layers are determined by unknown processes that, given our ignorance, are represented as mathematically random. What is hidden in the hidden Markov model is not a

241 Lawrence R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE 77, no. 2 (February 1989): 258.

242 Lawrence R. Rabiner and B. H. Juang, “An Introduction to Hidden Markov Models,” IEEE ASSP Magazine, January 1986, 5.

!137 black-boxed component of the model. Rather, it is the governing properties of the thing being modeled that are hidden, not by, but from the model itself.

The hidden Markov model essentially represents a sequence of data, such as an acoustic speech signal, as the sequence of output observation resulting from a second, underlying series of “hidden” events, also referred to as “states.” As the states of this hidden series cannot be observed directly and are therefore unknown, they are portrayed as a Markov process. The most basic Markov process, the Markov chain, is a stochastic process depicts a series of random states drawn from a finite set of possibilities.243 What makes the Markov process distinct from other random processes, such as a coin toss or die roll, is the assumption that the probability of any state is dependent on the immediately preceding one, connected like links in a chain. The probability that a coin toss will result in heads or tails is not influenced by the outcomes of previous tosses, whereas in Markov’s original example using Pushkin, the probability that the next letter in a text will be a vowel varies depending on the preceding letter. In other words, the probability distributions described in a Markov chain do not depict the chance of an event occurring, but the transition probability between any two events. In a common textbook example of Markov chains, imagine that each day’s

243 As a stochastic process, the chance of any state from the set occurring is given by a probability distribution, such that each individual occurrence is random, but some outcomes have a higher chance of occurring than others.

!138 weather can be described as one of three states, sunny, rainy, and cloudy.244 Based on patterns in previous weather data, one can calculate the probability that a sunny day will be followed by a rainy day, a cloudy day, or another sunny day, and so forth, making a reasonable guess about weather patterns without any knowledge of the meteorological processes shaping them.

The hidden Markov model adds an additional layer of uncertainty by assuming that the sequence of events being modeled cannot be directly observed and is therefore unknown. Instead, what we can observe is a secondary series of events triggered by the unseen “source” sequence. Using the same example of the three-state weather system, imagine that you are in a windowless cell and cannot observe each day’s weather directly. However, you have a daily visitor who sometimes wears a hat and other times does not. Provided enough data of previously known weather states (sunny/rainy/cloudy) and corresponding observed hat states (hat/no hat), one can then calculate the probability of the current unknown weather state based on whether or not your visitor is wearing a hat. In this example, both the processes determining the weather (the transition probabilities of the “hidden” state) and the relationship between the weather and your visitor’s wardrobe decisions (the output probabilities between the hidden and observed states) are unknown, depicted only as statistical estimates based on

244 Variations on this example appear in numerous textbooks and course lectures, though it is not clear where this example originated.

!139 previous data.245 The hidden Markov model thus proposes a framework for statistically predicting the progress of a series of observations without having to precisely formulate, or even know, the underlying operations that produce these observations.

Applied to speech recognition, we can imagine the acoustic speech signal as the sequence of observed outputs, while the text of the words being spoken serve as the series of hidden states.246 The transition probabilities of the hidden state therefore stand in for the linguistic constraints that govern the selection and order of each word while the output probabilities represent the relationship between any given word and its acoustic performance as speech. These transitions from one state to another and between hidden states and observed output, represented as statistical parameters, were thus used to replace the “unknown mental and physiological processes” governing speech as associations that could not be defined. That is, the statistical parameters did not express how or why a particular series of words might produce the observed sounds, but simply the

245 Also note that the number of possible states in the source series and in the output series are not necessarily equal. Multiple source states might result in repeated instances of the output, or visa versa. Since we cannot observe the source states, we cannot assume a one-to-one correspondence where each distinct source state corresponds to a distinct output state. This can be handy for classification, for instance if you label a sequence of words according to their part-of-speech categories. The latter can be treated as an ‘unseen’ source, where each category “state” might be associated with multiple words.

246 This is a extremely simplified portrayal of the application of HMMs in speech recognition for the sake of a non-technical explanation. In reality multiple implementations of HMMs may be carried out in various stages of the recognition process and concatenated or layered in complex ways and alongside other techniques.

!140 estimated likelihood that such sequences would co-occur, based on patterns detected in data from previous examples.247 The hidden Markov model thus did not simply lack knowledge of these processes, but indeed formalized this very ignorance.

What was ultimately radical about the use of hidden Markov models was not that the engineers at IBM believed the underlying nature of speech to be stochastic, but that they believed that it simply did not matter whether or not it was. The HMM was not a metaphysical reconceptualization of the nature of speech, but rather an epistemological claim that knowledge of speech’s nature, whatever its character, was categorically irrelevant to the computational model. Stochasticism was not an approach to representing speech, but ignorance.

As James Baker explained in an interview in 2006, the statistical approach to speech recognition did not simply model speech as a stochastic process, but as a hidden stochastic process, which effectively frees the model from having to be about speech at all. This move to modeling the absence of knowledge would later become a conceptual cornerstone in the foundation of big data analytics.

247 The statistical parameters in the hidden Markov models were generated used Maximum Likelihood Estimation. Simply speaking, this process calculates the parameter values for the model that was most likely to have produced the outcomes that are observed. It should be noted that “likelihood” and “probability” technically describe different things in statistics, though they are often used interchangeably in non-technical descriptions. Probability describes the change of a particular outcome, given certain parameters, whereas likelihood describes the chance of particular parameters, given a certain outcome. So for instance, the chance that one would roll a six given a fairly weighted die is a probability function. The chance that, given a series of observed die rolls, the die being used is fairly weighted is the likelihood function. Though the general concept of maximum likelihood seems fairly intuitive now, such was not always the case. For an account of the history of maximum likelihood, see Stephen M. Stigler, “The Epic Story of Maximum Likelihood,” Statistical Science 22, no. 4 (November 2007): 598–620.

!141 No data like more data

The hidden Markov model offered a formal architecture for modeling the performance of speech processes with minimal reference to the processes themselves, but for it to be implemented, it had to be populated by statistical values. Across both the language and acoustic models, explicit numerical values, known as parameters, had to be assigned to represent the probability distribution for the occurrence of every possible sequence of text and every combination text and sound.248 If these values, such as the probability that a word would follow another, could not be determined based on syntactic and semantic structure (e.g., the knowledge that “the” is more likely to be followed by a noun than a verb or that the phrase “write a novel” is more probable than “ride a novel”), what were they based on? In other words, if speech and language were no longer the external referent expressed using statistical modeling, what exactly did the statistics model?

One of the distinctive features of the probabilistic models used in the

Tangora was the fact that the statistical values were “trained,” rather than assigned. Training, in this instance, referred to the identification of patterns within speech and text corpora as a means to set statistical parameters. The statistical parameters of the model would thus describe specific sets of language data, rather

248 In presuming no prior knowledge about structure and rules of speech or language, the model cannot restrict possible outcomes accordingly. Thus, no acoustic match or text sequence is impossible, merely less likely than others.

!142 than abstract principles of linguistic structure. In other words, since hidden

Markov models eschewed direct knowledge of the underlying process by which language was generated, it could be treated as fundamentally random, with words that occurred as events in a stochastic sequence guided by a probability distribution that, while unknowable, could be estimated based on data from prior observations. For instance, the probability parameters for words sequences in the language model (e.g., what potentially makes “write a novel” a more probable phrase than “ride a novel”) could be inferred from word sequence patterns in text data. The reliance on data over linguistic principles, however, presented a new set of challenges, for it meant that the statistical models were necessarily determined by the characteristics of training data. As a result, the size of the dataset became a central concern, since fundamental principles of probability dictated that for a mathematically random process—a process where the sequence of outcomes is based on chance rather than a deterministic pattern—the larger the dataset of observed outcomes, the closer its statistical parameters describing those observations comes to reflecting the underlying probability distribution of possible outcomes in the process. Larger datasets of observed outcomes not only improved the probability estimates for a random process, but also increased the chance that the data would capture more rarely-occurring outcomes. Training data size, in fact, was so central to IBM’s approach that in 1985, Robert Mercer explained the group’s outlook in 1985 by simply proclaiming, “There’s no data

!143 like more data.”249 The statement, while seemingly a flippant dismissal of quality in favor of quantity, also identified the key challenge of statistical speech recognition, where “more data” was far easier to valorize than it was to acquire.

Data, despite its etymological provenance, was far from “given.” In the case of the speech recognition in the 1970s and 1980s, it was eagerly sought, laboriously generated, and prepared to exacting specifications. Unlike today, the production of human-readable typed text and computer-readable text were separate processes. Digital data storage through the 1970s relied upon punched cards, and text data had to be manually reproduced in the appropriate format using a keypunch, a process that was time-consuming and labor-intensive. Large-scale digital record-keeping was far from an established practice, making text data stored in computer-readable format vanishingly scarce by contemporary standards. Collecting even a million words in computer-readable form, according to Bahl, would have proven a substantial undertaking at the time.250 By comparison, Google in 2006 created a text corpus of over a trillion words scraped from public web pages using automated methods, and made it readily available for download through the Linguistic Data Consortium online catalog.251

249 Robert Mercer, quoted in Jelinek, “Some of My Best Friends are Linguists.”

250 Bahl, interview with the author, 2015. As Bahl recalled, “back in those days . . . you couldn’t even find a million words in computer-readable text very easily. And we looked all over the place for text.”

251 The LDC additionally listed 248 other English text corpora of varying sizes in their download catalogue as of July 2016.

!144 The IBM CSR group thus engaged in ongoing efforts that spanned approximately a decade to acquire and produce sufficient computer-readable text data to train a speech recognition vocabulary large enough for dictation.252 These effort were complicated by the fact that while IBM’s computing resources outstripped other institutions, they were far from infinite, resulting in limitations on the vocabulary size of the data. The initial goal was a natural-language vocabulary of 1,000 words, which was four times the size of the finite state artificial grammar that the group had used in previous experiments, but still remained within the processing capabilities the IBM’s computers. The challenge was therefore to gather a text corpus that was limited enough contain only a

1,000-word vocabulary, but also large in size enough to contain numerous occurrences of each word in that vocabulary in order to calculate reliable probabilities for word sequences. Simply put, the group faced the additional difficulty of compiling data that was not only large in quantity, but also relatively limited in variety.

252 Systems built in the 1970s and early 1980s typically contained relatively small, task- specific vocabularies. According to Victor Zue, former head of the Spoken Language Group and the Computer Science and Artificial Intelligence Lab at MIT, they were usually limited to vocabularies ranging from 10-200 words. See Victor W. Zue, “Comment on ‘Performing Fine Phonetic Distinctions: Templates versus Features,’” in Invariance and Variability in Speech Processes, ed. Joseph S. Perkell and Dennis H. Klatt (Lawrence Erlbaum Associates, 1986), 342. In a later paper, Zue notes that the work at IBM was a “notable exception” to the limited vocabularies common in speech recognition at the time. See“The Use of Speech Knowledge in Automatic Speech Recognition,” Proceedings of the IEEE 73, no. 11 (November 1985): 1602, footnote 1.

!145 The CSR group first considered using a collection of IBM manuals, many of which were readily available as computer-readable data, but soon abandoned the option because the vocabulary was both too large and too technical, leading the group to believe that “it would be difficult to pass most of [the technical language of the manuals] off as English.”253 Believing that the best course might be to produce their own corpus, they enlisted the work of a “really, really fast typist,” who also happened to be the wife of one of the lab staff, to type approximately a million words of text from children’s novels, but there, too, the vocabulary proved larger than they hoped.254 Next, Fred Damerau at the Watson

Research Center informed the group of a pilot study being conducted by the U.S.

Patent Office for converting patents into computer-readable form.255 The material that had already been processed included a collection of patents in the field of laser technology. The patent collection was extensive enough in size and narrow enough in focus that the CSR researchers were finally able collect a million words

253 Peter Brown and Robert Mercer, “Oh, Yes, Everything’s Right on Schedule, Fred” (Twenty Years of Bitext, Seattle, WA, October 18, 2013), http://cs.jhu.edu/~post/ bitext/.

254 Robert Mercer, Email, (May 8, 2015).

255 Mercer, interview with the author, 2015. The members of the CSR group cannot clearly recall who the original source of this information was. During our interview, Robert Mercer initially credited the finding to John Cocke, but then revised that “maybe it was Fred [Damerau].” In the bi-text presentation given by Mercer and Brown in 2013, they also cite Fred Damerau as the source. In the interview, they also could not give a definitive date for when they acquired this corpus. Robert Mercer guessed sometime in the mid-70s.

!146 worth of sentences confined to a thousand-word vocabulary.256 However, though the resulting million-word corpus drawn from the laser patent was considered a

“naturally-occurring corpus” (in contrast to the artificial grammars designed by

IBM linguists), it was in fact meticulously produced, even beyond being limited to sentences that were composed entirely of the one thousand most frequently- occurring words. Even before discarding any sentences that were not composed entirely from the one thousand most frequently-occurring words,257 the complete patent text had to be “subjected to intensive and hand computerized editing”: eliminating duplicates, merging spelling variations, and substituting scientific symbols and formulas with words.258 The claims sections of the patents proved especially problematic due to the “highly stylized” legal language (for instance, the word “said” is used in place of “the” when referencing elements of an invention), and ultimately excised altogether.259

256 The complete laser patent text corpus consisted of approximately 1.8 million words of running text (known in computational linguistics as “tokens”), with a vocabulary of 12259 distinct words (or “types”), wherein different word forms of the same base word (e.g. “cat” and “cats”) are treated as different words. The size of the corpus consisting of sentences using only the 1000 most frequently occurring words was roughly 3800 sentences. See Lalit Bahl et al., “Speech Recognition of a Natural Text Read as Isolated Words,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’81., vol. 6, 1981, 1168.

257 Bahl, Jelinek, and Mercer, “A Maximum Likelihood Approach,” 189.

258 Lalit R. Bahl et al., “Automatic Recognition of Continuously Spoken Sentences from a Finite State Grammer,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’78., vol. 3, 1978, 422.

259 Ibid.

!147 By the 1980s, hardware improvements, including a speech processing chip designed by Gottfried Ungerboeck at IBM’s Zurich research facility, significantly improved system speed, allowing the engineers to expand their recognition vocabulary to 5,000 words. With a less constrained vocabulary size, the group was able to focus on a business dictation-suitable vocabulary, adding a number of additional text sources beginning in the late 1970s. These included approximately

60 million words of public domain book and magazine text from the American

Printing House for the Blind and a 2.5 million word office correspondence corpus supplied by Richard Garwin, an IBM Physicist famed for designing the first hydrogen bomb, who maintained all of his correspondence in computer-readable form with the aid of four secretaries (Mercer interview 5/5/15). Between 1982 and

1984, Peter Brown joined IBM Research and brought with him an Associated

Press/Reuters wire services corpus of 20 million words, courtesy of his previous employers at Verbex Voice Systems. By the mid-1980s, IBM also acquired an additional 25 million words from the US oil company Amoco, and a collection of unknown size from the Federal Register.

The most substantial collection in this period, however, came from a source that was unique to IBM: the landmark federal antitrust lawsuit filed against

IBM in 1969. The case, which spanned thirteen years before the Department of

Justice dropped the charges in 1982, included testimony from 974 witnesses, nearly 800 of which were taken by deposition, with resulting transcripts that

!148 totaled over 100,000 pages. The operation to digitize the deposition transcripts onto computer-readable Hollerith punch cards during the trail was so prodigious that it required a staff of dedicated keypunch operators (most likely women) large enough to fill a “football field”-sized facility in White Plains, NY, where IBM’s

Data Processing Division was headquartered. While the suit itself proved extremely costly for IBM despite the ultimately favorable ruling, the transcript digitization efforts provided the CSR group with their largest and most robust text collection, resulting in a corpus of 100 million words that was varied enough to contain both “pickle” and “hacksaw.”260 Finally, in the mid-1980s the CSR group discovered that the Canadian Parliament had digitized transcripts of their official proceedings dating back to 1974. John Cocke had learned of the transcripts, known as the Hansard corpus, during a chance conversation with fellow airline passenger on a flight. Shortly after, they incorporated an additional 100 million words of English text from the parliamentary transcripts. The collection, known as the Hansard corpus, was particularly notable because the transcripts were maintained in both English and French, and led the CSR group to begin pursuing machine translation using the same statistical models being developed for speech recognition.

260 According to Robert Mercer and Peter Brown, John Cocke, who introduced the team to the deposition, had a test for whether a corpus vocabulary was truly “broad” based on the inclusion of three words: pickle, hacksaw, and a third which the two had forgotten. Of all the corpora they collected, only the deposition transcripts contained all three. See Brown and Mercer, “Oh, Yes, Everything’s Right on Schedule, Fred.”

!149 The key distinction in statistical speech recognition was that statistical models were models of performance data, rather than models of the underlying process. What made the IBM approach “statistical” was that statistics were not used to represent, but replace language. Linguistic features were not assigned statistical values, but rather statistical inference became the method by which the features — that is, the patterns — of language were determined. Instead of a quantification of linguistic features, reformulating them as statistical patterns, the

IBM approach identified statistical patterns within aggregate data and reformulated them as linguistic features.

The Way of the Machine

Having developed a framework for generating statistical models for the linguistic decoder using sample data, the engineers at IBM turned their attention to the acoustic “front-end” of the speech recognition system. The set-up of the acoustic signal processor, which prepared recorded speech as data, remained the final site of extensive human intervention in the IBM formula. As the IBM CSR group explained in a 1979 paper describing experiments with different acoustic processors, “statistical training methods eliminate much human intervention and subjective evaluation; however there are several areas in which the systems using

!150 [our current acoustic processors] still depend on human intuition”.261 The acoustic processor was responsible for dividing up the recorded speech signal into segments that were appropriate for the recognition process and classifying each segment according to a finite inventory of possible acoustic-match unit types.262

This was necessary since speech recognition systems typically do not perform matches of acoustic-input-to-text on any given speech utterance in its entirety.

Except in systems designed for highly constrained speech input,263 the acoustic input from a complete speech utterance (such as a spoken sentence, for instance) must be broken up into segments of system-appropriate acoustic events (such as individual words) that can be matched one after another. The set-up of the acoustic processor thus effectively defined the constitutive elements of speech as an acoustic phenomenon.

261 Lalit R. Bahl et al., “Recognition Results for Several Experimental Acoustic Processors,” in ICASSP ’79. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, 1979, 249.

262 By “type,” I refer distinct, abstract elements or categories that make up the acoustic- match units. For instance, for a system that uses phonemes as its acoustic-match units, each distinct phoneme would constitute a type, whereas individual instances and repetitions of that phoneme would be a token occurrences of that type. Similarly, for a system that does word-level acoustic matching, there would be an acoustic type for each unique word in the vocabulary. The classification step of acoustic processing, which I will discuss more extensively later in this section, is important since, as we recall with the Audrey, repetition of the same word uttered by the same speaker will still result in acoustic variations. Thus, not all acoustic tokens of a given acoustic unit type will be the same, requiring the processor to identify sufficiently similar acoustic token in order to classify them within the same appropriate type category.

263 Systems that only accept a very small command vocabulary, for instance, might perform acoustic matches on a whole utterance based on the design condition that users will speak only a single command at a time, and the set of permissible utterances is limited. The Audrey works in this manner, accepting only ten words, where one word can be uttered and must be recognized before the next can be inputted.

!151 It was standard practice across the speech recognition field to configure acoustic processing procedures according to linguistic principles. Acoustic segmentation and classification schema were based on linguistically-meaningful units, such as entire words or smaller phonetic and phonological categories such as phones, phonemes, and syllables. While word-level processing was typical through the 1950s and 1960s, phonetic/phoneme unit-level procedures were increasingly common by the time the IBM CSR group began work in the early

1970s due to growing interest in continuous speech recognition (as opposed to single-word commands) as well as growing vocabulary sizes, both of which made word-level acoustic units computationally inefficient to model.264 The researchers at IBM followed standard practice when it came to the acoustic processor for much of the 1970s, despite taking a radically different approach to the linguistic decoder back-end of the speech recognizer using statistical models trained from data for both pronunciation and language. The CSR group began experimenting with the removal of various phonetic and phonological constraints from the acoustic segmentation and classification process only by 1977, and it was not

264 While it would be difficult to speak definitively about all the varied speech recognition experiments being undertaken across three decades, it was at least the internal perception within the field that “widespread interest” in continuous-speech recognition emerged in the early 1970s. In their introduction to the 1979 IEEE selected reprint volume on “Automatic Speech & Speaker Recognition,” N. Rex Dixon, then chairman of the IEEE Technical Committee on Speech Processing and speech processing researcher at IBM, and Threshold Technologies president Thomas B. Martin claim that techniques “developed for automatic recognition of isolated, single-word-length utterances” made up “the vast majority of recognition work prior to 1972.” N. Rex Dixon and Thomas B. Martin, “Introductory Comments,” in Automatic Speech & Speaker Recognition, ed. N. Rex Dixon and Thomas B. Martin, Selected Reprint Series (New York, NY: IEEE Press, 1979), 2.

!152 until 1981, nearly a decade after they began research, that the CSR group finally ousted linguistic organization entirely from the acoustic processor design in favor of a machine-optimized speech alphabet.265

However, while phonetic and linguistic considerations were excised from acoustic processing procedures, human listeners did not disappear entirely from the conceptual design of the processor. The critical filter bands and feature selection element used in the processor, which effectively determined how acoustic measurements are taken and characterized, incorporated “a model based on the neural firings of the ear” that “considers the auditory nerve firing rates at selected frequencies as the features which define the acoustic input.”266 IBM CSR researchers Bakis and Cohen additionally noted that their model “seeks closer conformance with neurophysiological data than conventional threshold models of the ear” and uses time-scale and compression computations for a “new formulation” of the auditory model with “output [that] is appropriate for use directly in the front end of a speech recognition system.”267 What had been removed were considerations that require the active intervention of human judgment and interpretation, in the form of phonetic prototype selection and

265 Lalit R. Bahl et al., “Automatic Recognition of Continuously Spoken Sentences from a Finite State Grammer,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’78., vol. 3, 1978.

266 Raimo Bakis and Jordan Rian Cohen, Nonlinear signal processing in a speech recognition system, EP0179280 A2, filed September 20, 1985, and issued April 30, 1986, http://www.google.com/patents/EP0179280A2.

267 Ibid.

!153 linguistic constraints. The human body, and its measurable physiological responses, remain present, so long as they could be reformulated to produce outputs optimized for the processor performance. The removal of language was thus defined according to the needs of processor optimization and automation.

This new “fenonic”268 definition of speech was developed for the purpose of automating acoustic segmentation and classification. In displacing phonetic elements as the defining units of speech however, the invention of the “fenone” did more than automate key steps in training statistical models for acoustic recognition. It enabled the system to statistically generate and define the very units by which speech was to be recognized, marking what Peter Brown identified as the defining moment in which speech recognition “all became, you know, automatic.”269

268 The technique used for generating fenemic labels is first described in the 1981 paper presented by Nadas et al. at the IEEE International Conference on Acoustics, Speech, and Signal Processing, though it is referred to genetically as a method for “automatically obtaining a set of acoustic prototypes for use by a centisecond labeling acoustic processor . . . based on clustering.” A. Nadas et al., “Continuous Speech Recognition with Automatically Selected Acoustic Prototypes Obtained by Either Bootstrapping or Clustering,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’81., vol. 6, 1981, 1153. The first published reference of the feneme/fenone terminology appears in two patents, US4833712 and US5165007, originally by Bahl et. al in 1985 (though they were not granted until 1989 and 1992, respectively). US4833712 also makes reference to an anonymously authored research disclosure titled “Composite Fenemic Phones” dated August 1985. Lalit R. Bahl et al., Automatic generation of simple Markov model stunted baseforms for words in a vocabulary, US4833712 A, filed May 29, 1985, and issued May 23, 1989; Lalit R. Bahl et al., Feneme-based Markov models for words, US5165007 A, filed June 12, 1989, and issued November 17, 1992.

269 Brown, Interview with the author, 2015.

!154 To understand what made acoustic segmentation and classification (also sometimes referred to as “labeling” since segments are classified by being assigned a reference “label,” or acoustic symbol, transcribing the signal into a label string270) so challenging to automate, we can compare the training data required for the two types of statistical model used in the Tangora’s linguistic decoder: the acoustic model (which I will refer to as the “pronunciation” model to prevent any confusion of terms with other aspects of acoustic processing, though the acoustic model in fact models both the speaker’s pronunciation and the effects of the recording/processing apparatus as a single function) and the language model. Together they make up what IBM researchers referred to as speech recognition’s “fundamental equation”: Ŵ = arg max P(W)P(A|W) W The language model, P(W), is the probability of a given word W (where W is drawn from a finite vocabulary of permissible words) occurring based on one or more of the preceding words in a sequence.271 The acoustic model, P (A|W), is the conditional probability that a given word W, once spoken aloud and then passed

270 I am using the terms “symbol” here in the formal language sense, meaning an identifier for a member of a base set of possible events known as an “alphabet” (a common “alphabet” is the binary alphabet which contains the members 1 and 0). So in the context of speech recognition that I am describing here, an acoustic symbol is an identifier of some member element from a defined set of possible acoustic events. Similarly, “string” here is being used to refer to the broader mathematical sense in formal language, referring to a finite sequence of symbols, and not necessarily in reference to the string datatype in computer science.

271 This is what is commonly known as an n-gram model, where n is the total length of the sequence considered in calculating the probability of each word in the series. In Tangora’s language model used trigrams, meaning probability for the next word in a series was based on the two preceding it. No explicit reason for this is given, other than effectiveness observed through trial-and-error.

!155 through the acoustic processor front-end, would produce the acoustic label string

A (where A is one or more labels from a finite inventory of possible acoustic events). Both models are expressed as Markov processes, with statistical parameters assigned to their respective probability distributions, the values of which are estimated based on training data.

The automatic training of statistical parameters for the language model

P(W) for calculating the probability of word sequences proved computationally straightforward. Since W was imagined to be the original text prior to speech, the statistical parameters of the language model were trained on text data alone. Once the CSR group acquired sufficient quantities of computer-readable text data that adhered to the vocabulary-size limits set by the system design, the task of segmenting and labeling a corpus of natural language text for pattern recognition was easily automated, since text data is formally receptive to automatic

!156 segmentation and classification.272 English text uses a space as a divider symbol between words, providing a formal (and thus algorithm-friendly) indicator of unit boundaries that can be used to trigger segmentation automatically. Meanwhile standard orthography also allows for the reasonable expectation that repeat token occurrences of each unique word type in the recognizer vocabulary would be identical.273 A computer can therefore segment and label a text corpus into the appropriate units for use as training data without knowing the linguistic meaning

272 In order for statistical parameters in a model to be “trained,” there has to be data where the relevant information being estimated by the model is known. The parameters of the model are then inferred from patterns in this sample data. Thus, to train the model for P(W), which is the probability that word type W occurs (where W is some word element in the recognizer vocabulary), you need a text corpus that is broken up into units that correspond to individual words, and where each unit is correctly identified as an element in the recognizer vocabulary. In other words, a text corpus, as far as the computer is concerned, is a long string of characters (which are encoded as numerical values). That first has to be broken up into smaller strings corresponding to individual words (segmentation) and then each of these smaller strings has to be identified as an instance of one of the word type elements in the vocabulary (classification/labelling). For instance, the single text string “how to wreck a nice beach” needs to be first broken up into six smaller text strings (“how,” “to,” “wreck,” etc.). Then each of these strings has to be identified as an instance of the vocabulary element, so that the text string “how” is labeled to correspond to the recognizer vocabulary element [how], thus allowing the computer to track patterns in the occurrences of [how] in the corpus, from which it can estimate P(W), where W = [how].

273 Techniques necessary to account for misspellings and spelling variation is outside the scope of the current discussion. For the sake of simplicity, let us assume that a word type is a unique sequence of letters, such that a misspelling would produce a new word type. The key point here is that repeated words in text (tokens) are computationally uniform.

!157 or function of any particular word (or if any given string of characters is a word at all).274

Preparing data for the pronunciation model P (A|W) was considerably more complex, as it required not only a segmented and labeled text corpus, but the corresponding speech data. As with the text corpora, the speech signal had to be broken up into segments that could be mapped onto individual words from the text corpus, such that patterns in the occurrence of some acoustic event A in correspondence to some text event W could be tracked. Unlike text, however, there are no conventional indicators for where a word begins or ends in an acoustic speech signal, nor any standard length for its duration, which varies with speaking speed, pronunciation, and surrounding words. How then could one find a way to formally define even one word, let alone words as a comprehensive category, in computational terms?

As I briefly mentioned earlier, most researchers working on experimental systems prior to the 1970s side-stepped the problem, by focusing on isolated-word systems that would only accept and process speech input one word at a time. This

274 Word segmentation and classification is, of course, not so simple for languages that do not follow the convention of using spaces or other word dividers. For instance, in written Chinese (and other languages that adopted Chinese characters their writing systems), spaces are used to divide individual characters, but words often consist more than one character and readers rely on syntactic and semantic context to identify word boundaries. Consider an example adapted from Sproat et al. quoted in Daniel Jurafsky and James H. Martin, Speech and Language Processing, 1st Edition (Prentice Hall, 2000): the characters “෭෈ᒍṺ” in the sentence “෭෈ᒍṺெ焒抋 (how do you say octopus in Japanese [Japanese octopus how say]” could be divided to make the words “Japanese (෭෈ [rì-wén]) octopus (ᒍṺ [zhāng-yú])” or the words “Japan (෭ [rì]) essay (෈ᒍ [wén-zhāng]) fish (Ṻ [yú]).”

!158 form of “presegmented”275 solution where the speech signal was effectively segmented by the speaker prior to processing offered only a temporary reprieve, one which only lasted so long as vocabularies remained strictly limited.276 As briefly discussed in the previous chapter, in order to in order to perform recognition, the system required an acoustic prototype, such as a reference template, a deterministic rule-set, or statistical parameters, that described what to expect when a particular word was uttered. Some words required multiple prototypes due to distinct pronunciations or other significant contextual variations. And the inherent variability of the speech signal additionally meant that multiple examples of each pronunciation of each word had to be analyzed and synthesized to produce a single prototype reference (the reference templates in

275 Dixon and Martin, “Introductory Comments,” 2.

276 This is not to be confused with isolated or discrete speech recognition, which used deliberate pauses between words, and was still used in systems that relied on sub-word units or had larger vocabularies. Discrete speech in fact remained the default in speech recognition well into the 70s and 80s, as continuous speech presented a whole set of additional recognition challenges beyond basic word segmentation. The Tangora, in fact, remained an isolated-speech recognition system through the 1980s, and continuous speech was incorporated only in 1989. Bahl et al., “Large Vocabulary Natural Language.” However, IBM had designed and built a number of more limited experimental systems using continuous speech through the preceding decades. For descriptions, see Lalit Bahl et al., “Preliminary Results on the Performance of a System for the Automatic Recognition of Continuous Speech,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’76., vol. 1, 1976, 425–29; R. Bakis, “Continuous Speech Recognition via Centisecond Acoustic States,” The Journal of the Acoustical Society of America 59, no. S1 (April 1, 1976): S97–S97; Lalit Bahl et al., “Recognition of Continuously Read Natural Corpus,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’78., vol. 3, 1978, 422–24; Bahl et al., “Automatic Recognition of Continuously Spoken Sentences from a Finite State Grammer”; Lalit Bahl et al., “Further Results on the Recognition of a Continuously Read Natural Corpus,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’80., vol. 5, 1980, 872–75.

!159 Audrey, we might recall, required “approximately 100 repetitions of each of the digits [in its vocabulary].”277 The storage and processing demands, not to mention the human labor required to produce sufficient acoustic samples to formulate acoustic prototypes, grew logarithmically with the vocabulary size.

In order to handle growing vocabularies, researchers began defining acoustic inventories, or the set of possible acoustic events, as “alphabets” of sub- word units that could serve as building blocks for whole words, thus allowing for a recognizer to expand its vocabulary without significantly expanding its repertoire of acoustic prototypes.278 As a result, deliberate pauses between words no longer circumvented the problem of segmentation, since the acoustic signal for each word still had to be segmented into smaller units for recognition, and speakers could not be reliably be tasked with “presegmenting” their speech by pausing between each phoneme or phone. The problem of acoustic segmentation thus had to be addressed in order for recognizer vocabularies to expand.

The IBM CSR group’s initial speech recognition experiments were conducted using the Modular Acoustic Processor (MAP), an acoustic processor that IBM originally built as part of a four-year speech recognition project developed under contract with the US Air Force from 1967-1971. This original

277 K. H. Davis, R. Biddulph, and S. Balashek. “Automatic Recognition,” 641.

278 For instance, the Fry and Denes system contained an prototype “alphabet” of thirteen phonemes that could be combined into a vocabulary of 200 English words. Denes, “Automatic Speech Recognition,” 1.

!160 project was developed by the Speech Processing Technology department in IBM’s

Systems Development Division, based out of their North Carolina research facility. The MAP was then transferred to the newly-formed CSR group in

Yorktown in 1972, alongside a number of speech scientists including N. Rex

Dixon, one of MAP’s original developers, as part of what later proved to be a short-lived effort to include linguistic expertise. The MAP segmented and classified the acoustic signal phonetically,279 according to an alphabet of sixty-two

279 The distinction between “phonetic” and “phonemic” sound units is not always made consistently in the technical publications from the IBM CSR group. While many of the early publications in the mid-1970s mark this distinction clearly, the usage becomes less apparent in later documents. There are a number of likely reasons for this slipperiness. The first is the fact that the CSR group started out with linguistics and speech experts on the team in the early and mid-1970s, but all of them had left or been transferred by the time the fenonic system was established, making the inattention to linguistic terminology a symptom of a general disinterest in linguistics. Additionally, phones are the physically or perceptually distinct sound of speech, whereas phonemes are the linguistically distinct sounds of speech in a given language, such that a single phoneme may have more than one associated phone, depending on the language. Speech recognizers through this period were unilingual, so the primary difference between phonetic and phonemic- baseforms systems was whether unique phones were grouped according to phonemes at the acoustic processing stage or the linguistic decoding stage. Since all acoustic processors after the MAP delayed these grouping decisions to the linguistic decoding stage, there was no longer a technical distinction between phonetic and phonemic segmentation, as far as the acoustic processor was concerned.

!161 phones as the set of possible acoustic events.280 To do so, it compared the digitized spectral samples from the acoustic signal to reference prototypes for each of the phones to produce an initial match of several likely candidates, then it used a set of seventy-two phonological rules to determine the duration and phonetic identity of an acoustic event region, followed by a further sixty-eight rules to evaluate the identification.281 Both the reference prototypes and phonological rules had to be developed manually by the research team’s

280 Bahl et al., “Automatic Recognition of Continuously Spoken Sentences from a Finite State Grammer,” 418. The size of the phonetic alphabet used in the acoustic processor was very much in flux through this period. Additionally, the dates for when changes made to the acoustic processor component alone and when these changes were incorporated into the overall speech recognition system differed. A 1976 paper reporting experimental results performed on the full recognizer system that had been operational beginning in January 1975 both list the use of a 62-phone alphabet. See Bahl et al., “Preliminary Results on the Performance of a System.” However, another 1976 paper describing the MAP component alone lists an alphabet of only thirty-three phones. See N. Dixon and H. Silverman, “The 1976 Modular Acoustic processor(MAP),” IEEE Transactions on Acoustics, Speech, and Signal Processing 25, no. 5 (October 1977): 367–79. Then a 1978 report of recognizer results additionally mentions the design of a new recognition system that became operational in August 1977 that only a thirty-three phone alphabet as well. Bahl et al., “Recognition Results for Several Experimental Acoustic Processors,” 419. This August 1977 system lcontained a new acoustic processor system, the CSAP-1, which had inherited the acoustic prototypes originally produced for the MAP. See 1. Bahl et al., “Recognition Results for Several Experimental Acoustic Processors,” 249. Prior to being transferred to the CSR group in Yorktown, at the end of the four-year engagement with the US Air Force in 1971, C.C. Tappert, the lead on the project, described the MAP as containing forty-four phonetic classes. C. C. Tappert, “A Preliminary Investigation of Adaptive Control in the Interaction Between Segmentation and Segment Classification in Automatic Recognition of Continuous Speech,” IEEE Transactions on Systems, Man, and Cybernetics SMC-2, no. 1 (January 1972): 67. The MAP system was developed over the course of nearly a decade, from 1967 to 1976, and transferred between two projects in carried out by two separate research groups based in IBM divisions with different research mandates. The research outcomes of the first four years of development were in fact not set by IBM at all, but heavily constrained by “contractual requirements” with the US Air Force. Dixon and Silverman, “The 1976 Modular Acoustic processor(MAP),” 367. As a result, the MAP has undergone iterative changes in a number of respects, including the number of phonemic prototypes included in its acoustic inventory.

281 Dixon and Silverman, “The 1976 Modular Acoustic processor(MAP),” 370.

!162 phoneticians through a combination of linguistic principle and trial-and-error.282

Thus, though the generation of statistical models for the “decoder” back-end of the recognition system was “automatically” trained using data, the process of segmenting and labeling the speech data needed for training (as well as simply processing unknown speech input when actually using the recognizer) required extensive manual set-up and calibration.

It is important also to clarify that the work of selecting reference prototypes and phonological rules was not simply a one-time design task in engineering. I refer to this as the “set up” of the acoustic processor because it had to be undertaken each time the speaker was changed, since the acoustic feature measurements vary drastically between speakers, requiring new prototypes and rule adjustments. Thus, it was not simply that the overall design of the system required expert judgment (as all technical systems do), but that the overall design was such that key tasks in preparing the system for operation required variable expert input.283 In other words, the segmentation and classification mechanisms

282 N. Dixon and H. Silverman, “A General Language-Operated Decision Implementation System (GLODIS): Its Application to Continuous-Speech Segmentation,” IEEE Transactions on Acoustics, Speech, and Signal Processing 24, no. 2 (April 1976): 137– 62. See also Jelinek, “Continuous Speech Recognition by Statistical Methods” and Bahl et al., “Speech Recognition of a Natural Text Read as Isolated Words.”

283 We might compare the acoustic prototypes and phonological rules to the feature extraction mechanism in the acoustic processor. Feature extraction, that is, the use of critical band filters to take spectral measurement at set intervals, is based on neurophysiological research on human hearing and perception. However, once the measurement frequencies are set, they do not change. They are part of the static design of the system.

!163 make up part of the dynamic modeling process, rather than the static system design.

Beginning in 1977, the CSR group began experiment with the removal of phonetic rules and segmentation, adopting a series of “centisecond-level modeling” processors. These processors effectively performed the initial prototype match, labeling each centisecond spectral sample according to the most similar prototype from the phone alphabet, and delayed segmentation decision and the evaluation of the initial match to the linguistic decoder. This allowed the

CSR group to remove explicit phonetic segmentation from the acoustic processor, absorbing the task into the statistical parameters of the pronunciation model through expanded training, but the dependence on expert-produced reference prototypes remained similar to the MAP. The initial prototype match remained so similar, in fact, that the first of these new centisecond-level processors, the

CSAP-1, used the thirty-three prototypes of steady-state phones in American

English that were originally intended for the MAP.284 Between 1977 and 1980, the

CSR group iterated on this basic processor set-up, expanding the number of acoustic-phonetic prototypes to 45 in the CSAP-2 and finally 200 in the

CSAP-200, which included multiple prototypes for every phone to account for varied phonetic contexts.

284 Bahl et al., “Recognition Results for Several Experimental Acoustic Processors,” 249.

!164 Despite the drastic improvements to recognition performance compared to the 1975 system using the MAP,285 the researchers at IBM sought to move away from the reliance on phonetic prototypes, rejecting them as “error-prone . . . [and] subject to the skill of the individual in charge.”286 This was due to the fact that the prototypes were the result of a process “based on a priori human linguistic knowledge” where a phonetician would listen to a speech recording while examining the corresponding spectrographic images and “carefully select” several exemplary tokens of each desired phone.287 These would then be averaged together to form a single prototype.288 In the earliest published accounts of their move away from manual acoustic segmentation and prototyping in the early

1980s, the CSR team emphasized the laboriousness of process, noting that “since it requires extensive human intervention, it cannot be used in a system which must automatically train to new speakers,” while also mentioning as a seemingly

285 Tests using a natural language corpus run on the 1975 system using the MAP front- end displayed a word-error rate of 33.1%. Bahl et al., “Continuously Read Natural Corpus,” 424). Meanwhile experiments conducted on a 1980 system using the CSAP-200 front-end using the same test sentences resulted in an error rate of only 8.9%. Bahl et al., “Further Results on the Recognition of a Continuously Read Natural Corpus,” 872.

286 Das and Picheny, “Issues in Practical Large Vocabulary Isolated Word Recognition,” 467.

287 Lalit R. Bahl et al., “Automatic Construction of Acoustic Markov Models for Words,” in 1st IASTED International Symposium on Signal Processing and Its Applications (Brisbane, Australia, 1987), 566.

288 Bahl et al., “Further Results on the Recognition of a Continuously Read Natural Corpus,” 873. See also Nadas et al., “Continuous Speech Recognition with Automatically Selected Acoustic Prototypes,” 1153.

!165 secondary concern that the process was also “to some extent arbitrary.”289 The need for extensive human intervention was depicted primarily as a barrier to scale, and thus commercial viability, since phoneticians could not be shipped with software packages. By the time fenonic prototypes were established for the

Tangora in the latter half of the 1980s, however, the perception of linguistic expertise as “subjective, difficult to extract, and subject to error”290 emerged as the central concern, with error-reduction and lack of subjective or “ad-hoc” selection often cited as the sole advantage of non-phonetic prototypes.291 The push to automate even the configuration of the recognizer was thus simultaneously animated by an underlying commercial interest as well as a conceptual privileging of algorithmic procedurality for its consistency.

The IBM CSR group first described the method of automatically generating acoustic prototypes that would later be known as “fenones” in a paper presented at the IEEE International Conference on Acoustics, Speech, and Signal

Processing in 1981, where they referred to a new processor named

AUTOCLUST.292 The paper actually discussed two different acoustic processors

289 Nadas et al., “Continuous Speech Recognition with Automatically Selected Acoustic Prototypes,” 1153.

290 Bahl et al., “Automatic Construction of Acoustic Markov Models,” 566.

291 IBM had previously developed a phonetic acoustic processor that could be adapted automatically to new speakers, provided it had an initial set of manually-selected prototypes, by 1979, but mentions of this processor were absent from publications discussing fenones, despite mentions of older processors that preceded it.

292 Nadas et al., “Continuous Speech Recognition with Automatically Selected Acoustic Prototypes.”

!166 that generated prototypes automatically, the other being the TRIVIAL (short for

Training by Iterative VIterbi ALignment), which was first debuted two years prior. The major difference between the two processors, however, was that the

TRIVIAL only adapted prototypes to new speakers automatically. It still required an initial set of hand-picked prototypes, which were then adjusted based on sample speech from the new speaker.293 In other words, the class of acoustic phenomena being represented through reference prototypes and the constituent elements of that class were still determined according to phonetic definitions based on “subjective evaluation . . . [and] still depend on human intuition,”294 even though the specific spectral values that made up each prototype was adjusted automatically. In contrast, the AUTOCLUST processor generated the prototypes based entirely on the automatic processing of a speech sample without relying on existing expert-selected prototypes, using a clustering algorithm to find the optimal divisions of the sample data.295 In doing so, the method used in

AUTOCLUST not only automatically evaluated the spectral parameters of each prototype, but redefined the class of acoustic phenomena used segment and label the speech signal—the set of elements that the acoustic prototypes actually reference—according to the proclivities of the machine. In other words, rather

293 Bahl et al., “Recognition Results for Several Experimental Acoustic Processors,” 249.

294 Ibid.

295 Nadas et al., “Continuous Speech Recognition with Automatically Selected Acoustic Prototypes,” 1153.

!167 than pursuing the computationally-difficult task of automatically specifying phonetic units, which were sound units defined by humans speech perception, the

CSR group simply redefined speech according to a set of artificial sound units that were easy to algorithmically specify.

They dubbed these machine-optimized units as a “fenonic” alphabet, a neologism that combined the abbreviation of “FE” for “Front End,” referring to the acoustic signal processor, with “the suffix ‘nonic’ . . . to lend the term scientific respectability.”296 Rather than specifying the acoustic speech signal according to physically and perceptually distinct features of speech, fenones defined speech in explicit reference to the operational features of the “front end” acoustic processing apparatus. Similar to the training of hidden Markov models for the linguistic decoder, generating Fenonic prototypes for acoustic processing employed a “data-driven technique . . . [that] rests on determining the constituent

296 Jelinek, Statistical Methods for Speech Recognition, 46. As briefly mentioned earlier, variations in the terminology exist, with documents referred both to “fenone”/“fenonic” and “feneme”/"fenemic” units. These appear to be based on the “phone”/“phonetic” and “phoneme”/“phonemic” distinction in linguistics. I will use “fenone”/“fenonic” throughout (except in direct quotations) to avoid confusion as this is the terminology used in published references in the 1990s, once both the method and language had settled. See also Das and Picheny, “Issues in Practical Large Vocabulary Isolated Word Recognition.” Additionally, by the time IBM began exploring fenone-level models in the 1980s, they had already abandoned phonemic units in favor of phonetic ones, arguing as early as 1976 that the phone would better “lend itself to convenient pattern recognition by an acoustic processor” as phones, which depend solely on acoustic rather than linguistic differences, indicated “such regions of the recognition space which the acoustic processor is particularly capable of identifying.” See Jelinek, “Continuous Speech Recognition by Statistical Methods,” 533.

!168 structure [of speech] . . . automatically from the speech data itself.”297 In other words, data was not only used to generate models, but also to determine the set of elements to be modeled. If hidden Markov models expressed probabilistic relationships between acoustic and text elements, fenonic modeling went one step further, using statistical optimization to define the acoustic elements being modeled.

In place of phonetic units, the acoustic partitions were determined through the same process by which the signal is made legible to computers: vector quantization. Vector quantization, as discussed earlier in relation to the Audrey system at Bell Labs, is a basic function in signal processing that transforms a continuous acoustic signal into a digital sequence of discrete numerical value-sets.

It is done by transforming each sample frame of a speech signal (produced during the sampling process of turning a continuous spectral waveform into a discrete digital signal) into a set of energy measurements taken at set positions across the frequency range, resulting in a “vector,” or set, of numerical parameters representing the signal at any given time interval. In order to reduce some of the variability produced by the continuous signal, quantization also adjusts the numerical parameters of each vector to those of nearest digital “step” defined by the resolution level of the processor, like rounding a decimal to the nearest

297 Das and Picheny, “Issues in Practical Large Vocabulary Isolated Word Recognition,” 476.

!169 integer.298 For the fenonic modeling in the Tangora, this finite set of vector parameters that constitute the “steps” that are used to constrain each sample frame in the quantization process becomes the acoustic prototype alphabet used to characterize the speech signal.

A more familiar example of this procedure is the process of color quantization in digital imaging, where an image’s continuous colors or tones are pixelated and reduced to a limited palette containing a set number of available colors (see figure 14). This is done by partitioning the image into sections and

Figure 14 Color quantization using K-means clustering. The original image (left) was reduced using k- means to 16 colors (k = 16). Source: original image by Joshua Strang, 050118-F-3488S-003, Photograph, January 18, 2005, Wikimedia Commons; derivative rendering by King of Hearts, November 20, 2012, Wikimedia Commons. reassigning the color value of each element in that section to the closest match to the “average” color of that region. We might think of the prototype fenone labels as the palette of available colors. In order to determine which colors are included in this palette, we need to find the optimal colors to minimize information loss.

That is, we must decide which colors, depending on the size of the palette, will

298 In fact, we can think of rounding a number to the nearest integer as a quantization process using single-parameter vectors.

!170 result in a quantized image that remains as close to the original. To do so, we cannot simply take the entire range of the color spectrum within an image and divide up the palette evenly across it—while the range of colors may be broad, it might be heavily composed of blues or feature greater detailed variations in a limited range, such that you would want more shades of blue in your palette in order to minimize the loss of detail in the image overall. Fenone prototypes must similarly be those sound units which minimize the loss of acoustic information as each vector is adjusted to match an existing fenone. In contrast to phonetic prototypes, which are selected to define sound units that are acoustically meaningful—that is, perceptually distinct in human speech—fenonic prototypes are selected to define sound units that are informationally optimal, regardless of where or not these units match up to distinct speech sounds.

The number of fenone prototypes used for acoustic processing and modeling was based on the acoustic resolution of the signal processor, which was pre-set. The fenonic alphabet used in the Tangora included 200 prototypes, a number that was chosen primarily for evaluation purposes, to ensure similar training requirements as a phonetics-based processor in order to compare performance results.299 The fenonic prototypes themselves, however, had to be

299 Bahl et al., “A Method for the Construction of Acoustic Markov Models for Words,” 446. Note that the fenonic and phonetic models had approximately the same number of parameters to be trained, not the same number of prototypes in their respective alphabets. The phonetic alphabet used had only fifty-five prototypes, but a greater number of output and transition distributions to train per prototype.

!171 generated using a speech training sample from the user, since they characterized the optimal divisions of the data itself, rather than representing a priori phonetic elements. This was carried out using a clustering procedure known as the K-

Means algorithm, which was essentially an iterative method for organizing any given number (m) of observations into a set number (k) of mathematically optimal groupings, where k is the desired size of the alphabet (k = 200 in the case of the

Tangora).300

In order to generate the 200 fenone prototypes for the Tangora, a five- minute training sample of the user’s speech was run through acoustic processor,301 where it was sampled once every centisecond (10 milliseconds) to produce m =

30,000 centisecond frames, or observations. Each centisecond frame is measured

300 Das and Picheny, “Issues in Practical Large Vocabulary Isolated Word Recognition,” 478. Das and Picheny do not give a precise number, instead noting that the processor included a fenone alphabet of “about 200.” It’s likely that after the initial experiments in 1981, the CSR researchers made small adjustments to the number of prototypes based on trial-and-error. Since fenones don’t map onto any existing acoustic categories (the 200 phonetic prototypes in the CSAP-200 alphabet represented the common phones in English, with multiple prototypes for each phone representing common acoustic variations based on linguistic context), the number of prototypes is essentially arbitrary. It was likely that the number was initially set at 200 as an experimental control in order to compare results with phonetic processors, and later adjusted incrementally based on performance results.

301 The fenonic prototypes were derived from the first five minutes of the total speech training sample that was used to train the overall pronunciation models for the speech recognizer in the experiments that were published in 1981. Nadas et al., “Continuous Speech Recognition with Automatically Selected Acoustic Prototypes,” 1154. These experiments were performed on a version of the Tangora that had a 1000-word vocabulary. Later iterations with different vocabulary sizes and tasks may have varied the length of the speech sample used for defining the fenones.

!172 at different amplitudes and those measurements are processed302 to produce a vector containing eighty-one parameter coordinates. These vectors are conceptually plotted into an 81-dimensional space, which is then partitioned into

200 spaces at random.303 The centroid, or geometric mean, is then found for each partition and the parameter vector of these centroid plots is assigned as the initial

200 k-means. Once initialized, the first step is to assign each of the m observations to a cluster according the nearest k-mean, and redraw the partitions around these clusters. The centroid plots of each of these new partitions is then calculated and become the new k-mean for that partition. The optimality of these new 200 k- means are then compared to that of the previous set. If the average distance (or

“distortion”) between the observations within an assigned cluster and the k-mean is less than that of the previous cluster, thus indicating improved groupings, return to the first step and run the process again. Eventually, the space will be partitioned in such a way as to produce optimal clusters when the distortion stops decreasing

302 The signal processing procedure contains a number of scaling, rotation, and compression steps, some of which are repeated. There are, in fact, eighty-one measurements taken from each centisecond frame (eighty based energy concentration measurements at different amplitudes, and one measurement of the total energy of the signal in that frame), but they are not the final eighty-one coordinates. That is because each vector contains the measurement taken from three adjacent centisecond frames (the current, previous, and next) in order to increase the amount of information contain in each vector. The measurements of each centisecond sample and the two adjacent to it are all compressed from eighty-one to twenty-seven, and the concatenated together to produce the final eighty-one coordinates for that centisecond sample.

303 There are different strategies for selecting the initial partition used when running a k- means algorithm, but random is one of the most common, and the one described in the AUTOCLUST processor. See Nadas et al., “Continuous Speech Recognition with Automatically Selected Acoustic Prototypes,” 1155.

!173 with each iteration. The final partitions are now the boundaries of the acoustic spaces that define each fenone, and the vector coordinates for the centroid of each partition becomes the prototype for that fenone and is identified by a label.304 The vector coordinates for the final set of k-means thus becomes the fenonic alphabet that defines set of possible sound units of speech. Once implemented, when an unknown speech utterance is fed into the acoustic front-end, it will be sampled and transformed into a sequence of numerical vectors coordinate-sets, each of which will be labeled according to the prototype of the acoustic partition into which it falls, resulting in a string of fenonic labels. Since it is this string of labels that is passed to the linguistic decoder back-end for matching, the original parameter values of the observed vector is effectively constrained and replaced by the parameter values of the prototype identified by the label.

With the fenone, IBM introduced a machine-optimized category of acoustic measurement, one in which the units of acoustic phenomena and their quantitative thresholds were both defined by the native functions of the acoustic processor. The nature of speech as an acoustic phenomenon was, in this way,

304 Lalit R. Bahl et al., “A Method for the Construction of Acoustic Markov Models for Words,” IEEE Transactions on Speech and Audio Processing 1, no. 4 (October 1993): 444. As Bahl explained, technically the labels themselves “are of no consequence to the recognition system; they are merely an aid to humans who may wish to examine data labeled with the [fenone] prototypes.” As far as the processor is concerned, is it the numerical values that the label references are that important. The labels are simply an identifier for some stored data, like using an alphabetic character as a symbol to represent a numerical value in x = 5. Labels are useful for programming and the actual names used can be entirely arbitrary, such that “the integers 1 to 200 would be a reasonable choice.”

!174 redefined as its optimal processing. By operationalizing speech not only as the behaviors that were measurable, but those that whose measure could be taken automatically, fenonic modeling went beyond the metonymic reduction of complex or abstract phenomena to its readily quantifiable functions (e.g., the use of blood pressure as an indicator of cardiovascular health). Fenones instead reorganized the constituent structure of speech in terms of what was numerically optimal rather than what was physiologically perceptible, to say nothing of what was linguistically meaningful. In other words, not only were the recognition models represented as statistical probabilities, but the very process of determining what was to be modeled, was automatically generated based on statistical patterns in the data, with no reference to any sort of meaningful linguistic classification or reasoning. Put another way, modeling turns in on itself, and the “acoustic” model produced no longer offered an explanatory account of the acoustic structure of speech as an external phenomenon, but rather an optimization of the dataset and its processing.

The information-theoretic framework favored by the IBM CSR group led to the popularization of statistical techniques that prioritized automation and predictive power over knowledge representation and interpretability. Instead of seeking to formalize the underlying human processes as a set of statistical functions, engineers came to imagine that they could replicate and predict outcomes without an account of the underlying structure or operation of the

!175 process that generated them. Statistical modeling, reconfigured around computational imperatives, was recast as a means of producing information in the absence of explanation. This manner of probabilistic prediction, alongside the decoupling of recognition from the distinctive properties of speech, was then precisely what makes possible the diffusion of statistical techniques developed in speech recognition across such a wide range of practices.

!176 CHAPTER IV

THE DISCOVERY OF KNOWLEDGE

“You shall know a word by the company it keeps.” J.R. Firth, “A Synopsis of Linguistic Theory 1930–1955” (1957)305

The previous two chapters followed the story of how text was made predictable, tracing the development of data-driven statistical methods that were molded to the particularities of speech recognition research and materialized changing perceptions regarding human and computational forms of knowledge.

They offered a historical examination of the discursive and technical coordinations through which a computational approach to knowing, pronounced to be “the natural way for the machine,” took shape within the confines of speech recognition research, reconceptualizing speech from a problem of language to one of statistical text prediction. This chapter pursues predictive text in the other direction, investigating the enlistment of text analysis in the prediction of things outside of language. What resulted was an alliance of computational linguistics, database management, and machine learning that became one of the most significant contributing factors in the rise of so-called “big data.”

305 John Rupert Firth, “A Synopsis of Linguistic Theory 1930–1955,” in Studies in Linguistic Analysis: Special Volume of the Philosogical Society, ed. John Rupert Firth (Oxford: Blackwell, 1957), 11.

!177 Knowledge Discovery in Text (KDT), also commonly referred to as text data mining (or text mining for short) or text analytics, describes a collection of techniques that “extract useful information from data sources through the identification and exploration of interesting patterns . . . found not among formalized database records but in the unstructured textual data.”306 Emerging in the mid-1990s as a sibling branch of Knowledge Discovery in Databases (KDD), text mining brought together techniques from computational linguistics, information retrieval, and data management, in order to render the statistical text modeling suitable to applications beyond the analysis of language. Through text mining, statistical text processing was no longer a way to approximate knowledge about language, but rather became a means to, as computer scientist Marti A.

Hearst explained in her 1999 address to the Association of Computational

Linguistics, “discover new facts and trends about the world itself.”307

Less than a decade after Hearst so plainly described the remarkable conceptual leap from the detection of text patterns to the discovery of unspecified new knowledge, Google launched Flu Trends in November of 2008, just a few

306 Ronen Feldman and James Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, (Cambridge ; New York: Cambridge University Press, 2006): 1. To be clear, natural language text is not the only form of unstructured data, which describes any data that is not organized according to a pre- defined data model or schema. It is, however, one of the most prevalent and prolific forms.

307 Marti A. Hearst, “Untangling Text Data Mining,” in Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL ’99 (Stroudsburg, PA, USA: Association for Computational Linguistics, 1999), 10.

!178 short months after predictive text completion became a default function in its search engine. The project presented “a method of analysing large numbers of

Google search queries to track influenza-like illness in a population” by mapping patterns in the search text data to flu occurrences between 2003-2008 in order to predict future flu outbreaks.308 Initially celebrated “as an exemplary use of big data,”309 the project was later abandoned after CDC reports showed that Google

Flu Trends had overshot estimates by nearly double for several years in a row, beginning in 2011. Yet, despite what Wired magazine theatrically termed Google’s

“Epic Failure,”310 researchers from numerous fields, including epidemiology, continued to experiment with the use of search and other online text data to model real-world events.311 Google itself doubled-down on the underlying premise of

Flu Trends even as the project itself foundered, launching the spin-off Google

Correlate, a publicly-accessible, general purpose tool that statistically correlates

308 Jeremy Ginsberg et al., “Detecting Influenza Epidemics Using Search Engine Query Data,” Nature 457, no. 7232 (February 19, 2009): 1012. Matt Mohebbi, the co-creator of the Flu Trends project and second author on the Nature article, incidentally also joined Google in 2004, the same year Google introduced the Autocomplete search feature, and worked under Peter Norvig, an AI expert and noted “data supremacist” (Lohr 2015, 116).

309 David Lazer et al., “The Parable of Google Flu: Traps in Big Data Analysis,” Science, March 14, 2014.

310 David Lazer and Ryan Kennedy, “What We Can Learn From the Epic Failure of Google Flu Trends,” WIRED, October 1, 2015, https://www.wired.com/2015/10/can-learn- epic-failure-google-flu-trends/.

311 For instance, Universität Osnabrück and IBM have an ongoing flu prediction project using twitter data. For instance, Universität Osnabrück, “Flu Prediction: About,” accessed August 10, 2017, http://www.flu-prediction.com/about. Another example is a widely-cited 2011 article on the prediction of stock markets, also using twitter data. Johan Bollen, Huina Mao, and Xiaojun Zeng, “Twitter Mood Predicts the Stock Market,” Journal of Computational Science 2, no. 1 (March 1, 2011): 1–8, doi:10.1016/j.jocs.2010.12.007.

!179 search query trends to any time series data to identify patterns of co-occurrence much in the same fashion as Flu Trends.312 Just as Hearst described, the ability to anticipate patterns in text, already bound up with information retrieval, was now capable of ostensibly generating new, previously-unknown facts beyond the likely succession of text.

This chapter begins to tell the story of how large-scale statistical text analysis and predictive modeling came to inhabit a shared epistemic terrain, one seemingly native to the informatic conditions of the web. It traces two sets of institutional forces operating at different scales that brought the newly data-driven field of speech recognition into sustained contact with the broader discipline of computational linguistics beginning at the end of the 1980s—the deliberate arrangement of DARPA-funded events and standard evaluation on the one hand and, on the other, the IBM CSR group’s far less purposive digression into machine translation. The consequent spread of statistical modeling techniques from speech recognition, I suggest, mobilized new negotiations over the relationship between language and computing, and the status of knowledge within it. At the same time, it reoriented natural language processing towards the aims of pattern recognition, drawing it into alignment with machine learning and data management. This chapter thus offers one modest point of entry into the

312 “Google Correlate,” accessed May 16, 2017, https://www.google.com/trends/correlate. The landing page of the Google Correlate tool proclaims “this is how we built Google Flu Trends!”

!180 labyrinthine history of how, in making text data, statistical language processing contributed to the expansive groundwork upon which data was made “big.”

Though now increasingly commonplace, the statistical analysis of patterns in text as a means of generating information about events and relations outside of language or document collections is a relatively new area of computational practice. Many of the core practices that have converged in applications such as web search, predictive analytics, and business intelligence belong to separate domains. Natural language processing, information retrieval, machine learning, and data analysis were distinct research fields with relatively little overlap until the 1990s. Information retrieval and data access were centered around curated document collections, where individual records could be manually encoded using highly restricted or formal command languages. Even as data mining developed a means of utilizing machine learning techniques to not only access databases, but

“discover” new knowledge within them, focus rested primarily on numerical data, and assumed data was formatted within structured databases. Natural language processing and computational linguistics, on the other hand, were focused on computational language understanding for user interface tasks rather than automatic data processing. It remained rooted in symbolic knowledge representation and rule-based expert systems, separate from fields tied to machine learning, even including speech recognition, until the tail end of the 1980s. This chapter traces how, by the 1990s, these practices began to coalesce around a new

!181 set of problems associated with the exponential growth of data, and text data in particular. Crucial to this process, I argue, was the standardization of data-driven statistical methods across all quarters, which facilitated interoperability and aligned disparate practices along the narrow procedures laid forth by pattern recognition.

At the center of all of this was the explosive growth of the World Wide

Web and the attendant proliferation of digitally encoded text. The web at once offered a stunning source of machine-accessible text corpora to train and test probabilistic language models and a glittering desiderata for business and scientific communities seeking ways to leverage the contents of digital communication. As a result, it solidified incursion of data-driven statistical methods from speech recognition, which was aligned with engineering and machine learning, across the broader field of computational linguistics. Natural

Language Processing re-centered on “shallow” pattern recognition and machine learning tasks, focused around the classification of text data, rather than “deep” language understanding tasks rooted in classic AI and knowledge representation.

At the same time, the size, diversity, and dynamic nature of text documents, particularly online, presented a new set of challenges to data handling practices, making information retrieval and data mining increasingly reliant on natural language processing. Statistical text processing thus played a central role in

!182 imbedding machine learning and pattern recognition into the infrastructures of information management and communication.

From Speech Recognition to Language Processing

Data-driven statistical methods swept through computational linguistics beginning in the 1990s and played a central role in refashioning the aims of its applied branch, Natural Language Processing (NLP), around pattern recognition tasks that would later prove ideally suited for data processing. Hailed by proponents as an “empirical renaissance” of corpus-based methods that had fallen out of favor since the late 1960s,313 overtaken by rationalist approaches, statistical

NLP ported data-driven techniques directly from speech recognition research.

By the late 1980s, the data-driven modeling techniques incubated at IBM had emerged as the dominant approach to speech recognition, particularly for large-vocabulary applications such as transcription. The growth in popularity was spurred in large part by the performance results of experimental systems using hidden Markov models. John Makhoul, chief scientist at BBN Technologies and a central figure in speech processing research throughout this period, recalled that in 1984, DARPA, a stronghold of knowledge-based expert systems, began its own tentative investigations in HMM-based speech recognition as part of its strategic

313 Kenneth W. Church and Robert L. Mercer, “Introduction to the Special Issue on Computational Linguistics Using Large Corpora,” 1. See also Christopher D. Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, 1st edition (Cambridge, MA: The MIT Press, 1999).

!183 computing program. By the first common evaluation in February of 1986, the “far superior” performance of the HMM-based system over knowledge-based efforts helped “demonstrate the power of HMMs in modeling phonetic distinctions.”314

According to Makhoul, by 1987, even CMU, one of the most steadfast sites of expert systems approaches, had moved entirely from knowledge-based methods to

HMMs.315 Another key factor was the publication of technical reference materials detailing both the underlying mathematical theory and the implementation procedures for applying HMMs to speech processing. Particularly influential was a symposium on the application of hidden Markov models to speech and language held at the Institute of Defense Analysis in 1980 and consequently published in what became known throughout the speech and language processing field as the

314 John Makhoul, “A 50-Year Personal Retrospective on Speech and Language Processing” (Interspeech, San Francisco, 2016). Though the main contract that year went towards a knowledge-based system built by Raj Reddy’s team at Carnegie Mellon University (CMU), BBN was granted a “small contract” to experiment on speech recognition using what was considered at the time to be a “‘high-risk approach using HMMs.” It should be noted that this is not the first time a speech recognition system using HMMs had been build using DARPA funding. James and Janet Baker’s early work on the Dragon system prior to joining IBM was funded at least in part through Carnegie Mellon’s participation in the DARPA Speech Understanding Research program in the early 1970s. However, the Dragon system was one of a number of experimental systems being built at CMU under a broad exploration of speech understanding systems, whereas it is implied that the 1984 BBN contract was included exploring the efficacy of HMM in speech recognition as its explicit aim.

315 Ibid.

!184 “Blue Book.”316 The contents of the symposium, which included the establishment of the term “hidden Markov model,” were circulated at Bell

Laboratories, resulting in the publication of a series of technical tutorials by

Lawrence Rabiner and B.H. Juang in the late 1980s.317 These papers quickly became canonical references for the application of HMMs, dubbed by former president of the Institute of Mathematical Statistics J. Michael Steele as “a tutorial that taught a generation.”318 These influential tutorials made the relatively specialized mathematical theory behind hidden Markov models accessible a wider group of speech and language researchers.

This period of ascendence for data-driven statistical methods in speech recognition, however, also coincided with a growing distance between speech recognition and natural language processing. Speech processing research’s pragmatic ties to signal processing and acoustics had positioned it within the domain of electrical engineering early on, isolated from work in computational

316 John D Ferguson, ed., Symposium on the Application of Hidden Markov Models to Text and Speech (Princeton, NJ: Institute for Defense Analyses, Communications Research Division, 1980). For mentions of the “Blue Book” and its influence, see Lawrence Rabiner, “First-Hand: The Hidden Markov Model,” Engineering and Technology History Wiki (United Engineering Foundation, January 12, 2015), http://ethw.org/First- Hand:The_Hidden_Markov_Model and John Makhoul, “A 50-Year Personal Retrospective on Speech and Language Processing” (Interspeech, San Francisco, 2016).

317 Lawrence R. Rabiner and B. H. Juang, “An Introduction to Hidden Markov Models,” IEEE ASSP Magazine, January 1986, 4–16 and Lawrence R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE 77, no. 2 (February 1989): 257–86.

318 Michael Steele, “Hidden Markov Models,” Course Resource for Financial Time Series and Computational Statistics (Univeristy of Pennsylvania, 2009), http://www- stat.wharton.upenn.edu/~steele/Courses/956/Resource/HiddenMarkovModels.htm.

!185 linguistics, which had focused on linguistic structure generally or the processing of text documents. In fact, researchers in speech processing and those in natural language processing and computational linguistics had little contact with one another following the dissolution of the DAPRA Speech Understand Research

(SUR) program in 1976.319 NLP efforts, which focused on problems of language understanding and translation, had remained resolutely tethered to research in symbolic AI, with its focus on knowledge representation, production rules, and domain expertise. Meanwhile, the growing statistical zeal in speech recognition allied speech research with the emerging field of machine learning, an increasingly peripheral practice in mainstream AI that, by the late 1980s, had begun to distance itself from its parent field thanks to a growing emphasis on pattern recognition.320 Research in speech recognition, even that which drew heavily on linguistic scholarship, did not begin receiving notable coverage in scholarly outlets for computational linguistics until the end of the 1980s.321

319 Ralph Weischedel et al., “White Paper on Natural Language Processing,” in Speech and Natural Language: Proceedings of a Workshop (Cape Cod, MA: Morgan Kaufmann Publishers, Inc., 1989), 487.

320 Pat Langley, “The Changing Science of Machine Learning,” Machine Learning 82, no. 3 (2011): 275–79.

321 See, for instance, David Hall, Daniel Jurafsky, and Christopher D. Manning, “Studying the History of Ideas Using Topic Models,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08 (Stroudsburg, PA, USA: Association for Computational Linguistics, 2008), 367-8. The authors describe sudden influx of publications in computational linguistics outlets by researchers who had previously published in speech recognition starting in 1988, noting that ”speech recognition is historically an electrical engineering field.”

!186 The conceptual transformations and new methods that overtook speech recognition might have thus remained mere technical curiosities, sequestered within a specialized corner of electrical engineering, had it not been for two key interventions—one institutional, the other technical—that deliberately brought speech research into natural language processing. First, at the institutional level, were a cluster of government-sponsored initiatives, including workshops, standard evaluation tasks, and shared data resources, explicitly aimed at consolidating progress in defense-funded speech and language research. Renewed contact between speech processing and computational linguistics, particularly in the context of government funding, helped shift computational linguistics towards the

“engineering” paradigm dominant in speech recognition, which prioritized practical applications and performance evaluation. At the technical level was the

IBM Continuous Speech Recognition group’s off-hand entry into machine translation. In directly overlaying the mathematical framework from statistical speech recognition directly onto one of the defining tasks of computational linguistics, their brief foray into machine translation laid an unambiguous, if perhaps unanticipated, claim by data-driven statistical methods over language understanding.

Particularly influential among the institutional initiatives were the DARPA

Speech and Natural Language Workshops, which ran from 1989 to 1994 (they were renamed the “Human Language Technology Workshops” in 1993,

!187 coinciding with the renaming of DARPA back to ARPA that same year). These workshops had the explicit aim of bringing together all of the scattered DARPA- funded speech and language programs in a deliberate effort to “cross-educate the two communities” and consolidate technical advances.322 The fields of speech and natural language processing were at this time seen as so removed from one another that organizers devoted the entire first day of the inaugural workshop in

1989 to “the establishment of common ground between the natural language and speech groups,” featuring introductory tutorials to basic concepts of computational linguistics and speech processing.323 As one of the first formal sites of contact between the speech processing and computational linguistics research communities in over a decade, these workshops consequently became a prevailing influence on the field of computational linguistics in this period. According to computational linguist Phillip Resnik, the “(largely DARPA-imposed) re- introduction of the natural language processing community to their colleagues doing speech recognition and machine learning” was perhaps the most important factor for the “statistical revolution in natural language processing in the late

322 Charles L. Wayne, “Foreword,” in Speech and Natural Language: Proceedings of a Workshop (Philadelphia, PA: Morgan Kaufmann Publishers, Inc., 1989), vii.

323 In the computational linguistics tutorial, for instance, presenter Ralph Grishman noted that he drew the tutorial contents from his introductory textbook Computational Linguistics: An Introduction, though he “necessarily hit only a few of the highlights” and was “forced over oversimplify” (1989, 37).

!188 1980s up to mid 90s or so.”324 Resnik’s broad perception is mirrored in publication trends during the Workshops’ run, according to a 2012 computational analysis of topics covered in the Association of Computational Linguistics

Anthology that identified 1989-1994 as a “major epoch of the ACL’s history,” based on “a clear, distinct period of topical cohesion.”325 Additionally, other

DARPA-supported initiatives of this period, such as the Message Understanding

Conferences (MUC) for information extraction and the Air Travel Information

System (ATIS) for speech understanding, required participants to work on standard evaluation tasks (known as “bakeoffs”), which further brought speech and natural language processing researchers into contact through shared research aims.326 The DARPA workshops and related initiatives, through the deliberate re- introduction of speech recognition research into NLP, “serv[ed] as a major bridge from early linguistic topics to modern probabilistic topics.”327

324 Philip Resnik, “Four Revolutions,” Language Log, February 5, 2011, http:// languagelog.ldc.upenn.edu/nll/?p=2946.

325 Ashton Anderson, Dan McFarland, and , “Towards a Computational History of the ACL: 1980-2008,” in Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries, ACL ’12 (Stroudsburg, PA, USA: Association for Computational Linguistics, 2012), 14.

326 Ibid., 16.

327 Ibid., 20. See also: David Hall, Daniel Jurafsky, and Christopher D. Manning, “Studying the History of Ideas Using Topic Models,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08 (Stroudsburg, PA, USA: Association for Computational Linguistics, 2008), 369. According to their analysis, the field’s major journals and conferences saw a dramatic spike in papers discussing speech recognition in the course of the DARPA workshops.

!189 Contact with speech recognition research, organized within the institutional demands of the DARPA workshops, not only introduced new statistical techniques, but also reoriented the research aims of natural language processing around the demands of establishing “common ground” between the fields, particularly for the development of shared resources and standardized evaluations benchmarks. Data-driven statistical modeling quickly came to dominate every major area of NLP research in the course of the workshops’ run.

At the inaugural workshop in February of 1989, the program’s general chair, Lynn

Hirshman, identified the areas of common interest between the computational linguistics and speech research communities as “prosodics, spoken language systems, and development of shared resources.”328 By 1991, Patti Price, the general program chair that year, noted that the workshops had undergone “a paradigm shift: all the papers were corpus-based (as opposed to relying on the intuition of experts), and most have at least one statistical subcomponent.”329 This trend was further solidified the following year, when program chair Mitchell

Marcus proclaimed in his overview that “[t]he paradigm shift in natural language processing towards empirical, corpus based methods was nowhere clearer than at

328 Lynette Hirshman, “Overview of the DARPA Speech and Natural Language Workshop,” in Speech and Natural Language: Proceedings of a Workshop (Philadelphia, PA: Morgan Kaufmann Publishers, Inc., 1989), 1.

329 Patti Price, “Overview of the Fourth DARPA Speech and Natural Language Workshop,” in Speech and Natural Language: Proceedings of a Workshop (Pacific Grove, CA: Morgan Kaufmann Publishers, Inc., 1991), 4.

!190 this workshop,” highlighting the “the continuing explosion of results in statistical natural language processing” and citing the growth of papers using statistical methods from 20% in 1989 to over 90% at that year’s workshop in 1992. The year after that, the workshop formally changed its name to the Human Language

Technology workshop as a reflection of the group’s broadening interest beyond common project tasks. As 1993’s program chair, Madeleine Bates, explained, “the scope [of the workshop] includes not just speech recognition, speech understanding, text understanding, and machine translation, but also all spoken and written language work . . . with an emphasis on topics of mutual interest, such as statistical language modeling.”330 In other words, statistical methods were no longer simply a set of techniques, but had become so pervasive as to become an exemplary feature that defined the work of language technology research across disparate applications.

While the DARPA workshop created sustained exchanges in an effort to draw speech processing overall into natural language research, statistical techniques in particular had also begun to migrate outward, extending into language modeling challenges outside of speech. Prior to the government- instituted reintegration of work regarding speech into natural language processing, a handful of researchers had begun to experiment with probabilistic models

330 Madeleine Bates, “Overview of the ARPA Human Language Technology Workshop,” in Human Language Technology: Proceeding of a Workshop (Plainsboro, NJ: Morgan Kaufmann Publishers, Inc., 1993), 3.

!191 trained on corpus data on non-speech tasks. Probabilistic modeling became a notable presence in computational linguistics publications around 1988, with a sharp increase in papers on the topic that marked the beginning of a steady upward trend until the mid-1990s.331 The majority of contributions on the topic that year came from researchers who had previously worked on speech recognition, who served as “an important conduit for the borrowing of statistical methodologies into computational linguistics.”332

Among the publications that year were two particularly influential papers, both authored by speech recognition researchers originating from “well-funded industrial laboratories”333 where computing resources, including data, were still more abundant than their university-based counterparts. One was a paper

331 Hall et al., “Studying the History of Ideas Using Topic Models,” 365. They graph data collected from all papers in the ACL Anthology between 1978-2006, showing strong incline in papers on the topic of “probabilistic models” starting in 1988 until around 1996, when it plateaus for several years before a second increase starting around 2000. See also Kenneth Church, “Has Computational Linguistics Become More Applied?,” in Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing ’09 (Berlin, Heidelberg: Springer-Verlag, 2009), 1. Church additionally cites the results of two (likely informal) “independent surveys,” one conducted by Bob Moore and the other by Frederick Jelinek, on the presence of “statistical papers” between 1985-2005, though the source of their data is not specified.

332 Ibid., 267.

333 Kenneth Church, “Has Computational Linguistics Become More Applied?,” in Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing ’09 (Berlin, Heidelberg: Springer-Verlag, 2009), 1. See also: Hall et. al, “Studying the History of Ideas Using Topic Models,” 365-366. They note that papers dealing with “the probabilistic model topic” increase around 1988, “which seems to have been an important year for probabilistic models,” and highlight the appearance of two particularly “high-impact” papers that year from authors that had previously published on speech recognition topics.

!192 presented by Kenneth Church from AT&T Bell Laboratories,334 describing the implementation of a stochastic part-of-speech (POS) tagger.335 Though it was arguably the first fully-implemented tagger using HMMs, similar methods for

POS tagging had been garnering modest interest in the computational linguistics field since the early 1980s.336 The other paper, however, introduced probabilistic models into what proved to be a far more contentious domain—machine translation—and marked the beginning of a relatively short-lived project that, despite its brevity, was identified by natural language processing pioneer Yorick

Wilks as “[t]he greatest impetus for statistical NLP.”337 Simply titled “A

334 Kenneth Church, “A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text,” in Proceedings of the Second Conference on Applied Natural Language Processing, ANLC ’88 (Stroudsburg, PA, USA: Association for Computational Linguistics, 1988), 136–143.

335 Parts-of-speech are the various word classes, or syntactic categories, of a language (e.g., nouns, verbs, etc.). Part-of-speech tagging broadly refers to automatically assigning words in a text to appropriate classes by computational means and constitutes a major area of NLP as it is a key component in many tasks, including syntactic parsing, named- entity identification, and other information extraction and retrieval tasks that involve identifying important terms in a document.

336 Dan Jurafsky and James H. Martin, Speech and Language Processing, 3rd ed. (Draft, 2016). The best-known POS tagger using probabilistic models prior Church’s system is the the CLAWS (Constituent-Likelihood Automatic Word-Tagging Systems), developed by a team of researchers from the Universities of Lancaster, Oslo, and Bergen starting in 1981 (Garside 1987). According to Jurafsky and Martin, the CLAWS used probabilistic models used a simplified version of the HMM approach, but did not store the word likelihoods of each tag. In earlier editions of their book, they also referred to Church’s system as almost, but not quite, an HMM implementation, citing Church’s use of an slightly different probability computation. They’ve updated their assessment for the latest edition, citing personal correspondence with Church, who explained that the discrepancy was an intentional oversimplification in order to make the idea more accessible at the time. Church presented correct HMM computation in an updated paper in 1989. Other probabilistic taggers, such one from Lalit Bahl and Robert Mercer of IBM’s CSR group in 1976, had been described but not fully implemented.

337 , “Computational Linguistics: History,” in Encyclopedia of Language & Linguistics (Second Edition), ed. Keith Brown (Oxford: Elsevier, 2006), 766.

!193 Statistical Approach to Language Translation,” the paper was authored by researchers from none other than IBM’s Continuous Speech Recognition group.338

The Crude Force of Computing

Statistical machine translation at IBM began in 1987 and was spun off directly from speech recognition research,339 spearheaded by a small, core team of researchers that included Peter Brown, Bob Mercer, and Stephen and Vincent

338 Brown et al., “A Statistical Approach to Language Translation,” in Proceedings of the 12th Conference on Computational Linguistics, COLING ’88 (Stroudsburg, PA: Association for Computational Linguistics, 1988), 71–76. Curiously, the 1988 publication is often referenced as “A Statistical Approach to Machine Translation,” which is the title of the more widely-cited version of the paper that appeared in the Computational Linguistics journal in 1990.

339 There is some confusion among the former IBM CSR researchers involved regarding whether their work on machine translation began in 1987 or 1988. Jelinek dated the work to 1987 in his ACL Lifetime Achievement Award speech in 2009 (though he had previously cited 1986 in a 2005 interview with Janet Baker and additionally). Mercer also cited 1987 in his ACL Lifetime Achievement Award speech in 2014. However, Brown, who lead translation research and was first author on all the machine translation papers, cited “1987 or 1988” in his 2013 presentation at EMNLP. During my interview with Brown, and both Stephen and Vincent della Pietra in 2015, the three came to the conclusion that work began in 1988, the same year that the first machine translation paper was presented. However, the 1988 date creates a very tight timeline between when the research was conducted and when it was presented, since a bulk of the research was carried out during the summer. The 1988 COLING conference took place in late August, and the 1988 TMI presentation in Pittsburg took place even earlier in June 12-14, 1988. Moreover, an oft-cited anonymous peer review of the COLING paper has been dated to March 1988, according to Jelinek’s 2010 memorial service program, though there is no clear date indicated on the original document. Given that the initial research was said to have taken place in the summer, 1987 is the more likely start-date.

!194 Della Pietra.340 The four constituted “some of the most capable people” from the

Continuous Speech Recognition group, according to the CSR group director

Frederick Jelinek.341 The project ran for approximately five years and produced a series of influential publications, including what Machine Translation editor-in- chief Andy Way speculated in 2009 was “perhaps the most cited paper on (S)MT even today.”342 The paper in question was “The Mathematics of Statistical

Machine Translation: Parameter Estimation,” and appeared in Computational

Linguistics in June of 1993, just a few short months before Peter Brown and Bob

Mercer would both abandon the language technology field altogether. That

340 Other contributors included John Cocke, Paul Roossin, and Jelinek himself on Brown et al., “A Statistical Approach to Language Translation”; Peter Brown et al., “A Statistical Approach to French/English Translation” (Second International Conference on Theoretical and Methodological Issues in Machine Translation of Natural Language (TMI), Pittsburgh, PA, 1988); and Peter F. Brown et al., “A Statistical Approach to Machine Translation,” Computational Linguistics 16, no. 2 (June 1990): 79–85. Meredith J. Goldsmith, Jan Hajic, Surya Mohanty appear in Peter F. Brown et al., “But Dictionaries Are Data Too,” in Proceedings of the Workshop on Human Language Technology, HLT ’93 (Stroudsburg, PA, 1993), 202–205. Adam Berger, J. Gillett, J. Lafferty, H. Printz, and L. Ures appear in Adam L. Berger et al., “The Candide System for Machine Translation,” in Proceedings of the Workshop on Human Language Technology, HLT ’94 (Stroudsburg, PA, 1994), 157–162. Only Brown, Mercer and the Della Pietra brothers appear on all of the machine translation publications and the four are only authors on the most widely- cited paper from the project, “The mathematics of statistical machine translation: Parameter estimation.” Peter F. Brown et al., “The Mathematics of Statistical Machine Translation: Parameter Estimation,” Computational Linguistics 19, no. 2 (June 1993): 263–311. Lalit Bahl’s involvement has also been mentioned in various sources, including the author’s with member of the CSR groups and Jelinek’s 2005 interview with Janet Baker, though Bahl does not appear on any of the main publications.

341 Frederick Jelinek, interview by Janet Baker, audio recording, March 2005, History of Speech and Language Technology Project. The full recording of the interview was provided to the author by Janet Baker.

342 Andy Way, “A Critique of Statistical Machine Translation,” Linguistica Antverpiensia, New Series – Themes in Translation Studies, no. 8 (2009): 18, emphasis in the original. The paper in question

!195 November, they left IBM to take up equities trading at Renaissance Technologies, one of the most breathtakingly profitable hedge funds in history.343 Machine translation research at IBM quietly languished from there, with Stephen and

Vincent della Pietra leaving to join Brown and Mercer at Renaissance

Technologies in 1995, and Lalit Bahl as well as several others from IBM Research following suit in subsequent years.

Though the IBM CSR group’s foray into machine translation was relatively short-lived, it left a lasting mark on machine translation. As with speech recognition, “since this time, statistical machine translation (SMT) has become the major focus of most MT research groups, based primarily on the IBM

343 Peter Brown, in a presentation with Robert Mercer at the EMNLP workshop in 2013, recounted that Renaissance Technologies had initially recruited both him and Mercer in early 1993, and that both “just threw the [offer] letters away because [they] were very happy at IBM.” However, the two reconsidered due to personal circumstances that changed their financial priorities. They were additionally persuaded by the discovery that one of Renaissance CEO Jim Simons’ early collaborators in finance had been Leonard Baum, the former Institute of Defense Analysis mathematician who “had developed the EM algorithm, that made all of [their] work on speech recognition, typing correction, and machine translation possible.” In an interview I conducted on May 5, 2015, Brown recalled that both he and Mercer were undecided until their offer deadline, October 18th, 1993, when they finally accepted. Mercer confirmed that they left IBM and began work at Renaissance technologies on the first of November. Frederick Jelinek, the director of the CSR group, also left IBM in 1993 and returned to academic research as a professor of Electrical and Computer Engineering at John Hopkins University, where he also helmed the newly established Center for Language and Speech Processing. See also: Sebastian Mallaby, More Money Than God: Hedge Funds and the Making of a New Elite,(New York: Penguin Books, 2011). According to Mallaby, Renaissance technologies may be “perhaps the most successful hedge fund ever,” with founder Jim Simons clocking in $1.5 billion in personal earning alone in 2006, comparable to the combined annual corporate profits of coffee giant Starbucks and major wholesaler Costco, companies that each employ over 100,000 individuals. In 2016, Bloomberg reported that Renaissance’s flagship Medallion Fund had produced an estimated $55 billion in profit since its inception in 1988, putting its lifetime net gains $10 billion or more ahead of its nearest competitors, Ray Palio’s Bridgewater Pure Alpha and George Soros’ Quantum Endowment Fund, despite both those funds having been been earning for over a decade longer.

!196 model.”344 It has also remained the dominant paradigm in both research and application, only recently being challenged by a resurgence of neural networks, which in fact share many of the defining conceptual investments with statistical methods, despite formal differences. Google’s popular Google Translate service, for instance, relied on statistical machine translation for a decade, until November

2016, when they began converting some of their most popular languages offerings over to their new Google Neural Machine Translation (GNMT) system built using a recurrent neural network architecture.345

Despite its outsized impact, statistical machine translation research at IBM emerged only as an unanticipated—and initially unauthorized—byproduct of data-gathering efforts for speech recognition. Fittingly, the driving impetus behind

Brown et al.’s pursuit of machine translation was the Canadian Hansard data, a corpus of machine-readable text records of the proceedings of the Canadian

344 Hutchins, “Machine Translation: History of Research and Applications,” 128.

345 Barak Turovsky, “Found in Translation: More Accurate, Fluent Sentences in Google Translate,” The Keyword (Official Google Blog), November 15, 2016, http://blog.google: 443/products/translate/found-translation-more-accurate-fluent-sentences-google- translate/. See also: Quoc V. Le and Mike Schuster, “A Neural Network for Machine Translation, at Production Scale,” Google Research Blog, September 27, 2016, https:// research.googleblog.com/2016/09/a-neural-network-for-machine.html. A full overview of the similarities and distinctions between statistical methods and neural networks is outside the scope of the present work. Broadly speaking, however, both neural networks and previous statistical methods are data-driven approaches that are trained by automatically detect patterns in large data sets and then use those patterns to recognize, classify, or otherwise “interpret” consequent data points. They differ in terms of formal architecture, most notably in that neural nets are discriminative while classic “statistical” techniques such as hidden Markov models are typically generative. As a result, systems using neural nets often push the data-driven paradigm even further than statistical methods.

!197 House of Parliament that included text data in both French and English. The

Hansard data was therefore also the reason for the selection of French as the source language for the project, despite no one in the group having proficiency in the language.346 The CSR group acquired the corpus around 1986 (thanks, as mentioned in the previous chapter, to John Cocke and a fortuitous airline seat assignment) in order to use the English portion as training data for language models in the Tangora speech recognition system.

Even with the French-English text data, as well as urging from John

Cocke, Brown et al. did not immediately set about working on translation. The demands of the CSR group’s primary research program, speech recognition, took precedence and left little time to pursue machine translation, particularly as the group’s director, Jelinek, “was a bit of a task master” according to Brown.347

Statistical machine translation was instead relegated to a series of daily informal

346 Fred Jelinek made the claim in his ACL speech that, in fact, the decision to use the Hansard data was a result of the decision to conduct research using French and English, since the two were close in structure. However, when asked about this claim in following his EMNLP presentation in 2013, Peter Brown stated that the decision to use French- English as the translation pair was a result of having the data, maintaining that “[i]f it was data in Japanese, we would have worked on Japanese.” Brown also explained that the project name, Candide, was actually in reference to the fact that Brown and Mercer were assigned to read Candide as part of a two-week course the two took in an attempt to learn French after they had already implemented the basic translation model. Bob Mercer similar contested Jelinek’s account in his 2014 ACL Lifetime Achievement Award acceptance speech: “According to Fred, we chose to translate French into English because of the great similarity of the two languages and then we were lucky enough to find a Canadian Hansard data a dual language corpus upon which to base our research. In fact, the process was just the reverse.” French was chosen as the source rather than target language so that the researchers, who knew little French, could at least assess the grammaticality of the resulting English output, if not the accuracy of the translations themselves.

347 Brown and Mercer, “Oh, Yes, Everything’s Right on Schedule, Fred.”

!198 meetings devoted to “wild and crazy ideas” organized between Mercer, Brown, and the della Pietra brothers.348 It wasn’t until Jelinek left for his annual summer sojourn to Cape Cod in 1987 that Brown et al. were able to finally turn serious attention to the French portion of the Hansard corpus.

Brown et al. also chose to begin work on machine translation discretely, during Jelinek’s absence, because they anticipated that Jelinek would object to the project on intellectual grounds in addition to administrative ones. According to

Mercer, Jelinek, despite infamously joking that their speech recognition systems improved every time he fired a linguist, was himself “really a closet linguist . . .

He always felt that we were going to find the right linguist to solve this [speech recognition] problem.”349 Mercer and Bahl recalled that they had often taken advantage of Jelinek’s annual summer absences to surreptitiously push forward the development of data-driven techniques that could replace linguistic components in the Tangora system. In fact, the implementation of an automatic centisecond segmentation for “fenonic” models (discussed in the previous chapter) had been blocked by Jelinek “for years,” despite other researchers in the group being eager to discard manual phonetic segmentation that required linguists, and was finally implemented while Jelinek was away at Cape Cod one

348 Robert Mercer, interview with the author, May 5, 2015. Brown notes that Bahl was absent from these meetings due to being in France during this period. Brown and Mercer also noted that there may have also been a fifth member in attendance, possibly Adam Berger, but could not recall for certain.

349 Robert Mercer and Lalit Bahl, interview with the author, May 5, 2015.

!199 summer.350 Brown anticipated that statistical machine translation based solely on data, developed without any linguistic theory or even the ability to understand the language being translated, would face similar dismissal from Jelinek. To his surprise, rather than shutting the project down upon his return, Jelinek was enthusiastic and pushed Brown et al. to write up their findings, which were submitted to the International Conference on Computational Linguistics

(COLING) in 1988.351 Machine translation was taken up as an official project and named Candide, after the Voltaire novella, which Brown and Mercer had been assigned to read during a two-week French language course the two took in an effort to familiarize themselves with French grammar and morphology despite their primarily statistical approach.352

The resistance that Brown had anticipated from Jelinek came instead from the computational linguistics field. “The validity of a statistical (information theoretic) approach to MT . . . was universally recognized as mistaken by 1950,” dismissed an anonymous peer reviewer on Brown et al.’s 1988 COLING submission, concluding simply: “The crude force of computers is not science.”353

350 Ibid.

351 Brown and Mercer, “Oh, Yes, Everything’s Right on Schedule, Fred.”

352 Ibid. According to Brown, the course was taught by Michel Thomas, who had famously taught French to Princess Grace of Monaco prior to her assuming royal duties.

353 The anonymous peer review was dated March 1, 1988 according to the program for Jelinek’s memorial event hosted by John Hopkins University on November 5, 2010. The review was featured on the first page of the program (following a title page with the event date). Brown and Mercer, “Oh, Yes, Everything’s Right on Schedule, Fred.”

!200 The paper was ultimately accepted to the conference, despite the review, but remained decidedly controversial. Brown had first debuted the paper in June of

1988, a couple months prior to COLING, at the Second International Conference on Theoretical and Methodological Issues in Machine Translation of Natural

Language (TMI) in Pittsburg, on a panel simply titled “Paradigms in Machine

Translation” featuring several noted NLP and linguistics researchers, including

Jaime Carbonell, then-director of CMU’s Center for Machine Translation. As fellow panelist Harold Somers recalled, the audience’s reaction to Brown’s presentation “was hostile, to say the least,” adding that the hostility came “despite the fact that early results were not significantly worse than results of more orthodox systems.”354 Computational linguist Pierre Isabelle, who was also in attendance, gave a similar, if less restrained, account:

We were all flabbergasted. All throughout Peter’s presentation, people were shaking their heads and spurting grunts of disbelieve or even hostility . . . [I]n the heat of the moment, nobody was able to articulate the general disbelief into anything like a reasonable response.355

The reported intensity and apparent universality of the objections from the machine translation field seemed to far outstrip any negative reactions that the same methods inspired in the field of speech recognition, and was all the more notable given that statistical modeling had by this time already proven so effective

354 Harold L. Somers, “Current Research in Machine Translation,” Machine Translation 7, no. 4 (1993): 232.

355 Peter Isabelle quoted in Way, “A Critique of Statistical Machine Translation,” 21.

!201 there. Given that Brown and the IBM team has simply sought to apply methods from that which had proven successful in speech recognition to another area of language technology, the overwhelmingly hostile response clearly suggested that something substantively different was at stake in machine translation.

Equally curious was the perception of IBM’s Candide within the MT field as the “introduction of a new paradigm.”356 The idea of using information theoretic approaches for machine translation, as the anonymous COLING review had pointed out, was far from new. In fact, it had been one of the earliest strategies proposed for applying digital computers to translation and was put forth by Warren Weaver in a letter to Norbert Wiener in 1947. In 1949, Weaver revised his letter into a follow-up memo, initially circulated to only about twenty to thirty individuals, that is now considered one of the formative publications in the machine translation field, “since it formulated goals and methods before most people had any idea of what computers might be capable of . . . [and] was the

356 Bente Maegaard, ed., “Machine Translation,” in Multilingual Information Management: Current Levels and Future Abilities. A Report Commissioned by the US National Science Foundation and Also Delivered to the European Commission’s Language Engineering Office and the US Defense Advanced Research Projects Agency. (1999), http:// www.cs.cmu.edu/~ref/mlim/index.html.

!202 direct stimulus for the beginnings of research.”357 In it, Weaver emphasized the use of statistical methods, likening translation to cryptography and referencing

Claude Shannon’s work on information theory.358 What’s more, the statistical approach introduced by IBM was not even the only data-intensive approach to machine translation, despite often being referenced synonymously with

“empirical” or “corpus-based” methods. Example-Based Machine Translation

(EBMT), a method that similarly based translation decisions on the data of bilingual text corpora, has been introduced several years earlier with considerably less fanfare.359

In an echo of the statistical turn in speech recognition, what was clearly at stake was not simply a new modeling technique, but the status of language. The

Candide was frequently referenced by contemporaries in the field as not merely statistical, but “purely statistical,” highlighting the perceived absence of any

357 W. John Hutchins, “Warren Weaver Memorandum, July 1949,” MT New International 22 (July 1999): 5. For more detail on the memo’s direct and indirect reference, as well as its distribution, see W. John Hutchins, “From First Conception to First Demonstration: The Nascent Years of Machine Translation, 1947-1954. A Chronology,” Machine Translation 12, no. 3 (1997): 195–252. Hutchins explains that the widely-cited number of 200 initial memo recipients found in the Locke and Booth (1955) volume is inflated, and possibly an estimate of the consequent readership for the memo in the months following its initial distribution. He notes that Weaver himself reported sending the memo to only twenty or thirty people.

358 Warren Weaver, “Translation,” reprinted in Machine Translation of Languages: Fourteen Essays, ed. W.N. Locke and Booth, A.D. (Cambridge, MA: MIT Press, 1949/1955), 15–23.

359 Way, “A Critique of Statistical Machine Translation,” 17.

!203 linguistic considerations.360 However, what is perhaps more revealing is the fact that, unlike in speech recognition, where IBM’s statistical approach readily outpaced other systems, the Candide did not significantly improve on the performance of competing machine translation systems. After all, Somers’ account of his 1988 panel with Brown mentioned earlier indicated that the level of audience hostility was surprising, given performance results that were rather unremarkable. “What surprised most researchers (particularly those involved in rule-based approaches),” noted W. John Hutchins, one of machine translation’s most prolific chroniclers, “was that the results [of the Candide] were so acceptable,” further adding that consequent adoption of the approach required

“many subsequent refinements.”361 Peter Brown himself acknowledged that their results also benefited from the fact “that English and French are basically the same language” and that the approach “wasn’t very successful until you guys [in the machine translation field] fixed it up.”362 That the simple adequacy of statistical modeling should prove so remarkable suggests that both the initial

360 Jaime Carbonell, “Session 3: Machine Translation,” in Speech and Natural Language: Proceedings of a Workshop (Pacific Grove, CA: Morgan Kaufmann Publishers, Inc., 1991), 139. See also: Bente Maegaard, ed., “Machine Translation,” in Multilingual Information Management: Current Levels and Future Abilities. A Report Commissioned by the US National Science Foundation and Also Delivered to the European Commission’s Language Engineering Office and the US Defense Advanced Research Projects Agency (US National Science Foundation, 1999) and Yorick Wilks, “Keynote Address: Some Notes on the State of the Art: Where Are We Now in MT,” in Machine Translation: Ten Years on, ed. Douglas Clarke and Alfred Vella (International Conference on Machine Translation, Cranfield, UK, 1994), 2.1.

361 Hutchins, “Machine Translation: History of Research and Applications,” 127-128.

362 Brown and Mercer, “Oh, Yes, Everything’s Right on Schedule, Fred.”

!204 hostility and later enthusiasm towards the approach were both rooted in the perception that data-driven statistical modeling was more consequential for machine translation than for speech recognition in principle rather than practice.

As Mercer put it, in comparison to speech recognition “the statistical method, the data method, in its application in translation, is even more outrageous.”363

One crucial difference was a matter of acoustics. While early proponents of machine translation, such as Warren Weaver, had been inspired by information theory and likened the translation between languages to cryptography, the resemblance between foreign language and foreign code proved more rhetorical than functional. Speech recognition, on the other hand, was bound to signal transmission and the attendant mathematical framework of information theory by the necessity of acoustic processing. The signal processing component required to digitize audio was, in fact, the primary component that all speech recognition systems, even the most linguistically-informed ones, had in common. As Brown explained, in speech recognition, “[e]verybody understands signal processing.

Even if you’re just, you know, a linguist, you need signal processing. And so already you’ve got some mathematics in there.”364 In other words, the prerequisite of signal processing for handling acoustic input served as a material bridge between data-driven models using statistical analysis and human language, as

363 Robert Mercer, interview with the author, 2015.

364 Peter Brown, interview with the author, 2015.

!205 well as a rationalization of the success of statistical methods in ASR. That is, linguistics could be exempted from the domain of speech recognition because speech recognition was, in its first instance already a problem of acoustic signal processing. Computational linguistics could reciprocally exempt speech recognition in turn, ceding it to electrical engineering as a task that was never solely about language in the first place.

Computational linguistics could make no similar disavowal of machine translation, however. In addition to being one of the defining problems of the field, machine translation was literally one of its founding tasks: the Association for Computational Linguistics (ACL) was formed in 1962 as the Association for

Machine Translation and Computational Linguistics.365 More importantly perhaps, without the alibi of signal processing, the application of statistical techniques to machine translation meant that they were emphatically making claims into the territory of meaning:

Because speech was one thing. You know, there's, like, signal processing . . . But translation, that’s understanding, meaning, and stuff like that. And what in the world does that have to do with mathematics? And so, you know, I’d say that—they really thought we were nuts.366

365 While the name was shorted to Association for Computational Linguistics by 1968, possibly to distance the field from some of the initial setbacks and corresponding budgetary retractions in the wake of the 1966 ALPAC report, machine translation remained a significant part of the field. In the entry on the history of computational linguistics in the Encyclopedia of Language & Linguistics edited by Keith Brown, noted computational linguistics and artificial intelligence scholar Yorick Wilks maintains that machine translation “was the original task of NLP [Natural Language Processing] and remains a principle one.” Wilks, “Computational Linguistics,” 761.

366 Peter Brown, interview with the author, May 5, 2015.

!206 In other words, whereas statistical speech recognition claimed that speech could be rendered computationally using statistics in the absence of linguistic knowledge and meaning, statistical machine translation suggested that, perhaps, language and meaning themselves can be computationally rendered as mathematical phenomena.

Brown himself even claimed that IBM’s aim had never been to discard linguistics, but rather to present how it could be “incorporated in a mathematically coherent system.”367 In contrast to speech recognition, where the IBM CSR group sought to systematically replace linguistic components with statistical models,

Brown believed there to be “an enormous role for linguistics in translation,” provided that role was mathematically sound: “For what it’s worth I think it’s very important to get the mathematics straight when doing linguistics . . . Our goal was to establish the mathematical framework for MT so that the linguistically-minded

367 Way, “A Critique of Statistical Machine Translation,” 22-23. According to Way, despite Brown’s objections, perception of such was so strong that many MT researchers often misidentified Jelinek’s infamous quotation about system performance in speech recognition improving every time he fired a linguist as being about machine translation instead.

!207 could proceed [within it].”368 (Mercer, on the other hand, was notably less accommodating when it came to linguistics.)369

The establishment of a mathematical framework within which language could still be incorporated, however, abetted rather than disavowed the claim to language. Whereas statistical modeling in speech recognition was portrayed as a mathematical model of the absence of linguistic knowledge, the “mathematically coherent” framework of machine translation required that linguistic components be refashioned as mathematical functions in order to cohere to the framework. In other words, linguistics could be incorporated into the statistical framework son long as linguistic features could be expressed statistically using the same techniques that, in speech recognition, had been used to model ignorance of those very same features. Key linguistic elements such as syntax were later integrated back into statistical machine translation after they were rendered into data-trained

368 Brown quoted in Way, “A Critique of Statistical Machine Translation,” 24.

369 In contrast to Brown’s conciliatory comments, Mercer explained in his acceptance speech for the ACL Lifetime Achievement award in 2014 that performing speech recognition and machine translation letter by letter, without having to deal with words at all, was the stuff of his “literal fantasies.” See Robert Mercer, “A Computational Life (ACL Lifetime Achievement Award Address)” (The 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, June 25, 2014), http://techtalks.tv/talks/closing- session/60532/.

!208 models, once statistical methods spread across natural language processing.370

Thus, if statistical methods in speech recognition served as a stage for negotiating the role of computers in the face of language, then statistical methods in machine translation and natural language processing became one for negotiating the nature of language in the face of computing.

“Big-Data-Small-Program”

As the initial contributions of Brown and the IBM CSR group researchers swept through machine translation and into other areas of natural language processing, the characterization of their intervention began to shift as well. The increasing diversity in the type of language tasks that came to be deemed appropriate for statistical methods marked their steady encroachment into new conceptual territories. Where the use of statistical language models in speech recognition was premised on an explicit disavowal of knowledge regarding linguistic structure or meaning, the progressive incursion of statistical models into part-of-speech tagging, parsing, lexicography, and so forth brought new facets of

370 Way, “A Critique of Statistical Machine Translation,” 32-33. As Way describes it, “until very recently, it proved difficult to incorporate syntactic knowledge.” It was not until researchers working statistical parsing have began improving statistical machine translation models by integrating syntax in a form that “does not rely on any linguistic annotations or assumptions, so that the formal syntax induced is not linguistically motivated” and with “much of the linguistics surfacing as annotated data” that “syntax has been shown to be of us [in statistical machine translation.”

!209 linguistic knowledge within the purview of data-driven modeling and made the narrowly technical designation of “purely statistical” methods inexact.

Statistical methods had availed language to computational processing in a fundamental way, beyond the generic quantification involved in digitizing text as numerically encoded character strings. Text was remade in both the form and function of data, as empirical observations, the routinized output of some unseen process that could be sifted for patterns. Words were treated not as a symbols, but metrics, the aggregate counts of which could be used to approximate key structural features of its source language the same way a count of toss results could be used approximate the weight balance of a coin. Text could be not only stored and transmitted numerically, but interpreted numerically as well, to unearth information that did not require an understanding of the words themselves. Put simply, if digital encoding rendered text numerical in form, statistical processing rendered text numerical in substance. The function of text was no longer representation, but measurement, its significance relocated from the interpretation of its contents to the enumeration of its occurrences.

At an event commemorating the tenth anniversary of the International

Conference on Machine Translation in 1994, Yorick Wilks framed the lasting impact “the IBM work has been on MT and computational linguistics in general” in broader terms, suggesting that “unaided statistical methods will probably not be enough for any viable [MT] system . . . [but] thanks to IBM, resource driven

!210 systems are here to stay.”371 The expanding influence of statistical methods within computational linguistics also contributed to its expanding influence, casting natural language processing in the domain of information management and data analysis. NLP, under the influence of statistical methods, became increasingly focused on pattern recognition and classification, component tasks that would later prove ideally suited to new information processing and data management challenges posed by the World Wide Web. In a prescient move, Wilks in 1994 also christened this new conception of the statistical approach to language processing with a term that is now all too familiar: “Big-Data-Small-Program.”372

By the mid-1990s, the field of computational linguistics, and natural language processing in particular, had tilted headlong into data-driven statistical methods. In 1996, linguist Steve Abney, reflecting on the state of the field, observed:

In the space of the last ten years, statistical methods have gone from being virtually unknown in computational linguistics to being a fundamental

371 Yorick Wilks, “Keynote Address: Some Notes on the State of the Art: Where Are We Now in MT,” in Machine Translation: Ten Years on, ed. Douglas Clarke and Alfred Vella (International Conference on Machine Translation, Cranfield, UK, 1994), 2.2.

372 Ibid. Though there is little use to engaging in debates about when terms were first used, it is worth noting that this is definitely one of the earlier references to “big data.” More importantly, the addition of “small program” makes this a reference that is particular to the computational philosophy implied by “big data” in the way that it is used in data mining and machine learning. It’s not simply “big data” as a as a way of describing data that is that is high in the “three Vs” (volume, velocity, variability), but also refers to the reliance on such data in place of sophisticated (or “big”) algorithms. See, for instance, Alon Halevy, Peter Norvig, and Fernando Pereira, “The Unreasonable Effectiveness of Data,” IEEE Intelligent Systems 24, no. 2 (2009): 8–12. In other words, it’s an early use of the term “big data” as a qualitative descriptor for an approach to computational modeling, rather than of the quantitative dimensions of a data set.

211! given. In 1996, no one can profess to be a computational linguistic without a passing knowledge of statistical models. HMM’s are as de rigueur as LR tables [Left-to-Right tables, a common structure used in deterministic parsers], and anyone who cannot at least use the terminology persuasively risks being mistaken for kitchen help at the ACL [Association for Computational Linguistics] banquet.373

This turn towards statistical methods, as Wilks’ “big-data-small-program” descriptor might have indicated, was seen by NLP researchers as a return to the corpus-based, empirical inclinations that had been popular in linguistics until the

1960s. Its revival was largely attributed to two major factors would not only shape the methods, but also shift the aims and applications for natural language processing.

First and most conspicuous was the explosive growth in machine-readable text, and the corresponding growth in computing power to process it.374 In their introduction to a two-volume special issue in Computational Linguistics in 1993,

Kenneth Church and Bob Mercer maintained that “the most immediate reason for this empirical renaissance is the availability of massive quantities of data,” wherein not only was text data more abundant, but also more accessible, thanks to

373 Steven Abney, “Statistical Methods and Linguistics,” in The Balancing Act: Combining Symbolic and Statistical Approaches to Language, ed. Judith Klavans and Philip Resnik (Cambridge, MA: MIT Press, 1996), 1. This quotation was also notably used by Peter Norvig, in his 2011 refutation of Chomsky’s critique of statistical modeling in AI and Machine Learning, as evidence of the success of statistical methods in capturing the interest of researchers.

374 Peter Brown had joked during the Q&A segment following his and Bob Mercer’s 2013 Bitext presentation that the problem wasn’t the “crude force of computers,” as the 1988 COLING review had suggested, but instead “the force of crude computers.”

!212 numerous public text data collection efforts.375 The eager spread of statistical techniques in computational linguistics was indeed so closely associated with the growing availability of text data that, though this immensely influential special issue was described by its editor as “bear[ing] witness to promising developments in computational linguistics . . . [using] empirical and statistical methods,” it was simply titled “Using Large Corpora.”376 In short, a small subset of stochastic modeling and parameter estimations drawn from speech recognition had become nearly synonymous of corpus-based or “empirical” approaches to computational linguistics.

While the availability of data made statistical methods accessible to more

NLP researchers, it was the growing demands for quantitative performance evaluation that made them appealing. The most important factor in the rise of statistical methods other than data, according to Church and Mercer, was “the recent emphasis on numerical evaluations and concrete deliverables,” since it was something “the data-intensive approach to language . . . [was] well suited to

375 Church and Mercer, “Introduction to the Special Issue on Computational Linguistics Using Large Corpora,” 1.

376 Susan Armstrong-Warwick, “Preface,” Computational Linguistics 19, no. 1 (March 1993): iii. The influence of the issue was referenced a decade later, in the introduction to another special issue, one on “Web as Corpus.” According to its authors that, for several years, “it was not clear whether corpus work was an acceptable part of the [computational linguistics] field. It was only with the highly successful 1993 special issue of this journal, ‘Using Large Corpora,’ that the relation between computational linguistics and corpora was consummated.” Adam Kilgarriff and Gregory Grefenstette, “Introduction to the Special Issue on the Web as Corpus,” Computational Linguistics 29, no. 3 (September 1, 2003): 335.

!213 meet.”377 The preoccupation with quantitative evaluation standards came from funding agencies such as DARPA, according to Church, motivated by a need to

“manage expectations,” given the history of computational speech and language processing as a field where “so much has been promised at various points, that it would be inevitable that there would be some disappointment when some of these expectations remained unfulfilled.”378 Standard evaluation metrics provided a means for “demonstrating consistent progress over time . . . [and] helps sell the field.”379 The desire to demonstrate quantifiable progress was also partly responsible for increased availability of text data, as corpus collection efforts were also needed to provide standard test sets evaluation. Statistical methods were well-adapted to the numerical evaluation and the demonstration of incremental progress thanks precisely to the “mathematically coherent” framework that Brown had touted for machine translation, which provided formal mechanisms that could rigorously quantify abstract concepts like information density and prediction error for comparative assessment.380

377 Church and Mercer, “Introduction to the Special Issue on Computational Linguistics Using Large Corpora,” 1.

378 Kenneth W. Church, “Speech and Language Processing: Where Have We Been and Where Are We Going” (Eurospeech, Geneva, Switzerland, 2003), 2.

379 Ibid., 1-2. Church credits Charles Wayne, the managing director at DARPA responsible for the Speech and Language Processing workshops, as a particularly influential advocate. It’s worth noting that Peter Brown also credits Wayne with encouraging and funding much of the statistical machine translation work which was considered too much of a “luxury” to maintain financial support from within IBM.

380 Church and Mercer, “Introduction to the Special Issue on Computational Linguistics Using Large Corpora,” 1.

!214 If data provided the resources to fuel the rise of statistical methods as the dominant approach to computational linguistics by the 1990s, it was the focus on evaluation that steered the course. Research in computational linguistics increasingly veered towards text analysis, a term Church and Mercer used to describe the “data-intensive” mode of natural language processing, characterized by “a pragmatic approach . . . [that] focuses on broad (though possibly superficial) coverage of unrestricted text, rather than deep analysis of (artificially) restricted domains.”381 Text analysis highlighted the use of text corpora as a practical surrogate for directly observing sufficiently large quantities of language-in-use in the world. The use of text corpora of naturally-occurring language was also

“pragmatic” in a different sense—it “placed an emphasis on engineering practical solutions . . . that can work on raw text as it exists in the world,”382 rather than what were known as “toy systems” that were not usable outside their very limited, artificially-restricted vocabularies.383 “Deep analysis” in this instance referred to comprehensive, often layered, system of rules and interactions tailored to a particular subject or task domain. Prior to the dominance of statistical methods,

381 Ibid.

382 Manning and Schütze, Foundations of Statistical Natural Language Processing, 7.

383 The term “toy system” is derived from A.I. research. Wilks notes that “in the era of AI methods in CL/NLP [from the 1960s until the late 1980s], the vocabularies of working systems were found to average about 35, which gave rise to the term ‘toy systems.’” He further points out that those these were technically operational systems, with “many formal achievements in print, they have had little success in producing any general and usable program.” Wilks, “Computational Linguistics,” 764-765.

!215 natural language processing systems were typically driven by large sets of grammar rules supplemented with preference rules for resolving ambiguities and error-correction rules for handling unexpected inputs, whether erroneous or out of domain. All these sorts of rules of manually constructed and developers became overwhelmed by interactions among them once systems reached a certain size. Adapting a system to a new domain was nearly as hard as starting over again.384

Work in text analysis, on the other hand, “concentrated on the lower levels of grammatical processing”385 that could be discerned as measurable regularities in a body of text, fastidiously skirting the murky domain of meaning.

This reorganization of natural language processing in terms of pattern recognition bore the lingering impression of speech recognition’s particular aims.

The congruent forces of data and evaluation were marshaled using techniques, such as HMMs, the noisy channel model, and maximum entropy, which had been tailored to the informatic proclivities of speech recognition and signal processing, which formally prioritized the identification of patterns over the interpretation of meaning. As discussed in previous chapters, linguistic knowledge had initially been incorporated into speech recognition systems as a means to cope with acoustic variability. The presence of “high-level constraints,” such as semantics, had always been primarily tactical, “as a tool to disambiguate . . . the speech

384 Abney, “Statistical Methods in Language Processing,” 315.

385 Manning and Schütze, Foundations of Statistical Natural Language Processing, 8.

!216 signal by understanding the message.”386 Even in knowledge-based systems, linguistic meaning served a primarily instrumental role, rather than being the end- goal. The turn to pattern recognition, in other words, was already built into the conceptual machinery of the particular strain of statistical modeling that had been passed down from speech recognition.

The wholesale adoption of major concepts from speech recognition served as a conduit between computational linguistics and machine learning, a kindred field to speech recognition387 that “became a second major tributary [of statistical techniques] soon thereafter . . . [that] provided a wealth of learning methods, with a particular focus on classification.”388 From around the late 1980s and through the 1990s, that machine learning was undergoing its own transformation as it outgrew its parent fields of artificial intelligence and cognitive science. Here too, these changes spurred a burgeoning commitment to formal evaluation and

386 Church and Mercer, “Introduction to the Special Issue on Computational Linguistics Using Large Corpora,” 4.

387 Machine learning was declared to be “intimately related to speech recognition” in a 1989 white paper authored by some of the most prominent figures in speech research, including Makhoul, Jelinek, and Rabiner, as well as Clifford Weinstein (head of the Human Language Technology group at Lincoln Labs), and (Victor Zue (then-director of the MIT Spoken Language Systems group and later director of the Computer Science and Artificial intelligence Lab). In the paper, they explain that the computational models (such as HMMs) used in speech recognition rely on “automatic training (i.e., learning) methods” from machine learning. See Makhoul et al., “White Paper on Spoken Language Systems,” in Speech and Natural Language: Proceedings of a Workshop (Cape Cod, MA: Morgan Kaufmann Publishers, Inc., 1989), 468.

388 Steven Abney, “Statistical Methods in Language Processing,” Wiley Interdisciplinary Reviews: Cognitive Science 2, no. 3 (2011): 315.

!217 controlled experimentation.389 Using this emphasis on performance evaluation to distinguish itself as an experimental field, machine learning expanded from its narrow attachment to expert systems and symbolic knowledge representation, which had been inherited from AI, to “include the study of any methods that improved performance with experience.”390 In particular, machine learning began to incorporate techniques from pattern recognition, including statistical modeling methods.

Both the commitment to evaluation and the borrowing of statistical methods from pattern recognition had the unplanned consequence of reorienting machine learning around data. This was in notable contrast to the field’s previous epistemic commitments, which stemmed from symbolic AI and “explanation- based learning” that “assumed that the product of learning was knowledge stated in some explicit form.”391 But by the time machine learning was being widely

389 Pat Langley, “The Changing Science of Machine Learning,” Machine Learning 82, no. 3 (March 2011): 276. Pat Langely, the founding editor of the journal Machine Learning, recalled that one of the key steps researchers took to distinguish the field from symbolic AI by defining “a framework for . . . an experimental science of machine learning” that approached “learning” as the ability “to improve performance to some class of tasks,” in a way that was measurable and replicable. This was done through the adoption of formal methods of experimental design and evaluation, such that “within a few years [of 1988], the vast majority of Machine Learning articles reported experimental results about performance improvement on well-defined tasks.” The year 1988 refers to the publication of a paper by D. Kibler and P. Langley in Proceedings of the third European working session on learning that Langley cites for laying out an experimental framework for machine learning.

390 Ibid., 277.

391 Ibid.

!218 incorporated into NLP in the 1990s, it too had already been drawn into the orbit of data processing:

The increasing reliance on experimental evaluation that revolved around performance metrics meant there was no evolutionary pressure to study knowledge-generating mechanisms. At the same time, the advent to large data sets convinced many that learning rate was not an important issue, reducing interest in using background knowledge to improve it. In some circles, learning from very large data sets became almost a fetish, marginalizing work on learning rapidly from few experiences.392

This shift in the status of knowledge corresponded to a shift in practice, as well.

As with natural language processing, machine learning research began to exhibit a data-centered approach, adapted to tasks that were amenable to statistical methods and pattern recognition. Machine learning was increasingly defined by a growing emphasis on classification and regression endeavors where, unlike the more complex and subjective aims that had previously propelled the field such as reasoning, problem-solving, and, rather aptly, language understanding, progress was far easier to define and evaluate.393 Moreover, the rearrangement of procedural priorities additionally altered the field’s working objects, as “new emphasis on classification led to an increased focus on component learning algorithms that were studied in isolation from larger-scale intelligent systems.”394

That is, statistical methods in pattern recognition provided a technological fault

392 Ibid., 278.

393 Ibid., 277.

394 Ibid.

!219 line, along which the widening epistemic fissure between machine learning and mainstream AI was concretized as material practice.

The rise of statistical methods in computational linguistics thus made natural language processing and machine learning amenable to one another, allied through a common shift in focus from “high level” tasks dealing with interpretation and understanding to “low level” tasks concerned primarily with the classification of data. Linking statistical methods to “low level” tasks of sorting text data did not mean that “high level” processes dealing with interpretation and language understanding were entirely abandoned, however.

Instead, they were increasingly refashioned as classification and sorting tasks better suited to statistical analysis and the rubrics of standard evaluations.395 As a popular 1999 textbook on natural language processing reasoned:

people have sometimes expressed skepticism as to whether statistical approaches can ever deal with meaning. But the difficulty in answering this question is mainly in defining what ‘meaning’ is! . . . [F]rom a Statistical NLP perspective, it is more natural to think of meaning as residing in the distribution of contexts over which words and utterances are used . . . [W]here the meaning of a word is defined by the circumstances of its use . . . much of statistical NLP research directly tackles questions of meaning.396

395 For instance, Wilks refers to a practical information extraction system developed in 1997, noting that “IE [information extraction] has become an established technology, and this has been achieved largely by surface pattern matching, rather than by syntactic analysis and the use of knowledge structures.” See Wilks, “Computational Linguistics: History,” 767.

396 Manning and Schütze, Foundations of Statistical Natural Language Processing, 16-17.

!220 In other words, interpretive tasks remained well within the scope of statistical pattern recognition, in so far as interpretation could be collapsed into the adequate specification of classificatory schema across which probability distributions could be allotted.

As with machine learning more generally, this required the decomposition of complex, integrated functions such as language understanding or problem solving into “component learning” tasks that could be studied in isolation using classification and regression. This emphasis on component learning procedures additionally generalized natural language processing procedures, detaching them from large-scale language understanding systems that were customized to specific task or topic domains. The statistical paradigm in computational linguistics radically operationalized natural language processing as text analysis, treating the problem of language as a problem of sifting through large quantities of unrestricted text data to identify surface regularities and utilize these patterns to construct models for automatic classification and clustering. Broken up into component tasks and unconstrained by domain knowledge, Statistical NLP, or rather Text Analysis, was no longer anchored to the specific aims of language processing.

!221 Data’s Rising Tide

Having thus slipped the custody of linguistics, and newly allied with machine learning, text analysis was soon drawn into challenges that arose from another quarter. By the time statistical methods swept natural language processing in the 1990s, the emerging orthodoxy of “there’s no data like more data,” as

Mercer had so succinctly summarized 1985, had begun to mutate towards a more invasive strain: “The rising tide of data will lift all boats!”397 It was not simply that quantity was privileged over all other attributes of data, but that sufficient quantities of data was privileged over all other attributes of predictive modeling as a whole. According to Eric Brill of Microsoft Research, it was increases in data and not improvements in algorithms that would move the NLP field forward, concluding, “it never pays to think until you’ve run out of data.”398 Whereas data in statistical speech recognition was posed as a replacement for linguistic theory, data in statistical language processing was, at the extremes, posed as a replacement for all forms of knowledge. Jelinek had joked about firing linguists, leaving only engineers and mathematicians to handle the data; Michele Banko and

397 Kenneth W. Church, “Empiricism from TMI-1992 to AMTA-2002 to AMTA-2012” (Association for Machine Translation in the Americas, Tiburon, CA, October 8, 2002).

398 Brill quoted in Kenneth W. Church, “Where Have We Been and Where Are We Going?” (ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, 2004).

!222 Eric Brill “suggested firing everyone (including themselves),” leaving only money to spend on acquiring more data.399

However, while the deluge of digital text was embraced as a windfall for statistical language processing, that “rising tide” took on a decidedly more foreboding cast in fields charged with the orderly management of data, not as a burgeoning promise of uplift, but a looming threat of submersion. Natural language text in particular compounded the problem of simple abundance with a lack of structural restraint. As data, unrestricted natural language text was considered “unstructured,” inconsistent in size and unassimilated to the rigid indexing architectures of databases management.400 Particularly troubling was a new, ever-expanding source of data—a significant portion of which was in the form of unstructured text—that was both punishingly vast and singularly unruly: the World Wide Web.

399 Kenneth W. Church, “Statistical Models for Natural Language Processing,” in The Oxford Handbook of Computational Linguistics, ed. Ruslan Mitkov, 2014. See also: Kenneth W. Church, “Speech and Language Processing: Where Have We Been and Where Are We Going” (Eurospeech, Geneva, Switzerland, 2003).

400 According to computer historian Thomas Haigh, the design of database systems were inherited from file management systems and were thus “very well suited to the bureaucratic records for things such as payroll administration, because each record included the same pieces of data . . . [but] it was entirely useless for representing and search less rigidly formatted data, such as full-text records, correspondence, or even scientific abstract.” Only with web did “widespread attention turn back to the indexing and management of huge amounts of natural language information . . . However, these [web search] technologies remain quite distinct from mainstream DBMS [Data Base Management System]. See Thomas Haigh, “‘A Veritable Bucket of Facts’ Origins of the Data Base Management System,” SIGMOD Rec. 35, no. 2 (June 2006): 44.

!223 By the mid-1990s, the Web emerged as a new gravitational axis, around which once-distinct areas of computational linguistics, information processing, and data mining began to align and coalesce. Early applications of NLP to were fairly limited, typically as an aid to document retrieval and database access that enabled users to “extract information from databases without using a special syntax.”401 With the rise of the Web, where individual pages are too numerous and contents too varied and dynamic to manually index, information processing has come to rely increasingly on statistical NLP applications, such as text classification and named entity extraction, to automatically index and retrieve web documents. Correspondingly, NLP also shifted from user interaction applications (getting computers to “understand” natural language for easier to use interfaces) to processing text itself as being important. As computational linguist

Philip Resnik recalled:

When I started out in NLP, the big dream for language technology was centered on human-computer interaction: we’d be able to speak to our machines, in order to ask them questions and tell them what we wanted them to do . . . [B]ut in the mid 1990s something truly changed the landscape, pushing that particular dream into the background: the Web made text important again. If the statistical revolution was about the methods, the Internet revolution was about the needs. All of a sudden there was a world of information out there, and we needed ways to locate

401 P. Jackson and F. Schilder, “Natural Language Processing: Overview,” in Encyclopedia of Language & Linguistics (Second Edition), ed. Keith Brown (Oxford: Elsevier, 2006), 506.

!224 relevant Web pages, to summarize, to translate, to ask questions and pinpoint the answers.402

With web search, the full text of documents had to be analyzed, since retrieval couldn’t be based solely on the presence of query keywords. A single keyword could easily return massive quantities (at present, billions) of results, such that any useful retrieval system would have to devise some means of assessing and comparing relevance based on a more sustained analysis of a page’s contents alongside other contextual data. The reliance on “relevance” in sorting, rather than simply on an explicit quantitative attribute (e.g., date or file size), made information retrieval on the web and within other large collection increasingly reliant on text analysis.

Text mining, or “Knowledge Discovery in Text,” arose as the integration of text analysis and data mining. The field of Knowledge Discovery in Databases

(KDD), sometimes referred to as “data mining,”403 gained momentum in the

1990s in response to “a growing gap between data generation and data

402 Philip Resnik, “Four Revolutions,” Language Log, February 5, 2011, http:// languagelog.ldc.upenn.edu/nll/?p=2946.

403 The two terms are often used synonymously, though researchers in the field have distinguished between data mining, as an older term originating in statistics, from KDD, which was intended to “designate an area of research that draws upon data mining methods from statistics, pattern recognition, machine learning, and database techniques in the context of large databases.” See: Usama Fayyad and Ramasamy Uthurusamy, eds., KDD’95: Proceedings of the First International Conference on Knowledge Discovery and Data Mining (Montréal, Québec, Canada: AAAI Press, 1995).

!225 understanding.”404 Data mining as a practice thus developed partly in response to the manner of information access problems that text data presented.

Like NLP, the practices of database management and information retrieval were similarly spurred towards probabilistic models trained on statistics by the influx of data. But where increased data provided a new resource that made corpus-based techniques for natural language processing more accessible, it provided new challenges to existing data storage systems, and database engineers resorted to new statistical modeling methods in order to cope with the scale and complexity of new data. As historian Matthew Jones explains, KDD first emerged as an means of querying large databases that had been designed to prioritize information storage rather than access, but soon came to be understood as “the activity of creating non-trivial knowledge suitable for action from databases of vast size and dimensionality.”405 That is, initially aimed to aid information access,

KDD became a form of knowledge production wherein knowledge “comprises

‘interesting,’ ‘actionable’ patterns from vast quantities of data.”406 Moreover, to contend with both the growing scale and complexity of data and the material constraints of database technology, Jones argues, KDD practitioners advocated

404 William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus, “Knowledge Discovery in Databases: An Overview,” AI Magazine 13, no. 3 (September 15, 1992): 57.

405 Matthew L. Jones, “Querying the Archive: Data Mining from Apriori to Pagerank,” in Science in the Archives: Pasts, Presents, Futures, ed. Lorraine Daston (Chicago: University Of Chicago Press, 2017), 311–28.

406 ibid.

!226 new approaches to scientific and statistical knowledge focused on relatively short term goals and the practical challenges of analyzing “messy real-world data.”407

Data mining thus turned to techniques from statistical pattern recognition and machine learning, emerging as an unflinchingly pragmatic strain of computational knowledge.

Digital text, particularly text on the web, amplified the existing data processing problem in new ways. “Naturally occurring” text—unrestricted text from websites, email communications, press and scholarly publications, and other documents that were not generated and formatted for the express purpose of linguistic study—was in many ways particularly emblematic of the messiness of

“real-world data.” It was highly variable and irregular, prone to error (or “noise”), and had no standard lengths or file sizes. At the same time, natural language text data presented challenges to data mining that were distinct from those posed by other forms of data, so much so that by the mid-1990s, the application of data mining to text had branched off into a sibling field known as text data mining, or simply text mining.

Natural Language Processing thus came to provide the crucial “pre- processing” stage of text mining. That is, for text to be tractable data, and mined like data, it has to be structured like data. Whereas preprocessing in standard data mining focused mainly on combining and normalizing data, ensuring individual

407 Ibid.

!227 data values are correct and consistent, preprocessing operations in text mining

“center on the identification and extraction of representative features for natural language documents.”408 In their handbook on text mining, Feldman and Sanger explain that text mining systems

do not run their knowledge discovery algorithms on unprepared document collections. Considerable emphasis in text mining is devoted to what are commonly referred to as preprocessing operations . . . Text mining preprocessing operations include a variety of different types of techniques culled and adapted from information retrieval, information extraction, and computational linguistics research that transform raw, unstructured, original-format content (like that which can be downloaded from PubMed) into a carefully structured, intermediate data format. Knowledge discovery operations, in turn, are operated against this specially structured intermediate representation of the original document collection.409

In other words, preprocessing in text mining uses statistical language processing to fundamentally restructure text into a suitable data format, as an

“intermediate representation,” to which “knowledge discovery” operations can then be applied. NLP in text mining essentially made text commensurate with other forms of data, allowing them to be analyzed together, formally coupling textual and contextual data into a single mathematical framework. Text prediction thus became a foundation for large scale data analytics of all forms, laying the path for the prediction of, as Google put it, “things not strings.”

408 Feldman and Sanger, The Text Mining Handbook, 1.

409 Ibid., 2-3, emphasis in original.

!228 CHAPTER V

CONCLUSION: THE BLACKEST BOXES

"It’s about transferring information, but at the same time about a certain lack of specificity." William Gibson, Pattern Recognition (2003)410

“The fundamental goal of machine learning is to generalize . . . All learners essentially work by grouping nearby examples into the same class; the key difference is in the meaning of ‘nearby.’” Pedro Domingos, “A Few Useful Things to Know About Machine Learning” (2012)411

The preceding chapters have pursued a story about language and statistics, communication and computation, and the forging of a distinct genre of data- driven knowledge among them. I have highlighted the role of text prediction as a single thread within the broader development of big data, AI, and machine learning, one that has been tucked into seams of a far more expansive and convoluted history, at once integral and easy to overlook. To briefly recap, I began the story with the development of speech recognition research as both a conceptual and technical hinge that linked the mathematization of communication

410 William Gibson, Pattern Recognition, Reprint edition (New York, NY: Berkley, 2005), 65.

411 Pedro Domingos, “A Few Useful Things to Know About Machine Learning,” Communications of the ACM 55, no. 10 (2012): 78–87.

!229 in signal processing to the computerization of language in text analytics. Focusing on its so-called “statistical” turn beginning in the 1970s, speech recognition research became a critical site where language was made commensurate with data processing in new ways.

Chapter two examined major conceptual and institutional shifts in speech recognition research in the US as it moved across telecommunications, defense research, and commercial computing, and the role of speech recognition as the site of ongoing debates over the purpose, feasibility, and fundamental nature of computing and artificial intelligence. It traced the formation of IBM’s Continuous

Speech Recognition group and the commercial forces that drove the group’s pursuit of speech recognition as a statistical rather than linguistic problem, which prefigured both the conceptual terrain in which big data became thinkable and the economic conditions under which it became imperative. Chapter three then teased out the technical particulars that comprised the “statistical” approach, examining the cultural and epistemic priorities encapsulated in modeling techniques that, rather than producing formal representations of linguistic knowledge, sought instead to model precisely its absence. Finally, chapter four follows the data- driven statistical techniques from speech recognition as they spread across the broader field of computational linguistics and are drawn into the consolidation of data management, machine learning, and natural language processing as these

!230 disparate domains begin to coalesce around the rise of the world wide web in the

1990s.

The story, however, does not end there. The dominance of the statistical techniques that populate the history recounted here has already begun to erode.

Over the past two decades, statistical machine learning has been steadily giving ground in all quarters to “deep learning,” a broad term for the latest revival of artificial neural networks. In late 2016, for instance, Google announced that its popular machine translation tool was transitioning from a statistical modeling approach to a new deep learning system called Google Neural Machine

Translation.412 While the substance of the differences between these approaches is outside the scope of the current work, one immediately apparent distinction is deep learning’s decidedly more organic, even spiritual, nomenclature. In contrast to the staid engineering metaphors of noisy channels and self-organizing systems, deep learning appeals to imagery of matter and consciousness, neurons and “deep belief networks,” suggesting that the parameters between human and machine knowledge remain emphatically under contention. Despite the rhetorical differences, however, both statistical and deep learning approaches are similarly centered around principles of data, automation, and approximation. Both privilege what we might call “proximate knowledge,” organized around operations that

412 Quoc V. Le and Mike Schuster, “A Neural Network for Machine Translation, at Production Scale,” Google Research Blog, September 27, 2016, https:// research.googleblog.com/2016/09/a-neural-network-for-machine.html.

!231 seek to define “similarity” between elements and with the aim of producing models that “generalize well,” or approximate closely without depicting exactly.

Throughout this narrative, IBM’s Continuous Speech Recognition group has provided a major focal point, though it is certainly not the only story that could be told. Still, there is no better testament to the influence and adaptability of text prediction research that is more potent than post-IBM careers of researchers, and that of Robert Mercer in particular. While some, including group director

Frederick Jelinek, remained tied to natural language processing research, key members including Mercer, Brown, Bahl, and many others found their way into finance, joining Renaissance Technologies, a pioneering force in quantitative and algorithmic trading best-known for its almost incomprehensibly vast profits.

Mercer and Brown took over the company as co-executives following founder

Jim Simons’s retirement in 2009, and in 2016, Bloomberg Markets reported the company’s flagship Medallion fund to be the most profitable hedge fund in history.413 When the two were asked in 2013 if their current work in finance had been informed by their speech and translation research at IBM, Brown’s response neatly captured the radical equivalence made possible through data-driven techniques. Carefully picking his way along the margins of corporate secrecy, he

413 Katherine Burton, “Inside a Moneymaking Machine Like No Other,” Bloomberg Markets, November 21, 2016, https://www.bloomberg.com/news/articles/2016-11-21/how- renaissance-s-medallion-fund-became-finance-s-blackest-box. The Medallion fund was lead by a significant margin, with a margin of approximately $10 billion in lifetime profits above the second place ranking, and that was before accounting for the fund’s notably shorter existence.

!232 replied, “In some sense, if you look at our blackboards, they look exactly like your blackboards. Full of similar kinds of equations.”414

The former IBM researchers and their experiences in building speech and text processing systems certainly did not single-handedly shape the investment approach at Renaissance. Though their significance is clearly evidenced by Brown and Mercer’s positions as founder Jim Simons’s successors, Renaissance

Technologies had long been a company replete with mathematical expertise, including Jim Simons himself, who had previously worked in code-breaking at the Institute of Defense Analysis. Perhaps most notably, the company also once included Leonard Baum, Simons’s former colleague at IDA and the primary developer of the hidden Markov model, which had been so central to the work of the IBM CSR group. What Mercer, Brown, and their IBM colleagues brought, however, was computing expertise and experience in implementing mathematical models through data processing systems at scale. In other words, they brought the capacity for “big data.” The same Bloomberg Markets article that named

Renaissance’s Medallion fund the most profitable in history also referred to it by another superlative: “the blackest box in all of finance.”415

As perhaps further testament to the startling flexibility of data-driven modeling techniques, Mercer has gained particular notoriety for his contributions

414 Brown and Mercer, “Oh, Yes, Everything’s Right on Schedule, Fred.”

415 Burton, “Inside a Moneymaking Machine Like No Other.”

!233 in the political arena, as a backer of Cambridge Analytica, the notoriously “black box” data analytics company that has been credited with both the outcome of the

Brexit vote and Donald Trump’s 2016 election.416 It is hard to say, of course, to what extent and in what form the particular techniques that were refined in speech recognition and natural language processing have made their way into the electoral targeting strategies of a company like Cambridge Analytica, though the overall epistemic machinery, the conceptual cranks and levers, likely remain intact. What is nevertheless suggestive is not whether these figurative black boxes, like Brown’s literal blackboards, are full of the same equations, but the very fact that they can be black boxed at all. In other words, the history of text prediction and its role in the emergence of big data analytics does not seek to reveal the precise contents of big data’s many black boxes, but to question how so many disparate forms of knowledge work came to be contained within them.

416 Mercer’s financial involvement in Cambridge Analytica and the Trump campaign has been reported on widely. See, for instance, Jane Mayer, “The Reclusive Hedge-Fund Tycoon Behind the Trump Presidency,” The New Yorker, March 17, 2017, http:// www.newyorker.com/magazine/2017/03/27/the-reclusive-hedge-fund-tycoon-behind-the- trump-presidency. A series of articles for The Guardian by journalist Carole Cadwalladr on the ties between Mercer, Cambridge Analytics, the Brexit vote and the Trump presidential campaign were published in 2017. At the time of writing, these articles are the subject of legal complaints from Cambridge Analytica. See Carole Cadwalladr, “Revealed: How US Billionaire Helped to Back Brexit,” The Guardian, February 25, 2017, sec. Politics, https://www.theguardian.com/politics/2017/feb/26/us-billionaire-mercer- helped-back-brexit; Carole Cadwalladr, “Robert Mercer: The Big Data Billionaire Waging War on Mainstream Media,” The Guardian, February 26, 2017, sec. Politics, https:// www.theguardian.com/politics/2017/feb/26/robert-mercer-breitbart-war-on-media-steve- bannon-donald-trump-nigel-farage; Carole Cadwalladr, “Follow the Data: Does a Legal Document Link Brexit Campaigns to US Billionaire?,” The Guardian, May 14, 2017, sec. Technology, http://www.theguardian.com/technology/2017/may/14/robert-mercer- cambridge-analytica-leave-eu-referendum-brexit-campaigns.

!234 WORKS CITED

“A Writing Machine That Responds to the Voice.” The Electrical Experimenter, April 1916. A-10414. Photograph, n.d. AT&T Archives and History Center, Warren NJ. Abney, Steven. “Statistical Methods and Linguistics.” In The Balancing Act: Combining Symbolic and Statistical Approaches to Language, edited by Judith Klavans and Philip Resnik, 1–23. Cambridge, MA: MIT Press, 1996. ———. “Statistical Methods in Language Processing.” Wiley Interdisciplinary Reviews: Cognitive Science 2, no. 3 (2011): 315–22. doi:10.1002/wcs.111. Ananny, Mike, and Kate Crawford. “Seeing without Knowing: Limitations of the Transparency Ideal and Its Application to Algorithmic Accountability.” New Media & Society, December 13, 2016, 1–17. doi: 10.1177/1461444816676645. Anderson, Ashton, Dan McFarland, and Dan Jurafsky. “Towards a Computational History of the ACL: 1980-2008.” In Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries, 13–21. ACL ’12. Stroudsburg, PA, USA: Association for Computational Linguistics, 2012. http://dl.acm.org/citation.cfm?id=2390507.2390510. Armstrong-Warwick, Susan. “Preface.” Computational Linguistics 19, no. 1 (March 1993): iii–iv. Averbuch, A, L. Bahl, R. Bakis, P. Brown, G. Daggett, S. Das, K. Davies, et al. “Experiments with the Tangora 20,000 Word Speech Recognizer.” In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’87., 12:701–4, 1987. doi:10.1109/ICASSP.1987.1169870. Bahl, L., J. Baker, P. Cohen, A. Cole, F. Jelinek, B. Lewis, and R.L. Mercer. “Automatic Recognition of Continuously Spoken Sentences from a Finite State Grammer.” In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’78., 3:418–21, 1978. doi:10.1109/ ICASSP.1978.1170404. Bahl, L., J. Baker, P. Cohen, N. Dixon, F. Jelinek, R. Mercer, and H. Silverman. “Preliminary Results on the Performance of a System for the Automatic Recognition of Continuous Speech.” In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’76., 1:425–29, 1976. doi:10.1109/ICASSP.1976.1170026. Bahl, L., J. Baker, P. Cohen, F. Jelinek, B. Lewis, and R.L. Mercer. “Recognition of Continuously Read Natural Corpus.” In Acoustics, Speech, and Signal

!235 Processing, IEEE International Conference on ICASSP ’78., 3:422–24, 1978. doi:10.1109/ICASSP.1978.1170402. Bahl, L., R. Bakis, P. Cohen, A Cole, F. Jelinek, B. Lewis, and R.L. Mercer. “Further Results on the Recognition of a Continuously Read Natural Corpus.” In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’80., 5:872–75, 1980. doi:10.1109/ICASSP. 1980.1170862. Bahl, L. R., R. Bakis, J. Bellegarda, P.F. Brown, D. Burshtein, S. Das, P.V. de Souza, et al. “Large Vocabulary Natural Language Continuous Speech Recognition.” In International Conference on Acoustics, Speech, and Signal Processing, 1989. ICASSP-89, 465–67, 1989. doi:10.1109/ICASSP. 1989.266464. Bahl, L. R., P. F. Brown, P. V. de Souza, R. L. Mercer, and M. A. Picheny. “Automatic Construction of Acoustic Markov Models for Words.” In 1st IASTED International Symposium on Signal Processing and Its Applications, 565–69. Brisbane, Australia, 1987. Bahl, Lalit R., R. Bakis, P. Cohen, A. Cole, F. Jelinek, B. Lewis, and R. Mercer. “Recognition Results for Several Experimental Acoustic Processors.” In ICASSP ’79. IEEE International Conference on Acoustics, Speech, and Signal Processing, 4:249–51, 1979. doi:10.1109/ICASSP.1979.1170736. Bahl, Lalit R., R. Bakis, P. Cohen, A Cole, F. Jelinek, B. Lewis, and R.L. Mercer. “Speech Recognition of a Natural Text Read as Isolated Words.” In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’81., 6:1168–71, 1981. doi:10.1109/ICASSP.1981.1171115. Bahl, Lalit R., P.F. Brown, P.V. de Souza, R.L. Mercer, and M.A. Picheny. “A Method for the Construction of Acoustic Markov Models for Words.” IEEE Transactions on Speech and Audio Processing 1, no. 4 (October 1993): 443–52. doi:10.1109/89.242490. Bahl, Lalit R., Peter V. deSouza, Robert L. Mercer, and Michael A. Picheny. Automatic generation of simple Markov model stunted baseforms for words in a vocabulary. US4833712 A, filed May 29, 1985, and issued May 23, 1989. http://www.google.com/patents/US4833712. ———. Feneme-based Markov models for words. US5165007 A, filed June 12, 1989, and issued November 17, 1992. http://www.google.com/patents/ US5165007. Bahl, Lalit R., F. Jelinek, and R. Mercer. “A Maximum Likelihood Approach to Continuous Speech Recognition.” IEEE Transactions on Pattern Analysis

!236 and Machine Intelligence PAMI-5, no. 2 (March 1983): 179–90. doi: 10.1109/TPAMI.1983.4767370. Baker, James K. Interview by Patri Pugliese. Audio recording, December 21, 2006. History of Speech and Language Technology Project. http:// www.sarasinstitute.org/Pages/Interv/SarJimBaker.html. Baker, Janet M. Interview by Patri Pugliese. Audio recording, January 18, 2007. History of Speech and Language Technology Project. http:// www.sarasinstitute.org/Pages/Interv/SarJanetBaker.html. Bakis, R. “Continuous Speech Recognition via Centisecond Acoustic States.” The Journal of the Acoustical Society of America 59, no. S1 (April 1, 1976): S97–S97. doi:10.1121/1.2003011. Bakis, Raimo, and Jordan Rian Cohen. Nonlinear signal processing in a speech recognition system. EP0179280 A2, filed September 20, 1985, and issued April 30, 1986. http://www.google.com.na/patents/EP0179280A2. Barlow, W. H. “On the Pneumatic Action Which Accompanies the Articulation of Sounds by the Human Voice, as Exhibited by a Recording Instrument.” Proceedings of the Royal Society of London 22, no. 152 (April 1874): 277–86. doi:10.1098/rspl.1873.0043. Barlow, W.H. “The Logograph.” Journal of the Society of Telegraph Engineers 7, no. 21 (1878): 65–68. doi:10.1049/jste-1.1878.0006. Barocas, Solon, Sophie Hood, and Malte Ziewitz. “Governing Algorithms: A Provocation Piece.” SSRN Scholarly Paper. Rochester, NY: Social Science Research Network, March 29, 2013. https://papers.ssrn.com/ abstract=2245322. Basharin, Gely P., Amy N. Langville, and Valeriy A. Naumov. “The Life and Work of A.A. Markov.” Linear Algebra and Its Applications, Special Issue on the Conference on the Numerical Solution of Markov Chains 2003, 386 (July 15, 2004): 3–26. doi:10.1016/j.laa.2003.12.041. Bates, Madeleine. “Overview of the ARPA Human Language Technology Workshop.” In Human Language Technology: Proceeding of a Workshop, 3–4. Plainsboro, NJ: Morgan Kaufmann Publishers, Inc., 1993. Baum, Leonard E. “An Inequality and Associated Maximization Technique in Statistical Estimation for Probabilistic Functions of Markov Processes.” Inequalities 3 (1972): 1–8. Baum, Leonard E., and J. A. Eagon. “An Inequality with Applications to Statistical Estimation for Probabilistic Functions of Markov Processes and to a Model for Ecology.” Bulletin of the American Mathematical Society 73, no. 3 (May 1967): 360–63.

!237 Baum, Leonard E., and Ted Petrie. “Statistical Inference for Probabilistic Functions of Finite State Markov Chains.” The Annals of Mathematical Statistics 37, no. 6 (1966): 1554–63. Baum, Leonard E., Ted Petrie, George Soules, and Norman Weiss. “A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains.” The Annals of Mathematical Statistics 41, no. 1 (February 1970): 164–71. doi:10.1214/aoms/ 1177697196. Berger, Adam L., Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, John R. Gillett, John D. Lafferty, Robert L. Mercer, Harry Printz, and Luboš Ureš. “The Candide System for Machine Translation.” In Proceedings of the Workshop on Human Language Technology, 157–162. HLT ’94. Stroudsburg, PA, USA: Association for Computational Linguistics, 1994. doi:10.3115/1075812.1075844. Biddulph, Rulon S., and Kingsbury H. Davis. Voice-operated device. US2685615 A, filed May 1, 1952, and issued August 3, 1954. http://www.google.com/ patents/US2685615. Bollen, Johan, Huina Mao, and Xiaojun Zeng. “Twitter Mood Predicts the Stock Market.” Journal of Computational Science 2, no. 1 (March 1, 2011): 1–8. doi:10.1016/j.jocs.2010.12.007. Bouk, Dan. How Our Days Became Numbered: Risk and the Rise of the Statistical Individual. Chicago; London: University Of Chicago Press, 2015. Brain, Robert. The Pulse of Modernism: Physiological Aesthetics in Fin-de-Siècle Europe. Seattle: University of Washington Press, 2015. Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and P. Roossin. “A Statistical Approach to Language Translation.” In Proceedings of the 12th Conference on Computational Linguistics - Volume 1, 71–76. COLING ’88. Stroudsburg, PA, USA: Association for Computational Linguistics, 1988. doi:10.3115/991635.991651. Brown, Peter, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Frederick Jelinek, Robert Mercer, and Paul S. Roossin. “A Statistical Approach to French/English Translation,” (pages not numbered). Pittsburgh, PA, USA, 1988. Brown, Peter F., John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. “A Statistical Approach to Machine Translation.” Computational Linguistics 16, no. 2 (June 1990): 79–85.

!238 Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, Meredith J. Goldsmith, Jan Hajic, Robert L. Mercer, and Surya Mohanty. “But Dictionaries Are Data Too.” In Proceedings of the Workshop on Human Language Technology, 202–205. HLT ’93. Stroudsburg, PA, USA: Association for Computational Linguistics, 1993. doi: 10.3115/1075671.1075716. Brown, Peter F., Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. “The Mathematics of Statistical Machine Translation: Parameter Estimation.” Computational Linguistics 19, no. 2 (June 1993): 263–311. Brown, Peter, and Robert Mercer. “Oh, Yes, Everything’s Right on Schedule, Fred.” presented at the Twenty Years of Bitext, Seattle, WA, October 18, 2013. http://cs.jhu.edu/~post/bitext/. Bruce, Bertram C. “HWIM: A Computer Model of Language Comprehension and Production.” Champaign, IL: Center for the Study of Reading, March 1982. https://www.ideals.illinois.edu/handle/2142/18004. Burton, Katherine. “Inside a Moneymaking Machine Like No Other.” Bloomberg Markets, November 21, 2016. https://www.bloomberg.com/news/articles/ 2016-11-21/how-renaissance-s-medallion-fund-became-finance-s- blackest-box. Bush, Vannevar. “As We May Think.” The Atlantic, July 1945. Business Insider Tech. “Watch Siri Fail Live on-Stage at Apple’s Huge WWDC Event.” Business Insider, June 8, 2015. http://www.businessinsider.com/ siri-fail-live-apple-wwdc-2015-6#ixzz3hU8veYTA. Cadwalladr, Carole. “Follow the Data: Does a Legal Document Link Brexit Campaigns to US Billionaire?” The Guardian, May 14, 2017, sec. Technology. http://www.theguardian.com/technology/2017/may/14/robert- mercer-cambridge-analytica-leave-eu-referendum-brexit-campaigns. ———. “Revealed: How US Billionaire Helped to Back Brexit.” The Guardian, February 25, 2017, sec. Politics. https://www.theguardian.com/politics/ 2017/feb/26/us-billionaire-mercer-helped-back-brexit. ———. “Robert Mercer: The Big Data Billionaire Waging War on Mainstream Media.” The Guardian, February 26, 2017, sec. Politics. https:// www.theguardian.com/politics/2017/feb/26/robert-mercer-breitbart-war- on-media-steve-bannon-donald-trump-nigel-farage. Campbell-Kelly, Martin, William Aspray, Nathan Ensmenger, and Jeffrey R. Yost. Computer: A History of the Information Machine. 3 edition. Boulder, CO: Westview Press, 2013.

!239 Canton, Hallie. “15 Things We Can Learn About Humanity From Google Autocomplete.” CollegeHumor, June 20, 2013. http:// www.collegehumor.com/article/6896075/15-things-we-can-learn-about- humanity-from-google-autocomplete. Carbonell, Jaime. “Session 3: Machine Translation.” In Speech and Natural Language: Proceedings of a Workshop, 139–40. Pacific Grove, CA: Morgan Kaufmann Publishers, Inc., 1991. Carroll, Charles Michael. The Great Chess Automaton. New York: Dover Publications, 1975. Cetina, Karin Knorr. Epistemic Cultures: How the Sciences Make Knowledge. Cambridge, Mass: Harvard University Press, 1999. Chun, Wendy Hui Kyong. Programmed Visions: Software and Memory. Reprint. The MIT Press, 2013. Church, Kenneth W. “A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text.” In Proceedings of the Second Conference on Applied Natural Language Processing, 136–143. ANLC ’88. Stroudsburg, PA, USA: Association for Computational Linguistics, 1988. doi: 10.3115/974235.974260. ———. “Empiricism from TMI-1992 to AMTA-2002 to AMTA-2012.” presented at the Association for Machine Translation in the Americas, Tiburon, CA, October 8, 2002. ———. “Has Computational Linguistics Become More Applied?” In Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing, 1–5. CICLing ’09. Berlin, Heidelberg: Springer-Verlag, 2009. doi:10.1007/978-3-642-00382-0_1. ———. “Speech and Language Processing: Where Have We Been and Where Are We Going.” Geneva, Switzerland, 2003. ———. “Where Have We Been and Where Are We Going?” Seattle, WA, 2004. Church, Kenneth W., and Robert L. Mercer. “Introduction to the Special Issue on Computational Linguistics Using Large Corpora.” Computational Linguistics 19, no. 1 (March 1993): 1–24. Colomina, Beatriz. “Enclosed by Images: The Eameses’ Multimedia Architecture.” Grey Room, no. 2 (Winter 2001): 6–29. doi: 10.1162/152638101750172975. Darling, Lloyd. “The Marvelous Voice Typewriter: Talk to It and It Writes.” Popular Science Monthly, July 1916. Das, S.K., and M.A Picheny. “Issues in Practical Large Vocabulary Isolated Word Recognition: The IBM Tangora System.” In Automatic Speech and

!240 Speaker Recognition, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal. The Kluwer International Series in Engineering and Computer Science. Boston, MA: Springer US, 1996. http:// link.springer.com/10.1007/978-1-4613-1367-0. Daston, Lorraine. “Probability and Evidence.” In The Cambridge History of Seventeenth-Century Philosophy, edited by Daniel Garber, Michael Ayers, Daniel Garber, and Michael Ayers, 2:1108–44. Cambridge: Cambridge University Press, 2012. http://ezproxy.library.nyu.edu:8412/cambridge/ histories/chapter.jsf? bid=CBO9781139055468&cid=CBO9781139055468A009. Daston, Lorraine J., and Peter Galison. Objectivity. Zone Books, 2010. David, Jr., E.E. “Artificial Auditory Recognition in Telephony.” IBM Journal of Research and Development 2, no. 4 (October 1958): 294–309. David, Jr., E.E., and O.G. Selfridge. “Eyes and Ears for Computers.” Proceedings of the IRE 50, no. 5 (May 1962): 1093–1101. doi:10.1109/JRPROC. 1962.288011. Davis, K.H., R. Biddulph, and S. Balashek. “Automatic Recognition of Spoken Digits.” Journal of the Acoustical Society of America 24, no. 6 (November 1952): 637–42. Denes, P. “The Design and Operation of the Mechanical Speech Recognizer at University College London.” Journal of the British Institution of Radio Engineers 19, no. 4 (April 1959): 219–29. doi:10.1049/jbire.1959.0027. Denes, Peter. “Automatic Speech Recognition: Experiments with a Recogniser Using Linguistic Statistics.” Contract No. AF 61(514)-1176. Air Force Cambridge Research Center: United States Air Force Air Research and Development Command, September 1960. Dersch, W. C. “Shoebox - A Voice Responsive Machine.” DATAMATION 8 (June 1962): 47–50. Dixon, N. R., and C. C. Tappert. “Toward Objective Phonetic Transcription - An On-Line Interactive Technique for Machine-Processed Speech Data.” IEEE Transactions on Man-Machine Systems 11, no. 4 (December 1970): 202–10. doi:10.1109/TMMS.1970.299943. Dixon, N. Rex, and Thomas B. Martin. “Introductory Comments.” In Automatic Speech & Speaker Recognition, edited by N. Rex Dixon and Thomas B. Martin, 2–3. Selected Reprint Series. New York, NY: IEEE Press, 1979. Dixon, N., and H. Silverman. “A General Language-Operated Decision Implementation System (GLODIS): Its Application to Continuous-Speech Segmentation.” IEEE Transactions on Acoustics, Speech, and Signal

!241 Processing 24, no. 2 (April 1976): 137–62. doi:10.1109/TASSP. 1976.1162793. ———. “The 1976 Modular Acoustic processor(MAP).” IEEE Transactions on Acoustics, Speech, and Signal Processing 25, no. 5 (October 1977): 367– 79. doi:10.1109/TASSP.1977.1162985. Domingos, Pedro. “A Few Useful Things to Know About Machine Learning.” Communications of the ACM 55, no. 10 (October 2012): 78–87. Dudley, Homer, and S. Balashek. “Automatic Recognition of Phonetic Patterns in Speech.” The Journal of the Acoustical Society of America 30, no. 8 (August 1, 1958): 721–32. doi:10.1121/1.1909742. Dudley, Homer, and T. H. Tarnoczy. “The Speaking Machine of Wolfgang von Kempelen.” The Journal of the Acoustical Society of America 22, no. 2 (March 1, 1950): 151–66. doi:10.1121/1.1906583. Durkheim, Emile, and Marcel Mauss. Primitive Classification. Translated by Rodney Needham. Chicago: University Of Chicago Press, 1967. Edwards, Paul N. A Vast Machine Computer Models, Climate Data, and the Politics of Global Warming. Cambridge, Mass.: MIT Press, 2010. Engelmore, Robert S. “AI Development: DARPA and ONR Viewpoints.” In Expert Systems and Artificial Intelligence: Applications and Management, edited by Thomas C. Bartee, 213–18. H.W. Sams, 1988. Ensmenger, Nathan. “Is Chess the Drosophila of Artificial Intelligence? A Social History of an Algorithm.” Social Studies of Science 42, no. 1 (February 1, 2012): 5–30. doi:10.1177/0306312711424596. Fayyad, Usama, and Ramasamy Uthurusamy, eds. KDD’95: Proceedings of the First International Conference on Knowledge Discovery and Data Mining. Montréal, Québec, Canada: AAAI Press, 1995. Feldman, Loren. “Goldman Sachs and a Sale Gone Horribly Awry.” The New York Times, July 14, 2012, sec. Business Day. https://www.nytimes.com/ 2012/07/15/business/goldman-sachs-and-a-sale-gone-horribly-awry.html. Feldman, Ronen, and James Sanger. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. 1 edition. Cambridge; New York: Cambridge University Press, 2006. Ferguson, John D, ed. Symposium on the Application of Hidden Markov Models to Text and Speech. Princeton, NJ: Institute for Defense Analyses, Communications Research Division, 1980. Firth, John Rupert. “A Synopsis of Linguistic Theory 1930–1955.” In Studies in Linguistic Analysis: Special Volume of the Philological Society, edited by John Rupert Firth, 1–32. Oxford: Blackwell, 1957.

!242 Flanagan, J.L., S.E. Levinson, L.R. Rabiner, and A.E. Rosenberg. “Techniques for Expanding the Capabilities of Practical Speech Recognizers.” In Trends in Speech Recognition, edited by Wayne A. Lea, 425–44. Prentice-Hall, 1980. Flowers, John B. “The True Nature of Speech.” American Institute of Electrical Engineers, Transactions of the 35, no. 1 (January 1916): 213–48. doi: 10.1109/T-AIEE.1916.4765383. Foucault, Michel. Discipline & Punish: The Birth of the Prison. Translated by Alan Sheridan. 2nd Edition. Vintage, 1995. Frawley, William J., Gregory Piatetsky-Shapiro, and Christopher J. Matheus. “Knowledge Discovery in Databases: An Overview.” AI Magazine 13, no. 3 (September 15, 1992): 57. Freedgood, Elaine. “Divination.” PMLA 128, no. 1 (January 2013): 221–25. doi: 10.1632/pmla.2013.128.1.221. Fry, D. B., and P. Denes. “The Solution of Some Fundamental Problems in Mechanical Speech Recognition.” Language and Speech 1, no. 1 (January 1, 1958): 35–58. doi:10.1177/002383095800100104. Fuller, Matthew. Behind the Blip: Essays on the Culture of Software. Brooklyn, NY: Autonomedia, 2003. ———, ed. Software Studies: A Lexicon. The MIT Press, 2008. Galison, Peter. “Computer Simulations and the Trading Zone.” In The Disunity of Science: Boundaries, Contexts, and Power, edited by Peter Galison and David Stump, 1st ed., 118–57. Stanford University Press, 1996. Galloway, Alexander R. Protocol: How Control Exists after Decentralization. The MIT Press, 2006. Garber, Megan. “How Google’s Autocomplete Was ... Created / Invented / Born.” The Atlantic, August 23, 2013. https://www.theatlantic.com/technology/ archive/2013/08/how-googles-autocomplete-was-created-invented-born/ 278991/. Garfinkel, Simson. “Enter the Dragon.” MIT Technology Review, September 1, 1998. Gibson, William. Pattern Recognition. Reprint edition. New York, NY: Berkley, 2005. Gigerenzer, Gerd, Zeno Swijtink, Theodore Porter, Lorraine Daston, John Beatty, and Lorenz Kruger. The Empire of Chance: How Probability Changed Science and Everyday Life. Reprint. Cambridge University Press, 1990.

!243 Gillespie, Tarleton. “The Relevance of Algorithms.” In Media Technologies: Essays on Communication, Materiality, and Society, edited by Tarleton Gillespie, Pablo J Boczkowski, and Kirsten A Foot, 167–93, 2014. Ginsberg, Jeremy, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski, and Larry Brilliant. “Detecting Influenza Epidemics Using Search Engine Query Data.” Nature 457, no. 7232 (February 19, 2009): 1012–14. doi:10.1038/nature07634. Golumbia, David. The Cultural Logic of Computation. Cambridge, Mass.: Harvard University Press, 2009. Gomes, Lee. “Facebook AI Director Yann LeCun on His Quest to Unleash Deep Learning and Make Machines Smarter.” IEEE Spectrum, February 18, 2015. http://spectrum.ieee.org/automaton/robotics/artificial-intelligence/ facebook-ai-director-yann-lecun-on-deep-learning. Google. Behind the Mic: The Science of Talking with Computers. YouTube video. Google, 2014. https://www.youtube.com/watch?v=yxxRAHVtafI. “Google Correlate.” Accessed May 16, 2017. https://www.google.com/trends/ correlate. Griner, David. “Powerful Ads Use Real Google Searches to Show the Scope of Sexism Worldwide.” AdWeek. Accessed October 22, 2013. http:// www.adweek.com/adfreak/powerful-ads-use-real-google-searches-show- scope-sexism-worldwide-153235. Grishman, Ralph. “A Very Brief Introduction to Computational LIngustics.” In Speech and Natural Language: Proceedings of a Workshop, 37–52. Philadelphia, PA: Morgan Kaufmann Publishers, Inc., 1989. Hacking, Ian. The Taming of Chance. Cambridge University Press, 1990. Haigh, Thomas. “‘A Veritable Bucket of Facts’ Origins of the Data Base Management System.” SIGMOD Rec. 35, no. 2 (June 2006): 33–49. doi: 10.1145/1147376.1147382. Halevy, Alon, Peter Norvig, and Fernando Pereira. “The Unreasonable Effectiveness of Data.” IEEE Intelligent Systems 24, no. 2 (2009): 8–12. Hall, David, Daniel Jurafsky, and Christopher D. Manning. “Studying the History of Ideas Using Topic Models.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 363–371. EMNLP ’08. Stroudsburg, PA, USA: Association for Computational Linguistics, 2008. http://dl.acm.org/citation.cfm?id=1613715.1613763. Hapgood, Fred. “Computer Chess Bad--Human Chess Worse.” New Scientist, December 30, 1982.

!244 Hearst, Marti A. “Untangling Text Data Mining.” In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, 3–10. ACL ’99. Stroudsburg, PA, USA: Association for Computational Linguistics, 1999. doi: 10.3115/1034678.1034679. Hilgers, Phillip von, and Amy N. Langville. “The Five Greatest Applications of Markov Chains.” In Markov Anniversary Meeting: An International Conference to Celebrate the 150th Anniversary of the Birth of A.A. Markov, 155–67. Charleston, SC, 2006. Hillyer, Peter. “Talking to Terminals . . .” THINK, 1987. IBM Corporate Archives. Hindenburg, Carl Friedrich. Ueber den Schachspieler des herrn von Kempelen: nebst einer Abbildung und Beschreibung seiner Sprachmaschine. Leipzig: J.G. Müller, 1784. Hirshman, Lynette. “Overview of the DARPA Speech and Natural Language Workshop.” In Speech and Natural Language: Proceedings of a Workshop, 1–2. Philadelphia, PA: Morgan Kaufmann Publishers, Inc., 1989. Horvitz, Eric. “AI in the Open World: Directions, Challenges, and Futures.” presented at the Data&Society Databite No. 98, New York, NY, April 26, 2017. Hsu, Feng-Hsiung. Behind Deep Blue: Building the Computer That Defeated the World Chess Champion. Princeton, N.J.; Oxford: Princeton University Press, 2004. Hutchins, W. John. “Machine Translation: History of Research and Applications.” In Routledge Encyclopedia of Translation Technology, edited by Sin-Wai Chan, 120–36. Routledge, 2014. doi:10.4324/9781315749129. IBM Corporation Field Engineer Division. “IBM Service Specialists ‘Talk’ to a Computer.” Press Release, August 5, 1971. IBM Corporate Archives. “IBM Scientists Demonstrate Personal Computer with Advanced Speech Recognition Capability.” Press Release. IBM Corporation Research Division, April 7, 1986. IBM Corporate Archives. Jackson, P., and F. Schilder. “Natural Language Processing: Overview.” In Encyclopedia of Language & Linguistics (Second Edition), edited by Keith Brown, 503–18. Oxford: Elsevier, 2006. doi:10.1016/ B0-08-044854-2/00927-5. Jelinek, F. “Continuous Speech Recognition by Statistical Methods.” Proceedings of the IEEE 64, no. 4 (April 1976): 532–56. doi:10.1109/PROC. 1976.10159.

!245 ———. “The Development of an Experimental Discrete Dictation Recognizer.” Proceedings of the IEEE 73, no. 11 (November 1985): 1616–24. doi: 10.1109/PROC.1985.13343. Jelinek, Frederick. “ACL Lifetime Achievement Award: The Dawn of Statistical ASR and MT.” Computational Linguistics 35, no. 4 (2009): 483–94. ———. “Some of My Best Friends Are Linguists.” presented at the 4th International Conference On Language Resources and Evaluation, Lisbon, May 28, 2004. http://www.lrec-conf.org/lrec2004/. ———. Statistical Methods for Speech Recognition. MIT Press, 1997. ———. Interview by Janet Baker. Audio recording, March 2005. History of Speech and Language Technology Project. http://www.sarasinstitute.org/ Pages/Interv/SarJelin.html. Jelinek, Frederick, Lalit Bahl, and Robert Mercer. “Design of a Linguistic Statistical Decoder for the Recognition of Continuous Speech.” IEEE Transactions on Information Theory 21, no. 3 (May 1975): 250–56. doi: 10.1109/TIT.1975.1055384. Jobs, Steve. “World Developer’s Conference Keynote.” Presentation, San Jose, CA, May 10, 1999. https://www.youtube.com/watch?v=IrErYJhvAFo. Jones, Matthew L. “Querying the Archive: Data Mining from Apriori to Pagerank.” In Science in the Archives: Pasts, Presents, Futures, edited by Lorraine Daston, 311–28. Chicago: University Of Chicago Press, 2017. Jurafsky, Dan, and James H. Martin. Speech and Language Processing. 3rd ed. Draft, 2016. Kilgarriff, Adam, and Gregory Grefenstette. “Introduction to the Special Issue on the Web as Corpus.” Computational Linguistics 29, no. 3 (September 1, 2003): 333–47. doi:10.1162/089120103322711569. Kitchin, Rob. “Thinking Critically about and Researching Algorithms.” Information, Communication & Society 20, no. 1 (January 2, 2017): 14– 29. doi:10.1080/1369118X.2016.1154087. Kittler, Friedrich A. Discourse Networks, 1800/1900. Translated by Michael Metteer and Chris Cullens. Reprint. Stanford University Press, 1992. Klatt, Dennis H. “Review of the ARPA Speech Understanding Project.” The Journal of the Acoustical Society of America 62, no. 6 (1977): 1345–66. doi:10.1121/1.381666. Ladefoged, Peter, and D. E. Broadbent. “Information Conveyed by Vowels.” The Journal of the Acoustical Society of America 29, no. 1 (January 1, 1957): 98–104. doi:10.1121/1.1908694.

!246 Langley, Pat. “The Changing Science of Machine Learning.” Machine Learning 82, no. 3 (March 1, 2011): 275–79. doi:10.1007/s10994-011-5242-y. Lazer, David, and Ryan Kennedy. “What We Can Learn From the Epic Failure of Google Flu Trends.” WIRED, October 1, 2015. https://www.wired.com/ 2015/10/can-learn-epic-failure-google-flu-trends/. Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. “The Parable of Google Flu: Traps in Big Data Analysis.” Science, March 14, 2014. Le, Quoc V., and Mike Schuster. “A Neural Network for Machine Translation, at Production Scale.” Google Research Blog, September 27, 2016. https:// research.googleblog.com/2016/09/a-neural-network-for-machine.html. Lea, W. “Establishing the Value of Voice Communication with Computers.” IEEE Transactions on Audio and Electroacoustics 16, no. 2 (June 1968): 184– 97. doi:10.1109/TAU.1968.1161970. Lea, Wayne A. “Establishing the Value of Voice Communication with Computers.” IEEE Transactions on Audio and Electroacoustics 16, no. 2 (June 1968): 184–97. doi:10.1109/TAU.1968.1161970. Levitt, Gerald M. The Turk, Chess Automaton. McFarland, Incorporated Publishers, 2000. Lewis-Kraus, Gideon. “The Fasinatng … Frustrating … Fascinating History of Autocorrect.” Wired | Gadget Lab, July 22, 2014. http://www.wired.com/ 2014/07/history-of-autocorrect/. Licklider, J. C. R. “Man-Computer Symbiosis.” IRE Transactions on Human Factors in Electronics HFE-1, no. 1 (March 1960): 4–11. doi:10.1109/ THFE2.1960.4503259. Liu, Jennifer. “At a Loss for Words?” Official Google Blog, August 25, 2008. http://googleblog.blogspot.com/2008/08/at-loss-for-words.html. Liu, Lydia H. The Freudian Robot: Digital Media and the Future of the Unconscious. Chicago, IL: University of Chicago Press, 2011. Lohr, Steve. Data-Ism: The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else. New York, NY: HarperBusiness, 2015. ———. “The Age of Big Data.” The New York Times, February 11, 2012, sec. Sunday Review. http://www.nytimes.com/2012/02/12/sunday-review/big- datas-impact-in-the-world.html. Mackenzie, Adrian. Cutting Code: Software and Sociality. New York: Peter Lang, 2006.

!247 ———. “The Production of Prediction: What Does Machine Learning Want?” European Journal of Cultural Studies 18, no. 4–5 (August 1, 2015): 429– 45. doi:10.1177/1367549415577384. Maegaard, Bente, ed. “Machine Translation.” In Multilingual Information Management: Current Levels and Future Abilities. A Report Commissioned by the US National Science Foundation and Also Delivered to the European Commission’s Language Engineering Office and the US Defense Advanced Research Projects Agency. US National Science Foundation, 1999. http://www.cs.cmu.edu/~ref/mlim/index.html. Makhoul, John. “A 50-Year Personal Retrospective on Speech and Language Processing.” San Francisco, 2016. Makhoul, John, Frederick Jelinek, Lawrence Rabiner, Clifford Weinstein, and Victor Zue. “White Paper on Spoken Language Systems.” In Speech and Natural Language: Proceedings of a Workshop, 463–79. Cape Cod, MA: Morgan Kaufmann Publishers, Inc., 1989. Makhoul, John, and Richard Schwatz. “Ignorance Modeling.” In Invariance and Variability in Speech Processes, edited by Joseph S. Perkell and Dennis H. Klatt, 344–45. Lawrence Erlbaum Associates, 1986. Mallaby, Sebastian. More Money Than God: Hedge Funds and the Making of a New Elite. Reprint edition. New York: Penguin Books, 2011. Manning, Christopher D., and Hinrich Schütze. Foundations of Statistical Natural Language Processing. 1 edition. Cambridge, Mass: The MIT Press, 1999. Manovich, Lev. The Language of New Media. MIT Press, 2001. Markov, A. A. “An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains.” Translated by Gloria Custance, David Link, Alexander Nitussov, and Lioudmila Voropai. Science in Context 19, no. 4 (2006): 591–600. doi:10.1017/ S0269889706001074. Martin, Reinhold. The Organizational Complex: Architecture, Media, and Corporate Space. MIT Press, 2005. Mayer, Jane. “The Reclusive Hedge-Fund Tycoon Behind the Trump Presidency.” The New Yorker, March 17, 2017. http://www.newyorker.com/magazine/ 2017/03/27/the-reclusive-hedge-fund-tycoon-behind-the-trump- presidency. Mayer, Marissa. “Search: Now Faster than the Speed of Type.” Official Google Blog, September 8, 2010. http://googleblog.blogspot.com/2008/08/at-loss- for-words.html.

!248 Mercer, Robert. “A Computational Life (ACL Lifetime Achievement Award Address).” presented at the The 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, June 25, 2014. http:// techtalks.tv/talks/closing-session/60532/. Miller, Carolyn R. “Genre Innovation: Evolution, Emergence, or Something Else?” The Journal of Media Innovations 3, no. 2 (November 7, 2016): 4– 19. Mills, Mara. “Deaf Jam From Inscription to Reproduction to Information.” Social Text 28, no. 102 (March 20, 2010): 35–58. doi: 10.1215/01642472-2009-059. ———. “Media and Prosthesis: The Vocoder, the Artificial Larynx, and the History of Signal Processing.” Qui Parle 21, no. 1 (December 1, 2012): 107–49. doi:10.5250/quiparle.21.1.0107. Mirowski, Philip. Machine Dreams: Economics Becomes a Cyborg Science. Cambridge: Cambridge University Press, 2002. ———. “The Probabilistic Counter-Revolution, or How Stochastic Concepts Came to Neoclassical Economic Theory.” Oxford Economic Papers 41, no. 1 (1989): 217–35. doi:10.2307/2663190. Mischne, Gilad, and Natalie Glance. “Predicting Movie Sales from Blogger Sentiment,” 155–58. American Association for Artificial Intelligence, 2006. http://www.aaai.org/Library/Symposia/Spring/2006/ ss06-03-030.php. Mitchell, Silas Weir. “The Last of a Veteran Chess Player.” Chess Monthly, 1857. https://www.chess.com/blog/batgirl/the-last-of-a-veteran-chess-player--- the-turk. Murphy, T., ed. “IBM Reports Major Speech Recognition Progress.” IBM Research Highlights, no. 1 (1985). Nadas, A., R.L. Mercer, L. Bahl, R. Bakis, P. Cohen, A. Cole, F. Jelinek, and B. Lewis. “Continuous Speech Recognition with Automatically Selected Acoustic Prototypes Obtained by Either Bootstrapping or Clustering.” In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’81., 6:1153–55, 1981. doi:10.1109/ICASSP.1981.1171177. National Research Council. Funding a Revolution: Government Support for Computing Research. Washington, D.C: The National Academies Press, 1999. http://www.nap.edu/catalog/6323/funding-a-revolution-government- support-for-computing-research. Newell, Allen, J. Barnett, J. Forgie, C. Green, D. Klatt, J.C.R. Licklider, J. Munson, and R. Reddy. “Speech-Understanding Systems: Final Report of

!249 a Study Group.” Information Processing Techniques Office of the Advanced Research Projects Agency, May 1971. http:// repository.cmu.edu/compsci/1839. Palmer, Martha, Tim Finin, and Sharon M. Walter. “Workshop on the Evaluation of Natural Language Processing Systems.” Final Technical Report. Griffiss Air Force Base, NY: Rome Air Development Center, December 1989. Parisi, Luciana. Contagious Architecture: Computation, Aesthetics, and Space (Technologies of Lived Abstraction). The MIT Press, 2013. Pauly, Philip J. Controlling Life: Jacques Loeb & the Engineering Ideal in Biology. Oxford University Press, 1987. Peterson, Gordon E., and Harold L. Barney. “Control Methods Used in a Study of the Vowels.” The Journal of the Acoustical Society of America 24, no. 2 (March 1, 1952): 175–84. doi:10.1121/1.1906875. Pickstone, John V. Ways of Knowing: A New History of Science, Technology, and Medicine. University of Chicago Press, 2001. Pieraccini, Roberto. The Voice in the Machine: Building Computers That Understand Speech. Cambridge, MA: The MIT Press, 2012. Pierce, John R. “Whither Speech Recognition?” Journal of the Acoustical Society of America 46, no. 4 (1969): 1049–51. doi:10.1121/1.1911801. ———. Letter to Ralph K. Potter, February 13, 1947. File 37874-13, Volume B. AT&T Archives and History Center, Warren NJ. Pierce, John R. (as J.J. Coupling). “Portrait of a Voice.” Astounding Science Fiction, July 1946. Pierce, John R., John B. Carroll, Eric P. Hamp, David G. Hays, Charles F. Hockett, Anthony G. Oettinger, and Alan Perilis. “Language and Machines: Computers in Translation and Linguistics A Report by the Automatic Language Processing Advisory Committee.” Washington, DC: National Academy of Sciences, 1966. https://www.nap.edu/catalog/9547/ language-and-machines-computers-in-translation-and-linguistics. Pierce, John R., Claude E. Shannon, Walter A. Rosenblith, and Vannevar Bush. “What Computers Should Be Doing.” In Management and the Computer of the Future, edited by Martin Greenberger, 290–325. Cambridge, MA: The MIT Press, 1962. Potter, R. K., and J. C. Steinberg. “Toward the Specification of Speech.” The Journal of the Acoustical Society of America 22, no. 6 (November 1, 1950): 807–20. doi:10.1121/1.1906694.

!250 Potter, Ralph K. “Visible Patterns of Sound.” Science 102, no. 2654 (November 9, 1945): 463–70. doi:10.1126/science.102.2654.463. Price, Patti. “Overview of the Fourth DARPA Speech and Natural Language Workshop.” In Speech and Natural Language: Proceedings of a Workshop, 3–4. Pacific Grove, CA: Morgan Kaufmann Publishers, Inc., 1991. Proctor, Richard. “The Phonograph, or Voice-Recorder.” Gentleman’s Magazine, 1878. Rabiner, Lawrence R. “First-Hand: The Hidden Markov Model.” Engineering and Technology History Wiki. United Engineering Foundation, January 12, 2015. http://ethw.org/First-Hand:The_Hidden_Markov_Model. Rabiner, Lawrence R., and Biing-Hwang Juang. “An Introduction to Hidden Markov Models.” IEEE ASSP Magazine, January 1986, 4–16. ———. Fundamentals of Speech Recognition. Englewood Cliffs, N.J: Prentice Hall, 1993. Racknitz, Joseph Friedrich Frenherr zu. Ueber Den Schachspieler Des Herrn von Kempelen Und Dessen Nachbildung Mit Sieben Rupfertafeln. Dresden, 1789. Radinsky, Kira, and Eric Horvitz. “Mining the Web to Predict Future Events.” In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, 255–264. WSDM ’13. New York, NY, USA: ACM, 2012. doi:10.1145/2433396.2433431. Reddy, Dabbala R. “Speech Recognition by Machine: A Review.” Proceedings of the IEEE 64, no. 4 (April 1976): 501–31. doi:10.1109/PROC.1976.10158. Riskin, Jessica. “The Defecating Duck, Or, the Ambiguous Origins of Artificial Life.” Critical Inquiry 29, no. 4 (June 1, 2003): 599–633. doi: 10.1086/377722. Roberts, Lawrence. “Expanding AI Research and Founding ARPANET.” In Expert Systems and Artificial Intelligence: Applications and Management, edited by Thomas C. Bartee, 229–36. H.W. Sams, 1988. Russell, Stuart, and Peter Norvig. Artificial Intelligence: A Modern Approach. 3 edition. Upper Saddle River: Pearson, 2009. Seabrook, John. “Hello, Hal.” The New Yorker, June 23, 2008. http:// www.newyorker.com/magazine/2008/06/23/hello-hal. Seneta, Eugene. “Markov and the Birth of Chain Dependence Theory.” International Statistical Review / Revue Internationale de Statistique 64, no. 3 (1996): 255–63. doi:10.2307/1403785.

!251 Sheynin, O. B. “A. A. Markov’s Work on Probability.” Archive for History of Exact Sciences 39, no. 4 (December 1, 1989): 337–77. doi:10.1007/ BF00348446. Silver, Nate. The Signal and the Noise: Why So Many Predictions Fail, But Some Don’t. New York: Penguin, 2015. Somers, Harold L. “Current Research in Machine Translation.” Machine Translation 7, no. 4 (December 1, 1992): 231–46. doi:10.1007/ BF00398467. Steele, J. Michael. “Hidden Markov Models.” Course Resource for Financial Time Series and Computational Statistics. Univeristy of Pennsylvania, 2009. http://www-stat.wharton.upenn.edu/~steele/Courses/956/Resource/ HiddenMarkovModels.htm. Stempel, Jonathan. “Goldman Sachs Defeats Appeal over Collapsed Buyout.” Reuters, November 12, 2014. https://www.reuters.com/article/us-goldman- dragonsystems-lawsuit-idUSKCN0IW2JK20141112. Stigler, Stephen M. Statistics on the Table: The History of Statistical Concepts and Methods. Cambridge, MA: Harvard University Press, 1999. ———. “The Epic Story of Maximum Likelihood.” Statistical Science 22, no. 4 (November 2007): 598–620. doi:10.1214/07-STS249. ———. The Seven Pillars of Statistical Wisdom. Cambridge, Massachusetts: Harvard University Press, 2016. Strang, Joshua. 050118-F-3488S-003. Photograph, January 18, 2005. Wikimedia Commons. Tappert, C. C. “A Preliminary Investigation of Adaptive Control in the Interaction Between Segmentation and Segment Classification in Automatic Recognition of Continuous Speech.” IEEE Transactions on Systems, Man, and Cybernetics SMC-2, no. 1 (January 1972): 66–72. doi:10.1109/ TSMC.1972.5408558. Turovsky, Barak. “Found in Translation: More Accurate, Fluent Sentences in Google Translate.” The Keyword (Official Google Blog), November 15, 2016. http://blog.google:443/products/translate/found-translation-more- accurate-fluent-sentences-google-translate/. Underwood, M.J. “Machines That Understand Speech.” Radio and Electronic Engineer 47, no. 8.9 (August 1977): 368–76. doi:10.1049/ree.1977.0055. Universität Osnabrück, and IBM. “Flu Prediction: About.” Accessed August 10, 2017. http://www.flu-prediction.com/about.

!252 Vincent, James. “What Counts as Artificially Intelligent? AI and Deep Learning, Explained.” The Verge, February 29, 2016. https://www.theverge.com/ 2016/2/29/11133682/deep-learning-ai-explained-machine-learning. “Voice-Controlled Writing Machine.” Scientific American, February 12, 1916. Waibel, Alex, and Kai-Fu Lee. “Knowledge-Based Approaches.” In Readings in Speech Recognition, edited by Alexander Waibel and Kai-Fu Lee, 1st edition., 197–99. San Mateo, Calif: Morgan Kaufmann, 1990. Waibel, Alexander, and Kai-Fu Lee, eds. Readings in Speech Recognition. 1st edition. San Mateo, Calif: Morgan Kaufmann, 1990. Way, Andy. “A Critique of Statistical Machine Translation.” Linguistica Antverpiensia, New Series – Themes in Translation Studies 0, no. 8 (2009): 17–41. Wayne, Charles L. “Foreword.” In Speech and Natural Language: Proceedings of a Workshop, vii. Philadelphia, PA: Morgan Kaufmann Publishers, Inc., 1989. Weaver, Warren. “Translation.” In Reprinted in Machine Translation of Languages: Fourteen Essays, edited by W.N. Locke and Booth, A.D., 15– 23. Cambridge, MA: MIT Press, 1949. Webb, Amy. “You Just Don’t Understand.” Slate, May 5, 2014. http:// www.slate.com/articles/technology/data_mine_1/2014/05/ ok_google_siri_why_speech_recognition_technology_isn_t_very_good.ht ml. Weischedel, Ralph, Jaime Carbonell, Barbara Grosz, Wendy Lehnert, Marcus Mitchell, Raymond Perrault, and Robert Wilensky. “White Paper on Natural Language Processing.” In Speech and Natural Language: Proceedings of a Workshop, 481–93. Cape Cod, MA: Morgan Kaufmann Publishers, Inc., 1989. Wilks, Yorick. “Computational Linguistics: History.” In Encyclopedia of Language & Linguistics (Second Edition), edited by Keith Brown, 761– 69. Oxford: Elsevier, 2006. doi:10.1016/B0-08-044854-2/00928-7. ———. “Keynote Address: Some Notes on the State of the Art: Where Are We Now in MT.” In Machine Translation: Ten Years on, edited by Douglas Clarke and Alfred Vella, 2.1-2.4. Cranfield, UK, 1994. Windisch, Karl Gottlieb. Inanimate Reason; or a Circumstantial Account of That Astonishing Piece of Mechanism, M. de Kempelen’s Chess-Player; Now Exhibiting at No. 8, Savile-Row, Burlington-Gardens; Illustrated with Three Copper-Plates, Exhibiting This Celebrated Automaton, in Different

!253 Points of View. Translated from the Original Letters of M. Charles Gottlieb de Windisch. London: Printed for S. Bladon, 1784. Yates, JoAnne. Control through Communication: The Rise of System in American Management. The Johns Hopkins University Press, 1993. ———. Structuring the Information Age: Life Insurance and Technology in the Twentieth Century. The Johns Hopkins University Press, 2008. Zue, Victor W. “Comment on ‘Performing Fine Phonetic Distinctions: Templates versus Features.’” In Invariance and Variability in Speech Processes, edited by Joseph S. Perkell and Dennis H. Klatt, 342–44. Lawrence Erlbaum Associates, 1986. ———. “The Use of Speech Knowledge in Automatic Speech Recognition.” Proceedings of the IEEE 73, no. 11 (November 1985): 1602–15. doi: 10.1109/PROC.1985.13342.

!254