The Field of Phonetics Has Experienced Two
Total Page:16
File Type:pdf, Size:1020Kb
The field of phonetics has experienced two revolutions in the last century: the advent of the sound spectrograph in the 1950s and the application of computers beginning in the 1970s. Today, advances in digital multimedia, networking and mass storage are promising a third revolution: a movement from the study of small, individual datasets to the analysis of published corpora that are thousands of times larger. These new bodies of data are badly needed, to enable the field of phonetics to develop and test hypotheses across languages and across the many types of individual, social and contextual variation. Allied fields such as sociolinguistics and psycholinguistics ought to benefit even more. However, in contrast to speech technology research, speech science has so far taken relatively little advantage of this opportunity, because access to these resources for phonetics research requires tools and methods that are now incomplete, untested, and inaccessible to most researchers. Our research aims to fill this gap by integrating, adapting and improving techniques developed in speech technology research and database research. The intellectual merit: The most important innovation is robust forced alignment of digital audio with phonetic representations derived from orthographic transcripts, using HMM methods developed for speech recognition technology. Existing forced-alignment techniques must be improved and validated for robust application to phonetics research. There are three basic challenges to be met: orthographic ambiguity; pronunciation variation; and imperfect transcripts (especially the omission of disfluencies). Reliable confidence measures must be developed, so as to allow regions of bad alignment to be identified and eliminated or fixed. Researchers need an easy way to get a believable picture of the distribution of transcription and measurement errors, so as to estimate confidence intervals, and also to determine the extent of any bias that may be introduced. And in addition to solving these problems for English, we need to show how to apply the same techniques to a range of other languages that present a range of new problems. In addition to more robust forced alignment, researchers also need improved techniques for creating, sharing, searching, and maintaining the databases that result from applying these techniques on a large scale. Previous research has established a workable framework for the database issues involved, and some implementations are now in use in speech technology research; but these approaches need to be extended and adapted to meet the needs of phonetics researchers. The broader impacts: The proposed research will help the field of phonetics to enter a new era: conducting research using very large speech corpora, in the range from hundreds of hours to hundreds of thousands of hours. It will also enhance research in other language-related fields, not only within phonetics, but also in neighboring disciplines such as speech technology, sociolinguistics and linguistic anthropology. And this effort to enable new kinds of research also brings up a number of research problems that are interesting in their own right. Speech technology will benefit because better understanding of phonetic variation will enable the creation of systems that are truly robust to the range of speakers they need to deal with, thereby making modern user-interfaces more accessible to the entire population. Sociolinguistics and linguistic anthropology will be given new tools to map out populations based on their speech patterns, and ultimately help our society better understand the diversity of linguistic behaviors and associated cultural manifestations it encompasses. Key Words: speech science; corpus phonetics; acoustic modeling; pronunciation variation; phonetic databases; forced alignment. RI: Medium: New Tools and Methods for Very-Large-Scale Phonetics Research 1. Introduction The field of phonetics has experienced two revolutions in the last century: the advent of the sound spectrograph in the 1950s and the application of computers beginning in the 1970s. Today, advances in digital multimedia, networking and mass storage are promising a third revolution: a movement from the study of small, mostly artificial datasets to the analysis of published corpora of natural speech that are thousands of times larger. Peterson & Barney’s influential 1952 study of American English vowels was based on measurements from a total of less than 30 minutes of speech. Many phonetic studies have been based on the TIMIT corpus, originally published in 1991, which contains just over 300 minutes of speech. Since then, much larger speech corpora have been published for use in technology development: Collections of transcribed conversational telephone speech in English, published by the Linguiistic Data Consortium (LDC) now total more than 300,000 minutes, for example. And many even larger collections are now becoming accessible, from sources such as oral histories, audio books, political debates and speeches, podcasts, and so on. To give just one example, the historical archive of U.S. Supreme Court oral arguments (http://www.oyez.org/) comprises about 9,000 hours (540,000 minutes) of transcribed audio. These very-large-scale bodies of data make it possible to use natural speech in developing and testing hypotheses across the many types of individual, social, regional, temporal, textual and contextual variation, as well as across languages. All the sciences of spoken language stand to benefit, from not only within linguistics, but also in psychology, in clinical applications, and in the social sciences. However, in contrast to speech technology research, speech science has so far taken relatively little advantage of this opportunity, because access to the resources for very-large-scale phonetics research requires tools and methods that are now incomplete, untested, and inaccessible to most researchers. Transcripts in ordinary orthography, typically inaccurate or incomplete in various ways, must be turned into detailed and accurate phonetic transcripts that are time-aligned with the digital recordings. And information about speakers, contexts, and content must be integrated with phonetic and acoustic information, within collections involving tens of thousands of speakers and billions of phonetic segments, and across collections with differing sorts of metadata that may be stored in complex and incompatible formats. Our research aims to solve these problems by integrating, adapting and improving techniques developed in speech technology research and database research. The most important technique is forced alignment of digital audio with phonetic representations derived from orthographic transcripts, using Hidden Markov Model (HMM) methods developed for speech recognition technology. Our preliminary results, described below, convince us that this approach will work. However, forced-alignment techniques must be improved and validated for robust application in phonetics research. There are three basic challenges to be met: orthographic ambiguity; pronunciation variation; and imperfect transcripts (especially the omission of disfluencies). Speech technology researchers have addressed all of these problem, but their solutions have been optimized to decrease word error rates in speech recognition, and must be adapted instead to decrease error and bias in selecting and time-aligning phonetic transcriptions. In particular, reliable confidence measures must be developed, so as to allow regions of uncertain segment choice or bad alignment to be identified and eliminated or fixed, and to give a believable estimate of the distribution of errors in the resulting data. And in addition to solving these problems for English, we need to show how to apply the same techniques to a range of other languages, with different phonetic and orthographic problems. In particular, widely used languages like Mandarin and Arabic have inherent ambiguities in their writing systems that make the mapping from written form to pronunciation more difficult (lack of word segmentation in Mandarin, non-encoding of short vowels in Arabic script). Researchers also need improved techniques for dealing with the resulting datasets. This is partly a question of scale – techniques that work well on small datasets may become unacceptably slow, or fail completely, when dealing with billions of phonetic segments and hundreds of millions of words. There are also issues of consistency: different corpora, even from the same source, typically have differing sorts of metadata, and may be laid out in quite different ways. Finally, there are issues about how to deal with multiple layers of possibly-asynchronous annotation, since along with phonetic segments, words, and speaker information, some datasets may have manual or automatic annotation of syntactic, semantic or pragmatic categories. Researchers need a coherent model of these varied, complex, and multidimensional databases, with methods to retrieve relevant subsets in a suitably combinatoric way. Approaches to these problems were developed at LDC under NSF awards 9983258, “Multidimensional Exploration of Linguistic Databases”, and 0317826, “Querying linguistic databases”; with key ideas documented in Bird and Liberman (2001); and we propose to adapt and improve the results for the needs of phonetics research. The proposed research will help the field of phonetics enter a new era: conducting research using very large speech corpora, in the range