The field of phonetics has experienced two revolutions in the last century: the advent of the sound spectrograph in the 1950s and the application of computers beginning in the 1970s. Today, advances in digital multimedia, networking and mass storage are promising a third revolution: a movement from the study of small, individual datasets to the analysis of published corpora that are thousands of times larger. These new bodies of data are badly needed, to enable the field of phonetics to develop and test hypotheses across languages and across the many types of individual, social and contextual variation. Allied fields such as sociolinguistics and psycholinguistics ought to benefit even more. However, in contrast to speech technology research, speech science has so far taken relatively little advantage of this opportunity, because access to these resources for phonetics research requires tools and methods that are now incomplete, untested, and inaccessible to most researchers. Our research aims to fill this gap by integrating, adapting and improving techniques developed in speech technology research and database research.

The intellectual merit: The most important innovation is robust forced alignment of digital audio with phonetic representations derived from orthographic transcripts, using HMM methods developed for speech recognition technology. Existing forced-alignment techniques must be improved and validated for robust application to phonetics research. There are three basic challenges to be met: orthographic ambiguity; pronunciation variation; and imperfect transcripts (especially the omission of disfluencies). Reliable confidence measures must be developed, so as to allow regions of bad alignment to be identified and eliminated or fixed. Researchers need an easy way to get a believable picture of the distribution of transcription and measurement errors, so as to estimate confidence intervals, and also to determine the extent of any bias that may be introduced. And in addition to solving these problems for English, we need to show how to apply the same techniques to a range of other languages that present a range of new problems. In addition to more robust forced alignment, researchers also need improved techniques for creating, sharing, searching, and maintaining the databases that result from applying these techniques on a large scale. Previous research has established a workable framework for the database issues involved, and some implementations are now in use in speech technology research; but these approaches need to be extended and adapted to meet the needs of phonetics researchers.

The broader impacts: The proposed research will help the field of phonetics to enter a new era: conducting research using very large speech corpora, in the range from hundreds of hours to hundreds of thousands of hours. It will also enhance research in other language-related fields, not only within phonetics, but also in neighboring disciplines such as speech technology, sociolinguistics and linguistic anthropology. And this effort to enable new kinds of research also brings up a number of research problems that are interesting in their own right. Speech technology will benefit because better understanding of phonetic variation will enable the creation of systems that are truly robust to the range of speakers they need to deal with, thereby making modern user-interfaces more accessible to the entire population. Sociolinguistics and linguistic anthropology will be given new tools to map out populations based on their speech patterns, and ultimately help our society better understand the diversity of linguistic behaviors and associated cultural manifestations it encompasses.

Key Words: speech science; corpus phonetics; acoustic modeling; pronunciation variation; phonetic databases; forced alignment.

RI: Medium: New Tools and Methods for Very-Large-Scale Phonetics Research

1. Introduction The field of phonetics has experienced two revolutions in the last century: the advent of the sound spectrograph in the 1950s and the application of computers beginning in the 1970s. Today, advances in digital multimedia, networking and mass storage are promising a third revolution: a movement from the study of small, mostly artificial datasets to the analysis of published corpora of natural speech that are thousands of times larger. Peterson & Barney’s influential 1952 study of American English vowels was based on measurements from a total of less than 30 minutes of speech. Many phonetic studies have been based on the TIMIT corpus, originally published in 1991, which contains just over 300 minutes of speech. Since then, much larger speech corpora have been published for use in technology development: Collections of transcribed conversational telephone speech in English, published by the Linguiistic Data Consortium (LDC) now total more than 300,000 minutes, for example. And many even larger collections are now becoming accessible, from sources such as oral histories, audio books, political debates and speeches, podcasts, and so on. To give just one example, the historical archive of U.S. Supreme Court oral arguments (http://www.oyez.org/) comprises about 9,000 hours (540,000 minutes) of transcribed audio. These very-large-scale bodies of data make it possible to use natural speech in developing and testing hypotheses across the many types of individual, social, regional, temporal, textual and contextual variation, as well as across languages. All the sciences of spoken language stand to benefit, from not only within , but also in psychology, in clinical applications, and in the social sciences. However, in contrast to speech technology research, speech science has so far taken relatively little advantage of this opportunity, because access to the resources for very-large-scale phonetics research requires tools and methods that are now incomplete, untested, and inaccessible to most researchers. Transcripts in ordinary orthography, typically inaccurate or incomplete in various ways, must be turned into detailed and accurate phonetic transcripts that are time-aligned with the digital recordings. And information about speakers, contexts, and content must be integrated with phonetic and acoustic information, within collections involving tens of thousands of speakers and billions of phonetic segments, and across collections with differing sorts of metadata that may be stored in complex and incompatible formats. Our research aims to solve these problems by integrating, adapting and improving techniques developed in speech technology research and database research. The most important technique is forced alignment of digital audio with phonetic representations derived from orthographic transcripts, using Hidden Markov Model (HMM) methods developed for speech recognition technology. Our preliminary results, described below, convince us that this approach will work. However, forced-alignment techniques must be improved and validated for robust application in phonetics research. There are three basic challenges to be met: orthographic ambiguity; pronunciation variation; and imperfect transcripts (especially the omission of disfluencies). Speech technology researchers have addressed all of these problem, but their solutions have been optimized to decrease word error rates in speech recognition, and must be adapted instead to decrease error and bias in selecting and time-aligning phonetic transcriptions. In particular, reliable confidence measures must be developed, so as to allow regions of uncertain segment choice or bad alignment to be identified and eliminated or fixed, and to give a believable estimate of the distribution of errors in the resulting data. And in addition to solving these problems for English, we need to show how to apply the same techniques to a range of other languages, with different phonetic and orthographic problems. In particular, widely used languages like Mandarin and Arabic have inherent ambiguities in their writing systems that make the mapping from written form to pronunciation more difficult (lack of word segmentation in Mandarin, non-encoding of short vowels in Arabic script). Researchers also need improved techniques for dealing with the resulting datasets. This is partly a question of scale – techniques that work well on small datasets may become unacceptably slow, or fail completely, when dealing with billions of phonetic segments and hundreds of millions of words. There are also issues of consistency: different corpora, even from the same source, typically have differing sorts of metadata, and may be laid out in quite different ways. Finally, there are issues about how to deal with multiple layers of possibly-asynchronous annotation, since along with phonetic segments, words, and speaker information, some datasets may have manual or automatic annotation of syntactic, semantic or pragmatic categories. Researchers need a coherent model of these varied, complex, and multidimensional databases, with methods to retrieve relevant subsets in a suitably combinatoric way. Approaches to these problems were developed at LDC under NSF awards 9983258, “Multidimensional Exploration of Linguistic Databases”, and 0317826, “Querying linguistic databases”; with key ideas documented in Bird and Liberman (2001); and we propose to adapt and improve the results for the needs of phonetics research. The proposed research will help the field of phonetics enter a new era: conducting research using very large speech corpora, in the range from hundreds of hours to hundreds of thousands of hours. It will also enhance research in other language-related fields, not only within linguistics proper, but also in neighboring disciplines such as psycholinguistics, sociolinguistics and linguistic anthropology. This effort to enable new kinds of research also brings up a number of research problems that are interesting in their own right, as we will explain.

2. Forced Alignment Analysis of large speech corpora is crucial for understanding variation in speech (Keating et al., 1994; Johnson, 2004). Understanding variation in speech is not only a fundamental goal of phonetics, it is also important for studies of language change (Labov, 1994), (Pierrehumbert, 2003), psycholinguistics (Jurafsky, 2003), and speech technology (Benzeghiba et al., 2007). In addition, large speech corpora provide rich sources of data to study prosody (Grabe et al., 2005; Chu et al., 2006), disfluency (Shriberg, 1996; Stouten et al., 2006), and discourse (Hastie et al., 2002). The ability to use speech corpora for phonetics research depends on the availability of phonetic segmentation and transcriptions. In the last twenty years, many large speech corpora have been collected; however, only a small portion of them have come with phonetic segmentation and transcriptions, including: TIMIT (Garofolo et al., 1993), Switchboard (Godfrey & Holliman, 1997), the Buckeye natural (Pitt et al., 2007), the Corpus of Spontaneous Japanese (http://www.kokken.go.jp/katsudo/seika/corpus/public/), and the Spoken Dutch Corpus (http://lands.let.kun.nl/cgn/ehome.htm). Manual phonetic segmentation is time-consuming and expensive (Van Bael et al. 2007); it takes about 400 times real time (Switchboard Transcription Project, 1999) or 30 seconds per phoneme (1800 phonemes for 15 hours) (Leung and Zue, 1984). Furthermore, manual segmentation is somewhat inconsistent, with much less than perfect inter-annotator agreement (Cucchiarini, 1993). Forced alignment has been widely used for automatic phonetic segmentation in speech recognition and corpus-based concatenative speech synthesis. This task requires two inputs: recorded audio and (usually) word transcriptions. The transcribed words are mapped into a phone sequence in advance by using a pronouncing dictionary, or grapheme to phoneme rules. Phone boundaries are determined based on the acoustic models via computer algorithms such as Viterbi search (Wightman and Talkin, 1997) and Dynamic Time Warping (Wagner, 1981). The most frequently used approach for forced alignment is to build a Hidden Markov Model (HMM) based phonetic recognizer. The speech signal is analyzed as a successive set of frames (e.g., every 3 - 10 ms). The alignment of frames with phonemes is determined via the Viterbi algorithm, which finds the most likely sequence of hidden states (in practice each phone has 3-5 states) given the observed data and the acoustic model represented by the HMMs. The acoustic features used for training HMMs are normally cepstral coefficients such as MFCCs (Davis and Mermelstein, 1980) and PLPs (Hermansky, 1990). A common practice involves training single Gaussian HMMs first and then extending these HMMs to more Gaussians (Gaussian Mixture Models (GMMs)). The reported performances of state-of-the-art HMM-based forced alignment systems range from 80%-90% agreement (of all boundaries) within 20 ms compared to manual segmentation on TIMIT (Hosom, 2000). Human labelers have an average agreement of 93% within 20 ms, with a maximum of 96% within 20 ms for highly-trained specialists (Hosom, 2000). In forced alignment, unlike in automatic speech recognition, monophone (context-independent) HMMs are more commonly used than triphone (context-dependent) HMMs. Ljolje et al. (1997) provide a theoretical explanation as to why triphone models tend to be less precise in automatic segmentation. In the triphone model, the HMMs do not need to discriminate between the target phone and the context; the spectral movement characteristics are better modeled, but phone boundary accuracy is sacrificed. Toledano et al. (2003) compare monophone and triphone models for forced alignment under different criteria and show in their experiments that monophone models outperform triphone models for medium tolerances (15-30 ms different from manual segmentation). However, monophone models underperform for small tolerances (5-10 ms) and large tolerances (>35 ms). Stolcke and collegues at SRI found in NIST Hub5 diagnostic evaluations, where automatic phone alignments were scored against labelings from the Switchboard Transcription Project, that cross-word triphone models gave worse phone-level accuracy than within-word triphones. Many researchers have tried to improve forced alignment accuracy. Hosom (2000) uses acoustic- phonetic information (phonetic transitions, acoustic-level features, and distinctive phonetic features) in addition to PLPs. This study shows that the phonetic transition information provides the greatest relative improvement in performance. The acoustic-level features - such as impulse detection, intensity discrimination, and voicing features – provide the next-greatest improvement, and the use of distinctive features (manner, place, and height) may increase or decrease performance, depending on the corpus used for evaluation. Toledano et al. (2003) propose a statistical correction procedure to compensate for the systematic errors produced by context-dependent HMMs. The procedure is comprised of two steps: a training phase, where some statistical averages are estimated; and a boundary correction phase, where the phone boundaries are moved according to the estimated averages. The procedure has been shown to correct segmentations produced by context-dependent HMMs; therefore, the results are more accurate than those obtained by context-independent and context-dependent HMMs alone. There are also studies in the literature that attempt to improve forced alignment by using a different model than HMMs. Lee (2006) employs a multilayer perceptron (MLP) to refine the phone boundaries provided by HMM-based alignment; Keshet et al. (2005) describe a new paradigm for alignment based on Support Vector Machines (SVMs). Although forced alignment works well on read speech and short sentences, the alignment of long and spontaneous speech remains a great challenge (Osuga et al., 2001). Moreno et al. (1998) developed a recursive algorithm to align long recordings. The algorithm turns the forced alignment problem into a recursive speech recognition problem with a gradually restricting dictionary and language model. Toth (2004) combined duration and prosodic phrase breaks with HMM to segment long recordings into smaller utterances. Venkataraman et al. (2004) and Hazen (2006) propose new alignment approaches for approximate transcriptions of long audio files which was designed to discover and correct errors in the manual transcription during the alignment process. Compared to forced alignment, more efforts have been made to improve recognition of spontaneous speech (Furui, 2005) by: using better models of pronunciation variation (Strik & Cucchiarini, 1999, Hain 2005); using prosodic information (Wang, 2001, Shriberg & Stolcke, 2004); and improving language models (Stolcke & Shriberg, 1996; Johnson et al., 2004). With respect to pronunciation models, Riley et al. (1999) use statistical decision trees to generate alternate word pronunciations in spontaneous speech. Kessens et al. (2003) describe a rule-based data-driven method to model pronunciation variation. Zheng et al. (2003) introduce “zero-length phones” to better model elisions in fast pronunciations while preserving the phonetic context information for adjacent phones. Bates et al. (2007) present a phonetic- feature-based prediction model of pronunciation variation. Their study shows that feature-based models are more efficient than phone-based models; they require fewer parameters to predict variation and give smaller distance and perplexity values when comparing predictions to the hand-labeled reference. Saraclar et al. (2000) propose a new method of accommodating nonstandard pronunciations: rather than allowing a phoneme to be realized as one of a few alternate phones, the HMM states of the phoneme’s model are allowed to share Gaussian mixture components with the HMM states of the model(s) of the alternate realization(s). The use of prosody and language models to improve automatic recognition of spontaneous speech has been largely integrated. Liu et al. (2006) describe a metadata (sentence boundaries, pause fillers, and disfluencies) detection system; it combines information from different types of textual knowledge sources with information from a prosodic classifier. Huang and Renals (2007) incorporate syllable-based prosodic features into language models. Their experiment shows that exploiting prosody in language modeling significantly reduces perplexity and marginally reduces word error rate. Another closely related research area is automatic phonetic transcription. Van Bael et al. (2007) showed that in order to approximate the quality of the manually verified phonetic transcriptions in the Spoken Dutch corpus, one only needs an orthographic transcription, a canonical lexicon, a small sample of manually verified phonetic transcriptions, software for the implementation of decision trees, and a standard continuous speech recognizer. Chang et al. (2000) developed an automatic transcription system that does not use word-level transcripts. Instead, special purpose neural networks are built to classify each 10ms frame of speech in terms of articulatory-acoustic-based phonetic features; the features are subsequently mapped to phonetic labels using multilayer perceptron (MLP) networks. The phonetic labels generated by this system are 80% concordant with the labels produced by human transcribers. Forced alignment assumes that the orthographic transcription is correct and accurate. However, transcribing spontaneous speech is difficult. Disfluencies are often missed in the transcription process (Lickley & Bard, 1996). Instructions to attend carefully to disfluencies increase bias to report them but not accuracy in locating them (Martin & Strange, 1968). Forced alignment also assumes that the word-to- phoneme mapping generates a path that contains the correct pronunciation – but of course, natural speech is highly variable. The obvious approach is to use language models to postulate additional disfluencies that may have been omitted in the transcript, and to use models of pronunciation variation to enrich the lattice of pronunciation alternatives for words in context; and then to use the usual HMM Viterbi decoding to choose the best path given the acoustic data. Most of the research on related topics is aimed at improving speech recognition rather than improving phonetic alignments, but the results suggest that these approaches, properly used, will not only give better alignments, but also provide valid information about the distribution of phonetic variants. For example, Fox (2006) demonstrated that a forced-alignment technique worked well in studying the distribution of s-deletion in Spanish, using LDC corpora of conversational telephone speech and radio news broadcasts. She was also able to get reliable estimates of the distribution of the durations of non-deleted /s/ segments. A critical component of any such research is estimation of the distribution of errors, whether in disambiguating alternative pronunciations, correcting the transcription of disfluencies, or determining the boundaries of segments. Since human annotators also disagree about these matters, it’s crucial to compare the distribution of human/human differences as well as the distribution of human/machine differences. And in both cases, the mean squared (or absolute-value) error often matters less than the bias. If we want to estimate (for example) the average duration of a certain vowel segment, or the average ratio of durations between vowels and following voiced vs. voiceless consonants, the amount of noise in the measurement of individual instances matters less than the bias of the noise, since as the volume of data increases, our confidence intervals will steadily shrink – and the whole point of this enterprise is to increase the available volume of data by several orders of magnitude. Fox (2006) found this kind of noise reduction, just as we would hope, so that overall parameter estimates from forced alignment converged with the overall parameter estimates from human annotation. We will need to develop standard procedures for checking this in new applications. Since a sample of human annotations is a critical and expensive part of this process, a crucial step will be to define the minimal sample of such annotations required to achieve a given level of confidence in the result.

3. Preliminary results 3.1. The Penn Phonetics Lab Forced Aligner The U.S. Supreme Court began recording its oral arguments in the early 1950s; some 9,000 hours of recording are stored in the National Archives. The transcripts do not identify the speaking turns of individual Justices but refer to them all as “The Court”. As part of a project to make this material available online in aligned digital form, we have developed techniques for identifying speakers and aligning entire (hour-long) transcripts with the digitized audio (Yuan & Liberman, 2008). The Penn Phonetics Lab Forced Aligner was developed from this project, which was carried out with NSF funding in collaboration with the OYEZ project (http://oyez.org). Seventy-nine hour-long sessions of the SCOTUS corpus were transcribed, speaker identified, and manually word-aligned. Silence and noise segments in these arguments were also annotated. A total of 25.5 hours of speaker turns were extracted from the arguments and used for our training data; one argument was set aside for testing purposes. Silences were separately extracted and randomly added to the beginning and end of each turn. Our acoustic models are GMM-based, monophone HMMs. Each HMM state has 32 Gaussian Mixture components on 39 PLP coefficients (12 cepstral coefficients plus energy, and Delta and Acceleration). The models were trained using the HTK toolkit (http://htk.eng.cam.ac.uk) and the CMU American English Pronouncing Dictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict). We tested the forced aligner on both TIMIT (the training set data) and the (the data of s14). TIMIT is read speech and the audio files are short (a few seconds each). The Buckeye corpus is spontaneous interview speech and the audio files are nine minutes long on average. During the tests, the manually transcribed phones were used as input for alignment. Table 1 lists the average absolute difference between the automatically and manually labeled phone boundaries; it also lists the percentage of agreement within 20 ms between forced alignment and manual segmentation.

Table 1. Performance of the PPL Forced Aligner on TIMIT and Buckeye. Average absolute difference Percentage of agreement within 20ms TIMIT 11.3 ms 85.3% Buckeye 21.2 ms 79.2%

The differences tend to become smaller when aggregate statistics are calculated. For example, in the case of the Buckeye corpus, the mean vowel duration in the hand-annotated data was 97.45 milliseconds, while the (5% trimmed) mean for the machine-aligned data was 95.11 milliseconds, or just 2.3 milliseconds shorter. Given that we made no attempt to harmonize the Buckeye segmentation standards with the decisions implicit in the aligner, this relative lack of bias is promising. We performed a preliminary error analysis on alignment of the TIMIT corpus. As shown in Table 2, we found that the signed alignment errors between different phone classes have different patterns. There is no bias towards either phone class for the boundaries between Nasals and Glides (no matter which is first, -0.002s vs. 0.006s); however, there is a significant bias towards Stops for the boundaries between Stops and Glides (-0.01s vs. 0.015s). There is no bias for the boundaries between Vowels and Glides (-0.002 s), but there is a significant bias towards Vowels for the boundaries between Glides and Vowels (0.013s). We will undertake further analyses to reveal how the error patterns are related to phone characteristics, coarticulation, and syllable structure. We will then use the information to improve forced alignment.

Table 2. Average signed errors for boundaries between broad phone classes (Seconds). Affricate Fricative Glide /h/ Nasal Stop Vowel Affricate –.008 –.006 - - - .019 Fricative .026 –.009 - –.013 .008 .007 Glide .014 .003 .013 .006 .015 .013 /h/ - - –.008 - - .010 Nasal - –.005 –.002 - .009 .013 Stop - –.001 –.010 - –.008 - –.003 Vowel .006 –.012 –.002 .006 –.004 .006 -

We also tested the aligner on hour-long audio files - i.e., alignment of entire hour-long recordings without cutting them into smaller pieces - using the (BNC) and the SCOTUS corpus. The spoken part of the BNC corpus consists of informal conversations recorded by volunteers. The conversations contain a large amount of background noise, speech overlaps, etc. To help our forced aligner better handle the BNC data, we combined the CMU pronouncing dictionary with the Oxford Advanced Learner's dictionary (http://ota.ahds.ac.uk/headers/0710.xml), which is a British English pronouncing dictionary. We also retrained the silence and noise model using data from the BNC corpus. We manually checked the word alignments on a 50-minute recording, and 78.6% of the words in the recording were aligned accurately. The argument in the SCOTUS corpus that was set aside for testing in our study is 58 minutes long and manually word-aligned. The performance of the aligner on this argument is shown by Figure 1, where a boxplot of alignment errors (absolute differences from manual segmentation) in every minute from the beginning to the end of the recording is drawn. We can see that the alignment is consistently good throughout the entire recording (the outliers are mostly due to incomplete transcriptions, e.g., untranscribed disfluencies and words). Possible reasons for why our forced aligner can handle long and spontaneous speech well include: the high quality of the training data; the fact that the training data is large enough to train robust monophone GMM models; and the robustness of the silence and noise models.

Figure 1. Alignment errors in every minute in a 58-minute recording.

3.2. Phonetics research using very large speech corpora and forced alignment Instrumental phonetics research, especially in sociolinguistics, continues to rely heavily on formant measurements, because these provide the best acoustic proxy for perceived vowel quality, as well as for some other phonetic distinctions such as clear vs. dark /l/. However, automatic formant measurements are generally viewed as too error-prone to be accepted in this application. It is therefore important to show that automated methods, applied to transcribed audio recordings on a large scale, can produce results that are comparable in value to those obtained by human annotation. We've made a promising start on this task, both with bayesian formant-tracking using priors derived from the vowel categories in the transcript, as in Evanini et al. (2009), and partly by substituting methods based on comparing the fit of different allophonic transcripts, as in Yuan and Liberman (2009). We have used large speech corpora to investigate speech and language phenomena such as /l/ variation (Yuan and Liberman 2009), acoustic vowel space (Yuan and Liberman, 2008), speaking rate (Yuan et al., 2006), speech overlap (Yuan et al., 2007), stress (Yuan et al., 2008), duration (Yuan, 2008), and tone sandhi (Chen & Yuan, 2007). We will now summarize our study on /l/ variation in English, which shows how we can revisit classic phonetic and phonological problems from the perspective of utilizing very large speech corpora and forced alignment. The distinction between dark and clear /l/ in English has long been observed (Jones 1947). In a classic study of English /l/, Sproat and Fujimura (1993) argued that the clear and dark allophones are not categorically distinct. They proposed that the single phonological entity /l/ involves two gestures - a vocalic dorsal gesture and a consonantal apical gesture. The two gestures are inherently asynchronous: the vocalic gesture is attracted to the nucleus of the syllable whereas the consonantal gesture is attracted to the margin. When producing a syllable-final /l/, the tongue dorsum gesture shifts left to the syllable nucleus, making the vocalic gesture precede the consonantal, tongue apex gesture. When producing a syllable-initial /l/, the reverse situation holds. As an important piece of evidence for their proposal, Sproat and Fujimura (1993) found that the backness of pre-boundary intervocalic /l/ is correlated with the duration of the pre-boundary rime. The /l/ in longer rimes is darker. We investigated /l/ variation in the 2001 term of the SCOTUS corpus, which contained 21,706 tokens of /l/. The phone boundaries were automatically determined using the PPL Forced Aligner. To measure the “darkness” of /l/ through forced alignment, we first split /l/ into two phones, L1 for the clear /l/ and L2 for the dark /l/, and retrained the acoustic models. In training, the word-initial [l]’s (e.g., like) and the [l]’s in the word-initial consonant clusters (e.g., please) were categorized as L1 (clear); the word-final [l]s (e.g., full) and the [l]’s in the word-final consonant clusters (e.g., felt) were L2 (dark). All other [l]’s were ambiguous, which could be either L1 or L2. During each iteration of training, the ‘real’ pronunciations of the ambiguous [l]’s were automatically determined, and then the acoustic models of L1 and L2 were updated. The new acoustic models were tested on both the training data and a dataset that had been set aside for the testing purpose. During the tests, all [l]’s were treated as ambiguous, the forced aligner determined whether a [l] was L1 or L2. If using word-initial vs. word-final as the gold standard, the accuracy of /l/ classification by forced alignment is 93.8% on the training data and 92.8% on the test data. These numbers suggest that forced alignment can be used to determine the darkness of /l/. To compute a score that can measure the degree of /l/- darkness, we ran forced alignment twice. First, all [l]’s were aligned using the L1 model, and then, using the L2 model. The difference between the likelihood scores resulted from L2 alignment and L1 alignment - the D score - measures the darkness of /l/ (Eq. 1). The larger the D score is, the darker the /l/.

D(l) = log p(l | L2)  log p(l | L1) (Eq. 1)

Figure 2 draws the histograms of the D scores of all /l/ tokens in the dataset. L1 and L2 were classified by forced alignment as above. We can see that, as expected, most L1’s have negative D scores whereas most L2’s have positive D scores.

Figure 2. Histograms of D-scores for L1 (clear) and L2 (dark).

Figure 3 plots the average D scores of the /l/ in syllable rimes for different rime durations, grouped by the type of the segment that /l/ precedes. Although such /l/ typically follows a primary-stress vowel (denoted as ‘1’), it can precede a word boundary (denoted as ‘#’), a consonant within the word (denoted as ‘C’), or a non-stress vowel within the word (denoted as ‘0’).

Figure 3. Relation between rime duration and darkness for syllable-final /l/. The x-axis represents duration, “.10” means below .10s, “.15” means between .10 and .15 seconds, etc.

We can see from Figure 3 that the /l/ in longer rimes has larger D scores, and hence is darker. This result is consistent with Sproat and Fujimura (1993). Figure 3 also shows that the rime duration being equal, the /l/ preceding a non-stress vowel (1_L_0) is less dark than the /l/ preceding a word boundary (1_L_#) or a consonant (1_L_C). Moreover, the relationship between the rime duration and darkness for the /l/ in 1_L_C is non-linear. For shorter rimes the correlation is positive whereas for longer ones it is negative; the /l/ reaches its peak of darkness when the rime (more precisely, the stressed vowel and /l/) is about 150-200 ms. Finally, Figure 3 shows that the syllable final /l/ was always dark (D > 0), even in the rimes that were very short, i.e., less than 100 ms. This result is contradictory to Sproat and Fujimura’s finding that the syllable-final /l/ in very short rimes can be as clear as the canonical clear /l/. Using the same technique, we studied the acoustic difference of all consonants in English between word initial and word final positions. We trained two acoustic models for each consonant, one for the word-initial position and the other for the word-final position. We then used the two models to identify whether a consonant is word-initial or word-final, by running forced alignment twice and comparing the likelihood scores. The results showed that the identification accuracies range from 80% (/s/) to 95% (/r/).

3.3 Improving automatic phonetic alignments Forced alignment is a powerful tool for utilizing very large speech corpora in phonetics research, but as we noted, it has several obvious problems: orthographic ambiguity, pronunciation variation, and imperfect transcripts. The general approach in all cases is to add alternative paths to the “language model” (which in the simplest case is just a simple sequence of expected phonetic segments), with estimates of the a priori probability of the alternatives, and let the Viterbi decoding choose the best option. In some cases, it may also be helpful to add additional acoustically-based features – perhaps based on some decision-specific machine learning – designed to discriminate among the alternatives. We and others have gotten promising results with such techniques, and we’re confident that with some improvements, they will deal adequately with the problems, as well as adding information about the distribution of phonetic variants in speech. Pretonic schwa deletion (e.g., suppose -> [sp]ose) presents a typical challenge of this type (Hooper, 1978; Patterson et al., 2003; Davidson, 2006). Editing the pronouncing dictionary may solve the problem, but it is time-consuming and error-prone. We propose a different approach: using a “tee model” for schwa in forced alignment. A “tee-model” has a direct transition from the entry to the exit node in the HMM; therefore, a phone with a “tee-model” can have “zero” length. The “tee-model” has mainly been used for handling possible inter-word silence. In a pilot experiment, we trained a “tee-model” for schwa and used the model to identify schwa elision (“zero” length from alignment) in the SCOTUS corpus.

Figure 4. Identifying schwa elision through forced alignment and “tee-model”.

We asked a phonetics student to examine all the tokens of the word suppose in the corpus (ninety-nine total) and manually identify whether there was a schwa in the word by listening to the sound and looking at the spectrogram. The agreement between the forced alignment procedure and the manual procedure was 88.9% (88/99). 24 tokens were identified as ‘no schwa’ (schwa elision) by the student, 22 of them (91.7%) were correctly identified by the aligner; 75 tokens were identified as having a schwa by the student, 66 of them (88%) were correctly identified by the aligner. Figure 4 (above) illustrates two examples from the forced alignment results. We can see that the forced aligner correctly identifies a schwa in the first word and a schwa elision in the second word, although the word suppose does not have a pronunciation variant with schwa elision in the pronouncing dictionary.

4. Research Plans We plan to improve the Penn Phonetics Lab (PPL) Forced Aligner in two respects: 1) its segmentation accuracy; and 2) its robustness to conversational speech and long recordings. We will further explore techniques for modeling phonetic variation and recognizing untranscribed disfluencies, and for marking regions of unreliable alignment. In addition, we will extend this system to other speech genres (e.g., child directed speech) and more languages (e.g., Mandarin Chinese). We will apply these techniques to the LDC’s very large speech corpora, and explore how to integrate the resulting automated annotations into a database system that is convenient for phonetic search and retrieval, as per the techniques developed in previous NSF-funded projects at LDC (see Section 1 above). In addition to using these results in our own research, we will publish both the annotations and the search software for use by the research community at large, in order to learn as much as possible about the issues that arise in applying this new approach.

4.1 Gold-standard benchmarks TIMIT has been widely used to evaluate the performance of forced alignment. Besides TIMIT, we will also use the Buckeye corpus (http://vic.psy.ohio-state.edu/), the SCOTUS corpus (2001 term), and the Switchboard corpus (LDC97S62, and its ISIP transcripts http://www.isip.piconepress.com/projects/switcboard) as gold-standard benchmarks in our research to improve forced alignment. TIMIT and the Buckeye corpus contain both word and phone segmentations; the SCOTUS corpus and Switchboard have word boundaries only (a subset of the Switchboard corpus also has detailed phonetic transcriptions, Switchboard Transcription Project, Greenberg et al. 1996). The phone boundaries in TIMIT and the Buckeye corpus were generated in a two-pass procedure: first through forced alignment and then hand correction. For the phone boundaries that are difficult to determine, e.g., those between a vowel and an approximant, a somewhat arbitrary segmentation protocol was adopted. From the Buckeye annotation manual, for example, it states that: “... it will often not be possible to define a single point in time that separates the vowel from the approximant. The first strategy is to place the label boundary half way between the points at which the segments become clearly vowel and semivowel. If that is not possible (they may never become prototypical!), then assign one-third of the vocalic region to the approximant, and two-thirds to the vowel.” This type of arbitrariness in the manual segmentation “gold-standard” imposes a problem for the assessment of forced alignment. For this reason, the first step of our research is to develop phonetic segmentation data that can serve as a new gold-standard benchmark. We will determine how to label the transitions between phones. One possibility is to adopt the phone-transition model presented in Hertz (1991), which treats transitions as independent units, rather than incorporating them into phones. This dataset will be created in a uniform format for all our target languages and speech genres. We will randomly select representative utterances from the four corpora listed above, and the corpora described in Section 4.5, including a Mandarin Chinese broadcast news corpus, a Mandarin Chinese telephone conversation corpus, and an English child-directed speech corpus. Half an hour of benchmark data will be created for each of the corpora. These datasets will be published through LDC by the end of the first year of the project.

4.2 Improving acoustic and pronunciation models A key issue in forced alignment is the inherent variation of human speech. Based upon our review of the literature, we will conduct studies to improve acoustic and pronunciation modeling from the perspective of forced alignment. We will first investigate how to better handle pronunciation variation (i.e., deletion, reduction, and insertion). Although modeling pronunciation variation to improve the performance of ASR systems has been extensively studied, the results may or may not be applied to forced alignment. Kessens et al. (2003) argued that adding pronunciation variants to the lexicon increases confusability among words, therefore, modeling pronunciation variation by increasing the number of pronunciation variants per word may bring both improvements and deteriorations in speech recognition. Hain (2005) presented a method for constructing a dictionary with only one pronunciation entry per word from a good reference dictionary. His study showed that the use of such single pronunciation dictionaries provides similar or better word error rate performance than the standard baseline system using multiple pronunciation variants. In forced alignment, however, pronunciation confusability between words is far less a problem than it is in speech recognition. Therefore, adding multiple pronunciations to the lexicon, either through a data-driven or a knowledge-based method, can probably better improve forced alignment than ASR. Another method of handling pronunciation variation is to improve the acoustic models. Based on the success of our study of pre-tonic schwa elision (presented in Section 3.3.), we will build “tee models" to detect and align deletions in speech. We will also investigate the performance of monophpone, triphone, and phone-transition models in forced alignment. A possible experiment on vowel reduction involves building a system in which all English reduced vowels are the same phoneme; this special phoneme would have triphone models instead of a monophone model. We may also ask, can (some) transitions be clustered with (some) reduced vowels in their acoustic models? More generally, are the parameter tying techniques (Young et al., 1994) capable of capturing pronunciation variability? On which level should the parameter clustering be performed, model, HMM state, or GMM mixture component? Finally, we will investigate the use of discriminative training and acoustic modeling techniques on the quality of alignments. In particular, we expect improvements from training criteria designed to improve phone discrimination in Gaussian model estimation, such as maximum mutual information (Normandin 1992) and minimum phone error (Povey and Woodland 2002), and from acoustic features computed by neural networks trained to separate phone classes (Hermansky et al. 2000). We will carry out a detailed error analysis of the models’ performance. Error analysis provides information about where and how the system should be improved; it allows us to estimate whether the deviations from human annotation introduce any bias; and in addition, it may yield insights of its own. Greenberg and Chang (2000), for example, conducted a diagnostic evaluation of eight Switchboard- corpus recognition systems, and found that syllabic structure, prosodic stress and speaking rate were important factors in accounting for recognition performance. Through error analyses we can determine, for example, what model can better handle the transitions between vowels and approximants, and whether syllable initial and syllable final consonants should have different models (as suggested by the results presented in Section 3.2).

4.3 Integration of phonetic knowledge To our understanding, part of the reason why the integration of phonetic knowledge has not significantly improved the accuracy of speech recognition is the strong impact of the language model in automatic speech recognition (ASR) procedures. Since the word sequence is provided in forced alignment, the application of phonetic knowledge is more likely to be successful here. The proposed research will attempt to improve forced alignment by incorporating well-established phonetic models. Specifically, besides the phone-transition model mentioned above, we will explore the -gesture model (Byrd & Saltzman, 2003), and the landmark model (Stevens, 2002). The -gesture model of Byrd and Saltzman (2003) suggests that boundary-related durational patterning can result from prosodic gestures or -gestures, which stretch or shrink the local temporal fabric of an utterance. We propose to incorporate the -gesture model into the forced alignment procedure through the rescoring of alignment lattices (Jennequin & Gauvain, 2007). Stevens (2002) proposes a model for lexical access based on acoustic landmarks and distinctive features. Landmark- based speech recognition has advanced in recent years (Hasegawa-Johnson et al., 2005). We will adopt a two-step procedure to apply the landmark model in forced alignment. In the first step, segment boundaries will be obtained by the HMM-based PPL forced aligner. In the second step, the boundaries will be refined through landmark detection, using the framework proposed in Juneja and Espy-Wilson (2008).

4.4 Robustness to conversational speech and long recordings The transcriptions of long and spontaneous speech are usually imperfect. Spontaneous speech contains filled pauses, disfluencies, errors, repairs, and deletions that are often missed in the transcription process. Recordings of long and spontaneous speech usually contain background noises, speech overlaps, and very long non-speech segments. These factors make the alignment of long and spontaneous speech a great challenge. We aim to improve the robustness of the PPL aligner to long and spontaneous speech in three aspects: 1) improve the acoustic models of silences, noises, and filled pauses; 2) introduce constraints from prosody and language model into forced alignment; and 3) integrate technology from research on audio segmentation and "diarization". Besides the Switchboard corpus, the Buckeye corpus, and the SCOTUS corpus, we will use the following corpora for this part of the research: 1. RT-03 MDE Training Data Speech (LDC 2004S08) and Text and Annotations (LDC 2004T12); 2. ICSI Meeting Speech (LDC2004S02) and Transcripts (LDC2004T04), ISL Meeting Speech Part I (LDC2004S05) and Transcripts (LDC2004T04), NIST Meeting Pilot Corpus Speech (LDC 2004S09) and Transcripts and Metadata (LDC2004T13); 3. Fisher English Training Speech Part 1 Speech (LDC2004S13) and Transcripts (LDC2004T19), Fisher English Training Speech Part 2 Speech (LDC2005S13) and Transcripts (LDC 2005T19). In our experiments on alignment of the BNC corpus - which consists of casual and long speech in a natural setting - we found that erroneous alignments can be reduced by adapting the silence and noise models of the PPL aligner to the BNC data. We will further explore the importance of the non-speech models in forced alignment of long and casual speech, and also investigate adaptation of the speech models to the individual speakers. We will also investigate ways to improve the acoustic models for better handling filled pauses. Schramm et al. (2003) created many pronunciation variants for a filled pause through a data-drive lexical modeling technique. The new model outperforms the single-pronunciation filled pause model in recognition of highly spontaneous medical speech. Another approach (Stolcke et al. 2000) is the use of dedicated phones for filled pauses; thus the great acoustic variability in filled pause realizations can be accounted for without affecting the acoustic models of standard phones. Stouten et al. (2006) argue that a better way to cope with pause fillers in speech recognition is to introduce a specialized filled pause detector (as a preprocessor) and supply the output of that detector to the general decoder. We will explore all these approaches for our purpose of improving forced alignment of long and casual speech. A common practice in forced alignment is to insert a “tee-model” phone, called sp (short pauses), after each word in the transcription for handling possible inter-word silence. Since a “tee-model” has a direct transition from the entry to the exit node, sp can be skipped during forced alignment. In this way, a forced aligner can “spot” and segment pauses in the speech, which are usually not transcribed. In casual and long speech, such pauses can be extremely long and filled with background noises. In this case, the sp-insertion approach could cause severe problems. In our study on the BNC corpus, we found that there are often many sp segments mistakenly determined by the aligner in regions where the word boundaries were not correctly aligned. We believe that these types of errors can be reduced by introducing constraints on the occurrences of sp from both a language model and a prosodic model. For example, it is unlikely to have pauses between the words in very common phrases such as “How are you doing?”. On the other hand, if there is a single word between two pauses in speech, the word is likely to be lengthened; hence, it should have longer duration and particular F0 characteristics. Another type of error we have seen from the BNC corpus is that some words are extremely long in the alignment results. This usually occurs when there is long speech-like background noise surrounding the words. This type of error can be reduced by introducing constraints on word or phone duration. One way to do this is to score alternate alignments with a word-specific Gaussian phone duration model, an approach that has been shown to improve word recognition (Gadde 2000). A remaining challenge is the robust identification of regions where (for whatever reason) the automatic process has gone badly wrong, so that the resulting data should be disregarded. The use of likelihood scores or other confidence measures is the obvious approach, but the history of such confidence measures in automatic speech recognition is mixed at best. We hypothesize that this is mainly because the language model is weighted more heavily than the acoustic model in automatic speech recognition. In forced alignment, however, the word sequence is (mostly or completely) given, so the language model plays much less of a role in computing the likelihood scores, which are therefore more reliable and useful. We may also segment a long recording into speech and non-speech regions before doing forced alignment, by integrating technology from research on robust audio segmentation and "diarization" (Ajmera 2004, Reynolds and Torres-Carrasquillo 2005). The state-of-the-art segmentation techniques can be classified into two types: model-based, i.e., training models for acoustic classes such as speech, silence, noise, music, etc., and metric-based, i.e., computing a distance between two segments to determine whether they belong to the same acoustic class, using metrics such as log likelihood ratio (LLR) and Bayesian Information Criterion (BIC). Kemp et al. (2000) demonstrated that model-based techniques can achieve better boundary precision while metric-based techniques can perform better in terms of segment boundary recall. We will combine these two types of techniques in our research.

4.5 Extending to other speech genres and languages We've used the forced alignment techniques successfully on a wide range of speech genres and recording conditions, including conversational telephone speech, courtroom recordings, sociolinguistic interviews, news and public affairs broadcasts, and audiobooks. We plan to add a few additional types of speech data, especially focusing on the CHILDES corpus (http://childes.psy.cmu.edu/), which contains audio/video data and transcripts collected from conversations between young children and their playmates and caretakers. The CHILDES corpus has been a great resource for studying language acquisition and phonetic variation. We propose to conduct forced alignment on the child directed speech data in this corpus in order to make the data more accessible for phonetic research. We will utilize the English Brent corpus distributed in CHILDES, which includes about 100 hours of transcribed recordings of mothers speaking to preverbal infants. Kirchhoff and Schimmel (2005) trained automatic speech recognizers on infant directed (ID) and adult directed speech (AD), respectively, and tested the recognizers on both ID and AD speech. They found that matched conditions produced better results than mismatched conditions, and that the relative degradation of ID-trained recognizers on AD speech was significantly less severe than in the reverse case. We will conduct a similar experiment for forced alignment by comparing the aligner trained on the SCOTUS corpus and on the Brent corpus. In this experiment we will extract the mothers’ speech from the recordings based on the time stamps in the transcripts, and use the mothers’ speech for both training and testing. We will also test the portability of the diarization techniques, developed in 4.4, to mother-infant conversations. We will use the entirety of the Brent corpus recordings in the experiment, and diarize mother’s speech, infant’s responses, and pauses and noises. We expect that additional acoustic and phonetic features need to be incorporated into the diarization techniques to separate infants’ responses from mothers’ speech and noises. We will also extend the PPL aligner to Mandarin Chinese. We have built a baseline Mandarin Chinese forced aligner using the 1997 Mandarin Broadcast News (Hub4-NE) Speech (LDC98S73) and Transcripts (LDC98T24), and the CallHome Mandarin Chinese Lexicon (LDC96L15). We will test the portability of the acoustic and pronunciation modeling techniques, developed in 4.2, to the Mandarin Chinese aligner. In addition, we will investigate how to incorporate advanced tonal models and use alternative acoustic units such as syllables and initials/finals in Chinese forced alignment. Tonal models and acoustic units have been extensively investigated in Chinese automatic speech recognition studies (e.g., Fu et al. 1996, Vu et al., 2005, Lei, 2006). However, unlike these studies, our research aims to improve forced alignment but not recognition accuracy. We will also test the portability of the techniques for handling disfluencies and imperfect transcription, developed in 4.4, to Mandarin Chinese conversational speech, using the CallHome Mandarin Chinese Speech (LDC96S34) and Transcripts (LDC96T16), and the HKUST Mandarin Telephone Speech, part I (LDC2005S15) and Transcripts (LDC2005T32).

4.6. Creating and using very-large-scale phonetic databases Using very-large-scale phonetic databases in speech research poses three key problems: distributed creation and maintenance, multidimensional search, and integrated search across datasets with somewhat different designs. Large speech databases typically grow over time with contributions or corrections from multiple sites. Without a clear record of what information was added when and by whom, and a system for maintaining consistency over time, the result can be incompatible versions that are difficult to re- integrate. In addition, such databases contain many dimensions and layers of information: the basic audio or video recording and associated metadata; the time-aligned orthographic and phonetic transcripts; information about who spoke when, with associated speaker metadata; information about discourse type; and so on. Researchers want to be able to extract datasets based on combinations of these features, e.g. “short 'a' between labial stops in stressed syllables of words with frequency less than 10 per million, in conversational speech from female speakers born between 1960 and 1965 in the Northern Cities dialect area.” This goal is made harder to reach by the fact that different data sources come with different sorts of information, or may present the same content in superficially different ways. Approaches suitable for creating, maintaining, and searching such varied, complex, and multidimensional databases, with methods to retrieve relevant subsets in a suitably combinatoric way, were developed at LDC under NSF awards 9983258, “Multidimensional Exploration of Linguistic Databases”, and 0317826, “Querying linguistic databases”. Susan Davidson, who participated in those projects, is also an expert in problems of data provenance and querying multiple sources. We propose to adapt and improve the results of this prior research. We will organize a workshop during the summer of the second year of the project, which will bring researchers in phonetics, speech technology, and database research together, to evaluate the existing solutions and identify remaining needs. We will also recruit a graduate student in computer science to work on this aspect of the project.

4.7. Dissemination of the research We will disseminate the research using methods that include journal publications; open-source toolkits and web-based applications; and tutorials, workshops, and courses. We will of course continue to write papers both on the methodological issues and on application of the approach to research problems in phonetics and other areas, for conference presentations and journal articles. We have built a freely accessible forced alignment system, residing at http://www.ling.upenn.edu/phonetics/p2fa/, which has been used by researchers at sites including NYU, Oxford, Stanford, UIUC, University of Chicago, and ETS. We will publish new versions of the aligner annually at no cost, through LDC and the phonetics lab website. To encourage long-term use of the aligner, we will produce a permanent free-standing tutorial, covering the training and use of the aligner, and the integration of its output. We will also develop web-based applications that integrate forced alignment, database query, and phonetics research. For example, we have built a web-based search engine for searching phones and words in the SCOTUS corpus, where the search results are word- aligned speaking turns. We will organize a workshop on phonetics databases during the summer of the second year. The aim of the workshop is to bring researchers in phonetics, speech technology, and database research together, to discuss the challenges and opportunities of building very large, multi-layer, phonetically- annotated datasets, and to identify which methods can provide usable and practical solutions. Modern HMM-based speech recognition is relatively easy to port to new languages; but a new language often brings new challenges, in the form of new phonetic segment types or new orthographic ambiguities. We aim to give other speech researchers an easy-to-follow cookbook to apply these methods to new languages. Towards that end, in each of the first two years, we'll pick a new language in which to help a researcher develop a forced-alignment system and apply it to a scientific problem, documenting the process in developing a "forced alignment cookbook". During the summer of the third year, we'll run a workshop at which this will be done for several languages in parallel with pronunciation modeling and disfluency modeling. The workshop will also provide an opportunity for us to test the aligner on different datasets from the workshop participants, and to seek research collaborations. We will organize a workshop on the creation and use of very large corpora in speech research during the last year of the project. The purposes of the workshop will be: 1) to introduce these techniques to those in the speech-research community who are not familiar with them; and 2) to promote phonetics research using very large corpora with forced alignment as both a tool and methodology. At the University of Pennsylvania we have been teaching a course on “corpus phonetics”, covering relevant Python and Praat scripting, database access, statistical analysis in R, etc. We plan to teach a similar course at the Linguistic Society of America's 2011 Summer Institute.

(Collaboration Plan: continued on next page.)

Collaboration Plan

1. Introduction Davidson, Susan is Weiss Professor of Computer and Information Science, and Founding co- Director of the Center for Bioinformatics at the University of Pennsylvania. Her main research areas are databases and bioinformatics. Liberman, Mark is Trustee Professor of Phonetics, Professor of Computer and Information Science, and Director of the Linguistic Data Consortium at the University of Pennsylvania. In addition to phonetics and speech technology, his research interests include textual information extraction and the phonology of lexical tone. He was previously head of the Linguistics Research Department at AT&T Bell Laboratories. Stolcke, Andreas is Senior Research Engineer at SRI International. His main research area is speech recognition. Of particular relevance to this project is his expertise and experience in the modeling of spontaneous speech phenomena, such as pronunciation variability, disfluencies, and dialog structure. Yuan, Jiahong is Assistant Professor of Linguistics at the University of Pennsylvania. His research interests include speech prosody, corpus phonetics, and integration of speech technology in phonetics research. Yuan is the principal developer of the Penn Phonetics Lab Forced Aligner, which has been extensively applied to research in phonetics and other areas of speech research. LDC and SRI have a long history of collaboration on DARPA speech and language research projects and data creation such as the Penn Treebanks. One of the main reasons for confidence that the project will succeed is the combined strengths of the LDC and SRI. LDC has built up enormous collection of spoken material and tools over many years under Liberman's direction. In recent years, Liberman and Yuan have worked together to utilize LDC's large speech corpora and speech technologies, mainly forced alignment, for phonetics research. Stolcke and colleagues at SRI have, for over a decade, developed new approaches to speech recognition and modeling. SRI has also been among the first to model prosody computationally for many tasks, such as automatic sentence segmentation, disfluency detection, speech recognition, speaker recognition, and emotion recognition. Finally, the Speech Technology and Research Laboratory at SRI has experience with several state-of-the-art speech recognition techniques that will be instrumental to improving automatic phonetic alignment, including language modeling, discriminative acoustic modeling, and speech duration modeling. Davidson will provide database expertise and will advise as needed in constructing phonetically- annotated databases. Davidson and Liberman have collaborated for building and querying linguistic databases under previous NSF awards. In this project the focus is on solving database problems with practical solutions, and creating a standard framework that other researchers can join us in using. They will co-supervise a graduate student RA, to be named, to work on this aspect of the project.

2. The roles and responsibilities of the personnel J. Yuan (PI): Responsible for the overall direction of the project. He will lead the construction of gold-standard benchmarks, evaluation of forced alignment performance, integration of phonetics knowledge, and construciton of a Mandarin Chinese forced aligner. With Dr. Stolcke, he will also work on acoustic and pronunciation modeling, and prosody and disfluency modeling. A. Stolcke (co-PI): Primarily responsible for improving forced alignment techniques. He will lead the acoustic and pronunciation modeling, prosody and disfluency modeling, discriminative training, and diarization research. He will assist Yuan and Liberman to extend the PPL forced aligner to other speech genres and languages. M. Liberman (co-PI): Shares responsibility for the evaluation of forced alignment performance, integration of phonetics knowledge, prosody modeling, discriminative tranining, and diarization. With Dr. Davidson, he will co-lead database research, and co-supervise a graduate student RA, to be named, working on importing standard database techniques to phonetically-annotated data. S. Davidson (co-PI): Provides database expertise. With Dr. Liberman, she will co-lead database research, and co-supervise a graduate student RA, to be named, working on importing standard database techniques to phonetically-annotated data.

3. Time schedule

Jan 2010 Stolcke visits Penn for planning meeting Construction of project webpage: Yuan Recruitment of a graduate student: Liberman, Davidson Jan 2010 - Construction of gold-standard benchmarks: Yuan Dec 2010 Acoustic and pronunciation modeling: Stolcke, Yuan Discriminative training: Stolcke, Liberman Dec 2010 Yuan and Liberman visit SRI; Evaluation of improvement to forced alignment accuracy: Yuan, Liberman, Stolcke Jan 2011 Error analysis of forced alignment accuracy: Yuan, Liberman Jan 2011 - Prosody and disfluency modeling: Stolcke, Yuan Dec 2011 Integration of phonetics knowledge: Yuan, Liberman Database experiments: graduate student RA (to be named), Liberman, Davidson June 2011 Workshop on constructing phonetically-annotated databases July 2011 Teaching at LSA summer institute: Yuan, Liberman Dec 2011 Evaluation of forced alignment on conversational speech and long recordings: Yuan, Liberman, Stolcke Jan 2012 - Diarization: Stolcke, Liberman Dec 2012 Extension to Mandarin Chinese and Child-directed speech: Yuan, Stolcke Constructing phonetically-annotated databases: graduate student RA under the supervision of Liberman and Davidson June 2012 Workshop on building forced aligner Dec 2012 Evaluation of Mandarin Chinese forced aligner: Yuan Jan 2013 - Phonetics research using forced alignment: Yuan, Liberman Oct 2013 Integration of diarization and speaker adaptation techniques into forced alignment: Stolcke Publish phonetically-annotated databases through LDC: graduate student RA under supervision of Liberman and Davidson Aug 2013 Workshop on using forced alignment for phonetics research Nov 2013 - Final report, dissemination and evaluation activities: Davidson, Liberman, Stolcke, Yuan Dec 2013 Publish forced alignment system and documentation through LDC and Penn phonetics lab website: Yuan, Liberman, Stolcke

References

Ajmera, J. (2004). Robust audio segmentation, PhD thesis, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, 2004. Bates, R., Ostendorf, M., and Wright, R. (2007). “Symbolic Phonetic Features for Modeling of Pronunciation Variation,” Speech Communication, 49, pp. 83-97. Bell, A., Jurafsky, D., Fosler-Lussier, E., Girand, C., Gregory, M., and Gildea, D. (2003). “Effects of disfluencies, predictability, and utterance position on word form variation in English conversation,” Journal of the Acoustical Society of America, 113, pp. 1001-1024. Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Rose, R., Tyagi, V., and Wellekens, C. (2007). “Automatic speech recognition and speech variability: A review,” Speech Communication, 49, pp. 763-786. Bird, S., and Liberman, M. (2001). “A formal framework for linguistic annotation”, Speech Communication 33(1-2) pp. 23-60. Byrd D., and Saltzman, E. (2003). “The elastic phrase: Modeling the dynamics of boundary-adjacent lengthening,” Journal of Phonetics, 31, pp. 149-180. Chang, S., Shastri, L., and Greenberg, S. (2000). “Automatic phonetic transcription of spontaneous speech (American English),” Proceedings of ICSLP '00, pp. 330-333. Chen, Y. and Yuan, J. (2007). “A Corpus Study of the 3rd Tone Sandhi in Standard Chinese,” Proceedings of Interspeech '07, pp. 2749-2752. Chu, M., Zhao, Y., and Chang, E. (2006). “Modeling stylized invariance and local variability of prosody in text-to-speech synthesis,” Speech Communication, 48, pp. 716-726. Cucchiarini, C. (1993). Phonetic transcription: a methodological and empirical study, PhD thesis, University of Nijmegen. Davidson, L. (2006). “Schwa Elision in Fast Speech: Segmental Deletion or Gestural Overlap?” Phonetica, 63, pp. 79-112. Davis, S. and Mermelstein, P. (1980). “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Transactions on Acoustics, Speech and Signal Processing, 28, pp. 357-366. Evanini, K., Isard, S., and Liberman, M. (2009). “Automatic Formant Extraction for Sociolinguistic Analysis of Large Corpora,” (to appear in) Proceedings of Interspeech ’09. Fox, Michelle (2006), Usage-Based Effects in Latin American Spanish Syllable-Final /s/ Lenition, University of Pennsylvania PhD Dissertation. Fu, S., Lee, C.H., and Clubb O. (1996). “A survey on Chinese speech recognition,” Communications of COLIPS, 6, pp. 1–17. Furui, S. (2005). “Recent progress in corpus-based spontaneous speech recognition,” IEICE Trans. Inf. & Syst., E88-D, pp. 366-375. Gadde, V. R. R. (2000). “Modeling word durations,” Proceedings of ICSLP ‘00, pp. 601-604. Garofolo, J., et al. (1993). TIMIT Acoustic-Phonetic Continuous Speech Corpus, Linguistic Data Consortium, Philadelphia. Godfrey, J. and Holliman, E. (1997). Switchboard-1 Release 2, Linguistic Data Consortium, Philadelphia. Grabe, E., Kochanski, G. and Coleman, J. (2005). “The intonation of native accent varieties in the British Isles - potential for miscommunication?” In Katarzyna Dziubalska-Koaczyk and Joanna Przedlacka (eds.), English pronunciation models: a changing scene, pp. 311-337. Greenberg, S. and Chang, S. (2000). “Linguistic dissection of switchboard-corpus automatic speech recognition systems,” Proceedings of the ISCA Workshop on Automatic Speech Recognition: Challenges for the New Millennium, pp. 195-202. Greenberg, S., Hollenback, J., and Ellis, D. (1996). “Insights into the spoken language gleaned from phonetic transcriptions of the Switchboard corpus,” Proceedings of ICSLP ’96, pp. 32-35. Hain, T. (2005). “Implicit modelling of pronunciation variation in automatic speech recognition,” Speech Communication, 46, pp. 171-188. Hasegawa-Johnson, M., Baker, J., Borys, S., Chen, K., Coogan, E., Greenberg, S., Juneja, A., Kirchhoff, K., Livescu, K., Mohan, S., Muller, J., Sönmez, K., and Wang T. (2005). “Landmark-Based Speech Recognition: Report of the 2004 Johns Hopkins Summer Workshop,” Proceedings of ICASSP '05, pp. 213-216. Hastie, H.W., Poesio, M., and Isard, S. (2002). “SpeechAutomatically predicting dialogue structure using prosodic features,” Speech Communication, 36, pp. 63-79. Hazen, T. (2006). “Automatic Alignment and Error Correction of Human Generated Transcripts for Long Speech Recordings,” Proceedings of Interspeech ’06, pp. 1606-1609. Hermansky, H. (1990). “Perceptual linear predictive (PLP) analysis of speech,” Journal of the Acoustical Society of America, 87, pp. 1738-1752. Hermansky, H, Ellis, D. P. W., and Sharma, S. (2000). “Tandem connectionist feature extraction for conventional HMM systems,” Proceedings of ICASSP ’00, pp. 1635-1638. Hertz, S. (1991). “Streams, phones, and transitions: toward a new phonological and phonetic model of formant timing,” Journal of Phonetics, 19, pp. 91-109. Hooper, J. (1978). “Constraints on schwa-deletion in American English,” in J. Fisiak (ed.), Recent developments in historical linguistics, 4th ed., pp. 183-207. Hosom, J.P. (2000). Automatic Time Alignment of Phonemes Using Acoustic-Phonetic Information. PhD thesis, Oregon Graduate Institute of Science and Technology. Huang, S., and Renals, S. (2007). “Using prosodic features in language models for meetings,” Proceedings of MLMI-07, 192-203. Jennequin, N. and Gauvain, J.L. (2007). “Modeling duration via lattice rescoring,” Proceedings of ICASSP '07, pp. 641-644. Johnson, K. (2004). “Massive reduction in conversational American English,” In K. Yoneyama and K. Maekawa (eds.) Spontaneous Speech: Data and Analysis. Proceedings of the 1st Session of the 10th International Symposium, pp. 29-54. Johnson, M., Charniak, E., and Lease, M. (2004). “An improved model for recognizing disfluencies in conversational speech,” In Rich Transcription 2004 Fall Workshop. Jones, D. (1947). An Outline of English Phonetics. Cambridge: W. Heffer and Sons. Juneja, A., and Espy-Wilson, C. (2008). “A probabilistic framework for landmark detection based on phonetic features for automatic speech recognition,” Journal of the Acoustical Society of America, 123, pp. 1154-1168. Jurafsky, D. (2003). “Probabilistic modeling in psycholinguistics: Linguistic comprehension and production,” in R. Bod, J. Hay, and S. Jannedy, (Eds)., Probabilistic Linguistics, pp. 39-96. Keating, P., Byrd, D., Flemming, E., and Todaka Y. (1994). “Phonetic analyses of word and segment variation using the TIMIT corpus of American English,” Speech Communication, 14, pp. 131-142. Kemp, T., Schmidt, M., Westphal, M., and Awibl, A. (2000). “Strategies for automatic segmentation of audio data,” Proceedings of ICASSP 2000, pp. 1423-1426. Keshet, J., Shalev-Shwartz, S., Singer, Y., and Chazan, D. (2005). “Phoneme alignment based on discriminative learning,” Proceedings of Interspeech '05, pp. 2961-2964. Kessens, J., Cucchiarini, C., Strik, H. (2003). “A data-driven method for modeling pronunciation variation,” Speech Communication, pp. 517-534. Kirchhoff, K. and Schimmel, S. (2005). “Statistical properties of infant-directed vs. adult-directed speech: insights from speech recognition,” Journal of the Acoustical Society of America, 117, pp. 2224-2237. Labov, W. (1994). Principles of linguistic change. Volume I: Internal Factors. Oxford: Basil Blackwell. Lee, K.-S. (2006). “MLP-based phone boundary refining for a TTS database,” IEEE Transactions on Audio, Speech, and Language Processing, 14, pp. 981-989. Lei, X., (2006). Modeling lexical tones for Mandarin large vocabulary continuous speech recognition, PhD thesis, University of Washington. Leung, H., and Zue, V. (1984). A procedure for automatic alignment of phonetic transcription with continuous speech,” Proceedings of ICASSP '84, pp. 73-76. Lickley, R.J. and Bard, E.G. (1996). “On not recognizing disfluencies in dialogue," Proceedings of ICSLP '96, pp. 1876–1879. Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M. and Harper, M. (2006), “Enriching speech recognition with automatic detection of sentence boundaries and disfluencies,” IEEE Transactions on Audio, Speech, and Language Processing, 14, pp. 1526-1540. Ljolje, A., Hirschberg, J., and van Santen, J. (1997). “Automatic speech segmentation for concatenative inventory selection,” in J. van Santen, R. Sproat, J. Olive, and J. Hirschberg (ed.), Progress in Speech Synthesis, Springer Verlag, New York, pp. 313-323. Martin, J. and Strange, W. (1968). “The perception of hesitation in spontaneous speech,” Perception and Psychophysics, 3, pp. 427-438. Moreno, P., Joerg, C., Van Thong, J.M., and Glickman, O. (1998), “A Recursive Algorithm for the Forced Alignment of Very Long Audio Segments,” Proceedings of ICSLP 1998, pp. 2711-2714. Normandin, Y. (1991). Hidden Markov Models, Maximum Mutual Information Estimation, and the Speech Recognition Problem, PhD thesis, McGill University, Montreal. Osuga, T., Horiuchi, Y., and Ichikawa, A. (2001). “Investigation on the problems about automatic forced alignment for spontaneous speech,” SIG-SLUD, 32, pp. 19-24. Patterson, D., LoCasto, P., and Connine, C. (2003). “Corpora analysis of frequency of schwa deletion in conversational American English,” Phonetica, 60, pp. 45-69. Peterson, G., and Barney, H. (1952). “Control methods used in a study of the vowels,” Journal of the Acoustical Society of America, 24, pp. 175-184. Pierrehumbert, J. (2003). “Phonetic diversity, statistical learning, and acquisition of phonology,” Language and Speech, 46, 115-154. Pitt, M.A., Dilley, L., Johnson, K., Kiesling, S., Raymond, W., Hume, E. and Fosler-Lussier, E. (2007). Buckeye Corpus of Conversational Speech (2nd release), Columbus, OH: Department of Psychology, Ohio State University (Distributor). Povey D. and Woodland P. C. (2002). “Minimum Phone Error and I-Smoothing for Improved Discriminative Training,” Proceedings of ICASSP 2002, pp. 105-108. Reynolds, D. and Torres-Carrasquillo, P. (2005). “Approaches and Applications of Audio Diarization,” Proceedings of ICASSP ’05, pp. 953-956. Riley, M., Byrne, W., Finke, M., Khudanpur, S., Ljolje, A., McDonough, J., Nock, H., Saraclar, M., Wooters, C., and Zavaliagkos, G. (1999). “Stochastic pronunciation modelling from hand-labelled phonetic corpora,” Speech Communication, 29, pp. 209-224. Saraclar, M., Nock, H., and Khudanpur, S. (2000). “Pronunciation modeling by sharing gaussian densities across phonetic models,” Computer Speech and Language, 14, pp. 137-160. Schramm, H., Aubert, X.L., Meyer, C., and Peters, J. (2003). “Filled-pause modeling for medical transcriptions,” Proceedings of ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition '03, pp. 143-146. Shriberg, E. and Stolcke, A. (2004). “Direct modeling of prosody: An overview of applications in automatic speech processing,” Proceedings Speech Prosody '04, pp. 575-582. Sproat, R. and Fujimura, O. (1993). “Allophonic variation in English /1/ and its implications for phonetic implementation”, Journal of Phonetics, 21, pp. 291–311. Stevens, K. (2002). “Toward a Model for Lexical Access Based on Acoustic Landmarks and Distinctive Features,” Journal of the Acoustical Society of America, 111, pp. 1872-1891. Stolcke, A., Bratt, H., Butzberger, J., Franco, H., Rao Gadde, V. R., Plauche, M., Richey, C., Shriberg, E., Sonmez, K., Weng, F., Zheng J. (2000). “The SRI March 2000 Hub-5 Conversational Speech Transcription System,” Proceedings of NIST Speech Transcription Workshop 2000. Stolcke A. and Shriberg, E. (1996). “Statistical language modeling for speech disfluencies,'” Proceedings of ICASSP '96, pp. 405-408. Stouten, F., Duchateau, J. Martens, J.P., and Wambacq, P. (2006). “Coping with disfluencies in spontaneous speech recognition: Acoustic detection and linguistic context manipulation,” Speech Communication, 48, pp. 1590-1606. Strik, H., and Cucchiarini, C. (1999). “Modelling pronunciation variation for ASR: a survey of the literature,” Speech Communication, 29, pp. 225-246. Toledano, D., Gómez, L., and Grande, L. (2003). “Automatic phonetic segmentation,” IEEE Transactions on Speech and Audio Processing, 11, pp. 617-625. Toth, A. (2004). “Forced alignment for speech synthesis databases using duration and prosodic phrase breaks,” Proceedings of SSW5-2004, pp. 225-226. Van Bael, C., Boves, L., van den Heuvel, H., and Strik, H. (2007). “Automatic phonetic transcription of large speech corpora,” Computer Speech and Language, 21, pp. 652-668. Venkataraman, A., Stolcke, A., Wang, W., Vergyri, D., Gadde, V. R. R. and Zheng J. (2004). “An Efficient Repair Procedure For Quick Transcriptions,” Proceedings of ICSLP ’04, pp. 1961-1964. Vu, T.T., Nguyen, D.T., Luong, M.C., and Hosom, J.P., (2005). “Vietnamese large vocabulary continuous speech recognition,” Proceedings of Interspeech '05, pp. 1689-1692. Wagner, M. (1981). "Automatic labelling of continuous speech with a given phonetic transcription using dynamic programming algorithms," Proceedings of ICASSP '81, pp. 1156-1159. Wang C. (2001). Prosodic Modeling for Improved Speech Recognition and Understanding, PHD thesis, MIT. Wightman, C. and Talkin, D. (1997). “The Aligner: Text to speech alignment using Markov Models,” in J. van Santen, R. Sproat, J. Olive, and J. Hirschberg (ed.), Progress in Speech Synthesis, Springer Verlag, New York, pp. 313-323. Yuan, J. and Liberman, M. (2009). “Investigating /l/ Variation in English through Forced Alignment,” (to appear in) Proceedings of Interspeech ‘09. Yuan, J. (2008). “Covariatiions of English segmental durations across speakers,” Proceedings of Interspeech '08. Yuan, J., Isard, S., and Liberman, M. (2008). “Different Roles of Pitch and Duration in Distinguishing Word Stress in English”, Proceedings of Interspeech '08. Yuan, J. and Liberman, M. (2008). “Speaker identification on the SCOTUS corpus,” Journal of the Acoustical Society of America, 123, pp. 3878. Yuan, J. and Liberman, M. (2008). “Vowel acoustic space in continuous speech: An example of using audio books for research,” (to appear in) Proceedings of CatCod 2008. Yuan, J., Liberman, M., and Cieri, C. (2006). “Towards an Integrated Understanding of Speaking Rate in Conversation,” Proceedings of Interspeech '06, pp. 541-544. Yuan, J., Liberman, M., and Cieri, C. (2007). “Towards an Integrated Understanding of Speech Overlaps in Conversation,” Procedings of ICPhS XVI, pp. 1337-1340. Zheng, J., Franco, H., and Stolcke, A. (2003). “Modeling word-level rate-of-speech variation in large vocabulary conversational speech recognition,” Speech Communication 41, pp. 273–285.