The Use of Shibboleth Words for Automatically Classifying Speakers
Total Page:16
File Type:pdf, Size:1020Kb
THE USE OF SHIBBOLETH WORDS FOR AUTOMATICALLY CLASSIFYING SPEAKERS BY DIALECT A.W.F. Huggins ([email protected]) BBN Hark Systems, 70 Fawcett St., Cambridge, MA 02138, USA Yogen Patel (yogen@sp eech.com) PureSp eech, 100 Cambridge Park Drive, Cambridge, MA 02140, USA Abstract The development of the pseudo-automatic scheme for partioning this sp eech corpus was driven bytwo Real-world applications using sp eech recognition ma jor ob jectives. First, we felt that suchascheme must p erform well over a range of dialects. Di er- could lead to improved accuracy of the multi-dialect ences in dialect b etween the sp eakers in the training system since we would be able to group training database and the target users often leads to degraded sp eakers together more accurately based on actual recognition p erformance. For the BBN Hark Hidden sp eaking characteristics. Second, we felt that a Markov Mo del (HMM) based system, we have al- generic automatic data driven scheme would allow ready develop ed a reasonably e ective technique [1] us to partition data collected in the future more eas- for dealing with multiple US dialects. The solution ily, as we would not have to maintain nor collect involves building separate HMM sets for each dialect detailed demographic information for each sp eaker from representative training sp eech data. This re- in the corpus. quires that training sp eakers b e accurately classi ed There has b een some previous work on dialect par- by dialect, which is dicult to do reliably even by titioning. Cohen et al. [3] studied the two sen- hand. In this pap er we describ e a recognition based tences that were included in the TIMIT database to pseudo-automatic scheme for partitioning a p o ol of capture dialectal di erences (the \shibb oleth" sen- US English training sp eakers into groups, such that tences). They de ned eighteen segments whose alter- the sp eakers within each group share the same pro- native pronunciations they thought they could tran- nunciation characteristics. Our scheme is sp eech- scrib e reliably, either visually from sp ectrograms or data driven, and involves using transcript-level word by ear, in the tokens of these sentences as sp oken hyp otheses generated by a recognizer to partition the by their 630 di erent sp eakers. They found that a p o ol of training sp eakers. sp eaker's use of alternative pronunciations for vari- ous segments were not indep endent, but formed clus- ters, and that these clusters were also dep endent on the dialect of the sp eaker, based on where the 1 Intro duction sp eaker lived from age 2 to 10. They estimated that a useful reduction would o ccur in the entropy of the We have implemented a multi-dialect recognition phonological mo del, and suggested this could lead system [1] for a sp eaker indep endent, 200 word, com- to improved recognition accuracy. However, they mand and control application. The \commands" are did not test this prediction, and the metho d was sentences which are typically connected sequences of not automatic. Furthermore, only a few of the seg- four or ve words. The underlying sp eech recogni- ments studied were vowels, which are usually the tion engine used in our work evolved from BBN's most salient cues to dialect for human listeners, and Byblos system [2], and it uses discrete probability also, we b elieve, for some recognition systems. density functions to mo del triphones. The origi- nal multi-dialect system was trained on a multi- A study byVan Comp ernolle et al. [4] rep orted an dialect sp eech corpus that was partitioned into di- attempt to automatically cluster sp eakers of Dutch alect groups based on where the sp eech data was and Flemish by dialect and by gender. \Dutch collected. Limited listening tests were carried out by and Flemish are identical written languages, [but] a dialect exp ert to con rm that the lo cation-based pronunciation di erences are large..". First, they partitioning was reasonable. demonstrated substantially improved p erformance 1 the b ene t of b eing able to include these words and from using multiple baseforms, derived from hand- phrases in the data collection. We had to pick a set classi cation of their 1000 sp eakers. Their metho d of shibb oleth words from the existing vo cabulary. for automatic clustering involved (1) training two baseforms for eachword (digits), based on an initial We selected the 52 highest frequency words in split of the training set, (2) doing recognition on each our corpus, and generated several di erent pho- utterance in the training set, and assigning it to the netic transcriptions for each word. When gener- class where it was b est recognized, (3) retraining the ating the phonetic transcriptions, we attempted to mo dels based on the new classi cation, and (4) it- capture the p otential range of phonetic variation erating. Although this metho d pro duced improved across all the sp eakers in the training corpus, but, of p erformance on the training set, the improvement course, wewere constrained by our existing phoneme failed to carry over to the test set. The metho d we set, whose main purp ose was to capture phonemic rep ort on here is very similar to this; the main di er- regularity rather than phonetic variability. Each ence is that we used p erformance on a small number word had typically between four and twelve di er- of words, hand-selected to discriminate b etween di- ent transcriptions. For each sp eaker, the sentences alects, to classify sp eakers according to dialect, and that contained these words were passed through a then trained separate mo dels for the two classes of grammar-constrained HMM phoneme recognizer. sp eakers. We b elieve the reason their metho d failed Wewere unable to use triphone mo dels, whichwould was that a few useful discriminants were swamp ed by have b een preferable, b ecause manyofthe alterna- a larger numb er of cases where the main di erences tive phonetic transcriptions contained triphones for were due to other irrelevant asp ects of the sp eech which there was little or no training, and this would tokens. have resulted in strong biasing of the recognition path selected. The HMM phoneme mo dels in the phoneme recognizer were previously generated using 2 Corpus Partitioning data drawn from a di erent 'uniform sp eaker' cor- pus. Relative counts were accumulated for each pro- We collected application-domain dep endent sp eech nunciation of each of the shibb oleth words by lo oking data from approximately 450 adult sp eakers in the at the resulting deco ded phoneme sequences for each following three southern cities: Atlanta, Dallas, of the di erentspoken words. Raleigh (North Carolina), and three northern cities: To convert the word counts into a distance mea- Paramus (New Jersey), Detroit, and Boston. In sure, we calculated a chi-square value for each pair addition, we included a few more \northern" and of sp eakers for each of the words, corrected it for de- \southern" sp eakers, whose data although collected grees of freedom, and to ok the square ro ot. We also separately, consisted of essentially the same vo- combined the separate sp eaker*sp eaker distance ma- cabulary. The sp eech data was recorded using a trices for eachword into a euclidean p o oled matrix. close talking microphone at a sampling frequency of The resulting distance matrices were analyzed by hi- 16KHz. erarchical clustering to classify the sp eakers. Our target system required developing the follow- As an initial validation of our approach, we ran ing four mo dels sets; southern male, northern male, the same analysis on smaller matrices in which all southern female, and, northern female. We manu- the sp eakers were p o oled according to where their ally split the corpus by gender since we wanted to data was collected, and con rmed that the cluster- fo cus only on developing metho ds to automatically ing obtained re ected our exp ectations in terms of partition sp eakers into sub-groups within each of the dialects. We did this three times as a result of in- separate gender groups. There are already well es- sp ection of the clustering plots. After the rst pass, tablished metho ds for gender recognition based on we removed all but the thirteen most frequentwords variants of sp eaker recognition techniques [5][6], and (the digits 1-9, \oh" and \zero", plus \p oint" and it was not our goal to validate these metho ds nor \phone"). And after the second pass, we removed all improve up on them. pronunciation alternatives that received low counts. For most words, this left only two alternatives. 2.1 Overview of Metho d A ma jor uncertainty in making this technique work 2.2 Computing Word Counts was whether an appropriate set of shibb oleth words was available in our corpus. The sp eech data col- As mentioned b efore, for each sp eaker we carried out lection mentioned ab ove to ok place b efore we had grammar constrained phonetic recognition on those even considered developing an automatic dialect par- sentences that contained the shibb oleth words. The titioning scheme. Thus, although there are some di erent pronunciations of each of the shibb oleth classic words and phrases that highlight dialectical words were represented by tagged multiple phoneme di erences in American English [7], we did not have paths in the grammar.