THE USE OF SHIBBOLETH WORDS FOR

AUTOMATICALLY CLASSIFYING SPEAKERS BY

DIALECT

A.W.F. Huggins ([email protected])

BBN Hark Systems, 70 Fawcett St., Cambridge, MA 02138, USA

Yogen Patel (yogen@sp eech.com)

PureSp eech, 100 Cambridge Park Drive, Cambridge, MA 02140, USA

Abstract The development of the pseudo-automatic scheme

for partioning this sp eech corpus was driven bytwo

Real-world applications using sp eech recognition

ma jor ob jectives. First, we felt that suchascheme

must p erform well over a range of . Di er-

could lead to improved accuracy of the multi-

ences in dialect b etween the sp eakers in the training

system since we would be able to group training

database and the target users often leads to degraded

sp eakers together more accurately based on actual

recognition p erformance. For the BBN Hark Hidden

sp eaking characteristics. Second, we felt that a

Markov Mo del (HMM) based system, we have al-

generic automatic data driven scheme would allow

ready develop ed a reasonably e ective technique [1]

us to partition data collected in the future more eas-

for dealing with multiple US dialects. The solution

ily, as we would not have to maintain nor collect

involves building separate HMM sets for each dialect

detailed demographic information for each sp eaker

from representative training sp eech data. This re-

in the corpus.

quires that training sp eakers b e accurately classi ed

There has b een some previous work on dialect par-

by dialect, which is dicult to do reliably even by

titioning. Cohen et al. [3] studied the two sen-

hand. In this pap er we describ e a recognition based

tences that were included in the TIMIT database to

pseudo-automatic scheme for partitioning a p o ol of

capture dialectal di erences (the \shibb oleth" sen-

US English training sp eakers into groups, such that

tences). They de ned eighteen segments whose alter-

the sp eakers within each group share the same pro-

native pronunciations they thought they could tran-

nunciation characteristics. Our scheme is sp eech-

scrib e reliably, either visually from sp ectrograms or

data driven, and involves using transcript-level word

by ear, in the tokens of these sentences as sp oken

hyp otheses generated by a recognizer to partition the

by their 630 di erent sp eakers. They found that a

p o ol of training sp eakers.

sp eaker' use of alternative pronunciations for vari-

ous segments were not indep endent, but formed clus-

ters, and that these clusters were also dep endent

on the dialect of the sp eaker, based on where the

1 Intro duction

sp eaker lived from age 2 to 10. They estimated that

a useful reduction would o ccur in the entropy of the

We have implemented a multi-dialect recognition

phonological mo del, and suggested this could lead

system [1] for a sp eaker indep endent, 200 word, com-

to improved recognition accuracy. However, they

mand and control application. The \commands" are

did not test this prediction, and the metho d was

sentences which are typically connected sequences of

not automatic. Furthermore, only a few of the seg-

four or ve words. The underlying sp eech recogni-

ments studied were vowels, which are usually the

tion engine used in our work evolved from BBN's

most salient cues to dialect for human listeners, and

Byblos system [2], and it uses discrete probability

also, we b elieve, for some recognition systems.

density functions to mo del triphones. The origi-

nal multi-dialect system was trained on a multi- A study byVan Comp ernolle et al. [4] rep orted an

dialect sp eech corpus that was partitioned into di- attempt to automatically cluster sp eakers of Dutch

alect groups based on where the sp eech data was and Flemish by dialect and by gender. \Dutch

collected. Limited listening tests were carried out by and Flemish are identical written languages, [but]

a dialect exp ert to con rm that the lo cation-based pronunciation di erences are large..". First, they

partitioning was reasonable. demonstrated substantially improved p erformance 1

the b ene t of b eing able to include these words and from using multiple baseforms, derived from hand-

phrases in the data collection. We had to pick a set classi cation of their 1000 sp eakers. Their metho d

of shibb oleth words from the existing vo cabulary. for automatic clustering involved (1) training two

baseforms for eachword (digits), based on an initial

We selected the 52 highest frequency words in

split of the training set, (2) doing recognition on each

our corpus, and generated several di erent pho-

utterance in the training set, and assigning it to the

netic transcriptions for each word. When gener-

class where it was b est recognized, (3) retraining the

ating the phonetic transcriptions, we attempted to

mo dels based on the new classi cation, and (4) it-

capture the p otential range of phonetic variation

erating. Although this metho d pro duced improved

across all the sp eakers in the training corpus, but, of

p erformance on the training set, the improvement

course, wewere constrained by our existing phoneme

failed to carry over to the test set. The metho d we

set, whose main purp ose was to capture phonemic

rep ort on here is very similar to this; the main di er-

regularity rather than phonetic variability. Each

ence is that we used p erformance on a small number

word had typically between four and twelve di er-

of words, hand-selected to discriminate b etween di-

ent transcriptions. For each sp eaker, the sentences

alects, to classify sp eakers according to dialect, and

that contained these words were passed through a

then trained separate mo dels for the two classes of

grammar-constrained HMM phoneme recognizer.

sp eakers. We b elieve the reason their metho d failed

Wewere unable to use triphone mo dels, whichwould

was that a few useful discriminants were swamp ed by

have b een preferable, b ecause manyofthe alterna-

a larger numb er of cases where the main di erences

tive phonetic transcriptions contained triphones for

were due to other irrelevant asp ects of the sp eech

which there was little or no training, and this would

tokens.

have resulted in strong biasing of the recognition

path selected. The HMM phoneme mo dels in the

phoneme recognizer were previously generated using

2 Corpus Partitioning

data drawn from a di erent 'uniform sp eaker' cor-

pus. Relative counts were accumulated for each pro-

We collected application-domain dep endent sp eech

nunciation of each of the shibb oleth words by lo oking

data from approximately 450 adult sp eakers in the

at the resulting deco ded phoneme sequences for each

following three southern cities: Atlanta, Dallas,

of the di erentspoken words.

Raleigh (North Carolina), and three northern cities:

To convert the word counts into a distance mea-

Paramus (New Jersey), Detroit, and Boston. In

sure, we calculated a chi-square value for each pair

addition, we included a few more \northern" and

of sp eakers for each of the words, corrected it for de-

\southern" sp eakers, whose data although collected

grees of freedom, and to ok the square ro ot. We also

separately, consisted of essentially the same vo-

combined the separate sp eaker*sp eaker distance ma-

cabulary. The sp eech data was recorded using a

trices for eachword into a euclidean p o oled matrix.

close talking microphone at a sampling frequency of

The resulting distance matrices were analyzed by hi-

16KHz.

erarchical clustering to classify the sp eakers.

Our target system required developing the follow-

As an initial validation of our approach, we ran

ing four mo dels sets; southern male, northern male,

the same analysis on smaller matrices in which all

southern female, and, northern female. We manu-

the sp eakers were p o oled according to where their

ally split the corpus by gender since we wanted to

data was collected, and con rmed that the cluster-

fo cus only on developing metho ds to automatically

ing obtained re ected our exp ectations in terms of

partition sp eakers into sub-groups within each of the

dialects. We did this three times as a result of in-

separate gender groups. There are already well es-

sp ection of the clustering plots. After the rst pass,

tablished metho ds for gender recognition based on

we removed all but the thirteen most frequentwords

variants of sp eaker recognition techniques [5][6], and

(the digits 1-9, \oh" and \zero", plus \p oint" and

it was not our goal to validate these metho ds nor

\phone"). And after the second pass, we removed all

improve up on them.

pronunciation alternatives that received low counts.

For most words, this left only two alternatives.

2.1 Overview of Metho d

A ma jor uncertainty in making this technique work

2.2 Computing Word Counts

was whether an appropriate set of shibb oleth words

was available in our corpus. The sp eech data col- As mentioned b efore, for each sp eaker we carried out

lection mentioned ab ove to ok place b efore we had grammar constrained phonetic recognition on those

even considered developing an automatic dialect par- sentences that contained the shibb oleth words. The

titioning scheme. Thus, although there are some di erent pronunciations of each of the shibb oleth

classic words and phrases that highlight dialectical words were represented by tagged multiple phoneme

di erences in [7], we did not have paths in the grammar. For each sentence the recog-

bly repay further study. However, for the purp oses

Relative Count

at hand, we to ok only the two highest level clusters

Sp eaker NINE 1 NINE 2

generated from the euclidean p o oled data for all thir-

spkr1 0.90 0.10

teen words for the male sub jects, and used them to

spkr2 0.80 0.20

partition the male sp eakers into a \northern" and a

spkr3 0.10 0.90

\southern" group. We did the same for the female

spkr4 0.20 0.80

sp eakers.

Table 1: Relative counts of the di erent pronuncia-

tions of the word 'nine', invented data

3 Exp eriments and Results

In order to evaluate the automatic data partition-

ing scheme, we carried out four sets of recognition

nizer generated a word level transcript, where each

exp eriments on the dialect database (describ ed ear-

of the shibb oleth words had a 'tag' or 'identi er' at-

lier). Sp eech data from a set of 24 sp eakers, evenly

tached to it indicating the b est match pronunciation.

balanced by gender and region, was held out as a

A simple example may help to make this more

test set. The test utterances coverered a wide range

clear. Consider a set of 4 male sp eakers, and consider

of sentences from our command and control applica-

the word \nine" as the shibb oleth word to b e used

tion, and many of them did not contain any of our

in grouping the sp eakers. Let us assume that the

thirteen surviving shibb oleth words.

range of p otential pronunciations of the word 'nine'

The rst training run involved p o oling all the

are covered by the following two alternate phonetic

training data from all the sp eakers and building a di-

sp ellings,

alect indep endent and gender indep endent acoustic

NINE: n-ay-n OR n-aa-n.

mo del. In the second exp eriment, we built a gender

dep endent but dialect indep endent acoustic mo del.

This acoustic mo del thus contained two sub-mo dels,

N AA N

one for female sp eakers, and one for male sp eakers.

i i i NINE 1

In the third exp eriment, we trained a gender de-

 T

p endent and dialect dep endent mo del containing



T

i T i

the following four sub-mo dels; male southern, male

T 

northern, female southern, female northern. In this

N AY N 

T

T i i i NINE 2

exp eriment the dialect sub-mo dels were created us-

ing the hand partitioned data. That is, we p o oled to-

gether the data from Raleigh, Dallas and Atlanta to

build the southern mo dels, and that from Paramus,

Fig 2.1

Boston and Detroit to build the northern mo dels.

Finally, we trained a gender and dialect dep endent

As shown in gure 2.1, the grammar will contain

mo del, with the same number of sub-mo dels as in

two paths, each corresp onding to one of the phonetic

the third exp eriment, but this time using the auto-

sp ellings. All the sentences containing the word nine,

matically partitioned training database to create the

are then passed through a phoneme recognizer where

dialect sub-mo dels.

the recognition path is constrained by this grammar.

Table 2 shows the sentence accuracy for each ex-

By following the highest scoring deco ded phoneme

p eriment. Simply using gender dep endent mo dels

sequence for each sentence, we are able to compile

results in a signi improvement in p erformance,

a table of relative counts of the o ccurrences of the

con rming the well-established result that gender de-

di erent forms of the word \nine"(table 1 ).

p endent mo deling alone can obtain a large reduction

in error rate compared to gender-indep endentmod-

els. Using gender and dialect dep endent mo dels built

2.3 Clustering Sp eakers

from b oth the hand and automatic partitioned di-

As a result of the elimination rounds constituted by

alect databases reduces the sentence error even fur-

our initial validation tests, we had fourteen distance

ther, with the automatically partitioned data giving

matrices, thirteen showing distances between each

almost twice the error reduction of the hand parti-

pair of sp eakers for each of the nal words, and one

tioned data.

a euclidean combination of all thirteen words. In-

sp ection of the cluster plots for the individual words

showed that, for some words, the clusterings di ered

4 Conclusions

fairly dramatically from the dialect partitioning that

wewere hoping for. These di erences would proba- In this pap er wehave describ ed a fairly simple, yet

where the dialectal di erences are more pronounced,

Exp. Mo del Set Percent Error

such as England or Germany, to name a couple at

1 Gender & 7.8

random.

Dialect Indep endent

2 Gender Dep endent& 5.6

References

Dialect Indep endent

3 Gender Dep endent& 5.1

[1] V.Beattie, S.Edmondson, D.Miller, Y.Patel, and

Dialect Dep endent

G.Talvola. \An Integrated Multi-Dialect Sp eech

(hand-partioned)

Recognition System with Optional Sp eaker

4 Gender Dep endent& 4.7

Adaptation". In Proceedings of Eurospeech, page

Dialect Dep endent

1123, 1995.

(automatic-partioned)

[2] Y.Chow, M.Dunham, O.Kimball, M.Krasner,

Table 2: Sentence error for di erent mo del sets

G.F.Kubala, J.Makhoul, P.Price, S.Roucos, and

R.Schwartz. \BYBLOS: The BBN Continuous

Sp eech Recognition System". In Proceedings of

powerful metho d for partitioning a training corpus

ICASSP, page 89, 1987.

on the basis of pronunciation similarities. Our ex-

[3] M.Cohen, G.Baldwin, J.Bernstein, .Murveit,

p erimental results show that the mo dels built from

and M.Weintraub. \Studies for an adaptive

the automatically partioned data yield b etter p erfor-

recognition lexicon". Proceedings of the DARPA

mance than the mo dels built from the hand parti-

Speech Recognition Workshop, Report No. SAIC-

tioned data. This improvement in accuracy leads us

87/1644, San Diego, March 1987.

to b elieve that a recognition-based scheme pro duces

a grouping of similar sp eakers that is more accurate

[4] D.Van Comp ernolle, J.Smolders, P.Jasp ers, and

from the recognizer's p oint of view.

T.Hellemans. \Sp eaker Clustering for Dialec-

Pseudo-automatic partitioning schemes like the

tic Robustness in Sp eaker Indep endent Recog-

one wehave presented here, are very useful for par-

nition". In Proceedings of Eurospeech, page 723,

tioning sp eech data that do es not have accompany-

1991.

ing detailed demographic information. These sorts

of metho ds may allow more ecient and e ective

[5] Herb ert Gish and Michael Schmidt. \Text In-

management of data that is collected in unsup ervised

dep endent Sp eaker Identi cation". IEEE Signal

scenarios (live system data logging, etc.).

Processing Magazine, 1994.

[6] Bishnu S. Atal. \Automatic Recognition of

Sp eakers from their Voices". Proceedings of the

5 Further Work

IEEE, Vol. 64, NO. 4, April 1976.

We need to explore further what prop erties of

[7] Charles K Thomas. Phonetics of American En-

the sp eech tokens controlled the clustering of those

glish. NY: Ronald Press, 1958.

words whose clusterings app ear to be irrelevant to

the dialect groups. It is not at all clear that clus-

ters that seem appropriate to the human listener are

equally appropriate to the recognizer, and establish-

ing the features underlying these clusters might lead

to a b etter metho d of segregating sp eakers into mul-

tiple mo dels.

It would also be interesting to see how far our

approach could b e pushed by delib erately including

words known to distinguish b etween the dialects of

interest. The TIMIT database provides an obvious

opp ortunity to test this suggestion.

It will also be interesting to see how well our

metho d works when applied to dialects in other lan-

guages. Dialect di erences are relatively minor in

the US, at least compared with those found else-

where. The need for metho ds to classify sp eakers

by dialect will be much more imp ortant when the

attempt is made to write applications for countries