THE USE OF SHIBBOLETH WORDS FOR
AUTOMATICALLY CLASSIFYING SPEAKERS BY
DIALECT
A.W.F. Huggins ([email protected])
BBN Hark Systems, 70 Fawcett St., Cambridge, MA 02138, USA
Yogen Patel (yogen@sp eech.com)
PureSp eech, 100 Cambridge Park Drive, Cambridge, MA 02140, USA
Abstract The development of the pseudo-automatic scheme
for partioning this sp eech corpus was driven bytwo
Real-world applications using sp eech recognition
ma jor ob jectives. First, we felt that suchascheme
must p erform well over a range of dialects. Di er-
could lead to improved accuracy of the multi-dialect
ences in dialect b etween the sp eakers in the training
system since we would be able to group training
database and the target users often leads to degraded
sp eakers together more accurately based on actual
recognition p erformance. For the BBN Hark Hidden
sp eaking characteristics. Second, we felt that a
Markov Mo del (HMM) based system, we have al-
generic automatic data driven scheme would allow
ready develop ed a reasonably e ective technique [1]
us to partition data collected in the future more eas-
for dealing with multiple US dialects. The solution
ily, as we would not have to maintain nor collect
involves building separate HMM sets for each dialect
detailed demographic information for each sp eaker
from representative training sp eech data. This re-
in the corpus.
quires that training sp eakers b e accurately classi ed
There has b een some previous work on dialect par-
by dialect, which is dicult to do reliably even by
titioning. Cohen et al. [3] studied the two sen-
hand. In this pap er we describ e a recognition based
tences that were included in the TIMIT database to
pseudo-automatic scheme for partitioning a p o ol of
capture dialectal di erences (the \shibb oleth" sen-
US English training sp eakers into groups, such that
tences). They de ned eighteen segments whose alter-
the sp eakers within each group share the same pro-
native pronunciations they thought they could tran-
nunciation characteristics. Our scheme is sp eech-
scrib e reliably, either visually from sp ectrograms or
data driven, and involves using transcript-level word
by ear, in the tokens of these sentences as sp oken
hyp otheses generated by a recognizer to partition the
by their 630 di erent sp eakers. They found that a
p o ol of training sp eakers.
sp eaker's use of alternative pronunciations for vari-
ous segments were not indep endent, but formed clus-
ters, and that these clusters were also dep endent
on the dialect of the sp eaker, based on where the
1 Intro duction
sp eaker lived from age 2 to 10. They estimated that
a useful reduction would o ccur in the entropy of the
We have implemented a multi-dialect recognition
phonological mo del, and suggested this could lead
system [1] for a sp eaker indep endent, 200 word, com-
to improved recognition accuracy. However, they
mand and control application. The \commands" are
did not test this prediction, and the metho d was
sentences which are typically connected sequences of
not automatic. Furthermore, only a few of the seg-
four or ve words. The underlying sp eech recogni-
ments studied were vowels, which are usually the
tion engine used in our work evolved from BBN's
most salient cues to dialect for human listeners, and
Byblos system [2], and it uses discrete probability
also, we b elieve, for some recognition systems.
density functions to mo del triphones. The origi-
nal multi-dialect system was trained on a multi- A study byVan Comp ernolle et al. [4] rep orted an
dialect sp eech corpus that was partitioned into di- attempt to automatically cluster sp eakers of Dutch
alect groups based on where the sp eech data was and Flemish by dialect and by gender. \Dutch
collected. Limited listening tests were carried out by and Flemish are identical written languages, [but]
a dialect exp ert to con rm that the lo cation-based pronunciation di erences are large..". First, they
partitioning was reasonable. demonstrated substantially improved p erformance 1
the b ene t of b eing able to include these words and from using multiple baseforms, derived from hand-
phrases in the data collection. We had to pick a set classi cation of their 1000 sp eakers. Their metho d
of shibb oleth words from the existing vo cabulary. for automatic clustering involved (1) training two
baseforms for eachword (digits), based on an initial
We selected the 52 highest frequency words in
split of the training set, (2) doing recognition on each
our corpus, and generated several di erent pho-
utterance in the training set, and assigning it to the
netic transcriptions for each word. When gener-
class where it was b est recognized, (3) retraining the
ating the phonetic transcriptions, we attempted to
mo dels based on the new classi cation, and (4) it-
capture the p otential range of phonetic variation
erating. Although this metho d pro duced improved
across all the sp eakers in the training corpus, but, of
p erformance on the training set, the improvement
course, wewere constrained by our existing phoneme
failed to carry over to the test set. The metho d we
set, whose main purp ose was to capture phonemic
rep ort on here is very similar to this; the main di er-
regularity rather than phonetic variability. Each
ence is that we used p erformance on a small number
word had typically between four and twelve di er-
of words, hand-selected to discriminate b etween di-
ent transcriptions. For each sp eaker, the sentences
alects, to classify sp eakers according to dialect, and
that contained these words were passed through a
then trained separate mo dels for the two classes of
grammar-constrained HMM phoneme recognizer.
sp eakers. We b elieve the reason their metho d failed
Wewere unable to use triphone mo dels, whichwould
was that a few useful discriminants were swamp ed by
have b een preferable, b ecause manyofthe alterna-
a larger numb er of cases where the main di erences
tive phonetic transcriptions contained triphones for
were due to other irrelevant asp ects of the sp eech
which there was little or no training, and this would
tokens.
have resulted in strong biasing of the recognition
path selected. The HMM phoneme mo dels in the
phoneme recognizer were previously generated using
2 Corpus Partitioning
data drawn from a di erent 'uniform sp eaker' cor-
pus. Relative counts were accumulated for each pro-
We collected application-domain dep endent sp eech
nunciation of each of the shibb oleth words by lo oking
data from approximately 450 adult sp eakers in the
at the resulting deco ded phoneme sequences for each
following three southern cities: Atlanta, Dallas,
of the di erentspoken words.
Raleigh (North Carolina), and three northern cities:
To convert the word counts into a distance mea-
Paramus (New Jersey), Detroit, and Boston. In
sure, we calculated a chi-square value for each pair
addition, we included a few more \northern" and
of sp eakers for each of the words, corrected it for de-
\southern" sp eakers, whose data although collected
grees of freedom, and to ok the square ro ot. We also
separately, consisted of essentially the same vo-
combined the separate sp eaker*sp eaker distance ma-
cabulary. The sp eech data was recorded using a
trices for eachword into a euclidean p o oled matrix.
close talking microphone at a sampling frequency of
The resulting distance matrices were analyzed by hi-
16KHz.
erarchical clustering to classify the sp eakers.
Our target system required developing the follow-
As an initial validation of our approach, we ran
ing four mo dels sets; southern male, northern male,
the same analysis on smaller matrices in which all
southern female, and, northern female. We manu-
the sp eakers were p o oled according to where their
ally split the corpus by gender since we wanted to
data was collected, and con rmed that the cluster-
fo cus only on developing metho ds to automatically
ing obtained re ected our exp ectations in terms of
partition sp eakers into sub-groups within each of the
dialects. We did this three times as a result of in-
separate gender groups. There are already well es-
sp ection of the clustering plots. After the rst pass,
tablished metho ds for gender recognition based on
we removed all but the thirteen most frequentwords
variants of sp eaker recognition techniques [5][6], and
(the digits 1-9, \oh" and \zero", plus \p oint" and
it was not our goal to validate these metho ds nor
\phone"). And after the second pass, we removed all
improve up on them.
pronunciation alternatives that received low counts.
For most words, this left only two alternatives.
2.1 Overview of Metho d
A ma jor uncertainty in making this technique work
2.2 Computing Word Counts
was whether an appropriate set of shibb oleth words
was available in our corpus. The sp eech data col- As mentioned b efore, for each sp eaker we carried out
lection mentioned ab ove to ok place b efore we had grammar constrained phonetic recognition on those
even considered developing an automatic dialect par- sentences that contained the shibb oleth words. The
titioning scheme. Thus, although there are some di erent pronunciations of each of the shibb oleth
classic words and phrases that highlight dialectical words were represented by tagged multiple phoneme
di erences in American English [7], we did not have paths in the grammar. For each sentence the recog-
bly repay further study. However, for the purp oses
Relative Count
at hand, we to ok only the two highest level clusters
Sp eaker NINE 1 NINE 2
generated from the euclidean p o oled data for all thir-
spkr1 0.90 0.10
teen words for the male sub jects, and used them to
spkr2 0.80 0.20
partition the male sp eakers into a \northern" and a
spkr3 0.10 0.90
\southern" group. We did the same for the female
spkr4 0.20 0.80
sp eakers.
Table 1: Relative counts of the di erent pronuncia-
tions of the word 'nine', invented data
3 Exp eriments and Results
In order to evaluate the automatic data partition-
ing scheme, we carried out four sets of recognition
nizer generated a word level transcript, where each
exp eriments on the dialect database (describ ed ear-
of the shibb oleth words had a 'tag' or 'identi er' at-
lier). Sp eech data from a set of 24 sp eakers, evenly
tached to it indicating the b est match pronunciation.
balanced by gender and region, was held out as a
A simple example may help to make this more
test set. The test utterances coverered a wide range
clear. Consider a set of 4 male sp eakers, and consider
of sentences from our command and control applica-
the word \nine" as the shibb oleth word to b e used
tion, and many of them did not contain any of our
in grouping the sp eakers. Let us assume that the
thirteen surviving shibb oleth words.
range of p otential pronunciations of the word 'nine'
The rst training run involved p o oling all the
are covered by the following two alternate phonetic
training data from all the sp eakers and building a di-
sp ellings,
alect indep endent and gender indep endent acoustic
NINE: n-ay-n OR n-aa-n.
mo del. In the second exp eriment, we built a gender
dep endent but dialect indep endent acoustic mo del.
This acoustic mo del thus contained two sub-mo dels,
N AA N
one for female sp eakers, and one for male sp eakers.
i i i NINE 1
In the third exp eriment, we trained a gender de-
T
p endent and dialect dep endent mo del containing
T
i T i
the following four sub-mo dels; male southern, male
T
northern, female southern, female northern. In this
N AY N
T
T i i i NINE 2
exp eriment the dialect sub-mo dels were created us-
ing the hand partitioned data. That is, we p o oled to-
gether the data from Raleigh, Dallas and Atlanta to
build the southern mo dels, and that from Paramus,
Fig 2.1
Boston and Detroit to build the northern mo dels.
Finally, we trained a gender and dialect dep endent
As shown in gure 2.1, the grammar will contain
mo del, with the same number of sub-mo dels as in
two paths, each corresp onding to one of the phonetic
the third exp eriment, but this time using the auto-
sp ellings. All the sentences containing the word nine,
matically partitioned training database to create the
are then passed through a phoneme recognizer where
dialect sub-mo dels.
the recognition path is constrained by this grammar.
Table 2 shows the sentence accuracy for each ex-
By following the highest scoring deco ded phoneme
p eriment. Simply using gender dep endent mo dels
sequence for each sentence, we are able to compile
results in a signi cant improvement in p erformance,
a table of relative counts of the o ccurrences of the
con rming the well-established result that gender de-
di erent forms of the word \nine"(table 1 ).
p endent mo deling alone can obtain a large reduction
in error rate compared to gender-indep endentmod-
els. Using gender and dialect dep endent mo dels built
2.3 Clustering Sp eakers
from b oth the hand and automatic partitioned di-
As a result of the elimination rounds constituted by
alect databases reduces the sentence error even fur-
our initial validation tests, we had fourteen distance
ther, with the automatically partitioned data giving
matrices, thirteen showing distances between each
almost twice the error reduction of the hand parti-
pair of sp eakers for each of the nal words, and one
tioned data.
a euclidean combination of all thirteen words. In-
sp ection of the cluster plots for the individual words
showed that, for some words, the clusterings di ered
4 Conclusions
fairly dramatically from the dialect partitioning that
wewere hoping for. These di erences would proba- In this pap er wehave describ ed a fairly simple, yet
where the dialectal di erences are more pronounced,
Exp. Mo del Set Percent Error
such as England or Germany, to name a couple at
1 Gender & 7.8
random.
Dialect Indep endent
2 Gender Dep endent& 5.6
References
Dialect Indep endent
3 Gender Dep endent& 5.1
[1] V.Beattie, S.Edmondson, D.Miller, Y.Patel, and
Dialect Dep endent
G.Talvola. \An Integrated Multi-Dialect Sp eech
(hand-partioned)
Recognition System with Optional Sp eaker
4 Gender Dep endent& 4.7
Adaptation". In Proceedings of Eurospeech, page
Dialect Dep endent
1123, 1995.
(automatic-partioned)
[2] Y.Chow, M.Dunham, O.Kimball, M.Krasner,
Table 2: Sentence error for di erent mo del sets
G.F.Kubala, J.Makhoul, P.Price, S.Roucos, and
R.Schwartz. \BYBLOS: The BBN Continuous
Sp eech Recognition System". In Proceedings of
powerful metho d for partitioning a training corpus
ICASSP, page 89, 1987.
on the basis of pronunciation similarities. Our ex-
[3] M.Cohen, G.Baldwin, J.Bernstein, H.Murveit,
p erimental results show that the mo dels built from
and M.Weintraub. \Studies for an adaptive
the automatically partioned data yield b etter p erfor-
recognition lexicon". Proceedings of the DARPA
mance than the mo dels built from the hand parti-
Speech Recognition Workshop, Report No. SAIC-
tioned data. This improvement in accuracy leads us
87/1644, San Diego, March 1987.
to b elieve that a recognition-based scheme pro duces
a grouping of similar sp eakers that is more accurate
[4] D.Van Comp ernolle, J.Smolders, P.Jasp ers, and
from the recognizer's p oint of view.
T.Hellemans. \Sp eaker Clustering for Dialec-
Pseudo-automatic partitioning schemes like the
tic Robustness in Sp eaker Indep endent Recog-
one wehave presented here, are very useful for par-
nition". In Proceedings of Eurospeech, page 723,
tioning sp eech data that do es not have accompany-
1991.
ing detailed demographic information. These sorts
of metho ds may allow more ecient and e ective
[5] Herb ert Gish and Michael Schmidt. \Text In-
management of data that is collected in unsup ervised
dep endent Sp eaker Identi cation". IEEE Signal
scenarios (live system data logging, etc.).
Processing Magazine, 1994.
[6] Bishnu S. Atal. \Automatic Recognition of
Sp eakers from their Voices". Proceedings of the
5 Further Work
IEEE, Vol. 64, NO. 4, April 1976.
We need to explore further what prop erties of
[7] Charles K Thomas. Phonetics of American En-
the sp eech tokens controlled the clustering of those
glish. NY: Ronald Press, 1958.
words whose clusterings app ear to be irrelevant to
the dialect groups. It is not at all clear that clus-
ters that seem appropriate to the human listener are
equally appropriate to the recognizer, and establish-
ing the features underlying these clusters might lead
to a b etter metho d of segregating sp eakers into mul-
tiple mo dels.
It would also be interesting to see how far our
approach could b e pushed by delib erately including
words known to distinguish b etween the dialects of
interest. The TIMIT database provides an obvious
opp ortunity to test this suggestion.
It will also be interesting to see how well our
metho d works when applied to dialects in other lan-
guages. Dialect di erences are relatively minor in
the US, at least compared with those found else-
where. The need for metho ds to classify sp eakers
by dialect will be much more imp ortant when the
attempt is made to write applications for countries