ON THE INTEGRATION OF DIALECT AND SPEAKER
ADAPTATION IN A MULTI-DIALECT SPEECH
RECOGNITION SYSTEM
V. Digalakis V. Doumpiotis S. Tsakalidis
Dept. of Electronics & Computer Engineering
Technical University of Crete
73100 Hania, Crete, GREECE
fvas,doump,[email protected]
on sp eakers of another Swedish dialect, namely from ABSTRACT
the Scania region.
The recognition accuracy in recent Automatic Sp eech
To improve the p erformance of the SI system for
Recognition (ASR) systems has proven to be highly
sp eakers of dialects for which minimal amounts of train-
related to the correlation of the training and testing
ing data are available, we use dialect adaptation tech-
conditions. Several adaptation approaches have b een
niques. We rst identify the dialect of the sp eaker, and
prop osed in an e ort to improve the sp eech recogni-
then use a dialect-sp eci c system which has b een trained
tion p erformance, and have typically b een applied to
using adaptation techniques with a small amount of data
the sp eaker- and channel-adaptation tasks. We have
from the target dialect. This dialect-sp eci c system can
shown in the past that a mismatch in dialects b etween
b e further adapted to the sp eaker, providing additional
the training and testing sp eakers signi cantly in uences
improvement in recognition, and we show the e ect the
the recognition accuracy, and wehave used adaptation
seed system has in the nal sp eaker-adapted recognition
to comp ensate for this mismatch. The dialect of the
p erformance.
sp eaker needs to be identi ed in a dialect-sp eci c sys-
tem, and in this pap er we present results in this area.
2 DIALECT AND SPEAKER ADAPTATION
To achieve further improvement in recognition p erfor-
METHODS
mance, we combine dialect- and sp eaker-adaptation.
The SI sp eech recognition system for a sp eci c di-
1 INTRODUCTION
alect is mo deled with continuous mixture-density hid-
den Markov mo dels (HMM's) that use a large number
Performance in large-vo cabulary continuous-sp eech
of Gaussian mixtures [8]. The comp onent mixtures of
recognition degrades dramatically if a mismatch exists
each Gaussian co deb o ok (genone) are shared across clus-
between the training and testing conditions, such as
ters of HMM states, and hence the observation densities
di erent channel, accent or sp eaker's voice character-
of the vector pro cess x have the form:
t
istics. Several sp eaker adaptation techniques have b een
recently prop osed, to improve the p erformance and ro-
N
!
X
bustness of sp eech recognition systems. These tech-
P (x js )= p(! js )N(x ;m ;S );
SI t t i t t ig ig
niques include transformation based adaptation in the
i
mo del space [1, 2, 3], Bayesian adaptation [4, 5], or com-
where s is the HMM state, x is the sp ectral feature
bined approaches [6].
t t
obtained from the recognizer front-end at frame t, and
In this pap er, we consider the dialect issue on a
g is the genone index used by the HMM state s .
sp eaker-indep endent (SI) sp eech recognition system and
t
These mo dels need large amounts of training data the adaptation of a dialect-sp eci c system to individ-
for robust estimation of their parameters. Since the ual sp eakers. Based on the Swedish language corpus
amount of available training data for some dialects of collected byTelia, wehave develop ed a Swedish multi-
our database is small, the development of dialect-sp eci c dialect SI sp eech recognition system which requires only
SI mo dels is not a robust solution. Alternatively, an ini- a small amount of dialect-dep endent data. This recog-
tial SI recognition system trained on some seed dialects nizer is part of a bidirectional sp eech translation system
can be adapted to match a sp eci c target dialect, in between English and Swedish that has b een develop ed
which case the adapted system utilizes knowledge ob- under the SRI-Telia ResearchSpoken Language Trans-
tained from the seed dialects. Wecho ose to apply algo- lator pro ject [7]. We have found in the past that the
rithms that we have previously develop ed and applied recognition p erformance of a sp eaker-indep endent sys-
to the problem of sp eaker adaptation, since in our prob- tem trained on a large amount of training data from the
lem there are consistent di erences in the pronunciation Sto ckholm dialect decreases dramatically when tested
between the di erent dialects that we examine. The selecting the dialect of the recognizer with the highest
adaptation pro cess is p erformed by jointly transform- likeliho o d as the identi ed one.
ing all the Gaussians of each genone, and by combining
An alternative metho d that we examined is to base
transformation and Bayesian techniques.
the dialect identi cation solely on the observation den-
Using the adaptation metho d prop osed in [1], we as-
sity likeliho o ds:
sume that the dialect-adapted (DA) observation density
T
Y
of the HMM state s for dialect D can b e obtained from
t
D = arg max P (x js ;D);
DA t t
the corresp onding density of the seed-dialect system:
D
t=1
N
!
X
using again the states s of the most likely state se-
t
P (x js ;D)= p(! js )N(x ;m (D );S (D ))
DA t t i t t ig ig
quence.
i
N
!
X
4 EXPERIMENTS
t
= p(! js )N (x ; A (D )m +b (D );A (D)S (A (D )) ):
i t t g ig g g ig g
The adaptation exp eriments were carried out using
i
a multi-dialect Swedish sp eech database collected by
Adaptation is equivalent to estimation of the parameters
Telia. The core of the database was recorded in Sto ck-
A (D );b (D);g = 1;:::;N . N denotes the number
g g g g
holm using more than 100 sp eakers. Several other di-
of transformations for the whole set of genones. The
alects are currently b eing recorded across Sweden. The
parameter estimation pro cess is p erformed using the EM
corpus consists of sub jects reading various prompts or-
algorithm [9].
ganized in sections. The sections include a set of pho-
In a similar manner, a sp eaker-adapted (SA) system
netically balanced common sentences for all the sp eak-
to a particular sp eaker S can b e obtained using the same
ers, a set of sentences translated from the English Air
transformation metho d and the dialect-adapted system
Travel Information System (ATIS) domain, and a set of
as a seed mo del:
newspap er sentences.
N
!
X
For our adaptation exp eriments we used data from
P (x js ;S)= p(! js )
SA t t i t
the Sto ckholm and Scanian dialects, that were, resp ec-
i
tively, the seed and target dialects. The Scanian dialect
was chosen for the initial exp eriments b ecause it is one
t
N(x ;A (S)m (D )+b (S);A (S)S (D )(A (S )) );
t g ig g g ig g
of three that are clearly di erent from the Sto ckholm di-
where D is the dialect of the sp eaker. In our exp eriments
alect. The main di erences b etween the dialects is that
we assume the matrix A is diagonal [1].
the long (tense) vowels b ecome diphthongs in the Sca-
g
nian dialect, and that the usual supra-dental /r/-sound
3 DIALECT IDENTIFICATION
b ecomes uvular. In the Sto ckholm dialect, a combina-
tion of /r/ with one of the dental consonants /n/, /d/,
To use a the correct system for a particular sp eaker,
/t/, /s/ or /l/, results in supradentalization of these
his/her dialect must b e rst identi ed. One alternative
consonants and a deletion of the /r/. In the Scanian
is to run multiple dialect-sp eci c systems in parallel and
dialect, since the /r/-sound is di erent, this do es not
select the system which maximizes the a-p osteriori prob-
happ en. There are also proso dic di erences.
ability of the dialect given the sp eaker data. Assuming
There is a total of 40 sp eakers of the Scanian dialect,
that all dialects are equally likely, then the identi ed
b oth male and female, and each of them recorded more
dialect D is simply the one that maximizes the likeli-
than 40 sentences. We selected 8 of the sp eakers (half
ho o d of the data using the corresp onding dialect-sp eci c
of them male) to serve as testing data and the rest com-
system. If a Viterbi recognizer is used, we can approxi-
p osed the adaptation/training data with a total of 3814
mate the summation over all p ossible state sequences by
sentences. Exp eriments were carried out using SRI's
maximizing the joint likeliho o d of the observed sp ectral
TM
DE C I P H E R system [8]. The system's front-end
features X =[x ;x ;:::;x ] and the most likely state
1 2 T
was con gured to output 12 cepstral co ecients, cep-
sequence S =[s ;s ;:::;s ]over all p ossible dialects:
1 2 T
stral energy and their rst and second derivatives. The
D = arg max max p(X; S jD);
cepstral features are computed with a fast Fourier trans-
D S
form (FFT) lterbank and subsequent cepstral-mean
normalization on a sentence basis is p erformed. We used
where
genonic HMM's with arbitrary degree of Gaussian shar-
T
Y
ing across di erent HMM states [8].
p(s js )P (x js ;D): p(X; S jD)=
t t 1 DA t t
The SI continuous HMM system which served as seed
t=1
mo dels for our adaptation scheme, was trained on ap-
This maximization can b e achieved by running in paral- proximately 21000 sentences of Sto ckholm dialect. The
lel one Viterbi deco der for each dialect, and at the end recognizer is con gured so that it runs in real time on
Test set
26
Sto ckholm Scanian Total System Description (1) seed: Stockholm System (2) seed: Dialect adapted, 200 sentences
24 (3) seed: Dialect adapted, 500 sentences
Sto ckholm Trained 10.3 25.1 18.0
22
Scania Adapted 13.2 6.9 10.0
20
Known Dialect 10.3 6.9 8.3 ti cation
Dialect Iden 18
eliho o d 10.3 7.3 8.7
Recognizer lik 16
ation Probability 11.2 7.8 9.4 Observ 14 word error rate 12
10
Table 1: Word-error rates (%) for Dialect-dep endent,
8
Cross-dialect, Known-dialect and automatic dialect
6
identi cation metho ds.
4 0 20 40 60 80 100 120
Number of adaptation sentences
a Sun Sparc Ultra-1 workstation. The system's recog-
nition p erformance on an air travel information task
similar to the English ATIS one was b enchmarked at
a 8.9% word-error rate using a bigram language mo del
when tested on Sto ckholm sp eakers. On the other hand,
Figure 1: Sp eaker adaptation results using di erent seed
its p erformance degraded signi cantly, reaching a word-
systems.
error rate of 25.08% when tested on the Scanian-dialect
testing set. The degradation in p erformance was uni-
form across the various sp eakers in the test set, sug-
Scanian dialect using twohundred and vehundred sen-
gesting that there are consistent di erences across the
tences from other Scanian sp eakers. We see that even
two dialects. In our previous work [10], we adapted the
using such a small amount of dialect-dep endent data
Sto ckholm system to the Scania dialect, and weachieved
accelerates the convergence of the sp eaker-adaptation
aword-error rate of 6.9% using only a few hundred sen-
pro cess signi cantly. For example, the dialect-adapted
tences from six sp eakers.
systems using 40 sentences of the sp eaker achieve an
When the multi-dialect system is used in dialect-
adapted p erformance of around 7.0%, whereas the orig-
indep endent mo de, the dialect of the sp eaker must be
inal Sto ckholm system has an adapted word-error rate
identi ed. In Table 1 we present the recognition re-
of 14.0% when the same 40 adaptation sentences are
sults of the seed Sto ckholm and Scania adapted sys-
used.
tems on test sets from the Sto ckholm and Scanian di-
alects. When the dialect of the sp eaker is known, then
5 CONCLUSIONS
the matched system to the testing conditions p erforms
In this pap er wehave shown that highly-accurate dialect
well and achieves a word-error rate of 8.3%. In multi-
identi cation is p ossible using the observation-density
dialect mo de, however, the dialect must be identi ed
likeliho o d of hidden Markov mo dels. Identifying the di-
automatically. Using the dialect identi cation meth-
alect of the sp eaker is imp ortantinmulti-dialect appli-
o ds that we mentioned ab ove, wewere able to identify
cations, where there are signi cant di erences b etween
the correct dialect 96.75% of the time using the recog-
the dialects and the recognizer can b ene t by knowing
nizer likeliho o d, and only 79.25% using the observation-
the dialect of the sp eaker. In addition, knowledge of
density likeliho o d. As a result, an automatic con gura-
the dialect is useful in order to b o otstrap the sp eaker-
tion where the two recognizers were running in parallel
adaptation pro cess using seed mo dels matched to the
and the one with the highest likeliho o d was selected and
sp eaker's dialect. We have shown that the adaptation
its hyp othesis was adopted, achieved a word-error rate
pro cess can be accelerated signi cantly when dialect-
of 8.7%, very close to the lower b ound p erformance of
sp eci c mo dels are used.
the known-dialect case, which is 8.3%.
Once the dialect of the sp eaker is known, then the
ACKNOWLEDGMENTS
system can b e further adapted to the sp eaker. In Fig-
The work we have describ ed was accomplished under
ure1we present sp eaker-adaptation results on sp eakers
contract to Telia Research.
of the Scanian dialect for di erent amounts of sp eaker-
adaptation data and for di erent seed systems. Sp eci -
References
cally,we adapt to sp eakers of the Scanian dialect three
systems: the original Sto ckholm sp eaker-indep endent [1] V. Digalakis, D. Rtischev and L. Neumeyer,
system, and two systems that were rst adapted to the \Sp eaker Adaptation Using Constrained Reestima-
tion of Gaussian Mixtures," IEEE Transactions
Speech and Audio Processing, pp. 357{366, Septem-
b er 1995.
[2] L.Neumeyer, A.Sankar and V.Digalakis, \A Com-
parative Study of Sp eaker Adaptation Techniques",
Proceedings of European Conference on Speech
Communication and Technology, pp. 1127{1130,
Madrid, Spain, 1995.
[3] C. J. Leggetter and P. C. Wo o dland, \Maximum
Likeliho o d Linear Regression for Sp eaker Adapta-
tion of Continuous Density Hidden Markov Mo d-
els," Computer Speech and Language, pp. 171{185,
1995.
[4] C.-H. Lee, C.-H. Lin and B.-H. Juang, \A Study on
Sp eaker Adaptation of the Parameters of Continu-
ous Density Hidden Markov Mo dels," IEEE Trans.
on Acoust., Speech and Signal Proc., Vol. ASSP-
39(4), pp. 806{814, April 1991.
[5] J.-L. Gauvain and C.-H. Lee, \Maximum a Pos-
teriori Estimation for Multivariate Gaussian Ob-
servations of Markov Chains," IEEE Transactions
Speech and Audio Processing, pp. 291{298, April
1994.
[6] V. Digalakis and L. Neumeyer, \Sp eaker Adapta-
tion Using Combined Transformation and Bayesian
Metho ds," IEEE Transactions Speech and Audio
Processing, June 1996.
[7] M. Rayner,I. Bretan,D. Carter, M. Collins, V. Di-
galakis, B. Gamback, J. Ka ja, J. Karlgren, B. Ly-
b erg, P. Price, S. Pulman and C. Samuelsson, \Sp o-
ken Language Translation with Mid-90's Technol-
ogy: A Case Study," Proc. Eurospeech '93, Berlin,
1993.
[8] V. Digalakis, P. Monaco and H. Murveit, \Genones:
Generalized Mixture Tying in Continuous Hidden
Markov Mo del-Based Sp eech Recognizers," IEEE
Transactions Speech and Audio Processing, June
1996.
[9] A. P. Dempster, N. M. Laird and D. B. Rubin,
\Maximum Likeliho o d Estimation from Incomplete
Data," Journal of the Royal Statistical Society (B),
Vol. 39, No. 1, pp. 1{38, 1977.
[10] V. Diakoloukas, V. Digalakis, L. Neumeyer
and J. Ka ja, \Development of Dialect-Sp eci c
Sp eech Recognizers Using Adaptation Metho ds,"
Proc. Int'l. Conf. on Acoust., Speech and Signal
Processing, Munich, 1997.