ON THE INTEGRATION OF AND SPEAKER

ADAPTATION IN A MULTI-DIALECT SPEECH

RECOGNITION SYSTEM

V. Digalakis V. Doumpiotis S. Tsakalidis

Dept. of Electronics & Computer Engineering

Technical University of Crete

73100 Hania, Crete, GREECE

fvas,doump,[email protected]

on sp eakers of another Swedish dialect, namely from ABSTRACT

the region.

The recognition accuracy in recent Automatic Sp eech

To improve the p erformance of the SI system for

Recognition (ASR) systems has proven to be highly

sp eakers of for which minimal amounts of train-

related to the correlation of the training and testing

ing data are available, we use dialect adaptation tech-

conditions. Several adaptation approaches have b een

niques. We rst identify the dialect of the sp eaker, and

prop osed in an e ort to improve the sp eech recogni-

then use a dialect-sp eci c system which has b een trained

tion p erformance, and have typically b een applied to

using adaptation techniques with a small amount of data

the sp eaker- and channel-adaptation tasks. We have

from the target dialect. This dialect-sp eci c system can

shown in the past that a mismatch in dialects b etween

b e further adapted to the sp eaker, providing additional

the training and testing sp eakers signi cantly in uences

improvement in recognition, and we show the e ect the

the recognition accuracy, and wehave used adaptation

seed system has in the nal sp eaker-adapted recognition

to comp ensate for this mismatch. The dialect of the

p erformance.

sp eaker needs to be identi ed in a dialect-sp eci c sys-

tem, and in this pap er we present results in this area.

2 DIALECT AND SPEAKER ADAPTATION

To achieve further improvement in recognition p erfor-

METHODS

mance, we combine dialect- and sp eaker-adaptation.

The SI sp eech recognition system for a sp eci c di-

1 INTRODUCTION

alect is mo deled with continuous mixture-density hid-

den Markov mo dels (HMM's) that use a large number

Performance in large-vo cabulary continuous-sp eech

of Gaussian mixtures [8]. The comp onent mixtures of

recognition degrades dramatically if a mismatch exists

each Gaussian co deb o ok (genone) are shared across clus-

between the training and testing conditions, such as

ters of HMM states, and hence the observation densities

di erent channel, accent or sp eaker's voice character-

of the vector pro cess x have the form:

t

istics. Several sp eaker adaptation techniques have b een

recently prop osed, to improve the p erformance and ro-

N

!

X

bustness of sp eech recognition systems. These tech-

P (x js )= p(! js )N(x ;m ;S );

SI t t i t t ig ig

niques include transformation based adaptation in the

i

mo del space [1, 2, 3], Bayesian adaptation [4, 5], or com-

where s is the HMM state, x is the sp ectral feature

bined approaches [6].

t t

obtained from the recognizer front-end at frame t, and

In this pap er, we consider the dialect issue on a

g is the genone index used by the HMM state s .

sp eaker-indep endent (SI) sp eech recognition system and

t

These mo dels need large amounts of training data the adaptation of a dialect-sp eci c system to individ-

for robust estimation of their parameters. Since the ual sp eakers. Based on the Swedish corpus

amount of available training data for some dialects of collected byTelia, wehave develop ed a Swedish multi-

our database is small, the development of dialect-sp eci c dialect SI sp eech recognition system which requires only

SI mo dels is not a robust solution. Alternatively, an ini- a small amount of dialect-dep endent data. This recog-

tial SI recognition system trained on some seed dialects nizer is part of a bidirectional sp eech translation system

can be adapted to match a sp eci c target dialect, in between English and Swedish that has b een develop ed

which case the adapted system utilizes knowledge ob- under the SRI-Telia ResearchSpoken Language Trans-

tained from the seed dialects. Wecho ose to apply algo- lator pro ject [7]. We have found in the past that the

rithms that we have previously develop ed and applied recognition p erformance of a sp eaker-indep endent sys-

to the problem of sp eaker adaptation, since in our prob- tem trained on a large amount of training data from the

lem there are consistent di erences in the pronunciation Sto ckholm dialect decreases dramatically when tested

between the di erent dialects that we examine. The selecting the dialect of the recognizer with the highest

adaptation pro cess is p erformed by jointly transform- likeliho o d as the identi ed one.

ing all the Gaussians of each genone, and by combining

An alternative metho d that we examined is to base

transformation and Bayesian techniques.

the dialect identi cation solely on the observation den-

Using the adaptation metho d prop osed in [1], we as-

sity likeliho o ds:

sume that the dialect-adapted (DA) observation density

T

Y

of the HMM state s for dialect D can b e obtained from

t



D = arg max P (x js ;D);

DA t t

the corresp onding density of the seed-dialect system:

D

t=1

N

!

X

using again the states s of the most likely state se-

t

P (x js ;D)= p(! js )N(x ;m (D );S (D ))

DA t t i t t ig ig

quence.

i

N

!

X

4 EXPERIMENTS

t

= p(! js )N (x ; A (D )m +b (D );A (D)S (A (D )) ):

i t t g ig g g ig g

The adaptation exp eriments were carried out using

i

a multi-dialect Swedish sp eech database collected by

Adaptation is equivalent to estimation of the parameters

Telia. The core of the database was recorded in Sto ck-

A (D );b (D);g = 1;:::;N . N denotes the number

g g g g

holm using more than 100 sp eakers. Several other di-

of transformations for the whole set of genones. The

alects are currently b eing recorded across . The

parameter estimation pro cess is p erformed using the EM

corpus consists of sub jects reading various prompts or-

algorithm [9].

ganized in sections. The sections include a set of pho-

In a similar manner, a sp eaker-adapted (SA) system

netically balanced common sentences for all the sp eak-

to a particular sp eaker S can b e obtained using the same

ers, a set of sentences translated from the English Air

transformation metho d and the dialect-adapted system

Travel Information System (ATIS) domain, and a set of

as a seed mo del:

newspap er sentences.

N

!

X

For our adaptation exp eriments we used data from

P (x js ;S)= p(! js )

SA t t i t

the Sto ckholm and Scanian dialects, that were, resp ec-

i

tively, the seed and target dialects. The Scanian dialect

was chosen for the initial exp eriments b ecause it is one

t

N(x ;A (S)m (D )+b (S);A (S)S (D )(A (S )) );

t g ig g g ig g

of three that are clearly di erent from the Sto ckholm di-

where D is the dialect of the sp eaker. In our exp eriments

alect. The main di erences b etween the dialects is that

we assume the matrix A is diagonal [1].

the long (tense) vowels b ecome in the Sca-

g

nian dialect, and that the usual supra-dental /r/-sound

3 DIALECT IDENTIFICATION

b ecomes uvular. In the Sto ckholm dialect, a combina-

tion of /r/ with one of the dental consonants /n/, /d/,

To use a the correct system for a particular sp eaker,

/t/, /s/ or /l/, results in supradentalization of these

his/her dialect must b e rst identi ed. One alternative

consonants and a deletion of the /r/. In the Scanian

is to run multiple dialect-sp eci c systems in parallel and

dialect, since the /r/-sound is di erent, this do es not

select the system which maximizes the a-p osteriori prob-

happ en. There are also proso dic di erences.

ability of the dialect given the sp eaker data. Assuming

There is a total of 40 sp eakers of the Scanian dialect,

that all dialects are equally likely, then the identi ed



b oth male and female, and each of them recorded more

dialect D is simply the one that maximizes the likeli-

than 40 sentences. We selected 8 of the sp eakers (half

ho o d of the data using the corresp onding dialect-sp eci c

of them male) to serve as testing data and the rest com-

system. If a Viterbi recognizer is used, we can approxi-

p osed the adaptation/training data with a total of 3814

mate the summation over all p ossible state sequences by

sentences. Exp eriments were carried out using SRI's

maximizing the joint likeliho o d of the observed sp ectral

TM

DE C I P H E R system [8]. The system's front-end

features X =[x ;x ;:::;x ] and the most likely state

1 2 T

was con gured to output 12 cepstral co ecients, cep-

sequence S =[s ;s ;:::;s ]over all p ossible dialects:

1 2 T

stral energy and their rst and second derivatives. The



D = arg max max p(X; S jD);

cepstral features are computed with a fast Fourier trans-

D S

form (FFT) lterbank and subsequent cepstral-mean

normalization on a sentence basis is p erformed. We used

where

genonic HMM's with arbitrary degree of Gaussian shar-

T

Y

ing across di erent HMM states [8].

p(s js )P (x js ;D): p(X; S jD)=

t t1 DA t t

The SI continuous HMM system which served as seed

t=1

mo dels for our adaptation scheme, was trained on ap-

This maximization can b e achieved by running in paral- proximately 21000 sentences of Sto ckholm dialect. The

lel one Viterbi deco der for each dialect, and at the end recognizer is con gured so that it runs in real time on

Test set

26

Sto ckholm Scanian Total System Description (1) seed: Stockholm System (2) seed: Dialect adapted, 200 sentences

24 (3) seed: Dialect adapted, 500 sentences

Sto ckholm Trained 10.3 25.1 18.0

22

Scania Adapted 13.2 6.9 10.0

20

Known Dialect 10.3 6.9 8.3 ti cation

Dialect Iden 18

eliho o d 10.3 7.3 8.7

Recognizer lik 16

ation Probability 11.2 7.8 9.4 Observ 14 word error rate 12

10

Table 1: Word-error rates (%) for Dialect-dep endent,

8

Cross-dialect, Known-dialect and automatic dialect

6

identi cation metho ds.

4 0 20 40 60 80 100 120

Number of adaptation sentences

a Sun Sparc Ultra-1 workstation. The system's recog-

nition p erformance on an air travel information task

similar to the English ATIS one was b enchmarked at

a 8.9% word-error rate using a bigram language mo del

when tested on Sto ckholm sp eakers. On the other hand,

Figure 1: Sp eaker adaptation results using di erent seed

its p erformance degraded signi cantly, reaching a word-

systems.

error rate of 25.08% when tested on the Scanian-dialect

testing set. The degradation in p erformance was uni-

form across the various sp eakers in the test set, sug-

Scanian dialect using twohundred and vehundred sen-

gesting that there are consistent di erences across the

tences from other Scanian sp eakers. We see that even

two dialects. In our previous work [10], we adapted the

using such a small amount of dialect-dep endent data

Sto ckholm system to the Scania dialect, and weachieved

accelerates the convergence of the sp eaker-adaptation

aword-error rate of 6.9% using only a few hundred sen-

pro cess signi cantly. For example, the dialect-adapted

tences from six sp eakers.

systems using 40 sentences of the sp eaker achieve an

When the multi-dialect system is used in dialect-

adapted p erformance of around 7.0%, whereas the orig-

indep endent mo de, the dialect of the sp eaker must be

inal Sto ckholm system has an adapted word-error rate

identi ed. In Table 1 we present the recognition re-

of 14.0% when the same 40 adaptation sentences are

sults of the seed Sto ckholm and Scania adapted sys-

used.

tems on test sets from the Sto ckholm and Scanian di-

alects. When the dialect of the sp eaker is known, then

5 CONCLUSIONS

the matched system to the testing conditions p erforms

In this pap er wehave shown that highly-accurate dialect

well and achieves a word-error rate of 8.3%. In multi-

identi cation is p ossible using the observation-density

dialect mo de, however, the dialect must be identi ed

likeliho o d of hidden Markov mo dels. Identifying the di-

automatically. Using the dialect identi cation meth-

alect of the sp eaker is imp ortantinmulti-dialect appli-

o ds that we mentioned ab ove, wewere able to identify

cations, where there are signi cant di erences b etween

the correct dialect 96.75% of the time using the recog-

the dialects and the recognizer can b ene t by knowing

nizer likeliho o d, and only 79.25% using the observation-

the dialect of the sp eaker. In addition, knowledge of

density likeliho o d. As a result, an automatic con gura-

the dialect is useful in order to b o otstrap the sp eaker-

tion where the two recognizers were running in parallel

adaptation pro cess using seed mo dels matched to the

and the one with the highest likeliho o d was selected and

sp eaker's dialect. We have shown that the adaptation

its hyp othesis was adopted, achieved a word-error rate

pro cess can be accelerated signi cantly when dialect-

of 8.7%, very close to the lower b ound p erformance of

sp eci c mo dels are used.

the known-dialect case, which is 8.3%.

Once the dialect of the sp eaker is known, then the

ACKNOWLEDGMENTS

system can b e further adapted to the sp eaker. In Fig-

The work we have describ ed was accomplished under

ure1we present sp eaker-adaptation results on sp eakers

contract to Telia Research.

of the Scanian dialect for di erent amounts of sp eaker-

adaptation data and for di erent seed systems. Sp eci -

References

cally,we adapt to sp eakers of the Scanian dialect three

systems: the original Sto ckholm sp eaker-indep endent [1] V. Digalakis, D. Rtischev and L. Neumeyer,

system, and two systems that were rst adapted to the \Sp eaker Adaptation Using Constrained Reestima-

tion of Gaussian Mixtures," IEEE Transactions

Speech and Audio Processing, pp. 357{366, Septem-

b er 1995.

[2] L.Neumeyer, A.Sankar and V.Digalakis, \A Com-

parative Study of Sp eaker Adaptation Techniques",

Proceedings of European Conference on Speech

Communication and Technology, pp. 1127{1130,

Madrid, Spain, 1995.

[3] C. J. Leggetter and P. C. Wo o dland, \Maximum

Likeliho o d Linear Regression for Sp eaker Adapta-

tion of Continuous Density Hidden Markov Mo d-

els," Computer Speech and Language, pp. 171{185,

1995.

[4] C.-H. Lee, C.-H. Lin and B.-H. Juang, \A Study on

Sp eaker Adaptation of the Parameters of Continu-

ous Density Hidden Markov Mo dels," IEEE Trans.

on Acoust., Speech and Signal Proc., Vol. ASSP-

39(4), pp. 806{814, April 1991.

[5] J.-L. Gauvain and C.-H. Lee, \Maximum a Pos-

teriori Estimation for Multivariate Gaussian Ob-

servations of Markov Chains," IEEE Transactions

Speech and Audio Processing, pp. 291{298, April

1994.

[6] V. Digalakis and L. Neumeyer, \Sp eaker Adapta-

tion Using Combined Transformation and Bayesian

Metho ds," IEEE Transactions Speech and Audio

Processing, June 1996.

[7] M. Rayner,I. Bretan,D. Carter, M. Collins, V. Di-

galakis, B. Gamback, J. Ka ja, J. Karlgren, B. Ly-

b erg, P. Price, S. Pulman and C. Samuelsson, \Sp o-

ken Language Translation with Mid-90's Technol-

ogy: A Case Study," Proc. Eurospeech '93, Berlin,

1993.

[8] V. Digalakis, P. Monaco and H. Murveit, \Genones:

Generalized Mixture Tying in Continuous Hidden

Markov Mo del-Based Sp eech Recognizers," IEEE

Transactions Speech and Audio Processing, June

1996.

[9] A. P. Dempster, N. M. Laird and D. B. Rubin,

\Maximum Likeliho o d Estimation from Incomplete

Data," Journal of the Royal Statistical Society (B),

Vol. 39, No. 1, pp. 1{38, 1977.

[10] V. Diakoloukas, V. Digalakis, L. Neumeyer

and J. Ka ja, \Development of Dialect-Sp eci c

Sp eech Recognizers Using Adaptation Metho ds,"

Proc. Int'l. Conf. on Acoust., Speech and Signal

Processing, Munich, 1997.