On the Integration of Dialect and Speaker Adaptation in a Multi-Dialect Speech Recognition System
Total Page:16
File Type:pdf, Size:1020Kb
ON THE INTEGRATION OF DIALECT AND SPEAKER ADAPTATION IN A MULTI-DIALECT SPEECH RECOGNITION SYSTEM V. Digalakis V. Doumpiotis S. Tsakalidis Dept. of Electronics & Computer Engineering Technical University of Crete 73100 Hania, Crete, GREECE fvas,doump,[email protected] on sp eakers of another Swedish dialect, namely from ABSTRACT the Scania region. The recognition accuracy in recent Automatic Sp eech To improve the p erformance of the SI system for Recognition (ASR) systems has proven to be highly sp eakers of dialects for which minimal amounts of train- related to the correlation of the training and testing ing data are available, we use dialect adaptation tech- conditions. Several adaptation approaches have b een niques. We rst identify the dialect of the sp eaker, and prop osed in an e ort to improve the sp eech recogni- then use a dialect-sp eci c system which has b een trained tion p erformance, and have typically b een applied to using adaptation techniques with a small amount of data the sp eaker- and channel-adaptation tasks. We have from the target dialect. This dialect-sp eci c system can shown in the past that a mismatch in dialects b etween b e further adapted to the sp eaker, providing additional the training and testing sp eakers signi cantly in uences improvement in recognition, and we show the e ect the the recognition accuracy, and wehave used adaptation seed system has in the nal sp eaker-adapted recognition to comp ensate for this mismatch. The dialect of the p erformance. sp eaker needs to be identi ed in a dialect-sp eci c sys- tem, and in this pap er we present results in this area. 2 DIALECT AND SPEAKER ADAPTATION To achieve further improvement in recognition p erfor- METHODS mance, we combine dialect- and sp eaker-adaptation. The SI sp eech recognition system for a sp eci c di- 1 INTRODUCTION alect is mo deled with continuous mixture-density hid- den Markov mo dels (HMM's) that use a large number Performance in large-vo cabulary continuous-sp eech of Gaussian mixtures [8]. The comp onent mixtures of recognition degrades dramatically if a mismatch exists each Gaussian co deb o ok (genone) are shared across clus- between the training and testing conditions, such as ters of HMM states, and hence the observation densities di erent channel, accent or sp eaker's voice character- of the vector pro cess x have the form: t istics. Several sp eaker adaptation techniques have b een recently prop osed, to improve the p erformance and ro- N ! X bustness of sp eech recognition systems. These tech- P (x js )= p(! js )N(x ;m ;S ); SI t t i t t ig ig niques include transformation based adaptation in the i mo del space [1, 2, 3], Bayesian adaptation [4, 5], or com- where s is the HMM state, x is the sp ectral feature bined approaches [6]. t t obtained from the recognizer front-end at frame t, and In this pap er, we consider the dialect issue on a g is the genone index used by the HMM state s . sp eaker-indep endent (SI) sp eech recognition system and t These mo dels need large amounts of training data the adaptation of a dialect-sp eci c system to individ- for robust estimation of their parameters. Since the ual sp eakers. Based on the Swedish language corpus amount of available training data for some dialects of collected byTelia, wehave develop ed a Swedish multi- our database is small, the development of dialect-sp eci c dialect SI sp eech recognition system which requires only SI mo dels is not a robust solution. Alternatively, an ini- a small amount of dialect-dep endent data. This recog- tial SI recognition system trained on some seed dialects nizer is part of a bidirectional sp eech translation system can be adapted to match a sp eci c target dialect, in between English and Swedish that has b een develop ed which case the adapted system utilizes knowledge ob- under the SRI-Telia ResearchSpoken Language Trans- tained from the seed dialects. Wecho ose to apply algo- lator pro ject [7]. We have found in the past that the rithms that we have previously develop ed and applied recognition p erformance of a sp eaker-indep endent sys- to the problem of sp eaker adaptation, since in our prob- tem trained on a large amount of training data from the lem there are consistent di erences in the pronunciation Sto ckholm dialect decreases dramatically when tested between the di erent dialects that we examine. The selecting the dialect of the recognizer with the highest adaptation pro cess is p erformed by jointly transform- likeliho o d as the identi ed one. ing all the Gaussians of each genone, and by combining An alternative metho d that we examined is to base transformation and Bayesian techniques. the dialect identi cation solely on the observation den- Using the adaptation metho d prop osed in [1], we as- sity likeliho o ds: sume that the dialect-adapted (DA) observation density T Y of the HMM state s for dialect D can b e obtained from t D = arg max P (x js ;D); DA t t the corresp onding density of the seed-dialect system: D t=1 N ! X using again the states s of the most likely state se- t P (x js ;D)= p(! js )N(x ;m (D );S (D )) DA t t i t t ig ig quence. i N ! X 4 EXPERIMENTS t = p(! js )N (x ; A (D )m +b (D );A (D)S (A (D )) ): i t t g ig g g ig g The adaptation exp eriments were carried out using i a multi-dialect Swedish sp eech database collected by Adaptation is equivalent to estimation of the parameters Telia. The core of the database was recorded in Sto ck- A (D );b (D);g = 1;:::;N . N denotes the number g g g g holm using more than 100 sp eakers. Several other di- of transformations for the whole set of genones. The alects are currently b eing recorded across Sweden. The parameter estimation pro cess is p erformed using the EM corpus consists of sub jects reading various prompts or- algorithm [9]. ganized in sections. The sections include a set of pho- In a similar manner, a sp eaker-adapted (SA) system netically balanced common sentences for all the sp eak- to a particular sp eaker S can b e obtained using the same ers, a set of sentences translated from the English Air transformation metho d and the dialect-adapted system Travel Information System (ATIS) domain, and a set of as a seed mo del: newspap er sentences. N ! X For our adaptation exp eriments we used data from P (x js ;S)= p(! js ) SA t t i t the Sto ckholm and Scanian dialects, that were, resp ec- i tively, the seed and target dialects. The Scanian dialect was chosen for the initial exp eriments b ecause it is one t N(x ;A (S)m (D )+b (S);A (S)S (D )(A (S )) ); t g ig g g ig g of three that are clearly di erent from the Sto ckholm di- where D is the dialect of the sp eaker. In our exp eriments alect. The main di erences b etween the dialects is that we assume the matrix A is diagonal [1]. the long (tense) vowels b ecome diphthongs in the Sca- g nian dialect, and that the usual supra-dental /r/-sound 3 DIALECT IDENTIFICATION b ecomes uvular. In the Sto ckholm dialect, a combina- tion of /r/ with one of the dental consonants /n/, /d/, To use a the correct system for a particular sp eaker, /t/, /s/ or /l/, results in supradentalization of these his/her dialect must b e rst identi ed. One alternative consonants and a deletion of the /r/. In the Scanian is to run multiple dialect-sp eci c systems in parallel and dialect, since the /r/-sound is di erent, this do es not select the system which maximizes the a-p osteriori prob- happ en. There are also proso dic di erences. ability of the dialect given the sp eaker data. Assuming There is a total of 40 sp eakers of the Scanian dialect, that all dialects are equally likely, then the identi ed b oth male and female, and each of them recorded more dialect D is simply the one that maximizes the likeli- than 40 sentences. We selected 8 of the sp eakers (half ho o d of the data using the corresp onding dialect-sp eci c of them male) to serve as testing data and the rest com- system. If a Viterbi recognizer is used, we can approxi- p osed the adaptation/training data with a total of 3814 mate the summation over all p ossible state sequences by sentences. Exp eriments were carried out using SRI's maximizing the joint likeliho o d of the observed sp ectral TM DE C I P H E R system [8]. The system's front-end features X =[x ;x ;:::;x ] and the most likely state 1 2 T was con gured to output 12 cepstral co ecients, cep- sequence S =[s ;s ;:::;s ]over all p ossible dialects: 1 2 T stral energy and their rst and second derivatives. The D = arg max max p(X; S jD); cepstral features are computed with a fast Fourier trans- D S form (FFT) lterbank and subsequent cepstral-mean normalization on a sentence basis is p erformed.