Pronunciation Space Models for Pronunciation Evaluation

PRONUNCIATION SPACE MODELS FOR PRONUNCIATION EVALUATION

Si Wei, Yi-Qian Pan, Guo-Ping Hu, Yu Hu and Ren-Hua Wang {siwei, yqpan, gphu, yuhu, rhwang}@iflytek.com

iFLYTEK Research, Hefei

ABSTRACT system developed by the University of Nijmegen investigates the reasonability of human scoring and the Posterior probability is mostly used for pronunciation effect of prosody, fluency and segmental quality on human evaluation. This paper introduces pronunciation space scoring [2]. The system from Kyoto University considers models to calculate posterior probability replacing the importancee of different phonemes in language learning traditional phone-based acoustic models, which makes the and the effectfectct of differentd types of errors on pronunciation calculated posterior probability more precise. Pronunciation proficiencyncy [9]. ReceRecently, structural representation has been space models are constructed using unsupervised clustering used to assessess propronunciationpronun in order to capture the method guided by human scores and phone-level posterior structure,ucture, or higher llevel aspects of the language, when probability. By using correlation between machine scores spoken by non-native spespeakersspeaker [1, 7]. and human scores as the performance measurement, Almostst all the methods arare bbased on Automatic Speech pronunciation space models based method shows itsts Recognitionognition (ASR). They shshashare some common limitations. effectiveness for pronunciation evaluation in the The most severe limitlimitation is that all methods employ phone experiments on a Chinese database spoken by Koreansns with basedd acoustic modelsmod defined in ASR directly which the correlation’s improvement from 0.390 to 0.4155 contains only a little capability to distinguish comparing to the traditional method based on phone basedsed mispronunciation ofo different serious levels such as partially acoustic models. changednged mispronunciations.mispro To break through this limitation, Index Terms— pronunciation evaluation,aluation, posterior acoustictic modmmodels need to be reconstructed so as to distinguish probability, pronunciation space models,s, speech recognition differenterent serious level mispronunciations. This is very similar to pronunciation modeling of ASR. In pronunciation 1. INTRODUCTIONCTION modmodeling, acoustic models are reconstructed to model ppartial changes of pronunciation. Liu and Fung [6] propose With the development of computer science and artificial partial change phone models (PCPMs) to represent partial intelligence, much research has been done to assist language changes and then merge PCPMs into original acoustic learning with computers, whichh is called ComputerCo AssistedAssis models to increase the model resolution for pronunciation Language Learning (CALL). In the lastlas decade, withwit the variations. All in all, in order to deal with pronunciation rapid improvement of speech technology,nology, CALL systems changes, acoustic models should be reconstructed to have become much more intelligent thanan before.befbefore They can represent the pronunciation variations. In this paper, provide many potential benefits to both theelaelanProof language learner Pronunciation Space Models (PSM) are proposed to and teacher. Pronunciation evaluation is one of the essential represent mispronunciation of different serious levels to functions of CALL system. With this function, CALL enhance the discriminative capability of acoustic models for system can evaluate the pronunciations of speakers and give partially changed mispronunciation. These PSMs are built them informative feedback. with unsupervised clustering method. Then posterior Many researchers have studied pronunciation evaluation. probability based on PSM is used as the measure for Members of the SRI speech group mainly focus on pronunciation evaluation. Experimental results indicate that evaluating the overall pronunciation quality of learners [3, the new method based on PSM outperforms the traditional 8]. They take word posterior probability, timing and method based on phone based acoustic models with the duration scores as the measurement of pronunciation and correlation’s improvement from 0.390 to 0.415. measure performance of evaluation algorithm by the This paper is organized as follows. Section 2 introduces the correlation between machine scores and human scores. The definition and the construction of PSM. Then pronunciation joint research by the speech group of Cambridge University evaluation method using PSM is introduced in section 3. In and the AI lab of MIT mainly focuses on pronunciation section 4, database and experimental result is introduced. error detection and phone-level pronunciation evaluation Then this paper is concluded in section 5. [10]. They also investigate different ways to measure performance of mispronunciation detection. The VICK

978-1-4244-2942-4/08/$25.00 (c) 2008 IEEE 21 2. PRONUNCIATION SPACE MODELS training. Unfortunately, it’s very hard to label enough CONSTRUCTION mispronunciation data, and it is even harder to label enough data with acceptable consistency for model training. At the Phone-based acoustic models can only effectively deal with same time, some kinds of mispronunciation are very rare in the mispronunciation from one phone to another. But for the true pronunciation data. All these reasons make partially changed mispronunciation or mispronunciation not supervised pronunciation space modeling very hard to belonging to the phones in the model set, this kind of implement. This paper introduces an unsupervised acoustic model will be less effective. At the same time, pronunciation space modeling method with posterior partially changed mispronunciation is one of the most probability calculated for each sample. This method is frequent mispronunciations. So in order to efficiently shown as follows: evaluate the pronunciation, more effort should be put on Step 1, Collect all the samples belonging to one phone q partially changed mispronunciations. This paper builds from the acoustic model training database and mark it as acoustic models named PSM to represent the characteristics oooo= { 12, ,..., R }, where R is the number of the samples of each mispronunciation including partially changed qqqq mispronunciation. of q. Then use the phone-based acoustic models to calculate the posterior proprobability for each sample with the boundaries 2.1. Definition of PSM determinedd by forcfoforced alignment. Step 2, Clusterster oq to be several groups with pre-calculated Phone based acoustic models are not suitable for = CC12 CK mispronunciation detection because of its weakness in posteriorsterior probabilitybability aand markm it as ooooqqqq{ , ,..., } , handling partially changed mispronunciations. There are where the number of ggroupsgroup K is determined by the two different approaches which have been investigated to definitionition of PSM. Here, the cclustering is done by ordering deal with the partially changed mispronunciations withith oi according to its posteriposterior probability and then splitting phone based acoustic models. One approach is multi- q training method with mispronunciation and correct oq intonto K groups with sames size. pronunciation samples. This kind of model is able to Step 3, Train the PSMP via maximum likelihood estimation describe partially changed mispronunciation but sacrificefice withth the clusterclustereclustered acoustic training data. If the training data the capability of distinguishing mispronunciationnunciation from is limited,mited PSPSM can be trained by the adaptation method correct pronunciation. The other approachproach iss to train such as MaMaximumM Likelihood Linear Regression (MLLR) [5] acoustic models only with correct pronunciation.onunciation.on. This kind or MaximumMaxiMaxim a Posterior (MAP) [4]. of model retains the ability to distinguish correct The unsupervised u clustering method only uses the posterior pronunciation but its power too detect partially changedchan probabilitypro as the measurement for clustering. But posterior mispronunciation is limitedd due to the neglect oof probability can not precisely measure the correctness of mispronunciation data at thee acoustic training stage. In order phone samples. This paper also uses human scores as an to deal with all kinds of mispronunciations effectivelyeffectively, auxiliary measure for clustering. For example, for the most suitable acoustic models shouldd be built reflectingre all kindski correct pronunciation space model, the clustered group of pronunciations. The native modelodel trainedtr from native should have best human score despite of the highest database plus non-native model can bee regardedregar an example posterior probability. At the same time, for the worst of this kind, but it can not model the frequentlyfrequefrequen occurred pronunciation space model, the samples selected should have intermediate pronunciation between nativeProofe and non-native lowest human score. pronunciation precisely. Although the unsupervised method cannot perfectly split the To break through the above mentioned limitation, the data into correct pronunciation and mispronunciation, this authors try to build different acoustic models named PSM pronunciation space modeling method is capable of giving for mispronunciation of different serious levels, not just us an elementary knowledge about the effectiveness of PSM. correct vs. mispronounced, or native vs. non-native. In PSM, traditional phone based model q is expanded to be 3. PRONUNCIATION EVALUATION METHOD { qq12, ,..., qK }, representing K kinds of mispronunciation of BASED ON PRONUNCIATION SPACE MODELS different serious levels and K is a tunable parameter. This section introduces the pronunciation evaluation method 2.2. PSM construction based on unsupervised clustering used in this paper. As shown in the previous research, method posterior probability, duration scores and rate of speech are all useful measures for pronunciation evaluation. But this The proposed PSM try to model all kinds of paper focuses on posterior probability method and puts mispronunciations. If enough mispronunciation data labeled emphasis on how to improve the performance of posterior by humans could be obtained, these models could be built probability. Posterior probability using PSM for via maximum likelihood estimation or discriminative

978-1-4244-2942-4/08/$25.00 (c) 2008 IEEE 22 pronunciation evaluation is introduced in the following po(| W )( PW ) PW(|) o = sections. ()po(| q )( Pq )+ po (| q )( Pq ) ¦ std std accent accent qQ∈ 3.1 Posterior Probability (6) where qstd and qaccent represent the standard model and Given an acoustic observation o and a word sequence W, accented model for phone q respectively. according to Bayesian rule, the posterior probability that o is Here the standard and the accented pronunciation space recognized as W is defined as follows: models are trained as introduced in section 2.2 with p(|oWPW )( ) PW(|) o = (1) posterior probability and human scores. Because in this Po() paper, the non-native database is quite small, MLLR is where PW( ) is the probability of W , p(oW | ) is the chosen to build the PSMs. The five most native-like speakers are selected and the adaptation samples are also probability of observation o generated by word sequence selected via posterior probability to adapt the original native W , and Po ( ) is the probability of acoustic observation o . phone based models to standard pronunciation space models In Eq. (1), PW ( ) is normally calculated from a language and another most non-native like five speakers are chosen to model, and p (oW | ) is calculated with acoustic HMMs. build accenteded pronunciationpr space models.

Po(), which is difficult to be estimated directly, is often 4.. EXPEREXPERIMENTSEXPERI AND RESULTS calculated according to the following formula: Po()==¦¦ PoH (, ) po ( | H ) PH ( ) (2) 4.1.1 Database instructioninstructio HH where H denotes one of the possible hypotheses for o , and Theree are two databases used ini this paper. The first one is the summation is done over all possible hypotheses for o . the native Chinese databadatabase which is used to train the It’s very hard if not impossible to predict all possible acousticustic model. The secsecond one is the Chinese database hypotheses so some approximations or constraints shouldhould be spoken by Koreans. SSoSome of the second database is used to used to make the calculation feasible. obtain the PSMs anand the other is used as the pronunciation evaluationaluation testingtesti database. 3.2 Pronunciation evaluation based on posteriorosterior The first databasedatab contains 56 people and each people have probability using PSM aboutt 3 hohours’ continuous speech. The second database contains 48 Koreans with 100 Chinese sentences. Ten of As shown in the previous section,, Po()( ) is approximatapproximated as them are selected as the PSM training database. Five of the Eq.(1). For pronunciation evaluation,uation, the summation is donedo ten speakers are quite native-like and the other five have by a phone-loop network withithth phone based acoustic models, heavily non-native accent. The first five speakers’ data is which is shown as follows: used to train the standard pronunciation space models and the other five are used to train the accented pronunciation Po()== PoH (, ) po(|(|)()(|)( ( | qPq ) () (3) ¦¦¦ space models. The training process is the same as shown in HqQ∈Q section 2.2. Then the pronunciation evaluation is done using where q is phone acoustic model and Q is ththe phone set. posterior probability with PSMs as Eq.(6). While in this paper, Po ( ) is calculatedated wwitwith the pre- ProofThe rest 38 speakers’ database is used as the testing obtained PSMs as follows: database and each of the sentences is labeled by human K evaluator. The sentences are scored in a 10-point scale Po()== PoH (, ) po ( | q ) Pq ( ) (4) ¦¦¦kk HqQk∈=1 varying from quite native-like to heavily non-native. where K is the parameter of PSM and the summation is done by a pronunciation space phone-loop network with 4.2 Performance measurement PSMs. Then the posterior probability based on PSM is The performance of the pronunciation evaluation algorithm implemented as follows: is measured with correlation between the machine scores and the human score as follows: poW(| )( PW ) PW(|) o = (5) N K [(SS−× ) ( SS − )] p(|oq )( Pq ) ¦ Ai A Bi B ¦¦ kk = i=1 qQk∈=1 Corr(, A B ) (7) NN This paper uses standard pronunciation space model and ()()SS−×22 SS − ¦¦Ai A Bi B accented pronunciation space model for specific phone and ii==11 the posterior probability is calculated as follows: Where A and B are two evaluators, N is the number of all sentences, SAi and SBi are the scores give by evaluator A and

978-1-4244-2942-4/08/$25.00 (c) 2008 IEEE 23 B for i-th sentence, S and S are the average scores given method outperforms the phone based acoustic model A B method. When using PSM for pronunciation evaluation, the by evaluator A and B for all N sentences. correlation between human and machine increases from

0.391 to 0.416. By using correlation, the performance of human evaluators Several aspects of the proposed method can be further can be measured by the correlation between them, which is investigated in future. First, only posterior probability based shown in table 1. on PSM is used for pronunciation evaluation. Many other

evaluation methods could benefit from PSM. Second, the Correlation Evaluator 1 PSM is constructed by unsupervised clustering with Evaluator 2 0.593 posterior probability and partially with human scores. How Table 1, Correlation between different human evaluators. to utilize more information from human labeling for PSM construction can be further investigated. At the same time, We can see from table 1 that because the scoring is done at considering pool evaluation performance comparing to the sentence level, even human evaluator’s correlation is human, more evaluation methods should be used for rather low. evaluation. In summary, the pronunciation evaluation problem is stillill a work in progress, and more effort needs to 4.3 Experiment result be carried outut to sosolve this problem in future.

This section will give the performance of the method based 6. RREFERENCES on PSM and the baseline system with phone based acoustic models. [1] S. Asakawa, N. MineMinematsu,Minematsu T. Isei-Jaakkola, K. Hirose, The experimental results are shown in table 2. “Structuralctural Representation of ththe Non-native Pronunciation”.

EuroSpeechSpe , pp. 165-169,65-169, 2005.2005 Correlation Correlation (AVE) [2] C. Cucchiarini, F.D.F.D Wet,We H. Strik, L. Boves, “Assessment of Baseline 0.390 Dutch Pronunciation by Means of Automatic Speech Recognition PSM 0.416 Technology”, ICSLP, pp1739-1742, 1998. Table 2, performance of different pronunciationn evaluationon [3] H.L.Franco,H.L L. Neumeyer, V. Digalakis, O. Ronen, algorithm Here AVE means that the correlationn is thee average “Combinationbina of Machine Scores for Automatic Grading of correlation between machine scores and the two human PronunciatioPronunciationuncia Quality”, Speech Communication, pp. 121- ,2000. scores. [4] J. L. Gauvain, C. H. Lee, "Maximum a Posteriori Estimation offor MMultivariate Gaussian Mixture Observations of Markov Table 2 indicates that the evaluationuation method based on PSMPS CChains", IEEE Transactions on Speech and Signal Processing, is better than phone based acousticcoustic model methmethod. pp.291-298, 1994. [5] C. J. Leggetter, P. C. Woodland, "Maximum likelihood linear 5. CONCLUSIONSUSIONS regression for speaker adaptation of continuous density hidden Markov Models", Computer Speech & Language, pp.171-185, 1995. This paper introduces the pronunciationtion evevaluation method [6] Y. Liu, P. Fung, “Modeling partial pronunciation variations for based on posterior probability usingg PSMPSM. PSM are spontaneous Mandarin speech recognition”, Computer Speech & proposed to address the limitation of previousProofrev posterior Language, pp. 357-379, 2003 probability based research. The limitation is that partially changed mispronunciations are ignored or mixed up with [7] N. Minematsu, “Pronunciation Assessment based upon the correct pronunciation during the acoustic model training Compatibility between a Learner’s Pronunciation Structure and the stage in most of the ASR-based pronunciation evaluation Target Language’s Lexical Structure”, ICSLP, pp. 1317-1320, method, which weakens the capability of pronunciation 2004. evaluation. [8] L. Neumeyer, H. Franco, V. Digalakis, M. Weintraub, To deal with the limitation, PSM based method is proposed “Automatic Scoring of Pronunciation Quality”, Speech in this paper. Specific mispronunciation acoustic models are Communication, pp. 83-1xx, 2000. used to represent the mispronunciation of different serious [9] A. Raux, T. Kawahara, “Automatic Intelligibility Assessment levels. First pronunciation data is collected from a range of and Diagnosis of Critical Pronunciation Errors for Computer- people to construct PSM with an unsupervised clustering assisted Pronunciation Learning”, ICSLP, pp. 737-740, 2002. method guided by posterior probabilities and human scores. [10] S.M. Witt, S.J. Young, “Phone-level Pronunciation Scoring Then posterior probability is used to evaluate pronunciation and Assessment for Interactive Language Learning”. Speech based on the obtained PSM. Communication, pp. 95-, 2000. Experimental results on Chinese database spoken by 38 Koreans indicate that the PSM based posterior probability