OPTIMIZATION OF THE SELECTED PARAMETERS IN HMM-BASED POLISH SPEECH RECOGNITION

Robert Wielgat

State Higher Vocational School in Tarnow, Technology Department, Ul. Mickiewicza 8, 33-100 Tarnów, Poland,

Abstract - The preliminary research results on choosing optimal parameters values of speech recognition process using Hidden Markov Models (HMMs) are presented. An influence of the tree based clustering log likelihood threshold  Pq_min , sampling frequency, kind of utterance and speaker sex on the recognition were tested. Experiments were carried out on the closed set of the 365 isolated utterances taken from Polish speech database - CORPORA. The HTK software was used in the experiments. Obtained results show a dependence of the optimal recognition accuracy on the tested speech recognition parameters. An optimal value of the parameter  Pq_min and sampling frequency was evaluated when a weighted cepstral coefficients were used as a features. The optimal sampling frequency depends on speaker sex. 1. INTRODUCTION

Hidden Markov Models are one of the most powerful methods for speech recognition.

Optimization of different parameters of the speech recogniton process are crucial issue for speech recognition applications.

Parameters choosen for optimization were:

 likelihood threshold ΔPq_min

 sampling frequency

2. EXPERIMENTS 2.1. Speech database:

Polish speech database - CORPORA

Training set:

7 x 365 utterances of speakers from speaker dependent testing set + 5 x 365 utterances of 5 randomly choosen speakers from a set of 8 speakers out of the speaker dependent testing set

= 12 x 365 utterances = 4380 utterances

Testing sets:

Speaker dependent testing set: 8 x 365 utterances coming from 3 women and 4 men = 2920 utterances

Speaker independent testing set: 22 x 365 utterances coming from 6 women and 16 men = 8030 utterances

Speakers were from 9 to 70 years old

Each 365 utterances comprised:

10 digits, 200 names, 8 control words 114 phonetically balanced sentences.

2 Utterances were recorded with 16 kHz sampling frequency and 16 bits/sample.

2.2. Signal Modeling

1) Signal resampling from16 kHz to 4 kHz, 8 kHz and 12 kHz in order to choose an optimal sampling frequency value.

2) Preemphasis with the preemphasis coefficient α = –1.

H z  1 z 1 .

sn  v(n)  v (n 1) , s(n) – signal samples after preemphasis v(n) – signal samples before preemphasis

3) Frame blocking.

0 t 1-st frame 2-nd frame L-th frame

Frame overlapping: 20 ms Frame length: 30 ms

4) Signal windowing with Hamming window

5) Feature extraction used features: 10 weighted LPC derived cepstral coefficients by the autocorrelation method used weighting:

cwm  wm cm ,  p  m    wm  1 sin , 1 m  p  2  p 

3 2.3. Classification by Hidden Markov Models

2.3.1. Training the Hidden Markov Models

a a a 22 33 44 a a a a 12 23 34 45 1 2 3 4 5

b (o ) b (o ) b (o ) b (o ) b (o ) b (o ) b (o ) 2 1 2 2 2 3 2 4 3 5 4 6 4 7

O O O O O O O 1 2 3 4 5 6 7

HMM phoneme model used in experiments.

0 a12 0 0 0  0 a a 0 0   22 23 

AT  0 0 a33 a34 0    0 0 0 a44 a45  0 0 0 0 0  Transition matrix

1  (o  )'1(o  ) 1 2 t j j t j b j (ot )  e n (2 )  j

The probability of observation being generated in time t by j-th state -single Gaussian

Where: Σj – covariance matrix of the observation vectors for j-th state, μj – mean vector of the observations being generated by j-th state, n – size of the observation vectors.

The probability of the model shown above given the observation O and state X sequence can be calculated as follows:

P(O,X | M) = a12 b2(o1) a22 b2(o2) a22 b2(o3) a22 b2(o4) a23 b3(o5) a34 b4(o6) a44 b4(o7) a45.

4 When the state sequence X is unknown probability of the model can be approximated by:

T ˆ   P(O | M )  maxax(0)x(1) bx(t) (ot )ax(t)x(t1)  X   t1 

To calculate above probability Hidden Markov Models parameters have to be known and to calculate these parameters the estimation procedure is needed.

Training Procedure

Initialization of the HMM Parameters: A ,  ,  Initialization of the HMM Parameters: AT , j , j T j j

Embedded Training x 3 Embedded Training x 3

Fixing the Silence Models Fixing the Silence Models

Embedded Training x 2 Embedded Training x 2

Selection of the Most Optimal Phonetic Transcription Selection of the Most Optimal Phonetic Transcription

Embedded Training x 2 Embedded Training x 2

Making Triphones from Monophones Making Triphones from Monophones

Embedded Training x 2 Embedded Training x 2

Making Tied State Triphones Making Tied State Triphones

Embedded Training x 2 Embedded Training x 2

Finally Estimated HMM Parameters: A ,  ,  Finally Estimated HMM Parameters: AT , j , j T j j

5 Embedded Training

Embedded Training is extended Baum-Welch procedure to model sub- word units as phonemes.

Fixing the Silence Models

Silence extra transition from state ‘2’ to ‘4’ and from state ’2’ to ‘4’

1 2 3 4 5

State ‘3’ of the silence and state ‘3’ of the short pause are tied

Short Pause 1 2 3

extra transition between non- emitting states

Selection of the Most Optimal Phonetic Transcription

Usually there is several pronunciation for the particular word. Some phonetic versions of the Polish word ‘dziewięć’ (nine) transcribed in HTK® format is shown below.

ci e w j e ni ci dzi e w j e ni ci dzi e w j e_ ci dzi e w j e_ ni ci

The pronunciation which maximizes model probability should be chosen for training the HMM-s.

6 Making Triphones from Monophones

Monophone transcription

m a g d a l e n a

The same Markov model for phonemes of the same class

Triphone transcription

m+a m-a+g a-g+d g-d+a d-a+l a-l+e l-e+n e-n+a n-a

Different Markov models for phonemes of the same class depending on phoneme neighborhood

Making Tied State Triphones

Left consonantal ?

Right supralaryngeal? N T Right nasal ? Left Right N T Stopping of the gentle ? gentle ? clasterization process ΔP < ΔP or q q_MIN N T N T N T L < L or ST S_MIN L < L SN S_MIN Parameter tying for clusters having different parentsif - ΔP < ΔP STOP q q_MIN

tree-based clustering

2.3.2. Recognition by Hidden Markov Models

7 In recognition process an alternative formulation of the Viterbi Algorithm the Token Passing Model was used.

4. RESULTS

4.2. Optimal value of the probability threshold ΔPq_min

Speaker independent set, Fs = 8 kHz

100 3000 90 2538 2500 RA 80 2000 1780 N 70 1500 [%] 60 939 1000 50 500 255 40 108 54 50 50 47 47 47 47 0 50 100 200 500 1000 2000 3000 4000 6000 8000 10000 12000

probability threshold

RA - recognition accuracy N - Number of HMM-s

Overall recognition accuracy for isolated utterances from CORPORA basis versus threshold ΔPq_min for 8 kHz sampling frequency and speaker independent testing set.

100

90

] 80 % [

A R 70

60

50 50 100 200 500 1000 2000 3000 4000 6000 8000 10000 12000 Probability threshold

ZM 4 kHz NM 4 kHz ZM 8 kHz NM 8 kHz ZM 12 kHz NM 12 kHz ZM 16 kHz NM 16 kHz Overall recognition accuracy for isolated utterances from CORPORA basis versus threshold ΔPq_min for different sampling frequencies. Training sets: ZM –speaker dependent, NM – speaker independent.

8 4.3. Optimal sampling frequency

Recognition accuracy [%] Utterance category Fs = 4 kHz Fs = 8 kHz Fs = 12 kHz Fs = 16 kHz ZM NM ZM NM ZM NM ZM NM

Alphabet 50,38 60,47 69,70 68,04 66,67 66,94 61,74 62,40

Words 75,17 79,11 90,02 88,62 82,86 84,78 83,72 80,42

Sentences 92,54 97,85 99,45 99,72 98,90 99,40 99,01 99,20

Total 78,36 83,28 91,13 90,22 86,40 87,73 86,51 84,66

Table 1. Overall recognition accuracy for different sampling frequencies,

and different utterance categories, ΔPq_min = 6000. Training sets: ZM – speaker dependent, NM – speaker independent

Recognition accuracy [%] Utterance Fs = 4 kHz Fs = 8 kHz Fs = 12 kHz Fs = 16 kHz category ZM NM ZM NM ZM NM ZM NM

Alphabet 56,57 55,56 74,75 61,11 74,75 64,14 67,68 58,59

Words 80,28 73,24 93,27 82,03 94,50 86,39 90,52 83,26

Sentences 96,78 97,51 100,00 99,12 100,00 99,85 100,00 99,85

Total 83,29 79,22 93,70 85,48 94,43 88,58 91,42 86,21 Table 2. Recognition accuracy for female voices for different sampling

frequencies, and different utterance categories, ΔPq_min = 6000. Training sets: (see Tab.1)

Recognition accuracy [%] Utterance Fs = 4 kHz Fs = 8 kHz Fs = 12 kHz Fs = 16 kHz category ZM NM ZM NM ZM NM ZM NM

Alphabet 46,67 62,31 66,67 70,64 61,82 67,99 58,18 63,83

Words 72,11 81,31 88,07 91,08 75,87 84,17 79,63 79,36

Sentences 90,00 97,97 99,12 99,95 98,25 99,23 98,42 98,96

Total 75,40 84,24 89,59 91,84 81,59 86,94 83,56 83,95

Table 3. Recognition accuracy for male voices for different sampling frequencies, and different utterance categories, ΔPq_min = 6000. Training sets: (see Tab.1)

From the tables shown above it is evident that male and female voices have different optimal sampling frequency. It is also evident that the recognition accuracy is influenced by the kind of utterance what corresponds with the results found in literature.

DISCUSSION

9 Recognition accuracy grows for higher probability threshold. This phenomena is related to smaller number of clusters for higher probability threshold. For small number of clusters there is a big number of phoneme instances what causes better training of HMM-s. It results in better recognition accuracy. According to the rule of thumb for the optimal number of LPC coefficients: F p  s   1000 where: Fs – sampling frequency γ - fudge constant equal to 2 or 3, an optimal sampling frequency for 10 cepstral coefficients should be 8 kHz. Experiments proved this dependency only for male voices. For female voices an optimal frequency is about 12 kHz. Experiments show also that the recognition accuracy is influenced by the kind of utterance what corresponds with the results found in literature.

CONCLUSION

 An optimal porbability threshold in tree-based clustering during HMM training process for Polish speech recognition ΔPq_min has been found.

 Experiments indicated an existence of optimal sampling frequency ca. 8 kHz in general with respect to recognition accuracy. Recognition of male and female voices separately gives slight improvement in speech accuracy providing that male utterances are sampled with 8 kHz and female with 12 kHz .

FUTURE WORK

 Increase of tested Polish speech database

 Searching an optimal bandwidth and band shape of the filters used for preprocessing

ACKNOWLEDGMENTS

10 Author would like to thank Prof. A. Materka from Technical University of Lodz for his continuous support of this work and Prof. T. Zieliński from University of Mining and Metallurgy for his helpful comments and suggestions in improving the manuscript.

REFERENCES

[1] S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, P. Woodland, The HTK Book (for HTK Version 3.0), http://htk.eng.cam.ac.uk/, Jul. 2000. [2] R. Gemello, L. Moisa , P. Laface, Synergy of Spectral and Perceptual Features in Multi- Source Connectionist Speech Recognition, ICSLP 2000, Bejing, China. [3] J. G. Bauer, J. Junkawitsch, Accurate Recognition Of City Names With Spelling As A Fall Back Strategy, Proc. Eurospeech-99, Sept., 1999, Budapest, Hungary [4] L. Grad, Isolated words recognition using Hidden Markov Models, Biuletyn Instytutu Automatyki i Robotyki WAT, ss. 131-153, 12/2000. (in Polish). [5] S. Grocholewski, Analysis of HMM Models in Alphabet Letters Recognition, Proc. Eurospeech-99, September, 1999, Budapest, Hungary [6] S. Grocholewski, The statistical basis of the ASR system for Polish, Wydawnictwo Politechniki Poznanskiej, Poznan 2001. (in Polish) [7] F.-H. Liu, M. Picheny, On Variable Sampling Frequencies in Speech Recognition, Proc. ICSLP, 1998, Sydney, Australia. [8] T.W. Parsons, Voice and Speech Processing, McGraw-Hill, Inc., 1987. [9] S. Young, P. Woodland. State clustering in HMM-based continuous speech recognition, Computer Speech and Language, 8(4):369–394, 1994. [10] Liu Yi, Pascale Fung, Decision Tree-Based Triphones are Robust and Practical for Mandarian Speech Recognition, Proc. Eurospeech-99, September, 1999, Budapest, Hungary. [11] D. Yook, Decision Tree Based Clustering, Intelligent Data Engineering and Automated Learning, pp. 487-492, August 2002, © Springer-Verlag. [12] Polish speech database - CORPORA – user manual (in Polish). [13] R. Wielgat, Application of Landmarks, Dynamic Time Warping and Hidden Markov Models to Isolated Words Recognition of Polish Speech, Dissertation, Technical University of Lodz, Lodz 2001. (in Polish) [14] L. R. Rabiner, B. -H. Juang, Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, New Jersey, 1993. [15] M.-Y. Hwang, X. Huang, F. A. Alleva, Predicting Unseen Triphones with Senones, IEEE Trans. Speech Audio Proc., vol. 4, No. 6, pp. 412-419, Nov.1996. [16] W. Jassem, Fundamentals of acoustic phonetics, PWN, Warszawa 1973. (in Polish) [17] S. J. Young, N. H. Russell , J. H. S. Thornton, Token Passing: a Conceptual Model for Connected Speech Recognition Systems, CUED Technical Report F_INFENG/TR38, Cambridge University, 1989.

11