Dynamically Configurable Acoustic Models for Speech Recognition

DYNAMICALLY CONFIGURABLE ACOUSTIC MODELS FOR SPEECH RECOGNITION Mei-Yuh Hwang and Xuedong Huang Microsoft Research 1 Microsoft Way Redmond, WA 98052, USA mhwang,[email protected] ABSTRACT it does not require the training speech. Therefore it is particularly useful in real applications to provide the user with the flexibility Senones were introduced to share Hidden Markov model (HMM) of dynamically configuring the size of the speech recognizer and parameters at a sub-phonetic level in [3] and decision trees were trading the accuracy with resource utilization. incorporated to predict unseen phonetic contexts in [4]. In this While using CHMMs with Gaussian density functions, one paper, we will describe two applications of the senonic decision should be careful about smoothing the covariances. The shared- tree in (1) dynamically downsizing a speech recognition system Gaussian model [2] was attempting to smooth the covariance for small platforms and in (2) sharing the Gaussian covariances through sharing. When the covariance is too small, the model of continuous density HMMs (CHMMs). We experimented how becomes tailored too much to the training data and thus is not ro- to balance different parameters that can offer the best trade off bust with new test data. Here we propose to use the hierarchy of between recognition accuracy and system size. The dynamically the senonic decision tree to guide the sharing of the covariances downsized system, without retraining, performed even better than — only those covariances of leaves under the same pre-selected the regular Baum-Welch [1] trained system. The shared covariance ancestor node can be tied. Using similar likelihood loss mea- model provided as good a performance as the unshared full model surement, Gaussian covariances of the same pre-selected ancestor and thus gave us the freedom to increase the number of Gaussian are clustered, while the Gaussian means are still separate. When means to increase the accuracy of the model. Combining the the covariances are tied, not only we reduce the parameter size, downsizing and covariance sharing algorithms, a total of 8% error the covariance also becomes more robust. Furthermore, after the reduction was achieved over the Baum-Welch trained system with covariance size is reduced, we have the luxury of increasing the approximately the same parameter size. number of Gaussian means to increase the detailness of the acoustic model. In our experiments, the decision-tree shared covariance 1. INTRODUCTION model suffers no or very minor performance degradation com- pared with the unshared full-sized model. When more Gaussian One of the challenges to practice speech-recognition research sys- means than covariances are allocated, the performance is improved tems is how to utilize users’ resources without accuracy degra- slightly (4%) over the model which has about the same parameter dation. To fit in the platforms most users (especially PC users) size but has equal number of Gaussian means and covariances. have, often a simplified acoustic/language model is provided in- In Section 2, we will describe the downsizing algorithm and stead, which results in higher recognition error rates in comparison the clustering objective function in detail. Section 3 introduces the with the research system. Due to the wide spectrum of users’ covariancesharing algorithm. Section 4 describesthe experimental platforms, one good compromise is to provide the best model and speechrecognition tasks and results. Section 5 summarizes our key allow the user to dynamically decide how much resource he/she findings. would like to allocate for the speech recognizer. The hierarchy in the senonic decision tree [4] provides a good downsizing ar- chitecture. Leaves of the same ancestor node represent Markov 2. THE DOWNSIZING ALGORITHM states of similar acoustic realization. Based on this principle, we first build the best acoustic model with as many parameters as Assume the acoustic model of a speech recognizer utilizes the possible. In other words, the tree is grown to the deepest until it senonic decision-tree based CHMMs with Gaussian mixtures. Sup- cannot be well trained by the given training corpus. This usually pose there are 1 senones (tree leaves) in the best system tuned on takes a few experiments to plot the histogram of the error rate vs. a particular training and development corpora. Let’s call these parameter size on a development test set. We then cluster all the senones the deep senones since they are located at the deepest level parameters under some pre-chosenancestor node down to a smaller of the tree. Assume there are ¡ 1 Gaussians used for each deep ¢ size, based on the criterion of minimizing the loss in the likelihood senone. Our goal is to reduce the number of Gaussians ( 1 ¡ 1) of generating the training data. Using the statistics embedded in to a smaller value so that the speech recognizer runs efficiently on the acoustic model, we don’t need to refer to the original training a smaller platform. speech. We will show that this decision-tree based downsizing To downsize the acoustic model, suppose we would like to algorithm actually provides slightly better performance (4%) than re-select the set of senonic decision tree nodes that correspond to £ the full Baum-Welch trained system with the same reduced param- 2 ( 1) senones as the new leaves. Let’s call this set of leaves eter size. This downsizing algorithm is not only fast and accurate, the shallow senones, as they are closer to the root. For every 7 2 shallow senone , there associates a set of deep senones who are Gaussian now has a total occupancy of and its aligned ¡£¢ 6 6 NM ¥¤ ¦ descendants of in the tree. Let’s donate = Gaussians of all data include . The raw data and do not need to be the deep senones who are descendants of ¨§ . available. Only the occupancy count and the Gaussian mean and We would like to downsize the system by trimming the decision covariance need to be recorded for each remaining Gaussian. This ¡ tree to the shallow-senone level and for every shallow senone , process continues until 2 Gaussians are left for every shallow ¡ ¢ £ © © ¡ ¨¤ ¡ we need to decide the best ¡ 2 Gaussians, where 2 , senone . The mixture weights of these 2 Gaussians for shallow without involving the training speech. Our algorithm is to merge senone are the normalized occupancy. ¡ ¢ © © ¡ the ¥¤ deep Gaussians into 2 Gaussians for every shallow Notice the above clustering is guided by the hierarchy of the senone . While merging Gaussians, the objective of minimizing senonic decision tree between the deep senones and the shallow the loss in likelihood is always followed. Since the deep Gaussian senones. Arbitrary clustering across far-away relatives in the tree is parameters are trained in a maximum likelihood (ML) fashion, discouragedbecause they represent quite different acoustic realiza- ¢ Σ ¤ for every deep Gaussian 1 1 1 , the Baum-Welch ML tions. In Section 4 we will show that this reduced model, without estimate is: re-training, is slightly better than the standard ML trained model ¡ ¢ which has 2 senones with 2 Gaussians per senone. Further ¤ 1 (1) Baum-Welch training on the auto-reduced model does not improve the performance as the model is already saturated at a local optimal ¢ ¢ ¢ point. This once more demonstrates the efficiency and accuracy of " Σ ¤ ¤ ¤! 1 the downsizing algorithm. where = ¦ speech data aligned to Gaussian 1 with occupancy 3. THE SHARED COVARIANCE MODEL ¢ ¢ ¤ § ¤ count for each data , $# = total occupancy of Gaussian 1 in the training data. The likelihood of generating One way to smooth the covariances in the Gaussian model is to the set of data is thus: share them in similar acoustic contexts so that there are enough -, data to train the shared covariance. When two Gaussians 1 % ¢ ¢ © ¢ ¢ ¤ & ¤)(+* Σ Σ Σ ¤ ¤ $ 1 log ; 1 1 1 1 and 2 2 2 decide to tie their covariances to ' Σ 1 J 2, the Q function becomes, © © ./ ¢ 132 2"0/§ ¦¥0 Σ 2 log 2 log 1 ¤ : Σ 1 2 1 J 2 where 0 is the dimensionality of the feature stream x. Similarly for ¢ ¡£¢ ¢ ¢ © © K Σ D Σ 1 ¤!¦¨0 1£2"?A@¥B ¤! C ¤ ¥¤ 2 J any Gaussian , in , associated with data Y, log 2 2 1 2 1 1 J 2 1 %-45¢6 © © © .78 ¤ ¦¨0 132 Σ 2"0/§ 2 2 log 2 log 2 4 4 4 ¢ ¢ ¢ © © Σ Σ 1 2 ¤!¦¨0 1 2E?F@¨B ¤! C ¤!§ 2 log 2 J 1 2 2 1 J 2 2 When both sets of data X and Y are modeled by the same 4 ¢ Σ ¤ : Gaussian 9 , setting the derivative of the function in O OP : [1] with respect to to zero gives ML estimate of the following, 1 = 0 gives the same estimate as Formula (1) for 1. Simi- O O Σ : assuming the data alignment is fixed: larly for 2. 1 J 2 0 gives 4 4 7 ¢ ¢ ¢ ¢ ¢ Σ Σ ¤ : ;¤ ¤=2 ¤ ¤ < log ; log ; Σ Σ Σ 2 J 7 7 1 2 1 2 4 6 2 2 Again the likelihood loss is computed as Formula (3). The two ¢ ¢ ¢ © © D 1 ¤!¦¨0 132"?A@¨B Σ 2 ¤! Σ C ¤!§ > log 2 Gaussians which yield the least likelihood loss after their covariances are tied are chosen to share their covariance. This process 4 4 4 ¢ ¢ ¢ © © repeats as in Section 2. The senonic decision tree is used again to Σ Σ 1 1 2E?F@¨B C 2 ¤!¦¨0 2 ¤! ¤!§ log 2 4 guide the constraint of the clustering: covariances from far-away relative nodes are not allowed to be merged since they are modeling quite different acoustics. 7 Notice a few differences with Section 2. 2 HG 7 7 1 2 Q 2 2 Σ 1 J 2 here is not affected by the means while Formula (2) ¢ ¢ 2 ¤ ¤! I§ Σ ¦ Σ 7 1 1 1 does.

Dynamically Configurable Acoustic Models for Speech Recognition

Semi-Continuous Hidden Markov Models for Speech Recognition

Acoustics, Speech, and Signal Processing, 1995

EFFICIENT CEPSTRAL NORMALIZATION for ROBUST SPEECH RECOGNITION Fu-Hua Liu, Richard M

Subphonetic Modeling for Speech Recognition Mei-Yuh Hwang Xuedong Huang

Challenges in Adopting Speech Recognition

The Computer and the Brain, and the Potential of the Other 95%

Big Data for Speech and Language Processing

Making a World of Difference

An Overview of Modern Speech Recognition

When Artificial Intelligence Can Transcribe Everything

Session 2: Language Modeling

Towards Human Parity in Conversational Speech Recognition