<<

INTERSPEECH 2011

A Template Based Voice Trigger System Using Bhattacharyya Edit Distance

Evelyn Kurniawati, Samsudin Ng, Karthik Muralidhar, Sapna George

STMicroelectronics Asia Pacific, Pte. Ltd. Singapore [email protected], [email protected], [email protected], [email protected]

distance measure are explained in section 3. Section 4 Abstract discusses the results of the experiment, and section 5 presents Dynamic Time Warping (DTW) is frequently used in isolated the conclusions. word recognition system due to their simplicity and robustness to noise. However, the computational effort required by DTW 2. System Descriptions based solution is proportional to the number of words This section elaborates on the full system description from the registered in the system. Vector Quantization (VQ) is front end feature extraction, training process which consist of employed to alleviate this by converting the spoken input to a sub-word unit and word template generation, and the final sequence of discrete symbols to be matched with the stored recognition process. A section on how guest user could use the word template. In this paper, we propose the use of system is also presented. Bhattacharyya distance as the cost function for this pattern matching problem. The template used is a string of discrete 2.1. Front End Feature Extraction symbols, each modeled by Gaussian Mixture Model (GMM) representing context dependent sub-word unit. The system is The feature vector used for this recognition task is 24 MFCC. tested on 100 template matching task from two registrations of The window size per frame is 20ms with 10ms overlap. All the 50 cable TV channel names to simulate voice-triggered remote speech data used in this experiment are 16 bit sampled at 16 control. An average of 92% accuracy is obtained. A scheme is kHz. also proposed to enable guest user without registration data to use the system efficiently. 2.2. Sub-word Unit Generation Index Terms: dynamic time warping, template matching, edit distance, isolated word recognition. This is the first part of the training process where the user is required to record approximately two minutes of their speech. It is recommended to read phonetically rich sentences in order 1. Introduction to obtain a more comprehensive sub-word unit. In this The two main approaches used in Isolated Word Recognition experiment, the user is asked to read a series of Harvard task are Hidden Markov Model (HMM) and Dynamic Time sentences [4]. Warping. Although HMM in general has better performance After going through front end feature extraction described due to its flexibility in modeling temporal structure and in section 2.1, the resulting MFCC are clustered (using k- variability of speech, it requires significant computational means algorithm) into 64 distinct units, roughly corresponding resource incompatible with low power devices. Furthermore, to a collection of sub-word. Each of these clusters is then the training process of HMM based speech recognizer needs a modeled using Gaussian Mixture Model of 4 mixtures. This is substantial amount of transcribed speech database to create a a similar modeling strategy as used in [5], but in this 10.21437/Interspeech.2011-348 reliable acoustic model. experiment, no re-clustering was done during word template DTW on the other hand, stores multiple sequences of generation. features corresponding to known words. During recognition For the proposed edit distance, further calculation is process, the distance between the input to each of those performed to generate the 64 by 64 Bhattacharyya distance templates are computed and the word with the least distance matrix. Section 3 will present this scheme in more details. are chosen as output. Simplicity in implementation was one of Figure 1 illustrates the sub-word unit generation process. the main reasons DTW based solution was chosen for our voice triggered remote control project. With the user having Speaker's Front End MFCC Clustering Feature and GMM 64 Subword Unit control of when to activate the microphone, the DTW no Speech extraction modelling (each modelled with GMM) longer has the inherent problem of determining the start and end point of the input speech to be recognized [1]. The Calculate computational complexity however, increases in proportion to Bhattacharyya 64 x 64 the number of template words used in the system. This is the Distance Distance matrix main problem being addressed in this paper. Figure 1: Sub-word Unit generation Our work builds on the solution presented in [2]. Instead of using maximum likelihood of the input resulting from each template as the distance measure, we propose to quantize the 2.3. Word Template Generation input signal such that the comparison between the input and In this step, the words that the user wants the system to the template is reduced to a simple string matching problem. A recognize are registered. The user is asked to pronounce the string edit distance [3] is then calculated for each template and words, and the template generation will convert those words we proposed the use of Bhattacharyya distance measure to into a sequence of sub-word unit index (obtained from the quantify the mismatch between the input and the template. previous step) based on its maximum likelihood. To avoid This paper is organized as follows. Section 2 presents the over segmentation of the word, transitional heuristic [2] is description of the full system used in this experiment. The employed by allowing the change of sub-word index only details of the proposed string edit distance with Bhattacharyya when there is a significant margin of likelihood difference

Copyright © 2011 ISCA 889 28-31 August 2011, Florence, Italy with the neighboring state. The word template generation is word template. The centroid of which is then used as the guest illustrated in figure 2. This process has to be repeated for each word template. words that the user wants to introduce to the system. The speaker recognition task to select which user profile to use is addressed in separate publication.

Word1 Front End MFCC Find Word1 Template Feature Optimum Speech extraction Sequence (Subword Unit/GMM sequence) 3. Proposed Edit Distance Instead of treating the input feature vector as possible 64 Subword Unit observation generated by the template as described in section 2.4, we propose to convert the input feature to a sub-word unit Figure 2: Word template generation index sequence, similar to the word registration process in section 2.3. Now what is needed is to calculate the distance Registering the same word more than once increases its between this string and those that are stored in the template. chance of being recognize. There are schemes proposed in The chosen word is the one which has minimum distance. The [6][7][8] to optimize this multiple pattern case, but it was process is illustrated in figure 4. beyond the scope of this paper as our intention is to improve Word Input Word the recognition with large number of word template. Input Template Speech Generation Template 2.4. Matching Process – the Recognition Given there are M word templates in the system, the Word1 Template recognition process calculates the probability of the user input Word2 Template Find Minimum Output feature vector Xinput being generated by the template. The Edit_distance(input,Word_x) Word chosen word is the one which gives the highest likelihood...... * = Word100 Template Distance Matrix m arg max pm (Xinput) (1) m Figure 4: Proposed method for word recognition Note that the template can be viewed as a sequence of GMM, and this makes the pm(Xinput) calculation increasingly This proposed method changes the template matching expensive with an increasing number of word template used. problem into a simple string edit distance calculation [3]. The Figure 3 illustrates this recognition process. main issue is to find a suitable distance measure between the sub-word units. The use of minimum edit distance in speech recognition has been proposed before in [9][10]. In both cases Word1 Front End Feature however, the system requires the knowledge of phonetic class Speech extraction represented by the string. [9] uses the phone confusion matrix as distance measure, where the phone substitution penalty is Word1 Template MFCC inversely proportional to the amount of time such substitution Word2 Template Find Word template Output occur in the training set. This is a data driven approach. which maximize the Word ...... likelihood of the Another approach is to look at the acoustic class of the phone input MFCC being observed. Substitution between two phones of the same class should carry smaller penalty compared to that from Word100 Template different acoustic class. This is linguistic driven approach. Figure 3: Original method for word recognition The above two method could not be adopted in our system because there is no notion of phone in our system. The sub- word generation in section 2.2 was done using k-means 2.5. Guest User scenario clustering without using any transcription which would have The system descriptions above require a user to register both been needed for phone classification. his voice and the words to be recognized. Each user has his or In the domain of speaker recognition, Bhattacharyya her own sub-word unit profile and word templates. Apart from distance has been successfully used to quantify the difference that, the system also builds one guest profile and templates to between two GMM [11][12]. It has also been used to guide be used by non-registered user. For example, in a voice trigger agglomerative phone clustering process [13]. These results TV remote control use case, each member of a household can motivate us to use this distance measure for our word register their voice profile. However, if a guest arrived and recognition task. wanted to use the system without registering their voice, this The Bhattacharyya distance [5] between two probability guest profile is the one that will be evoked. distribution p1 and p2 is defined by: Guest sub-word unit profile is built by combining all = − ( () () ) (2) registration data from the entire user. The main issue now is Bhatt ( p1 || p2 ) ln O p1 x p2 x dx how to build the word templates. Using the word template generation data from each user is not optimal for two reasons. Each sub-word unit in our system is modeled using 4 mixtures Firstly, it is not generic enough as the template will be too GMM, so the distance between them is given by: closely tied with the way that particular user pronounce the C 3 3 S word. Secondly, the size of the word templates will be huge = − D () () T (3) Bhatt ( p1 || p2 ) ln D O B p1,mix x B p2,mix x dx T because it needs to cover each word repeated by all registered E mix =0 mix =0 U user. This is where the solution from [6][7] to jointly decode multiple pattern can be adopted, but instead of multiple inputs, This distance needs to be calculated for all 64 sub-word the scheme is used to aggregate the multiple pattern of each unit. However, this process can be done offline before the actual recognition process. The fact that this distance matrix is

890 pre-calculated also contributes to the speed-up factor obtained running time as the edit distance was able to execute 3 times compared to the original matching process. faster than the original method on this 100 template matching The edit distance is calculated following the Levenshtein task. distance pseudo-code below: function edit_distance(inp[1..m],template[1..]) for i = 1 to m distance[i,0] = del_penalty; for j = 0 to n distance[0,j] = ins_penalty; for j=1 to n do for i=1 to m do distance[i,j] = minimum( distance[i-1,j] + del_penalty, distance[i,j-1] + ins_penalty, distance[i-1,j-1]+bhatt(inp[i-1],template[j-1] Figure 5: H2 Zoom Microphone ) return distance[m,n] 

del_penalty, ins_penalty, and bhatt(x,y) correspond to  deletion penalty, insertion penalty, and Bhattacharyya distance between sub-word x and y respectively. The above distance  needs to be calculated for each word template in the system,  and the minimum of which will be the chosen word output.  4. Experiment Results and Discussion           *`$I:6+^ VI]C: VZ1_ -R1 R1 :JHV The proposed algorithm is tested on 50 vocabulary voice command system meant for changing channel in the remote control. Two registrations per word is used to increase the Figure 6: Running time (in millisecond) of the two accuracy, totaling in 100 template. The lists of words used are pattern matching algorithm. X axis shows the number listed in table 1. of template in the task

Table 1. Vocabulary used in the experiment. Figure 6 shows the average run time of the recognition task, comparing the original pattern matching method with the  1J$:]Q`V :`.%G:GCV.:JJVC proposed edit distance. The x axis shows the number of template being compared at each run. It is evident from the .:JJVC  :` QQJV 1Q`@  VJ]Q`  graph that the original pattern matching algorithm increases .:JJVC  1 Q`7   1JV7.:JVC proportionally with the number of templates being compared. Its complexity however, is lesser than the proposed edit %`1:   :`Q01V 1H@VCQRVQJ distance method when the number of template is relatively ::J .:I : VQ1CR &Q6&:I1C7 QQIV`:J$ small (lower than 30). Above this number, the proposed edit .:JJVC 1HQ0V`7 @7V1 7`7 distance measure has a speed advantage. It was also observed that the running time of the proposed @ Q *J1I:C+C:JV   *J1I:6 method is stable regardless of the number of template being  :`+C% , CQQIGV`$  matched. This can be explained by looking at the complexity breakdown of this method. Without taking into account the  :` QCR -J V` :1JIVJ   :G7 front end feature extraction which is a common process for all *1:JV  &:.1QJ .V   `:I: speech recognition, the majority (close to 98% in average) of the computational load in the new search method is actually *`1`:J$  :`Q`CR *1:`:0VC /:`:/ spent in generating the sub-word unit index sequence of the / &Q6`1IV . + *Q%` input signal. This is the process described in section 2.3, -%`Q]Q`  * .1J:  involving multiple computation of probability of GMM distribution. For the same reason, the original pattern matching 1:H_%V  :`JV` Q`CR  algorithm has a distinct increase in complexity as the number of templates increase, because what it does is also computing The database was recorded using H2 Zoom microphone posterior probability given the sequence of GMM distribution shown in figure 5 in a noiseless environment. 7 people were template. The proposed edit distance manages to avoid this involved in this exercise (3 female 4 male), each recorded 4 complex task by converting them into distinct symbol and pre- set of the 50 words listed in table 1 (total 200 words per calculating the Bhattacharyya distance between each symbol person). Two sets were used to create the word template and so the only time posterior probability has to be computed from the other 2 sets were used to test the recognition process. the GMM distribution is during the generation of the sub-word There was no significant difference in recognition rate index sequence. between the original matching process and the use of edit Table 2 summarizes the system accuracy for registered distance. An average of 91.71% accuracy was obtained from user test as well as the guest user test using the original and the the former, and 92% from the later. The main impact is in the proposed edit distance measure.

891 Table 2. Systems accuracy comparison. building a guest sub-word unit from the available speech registered. The guest word template is built by combining all  V`%GR1Q` available word registration and finding the centroid. Only 78% accuracy is achieved for guest use case, but 86% can be  Q`1$1J:C -R1 R1 :JHV obtained when guest word registration is employed. It is 1V$1 V`VR%V` 8 Q Q speculated that the poor performance of the guest word  %V %GR1Q`R template is due to the variety of pronunciation in the available  Q`1$1J:C -R1 R1 :JHV word registration. The development of a method to better %V %V` Q Q accommodate this pronunciation variety will be part of our %V %V`11 . future work. The adaptation from guest sub-word unit to 1Q`R`V$1 `: 1QJ Q Q specific user sub-word unit will also be part of our future investigation. For the guest user scenario, only 78% accuracy is achieved with the edit distance method, and 76% on the original 6. References method. Note that this was tested on a user which was not [1] Xavier Anguera, Robert Macrae, Nuria Oliver, “Partial Sequence involved in the sub-word unit generation process. Their Matching using an Unbounded Dynamic Time Warping registration speech was removed from the system, and their Algorighm”, IEEE International Conference on Acoustics word registration was discarded in place of the centroid word Speech and Signal Processing 2010, pp. 3582-3585 template discussed earlier. [2] Gururaj G Putraya, “Subword Based Isolated Word In the second part of the test, the guest word registration Recognition,” Master thesis for Department of Electrical was introduced. However, the guest sub-word unit was not Engineering Indian Institute of Science, Bangalore. [3] Simon Dobrisek, Nikola Pavesic, France Mihelic, “An Edit changed. It turns out significant improvement was achieved Distance Model for the Approximate Matching of Timed with 86% accuracy using the edit distance measure. This String,” IEEE Transaction on Pattern Analysis and Machine proves that the guest sub-word unit is representative enough to Intelligence, vol. 31, no. 4. characterize the guest user pronunciation. It is possible that the [4] Harvard sentences, low performance of the centroid word template for guest user http://en.wikipedia.org/wiki/Harvard_sentences was due to the high pronunciation variety of the registration [5] Hyeopwoo Lee, Sukmoon Chang, Dongsuk Yook, Yongserk word among the remaining users, given that all subjects Kim, “A Voice Trigger System Using Keyword and Speaker involved in this exercise are not native English speaker. Recognition for Mobile Devices”, IEEE Transaction on Consumer Electronics vol. 55 pp. 2377-2384 Another interesting observation is the margin between the [6] Nishanth Ulhas Nair, T.V. Sreenivas, “Joint Decoding of use of user’s own sub-word (92%) and guest sub-word (86%). Multiple Speech Patterns for Robus Speech Recognition”, IEEE In both cases, word registration is performed and the only Workshop on Automatic Speech Recotnition and Understanding, factor causing this difference is the sub-word unit itself. These 2007, pp.93-98 sub-word units can be seen as a collection of GMM that [7] Nishanth Ulhas Nair, T.V. Sreenivas, “Multi Pattern Dynamic represents the speaker profile. It is possible to introduce an Time Warping for Automatic Speech Recognition”, IEEE adaptation strategy to enhance the guest sub-word to better Region 10 Conference, 2008, pp. 1-6. match his personal profile. Similar process exists in the [8] Nishanth Ulhas Nair, T.V. Sreenivas,, “Forward/backward Algorithms for Joint Multi Pattern Speech Recognition,”16th domain of speaker recognition where MAP or ML method is European Signal Processing Conference 2008. employed to adapt from universal background model to a [9] Jasha Droppo, Alex Acero, “Context Dependent Phonetic String specific speaker profile. The only difference is that in our case, Edit Distance for Automatic Speech Recognition,” IEEE the adaptation data will be very limited, because it could only International Conference on Acoustics Speech and Signal come from the test word when the guest user is using the Processing 2010, pp. 4358-4361. system. This adaptation strategy and an improvement to the [10] Kartik Audhkhasi, Ashish Verma, “Keyword Search using centroid word template to better accommodate variation in Modified Minimum Edit Distance Measure,” IEEE International pronunciations will be the main part of our future work. Conference on Acoustics Speech and Signal Processing 2007, pp.929-932 The system is also tested real time, using a front end built [11] Chang Huai You, Kong Aik Lee, Haizhou Li, “An SVM Kernel with portaudio library[15] and a simple energy based voice with GMM Supervector Based on the Bhattacharyya Distance activity detector. A satisfactory performance similar to offline for Speaker Recognition”, IEEE Signal Processing Letters testing is achieved. January 2009 Vol 16, No. 1 [12] Chang Huai You, Kong Aik Lee, Haizhou Li,” A GMM 5. Conclusions Supervector Kernel with the BhattacharyyaDistance for SVM Based Speaker Recognition”, IEEE International Conference on Although the use of dynamic time warping is fairly common Acoustics Speech and Signal Processing 2009, pp. 4221-4224 for isolated word recognition, its complexity increases with the [13] Brian Mak, Etienne Barnard, “Phone Clustering using the number of vocabulary introduced to the system, making it Bhattacharyya Distance”, International Conference on Speken Language 1996, pp. 2005-2008 incompatible for low power devices. This paper proposed a [14] T. Kailath, “The Divergence and Bhattacharyya distance new pattern matching algorithm using edit distance measure to Measure in signal Selection,” IEEE Transaction in find the most probable word template. Bhattacharyya distance Communication Technology 1967, pp. 52-60. matrix is computed prior to the recognition task to indicate the [15] PortAudio-an Open Source Cross Platform Audio API, similarity between each sub-word which is modeled using www.portaudio.com Gaussian Mixture Model. The scheme was tested with 100 template matching task (2 registration of 50 words) to simulate the use of voice triggered remote control. Similar accuracy to the original pattern matching algorithm is obtained with substantial reduction in complexity in all the tests performed. On average, 92% accuracy is achieved when a registered user utilize the system. A scheme is developed for a guest user by

892