IEICE TRANS. INF. & SYST., VOL.E91–D, NO.3 MARCH 2008 478

PAPER Special Section on Robust Speech Processing in Realistic Environments Linear Discriminant Analysis Using a Generalized of Class and Its Application to Speech Recognition

Makoto SAKAI†,††a), Norihide KITAOKA††b), Members, and Seiichi NAKAGAWA†††c), Fellow

SUMMARY To precisely model the time dependency of features is deal with unequal covariances because the maximum likeli- one of the important issues for speech recognition. Segmental unit input hood estimation was used to estimate parameters for differ- HMM with a dimensionality reduction method has been widely used to ent Gaussians with unequal covariances [9]. Heteroscedas- address this issue. Linear discriminant analysis (LDA) and heteroscedas- tic extensions, e.g., heteroscedastic linear discriminant analysis (HLDA) or tic discriminant analysis (HDA) was proposed as another heteroscedastic discriminant analysis (HDA), are popular approaches to re- objective function which employed individual weighted duce dimensionality. However, it is difficult to find one particular criterion contributions of the classes [10]. The effectiveness of these suitable for any kind of data set in carrying out dimensionality reduction methods for some data sets has been experimentally demon- while preserving discriminative information. In this paper, we propose a ffi new framework which we call power linear discriminant analysis (PLDA). strated. However, it is di cult to find one particular crite- PLDA can be used to describe various criteria including LDA, HLDA, and rion suitable for any kind of data set. HDA with one control parameter. In addition, we provide an efficient selec- In this paper we show that these three methods have tion method using a control parameter without training HMMs nor testing a strong mutual relationship, and provide a new interpreta- recognition performance on a development data set. Experimental results tion for them. Then, we propose a new framework that we show that the PLDA is more effective than conventional methods for vari- ous data sets. call power linear discriminant analysis (PLDA), which can key words: speech recognition, feature extraction, multidimensional signal describe various criteria including LDA, HLDA, and HDA processing with one control parameter. Because PLDA can describe various criteria for dimensionality reduction, it can flexibly 1. Introduction adapt to various environments such as a noisy environment. Thus, PLDA can provide robustness to a speech recognizer Although Hidden Markov Models (HMMs) have been in realistic environments. Unfortunately, we cannot know widely used to model speech signals for speech recogni- which control parameter is the most effective before training tion, they cannot precisely model the time dependency of HMMs and testing the performances of each control param- feature parameters. In order to overcome this limitation, eter on a development set. In general, this training and test- many extensions have been proposed [1]–[5]. Segmental ing process requires more than several dozen hours. More- unit input HMM [1] has been widely used for its effective- over, the computational time is proportional to the number ness and tractability. In segmental unit input HMM, the im- of variations of the control parameters under test. PLDA mediate use of several successive frames as an input vector requires much time to find an optimal control parameter be- inevitably increases the number of parameters. Therefore, a cause its control parameter can be set to a real number. We dimensionality reduction method is applied to feature vec- will provide an efficient selection method of an optimal con- tors. trol parameter without training of HMMs nor testing recog- Linear discriminant analysis (LDA) [6], [7] is widely nition performance on a development set. used to reduce dimensionality and a powerful tool to pre- The paper is organized as follows: LDA, HLDA, and serve discriminative information. LDA assumes each class HDA are reviewed in Sect. 2. Then, a new framework of has the same class [8]. However, this assumption PLDA is proposed in Sect. 3. A selection method of an opti- does not necessarily hold for a real data set. In order to over- mal control parameter is derived in Sect. 4. Experimental re- come this limitation, several methods have been proposed. sults are presented in Sect. 5. Finally, conclusions are given Heteroscedastic linear discriminant analysis (HLDA) could in Sect. 6.

Manuscript received July 2, 2007. 2. Segmental Unit Input HMM Manuscript revised September 14, 2007. †The author is with DENSO CORPORATION, Nisshin-shi, 470–0111 Japan. For an input symbol sequence o = (o1, o2, ··· , oT )anda †† The authors are with Nagoya University, Nagoya-shi, 464– state sequence q = (q1, q2, ··· , qT ), the output probability of 8603 Japan. segmental unit input HMM is given by the following equa- ††† The author is with Toyohashi University of Technology, tions [1]: Toyohashi-shi, 441–8580 Japan. a) E-mail: [email protected] P (o , ··· , o ) b) E-mail: [email protected] 1 T c) E-mail: [email protected] = P (oi | o1, ··· , oi−1, q1, ··· , qi) / / DOI: 10.1093 ietisy e91–d.3.478 q i

Copyright c 2008 The Institute of Electronics, Information and Communication Engineers SAKAI et al.: GENERALIZATION OF LINEAR DISCRIMINANT ANALYSIS AND ITS APPLICATION TO SPEECH RECOGNITION 479

× P (qi | q1, ··· , qi−1) (1) where Σt denotes the of all features, namely a total covariance, which equals Σb + Σw. ≈ P oi | oi−(d−1), ··· , oi−1, qi P (qi | qi−1) LDA finds a projection matrix B[p] that maximizes q i Eqs. (6) or (7). The optimization of (6) and (7) results in (2) the same projection. ≈ P oi−(d−1), ··· , oi | qi P (qi | qi−1) , (3) q i 2.2 Heteroscedastic Extensions where T denotes the length of input sequence and d denotes the number of successive frames used in probability calcu- LDA is not the optimal projection when the class distribu- lation at a time frame. The immediate use of several suc- tions are heteroscedastic. Campbell [8] has shown that LDA cessive frames as an input vector inevitably increases the is related to the maximum likelihood estimation of parame- number of parameters. When the number of dimensions in- ters for a Gaussian model with an identical class covariance. creases, several problems generally occur: heavier compu- However, this condition is not necessarily satisfied for a real tational load and larger memory are required, and the accu- data set. racy of a parameter estimation degrades. Then, dimension- In order to overcome this limitation, several extensions ality reduction methods, e.g., principal component analysis have been proposed [9]–[12]. This paper focuses on two (PCA), LDA, HLDA or HDA, are used to reduce dimension- heteroscedastic extensions called heteroscedastic linear dis- ality [1], [3], [9], [10]. criminant analysis (HLDA) and heteroscedastic discrimi- Here, we will briefly review LDA, HLDA and HDA, nant analysis (HDA) [9], [10]. then investigate the effectiveness of these methods for some artificial data sets. 2.2.1 Heteroscedastic Linear Discriminant Analysis ∈ Rn×n 2.1 Linear Discriminant Analysis In HLDA, the full-rank linear projection matrix B is constrained as follows: the first p columns of B span the ∈ Rn = p-dimensional subspace in which the class and vari- Given n-dimensional feature vectors x j ( j T ances are different and the remaining n − p columns of B 1, 2,...,N) †, e.g., x = oT , ··· , oT , let us find a pro- j j−(d−1) j span the (n − p)-dimensional subspace in which the class ∈ Rn×p jection matrix B[p] that projects these feature vectors means and are identical. Let the parameters that ∈ Rp = , ,..., to p-dimensional feature vectors z j ( j 1 2 N) BT x µ T describe the class means and covariances of be ˆ k and (p < n), where z j = B x j,andN denotes the number of all [p] Σˆ k, respectively: features. ⎡ ⎤ Within-class and between-class covariance matrices ⎢ T µ ⎥ ⎢ B[p] k ⎥ µˆ = ⎣⎢ ⎦⎥ , (8) are defined as follows [6], [7]: k T µ B[n−p] c ⎡ ⎤ 1 T T Σ Σ = x − µ x − µ ⎢B kB[p] 0 ⎥ w j k j k Σˆ = ⎢ [p] ⎥ , N = ∈ k ⎣ ⎦ (9) k 1 x j Dk T Σ 0B[n−p] tB[n−p] c = PkΣk, (4) n×(n−p) where B = B[p]|B[n−p] and B[n−p] ∈ R . = k 1 Kumar et al. [9] incorporated the maximum likelihood c T estimation of parameters for differently distributed Gaus- Σb = Pk (µk − µ)(µk − µ) , (5) sians. An HLDA objective function is derived as follows: k=1 2N where c denotes the number of classes, Dk denotes the sub- |B| 1 J (B) = . (10) set of feature vectors labeled as class k, µ is the mean vector HLDA N c T N µ Σ B − ΣtB[n−p] T Σ k for all the classes, k is the mean vector of the class k, k [n p] B[p] kB[p] is the covariance matrix of the class k,andPk is the class k=1 weight, respectively. N denotes the number of features of class k. The solution There are several ways to formulate objective functions k to maximize Eq. (10) is not analytically obtained. There- for multi-class data [6]. Typical objective functions are the fore, its maximization is performed using a numerical op- following: timization technique. Alternatively, a computationally effi- T Σ cient scheme is given in [13]. B[p] bB[p] = , † JLDA B[p] (6) This paper uses the following notation: capital bold letters T B ΣwB[p] [p] refer to matrices, e.g., A, bold letters refer to vectors, e.g., b,and scalars are not bold, e.g., c. Where submatrices are used they are T B ΣtB[p] indicated, for example, by A ,thisisann × p matrix. AT is the [p] [p] JLDA B[p] = , (7) transpose of the matrix, |A| is the determinant of the matrix, and T Σ B[p] wB[p] tr(A) is the trace of the matrix. IEICE TRANS. INF. & SYST., VOL.E91–D, NO.3 MARCH 2008 480

2.2.2 Heteroscedastic Discriminant Analysis

HDA uses the following objective function which incorpo- rates individual weighted contributions of the class vari- ances [10]: ⎛ ⎞ Nk c ⎜BT Σ B ⎟ ⎜ [p] b [p] ⎟ J B = ⎜ ⎟ (11) HDA [p] ⎝⎜ ⎠⎟ = T Σ k 1 B[p] kB[p] N T B ΣbB[p] (a) = [p] . c (12) T Σ Nk B[p] kB[p] k=1 In contrast to HLDA, this function is not considered (n − p) dimensions. Only a projection matrix B[p] is estimated. There is no closed-form solution to obtain projection matrix B[p] similar to HLDA.

2.3 Dependency on Data Set

In Fig. 1, two-dimensional, two- or three-class data features (b) are projected onto one-dimensional subspaces by LDA and HDA. Here, HLDA projections were omitted because they were close to HDA projections. Figure 1 (a) shows that HDA has higher separability than LDA for the data set used in [10]. On the other hand, as shown in Fig. 1 (b), LDA has higher separability than HDA for another data set. Fig- ure 1 (c) shows the case with another data set where both LDA and HDA have low separabilities. Thus, LDA and HDA do not always classify the given data set appropriately. All results show that the separabilities of LDA and HDA de- pend significantly on data sets.

(c) 3. Generalization of Discriminant Analysis Fig. 1 Examples of dimensionality reduction by LDA, HDA and PLDA. As shown above, it is difficult to separate appropriately every data set with one particular criterion such as LDA, HLDA, or HDA. Here, we concentrate on providing a Inserting this in (10) and removing a constant term yields framework which integrates various criteria. N T Σ B[p] tB[p] 3.1 Relationship between HLDA and HDA J B ∝ . (17) HLDA [p] c Nk T BT Σ B By using Eqs. (8) and (9), let us rearrange B ΣtB as follows: [p] k [p] k=1 T T T B ΣtB = B ΣbB + B ΣwB (13) From (12) and (17), the difference between HLDA and T ˆ = Pk(ˆµk − µˆ )(µ ˆ k − µˆ ) + PkΣk (14) HDA lies in their numerators, i.e., the total covariance ma- ⎡ k ⎤ k trix versus the between-class covariance matrix. This dif- ⎢ T Σ ⎥ ff ⎢B[p] tB[p] 0 ⎥ ference is the same as the di erence between the two LDAs = ⎢ ⎥ , (15) ⎣ T ⎦ shown in (6) and (7). Thus, (12) and (17) can be viewed as 0BΣ B − [n−p] t [n p] the same formulation. whereµ ˆ = BT µ. The determinant of this is 3.2 Relationship between LDA and HDA | T Σ | = | T Σ || T Σ |. B tB B[p] tB[p] B[n−p] tB[n−p] (16) The LDA and HDA objective functions can be rewritten as SAKAI et al.: GENERALIZATION OF LINEAR DISCRIMINANT ANALYSIS AND ITS APPLICATION TO SPEECH RECOGNITION 481

T Σ • m → 0 () B[p] bB[p] Σ˜ b J B = = , (18) LDA [p] c BT Σ B Σ˜ [p] w [p] Σ˜ n Pk k J B , 0 = ∝ J B . (23) PLDA [p] c HDA [p] k=1 Σ˜ Pk N k T Σ = B[p] bB[p] Σ˜ b k 1 J B = ∝ , (19) HDA [p] c c N P • m = −1 () BT Σ B k Σ˜ k [p] k [p] k k=1 k=1 Σ˜ n , − = ⎛ ⎞ . T T JPLDA B[p] 1 −1 (24) where Σ˜ b = B ΣbB[p] and Σ˜ k = B ΣkB[p] are between- ⎜c ⎟ [p] [p] ⎜ − ⎟ ⎝⎜ P Σ˜ 1⎠⎟ class and class k covariance matrices in the projected p- k k dimensional space, respectively. k=1 Both numerators denote determinants of the between- See Appendix A for the proof that the PLDA objective func- class covariance matrix. In Eq. (18), the denominator can be tion tends to HDA objective function when m tends to zero. viewed as a determinant of the weighted of The following equations are also obtained under a particular the class covariance matrices. Similarly, in Eq. (19), the de- condition (see Appendix B). nominator can be viewed as a determinant of the weighted geometric mean of the class covariance matrices. Thus, the • m →∞ difference between LDA and HDA is the definitions of the Σ˜ mean of the class covariance matrices. Moreover, to replace n J B , ∞ = . (25) PLDA [p] their numerators with the determinants of the total covari- max Σ˜ k ance matrices, the difference between LDA and HLDA is k the same as the difference between LDA and HDA. • m →−∞ 3.3 Power Linear Discriminant Analysis Σ˜ n J B , −∞ = . (26) PLDA [p] min Σ˜ k As described above, Eqs. (18) and (19) give us a new inte- k grated interpretation of LDA and HDA. As an extension of Intuitively, as m becomes larger, the classes with larger this interpretation, their denominators can be replaced by a variances become dominant in the denominator of Eq. (20). determinant of the weighted harmonic mean, or a determi- Conversely, as m becomes smaller, the classes with smaller nant of the . variances become dominant. In the econometric literature, a more general definition We call this new discriminant analysis formulation of a mean is often used, called the weighted mean of order Power Linear Discriminant Analysis (PLDA). Figure 1 (c) m [14]. We extend this notion to a determinant of a matrix † shows that PLDA can have a higher separability for a data mean and propose a new objective function as follows : set with which LDA and HDA have lower separability. To Σ˜ n maximize the PLDA objective function with respect to B, JPLDA B[p], m = ⎛ ⎞ , (20) c 1/m we can use numerical optimization techniques such as the ⎜ ⎟ ⎝⎜ P Σ˜ m⎠⎟ Nelder-Mead method [15] or the SANN method [16]. These k k k=1 methods need no derivatives of the objective function. How- ever, it is known that these methods converge slowly. In Σ˜ ∈{Σ˜ , Σ˜ } Σ˜ = T Σ where n b t , t B[p] tB[p],andm is a control pa- some special cases below, using a matrix differential cal- rameter. By varying the control parameter m, the proposed culus [17], the derivatives of the objective function are ob- objective function can represent various criteria. Some typ- tained. Hence, we can use some fast convergence methods, ical objective functions are enumerated below. such as the quasi-Newton method and conjugate gradient • m = 2 (root mean square) method [18]. Σ˜ n 3.3.1 Order m Constrained to be an Integer JPLDA B[p], 2 = ⎛ ⎞ . (21) c 1/2 ⎜ ⎟ ⎝⎜ P Σ˜ 2⎠⎟ k k Assuming that a control parameter m is constrained to be an k=1 integer, the derivatives of the PLDA objective function are • m = 1 (arithmetic mean) formulated as follows: † Σ˜ We let the function f of a symmetric positive definite ma- n T T J B , 1 = = J B . (22) trix A equal Udiag( f (λ1),..., f (λn))U = U( f (Λ))U ,where PLDA [p] c LDA [p] A = UΛUT , U denotes the matrix of n eigenvectors, and Λ de- P Σ˜ λ k k notes the diagonal matrix of eigenvalues, i’s. We may define the k=1 function f as some power or the of A. IEICE TRANS. INF. & SYST., VOL.E91–D, NO.3 MARCH 2008 482

∂ , = Σ Σ˜ −1 − , of obtaining an optimal control parameter m is to train log JPLDA B[p] m 2 nB[p] n 2Dm (27) ∂B[p] HMMs and test recognition performance on a development set changing m andtochoosethem with the smallest er- where ⎧ ror. Unfortunately, this raises a considerable problem in a c m ⎪ 1 speech recognition task. In general, to train HMMs and ⎪ P Σ B X , , , if m > 0 ⎪ k k [p] m j k to test recognition performance on a development set for ⎪ m = = ⎪ k 1 j 1 finding an optimal control parameter requires several dozen ⎪ c ⎨ ˜ −1 hours. PLDA requires considerable time to select an opti- D = ⎪ PkΣkB[p]Σ , if m = 0 m ⎪ k ⎪ k=1 mal control parameter because it is able to choose a control ⎪ c |m| ⎪ 1 parameter within a real number. ⎪ − P Σ B Y , , , otherwise ⎩⎪ m k k [p] m j k In this section we focus on a class separability error of k=1 j=1 the features in the projected space instead of using a recog- ⎛ ⎞− nition error on a development set. Better recognition per- ⎜c ⎟ 1 = Σ˜ m− j ⎜ Σ˜ m⎟ Σ˜ j−1, formance can be obtained under the lower class separabil- Xm, j,k k ⎝ Pl l ⎠ k l=1 ity error of projected features. Consequently, we measure the class separability error and use it as a criterion for the and recognition performance comparison. We will define a class ⎛ ⎞− ⎜c ⎟ 1 separability error of projected features. = Σ˜ m+ j−1 ⎜ Σ˜ m⎟ Σ˜ − j. Ym, j,k k ⎝ Pl l ⎠ k l=1 4.1 Two-Class Problem This equation is used for acoustic models with full covari- ance. This subsection focuses on the two-class case. We first con- sider the Bayes error of the projected features on training

3.3.2 Σ˜ k Constrained to be Diagonal data as a class separability error:

Because of computational simplicity, the covariance matrix ε = min[P1 p1(x), P2 p2(x)]dx, (32) in class k is often assumed to be diagonal [9], [10]. Since a diagonal matrix multiplication is commutative, the deriva- where Pi denotes a prior probability of class i and pi(x)is tives of the PLDA objective function are simplified as fol- a conditional density function of class i. The Bayes error lows: ε can represent a classification error, assuming that train- Σ˜ ing data and evaluation data come from the same distribu- n ffi JPLDA B[p], m = ⎛ ⎞ , (28) tions. However, it is di cult to directly measure the Bayes c 1/m ⎜ ⎟ error. Instead, we use the Chernoff bound between class 1 ⎝⎜ P diag(Σ˜ )m⎠⎟ k k and class 2 as a class separability error [6]: k=1 ∂ , = Σ Σ˜ −1 − , ε1,2 = s 1−s s 1−s ≤ ≤ log JPLDA B[p] m 2 nB[p] n 2FmGm u P1P2 p1(x)p2 (x)dx for 0 s 1 ∂B[p] (29) (33) ε ε where where u indicates an upper bound of . In addition, when µ c the pi(x)’s are normal with mean vectors i and covariance m−1 matrices Σi, the Chernoff bound between class 1 and class 2 Fm = PkΣkB[p]diag Σ˜ k , (30) k=1 becomes ⎛ ⎞− ⎜c ⎟ 1 ε1,2 = s 1−s −η1,2 , ⎜ m⎟ u P1P2 exp( (s)) (34) Gm = ⎝⎜ Pkdiag Σ˜ k ⎠⎟ , (31) k=1 where and diag is an operator which sets zero for off-diagonal ele- s(1 − s) η1,2(s) = (µ − µ )T Σ−1(µ − µ ) ments. In Eq. (28), the control parameter m can be any real 2 2 1 12 2 1 number, unlike in Eq. (27). 1 |Σ | + ln 12 , (35) When m is equal to zero, the PLDA objective function 2 |Σ |s |Σ |1−s corresponds to the diagonal HDA (DHDA) objective func- 1 2 tion introduced in [10]. where Σ12 ≡ sΣ1 +(1− s)Σ2. In this case, εu can be obtained analytically and calculated rapidly. 4. Selection of an Optimal Control Parameter In Fig. 2, two-dimensional two-class data are projected onto one-dimensional subspaces by two methods. To com- As shown in the previous section, PLDA can describe var- pare with their Chernoff bounds, the lower class separability ious criteria by varying its control parameter m.Oneway error is obtained from the projected features by Method 1 as SAKAI et al.: GENERALIZATION OF LINEAR DISCRIMINANT ANALYSIS AND ITS APPLICATION TO SPEECH RECOGNITION 483 ⎧ ⎪ ⎨1, if j = jˆi, I(i, j) = ⎪ (39) ⎩0, otherwise,

i, j where jˆi ≡ arg max εu . j 5. Experiments

We conducted the experiments using the CENSREC-3 database [19]. The CENSREC-3 is designed as an evalu- ation framework of Japanese isolated word recognition in real driving car environments. Speech data were collected using 2 microphones, a close-talking (CT) microphone and Fig. 2 Comparison of Bayes error and Chernoff bound. a hands-free (HF) microphone. For training, driver’s speech of phonetically-balanced sentences was recorded under two conditions: while idling and driving on a city street with nor- compared with those by Method 2. In this case, Method 1 mal in-car environment. A total of 14,050 utterances spoken preserving the lower class separability error should be se- by 293 drivers (202 males and 91 females) were recorded lected. with both microphones. We used all utterances recorded with CT and HF microphones for training. For evaluation, 4.2 Extension to Multi-Class Problem driver’s speech of isolated words was recorded under 16 en- vironmental conditions using combinations of three kinds In the previous subsection, we defined a class separability of vehicle speeds (idling, low-speed driving on a city street, error for two-class data. Here, we extend a two-class case to and high-speed driving on an expressway) and six kinds of a multi-class case. Unlike the two-class case, it is possible to in-car environments (normal, with hazard lights on, with the define several error functions for multi-class data. We define air-conditioner on (fan low/high), with the audio CD player an error function as follows: on, and with windows open). In these conditions, the “haz- c c ard lights on” condition was used only when idling. We only i, j used three kinds of vehicle speeds in normal in-car environ- ε˜u = I(i, j)εu (36) i=1 j=1 ment and evaluated 2,646 utterances spoken by 18 speak- ers (8 males and 10 females) for each microphone. The where I(·) denotes an indicator function. We consider the speech signals for training and evaluation were both sam- following three formulations as an indicator function. pled at 16 kHz.

4.2.1 Sum of Pairwise Approximated Errors 5.1 Baseline System

The sum of all the pairwise Chernoff bounds is defined using In the CENSREC-3, the baseline scripts are designed to fa- the following indicator function: cilitate HMM training and evaluation by HTK [20]. The ⎧ acoustic models consisted of triphone HMMs. Each HMM ⎪ ⎨⎪1, if j > i, had five states and three of them had output distributions. I(i, j) = ⎪ (37) ⎩0, otherwise. Each distribution was represented with 32 mixture diagonal Gaussians. The total number of states with the distributions were 2,000. The feature vector consisted of 12 MFCCs and 4.2.2 Maximum Pairwise Approximated Error log-energy with their corresponding delta and acceleration ffi ff coe cients (39 dimensions). Frame length was 20 msec and The maximum pairwise Cherno bound is defined using the frame shift was 10 msec. In the Mel-filter bank analysis, following indicator function: acut-off was applied to frequency components lower than ⎧ ⎪ 250 Hz. The decoding process was performed without any ⎨⎪1, if j > i and (i, j) = (ˆi, ˆj), I(i, j) = ⎪ (38) language model. The vocabulary size was 100 words which ⎩⎪ , , 0 otherwise included the original fifty words and another fifty similar- sounding words. i, j where (ˆi, ˆj) ≡ arg max εu . , i j 5.2 Dimensionality Reduction Procedure 4.2.3 Sum of Maximum Approximated Errors in Each Class The dimensionality reduction was performed using PCA, LDA, HDA, DHDA [10], and PLDA for the spliced features. The sum of the maximum pairwise Chernoff bounds in each Eleven successive frames (143 dimensions) were reduced to class is defined using the following indicator function: 39 dimensions. In (D)HDA and PLDA, to optimize Eq. (28), IEICE TRANS. INF. & SYST., VOL.E91–D, NO.3 MARCH 2008 484 we assumed that projected covariance matrices were diago- lowest WER. For the evaluation data recorded with a HF nal and used the limited-memory BFGS algorithm as a nu- microphone, the lowest WER is obtained by PLDA with a merical optimization technique [18]. The LDA projection different control parameter (m = −1.5) in Table 2. In both matrix was used as the initial gradient matrix. To assign one cases with CT and HF microphones, PLDA with the opti- of the classes to every feature after dimensionality reduc- mal control parameters consistently outperformed the other tion, HMM state labels were generated for the training data criteria. Two data sets recorded with different microphones by a state-level forced alignment algorithm using a well- had different optimal control parameters. The analysis on trained HMM system. The class number was 43 correspond- the training data revealed that the voiced sounds had larger ing to the number of the monophones. variances while the unvoiced sounds had smaller ones. As described in Sect. 3.3, PLDA with a smaller control parame- 5.3 Experimental Results ter gives greater importance to the discrimination of classes with smaller variances. Thus, PLDA with a smaller control Tables 1 and 2 show the word error rates and class separabil- parameter has better ability to discriminate unvoiced sounds. ity errors according to Eqs. (37)–(39) for each dimensional- In general, under noisy environment as with an HF micro- ity reduction criterion. The evaluation sets used in Tables 1 phone, discrimination of unvoiced sounds becomes difficult. and 2 were recorded with CT and HF microphones, respec- Therefore, the optimal control parameter m for an HF micro- tively. For the evaluation data recorded with a CT micro- phone is smaller than with a CT microphone. phone, Table 1 shows that PLDA with m = −0.5 yields the In comparing dimensionality reduction criteria without training HMMs nor testing recognition performance on a de- velopment set, we used s = 1/2 for the Chernoff bound com- Table 1 Word error rates (%) and class separability errors according to Eqs. (37)–(39) for the evaluation set with a CT microphone. The best re- putation because there was no aprioriinformation about sults are highlighted in bold. weights of two class distributions. In the case of s = 1/2, m WER Eq. (37) Eq. (38) Eq. (39) Eq. (33) is called the Bhattacharyya bound. Two covariance MFCC matrices in Eq. (35) were treated as diagonal because diag- +Δ +ΔΔ - 7.45 2.31 0.0322 0.575 onal Gaussians were used to model HMMs. The parameter PCA - 10.58 3.36 0.0354 0.669 selection was performed as follows: To select the optimal LDA - 8.78 3.10 0.0354 0.641 control parameter for the data set recorded with a CT mi- HDA - 7.94 2.99 0.0361 0.635 crophone, all the training data with a CT microphone were PLDA −3.0 6.73 2.02 0.0319 0.531 PLDA −2.0 7.29 2.07 0.0316 0.532 labeled with monophones using a forced alignment recog- PLDA −1.5 6.27 1.97 0.0307 0.523 nizer. Then, each monophone was modeled as a unimodal PLDA −1.0 6.92 1.99 0.0301 0.521 normal distribution, and the mean vector and covariance ma- PLDA −0.5 6.12 2.01 0.0292 0.525 trix of each class were calculated. Chernoff bounds were DHDA - 7.41 2.15 0.0296 0.541 obtained using these mean vectors and covariance matrices. (PLDA) (0.0) The optimal control parameter for the data set with an HF PLDA 0.5 7.29 2.41 0.0306 0.560 PLDA 1.0 9.33 3.09 0.0354 0.641 microphone was obtained using all of the training data with PLDA 1.5 8.96 4.61 0.0394 0.742 an HF microphone through the same process as a CT mi- PLDA 2.0 8.58 4.65 0.0404 0.745 crophone. Both Tables 1 and 2 show that the results of PLDA 3.0 9.41 4.73 0.0413 0.756 the proposed method and relative recognition performance agree well. There was little difference in the parameter se- Table 2 Word error rates (%) and class separability errors according to lection performances among Eqs. (37)–(39) in parameter se- Eqs. (37)–(39) for the evaluation set with an HF microphone. lection accuracy. The proposed selection method yielded m WER Eq. (37) Eq. (38) Eq. (39) sub-optimal performance without training HMMs nor test- MFCC ing recognition performance on a development set, although +Δ+ΔΔ - 15.04 2.56 0.0356 0.648 it neglected time information of speech feature sequences PCA - 19.39 3.65 0.0377 0.738 to measure a class separability error and modeled a class LDA - 15.80 3.38 0.0370 0.711 distribution as a unimodal normal distribution. In addition, HDA - 17.16 3.21 0.0371 0.697 ff − . the optimal control parameter value can vary with di erent PLDA 3 0 15.04 2.19 0.0338 0.600 ff ff PLDA −2.0 12.32 2.26 0.0339 0.602 speech features, a di erent language, or a di erent noise PLDA −1.5 10.70 2.18 0.0332 0.5921 environment. The proposed selection method can adapt to PLDA −1.0 11.49 2.23 0.0327 0.5922 such variations. PLDA −0.5 12.51 2.31 0.0329 0.598 DHDA - 14.17 2.50 0.0331 0.619 5.4 Discriminative Training Results (PLDA) (0.0) PLDA 0.5 13.53 2.81 0.0341 0.644 PLDA 1.0 16.97 3.38 0.0370 0.711 PLDA can combine a discriminative training technique of PLDA 1.5 17.31 5.13 0.0403 0.828 HMMs, such as maximum mutual information (MMI) and PLDA 2.0 15.91 5.22 0.0412 0.835 minimum phone error (MPE) [21]–[23]. We also conducted PLDA 3.0 16.36 5.36 0.0424 0.850 the same experiments using MMI and MPE by HTK and SAKAI et al.: GENERALIZATION OF LINEAR DISCRIMINANT ANALYSIS AND ITS APPLICATION TO SPEECH RECOGNITION 485

Table 3 Word error rates (%) using a maximum likelihood training and for predicting the optimal criterion among the 11 dimen- three discriminative trainings for the evaluation set with a CT microphone. sionality reduction criteria described above. Thus, the pro- ML MMI MPE (approx.) MPE (exact) posed method could perform the prediction process much MFCC +Δ+ΔΔ 7.45 7.14 6.92 6.95 faster than a conventional procedure that included training PLDA of HMMs and test of recognition performance on a devel- 6.12 5.71 5.06 4.99 (m=-0.5) opment set.

6. Conclusions Table 4 Word error rates (%) using a maximum likelihood training and three discriminative trainings for the evaluation set with an HF microphone. ML MMI MPE (approx.) MPE (exact) In this paper we proposed a new framework for integrat- MFCC ing various criteria to reduce dimensionality. The new +Δ+ΔΔ 15.04 14.44 18.67 15.99 framework which we call power linear discriminant analy- PLDA 10.70 10.39 9.44 10.28 sis (PLDA) includes LDA, HLDA, and HDA criteria as spe- (m=-1.5) cial cases. Next, an efficient selection method of an optimal PLDA control parameter was introduced. The method used ff Table 5 Computational costs with the conventional and proposed the Cherno bound as a measure of a class separability error method. which was the upper bound of the Bayes error. The exper- conventional 220 h imental results on the CENSREC-3 database demonstrated = (15 h (training) + 5h(test)) that segmental unit input HMM with PLDA gave better per- × 11 conditions formance than the others and that PLDA with a control pa- proposed 0.87 h ffi = 30 min (mean and calculations) rameter selected by the proposed e cient selection method + 2 min (Chernoff bound calculation) yielded sub-optimal performance with a drastic reduction of × 11 conditions computational costs.

Acknowledgments compared a maximum likelihood (ML) training, MMI, ap- proximate MPE and exact MPE. The approximate MPE The authors would like to thank Dr. Manabu Otsuka for his assigns approximate correctness to phones while the exact useful comments. MPE assigns exact correctness to phones. The former is The present study was conducted using the CENSREC- faster in computation for assigning correctness, and the lat- 3 database developed by the IPSJ-SIG SLP Noisy Speech ter is more precise in correctness. The results are shown Recognition Evaluation Working Group. in Tables 3 and 4. By combining PLDA and the discrimi- native training techniques, we obtained better performance References than the PLDA with a maximum likelihood criterion train- ing. There appears to be no consistent difference between [1] S. Nakagawa and K. Yamamoto, “Evaluation of segmental unit input approximate and exact MPE as reported in a discriminative HMM,” Proc. ICASSP, pp.439–442, 1996. training study [23]. [2] M. Ostendorf and S. Roukos, “A stochastic segment model for phoneme-based continuous speech recognition,” IEEE Trans. Acoust. Speech Signal Process., vol.37, no.12, pp.1857–1869, 1989. 5.5 Computational Costs [3] R. Haeb-Umbach and H. Ney, “Linear discriminant analysis for improved large vocabulary continuous speech recognition,” Proc. The computational costs for the evaluation of recogni- ICASSP, pp.13–16, 1992. tion performance versus the proposed selection method are [4] H. Gish and M. Russell, “Parametric trajectory models for speech shown in Table 5. Here, the computational cost involves the recognition,” Proc. ICSLP, pp.466–469, 1996. [5] K. Tokuda, H. Zen, and T. Kitamura, “Trajectory modeling based optimization procedure of the control parameter. In this ex- on HMMs with the explicit relationship between static and dynamic periment, we evaluate the computational costs on the eval- features,” Proc. Eurospeech 2003, pp.865–868, 2003. uation data set with a Pentium IV 2.8 GHz computer. For [6] K. Fukunaga, Introduction to Statistical Pattern Recognition, sec- every dimensionality reduction criterion, the evaluation of ond ed., Academic Press, New York, 1990. recognition performance required 15 hours for training of [7] R.O. Duda, P.B. Hart, and D.G. Stork, Pattern Classification, John Wiley & Sons, New York, 2001. HMMs and 5 hours for test on a development set. In total, [8] N.A. Campbell, “Canonical variate analysis — A general model for- 220 hours were required for comparing 11 dimensionality mulation,” Australian Journal of , vol.4, pp.86–96, 1984. reduction criteria (PLDAs using 11 different control param- [9] N. Kumar and A.G. Andreou, “Heteroscedastic discriminant anal- eters). On the other hand, the proposed selection method ysis and reduced rank HMMs for improved speech recognition,” only required approximately 30 minutes for calculating sta- Speech Commun., pp.283–297, 1998. tistical values such as mean vectors and covariance matrices [10] G. Saon, M. Padmanabhan, R. Gopinath, and S. Chen, “Maximum likelihood discriminant feature spaces,” Proc. ICASSP, pp.129–132, of each class in the original space. After this, 2 minutes were 2000. required to calculate Eqs. (37)–(39) for each dimensionality [11] F. de la Torre and T. Kanade, “Oriented discriminant analysis,” reduction criterion. In total, only 0.87 hour was required British Machine Vision Conference, pp.132–141, 2004. IEICE TRANS. INF. & SYST., VOL.E91–D, NO.3 MARCH 2008 486

∂g(m) [12] M. Loog and R. Duin, “Linear dimensionality reduction via a het- = 1, (A· 7) eroscedastic extension of LDA: The chernoff criterion,” IEEE Trans. ∂m Pattern Anal. Mach. Intell., vol.26, no.6, pp.732–739, 2004. −1 [13] M.J.F. Gales, “Semi-tied covariance matrices for hidden Markov = Σ˜ m where Zm j P j j , Ui denotes the matrix of eigen- models,” IEEE Trans. Speech Audio Process., vol.7, no.3, pp.272– Σ˜ m Λ 281, 1999. vectors of i ,and i denotes the diagonal matrix of eigen- Σ˜ m [14] J.R. Magnus and H. Neudecker, Matrix Differential Calculus with values of i . Applications in Statistics and Econometrics, John Wiley & Sons, By l’Hˆopital’s rule, 1999.

[15] J.A. Nelder and R. Mead, “A simplex method for function minimiza- f (m) = f (m) = f (0) · lim lim (A 8) tion,” Comput. J., vol.7, pp.308–313, 1965. m→0 g(m) m→0 g (m) g (0) [16] C.J.P. Belisle, “Convergence theorems for a class of simulated an- nealing algorithms,” J. Applied Probability, vol.29, pp.885–892, = Pitr(log Λi)(A· 9) 1992. i [17] S.R. Searle, Matrix Algebra Useful for Statistics, Wiley Series in = Σ˜ Pi , · Probability and , New York, 1982. log i (A 10) [18] J. Nocedal and S.J. Wright, Numerical Optimization, Springer- i Verlag, 1999. and (A· 1) follows.  [19] M. Fujimoto, K. Takeda, and S. Nakamura, “CENSREC-3: An eval- uation framework for Japanese speech recognition in real driving- car environments,” IEICE Trans. Inf. & Syst., vol.E89-D, no.11, Appendix B: Proofs of (25) and (26) pp.2783–2793, Nov. 2006. [20] HTK Web site. http://htk.eng.cam.ac.uk/ [21] L. Bahl, P. Brown, P. de Sousa, and R. Mercer, “Maximum mu- Theorem 2. Let |Σ˜ k| = max |Σ˜ i| (k is not necessarily tual information estimation of hidden Markov model parameters for Σ˜ Σ˜ m Σ˜ m unique). If k satisfies k Pi i ,then speech recognition,” Proc. ICASSP, pp.49–52, 1986. i [22] D. Povey and P. Woodland, “Minimum phone error and I-smoothing for improved discriminative training,” Proc. ICASSP, pp.105–108, Σ˜ n 2002. J B , ∞ = , (A· 11) PLDA [p] [23] D. Povey, Discriminative Training for Large Vocabulary Speech max Σ˜ i i Recognition, Ph.D. Thesis, Cambridge University, 2003. Σ˜ n JPLDA B[p], −∞ = , (A· 12) min Σ˜ i Appendix A: Proof of (23) i where X Y denotes that (X − Y) is a positive semidefinite Theorem 1. Let Σ˜ k (1 ≤ k ≤ c) be symmetric positive defi- nite matrices. Then, matrix. Proof. To prove (A· 11), let Σ˜ n , = . · ⎛ ⎞ lim JPLDA(B[p] m) c (A 1) 1/m m→0 ⎜ ⎟ Pk ⎜ m⎟ Σ˜ k φ(m) = ⎝⎜ P Σ˜ ⎠⎟ . (A· 13) i i k=1 i Proof. Here, we focus on the denominator of a PLDA ob- We have the following inequality†: jective function. We let c m m PiΣ˜ ≥ PkΣ˜ . (A· 14) f (m) = log P Σ˜ m (A· 2) i k i i i i=1 > = Using this, for m 0, we obtain and g(m) m,sothat ⎛ ⎞ m1/m c 1/m c φ(m) ≥ PkΣ˜ (A· 15) ⎜ ⎟ 1 f (m) k ⎜ m⎟ m / log ⎝⎜ PiΣ˜ ⎠⎟ = log PiΣ˜ = . (A· 3) = 1 mΣ˜ · i i P k (A 16) = m = g(m) k i 1 i 1 = Σ˜ k (m →∞). (A· 17) Then f (0) = g(0) = 0, and ⎛ ⎞ Σ˜ Σ˜ m ∂ ⎜ ∂ ⎟ Added to this, since we assume that k satisfies k f (m) ⎜ m⎟ m = tr ⎝⎜Zm Pi Σ˜ ⎠⎟ (A· 4) P Σ˜ , we also obtain ∂m ∂m i i i ⎛ i ⎞ i ⎜ ⎟ ⎜ ∂ ⎟ = tr ⎝⎜Z P U Λm UT ⎠⎟ (A· 5) m i i ∂m i i Σ˜ m = P Σ˜ m + Σ˜ m − P Σ˜ m (A· 18) i k i i k i i ⎛ ⎞ i i ⎜ ⎟ = ⎜ Λm Λ T ⎟ , · †The inequality follows from observing that |X + Y|≥|X| for tr ⎝Zm PiUi i log i Ui ⎠ (A 6) i symmetric positive definite matrices X and Y. SAKAI et al.: GENERALIZATION OF LINEAR DISCRIMINANT ANALYSIS AND ITS APPLICATION TO SPEECH RECOGNITION 487 Seiichi Nakagawa received Dr. of Eng. de- ≥ P Σ˜ m . (A· 19) i i gree from Kyoto University in 1977. He joined i the Faculty of Kyoto University, in 1976, as a Research Associate in the Department of In- Hence, formation Sciences. He moved to Toyohashi 1/m University of Technology in 1980. From 1980 to 1983, he was an Assistant Professor, and Σ˜ ≥ P Σ˜ m = φ(m). (A· 20) k i i from 1983 to 1990 he was an Associate Pro- i fessor. Since 1990 he has been a Professor in (A· 17) and (A· 20) imply the Department of Information and Computer Sciences, Toyohashi University of Technology, Toyohashi. From 1985 to 1986, he was a Visiting Scientist in the De- lim φ(m) = Σ˜ k = max Σ˜ i . (A· 21) m→∞ i partment of Computer Science, Carnegie-Mellon University, Pittsburgh, USA. He received the 1997/2001 Paper Award from the IEICE and the (A· 11) then follows. 1988 JC Bose Memorial Award from the Institution of Electro. Telecomm. · = − Σˇ = Σ˜ −1 To prove (A 12), let n m and i i .Then Engrs. His major interests in research include automatic speech recogni- / ⎛ ⎞ tion speech processing, natural language processing, human interface, and ⎜ 1/n⎟−1 artificial intelligence. He is a fellow of IPSJ. ⎜ ⎟ φ(m) = ⎜ P Σˇ n ⎟ (A· 22) ⎝ i i ⎠ i and hence ⎛ ⎞ /n −1 ⎜ 1 ⎟ ⎜ n ⎟ lim φ(m) = lim ⎜ PiΣˇ ⎟ (A· 23) m→−∞ n→∞ ⎝ i ⎠ i −1 = max Σˇ i (A· 24) i

= min |Σ˜ i|, (A· 25) i and (A· 12) follows. 

Makoto Sakai received his B.E. and M.E. degrees from Nagoya Institute of Technology in 1997 and 1999, respectively. He is currently with the Research Laboratories, DENSO COR- PORATION, Japan. He is also pursuing a Ph.D. degree in the Department of Media Science, Nagoya University. His research interests in- clude automatic speech recognition, signal pro- cessing, and machine learning.

Norihide Kitaoka received his B.E. and M.E. degrees from Kyoto University in 1992 and 1994, respectively, and a Dr. Eng. degree from Toyohashi University of Technology in 2000. He joined DENSO CORPORATION, Japan in 1994. He then joined the Depart- ment of Information and Computer Sciences at Toyohashi University of Technology as a re- search associate in 2001 and was a lecturer from 2003 to 2006. Currently he is an Associate Pro- fessor of the Graduate School of Information Science, Nagoya University. His research interests include speech pro- cessing, speech recognition and spoken dialog. He is a member of the Information Processing Society of Japan (IPSJ), the Acoustical Society of Japan (ASJ) and the Japan Society for Artificial Intelligence (JSAI).