Linear Discriminant Analysis Using a Generalized Mean of Class Covariances and Its Application to Speech Recognition
Total Page:16
File Type:pdf, Size:1020Kb
IEICE TRANS. INF. & SYST., VOL.E91–D, NO.3 MARCH 2008 478 PAPER Special Section on Robust Speech Processing in Realistic Environments Linear Discriminant Analysis Using a Generalized Mean of Class Covariances and Its Application to Speech Recognition Makoto SAKAI†,††a), Norihide KITAOKA††b), Members, and Seiichi NAKAGAWA†††c), Fellow SUMMARY To precisely model the time dependency of features is deal with unequal covariances because the maximum likeli- one of the important issues for speech recognition. Segmental unit input hood estimation was used to estimate parameters for differ- HMM with a dimensionality reduction method has been widely used to ent Gaussians with unequal covariances [9]. Heteroscedas- address this issue. Linear discriminant analysis (LDA) and heteroscedas- tic extensions, e.g., heteroscedastic linear discriminant analysis (HLDA) or tic discriminant analysis (HDA) was proposed as another heteroscedastic discriminant analysis (HDA), are popular approaches to re- objective function which employed individual weighted duce dimensionality. However, it is difficult to find one particular criterion contributions of the classes [10]. The effectiveness of these suitable for any kind of data set in carrying out dimensionality reduction methods for some data sets has been experimentally demon- while preserving discriminative information. In this paper, we propose a ffi new framework which we call power linear discriminant analysis (PLDA). strated. However, it is di cult to find one particular crite- PLDA can be used to describe various criteria including LDA, HLDA, and rion suitable for any kind of data set. HDA with one control parameter. In addition, we provide an efficient selec- In this paper we show that these three methods have tion method using a control parameter without training HMMs nor testing a strong mutual relationship, and provide a new interpreta- recognition performance on a development data set. Experimental results tion for them. Then, we propose a new framework that we show that the PLDA is more effective than conventional methods for vari- ous data sets. call power linear discriminant analysis (PLDA), which can key words: speech recognition, feature extraction, multidimensional signal describe various criteria including LDA, HLDA, and HDA processing with one control parameter. Because PLDA can describe various criteria for dimensionality reduction, it can flexibly 1. Introduction adapt to various environments such as a noisy environment. Thus, PLDA can provide robustness to a speech recognizer Although Hidden Markov Models (HMMs) have been in realistic environments. Unfortunately, we cannot know widely used to model speech signals for speech recogni- which control parameter is the most effective before training tion, they cannot precisely model the time dependency of HMMs and testing the performances of each control param- feature parameters. In order to overcome this limitation, eter on a development set. In general, this training and test- many extensions have been proposed [1]–[5]. Segmental ing process requires more than several dozen hours. More- unit input HMM [1] has been widely used for its effective- over, the computational time is proportional to the number ness and tractability. In segmental unit input HMM, the im- of variations of the control parameters under test. PLDA mediate use of several successive frames as an input vector requires much time to find an optimal control parameter be- inevitably increases the number of parameters. Therefore, a cause its control parameter can be set to a real number. We dimensionality reduction method is applied to feature vec- will provide an efficient selection method of an optimal con- tors. trol parameter without training of HMMs nor testing recog- Linear discriminant analysis (LDA) [6], [7] is widely nition performance on a development set. used to reduce dimensionality and a powerful tool to pre- The paper is organized as follows: LDA, HLDA, and serve discriminative information. LDA assumes each class HDA are reviewed in Sect. 2. Then, a new framework of has the same class covariance [8]. However, this assumption PLDA is proposed in Sect. 3. A selection method of an opti- does not necessarily hold for a real data set. In order to over- mal control parameter is derived in Sect. 4. Experimental re- come this limitation, several methods have been proposed. sults are presented in Sect. 5. Finally, conclusions are given Heteroscedastic linear discriminant analysis (HLDA) could in Sect. 6. Manuscript received July 2, 2007. 2. Segmental Unit Input HMM Manuscript revised September 14, 2007. †The author is with DENSO CORPORATION, Nisshin-shi, 470–0111 Japan. For an input symbol sequence o = (o1, o2, ··· , oT )anda †† The authors are with Nagoya University, Nagoya-shi, 464– state sequence q = (q1, q2, ··· , qT ), the output probability of 8603 Japan. segmental unit input HMM is given by the following equa- ††† The author is with Toyohashi University of Technology, tions [1]: Toyohashi-shi, 441–8580 Japan. a) E-mail: [email protected] P (o , ··· , o ) b) E-mail: [email protected] 1 T c) E-mail: [email protected] = P (oi | o1, ··· , oi−1, q1, ··· , qi) / / DOI: 10.1093 ietisy e91–d.3.478 q i Copyright c 2008 The Institute of Electronics, Information and Communication Engineers SAKAI et al.: GENERALIZATION OF LINEAR DISCRIMINANT ANALYSIS AND ITS APPLICATION TO SPEECH RECOGNITION 479 × P (qi | q1, ··· , qi−1) (1) where Σt denotes the covariance matrix of all features, namely a total covariance, which equals Σb + Σw. ≈ P oi | oi−(d−1), ··· , oi−1, qi P (qi | qi−1) LDA finds a projection matrix B[p] that maximizes q i Eqs. (6) or (7). The optimization of (6) and (7) results in (2) the same projection. ≈ P oi−(d−1), ··· , oi | qi P (qi | qi−1) , (3) q i 2.2 Heteroscedastic Extensions where T denotes the length of input sequence and d denotes the number of successive frames used in probability calcu- LDA is not the optimal projection when the class distribu- lation at a time frame. The immediate use of several suc- tions are heteroscedastic. Campbell [8] has shown that LDA cessive frames as an input vector inevitably increases the is related to the maximum likelihood estimation of parame- number of parameters. When the number of dimensions in- ters for a Gaussian model with an identical class covariance. creases, several problems generally occur: heavier compu- However, this condition is not necessarily satisfied for a real tational load and larger memory are required, and the accu- data set. racy of a parameter estimation degrades. Then, dimension- In order to overcome this limitation, several extensions ality reduction methods, e.g., principal component analysis have been proposed [9]–[12]. This paper focuses on two (PCA), LDA, HLDA or HDA, are used to reduce dimension- heteroscedastic extensions called heteroscedastic linear dis- ality [1], [3], [9], [10]. criminant analysis (HLDA) and heteroscedastic discrimi- Here, we will briefly review LDA, HLDA and HDA, nant analysis (HDA) [9], [10]. then investigate the effectiveness of these methods for some artificial data sets. 2.2.1 Heteroscedastic Linear Discriminant Analysis ∈ Rn×n 2.1 Linear Discriminant Analysis In HLDA, the full-rank linear projection matrix B is constrained as follows: the first p columns of B span the ∈ Rn = p-dimensional subspace in which the class means and vari- Given n-dimensional feature vectors x j ( j T ances are different and the remaining n − p columns of B 1, 2,...,N) †, e.g., x = oT , ··· , oT , let us find a pro- j j−(d−1) j span the (n − p)-dimensional subspace in which the class ∈ Rn×p jection matrix B[p] that projects these feature vectors means and variances are identical. Let the parameters that ∈ Rp = , ,..., to p-dimensional feature vectors z j ( j 1 2 N) BT x µ T describe the class means and covariances of be ˆ k and (p < n), where z j = B x j,andN denotes the number of all [p] Σˆ k, respectively: features. ⎡ ⎤ Within-class and between-class covariance matrices ⎢ T µ ⎥ ⎢ B[p] k ⎥ µˆ = ⎣⎢ ⎦⎥ , (8) are defined as follows [6], [7]: k T µ B[n−p] c ⎡ ⎤ 1 T T Σ Σ = x − µ x − µ ⎢B kB[p] 0 ⎥ w j k j k Σˆ = ⎢ [p] ⎥ , N = ∈ k ⎣ ⎦ (9) k 1 x j Dk T Σ 0B[n−p] tB[n−p] c = PkΣk, (4) n×(n−p) where B = B[p]|B[n−p] and B[n−p] ∈ R . = k 1 Kumar et al. [9] incorporated the maximum likelihood c T estimation of parameters for differently distributed Gaus- Σb = Pk (µk − µ)(µk − µ) , (5) sians. An HLDA objective function is derived as follows: k=1 2N where c denotes the number of classes, Dk denotes the sub- |B| 1 J (B) = . (10) set of feature vectors labeled as class k, µ is the mean vector HLDA N c T N µ Σ B − ΣtB[n−p] T Σ k for all the classes, k is the mean vector of the class k, k [n p] B[p] kB[p] is the covariance matrix of the class k,andPk is the class k=1 weight, respectively. N denotes the number of features of class k. The solution There are several ways to formulate objective functions k to maximize Eq. (10) is not analytically obtained. There- for multi-class data [6]. Typical objective functions are the fore, its maximization is performed using a numerical op- following: timization technique. Alternatively, a computationally effi- T Σ cient scheme is given in [13]. B[p] bB[p] = , † JLDA B[p] (6) This paper uses the following notation: capital bold letters T B ΣwB[p] [p] refer to matrices, e.g., A, bold letters refer to vectors, e.g., b,and scalars are not bold, e.g., c. Where submatrices are used they are T B ΣtB[p] indicated, for example, by A ,thisisann × p matrix.