arXiv:2103.00361v1 [cs.LG] 28 Feb 2021 niern,ResnUiest,Trno NMB23 Cana 2K3, M5B ON Toronto, University, the (email:[email protected]). is Ryerson Chen Engineering, E. China. enqingchen Cana (email:[email protected]; Zhengzhou, [email protected]) 2K3, author. University, responding M5B Zhengzhou ON ing, Toronto, University, (email:[email protected]) Ryerson Many low-cost applications. Engineering, by of captured wide easily a fo in feasible be deployment economically types more can potential noninvasive them and They making devices, passive, [4-5]. sensing natural, traits most information face the of and of voice two them, gestu Among are body etc. and EEG, hand and expressions, ECG facial face, voice, as such [2-3]. areas other many and security/surveillance communicatio modern (HCC), human human-computer to in (HCI), applied roles widely important computer is and plays with processing It signal compared multiple [1]. multimedia performance characteristics only the modality system intrinsic single of improving the pattern, use of discrimina the description joint more complete a the provide and semantics, potentially information can data characteristics complementary and the common same about both the many contain of in characteristics may topic different ar research Since sources important applications. media increasingly different an in becoming presented contents formation MC uprom h rdtoa ehd fsra fusion serial of DCCA. tha methods and show MCCA traditional CCA, the Extensive outperforms DMCCA recognition. to emotion DMCCA recognit digit human of handwritten prototype in performance a its Canonica demonstrate implement for We framework unified Analysis. c a Correlation special establishing are thus DMCCA, (DCCA) of Analysis (CCA Discri Correlation superior and Analysis Canonical (MCCA) Correlation native both Analysis co to computational Canonical that Multiple in leading verify reduction further predicted, We substantial accurately and quite DMCC performance by be dimension projected can optimally analy we the process, that the In demonstrate better information. multimodal to the direc leading of tion correlation, projected correlati between-class within-class the the the minimize finds maximize simultaneously it which Specifically, info representations. multimodal extrac from of characteristics capable discriminative multimod is more DMCCA for fusion. (DMCCA) and analysis Analysis information Correlation Canonical tiple .Ga swt h eateto lcrcladComputer and Electrical of Department the Engineer- with Information is of Computer School Guan and the L. with Electrical are of Chen E. Department and the Qi L. with is Gao L. nomto uini h td fmlil aamodalities data multiple of study the is fusion Information h fetv tlzto n nerto fmlioa in multimodal of integration and utilization effective The Abstract e Gao, Lei iciiaieMlil aoia Correlation Canonical Multiple Discriminative I hsppr epooeteDsrmntv Mul- Discriminative the propose we paper, this —In tdn ebr IEEE, Member, Student .I I. nlssfrIfrainFusion Information for Analysis NTRODUCTION i i nigChen, Enqing Qi, Lin galcm iee- @gmail.com, rmation o and ion nand on utiliza- tically tions data tory ases ting mi- cor- re, da. st. of da al A ), n e - r - t l , h ako eonto fhmneoin ttefaueleve feature the at emotions human achiev of and recognition bimodal of analysis task to audiovisual the application for hierarchica a scheme with audio proposed [8] multiclassifier speech Guan coupled and Wang of tightly processing. speech features integrating Markov hidden visual for fused and novel HMM) computer-directe a (fused visual presented a [7] in model of al. information disfluencies et use of Pan language presence speech. the and the described prosodic detect to [6] with al. along information et biometri based Serdar audio-visual for recognition. proposed been have methods oslc o-iesoa iciiaiefauevectors feature discriminative low-dimensional extraction feature select wa serial to [20] after However, fusion winner. feature early serial the them, impressi Among made [18-20]. has it and progress to th preservation due information from biometrics of attention and capacity intellig multimedia significant of years, drawn communities has research recent been fu- fusion In have level level feature [15-17]. combination, score three extensively and multi-classifier the fusion researched Among by level voting. delegated decision majority sion, the and fusion, metho OR, levels based from AND, rule made into as using decisions features such modalities the new or usual on classifiers based fusion as individual results level taken classification final decision pattern the are The generates a scores algorithm. in the classification or a which scheme, in based modalit sense score rule multiple the a before using combines classifiers through information level multiple score original from the generated fo the at to representing Fusion strategies origina classification. vector certain the new through combines features a fusion extracted or level data Data/feature de [14]. and level level score level, feature/data fusion: information train of set a from t patterns [12-13]. learns multimodal samples multiple which the the algorithm, of takes classification characteristics fusion a is into based [12] information product) learning oth the and machine On fusion, sum sources. hand, information based and min, multiple rule combine max, based to (e.g., the rule applied rule In as predefined methods. categorized a based be learning Generally can 51-54]. machine strategies [9-11, fusion in documented the analy been are content State- related have recognition problems. multimedia and the methods and address multimodal of to in the-art literatures variety the wide in th proposed presente A strategy information channels. fusion complete a different the of utilize design effectively betw can the representations and discriminatory modalities, different the of identification ti ieyakolde htteeaetrelvl of levels three are there that acknowledged widely is It the in lie difficulties major the fusion, information For ebr IEEE, Member, n igGuan, Ling and elw IEEE Fellow, cision how , in d een ing ent for of- ies rm sis he ed ve ds ly er at l. d e c s s s 1 l l , 2 effective recognition remains a challenging open problem. sets of features. The idea is to optimize characteristics of the Recently, there has been extensive interest in the analysis dispersion matrix of the transformed variables to obtain high of correlation based approaches for multimodal information correlations between all new variables simultaneously. It has fusion. The objective of correlation analysis is to identify and been applied for inclusion in geographical information systems measure the intrinsic association between different modalities. (GIS) [35], joint blind source separation [36] and blind single- Hu et al. [47] proposed a large margin multi-metric learning input & multiple-output (SIMO) channels equalization [37]. (LM3L) method for face and kinship verification in the wild. However, it does not explore the discriminatory representa- Wang et al. [12] proposed the method of kernel cross-modal tion and is not capable of providing satisfactory recognition to address the audiovisual emotion recognition performance. problem, achieving promising recognition accuracy on the In this paper, we introduce a discriminative multiple canon- RML and eNTERFACE datasets with a combination of feature ical correlation analysis (DMCCA) approach for multimodal and score level fusions. analysis and fusion. We will mathematically verify the fol- Canonical correlation analysis (CCA) is one of the statistical lowing characteristics of DMCCA: a) The optimally projected methods dealing with the mutual relationship between two dimensions can be accurately predicted, leading to both su- random vectors, and a valuable of jointly processing perior performance and substantial reduction in computational information from bimodal sources. It has been widely applied cost; b) CCA, DCCA and MCCA are special cases of DM- to data analysis, information fusion and more [21-22]. Sun CCA, thus establishing a unified framework for canonical et al. [23] proposed to use CCA to identify the correlation correlation analysis with the purpose of information fusion information of multiple feature streams of an image, and in the transformed domain. The effectiveness of DMCCA is demonstrated the effectiveness of the method in handwrit- demonstrated and compares with serial fusion and methods ing recognition and face recognition. The CCA method has based on similar principles such as CCA, DCCA and MCCA also been applied to audiovisual based talking-face biometric with applications to handwritten digit recognition and human verification [24], medical imaging analysis [25], and audio- emotion recognition problems. visual synchronization [26]. However, in many practical prob- The remainder of this paper is organized as follows: the lems dependencies between two signals cannot be effectively analysis and derivation of DMCCA are introduced in Section represented by simple linear correlation. If there is nonlinear II. In Section III, feature extraction and implementation of correlation between the two variables, CCA may not correctly DMCCA for multimodal fusion with application to hand- correlate this relationship. written digit recognition and human emotion recognition are In order to address the nonlinear problem, two different presented. The experimental results and analysis are given in methods are proposed. Kernel canonical correlation analysis Section IV. Conclusions are drawn in Section V. (KCCA) [27], a nonlinear extension of CCA via the kernel trick to overcome this drawback, has been applied to the II. THE DMCCA METHOD fusion of global and local features for target recognition [28], the fusion of ear and profile face for multimodal biometric One of the major challenges in information fusion is to recognition [29], the fusion of text and image for spectral identify the discriminatory representation amongst different clustering [30], and the fusion of labelled graph vector and the modalities. In this section, we introduce discriminative mul- semantic expression vector for facial expression recognition tiple canonical correlation analysis (DMCCA) to address this [31]. On the other hand, Deep Canonical Correlation Analysis problem. Although Generalized multi-view analysis (GMA) (Deep CCA) was introduced in [48]. It attempts to learn [55] and Multi-view Discriminant Analysis (MDA) [56] are complex nonlinear transformations of two views of data by also proposed to solve the multi-view problem, there exist neural networks such that the resulting representations were obvious differences among them. To be specific, the differ- highly linearly correlated. ences between DMCCA and GMA are: 1) In GMA, the However, by CCA, KCCA or Deep CCA, only the cor- discriminability is obtained within each feature, while in relation between the pairwise samples is better revealed. DMCCA it is achieved by using all features; 2) In GMA, the This correlation can neither well represent the similarity cross-view correlation is obtained only from observations cor- between the samples in the same class, nor can it evaluate responding to the same underlying sample, while in DMCCA the dissimilarity between the samples in different classes. it is obtained from all observations from different feature sets; To tackle the problem, a new method, 3) GMA needs a significant number of processing parameters, namely discriminative CCA (DCCA) is proposed [32-34]. It especially when the feature vector is long, while DMCCA can simultaneously maximize the within-class correlation and works with a much smaller set of parameters. minimize the between-class correlation, thus potentially more Simultaneously, different from MDA, the purpose of DM- suitable for recognition tasks than CCA, KCCA or Deep CCA. CCA is to simultaneously maximize the within-class corre- Nevertheless, CCA, KCCA, DCCA and Deep CCA only lation and minimize the between-class correlation, helping deal with the mutual relationship between two random vectors, reveal the intrinsic structure and complementary representa- limiting the application of the technique if there are multi- tions from different sources/modalities in order to improve the ple random vectors. Multi-set canonical correlation analysis recognition accuracy. (MCCA) is a natural extension of two-set canonical correlation The advantages of DMCCA for multimodal information analysis. It is generalized from CCA to deal with multiple fusion rest on the following facts: 1) DMCCA involves 3 modalities/features having a mixture of correlated (modality- For detailed information, please refer to [33]. common information) components and achieving the maxi- 3) MCCA: MCCA can be viewed as a natural extension of mum of the correlation [57]. Therefore, DMCCA possesses the the two-set canonical correlation analysis [37]. Given M sets maximal commonality of multiple modalities/features; 2) the of real random variables x1, x2, ··· xM with the dimensions within-class and the between-class correlations of all modal- of m1,m2, ··· mM . The objective of MCCA is to find ω = T T T T ities/features are considered jointly to extract more discrim- [ω1 ,ω2 ··· ωM ] which satisfies similar requirement as inative information, leading to a more discriminant common CCA and described as: space and better generalization ability for classification from M multiple modalities/features. In the following, we first briefly 1 T arg max ωk Cxkxl ωl(k 6= l) (6) present the fundamentals of CCA, DCCA and MCCA, and ω ,ω ···ω M(M − 1) 1 2 M k,lX=1 then formulate DMCCA. k=6 l

A. Review of CCA, DCCA and MCCA subject to M 1) CCA: The aim of CCA is to find basis vectors for T ωk Cxkxk ωk = M (7) two sets of variables such that the correlations between kX=1 the projections of the variables onto these basis vectors are T mutually maximized. Simultaneously, it needs to satisfy the where Cxkxl = xkxl . Equation (6) can be solved by the canonical property that the first projection is uncorrelated with Lagrange multipliers as: the second projection, etc. To do so, all useful information, 1 be it common to the two sets or specific to one of them, is (C − D)ω = βDω (8) M − 1 maximumly preserved through the projections. Let x ∈ R m,y ∈ R p be two sets as the entries, and m, p where being the dimensions of the samples in x and y, respectively. T T The CCA finds a pair of directions ω1 and ω2 to maximize x1x1 ... x1xM the correlation between the projections of the two canonical . . C =  . .. .  (9) vectors: X = ωT x, Y = ωT y, which can be written as . . . 1 2 T T follows:  xM x1 ··· xM xM  T   arg max ω1 Rxyω2 (1) ω ,ω 1 2 T x1x1 ... 0 T . . . where Rxy = xy is the cross-correlation matrix of the vectors D =  . .. .  (10) x and y. T  0 ··· xM xM  Simultaneously, x and y should satisfy the following con-   strained condition to guarantee the first projection is uncor- with β being the generalized canonical correlation. The related with the second projection(canonical property): MCCA solution is obtained as the GEV problem. T T ω1 Rxxω1 = ω2 Ryyω2 =1 (2) B. Concept of the DMCCA By solving the above problem using the algorithm of Lagrange Let P sets of zero- random variables be x1 ∈ m1 m2 mP multipliers, we obtain the following relationship [26]: R , x2 ∈ R , ··· xP ∈ R for c classes and Q = m1 +m2 +··· mP . Concretely, DMCCA aims to seek the pro- 0 R R 0 T T T T m1×Q xy ω=µ xx ω (3) jection vectors ω = [ω1 ,ω2 ··· ωP ] (ω1 ∈ R ,ω2 ∈ R 0 0 R m2×Q mP ×Q  yx   yy  R , ··· ωP ∈ R ) for information fusion so that the within-class correlation is maximized and the between-class T T T where µ is the canonical correlation value and ω = [ω1 ,ω2 ] correlation is minimized. Based on the definition of CCA and is the projected vector. Then equation (3) can be solved as the MCCA, DMCCA is formulated as the following optimization generalized eigenvalue(GEV) problem. problem: 2) DCCA: The purpose of DCCA is to maximize the P ∼ similarities of any pairs of sets of within-class while mini- 1 T argmax ρ = ωk Cxkxm ωm(k 6= m) (11) mizing the similarities of pairwise sets of between-class, as ω ,ω ···ω P (P − 1) 1 2 P k,mX=1 mathematically expressed in [33]: k=6 m ′ ′ T = maxtr(T SwT )/tr(T SbT ) (4) Subject to arg T P T where T is the discriminant function and Sw,Sb relate to the ωk Cxkxk ωk = P (12) within-class and between-class matrices respectively. kX=1 ∼ The solution to equation (4) is obtained by solving the T where Cx x = Cw − δCb (δ > 0), Cx x = xkxk . following generalized eigenvalue problem: k m xkxm xkxm k k C and C denote the within-class and between-class wxkxm bxk xm SbT = λSwT (5) correlation matrixes of P sets, respectively. 4

Based on the mathematical analysis in Appendix I, Cw xk xm T T T 0 x1Ax2 x1Ax3 ... x1AxP and Cb can be explicitly expressed in equations (13), (14) xkxm  T T T  x2Ax1 0 x2Ax3 ... x2AxP and (15). T T x3Ax2 T C−D =  x3Ax1 0 . . . x3Ax  C = x Ax T (13)  . P  wxkxm k m    .   T T T   xP Ax1 xP Ax2 xP Ax3 . . . 0  (22) T Cbx xm = −xkAxm (14) k T T T T ω = [ω 1,ω 2, · · · ω P ] (23) Since (1 + δ/P − 1) is a constant, it will not introduce any H ... 0 ni1×ni1 influence to the projection matrix ω. During the following  . .  A = . H . (15) mathematical analysis, we will ignore this term. With mathe- nil×nil matical transform, equation (19) is further written in the form  0 ... H ×   nic nic  of: T T · · · T T where nil is the number of samples in the lth class of xi set x1Ax2 ω2 + x1Ax3 ω3 + + x1AxP ωP = ρx1x1 ω1 T T T T x2Ax1 ω1 + x2Ax3 ω3 + · · · + x2AxP ωP = ρx2x2 ω2 and Hnil×nil is in the form of nil × nil and all the elements . in Hnil×nil are unit values. . T T T T Substituting equations (13) and (14) into (11) yields: xP Ax1 ω1 + xP Ax2 ω2 + · · · + xP AxP −1 ωP −1 = ρxP xP ωP (24) P ∼ 1 T ∼ argmax ρ = P (P −1) ωk Cxk xm ωm ω1,ω2···ωP k,m=1 Based on the definition of Cxk xm and equation (19), the P P (16) 1+δ T T value of ρ plays a critical role in evaluating the relationship = P (P −1) ωk xkAxm ωm k,m=1 between within-class and between-class correlation matrixes. P When the value of ρ is greater than zero, the corresponding Subject to projected vector ω contributes positively to the discriminative P power in classification while the projected vector ω corre- T ωk Cxkxk ωk = P (17) sponding to the non-positively values of ρ would result in Xk=1 reducing the discriminative power in classification. Clearly, the solution obtained is the eigenvector associated to the largest Using Lagrangian multiplier criterion, it obtains equation (18) as follows: eigenvalue in equation (19). Commonly, it is known that the time taken greatly depends T T T 0 x1Ax2 x1Ax3 ... x1AxP  T T T  on the computational process of projective vectors, using x2Ax1 0 x2Ax3 ... x2AxP T 1+δ  x3Ax2  algebraic method to extract discriminative features. When the  x Ax T 0 ... x Ax T  ω P −1  3 1 . 3 P   .  rank of eigen-matrix is very high, the computation of eigen-  .   T T T   xP Ax1 xP Ax2 xP Ax3 ... 0  values and eigenvectors will be time-consuming. To address T x1x1 0 0 ... 0 this problem effectively, an important property of DMCCA  T  0 x2x2 0 ... 0 is proved here. That is the number of projected dimension d  0  = ρ  0 x x T ... 0  ω  . 3 3  corresponding to the optimal recognition accuracy is smaller  .   .   T  than or equals to the number of classes, c, or mathematically:  0 0 0 ... xP xP  (18) d ≤ c (25) It is further rewritten as: Now we will show that d does satisfy inequality (25). From 1+ δ (C − D)ω = ρDω (19) equation (15), the rank of matrix A satisfies P − 1 (26) where rank(A) ≤ c

T T T T x1x1 x1Ax2 x1Ax3 ... x1AxP  T T T T  Then, equation (26) leads to: x2Ax1 x2x1 x2Ax3 ... x2AxP T  T x3Ax2 T T  T C =  x3Ax1 x3x3 ... x3AxP  rank(xiAxj ) ≤ min(ri, rA, rj ) (27)  .   .   T T T T where ri, rA, rj are the ranks of matrixes xi, A, xj ( i, j ∈  xP Ax1 xP Ax2 xP Ax3 ... xP xP  (20) [1, 2, 3, ..., P ]), respectively. Due to the fact that rank(A) ≤ c, equation (27) satisfies T x1x1 . . . 0 0 0 T  T  0 x2x2 0 . . . 0 rank(xiAxj ) ≤ min(ri,c,rj ) (28) 0 T D =  3 3  (21)  0 x x . . . 0  when c is less than ri and rj , equation (28) is written as  .   .  T  T  (29)  0 0 0 ... xP xP  rank(xiAxj ) ≤ c 5 otherwise, equation (28) satisfies leading to the expression of the fused features as:

T T rank(xiAxj ) ≤ min(ri, rj ) < c (30) Y1 = ω1 x1 T  Y2 = ω2 x2  . (39) Then equation (24) can be written as follows:  . T T T T T x1A(x2 ω2 + x3 ω3 + · · · + xP ωP )= ρx1x1 ω1 YP = ωP xP  T T T T  x2A x1 ω1 x3 ω3 · · · x ω ρx2x2 ω2   ( + + + P P )=   .  . Let (ω)d be the projected matrix with DMCCA achieving . Q×d T T T T optimal performance and (ω)d ∈ R . (ω)d is written as  xP A(x1 ω1 + x2 ω2 + · · · + xP −1 ωP −1)= ρxP xP ωP  (31) follows£º It is further written as: ω11,ω12, ..., ω1d (ω1)d x1 0 0, ... 0 A 0 0, ... 0 B1  ω21,ω22, ..., ω2d   (ω2)d  (ω) = = (40)  0 x2 0, ... 0   0 A 0, ... 0   B2  d . .  .   .  . . .      .   .   .   ω ,ω , ..., ω   (ω )         P 1 P 2 Pd   P d   0 0 0, ... x   0 0 0, ... A   B   P     P  Then ω, (ω)d and (ωi)d(i =1, 2, ...P ) satisfy the relation- P T ship: x1x1 0 0|, ... 0{z ω1} T  0 x2x2 0, ... 0   ω2  ω1 ω11 ω12 ... ω1d ω1(d+1) ... ω1Q = ρ . .  .   .   ω2   ω21 ω22 ... ω2d ω2(d+1) ... ω2Q   T    ω = . = .  0 00, ... xP xP   ωP   .   . ...      (32)      ωP   ωP ωP ... ωPd ω ... ωP Q  where    1 2 P (d+1)  (ω1)d,ω1(d+1), ..., ω1Q B = x T ω + x T ω + ··· + x T ω 1 2 2 3 3 P P  (ω2) ,ω2(d+1), ..., ω2Q  T T T d  B2 = x1 ω1 + x3 ω3 + ··· + xP ωP = .  . (33)  .   .     (ωP ) ,ωP (d+1), ..., ωP Q  T T T  d  BP = x1 ω1 + x2 ω2 + ··· + xP −1 ωP −1 = [(ω) , 0, ..., 0 ] + [ω−[(ω) , 0, ..., 0 ]](0 ∈ RQ)  d d  Equation (31) is further expressed as: Q−d Q−d | {z } | {z } (41) −1 T T T Inserting (41) into (39) yields: (ζ1) x1A(x2 ω2 + x3 ω3 + ··· + xP ωP )= ω1 −1 T T T T T T  (ζ2) x2A(x1 ω1 + x3 ω3 + ··· + xP ωP )= ω2 Y1 = ω1 x1 = [(ω1)d, 01, ..., 01 ] x1 + [ω1−[(ω1)d, 01, ..., 01 ]] x1   .  .  Q−d Q−d  T | {z } T | {z } T −1 T T T  Y2 = ω2 x2= [(ω2)d, 02, ..., 02 ] x2 + [ω2−[(ω2)d, 02, ..., 02 ]] x2 (ζP ) xP A(x1 ω1 + x2 ω2 + ··· + xP −1 ωP −1)= ωP     (34)  Q−d Q−d   | {z } | {z } where .  . T  T T − T ζ1 = ρx1x1  YP = ωP xP = [(ωP )d, 0P , ..., 0P ] xP + [ωP [(ωP )d, 0P , ..., 0P ]] xP T   ζ2 = ρx2x2  T  Q−d Q−d  . (when xixi is non − singular) (35)  | {z } | {z }  .  (42)  . mi T where 0i(i =1, 2, ..., P ) is in the form R . ζP = ρxP xP  Based on equation (39), d should satisfy inequality (43) or d ≤ min(m1,m2, ..., mP ) (43) T ζ1 = ρx1x1 + σ1I1 T ζ2 = ρx2x2 + σ2I2 Simultaneously, since rank(ωi) ≤ c and Yi possesses the same  T  . (when xixi is singular) (36) number of rows, d satisfies the relation (44)  . T ζP = ρxP xP + σP IP d ≤ min[rank(ω1),rank(ω2), ..., rank(ωP )] ≤ c (44)   mi×mi where Ii ∈ R and σ1, σ2, ...σP are constants. considering (43) and (44) together, d should satisfy (45) mi×Q Since rank(A) ≤ c and ωi ∈ R (i =1, 2, ...P ), based d ≤ min(m ,m , ..., m ) on equation (34), the rank of ωi satisfies 1 2 P (45)  d ≤ c rank(ωi) ≤ c(i =1, 2, ..., P ) (37) Now, we analyze the following two cases.

Then the fused feature of Yi can be written as follows: 1. when c satisfies (46) T Yi = ωi xi(i =1, 2, ...P ) (38) c ≤ min(m1,m2, ..., mP ) (46) 6 combining (43) and (44) leads to where T T d ≤ c (47) x1x1 x1Ax2 C = T T (53)  x2Ax1 x2x2  2. when c satisfies (48)

c> min(m ,m , ..., mP ) (48) 1 2 T x1x1 0 D = T (54) combining (43) and (44) leads to  0 x2x2  d ≤ min(m ,m , ..., m ) ≤ c (49) 1 2 P Thus, it transforms into the method of DCCA. Although In summary, DCCA also possesses this discriminative property, it can d ≤ c (50) only deal with the mutual relationships between two random variables.

In order to achieve the optimal recognition accuracy, we select the c projected vectors from the eigenvectors associated 2) Relation with MCCA: when A is an identity matrix, the with the c different largest eigenvalues in equation (19). matrixes C and D of the DMCCA can be written: T T Since [ωi−[(ωi)d, 0i, ..., 0i ]] xi and [0i, ..., 0i ] xi have no T T x1x1 ... x1xP Q−d Q−d  . . .  contribution to the optimally| {z } fused result| {z of Y}i, the optimal C = . .. . (55) performance reached by DMCCA when, d, the projected  x x T ··· x x T  dimension is less than or equals to c, as expressed in (51)  P 1 P P  T T ∈ d Y1,optimal= [(ω1)d, 01, ..., 01 ] x1 =(ω1)d x1 R (d<= c)  x x T ... 0  Q−d 1 1  | {z } T T ∈ d . . .  Y2,optimal = [(ω2)d, 02, ..., 02 ] x2 =(ω2)d x2 R (d<= c) D =  . . .  (56)  . . .  Q−d T  | {z }  0 ··· xP xP  .    .  T T ∈ d  YP,optimal = [(ωP )d, 0P , ..., 0P ] xP =(ωP )d xP R (d<= c)It transforms into the method of MCCA.   Q−d  | {z }  (51) 3) Relation with CCA: when P =2 and A is an identity Thus, expressions in (51) lead to the proof of (25). matrix, the matrixes C and D of the DMCCA can be described Specifically, if the feature space dimension equals Q, the as: computational complexity of DMCCA is on the order of T T O(Q*c) O(Q*Q) x1x1 x1x2 , instead of , as other transformation-based C = T T (57) methods require (such as MCCA). Thus, this property is not  x2x1 x2x2  only analytically elegant, but practically significant when c is small compared with the feature space dimension (which T includes emotion recognition, digit recognition, English x1x1 0 D = T (58) character recognition, and many others), where c ranges  0 x2x2  from a handful to a couple of dozens, but the feature space dimension could be hundreds or even thousands. It transforms into the method of CCA. Since there are no In summary of the discussion so far, the information fusion within-class and between-class correlation considered, CCA algorithm based on DMCCA is given below: and MCCA do not possess the discriminative power as DM- Step 1. Extract information from multimodal sources to CCA. In addition, the best performance of CCA (and MCCA) form the training sample spaces. is not predictable. Step 2. Convert the extracted information into the normalized form and compute the matrixes C and D. Step 3. Compute the eigenvalues and eigenvectors of III. RECOGNITION BY MULTIMODAL/MULTI-FEATURE equation (19). FUSION Step 4. Obtain the fused information expression from equation (51), which is used for classification. In this section, we examine the performance of Next we will demonstrate that CCA, MCCA and DCCA are DMCCA in several applications, ranging from multi- special cases of DMCCA. feature information fusion in handwritten digit recognition, 1) Relation with DCCA: when P =2, equation (19) turns into audio-based and visual-based emotion recognition to the following form: multimodal information fusion in audiovisual-based human emotion recognition. A general block diagram of the (C − D)ω = ρDω (52) proposed recognition system can be depicted as Fig.1. 7

6RXUFH  6RXUFH  6RXUFH 1 ĂĂ

3UHSURFHVVLQJ

ĂĂ

)HDWXUH )HDWXUH )HDWXUH )HDWXUH )HDWXUH )HDWXUH+

Fig.2 Extracted prosodic features

Discriminative Multiple Canonical Correlation Analysis (DMCCA)

1RLVH 5HGXFWLRQ

Recognition

Fig.1 The block diagram of DMCCA recognition system

+DPPLQJ :LQGRZV

3URVRGLF 0)&& )) )HDWXUHV )HDWXUHV )HDWXUHV

A. The DMCCA for multi-feature fusion Fig.3 Audio features extraction of emotion recognition

Multi-feature fusion is a special case of multimodal fusion. In multi-feature fusion, different sets of features are extracted 2) Audio-based Emotion Recognition: To build an emotion from the same modality data but with different extraction recognition system, the extraction of features that can truly methods, and thus highly likely carry complementary infor- represent the representative characteristics of the intended mation. Therefore, the fusion of the multi-feature sets would emotion is a challenge. For emotional audio, a good reference lead to better recognition results. model is the human hearing system. Currently, Prosodic, MFCC and Formant Frequency (FF) are widely used in audio 1) Handwritten Digit Recognition: Handwritten digit emotion recognition [8,12]. In this paper, we investigate the recognition is an active research topic in OCR applications fusion of all these three types of features which are extracted and /learning research. The most popular as follows: applications are postal mail sorting, bank check processing, 1) Prosodic features as listed in Fig. 2. form data entry, etc. While in information fusion communities, 2) MFCC features: the mean, , , the problem of handwritten digit recognition has been used max, and range of the first 13 MFCC coefficients. to test the performance of different algorithms. Since the 3) Formant Frequency features: the mean, median, standard performance of digit recognition largely depends on the deviation, max and min of the first three formant frequencies. quality of features, various feature extraction methods have The procedure of audio-based emotion recognition for been proposed, with Zernike moments [49] and Gabor-filter feature extraction is shown in Fig.3. An input audio segment [50] being considered as two of the most popular and is first pre-processed by a coefficient thresholding efficient. In this paper, three sets of features based on Zernike method to reduce the recording machine and background moments and Gabor-filter are extracted as follows: noise [8]. To extract audio features, we perform spectral 1) 24-dimensional Gabor-filter feature with the mean of the analysis which is only reliable when the signal is stationary. digit images transformed by the Gabor filter [44]. For audio, this holds only within the short time intervals of 2) 24-dimensional Gabor-filter feature with the standard articulatory stability, during which a short time analysis can deviation of the digit images transformed by the Gabor filter be performed by windowing a signal into a succession of [44]. frames. In this paper, we use a Hamming window of size 512 3) 36-dimensional Zernike moments feature [49]. points, with 50 % overlap between the adjacent frames. 8

and 2 2 2 2 2 3) Visual-based Emotion Recognition: Facial expression is X = [x 1, x 2, x 3, ...x d] (60) another major factor in human emotion recognition. Existing 1 2 solutions for facial expression analysis can be roughly dist[X X ] is defined as categorized into two groups. One is to treat the human face as d 1 2 1 2 (61) a whole unit [39-40], and the other is to represent the face by dist[X X ]= x j − x j 2 prominent components, such as mouth, eyes, nose, eyebrow, Xj=1 and chin [41-42]. Since focusing on only mouth, eyes, nose, where ka − bk2 denotes the Euclidean distance between the eyebrow, and chin as facial components, the representation two vectors a and b. of the discriminative characteristics of human emotion might Let the feature matrices of the N training samples as be inadequate. Therefore, in this work, we perform facial F1, F2, ...FN and each sample belongs to some class Ci analysis by treating the face as a holistic pattern. The visual (i =1, 2...c), then for a given test sample I, if information is represented by Gabor wavelet features [43]. It dist[I, Fl] = min dist[I, Fj ](j =1, 2...N) (62) allows description of spatial frequency structure in images j while preserving information about spatial relations. In this paper, the Gabor filter bank is designed using the algorithm and proposed in [44]. The designed Gabor filter bank consists of Fl = Ci (63)

filters in 4 scales and 6 orientations. Apparently, the basic the resulting decision is I = Ci. Gabor feature space is of high dimensionality, leading to It is pointed out, in order to demonstrate the effectiveness of high computational cost, and thus is not suitable for practical the proposed method, we also implemented the serial fusion, applications. We therefore take the mean, standard deviation CCA, MCCA, DCCA for the purpose of comparison. and median of the magnitude of the transform coefficients of the filter banks as the three feature sets and fuse them for IV. EXPERIMENTAL RESULTS AND ANALYSIS visual emotion recognition. The procedure of visual-based A. The experimental results for multi-feature fusion emotion recognition for features extraction is shown in Fig.4. 1) Handwritten Digit Recognition: The MNIST database contains digits ranging from 0 to 9. Each digit is normalized and centered in a gray-level image with size 28∗28. Some examples are shown in Fig. 6. In the experiments, we select

*DERU :DYHOHW 1500 samples to form the training subset and 1500 samples as the testing subsets, respectively. The performance of mean, standard deviation and zernike is first evaluated

6WDQGDUG 0HDQ 0HGLDQ shown in TABLE I. 'HYLDWLRQ )HDWXUHV )HDWXUHV )HDWXUHV

Fig.4 Visual features extraction of emotion recognition

B. The DMCCA for multimodal fusion To evaluate the effectiveness of DMCCA in audiovisual

emotion recognition, the Prosodic, MFCC, and Formant Fre- quency features of audio signals and the mean, standard Fig.6 Example images from the MNIST database deviation, median of the magnitude of the Gabor transform coefficients of images are combined as the audiovisual features TABLE I for emotion recognition. Fig. 5, the realization of Fig. 1 for RESULTS OF HANDWRITTEN DIGIT RECOGNITION WITH A emotion recognition, depicts a block diagram of the proposed SINGLE FEATURE audiovisual-based emotion recognition system. The first part Single Feature Recognition Accuracy is the combination of Fig. 3 and Fig. 4, and the second part Mean 49.13% is DMCCA and recognition stages. StandardDeviation 52.60% Zernike 70.20% C. Recognition algorithm analysis In terms of information fusion algorithm analysis, the method of FFS I in [23] is adopted to obtain the final From TABLE I, the standard deviation (52.60%) and zernike fused features for its better performance. Then we use the moment (70.20%) features achieve better performance than recognition algorithm proposed in [45], which can be written the mean (49.13%), and therefore will be used in CCA and as follows: DCCA which only take two sets of features. In addition, Given two sets of features, represented by feature matrices the performance based on the method of serial fusion with 1 1 1 1 1 X = [x 1, x 2, x 3, ...x d] (59) standard deviation & zernike moment and all of the three 9

3URVRGLF Discriminative Multiple Canonical Correlation Analys Correlation Canonical Multiple Discriminative

3UH $XGLR SURFHVVLQJ 0)&& )HDWXUHV Emotion Recognition Emotion

)RUPDQW

IUHTXHQF\ '

0

&

&

$

0HDQ

3UH 9LVXDO SURFHVVLQJ 6WDQGDUG )HDWXUHV 'HYLDWLRQ is PHGLDQ

Fig.5 The proposed multimodal information fusion for emotion recognition system features is implemented, respectively. The experimental results Frequency features in emotion recognition are first evaluated, are shown in TABLE II. which are shown as TABLE III. The recognition accuracy is calculated as the ratio of the number of correctly classified TABLE II samples over the total number of testing samples. RESULTS OF HANDWRITTEN DIGIT RECOGNITION BY SERIAL FUSION TABLE III Serial Fusion Recognition Accuracy RESULTS OF EMOTION RECOGNITION WITH SINGLE Standard Deviation & Zernike 70.20% AUDIO FEATURE Allofthethreefeatures 70.33% Single Feature Recognition Accuracy Prosodic(RML) 53.13% Next, the comparison among DMCCA, serial fusion, CCA, MFCC(RML) 47.92% MCCA, and DCCA are implemented. The overall recognition Formant Frequency(RML) 29.17% rates are given in Fig. 7, with DMCCA providing the best Prosodic(eNT) 55.21% performance, clearly showing the discriminative power of the MFCC(eNT) 39.58% DMCCA for information fusion in handwritten digit recogni- FormantFrequency(eNT) 31.25% tion. From the figure, it is clear that the application of DMCCA achieves the best performance when the projected dimension d equals to 9 < 10 = c, the number of classes, confirming nicely with the mathematical proof in Section II. TABLE III suggests we should use the prosodic (53.13%, 55.21%) and MFCC (47.92%, 39.58%) features which 2) Audio-based Emotion Recognition: To evaluate the per- perform the best individually, in CCA and DCCA which formance of the proposed method, we conduct experiments only need to take two sets of features. We also experimented on Ryerson Multimedia Lab (RML) and eNTERFACE (eNT) on the method of serial fusion on RML and eNT databases audiovisual databases[12], respectively. The RML database with prosodic & MFCC features and all of the three features, consists of video samples speaking six different languages respectively. The experimental results are shown in TABLE (English, Mandarin, Urdu, Punjabi, Persian, and Italian) from IV. eight subjects to express the six principal emotions-angry, TABLE IV disgust, fear, surprise, sadness and happiness. The frame rate THE EXPERIMENTAL RESULTS OF AUDIO-BASED EMOTION for the video is 30 fps with audio recorded at a rate RECOGNITION WITH SERIAL FUSION of 22050 Hz. The size of the image frames is 720*480, and Serial Fusion RecognitionAccuracy the face region has an average size of around 112*96. Prosodic & MFCC(RML) 54.17% The eNTERFACE (eNT) database contains video samples All of the three features(RML) 44.79% from 43 subjects, also expressing the six basic emotions, with a Prosodic&MFCC(eNT) 40.63% sampling rate of 48000 Hz for audio channel and a video frame All of the three features(eNT) 34.38% rate of 25 fps. The image frames have a size of 720*576, with the average size of the face region around 260*300. Example images from RML and eNTERFACE are shown in Fig. 8. In the , 456 audio samples of eight subjects from Then, we compare the performance of DMCCA with serial RML database and 456 audio samples of ten subjects from fusion, CCA, MCCA and DCCA. The overall recognition eNTERFACE database are selected, respectively. We divide accurancies are shown in Fig. 9 and Fig. 10. Clearly, the both of the audio samples into training and testing subsets discrimination power of the DMCCA provides a more effective containing 360 and 96 samples each. As a benchmark, modelling of the relationship between different features and the performance of using Prosodic, MFCC and Formant achieves better performance than other methods. 10

CCA(Standard Deviation & Zernike) MCCA 90 DCCA(Standard Deviation & Zernike) DMCCA Zernike Mean Standard Derivation 80 Serial Fusion(Standard Deviation & Zernike) Serial Fusion(All of the three features)

70

60

50 Recognition Accuracy(%)

40

30

20 0 5 10 15 20 25 30 35 40 45 50 Projected Dimensions

Fig.7 Handwritten digit recognition experimental results of different methods on MNIST Database

Fig.8 Example facial expression images from the RML (Top two rows) and eNTERFACE (Bottom two rows) Databases

75

70

65

60

55

50 Recognition accuracy(%) 45

40

35 CCA DCCA MCCA DMCCA Prosodic MFCC Formant Frequency Serial (All of the three features) Seria (Prosodic, MFCC)

30

25 0 10 20 30 40 50 60 70 Projected Dimensions

Fig.9 Audio-based emotion recognition experimental results of different methods on RML Database

3) Visual-based Emotion Recognition: To further evaluate TABLE V the performance of DMCCA, we conduct experiments on RESULTS OF VISUAL-BASEDEMOTIONRECOGNITIONWITH SINGLE GABOR FEATURE RML and eNT visual emotion databases, respectively. Among the samples from RML database, 384 samples are chosen Single Feature Recognition Accuracy for training and 96 are chosen for evaluation. For eNT Mean(RML) 60.42% database, 456 visual samples are selected for experiment, Standard Deviation(RML) 67.71% training and testing subsets containing 360 and 96 each. Median(RML) 57.29% As a benchmark, the performance of using mean, standard Mean(eNT) 75.00% deviation and median features is evaluated. The results are Standard Deviation(eNT) 80.21% shown as TABLE V. Median(eNT) 72.92%

From TABLE V, it is observed that the features of mean (60.42%, 75.00%) and standard deviation (67.71%, 80.21%) the following experiments, we will use mean and standard de- achieve better performance in visual emotion recognition com- viation features in CCA and DCCA methods. The experiments pared with the feature of median (57.29%, 72.92%). Thus, in using serial fusion with mean & standard deviation features 11

MCCA 80 CCA (Prosodic & MFCC) DCCA(Prosodic & MFCC) DMCCA Serial Fusion(Prosodic & MFCC) Serial Fusion(All Features) Prosodic 70 MFCC Formant Frequency

60

50 Recognition accuracy(%)

40

30

20 5 10 15 20 25 30 35 40 45 50 55 60 Projected Dimensions

Fig.10 Audio-based emotion recognition experimental results of different methods on eNTERFACE Database and all of the three features are performed, which are shown power of the DMCCA provides a more effective modeling of in TABLE VI. the relationship between audiovisual information fusion. In order to further demonstrate the effectiveness of DMCCA TABLE VI on multimodal fusion, we applied the method of DCCA THE EXPERIMENTAL RESULTS OF VISUAL-BASED with embedding (named as EDCCA) to bimodal emotion EMOTION RECOGNITION WITH SERIAL FUSION recognition and compared with DMCCA. There are six sets SerialFusion RecognitionAccuracy of features for bimodal emotion recognition. Three are audio: Mean & Standard Deviation(RML) 69.79% Prosodic, MFCC and Formant Frequency; and three are vi- Allofthethreefeatures(RML) 62.50% sual: Mean, Standard Deviation and Median calculated from Mean & Standard Deviation(eNT) 79.17% Gabor wavelet transform. Naturally, we embed the three audio Allofthethreefeatures(eNT) 80.21% features together and the three visual features together, and perform EDCCA on the two embedded features. The overall The overall recognition results are illustrated in Fig. 11 and recognition accuracies by EDCCA on RML and eNT datasets Fig.12, which shows, again, the proposed DMCCA outper- are shown in Fig. 15 and Fig. 16. forms the other methods. From Fig. 13 to Fig. 16, we observe the following in terms of optimal performance: 1) RML: 85% (DMCCA, Green line in Fig. 13) > 79% B. The experimental results for multimodal fusion (EDCCA) > 69% (DCCA, Blue line in Fig. 13) For audiovisual-based emotion recognition, visual features 2) eNT: 87% (DMCCA, Green line in Fig. 14) > 78% are extracted from individual images. The planar envelope (EDCCA) > 77% (DCCA, Blue line in Fig. 14) approximation method in the HSV color space is used for Therefore, properly embedding all features in two sets did face detection [38]. To alleviate the high memory require- improve the performance of DCCA in this example, but it is ment of MCCA and DMCCA methods, as well as producing still inferior to the performance of DMCCA. Moreover, when comparable results, 288 video clips of six basic emotions the fusion involves features from three or more modalities, are selected for capturing the change of audio and visual it is difficult, if not impossible, to design a reasonable em- information with respect to time simultaneously from RML bedding strategy. On the other hand, with a sound theoretical database. Among them, 192 clips are chosen for training set foundation, DMCCA can handle fusion involving any number and 96 for evaluation. For eNT database, 360 clips are chosen of modalities. for training set and 96 for evaluation. From the previous From the above experimental results, it can be seen that the experimental results, it is shown that the Prosodic features in recognition accuracy of serial fusion is generally worse than audio and standard deviation of Gabor Transform coefficients CCA and related methods, and fusion does not help as shown in visual images could result in better performance in emotion in TABLE II, TABLE IV and TABLE VI justifying that simply recognition compared with other features. Therefore, in the putting the features from different channels together without following experiments, we will use Prosodic and standard considering the intrinsic structure and relationship results in deviation for the methods of CCA and DCCA in audiovisual- low recognition accuracy. based multimodal fusion. Besides, the results of serial fusion On the other hand, the performance of DMCCA is much on all the six audiovisual features are also investigated, and the better than the methods compared in all cases. An important overall recognition accuracy is 30.28% for RML database and finding of the research is that, the exact location of optimal 35.42% for eNT database. The performance by the methods recognition performance occurs when the number of projected of serial fusion, CCA, MCCA, DCCA, audio-based multi- dimension d is smaller than or equals to the number of classes, feature DMCCA, visual-based multi-feature DMCCA and confirming nicely with the mathematical analysis presented audiovisual-based DMCCA are shown in Fig. 13 and Fig. in Section II. The significance here is that, we only need to 14. From the experimental results, clearly, the discrimination 12

85

80

75

70

65

60 Recognition accuracy(%) 55

50

45

40 CCA DCCA MCCA DMCCA Median Mean Std Serial Fusion (All Features) Serial Fusion(Mean & Std)

35 0 10 20 30 40 50 60 Projected Dimensions

Fig.11 Visual-based emotion recognition experimental results of different methods on RML Database

90

80

70

60 Recognition accuracy(%) 50

40 CCA (Mean & STD) MCCA DCCA(Mean & STD) DMCCA Serial Fusion(Mean & STD) 30 Serial Fusion(All Features) Mean STD Median

20 0 10 20 30 40 50 60 Projected Dimensions

Fig.12 Visual-based emotion recognition experimental results of different methods on eNTERFACE Database calculate the first d (is smaller than or equals to the number of demonstrate that the proposed method improves recognition classes) projected dimensions of DMCCA to obtain the desired performance with substantially reduced dimensionality of the recognition performance, eliminating the need of computing feature space, leading to efficient practical pattern recognition. the complete transformation processes associated with most There exist plenty of directions in which the current work of the other methods, and thus substantially reducing the can be extended. One of the worthwhile topics is to investigate computational complexity to obtain the optimal recognition kernelized version of the DMCCA based on the Kernel canoni- accuracy. cal correlation analysis (KCCA) [46] in order to obtain a more effective method to solve nonlinear problems in information V. CONCLUSIONS fusion. This paper proposed a new method for effective information fusion through the analysis of discriminative multiple canon- APPENDIX I ical correlation (DMCCA). The key contribution of this work Let x = [x (1), x (1) ··· x (1), ··· x (c), x (c) ··· x (c)] is that based on the proposed algorithm, we are able to verify i i1 i2 in1 i1 i2 inc mi×n, then that the best performance by discriminatory representation is ∈ R achieved when the number of projected dimensions is smaller T n en = [0, 0, ··· 0, 1, 1, ··· 1 0, 0, ··· 0] ∈ R (64) than or equals to the number of classes (c) in the fused il l−1 nil l space. The significance of the contribution is that, only the niu n− niu first c projected dimensions of the discriminative model need | uP=1{z } | {z } | uP{z=1 } to be calculated to obtain the discriminatory representations, T n eliminating the need to compute the complete transformation 1 = [1, 1, ··· 1] ∈ R (65) process. This is a particularly attractive feature when dealing with large scale problems. Furthermore, we mathematically where i is the number sequence of the random features, n is (d) proved that CCA, MCCA and DCCA are special cases of the total number of training samples, xij denotes the jth the proposed DMCCA, unifying correlation based information sample in the dth class, respectively, and nil is the number of fusion methods. samples in the lth class of xi set. The proposed method is applied to multi-feature and multi- c modal information fusion problems in handwritten digit recog- n = n (66) nition and human emotion recognition. Experimental results il Xl=1 13

90

80

70

60

Recognition Results (%) 50

40

30

20 MCCA DCCA CCA DMCCA Audio DMCCA Visual DMCCA Serial

10 0 10 20 30 40 50 60 70 80 90 100

Projected Dimensions Fig.13 Audiovisual emotion recognition experimental results by different methods on RML Database

90

80

70

60 Recognition accuracy(%) 50

40

30 DMCCA DCCA MCCA CCA Visual−DMCCA Audio−DMCCA Serial

20 0 10 20 30 40 50 60 70 80 90 100 Projected Dimensions

Fig.14 Audiovisual emotion recognition experimental results by different methods on eNTERFACE Database

80

75

70

65

60

55

Recognition Accuracy(%) 50

45

40

35 0 10 20 30 40 50 60 70 80 90 100 Projected Dimensions Fig.15 Performance on audiovisual fusion on emotion recognition with the method of EDCCA (RML Dataset) where c is the total number of classes. Note that, as the random features satisfies the property of zero-mean, it can be shown that:

xi • 1 = 0 (67)

Then, the within-class correlation matrix C can be wxk xm written as: Hni1×ni1 ... 0 . . n×n c nkl nml A =  . .  ∈ R (l) (l)T . Hnil×nil . Cw = xkh xmg xkxm 0 ... H l=1 h=1 g=1  nic×nic  c P P P   T (68) (69) = (xken )(xmen ) kl ml where Hn ×n is in the form of ni × ni and all the l=1 i1 i1 1 1 P T = xkAxm elements in Hni1×ni1 are unit values. Similarly, the between- 14

80

75

70

65

60

55

Recognition Accuracy(%) 50

45

40

35 0 10 20 30 40 50 60 70 80 90 100 Projected Dimensions

Fig.16 Performance on audiovisual fusion on emotion recognition with the method of EDCCA (eNT Dataset) class correlation matrix C is in the form of: [11] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli, bxk xm “Multimodal fusion for multimedia analysis: A survey,” Multimedia Syst., vol. 16, no. 6, pp. 345-379, 2010. c c nkl nmq (l) (q)T [12] Y. Wang, L. Guan and A.N. Venetsanopoulos, “Kernel based fusion Cb = xkh xmg xkxm with application to audiovisual emotion recognition,” IEEE Trans. on l=1 q=1 h=1 g=1 P lP=6 q P P Multimedia, vol. 14, no. 3, pp. 597-607, Jun 2012. c c nkl nmq c nkl nml [13] D. Ruta and B. Gabrys. “An overview of classifier fusion methods,” (l) (q)T (l) (l)T = xkh xmg − xkh xmg Computing and Information systems, vol. 7, no. 1, pp. 1-10, 2000. l=1 q=1 h=1 g=1 l=1 h=1 g=1 [14] A. Ross and A. K. Jain, “Information fusion in biometrics,” Pattern P P P PT T P P P Recognition Letter = (xk1)(xm1) − xkAxm , vol. 24, no. 13, pp. 2115-2125, 2003. T [15] Y.S. Huang, C.Y. Suen, “Method of combining multiple experts for the = −xkAxm recognition of unconstrained hand written numerals,” IEEE Trans. Pattern (70) Anal. Mach. Intell., vol. 7, no. 1, pp. 90-94, 1999. [16] A.S. Constantinidis, M.C. Fairhurst, A.F.R. Rahman, “A new multi- ACKNOWLEDGMENT expert decision combination algorithm and its application to the detection of circumscribed masses in digital mammograms,” Pattern Recognition, This work is supported by the National Natural Science vol. 34, no. 8, pp. 1528-1537, 2001. Foundation of China (NSFC, No.61071211), the State Key [17] X.-Y. Jin, D. Zhang, J.-Y. Yang, “Face recognition based on a group Program of NSFC (No. 61331021), the Key International decision-making combination approach,” Pattern Recognition, vol. 36, no. 7, pp. 1675-1678, 2003. Collaboration Program of NSFC (No. 61210005) and the [18] C.J. Liu, H. Wechsler, “A shape-and texture-based enhanced Fisher Discovery Grant of Natural Science and Engineering Council classifier for face recognition,” IEEE Trans. Image Processing, vol. 10, of Canada (No. 238813/2010). no. 4, pp. 598-608, 2001. [19] J. Yang, J.-Y. Yang, “Generalized K-L transform based combined feature extraction,” Pattern Recognition, vol. 35, no. 1, pp. 295-297, 2002. REFERENCES [20] J. Yang, J.Y. Yang, D. Zhang, J.F. Lu, “Feature fusion: parallel strategy vs. serial strategy,” Pattern Recognition, vol. 36, no. 6, pp. 1369-1381, [1] L. Guan, Y. Wang, R. Zhang, Y. Tie, A. Bulzacki, and M. T. 2003. Ibrahim,“Multimodal information fusion for selected multimedia appli- cations,” Int. J. Multimedia Intell. Sec., vol. 1, no. 1, pp. 5-32, 2010. [21] D.R. Hardoon, S. Szedmak and J.S. Taylor, “Canonical correlation [2] D. Roy, “Grounding words in perception and action: Computational analysis: An overview with application to learning methods,” Neural insights,” Trends Cogn. Sci., vol. 9, no. 8, pp. 389-396, Aug. 2005. computation, vol. 16, no. 12, pp.2639-2664, 2004. [3] G. O. Deak, M. S. Bartlett, and T. Jebara, “New trends in cognitive [22] G. Lisanti, S. Karaman, I. Masi and A. Del Bimbo, “Multi Channel- science: Integrative approaches to learning and development,” Neurocom- Kernel Canonical Correlation Analysis for Cross-View Person Re- puting, vol. 70, no. 13-15, pp. 2139-2147, Jan. 2007. Identification,” IEEE Trans on Information Forensics and Security, arXiv [4] A. K. Jain, A. Ross and S. Prabhakar, “An Introduction to Biometric preprint arXiv:1607.02204, 2016. Recognition,” IEEE Trans. on Circuits and Systems for Video Technology, [23] Q. Sun, S. Zeng, Y. Liu, P. Heng, and D. Xia, “A new method of feature vol. 14, no. 1, pp. 4-20, Jan. 2004. fusion and its application in image recognition,” Pattern Recognition, vol. [5] K. C. Kwak, W. Pedrycz, “Face recognition: A study in information fusion 36, no. 12, pp. 2437-2448, Dec. 2005. using fuzzy integral,” Pattern Recognition Letters, vol. 26, no. 6, pp. 719- [24] H. Bredin and G. Chollet, “Audio-visual speech synchrony measure 733, 2005. for talking-face identity verification,” IEEE International Conference on [6] S. Yildirim and S. Narayanan, “Automatic Detection of Disfluency Acoustics, Speech and Signal Processing, vol. 2, pp. 233-236, 2007. Boundaries in Spontaneous Speech of Children Using Audio-Visual [25] N. M. Correa, T. Adali, Y. Li, and V. D. Calhoun, “Canonical correlation Information,” IEEE Trans on Audio, Speech, and Language Processing, analysis for data fusion and group ingerences: examining applications of vol. 11, no. 1, pp. 2-12, Jan. 2009. medical imaging data,” IEEE Signal Process. Magazine, vol. 27, no. 4, [7] H. Pan, S. E. Levinson, T. S. Huang and Z. Liang, “A Fused Hidden pp. 39-50, 2010. Markov Model With Application to Bimodal Speech Processing,” IEEE [26] M. E. Sargin, Y. Yemez, E. Erzin, and A. M. Tekalp, “Audiovisual Trans. on Signal Processing, vol. 52, no. 3, pp. 573-581, Mar. 2004. synchronization and fusion using canonical correlation analysis,” IEEE [8] Y. Wang and L. Guan, “Recognizing human emotional state from audio- Trans. Multimedia, vol. 9, no. 7, pp. 1396-1403, Nov. 2007. visual signals,” IEEE Trans. Multimedia, vol. 10, no. 5, pp. 936-946, Oct. [27] C. Fyfe and P. L. Lai, “Kernel and nonlinear canonical correlation 2008. analysis,” Int. J. Neural Syst., vol. 10, pp. 365-374, 2001. [9] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect [28] J. Zhao, Y. Fan, and W. Fan, “Fusion of global and local features recognition methods: audio, visual, and spontaneous expressions,” IEEE using KCCA for automatic target recognition,” The Fifth International Trans. Pattern Anal. Mach. Intell., vol. 31, no. 1, pp. 39-58, 2008. Conference on Image and Graphics, pp. 958-962, 2009. [10] T. Joshi, S. Dey, and D. Samanta, “Multimodal biometrics: state of the [29] X. Xu and Z. Mu, “Feature fusion method based on KCCA for ear art in fusion techniques,” Int. J. Biomet., vol. 1, no. 4, pp. 393-417, July and profile face based multimodal recognition,” 2007 IEEE International 2009. Conference on Automation and Logistics, pp. 620-623, 2007. 15

[30] M. B. Blaschko and C. H. Lampert, “Correlation spectral clustering,” [54] C. Hou, C. Zhang, Y. Wu and F. Nie, “Multiple view semi-supervised 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. dimensionality reduction,” Pattern Recognition, vol. 43, no. 3, pp. 720- 1-8, 2008. 730, 2010. [31] M. J. Lyons, J. Budynek, et al., “Classifying facial attributes using a 2-D [55] A. Sharma, A. Kumar, H. Daume and D.W. Jacobs, “Generalized Gabor wavelet representation and discriminant analysis,” The Fourth IEEE multiview analysis: A discriminative latent space,” 2012 IEEE Conference International Conference on Automatic Face and Gesture Recognition, pp. on Computer Vision and Pattern Recognition (CVPR), pp. 2160-2167, 202-207, March 2000. 2012. [32] T.K. Sun, S.C. Chen, Z. Jin, J.Y. Yang, “Kernelized Discriminative [56] M. Kan, S. Shan, H. Zhang, S. Lao and X. Chen, “Multi-view dis- Canonical Correlation Analysis,” International Conference on Wavelet criminant analysis,” 2012 European Conference on Computer Vision, pp. Analysis and Pattern Recognition, vol. 2, pp. 1283-1287, 2007. 808-821, 2012. [33] T. K. Kim, J. Kittler and R. Cipolla, “Discriminative learning and [57] M.E. Sargent, Y. Yemez, E. Erzin and A.M. Tekalp, “Audiovisual recognition of image set classes using canonical correlations,” IEEE Trans synchronization and fusion using canonical correlation analysis,” IEEE on Pattern Analysis and Machine Intelligence, Vol. 29, No. 6, pp.1005- Transactions on Multimedia, vol.9, no. 7, pp.1396-1403, 2007. 1018, 2007. [34] O. Arandjelovic, “Discriminative extended canonical correlation analysis for pattern set matching,” Machine Learning, pp. 1-18, 2013. [35] A. A. Nielsen, “Multiset Canonical Correlations Analysis and Multi- spectral, Truly Multitemporal Remote Sensing Data,” IEEE Trans. Image Processing, vol. 11, no. 3, pp. 293-305, Mar. 2002. [36] T. Adali, W. Wang, V.D. Calhoun, “Joint Blind Source Separation by Multiset Canonical Correlation Analysis,” IEEE Trans. on Signal Processing, vol. 57, no. 10, pp. 3918-3929, Oct. 2009. [37] J. Via, I. Santamara, d J. P´erez, “Canonical correlation analysis (CCA) algorithms for multiple data sets: application to blind SIMO equalization,” 13th European Signal Processing Conference, pp. 1-4, 2005. [38] K. Praveen, S. Makrogiannis, and N. Bourbakis, “A survey of skin-color modeling and detection methods,” Pattern recognition, vol. 40, no.3, pp. 1106-1122, 2007. [39] M. J. Lyons, J. Budynek, A. Plante, and S. Akamatsu, “Classifying facial attributes using a 2-D Gabor wavelet representation and discriminant analysis,” The 4th International Conference on Automatic Face and Gesture Recognition, pp. 202-207, 2000. [40] L. Ma and K. Khorasani, “Facial expression recognition using construc- tive feedforward neural networks,” IEEE Trans. Syst., Man, Cybern B: Cybern, vol. 34, pp. 1588-1595, 2004. [41] M. Pantic and L. J. M. rothkrantz, “Facial action recognition for facial expression analysis from static face image”, IEEE Trans. Syst., Man, Cybern B:Cybern, vol. 34, pp. 1449-1461, 2004. [42] L. C. De Silva and S. C. Hui, “Real-time facial feature extraction and emotion recognition,” The 4th International Conference on Information, Communications and Signal Processing, vol. 3, pp. 1310-1314, 2003. [43] M. J. Lyons, J. Budynek, A. Plante, and S. Akamatsu, “Classifying facial attributes using a 2-D Gabor wavelet representation and discriminant analysis,” The 4th Int. Conf. Automatic Face and Gesture Recognition, pp. 202-207, 2000. [44] B. S. Manjunath and W. Y. Ma, “Texture features for browsing and retrieval of image data”, IEEE Trans. Pattern Anal. Machine Intell., vol. 18, pp. 837-842, Aug. 1996. [45] B. H. Shekar, M. S. Kumari, L. M. Mestetskiy and N. F. Dyshkant,“Face recognition using kernel entropy component analysis,” Neurocomputing, vol. 74, no. 6, pp. 1053-1057, 2011. [46] W.M. Zheng, X.Y Zhou, C.R Zou, L. Zhao, “facial expression recog- nition using kernel canonical correlation analysis (KCCA),” IEEE Trans. on Neural Networks, Vol. 17, No. 1, pp. 233-238, 2006. [47] J.L. Hu, J.W. Wen, J.S Yuan, Y.P. Tan, “Large margin multi-metric learning for face and kinship verification in the wild”, The 12th Asian Conference on Computer Vision, pp. 1-16, 2014. [48] G. Andrew, R. Arora, J. Bilmes, K. Livescu, “Deep Canonical Correla- tion Analysis,” The Proceedings of the 30th International Conference on Machine Learning, pp. 1247-1255. 2013. [49] K. Alireza, H.Yawhua, “Invariant image recognition by Zernike mo- ments,” IEEE Trans. Pattern Anal. Mach. Intell., no. 12, pp. 489-497, 1990. [50] Y. Hamamoto, S. Uchimura, M. Watanabe, T. Yasuda and S. Tomita, “Recognition of handwritten numerals using Gabor features,” Interna- tional Conference on Pattern Recognition, vol. 3, pp. 250-250, 1996. [51] S. Zheng, X. Cai, C.H. Ding, F. Nie and H. Huang, “A Closed Form Solution to Multi-View Low-Rank Regression,” 2015 The Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 1973-1979, 2015. [52] C. Xiao, F. Nie, W. Cai and H. Huang, “Heterogeneous image features integration via multi-modal semi-supervised learning model,” IEEE In- ternational Conference on Computer Vision, pp. 1737-1744. 2013. [53] W. Hua, F. Nie, H. Huang and C. Ding, “Heterogeneous visual features fusion via sparse multimodal machine,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 3097-3102, 2013.