Proceedings of the Thirty-First AAAI Conference on (AAAI-17)

Latent Discriminant Analysis with Representative Feature Discovery

Gang Chen Department of Computer Science and Engineer SUNY at Buffalo, Buffalo, NY 14260 [email protected]

Abstract gene expression data analysis. Moreover, LDA was shown to compare favorably with other supervised dimensionality re- Linear Discriminant Analysis (LDA) is a well-known method duction methods through extensive experiments (Sugiyama for dimension reduction and classification with focus on dis- criminative feature selection. However, how to discover dis- et al. 2010). criminative as well as representative features in LDA model However, as a supervised approach, LDA expects man- has not been explored. In this paper, we propose a latent ually annotated training sets, e.g., instance/label pairs. As Fisher discriminant model with representative feature discov- we known, it is labor-intensive and time-consuming to label ery in an semi-supervised manner. Specifically, our model each instance, which is surprisingly prohibitive especially leverages advantages of both discriminative and generative for large scale data. Correspondently, it is reasonable to ex- models by generalizing LDA with data-driven prior over the tend supervised LDA into a semi-supervised method, and latent variables. Thus, our method combines multi-class, la- many approaches (Joachims 1999; Cai, He, and Han 2007; tent variables and dimension reduction in an unified Bayesian framework. We test our method on MUSK and Corel datasets Zhang and Yeung 2008; Sugiyama et al. 2010) have been and yield competitive results compared to baselines. We proposed. Unfortunately, most of these methods still need also demonstrate its capacity on the challenging TRECVID instance/label pairs, i.e. training a classifier with a few la- MED11 dataset for semantic keyframe extraction and con- beled instances. In practice, many real applications require duct a human-factors ranking-based experimental evaluation, bag-level labels (Andrews, Tsochantaridis, and Hofmann which clearly demonstrates our proposed method consistently 2002), such as molecule activity (Maron and Lozano-Prez extracts more semantically meaningful keyframes than chal- 1998), image classification (Maron and Ratan 1998) and lenging baselines. event detection (Perera et al. 2011). Recently, MI-SVM or latent SVM (Andrews, Tsochantaridis, and Hofmann 2002; Introduction Felzenszwalb et al. 2010) has been widely used for classifi- cation tasks, such as object detection. In a sense, MI-SVM Linear Discriminant Analysis (LDA) (Fisher 1936) is a pow- can learn discriminative features effectively under maxi- erful tool for dimensionality reduction and classification that mum margin framework. However, MI-SVM does not con- projects high dimensional data into a low-dimensional space sider data distribution while inferring latent variables. On where the data achieves maximum class separability (Duda, the contrary, LDA leverages data distribution by computing Hart, and Stork 2000; Fukunaga 1990; Wu, Wipf, and Yun between-class and within-class covariances to learn a dis- 2015). The basic idea in classical LDA, known as Fisher criminant projection. Thus it is possible to incorporate the Linear Discriminant Analysis (FDA) is to obtain the pro- data driven prior into LDA to discover both representative jection matrix by minimizing the within-class distance and and discriminative features. maximizing the between-class distance simultaneously to In this paper, we propose a Latent Fisher Discriminant yield the maximum class discrimination. It has been proved Analysis model (or LFDA in short) with representative fea- analytically that the optimal transformation is readily com- ture discovery. On the one hand, we hope our model can han- puted by solving a generalized eigenvalue problem (Fuku- dle semi-supervised learning problems. On the other hand, naga 1990). In order to deal with multi-class scenarios (Rao we can generalize discriminative FDA to select representa- 1948; Duda, Hart, and Stork 2000), LDA can be easily ex- tive features as well. More specifically, our method unifies tended from binary case to multi-class problems, which finds the discriminative nature of FDA with a data driven Gaus- a subspace with d − 1 dimensions, where d is the num- sian mixture prior over the training data under the Bayesian ber of classes in the training dataset. Because of its effec- framework. By combining these two terms into one model, tiveness and computational efficiency, it has been applied we infer latent variables and learn projection matrix in an successfully in many applications, such as face recognition alternative manner until convergence. To further leverage (Belhumeur, Hepanha, and Kriegman 1997) and microarray the compactness of each component with Gaussian mix- Copyright c 2017, Association for the Advancement of Artificial ture model, we assume that all instances in each component Intelligence (www.aaai.org). All rights reserved. have the same label. Thus, our model relaxes the instance

1791 level inference into component level inference by maximiz- However, directly casting LDA as a semi-supervised method ing a joint likelihood, which can capture representative fea- to handle bag-level labels is still a challenge for multi-class tures effectively. To sum up, our method combines multi- problems. class, latent variables and dimension reduction in an uni- fied bayesian framework. We demonstrate the advantages of Latent Fisher discriminant analysis our model on MUSK and Corel datasets for classification Let X = {x1, x2, ..., xn} represent n bags, with the cor- problems, and on TRECVID MED11 dataset for semantic responding labels L = {l1,l2, ..., ln} as the training data. keyframe extraction on five video events (Perera et al. 2011). For each bag xi ∈ X , it can have one or multiple instances (Andrews, Tsochantaridis, and Hofmann 2002), and its la- Related Work bel li is categorical and assumes values in a finite set, e.g. d×ni LDA has been a popular method for dimension reduction {1, 2, ..., C}. Let xi ∈ R , which means it contains ni x { 1 2 ni } and classification. It searches a projection matrix that simul- instances (or frames), denoted as i = xi ,xi , ..., xi with taneously maximizes the between-class dissimilarity and th j ∈ Rd its j instance xi (however, its label is not given). minimizes the within-class dissimilarity to increase class Given the data X and its corresponding instance level labels separability, typically for classification applications. And Z(X ), LDA searches for a discriminative feature transfor- many methods (Belhumeur, Hepanha, and Kriegman 1997; mation f : X → Y to maximize the ratio of between-class Chen et al. 2000; Baudat and Anouar 2000; Merchante, variance to the within-class variance, where y ∈ Rd and Grandvalet, and Govaert 2012) have been proposed to ei- d ≤ d. In general, d is decided by C, namely d = C − 1. ther leverage or extend LDA because of its effectiveness However, we do not know the instance-level labels Z(X ) and computational efficiency. Belhumeur et al proposed for the data X . In our case, only the bag-level labels L are PCA+LDA (Belhumeur, Hepanha, and Kriegman 1997) for available. Thus, we think for any instance ∀x ∈ X ,ithas face recognition. Recently, sparsity induced LDA is also a corresponding label z(x), which can be inferred from the proposed (Merchante, Grandvalet, and Govaert 2012; Wu, training pairs (X , L). Wipf, and Yun 2015). Latent Fisher discriminant analysis model generalizes However, many real-world applications only provide la- LDA with latent variables. Suppose the projection matrix is bels on bag-level, such as object detection (Felzenszwalb P, and y = f(x)=Px, then our latent Fisher LDA pro- et al. 2010) and image classification (Maron and Ratan poses to minimize the following ratio: 1998). In the last decades, semi-supervised methods have ∗ been proposed to utilize unlabeled data to aid classifica- (P )=argminJ(P,Z) P tion or regression tasks under situations with limited labeled ,Z   PT Σ (X , L,Z)P data, such as Transductive SVM (TSVM) (Vapnik 1998; w PT P =argmintrace T + β Joachims 1999) and Co-Training (Blum and Mitchell 1998). P,Z P Σb(X , L,Z)P One of the main trend is to extend LDA to handle semi- (1) supervised problems (Cai, He, and Han 2007; Zhang and X Yeung 2008; Sugiyama et al. 2010) in a transductive man- where Z are the latent variables for the data , and β is ner, which attempts to utilize unlabeled data to aid clas- a weighing parameter for regularization term. The variable ∈ X sification or regression tasks under situations with limited z Z( ) defines the possible latent values for a sam- ∈ X ∈{ } X L labeled data. For example, Semi-supervised Discriminant ple x . In our case, z 1, 2, ..., C . Σb( , ,Z) X L Analysis (Cai, He, and Han 2007) was proposed, which is between-class scatter matrix and Σw( , ,Z) is within- made use of both labeled and unlabeled samples. Sugiyama class scatter matrix, defined respectively as follows: C T et. al. proposed a semi-supervised dimensionality reduc- Σw(X , L,Z)=Σk=1Σ{x∈X |δ(z(x)=k)}(x − x¯k)(x − x¯k) tion method (Sugiyama et al. 2010), which can preserve (2) the global structure of unlabeled samples in addition to Σ (X , L,Z)=ΣC m (¯x − x¯)(¯x − x¯)T separating labeled samples in different classes from each b k=1 k k k (3) other. Chen and Corso proposed a semi-supervised approach where δ(z(x)=k) is the indicator function, mk is (Chen and Corso 2012) to learn discriminative codebooks the number of training samples for each class k, x¯k = x and classify instances with nearest neighbor voting. Re- {x∈X |δ(z(x)=k)} is the mean of the k-th class and x¯ is the cenly, latent SVM or MI-SVM has attracted great atten- mk  1 C total mean vector given by x¯ = C Σ =1mkx¯k. Note tion for semi-supervised problems, such as multiple instance k=1 mk k learning and object detection (Zhang and Yeung 2008; that LDA (Fisher 1936) is dependent on a categorical vari- Felzenszwalb et al. 2010). It basically infers the latent vari- able z (i.e. the class label) for each instance x to compute Σb ables by maximizing a posterior probability and shows great and Σw.Ifz is given for any x, we can use LDA to find the P −1 improvement on object detection (Felzenszwalb et al. 2010). discriminative transform , using eigenvectors of Σw Σb Another trend prefers to extent LDA into an unsupervised to capture both compactness of each class and separations scenarios. For example, Ding and Li proposed to combine between classes. LDA and K-means clustering into the LDA-Km algorithm However, in our case, given the training data X , we only (Ding and Li 2007) for adaptive dimension reduction. In this know bag-level labels L, not the instance-level labels Z. algorithm, K-means clustering was used to generate class To minimize J(P,Z), we need to infer z(x) for any given labels and LDA is utilized to perform subspace selection. x. This problem is a chicken and egg problem, and can be

1792 solved by alternating algorithms, such as EM (Dempster, where zi is the latent label assignment, πi is the data- Laird, and Rubin 1977). In other words, solve P in Eq. (1) driven prior for each class i, wi is the posterior (or weight) with fixed z, and vice versa in an alternating strategy. determined by kNN voting (see further) in the subspace and ◦ is the pointwise production or Hadamard product. Updating z We treat Eq. (5a) as the latent Fisher discriminant analysis Suppose we have found the projection matrix P. Then, we model (LFDA), because it takes the same strategy as latent can project X into its corresponding subspace Y = PX, SVM model (Andrews, Tsochantaridis, and Hofmann 2002; Felzenszwalb et al. 2010). As for Eq. (5b), we extend LFDA where Y = {y1, y2, ..., yn} is the one to one mapping of X {x1 x2 xn} by combining both representative and discriminative factors = , , ..., . Instead of inferring latent variables j at instance-level as latent SVM, we propose to infer latent together, and find the cluster Si in class i by maximizing variable z at clustering-level in the projected space Y. That Eq. (5b). In a sense, Eq. (5b) considers the prior distribution means all elements in the same cluster have the same label. from the training dataset, thus, we treat it as the joint la- Such assumption is reasonable because elements in the same tent Fisher discriminant analysis model (JLFDA) or LFDA cluster are close to each other. On the other hand, cluster- with prior. In a nutshell, we propose a way to formulate dis- level inference can speed up the learning process. We ex- criminative and generative models together under Bayesian tend mixture discriminative analysis model in (Hastie and framework. We comparatively analyze both of these models Tibshirani 1996) by incorporating latent variables over all in experiments. j instances for an given class. Thus, we assume each class i is Consequently, if we select the cluster Si with the mean a K components of Gaussians, j μi which maximizes the above equation (for example, Eq. ∈ j K (5a)) for class i, we can relabel all samples y Si positive j j j for class i and the rest negative, subject to y = Px and p(t|λi)= π g(y|μ , Σ ) (4) i i i ∈ j j=1 y Si . And further we can decide the label of x with the assumption z(y)=z(x). Thus, for any x ∈ X , we decide ∈ Rd { j}K where t, y (i.e. vector or feature); πi = πi j=1 its label z(x) based on the following j j | j are the mixture weights, and g(y μi , Σi ) is the j-th com- P ∈ z(x)=li, if y = x Si which maximizes Eq. (5) (6) ponent Gaussian with μj as mean and Σj as covariance. i i X + {x+ x+ x+} λi = {πi,μi, Σi} are the parameters for class i with μi = Then, we update the training data = 1 , 2 , ..., n , j K j K L+ {z+ z+ z+} x+ j {μ } =1 and Σi = {Σ } =1 which we need to estimate. with labels = 1 , 2 , ..., n , where i = Si i j i j Y L j z+ Given the training data ( , ), we can compute the data- for class i with ni elements, and its labels i = driven prior using Gaussian (GMM) in Eq. j { 1 2 ni } (4). In our case, for each class i ∈{1, 2,...,C}, we col- zi ,zi , ..., zi on instance level. In a sense, we are do- ing a kind of selection, which finds the most discriminative lect all instances whose yi belongs to this class i, then we and representative component for each class from mixture use mixtures of Gaussians in Eq. (4) for clustering analysis. + + of Gaussians. Obviously, xi ⊆ xi and X is a subset of X . As a result, we can estimate λi and get its K components + 1 2 X X S = {S ,S , ..., SK } with EM algorithm, as well as its The difference between and lies that every element i i i i + ∈ x+ + x ⊂ X { 1 2 K } xi i has label z(xi ) decided by Eq. (6), while i data-driven prior distribution πi = πi ,πi , ..., πi . With a j only has bag level label. little abuse of symbols, we may μi to indicate the j-th clus- j j P ter of i-th class, and each cluster Si has ni elements, which Updating projection K j + satisfies j=1 ni = ni. After we decide labels for the new training data X , we can Note that the basic idea here is to select the most discrim- use LDA to minimize J(P,Z). Note that Eq. (1) is invariant inative or representative cluster in each class, and then in- to the scale of the vector P. Hence, we can always choose P T fer its latent variables by maximizing a posterior probabil- such that the denominator is simply P ΣbP = 1. For this ity or a joint likelihood. Suppose we have the discrimina- reason we can transform the problem of minimizing Eq. (1) tive weights (or posterior probability) corresponding to its into the following constrained optimization problem (Duda, 1 2 K K components in each class, wi = {w ,w , ..., w }, which Hart, and Stork 2000; Fukunaga 1990; Ye 2007): i i i   are the posterior probability determined by LFDA and will ∗ T + T P =argmintrace P Σw(X , L,Z)P + βP P be discussed later in the next part. We maximize one of the P following two objectives: + s.t. PT Σ (X , L,Z)P = 1 (7) Maximizing a posterior probability: b Rd×d j | P where 1 is the identity matrix in . The opti- μi =argmaxwi =argmaxp(zi μi, ) (5a) j j mal Multi-class LDA consists of the top eigenvectors μ ∈μi,j∈[1,K] μ ∈μi,j∈[1,K]   i i + † + of Σw(X , L,Z)+β Σb(X , L,Z) correspond- Maximizing the joint probability with prior: ing to the nonzero eigenvalues (Fukunaga 1990), here + † (Σw(X , L,Z)+β) denotes the pseudo-inverse of μj =argmax(π ◦ w ) + i i i (5b) (Σw(X , L,Z)+β). After we calculated P, we can project j ∈ ∈[1 ] + + + μi μi,j ,K X into subspace Y . Note that in the subspace Y ,any

1793 y+ ∈ Y+ preserves the same labels as in the original space. Algorithm 1 + + In other words, Y has corresponding labels L at element Input: training data X and its labels L at video level, β, K, N, T + + level, namely z(y )=z(x ). and . C In general, multi-class LDA (Ye 2007) uses kNN to clas- Output: P, {λi}i=1 sify new input data. We compute wi using the following 1: Initialize P and wi; kNN strategy: for each sample x ∈ X , we get y = Px by 2: for Iter =1to T do projecting it into subspace Y. Then, for y ∈ Y, we choose 3: for i =1;i<= C; i ++do its N nearest neighbors from Y+, and use their labels as a 4: Project all the training data X into subspace Y using Y PX Sj i = ; vote for each cluster i in each class . Then, we compute 5: For each class, using the Gaussian mixture model to the following posterior probability: partition its elements in the subspace, and learn λi = j j j {π ,μ , Σ } w = p(zi =1|μ ) ∝ p(μ |zi =1)p(zi =1) i i i ; i i i Sj μj j 6: Maximize Eq. (5) to find i with center i ; p(μi ,zi =1) Sj i  7: Relabel all elements positive in the cluster i for class = p(zi =1) C j (8) i=1 p(μi ,zi =1) according to Eq. (6); 8: end for ∈ Y j + It counts all y that fall into N nearest neighbor of μi 9: Update z and construct the new subset X and its labels + with label zi. Note that kNN is widely used as the classifier L for all C classes; in the subspace after LDA transformation. Thus, Eq. (8) con- 10: Do Fisher linear discriminant analysis and update P; sider all training data to vote the weight for each discrimina- 11: if P converge (change less than ), then break; tive cluster Sj in every class i. Hence, we can find the most 12: Compute N nearest neighbors for each training data, and i w i j j k ∈ calculate discriminative weight i for each class accord- discriminative cluster Si , s.t. wi >wi , k [1,K],k = j. ing to Eq. (8). 13: end for C Algorithm 14: Return P and cluster centers {λi}i=1 learned respectively for The basic idea behind our method is that we use Gaussian all C classes; mixture model to partition all instances in each class, which can get the data driven prior for each component. Then, we infer the latent variable at the component level, which can where P is the old P. This proof can be shown using be used further to label each instance in that cluster. Finally, Jensen’s inequality. given the labeled data, we can use LDA to find a discrim- Lemma 1.1. The hard assignment of latent variable z by inative transformation. Such processes repeat until conver- maximizing Eq. (5) is a special case of EM algorithm. gence. We summarize the above discussion in pseudo code in Algorithm 1. To put simply, we update P and z in an al- Proof. ternative manner, and accept the new projection matrix P   with LDA on the relabeled instances. Such algorithm can al- C |X L P X L| |P ways converge very fast (e.g. 10 iterations). After we learned p(zi , , )ln p( , zi)p(zi ) P { }C i=1 matrix and λi i=1 by maximizing Eq. (5), we can use C them to select representative and discriminative features (i.e. frames from video datasets) by nearest neighbor searching. = p(zi|X , L, P )ln(p(zi|X , L, P)) i=1 Convergence analysis C Our method updates latent variable z and then P in an al- + p(zi|X , L, P )ln(p(X , L|P)) ternating manner. Such strategy can be attributed to the hard i=1 assignment of EM algorithm. Recall the EM approach: C C = p(zi|X , L, P )ln(p(zi|X , L, P)) + ln(p(X , L|P)) ∗ P =argmaxp(X , L|P) = argmax p(X , L,zi|P) i=1 P P i=1 (11) C  Given P, we can infer the latent variable z. Because =argmax p(X , L|zi)p(zi|P) (9) P the hard assignment of z, the first term in the right hand i=1 side of Eq. (11) assigns zi into one class. Note that then the likelihood can be optimized using iterative use of p(z|X , L, P)ln(p(z|X , L, P)) is a monotonically increas- the EM algorithm. ing function, which means that by maximizing the posterior Theorem 1. Assume the latent variable z is inferred for likelihood p(z|X , L, P) for each instance, we can maxi- each instance in X , then maximizing the above function is mize Eq. (11) for the hard assignment case in Eq. (5). Thus, equivalent to maximizing the following auxiliary function the updating strategy in our algorithm is a special case of   C EM algorithm, and it can converge into a local maximum P =argmax p(zi|X , L, P )ln p(X , L|zi)p(zi|P) as EM algorithm. Note that in our implementation, we in- P =1 fer the latent variable in cluster level. In other words, to i (10) maximize p(zi|X , L, P ), we can include another latent

1794 Data MI- π i j inst/Dim LDA LFDA JLFDA hi set SVM MUSK1 476/166 77.9 70.4 81.4 87.1 MUSK2 6598/166 84.3 51.8 76.4 81.3 j Elephant 1391/230 81.4 70.5 74.5 82.2 zi wi μi y i Σi Fox 1320/230 57.8 53.5 61.5 59.5 Tiger 1220/230 84.0 71.5 74.0 80.5 Average - 77.08 63.54 73.56 78.1 j xi n Table 1: Accuracy results for various methods on MUSK i and Corel datasets. Our approach outperforms LDA signif- icantly, and we get better result than MI-SVM on MUSK1, Figure 1: Example of graphical representation only for one Elephant and Fox datasets. On average, our method outper- j j forms MI-SVM, which indicates that our model is stable for class (event) in our model. hi is the hidden variable, xi is j j the semi-supervised classification. the observable input, yi is the projection of xi in the sub- space, j ∈ [1,ni], and ni is the number of total training data { 1 2 K } for class i. The K cluster centers μi = μi ,μi , ..., μi is in detail in the landmark paper (Dietterich, Lathrop, and determined by both πi and wi. The graphical model of our Lozano-Perez´ 1997). Both data sets, MUSK1 and MUSK2, method is similar to GMM model in vertical. By adding zi consist of descriptions of molecules using multiple low- into LDA, the graphical model can handle latent variables. energy conformations. Each conformation is represented by a 166-dimensional feature vector derived from surface prop- erties. MUSK1 contains on average approximately 6 confor- j ∈ variable πi ,j [1,K]. Specifically, we need to maximize mation per molecule, while MUSK2 has on average more K j|X L P j=1 p(zi,πi , , ), which we can recursively deter- than 60 conformations in each bag. The Corel data set con- mine the latent variable πi using an embedded EM algo- sists three different categories (“elephant”, “fox”, “tiger”), rithm. Hence, our algorithm uses two steps of EM algo- and each instance is represented with 230 dimension fea- rithm, and it can converge to a local maximum. Refer to tures, characterized by color, texture and shape descriptors. (Wu 1983) for more details about the convergence of EM The data sets have 100 positive and 100 negative example algorithm. images. The latter have been randomly drawn from a pool of photos of other animals. We first use PCA reducing its Probabilistic understanding for the model dimension into 40 for our method. For parameter setting, Latent SVM model (Felzenszwalb et al. 2010; Andrews, we set K=3, T = 20 and N =4(namely the 4-Nearest- Tsochantaridis, and Hofmann 2002) attempts to label in- Neighbor (4NN) algorithm is applied for classification) on stance x in positive bag, by maximizing p(z(x )=1|x ), all datasets except Elephant (we set K =2for it). The aver- i i i aged results of 10-fold cross-validation runs are summarized which is the optimal Bayes decision rule. Similarly, Eq. 2 (5a) takes the same strategy as latent SVM to maximize a in Table (1). We set LDA and MI-SVM as our baselines. We posterior probability. Moreover, instead of only maximizing can observe that JLFDA outperforms LDA and MI-SVM on p(z =1|x), we also maximize the joint probability p(z = MUSK1, Elephant and Fox data sets, especially our method 1,x), using the Bayes rule, p(z =1,x)=p(x)p(z =1|x). shows significantly better result on MUSK1 data set. We In this paper, we use Gaussian mixture model to approxi- also show the average accuracy over the five datasets, and mate the prior p(x). We argue that to maximize a joint proba- it demonstrates that our model is better than MI-SVM with bility is reasonable, because it considers both discriminative higher accuracy. (posterior probability) and representative (prior) property in the video dataset. We give the graphical representation of Semantic keyframe extraction our model in Fig. 1. Keyframe defines the starting and ending points of any smooth transition in a video. Semantic keyframe in our pa- Experiments and results per refers the keyframes which have semantic meanings. In other words, we call tell what event or topic happened in the In this section, we perform experiments on various data sets video when we observe the keyframes. We conduct experi- to evaluate the proposed techniques and compare it to other ments on the challenging TRECVID MED11 dataset3 with baseline methods. For all the experiments, we set T =20 five events (or classes): attempting a board trick, feeding an and β =40if there is no other specification; and initialize animal, landing a fish, wedding ceremony and working on uniformly weighted w and projection matrix P with LDA. i a woodworking project. As for parameters, we set K =10 and N =10. We learned the representative clusters for each Classification on toy data sets class (event), and then use them to find semantic frames in The MUSK data sets1 are the benchmark data sets used in virtually all previous approaches and have been described 2we use the bag label as the instance label to test its perfor- mance 1www.cs.columbia.edu/ andrews/mil/datasets.html 3http://www.nist.gov/itl/iad/mig/med11.cfm

1795 statistical comparison of 5 different methods for five events videos with the same labels. Then we evaluate the semantic 3000 JLFDA frames for each video through human-factors analysis—the LFDA semantic keyframe extraction problem demands a human- SVM(1) 2500 SVM(2) in-the-loop for evaluation. RAND Video representation. For all videos, we extract HOG3D descriptors (Klaser, Marszalek, and Schmid 2008) every 25 2000 frames (about sampling a frame per second). To represent videos using local features we apply a bag-of-words model, 1500 using all detected points and a codebook with 1000 ele- statistical data ments. 1000 Benchmark methods. We make use of SVM as the benchmark method in the experiment. We take the one-vs- all strategy to train a linear SVM classifier using SV Mlight 500 (Joachims, Finley, and Yu 2009), for each kind of event. 0 Then we choose 10 frames for each video which are far E001 E002 E003 E004 E005 from the margin and close to the margin on positive side. For the frames chosen farthest away from the margin, we Figure 2: Comparison of 5 methods for five events. Higher refer it SVM(1), while for frames closest to the margin we value, better performance. refer it SVM(2). We also randomly select 10 frames from each video, and we refer it RAND in our experiments. Win-Loss matrix Method Experimental setting. Ten highly motivated graduate JLFDA FLDA RAND SVM(1) SVM(2) students (range from 22 to 30 years’ old) served as subjects JLFDA - 3413 2274 2257 3758 in all of the following human-in-the-loop experiments. Each LFDA 2957 - 2309 2230 3554 novel subject to the annotation-task paradigm underwent a RAND 2111 2175 - 1861 2274 training process. Two of the authors gave a detailed de- SVM(1) 2088 2270 2010 - 2314 scription about the dataset and problem, including its back- SVM(2) 3232 3316 2113 2125 - ground, definition and its purpose. In order to indicate what representative and discriminative means for each event, the Table 2: Win-Loss matrix for five methods. It represents how two authors showed videos for each kind of event to the sub- many times methods in each row win methods in column. jects, and make sure all subjects understand what semantic keyframes are. The training procedure was terminated after the subject’s performance had stabilized. We take a pairwise pairwise comparison. If ‘Yes’, we cast one ballot to the left ranking strategy for our evaluation. We extract 10 frames method; else if ‘No’, we add a ballot to the right method; else per video for 5 different methods (SVM(1), SVM(2), LFDA, do nothing to the two methods. Fig. 2 shows ballots for each JLFDA and RAND) respectively. For each video, we had method on each event. It demonstrates our method JLFDA about 1000 image pairs for comparison. We had developed always beat other methods, except for the E004 dataset. an interface using Matlab to display the two image pair and the number of No.1 method in each event three options (Yes, No and Equal) to compare an image pair Method each time. The students are taught how to use the software; E001 E002 E003 E004 E005 a trial requires them to give a ranking: If the left image is JLFDA 6773 7 better than the right, then choose ‘Yes’; if the right is better LFDA 6 445 1 than the left, choose ‘No’. If the two images are same, then SVM(1) 4442 4 choose ‘Equal’. The subjects are again informed that better SVM(2) 6 3 1 7 6 means a better semantic keyframe. The ten subjects each in- RAND 2 2 4 3 2 stalled the software to their computers, and conducted the image pair comparison independently. Table 3: Higher value, better results. It demonstrates that our Experimental Results. We have scores for each image method is more capable at extracting semantically meaning- pair. Then, by sampling 10 videos from each event, we at last ful keyframes. had annotations of 104 videos. It means our sampling videos got from 10 subjects almost cover all test data (105 videos). We also compared the five methods based on Elo rating Table 2 shows the win-loss matrix between five methods by system4. For each video, we ranked the five methods accord- counting the pairwise comparison results on all 5 events. It ing to Elo ranking system. Then, we counted the number of shows that JLFDA and LFDA always beat the three base- No.1 method on video level in each event. The results in lines. Furthermore, JLFDA is better than LFDA because it Table 3 show that our method is better than others, except considers data-driven prior, which will help JLFDA to find E004. For example, E002 has total 20 videos (column sum- more representative frames. Refer to supplementary mate- mation), where JLFDA has ranked first on 7 videos, while rial for keyframes extracted with JLFDA. We compared the RAND ranks first on only 2 videos. Such results is consistent five methods on the basis of Condorcet voting method. We treat ‘Yes’, ‘No’ and ‘Equal’ as voters for each method in the 4https://en.wikipedia.org/wiki/Elo rating system

1796 with that using Condorcet ranking method in Fig. 2. E004 is Dietterich, T. G.; Lathrop, R. H.; and Lozano-Perez,´ T. 1997. the wedding ceremony event and our method is consistently Solving the multiple instance problem with axis-parallel rect- outperformed by the SVM baseline method. We believe this angles. Artif. Intell. 89:31–71. is due to the distinct nature of the E004 videos in which the Ding, C., and Li, T. 2007. Adaptive dimension reduction using video scene context itself distinguishes it from the other four discriminant analysis and k-means clustering. In ICML, 521– events (the wedding ceremonies typically have very many 528. New York, NY, USA: ACM. people and are inside). Hence the discriminative component Duda, R. O.; Hart, P. E.; and Stork, D. G. 2000. Pattern Clas- of the methods are taking over, and the SVM is able to out- sification (2nd Edition). Wiley-Interscience. perform our model. Felzenszwalb, P. F.; Girshick, R. B.; McAllester, D.; and Ra- manan, D. 2010. Object detection with discriminatively trained Conclusion part based models. TPAMI 32:1627–1645. In this paper, we have presented a latent Fisher discriminant Fisher, R. A. 1936. The use of multiple measurements in taxo- analysis model with representative , which nomic problems. Annals of Eugenics 7(7):179–188. combines the latent variable, multi-class and dimension re- Fukunaga, K. 1990. Introduction to statistical pattern recogni- duction in an unified framework. To the best of our knowl- tion (2nd ed.). San Diego, CA, USA: Academic Press Profes- edge, this is the first paper to generalize LDA and study the sional, Inc. extraction of semantically representative and discriminative Hastie, T., and Tibshirani, R. 1996. Discriminant analysis by features together rather than in a separate manner (either rep- gaussian mixtures. Journal of the Royal Statistical Society, Se- resentative or discriminative). We conduct a thorough exper- ries B 58:155–176. iments on MUSK, Corel and TRECVID MED11 datasets, Joachims, T.; Finley, T.; and Yu, C.-N. J. 2009. Cutting-plane and demonstrate that our method is able to consistently out- training of structural svms. JMLR 77:27–59. perform competitive baselines. Joachims, T. 1999. Transductive inference for text classification using support vector machines. In Proceedings of the Sixteenth Acknowledgement International Conference on , 200–209. The authors would like to thank Dr. Jason J. Corso for inspir- Klaser, A.; Marszalek, M.; and Schmid, C. 2008. A spatio- ing discussions on the whole idea and valuable feedback on temporal descriptor based on 3d-gradients. In BMVC. the paper. Also thank all VPML members in SUNY at Buf- Maron, O., and Lozano-Prez, T. 1998. A framework for falo conducting human-factors ranking-based experimental multiple-instance learning. In NIPS. evaluation on TRECVID MED11 datasets. In addition, we Maron, O., and Ratan, A. L. 1998. Multiple-instance learn- also like to thank the anonymous reviewers for their helpful ing for natural scene classification. In ICML, 341–349. San comments. Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Merchante, L. F. S.; Grandvalet, Y.; and Govaert, G. 2012. References An efficient approach to sparse linear discriminant analysis. In Andrews, S.; Tsochantaridis, I.; and Hofmann, T. 2002. Support ICML. vector machines for multiple-instance learning. In NIPS, 561– Perera, A.; Oh, S.; Leotta, M.; Kim, I.; Byun, B.; Lee, C.-H.; 568. McCloskey, S.; Liu, J.; Miller, B.; Huang, Z. F.; Vahdat, A.; Baudat, G., and Anouar, F. 2000. Generalized discrimi- Yang, W.; Mori, G.; Tang, K.; Koller, D.; Fei-Fei, L.; Li, K.; nant analysis using a kernel approach. Neural Computation Chen, G.; Corso, J. J.; Fu, Y.; and Srihari, R. K. 2011. GE- 12:2385–2404. NIE TRECVID2011 multimedia event detection: Late-fusion approaches to combine multiple audio-visual features. In NIST Belhumeur, P. N.; Hepanha, J. P.; and Kriegman, D. J. 1997. TRECVID Workshop. Eigenfaces vs. fisherfaces: recognition using class specific lin- ear projection. IEEE TPAMI 19:711–720. Rao, C. R. 1948. The utilization of multiple measurements in problems of biological classification. Journal of the Royal Blum, A., and Mitchell, T. 1998. Combining labeled and unla- Statistical Society - Series B 10(2):159–203. beled data with co-training. In Proceedings of the Workshop on Sugiyama, M.; Ide,´ T.; Nakajima, S.; and Sese, J. 2010. Semi- Computational Learning Theory, 92–100. supervised local fisher discriminant analysis for dimensionality Cai, D.; He, X.; and Han, J. 2007. Semi-supervised discrimi- reduction. In Machine Learning, volume 78, 35–61. nant analysis. In ICCV. IEEE. Vapnik, V. N. 1998. Statistical learning theory. John Wiley & Chen, G., and Corso, J. J. 2012. Greedy multiple instance learn- Sons. ing via codebook learning and nearest neighbor voting. Arxiv Wu, Y.; Wipf, D. P.; and Yun, J. 2015. Understanding and preprint. evaluating sparse linear discriminant analysis. In AISTATS. Chen, L.; Liao, H.; Ko, M.; Lin, J.; and Yu, G. 2000. A new Wu, C. F. J. 1983. On the convergence properties of the EM lda-based face recognition system which can solve the small algorithm. The Annals of 11(1):95–103. sample size problem. Pattern Recognition 33:1713–1726. Ye, J. 2007. Least squares linear discriminant analysis. In Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977. Max- ICML. imum likelihood from incomplete data via the em algorithm. Zhang, Y., and Yeung, D.-Y. 2008. Semi-supervised discrimi- JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES nant analysis via cccp. In ECML. IEEE. B 39(1):1–38.

1797