Semi-supervised Spectral Clustering for Image Set Classification

Arif Mahmood, Ajmal Mian, Robyn Owens School of Computer Science and Software Engineering, The University of Western Australia {arif.mahmood, ajmal.mian, robyn.owens}@ uwa.edu.au

Abstract We present an image set classification algorithm based on unsupervised clustering of labeled training and unla- 0.6 beled test data where labels are only used in the stopping 0.4

criterion. The probability distribution of each class over 2

the set of clusters is used to define a true set based simi- PC larity measure. To this end, we propose an iterative sparse 0.2 spectral clustering algorithm. In each iteration, a proxim- ity matrix is efficiently recomputed to better represent the 0 local subspace structure. Initial clusters capture the global -0.2 data structure and finer clusters at the later stages capture the subtle class differences not visible at the global scale. -0.5 Image sets are compactly represented with multiple Grass- -0.2 mannian manifolds which are subsequently embedded in 0 0 Euclidean space with the proposed spectral clustering al- 0.2 gorithm. We also propose an efficient eigenvector solver 0.5 which not only reduces the computational cost of spectral Figure 1. Projection on the top 3 principal directions of 5 subjects clustering by many folds but also improves the clustering (in different colors) from the CMU Mobo face dataset. The natural quality and final classification results. Experiments on five data clusters do not follow the class labels and the underlying face standard datasets and comparison with seven existing tech- subspaces are neither independent nor disjoint. niques show the efficacy of our algorithm. variants of the nearest neighbor (NN) algorithm where the NN distance is measured under some constraint such as rep- 1. Introduction resenting sets with affine or convex hulls [8], regularized Image set based object classification has recently re- affine hull [3], or using the sparsity constraint to find the ceived significant research interest [1,2,5, 10, 12, 17, 20, nearest points between image sets [30]. Since NN tech- 28, 29, 30] due to its higher potential for accuracy and ro- niques utilize only a small part of the available data, they bustness compared to single image based approaches. An are more vulnerable to outliers. image set contains more appearance details such as multi- At the other end of the spectrum are algorithms that rep- ple views, illumination variations and time lapsed changes. resent the holistic set structure, generally as a linear sub- These variations complement each other resulting in a bet- space, and compute similarity as canonical correlations or ter object representation. However, set based classification principle angles [26]. However, the global structure may also introduces many new challenges. For example, in the be a non-linear complex manifold and representing it with a presence of large intra set variations, efficient representa- single subspace will lead to incorrect classification [2]. Dis- tion turns out to be a difficult problem. Considering face criminant analysis has been used to force the class bound- recognition, it is well known that the images of different aries by finding a space where each class is more compact identities in the same pose are more similar compared to while different classes are apart. Due to multi-modal na- the images of the same identity in different poses (Fig.1). ture of the sets, such an optimization may not scale the inter Other important challenges include efficient utilization of class distances appropriately (Fig.1). In the middle of the all available information in a set to exploit intra-set similar- spectrum are algorithms [2, 24, 28] that divide an image set ities and inter-set dissimilarities. into multiple local clusters (local subspaces or manifolds) Many existing image set classification techniques are and measure cluster-to-cluster distance [24, 28]. Chen et

1 (d) Iterative Sparse Spectral Clustering (e) (f) Pr(Class|Cluster) (c) All Subspace Basis Fast Eigen Solver

SLS No Stop Clustering Cluster Labels

(b) Grassmann Yes (g) Distribution Class Labels Final Cluster Labels (a) Manifolds Distance

Figure 2. Proposed algorithm:(a) Face manifolds of 4 subjects from CMU Mobo dataset in PCA space. (b) Each set transformed to a Grassmannian manifold. (c) Data matrix and class label vector. (d) Iterative Sparse Spectral Clustering Algorithm: Proximity matrix based on Sparse Least Squares, a novel fast Eigenvector solver and supervised termination criterion. (e) Each class element assigned a latent cluster label. (f) Probability distribution of each class over the set of clusters. (g) Class probability distribution based distance measure. al. [2] achieved improved performance by computing the The proposed clustering algorithm can be directly used distance between different locally linear subspaces. In all with the set samples however, we represent each set with a these techniques, classification is dominated by either a few Grassmannian manifold and perform clustering on the man- data points, only one cluster, one local subspace, or one ba- ifold basis matrices. This strategy reduces computational sis of the global set structure, while the rest of the image set complexity and increases discrimination and robustness to data or local structure variations are ignored. different types of noise in the image sets. Using Grass- We propose a new framework in which the final classifi- mann manifolds, we make an ensemble of spectral classi- cation decision is based on all data/clusters/subspaces con- fiers which further increases accuracy and gives reliability tained in all image sets. Therefore, classification decisions (confidence) of the label assignment. are global compared to the existing local decisions. For Our final contribution is a fast eigenvector solver based this purpose, we apply unsupervised clustering on all data on the group power iterations method. The proposed algo- contained in the gallery and the probe sets, without enforc- rithm iteratively finds all eigenvectors simultaneously and ing set boundaries. We find natural data clusters based on terminates when the signs of the required eigenvector co- the true data characteristics without using labels. Set labels efficients become stable. This is many times faster than are used only to determine the required number of clusters the Matlab SVDS implementation and yields clusters of i.e. in the termination criterion. The probability distribution the same or better quality. Experiments are performed on of each class over the set of natural data clusters is com- three standard face image-sets (Honda, Mobo and Youtube puted. Classification is performed by measuring distances Celebrities), an object categorization (ETH 80), and Cam- between the probability distributions of different classes. bridge hand gesture datasets. Results are compared to seven The proposed framework is generic and applicable to any state of the art algorithms. The proposed technique achieves unsupervised clustering algorithm. However, we propose an the highest accuracy on all datasets. The maximum im- improved version of the Sparse Subspace Clustering (SSC) provement is observed on the most challenging Youtube algorithm [6] for our framework. SSC yields good results dataset where our algorithm achieves 11.4% higher accu- if different classes span independent or disjoint subspaces racy than the best reported. and is not directly applicable to the image-set classification problem where the subspaces of different classes are nei- 2. Image Set Classification by Semi-supervised ther independent nor disjoint. We propose various improve- Clustering ments over the SSC algorithm. We obtain the proximity matrix by sparse least squares with more emphasis on the Most image-set classification algorithms try to enforce reconstruction error minimization than the sparsity induc- the set boundaries in a supervised way i.e. label assignment tion. We perform clustering iteratively and in each iteration to the training data is often manual and based on the real divide a parent cluster into small number of child clusters world semantics or prior information instead of the under- using the NCut objective function. Coarser clusters cap- lying data characteristics. For example, all images of the ture the global data variations while the finer clusters, in the same subject are pre-assigned the same label despite that later iterations, capture the subtle local variations not visi- the intra subject dissimilarity may exceed that of inter sub- ble at the global level. This coarse to fine scheme allows ject. Therefore, the intrinsic data clusters may not align well discrimination between globally similar data elements. with the imposed class boundaries. Wang et. al. [24, 28] computed multiple local clusters We only discuss an informal proof of Lemma 2.1. The con- (sub-sets) within the set boundaries. Subsets are individu- dition (3) ensures all clusters are indivisible, therefore in- ∗ ally matched and a gallery class containing a subset max- creasing the number of clusters beyond nk will not yield ∗ imally similar to a query subset becomes the label winner. more discrimination. For nk < nk, there will be some clus- This may address the problem of within set dissimilarity but ters having overlap between class distributions, hence re- ∗ most samples do not play any role in the label estimation. ducing the discrimination. Thus nk are the optimal number We propose to compute the set similarities based on the of clusters for the task at hand. probability distribution of each data-set over an exhaustive Once the optimality condition is satisfied, all clusters number of natural data clusters. We propose unsupervised will be indivisible. Existing measures may then be used clustering to be performed on all data, the probe and all the for the computation of distances between class-cluster dis- training sets combined without considering labels. By do- tributions. For this purpose, we consider Bhattacharyya and ing so, we get natural clusters based on the inherent data Hellinger distance measures. We also propose a modified characteristics. Clusters are allowed to be formed across Bhattacharyya distance which we empirically found more two or more gallery classes. Once an appropriate number of discriminative than the existing measures. clusters are obtained, we use labels to compute class proba- Bhattacharyya distance (Bi,p) between a class ci ∈ G and bility distribution over the set of clusters. the probe-set cp, having pi(k) and pp(k) probability over Let nk be the number of natural clusters and nc be the the k-th cluster is number of gallery classes. For a class ci having ni data ∗ nk » points, let p ∈ Rnk be the distribution over all clusters: i Bi,p = − ln Q pi(k)pp(k). (4) nk [ ] = ≥ [ ] = [ ]~ ≥ [ ] ∑k=1 pi k 1 and 1 pi k ni k ni 0, where ni k k=1 are the data points of class ci in the k-th cluster. Our ba- In (4), ln(0) ∶= 0, therefore 0 ≤ B ≤ 1. sic framework does not put any condition on nk however, i,p we argue by Lemma 2.1 that an optimal number of clusters Hellinger distance is the `2 norm distance between two exist and can be found for the task of set label estimation. probability distributions The derivation of Lemma 2.1 is based on the notion of ‘con- ¿ Á n∗ ditional orthogonality’ and ‘indivisibility’ of clusters as de- Á k » » 1 ÀÁ 2 Hi,p = √ Q ( pi[k] − pp[k]) , (5) fined below. We assume that classes ci and cj belong to the 2 k=1 gallery (G) with known labels while cp is the probe set with unknown label. M Modified Bhattacharyya Distance (Bi,p) of gallery class ci with probe class cp is given by Conditional Orthogonality Distribution of class ci is con- ditionally orthogonal to the distribution of a class c w.r.t the M √ √ » » j B = − ln ⟪ p , p ⋅ (p ∧ p )⟫⟪ p , p ⋅ (p ∧ p )⟫, distribution of probe set c if i,p i i i p p p i p p (6) ( ⊥ ) ∶= ⟪( ∧ ) ( ∧ )⟫ = ∀( ) ∈ G where (⋅) in this definition is a point-wise multiplication pi c pj pp pi pp , pj pp 0 ci, cj , (1) operator and have precedence over inner product. where ∧ is the logical AND operation and ⟪⋅⟫ is the inner Lemma 2.2 Bhattacharyya distance is upper bounded by product. If both operands of ∧ are non-zero then the result the modified Bhattacharyya distance: BM ≥ B will be 1 otherwise result will be 0. i,p i,p Indivisible Cluster A cluster k∗ is indivisible from the Proof simply follows from Cauchy Schwartz inequality: probe set label estimation perspective if M √ » Bi,p ≥ − ln ⟪ pi, pp⟫. ∗ ∗ ∗ pi[k ] ∧ pj[k ] ∧ pp[k ] = 0 ∀(ci, cj) ∈ G. (2) At the extreme cases, when the angle between the two dis- o M A cluster is ‘indivisible’ if either exactly one gallery class tributions is 0 or 90 , Bi,p = Bi,p, while for all other cases has non zero probability over that cluster or the probability M M Bi,p > Bi,p. Note that the factors forced to be zero in Bi,p of probe set is zero. by introducing the ∧ operator are automatically canceled B Lemma 2.1 Optimal number of clusters for the set labeling in i,p. Performance of the three measures was experimen- BM problem are the minimum number of clusters such that all tally compared and we observed that i,p achieves the high- gallery class distributions become orthogonal to each other est accuracy. w.r.t the probe set distribution. 3. Spectral Clustering ∗ ≜ ( ⊥ ) ∀ ∈ G nk min pi c pj pp ci, cj (3) nk The basic idea is to divide data points into different clus- ters using the spectrum of proximity matrix which repre- sents an undirected weighted graph. Each data point cor- where α are the linear coefficients. We are mainly con- responds to a vertex and edge weights correspond to sim- cerned with the face space which is neither independent nor g G = { } ∈ Rl×ng ilarity between the two points. Let Xi i=1 disjoint across different subjects. Therefore, introducing a be the collection of data points in the gallery sets. Here, sparsity constraint on α ensures that the linear coefficients = g ng ∑i=1 ni are the total number of data points in the from less relevant subspaces will be forced to zero: gallery. The i-th image set has ni data points each of dimen- ∗ ∶ ˆ 2 sion l. The gallery contains g sets and n classes: g ≥ n , αi = min‰SSxi − DαiSS2 + λSSαiSS1Ž . (11) c c αi = { }ni ∈ Rl×ni Xi xj j=1 . Each data point xj could be a fea- ture vector or simply the pixel values. The first term minimizes the reconstruction error while the n = { } p ∈ Rl×np second term induces sparsity. This process is repeated for Let cp xi i=1 be the probe-set with a dummy label nc +1. We make a global data matrix by appending all all data points and the corresponding αi are appended as nd n ×n l×nd = { } ∈ R d d gallery sets and the probe set: D = [G cp] ∈ R , where columns in a matrix S αi i=1 . Some of nd×nd ( ) ≠ nd = ng + np. The affinity matrix A ∈ R is computed the α coefficients may be negative and in general S i, j as S(j, i). Therefore, a symmetric sparse LS proximity matrix ⎧ 2 ⊺ −SSxi−xj SS is computed as A = SSS + SS S for spectral clustering. ⎪ exp 2 if i ≠ j A = ⎨ 2σ2 (7) i,j ⎪ ⎩⎪ 0 if i = j. 3.2. Iterative Hierarchical Sparse Spectral Clusters From A, a degree matrix D is computed Conventionally, sparse spectral clustering is performed by simultaneous partitioning of the graph into nk clus- ⎧ ⎪ nd ( ) = ters [6]. We argue that iterative has ⎪ ∑i=1 A i, j if i j D(i, j) = ⎨ (8) many advantages. In each iteration, we divide the graph ⎪ 0 if i ≠ j, ⎩ into very few partitions/clusters (four in our implementa- tion). If a cluster is not indivisible (2), we recompute the Using A and D, a L is computed local sparse LS based proximity matrix only for that clus- −1~2 −1~2 ter and then re-compute the eigenvectors. Note that we do Lw = D AD . (9) not reuse a part of the initial proximity matrix, because the nc sparse LS gives a different matrix due to reduced number Let E = {ei} be the matrix of nc smallest eigenvectors i=1 of candidates. The new matrix highlights only local con- of Lw. The eigenvectors of the Laplacian matrix embed the graph vertices into a Euclidean space where NN approach nectivity as opposed to global connectivity at the highest can be used for clustering. Therefore, the rows of E are unit level. Coarser clusters capture the large global variations while the finer clusters, obtained later, capture the subtle in- normalized and grouped into nc clusters using kNN. ter class differences which may not be visible at the global 3.1. Proximity Matrix As Sparse Least Squares scale. As a result, we are able to locally differentiate be- tween data points which were globally similar. Moreover, Often high dimensional data sets lie on low dimensional as we explain next, in the case of simultaneous partition- manifolds. In such cases, the Euclidean distance based ing, an approximation error accumulates which adversely proximity matrix is not an effective way to represent the ge- affects the cluster quality, whereas the iterative hierarchical ometric relationships among the data points. A more viable approach enables us to obtain high quality clusters. option is the sparse representation of data which has been Let al and ar be the two disjoint partitions. We want used for many tasks including label propagation [4], dimen- to find a graph cut that minimizes the sum of edge weights sionality reduction, and face recogni- ( ) ∑i∈SalS,j∈Sar S A i, j across the cut. MinCut is easy to imple- tion [11]. Alhamifar and Vidal [6] have recently proposed ment but it may give unbalanced partitions. In the extreme sparse subspace clustering which can discriminate data ly- case, minCut may separate only one node from the remain- ing on independent and disjoint subspaces. ing graph. To ensure balanced partitions, we minimize the A vector can only be represented as a linear combination normalized cut NCut objective function [25,7] of other vectors spanning the same subspace. Therefore, the proximity matrices based on linear decomposition of data 1 1 Q A(i, j) + Q A(i, j), (12) points lead to subspace based clustering. Representing a V V al i∈SalS,j∈Sar S ar i∈SalS,j∈Sar S data point xi as a linear combination of the remaining data ˆ points D = D~xi ensures that zero coefficients will only where Val is the sum of all edge weights attached to the correspond to the points spanning different subspaces. Such vertices in al. The objective function increases with the de- a decomposition can be computed with least squares crease in the number of nodes in a partition. Unfortunately, the minCut using the NCut objective function is NP hard ⊺ −1 ⊺ αi = (Dˆ Dˆ) Dˆ xi, (10) and only approximate solutions can be computed using the spectral embedding approach. It has been shown in [25] manifold to manifold distances, we perform sparse spectral that the eigenvector corresponding to the second smallest clustering on the Grassmannian manifolds. n eigenvalue (e2) of A set of λ-dimensional linear subspaces of R , n = min(l, n ) and λ ≤ n, is termed the Grassmann mani- L = D−1~2(D − A)D−1~2 (13) j sym fold Grass(λ, n). An element Y of Grass(λ, n) is a λ- provides an approximate solution to a relaxed NCut prob- dimensional subspace which can be specified by a set of λ l×λ lem. All data points corresponding to e2(i) ≥ 0 are assigned vectors: Y = {y1, ..., yλ} ∈ R and Y is the set of all to al while the remaining ones to ar. Similarly, e3 can their linear combinations. For each data element of the im- further divide each cluster into two more clusters and the age set, we compute a set of basis Y and the set of all such higher order eigenvectors can be used for further partition- Y matrices is termed as the non-compact Stiefel manifold ing the graph into smaller clusters. However, the approxi- ST (λ, n) ∶= {Y ∈ Rl×λ ∶ rank (Y) = λ.}. We arrange mation error accumulates degrading the clustering quality. all the Y matrices in a basis matrix B which is capable of Although good quality clusters can be obtained using the representing each data point in the image set by just using iterative approach, it requires more computations because λ of its columns. For the i-th data point in j-th image set i i i the proximity matrix and eigenvector computations must xj ∈ Xj, having Bj as the basis matrix, xj = Bjαj, where i i be repeated at each iteration. If the data matrix D is di- αj is the set of linear parameters with SαjSo = λ. For the case vided into four balanced partitions each time, the size of the 1 2 ni of known Bj, we can find a matrix αj = {αj , αj , ⋯αj } new matrices are 16 times smaller. Since the complexity of such that the residue approaches zero 3 eigenvector solvers is O(nd), the complexity reduction in n 3 ni the next iteration is O(( d ) )) for each sub-problem. The i j 2 4 min(Q SSxj − Bjα SS2) s.t. SSαjSSo ≤ λ . (14) α i depth of the recursive tree is log4(nd), however the pro- j i=1 posed supervised stopping criteria does not let the iterations to continue until the end, rather the process stops as soon Since `1 can approximate `o, we can estimate both αj and as all clusters are indivisible. A computational complexity Bj iteratively by using the following objective function [16] analysis of the recursion tree reveals that the overall com- nj ( 3) 1 1 i i 2 i plexity of the eigenvector computations remains O nd . min Œ Q SSxj − BjαjSS2 s.t. SSαjSS1 ≤ λ‘ . (15) α ,B j j nj i=1 2 4. Computational Cost Reduction The solution is obtained by randomly initializing Bj and We propose two approaches for computational complex- computing αj, then fixing αj and computing Bj. The size l×Λ ity reduction of spectral clustering. The first approach re- of Bj ∈ R is significantly smaller than the correspond- l×nj duces the data by embedding each image set on a Grass- ing image set Xj∈R . Representing each image set with mann manifold and then using the manifold basis vectors to Λ basis vectors reduces the data matrix from nd × nd to represent the set. The second approach is a fast eigenvec- Λg×Λg, which significantly reduces the computational cost. tor solver which quickly computes approximate eigenvec- Additionally, we observe that this representation also pro- tors using an early termination criterion of sign changes. vides significant increase in the accuracy of the proposed clustering algorithm because the underlying subspaces of 4.1. Data Reduction by Grassmann Manifolds each class are robustly captured in B leaving out noise. Eigenvectors computation of large Laplacian matrices nd×nd 4.2. Fast Approximate Eigenvector Solver Lsym ∈ R incurs high computational cost. Often a part of the data matrix D or some columns from Lsym The size of Laplacian matrix still grows with the number are sampled and eigenvectors are computed for the sampled of subjects. To reduce the cost of eigenvector computations, data and extrapolated for the rest [7, 22, 18]. Approximate we propose a fast group power iteration method which finds approaches provide sufficient accuracy for computing the all the eigenvectors and eigenvalues simultaneously, given most significant eigenvectors but are not as accurate for the sufficient number of iterations. The proposed algorithm al- least significant eigenvectors which are actually required in ways converges and has good numerical stability. spectral clustering. In contrast, we propose to represent The signs of the e2 eigenvector coefficients (correspond- each set by a compact representation and clustering to be ing to the minimum non-zero eigenvalue) of Lsym provides performed on the representation instead of the original data. an approximate solution to a relaxed NCut problem [25]. Our choice of image set representation is motivated from Therefore, if we get the correct signs with approximate linear subspace based image-set representations [28, 26]. magnitude of the eigenvectors, we will still get the same These subspaces may be considered as points on Grassman- quality of clusters. We exploit this fact in our proposed nian manifolds [9, 10]. While others performed discrimi- eigenvector solver and terminate the iterations when the nant analysis on the Grassmannian manifolds or computed number of sign changes falls below a threshold. Power iterations algorithm is a fundamental technique Algorithm 1 Fast Eigen Solver: Group Power Iteration n×n for eigenvalues and eigenvectors computation [31]. A ran- Input: L ∈ R , L dom vector v is repeatedly multiplied by the matrix L and Output: U, V {Eigenvectors}, Λ {Eigenvalues} normalized. After k iterations Av(k) = λ v(k), where v(k) is (0) Vr = I(n, n) {Identity matrix} the most dominant eigenvector of L and λ the correspond- Λ(0) = L, δΛ = 1 ing eigenvalue. To calculate the next eigenvector, the same while δΛ ≥ L do process is repeated on the deflated matrix L. (k) ⊺ (k−1) V ← L Vr Group power iteration method can be used for the com- l V (k) ← V (k)~(V (k)⊺V (k)) putation of all eigenvectors simultaneously. A random ma- l l l l V (k)R(k) ← V (k) {left qr decomposition} trix Vr is iteratively multiplied by L. After each multiplica- l l l (k) (k) tion, a unit normalization step is required by division with Vr ← Vl ⊺ (k) (k) (k)⊺ (k) Vr Vr which converges to a diagonal matrix as Vr converges Vr ← Vr ~(Vr Vr ) to an orthonormal matrix. To stop all columns from con- (k) (k) (k) Vr Rr ← Vr {right qr decomposition} verging to the most dominant eigenvector, an orthogonal- Λ(k) ← V (k)LV (k)⊺ ization step is also required. We use QR decomposition for r l δΛ = SSdiag(Λ(k) − Λ(k−1))SS this purpose: V (k)R(k) = V (k). A simple group power it- 2 r r r end while eration method will have poor convergence properties [31]. k k U ← V ( ), V ← V ( ) To overcome this, we apply convergence from the left and r l the right sides simultaneously by computing the left and the right eigenvectors (see Algorithm 1). We observe that Al- ing. Comparisons are performed with Discriminant Canon- gorithm 1 always converges to the correct solution and has ical Correlation (DCC) [26], Manifold-Manifold Distance better numerical properties than the group power iteration. (MMD) [28], Manifold Discriminant Analysis (MDA) [24], For the proposed spectral clustering, only the signs are linear Affine and Convex Hull based Image Set Dis- important. Therefore, in Algorithm 1, we replace the eigen- tance (AHISD, CHISD) [8], Sparse Approximated Near- value convergence based termination criterion with stabi- est Points [30], and Covariance Discriminative Learning lization of sign changes criterion between consecutive iter- (CDL) [27]. The same experimental protocol is used for ations as follows all algorithms. The codes of [8, 26, 28, 30] were provided (k) (k−1) by the original authors. ∆S = Q(Vr > 0) ⊕ (Vr > 0), (16) 6.1. Face Recognition using Image-sets where ⊕ is the XOR operator. 0⊕1 = 1, 1⊕0 = 1, 0⊕0 = 0, 1 ⊕ 1 = 0. We empirically found that most of the signs Our first dataset is the You-tube Celebrities [19] which become stable after very few iterations (≤ 4). is very challenging and includes 1910 very low resolution videos (of 47 subjects) containing motion blur, high com- 5. Ensemble of Spectral Classifiers pression, pose and expression variations (Fig.3c). Faces Representing image sets with Grassmannian manifolds were automatically detected, tracked and cropped to 30×30 facilitates formulation of an ensemble of spectral classifiers. gray-scale images. Due to tracking failures, our sets con- 0 Different random initializations of Bj in (15) may converge tained fewer images (8 to 400 per set) than the total number to different solutions resulting in multiple image set repre- of video frames. The proposed algorithm performed best on sentations. In addition to that, we also vary the dimension- HOG features. Five-fold cross validation experiments were ality of manifolds and compute a set of manifolds for each performed where 3 image sets were randomly selected for class. The proposed spectral classifier is independently ap- training and the remaining 6 for testing. plied to the manifolds of the same dimensionality over all Each image-set was represented by two manifold-sets by classes and the inter class distances are estimated. We fuse using λ = {1, 2} in (15). Each set has 8 classifiers with di- the set of distances using mode fusion and also by sum rule. mensionality increasing from 1 to 8. Recognition rates of In mode fusion, the probe set label is estimated individu- the ensembles of each dimensionality are compared with ally for each classifier and the label with the maximum fre- both fusion schemes in Fig.4a. No-fusion case shows the quency is selected. In sum fusion, minimum distance over average accuracy of the individual classifiers over the five the cumulative distance vector defines the probe set label. folds. Note that mode fusion achieves the highest accuracy and performs better than sum fusion because error does not 6. Experimental Evaluation get accumulated. Hierarchical clustering is compared with Evaluations are performed for image-set based face one-step clustering in4b which shows the superiority of the recognition, object categorization and gesture recogni- hierarchical approach. The performance of the proposed tion. The SPAMS [21] package is used for sparse cod- eigenvector solver is compared with the Matlab sparse SVD Table 1. Average recognition rates over 10-folds on Honda, MoBo, & ETH80, 5-fold on YouTube and 1-fold on Cambrage dataset. Honda MoBo ETH80 You-Tube Cambr. DCC 94.7±1.3 93.6±1.8 90.9±5.3 53.9±4.7 65.0 MMD 94.9±1.2 93.2±1.7 85.7±8.3 54.0±3.7 58.1 ± ± ± ± (a) (b) MDA 97.4 0.9 97.1 1.0 80.50 6.81 55.1 4.5 20.9 AHISD † 89.7±1.9 97.4±0.8 74.76±3.3 60.7±5.2 18.1 CHISD † 92.3±2.1 96.4±1.0 71.0±3.9 60.4±5.9 18.3 SANP 93.1±3.4 96.9±0.6 72.4±5.0 65.0±5.53 22.5 CDL 100±0.0 95.8±2.0 89.2±6.8 ∗62.2±5.1 73.4 Prop. 100±0.0 98.0±0.9 91.5±3.8 76.4±5.7 83.05 ∗ CDL results are on different folds therefore, the accuracy (c) (d) is less than that reported by [27]. †The accuracy of AHISD and Figure 3. Two example image-sets from (a) Honda, (b) CMU Youtube CHISD is less than in [8] due to smaller image sizes. Mobo and (c) You-tube Celebrities datasets. (d) ETH 80 dataset. Ajm code with Splits Ajm code Ajm code Halingr 83 The second dataset is CMU Mobo [23] containing 96 Bhattachar Energy Modified Bhattachar Energy 85 82 Fold 1 2 3 4 5 Mean Fold 1 2 3 4 5 Mean Fold 1 2 3 4 5 Mean 80 1 81.9149 76.9504 74.4681 77.305 76.2411 77.3759 82.6241 77.305 75.1773 77.305 76.2411 77.7305 81.9149 76.9504 74.4681 77.305 76.2411 77.3759 40 × 81 videos of 24 subjects. Face images were re-sized to 2 74.1135 70.2128 68.0851 69.1489 69.5035 70.2128 73.4043 70.5674 66.6667 68.0851 67.3759 69.2199 74.1135 70.2128 68.0851 69.1489 69.5035 70.2128 75 3 74.4681 70.2128 68.7943 68.4397 70.5674 70.4965 74.1135 71.6312 67.3759 67.0213 68.4397 69.7163 74.4681 70.2128 69.1489 68.4397 70.5674 70.5674 80 40 4 76.5957 72.695 69.8582 69.8582 72.695 72.3404 76.5957 74.4681 70.2128 68.7943 71.2766 72.2695 76.5957 72.695 69.8582 70.2128 72.695 72.4113 70 and LBP features were computed using circular (8, 1) 79 5 77.6596 73.0496 70.5674 70.5674 74.8227 73.3333 78.3688 74.4681 70.2128 69.8582 72.3404 73.0496 77.6596 73.4043 70.5674 70.5674 74.8227 73.4043 65 6 79.078 74.8227 70.922 71.9858 75.1773 74.3972 78.7234 76.5957 70.922 70.922 73.0496 74.0426 79.078 75.1773 70.922 71.9858 74.8227 74.3972 78 neighborhoods extracted from 8 × 8 gray scale patches [8]. Accuracy Accuracy 7 79.4326 76.2411 71.2766 71.9858 75.8865 74.9645 79.4326 77.6596 71.2766 70.2128 74.4681 74.6099 79.4326 76.2411 71.6312 71.9858 75.8865 75.0355 60 77 % 8 80.4965 77.305 71.9858 72.695 76.5957 75.8156 80.1418 79.4326 72.695 71.2766 76.5957 76.0284 80.4965 77.305 72.3404 72.695 76.5957 75.8865 % 76 55 We performed 10-fold experiments by randomly selecting Bhattachar Frequency Fold Halinger Frequency Fold Modified Bhattachar Frequency 75 50 Fold 1 2 3 4 5 Mean Fold 1 2 3 4 5 Mean Fold 1 2 3 4 5 Mean Sum Rule Fusion one image-set per subject as training and the remaining 3 One‐Step Clustering 81.9149 76.9504 74.4681 77.305 76.2411 77.3759 82.6241 77.305 75.1773 77.305 76.2411 77.7305 81.9149 76.9504 74.4681 77.305 76.2411 77.3759 74 Mode Fusion 45 77.6596 74.8227 72.3404 70.5674 76.2411 74.3262 77.6596 74.4681 72.3404 69.8582 73.7589 73.617 77.6596 74.8227 72.3404 70.5674 76.2411 74.3262 No Fusion Hierarchical Clustering 73 79.4326 78.3688 74.8227 71.9858 76.5957 76.2411 80.4965 79.078 73.7589 72.3404 74.1135 75.9574 79.4326 78.3688 74.8227 71.9858 76.5957 76.2411 40 as probe. We achieved a maximum accuracy of 100% and (a) 80.1418 80.1418 74.1135 72.3404 79.4326 77.234 81.9149 80.1418 74.4681 71.9858 75.1773 76.7376 80.1418 80.1418 74.1135 72.3404 79.4326 77.234 2345678 2345678 80.8511 80.8511 74.8227 73.7589 78.0142 77.6596 80.4965 79.7872 74.8227 71.6312 75.8865 76.5248 80.8511 80.8511 74.8227 73.7589 78.0142 77.6596 (a) Grassmann Manifold Dimensionality (b) Grassmann Manifold Dimensionality average 98.0±0.93% (Table1) which is the highest reported 80.8511 80.1418 74.4681 73.7589 78.3688 77.5177 80.8511 81.9149 73.7589 72.3404 76.9504 77.1631 80.8511 80.1418 74.4681 73.7589 78.3688 77.5177 85 60128 81.9149 80.4965 74.4681 75.5319 78.0142 78.0851 81.5603 82.9787 74.4681 73.0496 78.0142 78.0142 81.9149 80.4965 74.4681 75.5319 78.0142 78.0851 82.6241 80.8511 74.4681 75.8865 78.3688 78.4397 81.9149 83.3333 75.8865 73.4043 77.305 78.3688 82.27 80.85 74.47 75.89 78.37 78.37 Fast Eigen Solver so far. Modified Bhattachar Energy 50128 Matlab SVDS #NAME? 82 Our final dataset is Honda/UCSD [13] containing 59 Fold 12345678Mean 98.6111 100 97.2222 98.6111 97.2222 100 86.1111 97.222240128 96.875 (sec) videos of 20 subjects with varying poses and expressions. 93.0556 94.4444 95.833379 94.4444 93.0556 94.4444 90.2778 88.8889 93.0556 Time 91.6667 93.0556 93.0556 91.6667 93.0556 91.6667 87.5 84.7222 90.7986 30128 Time Fast Eigen solver ×

Accuracy Histogram equalized 20 20 gray scale face image pixel Fold 1 2 3 4 5 Mean 91.6667 93.0556 93.0556 93.0556 91.6667 91.6667 91.6667 88.8889 91.8403 % 76 1 57.346 55.209 56.41 60.201 55.833 57 91.6667 90.2778 95.8333 93.0556 91.6667 93.0556 91.6667 88.8889 92.013920128

Execution values were used as features [28]. We performed 10-fold 2 217.01 217.54 256.09 219.98 215.08 225.14 93.0556 91.6667 95.8333 95.8333 93.0556 95.8333 95.8333 93.0556 94.2708 3 744.64 820.35 806.34 799.18 802.94 794.69 95.8333 91.6667 95.833373 97.2222 93.0556 95.8333 95.8333Matlab SVDS 93.0556 94.7917 4 1439.6 1569.9 1509.4 1521.7 1558.7 1519.9 10128 experiments by randomly selecting one set per subject as 5 3435.3 3509.4 3515.9 3536.8 3615.8 3522.7 97.2222 91.6667 95.8333 97.2222 93.0556 94.4444 95.8333Fast Eigen Solver 95.8333 95.1389 6 5554.7 5825.9 5644.8 5723.4 6033.7 5756.5 97.2222 91.6667 95.8333 95.8333 93.0556 94.4444 95.8333 95.8333 94.9653 70 128 7 9463.8 10025 9634.7 10142 10411 9935.4 gallery and the remaining 39 as probes. An ensemble of 15 95.8333 91.6667 95.83332345678 95.8333 93.0556 95.8333 95.8333 95.8333 94.96532345678 8 13426 14044 14014 14401 14675 14112 (c) 95.8333 93.0556 97.2222 95.8333Grassmann 93.0556 Manifold 97.2222 Dimensionality 97.2222 95.8333 95.6597(d) Grassmann Manifold Dimensionality 95.8333 93.0556 97.2222 95.8333 94.4444 97.2222 97.2222 95.8333 95.8333 spectral classifiers obtained 100% accuracy on all folds. 3.930806 97.2222 93.0556 97.2222 95.8333 94.4444 97.2222 97.2222 95.8333 96.0069 52775 97.2222 93.0556 95.8333Figure 95.8333 4. You-tube 94.4444 95.8333 dataset: 97.2222 97.2222 (a) Comparison 95.8333 of different spectral 97.2222 94.4444 97.2222 95.8333 94.4444 95.8333 97.2222 97.2222 96.1806 6.2. Object Categorization & Gesture Recognition No ensemble ensemble fusion schemes. (b) Comparison of one step and the it- 97.2222 94.4444 97.2222 95.8333 94.4444 97.2222 97.2222 97.2222 96.3542 For object categorization, we use the ETH-80 dataset Modified Bhattachar No ensembel 97.2222 94.4444 97.2222erative 95.8333 clustering 94.4444 95.8333 (No Fusion). 97.2222 97.2222 (c-d) 96.1806 Accuracy and execution time Fold 1 2 3 4 5 Mean 95.8333 94.4444 97.2222 95.8333 94.4444 95.8333 97.2222 97.2222 96.0069 81.915 76.95 74.468 77.305 76.241 77.376 95.8333 94.4444 97.2222 95.8333 94.4444 95.8333 97.2222 97.2222 96.0069 containing images of 8 object categories each with 10 dif- 75.177 72.695 69.858 67.376 74.113 71.844 comparison of the proposed fast eigen-solver with Matlab SVDS. 75.887 72.695 70.213 71.986 71.277 72.411 95.8333 94.4444 97.2222100 95.8333 94.4444 95.8333 97.2222 97.2222 96.0069100 75.532 73.05 69.149 67.73 71.986 71.489 95.8333 94.4444 97.2222 95.8333 94.4444 95.8333 97.2222 97.2222 96.0069 ferent objects. Each object has 41 images taken at differ- 76.95 75.177 68.794 70.567 73.759 73.05 95.8333 94.4444 97.222298 95.8333 94.4444 95.8333 97.2222 97.2222 96.006998 76.596 73.404 71.277 71.277 71.986 72.908 ent views forming an image-set. We use 20×20 intensity 75.532 75.532 72.695 71.277 73.05 73.617 96 96 76.95 74.113 71.986 71.986 75.887 74.184 Modified Bhattachar Frequency Fold 12345Mean images for classifying an image-set of an object into a cate- 94 94 Accuracy Accuracy

98.6111 100 97.2222 98.6111 97.2222 100 86.1111 97.2222 96.875 93.0556 94.4444 97.2222 94.4444 94.4444 94.4444 94.4444 88.8889 93.9236 2 gory. ETH-80 is a challenging database because it has fewer 92 92 95.8333 94.4444 95.8333 93.0556 93.0556 93.0556 91.6667 88.8889 93.2292 3 images per set and significant appearance variations across Percentage 93.0556 93.0556 97.222290 91.6667 93.0556 95.8333 90.2778 93.0556 93.4028Percentage 90 4 94.4444 91.6667 95.8333 94.4444 91.6667 94.4444 93.0556Sum Fusion 93.0556 93.5764 5 Sum Fusion 94.4444 91.6667 95.833388 94.4444 93.0556 95.8333 93.0556Mode 93.0556Fusion 93.923688 6 Mode Fusion objects of the same class. For each class, 5 random image 94.4444 91.6667 95.8333 94.4444 93.0556 95.8333 91.6667No Fusion 93.0556 93.75 7 No Fusion 95.8333 91.6667 95.833386 94.4444 93.0556 97.2222 94.4444 93.0556 94.444486 8 sets are used for training and the remaining 5 for testing. We 95.8333 93.0556 95.83330 94.4444 5 94.4444 10 97.2222 15 94.4444 20 95.8333 25 95.13890 9 5 10 15 20 25 95.8333 93.0556 95.8333(a) 95.8333Grassmannian 95.8333 97.2222Manifold Dimensionality 95.8333 95.8333 95.6597(b) 10Grassmannian Manifold Dimensionality achieved an average recognition rate of 91.5±3.8% using an 95.8333 91.6667 95.8333 94.4444 95.8333 97.2222 95.8333 95.8333 95.3125 11 97.2222 93.0556 95.8333Figure 94.4444 5. Mobo 95.8333dataset: 97.2222 95.8333 Comparison 95.8333 95.6597 of (a) 12 hierarchical with (b) ensemble of 15 spectral classifiers. 97.2222 93.0556 95.8333 95.8333 95.8333 97.2222 97.2222 95.8333 96.0069 13 97.2222 91.6667 95.8333one-step 95.8333 spectral 95.8333 97.2222 clustering. 97.2222 95.8333 95.8333 14 The Cambridge Hand Gesture dataset [15] contains 900 97.2222 94.4444 95.8333 95.8333 95.8333 97.2222 97.2222 97.2222 96.3542 15 97.2222 93.0556 95.8333 97.2222 95.8333 97.2222 97.2222 97.2222 96.3542 16 image-sets of 9 gesture classes with large intra-class varia- 97.2222 94.4444 97.2222 97.2222 95.8333 97.2222 97.2222 97.2222 96.7014 17 97.2222 93.0556 97.2222 97.2222 95.8333 97.2222 97.2222 97.2222 96.5278 18 tions. Each class has 100 image sets, divided into two parts, 95.8333 94.4444 97.2222(svds) 97.2222 in Fig. 95.83334c. 95.8333 We 97.2222 observe 97.2222 more 96.3542 accuracy 19 due to better 95.8333 94.4444 97.2222 97.2222 95.8333 95.8333 97.2222 97.2222 96.3542 20 81-100 used as gallery and 1-80 as probes [14]. Pixel val- 95.8333 93.0556 97.2222quality 97.2222 embedding 95.8333 95.8333 of 97.2222 Grassmannian 97.2222 96.1806 Manifolds 21 in the Eu- 95.8333 94.4444 97.2222 95.8333 95.8333 95.8333 97.2222 97.2222 96.1806 22 ues of 20×20 gray scale images are used as feature vectors. clidean space by the proposed solver. In terms of execu- Modified Bhattachar No ensembel Using an ensemble of 9 spectral classifiers, we obtained an Fold 12345tion time, ourMean eigenvector solver is around 4 times faster 98.6111 100 97.2222 98.6111 97.2222 100 86.1111 97.2222 96.875 accuracy of 83.05% which is higher than the other algo- 93.0556 94.4444 97.2222than 94.4444 SVDS 94.4444 (Fig. 94.44444d). 94.4444 This 88.8889 demonstrates 93.9236 the efficacy of 95.8333 93.0556 93.0556 91.6667 88.8889 93.0556 87.5 90.2778 91.6667 rithms. 90.2778 90.2778 94.4444our eigenvector 90.2778 90.2778 93.0556 solver 88.8889 for 90.2778 the 90.9722 purpose of spectral cluster- 86.1111 90.2778 94.4444 93.0556 90.2778 88.8889 90.2778 91.6667 90.625 91.6667 90.2778 93.0556ing. 88.8889 The maximum 91.6667 94.4444 accuracy 87.5 87.5 of 90.625 our algorithm is 83% and 6.3. Robustness to Outliers 88.8889 90.2778 95.8333 93.0556 91.6667 97.2222 88.8889 88.8889 91.8403 M 94.4444 88.8889 91.6667the average 83.3333 94.4444 accuracy 91.6667 90.2778 is 76.4 90.2778±5.7% 90.625 using Bi,p (6) distance We performed robustness experiments in a setting sim- 93.0556 86.1111 90.2778 91.6667 94.4444 88.8889 87.5 87.5 89.9306 97.2222 84.7222 95.8333measure 88.8889 (1 93.0556). To 94.4444 the best 97.2222 of 97.2222 our 93.5764 knowledge, this is the high- ilar to [8]. Honda dataset was modified to have 100 ran- 88.8889 88.8889 93.0556 95.8333 95.8333 95.8333 86.1111 86.1111 91.3194 94.4444 86.1111 95.8333est accuracy 94.4444 93.0556 to 91.6667date reported 90.2778 90.2778 on 92.0139 this dataset. domly selected images per set. In the first experiment, each 90.2778 81.9444 93.0556 93.0556 93.0556 88.8889 90.2778 90.2778 90.1042 91.6667 88.8889 93.0556 90.2778 88.8889 90.2778 94.4444 94.4444 91.4931 90.2778 93.0556 90.2778 91.6667 91.6667 91.6667 90.2778 90.2778 91.1458 90.2778 88.8889 88.8889 90.2778 93.0556 93.0556 94.4444 94.4444 91.6667 93.0556 87.5 93.0556 94.4444 91.6667 91.6667 93.0556 93.0556 92.1875 90.2778 88.8889 94.4444 94.4444 91.6667 90.2778 93.0556 93.0556 92.0139 90.2778 86.1111 95.8333 93.0556 91.6667 93.0556 90.2778 90.2778 91.3194 91.6667 88.8889 90.2778 90.2778 93.0556 91.6667 94.4444 94.4444 91.8403 91.6667 91.6667 97.2222 93.0556 95.8333 93.0556 86.1111 86.1111 91.8403 90.2778 93.0556 94.4444 90.2778 93.0556 90.2778 88.8889 88.8889 91.1458 N_G only first fold N_P only first fold

n_r CDL n_r CDL 19 97.4 1 97.43 38 97.4 2 94.87 57 94.8 3 94.87

n_r SANP n_r SANP 1 89.74 1 87.17 2 94.87 2 76.92 3 89.74 3 74.35

n_r CHISD n_r CHISD 11 22 33

n_r Proposed n_r Proposed 19 100 1 100 38 100 2 100 57 100 3 97.43

100 100 98 [9] J. Hamm and D. Lee. Grassmann discriminant analysis: a 95 96 90 unifying view on subspace learning. In ICML, 2008.5 94 92 85 [10] M. Harandi, C. Sanderson, S. Shirazi, B. Lovell. Graph em- Accuracy

% 90 80 88 CDL bedding discriminant analysis on Grassmannian manifolds 86 75 SANP for improved image set matching CVPR, 2011.1,5 84 70 Proposed 19 38 57 (a) 19 38 57 [11] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry and Y. Ma. % Outliers (b) % Outliers Figure 6. Robustness to outliers experiment: (a) Corrupted gallery Robust face recognition via sparse representation. TPAMI, case (b) Corrupted probe case. 31(2):210–227, 2009.4 [12] Y. Chen, V. Patel, P. Phillips, and R. Chellappa. Dictionary- gallery set was corrupted by adding 1 to 3 random images based face recognition from video. In ECCV. 2012.1 from each other gallery set resulting in 19%, 38% and 57% [13] K. Lee, J. Ho, M. Yang and D. Kriegman. Video based outliers. Our algorithm achieved 100% accuracy for all face recognition using probabilistic appearance manifolds. three cases. In the second experiment, the probe set was In CVPR, 2003.7 corrupted by adding 1 to 3 random images from each gallery [14] T. Kim and R. Cipolla. Canonical correlation analysis of set. In this case, our algorithm achieved {100%, 100%, video volume tensors for action categorization and detection. 97.43%} recognition rates. Our algorithm outperformed all TPAMI, 31(8), 2009.7 others. Fig.6 compares our algorithm to the nearest 2 com- [15] T. Kim, K. Wong, and R. Cipolla. Tensor canonical correla- petitors in both experiments i.e. CDL [27], SANP [30]. tion analysis for action classification. In CVPR, 2007.7 [16] H. Lee, A. Battle, R. Raina, and A. Ng. Efficient sparse 7. Conclusion coding algorithms. In NIPS, 2007.5 We presented an iterative sparse spectral clustering al- [17] H Li, G Hua, Z Lin, and J Yang Probabilistic elastic match- gorithm for robust image-set classification. Each image-set ing for pose variant face verification. In CVPR, 2013.1 is represented with Grassmannian manifolds of increasing [18] M. Li, X. Lian, J. K., B. Lu. Time and space efficient spectral dimensionality to facilitate the use of an ensemble of spec- clustering via column sampling. In CVPR, 2011.5 tral classifiers. An important contribution is a fast eigenvec- [19] M. Kim, S. Kumar, V. Pavlovic and H. Rowley. Face tracking tor solver which makes spectral clustering more efficient in and recognition with visual constraints in real world videos. general. Instead of eigenvalue error, we minimize the sign In CVPR, 2008.6 changes in our group power iteration algorithm which pro- [20] M. Nishiyama, M. Yuasa, T. Shibata, T. Wakasugi, T. Kawa- vides significant speedup and better quality clusters. hara and Yamaguchi. Recognizing faces of moving people by hierarchical image-Set matching. In CVPR, 2007.1 8. Acknowledgements [21] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In ICML, 2009.6 This research was supported by ARC grants DP1096801 [22] X. Peng, L. Zhang, and Z. Yi. Scalable sparse subspace clus- and DP110102399. We thank T. Kim and R. Wang for tering. CVPR, 2013.5 the codes of DCC and MMD and the cropped faces of Honda data. We thank H. Cevikalp for the LBP features [23] R. Gross and J. Shi. The CMU Motion of Body (MoBo) of Mobo data and Y. Hu for the MDA implementation. Database. Technical Report CMU-RI-TR-01-18, 2001.7 [24] R. Wang and X. Chen. Manifold discriminant analysis. In References CVPR, 2009.1,3,6 [25] J. Shi and J. Malik. Normalized cuts and image segmenta- [1] A. Mahmood and A. Mian. Hierarchical sparse spectral clus- tion. TPAMI, 2000.4,5 tering for image set classification. In BMVC, 2012.1 [26] T. Kim, O. Arandjelovic and R. Cipolla. Discriminative [2] S. Chen, C. Sanderson, M. T. Harandi, and B. C. Lovell. Im- learning and recognition of image set classes using canon- proved image set classification via joint sparse approximated ical correlations. TPAMI, 2007.1,5,6 nearest subspaces. In CVPR, 2013.1,2 [27] R. Wang, H. Guo, L. S. Davis, and Q. Dai. Covariance dis- [3] M. Yang, P. Zhu, L. Gool, and L. Zhang. Face recognition criminative learning: A natural and efficient approach to im- based on regularized nearest points between image sets. In age set classification. In CVPR, 2012.6,7,8 FG, 2013.1 [28] R. Wang, S. Shan, X. Chen, Q. Dai, and W. Gao. Manifold- [4] H. Cheng, Z. Liu, and L. Yang. Sparsity induced similarity manifold distance and its application to face recognition with measure for label propagation. In ICCV, 2009.4 image sets. TIP, 2012.1,3,5,6,7 [5] Z. Cui, S. Shan, H. Zhang, S. Lao, X. Chen. Image sets align [29] B Wu, Y Zh, B Hu, Q Ji Constrained clustering and its ap- -ment for video based face recognition. CVPR, 2012.1 plication to face clustering in videos. In CVPR, 2013.1 [6] E. Elhamifar and R. Vidal. Sparse subspace clustering: Al- [30] Yiqun Hu and Ajmal S. Mian and Robyn Owens. Face recog- gorithm, theory, and applications. TPAMI, 2013.2,4 nition using sparse approximated nearest points between im- [7] C. Fowlkes, S. Belongie, F. Chung, J. Malik. Spectral group- age sets. TPAMI, 2012.1,6,8 TPAMI ing using the Nystrom method. , 2004.4,5 [31] G. H. Golub and C. V. Loan. Matrix Computations. The John [8] H. Cevikalp and B. Triggs. Face recognition based on image Hopkins University Press, 1996.6 sets. In CVPR, pages 2567–2573, 2010.1,6,7