Cluster Canonical Correlation Analysis

Cluster Canonical Correlation Analysis Nikhil Rasiwasia∗ Dhruv Mahajan† Vijay Mahadevan∗ Gaurav Aggarwal∗ ∗Yahoo Labs Bangalore(nikux,vmahadev,[email protected]), †Microsoft Research([email protected]) Abstract In this paper we present cluster canonical correlation analysis (cluster-CCA) for joint dimensionality reduction of two sets of data points. Unlike the standard pairwise correspondence between the data points, in our problem each set is partitioned into multiple clusters or classes, (a) (b) where the class labels define correspondencesbe- tween the sets. Cluster-CCA is able to learn discriminant low dimensional representations that maximize the correlation between the two sets while segregating the different classes on the learned space. Furthermore, we present a kernel extension, kernel cluster canonical correlation analysis (cluster-KCCA) that extends cluster- (c) (d) CCA to account for non-linear relationships. Cluster-(K)CCA is shown to be computationally Figure 1: Representation of various methods to obtain cor- efficient, the complexity being similar to stan- related subspaces between sets. For each method, the two dard (K)CCA. By means of experimental evalu- sets are shown at the top and the joint projected space at the ation on benchmark datasets, cluster-(K)CCA is bottom where ’ and ’ ’ represent two clusters in each shown to achieve state of the art performance for set. (a) CCA: uses△ pairwise correspondences between sets cross-modal retrieval tasks. and cannot segregate the two clusters, (b) CCA for sets: computes principal angles between two subspaces and can- 1 Introduction not handle multiple clusters, (c) cluster-CCA: uses all pairwise correspondences within a cluster across the two sets and results in cluster segregation and (d) mean-CCA: com- Joint dimensionality reduction techniques such as Canoni- putes CCA between mean cluster vectors. cal Correlation Analysis (CCA) [12], Partial Least Squares (PLS) [23], Bilinear Model [26], Cross-modal Factor Anal- ysis (CFA) [19] etc. have become quite popular in re- cent years. These approaches differ from the standard di- the second modality — these approaches learn a com- mensionality reduction techniques such as principal common low dimensional feature space where representation- ponent analysis (PCA) [5] or linear discriminant analysis specific details are discarded to yield the common under- (LDA) [5], as the dimensionality reduction is performed lying structure. Of these, CCA is fast becoming the de simultaneously across two (or more) modalities1. Given facto standard [19, 21, 22, 24, 27]. CCA has been applied a dataset with two paired modalities — where each data to several multimedia problems, such as cross-modal re- point in the first modality is paired with a data point in trieval [10, 22, 24] — retrieval of data points from a given modality in response to a query from a different modality, 1 It is common to refer to ‘modalities’ as ‘sets’ or ‘views’. image segmentation [21], cross-lingual retrieval [27], etc. CCA has also been successfully kernelized to enable learn- 17th Appearing in Proceedings of the International Conference on ing of non-linear relationships in [4, 9]. Artificial Intelligence and Statistics (AISTATS) 2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copyright 2014 by the au- However, CCA requires paired modalities and can not be thors. directly applied when either multiple clusters of points in a 823 Cluster Canonical Correlation Analysis given modality correspond to multiple clusters of points in 2 Related Work another modality, or when the paired modalities are supple- mented with class labels. Such a scenario arises when, for Several extensions of CCA have been proposed in the liter- each class, several data points are available in two different ature [15–17,21,25,28]. One class of modifications aims at modalities which may or may not be paired. For example, using CCA for supervised dimensionality reduction [1,11]. images and web-pages obtained using web search queries Given a set of samples with their class labels, CCA is used for different class labels. Note that for the case of paired to learn a low dimensional representation. The data sam- modalities with class labels, CCA can still be applied ignor- ples themselves serve as the first modality and the class ing the class labels, however, as shown in Fig. 1(a), CCA labels as the second. Many variations in how the class la- would be ineffective in achieving class discrimination. bels are used have been proposed[17,21,25]. Nevertheless, In this work, we are interested in the above scenario, where the above approaches are targeted toward a single labeled each set consists of multiple clusters/classes. The corre- modality, and cannot be directly applied for joint super- spondences between the sets are established by the class vised dimensionality reduction of two labeled modalities. labels (the sets may or may not be paired). The aim is to CCA for matching two sets that are not paired, was pro- learn a discriminative common low dimensional represen- posed in [12]. Canonical vectors are obtained which tation for the two sets. The contributions of this work are minimize the principal angles — the minimal angles be- as follows: tween vectors of two subspaces. Fig. 1(b) shows a simple schematic representation of ‘CCA for sets’. CCA for sets We propose a very simple, yet effective adaptation has been applied to various problems [15, 16, 28], however • of CCA, referred to as cluster canonical correlation it is only useful for the case where sets are unlabelled, e.g. analysis (cluster-CCA). As shown in Fig. 1(c), in to find canonical vectors for a given set of images and text cluster-CCA all points from one modality within a where all the images and text belong to the same cluster. class are paired with all points from the other modal- It cannot be directly applied to the case where there are ity in the same class and thereafter the projections are multiple clusters in each set. CCA for sets was modified learned using the standard CCA framework. for classification of images into multiple classes in [16]. However, this approach too is applicable only to datasets A naive implementation of cluster-CCA is computa- consisting of a single labeled set. • tionally infeasible for large datasets, as the number of pairwise correspondences grows quadratically with Recently several approaches have been proposed for joint the number of data points per cluster. We present dimensionality reduction using class labels. Semantic a formulation of cluster-CCA that is computationally Matching (SM) was proposed in [22], where two map- efficient and grows linearly. pings are implemented using classifiers of the two modalities. Each modality is represented as vector of poste- We also propose mean-CCA, a yet simpler adaptation rior probabilities with respect to the same class vocabulary, • of CCA for our task, where the mean vectors per clus- which serves as the common feature representation. Gener- ter are used to learn the projections. We show that alized Multivew Linear Discriminant Analysis (GMLDA) the fundamental difference between cluster-CCA and proposed in [24] formulates the problem of finding cor- mean-CCA is in the estimation of the within-set co- related subspaces as that of jointly optimizing covariance variances, which results in significant difference in between sets and separating the classes in the respective their performance. feature spaces. The three objective functions are coupled linearly using suitable constants. Multi-view Discriminant Finally, we present a kernelized extension of cluster- Analysis (MvDA) proposed in [13] forgoes the free param- • CCA, referred to as cluster kernel canonical corre- eters by directly separating the classes in the joint feature lation analysis (cluster-KCCA) to extract non-linear space, but it is not clear how correlated the samples from relationships between modalities. different modalities are. Weakly-Paired Maximum Covari- ance Analysis (WMCA) proposed in [18], learns a corre- The efficacy of the proposed approaches is tested by mea- lated discriminative feature space without the need for pair- suring their performance in cross-modal retrieval tasks on wise correspondences. However, WMCA is based on max- benchmark datasets. The experiments show that cluster- imum correlation analysis while the proposed work extends (K)CCA is not only superior to (K)CCA but, despite its CCA for a similar problem setting. simplicity, outperforms other state-of-the art approaches that use class-labels to learn low dimensional projections. It is also shown that its performance can be improved 3 Canonical Correlation Analysis (CCA) by adding data to a single modality independently of the other modality, a benefit which is not shared by standard In this section we briefly review CCA (for a more detailed (K)CCA. introduction to CCA see [10]). Consider two multivariate 824 Nikhil Rasiwasia∗, Dhruv Mahajan†, Vijay Mahadevan∗, Gaurav Aggarwal∗ random variables x Dx and y Dy with zero mean. projections on w and v lack any direct correspondence. To ∈ ℜ ∈ ℜ Let the sets x = x1,..., xn and y = y1,..., yn , address this, we propose two solutions viz. mean canoni- be paired. CCAS aims{ at finding} a newS coordinate{ for }x cal correlation analysis (mean-CCA) and cluster canonical by choosing a direction w Dx and similarly for y by correlation analysis (cluster-CCA). ∈ ℜ choosing a direction v Dy , such that the correlation ∈ ℜ between the projection of x and y on w and v is maxi- 4.1 Mean Canonical Correlation Analysis mized, S S One simple solution is to establish correspondences be- w′C v ρ = max xy (1) tween the mean cluster vectors of the two sets. This w v ′ ′ , √w Cxxw v Cyyv yields C vectors per set with one-to-one correspondences. p c 1 |Xc| c c n Given the cluster means µx = x and µy = where is the correlation, xx′ 1 x x ′ |Xc| j=1 j ρ Cxx = E[ ]= n i=1 i i |Y | P ′ 1 n ′ P 1 c yc , mean canonical correlation analysis (mean- and Cyy = E[yy ] = n i=1 yiyi are the within-set co- |Yc| k=1 k P ′ 1 n ′ P variance matrices and Cxy = E[xy ]= xiyi the CCA) problem is formulated as, n Pi=1 between-set covariance matrix, E denoting empirical ex- w′V v ρ = max xy (4) pectation.

Load more