Stochastic Approximation for Canonical Correlation Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Stochastic Approximation for Canonical Correlation Analysis Raman Arora, Teodor V. Marinov, Poorya Mianjy Department of Computer Science, Johns Hopkins University, Baltimore Presented by Liqun Chen Nov 3rd, 2017 1 Outline 1 Introduction Background Challenges Notations Problem Definition 2 Methods Matrix Stochastic Gradient Matrix Exponentiated Gradient 3 Experiments 2 Introduction Outline 1 Introduction Background Challenges Notations Problem Definition 2 Methods Matrix Stochastic Gradient Matrix Exponentiated Gradient 3 Experiments 3 Introduction Background Background Canonical Correlation Analysis (CCA) is a statistical technique for finding linear components of two sets of random variables that are maximally correlated. CCA is often posed as a dimensionality reduction problem about a fixed dataset of n paired data points. In this paper, they take a stochastic optimization view of CCA. They pose CCA as the following stochastic optimization problem: given a d d pair of random vectors (x; y) 2 R x × R y , with some (unknown) joint d ×k d ×k distribution D, find matrices U 2 R x and V 2 R y that solve: > > maximize Ex;y[x UV y] d ×k d ×k U2R x ; V2R y : (1) > > > > subject to U Ex[xx ]U = V Ey[yy ]V = Ik 4 Introduction Background Background the goal is to find a \good" enough subspace in terms of capturing correlation rather than a \true" subspace (Domain adaptation) the stochastic optimization view motivates stochastic approximation algorithms that can easily scale to very large datasets. 5 Introduction Challenges Challenges non-convex it is a unconstraint optimization problem, since there is no stochastic constraint on U and V , e.g., for k = 1, we can consider the objective T T > > p > > p > > ρ(u x; v y) = Ex;y u xy v = Ex [u xx u] Ey [v yy v] : CCA problem given in (1) is non-learnable, i.e. there exists distributions on which it may incur small empirical error on the training sample but arbitrary large generalization error. 6 Introduction Notations Notations For any matrix X, spectral norm, nuclear norm, and Frobenius norm are represented by kXk2, kXkF , and kXk∗ respectively. the standard inner-product between the two is given as hX; Yi = Tr X>Y Let singular value decomposition (SVD) of arbitrary X be given as X = UΣV>. > > Let auto-covariance matrices Cxx = Ex[xx ], Cyy = Ey[yy ], and > cross-covariance matrix Cxy = E(x;y)[xy ]. denote the regularized auto-covariance matrices by Cx = Cxx + rxI and Cy = Cyy + ryI 7 Introduction Problem Definition Problem Definition The regularized version of Problem 1 can be formulated as follows. With this notation, re-write the regularized CCA problem as follows. > > max Tr U~ E[xy ]V~ ~ d ×k ~ dy ×k U2R x ;V2R (2) ~ > > ~ ~ > > ~ subject to U E[xx + rxI]U = I; V E[yy + ryI]V = I 1 1 2 2 Since Cx and Cy are positive definite matrices one can make the simple change 1 1 2 2 of variables U = Cx U~ and V = Cy V~ to get the equivalent problem > > − 1 > > − 1 max Tr U E[xx + rxI] 2 E[xy ] E[yy + ryI] 2 V d ×k d ×k U2R x ;V2R y (3) subject to U>U = I; V>V = I 8 Introduction Problem Definition Problem Definition (continue) d ×k d ×k Let Φ 2 R x and Ψ 2 R y denote the top-k left and right singular 1 1 − 2 − 2 vectors of Cx CxyCy . Then, the optimum of Problem 2 is achieved at 1 1 − 2 − 2 U~ = Cx Φ and V~ = Cy Ψ. An equivalent definition for the regularized empirical problem in the 1-dimensional setting is 1 > > 2 2 2 min u X − v Y + rx kuk + ry kvk d d 2 2 2 u2R x ;v2R y 2t (4) > > subject to u Cx;tu = I; v Cy;tv = I 9 Methods Outline 1 Introduction Background Challenges Notations Problem Definition 2 Methods Matrix Stochastic Gradient Matrix Exponentiated Gradient 3 Experiments 10 Methods Proposed Method (1) Problem 3 is not a convex optimization problem { not only is the objective non-convex, the constraint set of orthogonal matrices is also non-convex. Re-parametrize Problem 3 via the following variable substitute: M = UV>. Furthermore, they take the convex hull of the constraint: 1 1 − 2 − 2 max hM; Cx CxyCy i d ×d M2R x y : (5) subject to kMk2 ≤ 1; kMk∗ ≤ k 11 Methods Proposed Method (2) 1 1 − 2 − 2 the gradient of the CCA objective w.r.t. M is g := Cx CxyCy , however, the marginal distribution for x and y is unknown use an \inexact" first order oracle @t := Wx;txtytWy;t ≈ gt; 1 1 − 2 − 2 the choice of Wx;t := Cx;t and Wy;t := Cy;t being the empirical estimates of the whitening transform matrices. > − 1 denote the true whitening transforms by Wx := E[(xx + rxI) 2 ] and > − 1 Wy := E[(yy + ryI) 2 ]. 12 Methods Proposed Method (3) Design stochastic approximation algorithms for solving Problem 5; In each update, minimize the instantaneous loss measured on a single data point while trying to stay close to the previous iterate. In each iteration minimize the sum ∆(Mt; M) + ηED [` (M; gt)], over all feasible M ∆(·; ·) is a divergence function between the successive iterates ` is the instantaneous loss η is the learning-rate parameter controlling the tradeoff between loss and divergence the divergence function is defined in terms of a potential function 13 Methods Matrix Stochastic Gradient Matrix Stochastic Gradient stochastic mirror descent with the choice of potential function as the Frobenius norm: Mt = PF (Mt−1 + ηt@t) 14 Methods Matrix Stochastic Gradient Convergence of Matrix Stochastic Gradient Lemma 2 2 Assume that for all (xt; yt) ∼ D we have maxfkxtk ; kytk g ≤ B. There exists a constant κ independent of t such that for all t the following holds in expectation: κ ED [kEtk2] ≤ p t Theorem p After T iterations of MSG with step size η = 2pk , and starting at M(1) = 0, G T p − 1 − 1 2 kG + 2kκ ~ 2 2 E[hM∗ − M; Cx CxyCy i] ≤ p (6) T where κ is a universal constant given by lemma 1, M∗ is the optimum of (5), expectation is with respect to the i.i.d. samples and rounding, and G is given by Lemma in Appendix such that k@tkF ≤ G for all the iterates t = 1; ··· ;T . 15 Methods Matrix Exponentiated Gradient Matrix Exponentiated Gradient (MEG) the set is that of d k-dimensional (paired) subspaces and the multiplicative algorithm is an instance of matrix exponentiated gradient (MEG) update. Let d := dx + dy. 0 gt Consider the self-adjoint dilation of the matrix ED [gt] C := ED > . gt 0 Assume that ED [gt] has no repeated singular values and its SVD is given by > ED [gt] = UΣV then the eigen-decomposition of C is given by 1 UU Σ 0 UU C = : 2 V −V 0 −Σ V −V Then re-formulate regularized CCA as max Tr WW>C d×k W2R : (7) subject to W>W = I 16 Methods Matrix Exponentiated Gradient MEG Problem 7 is not a convex optimization, but it admits the following convex relaxation by 1 > setting M = k WW : max Tr (MC) M2 d×d R (8) 1 subject to M 0; kMk ≤ ; Tr (M) = 1 2 k Update Mt−1 by solving Mt := argmin ∆ (M; Mt−1) − ηTr (MCt) M2 d×d R (9) 1 subject to M 0; kMk ≤ ; Tr (M) = 1 2 k where Ct is the self-adjoint dilation of gt and ∆ (x; y) is the quantum relative entropy between x and y. Setting the Lagrangian to 0 and solving: exp (log (Mt−1) + ηCt) Mb t = ; Mt = P Mb t (10) Tr (exp (log (Mt−1) + ηCt)) where P denotes the projection onto the convex set of constraints in (9). In practice they replace the self-adjoint dilation of gt by self-adjoint dilation of @t which is denoted by C~ t. 17 Methods Matrix Exponentiated Gradient MEG 18 Experiments Outline 1 Introduction Background Challenges Notations Problem Definition 2 Methods Matrix Stochastic Gradient Matrix Exponentiated Gradient 3 Experiments 19 Experiments Synthetic dataset 20 Experiments Mediamill 21.