Stochastic Approximation for Canonical Correlation Analysis

Stochastic Approximation for Canonical Correlation Analysis Raman Arora, Teodor V. Marinov, Poorya Mianjy Department of Computer Science, Johns Hopkins University, Baltimore Presented by Liqun Chen Nov 3rd, 2017 1 Outline 1 Introduction Background Challenges Notations Problem Definition 2 Methods Matrix Stochastic Gradient Matrix Exponentiated Gradient 3 Experiments 2 Introduction Outline 1 Introduction Background Challenges Notations Problem Definition 2 Methods Matrix Stochastic Gradient Matrix Exponentiated Gradient 3 Experiments 3 Introduction Background Background Canonical Correlation Analysis (CCA) is a statistical technique for finding linear components of two sets of random variables that are maximally correlated. CCA is often posed as a dimensionality reduction problem about a fixed dataset of n paired data points. In this paper, they take a stochastic optimization view of CCA. They pose CCA as the following stochastic optimization problem: given a d d pair of random vectors (x; y) 2 R x × R y , with some (unknown) joint d ×k d ×k distribution D, find matrices U 2 R x and V 2 R y that solve: > > maximize Ex;y[x UV y] d ×k d ×k U2R x ; V2R y : (1) > > > > subject to U Ex[xx ]U = V Ey[yy ]V = Ik 4 Introduction Background Background the goal is to find a \good" enough subspace in terms of capturing correlation rather than a \true" subspace (Domain adaptation) the stochastic optimization view motivates stochastic approximation algorithms that can easily scale to very large datasets. 5 Introduction Challenges Challenges non-convex it is a unconstraint optimization problem, since there is no stochastic constraint on U and V , e.g., for k = 1, we can consider the objective T T > > p > > p > > ρ(u x; v y) = Ex;y u xy v = Ex [u xx u] Ey [v yy v] : CCA problem given in (1) is non-learnable, i.e. there exists distributions on which it may incur small empirical error on the training sample but arbitrary large generalization error. 6 Introduction Notations Notations For any matrix X, spectral norm, nuclear norm, and Frobenius norm are represented by kXk2, kXkF , and kXk∗ respectively. the standard inner-product between the two is given as hX; Yi = Tr X>Y Let singular value decomposition (SVD) of arbitrary X be given as X = UΣV>. > > Let auto-covariance matrices Cxx = Ex[xx ], Cyy = Ey[yy ], and > cross-covariance matrix Cxy = E(x;y)[xy ]. denote the regularized auto-covariance matrices by Cx = Cxx + rxI and Cy = Cyy + ryI 7 Introduction Problem Definition Problem Definition The regularized version of Problem 1 can be formulated as follows. With this notation, re-write the regularized CCA problem as follows. > > max Tr U~ E[xy ]V~ ~ d ×k ~ dy ×k U2R x ;V2R (2) ~ > > ~ ~ > > ~ subject to U E[xx + rxI]U = I; V E[yy + ryI]V = I 1 1 2 2 Since Cx and Cy are positive definite matrices one can make the simple change 1 1 2 2 of variables U = Cx U~ and V = Cy V~ to get the equivalent problem > > − 1 > > − 1 max Tr U E[xx + rxI] 2 E[xy ] E[yy + ryI] 2 V d ×k d ×k U2R x ;V2R y (3) subject to U>U = I; V>V = I 8 Introduction Problem Definition Problem Definition (continue) d ×k d ×k Let Φ 2 R x and Ψ 2 R y denote the top-k left and right singular 1 1 − 2 − 2 vectors of Cx CxyCy . Then, the optimum of Problem 2 is achieved at 1 1 − 2 − 2 U~ = Cx Φ and V~ = Cy Ψ. An equivalent definition for the regularized empirical problem in the 1-dimensional setting is 1 > > 2 2 2 min u X − v Y + rx kuk + ry kvk d d 2 2 2 u2R x ;v2R y 2t (4) > > subject to u Cx;tu = I; v Cy;tv = I 9 Methods Outline 1 Introduction Background Challenges Notations Problem Definition 2 Methods Matrix Stochastic Gradient Matrix Exponentiated Gradient 3 Experiments 10 Methods Proposed Method (1) Problem 3 is not a convex optimization problem { not only is the objective non-convex, the constraint set of orthogonal matrices is also non-convex. Re-parametrize Problem 3 via the following variable substitute: M = UV>. Furthermore, they take the convex hull of the constraint: 1 1 − 2 − 2 max hM; Cx CxyCy i d ×d M2R x y : (5) subject to kMk2 ≤ 1; kMk∗ ≤ k 11 Methods Proposed Method (2) 1 1 − 2 − 2 the gradient of the CCA objective w.r.t. M is g := Cx CxyCy , however, the marginal distribution for x and y is unknown use an \inexact" first order oracle @t := Wx;txtytWy;t ≈ gt; 1 1 − 2 − 2 the choice of Wx;t := Cx;t and Wy;t := Cy;t being the empirical estimates of the whitening transform matrices. > − 1 denote the true whitening transforms by Wx := E[(xx + rxI) 2 ] and > − 1 Wy := E[(yy + ryI) 2 ]. 12 Methods Proposed Method (3) Design stochastic approximation algorithms for solving Problem 5; In each update, minimize the instantaneous loss measured on a single data point while trying to stay close to the previous iterate. In each iteration minimize the sum ∆(Mt; M) + ηED [` (M; gt)], over all feasible M ∆(·; ·) is a divergence function between the successive iterates ` is the instantaneous loss η is the learning-rate parameter controlling the tradeoff between loss and divergence the divergence function is defined in terms of a potential function 13 Methods Matrix Stochastic Gradient Matrix Stochastic Gradient stochastic mirror descent with the choice of potential function as the Frobenius norm: Mt = PF (Mt−1 + ηt@t) 14 Methods Matrix Stochastic Gradient Convergence of Matrix Stochastic Gradient Lemma 2 2 Assume that for all (xt; yt) ∼ D we have maxfkxtk ; kytk g ≤ B. There exists a constant κ independent of t such that for all t the following holds in expectation: κ ED [kEtk2] ≤ p t Theorem p After T iterations of MSG with step size η = 2pk , and starting at M(1) = 0, G T p − 1 − 1 2 kG + 2kκ ~ 2 2 E[hM∗ − M; Cx CxyCy i] ≤ p (6) T where κ is a universal constant given by lemma 1, M∗ is the optimum of (5), expectation is with respect to the i.i.d. samples and rounding, and G is given by Lemma in Appendix such that k@tkF ≤ G for all the iterates t = 1; ··· ;T . 15 Methods Matrix Exponentiated Gradient Matrix Exponentiated Gradient (MEG) the set is that of d k-dimensional (paired) subspaces and the multiplicative algorithm is an instance of matrix exponentiated gradient (MEG) update. Let d := dx + dy. 0 gt Consider the self-adjoint dilation of the matrix ED [gt] C := ED > . gt 0 Assume that ED [gt] has no repeated singular values and its SVD is given by > ED [gt] = UΣV then the eigen-decomposition of C is given by 1 UU Σ 0 UU C = : 2 V −V 0 −Σ V −V Then re-formulate regularized CCA as max Tr WW>C d×k W2R : (7) subject to W>W = I 16 Methods Matrix Exponentiated Gradient MEG Problem 7 is not a convex optimization, but it admits the following convex relaxation by 1 > setting M = k WW : max Tr (MC) M2 d×d R (8) 1 subject to M 0; kMk ≤ ; Tr (M) = 1 2 k Update Mt−1 by solving Mt := argmin ∆ (M; Mt−1) − ηTr (MCt) M2 d×d R (9) 1 subject to M 0; kMk ≤ ; Tr (M) = 1 2 k where Ct is the self-adjoint dilation of gt and ∆ (x; y) is the quantum relative entropy between x and y. Setting the Lagrangian to 0 and solving: exp (log (Mt−1) + ηCt) Mb t = ; Mt = P Mb t (10) Tr (exp (log (Mt−1) + ηCt)) where P denotes the projection onto the convex set of constraints in (9). In practice they replace the self-adjoint dilation of gt by self-adjoint dilation of @t which is denoted by C~ t. 17 Methods Matrix Exponentiated Gradient MEG 18 Experiments Outline 1 Introduction Background Challenges Notations Problem Definition 2 Methods Matrix Stochastic Gradient Matrix Exponentiated Gradient 3 Experiments 19 Experiments Synthetic dataset 20 Experiments Mediamill 21.

Stochastic Approximation for Canonical Correlation Analysis

Trajectory Averaging for Stochastic Approximation MCMC Algorithms

Stochastic Approximation for the Multivariate and the Functional Median 3 It Can Not Be Updated Simply If the Data Arrive Online

Simulation Methods for Robust Risk Assessment and the Distorted Mix

An Empirical Bayes Procedure for the Selection of Gaussian Graphical Models

Stochastic Approximation of Score Functions for Gaussian Processes

Stochastic Approximation Methods and Applications in Financial Optimization Problems

Stochastic Approximation Algorithm

Stochastic Approximation: from Statistical Origin to Big-Data, Multidisciplinary Applications

On Implementation of the Markov Chain Monte Carlo Stochastic Approximation Algorithm

1 Multidimensional Stochastic Approximation: Adaptive Algorithms and Applications

Performance of a Distributed Stochastic Approximation Algorithm

Arxiv:1909.08749V2 [Stat.ML] 15 Sep 2020 Learning [Ber95a, Ber95b, SB18]