Stochastic Approximation for Analysis

Raman Arora, Teodor V. Marinov, Poorya Mianjy

Department of Computer Science, Johns Hopkins University, Baltimore

Presented by Liqun Chen Nov 3rd, 2017

1 Outline

2 Introduction Outline

3 Introduction Background Background

Canonical Correlation Analysis (CCA) is a statistical technique for finding linear components of two sets of random variables that are maximally correlated. CCA is often posed as a dimensionality reduction problem about a fixed dataset of n paired points. In this paper, they take a view of CCA. They pose CCA as the following stochastic optimization problem: given a d d pair of random vectors (x, y) ∈ R x × R y , with some (unknown) joint d ×k d ×k distribution D, find matrices U ∈ R x and V ∈ R y that solve: > > maximize Ex,y[x UV y] d ×k d ×k U∈R x , V∈R y . (1) > > > > subject to U Ex[xx ]U = V Ey[yy ]V = Ik

4 Introduction Background Background

the goal is to find a “good” enough subspace in terms of capturing correlation rather than a “true” subspace (Domain adaptation) the stochastic optimization view motivates stochastic approximation algorithms that can easily scale to very large datasets.

5 Introduction Challenges Challenges

non-convex it is a unconstraint optimization problem, since there is no stochastic constraint on U and V , e.g., for k = 1, we can consider the objective T T  > >  p > > p > >  ρ(u x, v y) = Ex,y u xy v / Ex [u xx u] Ey [v yy v] . CCA problem given in (1) is non-learnable, i.e. there exists distributions on which it may incur small empirical error on the training sample but arbitrary large generalization error.

6 Introduction Notations Notations

For any matrix X, spectral norm, nuclear norm, and Frobenius norm are represented by kXk2, kXkF , and kXk∗ respectively. the standard inner-product between the two is given as hX, Yi = Tr X>Y Let singular value decomposition (SVD) of arbitrary X be given as X = UΣV>. > > Let auto- matrices Cxx = Ex[xx ], Cyy = Ey[yy ], and > cross- Cxy = E(x,y)[xy ]. denote the regularized auto-covariance matrices by Cx = Cxx + rxI and Cy = Cyy + ryI

7 Introduction Problem Definition Problem Definition

The regularized version of Problem 1 can be formulated as follows. With this notation, re-write the regularized CCA problem as follows.

 > >  max Tr U˜ E[xy ]V˜ ˜ d ×k ˜ dy ×k U∈R x ,V∈R (2) ˜ > > ˜ ˜ > > ˜ subject to U E[xx + rxI]U = I, V E[yy + ryI]V = I

1 1 2 2 Since Cx and Cy are positive definite matrices one can make the simple change 1 1 2 2 of variables U = Cx U˜ and V = Cy V˜ to get the equivalent problem

 > > − 1 > > − 1  max Tr U E[xx + rxI] 2 E[xy ] E[yy + ryI] 2 V d ×k d ×k U∈R x ,V∈R y (3) subject to U>U = I, V>V = I

8 Introduction Problem Definition Problem Definition (continue)

d ×k d ×k Let Φ ∈ R x and Ψ ∈ R y denote the top-k left and right singular 1 1 − 2 − 2 vectors of Cx CxyCy . Then, the optimum of Problem 2 is achieved at 1 1 − 2 − 2 U˜ = Cx Φ and V˜ = Cy Ψ. An equivalent definition for the regularized empirical problem in the 1-dimensional setting is

1 > > 2 2 2 min u X − v Y + rx kuk + ry kvk d d 2 2 2 u∈R x ,v∈R y 2t (4) > > subject to u Cx,tu = I, v Cy,tv = I

9 Methods Outline

10 Methods Proposed Method (1)

Problem 3 is not a convex optimization problem – not only is the objective non-convex, the constraint set of orthogonal matrices is also non-convex. Re-parametrize Problem 3 via the following variable substitute: M = UV>. Furthermore, they take the convex hull of the constraint:

1 1 − 2 − 2 max hM, Cx CxyCy i d ×d M∈R x y . (5) subject to kMk2 ≤ 1, kMk∗ ≤ k

11 Methods Proposed Method (2)

1 1 − 2 − 2 the gradient of the CCA objective w.r.t. M is g := Cx CxyCy , however, the marginal distribution for x and y is unknown use an “inexact” first order oracle

∂t := Wx,txtytWy,t ≈ gt,

1 1 − 2 − 2 the choice of Wx,t := Cx,t and Wy,t := Cy,t being the empirical estimates of the whitening transform matrices. > − 1 denote the true whitening transforms by Wx := E[(xx + rxI) 2 ] and > − 1 Wy := E[(yy + ryI) 2 ].

12 Methods Proposed Method (3)

Design stochastic approximation algorithms for solving Problem 5; In each update, minimize the instantaneous loss measured on a single data point while trying to stay close to the previous iterate.

In each iteration minimize the sum ∆(Mt, M) + ηED [` (M, gt)], over all feasible M ∆(·, ·) is a function between the successive iterates ` is the instantaneous loss η is the learning-rate parameter controlling the tradeoff between loss and divergence the divergence function is defined in terms of a potential function

13 Methods Matrix Stochastic Gradient Matrix Stochastic Gradient

stochastic mirror descent with the choice of potential function as the Frobenius norm: Mt = PF (Mt−1 + ηt∂t)

14 Methods Matrix Stochastic Gradient Convergence of Matrix Stochastic Gradient


2 2 Assume that for all (xt, yt) ∼ D we have max{kxtk , kytk } ≤ B. There exists a constant κ independent of t such that for all t the following holds in expectation: κ ED [kEtk2] ≤ √ t

Theorem √ After T iterations of MSG with step size η = 2√k , and starting at M(1) = 0, G T √ − 1 − 1 2 kG + 2kκ ˜ 2 2 E[hM∗ − M, Cx CxyCy i] ≤ √ (6) T

where κ is a universal constant given by lemma 1, M∗ is the optimum of (5), expectation is with respect to the i.i.d. samples and rounding, and G is given by

Lemma in Appendix such that k∂tkF ≤ G for all the iterates t = 1, ··· ,T . 15 Methods Matrix Exponentiated Gradient Matrix Exponentiated Gradient (MEG)

the set is that of d k-dimensional (paired) subspaces and the multiplicative algorithm is an instance of matrix exponentiated gradient (MEG) update. Let d := dx + dy.   0 gt Consider the self-adjoint dilation of the matrix ED [gt] C := ED > . gt 0 Assume that ED [gt] has no repeated singular values and its SVD is given by > ED [gt] = UΣV then the eigen-decomposition of C is given by 1 UU  Σ 0  UU  C = . 2 V −V 0 −Σ V −V Then re-formulate regularized CCA as   max Tr WW>C d×k W∈R . (7) subject to W>W = I

16 Methods Matrix Exponentiated Gradient MEG

Problem 7 is not a convex optimization, but it admits the following convex relaxation by 1 > setting M = k WW : max Tr (MC) M∈ d×d R (8) 1 subject to M  0, kMk ≤ , Tr (M) = 1 2 k

Update Mt−1 by solving

Mt := argmin ∆ (M, Mt−1) − ηTr (MCt) M∈ d×d R (9) 1 subject to M  0, kMk ≤ , Tr (M) = 1 2 k where Ct is the self-adjoint dilation of gt and ∆ (x, y) is the quantum relative entropy between x and y. Setting the Lagrangian to 0 and solving:

exp (log (Mt−1) + ηCt)   Mb t = , Mt = P Mb t (10) Tr (exp (log (Mt−1) + ηCt)) where P denotes the projection onto the convex set of constraints in (9). In practice they replace the self-adjoint dilation of gt by self-adjoint dilation of ∂t which is denoted by C˜ t. 17 Methods Matrix Exponentiated Gradient MEG

19 Experiments Synthetic dataset

20 Experiments Mediamill