Escaping from Saddle Points on Riemannian Manifolds
Total Page:16
File Type:pdf, Size:1020Kb
Escaping from saddle points on Riemannian manifolds Yue Suny, Nicolas Flammarionz, Maryam Fazely y Department of Electrical and Computer Engineering, University of Washington, Seattle z School of Computer and Communication Sciences, EPFL Abstract Taylor series and smoothness assumptions Example – Burer-Monteiro factorization. Notation: Expx(y) denotes the exponential map, gradf(x) and H(x) d×d Let A 2 S , the problem 6 We consider minimizing a nonconvex, smooth function f(x) on a are Riemannian gradient and Hessian of f(x); x is a saddle point, and smooth manifold x 2 M. We show that a perturbed Riemannian −1 max trace(AX); 5 u = Exp (u). 4 x X2 d×d gradient algorithm converges to a second-order stationary point in a S 3 • Riemannian gradient descent. Let f have a β-Lipschitz gradient. Function value number of iterations that is polynomial in appropriate smoothness s:t: diag(X) = 1;X 0; rank(X) ≤ r: 2 q There exists η = Θ(1/β) such that Riemannian gradient descent step 1 parameters of f and M, and polylog in dimension. This matches the can be factorized as + + 0 0 5 10 15 20 25 30 35 40 45 best known rate for unconstrained smooth minimization. u = Exp (−ηgradf(u)) (c.f. Euclidean case: u = u − ηrf(u)) Iterations u max trace(AY Y T ); s:t: diag(YY T ) = 1: η Iteration versus function value. The iterations 2 Y 2 d×p monotonically decreases f by 2kgradf(u)k . R start from a saddle point, is perturbed by the Background and motivation ^ when r(r + 1)=2 ≤ d, p(p + 1)=2 ≥ d. noise, and converges to a local minimum (that is • ρ-Lipschitz Hessian. Let fx = f ◦ Expx have a ρ-Lipschitz proven global as well). Hessian, then Consider the optimization problem Contributions 1 ρ 3 f^x(u) = f(u) ≤ f(x) + hgradf(x); ui + H(x)[u; u] + kuk : 2 6 For (nonconvex) optimization on Riemannian manifold, perturbed Rieman- • Two perturbed iterates; negative curvature direction. Let nian GD has a rate minimize f(x); q q q q q x u; w be perturbations of x, then • Polylog in dimension (improving ‘polynomial in dimension’ rate in [2]) subject to x 2 M; + + • Comparable polynomial dependence on and β as in unconstrained (w − u ) − (w − u) + ηH(x)[w − u] ≤ ηρ^kw−uk(kw−xk+ku−xk): case [1]. ρ^ is a function of (1) Hessian Lipschitz constant of f(·), (2) Hessian Figure: Escaping from saddle trajectory. | | q q q q q q q q • Explicit polynomial dependence of curvature constant, which is Lipschitz constant of Exp·(·), (3) spectral norm of Riemannian curvature implicit in [3]. where M is a manifold of dimension d, optimization variable is x 2 M, tensor, (4) injectivity radius. and (nonconvex) function f(x) is twice differentiable. Finding global Future work optimum is generally not possible. We seek an approximate second order stationary point on the manifold (defined in main theorem) using first-order Main theorem It is known that accelerated method works in escaping saddle framework [4], algorithms. it’s also of interest whether we can run accelerated algorithm on manifolds. Related work: Let smoothness assumptions above hold. With probability δ, perturbed Another recent trend is to consider optimization problem with equality • Unconstrained case: convergence rate of perturbed GD is polynomial Riemannian GD takes and inequality constraints [5, 6]. They require solution or approximation in smoothness parameters and d [1] . ∗ ∗ oracle for NP-hard problems in general (including copositivity test). β(f(x0) − f(x )) 4 βd(f(x0) − f(x )) • Equality-constrained case (with explicit constraints): convergence rate O( 2 log ( 2 )) of noisy GD is polynomial in smoothness parameters and polylog in d p δ Bibliography iterations to reach an (, − ρ^ )-stationary point, where kgradf(x)k ≤ [1] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, “How to escape saddle points efficiently,” in Proceedings of [2]. p the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 1724–1732. and λminH(x) ≥ − ρ^ . Here we study perturbed Riemannian GD and show convergence rate is [2] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points – online stochastic gradient for tensor decomposition,” in Conference on Learning Theory, 2015, pp. 797–842. polylog in d and polynomial in smoothness parameters. This extends [3] C. Criscitiello and N. Boumal, “Efficiently escaping saddle points on manifolds,” arXiv preprint arXiv:1906.04321, 2019. best known unconstrained rates to the case of non-Euclidean, manifold [4] C. Jin, P. Netrapalli, and M. I. Jordan, “Accelerated gradient descent escapes saddle points faster than gradient descent,” constrained problems (e.g., optimization on matrix manifolds). Algorithm (informal) arXiv preprint arXiv:1711.10456, 2017. [5] A. Mokhtari, A. Ozdaglar, and A. Jadbabaie, “Escaping saddle points in constrained optimization,” arXiv preprint arXiv:1809.02162, 2018. • At iterate x, check the norm of gradient [6] M. Nouiehed, J. D. Lee, and M. Razaviyayn, “Convergence to second-order stationarity for constrained non-convex + optimization,” arXiv preprint arXiv:1810.02024, 2018. • If large: do x = Expx(−ηgradf(x)) to decrease function value • If small: near either a saddle point or a local min. Perturb iterate by adding appropriate noise, run a few iterations • if f decreases, iterates escape saddle point (and alg continues) • if f doesn’t decrease: at approximate local min (alg terminates). (a) (b) Figure: (a) Exponential map on manifold; (b) Progress of two iterate sequences..