Some New Complexity Results for Composite Optimization
Total Page:16
File Type:pdf, Size:1020Kb
Some New Complexity Results for Composite Optimization Guanghui (George) Lan Georgia Institue of Technology Symposium on Frontiers in Big Data University of Illinois at Urbana Champaign September 23, 2016 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary Data Analysis Background Big-data Era: In 2012, IBM reported that2 :5 quintillion (1018) bytes of data are created everyday. Internet acts as a rich data source, e.g.,2 :9 million emails sent every second, 20 hours video uploaded to Youtube every minute. Better sensor technology. Widespread use of computer simulation. Opportunities: transform raw data into useful knowledge to support decision-making, e.g., in healthcare, national security, energy and transportation etc. beamer-tu-logo 2 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary Optimization for data analysis Machine Learning m Given a set of observed data S = f(ui ; vi )gi=1, drawn from a certain unknown distribution D on U × V . Goal: to describe the relation between ui and vi ’s for prediction. Applications: predicting strokes and seizures, identifying heart failure, stopping credit card fraud, predicting machine failure, identifying spam, ...... Classic models: 2 Lasso regression: minx E[(hx; ui − v) ] + ρkxk1. 2 Support vector machine: min Eu;v [maxf0; vhx; ui] + ρkxk2. 2 Deep learning: minx Eu;v (F(u; x) − v) + ρkUxk1 beamer-tu-logo 3 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary Optimization for data analysis Inverse Problems Given external observations b of a hidden black-box system, to recover the unknown parameters x of the system. The relation between b and x, e.g., Ax = b, is typically given. However, the system is underdetermined, and b is noisy. Applications: medical imaging, locations of oil and mineral deposits, cracks and interfaces within materials. Classic models: 2 Total variation minimization: minx kAx − bk + λTV(x). 2 Compressed sensing: minx kAx − bk + λkxk1. 2 P Matrix completion: minx kAx − bk + λ i σi (x). beamer-tu-logo 4 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary Composite optimization problems We consider composite problems which can be modeled as ∗ Ψ = minx2X fΨ(x) := f (x) + h(x)g : Here, f : X ! R is a smooth and expensive term (data fitting), h : X ! R is a nonsmooth regularization term (solution structures), and X is a closed convex set. Much of my previous research f given as an expectation or finite-sum. f is possibly nonconvex and stochastic. e.g., mirror descent stochastic approximation (Nemirovski, Juditsky, Lan and Shapiro 07), accelerated stochastic approximation (Lan 08); Nonconvex stochastic gradient descent (Ghadimi and Lan 12) beamer-tu-logo 5 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary Complexity for composite optimization ∗ Problem: Ψ := minx2X fΨ(x) := f (x) + h(x)g. Focus of this talk: h is not necessarily simple More solution structural properties, e.g., total variation, group sparsity, and graph-based regularization ... Extension: X is not necessarily simple. First-order methods: iterative methods which operate with the gradients (subgradients) of f and h. Complexity: number of iterations to find an -solution, i.e., a point x¯ 2 X s.t. Ψ(x¯) − Ψ∗ ≤ . Easy case: h simple, X simple P (y) := argmin ky − xk2 + h(x) is easy to compute (e.g., X;h x2X p beamer-tu-logo compressed sensing). Complexity: O(1= ) (Nesterov 07). 6 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary Complexity for composite optimization ∗ Problem: Ψ := minx2X fΨ(x) := f (x) + h(x)g. Focus of this talk: h is not necessarily simple More solution structural properties, e.g., total variation, group sparsity, and graph-based regularization ... Extension: X is not necessarily simple. First-order methods: iterative methods which operate with the gradients (subgradients) of f and h. Complexity: number of iterations to find an -solution, i.e., a point x¯ 2 X s.t. Ψ(x¯) − Ψ∗ ≤ . Easy case: h simple, X simple P (y) := argmin ky − xk2 + h(x) is easy to compute (e.g., X;h x2X p beamer-tu-logo compressed sensing). Complexity: O(1= ) (Nesterov 07). 6 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary More difficult cases h general, X simple 2 h is a general nonsmooth function; PX := argminx2X ky − xk is easy to compute. Complexity: O(1/2). h structured, X simple h is structured, e.g., h(x) = maxy2Y hAx; yi; PX is easy to compute. Complexity: O(1/). h simple, X complicated LX;h(y) := argminx2X hy; xi + h(x) is easy to compute (e.g., matrix completion).Complexity: O(1/). beamer-tu-logo 7 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary More difficult cases h general, X simple 2 h is a general nonsmooth function; PX := argminx2X ky − xk is easy to compute. Complexity: O(1/2). h structured, X simple h is structured, e.g., h(x) = maxy2Y hAx; yi; PX is easy to compute. Complexity: O(1/). h simple, X complicated LX;h(y) := argminx2X hy; xi + h(x) is easy to compute (e.g., matrix completion).Complexity: O(1/). beamer-tu-logo 7 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary More difficult cases h general, X simple 2 h is a general nonsmooth function; PX := argminx2X ky − xk is easy to compute. Complexity: O(1/2). h structured, X simple h is structured, e.g., h(x) = maxy2Y hAx; yi; PX is easy to compute. Complexity: O(1/). h simple, X complicated LX;h(y) := argminx2X hy; xi + h(x) is easy to compute (e.g., matrix completion).Complexity: O(1/). beamer-tu-logo 7 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary Motivation p h simple, X simple O(1= ) 100 h general, X simple O(1/2) 108 h structured, X simple O(1/) 104 h simple, X complicated O(1/) 104 More general h or more complicated X + Slow convergence of first-order algorithms + A large number of gradient evaluations of rf beamer-tu-logo 8 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary Motivation p h simple, X simple O(1= ) 100 h general, X simple O(1/2) 108 h structured, X simple O(1/) 104 h simple, X complicated O(1/) 104 More general h or more complicated X + Slow convergence of first-order algorithms +× ? A large number of gradient evaluations of rf Question: Can we skip the computation of rf ? beamer-tu-logo 8 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary Our approach: gradient sliding algorithms Gradient sliding: h general, X simple (Lan). Accelerated gradient sliding: h structured, X simple (with Yuyuan Ouyang). Conditional gradient sliding: h simple, X complicated (with Yi Zhou). beamer-tu-logo 9 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary Nonsmooth composite problems ∗ Ψ = minx2X fΨ(x) := f (x) + h(x)g : f is smooth, i.e., 9L > 0 s.t. 8x; y 2 X, krf (y) − rf (x)k ≤ Lky − xk. h is nonsmooth, i.e., 9M > 0 s.t. 8x; y 2 X, jh(x) − h(y)j ≤ Mky − xk. PX is simple to compute. Question How many number of gradient evaluations of rf and subgradient evaluations of h0 are needed to find an -solution? beamer-tu-logo 10 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary Existing Algorithms Best-known complexity given by accelerated stochastic approximation (Lan, 12): q L M2 O + 2 Issue: Whenever the second term dominates, the number of gradient evaluations rf is given by O(1/2). The computation of rf , however, is often the bottleneck. The computation of rf invovles a large data set, while that of h0 only involves a very sparse matrix. Can we reduce the number of gradient evaluations for rf p from O(1/2) to O(1= ), while still maintaining the beamer-tu-logo optimal O(1/2) bound on subgradient evaluations for h0? 11 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary Review of proximal gradient methods The model function Suppose h is relatively simple, e.g., h(x) = kxk1. For a given x 2 X, let mΨ(x; u) := lf (x; u) + h(u); 8u 2 X; lf (x; y) := f (x) + hrf (x); y − xi: Clearly, by the convexity of f , L 2 mΨ(x; u) ≤ Ψ(u) ≤ mΨ(x; u) + 2 ku − xk ; 8u 2 X: for any u 2 X Bregman Distance Let ! be a strongly convex function with modulus ν and define the Bregman distance V (x; u) = !(u) − !(x) − hr!(x); u − xi. L mΨ(x; u) ≤ Ψ(u) ≤ mΨ(x; u) + ν V (x; u); 8u 2 X: beamer-tu-logo 12 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary Review of proximal gradient descent mΨ(x; u) = lf (x; u) + h(u) is a good approximation of Ψ(u) when u is “close” enough to x. Proximal gradient iterations xk = argminu2X flf (xk−1; u) + h(u) + βk V (xk−1; u)g : Iteration complexity: O(1/). Accelerated gradient iterations xk =( 1 − γk )x¯k−1 + γk xk−1; xk = argminu2X fΦk (u) := lf (xk ; u) + h(u) + βk V (xk−1; u)g ; x¯ =( 1 − γ )x¯ + γ x : k k k−1 kpk Iteration complexity: O(1= ). beamer-tu-logo 13 / 41 Background Gradient Sliding Accelerated gradient sliding Numerical experiments Summary How about a general nonsmooth h? Old approach: linearizing h (Lan 08, 12) q ∗ 2 ∗ LV (x0;x ) M V (x0;x ) Iteration Complexity: O + 2 . New approach: gradient sliding Key idea: keep h in the subproblem, and apply an iterative method to solve the subproblem.