ORIE 6326: Convex Optimization [2Ex] Subgradient Method

ORIE 6326: Convex Optimization Subgradient Method Professor Udell Operations Research and Information Engineering Cornell April 25, 2017 Some slides adapted from Stanford EE364b 1/45 Subgradient recall: the subgradient generalizes the gradient for nonsmooth functions for f : Rn R, we say g Rn is a subgradient of f at x dom(f ) if → ∈ ∈ f (y) f (x)+ g T (y x) y dom(f ) ≥ − ∀ ∈ this is just the first-order condition for convexity notation: ◮ ∂f (x) denotes the set of all subgradients of f at x ◮ ˜ f (x) denotes a particular choice of an element of ∂f (x) ∇ 2/45 Outline Subgradients and optimality Subgradients and descent Subgradient method Analysis of SGM Examples Polyak’s optimal step size 3/45 Subgradients and sublevel sets g is a subgradient at x means f (y) f (x)+ g T (y x) ≥ − hence f (y) f (x) = g T (y x) 0 ≤ ⇒ − ≤ g ∂f (x0) ∈ x0 f (x) f (x0) ≤ x1 f (x1) ∇ ◮ f differentiable at x0: f (x0) is normal to the sublevel set x f (x) f (x ) ∇ { | ≤ 0 } ◮ f nondifferentiable at x0: subgradient defines a supporting hyperplane to sublevel set through x0 4/45 Optimality conditions — unconstrained recall for f convex, differentiable, f (x ⋆) = inf f (x) 0= f (x ⋆) x ⇐⇒ ∇ generalization to nondifferentiable convex f : f (x ⋆) = inf f (x) 0 ∂f (x ⋆) x ⇐⇒ ∈ f (x) 0 ∂f (x0) ∈ x x0 proof. 5/45 Optimality conditions — unconstrained recall for f convex, differentiable, f (x ⋆) = inf f (x) 0= f (x ⋆) x ⇐⇒ ∇ generalization to nondifferentiable convex f : f (x ⋆) = inf f (x) 0 ∂f (x ⋆) x ⇐⇒ ∈ f (x) 0 ∂f (x0) ∈ x x0 proof. by definition (!) f (y) f (x ⋆) + 0T (y x ⋆) for all y 0 ∂f (x ⋆) ≥ − ⇐⇒ ∈ ...seems trivial but isn’t 5/45 Example: piecewise linear minimization T f (x) = maxi=1,...,m(ai x + bi ) ⋆ ⋆ T ⋆ ⋆ x minimizes f 0 ∂f (x )= conv ai a x + bi = f (x ) ⇐⇒ ∈ { | i } there is a λ with ⇐⇒ m T λ 0, 1 λ = 1, λi ai = 0 i X=1 T ⋆ ⋆ where λi =0 if ai x + bi < f (x ) 6/45 ...but these are the KKT conditions for the epigraph form minimize t T subject to a x + bi t, i = 1,..., m i ≤ with dual maximize bT λ subject to λ 0, AT λ = 0, 1T λ = 1 7/45 Optimality conditions — constrained minimize f0(x) subject to fi (x) 0, i = 1,..., m ≤ we assume ◮ n fi convex, defined on R (hence subdifferentiable) ◮ strict feasibility (Slater’s condition) x ⋆ is primal optimal (λ⋆ is dual optimal) iff ⋆ ⋆ fi (x ) 0, λ 0 ≤ i ≥ ⋆ m ⋆ ⋆ 0 ∂f (x )+ λ ∂fi (x ) ∈ 0 i=1 i ⋆ ⋆ λi fi (x ) = 0 P ...generalizes KKT for nondifferentiable fi 8/45 Outline Subgradients and optimality Subgradients and descent Subgradient method Analysis of SGM Examples Polyak’s optimal step size 9/45 Directional derivative directional derivative of f at x in the direction δx is ∆ f (x + hδx) f (x) f ′(x; δx) = lim − h 0 h ց can be + or ∞ −∞ ◮ f convex, finite near x = f ′(x; δx) exists ⇒ ◮ f differentiable at x iff, for some g (= f (x)) and all δx, ∇ T f ′(x; δx)= g δx (i.e., f ′(x; δx) is a linear function of δx) 10/45 Directional derivative and subdifferential T general formula for convex f : f ′(x; δx) = sup g δx g ∂f (x) ∈ δx ∂f (x) 11/45 Descent directions δx is a descent direction for f at x if f ′(x; δx) < 0 for differentiable f , δx = f (x) is always a descent direction (except when it is zero) −∇ 12/45 Descent directions δx is a descent direction for f at x if f ′(x; δx) < 0 for differentiable f , δx = f (x) is always a descent direction (except when it is zero) −∇ warning: for nondifferentiable (convex) functions, ˜ f (x) need not be descent direction ∇ x2 g x1 example: f (x)= x + 2 x | 1| | 2| 12/45 Subgradients and distance to sublevel sets if f is convex, f (z) < f (x), g ∂f (x), then for small enough t > 0, ∈ x tg z < x z k − − k2 k − k2 thus g is descent direction for x z 2, for any z with f (z) < f (x) (e.g.,−x ⋆) k − k negative subgradient is descent direction for distance to optimal point proof: 13/45 Subgradients and distance to sublevel sets if f is convex, f (z) < f (x), g ∂f (x), then for small enough t > 0, ∈ x tg z < x z k − − k2 k − k2 thus g is descent direction for x z 2, for any z with f (z) < f (x) (e.g.,−x ⋆) k − k negative subgradient is descent direction for distance to optimal point proof: x tg z 2 = x z 2 2tg T (x z)+ t2 g 2 k − − k2 k − k2 − − k k2 x z 2 2t(f (x) f (z)) + t2 g 2 ≤ k − k2 − − k k2 13/45 Descent directions and optimality fact: for f convex, finite near x, either ◮ 0 ∂f (x) (in which case x minimizes f ), or ◮ there∈ is a descent direction for f at x i.e., x is optimal (minimizes f ) iff there’s no descent direction for f at x proof: define δxsd = argmin z 2 − z ∂f (x) k k ∈ if δxsd = 0, then 0 ∂f (x), so x is optimal; otherwise ∈ 2 f ′(x; δxsd)= infz ∂f (x) z 2 < 0, so δxsd is a descent direction − ∈ k k ∂f (x) xsd 14/45 Outline Subgradients and optimality Subgradients and descent Subgradient method Analysis of SGM Examples Polyak’s optimal step size 15/45 Subgradient method the subgradient method is a simple algorithm to minimize nondifferentiable convex function f (k+1) (k) (k) x = x αk g − ◮ x (k) is the kth iterate ◮ g (k) is any subgradient of f at x (k) ◮ αk > 0 is the kth step size 16/45 Subgradient method the subgradient method is a simple algorithm to minimize nondifferentiable convex function f (k+1) (k) (k) x = x αk g − ◮ x (k) is the kth iterate ◮ g (k) is any subgradient of f at x (k) ◮ αk > 0 is the kth step size warning: subgradient method is not a descent method. so, do not call it subgradient descent [Steven Wright]. 16/45 Subgradient method the subgradient method is a simple algorithm to minimize nondifferentiable convex function f (k+1) (k) (k) x = x αk g − ◮ x (k) is the kth iterate ◮ g (k) is any subgradient of f at x (k) ◮ αk > 0 is the kth step size warning: subgradient method is not a descent method. so, do not call it subgradient descent [Steven Wright]. instead, keep track of best point so far (k) (i) fbest = min f (x ) i=1,...,k 16/45 Line search + subgradients = danger warning: with exact line search, subgradient method can converge to suboptimal point! 17/45 Line search + subgradients = danger warning: with exact line search, subgradient method can converge to suboptimal point! x2 g x1 example: f (x)= x + 2 x | 1| | 2| 17/45 Step size rules step sizes are fixed ahead of time (can’t do linesearch) ◮ constant step size: αk = α (constant) ◮ (k) (k+1) (k) constant step length: αk = γ/ g (so x x = γ) k k2 k − k2 ◮ square summable but not summable: step sizes satisfy ∞ 2 ∞ α < , αk = k ∞ ∞ k k X=1 X=1 1 e.g., αk = k ◮ nonsummable diminishing: step sizes satisfy ∞ lim αk = 0, αk = , k ∞ →∞ k X=1 e.g., α = 1 k √k 18/45 Outline Subgradients and optimality Subgradients and descent Subgradient method Analysis of SGM Examples Polyak’s optimal step size 19/45 Lipschitz functions a function f : Rn R is L-Lipschitz continuous if for every x, y dom f , → ∈ f (x) f (y) L x y | − |≤ k − k (often, we simply say f is L-Lipschitz) 20/45 Lipschitz functions a function f : Rn R is L-Lipschitz continuous if for every x, y dom f , → ∈ f (x) f (y) L x y | − |≤ k − k (often, we simply say f is L-Lipschitz) notice for any g ∂f (x), ∈ f (y) f (x) g T (y x) g y x − ≤ − ≤ k kk − k so the Lipschitz constant for f satisfies L sup sup g ≤ x dom f g ∂f (x) k k ∈ ∈ 20/45 Lipschitz functions examples: ◮ x : | | 21/45 Lipschitz functions examples: ◮ x : L = 1 | | 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : k k1 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 p 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: k k k k≤ 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ ◮ λmax(X ): 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ ◮ λmax(X ): L = 1 (subgradients are eigenvectors) 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ ◮ λmax(X ): L = 1 (subgradients are eigenvectors) ◮ indicators of convex sets : C 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ ◮ λmax(X ): L = 1 (subgradients are eigenvectors) ◮ indicators of convex sets : not Lipschitz continuous C 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ ◮ λmax(X ): L = 1 (subgradients are eigenvectors) ◮ indicators of convex sets : not Lipschitz continuous C 21/45 Assumptions ◮ ⋆ f is bounded below: p = infx f (x) > −∞ ◮ f attains its minimum at x ⋆: f (x ⋆)= p⋆ ◮ f is L-Lipschitz, so g L for all g ∂f k k2 ≤ ∈ ◮ x (0) x ⋆ R k − k2 ≤ these assumptions are stronger than needed, just to simplify proofs 22/45 Convergence results define ◮ (k) 1 k (k) averaged iteratex ¯ = k i=1 x ◮ averaged value f¯(k) = f (¯x (k)) P ◮ (k) f¯ = limk f¯ →∞ ◮ (k) fbest = limk f¯best →∞ we’ll show f¯(k) f¯ is close to p⋆.

Load more