ORIE 6326: Convex Optimization [2Ex] Subgradient Method

ORIE 6326: Convex Optimization Subgradient Method Professor Udell Operations Research and Information Engineering Cornell April 25, 2017 Some slides adapted from Stanford EE364b 1/45 Subgradient recall: the subgradient generalizes the gradient for nonsmooth functions for f : Rn R, we say g Rn is a subgradient of f at x dom(f ) if → ∈ ∈ f (y) f (x)+ g T (y x) y dom(f ) ≥ − ∀ ∈ this is just the first-order condition for convexity notation: ◮ ∂f (x) denotes the set of all subgradients of f at x ◮ ˜ f (x) denotes a particular choice of an element of ∂f (x) ∇ 2/45 Outline Subgradients and optimality Subgradients and descent Subgradient method Analysis of SGM Examples Polyak’s optimal step size 3/45 Subgradients and sublevel sets g is a subgradient at x means f (y) f (x)+ g T (y x) ≥ − hence f (y) f (x) = g T (y x) 0 ≤ ⇒ − ≤ g ∂f (x0) ∈ x0 f (x) f (x0) ≤ x1 f (x1) ∇ ◮ f differentiable at x0: f (x0) is normal to the sublevel set x f (x) f (x ) ∇ { | ≤ 0 } ◮ f nondifferentiable at x0: subgradient defines a supporting hyperplane to sublevel set through x0 4/45 Optimality conditions — unconstrained recall for f convex, differentiable, f (x ⋆) = inf f (x) 0= f (x ⋆) x ⇐⇒ ∇ generalization to nondifferentiable convex f : f (x ⋆) = inf f (x) 0 ∂f (x ⋆) x ⇐⇒ ∈ f (x) 0 ∂f (x0) ∈ x x0 proof. 5/45 Optimality conditions — unconstrained recall for f convex, differentiable, f (x ⋆) = inf f (x) 0= f (x ⋆) x ⇐⇒ ∇ generalization to nondifferentiable convex f : f (x ⋆) = inf f (x) 0 ∂f (x ⋆) x ⇐⇒ ∈ f (x) 0 ∂f (x0) ∈ x x0 proof. by definition (!) f (y) f (x ⋆) + 0T (y x ⋆) for all y 0 ∂f (x ⋆) ≥ − ⇐⇒ ∈ ...seems trivial but isn’t 5/45 Example: piecewise linear minimization T f (x) = maxi=1,...,m(ai x + bi ) ⋆ ⋆ T ⋆ ⋆ x minimizes f 0 ∂f (x )= conv ai a x + bi = f (x ) ⇐⇒ ∈ { | i } there is a λ with ⇐⇒ m T λ 0, 1 λ = 1, λi ai = 0 i X=1 T ⋆ ⋆ where λi =0 if ai x + bi < f (x ) 6/45 ...but these are the KKT conditions for the epigraph form minimize t T subject to a x + bi t, i = 1,..., m i ≤ with dual maximize bT λ subject to λ 0, AT λ = 0, 1T λ = 1 7/45 Optimality conditions — constrained minimize f0(x) subject to fi (x) 0, i = 1,..., m ≤ we assume ◮ n fi convex, defined on R (hence subdifferentiable) ◮ strict feasibility (Slater’s condition) x ⋆ is primal optimal (λ⋆ is dual optimal) iff ⋆ ⋆ fi (x ) 0, λ 0 ≤ i ≥ ⋆ m ⋆ ⋆ 0 ∂f (x )+ λ ∂fi (x ) ∈ 0 i=1 i ⋆ ⋆ λi fi (x ) = 0 P ...generalizes KKT for nondifferentiable fi 8/45 Outline Subgradients and optimality Subgradients and descent Subgradient method Analysis of SGM Examples Polyak’s optimal step size 9/45 Directional derivative directional derivative of f at x in the direction δx is ∆ f (x + hδx) f (x) f ′(x; δx) = lim − h 0 h ց can be + or ∞ −∞ ◮ f convex, finite near x = f ′(x; δx) exists ⇒ ◮ f differentiable at x iff, for some g (= f (x)) and all δx, ∇ T f ′(x; δx)= g δx (i.e., f ′(x; δx) is a linear function of δx) 10/45 Directional derivative and subdifferential T general formula for convex f : f ′(x; δx) = sup g δx g ∂f (x) ∈ δx ∂f (x) 11/45 Descent directions δx is a descent direction for f at x if f ′(x; δx) < 0 for differentiable f , δx = f (x) is always a descent direction (except when it is zero) −∇ 12/45 Descent directions δx is a descent direction for f at x if f ′(x; δx) < 0 for differentiable f , δx = f (x) is always a descent direction (except when it is zero) −∇ warning: for nondifferentiable (convex) functions, ˜ f (x) need not be descent direction ∇ x2 g x1 example: f (x)= x + 2 x | 1| | 2| 12/45 Subgradients and distance to sublevel sets if f is convex, f (z) < f (x), g ∂f (x), then for small enough t > 0, ∈ x tg z < x z k − − k2 k − k2 thus g is descent direction for x z 2, for any z with f (z) < f (x) (e.g.,−x ⋆) k − k negative subgradient is descent direction for distance to optimal point proof: 13/45 Subgradients and distance to sublevel sets if f is convex, f (z) < f (x), g ∂f (x), then for small enough t > 0, ∈ x tg z < x z k − − k2 k − k2 thus g is descent direction for x z 2, for any z with f (z) < f (x) (e.g.,−x ⋆) k − k negative subgradient is descent direction for distance to optimal point proof: x tg z 2 = x z 2 2tg T (x z)+ t2 g 2 k − − k2 k − k2 − − k k2 x z 2 2t(f (x) f (z)) + t2 g 2 ≤ k − k2 − − k k2 13/45 Descent directions and optimality fact: for f convex, finite near x, either ◮ 0 ∂f (x) (in which case x minimizes f ), or ◮ there∈ is a descent direction for f at x i.e., x is optimal (minimizes f ) iff there’s no descent direction for f at x proof: define δxsd = argmin z 2 − z ∂f (x) k k ∈ if δxsd = 0, then 0 ∂f (x), so x is optimal; otherwise ∈ 2 f ′(x; δxsd)= infz ∂f (x) z 2 < 0, so δxsd is a descent direction − ∈ k k ∂f (x) xsd 14/45 Outline Subgradients and optimality Subgradients and descent Subgradient method Analysis of SGM Examples Polyak’s optimal step size 15/45 Subgradient method the subgradient method is a simple algorithm to minimize nondifferentiable convex function f (k+1) (k) (k) x = x αk g − ◮ x (k) is the kth iterate ◮ g (k) is any subgradient of f at x (k) ◮ αk > 0 is the kth step size 16/45 Subgradient method the subgradient method is a simple algorithm to minimize nondifferentiable convex function f (k+1) (k) (k) x = x αk g − ◮ x (k) is the kth iterate ◮ g (k) is any subgradient of f at x (k) ◮ αk > 0 is the kth step size warning: subgradient method is not a descent method. so, do not call it subgradient descent [Steven Wright]. 16/45 Subgradient method the subgradient method is a simple algorithm to minimize nondifferentiable convex function f (k+1) (k) (k) x = x αk g − ◮ x (k) is the kth iterate ◮ g (k) is any subgradient of f at x (k) ◮ αk > 0 is the kth step size warning: subgradient method is not a descent method. so, do not call it subgradient descent [Steven Wright]. instead, keep track of best point so far (k) (i) fbest = min f (x ) i=1,...,k 16/45 Line search + subgradients = danger warning: with exact line search, subgradient method can converge to suboptimal point! 17/45 Line search + subgradients = danger warning: with exact line search, subgradient method can converge to suboptimal point! x2 g x1 example: f (x)= x + 2 x | 1| | 2| 17/45 Step size rules step sizes are fixed ahead of time (can’t do linesearch) ◮ constant step size: αk = α (constant) ◮ (k) (k+1) (k) constant step length: αk = γ/ g (so x x = γ) k k2 k − k2 ◮ square summable but not summable: step sizes satisfy ∞ 2 ∞ α < , αk = k ∞ ∞ k k X=1 X=1 1 e.g., αk = k ◮ nonsummable diminishing: step sizes satisfy ∞ lim αk = 0, αk = , k ∞ →∞ k X=1 e.g., α = 1 k √k 18/45 Outline Subgradients and optimality Subgradients and descent Subgradient method Analysis of SGM Examples Polyak’s optimal step size 19/45 Lipschitz functions a function f : Rn R is L-Lipschitz continuous if for every x, y dom f , → ∈ f (x) f (y) L x y | − |≤ k − k (often, we simply say f is L-Lipschitz) 20/45 Lipschitz functions a function f : Rn R is L-Lipschitz continuous if for every x, y dom f , → ∈ f (x) f (y) L x y | − |≤ k − k (often, we simply say f is L-Lipschitz) notice for any g ∂f (x), ∈ f (y) f (x) g T (y x) g y x − ≤ − ≤ k kk − k so the Lipschitz constant for f satisfies L sup sup g ≤ x dom f g ∂f (x) k k ∈ ∈ 20/45 Lipschitz functions examples: ◮ x : | | 21/45 Lipschitz functions examples: ◮ x : L = 1 | | 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : k k1 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 p 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: k k k k≤ 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ ◮ λmax(X ): 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ ◮ λmax(X ): L = 1 (subgradients are eigenvectors) 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ ◮ λmax(X ): L = 1 (subgradients are eigenvectors) ◮ indicators of convex sets : C 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ ◮ λmax(X ): L = 1 (subgradients are eigenvectors) ◮ indicators of convex sets : not Lipschitz continuous C 21/45 Lipschitz functions examples: ◮ x : L = 1 | | ◮ max(x, 0): L = 1 ◮ x : L = (n) (subgradients are sign vectors) k k1 ◮ x 2 for xp R: L = R k k k k≤ ◮ λmax(X ): L = 1 (subgradients are eigenvectors) ◮ indicators of convex sets : not Lipschitz continuous C 21/45 Assumptions ◮ ⋆ f is bounded below: p = infx f (x) > −∞ ◮ f attains its minimum at x ⋆: f (x ⋆)= p⋆ ◮ f is L-Lipschitz, so g L for all g ∂f k k2 ≤ ∈ ◮ x (0) x ⋆ R k − k2 ≤ these assumptions are stronger than needed, just to simplify proofs 22/45 Convergence results define ◮ (k) 1 k (k) averaged iteratex ¯ = k i=1 x ◮ averaged value f¯(k) = f (¯x (k)) P ◮ (k) f¯ = limk f¯ →∞ ◮ (k) fbest = limk f¯best →∞ we’ll show f¯(k) f¯ is close to p⋆.

ORIE 6326: Convex Optimization [2Ex] Subgradient Method

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support