Quick viewing(Text Mode)

Optimization Techniques for ML (2)

Optimization Techniques for ML (2)

Piyush Rai

Introduction to (CS771A)

August 28, 2018

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 1 Convex functions have a unique minima

Loss Loss

Optima Optima w w 2 w 1 Non-convex have several local minima

Loss Loss

“Local” Optima “Local” Optima w 2 w 1 w

Recap: Convex and Non-

Most ML problems boil down to minimization of convex/non-convex functions, e.g., N 1 X wˆ = arg min L(w) = arg min `n(w) + R(w) w w N n=1

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 2 Non-convex function have several local minima

Loss Loss

“Local” Optima “Local” Optima w 2 w 1 w

Recap: Convex and Non-Convex Function

Most ML problems boil down to minimization of convex/non-convex functions, e.g., N 1 X wˆ = arg min L(w) = arg min `n(w) + R(w) w w N n=1 Convex functions have a unique minima

Loss Loss

Optima Optima w w 2 w 1

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 2 Recap: Convex and Non-Convex Function

Most ML problems boil down to minimization of convex/non-convex functions, e.g., N 1 X wˆ = arg min L(w) = arg min `n(w) + R(w) w w N n=1 Convex functions have a unique minima

Loss Loss

Optima Optima w w 2 w 1 Non-convex function have several local minima

Loss Loss

“Local” Optima “Local” Optima w 2 w 1 w

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 2 A function is convex if its graph lies above all of its (above its first order Taylor expansion)

A function is convex if its second (Hessian) is positive semi-definite Note: If f is convex then −f is a

Recap: Convex Functions

A function is convex if all of its chords lie above the function Note: “Chord lies above function” Convex Function Non-convex Function more formally means If f is convex then given s.t

Jensen’s Inequality

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 3 A function is convex if its (Hessian) is positive semi-definite Note: If f is convex then −f is a concave function

Recap: Convex Functions

A function is convex if all of its chords lie above the function Note: “Chord lies above function” Convex Function Non-convex Function more formally means If f is convex then given s.t

Jensen’s Inequality

A function is convex if its graph lies above all of its tangents (above its first order Taylor expansion)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 3 Note: If f is convex then −f is a concave function

Recap: Convex Functions

A function is convex if all of its chords lie above the function Note: “Chord lies above function” Convex Function Non-convex Function more formally means If f is convex then given s.t

Jensen’s Inequality

A function is convex if its graph lies above all of its tangents (above its first order Taylor expansion)

A function is convex if its second derivative (Hessian) is positive semi-definite

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 3 Recap: Convex Functions

A function is convex if all of its chords lie above the function Note: “Chord lies above function” Convex Function Non-convex Function more formally means If f is convex then given s.t

Jensen’s Inequality

A function is convex if its graph lies above all of its tangents (above its first order Taylor expansion)

A function is convex if its second derivative (Hessian) is positive semi-definite Note: If f is convex then −f is a concave function

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 3 Recap: Descent

A very simple, first-order method for optimizing any differentiable function (convex/non-convex) Uses only the gradient g = ∇L(w) of the function Basic idea: Start at some location w (0) and move in the opposite direction of the gradient

negative gradient

positive direction

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 4 Recap:

A very simple, first-order method for optimizing any differentiable function (convex/non-convex) Uses only the gradient g = ∇L(w) of the function Basic idea: Start at some location w (0) and move in the opposite direction of the gradient

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 4 Recap: Gradient Descent

A very simple, first-order method for optimizing any differentiable function (convex/non-convex) Uses only the gradient g = ∇L(w) of the function Basic idea: Start at some location w (0) and move in the opposite direction of the gradient

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 4 Recap: Gradient Descent

A very simple, first-order method for optimizing any differentiable function (convex/non-convex) Uses only the gradient g = ∇L(w) of the function Basic idea: Start at some location w (0) and move in the opposite direction of the gradient

positive gradient

negative direction

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 4 Recap: Gradient Descent

A very simple, first-order method for optimizing any differentiable function (convex/non-convex) Uses only the gradient g = ∇L(w) of the function Basic idea: Start at some location w (0) and move in the opposite direction of the gradient

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 4 Recap: Gradient Descent

A very simple, first-order method for optimizing any differentiable function (convex/non-convex) Uses only the gradient g = ∇L(w) of the function Basic idea: Start at some location w (0) and move in the opposite direction of the gradient

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 4 Recap: Gradient Descent

Gradient Descent

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 5 Many ways to set the learning rate, e.g., Constant (if properly set, can still show good convergence behavior) √ Decreasing with t (e.g. 1/t, 1/ t, etc.) Use adaptive learning rates (e.g., using methods such as Adagrad, Adam)

Recap: Gradient Descent

The learning rate ηt is important Very small learning rates may result in very slow convergence Very large learning rates may lead to oscillatory behavior or result in a bad local optima

Very small learning rates Very large learning rates

VERY VERY large rate (can even jump into a bad region)

May not be able to “cross” towards the good side

May take too long May keep to converge oscillating

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 6 Constant (if properly set, can still show good convergence behavior) √ Decreasing with t (e.g. 1/t, 1/ t, etc.) Use adaptive learning rates (e.g., using methods such as Adagrad, Adam)

Recap: Gradient Descent

The learning rate ηt is important Very small learning rates may result in very slow convergence Very large learning rates may lead to oscillatory behavior or result in a bad local optima

Very small learning rates Very large learning rates

VERY VERY large rate (can even jump into a bad region)

May not be able to “cross” towards the good side

May take too long May keep to converge oscillating

Many ways to set the learning rate, e.g.,

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 6 Recap: Gradient Descent

The learning rate ηt is important Very small learning rates may result in very slow convergence Very large learning rates may lead to oscillatory behavior or result in a bad local optima

Very small learning rates Very large learning rates

VERY VERY large rate (can even jump into a bad region)

May not be able to “cross” towards the good side

May take too long May keep to converge oscillating

Many ways to set the learning rate, e.g., Constant (if properly set, can still show good convergence behavior) √ Decreasing with t (e.g. 1/t, 1/ t, etc.) Use adaptive learning rates (e.g., using methods such as Adagrad, Adam)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 6 Stochastic Gradient Descent (SGD) approximates g using a single data point In iteration t, SGD picks a uniformly random i ∈ {1,..., N} and approximate g as

g ≈ g i = ∇w `i (w)

Stochastic Gradient Descent

Recap: Stochastic Gradient Descent

Gradient computation in standard GD may be expensive when N is large

" N # N 1 X 1 X g = ∇ ` (w) = g (ignoring regularizer R(w)) w N n N n n=1 n=1

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 7 In iteration t, SGD picks a uniformly random i ∈ {1,..., N} and approximate g as

g ≈ g i = ∇w `i (w)

Stochastic Gradient Descent

Recap: Stochastic Gradient Descent

Gradient computation in standard GD may be expensive when N is large

" N # N 1 X 1 X g = ∇ ` (w) = g (ignoring regularizer R(w)) w N n N n n=1 n=1 Stochastic Gradient Descent (SGD) approximates g using a single data point

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 7 Stochastic Gradient Descent

Recap: Stochastic Gradient Descent

Gradient computation in standard GD may be expensive when N is large

" N # N 1 X 1 X g = ∇ ` (w) = g (ignoring regularizer R(w)) w N n N n n=1 n=1 Stochastic Gradient Descent (SGD) approximates g using a single data point In iteration t, SGD picks a uniformly random i ∈ {1,..., N} and approximate g as

g ≈ g i = ∇w `i (w)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 7 Recap: Stochastic Gradient Descent

Gradient computation in standard GD may be expensive when N is large

" N # N 1 X 1 X g = ∇ ` (w) = g (ignoring regularizer R(w)) w N n N n n=1 n=1 Stochastic Gradient Descent (SGD) approximates g using a single data point In iteration t, SGD picks a uniformly random i ∈ {1,..., N} and approximate g as

g ≈ g i = ∇w `i (w)

Stochastic Gradient Descent

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 7 We can instead use B > 1 uniformly randomly chosen points with indices i1,..., iB ∈ {1,..., N} This is the idea behind mini-batch SGD. The approximated gradient in this case would be

B 1 X g ≈ g B ib b=1 The basic intuition: Averaging helps in variance reduction! The algorithm is same as SGD except we will now using these mini-batch at each step

Recap: Mini-batch SGD

In each itearation, SGD uses a single randomly chosen i ∈ {1,..., N} to approximate g

This results in a large variance in g i (full gradient)

(stochastic gradient)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 8 This is the idea behind mini-batch SGD. The approximated gradient in this case would be

B 1 X g ≈ g B ib b=1 The basic intuition: Averaging helps in variance reduction! The algorithm is same as SGD except we will now using these mini-batch gradients at each step

Recap: Mini-batch SGD

In each itearation, SGD uses a single randomly chosen i ∈ {1,..., N} to approximate g

This results in a large variance in g i (full gradient)

(stochastic gradient)

We can instead use B > 1 uniformly randomly chosen points with indices i1,..., iB ∈ {1,..., N}

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 8 The basic intuition: Averaging helps in variance reduction! The algorithm is same as SGD except we will now using these mini-batch gradients at each step

Recap: Mini-batch SGD

In each itearation, SGD uses a single randomly chosen i ∈ {1,..., N} to approximate g

This results in a large variance in g i (full gradient)

(stochastic gradient)

We can instead use B > 1 uniformly randomly chosen points with indices i1,..., iB ∈ {1,..., N} This is the idea behind mini-batch SGD. The approximated gradient in this case would be

B 1 X g ≈ g B ib b=1

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 8 The algorithm is same as SGD except we will now using these mini-batch gradients at each step

Recap: Mini-batch SGD

In each itearation, SGD uses a single randomly chosen i ∈ {1,..., N} to approximate g

This results in a large variance in g i (full gradient)

(stochastic gradient)

We can instead use B > 1 uniformly randomly chosen points with indices i1,..., iB ∈ {1,..., N} This is the idea behind mini-batch SGD. The approximated gradient in this case would be

B 1 X g ≈ g B ib b=1 The basic intuition: Averaging helps in variance reduction!

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 8 Recap: Mini-batch SGD

In each itearation, SGD uses a single randomly chosen i ∈ {1,..., N} to approximate g

This results in a large variance in g i (full gradient)

(stochastic gradient)

We can instead use B > 1 uniformly randomly chosen points with indices i1,..., iB ∈ {1,..., N} This is the idea behind mini-batch SGD. The approximated gradient in this case would be

B 1 X g ≈ g B ib b=1 The basic intuition: Averaging helps in variance reduction! The algorithm is same as SGD except we will now using these mini-batch gradients at each step

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 8 Plan for today

Optimization of functions that are NOT differentiable

Optimization with constraints on the variables Optimizing w.r.t. several variables with one at a time Co-ordinate descent Alternating optimization

Second-order methods for optimization

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 9 Some common examples Absolute, -insensitive loss in regression, several classification loss functions (we will see shortly)

Absolute Loss: -insensitive Loss: “” Loss: Hinge Loss:

(0,1)

(0,0) (0,0) (1,0)

Regression/classification loss functions with `1 or `p (p < 1) regularization

Differentiable NON-Differentiable Can’t apply standard GD or SGD since gradient isn’t defined at points of non-differentiability

Optimizing Non-differentiable Functions

Many ML problems require minimizing non-differentiable functions

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 10 “Perceptron” Loss: Hinge Loss:

(0,1)

(0,0) (0,0) (1,0)

Regression/classification loss functions with `1 or `p (p < 1) regularization

Differentiable NON-Differentiable Can’t apply standard GD or SGD since gradient isn’t defined at points of non-differentiability

Optimizing Non-differentiable Functions

Many ML problems require minimizing non-differentiable functions Some common examples Absolute, -insensitive loss in regression, several classification loss functions (we will see shortly)

Absolute Loss: -insensitive Loss:

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 10 Regression/classification loss functions with `1 or `p (p < 1) regularization

Differentiable NON-Differentiable Can’t apply standard GD or SGD since gradient isn’t defined at points of non-differentiability

Optimizing Non-differentiable Functions

Many ML problems require minimizing non-differentiable functions Some common examples Absolute, -insensitive loss in regression, several classification loss functions (we will see shortly)

Absolute Loss: -insensitive Loss: “Perceptron” Loss: Hinge Loss:

(0,1)

(0,0) (0,0) (1,0)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 10 Can’t apply standard GD or SGD since gradient isn’t defined at points of non-differentiability

Optimizing Non-differentiable Functions

Many ML problems require minimizing non-differentiable functions Some common examples Absolute, -insensitive loss in regression, several classification loss functions (we will see shortly)

Absolute Loss: -insensitive Loss: “Perceptron” Loss: Hinge Loss:

(0,1)

(0,0) (0,0) (1,0)

Regression/classification loss functions with `1 or `p (p < 1) regularization

Differentiable NON-Differentiable

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 10 Optimizing Non-differentiable Functions

Many ML problems require minimizing non-differentiable functions Some common examples Absolute, -insensitive loss in regression, several classification loss functions (we will see shortly)

Absolute Loss: -insensitive Loss: “Perceptron” Loss: Hinge Loss:

(0,1)

(0,0) (0,0) (1,0)

Regression/classification loss functions with `1 or `p (p < 1) regularization

Differentiable NON-Differentiable Can’t apply standard GD or SGD since gradient isn’t defined at points of non-differentiability

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 10 We typically look at the difference between true y and model’s prediction w >x How to formally define loss functions for classification? We have already looked at the for logistic regression (assuming y ∈ {−1, +1})

`(y, yˆ) = log(1 + exp(−yw >x)) Why does the above make sense? Well, it is large for large misclassifications, small otherwise

(0,0) Are there other loss functions for classification? Yes, several.

Interlude: Loss Functions for Classification

In regression (assuming linear modely ˆ = w >x), some common loss functions are

`(y, yˆ) = (y − w >x)2 or `(y, yˆ) = |y − w >x|

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 11 How to formally define loss functions for classification? We have already looked at the loss function for logistic regression (assuming y ∈ {−1, +1})

`(y, yˆ) = log(1 + exp(−yw >x)) Why does the above make sense? Well, it is large for large misclassifications, small otherwise

(0,0) Are there other loss functions for classification? Yes, several.

Interlude: Loss Functions for Classification

In regression (assuming linear modely ˆ = w >x), some common loss functions are

`(y, yˆ) = (y − w >x)2 or `(y, yˆ) = |y − w >x|

We typically look at the difference between true y and model’s prediction w >x

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 11 We have already looked at the loss function for logistic regression (assuming y ∈ {−1, +1})

`(y, yˆ) = log(1 + exp(−yw >x)) Why does the above make sense? Well, it is large for large misclassifications, small otherwise

(0,0) Are there other loss functions for classification? Yes, several.

Interlude: Loss Functions for Classification

In regression (assuming linear modely ˆ = w >x), some common loss functions are

`(y, yˆ) = (y − w >x)2 or `(y, yˆ) = |y − w >x|

We typically look at the difference between true y and model’s prediction w >x How to formally define loss functions for classification?

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 11 Why does the above make sense? Well, it is large for large misclassifications, small otherwise

(0,0) Are there other loss functions for classification? Yes, several.

Interlude: Loss Functions for Classification

In regression (assuming linear modely ˆ = w >x), some common loss functions are

`(y, yˆ) = (y − w >x)2 or `(y, yˆ) = |y − w >x|

We typically look at the difference between true y and model’s prediction w >x How to formally define loss functions for classification? We have already looked at the loss function for logistic regression (assuming y ∈ {−1, +1})

`(y, yˆ) = log(1 + exp(−yw >x))

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 11 Are there other loss functions for classification? Yes, several.

Interlude: Loss Functions for Classification

In regression (assuming linear modely ˆ = w >x), some common loss functions are

`(y, yˆ) = (y − w >x)2 or `(y, yˆ) = |y − w >x|

We typically look at the difference between true y and model’s prediction w >x How to formally define loss functions for classification? We have already looked at the loss function for logistic regression (assuming y ∈ {−1, +1})

`(y, yˆ) = log(1 + exp(−yw >x)) Why does the above make sense? Well, it is large for large misclassifications, small otherwise

(0,0)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 11 Yes, several.

Interlude: Loss Functions for Classification

In regression (assuming linear modely ˆ = w >x), some common loss functions are

`(y, yˆ) = (y − w >x)2 or `(y, yˆ) = |y − w >x|

We typically look at the difference between true y and model’s prediction w >x How to formally define loss functions for classification? We have already looked at the loss function for logistic regression (assuming y ∈ {−1, +1})

`(y, yˆ) = log(1 + exp(−yw >x)) Why does the above make sense? Well, it is large for large misclassifications, small otherwise

(0,0) Are there other loss functions for classification?

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 11 Interlude: Loss Functions for Classification

In regression (assuming linear modely ˆ = w >x), some common loss functions are

`(y, yˆ) = (y − w >x)2 or `(y, yˆ) = |y − w >x|

We typically look at the difference between true y and model’s prediction w >x How to formally define loss functions for classification? We have already looked at the loss function for logistic regression (assuming y ∈ {−1, +1})

`(y, yˆ) = log(1 + exp(−yw >x)) Why does the above make sense? Well, it is large for large misclassifications, small otherwise

(0,0) Are there other loss functions for classification? Yes, several.

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 11 Interlude: Some Loss Functions for (Binary) Classification

0-1 Loss “Perceptron” Loss

(same as) Convex + (0,1) Non-differentiable

(0,0) (0,0)

Log(istic) Loss Hinge Loss

Convex + (0,1) Convex Differentiable + Non-differentiable

(0,0) (0,0) (1,0)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 12 Interlude: Some Loss Functions for (Binary) Classification

0-1 Loss “Perceptron” Loss

Non-convex + Non-differentiable + (same as) NP-Hard to Optimize Convex + (0,1) Non-differentiable

(0,0) (0,0)

Log(istic) Loss Hinge Loss

Convex + (0,1) Convex Differentiable + Non-differentiable

(0,0) (0,0) (1,0)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 12 Interlude: Some Loss Functions for (Binary) Classification

0-1 Loss “Perceptron” Loss

Non-convex + Non-differentiable + (same as) NP-Hard to Optimize

(0,1)

(0,0) (0,0)

Log(istic) Loss Hinge Loss

Convex + (0,1) Convex Differentiable + Non-differentiable

(0,0) (0,0) (1,0)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 12 Interlude: Some Loss Functions for (Binary) Classification

0-1 Loss “Perceptron” Loss

Non-convex + Non-differentiable + (same as) NP-Hard to Optimize Convex + (0,1) Non-differentiable

(0,0) (0,0)

Log(istic) Loss Hinge Loss

Convex + (0,1) Convex Differentiable + Non-differentiable

(0,0) (0,0) (1,0)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 12 Interlude: Some Loss Functions for (Binary) Classification

0-1 Loss “Perceptron” Loss

Non-convex + Non-differentiable + (same as) NP-Hard to Optimize Convex + (0,1) Non-differentiable

(0,0) (0,0)

Log(istic) Loss Hinge Loss

(0,1) Convex + Non-differentiable

(0,0) (0,0) (1,0)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 12 Interlude: Some Loss Functions for (Binary) Classification

0-1 Loss “Perceptron” Loss

Non-convex + Non-differentiable + (same as) NP-Hard to Optimize Convex + (0,1) Non-differentiable

(0,0) (0,0)

Log(istic) Loss Hinge Loss

Convex + (0,1) Convex Differentiable + Non-differentiable

(0,0) (0,0) (1,0)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 12 Interlude: Some Loss Functions for (Binary) Classification

0-1 Loss “Perceptron” Loss

Non-convex + Non-differentiable + (same as) NP-Hard to Optimize Convex + (0,1) Non-differentiable

(0,0) (0,0)

Log(istic) Loss Hinge Loss

Convex + (0,1) Differentiable

(0,0) (0,0) (1,0)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 12 Interlude: Some Loss Functions for (Binary) Classification

0-1 Loss “Perceptron” Loss

Non-convex + Non-differentiable + (same as) NP-Hard to Optimize Convex + (0,1) Non-differentiable

(0,0) (0,0)

Log(istic) Loss Hinge Loss

Convex + (0,1) Convex Differentiable + Non-differentiable

(0,0) (0,0) (1,0)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 12 For a function f (x), its subgradient at x is any vector g s.t. ∀y f (y) ≥ f (x) + g >(y − x)

A non-differentiable function can have several subgradients at the point of non-differentiability Set of all subgradients of a function f at point x is called the subdifferential denoted as ∂f (x)

∂f (x) = {g : f (y) ≥ f (x) + g >(y − x), ∀y}

Optimizing Non-differentiable Functions

Even though gradients are not defined for non-diff. functions, we can work with subgradients

differentiable here

non-differentiable here

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 13 A non-differentiable function can have several subgradients at the point of non-differentiability Set of all subgradients of a function f at point x is called the subdifferential denoted as ∂f (x)

∂f (x) = {g : f (y) ≥ f (x) + g >(y − x), ∀y}

Optimizing Non-differentiable Functions

Even though gradients are not defined for non-diff. functions, we can work with subgradients

differentiable here

non-differentiable here

For a function f (x), its subgradient at x is any vector g s.t. ∀y f (y) ≥ f (x) + g >(y − x)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 13 Set of all subgradients of a function f at point x is called the subdifferential denoted as ∂f (x)

∂f (x) = {g : f (y) ≥ f (x) + g >(y − x), ∀y}

Optimizing Non-differentiable Functions

Even though gradients are not defined for non-diff. functions, we can work with subgradients

differentiable here

non-differentiable here

For a function f (x), its subgradient at x is any vector g s.t. ∀y f (y) ≥ f (x) + g >(y − x)

A non-differentiable function can have several subgradients at the point of non-differentiability

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 13 Optimizing Non-differentiable Functions

Even though gradients are not defined for non-diff. functions, we can work with subgradients

differentiable here

non-differentiable here

For a function f (x), its subgradient at x is any vector g s.t. ∀y f (y) ≥ f (x) + g >(y − x)

A non-differentiable function can have several subgradients at the point of non-differentiability Set of all subgradients of a function f at point x is called the subdifferential denoted as ∂f (x)

∂f (x) = {g : f (y) ≥ f (x) + g >(y − x), ∀y}

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 13 (0,0)

If we take td = 0 at wd = 0 then td = sign(wd )

The squared error term is differentiable but the norm ||w||1 is NOT at wd = 0

We can use subgradients of ||w||1 in this case N X > g = 2 (yn − w x n)x n + λt n=1 Here t is a vector s.t.  −1, for w < 0  d td = [−1, +1] for wd = 0  +1 for wd > 0

Subgradient Descent: An Example

Consider linear regression but with `1 norm on w (recall: `1 norm promotes a sparse w) N X > 2 wˆ = arg min (yn − w x n) + λ||w||1 w n=1

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 14 (0,0)

If we take td = 0 at wd = 0 then td = sign(wd )

We can use subgradients of ||w||1 in this case N X > g = 2 (yn − w x n)x n + λt n=1 Here t is a vector s.t.  −1, for w < 0  d td = [−1, +1] for wd = 0  +1 for wd > 0

Subgradient Descent: An Example

Consider linear regression but with `1 norm on w (recall: `1 norm promotes a sparse w) N X > 2 wˆ = arg min (yn − w x n) + λ||w||1 w n=1 The squared error term is differentiable but the norm ||w||1 is NOT at wd = 0

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 14 (0,0)

If we take td = 0 at wd = 0 then td = sign(wd )

Here t is a vector s.t.  −1, for w < 0  d td = [−1, +1] for wd = 0  +1 for wd > 0

Subgradient Descent: An Example

Consider linear regression but with `1 norm on w (recall: `1 norm promotes a sparse w) N X > 2 wˆ = arg min (yn − w x n) + λ||w||1 w n=1 The squared error term is differentiable but the norm ||w||1 is NOT at wd = 0

We can use subgradients of ||w||1 in this case N X > g = 2 (yn − w x n)x n + λt n=1

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 14 If we take td = 0 at wd = 0 then td = sign(wd )

Subgradient Descent: An Example

Consider linear regression but with `1 norm on w (recall: `1 norm promotes a sparse w) N X > 2 wˆ = arg min (yn − w x n) + λ||w||1 w n=1 The squared error term is differentiable but the norm ||w||1 is NOT at wd = 0

We can use subgradients of ||w||1 in this case N X > g = 2 (yn − w x n)x n + λt n=1 Here t is a vector s.t.  −1, for w < 0  d td = [−1, +1] for wd = 0  (0,0) +1 for wd > 0

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 14 Subgradient Descent: An Example

Consider linear regression but with `1 norm on w (recall: `1 norm promotes a sparse w) N X > 2 wˆ = arg min (yn − w x n) + λ||w||1 w n=1 The squared error term is differentiable but the norm ||w||1 is NOT at wd = 0

We can use subgradients of ||w||1 in this case N X > g = 2 (yn − w x n)x n + λt n=1 Here t is a vector s.t.  −1, for w < 0  d td = [−1, +1] for wd = 0  (0,0) +1 for wd > 0

If we take td = 0 at wd = 0 then td = sign(wd )

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 14 Subgradient t of the hinge loss term will be  0, for y w >x > 1  n n > t = −ynx n for ynw x n < 1  > kynx n for ynw x n = 1 (where k ∈ [−1, 0])

Subgradient Descent: Another Example

Consider binary classification with hinge loss (used in SVM - will see later), assume `2 regularizer

Hinge Loss:

(0,1)

(0,0) (1,0)

In this case loss (hinge) non-differentiable, regularizer differentiable

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 15 Subgradient Descent: Another Example

Consider binary classification with hinge loss (used in SVM - will see later), assume `2 regularizer

Hinge Loss:

(0,1)

(0,0) (1,0)

In this case loss (hinge) non-differentiable, regularizer differentiable Subgradient t of the hinge loss term will be  0, for y w >x > 1  n n > t = −ynx n for ynw x n < 1  > kynx n for ynw x n = 1 (where k ∈ [−1, 0])

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 15 Subgradient Descent: Summary

Hinge Loss:

(0,1)

(0,0) (0,0) (1,0)

Not really that different from standard GD Only difference is that we use subgradients where function is non-differentiable

In practice, it is like pretending that the function is differentiable everywhere

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 16 Constrained Optimization

<= 0 w 1 w 2

1: Lagrangian based optimization 2: Projected gradient descent

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 17 Note: Can handle multiple inequality and equality constraints too (will see later) Can transform the above constrained problem into an equivalent unconstrained problem wˆ = arg min f (w)+c(w) w where we have defined c(w) as ( ∞, if g(w) > 0 (constraint violated) c(w) = max αg(w) = α≥0 0 if g(w) ≤ 0 (constraint satisfied) We can equivalently write the problem as   wˆ = arg min f (w)+ arg max αg(w) w α≥0

Constrained Optimization: Lagrangian Approach

Consider optimizing some function f (w) subject to an inequality constraint on w wˆ = arg min f (w), s.t. g(w) ≤ 0 w If constraint of the form g(w) ≥ 0, use −g(w) ≤ 0

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 18 Can transform the above constrained problem into an equivalent unconstrained problem wˆ = arg min f (w)+c(w) w where we have defined c(w) as ( ∞, if g(w) > 0 (constraint violated) c(w) = max αg(w) = α≥0 0 if g(w) ≤ 0 (constraint satisfied) We can equivalently write the problem as   wˆ = arg min f (w)+ arg max αg(w) w α≥0

Constrained Optimization: Lagrangian Approach

Consider optimizing some function f (w) subject to an inequality constraint on w wˆ = arg min f (w), s.t. g(w) ≤ 0 w If constraint of the form g(w) ≥ 0, use −g(w) ≤ 0 Note: Can handle multiple inequality and equality constraints too (will see later)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 18 where we have defined c(w) as ( ∞, if g(w) > 0 (constraint violated) c(w) = max αg(w) = α≥0 0 if g(w) ≤ 0 (constraint satisfied) We can equivalently write the problem as   wˆ = arg min f (w)+ arg max αg(w) w α≥0

Constrained Optimization: Lagrangian Approach

Consider optimizing some function f (w) subject to an inequality constraint on w wˆ = arg min f (w), s.t. g(w) ≤ 0 w If constraint of the form g(w) ≥ 0, use −g(w) ≤ 0 Note: Can handle multiple inequality and equality constraints too (will see later) Can transform the above constrained problem into an equivalent unconstrained problem wˆ = arg min f (w)+c(w) w

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 18 ( ∞, if g(w) > 0 (constraint violated) = 0 if g(w) ≤ 0 (constraint satisfied) We can equivalently write the problem as   wˆ = arg min f (w)+ arg max αg(w) w α≥0

Constrained Optimization: Lagrangian Approach

Consider optimizing some function f (w) subject to an inequality constraint on w wˆ = arg min f (w), s.t. g(w) ≤ 0 w If constraint of the form g(w) ≥ 0, use −g(w) ≤ 0 Note: Can handle multiple inequality and equality constraints too (will see later) Can transform the above constrained problem into an equivalent unconstrained problem wˆ = arg min f (w)+c(w) w where we have defined c(w) as

c(w) = max αg(w) α≥0

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 18 We can equivalently write the problem as   wˆ = arg min f (w)+ arg max αg(w) w α≥0

Constrained Optimization: Lagrangian Approach

Consider optimizing some function f (w) subject to an inequality constraint on w wˆ = arg min f (w), s.t. g(w) ≤ 0 w If constraint of the form g(w) ≥ 0, use −g(w) ≤ 0 Note: Can handle multiple inequality and equality constraints too (will see later) Can transform the above constrained problem into an equivalent unconstrained problem wˆ = arg min f (w)+c(w) w where we have defined c(w) as ( ∞, if g(w) > 0 (constraint violated) c(w) = max αg(w) = α≥0 0 if g(w) ≤ 0 (constraint satisfied)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 18 Constrained Optimization: Lagrangian Approach

Consider optimizing some function f (w) subject to an inequality constraint on w wˆ = arg min f (w), s.t. g(w) ≤ 0 w If constraint of the form g(w) ≥ 0, use −g(w) ≤ 0 Note: Can handle multiple inequality and equality constraints too (will see later) Can transform the above constrained problem into an equivalent unconstrained problem wˆ = arg min f (w)+c(w) w where we have defined c(w) as ( ∞, if g(w) > 0 (constraint violated) c(w) = max αg(w) = α≥0 0 if g(w) ≤ 0 (constraint satisfied) We can equivalently write the problem as   wˆ = arg min f (w)+ arg max αg(w) w α≥0

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 18   = arg min arg max {f (w) + αg(w)} w α≥0

The function L(w, α) = f (w) + αg(w) called the Lagrangian, optimized w.r.t. w and α α is known as the Lagrange multiplier Primal and Dual problems   wˆ P = arg min arg max {f (w) + αg(w)} (primal problem) w α≥0 n o wˆ D = arg max arg min {f (w) + αg(w)} (dual problem) α≥0 w

Note: wˆ P = wˆ D in some nice cases (e.g., when f (w) and constraint set g(w) ≤ 0 are convex)

For dual solution, αD g(wˆ D ) = 0 (complimentary slackness/Karush-Kuhn-Tucker (KKT) condition)

Constrained Optimization: Lagrangian Approach

So we could write the original problem as   wˆ = arg min f (w)+ arg max αg(w) w α≥0

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 19 The function L(w, α) = f (w) + αg(w) called the Lagrangian, optimized w.r.t. w and α α is known as the Lagrange multiplier Primal and Dual problems   wˆ P = arg min arg max {f (w) + αg(w)} (primal problem) w α≥0 n o wˆ D = arg max arg min {f (w) + αg(w)} (dual problem) α≥0 w

Note: wˆ P = wˆ D in some nice cases (e.g., when f (w) and constraint set g(w) ≤ 0 are convex)

For dual solution, αD g(wˆ D ) = 0 (complimentary slackness/Karush-Kuhn-Tucker (KKT) condition)

Constrained Optimization: Lagrangian Approach

So we could write the original problem as     wˆ = arg min f (w)+ arg max αg(w) = arg min arg max {f (w) + αg(w)} w α≥0 w α≥0

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 19 α is known as the Lagrange multiplier Primal and Dual problems   wˆ P = arg min arg max {f (w) + αg(w)} (primal problem) w α≥0 n o wˆ D = arg max arg min {f (w) + αg(w)} (dual problem) α≥0 w

Note: wˆ P = wˆ D in some nice cases (e.g., when f (w) and constraint set g(w) ≤ 0 are convex)

For dual solution, αD g(wˆ D ) = 0 (complimentary slackness/Karush-Kuhn-Tucker (KKT) condition)

Constrained Optimization: Lagrangian Approach

So we could write the original problem as     wˆ = arg min f (w)+ arg max αg(w) = arg min arg max {f (w) + αg(w)} w α≥0 w α≥0

The function L(w, α) = f (w) + αg(w) called the Lagrangian, optimized w.r.t. w and α

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 19 Primal and Dual problems   wˆ P = arg min arg max {f (w) + αg(w)} (primal problem) w α≥0 n o wˆ D = arg max arg min {f (w) + αg(w)} (dual problem) α≥0 w

Note: wˆ P = wˆ D in some nice cases (e.g., when f (w) and constraint set g(w) ≤ 0 are convex)

For dual solution, αD g(wˆ D ) = 0 (complimentary slackness/Karush-Kuhn-Tucker (KKT) condition)

Constrained Optimization: Lagrangian Approach

So we could write the original problem as     wˆ = arg min f (w)+ arg max αg(w) = arg min arg max {f (w) + αg(w)} w α≥0 w α≥0

The function L(w, α) = f (w) + αg(w) called the Lagrangian, optimized w.r.t. w and α α is known as the Lagrange multiplier

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 19 n o wˆ D = arg max arg min {f (w) + αg(w)} (dual problem) α≥0 w

Note: wˆ P = wˆ D in some nice cases (e.g., when f (w) and constraint set g(w) ≤ 0 are convex)

For dual solution, αD g(wˆ D ) = 0 (complimentary slackness/Karush-Kuhn-Tucker (KKT) condition)

Constrained Optimization: Lagrangian Approach

So we could write the original problem as     wˆ = arg min f (w)+ arg max αg(w) = arg min arg max {f (w) + αg(w)} w α≥0 w α≥0

The function L(w, α) = f (w) + αg(w) called the Lagrangian, optimized w.r.t. w and α α is known as the Lagrange multiplier Primal and Dual problems   wˆ P = arg min arg max {f (w) + αg(w)} (primal problem) w α≥0

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 19 Note: wˆ P = wˆ D in some nice cases (e.g., when f (w) and constraint set g(w) ≤ 0 are convex)

For dual solution, αD g(wˆ D ) = 0 (complimentary slackness/Karush-Kuhn-Tucker (KKT) condition)

Constrained Optimization: Lagrangian Approach

So we could write the original problem as     wˆ = arg min f (w)+ arg max αg(w) = arg min arg max {f (w) + αg(w)} w α≥0 w α≥0

The function L(w, α) = f (w) + αg(w) called the Lagrangian, optimized w.r.t. w and α α is known as the Lagrange multiplier Primal and Dual problems   wˆ P = arg min arg max {f (w) + αg(w)} (primal problem) w α≥0 n o wˆ D = arg max arg min {f (w) + αg(w)} (dual problem) α≥0 w

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 19 For dual solution, αD g(wˆ D ) = 0 (complimentary slackness/Karush-Kuhn-Tucker (KKT) condition)

Constrained Optimization: Lagrangian Approach

So we could write the original problem as     wˆ = arg min f (w)+ arg max αg(w) = arg min arg max {f (w) + αg(w)} w α≥0 w α≥0

The function L(w, α) = f (w) + αg(w) called the Lagrangian, optimized w.r.t. w and α α is known as the Lagrange multiplier Primal and Dual problems   wˆ P = arg min arg max {f (w) + αg(w)} (primal problem) w α≥0 n o wˆ D = arg max arg min {f (w) + αg(w)} (dual problem) α≥0 w

Note: wˆ P = wˆ D in some nice cases (e.g., when f (w) and constraint set g(w) ≤ 0 are convex)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 19 Constrained Optimization: Lagrangian Approach

So we could write the original problem as     wˆ = arg min f (w)+ arg max αg(w) = arg min arg max {f (w) + αg(w)} w α≥0 w α≥0

The function L(w, α) = f (w) + αg(w) called the Lagrangian, optimized w.r.t. w and α α is known as the Lagrange multiplier Primal and Dual problems   wˆ P = arg min arg max {f (w) + αg(w)} (primal problem) w α≥0 n o wˆ D = arg max arg min {f (w) + αg(w)} (dual problem) α≥0 w

Note: wˆ P = wˆ D in some nice cases (e.g., when f (w) and constraint set g(w) ≤ 0 are convex)

For dual solution, αD g(wˆ D ) = 0 (complimentary slackness/Karush-Kuhn-Tucker (KKT) condition)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 19 Introduce Lagrange multipliers α = (α1, . . . , αK ) ≥ 0 and β = (β1, . . . , βL) The Lagrangian based primal and dual problems will be K L X X wˆ P = arg min{arg max {f (w) + αi gi (w) + βj hj (w)}} w α≥0,β i=1 j=1 K L X X wˆ D = arg max {arg min{f (w) + αi gi (w) + βj hj (w)}} α≥0,β w i=1 j=1

Constrained Optimization: Lagrangian with Multiple Constraints

We can also have multiple inequality and equality constraints

wˆ = arg min f (w) w s.t. gi (w) ≤ 0, i = 1,..., K

hj (w) = 0, j =, 1,..., L

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 20 The Lagrangian based primal and dual problems will be K L X X wˆ P = arg min{arg max {f (w) + αi gi (w) + βj hj (w)}} w α≥0,β i=1 j=1 K L X X wˆ D = arg max {arg min{f (w) + αi gi (w) + βj hj (w)}} α≥0,β w i=1 j=1

Constrained Optimization: Lagrangian with Multiple Constraints

We can also have multiple inequality and equality constraints

wˆ = arg min f (w) w s.t. gi (w) ≤ 0, i = 1,..., K

hj (w) = 0, j =, 1,..., L

Introduce Lagrange multipliers α = (α1, . . . , αK ) ≥ 0 and β = (β1, . . . , βL)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 20 K L X X wˆ D = arg max {arg min{f (w) + αi gi (w) + βj hj (w)}} α≥0,β w i=1 j=1

Constrained Optimization: Lagrangian with Multiple Constraints

We can also have multiple inequality and equality constraints

wˆ = arg min f (w) w s.t. gi (w) ≤ 0, i = 1,..., K

hj (w) = 0, j =, 1,..., L

Introduce Lagrange multipliers α = (α1, . . . , αK ) ≥ 0 and β = (β1, . . . , βL) The Lagrangian based primal and dual problems will be K L X X wˆ P = arg min{arg max {f (w) + αi gi (w) + βj hj (w)}} w α≥0,β i=1 j=1

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 20 Constrained Optimization: Lagrangian with Multiple Constraints

We can also have multiple inequality and equality constraints

wˆ = arg min f (w) w s.t. gi (w) ≤ 0, i = 1,..., K

hj (w) = 0, j =, 1,..., L

Introduce Lagrange multipliers α = (α1, . . . , αK ) ≥ 0 and β = (β1, . . . , βL) The Lagrangian based primal and dual problems will be K L X X wˆ P = arg min{arg max {f (w) + αi gi (w) + βj hj (w)}} w α≥0,β i=1 j=1 K L X X wˆ D = arg max {arg min{f (w) + αi gi (w) + βj hj (w)}} α≥0,β w i=1 j=1

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 20 N Given N observations {x n, yn}n=1, the negative log-likelihood for class marginal N X f (π) = − log p(yn|π) n=1 PK We have an equality constraint k=1 πk − 1 = 0 The Lagrangian for this problem will be K X L(π, β) = f (π) + β( πk − 1) k=1 Exercise: Solve arg maxβ arg minπ L(π, β) and show that πk = Nk /N

Lagrangian based Optimization: An Example

Consider the generative classification model with K classes Suppose we want to estimate the parameters of class-marginal p(y) K K Y I[y=k] X p(y|π) = multinoulli(π1, π2, . . . , πK ) = πk , s.t. πk = 1 k=1 k=1

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 21 PK We have an equality constraint k=1 πk − 1 = 0 The Lagrangian for this problem will be K X L(π, β) = f (π) + β( πk − 1) k=1 Exercise: Solve arg maxβ arg minπ L(π, β) and show that πk = Nk /N

Lagrangian based Optimization: An Example

Consider the generative classification model with K classes Suppose we want to estimate the parameters of class-marginal p(y) K K Y I[y=k] X p(y|π) = multinoulli(π1, π2, . . . , πK ) = πk , s.t. πk = 1 k=1 k=1 N Given N observations {x n, yn}n=1, the negative log-likelihood for class marginal N X f (π) = − log p(yn|π) n=1

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 21 The Lagrangian for this problem will be K X L(π, β) = f (π) + β( πk − 1) k=1 Exercise: Solve arg maxβ arg minπ L(π, β) and show that πk = Nk /N

Lagrangian based Optimization: An Example

Consider the generative classification model with K classes Suppose we want to estimate the parameters of class-marginal p(y) K K Y I[y=k] X p(y|π) = multinoulli(π1, π2, . . . , πK ) = πk , s.t. πk = 1 k=1 k=1 N Given N observations {x n, yn}n=1, the negative log-likelihood for class marginal N X f (π) = − log p(yn|π) n=1 PK We have an equality constraint k=1 πk − 1 = 0

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 21 Exercise: Solve arg maxβ arg minπ L(π, β) and show that πk = Nk /N

Lagrangian based Optimization: An Example

Consider the generative classification model with K classes Suppose we want to estimate the parameters of class-marginal p(y) K K Y I[y=k] X p(y|π) = multinoulli(π1, π2, . . . , πK ) = πk , s.t. πk = 1 k=1 k=1 N Given N observations {x n, yn}n=1, the negative log-likelihood for class marginal N X f (π) = − log p(yn|π) n=1 PK We have an equality constraint k=1 πk − 1 = 0 The Lagrangian for this problem will be K X L(π, β) = f (π) + β( πk − 1) k=1

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 21 Lagrangian based Optimization: An Example

Consider the generative classification model with K classes Suppose we want to estimate the parameters of class-marginal p(y) K K Y I[y=k] X p(y|π) = multinoulli(π1, π2, . . . , πK ) = πk , s.t. πk = 1 k=1 k=1 N Given N observations {x n, yn}n=1, the negative log-likelihood for class marginal N X f (π) = − log p(yn|π) n=1 PK We have an equality constraint k=1 πk − 1 = 0 The Lagrangian for this problem will be K X L(π, β) = f (π) + β( πk − 1) k=1 Exercise: Solve arg maxβ arg minπ L(π, β) and show that πk = Nk /N

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 21 Projected GD is very similar to GD with an extra projection step

Projection step

Each step of projected GD works as follows (t+1) (t) (t) Do the usual GD update: z = w − ηt g Check z (t+1) for the constraints If z(t+1) ∈ C, w (t+1) = z(t+1) (t+1) (t+1) (t+1) If z ∈/ C, project on the constraint set: w = ΠC[z ] | {z } projection

Projected Gradient Descent

Suppose our problem requires the parameters to lie within a set C wˆ = arg min L(w), subject to w ∈ C w

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 22 Each step of projected GD works as follows (t+1) (t) (t) Do the usual GD update: z = w − ηt g Check z (t+1) for the constraints If z(t+1) ∈ C, w (t+1) = z(t+1) (t+1) (t+1) (t+1) If z ∈/ C, project on the constraint set: w = ΠC[z ] | {z } projection

Projected Gradient Descent

Suppose our problem requires the parameters to lie within a set C wˆ = arg min L(w), subject to w ∈ C w Projected GD is very similar to GD with an extra projection step

Projection step

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 22 Projected Gradient Descent

Suppose our problem requires the parameters to lie within a set C wˆ = arg min L(w), subject to w ∈ C w Projected GD is very similar to GD with an extra projection step

Projection step

Each step of projected GD works as follows (t+1) (t) (t) Do the usual GD update: z = w − ηt g Check z (t+1) for the constraints If z(t+1) ∈ C, w (t+1) = z(t+1) (t+1) (t+1) (t+1) If z ∈/ C, project on the constraint set: w = ΠC[z ] | {z } projection

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 22 For some sets C, the projection step is easy/trivial

: Unit radius ball : Set of non-negative reals (0,1) Projection Projection = = Normalize to Set negative (1,0) unit length values to 0

For some other sets C, the projection step may be a bit more involved

Projected GD: How to Project?

The projection itself is an optimization problem

Projection step

Given z, we find the “closest” point (e.g., in Euclidean sense) w in the set as follows

2 ΠC[z] = arg min ||w − z|| w∈C

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 23 For some other sets C, the projection step may be a bit more involved

Projected GD: How to Project?

The projection itself is an optimization problem

Projection step

Given z, we find the “closest” point (e.g., in Euclidean sense) w in the set as follows

2 ΠC[z] = arg min ||w − z|| w∈C For some sets C, the projection step is easy/trivial

: Unit radius ball : Set of non-negative reals (0,1) Projection Projection = = Normalize to Set negative (1,0) unit length values to 0

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 23 Projected GD: How to Project?

The projection itself is an optimization problem

Projection step

Given z, we find the “closest” point (e.g., in Euclidean sense) w in the set as follows

2 ΠC[z] = arg min ||w − z|| w∈C For some sets C, the projection step is easy/trivial

: Unit radius ball : Set of non-negative reals (0,1) Projection Projection = = Normalize to Set negative (1,0) unit length values to 0

For some other sets C, the projection step may be a bit more involved

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 23 CD: Each step update one component (co-ordinate) at a time, keeping all others fixed

(t+1) (t) (t) wd = wd − ηt gd Cost of each update is now independent of D How to pick which co-ordinate to update? Can be chosen in random order (stochastic CD) Can be chosen in cyclic order Note: Can also update “blocks” of co-ordinates (called Block co-ordinate descent) Should cache previous computations (e.g., w >x) to avoid O(D) cost in gradient computation

Co-ordinate Descent (CD)

Standard GD update for w ∈ RD at each step

(t+1) (t) (t) w = w − ηt g

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 24 Cost of each update is now independent of D How to pick which co-ordinate to update? Can be chosen in random order (stochastic CD) Can be chosen in cyclic order Note: Can also update “blocks” of co-ordinates (called Block co-ordinate descent) Should cache previous computations (e.g., w >x) to avoid O(D) cost in gradient computation

Co-ordinate Descent (CD)

Standard GD update for w ∈ RD at each step

(t+1) (t) (t) w = w − ηt g

CD: Each step update one component (co-ordinate) at a time, keeping all others fixed

(t+1) (t) (t) wd = wd − ηt gd

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 24 How to pick which co-ordinate to update? Can be chosen in random order (stochastic CD) Can be chosen in cyclic order Note: Can also update “blocks” of co-ordinates (called Block co-ordinate descent) Should cache previous computations (e.g., w >x) to avoid O(D) cost in gradient computation

Co-ordinate Descent (CD)

Standard GD update for w ∈ RD at each step

(t+1) (t) (t) w = w − ηt g

CD: Each step update one component (co-ordinate) at a time, keeping all others fixed

(t+1) (t) (t) wd = wd − ηt gd Cost of each update is now independent of D

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 24 Can be chosen in random order (stochastic CD) Can be chosen in cyclic order Note: Can also update “blocks” of co-ordinates (called Block co-ordinate descent) Should cache previous computations (e.g., w >x) to avoid O(D) cost in gradient computation

Co-ordinate Descent (CD)

Standard GD update for w ∈ RD at each step

(t+1) (t) (t) w = w − ηt g

CD: Each step update one component (co-ordinate) at a time, keeping all others fixed

(t+1) (t) (t) wd = wd − ηt gd Cost of each update is now independent of D How to pick which co-ordinate to update?

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 24 Can be chosen in cyclic order Note: Can also update “blocks” of co-ordinates (called Block co-ordinate descent) Should cache previous computations (e.g., w >x) to avoid O(D) cost in gradient computation

Co-ordinate Descent (CD)

Standard GD update for w ∈ RD at each step

(t+1) (t) (t) w = w − ηt g

CD: Each step update one component (co-ordinate) at a time, keeping all others fixed

(t+1) (t) (t) wd = wd − ηt gd Cost of each update is now independent of D How to pick which co-ordinate to update? Can be chosen in random order (stochastic CD)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 24 Note: Can also update “blocks” of co-ordinates (called Block co-ordinate descent) Should cache previous computations (e.g., w >x) to avoid O(D) cost in gradient computation

Co-ordinate Descent (CD)

Standard GD update for w ∈ RD at each step

(t+1) (t) (t) w = w − ηt g

CD: Each step update one component (co-ordinate) at a time, keeping all others fixed

(t+1) (t) (t) wd = wd − ηt gd Cost of each update is now independent of D How to pick which co-ordinate to update? Can be chosen in random order (stochastic CD) Can be chosen in cyclic order

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 24 Should cache previous computations (e.g., w >x) to avoid O(D) cost in gradient computation

Co-ordinate Descent (CD)

Standard GD update for w ∈ RD at each step

(t+1) (t) (t) w = w − ηt g

CD: Each step update one component (co-ordinate) at a time, keeping all others fixed

(t+1) (t) (t) wd = wd − ηt gd Cost of each update is now independent of D How to pick which co-ordinate to update? Can be chosen in random order (stochastic CD) Can be chosen in cyclic order Note: Can also update “blocks” of co-ordinates (called Block co-ordinate descent)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 24 Co-ordinate Descent (CD)

Standard GD update for w ∈ RD at each step

(t+1) (t) (t) w = w − ηt g

CD: Each step update one component (co-ordinate) at a time, keeping all others fixed

(t+1) (t) (t) wd = wd − ηt gd Cost of each update is now independent of D How to pick which co-ordinate to update? Can be chosen in random order (stochastic CD) Can be chosen in cyclic order Note: Can also update “blocks” of co-ordinates (called Block co-ordinate descent) Should cache previous computations (e.g., w >x) to avoid O(D) cost in gradient computation

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 24 D D For simplicity, suppose we want to optimize a function of 2 variables w 1 ∈ R and w 2 ∈ R

{wˆ 1, wˆ 2} = arg min L(w 1, w 2) w 1,w 2

Jointly optimizing w.r.t. w 1 and w 2 may be hard (e.g., if their values depend on each other) Often, knowing value of one may make optimization w.r.t. other easy (sometimes even closed form)

We can therefore follow an alternating scheme to optimize w.r.t. w 1 and w 2 (0) Initialize one of the variables, e.g., w 2 = w 2 , t = 0 (t+1) (t) Solve w 1 = arg maxw 1 L(w 1, w 2 ) (t+1) (t+1) Solve w 2 = arg maxw 2 L(w 1 , w 2) t = t + 1. Repeat until convergence

Usually converges to a local optima of L(w 1, w 2). Also connections toEM (will see later) Extends to more than 2 variables as well (and not just to vectors). CD is a special case.

Alternating Optimization

Many optimization problems consist of several variables. Very common in ML.

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 25 Jointly optimizing w.r.t. w 1 and w 2 may be hard (e.g., if their values depend on each other) Often, knowing value of one may make optimization w.r.t. other easy (sometimes even closed form)

We can therefore follow an alternating scheme to optimize w.r.t. w 1 and w 2 (0) Initialize one of the variables, e.g., w 2 = w 2 , t = 0 (t+1) (t) Solve w 1 = arg maxw 1 L(w 1, w 2 ) (t+1) (t+1) Solve w 2 = arg maxw 2 L(w 1 , w 2) t = t + 1. Repeat until convergence

Usually converges to a local optima of L(w 1, w 2). Also connections toEM (will see later) Extends to more than 2 variables as well (and not just to vectors). CD is a special case.

Alternating Optimization

Many optimization problems consist of several variables. Very common in ML. D D For simplicity, suppose we want to optimize a function of 2 variables w 1 ∈ R and w 2 ∈ R

{wˆ 1, wˆ 2} = arg min L(w 1, w 2) w 1,w 2

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 25 Often, knowing value of one may make optimization w.r.t. other easy (sometimes even closed form)

We can therefore follow an alternating scheme to optimize w.r.t. w 1 and w 2 (0) Initialize one of the variables, e.g., w 2 = w 2 , t = 0 (t+1) (t) Solve w 1 = arg maxw 1 L(w 1, w 2 ) (t+1) (t+1) Solve w 2 = arg maxw 2 L(w 1 , w 2) t = t + 1. Repeat until convergence

Usually converges to a local optima of L(w 1, w 2). Also connections toEM (will see later) Extends to more than 2 variables as well (and not just to vectors). CD is a special case.

Alternating Optimization

Many optimization problems consist of several variables. Very common in ML. D D For simplicity, suppose we want to optimize a function of 2 variables w 1 ∈ R and w 2 ∈ R

{wˆ 1, wˆ 2} = arg min L(w 1, w 2) w 1,w 2

Jointly optimizing w.r.t. w 1 and w 2 may be hard (e.g., if their values depend on each other)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 25 We can therefore follow an alternating scheme to optimize w.r.t. w 1 and w 2 (0) Initialize one of the variables, e.g., w 2 = w 2 , t = 0 (t+1) (t) Solve w 1 = arg maxw 1 L(w 1, w 2 ) (t+1) (t+1) Solve w 2 = arg maxw 2 L(w 1 , w 2) t = t + 1. Repeat until convergence

Usually converges to a local optima of L(w 1, w 2). Also connections toEM (will see later) Extends to more than 2 variables as well (and not just to vectors). CD is a special case.

Alternating Optimization

Many optimization problems consist of several variables. Very common in ML. D D For simplicity, suppose we want to optimize a function of 2 variables w 1 ∈ R and w 2 ∈ R

{wˆ 1, wˆ 2} = arg min L(w 1, w 2) w 1,w 2

Jointly optimizing w.r.t. w 1 and w 2 may be hard (e.g., if their values depend on each other) Often, knowing value of one may make optimization w.r.t. other easy (sometimes even closed form)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 25 (0) Initialize one of the variables, e.g., w 2 = w 2 , t = 0 (t+1) (t) Solve w 1 = arg maxw 1 L(w 1, w 2 ) (t+1) (t+1) Solve w 2 = arg maxw 2 L(w 1 , w 2) t = t + 1. Repeat until convergence

Usually converges to a local optima of L(w 1, w 2). Also connections toEM (will see later) Extends to more than 2 variables as well (and not just to vectors). CD is a special case.

Alternating Optimization

Many optimization problems consist of several variables. Very common in ML. D D For simplicity, suppose we want to optimize a function of 2 variables w 1 ∈ R and w 2 ∈ R

{wˆ 1, wˆ 2} = arg min L(w 1, w 2) w 1,w 2

Jointly optimizing w.r.t. w 1 and w 2 may be hard (e.g., if their values depend on each other) Often, knowing value of one may make optimization w.r.t. other easy (sometimes even closed form)

We can therefore follow an alternating scheme to optimize w.r.t. w 1 and w 2

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 25 (t+1) (t) Solve w 1 = arg maxw 1 L(w 1, w 2 ) (t+1) (t+1) Solve w 2 = arg maxw 2 L(w 1 , w 2) t = t + 1. Repeat until convergence

Usually converges to a local optima of L(w 1, w 2). Also connections toEM (will see later) Extends to more than 2 variables as well (and not just to vectors). CD is a special case.

Alternating Optimization

Many optimization problems consist of several variables. Very common in ML. D D For simplicity, suppose we want to optimize a function of 2 variables w 1 ∈ R and w 2 ∈ R

{wˆ 1, wˆ 2} = arg min L(w 1, w 2) w 1,w 2

Jointly optimizing w.r.t. w 1 and w 2 may be hard (e.g., if their values depend on each other) Often, knowing value of one may make optimization w.r.t. other easy (sometimes even closed form)

We can therefore follow an alternating scheme to optimize w.r.t. w 1 and w 2 (0) Initialize one of the variables, e.g., w 2 = w 2 , t = 0

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 25 (t+1) (t+1) Solve w 2 = arg maxw 2 L(w 1 , w 2) t = t + 1. Repeat until convergence

Usually converges to a local optima of L(w 1, w 2). Also connections toEM (will see later) Extends to more than 2 variables as well (and not just to vectors). CD is a special case.

Alternating Optimization

Many optimization problems consist of several variables. Very common in ML. D D For simplicity, suppose we want to optimize a function of 2 variables w 1 ∈ R and w 2 ∈ R

{wˆ 1, wˆ 2} = arg min L(w 1, w 2) w 1,w 2

Jointly optimizing w.r.t. w 1 and w 2 may be hard (e.g., if their values depend on each other) Often, knowing value of one may make optimization w.r.t. other easy (sometimes even closed form)

We can therefore follow an alternating scheme to optimize w.r.t. w 1 and w 2 (0) Initialize one of the variables, e.g., w 2 = w 2 , t = 0 (t+1) (t) Solve w 1 = arg maxw 1 L(w 1, w 2 )

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 25 t = t + 1. Repeat until convergence

Usually converges to a local optima of L(w 1, w 2). Also connections toEM (will see later) Extends to more than 2 variables as well (and not just to vectors). CD is a special case.

Alternating Optimization

Many optimization problems consist of several variables. Very common in ML. D D For simplicity, suppose we want to optimize a function of 2 variables w 1 ∈ R and w 2 ∈ R

{wˆ 1, wˆ 2} = arg min L(w 1, w 2) w 1,w 2

Jointly optimizing w.r.t. w 1 and w 2 may be hard (e.g., if their values depend on each other) Often, knowing value of one may make optimization w.r.t. other easy (sometimes even closed form)

We can therefore follow an alternating scheme to optimize w.r.t. w 1 and w 2 (0) Initialize one of the variables, e.g., w 2 = w 2 , t = 0 (t+1) (t) Solve w 1 = arg maxw 1 L(w 1, w 2 ) (t+1) (t+1) Solve w 2 = arg maxw 2 L(w 1 , w 2)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 25 Usually converges to a local optima of L(w 1, w 2). Also connections toEM (will see later) Extends to more than 2 variables as well (and not just to vectors). CD is a special case.

Alternating Optimization

Many optimization problems consist of several variables. Very common in ML. D D For simplicity, suppose we want to optimize a function of 2 variables w 1 ∈ R and w 2 ∈ R

{wˆ 1, wˆ 2} = arg min L(w 1, w 2) w 1,w 2

Jointly optimizing w.r.t. w 1 and w 2 may be hard (e.g., if their values depend on each other) Often, knowing value of one may make optimization w.r.t. other easy (sometimes even closed form)

We can therefore follow an alternating scheme to optimize w.r.t. w 1 and w 2 (0) Initialize one of the variables, e.g., w 2 = w 2 , t = 0 (t+1) (t) Solve w 1 = arg maxw 1 L(w 1, w 2 ) (t+1) (t+1) Solve w 2 = arg maxw 2 L(w 1 , w 2) t = t + 1. Repeat until convergence

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 25 Extends to more than 2 variables as well (and not just to vectors). CD is a special case.

Alternating Optimization

Many optimization problems consist of several variables. Very common in ML. D D For simplicity, suppose we want to optimize a function of 2 variables w 1 ∈ R and w 2 ∈ R

{wˆ 1, wˆ 2} = arg min L(w 1, w 2) w 1,w 2

Jointly optimizing w.r.t. w 1 and w 2 may be hard (e.g., if their values depend on each other) Often, knowing value of one may make optimization w.r.t. other easy (sometimes even closed form)

We can therefore follow an alternating scheme to optimize w.r.t. w 1 and w 2 (0) Initialize one of the variables, e.g., w 2 = w 2 , t = 0 (t+1) (t) Solve w 1 = arg maxw 1 L(w 1, w 2 ) (t+1) (t+1) Solve w 2 = arg maxw 2 L(w 1 , w 2) t = t + 1. Repeat until convergence

Usually converges to a local optima of L(w 1, w 2). Also connections toEM (will see later)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 25 . CD is a special case.

Alternating Optimization

Many optimization problems consist of several variables. Very common in ML. D D For simplicity, suppose we want to optimize a function of 2 variables w 1 ∈ R and w 2 ∈ R

{wˆ 1, wˆ 2} = arg min L(w 1, w 2) w 1,w 2

Jointly optimizing w.r.t. w 1 and w 2 may be hard (e.g., if their values depend on each other) Often, knowing value of one may make optimization w.r.t. other easy (sometimes even closed form)

We can therefore follow an alternating scheme to optimize w.r.t. w 1 and w 2 (0) Initialize one of the variables, e.g., w 2 = w 2 , t = 0 (t+1) (t) Solve w 1 = arg maxw 1 L(w 1, w 2 ) (t+1) (t+1) Solve w 2 = arg maxw 2 L(w 1 , w 2) t = t + 1. Repeat until convergence

Usually converges to a local optima of L(w 1, w 2). Also connections toEM (will see later) Extends to more than 2 variables as well (and not just to vectors)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 25 Alternating Optimization

Many optimization problems consist of several variables. Very common in ML. D D For simplicity, suppose we want to optimize a function of 2 variables w 1 ∈ R and w 2 ∈ R

{wˆ 1, wˆ 2} = arg min L(w 1, w 2) w 1,w 2

Jointly optimizing w.r.t. w 1 and w 2 may be hard (e.g., if their values depend on each other) Often, knowing value of one may make optimization w.r.t. other easy (sometimes even closed form)

We can therefore follow an alternating scheme to optimize w.r.t. w 1 and w 2 (0) Initialize one of the variables, e.g., w 2 = w 2 , t = 0 (t+1) (t) Solve w 1 = arg maxw 1 L(w 1, w 2 ) (t+1) (t+1) Solve w 2 = arg maxw 2 L(w 1 , w 2) t = t + 1. Repeat until convergence

Usually converges to a local optima of L(w 1, w 2). Also connections toEM (will see later) Extends to more than 2 variables as well (and not just to vectors). CD is a special case.

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 25 Doesn’t rely on gradient to choose w (t+1) Instead, each step directly jumps to the minima of quadratic approximation

Second-order information often tells us a lot more about the function’s shape, , etc. Newton’s method is one such method that uses second-order information At each point, approximate the function by its quadratic approx. and minimize it

Second-Order Methods: Newton’s Method

GD and variants only use first-order information (the gradient)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 26 Doesn’t rely on gradient to choose w (t+1) Instead, each step directly jumps to the minima of quadratic approximation

At each point, approximate the function by its quadratic approx. and minimize it

Second-Order Methods: Newton’s Method

GD and variants only use first-order information (the gradient) Second-order information often tells us a lot more about the function’s shape, curvature, etc. Newton’s method is one such method that uses second-order information

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 26 Doesn’t rely on gradient to choose w (t+1) Instead, each step directly jumps to the minima of quadratic approximation

Second-Order Methods: Newton’s Method

GD and variants only use first-order information (the gradient) Second-order information often tells us a lot more about the function’s shape, curvature, etc. Newton’s method is one such method that uses second-order information At each point, approximate the function by its quadratic approx. and minimize it

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 26 Doesn’t rely on gradient to choose w (t+1) Instead, each step directly jumps to the minima of quadratic approximation

Second-Order Methods: Newton’s Method

GD and variants only use first-order information (the gradient) Second-order information often tells us a lot more about the function’s shape, curvature, etc. Newton’s method is one such method that uses second-order information At each point, approximate the function by its quadratic approx. and minimize it

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 26 Doesn’t rely on gradient to choose w (t+1) Instead, each step directly jumps to the minima of quadratic approximation

Second-Order Methods: Newton’s Method

GD and variants only use first-order information (the gradient) Second-order information often tells us a lot more about the function’s shape, curvature, etc. Newton’s method is one such method that uses second-order information At each point, approximate the function by its quadratic approx. and minimize it

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 26 Doesn’t rely on gradient to choose w (t+1) Instead, each step directly jumps to the minima of quadratic approximation

Second-Order Methods: Newton’s Method

GD and variants only use first-order information (the gradient) Second-order information often tells us a lot more about the function’s shape, curvature, etc. Newton’s method is one such method that uses second-order information At each point, approximate the function by its quadratic approx. and minimize it

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 26 Doesn’t rely on gradient to choose w (t+1) Instead, each step directly jumps to the minima of quadratic approximation

Second-Order Methods: Newton’s Method

GD and variants only use first-order information (the gradient) Second-order information often tells us a lot more about the function’s shape, curvature, etc. Newton’s method is one such method that uses second-order information At each point, approximate the function by its quadratic approx. and minimize it

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 26 Instead, each step directly jumps to the minima of quadratic approximation

Second-Order Methods: Newton’s Method

GD and variants only use first-order information (the gradient) Second-order information often tells us a lot more about the function’s shape, curvature, etc. Newton’s method is one such method that uses second-order information At each point, approximate the function by its quadratic approx. and minimize it

Doesn’t rely on gradient to choose w (t+1)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 26 Second-Order Methods: Newton’s Method

GD and variants only use first-order information (the gradient) Second-order information often tells us a lot more about the function’s shape, curvature, etc. Newton’s method is one such method that uses second-order information At each point, approximate the function by its quadratic approx. and minimize it

Doesn’t rely on gradient to choose w (t+1) Instead, each step directly jumps to the minima of quadratic approximation

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 26 The minimizer of this quadratic approximation is (exercise: verify)

wˆ = arg min f˜(w) = w (t) − (∇2f (w (t)))−1∇f (w (t)) w This is the update used in Newton’s method (a second order method since it uses the Hessian)

w (t+1) = w (t) − (∇2f (w (t)))−1∇f (w (t))

Look, Ma! No learning rate! :-) Very fast if f (w) is convex. But expensive due to Hessian computation/inversion.

Many ways to approximate the Hessian (e.g., using previous gradients); also look at L-BFGS etc.

Second-Order Methods: Newton’s Method

The quadratic (Taylor) approximation of f (w) at w (t) is given by

1 f˜(w) = f (w (t)) + ∇f (w (t))>(w − w (t)) + (w − w (t))>∇2f (w (t))(w − w (t)) 2

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 27 This is the update used in Newton’s method (a second order method since it uses the Hessian)

w (t+1) = w (t) − (∇2f (w (t)))−1∇f (w (t))

Look, Ma! No learning rate! :-) Very fast if f (w) is convex. But expensive due to Hessian computation/inversion.

Many ways to approximate the Hessian (e.g., using previous gradients); also look at L-BFGS etc.

Second-Order Methods: Newton’s Method

The quadratic (Taylor) approximation of f (w) at w (t) is given by

1 f˜(w) = f (w (t)) + ∇f (w (t))>(w − w (t)) + (w − w (t))>∇2f (w (t))(w − w (t)) 2 The minimizer of this quadratic approximation is (exercise: verify)

wˆ = arg min f˜(w) = w (t) − (∇2f (w (t)))−1∇f (w (t)) w

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 27 Very fast if f (w) is convex. But expensive due to Hessian computation/inversion.

Many ways to approximate the Hessian (e.g., using previous gradients); also look at L-BFGS etc.

Second-Order Methods: Newton’s Method

The quadratic (Taylor) approximation of f (w) at w (t) is given by

1 f˜(w) = f (w (t)) + ∇f (w (t))>(w − w (t)) + (w − w (t))>∇2f (w (t))(w − w (t)) 2 The minimizer of this quadratic approximation is (exercise: verify)

wˆ = arg min f˜(w) = w (t) − (∇2f (w (t)))−1∇f (w (t)) w This is the update used in Newton’s method (a second order method since it uses the Hessian)

w (t+1) = w (t) − (∇2f (w (t)))−1∇f (w (t))

Look, Ma! No learning rate! :-)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 27 Many ways to approximate the Hessian (e.g., using previous gradients); also look at L-BFGS etc.

Second-Order Methods: Newton’s Method

The quadratic (Taylor) approximation of f (w) at w (t) is given by

1 f˜(w) = f (w (t)) + ∇f (w (t))>(w − w (t)) + (w − w (t))>∇2f (w (t))(w − w (t)) 2 The minimizer of this quadratic approximation is (exercise: verify)

wˆ = arg min f˜(w) = w (t) − (∇2f (w (t)))−1∇f (w (t)) w This is the update used in Newton’s method (a second order method since it uses the Hessian)

w (t+1) = w (t) − (∇2f (w (t)))−1∇f (w (t))

Look, Ma! No learning rate! :-) Very fast if f (w) is convex. But expensive due to Hessian computation/inversion.

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 27 Second-Order Methods: Newton’s Method

The quadratic (Taylor) approximation of f (w) at w (t) is given by

1 f˜(w) = f (w (t)) + ∇f (w (t))>(w − w (t)) + (w − w (t))>∇2f (w (t))(w − w (t)) 2 The minimizer of this quadratic approximation is (exercise: verify)

wˆ = arg min f˜(w) = w (t) − (∇2f (w (t)))−1∇f (w (t)) w This is the update used in Newton’s method (a second order method since it uses the Hessian)

w (t+1) = w (t) − (∇2f (w (t)))−1∇f (w (t))

Look, Ma! No learning rate! :-) Very fast if f (w) is convex. But expensive due to Hessian computation/inversion.

Many ways to approximate the Hessian (e.g., using previous gradients); also look at L-BFGS etc.

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 27 But computing all this gradient related stuff looks scary to me. Any help? Don’t worry. Automatic Differentiation (AD) methods available now AD only requires specifying the loss function (especially useful for deep neural nets) Many packages such as Tensorflow, PyTorch, etc. provide AD support But having a good understanding of optimization is still helpful

Summary

Gradient methods are simple to understand and implement More sophisticated optimization methods often use gradient methods algorithm used in deep neural nets is GD + of differentiation

Use subgradient methods if function not differentiable

Constrained optimization require methods such as Lagrangian or projected gradient Second order methods such as Newton’s method are much faster but computationally expensive

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 28 Don’t worry. Automatic Differentiation (AD) methods available now AD only requires specifying the loss function (especially useful for deep neural nets) Many packages such as Tensorflow, PyTorch, etc. provide AD support But having a good understanding of optimization is still helpful

Summary

Gradient methods are simple to understand and implement More sophisticated optimization methods often use gradient methods Backpropagation algorithm used in deep neural nets is GD + chain rule of differentiation

Use subgradient methods if function not differentiable

Constrained optimization require methods such as Lagrangian or projected gradient Second order methods such as Newton’s method are much faster but computationally expensive But computing all this gradient related stuff looks scary to me. Any help?

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 28 Many packages such as Tensorflow, PyTorch, etc. provide AD support But having a good understanding of optimization is still helpful

Summary

Gradient methods are simple to understand and implement More sophisticated optimization methods often use gradient methods Backpropagation algorithm used in deep neural nets is GD + chain rule of differentiation

Use subgradient methods if function not differentiable

Constrained optimization require methods such as Lagrangian or projected gradient Second order methods such as Newton’s method are much faster but computationally expensive But computing all this gradient related stuff looks scary to me. Any help? Don’t worry. Automatic Differentiation (AD) methods available now AD only requires specifying the loss function (especially useful for deep neural nets)

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 28 But having a good understanding of optimization is still helpful

Summary

Gradient methods are simple to understand and implement More sophisticated optimization methods often use gradient methods Backpropagation algorithm used in deep neural nets is GD + chain rule of differentiation

Use subgradient methods if function not differentiable

Constrained optimization require methods such as Lagrangian or projected gradient Second order methods such as Newton’s method are much faster but computationally expensive But computing all this gradient related stuff looks scary to me. Any help? Don’t worry. Automatic Differentiation (AD) methods available now AD only requires specifying the loss function (especially useful for deep neural nets) Many packages such as Tensorflow, PyTorch, etc. provide AD support

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 28 Summary

Gradient methods are simple to understand and implement More sophisticated optimization methods often use gradient methods Backpropagation algorithm used in deep neural nets is GD + chain rule of differentiation

Use subgradient methods if function not differentiable

Constrained optimization require methods such as Lagrangian or projected gradient Second order methods such as Newton’s method are much faster but computationally expensive But computing all this gradient related stuff looks scary to me. Any help? Don’t worry. Automatic Differentiation (AD) methods available now AD only requires specifying the loss function (especially useful for deep neural nets) Many packages such as Tensorflow, PyTorch, etc. provide AD support But having a good understanding of optimization is still helpful

Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 28