Optimization Techniques for ML (2)

Optimization Techniques for ML (2) Piyush Rai Introduction to Machine Learning (CS771A) August 28, 2018 Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 1 Convex functions have a unique minima Loss Loss Optima Optima w w 2 w 1 Non-convex function have several local minima Loss Loss “Local” Optima “Local” Optima w 2 w 1 w Recap: Convex and Non-Convex Function Most ML problems boil down to minimization of convex/non-convex functions, e.g., N 1 X w^ = arg min L(w) = arg min `n(w) + R(w) w w N n=1 Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 2 Loss Loss Optima Optima w w 2 w 1 Non-convex function have several local minima Loss Loss “Local” Optima “Local” Optima w 2 w 1 w Recap: Convex and Non-Convex Function Most ML problems boil down to minimization of convex/non-convex functions, e.g., N 1 X w^ = arg min L(w) = arg min `n(w) + R(w) w w N n=1 Convex functions have a unique minima Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 2 Loss Loss Optima Optima w w 2 w 1 Loss Loss “Local” Optima “Local” Optima w 2 w 1 w Recap: Convex and Non-Convex Function Most ML problems boil down to minimization of convex/non-convex functions, e.g., N 1 X w^ = arg min L(w) = arg min `n(w) + R(w) w w N n=1 Convex functions have a unique minima Non-convex function have several local minima Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 2 A function is convex if its graph lies above all of its tangents (above its first order Taylor expansion) A function is convex if its second derivative (Hessian) is positive semi-definite Note: If f is convex then −f is a concave function Recap: Convex Functions A function is convex if all of its chords lie above the function Note: “Chord lies above function” Convex Function Non-convex Function more formally means If f is convex then given s.t Jensen’s Inequality Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 3 A function is convex if its second derivative (Hessian) is positive semi-definite Note: If f is convex then −f is a concave function Recap: Convex Functions A function is convex if all of its chords lie above the function Note: “Chord lies above function” Convex Function Non-convex Function more formally means If f is convex then given s.t Jensen’s Inequality A function is convex if its graph lies above all of its tangents (above its first order Taylor expansion) Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 3 Note: If f is convex then −f is a concave function Recap: Convex Functions A function is convex if all of its chords lie above the function Note: “Chord lies above function” Convex Function Non-convex Function more formally means If f is convex then given s.t Jensen’s Inequality A function is convex if its graph lies above all of its tangents (above its first order Taylor expansion) A function is convex if its second derivative (Hessian) is positive semi-definite Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 3 Recap: Convex Functions A function is convex if all of its chords lie above the function Note: “Chord lies above function” Convex Function Non-convex Function more formally means If f is convex then given s.t Jensen’s Inequality A function is convex if its graph lies above all of its tangents (above its first order Taylor expansion) A function is convex if its second derivative (Hessian) is positive semi-definite Note: If f is convex then −f is a concave function Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 3 Recap: Gradient Descent A very simple, first-order method for optimizing any differentiable function (convex/non-convex) Uses only the gradient g = rL(w) of the function Basic idea: Start at some location w (0) and move in the opposite direction of the gradient negative gradient positive direction Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 4 Recap: Gradient Descent A very simple, first-order method for optimizing any differentiable function (convex/non-convex) Uses only the gradient g = rL(w) of the function Basic idea: Start at some location w (0) and move in the opposite direction of the gradient Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 4 Recap: Gradient Descent A very simple, first-order method for optimizing any differentiable function (convex/non-convex) Uses only the gradient g = rL(w) of the function Basic idea: Start at some location w (0) and move in the opposite direction of the gradient Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 4 Recap: Gradient Descent A very simple, first-order method for optimizing any differentiable function (convex/non-convex) Uses only the gradient g = rL(w) of the function Basic idea: Start at some location w (0) and move in the opposite direction of the gradient positive gradient negative direction Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 4 Recap: Gradient Descent A very simple, first-order method for optimizing any differentiable function (convex/non-convex) Uses only the gradient g = rL(w) of the function Basic idea: Start at some location w (0) and move in the opposite direction of the gradient Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 4 Recap: Gradient Descent A very simple, first-order method for optimizing any differentiable function (convex/non-convex) Uses only the gradient g = rL(w) of the function Basic idea: Start at some location w (0) and move in the opposite direction of the gradient Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 4 Recap: Gradient Descent Gradient Descent Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 5 Many ways to set the learning rate, e.g., Constant (if properly set, can still show good convergence behavior) p Decreasing with t (e.g. 1=t, 1= t, etc.) Use adaptive learning rates (e.g., using methods such as Adagrad, Adam) Recap: Gradient Descent The learning rate ηt is important Very small learning rates may result in very slow convergence Very large learning rates may lead to oscillatory behavior or result in a bad local optima Very small learning rates Very large learning rates VERY VERY large rate (can even jump into a bad region) May not be able to “cross” towards the good side May take too long May keep to converge oscillating Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 6 Constant (if properly set, can still show good convergence behavior) p Decreasing with t (e.g. 1=t, 1= t, etc.) Use adaptive learning rates (e.g., using methods such as Adagrad, Adam) Recap: Gradient Descent The learning rate ηt is important Very small learning rates may result in very slow convergence Very large learning rates may lead to oscillatory behavior or result in a bad local optima Very small learning rates Very large learning rates VERY VERY large rate (can even jump into a bad region) May not be able to “cross” towards the good side May take too long May keep to converge oscillating Many ways to set the learning rate, e.g., Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 6 Recap: Gradient Descent The learning rate ηt is important Very small learning rates may result in very slow convergence Very large learning rates may lead to oscillatory behavior or result in a bad local optima Very small learning rates Very large learning rates VERY VERY large rate (can even jump into a bad region) May not be able to “cross” towards the good side May take too long May keep to converge oscillating Many ways to set the learning rate, e.g., Constant (if properly set, can still show good convergence behavior) p Decreasing with t (e.g. 1=t, 1= t, etc.) Use adaptive learning rates (e.g., using methods such as Adagrad, Adam) Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 6 Stochastic Gradient Descent (SGD) approximates g using a single data point In iteration t, SGD picks a uniformly random i 2 f1;:::; Ng and approximate g as g ≈ g i = rw ì (w) Stochastic Gradient Descent Recap: Stochastic Gradient Descent Gradient computation in standard GD may be expensive when N is large " N # N 1 X 1 X g = r ` (w) = g (ignoring regularizer R(w)) w N n N n n=1 n=1 Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 7 In iteration t, SGD picks a uniformly random i 2 f1;:::; Ng and approximate g as g ≈ g i = rw ì (w) Stochastic Gradient Descent Recap: Stochastic Gradient Descent Gradient computation in standard GD may be expensive when N is large " N # N 1 X 1 X g = r ` (w) = g (ignoring regularizer R(w)) w N n N n n=1 n=1 Stochastic Gradient Descent (SGD) approximates g using a single data point Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 7 Stochastic Gradient Descent Recap: Stochastic Gradient Descent Gradient computation in standard GD may be expensive when N is large " N # N 1 X 1 X g = r ` (w) = g (ignoring regularizer R(w)) w N n N n n=1 n=1 Stochastic Gradient Descent (SGD) approximates g using a single data point In iteration t, SGD picks a uniformly random i 2 f1;:::; Ng and approximate g as g ≈ g i = rw ì (w) Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 7 Stochastic Gradient Descent Recap: Stochastic Gradient Descent Gradient computation in standard GD may be expensive when N is large " N # N 1 X 1 X g = r ` (w) = g (ignoring regularizer R(w)) w N n N n n=1 n=1 Stochastic Gradient Descent (SGD) approximates g using a single data point In iteration t, SGD picks a uniformly random i 2 f1;:::; Ng and approximate g as g ≈ g i = rw ì (w) Intro to Machine Learning (CS771A) Optimization Techniques for ML (2) 7 We can instead use B > 1 uniformly randomly chosen points with indices i1;:::; iB 2 f1;:::; Ng This is the idea behind mini-batch SGD.

Optimization Techniques for ML (2)

Training Autoencoders by Alternating Minimization

Deep Learning Based Computer Generated Face Identification Using

Matrix Calculus

Differentiability in Several Variables: Summary of Basic Concepts

Lipschitz Recurrent Neural Networks

Neural Networks

Jointly Clustering with K-Means and Learning Representations

Differentiability of the Convolution

The Simple Essence of Automatic Differentiation

Expressivity of Deep Neural Networks∗

Differentiable Programming for Piecewise Polynomial

GAN and What It Is?