Least Mean Squares Regression

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression • Examples • The LMS objective • Gradient descent • Incremental/stochastic gradient descent 2 Least Squares Method for regression • Examples • The LMS objective • Gradient descent • Incremental/stochastic gradient descent 3 What’s the mileage? Suppose we want to predict the mileage of a car from its weight and age What we want: A Weight Age function that can Mileage (x 100 lb) (years) predict mileage x1 x2 31.5 6 21 using x1 and x2 36.2 2 25 43.1 0 18 27.6 2 30 4 Linear regression: The strategy Predicting continuous values using a linear model Assumption: The output is a linear function of the inputs Mileage = w0 + w1 x1 + w2 x2 Learning: Using the training data to find the best possible value of w Prediction: Given the values for x1, x2 for a new car, use the learned w to predict the Mileage for the new car 5 Linear regression: The strategy Predicting continuous values using a linear model Assumption: The output is a linear function of the inputs Mileage = w0 + w1 x1 + w2 x2 Parameters of the model Also called weights Collectively, a vector Learning: Using the training data to find the best possible value of w Prediction: Given the values for x1, x2 for a new car, use the learned w to predict the Mileage for the new car 6 Linear regression: The strategy ! For simplicity, we will assume • Inputs are vectors: � ∈ ℜ that the first feature is always 1. • Outputs are real numbers: � ∈ ℜ 1 � � = ! • We have a training set ⋮ �" D = { �!, �! , �", �" , ⋯ } This lets makes notation easier • We want to approximate � as � = �! + �"�" + ⋯ + �#�# � = �$� � is the learned weight vector in ℜ# 7 Examples y x One dimensional input 1 8 Examples y Predict using y = w1 + w2 x2 x One dimensional input 1 9 Examples y Predict using y = w1 + w2 x2 The linear function is not our only choice. We could have tried to fit the data as another polynomial x One dimensional input 1 10 Examples y Predict using y = w1 + w2 x2 The linear function is not our only choice. We could have tried to fit the data as another polynomial x One dimensional input 1 Two dimensional input Predict using y = w1 + w2 x2 +w3 x3 11 Least Squares Method for regression • Examples • The LMS objective • Gradient descent • Incremental/stochastic gradient descent 12 What is the best weight vector? Question: How do we know which weight vector is the best one for a training set? For an input (xi, yi) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be Sum of squared costs over the training set One strategy for learning: Find the w with least cost on this data 13 What is the best weight vector? Question: How do we know which weight vector is the best one for a training set? For an input (xi, yi) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be Sum of squared costs over the training set One strategy for learning: Find the w with least cost on this data 14 What is the best weight vector? Question: How do we know which weight vector is the best one for a training set? For an input (xi, yi) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be One strategy for learning: Find the w with least cost on this data 15 What is the best weight vector? Question: How do we know which weight vector is the best one for a training set? For an input (xi, yi) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be Sum of squared costs over the training set One strategy for learning: Find the w with least cost on this data 16 What is the best weight vector? Question: How do we know which weight vector is the best one for a training set? For an input (xi, yi) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be Sum of squared costs over the training set One strategy for learning: Find the w with least cost on this data 17 Least Mean Squares (LMS) Regression Learning: minimizing mean squared error 18 Least Mean Squares (LMS) Regression Learning: minimizing mean squared error Different strategies exist for learning by optimization • Gradient descent is a popular algorithm (For this particular minimization objective, there is also an analytical solution. No need for gradient descent) 19 Least Squares Method for regression • Examples • The LMS objective • Gradient descent • Incremental/stochastic gradient descent 20 We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) • Start with an initial guess for w, say w0 • Iterate till convergence: w – Compute the gradient of the gradient of J at wt Intuition: The gradient is the direction t t+1 – Update w to get w by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 21 We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) • Start with an initial guess for w, say w0 • Iterate till convergence: w – Compute the gradient of the w0 gradient of J at wt Intuition: The gradient is the direction t t+1 – Update w to get w by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 22 We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) • Start with an initial guess for w, say w0 • Iterate till convergence: w – Compute the gradient of the w1 w0 gradient of J at wt Intuition: The gradient is the direction t t+1 – Update w to get w by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 23 We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) • Start with an initial guess for w, say w0 • Iterate till convergence: w – Compute the gradient of the w2 w1 w0 gradient of J at wt Intuition: The gradient is the direction t t+1 – Update w to get w by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 24 We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) • Start with an initial guess for w, say w0 • Iterate till convergence: w – Compute the gradient of the w3 w2 w1 w0 gradient of J at wt Intuition: The gradient is the direction t t+1 – Update w to get w by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 25 We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) • Start with an initial guess for w, say w0 • Iterate till convergence: w – Compute the gradient of the w3 w2 w1 w0 gradient of J at wt Intuition: The gradient is the direction t t+1 – Update w to get w by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 26 We are trying to minimize Gradient descent for LMS 1. Initialize w0 2. For t = 0, 1, 2, …. 1. Compute gradient of J(w) at wt. Call it r J(wt) 2. Update w as follows: r: Called the learning rate (For now, a small constant. We will get to this later) 27 We are trying to minimize Gradient descent for LMS 1. Initialize w0 2. For t = 0, 1, 2, …. What is the gradient of J? 1. Compute gradient of J(w) at wt. Call it r J(wt) 2. Update w as follows: r: Called the learning rate (For now, a small constant. We will get to this later) 28 We are trying to minimize Gradient of the cost • The gradient is of the form • Remember that w is a vector with d elements – w = [w1, w2, w3, ! wj, !, wd] 29 We are trying to minimize Gradient of the cost • The gradient is of the form 30 We are trying to minimize Gradient of the cost • The gradient is of the form 31 We are trying to minimize Gradient of the cost • The gradient is of the form 32 We are trying to minimize Gradient of the cost • The gradient is of the form 33 We are trying to minimize Gradient of the cost • The gradient is of the form 34 We are trying to minimize Gradient of the cost • The gradient is of the form One element of the gradient vector 35 We are trying to minimize Gradient of the cost • The gradient is of the form One element of the gradient vector Sum of Error × Input 36 We are trying to minimize Gradient descent for LMS 1.

Load more