MS&E 226: Fundamentals of Data Science

MS&E 226: Fundamentals of Data Science

MS&E 226: Fundamentals of Data Science Lecture 6: Bias and variance Ramesh Johari 1 / 49 Our plan today In general, when creating predictive models, we are trading off two different goals: I On one hand, we want models to fit the training data well. I On the other hand, we want to avoid models that become so finely tuned to the training data that they perform poorly on new data. In this lecture we develop a systematic vocabulary for talking about this kind of trade off, through the notions of bias and variance. 2 / 49 Conditional expectation 3 / 49 Conditional expectation Given the population model for X~ and Y , suppose we are allowed to choose any predictive model f^ we want. What is the best one? ^ ~ 2 minimize EX;Y~ [(Y − f(X)) ]: (Here expectation is over (X;Y~ ).) Theorem The predictive model that minimizes squared error is ^ ~ ~ f(X) = EY [Y jX]. 4 / 49 Conditional expectation Proof: ^ ~ 2 EX;Y~ [(Y − f(X)) ] ~ ~ ^ ~ 2 = EX;Y~ [(Y − EY [Y jX] + EY [Y jX] − f(X)) ] ~ 2 ~ ^ ~ 2 = EX;Y~ [(Y − EY [Y jX]) ] + EY [(E[Y jX] − f(X)) ] ~ ~ ^ ~ + 2EX;Y~ [(Y − EY [Y jX])(EY [Y jX] − f(X))]: The first two terms are positive, and minimized if ^ ~ ~ f(X) = EY [Y jX]. For the third term, using the tower property of conditional expectation: ~ ~ ^ ~ EX;Y~ [(Y − EY [Y jX])(EY [Y jX] − f(X))] h ~ ~ ~ ^ ~ i = EX~ EY [Y − EY [Y jX]jX](EY [Y jX] − f(X)) = 0: So the squared error minimizing solution is to choose ^ ~ ~ f(X) = EY [Y jX]. 5 / 49 Conditional expectation ~ Why don't we just choose EY [Y jX] as our predictive model? Because we don't know the distribution of (X;Y~ )! In other words: We don't know the population model. Nevertheless, the preceding result is a useful guide: I It provides the benchmark that every squared-error-minimizing predictive model is striving for: approximate the conditional expectation. I It provides intuition for why linear regression approximates the conditional mean. 6 / 49 Population model For the rest of the lecture write: Y = f(X~ ) + "; ~ ~ ~ where E"["jX] = 0. In other words, f(X) = EY [Y jX]. We make the assumption that Var("jX~ ) = σ2 (i.e., it does not depend on X~ ). We will make additional assumptions about the population model as we go along. 7 / 49 Prediction error revisited 8 / 49 A note on prediction error and conditioning When you see \prediction error", it typically means: 2 E[(Y − f^(X~ )) j(∗)]; where (∗) can be one of many things: I X; Y; X~ ; I X; Y; I X; X~ ; I X; I nothing. As long we don't condition on both X and Y, the model is random! 9 / 49 Models and conditional expectation So now suppose we have data X; Y, and use it to build a model f^. What is the prediction error if we see a new X~ ? ^ ~ 2 ~ EY [(Y − f(X)) jX; Y; X] ~ 2 ~ = EY [(Y − f(X)) jX] + (f^(X~ ) − f(X~ ))2 = σ2 + (f^(X~ ) − f(X~ ))2: I.e.: When minimizing mean squared error, \good" models should behave like conditional expectation.1 Our goal: understand the second term. 1This is just another way of deriving that the prediction-error-minimizing solution is the conditional expectation. 10 / 49 Models and conditional expectation [∗] Proof of preceding statement: The proof is essentially identical to the earlier proof for conditional expectation: ^ ~ 2 ~ EY [(Y − f(X)) jX; Y; X] ~ ~ ^ ~ 2 ~ = EY [(Y − f(X) + f(X) − f(X)) jX; Y; X] ~ 2 ~ ^ ~ ~ 2 = EY [(Y − f(X)) jX] + (f(X) − f(X)) ~ ~ ~ ^ ~ + 2EY [Y − f(X)jX](f(X) − f(X)) ~ 2 ~ ^ ~ ~ 2 = EY [(Y − f(X)) jX] + (f(X) − f(X)) ; ~ ~ ~ ~ because EY [Y − f(X)jX] = EY [Y − EY [Y jX]jX] = 0. 11 / 49 A thought experiment Our goal is to understand: (f^(X~ ) − f(X~ ))2: (∗∗) Here is one way we might think about the quality of our modeling approach: I Fix the design matrix X. I Generate data Y many times (parallel \universes"). I In each universe, create a f^. I In each universe, evaluate (∗∗). 12 / 49 A thought experiment Our goal is to understand: (f^(X~ ) − f(X~ ))2: (∗∗) What kinds of things can we evaluate? I If f^(X~ ) is \close" to the conditional expectation, then on average in our universes, it should look like f(X~ ). I f^(X~ ) might be close to f(X~ ) on average, but still vary wildly across our universes. The first is bias. The second is variance. 13 / 49 Example Let's carry out some simulations with a synthetic model. Population model: I We generate X1;X2 as i.i.d. N(0; 1) random variables. I Given X1;X2, the distribution of Y is given by: Y = 1 + X1 + 2X2 + "; where " is an independent N(0; 1) random variable. Thus f(X1;X2) = 1 + X1 + 2X2. 14 / 49 Example: Parallel universes Generate a design matrix X by sampling 100 i.i.d. values of (X1;X2). Now we run m = 500 simulations. These are our \universes." In each simulation, generate data Y according to: Yi = f(Xi) + "i; where "i are i.i.d. N(0; 1) random variables. In each simulation, what changes is the specific values of the "i. This is what it means to condition on X. 15 / 49 Example: OLS in parallel universes In each simulation, given the design matrix X and Y, we build a fitted model f^ using ordinary least squares. Finally, we use our models to make predicitions at a new covariate vector: X~ = (1; 1). In each universe we evaluate: error = f(X~ ) − f^(X~ ): We then create a density plot of these errors across the 500 universes. 16 / 49 Example: Results We go through the process with three models: A, B, C. The three we try are: I Y ~ 1 + X1. I Y ~ 1 + X1 + X2. I Y ~ 1 + X1 + X2 + ... + I(X1^4) + I(X2^4). Which is which? 17 / 49 Example: Results Results: 2 model A B density 1 C 0 −1 0 1 2 3 ^ f(X) − f(X) 18 / 49 A thought experiment: Aside This is our first example of a frequentist approach: I The population model is fixed. I The data is random. I We reason about a particular modeling procedure by considering what happens if we carry out the same procedure over and over again (in this case, fitting a model from data). 19 / 49 Examples: Bias and variance Suppose you are predicting, e.g., wealth based on a collection of demographic covariates. ^ I Suppose we make a constant prediction: f(Xi) = c for all i. Is this biased? Does it have low variance? I Suppose that every time you get your data, you use enough parameters to fit Y exactly: f^(Xi) = Yi for all i. Is this biased? Does it have low variance? 20 / 49 The bias-variance decomposition We can be more precise about our discussion. ^ ~ 2 ~ EY;Y [(Y − f(X)) jX; X] 2 ^ ~ ~ 2 ~ = σ + EY[(f(X) − f(X)) jX; X] 2 2 ~ ^ ~ ~ = σ + f(X) − EY[f(X)jX; X] 2 h ^ ~ ^ ~ ~ ~ i + EY f(X) − E[f(X)jX; X] X; X : The first term is irreducible error. The second term is BIAS2. The third term is VARIANCE. 21 / 49 The bias-variance decomposition The bias-variance decomposition measures how sensitive prediction error is to changes in the training data (in this case, Y. I If there are systematic errors in prediction made regardless of the training data, then there is high bias. I If the fitted model is very sensitive to the choice of training data, then there is high variance. 22 / 49 The bias-variance decomposition: Proof [∗] Proof: We already showed that: ^ ~ 2 ~ 2 ^ ~ ~ 2 EY [(Y − f(X)) jX; Y; X] = σ + (f(X) − f(X)) : Take expectations over Y: ^ ~ 2 ~ 2 ^ ~ ~ 2 ~ EY;Y [(Y − f(X)) jX; X] = σ + EY[(f(X) − f(X)) jX; X]: ^ ~ ~ Add and subtract EY[f(X)jX; X] in the second term: ^ ~ ~ 2 ~ EY[(f(X) − f(X)) jX; X] h ^ ~ ^ ~ ~ = EY f(X) − EY[f(X)jX; X] 2 ^ ~ ~ ~ ~ i + EY[f(X)jX; X] − f(X) X; X 23 / 49 The bias-variance decomposition: Proof [∗] Proof (continued): Cross-multiply: ^ ~ ~ 2 ~ EY[(f(X) − f(X)) jX; X] 2 h ^ ~ ^ ~ ~ ~ i = EY f(X) − EY[f(X)jX; X] jX; X 2 ^ ~ ~ ~ + EY[f(X)jX; X] − f(X) h ^ ~ ^ ~ ~ ^ ~ ~ ~ ~ i + 2EY f(X) − EY[f(X)jX; X] EY[f(X)jX; X] − f(X) X; X : (The conditional expectation drops out on the middle term, because given X and X~ , it is no longer random.) The first term is the VARIANCE. The second term is the BIAS2. We have to show the third term is zero. 24 / 49 The bias-variance decomposition: Proof [∗] Proof (continued). We have: h ^ ~ ^ ~ ~ ^ ~ ~ ~ ~ i EY f(X) − EY[f(X)jX; X] EY[f(X)jX; X] − f(X) X; X ^ ~ ~ ~ h ^ ~ ^ ~ ~ ~ i = EY[f(X)jX; X] − f(X) EY f(X) − EY[f(X)jX; X] X; X = 0; using the tower property of conditional expectation. 25 / 49 Other forms of the bias-variance decomposition The decomposition varies depending on what you condition on, but always takes a similar form. For example, suppose that you also wish to average over the particular covariate vector X~ at which you make predictions; in particular suppose that X~ is also drawn from the population model. Then the bias-variance decomposition can be shown to be: ^ ~ 2 EX;Y;~ Y[(Y − f(X)) jX] 2 ~ ^ ~ = Var(Y ) + EX;~ Y[f(X) − f(X)jX] 2 h ^ ~ ^ ~ i + EX;~ Y f(X) − EX;~ Y[f(X)jX] X : (In this case the first term is the irreducible error.) 26 / 49 Example: k-nearest-neighbor fit Generate data the same way as before: We generate 1000 X1;X2 as i.i.d.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    49 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us