Linear Regression

Linear Regression

Linear regression Fitting a line to a bunch of points. Linear regression Example: college GPAs Better predictions with more information We also have SAT scores of all students. Distribution of GPAs of Mean squared error students at a certain Ivy (MSE) drops to 0.43. League university. What GPA to predict for a random student from this group? • Without further information, predict the mean, 2.47. • What is the average squared error of this prediction? This is a regression problem with: 2 That is, E[((student's GPA) − (predicted GPA)) ]? • Predictor variable: SAT score The variance of the distribution, 0.55. • Response variable: College GPA Parametrizing a line The line fitting problem (1) (1) (n) (n) Pick a line (a; b) based on (x ; y );:::; (x ; y ) 2 R × R A line can be parameterized as y = ax + b (a: slope, b: intercept). • x(i); y (i) are predictor and response variables. E.g. SAT score, GPA of ith student. • Minimize the mean squared error, n 1 X MSE(a; b) = (y (i) − (ax(i) + b))2: n i=1 This is the loss function. Minimizing the loss function Multivariate regression: diabetes study Given (x(1); y (1));:::; (x(n); y (n)), minimize n Data from n = 442 diabetes patients. X L(a; b) = (y (i) − (ax(i) + b))2: i=1 For each patient: • 10 features x = (x1;:::; x10) age, sex, body mass index, average blood pressure, and six blood serum measurements. • A real value y: the progression of the disease a year later. Regression problem: • response y 2 R 10 • predictor variables x 2 R Least-squares regression Back to the diabetes data 10 Linear function of 10 variables: for x 2 R , • No predictor variables: mean squared error (MSE) = 5930 f (x) = w1x1 + w2x2 + ··· + w10x10 + b = w · x + b • One predictor ('bmi'): MSE = 3890 where w = (w1; w2;:::; w10). Penalize error using squared loss (y − (w · x + b))2. Least-squares regression: (1) (1) (n) (n) d • Given: data (x ; y );:::; (x ; y ) 2 R × R d • Return: linear function given by w 2 R and b 2 R • Goal: minimize the loss function n X L(w; b) = (y (i) − (w · x(i) + b))2: • Two predictors ('bmi', 'serum5'): MSE = 3205 i=1 • All ten predictors: MSE = 2860 Least-squares solution 1 Least-squares solution 2 d Linear function of d variables given by w 2 R and b 2 R: Write f (x) = w1x1 + w2x2 + ··· + wd xd + b = w · x + b 0 (1) 1 0 (1)1 −−−− xe −−−−! y Assimilate the intercept b into w: B −−−− x(2) −−−−!C By (2)C X = B e C ; y = B C • Add a new feature that is identically 1: let x = (1; x) 2 d+1 B . C B . C e R @ . A @ . A −−−− x(n) −−−−! y (n) 4 0 2 ··· 3 =) 1 4 0 2 ··· 3 e d+1 • Set we = (b; w) 2 R Then the loss function is • Then f (x) = w · x + b = we · xe n X (i) (i) 2 2 d+1 L(w) = (y − w · x ) = ky − X wk Goal: find we 2 R that minimizes e e e e i=1 n X (i) (i) 2 T −1 T L(we) = (y − we · xe ) and it minimized at we = (X X ) (X y). i=1 Generalization behavior of least-squares regression Example (1) (1) (n) (n) d Given a training set (x ; y );:::; (x ; y ) 2 R × R, find a d linear function, given by w 2 R and b 2 R, that minimizes the squared loss n X L(w; b) = (y (i) − (w · x(i) + b))2: i=1 Is training loss a good estimate of future performance? • If n is large enough: maybe. • Otherwise: probably an underestimate. Better error estimates Ridge regression Minimize squared loss plus a term that penalizes \complex" w: Recall: k-fold cross-validation n • Divide the data set into k equal-sized groups S1;:::; Sk X L(w; b) = (y (i) − (w · x(i) + b))2 + λkwk2 • For i = 1 to k: i=1 • Train a regressor on all data except Si • Let Ei be its error on Si Adding a penalty term like this is called regularization. • Error estimate: average of E1;:::; Ek Put predictor vectors in matrix X and responses in vector y: A nagging question: w = (X T X + λI )−1(X T y) When n is small, should we be minimizing the squared loss? n X L(w; b) = (y (i) − (w · x(i) + b))2 i=1 Toy example The lasso Training, test sets of 100 points Popular \shrinkage" estimators: 100 • x 2 R , each feature xi is Gaussian N(0; 1) • Ridge regression • y = x + ··· + x + N(0; 1) 1 10 n X (i) (i) 2 2 L(w; b) = (y − (w · x + b)) + λkwk2 λ training MSE test MSE i=1 0.00001 0.00 585.81 0.0001 0.00 564.28 • Lasso: tends to produce sparse w 0.001 0.00 404.08 n 0.01 0.01 83.48 X (i) (i) 2 0.1 0.03 19.26 L(w; b) = (y − (w · x + b)) + λkwk1 1.0 0.07 7.02 i=1 10.0 0.35 2.84 100.0 2.40 5.79 1000.0 8.19 10.97 Toy example: 10000.0 10.83 12.63 Lasso recovers 10 relevant features plus a few more..

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    5 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us