Linear Regression
Total Page:16
File Type:pdf, Size:1020Kb
Linear regression Fitting a line to a bunch of points. Linear regression Example: college GPAs Better predictions with more information We also have SAT scores of all students. Distribution of GPAs of Mean squared error students at a certain Ivy (MSE) drops to 0.43. League university. What GPA to predict for a random student from this group? • Without further information, predict the mean, 2.47. • What is the average squared error of this prediction? This is a regression problem with: 2 That is, E[((student's GPA) − (predicted GPA)) ]? • Predictor variable: SAT score The variance of the distribution, 0.55. • Response variable: College GPA Parametrizing a line The line fitting problem (1) (1) (n) (n) Pick a line (a; b) based on (x ; y );:::; (x ; y ) 2 R × R A line can be parameterized as y = ax + b (a: slope, b: intercept). • x(i); y (i) are predictor and response variables. E.g. SAT score, GPA of ith student. • Minimize the mean squared error, n 1 X MSE(a; b) = (y (i) − (ax(i) + b))2: n i=1 This is the loss function. Minimizing the loss function Multivariate regression: diabetes study Given (x(1); y (1));:::; (x(n); y (n)), minimize n Data from n = 442 diabetes patients. X L(a; b) = (y (i) − (ax(i) + b))2: i=1 For each patient: • 10 features x = (x1;:::; x10) age, sex, body mass index, average blood pressure, and six blood serum measurements. • A real value y: the progression of the disease a year later. Regression problem: • response y 2 R 10 • predictor variables x 2 R Least-squares regression Back to the diabetes data 10 Linear function of 10 variables: for x 2 R , • No predictor variables: mean squared error (MSE) = 5930 f (x) = w1x1 + w2x2 + ··· + w10x10 + b = w · x + b • One predictor ('bmi'): MSE = 3890 where w = (w1; w2;:::; w10). Penalize error using squared loss (y − (w · x + b))2. Least-squares regression: (1) (1) (n) (n) d • Given: data (x ; y );:::; (x ; y ) 2 R × R d • Return: linear function given by w 2 R and b 2 R • Goal: minimize the loss function n X L(w; b) = (y (i) − (w · x(i) + b))2: • Two predictors ('bmi', 'serum5'): MSE = 3205 i=1 • All ten predictors: MSE = 2860 Least-squares solution 1 Least-squares solution 2 d Linear function of d variables given by w 2 R and b 2 R: Write f (x) = w1x1 + w2x2 + ··· + wd xd + b = w · x + b 0 (1) 1 0 (1)1 −−−− xe −−−−! y Assimilate the intercept b into w: B −−−− x(2) −−−−!C By (2)C X = B e C ; y = B C • Add a new feature that is identically 1: let x = (1; x) 2 d+1 B . C B . C e R @ . A @ . A −−−− x(n) −−−−! y (n) 4 0 2 ··· 3 =) 1 4 0 2 ··· 3 e d+1 • Set we = (b; w) 2 R Then the loss function is • Then f (x) = w · x + b = we · xe n X (i) (i) 2 2 d+1 L(w) = (y − w · x ) = ky − X wk Goal: find we 2 R that minimizes e e e e i=1 n X (i) (i) 2 T −1 T L(we) = (y − we · xe ) and it minimized at we = (X X ) (X y). i=1 Generalization behavior of least-squares regression Example (1) (1) (n) (n) d Given a training set (x ; y );:::; (x ; y ) 2 R × R, find a d linear function, given by w 2 R and b 2 R, that minimizes the squared loss n X L(w; b) = (y (i) − (w · x(i) + b))2: i=1 Is training loss a good estimate of future performance? • If n is large enough: maybe. • Otherwise: probably an underestimate. Better error estimates Ridge regression Minimize squared loss plus a term that penalizes \complex" w: Recall: k-fold cross-validation n • Divide the data set into k equal-sized groups S1;:::; Sk X L(w; b) = (y (i) − (w · x(i) + b))2 + λkwk2 • For i = 1 to k: i=1 • Train a regressor on all data except Si • Let Ei be its error on Si Adding a penalty term like this is called regularization. • Error estimate: average of E1;:::; Ek Put predictor vectors in matrix X and responses in vector y: A nagging question: w = (X T X + λI )−1(X T y) When n is small, should we be minimizing the squared loss? n X L(w; b) = (y (i) − (w · x(i) + b))2 i=1 Toy example The lasso Training, test sets of 100 points Popular \shrinkage" estimators: 100 • x 2 R , each feature xi is Gaussian N(0; 1) • Ridge regression • y = x + ··· + x + N(0; 1) 1 10 n X (i) (i) 2 2 L(w; b) = (y − (w · x + b)) + λkwk2 λ training MSE test MSE i=1 0.00001 0.00 585.81 0.0001 0.00 564.28 • Lasso: tends to produce sparse w 0.001 0.00 404.08 n 0.01 0.01 83.48 X (i) (i) 2 0.1 0.03 19.26 L(w; b) = (y − (w · x + b)) + λkwk1 1.0 0.07 7.02 i=1 10.0 0.35 2.84 100.0 2.40 5.79 1000.0 8.19 10.97 Toy example: 10000.0 10.83 12.63 Lasso recovers 10 relevant features plus a few more..