Fitting a line to a bunch of points.
Linear regression
Example: college GPAs Better predictions with more information
We also have SAT scores of all students.
Distribution of GPAs of Mean squared error students at a certain Ivy (MSE) drops to 0.43. League university.
What GPA to predict for a random student from this group? • Without further information, predict the mean, 2.47. • What is the average squared error of this prediction? This is a regression problem with: 2 That is, E[((student’s GPA) − (predicted GPA)) ]? • Predictor variable: SAT score The variance of the distribution, 0.55. • Response variable: College GPA Parametrizing a line The line fitting problem
(1) (1) (n) (n) Pick a line (a, b) based on (x , y ),..., (x , y ) ∈ R × R
A line can be parameterized as y = ax + b (a: slope, b: intercept). • x(i), y (i) are predictor and response variables. E.g. SAT score, GPA of ith student. • Minimize the mean squared error,
n 1 X MSE(a, b) = (y (i) − (ax(i) + b))2. n i=1 This is the loss function.
Minimizing the loss function Multivariate regression: diabetes study
Given (x(1), y (1)),..., (x(n), y (n)), minimize
n Data from n = 442 diabetes patients. X L(a, b) = (y (i) − (ax(i) + b))2. i=1 For each patient: • 10 features x = (x1,..., x10) age, sex, body mass index, average blood pressure, and six blood serum measurements.
• A real value y: the progression of the disease a year later.
Regression problem: • response y ∈ R 10 • predictor variables x ∈ R Least-squares regression Back to the diabetes data
10 Linear function of 10 variables: for x ∈ R , • No predictor variables: mean squared error (MSE) = 5930 f (x) = w1x1 + w2x2 + ··· + w10x10 + b = w · x + b • One predictor (’bmi’): MSE = 3890
where w = (w1, w2,..., w10). Penalize error using squared loss (y − (w · x + b))2.
Least-squares regression: (1) (1) (n) (n) d • Given: data (x , y ),..., (x , y ) ∈ R × R d • Return: linear function given by w ∈ R and b ∈ R • Goal: minimize the loss function
n X L(w, b) = (y (i) − (w · x(i) + b))2. • Two predictors (’bmi’, ’serum5’): MSE = 3205 i=1 • All ten predictors: MSE = 2860
Least-squares solution 1 Least-squares solution 2 d Linear function of d variables given by w ∈ R and b ∈ R: Write f (x) = w1x1 + w2x2 + ··· + wd xd + b = w · x + b (1) (1) ←−−−− xe −−−−→ y Assimilate the intercept b into w: ←−−−− x(2) −−−−→ y (2) X = e , y = • Add a new feature that is identically 1: let x = (1, x) ∈ d+1 . . e R . . ←−−−− x(n) −−−−→ y (n) 4 0 2 ··· 3 =⇒ 1 4 0 2 ··· 3 e
d+1 • Set we = (b, w) ∈ R Then the loss function is • Then f (x) = w · x + b = we · xe n X (i) (i) 2 2 d+1 L(w) = (y − w · x ) = ky − X wk Goal: find we ∈ R that minimizes e e e e i=1 n X (i) (i) 2 T −1 T L(we) = (y − we · xe ) and it minimized at we = (X X ) (X y). i=1 Generalization behavior of least-squares regression Example
(1) (1) (n) (n) d Given a training set (x , y ),..., (x , y ) ∈ R × R, find a d linear function, given by w ∈ R and b ∈ R, that minimizes the squared loss
n X L(w, b) = (y (i) − (w · x(i) + b))2. i=1
Is training loss a good estimate of future performance?
• If n is large enough: maybe. • Otherwise: probably an underestimate.
Better error estimates Ridge regression
Minimize squared loss plus a term that penalizes “complex” w: Recall: k-fold cross-validation n • Divide the data set into k equal-sized groups S1,..., Sk X L(w, b) = (y (i) − (w · x(i) + b))2 + λkwk2 • For i = 1 to k: i=1 • Train a regressor on all data except Si • Let Ei be its error on Si Adding a penalty term like this is called regularization.
• Error estimate: average of E1,..., Ek Put predictor vectors in matrix X and responses in vector y: A nagging question: w = (X T X + λI )−1(X T y) When n is small, should we be minimizing the squared loss?
n X L(w, b) = (y (i) − (w · x(i) + b))2 i=1 Toy example The lasso
Training, test sets of 100 points Popular “shrinkage” estimators: 100 • x ∈ R , each feature xi is Gaussian N(0, 1) • Ridge regression • y = x + ··· + x + N(0, 1) 1 10 n X (i) (i) 2 2 L(w, b) = (y − (w · x + b)) + λkwk2 λ training MSE test MSE i=1 0.00001 0.00 585.81 0.0001 0.00 564.28 • Lasso: tends to produce sparse w 0.001 0.00 404.08 n 0.01 0.01 83.48 X (i) (i) 2 0.1 0.03 19.26 L(w, b) = (y − (w · x + b)) + λkwk1 1.0 0.07 7.02 i=1 10.0 0.35 2.84 100.0 2.40 5.79 1000.0 8.19 10.97 Toy example: 10000.0 10.83 12.63 Lasso recovers 10 relevant features plus a few more.