Fitting a line to a bunch of points.

Linear regression

Example: college GPAs Better with more information

We also have SAT scores of all students.

Distribution of GPAs of squared error students at a certain Ivy (MSE) drops to 0.43. League university.

What GPA to predict for a random student from this group? • Without further information, predict the mean, 2.47. • What is the average squared error of this ? This is a regression problem with: 2 That is, E[((student’s GPA) − (predicted GPA)) ]? • Predictor variable: SAT score The of the distribution, 0.55. • Response variable: College GPA Parametrizing a line The line fitting problem

(1) (1) (n) (n) Pick a line (a, b) based on (x , y ),..., (x , y ) ∈ R × R

A line can be parameterized as y = ax + b (a: slope, b: intercept). • x(i), y (i) are predictor and response variables. E.g. SAT score, GPA of ith student. • Minimize the mean squared error,

n 1 X MSE(a, b) = (y (i) − (ax(i) + b))2. n i=1 This is the .

Minimizing the loss function Multivariate regression: diabetes study

Given (x(1), y (1)),..., (x(n), y (n)), minimize

n Data from n = 442 diabetes patients. X L(a, b) = (y (i) − (ax(i) + b))2. i=1 For each patient: • 10 features x = (x1,..., x10) age, sex, body mass index, average blood pressure, and six blood serum measurements.

• A real value y: the progression of the disease a year later.

Regression problem: • response y ∈ R 10 • predictor variables x ∈ R Least-squares regression Back to the diabetes data

10 Linear function of 10 variables: for x ∈ R , • No predictor variables: mean squared error (MSE) = 5930 f (x) = w1x1 + w2x2 + ··· + w10x10 + b = w · x + b • One predictor (’bmi’): MSE = 3890

where w = (w1, w2,..., w10). Penalize error using squared loss (y − (w · x + b))2.

Least-squares regression: (1) (1) (n) (n) d • Given: data (x , y ),..., (x , y ) ∈ R × R d • Return: linear function given by w ∈ R and b ∈ R • Goal: minimize the loss function

n X L(w, b) = (y (i) − (w · x(i) + b))2. • Two predictors (’bmi’, ’serum5’): MSE = 3205 i=1 • All ten predictors: MSE = 2860

Least-squares solution 1 Least-squares solution 2 d Linear function of d variables given by w ∈ R and b ∈ R: Write f (x) = w1x1 + w2x2 + ··· + wd xd + b = w · x + b  (1)   (1) ←−−−− xe −−−−→ y Assimilate the intercept b into w: ←−−−− x(2) −−−−→ y (2) X =  e  , y =   • Add a new feature that is identically 1: let x = (1, x) ∈ d+1  .   .  e R  .   .  ←−−−− x(n) −−−−→ y (n) 4 0 2 ··· 3 =⇒ 1 4 0 2 ··· 3 e

d+1 • Set we = (b, w) ∈ R Then the loss function is • Then f (x) = w · x + b = we · xe n X (i) (i) 2 2 d+1 L(w) = (y − w · x ) = ky − X wk Goal: find we ∈ R that minimizes e e e e i=1 n X (i) (i) 2 T −1 T L(we) = (y − we · xe ) and it minimized at we = (X X ) (X y). i=1 Generalization behavior of least-squares regression Example

(1) (1) (n) (n) d Given a training set (x , y ),..., (x , y ) ∈ R × R, find a d linear function, given by w ∈ R and b ∈ R, that minimizes the squared loss

n X L(w, b) = (y (i) − (w · x(i) + b))2. i=1

Is training loss a good estimate of future performance?

• If n is large enough: maybe. • Otherwise: probably an underestimate.

Better error estimates Ridge regression

Minimize squared loss plus a term that penalizes “complex” w: Recall: k-fold cross-validation n • Divide the data set into k equal-sized groups S1,..., Sk X L(w, b) = (y (i) − (w · x(i) + b))2 + λkwk2 • For i = 1 to k: i=1 • Train a regressor on all data except Si • Let Ei be its error on Si Adding a penalty term like this is called regularization.

• Error estimate: average of E1,..., Ek Put predictor vectors in matrix X and responses in vector y: A nagging question: w = (X T X + λI )−1(X T y) When n is small, should we be minimizing the squared loss?

n X L(w, b) = (y (i) − (w · x(i) + b))2 i=1 Toy example The lasso

Training, test sets of 100 points Popular “: 100 • x ∈ R , each feature xi is Gaussian N(0, 1) • Ridge regression • y = x + ··· + x + N(0, 1) 1 10 n X (i) (i) 2 2 L(w, b) = (y − (w · x + b)) + λkwk2 λ training MSE test MSE i=1 0.00001 0.00 585.81 0.0001 0.00 564.28 • Lasso: tends to produce sparse w 0.001 0.00 404.08 n 0.01 0.01 83.48 X (i) (i) 2 0.1 0.03 19.26 L(w, b) = (y − (w · x + b)) + λkwk1 1.0 0.07 7.02 i=1 10.0 0.35 2.84 100.0 2.40 5.79 1000.0 8.19 10.97 Toy example: 10000.0 10.83 12.63 Lasso recovers 10 relevant features plus a few more.