<<

and the Bias Tradeoff

Guest Lecturer Joseph E. Gonzalez slides available here: hp://nyurl.com/reglecture

Y

X

Response Variable Covariate : Y = mX + b Intercept (bias) Movaon

• One of the most widely used techniques • Fundamental to many larger models – Generalized Linear Models – Collaborave filtering • Easy to interpret • Efficient to solve Mulple Linear Regression The Regression Model • For a single point (x,y):

Independent Variable Response Variable (Vector) ()

Observe: (Condion) x y x ∈ Rp y ∈ R

• Joint Probability:

Discriminave p(x, y)=p(x)p(y|x) Model

The Linear Model

Vector of Vector of Covariates T Scalar Real Value Response y = ✓ x + ✏ Noise + b p Noise Model: ∼ 2 Linear Combinaon ✏ N(0, σ ) θixi of Covariates X i=1 What about bias/intercept term? Define: xp+1 =1 Then redefine p := p+1 for notaonal simplicity Condional Likelihood p(y|x)

• Condioned on x: Constant Normal Distribuon T 2 y = ✓ x + ✏ ∼ N(0, σ ) Variance

• Condional distribuon of Y: T 2 Y ∼ N(θ x, σ ) 1 (y θT x)2 p(y x)= exp − 2 | σ√2π ✓− 2σ ◆ Parameters and Random Variables Parameters T 2 y ∼ N(θ x, σ )

• Condional distribuon of y: – Bayesian: parameters as random variables 2 p(y|x, θ, σ )

– Frequenst: parameters as (unknown) constants

pθ,σ2 (y|x) So far …

Y I’m lonely

*

X2

X1 Independent and Idencally Distributed (iid) Data • For n data points:

D = {(x1,y1),...,(xn,yn)} n = {(xi,yi)}i=1

Plate Diagram

Independent Variable Response Variable (Vector) (Scalar)

xi yi p xi ∈ R yi ∈ R i ∈ {1,...,n} Joint Probability

x y i i n • For n data points independent and idencally distributed (iid): n p(D)=Y p(xi,yi) i=1 n = Y p(xi)p(yi|xi) i=1 Rewring with Notaon

n • Represent data D = {(xi,yi)}i=1 as:

Covariate (Design) Response Matrix Vector n x1 n y1

x2 y2 X = ∈ Rnp Y = ∈ Rn ... . Assume X . xn has p (not degenerate) yn

p 1 Rewring with Matrix Notaon

• Rewring the model using matrix operaons: Y = X✓ + ✏ Y = + X θ ✏ p n n n 1

1 p Esmang the Model

• Given data how can we esmate θ? Y = X✓ + ✏

• Construct maximum likelihood esmator (MLE): – Derive the log‐likelihood

– Find θMLE that maximizes log‐likelihood • Analycally: Take derivave and set = 0 • Iteravely: (Stochasc) gradient descent Joint Probability

x y i i n

• For n data points: n p(D)=Y p(xi,yi) i=1 n “1” i i i Discriminave = Y p(x )p(y |x ) Model i=1 Defining the Likelihood

pθ(y x)= | T 2 xi yi 1 (y θ x) exp − 2 n σ√2π ✓− 2σ ◆

n L(θ|D)=Y pθ(yi|xi) i=1 n T 2 1 (yi θ xi) = exp − √ 2σ2 iY=1 σ 2π ✓− ◆ n 1 1 T 2 = n exp − (yi − θ xi) n 2 2 σ (2π) 2σ i=1 ! X Maximizing the Likelihood

• Want to compute:

θˆMLE = arg max L(θ|D) θ∈Rp • To simplify the calculaons we take the log:

1 θˆMLE = arg max log L(θ|D) 1 2 3 4 5 θ∈Rp -1 -2 which does not affect the maximizaon because log is a monotone funcon. n θ 1 1 θT 2 L( |D)= n exp − 2 (yi − xi) σn π 2 σ (2 ) 2 i=1 ! X • Take the log:

n n n 1 T 2 θ σ π 2 θ log L( |D)=− log( (2 ) ) − σ2 X(yi − xi) 2 i=1

• Removing constant terms with respect to θ:

n T 2 log L(θ)=− X(yi − θ xi) i=1 Monotone Funcon (Easy to maximize) n T 2 log L(θ)=− X(yi − θ xi) i=1

• Want to compute:

θˆMLE = arg max log L(θ|D) θ∈Rp • Plugging in log‐likelihood: n T 2 θˆMLE = arg max − (yi − θ xi) θ∈Rp X i=1 n T 2 θˆMLE = arg max − (yi − θ xi) θ∈Rp X i=1 • Dropping the sign and flipping from maximizaon to minimizaon: n T 2 θˆMLE = arg min (yi − θ xi) θ∈Rp X i=1 Minimize Sum (Error)2

• Gaussian Noise Model  Squared Loss – Regression Pictorial Interpretaon of Squared Error

y

x Maximizing the Likelihood (Minimizing the Squared Error) n T 2 θˆMLE = arg min (yi − θ xi) θ∈Rp X i=1

− log L(θ) Convex Funcon

Slope = 0 θ ˆ θMLE • Take the gradient and set it equal to zero Minimizing the Squared Error n T 2 θˆMLE = arg min (yi − θ xi) θ∈Rp X i=1 • Taking the gradient n T 2 −rθ log L(θ)=rθ X(yi − θ xi) i=1 n T Chain Rule  = −2 X(yi − θ xi)xi i=1 n n T = −2 X yixi +2X(θ xi)xi i=1 i=1 • Rewring the gradient in matrix form: n n T −rθ log L(θ)=−2 X yixi +2X(θ xi)xi i=1 i=1 T T = −2X Y +2X Xθ • To make sure the log‐likelihood is convex compute the second derivave (Hessian)

2 T −r log L(θ)=2X X

• If X is full rank then XTX is posive definite and

therefore θMLE is the minimum – Address the degenerate cases with regularizaon T T −rθ log L(θ)=−2X y +2X Xθ =0

• Seng gradient equal to 0 and solve for θMLE:

T T (X X)θˆMLE = X Y Normal Equaons T −1 T (Write on θˆMLE =(X X) X Y board)

n p ‐1 n 1 p = Geometric Interpretaon • View the MLE as finding a projecon on col(X) – Define the esmator: Yˆ = Xθ – Observe that Ŷ is in col(X) • linear combinaon of cols of X – Want to Ŷ closest to Y • Implies (Y‐Ŷ) normal to X

T T X (Y − Yˆ )=X (Y − Xθ)=0 ⇒ XT Xθ = XT Y Connecon to Pseudo‐Inverse T −1 T θˆMLE =(X X) X Y

Moore‐Penrose X† Psuedoinverse • Generalizaon of the inverse: – Consider the case when X is square and inverble:

T 1 T 1 T 1 T 1 X† =(X X)− X = X− (X )− X = X−

‐1 – Which implies θMLE= X Y the soluon to X θ = Y when X is square and inverble Compung the MLE

T −1 T θˆMLE =(X X) X Y

• Not typically solved by inverng XTX • Solved using direct methods: Cholesky factorizaon: – or use the • Up to a factor of 2 faster built‐in solver – QR factorizaon: in your math library. • More numerically stable R: solve(Xt %*% X, Xt %*% y) • Solved using various iterave methods: – Krylov subspace methods – (Stochasc) Gradient Descent

hp://www.seas.ucla.edu/~vandenbe/103/lectures/qr.pdf Cholesky Factorizaon T T solve (X X)θˆMLE = X Y ˆ θMLE C d 2 • Compute symm. matrix C = XT X O(np ) • Compute vector d = XT Y O(np) T • Cholesky Factorizaon LL = C O(p3) – L is lower triangular 2 • Forward subs. to solve: Lz = d O(p ) T ˆ 2 • Backward subs. to solve: L θMLE = z O(p )

Connecons to inference: hp://ssg.mit.edu/~willsky/publ_pdfs/185_pub_MLR.pdf and hp://yaroslavvb.blogspot.com/2011/02/juncon‐trees‐in‐numerical‐analysis.html with illustraons Solving Triangular System

A11 A12 A13 A14 x1 b1

A22 A23 A24 x2 b2 * = A33 A34 x3 b3

A44 x4 b4 Solving Triangular System

A11x1 A12x2 A13x3 A14x4 b1 x1=b1‐A12x2‐A13x3‐A14x4 A11 A22x2 A23x3 A24x4 b2 x2=b2‐A23x3‐A24x4 A22 A33x3 A34x4 b3 x3=(b3‐A34x4) A33 A44x4 b4 x4=b4 /A44 Distributed Direct Soluon (Map‐Reduce) T −1 T θˆMLE =(X X) X Y • Distribuon computaons of sums: p n = T = T 2 p C X X X xixi O(np ) i=1

1 n T p d = X y = X xiyi O(np) i=1 3 • Solve system C θMLE = d on master. O(p ) Gradient Descent: What if p is large? (e.g., n/2) • The cost of O(np2) = O(n3) could by prohibive • Soluon: Iterave Methods – Gradient Descent:

For τ from 0 until convergence τ τ τ θ( +1) = θ( ) − ρ(τ)r log L(θ( )|D)

Gradient Descent Illustrated:

− log L(θ)

θ(0) Slope = 0 θ(1) θ(2)θ(3) (3) ˆ θ = θMLE Convex Funcon θ Gradient Descent: What if p is large? (e.g., n/2) • The cost of O(np2) = O(n3) could by prohibive • Soluon: Iterave Methods – Gradient Descent: For τ from 0 until convergence

τ τ τ θ( +1) = θ( ) − ρ(τ)r log L(θ( )|D) n (τ) 1 (τ)T = θ + ρ(τ) (yi − θ xi)xi O(np) n X i=1

• Can we do beer? Esmate of the Gradient Stochasc Gradient Descent

• Construct noisy esmate of the gradient:

For τ from 0 until convergence 1) pick a random i 2) (τ+1) (τ) (τ)T θ = θ + ρ(τ)(yi − θ xi)xi O(p)

• Sensive to choice of ρ(τ) typically (ρ(τ)=1/τ) • Also known as Least‐Mean‐Squares (LMS) • Applies to streaming data O(p) storage Fing Non‐linear Data • What if Y has a non‐linear response?

2.0

1.5

1.0

0.5

1 2 3 4 5 6

-0.5

-1.0

-1.5

• Can we sll use a linear model? Transforming the Feature Space

• Transform features xi

xi =(Xi,1,Xi,2,...,Xi,p)

• By applying non‐linear transformaon ϕ: φ : Rp → Rk • Example: 2 k φ(x)={1,x,x ,...,x } – others: splines, radial basis funcons, … – Expert engineered features (modeling) Under‐fing 81.< 81., x< 2 2

1 1

1 2 3 4 5 6 1 2 3 4 5 6 -1 -1

-2 -2 Out[380]= 2 3 2 3 4 5 91., x, x , x = 91., x, x , x , x , x = 2 2 1 1

1 2 3 4 5 6 1 2 3 4 5 6 -1 -1 -2 -2 Over‐fing Really Over‐fing!

2 3 4 5 6 7 8 9 10 11 12 13 14 91., x, x , x , x , x , x , x , x , x , x , x , x , x , x = 2

1

1 2 3 4 5 6

-1

-2 • Errors on training data are small • But errors on new points are likely to be large What if I train on different data? Low Variability:

2 3 2 3 2 3 91., x, x , x = 91., x, x , x = 91., x, x , x = 2 2 2

1 1 1

1 2 3 4 5 6 -1 1 2 3 4 5 6 1 2 3 4 5 6

-1 -1 -1

-2 -2 -2 High Variability

2 3 4 5 6 7 8 9 10 11 12 13 14 2 3 4 5 6 7 8 9 10 11 12 13 14 2 3 4 5 6 7 8 9 10 11 12 13 14 91., x, x , x , x , x , x , x , x , x , x , x , x , x , x = 91., x, x , x , x , x , x , x , x , x , x , x , x , x , x = 91., x, x , x , x , x , x , x , x , x , x , x , x , x , x = 2 2 2

1 1 1

1 2 3 4 5 6 -1 1 2 3 4 5 6 1 2 3 4 5 6

-1 -1 -1

-2 -2 -2 Bias‐Variance Tradeoff • So far we have minimized the error (loss) with respect to training data – Low training error does not imply good expected performance: over‐fing • We would like to reason about the expected loss (Predicon Risk) over:

– Training Data: {(y1, x1), …, (yn, xn)}

– Test point: (y*, x*) • We will decompose the expected loss into: E 2 2 D,(y∗,x∗) ⇥(y∗ − f(x∗|D)) ⇤ = Noise + Bias + Variance • Define (unobserved) the true model (h): Assume 0 mean noise y∗ = h(x∗)+✏∗ [bias goes in h(x*)] • Completed the squares with: h(x∗)=h∗ E 2 D,(y∗,x∗) ⇥(y∗ − f(x∗|D)) ⇤ Expected Loss E 2 = D,(y∗,x∗) ⇥(y∗ − h(x∗)+h(x∗) − f(x∗|D)) ⇤

a b 2 2 2 (a + b) = a + b +2ab

E 2 E 2 = ✏∗ ⇥(y∗ − h(x∗)) ⇤ + D ⇥(h(x∗) − f(x∗|D)) ⇤ E +2 D,(y∗,x∗) [y∗h∗ − y∗f∗ − h∗h∗ + h∗f∗] • Define (unobserved) the true model (h): y∗ = h(x∗)+✏∗ • Completed the squares with: h(x∗)=h∗ E 2 D,(y∗,x∗) ⇥(y∗ − f(x∗|D)) ⇤ Expected Loss E 2 = D,(y∗,x∗) ⇥(y∗ − h(x∗)+h(x∗) − f(x∗|D)) ⇤ E 2 E 2 = ✏∗ ⇥(y∗ − h(x∗)) ⇤ + D ⇥(h(x∗) − f(x∗|D)) ⇤ E +2 D,(y∗,x∗) [y∗h∗ − y∗f∗ − h∗h∗ + h∗f∗]

Substute defn. y* = h* + e* E [(h∗ + ✏∗)h∗ − (h∗ + ✏∗)f∗ − h∗h∗ + h∗f∗]= h∗h∗ + E [✏∗] h∗ − h∗E [f∗] − E [✏∗] f∗ − h∗h∗ + h∗E [f∗] • Define (unobserved) the true model (h): y∗ = h(x∗)+✏∗ • Completed the squares with: h(x∗)=h∗ E 2 D,(y∗,x∗) ⇥(y∗ − f(x∗|D)) ⇤ Expected Loss E 2 = D,(y∗,x∗) ⇥(y∗ − h(x∗)+h(x∗) − f(x∗|D)) ⇤ E 2 E 2 = ✏∗ ⇥(y∗ − h(x∗)) ⇤ + D ⇥(h(x∗) − f(x∗|D)) ⇤

Noise Term Model Esmaon Error (out of our control) (we want to minimize this)  Expand • Minimum error is governed by the noise. • Expanding on the model esmaon error: 2 ED ⇥(h(x∗) − f(x∗|D)) ⇤ • Compleng the squares with E [f(x∗|D)] = f¯∗

2 ED ⇥(h(x∗) − f(x∗|D)) ⇤ 2 = E ⇥(h(x∗) − E [f(x∗|D)] + E [f(x∗|D)] − f(x∗|D)) ⇤ 2 2 = E ⇥(h(x∗) − E [f(x∗|D)]) ⇤ + E ⇥(f(x∗|D) − E [f(x∗|D)]) ⇤ 2 +2E ⇥h∗f¯∗ − h∗f∗ − f¯∗f∗ + f¯∗ ⇤

2 = h∗f¯∗ − h∗E [f∗] − f¯∗E [f∗]+f¯∗ = 2 h∗f¯∗ − h∗f¯∗ − f¯∗f¯∗ + f¯∗ =0 • Expanding on the model esmaon error: 2 ED ⇥(h(x∗) − f(x∗|D)) ⇤ • Compleng the squares with E [f(x∗|D)] = f¯∗ 2 ED ⇥(h(x∗) − f(x∗|D)) ⇤ 2 2 = E ⇥(h(x∗) − E [f(x∗|D)]) ⇤ + E ⇥(f(x∗|D) − E [f(x∗|D)]) ⇤

2 (h(x∗) − E [f(x∗|D)]) • Expanding on the model esmaon error: 2 ED ⇥(h(x∗) − f(x∗|D)) ⇤ • Compleng the squares with E [f(x∗|D)] = f¯∗ 2 ED ⇥(h(x∗) − f(x∗|D)) ⇤ 2 2 =(h(x∗) − E [f(x∗|D)]) + E ⇥(f(x∗|D) − E [f(x∗|D)]) ⇤

(Bias)2 Variance • Tradeoff between bias and variance: – Simple Models: High Bias, Low Variance – Complex Models: Low Bias, High Variance Summary of Bias Variance Tradeoff E 2 D,(y∗,x∗) ⇥(y∗ − f(x∗|D)) ⇤ = Expected Loss E 2 ✏∗ ⇥(y∗ − h(x∗)) ⇤ Noise 2 +(h(x∗) − ED [f(x∗|D)]) (Bias)2 2 + ED ⇥(f(x∗|D) − ED [f(x∗|D)]) ⇤ Variance

• Choice of models balances bias and variance. – Over‐fing  Variance is too High – Under‐fing  Bias is too High Bias Variance

Image from hp://sco.fortmann‐roe.com/docs/BiasVariance.html T Analyze bias of f(x∗|D)=x∗ θˆMLE T • Assume a true model is linear: h(x∗)=x∗ θ bias = h(x∗) − ED [f(x∗|D)] Substute MLE T E T ˆ = x∗ ✓ − D hx∗ ✓MLEi Plug in definion of Y T T T −1 T Expand and cancel = x∗ ✓ − ED x∗ (X X) X Y T ⇥ T T −1 T ⇤ = x∗ ✓ − ED x∗ (X X) X (X✓ + ✏) T ⇥ T T −1 T T ⇤ T −1 T = x∗ ✓ − ED x∗ (X X) X X✓ + x∗ (X X) X ✏ T ⇥ T T T −1 T ⇤ = x∗ ✓ − ED x∗ ✓ + x∗ (X X) X ✏ ⇥ ⇤ Assumpon: T T T T −1 T E = x∗ ✓ − x∗ ✓ + x∗ (X X) X D [✏] ED [✏]=0 T T = x∗ ✓ − x∗ ✓ =0 ˆ θMLE is unbiased! T Analyze Variance of f(x∗|D)=x∗ θˆMLE T • Assume a true model is linear: h(x∗)=x∗ θ 2 Var. = E (f(x∗|D) − ED [f(x∗|D)]) ⇥ ⇤ T T 2 E ˆ Substute MLE + unbiased result = h(x∗ ✓MLE − x∗ ✓) i T T −1 T T 2 Plug in definion of Y = E (x∗ (X X) X Y − x∗ ✓) ⇥ T T −1 T ⇤ T 2 = E (x∗ (X X) X (X✓ + ✏) − x∗ ✓) ⇥ T T T −1 T T 2 ⇤ = E (x∗ ✓ + x∗ (X X) X ✏ − x∗ ✓) ⇥ T T −1 T 2 ⇤ = E (x∗ (X X) X ✏) ⇥ ⇤ • Use property of scalar: a2 = a aT Expand and cancel T Analyze Variance of f(x∗|D)=x∗ θˆMLE • Use property of scalar: a2 = a aT 2 Var. = E ⇥(f(x∗|D) − ED [f(x∗|D)]) ⇤ T T −1 T 2 = E ⇥(x∗ (X X) X ✏) ⇤ T T −1 T T T −1 T T = E ⇥(x∗ (X X) X ✏)(x∗ (X X) X ✏) ⇤ T T −1 T T T T −1 T T = E ⇥x∗ (X X) X ✏✏ (x∗ (X X) X ) ⇤ T T −1 T T T T −1 T T = x∗ (X X) X E ⇥✏✏ ⇤ (x∗ (X X) X ) T T −1 T 2 T T −1 T T = x∗ (X X) X σ✏ I(x∗ (X X) X ) 2 T T −1 T T T −1 T = σ✏ x∗ (X X) X X(x∗ (X X) ) 2 T T T −1 T = σ✏ x∗ (x∗ (X X) ) 2 T T −1 = σ✏ x∗ (X X) x∗ Consequence of Variance Calculaon

2 Var. = E ⇥(f(x∗|D) − ED [f(x∗|D)]) ⇤ 2 T T −1 = σ✏ x∗ (X X) x∗

y y

x x Higher Variance Lower Variance

Figure from hp://people.stern.nyu.edu/wgreene/MathStat/GreeneChapter4.pdf

Summary

• Least‐Square Regression is Unbiased:

E xT θˆ = xT θ D h ∗ MLEi ∗ • Variance depends on:

2 2 T T −1 E ⇥(f(x∗|D) − E [f(x∗|D)]) ⇤ = σ✏ x∗ (X X) x∗ 2 p ≈ σ ✏ n – Number of data‐points n – Dimensionality p – Not on observaons Y Deriving the final identy

• Assume xi and x* are N(0,1) E 2E T T −1 X,x∗ [Var.] = σ✏ X,x∗ ⇥x∗ (X X) x∗⇤ 2E T T −1 = σ✏ X,x∗ ⇥tr(x∗x∗ (X X) )⇤ 2 E T T −1 = σ✏ tr( X,x∗ ⇥x∗x∗ (X X) ⇤) 2 E T E T −1 = σ✏ tr( x∗ ⇥x∗x∗ ⇤ X ⇥(X X) ⇤) 2 σ✏ E T = tr( x∗ ⇥x∗x∗ ⇤) n σ2 = ✏ p n Gauss‐Markov Theorem

• The linear model:

T T T −1 T f(x∗)=x∗ θˆMLE = x∗ (X X) X Y

has the minimum variance among all unbiased linear esmators – Note that this is linear in Y

• BLUE: Best Linear Unbiased Esmator Summary

• Introduced the Least‐Square regression model – Maximum Likelihood: Gaussian Noise – Loss Funcon: Squared Error – Geometric Interpretaon: Minimizing Projecon • Derived the normal equaons: – Walked through process of construcng MLE – Discussed efficient computaon of the MLE • Introduced basis funcons for non‐ – Demonstrated issues with over‐fing • Derived the classic bias‐variance tradeoff – Applied to least‐squares model

Addional Reading I found Helpful

• hp://www.stat.cmu.edu/~roeder/stat707/ lectures.pdf • hp://people.stern.nyu.edu/wgreene/ MathStat/GreeneChapter4.pdf • hp://www.seas.ucla.edu/~vandenbe/103/ lectures/qr.pdf • hp://www.cs.berkeley.edu/~jduchi/projects/ matrix_prop.pdf