Linear Regression and the Bias Variance Tradeoff
Total Page:16
File Type:pdf, Size:1020Kb
Linear Regression and the Bias Variance Tradeoff Guest Lecturer Joseph E. Gonzalez slides available here: h"p://;nyurl.com/re1lecture Simple Linear Regression Y X Response Variable Covariate Linear Model: Y = mX + b Slope Intercept (bias) MoHvaon • Ine of t.e most widely used tec.niMues • Nundamental to many larger models – 4enerali:ed Linear Codels – Collaborave filtering • Easy to interpret • EPcient to solve CulHple Linear Regression 2.e Regression Codel • Nor a single data point (x,y): Endependent 0ariable Response 0ariable F0ector) FScalar) Ibserve: FCondiHon) R y x ∈ Rp y ∈ R • 6oint Probability: Siscriminave p(x, y)=p(x)p(y|x) Model The Linear Model Vector of Parameters Vector of Covariates T Scalar Real Value Response y = ✓ x + ✏ Noise + b p Noise Model: ∼ 2 Linear Combina;on ✏ N(0, σ ) θixi of Covariates X i=1 What about bias/intercept term? Define: xp+1 =1 Then redefine p := p+1 for notaonal simplicity CondiHonal Likeli.ood pFy|R) • CondiHoned on R: Constant Normal Sistribuon T 2 y = ✓ x + ✏ ∼ N(0, σ ) Mean 0ariance • CondiHonal distribuHon of Y: T 2 Y ∼ N(θ x, σ ) 1 (y θT x)2 p(y x)= exp − 2 | σ√2π ✓− 2σ ◆ Parameters and Random Variables Parameters T 2 y ∼ N(θ x, σ ) • CondiHonal distribuHon of y: – /ayesian: parameters as random variables 2 p(y|x, θ, σ ) – NreMuenHst: parameters as Funknown) constants pθ,σ2 (y|x) So far _ A E’m lonely * B2 B1 Independent and IdenHcally Sistributed Fiid) Sata • Nor n data points: D = {(x1,y1),...,(xn,yn)} n = {(xi,yi)}i=1 Plate Diagram Endependent 0ariable Response 0ariable F0ector) FScalar) Ri yi p xi ∈ R yi ∈ R i ∈ {1,...,n} Joint Probability x y i i n • For n data points independent and iden;cally distributed (iid): n p(D)=Y p(xi,yi) i=1 n = Y p(xi)p(yi|xi) i=1 RewriHng wit. CatriR Totaon n • Represent data D = {(xi,yi)}i=1 as: Covariate FSesign) Response Matrix 0ector n x1 n y1 x2 y2 X = ∈ Rnp Y = ∈ Rn ... Assume X . xn .as rank p Fnot degenerate) yn p 1 RewriHng wit. CatriR Totaon • RewriHng t.e model using matrix operaons: Y = X✓ + ✏ Y X Y X θ ✏ p n n n 1 1 p EsHmang t.e Codel • 4iven data .ow can we esHmate aW Y = X✓ + ✏ • Construct maximum likeli.ood esHmator FCLE): – Serive t.e lo)blikeli.ood – Nind aMLE t.at maximi:es lo)blikeli.ood • AnalyHcally: Take derivave and set X c • Eteravely: FStoc.asHc) gradient descent Joint Probability x y i i n • For n data points: n p(D)=Y p(xi,yi) i=1 n “1” i i i Discriminave = Y p(x )p(y |x ) Model i=1 Defining the Likelihood pθ(y x)= | T 2 xi yi 1 (y θ x) exp − 2 n σ√2π ✓− 2σ ◆ n L(θ|D)=Y pθ(yi|xi) i=1 n T 2 1 (yi θ xi) = exp − √ 2σ2 iY=1 σ 2π ✓− ◆ n 1 1 T 2 = n exp − (yi − θ xi) n 2 2 σ (2π) 2σ i=1 ! X Maximizing the Likelihood • Want to compute: θˆMLE = arg max L(θ|D) θ∈Rp • To simplify the calculaons we take the log: 1 θˆMLE = arg max log L(θ|D) 1 2 3 4 5 θ∈Rp -1 -2 which does not affect the maximizaon because log is a monotone funcHon. n θ 1 1 θT 2 L( |D)= n exp − 2 (yi − xi) σn π 2 σ (2 ) 2 i=1 ! X • Take the log: n n n 1 T 2 θ σ π 2 θ log L( |D)=− log( (2 ) ) − σ2 X(yi − xi) 2 i=1 • Removing constant terms with respect to θ: n T 2 log L(θ)=− X(yi − θ xi) i=1 Monotone FuncHon FEasy to maximize) n T 2 log L(θ)=− X(yi − θ xi) i=1 • Want to compute: θˆMLE = arg max log L(θ|D) θ∈Rp • Plugging in log‐likelihood: n T 2 θˆMLE = arg max − (yi − θ xi) θ∈Rp X i=1 n T 2 θˆMLE = arg max − (yi − θ xi) θ∈Rp X i=1 • Dropping the sign and flipping from maximizaon to minimizaon: n T 2 θˆMLE = arg min (yi − θ xi) θ∈Rp X i=1 Minimize Sum (Error)2 • Gaussian Noise Model Squared Loss – Least Squares Regression Pictorial Interpretaon of Squared Error y x Maximizing the Likelihood (Minimizing the Squared Error) n T 2 θˆMLE = arg min (yi − θ xi) θ∈Rp X i=1 − log L(θ) Convex FuncHon Slope = 0 θ ˆ θMLE • Take the gradient and set it equal to zero Minimizing the Squared Error n T 2 θˆMLE = arg min (yi − θ xi) θ∈Rp X i=1 • Taking the gradient n T 2 −rθ log L(θ)=rθ X(yi − θ xi) i=1 n T Chain Rule = −2 X(yi − θ xi)xi i=1 n n T = −2 X yixi +2X(θ xi)xi i=1 i=1 • RewriHng t.e gradient in matrix form: n n T −rθ log L(θ)=−2 X yixi +2X(θ xi)xi i=1 i=1 T T = −2X Y +2X Xθ • To make sure t.e lo)blikeli.ood is convex compute t.e second derivave Fgessian) 2 T −r log L(θ)=2X X • Ef X is full rank t.en XTX is posiHve definite and t.erefore aMLE is t.e minimum – Address t.e degenerate cases wit. regulari:aon T T −rθ log L(θ)=−2X y +2X Xθ =0 • Sehng gradient eMual to c and solve for aMLE: T T (X X)θˆMLE = X Y Normal Equaons T −1 T (Write on θˆMLE =(X X) X Y board) n p b1 n 1 p X Geometric Interpretaon • View the MLE as finding a projecHon on colFB) – Sefine t.e esHmator: Yˆ = Xθ – Ibserve t.at j is in colFB) • linear combinaon of cols of B – Uant to j closest to Y • Emplies FAbjG normal to B T T X (Y − Yˆ )=X (Y − Xθ)=0 ⇒ XT Xθ = XT Y ConnecHon to PseudobEnverse T −1 T θˆMLE =(X X) X Y CoorebPenrose X† Psuedoinverse • 4enerali:aon of t.e inverse: – Consider t.e case w.en B is sMuare and inverHble: T 1 T 1 T 1 T 1 X† =(X X)− X = X− (X )− X = X− ‐1 – U.ic. implies aMLEX X Y t.e soluHon to B a X Y w.en X is sMuare and inverHble CompuHng t.e CLE T −1 T θˆMLE =(X X) X Y • Not typically solved by inverHng XTX • Solved using direct met.ods: C.olesky factori:aon: – or use the • n7 to a factor of 2 faster built‐in solver – oR factori:aon: in your math library. • More numerically stable R: solve(Xt %*% X, Xt %*% y) • Solved using various iterave met.ods: – prylov subspace met.ods – FStoc.asHc) 4radient Sescent .qp://www.seas.ucla.edu/rvandenbe/1cs/lectures/qr.pdf Cholesky Factorizaon T T solve (X X)θˆMLE = X Y ˆ θMLE C d 2 • Compute symm. matrix C = XT X O(np ) • Compute vector d = XT Y O(np) T • Cholesky Factorizaon LL = C O(p3) – L is lower triangular 2 • Forward subs. to solve: Lz = d O(p ) T ˆ 2 • Backward subs. to solve: L θMLE = z O(p ) ConnecHons to graphical model inference: .qp://ssg.mit.edu/rwillsky/publ_pdfs/1uvtpub_MLR.pdf and .qp://yaroslavvb.blogspot.com/2c11/c2/iuncHon‐treesbin‐numericalbanalysis.html wit. illustraons Solving 2riangular System A11 A12 A1s A1w R1 b1 A22 A2s A2w R2 b2 l X Ass Asw Rs bs Aww Rw bw Solving 2riangular System A11R1 A12R2 A1sRs A1wRw b1 R1Xb1bA12R2bA1sRsbA1wRw A11 A22R2 A2sRs A2wRw b2 R2Xb2bA2sRsbA2wRw A22 AssRs AswRw bs RsXFbsbAswRw) Ass AwwRw bw RwXbw /Aww Distributed Direct SoluHon FCapbReduce) T −1 T θˆMLE =(X X) X Y • SistribuHon computaons of sums: p n = T = T 2 p C X X X xixi O(np ) i=1 1 n T p d = X y = X xiyi O(np) i=1 3 • Solve system C aMLE X d on master. O(p ) Gradient Descent: What if p is large? (e.g., n/2) • The cost of O(np2) = O(n3) could by prohibiHve • SoluHon: Eterave Cet.ods – 4radient Sescent: For τ from 0 until convergence τ τ τ θ( +1) = θ( ) − ρ(τ)r log L(θ( )|D) Learning rate 4radient Sescent Ellustrated: − log L(θ) θ(0) Slope X c θ(1) θ(2)θ(3) (3) ˆ θ = θMLE ConveR NuncHon θ Gradient Descent: What if p is large? (e.g., n/2) • The cost of O(np2) = O(n3) could by prohibiHve • SoluHon: Eterave Cet.ods – 4radient Sescent: For τ from 0 until convergence τ τ τ θ( +1) = θ( ) − ρ(τ)r log L(θ( )|D) n (τ) 1 (τ)T = θ + ρ(τ) (yi − θ xi)xi O(np) n X i=1 • Can we do beSerW EsHmate of t.e 4radient StochasHc 4radient Sescent • Construct noisy esHmate of t.e gradient: For τ from 0 until convergence 1) pick a random i 2) (τ+1) (τ) (τ)T θ = θ + ρ(τ)(yi − θ xi)xi O(p) • SensiHve to c.oice of xFy) typically FxFy)X1/y) • Also known as LeastbCeanbSMuares FLCS) • Applies to streaming data O(p) storage Nihng Tonblinear Sata • U.at if Y .as a nonblinear responseW 2.0 1.5 1.0 0.5 1 2 3 4 5 6 -0.5 -1.0 -1.5 • Can we sHll use a linear modelW Transforming the Feature Space • Transform features xi xi =(Xi,1,Xi,2,...,Xi,p) • By applying non‐linear transformaon ϕ: φ : Rp → Rk • Example: 2 k φ(x)={1,x,x ,...,x } – others: splines, radial basis funcHonsm _ – Expert engineered features Fmodeling) UnderbOhng 81.< 81., x< 2 2 1 1 1 2 3 4 5 6 1 2 3 4 5 6 -1 -1 -2 -2 Out[380]= 2 3 2 3 4 5 91., x, x , x = 91., x, x , x , x , x = 2 2 1 1 1 2 3 4 5 6 1 2 3 4 5 6 -1 -1 -2 -2 OverbOhng Really IverbOhn)z 2 3 4 5 6 7 8 9 10 11 12 13 14 91., x, x , x , x , x , x , x , x , x , x , x , x , x , x = 2 1 1 2 3 4 5 6 -1 -2 • Errors on training data are small • /ut errors on new points are likely to be large What if I train on different data? Low Variability: 2 3 2 3 2 3 91., x, x , x = 91., x, x , x = 91., x, x , x = 2 2 2 1 1 1 1 2 3 4 5 6 -1 1 2 3 4 5 6 1 2 3 4 5 6 -1 -1 -1 -2 -2 -2 High Variability 2 3 4 5 6 7 8 9 10 11 12 13 14 2 3 4 5 6 7 8 9 10 11 12 13 14 2 3 4 5 6 7 8 9 10 11 12 13 14 91., x, x , x , x , x , x , x , x , x , x , x , x , x , x = 91., x, x , x , x , x , x , x , x , x , x , x , x , x , x = 91., x, x , x , x , x , x , x , x , x , x , x , x , x , x = 2 2 2 1 1 1 1 2 3 4 5 6 -1 1 2 3 4 5 6 1 2 3 4 5 6 -1 -1 -1 -2 -2 -2 Bias‐Variance Tradeoff • So far we have minimized the error (loss) with respect to training data – Low training error does not imply good expected performance: over‐fiAn1 • We would like to reason about the eBpected loss (Credic;on RisE) over: – Training Data: {(y1, x1), …, (yn, xn)} – Test point: (y*, x*) • We will decompose the expected loss into: E 2 2 D,(y∗,x∗) ⇥(y∗ − f(x∗|D)) ⇤ = Noise + Bias + Variance • Define (unobserved) the true model (h): Assume 0 mean noise y∗ = h(x∗)+✏∗ [bias goes in h(x*)] • Completed the squares with: h(x∗)=h∗ E 2 D,(y∗,x∗) ⇥(y∗ − f(x∗|D)) ⇤ Expected Loss E 2 = D,(y∗,x∗) ⇥(y∗ − h(x∗)+h(x∗) − f(x∗|D)) ⇤ a b 2 2 2 (a + b) = a + b +2ab E 2 E 2 = ✏∗ ⇥(y∗ − h(x∗)) ⇤ + D ⇥(h(x∗) − f(x∗|D)) ⇤ E +2 D,(y∗,x∗) [y∗h∗ − y∗f∗ − h∗h∗ + h∗f∗] • Define (unobserved) the true model (h): y∗ = h(x∗)+✏∗ • Completed the squares with: h(x∗)=h∗ E 2 D,(y∗,x∗) ⇥(y∗ − f(x∗|D)) ⇤ Expected Loss E 2 = D,(y∗,x∗) ⇥(y∗ − h(x∗)+h(x∗) − f(x∗|D)) ⇤ E 2 E 2 = ✏∗ ⇥(y∗ − h(x∗)) ⇤ + D ⇥(h(x∗) − f(x∗|D)) ⇤ E +2 D,(y∗,x∗) [y∗h∗ − y∗f∗ − h∗h∗ + h∗f∗] SubsHtute defn.