Nonlinear Regression (Part 1)

Christof Seiler

Stanford University, Spring 2016, STATS 205 Overview

I Smoothing or estimating curves

I I Nonlinear regression

I Rank-based Curve Estimation

I A curve of interest can be a probability density function f I In density estimation, we observe X1,..., Xn from some unknown cdf F with density f

X1,..., Xn ∼ f

I The goal is to estimate density f Density Estimation

0.4

0.3

0.2 density

0.1

0.0

−2 0 2 4 rating Density Estimation

0.4

0.3

0.2 density

0.1

0.0

−2 0 2 4 rating Nonlinear Regression

I A curve of interest can be a regression function r I In regression, we observe pairs (x1, Y1),..., (xn, Yn) that are related as Yi = r(xi ) + i

with E(i ) = 0 I The goal is to estimate the regression function r Nonlinear Regression

Bone Mineral Density

0.2

0.1 gender female male Change in BMD

0.0

10 15 20 25 Age Nonlinear Regression

Bone Mineral Density Data

0.2

0.1 gender female male Change in BMD

0.0

−0.1

10 15 20 25 Age Nonlinear Regression

Bone Mineral Density Data

0.2

0.1 gender female male Change in BMD

0.0

10 15 20 25 Age Nonlinear Regression

Bone Mineral Density Data

0.2

0.1 gender female male Change in BMD

0.0

10 15 20 25 Age The Tradeoff

I Let fbn(x) be an estimate of a function f (x) I Define the squared error as

2 Loss = L(f (x), fbn(x)) = (f (x) − fbn(x))

I Define average of this loss as risk or Squared Error (MSE) MSE = R(f (x), fbn(x)) = E(Loss)

I The expectation is taken with respect to fbn which is random I The MSE can be decomposed into a bias and variance term

MSE = Bias2 + Var

I The decomposition is easy to show The Bias–Variance Tradeoff

I Expand

E((f − fb)2) = E(f 2 + fb2 + 2f fb) = E(f 2) + E(fb2) − E(2f fb)

2 2 I Use Var(X) = E(X ) − E(X)

E((f − fb)2) = Var(f ) + E(f )2 + Var(fb) + E(fb)2 − E(2f fb)

I Use E(f ) = f and Var(f ) = 0

E((f − fb)2) = f 2 + Var(fb) + E(fb)2 − 2f E(fb)

2 2 2 I Use (E(fb) − f ) = f + E(fb) − 2f E(fb)

E((f − fb)2) = (E(fb) − f )2 + Var(fb) = Bias2 + Var The Bias–Variance Tradeoff

I This described the risk at one point I To summarize the risk, for density problems, we need to integrate Z R(f , fbn) = R(f (x), fbn(x))dx

I For regression problems, we sum over all

n X R(r, rbn) = R(r(xi ), rbn(xi )) i=1 The Bias–Variance Tradeoff

I Consider the regression model

Yi = r(xi ) + i

∗ ∗ I Suppose we draw new observation Yi = r(xi ) + i for each xi ∗ I If we predict Yi with rbn(xi ) then the squared prediction error is

∗ 2 ∗ 2 (Yi − rbn(xi )) = (r(xi ) + i − rbn(xi ))

I Define predictive risk as ! 1 n E X(Y ∗ − r (x ))2 n i bn i i=1 The Bias–Variance Tradeoff

I Up to a constant, the average risk and the predictive risk are the same ! 1 n 1 n E X(Y ∗ − r (x ))2 = R(r, r ) + X E((∗)2) n i bn i bn n i i=1 i=1

2 I and in particular, if error i has variance σ , then ! 1 n E X(Y ∗ − r (x ))2 = R(r, r ) + σ2 n i bn i bn i=1 The Bias–Variance Tradeoff

I Challenge in smoothing is to determine how much smoothing to do I When the data are oversmoothed, the bias term is large and the variance is small I When the data are undersmoothed the opposite is true I This is called the bias–variance tradeoff I Minimizing risk corresponds to balancing bias and variance The Bias–Variance Tradeoff

Source: Wassermann (2006) The Bias–Variance Tradeoff (Example)

I Let f be a pdf I Consider estimating f (0) I Let h be a small and positive number I Define  h h  Z h/2 ph := P − < X < = f (x)dx ≈ hf (0) 2 2 −h/2

Hence I p f (0) ≈ h h The Bias–Variance Tradeoff (Example)

I Let X be the number of observations in the interval (−h/2, h/2) I Then X ∼ Binom(n, ph) I An estimate of ph is pch = X/n and estimate of f (0) is

ph X fb (0) = c = n h nh

I We now show that the MSE of fbn(0) is (for some constants A and B) B MSE = Ah4 + = Bias2 + Variance nh The Bias–Variance Tradeoff (Example)

I Taylor expand around 0

x 2 f (x) ≈ f (0) + xf 0(0) + f 00(0) 2

I Plugin

Z h/2 Z h/2 2 ! 0 x 00 ph = f (x)dx ≈ f (0) + xf (0) + f (0) dx −h/2 −h/2 2

f 00(0)h3 = hf (0) + 24 The Bias–Variance Tradeoff (Example)

I Since X is binomial, we have E(X) = nph f 00(0)h3 I Use Taylor approximation ph ≈ hf (0) + 24 and combine

00 2 E(X) ph f (0)h E(fb (0)) = = ≈ f (0) + n nh h 24

I After rearranging, the bias is

f 00(0)h2 Bias = E(fb (0)) − f (0) ≈ n 24 The Bias–Variance Tradeoff (Example)

I For the variance term, note that Var(X) = nph(1 − ph), then

Var(X) ph(1 − ph) Var(fb (0)) = = n n2h2 nh2

I Use 1 − ph ≈ 1 since h is small

ph Var(fb (0)) ≈ n nh2

I Combine with Taylor expansion

00 3 hf (0) + f (0)h f (0) f 00(0)h f (0) Var(fb (0)) ≈ 24 = + ≈ n nh2 nh 24n nh The Bias–Variance Tradeoff (Example)

I And combinding both terms

(f 00(0))2h4 f (0) B MSE = Bias2 + Var(fb (0)) = + ≡ Ah4 + n 576 nh nh

I As we smooth more (increase h), the bias term increases and the variance term decreases I As we smooth less (decrease h), the bias term decreases and the variance term increases I This is a typical bias–variance analysis The Curse of Dimensionality

I Problem with smoothing is the curse of dimensionality in high dimensions I Estimation gets harder as the dimensions of the observations increases I Computational: Computational burden increases exponentially with dimension, and I Statistical: If data have dimension d then we need sample size n to grow exponentially with d I The MSE of any nonparametric estimator of a smooth curve has form (for c > 0) c MSE ≈ n4/(4+d)

I If we want to have a fixed MSE = δ equal to some small number δ, then solving for n c d/4 n ∝ δ The Curse of Dimensionality

c d/4 I We see that n ∝ δ grows exponentially with dimension d I The reason for this is that smoothing involves estimating a function in a local neighborhood I But in high-dimensional problems the data are very sparse, so local neighborhood contain very few points The Curse of Dimensionality (Example)

I Suppose n data points uniformly distributed on the interval [0, 1] I How many data points will we find in the interval [0, 0.1]? I The answer: about n/10 points 10 I Now suppose n point in 10 dimensional unit cube [0, 1] 10 I How many data points in [0, 0.1] ? I The answer: about 0.110 n n × = 1 10, 000, 000, 000

I Thus, n has to be huge to ensure that small neighborhoods have any data in them I Smoothing methods can in principle be used in high-dimensions problems I But estimator won’t be accurate, confidence interval around the estimate will be distressingly large The Curse of Dimensionality (Example)

Source: Hastie, Tibshirani, and Friedman (2009)

I In ten dimensions 80% of to capture 10% of the data References

I Wassermann (2006). All of I Hastie, Tibshirani, Friedman (2009). The Elements of Statistical Learning