<<

6. 6.1 Cross Validation

1. Cross Validation Suppose we observe X1, . . . , Xn from an unknown density f. Our goal is to estimate f nonparametrically. Finding the best 2. fn in some sense is equivalent to finding the optimal smoothing 3. Density Estimation h. b 4. Local Polynomials How to measure the performance of fn? • Risk/Integrated Square Error(IMSE) 5. Higher Dimensions b R(f , f) = E (L(f , f)) 6. Mixture Models n n where L(f , f) = (f (x) − f(x))2dx. 7. Converting Density Estimation Into Regression n n b b • One can find an optimalR estimator that minimizes the risk b b function:

fn∗ = argmin R(fn, f), b fn but the risk function isbunknown! b 1 2

Estimation of the Risk function Finding an optimal estimator fn • Use leave-one-out cross validation to estimate the risk function. • E(J(h)) = E(J(h)) = R(h) + cb • One can express the as a function of the smoothing • Theboptimal smoothing parameter parameter h h∗ = argmin J(h) 2 h L(h) = (fn(x) − f(x)) dx Z • Nonparametric estimator f can be expressed as a function of h 2 2 = (fbn(x)) dx − 2 fn(x)f(x)dx + f (x)dx and the best estimator fn∗ can be obtained by Plug-in the Z Z Z b J(h) optimal smoothing parameter b b b • Cross-validation| estimator of the{z risk function }(up to constant) f ∗ = fh∗

2 2 n b b J(h) = fn(x)dx − f( i)(Xi) n − i=1 Z  X b b b t where f( i) is the density estimator obtained after removing i h − observation b 3 4 Theorem 1. Suppose that f 0 is absolutely continuous and 6.2 Histogram 2 (f 0) du < ∞. Then

2 R h 2 1 2 1 Histogram Estimator R(f , f) = (f 0) du + + o(h ) + o . n 12 nh n Z   • Without loss of generality, we assume that the support of f is The optimal bbandwidth is [0,1]. Divide the support into m equally sized bins 1 6 1/3 1 1 2 m − 1 h∗ = . B1 = 0, , B2 = , , . . . , Bm = , 1 n1/3 (f (u))2du m m m m  0        With the optimal binwidth, R 1 n • Let h = m , pj = B f(x)dx and Yj = i=1 I(Xi ∈ Bj) j C R(f , f) ≈ • The histogram estimatorR is defined by P n n2/3 n 2/3 1/3 pj where 3 2 b . f (x) = I(x ∈ B ) C = 4 (f 0(u)) du n h j j=1 X  R  b b Yj where pj = n .

b 5 6

Proof. For any x, u ∈ B , j Therefore, the bias of fn is 2 (u − x) E 1 E f(u) = f(x) + (u − x)f 0(x) + f 00(x). b(x) ≡ (fn(x)) −b f(x) = (pj) − f(x) 2 h p = j − f(x) for some x between x and u. Hence, e h b b 1 1 3 p = f(u)du = f(x)h + hf 0(x) h j − − x + O(h ) − f(x) j e h 2 ZBj       (u − x)2 1 2 = f(x) + (u − x)f 0(x) + f 00(x) du = f 0(x) h j − − x + O(h ). 2 2 ZBj       By the mean value theorem, for some x ∈ B , 1 3 j j = f(x)h + hf 0(x) h j − − x + O(he). 2 2     2 2 1 4 b (x)dx = (f 0(x)) h ej − − x dx + O(h ) 2 ZBj ZBj     2 2 1 4 = (f 0(x )) h j − − x dx + O(h ) j 2 ZBj     3 2 h 4 = (f 0(xe )) + O(h ). j 12 7 8 e Hence Therefore,

1 m 1 m m b2(x)dx = b2(x)dx + O(h3) pj(1 − pj) v(x)dx = v(x)dx = 2 dx 0 Bj nh Z j=1 Z 0 j=1 Bj j=1 Bj X Z X Z X Z m h3 m m 2 3 1 1 2 = (f 00(xj )) x + O(h ) = p − p 12 nh2 j nh2 j j=1 j=1 Bj j=1 Bj X X Z X Z 2 1 m h e 2 2 1 1 1 1 = (f 00(x)) dx + o(h ). 2 2 2 = − pj = − h f (xj ) 12 0 nh nh2 nh nh Z j=1 j=1 X X For the of fn: 1 1 1 1 1 = − f 2(x)dx + o(1) = + o 1 p (1 − p ) nh nh 0 nh n v(x) ≡ Varb(f (x)) = Var (p ) = j j . Z    n h2 j nh2

By the mean value theorem,b for some xbj ∈ Bj,

pj = f(x)dx − hf(xj ). ZBj

9 10

Theorem 2. The cross-validation estimator of risk for the histogram is 6.3 Kernel Density Estimation 2 (n + 1) m J(h) = − p2. h(n − 1) h(n − 1) j j=1 • Given a kernel K and a positive number h, called the X bandwidth, the kernel density estimator is: Theorem 3. Letbm = m(n) be the number of binsb in the histogram n fn. Assume that m(n) → ∞ and m(n) = log n/n → 0 as n → ∞. 1 1 x − X f (x) = K i . Define n n h h i=1 b X   2 2 b • The choice of kernel K is not crucial but the choice of bandwidth ln(x) = max fn(x) − c, 0 , un(x) = fn(x) − c, 0  q  q  h is important. where zα/(2m) mb. Then b c = 2 n • We assume that K satisfies p Pr l (x) ≤ E (f ) ≤ u (x)∀x ≥ 1 − α. 2 2 n n n K(x)dx = 1, xK(x)dx = 0 and σK ≡ x K(x) > 0.   Z Z Z b

11 12 E 2 1 x X 1 n Theorem 4. Let Rx = (f(x) − f(x)) be the risk at point x and let Proof. Write Kh(x, X) = h K −h and fn(x) = n i=1 Kh(x, Xi). R = Rxdx denote the integrated risk. Assume that f 00 is absolutely E (fn(x)) = E (Kh(x, X))  P continuous and that int(f (x))2dxb< ∞. Also, assume that K b R 000 1 x − t satisfies = K f(t)dt b h h Z   2 2 K(x)dx = 1, xK(x)dx = 0 and σK ≡ x K(x) > 0. = K(u)f(x − hu)du Z Z Z Z Then, h2u2 = K(u) f(x) − huf 0(x) + f 00(x) + · · · du 2 2 1 4 4 2 f(x) K (x)dx 1 6 Z   Rx = σK hn(f 00(x)) + + O(n− ) + O(hn), 1 2 2 4 nh = f(x) + h f 00(x) u K(u)du. R n 2 2 Z 1 4 4 2 K (x)dx 1 6 R = σK hn (f 00(x)) dx + + O(n− ) + O(hn). Hence 4 nhn 1 2 2 4 Z R bias(f (x)) = σ h f 00(x) + O(h ). n 2 K n Similarly, b 2 f(x) K (x)dx 1 Var (f (x)) = + O(n− ). n nh R n b 13 14

Bandwidth selection The Normal reference rule

• The optimal smoothing bandwidth is • Assume that f is Normal, one can compute the h∗ with the

1/5 Gaussian kernel. c2 1/5 h∗ = h∗ = 1.06σn− c2A(f)n  1  2 2 • σ is estimated by σ = min{s, IQR/1.34} where s is the sample where c1 = x K(x)dx, c2 = (K(x)) dx and 2 and IQR is the interquartile . A(f) = (f 00(x)) dx. R R b 2 • The selected bandwidth is: • The onlyR unknown quantity in h∗ is A(f) = (f 00(x)) dx 1.06σ 2 (4) ER (4) (r) hn = . • A(f) = (f 00(x)) dx = (f (x))f(x)dx = (f ) where f n1/5 th denote theR r derivativRe of f b

15 16 Plug-in method Cross-validation

(r) Let ψr = E (f ). To estimate ψ4 by using the , one • Cross-validation score function: need to choose the optimal bandwidth which is a functional of ψ6. n 2 2 J(h) = f (x)dx − f i(Xi) 1. Estimate ψ with the bandwidth chosen the normal reference n − 8 i=1 Z X rule. b b b • The selected bandwidth is 2. Estimate ψ6 with the bandwidth depending on ψ8 h = argmin J(h) ∗ 3. Estimate ψ4 with the bandwidth depending on ψb6 h Theorem 5 (Stone’s Theorem). Supposeb f is bounded. Let f 4. The selected bandwidth is b h 1/5 denote the kernel estimator with bandwidth h and let h denote the ∗ c2 b h∗ = bandwidth chosen by cross-validation. Then c2ψ n 1 4 ! 2 f(x) − f ∗ dx h a.s. b → 1. R   2 infh f(x) −b fh dx R   b 17 18

Computation Optimal Convergence rate

4/5 • The cross-validation score function J(h) can be approximated by From Theorem 4, the optimal convergence rate is O(n− ) if the

n n optimal bandwidth is used. 1 Xi − Xbi 2 1 (m) th J(h) = K∗ + K(0) + O Theorem 6. Let F be the set of all pdfs and let f denote the m nh2 h nh n2 i=1 j=1 X X     derivative of f. Define b (2) (2) where K∗(x) = K (x) − 2K(x) and K K(z − y)K(y)dy. (m) 2 2 Fm(c) = f ∈ F : |f (x)| dx ≤ c . • For the computation of fn and J(h), use ;  Z 

– fast Fourier transform (FFT) For any estimator fn, b b – binning strategy 2 2m/(2m+1) sup bE f (fn(x) − f(x)) ≥ bn− . f (c) • For the details, See Silverman (1981) and Wand and Jones ∈Fm Z (1995). where b > 0 is a universal bconstant that depends only on m and c.

19 20 Local log-likelihood 6.4 Local Polynomials • The kernel density estimator:

Smoothed log-likelihood fK (x) = argmax Lx(p). p • Recall that the nonparametric MLE of the pdf is • The log-likelihood functionb of f:

p = (p1, . . . , pn) ≡ argmin L(p) n p L(f) = log f(Xi) − n f(u)du − 1 . where i=1 Z  n n X

L(p) = log pi − n pi − 1 . • The local log-likelihood at target value x is i=1 i=1 ! X X n x − Xi x − u • A smoothed version of the log-likelihood at x (up to constant) is Lx(f) = K log f(Xi) − n K f(u)du. h h i=1   Z   n x − X n x − X X L (p) = K i log p − n K i p . x h i h i i=1 i=1 X   X  

21 22

Local likelihood density estimator 6.5 Higher Dimensions • Approximate log f(u) with a polynomial

p a Suppose Xi = (X , . . . , X ). The multivariate kernel estimator is P (a, u) = j (x − u)j. i1 id x j! j=0 n X 1 1 1 f (x) = K(H− (x − X)). n n |H| • The local polynomial log-likelihood is i=1 X n We often assume ba simple form of the bandwidth matrix or kernel. x − Xi u − Xi Px(a,u) Lx(a) = K Px(a, Xi)−n K e du. h h For example, H =diag(h1, . . . , hd). Then i=1 X   Z   d d • The local likelihood density estimator is 1 xj − Xij fn(x) = K b b nh1 · · · nd  hj  Px(a,x) a0 i=1 j=1   fn(x) = e = e , X Y  b   where b T a = (a0, . . . , ap) ≡ argmax Lx(a) a b b b 23 24 • The risk function; 6.6 Mixture Models d 2 d 1 4 4 2 2 2 ( K (x)dx) R ≈ σK hj fjj (x)dx + hj hk fjj fkkdx + . 4   nh1 · · · hd The Normal Mixture Model j=1 j=k R X Z X6 Z   K 1/(4+d) 2 • The optimal bandwidth hi∗ = O(n− ). f(x, θ) = pjφ(x|µj, σj ), j=1 • In practice, one can choose the optimal bandwidth by X − 2 (x µj ) cross-validation and often assume a simple form of the 2σ2 where φ (x|µ , σ2) = 1 e− j . bandwidth matrix. H = h · I. j j j √2π • Define θ = (p , . . . , p , µ , . . . , µ , σ2, . . . , σ2 ) • The curse of dimensionality. 1 K 1 K 1 K • Given K, how to estimate θ? → EM algorithm

25 26

EM algorithm • [E-step] Given θ, compute E (Zi|Y, θ).

For K = 2, pφ(yi) E(Zi|Y = yi, θ) ≡ wi = φ(y) + (1 − p)φ(y) • Define latent variables Zi: b • [M-step] 1 : if Yi is from group 1 b b Zi = θ = argmax L(θ),  0 : if Y is from group 2  i θ where b e • The log- is n 2 2 L(θ) = (wi log pφ(yi) + (1 − wi) log(1 − p)φ(yi)). L(y, z, θ) = log[pφ(yi|µ1, σ1)] + log[(1 − p)φ(yi|µ2, σ2)] i=1 i:zi=0 i:zi=1 X X X n Then e b 2 2 = log[pφ(yi|µ1, σ )] + (1 − zi) log[(1 − p)φ(yi|µ2, σ )] 1 2 n w n w y n (1 − w )y i=1 p = i , µ = i=1 i i , µ = i=1 i i , X n 1 n w 2 n (1 − w i=1 i=1 i i=1 i X P P n 2 n 2 b wi(ybi − µ1)P b (1 −Pwi)(yi − µ1) σ2 = i=1 , σ2 = i=1 . 1 n w 2 n (1 − w ) P i=1 i P i=1 i b b 27 b P 28 b P How to estimate the number of components k? • The number of clusters is not always equal to the number of component • AIC (Akaike Information Criterion): find most predictive model • Gaussian is sensitive to outliers → Replace the normal density AIC = L − q with t-distribution density.

where L is the loglikelihood and p is the number of . 2 2 • If σj = σ > 0 is fixed and K → n, then MLE of the mixture 1 • BIC (Bayesian Information Criterion): find the true model with model approaches the kernel estimate where pj = n and µj = xj . high q log n b BIC = L − 2

29 30

source("ch6.r") h.cv <- ucv(eruptions) library(MASS) h.plugin <-width.SJ(eruptions) library(sm) library(locfit) f1<-density(eruptions,width=h.normal) # normal reference library(mclust) f2<-density(eruptions,width=h.cv) # cross-validation f3<-density(eruptions,width=h.plugin) # plugin ### Read f4<-density(eruptions,method="mclust") # finite mixture model faithful<-read.table("faithful.dat",header=T) f5 <-locfit(~eruptions, alpha=c(0.1,0.8),flim=c(1,6)) eruptions <-faithful$eruptions f6 <-locfit(~eruptions, alpha=c(0.1,0.6),flim=c(1,6), link="ident") n<-length(eruptions) postscript("density.ps") ### Select the optimal number of bins by CV par(mfrow=c(2,2)) hist.h <- cv.hist.fun(eruptions)$mbest truehist(eruptions, xlim=c(1,6), ymax=0.8) # histogram truehist(eruptions, nbins=hist.h, xlim=c(1,6), ymax=0.8) ### Select optimal bandwidth of kernel by normal plot(f1,ylim=c(0,0.8));plot(f2,ylim=c(0,0.8)) ### reference rule/Cross-validation/plug-in plot(f3,ylim=c(0,0.8));plot(f4,ylim=c(0,0.8)) plot(f5,ylim=c(0,0.8),main="local linear") sigma.hat <-min(sd(eruptions), IQR(eruptions)/1.34) plot(f6,ylim=c(0,0.8),main="local quadratic") h.normal <-1.06*sigma.hat/n^(0.2) dev.off() 31 32 density(eruptions, width = h.plugin) density(eruptions, method = "mclust") 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 Density Density 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0

1 2 3 4 5 6 1 2 3 4 5 6 2 3 4 5 1 2 3 4 5 6

eruptions eruptions N = 272 Bandwidth = 0.14 N = 272 Bandwidth = 0.3348

density(eruptions, width = h.normal) density(eruptions, width = h.cv) local linear local quadratic 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 density density Density Density 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0

2 3 4 5 2 3 4 5 1 2 3 4 5 6 1 2 3 4 5 6

N = 272 Bandwidth = 0.09857 N = 272 Bandwidth = 0.1019 eruptions eruptions

33 34

4. Let r(x) = f(x) and tj is the mid point of Bj. Then 6.7 Converting Density Estimation Into Regression p Yj ≈ r(tj) + σj ,

Description of Methodette k where j ∼ N(0, 1) and σ = 4n . 1. Suppose that X1, . . . , Xn are from a density f on [0, 1] q 5. Apply your favorite method to (tj , Yj) 2. Create k equal width bins where k ≈ n/10 to get an estimate rn. 3. Define 2 6. Calculate (rn(x)) and normalize to be a density. k 1 b Y = × N + , + 2 j n j 4 (r (x)) r r b fn(x) = 1 k + 2 n th (r (tj)) where Nj = i=1 I(Xi ∈ Bj) and Bj is the j bin. k j=1 b b where r+(x) = max{r (x), 0}. P P n b

b b

35 36 How does it work? source("ch6.r") library(sm) • Poissonization nf(t ) temp <-den.to.reg(eruptions) N ≈ Poisson n f(x)dx ≈ Poisson j j k temp2 <-smooth.spline(temp$x,temp.y,cv=TRUE) Bj ! Z   temp3<-temp2$y

• E (Nj) = Var (Nj) ≈ nf(tj )/k. temp3[temp3 <0] <-0 k <-length(temp3) • Variance stabilization normal.const<-sum(temp3^2)/k k 1 seqs <-seq(0,1,by=0.01) Y = × N + , j j temp4<-predict(temp2,seqs)$y rn r 4 temp4[temp4 <0] <-0 E Var • (Yj) ≈ f(tj) and (Yj ) ≈ k/(4n) f.hat <-(temp4)^2/normal.const*k p postscript("den_reg.ps") plot(seqs, f.hat,main="Density to Regression") dev.off()

37 38

REFERENCES Density to Regression 1. Brown, L.D., Cai, T. and Zhou, L. (2005). A root-unroot transform and block thresholding approach to adaptive

0.08 density estimation. Unpublished manuscript. 2. Loader, C. (1999). and likelihood. Springer, New

0.06 York. 3. McLachlan, G. and Peel, D. (2000). Finite Mixture Models. f.hat

0.04 Wiley, New York. 4. Scott, D.W. (1992). Multivariate Density Estimation: Theory,

0.02 Practice, and Visualization. Wiley, New York. 5. Silverman, B.W. (1986). Density Estimation for Statistics and

0.00 Data Analysis. Chapman and Hall. London.

0.0 0.2 0.4 0.6 0.8 1.0

seqs 6. Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman and Hall. London.

39 40