6. Density Estimation 1. Cross Validation 2. Histogram 3. Kernel
Total Page:16
File Type:pdf, Size:1020Kb
6. Density Estimation 6.1 Cross Validation 1. Cross Validation Suppose we observe X1; : : : ; Xn from an unknown density f. Our goal is to estimate f nonparametrically. Finding the best estimator 2. Histogram fn in some sense is equivalent to finding the optimal smoothing 3. Kernel Density Estimation parameter h. b 4. Local Polynomials How to measure the performance of fn? • Risk/Integrated Mean Square Error(IMSE) 5. Higher Dimensions b R(f ; f) = E (L(f ; f)) 6. Mixture Models n n where L(f ; f) = (f (x) − f(x))2dx. 7. Converting Density Estimation Into Regression n n b b • One can find an optimalR estimator that minimizes the risk b b function: fn∗ = argmin R(fn; f); b fn but the risk function isbunknown! b 1 2 Estimation of the Risk function Finding an optimal estimator fn • Use leave-one-out cross validation to estimate the risk function. • E(J(h)) = E(J(h)) = R(h) + cb • One can express the loss function as a function of the smoothing • Theboptimal smoothing parameter parameter h h∗ = argmin J(h) 2 h L(h) = (fn(x) − f(x)) dx Z • Nonparametric estimator f can be expressed as a function of h 2 2 = (fbn(x)) dx − 2 fn(x)f(x)dx + f (x)dx and the best estimator fn∗ can be obtained by Plug-in the Z Z Z b J(h) optimal smoothing parameter b b b • Cross-validation| estimator of the{z risk function }(up to constant) f ∗ = fh∗ 2 2 n b b J(h) = fn(x)dx − f( i)(Xi) n − i=1 Z X b b b t where f( i) is the density estimator obtained after removing i h − observation b 3 4 Theorem 1. Suppose that f 0 is absolutely continuous and 6.2 Histogram 2 (f 0) du < 1. Then 2 R h 2 1 2 1 Histogram Estimator R(f ; f) = (f 0) du + + o(h ) + o : n 12 nh n Z • Without loss of generality, we assume that the support of f is The optimal bbandwidth is [0,1]. Divide the support into m equally sized bins 1 6 1=3 1 1 2 m − 1 h∗ = : B1 = 0; ; B2 = ; ; : : : ; Bm = ; 1 n1=3 (f (u))2du m m m m 0 With the optimal binwidth, R 1 n • Let h = m , pj = B f(x)dx and Yj = i=1 I(Xi 2 Bj) j C R(f ; f) ≈ • The histogram estimatorR is defined by P n n2=3 n 2=3 1=3 pj where 3 2 b . f (x) = I(x 2 B ) C = 4 (f 0(u)) du n h j j=1 X R b b Yj where pj = n . b 5 6 Proof. For any x; u 2 B , j Therefore, the bias of fn is 2 (u − x) E 1 E f(u) = f(x) + (u − x)f 0(x) + f 00(x): b(x) ≡ (fn(x)) −b f(x) = (pj) − f(x) 2 h p = j − f(x) for some x between x and u. Hence, e h b b 1 1 3 p = f(u)du = f(x)h + hf 0(x) h j − − x + O(h ) − f(x) j e h 2 ZBj (u − x)2 1 2 = f(x) + (u − x)f 0(x) + f 00(x) du = f 0(x) h j − − x + O(h ): 2 2 ZBj By the mean value theorem, for some x 2 B , 1 3 j j = f(x)h + hf 0(x) h j − − x + O(he): 2 2 2 2 1 4 b (x)dx = (f 0(x)) h ej − − x dx + O(h ) 2 ZBj ZBj 2 2 1 4 = (f 0(x )) h j − − x dx + O(h ) j 2 ZBj 3 2 h 4 = (f 0(xe )) + O(h ): j 12 7 8 e Hence Therefore, 1 m 1 m m b2(x)dx = b2(x)dx + O(h3) pj(1 − pj) v(x)dx = v(x)dx = 2 dx 0 Bj nh Z j=1 Z 0 j=1 Bj j=1 Bj X Z X Z X Z m h3 m m 2 3 1 1 2 = (f 00(xj )) x + O(h ) = p − p 12 nh2 j nh2 j j=1 j=1 Bj j=1 Bj X X Z X Z 2 1 m h e 2 2 1 1 1 1 = (f 00(x)) dx + o(h ): 2 2 2 = − pj = − h f (xj ) 12 0 nh nh2 nh nh Z j=1 j=1 X X For the variance of fn: 1 1 1 1 1 = − f 2(x)dx + o(1) = + o 1 p (1 − p ) nh nh 0 nh n v(x) ≡ Varb(f (x)) = Var (p ) = j j : Z n h2 j nh2 By the mean value theorem,b for some xbj 2 Bj, pj = f(x)dx − hf(xj ): ZBj 9 10 Theorem 2. The cross-validation estimator of risk for the histogram is 6.3 Kernel Density Estimation 2 (n + 1) m J(h) = − p2: h(n − 1) h(n − 1) j j=1 • Given a kernel K and a positive number h, called the X bandwidth, the kernel density estimator is: Theorem 3. Letbm = m(n) be the number of binsb in the histogram n fn. Assume that m(n) ! 1 and m(n) = log n=n ! 0 as n ! 1. 1 1 x − X f (x) = K i : Define n n h h i=1 b X 2 2 b • The choice of kernel K is not crucial but the choice of bandwidth ln(x) = max fn(x) − c; 0 ; un(x) = fn(x) − c; 0 q q h is important. where zα/(2m) mb. Then b c = 2 n • We assume that K satisfies p Pr l (x) ≤ E (f ) ≤ u (x)8x ≥ 1 − α: 2 2 n n n K(x)dx = 1; xK(x)dx = 0 and σK ≡ x K(x) > 0: Z Z Z b 11 12 E 2 1 x X 1 n Theorem 4. Let Rx = (f(x) − f(x)) be the risk at point x and let Proof. Write Kh(x; X) = h K −h and fn(x) = n i=1 Kh(x; Xi). R = Rxdx denote the integrated risk. Assume that f 00 is absolutely E (fn(x)) = E (Kh(x; X)) P continuous and that int(f (x))2dxb< 1. Also, assume that K b R 000 1 x − t satisfies = K f(t)dt b h h Z 2 2 K(x)dx = 1; xK(x)dx = 0 and σK ≡ x K(x) > 0: = K(u)f(x − hu)du Z Z Z Z Then, h2u2 = K(u) f(x) − huf 0(x) + f 00(x) + · · · du 2 2 1 4 4 2 f(x) K (x)dx 1 6 Z Rx = σK hn(f 00(x)) + + O(n− ) + O(hn); 1 2 2 4 nh = f(x) + h f 00(x) u K(u)du: R n 2 2 Z 1 4 4 2 K (x)dx 1 6 R = σK hn (f 00(x)) dx + + O(n− ) + O(hn): Hence 4 nhn 1 2 2 4 Z R bias(f (x)) = σ h f 00(x) + O(h ): n 2 K n Similarly, b 2 f(x) K (x)dx 1 Var (f (x)) = + O(n− ): n nh R n b 13 14 Bandwidth selection The Normal reference rule • The optimal smoothing bandwidth is • Assume that f is Normal, one can compute the h∗ with the 1=5 Gaussian kernel. c2 1=5 h∗ = h∗ = 1:06σn− c2A(f)n 1 2 2 • σ is estimated by σ = minfs; IQR=1:34g where s is the sample where c1 = x K(x)dx; c2 = (K(x)) dx and 2 standard deviation and IQR is the interquartile range. A(f) = (f 00(x)) dx. R R b 2 • The selected bandwidth is: • The onlyR unknown quantity in h∗ is A(f) = (f 00(x)) dx 1:06σ 2 (4) ER (4) (r) hn = : • A(f) = (f 00(x)) dx = (f (x))f(x)dx = (f ) where f n1=5 th denote theR r derivativRe of f b 15 16 Plug-in method Cross-validation (r) Let r = E (f ). To estimate 4 by using the Kernel method, one • Cross-validation score function: need to choose the optimal bandwidth which is a functional of 6. n 2 2 J(h) = f (x)dx − f i(Xi) 1. Estimate with the bandwidth chosen the normal reference n − 8 i=1 Z X rule. b b b • The selected bandwidth is 2. Estimate 6 with the bandwidth depending on 8 h = argmin J(h) ∗ 3. Estimate 4 with the bandwidth depending on b6 h Theorem 5 (Stone's Theorem). Supposeb f is bounded. Let f 4. The selected bandwidth is b h 1=5 denote the kernel estimator with bandwidth h and let h denote the ∗ c2 b h∗ = bandwidth chosen by cross-validation. Then c2 n 1 4 ! 2 f(x) − f ∗ dx h a:s: b ! 1: R 2 infh f(x) −b fh dx R b 17 18 Computation Optimal Convergence rate 4=5 • The cross-validation score function J(h) can be approximated by From Theorem 4, the optimal convergence rate is O(n− ) if the n n optimal bandwidth is used. 1 Xi − Xbi 2 1 (m) th J(h) = K∗ + K(0) + O Theorem 6. Let F be the set of all pdfs and let f denote the m nh2 h nh n2 i=1 j=1 X X derivative of f.