6. Density Estimation 1. Cross Validation 2. Histogram 3. Kernel

6. Density Estimation 6.1 Cross Validation 1. Cross Validation Suppose we observe X1; : : : ; Xn from an unknown density f. Our goal is to estimate f nonparametrically. Finding the best estimator 2. Histogram fn in some sense is equivalent to finding the optimal smoothing 3. Kernel Density Estimation parameter h. b 4. Local Polynomials How to measure the performance of fn? • Risk/Integrated Mean Square Error(IMSE) 5. Higher Dimensions b R(f ; f) = E (L(f ; f)) 6. Mixture Models n n where L(f ; f) = (f (x) − f(x))2dx. 7. Converting Density Estimation Into Regression n n b b • One can find an optimalR estimator that minimizes the risk b b function: fn∗ = argmin R(fn; f); b fn but the risk function isbunknown! b 1 2 Estimation of the Risk function Finding an optimal estimator fn • Use leave-one-out cross validation to estimate the risk function. • E(J(h)) = E(J(h)) = R(h) + cb • One can express the loss function as a function of the smoothing • Theboptimal smoothing parameter parameter h h∗ = argmin J(h) 2 h L(h) = (fn(x) − f(x)) dx Z • Nonparametric estimator f can be expressed as a function of h 2 2 = (fbn(x)) dx − 2 fn(x)f(x)dx + f (x)dx and the best estimator fn∗ can be obtained by Plug-in the Z Z Z b J(h) optimal smoothing parameter b b b • Cross-validation| estimator of the{z risk function }(up to constant) f ∗ = fh∗ 2 2 n b b J(h) = fn(x)dx − f( i)(Xi) n − i=1 Z X b b b t where f( i) is the density estimator obtained after removing i h − observation b 3 4 Theorem 1. Suppose that f 0 is absolutely continuous and 6.2 Histogram 2 (f 0) du < 1. Then 2 R h 2 1 2 1 Histogram Estimator R(f ; f) = (f 0) du + + o(h ) + o : n 12 nh n Z • Without loss of generality, we assume that the support of f is The optimal bbandwidth is [0,1]. Divide the support into m equally sized bins 1 6 1=3 1 1 2 m − 1 h∗ = : B1 = 0; ; B2 = ; ; : : : ; Bm = ; 1 n1=3 (f (u))2du m m m m 0 With the optimal binwidth, R 1 n • Let h = m , pj = B f(x)dx and Yj = i=1 I(Xi 2 Bj) j C R(f ; f) ≈ • The histogram estimatorR is defined by P n n2=3 n 2=3 1=3 pj where 3 2 b . f (x) = I(x 2 B ) C = 4 (f 0(u)) du n h j j=1 X R b b Yj where pj = n . b 5 6 Proof. For any x; u 2 B , j Therefore, the bias of fn is 2 (u − x) E 1 E f(u) = f(x) + (u − x)f 0(x) + f 00(x): b(x) ≡ (fn(x)) −b f(x) = (pj) − f(x) 2 h p = j − f(x) for some x between x and u. Hence, e h b b 1 1 3 p = f(u)du = f(x)h + hf 0(x) h j − − x + O(h ) − f(x) j e h 2 ZBj (u − x)2 1 2 = f(x) + (u − x)f 0(x) + f 00(x) du = f 0(x) h j − − x + O(h ): 2 2 ZBj By the mean value theorem, for some x 2 B , 1 3 j j = f(x)h + hf 0(x) h j − − x + O(he): 2 2 2 2 1 4 b (x)dx = (f 0(x)) h ej − − x dx + O(h ) 2 ZBj ZBj 2 2 1 4 = (f 0(x )) h j − − x dx + O(h ) j 2 ZBj 3 2 h 4 = (f 0(xe )) + O(h ): j 12 7 8 e Hence Therefore, 1 m 1 m m b2(x)dx = b2(x)dx + O(h3) pj(1 − pj) v(x)dx = v(x)dx = 2 dx 0 Bj nh Z j=1 Z 0 j=1 Bj j=1 Bj X Z X Z X Z m h3 m m 2 3 1 1 2 = (f 00(xj )) x + O(h ) = p − p 12 nh2 j nh2 j j=1 j=1 Bj j=1 Bj X X Z X Z 2 1 m h e 2 2 1 1 1 1 = (f 00(x)) dx + o(h ): 2 2 2 = − pj = − h f (xj ) 12 0 nh nh2 nh nh Z j=1 j=1 X X For the variance of fn: 1 1 1 1 1 = − f 2(x)dx + o(1) = + o 1 p (1 − p ) nh nh 0 nh n v(x) ≡ Varb(f (x)) = Var (p ) = j j : Z n h2 j nh2 By the mean value theorem,b for some xbj 2 Bj, pj = f(x)dx − hf(xj ): ZBj 9 10 Theorem 2. The cross-validation estimator of risk for the histogram is 6.3 Kernel Density Estimation 2 (n + 1) m J(h) = − p2: h(n − 1) h(n − 1) j j=1 • Given a kernel K and a positive number h, called the X bandwidth, the kernel density estimator is: Theorem 3. Letbm = m(n) be the number of binsb in the histogram n fn. Assume that m(n) ! 1 and m(n) = log n=n ! 0 as n ! 1. 1 1 x − X f (x) = K i : Define n n h h i=1 b X 2 2 b • The choice of kernel K is not crucial but the choice of bandwidth ln(x) = max fn(x) − c; 0 ; un(x) = fn(x) − c; 0 q q h is important. where zα/(2m) mb. Then b c = 2 n • We assume that K satisfies p Pr l (x) ≤ E (f ) ≤ u (x)8x ≥ 1 − α: 2 2 n n n K(x)dx = 1; xK(x)dx = 0 and σK ≡ x K(x) > 0: Z Z Z b 11 12 E 2 1 x X 1 n Theorem 4. Let Rx = (f(x) − f(x)) be the risk at point x and let Proof. Write Kh(x; X) = h K −h and fn(x) = n i=1 Kh(x; Xi). R = Rxdx denote the integrated risk. Assume that f 00 is absolutely E (fn(x)) = E (Kh(x; X)) P continuous and that int(f (x))2dxb< 1. Also, assume that K b R 000 1 x − t satisfies = K f(t)dt b h h Z 2 2 K(x)dx = 1; xK(x)dx = 0 and σK ≡ x K(x) > 0: = K(u)f(x − hu)du Z Z Z Z Then, h2u2 = K(u) f(x) − huf 0(x) + f 00(x) + · · · du 2 2 1 4 4 2 f(x) K (x)dx 1 6 Z Rx = σK hn(f 00(x)) + + O(n− ) + O(hn); 1 2 2 4 nh = f(x) + h f 00(x) u K(u)du: R n 2 2 Z 1 4 4 2 K (x)dx 1 6 R = σK hn (f 00(x)) dx + + O(n− ) + O(hn): Hence 4 nhn 1 2 2 4 Z R bias(f (x)) = σ h f 00(x) + O(h ): n 2 K n Similarly, b 2 f(x) K (x)dx 1 Var (f (x)) = + O(n− ): n nh R n b 13 14 Bandwidth selection The Normal reference rule • The optimal smoothing bandwidth is • Assume that f is Normal, one can compute the h∗ with the 1=5 Gaussian kernel. c2 1=5 h∗ = h∗ = 1:06σn− c2A(f)n 1 2 2 • σ is estimated by σ = minfs; IQR=1:34g where s is the sample where c1 = x K(x)dx; c2 = (K(x)) dx and 2 standard deviation and IQR is the interquartile range. A(f) = (f 00(x)) dx. R R b 2 • The selected bandwidth is: • The onlyR unknown quantity in h∗ is A(f) = (f 00(x)) dx 1:06σ 2 (4) ER (4) (r) hn = : • A(f) = (f 00(x)) dx = (f (x))f(x)dx = (f ) where f n1=5 th denote theR r derivativRe of f b 15 16 Plug-in method Cross-validation (r) Let r = E (f ). To estimate 4 by using the Kernel method, one • Cross-validation score function: need to choose the optimal bandwidth which is a functional of 6. n 2 2 J(h) = f (x)dx − f i(Xi) 1. Estimate with the bandwidth chosen the normal reference n − 8 i=1 Z X rule. b b b • The selected bandwidth is 2. Estimate 6 with the bandwidth depending on 8 h = argmin J(h) ∗ 3. Estimate 4 with the bandwidth depending on b6 h Theorem 5 (Stone's Theorem). Supposeb f is bounded. Let f 4. The selected bandwidth is b h 1=5 denote the kernel estimator with bandwidth h and let h denote the ∗ c2 b h∗ = bandwidth chosen by cross-validation. Then c2 n 1 4 ! 2 f(x) − f ∗ dx h a:s: b ! 1: R 2 infh f(x) −b fh dx R b 17 18 Computation Optimal Convergence rate 4=5 • The cross-validation score function J(h) can be approximated by From Theorem 4, the optimal convergence rate is O(n− ) if the n n optimal bandwidth is used. 1 Xi − Xbi 2 1 (m) th J(h) = K∗ + K(0) + O Theorem 6. Let F be the set of all pdfs and let f denote the m nh2 h nh n2 i=1 j=1 X X derivative of f.

6. Density Estimation 1. Cross Validation 2. Histogram 3. Kernel

Nonparametric Density Estimation Using Kernels with Variable Size Windows Anil Ramchandra Londhe Iowa State University

Deep Kernel Learning

Kernel Methods As Before We Assume a Space X of Objects and a Feature Map Φ : X → RD

MDL Histogram Density Estimation

3.2 the Periodogram

A Least-Squares Approach to Direct Importance Estimation∗

Lecture 2: Density Estimation 2.1 Histogram

Density Estimation 36-708

Nonparametric Estimation of Probability Density Functions of Random Persistence Diagrams

DENSITY ESTIMATION INCLUDING EXAMPLES Hans-Georg Müller

Consistent Kernel Mean Estimation for Functions of Random Variables

Density and Distribution Estimation