Density Estimation 36-708
Total Page:16
File Type:pdf, Size:1020Kb
Density Estimation 36-708 1 Introduction Let X1;:::;Xn be a sample from a distribution P with density p. The goal of nonparametric density estimation is to estimate p with as few assumptions about p as possible. We denote the estimator by pb. The estimator will depend on a smoothing parameter h and choosing h carefully is crucial. To emphasize the dependence on h we sometimes write pbh. Density estimation used for: regression, classification, clustering and unsupervised predic- tion. For example, if pb(x; y) is an estimate of p(x; y) then we get the following estimate of the regression function: Z mb (x) = ypb(yjx)dy where pb(yjx) = pb(y; x)=pb(x). For classification, recall that the Bayes rule is h(x) = I(p1(x)π1 > p0(x)π0) where π1 = P(Y = 1), π0 = P(Y = 0), p1(x) = p(xjy = 1) and p0(x) = p(xjy = 0). Inserting sample estimates of π1 and π0, and density estimates for p1 and p0 yields an estimate of the Bayes rule. For clustering, we look for the high density regions, based on an estimate of the density. Many classifiers that you are familiar with can be re-expressed this way. Unsupervised prediction is discussed in Section 9. In this case we want to predict Xn+1 from X1;:::;Xn. Example 1 (Bart Simpson) The top left plot in Figure 1 shows the density 4 1 1 X p(x) = φ(x; 0; 1) + φ(x;(j=2) − 1; 1=10) (1) 2 10 j=0 where φ(x; µ, σ) denotes a Normal density with mean µ and standard deviation σ. Marron and Wand (1992) call this density \the claw" although we will call it the Bart Simpson density. Based on 1,000 draws from p, we computed a kernel density estimator, described later. The estimator depends on a tuning parameter called the bandwidth. The top right plot is based on a small bandwidth h which leads to undersmoothing. The bottom right plot is based on a large bandwidth h which leads to oversmoothing. The bottom left plot is based on a bandwidth h which was chosen to minimize estimated risk. This leads to a much more reasonable density estimate. 1 0.0 0.5 1.0 0.0 0.5 1.0 −3 0 3 −3 0 3 True Density Undersmoothed 0.0 0.5 1.0 0.0 0.5 1.0 −3 0 3 −3 0 3 Just Right Oversmoothed Figure 1: The Bart Simpson density from Example 1. Top left: true density. The other plots are kernel estimators based on n = 1;000 draws. Bottom left: bandwidth h = 0:05 chosen by leave-one-out cross-validation. Top right: bandwidth h=10. Bottom right: bandwidth 10h. 2 2 Loss Functions The most commonly used loss function is the L2 loss Z Z Z Z 2 2 2 (pb(x) − p(x)) dx = pb (x)dx − 2 pb(x)p(x) + p (x)dx: The risk is R(p; pb) = E(L(p; pb)). Devroye and Gy¨orfi (1985) make a strong case for using the L1 norm Z kpb− pk1 ≡ jpb(x) − p(x)jdx as the loss instead of L2. The L1 loss has the following nice interpretation. If P and Q are distributions define the total variation metric dTV (P; Q) = sup jP (A) − Q(A)j A where the supremum is over all measurable sets. Now if P and Q have densities p and q then 1 Z 1 d (P; Q) = jp − qj = kp − qk : H TV 2 2 1 R Thus, if jp−qj < δ then we know that jP (A)−Q(A)j < δ=2 for all A. Also, the L1 norm is transformation invariant. Suppose that T is a one-to-one smooth function. Let Y = T (X). Let p and q be densities for X and let pe and qe be the corresponding densities for Y . Then Z Z jp(x) − q(x)jdx = jpe(y) − qe(y)jdy: H Hence the distance is unaffected by transformations. The L1 loss is, in some sense, a much better loss function than L2 for density estimation. But it is much more difficult to deal with. For now, we will focus on L2 loss. But we may discuss L1 loss later. Another loss function is the Kullback-Leibler loss R p(x) log p(x)=q(x)dx. This is not a good loss function to use for nonparametric density estimation. The reason is that the Kullback- H Leibler loss is completely dominated by the tails of the densities. 3 Histograms Perhaps the simplest density estimators are histograms. For convenience, assume that the d data X1;:::;Xn are contained in the unit cube X = [0; 1] (although this assumption is not crucial). Divide X into bins, or sub-cubes, of size h. We discuss methods for choosing 3 h later. There are N ≈ (1=h)d such bins and each has volume hd. Denote the bins by B1;:::;BN . The histogram density estimator is N X θbj p (x) = I(x 2 B ) (2) bh hd j j=1 where n 1 X θbj = I(Xi 2 Bj) n i=1 is the fraction of data points in bin Bj. Now we bound the bias and variance of pbh. We will assume that p 2 P(L) where ( ) P(L) = p : jp(x) − p(y)j ≤ Lkx − yk; for all x; y : (3) R First we bound the bias. Let θj = P (X 2 Bj) = p(u)du. For any x 2 Bj, Bj θ p (x) ≡ (p (x)) = j (4) h E bh hd and hence R p(u)du 1 Z p(x) − p (x) = p(x) − Bj = (p(x) − p(u))du: h hd hd Thus, 1 Z 1 p Z p jp(x) − p (x)j ≤ jp(x) − p(u)jdu ≤ Lh d du = Lh d h hd hd p where we used the fact that if x; u 2 Bj then kx − uk ≤ dh. Now we bound the variance. Since p is Lipschitz on a compact set, it is bounded. Hence, R R d θj = p(u)du ≤ C du = Ch for some C. Thus, the variance is Bj Bj 1 θj(1 − θj) θj C Var(ph(x)) = Var(θbj) = ≤ ≤ : b h2d nh2d nh2d nhd We conclude that the L2 risk is bounded by Z C sup R(p; p) = ( (p (x) − p(x))2 ≤ L2h2d + : (5) b E bh d p2P(L) nh 1 C d+2 The upper bound is minimized by choosing h = L2nd . (Later, we shall see a more practical way to choose h.) With this choice, 2 1 d+2 sup R(p; pb) ≤ C0 P 2P(L) n 4 2 2 2=(d+2) where C0 = L d(C=(L d)) . Later, we will prove the following theorem which shows that this upper bound is tight. Specifically: Theorem 2 There exists a constant C > 0 such that 2 Z d+2 2 1 inf sup E (pb(x) − p(x)) dx ≥ C : (6) pb P 2P(L) n 3.1 Concentration Analysis For Histograms Let us now derive a concentration result for pbh. We will bound n sup P (kpbh − pk1 > ) P 2P where kfk1 = supx jf(x)j. Assume that ≤ 1. First, note that ! θbj θj d X d P(kph−phk1 > ) = P max − > = P(max jθbj−θjj > h ) ≤ P(jθbj−θjj > h ): b j hd hd j j d Using Bernstein's inequality and the fact that θj(1 − θj) ≤ θj ≤ Ch , 2 2d d 1 n h P(jθbj − θjj > h ) ≤ 2 exp − d 2 θj(1 − θj) + h =3 1 n2h2d ≤ 2 exp − 2 Chd + hd=3 ≤ 2 exp −cn2hd where c = 1=(2(C + 1=3)). By the union bound and the fact that N ≤ (1=h)d, d −d 2 d P(jθbj − θjj > h ) ≤ 2h exp −cn h ≡ πn: p Earlier we saw that supx jp(x) − ph(x)j ≤ L dh. Hence, with probability at least 1 − πn, p kpbh − pk1 ≤ kpbh − phk1 + kph − pk1 ≤ + L dh: (7) Now set s 1 2 = log : cnhd δhd 5 Then, with probability at least 1 − δ, s 1 2 p kp − pk ≤ log + L dh: (8) bh 1 cnhd δhd 1=(2+d) Choosing h = (c2=n) we conclude that, with probability at least 1 − δ, s 1 ! p 2+d − 2 2 2 − 1 log n kp −pk ≤ c−1n 2+d log + log n +L dn 2+d = O : (9) bh 1 δ 2 + d n 4 Kernel Density Estimation A one-dimensional smoothing kernel is any smooth function K such that R K(x) dx = 1, R 2 R 2 xK(x)dx = 0 and σK ≡ x K(x)dx > 0: Smoothing kernels should not be confused with Mercer kernels which we discuss later. Some commonly used kernels are the following: 2 Boxcar: K(x) = 1 I(x) Gaussian: K(x) = p1 e−x =2 2 2π 3 2 70 3 3 Epanechnikov: K(x) = 4 (1 − x )I(x) Tricube: K(x) = 81 (1 − jxj ) I(x) where I(x) = 1 if jxj ≤ 1 and I(x) = 0 otherwise. These kernels are plotted in Figure 2. Qd Two commonly used multivariate kernels are j=1 K(xj) and K(kxk). −3 0 3 −3 0 3 −3 0 3 −3 0 3 Figure 2: Examples of smoothing kernels: boxcar (top left), Gaussian (top right), Epanech- nikov (bottom left), and tricube (bottom right). 6 −10 −5 0 5 10 Figure 3: A kernel density estimator pb.