APPLIED SMOOTHING TECHNIQUES Part 1: Kernel Density Estimation

APPLIED SMOOTHING TECHNIQUES Part 1: Kernel Density Estimation Walter Zucchini October 2003 Contents 1 Density Estimation 2 1.1 Introduction . 2 1.1.1 The probability density function . 2 1.1.2 Non–parametric estimation of f(x) — histograms . 3 1.2 Kernel density estimation . 3 1.2.1 Weighting functions . 3 1.2.2 Kernels . 7 1.2.3 Densities with bounded support . 8 1.3 Properties of kernel estimators . 12 1.3.1 Quantifying the accuracy of kernel estimators . 12 1.3.2 The bias, variance and mean squared error of fˆ(x) . 12 1.3.3 Optimal bandwidth . 15 1.3.4 Optimal kernels . 15 1.4 Selection of the bandwidth . 16 1.4.1 Subjective selection . 16 1.4.2 Selection with reference to some given distribution . 16 1.4.3 Cross–validation . 18 1.4.4 ”Plug–in” estimator . 19 1.4.5 Summary and extensions . 19 1 Chapter 1 Density Estimation 1.1 Introduction 1.1.1 The probability density function The probability distribution of a continuous–valued random variable X is conventionally described in terms of its probability density function (pdf), f(x), from which probabilities associated with X can be determined using the relationship Z b P (a · X · b) = f(x)dx : a The objective of many investigations is to estimate f(x) from a sample of observations x1; x2; :::; xn . In what follows we will assume that the observations can be regarded as independent realizations of X. The parametric approach for estimating f(x) is to assume that f(x) is a member of some parametric family of distributions, e.g. N(¹; σ2), and then to estimate the parameters of the assumed distribution from the data. For example, fitting a normal distribution leads to the estimator 1 2 fˆ(x) = p e(x¡¹ˆ)=2ˆσ ; x 2 IR ; 2¼σˆ 1 Pn 2 1 Pn 2 where¹ ˆ = n i=1 xi andσ ˆ = n¡1 i=1(xi ¡ ¹ˆ) . This approach has advantages as long as the distributional assumption is correct, or at least is not seriously wrong. It is easy to apply and it yields (relatively) stable estimates. The main disadvantage of the parametric approach is lack of flexibility. Each parametric family of distributions imposes restrictions on the shapes that f(x) can have. For example the density function of the normal distribution is symmetrical and bell–shaped, and therefore is unsuitable for representing skewed densities or bimodal densities. 2 CHAPTER 1. DENSITY ESTIMATION 3 1.1.2 Non–parametric estimation of f(x) — histograms The idea of the non–parametric approach is to avoid restrictive assumptions about the form of f(x) and to estimate this directly from the data. A well–known non–parametric estimator of the pdf is the histogram. It has the advantage of simplicity but it also has disadvantages, such as lack of continuity. Secondly, in terms of various mathematical measures of accuracy there exist alternative non–parametric estimators that are superior to histograms. To construct a histogram one needs to select a left bound, or starting point, x0, and the bin width, b. The bins are of the form [x0 + (i ¡ 1)b; x0 + ib), i = 1; 2; :::; m. The estimator of f(x) is then given by 1 Number of observations in the same bin as x fˆ(x) = n b More generally one can also use bins of different widths, in which case 1 Number of observations in the same bin as x fˆ(x) = n Width of bin containing x The choice of bins, especially the bin widths, has a substantial effect on the shape and other properties of fˆ(x). This is illustrated in the example that follows. Example 1 We consider a population of 689 of a certain model of new cars. Of interest here is the amount (in DM) paid by the customers for “optional extras”, such as radio, hubcaps, special upholstery, etc. The top histogram in Figure 1.1 relates to the entire population; the bottom histogram is for a random sample of size 10 from the population. Figure 1.2 shows three histogram estimates of f(x) for the sample, for different bin widths. Note that the estimates are piecewise constant and that they are strongly influenced by the choice of bin width. The bottom right hand graph is an example of a so–called kernel estimator of f(x). We will be examining such estimations in more detail. 1.2 Kernel density estimation 1.2.1 Weighting functions From the definition of the pdf, f(x), of a random variable, X, one has that Zx+h P (x ¡ h < X < x + h) = f(t)dt ¼ 2hf(x) x¡h CHAPTER 1. DENSITY ESTIMATION 4 Histogram of total expenditure (population of 689 cars) 0.20 0.15 0.10 0.05 Density 0.00 0 10 20 30 40 1000 DM Histogram of total expenditure (sample of 10 cars) 0.20 0.15 0.10 0.05 Density 0.00 0 10 20 30 40 1000 DM Figure 1.1: Histogram for all cars in the population and for a random sample of size n = 10. Histogram of samp Histogram of samp 0.05 0.08 0.04 0.06 0.03 0.04 0.02 Density 0.01 Density 0.02 0.00 0.00 0 5 10 15 20 25 30 0 5 10 15 20 25 30 1000 DM 1000 DM Histogram of samp Kernel density estimate 0.10 0.08 0.06 0.06 0.04 0.04 Density 0.02 Density 0.02 0.00 0.00 0 5 10 15 20 25 30 0 5 10 15 20 25 30 1000 DM N = 10 Bandwidth = 2.344 Figure 1.2: Histograms with different bin widths and a kernel estimate of f(x) for the same sample. CHAPTER 1. DENSITY ESTIMATION 5 and hence 1 f(x) ¼ P (x ¡ h < X < x + h) (1.1) 2h The above probability can be estimated by a relative frequency in the sample, hence 1 number of observations in (x ¡ h; x + h) fˆ(x) = (1.2) 2h n An alternative way to represent fˆ(x) is 1 Xn fˆ(x) = w(x ¡ x ; h) ; (1.3) n i i=1 where x1; x2; :::; xn are the observed values and ½ 1 for jtj < h ; w(t; h) = 2h 0 otherwise : It is left to the reader as an exercise to show that fˆ(x) defined in (1.3) has the properties of a pdf, that is fˆ(x) is non–negative for all x, and the area between fˆ(x) and the x–axis is equal to one. 1 One way to think about (1.3) is to imagine that a rectangle (height 2h and width 2h) is placed over each observed point on the x–axis. The estimate of the pdf at a given point is 1=n times the sum of the heights of all the rectangles that cover the point. Figure 1.3 shows fˆ(x) for such rectangular “weighting functions” and for different values of h. We note that the estimates of fˆ(x) in Figure 1.3 fluctuate less as the value of h is increased. By increasing h one increases the width of each rectangle and thereby increases the degree of “smoothing”. Instead of using rectangles in (1.3) one could use other weighting functions, for example triangles: ½ 1 (1 ¡ jtj=h) for jtj < h ; w(t; h) = h 0 otherwise : Again it is left to the reader to check that the resulting fˆ(x) is indeed a pdf. Examples of fˆ(x) based on the triangular weighting function and four different values of h are shown in Figure 1.4. Note that here too larger values of h lead to smoother estimates fˆ(x). Another alternative weighting function is the Gaussian: 1 2 2 w(t; h) = p e¡t =2h ; ¡1 < t < 1 : 2¼h Figure 1.5 shows fˆ(x) based on this weighting function for different values of h. Again the fluctuations in fˆ(x) decrease with increasing h. CHAPTER 1. DENSITY ESTIMATION 6 Rectangular kernel Rectangular kernel bw= 0.5 bw= 1 0.20 0.20 0.15 0.15 0.10 0.10 Density 0.05 Density 0.05 0.00 0.00 0 10 20 30 40 0 10 20 30 40 1000 DM 1000 DM Rectangular kernel Rectangular kernel bw= 2 bw= 4 0.20 0.20 0.15 0.15 0.10 0.10 Density 0.05 Density 0.05 0.00 0.00 0 10 20 30 40 0 10 20 30 40 1000 DM 1000 DM Figure 1.3: Estimates of f(x) for different values of h. The abbreviation bw (short for bandwidth) is used here instead of h. Triangular kernel Triangular kernel bw= 0.5 bw= 1 0.20 0.20 0.15 0.15 0.10 0.10 Density 0.05 Density 0.05 0.00 0.00 0 10 20 30 40 0 10 20 30 40 1000 DM 1000 DM Triangular kernel Triangular kernel bw= 2 bw= 4 0.20 0.20 0.15 0.15 0.10 0.10 Density 0.05 Density 0.05 0.00 0.00 0 10 20 30 40 0 10 20 30 40 1000 DM 1000 DM Figure 1.4: Estimates of f(x) based on triangular weighting functions. CHAPTER 1. DENSITY ESTIMATION 7 Gaussian kernel Gaussian kernel bw= 0.5 bw= 1 0.20 0.20 0.10 0.10 Density Density 0.00 0.00 0 10 20 30 40 0 10 20 30 40 1000 DM 1000 DM Gaussian kernel Gaussian kernel bw= 2 bw= 4 0.20 0.20 0.10 0.10 Density Density 0.00 0.00 0 10 20 30 40 0 10 20 30 40 1000 DM 1000 DM Figure 1.5: Estimates of f(x) based on Gaussian weighting functions.

APPLIED SMOOTHING TECHNIQUES Part 1: Kernel Density Estimation

Nonparametric Density Estimation Using Kernels with Variable Size Windows Anil Ramchandra Londhe Iowa State University

Deep Kernel Learning

Kernel Methods As Before We Assume a Space X of Objects and a Feature Map Φ : X → RD

MDL Histogram Density Estimation

3.2 the Periodogram

A Least-Squares Approach to Direct Importance Estimation∗

Lecture 2: Density Estimation 2.1 Histogram

Density Estimation 36-708

Nonparametric Estimation of Probability Density Functions of Random Persistence Diagrams

DENSITY ESTIMATION INCLUDING EXAMPLES Hans-Georg Müller

Consistent Kernel Mean Estimation for Functions of Random Variables

Density and Distribution Estimation