Data Density and Structure

Part II Data Density and Structure 187 189 A canonical problem in statistics is to gain understanding of a given random sample, y1,y2,...,yn, so as to understand the process that yielded the data. The specific objective is to make inferences about the population from which the random sample arose. In many cases we wish to make inferences only about some finite set of parameters, such as the mean and variance, that describe the population. In other cases we want to predict a future value of an observation. Sometimes the objective is more difficult; we want to estimate a function that characterizes the distribution of the population. The cumulative distribution function (CDF) or the probability density function (PDF) provides a complete description of the population, and so we may wish to estimate these functions. In the simpler cases of statistical inference, we assume that the form of the CDF P is known, and that there is a parameter, θ =Θ(P ), of finite dimension that characterizes the distribution within that assumed family of forms. An objective in such cases may be to determine an estimate θ of the parameter θ. The parameter may completely characterize the probability distribution of the population or it may just determine an important property of the distribution, such as its mean or median. If the distribution or density function is assumed known up to a vector of parameters, the complete description is provided by the parameter estimate. For example, if the distribution is assumed to be normal, the form of P is known. It involves two parameters, the mean µ and the variance σ2. The problem of completely describing the distribution is merely the problem of estimating θ =(µ, σ2). In this case, the estimates of the CDF, P,andthedensity,p, are the normal CDF and density with the estimate of the parameter, θ, plugged in. If no assumptions, or only weak assumptions, are made about the form of the distribution or density function, the estimation problem is much more difficult. Because the distribution function or density function is a characterization from which all other properties of the distribution could be determined, we ex- pect the estimation of the function to be the most difficult type of statistical inference. “Most difficult” is clearly a heuristic concept and here may mean that the estimator is most biased, most variable, most difficult to compute, most mathematically intractable, and so on. Estimators such as θ for the parameter θ or p for the density p are usually random variables, hence, we are interested in the statistical properties of these estimators. If our approach to the problem treats θ and p as fixed (but unknown), then the distribution of θ and p canbeusedtomakeinformative statements about θ and p. Alternatively, if θ and p are viewed as realizations of random variables, then the distribution of θ and p canbeusedtomakeinformative statements about conditional distributions of the parameter and the function, given the observed data. While the CDF in some ways is more fundamental in characterizing a probability distribution (it always exists and is defined the same for both continuous 190 and discrete distributions), the probability density function is more familiar to most data analysts. Important properties such as skewness, modes, and so on can be seen more readily from a plot of the probability density function than from a plot of the CDF. We are therefore usually more interested in estimating the density, p, than the CDF, P . Some methods of estimating the density, however, are based on estimates of the CDF. The simplest estimate of the CDF is the empirical cumulative distribution function, the ECDF, which is defined as 1 n P (y)= I(−∞ ](y ). (7.6) n n ,y i i=1 (See page 363 for definition and properties of the indicator function IS (·)inthe ECDF.) As we have seen on page 11, the ECDF is pointwise unbiased for the CDF. The derivative of the ECDF, the empirical probability density function (EPDF), 1 n p (y)= δ(y − y ), (7.7) n n i i=1 where δ is the Dirac delta function, is just a series of spikes at points corresponding to the observed values. It is not very useful as an estimator of the probability density. It is, however, unbiased for the probability density function at any point. In the absence of assumptions about the form of the density p,theesti- mation problem may be computationally intensive. A very large sample is usually required in order to get a reliable estimate of the density. How good the estimate is depends on the dimension of the random variable. Heuristically, the higher the dimension, the larger the sample required to provide adequate representation of the sample space. Density estimation generally has more modest goals than the development of a mathematical expression that yields the probability density function p everywhere. Although we may develop such an expression, the purpose of the estimate is usually a more general understanding of the population: • to identify structure in the population, its modality, tail behavior, skewness, and so on; • to classify the data and to identify different subpopulations giving rise to it; or • to make a visual presentation that represents the population density. There are several ways to approach the probability density estimation problem. In a parametric approach mentioned above, a parametric family of distributions, such as a normal distribution or a beta distribution, is assumed. The density is estimated by estimating the parameters of the distribution and 191 substituting the estimates into the expression for the density. In a nonparametric approach, only very general assumptions are made about the distribution. These assumptions may only address the shape of the distribution, such as an assumption of unimodality, or an assumption of continuity or other degrees of smoothness of the density function. There are various semi-parametric ap- proaches in which, for example, parametric assumptions may be made only over a subset of the range of the distribution, or in a multivariate case, a parametric approach may be taken for some elements of the random vector and a nonparametric approach for others. Another approach is to assume a more general family of distributions, perhaps characterized by a differential equation, for example, and to fit the equation by equating properties of the sample, such as sample moments, with the corresponding properties of the equation. In the case of parametric estimation, we have a complete estimate of the density; that is, an estimate at all points. In nonparametric estimation, we generally develop estimates of the ordinate of the density function at specific points. After the estimates are available at given points a smooth function can be fitted. In the next few chapters we will be concerned primarily with nonparametric estimation of probability densities and identification of structure in the data. In Chapter 11 we will consider building models that express asymmetric relationships between variables, and making inference about those models..

Load more