Goodness of Fit: What Do We Really Want to Know? I
Total Page:16
File Type:pdf, Size:1020Kb
PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003 Goodness of Fit: What Do We Really Want to Know? I. Narsky California Institute of Technology, Pasadena, CA 91125, USA Definitions of the goodness-of-fit problem are discussed. A new method for estimation of the goodness of fit using distance to nearest neighbor is described. Performance of several goodness-of-fit methods is studied for time-dependent CP asymmetry measurements of sin(2β). 1. INTRODUCTION of the test is then defined as the probability of ac- cepting the null hypothesis given it is true, and the The goodness-of-fit problem has recently attracted power of the test, 1 − αII , is defined as the probabil- attention from the particle physics community. In ity of rejecting the null hypothesis given the alterna- modern particle experiments, one often performs an tive is true. Above, αI and αII denote Type I and unbinned likelihood fit to data. The experimenter Type II errors, respectively. An ideal hypothesis test then needs to estimate how accurately the fit function is uniformly most powerful (UMP) because it gives approximates the observed distribution. A number of the highest power among all possible tests at the fixed methods have been used to solve this problem in the confidence level. In most realistic problems, it is not past [1], and a number of methods have been recently possible to find a UMP test and one has to consider proposed [2, 3] in the physics literature. various tests with acceptable power functions. For binned data, one typically applies a χ2 statistic There is a long-standing controversy about the con- to estimate the fit quality. Without discussing advan- nection between hypothesis testing and the goodness- tages and flaws of this approach, I would like to stress of-fit problem. It can be argued [5] that there can be that the application of the χ2 statistic is limited. The no alternative hypothesis for the goodness-of-fit test. χ2 test is neither capable nor expected to detect fit In this approach, however, the experimenter does not inefficiencies for all possible problems. This is a pow- have any criteria for choosing one goodness-of-fit pro- erful and versatile tool but it should not be considered cedure over another. One can design a goodness-of- as the ultimate solution to every goodness-of-fit prob- fit test using first principles, advanced computational lem. methods, rich intuition or black magic. But the prac- There is no such popular method, an equivalent of titioner wants to know how well this method will per- the χ2 test, for unbinned data. The maximum like- form in specific situations. To evaluate this perfor- lihood value (MLV) test has been frequently used in mance, one needs to study the power of the proposed practice but it often fails to provide a reasonable an- method against a few specific alternatives. A certain, swer to the question at hand: how well are the data perhaps vague, notion of an alternative hypothesis modelled by a certain density [4]? It is only natural must be adopted for this exercise; hence, a certain, that goodness-of-fit tests for small data samples are perhaps vague, notion of the alternative hypothesis is harder to design and less versatile than those for large typically used to design a goodness-of-fit test. samples. For small data samples, asymptotic approx- Consider, for example, testing uniformity on an in- imations do not hold and the performance of every terval. The alternative is usually perceived as pres- goodness-of-fit test needs to be studied carefully on ence of peaks in the data. Suppose we design a pro- realistic examples. Thus, the hope for a versatile un- cedure that gives the highest goodness-of-fit value for binned goodness-of-fit procedure expressed by some equidistant experimental points. This test will per- people at the conference seems somewhat naive. form well for the chosen alternative. In reality, how- A more important practical question is how to de- ever, we may need to test exponentiality of the pro- sign a powerful goodness-of-fit test for each individual cess. For instance, we use a Geiger counter to measure problem. It is not possible to answer this question un- elapsed time between two consecutive events and plot less we specify in more narrow terms the problem that these time intervals next to each other on a straight we are trying to solve. line. In this case, equidistant data would imply that the process is not as random as we thought, and the designed goodness-of-fit procedure would fail to detect the inconsistency between the data and the model. Tests against highly structured data (e.g., equidistant 2. WHAT IS A GOODNESS-OF-FIT TEST? one-dimensional data) have been, in fact, a subject of statistical research on goodness-of-fit methods. A hypothesis test requires formulation of null and The question therefore is how to state the alterna- alternative hypotheses. The confidence level, 1 − αI , tive hypothesis in a way appropriate for each individ- 70 PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003 ual problem. I emphasize that I am not suggesting to 3. DISTANCE TO NEAREST NEIGHBOR use a directional test for one specific well-defined alter- TEST native. The goal is to design an omnibus goodness- of-fit test that discriminates against at least several The idea of using Euclidian distance between near- plausible alternatives. est observed experimental points as a goodness-of-fit The null hypothesis is defined as measure is not new. Clark and Evans [6] used an av- erage distance between nearest neighbors to test two- H0 : X ∼ f(xjθ0; η) ; (1) dimensional populations of various plants for unifor- mity. Later they extended this formalism to a higher where X is a multivariate random variable, and f is number of dimensions. Diggle [7] proposed to use the fit density with a vector of arguments x, vector an entire distribution of ordered distances to nearest of parameters θ and vector of nuisance parameters η. neighbors and apply Kolmogorov-Smirnov or Cramer- The alternative hypothesis is stated in the most gen- von Mises tests to evaluate consistency between ex- eral way as perimentally observed and expected densities. Rip- ley [8] introduced a function, K(t), which represents a number of points within distance t of an arbitrary H1 : X ∼ g(x) with g(x) =6 f(xjθ0; η) : (2) point of the process; he used the maximal deviation between expected and observed K(t) as a goodness- A specific subclass of this alternative hypothesis that of-fit measure. Bickel and Breiman [9] introduced a is sometimes of interest is expressed as goodness-of-fit test based on the distribution of the variable exp (−Nf(xi)V (xi)), where f(xi) is the ex- H1 : X ∼ f(xjθ; η) with θ =6 θ0 : (3) pected density at the observed point xi, V (xi) is the volume of a nearest neighbor sphere centered at xi In other words, most usually we would like to test and N is the total number of observed points. An ap- the fit function against different shapes (2). For the proach closely related to distance-to-nearest-neighbor test (3), we assume that the shape of the fit function tests is two-sample comparison based on counts of is correctly modelled and we only need to cross-check nearest neighbors that belong to the same sample [10]. the value of the parameter. These methods received substantial attention from the If a statistic S(x) is used to judge the fit quality, science community and have been applied to numer- the goodness-of-fit is given by ous practical problems, mostly in ecology and medi- cal research. Ref. [11] offers a survey of distance-to- nearest-neighbor methods. 1 − αI = fS(s)ds ; (4) The goodness-of-fit test [3] uses a bivariate distribu- f (s)>f (s ) Z S S 0 tion of the minimal and maximal distances to nearest neighbors. First, one transforms the fit function de- where s0 is the value of the statistic observed in the fined in an n-dimensional space of observables to a experiment, and fS(s) is the distribution of the statis- uniform density in an n-dimensional unit cube. Then tic under the null hypothesis. one finds smallest and largest clusters of nearest neigh- In practice, the vector of parameter estimates θ0 is bors whose linear size maximally deviates from the usually extracted from an unbinned maximum likeli- average cluster size predicted from uniformity. The hood (ML) fit to the data: θ0 = θ^(x). In this case, the cluster size is defined as an average distance from the goodness-of-fit statistic must be independent of, or at central point of the cluster to m nearest neighbors. most weakly correlated to, the ML estimator of the If the experimenter has no prior knowledge of the parameter: ρ(S(x); θ^(x)) ≈ 0, where ρ is the corre- optimal number m of nearest neighbors included in lation coefficient computed under the null hypothesis. the goodness-of-fit estimation, one can try all possi- If S(x) and θ^(x) are strongly correlated, the goodness- ble clusters 2 ≤ m ≤ N, where N is the total number of-fit test is redundant. of observed experimental points. The probability of The ML estimator itself is usually a powerful tool observing the smallest and largest clusters of this size for discrimination against the alternative (3). In this gives an estimate of the goodness of fit and the loca- case, the statistic S(x) =6 θ^(x) can be treated as an tions of the clusters can be used to point out potential problems with data modelling.