Degrees of Freedom in Quadratic Goodness of Fit
Total Page:16
File Type:pdf, Size:1020Kb
Submitted to the Annals of Statistics DEGREES OF FREEDOM IN QUADRATIC GOODNESS OF FIT By Bruce G. Lindsay∗, Marianthi Markatouy and Surajit Ray Pennsylvania State University, Columbia University, Boston University We study the effect of degrees of freedom on the level and power of quadratic distance based tests. The concept of an eigendepth index is introduced and discussed in the context of selecting the optimal de- grees of freedom, where optimality refers to high power. We introduce the class of diffusion kernels by the properties we seek these kernels to have and give a method for constructing them by exponentiating the rate matrix of a Markov chain. Product kernels and their spectral decomposition are discussed and shown useful for high dimensional data problems. 1. Introduction. Lindsay et al. (2008) developed a general theory for good- ness of fit testing based on quadratic distances. This class of tests is enormous, encompassing many of the tests found in the literature. It includes tests based on characteristic functions, density estimation, and the chi-squared tests, as well as providing quadratic approximations to many other tests, such as those based on likelihood ratios. The flexibility of the methodology is particularly important for enabling statisticians to readily construct tests for model fit in higher dimensions and in more complex data. ∗Supported by NSF grant DMS-04-05637 ySupported by NSF grant DMS-05-04957 AMS 2000 subject classifications: Primary 62F99, 62F03; secondary 62H15, 62F05 Keywords and phrases: Degrees of freedom, eigendepth, high dimensional goodness of fit, Markov diffusion kernels, quadratic distance, spectral decomposition in high dimensions 1 2 LINDSAY ET AL. The paper by Lindsay et al. introduced, as a unifying concept, a formula for the spectral degrees of freedom (DOF ) of the test statistic. It was based on a functional spectral decomposition of the quadratic kernel, but could be calculated without knowing the decomposition. In essence, the limiting null distribution of the test statistic was shown to be approximately chi-squared, with DOF being its degrees of freedom. One feature of building tests within the quadratic distance framework is that the DOF is often a continuously tuneable parameter; i.e:, one to be selected by the user. In our examples here, which involve kernel density estimation, DOF is a decreasing function of the smoothing parameter h: This is a particularly valuable characteristic in higher dimensions, as one can then adjust the degrees of freedom to compensate for both dimension and sample size. Lindsay et al. (2008) offered some possible guidelines for selecting degrees of freedom. They were based solely on the heuristic that DOF should be chosen as if carrying out a chi-squared test. Ray and Lindsay (2008) created a useful risk assessment tool based on tuneable quadratic distances and their associated degrees of freedom, but again provided little guidance on the selection of the tuning parameter. Our first goal here is to show that indeed, the choice of DOF is important for obtaining good power properties. Just as in a chi-square test procedure, selecting DOF too large or too small can lead to very weak power against important alter- natives. We demonstrate this through a careful simulation. As in a chi-squared test, too many degrees of freedom, relative to the sample size, creates procedures with low power (\cells with small counts"). And too few degrees of freedom, especially in higher dimensional data, can fail to provide enough dimensions of DEGREES OF FREEDOM FOR DISTANCES 3 discrimination. In this paper, we will show that many weaknesses of the standard chi squared methodology can be overcome by careful choice of the kernel in the quadratic distance. As one example of this, we will examine goodness of fit in the multivariate normal (or mixtures of normals) case, where the kernel of the distance is also taken to be a multivariate normal density. In this example, and others like it, the quadratic testing method requires no cell creation, the degrees of freedom are continuously adjustable, the calculations do not require numerical integration, and the power is global. By using this kernel we can always com- pute explicitly the distance between the data and the hypothesized multivariate normal, or mixtures of normals, model as well as degrees of freedom. Our second goal is to provide methodology that is useful for choosing DOF in such a way that one has good power across a range of alternatives in a high dimensional setting. For this we first need to develop a set of tools for the analysis of DOF, especially as it relates to the dimension of the data d: To do so, we will focus on a important class of distance kernels that we call diffusion kernels. These kernels are of special interest because they are tuneable and they allow easy computation of the distance in certain high dimensional models. They also enable explicit spectral decompositions and formulas for the corresponding degrees of freedom. We will also focus on a special mechanism for constructing distance kernels in higher dimensions that we call product kernels. The use of such kernels enables one to construct distances for data vectors not only of arbitrary dimension, but also those that have a mixture of discrete and continuous coordinates. Finally, we will introduce here a new tool for choosing DOF in higher dimen- sions. Based on spectral theory, we derived a simple but informative function of 4 LINDSAY ET AL. DOF and the data dimension d that we call the eigendepth index k^: It is valid for kernels with geometrically decaying eigenvalues. It describes how the test statistic weights different eigenspaces in higher dimensions. This index proved particularly useful for understanding how the results of our simulation study depended on data dimension. The paper is organized as follows. In Section 2 we develop some essential background on quadratic distance testing, especially as it relates to degrees of freedom. We then introduce in Section 3 the kernels that will be of particular interest to us. Called diffusion kernels, we construct them by exponentiating the rate matrix of a Markov Chain to fit a variety of sample spaces. We consider these kernels to be the natural generalization of the normal kernel to other sample space. We then turn to the challenge of degrees of freedom in higher dimensions. In Section 4 we develop a description of the standard eigenanalysis of the diffusion kernel type. This in turn leads to a recognition that in higher dimensions, the product kernel eigendecomposition has a beautiful structure that can be exploited to understand how the kernel weights deviations from the model. This leads to a proposed eigendepth index k^ to measure this effect. The eigendepth index is transformed into a simple formula depending only on dimension d and DOF: Our final sections are devoted to a detailed study of testing for multivariate normality using a quadratic distance. After a brief review of preceding literature in Section 5, we turn to a detailed simulation study in Section 6. Here we will show that power as a function of eigendepth is remarkably homogeneous across d = 2, 4; and 8; with the peak power typically occurring at k^ = 4: 1.1. Quadratic distances. Lindsay et al. (2008) introduced a unified frame- DEGREES OF FREEDOM FOR DISTANCES 5 work for the study of quadratic form distance measures that are used to assess model fit. We briefly review some fundamentals of this framework. Let be a sample space and let du(s) be the canonical uniform measure on X this space; this could be Lebesgue measure, counting measure, or spherical volume measure depending on the application. The building block of a statistical distance is the function K(s; t), a bounded, symmetric, non-negative definite kernel defined on . X × X Definition 1. Given a CNND kernel function K(s, t), possibly depending on a distribution G whose goodness of fit we wish to assess, the K-based quadratic distance between two probability measures F and G is defined as d (F; G) = K (s; t)d(F G)(s)d(F G)(t): K G − − Z Z An important example of quadratic distance is Pearson's chi-squared. The kernel of the Pearson chi-squared distance is given by m I(r Ai)I(s Ai) KG(r; s) = 2 2 ; G(Ai) Xi=1 where I is the indicator function and A1;A2;:::; Am is a partitioning of the sample space into m bins. If x1; :::; xn is a random sample with empirical distribution F^, then we have a natural construction of an empirical distance between the data and the model as d(F^; G): This becomes the building block for goodness of fit procedures. To obtain the asymptotic theory, we modify the kernel KG by centering it to obtain Kc;G (details in Section 3), in which case we can write dK (F; G) = Kc;G(s; t)dF (s)dF (t): Z Z 6 LINDSAY ET AL. In this form it is clear that d(F^; G) is a V-statistic. Suppose that Fτ is the true distribution. The U-statistic that unbiasedly estimates the distance d(Fτ ; G) is given by the expression 1 U = K (x ; x ): n n(n 1) c;G i j i j=i − X X6 The family of possible quadratic distances is enormous, needing just the spec- ification of the kernel K(x; y): Our particular interest here will be in kernels of the diffusion type, or products thereof. 1.2. The L2 representation. The spectral representation theory of Lindsay et al. (2008) shows that for a given symmetric kernel K(s; t), there generally exists a symmet- ric \square root" kernel K1=2(s; t) satisfying the relationship K1=2(s; r)K1=2(r; t)du(r) = K(s; t): Z 1 For example, if one uses as a distance kernel the normal K 2 (x; y) = (p2πh2) exp((x h − − y)2=2h2); then the square root kernel is a normal kernel with variance h2=2.