Evaluating Fit Using Hellinger Discrimination and Dirichlet Process Prior
Total Page:16
File Type:pdf, Size:1020Kb
EVALUATING FIT USING HELLINGER DISCRIMINATION AND DIRICHLET PROCESS PRIOR PAPA NGOM¦, RICHARD EMILION¤ ¦ LMA - Universit¶eCheikh Anta Diop - Dakar - S¶en¶egal ¤ MAPMO - Universit¶ed'Orl¶eans- France Abstract. We evaluate the lack of ¯t in a proposed family of distributions by estimating the posterior distribution of the Hellinger distance between the true distribution generating the data and the proposed family, when the prior is a Dirichlet process. We also prove some consistency results and ¯nally we illustrate the method by using Markov Chain Monte Carlo (MCMC) techniques to compute the posterior. Our work hinges on that of K. Viele (2000-b) which holds for KL information distance. 1. Introduction Nonparametric Bayesian methods have been popular and successful in many estimation problems but their relevance in hypothesis testing situations have become of interest only recently. In particular, testing goodness-of-¯t in a Bayesian setting has received attention, for example in the works of Kass and Rafety (1995), Gelman, Meng, Stern (1996). Earlier, Beran (1977) provides a new parametric estimate procedure which is minimax robust in a small Hellinger metric neighborhood. His method is based on the proposed estimator of θ, is that b value θn which minimizes the Hellinger distance between f (fθ : θ 2 £ a speci¯ed parametric family) θbn and gbn, where gbn is a kernel density estimator. This minimum Hellinger distance estimator is used to provide a goodness-of-¯t statistic which assesses the adequacy of the parametric model. Here, we pursue this idea from a Bayesian viewpoint. Suppose we have a proposed or null family of discrete distributions D£ for an observed set of data. We de¯ne the alternative family of distributions P as every possible distribution on the nonnegative integers. Our main goal is to estimate (instead of test), how the proposed Dθ approximates P , the actual process generating the data. We will place a Dirichlet Process as our prior over the alternative class and evaluate ¯t by ¯nding the posterior of quantity inf d(P; Dθ) where d is Hellinger distance between P and Dθ. Thus, our method focuses on θ2£ quantifying the lack of ¯t, rather than determining exactly the actual distribution that generates the data. Viele (2001) proposed a method for evaluating ¯t, based on embedded model spaces, as also de- scribed in Carota, Parmigiani, and Polson (1996): suppose that we have been proposed a family D£ = (Dθ)θ2£ of models Dθ and that we want to evaluate its ¯t from an observed set of data gener- ated by an unknown distribution P . The proposed model is embedded into a larger family of models, assuming that P belongs to this larger family while it need not be in D£. A prior, intended to repre- sent uncertainty in the model speci¯cation, is put on this larger class. Then the posterior distribution of d(P; D£), the distance between the true distribution and the proposed models, is computed, small distances indicating good ¯t and large distances indicating poor ¯t. Viele (2000-a) used nonparametric Bayesian methods to compute a point estimate of d(P; D£), further establishing consistency results when the prior is a Dirichlet proceess and d is the Kullback- Leibler (1959) distance. In the present paper we study this method when using Hellinger discrimination. The remainder of this paper is organized as follows : in Section 2, we present the problem of ¯tness when choosing Hellinger metric. Section 3 concerns the Bayes estimate when the prior is a Dirichet process. Section 4 contains the main results about consistency of the posterior distribution. In order to illustrate the method, we present a simulation study in Section 5. We conclude in Section 6 with a discussion on the results of Section 5. Key words and phrases. Probabilistic inference, Dirichlet processes, Gaussians mixtures, EM algorithm, Hellinger distance, divergence statistics, test of goodness ¯t. 1 2 PAPA NGOM¦, RICHARD EMILION¤ 2. Fitness using Hellinger metric The following notations will be used throughout the paper. Let ¸ be a probability measure on a measurable space (X ; B), where the σ¡¯eld B is separable. Let Q be the set of all ¯nite measures on (X ; B) that are absolutely continuous with respect to ¸. Note that it can be assumed more generally that ¸ is a σ¡¯nite measure. In order to distinguish two probability distributions based on some observations, various distance between measures (metrics) were introduced. Among the most popular ones are: the relative entropy distance (Kullback-Leibler entropy) and the Hellinger distance [3]. The Hellinger metric H(:;:) on Q is de¯ned as: Z h i2 1=2 1=2 (1) H(Q1;Q2) = f1(x) ¡ f2(x) d¸(x) dQ where f = i . i d¸ When f1 and f2 are discrete probability distributions, their Hellinger distance is de¯ned as: X £p p ¤2 H(Q1;Q2) = f1(x) ¡ f2(x) : x Now, let de¯ne the Hellinger distance from P to the the family D£ as: (2) H(P; D£) = inf H(P; Dθ) θ2 £ A usual way of evaluating ¯t whether H(P; Dθ) is a "su±cient" model consists in testing the following hypothesis: H0 : H(P; Dθ) · a H1 : H(P; Dθ) > a: Goutis and Robert (1998), Mengerson and Robert (1996) use this approach. In this paper, we rather use an approach of estimation, providing a method for estimating how closely D£ approximates the true model P . We consider the ¯t of a null family D£ of discrete distributions: it will be assumed that there exists a ¯xed countably in¯nite set S, which can be taken as S = f0; 1; 2;:::g without loss of generality, such that Dθ(S) = 1 for all θ. Let de¯ne the alternative family as the larger family of every possible distribution on the nonnegative integers. 1 Let X1;:::;Xn be n i.i.d observations from an unknown distribution Po. Let Pn (resp. P ) be the n¡fold (resp. in¯nite) product of Po. We do not assume that the true distribution P0 2 D£. Our parameter will be the distribution P , that resides in a certain space. We assume that all the Dθ distributions belong to this space. We will place a nonparametric prior on P and our ¯nal goal will be to estimate ho = H(Po;D£) using the posterior distribution of h = H(P; D£), when P is drawn at random from the alternative family according to the prior distribution. The value ho quanti¯es the divergence between the true distribution Po and the null family of distributions. A nice popular prior is the famous Dirichlet process whose very rich properties are now recalled. 3. Evaluating Hellinger's in¯mum The notion of Dirichlet process (DP) was introduced by Ferguson (1973) in a celebrated fundamental paper on a Bayesian approach to some nonparametric problems. Let D0 be a probability measure on a Polish measurable space (X ; B). A Dirichlet process on (X ; B) with scaling parameter ® and baseline distribution D0 , shortly P 2 Dir(D0 £ ®), is a random probability measure P on (X ; B) such that for any positive integer m and any ¯nite measurable partition (B1;B2;:::;Bm) of X , with Pk = P rob(Bk), the joint distribution of the random vector (P1;P2;:::;Pm) 2 Dir(®D0(B1); ®D0(B2); : : : ; ®D0(Bm)) the standard ¯nite dimensional Dirichlet distribution. Ferguson (1973) proved that the class of such processes is closed, in the sense that if P »Dir(®D0) and P »Dir(®D0) EVALUATING FIT USING HELLINGER DISCRIMINATION 3 X1;:::;XnjP » P Xn then the posterior distribution of P is also a DP with base probability measure (®D0 + ±Xi )=(®+n) i=1 and scaling parameter ® + n Xn P jX1;:::;Xn »Dir(®D0 + ±Xi ) i=1 where ±Xi (xj) = 1 if xi = xj and 0 otherwise. For computing the posterior distribution of h, we will use the following theorem which is adapted from a theorem in Berk (1966): Theorem 3.1. (Berk 1966) Suppose that D£ is a family of distributions and P is another distribution, which may or may not be an element of D£, and let X1;X2; ¢ ¢ ¢ » P . If there exists a prior ¦(θ) on £ and a point θp such that (i) the posterior distribution ¦n = ¦(θjX1;:::;Xn) is asymptotically carried on θp,in the sense that if U is an open set containing θp, then lim ¦n(U) = 1 n!+1 almost surely [P 1] (ii) the assumptions from Berk (1966) hold, then inf H(P; Dθ) = H(P; Dθp) θ2£ For the proof of the ¯rst part (i) of the theorem, see Berk (1966). Let us prove (ii). Proof. For " > 0, let de¯ne A" = fP : H(P; Dθ) · "g: Let X1;X2;:::Xn be denoted shortly by Xn. De¯ne the Bayes estimate Dθp to be the probability measure such that: Z Dθp(A) = Dθ(A)¦(dP jX1;X2;:::;Xn) = E[Dθ(A)jXn]: Since the function ©(¢) = H(P; ¢) is convex, Jensen's inequality yields: ³ Z ´ ©(Dθp(¢)) = © Dθ(¢)¦(dP jXn) Z ³ ´ · © Dθ(¢)¦(dP jXn) so that: Z H(P; Dθp) · H(P; Dθ)¦(dP jXn) Z Z (3) = H(P; Dθ)¦(dP jXn) + H(P; Dθ)¦(dP jXn) c A" A" The ¯rst term on the right-hand side of (2) is at most " by de¯nition of A", and the second term goes to 0 a.s. [P 1] by Theorem 3.1 and the fact that Hellinger measure is bounded. The result follows since " is arbitrary. ¤ 4. Main results We give, in this section, conditions that guarantee that the posterior probability of Hellinger dis- tance h = H(P; D£) between P and D£ tends to a point mass ho = H(Po;D£).