Goodness-Of-Fit Tests for Dirichlet Distributions with Applications

GOODNESS-OF-FIT TESTS FOR DIRICHLET DISTRIBUTIONS WITH APPLICATIONS

Yi Li

A Dissertation

Submitted to the Graduate College of Bowling Green State University in partial fulﬁllment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

August 2015

Committee:

Maria Rizzo, Advisor

Pamela Bechtel, Graduate Faculty Representative

James Albert

Maria Rizzo, Advisor

We present five new goodness-of-fit tests for Dirichlet distributions. Since beta, Dirichlet, and generalized Dirichlet distributions are widely used in statistics such as Dirichlet regression, data mining, and machine learning. Thus, goodness-of-fit tests for beta and Dirichlet distribution are important applications. We develop new test procedures for testing beta and Dirichlet distribution. The first test, called energy test, is based on energy statistics. The second test, called distance covariance test, is based on the property of complete neutrality of the Dirichlet distribution and distance covariance tests. We also expand the distance covariance test into testing the mutual independence of all components of arbitrary random vector. The third, fourth and fifth tests, called the triangle tests, are based on the interpoint distances. Simulation studies of power performance for the new beta and Dirichlet goodness-of-fit tests are presented. Results show that distance covariance goodness-of-fit test have good performance in contaminated Dirichlet distributions which contain the small perturbation in the Dirichlet distribution. The parameter estimation based on the maximum likelihood is derived for the generalized Dirichlet distribution and the initial value for the iteration of the Newton-Raphson algorithm is also proposed. Applications includes the goodness-of-fit tests of Dirichlet and generalized Dirich- let distributions, model evaluation of Dirichlet regression models, and influence diagnostics of Dirichlet regression models. iv ACKNOWLEDGMENTS

I would like to express my special gratitude to my advisor Dr. Rizzo, for her excellent guidance, support, patience. I will always remember my discussion with Dr. Rizzo about the new results and algorithm of this dissertation. I want to thank my committee members, Dr. James Albert and Dr. Wei Ning, whose work are so important for this dissertation. I would also like to thank my family. They were always supporting me and encouraging me with their best wishes. In the end, I thank Bowling Green State University, which provided me a chance to study here. v

TABLE OF CONTENTS

CHAPTER 1 INTRODUCTION 1

CHAPTER 2 LITERATURE REVIEW 3 2.1 EDF Tests for Univariate Distribution ...... 3 2.2 Chi-squared Test and Likelihood Ratio Tests ...... 4 2.3 Two-Dimensional Kolmogorov-Smirnov Goodness-of-Fit Tests ...... 6 2.4 Energy Tests ...... 6 2.5 Triangle Tests ...... 8

CHAPTER 3 BETA AND DIRICHLET DISTRIBUTION 12 3.1 Beta Distribution ...... 12 3.2 Stochastic Representation of Beta Distribution ...... 14 3.3 Dirichlet Distribution ...... 15 3.4 Stochastic Representation of Dirichlet Distribution ...... 16 3.5 Estimation for the Beta and Dirichlet Parameter Vector ...... 17 3.5.1 MLE Based on the Newton-Raphson Algorithm ...... 17 3.5.2 Method of Moments Estimates ...... 19

CHAPTER 4 ENERGY AND TRIANGLE TESTS OF BETA AND DIRICHLET DISTRI- BUTION 20 4.1 Univariate Energy Goodness-of-Fit Statistics ...... 20 4.2 Energy Goodness-of-Fit Test for Uniform Distribution ...... 21 4.3 Data Transformation and Standardization in the Dirichlet Distribution ...... 26 4.4 Energy Test of Beta or Dirichlet Distribution ...... 30 4.5 Triangle Tests of Beta or Dirichlet Distribution ...... 31 vi CHAPTER 5 DISTANCE CORRELATION TEST OF DIRICHLET DISTRIBUTIONS 33 5.1 Distance Covariance and Distance Correlation ...... 33 5.2 Characterization of Beta and Dirichlet Distributions ...... 37 5.2.1 Complete Neutrality ...... 37 5.3 Testing of Mutual Independence of All Components in a Random Vector ...... 40 5.4 Distance Covariance Goodness-of-Fit Tests ...... 41

CHAPTER 6 SIMULATION STUDY 44 6.1 Simulation Study ...... 44 6.1.1 Energy and Triangle Tests ...... 44 6.1.2 Distance Covariance Test ...... 45 6.2 Contaminated Dirichlet Distribution ...... 46 6.3 Logistic Normal Distribution ...... 54

CHAPTER 7 MAIN RESULTS FOR GENERALIZED DIRICHLET DISTRIBUTIONS 58 7.1 Generalized Dirichlet Distribution ...... 58 7.2 Estimation for the Generalized Dirichlet Distribution Parameter Vector ...... 62 7.2.1 MLE Based on the Newton-Raphson Algorithm ...... 63 7.2.2 Direct Proof of Global Convexity of Log Likelihood Function of the Gen- eralized Dirichlet ...... 65 7.2.3 The Initial Values of Iteration ...... 67 7.2.4 Iteration Contingency Criteria ...... 67 7.3 Generalized Dirichlet Distribution Model vs. Dirichlet Distribution Model . . . . . 68 7.3.1 The Likelihood Ratio Test ...... 69 7.3.2 The Wald Test ...... 69 7.3.3 Rao’s Score Test ...... 70 7.3.4 Bootstrap Hypothesis Testing ...... 71 vii CHAPTER 8 APPLICATIONS 74 8.1 Applications of the Goodness-of-Fit Test of Dirichlet and Generalized Dirichlet Distribution ...... 74 8.2 Application to Goodness-of-Fit Testing of Dirichlet Regression Models ...... 75 8.3 Application to the Inﬂuence Diagnostics of the Dirichlet Regression Models . . . 80

CHAPTER 9 SUMMARY AND FURTHER DIRECTIONS 86

BIBILOGRAPHY 88 viii

LIST OF FIGURES

2.1 Regions deﬁning 푕푘, 푘= 1, 2, 3 ...... 10

5.1 Probability of family-wise signiﬁcance level based on Bonferroni correction. . . . 42

6.1 Contour plot of density function of variables 푌1 and 푌2...... 48 6.2 Contour plot of density function of Dir(1, 3/4, 5/4)...... 49 6.3 Empirical power of ﬁve tests of Dirichlet against a contaminated Dirichlet alternative in Example 6.1...... 50

6.4 Contour plot of density function of variables 푌1, 푌2 and 푌3 in (5.9)...... 52 6.5 Contour plot of density function of Dir(1, 1, 2)...... 52 6.6 Contour plot of density function of Dir(1, 0.7495, 1.2505)...... 53 6.7 Empirical power of five tests of Dirichlet against a contaminated Dirichlet alternative in Example 6.2 ...... 54 6.8 Empirical power of five tests of Dirichlet against a contaminated Dirichlet alternative in Example 6.3 ...... 55 6.9 Empirical power of five tests of Dirichlet against logistic normal alternative in Example 6.4 ...... 57

8.1 Arctic lake dataset and corresponding scatterplots...... 81 8.2 Observed and ﬁtted compositions in the Arctic lake sediments using Dirichlet quadratic regression model...... 82 8.3 Residual plots of three compositions in the Arctic lake sediments using Dirichlet quadratic model...... 83

8.4 Scatterplot of 푇1 and 푇2 from the transformed Arctic lake sediments using Dirichlet quadratic model...... 84 ix

8.5 Scatterplot of 푉1 and 푉2 from the transformed Arctic lake sediments using Dirichlet quadratic model...... 85 x

LIST OF TABLES

6.1 Empirical Power of Five Tests of Dirichlet Against a Contaminated Dirichlet Al- ternative in Example 6.1 ...... 49 6.2 Empirical Power of Five Tests of Dirichlet Against a Contaminated Dirichlet Al- ternative in Example 6.2 ...... 53 6.3 Empirical Power of Five Tests of Dirichlet Against a Contaminated Dirichlet Al- ternative in Example 6.3 ...... 53 6.4 Empirical Power of Five Tests of Dirichlet Against Logistic Normal Alternative in Example 6.4 ...... 56 1

CHAPTER 1 INTRODUCTION

Beta, Dirichlet, and generalized Dirichlet distributions are widely used in statistics, data mining, and machine learning. The Dirichlet distribution is the multivariate beta distribution. The generalized Dirichlet distribution has more flexible variance-covariance structure than the Dirichlet distribution, which makes it more practical in data analysis. The beta distribution is also frequently used as the prior probability distribution in Bayesian analysis. The beta distribution is a good model for the prior distribution of a proportion. A Dirichlet distribution and generalized Dirichlet distribution can be fitted as the prior multinomial distribution in Bayesian analysis. The Dirich- let and generalized Dirichlet distributions can also be applied to compositional data. A Dirichlet regression model regresses the compositional data on the corresponding covariates. Before we estimate the parameters of the beta, Dirichlet or generalized Dirichlet distributions, we need to test whether the data came from a population that has a beta, Dirichlet, or generalized Dirichlet distribution. In Dirichlet regression models, we also need goodness-of-fit tests of the Dirichlet or generalized Dirichlet distributions. The null hypothesis of interest for a goodness-of-fit test is that a given random variable or vector 푋 follows a specified beta or Dirichlet distribution 퐹 (푥). The parameters of the hypothesized distribution may be known or unknown.

If 푋1, ··· , 푋푛 is a random sample with common Dirichlet distribution function 퐹 , then we would like to test 퐹 = 퐹0 against 퐹 ̸= 퐹0, where 퐹0 is a specified distribution with known or unknown parameters. There are no goodness-of-fit tests or standardized goodness-of-fit tests developed for the general case of the Dirichlet distribution or generalized Dirichlet distribution in the literature. We propose five goodness-of-fit tests for the Dirichlet distribution. The first test, which is called energy test, is based on energy statistics. The second test, called distance covariance test, is based on the property of complete neutrality of the Dirichlet distribution and distance covariance tests. The third, fourth and fifth tests, called the triangle tests, are based on the interpoint distances. The power of the five proposed tests in simulation studies is presented. The simulation results show that 2 the distance covariance goodness-of-fit tests have best performance for the contaminated Dirich- let distribution. The distance covariance test also has good performance for the logistic normal distribution. Mutual independence of a set of random variables or random vectors is a frequently used assumption in statistical inference. Therefore, testing mutual independence of a set of variables has very important applications. We propose a new method based on the distance covariance and permutation to test the mutual independence of all components of random vector in arbitrary dimensions. The energy, and triangle goodness-of-fit tests can also be used for the generalized Dirichlet distribution, which seems more widely used in data mining and machine learning because of it- s more flexible variance-covariance structure. For the principle of the parsimony, to determine whether the data can be better fitted by Dirichlet or generalized Dirichlet distribution, we apply the likelihood ratio test, Wald test, Rao’s score test to decide whether the generalized Dirichlet distribution is necessary. Some derivation about these three tests are also provided in the dissertation. The limiting distribution of the likelihood ratio, Wald or Rao’s test statistics is a chi-squared distribution, which is based on the large sample theory. However, no general theory provides a practical guidance about how large the sample size should be to apply the chi-squared approximation. Hence, we propose the parametric bootstrap likelihood ratio test, parametric bootstrap Wald test, and parametric bootstrap Rao’s score test for the small or medium sample size. For the generalized Dirichlet distribution, the derivation of the MLE method for the parameter estimation, the initial values of the iteration and the criteria of convergence for the Newton-Raphson algorithm has been proposed. The application of the dissertation includes goodness-of-fit tests of the Dirichlet and generalized Dirichlet distribution, model evaluation of the Dirichlet regression analysis, outlier and influential observation identification in the Dirichlet regression model. The generalized Dirichlet regression model is introduced. Copulas are widely used in finance, biology, insurance. The goodness-of-fit test of copulas is still a challenging problem. The copula goodness-of-fit tests based on the distance covariance test of mutual independence of all components of random vector in arbitrary dimensions and Rosenblatt transformation are also introduced. 3

CHAPTER 2 LITERATURE REVIEW

A historical overview of several important goodness-of-ﬁt tests of univariate and multivariate distributions follows.

2.1 EDF Tests for Univariate Distribution

The empirical distribution function is the cumulative distribution function of the sample. This empirical distribution function is a piece-wise function that increases by 1/푛 at each of the ob-

servations when the sample size is 푛. Suppose 푋1, 푋2, ...., 푋푛 are independent and identically distributed (iid) real random variables with the common cumulative distribution function 퐹 (푥), then the empirical distribution function is deﬁned as

푛 1 ∑︁ 퐹^ (푡) = ퟏ{푥 ≤ 푡}, 푛 푛 푖 푖=1

where 1{} is the indicator function. The indicator 1{푋 ≤ 푥} is a Bernoulli random variable ^ with parameter 푝 = 퐹 (푥), and 퐹푛(푥) is an unbiased estimator for 퐹 (푥). By the central limit ^ theorem, the empirical 퐹푛(푥) has an asymptotically normal distribution with mean 0 and variance 퐹 (푥)(1 − 퐹 (푥)). That is,

√ (︀ ^ )︀ 푑 (︁ (︀ )︀)︁ 푛 퐹푛(푥) − 퐹 (푥) −→풩 0, 퐹 (푥) 1 − 퐹 (푥) .

^ By the strong law of large numbers, the estimator 퐹푛(푥) converges to 퐹 (푥) almost surely, ^ thus 퐹푛(푥) is a consistent estimator of 퐹 (푥). This establishes the pointwise convergence of the empirical distribution function to the true cdf. There is a stronger result, called the Glivenko Cantelli theorem, which states that the convergence happens uniformly over 푥. That is,

^ ⃒ ^ ⃒ 푎.푠. ‖퐹푛 − 퐹 ‖∞ ≡ sup ⃒퐹푛(푥) − 퐹 (푥)⃒ −−→ 0. (2.1) 푥∈ℝ 4 The sup norm in (2.1) is the Kolmogorov-Smirnov statistic for testing the goodness-of-ﬁt between ^ the empirical distribution 퐹푛(푥) and the assumed null hypothesis 퐹 (푥). In (2.1), if we replace the sup norm with the 퐿2 norm,

∫︁ ^ ⃒ ^ ⃒2 ‖퐹푛 − 퐹 ‖푎 = ⃒퐹푛(푥) − 퐹 (푥)⃒ 푑푥.

Then the corresponding statistic is called the Cramer-von Mises statistic. Another norm may be reasonably used here instead of the sup norm or 퐿2 norm. The statistic of the Kolmogorov-Smirnov test for a given null distribution function 퐹 (푥) is

⃒ ^ ⃒ 퐷 = sup ⃒퐹푛(푥) − 퐹 (푥)⃒.

If the observations come from the hypothesized null distribution 퐹 (푋), the test statistic 퐷 converges to 0 almost surely as 푛 → ∞. The critical value of Kolmogorov-Smirnov test comes from the Kolmogorov distribution. When 퐾푛 is greater than the critical value 퐾훼, we reject the null hypothesis at the signiﬁcance level 훼. In other words, we have strong evidence that the data do not come from the null distribution. The statistic of the Cramer-von Mises test is

푛 2 1 ∑︁ [︂2푖 − 1 ]︂ 퐶 = + − 퐹 (푥 ) , 푛 12푛 2푛 (푖) 푖=1 where 푋(푖) are the ordered observations.

If the test statistic 퐶푛 is greater than the critical value 퐶훼, which can be obtained from the table of critical values for the Cramer-von Mises test, then we reject the null hypothesis at the level 훼. That is, we have enough evidence that the data do not come from the null distribution.

2.2 Chi-squared Test and Likelihood Ratio Tests

Suppose the multinomial distribution has 푘 cells with unknown probabilities 푝1, . . . , 푝푘. Let

푛1, . . . , 푛푘 be a set of observed cell frequencies with 푛1 + ··· + 푛푘 = 푛. The null hypothesis is 5 given by

* 퐻0 : 푝푖 = 푝푖 , 1 ≤ 푖 ≤ 푘,

* where 푝푖 are a specified probability distribution on the k cells. The chi-squared goodness-of-fit statistic is defined as 푘 2 ∑︁ (푛푖 − 푒푖) 휒2 = , 푒 푖=1 푖 where

* 푒푖 = 푛푝푖 , 1 ≤ 푖 ≤ 푘.

The null hypothesis is rejected for large 휒2. The chi-squared goodness-of-ﬁt test may also be applied to continuous distributions. The range of the observed data are grouped into mutually exclusive bins. Then 푛푖 is the number of observations that fall into the 푖-th bin, and 푝푖 is the probability that an observation will fall into the 푖-th bin under the null hypothesis. Based on the same null hypothesis, the likelihood ratio goodness-of-ﬁt statistic is given by

푘 2 ∑︁ 퐺 = 2퐿 = 2 푛푖푙표푔 (푛푖/푛푝푖) . 푖=1

Fisher (1928) has proven that

푘 (︃ 2 3 (푖+1) )︃ ∑︁ (푛푖 − 푛푝푖) (푛푖 − 푛푝푖) −1 (푛푖 − 푛푝푖) 퐿 = + + ··· + (︀푖2 + 푖)︀ + ... . 2푛푝 6푛푝2 푛푝푖 푖=1 푖 푖 푖

2 The likelihood ratio test statistic has an asymptotic 휒푘−1 distribution. The ﬁrst term of 퐿 is half of the 휒2, and therefore the likelihood ratio test statistic is asymptotically equivalent to chi-squared statistic. Basically, we do not know the population parameter vector before we conduct the chi-squared test. We must ﬁrst estimate the parameter vector of the null distribution, then chooses bins based on the estimated parameters. The degrees of freedom of the chi-square statistic is adjusted for the estimated parameters. Here it would be 푘 (the number of bins) minus one minus the size of 6 the parameter vector. Both the chi-squared test and likelihood ratio test have relatively low power because we lose much information of the actual data when we bin the data. Chi-squared tests are generally less powerful than EDF tests (D’Agostino and Stephens, 1986).

2.3 Two-Dimensional Kolmogorov-Smirnov Goodness-of-Fit Tests

Peacock (1982) proposed a two-dimensional Kolmogorov-Smirnov goodness-of-fit test. For the two-dimensional distribution, given a specific observation (푥, 푦), four quadrants of the plane are defined by (푥 < 푋, 푦 < 푌 ), (푥 < 푋, 푦 > 푌 ), (푥 > 푋, 푦 < 푌 ), and (푥 > 푋, 푦 > 푌 ). The four quadrants are equally valid areas for the definition of the cumulative probability distribution. The algorithm is to consider each definition and adopt the largest of the four differences in empirical and theoretical cumulative distribution as the final statistics. This test computes in 푂(푛3) and the power is low. If we use his idea to develop a test for 푘-dimensional data, the test will compute in 푂(푛푘+1), and the computational efficiency is low.

2.4 Energy Tests

Energy statistics are based on Newton’s potential energy which is determined by the relative location of two bodies in a gravitational space. The potential energy is the function of the distance between two objects in a gravitational space. The longer the distance between the two objects, the bigger the potential energy. The potential energy of one object to itself is zero because the distance of one object to itself is zero. The energy statistics regard distributions or sample observations as the spatial objects. The energy distance between two distributions is zero if and only if the two distributions are identical, which is similar to the fact that the potential energy of one object to itself is zero. If the observations play a symmetric role, then it makes sense to suppose that energy statistics are symmetric functions of distance between observations. Energy statistics are 푉 -statistics or 푈- statistics based on the Euclidean distance. Let 푋1, 푋2, . . . , 푋푛 be a 푘-dimensional random sample from the distribution 퐹 . The kernel function 푕: 푅푘 × 푅푘 → 푅 is symmetric 푕(푋, 푌 ) = 푕(푌, 푋), 7 and the energy statistics are deﬁned as

푛 푛 1 ∑︁ ∑︁ 푉 = 푕 (푋 , 푋 ) , 푛 푛2 푘 푖 푗 푖=1 푗=1

or 푛 푛 1 ∑︁ ∑︁ 푈 = 푕 (푋 , 푋 ) . 푛 푛(푛 − 1) 푘 푖 푗 푖=1 푗̸=푖 In this dissertation, we consider the energy statistic as 푉 -statistics. Consider the kernel

푕(푥, 푦) = 퐸 |푥 − 푌 | + 퐸 |푦 − 푌 | − 퐸 |푌 − 푌 ′| − |푥 − 푦| ,

where |푥 − 푦| is the Euclidean distance between sample observations and 푌 ′ is an independent, identically distributed copy of the random variable 푌 from the distribution 퐹 . The kernel function 푕(푥, 푦) is a function of distance 휑(푧), where 푧 = |푥 − 푦|. Under the null hypothesis 퐹 = 퐺, the expected value of 푕(푥, 푌 ) is zero, so the kernel 푕 is degenerate.

If 퐹푛 denotes the empirical cumulative distribution function (cdf) of 푋, then the sample energy ℰ was introduced by Sze´kely (2003) is

(︃ 푛 푛 )︃ 2 ∑︁ 훼 훼 1 ∑︁ 훼 ℰ (훼)(퐹 , 퐺) = 퐸(|푥 − 푌 | ) − 퐸 |푌 − 푌 ′| − |푥 − 푥 | , 푛 푛 푛 푖 푛2 푖 푗 푖=1 푖,푗=1

where 훼 is in (0, 2). The energy distance between two distributions 퐹 and 퐺 is deﬁned by

ℰ(퐹, 퐺) = 휀(푋, 푌 ) = 2퐸(|푋 − 푌 |) − 퐸|푋 − 푋′| − 퐸|푌 − 푌 ′|,

where 푋 is a random variable with the distribution 퐹 , and 푌 is a random variable with 퐺. Then ℰ(퐹, 퐺) is nonnegative and equals zero if and only if 퐹 = 퐺, which matches the fact that the potential energy between two different objects are positive and the potential energy of an object to itself is zero. The goodness-of-ﬁt test statistic for the null hypothesis that data come from 푌 8 distribution based on energy statistics is

(︃ 푛 푛 )︃ 2 ∑︁ 1 ∑︁ 푄 = 푛 퐸(|푥 − 푌 |) − 퐸 |푌 − 푌 ′| − |푥 − 푥 | . 푛 푛 푖 푛2 푖 푗 푖=1 푖,푗=1

Based on Euclidean distances between observations, the family of statistics and tests is indexed by an exponent in (0, 2) on Euclidean distances

(︃ 푛 푛 )︃ 2 ∑︁ 훼 훼 1 ∑︁ 훼 푄(훼) = 푛 퐸(|푥 − 푌 | ) − 퐸 |푌 − 푌 ′| − |푥 − 푥 | . 푛 푛 푖 푛2 푖 푗 푖=1 푖,푗=1

According to the theory of 푉 -statistics, under the null hypothesis, if 퐸(푕2(푋, 푋′)) is ﬁnite,

∑︀ 2 the limiting distribution of 푄푛 is a quadratic form 푄 = 휆푘푍푘 of independent and identically distributed standard normal random variables 푍푘, k = 1, 2, ... . The nonnegative coefﬁcients 휆푘 are eigenvalues of the integral operator with kernel 푕(푢1, 푢2), i.e.

∫︁ +∞ 푕(푢1, 푢2)휓(푢2)푑퐹 (푢2) = 휆휓(푢2), −∞

where 휓 is the eigenfunction. We reject the null hypothesis for the large values of 푄푛. Thus, it

∑︀ ∑︀ 2 is easy to ﬁnd that 퐸(푄푛) = 푘 휆푘 and 푉 푎푟(푄푛) = 2 푘 휆푘. The energy tests are consistent against general alternatives, and they are rotation and translation invariant omnibus tests.

2.5 Triangle Tests

The triangle goodness-of-fit test is a kind of test based on the geometry figure, triangle. The length of each side of the triangle is measured by the interpoint distance. The triangle test was proposed by Bartoszynski and Lawrence (1997). The interpoint distance can be the Euclidean distance, the Manhattan distance, or other distances.The interpoint distance distribution of a specific distribution 퐺 is defined by 퐺*(푧) = 푃 (훿(푋, 푌 ) ≤ 푧), where X and Y are independent variables from the distribution G and 훿 is any properly chosen distance function. Maa and Bartoszynski (1996) posited that two distributions, 퐹 and 퐺, are identical if and only if two corresponding 9

interpoint distance variables 훿(푋1, 푋2) and 훿(푋1, 푌 ) have the identical interpoint distance distri-

bution, where the random variables 푋1 and 푋2 follow the 퐹 distribution and the random variable

푌 follows 퐺 distribution and three variables are mutually independent. Let 푋1, 푋2, . . . , 푋푛 be a 푘-dimensional random sample from the distribution F. We form a random triangle from a pair of

randomly selected data points 푋푖 and 푋푗 and a variable 푌 from the hypothesized distribution 퐺. According the theorem by Maa and Bartoszynski (1996), the null hypothesis 퐹 = 퐺 holds if and only if the lengths of the three sides of this random triangle have the same interpoint distance distribution 퐺*. Under the null hypothesis, the side joining the two sample points is equally likely to be the shortest, middle, or the longest side in the random triangle. The triangle goodness-of-ﬁt test is based on statistics which estimate the probability that the side joining the two sample points is the smallest, middle, or the largest side of the triangle. Bartoszynski and Lawrence (1997) proposed the following three kernel functions:

{︀ }︀ 푕1(푥, 푦) = 푃푍 훿(푥, 푦) < 푚푖푛 [훿(푥, 푍), 훿(푦, 푍)] 1 + 푃 {︀훿(푥, 푦) = 훿(푥, 푍) < 훿(푦, 푍)}︀ 2 푍 1 + 푃 {︀훿(푥, 푦) = 훿(푦, 푍) < 훿(푥, 푍)}︀ 2 푍 1 + 푃 {︀훿(푥, 푦) = 훿(푥, 푍) = 훿(푦, 푍)}︀, 3 푍

{︀ }︀ 푕3(푥, 푦) = 푃푍 훿(푥, 푦) > 푚푎푥[훿(푥, 푍), 훿(푦, 푍)] 1 + 푃 {︀훿(푥, 푦) = 훿(푥, 푍) > 훿(푦, 푍)}︀ 2 푍 1 + 푃 {︀훿(푥, 푦) = 훿(푦, 푍) > 훿(푥, 푍)}︀ 2 푍 1 + 푃 {︀훿(푥, 푦) = 훿(푥, 푍) = 훿(푦, 푍)}︀, 3 푍

푕2(푥, 푦) = 1 − 푕1(푥, 푦) − 푕3(푥, 푦), 10 where 푥 and 푦 are the sample points from the distribution 퐹 , and 푍 is a random variable from the distribution 퐺. For a continuous distribution, 푕1, 푕2, and 푕3 denote the probabilities that the side

Figure 2.1: Regions deﬁning 푕푘, 푘= 1, 2, 3

joining the sample points 푥 and 푦 is the smallest, middle, or largest in the triangle with two vertices 푥, 푦 and the third vertex 푍 from the distribution 퐺. For a two-dimensional distribution, sample points 푥 and 푦 divide the two-dimensional data space into three different regions 1, 2, and 3. The functions 푕1, 푕2, and 푕3 based on the Euclidean distance denote the probabilities of 푍 falling into regions 1, 2, and 3 in Figure 2.1, where the probabilities on the boundaries have been split evenly among the adjacent regions. Based on the above three kernel functions, Bartoszynski and Lawrence (1997) proposed the following three statistics that the triangle goodness-of-ﬁt test is based on:

1 ∑︁ 푈푘 = (︀푛)︀ 푕푘 (푋푖, 푋푗) 푓표푟 푘 = 1, 2, 3. 2 푖<푗

These three statistics 푈1, 푈2, and 푈3 estimate the chance that the side joining the two data points is the smallest, middle, or largest side of the triangle formed from two randomly selected data points and the hypothetical point. 푈푘 estimates the chance of a point falling into region 푘 in Figure 2.1.

Thus, it is obvious that 푈1 + 푈2 + 푈3 = 1. 11

2(푛 − 2)휃푟푠 휎푟푠 퐶표푣 (푈푟, 푈푠) = (︀푛)︀ + (︀푛)︀ for 1 ≤ 푟, 푠 ≤ 3 2 2 where ,

휃푟푠 = 푐표푣 [푕푟 (푋1, 푋2) , 푕푠 (푋1, 푋3)] , and

휎푟푠 = 푐표푣 [푕푟 (푋1, 푋2) , 푕푠 (푋1, 푋2)] .

1 Under the null hypothesis and from the symmetry point of view, 퐸(푈푘) = 3 for k = 1, 2, 3. If

푕1, 푕2, and 푕3 have ﬁnite variances, the asymptotic distribution of (푈푟, 푈푠) is bivariate normal for

푟, 푠 = 1, 2, 3 and 푟 ̸= 푠 under the null distribution. We can use any of the statistics 푈1, 푈2, or 푈3 individually or use a combination of these three statistics to give the most power against a speciﬁc

family of alternatives. For example,one test statistic based on 푈1 and 푈3 can be given by:

푍2 + 푍2 − 2휌푍 푍 푄 = 1 3 1 3 , 1 − 휌2

1 (푈푖− 3 ) where 푍푖 = √ , 푖 = 1, 2, and 휌 = 푐표푣(푍1, 푍2). Under the null hypothesis, Q has a chi- 푣푎푟(푈푖) squared distribution with degrees of freedom 2. 12

CHAPTER 3 BETA AND DIRICHLET DISTRIBUTION

Beta and Dirichlet distributions are widely used in statistics. The beta distribution is frequently used as the prior probability distribution in Bayesian analysis. For example, the beta distribution is a good model for the prior distribution of a proportion.

3.1 Beta Distribution

Deﬁnition 3.1. A random variable 푥 follows a beta distribution with shape parameters 훼, 훽 > 0, if it has the pdf

푥훼−1(1 − 푥)훽−1 푓(푥 | 훼, 훽) = , 0 < 푥 < 1, 퐵(훼, 훽)

where 퐵(훼, 훽) = Γ(훼)Γ(훽)/Γ(훼, 훽) denotes the beta function. A random variable 푋 that is beta- distributed with shape parameters 훼, 훽 is denoted by 푋 ∼ 퐵푒푡푎(훼, 훽).

Deﬁnition 3.2. The incomplete beta function is deﬁned by

∫︁ 푥 퐵(푥|훼, 훽) = 푥훼−1(1 − 푥)훽−1푑푥. 0

Deﬁnition 3.3. The regularized incomplete beta function is deﬁned by

∫︁ 푥 푥훼−1(1 − 푥)훽−1 퐼(푥|훼, 훽) = 푑푥. 0 퐵(훼, 훽)

Definition 3.4. The conditional regularized incomplete beta function given 푥1, ··· , 푥푖−1 is defined by 푥 푖 훼−1 훽−1 ∫︁ 1−푥1−···−푥푖−1 푥 (1 − 푥) 퐼(푥푖|훼, 훽, 푥1, ··· , 푥푖−1) = 푑푥. 0 퐵(훼, 훽) 13 Definition 3.5. The hypergeometric function is defined for |푥| < 1 by the power series

∞ 푛 ∑︁ (푎)푛(푏)푛 푥 퐹 (푎, 푏; 푐; 푥) = . 2 1 (푐) 푛! 푛=0 푛

Deﬁnition 3.6. (푎)푛 is the Pochhammer symbol, which is deﬁned by

⎧ ⎨⎪ 1 if 푛 = 0; (푎)푛 = ⎩⎪ 푎(푎 + 1) ··· (푎 + 푛 − 1) if 푛 > 0.

Incomplete beta functions 퐵(푥; 훼, 훽) is given by the hypergeometric function

푥훼 퐵(푥; 훼, 훽) = 퐹 (훼, 1 − 훽; 훼 + 1; 푥). 훼 2 1

It is also given by the power series

∞ ∑︁ (1 − 훽)푛 퐵(푥; 훼, 훽) = 푥훼 푥푛. 푛!(훼 + 푛) 푛=0

For 푥 = 1, the incomplete beta function coincides with the complete beta function. The beta function is symmetric, that is, 퐵(훼, 훽) = 퐵(훽, 훼).

When 훼 and 훽 are positive integers, it is related to factorial

(훼 − 1)!(훽 − 1)! 퐵(훼, 훽) = . (훼 + 훽 − 1)!

When 훼 and 훽 are positive numbers, it is related to the gamma function

Γ(훼)Γ(훽) 퐵(훼, 훽) = . Γ(훼 + 훽)

Deﬁnition 3.7. The regularized incomplete beta function is deﬁned by 14

퐵(푥; 푎, 푏) 퐼 (푎, 푏) = . 푥 퐵(푎, 푏)

When 푎 and 푏 are integers,

푎+푏−1 ∑︁ (︂푎 + 푏 − 1)︂ 퐼 (푎, 푏) = 푥푖(1 − 푥)푎+푏−1−푖. 푥 푖 푖=푎

The cumulative distribution function of beta distribution is

퐵(푥; 훼, 훽) 퐹 (푥; 훼, 훽) = = 퐼 (훼, 훽), 퐵(훼, 훽) 푥

where 퐵(푥; 훼, 훽) is the incomplete beta function, and 퐼푥(훼, 훽) is the regularized incomplete beta function.

3.2 Stochastic Representation of Beta Distribution

If 푋 is a random variable whose distribution is identical to the distribution of a function of other random variables 푌1,..., 푌푛, the stochastic representation of 푋 is deﬁned as

푑 푋 = 푔(푌1, . . . , 푌푛).

In practice, the random variables 푌1,..., 푌푛 are mutually independent and identically distributed, and the distribution of random variables 푌1,..., 푌푛 are familiar to us and easier to be simulated. We can call the =푑 operator the identical distribution operator. If 푋 = 푌 , then 푋 =푑 푌 , but the inverse may not be true. When the distribution of a random variable 푥 or its properties are hard to derive or be simulated, the stochastic representation is a very powerful method.

Deﬁnition 3.8. A random variable 푥 follows a gamma distribution with parameters 훼, 훽 > 0, if it has the pdf

− 푥 푥훼−1푒 훽 푓(푥|훼, 훽) = 푥 > 0 훼, 훽 > 0, 훽훼Γ(훼) 15 where Γ(훼) denotes the Γ function. A random variable 푋 that is gamma-distributed with shape parameter 훼 and scale parameter 훽 is denoted by 푋 ∼ Γ(훼, 훽).

Let 푋1 and 푋2 be independent random variables, each having a gamma distribution with 훽 = 1. The joint pdf of the two variables can be written as

⎧ ⎪∏︀2 1 훼푖−1 −푥푖 ⎪ 푥 푒 , 0 < 푥푖 < ∞, ⎨ 푖=1 Γ(훼푖) 푖 푓(푥1, 푥2) = ⎪ ⎩⎪0, otherwise.

The random variable 푌 is given by 푋 푌 = 1 . 푋1 + 푋2

The random variable 푌 follows the beta distribution with shape parameters 훼1 and 훼2. The associated transformation maps the space

{︀ }︀ 풜 = (푥1, 푥2) : 0 < 푥푖 < ∞, 푖 = 1, 2

onto the space ℬ = {푦 : 0 < 푦 < 1}.

In what follows, the stochastic representation 푌 of beta distribution by gamma variables is key.

3.3 Dirichlet Distribution

The Dirichlet distribution can be thought of as the multivariate beta distribution. The Dirichlet distribution can be ﬁtted as the prior multinomial distribution in Bayesian analysis. The Dirichlet distribution is the conjugate prior for the probability vector from multinomial sampling.

Deﬁnition 3.9. A random variable 푋 follows the Dirichlet distribution with parameters 훼1, 훼2,

. . . , 훼푘+1 > 0, if it has the pdf

Γ(훼 + ··· + 훼 ) 1 푘+1 훼1−1 훼푘−1 훼푘+1−1 푓(푥1, 푥2, . . . , 푥푘) = 푥1 . . . 푥푘 (1 − 푥1 − · · · − 푥푘) , Γ(훼1) ... Γ(훼푘+1) 16 where 0 < 푥푖 < 1, 푖 = 1, . . . , 푘, and 푥1 + 푥2 + ··· + 푥푘 < 1.

푇 The Dirichlet distribution with parameters 훼 = (훼1, 훼2, . . . , 훼푘+1) and probability densi-

ty function is denoted by Dirichlet(훼1, 훼2, . . . , 훼푘+1) or Dir(훼1, 훼2, . . . , 훼푘+1). When 푘 =1, the Dirichlet distribution becomes a beta distribution. The marginal distributions of Dirichlet distri-

bution are each a beta distribution. Suppose 푋 has a Dirichlet(훼1, 훼2, . . . , 훼푘+1) distribution and

푋1 is any subvector of 푋, then 푋1 has a Dirichlet distribution too. For example, for a given

푘-dimensional random vector (푋1, 푋2, . . . , 푋푘) from the Dirichlet(훼1, 훼2, . . . , 훼푘+1) distribution, ∑︀푘+1 the ﬁrst component of the vector, 푋1, follows Beta(훼1, 푖=2 훼푖) distribution. The third compo- ∑︀푘+1 nent of the vector, 푋3, follows Beta(훼3, 푖=1 훼푖 − 훼3) distribution. The joint distribution of the ∑︀푘+1 subvector (푋1, 푋3) follows the Dirichlet(훼1, 훼3, 푖=1 훼푖 − 훼1 − 훼3).

3.4 Stochastic Representation of Dirichlet Distribution

Let 푋1, 푋2, . . . , 푋푘+1 be independent random variables, each having a gamma distribution with 훽 = 1. The joint pdf of these variables is written as

⎧ ⎪∏︀푘+1 1 훼푖−1 −푥푖 ⎪ 푥 푒 , 0 < 푥푖 < ∞, ⎨ 푖=1 Γ(훼푖) 푖 푓(푥1, 푥2, . . . , 푥푘+1) = ⎪ ⎩⎪0, otherwise.

Let

푋1 푌푖 = , 푖 = 1, 2, . . . , 푘. 푋1 + 푋2 + ··· + 푋푘+1

The random variables 푌1, 푌2,..., 푌푘 have a Dirichlet pdf

Γ(훼 + ··· + 훼 ) 1 푘+1 훼1−1 훼푘−1 훼푘+1−1 푓(푦1, . . . , 푦푘) = 푦1 . . . 푦푘 (1 − 푦1 − · · · − 푦푘) , Γ(훼1) ... Γ(훼푘+1)

where 0 < 푦푖, 푖 = 1,..., 푘, 푦1 + 푦2 +. . . + 푦푘 < 1. Therefore, the random variables 푌1, 푌2,..., 푌푘 follows a Dirichlet distribution. The associated transformation maps the space

풜 = {(푥1, . . . , 푥푘+1) : 0 < 푥푖 < ∞, 푖 = 1, . . . , 푘 + 1} 17 onto the space

ℬ = {(푦1, . . . , 푦푘) : 0 < 푦푖, 푖 = 1, 2, . . . , 푘, 푦1 + ··· + 푦푘 < 1}.

The R gtools package by Warners, Bolker, and Lumley (2014) provides the rdirichlet function to generate the random samples following a speciﬁc Dirichlet distribution.

3.5 Estimation for the Beta and Dirichlet Parameter Vector

For the energy and triangle goodness-of-ﬁt tests, we need to estimate the parameter vector. We

have 푛 independent and identically distributed variables from the Beta(훼, 훽) or Dir(훼1, . . . , 훼푘+1)

distribution. For 푖 = 1, . . . , 푛, let 푥푖 = (푥푖1, . . . , 푥푖(푘+1)) be 푛 observations.

3.5.1 MLE Based on the Newton-Raphson Algorithm

The log-likelihood function of the parameter vector 훼 = (훼1, 훼2, . . . , 훼푘+1) for the 푛 observations is 푘+1 푘+1 푘+1 푛 {︁ ∑︁ ∑︁ ∑︁ ∏︁ }︁ 푙(훼) = 푚 log Γ( 훼푖) − log Γ(훼푖) + (훼푗 − 1) log 푥푖푗 . 푖=1 푖=1 푗=1 푖=1 The Newton-Raphson method is used in the following iteration to calculate the MLE of the parameter vector (Narayanan, 1990):

⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ 훼^1 훼^1 푣푎푟(훼 ^1) . . . 푐표푣(훼 ^1, 훼푘^+1) 푔1(^훼) ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ . ⎟ ⎜ . ⎟ ⎜ . . ⎟ ⎜ . ⎟ ⎜ . ⎟ = ⎜ . ⎟ + ⎜ . .. ⎟ ⎜ . ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ 훼푘^+1 훼푘^+1 푐표푣(훼푘 ^+1, 훼^1) . . . 푣푎푟(훼푘 ^+1) 푔푘+1(^훼) (푖) (푖−1) (푖−1) (푖−1) 18 ′ where 훼^(0) = (^훼1(0),..., 훼^푘(0)) are the initial estimates. The gradient vector is

⎛ ∑︀푘+1 ⎞ 휕푙표푔Γ( 푖=1 훼푖) 휕푙표푔Γ(훼1) ∏︀푛 − + 푙표푔 푥1푗 휕훼1 휕훼1 푗=1 ⎜ ∑︀푘+1 ⎟ ⎜ 휕푙표푔Γ( 푖=1 훼푖) 휕푙표푔Γ(훼2) ∏︀푛 ⎟ ⎜ − + 푙표푔 푥2푗 ⎟ ⎜ 휕훼2 휕훼2 푗=1 ⎟ 푔 = ∇푙(훼) = ⎜ ⎟ . ⎜ ... ⎟ ⎜ ⎟ ⎝ ∑︀푘+1 ⎠ 휕푙표푔Γ( 푖=1 훼푖) 휕푙표푔Γ(훼푖) ∏︀푛 − + 푙표푔 푥(푘+1)푗 휕훼(푘+1) 휕훼(푘+1) 푗=1

The Hessian matrix is 퐻 = ∇2푙(훼) =

⎛ 2 ∑︀푘+1 2 2 ∑︀푘+1 2 ∑︀푘+1 ⎞ 휕 푙표푔Γ( 푖=1 훼푖) 휕 푙표푔Γ(훼1) 휕 푙표푔Γ( 푖=1 훼푖) 휕 푙표푔Γ( 푖=1 훼푖) 휕훼2 − 휕훼2 휕훼 휕훼 ... 휕훼 휕훼 ⎜ 1 1 1 2 1 푘+1 ⎟ 2 ∑︀푘+1 2 ∑︀푘+1 2 2 ∑︀푘+1 ⎜ 휕 푙표푔Γ( 푖=1 훼푖) 휕 푙표푔Γ( 푖=1 훼푖) 휕 푙표푔Γ(훼2) 휕 푙표푔Γ( 푖=1 훼푖) ⎟ ⎜ 휕훼 휕훼 휕훼2 − 휕훼2 ... 휕훼 휕훼 ⎟ ⎜ 2 1 2 2 2 푘+1 ⎟ ⎜ ⎟ . ⎜ ...... ⎟ ⎜ ⎟ ⎝ 2 ∑︀푘+1 2 ∑︀푘+1 2 ∑︀푘+1 2 ⎠ 휕 푙표푔Γ( 푖=1 훼푖) 휕 푙표푔Γ( 푖=1 훼푖) 휕 푙표푔Γ( 푖=1 훼푖) 휕 푙표푔Γ(훼푘+1) ... 2 − 2 휕훼푘+1휕훼1 휕훼푘+1휕훼2 휕훼푘+1 휕훼푘+1

The 푗-th component of the gradient vector is

푘+1 ∑︁ 푔푗(훼) = 푛Ψ( 훼푚) − 푛Ψ(훼푗) + 푛푙표푔퐺푗 푗 = 1, 2, . . . , 푘 + 1, 푚=1

where Ψ is the digamma function. The variance-covariance matrix 퐕 which is the inverse of the Fisher information matrix is found as

푉 = 퐷 + 훽푎푎′,

′ ′ ′ ′ ′ where 퐷 = 푑푖푎푔(1/푛Ψ (훼1),..., 1/푛Ψ (훼푘+1)), 푎 = [1/Ψ (훼1),..., 1/Ψ (훼푘+1)] and

푘+1 푘+1 푘+1 ∑︁ ∑︁ ∑︁ 1 훽 = 푛Ψ′( 훼 ){1 − Ψ′( 훼 ) }−1, 푗 푖 Ψ′(훼 ) 푗=1 푗=1 푗=1 푗 19 where Ψ′ is the trigamma function. One criteria for convergence of this algorithm is developed by Choi and Wette (1967). The statistic 푆 for the convergence is given by

⎛ ⎞ ⎛ ⎞ 푣푎푟(훼 ^1) . . . 푐표푣(훼 ^1, 훼푘^+1) 푔1(훼̂) ⎜ ⎟ ⎜ ⎟ ⎜ . . ⎟ ⎜ . ⎟ 푆 = (푔1(훼̂), . . . , 푔푘+1(훼̂)) ⎜ . .. ⎟ ⎜ . ⎟ . ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ 푐표푣(훼푘 ^+1, 훼^1) . . . 푣푎푟(훼푘 ^+1) 푔푘+1(훼̂)

2 The iteration can be continued until 푆 becomes less than 휒푘+1(푐) for a ﬁxed real number 푐 in the lower tail or the number of iterations required exceeds the maximum number of iterations we set. In the computer program of this dissertation, we set 푐 = 0.0001 and maximum 100. When the initial estimates are calculated by the method of moments, some components of the parameter vector 훼 are negative numbers during the iterations. Thus the initial estimates are set to be the

min{푥푖푗}, which makes the parameter estimates positive during the iteration (Ronning, 1986).

3.5.2 Method of Moments Estimates

The method of moments for the estimates of the parameter vector is given by

′ ′ ′ (푥11 − 푥21)푥1푗 훼^푗 = ′ 2 , 푗 = 1, 2, . . . , 푘 + 1, 푥21 − 푥11 where 푛 1 ∑︁ 푥′ = 푥 , 푗 = 1, 2, . . . , 푘 + 1, 1푗 푛 푖푗 푖=1 and 푛 1 ∑︁ 푥′ = 푥2 . 21 푛 푖1 푖=1 In the MLE algorithm, we need to provide initial values for the iteration, and the initial values can be the parameter estimates based on the method of moments. 20

CHAPTER 4 ENERGY AND TRIANGLE TESTS OF BETA AND DIRICHLET DISTRIBUTION

4.1 Univariate Energy Goodness-of-Fit Statistics

We want to test if a given data set is from the Dirichlet family. The goodness-of-ﬁt test statistic for the null hypothesis based on energy distance is

(︃ 푛 푛 )︃ 2 ∑︁ 1 ∑︁ 푄 = 푛 퐸(|푥 − 푌 |) − 퐸|푌 − 푌 ′| − |푥 − 푥 | . 푛 푛 푖 푛2 푖 푗 푖=1 푖,푗=1

Based on Euclidean distances between observations, the family of statistics and tests is indexed by an exponent in (0, 2) on Euclidean distances, which we denote by 훼. The corresponding statistics are

(︃ 푛 푛 )︃ 2 ∑︁ 1 ∑︁ 푄(훼) = 푛 퐸(|푥 − 푌 |훼) − 퐸|푌 − 푌 ′|훼 − |푥 − 푥 |훼 , 0 < 훼 < 2. 푛 푛 푖 푛2 푖 푗 푖=1 푖,푗=1

Energy goodness-of-ﬁt tests are derived for the beta distribution in this section. The third compo- ∑︀푛 nent of the energy statistic 푄푛 is 푖,푗=1 |푥푖 −푥푗|, which can be simpliﬁed into a linear combination ∑︀푛 of the ordered sample. The linear combination will reduce the calculation of 푖,푗=1 |푥푖 − 푥푗| from

2 푛 terms to 푛 terms, and therefore it reduces the computation time. Suppose 푋1, 푋2, . . . , 푋푛 is a univariate random sample, and the corresponding ordered sample is 푋(1), . . . , 푋(푛) . Then (Rizzo, 2002)

푛 푛 ∑︁ ∑︁ |푥푖 − 푥푗| = 2 ((2푘 − 1) − 푛) 푋(푘). 푖,푗=1 푘=1 21 Proof.

푛 푛 ∑︁ ∑︁ |푥푖 − 푥푗| = 2 (푥(푖) − 푥(푗)) 푖,푗=1 푖,푗=1 [︀ ]︀ = 2 (푛 − 1)푋(푛) + (푛 − 2)푋(푛−1) + (푛 − 3)푋(푛−2) + ··· + (1)푋(2) [︀ ]︀ −2 (푛 − 1)푋(1) + (푛 − 2)푋(2) + (푛 − 3)푋(3) + ··· + (1)푋(푛) [︀ ]︀ = 2 (1 − 푛)푋(1) + (3 − 푛)푋(2) + (5 − 푛)푋(3) + ··· + (푛 − 1)푋(푛) 푛 ∑︁ = 2 ((2푘 − 1) − 푛) 푋(푘). 푘=1

4.2 Energy Goodness-of-Fit Test for Uniform Distribution

For energy and triangle goodness-of-ﬁt tests, we transform arbitrary beta or Dirichlet distributions into standard beta or Dirichlet distributions. We regard Beta(1,1) as the standard beta distribution, which is the same as uniform distribution with the parameters 0 and 1.

Proposition 4.1. If 푌 is distributed as Beta(1,1), then

퐸|푥 − 푌 | = 푥2 − 푥 + 1/2.

Proof.

∫︁ 푥 ∫︁ 1 퐸|푥 − 푌 | = (푥 − 푦)푑푦 + (푦 − 푥)푑푦 0 푥 = 푥2 − 푥 + 1/2.

Proposition 4.2. If 푋 and 푌 is independent and distributed as Beta(1,1), then

퐸|푋 − 푌 | = 1/3. 22 Proof.

퐸|푋 − 푌 | = 퐸 (퐸(|푋 − 푌 ||푥)) (︂∫︁ 푥 ∫︁ 1 )︂ = 퐸 (푥 − 푦)푑푦 + (푦 − 푥)푑푦 0 푥 = 퐸 (︀푥2 − 푥 + 1/2)︀

= 1/3.

If 푋1,..., 푋푛 is a random sample, then the energy goodness-of-ﬁt test statistic for testing

퐻0 : 푋 ∼ 퐵푒푡푎(1, 1) is given by

(︃ 푛 푛 )︃ 2 ∑︁ (︂ 1)︂ 1 2 ∑︁ 푄 = 푛 푋2 − 푋 + − − (2푘 − 1 − 푛) 푋 . 푛 푛 푖 푖 2 3 푛2 (푖) 푖=1 푖=1

Since the cumulative distribution function of the beta distribution is the regularized incomplete beta function, when we apply the regularized incomplete beta function 퐼(푥) as the transformation

function, Beta(훼, 훽) can be transformed into the standard beta distribution. If 푋1,..., 푋푛 is a

random sample from the beta distribution, 푋(1),..., 푋(푛) is the ordered sample. Let 푍(1),..., 푍(푛)

be the transformed sample, where 푍(푖) = 퐼(푋(푖)), 푖 = 1, . . . , 푛. The goodness-of-ﬁt test statistic

for testing the null hypothesis 퐻0 : 푋 ∼ 퐵푒푡푎(훼, 훽) is given by

(︃ 푛 푛 )︃ 2 ∑︁ (︂ 1)︂ 1 2 ∑︁ 푄 = 푛 푍2 − 푍 + − − (2푘 − 1 − 푛) 푍 . 푛 푛 (푖) (푖) 2 3 푛2 (푖) 푖=1 푖=1

When we would like to use the actual sample without any transformation to construct the goodness-of-ﬁt test, and when we would like to apply the family of statistics and tests which is indexed by an exponent in (0, 2) on Euclidean distances, we need to have the following four propositions. 23 Proposition 4.3. If 푌 is distributed as Beta(훼, 훽), where 훼 > 0, 훽 > 0 are ﬁxed, then

훼+1 ∞ 푛 푎 ∑︁ 푎 (1 − 훽)푛 퐸|푎 − 푋| = 퐵(훼, 훽) 푛!(훼 + 푛)(훼 + 푛 + 1) 푛=0 훽 ∞ 푛 (1 − 푎) ∑︁ (1 − 푎) (−훼)푛 − 푎(1 − 훼)푛 + 퐵(훼, 훽) 푛! 훽 + 푛 푛=0

where (푎)푛 = 푎(푎 + 1)...(푎 + 푛 − 1) is the ascending factorial.

Proposition 4.4. If 푌 is distributed as Beta(훼, 훽), where 훼 > 0, 훽 > 0 are ﬁxed, and 0 < 푘 < 2, then

∞ 푛 (︀푛)︀ [︂ 푘+훼+푖 푘+훽+푖 ]︂ ∑︁ ∑︁ (−푘)푛−푖 (1 − 훽)푖푎 (1 − 훼)푖(1 − 푎) 퐸|푎 − 푋|푘 = 푖 + . 퐵(훼, 훽)푛! 훼 + 푛 훽 + 푛 푛=0 푖=1

Proof.

1 ∫︁ 1 퐸|푎 − 푥|푘 = |푎 − 푥|푘푥훼−1(1 − 푥)훽−1푑푥 퐵(훼, 훽) 0 1 ∫︁ 푎 = (푎 − 푥)푘푥훼−1(1 − 푥)훽−1푑푥 퐵(훼, 훽) 0 1 ∫︁ 1 + (푥 − 푎)푘푥훼−1(1 − 푥)훽−1푑푥. 퐵(훼, 훽) 푎 24 Let f(x) be (푎 − 푥)푘(1 − 푥)훽−1, so that

1 ∫︁ 푎 (푎 − 푥)푘푥훼−1(1 − 푥)훽−1푑푥 퐵(훼, 훽) 0 ∞ 훼+푛 1 ∑︁ (푛) 푎 = 푓 퐵(훼, 훽) 0 (훼 + 푛)푛! 푛=0 (︂ 훼 훼+1 (푛) 훼+푛 )︂ 1 푎 ′ 푎 푓 (0)푎 = 푓(0) + 푓 (0) + ··· + + ··· 퐵(훼, 훽) 훼 훼 + 1 (훼 + 푛)!푛! ∞ 푛 1 ∑︁ ∑︁ 푎훼+푛 = 퐶푖 (−푘) (1 − 훽) 푎푘−푛+푖 퐵(훼, 훽) 푛 푛−푖 푖 (훼 + 푛)푛! 푛=0 푖=0 ∞ 푛 푖 푘+훼+푖 1 ∑︁ ∑︁ 퐶 (−푘)푛−푖(1 − 훽)푖푎 = 푛 . 퐵(훼, 훽) (훼 + 푛)푛! 푛=0 푖=0

Let 푔(푥) be (1 − 푥 − 푎)푘(1 − 푥)훼−1. We replace 푔(푥) in the above integral with its Maclaurin Series at 푥 = 0.

1 ∫︁ 1 (푥 − 푎)푘푥훼−1(1 − 푥)훽−1 퐵(훼, 훽) 푎 ∞ ∞ 1 ∫︁ 1−푎 ∑︁ ∑︁ 푥푛 = 푥훽−1 푔(푛)(0) 푑푥 퐵(훼, 훽) 푛! 0 푛=0 푛=0 ∞ 푛 1 ∑︁ ∑︁ (1 − 푎)푛+훽 = 퐶푖 (−푘) (1 − 훼) (1 − 푎)푘−푛+푖 , 퐵(훼, 훽) 푛 푛−푖 푖 푛 + 훽 푛=0 푖=0

where 푔(푛)(0) denotes the 푛-th derivative of 푔 at 푦 = 0. After simpliﬁcation, we have

Proposition 4.5. If 푋 and 푌 are independent and identically distributed as Beta(훼, 훽), where 25 훼 > 0, 훽 > 0 are ﬁxed, then

∞ 2 ∑︁ (1 − 훽)푛퐵(2훼 + 푛 + 1, 훽) 퐸|푋 − 푌 | = , 퐵2(훼, 훽) (훼 + 푛) 푛! 푛=0 2

where (푎)푛 = 푎(푎 + 1)...(푎 + 푛 − 1) is the ascending factorial, and 퐵(훼, 훽) denotes the complete beta function.

Proposition 4.6. If 푋 and 푌 is independent and identically distributed as Beta(훼, 훽), and 훼, 훽 are ﬁxed , and 0 < 푘 < 2, then

∞ 푛 (︀푛)︀ 2 ∑︁ ∑︁ (−푘)푛−푖(1 − 훽)푖퐵(2훼 + 푘 + 푖, 훽) 퐸|푋 − 푌 |푘 = 푖 . 퐵2(훼, 훽) (훼 + 푛)푛! 푛=0 푖=0

Proof.

∫︁ 1 ∫︁ 1 푥훼−1(1 − 푥)훽−1 푦훼−1(1 − 푦)훽−1 퐸|푋 − 푌 |푘 = |푥 − 푦|푘 푑푥푑푦 0 0 퐵(훼, 훽) 퐵(훼, 훽) ∫︁ 1 ∫︁ 푥 2 훼−1 훽−1 푘 훼−1 훽−1 = 2 푥 (1 − 푥) 푑푥 (푥 − 푦) 푦 (1 − 푦) 푑푦 퐵 (훼, 훽) 0 0

Let 푓(푦) be (푎 − 푦)푘(1 − 푦)훽−1. We use the Maclaurin Series of 푓(푦) at 푦 = 0 to replace 푓(푦) in the above integral, so that

∞ 2 ∫︁ 1 ∑︁ 푥훼+푛 퐸|푋 − 푌 |푘 = 푥훼−1(1 − 푥)훽−1 푓 푛(0) 푑푥 퐵2(훼, 훽) 훼 + 푛 0 푛=0 ∞ 2 ∑︁ ∫︁ 1 푥2훼+푛+1 = (1 − 푥)훽−1푓 (푛)(0), 퐵2(훼, 훽) (훼 + 푛)푛! 푛=0 0

where 푓 (푛)(0) denotes the 푛-th derivative of 푓 at 푦 = 0. 26 After simpliﬁcation, we have

∞ 푛 (︀푛)︀ 2 ∑︁ ∑︁ (−푘)푛−푖(1 − 훽)푖퐵 (2훼 + 푘 + 푖, 훽) 퐸|푋 − 푌 |푘 = 푖 . 퐵2(훼, 훽) (훼 + 푛)푛! 푛=0 푖=0

A random variable 푋 follows the Dirichlet distribution with parameters 훼1, 훼2, . . . , 훼푘+1 > 0, if it has the pdf

Γ(훼 + ··· + 훼 ) 1 푘+1 훼1−1 훼푘−1 훼푘+1−1 푓(푥1, 푥2, . . . , 푥푘) = 푥1 . . . 푥푘 (1 − 푥1 − · · · − 푥푘) , Γ(훼1) ... Γ(훼푘+1)

where 0 < 푥푖 < 1, 푖 = 1, . . . , 푘, and 0 < 푥1 + 푥2 + ··· + 푥푘 < 1. If all parameters of a Dirichlet distribution are one, the Dirichlet distribution is called the standard Dirichlet distribution. For a two-dimensional Dirichlet distribution, the support is a right triangle with vertices (0,0), (0,1), and (1,0). For three-dimensional Dirichlet distributions, the support is a pyramid with vertices (0, 0, 0), (1, 0, 0), (0, 1, 0), and (0, 0, 1). We can regard the support of a higher-dimensional Dirichlet distribution as the superpyramid. For the goodness-of-ﬁt test for the Dirichlet distribution, translation and rotation will change the support, which is bounded and ﬁxed in the area of

0 < 푥푖 < 1, 푖 = 1, . . . , 푘, and 0 < 푥1 +푥2 +···+푥푘 < 1. Therefore, we do not need to consider the affine invariant properties of goodness-of-fit tests for the Dirichlet distribution, even though affine invariant properties are very important for goodness-of-fit tests for normal distributions. However, we still need one specific kind of affine transformation to change arbitrary 푘-dimensional Dirichlet distributions to the corresponding standard Dirichlet distribution. We calculate our test statistics of the energy or triangle test based on the transformed standard Dirichlet distribution, which makes both tests easier to use and improves their efficiency.

4.3 Data Transformation and Standardization in the Dirichlet Distribution

For a random variable 푋 which follows the multivariate normal distribution 푁(휇, Σ), we can use the transformation (푋−휇)푇 Σ−1(푋−휇) to change the multivariate normal random variable into 27 the corresponding multivariate standard normal variable. However, for the Dirichlet distribution, we do not have such a simple formula of transformation. We can use a two-step transformation technique to transform the 푘-dimensional Dirichlet distribution into the 푘-dimensional standard Dirichlet distribution.

Theorem 4.7. Suppose that the random vector (푋1, . . . , 푋푛) is from an n-dimensional Dirichlet

* * with the parameter vector(훼1, . . . , 훼푛+1). Then (푋1 , . . . , 푋푛) defined by the following two-step transformation is jointly distributed as the standard n-dimensional Dirichlet distribution. The two- step transformation is defined as follows. The first step of transformation is given by

(︃ 푛+1 )︃ ⃒ ∑︁ 푌1 = 퐼 푋1⃒훼1, 훼푖 , (4.3.1) 푖=2 (︃ 푛+1 )︃ 푋2 ⃒ ∑︁ 푌2 = 퐼 ⃒훼2, 훼푖, 푥1 , 1 − 푋 1 푖=3 (︃ 푛+1 )︃ 푋3 ⃒ ∑︁ 푌3 = 퐼 ⃒훼3, 훼푖, 푥1, 푥2 , 1 − 푋 − 푋 1 2 푖=4 ......

(︃ 푛+1 )︃ 푋푖 ⃒ ∑︁ 푌푖 = 퐼 ⃒훼푖, 훼푖, 푥1, ··· , 푥푖−1 , 1 − 푋 − 푋 − · · · − 푋 1 2 푖−1 푗=푖+1

...... (︂ )︂ 푋푛 ⃒ 푌푛 = 퐼 ⃒훼푛, 훼푛+1, 푥1, ··· , 푥푛−1 , 1 − 푋1 − · · · − 푋푛−1

(︁ )︁ 푋푖 where 퐼 |푋1, ··· , 푋푖−1 , function is the conditional regularized incomplete be- 1−푋1−푋2−···−푋푖−1 28 ta function of 푋푖 given 푋1, ··· , 푋푖−1. The second step of transformation is given by

1 * 푛 푋1 = 1 − (1 − 푌1) ,

1 [︁ 1 ]︁ * 푛 푛−1 푋2 = (1 − 푌1) 1 − (1 − 푌2) ,

1 1 [︁ 1 ]︁ * 푛 푛−1 푛−2 푋3 = (1 − 푌1) (1 − 푌2) 1 − (1 − 푌3) ,

...... 푖−1 1 [︁ 1 ]︁ * ∏︁ 푛−푖+1 푛−푖+1 푋푖 = (1 − 푌푗) 1 − (1 − 푌푖) , 푗=1 ...... 푛−1 1 * ∏︁ 푛−푖+1 푋푛 = (1 − 푌푖) 푌푛. 푖=1

* * The random vector (푋1 , . . . , 푋푛) follows the standard 푛-dimensional Dirichlet distribution.

Proof. The ﬁrst step is that we change the 푘-dimensional Dirichlet distribution into the 푘-dimensional

uniform distribution with the support 0 < 푥푖 < 1, 푖 = 1, . . . , 푘, which can be called the supercube uniform distribution because for the two-dimensional support, it is a square with the unity side and for the 3-dimensional support, it is a cube with the unity side. The second step is that we change the 푘-dimensional supercube uniform distribution into the 푘-dimensional standard Dirichlet distribution. During the ﬁrst step, the non-uniform density is transformed into the uniform density which is what we desire, but the support is transformed from superpyramid into supercube which is not what we desire. The second step is to keep the uniform density but change the support from supercube back into superpyramid again. The Jacobian determinant is the determinant of the Jacobian matrix which is given by

⎡ 휕푥* 휕푥* 휕푥* ⎤ 1 1 ... 1 ⎢ 휕푥1 휕푥2 휕푥푛 ⎥ ⎢ 휕푥* 휕푥* 휕푥* ⎥ ⎢ 2 2 ... 2 ⎥ ⎢ 휕푥1 휕푥2 휕푥푛 ⎥ ⎢ ⎥ . ⎢...... ⎥ ⎢ ⎥ ⎣ * * * ⎦ 휕푥푛 휕푥푛 ... 휕푥푛 휕푥1 휕푥2 휕푥푛 29 However, it is hard to directly derive the each element of the Jacobian matrix. On the other hand, the Jacobian determinant is the inverse of the determinant of the matrix which is given by

⎡ 휕푥* 휕푥* 휕푥* ⎤ 1 1 ... 1 ⎢ 휕푥1 휕푥2 휕푥푛 ⎥ ⎢ 휕푥* 휕푥* 휕푥* ⎥ ⎢ 2 2 ... 2 ⎥ ⎢ 휕푥1 휕푥2 휕푥푛 ⎥ ⎢ ⎥ . ⎢...... ⎥ ⎢ ⎥ ⎣ * * * ⎦ 휕푥푛 휕푥푛 ... 휕푥푛 휕푥1 휕푥2 휕푥푛

* We call the above matrix the inverse Jacobian matrix. Since 푋푖 is a function of 푋1, 푋2, . . . , and * 휕푥푖 푋푖, are all zero when the subscript 푗 > 푖, so the inverse Jacobian matrix becomes 휕푥푗

⎡ 휕푥* ⎤ 1 0 ... 0 ⎢ 휕푥1 ⎥ ⎢ 휕푥* 휕푥* ⎥ ⎢ 2 2 0 0 ⎥ ⎢ 휕푥1 휕푥2 ⎥ ⎢ ⎥ . ⎢...... 0 ⎥ ⎢ ⎥ ⎣ * * * ⎦ 휕푥푛 휕푥푛 ... 휕푥푛 휕푥1 휕푥2 휕푥푛

Therefore, we only need to calculate the diagonal element of the inverse Jacobian matrix to obtain the determinant of the inverse Jacobian matrix. The 푖-th term of the diagonal elements is given by

휕푥* 휕푥* 휕푦 휕푥* 휕푦 휕푥* 휕푦 푖 = 푖 1 + 푖 2 + ··· + 푖 푛 , 푖 = 1, 2, . . . , 푛. 휕푥푖 휕푦1 휕푥푖 휕푦2 휕푥푖 휕푦푛 휕푥푖

휕푦푖 * Since 푌푖 is a function of 푋1, 푋2, . . . , and 푋푖, are all zero when the subscript 푗 > 푖. Since 푋 휕푥푗 푖 * 휕푥푖 is a function of 푌1, 푌2, . . . , 푌푖, are all zero when the subscript 푗 > 푖, so the 푖-th term of the 휕푦푗 diagonal elements becomes

휕푥* 휕푥* 휕푦 푖 = 푖 푖 휕푥푖 휕푦푖 휕푥푖 훼 ∑︀푛+1 훼 −1 푖−1 (︃ )︃ 푖−1 (︃ ∑︀푖 )︃ 푗=푖 푗 ∏︁ 1 1 −1 푥푖 1 − 푗=1 푥푗 = 푐 (1 − 푦 ) 푛−푗+1 (1 − 푦 ) 푛−푖+1 , 푗 푖 ∑︀푖−1 ∑︀푖−1 푗=1 1 − 푗=1 푥푗 1 − 푗=1 푥푗 30 1 where c is 푛−푖+1 . The determinant of the inverse Jacobian matrix is given by

푛 * 푛 * ∏︁ 휕푥 ∏︁ 휕푥 휕푦푖 푖 = 푖 휕푥 휕푦 휕푥 푖=1 푖 푖=1 푖 푖 푛 (︃ 푛 )︃훼푛+1−1 Γ(훼1 + ··· + 훼푛+1) ∏︁ 1 ∑︁ = 푥훼푖−1 1 − 푥 . Γ(훼 ) ... Γ(훼 ) 푖 푖 푖 1 푛+1 푖=1 푖=1

The Jacobian determinant is the inverse of the determinant of the inverse Jacobian matrix, so the

* * * density function of the vector (푋1 , 푋2 , . . . , 푋푛) is

* * * 푓(푥1, 푥2, . . . , 푥푛) = 푓(푥1, 푥2, . . . , 푥푛) × 퐽푎푐표푏푖푎푛 푓(푥 , 푥 , . . . , 푥 ) = 1 2 푛 Γ(훼1+···+훼푛+1) ∏︀푛 훼푖−1 ∑︀푛 훼 −1 푛! 푥 (1 − 푥푖) 푛+1 Γ(훼1)...Γ(훼푛+1) 푖=1 푖 푖=1 = 푛!,

* * * * * * * where 0 < 푥푖 , 푖 = 1, . . . , 푘, 푥1 + 푥2 + ··· + 푥푘 < 1. The support of the vector (푋1 , 푋2 , . . . , 푋푛)

* * * is the same as the support of the vector (푋1, 푋2, . . . , 푋푛). Therefore, the vector (푋1 , 푋2 , . . . , 푋푛) follows the standard Dirichlet distribution.

4.4 Energy Test of Beta or Dirichlet Distribution

Formally the energy test of beta or Dirichlet Distribution follows.

(︃ 푛 푛 )︃ 2 ∑︁ 1 ∑︁ 푄 = 푛 퐸(|푥 − 푋|) − 퐸|푋 − 푋′| − |푥 − 푥 |훼 . 푛,푑 푛 푖 푛2 푖 푗 푖=1 푖,푗=1

We transform the observed sample 푋 into the standardized sample 푌 by the formula (4.3.1). The standardized sample depends on the unknown parameters. However, the maximum likelihood estimate is consistent, which converges in probability to the true population parameter vector. The limiting distribution of the transformed sample is the standard Dirichlet distribution. We will 31 ignore the dependence of the standardized sample. The test statistic we will use is

(︃ 푛 푛 )︃ 2 ∑︁ 1 ∑︁ 푄 = 푛 퐸(|푦 − 푍|) − 퐸|푍 − 푍′| − |푦 − 푦 |훼 , (4.4.1) 푛,푑 푛 푖 푛2 푖 푗 푖=1 푖,푗=1

where 푍 is the standard beta or Dirichlet distributed variable. The ﬁrst component of 푄푛,푑 for the Dirichlet distribution is obtained by simulation. The null hypothesis that the sample is from the

beta or Dirichlet family is rejected for large values of 푄푛,푑. Please see Section 6.1 for the details of implementation of the energy test.

4.5 Triangle Tests of Beta or Dirichlet Distribution

The triangle goodness-of-ﬁt test for the null hypothesis that the data is from the beta or Dirichlet family is based on 1 ∑︁ 푈푘 = (︀푛)︀ 푕푘(푋푖, 푋푗) 푓표푟 푘 = 1, 2, 3, 2 푖<푗

where 푈1, 푈2, and 푈3 estimate the chance that the side joining the two data points is the smallest, middle, or largest side of the triangle formed from two randomly selected data points and the hypothetical point. We ﬁrst transform the observed sample 푋 into the standardized sample 푌 by the formula. The standardized sample depends on the unknown parameters. However, the maximum likelihood estimate is consistent. The standardized sample converges in distribution to the Dirichlet distribution. We will ignore the dependence of the standardized sample. Thus,

1 ∑︁ 푈푘 = (︀푛)︀ 푕푘(푌푖, 푌푗) 푓표푟 푘 = 1, 2, 3, (4.5.1) 2 푖<푗

where 푈1, 푈2, and 푈3 estimate the chance that the side joining the two selected data points is the smallest, middle, or largest side of the triangle formed from two randomly selected data points from the standardized sample and the hypothetical point from the standard beta or Dirichlet distribution.

2 2 2 The three triangle statistics we will use are 푈3, (푈1 −1/3) and (푈1 −1/3) +(푈3 −1/3) . The null hypothesis that the sample is from the beta or Dirichlet family is rejected for large values or small

2 2 2 values of 푈3, for large values of (푈1 − 1/3) , and for large values of (푈1 − 1/3) + (푈3 − 1/3) . 32 See Section 6.1 for details of implementation of the triangle test. 33

CHAPTER 5 DISTANCE CORRELATION TEST OF DIRICHLET DISTRIBUTIONS

5.1 Distance Covariance and Distance Correlation

Distance correlation is a new measure of dependence between the random vectors 푋 and 푌 in arbitrary dimension, recently introduced by Sze´kely, Rizzo, and Bakirov (2007). For all distributions with ﬁnite ﬁrst moments, distance correlation ℛ generalizes the idea of correlation in two fundamental ways:

1. ℛ(풳 , 풴) is deﬁned for X and Y in arbitrary dimensions;

2. ℛ(풳 , 풴) = 0 characterizes independence of X and Y.

Pearson’s product-moment covariance measures linear dependence between two variables. If the Pearson correlation coefficient 휌 = 1, there is a pure positive linear relationship between the variables 푋 and 푌 . If 휌 = −1, there is a pure negative linear relationship between 푋 and 푌 . If 휌 = 0, there is no linear relationship between 푋 and 푌 . For the bivariate normal distribution, 휌 = 0 is equivalent to independence of the variables 푋 and 푌 . However, it can be shown that except for the bivariate normal distribution, 휌 = 0 is not a characterization of independence of two random variables. Distance correlation has properties of a true dependence measure, analogous to, but more general than the product-moment Pearson’s correlation coefficient. Distance correlation satisfies 0 ≤ ℛ ≤ 1 , and ℛ = 0 only if X and Y are independent. Distance covariance and distance correlation are applicable for random vectors in arbitrary, not necessarily equal dimensions. The following definitions are given by Sze´kely et al. (2007).

Definition 5.1. The distance covariance (dCov) between random vectors 푋 and 푌 with finite first 34 moments is the nonnegative number

2 ^ ^ ^ 2 푑퐶표푣 (푋, 푌 ) = ||푓푋,푌 (푠, 푡) − 푓푋 (푠)푓푌 (푡)|| ⃒ ⃒2 ⃒ ^ ^ ^ ⃒ 1 ∫︁ ⃒푓푋,푌 (푠, 푡) − 푓푋 (푠)푓푌 (푡)⃒ = 1+푝 1+푞 푑푡 푑푠. 푐푝푐푞 ℝ푝+푞 |푠|푝 |푡|푞

^ ^ ^ where 푓푋,푌 (푠, 푡), 푓푋 (푠) and 푓푌 (푡) are the characteristic functions of random vectors (X, Y), X,

푝 and Y, respectively, || · || is the function 퐿2 norm, | · | is the Euclidean norm in ℝ , p, q denote

푝 휋(푝+1)/2 the Euclidean dimension of X and Y, | · |푝 denotes the Euclidean norm in ℝ , 푐푝 = 푝+1 and Γ( 2 ) 휋(푞+1)/2 푐푞 = 푞+1 , and the integral above exists. Γ( 2 )

1+푝 1+푞 −1 The weight function (푐푝푐푞|푠|푝 |푡|푞 ) is chosen to produce a scale equivariant and rotation invariant measure that does not become zero for dependent variables (Sze´kely et al., 2007). By the deﬁnition of the norm || · ||, it is clear that dCov is nonnegative and dCov = 0 if and only if 푋 and 푌 are independent.

Theorem 5.1. (Sze´kely et al., 2007) The population value of distance covariance is the square root of

dCov2(푋, 푌 ) = E[|푋 − 푋′||푌 − 푌 ′|] + E[|푋 − 푋′|] E[|푌 − 푌 ′|]

− 2 E[|푋 − 푋′||푌 − 푌 ′′|],

where E denotes the expected value, | · | is the Euclidean norm, and (푋, 푌 ), (푋′, 푌 ′), (푋′′, 푌 ′′) are independent and identically distributed, provided the expectations are ﬁnite.

Definition 5.2. The distance variance of the random vector 푋 with finite first moments is the 35 nonnegative number

2 2 ^ ^ ^ 2 푑푉 푎푟 (푋) = 푑푉 푎푟 (푋, 푋) = ||푓푋,푋 (푠, 푡) − 푓푋 (푠)푓푋 (푡)|| ⃒ ⃒2 ⃒ ^ ^ ^ ⃒ 1 ∫︁ ⃒푓푋,푋 (푠, 푡) − 푓푋 (푠)푓푋 (푡)⃒ = 1+푝 1+푞 푑푡 푑푠, 푐푝푐푞 ℝ푝+푞 |푠|푝 |푡|푞

provided the integral is ﬁnite.

Theorem 5.2. The population value of distance variance is the square root of

dVar2(푋) = E[|푋 − 푋′|2] + E2[|푋 − 푋′|] − 2 E[|푋 − 푋′||푋 − 푋′′|],

where E denotes the expected value, 푋, 푋′, and 푋′′ are iid, and the expectations are ﬁnite(Sze´kely et al., 2007).

Deﬁnition 5.3. The distance correlation of two random vectors is obtained by dividing their distance covariance by the product of their distance standard deviations. The distance correlation is

⎧ 2 ⎪√ dCov (푋,푌 ) , dVar(푋) dVar(푋) > 0 , ⎨ 2 2 dCor2(푋, 푌 ) = dVar (푋) dVar (푌 ) ⎪ ⎩⎪0, dVar(푋) dVar(푋) = 0.

^푛 ^푛 ^푛 Let 푓푋 (푠), 푓푋 (푡), and 푓푋,푌 (푠, 푡) be the empirical characteristic functions of the samples 퐗, 퐘, and (퐗, 퐘), respectively. We substitute the empirical characteristic functions for the characteristic functions in the deﬁnition of the distance covariance. The following result has been proved by Sze´kely et al. (2007). If (퐗, 퐘) is a random sample from the joint distribution of (푋, 푌 ), then

^푛 ^푛 ^푛 2 ||푓푋,푌 (푠, 푡) − 푓푋 (푠)푓푌 (푡)|| = 푆1 + 푆2 − 2푆3,

where 1 푆 = Σ푛 |푋 − 푋 | |푌 − 푌 | , 1 푛2 푘,푙=1 푘 푙 푝 푘 푙 푞 36 1 1 푆 = Σ푛 |푋 − 푋 | Σ푛 |푌 − 푌 | , 2 푛2 푘,푙=1 푘 푙 푝 푛2 푘,푙=1 푘 푙 푞 1 푆 = Σ푛 Σ푛 |푋 − 푋 | |푌 − 푌 | , 3 푛3 푘=1 푙,푚=1 푘 푙 푝 푘 푚 푞

and

2 dCov푛 (퐗, 퐘) = 푆1 + 푆2 − 2푆3, (5.1.1)

2 where (5.1.1) is algebraically equivalent to 푑퐶표푣푛(푋, 푌 ) in the following theorem by Sze´kely and Rizzo (2007).

2 Theorem 5.3. (Sze´kely et al., 2007) The empirical distance covariance dCov푛 (퐗, 퐘) is the nonnegative number 푛 1 ∑︁ dCov2 (퐗, 퐘) = 퐴 퐵 , (5.1.2) 푛 푛2 푗,푘 푗,푘 푗,푘=1 where 퐴푗,푘 and 퐵푗,푘 are given by

퐴푗,푘 = 푎푗,푘 − 푎푗. − 푎.푘 + 푎..,

퐵푗,푘 = 푏푗,푘 − 푏푗. − 푏.푘 + 푏..,

푎푗,푘 = ‖푋푗 − 푋푘‖, 푗, 푘 = 1, 2, . . . , 푛,

푏푗,푘 = ‖푌푗 − 푌푘‖, 푗, 푘 = 1, 2, . . . , 푛,

푛 푛 푛 푛 1 ∑︁ 1 ∑︁ 1 ∑︁ ∑︁ 푎¯ = 푎 , 푎¯ = 푎 , 푎¯ = 푎 , 푖. 푛 푖푗 .푗 푛 푖푗 .. 푛2 푗푘 푗=1 푖=1 푗=1 푘=1

푛 푛 푛 푛 1 ∑︁ 1 ∑︁ 1 ∑︁ ∑︁ ¯푏 = 푏 , ¯푏 = 푏 , ¯푏 = 푏 , 푖. 푛 푖푗 .푗 푛 푖푗 .. 푛2 푗푘 푗=1 푖=1 푗=1 푘=1 37 where subscript “.” denotes that the mean is computed for the index that is replaced.

Deﬁnition 5.4. The empirical distance correlation dCov푛(푋, 푌 ) is the square root of

⎧ 2 ⎪√ dCov푛 (퐗,퐘) , dVar (퐗) dVar (퐘) > 0 , ⎨ 2 2 푛 푛 2 dVar푛 (퐗) dVar푛 (퐘) dCor푛 (퐗, 퐘) = ⎪ ⎩⎪0, dVar푛 (퐗) dVar푛 (퐘) = 0 ,

where the sample distance variance is deﬁned by

2 2 1 ∑︁ 2 dVar푛(퐗) = dCov푛 (퐗, 퐗) = 푛2 퐴푘,ℓ. 푘,ℓ

The statistic 푑푉 푎푟푛(퐗) is zero if and only if each sample observation is identical.

Theorem 5.4. (Sze´kely et al., 2007) If 퐸|푋| < ∞, 퐸|푌 | < ∞ then almost surely

lim dCov푛 (퐗, 퐘) = dCov(푋, 푌 ), 푛→∞

2 2 lim dCor푛 (퐗, 퐘) = dCov (푋, 푌 ). 푛→∞

For the beta and Dirichlet distribution, since their support are bounded, it is always true that 퐸|푋| < ∞ and 퐸|푌 | < ∞, and therefore the dCov and dCor coefﬁcients exist in that case.

2 퐷 ∞ 2 Under independence, 푛 dCov푛 (퐗, 퐘) converges in distribution to a quadratic form 푄 = Σ푗=1휆푗푍푗 ,

where 푍푗 are independent standard normal random variables, and 휆푗 are nonnegative constants that depend on the distribution of (푋, 푌 ), which is proven by Sze´kely et al. (2007). Under dependence,

2 푛 dCov푛 (퐗, 퐘) tends to inﬁnity stochastically as 푛 → ∞. Hence, the test which rejects the null 2 hypothesis of independence for large 푛 dCov푛 (퐗, 퐘) is consistent against the alternative hypothesis of dependence.

5.2 Characterization of Beta and Dirichlet Distributions

5.2.1 Complete Neutrality

Connor and Mosiman (1969) introduced the concept of neutrality . 푋2 is deﬁned to be neutral 38 with respect to 푋1 if 푋1/(1 − 푋2), 푋2 are independent. The set 푋1, 푋2, ··· , 푋푘 is deﬁned to be completely neutral for the order 1, 2, ··· , 푘 if

푋2 푋3 푋푘 푋1, , , ··· , 1 − 푋1 1 − 푋1 − 푋2 1 − 푋1 − 푋2 − · · · − 푋푘−1 are jointly independent with a corresponding property for every permutation of {1, 2, ··· , 푘}.

Theorem 5.5. The random variable set {푋1, 푋2, ··· , 푋푘} is completely neutral for all permutations if and only if 푋1, 푋2, ··· , 푋푘 have a Dirichlet distribution.

Proof. Darroch and Ratcliff (1971) has shown that if the set {푋1, 푋2, ··· , 푋푘} is completely neutral for all permutations, then 푋1, 푋2, ··· , 푋푘 have a Dirichlet distribution. Here, we will show that if 푋1, 푋2, ··· , 푋푘 have a Dirichlet distribution with the parameters (훼1, 훼2, ··· , 훼푘+1), then

푋2 푋3 푋푘 푋1, , , ··· , 1 − 푋1 1 − 푋1 − 푋2 1 − 푋1 − 푋2 − · · · − 푋푘−1 are jointly independent.

푋2 푋3 푋푘 Let 푌1 = 푋1, 푌2 = , 푌3 = , ··· , and 푌푘 = . The single-valued 1−푋1 1−푋1−푋2 1−푋1−푋2−···−푋푘−1 inverse functions are 푋1 = 푌1, 푋2 = 푌2(1 − 푌1), 푋3 = 푌3(1 − 푌2)(1 − 푌1), ··· , and 푋푘 = ∏︀푘−1 푌푘 푖=1 (1 − 푌푘−1), so that the Jacobian matrix is

⎡ ⎤ 1 0 ··· 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ −푦 1 − 푦 ··· 0 0 ⎥ ⎢ 2 1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ −푦3(1 − 푦2) −푦3(1 − 푦1) ··· 0 0 ⎥ ⎢ ⎥ ⎢ . . . . . ⎥ ⎢ . . . . . ⎥ ⎢ ⎥ 푘−1 푘−1 푘−1 푘−1 ⎢ ∏︀ ∏︀ ∏︀ ∏︀ ⎥ ⎣−푌푘 (1−푌푘−1) −푌푘 (1−푌푘−1) · · · −푌푘 (1−푌푘−1) (1−푌푘−1) ⎦ 푖=2 푖=1̸=2 푖=1̸=푘−1 푖=1 39

The support of 푋1, 푋2, ··· , 푋푘 maps onto

0 ≤ 푌1 ≤ 1, 0 ≤ 푌2(1 − 푌1) ≤ 1, 0 ≤ 푌3(1 − 푌2)(1 − 푌1) ≤ 1, ··· ,

∏︀푘−1 0 ≤ 푌푘 푖=1 (1 − 푌푘−1) ≤ 1, which is equivalent to the support given by

풯 = {(푦1, 푦2, ··· , 푦푘) : 0 ≤ 푦푖 ≤ 1, 푖 = 1, 2, ··· , 푘}.

Hence the joint pdf of 푌1, 푌2, ··· , 푌푘+1 is

∑︀푘+1 Γ( 훼푖) 훼 −1 ∑︀푘+1 훼 −1 푓(푦 , ··· , 푦 ) = 푖=1 푦 1 (1 − 푦 ) 푖=2 푖 1 푘 ∑︀푘+1 1 1 Γ(훼1)Γ( 푖=2 훼푖) ∑︀푘+1 Γ( 훼푖) 훼 −1 ∑︀푘+1 훼 −1 × 푖=2 푦 2 (1 − 푦 ) 푖=3 푖 × · · · ∑︀푘+1 2 2 Γ(훼2)Γ( 푖=3 훼푖) ∑︀푘+1 Γ( 훼 ) 푘+1 푖=푘−1 푖 훼푘−1−1 ∑︀ 훼 −1 × 푦 (1 − 푦 ) 푖=푘 푖 ∑︀푘+1 푘−1 푘−1 Γ(훼푘−1)Γ( 푖=푘 훼푖)

Γ(훼푘 + 훼푘+1) 훼푘−1 × 푦푘 , (5.2.1) Γ(훼푘)Γ(훼푘+1)

(푦1, 푦2, ··· , 푦푘) ∈ 풯 . The marginal pdf of 푌푖 is

∑︀푘+1 Γ( 푗=푖 훼푗) 훼 −1 ∑︀푘+1 훼 −1 푓 (푦 ) = 푦 푖 (1 − 푦 ) 푗=푖+1 푗 , 푦 ∈ [0, 1]. 푖 푖 ∑︀푘+1 푖 푖 푖 Γ(훼푖)Γ( 푗=푖+1)훼푗

Therefore, 푌푖 follows the beta distribution with the left parameter 훼푖 and the right parameter ∑︀푘+1 푗=푖+1 훼푗 and in ((5.2.1))

푓(푦1, 푦2, ··· , 푦푘+1) = 푓1(푦1)푓2(푦2) ··· 푓푘(푦푘).

Therefore the variables 푌1, 푌2, ··· , 푌푘 are mutually independent. 40 5.3 Testing of Mutual Independence of All Components in a Random Vector

To apply Theorem 5.5, For a random vector (푌1, 푌2, ··· , 푌푘), we would like to test the mutual independence of 푌1, 푌2, ··· , 푌푘. We construct the following algorithm to test the mutual independence of 푘 variables based on the following 푘 − 1 independent hypotheses:

퐇ퟎ,ퟏ : 푌1 is independent of the vector (푌2, 푌3, ··· , 푌푘),

퐇ퟎ,ퟐ : 푌2 is independent of the vector (푌3, 푌4, ··· , 푌푘),

·········

퐇ퟎ,퐢 : 푌푖 is independent of the vector (푌푖+1, 푌푖+2, ··· , 푌푘),

·········

퐇ퟎ,퐤−ퟏ : 푌푘−1 is independent of 푌푘.

The distance covariance is applicable for random vectors in arbitrary dimensions, so the dCov

test is a statistically consistent test of each hypothesis 퐻0,1, 퐻0,2, ··· , 퐻0,푖, ··· , 퐻0,푘−1. The 푖-th hypothesis tests the independence of 푌푖, 푖 = 1, ··· , 푘 − 1 and 푌푖+1, 푌푖+2, ··· , 푌푘, which does not test any dependence or independence information among the variables 푌푖+1, 푌푖+2, ··· , and 푌푘. If the 푖-th hypothesis holds,

푓푖푘(푦푖, 푦푖+1, ··· , 푦푘+1) = 푓푖(푦푖)푓(푖+1)푘(푦푖+1, ··· , 푦푘).

The 푖-th hypothesis is independent of the (푖 + 1)-th hypothesis, 푖 = 1, ··· , 푘 − 1. 41

If all 푘 − 1 hypotheses hold, the joint pdf of the variables 푦1, 푦2, ··· , 푦푘

푓(푦1, ··· , 푦푘) = 푓1(푦1)푓2푘(푦2, ··· , 푦푘)

= 푓1(푦1)푓2(푦2)푓3푘(푦3, ··· , 푦푘)

= 푓1(푦1)푓2(푦2) ··· 푓푖(푦푖)푓(푖+1)푘(푦푖+1, ··· , 푦푘)

= 푓1(푦1)푓2(푦2) ··· 푓푘(푦푘).

Thus, the above 푘 − 1 hypotheses as a whole test whether the variables 푌1, 푌2, ··· , 푌푘−1 are jointly independent. Since we have 푘 − 1 independent hypotheses, the Bonferroni correction is chosen to control the familywise type I error rate. The correction is based on the idea that if 푛 dependent or independent hypotheses are tested, then one way of maintaining the familywise error rate 훼 is to

훼 test each individual hypothesis at a statistical significance level of 푛 . The Bonferroni correction tends to be less conservative when all tests are independent of each other, but more conservative when some or all tests are dependent. For the 푘 + 1 independent hypotheses, the probability of at least one significant test result is 1 − (1 − 훼/푘)푘. Figure 5.1 shows that as the number of independent tests increases from 2 to 10, the family-wise significance level decreases but still very close to 훼 = 0.05, which is why the Bonferroni correction is conservative.

5.4 Distance Covariance Goodness-of-Fit Tests

We propose a new goodness-of-ﬁt test for the Dirichlet distribution based on the distance covariance. For an observed 푘-dimensional sample of 푛 observations (푥푖,1, 푥푖,2, ··· , 푥푖,푘−1, 푥푖,푘), 푖 = 1, 2, ··· , 푛, we would like to test whether the sample comes from a Dirichlet distribution:

퐇ퟎ : The sampled population has a Dirichlet distribution,

퐇ퟏ : The sampled population does not have a Dirichlet distribution. 42

Figure 5.1: Probability of family-wise signiﬁcance level based on Bonferroni correction.

푋푖 Let 푌1 = 푋1, 푌푖 = , 푖 = 2, ··· , 푘. By Theorem 5.5, under the null hypothesis 1−푋1−···−푋푖−1 of the sample from a Dirichlet distribution, 푌1, 푌2, ··· , 푌푘 are mutually independent. Under the

alternative hypothesis, 푌1, 푌2, ··· , 푌푘 are not mutually independent. We can transform the original goodness-of-ﬁt problem of testing if the sample comes from the

Dirichlet distribution into another test whether 푌1, 푌2, ··· , 푌푘 are mutually independent or not. Since dCov(푋, 푌 ) = 0 if and only if X and Y are independent, we can transform the test if

푌1, 푌2, ··· , 푌푘 are mutually independent into testing whether the population distance covariance

2 coefﬁcient of 푌1, 푌2, ··· , 푌푘 is 0 or not. Thus, our test statistic is based on dCov푛(푋, 푌 ).

For example, if the random vector (푋1, 푋2) follows the two-dimensional Dirichlet distribution,

푋2 푋1 then the variables 푌1 = 푋1, 푌2 = are independent, and 푌3 = 푋2, 푌4 = are also inde- 1−푋1 1−푋2

pendent. Therefore, the population distance covariance of 푌1, 푌2 is 0, and the population distance

covariance of 푌3, 푌4 is also 0. If the random vector (푋1, 푋2) does not follow the two-dimensional

Dirichlet distribution, then the variables 푌1 and 푌2 are not independent or the variables 푌1 and 푌2 43 are not independent. Therefore, the population distance covariance of 푌1, 푌2 is positive, or the population distance covariance of 푌3, 푌4 is positive. If the variables 푌1, 푌2 are independent and

푌3, 푌4 are not independent, the random vector (푋1, 푋2) does not follow the Dirichlet distribution, but it may follow the generalized Dirichlet distribution, which is discussed in Chapter 8.

For the random vector (푋1, 푋2, ··· , 푋푘) from 푘-dimensional Dirichlet distribution, the trans-

푋2 푋3 푋푘 formed 푘 variables 푌1 = 푋1, 푌2 = , 푌3 = , ··· , and 푌푘 = for 1−푋1 1−푋1−푋2 1−푋1−푋2−···−푋푘−1 all permutations of the set {푋1, 푋2, ··· , 푋푘} are mutually independent because of the complete neutrality of Dirichlet Distributions. Therefore, we need to conduct the distance covariance test for each of 푘! permutations. For each permutation of the 푘 variables, we conduct (푘 − 1) null hypotheses. For all 푘! permutations, we need to construct 푘! × (푘 − 1) null hypotheses. For the Dirichlet distribution, many of all 푘! × (푘 − 1) hypotheses are dependent. For the trivariate Dirichlet distribution, we construct 3! × (2) = 12 hypotheses. For the four-dimensional Dirichlet distribution, we construct 4! × (3) = 72 hypotheses. In the Bonferroni procedure, the nominal familywise 훼 level is divided by the number of hypothesis tests. So when the number of hypotheses is too big and many of the hypotheses are dependent, the Bonferroni correction will make the empirical familywise type I error rate much less than the nominal familywise type I error rate 훼. We will propose an improved algorithm to consider the dependence structure of all hypotheses to control the familywise type I error rate for higher dimensional (푘 > 4) Dirichlet Distributions. An R function dcov.test for distance covariance test of any two random samples are provided in the energy package by Rizzo and Sze´kely (2007). We can apply the same logic to conduct the test of mutual independence of a collection of random vectors. 44

CHAPTER 6 SIMULATION STUDY

6.1 Simulation Study

The empirical power of a goodness-of-fit test against a specified alternative distribution at a fixed significance level is obtained by drawing random samples from the alternative distribution, and computing the percentage of samples for which the test is significant at that significance level. Empirical critical values for each of the tests are estimated from 20,000 random samples from the standard Dirichlet distribution.

6.1.1 Energy and Triangle Tests

First, we compute the empirical critical value for the energy or triangle tests.

1. Generate random sample 푋1, ··· , 푋푛 from the standard Dirichlet or Dirichlet distribution. 2. Compute estimates of parameters based on MLE. 3. Standardize data using estimated parameters. 4. Compute the energy statistic (or triangle statistics) on the transformed data. 5. Get critical values from the empirical 95th percentile of energy statistic (or triangle statistics) in step 4. For a given alternative distribution, the empirical power is computed as follows. 1. Generate 1,000 random samples of size 푛 from the speciﬁed alternative distribution. 2. Standardize the sample using the formula (4.3.1).

2 3. Calculate the energy statistic 푄푛 using formula (4.4.1), three triangle statistics 푈3, (푈1 − 1/3)

2 2 and (푈1 − 1/3) + (푈3 − 1/3) using formula (4.5.1). 4. Set 푁 equal to the number of the test statistics which are greater than the critical value.

푁 5. Compute the estimated power by 1000 . For energy or triangle tests, we calculate the estimated power but not the p-values. 45 6.1.2 Distance Covariance Test

For the random vector (푋1, 푋2, ··· , 푋푘) from 푘-dimensional Dirichlet distribution, the trans-

푋2 푋3 푋푘 formed 푘 variables 푌1 = 푋1, 푌2 = , 푌3 = , ··· , and 푌푘 = for all 1−푋1 1−푋1−푋2 1−푋1−푋2−···−푋푘−1 ^ permutations of the set {푋1, 푋2, ··· , 푋푘} are mutually independent. Let 휃 be a two sample statistic for testing multivariate independence. Before we construct the distance covariance goodness-of-ﬁt test, we need to introduce the permutation test of independence of two random vectors. Suppose that 푋 is a 푝-dimensional sample and 푌 is a 푞-dimensional sample and 푍 = (푋, 푌 ). Then 푍 is a

푝 + 푞 dimensional sample. Let 푣1 be the row labels of the X sample and let 푣2 be the row labels of the Y sample. Then (푍, 푣1, 푣2) is the sample from the joint distribution of 푋 and 푌 . The permutation test procedure permutes the row indices of 푌 sample. Approximate permutation test procedure for independence of two random vectors is as follows.

^ 2 Let 휃 (푑퐶표푣푛) be a two sample statistic using formula of (5.1.2) for testing independence of two random vectors . ^ ^ 1. Compute the observed test statistic 휃(푋, 푌 ) = 휃(푍, 휈1, 휈2). 2. For each replicate, indexed 푏 = 1, ··· , 퐵:

(a) Generate a random permutation 휋푏 = 휋(휈2). ^(푏) ^ * (b) Calculate the statistic 휃 = 휃(푋, 푌 , 휋(휈2)). (c) Calculate the ASL by {︁ }︁ ∑︀퐵 ^(푏) ^ 1 + 푏=1 퐼(휃 ≥ 휃) 푝^ = . 퐵 + 1

Permutation test procedure for distance covariance goodness-of-ﬁt tests is as follows. 1. Generate 1,000 random samples of size n from the speciﬁed alternative distribution. For each

sample{푋1, 푋2, ··· , 푋푘}, repeat step 2 to step 5.

2. For each of all 푘! permeations of each sample {푋1, 푋2, ··· , 푋푘}, repeat step 3 to step 4.

푋2 푋3 푋푘 3. Compute 푌1 = 푋1, 푌2 = , 푌3 = , ··· , and 푌푘 = . 1−푋1 1−푋1−푋2 1−푋1−푋2−···−푋푘−1 ^ ^ ^ 4. Compute the 푘 − 1 observed test statistics 휃 (푌1, (푌2, ··· , 푌푘)) , 휃 (푌2, (푌3, ··· , 푌푘)) , ··· , 휃 46

(푌푘−1, 푌푘), and get the ASL based on the above approximate permutation test procedure for independence.

5. Reject 퐻0 at signiﬁcance level 훼 if any of (푘 − 1)푘! ASL (achieved signiﬁcance level) is less

훼 than (푘−1)푘! .

6. Set 푁 equal to the number of times 퐻0 is rejected.

푁 7. Compute the estimated power by 1000 . For more details about permutation test for independence, please see Rizzo (2008). For the three and four-dimensional Dirichlet distribution, based on the above procedures, the empirical type I error rate of the distance covariance test is 0.046.

6.2 Contaminated Dirichlet Distribution

We introduce a class of alternative distributions below that have the same support as a Dirichlet distribution.

Deﬁnition 6.1. A random vector (푌1, 푌2, ··· , 푌푘) has a contaminated Dirichlet distribution with parameters (훼1, 훼2, ··· , 훼푘+1) if the pdf of the random vector is

* * 푓(푦1, 푦2, ··· , 푦푘) = 푓1(푦1, 푦2, ··· , 푦푘)푦푘,

2 (1 − 푦1 − · · · − 푦푘−1) 0 ≤ 푦1, 푦2, ··· , 푦푘−1 < 1, 0 < 푦푘 ≤ , 1 − 푦1 − · · · − 푦푘−2

where 푓1(푦1, 푦2, ··· , 푦푘) is the density function of a Dirichlet distribution with parameters (훼1, 훼2,

* 푦푘(1−푦1−···−푦푘−2) ··· , 훼푘+1) and 푦 = . 푘 1−푦1−···−푦푘−1

The following transformation method can be applied to generate a sample from a contaminated

Dirichlet distribution with parameters (훼1, 훼2, ··· , 훼푘+1). If the random variables 푋1, ··· , 푋푘

follow a Dirichlet distribution with the parameters (훼1, 훼2, ··· , 훼푘+1), then

푌푖 = 푋푖, 푖 = 1, 2, ··· , 푘 − 1,

1 − 푥1 − 푥2 − · · · − 푥푘−1 푌푘 = 푥푘 1 − 푥1 − 푥2 − · · · − 푥푘−2 47 is a contaminated Dirichlet distribution with parameters (훼1, 훼2, ··· , 훼푘+1). The ﬁrst 푘−1 components have the same corresponding marginal beta distribution as the Dirichlet distribution with pa-

rameters (훼1, 훼2, ··· , 훼푘+1). Only the last component is contaminated; compared with the marginal distribution of the last component of the corresponding Dirichlet distribution, its values become relatively smaller, which have high density. By the density contour plots of Example 6.1 and Example 6.2, it is easy to see that small values of the last component have high density.

Example 6.1 (Power comparison of tests of bivariate Dirichlet distributions).

Let two continuous random variables 푌1 and 푌2 have a joint density given by

2 2 푓(푦1, 푦2) = , 0 < 푦1 < 1, 0 < 푦2 ≤ (1 − 푦1) . 1 − 푦1

The marginal distribution of 푌1 is the beta distribution with parameters 1 and 2. The marginal

distribution of 푌2 has the density function

1 푓(푦2) = 2 log √ , 0 < 푦2 < 1. 푦2

1 7 The random variable 푌2 has a mean of 4 and a variance of 144 . The marginal distribution of 푌1 is the same as the marginal distribution of the ﬁrst component of standard Dirichlet distribution

with parameters (1, 1, 1) and only the marginal distribution of 푌2 is different from the marginal distribution of the second component; that is, the last component is contaminated, so we call the

distribution of 푌1 and 푌2 a contaminated Dirichlet distribution. We have the following transforma-

tion method for generating the random variables 푌1 and 푌2. If the random variables 푋1 and 푋2 follow the bivariate Dirichlet distribution with the parameters (1, 1, 1), then

푌1 = 푋1,

푌2 = 푋2(1 − 푋1) 48 have the joint pdf

2 2 푓(푦1, 푦2) = , 0 ≤ 푦1 ≤ 1, 0 ≤ 푦2 ≤ (1 − 푦1) . 1 − 푦1

3 5 If the variables 푍1 and 푍2 follow the Dirichlet distribution with parameters (1, 4 , 4 ), then they

have the same mean and variance as the variables 푌1 and 푌2.

Figure 6.1: Contour plot of density function of variables 푌1 and 푌2.

Figure 6.1 shows the contour plot of the density function of variables 푌1 and 푌2. Figure 6.2 shows the contour plot of the density function of a Dirichlet distribution with parameters(1, 3/4, 5/4). We compare the empirical power of the distance covariance test, energy test and triangle tests.

√︁ 푝̂(1−푝̂) If we denote the observed Type I error rate by 푝^, then an estimate of se(푝^) is 푠푒^ (^푝) = 푚 ≤

√0.5 푚 . Here 푚 is the number of replicates for the empirical power and we set m = 1000. Let 풩 denote the family of bivariate Dirichlet distributions. Then the test hypotheses are

퐻0 : 퐹푥 ∈ 풩 vs 퐻1 : 퐹푥 ∈/ 풩 . 49

Figure 6.2: Contour plot of density function of Dir(1, 3/4, 5/4).

The power comparison for Example 6.1 is summarized in Table 6.1 and in Figure 6.3. The sample size is 25, 50, 100, respectively.

Table 6.1: Empirical Power of Five Tests of Dirichlet Against a Contaminated Dirichlet Alternative in Example 6.1

n DCov Energy Triangle.1 Triangle.2 Triangle.3 25 0.643 0.349 0.070 0.051 0.049 50 0.949 0.726 0.071 0.061 0.057 100 1 0.930 0.079 0.087 0.064

Since the contaminated Dirichlet distribution is only the last component different from the its counterpart of the Dirichlet distribution, it’s shape is similar to that of the Dirichlet, and the triangle tests fail to detect the small change in the shape. However, the distance covariance test measure the independence. Even though there is small shape change in the last component, there is big change in the complete neutrality, that is, the big change in mutual independence. The distance covariance test is sensitive to the change of mutual independence of data structure. That is the reason why the distance covariance test has big powers. The simulation results suggest that the distance covariance test is the most powerful among all ﬁve tests against is alternative. The energy test is more powerful than all three triangle tests. The three triangle tests have comparable power 50

Figure 6.3: Empirical power of ﬁve tests of Dirichlet against a contaminated Dirichlet alternative in Example 6.1.

to each other. ◇

Example 6.2 (Power comparison of tests of trivariate Dirichlet distributions).

Three continuous random variables 푌1, 푌2 and 푌3 have a joint density which is given by

2 1 − 푦1 (1 − 푦1 − 푦2) 푓(푦1, 푦2, 푦3) = 6 , 0 < 푦1 < 1, 0 < 푦2 < 1, 0 < 푦3 < . 1 − 푦1 − 푦2 1 − 푦1

The marginal distributions of 푌1 and 푌2 both have the beta distribution with parameters 훼 = 1 and

훽 = 3. The joint distribution of 푌1 and 푌3 has the density function

√︂ 1 − 푦1 푓(푦1, 푦3) = 6(1 − 푦1) log , 푦3 51 and its support is given by

0 < 푦1 < 1, 0 < 푦3 < 1 − 푦1.

1 7 The random variable 푌2 has mean 4 and variance 144 . The marginal distribution of 푌1 and 푌2 is the same as the marginal distribution of the ﬁrst component of the standard Dirichlet distribution

with parameters (1, 1, 1) and only the marginal distribution of 푌3 is different from the marginal distribution of the last component of the standard Dirichlet distribution; that is, the last component

is contaminated, so we call the joint distribution of 푌1, 푌2 and 푌3 a trivariate contaminated Dirichlet distribution. We have the following transformation method for generating random variables 푌1, 푌2 and 푌3. If the random variables 푋1, 푋2 and 푋3 follow the standard Dirichlet distribution with the parameters (1, 1, 1, 1), then

푌1 = 푋1,

푌2 = 푋2,

푋3 푌3 = (1 − 푋1 − 푋2) 1 − 푋1 have the joint pdf

2 1 − 푦1 (1 − 푦1 − 푦2) 푓(푦1, 푦2, 푦3) = 6 , 0 < 푦1 < 1, 0 < 푦2 < 1, 0 < 푦3 ≤ . 1 − 푦1 − 푦2 1 − 푦1

The Dirichlet distribution with parameters (훼1 = 1, 훼2 = 0.750, 훼3 = 1.251) have the same mean

and variance as the joint distribution of 푌1 and 푌3. Figure 6.4 shows the contour plot of the density

function of variables 푌1, 푌2 and 푌3 in (5.9). The contour plot of the density function of a Dirichlet distribution with parameters (1, 1, 2) is shown in Figure 6.5. Figure 6.6 shows the contour plot of the density function of a Dirichlet distribution with parameters (1, 0.7495, 1.2505). We compare the empirical power of the distance covariance test, energy test and triangle tests. The number of replicates for the empirical power is 1000. Let 풩 denote the family of trivariate 52

Figure 6.4: Contour plot of density function of variables 푌1, 푌2 and 푌3 in (5.9).

Figure 6.5: Contour plot of density function of Dir(1, 1, 2).

Dirichlet distributions. Then the test hypotheses are

퐻0 : 퐹푥 ∈ 풩 vs 퐻1 : 퐹푥 ∈/ 풩 .

The power comparison for Example 6.2 is summarized in Table 6.2 and Figure 6.7. 53

Figure 6.6: Contour plot of density function of Dir(1, 0.7495, 1.2505).

Table 6.2: Empirical Power of Five Tests of Dirichlet Against a Contaminated Dirichlet Alternative in Example 6.2

n DCov Energy Triangle.1 Triangle.2 Triangle.3 25 0.367 0.261 0.050 0.048 0.051 50 0.839 0.508 0.053 0.054 0.058 100 0.998 0.773 0.061 0.071 0.061

The simulation results suggest that the distance covariance test is the most powerful among all ﬁve tests against this alternative. The energy test is more powerful than all three triangle tests. The three triangle tests have comparable power to each other. ◇ We can generalize the distributions in Example 6.1 and Example 6.2 to multivariate contaminated Dirichlet distributions.

Table 6.3: Empirical Power of Five Tests of Dirichlet Against a Contaminated Dirichlet Alternative in Example 6.3

n DCov Energy Triangle.1 Triangle.2 Triangle.3 25 0.887 0.437 0.074 0.050 0.048 50 0.996 0.849 0.080 0.061 0.053 100 1.000 0.985 0.126 0.148 0.089

Example 6.3 (Power comparison of tests of trivariate Dirichlet distributions (3,5,4,2)). 54

Figure 6.7: Empirical power of ﬁve tests of Dirichlet against a contaminated Dirichlet alternative in Example 6.2

Following Definition 6.1, we compared the power of the five tests against multivariate contaminated Dirichlet distribution with the parameter vector (3, 5, 4, 2). Again, the number of replicates for the empirical power is 1000. The results are summarized in Figure 6.8. The simulation results suggest that the distance covariance test is the most powerful among all five tests against this alternative. The energy test is more powerful than all three triangle tests. The three triangle tests have comparable power to each other.

6.3 Logistic Normal Distribution

The support of the beta distribution is from 0 to 1. The logit normal distribution has the same support as the beta distribution. If the random variable 푌 follows the normal distribution, then

푒푦 푋 = 1+푒푦 follows the logit normal distribution. If 푋 follows the logit normal distribution, then 푋 푌 = logit(푋) = log( 1−푋 ) folllows the normal distribution. The multivariate logit distribution is known as the logistic normal distribution. If the random 55

Figure 6.8: Empirical power of ﬁve tests of Dirichlet against a contaminated Dirichlet alternative in Example 6.3 vector 푌 follows the multivariate normal distribution 푁(흁, 횺), then the transformed random vector

(︃ )︃ 푒푦1 푒푦푑−1 1 푋 = 푑−1 ,..., 푑−1 , 푑−1 ∑︀ 푦푖 ∑︀ 푦푖 ∑︀ 푦푖 1 + 푖=1 푒 1 + 푖=1 푒 1 + 푖=1 푒 follows the logistic normal distribution. If the 푑−dimensional random vector 푋 follows the logistic multivariate normal distribution, then the transformed random vector

(︂ (︂푥 )︂ (︂푥 )︂)︂ 푌 = log 1 ,..., log 푑−1 푥푑 푥푑 follows the multivariate normal distribution. The logit normal distribution is distinct from the beta distribution (Hinde, 2014). However, the logistic normal distribution is an alternative to the Dirichlet distribution. The logit normal 56 distribution has been applied in random effects models for binary data and the logistic normal distribution was applied by Aitchison and Begg (1976) for statistical diagnosis/discrimination.

Example 6.4 (Power comparison of tests of logistic normal distributions).

We compare the empirical power of ﬁve proposed goodness-of-ﬁt tests when the sampled population has a logistic normal distribution and our null distribution is the Dirichlet distribution. In this experiment, we generated 1000 random samples of size 25, 50 and 100 from the four- dimensional logistic normal distribution with the corresponding trivariate normal distribution with mean vector (0, 0, 0) and identity covariance matrix. Results for this example are summarized in Table 6.4 and Figure 6.9.

Table 6.4: Empirical Power of Five Tests of Dirichlet Against Logistic Normal Alternative in Example 6.4

n DCov Energy Triangle.1 Triangle.2 Triangle.3 25 0.622 0.551 0.098 0.125 0.138 50 0.951 0.877 0.139 0.159 0.184 100 1 0.996 0.223 0.249 0.311

Figure 6.9: Empirical power of ﬁve tests of Dirichlet against logistic normal alternative in Example 6.4 58

CHAPTER 7 MAIN RESULTS FOR GENERALIZED DIRICHLET DISTRIBUTIONS

7.1 Generalized Dirichlet Distribution

A random vector (푋1, 푋2, ··· , 푋푘) follows the generalized Dirichlet distribution with parameters 훼1, 훽1, 훼2, 훽2, . . . , 훼푘, 훽푘 > 0, if it has the pdf

푘 ∏︁ Γ(훼푖 + 훽푖) 푓(푥 , 푥 , . . . , 푥 ) = × 1 2 푘 Γ(훼 )Γ(훽 ) 푖=1 푖 푖 푘−1 훽 −1 ∏︁ 훼푖−1 (︀ 푖 )︀훽푖−(훼푖+1+훽푖+1) 훼푘−1 (︀ 푘 )︀ 푘 푥푖 1 − Σ푗=1푥푗 푥푘 1 − Σ푗=1푥푗 , 푖=1

where 0 < 푥푖, 푖 = 1, . . . , 푘, 푥1 + 푥2 + ··· + 푥푘 < 1. If 훽푖 = 훼푖+1 + 훽푖+1, 푖 = 1, 2, ··· , 푘 − 1, then the generalized Dirichlet distribution becomes the Dirichlet distribution. The variables of the generalized Dirichlet distribution are neutral but not completely neutral (Connor and Mosiman, 1969). For two-dimensional generalized Dirichlet distributions, the support is a right triangle with vertices (0,0), (0,1), and (1,0). For three-dimensional generalized Dirichlet distributions, the support is a pyramid with vertices (0, 0, 0), (1, 0, 0), (0, 1, 0), and (0, 0, 1). We can regard the support of a higher-dimensional generalized Dirichlet distribution as the superpyramid. For the goodness-of-fit test for the generalized Dirichlet distribution, translation and rotation will change the support, which is bounded and fixed in the area of 0 < 푥푖, 푖 = 1, . . . , 푘, 푥1 + 푥2 + ··· + 푥푘 < 1. Therefore, we do not need to consider the affine invariant properties of goodness-of-fit tests for the generalized Dirichlet distribution, even though affine invariant properties are very important for goodness-of-fit tests for normal distributions. However, we still need one specific kind of affine transformation to change arbitrary 푘-dimensional generalized Dirichlet distributions to the corresponding standard Dirichlet distribution with all parameters 1. We calculate our test statistics of the energy or triangle test based on the transformed standard Dirichlet distribution, which makes 59 both tests easier to use and improves their efficiency. For a random variable 푋 which follows the multivariate normal distribution 푁(휇, Σ), we can use the transformation (푋 − 휇)푇 Σ−1(푋 − 휇) to change the multivariate normal random variable into the corresponding multivariate standard normal variable. However, for the generalized Dirichlet distribution, we do not have such a simple formula of transformation. Similar to the transformation applied in Chapter 4, we can use a two-step transformation technique to transfor- m the 푘-dimensional generalized Dirichlet distribution into the 푘-dimensional standard Dirichlet distribution.

Theorem 7.1. Suppose that the random vector (푋1, . . . , 푋푛) is from an n-dimensional generalized

* * Dirichlet distribution with the parameter vector (훼1, 훽1, 훼2, 훽2, . . . , 훼푛, 훽푛). Then (푋1 , . . . , 푋푛) defined by the following two-step transformation is jointly distributed as the standard n-dimensional Dirichlet distribution. The two-step transformation is defined as follows. The first step of transformation is given by

(︀ ⃒ )︀ 푌1 = 퐼 푋1⃒ 훼1, 훽1 (︂ )︂ 푋2 ⃒ 푌2 = 퐼 ⃒ 훼2, 훽2, 푋1 , 1 − 푋1 (︂ )︂ 푋3 ⃒ 푌3 = 퐼 ⃒ 훼3, 훽3, 푋1, 푋2 , 1 − 푋1 − 푋2 ...... (︂ )︂ 푋푖 ⃒ 푌푖 = 퐼 ⃒ 훼푖, 훽푖, 푋1, ··· , 푋푖−1 , 1 − 푋1 − 푋2 − · · · − 푋푖−1 ...... (︂ )︂ 푋푛 ⃒ 푌푛 = 퐼 ⃒훼푛, 훽푛, 푋1, ··· , 푋푛−1) , 1 − 푋1 − · · · − 푋푛−1

푋푖 where 퐼( | 훼푖, 훽푖, 푋1, ··· , 푋푖−1), function is the conditional regularized incom- 1−푋1−푋2−···−푋푖−1

plete beta function of 푋푖 given 푋1, ··· , 푋푖−1. 60 The second step of transformation is given by

1 * 푛 푋1 = 1 − (1 − 푌1) ,

1 [︁ 1 ]︁ * 푛 푛−1 푋2 = (1 − 푌1) 1 − (1 − 푌2) ,

1 1 [︁ 1 ]︁ * 푛 푛−1 푛−2 푋3 = (1 − 푌1) (1 − 푌2) 1 − (1 − 푌3) ,

...... 푖−1 ∏︁ 1 [︁ 1 ]︁ * 푛−푖+1 푛−푖+1 푋푖 = (1 − 푌푗) 1 − (1 − 푌푖) , 푗=1 ...... 푛−1 ∏︁ 1 * 푛−푖+1 푋푛 = (1 − 푌푖) 푌푛. 푖=1

* * The random vector (푋1 , . . . , 푋푛) follows the standard 푛-dimensional Dirichlet distribution.

Proof. The ﬁrst step is that we change the 푘-dimensional generalized Dirichlet distribution into

the 푘-dimensional uniform distribution with the support 0 < 푥푖 < 1, 푖 = 1, . . . , 푘, which can be called the supercube uniform distribution because for the 2-dimensional support, it is a square with the unity side and for the 3-dimensional support, it is a cube with the unity side. The second step is that we change the 푘-dimensional supercube uniform distribution into the 푘-dimensional standard Dirichlet distribution. During the ﬁrst step, the non-uniform density is transformed into the uniform density which is what we desire, but the support is transformed from superpyramid into supercube which is not what we want. During the second step, we keep the uniform density but change the support from supercube back into superpyramid again. The Jacobian determinant is the determinant of the Jacobian matrix which is given by

⎡ 휕푥* 휕푥* 휕푥* ⎤ 1 1 ... 1 ⎢ 휕푥1 휕푥2 휕푥푛 ⎥ ⎢ 휕푥* 휕푥* 휕푥* ⎥ ⎢ 2 2 ... 2 ⎥ ⎢ 휕푥1 휕푥2 휕푥푛 ⎥ ⎢ ⎥ . ⎢...... ⎥ ⎢ ⎥ ⎣ * * * ⎦ 휕푥푛 휕푥푛 ... 휕푥푛 휕푥1 휕푥2 휕푥푛 61 However, it is hard to directly derive each element of the Jacobian matrix. On the other hand, the Jacobian determinant is the inverse of the determinant of the matrix which is given by

Therefore, we only need to calculate the diagonal element of the inverse Jacobian matrix to obtain the determinant of the inverse Jacobian matrix. The 푖-th term of the diagonal elements is given by

휕푥* 휕푥* 휕푦 휕푥* 휕푦 휕푥* 휕푦 푖 = 푖 1 + 푖 2 + ··· + 푖 푛 , 푖 = 1, 2, . . . , 푛. 휕푥푖 휕푦1 휕푥푖 휕푦2 휕푥푖 휕푦푛 휕푥푖

휕푥* 휕푥* 휕푦 푖 = 푖 푖 휕푥푖 휕푦푖 휕푥푖 (︁ )︁훽푖−1 푖−1 1 − ∑︀푖 푥 ∏︁ 1 1 푗=1 푗 푛−푗+1 푛−푖+1 −1 훼푖−1 = 푐 (1 − 푦푗) (1 − 푦푖) 푥푖 훽 , (︁ 푖−1 )︁ 푖 푗=1 ∑︀ 1 − 푗=1 푥푗 62 where 푐 is 1 Γ(훼푖+훽푖) . 푛−푖+1 Γ(훼푖)Γ(훽푖) The determinant of the inverse Jacobian matrix is given by

푛 * 푛 * ∏︁ 휕푥 ∏︁ 휕푥 휕푦푖 푖 = 푖 휕푥 휕푦 휕푥 푖=1 푖 푖=1 푖 푖 푘 푘−1 ∏︁ Γ(훼푖 + 훽푖) ∏︁ 훽푖−(훼푖+1+훽푖+1) = 푥훼푖−1 (︀1 − Σ푖 푥 )︀ 푖Γ(훼 )Γ(훽 ) 푖 푗=1 푗 푖=1 푖 푖 푖=1 훼푘−1 푘 훽푘−1 푥푘 (1 − Σ푗=1푥푗) 푓(푥 , 푥 , . . . , 푥 ) = 1 2 푛 , 푛!

where 푓(푥1, 푥2, . . . , 푥푛) is the density function of the generalized Dirichlet distribution with the parameter 훼1, 훽1, 훼2, 훽2, . . . , 훼푘, 훽푘. The Jacobian determinant is the inverse of the determinant of

* * * the inverse Jacobian matrix, so the density function of the vector (푋1 , 푋2 , . . . , 푋푛) is

* * * 푓(푥1, 푥2, . . . , 푥푛) = 푓(푥1, 푥2, . . . , 푥푛) × 퐽푎푐표푏푖푎푛 푛! = 푓(푥1, 푥2, . . . , 푥푛) × 푓(푥1, 푥2, . . . , 푥푛) = 푛!,

* * * * * * * where 0 < 푥푖 , 푖 = 1, . . . , 푘, 푥1 + 푥2 + ··· + 푥푘 < 1. The support of the vector (푋1 , 푋2 , . . . , 푋푛)

* * * is the same as the support of the vector (푋1, 푋2, . . . , 푋푛). Therefore, the vector (푋1 , 푋2 , . . . , 푋푛) follows the standard Dirichlet distribution.

7.2 Estimation for the Generalized Dirichlet Distribution Parameter Vector

We have 푛 independent and identically distributed variables from the generalized Dirichlet(훼1,

훽1, . . . , 훼푘, 훽푘). For 푖 = 1, . . . , 푛, let 푥푖 = (푥푖1, . . . , 푥푖푘) be 푛 observations. Chang and Richards (1991) discussed the parameter estimation based on the maximum likelihood. We propose the maximum likelihood parameter estimation based on the Newton-Raphson Algorithm, the initial value of iterations and convergency criteria. 63 7.2.1 MLE Based on the Newton-Raphson Algorithm

The log-likelihood function of the parameter vector 훼 = (훼1, 훽1, 훼2, 훽2, . . . , 훼푘, 훽푘) for the 푛 observations is

[︀ 푘 ]︀ 푙(훼) = 푛 Σ푖=1 log Γ(훼푖 + 훽푖) − log Γ(훼푖) − log Γ(훽푖) 푛 푛 ∏︁ ∏︁ +(훼1 − 1) log 푥푖1 + 훽1 log (1 − 푥푖1) 푖=1 푖=1 푘−1 [︃ 푛 푛 푗 ]︃ ∑︁ ∏︁ 푥푖푗 ∏︁ 1 − Σ푚=1푥푖푚 (훼푗 − 1) log 푗−1 + 훽푗 log 푗−1 푗=2 푖=1 1 − Σ푚=1푥푖푚 푖=1 1 − Σ푚=1푥푖푚 푛 푛 (︂ 푘 )︂ ∏︁ ∏︁ 1 − Σ푚=1푥푖푚 (훼푘 − 1) log 푥푖1 + (훽푘 − 1) log 푘−1 . 푖=1 푖=1 1 − Σ푚=1푥푖푚

The Newton-Raphson method is used in the following iteration to calculate the MLE of the parameter vector (Narayanan, 1990):

⎛ ⎞ ⎛ ⎞ ⎛ ^ ⎞ 훼^1 훼^1 푔1(^훼1, 훽1) ⎜ ⎟ ⎜ ⎟ ⎛ ⎞ ⎜ ⎟ ⎜ ^ ⎟ ⎜ ^ ⎟ ^ ⎜ ^ ⎟ ⎜훽1 ⎟ ⎜훽1 ⎟ 푣푎푟(훼 ^1) . . . 푐표푣(훼 ^1, 훽푘) ⎜ 푔2(^훼1, 훽1) ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ . ⎟ ⎜ . ⎟ ⎜ . . ⎟ ⎜ . ⎟ ⎜ . ⎟ = ⎜ . ⎟ + ⎜ . .. ⎟ ⎜ . ⎟ , ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ^ ^ ⎜ ^ ⎟ ⎜훼^푘⎟ ⎜훼^푘⎟ 푐표푣(훽푘, 훼^1) . . . 푣푎푟(훽푘) ⎜푔2푘−1(^훼푘, 훽푘)⎟ ⎝ ⎠ ⎝ ⎠ (푖−1) ⎝ ⎠ ^ ^ ^ 훽푘 훽푘 푔2푘(^훼푘, 훽푘) (푖) (푖−1) (푖−1) 64 ^ ^ ′ where 훼^(0) = (^훼1(0), 훽1(0),..., 훼^푘(0), 훽푘(0)) are the initial estimates. The gradient vector is

⎛ ⎞ ∏︀ 푛Ψ(훼1 + 훽1) − 푛Ψ(훼1) + log 푥푖1 ⎜ ⎟ ⎜ ⎟ ⎜ 푛Ψ(훼 + 훽 ) − 푛Ψ(훽 ) + log ∏︀(1 − 푥 ) ⎟ ⎜ 1 1 1 푖1 ⎟ ⎜ ⎟ ∏︀ 푥푖2 ⎜ 푛Ψ(훼2 + 훽2) − 푛Ψ(훼2) + log ⎟ ⎜ 1−푥푖1 ⎟ ⎜ ⎟ ⎜ ∏︀ 1−푥푖1−푥푖2 ⎟ ⎜ 푛Ψ(훼2 + 훽2) − 푛Ψ(훽2) + log ⎟ ⎜ 1−푥푖1 ⎟ ⎜ ⎟ ⎜ ... ⎟ ⎜ ⎟ 푔 = ∇푙(훼) = ⎜ ⎟ . ⎜ ∏︀ 푥푖푗 ⎟ 푛Ψ(훼푗 + 훽푗) − 푛Ψ(훼푗) + log 푗−1 ⎜ 1−Σ푚=1푥푖푚 ⎟ ⎜ 푗 ⎟ ⎜ ∏︀ 1−Σ푚=1푥푖푚 ⎟ ⎜ 푛Ψ(훼푗 + 훽푗) − 푛Ψ(훽푗) + log 푗−1 ⎟ ⎜ 1−Σ푚=1푥푖푚 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ... ⎟ ⎜ ⎟ ⎜ 푛Ψ(훼 + 훽 ) − 푛Ψ(훼 ) + log ∏︀ 푥푖푘 ⎟ ⎜ 푘 푘 푘 1−Σ푘−1 푥 ⎟ ⎜ 푚=1 푖푚 ⎟ ⎝ 푘 ⎠ ∏︀ 1−Σ푚=1푥푖푚 푛Ψ(훼푘 + 훽푘) − 푛Ψ(훽푘) + log 푘−1 1−Σ푚=1푥푖푚

The information matrix 퐼 is

⎛ ⎞ 퐼훼1훽1 0 ... 0 ⎜ ⎟ ⎜ ⎟ ⎜ 0 퐼훼 훽 ... 0 ⎟ ⎜ 2 2 ⎟ ⎜ ⎟ . ⎜ ...... ⎟ ⎜ ⎟ ⎝ ⎠

0 0 . . . 퐼훼푘훽푘

The 2 × 2 matrix 퐼훼푖훽푖 is

⎛ ⎞ ′ ′ ′ 푛Ψ (훼푖) − 푛Ψ (훼푖 + 훽푖) −푛Ψ (훼푖 + 훽푖) ⎜ ⎟ , ⎝ ′ ′ ′ ⎠ −푛Ψ (훼푖 + 훽푖) 푛Ψ (훽푖) − 푛Ψ (훼푖 + 훽푖) where Ψ is the digamma function. The variance-covariance matrix is the inverse of the Fisher information matrix. Therefore, the 65 variance-covariance matrix has the block diagonal following form:

⎛ ⎞ 푉훼1훽1 0 ... 0 ⎜ ⎟ ⎜ ⎟ ⎜ 0 푉훼 훽 ... 0 ⎟ ⎜ 2 2 ⎟ ⎜ ⎟ , ⎜ ...... ⎟ ⎜ ⎟ ⎝ ⎠

0 0 . . . 푉훼푘훽푘

where 푉훼푖훽푖 is a 2 × 2 matrix:

⎛ ⎞ ′ ′ ′ 푛Ψ (훽푖) − 푛Ψ (훼푖 + 훽푖) 푛Ψ (훼푖 + 훽푖) 푐 ⎜ ⎟ , ⎝ ′ ′ ′ ⎠ 푛Ψ (훼푖 + 훽푖) 푛Ψ (훼푖) − 푛Ψ (훼푖 + 훽푖)

2 ′ ′ ′ ′ ′ ′ −1 and the constant c is (푛 [Ψ (훼푖)Ψ (훽푖) − Ψ (훼푖)Ψ (훼푖 + 훽푖) − Ψ (훽푖)Ψ (훼푖 + 훽푖)]) .

7.2.2 Direct Proof of Global Convexity of Log Likelihood Function of the Generalized Dirichlet

The digamma function is denoted by 휓(푥), which is the ﬁrst derivative of log Γ(푥). The trigam-

ma function denoted by 휓1(푥) is the second derivative of log Γ(푥). Although the global concavity of the likelihood function could indirectly be seen from the fact that the generalized Dirichlet distribution belongs to the exponential family, no direct proof of global concavity has been given. We will directly prove the global concavity by the inequality (Ronning, 1986).

휓1(푥) < 푎휓1(푎푥), 푥 > 0, 0 < 푎 < 1.

휕2푙 A sufficient condition for global concavity of the log likelihood function is that 퐴 = − 휕푎푇 휕푎 is positive definite. The first proof that 퐴 is positive definite is based on the 푖th principal submatrix.

Proof. Since 퐴 is a symmetric matrix, the necessary and sufﬁcient condition for 퐴 to be a positive

deﬁnite matrix is that the determinant of the 푖th principal submatrix is positive. Let 퐴푖 be the 푖th 66 principal submatrix.

(푖−1)/2 Let 푏푖 = Π푗=1 (휓푖(훼푖) − 휓1(훼푗 + 훽푗)) (휓1(훽푗)−휓1(훼푗+훽푗))−휓푖(훼푖)(휓1(훽푗)−휓푖(훼푖)(휓1(훽푗).

Then ⎧ ⎪ ⎨⎪(휓1(훼푖) − 휓1(훼푖 + 훽푖))푏푖, if 푖 is odd, 푑푒푡(퐴푖) = ⎪ ⎩⎪푏푖, if 푖 is even.

When 푥 = 훼푗 + 훽푗 and 푎 = 훼푗/푥 , 휓1(푥) = 휓1(훼푗 + 훽푗) < 푎휓1(푎푥) = 푎휓1(훼푗) < 휓1(훼푗), 푖 =

1, ··· , 푘. When 푥 = 훼푗 + 훽푗 and 푎 = 훽푗/푥 , 휓1(푥) = 휓1(훼푗 + 훽푗) < 푎휓1(푎푥) = 푎휓1(훽푗) <

휓(훽푗), 푖 = 1, ··· , 푘. Hence

(휓1(훼푗) − 휓1(훼푗 + 훽푗)) (휓1(훽푗)) − 휓1(훼푗 + 훽푗) − 휓1(훼푗)휓1(훽푗) (︂ )︂ 휓1(훼푗 + 훽푗) 휓1(훼푗 + 훽푗) = 1 − − 휓1(훼푗)휓1(훽푗) 휓1(훼푗) 휓1(훽푗) (︂ )︂ 훼푗 훽푗 > 1 − − 휓1(훼푗)휓1(훽푗) = 0, 훼푗 + 훽푗 훼푗 + 훽푗

so det(퐴푖) is always positive.

The second proof that 퐴 is positive deﬁnite is based on the characteristic roots.

Proof. Since 퐴 is a symmetric matrix, the necessary and sufﬁcient condition for 퐴 to be a positive

deﬁnite matrix is that the characteristic roots of 퐴 are all positive. Let 퐵푖(푖 = 1, 2, ··· , 푘) be the matrix. ⎡ ⎤ 휓 (훼 ) − 휓 (훼 + 훽 ) −휓 (훼 + 훽 ) ⎢ 1 푖 1 푖 푖 1 푖 푖 ⎥ ⎣ ⎦ . −휓1(훼푖 + 훽푖) 휓1(훽푖) − 휓1(훼푖 + 훽푖)

The characteristic roots of the matrix 퐵푖 are 휆푖1 and 휆푖2. It is easy to prove that both roots are posi-

tive. The characteristic roots of the matrix 퐴 are the characteristic roots of 퐵푖(푖 = 1, 2, ··· , 푘), 휆푖1

and 휆푖2(푖 = 1, 2, ··· , 푘), thus all 2푘 characteristic roots of 퐴 are positive. 67 7.2.3 The Initial Values of Iteration

The estimates from method of moments was suggested for the initial values of the Newton- Raphson method by Dishon and Weiss (1980) for the parameter estimation of the beta distribution. Ronning (1986) has shown that the initial values from the moment method led to negative values during the iterations of the maximum likelihood estimation of Dirichlet distributions.

Ronning’s suggestion is to set all 훼푗 = min{푋푖푗}, 푖 = 1, ··· , 푛, 푗 = 1, ··· , 푘 to guarantee that the gradient vector is always positive for the parameter estimation of the Dirichlet distribution. Our experience is that this is not only true for the Dirichlet distribution and but also true for the generalized Dirichlet distribution. Thus the initial values for the iteration of the maximum likelihood method for the generalized Dirichlet distribution is to set all 훼푗 = 푚푖푛{푋푖푗}, 푖 = 1, ··· , 푛, 푗 = 1, ··· , 푘.

7.2.4 Iteration Contingency Criteria

The statistic S for the convergence is given by

⎛ ⎞푇 ⎛ ⎞ 푔1(^훼) 푔1(^훼) ⎜ ⎟ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ^ ⎜ ⎟ ⎜ 푔2(^훼) ⎟ 푣푎푟(훼 ^1) . . . 푐표푣(훼 ^1, 훽푘) ⎜ 푔2(^훼) ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ . ⎟ ⎜ . . ⎟ ⎜ . ⎟ 푆 = ⎜ . ⎟ ⎜ . .. ⎟ ⎜ . ⎟ . ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎜ ⎟ ⎜ ⎟ ^ ^ ⎜ ⎟ ⎜푔2푘−1(^훼)⎟ 푐표푣(훽푘, 훼^1) . . . 푣푎푟(훽푘) ⎜푔2푘−1(^훼)⎟ ⎝ ⎠ ⎝ ⎠ 푔2푘(^훼) 푔2푘(^훼)

Since 푆 has a quadratic form, it can be considered similar to a chi-squared random variable and its value can be applied as a criteria for convergence. The gradient vector of MLE is always zero because the maximum likelihood maximizes the loglikelihood. At the beginning of the iteration, the estimated vector is a little more different from MLE, the gradient vector is far from zero, so 푆 is relatively big at the beginning. When the iteration continued, the estimated parameter vector becomes closer to the MlE of the parameter vector. Thus, the element of the gradient vector generally becomes smaller, and closer to zero. Hence, 푆 becomes smaller as the iteration 68 2 is continued. The iteration is continued until 푆 becomes less than 휒2푘(푐) for a ﬁxed 푐 in the right tail of the chi- squared distribution with 2푘 degrees of freedom. This approach was also used by Choi and Wette (1967) for the gamma distribution and by Narayanan (1991) for the Dirichlet distribution. In practice, in order to avoid a large number of iterations, we can specify a maximum number

2 of iterations. The iteration is continued until 푆 becomes less than 휒2푘(푐) for a ﬁxed signiﬁcance level 푐 or the number of iterations exceeds the maximum number of iteration we have set. In this paper, c is set to 0.0001 and the maximum number of iterations is set to 100. We propose another convergence criteria based on the partitioned variance matrix. Since the variance-covariance matrix 푉 is the inverse of the Fisher information matrix for the generalized Dirichlet distribution, we divide the parameter vector into 푘 mutually exclusive parts. The criteria based on the partitioned variance matrix is to guarantee that the iteration of each part of parameter

2 vector (훼푖, 훽푖) is continued until each 푆푖 becomes less than 휒2(푣) for a ﬁxed 푣 in the right tail of the chi-squared distribution with 2 degree of freedoms or exceeds the maximum number of iterations we have set. The criteria based on the whole variance matrix is to guarantee that the iteration of the

2 whole parameter vector is continued until 푆 becomes less than 휒2푘(푣) for a ﬁxed 푣 or the number of the iterations exceeds the maximum number.

7.3 Generalized Dirichlet Distribution Model vs. Dirichlet Distribution Model

For 푖 = 1, . . . , 푛, let 푥푖 = (푥푖1, . . . , 푥푖(푘+1)) be 푛 independent observations from the gener-

alized Dirichlet distribution with parameters (훼1, 훽1, . . . , 훼푘, 훽푘). The 푘-dimensional generalized Dirichlet distribution has 2푘 parameters, but the 푘-dimensional Dirichlet distribution has only 푘+1 parameters. Given a data set, based on principle of the parsimony, one may want to test whether a generalized Dirichlet distribution model is better than a Dirichlet distribution model to ﬁt the observed data. The hypothesis of interest is

⋂︁ 푐 퐻0 : 훼 ∈ 휔 versus 퐻1 : 훼 ∈ Ω 휔 , 69 where 훼 is the parameter vector, 휔 is the restricted 푘 + 1 dimensional parameter space for the Dirichlet distribution, and Ω is the full 2푘 parameter space for the generalized Dirichlet distribution. Let 훼^ denote the maximum likelihood estimator when the parameter space is the full space Ω and let 훼˜ denote the maximum likelihood estimator when the parameter space is the reduced space 휔. Here 훼˜ is the restricted maximum estimate, which can be obtained by the maximum likelihood estimates based on the assumption that sampled population has the Dirichlet distribution, because the constraints make the generalized Dirichlet distribution into the Dirichlet distribution.

7.3.1 The Likelihood Ratio Test

For a generalized Dirichlet distribution with the parameter vector (훼1, 훽1, ··· , 훼푘−1, 훽푘−1), it becomes a Dirichlet distribution when 훽푖 = 훼푖+1 + 훽푖+1, 푖 = 1, 2, ··· , 푘 − 1. Thus, the null and alternative hypotheses are given by

퐻0 : 훽푖 = 훼푖+1 + 훽푖+1, 푖 = 1, 2, ··· , 푘 − 1

퐻1 : at least one 훽푖 ̸= 훼푖+1 + 훽푖+1, 푖 = 1, 2, ··· , 푘 − 1

The likelihood ratio test statistic is

2 퐷 2 푋퐿푅 = −2 log ∧ −→ 휒 (푘 − 1), where 퐿(˜훼) ∧ = . 퐿(^훼)

When the null hypothesis is true, −2log∧ has approximately a chi-squared distribution with 푘 − 1 degrees of freedom. A proof of the above result can be found in Rao (1973).

7.3.2 The Wald Test

The 푘-dimensional generalized Dirichlet distribution has 2푘 parameters, but the 푘-dimensional Dirichlet distribution has only 푘 + 1 parameters. Therefore, there are 푘 − 1 constraints from the 70 generalized Dirichlet distribution to the Dirichlet distribution. The corresponding constraint matrix 푅 is given by ⎛ ⎞ 0 1 −1 −1 0 ... 0 0 ⎜ ⎟ ⎜ ⎟ ⎜ 0 0 0 1 −1 −1 ... 0 ⎟ ⎜ ⎟ ⎜ ⎟ . ⎜ ...... ⎟ ⎜ ⎟ ⎝ ⎠ ...... 0 1 −1 −1 (푘−1)×2푘

The null and alternative hypotheses are given by

퐻0 : 푅훼 = 0 푣푠 퐻1 : 푅훼 ̸= 0

where 푅 is the (푘 − 1) × 2푘 constraint matrix and 훼 is the parameter vector of the 푘−dimensional generalized Dirichlet Distribution. The rank of the constraint matrix R is 푘 − 1. The Wald test statistic is

2 푇 (︀ −1 푇 )︀−1 푋푊 = (푅훼^) 푅퐼 (^훼)푅 (푅훼^) ,

2 where 퐼 is the Fisher information matrix. When the null hypothesis is true, 푋푊 has approximately a chi-squared distribution with 푘 − 1 degrees of freedom.

7.3.3 Rao’s Score Test

Rao’s score test is also referred to as the Lagrange multiplier test. The Rao’s score test statistic is

2 푇 −1 푋푅 = 푠(˜훼) (퐼(˜훼)) 푠(˜훼)

1 휕퐿(훼) where 푠(훼) = 푛 휕훼 and 퐿(훼) is the likelihood function. If the null hypothesis is true, the restricted maximum likelihood estimates 훼^ will be near the unrestricted maximum likelihood estimates. Here 푠(^훼) = 0 is always true because the unrestricted maximum likelihood maximizes the log- likelihood. The bigger the Fisher score, the more likely the null hypothesis is not true. When the

2 null hypothesis is true, 푋푅 has approximately a chi-squared distribution with 푘 − 1 degrees of 71 freedom (Rao, 1973).

7.3.4 Bootstrap Hypothesis Testing

A comprehensive discussion of bootstrap methods can be found in the books by Efron and Tibshiranni (1993), and Davison and Hinkley (1997). When the null hypothesis is true, the likelihood ratio test statistic, the Wald test and Rao’s score test statistic each have approximately a chi-squared distribution. However, the general theory provides no practical guidance on how large the sample size should be for the chi-squared approximation to be reliable. Bootstrap methods provide an alternative approach to assessing the variance of an estimator and, in many cases, are more accurate for small samples.

To test a null hypothesis 퐻0, we specify a test statistic 푇 such that large absolute value of 푇 are evidence against 퐻0. Let 푇 have observed value 푡 which is calculated from the observed sample. We generally want to calculate the p-value

푝 = 푃 (|푇 | > 푡|퐻0).

We need the distribution of 푇 when the null hypothesis is true to evaluate this probability. However, the null hypothesis is that sample is from the Dirichlet family. We do not know the parameter vector in advance, so we use the MLE to replace the population parameter vector to construct the parametric bootstrap testing.

1. Evaluate the observed test statistic 푡 = 푇 (푥1, ··· , 푥푛). 2. Find the maximum likelihood estimate of the parameter vector . 3. Generate 퐵 random samples from the target distribution with MLE as its the parameter vector

* * * and compute 푡1, 푡2, ··· , 푡푛. 4. Compute the Monte Carlo 푝−value. For a given data set, the procedure of the parametric bootstrap method to the likelihood ratio test, the Wald test and Rao’s score test are as follows. 72 Likelihood Ratio Test Based on Parametric Bootstrapping

Suppose that 휃 is the parameter of interest, and 휃^ is an estimator of 휃 . Then the bootstrap likelihood ratio test is obtained as follows. 1. For a given data set, estimate the parameter vector 휃^ of the Dirichlet distribution and compute

2 the likelihood ratio test statistic 푋퐿푅 and the corresponding 푝-value. 2. For each bootstrap replicate, indexed 푏 = 1, ..., 퐵 :

*(푏) * * * * (a) Generate sample 푥 = 푥1, 푥2, 푥3, ··· , 푥푛 from a Dirichlet distribution with the parameter vector 휃^ from the ﬁrst step.

푡푕 2(푔) 푡푕 (b) Compute for the 푏 replicate the likelihood ratio test statistic 푋퐿푅 from the 푏 bootstrap sample.

2(푔) 2 3. Set 푁 equal to the number of the values 푔 for which 푋퐿푅 ≥ 푋퐿푅.

푁 4. Compute 푝^ = 퐵 .

5. Reject 퐻0 at the signiﬁcance level 훼 if 푝^ ≤ 훼.

Wald Test Based on Parametric Bootstrapping

Suppose that 휃 is the parameter of interest, and 휃^ is an estimator of 휃. Then the bootstrap Wald test is obtained as follows. 1. For a given data set, estimate the parameter vector 휃^ of the Dirichlet distribution and compute

2 the Wald test statistic 푋푊 and the corresponding 푝-value. 2. For each bootstrap replicate, indexed 푏 = 1, ..., 퐵 :

*(푏) * * * * (a) Generate sample 푥 = 푥1, 푥2, 푥3, ··· , 푥푛 from a Dirichlet distribution with the parameter vector 휃^ from the ﬁrst step.

푡푕 2(푔) 푡푕 (b) Compute for the 푏 replicate the likelihood ratio test statistic 푋푊 from the 푏 bootstrap sample.

2(푔) 2 3. Set 푁 equal to the number of the values 푔 for which 푋푊 ≥ 푋푊 .

푁 4. Compute 푝^ = 퐵 .

5. Reject 퐻0 at the signiﬁcance level 훼 if 푝^ ≤ 훼. 73 Rao’s Score Test Based on Parametric Bootstrapping

Suppose that 휃 is the parameter of interest, and 휃^ is an estimator of 휃. Then the bootstrap Rao’s score test is obtained as follows. 1. For a given data set, estimate the parameter vector 휃^ of the Dirichlet distribution and compute

2 the Rao’s score test statistic 푋푅 2. For each bootstrap replicate, indexed 푏 = 1, ..., 퐵 :

*(푏) * * * * (a) Generate sample 푥 = 푥1, 푥2, 푥3, ··· , 푥푛 from a Dirichlet distribution with the parameter vector 휃^ from the ﬁrst step.

푡푕 2(푔) 푡푕 (b) Compute for the 푏 replicate the likelihood ratio test statistic 푋푅 from the 푏 bootstrap sample.

2(푔) 2 3. Set 푁 equal to the number of the values 푔 for which 푋푅 ≥ 푋푅.

푁 4. Compute 푝^ = 퐵 .

5. Reject 퐻0 at signiﬁcance level 훼 if 푝^ ≤ 훼. 74

CHAPTER 8 APPLICATIONS

8.1 Applications of the Goodness-of-Fit Test of Dirichlet and Generalized Dirichlet Distribution

Example 8.1 (Proportions of Serum Proteins).

The data set came from a study of spurious correlations among proportions by Mosimann (1962) and is assumed to follow the Dirichlet distribution by Mosimann (1962) and Narayanan (1990). The maximum likelihood estimator of the parameter vector was given by Narayanan (1990). Dr. Jean-Marie Demers and Mr. Roch Carbonneau of the Department of Biology, U- niversity of Montreal, provided the original data set of the measurements of serum proteins in three-week-old white Peking ducklings. Each set of three measurements represents seven or twelve individual ducklings. The ducklings in each set were fed the same diet, but different sets represents different diets. The paper-electrophoresis with subsequent readings were used to produce the following proportional measurements. 푃1, 푃2, and 푃3 are the proportions of pre-albumin, albumin and globulins of serum proteins. We will apply the proposed goodness-of-ﬁt tests to the proportion of serum protein data. The 푝-value of the energy, three triangle goodness-of-ﬁt tests for the generalized Dirichlet distribution are 0.765, 0.445, 0.798, and 0.823, respectively. Hence we cannot reject the null hypothesis that the data come from a generalized Dirichlet distribution. We apply the generalized

Dirichlet distribution with the parameter vector (훼1, 훽1, 훼2, 훽2) to ﬁt data from the duck study. The 푝-value of the energy, three triangle goodness-of-ﬁt tests for the Dirichlet distribution are 0.413, 0.445, 0.693, and 0.752, respectively. Thus we cannot reject the null hypothesis for data

that came from a Dirichlet distribution. We can apply the Dirichlet distribution 퐷푖푟(훼1, 훼2, 훼3) to ﬁt data from the duck study. According to the principle of parsimony, is it necessary to apply the generalized Dirichlet distribution model rather than the simpler Dirichlet distribution model? We construct the following 75 hypothesis:

퐻0 훽1 = 훼2 + 훽2 푣푠 퐻훼 훽1 ̸= 훼2 + 훽2.

When the null hypothesis holds, the generalized Dirichlet distribution is the Dirichlet distribution. The 푝−value of the likelihood ratio, Wald, and Rao’s score test are 0.797, 0.800, 0.795, respectively, so the generalized Dirichlet distribution is not necessary for the duck data. For the distance covariance goodness-of-ﬁt test, according to the property of the complete

푝2 neutrality of the Dirichlet distribution, we need to test the independence between 푝1 and and 1−푝1

푝1 independence between 푝2 and ; so we construct two hypotheses: 1−푝2

푝2 퐻01 푝1 푖푠 푖푛푑푒푝푒푛푑푒푛푡 표푓 , 1 − 푝1

푝1 퐻02 푝2 푖푠 푖푛푑푒푝푒푛푑푒푛푡 표푓 . 1 − 푝2

We apply the Bonferroni correction to control the familywise type I error at 0.05 and ﬁnally we

cannot reject the null hypothesis that the duck data (푝1, 푝2, 푝3) come from the Dirichlet distribution at the signiﬁcance level of 0.05.

8.2 Application to Goodness-of-Fit Testing of Dirichlet Regression Models

Campbell and Mosimann (1987) originally proposed the Dirichlet distribution for compositional data. Campbell and Mosimann extended the Dirichlet distribution to the Dirichlet covariate models, where the composition of data from the Dirichlet distribution are inﬂuenced by

some common covariates. In this model, let 퐲퐢 = (푦푖1, ··· , 푦푖푑) be the 푖-th observed composi-

tion from the Dirichlet distribution with the parameter vector (훼푖1, ··· , 훼푖푑), 푖 = 1, ··· , 푛. Let

퐱퐢 = (푥푖1, ··· , 푥푖푚), 푖 = 1, 2, ··· , 푛 be the corresponding observed covariate vector with 푚 components.

The parameters 훼퐢 = (훼푖1, ··· , 훼푖푑), 푖 = 1, ··· , 푛 of the conditional Dirichlet distribution is

a function of the corresponding observed covariate vector 퐱퐢 = (푥푖1, ··· , 푥푖푑), 푖 = 1, ··· , 푛. For

a given covariate row vector 퐱퐢 = (푥푖1, ··· , 푥푖푑), each parameter 훼푖푗 is a positive-valued function 76

훼푖푗(푥푖1, ··· , 푥푖푗). The response variable vector follows a conditional Dirichlet distribution,

(푦푖1, ··· , 푦푖푑)|(푥푖1, ··· , 푥푖푚) ∼ 퐷푖푟(훼푖1, ··· , 훼푖푑).

The benchmark application of the Dirichlet regression model is the Arctic lake sediments, introduced by Coakley and Rust (1968) and adapted by Aitchison (1986). It consists of composition of sand, silt and clay for 39 sediment observations at different water depths in the lake. Each observation represents the composition of sand, silt and clay at different water depths. The response (dependent) variable is a vector, which contains three compositional variables of sand, silt, and clay. The response variable has 39 compositions, so the response matrix 퐲 is a 39 × 3 matrix. The covariate is the water depth and the corresponding covariate matrix is a 39 × 1 matrix with Figure 8.1 First, we take a look at the scatterplot of three observed compositional components (sand, silt and clay) and water depth in Figure 8.1. Figure 8.1 is the scatterplot of sand, silt and clay with the water depth, which shows that there is some dependency between sediment composition and water depth. We can ﬁnd that negative association between the proportion of the sand and water depths; the deeper the water, the smaller proportions of the sand. There is a positive association between the clay and water depth. Aitchison (1986) proposed the following log ratio model. The covariate in this model is the logarithm of the water depth.

log (푠푎푛푑/푐푙푎푦) = 9.70 − 2.74 log(푑푒푝푡푕),

log (푠푖푙푡/푐푙푎푦) = 4.80 − 1.10 log(푑푒푝푡푕).

However, Hijazi and Jerbigan (2007) stated that the application of the logarithm of the water is based only on Aichilson’s modeling method but not on any geological background, so the log transformation does not make the water depth more interpretable. Hijazi and Jerbigan (2007) proposed the following Dirichlet quadratic regression model. The 77 sediments compositional data come from the conditional Dirichlet distribution with the parameter vector (훼푖1, 훼푖2, 훼푖3), 푖 = 1, ··· , 39. In order to keep the parameter positive, the model applies the exponentiation of the quadratic regression term.

(︀ 2)︀ 훼푖1 = exp 5.240 − 0.072 푑푒푝푡푕 + 0.001 푑푒푝푡푕 ,

(︀ 2)︀ 훼푖2 = exp 3.426 − 0.203 푑푒푝푡푕 + 0.011 푑푒푝푡푕 ,

(︀ 2)︀ 훼푖3 = exp 3.635 − 0.391 푑푒푝푡푕 + 0.013 푑푒푝푡푕 .

Hijazi and Jerbigan (2007) showed that according to the criteria of the sum of compositional errors by Aitchison (1986) and the total variability evaluated using the moving windows technique by Hijazi and Jerbigan (2007), the Dirichlet quadratic regression model has better performance than the log ratio model. The Dirichlet regression model is an informative alternative to the log ratio model. Hijazi and Jerbigan (2007) also evaluated the model performance based on fitting of the marginal distribution of the sediment composition in the Arctic lake data. Figure 8.2 is the observed and fitted compositions in the Arctic lake sediments, which shows that the Dirichlet quadratic model fit each compositional component well. Figure 8.3 is the residual plot for each of three compositional components in the Arctic lake sediments, which also shows that the Dirichlet quadratic model fit each compositional component (marginal distribution) of the sediments well. However, a model that fits the marginal distribution of the sediment composition well, does not necessarily fit the joint distribution of the sediment composition well. A similar example is that for a sample from a specific multivariate distribution, univariate normal distributions fit each component of the sample well, which does not imply that the sample as a whole came from a multivariate normal distribution.

We propose a new test which is based on transforming samples from the conditional Dir(훼푖1,

··· , 훼푖푑) to the standard Dir(1, ··· , 1) to evaluate the overall performance of ﬁtting all composi- 78 tional components in the Dirichlet regression model.

Suppose 퐲퐢 = (푦푖1, ··· , 푦푖푑) is an observed observation from the Dirichlet distribution with

the parameter vector (훼푖1, ··· , 훼푖푑), 푖 = 1, ··· , 푛. The parameters (훼푖1, ··· , 훼푖푑), 푖 = 1, 2, ··· , 푛 of the Dirichlet distribution are based on the corresponding observed covariate vector 퐱퐢 = (푥푖1,

··· , 푥푖푚).

1. Each observation from the conditional Dirichlet distribution (푦푖1, ··· , 푦푖푑) with the parame-

* * ter vector (훼푖1, ··· , 훼푖푑) was transformed to the corresponding observation (푦푖1, ··· , 푦푖푑) following the standard Dirichlet distribution. 2. Conduct the goodness-of-ﬁt test of the transformed sample from the standard Dirichlet distribution.

Since we do not know the parameter vector (훼푖1, ··· , 훼푖푑), 푖 = 1, ··· , 푛 of the conditional Dirichlet distribution, we apply the parameter estimates from the Dirichlet regression model to transform the corresponding observation. However, the transformed observations based on the parameter estimates do not exactly follow the standard Dirichlet distribution. We apply the following parametric bootstrap method based on the distance covariance goodness- of-ﬁt test to show that replacing the conditional Dirichlet parameter vector with the corresponding estimated parameter is practical and has a very acceptable type I error.

1. Compute the estimated Dirichlet parameter vector (^훼푖1, ··· , 훼^푖푑), 푖 = 1, ··· , 푛 for each observation from the proposed Dirichlet regression model. 2. Set G = 0. 3. For each replicate, indexed l = 1, ··· , 푁 :

(a) Draw a random data point from 퐷푖푟(^훼푖1, ··· , 훼^푖푑), 푖 = 1, ··· , 푛. (b) Pool all 푛 random data points to get the 푙푡푕 sample. (c) Apply the Dirichlet regression model to the 푙푡푕 random sample to obtain the estimated conditional Dirichlet parameter vector. (d) Transform each vector of proportions of the 푙푡푕 sample by the corresponding estimated 79 parameter vector from the above step. (e) Conduct the distance covariance goodness-of-fit test of the Dirichlet distribution for the transformed sample. (f) Increase G by 1 if the above test is rejected at the fixed significance level.

퐺 3. Compute the achieved significance level 푝^ = 푁 . We apply the above bootstrap method based on the distance covariance goodness-of-fit test to the Arctic lake data, set N = 1000 and the fixed significance level 0.05, and finally the achieved significance level 푝^ = 0.045. So we can apply the goodness-of-fit test based on the transformed sample for the Dirichlet regression model evaluation and selection. In the end, we apply the estimated conditional Dirichlet parameter vector to the Arctic lake data and obtain the transformed unconditional Dirichlet distributed data, then apply the distance covariance goodness-of-fit test to the transformed unconditional Dirichlet distributed data. The null hypothesis that the Arctic lake data come from the proposed Dirichlet quadratic regression model is rejected at the significance level of 0.05. The Dirichlet quadratic regression model fits the marginal distribution of the composition well, but it fails to fit the joint distribution of all three compositional components well. We can try different Dirichlet polynomial terms, or add more covariates, or apply some variable transformations in the model. We propose the generalized Dirichlet regression model to fit the Arctic lake data. For the Dirichlet distribution, one component is always negatively correlated with other components. The generalized Dirichlet distribution (Kotz and Johnson, 2000) has a more flexible covariance structure to fit the compositional data, so the generalized Dirichlet distribution can be more practical and useful. For example, two components of the generalized Dirichlet distribution can be positively correlated.

Let 푆2 and 푆1 be the corresponding sand and silt variables in the transformed data. Let 푇1 =

푆2 푆1 푆1, 푇2 = , 푉1 = 푆2, 푉2 = . We take a look at the scatterplots of 푇1 and 푇2, and 푉1 and 1−푆1 1−푆2

푉2. A scatterplot between 푇1 and 푇2 in Figure 8.4 suggests that 푇1 and 푇2 seem independent, with correlation coefﬁcient -0.089. The 푝-value of the distance covariance independence test is 0.678, so we cannot reject the null hypothesis that 푇1 and 푇2 are independent. 80

A scatterplot between 푉1 and 푉2 in Figure 8.5 suggests that 푉1 and 푉2 are moderately associated, with correlation coefﬁcient 0.646 The 푝-value of the distance covariance independence test

(Rizzo, 2008) is 0.001, so we reject the null hypothesis of independence of 푉1 and 푉2.

If 푇1 and 푇2 are independent, and 푉1, 푉2 are also independent, then 푆1 and 푆2 follow the Dirichlet distribution (Darroch and Ratcliff, 1971).

When 푇1 and 푇2 are independent, but 푉1 and 푉2 are not independent, it is a typical feature of the generalized Dirichlet distribution. So we can consider applying the generalized Dirichlet distribution to ﬁt the transformed Arctic lake data. Furthermore, we can consider applying the generalized Dirichlet regression model to ﬁt the original Arctic lake data. The composition follows the conditional generalized Dirichlet distribution. The parameter of the conditional generalized Dirichlet distribution is a function of the corresponding covariates. We will consider the generalized Dirich- let regression model in future research. For other books on analysis of compositional data, please see Gerald and Raimon (2013), and Pawlowsky-Glahn and Tolosana-Delgado (2013).

8.3 Application to the Inﬂuence Diagnostics of the Dirichlet Regression Models

Different methods have been proposed to detect the outliers in compositional data by Barce- lo´, Pawlowsky, and Grunsky (1996) and by Baxter (1999). Hijazi (2005) stated these methods are suitable for unconditional compositions. For the conditional composition, the joint distribution of the quantile residuals can be applied to identify the compositions with large Mahalanobis distance as outliers (Marida and Bibby, 1979). When compositions are inﬂuenced by some covariates, residuals can be used to identify outliers. We can apply Dirichlet standardized distribution transformation (4.2) we have proposed in this dissertation to transform the conditional Dirich- let distributed data(different compositions has different parameter vectors) into the unconditional Dirichlet distributed data(all transformed compositions have the same parameter vectors), then apply the outlier identiﬁcation methods based on the unconditional Dirichlet distribution to identify the potential outliers. For example, one can apply the joint distribution of the transformed compositional data, which has an unconditional Dirichlet distribution, to identify the compositions with large Mahalanobis distance as the potential outliers. 81

Figure 8.1: Arctic lake dataset and corresponding scatterplots. 82

Figure 8.2: Observed and ﬁtted compositions in the Arctic lake sediments using Dirichlet quadratic regression model. 83

Figure 8.3: Residual plots of three compositions in the Arctic lake sediments using Dirichlet quadratic model. 84

Figure 8.4: Scatterplot of 푇1 and 푇2 from the transformed Arctic lake sediments using Dirichlet quadratic model. 85

Figure 8.5: Scatterplot of 푉1 and 푉2 from the transformed Arctic lake sediments using Dirichlet quadratic model. 86

CHAPTER 9 SUMMARY AND FURTHER DIRECTIONS

The Dirichlet distribution is the multivariate Beta distribution. The Dirichlet distribution is a special case of the generalized Dirichlet distribution. We propose a goodness-of-fit test based on energy distances and a goodness-of-fit test based on interpoint distances. We also propose a goodness-of-fit test based on the property of complete neutrality of the Dirichlet distribution and distance covariance tests. A new algorithm based on the distance covariance and permutation to test mutual independence of all components of random vector in arbitrary dimensions is introduced. The formula for transforming a variable following the generalized Dirichlet distribution or Dirichlet distribution into another variable following Dirichlet distribution is constructed. For the maximum likelihood estimation of the parameters in the generalized distribution, the initial values of the iteration and the criteria of convergence for the Newton-Raphson algorithm has been proposed. The application of the dissertation includes goodness-of-fit tests of the Dirichlet and generalized Dirichlet distribution, model evaluation of the Dirichlet regression analysis, outlier and influential observation identification in the Dirichlet regression model. Since the distance covariance test has very good performance in measuring the independence and in the goodness-of-fit test in the Dirichlet distribution, we will apply the distance covariance goodness-of-fit test on the copula models. Copulas are a function that joins or couples multivariate distribution functions to their one-dimensional marginal distribution functions (Nelsen, 1999). Copulas enable the modelling of marginal distributions and the dependence structure among the marginal distributions in two steps (Joe, 1997), which make it widely used in economics, hydrol- ogy, insurance, reinsurance, finance and biology. Although the goodness-of-fit of the univariate distributions is well documented, the study of goodness-of-fit tests for the compulas emergyed recently as a challenging problem (Berg, 2007). Our future work also includes the derivation of the generalized Dirichlet regression model. The generalized Dirichlet distribution has more flexible variance-covariance structure than the Dirichlet distribution, so the generalized Dirichlet regression model will be more useful than the Dirichlet 87 regression model. The generalized Dirichlet distribution is widely used in data mining and machine learning. The parameter estimation based on maximum likelihood methods and other methods will be derived. Since there is not closed form MLE method, numerical methods will be applied and the criteria of setting up the initial values for the iteration of the numerical methods will be discussed. The comparison of the Dirichlet regression and generalized Dirichlet regression model will be considered in future research. 88

BIBLIOGRAPHY

Aitchison (1986). The Statistical Analysis of Compositional Data. New York: Chapman and Hall.

Aitchison, J. and C. B. Begg (1976). Statistical diagnosis when the cases are not classiﬁed with certainty. Biometrikca 63, 1–12.

Barcelo´, C., V. Pawlowsky, and E. Grunsky (1996). Some aspects of transformations of compositional data and the identiﬁcation of outliers. Mathematical Geology 28, 501–518.

Bartoszynski, R., P. D. K. and J. Lawrence (1997). A multidimensional goodness-of-ﬁt test based on interpoint distances. Journal of the American Statistician Association 92, 577–586.

Baxter, M. (1999). Detecting multivariate outliers in artefact compositional data. Archaeometry 41, 321–338.

Berg, D. (2007). Copula goodness-of-ﬁt testing: an overview and power comparision. Statistical Research Report of University of Oslo and Norwegian Computing Center 5, 1–25.

Campbell, G. and J. Mosimann (1987). Multivariate methods for proportional shape. ASA Pro- ceedings of the Section on Statistical Graphics 1, 10–17.

Chang, W. Y., G. R. D. and D. P. Richards (1991). Structural properties of the generalized Dirichlet distributions. Contemporary Mathematics 1, 1–16.

Choi, S. C. and R. Wette (1967). Maximum likelihood estimation of the parameters of the gamma distribution and their bias. Technometrics 11, 683–689.

Coakley, J. P. and B. R. Rust (1968). Sedimentation in an arctic lake. Sedimentary Petrology 38, 1290–1300.

Connor, J. and J. Mosiman (1969). Concepts of independence for proportions with a generalization of the Dirichlet distribution. Journal of the American Statistical Association 64, 194–206. 89 D’Agostino, R. B. and M. A. Stephens (1986). Goodness-of-Fit Techniques. New York: Marcel Dekker.

Darroch, J. N. and D. Ratcliff (1971). A characterization of the Dirichlet distribution. Journal of the American Statistical Association 66, 641–643.

Davison, A. C. and D. V. Hinkley (1997). Bootstrap Methods and Their Applications. Cambridge: Cambridge University Press.

Dishon, M. and G. H. Weiss (1980). Small sample comparison of estimation methods for the beta distribution. Journal of Statistical Computation and Simulation 11, 1–11.

Efron, B. and R. J. Tibshiranni (1993). An Introduction to the Bootstrap. New York: Chapman and Hall.

Fisher, R. A. (1928). On a property connecting the 휒2 measure of discrepancy with the method of maximum likelihood. Atti Congresso Internazionale dei Mathematici 6, 95–100.

Gerald, V. B. and T. D. Raimon (2013). Analyzing Compostional Data with R. New York: Wiley.

Hijazi, R. H. (2005). Residuals and diagnostics in Dirichlet regression. ASA Section on General Methodology 1, 1190–1196.

Hijazi, R. H. and R. W. Jerbigan (2007). Modelling compositional data using Dirichlet regression models. Journal of Applied Probability & Statistics 4, 77–91.

Hinde, J. (2014). International Encyclopedia of Statistical Science. Berlin: Springer.

Joe, H. (1997). Multivariate Models and Dependence Concepts. London: Chapman and Hall.

Kotz, S., B. N. and N. Johnson (2000). Continuous Multivariate Distributions, Models and Appli- cations. New York: John Wiley.

Maa, J. F., P. D. K. and R. Bartoszynski (1996). Reducing multidimensional two-sample data to one-dimensional interpoint distances. The Annals of Statistics 24, 1069–1074. 90 Marida, K. V., K. J. T. and J. M. Bibby (1979). Multivariate Analysis. London: Academic Press.

Mosimann, J. E. (1962). On the compound multinomial distribution, the mutivariate 훽-distribution, and correlation among proportions. Biometrika 49, 65–81.

Narayanan, A. (1990). Algorithm as 266: Maximum likelihood estimation of the parameters of the Dirichlet distribution maximum likelihood estimation of the parameters of the Dirichlet distribution. Journal of the Royal Statistical Society 40, 365–374.

Narayanan, A. (1991). Small sample performance of parameter estimation in the Dirichlet distribution. Commun. Statist. - Simula 20, 647–666.

Nelsen, R. B. (1999). An Introduction to Copulas. New York: Springer.

Pawlowsky-Glahn, V., E. J. J. and R. Tolosana-Delgado (2013). Modeling and Analysis of Com- positional Data. New York: Wiley.

Peacock, J. A. (1982). Two-dimensional goodness-of-ﬁt testing in astronomy. Mon. Not. R. Astr. Soc 202, 615–627.

Rao, C. R. (1973). Linear Statistical Inference and Its Applications. New York: John Wiely & Sons.

Rizzo, L. M. (2008). Statistical Computing with R. London: Chapman & Hall/CRC.

Rizzo, M. L. (2002). A New Rotation Invariant Goodness-of-Fit Test. Bowling Green: Bowling Green State University.

Rizzo, M. L. and G. J. Sze´kely (2007). Energy: E-statistics ( Energy Statistics) Tests of Fit, Independece, Clustering. United States: R package version 1.6.2.

Ronning, G. (1986). On the curvature of the trigamma function. Journal of Computational and Applied Mathematics 15, 397–399. 91 Sze´kely, G. J. (2003). ℰ statistics: the energy of statistical samples. Bowling Green State Univer- sity, Department of Mathematics and Statistics Technical Report 1, 03–05.

Sze´kely, G. J. and M. L. Rizzo (2007). On the uniqueness of distance covariance. Statistics & Probability Letters 82, 2278–2282.

Sze´kely, G. J., M. L. Rizzo, and N. K. Bakirov (2007). Measuring and testing dependence by correlation of distances. Annals of Statistics 35, 2769–2794.

Warners, G. R., B. Bolker, and T. Lumley (2014). Various R Programming Tools. United States: R Package version 3.4.1.