LOGARITHMIC SERIES DISTRIBUTION AND ITS USE IN ANALYZING DISCRETE DATA

Jeffrey R, Wilson, Arizona State University Tempe, Arizona 85287

1. Introduction is assumed to be 7. for any unit and any trial. In a previous paper, Wilson and Koehler Furthermore, for ~ach unit the m trials are (1988) used the generalized-Dirichlet identical and independent so upon summing across multinomial model to account for extra trials X. = (XI~. X 2 ...... X~.)', the vector of variation. The model allows for a second order counts ~r the j th ~nit has ~3multinomial of pairwise correlation among units, a type of distribution with vector ~ = (~1' assumption found reasonable in some biological 72 ..... 71) ' and sample size m. Howe~er, data. In that paper, the two-way crossed responses given by the J units at a particular generalized Dirichlet Multinomial model was used trial may be correlated, producing a set of J to analyze repeated measure on the categorical correlated multinomial random vectors, X I, X 2, preferences of insurance customers. The number • o.~ X • of respondents was assumed to be fixed and Ta~is (1962) developed a model for this known. situation, which he called the generalized- In this paper a generalization of the model in which a single is made allowing the number of respondents m, to parameter P, is used to reflect the common be random. Thus both the number of units m, and dependency between any two of the dependent the underlying probability vector are allowed to multinomial random vectors. The distribution of vary. The model presented here uses the m logarithmic series distribution to account for the category total Xi3. = Z Xij k is binomial the variation among number of units and the k=l to model the with sample size m and parameter ~., for each . In particular the Dirichlet- unit. Tallis formalized the dependencies among Multinomial distribution is used to incorporate unit totals by specifying the joint moment the two types of variation, and the logarithmic generating function as series distribution is used to account for m J variation among the number of units within a Gj(~) = P { Z pi ( H eUj) i} + given time period• Ignoring either level of i=0 j=1 variation leads to underestimation of the true (2.1) standard errors of estimated proportions. J Tallis (1962) proposed the use of the (l-p) { H p(eUj)} , 1,2 ..... I, generalized-multinomial model for dependent j=l multinomials. Wilson and Koehler (1988) u I iu extended the model to allow for a second random r r and = (~ , ~ , where P(e ) = l pi e ~ I 2 component• The extended model considered can be i=l viewed as multivariate extensions of the beta- .... ~j )'. The parameter P appearing in (2.1) binomial and correlated binomial models is the correlation coefficient between X.. and considered by Kupper and Haseman (1978) and for any j ~ j'. When O ~ 0, GT(U) ll~ a Xi., o aj Crowder (1978) for binary data. Paul (1987) weighted mean of moment generating function for considered a modification of the beta-correlated a distribution with perfect correlation and one binomial as, a means of analyzing affected with complete independence, the weights being p foetuses in litters of live foetuses. Section 2 and I- 0 respectively. Altham (1978) proposed a outlines the generalized-multinomial model. similar model for a joint moment generating Section 3 discusses the Dirichlet-Multinomial function for correlated binary variables. model. Section 4 presents the generalized Consider the overall vector of category Dirichlet-Multinomial model. The logarithmic J distribution is presented in Section 5. An totals X = Z X.. From the moment generating extended generalized Dirichlet-Multinomial model j=l~3 is developed in Section 6. Tests for certain function in (2.1) it can be shown that hypotheses and fit of the model are developed in E(X) = Jm~ and V(X) = Jm{l+(J-l)p}M for the Section 7. Parameter estimates are obtained in ~ 7 Section 8. generalized-multinomial model, wher~ 2. Generalized-Multinomial Model M = diag(~) - ~7' and diag(7) is a diagonal 7 ~ One way to view the generalized-multinomial m~trix with diagonal elements provided by the model is to consider mJ vectors of outcomes for (Jm)--X is an vector ~. Consequently, ~ = Ta~lis (1962) a set of J units that are simultaneously unbiase~ estimator for 7. subjected to a series of m trials. At each proposed estimators forgo, but he did not trial, each unit is classified as being in discuss techniques for making inferences about exactly one of I mutually exclusive states• Let 7. We consider here a technique for making such the X.. k take the value i if the l~nferences. k-th trial of the j-t~3unit is observed to be in One approach is to use the limiting normal the i-th state, and zero otherwise. distribution of X as m÷~. At trial k consider The probability that Xij k takes the value i an IJ-dimensiona~ response vector ~(j) = (X~k,

275 ~2k'7''~$ ~) where = (X k' X ...... ' ~ ' e ~ Xljk)' respons $~ctor ~r thSJ~-th unit • .. ~n t)'"the generalized-multinomial model, the m X. t vectors (j=l,2 ..... J) are identically at the k-th trial. Define ~(j)X = k=iX X 4k(J) ~stributed and are not independent. The observations taken at time t on the J Since the X . . vectors are independent and the ~k J individuals are equally pairwise correlated as first and second moments of l~._. are finite, . ~ ~ measured by the parameter P- The vector of the mul~ivariate central llm~ ~eorem Implies n J that m-~(Z... - mu) ÷ N__(0, E), as m÷~, where total counts ~X = X X~t, where Xt= E Xjt for u = i_@ ~, ~ = M @ Q, ~. is a J-dimensional ~J ~ ~ ~j t=l j=l vector of ones, @~denotes direct product between the generalized Dirichlet-Multinomial model, has and Q is a square matrix of dimension J with mean vector E(X) = N~ and covariance matrix ones on the diagonal and p as each off diagonal V(X) = NC{I+ 0(~-I)}M ~, where N=nS is the total element. Now X = GX._. where G = i! @ E . and number of observations, S is the to~al number of E 1 is the identity m~ix of dimension I. IThen, un'its at time t and C = (S+o)(l+o) . Using an by Rao (1973 page 124) the limiting distribution argument similar to the one in section 2 it can of ~ is specified by be shown that n -½ (X-N~) ÷ N I(O, SC{I+(J-{)o}M ) ^ m-½(~- ~ ÷ NI( ~, (mj)-l{l+(J-1)p}M) (2 2) and tests of hypotheses about ~ or vector "kJ functions g(~), where g is a continuous function Given a consistent estimator for 0, with second partial derivatives, can be obtained asymptotic chi-square tests involving using the large sample chi-square distributions sufficiently smooth functions of ~ can be for the Wald statistic obtained as Wald , N{C{I+(J-I)o}}-I(g(~) _ (4.1) 4 = mJ{ l+(J-1) p }-I [g(~)_^ (2.3) g(~))'(DM D')-(g(~) - g(~)), g(~) ]' [DMSD' ]-[g($)-g(~) ] where [DM D'] denotes the generalized inverse where D is the matrix of first partial of DM D', with degrees of freedom equal to rank derivatives of g evaluated at ~, and [DMSD]- is aj of DM D'. The greater imprecision in the a generalized inverse of DM^D. The degrees of estimation for ~ due to variation in vectors of freedom correspond to the r~nk of DM D'. In nu proportion among individuals is accounted for by some applicatioDs it may be necessar~ ~o replace the factor C which cannot be less than one. The replace D with D (i.e. D evaluated at ~). consequence of ignoring this extra variation is 3. Dirichlet-Multinomial Model an inflation of the type I error levels for such For each of N units abserve a multinomial tests. vector of responses, with parameters ~ = (PI' 5. The Univariate Logarithmic P2 .... ~h~l )' and sample size S. Furthermore Series Distribution assume probability vector ~ has a Dirichlet The logarithmic series distribution was

distribution! with mean vector ~ = (~i' ~2 , oo., introduced by Fisher, Corbett and Williams ~i ) and scaling parameter ~. (1943) to investigate the distribution of For this model the sum of the vector of butterflies in the Malayan Peninsula. It has counts has a Dirichlet-Multinomial distribution been used in the sampling of quadrants for plant and the vector of proportions has first moment species, the distribution of animal species, and covariance matrix N -I(s+~) (l+a)-iM . The population growth and in economic applications. Chatfield et al (1966) used the logarithmic Dirichlet distribution provides a convenient series distribution to represent the model for describing variation among vectors of distribution of numbers of items of a product proportions since it has relatively simple purchased by a buyer in a specified time period. mathematical properties. The Dirichlet- They point out that the logarithmic series has Multinomial model has been studied by Mosimann the advantage of dependency on only one (1962) and Good (1965). Brier (1980) used the parameter O. model to analyze sample proportions obtained The random variable M has a logarithmic from a single two-stage cluster sample. Koehler series distribution if the probability function and Wilson (1986) extended some of Brier's P(M=k) = a0~/k (k=l,2 ..... ; 0<0

276 V (X) =mBM Hence the maximum value of P(M=k) is at the m initial value k=l and the value of P(M=k) decreases as k increases. RC{I+p(J-1} and R=nJ. Thus the In the model presented in section 6 the where B -- moments for the distribution given by h(t) are number of clusters J, and the total number of observation S from the J clusters may be V(X)= n-lsBM + Sj-2(S-J)(J-1)B2g,rr' expected to increase proportionately. The logarithmic series distribution is used here to = n-IsB{M + nBj-2(S-J)(J-1)~'} explain the variation in the number of units in 'II the cluster. Supposing the ' index of diversity', ~ remains constant then S and J and wv~ be expected to be related by ~ formula E(~) = SR~. e ~1~d= I+S/~. If S/~ is large then e -'v = S/~. It follows that the covariance matrix, V(X) can The idea to use the logarithmic series be written as distribution in conjunction with the Dirichlet V(~) = SBM + Sn2C2{I+0(J-I)}2(S-J)(J-I)~' distribution is a result of work done by Engen (1975)• He demonstrated the use of the limit of the Dirichlet distribution in deriving the = SBA-[SB Sj-2(S-J)(J-I)B2]~ ' logarithmic series distributions• Consider the joint conditional distribution = SB{A - [I-j-2B(S-J)(J-I)]~ '} of m I, m 2 .... mj for each fixed sum Im.=S i.e. j J e(Mj=mj;j=l,2 .... Jl Em.=S) = where j=l j t=(1-a0) (1-0)-IR2B -1 J -1 S!/J! H (i/re.)/F(s,J), From ( ) a = [- log(l-0)] and 0<0<1. j=l J Also R2B -1= RC-I{I+p(J-1)} -1 for lJ and V(X)=SB{[I-j-2B(S-J)(J-1)]M + j-2g(s-J)(J-l)~'. where- ~(s,J) is the absolute value o-f the stirling number of the first kind• The sum EM., The covariance matrix has some simularities with is a sufficient statistic since the condition~l the covariance matrix under the general distribution do not depend upon the parameter 0. Dirichlet-Multinomial. In the extended model The sum EM., is a complete sufficient statistic case the variance is a sum of the variation due for 0 and j P(M.--m.I•M.=S) is minimum variance to the generalized Dirichlet Multinomial and the unbiased estima~o~ ofJthe probability function variation due to the variation among the samples of LSD. Shanm.u~am and Singh (1984) noted that sizes• Thus when the variance among the sample E(m. lYM.=s)=Sn - regardless of the underlying sizes is small there is little difference for the random sample• between the two models, generalized Dirichlet 6. An Extended Generalized Multinomial and the extended generalized D irichle t-Mul t inomial Model Dirichlet Multinomial. Certainly there is no The logarithmic series distribution discussed difference between the models when there is one in the previous section is used to extend the unit per cluster• generalized Dirichlet-Multinomial model• The Similar to the assumption in (2•2) with the number of units per cluster is assumed to vary appropriate covariance matrix and given according to a logarithmic series distribution. consistent estimator for C and {l+0(J-l} Thus both m and the probability vector asymptotic chi-square tests involving associated with each cluster are allowed to be sufficiently smooth functions of ~ can be random variables and we have obtained as Wald statistics, h(t) = f E gmfm(P,t)~(P)dP XGL D = [g -g g -g where f (P, t) ~(P) represents the conditional ^ m wher~ [DVD']- is a generalized inverse of DVD' and V is a consistent estimate of V(X) in ( ). distribution for given m, and represented here T~e degrees of freedom correspond to the rank of by the generalized Dirichlet-Multinomial. The DVD'. term, . gm=P (M=m) represents the marginal 7. Test of the Model Assumptions distribution of the sample sizes• Here the In using the extended generalized Dirichlet- conditional distribution given a sum of random Multinomial model there arethree basic samples from logarithmic distribution is used to assumptions: represent such a marginal distribution• a) the correlations between the units X., and The problem of obtaining expressions for h(t) are constant for any j#j' b) the X ~ j=l, ~ .... J; are identically mult~nomially is now considerably magnified by the nature of distributed and c) the sample sizes are the expression for the conditional distribution distributed as logarithmic series distribution• given a sum of logarithmic series distribution Test statistics were presented to assess the variables• However, the first and second validity of the first two assumptions by Wilson moments of the distribution h(t) can be found• & Koehler (1988)• Large sample tests for the Under the generalized Dirichlet-Multinomial covariance structure associated with the model the covariance matrix for the conditional Dirichlet-Multinomial model were given by Wilson distribution of X for given m is (1986) and by Koehler and Wilson (1986). Here we make mention of some procedures for testing

277 that m 1,m 2...mJ belong to a logarithm series 27, 162-167. distribution. One method of testing that m I, Anderson, T. W. (1958). An introduction to m_70"'0 mj is a random sample from a logarithmic multivariate statistical analysis. John series distribution is to consider the Wiley and Sons, New York. characterization of the distribution, Shanmugen Anscombe, F. J. (1950). Sampling Theory of the and Singh (1984). For any fixed s, let Negative Binomial and Logarithmic Series Q=(m-~) 'Z (m-~) Distributions. Biometrika 37, 358-82. Brier, S. S. (1980). Analysis of contingency be a test statistic where _m=(m,,m ..... ,mT)' is tables under cluster sampling. Biometrika i Z J v the observed vector, and^~=(u,,~o .... ,~T) with 67, 591-596. u.=E(m. JEm.=s) is the vector o~ e~pected°values Chatfield, C., Ehrenberg, A. S. C. and a~d Y -~{co~(m.m.)[Em.=S} is the vector of Goodhardt, G. J. (1966). Progress on a weights. TheJr~nk o~ Y is J-l. It can be shown simplified model of stationary purchasing that asymptotically behavior (with discussion), Journal of The cov(mjmj) [Zmi=S) ~ S(S-J)(J-I)/j2 j=j, Royal Statistical Society, Series A, 129, -S (S-J)/j2 j4j,. 317-367. The structure of the dispersion matrix Y is of Cochran, W. G. (1943). Analysis of variance for the intraclass correlation matrix type. Thus Q percentages based on unequal numbers. simplifies to Journal of the i American Statistical J Association 38, 287-301. 2 Q ~ (S-J) -l[(S/J) I n. - S] . Choi, J. W. (1987). A direct estimate of j=1 j Intracluster correlation. Section on Survey. J Research Methods. Proceedings of American 2 Through the asymptotic distribution of Y. m. , Statistical Association. j=l 3 Crosby, L.A. and Stephens, N. (1987). Effects it can be shown (Shanmugam and Singh 1984) that of Relationship Marketing on the for a given level of significance ~, we would Satisfaction, Retention and Prices in the reject the null hypothesis that a random sample Life Insurance Industry. Journal of Marketing ml'm2 .... mT is from a logarithmic series Research. (Forthcoming). d~stributi~n if Crowder, M. J. (1978). Beta-binomial Anova for 2 proportions. Applied Statistics 27, 34-37. Im. - S(S-J+I) Engen, S. (1975). A note on the geometric 3 > Z series as a species frequency, model. ~/2 S-J+I Miscellanea. ¢' S(J-1) ( 3 ) Fisher, R. A., Corbett, A. S., and Williams, C. where Z /p is the (l-~/2)th percentile of the B. (1943). The relation between the number s tandar~'fiormal distribution. of species and the number of individuals on a 8. Parameter Estimates random sample from an animal population. The problem of estimating @ given values of J Journal of Animal Ecology, i0, 446-56. random variables ml,m2,...,m_; each having a Good, I. J. (1965). Estimation of O logarithmic series distributlon has been Probabilities. MIT Press, Cambridge, considered by Johnson and Kot~ (1969). The Massachusetts. maximum likelihood estimator O satisfies the Healy, M. J. R. (1972). Animal litters as equation for m (the mean of the m's) where experimental units. Applied Statistics, 21, J 155-159. m = j-1 I m. =-@{(l-@)log(l-$)} -I. (8.1) Johnson, N. L. and Kotz, S. (1969). Discrete j=l 3 Distributions. Houghton Mifflin Company, New Since the logarithmic distribution is a York. generalized power series distribution equation Koehler, K. J. and Wilson, J. R. (1986). Chi- (8.1) can be solved by equating the sample and square tests for comparing vectors of population means. Other estimators of @ are proportions for several cluster samples. presented in Johnson and Kotz. Communication in Statistics Vol. A15, No. I0, When using the logarithmic series Theory and Methods. distribution to obtain an extension of the Kupper, L. L. and Haseman, J. K. (1978). The generalized Dirichlet distribution to test use of a correlated bionomial model for the hypotheses concerning g(~) there is no need to analysis of certain toxicological obtain estimates of @. ~owever, estimates of C experiments. Biometrics 34, 69-76. and {l+p(J-1)} must be obtained. Landis, J. R. and Koch, G. G. (1977). A one-way Methods of estimating C and {l+p(J-l)} are component of variance model for categorical presented by Wilson and Koehler (1988). One set data. Biometrics, 33, 671-79. of estimators can be obtained by constructing an Lawley, D. N. (1963). On testing a set of IxJ and an Ixn table. From each t~ble obtai~ correlation coefficients for equality. the Pearson chi-square statistic X.Ij. and X~In) Annals of mathematical statistics 34, for testing independence in a two-Way ) 149-151. contingency table. Moore, D. S. (1977). Generalized inverses, ^ 2 2 Wald's method and construction of chi-squared Then C=X /(I-l)(J-l) and {l+p(J-l)}=X(in)/ tests of fit. Journal of the American (I-i) (n-~ J) Statistical Association 72: 131-137. References Altham, P. M. E. (1978). Two generalizations of the Applied Statistics

278 Moseman, J. E. (1962). On the compound Tallis, G. M. (1962). The use of a generalized multinomial distribution, the multivariate multinomial distribution in the estimation of 8-distribution, and correlation among correlation in discrete data. J. R. proportion. Biometrika 49: 65-82. Statistical Soc., Series B, 24, 530-534. Patil, G. P., and Bildekar, S., (1967). Tallis, G. M. (1964). Further Models for Multivariate Logarithmic Series Distribution estimating correlation in discrete data. J. as a Probability Model in Population and R. Statist. Soc., Series B, 26, 82-85. Community Ecology and Some of its Statistical Wald, A. (1943). Tests of statistical Properties. Journal of American Statistical hypotheses concerning several parameters when Association. the number of observations is large. Trans. Patil, G. P. and Wani, J. K. (1965). On Amer. Math., Soc., 54, 426-482. certain structural properties of the Wilson, J. R. (1986). Approximate distribution logarithmic series distribution and the first and test of fit for the clustering effects in type Sterling distribution, Sankhya Series A, the dirichlet multinomial model. 27, 271-280. Communication in Statistics, Vol. AI5, No. 4, Paul, S. R. (1987) On the Beta - Correlated Theory and Methods. Binomial (BCB) Distribution A Three Parameter Wilson, J. R. and Turner, D. (1987). A Generalization of the Binomial Distribution simulation Study to compare different Test Communication in Statistic, Vol. 6, (5), Statistics for Complex Sampling Data. 1473-1478. Technical Report DIS, Arizona State Shanmugam, R., and Singh, J., (1984). A University, Tempe, Arizona. characterization of the Logarithmic Series Distribution and Its Application. Communication in Statistics, 13(7), 865-875.

279