1988: Logarithmic Series Distribution and Its Use In
Total Page:16
File Type:pdf, Size:1020Kb
LOGARITHMIC SERIES DISTRIBUTION AND ITS USE IN ANALYZING DISCRETE DATA Jeffrey R, Wilson, Arizona State University Tempe, Arizona 85287 1. Introduction is assumed to be 7. for any unit and any trial. In a previous paper, Wilson and Koehler Furthermore, for ~ach unit the m trials are (1988) used the generalized-Dirichlet identical and independent so upon summing across multinomial model to account for extra trials X. = (XI~. X 2 ...... X~.)', the vector of variation. The model allows for a second order counts ~r the j th ~nit has ~3multinomial of pairwise correlation among units, a type of distribution with probability vector ~ = (~1' assumption found reasonable in some biological 72 ..... 71) ' and sample size m. Howe~er, data. In that paper, the two-way crossed responses given by the J units at a particular generalized Dirichlet Multinomial model was used trial may be correlated, producing a set of J to analyze repeated measure on the categorical correlated multinomial random vectors, X I, X 2, preferences of insurance customers. The number • o.~ X • of respondents was assumed to be fixed and Ta~is (1962) developed a model for this known. situation, which he called the generalized- In this paper a generalization of the model multinomial distribution in which a single is made allowing the number of respondents m, to parameter P, is used to reflect the common be random. Thus both the number of units m, and dependency between any two of the dependent the underlying probability vector are allowed to multinomial random vectors. The distribution of vary. The model presented here uses the m logarithmic series distribution to account for the category total Xi3. = Z Xij k is binomial the variation among number of units and the k=l Dirichlet distribution to model the with sample size m and parameter ~., for each probabilities. In particular the Dirichlet- unit. Tallis formalized the dependencies among Multinomial distribution is used to incorporate unit totals by specifying the joint moment the two types of variation, and the logarithmic generating function as series distribution is used to account for m J variation among the number of units within a Gj(~) = P { Z pi ( H eUj) i} + given time period• Ignoring either level of i=0 j=1 variation leads to underestimation of the true (2.1) standard errors of estimated proportions. J Tallis (1962) proposed the use of the (l-p) { H p(eUj)} , 1,2 ..... I, generalized-multinomial model for dependent j=l multinomials. Wilson and Koehler (1988) u I iu extended the model to allow for a second random r r and = (~ , ~ , where P(e ) = l pi e ~ I 2 component• The extended model considered can be i=l viewed as multivariate extensions of the beta- .... ~j )'. The parameter P appearing in (2.1) binomial and correlated binomial models is the correlation coefficient between X.. and considered by Kupper and Haseman (1978) and for any j ~ j'. When O ~ 0, GT(U) ll~ a Xi., o aj Crowder (1978) for binary data. Paul (1987) weighted mean of moment generating function for considered a modification of the beta-correlated a distribution with perfect correlation and one binomial as, a means of analyzing affected with complete independence, the weights being p foetuses in litters of live foetuses. Section 2 and I- 0 respectively. Altham (1978) proposed a outlines the generalized-multinomial model. similar model for a joint moment generating Section 3 discusses the Dirichlet-Multinomial function for correlated binary variables. model. Section 4 presents the generalized Consider the overall vector of category Dirichlet-Multinomial model. The logarithmic J distribution is presented in Section 5. An totals X = Z X.. From the moment generating extended generalized Dirichlet-Multinomial model j=l~3 is developed in Section 6. Tests for certain function in (2.1) it can be shown that hypotheses and fit of the model are developed in E(X) = Jm~ and V(X) = Jm{l+(J-l)p}M for the Section 7. Parameter estimates are obtained in ~ 7 Section 8. generalized-multinomial model, wher~ 2. Generalized-Multinomial Model M = diag(~) - ~7' and diag(7) is a diagonal 7 ~ One way to view the generalized-multinomial m~trix with diagonal elements provided by the model is to consider mJ vectors of outcomes for (Jm)--X is an vector ~. Consequently, ~ = Ta~lis (1962) a set of J units that are simultaneously unbiase~ estimator for 7. subjected to a series of m trials. At each proposed estimators forgo, but he did not trial, each unit is classified as being in discuss techniques for making inferences about exactly one of I mutually exclusive states• Let 7. We consider here a technique for making such the random variable X.. k take the value i if the l~nferences. k-th trial of the j-t~3unit is observed to be in One approach is to use the limiting normal the i-th state, and zero otherwise. distribution of X as m÷~. At trial k consider The probability that Xij k takes the value i an IJ-dimensiona~ response vector ~(j) = (X~k, 275 ~2k'7''~$ ~) where = (X k' X ...... ' ~ ' e ~ Xljk)' respons $~ctor ~r thSJ~-th unit • .. ~n t)'"the generalized-multinomial model, the m X. t vectors (j=l,2 ..... J) are identically at the k-th trial. Define ~(j)X = k=iX X 4k(J) ~stributed and are not independent. The observations taken at time t on the J Since the X . vectors are independent and the ~k J individuals are equally pairwise correlated as first and second moments of l~._. are finite, . ~ ~ measured by the parameter P- The vector of the mul~ivariate central llm~ ~eorem Implies n J that m-~(Z... - mu) ÷ N__(0, E), as m÷~, where total counts ~X = X X~t, where Xt= E Xjt for u = i_@ ~, ~ = M @ Q, ~. is a J-dimensional ~J ~ ~ ~j t=l j=l vector of ones, @~denotes direct product between the generalized Dirichlet-Multinomial model, has and Q is a square matrix of dimension J with mean vector E(X) = N~ and covariance matrix ones on the diagonal and p as each off diagonal V(X) = NC{I+ 0(~-I)}M ~, where N=nS is the total element. Now X = GX._. where G = i! @ E . and number of observations, S is the to~al number of E 1 is the identity m~ix of dimension I. IThen, un'its at time t and C = (S+o)(l+o) . Using an by Rao (1973 page 124) the limiting distribution argument similar to the one in section 2 it can of ~ is specified by be shown that n -½ (X-N~) ÷ N I(O, SC{I+(J-{)o}M ) ^ m-½(~- ~ ÷ NI( ~, (mj)-l{l+(J-1)p}M) (2 2) and tests of hypotheses about ~ or vector "kJ functions g(~), where g is a continuous function Given a consistent estimator for 0, with second partial derivatives, can be obtained asymptotic chi-square tests involving using the large sample chi-square distributions sufficiently smooth functions of ~ can be for the Wald statistic obtained as Wald statistics, N{C{I+(J-I)o}}-I(g(~) _ (4.1) 4 = mJ{ l+(J-1) p }-I [g(~)_^ (2.3) g(~))'(DM D')-(g(~) - g(~)), g(~) ]' [DMSD' ]-[g($)-g(~) ] where [DM D'] denotes the generalized inverse where D is the matrix of first partial of DM D', with degrees of freedom equal to rank derivatives of g evaluated at ~, and [DMSD]- is aj of DM D'. The greater imprecision in the a generalized inverse of DM^D. The degrees of estimation for ~ due to variation in vectors of freedom correspond to the r~nk of DM D'. In nu proportion among individuals is accounted for by some applicatioDs it may be necessar~ ~o replace the factor C which cannot be less than one. The replace D with D (i.e. D evaluated at ~). consequence of ignoring this extra variation is 3. Dirichlet-Multinomial Model an inflation of the type I error levels for such For each of N units abserve a multinomial tests. vector of responses, with parameters ~ = (PI' 5. The Univariate Logarithmic P2 .... ~h~l )' and sample size S. Furthermore Series Distribution assume probability vector ~ has a Dirichlet The logarithmic series distribution was distribution! with mean vector ~ = (~i' ~2 , oo., introduced by Fisher, Corbett and Williams ~i ) and scaling parameter ~. (1943) to investigate the distribution of For this model the sum of the vector of butterflies in the Malayan Peninsula. It has counts has a Dirichlet-Multinomial distribution been used in the sampling of quadrants for plant and the vector of proportions has first moment species, the distribution of animal species, and covariance matrix N -I(s+~) (l+a)-iM . The population growth and in economic applications. Chatfield et al (1966) used the logarithmic Dirichlet distribution provides a convenient series distribution to represent the model for describing variation among vectors of distribution of numbers of items of a product proportions since it has relatively simple purchased by a buyer in a specified time period. mathematical properties. The Dirichlet- They point out that the logarithmic series has Multinomial model has been studied by Mosimann the advantage of dependency on only one (1962) and Good (1965). Brier (1980) used the parameter O. model to analyze sample proportions obtained The random variable M has a logarithmic from a single two-stage cluster sample. Koehler series distribution if the probability function and Wilson (1986) extended some of Brier's P(M=k) = a0~/k (k=l,2 ..... ; 0<0<i) results to analyze vectors of proportions where a=-[log(l-0)]-l.