Advanced Review A review of multivariate distributions for count data derived from the Poisson distribution David I. Inouye,1 Eunho Yang,2 Genevera I. Allen3,4 and Pradeep Ravikumar5*
The Poisson distribution has been widely studied and used for modeling uni- variate count-valued data. However, multivariate generalizations of the Pois- son distribution that permit dependencies have been far less popular. Yet, real-world, high-dimensional, count-valued data found in word counts, geno- mics, and crime statistics, for example, exhibit rich dependencies and motivate the need for multivariate distributions that can appropriately model this data. We review multivariate distributions derived from the univariate Poisson, categorizing these models into three main classes: (1) where the marginal dis- tributions are Poisson, (2) where the joint distribution is a mixture of inde- pendent multivariate Poisson distributions, and (3) where the node-conditional distributions are derived from the Poisson. We discuss the development of multiple instances of these classes and compare the models in terms of inter- pretability and theory. Then, we empirically compare multiple models from each class on three real-world datasets that have varying data characteristics from different domains, namely traffic accident data, biological next genera- tion sequencing data, and text data. These empirical experiments develop intuition about the comparative advantages and disadvantages of each class of multivariate distribution that was derived from the Poisson. Finally, we sug- gest new research directions as explored in the subsequent Discussion section. © 2017 Wiley Periodicals, Inc.
How to cite this article: WIREs Comput Stat 2017, 9:e1398. doi: 10.1002/wics.1398
Keywords: Poisson, Multivariate, Graphical models, Copulas, High dimensional
INTRODUCTION ultivariate count-valued data has become Additional Supporting Information may be found in the online ver- Mincreasingly prevalent in modern big data set- sion of this article. tings. Variables in such data are rarely independent *Correspondence to: [email protected] and instead exhibit complex positive and negative 1Department of Computer Science, The University of Texas at Aus- dependencies. We highlight three examples of multi- tin, Austin, TX, USA variate count-valued data that exhibit rich depen- 2 School of Computing, Korea Advanced Institute of Science and dencies: text analysis, genomics, and crime statistics. Technology, Daejeon, Republic of Korea In text analysis, a standard way to represent docu- 3 Department of Statistics, Rice University, Houston, TX, USA ments is to merely count the number of occurrences 4 Department of Pediatrics-Neurology, Baylor College of Medicine, of each word in the vocabulary and create a word- Houston, TX, USA count vector for each document. This representation 5School of Computer Science, Carnegie Mellon University, Pitts- burgh, PA, USA is often known as the bag-of-words representation, in which the word order and syntax are ignored. Conflict of interest: The authors have declared no conflicts of inter- est for this article. The vocabulary size—i.e., the number of variables
Volume9,May/June2017 ©2017WileyPeriodicals,Inc. 1of25 Advanced Review wires.wiley.com/compstats in the data—is usually much greater than 1000 where λ is the standard mean parameter for the Poisson unique words, and thus, a high-dimensional multi- distribution. A trivial extension of this to a multivariate variate distribution is required. Also, words are distribution would be to assume independence between clearly not independent. For example, if the word variables and take the product of node-wise univariate ‘Poisson’ appears in a document, then the word Poisson distributions, but such a model would be ill ‘probability’ is more likely to also appear, signifying suited for many examples of multivariate count-valued a positive dependency. Similarly, if the word ‘art’ data that require rich dependence structures. We appears, then the word ‘probability’ is less likely to review multivariate probability models that are derived also appear, signifying a negative dependency. In from the univariate Poisson distribution and permit genomics, RNA-sequencing technologies are used to nontrivial dependencies between variables. We catego- measure gene and isoform expression levels. These rize these models into three main classes based on their technologies yield counts of reads mapped back to primary modeling assumption. The first class assumes DNA locations, which even after normalization that the univariate marginal distributions are derived yield non-negative data that is highly skewed with from the Poisson. The second class is derived as a mix- many exact zeros. These genomics data are both ture of independent multivariate Poisson distributions. high dimensional, with the number of genes measur- The third class assumes that the univariate conditional ing in the tens of thousands, and strongly dependent distributions are derived from the Poisson as genes work together in pathways and complex distribution—this last class of models can also be stud- systems to produce particular phenotypes. In crime ied in the context of probabilistic graphical models. An analysis, counts of crimes in different counties are illustration of each of these three main model classes clearly multidimensional, with dependencies between can be seen in Figure 1. While these models might have crime counts. For example, the counts of crime in been classified by primary application area or perfor- adjacent counties are likely to be correlated with mance on a particular task, a classification based on one another, indicating a positive dependency. modeling assumptions helps emphasize the core While positive dependencies are probably more abstractions for each model class. In addition, this cate- prevalent in crime statistics, negative dependencies gorization may help practitioners from different disci- might be very interesting. For example, a negative plines learn from the models that have worked well in dependency between adjacent counties may suggest different areas. We discuss multiple instances of these that a criminal gang has moved from one county to classes in the later sections and highlight the strengths the other. and weaknesses of each class. We then provide a short These examples motivate the need for a high- discussion on the differences between classes in terms dimensional, count-valued distribution that permits of interpretability and theory. Using two different rich dependencies between variables. In general, a empirical measures, we empirically compare multiple good class of probabilistic models is a fundamental models from each class on three real-world datasets building block for many tasks in data analysis. Esti- that have varying data characteristics from different mating such models from data could help answer domains, namely traffic accident data, biological next exploratory questions such as: Which genomic path- generation sequencing data, and text data. These ways are altered in a disease, e.g., by analyzing geno- experiments develop intuition about the comparative mic networks? Or which county seems to have the advantages and disadvantages of the models and sug- strongest effect, with respect to crime, on other coun- gest new research directions as explored in the subse- ties? A probabilistic model could also be used in quent Discussion section. Bayesian classification to determine questions such as: Does this Twitter post display positive or negative sentiment about a particular product (fitting one model on positive posts and one model on negative Notation posts)? ℝ denotes the set of real numbers, ℝ denotes the The classical model for a count-valued + non-negative real numbers, and ℝ++ denotes the posi- random variable is the univariate Poisson distribu- tive real numbers. Similarly, ℤ denotes the set of inte- tion, whose probability mass function for gers. Matrices are denoted as capital letters … x 2 {0, 1, 2, } is: (e.g., X, Φ), vectors are denoted as boldface lower- case letters (e.g., x, ϕ), and scalar values are nonbold ℙ λ λx −λ = !; PoissðÞxj = expðÞx ð1Þ lowercase letters (e.g., x, ϕ).
2of25 ©2017WileyPeriodicals,Inc. Volume9,May/June2017 WIREs Computational Statistics A review of multivariate distributions for count data
Marginal Poisson Mixture of Poissons Conditional Poissons Multivariate Poissons Marginals are Conditionals Poisson are Poisson
0 0
0 5 0 5 5 5 10 10 10 10
Mixture 0 0
0 5 0 5 5 5 10 0 10 10 10 0 5 5 10 10
FIGURE 1 | (Left) The first class of Poisson generalizations is based on the assumption that the univariate marginals are derived from the Poisson. (Middle) The second class is based on the idea of mixing independent multivariate Poissons into a joint multivariate distribution. (Right) The third class is based on the assumption that the univariate conditional distributions are derived from the Poisson.
MARGINAL POISSON As the sum of independent Poissons is also Poisson GENERALIZATIONS (whose parameter is the sum of those of two compo- nents), the marginal distribution of x1 (similarly x2) The models in this section generalize the univariate is still a Poisson with the rate of λ1 + λ0. It can be Poisson to a multivariate distribution with the property easily seen that the covariance of x1 and x2 is λ0, and that the marginal distributions of each variable are as a result, the correlation coefficient is somewhere nopffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffi Poisson. This is analogous to the fact that the marginal λ λ λ λ ffiffiffiffiffiffiffiffiffiffi1 + 0 ffiffiffiffiffiffiffiffiffiffi2 + 0 8 between 0 and min p , p . Independently, distributions of the multivariate Gaussian distribution λ2 + λ0 λ1 + λ0 are univariate Gaussian distributions and thus seems Wicksell4 derived the bivariate Poisson as the limit of like a natural constraint when extending the univariate a bivariate binomial distribution. Campbell1 shows Poisson to the multivariate case. Several historical that the models in M’Kendrick2 and Wicksell4 can attempts at achieving this marginal property have inci- identically be derived from the sums of three inde- dentally developed the same class of models, with dif- pendent Poisson variables. 1–4 ferent derivations. This marginal Poisson property This approach to directly extend the Poisson can also be achieved via the more general framework distribution can be generalized further to handle the 5–7 of copulas. ℤd multivariate case x 2 + , in which each variable xi is the sum of individual Poisson yi and the common Multivariate Poisson Distribution Poisson x0 as before. The joint probability for a mul- 3 The formulation of the multivariate Poissona distri- tivariate Poisson is developed in Teicher and further 9–12 bution goes back to M’Kendrick2 where the author considered by other works : used differential equations to derive the bivariate ! ! Xd Yd λxi Poisson process. An equivalent but more readable i ℙMulPoiðÞx;λ = exp − λi interpretation to arrive at the bivariate Poisson distri- xi! i =0 i =1 0 1 bution would be to use the summation of independ- z 1 ent Poisson variables as follows : let y1, y2, and z be !B C univariate Poisson variables with parameters λ , λ minXixi Yd B λ C 1 2, xi B 0 C λ z!B C : ð3Þ and 0, respectively. Then, by setting x1 = y1 + z and z Yd z =0 i =1 @ A x2 = y2 + z,(x1, x2) follows the bivariate Poisson dis- λi tribution, and its joint probability mass is defined as: i =1
ℙ λ λ λ −λ −λ −λ BiPoiðÞx1,x2 j 1, 2, 0 = expðÞ1 2 0 Several studies have shown that this formulation of x1 x2 minXðÞx1,x2 z the multivariate Poisson can also be derived as a lim- λ λ x x λ0 1 2 1 2 z! : ð2Þ iting distribution of a multivariate binomial distribu- x ! ! z z λ λ 1 2 z =0 1 2 tion when the success probabilities are small and the
Volume9,May/June2017 ©2017WileyPeriodicals,Inc. 3of25 Advanced Review wires.wiley.com/compstats
– number of trials is large.13 15 As in the bivariate case, is hence not feasible in high-dimensional settings. the marginal distribution of xi is Poisson with param- Overall, the multivariate Poisson distribution intro- eter λi + λ0.Asλ0 controls the covariance between all duced above is appealing in that its marginal distri- variables, an extremely limited set of correlations butions are Poisson; yet, there are many modeling between variables is permitted. drawbacks, including severe restriction on the types Mahamunulu16 first proposed a more general of dependencies permitted (e.g., only positive rela- extension of the multivariate Poisson distribution tionships), a complicated and intractable form in that permits a full covariance structure. This distribu- high dimensions, and challenging inference – tion has been studied further by many studies.13,17 20 procedures. While the form of this general multivariate Poisson distribution is too complicated to spell out for d >3, its distribution can be specified by a multivariate Copula Approaches fi reduction scheme. Speci cally, let yi for i =1, A much more general way to construct valid multi- d …,(2 − 1) be independently Poisson distributed variate Poisson distributions with Poisson marginals λ fi with parameter i. Now, de ne A =[A1, A2, is by pairing a copula distribution with Poisson mar- d … A ] where A is a d × matrix consisting of ginal distributions. For continuous multivariate dis- d i i tributions, the use of copula distributions is founded ones and zeros, where each column of Ai has exactly on the celebrated Sklar’s theorem: any continuous i ones with no duplicate columns. Hence, A1 is the joint distribution can be decomposed into a copula d × d identity matrix, and Ad is a column vector of and the marginal distributions, and conversely, any all ones. Then, x = Ay is a d -dimensional multivari- combination of a copula and marginal distributions ate Poisson-distributed random vector with a full gives a valid continuous joint distribution.21 The key covariance structure. Note that the simpler multivari- advantage of such models for continuous distribu- ate Poisson distribution with constant covariance in tions is that copulas fully specify the dependence Eq. 0.3 is a special case of this general form, where structure, hence separating the modeling of marginal A =[A1, Ad]. distributions from the modeling of dependencies. The multivariate Poisson distribution has not While copula distributions paired with continuous been widely used for real data applications. This is marginal distributions enjoy wide popularity (see likely due to two major limitations of this distribu- e.g., Ref 22 in finance applications), copula models tion. First, the multivariate Poisson distribution only paired with discrete marginal distributions, such as permits positive dependencies; this can easily be seen the Poisson, are more challenging both for theoretical – because the distribution arises as the sum of inde- and computational reasons.23 25 However, several pendent Poisson random variables, and hence, covar- simplifications and recent advances have attempted – iances are governed by the positive rate parameters to overcome these challenges.24 26 λi. The assumption of positive dependencies is likely unrealistic for most real count-valued data examples. fi Second, the computation of probabilities and infer- Copula De nition and Examples A copula is defined by a joint cumulative distribution ence of parameters is especially cumbersome for the d multivariate Poisson distribution; these are only com- function (CDF), C(u) : [0, 1] ! [0, 1], with uniform putationally tractable for small d and hence not read- marginal distributions. As a concrete example, the fi ily applicable in high-dimensional settings. Kano Gaussian copula (see left sub gure of Figure 2 for an and Kawamura17 proposed multivariate recursion example) is derived from the multivariate normal dis- schemes for computing probabilities, but these tribution and is one of the most popular multivariate fl schemes are only stable and computationally feasible copulas because of its exibility in the multidimen- fi for small d, thus complicating likelihood-based infer- sional case; the Gaussian copula is de ned simply as: ence procedures. Karlis18 more recently proposed a ÀÁ Gauss … −1 … −1 ; latent variable-based EM algorithm for parameter CR ðÞu1,u2, ,ud = HR H ðÞu1 , ,H ðÞud inference of the general multivariate Poisson distribu- − 1 tion. This approach treats every pairwise interaction where H ( ) denotes the standard normal inverse as a latent variable and conducts inference over both CDF, and HR( ) denotes the joint CDF of a N ðÞ0,R the observed and hidden parameters. While this random vector, where R is a correlation matrix. method is more tractable than recursion schemes, it A similar multivariate copula can be derived from ’ d the multivariate Student s t distribution if extreme still requires inference over latent variables and 27 2 values are important to model.
4of25 ©2017WileyPeriodicals,Inc. Volume9,May/June2017 WIREs Computational Statistics A review of multivariate distributions for count data
Gaussion copula density (r = 0.50) Marginal Poissons Copula-Poisson joint dist.
0 123456789 Poisson 1
0 0
0 0.5 0 5 u 0.5 2 0123456789 5 u 1 10 10 1 1 Poisson 2
FIGURE 2 | A copula distribution (left)—which is defined over the unit hypercube and has uniform marginal distributions—paired with univariate Poisson marginal distributions for each variable (middle) defines a valid discrete joint distribution with Poisson marginals (right).
The Archimedean copulas are another family of distribution (defined by the CDF G( )) by first sampling copulas that have a single parameter that defines the from the copula u Copula(θ) and then transforming 28 global dependence between all variables. One prop- u to the target space usingÂÃ the inverse CDFs of the mar- erty of Archimedean copulas is that they admit an −1 … −1 ginal distributions: x = F1 ðÞu1 , ,Fd ðÞud . explicit form, unlike the Gaussian copula. Unfortu- A valid multivariate discrete joint distribution nately, the Archimedean copulas do not directly can be derived by pairing a copula distribution with allow for a rich dependence structure like the Gaus- Poisson marginal distributions. For example, a valid sian because they only have one dependence parame- joint CDF with Poisson marginals is given by ter rather than a parameter for each pair of variables. 29 Pair copula constructions (PCCs) for copulas, GxðÞ1,x2,…,xd jθ = CθðÞF1ðÞx1 jλ1 ,…,FdðÞxd jλd ; or vine copulas, allow combinations of different bivari- ate copulas to form a joint multivariate copula. PCCs where Fi(xi | λi) is the Poisson CDF with mean fi de ne multivariate copulas that have an expressive parameter λi, and θ denotes the copula parameters. dependency structure, like the Gaussian copula, but If we pair a Gaussian copula with Poisson marginal may also model asymmetric or tail dependencies avail- distributions, we create a valid joint distribution able in Archimedean and t copulas. Pair copulas only that has been widely used for generating samples of use univariate CDFs, conditional CDFs, and bivariate multivariate count data7,31,32; an example of the copulas to construct a multivariate copula distribution Gaussian copula paired with Poisson marginals to and hence can use combinations of the Archimedean form a discrete joint distribution can be seen in copulas described previously. The multivariate distri- Figure 2. butions can be factorized in a variety of ways using Nikoloulopoulos5 present an excellent survey bivariate copulas to flexibly model dependencies. of copulas to be paired with discrete marginals by Vines, or graphical tree-like structures, denote the pos- defining several desired properties of a copula 30 sible factorizations that are feasible for PCCs. (quoted from Ref 5):
1. Wide range of dependence, allowing both posi- Copula Models for Discrete Data tive and negative dependence. As per Sklar’s theorem, any copula distribution can be combined with marginal distribution CDFs, 2. Flexible dependence, meaning that the number of bivariate marginals is (approximately) equal fgF ðÞx d to create a joint distribution: i i i =1, to the number of dependence parameters. 3. Computationally feasible CDF for likelihood GxðÞ1,x2,…,xd jθ,F1,…,Fd estimation. = CθðÞu1 = F1ðÞx1 ,…,ud = FdðÞxd : 4. Closure property under marginalization, mean- If sampling from the given copula is possible, this form ing that lower-order marginals belong to the admits simple direct sampling from the joint same parametric family.
Volume9,May/June2017 ©2017WileyPeriodicals,Inc. 5of25 Advanced Review wires.wiley.com/compstats
5. No joint constraints for the dependence para- marginals, this procedure is less obvious for discrete meters, meaning that the use of covariate marginal distributions when using a continuous cop- functions for the dependence parameters is ula. One idea is to use the continuous extension straightforward. (CE) of integer variables to the continuous domain36 by forming a new ‘jitter’ continuous random varia- Each copula model satisfies some of these properties ble xe: but not all of them. For example, Gaussian copulas satisfy properties (1), (2), and (4) but not (3) because xe = x + ðÞu−1 ; the Gaussian CDF is not known in closed form nor (5) because the correlation parameter matrix is con- where u is a random variable defined on the unit strained to be in the set of positive definite matrices. interval. It is clear that this new random variable is Nikoloulopoulos5 recommends Gaussian copulas for continuous, and xe ≤ x. An obvious choice for the dis- general models and vine copulas if modeling depend- tribution of u is the uniform distribution. With this ence in the tails or asymmetry is needed. idea, inference can be performed using a surrogate likelihood by randomly projecting each discrete data Theoretical Properties of Copulas Derived from point into the continuous domain and averaging over Discrete Distributions the random projections, as performed in Refs 37,38. From a theoretical perspective, a multivariate discrete Madsen39 and Madsen and Fang40 use the CE idea distribution can be viewed as a continuous copula nas well but generate multiple jittered samples distribution paired with discrete marginals, but the xeðÞ1 ,xeðÞ1 ,…,xeðÞm g for each original observation x to derived copula distributions are not unique and, hence, are unidentifiable.23 Note that this is in con- estimate the discrete likelihood rather than merely generating one jittered sample xe for each original trast to continuous multivariate distributions where 24 the derived copulas are uniquely defined.33 Because observation x as in Refs 37,38. Nikoloulopoulos fi fi of this nonuniqueness property, Genest and Nešle- nds that CE-based methods signi cantly underesti- hová23 caution against performing inference on and mate the correlation structure because the CE jitter interpreting dependencies of copulas derived from transform operates independently for each variable discrete distributions. A further consequence of non- instead of considering the correlation structure uniqueness is that when copula distributions are between the variables. paired with discrete marginal distributions, the copu- las no longer fully specify the dependence structure, as with continuous marginals.23 In other words, the Distributional Transform for Parameter Estimation dependencies of the joint distribution will depend in 26 part on which marginal distributions are employed. In a somewhat different direction, Rüschendorf In practice, this often means that the range of depen- proposed the use of a generalization of the CDF dis- dencies permitted with certain copula and discrete tribution function F( ) for the case with discrete vari- ables, which they term a distributional transform marginal distribution pairs is much more limited than e the copula distribution would otherwise model. (DT), denoted by FðÞ : However, several studies have suggested that this e nonuniqueness property does not have major practi- FxðÞ,v ≡ FxðÞ+ vℙðÞx = ℙðÞX < x + vℙðÞX = x ; cal ramifications.5,34 We discuss a few common approaches used for where v Uniform(0, 1). Note that in the continu- the estimation of continuous copulas with discrete ous case, ℙ(X = x)=0, and thus, this reduces to the marginals. standard CDF for continuous distributions. One way of thinking of this modified CDF is that the random Continuous Extension for Parameter variable v adds a random jump when there are dis- Estimation continuities in the original CDF. If the distribution is For the estimation of continuous copulas from data, discrete (or more generally, if there are discontinu- a two-stage procedure, called Inference Function for ities in the original CDF), this transformation enables Marginals (IFM)35 is commonly used in which the the simple proof of a theorem akin to Sklar’s theorem marginal distributions are estimated first and then for discrete distributions.26 used to map the data onto the unit hypercube using Kazianka41 and Kazianka and Pilz42 propose the CDFs of the inferred marginal distributions. using the DT from Ref 26 to develop a simple and While this is straightforward for continuous intuitive approximation for the likelihood.
6of25 ©2017WileyPeriodicals,Inc. Volume9,May/June2017 WIREs Computational Statistics A review of multivariate distributions for count data
Essentially, they simply take the expected jump value likelihood computation for vine PCCs with discrete of EðÞv =0:5 (where v Uniform(0, 1)) and thus marginals is quadratic as opposed to exponential, as transform the discrete data to the continuous domain would be the case for general multivariate copulas through the following: such as the Gaussian copula with discrete marginals. However, computation in truly high-dimensional set- ui ≡ FiðÞxi −1 +0:5ℙðÞxi =0:5ðÞFiðÞxi −1 + FiðÞxi ; tings remains a challenge as 2d(d − 1) bivariate cop- ula evaluations are required to calculate the which can be seen as simply taking the average of the probability mass function (PMF) or likelihood of a d- CDF values at xi − 1 and xi. Then, they use a contin- variate PCC using the algorithm proposed by Pana- uous copula such as the Gaussian copula. Note that giotelis et al.44. These bivariate copula evaluations, this is much simpler to compute than the simulated however, can be coupled with some of the previously likelihood (SL) method in Ref 24 or the CE methods discussed computational techniques, such as CE, DT, in Refs 37–40, which require averaging over many and SL, for further computational improvements. different random initializations. Finally, while vine PCCs offer a very flexible model- ing approach, this comes with the added challenge of SL for Parameter Estimation selecting the vine construction and bivariate 45 Finally, Nikoloulopoulos24 proposes a method to copulas, which has not been well studied for dis- 5 directly approximate the maximum likelihood esti- crete distributions. Overall, Nikoloulopoulos recom- mate (MLE) by estimating a discretized Gaussian mends using vine PCCs for the complex modeling of copula. Essentially, unlike the CE and DT methods, discrete data with tail dependencies and asymmetric which attempt to transform discrete variables to con- dependencies. tinuous variables, the MLE for a Gaussian copula … with discrete marginal distributions F1, F2, , Fd Summary of Marginal Poisson can be formulated as estimating multivariate normal Generalizations rectangular probabilities: We have reviewed the historical development of the ðϕ−1 γ ðϕ−1 γ multivariate Poisson, which has Poisson marginals, ½ F1ðÞx1 j 1 ½ FdðÞxd j d ℙðÞxjγ,R = and then reviewed many of the recent developments ϕ−1 − γ ϕ−1 − γ ½ F1ðÞx1 1j 1 ½ F1ðÞxd 1j d of using the much more general copula framework to derive Poisson generalizations with Poisson margin- Φ … … ; RðÞz1, ,zd dz1 dzd ð4Þ als. The original multivariate Poisson models based on latent Poisson variables are limited to positive where γ is the marginal distribution parameters, dependencies and require computationally expensive − ϕ 1( ) is the univariate standard normal inverse algorithms to fit. However, the estimation of copula — CDF, and ΦR( ) is the multivariate normal density distributions paired with Poisson marginals with correlation matrix R. Nikoloulopoulos24 pro- although theoretically has some caveats—can be per- poses to approximate the multivariate normal rectan- formed efficiently in practice. Simple approximations, gular probabilities via fast simulation algorithms such as the expectation under the DT, can provide discussed in Ref 43. Because this method directly nearly trivial transformations that move the discrete approximates the MLE via simulated algorithms, this variables to the continuous domain in which all the method is called SL. Nikoloulopoulos25 compares the tools of continuous copulas can be exploited. More DT and SL methods for small sample sizes and find complex transformations, such as the SL method24 that the DT method tends to overestimate the corre- can be used if the sample size is small, or high accu- lation structure. However, because of the computa- racy is needed. tional simplicity, Nikoloulopoulos25 provides some heuristics of when the DT method might work well POISSON MIXTURE compared to the more accurate but more computa- tionally expensive SL method. GENERALIZATIONS Instead of directly extending univariate Poissons to Vine Copulas for Discrete Distributions the multivariate case, a separate line of work pro- Panagiotelis et al.44 provide conditions under which poses to indirectly extend the Poisson based on the a multivariate discrete distribution can be decom- mixture of independent Poissons. Mixture models are posed as a vine PCC copula paired with discrete mar- often considered to provide more flexibility by allow- ginals. In addition, Panagiotelis et al.44 show that the ing the parameter to vary according to a mixing
Volume9,May/June2017 ©2017WileyPeriodicals,Inc. 7of25 Advanced Review wires.wiley.com/compstats distribution. One important property of mixture easily represented by those of λ. Besides the moments, models is that they can model overdispersion. Over- other interesting properties (convolutions, identifia- dispersion occurs when the variance of the data is bility, etc.) of Poisson mixture distributions are larger than the mean of the data—unlike in a Poisson extensively reviewed and studied in Ref 46. distribution in which the mean and variance are One key benefit of Poisson mixtures is that they equal. One way of quantifying dispersion is the dis- permit both positive as well as negative dependencies persion index: simply by properly defining g(λ). The intuition behind these dependencies can be more clearly under- σ2 stood when we consider the sample generation proc- δ : = μ ð5Þ ess. Suppose we have the distribution g(λ) in two dimensions (i.e., d = 2) with a strong positive λ λ If δ > 1, then the distribution is overdispersed, dependency between 1 and 2. Then, given a sample λ λ λ whereas if δ < 1, then the distribution is underdis- ( 1, 2) from g( ), x1 and x2 are likely to also be posi- persed. In real-world data, as will be seen in the tively correlated. experimental section, overdispersion is more com- In an early application of the model, Arbous 47 mon than underdispersion. Mixture models also ena- and Kerrich constrain the Poisson parameters as ble dependencies between the variables, as will be the different scales of the common gamma variable λ: … λ described in the following paragraphs. for i =1, , d, the time interval ti is given, and i is λ λ Suppose we are modeling a univariate random set to ti . Hence, g( ) is a univariate gamma distribu- fi λ ℝ variable x with a density of f(x | θ). Rather than assum- tion speci ed by 2 ++, which only allows a simple ing θ is fixed, we let θ itself be a random variable fol- dependency structure. As another early attempt, 48 lowing some mixing distribution. More formally, a Steyn choose the multivariate normal distribution general mixture distribution can be defined as46: for the mixing distribution g(λ) to provide more flexi- ð bility on the correlation structure. However, the nor- mal distribution poses problems because λ must ℙðÞxjgðÞ = fxðÞjθ gðÞθ dθ; ð6Þ ℝ fi Θ reside in ++, while the normal distribution is de ned on ℝ. λ where the parameter θ is assumed to come from the One of the most popular choices for g( ) is the mixing distribution g(θ) and Θ is the domain of θ. log-normal distribution because of its rich covariance , b λ ℝd structure and natural positivity constraint : For the Poisson case, let 2 ++ be a d- dimensional vector whose i-th element λi is the q1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi parameter of the Poisson distribution for xi. Now, N logðÞλjμ,Σ = given some mixing distribution g(λ), the family of Πd λ π d Σ i =1 i ðÞ2 j j Poisson mixture distributions is defined as 1 T −1 ð exp − ðÞlogλ−μ Σ ðÞlogλ−μ : ð10Þ Yd 2 ℙMixedPoiðÞx = gðÞλ ℙPoissðÞxi jλi dλ; ð7Þ ℝd ++ i =1 The log-normal distribution above is parameterized by μ and Σ, which are the mean and the covariance where the domain of the joint distribution is any of (logλ1, logλ2, …, logλd), respectively. Setting the ℤ count-valued assignment (i.e., xi 2 +, 8 i). While random variable xi to follow the Poisson distribution the probability density function (Eq. (7)) has a com- with parameter λi, we have the multivariate Poisson plicated form involving a multidimensional integral log-normal distribution49 from Eq. (7): (a complex, high-dimensional integral when d is ð large), the mean and variance are known to be Yd ℙ μ Σ λ μ Σ ℙ λ λ: expressed succinctly as PoiLogNðÞxj , = N logðÞj , PoissðÞxi j i d ℝd + i =1 EðÞx = EðÞλ ; ð8Þ ð11Þ
VarðÞx = EðÞλ + VarðÞλ : ð9Þ While the joint distribution (Eq. (11)) does not have a closed-form expression, and hence, as d increases, Note that Eq. (9) implies that the variance of a mix- it becomes computationally cumbersome to work ture is always larger than the variance of a single dis- with, its moments are available in closed form as a tribution. The higher-order moments of x are also special case of Eq. (9):
8of25 ©2017WileyPeriodicals,Inc. Volume9,May/June2017 WIREs Computational Statistics A review of multivariate distributions for count data