A Review of Multivariate Distributions for Count Data Derived from the Poisson Distribution

Home , Poisson distribution, Univariate (statistics)

Advanced Review A review of multivariate distributions for count data derived from the Poisson distribution David I. Inouye,1 Eunho Yang,2 Genevera I. Allen3,4 and Pradeep Ravikumar5*

The Poisson distribution has been widely studied and used for modeling univariate count-valued data. However, multivariate generalizations of the Pois- son distribution that permit dependencies have been far less popular. Yet, real-world, high-dimensional, count-valued data found in word counts, genomics, and crime statistics, for example, exhibit rich dependencies and motivate the need for multivariate distributions that can appropriately model this data. We review multivariate distributions derived from the univariate Poisson, categorizing these models into three main classes: (1) where the marginal distributions are Poisson, (2) where the joint distribution is a mixture of independent multivariate Poisson distributions, and (3) where the node-conditional distributions are derived from the Poisson. We discuss the development of multiple instances of these classes and compare the models in terms of interpretability and theory. Then, we empirically compare multiple models from each class on three real-world datasets that have varying data characteristics from different domains, namely trafﬁc accident data, biological next generation sequencing data, and text data. These empirical experiments develop intuition about the comparative advantages and disadvantages of each class of multivariate distribution that was derived from the Poisson. Finally, we suggest new research directions as explored in the subsequent Discussion section. © 2017 Wiley Periodicals, Inc.

How to cite this article: WIREs Comput Stat 2017, 9:e1398. doi: 10.1002/wics.1398

Keywords: Poisson, Multivariate, Graphical models, Copulas, High dimensional

INTRODUCTION ultivariate count-valued data has become Additional Supporting Information may be found in the online ver- Mincreasingly prevalent in modern big data set- sion of this article. tings. Variables in such data are rarely independent *Correspondence to: [email protected] and instead exhibit complex positive and negative 1Department of Computer Science, The University of Texas at Aus- dependencies. We highlight three examples of multi- tin, Austin, TX, USA variate count-valued data that exhibit rich depen- 2 School of Computing, Korea Advanced Institute of Science and dencies: text analysis, genomics, and crime statistics. Technology, Daejeon, Republic of Korea In text analysis, a standard way to represent docu- 3 Department of Statistics, Rice University, Houston, TX, USA ments is to merely count the number of occurrences 4 Department of Pediatrics-Neurology, Baylor College of Medicine, of each word in the vocabulary and create a word- Houston, TX, USA count vector for each document. This representation 5School of Computer Science, Carnegie Mellon University, Pitts- burgh, PA, USA is often known as the bag-of-words representation, in which the word order and syntax are ignored. Conﬂict of interest: The authors have declared no conﬂicts of interest for this article. The vocabulary size—i.e., the number of variables

Volume9,May/June2017 ©2017WileyPeriodicals,Inc. 1of25 Advanced Review wires.wiley.com/compstats in the data—is usually much greater than 1000 where λ is the standard mean parameter for the Poisson unique words, and thus, a high-dimensional multi- distribution. A trivial extension of this to a multivariate variate distribution is required. Also, words are distribution would be to assume independence between clearly not independent. For example, if the word variables and take the product of node-wise univariate ‘Poisson’ appears in a document, then the word Poisson distributions, but such a model would be ill ‘probability’ is more likely to also appear, signifying suited for many examples of multivariate count-valued a positive dependency. Similarly, if the word ‘art’ data that require rich dependence structures. We appears, then the word ‘probability’ is less likely to review multivariate probability models that are derived also appear, signifying a negative dependency. In from the univariate Poisson distribution and permit genomics, RNA-sequencing technologies are used to nontrivial dependencies between variables. We catego- measure gene and isoform expression levels. These rize these models into three main classes based on their technologies yield counts of reads mapped back to primary modeling assumption. The first class assumes DNA locations, which even after normalization that the univariate marginal distributions are derived yield non-negative data that is highly skewed with from the Poisson. The second class is derived as a mix- many exact zeros. These genomics data are both ture of independent multivariate Poisson distributions. high dimensional, with the number of genes measur- The third class assumes that the univariate conditional ing in the tens of thousands, and strongly dependent distributions are derived from the Poisson as genes work together in pathways and complex distribution—this last class of models can also be stud- systems to produce particular phenotypes. In crime ied in the context of probabilistic graphical models. An analysis, counts of crimes in different counties are illustration of each of these three main model classes clearly multidimensional, with dependencies between can be seen in Figure 1. While these models might have crime counts. For example, the counts of crime in been classified by primary application area or perfor- adjacent counties are likely to be correlated with mance on a particular task, a classification based on one another, indicating a positive dependency. modeling assumptions helps emphasize the core While positive dependencies are probably more abstractions for each model class. In addition, this cate- prevalent in crime statistics, negative dependencies gorization may help practitioners from different disci- might be very interesting. For example, a negative plines learn from the models that have worked well in dependency between adjacent counties may suggest different areas. We discuss multiple instances of these that a criminal gang has moved from one county to classes in the later sections and highlight the strengths the other. and weaknesses of each class. We then provide a short These examples motivate the need for a high- discussion on the differences between classes in terms dimensional, count-valued distribution that permits of interpretability and theory. Using two different rich dependencies between variables. In general, a empirical measures, we empirically compare multiple good class of probabilistic models is a fundamental models from each class on three real-world datasets building block for many tasks in data analysis. Esti- that have varying data characteristics from different mating such models from data could help answer domains, namely traffic accident data, biological next exploratory questions such as: Which genomic path- generation sequencing data, and text data. These ways are altered in a disease, e.g., by analyzing geno- experiments develop intuition about the comparative mic networks? Or which county seems to have the advantages and disadvantages of the models and sug- strongest effect, with respect to crime, on other coun- gest new research directions as explored in the subse- ties? A probabilistic model could also be used in quent Discussion section. Bayesian classification to determine questions such as: Does this Twitter post display positive or negative sentiment about a particular product (fitting one model on positive posts and one model on negative Notation posts)? ℝ denotes the set of real numbers, ℝ denotes the The classical model for a count-valued + non-negative real numbers, and ℝ++ denotes the posi- random variable is the univariate Poisson distribu- tive real numbers. Similarly, ℤ denotes the set of inte- tion, whose probability mass function for gers. Matrices are denoted as capital letters … x 2 {0, 1, 2, } is: (e.g., X, Φ), vectors are denoted as boldface lowercase letters (e.g., x, ϕ), and scalar values are nonbold ℙ λ λx −λ = !; PoissðÞxj = expðÞx ð1Þ lowercase letters (e.g., x, ϕ).

Marginal Poisson Mixture of Poissons Conditional Poissons Multivariate Poissons Marginals are Conditionals Poisson are Poisson

0 0

0 5 0 5 5 5 10 10 10 10

Mixture 0 0

0 5 0 5 5 5 10 0 10 10 10 0 5 5 10 10

FIGURE 1 | (Left) The ﬁrst class of Poisson generalizations is based on the assumption that the univariate marginals are derived from the Poisson. (Middle) The second class is based on the idea of mixing independent multivariate Poissons into a joint multivariate distribution. (Right) The third class is based on the assumption that the univariate conditional distributions are derived from the Poisson.

MARGINAL POISSON As the sum of independent Poissons is also Poisson GENERALIZATIONS (whose parameter is the sum of those of two compo- nents), the marginal distribution of x1 (similarly x2) The models in this section generalize the univariate is still a Poisson with the rate of λ1 + λ0. It can be Poisson to a multivariate distribution with the property easily seen that the covariance of x1 and x2 is λ0, and that the marginal distributions of each variable are as a result, the correlation coefficient is somewhere nopffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffi Poisson. This is analogous to the fact that the marginal λ λ λ λ ffiffiffiffiffiffiffiffiffiffi1 + 0 ffiffiffiffiffiffiffiffiffiffi2 + 0 8 between 0 and min p , p . Independently, distributions of the multivariate Gaussian distribution λ2 + λ0 λ1 + λ0 are univariate Gaussian distributions and thus seems Wicksell4 derived the bivariate Poisson as the limit of like a natural constraint when extending the univariate a bivariate binomial distribution. Campbell1 shows Poisson to the multivariate case. Several historical that the models in M’Kendrick2 and Wicksell4 can attempts at achieving this marginal property have inci- identically be derived from the sums of three inde- dentally developed the same class of models, with dif- pendent Poisson variables. 1–4 ferent derivations. This marginal Poisson property This approach to directly extend the Poisson can also be achieved via the more general framework distribution can be generalized further to handle the 5–7 of copulas. ℤd multivariate case x 2 + , in which each variable xi is the sum of individual Poisson yi and the common Multivariate Poisson Distribution Poisson x0 as before. The joint probability for a mul- 3 The formulation of the multivariate Poissona distri- tivariate Poisson is developed in Teicher and further 9–12 bution goes back to M’Kendrick2 where the author considered by other works : used differential equations to derive the bivariate ! ! Xd Yd λxi Poisson process. An equivalent but more readable i ℙMulPoiðÞx;λ = exp − λi interpretation to arrive at the bivariate Poisson distri- xi! i =0 i =1 0 1 bution would be to use the summation of independ- z 1 ent Poisson variables as follows : let y1, y2, and z be !B C univariate Poisson variables with parameters λ , λ minXixi Yd B λ C 1 2, xi B 0 C λ z!B C : ð3Þ and 0, respectively. Then, by setting x1 = y1 + z and z Yd z =0 i =1 @ A x2 = y2 + z,(x1, x2) follows the bivariate Poisson dis- λi tribution, and its joint probability mass is defined as: i =1

ℙ λ λ λ −λ −λ −λ BiPoiðÞx1,x2 j 1, 2, 0 = expðÞ1 2 0 Several studies have shown that this formulation of x1 x2 minXðÞx1,x2 z the multivariate Poisson can also be derived as a lim- λ λ x x λ0 1 2 1 2 z! : ð2Þ iting distribution of a multivariate binomial distribu- x ! ! z z λ λ 1 2 z =0 1 2 tion when the success probabilities are small and the

– number of trials is large.13 15 As in the bivariate case, is hence not feasible in high-dimensional settings. the marginal distribution of xi is Poisson with param- Overall, the multivariate Poisson distribution intro- eter λi + λ0.Asλ0 controls the covariance between all duced above is appealing in that its marginal distri- variables, an extremely limited set of correlations butions are Poisson; yet, there are many modeling between variables is permitted. drawbacks, including severe restriction on the types Mahamunulu16 first proposed a more general of dependencies permitted (e.g., only positive rela- extension of the multivariate Poisson distribution tionships), a complicated and intractable form in that permits a full covariance structure. This distribu- high dimensions, and challenging inference – tion has been studied further by many studies.13,17 20 procedures. While the form of this general multivariate Poisson distribution is too complicated to spell out for d >3, its distribution can be specified by a multivariate Copula Approaches fi reduction scheme. Speci cally, let yi for i =1, A much more general way to construct valid multi- d …,(2 − 1) be independently Poisson distributed variate Poisson distributions with Poisson marginals λ fi with parameter i. Now, de ne A =[A1, A2, is by pairing a copula distribution with Poisson mar- d … A ] where A is a d × matrix consisting of ginal distributions. For continuous multivariate dis- d i i tributions, the use of copula distributions is founded ones and zeros, where each column of Ai has exactly on the celebrated Sklar’s theorem: any continuous i ones with no duplicate columns. Hence, A1 is the joint distribution can be decomposed into a copula d × d identity matrix, and Ad is a column vector of and the marginal distributions, and conversely, any all ones. Then, x = Ay is a d -dimensional multivari- combination of a copula and marginal distributions ate Poisson-distributed random vector with a full gives a valid continuous joint distribution.21 The key covariance structure. Note that the simpler multivari- advantage of such models for continuous distribu- ate Poisson distribution with constant covariance in tions is that copulas fully specify the dependence Eq. 0.3 is a special case of this general form, where structure, hence separating the modeling of marginal A =[A1, Ad]. distributions from the modeling of dependencies. The multivariate Poisson distribution has not While copula distributions paired with continuous been widely used for real data applications. This is marginal distributions enjoy wide popularity (see likely due to two major limitations of this distribu- e.g., Ref 22 in finance applications), copula models tion. First, the multivariate Poisson distribution only paired with discrete marginal distributions, such as permits positive dependencies; this can easily be seen the Poisson, are more challenging both for theoretical – because the distribution arises as the sum of inde- and computational reasons.23 25 However, several pendent Poisson random variables, and hence, covar- simplifications and recent advances have attempted – iances are governed by the positive rate parameters to overcome these challenges.24 26 λi. The assumption of positive dependencies is likely unrealistic for most real count-valued data examples. fi Second, the computation of probabilities and infer- Copula De nition and Examples A copula is defined by a joint cumulative distribution ence of parameters is especially cumbersome for the d multivariate Poisson distribution; these are only com- function (CDF), C(u) : [0, 1] ! [0, 1], with uniform putationally tractable for small d and hence not read- marginal distributions. As a concrete example, the fi ily applicable in high-dimensional settings. Kano Gaussian copula (see left sub gure of Figure 2 for an and Kawamura17 proposed multivariate recursion example) is derived from the multivariate normal dis- schemes for computing probabilities, but these tribution and is one of the most popular multivariate fl schemes are only stable and computationally feasible copulas because of its exibility in the multidimen- fi for small d, thus complicating likelihood-based infer- sional case; the Gaussian copula is de ned simply as: ence procedures. Karlis18 more recently proposed a ÀÁ Gauss … −1 … −1 ; latent variable-based EM algorithm for parameter CR ðÞu1,u2, ,ud = HR H ðÞu1 , ,H ðÞud inference of the general multivariate Poisson distribu- − 1 tion. This approach treats every pairwise interaction where H () denotes the standard normal inverse as a latent variable and conducts inference over both CDF, and HR() denotes the joint CDF of a N ðÞ0,R the observed and hidden parameters. While this random vector, where R is a correlation matrix. method is more tractable than recursion schemes, it A similar multivariate copula can be derived from ’ d the multivariate Student s t distribution if extreme still requires inference over latent variables and 27 2 values are important to model.

Gaussion copula density (r = 0.50) Marginal Poissons Copula-Poisson joint dist.

0 123456789 Poisson 1

0 0

0 0.5 0 5 u 0.5 2 0123456789 5 u 1 10 10 1 1 Poisson 2

FIGURE 2 | A copula distribution (left)—which is deﬁned over the unit hypercube and has uniform marginal distributions—paired with univariate Poisson marginal distributions for each variable (middle) deﬁnes a valid discrete joint distribution with Poisson marginals (right).

The Archimedean copulas are another family of distribution (defined by the CDF G()) by first sampling copulas that have a single parameter that defines the from the copula u Copula(θ) and then transforming 28 global dependence between all variables. One prop- u to the target space usingÂÃ the inverse CDFs of the mar- erty of Archimedean copulas is that they admit an −1 … −1 ginal distributions: x = F1 ðÞu1 , ,Fd ðÞud . explicit form, unlike the Gaussian copula. Unfortu- A valid multivariate discrete joint distribution nately, the Archimedean copulas do not directly can be derived by pairing a copula distribution with allow for a rich dependence structure like the Gaus- Poisson marginal distributions. For example, a valid sian because they only have one dependence parame- joint CDF with Poisson marginals is given by ter rather than a parameter for each pair of variables. 29 Pair copula constructions (PCCs) for copulas, GxðÞ1,x2,…,xd jθ = CθðÞF1ðÞx1 jλ1 ,…,FdðÞxd jλd ; or vine copulas, allow combinations of different bivariate copulas to form a joint multivariate copula. PCCs where Fi(xi | λi) is the Poisson CDF with mean fi de ne multivariate copulas that have an expressive parameter λi, and θ denotes the copula parameters. dependency structure, like the Gaussian copula, but If we pair a Gaussian copula with Poisson marginal may also model asymmetric or tail dependencies avail- distributions, we create a valid joint distribution able in Archimedean and t copulas. Pair copulas only that has been widely used for generating samples of use univariate CDFs, conditional CDFs, and bivariate multivariate count data7,31,32; an example of the copulas to construct a multivariate copula distribution Gaussian copula paired with Poisson marginals to and hence can use combinations of the Archimedean form a discrete joint distribution can be seen in copulas described previously. The multivariate distri- Figure 2. butions can be factorized in a variety of ways using Nikoloulopoulos5 present an excellent survey bivariate copulas to flexibly model dependencies. of copulas to be paired with discrete marginals by Vines, or graphical tree-like structures, denote the pos- defining several desired properties of a copula 30 sible factorizations that are feasible for PCCs. (quoted from Ref 5):

1. Wide range of dependence, allowing both posi- Copula Models for Discrete Data tive and negative dependence. As per Sklar’s theorem, any copula distribution can be combined with marginal distribution CDFs, 2. Flexible dependence, meaning that the number of bivariate marginals is (approximately) equal fgF ðÞx d to create a joint distribution: i i i =1, to the number of dependence parameters. 3. Computationally feasible CDF for likelihood GxðÞ1,x2,…,xd jθ,F1,…,Fd estimation. = CθðÞu1 = F1ðÞx1 ,…,ud = FdðÞxd : 4. Closure property under marginalization, mean- If sampling from the given copula is possible, this form ing that lower-order marginals belong to the admits simple direct sampling from the joint same parametric family.

5. No joint constraints for the dependence para- marginals, this procedure is less obvious for discrete meters, meaning that the use of covariate marginal distributions when using a continuous cop- functions for the dependence parameters is ula. One idea is to use the continuous extension straightforward. (CE) of integer variables to the continuous domain36 by forming a new ‘jitter’ continuous random varia- Each copula model satisfies some of these properties ble xe: but not all of them. For example, Gaussian copulas satisfy properties (1), (2), and (4) but not (3) because xe = x + ðÞu−1 ; the Gaussian CDF is not known in closed form nor (5) because the correlation parameter matrix is con- where u is a random variable defined on the unit strained to be in the set of positive definite matrices. interval. It is clear that this new random variable is Nikoloulopoulos5 recommends Gaussian copulas for continuous, and xe ≤ x. An obvious choice for the dis- general models and vine copulas if modeling depend- tribution of u is the uniform distribution. With this ence in the tails or asymmetry is needed. idea, inference can be performed using a surrogate likelihood by randomly projecting each discrete data Theoretical Properties of Copulas Derived from point into the continuous domain and averaging over Discrete Distributions the random projections, as performed in Refs 37,38. From a theoretical perspective, a multivariate discrete Madsen39 and Madsen and Fang40 use the CE idea distribution can be viewed as a continuous copula nas well but generate multiple jittered samples distribution paired with discrete marginals, but the xeðÞ1 ,xeðÞ1 ,…,xeðÞm g for each original observation x to derived copula distributions are not unique and, hence, are unidentifiable.23 Note that this is in con- estimate the discrete likelihood rather than merely generating one jittered sample xe for each original trast to continuous multivariate distributions where 24 the derived copulas are uniquely defined.33 Because observation x as in Refs 37,38. Nikoloulopoulos fi fi of this nonuniqueness property, Genest and Nešle- nds that CE-based methods signi cantly underesti- hová23 caution against performing inference on and mate the correlation structure because the CE jitter interpreting dependencies of copulas derived from transform operates independently for each variable discrete distributions. A further consequence of non- instead of considering the correlation structure uniqueness is that when copula distributions are between the variables. paired with discrete marginal distributions, the copulas no longer fully specify the dependence structure, as with continuous marginals.23 In other words, the Distributional Transform for Parameter Estimation dependencies of the joint distribution will depend in 26 part on which marginal distributions are employed. In a somewhat different direction, Rüschendorf In practice, this often means that the range of depen- proposed the use of a generalization of the CDF dis- dencies permitted with certain copula and discrete tribution function F() for the case with discrete variables, which they term a distributional transform marginal distribution pairs is much more limited than e the copula distribution would otherwise model. (DT), denoted by FðÞ : However, several studies have suggested that this e nonuniqueness property does not have major practi- FxðÞ,v ≡ FxðÞ+ vℙðÞx = ℙðÞX < x + vℙðÞX = x ; cal ramifications.5,34 We discuss a few common approaches used for where v Uniform(0, 1). Note that in the continu- the estimation of continuous copulas with discrete ous case, ℙ(X = x)=0, and thus, this reduces to the marginals. standard CDF for continuous distributions. One way of thinking of this modified CDF is that the random Continuous Extension for Parameter variable v adds a random jump when there are dis- Estimation continuities in the original CDF. If the distribution is For the estimation of continuous copulas from data, discrete (or more generally, if there are discontinu- a two-stage procedure, called Inference Function for ities in the original CDF), this transformation enables Marginals (IFM)35 is commonly used in which the the simple proof of a theorem akin to Sklar’s theorem marginal distributions are estimated first and then for discrete distributions.26 used to map the data onto the unit hypercube using Kazianka41 and Kazianka and Pilz42 propose the CDFs of the inferred marginal distributions. using the DT from Ref 26 to develop a simple and While this is straightforward for continuous intuitive approximation for the likelihood.

Essentially, they simply take the expected jump value likelihood computation for vine PCCs with discrete of EðÞv =0:5 (where v Uniform(0, 1)) and thus marginals is quadratic as opposed to exponential, as transform the discrete data to the continuous domain would be the case for general multivariate copulas through the following: such as the Gaussian copula with discrete marginals. However, computation in truly high-dimensional set- ui ≡ FiðÞxi −1 +0:5ℙðÞxi =0:5ðÞFiðÞxi −1 + FiðÞxi ; tings remains a challenge as 2d(d − 1) bivariate copula evaluations are required to calculate the which can be seen as simply taking the average of the probability mass function (PMF) or likelihood of a d- CDF values at xi − 1 and xi. Then, they use a contin- variate PCC using the algorithm proposed by Pana- uous copula such as the Gaussian copula. Note that giotelis et al.44. These bivariate copula evaluations, this is much simpler to compute than the simulated however, can be coupled with some of the previously likelihood (SL) method in Ref 24 or the CE methods discussed computational techniques, such as CE, DT, in Refs 37–40, which require averaging over many and SL, for further computational improvements. different random initializations. Finally, while vine PCCs offer a very flexible modeling approach, this comes with the added challenge of SL for Parameter Estimation selecting the vine construction and bivariate 45 Finally, Nikoloulopoulos24 proposes a method to copulas, which has not been well studied for dis- 5 directly approximate the maximum likelihood esti- crete distributions. Overall, Nikoloulopoulos recom- mate (MLE) by estimating a discretized Gaussian mends using vine PCCs for the complex modeling of copula. Essentially, unlike the CE and DT methods, discrete data with tail dependencies and asymmetric which attempt to transform discrete variables to con- dependencies. tinuous variables, the MLE for a Gaussian copula … with discrete marginal distributions F1, F2, , Fd Summary of Marginal Poisson can be formulated as estimating multivariate normal Generalizations rectangular probabilities: We have reviewed the historical development of the ðϕ−1 γ ðϕ−1 γ multivariate Poisson, which has Poisson marginals, ½F1ðÞx1 j 1 ½FdðÞxd j d ℙðÞxjγ,R = and then reviewed many of the recent developments ϕ−1 − γ ϕ−1 − γ ½F1ðÞx1 1j 1 ½F1ðÞxd 1j d of using the much more general copula framework to derive Poisson generalizations with Poisson margin- Φ … … ; RðÞz1, ,zd dz1 dzd ð4Þ als. The original multivariate Poisson models based on latent Poisson variables are limited to positive where γ is the marginal distribution parameters, dependencies and require computationally expensive − ϕ 1() is the univariate standard normal inverse algorithms to fit. However, the estimation of copula — CDF, and ΦR() is the multivariate normal density distributions paired with Poisson marginals with correlation matrix R. Nikoloulopoulos24 pro- although theoretically has some caveats—can be per- poses to approximate the multivariate normal rectan- formed efficiently in practice. Simple approximations, gular probabilities via fast simulation algorithms such as the expectation under the DT, can provide discussed in Ref 43. Because this method directly nearly trivial transformations that move the discrete approximates the MLE via simulated algorithms, this variables to the continuous domain in which all the method is called SL. Nikoloulopoulos25 compares the tools of continuous copulas can be exploited. More DT and SL methods for small sample sizes and find complex transformations, such as the SL method24 that the DT method tends to overestimate the corre- can be used if the sample size is small, or high accu- lation structure. However, because of the computa- racy is needed. tional simplicity, Nikoloulopoulos25 provides some heuristics of when the DT method might work well POISSON MIXTURE compared to the more accurate but more computationally expensive SL method. GENERALIZATIONS Instead of directly extending univariate Poissons to Vine Copulas for Discrete Distributions the multivariate case, a separate line of work pro- Panagiotelis et al.44 provide conditions under which poses to indirectly extend the Poisson based on the a multivariate discrete distribution can be decom- mixture of independent Poissons. Mixture models are posed as a vine PCC copula paired with discrete mar- often considered to provide more flexibility by allow- ginals. In addition, Panagiotelis et al.44 show that the ing the parameter to vary according to a mixing

Volume9,May/June2017 ©2017WileyPeriodicals,Inc. 7of25 Advanced Review wires.wiley.com/compstats distribution. One important property of mixture easily represented by those of λ. Besides the moments, models is that they can model overdispersion. Over- other interesting properties (convolutions, identifia- dispersion occurs when the variance of the data is bility, etc.) of Poisson mixture distributions are larger than the mean of the data—unlike in a Poisson extensively reviewed and studied in Ref 46. distribution in which the mean and variance are One key benefit of Poisson mixtures is that they equal. One way of quantifying dispersion is the dis- permit both positive as well as negative dependencies persion index: simply by properly defining g(λ). The intuition behind these dependencies can be more clearly under- σ2 stood when we consider the sample generation proc- δ : = μ ð5Þ ess. Suppose we have the distribution g(λ) in two dimensions (i.e., d = 2) with a strong positive λ λ If δ > 1, then the distribution is overdispersed, dependency between 1 and 2. Then, given a sample λ λ λ whereas if δ < 1, then the distribution is underdis- ( 1, 2) from g( ), x1 and x2 are likely to also be posi- persed. In real-world data, as will be seen in the tively correlated. experimental section, overdispersion is more com- In an early application of the model, Arbous 47 mon than underdispersion. Mixture models also ena- and Kerrich constrain the Poisson parameters as ble dependencies between the variables, as will be the different scales of the common gamma variable λ: … λ described in the following paragraphs. for i =1, , d, the time interval ti is given, and i is λ λ Suppose we are modeling a univariate random set to ti . Hence, g( ) is a univariate gamma distribu- fi λ ℝ variable x with a density of f(x | θ). Rather than assum- tion speci ed by 2 ++, which only allows a simple ing θ is fixed, we let θ itself be a random variable fol- dependency structure. As another early attempt, 48 lowing some mixing distribution. More formally, a Steyn choose the multivariate normal distribution general mixture distribution can be defined as46: for the mixing distribution g(λ) to provide more flexi- ð bility on the correlation structure. However, the normal distribution poses problems because λ must ℙðÞxjgðÞ = fxðÞjθ gðÞθ dθ; ð6Þ ℝ fi Θ reside in ++, while the normal distribution is de ned on ℝ. λ where the parameter θ is assumed to come from the One of the most popular choices for g( ) is the mixing distribution g(θ) and Θ is the domain of θ. log-normal distribution because of its rich covariance , b λ ℝd structure and natural positivity constraint : For the Poisson case, let 2 ++ be a d- dimensional vector whose i-th element λi is the q1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi parameter of the Poisson distribution for xi. Now, N logðÞλjμ,Σ = given some mixing distribution g(λ), the family of Πd λ π d Σ i =1 i ðÞ2 j j Poisson mixture distributions is defined as 1 T −1 ð exp − ðÞlogλ−μ Σ ðÞlogλ−μ : ð10Þ Yd 2 ℙMixedPoiðÞx = gðÞλ ℙPoissðÞxi jλi dλ; ð7Þ ℝd ++ i =1 The log-normal distribution above is parameterized by μ and Σ, which are the mean and the covariance where the domain of the joint distribution is any of (logλ1, logλ2, …, logλd), respectively. Setting the ℤ count-valued assignment (i.e., xi 2 +, 8 i). While random variable xi to follow the Poisson distribution the probability density function (Eq. (7)) has a com- with parameter λi, we have the multivariate Poisson plicated form involving a multidimensional integral log-normal distribution49 from Eq. (7): (a complex, high-dimensional integral when d is ð large), the mean and variance are known to be Yd ℙ μ Σ λ μ Σ ℙ λ λ: expressed succinctly as PoiLogNðÞxj , = N logðÞj , PoissðÞxi j i d ℝd + i =1 EðÞx = EðÞλ ; ð8Þ ð11Þ

VarðÞx = EðÞλ + VarðÞλ : ð9Þ While the joint distribution (Eq. (11)) does not have a closed-form expression, and hence, as d increases, Note that Eq. (9) implies that the variance of a mix- it becomes computationally cumbersome to work ture is always larger than the variance of a single dis- with, its moments are available in closed form as a tribution. The higher-order moments of x are also special case of Eq. (9):

1 dependencies as well as closed form moments. While α ≡ EðÞx = exp μ + σ ; 56 i i i 2 ii Karlis and Meligkotsidou consider mixing multivariate Poissons with positive dependencies, the sim- α α2 σ − ; VarðÞxi = i + i ðÞexpðÞii 1 plified form—where the component distributions are ÀÁ ÀÁÀÁ independent Poisson distributions—is much simpler Cov x ,x = α α exp σ −1 : ð12Þ i j i j ij to implement using an expectation-maximization (EM) algorithm. This simple finite mixture distribu- The correlation and the degree of overdispersion tion can be viewed as a middle ground between a sin- (defined as the variance divided by the mean) of the gle Poisson and a nonparametric estimation method marginal distributions are strictly coupled by α and where a Poisson is located at every training σ. Also, possible Spearman’s ρ correlation values for point—i.e., the number of mixtures is equal to the this distribution are limited if the mean value α is i number of training data points (k = n). small. To briefly explore this phenomenon, we simu- The gamma distribution is another common lated a two-dimensional Poisson log-normal model mixing distribution for the Poisson because it is the with log-normal parameters μ ¼½0;0 and conjugate distribution for the Poisson mean parameter λ. For the univariate case, if the mixing distribu- 1 0:999 Σ = 2logðÞγ ; tion is gamma, then the resulting univariate 0:999 1 distribution is the well-known negative binomial distribution. The negative binomial distribution can γ γ which corresponds to a mean vector of [ , ] per handle overdispersion in count-valued data when the Eq. (12) and the strongest positive and negative cor- variance is larger than the mean. Unlike the Poisson relation possible between the two variables. We sim- log-normal mixture, the univariate gamma-Poisson ulated one million samples from this distribution and mixture density—i.e., the negative binomial fi γ ’ ρ found that when xing = 2, Spearman s values density—is known in closed form: are between −0.53 and 0.58. When fixing γ = 10, ’ ρ − Spearman s values are between 0.73 and 0.81. ΓðÞr + x ℙðÞxjr,p = prðÞ1−p x : Thus, for small mean values, the log-normal mixture ΓðÞr ΓðÞx +1 is limited in modeling strong dependencies, but for large mean values, the log-normal mixture can model As r ! ∞, the negative binomial distribution stronger dependencies. Besides the examples provided approaches the Poisson distribution. Thus, this can here, various Poisson mixture models from different be seen as a generalization of the Poisson distribu- mixing distributions are available, although limited tion. Note that the variance of this distribution is in the applied statistical literature due to their com- always larger than the Poisson distribution with the plexities. See Ref 46 and the references therein for same mean value. more examples of Poisson mixtures. Karlis and In a similar vein of placing a gamma prior distri- 46 Xekalaki also provide the general properties of bution on the mean parameter λ, a log-gamma mixtures as well as the specific properties of Poisson prior distribution can be placed on the log mean mixtures, such as moments, convolutions, and the parameter θ = log (λ). Bradley et al.57 recently posterior. leveraged the log-gamma conjugacy to the Poisson While this review focuses on modeling multi- log-mean parameter θ by introducing the Poisson variate, count-valued responses without any extra log-gamma hierarchical mixture distribution. They information, several extensions of multivariate Pois- particularly discuss the multivariate log-gamma distri- son log-normal models have been proposed to pro- bution that can have a flexible dependency structure vide more general correlation structures when similar to the multivariate log-normal distribution 50–55 covariates are available. These works formulate and illustrate some modeling advantages over the log- the mean parameter of log-normal mixing distribu- normal mixture model. tion, logμi, as a linear model on given covariates in the Bayesian framework. In order to alleviate the computational burden Summary of Mixture Model Generalizations of using log-normal distributions as an infinite mix- Overall, mixture models are particularly helpful if ing density as above, Karlis and Meligkotsidou56 there is overdispersion in the data, which is often the proposed an EM-type estimation for a finite mixture case for real-world data as seen in the experiments of k > 1 Poisson distributions, which still preserves section, while also allowing for variable dependencies similar properties, such as both positive and negative to be modeled implicitly through the mixing

! distribution. If the data exhibits overdispersion, then Xm 57 ℙ η η − η the log-normal or log-gamma distributions give ExpFamðÞxj = exp iTiðÞx + BðÞx AðÞ somewhat flexible dependency structures. The princi- i =1 ð ! pal caveat with the complex mixture of Poisson dis- Xm η η μ ; tributions is computational; exact inference of the AðÞ= log exp iTiðÞx + BðÞx d ðÞx parameters is typically computationally difficult due D i =1 to the presence of latent mixing variables. However, simpler models such as the finite mixture using sim- where η is called the natural or canonical parameters ple EM may provide good results in practice (see of the distribution, μ is the Lebesgue or counting Comparison section). measure depending on whether D is continuous or discrete, respectively, and A(η) is called the log partition function or log normalization constant because CONDITIONAL POISSON it normalizes the distribution over the domain D. Note that the sufficient statistics T x m can be GENERALIZATIONS fgiðÞi =1 any arbitrary function of x; for example, Ti(x)=x1x2 While the multivariate Poisson formulation in could be used to model an interaction between x1 η Eq. (3), as well as the distribution formed by pair- and x2. The log partition function A( ) will be a key ing a copula with Poisson marginals, assumes that quantity when discussing the following models: A(η) univariate marginal distributions are derived from must be finite for the distribution to be valid so that the Poisson, a different line of work generalizes the the realizable domain of parameters is given by η : η ∞ univariate Poisson by assuming that the univariate fg2D AðÞ< . Thus, for instance, if the realiza- node-conditional distributions are derived from the ble domain only allows positive or negative interac- – Poisson.58 63 Like the assumption of Poisson mar- tion terms, then the set of allowed dependencies ginals in previous sections, this conditional Poisson would be severely restricted. assumption seems a different yet natural extension Let us now consider the exponential family of the univariate Poisson distribution. The multivar- form of the univariate Poisson: iate Gaussian can be seen to satisfy such a condi- ℙ λ λx −λ = ! tional property as the node-conditional distributions PoissðÞxj = expðÞx of a multivariate Gaussian are univariate Gaussian. = expðÞ logðÞλx −logðÞx! −λ fi One bene t of these conditional models is that À they can be seen as undirected graphical models or λ − ! −λ: = exp |fflffl{zfflffl}logðÞ|{z}x + |fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}ðÞlogðÞx , Markov Random Fields, and they have a simple η TxðÞ Bx parametric form. In addition, estimating these ðÞ models generally reduces to estimating simple node- and therefore wise regressions, and some of these estimators have ℙPoissðÞxjη = expðÞηx−logðÞ x! −expðÞη ; ð13Þ theoretical guarantees on estimating the global graphical model structure even under high- where η ≡ log(λ) is the natural parameter of the Pois- dimensional sampling regimes, where the number of son, T(x)=x is the Poisson sufficient statistic, − log variables (d) is potentially even larger than the (x !) is the Poisson log base measure, and A(η)=exp number of samples (n). (η) is the Poisson log partition function. Note that for the general exponential family distribution, the log partition function may not have a closed form. Background on Exponential Family Distributions Background on Graphical Models We briefly describe exponential family distributions The graphical model over x, given some graph G—a and graphical models, which form the basis for the set of nodes and edges—is a set of distributions on conditional Poisson models. Many commonly used x that satisfy the Markov independence assumptions 64 distributions fall into this family, including Gaussian, with respect to G. In particular, an undirected Bernoulli, exponential, gamma, and Poisson, among graphical model provides a compact way to represent others. The exponential family is specified by a vector conditional independence among random variables— fi ≡ of suf cient statistics, denoted by T(x) [T1(x), the Markov properties of the graph. Conditional T2(x), …, Tm(x)], the log base measure B(x), and the independence relaxes the notion of full independence domain of the random variable D. With this nota- by defining which variables are independent given tion, the generic exponential family is defined as: that the other variables are fixed or known.

More formally, let V be a set of d nodes corre- example shows that the marginal moments—i.e., the sponding to the d random variables in x, let E be the covariance matrix Σ—are quite different from the set of undirected edges connecting nodes in V, and let graphical model parameters—i.e., the negative of the G =(V, E) be the corresponding undirected graph. By inverse covariance matrix Θ = − 1 Σ −1. In general, for 65 2 the Hammersley-Clifford theorem, any such distri- graphical models, such as the Poisson graphical mod- bution has the following form: els (PGMs) defined in the next section, the transfor- !mation from the covariance to the graphical model X parameter (Eq. (15)) is not known in closed form; in ℙ η η − η ðÞxj = exp C TCðÞxC AðÞ ð14Þ fact, this transformation is often very difficult to C2C compute for non-Gaussian models.66 For more information about graphical models and exponential where C is a set of cliques (fully-connected sub- families, see Refs 66,67. graphs) of G, and TC(xC) are the clique-wise sufficient statistics. For example, if C =fg 1,2,3 2C, then there would be a term η1,2,3 T1,2,3(x1, x2, x3) that involves Poisson Graphical Model the first, second, and third random variables in x. The first to consider multivariate extensions con- Hence, a graphical model can be understood as an structed by assuming that conditional distributions are exponential family distribution with the form given univariate exponential family distributions, such as in Eq. (14). An important special case—which will be and including the Poisson distribution, was Besag58. the focus in this paper—is a pairwise graphical In particular, suppose all node-conditional model, where C consists of merely V and ℰ, distributions—the conditional distribution of a node i.e., jCj =fg 1,2 ,8C 2C, so that we have conditioned on the rest of the nodes—are univariate 0 1 Poisson. Then, there is a unique joint distribution con- X X ÀÁ sistent with these node-conditional distributions under ℙ η @ η η − η A: ðÞxj = exp iTiðÞxi + ijTij xi,xj AðÞ some conditions, and moreover, this joint distribution i2V ðÞ2i,j ℰ is a graphical model distribution that factors according to a graph specified by the node-conditional distribu- As graphical models provide direct interpretations on tions. In fact, this approach can be uniformly applica- the Markov independence assumptions, for the ble for any exponential family beyond the Poisson Poisson-based graphical models in this section, we distribution and can be extended to more general can easily investigate the conditional independence graphical model settings61,63 beyond the pairwise set- relationships between random variables rather than ting in Ref 58. The particular instance of the univariate marginal correlations. Poisson as the exponential family underlying the node- As an example, we will consider the Gaussian conditional distributions is called a PGM.c graphical model formulation of the standard multi- Specifically, suppose that for every i 2 {1, …, d}, variate normal distribution (for simplicity, we will the node-conditional distribution is specified by univar- assume the mean vector is zero, i.e., μ = 0): iate Poisson distribution in exponential family form as specified in Eq. (13): Standard Form , Graphical Model Form ℙðÞxi jx−i = expfgψðÞx−i xi −logðÞxi! −expðÞψðÞx−i ; ð17Þ 1 − 1 − Σ = − Θ 1 , Θ = − Σ 1 ð15Þ 2 2 where x− i is the set of all xj except xi, and the function ψ(x− i)isany function that depends on the rest 1 T −1 ℙðÞ/xjΣ exp − x Σ x of all the random variables except xi. Furthermore, 2 ÀÁ suppose that the corresponding joint distribution on , ℙðÞ/xjΘ exp xTΘx x factors according to the set of cliques C of a graph ! 63 X X G. Yang et al. then show that such a joint distribu- θ 2 θ : tion consistent with the above node-conditional dis- = exp iixi + ijxixj ð16Þ i i6¼j tributions exists and, moreover, necessarily has the form Note how Eq. (16) is related to Eq. (14) by setting () η θ η θ 2 X Y Xd i = ii, ij = ij, TiðÞxi = xi , Tij(xi, xj)=xixj, and ℙ η η − ! − η ; ℰ θ — ðÞxj = exp C xi logðÞxi AðÞ ð18Þ ={(i, j):i ¼6 j, ij 6¼ 0} i.e., the edges in the graph C2C i2C i =1 correspond to the non-zeros in Θ. In addition, this

Volume9,May/June2017 ©2017WileyPeriodicals,Inc. 11 of 25 Advanced Review wires.wiley.com/compstats where the function A(η) is the log-partition function quadratically, whereas the log base measure terms η η − − on all parameters = fgC C2C. The pairwise PGM, as log(xi !) log(xj !) only decrease as O(xi logxi + a special case, is deﬁned as follows: xj log xj), so A(θ, Φ) ! ∞ as xi, xj ! ∞. Thus, even 8 though the PGM is a natural extension of the univar- Poisson regression for et al.62 showed that Winsorized node-conditional dis- each variable. In contrast to the previous approaches tributions actually do not lead to a consistent joint in the sections above, this parameter estimation distribution. approach is not only simple but is also guaranteed to As an alternative way of truncation, Yang be consistent even under high-dimensional sampling et al.62 instead keep the same parametric form as regimes and under some other mild conditions, PGM (Eq. (20)) but merely truncate the domain to including a sparse graph structural assumption (see non-negative integers less than or equal to R— Yang et al.61,63 for more details on the analysis). As i.e., DTPGM =fg 0,1,…,R —so that the joint distribu- in Poisson log-normal models, the parameters of tion takes the form63: PGM can be made to depend on covariates to allow for more ﬂexible correlations.68 X ℙ θT TΦ − ! In spite of its simple parameter estimation TPGMðÞx = exp x + x x logðÞxi method, the major drawback with this vanilla PGM i distribution is that it only permits negative condi- −ATPGMðÞθ,Φ : ð21Þ tional dependencies between variables:

Proposition 1. (Ref 58) Consider the PGM distribu- As Yang et al.62 show, the node-conditional distribution in Eq. (20). Then, for any parameters θ and Φ, tions of this graphical model distribution belong to APGM(θ, Φ)<+∞ only if the pairwise parameters an exponential family that is Poisson-like but with are nonpositive: ϕij ≤ 0, 8 (i, j) 2 ℰ. the domain bounded by R. Thus, the key difference from the vanilla PGM (Eq. (20)) is that the domain is Φ Φ Intuitively, if any entry in , say ij, is positive, ﬁnite, and hence, the log partition function ATPGM() the term Φijxixj in Eq. (20) would grow only involves a ﬁnite number of summations. Thus,

12 of 25 ©2017WileyPeriodicals,Inc. Volume9,May/June2017 WIREs Computational Statistics A review of multivariate distributions for count data no restrictions are imposed on the parameters for the Consider the following univariate distribution over normalizability of the distribution. count-valued variables: Yang et al.62 discuss several major drawbacks to TPGM. First, the domain needs to be bounded a ℙðÞ/z expfgθTzðÞ;R0,R −log z! ; ð24Þ priori, so R should ideally be set larger than any unseen observation. Second, the effective range of which has the same base measure log z ! as the Pois- parameter space for a nondegenerate distribution is son but with the following sublinear sufficient still limited: as the truncation value R increases, the statistics: effective values of pairwise parameters become increasingly negative or close to zero; otherwise, the TzðÞ;R0,R = 8 distribution can be degenerate, placing most of its ≤ >z if z R0 probability mass at 0 or R. <> 1 R R2 − z2 + x− 0 if R < z ≤ R 2ðÞR−R R−R 2ðÞR−R 0 > 0 0 0 :>R + R 0 if z ≥ R: Quadratic PGM and Sublinear PGM 2 Yang et al.62 also investigate the possibility of PGMs ð25Þ that (1) allow both positive and negative dependencies as well as (2) allow the domain to range over all non-negative integers. As described previously, a key For values of x upto R0, T(x) increases linearly, while fi reason for the negative constraint on the pairwise after R0, its slope decreases linearly, and nally after ϕ R, T(x) becomes constant. The joint graphical model, Pparameters ij in Eq. (20) is that the log base measure which they call a sublinear PGM (SPGM), specified ilog(xi!) scales more slowly than the quadratic by the node-conditional distributions belonging to pairwise term xTΦx where x 2 ℤd . Yang et al.62 thus + the family in Eq. (24), has the following form: propose two possible solutions: increase the base measure or decrease the quadratic pairwise term. n ℙ θT T Φ First, if we modify the base measure of Poisson SPGMðÞx = exp TðÞx + TðÞx TðÞx distribution with ‘Gaussian-esque’ quadratic functions X o − ! − θ Φ ; (note that for the linear sufficient statistics with positive logðÞxi ASPGMðÞ, jR0,R ð26Þ dependencies, the base measures should be quadratic i 62 at the very least ), then the joint distribution, which where they call a quadratic PGM, is normalizable while 62 X È allowing both positive and negative dependencies : T ASPGMðÞθ,ΦjR0,R = log exp θ TðÞx ÈÉ x2ℤ + T T X ℙQPGMðÞx = exp θ x + x Φx−AQPGMðÞθ,Φ : ð22Þ T + TðÞx ΦTðÞx − logðÞgxi! ; ð27Þ i Essentially, QPGM has the same form as the Gaussian distribution but where its domain is the set of non- and T(x) is the entry-wise application of the function negative integers. The key differences from PGM are in Eq. (25). SPGM is always normalizable for that Φ can have negative values along the diagonal, 62 P ϕ 2 ℝ 8 i 6¼ j and the Poisson base measure − log(x !) is ij . X i i The main difficulty in estimating the PGM var- replaced by the quadratic term ϕ x2. Note that a i ii i iants above with an infinite domain is the lack of sufficient condition for the distribution to be normal- closed-form expressions for the log partition func- izable is given by: tion, even only for the node-conditional distributions that are needed for parameter estimation. Yang TΦ − 2 ℤd ; et al.62 propose an approximate estimation proce- x x < ckxk2 8x 2 + ð23Þ dure that uses the univariate Poisson and Gaussian for some constant c > 0, which in turn can be satisfied log-partition functions as upper bounds for the node- if Φ is negative definite. One significant drawback of conditional log-partition functions for the QPGM QPGM is that the tail is Gaussian-esque and thin and SPGM models, respectively. rather than Poisson-esque and thicker as in PGM. Another possible modification is to use sub- Poisson Square Root Graphical Model linear sufficient statistics in order to preserve the In the similar vein as SPGM in the earlier section, Poisson base measure and possibly heavier tails. Inouye et al.60 consider the use of exponential

Volume9,May/June2017 ©2017WileyPeriodicals,Inc. 13 of 25 Advanced Review wires.wiley.com/compstats families with square root-sufficient statistics. While model does not suffer from the degenerate distribu- they consider general graphical model families, their tions of TPGM as well as the fixed-length PGM PGM variant called the square root graphical model (FLPGM) discussed in the next section while still (SQR) can be written as: allowing both positive and negative dependencies. n To show that SQR graphical models can eas- ffiffiffi ffiffiffi ffiffiffi 60 Tp p T p ily be normalized, Inouye et al. first define ℙSQRðÞxjθ = exp θ x + x Φ x X o radial-conditional distributions. The radial- − logðÞxi! −ASQRðÞθ,Φ ; ð28Þ conditional distribution assumes that the unit direc- i tion is fixed, but the length of the vector is unknown. The difference between the standard 1D ϕ where ii can be non-zero in contrast to the zero node-conditional distributions and the 1D radial- diagonal of the parameter matrix in Eq. (20). As with conditional distributions is illustrated in Figure 3. ϕ PGM, when there are no edges (i.e., ij =08 i 6¼ j) Suppose we condition on the unit direction v = x and θ = 0, this reduces to the independent Poisson kxk1 of the sufficient statistics, but the scaling of this unit model. The node conditionals of this distribution direction z = kxk is unknown. With this notation, have the form: 1 Inouye et al.60 define the radial-conditional distribu- noffiffiffiffiffiffiffi ffiffiffiffi tion as: T p p ℙ x − exp ϕ x + θ +2ϕ − x −log x ! ; ðÞ/ijx i ii i i i, −i x i i ðÞi n pffiffiffiffiffi pffiffiffiffiffiT pffiffiffiffiffi ð29Þ ℙðÞ/x = zvjv,θ,Φ exp θT zv + zv Φ zv X − logðÞgðÞzvi ! where ϕi,− i is the i-th column of Φ with the i-th entry removed. This can be rewritten in the form of a two- ()i ÀÁffiffiffi ffiffiffi ffiffiffi X parameter exponential family: T p p T p / exp θ v z + v Φ v z− logðÞðÞzvi ! : ffiffiffiffi ℙ η η η η p − ! − η η ; i ðÞxij 1, 2 = expfg1xi + 2 xi logðÞxi AðÞ1, 2 ð30Þ Similar to the node-conditional distribution, the ffiffiffiffiffiffiffi radial-conditional distribution can be rewritten as a η ϕ η θ ϕT p η η where 1 = ii, 2 = i +2 i, −i x−i, and A( 1, 2)is two-parameter exponential family: the log partition function. Note that a key difference 0 1 with the PGM variants in the previous section is B pffiffiffi C that the diagonal of ΦSQR can be non-zero, whereas ℙ θ Φ @η η e − η η A; ðÞzjv, , =exp |fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}1z + 2 z + B|fflffl{zfflffl}vðÞz AradðÞ1, 2 the diagonal of ΦPGM must be zero. Because the pffiffiffi pffiffiffi OzðÞ OðÞ−zlogðÞz interaction term xTΦ x is asymptotically linear rather than quadratic, the Poisson SQR graphical ð31Þ

0.35 0.35 Node conditional Radial conditional 0.3 distributions 0.3 distributions

0.25 0.25

0.2 0.2 2 2 x x 0.15 0.15

0.1 0.1

0.05 0.05

0 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 x x 1 1

FIGURE 3 | Node-conditional distributions (left) are univariate probability distributions of one variable conditioned on the other variables, while radial-conditional distributions are univariate probability distributions of the vector scaling conditioned on the vector direction. Both conditional distributions are helpful in understanding square root (SQR) graphical models. (Illustration from Ref 60)

ffiffiffi ffiffiffi η p TΦp η θT e density. Instead, the underlying model is defined in whereX 1 = v v, 2 = v, and BvðÞz = d terms of a series of local models where each variable − logðÞðÞzv ! . The only difference between i =1 i is conditionally Poisson given its node-neighbors; this this exponential family and the node-conditional dis- approach is thus termed the local PGM (LPGM). — tributionX is the different base measure i.e., Note that LPGM does not impose any restrictions on d − logðÞðÞzvi ! 6¼ −logðÞz! . However, note that the the parameter space or types of dependencies; if the i =1 parameter space of each local model was constrained log base measure is still O(−zlog(z)), and thus, the log base measure will overcome the linear term as to be non-positive, then the LPGM reduces to the z ! ∞. Therefore, the radial-conditional distribution vanilla PGM as previously discussed. Hence, the η η ℝ LPGM is less interesting as a candidate multivariate is normalizable for any 1, 2 2 . fi With the radial-conditional distributions nota- model for count-valued data, but many may still nd tion, Inouye et al.60 show that the log partition func- its simple and interpretable network estimates tion for Poisson SQR graphical models is finite by appealing. Recently, several studies have proposed to separating the summation into a nested radial direc- adopt this estimation strategy for alternative network types.73,74 tion and scalar summation. Let V = fv : kvk 1=1,v 2 ℝdg be the set of unit vectors in the positive orthant. Fixed-Length Poisson MRFs The SQR log partition function A (θ, Φ) can be SQR In a somewhat different direction, Inouye et al.59 decomposed into nested summation over the unit propose a distribution that has the same parametric direction and the one-dimensional radial conditional: form as the original PGM but allows positive depen- ð X È pffiffiffi dencies by decomposing the joint distribution into θ Φ η Φ η θ fi ASQRðÞ, = log exp 1ðÞvj z + 2ðÞvj z two distributions. The rst distribution is the mar- ℤ^ v2V z2 ginal distribution over the length of the vector X denoted ℙ(L)—i.e., the distribution of the ℓ -norm of − ! ; 1 logðÞgzvi dv ð32Þ the vector or the total sum of counts. The second dis- i tribution, the FLPGM, is the conditional distribution where η ðÞvjΦ and η ðÞvjθ are the radial-conditional of PGM given the fact that the vector length L is 1 2 ÈÉ fi ℙ parameters as defined above, and ℤ^ = z : zv 2 ℤd . known or xed, denoted FLPGM(x | kxk1 = L). Note + that this allows the marginal distribution on length Note that ℤ^ ℤ, and thus, the inner summation can and the distribution given the length to be specified be replaced by the radial-conditional log partition independently.d The restriction to negative dependen- function. Therefore, because V is a bounded set, and cies is removed because the second distribution, given the radial-conditional log partition function is finite the vector length ℙ (x | kxk = L) has a finite for any η ðÞvjθ and η ðÞvjΦ , A < ∞ and the Pois- ÈÉFLPGM 1 , 1 2 SQR domain = : ℤd , = L and is thus son SQR joint distribution is normalizable. DFLPGM x x 2 + kxk1 — The main drawback of the Poisson SQR is that trivially normalizable similar to the normalizability fi 59 for parameter estimation, the log partition function of the nite-domain TPGM. More formally, defined the FLPGM as: A(η1, η2) of the node conditionals in Eq. (30) is not known in closed form in general. Inouye et al.60 pro- ℙ θ Φ λ ℙ λ ℙ θ Φ ; vide a closed-form estimate for the exponential SQR, ðÞxj , , = ðÞLj FLPGMðÞxjkxk1 = L, , ð33Þ but a closed-form solution for the Poisson SQR È 60 T T model seems unlikely to exist. Inouye et al. suggest ℙFLPGMðÞxjkxk1 = L,θ,Φ = exp θ x + x Φx numerically approximating A(η1, η2) as it only X requires a one-dimensional summation. − logðÞxi! −ALðÞgθ,Φ ; ð34Þ i Local PGM where λ is the parameter for the marginal length dis- Inspired by the neighborhood selection technique of tribution, which could be Poisson, negative binomial, Meinshausen and Bühlmann70, Allen and Liu71,72 or any other distribution on non-negative integers. In propose to estimate the network structure of count- addition, FLPGM could be used as a replacement for valued data by fitting a series of ℓ1-regularized Pois- the multinomial distribution because it has the same son regressions to estimate the node-neighborhoods. domain as the multinomial and actually reduces to Such an estimation method may yield interesting net- the multinomial if there are no dependencies. Earlier, work estimates, but as Allen and Liu72 note, these Altham and Hankin75 developed an identical model estimates do not correspond to a consistent joint by generalizing an earlier bivariate generalization of

76 the binomial. However, in Refs 75,76, the model MODEL COMPARISON assumed that L was constant over all samples, fi whereas Ref 59 allowed for L to vary for each sam- We compare models by rst discussing two structural ple according to some distributions ℙ(L). aspects of the models: (a) interpretability and (b) the One significant drawback is that FLPGM is not relative stringency and ease of verifying theoretical amenable to the simple node-wise parameter estima- assumptions and guarantees. We then present and tion method of the previous PGM models. Nonethe- discuss an empirical comparison of the models on less, in Inouye et al.59, the parameters are three real-world datasets. heuristically estimated with Poisson regressions similar to PGM, although the theoretical properties of Comparison of Model Interpretation this heuristic estimate are unknown. Another draw- Marginal models can be interpreted as weakly decou- back is that while FLPGM allows for positive depen- pling the univariate marginal distributions from the dencies, the distribution can yet yield a degenerate dependency structure between the variables. How- — distribution for large values of L similar to the ever, in the discrete case, specifically for distributions problem of TPGM where the mass is concentrated 59 based on pairing copulas with Poisson marginals, the near 0 or R. Thus, Inouye et al. introduce a dependency structure estimation is also dependent on ω decreasing weighting function (L) that scales the the marginal estimation, unlike for copulas paired interaction term: with continuous marginals.23 Conditional models or graphical models, on the other hand, can be inter- ℙ ðÞxjkxk = L,θ,Φ,ωðÞ FLPGM()1 preted as specifying generative models for each varia- X ble given the variable’s neighborhood (i.e., the θT ω TΦ − ! − θ ω Φ : = exp x + ðÞL x x logðÞxi ALðÞ, ðÞL conditional distribution). In addition, dependencies i in graphical models can be visualized and interpreted ð35Þ via networks. Here, each variable is a node, and the weighted edges in the network structure depict the While the log likelihood is not available in tractable pair-wise conditional dependencies between vari- form, Inouye et al.59 approximate the log likelihood ables. The simple network depiction for graphical using annealed importance sampling,77 which might models may enable domain experts to interpret com- be applicable to the extensions covered previously plex dependency structures more easily compared to as well. other models. Overall, marginal models may be preferred if modeling the statistics of the data, particu- Summary of Conditional Poisson larly the marginal statistics over individual variables, is of primary importance, while conditional models Generalizations may be preferred if the prediction of some variables, fi The conditional Poisson models bene t from the rich given others, is of primary importance. Mixture mod- literature in exponential families and undirected els may be more or less difficult to interpret depend- graphical models, or Markov Random Fields. In ing on whether there is an application-specific addition, the conditional Poisson models have a sim- interpretation of the latent mixing variable. For — ple parametric form. The historical PGM or the example, a finite mixture of two Poisson distributions 58— auto-Poisson model only allowed negative depen- may model the crime statistics of a city that contains dencies between variables. Multiple extensions have downtown and suburban areas. On the other hand, a sought to overcome this severe limitation by altering finite mixture of Poisson distributions or a log- fi the PGM so that the log partition function is nite normal Poisson mixture when modeling crash sever- even with positive dependencies. One major draw- ity counts (as seen in the empirical comparison sec- back to the graphical model approach is that com- tion) seems more difficult to interpret; even if the puting the likelihood requires the approximation of model empirically fits the data well, the hidden mix- θ Φ the joint log partition function A( , ); a related ture variable might not have an obvious application- problem is that the distribution moments and mar- specific interpretation. ginals are not known in closed form. Despite these drawbacks, parameter estimation using composite likelihood methods via ℓ1-penalized node-wise regres- Comparison of Theoretical Considerations sions (in which the joint likelihood is not computed) The estimation of marginal models from data has has solid theoretical properties under certain various theoretical problems, as evidenced by the conditions. analysis of copulas paired with discrete marginals in

Ref 23. The extent to which these theoretical pro- high overdispersion. We retrieve raw next generation blems cause any significant practical issues remains sequencing data for breast cancer (BRCA) using the unclear. In particular, the estimators of the marginal software TCGA2STAT79 and compute a simple log- distributions themselves typically have easily checked count transformation of the raw counts: blog(x+1)c, assumptions as the empirical marginal distributions a common preprocessing technique for RNA-Seq can be inspected directly. On the other hand, the esti- data. The BRCA data exhibit medium counts and mation of conditional models is both computation- medium overdispersion. We collect the word count ally tractable and comes with strong theoretical vectors from the Classic3 text corpus, which contains guarantees even under high-dimensional regimes abstracts from aerospace engineering, medical, and 63 e where n < d. However, the assumptions under information sciences journals. The Classic3 dataset which the guarantees of the estimators hold are diffi- exhibits low counts—including many zeros—and cult to check in practice and could cause problems if medium overdispersion. In the Supporting informa- they are violated (e.g., outliers caused by unobserved tion, we also give results for a crime statistics dataset factors). The estimation of mixture models tends to and the 20 Newsgroup dataset, but they have similar have limited theoretical guarantees. In particular, characteristics and perform similarly to the BRCA finite Poisson mixture models have very weak and Classic3 datasets, respectively; thus, we omit assumptions on the underlying distribution— them for simplicity. We select variables (e.g., for eventually becoming a nonparametric distribution if d =10ord = 100) by sorting the variables by mean k = O(n)—but the estimation problems in theory are count value or sorting by variance in the case of the extremely difficult, with very few theoretical guaran- BRCA dataset as highly variable genes are of more tees for practical estimators. Yet, as seen in the next interest in biology. section, empirically estimating a finite mixture model In order to understand how each model might using EM iterations performs well in practice. perform under varying data characteristics, we consider the following two questions: (1) How well does the model (i.e., the joint distribution) fit the Empirical Comparison underlying data distribution? and (2) How well does In this section, we seek to empirically compare mod- the model capture the dependency structure between els from the three classes presented to assess how variables? To help answer these questions, we evalu- well they fit real-world count data. ate the empirical fit of models using two metrics, which only require samples from the model. The Comparison Experimental Setup first metric is based on a statistic called maximum 80 We empirically compare models on selected datasets mean discrepancy (MMD), which estimates the from three diverse domains which, have different maximum moment difference over all possible data characteristics in terms of their mean count moments. The empirical MMD can be approxi- values and dispersion indices (Eq. 0.5), as can be seen mated as follows from two sets of samples X 2 × × in Table 1. The crash severity dataset is a small acci- ℝn1 d and Y 2 ℝn2 d: dent dataset from Ref 78 with three different count variables corresponding to crash severity classes: Xn1 Xn2 d 1 − 1 ; ‘Property-only’, ‘Possible Injury’, and ‘Injury’. The MMDðÞG,X,Y = sup f ðÞxi f yj ð36Þ n1 n2 crash severity data exhibit high count values and f 2G i =1 j =1

TABLE 1 | Dataset Statistics (Per Variable )) Means Dispersion Indices Spearman’s ρ Dataset dnMin Med Max Min Med Max Min Med Max Crash Severity 3 275 3.4 3.8 9.7 6 9.3 16 0.61 0.73 0.79 BRCA 10 878 3.2 5 7.7 1.5 2.2 3.8 −0.2 0.25 0.95 100 878 1.1 4 9 0.63 1.7 4.6 −0.5 0.08 0.95 1000 878 0.51 3.5 11 0.26 1 4.6 −0.64 0.06 0.97 Classic3 10 3893 0.26 0.33 0.51 1.4 3.4 3.8 −0.17 0.12 0.82 100 3893 0.09 0.14 0.51 1.1 2.1 8.3 −0.17 0.02 0.82 1000 3893 0.02 0.03 0.51 0.98 1.7 8.5 −0.17 −0 0.82

Volume9,May/June2017 ©2017WileyPeriodicals,Inc. 17 of 25 Advanced Review wires.wiley.com/compstats where G is the union of the RKHS spaces based on FLPGM with a Poisson distribution on the vector the Gaussian kernel using 21 σ values log-spaced length L = kxk1 (‘FLPGM Poisson’), and the Poisson between 0.01 and 100. In our experiments, we esti- square root graphical model (‘Poisson SQR’). Using mate the MMD between the pairwise marginals of composite likelihood methods of penalized ℓ1 node- model samples and the pairwise marginals of the wise regressions, we estimate these models via code original observations: from Refs 63,82,83, and the XMRFf R package. After parameter estimation, we generate 1000 samples for 8 ÂÃhi d ðÞs

The MMD metric is of more general interest because Empirical Comparison Results it evaluates whether the models actually fit the empir- The full results for both the MMD and Spearman’s ρ ical data distribution, while the Spearman metric metrics for the crash severity, breast cancer RNA-Seq, may be more interesting for practitioners who prima- and Classic3 text datasets can be seen in Figures 4, 5, rily care about the dependency structure, such as and 6, respectively. The low-dimensional results biologists who specifically want to study gene depen- (d ≤ 10) give evidence across all the datasets that three dencies rather than gene distributions. models outperform the others in their classesg: We empirically compare the model fits on these The Gaussian copula paired with Poisson mar- real-world datasets for several types of models from ginals model (‘Copula Poisson’) for the marginal the three general classes presented. As a baseline, we model class, the mixture of Poissons distribution estimate an independent Poisson model (‘Ind Pois- (‘Mixture Poiss’) for the mixture model class, and the son’). We include Gaussian copulas and vine copulas, Poisson SQR distribution (‘Poisson SQR’) for the both paired with Poisson marginals (‘Copula Poisson’ conditional model class. Thus, we only include these and ‘Vine Poisson’) to represent the marginal model representative models along with an independent class. We estimate the copula-based models via the Poisson baseline in the high-dimensional experiments two-stage IFM method35 via the DT.26 For the mix- when d > 10. We discuss the results for specific data ture class, we include both a simple finite mixture of characteristics as represented by each dataset.h independent Poissons (‘Mixture Poiss’) and a log- For the crash severity dataset with high counts normal mixture of Poissons (‘Log-Normal’). The finite and high overdispersion (Figure 4), mixture models mixture was estimated using a simple EM algorithm; (i.e., ‘Log-Normal’ and ‘Mixture Poiss’) perform the the log-normal mixture model was estimated via best as expected as they can model overdispersion MCMC sampling using the code from Ref 55. For the well. However, if dependency structure is the only conditional model class, we estimate the simple PGM object of interest, the Gaussian copula paired with (‘PGM’), which only allows negative dependencies, Poisson marginals (‘Copula Poisson’) performs well. and three variants that allow for positive dependen- For the BRCA dataset with medium counts and cies: the truncated PGM (‘Truncated PGM’), the medium overdispersion (Figure 5), we note similar

Crash severity (d = 3) Crash severity (d = 3)

Histogram Mean Min Median Max Histogram Mean Min Median Max

Ind Poisson PGM

Poisson SQR Ind Poisson

FLPGM poisson FLPGM poisson

Vine Poisson Truncated PGM

PGM Poisson SQR

Copula Poisson Vine Poisson

Truncated PGM Log-Normal

Log-Normal Mixture Poiss

Mixture Poiss Copula Poisson

0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 Pairwise maximum mean discrepancy Pairwise spearman′s difference

FIGURE 4 | Crash severity dataset (high counts and high overdispersion): maximum mean discrepancy (left) and Spearman ρ’s difference (right). As expected, for high overdispersion, mixture models (‘Log-Normal’ and ‘Mixture Poiss’) seem to perform the best. trends with two notable exceptions: (1) The Poisson Poisson marginals joint distribution can estimate SQR model actually performs reasonably in low dependency structure (per the Spearman metric) for a dimensions, suggesting that it can model moderate wide range of data characteristics even when the distri- overdispersion, and (2) the high-dimensional bution does not fit the underlying data (per the MMD (d ≥ 100) Spearman’s ρ difference results show that metric). The Poisson SQR model performs well for the Gaussian copula paired with Poisson marginals low count values with many zeros (i.e., sparse data) (‘Copula Poisson’) performs significantly better than and may be able to handle moderate overdispersion. the mixture model; this result suggests that copulas paired with Poisson marginals are likely better for modeling dependencies than mixture models. Finally, DISCUSSION for the Classic3 dataset with low counts and medium overdispersion (Figure 6), the Poisson SQR model While this review analyzes each model class sepa- seems to perform well in this low-count setting, espe- rately, it would be quite interesting to consider com- cially in low dimensions, unlike in previous data set- binations or synergies between the model classes. tings. While the simple independent mixture of Because negative binomial distributions can be Poisson distributions still performs well, the Poisson viewed as a gamma–Poisson mixture model, one sim- log-normal mixture distribution (‘Log-Normal’) per- ple idea is to consider pairing a copula with negative forms quite poorly in this setting with small counts binomial marginals or developing a negative bino- and many zeros. This poor performance of the Pois- mial SQR graphical model. As another example, we son log-normal mixture is somewhat surprising as could form a finite mixture of copula-based or graph- the dispersion indices are almost all greater than one, ical model-based models. This might combine the as seen in Table 1. The differing results between low strengths of a mixture model in handling multiple counts and medium counts with similar overdisper- modes and overdispersion with the strengths of the sion demonstrate the importance of considering both copula-based models and graphical models, which the overdispersion and the mean count values when can explicitly model dependencies. characterizing a dataset. We may also consider how one type of model In summary, we note several overall trends. Mix- informs the other. For example, by the generalized ture models are important for overdispersion when Sklar’s theorem,26 each conditional PGM actually counts are medium or high. The Gaussian copula with induces a copula—just as the Gaussian graphical

BRCA (d = 10) Histogram Mean Min Median Max

Ind Poisson

FLPGM Poisson

Truncated PGM BRCA (d = 100) BRCA (d = 1000) Vine Poisson Histogram Mean Min Median Max Histogram Mean Min Median Max

Copula Poisson Ind Poisson Poisson SQR

PGM Poisson SQR Ind Poisson

Poisson SQR Copula Poisson Mixture Poiss Log-Normal

Mixture Poiss Mixture Poiss Copula Poisson

0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0.5 0 0.10.20.30.40.5 Pairwise maximum mean discrepancy Pairwise maximum mean discrepancy Pairwise maximum mean discrepancy

BRCA (d = 10) Histogram Mean Min Median Max

Ind Poisson

PGM

FLPGM poisson BRCA (d = 100) BRCA (d = 1000) Truncated PGM Histogram Mean Min Median Max Histogram Mean Min Median Max

Log-Normal Ind Poisson Ind Poisson

Vine Poisson Poisson SQR Poisson SQR

Poisson SQR Mixture Poiss Mixture Poiss Mixture Poiss

Copula Poisson Copula Poisson Copula Poisson

0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 Pairwise spearman′s difference Pairwise spearman′s difference Pairwise spearman′s difference

FIGURE 5 | BRCA RNA-Seq dataset (medium counts and medium overdispersion): maximum mean discrepancy (MMD) (top) and Spearman ρ’s difference (bottom) with different number of variables: 10 (left), 100 (middle), 1000 (right). While mixtures (‘Log-Normal’ and ‘Mixture Poiss’) perform well in terms of MMD, the Gaussian copula paired with Poisson marginals (‘Copula Poisson’) can model dependency structure well as evidenced by the Spearman metric. model induces the Gaussian copula. Studying the marginal distributions are derived from the Poisson, copulas induced by graphical models seems to be a (2) the joint distribution is a mixture of independent relatively unexplored area. On the other hand, it may Poisson distributions, and (3) the node-conditional be useful to consider fitting a Gaussian copula paired distributions are derived from the Poisson. The first with discrete marginals using the theoretically class based on Poisson marginals, and particularly grounded techniques from graphical models for the general approach of pairing copulas with Poisson sparse dependency structure estimation, especially for marginals, provides an elegant way to partiallyi the small sample regimes in which d > n; this has decouple the marginals from the dependency struc- been studied for the case of continuous marginals in ture and gives strong empirical results despite some Ref 84. Overall, bringing together and comparing theoretical issues related to nonuniqueness. While these diverse paradigms for probability models opens advanced methods to estimate the joint distribution up the door for many combinations and synergies. of copulas paired with discrete marginals, such as SL25 or vine copula constructions, provide more accurate or more flexible copula models, respectively, CONCLUSION our empirical results suggest that a simple Gaussian copula paired with Poisson marginals with the trivial We have reviewed three main approaches to con- DT can perform quite well in practice. The second structing multivariate distributions derived from the class based on mixture models can be particularly Poisson using three different assumptions: (1) the helpful for handling overdispersion that often occurs

Classic3 (d = 10) Histogram Mean Min Median Max

Log-Normal

Ind Poisson

FLPGM Poisson Classic3 (d = 100) Classic3 (d = 1000) Copula Poisson Histogram Mean Min Median Max Histogram Mean Min Median Max

PGM Ind Poisson Ind Poisson

Vine Poisson Copula Poisson Copula Poisson

Truncated PGM Poisson SQR Poisson SQR Mixture Poiss

Poisson SQR Mixture Poiss Mixture Poiss

0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 Pairwise maximum mean discrepancy Pairwise maximum mean discrepancy Pairwise maximum mean discrepancy

Classic3 (d = 10) Histogram Mean Min Median Max

Log-Normal

Vine Poisson

Ind Poisson Classic3 (d = 100) Classic3 (d = 1000)

PGM Histogram Mean Min Median Max Histogram Mean Min Median Max

FLPGM Poisson Ind Poisson Poisson SQR

Truncated PGM Poisson SQR Copula Poisson

Copula Poisson Copula Poisson Mixture Poiss Mixture Poiss

Mixture Poiss Ind Poisson Poisson SQR

0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 Pairwise spearman′s difference Pairwise spearman′s difference Pairwise spearman′s difference

FIGURE 6 | Classic3 text dataset (low counts and medium overdispersion): maximum mean discrepancy (top) and Spearman ρ’s difference (bottom) with different number of variables: 10 (left), 100 (middle), 1000 (right). The Poisson SQR model performs better on this low count dataset than in previous settings. in real count data with the log-normal–Poisson mix- characteristics even when the distribution does not fit ture and a finite mixture of independent Poisson dis- the underlying data, and (3) Poisson SQR models tributions being prime examples. In addition, perform well for low count values with many zeros mixture models have closed-form moments and, in (i.e., sparse data) and can handle moderate overdis- the case of a finite mixture, closed-form likelihood persion. Overall, in practice, we would recommend calculations—something not generally true for the comparing the three best-performing methods from other classes. The third class based on Poisson condi- each class, namely the Gaussian copula model paired tionals can be represented as graphical models, thus with Poisson marginals, the finite mixture of inde- providing both compact and visually appealing repre- pendent Poisson distributions, and the Poisson SQR sentations of joint distributions. Conditional models model. This initial comparison will likely highlight benefit from strong theoretical guarantees about some interesting properties of a given dataset and model recovery given certain modeling assumptions. suggest which class to pursue in more detail. However, checking conditional modeling assump- This review has highlighted several key tions may be impossible and may not always be satis- strengths and weaknesses of the main approaches to fied for real-world count data. From our empirical constructing multivariate Poisson distributions. Yet, experiments, we found that (1) mixture models are there remain many open questions. For example, important for overdispersion when counts are what are the marginal distributions of the PGMs that medium or high, (2) the Gaussian copula with Pois- are defined in terms of their conditional distribu- son marginals joint distribution can estimate depend- tions? Or conversely, what are the conditional distri- ency structure for a wide range of data butions of the copula models that are defined in

Volume9,May/June2017 ©2017WileyPeriodicals,Inc. 21 of 25 Advanced Review wires.wiley.com/compstats terms of their marginal distributions? Can novel c Besag58 originally named these Poisson auto models, models be created at the intersection of these model focusing on pairwise graphical models, but here, we con- classes that could combine the strengths of different sider the general graphical model setting. classes as suggested in the Discussion section? Could d If the marginal distribution on the length is set to be the certain model classes be developed in an application same as the marginalX distribution on length for the PGM—i.e., if ℙðÞL = ℙPGMðÞx , then the PGM area that has been largely dominated by another x:kxk1 = L model class? For example, graphical models are well distribution is recovered. known in the machine learning literature, while cop- e http://ir.dcs.gla.ac.uk/resources/test_collections/ ﬁ ula models are well known in the nancial modeling f https://cran.r-project.org/web/packages/XMRF/index.html literature. Overall, multivariate Poisson models are g For the crash-severity dataset, the truncated Poisson poised to increase in popularity given the wide poten- graphical model (‘Truncated PGM’) outperforms the Pois- tial applications for real-world, high-dimensional, son SQR model under the pairwise MMD metric. After count-valued data in text analysis, genomics, spatial inspection, however, we realized that the Truncated PGM statistics, economics, and epidemiology. model performed better merely because outlier values were truncated to the 99th percentile as described in the supplementary material. This reduced the overﬁtting of outlier values caused by the crash-severity dataset’s high NOTES overdispersion. a The label ‘multivariate Poisson’ was introduced in the h These basic trends are also corroborated by the two data- statistics community to refer to the particular model intro- sets in the supplementary material. duced in this section, but other generalizations could also i In the discrete case, the dependency structure cannot be be considered multivariate Poisson distributions. perfectly decoupled from the marginal distributions, unlike b This is because if y 2 ℝ Normal, then exp in the continuous case where the dependency structure and (y) 2 ℝ++ LogNormal. marginals can be perfectly decoupled.

ACKNOWLEDGMENTS D.I. and P.R. acknowledge the support of ARO via W911NF-12-1-0390, NSF via IIS-1149803, IIS-1447574, DMS-1264033, and NIH via R01 GM117594-01 as part of the Joint DMS/NIGMS Initiative to Support Research at the Interface of the Biological and Mathematical Sciences. E.Y. acknowledges the support of MSIP/IITP via ICT R&D program 2016-0-00563. G.A. acknowledges support from NSF DMS-1264058 and NSF DMS-1554821.

REFERENCES 1. Campbell J. The Poisson correlation function. Proc 6. Nikoloulopoulos AK, Karlis D. Modeling multivariate Edinburgh Math Soc 1934, 4:18–26. count data using copulas. Commun Stat Simul Comput 2009, 39:172–187. 2. M’Kendrick A. Applications of mathematics to medical problems. Proc Edinburgh Math Soc 1925, 7. Xue-Kun Song P. Multivariate dispersion models gen- 44:98–130. erated from gaussian copula. Scand J Stat 2000, 27:305–320. 3. Teicher H. On the multivariate Poisson distribution. 8. Holgate P. Estimation for the bivariate Poisson distri- Scand Actuar J 1954, 1954:1–9. bution. Biometrika 1964, 51:241–287. 4. Wicksell SD. Some theorems in the theory of probabil- 9. Dwass M, Teicher H. On inﬁnitely divisible random ity, with special reference to their importance in the vectors. Ann Math Stat 1957, 28:461–470. theory of homograde correlation. Svenska Aktuarieför- 10. Kawamura K. The structure of multivariate Poisson eningen Tidskrift. 1916:165–213. distribution. Kodai Math J 1979, 2:337–345. 5. Nikoloulopoulos AK. Copula-based models for multi- 11. Srivastava R, Srivastava A. On a characterization of variate discrete response data. In: Jaworski P, Poisson distribution. J Appl Probab 1970, 7:497–501. Durante F, Härdle W, eds. Copulae in Mathematical 12. Wang Y. Characterizations of certain multivariate dis- and Quantitative Finance. Berlin Heidelberg: Springer- tributions. Mathematical Proceedings of the Cam- Verlag; 2013. bridge Philosophical Society 1974, 75:219–234.

13. Johnson NL, Kotz S, Balakrishnan N. Discrete Multi- 30. Bedford T, Cooke RM. Vines: a new graphical model variate Distributions, vol. 165. New York: John for dependent random variables. Ann Stat 2002, Wiley & Sons; 1997. 30:1031–1068. 14. Krishnamoorthy A. Multivariate binomial and Poisson 31. Cook RJ, Lawless JF, Lee K-A. A copula-based mixed distributions. Sankhya: Indian J Stat 1951, Poisson model for bivariate recurrent events under 11:117–124. event-dependent censoring. Stat Med 2010, – 15. Krummenauer F. Limit theorems for multivariate dis- 29:694 707. crete distributions. Metrika 1998, 47:47–69. 32. Yahav I, Shmueli G. On generating multivariate Pois- 16. Mahamunulu D. A note on regression in the multivari- son data in management science applications. Appl ate Poisson distribution. J Am Stat Assoc 1967, Stoch Models Bus Ind 2012, 28:91–102. 62:251–258. 33. Sklar A. Random variables, joint distribution func- 17. Kano K, Kawamura K. On recurrence relations for the tions, and copulas. Kybernetika 1973, 9:449–460. probability function of multivariate generalized Pois- 34. Karlis D. Models for multivariate count time series. In: son distribution. Commun Stat Theory Methods 1991, Davis RA, Holan SH, Lund R, Ravishanker N, eds. – 20:165 178. Handbook of Discrete-Valued Time Series. Boca 18. Karlis D. An em algorithm for multivariate Poisson Raton, FL: CRC Press; 2016, 407–424(Chapter 19). distribution and related models. J Appl Stat 2003, 35. Joe H, Xu JJ. The estimation method of inference func- – 30:63 77. tions for margins for multivariate models. Technical 19. Loukas S, Kemp C. On computer sampling from tri- Report 166, The University of British Columbia, Van- variate and multivariate discrete distributions: multi- couver, Canada, 1996. variate discrete distributions. J Stat Comput Simul 36. Denuit M, Lambert P. Constraints on concordance – 1983, 17:113 123. measures in bivariate discrete data. J Multivariate Anal 20. Tsiamyrtzis P, Karlis D. Strategies for efﬁcient compu- 2005, 93:40–57. tation of multivariate Poisson probabilities. Commun 37. Heinen A, Rengifo E. Multivariate autoregressive mod- Stat Simul Comput 2004, 33:271–292. eling of time series count data using copulas. J Empir 21. Sklar A. Fonctions de répartition à n dimensions et Finance 2007, 14:564–583. leurs marges. Publ Inst Statist Univ Paris 1959, 38. Heinen A, Rengifo E. Multivariate reduced rank 8:229–231. regression in nonGaussian contexts, using copulas. 22. Cherubini U, Luciano E, Vecchiato W. Copula Meth- Comput Stat Data Anal 2008, 52:2931–2944. ods in Finance. Hoboken, NJ: John Wiley & 39. Madsen L. Maximum likelihood estimation of regres- Sons; 2004. sion parameters with spatially dependent discrete data. š 23. Genest C, Ne lehová J. A primer on copulas for count J Agric Biol Environ Stat 2009, 14:375–391. data. Astin Bull 2007, 37:475–515. 40. Madsen L, Fang Y. Joint regression analysis for 24. Nikoloulopoulos AK. On the estimation of normal discrete longitudinal data. Biometrics 2011, 67: copula discrete regression models using the continuous 1171–1175. extension and simulated likelihood. J Stat Plann Infer- ence 2013, 143:1923–1937. 41. Kazianka H. Approximate copula-based estimation and prediction of discrete spatial data. Stoch Environ ﬁ 25. Nikoloulopoulos AK. Ef cient estimation of high- Res Risk Assess 2013, 27:2015–2026. dimensional multivariate normal copula models with discrete spatial responses. Stoch Environ Res Risk 42. Kazianka H, Pilz J. Copula-based geostatistical model- Assess 2016, 30:493–505. ing of continuous and discrete data including covariates. Stoch Environ Res Risk Assess 2010, 24: 26. Rüschendorf L. Copulas, Sklar’s theorem, and distribu- 661–673. tional transform. In: Mathematical Risk Analysis. Berlin, Heidelberg: Springer-Verlag; 2013, 3–34 43. Genz A, Bretz F. Computation of Multivariate Normal (Chapter 1). and t Probabilities, vol. 195. Dordrecht Heidelberg London New York: Springer; 2009. 27. Demarta S, McNeil AJ. The t copula and related copulas. Int Stat Rev 2005, 73:111–129. 44. Panagiotelis A, Czado C, Joe H. Pair copula constructions for multivariate discrete data. J Am Stat Assoc 28. Trivedi PK, Zimmer DM. Copula Modeling: An Intro- ® 2012, 107:1063–1072. duction for Practitioners. Foundations and Trends in Econometrics, vol. 1. Boston - Delft: now publishers; 45. Czado C, Brechmann EC, Gruber L. Selection of vine 2007, 1–111. copulas. In: Copulae in Mathematical and Quantitative – 29. Aas K, Czado C, Frigessi A, Bakken H. Pair-copula Finance. Berlin Heidelberg: Springer; 2013, 17 37. constructions of multiple dependence. Insur: Math 46. Karlis D, Xekalaki E. Mixed Poisson distributions. Int Econ 2009, 44:182–198. Stat Rev 2005, 73:35–58.

47. Arbous AG, Kerrich J. Accident statistics and the 64. Lauritzen SL. Graphical Models. Oxford: Oxford Uni- concept of accidentproneness. Biometrics 1951, 7: versity Press; 1996. – 340 432. 65. Clifford P. Markov random fields in statistics. In: Dis- 48. Steyn H. On the multivariate Poisson normal distribu- order in Physical Systems. Oxford: Oxford Science tion. J Am Stat Assoc 1976, 71:233–236. Publications; 1990. 49. Aitchison J, Ho C. The multivariate Poisson-log nor- 66. Wainwright MJ, Jordan MI. Graphical models, expo- mal distribution. Biometrika 1989, 76:643–653. nential families, and variational inference. Found 50. Aguero-Valverde J, Jovanis PP. Bayesian multivariate Trends R Mach Learn 2008, 1:1–305. Poisson lognormal models for crash severity modeling 67. Koller D, Friedman N. Probabilistic Graphical Models: and site ranking. Transp Res Rec: J Transp Res Board Principles and Techniques. Cambridge, MA: MIT 2009, 2136:82–91. Press; 2009. 51. Chib S, Winkelmann R. Markov chain Monte Carlo 68. Yang E, Ravikumar P, Allen G, Liu Z. Conditional analysis of correlated count data. J Bus Econ Stat random fields via univariate exponential families. In: 2001, 19:428–435. Neural Information Processing Systems, 26, 2013. 52. El-Basyouny K, Sayed T. Collision prediction models 69. Kaiser MS, Cressie N. Modeling Poisson variables with using multivariate Poisson-lognormal regression. Accid positive spatial dependence. Stat Probab Lett 1997, Anal Prev 2009, 41:820–828. 35:423–432. 53. Ma J, Kockelman KM, Damien P. A multivariate 70. Meinshausen N, Bühlmann P. High-dimensional Poisson-lognormal regression model for prediction of graphs and variable selection with the lasso. Ann Stat crash counts by severity, using Bayesian methods. 2006, 34:1436–1462. Accid Anal Prev 2008, 40:964–975. 71. Allen GI, Liu Z. A log-linear graphical model for infer- 54. Park E, Lord D. Multivariate Poisson-lognormal mod- ring genetic networks from high-throughput sequen- els for jointly modeling crash frequency by severity. cing data. In: 2012 I.E. International Conference on – Transp Res Rec: J Transp Res Board 2007, 2019:1 6. Bioinformatics and Biomedicine, IEEE, 2012, 1–6. fi 55. Zhan X, Abdul Aziz HM, Ukkusuri SV. An ef cient 72. Allen GI, Liu Z. A local Poisson graphical model for parallel sampling technique for multivariate Poisson- inferring networks from sequencing data. IEEE Trans lognormal model: Analysis with two crash count data- NanoBiosci 2013, 12:189–198. sets. Anal Methods Accid Res 2015, 8:45–60. 73. Hadiji F, Molina A, Natarajan S, Kersting K. Poisson 56. Karlis D, Meligkotsidou L. Finite mixtures of multivar- dependency networks: gradient boosted models for mul- iate Poisson distributions with application. J Stat Plann tivariate count data. Mach Learn 2015, 100:477–507. Inference 2007, 137:1942–1960. 74. Han SW, Zhong H. Estimation of sparse directed acy- 57. Bradley JR, Holan SH, Wikle CK. Computationally clic graphs for multivariate counts data. Biometrics efficient distribution theory for Bayesian inference of 2016, 72:791–803. high-dimensional dependent count-valued data. arXiv preprint arXiv:1512.07273, 2015. 75. Altham PME, Hankin RKS. Multivariate generalizations of the multiplicative binomial distribution: intro- 58. Besag J. Spatial interaction and the statistical analysis ducing the mm package. J Stat Softw 2012, 46:1–23. of lattice systems. J R Stat Soc Ser B Stat Methodol doi:10.18637/jss.v046.i12. 1974, 36:192–236. 76. Altham PME. Two generalizations of the binomial distri- 59. Inouye DI, Ravikumar P, Dhillon IS. Fixed-length Pois- bution. J R Stat Soc Ser C Appl Stat 1978, 27:162–167. son MRF: adding dependencies to the multinomial. In: Neural Information Processing Systems, 28, 2015. 77. Neal RM. Annealed importance sampling. Stat Com- put 2001, 11:125–139. 60. Inouye DI, Ravikumar P, Dhillon IS. Square root graphical models: multivariate generalizations of univariate 78. Milton JC, Shankar VN, Mannering FL. Highway exponential families that permit positive dependencies. In: accident severities and the mixed logit model: an International Conference on Machine Learning, 2016. exploratory empirical analysis. Accid Anal Prev 2008, 40:260–266. 61. Yang E, Ravikumar P, Allen G, Liu Z. Graphical models via generalized linear models. In: Neural Informa- 79. Wan Y-W, Allen GI, Liu Z. TCGA2STAT: simple tion Processing Systems, 25, 2012. TCGA data access for integrated statistical analysis in – 62. Yang E, Ravikumar P, Allen G, Liu Z. On Poisson R. Bioinformatics 2016, 32:952 954. graphical models. In: Neural Information Processing 80. Gretton A. A kernel two-sample test. J Mach Learn Systems, 26, 2013. Res 2012, 13:723–773. 63. Yang E, Ravikumar P, Allen G, Liu Z. Graphical mod- 81. Zhao J, Meng D. FastMMD: ensemble of circular dis- els via univariate exponential family distribution. crepancy for efficient two-sample test. Neural Comput J Mach Learn Res 2015, 16:3813–3847. 2015, 27:1345–1372.

82. Inouye DI, Ravikumar P, Dhillon IS. Admixtures of univariate exponential families. arXiv preprint Poisson MRFs: a topic model with word dependencies. arXiv:1606.00813, 2016. In: International Conference on Machine Learning, vol 84. H. Liu, F. Han, M. Yuan, J. Lafferty, and 31, 2014. L. Wasserman. High dimensional semiparametric 83. Inouye DI, Ravikumar P, Dhillon IS. Generalized root Gaussian copula graphical models. Ann Stat, 40:34, models: beyond pairwise graphical models for 2012. 10.1214/12-AOS1037.