<<

Lognormal and Gamma Mixed Negative Binomial Regression

Mingyuan Zhou [email protected] Lingbo Li [email protected] David Dunson [email protected] Lawrence Carin [email protected] Duke University, Durham NC 27708, USA Abstract & Nelder, 1989; Long, 1997; Cameron & Trivedi, 1998; In regression analysis of counts, a lack of Agresti, 2002; Winkelmann, 2008). Regression models simple and efficient algorithms for poste- for counts are usually nonlinear and have to take into rior computation has made Bayesian ap- consideration the specific properties of counts, includ- proaches appear unattractive and thus un- ing discreteness and nonnegativity, and often charac- derdeveloped. We propose a lognormal and terized by ( greater than the gamma mixed negative binomial (NB) regres- ). In addition, we may wish to impose a sparse sion model for counts, and present efficient prior in the regression coefficients for counts, which is closed-form ; unlike con- demonstrated to be beneficial for regression analysis ventional Poisson models, the proposed ap- of both Gaussian and binary data (Tipping, 2001). proach has two free parameters to include two Count data are commonly modeled with the Poisson different kinds of random effects, and allows distribution y ∼ Pois(λ), whose mean and variance the incorporation of prior information, such are both equal to λ. Due to heterogeneity (difference as sparsity in the regression coefficients. By between individuals) and contagion (dependence be- placing a gamma distribution prior on the NB tween the occurrence of events), the varance is often dispersion parameter r, and connecting a log- much larger than the mean, making the Poisson as- prior with the logit of the sumption restrictive. By placing a gamma distribution NB probability parameter p, efficient Gibbs prior with shape r and scale p/(1 − p) on λ, a negative sampling and variational Bayes inference are binomial (NB) distribution y ∼ NB(r, p) can be gener- both developed. The closed-form updates ∞   ated as f (y) = R Pois(y; λ)Gamma λ; r, p dλ = are obtained by exploiting conditional con- Y 0 1−p Γ(r+y) r y jugacy via both a compound Poisson repre- y!Γ(r) (1 − p) p , where Γ(·) denotes the gamma func- sentation and a Polya-Gamma distribution tion, r is the nonnegative dispersion parameter and p is based data augmentation approach. The pro- a probability parameter. Therefore, the NB distribu- posed Bayesian inference can be implemented tion is also known as the gamma-. routinely, while being easily generalizable to It has a variance rp/(1 − p)2 larger than the mean more complex settings involving multivariate rp/(1−p), and thus it is usually favored over the Pois- dependence structures. The algorithms are son distribution for modeling overdispersed counts. illustrated using real examples. The regression analysis of counts is commonly per- formed under the Poisson or NB likelihoods, whose 1. Introduction parameters are usually estimated by finding the max- imum of the nonlinear log likelihood (Long, 1997; In numerous scientific studies, the response variable Cameron & Trivedi, 1998; Agresti, 2002; Winkelmann, is a count y = 0, 1, 2, ··· , which we wish to ex- 2008). The maximum likelihood estimator (MLE), T plain with a set of covariates x = [1, x1, ··· , xP ] as however, only provides a point estimate and does not −1 T T E[y|x] = g (x β), where β = [β0, ··· , βP ] are the allow the incorporation of prior information, such as regression coefficients and g is the canonical link func- sparsity in the regression coefficients. In addition, the tion in generalized linear models (GLMs) (McCullagh MLE of the NB dispersion parameter r often lacks ro- bustness and may be severely biased or even fail to th Appearing in Proceedings of the 29 International Confer- converge if the sample size is small, the mean is small ence on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). or if r is large (Saha & Paul, 2005; Lloyd-Smith, 2007). Lognormal and Gamma Mixed Negative Binomial Regression

Compared to the MLE, Bayesian approaches are able (Winkelmann, 2008). To model overdispersed counts, to model the uncertainty of estimation and to incor- the model can be modified as porate prior information. In regression analysis of y ∼ Pois(λ ), λ = exp(xT β) (2) counts, however, the lack of simple and efficient algo- i i i i i rithms for posterior computation has seriously limited where i is a nonnegative multiplicative random-effect routine applications of Bayesian approaches, making term to model individual heterogeneity (Winkelmann, Bayesian analysis of counts appear unattractive and 2008). Using both the law of total expectation and the thus underdeveloped. For instance, for the NB dis- law of total variance, it can be shown that persion parameter r, the only available closed-form T E[yi|xi] = exp(xi β)E[i] (3) Bayesian solution relies on approximating the ratio of two gamma functions using a polynomial expan- Var[i] 2 Var[yi|xi] = E[yi|xi] + 2 E [yi|xi]. (4) sion (Bradlow et al., 2002); and for the regression co- E [i] efficients β, Bayesian solutions usually involve com- Thus Var[y |x ] ≥ [y |x ] and we obtain a regression putationally intensive Metropolis-Hastings algorithms, i i E i i model for overdispersed counts. We show below that since the for β is not known under the both the gamma and lognormal distributions can be Poisson and NB likelihoods (Chib et al., 1998; Chib & used as the nonnegative prior on  . Winkelmann, 2001; Winkelmann, 2008). i In this paper we propose a lognormal and gamma 2.1. The Negative Binomial Regression Model mixed NB regression model for counts, with default The NB regression model (Long, 1997; Cameron & Bayesian analysis presented based on two novel data Trivedi, 1998; Winkelmann, 2008; Hilbe, 2007) is con- augmentation approaches. Specifically, we show that structed by placing a gamma prior on i as the gamma distribution is the conjugate prior to the rr r−1 −ri NB dispersion parameter r, under the compound Pois- i ∼ Gamma(r, 1/r) = i e (5) son representation, with efficient and Γ(r) variational Bayes (VB) inference derived by exploiting −1 where E[i] = 1 and Var[i] = r . Marginalizing conditional conjugacy. Further we show that a log- out i in (2), we have a NB distribution parame- normal prior can be connected to the logit of the NB T terized by mean µi = exp(xi β) and inverse disper- probability parameter p, with efficient Gibbs sampling sion parameter φ (the reciprocal of r) as fY (yi) = −1 and VB inference developed for the regression coeffi- −1  −1 φ  yi Γ(φ +yi) φ µi 2 −1 −1 −1 , thus cients β and the lognormal variance parameter σ , by yi!Γ(φ ) φ +µi φ +µi generalizing a Polya-Gamma distribution based data T augmentation approach in Polson & Scott(2011). The E[yi|xi] = exp(xi β) (6) 2 proposed Bayesian inference can be implemented rou- Var[yi|xi] = E[yi|xi] + φE [yi|xi]. (7) tinely, while being easily generalizable to more com- plex settings involving multivariate dependence struc- The MLEs of β and φ can be found numerically with tures. We illustrate the algorithm with real examples the Newton-Raphson method (Lawless, 1987). on count analysis and count regression, and demonstrate the advantages of the proposed Bayesian 2.2. The Lognormal-Poisson Regression Model approaches over conventional count models. A lognormal-Poisson regression model (Breslow, 1984; Long, 1997; Agresti, 2002; Winkelmann, 2008) can be 2. Regression Models for Counts constructed by placing a lognormal prior on i as

The most basic regression model for counts is the Pois- 2 i ∼ ln N (0, σ ) (8) son regression model (Long, 1997; Cameron & Trivedi, 1998; Winkelmann, 2008), which can be expressed as σ2/2 σ2  σ2  where E[i] = e and Var[i] = e e − 1 . Us- T yi ∼ Pois(λi), λi = exp(xi β) (1) ing (3) and (4), we have T where xi = [1, xi1, ··· , xiP ] is the covariate vector for T 2 E[yi|xi] = exp(xi β + σ /2) (9) sample i. The Newton-Raphson method can be used  σ2  2 to iteratively find the MLE of β (Long, 1997). A seri- Var[yi|xi] = E[yi|xi] + e − 1 E [yi|xi]. (10) ous constraint of the Poisson regression model is that it assumes equal-dispersion, i.e., E[yi|xi] = Var[yi|xi] = Compared to the NB model, there is no analytical form T exp(xi β). In practice, however, count data are of- for the distribution of yi if i is marginalized out and ten overdispersed, due to heterogeneity and contagion the MLE is less straightforward to calculate, making Lognormal and Gamma Mixed Negative Binomial Regression it less commonly used. However, Winkelmann(2008) 3.1. Model Properties and Model Comparison suggests to reevaluate the lognormal-Poisson model, Using the laws of total expectation and total variance since it is appealing in theory and may fit the data and the moments of the NB distribution, we have better. The inverse Gaussian distribution prior can T 2 E[yi|xi] = Ei [E[yi|xi, i]]= exp(xi β + σ /2 + ln r) (16) also be placed on i to construct a heavier-tailed al- Var[y |x ]= [Var[y |x , ]]+Var [ [y |x , ]] ternative to the NB model (Dean et al., 1989), whose i i Ei i i i i E i i i  σ2 −1  2 density functions are shown to be virtually identical = E[yi|xi] + e (1 + r ) − 1 E [yi|xi]. (17) to the lognormal-Poisson model (Winkelmann, 2008). We define the quasi-dispersion κ as the coefficient as- sociated with the mean quadratic term in the variance. 3. The Lognormal and Gamma Mixed As shown in (7) and (10), κ = φ in the NB model and  2  Negative Binomial Regression Model κ = eσ −1 in the lognormal-Poisson model. Ap- To explicitly model the uncertainty of estimation and parently, they have different distribution assumptions incorporate prior information, Bayesian approaches on dispersion, yet there is no clear evidence to favor appear attractive. Bayesian analysis of counts, how- one over the other in terms of goodness of fit. In the ever, is seriously limited by the lack of efficient infer- proposed LGNB model, there are two free parame- ence, as the conjugate prior for the regression coeffi- ters r and σ2 to adjust both the mean in (16) and  2  cients β is unknown under the Poisson and NB like- dispersion κ = eσ (1+r−1) − 1 , which become the lihoods (Winkelmann, 2008), and the conjugate prior same as those of the NB model when σ2 = 0, and the for the NB dispersion parameter r is also unknown. same as those of the lognormal-Poisson model when To address these issues, we propose a lognormal φ = r−1 = 0. Thus the LGNB model has one extra de- and gamma mixed NB regression model for counts, gree of freedom to incorporate both kinds of random termed here the LGNB model, where a lognormal prior effects, with their proportion automatically inferred. ln N (0, σ2) is placed on the multiplicative random ef- fect term i and a gamma prior is placed on r. Denot- 4. Default Bayesian Analysis Using ψ T e i exp(xi β)i pi ing pi = ψi = T , and logit(pi) = ln , 1+e 1+exp(xi β)i 1−pi Data Augmentation the LGNB model is constructed as T As discussed in Section3, the LGNB model has an ad- yi ∼ NB (r, pi) , ψi = logit(pi) = xi β + ln i (11) vantage of having two free parameters to incorporate −1 i ∼ ln N (0, ϕ ), ϕ ∼ Gamma(e0, 1/f0) (12) both kinds of random effects. We show below that it P has an additional advantage in that default Bayesian Y −1 β ∼ N (0, αp ), αp ∼ Gamma(c0, 1/d0) (13) analysis can be performed with two novel data aug- p=0 mentation approaches, with closed-form solutions and r ∼ Gamma(a0, 1/h), h ∼ Gamma(b0, 1/g0) (14) analytical update equations available for both Gibbs −2 sampling and VB inference. One augmentation ap- where ϕ = σ and a0, b0, c0, d0, e0, f0 and g0 are gamma hyperparameters (they are set as 0.01 proach concerns the inference of the NB dispersion pa- rameter r using the compound Poisson representation, in experiments). Since yi ∼ NB (r, pi) in (11) can be augmented into a gamma-Poisson structure as and the other concerns the inference of the regression T  coefficients β using the Polya-Gamma distribution. yi ∼ Pois(λi), λi ∼ Gamma r, exp(xi β)i , the LGNB model can also be considered as a lognormal- gamma-gamma-Poisson regression model. Denoting 4.1. Inferring the Dispersion Parameter Under T T T the Compound Poisson Representation X = [x1 , ··· , xN ] , we may equivalently express T ψ = [ψ1, ··· , ψN ] in the above model as We first focus on inference of the NB dispersion pa- ψ ∼ N (Xβ, ϕ−1I). (15) rameter r and assume we know {pi}i=1,N and h, ne- glecting the remaining part of the LGNB model at this If we marginalize out h in (14), we obtain a beta prime moment. We comment here that the novel Bayesian 0 distribution prior r ∼ β (a0, b0, 1, g0). If we marginal- inference developed here can be applied to any other ize out αp in (13), we obtain a Student-t prior for βp, scenarios where the conditional posterior of r is pro- QN the sparsity-promoting prior used in Tipping(2001); portional to i=1 NB(yi; r, pi)Gamma(r; a0, 1/h), for Bishop & Tipping(2000) for regression analysis of which a hybrid Monte Carlo and a Metropolis-Hastings both Gaussian and binary data. Note that β is con- algorithms had been developed in Williamson et al. nected to pi with a logit link, which is key to deriving (2010) and Zhou et al.(2012), but VB solutions were efficient Bayesian inference. not yet developed. Lognormal and Gamma Mixed Negative Binomial Regression

 m As proved in Quenouille(1949), y ∼ NB(r, p) can also j X 0 j0 R (m, j) = F (m, j)r F (m, j )r . (25) be generated from a compound Poisson distribution as r j0=1 L X iid y = u`,L ∼ Pois(−r ln(1 − p)), u` ∼ Log(p) (18) The values of F can be iteratively calculated and each `=1 row sums to one, e.g., the 4th and 5th rows of F are  6/4! 11/4! 6/4! 1/4! 0 0 ···  where Log(p) corresponds to the logarithmic distri- . 24/5! 50/5! 35/5! 10/5! 1/5! 0 ··· bution (Barndorff-Nielsen et al., 2010) with fU (k) = −pk/[k ln(1 − p)], k ∈ {1, 2,... }, whose probability- Note that to obtain (22), we use the relationship generating function (PGF) is proved in Lemma 1 of the supplementary material that −1 GU (z) = ln(1 − pz)/ln(1 − p), |z| < p . (19) 1 dm Using the conjugacy between the gamma and Poisson j m m fi (z) = F (m, j)j!pi , 1 ≤ j ≤ m. (26) distributions, it is evident that the gamma distribution m! dz z=0 is the conjugate prior for r under this augmentation. Gibbs sampling for r proceeds by alternately sampling (24) and (21). Note that to ensure numerical stability 4.1.1. Gibbs Sampling for r when r > 1, instead of using (25), we may iteratively

Recalling (18), yi ∼ NB(r, pi) can also be generated calculate Rr in the way we calculate F in (23). We PLi show in Figure 1 of the supplementary material the from the random sum yi = `=1 ui` with matrices Rr for r = .1, 1, 10 and 100. iid ui` ∼ Log(pi),Li ∼ Pois(−r ln(1 − pi)). (20) 4.1.2. Variational Bayes Inference for r Exploiting conjugacy between (14) and (20), given Li, we have the conditional posterior of r as Using VB inference (Bishop & Tipping, 2000; Beal, 2003), we approximate the posterior p(r, L|X) with N ! QN X 1 Q(r, L) = Qr(r) i=1 QLi (Li), and we have (r|−) ∼ Gamma a0 + Li, N (21) h−P ln(1−p ) yi i=1 i=1 i X QLi (Li) = Rr˜(yi, j)δj (27) where here and below expressions like (r|−) corre- j=0 spond to (RV) r, conditioned on all Qr(r) = Gamma(˜a, 1/h˜). (28) other RVs. The remaining challenge is finding the Pj conditional posterior of Li. Denote wij = ui`, where hxi = E[x],r ˜= exp (hln ri), ψ(x) is the digamma `=1 function, and j =1, ··· , yi. Since wij is the summation of j iid Log(pi) distributed RVs, using (19), the PGF of wij is yi X hln ri = ψ(˜a)−ln h,˜ hLii= Rr˜(yi, j)j (29) j −1 j=1 GWij(z) = [ln(1 − piz)/ln(1 − pi)] , |z| < pi . N N X X Therefore, we have Li ≡ 0 if yi = 0 and for 1 ≤ j ≤ yi a˜ = a0 + hLii, h˜ = h − hln(1 − pi)i. (30) i=1 i=1 Pr(Li = j|−) ∝ Pr(wij = yi)Pois(j; −r ln(1 − pi)) G(yi) (0) Equations (29)-(30) constitute the VB inference for the Wij ˜ = Pois(j; −r ln(1 − pi)) NB dispersion parameter r, with hri =a/ ˜ h. yi! yi j d j r = fi (z) exp(r ln(1 − pi)) 4.2. Inferring the Regression Coefficients dzyi j!y ! z=0 i Using the Polya-Gamma Distribution j yi = F (yi, j)r pi exp(r ln(1 − pi)) (22) Denote ωi as a random variable drawn from the Polya- where fi(z) = − ln(1−piz) and F is a lower triangular Gamma (PG) distribution (Polson & Scott, 2011) as matrix with F (1, 1) = 1, F (m, j) = 0 if j > m, and ωi ∼ PG(yi + r, 0). (31) m−1 1 We have exp(−ω ψ2/2) = cosh−(yi+r)(ψ /2). F (m, j) = F (m−1, j)+ F (m−1, j−1) (23) Eωi i i i m m Thus the likelihood of ψi in (11) can be expressed as

y if 1 ≤ j ≤ m. Using (22), we have ψi i −(yi+r) yi−r (e ) 2 exp( 2 ψi) L(ψi) ∝ = ψ yi+r yi+r (1 + e i ) cosh (ψi/2) Pr(Li = j|−) = Rr(yi, j), j = 0, ··· , yi. (24)  yi − r   2  ∝ exp ψi Eωi exp(−ωiψi /2) . (32) where Rr(0, 0) = 1 and 2 Lognormal and Gamma Mixed Negative Binomial Regression

˜ T T ˜ Given the values of {ωi}i=1,N and the prior in (15), as in (29), hri =a/ ˜ h, hψ ψi = µ˜ µ˜ + tr[Σ], hβi = µ˜ β, T T ˜ T T ˜ ˜ the conditional posterior of ψ can be expressed as hβ βi = µ˜ β µ˜ β + tr[Σβ], hββ i = µ˜ βµ˜ β + Σβ, hhi = b/g˜, N hϕi =e/ ˜ f˜ and hαpi =c ˜p/d˜p. Although we do not have ωi  yi−r 2 −1 Y − ψi− 2 2ωi (ψ|−) ∝ N (ψ; Xβ, ϕ I) e (33) analytical forms for γi1 and γi2 in Qωi (ωi), we can use i=1 (36) to calculate hωii as tanh(ψ /2) and given the values of ψ and the prior in (31), the hω i = [ [ω |r, ψ , y ]] = (y +hri) i (47) i Er,ψi E i i i i 2ψ conditional posterior of ωi can be expressed as i 2 1 (ωi|−) ∝ exp(−ωiψi /2)PG(ωi; yi + r, 0). (34) where the mean property of the PG distribution (Pol- son & Scott, 2011) is applied. To calculate hln(1+eψi )i D E 4.3. Gibbs Sampling Inference in (41) and tanh(ψi/2) in (47), we use the Monte 2ψi Exploiting conditional conjugacy and the exponential Carlo integration algorithm (Andrieu et al., 2003). tilting of the PG distribution in Polson & Scott(2011), we can sample in closed-form all latent parameters of 5. Example Results the LGNB model described from (11) to (14) as 5.1. Univariate Count Data Analysis Sampling Li with (24), Sampling r with (21) (35) The inference of the NB dispersion parameter r by it- (ωi|−) ∼ PG(yi + r, ψi) (36) self plays an important role not only for the NB regres- (ψ|−) ∼ N (µ, Σ), (β|−) ∼ N (µβ, Σβ) (37) sion (Lawless, 1987; Winkelmann, 2008) but also for (h|−) ∼ Gamma (a0 + b0, 1/(g0 + r)) (38) univariate count data analysis (Bliss & Fisher, 1953;  N 1  Clark & Perry, 1989; Saha & Paul, 2005; Lloyd-Smith, (ϕ|−) ∼ Gamma e + , (39) 0 2 2007), and it also arises in some recently proposed 2 f0 + kψ − Xβk2/2 2  latent variable models for count matrix factorization (αp|−) ∼ Gamma c0 + 1/2, 1/ d0 + βp /2 (40) (Williamson et al., 2010; Zhou et al., 2012). Thus it is where Ω = diag(ω1, ··· , ωN ), A = diag(α0, ··· , αP ), y = of interest to evaluate the proposed closed-form Gibbs T −1 [y1, ··· , yN ] , Σ = (ϕI + Ω) , µ = Σ[(y − r)/2 + ϕXβ], sampling and VB inference for this parameter alone, T −1 T Σβ = (ϕX X + A) and µβ = ϕΣβX ψ. Note that before introducing the regression analysis part. a PG distributed random variable can be generated We consider a real dataset describing counts of red from an infinite sum of weighted iid gamma random mites on apple leaves, given in Table 1 of Bliss & Fisher variables (Devroye, 2009; Polson & Scott, 2011). We (1953). There were in total 172 adult female mites provide in the supplementary material a method for found in 150 randomly selected leaves, with a 0 count accurately truncating the infinite sum. on 70 leaves, 1 on 38, 2 on 17, 3 on 10, 4 on 9, 5 on 3, 4.4. Variational Bayes Inference 6 on 2 and 7 on 1. This dataset has a mean of 1.1467 and a variance of 2.2736, clearly overdispersed. We Using VB inference (Bishop & Tipping, 2000; Beal, assume the counts are NB distributed and we intend 2003), we approximate the posterior distribution QP to infer r with a hierarchical model as with Q = Qψ(ψ)Qβ(β)Qr(r)Qh(h)Qϕ(ϕ) p=0 Qαp (αp) QN iid i=1[QLi (Li)Qωi (ωi)]. To exploit conjugacy, defin- yi ∼ NB(r, p), r ∼ Gamma(a, 1/b), p ∼ Beta(α, β) ing QLi (Li) as in (27), Qr(r) as in (28), Qωi (ωi) = ˜ ˜ where i = 1, ··· ,N and we set a = b = α = β = 0.01. PG(γi1, γi2), Qψ(ψ) = N (µ˜, Σ), Qβ(β) = N (µ˜ β, Σβ), We consider 20,000 Gibbs sampling iterations, with Qh(h) = Gamma(˜b, 1/g˜), Qϕ(ϕ) = Gamma(˜e, 1/f˜) and ˜ the first 10,000 samples discarded and every fifth sam- Qαp (αp) = Gamma(˜cp, 1/dp), we have N N ple collected afterwards. As shown in Figure1, the X X ψi a˜ =a0 + hLii, h˜ =hhi+ hln(1 + e )i (41) autocorrelation of Gibbs samples decreases quickly as i=1 i=1 the lag increases, and the VB lower bound converges Σ˜ =hϕiI+Ω˜ −1, µ˜ =Σ˜ [(y−hri)/2+hϕiXβ] (42) quickly even starting from a bad initialization (r is ˜ T ˜ −1 ˜ T initialized two times the converged value). Σβ = hϕiX X+hAi , µ˜ β = hϕiΣβX hψi (43) The estimated posterior mean of r is 1.0812 with ˜b = a0 + b0, g˜ = hri + g0, e˜ = e0 + N/2 (44) Gibbs sampling and 0.9988 with VB. Compared to hψT ψi tr[XhββT iXT ] f˜=f + −hψiT Xhβi+ (45) the method of moments estimator (MME), MLE, and 0 2 2 ˜ 2 maximum quasi-likelihood estimator (MQLE) (Clark c˜p = c0 + 1/2, dp = d0 + hβp i/2 (46) 1There is a typo in B.2 Lemma 2 and other related equa- where Ω˜ = diag(hω1i, ··· , hωN i), A˜ = diag(˜α0, ··· , α˜P ), a c tions of Polson & Scott(2011), where E(ω) = c tanh( 2 ) a c tr[Σ] is the trace of Σ, hln ri and hLii are calculated should be corrected as E(ω) = 2c tanh( 2 ). Lognormal and Gamma Mixed Negative Binomial Regression

(a) (b) 1 400 Table 1. The MLEs or posterior of the lognormal 0.5 2 200 variance parameter σ , NB dispersion parameter r, quasi- 0 Frenqucy dispersion κ and regression coefficients β for the Poisson, 0 Autocorrelation −0.5 0 1 2 3 0 5 10 15 20 r Lag NB and LGNB regression models on the NASCAR dataset, (c) (d) using the MLE, VB or Gibbs sampling for parameter esti- −253 4 mations. −254 Model Poisson NB LGNB LGNB 2 −255 Parameters (MLE) (MLE) (VB) (Gibbs)

VB Lower Bound 2

Probability Density 0 −256 0 1 2 3 0 50 100 150 σ N/A N/A 0.1396 0.0289 r VB Iteration r N/A 5.2484 18.5825 6.0420 Figure 1. (a) The histogram and (b) autocorrelation of β0 -0.4903 -0.5038 -3.5271 -2.1680 collected Gibbs samples of the NB dispersion parameter β1 (Laps) 0.0021 0.0017 0.0015 0.0013 r. (c) The inferred probability density function of r using β2 (Drivers) 0.0516 0.0597 0.0674 0.0643 VB. (d) The VB lower bound. Note that the calculated β3 (TrkLen) 0.6104 0.5153 0.4192 0.4200 lower bound shows variations after convergence, which is expected since the Monte Carlo integration is used to calcu- Table 2. Test of goodness of fit with Pearson residuals. Models (Methods) NASCAR MotorIns late non-analytical expectation terms such as hln Γ(r+yi)i. Poisson (MLE) 655.6 485.6 NB (MLE) 138.3 316.5 & Perry, 1989), which provides point estimates of IG-Poisson (MLE) N/A 319.7 1.1667, 1.0246 and 0.99472, respectively, our algorithm LGNB (r ≡ 1000, Gibbs) 117.8 296.7 LGNB(VB) 126.1 275.5 is able to provide a full posterior distribution of r and LGNB(Gibbs) 129.0 284.4 is convenient to incorporate prior information. The calculating details of the MME, MLE and MQLE, the N closed-form Gibbs sampling and VB update equations, X 2 yi − µˆi E = ei , ei = p (48) and the VB lower bound are all provided in the sup- i=1 µˆi(1 +κ ˆµˆi) plementary material, omitted here for brevity. whereµ ˆ andκ ˆ are the estimated mean and quasi- dispersion, respectively, whose calculations are de- 5.2. Regression Analysis of Counts scribed in detail in the supplementary material. We test the full LGNB model on two real exam- The MLEs for the Poisson and NB models are well- ples, with comparison to the Poisson, NB, lognormal- known and the update equations can be found in Win- Poisson and inverse-Gaussian-Poisson (IG-Poisson) re- ner; Winkelmann(2008). The MLE results for the IG- gression models. The NASCAR dataset3, analyzed Poisson model on the MotorIns data were reported in in Winner, consists of 151 NASCAR races during the Dean et al.(1989). For the lognormal-Poisson model, 1975-1979 Seasons. The response variable is the num- no standard MLE algorithms are available and we ber of lead changes in a race, and the covariates of a choose Metropolis-Hastings (M-H) algorithms for pa- race include the number of laps, number of drivers and rameter estimation. We also consider a LGNB model length of the track (in miles). The MotorIns dataset4, under the special setting that r = 1000. As discussed analyzed in Dean et al.(1989), consists of Swedish in Section 3.1, this would lead to a model which is third-party motor insurance claims in 1977. Included approximately the lognormal-Poisson model, yet with in the data are the total number of claims for automo- closed-form Gibbs sampling inference. We use both biles insured in each of the 315 risk groups, defined by VB and Gibbs sampling for the LGNB model. We con- a combination of DISTANCE, BONUS, and MAKE sider 20,000 Gibbs sampling iterations, with the first factor levels. The number of insured automobile-years 10,000 samples discarded and every fifth sample col- for each group is also given. As in Dean et al.(1989), lected afterwards. As described in the supplementary a 19 dimensional covariate vector is constructed for material, we sample from the PG distribution with a each group to represent levels of the factors. To test truncation level of 2000. We initialize r as 100 and goodness-of-fit, we use the Pearson residuals, a met- other parameters at random. Examining the samples ric widely used in GLMs (McCullagh & Nelder, 1989), in Gibbs sampling, we find that the autocorrelations calculated as of model parameters generally reduce to below 0.2 at 2The inverse dispersion parameter φ = 1/0.9947 = the lag of 20, indicating fast mixing. 1.005 is mistakenly reported as the dispersion parameter r in Clark & Perry(1989) at Line 15, Page 314. Shown in Table1 are the MLEs or posterior means 3 of key model parameters. Note that β of the LGNB http://www.stat.ufl.edu/~winner/datasets.html 0 4http://www.statsci.org/data/general/motorins. model differs considerably from that of the Poisson and 2 txt NB models, which is expected since β0 + σ /2 + ln r in Lognormal and Gamma Mixed Negative Binomial Regression

(a) (b) the LGNB model plays about the same role as β0 in 400 400 the Poisson and NB models, as indicated in (16). 200 200

As shown in Tables2, in terms of goodness of fit mea- 0 0 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 sured by Pearson residuals, the Poisson model per- σ2 σ2 κ=e (1+1/r)− 1 κ=e (1+1/r)− 1 (c) (d) forms the worst due to its unrealistic equal-dispersion 600 600 assumption; the NB model, assuming a gamma dis- 400 400 tributed multiplicative random effect term, signifi- 200 200 cantly improves the performances compared to the 0 0 0.01 0.02 0.03 0.04 0.05 0.01 0.02 0.03 0.04 0.05 σ2 σ2 Poisson model; the proposed LGNB model, model- κ=e (1+1/r)− 1 κ=e (1+1/r)− 1 ing extra-Poisson variations with both the gamma and Figure 2. The histograms of the quasi-dispersion κ = 2 lognormal distributions, clearly outperforms both the eσ (1+1/r)−1 based on (a) the 2000 collected Gibbs sam- Poisson and NB models. Since for the lognormal- ples for NASCAR, (b) the 2000 simulated samples using Poisson model with the M-H algorithm, we were not the VB Q functions for NASCAR, (c) the 2000 collected able to obtain comparable results even after carefully Gibbs samples for MotorIns, and (d) the 2000 simulated tuning the proposal distribution, we did not include it samples using the VB Q functions for MotorIns. here for comparison. However, since the LGNB model reduces to the lognormal-Poisson model as r → ∞, NASCAR dataset, we can calculate the correlation T the results of the LGNB model with r ≡ 1000 would matrix for (β1, β2, β3) as be able to indicate whether the lognormal distribution  1.0000 −0.4824 0.8933  alone is appropriate to model the extra-Poisson varia-  −0.4824 1.0000 −0.7171  tions. Despite the popularity of the NB model, which 0.8933 −0.7171 1.0000 models extra-Poisson variations only with the gamma distribution, the results in Tables2 suggest the ben- which is typically not provided in MLE. Since β1 efits of incorporating the lognormal random effects. (Laps) and β3 (TrkLen) are highly positively corre- These observations also the claim in (Winkel- lated, we expect the corresponding covariates to be mann, 2008) that the lognormal-Poisson model should highly negatively correlated. This is confirmed, as the be reevaluated since it is appealing in theory and correlation coefficient between the number of laps and may fit the data better. Compared to the lognormal- the track length is found to be as small as −0.9006. Poisson model, the LGNB model has an additional ad- vantage that its parameters can be estimated with VB 6. Conclusions inference, which is usually much faster than sampling based methods. A lognormal and gamma mixed negative binomial (LGNB) regression model is proposed for regression A clear advantage of the Bayesian inference over the analysis of overdispersed counts. Efficient closed-form MLE is that a full posterior distribution can be ob- 2 Gibbs sampling and VB inference are both presented, tained, by utilizing the estimated posteriors of σ , r by exploiting the compound Poisson representation and β. For example, shown in Figure2 are the esti- and a Polya-Gamma distribution based data augmen- mated posterior distributions of the quasi-dispersion κ, tation approach. Model properties are examined, with represented with histograms. These histograms should comparison to the Poisson, NB and lognormal-Poisson be compared to κ = 0 in the Poisson model, and the models. As the univariate lognormal-Poisson regres- NB model’s MLEs of κ = 0.1905 and κ = 0.0118, for sion model can be easily generalized to regression the NASCAR and MotorIns datasets, respectively. We analysis of correlated counts, in which the derivatives can also find that VB generally tends to overemphasize and Hessian matrixes of parameters are used to con- the regions around the of its estimated poste- struct multivariate normal proposals in a Metropolis- rior distribution and consequently places low densities Hastings algorithm (Chib et al., 1998; Chib & Winkel- on the tails, whereas Gibbs sampling is able to ex- mann, 2001; Ma et al., 2008; Winkelmann, 2008), the plore a wider region. This is intuitive since VB relies proposed LGNB model can be conveniently modified on the assumption that the posterior distribution can for multivariate count regression, in which we may be be approximated with the product of independent Q able to derive closed-form Gibbs sampling and VB in- functions, whereas Gibbs sampling only exploits con- ference. As the log Gaussian process can be used to ditional independence. model the intensity of the Poisson process, whose in- The estimated posteriors can also assist model inter- ference remains a major challenge (Møller et al., 1998; pretation. For example, based on Σ˜ β in VB for the Adams et al., 2009; Murray et al., 2010; Rao & Teh, 2011), we may link the log Gaussian process to the Lognormal and Gamma Mixed Negative Binomial Regression logit of the NB probability parameter, leading to a metrics, 1989. log Gaussian NB process with tractable closed-form Dean, C., Lawless, J. F., and Willmot, G. E. A mixed Bayesian inference. Furthermore, the NB distribu- Poisson-inverse-Gaussian regression model. Canadian tion is shown to be important for the factorization Journal of Statistics, 1989. of a term-document count matrix (Williamson et al., Devroye, L. On exact simulation algorithms for some dis- 2010; Zhou et al., 2012), and the multinomial logit has tributions related to Jacobi theta functions. Statistics & been used to model correlated topics in topic model- Probability Letters, 2009. ing (Blei & Lafferty, 2005; Paisley et al., 2011). Ap- Hilbe, J. M. Negative Binomial Regression. Cambridge plying the proposed lognormal-gamma-NB framework University Press, 2007. and the developed closed-form Bayesian inference to Lawless, J. F. Negative binomial and mixed Poisson re- these diverse problems is currently under active inves- gression. Canadian Journal of Statistics, 1987. tigation. Lloyd-Smith, J. O. Maximum likelihood estimation of the negative binomial dispersion parameter for highly overdispersed data, with applications to infectious dis- Acknowledgements eases. PLoS ONE, 2007. The research reported here has been supported in part Long, S. J. Regression Models for Categorical and Limited by DARPA under the MSEE program. Dependent Variables. SAGE, 1997. Ma, J., Kockelman, K. M., and Damien, P. A multivari- ate Poisson-lognormal regression model for prediction of References crash counts by severity, using Bayesian methods. Acci- Adams, R., Murray, I., and MacKay, D. Tractable non- dent Analysis and Prevention, 2008. parametric Bayesian inference in Poisson processes with McCullagh, P. and Nelder, J. A. Generalized linear models. Gaussian process intensities. In ICML, 2009. Chapman & Hall, 2nd edition, 1989. Agresti, A. Categorical Data Analysis. Wiley-Interscience, Møller, J., Syversveen, A. R., and Waagepetersen, R. P. 2nd edition, 2002. Log Gaussian Cox processes. Scandinavian Journal of Andrieu, C., de Freitas, N., Doucet, A., and Jordan, M. I. Statistics, 1998. An introduction to MCMC for machine learning. Ma- Murray, I., Adams, R. P., and MacKay, D. J. C. Elliptical chine Learning, 2003. slice sampling. In AISTATS, 2010. Barndorff-Nielsen, O. E., Pollard, D. G., and Shephard, N. Paisley, J., Wang, C., and Blei, D. M. The discrete infi- Integer-valued L´evyprocesses and low latency financial nite logistic normal distribution for mixed-membership econometrics. Preprint, 2010. modeling. In AISTATS, 2011. Beal, M. J. Variational Algorithms for Approximate Polson, N. G. and Scott, J. G. Default Bayesian analy- Bayesian Inference. PhD thesis, UCL, 2003. sis for multi-way tables: a data-augmentation approach. Bishop, C. M. and Tipping, M. E. Variational relevance arXiv:1109.4180v1, 2011. vector machines. In UAI, 2000. Quenouille, M. H. A relation between the logarithmic, Blei, D. and Lafferty, J. D. Correlated topic models. In Poisson, and negative binomial series. Biometrics, 1949. NIPS, 2005. Rao, V. and Teh, Y. W. Gaussian process modulated re- Bliss, C. I. and Fisher, R. A. Fitting the negative binomial newal processes. In NIPS, 2011. distribution to biological data. Biometrics, 1953. Saha, K. and Paul, S. Bias-corrected maximum likelihood Bradlow, E. T., Hardie, B. G. S., and Fader, P. S. estimator of the negative binomial dispersion parameter. Bayesian inference for the negative Biometrics, 2005. via polynomial expansions. Journal of Computational Tipping, M. E. Sparse Bayesian learning and the relevance and Graphical Statistics, 2002. vector machine. J. Mach. Learn. Res., 2001. Breslow, N. E. Extra-Poisson variation in log-linear mod- Williamson, S., Wang, C., Heller, Katherine A., and Blei, els. J. Roy. Statist. Soc., C, 1984. D. M. The IBP compound Dirichlet process and its ap- Cameron, A. C. and Trivedi, P. K. Regression Analysis of plication to focused topic modeling. In ICML, 2010. Count Data. Cambridge, UK, 1998. Winkelmann, R. Econometric Analysis of Count Data. Chib, S. and Winkelmann, R. Markov chain Monte Carlo Springer, Berlin, 5th edition, 2008. analysis of correlated count data. Journal of Business & Winner, L. Case Study – Negative Binomial Regression, Economic Statistics, 2001. NASCAR Lead Changes 1975-1979. URL http://www. Chib, S, Greenberg, E, and Winkelmann, R. Posterior stat.ufl.edu/~winner/cases/nb_nascar.doc. Ac- simulation and Bayes factors in panel count data models. cessed 01/29/2012. Journal of Econometrics, 1998. Zhou, M., Hannah, L., Dunson, D., and Carin, L. Beta- Clark, S. J. and Perry, J. N. Estimation of the negative bi- negative binomial process and Poisson factor analysis. nomial parameter κ by maximum quasi-likelihood. Bio- In AISTATS, 2012.