Bridging the Gap: a Generalized Stochastic Process for Count Data

The American Statistician ISSN: 0003-1305 (Print) 1537-2731 (Online) Journal homepage: http://amstat.tandfonline.com/loi/utas20 Bridging the Gap: A Generalized Stochastic Process for Count Data Li Zhu, Kimberly F. Sellers, Darcy Steeg Morris & Galit Shmueli To cite this article: Li Zhu, Kimberly F. Sellers, Darcy Steeg Morris & Galit Shmueli (2017) Bridging the Gap: A Generalized Stochastic Process for Count Data, The American Statistician, 71:1, 71-80, DOI: 10.1080/00031305.2016.1234976 To link to this article: http://dx.doi.org/10.1080/00031305.2016.1234976 Accepted author version posted online: 21 Sep 2016. Published online: 21 Sep 2016. Submit your article to this journal Article views: 196 View Crossmark data Full Terms & Conditions of access and use can be found at http://amstat.tandfonline.com/action/journalInformation?journalCode=utas20 Download by: [Georgetown University] Date: 20 March 2017, At: 07:14 THE AMERICAN STATISTICIAN , VOL. , NO. , – http://dx.doi.org/./.. GENERAL Bridging the Gap: A Generalized Stochastic Process for Count Data Li Zhua,b, Kimberly F. Sellersb,c, Darcy Steeg Morrisc, and Galit Shmuelid aCAC Fund, Shanghai, China; bDepartment of Mathematics and Statistics, Georgetown University, Washington, DC; cCenter for Statistical Research & Methodology, U.S. Census Bureau, Washington, DC; dInstitute of Service Science, National Tsing Hua University, Taiwan ABSTRACT ARTICLE HISTORY The Bernoulli and Poisson processes are two popular discrete count processes; however, both rely on strict Received August assumptions. We instead propose a generalized homogenous count process (which we name the Conway– Revised March Maxwell–Poisson or COM-Poisson process) that not only includes the Bernoulli and Poisson processes as KEYWORDS special cases, but also serves as a flexible mechanism to describe count processes that approximate data Bernoulli process; with over- or under-dispersion. We introduce the process and an associated generalized waiting time dis- Conway-Maxwell-Poisson tribution with several real-data applications to illustrate its flexibility for a variety of data structures. We (COM-Poisson) distribution; consider model estimation under different scenarios of data availability, and assess performance through Count process; Dispersion; simulated and real datasets. This new generalized process will enable analysts to better model count pro- Poisson process; Waiting cesses where data dispersion exists in a more accommodating and flexible manner. time 1. Introduction under-dispersion. With the COM-Poisson process, we develop a Throughout history, stochastic processes have been developed generalized waiting time distribution that encompasses waiting to model data that arise in different disciplines, with the most time distributions associated with the Bernoulli and Poisson notable being transportation, marketing, and finance. With the processes, respectively, and models the distribution of waiting rapid development of modern technology, count data have times for over- or under-dispersed data. Our work develops the become popular in further areas. As a result, stochastic process COM-Poisson process not only to bridge the gap between two models can play a more significant role in today’s data analysis classical count processes, but also to introduce a process that toolkit. can address a wide range of data dispersion. The simplest discrete stochastic process is the Bernoulli pro- The remainder of the article is organized as follows. Section 2 cess, whose associated waiting times are geometric. Meanwhile, provides background and motivation, briefly reviewing the the Poisson process is the most popular and used stochastic Bernoulli and Poisson counting processes, along with their process for count data. Often considered as the continuous- respective frameworks and associated properties. Section 3 for- time counterpart of the Bernoulli process, its most distinguish- mally introduces the COM-Poisson and sum-of-COM-Poisson ing property is its underlying assumption of equi-dispersion (sCOM-Poisson) distributions, and uses them to develop the (i.e., the average number of arrivals equals the variance) in COM-Poisson process and study its properties; included is the a fixed time period. This assumption, however, is constrain- derivation of the associated generalized waiting time distri- ing and problematic because many real-world applications con- bution. Section 4 discusses parameter estimation and associ- tain count data that fail to satisfy the equi-dispersion property. ated uncertainty quantification. Section 5 considers estimation Barndorff-Nielsen and Yeo1969 ( ) and Diggle and Milne (1983) robustness under different simulated data scenarios. Further, considered negative binomial point processes or, more broadly, this section illustrates the flexibility of the COM-Poisson pro- “any flexible class of distributions with a variance-to-mean ratio cess when applied to real-world data, comparing this process greater than unity” (Diggle and Milne 1983, p. 257). While such approach with Bernoulli, Poisson, and other count processes models can account for data over-dispersion (i.e., where the addressing data dispersion. Finally, Section 6 provides a discus- variance is larger than the mean), they cannot effectively model sion and future directions. data under-dispersion (i.e., where the variance is smaller than the mean). 2. Classical Counting Processes In this article, we use the Conway–Maxwell–Poisson (COM- Poisson) distribution to derive what we call a COM-Poisson Çinlar (1975) defined a Bernoulli process with success probabil- process. The significance of the COM-Poisson distribution and, ity p as a discrete-time stochastic process of the form {Xn; n = hence, the corresponding process, lies in its ability to represent 1, 2,...} where, for all n, X1,...,Xn are independent, and Xn a family of processes encompassing count data (including takes the values {0, 1} with probabilities P(Xn = 1) = p and the Poisson, geometric, and Bernoulli processes), and fur- P(Xn = 0) = q = 1 − p. The number of successes that have ther models a sequence of arrivals whose data display over- or occurred through the nth trial, Nn = X1 +···+Xn,followsa CONTACT Kimberly F. Sellers [email protected] Center for Statistical Research, & Methodology, U.S. Census Bureau, Washington, DC . © American Statistical Association 72 L. ZHU ET AL. Binomial(n, p)distribution;N0 = 0. For any m, n ∈ N,thedis- describes the COM-Poisson and sum of COM-Poisson (sCOM- tribution of Nm+n − Nm also follows a Binomial(n, p)distribu- Poisson) distributions, and highlights some of their statisti- tion, independent of m, that is, the process has independent cal properties. Section 3.2 uses these distributions to derive a increments. In terms of waiting times, Tk is defined as the num- homogenous COM-Poisson process. Finally, Section 3.3 intro- ber of trials it takes to get the kth success. The time between duces a generalized waiting time distribution associated with the successes, Tk+1 − Tk for any k ∈ N,followsageometricdistri- COM-Poisson process. m−1 bution, that is, P(Tk+1 − Tk = m) = pq , m = 1, 2,....Fur- ther, for any m and n,theintervalT + − T is independent of m n m 3.1. The COM-Poisson and sCOM-Poisson Distributions m (Çinlar 1975). Meanwhile, a Poisson process is a popular, continuous-time, The COM-Poisson probability mass function (pmf) takes the stochastic counting process used to model events such as cus- form tomer arrivals, electron emissions, neuron spike activity, etc. λx ( = ) = , = , , ,... (Kannan 1979). Let Nt denote the number of events that have P X x ( )ν (λ, ν) x 0 1 2 occurred up to time t ≥ 0. By definition (Durrett 2004), x! Z r = N0 0 for a random variable X,whereν ≥ 0 is a dispersion parameter r − ∼ (λ ) − = ν = ν>(<) Ns+t Ns Poisson t [in particular, Ns+1 Ns such that 1 denotes equi-dispersion, while 1signi- (λ) ∞ λ j Poisson ] fies under-dispersion (over-dispersion); Z(λ, ν) = = ν is r , ,..., j 0 ( j!) Nt has independent increments, that is, for t0 t1 tn, a normalizing constant (Shmueli et al. 2005)andλ = E(Xν )> − ,..., − the variables N N N N − are independent. r t1 t0 tn tn 1 0. The COM-Poisson distribution includes three well-known the following relations hold: distributions as special cases: the Poisson (ν = 1), geometric (ν = 0andλ<1), and Bernoulli (ν →∞) distributions. The ( − = ) = λ + ( ), P Ns+t Ns 1 t o t expectedvalueandvarianceoftheCOM-Poissondistribution can be presented as derivatives with respect to ln(λ) (Sellers, P(Ns+t − Ns ≥ 2) = o(t), Shmueli, and Borle 2011): ( ) ( ) where a function f t is said to be of order o t if ∂ ln Z(λ, ν) ∂2 ln Z(λ, ν) f (t) ( ) = ( ) = ; → = E X and var X limt 0 t 0(Kannan1979). ∂ ln(λ) ∂(ln λ)2 Thus, a homogenous Poisson process has a rate/intensity (1) parameter λ such that the number of events to occur in the inter- val (t, t + τ] is Poisson (λτ), that is, more generally, the probability and moment generating func- ( ) = ( X ) = Z(λt,ν) ( ) = Z(λet ,ν) tions are GX t E t (λ,ν) and MX t (λ,ν) , −λτ (λτ )i Z Z e respectively. As a weighted Poisson distribution whose weight P (Nt+τ − Nt = i) = , i = 0, 1, 2,..., i! function is w(x) = (x!)1−ν (Kokonendji, Miz, and Balakr- ishnan 2008), the COM-Poisson distribution belongs to both and the associated waiting time between events, Tk, is exponen- the exponential family and the two-parameter power series λ tially distributed with parameter . (Shmueli et al. 2005; Sellers, Shmueli, and Borle 2011). The sum of n iid COM-Poisson variables leads to the sCOM- Poisson (λ, ν, n) distribution, which has the following pmf 3. The Conway–Maxwell–Poisson (COM-Poisson) = n ∼iid Process for a random variable Y i=1 Xi,whereXi COM-Poisson (λ, ν): The Conway–Maxwell–Poisson (COM-Poisson) distribution is y ν a two-parameter generalization of the Poisson distribution λy y P(Y = y) = , introduced by Conway and Maxwell (1962)andwhosestatisti- ( )ν (λ, ν) n ..

Bridging the Gap: a Generalized Stochastic Process for Count Data

Hyperwage Theory

The 7Th Workshop on MARKOV PROCESSES and RELATED TOPICS

Superpositions and Products of Ornstein-Uhlenbeck Type Processes: Intermittency and Multifractality

Program Book

A Geometric-Process Maintenance Model for a Deteriorating System Under a Random Environment Yeh Lam and Yuan Lin Zhang

The Market Price of Risk for Delivery Periods: Arxiv:2002.07561V2 [Q-Fin

Facilitating Numerical Solutions of Inhomogeneous Continuous Time Markov Chains Using Ergodicity Bounds Obtained with Logarithmic Norm Method

Tukey Max-Stable Processes for Spatial Extremes

The General Theory of Stochastic Population Processes

Geometric Process and Its Application

A Statistical Derivation of the Significant-Digit

Matrix Analytic Methods with Markov Decision Processes for Hydrological Applications