Analyzing Spatial Longitudinal Incidence Patterns Using Dynamic Multivariate Poisson Models
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University
By
Yihan Sui, B.S., M.S.
Graduate Program in Biostatistics
The Ohio State University
2018
Dissertation Committee:
Dr. Grzegorz A. Rempala, Advisor
Dr. Chi (Chuck) Song, Advisor
Dr. Oksana Chkrebtii c Copyright by
Yihan Sui
2018 Abstract
Methods of multivariate analysis for continuous data have been applied extensively in epidemiology, economics, engineering, and other fields. However, multivariate models for count data have not been applied to a similar extent, and there is relatively little work published on models, especially when accounting for both spatial and temporal dependence. In the first part of the thesis, we propose a hierarchical multivariate
Poisson (MVP) model that simultaneously models spatial and temporal correlation of observed data counts in a fairly general setting. In particular, MVP allows for modeling the spatially/temporally dependent counts as a function of location-specific and time-varying covariates. To characterize temporal trends, we propose a broken- line regression model within MVP and apply it to joinpoint detection. Bayesian inference is conducted using Markov chain Monte Carlo (MCMC) methods to ap- proximate required posterior summaries. We use a backward selection algorithm coupled with the shaver method to obtain joinpoint coefficients via Bayesian Lasso.
We apply the proposed model to spatial temporal pertussis incidence data from 2000 to 2015 from several midwestern states in the U.S. To evaluate the appropriateness of the model, in the second part of this thesis we develop a goodness-of-fit (GOF)
ii statistic for fitting discrete generalized linear models (GLMs) based on the sum of standardized residuals (SSRs). This work is an extension of our earlier work (Chen L. et al. [1]) which proposed a GOF test for binary responses. We derive the asymptotic distribution of the test statistic and show how it can be applied to popular count regression models such as Poisson regression, Negative Binomial regression and Bino- mial regression using different link functions. Using numeric examples we show that the proposed test is substantially more powerful than most of the currently available
GOF tests under various model misspecification scenarios applied to various discrete
GLMs like logistic, Poisson, and negative binomial.
iii Acknowledgments
First, I would like to thank my wonderful parents, Daifu and Guiping for their love, trust and encouragement along the way. I would not be where I am today were it not for their love and guidance.
I am grateful to my co-advisors, Professor Grzegorz A. Rempala and Professor Chi
Song, both of whom have devoted a substantial amount of time and effort in train- ing me to be a better researcher over the past few years. Without their guidance and persistent help this work would not have been possible. Professor Grzegorz A.
Rempala is the reason I first became interested in multivariate count modeling, and since then I have thoroughly enjoyed working in this area with him. He has pro- vided me with numerous opportunities to learn and grow through research, teaching, mentoring, and collaboration. I also would like to express appreciation to Professor
Chi Song. He is a patient mentor and one of the smartest people I know. Some of my fondest memories in graduate school were hours of research discussions with him in his office. He has also become a good friend, whom I admire professionally and personally.
iv I thank the other member of my dissertation committee − Professor Oksana Chkrebtii, and the member of my candidacy committee − Professor Kellie J. Archer for their time and effort in evaluating my work and providing with me feedbacks.
Finally, I would also like to thank all the professors and staffs in the program of biostatistics who have provided guidance and support during my time at The Ohio
State University.
v Vita
June 5, 1990 ...... Born - Jilin, China
2013 ...... B.S. Biological Science
2015 ...... M.S. Statistics
2013-present ...... Graduate Associate, The Ohio State University.
Publications
Research Publications
Chen Lu, Sui Yihan, Song Chi, Grzegorz A. Rempala (2017). The Sum of Stan- dardized Residuals: Goodness of Fit Test for Binary Response Model Statistics in Medicine, doi: 10.1002/sim.7644.
Fields of Study
Major Field: Biostatistics
vi Table of Contents
Page
Abstract ...... ii
Acknowledgments ...... iv
Vita...... vi
List of Tables ...... x
List of Figures ...... xii
1. Introduction ...... 1
1.1 Mathematical Modeling of Epidemics ...... 1 1.1.1 Multivariate Models ...... 3 1.1.2 Multivariate Poisson (MVP) Distribution ...... 4 1.2 Generalized Linear Models (GLMs) ...... 6 1.2.1 Estimating Equations ...... 7 1.2.2 Goodness of Fit (GOF) Test for GLM ...... 8 1.3 Thesis Outline ...... 8
2. Dynamic Multivariate Poisson (DMVP) Model ...... 11
2.1 Modeling Framework ...... 11 2.1.1 Non-Dynamic Multivariate Poisson Model ...... 12 2.1.2 Dynamic Multivariate Poisson (DMVP) Model ...... 14 2.2 Multivariate Poisson Joinpoint Model ...... 17
vii 2.3 Bayesian Estimation ...... 21 2.3.1 The Posterior Distribution for AR(1) Model ...... 22 2.3.2 The Posterior Distribution for ARIMA(1,1,0) Model . . . . 24 2.3.3 The Estimation of Joinpoint Parameters b ...... 27 2.3.4 The update of Y and ξ ...... 28 2.4 Simulation Studies ...... 29 2.5 Analysis of Pertussis Data ...... 32 2.5.1 Background ...... 33 2.5.2 Modern Resurgence of Pertussis ...... 36 2.5.3 Modeling Pertussis Trends ...... 38 2.5.4 Analysis Details ...... 39 2.5.5 Conclusion ...... 45
3. The Goodness of Fit (GOF) in Generalized Linear Model (GLM) . . . . 50
3.1 Summary of Previous Research on GOF ...... 50 3.1.1 GOF Tests for Binary Outcomes ...... 52 3.1.2 GOF Tests for Discrete Outcomes ...... 56 3.2 Proposed Test Statistic - Standardized Residuals ...... 59 3.3 Numerical Examples ...... 66 3.3.1 Examples for Binary Variables ...... 67 3.3.2 Poisson Regression Example 1 ...... 75 3.3.3 Poisson Regression Example 2 ...... 78 3.3.4 Negative Binomial Example 1 ...... 79 3.3.5 Negative Binomial Example 2 ...... 82 3.4 Model Selection Example ...... 86 3.4.1 Model Selection Example for Binary Variable ...... 86 3.4.2 Model Discrimination Example for Count Variable . . . . . 87 3.5 Discussion ...... 89
4. Future Work ...... 91
Bibliography ...... 94
Appendices 103
viii A. The Sum of Standardized Residuals: Goodness of Fit Test for Binary Response Model ...... 103
ix List of Tables
Table Page
2.1 Simulation results for parameter β and γ in P = 1, Q = 1 ...... 30
2.2 Simulation results for parameter β and γ in P = 2,Q = 1 ...... 31
2.3 History of pertussis vaccines ...... 35
2.4 History of acellular pertussis vaccines ...... 36
2.5 Pertussis Vaccination Coverage (%) ...... 48
3.1 Selected variables description in the Boston Marathon data ...... 68
3.2 Proportion H0 rejected at the α = 0.05 using sample size of 445 with 10000 replications in quadratic models using the Boston Marathon data 69
3.3 Selected variables description in Cleveland data ...... 71
3.4 Proportion H0 rejected at the α = 0.05 using sample size of 303 with 10000 replications in interaction models using the Cleveland data . . 72
3.5 Selected variables description in NHANES data ...... 73
3.6 Proportion H0 rejected at the α = 0.05 using sample size of 373 with 10000 replications in interaction models using the NHANES data . . 74
x 3.7 Selected variables description in the Affairs data (n=601) ...... 76
3.8 Proportion H0 rejected at the α = 0.05 using sample size of 601 with 10000 replications using the Affairs data ...... 76
3.9 Selected variables description in the German Credit data (n=1,000) . 80
3.10 Proportion H0 rejected at the α = 0.05 using sample size of 1000 with 10000 replications using the German Credit data ...... 80
3.11 Selected variables description in the Bad Health data ...... 83
3.12 Proportion H0 rejected at the α = 0.05 using sample size of 1127 with 10000 replications using the Bad Health data ...... 83
3.13 Selected variables description in the Fishing data ...... 85
3.14 Proportion H0 rejected at the α = 0.05 using sample size of 147 with 10000 replications in sqrt link model using the Fishing data ...... 86
3.15 Selected variables description in HINTS data ...... 87
3.16 The results of Cn and its p-values under each model using the HINTS data ...... 88
3.17 Pertussis data ...... 89
3.18 The results of Cn and its p−values under Poisson and negative bimo- nial model using Pertussis data ...... 89
xi List of Figures
Figure Page
1.1 An overview of mathematical models for infectious diseases generated by Siettos and Russo[2] ...... 3
2.1 Group sampling illustration ...... 28
2.2 Pertussis Incidence by Age Group ...... 37
2.3 Pertussis Incidence Cases ...... 38
2.4 Pertussis Rate(per 100,000) 2008 ...... 39
2.5 Pertussis Rate(per 100,000) 2009 ...... 40
2.6 Pertussis Rate(per 100,000) 2010 ...... 41
2.7 Pertussis Rate(per 100,000) 2011 ...... 42
2.8 Pertussis Rate(per 100,000) 2012 ...... 43
2.9 Pertussis Rate(per 100,000) 2013 ...... 44
2.10 Pertussis Rate(per 100,000) 2014 ...... 45
2.11 Pertussis Rate(per 100,000) 2015 ...... 46
xii 2.12 Pertussis Rate(per 100,000) 2016 ...... 47
2.13 9 States ...... 47
2.14 DMVP and Simplified Poisson Model Fitting ...... 49
3.1 Proportion H0 rejected at the α = 0.05 using sample size of 601 with 10000 replications in interaction models using the Affair data . . . . . 77
3.2 Proportion H0 rejected at the α = 0.05 over increase sample size with 10000 replications using the German Credit data ...... 81
3.3 Proportion H0 rejected at the α = 0.05 in quadratic models with sample size of 1127 with 10000 replications in the Bad Health data . 84
xiii Chapter 1: Introduction
1.1 Mathematical Modeling of Epidemics
The usage of mathematical modeling methodologies in epidemics dates back to 1760 when Daniel Bernoulli studied the outbreak of smallpox in England[3, 4]. Recently, dynamical systems modeling methods have become popular in epidemiological re- search, mostly due to their ability to accommodate social, economic, and demo- graphic factors [2]. In particular, the modeling with multivariate count data has become increasingly prevalent in big data applications in many different domains.
Such data are usually dependent and with both positive or negative correlations present. For instance, in the public health domain, the Centers for Disease Control and Prevention (CDC) provides the database of national-wide disease outbreaks by year, region, and disease type [5]. The epidemiological trends such as disease inci- dence data are monitored and displayed on the CDC website with graphics, maps and are also provided in text-file formats. For example, reported historic pertussis cases data is available on the CDC website [6]. These disease data have been collected
1 over past 20 years from all the states through their respective health departments via the National Notifiable Diseases Surveillance System (NNDSS). For epidemiologists
NNDSS data are invaluable for understanding the trends and the spreads of various communicable diseases, and for examining the underlying social, economic and de- mographic factors that are crucial to to develop better decision-making policies and to help better prepare for pandemic outbreaks in the US [7].
Spatial variation in disease incidence should be considered while modeling due to the ecological processes, such as pathogen dispersal might be clustered or infected hosts might be less likely to be on a long trip [8]. Moreover, these spatial distri- butions of infectious diseases often change through time [8]. In order to study the disease incidence and its changes over time, we need to establish convenient modeling methods that can take into account the spatial and temporal relationship among the incident rates. Siettos and Russo [4] suggested that the mathematical approaches used in epidemiology can be divided into 3 categories (presented in Fig. 1.1): 1)
Statistical approaches such as regression within the context of spatial patterns to model disease epidemics; 2) Mathematical models of dynamical systems(mechanistic state-space models) such as the susceptible , infected, and recovered (SIR) model;
3) Machine learning techniques using data from internet platforms such as Google
Flu. In this thesis we focus on the first category. We approach the problem at the population level by adopting a statical multivariate Poisson model, expanding it into a hierarchical model, and extending the latter to analyze spatial temporal data.
2 Figure 1.1: An overview of mathematical models for infectious diseases generated by Siettos and Russo[2]
1.1.1 Multivariate Models
The use of the multivariate normal (MVN) distribution has a long history for the analysis of continuous data because of its convenient statistical properties. MVN distribution has been well established tool to model spatial dependencies due to
T its convenient statistical properties [9]. Let X = (X1, ..., Xn) be a vector of ran- dom variables. The mean vector and the variance-covariance matrix are denoted
T n by µ = (µ1, ..., µn) (µ ∈ R ) and Σn×n (Σn×n is a positive semi-definite matrix), respectively. The desirable properties of MVP distribution are: 1) their explicit
1 1 1 joint probability density function exists: fX (x|µ, Σ) = (2π)n/2 (det)1/2(Σ) exp − 2 (x − µ)T Σ−1(x − µ) , where x is a realization of the random vector X (x ∈ Rn) and 2)
3 the covariance function (Σ) accurately specifies the correlation structure. In contrast, these properties are not immediately available for any multivariate count distribu- tion. There has been relatively few works on implementing multivariate count model to deal with discrete spatial data as discussed in Section 1.1.2. Here we aim to con- struct a multi-dimensional discrete distribution with Poisson margins that allows for simple description of the dependencies among variables.
1.1.2 Multivariate Poisson (MVP) Distribution
One of the most commonly accepted definition of MVP distribution is given by
T Mahamunulu et al. [10]: a m-variate vector X = (X1, ..., Xm) is said to follow the
T MVP distribution through a vector of hidden variables Y = (Y1, ..., Yk) , such that
X = AY . This definition is based on a projection Nk → Nm, k ≥ m. Here, the
T hidden variables Y = (Y1, ..., Yk) independently follow a Poisson distribution. A is a m × k matrix of values 0 and 1 that determines the correlation structure of X.
This is the distribution that our dynamic MVP model is based on. Our dynamic
MVP borrows the idea of hidden variable decomposition scheme to help construct the variance-covariance structure. Detailed description of the model will be provided in Chapter 2.
The formulation of multivariate Poisson distribution dates back to 1925 when differ- ential equations were used to derive a bivariate Poisson distribution [11, 12]. A more understandable but equivalent formulation to derive the bivariate Poisson distribu- tion is to use the summation of independent Poisson variables [12]. The derivation
4 of a special case of multivariate Poisson model were presented recently by Tsionas et al. [13, 14] and Karlis [15] using a shared component construction. The limitation of this model is that it assumes the correlation between all pairs of variables being the same by using a single common covariance term. In 2005, this model was extended by Karlis and Meligkotsidou [16] to allow for a more flexible covariance structure among variables by permitting different correlations between pairs of variables. This extension of the covariance structure expanded the model’s application into a num- ber of new fields [16]. For example, in epidemiology it is of great interest to study the covariance structure of the diseases incidence in a given area from a geographical perspective. Although the model of Karlis and Meligkotsidou [16] can capture the spatial correlation, it is not a dynamic model and fails to account for the temporal trends. The research on the extension of the MVP for modeling longitudinal data is relatively sparse. Pedeli et al.[17] introduced a bivariate integer-valued autore- gressive process of order 1 (BINAR(1)) and applied it to a bivariate temporal data concerning daytime and nighttime road accidents in Netherlands. They focused on bivariate Poisson and bivariate negative binomial distributions. Aktekin et al. [18] developed a MVP model they referred to as multivariate Poisson-scaled beta (MPSB) that allows for temporal dependency in the counts as well as dependency across mul- tiple series. They then applied MPSB to web traffic analysis using weekly consumer demand data.
5 In the following chapters, we construct a dynamic MVP that can be applied to lon- gitudinal (time series) data based on the MVP derived by Karlis and Meligkotsidou
[16]. We then allow for the temporal trends detection by adding a piecewise regres- sion component. Explanatory variables can be included in the model. Inference is performed via the Metropolis-Hastings within Gibbs sampler. We adopt the adaptive rejection sampling (ARS) scheme to improve the algorithm efficiency.
1.2 Generalized Linear Models (GLMs)
Generalized linear models (GLMs) are a popular class of regression models that generalize linear regression to accommodate a variety of outcome variable types such as discrete, binary, and categorical [19]. In what follows, we borrow the notation
T from McCullagh and Nelder [20]. The sample Y = (Y1, ..., Yk) is assumed to have independent distribution of the natural exponential family (NEF) such that each Yi has the p.d.f.
n(y θ − b(θ ))o f (y; η, φ) = exp i i i c(y ; φ), i = 1, ...n. (1.1) Yi α(φ) i where n is the number of samples, αi(·), b(·) and c(·) are known functions, and θi is a canonical(natural) parameter for some known dispersion function α(φ) with dispersion parameter φ. The mean and variance of Yi are given by
( 0 E(Yi) = µ = b (θi), 00 Var(Yi) = V (µi) = b (θi)α(φ), i = 1, ...n.
There are three components in GLM:
6 1. Random Component - refers to the variation of the response variable Y due
to its probability drawn from the density function; e.g. in Poisson regression
models, Y has Poisson distribution.
T 2. Systematic Component - refers to the set of covariates Zi = (Zi1,Zi2, ...ZiP ) ;
T e.g. θi = β Zi in Poisson regression.
3. Link Function - refers to the link function g between the mean of the random
and the systematic components. The natural parameter θi is a function of the
T T mean θi = g(µi), such as g(µi) = β Zi, that is, the link between θi and β Zi is identity.
1.2.1 Estimating Equations
T Let Zi denote a p-vector of covariates for the ith individual. From item 3 in the
T previous section, the mean function is assumed to relate to Zi β through the link function
T g(µ(θi)) = Zi β, i = 1, ..., n, where g is the link function and β is a p-vector of parameters. As discussed above,
−1 T −1 if µ = g and θi = Zi β, then g is the canonical link. Now define ψ = (g ◦ µ) , then the p.d.f can be written as
ψ(ZT β)y − b(ψ(ZT β)) f (y; η, φ) = exp{ i i i }c(y ; φ), (1.2) Yi α(φ) i
7 and thus the log-likelihood function is
n T T X h ψ(Z β)yi − b(ψ(Z β))i log`(β) = logc(y ; φ ) + i i . i i α(φ) i=1 The MLE of β can be obtained by solving the likelihood score equations
n ∂log`(β) 1 X n 0 o = y − µ(ψ(ZT β))ψ (ZT β)Z = 0. ∂β α(φ) i i i i i=1 0 T In this work, we study primarily the case when ψ (Zi β) is not proportional to √ 1 . This is the case for most discrete GLM’s. V (µi)
1.2.2 Goodness of Fit (GOF) Test for GLM
The issue of measuring discrepancy between model predicted and observed values arises once model is estimated from the data. This concept of a model goodness of fit
(GOF) helps quantify this discrepancy by considering the (signed) distance between observations and their predicted mean values under the specific link function and the set of covariates (e.g., linear form, quadratic form and interactions). Measurement of GOF is crucial to evaluate whether the model explains the data sufficiently well.
Motivated by our longitudinal count data problem we consider here a new GOF measure that appears to be well suited for discrete GLMs.
1.3 Thesis Outline
The thesis is organized into 3 chapters. Beyond this introductory Chapter 1, Chapter
2 will introduce the modeling framework of dynamic multivariate Poisson (DMVP)
8 model starting from a static MVP distribution and its extension to dynamic data over time. Section 2.1.1 introduces different decomposition schemes accounting for spatial dependency of multivariate count variables. Section 2.1.2 discusses the construction of DMVP model for a given time series. We then consider a joinpoint model in
Section 2.2 to identify diseases trends and to detect changes. A shaver methods is employed in Section 2.2 for parameter estimation. We then further discuss the details of the parameter estimation from the Bayesian prospective in Section 2.3.
Simulation studies are performed to evaluate the model performance in Section 2.4.
As an example, a real application using pertussis dataset is shown in Section 2.5 to demonstrate the advantages and disadvantages of DMVP model over simple Poisson regression model. In Chapter 3 we propose a GOF statistic Cn for the discrete GLMs based on the sum of standardized residuals (SSRs). Section 3.1 will review currently used GOF tests that are built into statistical software for exponential modeling families. The derivation of the SSRs test statistic Cn and its asymptotic properties are discussed in Section 3.2. Next, via numerical examples we demonstrate that
GOF test based on Cn outperforms many of the existing more complicated currently used methods. We also adopt real data examples in simulation studies to investigate the power of the proposed test and to compare it Section 3.3 with existing methods in logistic, Poisson, and negative binomial regressions. Finally, a simple illustration of Cn statistic usage in the model selection problem is presented in Section 3.4. We
9 discuss limitations of the test statistics in Section 3.5 where we also outline some possible future research directions.
10 Chapter 2: Dynamic Multivariate Poisson (DMVP) Model
2.1 Modeling Framework
In this section, we will first illustrate the definition of MVP distribution introduced by Karlis et al. [16] with examples in simple settings. Then we will extend the MVP by introducing the temporal component to capture trends in dynamic data.
The MVP distribution is derived from a multivariate reduction scheme (Karlis et al.)
T [16]. Let X = (X1,X2, ..., Xm) denote the observed m-vector of variables at location
1, ..., m, where m is the total number of locations. Each element Xi represents the outcome variable such as the total number of infected at the specific location i, i ∈ {1, 2, ..., m}. X is said to follow a MVP distribution if X = AY , where the
T components of the vector Y = (Y1,Y2, ..., Yk) are hidden variables that are assumed to have independent Poisson distributions, ie. Yi ∼ Poi(θi), i ∈ {1, 2, ..., k} (k is the total number of the hidden variables in Y ). A is an m × k matrix (m ≤ k) with 0 and 1 elements that determines the correlation structure of X . The variability of
X is explained through the variability of k independent univariate Poisson random
11 variables Y1,Y2, ..., Yk. As discussed later, the elements in Y can be considered as main effects and covariance effects. The mean and variance-covariance matrix of X are
E(X ) = AE(Y ) = Aθ, and
V ar(X) = AΣAT , (2.1)
T where vector θ = (θ1, θ2, ..., θk) is the mean of Y , and Σ = diag(θ1, θ2, ...θk) is the variance-covariance matrix of Y . Because the summation of independent Poisson random variables still follows Poisson distribution, we can derive that each element of X follows a univariate Poisson distribution marginally [17, 21]. An example of the MVP distribution will be presented in the next section.
2.1.1 Non-Dynamic Multivariate Poisson Model
We now will provide illustrative examples of MVP distribution described above us- ing a two-way covariance setting where only correlations between two locations are considered. Though the general model could be more flexible to account for more complex correlation structure such as correlation among 3 locations, we only consider a simplified model here.
A two-way covariance structure MVP model is derived by decomposing matrix A into two parts. The first part of A is an identity matrix that formulates the main effect of the outcome variables. The second part of A induces the covariance structure of the
12 outcome variables. Note that a different correlation term for each pair of variables is allowed in this model, such that the parameters of the MVP model have meaningful
T interpretation. We now consider the hidden variables Y = (Y1,Y2, ..., Yk) consisting of two parts as in matrix A, the main components and the covariance components:
T T (Y1,Y2, ..., Yk) = (Y1,Y2, ..., Ym, ξ1,2, ξ1,3, ..., ξm−1,m)
T Main components are the m out of k elements in Y = (Y1,Y2, ..., Yk) represent- ing the variation that can be explained by the location-specific covariates. Whereas
T ξ = (ξ1,2, ξ1,3, ..., ξm−1,m) are the covariance (pairwise) effect components that can be modeled by explanatory variables representing the strength of the correlation between locations. Note that, when the number of main effect elements is m, the