Analyzing Spatial Longitudinal Incidence Patterns Using Dynamic Multivariate Poisson Models

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Yihan Sui, B.S., M.S.

Graduate Program in Biostatistics

The Ohio State University

2018

Dissertation Committee:

Dr. Grzegorz A. Rempala, Advisor

Dr. Chi (Chuck) Song, Advisor

Dr. Oksana Chkrebtii c Copyright by

Yihan Sui

2018 Abstract

Methods of multivariate analysis for continuous data have been applied extensively in epidemiology, economics, engineering, and other fields. However, multivariate models for count data have not been applied to a similar extent, and there is relatively little work published on models, especially when accounting for both spatial and temporal dependence. In the first part of the thesis, we propose a hierarchical multivariate

Poisson (MVP) model that simultaneously models spatial and temporal correlation of observed data counts in a fairly general setting. In particular, MVP allows for modeling the spatially/temporally dependent counts as a function of location-specific and time-varying covariates. To characterize temporal trends, we propose a broken- line regression model within MVP and apply it to joinpoint detection. Bayesian inference is conducted using Markov chain Monte Carlo (MCMC) methods to ap- proximate required posterior summaries. We use a backward selection algorithm coupled with the shaver method to obtain joinpoint coefficients via Bayesian Lasso.

We apply the proposed model to spatial temporal pertussis incidence data from 2000 to 2015 from several midwestern states in the U.S. To evaluate the appropriateness of the model, in the second part of this thesis we develop a goodness-of-fit (GOF)

ii statistic for fitting discrete generalized linear models (GLMs) based on the sum of standardized residuals (SSRs). This work is an extension of our earlier work (Chen L. et al. [1]) which proposed a GOF test for binary responses. We derive the asymptotic distribution of the test statistic and show how it can be applied to popular count regression models such as Poisson regression, Negative Binomial regression and Bino- mial regression using different link functions. Using numeric examples we show that the proposed test is substantially more powerful than most of the currently available

GOF tests under various model misspecification scenarios applied to various discrete

GLMs like logistic, Poisson, and negative binomial.

iii Acknowledgments

First, I would like to thank my wonderful parents, Daifu and Guiping for their love, trust and encouragement along the way. I would not be where I am today were it not for their love and guidance.

I am grateful to my co-advisors, Professor Grzegorz A. Rempala and Professor Chi

Song, both of whom have devoted a substantial amount of time and effort in train- ing me to be a better researcher over the past few years. Without their guidance and persistent help this work would not have been possible. Professor Grzegorz A.

Rempala is the reason I first became interested in multivariate count modeling, and since then I have thoroughly enjoyed working in this area with him. He has pro- vided me with numerous opportunities to learn and grow through research, teaching, mentoring, and collaboration. I also would like to express appreciation to Professor

Chi Song. He is a patient mentor and one of the smartest people I know. Some of my fondest memories in graduate school were hours of research discussions with him in his office. He has also become a good friend, whom I admire professionally and personally.

iv I thank the other member of my dissertation committee − Professor Oksana Chkrebtii, and the member of my candidacy committee − Professor Kellie J. Archer for their time and effort in evaluating my work and providing with me feedbacks.

Finally, I would also like to thank all the professors and staffs in the program of biostatistics who have provided guidance and support during my time at The Ohio

State University.

v Vita

June 5, 1990 ...... Born - Jilin,

2013 ...... B.S. Biological Science

2015 ...... M.S. Statistics

2013-present ...... Graduate Associate, The Ohio State University.

Publications

Research Publications

Chen Lu, Sui Yihan, Song Chi, Grzegorz A. Rempala (2017). The Sum of Stan- dardized Residuals: Goodness of Fit Test for Binary Response Model Statistics in Medicine, doi: 10.1002/sim.7644.

Fields of Study

Major Field: Biostatistics

vi Table of Contents

Page

Abstract ...... ii

Acknowledgments ...... iv

Vita...... vi

List of Tables ...... x

List of Figures ...... xii

1. Introduction ...... 1

1.1 Mathematical Modeling of Epidemics ...... 1 1.1.1 Multivariate Models ...... 3 1.1.2 Multivariate Poisson (MVP) Distribution ...... 4 1.2 Generalized Linear Models (GLMs) ...... 6 1.2.1 Estimating Equations ...... 7 1.2.2 Goodness of Fit (GOF) Test for GLM ...... 8 1.3 Thesis Outline ...... 8

2. Dynamic Multivariate Poisson (DMVP) Model ...... 11

2.1 Modeling Framework ...... 11 2.1.1 Non-Dynamic Multivariate Poisson Model ...... 12 2.1.2 Dynamic Multivariate Poisson (DMVP) Model ...... 14 2.2 Multivariate Poisson Joinpoint Model ...... 17

vii 2.3 Bayesian Estimation ...... 21 2.3.1 The Posterior Distribution for AR(1) Model ...... 22 2.3.2 The Posterior Distribution for ARIMA(1,1,0) Model . . . . 24 2.3.3 The Estimation of Joinpoint Parameters b ...... 27 2.3.4 The update of Y and ξ ...... 28 2.4 Simulation Studies ...... 29 2.5 Analysis of Pertussis Data ...... 32 2.5.1 Background ...... 33 2.5.2 Modern Resurgence of Pertussis ...... 36 2.5.3 Modeling Pertussis Trends ...... 38 2.5.4 Analysis Details ...... 39 2.5.5 Conclusion ...... 45

3. The Goodness of Fit (GOF) in Generalized Linear Model (GLM) . . . . 50

3.1 Summary of Previous Research on GOF ...... 50 3.1.1 GOF Tests for Binary Outcomes ...... 52 3.1.2 GOF Tests for Discrete Outcomes ...... 56 3.2 Proposed Test Statistic - Standardized Residuals ...... 59 3.3 Numerical Examples ...... 66 3.3.1 Examples for Binary Variables ...... 67 3.3.2 Poisson Regression Example 1 ...... 75 3.3.3 Poisson Regression Example 2 ...... 78 3.3.4 Negative Binomial Example 1 ...... 79 3.3.5 Negative Binomial Example 2 ...... 82 3.4 Model Selection Example ...... 86 3.4.1 Model Selection Example for Binary Variable ...... 86 3.4.2 Model Discrimination Example for Count Variable . . . . . 87 3.5 Discussion ...... 89

4. Future Work ...... 91

Bibliography ...... 94

Appendices 103

viii A. The Sum of Standardized Residuals: Goodness of Fit Test for Binary Response Model ...... 103

ix List of Tables

Table Page

2.1 Simulation results for parameter β and γ in P = 1, Q = 1 ...... 30

2.2 Simulation results for parameter β and γ in P = 2,Q = 1 ...... 31

2.3 History of pertussis vaccines ...... 35

2.4 History of acellular pertussis vaccines ...... 36

2.5 Pertussis Vaccination Coverage (%) ...... 48

3.1 Selected variables description in the data ...... 68

3.2 Proportion H0 rejected at the α = 0.05 using sample size of 445 with 10000 replications in quadratic models using the data 69

3.3 Selected variables description in Cleveland data ...... 71

3.4 Proportion H0 rejected at the α = 0.05 using sample size of 303 with 10000 replications in interaction models using the Cleveland data . . 72

3.5 Selected variables description in NHANES data ...... 73

3.6 Proportion H0 rejected at the α = 0.05 using sample size of 373 with 10000 replications in interaction models using the NHANES data . . 74

x 3.7 Selected variables description in the Affairs data (n=601) ...... 76

3.8 Proportion H0 rejected at the α = 0.05 using sample size of 601 with 10000 replications using the Affairs data ...... 76

3.9 Selected variables description in the German Credit data (n=1,000) . 80

3.10 Proportion H0 rejected at the α = 0.05 using sample size of 1000 with 10000 replications using the German Credit data ...... 80

3.11 Selected variables description in the Bad Health data ...... 83

3.12 Proportion H0 rejected at the α = 0.05 using sample size of 1127 with 10000 replications using the Bad Health data ...... 83

3.13 Selected variables description in the Fishing data ...... 85

3.14 Proportion H0 rejected at the α = 0.05 using sample size of 147 with 10000 replications in sqrt link model using the Fishing data ...... 86

3.15 Selected variables description in HINTS data ...... 87

3.16 The results of Cn and its p-values under each model using the HINTS data ...... 88

3.17 Pertussis data ...... 89

3.18 The results of Cn and its p−values under Poisson and negative bimo- nial model using Pertussis data ...... 89

xi List of Figures

Figure Page

1.1 An overview of mathematical models for infectious diseases generated by Siettos and Russo[2] ...... 3

2.1 Group sampling illustration ...... 28

2.2 Pertussis Incidence by Age Group ...... 37

2.3 Pertussis Incidence Cases ...... 38

2.4 Pertussis Rate(per 100,000) 2008 ...... 39

2.5 Pertussis Rate(per 100,000) 2009 ...... 40

2.6 Pertussis Rate(per 100,000) 2010 ...... 41

2.7 Pertussis Rate(per 100,000) 2011 ...... 42

2.8 Pertussis Rate(per 100,000) 2012 ...... 43

2.9 Pertussis Rate(per 100,000) 2013 ...... 44

2.10 Pertussis Rate(per 100,000) 2014 ...... 45

2.11 Pertussis Rate(per 100,000) 2015 ...... 46

xii 2.12 Pertussis Rate(per 100,000) 2016 ...... 47

2.13 9 States ...... 47

2.14 DMVP and Simplified Poisson Model Fitting ...... 49

3.1 Proportion H0 rejected at the α = 0.05 using sample size of 601 with 10000 replications in interaction models using the Affair data . . . . . 77

3.2 Proportion H0 rejected at the α = 0.05 over increase sample size with 10000 replications using the German Credit data ...... 81

3.3 Proportion H0 rejected at the α = 0.05 in quadratic models with sample size of 1127 with 10000 replications in the Bad Health data . 84

xiii Chapter 1: Introduction

1.1 Mathematical Modeling of Epidemics

The usage of mathematical modeling methodologies in epidemics dates back to 1760 when Daniel Bernoulli studied the outbreak of smallpox in England[3, 4]. Recently, dynamical systems modeling methods have become popular in epidemiological re- search, mostly due to their ability to accommodate social, economic, and demo- graphic factors [2]. In particular, the modeling with multivariate count data has become increasingly prevalent in big data applications in many different domains.

Such data are usually dependent and with both positive or negative correlations present. For instance, in the public health domain, the Centers for Disease Control and Prevention (CDC) provides the database of national-wide disease outbreaks by year, region, and disease type [5]. The epidemiological trends such as disease inci- dence data are monitored and displayed on the CDC website with graphics, maps and are also provided in text-file formats. For example, reported historic pertussis cases data is available on the CDC website [6]. These disease data have been collected

1 over past 20 years from all the states through their respective health departments via the National Notifiable Diseases Surveillance System (NNDSS). For epidemiologists

NNDSS data are invaluable for understanding the trends and the spreads of various communicable diseases, and for examining the underlying social, economic and de- mographic factors that are crucial to to develop better decision-making policies and to help better prepare for pandemic outbreaks in the US [7].

Spatial variation in disease incidence should be considered while modeling due to the ecological processes, such as pathogen dispersal might be clustered or infected hosts might be less likely to be on a long trip [8]. Moreover, these spatial distri- butions of infectious diseases often change through time [8]. In order to study the disease incidence and its changes over time, we need to establish convenient modeling methods that can take into account the spatial and temporal relationship among the incident rates. Siettos and Russo [4] suggested that the mathematical approaches used in epidemiology can be divided into 3 categories (presented in Fig. 1.1): 1)

Statistical approaches such as regression within the context of spatial patterns to model disease epidemics; 2) Mathematical models of dynamical systems(mechanistic state-space models) such as the susceptible , infected, and recovered (SIR) model;

3) Machine learning techniques using data from internet platforms such as Google

Flu. In this thesis we focus on the first category. We approach the problem at the population level by adopting a statical multivariate Poisson model, expanding it into a hierarchical model, and extending the latter to analyze spatial temporal data.

2 Figure 1.1: An overview of mathematical models for infectious diseases generated by Siettos and Russo[2]

1.1.1 Multivariate Models

The use of the multivariate normal (MVN) distribution has a long history for the analysis of continuous data because of its convenient statistical properties. MVN distribution has been well established tool to model spatial dependencies due to

T its convenient statistical properties [9]. Let X = (X1, ..., Xn) be a vector of ran- dom variables. The mean vector and the variance-covariance matrix are denoted

T n by µ = (µ1, ..., µn) (µ ∈ R ) and Σn×n (Σn×n is a positive semi-definite matrix), respectively. The desirable properties of MVP distribution are: 1) their explicit

1 1  1 joint probability density function exists: fX (x|µ, Σ) = (2π)n/2 (det)1/2(Σ) exp − 2 (x −  µ)T Σ−1(x − µ) , where x is a realization of the random vector X (x ∈ Rn) and 2)

3 the covariance function (Σ) accurately specifies the correlation structure. In contrast, these properties are not immediately available for any multivariate count distribu- tion. There has been relatively few works on implementing multivariate count model to deal with discrete spatial data as discussed in Section 1.1.2. Here we aim to con- struct a multi-dimensional discrete distribution with Poisson margins that allows for simple description of the dependencies among variables.

1.1.2 Multivariate Poisson (MVP) Distribution

One of the most commonly accepted definition of MVP distribution is given by

T Mahamunulu et al. [10]: a m-variate vector X = (X1, ..., Xm) is said to follow the

T MVP distribution through a vector of hidden variables Y = (Y1, ..., Yk) , such that

X = AY . This definition is based on a projection Nk → Nm, k ≥ m. Here, the

T hidden variables Y = (Y1, ..., Yk) independently follow a Poisson distribution. A is a m × k matrix of values 0 and 1 that determines the correlation structure of X.

This is the distribution that our dynamic MVP model is based on. Our dynamic

MVP borrows the idea of hidden variable decomposition scheme to help construct the variance-covariance structure. Detailed description of the model will be provided in Chapter 2.

The formulation of multivariate Poisson distribution dates back to 1925 when differ- ential equations were used to derive a bivariate Poisson distribution [11, 12]. A more understandable but equivalent formulation to derive the bivariate Poisson distribu- tion is to use the summation of independent Poisson variables [12]. The derivation

4 of a special case of multivariate Poisson model were presented recently by Tsionas et al. [13, 14] and Karlis [15] using a shared component construction. The limitation of this model is that it assumes the correlation between all pairs of variables being the same by using a single common covariance term. In 2005, this model was extended by Karlis and Meligkotsidou [16] to allow for a more flexible covariance structure among variables by permitting different correlations between pairs of variables. This extension of the covariance structure expanded the model’s application into a num- ber of new fields [16]. For example, in epidemiology it is of great interest to study the covariance structure of the diseases incidence in a given area from a geographical perspective. Although the model of Karlis and Meligkotsidou [16] can capture the spatial correlation, it is not a dynamic model and fails to account for the temporal trends. The research on the extension of the MVP for modeling longitudinal data is relatively sparse. Pedeli et al.[17] introduced a bivariate integer-valued autore- gressive process of order 1 (BINAR(1)) and applied it to a bivariate temporal data concerning daytime and nighttime road accidents in Netherlands. They focused on bivariate Poisson and bivariate negative binomial distributions. Aktekin et al. [18] developed a MVP model they referred to as multivariate Poisson-scaled beta (MPSB) that allows for temporal dependency in the counts as well as dependency across mul- tiple series. They then applied MPSB to web traffic analysis using weekly consumer demand data.

5 In the following chapters, we construct a dynamic MVP that can be applied to lon- gitudinal (time series) data based on the MVP derived by Karlis and Meligkotsidou

[16]. We then allow for the temporal trends detection by adding a piecewise regres- sion component. Explanatory variables can be included in the model. Inference is performed via the Metropolis-Hastings within Gibbs sampler. We adopt the adaptive rejection sampling (ARS) scheme to improve the algorithm efficiency.

1.2 Generalized Linear Models (GLMs)

Generalized linear models (GLMs) are a popular class of regression models that generalize linear regression to accommodate a variety of outcome variable types such as discrete, binary, and categorical [19]. In what follows, we borrow the notation

T from McCullagh and Nelder [20]. The sample Y = (Y1, ..., Yk) is assumed to have independent distribution of the natural exponential family (NEF) such that each Yi has the p.d.f.

n(y θ − b(θ ))o f (y; η, φ) = exp i i i c(y ; φ), i = 1, ...n. (1.1) Yi α(φ) i where n is the number of samples, αi(·), b(·) and c(·) are known functions, and θi is a canonical(natural) parameter for some known dispersion function α(φ) with dispersion parameter φ. The mean and variance of Yi are given by

( 0 E(Yi) = µ = b (θi), 00 Var(Yi) = V (µi) = b (θi)α(φ), i = 1, ...n.

There are three components in GLM:

6 1. Random Component - refers to the variation of the response variable Y due

to its probability drawn from the density function; e.g. in Poisson regression

models, Y has Poisson distribution.

T 2. Systematic Component - refers to the set of covariates Zi = (Zi1,Zi2, ...ZiP ) ;

T e.g. θi = β Zi in Poisson regression.

3. Link Function - refers to the link function g between the mean of the random

and the systematic components. The natural parameter θi is a function of the

T T mean θi = g(µi), such as g(µi) = β Zi, that is, the link between θi and β Zi is identity.

1.2.1 Estimating Equations

T Let Zi denote a p-vector of covariates for the ith individual. From item 3 in the

T previous section, the mean function is assumed to relate to Zi β through the link function

T g(µ(θi)) = Zi β, i = 1, ..., n, where g is the link function and β is a p-vector of parameters. As discussed above,

−1 T −1 if µ = g and θi = Zi β, then g is the canonical link. Now define ψ = (g ◦ µ) , then the p.d.f can be written as

ψ(ZT β)y − b(ψ(ZT β)) f (y; η, φ) = exp{ i i i }c(y ; φ), (1.2) Yi α(φ) i

7 and thus the log-likelihood function is

n T T X h ψ(Z β)yi − b(ψ(Z β))i log`(β) = logc(y ; φ ) + i i . i i α(φ) i=1 The MLE of β can be obtained by solving the likelihood score equations

n ∂log`(β) 1 X n 0 o = y − µ(ψ(ZT β))ψ (ZT β)Z = 0. ∂β α(φ) i i i i i=1 0 T In this work, we study primarily the case when ψ (Zi β) is not proportional to √ 1 . This is the case for most discrete GLM’s. V (µi)

1.2.2 Goodness of Fit (GOF) Test for GLM

The issue of measuring discrepancy between model predicted and observed values arises once model is estimated from the data. This concept of a model goodness of fit

(GOF) helps quantify this discrepancy by considering the (signed) distance between observations and their predicted mean values under the specific link function and the set of covariates (e.g., linear form, quadratic form and interactions). Measurement of GOF is crucial to evaluate whether the model explains the data sufficiently well.

Motivated by our longitudinal count data problem we consider here a new GOF measure that appears to be well suited for discrete GLMs.

1.3 Thesis Outline

The thesis is organized into 3 chapters. Beyond this introductory Chapter 1, Chapter

2 will introduce the modeling framework of dynamic multivariate Poisson (DMVP)

8 model starting from a static MVP distribution and its extension to dynamic data over time. Section 2.1.1 introduces different decomposition schemes accounting for spatial dependency of multivariate count variables. Section 2.1.2 discusses the construction of DMVP model for a given time series. We then consider a joinpoint model in

Section 2.2 to identify diseases trends and to detect changes. A shaver methods is employed in Section 2.2 for parameter estimation. We then further discuss the details of the parameter estimation from the Bayesian prospective in Section 2.3.

Simulation studies are performed to evaluate the model performance in Section 2.4.

As an example, a real application using pertussis dataset is shown in Section 2.5 to demonstrate the advantages and disadvantages of DMVP model over simple Poisson regression model. In Chapter 3 we propose a GOF statistic Cn for the discrete GLMs based on the sum of standardized residuals (SSRs). Section 3.1 will review currently used GOF tests that are built into statistical software for exponential modeling families. The derivation of the SSRs test statistic Cn and its asymptotic properties are discussed in Section 3.2. Next, via numerical examples we demonstrate that

GOF test based on Cn outperforms many of the existing more complicated currently used methods. We also adopt real data examples in simulation studies to investigate the power of the proposed test and to compare it Section 3.3 with existing methods in logistic, Poisson, and negative binomial regressions. Finally, a simple illustration of Cn statistic usage in the model selection problem is presented in Section 3.4. We

9 discuss limitations of the test statistics in Section 3.5 where we also outline some possible future research directions.

10 Chapter 2: Dynamic Multivariate Poisson (DMVP) Model

2.1 Modeling Framework

In this section, we will first illustrate the definition of MVP distribution introduced by Karlis et al. [16] with examples in simple settings. Then we will extend the MVP by introducing the temporal component to capture trends in dynamic data.

The MVP distribution is derived from a multivariate reduction scheme (Karlis et al.)

T [16]. Let X = (X1,X2, ..., Xm) denote the observed m-vector of variables at location

1, ..., m, where m is the total number of locations. Each element Xi represents the outcome variable such as the total number of infected at the specific location i, i ∈ {1, 2, ..., m}. X is said to follow a MVP distribution if X = AY , where the

T components of the vector Y = (Y1,Y2, ..., Yk) are hidden variables that are assumed to have independent Poisson distributions, ie. Yi ∼ Poi(θi), i ∈ {1, 2, ..., k} (k is the total number of the hidden variables in Y ). A is an m × k matrix (m ≤ k) with 0 and 1 elements that determines the correlation structure of X . The variability of

X is explained through the variability of k independent univariate Poisson random

11 variables Y1,Y2, ..., Yk. As discussed later, the elements in Y can be considered as main effects and covariance effects. The mean and variance-covariance matrix of X are

E(X ) = AE(Y ) = Aθ, and

V ar(X) = AΣAT , (2.1)

T where vector θ = (θ1, θ2, ..., θk) is the mean of Y , and Σ = diag(θ1, θ2, ...θk) is the variance-covariance matrix of Y . Because the summation of independent Poisson random variables still follows Poisson distribution, we can derive that each element of X follows a univariate Poisson distribution marginally [17, 21]. An example of the MVP distribution will be presented in the next section.

2.1.1 Non-Dynamic Multivariate Poisson Model

We now will provide illustrative examples of MVP distribution described above us- ing a two-way covariance setting where only correlations between two locations are considered. Though the general model could be more flexible to account for more complex correlation structure such as correlation among 3 locations, we only consider a simplified model here.

A two-way covariance structure MVP model is derived by decomposing matrix A into two parts. The first part of A is an identity matrix that formulates the main effect of the outcome variables. The second part of A induces the covariance structure of the

12 outcome variables. Note that a different correlation term for each pair of variables is allowed in this model, such that the parameters of the MVP model have meaningful

T interpretation. We now consider the hidden variables Y = (Y1,Y2, ..., Yk) consisting of two parts as in matrix A, the main components and the covariance components:

T T (Y1,Y2, ..., Yk) = (Y1,Y2, ..., Ym, ξ1,2, ξ1,3, ..., ξm−1,m)

T Main components are the m out of k elements in Y = (Y1,Y2, ..., Yk) represent- ing the variation that can be explained by the location-specific covariates. Whereas

T ξ = (ξ1,2, ξ1,3, ..., ξm−1,m) are the covariance (pairwise) effect components that can be modeled by explanatory variables representing the strength of the correlation between locations. Note that, when the number of main effect elements is m, the

m number of covariance effect elements is 2 . The distributions of ξ are also inde- pendent Poissons, that is, ξi,j ∼ Poi(θi,j), where 1 ≤ i ≤ j ≤ m. The same for

Yi ∼ Poi(µi), where i ∈ {1, 2, ..., m}. An example of two-way covariance structure is the trivariate Poisson model, with m = 3 of the form

X1 = Y1 + ξ1,2 + ξ1,3,

X2 = Y2 + ξ1,2 + ξ2,3, (2.2)

X3 = Y3 + ξ1,3 + ξ2,3, where matrix A is

  1 0 0 1 1 0   A = 0 1 0 1 0 1 . (2.3)   0 0 1 0 1 1

13 T In general without considering time index, we have X = (X1,X2, ..., Xm) follows

MVP distribution such that X Xi = Yi + ξi,j, (2.4) j6=i where ξi,j = ξj,i.

2.1.2 Dynamic Multivariate Poisson (DMVP) Model

In practice, the infectious disease data are often collected over a long period of time to study the long-term trends in disease incidence. Therefore, we need to extend the

MVP distribution to account for temporal effect by incorporating parameter evolu- tion described by a time series model. The idea of applying time series approaches to generalized linear models (GLMs) was originally introduced by Gamerman et al.

[22] in the context of dynamic hierarchical models in order to account for the tem- poral trend. The dynamic model comprises two parts: observation equations that describe the distribution of the observations, and system equations that describe the evolution of the parameters through time. For location i at time t, these equations can be written as follows.

Observation equations: X Xi,t = Yi,t + ξi,j,t, j6=i (2.5) Yi,t ∼ Poi(µi,t); log(µi,t) = βi,t,

ξi,j,t ∼ Poi(θi,j,t); log(θi,j,t) = γi,j,t, 1 ≤ i

14 System equations:

(β) 2 βi,t = Gt βi,t− + i,t, i,t ∼ N(0, σ ), (2.6) (γ) 2 γi,j,t = Gt γi,t− + εi,j,t, εi,j,t ∼ N(0, v ).

(β) (β) Here Gt and Gt are known matrices that formulate βi,t and γi,j,t through a linear combination of the parameters prior to time t. In this work, we adopt simple AR(1) model and ARIMA(1, 1, 0) model for β and γ. These are discussed in Section 2.3.

βi,t− and γi,j,t− are vectors that include all elements prior to time t.  and ε are disturbance terms which are assumed to have independent Gaussian distributions with mean zero and variance σ2 and v2 respectively; t ∈ {1, 2, ..., T }, with T is total time points.

In order to further extend the applicability of the dynamic model, we incorporate the time-varying explanatory variables into the DMVP. This can be achieved by introducing log-link function between the explanatory variables and the parameters of interest, similarly as in the simple Poisson regression model. The proposed model can be written using matrix notation as follows.

Observation equations:

Xt = Y t + ξt, t ∈ {1, 2, ..., T },

T Y t ∼ Poi(µt), log(µt) = Zt βt, (2.7)

T ξt ∼ Poi(θt), log(θt) = Dt γt. System equations:

(β) βt = Gt βt− + t, t ∼ N(0, W ), (2.8) (γ) γt = Gt γt− + εt, εt ∼ N(0, V ),

15 T where Xt = (X1,t,X2,t, ..., Xm,t) is the vector of outcome variables − the number of

T T infections for location 1, ..., m at time t. Y t = (Y1,t, ...Ym,t) , ξt = (ξ1,2,t, ..., ξ(m−1),m,t) are the hidden variables representing the mean and covariance components at time t, that follow independent Poisson distributions. Zt is a P × m matrix of explanatory variables for main effect components Y t at time t. Note that Zt are location-specific variables that help explain the disease incidence rate, such as immunity, medical facility and weather conditions. Here P is the number of explanatory variables avail-

m able in Zt. Dt is a Q × 2 matrix of explanatory variables for covariance effect components ξt at time t. Note that Dt are variables that measure the correlation strength between locations, such as passengers travel volume between location i and j, or simply the distance between location i and j. Here Q is the number of covariates available for covariance effect. βt and γt are vectors of the regression coefficients for

(β) (γ) covariates ξt and Zt respectively. Gt and Gt are known matrices of the evolu- tion equations for βt and γt respectively, such as AR(1) model and ARIMA(1, 1, 0) model. W and V are variance-covariance matrices that do not vary over time. The model can also be written as follows.

Observation equation: X Xi,t = Yi,t + ξi,j,t, j6=i P X Yi,t ∼ Poi(µi,t), log(ui,t) = Zi,p,tβp,t, (2.9) p=1 Q X ξi,j,t ∼ Poi(θi,j,t), log(θi,j,t) = Di,j,qγq,t, q=1

16 System equation:

(β) 2 βp,t = Gt βp,t− + p,t, p,t ∼ N(0, σ ), (2.10) (γ) 2 γq,t = Gt γq,t− + εq,t, εq,t ∼ N(0, v ).

2.2 Multivariate Poisson Joinpoint Model

Joinpoint modeling in regression analysis arises when data sets are broken down into different blocks in such way that the observations within each block follow the same regression model, but observations from different blocks follow different models [23].

It is essential to find proper functional form of regression parameters within block.

Typically two types of joinpoints are considered, the instantaneous and gradual [24].

The patterns in gradual change point model are usually implicit and are character- ized with the help of piece-wise linear models [24]. Gradual change point models have been commonly used in public health applications. For example, Czajkowski et al. [25] extended the use of a broken-line model and applied it to a longitudi- nal cancer mortality data to search for the number and location of trend changes.

Campos et al. [26] used segmented models of Poisson regression to analyze chronic obstructive pulmonary disease (COPD) mortality trends by sex in 27 countries in the

European Union (EU), for the period of 1994 to 2010, whereas Malvezzi et al.[27, 28] estimated numbers of deaths from cancers and worldwide age-standardized mortality rates (ASRs) in 2013 using the World Health Organization mortality and population database, via logarithmic Poisson count data joinpoint models. Cassel [29] performed a piecewise logistic regression adjusting for age, sex, and race/ethnicity to examine

17 the trends in the prevalence of extreme obesity among US preschool-aged children living in low-income families from 1998 to 2010.

The purpose of this current work is to propose a point detection method that allows us to identify the locations and number of the time joinpoints in spatial and longitudinal data. In other words, we would like to apply the broken-line model from Czajkowski et al. [25], and incorporate it into our DMVP model.

Following the setting in Czajkowski et al. [25] we are focusing on analyzing the spatial discrete outcomes data collected from an infectious disease over a period of time. Recall that our observations are (X1,1, Z1,1, D1,2,1), ..., (Xm,T , Zm,T , Dm−1,m,T ), at time 1, ..., T that are ordered observation time points. Y1,1, ..., Ym,T are assumed to follow independent Poisson regressions with means µ1,1, ..., µm,T , and with unknown joinpoints with τ1 < ... < τw. Specifically we have

w T X log(µi,t) = Zi,tβt + b0t + bl(t − τl)+, l=1 (2.11)

where i = 1, .., m, t = 1, ...T, t+ = max{t, 0}, and βt is the P -variate vector of regression coefficients for the fixed main effect covariates at time t. b0 is the baseline

th slope and bl is the slope change corresponding to the (l+1) segment with l = 1, ..., w and w is the total number of true joinpoints. τ1, ..., τw are the unknown joinpoints locations. Note that, model (2.11) reduces to a general GLM if the joinpoints are known.

18 When the number of joinpoints w is unknown, it needs to be estimated together with all other parameters. Using direct likelihood-based maximization for estimating the joinpoint parameters typically leads to over-estimation. This means that joinpoints are found at the maximum number of locations possible. Therefore, we need to add additional constrains to bound the maximum number of joinpoints. For instance, we can fit the model with penalties placed on b1, ..., bw to control for model complexity.

For instance by adding a lasso penalty on the number of joinpoints will shrink the estimated parameter values towards 0:

w X −n λ|bl|, l=1 (2.12)

with some λ > 0. Here λ is referred to as the smoothing parameter. The lasso Pw penalty term n l=1 λ|bl| in (2.12) was introduced by Tibshirani et al. [30]. The lasso estimates are viewed as penalized least square estimates. The lasso estimates can be interpreted as the Bayesian posterior estimates where the regression coefficients are assumed to have independent and identical Laplace prior(i.e., double-exponential)

[31]. Penalized likelihood with flat prior is used in model fitting here to obtain the posterior distribution of the estimates of change of slopes bs. A more detailed discussion of the joinpoints parameters estimation is provided in Section 2.3.

Searching for the number of joinpoints requires us to re-estimate the parameters after each model update, which makes the problem of estimating the locations of change points of (2.11) computationally intensive. Gill et al.[24] proposed a computationally

19 simple and flexible shaver method for verifying the number of changes and approx- imating their locations. Shaver method starts off with an over-parametrized model with sufficient number of knots that are identified in order to ensure that the pre- specified joinpoints locations are close enough to the true knot locations. Then the over-parametrized model is fitted using the penalty function to obtain the regression coefficients and slopes. The extraneous knots are removed iteratively after the coef-

ficients of the knots are estimated (remove the insignificant ones at each iteration).

The details of this idea are presented below.

We initiate the model by taking a sufficiently large number of knots τ1, ..., τh with h  w over the time interval 1 − T , so that many of the slope bl may be estimated as zero. According to Gill et al. [24], shaver method proceeds as following:

1. Initiate with superfluous number of nodes h  w across 1 − T .

2. Sample (βt, b, γt) from their posterior distributions described in Section 2.3.

3. Remove the insignificant nodes of which credible intervals for which parameter

contain 0.

4. Re-estimate all coefficients (βt, b, γt).

Note that we only estimate the coefficients (βt, γt) at the initial and last step to save computational time. The initial joinpoints h should not be chosen too large in order to circumvent the problem of unidentifiability.

20 2.3 Bayesian Estimation

Our model fitting procedure is performed with the help of Bayesian methods. The description of our approach is based on the DMVP model discussed before with explanatory variables Z and D in a two-way covariance structure. The parame- ters we need to estimate are the regression coefficients β and γ, joinpoints slopes b = (b0, b1, ..., bw), and the hidden variables Y and ξ. More details are provided in

Sections 2.3.1 − 2.3.4.

The posterior distribution of regression parameters β and γ are obtained by using

Adaptive Rejection Sampling (ARS) scheme. ARS is a powerful sampling technique for univariate log-concave probability density function that was proposed by Gilks et al. for Gibbs Sampling [32]. The sampling technique is adaptive, where it squeezes the rejection envelope to converge to its target density function very quickly and efficiently [32]. In 1995, Gilks et al. [33] generalized the ARS to include a Metropolis

Hastings algorithm step for non-log-concave full conditionals. Since ARS helps to reduce the number of samples needed for MCMC convergence, it is therefore, used when the posterior distribution approximation is not conjugate and the density is computationally intensive.

Recall that the system equation for the regression coefficients can be taken a variety of forms. We consider two cases AR(1) and ARIMA(1, 1, 0) that are discussed in

Section 2.3.1 and 2.3.2 respectively.

21 Note that the density function of the observed data P (X|Z, D, β, γ, b) is not tractable.

However, the density function of the hidden variables P (X|Z, β, b) and p(ξ|D, γ) are easy to handle, as they are both assumed to have independent Poisson distribu- tions. We assume that Y and ξ are conditionally independent given Z, D, β, b and

γ. The full log-likelihood function is

`(β, γ, b) T X  = Poi Y t, ξt | βt, γt, Zt, Dt, b, λ t=1 T m w (2.13) X X T X  = Poi yi,t; exp Zi,tβt + b0t + bl(t − τl)+ t=1 i=1 l=1 T X X T  + Poi ξi,j,t; exp Di,jγt t=1 j