On Bayesian Methods for Spatial Point Processes

University of Connecticut OpenCommons@UConn

Doctoral Dissertations University of Connecticut Graduate School

5-22-2020

Jieying Jiao University of Connecticut - Storrs, [email protected]

Follow this and additional works at: https://opencommons.uconn.edu/dissertations

Recommended Citation Jiao, Jieying, "On Bayesian Methods for Spatial Point Processes" (2020). Doctoral Dissertations. 2443. https://opencommons.uconn.edu/dissertations/2443 On Bayesian Methods for Spatial Point Processes

Jieying Jiao, Ph.D. University of Connecticut, 2020

ABSTRACT

Spatial point pattern data are routinely encountered. A ﬂexible regression model for the underlying intensity is essential to characterizing and understanding the pattern. Spatial point processes are a widely used to model for such data. Additional measurements are often available along with spatial points, which are called marks. Such data can be modeled using marked spatial point processes.

The ﬁrst part of this dissertation focuses on the heterogeneity of point processes.

We propose a Bayesian semiparametric model where the observed points follow a spatial Poisson process with an intensity function which adjusts a nonparametric baseline intensity with multiplicative covariate effects. The baseline intensity is approached with a powered Chinese restaurant process (PCRP) prior. The parametric regression part allows for variable selection through the spike-slab prior on the regression coefficients. An efficient Markov chain Monte Carlo (MCMC) algorithm is developed. The performance of the methods is validated in an extensive simulation study and the Beilschmiedia

pendula trees data.

Spatial smoothness is often observed in some environmental spatial point pattern Jieying Jiao, University of Connecticut, 2020 data, and the PCRP may have lower eﬃciency for such data since it allows more ﬂex- ibility without any spatial constraint. Distance dependent Chinese restaurant process

(ddCRP) can be easily realized by introducing a decay function to Chinese restaurant process. The second part of this dissertation introduces the ddCRP model with Bayesian inference methods, whose performance is illustrated using simulation study.

In the third part, we investigate the marked spatial point process, which is motivated by the basketball shot data. We develop a Bayesian joint model of the mark and the intensity, where the intensity is incorporated in the mark’s model as a covariate.

An MCMC algorithm is developed to draw posterior samples from this model. Two

Bayesian model comparison criteria, the modiﬁed Deviance Information Criterion and the modiﬁed Logarithm of the Pseudo-Marginal Likelihood, are developed to assess the

ﬁtness of diﬀerent models focusing on the mark. Simulation study and application to

NBA basketball shot data are conducted to show the performance of proposed methods. On Bayesian Methods for Spatial Point Processes

Jieying Jiao

B.S., University of Science and Technology of China, Hefei, China, 2016

A Dissertation Submitted in Partial Fulﬁllment of the Requirements for the Degree of Doctor of Philosophy at the University of Connecticut

2020 ii

Jieying Jiao

2020 iii

APPROVAL PAGE

Doctor of Philosophy Dissertation

On Bayesian Methods for Spatial Point Processes

Presented by Jieying Jiao, B.S.

Major Advisor Dr. Jun Yan

Associate Advisor Dr. Haim Bar

Associate Advisor Dr. Dipak Dey

Associate Advisor Dr. Xiaojing Wang

University of Connecticut 2020 iv

To My Parents v Acknowledgments

First and most important, I want to express my sincere gratitude to my advisor, Professor

Jun Yan. He has always been supportive and has given me the freedom to pursue various projects. It would never have been possible for me to take this work to completion without his patient guidance and encouragement.

I would like to thank Professor Haim Bar, Professor Dipak Dey, and Professor Xiao- jing Wang for serving as my general exam and dissertation committee. The discussions with them and questions they proposed helped broaden my research. I also want to thank all the professors that have taught me during my time at UConn. The knowledge

I have learned from them is invaluable.

My sincere thanks also goes to Dr. Guanyu Hu for his generous help and support to my research work. His passion to research motivated me to dig deeper on any problems.

It has been a pleasure to work with him, and I have learned a lot from this process.

My special thanks goes to Professor Peng Zhang for involving me to the cybersecurity project and providing funding to my Ph.D. research. I also want to thank Dr. Yishu

Xue for her help to my research work.

For all my close friends I met at UConn and old friends back in China, I want to thank them for all the help and support they have provided to me. My life as a Ph.D. student would not be this meaningful without them. vi

I am truly grateful to my parents for their immeasurable love and care. I feel so lucky to have parents that trust me unconditionally. The warmth I feel from my parents, my brother and my grandparents is the treasure in my life. vii Contents

Acknowledgmentsv

1 Introduction1

1.1 Spatial Point Process...... 2

1.2 Nonparametric Methods for Intensity Function...... 5

1.3 Dirichlet Process...... 6

1.4 Mixture of Finite Mixtures...... 8

1.5 Prior on Membership Parameter z ...... 10

1.5.1 Stick-Breaking Process...... 11

1.5.2 Chinese Restaurant Process...... 12

1.5.3 Overclustering Problem of DP...... 14

1.6 Spatial Smoothness of the Intensity of Point Process...... 19

1.7 Marked Spatial Point Process...... 23

2 Hierarchical Semiparametric Poisson Point Process Model 26

2.1 Introduction...... 26

2.2 Model Setup...... 28

2.2.1 Spatial Poisson Point Process...... 28

2.2.2 Nonparametric Baseline Intensity...... 29 viii

2.2.3 Hierarchical Semiparametric Regression Model...... 31

2.3 Bayesian Inference...... 32

2.3.1 The MCMC Sampling Scheme...... 32

2.3.2 Inference for the Nonparametric Baseline Intensity...... 37

2.3.3 Selection of r ...... 40

2.4 Simulation Study...... 41

2.5 Point Pattern of Beilschmiedia Pendula ...... 49

2.6 Discussion...... 55

3 Spatial Poisson Point Process with Spatial Smoothness Constraint 57

3.1 Distance Dependent CRP...... 57

3.2 Model Setup...... 58

3.3 Bayesian Inference...... 60

3.4 Simulation Study...... 62

3.5 Discussion...... 71

4 Marked Spatial Point Process 73

4.1 Introduction...... 73

4.2 Shot Chart Data...... 74

4.3 Model Setup...... 77

4.3.1 Marked Spatial Point Process...... 77

4.3.2 Prior Speciﬁcation...... 79 ix

4.3.3 Bayesian Variable Selection...... 80

4.4 Bayesian Inference...... 81

4.4.1 The MCMC Sampling Schemes...... 81

4.4.2 Bayesian Model Selection for the Mark Model...... 82

4.5 Simulation...... 84

4.5.1 Estimation...... 84

4.5.2 Variable Selection...... 86

4.6 Real Data Analysis...... 92

5 Future Work 100

5.1 Spatially Dynamic Variable Selection...... 100

5.2 Clustered Regression Coeﬃcients Model for Spatial Point Process.... 101

5.3 Multivariate Point Process Model...... 102

A Appendix 105

A.1 BITC Validation Under Gaussian Mixture Model...... 105

Bibliography 110 x List of Tables

1 Simulation estimation results. “SD” is the empirical standard deviation

over 100 replicates, “SD”c is the average of 100 standard deviations cal-

culated using posterior sample. “AR” is the variable selection accuracy

rate over 100 replicates...... 47

2 Posterior means, standard errors (SE), and the 95% HPD credible in-

tervals for the regression coeﬃcients in the analysis of the spatial point

pattern of Beilschmiedia pendula in the BCI data. Covariates with a “*”

are signiﬁcant...... 53

3 Shot data summary for four players in the 2017–2018 regular NBA season.

The period includes 4 quarters and over time...... 75

4 Summaries of the bias, standard deviation (SD), average of the Bayesian

SD estimate (SD),c and coverage rate (CR) of 95% credible intervals when

Z2 is continuous: ξ = α0 = 0.5, α2 = 1, (β1, β2) = (2, 1) and Z2 ∼ N(0, 1). 87

5 Summaries of the bias, standard deviation (SD), average of the Bayesian

SD estimate (SD),c and coverage rate (CR) of 95% credible intervals

when Z2 is binary: ξ = α0 = 0.5, α2 = 1, (β1, β2) = (2, 1) and Z2 ∼

Bernoulli(0.5)...... 88 xi

6 Percentages of correct decision of the variables in 200 replicates with

continuous Z2. The signiﬁcant parameters except α1 all equal to 1. All

covariates were generated with standard Normal distribution...... 90

7 Percentages of correct decision of the variables in 200 replicates with

binary Z2. The signiﬁcant parameters except α1 all equal to 1. Z2 was

generated using Bernoulli(0.5), while other covariates were from standard

normal...... 91

8 Covariates used in the intensity model and the mark model...... 93

9 Summaries of mDIC and mLPML for the mark model and DIC for the

full joint model...... 94

10 Data analysis result using intensity dependent model for Curry and Durant. 97

11 Real data analysis results using intensity dependent model for Harden

and James...... 98

12 Simulation results of parameter estimation for Gaussian Mixture model

with symmetric Dirichlet process prior...... 109 xii List of Figures

1 Chinese restaurant process illustration...... 13

2 Illustration of modiﬁed Chinese restaurant process based on MFM.... 16

3 Histograms of Kb and overlaid trace plots of RI from the 100 replicates.

Setting 1 and 2 are on left and right panel, respectively. The optimal

r was selected by the BITC. The RI for each replicate was under the

optimal r for that replicate. The thick lines are the average of the trace

plots over the 100 replicates...... 43

4 Histogram of Kb chosen by LPML and DIC over 100 replicates. Results

of setting 1 and 2 are on left and right panel, respectively...... 44

5 Simulation conﬁgurations for baseline intensity, with ﬁtted baseline in-

tensity surfaces. Median and quantiles are calculated out of 100 replicates. 46

6 Boxplots of MSE over the 100 replicates for four models...... 48

7 The locations of Beilschmiedia Pendula and heat maps of the standardized

covariates of the BCI data...... 51

8 BCI data analysis: trace plots of β and K after burnin and thinning... 52

9 Heat map of the ﬁtted baseline intensity surface of the BCI data using

r = 1.4. From top to bottom: 2.5%, 50%, and 97.5% percentiles of

posterior distribution of λ(s)...... 54 xiii

10 Histograms of Kb and overlaid trace plots of RI from the 100 replicates.

Setting 1 and 2 are on left and right panel, respectively. The optimal h

was selected by the LPML or DIC. The thick lines are the average of the

trace plots over the 100 replicates...... 65

11 Simulation conﬁgurations for intensity surface, with ﬁtted intensity sur-

faces. Median and quantiles are calculated out of 100 replicates...... 67

12 Histograms of Kb and overlaid trace plots of RI from the 100 replicates.

Setting 1 and 2 are on left and right panel, respectively. The optimal r

was selected by BITC. The thick lines are the average of the trace plots

over the 100 replicates...... 69

13 Simulation conﬁgurations for intensity surface, with ﬁtted intensity sur-

faces. Median and quantiles are calculated out of 100 replicates...... 70

14 Shot charts of Curry, Durant, Harden and James in the 2017–2018 regular

NBA season...... 76

15 Intensity ﬁt results of Curry, Durant, Harden and James on the same

scale. Red means higher intensity...... 96

16 Histogram of Kb selection by BITC, LPML and DIC under GM model and

symmetric Dirichlet prior...... 108 1 Chapter 1

Introduction

Spatial point pattern data, which are random locations of certain events of interest in

space (e.g., Diggle, 2013), arise routinely in many ﬁeld. Such pattern can be, for example,

locations of basketball shooting attempts in sports analytic (Miller et al., 2014), earth-

quake centers in seismology (Schoenberg, 2003), or tree species in forestry (Leininger

et al., 2017; Thurman and Zhu, 2014). We ﬁrst introduce the spatial point process,

which is often used to model such data. It uses an intensity surface over the inter-

ested region to characterize the probability that certain event happens on any location

of the region. The intensity often varies at diﬀerent locations, and the main task is

to capture such heterogeneity. Parametric methods are used when there are covariates

available, and nonparametric methods are also actively discussed due to its ﬂexibility.

Most popular nonparametric methods includes Dirichlet process (DP) and Mixture of

Finite Mixtures (MFM) model, which will conﬁgure the nonhomogeneous intensity value

as from diﬀerent groups. One convenient way to set up DP and MFM is to introduce

an index parameter z. The hierarchical model structure makes Bayesian inference natural for such model. The prior distribution on z can then be set up as stick-breaing 2

process or Chinese restaurant Process (CRP). DP or MFM both assume independent

prior distribution for the index parameter z. We then consider the spatial homogeneity

property of the intensity surface displayed by some dataset. Diﬀerent ways to put spatial

smoothness constraints are introduced. At last, the marked spatial point pattern data

are discussed, and the marked spatial point process is introduced to such data.

1.1 Spatial Point Process

Spatial point pattern data are modeled by spatial point processes (Diggle et al., 1976)

characterized by a quantity called intensity. Within an area B, the intensity on any

location s ∈ B can be represented as λ(s), which is deﬁned as:

E[N(ds)] λ(s) = lim , |ds→0| |ds| where ds is an inﬁnitesimal region around s, |ds| represents its area, and N(ds) shows the number of events happened over ds. Poisson distribution is often used to model count data, so a spatial Poisson point process is widely used for point pattern data. The spatial Poisson point process is deﬁned by the following two conditions (Thurman and

Zhu, 2014): 3

1. the count of data points over any region A follows a Poisson distribution:

N(A) ∼ Poisson(λ(A)), (1.1) Z λ(A) = λ(s)ds. A

2. the joint density f of the event location S = {s1, s2,..., sN } is proportional to the

product of intensity functions λ(si), condition on the total number of events N:

N Y f(S | N) ∝ λ(si). i=1

The second condition says that, condition on the number of events, the locations of events are independent. The likelihood function can then be expressed as:

QN i=1 λ(si) L = R . (1.2) B λ(s)ds

There are some other spatial point processes. Cox processes (Cressie, 1992) is also called doubly stochastic Poisson process. It is an extension of Poisson process, which assumes the intensity of Poisson process is a stochastic process itself. This includes the classic Log Gaussian Cox process (LGCP) (Thurman et al., 2015; Miller et al., 2014), which uses a Gaussian process for the log of intensity in the spatial Poisson point process.

Using a Gaussian Markov random ﬁled (GMRF) in LGCP allows spatial smoothness and accounts for spatial correlation among locations that can not be explained by the 4

covariates (Yue and Loh, 2011). However, such double-stochastic model is often hard

to evaluate and computational challenge to ﬁt (Murray et al., 2012). Gibbs process

(Illian et al., 2008; Chiu et al., 2013) focuses on the dependent point pattern, where the

dependence could be attractive, repulsive, depending on geometric features. Neyman-

Scott process (Cressie, 1992) is a hierarchical model with three levels, and the upper

level is a nonhomogeneous Poisson process. The observed data points are only generated

from the lowest layer of the hierarchy. Due to the simplicity of Poisson process, and

also its advantage that it is theoretically tractable and straightforward to implement

computationally, we will focus on spatial Poisson point process in this dissertation.

When λ(s) varies with diﬀerent locations, the spatial point process is called non- homogeneous, which is often the case in real life. The primary interest about spatial point process is to characterize the heterogeneous intensity. When spatial covariates are available, most works focus on parametric methods to incorporate these covariates into intensity function. Let X(s) being p dimension vector showing the covariates value on

location s, then a common way to model intensity is to use exponential link function:

> λ(s) = λ0 exp X(s) β , (1.3)

where β is a p dimension coeﬃcients vector, and λ0 is baseline intensity. 5

1.2 Nonparametric Methods for Intensity Function

Parametric methods can reveal the relationship of interested events and diﬀerent factors.

It can be very useful to see how different factors influence the probability that the event happens. But the limitations are also very obvious. It only accounts for limited covariates, which means it assumes the spatial point process is only influenced by those known covariates. However, spatial point pattern data are often the results of complex interactions between physical, environmental, and ecological processes. For example, a census of tree locations in a forest includes many generation of trees whose growth and decay are affected by soil, rainfall, graphical factors, other tree species, etc. Another example is the data about earthquake locations, whose dynamic is not totally clear, and some known influential factors are also hard to measure. This makes the parametric model very hard to capture the true pattern, or even impossible when some factors are not available.

Nonparametric methods allow more flexibility for intensity function. Frequentist methods such as kernel estimation (Diggle, 1985, 2003; Baddeley et al., 2012), often involves parameter tuning in an ad-hoc manner. A different angle to capture the non- homogeneous intensity value using nonparametric method is to view different intensity values as from different groups, with the value stays the same within each group. Two classic Bayesian methods to model data from different distributions are Dirichlet process 6 and Mixture of Finite Mixtures model. They can provide a probabilistic framework under which the intensity values, grouping memberships, and the number of groups can be estimated simultaneously. Due to these advantages, we will continue to focus on these

Bayesian nonparametric methods.

1.3 Dirichlet Process

Dirichlet process is a widely used nonparametric method for data from diﬀerent groups.

It has gained large popularity because of its ﬂexibility in allowing an unknown number of groups. It can be applied to spatial point process as the heterogeneous intensity value

n can be viewed from diﬀerent groups. With a space partition on B = ∪i=1Ai, we can discretize the intensity function λ(s) by assuming its value within each small region Ai is constant: λ(s) = λ(Ai) for ∀s ∈ Ai. Then we have:

λ(Ai) ∼ G,

G ∼ DP(α, G0),

where G ∼ DP(α, G0) refers to G being a random distribution generated by a Dirichlet process with concentration parameter α and base distribution G0 (MacEachern and

M¨uller, 1998).

Dirichlet process can be viewed as a distribution over distributions. It is very diﬃcult to conﬁgure and draw samples from DP directly, since the information to specify a 7

distribution is inﬁnite. An alternative way to express DP is to introduce a membership

index z to show which group the intensity value comes from. For each small region

Ai, zi is deﬁned to take integer value from 1 to K, where K is the total number of

K groups. All unique intensity values from diﬀerent groups are denoted using {λi}i=1, and

λ(Ai) = λzi . Along with the spatial Poisson point process deﬁned in (1.1), we can write

the hierarchical model as:

S | λ, z ∼ SPP(λ(s)),

λ(s) = λzi , for ∀s ∈ Ai, i = 1, . . . , n

Pr(zi = k | π) = πk, i = 1, . . . , n, k = 1,...,K (1.4)

π | α ∼ Dirichlet(α/K, . . . , α/K),

λk ∼ G0, k = 1,...,K,

where SPP(λ(s)) represents spatial Poisson point process with intensity function λ(s),

λ = {λ1, λ2, . . . , λK }, z = {z1, z2, . . . , zn}, π = {π1, π2, . . . , πK }, Dirichlet(α/K, . . . , α/K)

is symmetric Dirichlet distribution over the K − 1 dimensional space {(π1, . . . , πK ):

PK QK α/K−1 k=1 πk = 1, πk > 0} with density function q(π1, . . . , πK ) = Γ(α)/Γ(α/K) k=1 πk ,

and G0 is a random distribution corresponding to the base distribution in Dirichlet Pro-

cess. When K goes to inﬁnity, this model becomes Dirichlet process mixture model

(DPMM) (Ishwaran and Zarepour, 2002).

The form of such prior distributions in 1.4 leads to conditional distributions that 8 make it very straightforward to use the Gibbs sampler to obtain posterior samples Neal

(2000). By introducing the membership index parameter z, we put a focus on the grouping eﬀects of intensity value over the whole space. Diﬀerent regions, sometimes maybe spatially distant, can be grouped together to have same intensity value. This is very meaning in real life since it can help us identify those regions that share similar properties.

1.4 Mixture of Finite Mixtures

An alternate Bayesian way to model data from mixture of distributions is to take the usual ﬁnite mixture model with symmetric Dirichlet weights, and put a prior on the number of groups. This is the so called Mixture of Finite Mixtures model. One way to present spatial Poisson point process using MFM framework is as follows:

S | λ, z ∼ SPP(λ(s)),

λ(s) = λzi , for ∀s ∈ Ai,

Pr(zi = k | π) = πk, i = 1, . . . , n, (1.5)

π ∼ DirichletK (γ, γ, . . . , γ),

K ∼ q(K),

λk ∼ G0, k = 1,...,K, 9 where q(K) is a probability mass function on 1, 2,... , γ is hyperparameter for symmetric

Dirichlet process.

MFM also allows the number of groups K to be unknown and totally determined by dataset dynamically. It shares many essential properties as DPMM, including exchangeable partition distribution, restaurant process, random measure representation, and stick-breaking representation. The MFM analogues are simple enough that they can be used much like the corresponding DPMM properties, so many powerful methods developed for inference in DPMM can be directly applied to MFM as well (Miller and

Harrison, 2018). MFM exhibits two majors advantages over DMPP (Miller, 2014). First is that the number of groups estimated by MFM is consistent (Geng et al., 2019), while

DPMM does not have a consistent estimation for K. This is because the prior used on the number of groups are quite different in these two models. In MFM, the number of groups will converge to a finite value with probability one as the number of observations goes to infinity, while the number of groups will go to infinity with probability one in

DPMM as sample size grows. Second advantage is that the size of groups are all the same order of magnitude in the MFM model, while in DMPP many small groups are negligible compared to large groups.

MFM also has obvious disadvantages compared with DMPP. Since MFM dislikes small groups, the mixing time of incremental MCMC samplers can be worse than for the DPMM. Another inconvenience is that the coeﬃcients of the partition distribution 10 needs to be precomputed: ∞ X k(K) V (K) = q(k), (1.6) n (γk)(n) k=1

where Vn(K) is the partition distribution coeﬃcients, k(K) = k(k − 1) ... (k − K + 1),

(n) (0) (γk) = γk(γk + 1) ... (γk + n − 1), and x = 1, x(0) = 1, K is the number of groups

in the partition, and q(k) is the probability mass function of the prior distribution for

K deﬁned in MFM (1.5).

For both DPMM and MFM conﬁgured in (1.4) and (1.5), we can integrate out the

parameter π and get the marginal distribution of z. The following section introduces

diﬀerent ways to conﬁgure the marginal prior distribution of z under DPMM. Similar

discussion can also be done under MFM framework, which is not the focus of this

dissertation.

1.5 Prior on Membership Parameter z

The distribution of the induced membership index z from Dirichlet process can be ex-

pressed using diﬀerent formulations. Stick-breaking and Chinese Restaurant process are

two popular ways to specify prior distribution on z. Stick-breaking uses the discreteness

property of the distributions drawn from Dirichlet process and express the desired distri-

bution as a mixture of point mass distributions, where the mixture weights are given by

a procedure resembling the breaking of a unit-length stick (hence the name). Chinese

restaurant process focuses more on the grouping eﬀects of Dirichlet process, which is 11 diﬀerent from stick-breaking process whose focus is more on the likelihood estimation.

1.5.1 Stick-Breaking Process

Dirichlet process DP (α, G0) is speciﬁed by a base distribution G0 and a positive real number α called the concentration parameter (also known as scaling parameter). The base distribution is the expected value of the process, and Dirichlet process can be viewed as drawing distributions around the base distribution. However, even if the base distribution is continuous, the distributions drawn from the Dirichlet process are almost surely discrete. The scaling parameter α speciﬁes how strong this discretization is. When α → 0, the realizations of DP are concentrated at a single value, while in the limit of α → ∞, the realizations become continuous. Between the two extremes, the realizations are discrete distributions with less and less concentration as α increases.

Starting from the property that the distribution drawn from DP is discrete with probability one, the marginal distribution of the membership parameter zi can be expressed as the following probability mass function in 1.7(Sethuraman, 1994).

∞ X fzi (c) = πhδh(c), i = 1, 2,..., h=1 Y πh = νh (1 − νh), (1.7) l

νh ∼ Beta(1, α), h = 1, . . . , n,

where δh(c) is an indicator function corresponding to a point mass distribution at h, 12 whose value is one at point h and zero otherwise, and Beta(1, α) is a Beta distribution with mean 1/(α + 1). The resemblance to the name “stick-breaking” can be seen by considering the mixture weights πh as the length of a piece of stick. We start with a unit-length stick and in each step we break oﬀ a portion of the remaining stick according to νh and assign this broken-piece to πh. The formula can be understood by noting that after the ﬁrst h − 1 values have their portions assigned, the length of the remainder of

Qh−1 the stick is l=1 (1 − νl), and this piece is broken according to νh and gets assigned to

πh. The smaller the α is, the less the stick will be left for subsequent values on average yielding more concentrated distributions.

1.5.2 Chinese Restaurant Process

Chinese restaurant process is derived from Dirichlet process and shows the distribution of membership parameter z = {z1, z2, . . . , zn} (Pitman, 1995; Neal, 2000). The name is from the popular Chinese restaurant metaphor. If a customer comes into a Chinese restaurant with inﬁnite tables available, and chooses a table to sit at, then all customers will be grouped according to the tables they choose.

The grouping problem of diﬀerent intensity values in spatial Poisson point process can be mapped to this situation if we think {λ(A1), λ(A2), . . . , λ(An)} are n customers, and membership parameter zi shows the label of the table that the i-th customer chooses.

CRP states that the ﬁrst customer will choose to sit at table one, i.e., z1 = 1, then for

the following customers, they will either to sit at a table that already has customers, 13

......

4 1 3 ! New customer

Figure 1: Chinese restaurant process illustration.

or open a new table. The conditional distribution of zi given the information of all

previous i − 1 customers (given {z1, z2, . . . , zi−1}) follows a “rich get richer” fashion.

The probability the customer joining an existing table is proportional to the number

of customers already sitting at that table, and the probability to open a new table is

proportional to α, see Figure1. This is also called the P´olya urn scheme by Blackwell

et al.(1973), and the conditional distribution is shown in (1.8).

  nc, at an existing group labeled c, Pr(zi = c | z1, . . . , zi−1) ∝ (1.8)  α, at a new group,

where 0 < α < 1 is the parameter in Dirichlet process, and nc is the number of customers

at table c.

Based on the P´olya urn scheme in (1.8), the conditional distribution of intensity value

λ(Ai), i = 2, 3, . . . , n, is given in Proposition1, which is derived by directly rewriting

the conclusion from Blackwell et al.(1973) under the model setup in (1.4). 14

n Proposition 1. With a partition {Ai}i=1 on the whole area, if the intensity of a spatial point process within each region Ai is assumed to be constant, denoted as λ(Ai), its conditional distribution using DP(α, G0) and CRP can be expressed as in (1.9).

K∗ ! X X q(λ(Ai) | λ(A1), . . . , λ(Ai−1)) ∝ δk(zj) δλk (λ(Ai)) + αG0(λ(Ai)), (1.9) k=1 j

∗ where K denotes the number of groups among {λ(Aj)}j

K∗ which takes value 1 when x = y and zero otherwise, and {λk}k=1 is the unique intensity

value for diﬀerent groups. For spatial Poisson point processes, a common choice for G0

is Gamma distribution, because it is a conjugate prior for λ, and it will give a closed

form full conditional distribution for z.

1.5.3 Overclustering Problem of DP

n The membership parameter z actually induces a partition on {λ(Ai)}i=1 with K diﬀerent

K groups. Using {nk}k=1 to show the sizes of diﬀerent groups, the probability of the group

size of this partition under (1.8) is proportional to (1.10):

K Y 1 . (1.10) nk k=1

This probability shows that clusters with relatively small sizes will be assigned with

larger probability under CRP. As a alternative form of Dirichlet process, CRP also have 15 inconsistent estimation for the number of groups K, even when the sample size goes to inﬁnity (Miller and Harrison, 2018). Since the probability that new customer goes to a new table is always positive, CRP will always produces new groups, thus results in redundant small groups. This is the so called “overclustering” problem.

Modiﬁed CRP Based on MFM

There are diﬀerent methods proposed to solve the overclustering problem of CRP. One popular method is to modify CRP based on the Mixture of Finite Mixtures model (Miller and Harrison, 2018), which will give a consistent estimation for K (Geng et al., 2019).

The idea is to give a prior distribution on K, and the model can be speciﬁed as follows:

Pr(zi = k | π,K) = πk, i = 1, . . . , n,

π = (π1, π2, . . . , πK ), (1.11)

π ∼ DirichletK (γ, . . . , γ),

K ∼ q(K),

where q(K) is a probability mass function same as deﬁned in MFM (1.5). Then the

conditional distribution of zi, i = 2, 3, . . . , n can be expressed as in (1.12).

  nc + γ, at an existing group labeled c, Pr(zi = c | z1, . . . , zi−1) ∝ (1.12)  ∗  Vn(K +1)  ∗ γ, at a new group,  Vn(K ) 16

......

4+' ' ' !"(4) 1+ 3+ ' New customer !"(3)

Figure 2: Illustration of modiﬁed Chinese restaurant process based on MFM.

∗ where function Vn(·) is deﬁned the same way as in (1.6), K is the number of groups P among {z1, . . . , zi−1}, γ is the hyperparameter same as in (1.11), and nc = j

This is quite similar to CRP, which can be illustrated using the similar restaurant process

plot shown in Figure2, but the probability to create a new group is reduced by the factor

∗ ∗ Vn(K + 1)/Vn(K ), which makes the model harder to produce redundant tiny groups.

In contrast to (1.10), the probability of the group size of partition induced by z under

the modiﬁed CRP in (1.11) is proportional to:

K Y 1 1−γ . (1.13) k=1 nk

It is obvious that the modiﬁed CRP assigns smaller probability to small groups compared to CRP. The relative size of the groups is controlled by the parameter γ. 17

Bayesian Repulsive Gaussian Mixture Models

A Bayesian repulsive mixture model is proposed under the Gaussian mixture model framework by Xie and Xu(2019) to allow well separated groups, and the proposed

Bayesian repulsive Gaussian mixture model (RGMM) has been approved to have a consistent estimation for the number of groups K. Assume the observed data points

p y = {y1, y2, . . . , yn}, yi ∈ R , are from a p dimensional Gaussian mixture model, and we still use the membership parameter zi for each observation to show which group it belongs to, then the Bayesian hierarchical model can be written as follow:

y ∼ GMp(µ1, Σ1,..., µK , ΣK ),

µ(yi) = µzi , Σ(yi) = Σzi ,

Pr(zi = k | π) = πk, i = 1, . . . , n, (1.14)

π = (π1, . . . , πK ) ∼ DirichletK (γ),

K ∼ q(K),

K (µk, Σk)k=1 | K ∼ q(µ1, Σ1,..., µK , ΣK | K),

where GMp(µ1, Σ1,..., µK , ΣK ) represents p dimensional Gaussian mixture model with

K K K diﬀerent groups, and the unique group means and variances are {µk}k=1 and {Σk}k=1, respectively, q(K) is the same probability mass function as deﬁned in the MFM model

(1.5), and q(µ1, Σ1,..., µK , ΣK | K) is the some density function with respect to the

Lebesgue measure on (Rp × S)K . 18

K As a contrast to DP and MFM, which assumes (µk, Σk)k=1 are independent and

identical from a base distribution G0, Xie and Xu(2019) introduced repulsion among

components Normal(µk, Σk) through their centers µk, such that they are well-separated.

K The form of the joint density of (µk, Σk)k=1 are assumed to be as follows:

K ! 1 Y q(µ1, Σ1,..., µK , ΣK | K) = qµ(µk)qΣ(Σk) hk(µ1,..., µK ), CK k=1 K ! Z Z Y Ck = ··· hK (µ1,..., µK ) qµ(µk) dµ1 ··· dµK , p×K R k=1

p K where CK is the normalizing constant, and hk :(R ) → [0, 1] is invariant under per- mutation of its argument, with the repulsive condition: hK (µ1,..., µK ) = 0 if and only

0 0 if µk = µk0 for some k 6= k , k, k ∈ {1, 2,...,K}.

The GRMM has some nice properties that independent priors such as DP do not have, including posterior consistency and shrinkage eﬀect on the tail probability of number of components K. The application to spatial point process model can be of future interests.

Powered Chinese Restaurant Process

A newly developed method based on CRP to solve the overclustering problem is powered

Chinese restaurant process (Lu et al., 2018), which is a simple modiﬁcation on CRP to increase the probability that new customer joins an existing table. The conditional 19 distribution of z under PCRP is shown in (1.15)

  r nc, at an existing group labeled c, Pr(zi = c | z1, . . . , zi−1) ∝ (1.15)  α, at a new group,

where r is a prespeciﬁed power parameter, and others are same as deﬁned in CRP prior

(1.8). When r = 1, PCRP goes back to CRP, and when r > 1, the probability that zi

takes an existing value is increased, hence number of groups K will be reduced. When

r → ∞, all observations will eventually go to one group and Kb will equal to one.

Under DP (α, G0) and PCRP, the conditional distribution of λ(Ai) in proposition1

becomes:

K∗ !r X X q(λ(Ai) | λ(A1), . . . , λ(Ai−1)) ∝ δk(zj) δλk (λ(Ai)) + αG0(λ(Ai)). (1.16) k=1 j

1.6 Spatial Smoothness of the Intensity of Point Pro-

cess

Dirichlet process gives independent prior distribution for the membership parameter

n {zi}i=1, which gives high ﬂexibility for the mixture model. It allows the probability that

a region to join any group not inﬂuenced by its neighbor. This means there is no spatial

smoothness displayed in such model. 20

In order to capture the spatial continuous feature displayed by some dataset, there are several methods proposed to impose spatial homogeneity into various models. Jiang and Serban(2012) introduced a model-based method for clustering random time-varying functions that are spatially interdependent. They assumed the spatial clustering membership parameter z is a realization from a locally dependent Markov random ﬁeld

(MRF), which is a stochastic process with markov property. They also introduce spatially correlated error term into the data model conditioning on cluster membership parameter. However, this method does not fully guarantee spatial contiguity.

Other methods focuses on identifying spatial clusters in various spatial models.

Knorr-Held and Raßer(2000) discussed spatially varying rates of disease incidence or mortality in the epidemiological area. They used a nonparametric Bayesian method based on reversible jump MCMC methodology for spatial clusters. They divided the whole area into different spatial clusters with specified spatial centers, while allowing the number of clusters, the location of each clusters, and the risk within each cluster are unknown. Similarly, Kim et al.(2005) also divided the area into different spatial clusters for the geostatistics data. They assume stationarity within same clusters and independency among different clusters. This is realized using piecewise Gaussian processes, with unknown number of clusters, unknown shape of each cluster, and unknown model within each cluster.

Spatial clustering effect of regression coefficients are also discussed. Lee et al.(2017) identified the spatial clusters of coefficient using a method based on hypothesis testing. 21

For a single cluster, F statistic is used to test the potential circular clusters of regression coeﬃcient against the null hypothesis that the regression coeﬃcient is the same over the entire spatial domain. For multiple clusters, sequential detection approach is adopted.

Li and Sang(2019) proposed a so called spatially clustered coefficients (SCC) model to identify spatial clustered patterns of coefficients in spatial regression modeling setting. The spatial smoothness is realized by imposing penalties on the difference between regression coefficients at any two locations connected in an edge set. The edge set is selected using a minimum spanning tree (MST) to incorporate spatial information and achieve computational efficiency. This makes those locations that are spatially distant very hard to go to same group.

Introducing spatial smoothness to Dirichlet process is also of interests. Applying a

Markov random ﬁeld prior to spatial statistical modeling is a classical Bayesian approach widely used in image segmentation problems (Geman and Geman, 1984; Winkler, 2012).

Orbanz and Buhmann(2008) incorporates MRF to add spatial smoothness constraints to Dirichlet Process. Hu et al.(2020) proposed a Markov random ﬁeld constraint MFM model to model the spatial functional data which has spatial continuity. MRF uses neighborhood graph, which is a collection of random variables deﬁned on an undirected, weighted graph N = (VN ,EN ,WN ), where VN is vertex set representing random vari-

n ables {θi}i=1 at n diﬀerent locations, EN is edge set showing the statistical dependence structure, and WN is the set of weight on each edge for the magnitude of dependence.

The Markovian property means that the conditional distribution of each random variable 22

θi only depends on its neighbor denoted as θ∂(i):

Π(θi | θ−i) = Π(θi | θ∂(i)).

The MRF distribution Π(θ1, θ2, . . . , θn) can play the role of prior distribution in Bayesian model, such as Dirichlet process mixture model, or mixture of ﬁnite mixture model. It can be decomposed as:

Π(θ1, θ2, . . . , θn) ∝ P (θ1, θ2, . . . , θn)M(θ1, θ2, . . . , θn),

where M(θ1, θ2, . . . , θn) is the interaction term from a cost function h(θ1, θ2, . . . , θn), and

P is the site-wise term. The MRF-constrained DP prior can then be written as:

n Y (θ1, . . . , θn) ∼ M(θ1, . . . , θn) G(θi), i=1 (1.17)

G ∼ DP(α, G0).

Gibb’s sampler can be easily adopted to this model by simple modify the full conditional

distribution of zi under CRP using a cost function.

Another approach to modify CRP to include spatial homogeneity is the distance

dependent CRP proposed by Blei and Frazier(2011). Starting from the Chinese restau-

rant metaphor, they investigate the probability that a new customer chooses to sit with

another customer, instead of choosing a table to join. Spatial smoothness constraints 23 are introduced by incorporating the spatial distance information between customers into the distribution. A new membership parameter ci is introduced for each customer, and its value shows which customer the i-th customer chooses to sit with. The distribution

of ci is:   f(dij), j 6= i, Pr(ci = j | D, f) ∝ (1.18)  α, j = i,

n where D = {dij}i,j=1 is the distance matrix showing the distance between diﬀerent

customers, and f is a decay function, which is decreasing with distance.

1.7 Marked Spatial Point Process

The work about spatial point pattern data mainly focuses on characterizing the spatial

pattern, which is captured by the intensity function λ(s). Besides the spatial location

of interested events, there might be other measurements along with each event in the

dataset. Such measurements are called marks. For example, earthquake magnitudes

recorded along with earthquake locations (Hu and Bradley, 2018), basketball shot output

(made or missed) for each shot attempt location, and tree heights or width along with

the tree location data. Marks are often related to locations, so considering marks can

help us improve the modeling accuracy of the spatial point pattern, and also improve

the understanding about the point pattern (Jiao et al., 2019). Marks sometimes are

important feature and is of researchers’ interests itself. 24

Spatial point pattern data with marks observation can be modeled using a marked spatial point process jointly. The observed data contains two part, locations and marks, denoted using (S, M), where M = (m1, m2, . . . , mN ) shows the mark measurements. To

model the dependence of the two parts, the intensity λ(s) from location model part can

be incorporated into the mark model by specifying the conditional distribution of mark

model f(M | Θm, λ(s)), where Θm represents all parameters in marks model. Then the

joint model likelihood becomes

L(Θ | S, M) = L(M | λ(s), Θm)L(S | Θs), (1.19)

where Θs is the parameter set for location model, and Θ = (Θm, Θs).

The dependence structure of location and mark can be modeled in two ways, either location dependent models (Mrkviˇcka et al., 2011) or intensity dependent models (Ho and

Stoyan, 2008). Location dependent models are observation driven, where the observed point pattern is incorporated into characterizing the spatially varying structure of the mark. It is motivated by geographical dependent model (Brunsdon et al., 1996), and can be written as: PN w (s − s)m m(s) | λ(s) = i=1 h i i , (1.20) PN i=1 wh(si − s) where wh(si − s) is the geographical weight between si and s with some bandwidth h.

The intensity dependent models are parameter driven, where the intensity instead of the observed point pattern characterizes the distribution of the mark at each point in the 25 spatial domain. The conditional mark model can be written as a linear regression model as in (1.21), and λ(s) is used as one covariate.

m(s) | λ(s) = ξλ(s) + x(s)>α + (s), (1.21)

> where (s) is a white noise, ξ is coefficient of λ(s), α = (α0, α1, . . . , αp) is a (p + 1) dimensional coefficients vector, and x(s) is a (p + 1) dimensional covariates vector at location s with first element being 1.

The rest of the dissertation is organized as follow. Chapter2 introduced an semipara-

metric model which combines the traditional spatial Poisson point process with spatially

varying baseline intensity which is captured by DPMM. The PCRP is used to solve the

overclustering problem. Chapter3 adds the spatial smoothness constraint to the point

process and investigates the performance of distance dependent CRP. Chapter4 then

goes to the marked spatial point process when marks are displayed in the dataset. 26 Chapter 2

Hierarchical Semiparametric

Poisson Point Process Model

2.1 Introduction

For spatial point pattern data, spatially varying covariates are often available for characterizing the occurrence of the events. However, parametric model that only use covariates to capture the heterogeneity of spatial point pattern can be limited. As mentioned be- fore, there might be some unknown covariates that can cause the intensity heterogeneity.

This problem is very often to be seen at some environmental data such as tree locations.

Our motivating application is the spatial point pattern of one of the most common tree species, Beilschmiedia pendula, in the 50–hectare tropical forest dynamics plot at the

Barro Colorado Island (BCI) in central Panama (Condit et al., 2019). The BCI data is a great resource for many environmental studies, containing so far complete information of over 400 000 individual trees that have been censused since the 1980s.

The relationship between covariates and the event are often of researchers’ interest, 27 so it is necessary to include them in the model. The remaining heterogeneity of intensity caused by missing covariates should also be considered in order to improve the model accuracy. One simple solution is to model the spatial point pattern data by both spatially varying covariates and a spatially varying baseline intensity.

The focus of this section is a Bayesian semiparametric spatial Poisson point process model which allows nonparametric spatially varying baseline heterogeneity in addition to covariate-induced heterogeneity. The nonparametric baseline intensity takes the form of a spatially piecewise constant function as in Geng et al.(2019). The Bayesian framework provide naturally estimates of the number of components of the piecewise constant function and the component conﬁguration, along an estimate of the intensity function estimate itself. In practice, the widely used Chinese restaurant process (Pitman, 1995;

Neal, 2000) for nonparametric modeling has been reported to produce extremely small and, hence, redundant components, making the estimator for the number of components inconsistent (Miller and Harrison, 2013). To remedy the situation, Miller and Harrison

(2018) put a prior on the number of components and proposed the so called mixture of

ﬁnite mixture model; Xie and Xu(2019) developed a general class of Bayesian repulsive

Gaussian mixture models to encourage well-separated components. A recent alternative method is the powered Chinese restaurant process (Lu et al., 2018), which encourages member assignment to existing components. The PCRP method has not been used in the context of spatial Poisson point process modeling with a nonparametric baseline as well as covariate eﬀects. 28

Our contribution is two-fold. First, we propose a flexible Bayesian semiparametric spatial Poisson point process model. The model simultaneously captures the heterogeneity introduced by a spatially varying baseline intensity and the heterogeneity explained by covariates. The baseline intensity is spatially piecewise constant with a PCRP prior, which prevents overfitting. Variable selection is achieved with spike-slab priors (Ish- waran and Rao, 2005) on the regression coefficients. The second contribution is an efficient companion Markov chain Monte Carlo inference. The full conditional distribution of the index vector of the grid boxes on the study region under the PCRP prior is summarized in a proposition. The selection of the power of the PCRP prior is done with a criterion similar to the Bayesian information criterion. Our simulation study shows that the proposed method is competitive when the number of data points in each component is sufficiently large. Interesting patterns of the Beilschmiedia pendula species

in the BCI plot are discovered.

2.2 Model Setup

2.2.1 Spatial Poisson Point Process

Spatial Poisson point processes is a fundamental model for spatial point patterns. Let

2 S = {s1, s2,..., sN } be the observed points over a study region B ⊂ R with si =

(xi, yi) ∈ B, i = 1,...,N. A spatial Poisson point process is the process such that

the number of points in any subregion A ⊂ B follows a Poisson distribution with mean 29

R λ(A) = A λ(s)ds for some function λ(·). Function λ(·) is the intensity function that

completely characterizes the spatial Poisson point process. When λ(s) changes with s,

the process is known as a non-homogeneous Poisson Point process (NHPP), denoted by

NHPP(λ(·)).

Covariates can be introduced to the intensity function of a NHPP. Let X(s) be a p × 1 spatially varying covariate vector, which does not include 1. A semiparametric regression model for the intensity function of an NHPP is

> λ(si) = λ0(si) exp X (si)β , (2.1)

> where λ0(si) is an unspeciﬁed baseline intensity function, and β = (β1, . . . , βp) is a vector of regression coeﬃcients. The baseline intensity function λ(·) captures additional spatial heterogeneity that are not explained by the covariates.

2.2.2 Nonparametric Baseline Intensity

The baseline intensity function λ0(·) is completely unspeciﬁed in Model (2.1). For ﬂexi-

bility, we characterize λ0(·) by a piecewise constant function

λ0(s) = λ0,z(s), z(s) ∈ {1,...,K}, 30

where K is the number of components of the piecewise constant function, vector λ0 =

K {λ0,i}i=1 is the unique value of λ0(s), and z(s) is the index of the component at location s. In implementation, we partition B by n disjoint grid boxes Ai, i = 1, 2, . . . , n,

n B = ∪i=1Ai. Let zi be the index of the component in the piecewise constant function to

which λ0(s) returns for any s ∈ Ai; that is, for any s ∈ Ai, we have λ0(s) = λ0,zi .

We specify the index process z = (z1, . . . , zn) by a PCRP (Lu et al., 2018). In particular, let z1 = 1 and, for i ∈ {2, . . . , n},

  r nc, at an existing component labeled c, Pr(zi = c | z1, . . . , zi−1) ∝ (2.2)  α, at a new component, where 0 < α < 1 is a hyperparameter corresponding to the concentration parameter in

DP, r ≥ 1, and nc is the number of grid boxes in component c. This process is denoted by PCRP(α, r). The special case of r = 1 is the CRP. When r > 1, the PCRP process assigns zi, i > 1, to an existing component with a higher probability than does the CRP process. This design helps to eliminate artifactual small components produced by the

CRP in modeling mixtures with an unknown number of components (Lu et al., 2018).

As r increases, the probability of each grid box being assigned to an existing component increases so that the ﬁnal number of components needed decreases. 31

2.2.3 Hierarchical Semiparametric Regression Model

With S following an NHPP and index vector z following a PCPR, the proposed hierarchical semiparametric regression model denoted by PCRP-NHPP is

S ∼ NHPP(λ(s)),

> λ(s) = λ0,zj exp X (s)β , s ∈ Aj, j = 1, . . . , n,

2 (2.3) βi ∼ Normal(0, δi ), i = 1, 2, . . . , p,

z ∼ PCRP(α, r),

λ0,k ∼ Gamma(a, b), k = 1, 2,...,K,

where z = (z1, . . . , zn); NHPP(λ(s)) and PCRP(α, r) are deﬁned in (2.1) and (2.2),

respectively, with hyperparameters (a, b, δ1, . . . , δp, α) and a pre-speciﬁed power r for

the PCRP; the prior distributions for βi’s are independent; the prior distributions for

λ0,k’s are independent gamma distributions with mean a/b; and K is the number of

unique values of zi, which goes to intiﬁty.

When variable selection is desired, a spike-slab prior can be imposed on each element

of β (Ishwaran and Rao, 2005). A spike-slab distribution is a mixture of a nearly

degenerated distribution at zero (the spike) and a ﬂat distribution (the slab). Zero-mean

normal distributions with a small and a large variance are common choices, respectively,

for the spike and the slab. The ratio of the two variances should be in a reasonable

range such that the MCMC will not get stuck in the spike component (Malsiner-Walli 32 and Wagner, 2018). Following the suggestion from George and McCulloch(1993), a common choice of the two variances are 0.01 and 100. Speciﬁcally, the normal spike-slab prior speciﬁes the hyperparameter δi, i = 1, . . . , p, in (2.3) as

2 δi = 0.01(1 − γi) + 100γi, (2.4)

γi ∼ Bernoulli(0.5).

With posterior samples from MCMC, variable selection is done using the posterior modes of γi’s. A posterior mode of 0 suggests little signiﬁcance of βi and exclusion of the corresponding covariate from the regression model.

2.3 Bayesian Inference

2.3.1 The MCMC Sampling Scheme

The parameters in Model (2.3)–(2.4) are Θ = {λ0, z, β, γ}, where γ = (γ1, . . . , γp). The likelihood of spatial Poisson point process is

N Y Z L(Θ|S) = λ(si) exp − λ(s)ds . i=1 B

The posterior density of Θ is

π(Θ|S) ∝ L(Θ|S)π(Θ) 33 where π(Θ) is the prior density of Θ.

The full conditional distribution of each parameter in Θ are derived in order to

use Gibbs sampling method. The full conditional distribution of zi, i = 1, . . . , n, is contingent on whether the ith grid box goes to an existing component or a new one.

The full conditional probability that grid box Ai belongs to an existing component c, i.e., ∃j 6= i, zj = c, is

r λmi Q expX>(s )β n−i,c 0,c j:sj ∈Ai j Pr(zi = c | S, z−i, λ0, β) ∝ Pk r exp(λ Λ (β)) j=1 nj − 1 + α 0,c i   nr −i,c mi X > = λ0,c exp  X (sj)β − λ0,cΛi(β) . Pk nr − 1 + α j=1 j j:sj ∈Ai

(2.5)

The full conditional probability that Ai belongs to a new component, i.e., ∀j 6= i, zj 6= c, is

Pr(zi = c | S, z−i, λ0, β)

mi Q > a α Z λ0,c exp X (sj)β b j:sj ∈Ai a−1 −bλ0,c ∝ λ e dλ0,c Pk r exp (λ Λ (β)) Γ(a) 0,c j=1 nj − 1 + α 0,c i   α ba Z (2.6) Y > mi+a−1 −(b+Λi(β))λ0,c =  exp X (sj)β  λ0,c e dλ0,c Pk nr − 1 + α Γ(a) j=1 j j:sj ∈Ai   a αb Γ(mi + a) X > = exp X (sj)β . Pk r m +a   ( n − 1 + α)(b + Λi(β)) i Γ(a) j=1 j j:sj ∈Ai

Combining (2.5) and (2.6) gives the full conditional distribution of zi in the following 34

Proposition.

Proposition 2. Under the model and prior speciﬁcation (2.3), the full conditional distribution of zi, i = 1, . . . , n, is

   r mi n−i,cλ0,c exp(−λ0,cΛi(β)) ∃j 6= i, zj = c (existing label), Pr(zi = c | S, z−i, λ0, β) ∝  αbaΓ(m + a)  i  m +a ∀j 6= i, zj 6= c (new label), (b + Λi(β)) i Γ(a) (2.7) where z−i is z with zi removed, and n−i,c is the number of grid boxes in component c

R > excluding Ai, and Λi(β) = exp X (s)β ds, The results remain when the spike-slab Ai prior is imposed on the elements of β.

Assuming that the covariates X(s) are piecewise constant on a grid partition, which may not be the same as the Ai’s, the integral of Λi(β) in Proposition2 can be calculated as a summation on each Ai, i = 1, . . . , n. If this grid partition is the same as Ai’s, that

> is, X(s) = Xi for s ∈ Ai, then Λi(β) = µ(Ai) exp Xi β , where µ(Ai) is the area of Ai.

For the full conditional distribution of λ0,i, we only need to focus on those data points that are in the ith component since the likelihood for the data points in other components does not involve λ0,i. That is, we only need to focus on those grid boxes 35

{Aj}’s such that zj = i. The full conditional density of λ0,i, i = 1,...,K, is

Q λ(s`) `:s`∈Aj ,zj =i a−1 q(λ0,i | S, β, γ, z, λ0,−i) ∝ λ0,i exp (−bλ0,i) R exp S λ(s)ds Aj j:zj =i Q > λ0,i exp X (s`)β `:s`∈Aj ,zj =i a−1 = λ0,i exp (−bλ0,i) (2.8) R > exp λ0,i S exp (X (s)β) ds Aj j:zj =i    

Ni+a−1 X ∝ λ0,i exp − b + Λj(β) λ0,i , j:zj =i

 which is the density of Gamma N + a, b + P Λ (β) . i j:zj =i j

The full conditional mass function of γi, i = 1, . . . , p, is

γi 1−γi γi 1−γi q(γi | S, β, γ−i, z, λ0) ∝ 0.5 0.5 φ (βi|100)φ (βi|0.01) (2.9) γi 1−γi = 0.5φ(βi|100) 0.5φ(βi|0.01) ,

which is Bernoulli with rate parameter

0.5φ(β |100) i . 0.5φ(βi|100) + 0.5φ(βi|0.01)

In summary, the full conditional distributions for λ0,k’s and γj’s are straightforward

from their conjugate priors:

X λ0,k | S, β, γ, z, λ0,−k ∼ Gamma(Nk + a, b + Λj(β)), k = 1,...,K, (2.10)

j:zj =k 36

 −1! φ(βj|0.01) γj | S, β, γ−j, z, λ0 ∼ Bernoulli 1 + , j = 1, . . . , p, (2.11) φ(βj|100)

where λ0,−i and γ−i are, respectively the λ0 and γ without the ith element, mi =

PN 1(s ∈ A ) is the number of data points in grid box A , N = P m is the j=1 j i i k i:zi=k i

2 2 number of data points in the kth component, and φ(·|σ ) is the density of Normal(0, σ ).

The full conditional density of βi, i = 1, . . . , p, is

1−γi γi q(βi | S, β−i, γ, z, λ0) ∝ φ (βi|0.01) φ (βi|100)

n   (2.12) Y mi X > × λ0,zi exp  X (sj)β − λ0,zi Λi(β) , i=1 j:sj ∈Ai

where β−i is β without the ith element. This is not a standard distribution. Sampling from it can be done with a Metropolis–Hasting algorithm with a normal proposal distribution centered at the value of the current iteration with a variance parameter tuned to achieve desired acceptance rate. The Metropolis–Hastings algorithm (Hastings, 1970) can be used to draw samples from this conditional distribution. In our implementation, we used a normal distribution proposal centered at the current value with a standard deviation that is tuned to achieve a desired acceptance rate.

The full conditional distributions facilitate MCMC sampling with an Gibbs sampling algorithm. Algorithm1 summarizes the speciﬁcs for one MCMC iteration. In the algorithm, K, the length of λ0, reduces by 1 whenever a component is found to contain only a single grid box; it increases by 1 when a new label is assigned by the full conditional 37

Algorithm 1 Gibbs sampling algorithm for one iteration of MCMC to update Θ.

1: update Λi(β), i = 1, . . . , n 2: for i = 1 : n do . Update z 3: if Ai is the only grid box in the component that it belongs to then 4: K = K − 1 5: zj = zj − 1 for all zj such that zj > zi

6: shorten λ0 by dropping λ0,zi

7: draw zi from (2.7) 8: if zi goes to a new component then 9: K = K + 1 10: draw λ0,K ∼ Gamma(a, b)

11: for i = 1 : K do . Update λ0 12: draw λ0,i from (2.10) 13: for i = 1 : p do . Update γ 14: draw γi from (2.11) 15: for i = 1 : p do . Update β 16: draw βi from (2.12) with the Metropolis–Hastings algorithm

distribution (2.7). To initialize, each zi, i = 1, . . . , n, is randomly assigned an integer value in {1,...,K0} for a prespeciﬁed initial number of components K0 for the piecewise constant baseline, possibly based on some exploratory analysis. Initial values for λ0 are generated independently from the Gamma(a, b) prior distribution. Initial values for other parameters β and γ can simply be set to zeros.

2.3.2 Inference for the Nonparametric Baseline Intensity

Bayesian inference for the nonparametric baseline intensity is not as straightforward as for the regression coeﬃcients whose posterior sample can be easily summarized. The nonparametric baseline intensity is constructed based on the index vector z, which is not exchangeable from iteration to iteration. The same zi value from diﬀerent iterations does 38 not necessarily mean the same component. The similar problem also exists in inference of λ0. We use Dahl’s method (Dahl, 2006) as a simple solution for summarizing z. It

chooses the iteration in the MCMC sample that optimizes a least squares criteria as the

estimate for z.

For a draw from the posterior distribution of Θ, deﬁne an n × n membership matrix

B = (B(i, j)) = 1(zi = zj) , i, j ∈ {1, 2, . . . , n}. (2.13)

That is, its (i, j)-th element is 1 if the i-th and j-th grid boxes belong to the same component (or have the same baseline intensity); it is 0 otherwise. For an MCMC sample of size M, let B(t) be the B matrix deﬁned for the tth draw in the sample and

1 PM (t) B = M t=1 B . Then each (i, j)-th element of B is the relative frequency that the

i-th and j-th grid boxes belong to the same component. Dahl(2006) suggested to take the draw in the sample that is closest to B as the point estimate of z. Let

n n X X (t) 2 t∗ = arg min (B (i, j) − B(i, j)) . t∈{1,2,...,M} i=1 j=1

(t∗) (t∗) The estimate for λ0 is λ0 and the estimate of K is the length of λ0 s. The advantage of this method is that is uses information from all posterior samples, and the ﬁnal result is guaranteed to be a valid scheme that exists in the sample.

Convergence check for the nonparametric baseline cannot be done with trace plots 39

for z or λ0 as their meanings change from iteration to iteration. The baseline at each

grid box, λ0,zj , j = 1, . . . , n, does have the same meaning across iteration and can be checked for convergence. The situation is similar to that of a reversible jump MCMC where a nonparametric component has varying degrees of freedom (Wang et al., 2013).

Instead of monitoring a large number n grid boxes, we focus on the number of component

K in practice and monitor the trace plot of K for stationarity. Usually, once K shows

stationarity, the nonparametric baseline at each grid box and the regression coeﬃcients

are all stationary.

As a further diagnostic tool for the nonparametric baseline, the Rand index (RI),

which measures the similarity of two component memberships (Rand, 1971), can be

checked. Let Ω = {(Ai,Aj)}16i

index vectors z(1) and z(2), deﬁne

(1) (1) (2) (2) a = #{(Ai,Aj) ∈ Ω: zi = zj , zi = zj },

(1) (1) (2) (2) b = #{(Ai,Aj) ∈ Ω: zi 6= zj , zi 6= zj },

where #{·} denotes the cardinality of a set. That is, under the two membership assign-

ments, a and b are the number of grid box pairs whose memberships are “concordant”;

other pairs are all “discordant”. Then, RI is deﬁned as

n−1 RI = (a + b), 2 40 which ranges from 0 to 1 with a higher value indicating a better agreement between the two index vectors, and 1 indicates a perfect match.

(t) If the true index vector z0 were known, for every MCMC iteration z , t = 1, 2,...,M,

(t) (t) (t) a Rand index RI based on z0 and z can be calculated. The trace plot of RI is a useful diagnostic tool. In practice, since z0 is unknown, we replace it with the estimate zb from Dahl’s method after convergence has been ensured from the trace plot of

K. The stationarity in the trace plot of RI provides a second stage reassurance of the convergence, and its level provides a measure for the quality of the consistency of the memberships from iteration to iteration.

2.3.3 Selection of r

The estimated number of components Kb of the nonparametric baseline intensity depends on r, the power of the PCRP prior. Selection of r remains to be addressed.

Model comparison criteria under the Bayesian framework, such as deviance information criterion (DIC) (Spiegelhalter et al., 2002) and logarithm of pseudo-marginal likelihood

(LPML) (Geisser and Eddy, 1979; Gelfand and Dey, 1994), are natural choices. From our simulation study, however, they tended to lead to overestimation of the number of components K in the nonparametric baseline. We experimented with a criterion in the spirit of the Bayesian information criterion (BIC) since BIC has proven to be an eﬀective criterion for likelihood-based model selection in clustering algorithms (Wang and Bickel,

2017). 41

Our Bayesian information type criterion (BITC) is deﬁned as

BITC = −2 log L(Θb | S) + Kb log(N), (2.14)

where Θb and Kb are estimator of Θ and K, respectively, from Dalh’s method. Although

BIC is usually used for frequentist methods, when number of points N and the MCMC sample size M are large enough, Θb provides a consistent estimator of Θ (Walker, 1969) and it seems reasonable to use BITC to assess model fitting without considering parameter variation. The effectiveness of the BITC in selecting r was confirmed in our simulation study reported in the next section. In practice, we can determine the range of candidate r values starting from 1 and ending with a number that gives Kb = 1. Then the optimal r is selected based on the BITC from a grid of candidate r values. The optimal r selected by this criterion provided better estimate of K in our simulation than

those by LPML and DIC. Its performance under simple Gaussian mixture model is also

investigated via simulation study in Appendix A.1.

2.4 Simulation Study

The proposed methods were validated in a simulation study over a region B = [0, 20] ×

[0, 20]. The study region is partitioned by grid boxes Ai, i = 1, 2, . . . , n, where each Ai, i ∈ {1, 2, . . . , n = 400}, is a unit square. Points were generated from the NHPP(λ(s)) model (2.1) with p = 4 covariates. The four covariates Xi(s), i = 1, 2, 3, 4, were set 42

to be piecewise constant over the grid boxes Ai’s. Their values were independently generated from the standard normal distribution. The true regression coeﬃcients were

β1 = β2 = 0.5 and β3 = β4 = 0. Two settings of the piecewise constant baseline intensity surface were considered. Setting 1 had two components, with λ0 = (0.2, 10), and the number of grid boxes in the two components were (n1, n2) = (309, 91). Setting 2 had three components, with λ0 = (0.2, 5, 20), and (n1, n2, n3) = (232, 91, 77). See Figure5 for the spatial structure of the baseline intensity surfaces under the two settings. In both settings, more grid boxes were assigned to ﬁrst component with low intensity value, which mimics the pattern of the BCI data. We used function rpoispp() from R

package spatstat (Baddeley et al., 2015) to generate the points. For each setting, 100

replicates were generated. There were about 1000 to 1500 data points generated under

setting 1, and about 3000 to 4000 data points generated under setting 2.

Priors for the model parameters were set to be those in (2.3) and (2.4), with hyper-

parameters a = b = α = 1. Diﬀerent r values starting from 1 were tried on each dataset and the optimal r was selected by the BITC in (2.14). The upper end of the candidate r in each setting was experimented to be the smallest value that would lead to a single component of the piecewise constant baseline intensity surface based on Dahl’s method.

Candidates powers used for setting 1 were from 1 to 2 with a step 0.1; for setting 2, candidates powers used were from 1 to 3 with a step 0.1. In the Metropolis–Hastings algorithm to draw β, a Normal(0, 0.052) distribution was used as the proposal, which 43

r = 1 r = optimal r = 1 r = optimal

60 60

40 40

20 20

0 0 3 6 9 3 6 9 4 6 8 4 6 8 (a) Histograms of K

(b) Rand Index

Figure 3: Histograms of Kb and overlaid trace plots of RI from the 100 replicates. Setting 1 and 2 are on left and right panel, respectively. The optimal r was selected by the BITC. The RI for each replicate was under the optimal r for that replicate. The thick lines are the average of the trace plots over the 100 replicates. yielded an acceptance rate of 30–40%. For each dataset, we ran the MCMC for 5000 iterations and discarded the ﬁrst 1000 iterations. Convergence for the remaining iterations was ensured by checking the trace plots of the elements in β and K.

Figure3 shows the histograms of Kb from Dahl’s method for a selective set of r values

based on the 100 replicates. The optimal r was selected by the BITC. The true K are 2

and 3, respectively, in the two settings. The histograms of Kb under the optimal r are 44

LPML DIC LPML DIC

80 30

60 20

10 20

0 0 3 6 9 3 6 9 4 6 8 4 6 8

Figure 4: Histogram of Kb chosen by LPML and DIC over 100 replicates. Results of setting 1 and 2 are on left and right panel, respectively. much closer to the true K under both settings than the those under r = 1, the popular

CRP prior. Among the 100 replicates, there are 76 times in setting 1 and 84 times in setting 2 that the optimal r led to Kb = K exactly. The same frequencies for r = 1 are only 55 and 18, respectively. LPML and DIC were also calculated using the Monte Carlo estimation proposed by Hu et al.(2019). Histograms of ﬁtted K chosen by LPML and

DIC are also shown in Figure4. They had good performance under setting 1, estimating

K correctly for 81 times and 83 times, respectively. Nonetheless, both of them did fairly poorly under setting with obviously more components estimated than the truth. Only

24 times and 23 times out of the 100 replicates estimated K correctly using LPML and

DIC, respectively.

Also shown in Figure3 are the overlaid trace plots of the RI from all the replicates under r = 1 and the optimal r The trace plots further assure the convergence of the index process z in the sense of Rand(1971). The averaged trace plots over the replicates

(shown in solid lines) reﬂect the level of consistency in component memberships across 45 iterations. Under r = 1, the averages stabilize around 0.806 and 0.828, respectively, for setting 1 and setting 2. Under the optimal r, the averages are 0.882 and 0.893,

respectively, indicating higher consistencies in component membership than those under

r = 1.

The estimated baseline intensity surfaces under the optimal r for the 100 repli-

cates are summarized, and the heat maps of their 2.5% quantile, median and 97.5% quantile are compared with the true surface in Figure5. According to the plots, fitted baseline intensity can accurately specify different components, and the intensity magnitude is also very close to the truth. The estimation of the edge part of different components do not show obvious worse performance. This is because PCRP does not have spatial smoothness constraint and allows more spatial flexibility. As long as the resolution of grid box is fine enough, PCRP can capture different components no matter the shape of the area. Under each setting, we pool the grid boxes under the same component and compare the averaged estimates with the true value. For setting 1, the empirical biases for the two components of λ0 = (0.2, 10) are (0.052, −0.413),

with standard deviations (0.055, 0.433). For setting 2, the empirical biases for the

three components λ0 = (0.2, 5, 20) are (0.089, 0.107, −0.846), with standard deviations

(0.058, 0.381, 0.752). Considering the magnitude of the intensity components, they are

estimated quite accurately.

Table1 summarizes the results of variable selection and estimation for the parametric 46

true 0.025 quantile median 0.975 quantile

2.5 5.0 7.5 10.0

(a) Setting 1 true 0.025 quantile median 0.975 quantile

5 10 15 20

(b) Setting 2

Figure 5: Simulation conﬁgurations for baseline intensity, with ﬁtted baseline intensity surfaces. Median and quantiles are calculated out of 100 replicates. 47

Table 1: Simulation estimation results. “SD” is the empirical standard deviation over 100 replicates, “SD”c is the average of 100 standard deviations calculated using posterior sample. “AR” is the variable selection accuracy rate over 100 replicates.

Setting β AR(%) Bias SD SDc Setting 1 0.5 100 0.019 0.035 0.033 0.5 100 0.016 0.032 0.033 0.0 100 0.028 0.034 0.030 0.0 100 0.022 0.035 0.031

Setting 2 0.5 100 0.012 0.027 0.026 0.5 100 0.011 0.025 0.025 0.0 100 0.007 0.024 0.023 0.0 100 0.003 0.021 0.023

regression part of the model. For both settings, the two important variables were in-

cluded and the two unimportant variables were excluded with accuracy rate 100%. The

empirical bias of the coeﬃcients are minimal. The empirical standard errors of the point

estimates have good agreement with the average of the posterior standard deviation of

the coeﬃcients. The variation of the coeﬃcient estimates is lower in setting 2 than in

setting 1, which is expected because setting 2 has high intensity and more observed

points.

Finally, the proposed model was compared with three competing models. The ﬁrst

one, denoted as “const-NHPP”, is the NHPP(λ(s)) model (2.1) with a constant baseline intensity surface λ0. The second one, denoted as “spline-NHPP”, is the NHPP(λ(s)) model (2.1) with baseline intensity surface approximated by a tensor product cubic splines with a single knot for each of the two coordinates (e.g., Berhane et al., 2008).

This model can be ﬁt with the spline basis into the covariates. The last model is the 48

Setting 1 Setting 2 125

100

MSE 50

● ●

●

● 0

LGCP LGCP pCRP−NHPP const−NHPP spline−NHPP pCRP−NHPP const−NHPP spline−NHPP model

Figure 6: Boxplots of MSE over the 100 replicates for four models.

LGCP model fitted with function kppm() from R package spatstat. The mean squared error (MSE) is often used to compare the performance different models in modeling spatial point pattern, which, with a grid partition Ai’s on B, is defined as

n 1 X 2 MSE = µ(Ai)λb(Ai) − mi , (2.15) n i=1

R where λ(Ai) = λ(s)ds, λ(s) is the estimated intensity on location s. Models with b s∈Ai b b smaller MSE values are preferred.

Figure6 summarizes the boxplots of the MSE across the proposed model and the three competing models. The three competing models are all misspecified under the data generating schemes. As expected, they have much higher MSEs than the proposed 49 model. In particular, the const-NHPP and LGCP models show very similar results, since the LGCP model mainly makes improvements on the intensity’s variance structure, which does not help here. The MSE of the spline-NHPP model could be further reduced if more flexibility in the splines were introduced. Nonetheless, its baseline intensity is spatially continuous which may never fit well the data generated in our settings with piecewise constant intensity surface consisting of only two or three quite different components.

2.5 Point Pattern of Beilschmiedia Pendula

Beilschmiedia pendula is one of the most abundant tree species in the BCI (Thurman and Zhu, 2014). There are 4, 026 such trees in total from the most recent census in the

50-hectare plot B, a rectangle of 1000m × 500m. Their exact locations are recorded in a Cartesian coordinate system (x, y) ∈ B = [0, 1000] × [0, 500]; see Figure7. In addition

to two geographical variables, elevation and slope, thirteen environmental covariates are

available, which are soil pH and concentrations in the soil of aluminum (Al), boron

(B), calcium (Ca), cuprum (Cu), ferrum (Fe), kalium (K), magnesium (Mg), manganese

(Mn), phosphorus (P), zinc (Z), nitrogen (N), and nitrogen mineralization (N.min.).

These variables are measured at each of the 1, 250 grid boxes of size 20m × 20m over B.

Within each grid box, the measures are assumed to be the same so that each measure is

piecewise constant over B. The heat maps of the standardized covariates are also shown 50 in Figure7.

With standardized covariates, the hierarchical semiparametric model (2.3)–(2.4) were

ﬁt to the observed point pattern of Beilschmiedia Pendula. For better numerical performance, the area of each grid box was scaled to be 1. The hyperparameters were set to be a = b = α = 1. The standard deviation of the normal proposal in the Metropolis–

Hastings algorithm was set to be 0.05 in order to make the acceptance rate between

30% and 40%. A grid of values {1, 1.1, 1.2, 1.3, 1.4, 1.5} were used for r in search for an

optimal r. For each r, an MCMC was carried out for 50,000 iteration. The ﬁrst 10,000

iterations dropped as burnin, and the remaining iterations were thinned by 10, yielding

an MCMC sample of size M = 2000. The convergence was checked for the trace plots

of β and K; see Figure8.

The optimal r was selected to be 1.4 by the BITC, which leads to Kb = 4 from Dahl’s

method. The posterior mode of K’s sample is also 4, whose frequency is 1256 out of

2000. The standard deviation of K’s posterior is 0.833, and most of the samples’ value are among 3 to 5, which covers 96.35% of the whole sample. The average RI over the

MCMC sample was 0.830, suggesting good agreement of the component members over the MCMC iterations. The corresponding baseline intensity values are 0.755, 5.241,

16.423, and 28.860, which are well separated, representing low, moderate, and high components, respectively. The posterior median of the surface of the baseline intensity over the study plot, as well as the 2.5% and 97.5% percentiles, are shown in Figure9.

Clearly, after accounting for the available covariates, there are missing covariates that 51

● ● ● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ● ● ●●●● ●● ● ●●●● ●● ●● ● ●● ● ● ●●● ●●● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ●●● ● ● ●● ● ●● ● ●● ● ● ● ●●●●●●●● ● ●●●●●● ● ● ●● ●●● ●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ●● ●●●●● ● ● ●● ● ● ● ●●● ●● ●●● ●●●●●●● ●● ● ●● ● ● ● ● ●● ● ● ●●●● ●● ●● ●●● ●●●● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ●● ● ● ●● ●●●●●●●●●●● ●● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ●● ●●● ●●●● ●●●●●●●●●●●●●●●●● ●● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●●●●●●●●●●●●●●●● ●●●●●● ● ● ● ● ● ●● ● ●● ● ●●●● ● ●●● ●● ● ● ● ● ● ● ●●●●● ●●●●●●●●●●●● ●●●●● ● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ●● ●●●●●●●●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●● ●●●●●●● ● ● ● ● ●●●● ● ● ●● ● ● ●●●● ● ● ● ● ●● ●● ●●●● ●●● ● ● ● ●●●●●● ● ● ●●● ●● ● ● ● ● ●●● ● ●●● ● ● ●● ● ● ●●●● ● ● ● ●● ●● ● ● ●● ● ●●● ● ● ● ●● ● ●● ● ●● ●●● ● ● ● ●●●●● ●● ● ●● ● ●●● ●● ●●●●● ● ● ● ●● ● ●●● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●● ●● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●●●● ● ● ●● ● ●●● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ●●● ●●● ●●●● ● ● ● ● ● ● ● ●●● ● ●● ●● ●● ●●●● ●● ● ●● ● ●● ● ●● ● ● ● ● ●●● ● ● ●●●● ● ● ●●● ●●●● ● ● ● ●● ● ●●●● ● ● ●●●●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ●●●●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ●● ● ● ● ● ●● ● ●●● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ● ●●●●●●● ●● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ●●●●●●● ●●●● ●● ●● ● ● ● ● ● ●● ●●● ● ●● ●● ● ● ●●● ●●● ● ● ● ●● ● ●● ● ●●● ● ●● ● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ●● ● ● ● ● ●●● ●●● ● ● ●●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ●● ●●● ●●●●● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ●●●●●●●● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ●● ●●● ● ● ●●● ●● ● ●●● ●● ●●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ●● ●●●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●●●●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●●● ● ● ● ●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ● ● ●●● ●●●● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●●● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ●●●●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●●●●● ● ● ● ● ●● ●●●● ● ● ●● ●●● ● ● ● ● ● ● ● ●● ●● ● ● ●●● ● ● ● ●●●● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ● ●●●● ● ● ● ● ●● ● ● ●●● ● ●● ● ●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●●● ●● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●● ●●●● ● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ● ● ● ● ●● ●● ●●●●● ● ● ● ● ●● ●●● ● ● ● ● ● ●●● ● ● ● ● ●●●●●● ●● ●● ●● ●●●●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ●●●● ● ●●●●●●●● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●●●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●●●●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●●● ● ● ●●● ● ●●● ●● ●●● ● ● ●● ● ● ● ●●●●●●●● ●● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●● ●●● ● ●●●●● ● ● ● ● ●●● ● ● ●●● ● ● ● ●●●●● ●● ● ●●●●●●●●●● ●● ●●● ● ● ● ● ●●● ● ● ● ● ●●●●●●● ● ●●●● ● ● ● ●●● ● ● ● ● ●●● ● ●● ● ● ●●●●● ● ● ● ●●●● ●● ● ● ● ●●●●●●●● ●●●● ●●●●●● ●●●●●● ●●●●● ● ●● ● ● ●●●●●●● ●●● ●●● ● ●●●● ● ●●●●●● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ●●● ● ● ● ●● ●● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●●● ●● ●●●● ● ● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ●●●●● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ●●●● ● ● ● ●● ●●●● ● ● ● ● ●●●●●●●● ● ● ● ● ●● ● ● ●●●●●●●●● ●● ● ● ●●●● ● ● ● ● ● ● ●●●●● ●●●●●●●●●● ●● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●●●●●●●● ●● ●●●●●● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●● ●● ●● ●●●●●●●● ●● ● ●● ● ●● ●●● ● ● ● ●● ● ●●● ●●● ●● ●●●●● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ●●●●● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ●●●● ● ● ● ●●● ●●●● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●●●● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ●●●●●● ●●●●●● ●●●●● ●●●●● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ●●● ●●●●●● ● ●● ●●●● ● ●●●●●●● ● ● ● ● ●● ●●● ● ●●●●● ●● ● ●● ● ● ● ● ● ● ●●●● ●● ●● ●● ●● ●●●●● ●●● ● ● ● ● ● ● ●● ●● ●● ●● ●● ● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●● ●● ●●● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●● ● ●●● ●● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●●●●●●●●● ● ●● ●●●●●●●●● ● ● ● ● ● ●● ● ● ● ●● ●● ●●● ● ● ●● ●● ●● ●●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ●● ● ● ●●●●●●●●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●●● ●● ●●● ● ●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ●●● ● ●●●●●●● ●● ●●●●●●●●●● ● ● ● ● ●●● ● ● ● ●●●● ●●● ● ● ● ● ● ● ●●● ● ● ●●●●●●●●●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ●● ●●● ●●●● ●●● ● ●● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ●●●●● ●●● ● ● ● ●●● ●● ● ●● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ●● ●● ●●● ● ●● ●● ●●●

elevation slope pH Al

B Ca Cu Fe

K Mg Mn P

Zn N N.min.

−2 0 2 4

Figure 7: The locations of Beilschmiedia Pendula and heat maps of the standardized covariates of the BCI data. 52

Number of Components K elevation slope pH 0.8 10.0 0.3 0.8 0.7 0.2 7.5 0.7 0.6 0.1 5.0 0.5 0.6 0.0

Al B Ca Cu

0.9 0.2 0.3 1.4 0.8 0.2 0.0 1.2 0.7 0.1

1.0 0.6 0.0 −0.2

0.5 0.8 −0.1

Fe K Mg Mn 0.4 0.5 −0.4 −0.2 0.3 0.4

0.2 −0.6 0.3 −0.4 0.1 0.2 −0.8 0.0 −0.6 0.1 −0.1 P Zn N N.min. 0.4

−0.2 −0.2 0.0 0.3 −0.3 0.2 −0.4 −0.4 −0.1 0.1 −0.5 −0.6 0.0 −0.6 −0.2

Figure 8: BCI data analysis: trace plots of β and K after burnin and thinning. 53

Table 2: Posterior means, standard errors (SE), and the 95% HPD credible intervals for the regression coeﬃcients in the analysis of the spatial point pattern of Beilschmiedia pendula in the BCI data. Covariates with a “*” are signiﬁcant.

Covariate Estimate SE 95% HPD Credible Interval elevation* 0.635 0.053 ( 0.538, 0.745) slope* 0.722 0.036 ( 0.653, 0.788) pH 0.141 0.046 ( 0.040, 0.223) Al* 0.711 0.068 ( 0.570, 0.835) B −0.052 0.078 (−0.198, 0.103) Ca* 1.139 0.109 ( 0.932, 1.355) Cu 0.086 0.061 (−0.036, 0.195) Fe* −0.407 0.076 (−0.539, −0.259) K* −0.625 0.084 (−0.790, −0.460) Mg 0.131 0.068 ( 0.003, 0.273) Mn 0.261 0.061 ( 0.146, 0.382) P* −0.376 0.068 (−0.504, −0.243) Zn* −0.435 0.090 (−0.590, −0.235) N −0.053 0.039 (−0.134, 0.019) N.min 0.179 0.056 ( 0.077, 0.299)

could have helped to explain the distribution of the tree species.

Table2 summarizes the posterior mean, standard deviation, and the 95% highest

posterior probability (HPD) credible intervals of the regression coeﬃcients. Signiﬁcant

covariates according to spike-slab prior variable selection method are shown with a star

symbol “*”. HPD intervals for those non-signiﬁcant covariates either cover 0, or very

close to 0. The species appears to prefer places with higher elevation and steeper slope.

More occurrence of the species is associated with higher concentrations of Al and Ca,

and lower concentrations of Fe, K, P, and Zn in the soil.

We also ﬁtted other three competing models in the simulation study and compared

the MSE of the models calculated on the 50 × 25 grid. The proposed method gives an 54 0.025 quantile median 0.975 quantile

10 20 30 40

Figure 9: Heat map of the ﬁtted baseline intensity surface of the BCI data using r = 1.4. From top to bottom: 2.5%, 50%, and 97.5% percentiles of posterior distribution of λ(s). 55

MSE of only 8.01. The MSE from the const-NHPP, spline-NHPP, and LGCP models are 25.15, 19.00 and 25.16, respectively. In the spline-NHPP model, B-spline basis with degree equals to three is used. Since range of x is twice as y’s range, 4 pieces for x and 2 pieces for y were used to generate B-spline basis, i.e, degree of free 7 and 5 were used for x and y with intercept, respectively. The improvement made by the proposed method is rather obvious.

2.6 Discussion

Explaining the spatial heterogeneity of point patterns is challenging when important covariates are not observed. The proposed semiparametric NHPP model captures the heterogeneity unexplained by observed covariates with a spatially varying baseline intensity. The baseline intensity surface is of a flexible form of piecewise constant on a grid partition of the study region, with a PCRP prior that prevents overly small components often seen when a CRP prior is imposed instead. The methodology is particularly useful when the baseline intensity surface lacks smoothness as in the case of missing covariates. The fitted number of components of the piecewise constant baseline depends on the power r of the PCRP. The selection of r thorough the BITC specifically designed for this setting seems to be more effective in the simulation study than the LPML and

DIC.

A few topics beyond the scope of this paper merits further investigation. When the 56 baseline intensity is deemed to be spatially contiguous, imposing spatial contiguity on the piecewise constant intensity surface (Li and Sang, 2019) may lead to more efficient estimator of the surface. Some applications may have large areas containing no events, in which case, including zero-inflated structures (Lambert, 1992) in spatial point process models may improve the fitting and avoid unidentifiably close-to-zero intensities. Finally, selection of r for the PCRP prior by the BITC needs to be evaluated in more general settings. A tuning-free strategy, for example, through a hyper prior on r, would be desirable for practice. 57 Chapter 3

Spatial Poisson Point Process with

Spatial Smoothness Constraint

3.1 Distance Dependent CRP

Spatial events often display some spatial smoothness, which means adjacency areas often share some similar properties. Such spatial smoothness is called spatial homogeneity. To introduce such spatial homogeneity to CRP, Blei and Frazier(2011) proposed a distance dependent CRP. Instead of focusing on which table the customer will choose to join, they choose to start from which customer that the current customer will choose to sit with.

To do this, they introduce a new index parameter ci, i = 1, 2, . . . , n for each customer.

The value of ci could be integers from 1 to n, showing which customer that the i-th

customer will choose to sit with. Sitting to a new table will corresponds to ci = i. Then

the distribution of ddCRP can be presented as:

   f(dij) j 6= i, Pr(ci = j | D, f) ∝ (3.1)   α j = i, 58

where D is the distance matrix, dij is the distance between i-th and j-th areas, and f(·)

is the decay function, which is a decreasing function.

Instead of using parameter ci directly in the hierarchical model, which is not very

easy to summarize the table assignments, we apply this ddCRP idea directly to CRP

by using a distance dependent weight on each customer. The number of areas in c-th

P component except i-th area Ai is n−i,c, which can also be represented as j nei 1(zj = c),

where 1(·) is indication function. Then the distance dependent weights can be assigned

to each area, and the prior distribution on zi becomes:

  P  j6=i f(dij)1(zj = c) ∃j 6= i, zj = c, Pr(zi = c | D, f, z−i) ∝ (3.2)   α ∀j 6= i, zj 6= c,

which can be denoted as ddCRP(f, D, α).

Common choices for decay function are: window decay f(d) = 1(d < h) where h is a

distance threshold; exponential decay f(d) = e−hd; logistic decay f(d) = exp(−d + h)/(1+

exp(−d + h)), which is a smooth version of window decay.

3.2 Model Setup

The distance dependent CRP prior can be incorporated into the spatial Poisson point

process to capture the spatially varying intensity surface. Let S = {s1, s2,..., sN } be

2 the observed points over a study region B ⊂ R with si = (xi, yi) ∈ B, i = 1,...,N, 59 and assume S follows a non-homogeneous spatial Poisson point process with intensity function λ(s).

Piecewise constant function is used to model λ(s):

λ(s) = λz(s), z(s) ∈ {1,...,K},

where K is the number of components of the piecewise constant function, vector λ =

(λ1, λ2, ··· λK ) is the unique value of intensity, and z(s) is the index of the component at

location s. Similar to Chapter2, we partition the whole region with disjoint grid boxes

n {Ai}i=1, and zi is the index parameter for Ai. Then z = (z1, . . . , zn) can be assumed to

have a ddCRP prior speciﬁed in (3.2). The hierarchical model can be speciﬁed in (3.3)

as follows:

S ∼ NHPP(λ(s)),

λ(s) = λzj , s ∈ Aj, j = 1, . . . , n, (3.3) z ∼ ddCRP(f, D, α),

λk ∼ Gamma(a, b), k = 1, 2,...,K,

where NHPP(λ(s)) and ddCRP(f, D, α) are deﬁned in (2.1) and (3.2), respectively, with

hyperparameters (a, b, α), a pre-speciﬁed distance matrix D and decay function f for

the ddCRP; the prior distributions for λk’s are independent gamma distributions with

mean a/b; and K is the number of unique values of zi which goes to inﬁnity. 60

3.3 Bayesian Inference

The parameters in Model 3.3 are Θ = {λ, z}. The likelihood of spatial Poisson point

process is N Y Z L(Θ|S) = λ(si) exp − λ(s)ds . i=1 B

The posterior density of Θ is

π(Θ|S) ∝ L(Θ|S)π(Θ)

where π(Θ) is the prior density of Θ.

Following the similar steps in Chapter 2, we can easily get the full conditional dis-

tribution for each parameter as follows. The full conditional probability that grid box

Ai belongs to an existing component c, i.e., ∃j 6= i, zj = c, is

P mi j6=i f(dij)1(zj = c) λc Pr(zi = c | S, z−i, λ) ∝ . (3.4) n − 1 + α exp(λcµ(Ai))

The full conditional probability that Ai belongs to a new component, i.e., ∀j 6= i, zj 6= c, 61 is

Pr(zi = c | S, z−i, λ)

Z mi a α λc b a−1 −bλc ∝ λc e dλc n − 1 + α exp (λcµ(Ai)) Γ(a) (3.5) α ba Z = λmi+a−1e−(b+µ(Ai))λc dλ n − 1 + α Γ(a) c c a αb Γ(mi + a) = m +a . (n − 1 + α)(b + µ(Ai)) i Γ(a)

Combining (3.4) and (3.5) gives the full conditional distribution of zi in the following

Proposition.

Proposition 3. Under the model and prior speciﬁcation (3.3), the full conditional distribution of zi, i = 1, . . . , n, is

  P mi  j6=i f(dij )1(zj =c)λc  ∃j 6= i, zj = c (existing label),  exp(λcµ(Ai)) Pr(zi = c | S, z−i, λ0, β) ∝  αbaΓ(m + a)  i  m +a ∀j 6= i, zj 6= c (new label), (b + µ(Ai)) i Γ(a) (3.6) where z−i is z with zi removed, and µ(Ai) is the area of region Ai.

For the full conditional distribution of λi, only data points in the ith component 62

should be considered for simplicity. The full conditional density of λi, i = 1,...,K, is

Q λ(s`) `:s`∈Aj ,zj =i a−1 q(λi | S, z, λ−i) ∝ λi exp (−bλi) R exp S λ(s)ds Aj j:zj =i Q λi `:s`∈Aj ,zj =i a−1 = λi exp (−bλi) (3.7) R exp S λids Aj j:zj =i    

Ni+a−1 X ∝ λi exp − b + µ(Ai) λi , j:zj =i

 which is the density of Gamma N + a, b + P µ(A ) , where N is the number of i j:zj =i i i

data points in i-th component.

Simple MCMC algorithm can be realized using the full conditional distributions from

(3.6) and (3.7). Inference can be made similar to the baseline intensity part of Chapter

2. Dahl’s method and Rand index is used here to summarize the posterior sample of z

and λ, and check their convergence.

3.4 Simulation Study

Similar to the simulation setup in Chapter2, we also focus on a region B = [0, 20]×[0, 20],

n with partition {Ai}i=1, where n = 400, and each Ai is a unit square. Points were

generated from NHPP(λ(s)), where λ(s) is speciﬁed to be diﬀerent values by a piecewise

constant function over B. Two settings with diﬀerent number of unique values of λ(s)

were investigated here. Setting 1 has three components λ = (0.2, 5, 20), and the number 63

of grid boxes in the three components were (n1, n2, n3) = (233, 90, 77). Setting 2 has ﬁve

components λ = (0.2, 1,, 5, 20, 80), and the number of grid boxes in each component are

balanced, n1 = n2 = n3 = n4 = n5 = 80. The spatial layout of diﬀerent components is

shown in Figure 11. Data was generated using the rpoispp() function from R package

spatstat (Baddeley et al., 2015). For each setting , 100 replicates were generated. There

were about 2000 data points generated under setting 1, and about 8500 data points

generated under setting 2.

Priors were set to be the same as in (3.3) with hyperparameters a = b = α =

1. Exponential decay function f(d) = Ce−hd was used for ddCRP with normalizing

constant C = n(n − 1)/ P(e−hD). Diﬀerent values for h starting from 0 were tried on

each dataset and optimal h was chosen by LPML and DIC using the formulas from Hu

et al.(2019). DIC formula for spatial point process is easy to get from traditional DIC

formula.

N ! X Z Dev(Θ) = −2 × log λ(si) − λ(s)ds , i=1 B

DIC = 2Dev(Θ) − Dev(Θb ),

where Dev(Θ) is the average deviance evaluated using each posterior sample of Θ, and

Dev(Θb ) is the deviance calculated using the point estimation of parameter using Dahl’s

method. LPML formula for spatial point process can be conveniently calculated using 64 the Monte Carlo estimation:

N X Z LPML\ = logλe(si) − λ(s)ds, i=1 B M !−1 1 X −1 λe(s)i = λ(si | Θt) , M t=1 M 1 X λ(s) = λ(s | Θ ), M t t=1

where Θt is the t-th posterior sample of parameters with a total length of M. When

h = 0, ddCRP will go back to CRP. As h increasing, the ﬁtted number of component

will ﬁrst decrease since adjacency regions will more likely go to same component. When

h is too large, the ﬁtted number of components will increase because the penalty on

distance is too large so that only very close neighbors will go to same component. The

upper boundary for candidate h value can be set as the one when Kb starts to increase

again. Candidate h values for setting 1 were from 0 to 1.2 with a step 0.2; for setting 2,

candidate h values were from 0 to 2 with step 0.2 and extra large value 4 and 6. for each

dataset, length of MCMC iterations and burnin period are 3000 and 1000, respectively.

Figure 10 shows the histogram of Kb from Dahl’s method. The true K are 3 and 5,

respectively, in the two settings. The histograms of Kb under the optimal h are much

closer to the true K under both settings than the those under h = 0, the popular CRP

prior. Among the 100 replicates, there are 62/64 times in setting 1 and 72/67 times in

setting 2 that the optimal h chosen by LPML/DIC led to Kb = K exactly. The same

frequencies for h = 0 are only 9 and 1, respectively. 65

h = 0 h = LPML optimal h = DIC optimal h = 0 h = LPML optimal h = DIC optimal

40 40

20 20

0 0

3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12 5 6 7 8 9 1011121314151617 5 6 7 8 9 1011121314151617 5 6 7 8 9 1011121314151617 (a) Histograms of K

(b) Rand Index

Figure 10: Histograms of Kb and overlaid trace plots of RI from the 100 replicates. Setting 1 and 2 are on left and right panel, respectively. The optimal h was selected by the LPML or DIC. The thick lines are the average of the trace plots over the 100 replicates. 66

Figure 10 also shows the overlaid trace plots of the RI from all the replicates under h = 0 and the optimal h. The trace plots further assure the convergence of the index

process z in the sense of Rand(1971). The averaged trace plots over the replicates

(shown in solid lines) reﬂect the level of consistency in component memberships across iterations. Under h = 0, the averages stabilize around 0.83 and 0.89, respectively, for setting 1 and setting 2. Under the optimal h, the averages are 0.90 and 0.98, respectively, indicating higher consistencies in component membership than those under h = 0.

The estimated baseline intensity surfaces under the optimal h for the 100 replicates are summarized, and the heat maps of their 2.5% quantile, median and 97.5% quantile are compared with the true surface in Figure 11. According to the plots, ﬁt- ted baseline intensity can accurately specify diﬀerent components, and the intensity magnitude is also very close to the truth. The estimation of the edge part of different components is slightly worse than other parts, which is reasonable due to the spatial smoothness constraint introduced by the penalty term due to distance. Un- der each setting, we pool the grid boxes under the same component and compare the averaged estimates with the true value. For setting 1, the empirical biases for the two components of λ = (0.2, 5, 20) are (0.021, −0.114, −0.305), with standard devia-

tions (0.043, 0.308, 0.727). For setting 2, the empirical biases for the three components

λ = (0.2, 1, 5, 20, 80) are (0.040, 0.003, −0.024, −0.228, −1.188), with standard devia-

tions (0.081, 0.169, 0.322, 0.779, 1.394). Considering the magnitude of the intensity com-

ponents, they are estimated quite accurately. 67

true 0.025 quantile median 0.975 quantile

5 10 15 20

(a) Setting 1 true 0.025 quantile median 0.975 quantile

20 40 60 80

(b) Setting 2

Figure 11: Simulation conﬁgurations for intensity surface, with ﬁtted intensity surfaces. Median and quantiles are calculated out of 100 replicates. 68

In order to compare the performances of ddCRP with PCRP used in Chapter2, we also ﬁtted the same settings using PCRP without covariates. The full conditional distributions for each parameter in PCRP can be easily derived using the similar steps.

One convenient way is to simply modify the full conditional distributions for z and λ in

Chapter2 by directly setting β = 0. Powers used for setting 1 were from 1 to 1.6 with

a step 0.1, and powers used for setting 2 were from 1 to 1.1 with a step 0.02. Optimal

power was selected using BITC proposed in (2.14).

Figure 12 shows the histogram of Kb from Dahl’s method. The true K are 3 and 5,

respectively, in the two settings. For setting 1, there are 85 times out 100 replicates that

Kb = K, but for setting 2, 77 times out of 100 replicates estimated K is only 4, instead

of 5. The trace plot of RI also shows that PCRP fails to give satisfactory estimation for

K and z, since the optimal RI is lower than RI using CRP, which corresponds to r = 1.

The estimated baseline intensity surfaces under the optimal r for the 100 replicates

are summarized, and the heat maps of their 2.5% quantile, median and 97.5% quantile

are compared with the true surface in Figure 13. Compared with the intensity plot

ﬁtted by ddCRP in Figure 11, it is obvious to see that PCRP performs worse in the

internal region of each group compared with ddCRP. The ﬁtted intensity value is more

likely to jump around. The smoothness constraint enables ddCRP easier to capture the

intensity on those region that are not on the edge of each group. The ﬁtted intensity

value and standard deviation for each group were also calculated using the same method

by pooling the grid boxes under same group together. For setting 1, the empirical biases 69

r = 1 r = optimal r = 1 r = optimal 80

60 60

40 40

20 20

0 0

3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12 4 5 6 7 8 9 10 11 12 13 14 15 16 17 4 5 6 7 8 9 10 11 12 13 14 15 16 17 (a) Histograms of K

(b) Rand Index

Figure 12: Histograms of Kb and overlaid trace plots of RI from the 100 replicates. Setting 1 and 2 are on left and right panel, respectively. The optimal r was selected by BITC. The thick lines are the average of the trace plots over the 100 replicates. 70

true 0.025 quantile median 0.975 quantile

5 10 15 20

(a) Setting 1 true 0.025 quantile median 0.975 quantile

20 40 60 80

(b) Setting 2

Figure 13: Simulation conﬁgurations for intensity surface, with ﬁtted intensity surfaces. Median and quantiles are calculated out of 100 replicates. for the two components of λ = (0.2, 5, 20) are (0.074, −0.058, −0.461), with standard deviations (0.062, 0.371, 0.627). For setting 2, the empirical biases for the three components λ = (0.2, 1, 5, 20, 80) are (0.299, 0.189, −0.306, −0.396, −0.964), with standard deviations (0.111, 0.231, 0.398, 0.668, 1.237). It is clear that PCRP did not give accurate intensity estimation under setting 2. For the two components with intensity 0.2 and 1,

PCRP fails to distinguish them and tend to put them together. This is also the reason that Kb is more frequently to be 4 instead of 5 under setting 2. 71

The comparison tells us that, when there is obvious spatial clustering eﬀect, ddCRP can capture the intensity value more eﬃciently. When spatially separated regions have close intensity values, ddCRP can distinguish them more accurately compared to PCRP, which tends to put them as same intensity group.

3.5 Discussion

As shown in the histogram of Kb using ddCRP in Figure 10, the overclusting problem is still presented in ddCRP. One future direction is to solve the overclustering problem by combining ddCRP and PCRP. The modiﬁcation on the prior on z is straightforward, which is shown below:

  P r  j6=i f(dij)1(zj = c) ∃j 6= i, zj = c, Pr(zi = c | D, f, z−i) ∝ (3.8)   α ∀j 6= i, zj 6= c, where f, D, α are deﬁned the same way as in ddCRP in (3.1), and r is the power parameter for PCRP.

Under the same model setup as in (3.3) but with prior on z replaced by (3.8), the full conditional distributions for z and λ can be obtained by modifying the full conditional 72 distribution of z with the power r:

 r P mi  f(dij)1(zj = c) λc  j6=i  ∃j 6= i, zj = c (existing label), exp(λcµ(Ai)) Pr(zi = c | S, z−i, λ0, β) ∝ a  αb Γ(mi + a)  ∀j 6= i, z 6= c (new label),  m +a j (b + µ(Ai)) i Γ(a) (3.9) where the notations are same as in (3.6). Full conditional distribution for λ will be the same as the one in previous model in (3.7).

The challenge for this model is how to tune the two parameters h and r. BITC and

LPML/DIC are used for tuning r and h, respectively, but it is not clear how the model will perform when both tuning parameters are changing. The investigation can start from tuning one parameter first to find the optimal value, and then tuning the other parameter given the optimal value of first parameter. The complexity of tuning is linear instead of square order by doing this. 73 Chapter 4

Marked Spatial Point Process

4.1 Introduction

When there is additional measurements along with each event location of the spatial point pattern data, a marked spatial point process should be considered to jointly model the spatial location and the measurements, which are called marks. It is important to consider these two part together since the spatial location and marks are often related with each other. Our motivating example is the shot charts of basketball, where spatial location is the location of each shot attempts, and the mark is the binary result of each shot, either made or missed.

Modeling the basketball shot chart data is very meaningful is sports analysis. Shot charts are important summaries for basketball coaches. Good defense strategies depend on good understandings of the oﬀense players’ tendencies to shoot and abilities to score.

The success rate of a basketball shot may be higher at locations in the court where a player makes more shots. In a marked spatial point process model, this means that the mark and the point pattern are dependent. There are two approaches to model how the mark depends on the point process. Location dependent models (Mrkviˇcka 74 et al., 2011) are observation driven, where the observed point pattern is incorporated into characterizing the spatially varying distribution of the mark. Intensity dependent models (Ho and Stoyan, 2008) are parameter driven, where the intensity instead of the observed point pattern of the point process characterizes the distribution of the mark at each point in the spatial domain. To the best of our knowledge, no work has been done to jointly model the intensity of the shot attempts and the results of the attempts.

The contribution of this section is two-fold. First, we propose a Bayesian joint model of marked spatial point process to analyze the spatial intensity and the outcome of shot attempts simultaneously. In particular, we use a non-homogeneous Poisson point process to model the spatial pattern of the shot attempts and incorporate the intensity of the process as a covariate in the model for the success rate of each shot. Second, we propose the variable selection with the spike-slab prior to identify important covariates for the model of success rate. Inferences are made with Markov chain Monte Carlo.

The modified Deviance Information Criterion (mDIC) and the modified Logarithm of the Pseudo-Marginal Likelihood (mLPML) are introduced to assess the fitness of our proposed model.

4.2 Shot Chart Data

The website stats.nba.com provides shot data for NBA players. We focus on the 2017–

2018 regular NBA season here. For each player, the dataset contains information about 75

Table 3: Shot data summary for four players in the 2017–2018 regular NBA season. The period includes 4 quarters and over time.

Player # Shots Success rate (%) 2-point shot (%) Period(%) Curry 740 50.0 43.4 (35.0, 20.7, 34.1, 9.9, 0.4) Durant 1032 52.5 66.7 (30.9, 23.6, 30.4, 14.7, 0.3) Harden 1286 45.6 50.6 (28.7, 22.3, 27.7, 21.0, 0.3) James 1409 54.3 74.6 (29.1, 21.5, 25.3, 23.6, 0.4)

each of his shots in this season including game date, opponent team, game period when

the shot was made (four quarters and a ﬁfth period representing extra time), minutes and

seconds left to the end of that period, success indicator or mark (0 for missed and 1 for

made), shot type (2-point or 3-point shot), shot distance, and shot location coordinates,

among others. From the data, the half court is positioned using a Cartesian coordinate

system (x, y) with the origin placed at the center of basket rim, x ranging from −25 to

25 feet and y ranging from −5 to 42 feet. Euclidean distance of a location to the origin

is rounded to foot. Shots made beyond the half court, which are very rare and not of

our interest, are excluded.

From the top 20 players ranked by the website, we chose four players with quite

diﬀerent styles: Stephen Curry, Kevin Durant, James Harden and LeBron James. Fig-

ure 14 shows their shot locations with the shot success indicators. Table3 summarizes

the count, success rate, percentage of 2-point shots, and percentage of the shots in each

period. As shown in Figure 14, most of the shots were made close to the rim or out of

but close to the 3-point line. This is expected since shorter distance should give higher

shot accuracy for either 2-point or 3-point shots. The overall shot success rates for the 76

mark Curry mark Durant

● miss ● miss

● made made ●

mark Harden mark James ● ● ● miss ● miss ● made made ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ●● ● ●●●●●●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ●●●●●● ●● ● ● ● ●● ● ●● ● ● ● ●●●● ●● ●● ●● ● ●● ●● ●●●● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●●●● ●● ●● ● ●●●●● ●● ● ●● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ● ● ● ● ●●● ●●●● ● ● ● ●● ● ●●●●● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ●●● ●●● ● ● ● ●●●● ●●● ● ● ● ●●●●●● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●●●●● ● ● ● ● ● ●● ●●●● ● ● ● ● ● ●● ● ● ● ● ●●●●● ● ● ●●●●● ●● ●●● ●● ● ● ● ● ● ● ●● ●● ● ● ●●●●● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●●● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ●● ●●●● ● ● ●● ●● ● ● ● ●●●● ●●● ●● ● ● ● ● ● ● ●●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●●●● ●● ●● ● ● ● ●● ● ●●● ●●●●●●●●●● ● ● ● ● ● ● ● ● ●●●●●●● ●● ● ● ●●●●●●●●●●●●●●● ● ● ●● ● ● ●●● ● ● ● ●● ●● ●●●●●●●●●●●●● ● ●● ● ●● ● ●● ● ● ●●●●●●●●●●●●●● ● ●● ● ● ● ●●●●●●●●●●●●● ● ● ● ● ●●● ●● ● ●●●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●●●●● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●

Figure 14: Shot charts of Curry, Durant, Harden and James in the 2017–2018 regular NBA season. 77 four players were all close to 50%. James had the highest percentage (74.6%) of 2-point shot while Curry had the highest percentage (56.6%) of 3-point shot. About half and

2/3 of the shots made by Harden and Durant, respectively were 2-point shots. Curry and Durant’s shot percentages were very similar for the ﬁrst three periods and smaller for the fourth period. Harden and James had relatively similar shot percentages across four quarters.

4.3 Model Setup

The observed shot chart of a player is represented by (S, M), where S is the collection

of the locations of shot attempts (x and y coordinates) and M is the vector of the corresponding marks (1 means success and 0 means failure). Assuming that N shots were observed, we have S = (s1, s2,..., sN ) and M = (m(s1), m(s2), . . . , m(sN )).

4.3.1 Marked Spatial Point Process

We propose to model (S, M) by a marked spatial point process. The shot locations S are

modeled by a non-homogeneous Poisson point process (e.g., Diggle, 2013). Let B ⊂ R2

be a subset of the half basketball court on which we are interested in modeling the shot

PN intensity. A Poisson point process is deﬁned such that N(A) = i=1 1(si ∈ A) for any R A ⊂ B follows a Poisson distribution with mean λ(A) = A λ(s)ds, where λ(·) deﬁnes 78 an intensity function of the process. The likelihood of the observed locations S is

N Y Z λ(si) exp − λ(s)ds . i=1 B

Covariates can be incorporated into the intensity by setting

> λ(si) = λ0 exp X (si)β , (4.1)

where λ0 is the baseline intensity, X(si) is a p × 1 spatially varying covariate vector, and

β is the corresponding coeﬃcient vector.

Next we consider modeling the success indicator (mark). It is natural to suspect that the success rate of shot attempts is higher at locations with higher shot intensity, suggesting an intensity dependent mark model. In particular, the success indicator is modeled by a logistic regression

m(si) | Z(si) ∼ Bernoulli θ(si) , (4.2) > logit θ(si) = ξλ(si) + Z (si)α,

where λ(si) is the intensity defined in (4.1) with a scalar coefficient ξ, Z(si) is a q × 1 covariate vector evaluated at i-th data point (Z does not need to be spatial, like period covariates), and α is a q × 1 vector of coefficient.

With Θ = (λ0, β, ξ, α), the joint likelihood for the observed marked spatial point 79 process (S, M) is

N Y m(si) 1−m(si) L(Θ | S, M) ∝ θ(si) (1 − θ(si)) i=1 N ! Y Z × λ(si) exp − λ(s)ds . (4.3) i=1 B

4.3.2 Prior Speciﬁcation

Vague priors are speciﬁed for the model parameters. For λ0, the gamma distribution is a conjugate prior (e.g., Leininger et al., 2017). For β, ξ, or α, there is no simple conjugate prior and we specify a vague, independent normal prior. In summary, we have

λ0 ∼ Gamma(a, b),

2 β ∼ MVN(0, σ Ip), (4.4) ξ ∼ N(0, δ2),

2 α ∼ MVN(0, δ Iq), where G(a, b) represents a Gamma distribution with shape a and scale b, respectively,

MVN(0, Σ) is a multivariate normal distribution with mean vector 0 and variance matrix

2 2 Σ, a, b, σ and δ are hyper-parameters to be speciﬁed, and Ik is the k-dimensional identity matrix. 80

4.3.3 Bayesian Variable Selection

The variable selection procedure aims to find those coefficients that are significantly different from zero. For our problem, we focus on variable selection for the mark model (4.2). The same method can be applied to the intensity model (4.1). In the intensity dependent mark model, the intensity λ is always present and, hence, is not selected. Also kept is the intercept in α.

A widely used Bayesian variable selection method is to use the spike-slab prior independently on covariate coeﬃcients (Ishwaran and Rao, 2005). This prior is a mixture of a nearly degenerated distribution at zero (the spike) and a ﬂat distribution (the slab).

One simple choice for the spike and slab is mean zero normal distributions with certain small and large variances, respectively. The ratio of the small variance to the large variance should not be too small to avoid MCMC getting stuck in the spike component

(Malsiner-Walli and Wagner, 2018). George and McCulloch(1993) recommended a ratio of 1/10, 000. A common choice is 0.01 and 100 for the small and large variances, respectively. To summarize, the spike-slab prior for each αi we want to select is speciﬁed 81 by

2 αi ∼ N(0, δi ),

2 δi = 0.01(1 − γi) + 100γi, (4.5)

γi ∼ Bernoulli(φi),

φi ∼ Beta(0.5, 0.5).

MCMC sampling is used to draw posterior samples for each of these parameters.

Since γi is a binary random variable, its posterior mode is used to decide on the importance of Zi. A posterior mode of zero means that αi is not signiﬁcant and Zi should not

be included in the model.

4.4 Bayesian Inference

4.4.1 The MCMC Sampling Schemes

The posterior distribution of Θ is

π(Θ|S, M) ∝ L(Θ | S, M)π(Θ), (4.6)

where π(Θ) = π(λ0)π(β)π(ξ)π(α) is the joint prior density as speciﬁed in (4.4) or (4.5).

In practice, we used vague priors with hyper-parameters σ2 = δ2 = 100 and a = b = 0.01 in (4.4). 82

To sample from the posterior distribution of Θ in (4.6), an Metropolis–Hasting within

Gibbs algorithm is facilitated by R package nimble (de Valpine et al., 2017). The loglike-

lihood function of the joint model used in the MCMC iteration is directly deﬁned using

the RW llFunction() sampler. To use the spike-slab prior in (4.5) for variable selection, we need to specify the sampling methods for hyper-parameters φi’s and γi’s. The RW()

sampler can be used to sample φi’s and the binary() sampler can be used for γi’s.

The integration in the likelihood function (4.3) does not have a closed-form. It needs to be computed with a Riemann approximation by partitioning B into a grid with a

suﬃciently ﬁne resolution. Within each grid box, the integrand λ(s) is approximated by

a constant. Then the integration of λ(s) becomes an integration of a piece-wise constant

function, which is easily computed by a summation over all of the grid boxes.

4.4.2 Bayesian Model Selection for the Mark Model

Within the Bayesian framework, deviance information criterion (DIC; Spiegelhalter

et al., 2002) and logarithm of pseudo-marginal likelihood (LPML; Geisser and Eddy,

1979; Gelfand and Dey, 1994) are two well-known Bayesian criteria for model compar-

ison. A smaller DIC and larger LPML indicates a better model. When assessing only

a speciﬁc component of a full model, as is the case with our application where we are

interested in only the mark model, these global criteria need to be modiﬁed to reﬂect

the focus. Besides, in contrast to DIC, the LPML for the joint model is hard to calculate

since the number of points N is random and, hence, the standard computing approaches 83 for known sample sizes do not apply. If we focus only on the mark model, however, then

N can be treated as a ﬁxed sample size and the existing approaches can then be applied.

To assess the intensity dependent mark model, namely, whether the intensity helps improve the fitting of the mark, one needs to compare two mark models with and without intensity as a covariate. We consider modifying DIC and LPML to only focus on the mark model. Using the idea from Ma et al.(2018), we first define the following deviance function:

N X Dev(λ, α, ξ | M) = −2 log f(m(si) | λ(si), α, ξ, Z(si)), i=1

where λ = (λ(s1), λ(s2), . . . , λ(sN )), and f(m(si) | λ(si), α, ξ, Z(si)) is the conditional probability mass function of m(si) given (λ(si), α, ξ, Z(si)). The eﬀective number of parameters for this conditional model is deﬁned as

pD = Dev(λ, α, ξ | M) − Dev(λ, α, ξ | M),

where Dev is the mean of the deviance evaluated at each posterior draw of the parame-

ters, and λ, α, and ξ are, respectively, the posterior mean of λ, α, and ξ. A modiﬁed

DIC (mDIC; Ma et al., 2018) for the mark model is

mDIC = Dev(λ, α, ξ | M) + 2pD. (4.7) 84

Similarly, a modified LPML (mLPML) is defined using a modified conditional predictive ordinate (mCPO), which can be calculated using a Monte Carlo estimation (Chen et al., 2012, Ch. 10). For the i-th data point, define

B −1 1 X 1 mCPO\ = i B fm(s ) | λ(b)(s ), α(b), ξ(b), Z(s ) b=1 i i i

(b) (b) (b) where {λ (si), α , ξ : b = 1, 2,...,B} is a posterior sample of size B of the unknown parameters. Then the mLPML is

N X mLPML\ = log(mCPO\ i). (4.8) i=1

4.5 Simulation

4.5.1 Estimation

To investigate the performance of the estimation, we generated data from a non-homogeneous

Poisson point process deﬁned on a square B = [−1, 1] × [−1, 1] with intensity λ(si) =

100λ0 exp(β1xi + β2yi), where si = (xi, yi) ∈ B is the location for every data point. For each si, i = 1,...,N, the mark m(si) follows a logistic model with two covariates in 85 addition to λ and intercept:

m(si) ∼ Bern(pi), (4.9)

logit(pi) = ξλ(si) + α0 + α1Z1i + α2Z2i.

The parameters of the model were designed to give point counts that are comparable to the basketball shot chart data. We ﬁxed (β1, β2) = (2, 1), ξ = 0.5, α0 = 0.5 and

α2 = 1. Three levels of α1 were considered, α1 ∈ {0.8, 1, 2}, in order to compare the

performance of the estimation procedure under diﬀerent magnitudes of the coeﬃcients

in the mark model. Two levels of λ0 were considered, λ0 ∈ {0.5, 1}, which controls

the mean of the number of points on B. It is easy to integrate in this case the intensity

function over B to get the average number of points being 850 and 1700, respectively, for

λ0 = 0.5 and 1. The numbers are approximately in the range of the NBA basketball shot

chart in Section 4.2. In the mark model, covariate Z1 was generated from the standard

normal distribution; two types of Z2 were considered, standard normal distribution or

Bernoulli with rate 0.5. The resulting range of the Bernoulli rate of the marks was

within (0.55, 0.78) for all the scenarios.

For each setting, 200 data sets were generated. R package spatstat (Baddeley et al.,

2005) was used to generate the Poisson point process data with the given intensity func-

tion. The priors for the model parameters were set to be (4.4) with the hyper-parameters

σ2 = δ2 = 100 and a = b = 0.01. The grid used to calculate the integration in likelihood 86 function had resolution 100 × 100. For each data set, a MCMC was run for 20,000 iterations with the ﬁrst 10,000 treated as burn-in period. For each parameter, the posterior mean was used as the point estimate and the 95% credible interval was constructed with the 2.5% lower and upper quantiles of the posterior sample. Empirical standard deviation (SD) of the 200 point estimates and the mean of the posterior standard deviations

(SD)c were also reported.

Table4–5 summarize the simulation results for the scenarios of standard normal

Z2 and Bernoulli Z2, respectively. The empirical bias for all the settings are close to zero. The average posterior standard deviation from the 200 replicates is very close to the empirical standard deviation of the 200 point estimates for all the parameters, suggesting that the uncertainty of the estimator are estimated well. Consequently, the empirical coverage rates of the credible intervals are close to the nominal level 0.95. As α1

increases, the variation increases in the mark parameter estimates but does not change

in the intensity parameter estimates. As λ0 increases, the variation of the estimates

for both intensity and mark parameters gets lower. Between the continuous and binary

cases of Z2, the variation in the estimates is higher in the latter case, especially for the

coeﬃcient of Z2.

4.5.2 Variable Selection

To assess the performance of the variable selection method, we generated data with

additional covariates and diﬀerent scales in one of the covariate coeﬃcients in the mark 87

Table 4: Summaries of the bias, standard deviation (SD), average of the Bayesian SD estimate (SD),c and coverage rate (CR) of 95% credible intervals when Z2 is continuous: ξ = α0 = 0.5, α2 = 1, (β1, β2) = (2, 1) and Z2 ∼ N(0, 1).

λ0 = 0.5 λ0 = 1

α1 Model Para Bias SD SDc CR Bias SD SDc CR

0.8 Intensity λ0 0.01 0.04 0.04 0.96 0.01 0.06 0.06 0.93 β1 −0.06 0.11 0.11 0.90 −0.06 0.09 0.08 0.88 β2 −0.05 0.09 0.09 0.94 −0.03 0.06 0.06 0.92 Mark ξ 0.11 0.57 0.60 0.97 0.04 0.22 0.22 0.96 α0 0.01 0.20 0.20 0.95 0.00 0.14 0.14 0.97 α1 0.03 0.13 0.13 0.94 0.01 0.09 0.09 0.94 α2 0.03 0.14 0.14 0.95 0.01 0.10 0.10 0.93

1 Intensity λ0 0.00 0.04 0.04 0.94 0.00 0.06 0.06 0.94 β1 −0.05 0.11 0.11 0.94 −0.05 0.08 0.08 0.91 β2 −0.03 0.10 0.09 0.92 −0.04 0.07 0.06 0.92 Mark ξ 0.03 0.60 0.61 0.95 0.05 0.21 0.22 0.96 α0 0.00 0.20 0.20 0.95 −0.01 0.14 0.14 0.95 α1 0.01 0.13 0.14 0.97 0.01 0.10 0.10 0.96 α2 0.03 0.14 0.14 0.95 0.01 0.10 0.10 0.94

2 Intensity λ0 0.00 0.04 0.04 0.94 0.00 0.06 0.06 0.95 β1 −0.06 0.12 0.11 0.91 −0.04 0.08 0.08 0.93 β2 −0.03 0.09 0.09 0.94 −0.03 0.07 0.06 0.91 Mark ξ 0.04 0.71 0.69 0.94 0.05 0.23 0.24 0.96 α0 0.02 0.22 0.23 0.95 −0.01 0.15 0.16 0.97 α1 0.08 0.23 0.21 0.93 0.03 0.15 0.15 0.94 α2 0.03 0.17 0.16 0.93 0.02 0.11 0.11 0.95 88

Table 5: Summaries of the bias, standard deviation (SD), average of the Bayesian SD estimate (SD),c and coverage rate (CR) of 95% credible intervals when Z2 is binary: ξ = α0 = 0.5, α2 = 1, (β1, β2) = (2, 1) and Z2 ∼ Bernoulli(0.5).

λ0 = 0.5 λ0 = 1

α1 Model Para Bias SD SDc CR Bias SD SDc CR

0.8 Intensity λ0 0.00 0.04 0.04 0.94 0.01 0.06 0.06 0.95 β1 −0.05 0.12 0.11 0.88 −0.05 0.08 0.08 0.90 β2 −0.02 0.10 0.09 0.94 −0.03 0.06 0.06 0.95 Mark ξ 0.07 0.62 0.61 0.94 0.04 0.20 0.23 0.96 α0 0.00 0.23 0.22 0.93 0.01 0.16 0.16 0.95 α1 0.03 0.13 0.13 0.96 0.01 0.10 0.10 0.94 α2 0.03 0.25 0.24 0.96 0.01 0.19 0.18 0.95

1 Intensity λ0 0.00 0.04 0.04 0.94 0.00 0.06 0.06 0.94 β1 −0.06 0.11 0.11 0.93 −0.04 0.09 0.08 0.89 β2 −0.04 0.08 0.09 0.94 −0.02 0.06 0.06 0.93 Mark ξ 0.10 0.64 0.63 0.94 0.09 0.22 0.23 0.94 α0 0.01 0.23 0.23 0.97 −0.03 0.16 0.16 0.94 α1 0.03 0.15 0.14 0.92 0.01 0.11 0.10 0.92 α2 0.02 0.27 0.25 0.93 0.02 0.17 0.18 0.96

2 Intensity λ0 0.00 0.04 0.04 0.95 0.01 0.06 0.06 0.94 β1 −0.05 0.11 0.11 0.92 −0.05 0.08 0.08 0.90 β2 −0.04 0.09 0.09 0.91 −0.04 0.06 0.06 0.92 Mark ξ 0.06 0.73 0.70 0.94 0.07 0.28 0.25 0.93 α0 0.03 0.29 0.26 0.93 −0.01 0.20 0.19 0.94 α1 0.06 0.21 0.21 0.94 0.05 0.15 0.15 0.94 α2 0.03 0.31 0.28 0.93 0.03 0.19 0.20 0.94 89 model. The intensity model and the coeﬃcients value remained the same as in the

Section 4.5.1. The mark model had six covariates (Z1,Z2,...,Z6). Except for Z2, these covariates were generated independently from the standard normal distribution.

Covariate Z2 again had two forms: standard normal or Bernoulli with rate 0.5. The values of ξ = α0 = 0.5 and α2 = 1 were ﬁxed. Three levels of α1 were considered: α1 ∈

{0.8, 1, 2}. Three scenarios of (α3, . . . , α6) were considered, (0, 0, 0, 0), (1, 1, 0, 0), and

(1, 1, 1, 1), representing more and more covariates that are present in the data generation model. For each setting, 200 datasets were generated. For each dataset, the spike-slab priors in (4.5) were speciﬁed for parameters αi, i ∈ {1, 2, 3, 4, 5, 6}. The priors for the other parameters were kept the same as in (4.4). Other settings such as integration grid and MCMC were the same as in the last study.

Table6–7 summarize the percentages of the 200 replicates in which the variable selection decision was correct for each covariate and for all covariates as a whole. Inter- estingly, the unimportant covariates are correctly excluded in all cases. The accuracy rates of correctly selecting the important variables individually decrease as more candidate variables are included. Changing Z2 from continuous to binary gives less accurate selection decisions for this variable. More points with higher λ0 improves the accuracy rate. The selection performance of all variables as a whole depends on the worst individual variable selection results, with much lower accuracy for binary Z2 than for continuous Z2, especially when λ0 is lower. Lower magnitude of α1 leads to lower accuracy in selecting Z1 as well as the whole model correctly. We also experimented with an 90

Table 6: Percentages of correct decision of the variables in 200 replicates with continuous Z2. The signiﬁcant parameters except α1 all equal to 1. All covariates were generated with standard Normal distribution.

Non- α1 = 0.8 α1 = 1 α1 = 2

Parameter Zero λ0 = 0.5 λ0 = 1 λ0 = 0.5 λ0 = 1 λ0 = 0.5 λ0 = 1

(α3, α4, α5, α6) = (0, 0, 0, 0) α1 Yes 84 89 95 96 99 100 α2 Yes 96 97 94 98 94 92 α3 No 100 100 100 100 100 100 α4 No 100 100 100 100 100 100 α5 No 100 100 100 100 100 100 α6 No 100 100 100 100 100 100 α – 80 86 89 94 93 92

(α3, α4, α5, α6) = (1, 1, 0, 0) α1 Yes 77 87 94 94 100 100 α2 Yes 94 96 95 96 92 94 α3 Yes 94 96 93 96 94 98 α4 Yes 91 96 93 95 91 98 α5 No 100 100 100 100 100 100 α6 No 100 100 100 100 100 100 α – 60 78 78 81 78 90

(α3, α4, α5, α6) = (1, 1, 1, 1) α1 Yes 74 85 90 96 100 100 α2 Yes 95 93 94 95 94 94 α3 Yes 97 98 94 97 95 92 α4 Yes 90 93 96 97 91 96 α5 Yes 92 95 89 96 89 95 α6 Yes 93 94 93 96 87 97 α – 61 71 76 87 80 83 91

Table 7: Percentages of correct decision of the variables in 200 replicates with binary Z2. The signiﬁcant parameters except α1 all equal to 1. Z2 was generated using Bernoulli(0.5), while other covariates were from standard normal.

Non- α1 = 0.8 α1 = 1 α1 = 2

Parameter Zero λ0 = 0.5 λ0 = 1 λ0 = 0.5 λ0 = 1 λ0 = 0.5 λ0 = 1

(α3, α4, α5, α6) = (0, 0, 0, 0) α1 Yes 80 87 95 96 100 100 α2 Yes 64 90 71 87 57 83 α3 No 100 100 100 100 100 100 α4 No 100 100 100 100 100 100 α5 No 100 100 100 100 100 100 α6 No 100 100 100 100 100 100 α — 52 79 67 83 57 83

(α3, α4, α5, α6) = (1, 1, 0, 0) α1 Yes 77 89 94 98 100 100 α2 Yes 65 85 62 82 58 97 α3 Yes 92 95 96 96 95 95 α4 Yes 94 94 94 96 93 94 α5 No 100 100 100 100 100 100 α6 No 100 100 100 100 100 100 α — 44 68 52 74 53 76

(α3, α4, α5, α6) = (1, 1, 1, 1) α1 Yes 69 85 93 97 99 100 α2 Yes 62 78 55 76 47 75 α3 Yes 93 95 92 97 91 97 α4 Yes 92 96 91 97 85 96 α5 Yes 92 96 93 93 89 96 α6 Yes 94 95 93 92 92 97 α — 36 63 42 70 34 70 92

even lower α1 = 0.5 (results not shown), in which case, the correct selection percentage for the whole model reduces to as low as 10% in some settings.

4.6 Real Data Analysis

The proposed methods were applied to analyze the shot chart of each of the four players in Figure 14. Because there were few shots placed behind the backboard line or too far beyond the 3-point line, we focused on a region of the half court with vertical coordinate y ∈ [−0.75, 30] feet, where y = −0.75 is the line of the backboard. In evaluating the joint log-likelihood, this region was evenly partitioned into a 123 × 200 grid for the numerical integration in Equation (4.3).

Table8 summarizes the covariates used in the intensity model and the mark model.

We deﬁned two distance covariates: distance to basket and distance to 3-point line. The former one is set to 0 for locations outside of the 3-point line and the latter one is set to 0 for locations inside the 3-point line. Both of them were rounded to the nearest foot.

In order to improve the convergence of the MCMC, these variables were standardized

(centered by the mean and scaled by the standard deviation). Intensity may also be related to angle (using the center of the basket rim as origin). We divided the court to six areas according to the shot angle relative to the backboard line: [−π/2, π/6],

[π/6, π/3], [π/3, π/2], [π/2, 2π/3], [2π/3, 5π/6], [5π/6, 3π/2]. Dummy variables were created using [−π/2, π/6] as reference. 93

Table 8: Covariates used in the intensity model and the mark model.

Model Covariates Explanation intensity beyond 3-point line indicator (1 = beyond 3-point line) distance to basket standardized; 0 for 3-point shots distance to 3-point line standardized; 0 for 2-point shots shot angle relative to backboard line; 6 levels with π π [− 2 , 6 ) as reference mark intensity shot intensity game period 5 levels with the ﬁrst period as reference seconds left time in seconds left towards the end of the period, divided by 100 opponent indicator (1 = opponent made playoﬀ last season) beyond 3-point line same as above distance to basket same as above distance to 3-point line same as above shot angle same as above 94

Table 9: Summaries of mDIC and mLPML for the mark model and DIC for the full joint model.

Mark model Joint model mDIC mLPML DIC Player ξ 6= 0 ξ = 0 ξ 6= 0 ξ = 0 ξ 6= 0 ξ = 0 Curry 1041.7 1038.8 −520.9 −521.3 151.1 148.2 Durant 1386.4 1454.1 −693.2 −729.1 335.0 402.8 Harden 1720.1 1809.2 −860.2 −918.0 1516.7 1605.9 James 1758.6 1809.2 −879.3 −909.1 1750.6 1802.0

The joint model in (4.1)–(4.2) was ﬁtted with the covariates in Table8. For the

mark model, the spike-slab prior (4.5) was imposed on each element in αi except the

intercept. Other parameters’ priors were set to be (4.4) with the hyper-parameters

σ2 = δ2 = 100 and a = b = 0.01. To check the importance of intensity as covariate in

the mark component, we ﬁtted the model with and without restricting ξ = 0. The two

mark models were compared with mDIC and mLPML. Besides, the DIC for the full joint

model, which, unlike the LPML, can be easily computed, was also obtained to compare

the joint ﬁt of both the intensity and the mark models with and without ξ = 0. For

each model ﬁtting, the trace plot of the MCMC was checked and the convergence of all

the parameters were conﬁrmed.

Table9 summarizes the mDIC and mLPML for the mark model and the DIC for the

full joint model. The smallest diﬀerence is 2.9 in mDIC and 0.4 in mLPML for Curry;

the largest diﬀerence is 89.1 in mDIC and 57.8 in mLPML for Harden. The DIC has a

rule of thumb similar to AIC to make decision (Spiegelhalter et al., 2002, Page. 613): 95 a difference larger than 10 is substantial and a difference about 2–3 does not give an evidence to support one model over the other. For LPML, difference less than 0.5 is “not worth more than to mention” and larger than 4.5 can be considered “very strong” (Kass and Raftery, 1995). With these guidelines applied to mDIC and mLPML, the mark model with shot intensity included as a covariate has a clear advantage relative to the model without it for all players except Curry. Curry’s result favors different models using different criterion, but the differences are both too small to give a significant conclusion.

We also obtained the DIC for the whole joint model for all four players. The diﬀerences in the DIC of the joint models are almost the same as those in the mDIC of the mark models, suggesting that the DIC for the intensity model are almost the same with and without ξ = 0. This is expected because the marks may contain little information about the intensities.

Figure 15 presents the ﬁtted shot intensities of the four players on the same scale.

The ﬁtted intensities appear to capture the spatial patterns of the shot charts in Fig- ure 14. The intensities are highest at the origin and gradually decreases as the distance increasing. An obvious increase of intensities is observed at the 3-point line, followed by a faster decreasing speed as the distance increases further. Regions with angles closer to vertical relative to the backboard have higher intensities than regions with angles closer to horizontal. The ﬁtted intensities are the lowest in just inside the 3-point line where the angles are tough. Between the players, it is obvious that Curry has higher intensity just outside of the 3-point line than James, although James made almost twice as many 96

Figure 15: Intensity ﬁt results of Curry, Durant, Harden and James on the same scale. Red means higher intensity. 97

Table 10: Data analysis result using intensity dependent model for Curry and Durant.

Posterior Posterior 95% Credible Player Model Covariates Mean SD Interval

Curry Intensity baseline (λ0) 0.25 0.03 ( 0.19, 0.32) beyond 3-point line −2.12 0.22 (−2.56, −1.70) distance to basket −1.39 0.07 (−1.54, −1.25) distance to 3-point line −2.28 0.17 (−2.61, −1.96) angle [π/6, π/3] 0.69 0.15 ( 0.40, 0.98) angle [π/3, π/2] 1.03 0.14 ( 0.75, 1.30) angle [π/2, 2π/3] 0.78 0.15 ( 0.49, 1.06) angle [2π/3, 5π/6] 0.58 0.15 ( 0.29, 0.88) angle [5π/6, 3π/2] 0.06 0.18 (−0.28, 0.40) Mark intercept 0.32 0.18 (−0.02, 0.67) beyond 3-point line −0.88 0.37 (−1.91, −0.21) ﬁfth period 7.10 6.40 (−0.10, 22.09)

Durant Intensity baseline (λ0) 0.65 0.07 ( 0.53, 0.79) beyond 3-point line −3.21 0.22 (−3.64, −2.79) distance to basket −0.98 0.05 (−1.07, −0.89) distance to 3-point line −2.32 0.17 (−2.65, −1.99) angle [π/6, π/3] 0.71 0.12 ( 0.48, 0.95) angle [π/3, π/2] 0.84 0.12 ( 0.61, 1.07) angle [π/2, 2π/3] 0.65 0.12 ( 0.42, 0.90) angle [2π/3, 5π/6] 0.42 0.13 ( 0.17, 0.67) angle [5π/6, 3π/2] −0.21 0.15 (−0.49, 0.08) Mark intercept −0.39 0.18 (−0.73, −0.01) intensity (λ) 0.33 0.07 ( 0.19, 0.46) total shots as Curry.

Estimates of model parameters for the four players are summarized in Table 10–

11, including posterior mean, posterior standard deviation, and 95% credible interval

(constructed with the lower and upper 2.5% percentiles of the posterior sample). The results from the intensity models provide more detailed insights on how the covariates aﬀect the shot intensity as visualized in Figure 15. Regions with shot angles in [π/6, 5π/6] 98

Table 11: Real data analysis results using intensity dependent model for Harden and James.

Posterior Posterior 95% Credible Player Model Covariate Mean SD Interval

Harden Intensity baseline (λ0) 0.18 0.02 ( 0.14, 0.23) beyond 3-point line −2.12 0.18 (−2.46, −1.77) distance to basket −2.44 0.07 (−2.58, −2.31) distance to 3-point line −2.49 0.13 (−2.75, −2.24) angle [π/6, π/3] 1.02 0.14 ( 0.76, 1.29) angle [π/3, π/2] 1.41 0.13 ( 1.16, 1.67) angle [π/2, 2π/3] 1.57 0.13 ( 1.32, 1.82) angle [2π/3, 5π/6] 1.27 0.14 ( 1.01, 1.54) angle [5π/6, 3π/2] 0.29 0.16 (−0.02, 0.60) Mark intercept −0.58 0.17 (−0.94, −0.26) intensity (λ) 0.07 0.01 ( 0.04, 0.09) angle [5π/6, 3π/2] 0.57 0.35 ( 0.00, 1.18)

James Intensity baseline (λ0) 0.74 0.06 ( 0.62, 0.87) beyond 3-point line −3.15 0.22 (−3.59, −2.72) distance to basket −2.10 0.05 (−2.20, −2.01) distance to 3-point line −2.32 0.17 (−2.66, −1.98) angle [π/6, π/3] 0.42 0.09 ( 0.24, 0.61) angle [π/3, π/2] 0.58 0.09 ( 0.40, 0.76) angle [π/2, 2π/3] 0.58 0.09 ( 0.40, 0.77) angle [2π/3, 5π/6] 0.41 0.10 ( 0.22, 0.60) angle [5π/6, 3π/2] −0.21 0.11 (−0.43, 0.00) Mark intercept −0.72 0.15 (−1.01, −0.43) intensity (λ) 0.08 0.01 ( 0.07, 0.11) 99 have higher shot intensities than those with poorer angles for all four players. The estimated coefficient of the 3-point indicator suggests that Curry and Harden intend to shoot for 3-point more often than James and Durant. The estimated coefficient of the distance to basket are closer to zero for Curry and Durant than for Harden and James, suggesting that the 2-point shot intensities for the former decreases at a slower rate than the latter. For the mark model, only results for the variables selected by the variable selection procedure are reported in addition to the intercept and the shot intensity. The coefficient of the intensity is significantly positive for Durant, Harden and James. That is, higher success rate are expected at locations where the players make more shots.

Shot intensity does not inﬂuence Curry’s shot accuracy directly, which makes Curry a really outstanding shooter. His accuracy is relatively more stable over the whole court, except that 3-point shot’s accuracy is lower than 2-point shot’s. He also has marginally higher accuracy during the extra times than during the ﬁrst quarter, but the number of observations in extra times are too few, making this conclusion less reliable. For Durant and James, no other covariates are selected after the intensity is included in the mark model. Harden has a marginally higher accuracy with shot angles in [5π/6, 3π/2] than

with angles in [−π/2, π/6]. 100 Chapter 5

Future Work

5.1 Spatially Dynamic Variable Selection

Spatial point process is widely used to study the relationship between occurrence of events in space and spatial covariates. Variable selection problem of spatial point process model with spatially varying coefficients have not yet received much attention. It is possible that a covariate is only significant at some regions, but insignificant at other regions. This means that the variable selection results should be spatially varying.

The traditional spike-and-spike prior for variable can not realize the spatial varying signiﬁcant covariates set. Some work has been done to modify the spike-and-slab prior and make it dynamic, wither spatially or temporally. Andersen et al.(2017) generalized the spike-and-slab prior distribution to encode a priori correlation of the support of the solution in both space and time by imposing a transformed Gaussian process on the spike-and-slab probabilities. Andersen et al.(2014) proposed a novel structured spike and slab prior, which allows to incorporate a priori knowledge of the sparsity pattern by imposing a spatial Gaussian process on the spike and slab probabilities. Rockova and

McAlinn(2017) used a dynamic spike-and-slab prior that are constructed as mixtures 101 of two processes: a spike process for the irrelevant coeﬃcients and a slab autoregressive process for the active coeﬃcients. The mixing weights are themselves time-varying and depend on a lagged value of the series.

Other methods focus on the detection of spatial or time varying signal. Jhuang et al.

(2019) propose a Bayesian spatial model for sparse signal detection in image data using continuous shrinkage priors. Kalli and Griﬃn(2014) used a prior to allow the shrinkage of the regression coeﬃcients to suitably change over time.

One direction of future work is to incorporate the shrinkage prior such as spatial horseshoe prior to spatial point process to capture the spatially dynamic sparsity of covariates’ eﬀect.

5.2 Clustered Regression Coeﬃcients Model for Spa-

tial Point Process

For the spatial covariate effects in spatial point process, not only the significance of covariates might be spatially varying or clustered, the magnitude of those significant covariates might also display some spatially clustering pattern. Such clustered covariate effects pattern is often observed in various areas, such as real estate application, spatial economics, and environmental science.

There are different methods to detect spatial clusters in literature, including scan statistic methods (Kulldorff et al., 2006; Neill, 2012; Shu et al., 2012), LASSO based 102 variable selection method (Xu and Gangnon, 2016) etc. But most of them only focuses on the cluster effect of spatial response. Spatial cluster detection of spatial regression coefficients is attracting more attentions in various fields recently. Solutions to this problem so far include quasi-likelihood based method (Lin, 2014; Lin et al., 2016), hypothesis testing based method (Lee et al., 2017), penalized approach incorporating spatial neighborhood information (Lee et al., 2019), etc. Bayesian approach is also frequently discussed in this field. Ma et al.(2019) recently proposed a Bayesian spatial clustered linear regression model with DP prior, which considers spatially dependent structure and cluster the covariate effects simultaneously. Zhao et al.(2020) focuses on the cluster pattern of regression coefficients of Poisson regression model by introducing the spatial smoothness constraint using MRF.

Incorporating diﬀerent clustering detection methods to regression coeﬃcient of spatial point process can be an interesting area to investigate in the future.

5.3 Multivariate Point Process Model

Spatial point pattern data we have discussed only consists one type of points, which can also be called as univariate spatial point pattern. It is often the case in real life that more than one type of events happen on the same spatial domain. All those diﬀerent type of spatial points put together will result in multivariate spatial point pattern, whose observations will have labels to show their types. This kind of data is particularly 103 noticeable in plant ecology where there may be tens or hundreds of types (species)

(Fl¨uggeet al., 2014; Baldeck et al., 2013; Kanagaraj et al., 2011; Punchi-Manage et al.,

2013). Different type of events are often influenced by each other, so it is more reasonable and efficient to model them together as a multivariate spatial point process.

There has been less focus on multivariate spatial point process than univariate case.

Rajala et al.(2018) proposed a method based on Gibbs point process to detect significant interactions in very large multivariate spatial point patterns. The model is fitted using a pseudo-likelihood approximation, and the significant interaction is automati- cally selected using the group lasso penalty with likelihood approximation. DeYoreo and Kottas(2018) focuses on multiple ordinal regression and applied Dirichlet process mixture of multivariate normals to model the joint distribution of random covariates and latent responses, such that a general dependence structure in the multivariate response distribution is induced. Gruner and Johnson(2001) analytically derived the

Kullback-Leibler distance between two Poisson point processes. Taddy(2010) starts from the correlated time series of marked spatial point processes. A Bayesian semiparametric framework is built, and the correlated process densities are captured using a novel class of dependent stick-breaking mixture models. Eckardt and Mateu(2019) considers a multivariate planar point process with quantitative marks, and presents a graphical modeling approach for such model. Doss(1989) proved the second-order distribution used to access dependence structure in a stationary bivariate point process will have a asymptotically normally distributed approximation. Allard et al.(2001) proposed a 104 global test for dependence between two point process. The method is based on local rotations of one process when the second one remains ﬁxed, and this method can map the local association between the two process.

With diﬀerent methods to access the dependence or distance between point processes, and less focus on joint modeling on multivariate spatial point process, it is a meaningful direction to extend the methods we discussed in previous chapters, including CRP, distance dependent CRP, semiparametric PCRP-NHPP, and marked point process to multivariate case. 105 Appendix A

Appendix

A.1 BITC Validation Under Gaussian Mixture Model

BIC is a widely used criteria for frequentist methods. We deﬁned BITC to choose

the power to be used in PCRP in order to have a good estimation for the number of

components. The performance of this criteria was validated in the simulation study

under the spatial point process model.

As a supplement study, the eﬀectiveness of BITC was investigated under other model

in this section by simulation study. Simple Gaussian Mixture (GM) model was consid-

N N ered. With data X = {Xi}i=1 and corresponding membership index vector z = {zi}i=1, the GM model can be written as:

Pr(zi = k) = φk, k = 1,...,K,

X ∼ Normal(µ , σ2 ), i zi zi

K where K is the number of components of GM model, and φ = {φk}k=1 is the mixture 106 weights. A common prior on z to do Bayesian inference is DP, and two alternative ways to specify DP is symmetric Dirichlet process and PCRP. Symmetric Dirichlet process requires the number of components K to be a prespeciﬁed constant. To choose the number K to be used, the proposed BITC can be used. Existing Bayesian criteria

LPML and DIC are also calculated here for comparison.

With Bayesian method to ﬁt the model, priors can be set as:

φ ∼ Symmetric-DirichletK (α),

2 µi ∼ Normal(µ0, λσi ), (A.1) 2 2 σi ∼ Inv-Gamma(ν, σ0),

i = 1, 2,...,K.

where α, µ0, λ, ν, and σ0 are hyperparameters.

The number of components K is set to be constant. Different value of K can be used to fit the model, and BITC can be used to choose the optimal one. Package “nimble” in R can be used to fit this hierarchical model directly.

Diﬀerent simulation setting was considered. Setting 1 has 3 mixture components with mixture weight φ = (0.1, 0.5, 0.4), and the corresponding means and variances are µ = (1, 5, 10), σ2 = (1, 1, 1). Setting 2 has 6 mixture components with weight

φ = (1/4, 1/8, 1/8, 1/8, 1/8, 1/4), and the corresponding means and variances are µ =

(−13, −7, −3, 1, 5, 10), σ2 = (1, 1, 1, 1, 1, 1), respectively. For each setting, two diﬀerent 107 sample sizes, N = 500 and N = 1000, were used. Candidate Ks used to ﬁt the model were (2, 3, 4, 5) and two to ten by step 1, respectively.

Figure 16 shows the histograms of K selected by BITC, LPML and DIC. For setting

1 when there is three components, there are 98 and 73 times out of 100 replicates that BITC gave correct K under two diﬀerent sample size, while LPML only made the correct decision 22 and 9 times, and 0 and 6 times for DIC. Under setting 2, there are more mixture components and the problem is more challenging. BITC can still give satisfactory performance compared with LPML and DIC, and the its performance will get better when the sample size increases. Out of 100 replicates, BITC have 44 and 95 times to reveal the true K under diﬀerent sample sizes, while LPML and DIC only has three and two times to give correct K estimation, even when the sample size increases.

We can also see from the histogram that LPML and DIC always tend to give larger

K estimation than the true value. This is because the penalty term for the number of parameters in them is not obvious.

For the estimation of parameters µ and σ, results are summarized in Table 12. Since the estimated number of components might be diﬀerent for each replicates, it is hard to directly summarize µb and σb for 100 replicates. For each replicate, we will group the observation to K diﬀerent groups according to the true membership parameter z, and the estimated µi and σi for i = 1,...,K were calculated as the average value of all observation’s estimations in that group. Then the average and standard deviation

(SD) of 100 replicates of these estimation were summarized in the table. The estimation 108

BITC LPML DIC BITC LPML DIC 100

75 75

50 50

25 25

0 0 3 4 5 3 4 5 3 4 5 3 4 5 3 4 5 3 4 5 (a) Setting 1 BITC LPML DIC BITC LPML DIC 50

40 75

30 50

25 10

0 0 5 6 7 8 9 10 5 6 7 8 9 10 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 (b) Setting 2

Figure 16: Histogram of Kb selection by BITC, LPML and DIC under GM model and symmetric Dirichlet prior. 109

Table 12: Simulation results of parameter estimation for Gaussian Mixture model with symmetric Dirichlet process prior.

true N = 500 N = 1000 Setting K par

value bias SD bias SD

Setting 1 3 µ1 1 0.283 0.247 0.227 0.151 µ2 5 −0.026 0.070 −0.016 0.047 µ3 10 −0.150 0.084 −0.081 0.060 σ1 1 −0.033 0.123 −0.032 0.089 σ2 1 0.047 0.050 0.023 0.039 σ3 1 0.293 0.054 0.149 0.040

Setting 2 6 µ1 -13 0.183 0.099 0.080 0.062 µ2 -7 1.067 0.680 0.256 0.135 µ3 -3 −0.620 0.989 0.005 0.137 µ4 1 −0.031 0.276 0.006 0.124 µ5 5 −0.436 0.341 −0.176 0.155 µ6 10 −0.184 0.089 −0.087 0.072 σ1 1 0.640 0.064 0.318 0.039 σ2 1 0.949 0.427 0.332 0.088 σ3 1 0.677 0.589 0.072 0.101 σ4 1 0.030 0.156 −0.069 0.069 σ5 1 0.444 0.181 0.218 0.122 σ6 1 0.432 0.064 0.210 0.051

results are quite accurate according to the table, while the estimation for µ is on average better than the estimation for σ. Increasing sample size will result in better estimation performance, which can be easily tell from the reduced bias and standard deviation. 110 Bibliography

Allard, D., A. Brix, and J. Chadoeuf (2001). Testing local independence between two point processes. Biometrics 57 (2), 508–517.

Andersen, M. R., A. Vehtari, O. Winther, and L. K. Hansen (2017). Bayesian inference for spatio-temporal spike-and-slab priors. The Journal of Machine Learning Research 18 (1), 5076–5133.

Andersen, M. R., O. Winther, and L. K. Hansen (2014). Bayesian inference for structured spike and slab priors. In Advances in Neural Information Processing Systems, pp. 1745–1753.

Baddeley, A., Y.-M. Chang, Y. Song, and R. Turner (2012). Nonparametric estimation of the dependence of a spatial point process on spatial covariates. Statistics and Its Interface 5 (2), 221–236.

Baddeley, A., E. Rubak, and R. Turner (2015). Spatial Point Patterns: Methodology and Applications with R. Chapman and Hall/CRC.

Baddeley, A., R. Turner, et al. (2005). Spatstat: an R package for analyzing spatial point patterns. Journal of statistical software 12 (6), 1–42.

Baldeck, C. A., K. E. Harms, J. B. Yavitt, R. John, B. L. Turner, R. Valencia, H. Navarrete, S. J. Davies, G. B. Chuyong, D. Kenfack, et al. (2013). Soil sesources and topography shape local tree community structure in tropical forests. Proceedings of the Royal Society B: Biological Sciences 280 (1753), 20122532.

Berhane, K., M. Hauptmann, and B. Langholz (2008). Using tensor product splines in modeling exposure–time–response relationships: Application to the Colorado Plateau Uranium Miners cohort. Statistics in Medicine 27 (26), 5484–5496.

Blackwell, D., J. B. MacQueen, et al. (1973). Ferguson distributions via P´olya urn schemes. The Annals of Statistics 1 (2), 353–355.

Blei, D. M. and P. I. Frazier (2011). Distance dependent Chinese restaurant processes. Journal of Machine Learning Research 12 (Aug), 2461–2488.

Brunsdon, C., A. S. Fotheringham, and M. E. Charlton (1996). Geographically weighted regression: a method for exploring spatial nonstationarity. Geographical analysis 28 (4), 281–298. 111

Chen, M.-H., Q.-M. Shao, and J. G. Ibrahim (2012). Monte Carlo Methods in Bayesian Computation. Springer Science & Business Media.

Chiu, S. N., D. Stoyan, W. S. Kendall, and J. Mecke (2013). Stochastic Geometry and Its Applications. John Wiley & Sons.

Condit, R., R. Perez, S. Aguilar, S. Lao, R. Foster, and S. P. Hubbell (2019). Complete Data from the Barro Colorado 50-ha Plot: 423617 Trees, 35 Years, 2019 Version. DataONE.

Cressie, N. (1992). Statistics for spatial data. Terra Nova 4 (5), 613–617.

Dahl, D. B. (2006). Model-based clustering for expression data via a Dirichlet process mixture model. Bayesian Inference for Gene Expression and Proteomics 4, 201–218. de Valpine, P., D. Turek, C. J. Paciorek, C. Anderson-Bergman, D. T. Lang, and R. Bodik (2017). Programming with models: Writing statistical algorithms for general model structures with NIMBLE. Journal of Computational and Graphical Statis- tics 26 (2), 403–413.

DeYoreo, M. and A. Kottas (2018). Bayesian nonparametric modeling for multivariate ordinal regression. Journal of Computational and Graphical Statistics 27 (1), 71–84.

Diggle, P. (1985). A kernel method for smoothing point process data. Journal of the Royal Statistical Society: Series C (Applied Statistics) 34 (2), 138–147.

Diggle, P. (2003). Statistical Analysis of Spatial Point Patterns. Arnold, London, UK.

Diggle, P. J. (2013). Statistical Analysis of Spatial and Spatio-Temporal Point Pat- terns. Chapman and Hall/CRC.

Diggle, P. J., J. Besag, and J. T. Gleaves (1976). Statistical analysis of spatial point patterns by means of distance methods. Biometrics 32 (3), 659–667.

Doss, H. (1989). On estimating the dependence between two point processes. The Annals of Statistics 17 (2), 749–763.

Eckardt, M. and J. Mateu (2019). Analysing multivariate spatial point processes with continuous marks: A graphical modelling approach. International Statistical Review 87 (1), 44–67.

Fl¨ugge,A. J., S. C. Olhede, and D. J. Murrell (2014). A method to detect subcommu- nities from multivariate spatial associations. Methods in Ecology and Evolution 5 (11), 1214–1224. 112

Geisser, S. and W. F. Eddy (1979). A predictive approach to model selection. Journal of the American Statistical Association 74 (365), 153–160.

Gelfand, A. E. and D. K. Dey (1994). Bayesian model choice: Asymptotics and exact calculations. Journal of the Royal Statistical Society. Series B (Methodological) 56 (3), 501–514.

Geman, S. and D. Geman (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence PAMI-6 (6), 721–741.

Geng, J., A. Bhattacharya, and D. Pati (2019). Probabilistic community detection with unknown number of communities. Journal of the American Statistical Associa- tion 114 (526), 893–905.

Geng, J., W. Shi, and G. Hu (2019). Bayesian nonparametric nonhomogeneous Poisson process with applications to USGS earthquake data.

George, E. I. and R. E. McCulloch (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association 88 (423), 881–889.

Gruner, C. M. and D. H. Johnson (2001). Calculation of the Kullback-Leibler distance between point process models. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Volume 6, pp. 3437–3440. IEEE.

Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57 (1), 97–109.

Ho, L. P. and D. Stoyan (2008). Modelling marked point patterns by intensity-marked Cox processes. Statistics & Probability Letters 78 (10), 1194–1199.

Hu, G. and J. Bradley (2018, 03). A Bayesian spatial-temporal model with latent multivariate log-gamma random eﬀects with application to earthquake magnitudes. Stat 7 (1), e179.

Hu, G., J. Geng, Y. Xue, and H. Sang (2020). Bayesian spatial homogeneity pursuit of functional data: an application to the US income distribution.

Hu, G., F. Huﬀer, and M.-H. Chen (2019). New development of Bayesian variable selection criteria for spatial point process with applications.

Illian, J., A. Penttinen, H. Stoyan, and D. Stoyan (2008). Statistical Analysis and Modelling of Spatial Point Patterns, Volume 70. John Wiley & Sons. 113

Ishwaran, H. and J. S. Rao (2005). Spike and slab variable selection: Frequentist and Bayesian strategies. The Annals of Statistics 33 (2), 730–773.

Ishwaran, H. and M. Zarepour (2002). Dirichlet prior sieves in ﬁnite normal mixtures. Statistica Sinica 12 (3), 941–963.

Jhuang, A.-T., M. Fuentes, J. L. Jones, G. Esteves, C. M. Fancher, M. Furman, and B. J. Reich (2019). Spatial signal detection using continuous shrinkage priors. Technometrics 61 (4), 494–506.

Jiang, H. and N. Serban (2012). Clustering random curves under spatial interdepen- dence with application to service accessibility. Technometrics 54 (2), 108–119.

Jiao, J., G. Hu, and J. Yan (2019). A Bayesian Joint Model for Spatial Point Processes with Application to Basketball Shot Chart.

Kalli, M. and J. E. Griﬃn (2014). Time-varying sparsity in dynamic regression models. Journal of Econometrics 178 (2), 779–793.

Kanagaraj, R., T. Wiegand, L. S. Comita, and A. Huth (2011). Tropical tree species assemblages in topographical habitats change in time and with life stage. Journal of Ecology 99 (6), 1441–1452.

Kass, R. E. and A. E. Raftery (1995). Bayes factors. Journal of the american statistical association 90 (430), 773–795.

Kim, H.-M., B. K. Mallick, and C. Holmes (2005). Analyzing nonstationary spatial data using piecewise Gaussian processes. Journal of the American Statistical Associ- ation 100 (470), 653–668.

Knorr-Held, L. and G. Raßer (2000). Bayesian detection of clusters and discontinuities in disease maps. Biometrics 56 (1), 13–21.

Kulldorﬀ, M., L. Huang, L. Pickle, and L. Duczmal (2006). An elliptic spatial scan statistic. Statistics in medicine 25 (22), 3929–3943.

Lambert, D. (1992). Zero-inﬂated poisson regression, with an application to defects in manufacturing. Technometrics 34 (1), 1–14.

Lee, J., R. E. Gangnon, and J. Zhu (2017). Cluster detection of spatial regression coeﬃcients. Statistics in medicine 36 (7), 1118–1133.

Lee, J., Y. Sun, and H. H. Chang (2019). Spatial cluster detection of regression coeﬃcients in a mixed-eﬀects model. Environmetrics n/a(n/a), e2578. 114

Leininger, T. J., A. E. Gelfand, et al. (2017). Bayesian inference and model assessment for spatial point patterns using posterior predictive samples. Bayesian Analysis 12 (1), 1–30.

Li, F. and H. Sang (2019). Spatial homogeneity pursuit of regression coeﬃcients for large datasets. Journal of the American Statistical Association 114 (527), 1050–1062.

Lin, P.-S. (2014). Generalized scan statistics for disease surveillance. Scandinavian Journal of Statistics 41 (3), 791–808.

Lin, P.-S., Y.-H. Kung, and M. Clayton (2016). Spatial scan statistics for detection of multiple clusters with arbitrary shapes. Biometrics 72 (4), 1226–1234.

Lu, J., M. Li, and D. Dunson (2018). Reducing over-clustering via the powered Chinese restaurant process.

Ma, Z., M.-H. Chen, and G. Hu (2018). Bayesian hierarchical spatial regression models for spatial data in the presence of missing covariates with applications. Technical Report 18-22, University of Connecticut, Department of Statistics.

Ma, Z., Y. Xue, and G. Hu (2019). Bayesian heterogeneity pursuit regression models for spatially dependent data.

MacEachern, S. N. and P. M¨uller(1998). Estimating mixture of dirichlet process models. Journal of Computational and Graphical Statistics 7 (2), 223–238.

Malsiner-Walli, G. and H. Wagner (2018). Comparing spike and slab priors for Bayesian variable selection.

Miller, A., L. Bornn, R. Adams, and K. Goldsberry (2014). Factorized point process intensities: A spatial analysis of professional basketball. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pp. 235–243.

Miller, J. W. (2014). Nonparametric and Variable-Dimension Bayesian Mixture Mod- els: Analysis, Comparison, and New Methods. Ph. D. thesis, Citeseer.

Miller, J. W. and M. T. Harrison (2013). A simple example of Dirichlet process mixture inconsistency for the number of components. In Advances in neural information processing systems, pp. 199–206.

Miller, J. W. and M. T. Harrison (2018). Mixture models with a prior on the number of components. Journal of the American Statistical Association 113 (521), 340–356. 115

Mrkviˇcka, T., F. Goreaud, and J. Chadœuf (2011). Spatial prediction of the mark of a location-dependent marked point process: How the use of a parametric model may improve prediction. Kybernetika 47 (5), 696–714.

Murray, I., Z. Ghahramani, and D. MacKay (2012). MCMC for doubly-intractable distributions.

Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics 9 (2), 249–265.

Neill, D. B. (2012). Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74 (2), 337–360.

Orbanz, P. and J. M. Buhmann (2008). Nonparametric Bayesian image segmentation. International Journal of Computer Vision 77 (1-3), 25–45.

Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Prob- ability Theory and Related Fields 102 (2), 145–158.

Punchi-Manage, R., S. Getzin, T. Wiegand, R. Kanagaraj, C. Savitri Gunatilleke, I. Nimal Gunatilleke, K. Wiegand, and A. Huth (2013). Eﬀects of topography on structuring local species assemblages in a Sri Lankan mixed dipterocarp forest. Journal of Ecology 101 (1), 149–160.

Rajala, T., D. Murrell, and S. Olhede (2018). Detecting multivariate interactions in spatial point patterns with Gibbs models and variable selection. Journal of the Royal Statistical Society: Series C (Applied Statistics) 67 (5), 1237–1273.

Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66 (336), 846–850.

Rockova, V. and K. McAlinn (2017). Dynamic variable selection with spike-and-slab process priors.

Schoenberg, F. P. (2003). Multidimensional residual analysis of point process models for earthquake occurrences. Journal of the American Statistical Association 98 (464), 789–795.

Sethuraman, J. (1994). A constructive deﬁnition of Dirichlet priors. Statistica sinica 4 (2), 639–650.

Shu, L., W. Jiang, and K.-L. Tsui (2012). A standardized scan statistic for detecting spatial clusters with estimated parameters. Naval Research Logistics (NRL) 59 (6), 397–410. 116

Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and A. Van Der Linde (2002). Bayesian measures of model complexity and ﬁt. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 (4), 583–639.

Taddy, M. A. (2010). Autoregressive mixture models for dynamic spatial Poisson processes: Application to tracking intensity of violent crime. Journal of the American Statistical Association 105 (492), 1403–1417.

Thurman, A. L., R. Fu, Y. Guan, and J. Zhu (2015). Regularized estimating equations for model selection of clustered spatial point processes. Statistica Sinica 25 (1), 173– 188.

Thurman, A. L. and J. Zhu (2014). Variable selection for spatial Poisson point processes via a regularization method. Statistical Methodology 17, 113–125.

Walker, A. M. (1969). On the asymptotic behaviour of posterior distributions. Journal of the Royal Statistical Society: Series B (Methodological) 31 (1), 80–88.

Wang, X., M.-H. Chen, and J. Yan (2013). Bayesian dynamic regression models for interval censored survival data with application to children dental health. Lifetime Data Analysis 19 (3), 297–316.

Wang, Y. R. and P. J. Bickel (2017). Likelihood-based model selection for stochastic block models. The Annals of Statistics 45 (2), 500–528.

Winkler, G. (2012). Image Analysis, Random Fields and Markov Chain Monte Carlo Methods: A Mathematical Introduction, Volume 27. Springer Science & Business Media.

Xie, F. and Y. Xu (2019). Bayesian repulsive Gaussian mixture model. Journal of the American Statistical Association 0 (0), 1–29.

Xu, J. and R. E. Gangnon (2016). Stepwise and stagewise approaches for spatial cluster detection. Spatial and spatio-temporal epidemiology 17, 59–74.

Yue, Y. R. and J. M. Loh (2011). Bayesian semiparametric intensity estimation for inhomogeneous spatial point processes. Biometrics 67 (3), 937–946.

Zhao, P., H.-C. Yang, D. K. Dey, and G. Hu (2020). Bayesian spatial homogeneity pursuit regression for count value data.