<<

THEORETICAL AND ALGORITHMIC DEVELOPMENTS IN MONTE CARLO

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Rajib Paul, M.Sc.

*****

The Ohio State University

2008

Dissertation Committee: Approved by

Professor L. Mark Berliner, Adviser Professor Noel Cressie Adviser Professor Steven N. MacEachern Graduate Program in c Copyright by

Rajib Paul

2008 ABSTRACT

This PhD thesis is concerned with three problems: (1) “Bayesian Change Point

Analysis for Mechanistic Modeling,” (3) “Assessing Convergence and Mixing of Markov

Chain Monte Carlo via Stratification,” and (2) “Bayesian Analysis via Diffusion

Monte Carlo.”

In Bayesian parametric change point modeling, the primary challenges are de-

tection of the number and locations of change points and obtaining the posterior

distribution of other unknown parameters involved in the model. Chapter 2 suggests

approaches for dealing with the computational burden arising in change point mod-

els that involving mechanistic reasoning as well as random explanatory variables. We

present an example invovling flow velocities of ice sheets in the Lambert Glacial Basin

of east Antarctica. We pay special attention to the development of reasonable priors

that can reflect a variety of prior information.

In Chapter 3 we apply the notion of post-stratification to develop assessment tools

for MCMC analysis. These tools are based on comparison of of two natural

. Based on the estimates of these variances we propose a test statistic which

helps in checking convergence and mixing of MCMC. Our method is illustrated using a

Bayesian change-point model for ice flow velocity in East Antarctica, a logistic regres-

sion model, a latent variable model for Arsenic concentration in public water systems

in Arizona, and a regime-switching model for Pacific sea surface temperatures.

ii Markov chain using langevin-type diffusion sometimes offer potentially useful methods in cases for which other MCMC methods are challenging. The key idea is to develop a whose stationary distribution is the target posterior using diffusions represented as solutions to stochastic differential equations

(SDE). The main challenge is to solve the required SDE’s. Naive discretizations like the Euler method can lead to lack of ergodicity. Hence, more sophisticated discretizations or Metropolis-Hastings correction steps are prescribed. Chapter 4 of my dissertation research deals with development of diffusion based algorithms for posterior distributions arising from non-conjugate Gaussian models with non-linear mean and functions.

iii Dedicated to my family, friends, and teachers

iv ACKNOWLEDGMENTS

Words are not enough to thank those people who mean a lot to me.

I express my profuse thanks and deepest appreciation to my friend, philosopher, and guide, Dr. L. Mark Berliner, for his continued support, patience, encouragement, and invaluable suggestions during my PhD program. Having him as an advisor is the best gift from the Almighty. His probability classes are cherishable memories for me.

I am highly indebted to Dr. Noel Cressie for motivating and enlightening me in various ways and also for his parent-like affection. It was a great opportunity and wonderful experience for me to work with him in the Spatial Statistics and

Environmental Sciences (SSES) program at The Ohio State University.

I would like to convey my heartfelt thanks to Dr. Steven N. MacEachern for his very precious intellectual guidance and for being my co-author in one of my dissertation problems. I have enjoyed a lot working with him.

I am very much grateful to Dr. Catherine A. Calder, Dr. Peter F. Craigmile, Dr.

Thomas J. Santner, Dr. Radu Herbei, Terry England, and Emily Kang for their help during my PhD program.

Finally, I would like to acknowledge my loving parents for their support and affection.

v VITA

1978 ...... Born - Calcutta, INDIA

2000 ...... B.Sc. Statistics, Calcutta University 2002 ...... M.Sc. Statistics, Indian Institute of Technology, Kanpur 2002-2004 ...... Graduate Teaching Associate, Department of Statistics, The Ohio State University. 2004-present ...... Graduate Research Associate, Department of Statistics, The Ohio State University.

PUBLICATIONS

Research Publications

Santner, Thomas J., Craigmile, Peter F., Calder, Catherine A., and Paul, Rajib, “Demographic and Behavioral Modifiers of Arsenic Exposure Pathways: A Bayesian Hierarchical Analysis of NHEXAS Data .” Journal of Environmental Science and Technology, in press

FIELDS OF STUDY

Major Field: Statistics

vi TABLE OF CONTENTS

Page

Abstract...... ii

Dedication...... iv

Acknowledgments...... v

Vita ...... vi

ListofTables...... ix

ListofFigures ...... x

Chapters:

1. MarkovChainMonteCarlo ...... 1

2. Bayesian Change Point Analysis for Mechanistic Modeling ...... 4

2.1 Introduction ...... 4 2.2 Bayesian Hierarchical Modeling of Ice Flow Velocities ...... 7 2.2.1 PhysicalTheory ...... 7 2.2.2 Data...... 8 2.2.3 BayesianModeling ...... 9 2.2.4 Detecting the Number of Change Points ...... 13 2.3 ImplementingMCMC ...... 15 2.4 Results ...... 17 2.5 Discussion...... 20 2.6 Appendix: Full Conditional Distributions ...... 24

vii 3. Assessing Convergence and Mixing of MCMC via Stratification ..... 28

3.1 Introduction ...... 28 3.2 Motivation ...... 31 3.3 ANewMethod...... 33 3.3.1 Choosing n, K, and J ...... 38 3.3.2 Assessingburn-in...... 38 3.3.3 Parametricbootstrapsamplesize ...... 41 3.4 Example 1: Bayesian Change Point Analysis: Velocity of Antarctic Glaciers...... 44 3.5 Example 2: Latent Variable Models: Arsenic Concentrations in Pub- licWaterSystem(PWS)...... 45 3.6 Example 3: Logistic Regression (WinBUGS manual) ...... 48 3.7 Example 4: Regime-switching model: Pacific Sea Surface Tempera- ture(SST) ...... 50 3.8 PoweroftheTest...... 51 3.9 Discussion...... 54

4. Bayesian Analysis via Diffusion Monte Carlo ...... 58

4.1 Introduction ...... 58 4.2 Background...... 61 4.3 UnivariateCase...... 64 4.4 MultivariateSettings...... 67 4.4.1 Multivariate Bayesian Models ...... 67 4.5 Discretization...... 69 4.6 UnivariateExamples ...... 70 4.6.1 Normal-CauchyModel ...... 70 4.6.2 Gaussian Model with Nonlinear Mean Function ...... 71 4.6.3 Computational Example: Nonlinear Model ...... 71 4.6.4 MixtureModels...... 75 4.6.5 Conditional Autoregressive (CAR) Model ...... 75 4.7 MultivariateExamples ...... 77 4.7.1 HierarchicalModels ...... 77 4.7.2 Normal-GammaModel...... 79 4.7.3 Conditional Autoregressive Model with Multivariate Param- eters...... 83 4.7.4 AnExchangeableModel ...... 88 4.8 Discussion...... 90

Bibliography ...... 92

viii LIST OF TABLES

Table Page

2.1 Summary of wavelet coefficients of Hτ 3 model in terms of posterior mean (Post. Mean) in (1018m(KP a)3) and posterior standard devia- tion (Post. SDev) in (1018m(KP a)3) ...... 17

2.2 DIC values for m =2,..., 5...... 20

2.3 Posterior summaries of the slopes of the regression lines in(a−1(kP a)−3) 23

2.4 Posterior mean of the change point locations ...... 23

3.1 Beetles dataset for logistic regression ...... 48

3.2 Acceptance rates for Geweke’s test, Gelman-Rubin test, and our new test...... 52

4.1 Comparing posterior summaries from diffusion MCMC and Gibbs sam- pler for normal-gamma model: E ( Y) = posterior mean based on DM ·| diffusion MCMC; SD ( Y) = posterior standard deviation based on DM ·| diffusion MCMC; E ( Y) = posterior mean based on Gibbs sampler; GS ·| SD ( Y) = posterior standard deviation based on Gibbs sampler . 80 GS ·| 4.2 Posterior summaries of the parameters involved in the CAR model (4.67).E ( Y) = posterior mean based on diffusion MCMC; SD ( Y) DM ·| DM ·| = posterior standard deviation based on diffusion MCMC; E ( Y)= GG ·| posterior mean based on griddy Gibbs sampler; SD ( Y) = posterior GG ·| standard deviation based on griddy Gibbs sampler ...... 88

4.3 Hellinger Distances for the parameters involved in the CAR model (4.67). This compares the distance between two estimated densities based on the samples generated using diffusion MCMC and griddy Gibbssampler...... 88

ix LIST OF FIGURES

Figure Page

2.1 IllustrationofUsingthePrior(2.15)...... 12

2.2 Observed and posterior means of the wavelet model for Hτ 3. Blue lines indicate observed values and red dots indicate several realizations from posterior distribution. (a) m = 2, (b) m = 3, (c) m = 4, and (d) m =5. 18

2.3 Observed and posterior means of the velocity model. Cyan circles indicate observed velocities and blue dots indicate posterior means. (a) m = 2, (b) m = 3, (c) m = 4, and (d) m =5...... 21

2.4 Residual: Observed - posterior means of the velocity model. (a) m = 2, (b) m = 3, (c) m = 4, and (d) m =5...... 22

3.1 Trace-plot of AR(1) process defined in (3.1) ...... 32

3.2 Assessing burn-in the for AR(1) process described in (3.16) using a windowsizeof100...... 39

3.3 Assessing burn-in for the slope parameter of linear change point prob- lemusingawindowsizeof100...... 40

3.4 Acceptance regions constructed for 3.17 when α = 0.2 using various bootstrap sample sizes (a) 1000, (b) 10, 000, (c) 25, 000, and (d) 50, 000 42

3.5 Acceptance regions constructed for 3.17 when α =0.998 using various bootstrap sample sizes (a) 1000, (b) 10, 000, (c) 25, 000, and (d) 50, 000 43

3.6 Histogramsofchangepointlocations...... 46

3.7 Observed and posterior means of ice-flow velocity...... 46

x 3.8 Scatterplot of location of first change point...... 47

3.9 Trace plots of simulated logistic regression parameters...... 49

(1) 3.10 Scatter plot for ASept08: The first EOF coefficient for Sept’08 . . . . . 51

3.11 Trace plot and acf plot of a MCMC chain generated using 3.20. ... 56

3.12 Density estimates from 10 separate AR(1) chains generated using (3.20). The thick black line indicates the true stationary distribution. 57

4.1 Simulated data from a Gaussian model with mean function (4.44) . . 74

4.2 Estimated cumulative distribution functions for L based on griddy- Gibbs and diffusion MCMC implemtations...... 74

4.3 Negative log- of the spatial dependence parameter γ fortheCARmodeldescribedin4.6.5 ...... 77

4.4 Estimated posterior cumulative distribution functions of the spatial dependence parameter γ for the CAR model described in 4.6.5 . . . . 78

4.5 Comparing posterior quantiles for a normal-gamma model based on samples from diffusion MCMC and Gibbs sampler: (a) for mean x1 (b) for precision x2 ...... 81

4.6 Comparing cumulative distribution functions of mean parameter x1 for a normal-gamma model based on samples from diffusion MCMC and Gibbssampler...... 82

4.7 Comparing cumulative distribution functions of precision parameter x2 for a normal-gamma model based on samples from diffusion MCMC andGibbssampler ...... 82

4.8 Trace-plots of parameters involved in the CAR model described by (4.67) 85

4.9 Posterior densities of parameters involved in the CAR model described by(4.67)...... 86

xi 4.10 Cumulative density estimates for the parameters in the CAR model described by (4.67). These estimates are based on samples generated from diffusion MCMC (blue lines) and griddy Gibbs sampler (red lines). 87

xii CHAPTER 1

MARKOV CHAIN MONTE CARLO

Markov chain Monte Carlo (MCMC) provides an extremely powerful computa- tional technique for drawing inferences from the posterior distributions which are analytically intractable. Two important introductions to MCMC are Geman and

Geman (1984) and Gelfand and Smith (1990).

In the Bayesian paradigm the structure of a statistical model has two components:

(1) Data Model P (D θ): This is often termed the likelihood and specifies the con- | ditional distribution of the observations D given unknown parameters θ. (2) Prior

P (θ): This distribution reflects prior knowledge about the unknown parameters θ.

The goal is to find the posterior distribution P (θ D). Applying Bayes theorem, one | can write P (D θ)P (θ) P (θ D)= | . | P (D θ)P (θ)dθ | Often the integration in the denominatorR is not available in a closed form. MCMC is a technique to simulate from the posterior, even though we cannot normalize it.

The idea is to develop a Markov chain whose stationary distribution is known to coincide with the target posterior distribution. One then runs that chain, knowing that eventually realizations from the chain approximately form a dependent sample from the posterior. Those realizations are then used to estimate features of the

1 posterior (i.e., posterior expectations of interesting quantities, predictive densities,

etc.) See Bernardo and Smith (1994), Gilks, Richardson and Spiegelhalter (1996),

and Robert and Casella (2004) for additional discussion.

Though major advances have been made using MCMC, some problems remain

difficult, at least when attempted by standard approaches to developing an MCMC

analysis. For example, is a powerful subclass of MCMC. In Gibbs

Sampling each iteration step for generating a parameter or a block of parameters

requires updated conditioning on the values of remaining parameters. That is, we

require computation of the full conditional distributions in closed forms. In some

settings, nonlinearity and/or nonconjugacy of certain components of a large model

render standard Gibbs unusable. Moreover, for some models the dimensions of the

parameters can be variable. Metropolis-Hastings algorithms and Gibbs-Metropolis

hybrids can be suggested, though these approaches can be taxing. For Metropolis-

Hastings we require a proposal distribution. A sample is generated from

this proposal distribution and then it is accepted or rejected based on certain weights.

The choice of proposal distribution impacts the acceptance probability and hence the

mixing of the chain. Guessing an appropriate proposal distribution can be very

difficult in some situations.

An important practical question is how long do we need to run the chain and how will we assess convergence. Along with convergence, another issue that often arises is “mixing” of the Markov chain. If our algorithm rapidly explores all the important regions of the posterior distribution, then we will say that we have a well-mixed chain.

For example, if our posterior distribution is multimodal, then a well-mixed chain will

2 explore all the modes. It is important to assess the convergence and mixing, and sometimes remedial actions are needed.

Several numerical challenges arise while coding the MCMC algorithm. These challenges are problem specific and any general rule cannot be prescribed. Clever techniques and programming skills are very necessary. See Craigmile, Calder, Li,

Paul and Cressie (2007) for additional discussion.

My thesis is motivated by the fact that for many Bayesian models having a fast and reliable Markov chain Monte Carlo algorithm is very necessary. This thesis is comprised of three major chapters. Chapter 2 develops a Bayesian change point model for ice flow velocity of east Antarctica. In this problem the dimension of the parameter space is unknown. We propose an analysis scheme for tackling such problems in an efficient way. In Chapter 3 we propose a method for assessing convergence and mixing of a Markov chain Monte Carlo algorithm via the stratification. Chapter 4 sheds some light on developing fast, diffusion based MCMC algorithms.

3 CHAPTER 2

BAYESIAN CHANGE POINT ANALYSIS FOR MECHANISTIC MODELING

2.1 Introduction

In regression modeling it is often the case that a single equation is unable to

capture the behavior of the entire dataset, though several local curves can model

the data reasonably well. Statistical analysis then involves detection and estimation

of change points and estimation of the parameters of the implied regression curves.

Commonly used techniques are maximum likelihood estimation (MLE) and likelihood

ratio tests (LRT) for determining the existence and locations of change points. In the

Bayesian approach, one develops prior distributions on all regression parameters and

numbers and locations of change points. Bayesian change point analyses have been

developed by many authors including Carlin, Gelfand and Smith (1992), Barry and

Hartigan (1993), Stephens (1994), and Green (1995).

Change point analyses can be broadly classified as parametric and nonparamet- ric approaches. Nonparametric approaches include free-knot, smoothing splines and wavelet analysis. For examples and discussion, see Denison, Mallick and Smith (1998),

Biller (2000), Dimatteo, Genovese and Kass (2001), Vidakovic (1999), and Muller and

Vidakovic (1999).

4 We focus on parametric procedures because our models arise from scientific rea-

soning. Such models are often termed mechanistic; see Box, Hunter and Hunter

(1978) chapter 12. Berliner (2003) described the notion as physical-statistical model- ing. The goal is to merge the approximate scientific laws with statistical analysis and uncertainty management. Bayesian hierarchical modeling (BHM) provides a useful approach, but there are significant challenges in model-building, and computing. We explore some suggestions for dealing with these issues. Our discussion is restricted to locally linear regression models with a single explanatory variable. The methods can be extended to more general cases. Another aspect of our modeling is that we can impose a on the explanatory variable. That is, we treat cases in which the explanatory variable or covariate is observed imprecisely.

Our example involves modeling ice flow velocities of an ice sheet in the Lambert

Glacial Basin in east Antarctica. The mechanistic aspect of our modeling uses a simple geophysical formula that relates ice-flow velocities to the surface and basal topographies of the ice sheets.

We consider fitting local linear regression models for response variable V and explanatory variable Z = g (Y, x), where g ( , ) is a continuous function of aux- θ θ · · iliary variables Y and x, an element in some domain , and parameters θ, as- X sumed to be known in this section. Consider datasets composed of n observations:

V ,Y , x , i = 1, ...n and define Z = g (Y , x ), i = 1,...,n. Let V, Y, Z, and x { i i i } i θ i i denote the corresponding n 1 vectors. ×

gθ may not be a monotone function of Y for all x and θ. This means that we can not define positions of change points uniquely depending only on Z = gθ(Y, x).

Assume that = [x , x ] is a compact subset of the real line. Our assumption that X L U

5 x is monotone leads us to define the change point locations on . For the moment, X assume that there are m change points with locations , ,r in the domain of . 1 ··· m X Due to our continuity assumption, we have that is partitioned into m + 1 disjoint X intervals, = [x , x ), k =1,...,m + 1, where x = x and x = x . Xk rk−1 rk r0 L rm+1 U

Conditional on the true values of Z1,...,Zn, we model the responses Vi, i =

1,...,n as

m+1 m+1 m+1 V = a I(x )+ b I(x )Z + σ I(x )ǫ , (2.1) i j i ∈ Xj j i ∈ Xj i j i ∈ Xj i j=1 j=1 j=1 X X X

where the ǫi’s are independent and identically distributed (iid) from some distribution

with mean zero and variance 1 and I( ) is the indicator function such that I(A)=1if · A occurs and 0 otherwise. The regression parameters comprise a 3(m +1) 1 vector × ′ ′ µ = (µ1,..., µm+1) where each µj = (aj, bj, σj) . Let Vj be the vector of V -values

with locations x lying in , j = 1,...,m + 1. Define vectors Z and x similarly. Xj j j

Assuming conditional independence among the Vi given Z, m, r, and µ, we write the likelihood as: m+1 [V Z, m, r, µ]= [V Z , µ ], (2.2) | j| j j j=1 Y where [u] denotes the distribution of u and [u w] denotes the conditional distribution | of u given w.

It is convenient to develop priors for m, r, and µ hierarchically. We formulate

component distributions using the ordering

[m, r, µ] = [µ m, r][r m][m]. (2.3) | |

The resulting posterior can be written as

[V Z, m, r, µ][µ m, r][r m][m] [m, r, µ V, Z]= | | | . (2.4) | [V Z, m, r, µ][µ m, r][r m][m]dµdr m | | | P R 6 The posterior distribution in (2.4) is not analytically tractable. We rely on MCMC simulation. A key challenge to implementing MCMC here involves the varying di- mensions due to the unknown number of change points. Reversible jump MCMC developed by Green (1995) is very useful in such settings but is also fairly compli- cated and typically slow to convergence in high dimensional cases. Our suggestion is to run separate MCMC fixing the number of change points at a few possible candi- date values and then draw inferences from these outputs. Similar techniques had been applied by Chib (1995). In Section 2.6 we will discuss an empirical Bayes procedure for detecting the number of change points based on Deviance Information Criterion

(DIC) (Spiegelhalter, Best, Carlin and van der Linde (2002)).

The next section describes the physical reasonings behind our models and various prior selections and their justifications. We discuss a change point model where we impose a continuity constraint on the model. Section 2.3 presents approaches to

Markov Chain Monte Carlo (MCMC) implementation for posterior inference. Section

2.4 describes our results and Section 2.5 is a brief summary.

2.2 Bayesian Hierarchical Modeling of Ice Flow Velocities

2.2.1 Physical Theory

We focus on modeling the ice flow velocity of an ice sheet in the Lambert Glacial

Basin of east Antarctica. We use simple physics-based approximation for velocity depending on surface and basal topographies of the ice sheet. In a preliminary data analysis we found that the simple model appears to be reasonable locally, though a need for multiple change points is very clear.

7 The approximate model used here is well-known in glaciology; see Paterson (1994,

Ch. 11) for discussion. The model states that the surface velocity v at location x is obtained as

3 v(x)= vsl(x)+0.50 AH(x) τ (x), (2.5)

where vsl(x) is sliding velocity, A is a flow parameter, and H denotes ice thickness.

In this article, we assume that the sliding velocity is an unknown constant (i.e., v (x)= v for all x ). The quantity τ approximates driving stress and is modeled sl 0 ∈ X as ds(x) τ(x)= ρgH(x) , (2.6) − dx where ρ is the density of the ice, g is the gravitational constant, and s is surface elevation. We set ρ = 1000 kg/m3 and g =9.81 m/s2.

2.2.2 Data

The dataset used here was obtained using remote sensing technology, particularly laser altimetry, radar altimetry, and inferometric synthetic aperture radar (see Wen,

Jezek, Csatho, Herzfeld and Fasness (2007)). The main components of the dataset are observed ice flow velocities, Vi, i =1,...,nv, measured in meters per year (m/a), surface elevation, S , l = 1, , n (in m), and basal topography observations (m). l ···

Ice thickness estimates, Hl are obtained by subtracting basal topography observa- tions from surface elevation observations. The numerical derivatives of the surface topography with respect to the spatial locations x were obtained as follows. For three consecutive observation locations, xl−1, xl, and xl+1, we estimate the surface derivative at xl by s(x ) s(x ) s′(x )= l+1 − l−1 . (2.7) l x x l+1 − l−1 8 At the first and last observation locations, one-sided numerical derivatives were used.

Note that we have velocity observations for only nv (< n) locations.

2.2.3 Bayesian Modeling Model Formulation

Applying (2.5) and the structure and notation of Section 2.1, it appears that we are to develop local regression models of the form

V a + bZ , (2.8) i ≈ i where a plays the role of sliding velocity, b is 0.50 A and

′ 3 Zi = g(Hi,s (xi), xi)= Hˆ (xi)ˆτ (xi), (2.9)

where ˆ indicates data-based, plug-in estimation. However, it is well-known among · glaciologists that direct use of (2.9) leads to Z’s that are ill-behaved in the sense

that they are too noisy. A small contribution of this noisy behavior is likely due to

measurement errors in the data. However, a major contributor is wild behavior of

the estimated surface derivatives. See Paterson (1994) Ch. 11 and Berliner, Jezek,

Cressie, Kim, Lam and van der Veen (2008) for further discussion.

We deal with this smoothing issue by viewing the Zl defined in (2.9) as obser-

3 vational estimates of the true values of Z˜l = H(xl) τ (xl). We model the Z˜l using

Daubechies orthogonal wavelets, in particular “D8” wavelets (Bruce and Gao (1996)

Ch. 2, Shen and Strang (1998)). Let V, Z, and Z˜ be n 1 vector of velocity observa- v × tions, n 1 vector of values of Z ’s defined in (2.9), and n 1 vector of hypothesized × l × true values Z˜, respectively. We assume that

Z = Z˜ + e (2.10)

9 = WC + e, (2.11) where W is the wavelet basis function matrix, constructed at a particular resolution level ν using “reflection” boundary conditions, and C is a c 1 vector of wavelet ν × coefficients; hence, W is an n c matrix. The errors e are assumed to follow a × ν 2 zero-mean Gaussian distribution with covariance matrix σz In, where In the identity matrix of order n.

We next use the change-point model formulation in Section 2.1. For each obser- vation, assuming there are m change points at locations r and conditioning on Z˜, we have

m+1 m+1 m+1 V = a I(x )+ b I(x )Z˜ + σ I(x )ǫ , (2.12) i j i ∈ Xj j i ∈ Xj i j i ∈ Xj i j=1 j=1 j=1 X X X where the ǫi’s are iid standard normal random variables. Hence, as in (2.2), the likelihood function = [V Z˜, a, b, σ2, r] is L | m+1 1 1 = exp (V a 1 b Z˜ )t(V a 1 b Z˜ ) (2.13) L √ nj −2σ2 j − j nj − j j j − j nj − j j j=1 ( 2πσj) j Y   where each V is the n 1 vector of V -values in the jth partition element , j j × Xj j =1,...,m + 1, the Z are defined similarly, a, b and σ are (m + 1) 1 vectors of j × intercepts, slope and variance parameters, respectively, and 1 is the M 1 vector M × m+1 of 1’s. Note that j=1 nj = nv P Priors on Parameters

Slopes and Intercepts. The assumed continuity constraints that the regression lines across various regions are connected at the change points translates to the restrictions

a + b Z˜ = a + b Z˜ for j =1, , m. (2.14) j j rj j+1 j+1 rj ··· 10 These constraints imply that there are only 2(m + 1) m free parameters. We − express the m intercepts a , j = 2, , m + 1 in terms of other parameters. Hence, j ··· we specify priors for a and the b , j = 1, , m + 1. We assume a Gau (0, τ 2 ), 1 j ··· 1 ∼ a1 b Gau (0, τ 2 ). For j =2, , m + 1, we model the slopes hierarchically: 1 ∼ b1 ···

2 1 2 1 2 [bj bj−1, τ ]= (bj bj−1) exp( (bj bj−1) ). (2.15) bj 3 2 | √2πτbj − −2τbj − Our motivation for choosing this prior is illustrated in Figure 2.1. We are to

regress y on x, and suspect a single change point at location r. Let α1 + β1x and

α2 + β2x be the regression lines before and after the change point, respectively. If

we need the change point, then either slopes and/or intercepts are fairly different for

these two regression lines. Since α2 is determined by the other three parameters, we

focus on β . The probability density [β β , σ2] as described in (2.15) is symmetric 2 2| 1

about β1, but assigns very little probability near β1. The modes of this density are at

β √2σ2. Hence, β values away from β are favored; σ quantifies this separation. 1 ± 2 1 For Gaussian data models, these bimodal densities are “conjugate”, so that up- dating can be done in closed form. Also, we can insure monotonicity with high probability by choosing small σ. To force monotonicity with probability one, we can truncate the density at β1, incurring very minor numerical issues.

Variances. For all variance parameters σ2, τ 2 , τ 2 , j =1, , m+1 we use inverse- { j a1 bj ··· } gamma priors with (η,γ), where η denotes the shape parameter and γ denotes the scale parameter. Since we do not have strong prior knowledge for

these variance parameters we used diffuse but proper inverse-gamma distributions.

Specifically, we set η =0.001 and γ =0.001 for all the variance parameters.

Wavelet Model. We assumed that C is a Gaussian random vector with mean 0 and

2 prior with covariance matrix σC Ic(ν) (Ic(ν) is an identity matrix of order c(ν)) is used

11 200

100 α + β x α + β x r 2 2

Y 1 1 0

−100 0 5 10 15 20 25 30 X −4 x 10 1

0.8 ) 2 σ

, 0.6 1 β |

2 0.4 β f( 0.2 β 1/2 β −21/2σ 1 β +2 σ 0 1 1 −30 −20 −10 0 10 20 30 40

Figure 2.1: Illustration of Using the Prior (2.15).

12 for C. One may also use a noninformative proper inverse-gamma prior on the error

2 2 variance σz . However, we treat σz as a control parameter, rather than measurement

2 error variance. The details are discussed in the results section. We assumed that σC follows a inverse-gamma prior distribution with η =0.001 and γ =0.001.

Change Points. Conditional on the number of change points, m, we assign an

order-restricted, discrete uniform prior for the locations r:

[r =(r , ...r ) m]= K I(X

where Km is a normalizing constant. However, we found it convenient to modify this

prior by adding restrictions that the differences between the indices of every pair of

consecutive change points must be at least d > 1. This minimal separation d is an extra parameter which prevents two change points being too close. Discussion of the selection and impact of the choice of d is given in Section 2.4.

2.2.4 Detecting the Number of Change Points

As mentioned in Section 2.1, inference for the number of change points m requires an MCMC algorithm which can move across model spaces of varying dimensions. We use a rather simplistic, but effective approach. We run separate Gibbs samplers for a collection of candidates for m. We then compare results and assess competing models using model selection methods. In our example, though we examined the Deviance

Information Criterion (DIC) (Spiegelhalter et al., 2002), our main suggestions were developed based on residual analyses.

13 Given m the appropriate log-likelihood function Lm(θm; V, Z˜) in our model selec- tion context is

m+1 m+1 n 1 t L = j log(2πσ2) V a b Z˜ V a b Z˜ , (2.17) m − 2 j − 2σ2 j − j − j j j − j − j j j=1 j=1 j X X     where θ = a , b , σ2 . Note that we have not included the contribution from the { m m m} wavelet modeling,

n 2 1 t Lwavelet = log(2πσz ) 2 (Z WC) (Z WC) (2.18) − 2 − 2σz − −

in (2.17). We omitted this term because we use a single wavelet model for all m. That

is, the smoothing performed should not influence the selection of m in our case. If

one considers the more complex set-up in which the selected wavelet resolution could

vary with m, then the likelihood should be adjusted appropriately. We did not do this

because the level of smoothing is a scientifically interesting characteristic. Further,

the smoothing is done to produce good model for velocities, rather than a good model

for Z˜ in its own right. Indeed, the wavelet model used does not provide a faithful reconstruction of the Z data. Hence, the poor fit of that part of the model, though

not a problem for our purposes, would force the likelihood to be extremely small.

The saturated DIC is defined as

DIC = 2[L (E(θ V, Z); V, E(Z˜ V, Z)) 2E(L (θ ; V, Z˜) V, Z)+ log(h(V, Z˜))], m m| | − m m | (2.19)

where log(h(V, Z˜)) is a standardizing function generally obtained by replacing V, Z˜ { } by E(V Z˜), Z˜ in the likelihood. For our case this quantity is m+1 nj log(2πσ2). { | } − j 2 j (k) For each iteration k of the MCMC (after burn-in), we compute Lm Pand estimate the posterior expectation of log-likelihood E(L (θ ; V, Z˜) V, Z) by 1 K L (k). The m m | K k=1 m P 14 first term in (2.19) is obtained as L (θˆ ; V, Eˆ(Z˜ V, Z)) where θˆ = 1 K θ (k) m m | m K k=1 m and Eˆ(Z˜ V, Z)= 1 K Z˜ (k) = 1 K WC(k), the estimated posterior meansP from | K k=1 K k=1 the MCMC samples.P We estimatedPh(V, Z˜) by plugging in the estimated posterior

2 ˜ mean of σj ’s. For model comparison purposes, some authors set h(V, Z) equal to 1,

yielding what is known as the non-standardized deviance information criterion. The

DIC accounts for the fit of the model along with the complexity of the model in terms

of number of independent unknown estimable parameters. For further discussion of

the properties of DIC, see Spiegelhalter et al. (2002) and Trevisani and Gelfand

(2003).

We next discuss choosing candidate points for m. As mentioned by Chib (1995)

in the context of detecting the number of mixtures for Gaussian mixture models, a

very large number of changepoints may result from over-fitting which in turn leads

to nonconvergence of MCMC algorithms. We suggest starting with a small number

of change points, and then increasing the number until there are indications of over-

fitting (such as, two consecutive change points are very close) and/or DIC values do

not change or change in a reverse direction.

2.3 Implementing MCMC

We ran the chain for 50, 000 iterations. The first 10, 000 samples were discarded

as burn-in. Convergence seems plausible based on inspection of trace and running

mean plots, as well as application of Geweke’s test (Geweke, 1992). As shown in

the Appendix, the full conditional distributions of the intercepts and slopes can be

expressed as a quantity times a normal distribution (equations 2.29, 2.32, 2.35). We

used importance sampling techniques to generate from these distributions, using the

15 corresponding normal distributions as proposal distributions. Specifically, we gener-

ated 1000 samples from each of those normal distributions. We then resampled from

these values using the proper probability weights.

18 3 We set the variance parameter σz at 0.1(10 m(KP a) ). We used a very smooth version of the wavelet decomposition based on three coefficients. The result does not provide a good fit for the observed Hτ 3. Our goal is to develop a reasonable model for ice flow velocities and this choice of wavelet resolution level performs well for that purpose.

2 Recall that we view σz as a control parameter rather than a measurement error variance. We used inverse gamma priors for all other variance parameters. The hyper-

1 1 prior for these inverse gamma distributions are set at ( 100 , 100 ). For the locations of the change points we used a griddy Gibbs sampling technique (Ritter and Tanner

(1992), Casella and George (1992)). For simplicity we consider that the change points can occur only at observation locations. The corresponding weight for a candidate point over the grid is derived as expression (2.41) in the Appendix. In implementing this approach, we must compute expressions of the form

1 exp (Y µ)′(Y µ) . (2.20) −2σ2 − −   If the dimension (say k) of Y is very large, then the above expression will become

zero and our sampler will fail due to numerical overflow. Berliner and Wikle (2006)

proposed a quick solution to this. If Y follows Gau(µ, σ2I) (in our case it is true), then according to probability theory (Y µ)′(Y µ) follows Chi-squared distribution − − with k degrees of freedom and

1 E exp (Y µ)′(Y µ) =2−k/2. (2.21) −2σ2 − −    16 Hence, we replace (2.20) by the expression k 1 exp log(2) (Y µ)′(Y µ) . (2.22) 2 − 2σ2 − −   In our case this adjustment fixed the numerical issue.

2.4 Results

We ran MCMC samplers with fixed numbers of change points m equal to 2, 3, 4,

and 5. We did not consider m = 0 or 1 because plots of the data clearly indicate that

these two models would not be competitive. Fig. 2.2 shows the MCMC realizations

of Hτ 3 and Table 2.1 tabulates the posterior means and standard deviations of three

wavelet coefficients for the different change point models.

Parameter Model Post. Mean Post. SDev m =2 -116.28 0.14 C1 m =3 -116.46 0.19 m =4 -116.34 0.23 m =5 -116.53 0.23 m =2 -101.23 0.41 C2 m =3 -101.04 0.20 m =4 -101.05 0.25 m =5 -100.21 0.84 m =2 -45.02 0.11 C3 m =3 -44.97 0.11 m =4 -44.97 0.13 m =5 -44.77 0.17

Table 2.1: Summary of wavelet coefficients of Hτ 3 model in terms of posterior mean (Post. Mean) in (1018m(KP a)3) and posterior standard deviation (Post. SDev) in (1018m(KP a)3)

Fig. 2.3 shows the observed velocities in cyan circles and posterior means of the velocity models as blue dots. The black vertical lines indicate the posterior mean

17 Figure 2.2: Observed and posterior means of the wavelet model for Hτ 3. Blue lines indicate observed values and red dots indicate several realizations from posterior distribution. (a) m = 2, (b) m = 3, (c) m = 4, and (d) m = 5.

18 of the locations of the change points. From these plots, we see the need for more

than two change points. Fig. 2.4 shows the residuals of these four models. We define

the residuals as the difference between the observed velocities and posterior means.

From the residual plots it is quite obvious that models with 3 or 4 change points are

reasonable for this dataset.

Recall that we impose the condition that two consecutive change points must be

at least d observations apart. We repeated the analyses for d = 4, 10, 20 and 30. In

terms of model fitting our results are insensitive to these choices of d.

Note that the residuals show patterns suggesting that the geophysical equation

(2.5) is not sufficient for capturing the trend of the velocity data. We did not elaborate the model to include a correlation structure of the errors, since this would not be of interest to the geophysicists. Instead, we believe that an improved model for the sliding velocity (vs in equation (2.5)) will improve the fit and be more scientifically useful (also see Berliner et al. 2008). We will pursue this direction elsewhere.

Table 3.2 shows the DIC values for the candidate models. Models with small DIC values are preferred over models with large DIC values. These values also suggest that the models with three or four change points are reasonable. Following Dempster

(1977)’s philosophy, we are reluctant to claim one of these two models as the unequiv- ocal best one. Table 2.3 shows the posterior means and posterior standard deviations of the parameters in the velocity models with three and four change points. The slope parameters are physically meaningful in that they provide an indirect indication of the temperature of the associated region (Paterson (1994)). Change points suggest that the temperature over the Lambert basin is not constant and comparing the values of b with their previously obtained theoretical values, we find that indications of higher

19 than expected temperatures. We note that one should not make too much of these

DIC comparions in this case. Recall that DIC assumes independent errors. However,

our residual plots clearly indicate unexplained dependence structures of the errors in

each case. As mentioned in the previos paragraph, we will develop altrenative model

classes elsewhere.

The posterior means of change point locations are tabulated in Table 2.4. We can

see that for m = 3, the change points occur around 53.8km, 55.9km, and 66.9km from the origin of the glacier. With m = 4, the change points occur near 50.9km, 55.1km,

58.1km, and 66.9km from the origin of the glacier. From expert knowledge we know that there are patches of water under these ice sheets. It would be interesting to see whether the change point locations obtained by us match with those potholes of water.

Model DIC m =2 1370.05 m =3 1215.23 m =4 1167.31 m =5 1573.63

Table 2.2: DIC values for m =2,..., 5.

2.5 Discussion

We presented a Bayesian analysis of a hierarchical model that arises from geophys-

ical reasoning. As mentioned by Berliner (2003), sometimes it is better to use simple

physical models in conjunction with Bayesian hierarchical models, rather than using

deterministic and complicated physical models or purely statistical models. A crucial

20 (a) (b) 800 800

600 600

400 400

200 200 Velocity in m/yr Velocity in m/yr

0 0 3 4 5 6 7 3 4 5 6 7 5 5 Distance in m x 10 Distance in m x 10 (c) (d) 800 800

600 600

400 400

200 200 Velocity in m/yr Velocity in m/yr

0 0 3 4 5 6 7 3 4 5 6 7 5 5 Distance in m x 10 Distance in m x 10

Figure 2.3: Observed and posterior means of the velocity model. Cyan circles indicate observed velocities and blue dots indicate posterior means. (a) m = 2, (b) m = 3, (c) m = 4, and (d) m = 5.

21 (a) (b) 100 100

50 50

0 0

Residuals −50 Residuals −50

−100 −100 3 4 5 6 7 3 4 5 6 7 5 5 Distance in m x 10 Distance in m x 10 (c) (d) 100 100

50 50

0 0

Residuals −50 Residuals −50

−100 −100 3 4 5 6 7 3 4 5 6 7 5 5 Distance in m x 10 Distance in m x 10

Figure 2.4: Residual: Observed - posterior means of the velocity model. (a) m = 2, (b) m = 3, (c) m = 4, and (d) m = 5.

22 m =3 m =4 Parameters Mean SDev Mean SDev −17 −18 −17 −18 b1 5.5 10 7.5 10 5.8 10 5.5 10 × −15 × −17 × −16 × −17 b2 1.2 10 3.9 10 2.8 10 3.5 10 × −16 × −17 × −16 × −17 b3 2.2 10 0.25 10 8.4 10 2.5 10 b 6.2 × 10−16 0.87 × 10−17 2 ×10−16 0.4 × 10−17 4 × × × × b - - 6.6 10−16 0.82 10−17 5 × ×

Table 2.3: Posterior summaries of the slopes of the regression lines in (a−1(kP a)−3)

Change point m =3 m =4 r1 53.799 km 50.999 km r2 55.999 km 55.099 km r3 66.998 km 58.099 km r4 - 66.999 km

Table 2.4: Posterior mean of the change point locations

question of the scientific community is that how are we going to trade between phys- ical and statistical modeling. Obviously, answering this question is not that simple.

A thorough discussion between physicists and statisticians is needed to understand the scientific phenomenon of interest. Too simple a physical model with implausible assumptions can be excluded. For example, ice flow velocities can also be explained in terms of thickness of the ice sheets. But only thickness is definitely not sufficient to explain the complex dynamics of ice flow.

Another issue that often arises is the nature and degree of smoothing. Berliner et

al. (2008) smooth surface and basal topographies separately and then put a corrector

process on τ. Where as we do not smooth surface and basal topographies, instead we

use wavelet smoother directly on Hτ 3. The degree of smoothing is important for the

23 scientists to understand which assumptions can be imposed on their physical models to make it valid.

We have also constructed an efficient MCMC implementation for a fairly compli- cated Bayesian hierarchical model where the dimensions of the parameter space are unknown. One may question the usage of DIC as model selection criterion. DIC can be interpreted properly in case of trivial Bayesian models (like Bayesian GLM), but for complex models properties of DIC are not well understood. Moreover, the residuals of our models are not independent, so some modifications of the definition of DIC are necessary. But we still use DIC since it penalizes for model fit as well as for the number of unknown parameters (which will not be the case if we use posterior probabilities or residual sum of squares). One may use other model selection criteria

(such as AIC, BIC) when the posterior mean is not a good summarization of the posterior distribution (skewed distributions).

2.6 Appendix: Full Conditional Distributions

Let WM denotes the rows of W matrix (wavelet basis functions) for those spatial locations where velocity observations are missing and Wv denotes those for which we have velocity observations. Again we partition Wv according to the change point locations. Hence, WM W1  W  W = 2 ,  ....     ....     W   m+1    where Wj denotes the rows of Wv which fall in the change point region j. The full

24 conditional distribution for C is Gaussian with mean µC and variance ΣC where, −1 ′ m+1 2 ′ W W I b W Wj Σ = + + j j (2.23) C σ2 σ2 σ2 z C j j ! X and ′ m+1 ′ W Z bjW (Vj aj1n ) µ = Σ + j − i (2.24) C C σ2 σ2 z j j ! X 2 Sample the variance parameter for σC from:

1 −1 IG η +3/2, +0.5C′C (2.25) C γ  C  ! Due to the continuity conditions (2.14) the intercepts a , j = 2, , m +1 are j ··· deterministic functions of other parameters.

Define:

˜ T11 = Zr1 1n2

T = Z˜ Z˜ 1 12 2 − r1 n2

For j =2, , m; T = Z˜ 1 ··· j1 r1 nj+1 For k =2, , j; T = Z˜ Z˜ 1 ··· jk rk − rk−1 nj+1   and T = Z˜ Z˜ 1 jj+1 j+1 − rj nj+1

The full conditional for the intercept a1 is :

[a rest]= Gau(µ , Σ ) (2.26) 1| a1 a1 where −1 m+1 n 1 Σ = j + (2.27) a1 σ2 τ 2 j j a1 ! X

m+1 ′ ′ m j+1 ′ 1n Vj 1 Z˜ b 1n Tjkbk µ = Σ j n1 1 1 j+1 (2.28) a1 a1 σ2 − σ2 − σ2 j=1 j 1 j=1 j ! X X Xk=1 25 The full conditional distribution for slope b1 is

[b rest] (b b )2Gau(µ , Σ ) (2.29) 1| ∝ 2 − 1 b1 b1

where −1 Z˜ ′ Z˜ m T ′T 1 1 Σ = 1 1 + j1 j1 + + (2.30) b1 σ2 σ2 τ 2 τ 2 1 j=1 j b1 b2 ! X

Z˜ ′ V 1 ′Z˜ a m T ′V µ = Σ ( 1 1 n1 1 1 + j1 j+1 (2.31) b1 b1 σ2 − σ2 σ2 − 1 1 j=1 j X j m a T ′1 m b T ′T b 1 j1 nj+1 k+1 j1 jk 2 ) σ2 − σ2 − τ 2 j=1 j j=1 j b2 X X Xk=1 The full conditional distribution for slope parameters b ,k =2, , m are k ···

[b rest] (b b )2(b b )2Gau(µ , Σ ) (2.32) k| ∝ k − k−1 k+1 − k bk bk

where, − m ′ 1 T Tjk 1 1 Σ = jk + + (2.33) bk σ2 τ 2 τ 2 − j bk b(k+1) ! j=Xk 1 m ′ m ′ m m ′ T V Tjk 1n a1 T T b b b µ = Σ jk j+1 j+1 jk jl l k−1 k+1 bk bk σ2 − σ2 − σ2 − τ 2 − τ 2 j=k−1 j j=k−1 j j=k−1 l=6 k j bk b(k+1) ! X X X X (2.34)

The full conditional distribution for slope bm+1 is

[b rest] (b b )2Gau(µ , Σ ) (2.35) m+1| ∝ m+1 − m bm+1 bm+1

where −1 ′ Tm(m+1) Tm(m+1) 1 Σbm+1 = 2 + 2 (2.36) σm+1 τb(m+1) ! ′ ′ m ′ T Vm+1 T 1nm+1 a1 T Tmlbl b µ = Σ m(m+1) m(m+1) m(m+1) m bm+1 bm+1 σ2 − σ2 − σ2 − τ 2 m+1 m+1 l=1 m+1 b(m+1) ! X (2.37)

26 n 1 1 −1 [σ2 rest]= IG η + j , + (V a 1 Z˜ b )t(V a 1 Z˜ b ) j | 0 2 γ 2 j − j nj − j j j − j nj − j j  0  ! (2.38) 1 1 1 −1 [τ 2 rest]= IG η + , + a 2 (2.39) a1| a1 2 γ 2 1  a1  ! For j =1, , m + 1; ··· 3 1 1 −1 [τ 2 rest]= IG η + , + b 2 (2.40) bj| bj 2 γ 2 j  bj  ! To generate realizations for the positions of change points we have adopted griddy gibbs sampling technique. The weights associated with rj are wj:

1 1 ˜ t ˜ wj n +1 exp( 2 (Vj aj1nj bjZj) (Vj aj1nj bjZj)) (2.41) nj j 2σ ∝ σ σj − j − − − − 1 ˜ t ˜ exp( 2 (Vj+1 aj+11nj+1 bj+1Zj+1) (Vj+1 aj+11nj+1 bj+1Zj+1)) × −2σj+1 − − − −

r is sampled from (r + d,r d) using w ’s. j j−1 j+1 − j

27 CHAPTER 3

ASSESSING CONVERGENCE AND MIXING OF MCMC VIA STRATIFICATION

3.1 Introduction

In many Bayesian analyses the posterior distribution is analytically intractable,

and hence we often rely on Markov Chain Monte Carlo (MCMC) simulation. The

idea of MCMC is to construct a Markov Chain in such a way that the chain has the

posterior distribution as its stationary distribution. If we run the algorithm for a

sufficiently “long” time, the produced samples will be a good approximation of our

targeted posterior distribution. The obvious questions are, how long do we need to

run the chain and how will we assess convergence. Along with convergence, another

issue that often arises is “mixing” of the Markov chain. If our algorithm explores all

the important regions of the posterior distribution, then we will say that we have a

well-mixed chain. For example, if our posterior distribution is multimodal, then a

well-mixed chain will explore all the modes.

There are several methods in the literature for assessing convergence. The de- scription we provide here is very brief. For a detailed comparative study of various convergence diagnostics see Cowles and Carlin (1996) and Chapter 12 of Robert and

Casella (2004).

28 The most commonly used MCMC diagnostic tool is the Gelman-Rubin statistic

(Gelman and Rubin (1992)). This method requires multiple chains with overly dis- persed starting values. The key quantity in their method is the ratio of between and within chain variances of the multiple chains. As the within chain variance domi- nates the between chain variance their diagnostic ratio approaches 1 which supports the adequate stationarity of the produced chain.

Geweke (Geweke (1992)) proposed a method for assessing convergence by testing

the equality of the means of first nf % and last nl% of the iterates. His test assumes

independence between the two sample averages based on the first nf % and last nl%

of the samples. Usually nf is chosen to be 10 and nl is set at 50. The variances of

these sample averages are estimated using a spectral density method. This method

does not use the remaining sample, nor does it address the issue of mixing.

Another method was proposed by Raftery and Lewis (1992) to detect the burn-

in and thinning interval of a MCMC sampler. Suppose someone is interested in

estimating the probability P (θ θ∗ Y ), where θ is the parameter of interest, θ∗ is a ≤ | specified quantity, and Y denotes the data. They require that the should

be within r of the truth with probability s. Based on this requirement their method ± finds the burn-in period and the minimum number of iterations needed to achieve this

requirement after burn-in. They also find the thinning interval based on the minimum

number of iterations needed to obtain the desired accuracy for a set of independent

samples.

Several other methods have been proposed by Roberts (1992), Ritter and Tanner

(1992), and Zellner and Min (1995). These methods are mathematically rigorous and

can be applied to assess the convergence to a joint posterior distribution without

29 considering each parameter separately. However, these method are constructed for application for Gibbs sampling. Often they are computationally inefficient. Liu, Liu and Rubin (1992) proposed a method which requires multiple chains with dispersed initial values. This method is also specific to Gibbs sampling and involves extensive computational overhead.

Heidelberger and Welch (1983) devised a technique for estimating confidence in- tervals for means based on spectral density and Brownian bridge arguments. Their method is applied sequentially to asses “burn-in” and, if the test for convergence sup- ports the null hypothesis, then a confidence interval for the mean is readily available.

This method requires φ mixing or geometrically ergodic chains. − Mykland, Tierney and Yu (1995) developed a method based on regeneration the- ory and split chain method. They used “scaled regeneration plots” (SRQ) to assess stationarity. This method requires writing the MCMC algorithm in a very specific way. For complex models this step is infeasible.

Recently Jones, Haran, Caffo and Neath (2006) and Flegal, Haran and Jones

(2008) developrd a method for assessing burn-in based on the width of the prediction interval of the mean of the parameter of interest. They use regeneration theory and batch-mean methods.

None of these methods prescribe procedures for checking both convergence and mixing together. We propose a computationally efficient technique to assess both con- vergence and mixing. Our method is based on stratification and batch-mean methods

(Jones et al. (2006); Flegal et al. (2008)). We compare the estimated variances of the asymptotic distributions of two consistent estimators. If they are close enough, we

30 say that there is no evidence for non-convergence or poor mixing. Our method does

not require multiple chains, but it can be applied to multiple chains.

The rest of this chapter is organized as follows. In Section 3.2 we present a motivational example. Section 3.3 presents the details of our approach. In Sections

3.4, 3.5, 3.6, and 3.7, we apply our method to various Bayesian hierarchical models and real datasets. Section 3.8 provides a numerical assessment of the power of our test. Section 3.9 presents concluding remarks.

3.2 Motivation

Trace plots are widely used in assessing convergence. For a large number of

parameters (e.g., latent variable models, Berliner, Wikle and Cressie (2000), large

spatial models, Sanso, Forest and Zantedeschi (2008)) it is implausible to inspect all

trace plots. Also, trace plots may be misleading. For example, consider an AR(1)

process of the following form:

X µ =0.22(X µ )+ ǫ , (3.1) t − t t−1 − t−1 t

where th ǫt’s are independent and identically distributed N(0, 0.1) variates, ǫt and µt

are independent, and the transition probabilities for the µ-process are

0.998 for µt = 20 P (µt µt−1 = 20) = | 0.0020 for µt = 15  0.4 for µt = 20 P (µt µt−1 = 15) = | 0.6 for µt = 15 

The invariant distribution for the µ-process is

31 0.995 for µ = 20 P (µ)= 0.005 for µ = 15 

Hence E(µ)=19.9750.

Under the stationary distribution, E(Xt) = E(µ) for all t. We set X0 = E(X) and µ0 = 20, simulated the process for 101, 000 iterations. The first 1, 000 of these iterations were then discarded. The trace plot for the resulting sample is shown in

Fig. 3.1. It is difficult to draw any inferences regarding the convergence and mixing from this trace plot. This simple example is emblematic of potential behavior for multimodal distributions.

21

20

19

18 X

17

16

15

14 0 2 4 6 8 10 12 4 iterations x 10

Figure 3.1: Trace-plot of AR(1) process defined in (3.1)

32 3.3 A New Method

Let θ be a parameter of interest. Assume we have an MCMC sample of size N,

θ , ...., θ . We divide this sample into K batches, each of size n (i.e., N = K n). 1 N × th For k =1, ...., K) let θ[k] be the k batch mean. A natural estimator of the posterior

mean, E(θ Data), is the sample mean, E = 1 N θ = 1 K θ . | 1 N i=1 i K k=1 [k]

We will construct another estimator, E2, basedP on stratification.P We partition the parameter space Θ into J “nontrivial” strata, A1, ...., AJ . By “nontrivial” we mean

that each stratum Aj has positive probability under the stationary distribution (π).

Define two quantities for the kth batch as follows:

n(k+1) P 1 Z = ZP (3.2) [k]j n ij i=Xnk+1 n(k+1) θ 1 Z = Zθ , (3.3) [k]j n ij i=Xnk+1 where the index j denotes the jth stratum, j =1, ...., J, and

ZP = I (θ A ) ij i ∈ j Zθ = θ I (θ A ) , ij i i ∈ j where I( ) is the indicator function: I(A) = 1 if event A occurs and is zero otherwise. · P P Note that Z =1 J−1 Z . Let Z denote the vector of the J(2K 1) Z ’s. An [k]J − j=1 [k]j − [k] P th 1 K P estimate of the probability that θ belongs to the j stratum is P j = K k=1 Z[k]j. We define P J K 1 P θ E = j Z . (3.4) 2 K P [k]j j=1 Z X Xk=1 [k]j Note that we can write the sample mean E1 in terms of the Z[k]j’s as follows:

J K 1 θ E = Z (3.5) 1 K [k]j j=1 X Xk=1 33 For a well mixed chain, we expect the estimated probabilities of the strata based

on the full sample to be close to those estimated from each of the batches. In that

P j situation the ratios P will be close to 1, and so E2 will be approximately equal to Z[k]j

E1.

We use the following form of the Ergodic Theorem (e.g., Robert and Casella

(2004)) to study the consistency properties of these two estimators:

Theorem 3.3.1 θ , , θ are realizations of a Harris recurrent Markov chain with 1 ··· N a σ-finite invariant measure π. If f, h L1(π) with h(θ)dπ(θ) =0, then: ∈ 6 R 1 N f(θ ) f(θ)dπ(θ) lim N t=1 t = . N→∞ 1 N h(θ)dπ(θ) N Pt=1 h(θt) R Under the conditions of the aboveP theorem, forR any fixed number of batches, K,

E a.s. E(θ Data) and E a.s. E(θ Data) as n . 1 → | 2 → | →∞

Next, we develop estimates of the variances of E1 and E2. In the expression for

E2 given in (3.4), the batch proportions in the denominator may take on zero values.

Hence, the variance of E2 does not exist. We will find the variance of the asymptotic

distribution (when n ) of E and will compare it with that of E . For numerical →∞ 2 1 P purposes we will set a lower bound for the batch proportions (Zj[k]).

We apply the Markov Chain Central Limit Theorem (MCCLT) to Z and then use a Taylor series expansion (or delta method). For a detailed statement and various versions of the MCCLT, see Jones et al. (2006) and Robert and Casella (2004). For a Harris-recurrent ergodic chain application of the MCCLT may require geometric or polynomial ergodicity, or α-mixing chains, depending on θ. Assuming that required conditions hold, we apply the Cramer-Wold device (Basu (2004), Ch. 9) to conclude

34 that

√n Z E(Z) D MV N(0, Σ ) as n , (3.6) − → Z →∞  where ΣZ is the variance of asymptotic distribution of Z. Applying the delta method to E2 = g(Z), we find the asymptotic distribution of E2:

√n E g(Z ) N(dt (E Z g(Z )),dt Σ d ), (3.7) 2 − o ≃ 2 π − o 2 Z 2  dE2 where d2 = dZ is evaluated at the observed values of Z denoted by Zo. Similarly

applying the delta method to E1 = h(Z), we find its asymptotic distribution:

√n E h(Z ) N(dt (E Z h(Z )),dt Σ d ), (3.8) 1 − o ≃ 1 π − o 1 Z 1  where d = dE1 evaluated at Z . Note that d = [ 1 , 1 1 ] 1 dZ o 1 K JK×1 (J−1)K×1

dE2 dE2 and d2 = P , θ , where |{z} dZ dZ  [k]j   [k]j  K θ θ K θ dE2 1 Z[k]j 1 P jZ[k]j 1 Z[k]J P = P P P dZ K2 Z − K (Z )2 − K2 1 J−1 Z [k]j Xk=1 [k]j [k]j Xk=1 − j=1 [k]j J−1 θ 1 (1 P j)Z P + − j=1 [k]J K J−1 P 2 (1 P Z[k]j) − j=1 dE 1 P 2 = j . P θ K P dZ[k]j Z[k]j Applying the ergodic theorem we have:

d d a.s. 0 as n (3.9) 2 − 1 → →∞

ˆ 1 t ˆ The estimated variance of the asymptotic distribution of E1 is V1 = n d1ΣZ d1 and ˆ 1 t ˆ ˆ that of E2 is V2 = n d2ΣZ d2, where ΣZ is an estimator of ΣZ .

Theorem 3.3.2 Let Vi denote the variance of the asymptotic distribution of Ei, i =

1, 2. Then V1 = V2.

35 t dE2 Proof: From (3.8) we know V1 = d1ΣZ d1. Let d indicate the derivative dZ evaluated

t at E(Z). Applying the delta method to E2, we find that V2 = d ΣZ d. Simple algebra shows d1 = d. Hence, V1 = V2

Remark: Note that in (3.7), we applied the delta method to E2 by evaluating the

derivative dE2 at the observed Z. Thus we have a consistent (as n ) estimator dZ → ∞

Vˆ2 of V2 by (3.9).

As the batch size n tends to , the batch means are asymptotically uncorrelated ∞ under mild conditions. This result is proved by Law and Carson (1979) as stated in the following lemma:

Lemma 3.3.3 (Law and Carson) Let θ be sample of size N. Split this { i}i=1,...,N sample into K batches of equal size n. Define C = Cov(θ , θ ), for l =1,...,N 1 l t t+l − and C (n) = Cov(θ , θ ), for j = 1, ,K 1. θ denotes the tth batch mean. j [t] [t+j] ··· − [t] Cj (n) ∞ Let ρj(n)= . If Cl < , then ρj(n) 0 as n for all j =1, ,K 1. C0(n) l ∞ → →∞ ··· − P Hence, Σ approaches a block-diagonal matrix Σ I , where Σ is a symmetric, Z ⊗ K positive-definite (2J 1) (2J 1) matrix and I is the identity matrix of rank K. − × − K There are at least two ways to estimate Σ: (1) apply renewal theory (e.g, Hobert,

Jones, Presnell and Rosenthal (2002)), and (2) use batch-based methods (e.g., Jones

et al. (2006)). Application of renewal theory requires writing the MCMC algorithm

in a specific format (split-chain method). Since this is a daunting task for complex

models, we use the batch-mean method for our estimation. It follows that the diagonal

elements of Σ can be estimated by K n P P σˆP = (Z Z )2 (3.10) j,j K 1 [k]j − j − k=1 XK n θ θ σˆθ = (Z Z )2, (3.11) j,j K 1 [k]j − j − Xk=1 36 P 1 K P θ 1 K θ where Zj = K k=1 Z[k]j and Zj = K k=1 Z[k]j. The off-diagonal elements are estimated by P P

K n P P P P σˆP = (Z Z )(Z Z ) (3.12) i,j K 1 [k]i − i [k]j − j − k=1 XK n P P θ θ σˆPθ = (Z Z )(Z Z ) (3.13) i,j K 1 [k]i − i [k]j − j − k=1 XK n θ θ θ θ σˆθ = (Z Z )(Z Z ). (3.14) i,j K 1 [k]i − i [k]j − j − Xk=1 Damerdji (1994) and Jones et al. (2006) show that, under certain assumptions, if n and K both tend to , then these estimates are consistent. Note that these ∞ ˆ 1 t ˆ ˆ 1 t ˆ estimates are also functions of Z: V1 = n d ΣZ d = f1(Z) and V2 = n d2ΣZ d2 = f2(Z). A careful application of Slutsky’s theorem and (3.9) imply that the asymptotic (n tends to ) distributions of nVˆ and nVˆ are the same. We use this fact to construct ∞ 1 2 a parametric bootstrap (see Casella and Berger (2002), Ch. 10; Davison and Hinkley

(1997), Ch. 2) test to compare Vˆ1 and Vˆ2. Recall that by the MCCLT

√n Z E(Z) MV N(0, Σ ) (3.15) − → Z  We approximate the distribution of Vˆ1 = f1(Z) based on generated samples from

(3.15) using the estimates Eˆ(Z) and Σˆ Z . If the observed value of Vˆ2 falls between the 0.025 and 0.975 quantiles of Vˆ1, then we find no evidence of non-stationarity and poor mixing.

Remark: Vˆ1 is more stable than Vˆ2, so we perform our test by generating samples for Vˆ1 instead of generating samples for any function involving Vˆ2.

37 3.3.1 Choosing n, K, and J

The above method requires specification of several tuning parameters: n (the

number of batches), K (batch size), and J (the number of strata). Since our deriva-

tions depend on several asymptotic results where n and K both tend to infinity, it

is essential that we choose n and K sufficiently large for the validity of our derived

results. Jones et al. (2006) suggested that one choose n = K = √N, where N is the length of the full chain. From experience with application of our method in several examples, we suggest the choice of K close to 30.

Next, consider the issue of how to stratify the parameter space. For mixture models and regime switching models (Berliner et al. (2000)), we have an idea of different regimes and can exploit this knowledge for stratification. For further details, see the example in Section 3.7. When we do not have such information, we can use sample quantiles or likelihood based selections (see Section 3.6). We can also obtain useful information for stratification from exploratory data analysis (see Section

3.5). If resources permit, we recommend use of preliminary/pilot runs of the MCMC algorithm to establish a useful stratification.

3.3.2 Assessing burn-in

Our method can be applied sequentially to assess burn-in of an MCMC imple- mentation. First, select a window size, say w. Next, reject the first w iterations

ˆ and compute the ratio V2 based on the remaining output. Then reject the first 2w Vˆ1 ˆ iterations and compute V2 . Continue this process until the ratio is sufficiently close Vˆ1 to 1. The point at which this occurs is a reasonable estimate of the burn-in time.

38 As an illustration, we consider an AR(1) process of the form

Xt+1 =0.99Xt + ǫt+1, (3.16)

2 where the ǫt+1’s are independent and identically distributed N(0, 0.0001 ) variates.

0.00012 The stationary distribution of this chain is N(0, 1−0.992 ). We set X0 = 10, 000, gen- erate 100, 000 samples, and partition the samples into 20 batches of size 5000. We

stratify the X space as X > 0 and X 0. We set w = 100 and keep the number of − ≤ batches fixed (at 20) by adjusting the batch sizes accordingly. From Fig. 3.2 we can

see the stabilization after discarding around 1, 100 initial iterations. Hence burn-in

appears to occur after about 1, 100 iterations.

45

40

35

30 ) 1 25 / V 2 20 log(V 15

10

5

0

0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of iterations discarded

Figure 3.2: Assessing burn-in the for AR(1) process described in (3.16) using a window size of 100.

39 As a second example, we consider the slope parameter of the linear change point

analysis problem described in Section 3.5. We generated 100, 000 iterates, split the whole chain into 20 batches and adjusted the batch-lengths accordingly. The window size w is set at 100. We computed the ratio of two variances and plotted the ratios

against the number of samples discarded in the calculations. In Fig. 3.3 we see that

in all the cases, the ratios are close to 1, suggesting that approximate convergence to

the stationary distribution is extremely rapid.

1.0065

1.006

1.0055 1 / V 2 V 1.005

1.0045

1.004 0 200 400 600 800 1000 1200 No of iterations excluded

Figure 3.3: Assessing burn-in for the slope parameter of linear change point problem using a window size of 100.

40 3.3.3 Parametric bootstrap sample size

In this section we discuss the effect of bootstrap sample size on our test. We

considered an AR(1) process of the form

Xt = αXt−1 + ǫt, (3.17) where α < 1 and the ǫ ’s are a white noise process. We generated a chain of size | | t 120, 000 and split it into 30 batches of size 4, 000. We chose α and the variance of the noise in such a way that the stationary distribution of the chain is a zero-mean

Gaussian distribution with variance 1. For the results in Fig. 3.4, we set α =0.2. We repeated this process for 50 separate Markov chains generated from (3.17). In each of the 50 cases, the test was performed for four different bootstrap sample sizes: 1, 000,

10, 000, 25, 000, and 50, 000.

From the plots we see that in all the cases, our test supports the claim that the

chains are well-mixed. The black dots in these plots indicate the observed values of

Vˆ2. The lengths of confidence intervals vary little as the bootstrap sample size varies.

We repeated the entire procedure for α = 0.998. All of our tests reject the claim

that the chain is well-mixed. Fig. 3.5 shows the bootstrap intervals. In these plots

all sample values of Vˆ2 lie to the right of the bootstrap intervals and off the plot.

The conclusions are the same for 50 separate Markov chains. Again, the lengths of

the confidence intervals are similar for the various bootstrap sample sizes. From the

evidence in these examples and further studies not presented here, we recommend a

default 1, 000 bootstrap samples.

41 (a) (b) MCMC index MCMC index 0 10 20 30 40 50 0 10 20 30 40 50

5.0 e−08 1.5 e−07 2.5 e−07 5.0 e−08 1.5 e−07 2.5 e−07

X X

(c) (d) MCMC index MCMC index 0 10 20 30 40 50 0 10 20 30 40 50

5.0 e−08 1.5 e−07 2.5 e−07 5.0 e−08 1.5 e−07 2.5 e−07

X X

Figure 3.4: Acceptance regions constructed for 3.17 when α = 0.2 using various bootstrap sample sizes (a) 1000, (b) 10, 000, (c) 25, 000, and (d) 50, 000

42 (a) (b) MCMC index MCMC index 0 10 20 30 40 50 0 10 20 30 40 50

0.002 0.006 0.010 0.014 0.002 0.006 0.010 0.014

X X

(c) (d) MCMC index MCMC index 0 10 20 30 40 50 0 10 20 30 40 50

0.002 0.006 0.010 0.014 0.002 0.006 0.010 0.014

X X

Figure 3.5: Acceptance regions constructed for 3.17 when α = 0.998 using various bootstrap sample sizes (a) 1000, (b) 10, 000, (c) 25, 000, and (d) 50, 000

43 3.4 Example 1: Bayesian Change Point Analysis: Velocity of Antarctic Glaciers

We return to the Bayesian change point models for surface velocity of an ice stream in east Antarctica discussed in the previous Chapter. The flow velocity v at location x is modeled using a geophysical equation from chapter 11 of Paterson (1994):

3 v(x)= v0 +0.50 AH(x) τ (x), (3.18)

where v0 is sliding velocity, A is a flow parameter, and H denotes ice thickness. The quantity τ approximates driving stress and is modeled as

ds(x) τ(x)= ρgH(x) , (3.19) − dx

where ρ is the density of the ice, g is the gravitational constant, and s is surface elevation. They set ρ = 1000kg/m3 and g =9.81m/s2. So, conditioning on H(x)τ 3(x)

it is a simple linear regression problem. But As discussed earlier, fitting a single

equation based on (3.18) throughout the entire range of the data is not adequate.

Hence, a Bayesian change point analysis was performed.

We consider one example involving three change points at unknown locations. It

was assumed that a change point can occur only at observed locations. The regression

lines are constrained to be continuous at the change points. Fig. 3.6 shows the

histograms of MCMC-sampled locations of the three change points. We see that

th the first change point (r1) has a bimodal posterior distribution with a dip near 150

observation. The posterior distributions of the second (r2) and the third change points

(r3) appear to be unimodal. The source of bimodality regarding r1 is seen in Fig. 3.7.

We plot the means of the regression model conditioned on r 150 and r > 150. We 1 ≤ 1 see the trading of error reduction in these plots. We note the unconditional mean of

44 the regression model performs poorly in terms of model fitting. In the region where the velocity observations are within 75 m/yr to 125 m/yr and with similar values of

H(x)τ 3(x), we have two layers of velocity observations. When the first change point occurs before 150th observation, the fitted line passes through the cluster of points

th at the lower layer, Similarly, when r1 is beyond the 150 observation, the fitted line passes through the cluster of points in the upper layer.

We now assess stationarity and mixing using our method. Fig. 3.8 shows a trace plot and a scatterplot of r1. Inspection of the trace plot does not suggest the bimodal nature, but the scatterplot indicates the two modes. We conduct our test based on an

MCMC sample of size N = 80, 000, after discarding the first 20, 000 iterates. We split this sample into K = 40 batches each of size n = 2, 000. We stratify the parameter space of the first change point based on exploratory data analysis. Our classes are: r 52, 52 < r 120 and r > 120. The acceptance region (0.1778, 0.4472) 1 ≤ 1 ≤ 1 is obtained using the Monte Carlo bootstrap method described earlier with 1000 bootstrap samples. The estimated asymptotic variance of E2 is Vˆ2 =0.2964. Hence, we conclude that we have no evidence of non-convergence nor of poor mixing.

3.5 Example 2: Latent Variable Models: Arsenic Concentra- tions in Public Water System (PWS)

Craigmile et al. (2007) developed a latent variable model for arsenic (As) con- centrations in public water systems (PWS) in Arizona. The quintessential structure of their model has three parts: (1) data model for the observed log As concentra- tions (Y ), (2) process model for the true log As concentrations (X), and (3) prior distributions for all unknown mean and precision parameters. The dataset involves

1161 PWS’s. When a concentration measurement falls below the minimum detection

45 4 r 4 r x 10 1 x 10 2 2 6

1.5 4 1 2 0.5

0 0 0 100 200 300 150 200 250 300 350

4 r x 10 3 4

3

2

1

0 260 280 300 320 340

Figure 3.6: Histograms of change point locations.

first change point ≤ 150 first change point > 150 350 350 obs. obs. post.mean post.mean 300 300

250 250

200 200

150 150

Velocity in m/yr 100 Velocity in m/yr 100

50 50

0 0

−50 −50 −1 −0.5 0 −1 −0.5 0 Hτ3 Hτ3

Figure 3.7: Observed and posterior means of ice-flow velocity.

46 Trace plot for location of first change point 300

200

100

0 : location of first change point

1 0 500 1000 1500 2000 2500 3000 r MCMC Iterations Scatter plot for location of first change point 300

200

100

0 : location of first change point

1 0 500 1000 1500 2000 2500 3000 r MCMC Iterations

Figure 3.8: Scatterplot of location of first change point.

level (MDL) of the measuring device, that value rather than the actual observation is reported. The dataset includes such censored observations.

Gaussian models with unknown log-As concentrations as their means are used as data models. Censored observations are assumed to follow truncated normal distri- butions with upper bounds at the MDL values. For exploratory data analysis, the censored observations are replaced by MDL/2 and are treated as regular observa- tions. Under the data model for an observation Y we know E(Y X ) = X . We i i| i i exploit this fact for our stratification. For each PWS the mean As concentrations,

Y , where i =1, , 1161, were calculated from the the observations. The parameter i ··· space for true log-As concentrations (X ) were stratified as: X Y and X > Y . i i ≤ i i i

For 945 PWS’s the estimated Vˆ2’s fall within the acceptance region, but for the rest

216 PWS’s, Vˆ2’s are outside the acceptance region. Hence, we conclude that the chain does not converge or mix well. Modifications of the models to get appropriate convergence has been discussed in Craigmile et al. (2007).

47 Concentration (xi) Number of beetles (ni) Number killed (ri) 1.6907 59 6 1.7242 60 13 1.7552 62 18 1.7842 56 28 1.8113 63 52 1.8369 59 52 1.8610 62 61 1.8839 60 60

Table 3.1: Beetles dataset for logistic regression

3.6 Example 3: Logistic Regression (WinBUGS manual)

Dobson (1983) analyzed binary dose-response data (Bliss (1935)). Table 3.1 shows

the dataset. The numbers of beetles (ri) killed after five hour exposure to carbon

disulphide at 8 different concentrations (xi) were recorded. A Bayesian hierarchical

model is constructed. We assume r Binomial (n , p ), for i = 1, ...., 8. The p′ s i ∼ i i i are modeled as log pi = β + β x . We used non-informative Gaussian priors for 1−pi 0 1 i   β′s: β N(0, τ ) and β N(0, τ ). The variance parameters are assumed to follow 0 ∼ 1 1 ∼ 2 inverse gamma distributions with scale parameter 0.001 and shape parameter 0.001.

Th posterior distribution is analytically intractable. We used WinBUGS for

MCMC simulation. The trace plots for α and β are shown in Fig. 3.9.

These plots indicate high and slow mixing. We ran the chain for

160,000 iterates and discarded first 10,000 iterates. We divide the remaining 150,000 samples into 25 batches of each of size 6,000. We obtained the posterior estimates of the parameters using the ‘gee’ R-library:α ˆ = 60.7175 and βˆ = 34.2703. Based on − these estimates, we partition the parameter space of α as: α> αˆ and α αˆ. For β, ≤

48 alpha −75 −65 −55 −45

0 20000 40000 60000 80000

Index beta 25 30 35 40

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

Index

Figure 3.9: Trace plots of simulated logistic regression parameters

we used: β > βˆ and β βˆ. Using our bootstrap method, we constructed acceptance ≤ regions for our test. For α, the acceptance region is (0.391, 1.384) while the observed

Vˆ is 2.6 1019. For β, the acceptance region is (0.098, 0.342) while the observed Vˆ 2 × 2 is 1.3 1019. These results suggest that convergence and/or mixing of the chain is × questionable.

49 3.7 Example 4: Regime-switching model: Pacific Sea Surface Temperature (SST)

Berliner et al. (2000) developed a Bayesian dynamic forecasting model for predic-

tion of the tropical Pacific SST monthly anomalies with seven-month leads. They use

a principal component dimension reduction (EOF’s) with time-varying coefficients.

Their analysis provides (1) qualitative understanding of SST, (2) prediction of regime

status commonly associated with warmer-than-normal (i.e., “El Nino”, state “0”),

normal (state “1”), and cooler-than-normal (i.e., “La Nina”, state “2”) states, and

(3) forecast information and the uncertainties associated with the forecasts.

An MCMC algorithm was run using SST anomalies for the period of January

1971 through February 2008. The model predicts the SST anomalies for September

2008. We will considered two parameters for illustrative purposes: (1) the regression coefficient of the first principal component for September 2008 (ASept08) and (2) the regime indicator (JSept08). Note that JSept08 can take values 0, 1 or 2.

The chain was run for 11,000 iterates and discarded the first 1,000 iterates. We split the N = 10, 000 iterates into K = 20 batches of equal size n = 500. Our method cannot be applied directly to the regime indicator JSept08 if we stratify the parameter space of JSept08 as JSept08 =1,JSept08 =2, and JSept08 =3. Then the variance of JSept08

within each stratum will be zero and E E will have a degenerate distribution 1 − 2

at 0. Hence we define the strata for ASept08 based on JSept08 as: ASept08(JSept08 =

0), ASept08(JSept08 = 1), and ASept08(JSept08 = 2). Applying a parametric bootstrap

method the acceptance region is (0.0136, 0.0503) and Vˆ2 =0.0304 We conclude that we

do not have evidence of non-convergence or poor mixing. The color-coded scatterplot

50 (Fig. 3.10) also supports the fact that all three regimes are visited regularly according

to their respective posterior probabilities.

Scatter plot for first EOF coefficient for Sept 2008 80

60

40

20

0 A1_Sept08

−20

J = 0 t −40 J = 1 t J = 2 t −60 0 2000 4000 6000 8000 10000 12000 index

(1) Figure 3.10: Scatter plot for ASept08: The first EOF coefficient for Sept’08

3.8 Power of the Test

In this section we provide a numerical assessment of the power of our test. Qual- itatively, our null hypothesis is that the chain has converged and mixed well. We quantify this by forming a null hypothesis that the estimates of the variances of the asymptotic distributions of two estimators (E1 and E2) are stochastically equivalent.

As a simple illustration, we consider the test’s performance for a Markov chain that is an autoregressive process of order one (AR(1)). We simulated 1,000 independent

51 AR(1) chains (each of size 100, 000) defined by

Xt =0.995X(t−1) + et, (3.20) where e N(0, (0.1)2) and the initial state X is generated from the stationary t ∼ 0 distribution. The stationary distribution is easily shown to be N (0, .01/(1 .9952)). − Though it was not necessary, we exclude first 20, 000 iterations from each chain.

Fig. 3.11 shows the trace-plot for 80, 000 iterations of one of the chains. Table 3.2 com- pares the results obtained from conducting our test, Geweke’s test, and the Gelman-

Rubin test on the 1,000 simulated chains. Geweke’s test is conducted using the first

10% and the last 50% iterates. The Gelman-Rubin test is conducted on each chain by splitting the chain into 8 groups, each of length 9000. We excluded 1,000 iterates between groups to obtain approximate independence. That is, we pretend that these

8 groups are 8 separate MCMC runs. Our test is conducted using 20 batches each of size 4, 000. We stratified the X space based on the conditions X > 2 and X 2. − ≤

Method Geweke Gelman-Rubin New Geweke 824/1000 824/1000 19/1000 Gelman-Rubin 824/1000 1000/1000 22/1000 New 19/1000 22/1000 22/1000

Table 3.2: Acceptance rates for Geweke’s test, Gelman-Rubin test, and our new test

Table 3.2 summarizes the results of this experiment. The diagonal entries are

the proportions of the cases where each method supports the null hypotheses. Off-

diagonal entries are the proportions of cases where both methods support the null

hypotheses. Since we initialized the chain in its stationary distribution, convergence is

52 not an issue and the tests should respond to apparent levels of mixing. The Gelman-

Rubin test fails to reject the null, supporting the fact of adequate convergence in all

1, 000 cases. The Geweke test arises from a null hypothesis of equality of the means of

the first 10% and the last 50% of the iterates. Hence, the test may be thought of as a

test for convergence rather than mixing. Indeed, neither of these tests were developed

to test mixing. However our method tries to assess mixing as well as stationarity. The

AR(1) process is an example of slow mixing chain when the autocorrelation parameter

is large (near one). For the process in (3.20) the autocorrelation for values separated

by 500 time steps (“lag 500”) is 0.08. Fig. 3.12 shows density plots using 80, 000 iterates of 10 such chains. The black thick line is the true stationary distribution.

The differences in density estimates show that there is substantial variation in the chains.

Next, we conducted our test using a different stratification, namely, using the conditions X > 1 and X 1. Our method supported the null hypothesis in 872 of ≤ the 1000 cases. Stratifying using X > 0 and X 0, our method supports the null ≤ hypothesis for 999 cases. Note that under the stationary distribution, P (X > 0) =

0.50, P (X > 1) = 0.1589, and P (X > 2) = 0.022. This suggests that when the selection of strata is arbitrary, our test should be performed for several selections of strata if possible. This is especially true when there appears to be one stratum that is dominant (i.e., has very large empirical probability). Our default suggestion is to stratify the parameter space into three groups based on the empirical 10% and 90% quantiles. That is, the three conditions X x , x 2 and X 2. For a new set of 1,000 − ≤

53 independent chains, our method supported the null hypothesis in 68%ofthecases.

This percentage rate is 89% when we apply our test using 20 batches each of size

15, 000. With 60 batches each of size 5,000, we accepted in none of the 1,000 cases.

Hence, our method is sensitive to both the number and sizes of the batches. Our recommendation is to keep the number of batches around 30.

3.9 Discussion

We proposed a method for diagnosing convergence to stationarity as well as mixing of a Markov chain based on a single run. Our technique can also be applied to multiple chains by combining the chains or treating each chain as a separate batch.

If the posterior distribution is multimodal, the chain may be trapped in a particu- lar mode. Hence, multiple chains with dispersed initial values are often recommended.

If one has reasonable guesses for the number and locations of the modes, our method enables incorporation of that information through the stratification of the parameter space. For example, if our hierarchical model includes mixture of normals, we can anticipate multimodal posterior distribution. In that case we can use the prior modes for stratifying the parameter space.

Sometimes the Gelman and Rubin statistic (Gelman and Rubin (1992)) is useful in detecting lack of mixing but our method is more sensitive to this issue. We have illustrated this fact in Section 3.8 using an AR(1) model. The main drawback of the

Gelman and Rubin statistic is that it requires around 10 multiple chains for its proper implementation. But in cases for which each iteration is very expensive, their method is inefficient. It is typically wise to run a single, long chain instead of multiple chains of necessarily smaller lengths.

54 Our method is easy to apply and computationally efficient. Though it requires specifications of some tuning parameters and stratifications of the parameter space, our test may be more effective than others. The additional overhead of computing our test statistics for a few selections of control parameters and stratifications is relatively minor, especially compared to running multiple chains of sufficient length.

Our method is ‘essentially univariate’, so in the case of multivariate parameter

set-up we have to conduct the test for all the parameters separately and if all these

tests support the null hypotheses then we will be able to conclude that there is no

evidence of poor mixing or non-convergence. One can also perform this test on -2

times the log of the posterior density to investigate the convergence of a joint posterior

distribution. Then the normalizing constants have to be estimated from the available

samples.

We plan to automate our method so that it can be included in ‘CODA’ package.

Often we are interested in the tails of the posterior distribution. Many MCMC

algorithms perform poorly in estimating the tails. So if we do not have any idea

about how to stratify the parameter space, then our default suggestion would be

stratifying the parameter space into three groups using upper 10% and lower 10%

quantiles.

55 −2 0 2 4

0 20000 40000 60000 80000

Iterations

Series X ACF 0.0 0.2 0.4 0.6 0.8 1.0

0 500 1000 1500 2000

Lag

Figure 3.11: Trace plot and acf plot of a MCMC chain generated using 3.20.

56 Density 0.0 0.1 0.2 0.3 0.4

−4 −2 0 2 4

X

Figure 3.12: Density estimates from 10 separate AR(1) chains generated using (3.20). The thick black line indicates the true stationary distribution.

57 CHAPTER 4

BAYESIAN ANALYSIS VIA DIFFUSION MONTE CARLO

4.1 Introduction

While the advent of MCMC has led to major advances in the application of

Bayesian analysis, some problems remain difficult, at least when attempted by - dard approaches to developing an MCMC analysis. For example, nonlinearity and/or nonconjugacy of certain components of a large model may render standard Gibbs unusable. Metropolis-Hastings algorithms and Gibbs-Metropolis hybrids can be sug- gested, though these approaches can be taxing.

In this chapter we explore a strategy for developing a Monte Carlo analysis based on diffusion processes. We make no claims that it is optimal and better than other

MCMC strategies, but rather offers a potentially useful method in some cases for which other strategies are challenging and inefficient. The idea of diffusion MCMC is identical to other MCMC procedures. Namely, we develop a stochastic process whose stationary distribution is the target posterior. This development is done using diffusions represented as solutions to stochastic differential equations.

Consider a one-dimensional stochastic differential equation (SDE)

dx(t)= b(x)dt + σ(x)dW (t),t> 0, (4.1)

58 where b is a drift function, σ > 0 corresponds to volatility function, and dW (t) represents white noise error. Specifically, W (t) : t 0 is a standard Brownian { ≥ } motion process or a Wiener process. When the volatility function σ(x) is chosen to

be a constant, the resulting process is often called a “Langevin diffusion”. Further,

the initial state of the process, X(0), is a with specified distribution.

Assuming that X(0) and W (t) : t 0 are independent, solutions are obtained { ≥ } (or defined) by integrating both sides of (4.1). To do so, one relies on definitions of

stochastic integrals. For further details of diffusion see Breiman (1992) Ch. 16 and

Roger and Williams (1994) Ch. 7.

The key fact is that solutions X(t) : t 0 of such SDE are Markov stochastic { ≥ } processes with continuous sample paths. Assuming that X(0) has density function

(w.r.t. Lebesgue measure) p(x, 0), then for each t > 0, X(t) has a density function,

say p(x, t). The temporal evolution of this quantity is our primary concern here.

Under mild conditions, p(x, t) is a solution to a partial differential equation (PDE)

known as the Fokker-Planck Equation (FPE) or Kolmogorov Forward Equation:

∂p ∂2 ∂ =0.50 (σ2 p) (bp). (4.2) ∂t ∂x2 − ∂x

Our interest here is in stationary solutions. Specifically, if a stationary solution exists,

it must be functionally independent of time, so the partial derivative w.r.t. t is zero.

The resulting equation for a stationary density p is

∂2 ∂ 0.50 (σ2 p)= (bp). (4.3) ∂x2 ∂x

Here we do not “solve” (4.3) for p. Rather, we specify p to be the target posterior

for our Bayesian model and then find appropriate functions b and σ leading to that

59 p as the stationary density. It is critical to note that we can do so even if the nor- malizer of the posterior cannot be found. Specifically, the normalizer cancels on both sides of (4.3). Having completed this step, we can simulate the diffusion and proceed as in MCMC. This is typically accomplished by forming a discrete-time approxima- tion to (4.1), that is, a Markov chain approximation to the continuous-time process.

There are several challenges in implementation. Typically, we are able to construct a

SDE having the target posterior distribution as its stationary distribution, but solv- ing that SDE is not easy. Difficulties arise due to nonlinearity and discretization of time. Several numerical approximations have been suggested to solve a nonlin- ear SDE. However, these approximations can lead to stationary distributions that are only approximations to our target posterior distribution. Often, a Metropolis-

Hastings correction step is suggested in these situations; for example, see Stramer and Tweedie (1999a), Stramer and Tweedie (1999b), Roberts and Stramer (2001), and Roberts and Stramer (2002). In this chapter we suggest that choices for the drift and volatility functions that, when combined with efficient simulation approaches lead to an acceptable set of MCMC samples without the Metropolis-Hastings correction.

This will lead to comparatively fast and cheap algorithms.

Some aspects of the theory of SDE and FPE used in this chapter are reviewed in Section 4.2. In Section 4.3 we describe the basic formulation of a diffusion with stationary density coinciding with a target posterior in the univariate case. Though univariate Bayesian problems typically would not be done via MCMC, the illustrations are useful and can sometimes be embedded into hybrid samplers involving several

Gibbs steps and several diffusion MCMC steps. Section 4.4 presents analyses for multivariate Bayesian models, including hierarchical models. Section 4.5 indicates

60 how discrete-time approximations for diffusions are formulated. We illustrate the technique using several models in Sections 4.6 and 4.7. Section 4.8 concludes the chapter.

4.2 Background

This is a very brief discussion. For details see Gardiner (1985), Lasota and Mackey

(1994), Lipster and Shiryayev (1977), and Protter (2005).

By a solution to the SDE in (4.1), we mean a stochastic process, X(t) : t 0 { ≥ } satisfying (i) a measurability condition known as nonanticipating (i.e., for each t, X(t) is -measurable, where is the σ-field generated by W (s):0 s t ), and (ii) Ft Ft { ≤ ≤ } for every t > 0, (4.1) is satisfied with probability 1. Solutions are representable as integrals t t X(t)= b(X(s))ds + σ(X(s))dW (s)+ X(0), (4.4) Z0 Z0 after giving precise meaning to the second term. The basics are similar to those underlying standard Riemann (or Riemann-Stieljes) integrals. Partition the time interval [0, t] with k intervals with endpoints t0 =0, t1, t2,...,tk = t and consider the sum k

Sk = X(ti−1)∆iW, (4.5) i=1 X where ∆ W = W (t ) W (t ). By properties of Brownian motion, the ∆ W are i i − i−1 i mutually independent, normal random variables with means equal to zero and vari- ances equal to t t (i.e., lengths of the intervals). To compute S , we need to i − i−1 k

“know” X(ti−1), but this is insured by the nonanticipating requirement. In fact, that assumption would allow evaluation of X at any point in the interval t t . How- i − i−1 ever, we get different final answers to the integral depending on where the evaluations

61 are made. Using the left endpoints of the intervals as in (4.5) leads to ItˆoIntegrals.

The remaining steps are to take the limit of the Sk as the partition mesh size tends to zero, and define the integral as the result if we get the same answer for all such partitions.

Our main focus here involves the temporal evolution of the probability density function, p(x, t), of X(t) assuming the density for the initial state X(0) is specified to be p(x, 0).

Theorem 4.2.1 (FPE) Assume b and σ satisfy the Lipschitz condition:

b(x) b(y) + σ(x) σ(y) K x y , (4.6) | − | | − | ≤ | − | for some constant K and for all x, y and are appropriately differentiable. Then p(x, t) is the solution to the ordinary partial differential equation,

∂p ∂2 ∂ =0.50 (σ2 p) (bp) (4.7) ∂t ∂x2 − ∂x subject to the initial condition p(x, 0) and the assumption that X(0) and W (t) : t { ≥ 0 are independent. }

The Lipschitz condition (4.6) insures that the solution to (4.1) exist and is unique in the sense that two processes satisfying the SDE must have the same probability distributions.

Since we do not solve the FPE in this article, boundary conditions on p(x, t) (e.g., integrability, support of p, etc.) are not needed directly. However, properties of both b and σ do play a role in determining such properties of p. Similarly, in general, growth conditions imposed on b and σ lead to bounds on the moments of X(t). We do not consider such conditions here because integrability properties are already determined

62 by the target posterior. On the other hand, the b and σ we select in a given problem should not contradict assumptions that lead to p with the same properties as our target posterior.

A general theory, primarily due to W. Feller (Feller (1954)), relating boundary behavior of p and the structure of b, σ is available. precision using gamma distribu- tion. This is a non-negative parameter and negative samples are not desirable. If we impose the following conditions on b and σ then the probability mass will not escape the support region as t . →∞

Proposition 4.2.2 1. Consider the support set [a , ) for < a . Assume 1 ∞ −∞ 1 the Lipschitz condition (4.6) and σ2(a )=0, b(a ) > 0, and X(0) a with 1 1 ≥ 1 probability 1.

2. Consider the support set ( , a ] for a < . Assume that the Lipschitz −∞ 2 2 ∞ 2 conditions are satisfied and σ (a2)=0 and b(a2) < 0 , and X(0) < a2 with

probability 1.

3. Consider the support set = [a , a ] for < a and a < . Assume that X 1 2 −∞ 1 2 ∞ 2 2 the Lipschitz conditions are satisfied and σ (a1)=0, σ (a2)=0, and b(a1) > 0

and b(a ) < 0 , and X(0) . 2 ∈ X

Note that the form of weak convergence here, being true for densities, is actually

“strong” weak convergence (i.e., probabilities converge to those computed under the stationary density for all Borel sets). As we will see in Section 4.4, under mild conditions, our diffusions are ergodic, meaning that expectations under the stationary distributions can be estimated in a strongly consistent manner from a realization of the diffusion (after a burn-in period).

63 4.3 Univariate Case

Consider a Bayesian model with data Y and unknown quantity X. Our model is

Y x g(y x) and X π(x). The posterior density π(x y) g(y x)π(x) by Bayes’ | ∼ | ∼ | ∝ | Theorem.

Integrating both sides of (4.3), we see that our purposes are satisfied if we can

find a pair of suitable functions b, σ for which

∂ 0.50 (σ2 p)= b p. (4.8) ∂x

The general solution of (4.8) is

x 2b(z) p(x)=(cσ2(x))−1 exp dz , (4.9) σ2(z) Z0  where the constant c is a normalizer. Since we also want

p(x) g(y x)π(x), (4.10) ∝ | direct computation yields the following result.

Theorem 4.3.1 Assume that b(x) and σ(x) are Lipschitz and satisfy

∂ log(gπσ2) g π σ2 b =0.50σ2 =0.50σ2 (x) + (x) + (x) , (4.11) ∂x g π σ2   ! where for a function f of several variables

∂f f = . (4.12) (z) ∂z

Then the stationary distribution of the corresponding stochastic processes coincides

with the target posterior distribution.

64 Note that we have a single equation (4.11) upon which to base our selection of

two functions. That is, the problem is not well-posed, or alternatively, there be many

pairs b, σ that work for a fixed Bayesian model. We suggest the following strategy for

finding a usable pair b, σ:

1. Compute the fixed components

g π (x) + (x) . g π

2. Choose σ to achieve both the Lipschitz condition and the needed boundary conditions as indicated in Theorem 4.2.2.

If successful, we are “done” in the sense that a satisfactory diffusion results, though

there may be more efficient (in the sense of convergence rate) selections.

Note that if the elements of the data vector Y = Y1,...,Yn are conditionally

independent with Y x g (y x), j =1,...,n, then j| ∼ j j| g n g (y x) (x) = j(x) j| . (4.13) g g (y x) j=1 j j X | Hence, Theorem 4.3.1 may yield simple Monte Carlo approaches for combining data from highly different experiments and likelihoods. Such problems can be challenging for other methods.

Also, recall that if the likelihood admits a sufficient statistic, say T (y), for x,

the Neyman factorization implies g(y x) = G(T (y) x)H(y). Hence, we have the | | simplification g G (x) = (x) . (4.14) g G To clarify the notions, we present a basic example.

Example 4.3.2 Assume that for τ 2 known, Y x N(x, τ 2) and X N(µ, η2). Of | ∼ ∼ course, we know that X y is normal with easily computed mean and variance (these | 65 will appear below). Hence, the details of this example provide clarification of the

technique.

Applying (4.11) with the choice of σ(x)= σ, some constant, yields y µ 1 1 b(x)=0.50σ2 + x( + ) . (4.15) τ 2 η2 − τ 2 η2   To indicate how the convergence to stationary works, we begin by actually solving the

SDE. Let α =0.50σ2(τ −2y + η−2µ) and β =0.50σ2(τ −2 + η−2). Then we have

dx =(α βx)dt + σdW (t). (4.16) − Applying the transformation z = x exp(βt) and the total differential

dz = ((α βx)dt + σdW (t))eβt + xβeβtdt + h.o.t., − where h.o.t. involve terms of order dt2 and dtdW (t) (that can therefore be ignored in the solution), we have

t t z(t)= αeβsds + σ eβsdW (s)+ z(0). (4.17) Z0 Z0 Note that z(0)=x(0). Transforming back to “x”, we have

t t x(t)= αe−β(t−s)ds + σ e−β(t−s)dW (s)+ x(0)e−βt. (4.18) Z0 Z0 It follows that

t E(X(t)) = αe−β(t−s)ds + E(X(0))e−βt (4.19) 0 Zα = (1 e−βt)+ E(X(0))e−βt. (4.20) β − It can also be shown that

t var(X(t)) = σ2var( e−β(t−s)dW (s))+ var(X(0)e−βt) (4.21) 0 t Z = σ2 e−2β(t−s)ds + var(X(0)e−βt) (4.22) Z0 = σ2(2β)−1(1 e−2βt)+ var(X(0))e−2βt. (4.23) − 66 Returning to the original parametrization, we conclude that as t , →∞ 1 1 y µ 1 1 E(X(t)) ( + )−1 + and var(X(t)) ( + )−1, (4.24) → τ 2 η2 τ 2 η2 → τ 2 η2 which are of course the usual posterior mean and variance for this Bayesian model.

Note that for each t, X(t) is also normally distributed, and the convergence rate to the stationary distribution is exponentially fast.

4.4 Multivariate Settings

The FPE extends to the multivariate case. Let X(t) : t 0 be a d-dimensional { ≥ } stochastic process solving the system of stochastic differential equations

dx(t)= b(x)dt + σ(x)dW(t), (4.25)

with initial state X(0) p(x, 0) where (i) W is a d-dimensional vector whose elements ∼ are each independent standard Brownian motions and σ is a d d matrix such that × ′ A = σσ is positive definite. Let aij denote the (i, j) element of A. One often

2 assumes that the quantities d aij are locally uniformly Holder continuous on Rd. dxkdxl The corresponding FPE is

∂p ∂2 ∂ =0.50 (a p) (b p), (4.26) ∂t ∂x ∂x ij − ∂x i i,j i j i i X X As above, stationary solutions satisfy

∂2 ∂ 0.50 (a p)= (b p). (4.27) ∂x ∂x ij ∂x i i,j i j i i X X 4.4.1 Multivariate Bayesian Models

Consider the model Y x ,...x g(y x ,...x ) with prior density π(x ,...x ) | 1 d ∼ | 1 d 1 d for d unknowns (processes of interest and/or model parameters). The search for

67 acceptable specifications of b and σ is now very complicated in general in that we have a system of d differential equations but d + d2 choices to make. To simplify the analysis, assume that σ is diagonal (i.e., all off-diagonal elements are set to 0).

Not only does this reduce the number of choices to be made to 2d, it also leads to a

comparatively simple system of equations. That is, (4.27) can be written as

∂2 ∂ 0.50 (a p)= (b p). (4.28) ∂x2 ii ∂x i i i i i X X Hence, we can treat (4.28) as d-univariate problems: for i =1,...,d, we require

∂2 ∂ 0.50 2 (aii p)= (bi p). (4.29) ∂xi ∂xi

Hence, as in (4.11), we consider

∂ log(gπσ2) b =0.50σ2 i , (4.30) i i ∂x  i  separately for i =1,...,d.

Remark It is typical that g and π are products of a variety of terms (e.g., for conditionally independent observations, g is a product; π is often represented as a product of hierarchical components). In such cases,

∂ log(gπ) ∂ log(g(i)π(i)) = , (4.31) ∂xi ∂xi where the superscripts indicate that only those components of g and π that explicitly

depend on xi are involved in the calculation. This parallels the familiar step in

computing full conditionals in setting up a conventional Gibbs Sampler. Namely, for

each i, one computes the distributions

g(i)π(i) [xi all other xj]= (i) (i) . (4.32) | g π dxi R 68 4.5 Discretization

The main challenge in diffusion based MCMC methods is developing discretized approximations to solutions of the selected SDE. There are several methods available in the literature. Here, we consider two methods: (1) Euler discretization and (2)

Shoji-Ozaki discretization (Shoji and Ozaki (1998)). These methods are discussed in detail in Roberts and Stramer (2002).

Euler discretization approximates the solution of (4.25) using

1/2 Xt+1 = Xt + b(Xt)h + σ(Xt)h Zt+1, (4.33)

where h > 0 is the discretization step-size and Zt+1 is a realization from a standard

multivariate Gaussian distribution. The corresponding transition distribution is

X X Gaussian (X + b(X )h, hσ(X )′σ(X )) . (4.34) t+1| t ∼ t t t t

Unfortunately, this discretization often leads to a stationary distribution that is

not the target posterior distribution. However, since this discretization is simple and

computationally efficient, it is often used as a proposal distribution for a Metropolis-

Hastings algorithm.

Shoji and Ozaki (1998) proposed an alternative. In their method the non-linear

drift function b(Xt) in (4.25) is approximated using Taylor series. For a small time

interval rh t< (r + 1)h, where r =0, 1,..., consider ≤

b(X ) b(X )+ J(X )(X X ), (4.35) t ≈ rh rh t − rh

where J(X) is the Jacobian matrix consisting of the first-order derivatives of b. In

each of the intervals rh t< (r + 1)h, σ(X) is held constant and approximated by ≤

69 σ(Xrh). These approximations lead to locally linear stochastic differential equations

which can be solved easily. Thus, we obtain a time-homogeneous Markov chain with

transition distribution:

X X Gaussian (µ (X ), Σ (X )) , (4.36) t+1| t ∼ h t h t where

µ = X + J−1(X )[exp(J(X )h) I]b(X ) (4.37) h t t t − t

and Σh(Xt) is a solution to the system of equations

J(X )Σ (X )+Σ (X )J′(X )= exp(J(X )h)σ′(X )σ(X )exp(J′(X )h) σ′(X )σ′(X ). t h t h t t t t t t − t t (4.38)

Note that these equations have a unique solution if J is negative definite and has no

reverses of the signs of its eigenvalues.

To implement the Ozaki-Shoji method, we need to compute the exponentiation of

a d d matrix. When d is large, this computation amy be very expensive. In these × situations we suggest a hybrid MCMC sampler involving Gibbs steps and diffusion

MCMC steps.

4.6 Univariate Examples

4.6.1 Normal-Cauchy Model

Assume that Y X N(X, τ 2), τ 2 known, and the prior for X is a Cauchy dis- | ∼ tribution with median µ and scale parameter one. Setting σ to be a constant and

applying (4.11), we have

y 2µ 1 1 b(x)=0.5σ2 + x + . (4.39) τ 2 1+(x µ)2 − τ 2 1+(x µ)2  −  − 

70 The Lipschitz condition is satisfied. This is suggestive of the potential of diffusion

MCMC when non-conjugate models are considered.

4.6.2 Gaussian Model with Nonlinear Mean Function

Assume that Y x N(f(x), τ 2) for a differentiable function f(x) and τ 2 is known. | ∼ The prior model is X N(µ, η2). For constant σ, (4.11) yields ∼ y f(x) df x µ b(x)=0.50σ2 − − . (4.40) τ 2 dx − η2   Note that b may be unruly depending on f. If necessary, we can allow σ2 to depend

on x and choose it to control b. Specifically, we consider

y f(x) df x µ ∂σ2(x) b(x)=0.50σ2(x) − − + σ−2 . (4.41) τ 2 dx − η2 ∂x   If we select

σ2(x) = exp( 0.50(f(x) y)2), (4.42) − − then 1 dh x µ b(x)=0.50σ2(x) ( + 1)(Y f(x)) − . (4.43) τ 2 − dx − η2   These selections of b and σ2 are Lipschitz.

4.6.3 Computational Example: Nonlinear Model

The following expression is a simplified model for the surface topography of an ice sheet as a function of location z:

3/8 f(z)= α + K L4/3 (L z)4/3 , (4.44) − −  where α, K, and L are parameters. See Paterson (1994) Ch. 11) and Berliner et al.

(2008) for discussion.

71 We assume that our data are conditionally independent observations, with mea-

surement errors, of the surface altitude. Our likelihood function g(Y; α,K,L,ω) is assumed to be n ω exp( 0.5ω(Y f(z ))2), (4.45) 2π − i − i i=1 r Y where ω is a precision parameter and zi, i = 1,...,n are observatioon locations. In

our computations, we consider n = 500 equally spaced locations ranging from 0 m

to 2500 m. Let z be the vector of these values. We simulated a dataset Y using the

values α = 2 m, K = 0.5, L = 3000 m, and ω = 0.25. Fig. 4.1 shows the data Y

along with the mean function f(z).

We suppose that only L is unknown and must be larger than the largest value of

the zi; namely, the location of the origin of the ice sheet. Specifically, we truncated

a normal distribution with mean zero and variance = 10, 000 to the range L> 2500.

A diffusion based MCMC algorithm is constructed for L.

To illustrate diffusion MCMC using the Ozaki discretization, we defined the drift function to be

d log(g(2, 0.5, L, 0.25)π(L)σ2(L)) b(L)=0.5σ2(L) , (4.46) dL   where σ2(L) is defined to be

(Y f(z))′(Y f(z))I(L > max(z)). − − th The transition distribution for generating Lt+1 at the (t + 1) step conditioning on

Lt is:

t+1 t+1 Gaussian(µL ,VL ),

where

µt+1 = L + J −1(L )[exp(J(L )h) 1]b(L ), (4.47) L t t t − t 72 V t+1 = 0.5σ2(L )J −1(L )[exp(2J(L )h) 1], (4.48) L t t t −

where J(L )= db(L) and h =0.0001. t dL |L=Lt For comparison, we also ran an intense, griddy-Gibbs algorithm. The algorithm was run for an extremely long time, leading to an accurate estimate of the posterior.

Fig. 4.10 displays the estimated cumulative distribution functions of L obtained from the griddy-Gibbs and diffusion MCMC procedures We see that the diffusion based algorithm provides nearly identical results as the griddy-Gibbs approach.

For further comparisons, we computed the Hellinger distance (Hellinger (1909)),

ρ between the estimated densities obtained from the two algorithms. Recall that ρ is

the L2 distance between two probability measures P and Q, assumed to be absolutely

continuous with respect to another probability measure µ; specifically,

2 1 dP dQ ρ = dµ. (4.49) 2 dµ − dµ Z s s !

dP dQ Here the measures dµ and dµ are Radon-Nikodym derivatives of P and Q with respect to µ and √P and √Q are both square integrable. Often, ρ is computed using

1 ρ = (1 BD(P, Q)) , (4.50) 2 −

where BD(P, Q) is the Bhattacharya distance defined as follows: For two discrete

probability distributions P and Q over a domain the Bhattacharya distance is: X

ρ(P, Q)= P Q. (4.51) x∈X X p Note that the ρ takes values from 0 to 1. Smaller ρ values indicate the closeness of two probability measures. For our example, the computed value of ρ is 0.0002.

73 35 Realizations Mean Function 30

25

20

15

Surface Topography in m 10

5

0 0 500 1000 1500 2000 2500 Distances in m

Figure 4.1: Simulated data from a Gaussian model with mean function (4.44)

1 Diffusion MCMC 0.9 True Posterior

0.8

0.7

0.6

0.5

0.4

0.3 cumulative distribution function 0.2

0.1

0 2650 2700 2750 2800 2850 2900 2950 3000 3050 3100 L (in m)

Figure 4.2: Estimated cumulative distribution functions for L based on griddy-Gibbs and diffusion MCMC implemtations.

74 4.6.4 Mixture Models

Suppose Y ,...,Y x are independent and identically distributed according to a 1 n| finite mixture of normal densities given by

n m 1 2 2 g(y x)= qi exp( 0.50(yj x) /τ ) . (4.52) | 2 − − i j=1 i=1 2πτi ! Y X Applying (4.13), we obtain p

g n (x) = (y x)r−1(y , x), g j − j j=1 X where m −3 2 2 −1 i=1 qiτi exp( 0.50(yj x) /τi ) r (yj, x)= − − . m q τ −1 exp( 0.50(y x)2/τ 2) Pi=1 i i − j − i For a normal prior X N(µ, ηP2) and constant σ, (4.11) yields ∼ n y µ n 1 1 b(x)=0.50σ2 ( j ) x( ( )+ ) . (4.53) r(y , x) η2 − r(y , x) η2 j=1 j j=1 j ! X X This choice is Lipschitz. Similar steps can be used to treat mixture priors.

4.6.5 Conditional Autoregressive (CAR) Model

We consider a conditional autoregressive model for a vector Y. That is, we assume that Y follows a zero-mean multivariate Gaussian distribution with covariance matrix

(I γA)−1, where γ is a spatial dependence parameter and A is the adjacency matrix. − See Cressie (1993) Ch. 6 for details of CAR model.

A vector of length 1001 was generated from this distribution with γ =0.2, and A defined by the nearest-neighbor structure

1 1 0 1. (4.54) 1

75 For the covariance matrix to be non-negative definite, we need γ to be in the interval

(λ1,λ2), where λ1 < 0 is the inverse of the minimum eigenvalue of A and λ2 > 0

is the inverse of the maximum eigenvalue of A. Fig. 4.3 shows the plot of negative log-likelihood for the simulated data. Here λ = 0.5 and λ = 0.5. Note that the 1 − 2 restricted maximum likelihood estimate (RMLE) of γ is 0.19. Next, we assumed a

uniform prior on γ over the interval (λ1,λ2). We implemented a diffusion MCMC

algorithm using

1 1 1001 a 1 b(γ)= Y′(I γA)Y i + Y′AY Y′AY, (4.55) 2 − −2 (1 γa ) 2 − i i ! X − ′ where the ais are eigenvalues of A. We set

0 for γ λ ≤ 1 σ2(γ)= Y′(I γA)Y for γ (λ ,λ ) (4.56)  1 2 − 0 for γ λ ∈.  ≥ 2 Note that these choices of b(γ) and σ2(γ) satisfy the conditions 4.2.2. This guarantees

that the generated samples will be in the proper domain. We used Ozaki discretization

where the sampling distribution for γi+1 given γi is Gaussian with mean µi+1 and

variance vi+1 given by

µ = γ + J −1(exp(Jh) 1)b(γ ) (4.57) i+1 i − i

and 1 v = σ2(γ )J −1(exp(2Jh) 1), (4.58) i+1 2 i −

where J is the first order derivative of b(γ) with respect to γ evaluated at γi. We set

the discretization time step to 0.0001. The estimated posterior mean of γ is 0.1906

and the estimated standard deviation is 0.0268. Hence, we see that our algorithm

provides us a reasonable set of samples for γ. We again implemented an intense

76 griddy-Gibbs algorithm. The resulting estimated posterior distribution functions of

γ are plotted in Fig. 4.4. Note that these estimates are virtually identical. Finally, the estimated Hellinger distance between these densities is 0.0004.

950

900

850

800

750

700

650

Negative Log−likelihood 600

550 γ = 0.19 500

450 −0.5 0 0.5 γ

Figure 4.3: Negative log-likelihood function of the spatial dependence parameter γ for the CAR model described in 4.6.5

4.7 Multivariate Examples

4.7.1 Hierarchical Models

Consider a Bayesian model with likelihood function

Y x ,...x g(y x ,...x ) (4.59) | 1 d ∼ | 1 d and hierarchical prior

π(x ,...x )= π1(x x ,...x )π2(x x ,...x ) πd(x ). (4.60) 1 d 1| 2 d 2| 3 d ··· d 77 1 Diffusion MCMC 0.9 True cdf 0.8

0.7

0.6

0.5

0.4

0.3

cumulative distribution function 0.2

0.1

0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 γ

Figure 4.4: Estimated posterior cumulative distribution functions of the spatial de- pendence parameter γ for the CAR model described in 4.6.5

For a function f(x1,...xd), we define

∂f f(i) = , i =1,...,d. dxi

Hence, (4.30) can be written as

π1 σ2 2 g(1) (1) 1(1) b1 = 0.50σ1 + 1 + 2 (4.61) g π σ1 ! π1 π2 σ2 2 g(2) (2) (2) 2(2) b2 = 0.50σ2 +( 1 + )+ 2 (4.62) g π π2 σ2 ! . . . = . g d πi σ2 b = 0.50σ2 (d) +( (d) )+ d(d) . (4.63) d d g π σ2 i=1 d d ! X

78 4.7.2 Normal-Gamma Model

1 Assume that Y1,...,Yn x1, x2 are an iid sample from a Gaussian(x1, ) distribu- | x2 tion, and that X ,X are independent a priori, X N(µ, η2), and X Γ(α, β), a 1 2 1 ∼ 2 ∼ gamma distribution. The system of equations (4.61)-(4.63) reduces to

n 2 x µ σ1(1) b = 0.50σ2 x (y x ) 1 − + (4.64) 1 1 2 i − 1 − η2 σ2 i=1 1 ! X n + α 1 n σ2 b = 0.50σ2 2 − 0.50 (y x )2 β−1 + 2(2) . (4.65) 2 2 x − i − 1 − σ2 2 i=1 2 ! X −1 Note that contributions of the form x2 in b2 create difficulties for the Lipschitz con- ditions. However, this can be readily adjusted. For example, consider the selections

σ2 = x and σ2 = x2 exp( γx ), (4.66) 1 2 2 2 − 2

γ > 0. Computation yields

1 n x µ 1 b = x x (y x ) 1 − + 1 2 2 2 i − 1 − η2 x i=1 2 ! X x n x n x b = 2 exp( γx ) + α 1 2 (y x )2 2 +2 γx , 2 2 − 2 2 − − 2 i − 1 − β − 2 i=1 !   X which are Lipschitz. The Jacobian matrix is:

db1 db1 dx1 dx2

 db2 db2  dx1 dx1   Finally, note that σ2 = 0 at x = 0, b > 0 at x = 0, he 0 X < with 2 2 2 2 ≤ 2 ∞ probability one by Proposition 4.2.2.

To illustrate this model, we generated a dataset from a Gaussian distribution with

mean 2 and variance 1. We set the hyperparameters as follows: µ = 0, η2 = 10000,

α =0.001, and β =0.001. The discretization step h is chosen to be 0.0001 and we set

79 Parameter E ( Y) SD ( Y) E ( Y) SD ( Y) DM ·| DM ·| GS ·| GS ·| x1 2.0012 0.0108 2.0012 0.0109 x2 0.8317 0.0117 0.8326 0.0129

Table 4.1: Comparing posterior summaries from diffusion MCMC and Gibbs sampler for normal-gamma model: E ( Y) = posterior mean based on diffusion MCMC; DM ·| SD ( Y) = posterior standard deviation based on diffusion MCMC; E ( Y) = DM ·| GS ·| posterior mean based on Gibbs sampler; SDGS( Y) = posterior standard deviation based on Gibbs sampler ·|

γ = 1. We generated an MCMC sample of size 100, 000 using the diffusion MCMC

algorithm with the Shoji-Ozaki discretization. This is a conjugate model and we can

easily find the full conditional distributions for x1 and x2. The Gibbs sampler was

used to generate an MCMC sample of the same size.

Table 4.1 compares the estimated posterior means and posterior standard devia- tions of these parameters based on the two sampling algorithms. We can see that the posterior means from diffusion MCMC are close to that of Gibbs sampler. Posterior standard deviations from diffusion MCMC are within 1% of that of Gibbs sampler.

Fig. 4.5 compares the quantiles of the samples generated from the two sampling al- gorithms. We plotted the quantiles from diffusion MCMC along X-axis and those

of the Gibbs sampler along Y -axis. Notice the departures at the tails, though the mid-quantiles are very close. Figures 4.6 and 4.7 compare the estimated cumulative distribution functions (cdf) for x1 and x2. We see that for x1, the two cdf’s overlap

closely, but some departures are noted for x2. We also computed the Hellinger dis-

tances (ρ) using the estimated cdf’s. For x1 the ρ is 0.0001 and for x2 its value is

0.0017.

80 (a)

2 Quantiles from Gibbs sampler

1.95 1.95 2 Quantiles from Diffusion MCMC

(b) 0.9

0.88

0.86

0.84

0.82 Quantiles from Gibbs sampler

0.8

0.78 0.78 0.8 0.82 0.84 0.86 0.88 0.9 Quantiles from Diffusion MCMC

Figure 4.5: Comparing posterior quantiles for a normal-gamma model based on sam- ples from diffusion MCMC and Gibbs sampler: (a) for mean x1 (b) for precision x2

81 1 Diffusion MCMC 0.9 Gibbs Sampler

0.8

0.7

0.6

0.5

0.4

0.3 cumulative distribution function 0.2

0.1

0 1.95 2 x 1

Figure 4.6: Comparing cumulative distribution functions of mean parameter x1 for a normal-gamma model based on samples from diffusion MCMC and Gibbs sampler

1 Diffusion MCMC 0.9 Gibbs Sampler

0.8

0.7

0.6

0.5

0.4

0.3 cumulative distribution function 0.2

0.1

0 0.78 0.8 0.82 0.84 0.86 0.88 0.9 x 2

Figure 4.7: Comparing cumulative distribution functions of precision parameter x2 for a normal-gamma model based on samples from diffusion MCMC and Gibbs sampler

82 4.7.3 Conditional Autoregressive Model with Multivariate Parameters

We consider a CAR model with mean function defined by the linear function

β0 + β1X, where X is some covariate. The inverse of the covariance matrix is ω(I γA), where − ω is a precision parameter, γ is the spatial dependence parameter, and A is the neighborhood matrix defined in (4.54). The likelihood, L(Y β , β ,ω,γ), is | 0 1

(2π)(−1001/2) ω(I γA) 1/2exp ( 0.5ω(Y β 1 β X)′(I γA)(Y β 1 β X)) , | − | − − 0 1001 − 1 − − 0 1001 − 1 (4.67) where 1 is 1001 1 vector of 1′s. We assume independent normal priors with 1001 × 4 means zero and variances 10 for β0 and beta1. We assumed ω has a gamma prior distribution with shape parameter 0.001 and scale parameter 0.001. We also assumed that γ has the uniform prior distribution described above.

We embedded the diffusion based MCMC step for γ in a Gibbs sampler, thereby constructing the following hybrid MCMC algorithm:

β 1. Sample 0 from Gaussian µ , V , where β β β  1   1 V =(X′(I γA)X + I)−1 β − 104

and

µ = V X′(I γA)Y. β β −

2. Sample ω from a gamma distribution with shape parameter 1001/2+0.001 and

−1 scale parameter 1 +0.5(Y β 1 β X)′(I γA)(Y β 1 β X) . 0.001 − 0 1001 − 1 − − 0 1001 − 1  83 3. Sample γ using the distribution described in (4.57, 4.58). You may want to

repeat this step few times to ensure burn-in, but in this example the convergence

is immediate.

1 We simulate a dataset Y of length 1001 using β0 = 2, β1 = 3, ω = 16 , and γ = 0.2. Our algorithm is run for 10, 000 iterations and we discard the first 2, 000

samples as burn-in. Trace-plots are shown in Fig. 4.8. These plots suggest satisfactory

convergence and mixing.

Fig. 4.9 displays the posterior density estimates of the four parameters of the

CAR model described in (4.67). Table 4.2 summarizes the posterior means and posterior standard deviations of these parameters along with the true values. For comparison purposes we run a griddy Gibbs sampler for this model. Table 4.2 also lists the posterior summaries from this griddy Gibbs sampler. We can see that these summaries are sufficiently close. In Fig. 4.10 we plot the cumulative density (cdf) estimates for the four parameters based on the samples from these two algorithms.

The blue lines indicate the cdf’s from diffusion MCMC and red lines indicate that of griddy Gibbs sampler. These two cdf’s are hardly distinguishable. We also tabulate the Hellinger distances in Table 4.3 and these values are reasonably small. From these comparisons we can conclude that our diffusion MCMC algorithm performs good enough to give us an estimate of the posterior distribution which arises from this CAR model. Our hybrid algorithm is very fast. Each iteration takes 0.66 second, where as for griddy Gibbs sampler each iteration takes 27.94 seconds.

84 (a) (b) 3 3.004

2 3.002 0 1 β β 1 3

0 2.998 0 5000 10000 0 5000 10000 iterations iterations (c) (d) 0.08 0.4

0.3 0.07 γ ω 0.2 0.06 0.1

0.05 0 0 5000 10000 0 5000 10000 iterations iterations

Figure 4.8: Trace-plots of parameters involved in the CAR model described by (4.67)

85 (a) (b) 1.5 1000

1 500 0.5

posterior density 0 posterior density 0 0 1 2 3 4 2.995 3 3.005 β β 0 1 (c) (d) 150 15

100 10

50 5

posterior density 0 posterior density 0 0.05 0.06 0.07 0.08 0 0.1 0.2 0.3 0.4 ω γ

Figure 4.9: Posterior densities of parameters involved in the CAR model described by (4.67)

86 (a) (b) 1 1

0.5 0.5 cdf cdf

0 0 0 1 2 3 4 2.995 3 3.005 β β 0 1 (c) (d) 1 1

0.5 0.5 cdf cdf

0 0 0.05 0.06 0.07 0.08 0 0.1 0.2 0.3 0.4 ω γ

Figure 4.10: Cumulative density estimates for the parameters in the CAR model described by (4.67). These estimates are based on samples generated from diffusion MCMC (blue lines) and griddy Gibbs sampler (red lines).

87 Parameters True Values E ( Y) SD ( Y) E ( Y) SD ( Y) DM ·| DM ·| GG ·| GG ·| β0 2 1.7659 0.3244 1.7706 0.3256 β1 3 3.000 0.0006 3.000 0.0006 ω 0.0625 0.0612 0.0028 0.0612 0.0028 γ 0.2 0.1876 0.0278 0.1856 0.0277

Table 4.2: Posterior summaries of the parameters involved in the CAR model (4.67).EDM ( Y) = posterior mean based on diffusion MCMC; SDDM ( Y) = poste- rior standard·| deviation based on diffusion MCMC; E ( Y) = posterior·| mean based GG ·| on griddy Gibbs sampler; SDGG( Y) = posterior standard deviation based on griddy Gibbs sampler ·|

Parameters Hellinger Distance β0 0.00023 β1 0.00019 ω 0.00011 γ 0.00044

Table 4.3: Hellinger Distances for the parameters involved in the CAR model (4.67). This compares the distance between two estimated densities based on the samples generated using diffusion MCMC and griddy Gibbs sampler.

4.7.4 An Exchangeable Model

Consider an exchangeable model for a multivariate normal n-vector Y with com-

mon means X , variances X , and correlations X . Namely, assume that Y x , x , x 1 2 3 | 1 2 2 ∼

N(x11, x2C(x3)), where 1 is an n-vector of 1’s and

1 x3 ... x3 x3 1 ... x3 C(x3)=  . . .  . (4.68) ......    x3 x3 . . . 1      We assume that X ,X ,X are independent a priori: X N(µ, η2); X IΓ(α, β); 1 2 3 1 ∼ 2 ∼ and X π where π is uniformly distributed over the interval ( 1, 1). Recall that 3 ∼ 3 3 − 88 the determinant C = (1 x )n−1(1+(n 1)x ) and the inverse of C is | | − 3 − 3 1+(n 2)x x . . . x − 3 − 3 − 3 1 x3 1+(n 2)x3 . . . x3 C−1 =  −. .− −.  . (1 x3)(1+(n 1)x3) ...... − −    x3 x3 . . . 1+(n 2)x3   − − −    Let v = 1′C−11,µ ˆ = v−11′C−1Y, and r =(y x 1)′C−1(y x 1) . Note that − 1 − 1 d C−1 n(n 1)(1 x )(n−2) | | = − − 3 (4.69) dx C 2 3 | | and

(n 2) 1 . . . 1 − − − dC−1 1 1 (n 2) . . . 1 =  −. −. −.  (4.70) dx3 (1 x3)(1+(n 1)x3) ...... − −    1 1 . . . (n 2)   − − −  (n 2) 2(n 1)x   + − − − 3 C−1. (1 x )(1+(n 1)x ) − 3 − 3 The system of equations (4.61)-(4.63) reduces to

µˆ µ 1 1 σ2 b = 0.50σ2 + x ( + + 1(1) 1 1 x v−1 η2 − 1 x v−1 η2 σ2  2 2  1 ! −1 σ2 2 0.50r + β α +0.50n 2(2) b2 = 0.50σ2 2 + 2 x2 − x2 σ2 ! C−1 1 d 1 dC−1 σ2 b = 0.50σ2 dx3 (Y x 1)′ (Y x 1)+ 3(3) . 3 2 2 C−1 − 2x − 1 dx − 1 σ2 | | 2 3 3 ! Our suggestion for volatility parameters are as follows:

σ2 = x (Y x 1)′ C−1 (Y x 1) I(x > 0)I(x ( 1, 1)) 1 2 − 1 − 1 2 3 ∈ − 2 2 σ2 = x2I(x2 > 0)

σ2 = x (Y x 1)′ C−1 (Y x 1) I(x > 0)I(x ( 1, 1)). 3 2 − 1 − 1 2 3 ∈ −

These choices of diffusion parameters satisfy Lipschitz conditions.

89 4.8 Discussion

In this chapter we focused on the selection of drift and volatility sepcifications which can lead to a successful implementation of diffusion based Markov chain Monte

Carlo. We also studied the closeness of the targeted posterior distributions with the stationary distributions of the generated chain.

We do not discourage using Metropolis-Hastings (M-H) correction step, but our point is that in case of large spatial and climate models when Metropolis-Hastings algorithms are expensive then the techniques discussed here are useful and cheaper.

Often choosing a proposal distribution for M-H algorithms is difficult. Using priors as proposals can lead to poor mixing and very low acceptance probabilities. In those situations the transition kernels which we developed here can be used as proposals.

Computing the exponent of a matrix in Shoji and Ozaki’s method can be very expensive. But in practice we often face a situation where for most of the param- eters the full conditional distributions are available. For few parameters we need to worry about Metropolis-Hastings algorithms. In those situations hybrid samplers with Gibbs steps and diffusion MCMC steps can be very useful. We exemplified this fact using a CAR model.

Roberts and Stramer (2002) prove that under certain conditions the diffusion

MCMC algorithms with Shoji and Ozaki’s discretization are geometrically ergodic.

But their regularity conditions are very strong and they restrict to the class of target distributions with Gaussian tails. We prefer to investigate the rate of convergence for some specific interesting models.

In all our examples we chose the discretization step h to be 0.0001. The value of h is crucial since it will impact the stationary distribution and the rate of convergence.

90 We do not recommend using too small h. A very small h will not allow the chain to move properly and that will lead to a poorly mixed chain. The idea is like Metropolis-

Hastings algorithm. Too precise proposal distribution will increase the acceptance probability, but at the same time interesting regions of the posterior distribution will remain unexplored. Euler’s discretization is more sensitive to the choice of h than

Shoji and Ozaki’s. Suggestions about choosing h are available in Roberts and Stramer

(2002).

91 BIBLIOGRAPHY

Barry, D. and Hartigan, J. A. (1993). A Bayesian Analysis for Change Point Problems. Journal of the American Statistical Association 88 (21), 309–319.

Basu, A. K. (2004). Measure Theory and Probability. Prentice-Hall, India.

Berliner, L. M. (2003). Physical-statistical modeling in geophysics. Journal of Geo- physical Research 108 (D24), 8776,doi: 10.1029/2002JD002865.

Berliner, L. M. and Wikle, C. K. (2006). Approximate importance sampling Monte Carlo for data assimilation. Physica D 230, 37–49.

Berliner, L. M., Jezek, K., Cressie, N., Kim, Y., Lam, E. and van der Veen, C. J. (2008). Modeling dynamic controls on ice streams: A Bayesian statistical ap- proach. Journal of Glaciology, in press.

Berliner, L. M., Wikle, C. K. and Cressie, N. (2000). Long-lead prediction of Pacific SSTs via Bayesian dynamic modeling. Journal of Climate 13, 3953–3968.

Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. John Wiley & Sons,Inc., New York.

Biller, C. (2000). Adaptive Bayesian regression splines in semiparametric generalized linear models. Journal of Computational and Graphical Statistics 9, 122–140.

Bliss, C. I. (1935). The calculation of the dosage-mortality curve. Annals of Applied Biology 22, 134–167.

Box, G. E. P., Hunter, W. G. and Hunter, J. S. (1978). Statistics for Experimenters. Wiley, London.

Breiman, L. (1992). Probability. SIAM, Philadelphia.

Bruce, A. and Gao, H. (1996). Applied Wavelet Analysis with S-Plus. Springer-Verlag, New York.

Carlin, B. P., Gelfand, A. E. and Smith, A. F. M. (1992). Hierarchical Bayesian Analysis of Changepoint Problems. Applied Statistics 41 (2), 389–405.

92 Casella, G. and Berger, R. L. (2002). Statistical Inference, 2nd Edition.

Casella, G. and George, E. I. (1992). Explaining the Gibbs Sampler. The American Statistician 46 (3), 167–174.

Chib, S. (1995). Marginal Likelihood From the Gibbs Output. Journal of the Amer- ican Statistical Association 90, 1313–1321.

Cowles, M. K. and Carlin, B. P. (1996). Markov Chain Monte Carlo Convergence Diagnostics: A comparative Review. Journal of the American Statistical Asso- ciation 91 (434), 883–904.

Craigmile, P. F., Calder, C. A., Li, H., Paul, R. and Cressie, N. (2007). Hierarchical model building, fitting, and checking: A behind-the-scenes look at a Bayesian analysis of arsenic exposure pathways. Department of Statistics Technical Report No. 806, The Ohio State University.

Cressie, N. (1993). Statistics for Spatial Data. Wiley, New York.

Damerdji, H. (1994). Strong consistency of the variance estimator in steady-state simulation output analysis. Mathematics of Operations Research 19, 494–512.

Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and their Application. Cambridge University Press, UK.

Dempster, A. P. (1977). The direct use of likelihood for significance testing. Statistics and Computing 7, 247 – 252.

Denison, D. G. T., Mallick, B. K. and Smith, A. F. M. (1998). Automatic Bayesian curve fitting. Journal of Royal Statistical Society, Series-B 60 (2), 33–350.

Dimatteo, I., Genovese, C. R. and Kass, R. E. (2001). Bayesian curve-fitting with free-knot splines. Biometrika 88 (4), 1055–1071.

Dobson, A. J. (1983). An introduction to statistical modelling. Chapman and Hall, London.

Feller, W. (1954). Diffusion Processes in One Dimension. Transactions of the Amer- ican Mathematical Society 77 (1), 1–31.

Flegal, J. M., Haran, M. and Jones, G. L. (2008). Markov Chain Monte Carlo: Can We Trust the Third Significant Figure? Statistical Science, to appear.

Gardiner, C. W. (1985). Handbook of Stochastic Methods for Physics, Chemistry and the Natural Sciences 2/e. Springer-Verlag, New York.

93 Gelfand, A. E. and Smith, A. F. M. (1990). Sampling based approaches to calculating marginal densities. Journal of the American Statistical Association 85, 398 – 409.

Gelman, A. and Rubin, D. B. (1992). Inference From Iterative Simulation Using Multiple Sequences. (with discussion), Statistical Science 7, 457–511.

Geman, S. and Geman, D. (1984). Stochastic Relaxation: Gibbs distributions and the Bayesian Restoration of images. IEEE. Trans. Pattn. Anal. Mach. Intel. 6, 721 – 741.

Geweke, J. (1992). Evaluating the Accuracy of Sampling-Based Approaches tothe Calculation of Posterior Moments. (J. M. Bernardo et al. editors) 4, 169–193.

Gilks, W. R., Richardson, S. and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in Practice. Chapman and Hall.

Green, P. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 (4), 711–732.

Heidelberger, P. and Welch, P. D. (1983). Simulation Run Length Control in the Presence of an Initial Transient. Operations Research 31, 1109–1140.

Hellinger, E. (1909). Neue Begrundung der Theorie quadratischer Formen von un- endlichvielen Vernderlichen. J. Reine Angew. Math. 36, 210271.

Hobert, J. P., Jones, G. L., Presnell, B. and Rosenthal, J. S. (2002). On the ap- plicability of regenerative simulation in Markov chain Monte Carlo. Biometrika 89, 731–743.

Jones, G. L., Haran, M., Caffo, B. S. and Neath, R. (2006). Fixed-Width Output Analysis for Markov Chain Monte Carlo. Journal of the American Statistical Association 101, 1537–1547.

Lasota, A. and Mackey, M. C. (1994). Chaos, Fractals, and Noise 2/e. Springer- Verlag, New York.

Law, A. M. and Carson, J. S. (1979). A Sequential Procedure for Determining the Length of a Steady-State Simulation. Operations Research 27 (5), 1011 – 1125.

Lipster, R. S. and Shiryayev, A. N. (1977). Statistics of Random Process I: General Theory. Springer-Verlag, New York.

Liu, C., Liu, J. and Rubin, D. B. (1992). A variational Control Variable for Assess- ing the Convergence the Gibbs Sampler. Proceedings of the American Statistical Association, Statistical Computing Section pp. 74–78.

94 Muller, P. and Vidakovic, B. (1999). in Wavelet Based Models. Springer, New York.

Mykland, P., Tierney, L. and Yu, B. (1995). Regeneration in Markov Chain Samplers. Journal of the American Statistical Association 90 (429), 233–241.

Paterson, W. S. B. (1994). Physics of Glaciers, 3rd edition edn. Butterworth Heine- mann, New York.

Protter, P. E. (2005). Stochastic Integration and Differential Equations. Springer- Verlag, New York.

Raftery, A. E. and Lewis, S. M. (1992). How many iterations in the Gibbs sampler? Bayesian Statistics (J. M. Bernardo et al. editors) 4, 763–773.

Ritter, B. D. and Tanner, M. A. (1992). Facilitating the Gibbs Sampler: The Gibbs Stopper and the Griddy-Gibbs Sampler. Journal of the American Statistical As- sociation 87, 861–868.

Robert, C. P. and Casella, G. (2004). Monte Carlo statistical methods. (2nd. ed.) New York: Springer.

Roberts, G. O. (1992). Convergence Diagnostics of the Gibbs Sampler. Bayesian Statistics (J. M. Bernardo et al. editors) 4, 775–782.

Roberts, G. O. and Stramer, O. (2001). Langevin-Type Models II: Self-Targeting Can- didates for MCMC Algorithms. Methodology and Computing in Applied Proba- bility 1 (3), 307–328.

Roberts, G. O. and Stramer, O. (2002). Tempered Langevin. Methodology and Com- puting in Applied Probability 1 (3), 307–328.

Roger, L. C. G. and Williams, D. (1994). Diffusions, Markov Processes, and Martin- gales. Cambridge University Press, London.

Sanso, B., Forest, C. E. and Zantedeschi, D. (2008). Inferring Climate System Prop- erties Using a Computer Model. Bayesian Analysis 3 (1), 1 – 38.

Shen, J. and Strang, G. (1998). Asymptotics of Daubechies Filters, Scaling Functions, and Wavelets. Applied and Computational Harmonic Analysis 5 (3), 312 – 331.

Shoji, I. and Ozaki, T. (1998). A statistical method of estimation and simulation for systems of stochastic differential equations. Biometrika 85 (1), 240–243.

Spiegelhalter, D. J., Best, N. G., Carlin, B. P. and van der Linde, A. (2002). Bayesian measures of model complexity and fit (with discussion). Journal of Royal Statis- tical Society, Seris-B 64 (4), 583–639.

95 Stephens, D. A. (1994). Bayesian Retrospective Multiple-Changepoint Identification. Applied Statistics 43 (1), 159 – 178.

Stramer, O. and Tweedie, R. L. (1999a). Langevin-Type Models I: Diffusions with Given Stationary Distributions and their Discretizations. Methodology and Com- puting in Applied Probability 1 (3), 283–306.

Stramer, O. and Tweedie, R. L. (1999b). Langevin-Type Models II: Self-Targeting Candidates for MCMC Algorithms. Methodology and Computing in Applied Probability 1 (3), 307–328.

Trevisani, M. and Gelfand, A. E. (2003). Inequalities between Expected Marginal Log-Likelihoods, with Implications for Likelihood-Based Model Complexity and Comparison Measures. Canadian Journal of Statistics 31 (3), 239–250.

Vidakovic, B. (1999). Statistical Modeling by Wavelets. Wiley Series in Probability and Statistics.

Wen, J., Jezek, K. C., Csatho, B. M., Herzfeld, U. C. and Fasness, K. L. (2007). Mass budgets of the Lambert, Mellor and Fisher Glaciers and basal fluxes beneath their flowbands on Amery Ice Shelf. Science in China Series D: Earth Sciences 50 (11), 1693 – 1706.

Zellner, A. and Min, C. K. (1995). Gibbs Sampler Convergence Criteria. Journal of the American Statistical Association 90, 921–927.

96