<<

Calibrated Bayes factors for and model averaging

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Pingbo Lu, B.S.

Graduate Program in

The Ohio State University

2012

Dissertation Committee:

Dr. Steven N. MacEachern, Co-Advisor Dr. Xinyi Xu, Co-Advisor Dr. Christopher M. Hans ⃝c Copyright by

Pingbo Lu

2012 Abstract

The discussion of selection and averaging has a long history and is

still attractive nowadays. Arguably, the Bayesian ideal of model comparison based on

Bayes factors successfully overcomes the difficulties and drawbacks for which frequen-

tist hypothesis testing has been criticized. By putting prior distributions on model-

specific parameters, Bayesian models will be completed hierarchically and thus many

statistical summaries, such as posterior model probabilities, model-specific posteri-

or distributions and model-average posterior distributions can be derived naturally.

These summaries enable us to compare inference summaries such as the predictive

distributions.

Importantly, a Bayes factor between two models can be greatly affected by the

prior distributions on the model parameters. When prior information is weak, very

dispersed proper prior distributions are often used. This is known to create a problem

for the Bayes factor when competing models that differ in dimension, and it is of

even greater concern when one of the models is of infinite dimension. Therefore, we

propose an innovative criterion called the calibrated Bayes factor, which uses training samples to calibrate the prior distributions so that they achieve a reasonable level of ”information”. The calibrated Bayes factor is then computed as the Bayes factor over the remaining . The level of ”information” is tied to the concentration of training-updated prior distributions, which is carefully evaluated by monitoring

ii the distribution of symmetrized Kullback-Leibler between two likelihood functions drawn independently from the training-updated prior distribution. Monte

Carlo Markov chain algorithms are widely used in this research to generate parameter draws from training-updated prior distributions. Subsampling is applied to reduce dependence among parameter draws.

In this thesis, we mainly focus on comparisons of one-sample (i.i.d) models and the variable selection problem under regression settings with a variety of model specific prior distributions. We illustrate through simulation studies that the calibrated Bayes factor yields robust and reliable model preferences under various true models. We further demonstrate use of the calibrated Bayes factor on obesity data from the Ohio

Family Health Survey (one-sample model) and the ozone data originally studied in

Breiman and Friedman (1985) (variable selection problem). The calibrated Bayes factor is applicable to a large variety of model comparison problems because it makes no assumption on model forms (parametric or nonparametric) and can be used for both proper and improper priors.

iii This is dedicated to my family, specially my grandmother, my parents, my wife

Zhijia, and those ones I love.

iv Acknowledgments

First, I would like to express my gratitude to my co-advisors, Dr. Steven MacEach- ern and Dr. Xinyi Xu, for inspiring this research and guiding me with endless patient.

They are so generous to share their creative ideas with me and always directed me in the right way. I sincerely appreciate their continual support in both academic and

finance over these years and the time they spent checking every page of this thesis. I feel very fortunate to have the opportunity to learn from them. Without their patient guidance and unselfish help, this work cannot be done. I also want to thank The Ohio

State University, Department of Statistics and Dr. Prem Goel for training me and funding me as a graduate associate during my study.

I also wish to thank Dr. Chris Hans and Dr. Peter Craigmile, for serving in my dissertation committee and my candidacy exam committee and their great comments and suggestions. Their generous help and support deserves more than thanks.

My gratitude also goes to Beth Duffy and all her families for their generosity and kindness. I want to thank them for hosting me when I just arrived in Columbus and being continuously supportive. It is always my pleasure to have you as my host family.

At the end, I would like to thank all my friends at Ohio State. I always feel lucky to meet them and know them. My special thanks also go to my families for their constant trust and encouragement.

v Working with everyone who made this thesis possible is a precious experience.

vi Vita

August 6, 1983 ...... Born - Tianjin, China

2006 ...... B.S. Mathematics, Beijing Institute of Technology 2008 - present ...... Graduate Teaching/Research Associate, Ohio Sate University.

Fields of Study

Major Field: Statistics

vii Table of Contents

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vii

List of Tables ...... xi

List of Figures ...... xii

1. Introduction ...... 1

1.1 Bayesian Model Comparison and Bayes Factors ...... 1 1.1.1 Introduction to the Comparison of Probability Models . . . 1 1.1.2 Definition of Bayes Factors ...... 3 1.2 Interpretation of the Bayes Factor ...... 4 1.3 Advantages of Using the Bayes Factor for Hypothesis Testing . . . 6 1.4 Motivation for a “New” Bayes Factor...... 8 1.4.1 Impact of the prior distribution ...... 8 1.4.2 Example 1: Lindley’s Paradox ...... 10 1.4.3 Example 2: An i.i.d. Case ...... 11 1.4.4 Remarks ...... 17

2. Computational Techniques ...... 20

2.1 Algorithms ...... 20 2.1.1 The Gibbs Sampler ...... 21 2.1.2 The Metropolis-Hastings Algorithm ...... 22 2.1.3 Remarks ...... 24

viii 2.2 Computation of ...... 25 2.2.1 The Laplace Approximation ...... 25 2.2.2 Monte Carlo Integration and Importance . . . . . 26 2.2.3 The Harmonic Estimator ...... 28 2.2.4 Chib’s MCMC-based Methods ...... 29

3. The Calibrated Bayes Factor ...... 37

3.1 Alternative Forms of the Bayes Factor ...... 38 3.1.1 Pseudo-Bayes Factors ...... 38 3.1.2 Posterior Bayes Factors ...... 39 3.1.3 Partial and Fractional Bayes Factors ...... 40 3.1.4 Intrinsic Bayes Factors ...... 41 3.1.5 Remarks ...... 42 3.2 Information Metric ...... 44 3.3 Calibrating Bayes Factors ...... 45 3.4 Example 2 revisited ...... 46 3.5 One-sample Model ...... 47 3.5.1 Target Information Level ...... 47 3.5.2 Calibrating Priors ...... 49 3.5.3 Simulations ...... 51 3.5.4 Study 1: Adult Males, Aged 18-24 ...... 56 3.6 Two-sample Model ...... 61 3.6.1 Study 2: Diabetes Group vs. Non-Diabetes Group . . . . . 61 3.7 Griffin’s MDP Priors ...... 69

4. Model ...... 72

4.1 Background ...... 72 4.2 Model Specification ...... 74 4.3 Prior Distributions ...... 75 4.3.1 Improper Priors ...... 75 4.3.2 Proper Priors ...... 77 4.4 Calibration ...... 82 4.4.1 Information Metric ...... 82 4.4.2 Target of Concentration ...... 83 4.4.3 Calibration sample size ...... 84 4.5 Simulations ...... 85 4.5.1 Independent normal priors on low dimension β ...... 86 4.5.2 Horseshoe Priors on high dimension β ...... 96 4.6 Case Study ...... 99 4.6.1 Description of Ozone data ...... 99

ix 4.6.2 Pairwise Comparisons among Model 1 - Model 4 ...... 100 4.6.3 Comparisons with Model 5 ...... 104

5. Conclusions ...... 107

Bibliography ...... 110

Appendices 119

A. An Algorithm for Calibrated Sample Size Search ...... 119

B. An Overview of the Ozone Data ...... 121

x List of Tables

Table Page

1.1 Jeffreys’ scale of evidence ...... 5

1.2 Kass and Raftery’s scale of evidence ...... 6

3.1 for OFHS (Male Adult, Aged 18-24) ...... 57

3.2 Probability of obtaining more non-diabetes observations ...... 64

3.3 Summary statistics for OFHS (Married Male, Aged 55-64 High School Graduates with Diabetes)...... 65

3.4 Summary statistics for OFHS (Married Male, Aged 55-64 High School Graduates without Diabetes)...... 67

2 · 2 4.1 Quantiles of χ(1) χ(p) ...... 84

4.2 Estimation of coefficients for Models 1-4 ...... 101

B.1 Variables used in the ozone data ...... 121

xi List of Figures

Figure Page

1.1 Motivation Example ...... 13

2 3.1 Cumulative density function of SKL samples versus χ(1) ...... 48

3.2 Small departure to normality ...... 53

3.3 departure to normality ...... 53

3.4 Large departure to normality ...... 54

3.5 Super large departure to normality ...... 54

3.6 Calibration sample size versus Departure to normality on log10 . . . . 55

3.7 of BMI data (Male Adult) ...... 58

3.8 Bayes factors for BMI data (Male Adult) ...... 60

3.9 The mean log Bayes factor for the two-sample model ...... 62

3.10 Histogram of BMI data (Diabetes Group) ...... 65

3.11 Bayes factors for BMI data (Diabetes Group) ...... 66

3.12 Histogram of BMI data (Non-Diabetes Group) ...... 68

3.13 Bayes factors for BMI data (Non-Diabetes Group) ...... 69

4.1 Box-plot for the posterior draws of βh’s and λh’s ...... 80

xii 4.2 Plot of estimates against horseshoe posterior . . . 81

2 · 2 4.3 Cumulative density functions of SKL samples versus χ(1) χ(p) . . . . 88

4.4 Bayes factors for multiple prefixed zeros in βγ ...... 89

4.5 Bayes factors for one prefixed zeros in βγ ...... 90

4.6 Bayes factors under multicollinearity ...... 92

4.7 Impact from the diffuseness of priors on original and calibrated Bayes factors ...... 94

4.8 An example of Bayes factor with two turning points ...... 95

4.9 Bayes factor for comparing high-dimensional models I ...... 97

4.10 Bayes factor for comparing high-dimensional models II ...... 98

4.11 Bayes factors for comparisons among Models 1-4 ...... 102

4.12 Bayes factors for comparing Models 1-4 against Model 5 ...... 105

B.1 Matrices scatter plots for Ozone data ...... 122

xiii Chapter 1: Introduction

1.1 Bayesian Model Comparison and Bayes Factors

1.1.1 Introduction to the Comparison of Probability Models

Selection of probability models is a fundamental decision making technique and a popular data analytic topic, as it is widely employed throughout the sciences. As a description of a typical study, a researcher collects data from an . The data are in the form of measurements on a set of aspects of the observed subjects.

The researcher is not sure whether all of the variables affect the response variables, and thus wants to determine which measurements are important to the outcomes.

Such natural questions lead to a discussion of model selection. An acceptable ap- proach to model selection should, at a minimum, yield results that can be interpreted numerically and scientifically. In addition, to gain widespread acceptance and use, an approach should be easy to carry out and should be applicable to a wide variety of models.

When data are available, formal statistical techniques play a major role in model selection. Numerous methods have been raised and discussed to deal with variety of

1 model selection questions, from both frequentist and Bayesian communities. Well-

known statistical methods of model selection include hypothesis testing, forward-

backward variable selection and criterion-based methods such as AIC, BIC, Mallows

Cp, cross-validation and use of the Bayes factor. Before getting into the details, it is

necessary to point out that some model selection techniques are designed to maintain

a subset of models for further consideration, and the subset may or may not consist

of a single model. An overall review of model selection techniques can be found in

Kadane and Lazar (2004).

Although this thesis primarily develops a novel Bayesian approach to model se-

lection, a few remarks on frequentist hypothesis testing seem a proper starting point.

First, the standard view in the frequentist community is that parameters are unknown

constants. Thus, the likelihood of the data under a particular model is a function of

the parameters in the model, and cannot be fully evaluated. The Neyman-Pearson

school portrays frequentist hypothesis testing as an approach of choosing one model

from two on the basis of the data. contrasts a null hypothesis

(H0) with an (Ha). The null hypothesis usually represents a simpler model nested in the alternative hypothesis. For example, when tossing a coin, one might ask such the question: Is it a fair coin? This question can be interpreted as two models: The small model is that the data are the result of independent flips of a fair coin. The large model is that the data are the result of independent flips of a biased coin. To set up the hypothesis test, the space spanned by both models is parti- tioned into a null hypothesis and an alternative hypothesis. Here, H0 : P (Head) = 0.5

and Ha : P (Head) ≠ 0.5. The null hypothesis corresponds to the small model while the alternative hypothesis is the complement (in the union of the two models) of the

2 small model. The decision from the hypothesis test can be based on the p-value, defined formally in a simple setting as the probability of obtaining a result as unusual or more unusual than the observed result given that the null hypothesis is true. A p- value less than a pre-specified critical value (α) implies the observed data cast strong doubt on the simpler model, thus leading to rejection of the null hypothesis. In more complex settings the large body of work on frequentist hypothesis testing seeks to describe how to capture the notion of “unusual”, how to define the decision criterion when the p-value depends on parameters in the null model, how to set the level α of the test, how to handle non-nested models, and more. Lehmann and Romano’s

(2005) authoritative work provided many details.

In contrast to a frequentist model, a Bayesian model also requires a prior distri- bution on the parameters. This enables the Bayesian to compute a single value for the likelihood of the model by integrating out the parameters. This single value of the likelihood is refered to as the marginal likelihood, and it can be used to compare models. The marginal likelihood is the fundamental quantity upon which the Bayes factor, and by extension Bayesian hypothesis tests, are based. The next subsection decribes the connection between the Bayes factor and Bayesian hypothesis tests.

1.1.2 Definition of Bayes Factors

Jeffreys (1961) developed the Bayesian approach to hypothesis testing. He focused on a scientific setting and considered each hypothesis to represent a particular scien- tific theory. From Jeffreys’ viewpoint, statistical models are introduced to represent the probability of the data according to each of the two theories.

3 Suppose one has observed data Y = {y1, y2, . . . , yn}, has two statistical models labeled M1 and M2, and wants to choose one over another by means of a hypothesis test. P (Y |Mk) or mk(Y ) represents the marginal likelihood (or integrated likelihood), which can be computed as ∫

mk(Y ) = fk(Y |θk)πk(θk)dθk, k = 1, 2. (1.1)

In this expression, fk(Y |θk) denotes the under model Mk condi- tional on model specific parameters θk, and πk(θk) is the prior distribution on θk.

Following the Bayesian approach, one focuses on the posterior distribution on the models, obtained via Bayes’ theorem

P (Y |Mk)P (Mk) P (Mk|Y ) = (k = 1, 2); P (Y |M1)P (M1) + P (Y |M2)P (M2) so that

P (M |Y ) P (Y |M ) P (M ) 1 = 1 1 (1.2) P (M2|Y ) P (Y |M2) P (M2) where the P (Mk) represents the prior model probability, and the P (Mk|Y ) represents the posterior model probability. The Bayes factor is defined as

P (Y |M ) m (Y ) BF = 1 = 1 . (1.3) P (Y |M2) m2(Y )

We note that it is the ratio of integrated likelihoods.

1.2 Interpretation of the Bayes Factor

According to Lavine and Schervish (1999), the Bayes factor measures whether the data Y have increased or decreased the belief in M1 relative to M2. In other words, given (1.3), if identical prior probabilities are assigned to M1 and M2, i.e. P (M1) =

4 P (M2) = 0.5, then the Bayes factor is the ratio of posterior probabilities. Due to the completeness of Bayes models, the entire derivation is natural and mathematically perfect. This simple fact illustrates the rationale for use of the Bayes factor in model selection.

Bayes factors compare models in pairs and serve as a one-number summary of evidence provided by the data in favor of one scientific theory, represented by a statistical model, as opposed to another. Once the Bayes factor is obtained, one needs a reasonable criterion to evaluate the strength of evidence indicate by this number. Several scales of interpretation of the Bayes factor have been proposed.

Here, we present two of the most popular: Jeffreys’ (1961) seminal scale for the strength of evidence, and Kass and Raftery’s (1995) more recent scale. Originally,

Jeffreys suggested interpreting the observed Bayes factor in half-units on the log10 scale (Table 1.1). The corresponding critical values on a natural logarithm scale are also presented, to facilitate comparison with the scale suggested by Kass and Raftery

(1995) in Table 1.2. The table also includes a column giving the Bayes factor itself.

These values may be interpretated as . Note that the same rule can be used to interpret the evidence against M1 by inverting the value of the Bayes factor.

Table 1.1: Jeffreys’ scale of evidence BF12 log10(BF12) loge(BF12) Evidence against M2

1 - 3.2 0 - 0.5 0 - 1.2 Not worth more than a bare mention 3.2 - 10 0.5 - 1.0 1.2 - 2.3 Substantial 10 - 100 1.0 - 2.0 2.3 - 4.6 Strong > 100 > 2.0 > 4.6 Decisive

5 Table 1.2: Kass and Raftery’s scale of evidence BF12 loge(BF12) Evidence against M2

1 - 2.7 0 - 1.0 Not worth more than a bare mention 2.7 - 20.1 1.0 - 3.0 Positive 20.1 - 148.4 3.0 - 5.0 Strong > 148.4 > 5.0 Very strong

Kass and Raftery (1995) emphasized that these categories are not a calibration of the Bayes factor, but rather a rough descriptive statement about the standard of evidence in scientific investigation.

To end this discussion, we note that, before data Y are observed, the Bayes factor given in (1.3) is a whose distribution depends on the distribution of

Y and on the models being entertained. With these factors in play, the decision rule that compares the observed Bayes factor to a predetermined value independent of the of Bayes factor, seems, to some, to be insufficient. Recently, the interpretation of the Bayes factor based on this perspective has been studied by

Wang and Gelfand (2002), Vlachos and Gelfand (2003) and Garcia-Donato and Chen

(2005). No further discussion of this aspect of the Bayes factor is given in this thesis.

1.3 Advantages of Using the Bayes Factor for Hypothesis Testing

In this subsection, we present several advantages of using the Bayes factor for the purpose of choosing one model from two under a zero-one . One important motivation for the use of the Bayes factor for hypothesis testing is the de-

ficiency of frequentist computations, such as those based on the p-value. The conflict

6 between the p-value and (the lower bound of) the Bayes factor or the posterior model probabilities has already been widely discussed in the literature. A null hypothesis may be strongly rejected by a tiny p-value, but supported by the Bayes factor. An explanation is that although a null hypothesis clearly conflicts with the data (small p-value), the alternative hypothesis may not fit the data any better. A number of concrete examples and extensive discussions can be found in Berger and Delampady

(1987), Berger and Sellke (1987) and Delampady and Berger (1990). Berger and

Pericchi (1996a) provide several other arguments in favor of Bayesian methods over non-Bayesian methods for model selection and hypothesis testing.

In addition, frequentist hypothesis tests are literally unable to assign a probability to a hypothesis while Bayesian methods provide direct probability statements about the hypotheses. As a consequence, accounting for model uncertainty, one can also average the inference or estimation results yielded by all models under consideration, with the posterior probabilities of models as weights.

Second, the Bayes factor places the null and alternative hypotheses on an equal footing. For example, analysis of non-nested models or hypotheses is very difficult via frequentist hypothesis testing, while this task can be accomplished easily with the

Bayes factor. Recall that in frequentist hypothesis testing, the one-number summary, the p-value, is a probability given the null hypothesis is true. Reversing the null and alternative hypotheses yields a different p-value, possibly resulting in a different conclusion. This point was mentioned by Kass and Raftery (1995) as well, who also pointed out the difficulties in implementing Cox’s approach (1961, 1962) for the comparison of non-nested models. In this case, the choice of which model to set as

7 the null hypothesis is quite arbitrary. The Bayes factor deserves credit for escaping

from this restriction naturally.

Last but not least, Bayesian methods can deal with more than two hypotheses si-

multaneously. Bayesian methods compare marginal likelihoods, which do not depend

on the other models under consideration, and thus do not suffer when comparing

three or more models. In comparison, a frequentist hypothesis test can deal with

only two models at a time.

1.4 Motivation for a “New” Bayes Factor.

1.4.1 Impact of the prior distribution

According to the definition of marginal likelihood in expression (1.1), it is clear

that the Bayes factor can be viewed as a ”weighted likelihood” (Wald, 1947) ratio of

M1 and M2, with the priors being the ”weighting functions”. As a consequence of

(1.1), the model-specific prior distributions determine the marginal distribution of the data, and so play a key role in determining the Bayes factor as well. Unfortunately, as Kass and Raftery (1995) state “prior parameter densities may be hard to set when there is no such information.” In this case, following Laplace’s principle of insufficient reason, the prior within a model πi is often specified by means of an

automated rule which has been designed to capture the notion of “little information”

or no information. See, for example, Jeffreys (1961), Bernardo (1979), Berger and

Bernardo (1992), Kass and Wasserman (1996), and Berger, Bernardo and Sun (2009).

Use of automatically specified priors is commonplace, especially in settings where a

single model is under consideration and the model is to be used for inference.

8 The impact of the prior πi can be enormous, as is most clearly seen in the fruitless

attempt to compute the Bayes factor for comparing two models which have improper

prior distributions. If one uses improper priors, say πi(θi) = cihi(θi) or πi(θi) ∝ hi(θi) which can merely be specified up to unknown constants ci, formal calculation yields ∫ ∫ f (Y |θ )π (θ )dθ c f (Y |θ )h (θ )dθ BF = ∫ 1 1 1 1 1 = 1 ∫ 1 1 1 1 1 . f2(Y |θ2)π2(θ2)dθ2 c2 f2(Y |θ2)h2(θ2)dθ2

Since c1 and c2 are arbitrary, the Bayes factor cannot be determined.

To overcome this well-known obstruction to Bayesian model comparison, substan-

tial work on alternatives to the conventional Bayes factor has been done. Examples

include the pseudo-Bayes factor of Geisser and Eddy (1979), the posterior Bayes fac-

tor of Aitkin (1991), the fractional and partial Bayes factors of O’Hagan (1995) and

the intrinsic Bayes factor of Berger and Pericchi (1996a, 1996b, 2001, 2004). All of

these suggest using part (or all) of the data as training information to update model

specific priors, so that normalizing constants ci become well-defined. The key feature

of the updating is to have a pair of proper partial posterior distributions. More de-

tails will be discussed in Chapter 2. Interested readers should see Gelfand and Dey

(1994) and Vlachos and Gelfand (2003) for more discussion and comparison of these

alternatives.

Generally speaking, most of these works focus on “proper-izing” improper priors

to obtain computable alternatives. In practice, when there is little prior information

about parameters, indeed, a proper prior enables one to obtain well-defined Bayes

factors. However, we realize and plan to show that even when the priors are proper,

it is still beneficial to further update the priors with more of the data. In addition, we

notice that over a of sample sizes, the Bayes factor can provide very different

model preferences with systematic patterns apparent. This suggests the need for a

9 new criterion for model comparison. The new criterion will be strongly tied to the

Bayes factor, but will acknowledge the arbitrariness of “vague, proper priors” and will yield more consistent model preference as sample size varies. In next subsection, we give two examples of Bayes factors. One example aims to show the impact of priors; the other is to illustrate how misleading a Bayes factor can be.

1.4.2 Example 1: Lindley’s Paradox

The famous Lindley’s paradox is a good example to emphasize the importance of avoiding overly diffuse priors. It gives a situation where the Bayesian and frequentist approaches to a hypothesis testing problem give opposite results for certain choices of the prior. This phenomenon was first studied by Jeffreys (1961) and discussed by Lindley (1957) and Shafer (1982). Here, we borrow an illustrative example from

Shafer (1982) to help readers understand this interesting phenomenon. Suppose θ is the mean parameter of a normally distributed random variable Y . The

2 parameter σ is known. The null hypothesis H0 : θ = θ0 has π0 = 0.5

2 and alternative hypothesis H1 : θ ∼ N(θ0, τ ) has prior probability 1−π0 = 0.5. Thus given a single observation Y = y and (1.3), we have

P (H |Y = y) π m (Y = y) m (Y = y) 0 = 0 1 = 1 P (Ha|Y = y) 1 − π0 m0(Y = y) m0(Y = y) where mi(Y ) represents the marginal density function under Hi evaluated at Y ( ) 2 1 (Y − θ0) m0(Y ) = √ exp − 2πσ2 2σ2 and ( ) 2 1 (Y − θ0) m1(Y ) = √ exp − . 2π(σ2 + τ 2) 2(σ2 + τ 2)

10 Under the frequentist approach, a small value of m0(Y ) implies evidence against H0.

However if the τ is much larger than σ, perhaps m1(Y ) is even smaller

than m0(Y ) at certain values of y, and so a Bayesian would conclude H0 : θ = θ0.

Therefore, in spite of strong evidence that is equal to or very near θ0, a sufficiently

diffuse prior can always lead to a choice of H0.

1.4.3 Example 2: An i.i.d. Case

The use of conventional priors can lead to additional troubles. The following

simulation illustrates this point. Data are independently and identically generated

from a univariate . The density function can be written as ( ) ( ( )) 2 y y f(y) = ϕ Φ α ω ω ω

where ϕ and Φ denote the PDF and CDF of a standard normal distribution. We set

α = 2.5 and ω = 1.5, which produces a variance approximately equal to one. A plot of such a density function is shown in Panel A of Figure 1.1.

Assume we have two such models to entertain:

1. M1, a Gaussian parametric model:

θ ∼ N(µ, τ 2)

2 iid 2 Yi | θ, σ ∼ N(θ, σ ), i = 1, ··· , n. (1.4)

2. M2, a mixture of Dirichlet processes (MDP) nonparametric model:

G ∼ DP (M = 2,N(µ, τ 2))

iid θi | G ∼ G,

2 2 Yi | θi, σ ∼ N(θi, σ ), i = 1, ··· , n. (1.5)

11 The hyper-parameters in both models are assumed to be independent and to follow

the common prior distributions

µ ∼ N(0, 500), σ2 ∼ IG(7, 0.3), τ 2 ∼ IG(11, 9.5).

It is common practice to match the predictive distribution for a single case under the

non-parametric Bayesian model to the predictive distribution under a corresponding

2 2 2 2 parametric model. Note that π1(µ, σ , τ ) = π2(µ, σ , τ ) guarantees

BF12(Y1) = 1,

for any Y1 ∈ (−∞, ∞). See Ishwaran and James (2002) and Griffin (2010) for ad-

ditional commentary on prior determination for MDP non-parametric models. In

addition, note that the strong priors for σ2 and τ 2 force the quantity σ2/(σ2 + τ 2) to concentrate near 1/20, which implies the likelihood functions proposed by the MDP priors tend to be ”fuzzy” (having a lot of small bumps). Also, note that neither model matches the true data-generating mechanism.

The MDP model is a semi-parametric Bayesian model of infinite dimension. Al-

ternative representations focus on finite-dimensional representations of a portion of

the model. These representations grow in size as the number of observations increas-

es. For this MDP model, in the range of sample sizes we consider, a finite mixture

of normal densities can be viewed as an approximation to a density function drawn

from the model. Thus the MDP model has much wider support than conventional

parametric models, such as M1. Introduction and more discussion of the MDP model

can be found in Ferguson (1973) and Antoniak (1974). Sethuraman (1994) contains

an alternative construction of the Dirichlet process. A set of Monte Carlo Markov

12 chain algorithms for fitting MDP model has been discussed and significantly devel-

oped by Escobar (1988, 1994), MacEachern (1994, 1998), Escobar and West (1995),

Bush and MacEachern (1996), MacEachern and M¨uller(1998), MacEachern, Clyde

and Liu (1999), Neal (2000), Ishwaran and James (2001), Dahl (2003), Jain and Neal

(2004, 2007), Walker (2007), Papaspiliopoulos and Roberts (2008) and Kalli, Griffin,

and Walker (2011). A simple MCMC algorithm is used to obtain posterior draws

under M1.

A B

Lim2 = −1.398

Lim1 = −1.426 True Density True 0.0 0.1 0.2 0.3 0.4

−2 0 2 4 6

C Para model(M1) NonP model(M2) E(log(Marginal Predictive Likelihood)) E(log(Marginal Predictive Cumulative log−BF Cumulative −2.0 −1.9 −1.8 −1.7 −1.6 −1.5 −1.4 −1.3 0 1 2 3 4 5

1 31 71 101 151 201 251 301 351 1 40 85 125 175 250 300 350

Sample Size Sample size

Figure 1.1: Panel A presents the density function of the data generation process. Panel B presents the expected log-predictive likelihoods at some sample sizes with a pointwise confidence band, where Lim1 marks the expected log-likelihood of the best-fitting normal (with same mean and variance as f) and Lim2 marks that of the true density. Panel C presents the mean of the log Bayes factor against the sample size.

13 To estimate the Bayes factor in favor of the Gaussian parametric model against the MDP nonparametric model, we simulate 300 data sets each consisting of n = 350 observations, and then calculate the log Bayes factor based on each data set. The mean log Bayes factor is 1.93, corresponding to a Bayes factor of

b 1.93 B12 = e ≈ 6.89, which provides substantial support for the Gaussian parametric model according to

Jeffreys’ criterion.

On the other hand, we examine the final predictive performances of these two models by estimating the expected log predictive densities E[log mi(Yj |Y(−j))], where

Y(−j) is the data without Yj, j = 1, 2, ··· , 350. Note that, provided our models have

finite KL divergences from the skew-normal,

E[log m1(Yj | Y(−j))] − E[log m2(Yj | Y(−j))] ∫ ∫ f(Yj) f(Yj) = f(Yj) log dYj − f(Yj) log dYj m2(Yj | Y(−j)) m1(Yj | Y(−j))

= KL(f, m2) − KL(f, m1), where f is the true skew-normal density and KL(f, ·) is the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951) from f to another density function. Thus, comparing E[log mi(Yj | Y(−j))] is equivalent to comparing the closeness of the pre- dictive densities mi(Yj | Y(−j)) to the true density f(Yj) under the KL divergence. In addition, since the KL divergence is non-negative, for any likelihood function m2 sam- pled from the posterior distribution of the MDP model after an update on a training

14 data set, we have KL(f, m2) ≥ 0, which implies ∫ f(Y ) f(Y ) log dY ∫ m2(Y ) ∫

= f(Y ) log f(Y )dY − f(Y ) log m2(Y )dY

= Lim2 − E[log m2(Y )]

≥ 0.

We also know under the Gaussian model M1, the posterior will asymptotically con- verge to a point mass at the best fitting likelihhod in the support of the prior. It is easy to see that

Lim1 − E[log m1(Y )] ≥ 0.

Based on the same 300 data sets, we obtain

b E [log m1(Y351 | Y(1:350))] = −1.4267, b and E [log m2(Y351 | Y(1:350))] = −1.4072, with standard errors 0.002 and 0.002, which indicates that the Gaussian parametric model is inferior in predictive performance.

In summary, in this example, the Bayes factor and the predictive distributions provide different model preferences based on the same data. To understand this discrepancy, we take a closer look at the relationship between the log Bayes factor and the log predictive density. Direct calculation shows that

log BF12(Y ) ∑n [ ] m1(Y1:n) = log = log m (Y | Y − )) − log(m (Y | Y − ) , (1.6) m (Y ) 1 j 1:(j 1) 2 j 1:(j 1) 2 1:n j=1

15 that is, the log Bayes factor can be represented as the sum of a sequence of log predictive density differences. Thus, it can be viewed as the cumulative effect of the differences in the log predictive performances.

In Figure 1, we plot E[log mi(Yj |Y1:(j−1))], i = 1, 2 and E[log(B12)] against various sample sizes. As shown in Panel B, when the sample size is small (between 20 and

125), the Gaussian parametric model provides better predictive performance than the

MDP nonparametric model. This is because the prior on the MDP model is much more diffuse and puts little mass around the true model and little mass around ap- proximations as good as a good-fitting Gaussian model. However, when the sample size is large enough (more than 150), the MDP model can provide a better approx- imation to the true model because of its larger support. Indeed, when the sample size n goes to infinity, the predictive density under the MDP nonparametric model appears to go to the true skew-normal distribution, while the predictive density under the Gaussian parametric model will converge to the Gaussian model closest to the true distribution (Berk 1966), that is, the Gaussian model with the same mean and the same variance as the true model.

Assembling this information, the expected log Bayes factor in favor of the Gaussian parametric model against the MDP nonparametric model, which is the sum of the expected log predictive density differences, should first increase when the sample size is small and then decrease when the sample size becomes larger. This is exactly what we observe in Panel C. At moderate sample sizes (between 150 and 350), although the predictive performance of the MDP model is better and the log Bayes factor is on a decreasing trend, the log Bayes factor is still positive and sizeable because of the big lead built up by the Gaussian parametric model at the beginning. In this sense,

16 the Bayes factor is very slow to reflect the change in predictive performance of the

two models.

1.4.4 Remarks

As shown in Figure (1.1), for the comparison of the Gaussian parametric model

and MDP model, the posterior predictive likelihood under M1 has a higher expecta- tion when the training sample size is very small, which entails an increasing trend of log Bayes factor by (1.6). However due to a broader class of likelihoods supported by M2 and a departure from Gaussianity under the data-generating process, once there is enough (175 and more) training data, the superiority of expected posterior predictive likelihoods between M1 and M2 switches. As a consequence, the corre- sponding log-Bayes factor will start to turn down. However, at the same , if the cumulative effect from smaller sample sizes leads to a large Bayes factor, this criterion fails to immediately produce the correct model preference. Furthermore, provided there is enough data, the cumulative effect will eventually be neutralized by adding more training observations, and then the Bayes factor will tend to pick up the “better” model. Therefore, as long as a switch of model superiority based on posterior predictive performance exists as the training data size goes up, there would be a range of sample sizes, for which the Bayes factor tends to produce misleading model preference. Moreover, this range of sample sizes tends to be larger if the cu- mulative effect is very significant and the gap between expected posterior predictive likelihoods is very narrow. Interestingly, we realize and will show that there may exist more than one switching points of superiority of expected posterior predictive likelihoods as the training sample size increases, thus the log Bayes factor curve may

17 present two turning points. In those cases, we give an example to show that our new

criterion for Bayesian model comparison seems to work very well.

To sum up, using conventional parametric models to fit data eases model inter-

pretation and description, and makes thrifty use of computational resources, Thus

it has a long history of application and development. However, as a tradeoff, the

limited form of the likelihood may constrain the predictive ability, motivating the

introduction of a non-parametric aspect, resulting in semi-parametric models. To

compare two models, from the Jeffreys’ (1961) viewpoint, the marginal probability or

the marginal likelihood of the observed data determines model preference. According

to the examples mentioned in the last subsection, if a very diffuse prior distribution

arbitrarily assigns too much prior “guess” on regions that are not evidenced by ob-

served data, the impact of the likelihood f(Y |θ) to the marginal likelihood will be nearly vanish. Therefore, our goal is to develop a new criterion, based on which the model preference will be more consistent with the superiority in terms of the expected predictive likelihood, which tends to be more robust to a change of priors.

At the end, it is worthwhile to point out that in the conjugate cases, the integral of

the marginal likelihood (1.3) can be analytically evaluated. However, in complicated

problems, the integration is often intractable, and so numerical methods must be

used both to fit the model and to compute the Bayes factor. As computing capability

has developed, several numerical methods of estimating this integral, such as Markov

chain Monte Carlo methods (MCMC), Laplace approximation, importance sampling,

reciprocal important sampling, have been proposed. A brief review of the these

approaches can be found in Chapter 2. More description and discussion in detail can

be found in Gelfand and Smith (1990), Smith and Gelfand (1992), Smith and Roberts

18 (1993), Gelfand and Dey (1994), Carlin and Chib (1995), Chib (1995), DiCicco, Kass,

Raftery and Wasserman (1997), Chib and Jeliazkov (2001), Basu and Chib (2003),

Congdon (2004) and Lenk (2009). In the rest of this thesis, Chapter 3 defines the new criterion we propose for Bayesian hypothesis tests and presents solid examples of simulation and real data analysis, when comparing one-sample models (or i.i.d models). Chapter 4 carefully extends the discussion to the variable selection problem for the linear regression model and provides simulation and real data examples. Thus the identically-distributed assumption can be released. Chapter 5 wraps up with a discussion of this thesis and presents some interesting directions for future work that are related to the results in this dissertation.

19 Chapter 2: Computational Techniques

2.1 Markov Chain Monte Carlo Algorithms

The study of the posterior distribution

f(Y | θ)π(θ) π(θ | Y ) = ∫ (2.1) f(Y | θ)π(θ)dθ

is of great interest to the Bayesian community, as inference about θ or about future

data connected to θ relies on the posterior. Unfortunately, when the model and

prior distribution do not form a conjugate pair, one cannot evaluate the integral

in the denominator of (2.1), and so there is no useful closed form for the posterior

distribution. The denominator of (2.1) is referred to as the marginal likelihood.

The discovery of Markov chain Monte Carlo algorithms has boosted Bayesian

statistics to new heights by providing a general tool for simulating observations from

the posterior distribution. The idea is to create an iterative sampling process whose

stationary distribution is the target distribution π(θ | Y ). In other words, if we are

able to construct a desired transition kernel properly, we can simulate a trajectory

of a Markov chain with this transition kernel, and the trajectory can be regarded as

samples from the target distribution. This ergodicity occurs after the chain has passed

the transient stage and the effect of the fixed starting value has become so small that

20 it can be ignored. Thus, given a set of such random draws, θ(1), θ(2), θ(3),..., θ(G), from the posterior distribution (or the target distribution), one can estimate virtually any summary of interest of the posterior distribution directly from the simulation.

For a simple example, the posterior mean of a parameter θ, E[θ |Y ], can be estimated with the sample mean of θ(1), θ(2), θ(3),..., θ(G). The origin of this technique can be tracked back to the 1950’s. Due to the enormous literature on this technique, it is impossible to provide a complete perspective on MCMC and its applications. We refer the interested readers to Kass, Carlin, Gelman and Neal (1998) and Capp´eand

Robert (2000) for good reviews of MCMC methods and their application in statistics, and for an entrance to the literature. MCMC methods are discribed in many books, among which are Liu (2001), Chen, Shao and Ibrahim (2000) and Robert and Casella

(2004).

For statistical users, there are two basic MCMC methods that should be under-

stood, the Gibbs sampler and the Metropolis-Hastings (M-H) algorithm. Both of

these methods are implemented in this thesis for generating observations from poste-

rior distributions. In this section, we provide a simple review of these two algorithms.

In addition, we will see in the next section that both of these algorithms can be easily

modified to produce MCMC estimates of the marginal likelihood.

2.1.1 The Gibbs Sampler

The Gibbs sampler was introduced in the context of image restoration by Geman

and Geman (1984). Gelfand and Smith (1990) generalized the technique, extending it

to a wide array of Bayesian models. The Gibbs sampler is now one of the best known

model fitting techniques. It is a special case of the Metropolis-Hastings algorithm,

21 which will be reviewed in next subsection, and it has many advantages in terms of

ease of computation. A short review of the Gibbs sampler is given in this subsection.

Suppose we partition a vector of parameters θ into m blocks; say, θ = (θ1, θ2, . . . , θm).

Let θ(t) denote the state of θ at iteration t (t = 0, 1, 2,...). A general Gibbs sampler

can be described as follows

1. Select an initial value θ(0).

(t+1) | (t) (t) 2. Iteratively update θ1 from f(θ1 Y, θ2 , . . . , θm ), (t+1) | (t+1) (t) (t) update θ2 from f(θ2 Y, θ1 , θ3 , . . . , θm ), . . (t+1) | (t+1) (t+1) (t+1) update θm from f(θm Y, θ1 , θ2 , . . . , θm−1 ),

where f(θi | θ1, . . . , θi−1, θi+1, . . . , θm) are referred as the full conditional distributions.

When is it not analytically feasible to derive a full conditional distribution, or when

the distribution cannot be sampled directly, one may consider the rejection method

or the Metropolis-Hastings algorithm, leading to a hybrid Gibbs sampler. Note that

partitioning θ into the components θ1, θ2, . . . , θm replaces sampling θ by sequentially

sampling θ1, θ2, . . . , θm from the conditional distributions of lower-dimensional blocks of components. This significantly relieves the curse of dimensionality and improves the performance of the algorithm.

2.1.2 The Metropolis-Hastings Algorithm

The Metropolis-Hastings (M-H) algorithm was originally developed by Metropolis,

Rosenbluth, Rosenbluth, Teller and Teller (1953) and generalized by Hastings (1970).

As we have mentioned, this method can be used when it is difficult to simulate

from full conditionals. This method requires a proposal distribution q(θ′ | ·), whose

22 density can be analytically evaluated and is easily sampled. The general procedure is described as follows

1. Select an initial value θ(0).

2. Sample θ′ from q(θ′ | θ(t)) and independently sample µ from the uniform distri-

bution on (0, 1). [ ] ′ ′ ′ ≤ ′ (t) f(Y | θ )π(θ )q(θ(t) | θ ) (t+1) ′ 3. If µ α(θ , θ ) = min f(Y | θ(t))π(θ(t))q(θ′ | θ(t)) , 1 , then set θ = θ , else set θ(t+1) = θ(t).

Note that the Markov chain constructed in this fashion is guaranteed to be reversible.

The performance of a M-H algorithm can be evaluated by tracking the acceptance [ ] ′ ′ ′ ≤ ′ (t) f(Y | θ )π(θ )q(θ(t) | θ ) rate Pr(µ α(θ , θ ) = min f(Y | θ(t))π(θ(t))q(θ′ | θ(t)) , 1 . A huge acceptance rate (say ≥ 80%) implies that the draws approximately follow the proposal distribution, with the target distribution having little impact on acceptance. A high acceptance rate is often a consequence of a proposal distribution that focuses on small changes to

θ. A tiny acceptance rate (say ≤ 10%) implies the chain has long stretches where it does not move, leading to a poor evaluation of the posterior. A low acceptance rate is often a consequence of a proposal distribution that focuses on extremely large changes to θ. Roberts, Gelman and Gilks (1997) and Roberts and Rosenthal (2001) discuss optimal acceptance rates and how to monitor the acceptance rate as an indicator of the quality of the Metropolis-Hastings algorithm. The optimal acceptance rate can be achieved by adjusting the location and scale of the proposal distribution. One can simply use trial-and-error to accomplish this task.

23 2.1.3 Remarks

The Gibbs sampler is a special case of the Metropolis-Hastings algorithm with acceptance rate always equal to one. This is easily seen since, with q(θ′ | θ(t)) pro- portional to a full conditional distribution, α(θ′, θ(t)) reduces to one. Both the Gibbs sampler and the Metropolis-Hastings algorithm have their advantages and disadvan- tages. Conditional conjugacy makes determination of the full conditional distributions easier and often leads to an effective Gibbs sampler. M-H does not require compu- tation of the normalizing constant, and so is useful in non-conjugate settings. In addition, as we mentioned before, one can always combine the two algorithms when a closed form of the full conditional for a subset of θ is not available. No single pro- cedure dominates the others for all applications. The form of f(Y | θ)π(θ) determines which method should be applied.

In practice, the convergence and mixing of a MCMC algorithm is important. The justification for the algorithms is strictly asymptotic and there are few small-sample results. Considerable theoretical work has been devoted to evaluating the convergence of MCMC methods, as in Rosenthal (2002) and Jones and Hobert (2004). Empirical methods for assessing convergence have also been developed, for example, Gelman and Rubin (1992), Raftery and Lewis (1992), Geweke (1992), Roberts (1992), Ritter and Tanner (1992), Liu, Liu, and Rubin (1992), Garren and Smith (1993), Johnson

(1994), Zellner and Min (1995), and a review by Cowles and Carlin (1996). We do not present a review here. Instead, we merely note that a thorough assessment of the convergence and mixing of a proposed algorithm is required for sound practice.

24 2.2 Computation of Marginal Likelihood

As stated in (1.3), the Bayes factor is defined as the ratio of marginal likelihoods

under two models. Computation of the marginal likelihood involves an integral to

eliminate model specific parameters. When a hierarchical model is not fully conjugate,

such an integral may not be evaluated analytically; thus numerical methods must be

applied. In this section, we present a short review of a subset of such numerical

methods for computing the marginal likelihood.

2.2.1 The Laplace Approximation

First of all, the marginal likelihood can be expressed as ∫ f(Y | θ)π(θ) m(Y ) = f(Y | θ)π(θ)dθ = , (2.2) π(θ | Y )

where π(θ|Y ) represent the posterior distribution. Most posterior-simulation methods

are based on this equation, since under a typical parametric setting the likelihood

function and prior distribution are specified in a closed form and thus can be evaluated

analytically. Specifically, the Laplace approximation is obtained by using a normal

distribution to approximate the posterior distribution, which is, in turn, proportional

to f(Y | θ)π(θ). Using a multivariate normal kernel and a Taylor series expansion (up

˜ ˜ ˜ −1 to the quadratic term), one can derive the mean and variance as θ and Σ = (Iθ˜) , ˜ ˜ where θ denotes the posterior and Iθ˜ is the observed Fisher information matrix, evaluated at θ = θ˜. The Laplace approximation of m(Y ) becomes

(2π)d/2|Σ˜|1/2f(Y | θ˜)π(θ˜),

where d represents the dimension of θ. In advance, an efficient normal approximation of the posterior requires the likelihood function to be peaked around θ˜, which will

25 be the case for large samples and a relatively diffuse prior distribution (e.g. Edward-

s, Lindman and Savage, 1963). Tierney and Kadane (1986) developed the Laplace

approximation and proved that the approximation error is of order O(n−1). A mod-

ification of the Laplace approximation can be found in DiCiccio, Kass, Raftery and

Wasserman (1997).

2.2.2 Monte Carlo Integration and Importance Sampling

Monte Carlo integration and importance sampling method are common techniques

for estimating expectations using a simulation. They both aim at approximating the

integral in (2.2) directly. Note that π(θ) is a density function, so the integral can be written as Eπ(θ)[f(Y | θ)]. This allows us to estimate the marginal likelihood as

1 ∑ mb (Y ) = f(Y | θ ), N i i

if θi are independent samples from the prior distribution and Eπ(θ)[f(Y | θ)] exists.

The strong law of large numbers guarantees consistency of mb (Y ) for m(Y ) as the

simulation sample size (N) grows. Furthermore, if V arπ(θ)[f(Y | θ)] is finite, the

implies that

√ D N(mb (Y ) − m(Y )) → N(0, V arπ(θ)[f(Y | θ)]).

Since this method is based on a large set of simulated samples from the prior distri-

bution, it is called Monte Carlo integration.

Consider (2.2) again. Suppose that sampling from the prior distribution is very

difficult or expensive, but there exists another density h(θ) that is very close to π(θ)

26 and is easily sampled. Then one can rewrite the integral (2.2) as ∫ m(Y ) = f(Y | θ)π(θ)dθ ∫ f(Y | θ)π(θ) = h(θ)dθ [ h(θ) ] f(Y | θ)π(θ) = E . h(θ) h(θ)

This expression suggests drawing θi independently from h(θ) and using a weighted

average to estimate m(Y ):

1 ∑ mb (Y ) = f(Y | θ )ω(θ ), N i i i

with weights ω(θi) = π(θi)/h(θi). Note that if π(θ) is known only up to an unknown

constant, one can empirically standardize the weights, replacing ω(θi) with

∗ ∑Nω(θi) ω (θi) = , (2.3) j ω(θj)

1 ∗ so that N Σiω (θi) = Eh(θ)[ω(θi)] = 1. Generally speaking, a wise choice of h(θ) requires experience and deep understand- ing of Monte Carlo integration. We do not plan to give further discussion. However, one can assess the performance of importance sampling by computing the effective sample size. When π(θ) is exactly known and the unstandardized weights are used, the effective sample size (ESS) is

n ESS = . 1 + vard [ω(θ)]

When π(θ) is known up to a constant and the standardized weights are used as in

(2.3), the effective sample size (ESS) is

n ESS = , 1 + cvb 2 [ω∗(θ)]

where cvb is the sample divided by the sample mean.

27 2.2.3 The Estimator

The harmonic mean estimator of the marginal likelihood first appeared in Newton and Raftery (1994). This method is based on sampling from the posterior distribution, which can be accomplished by running MCMC. The marginal likelihood is rewritten as ∫ f(Y | θ)π(θ)dθ m(Y ) = ∫ π(θ)dθ ∫ f(Y | θ)π(θ) | | π(θ Y )dθ = ∫ π(θ Y ) π(θ) | π(θ | Y ) π(θ Y )dθ [∫ ]− π(θ | Y ) 1 = dθ . f(Y | θ)

Thus a harmonic mean estimator can be computed from the sample as [ ] −1 1 ∑ mb (Y ) = f −1(Y | θ ) . N i i

The consistency of the estimator follows from the law of large numbers. Computation of the estimator is transparent.

However, this estimator might not satisfy the central limit theorem. Consider ∑ −1 | −1 | an individual term in i f (Y θi), which we denote by f (Y θ). For the first moment, ∫ 1 f(Y | θ)π(θ) −1 | Eπ(θ | Y )[f (Y θ)] = | dθ ∫ f(Y θ) m(Y ) π(θ) = dθ m(Y ) 1 = , m(Y )

28 but for the second moment ∫ | −2 1 f(Y θ)π(θ) E | [f (Y | θ)] = dθ π(θ Y ) 2 | f (Y∫ θ) m(Y ) 1 π(θ) = dθ, m(Y ) f(Y | θ)

which is not necessarily finite.

When a central limit theorem does not hold, in practice, a θi with very small

likelihood may be sampled, leading to a huge impact on the average. Lenk (2009)

notes that the harmonic mean estimator systematically overestimates the marginal

likelihood and proposed several corrections to improve the accuracy of the estimator

and its performance in model selection problems.

2.2.4 Chib’s MCMC-based Methods

Chib has proposed a series of MCMC-based methods for estimating the logarithm

of the marginal likelihood. The motivation for his methods is that

log m(Y ) = log f(Y | θ) + log π(θ) − log π(θ | Y ), (2.4)

which can be obtained by taking logarithm of both sides of (2.2). For most parametric

models, both f(Y | θ) and π(θ) are easy to evaluate. A good estimator of log π(θ | Y ) will them produce a good estimator of the marginal likelihood, since expression (2.4) holds for all θ on the support of the posterior distribution. To estimate the marginal likelihood, one can evaluate the expression at an arbitrary value of θ, say θ∗. This gives one flexibility in the choice of θ∗. Chib (1995) claimed that for a given number of posterior draws, the density is likely to be more accurately estimated at a high density point. A common choice is to set coordinates of θ∗ equal to their (estimated) posterior . We do this in the sequel, and the results are stable and accurate.

29 An additional advantage shared by most of Chib’s methods is that they are compatible with MCMC simulation, especially Metropolis-Hastings and methods, with only little modification. Thus they are very easy to program, once one has already written the code to implement a MCMC fit of a model.

A Metropolis-Hastings-based Method

First, we present a brief review on estimation of π(θ∗ |Y ) based on the Metropolis-

Hastings algorithm. This method was proposed by Chib and Jeliazkov (2001). Recall that q(θ∗ | θ) denotes the proposal density for the transition from θ to θ∗ and define [ ] f(Y | θ∗)π(θ∗)q(θ | θ∗) α(θ∗, θ | Y ) = min , 1 . (2.5) f(Y | θ)π(θ)q(θ∗ | θ)

Therefore the transition kernel (from θ to θ∗) of the M-H algorithm is p(θ∗, θ | Y ) =

α(θ∗, θ | Y )q(θ∗ | θ). From the reversibility of the Markov chain under M-H, we have

p(θ∗, θ | Y )π(θ | Y ) = p(θ, θ∗ | Y )π(θ∗ | Y )

α(θ∗, θ | Y )q(θ∗ | θ)π(θ | Y ) = α(θ, θ∗ | Y )q(θ | θ∗)π(θ∗ | Y ).

Upon integrating both side of this equation with respect to θ, one obtains ∫ α(θ∗, θ | Y )q(θ∗ | θ)π(θ | Y )dθ π(θ∗ | Y ) = ∫ α(θ, θ∗ | Y )q(θ | θ∗)dθ ∗ ∗ Eπ(θ | Y )[α(θ , θ | Y )q(θ | θ)] = ∗ , Eq(θ | θ∗)[α(θ, θ | Y )] which implies that a consistent estimator of the posterior ordinate can be obtained as ∑ 1 M α(θ∗, θ(j) | Y )q(θ∗ | θ(j)) πb(θ = θ∗ | Y ) = M j ∑ 1 N (i) ∗ | N i α(θ , θ Y ) where {θ(j)} are the M draws from the posterior distribution and {θ(i)} are sampled

N times from q(θ | θ∗). For a sound implementation, M and N should be fairly

30 large numbers. Examples of this method and a brief discussion of its simulation error

can be found in Chib and Jeliazkov (2001). In the event that the M-H algorithm is

embedded as a particular sampling step of a Gibbs sampler, say sampling θ1 from the

full conditional f(θ1 | θ0,Y ) where {θ0, θ1} represents a partition of θ. One can still

∗ employ the idea to estimate the conditional posterior ordinate π(θ1 = θ | θ0,Y ) by computing ∑ 1 M ∗ (j) | ∗ | (j) ∗ M j α(θ1, θ1 θ0,Y )q(θ1 θ1 ) πb(θ1 = θ | θ0,Y ) = ∑ , 1 1 N (i) ∗ | N i α(θ1 , θ1 θ0,Y ) { (j)} | { (i)} where θ1 are the M draws from the distribution π(θ1 θ0,Y ) and θ1 are the N

∗ draws from q(θ1 | θ ). All the quantities involved should be accessible from running

∗ the M-H step. Furthermore, the full posterior ordinate π({θ1 = θ , θ0 | Y }) can

∗ be estimated by πb(θ1 = θ | θ0,Y )πb(θ0 | Y ). A Gibbs-sampling-based estimate of b ∗ | π(θ0 = θ0 Y ), proposed by Chib (1995), will be discussed in the next subsection.

A Gibbs-based Method

Recall that Gibbs sampling is one of the most widely used MCMC techniques,

especially for those models with conditionally conjugate priors, so that the full condi-

tionals can be obtained in standard form and easily sampled. Chib (1995) proposed

a Gibbs-based method for estimating the posterior ordinate π(θ | Y ), which does not

require new programming and is straightforward to implement. As mentioned earlier,

the Gibbs sampler requires a partition of θ = (θ1, θ2, . . . , θm). Correspondingly, this

method splits the full posterior ordinate and estimates each sub-ordinate separately.

In other words,

∗ ∗ ∗ ∗ | ∗ | ∗ | ∗ ∗ | ∗ ∗ ∗ π(θ = (θ1, θ2, . . . , θm) Y ) = π(θ1 Y )π(θ2 Y, θ1) . . . π(θm Y, θ1, θ2, . . . , θm−1). (2.6)

31 We can see that the first term on the right hand side is a reduced conditional, which

can be estimated by

1 ∑ πb(θ∗ | Y ) = π(θ∗ | Y, θ(g), . . . , θ(g)) (2.7) 1 G 1 2 m g

{ (g) (g)} where θ2 , . . . , θm represent draws from the posterior. Evaluating the full condi- ∗ | tional π(θ1 Y, θ2, . . . , θm) should not be difficult, because of conditional conjugacy. For the other reduced conditional terms on the right hand side of (2.6), the apparent estimator is

1 ∑ πb(θ∗ | Y, θ∗, . . . , θ∗ ) = π(θ∗ | Y, θ∗, . . . , θ∗ , θ(g) , . . . , θ(g)) (2.8) r 1 r−1 G r 1 r−1 r+1 m g

{ (g) (g)} | ∗ ∗ where θr+1, . . . , θm represents G (large integer) draws from π(θr+1, . . . , θm Y, θ1, . . . , θr−1). A simple solution to this sampling complication is to continue sampling for additional { } { ∗ ∗ } iterations, but to fix θ1, . . . , θr−1 at θ1, . . . , θr−1 while running these additional iterations. Chib (1995) also discusses calculation of the of the estima-

tor.

A Method for the MDP Model

These two MCMC-based methods have proven successful for estimating the poste-

rior ordinate for parametric Bayesian models. Basu and Chib (2003) gave an extensive

discussion on how to apply them, especially the Gibbs-based method, for estimating

the marginal likelihood of a mixture of Dirichlet processes (MDP) model. In this

subsection, we give a brief review of the Dirichlet process model and Basu and Chib’s

method for estimating the marginal likelihood.

32 First, following the notation used in Basu and Chib (2003), a mixture of Dirichlet

processes model can be generally described as

Ψ = (Φ, κ, M) ∼ π

G | M,G0 ∼ DP (M,G0(· | κ))

iid θ1, . . . , θn | G ∼ G,

Yi | θi, Φ, xi ∼ f(· | θi, Φ, xi), i = 1, ··· , n, (2.9)

where the {xi} are fixed covariates. The θi are latent parameters and the f(·|θi, Φ, xi) fall in a parametric family of densities. A typical example of the MDP model has already been given as Example 2 in Section 1.4.

The key feature of the model is that, instead of assuming a particular parametric

family for G, the distribution is modeled by a Dirichlet process (DP), so that G

has an extremely large support. The MDP model was first introduced by Ferguson

(1973) and further studied by Antoniak (1974). Escobar (1988, 1994) created the first

Markov chain Monte Carlo sampler for a covariate-free version of the MDP model,

making use of Blackwell and MacQueen’s (1973) P´olya urn characterization of the

Dirichlet process. Escobar and West (1995) introduced MCMC sampling for the mass

parameter M, MacEachern (1994) identified slow mixing of the MCMC algorithm

and described a collapsed algorithm, Bush and MacEachern (1996)

drew the distribution between fixed and random effects and introduced a re-mixing

step to improve mixing, MacEachern and M¨uller(1998) and Neal (2000) developed

algorithms for non-conjugate MDP models. Dahl (2003) and Jain and Neal (2004)

developed split-merge moves to further improve mixing. Alternatively, sampling G

from a DP can be expressed in terms of the “stick-breaking” construction described

33 by Sethuraman and Tiwari (1982) and Sethuraman (1994), which follows from the

P´olya urn scheme: ∑∞ G(·) = piδi(·), (2.10) i=1 iid where δi ∼ G0(· | κ)(i = 1, 2,...) are point masses,

p1 = V1, ∏i−1 pi = Vi (1 − Vj) i = 2, 3,..., j=1 iid Vi ∼ Beta(1,M) i = 1, 2,...

If one truncates the sum of (2.10) at a large integer, the model becomes a finite mixture

model of the sort discussed by Diebolt and Robert (1994). The resulting finite mixture

can be very close to the MDP model under a variety of metrics (Ishwaran and James,

2001), and finite mixture techniques can be used to sample from this approximation.

The second goal of this subsection is to present the method of Basu and Chib

(2003) for estimating the marginal likelihood of a MDP model. Recall that one

can split the marginal likelihood into three pieces and evaluate them at any point

θ∗ = (Φ∗, κ∗,M ∗) in the support of the posterior. Under this model, the pieces are

∗ ∗ ∗ ∗ ∗ ∗ the prior ordinate π(Φ , κ ,M ), the likelihood ordinate L(Y |Φ , κ ,M ,G0) and the

∗ ∗ ∗ posterior ordinate π(Φ , κ ,M | Y,G0). Note that the likelihood ordinate is ∫ ∗ ∗ ∗ ∗ ∗ ∗ L(Y | Φ , κ ,M ,G0) = f(Y | Φ ,G)dP (G | M ,G0, κ ),

since the distribution for G is a distribution over the space of distributions. This sug-

gests that we may encounter difficulties when evaluating the integral in the likelihood

ordinate.

First, using the MCMC-based methods we reviewed in last two subsections, the

posterior ordinate can be divided into set of subordinates. Each can be estimated

34 within the MCMC algorithm when fitting the model, either by running M-H or Gibbs sampling, or a combination of both if necessary. Based on collapsed sequential im- portance sampling (MacEachern, Clyde and Liu, 1999), Basu and Chib proposed a method for estimating the likelihood ordinate efficiently for the conjugate MDP mod- el. First, let Yi denote the ith observation in Y and Y(i) denotes the first i observations

∗ ∗ in Y . {θ , . . . , θ } represents the set of k − unique values in {θ , . . . , θ − } 1, i−1 ki−1, i−1 i 1 1 i 1

∗ ∗ and {n − , . . . , n − } are the number of times that each of {θ , . . . , θ } 1, i 1 ki−1, i 1 1, i−1 ki−1, i−1 appears in {θ1, . . . , θn} respectively. This method can be described as follows. Com- ∫ ∗ ∗ ∗ pute µ1 = f(Y1 | Ψ ) = f(Y1 | θ, Φ )dG0(θ | κ ) and set s1 to equal one. Then, for i = 2, . . . , n, perform the following steps sequentially:

1. Compute the predictive probability

(g) | (g) ∗ µi = f(Yi Y(i−1), s − , Ψ ,G0) ∫ (i 1) M ∗ = f(Y | θ, Φ∗)dG (θ | κ∗) M ∗ + i − 1 i 0 ∫ ∑ki−1 nj, i−1 ∗ ∗ + f(Y | θ, Φ )dH − (θ | κ ), M ∗ + i − 1 i j, i 1 j=1 ∗ where Hj, i−1(θ | κ ) is the posterior distribution of θ based on the prior G0 and

observation {Yl : l ≤ i − 1 and sl = j}. When f and G0 are conjugate, both

integrals can be obtained in closed form.

(g) 2. Draw si from the categorical distribution

| (g) ∗ P (si = j Y(i), s(i−1), Ψ ) ∗ ∫ M ∗ ∗ = c f(Y | θ, Φ )dG (θ | κ ), j = k − + 1 ∗ − i 0 i 1 M + i 1 ∫ nj, i−1 ∗ ∗ = c f(Y | θ, Φ )dH − (θ | κ ), 1 ≤ j ≤ k − , M ∗ + i − 1 i j, i 1 i 1

where c is the normalizing constant.

35 ∏ (g) n (g) Finally, calculate ω = i=1 µi , and then estimate the likelihood ordinate as

1 ∑G Lb(Y | Ψ∗,G ) = ω(g). 0 G g=1

More detail and examples regarding this method can be found in Basu and Chib

(2003).

36 Chapter 3: The Calibrated Bayes Factor

In this chapter, we briefly review various forms of Bayes factors, propose our method of the calibrated Bayes factor and investigate its theoretical properties. The major thrust of this thesis is to propose a model selection method called the calibrated

Bayes factor, which calibrates the prior distributions to a certain “information” level before starting computation of the Bayes factor. To increase the amount of infor- mation in a diffuse prior, we use part of the data as a training sample, so that the posterior reaches the desired information level. We then use the this partial posterior distribution as a new prior and compute the Bayes factor based on the remainder of the data. To calibrate the prior distribution to a certain information level, we first need a metric to measure the amount of information contained in a prior/posterior distribution, and then need to set a target to define the desired level of information.

This should depend on the class to which a particular model belongs. In addition, we discuss the desired level of information for the one-sample model and present a set of examples in this chapter. We extend the idea to regression models in the next chapter.

37 3.1 Alternative Forms of the Bayes Factor

We assume that the observations are independently distributed, thus we randomly sample multiple training sets with a fixed size from the data, compute the Bayes factor based on each training sample, and then average these Bayes factors to make model choices. This idea can be traced back to Lempers (1971), who used half of the data as a training sample to update improper prior distributions, such that the posterior distributions are proper. Following his idea, several new schemes of selection of training samples have been proposed in the Bayesian community, which, in turn, has led to a set of alternative Bayes factors. All of these methods assume that the models under comparison are parametric and that the prior distributions are improper. Thus, these methods cannot be directly applied to model comparisons involving nonparametric models, where the above problem of the Bayes factor is more pronounced, because in these models a proper prior distribution is generally required to produce a proper posterior distribution. In contrast, our method makes no assumption on model forms (parametric or nonparametric) or on the integrability of priors (proper or improper), and so is applicable in a larger variety of model comparison problems. The details of the calibrated Bayes factor method are described in this chapter. We begin by giving a review of these well-known alternative Bayes factors. A discussion of the asymptotic behavior of these criteria can be found in

Gelfand and Dey (1994).

3.1.1 Pseudo-Bayes Factors

Geisser and Eddy (1979) extended Lempers’ (1971) approach and proposed the

“pseudo-Bayes factor,” which uses a set of n − 1 observations as the training sample.

38 Given a set of independent data, according to (1.1), the marginal likelihood under ∫ ∏ | model Mk is mk(Y ) = [ i fk(Yi θk)]π(θk)dθk. The authors suggested that one compare the following quantity

∏n P s | mk (Y ) = fk(Yi Y−i) i=1 ∏n ∫ = fk(Yi | θk)πk(θk | Y−i)dθk, i=1

where Y−i = (Y1,Y2,...,Yi−1,Yi+1,...,Yn), and choose the model with the greatest

P s mk (Y ) value as the most appropriate model. Their argument includes two points. First, it is apparent that if the Bayesian marginal density is used, then the joint dis- tribution of Y is predictively dependent. However, for computational simplicity, the product of the conditional predictive densities is used rather than the joint predictive density. Second, the conditional predictive density of Yj depends both on the whole remainder of the data, Y−i, and the prior distribution, whereas the joint predictive density depends only on the prior distribution. From the author’s point of view, this tends to put too much emphasis on the prior distribution. Rather that answering the question: “Which model best explains the observed data ?”, the pseudo-Bayes factor tends to select the model which best predicts future observations from the same data-generating process, since the criterion is to maximize the the product of the conditional predictive densities.

3.1.2 Posterior Bayes Factors

Aitkin (1991) proposed the idea of the posterior Bayes factor. For model compar-

ison, he suggested using the criterion ∫ P o | | mk (Y ) = fk(Y θk)πk(θk Y )dθk,

39 where πk(θk | Y ) is the full posterior distribution under model Mk. The motivation is that in a comparison of nested models, an objective diffuse prior tends to reduce the marginal probability of the observed data to zero, which should not be considered as strong support for the null model (simpler model) but as strong rejection of the more complex model. Therefore, in spite of lack of prior information, a tenable prior should assign enough weight to reasonable parts of the parameter space, when one is to compare the models. The author claimed that use of posterior distribution in place of the prior distribution has several advantages, including reduced sensitivity to variations in the prior and the avoidance of Lindley’s paradox (Example 1). However, it is transparently clear that this criterion suffers from non-coherence, resulting from double use of the data (likelihood and posterior). In fact, this criticism can be less- ened by using part of the data to update the prior distribution to a partial-posterior distribution and then using the remainder of data for model comparison.

3.1.3 Partial and Fractional Bayes Factors

The partial Bayes factor and fractional Bayes factor are introduced by O’Hagan

(1991, 1995). For the partial Bayes factor, O’Hagan (1991) proposed using a pro- portion b of the data (represented as Y (b)) for training and computing the marginal likelihood for the rest of data (represented as Y (−b)). Thus ∫ P a (−b) | | (b) mk (Y ) = fk(Y θk)πk(θk Y )dθk,

O’Hagan states that the partial Bayes factor is coherent. Note that according to the partition scheme of Y = {Y (b),Y (−b)}, the size of the training data set will go to infinity as the problem grows.

40 More recently, O’Hagan (1995) proposed another criterion for model comparison,

named the fractional Bayes factor. To avoid the arbitrariness of selecting a training

sample, the author proposed using a fractional power b (0 < b < 1) of the full

likelihood to update the prior. In other words, ∫ (b) | | b πk (θk Y ) = [fk(Y θk)] π(θk)dθk

(b) | One then computes the marginal likelihood with respect to πk (θk Y ), so that ∫ F r | (b) | mk (Y ) = fk(Y θk)πk (θk Y )dθk.

O’Hagan has shown that the fractional Bayes factor is asymptotically equivalent to the

partial Bayes factor and mentions that consistency of this criterion actually requires

b → 0 as n → ∞.

3.1.4 Intrinsic Bayes Factors

The primary goal of the intrinsic Bayes factor of Berger and Pericchi (1996a,

1996b, 2001, 2004) is to resolve the problem of unjustifiable Bayes factors when

improper prior distributions are used. A detailed explanation of this problem will

be given in next subsection. Berger and Pericchi (1996a) defined a proper training

(I) (I) sample Y as a subset of the data, such that the marginal likelihood of it, mk(Y ),

(I) is positive and finite (i.e. 0 < mk(Y ) < ∞). In addition, a training sample is called a minimal training sample (denoted as MTS) if the posterior is proper but no subset yields a proper posterior.

Berger and Pericchi proposed using a MTS to update the improper priors as they

prefer to leave as much data as possible to provide model comparison. Since given a

Bayesian model, a MTS is not necessarily unique and the size of the MTS depends

41 primarily on the form of the prior distribution and the form of the likelihood, Berger and Pericchi suggest averaging the updated priors over all possible MTS from the full data. In an asymptotic sense, this removes the arbitrariness of selection of the MTS.

A pair of prior distributions that can yield a Bayes factor that is asymptotically equivalent to the intrinsic Bayes factor is called the intrinsic Bayes factor. If we use

In πk (θk) to denote the intrinsic prior under model Mk, the criterion can be written as ∫ In | In mk (Y ) = fk(Y θk)πk (θk)dθk,

The intrinsic Bayes factor is arguably the most widely used model comparison crite- rion when model specific priors are improper. More discussion of the intrinsic priors can be found in Chapter 4.

3.1.5 Remarks

We conclude this section by showing why Bayes factors with improper priors are undefined and by presenting a general result from using training samples. The latter result is very important for this thesis, especially for the computation of the calibrated

Bayes factor. First, suppose one uses improper priors which can be specified only up to unknown constants ck, say πk(θk) = ckhk(θk). Formal calculation would yield ∫ ∫ | | ∫ f1(Y θ1)π1(θ1)dθ1 c1 ∫ f1(Y θ1)h1(θ1)dθ1 BF12(Y ) = = . (3.1) f2(Y |θ2)π2(θ2)dθ2 c2 f2(Y |θ2)h2(θ2)dθ2

Since c1 and c2 are arbitrary, the Bayes factor cannot be fully determined, leading to an inability to compute posterior model probabilities. On the other hand, regardless of the model index k, the impropriety of model specific priors may not cause severe

42 troubles in estimation problems, since the posterior distributions

f(Y |θ)π(θ) π(θ|Y ) = ∫ f(Y |θ)π(θ)dθ cf(Y |θ)h(θ) = ∫ c f(Y |θ)h(θ)dθ f(Y |θ)h(θ) = ∫ f(Y |θ)h(θ)dθ

can be fully determined. In addition, the influence of the prior distribution can be

controlled in the context of estimation. An improper prior is diffuse and is often

chosen to represent the notion of ignorance, so, with a decent amount of data, the

posterior distribution will be dominated by the likelihood f(Y |θ). In contrast, the marginal likelihood is the average of the likelihood weighted by the prior distribution.

This weighting, always by the prior, does not wash out as the sample size increases.

Second, a general result can be derived if the following two conditions are satis-

fied: (1) One uses part of data as a training sample to update the prior distribution

and computes the marginal likelihood for the remainder; (2) Data are assumed to be

independently sampled. Let {Y (T ),Y (R)} denotes a partition of Y where Y (T ) repre-

sents the training subset and Y (R) is the remainder of the data. Under model k, the

(R) (T ) new marginal likelihood mk(Y | Y ) can be written as ∫ (R) (T ) (R) (T ) mk(Y | Y ) = fk(Y | θk)π(θk | Y )dθk ∫ f (Y (T ) | θ )π (θ ) = f (Y (R) | θ )∫ k k k k dθ k k f (Y (T ) | θ )π (θ )dθ k ∫ k k k k k f (Y (R) | θ )f (Y (T ) | θ )π (θ )dθ = k ∫ k k k k k k . (T ) | ∫ fk(Y θk)πk(θk)dθk f (Y | θ )π (θ )dθ = ∫ k k k k k . (T ) fk(Y | θk)πk(θk)dθk

mk(Y ) = (T ) (k = 1, 2). mk(Y )

43 In the second line of equations, πk(θk) need not to be a proper distribution, but merely ∫ (T ) need to lead to a finite integral for fk(Y | θk)πk(θk)dθk. Any arbitrary constants

cancel. Therefore on the log scale, we have

log[BF (Y (R) | Y (T ))] = log[BF (Y )] − log[BF (Y (T ))]. (3.2)

This equation implies that as long as the partition scheme is invariant across models,

using a training sample to update prior distributions is equivalent to removing the

original Bayes factor for the training sample from the original Bayes factor for the full

data on the log scale. This general result is a key observation. In (3.2), any formal

version of the Bayes factor will lead to the same result.

3.2 Information Metric

Suppose that Yi | θ ∼ fθ(·)(i = 1, 2, 3, . . . , n) and θ ∼ π, where π could be a prior

distribution or a (partial) posterior distribution updated on a training sample. It is

difficult to create a good measure of the amount of information contained in a general

distribution π. The commonly used Fisher information is defined for parametric

models, but it is not applicable when π is nonparametric. The effective sample size

is sometimes used to measure the information in a prior distribution, but it can be

hard to compute when π is not conjugate.

To define the information metric for a general π, we measure the distance between

· · (1) (2) the distributions fθ(1) ( ) and fθ(2) ( ), where θ and θ are two random draws from

π. When π is highly concentrated (high information), θ(1) and θ(2) tend to be close · · to each other, and thus fθ(1) ( ) and fθ(2) ( ) would also be close. On the other hand,

(1) (2) · when π is very diffuse (low information), θ and θ tend to be far away, and fθ(1) ( ) · and fθ(2) ( ) would be very different. Note that this information metric is well defined

44 no matter whether π is parametric or nonparametric. We measure the closeness of · · fθ(1) ( ) and fθ(2) ( ) by the symmetrized Kullback-Leibler (SKL) divergence

· · SKL(fθ(1) ( ), fθ(2) ( )) ∫ ∫ 1 fθ(1) (y) 1 fθ(2) (y) = fθ(1) (y) log dy + fθ(2) (y) log dy. (3.3) 2 fθ(2) (y) 2 fθ(1) (y)

The of (θ(1), θ(2)) induces a distribution on SKL. We will evaluate the

information contained in π using the of this distribution of SKL.

3.3 Calibrating Bayes Factors

To calibrate the original priors to reach the desired information level, we need to

increase the amount of information contained in the priors. This can certainly be

done by adjusting the in the original priors (such as the ).

However, when the prior information is weak, it is hard to decide where to center the

more concentrated priors.

Instead, we propose to use part of the data as a training sample to update the prior

distributions and to use the remainder of the data to compute the (partial) Bayes

factor. As mentioned in Chapter 2, this approach has been used with the intrinsic

Bayes factor and with other versions of the Bayes factor based on improper prior

distributions. When the observations are independent, we randomly select multiple

training samples of the same size from the data, compute the Bayes factor based on

each training sample, and then average these Bayes factors to make model choices.

For a statistical model, the calibration sample size is defined as the minimum size

of a training sample such that the updated prior is beyond the target concentration

level.

45 Let the sk,(k = 1, 2) represent the calibration sample sizes for models M1 and M2.

We wish to have both prior distributions calibrated, and so we take s = max{s1, s2}.

Based on a training sample Y(s) ⊂ Y of size s, the partial Bayes factor satisfies

| − { 1 2 H } log PB12(Y/Y(s) Y(s)) = log B12(Y ) log B12(Y(s)). Let Y(s),Y(s), ..., Y(s) denote all

∗ possible subsets of Y of size s. Then the calibrated log Bayes factor log B12(Y ) is defined by

1 ∑H log B∗ (Y ) = log B (Y ) − log B (Y h ) (3.4) 12 12 H 12 (s) h=1 ∗ When H is a very large number, we may approximate log B12(Y ) by using only a portion of these H subsets.

3.4 Example 2 revisited

From the next section, starting with one-sample models, we are going to present

more simulations and real data examples. Therefore, we conclude this chapter by

revisiting the one-sample model mentioned before. For Example 2 described in Chap-

ter 1, a simple search led to a calibration sample size of 50. At the full sample size

n = 350, the mean log calibrated Bayes factor is -2.61, corresponding to a Bayes factor

of 13.62 in favor of the MDP model, which qualifies as strong evidence under Jeffreys’

criterion. This model choice is consistent with the superior predictive performance

of the MDP model based on the full data. Note that after calibration, the peak of

the mean log Bayes factor curve is 0.85, corresponding to a Bayes factor of 2.34 in

favor of the parametric model. Under Jeffreys’ criterion, this peak evidence is not

worth more than a bare mention. Therefore, the parametric model never accumulates

a significant lead.

46 3.5 One-sample Model

This section focuses on the one-sample (i.i.d) model. First, the appropriate tar- get information level will be discussed. Then we propose an algorithm to identify the calibration sample size. A batch of simulations and real data examples will be presented afterward to show the impact of calibration on the Bayes factor. Here, we are interested in comparing parametric models to the MDP model.

3.5.1 Target Information Level

To set a target for our prior calibration, we need to choose a benchmark prior and then require the updated priors to contain at least as much information as this benchmark prior. In order to perform a reasonable analysis where subjective input has little impact on the final conclusion, we set the benchmark to be a “minimally informative” prior – the unit information prior suggested in Kass and Wasserman

(1995), which contains the amount of (Fisher) information as that in one observation.

For example, for the Gaussian model Y ∼ N(θ, σ2), it is easy to verify that a unit

2 information prior on θ is N(θ0, σ ), where θ0 is any constant. The SKL distance between the distributions fθ(1) and fθ(2) in the Gaussian model is

SKL(fθ(1) , fθ(2) ) ∫ ∫ 1 ϕ (y − θ(1)) 1 ϕ (y − θ(2)) − (1) σ − (2) σ = ϕσ(y θ ) log (2) dy + ϕσ(y θ ) log (1) dy 2 ϕσ(y − θ ) 2 ϕσ(y − θ ) (θ(1) − θ(2))2 = , 2σ2 where ϕσ(·) is the Gaussian density function with mean 0 and standard deviation σ.

Under the unit information prior, θ(1) − θ(2) ∼ N(0, 2σ2), and thus the SKL distance

(1) − (2) 2 2 ∼ 2 2 (θ θ ) /2σ χ(1). Note that this distribution is free of θ0 and σ . We require

47 the calibrated priors to contain as least as much information as that in the chi-square distribution.

Gaussian Model MDP Model Empirical CDF of SKL Empirical CDF of SKL

m = 2 m = 2 m = 10 m = 10 m = 50 m = 50 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

CDF of Χ 2(1) CDF of Χ 2(1)

Figure 3.1: In the left plot, the horizontal axis presents the cumulative density function (CDF) 2 of the χ1 distribution and vertical axis presents that of the SKL distribution under the Gaussian parametric model with the calibration sample size m. In the right plot, the horizontal axis presents 2 the CDF of the χ1 distribution and vertical axis presents that of the SKL distribution under the MDP model with the calibration sample size m. The dashed lines are reference lines for equality of the two CDFs.

In complex models, such as the MDP model, the distribution of SKL might not have an explicit expression. Furthermore, the distributions of SKL under the target prior and under the calibrated priors might not be stochastically ordered. This pre- vents us from using the entire distribution of SKL as a criterion. Instead, we calibrate priors using a of the SKL distribution. Since the distribution of SKL is usually strongly right-skewed, the estimates of the low percentiles are more stable than those for the upper percentiles. We choose to use the first quartile of a SKL

48 distribution as our measure of information level, but other percentiles can be used as

well. The percentile used for calibration can be considered to be a tuning parameter

in this method. Our experience suggested that using a lower percentile usually leads

to a more concentrated posterior, and using a upper percentile usually leads to a less

concentrated posterior.

Example 2 revisited. Figure 3.1 contrasts the cumulative density function

2 (CDF) of the χ(1) distribution with the SKL distributions under the Gaussian para- metric model and the MDP model for a range of calibration sample sizes. Our specific choice has been the first quartile. Simulations show that the calibration sample size and performance of our method is robust to modest changes in the specified quan- tile. Figure 3.1 shows that for the Gaussian parametric model, when the calibration

2 sample size is 2, the first quartile of the SKL distribution matches that of the χ(1) distribution, and for the MDP model, when the calibration sample size is around 50,

2 the first quartile of the SKL distribution matches that of the χ(1) distribution.

3.5.2 Calibrating Priors

To choose an appropriate training sample size and to update the priors, we im-

plement the following algorithm.

• Step 1. Randomly draw a training sample with a pre-specified sample size

from the data.

• Step 2. Update the original prior based on this training sample, running two

independent Monte Carlo Markov chains. After discarding a burn-in period,

(1) (1) ··· (1) retain a subsample of M draws from each chain, say, Θ = (θ1 , , θM ) and (2) (2) ··· (2) Θ = (θ1 , , θM ).

49 • Step 3. Permute the draws of Θ(2) and pair them with the draws of Θ(1) to (1) (2) | (1) | (2) obtain M pairs (θj , θj ). Compute SKL(f(y θj ), f(y θj )) based on each pair.

• Step 4. Repeat Step 1 to Step 3 for N (a great number) times. Pool all MN

values of the SKLs to compute the confidence interval for the first quartile of

the SKL distribution.

• Step 5. Compare this confidence interval with the first quartile of the SKL

distribution under the unit information prior. If the confidence interval covers

the target first quartile, terminate the search and report the current sample

size as the calibration sample size. Otherwise modify the sample size using the

search strategy described in the Appendix and repeat Steps 1 to 5.

Note that running two independent chains in Step 2 and subsampling in Step 3 help to ensure that the eventual MN draws are approximately independent, justifying a simple approximate calculation for the confidence interval in Step 4. Our investigation shows that the confidence interval is insensitive to the length of the MCMC runs.

The computations for calibration sample size search often rely on the same kind of MCMC required to fit the models, and so necessitate little additional coding. For

2 2 the Gaussian model, M1, we would need draws of (µ, σ , τ ) which are produced by the standard MCMC output. For the MDP model, M2, we would need draws of

G. We obtain a close approximation to G by running an MCMC chain with the algorithm of Bush and MacEachern (1996) and then using the output of the chain to produce a finite approximation to G, as described in Gelfand and Kottas (2002) and

Ishwaran and James (2001, 2002). Once an approximate draw of G has been obtained,

50 numerical integration (based on a fine grid of points and the trapezoid rule) is used

to approximate the SKL.

3.5.3 Simulations

To investigate the patterns of log Bayes factors with different sample sizes and to

illustrate the effect of calibration, we conduct simulation studies to compare the Gaus-

sian parametric model to the MDP model under various true distributions. We select

the following distributions to represent distributions with different shape features:

• Skew-normal with varying α (representing skewed distribu-

tions)

• Student-t with varying degrees of freedom ν (representing thick-tailed distribu-

tions)

• Symmetric mixture of normals with varying component means ±δ (representing

multi-modal distributions)

In all cases, the distributions have been centered and scaled to have mean 0 and stan-

dard deviation 1. In addition, by specifying α, ν and ξ, we tune the KL divergences from the true distribution to the best fitting Gaussian distribution.

We place the common independent prior distributions µ ∼ N(0, 500), σ2 ∼

IG(3, 0.1), and τ 2 ∼ IG(3, 1.9) on the parameters in the Gaussian and the MDP

model, so that the marginal distributions of a single observation are the same. That

is, the Bayes factor based on any single observation is one. Note that for the prior

distributions on τ 2 and σ2, the 19 to 1 ratio in the scale parameters allows the MDP

51 model to capture local variation in the density, and that the moderate shape param- eters allow departures from this ratio. Since in all cases, the true data generating processes are non-Gaussian, we can envision that with sufficiently large sample size, the MDP model would outperform the Gaussian model.

In Figure 3.2 through 3.5, the upper panel displays the true sampling densities, the lower panel displays the expected log Bayes factor curves, and the dots on the curves represent the calibration sample sizes. The expected log Bayes factors are calculated as the sums of the expected differences in the log predictive likelihoods, and the calibration sample sizes are obtained from the search strategy in Appendix

A. We evaluate the expected predictive likelihoods at designated training sample sizes via Monte Carlo integration, and use linear interpolation for other sample sizes.

For the Gaussian model, since it is not fully conjugate, we jointly sample (σ2, τ 2) by

Metropolis-Hastings, leading to a hybrid Gibbs sampler. For the MDP model, since it is fully conjugate, we apply the collapsed Gibbs sampler of MacEachern (1998).

All inferences in the simulation results are based on 10,000 iterations of a sampler following a burn-in period of 500 iterations. Under the MDP model, the integrals needed to compute the SKL are computed numerically by the trapezoidal rule with

3,000 grid points over the range [−5, 5]. In addition, to approximate the expectation of the log predictive likelihoods by the central limit theorem (CLT ), the existence of the fourth moment of the data generating process is required. Therefore, we cannot add the Bayes factor curves under standardized t distributions in Figures 3.4 and 3.5, since their degrees of freedom are too small. For the same reason, we cannot add the log Bayes factor curve for the skew-normal distribution in Figure 3.5.

52 t skew normal mix of normal True standardized density standardized True 0.0 0.2 0.4 0.6 −4 −2 0 2 4

t skew normal mix of normal Cumulative log−BF Cumulative 0 1 2 3 4 5

1 21 41 56 71 86 101 126 151 176 201 251 301 Sample Size

Figure 3.2: Sampling densities of Y and expected log Bayes factors for small divergences from the Gaussian, α = 1.645, ν = 9.608, ξ = 0.711. Calibration sample sizes are marked on the lower panel.

t skew normal mix of normal True standardized density standardized True 0.0 0.2 0.4 0.6 −4 −2 0 2 4

t skew normal mix of normal Cumulative log−BF Cumulative −3 −1 1 2 3 4 1 21 41 56 71 86 101 126 151 176 201 251 301

Sample Size

Figure 3.3: Sampling densities of Y and expected log Bayes factors for moderate divergences from the Gaussian, α = 2.612, ν = 5.617, ξ = 0.902. Calibration sample sizes are marked on the lower panel.

53 skew normal mix of normal True standardized density standardized True 0.0 0.2 0.4 0.6 −4 −2 0 2 4

skew normal mix of normal Cumulative log−BF Cumulative −25 −15 −5 0 1 21 41 56 71 86 101 126 151 176 201 251 301 Sample Size

Figure 3.4: Sampling densities of Y and expected log Bayes factors for large divergences from the Gaussian, α = 7.121, ξ = 1.213. Calibration sample sizes are marked on the lower panel. The t distribution with this divergence does not admit a CLT for the estimand, and so is not presented.

mix of normal True standardized density standardized True 0.0 0.2 0.4 0.6 −4 −2 0 2 4

mix of normal Cumulative log−BF Cumulative −100 −60 −20 0 1 21 41 56 71 86 101 126 151 176 201 251 301 Sample Size

Figure 3.5: Sampling density of Y and expected log Bayes factors for a very large divergence from the Gaussian, ξ = 1.772. Calibration sample size is marked on the lower panel. The t and skew-normal distributions with this divergence result in unstable behavior, and so are not presented.

54 scale. 3.6: Figure te ad hnteK iegnei ag rvr ag a nFgrs34ad3.5), and 3.4 Figures the in On (as large large. very very or large is is size divergence sample KL that the the sizes, when unless sample hand, favored of other be range will large model a Gaussian for the shallow positive very is, remain be can can factor decline Bayes the log of the slope the the and normal, When to close model. very MDP is the slowly distribution favoring true then evidence and more model, and more small, Gaussian representing is the decline, size of sample favor the in when evidence Bayes numbers substantial log large expected representing to the increase 3.3), or rapidly 3.2 first Figures curves the in factor When (as medium change. or distributions small Gaussian is divergence fitting KL best the to distributions true the h atr flgBysfcosovosycagswe h Ldvrec from divergence KL the when changes obviously factors Bayes log of pattern The h eatrsfo omlt r esrdb ulakLilrdvrec on divergence Kullback-Leibler by measured are normality from departures The

Conservative Truncationt1 Sample Size

30 35 40 45 50 Truncation from normality vsDeparture SampleSize mix ofnormal skew normal t 20−. 10−0.5 −1.0 −1.5 −2.0 log10(K−L divergence toNormality) 55 log 10 the log Bayes factors start declining after a few observations, which means that MDP model picks up the discrepancies very quickly and the Gaussian never builds up a meaningful lead. The problem of the Bayes factor is most pronounced in Figure 3.3, when the log Bayes factor curves are clearly on the decline, but have not dropped below zero.

In all cases, the calibration is driven by the MDP model rather than the Gaussian model (which is typically calibrated after two or three observations). In each figure

(where the true distributions have the same KL divergence to the best fitting Gaussian distributions), the calibration sample size varies little; and across different figures, the further the underlying true distribution is from normality, the larger the calibration sample size will be (as in Figures 3.6). It is worth noting that after updating on the training sample, the peaks of the expected log calibrated Bayes factor curves remain below two for all cases, leading to better agreement between the Bayes factors and the models’ predictive performances.

3.5.4 Study 1: Adult Males, Aged 18-24

Obesity is a serious issue in the United States, as it is linked to increased risk of a variety of negative health outcomes, including high blood pressure, heart disease, type 2 diabetes, certain cancers, and other chronic conditions. These conditions carry human costs of a shorter lifespan and lesser quality of life. Additionally, there is an economic cost, borne by the health-care system. Through time, the rate of obesity has increased. Current estimates are that perhaps 34.2 percent of adults are overweight and that an additional 33.8 percent are obese (Flegal, Carroll, Ogden and

Curtin 2010). The Ohio Family Health Survey (OFHS) has been designed to capture

56 data on the health of Ohioans. The survey includes self-reported data on height and weight of individuals, as well as general demographic information such as age and gender. The technical definitions of overweight and obese are based on body mass index (BMI), which is defined as one’s body weight in kilograms divided by the square of his/her height in meters. Specifically, one is considered overweight if their BMI is at least 25, and is considered obese if their BMI is at least 30. Exclusions apply to particular subgroups, such as pregnant women.

To avoid exceptional observations and to ensure some degree of homogeneity, we

first focus on young adult males, aged 18 - 24, who responded to the 2008 OFHS.

We eliminated individuals with missing values, and ended up with 855 cases. A preliminary look at the data showed strong right-, confirmed in the summary presented in Table 3.1. We transformed the data, working with the natural logarithm of BMI. After transformation, the data are passably skew-normal, with the maximum likelihood estimate of the skewness parameterα ˆMLE = 2.41. Figure 3.7 presents a kernel density estimate along with the maximum likelihood estimate of the skew normal density.

Table 3.1: Summary statistics for OFHS (Male Adult, Aged 18-24) Min. Q1 Median Mean Q3 Max. S.D. BMI 13.93 21.86 24.38 25.61 28.24 71.12 5.55 log(BMI) 2.63 3.09 3.19 3.22 3.34 4.26 0.20

To illustrate the benefits of calibrating the Bayes factor, we first consider the com- parison of a normal model to a MDP model. The log-transformed BMI measurements were standardized to have mean 0 and standard deviation 1. The prior distributions

57 Density Estimate for log(BMI) of Male Adult (Aged 18−24) Density 0.0 0.5 1.0 1.5 2.0 2.5

2.5 3.0 3.5 4.0

log(BMI)

Figure 3.7: The dotted vertical lines mark the maximum and minimum observations. The solid line is a kernel density estimate of log(BMI). The dashed line provides the skew normal density with parameters equal to the maximum likelihood estimates.

for the normal model (with parameters specified in Chapter 1) are

µ ∼ N(0, 500), σ2 ∼ IG(7, 0.3), τ 2 ∼ IG(11, 9.5).

For the full data set, we used a simulation-based method proposed by Basu and Chib

(2003) to obtain the marginal likelihood of the MDP model. The marginal likelihood of the normal model was obtained by combining methods described in Chib (1995) and Chib and Jeliazkov (2001) with 300 replicates. The log Bayes factor for the full data set is about -12.19 with standard error 0.01. This translates to a Bayes factor of

196,811 favoring the MDP model, which we take to be conclusive evidence that the

MDP model is superior to the normal model.

58 We further investigate the expected cumulative log Bayes factor for a range of s- maller sample sizes. For each sample size m, we generated N = 300 distinct partitions { (h) (h) } of the full data set, say [Y1:m,Y(m+1):855], h = 1, ..., N . The expected log predictive likelihood Ef [log fk(Ym+1 | Y1:m)], k = 1, 2, was approximated as   1 ∑N  1 ∑   log f (y | Y (h)) . (3.5) N 855 − m k 1:m h=1 ∈ (h) y Y(m+1):855

To compute the log predictive likelihoods via Monte Carlo integration, we used 5,000

MCMC iterations, following 500 burn-in iterations. Other computational details are identical to those of the simulations in last subsection. As illustrated in Figure 3.8, the uncalibrated expected-cumulative-log-Bayes-factor curve (the solid curve) reach- es its peak 4.64 at a sample size of 106. This corresponds to a Bayes factor of 104 in favor of the Gaussian model. This qualifies as decisive (Jeffreys) evidence in fa- vor of the parametric model – evidence that is eventually, conclusively overturned.

Moreover, at a larger sample size of 250, although the MDP model provides a better predictive performance (evidenced by the downward trend of the curve), the uncali- brated expected-cumulative-log-Bayes-factor is still 2.92, corresponding to the strong evidence of 18.5 to one odds against the MDP model.

In contrast, our calibration suggests commencing the Bayes factor calculation after

50 observations. The peak of the calibrated curve is 0.64, corresponding to a Bayes factor of 1.9 that provides very weak model preference. At the full sample size, the eventual calibrated value is -16.18, which leads to a Bayes factor of 10.6 million to one in favor of the MDP model. We find the swing from inconclusive evidence for modest sample sizes to conclusive evidence in favor of the MDP model for the full

59 apefrmr aaal hntesigfo togvr togeiec none direction. in opposite evidence the strong in strong/very evidence conclusive from to swing direction the than palatable more far normal sample skew the of comparison the for is line dashed the model. and MDP model, the MDP with model the with model normal 3.8: Figure ntepo) huhwt eysalwsoe n rp o97 ttefl sample full the at 9.75 to drops and slope, shallow very shown a (not downward with move though to plot), start curve the favor the in in that, 25,848 After of factor model. Bayes normal a skew to the corresponds of which 801, of peak size a sample a to at model) 10.16 normal cu- of skew expected the The (favoring increases analysis. curve Gaussian factor Bayes the log of mulative that distribution matched prior variance The and mean parameter. skewness the the on on placed was with 5 distribution variance prior Gaussian and a 0 model, mean normal skew the For line). dashed the by ed

iue38as otat kwnra nlsst h D nlss(sillustrat- (as analysis MDP the to analysis normal skew a contrasts also 3.8 Figure Cumulative log−BF

−10 −5 0 5 10 4 8 2 0 5 0 0 0 801 601 401 301 251 201 126 86 41 1 enlgBysfco s apesz.Tesldln sfrtecmaio fthe of comparison the for is line solid The size. sample vs. factor log-Bayes Mean logBF logBF PeakOriginal Truncation Point 32 12 Sample Size 60 size 855. Our calibration still suggests commencing the Bayes factor calibration after

50 observations, since the calibration sample size is controlled by the MDP model.

The peak of the calibrated curve is 5.80, corresponding to a Bayes factor of 328 in

favor of the skew normal model. The eventual calibrated value at the full sample size

is 5.38, corresponding to a Bayes factor of 217, which still decisively favors the skew

normal model. This shows that when the predictive performances of the parametric

model and the MDP model are similar at a finite sample size, the calibrated Bayes

factor may favor the simpler parametric model, although the evidence for the simpler

model is weaker than that in the original Bayes factor.

3.6 Two-sample Model

3.6.1 Study 2: Diabetes Group vs. Non-Diabetes Group

In addition, we study another subset of BMI data. We focus on older (aged 55-64)

married males who have a high school diploma or more education. The data are log-

transformed and divided into two groups by their diabetes status, in order to make a

contrast. The diabetes group includes n1 = 412 observations and tends to skew to the

right. The maximum likelihood estimate of the skewness parameterα ˆMLE = 2.05.

The non-diabetes group includes n2 = 1, 795 observations and tends to be right- skewed as well, though with lesser skewness. The maximum likelihood estimate of the skewness parameterα ˆMLE = 1.25. First, it is justifiable to assume that the two groups have different averages of BMI. We want to use calibrated Bayes factor to test whether distributions of diabetes group (k = 1) and non-diabetes group (k = 2) are significantly different, other than through their location parameters. Therefore we compare the pair of models:

61 E(logBF) −300 −250 −200 −150 −100 −50 0

0 500 1000 1500 2000

Sample Size

Figure 3.9: The solid line and the green points represent the mean of log Bayes factor. The red points provide a 95% confidence band.

1. M1, a mean-shift model:

∼ 2 α N(µα, σα)

σ2 ∼ IG(a, b)

iid∼ 2 µk N(µ0, σ0) k = 1, 2;

iid∼ 2 ··· ··· Yik,k SN(µk, σ , α) k = 1, 2; i1 = 1, , n1; i2 = 1, , n2; (3.6)

62 2. M2, a more general model:

iid∼ 2 αk N(µα, σα) k = 1, 2;

2 iid∼ σk IG(a, b) k = 1, 2;

iid∼ 2 µk N(µ0, σ0) k = 1, 2;

iid∼ 2 ··· ··· Yik,k SN(µk, σk, αk) k = 1, 2; i1 = 1, , n1; i2 = 1, , n2 (3.7) where SN(µ, σ2, α) represents a skew-normal distribution with mean µ, variance σ2

2 and shape parameter α. We set hyperparameters µ0 = 0, σ0 = 500, a = 0.1, b = 0.1,

2 µα = 0 and σα = 5 for both models. The original log-Bayes-factor is evaluated via Monte Carlo integration with 5,000 draws from the prior. For each sample size, we randomly generate 10,000 sub-samples to estimate the expectation and plot the results against sample sizes in Figure 3.9. The log-Bayes-factor for the full data set is about -239 with standard error 10, corresponding to a Bayes factor of 1.60 × 10−104, which is, to put it mildly, very strong evidence in favor of M2.

To calibrate the Bayes factor, first, it is easy to see the calibration sample size is determined by M2, as M1 is a nested model of M2. Then we identify the calibration sample size by implementing the following steps on M2:

• Step 1: Regardless of group index, randomly draw s observations from the

entire data set (diabetes and non-diabetes) with equal probability;

• Step 2: Retain the observations from diabetes group and use them to update

2 the prior distribution of (µ2, σ2, α2);

• Step 3: Repeat Step 1 and Step 2 N (large enough) times;

63 • Step 4: Evaluate the first quartile (Q1(s)) of the distribution of SKL based

2 on the mixture of the N partially updated prior of (µ2, σ2, α2);

• Step 5: Define calibration sample size as Inf{s : Q1(s) < q1}, where q1 is the

first quartile of the χ2(1) distribution.

In Step 2, we only retain the observations from the diabetes group because this group far fewer observations than does the non-diabetes group. According to Table

3.2, when the sample size is greater than eight, it is very likely (with probability greater than 97.5%) that a sample would include more non-diabetes observations.

Therefore, a sub-sample that is able to provide a sufficient update for the prior of

2 2 (µ2, σ2, α2) should typically provide a stronger update for the prior of (µ1, σ1, α1). As discussed earlier, when the distribution of SKL is not available in a closed form, one can obtain SKL samples by running simulations and construct a narrow confidence interval for Q1(s).

Table 3.2: Probability of obtaining more non-diabetes observations Sample size 6 7 8 9 10 11 12 > 12 P (nD < nND) 91.7% 97.4% 95.5% 98.5% 97.5% 99.2% 98.6% > 99.2%

By implementing these steps, it turns out that the calibration sample size is 58, where the expected log-Bayes-factor is about -6.91. After the calibration, the log

Bayes factor reduces to -232, leading to a calibrated Bayes factor of 1.75 × 10−101, which is still a very decisive evidence supporting M2.

Since both the Bayes factor and the calibrated Bayes factor suggest the two groups have not only different locations but also variances and skewness, it is plausible to

64 analyze the diabetes group and non-diabetes group separately. We standardize both

groups respectively to have mean 0 and standard deviation 1 again. For each group,

we still compare the same normal model and skew normal model individually to the

same MDP model. All computational details are identical to Study 1.

Table 3.3: Summary statistics for OFHS (Married Male, Aged 55-64 High School Graduates with Diabetes). Min. Q1 Median Mean Q3 Max. S.D. BMI 16.70 27.86 31.19 32.48 35.56 73.89 6.93 log(BMI) 2.82 3.33 3.44 3.46 3.57 4.30 0.19

Density Estimate of log(BMI) for Married Male, Aged 55−64, High School Graduates with Diabete Density 0.0 0.5 1.0 1.5 2.0 2.5

2.5 3.0 3.5 4.0 4.5 log(BMI)

Figure 3.10: The dotted vertical lines mark the maximum and minimum observations. The solid line is a kernel density estimate of log(BMI). The dashed line provides the skew normal density with parameters equal to the maximum likelihood estimates.

65 antb ikdu yanra oe rase omlmdl hsbt para- both Thus model. data. normal the skew fit a to or fail model models normal metric a around by appears up picked bump be tiny cannot a that note 3.10, 4 Figure log(BMI)= In group. diabetes dardized normal skew the of comparison the for is line dashed the model. and MDP model, the MDP with model the with model normal 3.11: Figure 0,adi un u h airtdsml iei 4wt expected-log-Bayes-factor with 54 is size Figure sample calibrated size in the sample curve out at turns solid observed it is the and which 100, to 5.02, is Bayes According expected-log-Bayes-factor a of to model. peak leading the MDP 0.01, 3.11, the error favoring standard with 24.5 -3.20 of about factor is data full the for factor al . n iue31 rsn lnea h itiuino h unstan- the of distribution the at glance a present 3.10 Figure and 3.3 Table o h oprsnbtentenra oe n h D oe,telgBayes log the model, MDP the and model normal the between comparison the For

Cumulative log−BF

−2 0 2 4 6 2 4 7 0 2 5 7 0 5 0 5 401 351 301 251 201 176 151 126 101 71 41 21 1 . enlgBysfco s apesz.Tesldln sfrtecmaio fthe of comparison the for is line solid The size. sample vs. factor log-Bayes Mean nbt h itga n h enldniyetmt.Ti bump This estimate. density kernel the and histogram the both in 2 logBF logBF PeakOriginal Truncation Point 32 12 Sample Size 66 Diabete 4.44. Thus after calibration, the peak is reduced to 0.58, corresponding to a Bayes factor of 1.78 favoring the normal model. The calibrated Bayes factor for the full data becomes -7.64. The corresponding Bayes factor is about 2,000 to one favoring the MDP model, which is considered as decisive evidence against the normal model, according to Jefferys’ criterion.

For the other comparison, the log Bayes factor turns out to be 5.03 with standard error 0.09, implying a Bayes factor of 153.0 favoring the skew normal model. Note that according to the dashed curve in Figure 3.11, the peak is about 7.11 (at sample size 169), thus there is apparently a drop off of the Bayes factor afterward. Recall that this implies the MDP model is actually producing higher expected predictive likelihood than the skew normal model. The calibration sample size (derived from the MDP model) appears to be 54, leading to a reduction of the peak of expected- log-Bayes-factor curve by 5.15. The log Bayes factor for the full data set declines to -0.12, leading a Bayes factor slightly less than one. With a Bayes factor so close to one, there is essentially no evidence suggesting that one model is better than the other.

Table 3.4: Summary statistics for OFHS (Married Male, Aged 55-64 High School Graduates without Diabetes). Min. Q1 Median Mean Q3 Max. S.D. BMI 13.03 25.60 27.73 28.46 30.78 53.16 4.61 log(BMI) 2.57 3.24 3.32 3.34 3.43 3.97 0.16

In contrast, we look at the non-diabetes group. A quick glance at Table 3.4 and

Figure 3.12 arguably suggests the data are normally distributed. However for the

67 Density Estimate of log(BMI) for Married Male, Aged 55−64, High School Graduates without Diabete Density 0.0 0.5 1.0 1.5 2.0 2.5 3.0

2.5 3.0 3.5 4.0 log(BMI)

Figure 3.12: The dotted vertical lines mark the maximum and minimum observations. The solid line is a kernel density estimate of log(BMI). The dashed line provides the skew normal density with parameters equal to the maximum likelihood estimates.

comparison between the normal model and the MDP model, the log Bayes factor for the full data turns out to be -22.33 with standard error 0.06. The corresponding

Bayes factor is about 5 billion to one in favor of the MDP model, which is indeed decisive evidence supporting the MDP model. The calibration occurs at sample size

54, leading to a reduction of the peak (observed at sample size 170) expected-log-

Bayes-factor from 7.31 to 1.96. The log calibrated Bayes factor is -27.68, implying the calibrated Bayes factor favors the MDP model roughly 200 times as much as the original Bayes factor. Furthermore, based on the full data, the log Bayes factor for comparing the skew normal model to the MDP model is about -17.56 with standard error 0.24, corresponding to a Bayes factor of about 42 million to one favoring the

MDP model. The calibration sample size turns out to be 54 with expected-log-Bayes

68 . rffi’ D Priors MDP Griffin’s 3.7 model. considered MDP be definitely the should supporting which evidence factor, conclusive Bayes original the as much roughly as model times MDP 200 the favors factor Bayes calibrated the similarly, Thus 5.41. normal factor skew the of comparison the for is line dashed the model. and MDP model, the MDP with model the with model normal 3.13: Figure twt aaercnra oe ihacrepnigpirdsrbto.Tetwo follows The as distribution. expressed prior be corresponding can a models with compared model and normal (2010), parametric Griffin a in with described it model, MPD the for specification prior er pr rmtecmaio ftemdl etoe bv,w losuidanoth- studied also we above, mentioned models the of comparison the from Apart

Cumulative log−BF

−20 −15 −10 −5 0 5 1 8 16 0 61 0 0110 611751 1601 1301 1001 801 601 401 176 86 1 enlgBysfco s apesz.Tesldln sfrtecmaio fthe of comparison the for is line solid The size. sample vs. factor log-Bayes Mean logBF logBF PeakOriginal Truncation Point 32 12 Non−Diabete Sample Size 69 1. M1, a Gaussian parametric model:

2 iid 2 Yi | µ, σ ∼ N(µ, σ ), i = 1, ··· , n. (3.8)

2. M2, Griffin’s mixture of Dirichlet processes (MDP) nonparametric model:

G ∼ DP (M = 2,N(µ, (1 − α)σ2))

iid θi | G ∼ G,

2 2 Yi | θi, σ ∼ N(θi, ασ ), i = 1, ··· , n. (3.9)

The hyper-parameters in both models are assumed to be independent and to follow

the common prior distributions

2 2 µ ∼ N(µ0, ρ ), σ ∼ IG(aσ2 , bσ2 ), α ∼ Beta(aα, bα).

One advantage of Griffin’s MDP model is the ease of interpretation of model param-

eters. Parameters µ and σ2 can be interpreted as the location and scale of the prior

marginal predictive distribution. More interestingly, the scale-splitting parameter α

efficiently controls the number of Gaussian components, leading to a means of ad-

justing the smoothness of predictive distributions. Moreover, in most discussions of

MDP models, noninformative priors are not recommended. Thus when data-based

priors are used, determination of hyperparameters for the prior distributions of µ and

σ2 can be very easy. The choice of prior distribution for α is extensively discussed in

Griffin’s paper.

We also have run similar simulations to compare this pair of models with µ0 = 0,

2 2 ρ = 500, aσ2 = 12 and bσ2 = 11. The prior for σ strongly forces the scale param- eter σ2 to concentrate around one, since data-generating processes are standardized

70 to guarantee mean 0 and standard deviation 1. (aα = 1, bα = 1) are the hyperpa-

rameters for α recommended by Griffin. Under non-Gaussian distributed data, such

as skew-normal, student-t and symmetric mixture of normals, Griffin’s MDP model

with this prior distribution for α can catch up the normal model in terms of ex- pected marginal predictive likelihood with a very small training data set (fewer than

10 observations). Consequently, the expected-cumulative-log-Bayes-factors does not suffer the cumulative effect badly, resulting in a model preference that is almost con- sistent with predictive performance. Alternatively, simulation shows that if a prior distribution for α strongly forces the splitting ratio to be about 1/19, for example

(aα = 5, bα = 95), the patterns of expected-cumulative-log-Bayes-factors and the cal-

ibration sample sizes are very similar to the results of comparing the former pair of

2 2 M1 and M2, where the ratio τ /σ is actually concentrated around 1/19 as well.

71 Chapter 4: Linear Regression Model

4.1 Background

Linear regression models are the mainstay of statistical estimation and prediction.

In many scenarios, they provide good approximation to the relationship between

explanatory variables and the outcome of interest (Gelman, Carlin, Stern and Rubin,

2003). Model selection (or variable selection), which identifies a particular subset of

available predictors that very “efficiently” explains the data, is an important problem

in . For a particular data set, the words “efficiently” and “explain”

loosely imply a good balance between the number of predictors and the likelihood for

the responses. In other words, we tend to use as few predictor variables as possible

to sufficiently explain the variation of the response variable. In this chapter, we will

extend our methods for the multivariate normal model to model comparison in the

linear regression setting.

The general setup is as follows: let Y = Xiβi +ϵi for i = 1, 2; where Xi is a (n×ki)

2 given design matrix of rank ki < n and ϵ ∼ N(0, σ I). A common approach is to compare different models under some pre-chosen criterion such as Mallow Cp, AIC or

2 BIC. Another approach is to place a prior π(βi, σ ) on the parameters in each model and then compare models by their posterior probabilities when the prior probabilities

72 on the models are all equal. (See Smith and Spiegelhalter (1980), Pericchi (1984),

George and McCulloch (1993, 1995, 1997) and Geweke (1996) for detailed literature review on this approach.) The ratio between two models is the

Bayes factor.

When there is no clear prior information or when the model dimension is very high, different priors are often used on model parameters. For improper priors, Berger and

Pericchi (1996a, 2004) advocated using “minimal training samples” to update the priors so that the partial-posteriors are proper and to then use the partial-posteriors in place of the priors and the rest of the data for regression model selection. However, the method of minimal training samples cannot be applied to proper but diffuse priors and cannot be applied to model comparisons involving infinite dimensional models.

Instead, we will extend our proposed calibrated Bayes factor approach to develop a robust and widely applicable method for regression model selection.

Recall that for the multivariate model, we monitor the distribution of SKL to track the concentration level of the updated prior and suggest using the first quartile of the χ2(1) distribution as the criterion of sufficient updating. For a regression model, we are prevented from using the same criterion because the distribution of

SKL depends not only on the distribution of the prior but also on the variation of predictors in the model. Therefore, we need to take this factor into consideration and adjust the criterion accordingly. Note that equation (3.2) can still be employed to calibrate the original Bayes factor since the assumption of independence is valid. In this chapter, we present our solution to this adjustment and give a variety of examples of simulations and real data analysis to illustrate the impact and the adaptability of the calibration.

73 4.2 Model Specification

First we start with the introduction of our fundamental model setting and nota-

tion. Let X represent an n×p design matrix, where n is the sample size. Each column

of X denotes the realizations of a predictor variable and each row of X denotes an

observed subject. Subjects are assumed to be mutually independent. An n×1 vector

Y represents the responses from the n subjects. A p dimensional vector β represents

the coefficients for predictors and intercept. In addition, if all involved predictors and

responses are centered to have average zero, the least squares regression surface pass-

es through the origin, and the intercept term can be eliminated. We follow common

practice and do so here as usual, σ2 represents the common variance shared by all

subjects. Based on this notation, a Bayesian linear regression model can be expressed

as follow

β, σ2 ∼ π(β, σ2)

Y ∼ N(Xβ, σ2I) (4.1)

where π(β, σ2) denotes the prior distribution of the parameters (β, σ2) which may or may not depend on the design matrix X, and I represents the identity matrix. In this thesis, we investigate several forms of π(β, σ2). A brief introduction to these priors is

presented in the next section.

In addition, when comparing a pair of nested models, suppose βγ is the p-dimensional

vector of coefficients for the simpler model where, without loss of generality, compo-

nents βγ,h1 , βγ,h2 , ..., βγ,hK are fixed at zero, and β is the p-dimensional vector of

coefficients for the full model. It is easy to see that βγ is equivalent to putting the

74 ′ × − constraint C β = 0 on β, where C is a p (p K) matrix with C[hk,k] (k = 1, 2, ..., K) equal to one and other entries equal to zero.

4.3 Prior Distributions

4.3.1 Improper Priors

The first class of improper priors that we consider is a class of noninformative priors of the form

πN (β, σ) ∝ σ−(1+q) q ≥ −1, or equivalently,

πN (β, σ2) ∝ (σ2)−(2+q)/2 q ≥ −1. (4.2)

In particular, when q = 0, this prior is the default prior suggested by Jeffreys (1961) for general location-scale problems. This class of priors was considered in Berger and

Pericchi (1996b) for the linear regression problem. Since the normalizing constant is undefined, the marginal likelihood and thus the Bayes factor based on such a prior does not exist, whereas a minimal training sample can be used to yield a proper partial posterior, leading to the intrinsic Bayes factor. When q > −1, a minimal

′ training sample (YMTS,XMTS) is a sample of size p + 1 such that (XMTSXMTS) is non-singular. And when q = −1, a minimal training sample is a sample of size p + 2

′ such that (XMTSXMTS) is non-singular. The marginal likelihood is proportional to

2q/2Γ( n+q−p ) 2 , (n−p)/2 (n+q−p)/2| ′ |1/2 π SSE XMTSXMTS

′ − ′ −1 ′ where SSE = Y (I XMTS(XMTSXMTS) XMTS)Y . Interested readers should also read Berger and Pericchi (1996b, 2001).

75 The second class of improper priors that we consider is a class of intrinsic priors

defined in Berger and Pericchi (1996a). The Bayes factor under an intrinsic prior is

asymptotically equal to the intrinsic Bayes factor under the noninformative prior. The

form of the intrinsic prior distribution depend on both models under consideration and

is usually not unique. Berger and Pericchi (1996a) gave a general solution for intrinsic

priors in the nested model scenario in their paper. Please also see Dmochowski

(1994) for more references. In addition, following their work, Casella and Moreno

(2006) specifically wrote out a pair of intrinsic priors for comparing two nested linear

regression models. Suppose βγ and σγ are the vector of coefficients and the standard

deviation for the reduced model, which is nested in the more complex model with

coefficients β and standard deviation σ. To simplify the expressions, we assume

βγ and β have the same dimension p, but given that some of coefficients in βγ are

fixed at zero. The intrinsic prior for (βγ, σγ) given in Casella and Moreno (2006) is

I N π (βγ, σγ) = π (βγ, σγ) or (4.2) with q = 1. Consequently, the intrinsic prior for

(β, σ) can be represented in conditional form as

( )− ( ) 1 σ2 3/2 I | ∝ | 2 2 −1 π (β, σ βγ, σγ) Np β βγ, (σγ + σ )W 1 + 2 , σγ σγ

where W = Z′Z and Z is a theoretical design matrix of dimension (p + 1) × p, and

Np( | β, Σ) represents a p-dimensional multivariate normal distribution with mean vector β and matrix Σ. Obviously, such intrinsic priors are improper priors, but the Bayes factor comparing them is well defined since the normalizing constants for the improper parts cancel.

76 In fact, let Xγ denote the design matrix under the simpler model with pγ (number

of non-zero values in βγ) columns, then the Bayes factor is given by ( ) ′ ′ − −1 | |1/2 − (n kγ +1)/2 BFγ = XγXγ (Y (In Hγ)Y ) Iγ , (4.3)

where

′ −1 ′ Hγ = Xγ(XγXγ) Xγ, ∫ π/2 1 Iγ = 1/2 1/2 (n−kγ +1)/2 dψ, 0 |Aγ (ψ)| |B(ψ)| Eγ (ψ) 2 −1 ′ B(ψ) = (sin ψ)In + XW X ,

′ −1 Aγ(ψ) = XγB (ψ)Xγ and ′ −1 − −1 −1 ′ −1 Eγ(ψ) = Y (B (ψ) B (ψ)XγAγ (ψ)XγB (ψ))Y .

4.3.2 Proper Priors

In addition to the improper priors, we also consider several commonly used proper

but diffuse priors. We represent a prior π(β, σ2) as π(β, σ2) = π(β | σ2)π(σ2) and our

focus is on the specification of the conditional distribution π(β | σ2).

We choose the prior on σ2, that is π(σ2), to be an inverse-gamma distribution

IG(a, b) with density function ( ) βα β f(σ2 | α, β) = (σ2)−α−1 exp − , Γ(α) σ2

because this form is conditionally conjugate with the regression likelihood (4.1). We

set a = b = 0.1.

A natural choice for π(β | σ2) is a multivariate normal distribution of the form

β ∼ N(µβ = 0, Σβ)

Note that the prior mean is zero because the prior will be used to test whether a

particular subset of β is fairly close to zero. The Σβ is usually set

77 to be a diagonal matrix Diag{σ2 , . . . , σ2 }, so that the regression coefficients are not β1 βp only independent of σ2 but also mutually independent among themselves, that is

∼ { 2 2 } β N(µβ = 0, Σβ = Diag σβ1 , . . . , σβp ).

Specifically, when σ2 = σ2 = ... = σ2 = σ2, the prior becomes β1 β2 βp β

∼ 2 β N(0, σβI)

Another class of proper conditional priors for β is the famous g-prior introduced

by Zellner (1986). It can be represented as

2 2 ′ −1 β | σ , X, g ∼ N(µβ = 0, gσ (X X) ),

where g is a hyperparameter. The g-prior has been widely adopted in normal lin-

ear regression analysis because of its conjugacy with the likelihood in the regression

model and the consequently computational efficiency in evaluating marginal likeli-

hoods. Under a regular regression setting, g = n leads to a “unit information prior” since (X′X)/σ2 represents the observed information about β. More choices of g are discussed in Liang, Paulo, Molina, Clyde, and Berger (2008).

In addition, Liang, Paulo, Molina, Clyde, and Berger (2008) extend this idea,

proposing some mixtures of g-priors. One of them assigns an inverse-gamma prior distribution, IG(g | 1/2, n/2), on g. It is easy to prove that this mixture of g-priors

is equivalent to a multivariate Cauchy distribution

′ ( ) p+1 (X X) 1 ′ ′ −(p+1)/2 Γ( )| | 2 β X Xβ π (β | σ2) = 2 nσ2 1 + . Mg π(p+1)/2 nσ2

The multivariate Cauchy prior for regression coefficients was introduced by Zeller

and Siow (1980), who developed suitable multivariate extensions to Jeffreys’ prior

78 in the univariate normal mean problem. Note that both the proper g-prior and the proper mixture of g-priors require a design matrix X for which (X′X) is non-singular.

Both priors are conditionally conjugate with the regression likelihood (4.1), thus the posterior can be easily obtained by running a Gibbs sampler.

Besides Gaussian and Cauchy priors, we study the horseshoe prior proposed by

Carvalho, Polson and Scott (2009, 2010), which can be written in a hierarchical form

as follows:

τ ∼ C+(0, σ)

iid + λh ∼ C (0, 1)

∼ 2 2 βh N(µβh = 0, λhτ )(h = 1, 2, . . . , p),

+ where βh denotes a component of β and C (c0, γ) denotes a half-Cauchy distribution

with c0 and γ. Carvalho, Polson and Scott (2009,

2010) showed that this prior density is heavy-tailed and goes to infinity at zero. When

theobserved data Yh is relatively far from the prior mean, the corresponding shrinkage

coefficient tends to be near zero, implying little shrinkage, thus the posterior mean

π(βh|Y ) is dominated by the data. However, if Yh is relatively close to the prior mean,

the posterior mean will be shrunken toward zero substantially since the distribution

of this shrinkage coefficient will be concentrated around one. Thus it works very well

for estimating sparse location parameters in multivariate normal mean problems.

79 0 500 1000 1500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

λ i −20 0 20 40 60 80

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

β i

Figure 4.1: Box-plot for the posterior draws of βh’s and λh’s.

To illustrate the shrinkage property of this prior, we consider the following normal model:

σ2 ∼ IG(0.1, 0.1)

2 2 β | σ ∼ πHS(β | σ )

2 Yhj ∼ N(βh, σ )(j = 1, 2).

2 Yhj (j = 1, 2) are actually simulated from a normal distribution N(βh, σ ), where

β1 = ... = β10 = 2, β11 = ... = β15 = 4, β16 = ... = β18 = 15, β19 = 30 and

2 β20 = 60 and σ = 50. Posterior draws were generated by running MCMC for

20,000 iterations after a long enough burn-in period. Some results are presented in

Figure 4.1 and Figure 4.2. The shrinkage characteristics of the horseshoe prior can ˆ be observed in Figure 4.2, where βh = E[βh | Y ] is plotted against the least square

80 ¯ ¯ estimate Yh = (Yh1 + Yh2)/2. The important differences occur when Yh < 10 and ¯ when Yh is relatively large. For the former case, it is easy to see that posterior means based on the horseshoe prior are apparently closer to zero, compared to the least ¯ squares estimates. For the latter case, as Yh departs from zero (the prior means), the

differences vanish. β HS

βt = 2 βt = 4 βt = 15 βt = 30 βt = 60 0 10 20 30 40 50 60

0 10 20 30 40 50 60

LS β

Figure 4.2: Plot of least square estimates of βh’s against horseshoe posterior averages. βt in the legend represents the true β values. The dashed line represents equal least squares estimates and posterior averages. .

This property of the horseshoe prior is desirable for estimating high dimensional s- parse parameters. In particular, it can be employed in regression models to distinguish

81 between significant and insignificant predictors. More discussion of the horseshoe pri-

or can be found in Carvalho, Polson and Scott (2009, 2010).

4.4 Calibration

4.4.1 Information Metric

Recall that our idea of calibrated Bayes factor is to use training data to update the

original model-specific prior until the updated prior achieves a certain concentration

level. For linear regression models, following the approach to the one-sample model,

we employ the symmetrized Kullback-Leibler divergence (SKL) defined in (3.3) to

measure the distance between two likelihood functions and determine the calibration

sample size by tracking the concentration level of SKL samples.

Let (β(1), σ2(1)) and (β(2), σ2(2)) denote two independent draws from the partially

updated prior, let X0 denote one (random) realization of all of the predictors involved

in the model and let ϕσ(·) denote a univariate normal likelihood function with mean

zero and standard deviation σ, then the SKL distance between two normal linear

regression models at X0 is ∫ ( ) (1) 1 ϕ (1) (y − X β ) − (1) σ 0 SKL = ϕσ(1) (y X0β ) log (2) dy 2 ϕ (2) (y − X β ) ∫ ( σ 0 ) − (2) 1 (2) ϕσ(2) (y X0β ) + ϕ (2) (y − X β ) log dy. (4.4) σ 0 − (1) 2 ϕσ(1) (y X0β )

For linear regression models, the distribution of (4.4) depends not only on the par-

tially updated prior, but also on the distribution of X0. Following the approach to

the multivariate normal model, the dependence can be removed by pooling all SKL samples over the sampling distribution of X0 and the sampling distribution of train- ing samples (of the same size) to get the marginal distribution. Given a particular

82 model under consideration, we use the corresponding marginal distribution of SKL to evaluate the concentration level based on partially updated priors. However, when data are collected in practice, the underlying sampling distribution of X0 is not gener- ally accessible. To solve this problem, we suggest using the empirical distribution by sampling X0 from the observed design matrix X with probability P (X0 = Xi) = 1/n

(i = 1, 2, . . . , n), where Xi denotes the ith row of X.

4.4.2 Target of Concentration

To determine the target of concentration for the prior of a p-dimensional model,

we study the SKL for a p-dimensional linear regression model with the same design

matrix X, known variance σ2 and the unit-information prior on β, which is Zellner’s

g-prior with g = n, as the benchmark model. Under this model, (4.4) becomes

( ) (X β(1) − X β(2))2 SKL ϕ (y − X β(1)), ϕ (y − X β(2)) = 0 0 . (4.5) σ 0 σ 0 2σ2

According to the unit information prior, we obtain

β(1) − β(2) ∼ N(0, 2nσ2(X′X)−1),

and so

(1) − (2) X0(β β ) ′ −1 ′ √ ∼ N(0, nX0(X X) X ). 2σ2 0

· 2 Therefore the SKL defined in (4.5) follows the distribution of A χ(1), where the

′ −1 ′ subscripts denote the degrees of freedom and the scalar A = nX0(X X) X0. The

2 arbitrariness from the choice of X0 leads to a mixture of scaled χ(1), the concentration level of which is considered to be the target that a calibrated prior should achieve.

Specifically, when the sampling distributions of X0 and the rows of X are identical and

83 X XX p XX 10 9 8 7 6 5 4 3 2 1 quantiles XXX 0.05 0.0334 0.0296 0.0256 0.0217 0.0178 0.0139 0.0100 0.0062 0.0026 0.0002 0.10 0.1346 0.1188 0.1031 0.0873 0.0716 0.0560 0.0405 0.0253 0.0111 0.0012 0.20 0.5493 0.4849 0.4211 0.3574 0.2936 0.2302 0.1676 0.1065 0.0498 0.0079 0.25 0.8706 0.7698 0.6686 0.5679 0.4670 0.3673 0.2684 0.1721 0.0828 0.0148 0.30 1.2778 1.1298 0.9821 0.8349 0.6877 0.5421 0.3980 0.2577 0.1272 0.0253 0.40 2.3863 2.1136 1.8404 1.5679 1.2970 1.0276 0.7623 0.5037 0.2610 0.0622 0.45 3.1180 2.7629 2.4089 2.0558 1.7034 1.3542 1.0102 0.6740 0.3574 0.0922 0.50 3.9949 3.5444 3.0928 2.6440 2.1965 1.7522 1.3138 0.8860 0.4803 0.1333

2 · 2 Table 4.1: Some quantiles of χ(1) χ(p): The bisection method is used to search for the quantiles listed. Integration is approximated by the trapezoid rule.

in the form of a multivariate normal with mean zero and variance-covariance matrix

ΣX , it is easy to prove that as n goes to infinity, the distribution of A approaches

2 ′ −1 → −1 a χ(p) distribution (n(X X) ΣX ) and thus the SKL follows the distribution of

2 2 the product of χ(1) and χ(p), where p is the number of columns of X (dimension of the model). The closed form of this CDF does not exist, so we use numerical integration to evaluate its quantiles. The results are presented in Table 4.1. As mentioned before, the sampling distribution of X0 is generally unknown. Instead, we use the empirical distribution by sampling X0 from X with equal probabilities. Then the corresponding

2 distribution of SKL would still be a mixture of scaled χ(1) and the quantiles can be estimated by sampling methods with confidence intervals.

4.4.3 Calibration sample size

As in the multivariate normal model, we evaluate the concentration level of a

(calibrated) prior by the first quartile of the pooled SKL distribution. Under each model (Mk), we choose a calibration sample size sk so that the calibrated priors are as concentrated as the unit information prior. For simulation studies, we sample both Xi (i = 1, 2, . . . , n) and X0 from a multivariate normal distribution, and set the

84 2 · 2 target distribution to be χ(1) χ(p), which is the asymptotic distribution of SKL under the benchmark model. For the real data analysis where the sampling distribution of

X is not available, the target distribution becomes a mixture of the distributions of

′ −1 ′ · 2 [nX0(X X) X0] χ(1), which can be estimated by sampling X0 from Xi’s.

4.5 Simulations

In this section, we use simulation studies to investigate the calibration effect on the Bayes factor in a regression setup. Suppose that the observations Y are generated from the following normal regression model:

Y = Xβt + ϵ

| 2 ∼ 2 ϵ σt N(0, σt I) (4.6)

where X is generated from a multivariate normal distribution ΩX ∼ Np(µX = 0, ΣX ) with a pre-specified ΣX and β is the regression coefficient. In the different regression models under comparison, the dimensions of X and β can differ. When there is strong prior information on the regression coefficient β, a common prior is a diffuse multivariate normal prior. The normal form of this prior density guarantees the conditional conjugacy and the assumption of independence among coefficients enables one to specify the variance-covariance matrix easily. Moreover, we place an inverse- gamma prior IG(0.1, 0.1) on the variance σ2 throughout.

We study regression model comparisons under the calibrated Bayes factor in a variety of scenarios. The two models under comparison are nested in all cases. We

first consider comparisons between two regression models where the predictors are independent and the reduced model excludes one or more predictors. We then con- sider comparisons between regression models where the predictors are dependent with

85 several possible collinear situations. In these comparisons, we always set the distance

between the full model and the reduced model to be the same fixed value. Moreover,

we investigate regression model comparisons under different diffuseness levels of prior

distributions and regression model comparison where the expected Bayes factor plot

has multiple turning points when the sample size increases.

4.5.1 Independent normal priors on low dimension β

Study 1: Independent predictors and multiple (K = 3) fixed zeros in βγ

The data is generated from a normal regression model with independent predictors

′ — each row of X is generated from N(0,I10) and βt = (2, 2, 2, 4, 4, 4, 4, 12, 15, 60)

2 and σt = 50. The reduced model M1 sets the first three coordinates to be zero, that

is βγ,1 = βγ,2 = βγ,3 = 0. The full model M2 allows all 10 coordinates to vary freely.

Suppose we use SSE to represent the error sum of squares under the full model

and use SSEγ to represent the error sum of squares under the reduced model, then

SSEγ ≤ SSE. The average distance between the full model and the reduced model

can be measured by [ ] SSE − SSE D = lim E γ , →∞ 2 n nσt

which is the expected average difference of the error sum of squares between the full

model and reduced model in the asymptotic sense. If the full model is true or β = βt

− 2 2 SSEγ SSE 2 − and σ = σt , simple algebra shows that 2 follows a noncentral χ (p K, λ) σt distribution, thus [ ] − SSEγ SSE − E 2 = (p K) + 2λ, σt

86 1 ′ ′ ′ ′ −1 −1 ′ where λ = 2 (C β) [C (X X) C] (C β) is the noncentrality parameter. Because 2σt p−K → X′X → → ∞ n 0 and n ΣX (because µX = 0) as n , we obtain

1 ′ ′ ′ −1 −1 ′ D = 2 (C β) [C ΣX C] (C β). (4.7) σt

By formula (4.7), the distance between the reduced model and the full model for our values of β is D = 0.24. The empirical CDF ’s of SKL under M1 and M2 are plotted against the empirical CDF ’s of SKL under the benchmark model in Figure

4.3. We draw 10000 samples from the partial posterior using the Gibbs sampler and then subsample to generate nearly independent SKL samples.

The plot of expected-log-Bayes-factors against sample size is shown in Figure 4.4.

Given each sample size, we simulate 1000 data sets from the true model, calculate the Bayes factor based on each of them, and then take the average of these Bayes factors. To obtain the marginal likelihood, we calculate the integrals with respect to

2 β and βγ analytically and approximate the integral with respect to σ using Monte

Carlo integration with 10000 draws from the prior. The calibration sample sizes turn out to be s1 = 10 and s2 = 12 respectively, thus we calibrate at sample size 12.

87 Cumulative density function Cumulative density function SKL SKL

m = 7 m = 10 m = 10 m = 12 m = 12 m = 14 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Benchmark model with p = 7 Benchmark model with p = 10

Figure 4.3: The vertical axis represents the empirical CDF of SKL samples. The horizontal axis 2 · 2 represents the CDF of χ(1) χ(p). The red dashed line is the reference line with slope = 1. The left panel illustrates the relationship under the reduced model, and the right panel illustrates the relationship under the full model. In each panel, the relationship is plotted for 300 different training samples of size m.

At sample size 22, the original expected-log-Bayes factor reaches its peak of 4.19,

corresponding to a Bayes factor of 66.12, which is considered as strong evidence a-

gainst M2. After the calibration, the peak drops to 1.34, which is about 3.81 on the

original scale and is considered as substantial evidence in Jeffreys’ scale and positive

evidence in Kass and Raftery’s scale against M2 respectively.

Study 2: Independent predictors and one (K = 1) fixed zero in βγ In this study, we still let the predictors be independent and maintain the distance between

88 E(logBF)=2.85 E(logBF) −2 −1 0 1 2 3 4 5

0 20 40 60 80 100

Sample size

Figure 4.4: The green dots represent the expected-log-Bayes-factor while the red dots denote the endpoints of the 95% confidence intervals. The vertical dashed line represents the calibration sample size and the horizontal dashed line is marked at the amount that should be removed from the original expected-log-Bayes-factor when it is calibrated.

the reduced model and the full model D = 0.24. However, we now let the reduced model and the full model differ by only one dimension. Suppose that each row of the

′ design matrix X is generated from N(0,I8) and βt = (βt,1, 4, 4, 4, 4, 12, 15, 60) and

2 σt = 50. Figure 4.5 shows the plot of the expected-log-Bayes-factor as sample size increases up to 100. Comparing it with the Bayes factor in Study 1, we can see that they tend to approach the same slope as sample size increases. The calibration sample size turns out to be 11. Computational details are identical to Study 1. After calibration, the peak of the expected-log-Bayes-factor curve is reduced from 0.80 to

0.12, which implies on the original scale the peak of the Bayes factor is halved from

89 2.23 to 1.13. This is interpreted as very dubitable evidence supporting M1 by earlier works. Also suppose that the reduced model sets the first coordinate βγ,1 to be zero.

To maintain the model distance D = 0.24, we need to choose the constant βt,1 in βt. √ By the formula (4.7), we set βt,1 = 2 3.

E(logBF) = 0.68 E(logBF) −8 −6 −4 −2 0

0 20 40 60 80 100

Sample size

Figure 4.5: The green dots represent the expected-log-Bayes-factor while the red dots denote the 95% confidence intervals. The vertical dashed line represents the calibration sample size and the horizontal dashed line is marked at the amount that should be removed from the original expected- log-Bayes-factor when it is calibrated.

90 Study 3: Multicollinearity among predictor variables We now switch our

attention to the scenarios where the predictors are dependent. Each row of the design

matrix X is drawn from N(0, ΣX ), where ΣX has three possible structures:   1 .5 ....5 .5    ..   .5 1 .5 . .5  (1)  ......  ΣX =  . .5 . . .  ,  . .   .5 .. .. 1 .5  .5 .5 ....5 1 p

( ) (2) VK 0 ΣX = , 0 V − p K p where   1 .5 ....5 .5    ..   .5 1 .5 . .5   . . . .  Vm =  . .5 .. .. .   . .   .5 .. .. 1 .5  .5 .5 ....5 1 m×m and   1 .5 0 ... 0    .. .   .5 1 .5 . .  (3)  ......  ΣX =  0 . . . 0  .  . .   . .. .5 1 .5  0 ... 0 .5 1 p

91 Panel A Panel B Panel C E(logBF) E(logBF) E(logBF) −2 0 2 4 −2 0 2 4 −2 0 2 4

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Sample size Sample size Sample size E(logBF) E(logBF) E(logBF) −5 −4 −3 −2 −1 0 1−5 2 −4 −3 −2 −1 0 1−5 2 −4 −3 −2 −1 0 1 2

0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 Sample size Sample size Sample size

Figure 4.6: The top row shows the curves of original expected-log-Bayes-factor. The bottom row shows the curves of calibrated Bayes factors. The blue curves represent the dependent cases and the (1) (2) (3) black curves represent the independent case. Panels A, B and C correspond to Σx ,Σx and Σx respectively.

92 With each structure of the variance-covariance matrix, we simulate data from

′ the regression model with βt = (2, 2, 2, 4, 4, 4, 4, 12, 15, 60) . To retain the model

2 distance D = 0.24, we set σt as 34.38, 100 and 76.04 respectively by the formula (4.7). The models in comparison and computational details are identical to Study

1. For all three covariance structures, the calibration sample sizes turn out to be 12.

The original and calibrated expected-log-Bayes-factor curves are presented in Figure

4.6 and compared with the curve from Study 1. The plots in the top row show the original expected-log-Bayes-factors and the plots in the bottom row show the calibrated ones. It is easy to see that the gap between the expected-log-Bayes-factor curves for the independent case and the dependent cases are significantly reduced after calibration, especially for the first covariance structure. Thus the impact of collinearity is neutralized to a certain extent by the calibration.

Study 4: Diffuseness of the prior on coefficients The goal of this study is to investigate how the Bayes factor and the calibrated Bayes factor change under different diffuseness levels of the prior on the coefficients. We reconsider the setup of

2 Study 1, but increase the diffuseness of the priors on the coefficients by inflating σβ to 500, 5000, 50000 and 500000 respectively. Computational details are the same as in Study 1. The expected-log-Bayes-factor curves are presented in the top panel of

Figure 4.7. It is easy to see that the swings reach their peaks with different heights at almost the same sample size and then tend to be quite parallel after their peaks. The results of the calibration are presented in the bottom panel. All of the curves of the calibrated Bayes factors are very similar. Thus the diffuseness of the prior can have a very strong impact on the Bayes factor. This impact can be limited by calibration.

93 Original log(BF)

2 σβ = 500,000

2 σβ = 50,000

2 σβ = 5,000 E(logBF)

2 σβ = 500 −5 0 5 10 15

0 20 40 60 80 100

Sample size

Calibrated log(BF) Calibrated log(BF)

2 2 σβ = 500 σβ = 5,000 E(logBF) E(logBF) −5 −4 −3 −2 −1 0 1 −5 −4 −3 −2 −1 0 1

0 20 40 60 80 0 20 40 60 80

Sample size Sample size

Calibrated log(BF) Calibrated log(BF)

2 2 σβ = 50,000 σβ = 500,000 E(logBF) E(logBF) −5 −4 −3 −2 −1 0 1 −5 −4 −3 −2 −1 0 1

0 20 40 60 80 0 20 40 60 80

Sample size Sample size

2 Figure 4.7: The top panel shows the expected-log-Bayes-factors as σβ varies from 500 to 500,000. The green dots represent the expected-log-Bayes-factor while the red dots indicate 95% confidence intervals. The bottom panel shows the calibrated Bayes factors individually.

94 Study 5: Bayes factors with two turning points

log(Bayes Factor)

2nd TP: E(logBF) = 10.55

1st TP: E(logBF) = −1.58 E(logBF) −60 −40 −20 0 20

0 20 40 60 80 100

Sample size

Figure 4.8: The green dots represent the expected-log-Bayes-factor while the red dots are the 95% confidence intervals. The vertical dashed line denotes the calibration sample size. The bottom horizontal dashed line is marked at the first turning point. The top horizontal dashed line is marked at the second turning point.

This study illustrates a very interesting phenomena that appears in a comparison between a medium-dimension regression model and a low-dimension regression model.

The data are generated from a 50-dimensional model, where βt includes 15 replicates of two, then 20 replicates of four, 5 replicates of twelve, 5 replicates of fifteen and 5

2 replicates of sixty. Each row of X is generated from N(0,I50) and σt = 50. The full model (M2) under comparison includes all 50 predictors and the reduced model only

95 allows the last 5 predictors to be nonzero (the predictors with true coefficient 60).

For both models, the prior distributions for the nonzero coefficients are independent

2 normal with mean 0 and σβ = 500. The curve of the expected-log-Bayes-factor is presented in Figure 4.8. It is interesting that the curve has two turning points. The

first (a local minimum) is observed at sample size 6 and the second (a local maximum) is at sample size 45. According to (1.6), this implies that in terms of predictive performance, there exists two switches of superiority as sample size increases. The reason is that the reduced model has all of the most influential predictors. When the sample size is very small, estimation of the reduced model is inaccurate. When the sample size is moderate, the reduced model captures the most important parts, and so appears better. When sample size is large enough, the less-influential predictors are picked up, and so the full model is better.

The final calibration sample size turns out to be 51, thus both turning points are

truncated. The calibrated Bayes factor goes down from the very beginning, and thus

always gives the correct model preference.

4.5.2 Horseshoe Priors on high dimension β

As we mentioned earlier, the horseshoe prior is very efficient at dealing with s-

parsity, and thus can be applied for the purpose of variable selection. For this high

dimensional scenario, we want to find out how the calibrated Bayes factor performs

when the models under consideration use this prior. We generate data from the nor-

mal regression model and draw the design matrix X from N(0, ΣX = I). The true coefficients βt is 500 dimensional, including 400 replicates of zero, 90 replicates of two

2 and 10 replicates of twenty and σt = 50. We study two sets of comparisons.

96 E(logBF) = 4.05 E(logBF) −800 −600 −400 −200 0

0 100 200 300 400 500

Sample size

Figure 4.9: The green dots represent the expected-log-Bayes-factor while the red dots denote the 95% confidence intervals. The vertical dashed line represents the calibration sample size and the horizontal dashed line is marked at the amount that should be removed from the original expected- log-Bayes-factor when it is calibrated.

In the first comparison, the full model M1 includes the entire design matrix, and

M2 represents a reduced model where the first 400 components of βγ are fixed at zero. For the full model M2, we place the horseshoe prior (4.4) on β. For the reduced model M1 we place the independent Gaussian prior N100(0, 500I) on the nonzero components. In both models, an inverse-Gamma prior IG(0.1, 0.1) is placed on σ2.

The expected log Bayes factor curve can be observed in Figure 4.9. We simulated

300 different data sets to evaluate the expectation. The marginal likelihood for M1 is

97 approximated via Monte Carlo integration. Posterior draws are obtained by running

MCMC and then subsampled to remove dependence. Given posterior draws, the

integrals involved in (3.3) were evaluated analytically. The peak of 18.79 is observed

at sample size 60. It turns out that the calibration sample size ends up at 102, leading

to a downshift of 4.05 on the curve. We can see that the calibrated Bayes factor keeps

going down, and thus M2 is preferred.

E(logBF) = −1.16 E(logBF) −800 −600 −400 −200 0

0 100 200 300 400 500

Sample size

Figure 4.10: The green dots represent the expected-log-Bayes-factor while the red dots denote the 95% confidence intervals. The vertical dashed line represents the calibration sample size and the horizontal dashed line is marked at the amount that should be removed from the original expected- log-Bayes-factor when it is calibrated.

98 In the second comparison, for the reduced model M1, we replace the Gaussian

priors on nonzero components of βγ by a 100-dimensional horseshoe prior. The curve

of expected-log-Bayes-factor is shown in Figure 4.10. The entire curve stays below

zero thus the Bayes factor prefers M2 from the very beginning. The final calibration sample size is 11. The calibration sample size is dramatically reduced from that in the first comparison because the calibration sample size for this pair is determined by

M1, given that the horseshoe prior is a very strong prior with a infinitely tall spike at zero. The calibration leads to an upshift of 1.16, whereas the following part of the curve is still below zero. The calibrated Bayes factor tends to prefer M2 for any sample size as well.

4.6 Case Study

4.6.1 Description of Ozone data

In this section, we present an analysis of Breiman and Friedman’s (1985) well-

known ozone data. The data set consists of daily measurements (rounded up to the

nearest integers) of ozone concentration and eight meteorological quantities measured

in the Los Angeles basin for 330 days in 1976. The data set is included in the

software package R. Among others, Casella and Moreno (2006) and Yu, MacEachern

and Peruggia (2011) have analyzed this data set for the purposes of Bayesian model

averaging and model comparison. The former used the original ozone measurements

and the latter took the natural logarithm of ozone to improve the normality of the

residuals. In Yu, MacEachern and Peruggia (2011), three analysts independently

built models, each based on a subset of 110 of the 330 observations. We consider a

set of four models developed by Analyst 1. The models all include a sine curve for

99 the predictor “Day” (X10) to force it to be periodic with period 1 year. Appendix B

includes more information on the data set.

The comparisons in this section focus on five models along with a variety of pri-

ors for the models. Four from Analyst 1 and the model favored in the Casella and

Moreno’s (2006) analysis. The prior distributions include a mixture-of-g prior, an in- trinsic prior and a conventional improper prior. The models are compared on the basis of log-Bayes-factor plots and in terms of original (where applicable) and calibrated

Bayes factors.

4.6.2 Pairwise Comparisons among Model 1 - Model 4

The four models from Analyst 1 and their least squares estimates of coefficients are

presented in Table 4.2. The least squares analysis performed via classical hypothesis

tests at conventional levels suggests that predictors X4, X6, X9 and T (X10) remain in

the model. The expected-log-Bayes-factors are shown in Figure 4.11. The log Bayes

factors based on the conventional improper prior and the intrinsic prior are calculated

via the closed form expressions presented earlier, while the log Bayes factors based on

the mixture of g-priors are calculated via Monte Carlo integration with 5000 draws

from the prior distributions. Model 2 and Model 3 are not nested, thus the Bayes

factor for the intrinsic prior distribution is not well defined, and so the curve is missing

from the figure.

100 Table 4.2: Estimation of coefficients for Model 1-4 (* p-value is greater than 0.01) Variable Model 1 Model 2 Model 3 Model 4 Intercept 0.55 *2.15 0.63 *−1.64 −4 −4 X1 — *−2.93 × 10 — *4.20 × 10 −2 −2 −2 −2 X4 2.02 × 10 2.23 × 10 1.81 × 10 1.55 × 10 −4 −4 −4 −4 X6 −1.30 × 10 −1.31 × 10 −1.13 × 10 −1.12 × 10 −3 −3 X7 — — *1.89 × 10 *2.37 × 10 2 − × −5 − × −5 X7 — — 9.41 10 9.88 10 −3 −3 −3 −3 X9 −1.05 × 10 −1.06 × 10 −1.14 × 10 −1.10 × 10 −1 −1 −1 −1 T (X10) 3.99 × 10 3.78 × 10 4.59 × 10 4.62 × 10

Model 1 vs. Model 2 Based on the mixture of g-priors, the conventional improper prior and the intrinsic prior respectively, the log Bayes factors for the full data set are about 3.46, 3.85 and 3.50. The calibration sample sizes are 3, 9 and 8, leading to reduction of log Bayes factors by 0.12, 1.02 and 0.17 respectively. Thus the calibrated Bayes factor based on the mixture of g-priors is 28.31; the calibrated Bayes factor based on the conventional prior is 16.95; the calibrated Bayes factor based on the intrinsic prior is 27.96. They all provide strong evidence supporting Model 1, the simpler model, according to Jeffreys’ interpretation.

Model 1 vs. Model 3 Based on the mixture of g-priors, the conventional improper prior and the intrinsic prior respectively, the log Bayes factors for the full data set are about -10.04, -9.88 and -8.88. The calibration sample sizes are 3, 10 and

9, leading to reduction of log Bayes factors by 0.21, 1.45 and 0.90 respectively. Thus the calibrated Bayes factor based on the mixture of g-priors is about 3.56 × 10−7;

the calibrated Bayes factor based on the conventional prior is about 1.20 × 10−7; the calibrated Bayes factor based on the intrinsic prior is about 5.66 × 10−7. They all provide very strong or decisive evidence favoring Model 3.

101 Model 1 vs. Model 2 Model 1 vs. Model 3

Mix_g Improper Mix_g Intrinsic Improper E(logBF) E(logBF) Intrinsic −1 0 1 2 3 −10 −6 −2 0 2 4

0 50 100 150 200 2500 300 50 100 150 200 250 300

Sample size Sample size

Model 1 vs. Model 4 Model 2 vs. Model 3

Mix_g Mix_g Improper

E(logBF) E(logBF) Improper Intrinsic −10 −5 0 −10 −6 −2 0 2 4

0 50 100 150 200 2500 300 50 100 150 200 250 300

Sample size Sample size

Model 2 vs. Model 4 Model 3 vs. Model 4

Mix_g Mix_g Improper Improper Intrinsic

E(logBF) Intrinsic E(logBF) −1 0 1 2 3 −10 −6 −2 0 2 4

0 50 100 150 200 2500 300 50 100 150 200 250 300

Sample size Sample size

Figure 4.11: The red lines represent the expected-log-Bayes-factors for the mixture of g-priors; the green lines represent the expected-log-Bayes-factors for the conventional improper prior; the blue lines represent the expected-log-Bayes-factors for the intrinsic prior. The calibration sample sizes are marked on the curves as black points.

102 Model 1 vs. Model 4 Based on the mixture of g-priors, the conventional improper prior and the intrinsic prior respectively, the log Bayes factors for the full data set are about -6.81, -7.26 and -6.47. The calibration sample sizes are 3, 11 and

10, leading to reduction of log Bayes factors by 0.30, 1.78 and 1.10 respectively. Thus the calibrated Bayes factor based on the mixture of g-priors is about 8.16 × 10−4;

the calibrated Bayes factor based on the conventional prior is about 1.20 × 10−4; the calibrated Bayes factor based on the intrinsic prior is about 5.14 × 10−4. They all provide very strong or decisive evidence favoring Model 4.

Model 2 vs. Model 3 For this comparison, since Models 2 and 3 are not nested,

the intrinsic Bayes factor cannot be plotted. Based on the mixture of g-priors and

the conventional improper prior respectively, the log Bayes factors for the full data

set are about -13.50 and -13.00. The calibration sample sizes are 3 and 10, leading

to reduction of log Bayes factors by 0.09, and 0.95 respectively. Thus the calibrated

Bayes factor based on the mixture of g-priors is about 1.26 × 10−7; the calibrated

Bayes factor based on the conventional prior is about 0.87 × 10−7. They both provide very strong or decisive evidence favoring Model 3.

Model 2 vs. Model 4 Based on the mixture of g-priors, the conventional improper prior and the intrinsic prior respectively, the log Bayes factors for the full data set are about -10.27, -10.14 and -9.81. The calibration sample sizes are 3, 11 and

10, leading to reduction of log Bayes factors by 0.18, 1.41 and 0.78 respectively. Thus the calibrated Bayes factor based on the mixture of g-priors is about 2.88 × 10−5;

the calibrated Bayes factor based on the conventional prior is about 0.97 × 10−5; the calibrated Bayes factor based on the intrinsic prior is about 2.50 × 10−5. They all provide very strong or decisive evidence favoring Model 4.

103 Model 3 vs. Model 4 Based on the mixture of g-priors, the conventional improper prior and the intrinsic prior respectively, the log Bayes factors for the full data set are about 3.23, 3.59 and 3.20. The calibration sample sizes are 2, 11 and

10, leading to reduction of log Bayes factors by 0.04, 1.01 and 0.16 respectively.

Thus the calibrated Bayes factor based on the mixture of g-priors is about 24.37; the calibrated Bayes factor based on the conventional prior is about 13.22; the calibrated

Bayes factor based on the intrinsic prior is about 20.90. They all provide strong evidence supporting Model 3.

Assembling this information, we find that the calibrated Bayes factor has shown

performance that is robust to the selection of the prior. The order of model preference

based on the calibrated Bayes factor, from best to worst, is Model 3, Model 4, Model

1 and Model 2, irrespective of which prior from among our set is chosen.

4.6.3 Comparisons with Model 5

Casella and Moreno (2006) programmed a MCMC-based stochastic search and

identified the “best” model (with the largest posterior probability) using only linear

predictors. The model they chose included X3, X4 and X6. Recalling that Casella and

Moreno used the original ozone measurements as opposed to the natural logarithm,

one must adjust the marginal likelihood function to compare this model to Models 1-4.

The curves of expected-log-Bayes-factors are presented in Figure 4.12. Since Model

5 is not nested with any of Models 1-4, the intrinsic Bayes factor is not defined,

thus is omitted from comparison. Computational details are that same as in the last

subsection.

104 Model 1 vs. Model 5 Model 2 vs. Model 5

Mix_g Mix_g Improper Improper E(logBF) E(logBF) 0 20 40 60 0 20 40 60

0 50 100 150 200 250 300 0 50 100 150 200 250 300

Sample size Sample size

Model 3 vs. Model 5 Model 4 vs. Model 5

Mix_g Mix_g Improper Improper E(logBF) E(logBF) 0 20 40 60 0 20 40 60

0 50 100 150 200 250 300 0 50 100 150 200 250 300

Sample size Sample size

Figure 4.12: The red line represents the expected-log-Bayes-factor for the mixture of g-priors; the green line represents the expected-log-Bayes-factor for the conventional improper prior. The calibration sample sizes are marked on their curves as black points.

First, note that Model 5 is vastly inferior to each of Models 1-4. The log Bayes factors and the calibrated log Bayes factors for the full data set are all beyond 55, which translates to a (calibrated) Bayes factor in excess of 7 × 1023.

105 For the comparison between Model 1 and Model 5, the calibration sample sizes are

5 and 8 for the mixture of g-priors and the conventional improper prior respectively, leading to a rise of 2.19 and 0.54 from the original log Bayes factors. For the compar- ison between Model 2 and Model 5, the calibration sample sizes are 5 and 9, leading to a rise of 2.56 and 1.00 from the original log Bayes factors. For the comparison between Model 3 and Model 5, the calibration sample sizes are 5 and 10 respectively, corresponding to a rise of 2.76 and 1.32 from the original log Bayes factors. For the comparison between Model 4 and Model 5, the calibration sample sizes are 5 and 11 for the mixture of g-priors and the conventional improper prior respectively, leading to a rise of 2.99 and 1.51 from the original log Bayes factors.

106 Chapter 5: Conclusions

For model comparison under Bayes factors, the use of proper diffuse priors is suspect when the models under consideration have very different dimensions. In this thesis, we propose a novel method to calibrate the prior distributions, so that the partial posterior distributions updated on a training sample achieve a reasonable level of concentration, and then to use the remainder of the data to obtain the calibrated

Bayes factor. We demonstrate through simulations and data examples that such a calibrated Bayes factor provides reliable and robust model preferences, which are more consistent with predictive model performances.

The implications of this work extend beyond model choice to other settings where the Bayes factor is used. Model averaging implicitly makes use of the Bayes factor, after translating the model odds into model probabilities. In practice, the over models is often specified without regard for the prior distributions over model-specific parameters. Our results suggest that updating the model proba- bilities should commence after the posterior distributions have achieved a reasonable concentration.

There are many variations on the calibration that we present. We have chosen one based on the symmetrized Kullback-Liebler divergence. Such a measure seems appropriate for calibration of a Bayesian prior distribution, as it is driven by the

107 likelihood. Alternative calibrations could rely on other measures of discrepancy such

as Hellinger or total variation distance. These types of calibration do not depend on

the parameterization of the model, nor even on the existence of a (finite dimensional)

parameterization. They allow us to work with infinite dimensional models in a way

that alternatives such as Fisher information do not. High dimensional models, when

fit with a finite amount of data, behave similarly to infinite dimensional models, and

so we believe the calibration developed here will be appropriate for these models.

We have focused on training samples of a fixed size. As Berger and Perrichi (2004)

persuasively argue, allowing variation in the training sample size can eliminate certain

biases that arise when observations carry differing amounts of “information”. When

the calibration suggests very different training sample sizes, some improvement is to

be expected, although this improvement may be outweighed by computational costs.

Finally, there are several promising directions for future work that should be point-

ed out. We note that details of the prior distribution do impact the calibration. As

an example, for the MDP model that we have used, the split between the variance

of the base measure (τ 2) and the variance of the smoothing kernel (σ2) drives the

smoothness of the densities drawn from the model. Relatively smooth densities are

more easily mimicked by a large σ and small τ, and training sample sizes are corre- spondingly smaller when the prior distribution more easily allows this type of split.

Griffin (2010) has a particularly attractive prior distribution for the pair τ 2 and σ2

where he controls the split between the two. His distribution leads to markedly small-

er training sample sizes. In addition, the horseshoe prior is very efficient for dealing

with sparsity, thus is very promising for variable selection problem. Recall that this

prior has both a infinite spike at zero and heavy tails. The calibration sample size for

108 a model with such a strong prior is fairly small. Thus the study of the connection be- tween the concentration level of a prior distribution and the calibration sample size is very interesting. Last but not least, we can produce examples where the curve of the expected-log-Bayes-factor has two turning points, which implies that it experiences two switching points of superiority in terms of predictive power as training sample size increases. This phenomenon deserves a deeper study.

109 Bibliography

Aitkin, M. (1991), “Posterior Bayes Factor(with discussion),” Journal of the Royal Statistical Society, Ser. B, 53, 111-142.

Antoniak, C. E. (1974), “Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems,” The Annals of Statistics, 2, 1152-1174.

Basu, S. and Chib, S. (2003), “Marginal Likelihood and Bayes Factors for Dirichlet Process Mixture Models,” Journal of the American Statistical Association, 98, 224- 235.

Berger, J. O. and Bernardo, J. M. (1992), “On the Development of Reference Priors” (with discussion), Bayesian Analysis 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 35-60. Oxford Univ. Press.

Berger, J. O. , Bernardo, J. M. and Sun, D. (2009), “The Formal Definition of Ref- erence Priors,” The Annals of Statistics, 37, 905-938.

Berger, J. O. and Delampady, M. (1987), “ Testing Precise Hypotheses,” Statistical Science, Vol.2, No.3, 317-352.

Berger, J. O. and Pericchi, L. (1996a), “The Intrinsic Bayes Factor for Model Selec- tion and Prediction,” Journal of the American Statistical Association, 91, 109-122.

—— (1996b), “The Intrinsic Bayes Factor for Linear Models,” Bayesian Analysis 5 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 23-42. Oxford Univ. Press.

—— (2001), “Objective Bayesian methods for model selection: Introduction and comparison (with discussion),” In Model Selection (P. Lahiri, ed.). 135-207. IMS, Beachwood, OH.

—— (2004), “Training Samples in Objective Bayesian Model Selection,” The Annals of Statistics, 32, 841-869.

110 Berger, J. O. and Sellke, T. (1987), “ Testing a Point Null Hypothesis: The Irrecon- cilability of P-values and Evidence,” Journal of the American Statiscal Association, 82, 112-122.

Bernardo, J. M. (1979), “Reference Posterior Distributions for (with discussion),” Journal of the Royal Statistical Society, Ser. B, 41, 113-148.

Berk, R.H. (1966), “Limiting Distribution of Posterior Distributions when the Model is Incorrect,” The Annals of ,37, 51-58.

Blackwell, D. and MacQueen, J. B. (1973), “Ferguson Distributions Via P´olya Urn Schemes,” The Annals of Statistics, 1, 353-355.

Breiman, L. and Friedman, J. (1985), “Estimating Optimal Transformations for Mul- tiple Regression and Correlation,” Journal of the American Statistical Association, 80, 580C598.

Bush, C. A. and MacEachern, S. N. (1996), “A Semiparametric Bayesian Model for Randomized Block Designs,” Biometrika, 83, 275-285.

Capp´e,O. and Robert, C. P. (2000), “Markov Chain Monte Carlo: 10 Years and Still Running,” Journal of the American Statistical Association, 95, 1282-1286.

Carlin, B. P. and Chib, S. (1995), “Bayesian Model Choice via Markov Chain Monte Carlo Methods,” Journal of the Royal Statistical Society, Ser. B, 57, 473-484.

Casella, G. and Moreno, E. (2006), “Objective Bayesian Variable Selection,” Journal of the American Statistical Association, 101, 157-167.

Carvalho, C. M., Polson, N. G. and Scott, J. G. (2009), “Handling Sparsity via the Horseshoe,” Proceedings of the 12th International Conference on Artificial Intelli- gence and Statistics.

—— (2010), “The Horseshoe Estimator for Sparse Signals,” Biometrika, 97, 465-480.

Chen, M.-H., Shao, Q.-M. and Ibrahim, J. G. (2000), Monte Carlo Methods in Bayesian Computation, 1st Edition, Springer.

Chib, S. (1995), “Marginal Likelihood from the Gibbs Output,” Journal of the Amer- ican Statistical Association, 90, 1313-1321.

111 Chib, S. and Jeliazkov, I. (2001), “Marginal Likelihood from the Metropolis-Hasting Output,” Journal of the American Statistical Association, 96, 270-281.

Cowles, M. K. and Carlin, B. P. (1996), “Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review,” Journal of the American Statistical Associa- tion, 91, 883-904.

Cox, D. R. (1961), “ Tests of Separate Families of Hypotheses,” in Proceedings of the Fourth Berkeley Symposium, 1, 105-123.

—— (1962), “ Further Results on Tests of Separate Families of Hypotheses,” Journal of the American Statiscal Association, No. 24, 406-424.

Congdon, P. (2004), “Bayesian model choice based on Monte Carlo estimates of pos- terior model probabilities,” Computational Statistics and Data Analysis, 50, 346-357.

Dahl, D. B. (2003), “An Improved Merge-Split Sampler for Conjugate Dirichlet pro- cess mixture models,” Tech. Rep. 1086, Department of Statistics, University of Wisconsin.

Delampady, M. and Berger , J. O. (1990), “Lower Bounds on Bayes Factors for Multi- nomial Distribution, With application to Chi-Squared Tests of Fit,” The Annals of Statistics, Vol. 8, No. 3, 1295-1316.

DiCiccio, T. J. , Kass, R. E. , Raftery, A. and Wasserman, L. (1997), “Computing Bayes Factors by Combining Simulation and Asymptotic Approximations,” Journal of the American Statistical Association, 92, 903-915.

Diebolt, J. and Robert C. P. (1994), “Estimation of finite mixture distributions through Bayesian sampling,” J. R. Statist. Soc. B, 56, 363-375.

Dmochowski J. (1994), “Intrinsic priors via Kullback-Liebler Geometry,” in 5. ed J. M. Bernardo et al., London: Oxford University Press, 543-549.

Edwards, W., Lindman, H. and Savage, L. J. (1963), “Bayesian for Psychological research,” Psychological Review, 70, 193-242.

Escobar, M. D. (1988), “Estimating the Means of Several Normal Populations by Nonparametric Estimation of the Distribution of the Means,” Unpublished Ph.D. thesis, Yale University, Dept. of Statistics.

112 —— (1994), ”Estimating Normal Means With a Dirichlet process Prior,” Journal of the American Statistical Association, 89, 268-277.

Escobar, M. D. and West, M. (1995), “Bayesian and Inference Using Mixtures,” Journal of the American Statistical Association, 90, 577-588.

Ferguson, T. S. (1973), “ A Bayesian Analysis of Some Nonparametric Problems,” The Annals of Statistics, 1, 209-230.

Flegal, K. M., Carroll, M. D., Ogden, C. L., and Curtin, L. R. (2010), “Prevalence and Trends in Obesity Among US Adults, 1999-2008,” Journal of the American Med- ical Association, 303, 235-241.

Garcia-Donato, G. and Chen, MH. (2005), “ Calibrating Bayes factor under prior predictive distributions,” Statistica Sinica, Vol.15, No.2, 359-380.

Garren, S. T. and Smith, R. L. (1993), “Convergence Diagnostics for Markov Chain Samplers,” technical report, University of North Carolina, Dept. of Statistics.

Geisser, S. and Eddy, W. F. (1979), “A predictive approach to model selection,” Journal of the American Statistical Association, 74, 153-160.

Gelfand, A. E. and Dey, D. K. (1994), “Bayesian Model Choice: Asymptotics and Exact Calculations,” Journal of the Royal Statistical Society, Ser. B, 56, 501-514.

Gelfand, A. E. and Kottas, A. (2002), “A Computational Approach for Full Non- parametric Bayesian Inference under Dirichlet Process Mixture Models,” Journal of Computational and Graphical Statistics, 11, 289-305.

Gelfand, A. E. and Smith, A. F. M. (1990), “Sampling-Based Approaches to Calculat- ing Marginal Densities,” Journal of the American Statistical Association, 85, 398-409.

Gelman, A. and Rubin, D. B. (1992), “Inference from Iterative Simulation using Mul- tiple Sequences,”(with discussion), Statistical Science, 7, 457-511.

Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (2003), Bayesian Data Analysis 2nd ed. CRC press.

Geman, S. and Geman, D. (1984), “Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images,” IEEE Trans. Pattern Anal. Mach. Intell., 6, 721-741.

113 George, E. I. and McCulloch, R. E. (1993), “Variable Selection via Gibbs Sampling,” Journal of the American Statistical Association, 88, 881-889.

—— (1995), “Stochastic Search Variable Selection,” in Practical Markov Chain Monte Carlo in Practice, eds. W. R. Gilks et al., London: Chapman & Hall, 339-348.

—— (1997), “Approaches for Variable Selection,” Statistica Sinica, 7, 339-373.

Geweke, J. (1992), “Evaluating the Accuracy of Sampling-Based Approaches to the Calculation of Posterior Moments,” in Bayesian Statistics 4, eds. J. M. Bernardo, J. Berger, A. P. Dawid, and A. F. M. Smith, Oxford, U.K.: Oxford University Press, 169-193.

Geweke, J. (1996), “Variable Selection and Model Comparison in Regression,” in Bayesian Statistics 5, eds. J. M. Bernardo et al., Oxford University Press, pp. 169- 194.

Griffin, J. E. (2010), “Default priors for density estimation with mixture models,” Bayesian Analysis, 5, 45-64.

Hastings, W. K. (1970), “Monte Carlo Sampling Methods Using Markov Chains and Their Applications,” Bioamietrika, 57, 97-109.

Ishwaran, H. and James, L. F. (2001), “Gibbs Sampling Methods for Stick-Breaking Priors,” Journal of the American Statistical Association, 96, 161-174.

—— (2002), “Approximate Dirichlet Process Computing in Finite Normal Mixtures: Smoothing and Prior Information,” Journal of Computational and Graphical Statis- tics, 11, 508-532.

Jain, S. and Neal, R. M. (2004), “A Split-Merge Markov Chain Monte Carlo Procedure for the Dirichlet Process Mixture Model,” Journal of Computational and Graphical Statistics, 13, 158-182.

—— (2007), “Splitting and Merging Components of a Nonconjugate Dirichlet Process Mixture Model,” Bayesian Analysis, 2, 445-472.

Jeffreys, H. (1961), Theory of Probability, 3rd ed. Oxford Univ. Press.

Johnson, V. E. (1994), “Studying Convergence of Markov Chain Monte Carlo Algo- rithms Using Coupled Sample Paths,” Journal of the American Statistical Associa- tion, 91, 154-166.

114 Jones, G. L. and Hobert, J. P. (2004), “Sufficient burn-in for Gibbs samplers for a hierarchical random effects model,” Annals of Statistics, 32, 784-817.

Kadane, J. B. and Lazar, N. A. (2004), “Methods and Criteria for Model Selection,” Journal of the American Statistical Association, 99, 279-290.

Kalli, M., Griffin, J. E. and Walker, S. G. (2011), “Slice sampling mixture models,” Statistics and Computing, 21, 93-105.

Kass, R. E., Carlin, B. P., Gelman, A. and Neal, R. M. (1998), “Markov Chain Monte Carloin Practice: A Roundtable Discussion,” The American , 52, 93-100.

Kass, R. E. and Raftery, A. E. (1995), “Bayes Factor,” Journal of the American sta- tistical Association, 90, 773-795.

Kass, R. E. and Wasserman, L. (1995), “A Reference Bayesian Test for Nested Hy- potheses and Its Relationship to the Schwarz Criterion,” Journal of the American statistical Association, 90, 928-934.

—— (1996), “The Selection of Prior Distribution by Formal Rules,” Journal of the American statistical Association, 91, 1343-1370.

Kullback S. and Leibler R. A. (1951), “On Information and Sufficiency,” The Annals of Mathematical Statistics, 22, 79-86.

Laplace, P.S. (1820), Theorie Analytique des Probabilit´es, Courcier, Paris.

Lavine, M. and Schervish, M. J. (1999), “Bayes Factors: What They Are and What They Are Not,” The American Statistician, 53, 119-122.

Lehmann, E. L. and Romano, J. P. (2005), Testing Statistical Hypotheses, 3rd Edi- tion, Springer.

Lempers, F. B. (1971), Posterior Probabilities of Alternative Linear Models, Rotter- dam: University Press.

Lenk, P. (2009), “Simulation Pseudo-Bias Correction to the Harmonic Mean Estima- tor of Integrated Likelihoods,” Journal of Computation of Graphical Statistics, 18, 941-960.

115 Liang, F., Paulo, R., Molina, G., Clyde, M., and Berger, J. (2008),“Mixtures of g-Priors for Bayesian Variable Selection,” Journal of the American Statistical Asso- ciation, 103, 410-423.

Liu, C., Liu, J., and Rubin, D. B. (1992), “A Variational Control Variable for Assess- ing the Convergence of the Gibbs Sampler,” in Proceedings of the American Statistical Association, Statistical Computing Section, 74-78.

Liu, J. (2001), Monte Carlo Strategies in Scientific Computing, 1st Edition, Springer.

Lindley, D. V. (1957), “A statistical paradox,” Biometrika, 44, 187-192.

MacEachern, S. N. (1994), “Estimating normal means with a conjugate style dirich- let process prior,” Communications in Statistics - Simulation and Computation, 23, 727-741.

—— (1998), “Computational Methods for Mixture of Dirichlet Process Models,” In Dey, Dipak D., Muller, Peter, and Sinha, Debajyoti (Eds.) Practical Nonparametric and Semiparametric Bayesian Statistics, 23-44. New York: Springer.

MacEachern, S. N. , Clyde, M. and Liu, J. S. (1999), “Sequential Importance Sampling for Nonparametric Bayes Models: The Next Generation,” The Canadian Journal of Statistics, 27, 251-267.

MacEachern, S. N. and M¨uller,P. (1998), “Estimating Mixture of Dirichlet Process Models,” Journal of Computation of Graphical Statistics, 7, 223-238.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953), “Equations of State Calculations by Fast Computing Machines,” Journal of Chemical Physics, 21, 1087-1092.

Neal, R. M. (2000), “Markov Chain Sampling Methods for Dirichlet Process Mixture Models,” Journal of Computational and Graphical Statistics, 9, 249-265.

Newton, M., and Raftery, A. (1994), Approximate Bayesian Inference With Weight- ed Likelihood Bootstrap, Journal of the Royal Statistical Society, Ser. B, 56 (1), 3-48.

O’Hagan, A. (1991), Discussion on “Posterior Bayes factors,” by M. Aitkin, Journal of the Royal Statistical Society, Ser. B, 53, 136.

—— (1995), “Fractional Bayes Factors for Model Comparison” (with discussion), Journal of the Royal Statistical Society, Ser. B, 57, 99-138.

116 Papaspiliopoulos, O. and Roberts, G. O. (2008), “Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models”, Biometrika, 95, 169-186.

Pericchi, L. R. (1984), “An Alternative to the Standard Bayesian Procedure for Dis- crimination Between Normal Linear Models,” Biometrika, 71, 575-586.

Raftery, A. E. and Lewis, S. (1992), “How Many Iterationsin the Gibbs Sampler?,” in Bayesian Statistics 4, eds. J. M. Bernardo, J. Berger, A. P. Dawid, and A. F. M. Smith, Oxford, U.K.: Oxford University Press, 763-773.

Ritter, C. and Tanner, M. A. (1992), “Facilitating the Gibbs Sampler: The Gibbs Stopper and the Griddy-Gibbs Sampler,” Journal of the American Statistical Asso- ciation, 87, 861-868.

Robert, C. and Casella, G. (2004), Monte Carlo Statistical Methods, 2nd Edition, Springer.

Roberts, G. O. (1992), “Convergence Diagnostics of the Gibbs Sampler,” in Bayesian Statistics 4, eds. J. M. Bernardo, J. Berger, A. P. Dawid, and A. F. M. Smith, Ox- ford, U.K.: Oxford University Press, 775-782.

Roberts, G. O., Gelman, A. and Gilks, W. R. (1997), “Weak Convergence and Opti- mal Scaling of Random Walk Metropolis Algorithms,” The Annals of Applied Prob- ability, 7, 110-120.

Roberts, G. O., and Rosenthal, J. S. (2001), “Optimal Scaling for Various Metropolis- Hastings Algorithms,” Statistical Science, 16, 351-367.

Rosenthal, J. S. (2002), “Quantitative convergence rates of Markov chains: A simple account,” Electronic Communications in Probability, 7, 123-128.

Sethuraman, J. (1994), “A Constructive Definition of Dirichlet Priors,” Statistica Sinica, 4, 639-650.

Sethuraman, J. and Tiwari, R. C. (1982), “Convergence of Dirichlet measures and the interpretation of their parameter,” Statistical and Related Topics III, 2, 305-315.

Shafer, G. (1982), “Lindley’s Paradox,” Journal of the American Statistical Associa- tion, 77, 325-334.

117 Smith, A. F. M. and Gelfand, A. E. (1992), “Bayesian Statistics without Tears: A Sampling- Perspective,” The American Statistician, 46, 84-88.

Smith, A. F. M. and Roberts, G. O. (1993), “Bayesian Computation Via the Gibbs Sampler and Related Markov Chain Monte Carlo Methods,” Journal of the Royal Statistical Society, Ser. B, 55, 3-23.

Smith, A. F. M. and Spiegelhalter, D. J. (1980), “Bayes Factors and Choice Criteria for Linear Models,” Journal ofthe Royal Statistical Society, Ser. B, 42, 213-220.

Tierney L. and Kadane J. B. (1986), “Accurate Approximations for Posterior Mo- ments and Marginal Densities,” Journal of the American Statistical Association, 81, 82-86.

Vlachos, P. K. and Gelfand, A. E. (2003), “On the Calibration of Bayesian Model Choice Criteria,” Journal of Statistical Planning and Inference, 111, 223-234.

Wald, A. (1947), , John Wiley & Sons.

Walker, S. G. (2007), “Sampling the Dirichlet mixture model with slices,” Commu- nication in Statistics – Simulation and Computation, 36, 45-54.

Wang, F. and Gelfand, A. E. (2002), “A Simulation-based Approach to Bayesian Sample Size Determination for Performance under a Given Model and for Separating Models,” Statistical Science, 17, 193-208.

Yu, Q., MacEachern, S. N. and Peruggia, M. (2011), “Bayesian Synthesis: Combin- ing subjective analyses, with an application to ozone data”, The Annals of Applied Statistics, 5, 2B, 1678-1698.

Zellner, A. (1986) “On assessing prior distributions and Bayesian regression analysis with g-prior distributions,” Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, 233-243. North-Holland/Elsevier.

Zellner, A. and Min, C.-K. (1995), “Gibbs Sampler Convergenc Ceriteria,” Journal of the American Statistical Association, 90, 921-927.

Zellner, A. and Siow, A. (1980) “Posterior odds ratios for selected regression hy- potheses,” Bayesian Statistics: Proceedings of the First International Meeting held in Valencia (Spain), 585-603.

118 Appendix A: An Algorithm for Calibrated Sample Size Search

In this appendix, we develop an algorithm that searches for the appropriate train- ing sample size for prior calibration. We describe this algorithm using the Gaussian model as an example, but it is widely applicable to various model comparison setups. The values for the initial step size, the number of replicates, and the length of MCMC runs should be adjusted to fit the problem at hand. As described in Section 3.2, under the Gaussian model N(θ, σ2), the unit infor- 2 mation prior is θ ∼ N(θ0, σ ) and the corresponding SKL distribution is a scaled χ2(1). We evaluate the amount of information contained in a prior/posterior by the 2 first quartile of its SKL distribution. Let q1 denote the first quartile of the χ (1) distribution (which is approximately 0.1015) and let CI(s) denote a 90% confidence interval for the first quartile of the SKL distribution when the calibration sample size is s. Our search strategy for an appropriate calibration sample size consists of the following three stages:

• Stage 1(Bounding search): Setting the upper and lower bounds for the calibration sample size Step 1.1 Initialize the calibration sample size s to 1, set the step size to 32, and set the number of replicates to 100 (or another small number to save com- putation time); Step 1.2 Compute CI(s) using the above evaluation procedure; Step 1.3 Compare CI(s) with q1:

– If CI(s) covers q1, go to Stage 3;

– If CI(s) lies entirely to the left of q1, go to Stage 2;

– If CI(s) lies entirely to the right of q1, increase s by the step size and go back to Step 1.2.

• Stage 2 (Rough search): Finding a rough range for the calibration sample size Step 2.1 Halve the step size. If the step size is less than one, go to Stage 3; Step 2.2 Compute CI(s) using the above evaluation procedure; Step 2.2 Compare CI(s) with q1:

119 – If CI(s) covers q1, go to Stage 3;

– If CI(s) lies entirely to the left of q1, decrease s by the step size and go back to Step 2.1;

– If CI(s) lies entirely to the right of q1, increase s by the step size and go back to Step 2.1.

• Stage 3 (Accurate search): Providing an accurate value for the cali- bration sample size Step 3.1 Increase the number of replicates N to 800 (or another large number for accurate estimation); Step 3.2 Compute CI(s) using the above evaluation procedure; Step 3.3 Compare CI(s) with q1:

– If CI(s) covers q1, compute CI(s + 1) and compare it with q1. If CI(s + 1) also covers q1, reset s to s + 1 and go back to Step 3.2. If CI(s + 1) does not cover q1, terminate the search and report s as the calibration sample size;

– If CI(s) lies entirely to the left of q1, compute CI(s − 1) and compare it with q1. If CI(s − 1) covers q1 or is also entirely to the left of q1, reset s to s − 1 and go back to Step 3.2. If CI(s − 1) is entirely to the right of q1, terminate the search and report s as the calibration sample size;

– If CI(s) lies entirely to the right of q1, compute CI(s + 1) and compare it with q1. If CI(s + 1) covers q1 or is also entirely to the left of q1, reset s to s + 1 and go back to Step 3.2. If CI(s + 1) is entirely to the left of q1, terminate the search and report s + 1 as the calibration sample size;

Under this algorithm, when there is one and only one sample size satisfying the condition that CI(s) covers q1, this sample size will be reported; when there are multiple sample sizes satisfying the condition that CI(s) covers q1, the largest of them will be reported; and when no sample size satisfies the condition that CI(s) covers q1, the smallest sample size with CI(s) entirely to the left of q1 will be reported.

120 Appendix B: An Overview of the Ozone Data

In this appendix, we present the description of variables in the ozone data (Table B.1) and the matrices of pairwise scatter plots (Figure B.1) for these variables. X5 is an extra variable documented in R, which is very strongly correlated with X4 and was not considered by any earlier works mentioned in this thesis. The least squares sine curve with period one year based on X10 can be written as: X T (X ) = −0.6773 sin(2π 10 + 1.4021) + 2.2221. 10 366

Table B.1: Variables used in the ozone data Variable Description Y Daily maximum one-hour-average ozone reading on natural log level X1 500 millibar pressure height (m) measured at Vandenberg AFB X2 Wind speed (mph) at Los Angeles International Airport (LAX) X3 Humidity (%) at LAX X4 Temperature (degrees F) measured at Sandburg, CA X5 Temperature (degrees F) measured at El Monte, CA X6 Inversion base height (feet) at LAX X7 Pressure gradient (mm Hg) from LAX to Daggett, CA X8 Inversion base temperature (degrees F) at LAX X9 Visibility (miles) measured at LAX X10 Calendar day (1-366)

121 5300 5700 20 60 30 50 70 −50 50 0 150 300

Y 0 2

X1 5300 5800 X2 0 4 8

X3 20 60

X4 30 70

X5 30 60 122 Pairwise scatter plots X6 0 3000

X7 −50 50 Figure B.1:

X8 30 60 90

X9 0 200

X10 0 200

0 1 2 3 0 4 8 30 60 90 0 2000 5000 30 60 90 0 150 300