<<

Order Number 9411946

The role of hierarchical priors in robust

George, Robert Emerson, Ph.D.

The Ohio State University, 1993

U MI 300 N. Zeeb Rd. Ann Arbor, MI 48106 THE ROLE OF HIERARCHICAL PRIORS

IN ROBUST BAYESIAN INFERENCE

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By Robert Emerson George, B. A., M.S.

The Ohio State University 1993

Dissertation Committee:

Prem K. Goel

Mark Berliner Adviser,

Saul Blumenthal Department of Statistics To Mrs. Sharia J, Keebaugh With Deep Appreciation for Showing Me the Majesty of the Mathematical Sciences ACKNOWLEDGEMENTS

I wish to express my gratitude to my adviser, Dr. Prem Goel: without his invaluable advice, farsighted guidance, and unflagging enthusiasm, this thesis could not have been completed. I also wish to thank Drs. Mark Berliner and Saul Blumenthal, the other members of my Committee; and Dr. Steve MacEachem, who after serving on my General Examination Committee was prevented by scheduling conflicts from serving on my Committee. Also, I have benefitted greatly from the efficiency, knowledge, and eagerness to help of the staffs of the Statistical Computing Laboratory and of The Ohio

State University Libraries. Finally, I thank my family for all the innumerable kindnesses, many great and and some seemingly small, they have so willingly shown me throughout my life.

iii VITA

July 5, 1966 Bom - Urbana, Ohio

1988 B. A ., Wittenberg University, Springfield, Ohio

1988-1991 National Science Foundation Fellow, Department of Statistics, The Ohio State University, Columbus, Ohio

1990 M. S ., Department of Statistics, The Ohio State University, Columbus, Ohio

1991-1992 Graduate Teaching / Consulting Assistant, Department of Statistics, The Ohio State University, Columbus, Ohio

1992-1993 Fellow, Graduate School, The Ohio State University, Columbus, Ohio

PUBLICATIONS

1987. "Some Heuristics for Solving Elementary Word Problems." Spectrum: Writing at Wittenberg (2), 59-61.

FIELDS OF STUDY

Major Field; Statistics Studies in Decision Theory, Sequential Analysis (Dr. Prem Goel) Studies in Bayesian Inference (Dr. Mark Berliner) Studies in Statistical Computing (Dr. Jason Hsu, Dr. Elizabeth Stasny) Studies in Multivariate Analysis (Dr. Sue Leurgans, Dr. Joseph Verducci)

iv TABLE OF CONTENTS

DEDICATION ii

ACKNOWLEDGEMENTS iii

VITA iv

LIST OF TABLES vii

LIST OF FIGURES viii

CHAPTER PAGE

I INTRODUCTION AND SUMMARY 1 Overview 1 Some Notation and Terminology 2 Earlier Work on Hierarchical Priors 3 Information Theory and Hierarchical Priors 8 A Survey of Bayesian Robustness 11 Summary 16

II. BAYESIAN ROBUSTNESS UNDER SQUARED ERROR LOSS 19 General Remarks 19 Estimation of a Normal Mean 22 Estimation of the Exponential Parameter 27

III. BAYESIAN ROBUSTNESS AND KULLBACK-LEIBLER DISTANCE 36 Kullback-Leibler Distance 36 The Kullback-Leibler Approach to the Normal Problem 38 Some Further Results on Kullback-Leibler Distance 42

IV BAYESIAN ROBUSTNESS AND FINITE MIXTURES 52 Introduction 52 Main Results 54

V. HIERARCHICAL PRIORS AND T-MINIMAXITY 67 r-Minimaxity: General Remarks 67 v r-Minimax Rules as Hierarchical Bayes Rules 68 r-Minimax Rules when F Contains Two Priors 80 A r-Mirrimax Regret Procedure for Testing a Normal Mean 91

VI. CONCLUSIONS AND FUTURE AVENUES OF RESEARCH 100 Summary of Chapters II-V 100 Future Avenues of Research 103

APPENDICES A SOME FORTRAN AND PASCAL PROGRAMS USED IN SECTION 2.3 106 B. FINITE MIXTURE DISTRIBUTIONS 113 C SOME EXACT AND APPROXIMATE FORMULAE 118 Introduction 118 Normal Likelihood, Cauchy Prior 118 Normal Likelihood, Double Exponential Prior 123 Approximate Computational Formulae 130 D. SOME FORTRAN AND PASCAL PROGRAMS USED IN CHAPTER IV 134 E SOME FORTRAN PROGRAMS USED IN CHAPTER V 147

LIST OF REFERENCES 150

vi LIST OF TABLES

TABLES PAGE

1. Comparison of Hierarchical and "Best-Guess" Rules 35

2. Ratio of Regret for the Hierarchical Rule vs. Incorrect Benchmark Rule 61

3. Ratio of Risk for the Hierarchical Rule vs. the Benchmark Rule 61

4. Ratio of Regret for the Hierarchical Rule vs. Incorrect Benchmark Rule for S Containing Three Priors 64

5. Table of Risk for the Hierarchical Rule vs. Risk for the Benchmark Rule for E Containing Three Priors 65

6 . Ratio of Regret for Optimal Hierarchical Rule to Regret for Incorrect Rule 96

7. Ratio of Regret for Approximate vs. Optimal Hierarchical Rule 99 LIST OF FIGURES

FIGURES PAGE

1. The Behavior of fP and fM 48

2. The Behavior of the Hierarchical and “Best Guess" Prior 51

3. Behavior of the Tails of the Three Priors 66

4. A Sketch of the Behavior of Regret Functions 87 CHAPTER I

INTRODUCTION AND SUMMARY

1.1: Overview Two of the most extensively- and intensively-studied areas of Bayesian decision theory are those centered upon hierarchical-prior models and upon Bayesian robustness.

The degree of interest in these areas is perhaps not surprising, for one can argue that hierarchical priors are (for reasons which will be discussed below) "as Bayesian as even a Bayesian can get" while on the other hand the issue of Bayesian robustness is one which must be confronted if serious criticisms of the Bayesian paradigm are to be addressed. In this thesis we will discuss various approaches to using hierarchical priors to achieve Bayesian robustness in a variety of situations. Section 1.2 establishes some fundamental notation and terminology which will be used in later chapters. Sections 1.3 and 1.4 give a brief survey of certain topics pertaining to hierarchical Bayesian models1 while Section 1.5 surveys Bayesian robustness. Section 1.6 summarizes the results of Chapters II through V; Chapter VI reviews and integrates earlier material, and discusses various problems to be examined in the future.

1 Those two sections by no means constitute an exhaustive survey of all the important or elegant work done in hierarchical Bayesian methods, and exclusion of any particular work is in no way reflective of a negative evaluation of that work. Rather, topics presented are those which convey something of the history of the development of hierarchical models. 1 1.2: Some Notation and Terminology Before proceeding further, we review some notation and terminology: the data x are realizations of a random variable X defined on a sample space X. The distribution of

X is f(ie),e£0, (1.2.1) with 0 unknown. (Note that 0 could be any index set: in particular, attention is not restricted to parametric families, as we shall see in Chapter IV. ) We denote the prior distribution on 0 by it, and the distribution of 0|X = x (i.e., the posterior distribution) by n(.|x). The marginal (or predictive) distribution of X is given by

m(x)= Jf(x|0)dit(0). ( 12.2) e The action space A consists of all possible actions (or decisions) open to the statisticians, and for each a e A, 0 e© there is associated a loss L(0,a). We shall assume that all loss functions discussed herein are bounded from below. The posterior expected loss of an action a is given by p U x ,a ) = E"'*‘)[L(e,a)]. (1.2.3)

A (possibly randomized) decision rule mapping X into A will be denoted by 5, and the set of all such decision rules will be denoted by D*. The frequentist risk function R(0,5) and the Bayes risk r(7t,5) for the decision rule 5 are, respectively,

R(0,S) = Ef(xp)[L(0,8(x))], (1.2.4) and: r(x,5) = E^e*[R(0,8)]. (1.2.5)

It is well-known (Berger, 1985) that, for loss functions bounded from below r(it,5) = Em(x)[p(jt,X,6(X))]. (1.2.6)

The above notation is used throughout this thesis, although at times certain extensions and modifications are made. 3 1.3: Earlier Work on Hierarchical Priors When one thinks of hierarchical priors, the name of I. J. Good immediately comes to mind. We begin this section by discussing Good's contribution to the theoretical and philosophical justification for hierarchical models. Good (Good, 1980) views "[t]he notion of a hierarchy of different types, orders, levels, or stages of probability" as arising naturally in three distinct settings. Before delineating these settings, some terminology is in order. By a "physical probability" Good means "an intrinsic property of the material world, existing irrespective of minds and logic. . .. psychological probability is a degree of belief or intensity of conviction that is used .. . for making decisions." A "subjective probability" is one of a set of psychological probabilities which are coherent in the sense that those probabilities cannot lead their adherents into bets which are certain to lose. A "logical probability . . . is a rational intensity of conviction, implicit in the given information . . . such that if a person does not agree with it he is wrong (Good, 1965). " Subjective probabilities are properly viewed as approximations to logical probabilities (Good, 1976).

The three situations (Good, 1980) wherein hierarchies of probabilities can arise are: (i) Hierarchies o fphysical probabilities. Good says that the "meaning . . . is made clear merely by mentioning populations, superpopulations, and super-duper-populations. " As one moves "downward" from the superpopulation to the subpopulations, probabilities will often change: this is the idea behind advertising which "targets" a particular subset of the population. The probability that a teenage consumer will buy a skateboard is much different from, and higher than, the probability that a consumer will buy a skateboard. (ii) Situations that involve more them one ",kind" o f probability (e.g., physical and subjective). By introducing a hierarchical structure, one can simplify the problem somewhat and separate the various kinds of probabilities: for instance, the first stage might 4 be physical while the second stage is subjective. Consider an insect colony which is known from biological theory to undergo births and deaths as a random process characterized by certain parameters: this is a case of physical probabilities. But if the parameters are not known, one's probabilistic "guesses” about them are subjective probabilities. Combining the two yields a typical Bayesian model (although not a hierarchical model in the commonly-accepted sense, because there is only one stage of prior knowledge). (iii) Hierarchies o f subjective probabilities. Good emphasizes that, despite the important role subjective probabilities play in Bayesian inference, those probabilities are invariably vague. He deals with this vagueness by using hierarchical priors, as is evident from the following quote (Good, 1980):

[0 ]ne way to trying to cope with it is to allow for the confidence that you feel in your judgements and to represent this confidence by probabilities of a higher type . . . I still stand by the following two comments . . . the higher the type the woollier the probabilities . . . the higher the type the less the wooliness matters provided ... the calculations do not become too complicated .

In other words, since subjective probabilities are unlikely to be known exactly, modeling them as random quantities is expedient (and also conforms to the Bayesian paradigm). The higher stages of modeling may be even more vague (or "wooly") but still an advantage has been gained: the inexactness of the lower stage has been acknowledged, and even in some sense quantified, rather than ignored or denied. Though this paradigm has been the basis for much significant work since its proposal by Good in the 1950's, it is not universally accepted. L. J. Savage (Savage, 1972) objected on two distinct grounds:

But such a program seems to meet insurmountable difficulties . . . If the primary [i.e., first stage] probability of an event B were a random variable & with respect to secondary [i.e., second stage] probability, then B would have a composite probability, by which I mean the (secondary) expectation of Composite probability would then play the allegedly villainous role that secondary probability 5 was intended to obviate, and nothing would have been accomplished . . . [Furthermore] once second order probabilities are introduced, the introduction of an endless hierarchy seems inescapable. Such a hierarchy seems very difficult to interpret, and it seems at best to make the theory less realistic, not more. Any argument from L. J. Savage merits attention, but later work has served to weaken his criticisms to quite an extent: work discussed later in this section concerns situations wherein the second stage parameters are of intrinsic interest (i.e., the statistician wishes to conduct inference on the hyperparameters) and in Section 1.4 it will be seen that in many scenarios a many-staged hierarchy is of little use! The point about a multistage hierarchy collapsing into a single stage structure (which was "pointed out to [Savage] by Max Woodbury (Savage, 1972, p. 58)") is of course correct; but it does not directly address Good's contention that the multistage structure facilitates an understanding of the model.

As early as 1953, Good was applying the hierarchical paradigm to contingency tables and his work is the earliest hierarchical approach referred to in Berger's survey of hierarchical models (Berger, 1985). Good's work on hierarchical Bayesian techniques (see Good, 1983 for a bibliography), though substantia] in nature, has focused mainly on discrete or categorical data, while we now turn our attention to a topic pertaining to continuous random variables: the normal linear model. The earliest application of hierarchical Bayesian methodology to the normal linear model appears to be the work of John Rogers, which was begun in 1969. Rogers' work focused on "the estimation of a [emphasis in original] (univariate) normal mean when the prior contains hyperparameters (Good, 1980). " Rogers used normal and Cauchy priors, and estimated the hyperparameters by Type II maximum likelihood methods. So complicated were the problems that intentions to study the multivariate case were abandoned (Good, 1980). His thesis (Rogers, 1974) was apparently never published. Hierarchical Bayesian work on normal linear models has tended to build instead upon three very well-known papers by Lindley and Smith. The first of the three (Lindley and Smith, 1972) examined models with the following structure:

(1.3.1)

(1.3.2)

(1.3.3) with 63,A,,A2,A3, and the dispersion matrices C,,C2, and C3 known. Lindley and Smith proceed to find the posterior distribution of 0ljy,63,A1,A2,A3,C],C2,C3 (which is also normal, an example of the conjugacy principle) from which all Bayesian inference on 0, can be performed. Typically, the second stage (1.3.2) is suggested "rather naturally" by the design of the experiment, but at the third stage (1.3.3) "we find ourselves in a position where prior knowledge is weak (Lindley and Smith, 1972)." In such situations Lindley and Smith take Cl' = 0; in other words, they assume an infinitely-dispersed .

There is much more to the formulation (1.3.1)-(1,3,3) than may be initially apparent, for many widely-used statistical designs are contained therein. Lindley and Smith illustrate two-factor experimental designs, multiple regression with exchangeability between regressions, and multiple regressions with exchangeability within regressions. Bach of these is straightforward once appropriate Aj have been defined, since each is a special case of this very general formulation. The C, are in fact nuisance parameters but in practice should be treated as unknown (i.e., random in the Bayesian context). The theory developed by Lindley and Smith for this situation is necessarily less elegant than that for the known-C; case, since the integrals which must be evaluated lack closed forms. "We therefore consider an 7 approximation .. . [which] yields the bulk, though not unfortunately all, of the information required for estimation." The method approximates the mean by the mode. Smith (Smith, 1973a) describes means of conducting inference on the hyperparameter 62. The hierarchical structure is no longer merely a tool for "sharpening" one's understanding, as Good discussed; rather, the hyperparameters assume a practical importance in their own right. "The second stage [i.e., (1.3.2)] describes the form of relationship posited between parameters at the first stage [i.e., (1.3.1)] while the third stage [i.e., (1.3.3)] describes knowledge about the form of that relationship (Smith, 1973a). " The Bayesian model therefore greatly expands the areas which statisticians may explore: "inability to make inference about both the first-stage and second-stage parameters is a major difficulty in the sampling theory approach (Smith, 1973a)." Finally Smith (Smith, 1973b) explores some designed-experiments problems in the context of the hierarchical Bayesian linear model. Further work in that direction has been done by Smith and Verdinelli (1980), Verdinelli and Giovagnoli (1985), and Toman and Notz (1991). Hierarchical Bayes methodology continues to be an active research area, so active that relatively few developments can be discussed in depth here. Norberg (Norberg, 1989) has explored hierarchical conjugate priors; Geisser (Geisser, 1990) has used hierarchical priors in the exponential survival prediction problem. Datta and Ghosh (Datta and Ghosh, 1991) have used the hierarchical methodology in small area estimation; Lenk (Lenk, 1991) has used it in Bayesian nonparametric density estimation. Morris and Normand (Morris and Normand, 1992) have studied meta-analysis ("the science and art of combining results from similar and independent experiments") with the aid of the hierarchical Bayes methodology. Wolpert and Warren-Hicks (Wolpert and Warren-Hicks, 1992) have used hierarchical models to combine field and laboratory data in stream-acidification studies. 1.4: Information Theory and Hierarchical Priors Good viewed hierarchical models as a means of expressing and comprehending one's uncertainty about one's own subjective probabilities, while stating that the higher stages were often "wooly". On the other hand, Savage was concerned with the prospect of an infinite hierarchy. Two papers discussed in this section will make more rigorous the notion of "wooliness" and also indicate that an infinite hierarchy is neither necessary nor useful for most situations. Goel and DeGroot (1981) discuss a number of information measures appropriate to a Bayesian setting. "One general type of measure that is useful [Goel and DeGroot write] . . . is obtained by regarding the expected information about 9 [the parameter vector] as the difference between some measures of the uncertainty in the prior distribution of 6 and the expected uncertainty in the posterior distribution." Definition 1.4.1. An uncertainty function is a concave, measurable function mapping S into the real numbers, where S is a convex family of probability distributions. Definition 1.4.2 : The expected information in an experiment is defined as

I(X,e;U) = U(4)-E[U(4(X))], (1.4.1) where X,£ eS, £(X), and U are respectively the observable random variable, the prior distribution of 9, the posterior distribution of 9 given the realization of X, and an uncertainty function. By varying U many different information measures can be obtained: U I(X4;U)

U ,(0 = var(9) var(9) - E[var(9|X = x)] (1.4.2)

U2(4) = ln[var(9)] var(9|X = x) (1.4.3)

U,(5) = -J«8)ln(«e))dn(0) Shannon Information (1.4.4) 9 ( In (1.4.4), p denotes some a-finite measure on the parameter space 0 .)

Information may also be measured by some type of distance between two densities. In a Bayesian context, one is interested in the distance between the prior and posterior densities: the greater the distance, the more information (about 0) was gained in the experiment. For instance, based upon the Renyi entropy function

' •

A general hierarchical Bayesian model has the following structure: let the observable random vector be denoted by 0O and let g{0o(0l) denote the generalized probability density function of 0O with respect to some a - finite measure, where 0, is an unknown parameter vector. The gpdf of the i* level is denoted by gtOJO^,), and it is assumed that at some level k, the hyperparameter 0k in the distribution of 0k_, is known.

For a * 1, let the Renyi information measure of order a at the level i of the hierarchical model, given the observed value 0O be denoted by Ia(i). Goel and DeGroot prove that, for a > 0, Ia(i) is a decreasing function of i (i = 1,..., k) for every 0O and 0k.

As a corollary of the above result, the Kullback-Leibler Information (Kullback,

1959) also decreases as a function of i. Indeed, one can conclude that for many information measures "the information about the hyperparameters decreases as one moves to higher levels away from the data." Hence a statistician who can obtain only limited data is wise to model only a few stages of hyperparameters, because very little can be learned about them. This result is impressive from a technical point of view, but it is also significant philosophically. We 10 have seen that Savage was troubled by the prospect of an infinite hierarchy. Another statistician (Levi, 1973) quipped: "Good is prepared to define second order probability distributions . . . and third order probability distributions over these, etc., until he gets tired." The result of Goel and DeGroot suggests that there are formal reasons why Good can afiford to "get tired" and still remain secure in the knowledge that nothing of great importance has been omitted. Likewise, Good's concept of "wootiness" is given a more formal interpretation: hyperparameters that are "far" from the data (that is, out on a distant, "wild-and-wooly" frontier) have "little to do with" the data in that the data provides little information about them. Goel has also developed "a more general and unified approach (Goel, 1983)." Definition 1.4.3. Let p, and p2 be two probability measures and A. denote a finite or

A. with corresponding gpdfs denoted by g, and g2, respectively. Furthermore, let cp(u) denote an arbitrary convex function defined on the interval (0 ,o o ) . Then the (p divergence of p, and p2 is defined by

I»(n„n2) = (14.7) g2(G) Definition 1.4.4 : For any convex function

0o,0k, and k > 3 . By appropriate choice of

DeGroot, 1981) follow as immediate corollaries of the above result. Goel also proved that the

1.5: A Survey of Bayesian Robustness Professor James O. Berger, one of the most active researchers in the area of Bayesian robustness, lists two assumptions which, taken together, are fundamental to his philosophy of Bayesian robustness:

Assumption I. In any statistical investigation, one will ultimately be faced with . . . decisions which involve uncertainties. Of interest is the information available about these uncertainties after seeing the data, and the only trustworthy and sensible measures of this information are Bayesian posterior measures . . . Assumption II. Prior distributions [on the parameters of interest] can never be quantified or elicited exactly (i.e., without error), especially in a finite amount of time (Berger, 1984). Assumption I leads one ineluctably to the Bayesian paradigm, of course. However, Assumption II leads one to examine the robustness of the resulting analysis: that is, have

"errors" or "inexactitudes" in the prior seriously affected the posterior analysis? That such concerns are not ill-founded may readily be seen: consider this simple but illustrative example. Example 1.5.1: (Berger, 1985) Suppose that x|e~N (e,i). (1.5.1)

Further, assume that the prior of 6 is "felt" (or "known" or "assumed") to have support (-oo, oo) and first, second, and third quartiles equal to -1, 0, and 1. These assumptions correspond to infinitely many densities, of which two are

(1-5.2) and (The former density is Cauchy, the latter normal, hence the superscripts. ) In the absence of more extensive prior knowledge, the priors are equally suitable. Assume the problem is estimation of 6 under squared error loss. If X = 4.5 is observed, the posterior estimate of 9 is 4.09 if (1.5.2) is used and 3.09 if (1.5.3) is used. If X = 10 is observed, the respective estimates are 9.80 and 6.87. Clearly choice of prior can lead to dramatically different inference for certain observations! (Yet if X = 0 had been observed, the posterior estimate would have been 0 for either prior.) In some cases, there might exist sufficient prior knowledge to lead to a single "true" prior 7tr(0) but "the prior 7tr . . . is exactly nailed down only after an infinite process of elicitation (Berger, 1984). ” Also, if the decision (or inference) is to be the consensus of a group, then "the priors of all members of the group must be considered . . . hence . . . one is left, at the end of the elicitation process, with a set V of prior distributions which reflect true prior beliefs (Berger, 1984). ” The issue of Bayesian robustness is of extreme practical importance in terms of increasing the acceptance of Bayesian methodology by the research community:

The only truly overwhelming problem facing Bayesians is that of convincing non- Bayesians that the Bayesian viewpoint is correct. The major stumbling block in the entire controversy is that Bayesians (as a whole, not individually) have not openly admitted the validity of Assumption II and been willing to accept its consequences. This allows the non-Bayesian to refuse to think about Assumption I, because he feels certain that Assumption II is correct and hence that the Bayesians must be wrong (Berger, 1984). Though the case for Bayesian robustness is strong, Bayesians do not unanimously concur with Berger's assessment. There are two potential criticisms: (1) "Some Bayesians argue that . . . the approximate prior jca that one arrives at after a finite amount of time is your true prior at the moment, and should hence be used as such (Berger, 1984). " But, as Berger proceeds to point out, since further reflection would probably lead to a "new" rcA, in no reasonable sense of the word "truth" does rcA represent a "true prior." (2) Lindley (1984) while not rejecting Assumption IL, comes to a conclusion different from that reached by Berger:

My own view about Assumption II is that we should learn to measure probabilities . . . Surveyors do not deplore Euclidean geometry because they cannot measure distances without error: they use techniques like least-squares. And they discover that angles are easier to "eticit" than distances: perhaps log-odds are better than probabilities. We need . . . to develop the equivalent of the surveyor's least squares. Lindley's point is well-taken but it seems unlikely that the need to consider robustness issues can ever be entirely obviated. It seems more likely that, however sharp one's prior knowledge becomes, that knowledge will never be sufficiently sharp to rule out all distributions but one. Of course, the "smaller" (in whatever sense it is appropriate to consider) T can be made, the less crucial robustness becomes; but if T contains even two elements, robustness has some role to play (Berger, 1984).

Having convinced oneself of the need to consider the robustness of a Bayesian analysis, what concrete steps might be taken to "robustify" Bayesian analysis? A vast body of work exists in this area, and so a number of methods which have been extensively studied can be described only in very general terms: (i) The set of priors T can be "updated", or altered based upon the data. However, "the data is not to be used to shape your beliefs, but only to indicate how this narrowing down should be done. .. (Berger, 1984)." (ii) One can initially choose a prior from among the class of "robust" priors, that is, priors which may be misspecified (to a slight degree) without leading to a lack of posterior robustness. (The word "robust" is therefore slightly abused when applied to the prior.) There are certain "rules of thumb" along these lines (see Berger, 1984): conjugate priors, for instance, tend to be non-robust. (iii) An empirical Bayes approach will often be robust, but "all [emphasis in the original] the data . . . [not] just the 'past' data" should be used to estimate the hyperparameters (Berger, 1984). (iv) The so-called 'T-minimax approach" can be used to achieve procedure robustness. In a given decision problem, with set of potential priors T and the set of randomized decision rules D*, the T-minimax value for the problem is (Berger, 1985):

rr = inf sup(r(7t,8)). (1.5.4) 5 eD * n e r

Choosing a T-minimax rule 6* (i.e., a rule 5* such that sup(r(rc,S*)) = rr) will minimize Tier the maximum Bayes risk one might incur. Of course, there are the "standard" Bayesian criticisms of the minimax principle2 and also an additional caveat: "[I]t is often very difficult to come up with any method of calculating a F-minimax rule (Berger 1985). " Nevertheless, so doing will lead to procedure robustness, at least in the "minimax" sense of robustness. This avenue will be explored in greater detail in Chapter V. (v) "A sometimes useful solution to robustness problems is to ignore data whose modeling causes the nonrobustness (Berger, 1985). " Of course, to do so one must be convinced that the data ignored do not exert much influence on the posterior distributions in which one is actually interested; prior information and so-called "common sense" can come into play in making this determination. (vi) A sixth approach, which has been extensively examined by many researchers, is what might be called the posterior range method. Generally speaking, the underlying method is to report not one single quantity but rather a range of quantities corresponding

2 "In problems of statistical decision theory , the [minimax] principle can lead to very bad results, and works well only when it happens to coincide with reasonable prior information (Berger, 1985). ” to the various elements of T. Papers addressing this topic include: Berliner and Goel, 1990; Berger and O' Hagan, 1988a and 1988b; Berger and Berliner, 1986; Sivaganesan, 1988; and Berger and Sivaganesan, 1988. (vii) Finally, a lack of posterior robustness can be attacked by a truly Bayesian method; hierarchical modeling. If the posterior inference varies widely for different elements n e f then "[t]he natural Bayesian inclination would be to put some 'metaprior1 on T itself and use the resulting Bayes rule (Berger, 1984)." Although a hierarchical prior is equivalent to a single-stage prior (since the "intermediate" hyperparameters can be integrated out), the hierarchical prior will typically have relatively flat tails (Berger, 1985) and thus will tend to lead to more robust inference. Admittedly, this solution is only "ad hoc" in that the metaprior is "simply some arbitrarily chosen distribution used as a technical device to obtain an answer (Berger, 1984)" rather than a distribution based on some type or degree of prior knowledge. Lindley (1984) has also spoken of the potential value of hierarchical priors as a means of obtaining robustness

I met an example recently where the decision maker had to provide a [prior]. . but the decision maker only felt comfortable with a range of variances, so a distribution was placed on the variance. This is surely the better way out of the problem of a class of priors: put a prior over the class. Or, expressed differently, use a hierarchical model. If normal or Cauchy seems doubtful put probability a on one and 1 - a on the other; or better use t with a distribution on the degrees of freedom. As Berger notes above, this is the "natural" Bayesian way to deal with an unknown entity, in this case a prior: model it as random. We will explore the use of hierarchical priors in subsequent chapters, despite Berger's cautionary note: "analysis with metapriors can be very formidable. . . Also, there is nothing to guarantee that the resulting answer will be good (Berger, 1984). " We wilt 16 see that hierarchical priors can be a useful means of conducting robust inference in a

variety of settings.

1.6: Summary Each of the next four chapters deals with the use of a hierarchical prior structure to achieve some form of robustness in a Bayesian context. Chapter II deals with one of the most common decision problems treated in almost all texts on Bayesian inference:

parameter estimation under squared error loss. Our goal is robustness in the sense of mitigating the consequences of imperfect knowledge of the hyperparameter. More specifically, we work within the following framework: there exists a benchmark prior

ti0{9) = 7t(0|ao) with known functional form 7t(.|.)but unknown hyperparameter a 0. We

can deal with ignorance of the hyperparameter in one of two ways. Firstly, we can make a "best guess" as to what the benchmark hyperparameter is and then use as our prior 71,(0) = 7t(0|a,), where a, is our "best guess" as to the parameter a 0. Alternately, we can use a hierarchical prior of the form:

tc2(0) = J 7t(0|a)d(7t*(a))> (1-6.1) A for some distribution 7t* on A. (Note that the lower stage of the hierarchical structure is the known functional form 7t(.|.). ) Under what conditions is (1-6.1) "preferable” (in a regret sense which will be specified in Chapter II) to the "best guess” model? We present theoretical results by Goel and OHagan which show that, in the normal prior-normal likelihood setting, with fixed and known variances, there must always exist a hyperprior which can achieve this goal. Also, we present a simulation study which, within its rather limited scope, indicates that hierarchical priors can achieve the same goal in the simple exponential prior - simple exponential likelihood setting. 17 In Chapter III we again use the benchmark-best guess framework , but our criterion for robustness is different: we are interested in coming close to the benchmark posterior in terms of information distance. Our options — a "best guess" approach and a hierarchical approach — are as before. It is shown that in the normal setting there must always exist a hierarchical prior which can achieve this goal. Furthermore, two results which confirm and make formal some intuitively appealing notions about the behavior of posteriors as a function of the behavior of priors are proved. Chapter IV consists largely of a simulation study which demonstrates how robustness against uncertainty about the functional form of the prior can be achieved through the use of a hierarchical prior. The likelihood is normal, but the prior may be either Cauchy, double exponential, or normal. To undertake this study, it was necessary to find closed forms for the posterior mean and variance, and for the predictive distribution, which correspond to each of those three priors: these formulae are presented in Appendix

C. This framework is quite different from that of Chapter II, in which the prior wasknown to be normal, but the hyperparameter was not uniquely specified. A simulation study is not as strong as a theoretical analysis; however, it provides substantial evidence of the robustness one achieves via hierarchical modeling of the prior. The results indicate that a hierarchical approach can protect one from misspecification of the prior while at the same time being very competitive with the optimal approach (that is, the approach in which the true prior is known). Chapter V presents additional theoretical results in this thesis. It is shown that the hierarchical Bayesian approach can be used to find r-minimax rules, which were discussed in Section 1.5. The foundation of Chapter V is the definition of a derived decision problem based on the original decision problem (i.e., the true problem of interest) with the property that a minimax rule for the derived problem is a r-minimax rule for the original problem. 18 General results are given in Section 5.2, and then the process is illustrated for two decision problems in Sections 5.3 and 5.4. CHAPTER n

BAYESIAN ROBUSTNESS UNDER SQUARED-ERROR LOSS

2.1: General Remarks A very common situation in Bayesian inference is the estimation of the parameter

6 with squared error loss L(0, a) = (0 - a)2. (2.1.1)

It is well-known that the Bayes rule for this setting is the posterior mean. The posterior mean is determined, of course, by the prior 7t(0) and the likelihood f(x|0): consequently, two statisticians who use different priors will report different Bayes

rules. As we have seen in Example 1.5.1, the use of different priors can lead to much different inference. In this chapter, as well as Chapters III and IV, we will assume that the prior is known to lie in some class F with some known properties (e.g., functional form or, in Chapter IV, parametric or structural properties). This assumption is common to much of the work surveyed in Section 1.5. However, we wilt make an additional assumption: the existence of a particular element ji0 g T against which the statistical inference will be evaluated. This prior 7t0 will be called the "benchmark prior. " The statistician does not know 7C0, but must attempt to conduct inference which resembles as closely as possible the inference that would be conducted by someone who knew rc0. In our context, therefore,

19 20 "robustness" refers to how closely the inference actually executed matches the "benchmark inference" based on ti,,.1 This is a major departure from many of the methods cited in

Section 1.5, where "robustness" usually referred to the extent to which inference was stable over all the elements of T One may think of that type of robustness as being oriented toward precision (consistency), while the robustness studied in this chapter is oriented toward accuracy (nearness to an accepted, desired, or true value). One way to formally quantify this robustness is to define the regret r*(7t,S)

(Berger, 1985, p. 378) of a rule 5 against a prior 7t: Definition 2.1.1. The regret r*(n,S) of a rule 6 against a prior x is given by

r‘(n,5) = r(rc,8) - r(it.5”), (2.1.2) where 5" is the Bayes rule corresponding to the prior i t . In words, the regret is the additional consequence one pays for using the "incorrect” rule rather than the Bayes rule. Note that an immediate consequence of

Definition 2.1.1 is: r*(jt,8)i0, V8 e D \ (2.1.3) with equality obtaining only if 8 is Bayes against k . Here and in Chapter IV we will make considerable use of the following very simple, and well-known, result: Proposition 2.1.1 : Let 8"(x) = E’l

r’(jt,8) = Em(*)[(8,,(X )-8 (x ))2]. (2.1.4)

1 One can interpret this "benchmark prior" in the following manner: imagine a statistician S working for a boss B. The boss B will evaluate S's performance against his/her (that is, against B's) own prior. Hence B's prior will serve as a benchmark. One dictionary definition of "benchmark" (Mish, 1983) is "something that serves as a standard by which others may be measured." We have avoided the use of the terminology "true prior" because the notion of "truth” raises philosophical issues which are both difficult and, for our purposes, unnecessary to confront. The other potential choice of terminology was "reference prior", which has been used in a distinctly different context (Bemado, 1979). 21 Proof, Note that r*(fl,8) = r(n,8) - r(n,5’1)

= Em(x)^ x,|(8(x)-e)2 -(5*(x)-e)2]j

= e ^ J e ^ ’^ x ) - e - 8”(x ) + eX5(x) - 0 + s’(x) - e)]]

= Em(x)[(8(x) - 8 ’l(x))(8(x)+ S*(x) - 2E"(0,X>[0])]. (2.1.5)

However, E’l|O|x)[0] = 8"(x); therefore, (2.1.5) yields (2.1.4). □

Confronted with the task of attempting to conduct inference which resembles that which would result from the benchmark prior, the statistician will choose one of these two options: Option A : Make a "best guess", say n, e T, as to the benchmark prior, and use this

"best guess" prior to conduct posterior inference;

Option B: Instead of using the prior ti, e f , place a hyperprior ti* upon T and use the resulting hierarchical prior ti2 = j nd(7i*). 2 (Note that possibly 7i2 gT.) r To make the setting more concrete, let us assume that the prior distribution is assumed [or known] to have functional form Ti(0|a) (2.1.6) for some hyperparameter a e A, There exists a benchmark value a 0, and a corresponding benchmark prior 7to(0) = 7t(0|ao). Then Option A and Option B become:

Option A: Make a "best guess" as to a 0, say a,, and use this "best guess" prior

Tt1(0)=T i(0|a1);

Option B. Place a hyperprior Ti*(a) on A (with 7t*(a) centered, in some sense, about the "best guess” a,) and conduct inference using

2 Continuing the S[tatistician]-B[oss] scenario, one may interpret the use of a hyperprior as S "hedging” his/her bets against possible mistakes in the "best guess." 22

Jt2(0) = j7t(0|a)d(7t*(a) )• (2.1.7) A

2.2: Estimation of a Normal Mean In this section we present results from some unpublished work of Goei and OHagan, who have extensively studied this problem for both the univariate and multivariate normal settings; we discuss their results for the former problem here.

Let the likelihood and functional form of the prior be given by

X|0 ~ n (0,tj2), (2.2.1) and 0 ~ N (a ,o 2). (2.2.2) In particular, we assume that the variances rj2 and o 2 are known.

The benchmark and "best guess" priors are then 0 ~ N (a o,o2), (2.2.3) and 0 ~ N (a „ a 2). (2.2.4)

The hyperprior is assumed to be a ~ N (a,,T 2). (2.2.5) for some x2 chosen by the statistician. Thus, the prior tc2 corresponding to the use of

(2.2.5) is 0 ~ N (a „ a 2+T2). (2.2.6)

In general, in these specifications and in the interpretation of our results, variances are the natural parameters. However, in the intermediate stages, it will expedient to do the computations with precisions rather than variances. Define

r-Jr. *-7’ 23 and

J 1 Xp a2 + t2 1/p+l/X X+p

With this notation, the Bayes rales corresponding to the above options are (DeGroot,

1970): t 6i(x) = £ E L ± ^ i = o,l; (2.2.9) p + y and:

82(X) = ^ i± E . (2.2.10) S+Y The following questions arise when one wants to compare Options A and B: Does there exist a hierarchical prior which will reduce regret - that is, does there exist a value of x2 such that r*(7t0,82) < r*(jt0,8')? Can one find an "optimal" value of x2 that maximizes r*(7t0,8‘) - r*(ji0,82)? Are there instances in which all hierarchical models will be superior to Option A? These questions are answered by the following result: Theorem 2.2.1: Under the notation defined thus far: (a) The value of r^Jto.S'J-r'fjtojS2) is maximized when x2 = (a0-a j) 2 .

(b) The value of r^Tto.S1) - r*(jt0,82) is positive if 0 < x2 < (a0 - a ,) 2 .

(c) The value of r‘(jc0,8') - r’( ji0,82) is positive for all x2 > 0 if:

(a0~a,)2>ii2+o2. (2 .2 . 11)

Proof: From Proposition 2.1.1, it follows that

r*(7to,8i) = En,°W[(S0(x )-S I(x))2 , i = l,2. (2.2.12)

Equations (2.2.9) and (2.2.12) yield

(2 2.13) (p + y) 24

where A = a 0 - a ,. (2.2.14)

From (2.2.9) and (2.2.10) we see

(2.2.15) p + y S + y

Straightforward algebraic simplification of (2.2.IS) yields

5°(x)~82(x) = r -~-(-© (x-ci0)+ (l- co )a ), (2.2.16) (p+y)

where

YP (2.2.17) kp+yk + yp

Consequently, by (2.2.12):

r*(7t0,82) = Em°(x) ^^y(-© (X -a0) + (l-©)A)j

E^(x)[(-0(X -ao)+(l-G))A)2] vP+y

f v l,p+y) { - ^ (1 (x - a 0)2]-2W(1-cO)A E ^)[(X -a0)] + (1-«))2A2. (2.2.18)

But under m0(x) we have that

X ~ N (a0,y_1+p“1), (2.2.19)

so that (2.2.18) becomes

*(7t0,82) = (-£ -) {^(y '+ p^+ O -© )^2}. (2.2.20) I p + y J 25 It follows from (2.2.13) and (2.2,20) that:

r V . 8 ,) - r V . « 2) = r A 3 - k - « 1(T-| + p -')-(l-® )’tf}- (2.2 21) (p+y)

We must study the behavior of (2.2.21) as A varies (that is, as x2 varies). Differentiating

(2.2.21) wrt x2 yields

djrV.S'j-rV.S2)] d{A2 -

= '+ p i + ( l - f f l ) t f } |“ ^ . (2.2.22)

0 and —— < 0 for all x2 > 0. dk dx Hence, (2.2.22) implies that

( ^r*(7c0,5l)-r*(7t0,62)] sgn = sgn(-©(y_!+p ')+(1 -

Note that -©(y ’+p^'J + O -ojA 2 < 0o© (y“' +p 1 +A2)> A2

------(y 't+ p 1 + A2)> A2 Ap+yA+yp

<_>(T+p+ypAi) >A, Ap+yA+yp

o (y + p+ ypA2) > A2( Ap+yA + yp)

o ( y + p)> A2A(p + y)

O A2. (2.2.24) 26 From (2.2.7) and (2.2.24) it is at once clear that r*(7t0,5l) - r*(n:0,52) increases for x2 < A2

and decreases for t2 >A2, so that choosing x2 = A2 = (a0 - a, )2 will maximize

r*(7i0,8 ')- r*(ir0,62). Hence, (a) follows.

Apropos (b), we see from the above discussion that r’(7i0,8 ')- r*(7t0,62) is an

increasing function of x2 forx2 e(o,(a0 - a , ) 2); since r*(x0,5') = r*(7i0,52) for x2 = 0, it

is apparent that r*(fl:0,5l) - r*(jr0,52) > 0 for x2 e(o,(a0 - a ,) 2) . Hence, (b) obtains.

Apropos (c), observe that r’(^0,5')-r*(7C0,82) is a decreasing function of x2 for

x2 > (a0 - a ,) 2 . It follows from (2.2.7) and (2.2.17) that co -> 1 as x2 -> oo. Therefore,

(2.2.21) implies that:

lim (^(tXo .51) - r*(7t0T52)) = p ,2 {a 2-(y "1+p~‘)}, (2.2.25)

Thus, when (2.2.11) obtains, r*(7c0,51) — r*(xt0,52) increases from 0 up to a positive

number, and then decreases down to another positive number. Hence (c) follows. □ We make two observations at this point. The first is that if (a 0 - a,)2 < r\2 + o 2, a

very large value of x2 may, in fact, be detrimental. Further, we observe that in this context the optimal choice of x2 depends solely on |a0- a , |— the absolute difference between the benchmark a 0 and our "best guess" a,. It does not depend on either the likelihood variance rf or the prior variance a 2. This may be somewhat contrary to one's intuition: if in Problem 1, |a0- a ,| = l, ti = ct = 1 while in Problem 2, |a0 - a,] = 1, ri = 10-4, and o = 104, this result tells us that x = 1 is the best hyperprior variance for both. This becomes somewhat less strange, however, when we reflect upon what the hyperprior 7t* in fact is, and what it should reflect. The hyperpriorn is only a tool to get a better answer: it does not model an actual

"state of nature." Nor should it incorporate any knowledge we have about the benchmark 27 value a 0, since all such knowledge that we might have should go into our determination of the "best guess" value a ,. Rather, tC should reflect only our degree of confidence in the accuracy of our "best guess" a,.

Can we avoid the difficulty of needing some knowledge (such as a lower bound) of A2 before we are able to choose x2 ? Since we have seen that there are some values of t|2,a 2, and A2 for which any positive t 2 is beneficial, one might expect that there are some values of rf ,a 2, and x2 which will improve on the "best-guess" model for any value of A2. However, this is not the case: from (2.2.21) it is clear that:

(2.2.26)

2.3: Estimation of the Exponential Parameter In Section 2.2 we noticed that, for the normal likelihood-normal prior setting, adding a hyperprior with variance less than or equal to the square of the difference between the benchmark a 0 and best guess a, reduced Bayes regret. How do hierarchical priors perform in a different, non-normal setting? In this section we explore the use of hierarchical priors to find a more robust method of solving a simple estimation problem. Our approach is heuristic and exploratory, rather than rigorous: we neither seek general theorems nor carry out optimality studies, but rather confine ourselves to simulation studies of a few specific cases. This section is intended to be no more than a tentative step beyond the normality assumption, an attempt to "test" (in an informal, non-inferential sense) the use of hierarchical priors to conduct more robust inference. Our likelihood and prior are respectively f(x|0) = 0exp(-xe)l(Oa))(x), (2.3.1) and 28

7u(e|a) = aexp(-0a)l(O a))(e), a € A = ( 0 ,q o ) . (2.3.2)

We wish to find a , under squared error loss, of the mean of X:

(2.3.3)

It is well-known (see DeGroot, 1970) that the posterior distribution of 0|X = x, a is given by [0|X = x, a]~G am m a(2,x+a), (2.3.4) so that the Bayes estimator of (2.3.3) is the posterior expectation:

»(0|x,ot) = x + a . (2.3.5) 0

But what if a is not precisely specified? As in Section 2.1, suppose that we have only a "best-guessH value a, but that our inference will be evaluated against a benchmark value a 0. We consider placing a hyperprior it on a : we wish to learn whether using the hyperprior will lead to a smaller Bayes regret than the use of the "best-guess" value a,.

What hyperprior shall we choose as our tc*? In Section 2.2 we naturally chose a normal hyperprior, thus preserving conjugacy. However there does not seem to be any hyperprior which will preserve conjugacy in the exponential setting. We will consider two priors: a Gamma density and an inverse Gaussian (see Johnson and Kotz, 1970) density. We will not attempt to argue rigorously that these densities are in any sense the "optimal" choice, but we will cite three reasons why they are intuitively appealing:

(i) Both have support equal to A = (o ,qc). It is not necessarily the case that the support of the hyperparameter "should11 be equal to the hyperparameter space, but by placing some probability along the entire hyperparameter space, we are "hedging our bets" in a very cautious way. (ii) Both have mean and variance functionally independent, in the sense that choice of a particular mean (which in our case will be the "best guess" value a,) will not restrict our choice of variance. This is an important and by no means ubiquitous property. In particular, three other distributions which one might consider (the exponential, the normal truncated to (0,2a,), and the uniform3) lack this property. Since we wish to explore

whether the use of a hyperprior variance relative to the extent to which a 0 and a , differ4

is beneficial, this functional independence is an extremely important property for our

hyperprior to possess. (iii) The desired Bayes rules can be efficiently and (relatively) easily approximated by a well-known and widely-used method. For a given hyperprior n , the posterior mean (conditional on X = x, with a integrated out of the system) is

j j0exp(-0x)aexp(-a0)7i,(a)dad0 00

(2.3.6)

Now, for neither the Gamma hyperprior

< (a ) - Ca(a,b)a* 1 exp(-ab), (2.3.7) nor the inverse Gaussian hyperprior

(2.3.8)

3 Since our hyperprior must have nonnegative support, a uniform hyperprior centered at a, can haw variance no greater than oc*/3. 4 "Extent to which two quantities differ” is a vague term, which will be clarified shortly. 30 (the (unctions C(.,.) being norming constants) does (2.3.6) have a convenient, tractable

closed form. However, the Bayes rules can be approximated with GauB-Laguerre quadrature, a common method of numerical integration which will "converge to the true

value of the integral for almost any conceivable function which one meets in practice (Stroud and Secrest, 1966). "5 GauB quadrature has many variations, depending upon the

region of integration in which one is interested. Laguerre quadrature approximates

integrals of the form:

Jh(t)exp(-t)dt (2.3.9) 0 by a sum of the form:

S w kh(tk), (2.3.10) k-l

for nodes 0 < t, < ...< tM < qo and weights wk, k = with the nodes and weights independent of the particular function h(.) There exists software (IMSL, 1989a) which

can compute the nodes and weights for values of M . For the (2.3.7) and (2.3.8) in question, the integrals in (2.3.6) have the form given in (2.3.9). For the Gamma case, making the substitution u = ab yields

a“exp(-ab)da^ u* exp(-u)du^ x + a j x + u/b J (2.3.11) a* exp(-ab)da^ f ( u“u exp(-u)du exi (x + a )2 (x+u/b)2

5 The Tiemey-Kadane asymptotic formula (Tierney and Kadane, 1986), which is widely-used and quite useful in certain situations, was considered as an alternative but was rejected since the use of T-K would require finding the mode of the density J7t(0|a)n‘(a )d a , which is complicated. Hence the resort to "old 0 but trusty" quadrature methods. 31 while substituting t = — - for the inverse Gaussian case yields: 2m

a exp - ^ ( a + M2/a)jd a a exp (a + M2/ot)jda 2p' 8 *(x)= J (x+a)a3/2 (x+a) a V2

exp(-t)exp rat I V t f p(~l)e’ip( - A ) dt = J (2.3.12) (x + 2M2tA)Vt (x + 2p 2t /\f i/t

The algorithm used (see Appendix A for the program listing) was as follows:

co To approximate B = Jg(t)exp(-t)dt to desirable tolerance £ : 0 (1)j< -9; (2) B(j)(g) *- the 2s - point Laguerre quadrature approximation of B;

(3) j« - j + i; (4) B*’'(g) <- the 2J - point Laguerre quadrature approximation of B;

(5) If |B*j)(g) - B*j"^(g)| < eBlj)(g), then return B(j)(g)and terminate

algorithm6; (6) Otherwise, go to Step (3). Two points are in order. The first is that, although it is known that since g is continuous

B(j)->B , (2.3.13) j—►<»

6 This stopping criterion is one of two general methods of terminating an integration algorithm: the other requires some approximation or bound on the error in (see Davis and Rabinowitz, 1984, p.419) , which we cannot readily obtain. 32

it is possible that the above algorithm will be "fooled" if the terms of the sequence | B(j) I j=i become very close together before nearing B. To mitigate this danger, we choose a very

small tolerance, namely 0.00005%. Secondly, the above represents the "ideal" algorithm, rather than the one actually implemented. For large values of p, computing the 2P nodes and weights is very time-consuming; moreover, storing the nodes and weights requires a

great deal of memory. The DECsystem 5500, on which the code was run, took approximately two hours to generate the nodes and weights needed for p = 9,..., 15; the

file containing these nodes and weights required over 4,030,000 bytes. Therefore, p

cannot be permitted to grow indefinitely large, and so the algorithm actually implemented terminates if p = 15, whether the desired tolerance has been achieved by then or not, and

returns B(I5) along with an estimate of the tolerance which was achieved. Having chosen two functional forms for use as hyperpriors, we turn now to the choice of hyperparameters. The mean and variance of (2.3.7) are given by, respectively:

(2.3.14)

and the mean and variance of (2.3 .8) are given by

. p3 u and . (2.3.15)

Hence, as noted in (ii), we have complete freedom to choose the hyperparameters. We shall assume that the hyperprior has mean equal to the "best guess" a,. As for the hypervariance a ,, we want a* to be related in some sense to the "difference" between

a 0 and a ,. For the normal setting, the difference we used was the algebraic difference |a 0- a ,|. But in that setting 6 was a location parameter. In this setting it is a scale parameter. Consequently, we use as our measure of the "difference" between a 0 and a, the ratio 33

(2.3.16)

The results of Section 2.2, if applied to this section, would suggest that the use of (2.3 .7) or (2.3.8), with mean a, and variance less than or equal to (2.3.16) will lead to a smaller regret than the use of (2.3.2) with parameter a = a ,. Is this the case? We proceed directly to answer this question for four examples. We assume that (2.3.17) Our values of are {a0±0.25, 0.50} = {0.50,0.75,1.25,1.50}. It follows from (2.3,5) that the regret for the "best-guess" rule is:

r*(7t(e|o0),6') = (ao“ Ct|)2. (2.3.18)

For each a , , we use five hyperprior variances:

(2.3.19)

The regret for the hierarchical rules is approximated by simulation. Using IMSL, we generate pseudorandom uniform variates ui5 i = l,--,M , and transform each into a

ncAiirtnranHnm variable y havino Hensitv function m°(v) = -----!—- Then we cnmnnte

(using the algorithm given above) 8o(xi) and 5|(xj) for each of the M values of Xj. We then estimate the regret for 5q and 5* by the quantities

To =-J-2(8;(xi)-xi-a 0)2, (2.3.20) M i=l and

TI=~Z(8;{x])-x1-a 0)2. (2.3.21) M i=i 34 We repeat the above process Q times, computing TG(q),T,(q),q = we then compute

T0 = -^f:T 0(q), T ,= ilT ,( q ) , (2.3.22) Q q=1 Q q=l and:

Sb = J 7r1Tl(T,:.(q)-Tj , S, = J ~ i (T ,(q)-T,7. (2 3 23) )|Q-lq=l VQ“‘q=1 This process was executed for Q = 10, M = 1000. The algorithm failed to meet the desired tolerance a total of five times, each time when attempting to compute the denominator for the Gamma rule. However, the tolerances which were achieved for those cases were under 0.007. The primary results are presented in Table 1. We note first of all the striking similarity between the regrets for the Gamma and inverse Gaussian hyperpriors when o , is small. This similarity in regrets can be attributed to the similarity between the two rules: as o* becomes small, both the inverse Gaussian and Gamma hyperpriors tend toward distributions degenerate at a,.

2 f r ' \ a 0 a, The notion that min 9 > is the best hyperprior variance is clearly false: a 0 J for a, = 0.75,1.25 that does not even work to reduce regret below the best guess model.

On the other hand, we see as before (at least for these particular examples) that a sufficiently small o l exists which will reduce regret, although when |a0- a ,| = 0.25 the reduction is extremely small (less than five percent). In closing, we implore the reader to bear in mind the exploratory and tentative nature of this section, and to avoid undue generalizations on the basis of these few examples. The only goal of this section was to look beyond the normal case, and see how hyperpriors behaved in such a setting. 35 Table 1: Comparison of Hierarchical and "Best-GuessH Rules

“ t a2. (Sb) % X-(s^) % (a, - a 0)2 ( a ,- a 0)2 0.500 0.25000 0.1699(0.0072) 0.6795 0.1912(0.0061) 0.7646 0.500 0.12500 0.2023(0.0029) 0.8094 0.2098(0.0031) 0.8391 0.500 0.06250 0.2227(0.0011) 0.8906 0.2250(0.0011) 0.9001 0.500 0.03125 0.2364(0.0008) 0.9454 0.2370(0.0008) 0.9480 0.500 0.01563 0.2429(0.0004) 0.9717 0.2431(0.0004) 0.9724 0.750 0.56250 0.1104(0.0035) 1.7666 0.1047(0.0022) 1.6749 0.750 0.28125 0,0691(0.0028) 1.1062 0.0685(0.0024) 1.0960 0.750 0.14063 0.0611(0.0010) 0.9775 0.0611(0.0008) 0.9783 0.750 0.07031 0.0603(0.0008) 0.9641 0.0603(0.0007) 0.9654 0.750 0.03516 0.0607(0.0004) 0.9719 0.0608(0.0004) 0.9725 1.250 0.64000 0.1320(0.0057) 2.1119 0.1141(0.0054) 1.8253 1.250 0.32000 0.0761(0.0020) 1.2179 0.0731(0.0020) 1.1696 1.250 0.16000 0.0646(0.0014) 1.0331 0.0642(0.0014) 1.0267 1.250 0.08000 0.0622(0.0006) 0.9946 0.0621(0.0006) 0.9940 1.250 0.04000 0.0618(0.0004) 0.9882 0.0618(0.0004) 0.9882 1.500 0.44444 0.2467(0.0035) 0.9868 0.2438(0.0032) 0,9754 1.500 0.22222 0.2402(0.0025) 0.9607 0.2401(0.0024) 0.9605 1.500 0.11111 0.2426(0.0012) 0.9703 0.2427(0.0012) 0.9708 1.500 0.05556 0.2458(0.0008) 0.9832 0.2459(0,0007) 0.9834 1.500 0.02778 0.2475(0.0002) 0.9902 0.2476(0.0002) 0.9902 CHAPTER m

BAYESIAN ROBUSTNESS AND KULLBACK-LEIBLER DISTANCE

3.1: Kullback-Leibler Distance In Chapter n we explored the use of hierarchical priors as a means of achieving robustness in a particular context: parameter estimation under squared error loss. While that context occurs quite frequently in Bayesian analysis, it is nevertheless a rather narrow framework. In this chapter we shall apply a different, broader sense to the term "robustness", a sense not keyed to a particular decision problem. Rather, we shall strive for robustness with respect to the posterior densities themselves. To do so we must introduce additional notation and terminology: Definition 3.1.1: Let g, and g2 be probability [density] functions on a probability space Y dominated by a measure p - Define

(3.1.1) iCSi.Sa) is well-known as the Kullback-Leibler Information Distance between g, and g2

(Kullback and Leibler, 1951, Kullback, 1952). l(g1;g2) has been used extensively in a variety of situations involving modeling, inference, and optimal choice of experiments: it is a quantity measuring how much g, differs from g2(Csiszar, 1975).

36 37 Before proceeding, we first list certain properties of l(g,,g2) (Kullback, 1959): l(g1,g2)> 0foranyg„g2,Y, andp; (3.1.2)

*(&.&) =0=>g, =g2a.e.[p]; (3.1.3)

l(&,g2) * % > & ) (3.1.4)

As before, let f(x|0) denote a likelihood, and T = {7t(0|a):a g A}, with pA a a-finite measure on the hyperparameter space A. Let 7tj(0) = 7c(0|a1), i = 0,l denote the benchmark and best-guess priors respectively, and let 7t2(0) = J 7t(0|a)jt*(a)dpA(a) A denote the prior which corresponds to the hierarchical structure with hyperprior Jt*(a). Also, let m^x) and ^(©jx), i = 0,1,2 denote the marginals and posteriors. The hierarchical prior would be preferable (in the sense of Kullback-Leibler information) if

l(7to(0|x),Jt!(0!x)) £l(jco(0|x),Jt2(0|x)), Vx g X, (3.1.5) since (3.1.5) implies that tc2(0|x) is "closer" to7to(0|x) than 7t,(0|x) is to 7io(0|x).Note that we are interested in l(jio(0|x),Jtj(0|x)),i = 1,2 rather than in l(^i(01x), JCo(0|x)), i = 1,2, because Jto(0|x) is the benchmark posterior, against which our efforts will be judged. However (3.1.5) may not hold a posteriori for all x. Hence we will instead view the hierarchical prior as preferable if E”*,,)[i(Ir,(e|x),itl(e|x))] aE-'-'JiW eixW eix))] (3.1.6)

Therefore, our primary subject of study will be the difference between the left- and right- hand sides of (3 .1.6). The following result will be useful in the next section: Lemma 3.1.1\ Em"(K)[l(7Co(0|X),TC^ejX))] - E ^ l t a t e l x W O I X ) ) ]

In -E mo(x) In (3.1.7) ^ (e) m,(x)

Proof. It follows from (3.1.1) that E 1"0 w[l( 7to(0| X), Tc, (0| X))] - Em°*x)[l( n0 (0) X), Jt2 (0| X))] 38

_ gWoU) £«o(ep<)In ’‘o(0lX) ) _ lnf ^ ( 0iX) *,(e|x) J U ,(e|x )

11,(91 X) = Em0(x) £"0(eix) In lt,(9|X)

( m„{x) 7i2(9)m1(x ) > - E1 ;*o(01X) In 7t,(e)m2(x) L V

= £H10(k) ; * 0(6 |X ) In n M ) (3.1.8) L V7t,(0) Vmi(X)J

Since the first term in the inner expectation does not involve x, its expectation may be found wit to the marginal distribution 7to(0) of 0. Similarly, the second term does not involve 0, hence its expectation may be found wit to the marginal distribution of x, which is m0(x). The result (3.1.7) then follows.□

3.2: The Kullback-Leibler Approach to the Normal Problem We adhere to the notation of Section 2.2. Let the likelihood and functional form of the prior be given by (2.2.1) and (2.2.2), and the various priors by (2.2.3) — (2.2.6). We are interested in the following questions (which correspond to similar questions addressed in Section 2.2 with regard to squared error loss): (i) When is the hierarchical model better -- i.e., when is (3.1.6) satisfied?

(ii) What is the "best" choice of t — i.e., what t maximizes E”>w[l(lto(9|X),it1(0|X))] - E -W flkfaxK telx))]?

(iii) Under what conditions is (3.1.6) satisfied for all values of t ? The following theorem, which corresponds closely in form to Theorem 2.2.1, answers these questions: Theorem 3.2. T. Define I(T>,o!,n!,A!)=E”,ul[i(it( 1(e|x),nl(0|x))]-E”*(’,[i(ito(e|x),it 5(e|x))]. 0.2.1)

Then, under the notation defined thus far:

(a) The value of l(x2,a 2,ri2, A2) is maximized when t2 = A2 .

(b) The value of !(x2, a 2, t|2, A2) is positive if 0 < x2 < A2.

(c) The value of f(x2,o 2, rf, A2) is positive for all x2 > 0 if:

(3.2.2)

Proof: Note that

(3.2.3)

Similarly,

r

\ m,(x) J

For t,d >0, a > 0, define

(3.2.5) 40

Clearly, from (3.1.7), (3.2.3), and (3.2.4):

l(x2,o 2,ti2,A2) = —(i*(x 2,a 2,A2) - t*(x2,a 2 + ii2,A2)). (3.2.6)

For convenience, define i(t) = t*(t,c2, A2)-i*(t,o 2 + ti2, A2). (3.2.7)

Differentiating (3.2.5) wrt to t yields

di’(t,a,d) 1 ^ r(a + d ). (3.2.8) a a + t (a + t)

Combining (3.2.7) with (3.2.8) one obtains

dl(t) = . I + 1 2( / o 22+A2) *2\ a o 2+ t (o 2+ t)

- — -(rf+ a' + A2) Tj2+

1 1 = (A2- t) (3.2.9) (o2+t) (tj2 + CT2 + t)

Since , (3.2.9) implies (a2+t)2 (ri2+ a 2 + t)

sgn (3.2.10) so that i(t)is maximized for t = A2, which proves (a). Furthermore, since i(o) = ln(l) + 0 - ln(l) ~ 0 = 0, (3.2.11) 41 and since i(t) is strictly increasing for t e[o, A2), it follows that choosing t2 e[0, A2) will lead to a hierarchical model which is "superior" to the "best-guess" model, which proves

(b). Finally, to prove (c), we rewrite (3.2.7) as

r a 2(i f + a 2 + t) i(t) = In

(3.2.12) (ri2+ a 2+tj

u l(t)= ii-^ ]+fi+41-fi+- ^ T>) tu VT+a2y o j V ti +0 ;

^r| +0 J {a2 rf +0 J

_ f ( a 2 } q2A2 (3.2.13) "(ti2 +o2 J+0 2(ti2+<

Since i(t)is strictly decreasing and continuous on ( a 2,qo), and positive at t = A2, it follows from (3.2.13) that if the right-hand side of (3.2.13) is greater than or equal to zero, then i(t)is positive on (A2,oo), and therefore on [0,oo). But

tl2(a0- a ,)2 In _ « 1 V s O o (a0-a,)2> o' (3.2.14) f|" + c“ J o

Hence (c) obtains. □ Theorem 3.2.1 suggests that one must have some knowledge of A2 to choose a hyperprior that will improve upon the "best-guess" prior. This was also the case in the squared error loss problem. Theorem 3.2.2‘. There exist no three positive numbers q2,o 2, and t 2 such that 42

I(t2,o 2, ti2,A2)>0, (3.2.15) for all values of A2. Proof: If (3.2.15) obtained for all values of A2, then

lim (i(x2,a 2,Ti2,A2)) = I(x2,a 2,ri2,o) £ 0. (3.2.16) AJ-+ 0

But

f f a 2(u 2+ a 2 + x2) In (3.2.17) (o 2 + t 2)(ria + a 2) + (ct2 + x2) (rj2 + o 2 + x2) V Clearly, from (3.2.17), i(o,cr2,r}2,o) = 0. Furthermore, it follows from (3.2.9) that

dl(x2,q2,n2,0) = x^ < 0. (3.2.18) dx2 2 (o 2 + x2) (r)2+ o 2 + x2)

Hence, (3.2.16) cannot obtain for any x2 >0.D

9.3: Some Further Results on Kullback-Leibler Information Distance The posterior distribution one obtains is formed from the prior distribution one chooses and the likelihood. Intuition suggests that the same likelihood will draw two priors "closer" together. "Closeness" could be defined in many ways: in the spirit of this chapter, one might hope that if the hierarchical prior distribution N (a,,a2 + x2) is nearer

(in the Kullback-Leibler sense) to the benchmark prior distribution N (a0,o 2) than the

"best guess" prior N (a,,a2) is to N (a0,a 2), then the corresponding hierarchical posterior will come closer to the benchmark posterior than will the "best guess" posterior. A result along these lines can be established as a consequence of the following lemma:

Lemma 3.3.1: Define the functions: gi(e) = ai+bi8, i =1,2, (3.3.1) 43 with a, < a2 < 0 and b, > b2 > 0. (3.3,2)

Define

e, = i =1,2, (3.3.3) bi and

B’= ?b ,- z b t 2L- (3 34)

( Note that et is, in geometrical terms, the x-intercept of g,; and that e' is the unique intersection point of g, and g2. ) Then e' e,. (3.3.5)

Proof. Observe that

2 1 < -—L a2b, - a,b, < -a,b, + aib2 o a2b, < a,b2 1 2 b l

®2 a i ^ b2 < b,

O 82 > E,,

which establishes (3,3.5). □ Theorem 3.3.T. Suppose that for some positive values of x2,o 2, and (a 0 - a,)2 we have

I(tco (0), 71, (9)) - i(tco (0), tc2(0)) > 0. (3.3.6)

Then for any value of q2,

I(x2,a 2,ri2, A2) = Em»(x)[l(7Co(0|X),7t1(0jX ))- i(tto (0|X)} jc2(0|X))] £ 0. (3.3.7)

Proof. Let 9i+ = (o, a>) denote the positive reals, and define the functions

- > ( - do , qo ), i = 1,2 by

^i(wl,w2,w2,w4) = lnf W; l+r~W-l—v+— v; (3.3.8) V W2 Wj J (w2+W,) W2(w2+w,) 44 and

W3+W2 W, M/2(w„w2,w3,w4) = ln l.W3 +W2 +W, J w3+w2 + w,

w,w4 (3.3.9) (w3 + w2)(w3 + w2+w ,)'

Also, define St= { w e 9?4+:M/,(w)>0}, (3.3.10) and S2={we9l?:M'i(w)2iV|/2(w)} (3.3.11)

We now show that S, c=S2. (3.3.12)

Let (w],w2,w3,w4) g Sj be arbitrary.

Define

w, z = , u = w2; (3.3.13) w2 +w,

w, Z = —;------r , u = w 3 + w2. (3.3.14) W3 + W2 + W,

Define the functions >(—qo , qo ) andfM:9t+->(-oo,Qo)by

fp(e) = ln (l-z) + z +— , (3.3.15) u

fM(s) = ln(l-z*) + z’ + - r . (3.3.16)

Observe that fp(w4) = ¥i(wi,w2,w'3,w;); (3.3.17) and: 45 fM(w4)= V2(wi,w;,w„w4). (3.3.18)

We shall apply Lemma 3.3.1 to fM and fp. Define h(t) = ln (l-t) + t. (3.3.19)

Since = i~~t+ * = ~ T ~ t<0 ^ ^ ows t^ult h(t) is decreasing on (0,l); also, h(o) = 0. But 0 < z* < z < 1. Further, note that

z z* z * A — > — > — > 0. (3.3.20) u u u

Hence (3.3.2) obtains for a, = In(l-z) + z, b ,= —, a2 = In(l-z*) + z*,and b2 =-^-. u u Let 6P and eM denote the x-intercepts of fp and fM; observe that

(r In(l-z) w. ep = -u + lj = -w 2 1 + ^ f In - +1 (3.3.21) V.V w i / vw2 + w. and

ln(l-z*) (( w3+w; \ eM = -« + 1 = -(w ,+w j) 1+ +1 (3.3.22) vv w ' l J + w j

Substitution of Xp = w'2/w, into (3.3.21) yields

ep = -Xpw, (l + Xp)ln l , Upw’i+w',.

= - x Pw ; (l + Xp)ln V (3.3.23) ^ p +1, Substitution of XM = (w3 + w2)/w, into (3.3.22) yields

i (l + XM)log vXMw,+w, 46

= -K*/\ (l + A,M)ln +1 (3.3.24)

Define

^(A.) - X (3.3.25) and note by comparing (3.3.25) with (3.3.23) and (3.3.24), we have eP = -w,V(Xp), and eM = - w , ^ ^ ) . (3.3.26) We shall show that 'J'(X) is a decreasing function of X. Differentiating (3.3.25) yields T .....

2 M ! . m (3.3.27)

Hence

c^A.) : 0 o 2+(l + 2A.) In ^ ^ j j < 0. (3.3.28) &X

But the right-hand inequality in (3.3,28) obtains for all positive X (see Mitronivic, 1970, p. 273). Consequently, 'F(A) is a decreasing function of X. Since Ap < AM, it follows from

(3.2.26) that 6P

Thus, by Lemma 3.3.1, the point at which fp and fM intersect, which we shall call y , lies below ep : y y : since the slope of fp is greater than the slope of fM (by (3.3.20), fp(e) > fM(e) for large values of e . Hence, fp(e)- fM(e)>0 fo re> y , (3.3.31) 47 and

fp(e)-fM(E)<0 fo re< y . (3.3.32)

(The situation is illustrated in Figure 1. ) Now, by (3.3.17) and the fact that (w',,w'2,W3,w4) eS,, it follows that

t w4 >ep>y. (3.3.33)

But since w4 > y ,

f p M > fin M = vj/2(w„ w’j.w'j, w4), (3.3.34) from which it follows that (wi,w2,w3)w4) eS2. (3.3,35)

Now, if (3.3.6) obtains, then from (3.2.3) it follows that H/,(t2,a 2,T]2,A2)> 0 for all if, (3.3.36) so that (t2,o 2,ti2,A2) e S,. But then (t2,ct2, ti2,A2) e S2, so that

ti2, A2) - \|/2(t2,o2, if, A2) > 0. (3.3.37)

However, by Lemma 3.1.1, (3.2.3), and (3,2,4), one obtains

I (t2, o 2,n 2,A2) = v ,(t2,o 2, ti2,A2) - v|/2(t2,ct2, ti2,A2), (3.3.38) so the result follows from (3.3.37). □ Observe that Figure 1 suggests that the converse of Theorem 3.3.1 is not true, for

6 = —(y +ep), fp(e)<0 but nevertheless fM(e)

H"0'(e) In =-0.0003 >-0.0005 =Em°w In (3.3.39) 48

•p

Figure 1: The Behavior of fp and fM

Note that these values of cj2,t2, and (a 0- a , ) 2 are not "contrived" but instead reflect a realistic scenario. Theorem 3.3.1 has an intuitively appealing corollary. It occurs to one that the hierarchical approach might be preferable when the hierarchical prior evaluated at the benchmark value a 0 is greater than the best-guess prior evaluated at the benchmark value a 0. This is shown to be the case under the conditions of Theorem 3 .3.1:

Corollary 3.3.1: Using the notation we have defined thus far , if for some values of i2,a \a 0, and a,: 7c2(a0) > Jc,(a0), (3.3.40) then for any value of r|2, I(T2,o2,ti2,A2) = Emo(x)[i(7t0(e|x),7t1(e|x))-i(jto(e|x),7i2(e|x))]>o. (3.3.41)

Proof. Assume (3.3.40). Define 49

H*(b) = 21n -^ 4^ 1 bG(-oo,ao). (3.3.42)

Clearly (3,3.40) obtains iff H*(a0) > 0. Further observe that

f -.2 H*(b) = In (3.3.43) a 2 + t 2

< 0; (3.3.44) o„ 2 +t , _2

^ £ W = 2J ^ ( b - a 1) (3.3.45) ob o2(a2 + t2) ’ and

lim H*(b) = qo . (3.3.46)

It follows from (3.3.43) - (3,3.46), and the continuity of H*(b),that H*(b) has exactly one root, say b, , in (- 00,01,) and one root, say b2 , in (a ,, 00).Since H*(b) is symmetric about a,, it follows that a ,- b , = b2- a ,. (3.3.47) (Figure 2 on p.51 illustrates the situation when a, = 5, a2 - 2, and x2 = 1.)

Now, (3.3.40) implies a 0 g (-®, b,)U(b2,oo), and consequently (a0 - a,)2 > (a, - b,)2 = (a, - b2)2. (3.3.48)

But we can evaluate (a, - b,)2:

O_ 2 +T , _ 2 o e x p .2 ’ 2a2 (a2 + t2) ■ i 50

a 2(a 2 + x2) f a 2 + x2> o(b,-a,)a = (3.3.49) K5 Therefore (3.3.48) implies that

V + t ! ) 'a2+x2>\ (a „ - a ,) 2> In (3.3.50) I a 2 j

But note that

In * 2(9 ) q / L ,.* ---- (c* +(a -a ,)2) > 0 *i(0) . o2 + x2J 0 (0 -+- x )

I 2 ( v \ a2(q2 + T2)1 f a 2 + x2^ o (tr+(a0-a,) ]>------2------Ini— J. (3,3.51)

It follows that if (3.3.50) holds, then (3.3.51) also holds. Hence the result follows from Theorem 3.3.1. □ 51

0.6

0 .4 » I<0)

* 2<«> 0.2

0 0 2 4 6 8 10 0 Figure 2: The Behavior of the Hierarchical and "Best Guess" Prior CHAPTER IV

BAYESIAN ROBUSTNESS AND FINITE MIXTURES

4.1; Introduction In Chapters II and III, we have assumed that the functional form of the benchmark prior ir0(d) was known but that we were uncertain about some parameter in the prior

(such as the prior mean). In some instances, however, the opposite situation is more reasonable: one can assume that some parametric or structural properties of the prior are known exactly but one is uncertain about the prior's functional form. Specifically, in this chapter the benchmark priorn0 is assumed to lie in some class of distributions T, with each element of T satisfying certain properties P,, P2 , Pm. However, T will not be defined by P,,P2,*-*,Pm but rather will consist only of a finite subset of the class of all distributions satisfying P,,P2,---,Pm. In other words, each distribution in T will satisfy P,, P2, • • ■, Pm but only some of the distributions which satisfy P,, P2 , ■ ■ ■, Pm will be included in T. In fact, the elements of the set T which we study in this chapter are all well-known "named" distributions. The hierarchical model will then correspond to a finite mixture,1 with the hyperprior placed on T assigning some positive probability to each of the elements therein. We will focus our attention on hyperpriors which place equal mass on each prior in T, and will consider estimation under squared error loss. Computation with

1 Because of the extensive use of mixture distributions here and elsewhere in this dissertation, a brief discussion of that topic is given in Appendix B.

52 53 such priors will be facilitated by the following proposition, which is a direct extension of a well-known result (Berger, 1985) and which is proved in Appendix B: Proposition 4.1.1. Let f(x|0), 0 e© denote a likelihood, and let 7c,(0),7c2(0),---,7tk(0)be probability measures with support 0 . Define

iHi(x) = JffxIOjdtwstS)),! = l,---,k; (4.1.1 ) e

ft, (0| x) = f ^xl0| 7Cy —, i = 1, —, k ; (4.1.2) mi(x)

51(x) = E"‘(0,1')[0], i = l ," ‘,k ; (4.1.3)

(That is, (4.1.1), (4.1.2), and (4.1.3) denote respectively the marginal, the posterior, and the posterior mean based on 71,(0).) Let e = (e,,e2,” -,ek) denote a so-called probability vector (i.e., a vector such that

k = 1 and e( > 0 Vi). Define the mixture prior: i=l

k I i-1 Then the posterior mean of 0| X = x based on the prior (4.1.4) is given by

Se(x) = S e i(x)81(x), (4.1.5) i=l where

= (4.1.6) S ^m ^x) i=l denotes the of the i**1 element of T. Note that 8*(x) is a data-driven mixture of 8‘(x), with posterior weights proportional to ejm^x), i = l,2 ,...,k . 54

4.2: Main Results In the remainder of this chapter, we will consider the following setting. Conditional on 0, the observable random variable X is N(6,l) and 6 itself has a density 7i(o) which is known to satisfy: P,: Jt(0)>O V0e(-ao,oo); (4.2.1) P2: 7t(0) = 7t(-0) V0; (4.2.2)

P3: J d7t(0) = 0.8413447 for some known u . (4.2.3)

Thus, P, and P2 are structural properties while P3 is a parametric property. P, simply ensures that the support of 7t is the real line, while P2 ensures that 7C is symmetric. The magnitude of o in P3 determines the extent to which ti is diffuse: the larger u the more diffuse the prior k. There exist an uncountably infinite number of distributions which satisfy P,,P2 and P3 but we will restrict our attention to only three well-known distributions:

(4.2 .4)

*d (9) = ^ -e x p (-|0 /p D|); (4.2.5)

(4.2.6) where PC,PD, and pN are chosen to satisfy P3 . (The subscripts are explained by the fact that (4.2.4), (4.2.5), and (4.2.6) denote respectively Cauchy, double exponential, and normal densities.) Specifically, we choose PN =u , Approximate values for pc and pD in terms of u follow easily from integration:

J7tc(0)d0 * 0.8413447 =>pc * 2.814914U. (4.2.7) 55

| flCD(e)d0 *0.8413447 => pD *0.8711754u. (4.2.8) “ 00 Further, let mt(x), 7t,(0|x), 8'(x), V,(x), t = C,D,N, denote respectively the marginal

distribution of x, the posterior distribution of 0, the posterior mean (i.e., the Bayes estimator) of 0; and the posterior expected loss of the Bayes estimator. For the normal

case, these quantities are well-known: the other formulae are given in Propositions C.2.1

and C. 3.1 of Appendix C. First we will consider the situation where T has only two elements: r = K ,* p}c{7ic,JtD,7tN}. (4.2.9)

(We will expand to T = {tcc,7Cd ,tcn} subsequently.) We emphasize that we know T exactly: our uncertainty is whether the true prior is it, or 7tp. We are primarily interested

in studying the regret of a decision rule 8 :

r*(nt, 8) = Em,<*)^E"'(0|X)|(0 -5(x))2]- Vt(x)J

= r(wt,8)-r{jcl,6‘), t = s,p. (4.2.10)

As noted in Section 2.1, a more convenient expression for r*(7rt,S) is:

r*Ut,8) = Em‘w [(8’(x )-5 (X ))2], (4.2.11) provided all relevant expectations exist. What would be desirable properties of the rule 8(x) which we use to estimate 0 7

Since we do not know whether 7Cp(©) or7t,(0) is the benchmark prior against which our inference will be evaluated, we would want 5(x) to perform well in terms of regret measured against both np(6) and Jt,(0):

r*( Jtt, 8) is small fort = pandt = s. (4.2.12)

In particular, we would want 8 to incur less regret (evaluated against the benchmark) than the rule wrt to the other prior in f : 56

- ^ ^ < < 1 for(t,t) = (p,s) and for (t,t) = (s,p). (4,2.13) r [nt,5 }

Yet at the same time we would want 6(x) to be "competitive" with the optimal rule:

■^^T*lfort =pand fort = s, (4,2.14) rK .$)

/ 8) Of course, -7 —-—v> 1 since 5l is the optimal rule for x r We will consider hierarchical r(n„8’)

priors of the form:

*,(t|8) = {®’ ‘=P. (4.2.15) [1 -6 , t = S

We are in fact dealing with mixture distributions, inasmuch as (4.2.15) can be expressed

as the single-stage prior: rtH(0; p, s,e) = 67tp(e) + (l - e)7t,(0). (4.2.16)

which is simply a finite mixture, and hence we know from Proposition 4.1.1 that the Bayes

rule wrt the prior in (4.2.16) is:

8-(,-.)(x). «nf(x)»-(x) +(1 -S)m (x)a-(x) 8inp(x)+(l-e)m,(x)

We can show that any prior of the form (4.2.16) will incur a smaller regret than the incorrect alternative: note that Proposition 4.2.1 applies to any two distributions Jit and 7t,.:

Proposition 4.2.1. If r’(7tt,8‘ ) < 0 0 , then

r*(7tt,5H(,-,' E)) r . (tc^S* ot'V ) < ! ’ <4 2 , 8 > for all e e[0,l). 57 Proof: From (4,2,11) we have that

r’(jtt,5H(l,l'e)) = Em,wp 5'(x )-S H(M'E)(X))2J

r , , em,(X)5[(X)+(l-e)m,(X)8’(X) = E1m ,(x ) em,(X)+(l-sK{X)

Bm,(x)51(x )+ (l-e)m t.(x)51(x)-em t(x)6,(x )-(l-e)m ,.(x )6 t'(x)> emt(x )+ (l-e)m t,(X)

(S'(X)-8‘(X))! em,(x)+(l-e)m,.(x)J

< E""('li[(5 '(x )-5 , (X))’] = r’(jct ,5l ), (4.2.19)

which completes the proof. □ We shall present numerical evidence indicating that a hierarchical prior of the form (4.2.15) with e = 0,5 will often satisfy (4.2.13) and (4.2.14). (Note that we do not claim

that this is anoptimal choice of e ; it is merely the simplest, most convenient choice. ) To study whether (4.2.13) and (4.2.14) obtain, we must evaluate

g(M') = — q — z ^ r - for(t,t') = (p,s)and(t,t') = (s,p); (4.2.20) r (nt,8 ) and:

r(7tl,5H(l,t'^) h(tjt') = — I— — for(t,t') = (p,s)and(t,t') = (s,p). (4.2.21) 58 Ideally, g and h could be found analytically; but in practice the expectations defy analytic techniques. For instance, consider the case in which T = {jtD>JtN}. It follows from

(4.2.11) that we will have to evaluate

r-ta .S " '”'"1)- JmD(u)f . f r f e O (SN(u )-8 D(u))2du . (4.2.22) ^mD(u) + mN(u )j

As shown in Appendix C, 5° is a ratio of two sums of quantities involving Mills' Ratio! Analytic evaluation of (4.2.22) seems intractable ~ and (4.2.22) appears to represent a much simpler case than

JmD(u)f .‘^ (u) . .1 (8c(u )-8 D(u))!d u , (4.2.23) ^mD(u)+ 11^ (0 ); for the case F = {itc, JtD}. Hence we are motivated to employ simulation techniques 2

We have (i.e., IMSL. 1987) the ability to generate pseudorandom variates from the Cauchy and normal distributions; code to generate pseudorandom variates from the double exponential distribution can be written using IMSL code for the generation of pseudorandom exponential variates. Foragiven {n^jipjcr {jtc,7tD,7tN} we can generate a large number N of values from 7c,(0):call these values Then, for each of these 0J5 generate a value from f(x|0 = ©J; call the resulting values x,,x2,---,xN. Clearly,

Xj - m^ 1 < j < N . Then we compute

R,"M ) = ^ £(»■(*,) - 8"(w!,(itJ))!; (4.2.24)

(4.2.25) IN j=i and

2 An alternative to simulation, at least when the benchmark is double exponential, might be Hermite quadrature ( Stroud and Secrest, 1966). However, simulation appears far more straightforward and, in any case, there does not appear to be a quadrature method which could approximate the integrals needed when the benchmark is Cauchy. 59

L . ^ Z V . y , (4.2.26) N j=i referring to Appendix C as needed (for the posterior means and variances, and for the marginal densities). Since

ErRH(..p.()l (4.2.27)

E [R ;]= r,(it.,8'); (4.2.28) and E [L j= r(* s,55); (4.2.29) one may estimate r*(7tJ,5H{l,,p)) by Rjlls’p),r*(7ts,Sp) byRp, and rfji^S5) by Ls. Then, from

(4.2,20) and (4.2.21), one may estimate g(.) and h{.) by Gp and Hp, where

R. H(s,p.i) G? = - r T-* <4-2-30> and

T , 8>n(s,p,^) Hp = ^ —^ ------. (4.2.31) L, The program to actually implement this strategy is given in Appendix D; we note that the steps were executed M = 10 times with N = 25,000 for each of the six3 possible combinations. The program calculates the relevant quantities for each of the M simulations: RsH(l,pi)(m), m = 1, -,M ; (4.2 32) Rp(m), m = l, -,M ; (4.2.33) Ls(m), m = 1,*” ,M; (4.2.34) Gp(m), m= 1,” *,M; (4.2.35) Hp(m), m = 1,- -,M. (4.2.36)

3 That is, (s,p) = (C,D); (s.p) = (C,N); (s,p)={D,C); (s,pHD,N),(s,p)=(N1C), and (s,p)=(N,D). 60

The program then computes the mean and standard deviations of the functions G.'(.) and

G(s,p) = J-Z G f(m ); (4.2.37) M m=l

o(g,s,p) = S(0,p(m)-G(s,p)f; (4.2.38) V M - l m=l

H(s,p) = -J-ZH,p(m); (4.2.39) M m =1

5(h,s,p) =J--!— |(H ,"(m )-H (s,p))2; (4.2.40) I M - 1 m=l

(Had a been large in some case, M and/or N could have been increased; as it happened o was sufficiently small to make us comfortable with our estimates. ) In Table 2, for each o\ (s; p) the first row gives g(s,p) and the second row gives o(g,s,p). The third column of

Table 2 could have been predicted without doing any simulation, since "[t]he simple linear estimator has an infinite expected risk if the loss is squared error and the true prior has infinite variance (Rubin, 1977)."45 The numbers appear promising: even excluding the extreme case of the third column, the savings in regret range from 60% to 98%. Hence, the hierarchical prior is much better than the "incorrect" alternative. What is equally heartening is that the hierarchical prior is often not much worse than the benchmark, as seen in Table 3. In Table 3, for each o\ (s,p) the first row gives h(s,p) and the second row gives a(h,s,p). For u ^ 0.50, when the non-benchmark is Cauchy, the

4 Rubin's paper is somewhat related to this chapter. Rubin examined the "the behavior of Bayes estimates when the true prior and assumed prior are different members of this set [normal, double exponential, logistic, and Cauchy) of distributions. " His parameterization differs from ours, but he concluded that if the true distribution had "not extreme tails" then the choice of a wrong [e.g., Cauchy] prior did not incur overly severe consequences. 61

Table 2: Ratio of Regret for the Hierarchical Rule vs. Incorrect Benchmark Rule u\(s,p) (C,D)(C,N) (D.C) (D,N)(N,C) (n , d ) 0.25 0.0183 0.0000 0.3845 0.2344 0.3883 0.2662 (0.000) (0.000) ( 0.003) ( 0.001) (0.004) (0.001) 0.50 0.0423 0.0000 0.3312 0.1523 0.3529 0.3451 (0.001) (0.000) (0.004) ( 0.007) (0.004) ( 0.008) 0.75 0.0662 0.0000 0.2847 0.1000 0.3295 0.3940 (0.001) ( 0.000) ( 0.002) ( 0.002) (0.004) (0.009) 1.00 0.0863 0.0000 0.2516 0.0802 0.3155 0.4073 (0.001) (0.000) ( 0.002) ( 0.003) (0.003) ( 0.007) 1.25 0.1022 0.0000 0.2264 0.0750 0.3047 0.3965 (0.001) (0.000) ( 0.002) ( 0.003) (0.003) (0.005) 1.50 0.1153 0.0000 0.2075 0.0785 0.2952 0.3763 ( 0.001) (0.000) (0.001) ( 0.002) (0.005) (0.007) 1.75 0.1259 0.0000 0.1950 0.0815 0.2929 0.3623 (0.001) ( 0.000) ( 0.002) (0.002) ( 0.003) ( 0.007) 2.00 0.1341 0.0000 0.1838 0.0873 0.2912 0.3503 (0.001) (0.000) (0.001) ( 0.002) (0.002) ( 0.005)

Table 3: Ratio of Risk for the Hierarchical Rule vs. the Benchmark Rule u\(s,p) (C,D)(C,N) (D,C) (d , n ) (N,C) (n , d ) 0.25 1.0699 1.0753 1.9659 1.0025 2.5655 1.0039 ( 0.000) (0.000) (0.012) (0.000) ( 0.027) ( 0.000) 0.50 1.0719 1.0842 1.3388 1.0067 1.5855 1.0134 ( 0.001) (0.001) (0.006) (0.000) (0.010) (0.001) 0.75 1.0620 1.0758 1.1625 1.0080 1.2970 1.0174 ( 0.001) (0.001) ( 0.002) (0.000) ( 0.007) (0.001) 1.00 1.0509 1.0626 1.0949 1.0080 1.1757 1.0169 (0.000) (0.000) (0.001) (0.000) ( 0.002) (0.000) 1.25 1.0415 1.0504 1.0612 1.0077 1.1144 1.0151 (0.000) ( 0.000) ( 0.001) (0.000) (0.002) (0.001) 1.50 1.0341 1.0403 1.0427 1.0074 1.0787 1.0130 (0.000) (0.000) ( 0.000) ( 0.000) ( 0.002) (0.001) 1.75 1.0283 1.0326 1.0317 1.0071 1.0588 1.0117 ( 0.000) ( 0.000) ( 0.000) (0.000) (0.001) (0.000) 2.00 1.0237 1.0266 1.0243 1.0067 1.0454 1.0105 (0.000) ( 0.000) (0.000) (0.000) (0.000) (0.000) 62 mixture does not perform well: elsewhere, the additional consequence entailed by using the mixture rather than the benchmark is usually under ten percent, and sometimes (in the double exponential-normal setting ) under one percent! Why do mixture priors lead to such satisfactory results? The following example illustrates what is happening: Example 4.2.1 : Recall that the Bayes rule for the mixture prior is

m.s(x) mp(x) (x) = 6s(x) + 5p(x). (4.2.41) m,(x) + mp(x) J y ms(x) + mp(x)

The observed data determine which of the two rules 8s(x) and 8p(x) will exercise the greatest influence on 8H(s'p,(x). For instance, let us suppose that our benchmark prior is Cauchy and that the alternative is normal (i.e., suppose that (s,p) = (c,N)). Further, suppose that u = 1.00 and that the value x = 5.0 is observed. Then a simple calculation yields: 8C(5.0) = 4.6884; 8N(5.o) = 2.5000; (4.2.42)

— mc(5 °) 0.9815;— , m^ 5 ° ) — .*0.0185; (4.2.43) mc(5.0) + mN(5.0) mc(5.0) + mN(5.0)

Alternately, mc(5.0)/mN(5.0)* 53.05, so that the benchmark rule 8C exerts more influence on the hierarchical rule than does the alternative rule SN — by a factor of more than fifty. (The above discussion is not offered as a rigorous proof, but rather as an appeal to the reader's intuition.) This phenomenon was observed as a special case by Berger (Berger, 1985), who noted that the "posterior mean [corresponding to the mixture prior] will be strongly weighted toward 8c(x) when x is moderate or large. " Tables 2 and 3 suggest that it applies in double exponential settings as well. 63 For r = {7tc,7iD,7tN}, a simple hierarchical prior will assign probability to each member of T, i.e.:

*„(e)= ”o(e)+«D(e)+JlH(e) (4 2 44) and by Proposition 4.1,1 the Bayes rule corresponding to (4.2.44) is:

5H(x) - M * )S C(x) + ntJx)6c(x) + mN(x)8N(x) (4 2 45) mc(x) + mD(x) + mN(x)

The following quantities measure the relative regrets (risks) of the decision rule corresponding to a hierarchical prior versus the decision rule corresponding to the incorrect rule:

, r*(n ,8H) = fornt)7t, e l \ t * t ; (4.2.46) r(7t„5'J

, , r{nt,5H) h3(0 = r J ', fo rm e r. (4.2.47)

The interpretation of h3(t) is straightforward: the smaller it is, the more competitive the hierarchical rule is with the benchmark rule. The interpretation of g3(t,t) requires more explanation: g3(t,t) compares the regret incurred by 8H with the regret incurred by 8* when the benchmark is 7it, with 7it*■ . We again use simulation to approximate g3 and h}: the details are so similar to those of the earlier case that we will not describe them. A complete listing of computer code is provided in Appendix D; as before we used M - 10 runs of N * 25,000 simulations. The sample standard deviations were sufficiently low that we feel comfortable with the resulting estimates, given in Tables 4 and 5. In Table 4, for each o\ (s,p) the first row gives gj(s,p) and the second row gives o(g3,s,p); in Table 5, for each u\ (s), the first row gives h,(s) and the second row gives o(h3,s). 64 Table 4: Ratio of Regret for the Hierarchical Rule vs.

Incorrect Benchmark Rule for T Containing Three Priors

u\(s,p) (C,D)(C,N) (D.C) (d , n ) (N,C) (n ,d ) 0.25 0.0379 0.0000 0.2236 53.3211 0.2442 66.5467 ( 0.001) (0.000) ( 0.003) (0.565) (0.004) (0.731) 0.50 0.0873 0.0000 0.1743 4.0525 0.2319 9.9202 ( 0.001) (0.000) ( 0.005) (0.154) (0.004) (0.114) 0.75 0.1340 0.0000 0.1358 0.9712 0.2235 4.5725 ( 0.002) (0.000) (0.001) (0.013) (0.004) ( 0.045) 1.00 0.1694 0.0000 0.1149 0.4343 0.2200 2.9465 (0.001) ( 0.000) ( 0.002) ( 0.009) (0.003) (0.012) 1.25 0.1941 0.0000 0.1050 0.2780 0.2176 2.1492 ( 0.002) ( 0.000) ( 0.002) ( 0.009) ( 0.003) ( 0.004) 1.50 0.2109 0.0000 0.1013 0.2216 0.2156 1.6677 ( 0.002) ( 0.000) (0.001) (0.004) ( 0.005) ( 0.007) 1.75 0.2228 0.0000 0.1034 0.1930 0.2203 1.3672 ( 0.002) ( 0.000) (0.001) ( 0.004) ( 0.003) (0.007) 2.00 0.2301 0.0000 0.1059 0.1828 0.2249 1.1689 ( 0.002) (0.000) (0.001) ( 0.003) ( 0.002) ( 0.006)

Looking at Table 4, we note that the Cauchy rule pulls the hierarchical rule away from the benchmark rule, when the benchmark is normal (double exponential); consequently, in columns 5 and 7, we see that, for smaller o, the hierarchical rule often incurs more regret than the double exponential (normal) alternative (although the hierarchical rule is far superior to the Cauchy rule). In other words, the hierarchical rule "loses" to the non-Cauchy alternative. ( Note that Proposition 4.2.1, therefore, does not obtain when F contains three elements. ) Regarding Table S, we note that for o > 1, (that is, when the normal prior is more diffuse than the normal likelihood) the extra consequence incurred by use of the hierarchical rule as opposed to the benchmark rule is under thirteen percent, and decreases 65 Table 5; Ratio of Risk of Hierarchical Rule versus Risk for Benchmark Rule for T

Containing Three Priors

u\ (s) CDN 0.25 1.1453 1.5619 1.9846 (0.001) (0.011) ( 0.024) 0.50 1.1484 1.1783 1.3848 (0.001) ( 0.005) (0.009) 0.75 1.1255 1.0775 1.2015 (0.001) (0.001) ( 0.006) 1.00 1.1000 1.0434 1.1225 ( 0.000) (0.001) ( 0.002) 1.25 1.0787 1.0284 1.0817 ( 0.001) (0.001) ( 0.002) 1.50 1.0623 1.0208 1.0575 ( 0.000) ( 0.000) (0.002) 1.75 1.0501 1.0168 1.0442 ( 0.000) (0.000) (0.001) 2.00 1.0408 1.0140 1.0351 ( 0.000) ( 0.000) (0.000)

steadily for all three benchmarks. For small u, the additional consequence is heavy: quite heavy for the normal benchmark, much less so for the Cauchy, with the double exponential occupying an intermediate position. Again, what we are observing here is explained partly by the fact that the normal and double exponential priors are more similar to each other than either is to the Cauchy: for instance, for u > 1.5, the additional consequence entailed by the hierarchical prior for the Cauchy benchmark is the most severe consequence of the three benchmarks. This can be explained by considering the tail behavior of the three priors. As seen in Figure 3, ( u = 1.5) the tail of kc is much heavier than the tail of either 7tN or 7tD. Still, the data pulls the hierarchical rule nearer to the benchmark rule, so that thiscomparatively severe penalty is under 6.5 %! 66

Taken together, these simulations constitute evidence of the value of hierarchical priors as tools to conduct more robust Bayesian inference. In general, the hierarchical rule is much better than the "wrong" rule and not a great deal worse than the "benchmark rule", especially for larger values of u.

0.4

0.3

* d W 0.2

0.1

0 2 4 6 g 10 0

Figure 3: Behavior of the Tails of the Three Priors CHAPTERV

HIERARCHICAL PRIORS AND T-MINIMAXITY

5.1: r-Minimaxity: General Remarks Heretofore we have assumed that there exists a particular element of T, namely the benchmark prior, against which the statistical inference will be evaluated. In this

chapter we drop that assumption and will indicate the usefulness of hierarchical priors to achieve a type of robustness over all elements of P . Before the T -minimax terminology originated (apparently in Blum and

Rosenblatt, 1967), Robbins (Robbins, 1964) proposed the r-minimax principle: "When nothing is known about [T] it [i.e., the r-minimax principle] is therefore not unreasonable. " Menges (Menges, 1966) has proposed the use of Y -minimax rules1 as a

sort of "compromise" between the Bayesian paradigm, a desirable framework which may be difficult to implement "because of lack of knowledge of the a priori distribution", and the minimax (or "pure minimax" in Menges* parlance) paradigm. Menges views the T - minimax principle as combining Bayesian elements (the use of Bayes risks r(x,8) ) and

minimax elements (choosing a 6' such that supr(7t,5')= in f supr(7t,S)). Furthermore, Her fieD* ji^r

Menges did not intend for the supremum to be taken over the set of alt distributions, but rather over the "space of uncertainty", a subset of the set of all possible prior distributions.

1 What is now called a T -minimax ru le" Menges calls an "extended Bayes solution" and what is now called "minimaxity" Menges calls "pure minimaxity." 67 68

The smaller this subset is, the "more dominant" the "Bayes character" of the solution will be. Thus, Menges was careful to distinguish between r-minimaxity and minimaxity, with the former being more Bayesian. Later, Berger (1985) has written:

When a T-minimax or T -minimax regret analysis can be carried out, the answer is usually quite attractive. The F -minimax rule is usually a Bayes rule [emphasis added] that is compatible with the specified prior information, and it is usually robust (in some sense) over F In the next section we will now establish the connection between F -minimax rules and

Bayes rules corresponding to hierarchical priors.

5.2 T-Minimax Roles as Hierarchical Bayes Rules We begin by reviewing the general framework for what we shall call the "original" decision problem (the problem for which we are seeking a F-minimax decision rule). The state of nature 9 is an element of the space 0 ; there is a class F of prior distributions it on 0, which reflect our prior beliefs. ( We assume that each element of F is dominated by a o-finite measure p. ) The data space is X (on which is a o-finite measure v); the distribution of data given a particular state of nature is f(x|6). The action space and loss function are, respectively, A and L;0 x A -» 91, and D* is the set of randomized decision rules. Finally, the frequentist risk, the posterior expected loss, the Bayes risk, and the

Bayes regret, respectively, are given by

R(0,5) = F/m [l (0,S(x ))]; (5.2.1)

p(n,x,a) = E"M [L(0,a)]; (5.2.2)

r M ^ E ^ R M ) ] ; (5.2.3)

r*(it,5) = r(it,5)-inf r(n,d). (52.4) d e D *

We now apply a technique used by Berliner (1985) in the context of "F-admissibility" Definition 5.2.1 : A decision rule 8, eD* T-dominates a decision rule 82 eD* if 69 r(jr,8,)

and r(jt',5,)

Definition 5.2.2 : A decision rale 8 £D* is T-inadmissible if there exists some rule S' eD* which T-dominates 8. A decision rale 8 e D* is T-admissible if there exists no rale 8' e D* which T-dominates 8.

Berliner’s technique is to define a decision problem which we will call the "derived" decision problem (since it utilizes some notation from the original decision problem) and which Berliner calls the T-problem”. In the derived problem, the "parameter space" is T and thus the loss function needs to be defined accordingly. Berliner remarks that the

definition of the new decision problem suggests an "immediate relationship to r-minimax or T-minimax regret analysis"; it is this relationship, and the introduction of a prior (or hyperprior) on the new "parameter space" T that we explore in the remainder of this section.

As mentioned above, the state of nature in the derived problem is n, neF. Let H={nn is a probability measure on r} (5.2.7)

Note that the measures in H are not necessarily dominated by a o-finite measure, and hence need not have probability densities. Furthermore, if T cannot be parameterized, then one must take extreme care in defining probability measures r| on T. The data space X is as before, but the distribution of data given a particular state of nature is

fD(x|it)=jf(x|eW0)dn(e). (5.2.8) e The action space A and set of decision rules D* are as before, but the derived loss function2 LD:T xAxX— is given by

2 The reader will note that the derived toss function Ld depends upon the observed value x. While this is not a commonly-seen practice, the general decision theoretic framework allows the loss to depend on the observation x (see Rubin, 1968). In practice, loss functions have been assumed to depend on the observed 70

LD(«, a, x) = E*,e|x>[L(0, x ) ] , (5.2.9)

Then the derived (frequentist) risk and derived expected posterior loss are

R d U , 5 ) = EfD(ah,)[LD(ji,S(x),x)], ( 5 .2 .1 0 ) and pD(t),x,a) = En,l")[LD(n,a,x)]. (5.2.11)

In (5.2.11), T]* is the conditional probability measure on T given the prior measure q and the observation x. Combining (5.2.9) and (5.2.11) yields

JE’(*‘,[L(B.a)]fD(x|it)diiM pD

rD(n,S) = E'(")[RD(jt,5)] = JlRD(l.8)dn(ll). (5.2.13)

We note that since Ln is bounded below, an analog to (1.2.6) exists for the derived problem. Specifically, if we define

f^(x) = /fD(x|n)dn(7i), (5.2.14) F then

rD(n,8) = E'3W[pD(n,X,S(X))]. (5.2.15)

We now show that the derived decision problem possesses a crucial property. Lemma 5.2.1. Any decision rule 5 eD* which is minimax for the derived problem is a T- minimax decision rule for the original decision problem. P roof : For any decision rule 8 e D \ we have values: for instance, Wald (1950) allowed the cost of collecting an observation, which was part of the loss function, to depend on the observed value. We point out that L* is derived from quantities which are already defined in the original problem: in particular, we are not asking that a loss function for each value of 6,a, and x be specified "from scratch." The derived loss function is used solely to compute the derived risk, and to find the Bayes action in the derived problem: in the former case, xis already involved (see Strasser, 1985) and in the latter, x is assumed known (for the expected posterior loss conditional on x). 71 RD(7T18) = EfD^ )[LD(7t,8(x),x)]

= EfD(x^^/L(0,8(x))n(0|X)dM(e)j

= J^jM0,8(x))7t(0|x)dp(0)jfD(x|)r)dv(x)

= J j L ( 0 , 8 ( x ) ) ^ M ^ f D(x|n)d^(0)dv(x). (5.2.16) xe tD(x|Jt)

Now, since L is bounded below, one can reverse the order of integration in (5.2.16), yielding

Rd ( jt,S) = J jL(0,8(x))n(0)f(xj0)dv(x)d|i(0) = r(7i,8). (5.2.17) ex In other words, the Bayes risk of a decision rule 8 in the original problem is the frequentist risk of 8 in the derived problem. By definition, a minimax rule 8* in the derived problem satisfies

supRD(?i,8*)= inf supRD(n,8). (5.2,18) Tier neT But using (5.2.17), it follows from (5.2.18) that

sup(r(it,d*)) = inf sup(r(jrt,S)), (5.2.19) lief 8eD' jieT hence 8* is r-minimax in the original problem. □ A second lemma is necessary before we state the main results of this section. Lemma 5.2.2\ Let q be some probability measure on T. Then an action a € A is a Bayes action in the derived problem iff it is a Bayes action in the originalproblem wrt the prior density

7t’1(0) = j7t(0)dn(7t). (5.2.20) r Proof: First we will verify that (5.2.20) is indeed a density wrt the o-finite measure p. Note that for any measurable subset B eO .w ehave 72

P[0 eB |7t] = J Jt(0)dn(e). (5.2.21) B

But then

P[0 eBlJt”] = JQ7t(0)d^(0)JdTi(Ti), (5.2.22)

and by Fubini's Theorem the order of integration in (5.2.22) can be reversed, yielding

p[e eB|^]=/fj)t(0 )dn(7t)]dw(e). (5.2.23)

It follows from (5.2.23) that tc’1 is a density wrt p. Now, an action a' e A is a Bayes action in the original decision problem wrt (5.2,20) iff

E"',(0,x)[L(eia,)]= inf E ^ ^ l X e , a)]. (5.2.24)

Thus a Bayes action for the original decision problem must minimize

jLfoaVteJfUlOjdpte) (5.2.25) j nl,(0)f(x|0)dp(0) e But the denominator of (5.2.25) is constant wrt a. Hence, in fact, a Bayes action for the original decision problem must minimize

jL(e,a)7tn(0)f(x|0)dn(0) = |L(e,a)(jjt(0)dii(n)}f{x|0)dp(0). (5.2.26) e e (r J

An action a" € A is a Bayes action wrt to the measure rj for the derived decision problem iff pD(t1,x,a") = inf pD(n,x,a). (5.2.27) t e A

However, it follows from (5.2.8) and (5.2.12) that 73

jLfoaMeMxieMe)^ 1 fD(x|7t)dn(7t) J40)f(x|0)dn(e) PD(T1.x,a) = jf D(x|7t)dri(7t)

/ f L(0,a)7i(0)f(x|0)dM(0)dii(jt) — re (5.2.28) JfD{x|7c)dll(7t)

Again, only the numerator of (5,2.28) involves a, hence a Bayes action for the derived problem must be chosen to minimize

jjL(0,a)7c(0)f(x|0)d^(0)dTi(ji). (5.2.29) re But since L is bounded below, it follows from Fubini's Theorem that (5.2.29) is equivalent to the right-hand side of (5.2.26), which completes the proof. □ What does Lemma 5.2.2 say? Merely that use of a hierarchical prior structure (with q as the hyperprior) in the original decision problem leads to the same Bayes rule as use of 11 as the (single-stage) prior in the derived decision problem. From now on, a Bayes rule "wrt q" will mean both a Bayes rule in the original problem with hierarchical prior (5.2.20) and a Bayes rule for the derived problem wrt the prior measure q on T.

We now possess the technical tools to prove several results on r-minimaxity each result corresponds directly to a well-known result pertaining to minimaxity. The proofs given in standard texts for the minimax results assume that one is working with prior probability densities rather than prior probability measures. However, we cannot assume that the prior measures on T are dominated by a o-finite measure. Theorem 5,2.1 : Let q* be a probability measure on T with corresponding Bayes rule 5* .

If

rD(q’,5*) = supRD(x,5’), (5.2.30) *tr 74 then 5* is r-minimax (for the original problem). Furthermore, if 8* is the unique Bayes rule corresponding to r \, it is the unique r-minimax rule (for the original problem).

Proof. We shall follow the argument used by Lehmann for the minimax counterpart to this result ( see Theorem 4.2.1, p. 249, Lehmann, 1983 ) . For any decision rule 8 eD*, we have

supRD(7t,8)>jRDU,8)dri‘(rr)=rD(ri*,8). (5.2.31) kgt r But since 8* is Bayes wrt rj\ it follows from (5.2.15) that rD(n\S)>rD(ii-,5-), (5.2.32)

Thus, it follows from (5,2.30)-(5.2.32) that

supRD(jt,8)2: supRD(7t,S*), (5.2.33) jieT Tier which, together with Lemma 5.2.1, establishes the r-minimaxity of 8*. If 8* is the unique Bayes rule, then the inequality in (5.2.32) is strict, establishing 8* as the unique F-minimax rule. □ In effect, r\ is the "least favorable probability measure" on T, analogous to the least favorable distribution ( Lehmann, 1983 ). Definition 5.2.3: If (5.2.30) obtains for some probability measure r\ on T, t|* is the least favorable [hyper]prior on T. The following result extends the concept of "equalizer rules" (Berger, 1985) to the derived decision problem. Theorem 5.2.2: Let r|* be a probability measure on T. Let 8* denote the Bayes rule for the resulting hierarchical prior 7rn*(0), as given by (5.2.20). If for some constant K,

RD(tt,8*) s K Vtcer, (5.2.34) then 8„ is a T -minimax rule (for the original problem). 75

Proof. We observe, in a manner analogous to that of Lehmann ( Corollary 4.2.1, Lehmann, 1983 ) that if (5.2.34) obtains, then (5.2.30) obtains, and the result follows from Theorem 5 .2.1. □

Another minimaxity result can be extended to the r-minimax setting

Theorem 5.2.3 : Let {tIj} , be a sequence of probability measures on T to which corresponds a sequence of Bayes rules {s*}^ (i.e., Sj is Bayes wrt q,). Suppose that for some decision rule 5* it holds that

sup Rd (ti,8*)<;lim sup(^ (^ ,8 -)). (5.2.35) 7ier j->co

Then 8* is a T -minimax rule .

Proof. We apply a method used to prove the analogous minimaxity result (see Zacks, 1971, p. 290 Theorem 6.5.2). If 8* is not T-minimax, then there exists some estimator 8' such that

supRD(7t,5') < supRD(n,8*). (5.2,36) n eT ner But since 8* is Bayes wrt r^, it follows that

rp^.s;) < J RD(it,S')dni(*) S supRD(lt,S') (5.2.37) r 7i er for j > 1. Combining (5.2.35) and (5.2.37) we obtain

supRD(n,8*) ^ supRD(7t,S'), (5.2.38) Tier *er contradicting (5.2.36) and yielding the result. □

Finally, a fourth result on F-minimaxity follows directly from a well-known theorem from decision theory. If T is finite, then all probability measures on F can be dominated by counting measure. The following result may be obtained by the Minimax Theorem and Lemmas 5.2.1 and 5.2.2. Theorem 5.2.4 : Let T be finite: assume 76 r = (5.2.39)

Define for any decision rule 5, the function R D: D* -> 9?k by

R d(«) = (R d («„6),R d (jc2,S) RD(7tk,6))\ (5.2.40)

Furthermore, define the subset S cr 9lk by S = {RD(5):SeD*}. (5.2.41)

If S is bounded below and closed below (i.e., if it contains each of its lower boundary points), then there exists a prior distribution rf on T such that the Bayes rule S'1 for the prior given by

*"'(8)=ln'(*j)*j(8) (5.2-42) i= i will be a T -minimax rule. Proof. As noted above, this result is an immediate consequence of Lemmas 5.2.1, 5.2.2, and the Minimax Theorem (see Ferguson, 1967, pp.85ff for a concise proof). One merely observes that 8n is also Bayes in the derived decision problem, and hence minimax in the derived decision problem, which is to say T -minimax in the original problem. □ The primary concern with the application of Theorem 5.2.4 is the question of closedness (from below) of S . Certain results on closedness are known; it has also been remarked in the context of the Minimax Theorem for finite parameter spaces that "only very rarely will S not be closed in statistical problems (Berger, 1985). " Before turning to the issue of T -minimax regret, we will briefly comment on some of the other recent work done with regard to T -minimaxity, remarking on a tie-in with the results of this section. Under fairly general assumptions (see Wald, 1950, pp.59ff) a minimax rule wilt be Bayes (with respect to the least favorable prior) or limiting Bayes.

Therefore, under analogous assumptions, a r-minimax rule will be Bayes with respect to the least favorable prior on T. Moreover, it is often the case that F is sufficiently rich that 77 it contains all mixtures of elements of F. Under such situations, we have the following simple proposition. Proposition 5.2.1 : Assume the class T contains all mixtures of elements of F, and assume that there exists a least favorable [hyperjprior on T . Then a T-minimax rule exists which is Bayes with respect to some element of T. Proof. Let q* be the least favorable hyperprior on T Then, by Theorem 5.2.1, the Bayes rule corresponding to the prior

Jtn (0) = j7i(0)dr|*(jt) (5.2.43) r is r-minimax. But 7tn (0) e F , since T contains all mixtures of elements of T.D

Proposition 5.2.1 specifies one setting in which the T-minimax rule is Bayes with respect to an element of T. Thus, it tells us when it makes sense to seek a T-minimax rule by seeking a least favorable prior in the class F. In several settings (e.g., Berger, 1979; Chen and Eichenauer, 1988; Eichenauer-Herrmann, 1990) (for instance, symmetric unimodal priors are considered by Eichenauer-Herrmann) the classes F are sufficiently rich that Proposition 5.2.1 applies.

Example 5.2.1 : Eichenauer, Lehn, and Rettig (1988, hereafter E-L-R) assume a random sample X,, Xj,,.., Xn with likelihood

f(x|0)= * 0“x“-lexp(-0x). (5.2.44) T (a )

The estimand is

P(0) = Ef(xi0)[X] = ~ , (5.2.45) 0 and the loss function is squared error. Their class of priors is

T = {ti:E"[p (0)] £ n,E"[P2(0)] ^ t}, (5.2.46) with 78

x>H2, p > 0. (5.2.47)

Clearly, T contains all mixtures of elements of T. Using saddle-point theory, E-L-R show that the r-minimax rule is given by

8-(x ...... x,) =---- V yi ■ <5 2.48) t + na(x-p ) which is Bayes against the prior

7t*(0) = Gamma(y*,c*), (5.2.49) where

y* = 1 + — and c* = / ,v (5.2.50) x-p2 a(x-p2)

Theorem 5.2.1 can also be used to prove this result. One can see that ti* e T , since E“ (0,[p(0)] = p and E* (e,[P2(o)j = x; furthermore, E-L-R show that

r(7t*,8‘) = supr(7t,5*). (5.2.51) iter Consequently , by Theorem 5.2.1, 8*is the unique minimax rule for the derived decision problem, and hence the unique T-minimax rule for the original problem. Example 5,2.2: Eichenauer-Herrmann (1990) studies the decision problem with squared error loss, parameter space 0 = [-m, m] for some positive m, and T = {rt:7t(0) = 7t(-0),7t nonincreasing on [0,m]}. (5.2,52)

( The likelihood is assumed only to satisfy certain technical conditions which imply the convexity of the risk function. ) It is clear that T is closed under mixtures Eichenauer- Herrmann then shows that the uniform distribution on 0 (call it 7^) is least favorable in V in the sense that for the Bayes rule 8U with respect to 7^ we have

r(7tu,8IJ) = supr(7t,8u). (5.2.53) *er 79 Therefore, 5U is r-minimax by Theorem 5.2.1; Eichenauer-Herrmann cites results from saddle-point theory to obtain the same result. We have shown that r-minimax rules correspond to hierarchical Bayes rules. An alternate choice for the derived loss function establishes a correspondence between r-minimax regret rules and hierarchical Bayes rules.

Definition 5.2.4 : A decision rule 3' is a r-minimax regret rule if

supr*(x,6') = inf supr*(x,8), (5.2.54) iter sed * ner where r*(x,S) is given by (5.2.4).

Berger (1985) notes "The minimax regret principle works somewhat better [than the minimax principle] in statistical situations." If, instead of (5.2.9), the derived loss function is

L*D(x,a,x) = E’^ W e .a ) ] - inf r(7t,5), (5.2.55) u BeD* then the risk in the derived problem is equal to the Bayes regret (5.2.4). R*D(7t,S) = EfD(xl,t)[L*D(7t,5(x),x)] = r‘(jt,8). (5.2.56)

Furthermore, since inf r(n,8) is only a function of x, any rule which is Bayes in the BeD* derived problem for the loss L*D is Bayes for the loss LD, and hence (by Lemma 5.2.2)

Bayes for the corresponding hierarchical prior in the original problem. Therefore one could replace ’T-minimax" with 'T-minimax regret", RD with R„, and so forth in

Theorems 5.2.1 - 5.2.4 to establish a relationship between r-minimax regret rules and hierarchical priors. Indeed, that is precisely what is done in Sections 5.3 ( for estimation under squared error loss) and 5.4 ( for testing a normal mean). 80 5.3: r-Minimax Regret Rules When T Contains Two Priors

In this section we shall use the regret version of Theorem 5.2.2 to find a r-m inim ax regret rule when r = the decision problem is parameter

estimation under squared error loss. (Note that ji0 and 7t, play identical roles in the

problem: we are no longer using tt0 and tc, to denote respectively the benchmark and best-

guess prior. ) We will develop a methodology based upon a hierarchical prior which wilt yield a T -minimax regret rule: our development will establish a simple method of finding

that rule. The data x will have a distribution conditional on 6 which we denote by f(x|6);

and further, m^x) = Jf(x|0)xi(0)dp(G),i = 0,1, for an appropriate measure p. on ©. We e assume that nQ and 7t, share a common support, so the marginals n^ and m, will also have

a common support, which we denote by X, upon which is defined some measure v. Our action space will be A = 0 , and our loss function will be squared error: L(0, a) = (0 - a)2. (5.3.1)

Our goal is a decision rule 8' with the T -minimax regret property:

max{r’(7t0,8'),r*(7t0,6')}= inf max{r*(7t0,6),r*(jt0,8)}. (5.3.2) 3eD

We will find the T -minimax regret rule by placing a hyperprior on T. As noted earlier, priors on a finite T will be formally equivalent to mixtures: 7tE(0 ) s ( l- 8)7to(0)+e7tI(0), ee[0,l], (5.3.3)

Using S’ to denote the Bayes rule when the prior 7t,, i = 0,1 is used, the Bayes rule when the prior (5.3.3) is used is given by (Berger, 1985) : 8s (x) = (l - X(x,e))80(x) + A.(x,e)8‘(x), (5.3.4) where

= , eni|(*>— (53 5) (1 - e)m0(x )+ em, (x) 81

Two points are in order. The first is that our notation is consistent: if e = 0 or e = 1 then the prior defined by (5.3.4) agrees with 7t0 or rc,, respectively.3 The second is that unless e = 0 or e = 1 the prior given by (5.3.4) is not even potentially the "true" prior. Rather, as before we are using a hierarchical prior not because we believe it reflects a "true state of nature" but instead because doing so will enable us to find the F-minimax regret rule. We will adopt some "shorthand" notation to simplify our presentation; for i = 0 or

1, let C(e) = r*(«„8e). (5.3.6)

One can show (cf. Chapter II, Section 1) that

C(e) = Emi(,°[(5,(x) - 6 8(x))2]. (5.3.7)

By the F-minimax regret analog to Theorem 5.2,2, if for some e* we have C°(e*) = C V ), (5.3.8) then 8E will be a r-minimax regret rule. Before proceeding to the main result of this section, we examine two intuitively appealing notions: Notion A . "Since n0 and ft, play equivalent roles in our selection of an e* to satisfy (5.3.8), it will follow from symmetry that e* = 1/2." Notion B: "Should there exist for some priors 7t0 and 7t, an e * 1/2 such that (5.3.8) obtains for e* = e , then by the symmetry underlying the problem (i.e., "reverse" the roles of rc0 and 7tj) (5.3.8) will also obtain for e* = l- e ."

Despite the apparent logic of the two statements, we shall see that neither Notion A nor Notion B obtains. Theorem 5.3J: Using the notation defined for this chapter up to this time, assume that £°(l) < 00 and C(o) < qo and that there exists a a set X' c X with v(X') > 0 and 50(x)*8'(x) V xeX '. (5.3.9)

3 In fact, the desire to achieve this consistency has led us to "reverse" the definition of X used by Berger. 82 Then there exists aunique e* e(0,l) for which (S.3.8) obtains. P roof : Our argument will be based on the facts that C°(e) is astrictly increasing function of e with£°(l) > 0; and C'(e) is a strictly decreasing function of 8 with C’(o) > 0; we will also use some continuity properties of £°(e) and C‘(e).

To simplify the notation, define

k ( x ) = (8 '( x ) - 5 ° ( x ))J. (5.3.10)

From (5.3.7) we then have

C(b) = Em°(x)[(A,(X,e))2K(X)], (5.3.11) and

C'(e) = Em'w[(l^ (X ,e ))2K(X)]. (5.3.12)

Note that for x eX , A,(x,e) is a non-negative, differentiable function of s:

3 rw .1_ ((l-e )m 0(x) + em1(x))m0(x)-em ,(x)(m l(x )-m 0(x)) ^ lMX’6)J= ((1-E)m0(x) + Em1(x))2

_ mt(x)((l-e) m0(x) + emt(x) - em, (x) + em0(x)) ((l-ejm^xj+em^x))2

= ------m<(x)m0(x)...... >0 Vx (5 3 13) ((l-e)m 0(x)+em1(x))

(The assumption that n^ and m, share a common support X guarantees that the numerator is strictly positive for every x e X .)

From (5.3.13) we deduce: e < e => X(x,s ) < X(x,e ) , V xeX

=>A,2( x ,e ) < A,2(x ,e ) , V xeX (5.3.14)

From (5.3.9), (5.3.11), and (5.3.14) one deduces that

C°(e )= jA ^ x .E jK M m o M d p tx ^ jX ^ x .E V M n io M d p M ^ V ) (5.3.15) X X 83 That is, C (s) is astrictly increasing function of e . Also, since X,(x,0) = 0, 0 = C (0)< C (l). (5.3.16) To prove that C°(e) is a continuous function (of s), let e e(o,l) be given. Then for se(0, l),

0 <, |c°(s)-C°(e)| = |Emo(,t)[A.2(X ,s)K (x)]-E m,>(x)[X2(X,e)K(x)]|

= |e"“w [(a.2(X,s) - V(X,e))k(x )J

< E"^,,)[|(X2(X,a)-X2(X>8))K(X)|]. (5.3.17)

Observe that |A.2(x,s)-X?(X,e)| = |X(X,s)-X(x,e)| |X(X,s) + X(X,e)| (5.3.18)

By (5.3.5) ,A,(x,t)£l V t e[0,l], x € X, and consequently |X(X,s) + A.(X,s)|£2. (5.3.19)

Furthermore,

s m,(x) s m,(X) |x(X,s)-X(X,e)| = (l - s)m0(x )+sm,(x) (l - e)m,(x)+em,(x)

((l - e)mp(X)+sml(x))» - ((l - sjm^X) + sm,(x))e ^ m (X) ^ ((l-sJmofxJ + sm^XjJ^l-eWxJ+em^X))

S ~ 8 r(,.s)+4 g )Y ('-^ (x )+; ^ m^X) mo(x) y

s-e! (5.3.20) (l-s )e

Combining (5.3.20) with (5.3.17) and (5.3.19) yields 84

0£|C“(s)-C °(6)|<^_iE ^’‘)[K (x)]= ^ yE " ”(l)[(8'(x)-5“(x))J] (l-s)e (l-s)

= 2[s-e|C(l) (5 32]) (l-s )e

Since it is assumed that CO) < 00,

»-»lime, V (1 — 1^sjs 0 - (5 3 22) e^O.l)

Consequently application of the so-called "Squeezing Theorem" (Anton, 1984) to (5.3.21) implies that £° is continuous on (o, l).

Analogous computations will show that £‘(e)is strictly decreasing and continuous for 8 s(o, l) and that C'(0)>C(1) = 0. (5.3.23) Now, define for s e[0, l]

g(s)=C(s)-C°(s). (5.3.24) We will establish that g(s) has aunique root on (0,l), which will complete the proof As

the difference of two functions which are continuous on (o,i), g is continuous on (o,l). Furthermore, as the sum of two functions strictly decreasing on [0,l] (i.e,

g(s) = C (s)+ (- C(s))), g isstrictly decreasing on [0,l]. Since C(l) = C°(o)~0, it follows that g(0) = Cl(0) and g(l) = -C(l). (5.3.25)

For convenience, we define

g+(0) = sup {g(t)} = lim g(t); (5.3.26) te(o,l) 1+0 and:

g"(l)= ^{g(t)} =li|ng(t). (5.3.27) 85 Note that C'(o) = g(o) 2 g*(o) > g (l)> g (l) = - C O ) , (5.3.28) where the strict inequality follows from the strict monotonicity of g; the other two inequalities may or may not be strict. We shall now show that g+(o) > 0 and g (l)<0. Note that since C,' is strictly decreasing, lim C'(t) exists; define a = |imC’(t) Since a is the supremum of a set of ilo tio positive numbers, a is itself positive. Next we shall now show that

limC°(s) = C°(0) = 0. (5.3.29) siO

0

sm,(X) K(x) A (l-sJm ^X j + sm^X)

r m'(x)m°(x) ^ ^ M = sj T — v " t ~ 17 \ H x W ( x ) X (l~s)m (x/+sm (x)J

:(x)dp(x) , ™o(*) ,

< ( T ~ l i mi (X)K( x)d|J.( x) = (5.3.30) U-syx 1 “ s Application of the Squeezing Theorem to (5.3.30) yields

limC(s) = 0. (5.3.31) siO Therefore,

lim g( t) = lim C' (t) ” lim C°(t) = a - 0 = a > 0. (5.3,32) uo tio tio The argument that g"(l) < 0 is similar. 86 Since g-(l) < 0and since g+(o) > 0, the continuity of g on the interval (o,l) implies that there exists some number tn e(o,l) 3g(tn) = 0. Also, since g is strictly decreasing, there

exists only one such number tn. Therefore, by Theorem 5.2.2 (regret version) one achieves r-minimax-regret optimality by choosing e* = ta.D By way of a geometrical interpretation (to which we will allude in Example 5.3.1), we observe that g(t) = C‘(t) -C (t) > o iff t < to; (5.3.33) and

g(t) = C‘(t)-C (t)<0ifft > td . (5.3.34)

From (5.3.33) and (5.3.34) it follows that

njwj{c(0} = C'(0 fof 4 < ra, (5.3.35) and

=C°(t) for t > tn. (5.3.36)

The argument is illustrated visually in Figure 4, which is intended to clarify, rather than to prove. In particular, it must be emphasized that nothing about the concavity of either C° or has been or will be proved: Figure 4 is simply a sketch to convey a visual impression of the behavior of C° and C.

For two specific priors, the value s* can easily be approximated to within any desired error by application of the bisection algorithm (Burden and Faires, 1985) to the function g. Unfortunately, general exact formulae are elusive. As noted above, "Notion A" (that 8* = 1/2) is incorrect, as we shall see in Example 5.3.1. Exact formulae can be derived only for certain special cases. One such case occurs when both the likelihood and the prior belong to a symmetric location-parameter family: the symmetry will guarantee that b* = 1/2. 87

e

Figure 4: A Sketch of the Behavior of Regret Functions

Theorem 5.3.2 : Assume that f(x|0)= f(x~0); (5.3.37)

7ii(e)= 7t(e-a1), i=o,i. (5.3.38)

for functions f(.) and jc(.) , both symmetric about 0, and that © = X = 9?. Then (5.3.8) is satisfied for e* = 1/2. Proof. We shall show that C(e) = CO ~ e) aIM* then invoke Theorem 5.3.1

Define

m(x)= j f(x - 0)7t(0)d0 (5.3.39) -no and 88

jef(x-eMe)de s(x)= -= n ------(5.3.40) m(x)

Since both it and f are symmetric about zero, it is easily shown that m and 5 are also symmetric about zero. Also, for i = 0 or 1 and for any x e(-oo,oo),

m;(x) = j f ( x - 0 )11(0 -aJdO = ]f(x-w-aj)it(w)dw = m(x-aj) (5.3.41) -00 -00 and 5'(x) = 8 ( x - a i) + a i. (5.3.42)

In particular,

S^x + a,) = 5(x + a, - a 0) + a 0 = -S (-x + a 0-a ,) + a 0

= - 8 ‘( - x + a 0) + a , + a 0 . (5.3.43)

Now, we have:

e2(m, (x))2 m„(x)K(x) ^ C°(e)= Emo(x>[(M X ,e))2K(X)] = dx ((l-e)m 0(x) + em,(x))2;

e2 (m(x - a, ))2 m(x - a 0)K (x) = J dx. (5.3.44) ((l - e)m(x - a 0) +em(x - a, ))2

Substituting u = x - a, into (5.3.44) and using the symmetry of m yield:

u-n. e2(m(u))2m(u+a) - a 0)K(u+ai) C(«) = ] du ^ ((l-e)m (u+a,-a0)+em(u))2 ^

e2(m(-u))2m(-u-oti +a0)K(u+aI) = i du ((i-e)m (-u-a, +a0)+em(-u))2 89

62(m(-u))2m (-u-a, +a0)(s0(u+al)-5 l(u+al))2 = J du ((l - e)m(-u - a, + a0) + sm(-u))2

2 A e2(m(-u))2m (-u-a1 +a0)(a0 +a, -5 ,(-u+ a0)-5'(u+ a1>) = J du . (5.3.45) ((l - e)m (-u-a, + a0) + em(-u))2

Substituting v = - u + a 0 into (5.3 ,45) yields:

re2(m (v-a0))2m (v-a1){a0 +a, -b '^ -S ^ -v + ao+a,)) C ( e ) = J dv {(l-e)m(v-a[)+em(v-a0))2

e2(m0(v))2m,(v)(a0 + a, - 5*(v) - (a0 + a - 6°(-(-v + a,)+ a ,))) = J dv ((l-e)m ,(v) + em0(v))2

e2(m0(v))2nii(v)(-5l(v)+50(v)) = J dv ((l - e)m0(v) + em,(v))2

= Em'(x)[(l - \(X, 1 - e))2K(X)] = C(l - e). (5.3.46)

From (5.3 .46) it is clear that

C“(l/2 )= C (l-l/2 ) = C‘(l/2)- (5.3.47) Theorem 5.3.1 guarantees that there exists a unique point of equality for and and, moreover, that this point of equality satisfies (5.3.8). Hence the result follows from (5.3.47). □ But it is not the case that e* =j in general, as shown by the following example:

Example 5.3.1: Let f (x|eKCxex(l - e r \ x = 0,1, • • ■, n, (5.3.48) and: 90

4ei».b)= ^T ^e“(i-er, ees=(o,i), (5.3 .49) so that X|0~ Binomial (n,0) and 0~Beta(a,b). It is known (DeGroot, 1970) that the posterior mean in this setting is:

5(x)= a + X- . (5.3.50) a + b + n It is also known (Aitchison and Dunsmore, 1976) that:

r r(a + b)r(a + x)r(b + n -x ) m(xKC" r(a^(b)r(a+b + nj ' " ’ (5J'51)

Let the priors 7t0 and 7t, be given by Beta(a0,b0) and Beta(a,,b,), with a0 = 4 = b0 ; a, = 1; and b, = 7; and let the likelihood be Binomial(n,0). From (5.3.50) it follows that k (x ) does not depend on x:

k = k (x ) = — (5.3.52) 100 Simple computations using (5.3.5), (5.3.50), and (5.3.52) yield: C°(V2) = 0.017485 > 0.015383 = C'(l/2). (5.3.53)

It follows from Figure 4 that e* must be less than 1/2. (The true value of 8* could of course be found to any desired degree of accuracy by use of the bisection algorithm.) Having disposed of Notions A and B, we close this section with a simple example in which e* = 1/2. Example 5.3.2: Let X|0 ~ Binomial(n,0) and let 7CO(0) and rc,(0) correspond respectively to the distributions Beta(a,b) and Beta(b,a). First observe that

K=K(x) = f-i±i tti_Y =Lid>_Y (5.3.54) ^a + b + n b + a + n / \,b + a + n / is not a function of x. Furthermore, it is easy to check via (5.3.51) that for x = 0,l,---,n: mofn-x) =m,(x). (5.3.55) 91

Now,

C(V2) = K£ m ,(M j.V2) =K£, . (5.3.56)

Applying (5.3.55) and then making the substitution k = n - j into (5.3.56) yields:

C°(V2)=K± m'(”~ 2 = K±m,(kXl-a.(k.l/2))J =C(l/2). (5.3.57) j=°(m,(n - j) + m0(n - j)j

We then invoke Theorem 5.3.1 to conclude that e' = 1/2.

5.4: A F-Minimax Regret Procedure for Testing a Normal Mean In this section we will devise a r-minimax regret procedure for testing a simple null hypothesis versus a simple alternative hypothesis. The development herein parallels that of Section 5.3, although because our problem is much more specific weare spared the lengthy arguments concerning continuity of regret functions. The presentation is much simpler, but nevertheless represents an interesting application of this theory to a problem defined with a loss function different from those we have previously examined.

We assume that X|0 ~ N(G, l); (5.4.1) 0 e 0 = {0o,0,}; (0O <0,, both known) (5.4.2)

x(0]a) = a We)(l ~ a )Iw(0); (5.4.3) T = {x(0|ao), x(0|a,)}, a0 < a , ; (5.4.4)

0 = {eo,0 j; (5.4.5) A = {0O,0,}; (5.4.6)

, , fl, 0 ^ a L (B ,a H (5.4.7) 0, 0 = a 92 We want to conduct a Bayesian "test" using as our loss function L: more general loss functions and more general hypotheses have been considered ( see DeGroot, 1970 and Berger, 1985). Essentially, the approach is that of the so-called classification problem (Mardia, Kent, and Bibby, 1979). We will declare 9tobe0[ if the posterior probability (given X = x) that 0 = 0, is greater than or equal to the posterior probability that 0 = 0O; otherwise, we will declare 0 to be©0. It can be shown that, for a given prior probability a, the Bayes rule declares 0 to be 9, if and only if

(5.4.8)

where

(5.4.9) 2 and (5.4.10)

(For

Letting 8° denote the rule described above for a = a ', we need a formula for the

Bayes risk r(7t(.|a"),5a ) = E ^ ’Je ^L^S^X))]]:

E",9^)[Ef(^ )[L(0,8a'(X))]] = Pr|"x;>0 + |ln ^ ~ )|X ~ N (0 o,a2)| (l-a")

Hence, from (5.4.11), the Bayes regret is given by 93

la". (5.4.12)

We want a T-minimax regret rule 8*:

max{r*(ji(.|ai),8E)]= inf max{r*M |aj),&)} (5.4,13) i=0,l 6€D* i-0,1

We shall place ahyperprior on T such that the resulting Bayes rule will have constant

Bayes risk over T, and then invoke the regret analog to Theorem 5.2.2.

A hierarchical prior on T will be of the form: Jte(a) = e I{a,} (a )+ (l - e)l{aoj(a), (5.4.14) for any 6 e[0,l]. Finding the Bayes rule 8„ for the prior structure given by (5.4.14) and

(5.4.3) is simple: we need only compute: aE =Pr[0 = 0,|a~JtE] = ao+e(ai-ao), (5.4.15)4 and then it follows from (5.4.8) (take a = a E) that

(5.4.16)

We now prove the following result (cf. Theorem 5.3.1): Theorem 5.4.1. There exists a unique e* e(0,l) such that:

r’(4 .|a 0),5E’) = r*(7t(.|a,),8E). (5.4.17)

Proof. To simplify notation, define for v,,v2 €[0, l] the function £ -

£(vi>v2) = r*(7i(.|v,),5v*). (5.4.18)

It is clear from (5.4.12) that £ is continuous for all v,,v2 e[0,l]. We will show that 4(a0,a E) and £(ai,aE) intersect in a unique point.

It is clear from (5.4.12) that:

4 We note that this notation is consistent for e = 1 ande = 0. 94 S(a0>a E) = 0 for 6 = 0; (5.4.19) and: $(a,,aE) =0 fore = l. (5.4.20)

Since

da, - = a,M - “an -O’ (5.4.21) de we have for i = 0 or i = 1:

^ ( « i,g e) _ ( (i-ctj) 9 - 0 o +—In de { y ae ;; u -a«A ae)

Ot: ♦ 0 - 0 ,+ - l n . V

1-a. (aQ-a ,) 0 -0 oh—In - a f t 0 - e t + -ln . (5.4.22) v ) j v «c v ( l - a E)a SJ From (5.4.22) it follows that

/ £^(a,,aE)> -sgn de

f - 1 (- 1 sgn 0 - 0 o + —In - a f t 0 -0 , + —In I y I J) I y L[ ' " a “, ‘ 111))_

l-a„ l-a„ = sgn -0. 1---- - 02-+—In O 1, . (5.4.23) 2 y 2 y V a e

Furthermore

r©i - 0O i . (l —!— - -)—in < 0 o I 2 y 1 a E )) a c / Since — — = - — 1 is a decreasing function of a, (5.4.24) obtains if and only if a e > a ;. a a But from (5.4.15) it is apparent that: a, > a c > a 0 Ve e(o ,l). (5.4.25) Combining (5.4.23) - (5.4.25) makes it clear that £(a,,ae) and £(a0,a E) are, respectively, strictly decreasing and strictly increasing on (0,l), The result then follows from (5.4.19), (5.4.20), and the continuity of £(a,,aE) and £(a0,a e).D

The bisection algorithm can be used to approximate e* to any desired degree of accuracy. This was done by a program given in Appendix E to an accuracy of 2 35 * 3 x 10'", for 0O = 0 and 0, = 1 , and various values of a 0 and a,; the results are presented in Table 6. Ratios of regrets are given by:

^ .. S(a o>aE-).^ \

From Columns 4 and 5 we see that the use of the optimal e* reduces the regret by over 70% for these values of a 0 and a ,. Clearly, it is not the case that e* = 1/2 for all Table 6: Ratio of Regret for Optimal Hierarchical Rule to Regret for Incorrect Rule

4 ®o ®! e G0(a0,a,) G ,(a„a0)

0.10000 0.20000 0.52236 0.23130 0.27810 0.10000 0.30000 0.52579 0.23019 0.28756 0.10000 0.40000 0.52448 0.23247 0.29017 0.10000 0.50000 0.52160 0.23585 0.28982 0.10000 0.60000 0.51796 0.24016 0.28794 0.10000 0.70000 0.51357 0.24615 0.28503 0.10000 0.80000 0.50795 0.25601 0.28108 0.10000 0.90000 0.50000 0.27566 0.27566 0.20000 0.30000 0.50779 0.24312 0.25898 0.20000 0.40000 0.51005 0.24199 0.26317 0.20000 0.50000 0.50979 0.24319 0.26476 0.20000 0.60000 0.50803 0.24604 0.26470 0.20000 0.70000 0.50492 0.25111 0.26334 0.20000 0.80000 0.50000 0.26062 0.26062 0.20000 0.90000 0.49205 0.28108 0.25601 0.30000 0.40000 0.50311 0.24725 0.25354 0.30000 0.50000 0.50375 0.24728 0.25505 0.30000 0.60000 0.50269 0.24930 0.25511 0.30000 0.70000 0.50000 0.25387 0.25387 0.30000 0.80000 0.49508 0.26334 0.25111 0.30000 0.90000 0.48643 0.28503 0.24615 0.40000 0.50000 0.50086 0.24936 0.25109 0.40000 0.60000 0.50000 0.25086 0.25086 0.40000 0.70000 0.49731 0.25511 0.24930 0.40000 0.80000 0.49197 0.26470 0.24604 0.40000 0.90000 0.48204 0.28794 0.24016 0.50000 0.60000 0.49914 0.25109 0.24936 0.50000 0.70000 0.49625 0.25505 0.24728 0.50000 0.80000 0.49021 0.26476 0.24319 0.50000 0.90000 0.47840 0.28982 0.23585 0.60000 0.70000 0.49689 0.25354 0.24725 0.60000 0.80000 0,48995 0.26317 0.24199 0.60000 0.90000 0.47552 0.29017 0.23247 0.70000 0.80000 0.49221 0.25898 0.24312 0.70000 0.90000 0.47421 0.28756 0.23019 0.80000 0.90000 0.47764 0.27810 0.23130 97 a 0 and a,. Indeed, an exact formula for e* does not seem to exist. However, Table 6

suggests two results, which are immediate consequences of the following proposition. Proposition 5.4.1. Let v,,v2 e(0,l). Then: «(v,.vI) = 5 ( l-v 1.l - v 2). (5.4.27)

Proof: Observe that:

Inf - ^ 1 = - I n f -^1 i = 1,2 (5.4.28) Vi J I 1 _ Vi ) and: e - e „ = - ( e - e ,) . (5.4.29)

One then uses (5.4.28) and (5.4.29), and the symmetry of , to re-arrange £(v,,v2) in

such a way as to establish (5 .4.27). □ Proposition 5.4.2 : With the notation defined in this chapter, let a 0 = 1 - a ,. Then e* = 1/2, and 4(a0,a E.)/^(a0,a 1) = ^(al,a e.)/4 (a 1,a 0). (5.4.30)

n 4- r/ n 4-1 — n 1 Proof: Since a 0 = l^ a „ a I/2 = — L = —------~ . From (5.4.27) it follows that 2 2 2 4(a0,1/2) = £,( 1 - a 0,1/2) = ^(a,, 1/2); from Theorem 5.4.1, it follows that e* = l/2 is the unique point of intersection. Also, ^ ( a Q . a ^ ^ l - a o . l - a , ) - ^ , , ^ ) , so that (5.4.30) follows. Proposition 5.4.3\ For any numbers a 0 and a, such that 0 < a 0 < a, < 1, suppose that for somes, e(o,l):

^ . a .) = $ (a „ a .) (5.4.31)

Then 98

$(l-a0,l- a ) = ^ (l-a,,l-a .), (5.4.32)

(5.4.33) ^(a0,a,) 5(l-a0,l~a,) ’

and

(5.4.34) £(a,,a0) £(l-a„l-a0)

( Proposition 5.4.3 would be useful if we wished to construct tables of e*, since it would reduce the number of times we would have to implement the bisection algorithm: once we have found the point of intersection a e for' (a 0,a / ,), we \ would know that l - a E is the '

intersection point for ( l - a , , l - a 0).)

Proof: The result is immediate from Proposition 5.4.1 and the assumption (5.4.31). □ Note from Table 6 that typically e* * 1/2, which might suggest that if one lacks

access to an implementation of the bisection algorithm then the use of 1/2 will lead to

’’favorable" results, even if strict minimaxity is not realized. This is generally the case, as

indicated by Table 7, where:

Stajc^+q,)^) G0(a 0,oi,) — i ^ * (5.4,35) G0(a 0,a J

and:

G ](a,,a0)_. \ — S(a p

The program which generated those figures is listed in Appendix E: as before, 0O = 0 and 0, = 1 . 99 Table 7: Ratio of Regret for Approximate vs. Optimal Hierarchical Rule

<*0 Go(«o>a i) Gi(tt,,ci0 0.100 0.200 0.907 1.088 0.100 0.300 0.892 1.103 0.100 0.400 0.896 1.099 0.100 0.500 0.908 1.088 0.100 0.600 0.924 1.074 0.100 0.700 0.942 1.056 0.100 0.800 0.966 1.033 0.200 0.300 0.968 1.031 0,200 0.400 0.959 1.040 0.200 0.500 0.960 1.039 0.200 0.600 0.967 1,032 0.200 0.700 0.980 1.020 0.200 0.900 1.033 0.966 0.300 0.400 0.987 1.012 0.300 0.500 0.985 1.015 0.300 0.600 0.989 1.011 0.300 0.800 1.020 0.980 0.300 0.900 1.056 0.942 0.400 0.500 0.997 1.003 0.400 0.700 1.011 0.989 0.400 0.800 1.032 0.967 0,400 0.900 1.074 0.924 CHAPTER VI

CONCLUSIONS AND FUTURE AVENUES OF RESEARCH

6.1: Summary of Chapters H-V

All of the work in this dissertation deals with two aspects of Bayesian decision

theory: one is hierarchical modeling, the other is Bayesian robustness. The former is a

means toward an end, and the latter is the end. However, a hierarchical prior is a relatively straightforward and single-faceted entity, while "robustness" can have many meanings. Essentially, we have applied two meanings to Bayesian robustness. In Chapters II and in,

"robustness" means trying to come close to some "benchmark" — a benchmark decision rule in Chapter II, a benchmark posterior in Chapter III. As noted earlier, this represents

something of a departure from some of the earlier work in Bayesian robustness (such as the posterior-range method discussed in Section 1.5), which defined robustness in terms of

the extent to which Bayesian inference varied as the prior varied over T. One might say that the posterior-range method treats robustness as precision, while the benchmark-best guess paradigm treats robustness as accuracy. In Chapter V robustness is interpreted in a F-minimax sense, the goal being to find procedures which minimize the maximum Bayes risk (or, more commonly in Sections 5.3 and 5.4, the regret). The minimax paradigm is of course conservative: when, in Sections 5.3 and 5.4 we find what amount to equalizer rules (over the index set T) we tacitly

100 101 accept incurring a certain Bayes regret, fully aware that it might be possible to reduce that regret but equally aware that we are subject to a regret the magnitude of which is bounded above. In other words, we have agreed to pay a specified regret in exchange for which we are safeguarded from paying a heavier regret. Chapter IV is based in part on benchmark-best guess principles and in part on an informal application of the r-minimax principles. As in Chapter II, regrets are computed against a specific benchmark, be it Cauchy, double exponential, or normal: but the simulation evidence of Tables 2 through 5 shows that (at least in this case) the hierarchical rule possesses the following "pseudo-r-minimax" property: by using the hierarchical rule we incur a certain magnitude of regret, but we know that the magnitude of this regret is less than (in fact, generally much less than) the maximum regret that we could incur otherwise. Fortunately, the numerical evidence also suggests (again, at least in this case) that the magnitude of the regret is fairly small: the hierarchical rule is generally competitive with the benchmark rule. Why are hierarchical priors so useful in both the benchmark-best guess context and in the r-minimax context? We would suggest two points which may help one understand why hierarchical priors onT are usefUl tools in the pursuit of robustness: (i) Hierarchical priors, in an utterly natural and indeed Bayesian way, allow us to acknowledge the uncertainty in our assumptions and, moreover, to incorporate it directly into the model. By admitting that the hyperparameters are unknown and modeling them to be random quantities, we are acknowledging the uncertainty in our assumptions. Then, the computation of the posterior distribution via Bayes* theorem automatically deals with that uncertainty. This phenomenon is quite apparent in the normal settings (Sections 2.2 and 3.2): our uncertainty regarding the benchmark leads to the hierarchical prior having larger 102 larger variance than the single-stage prior, which in turns brings the posterior inference closer to the inference that would correspond to use of the benchmark prior. (ii) At least when r is finite, hierarchical priors, which is to say mixtures, can allow the data to "correct” mistakes in one's model: or, perhaps more precisely, the data can teach the statistician about the model even as the inference is being conducted. This phenomenon is most apparent in Section 4.2, particularly Example 4.2.1, where it is seen that the hierarchical rule performs much better than the "incorrect" rule and not a great deal worse than the "correct” rule. As one sees, particularly in Example 4.2.1, the data are

"helping" reduce Bayes regret. The conceptual benefit of using hierarchical priors, of course, is that one needs no additional theoretical justification: nor does one need to collect additional data for the refinement of one’s assumptions. For instance, in Section 4.2, a person determined to use one of the priors in T (i.e., a person determined not to use a hierarchical model) might opt to collect data in a pilot study to enable him/her to select a prior from T. The statistician could draw a sample and try to determine from which predictive density -- mc, mD, or 1% - the sample had come. Then more data would have to be collected for the actual inference. Of course, the two data sets could be merged; but doing so could require some nontrivial theoretical justification. The hierarchical prior eliminates the need for a pilot study in this context 1 and advances one directly to the second-stage of data collection - without requiring any additional justification! The hierarchical prior, in other words, performs a task automatically that the statistician might find difficult!

1 Pilot studies can be useful or essential in other contexts. The point is that here a pilot study need not be conducted to help the statistician hone in on the "proper" element of V 103

6.2: Future Avenues of Research In this section, we will discuss, in very general terms, some "next steps" that could be taken. Each builds upon the foundation already lain. Little if any work has been done on any of them: they represent only interesting and important questions which may or may not have an elegant or even tractable answer.

The most obvious "next step" would be an attempt to extend some of the results of Section 2.2 and Chapter III to non-normal distributions. Unfortunately, the proofs given for Theorem 2.2.1 and the results in Sections 3.2 and 3.3 use the normality assumption very heavily. Nevertheless, it might be possible to use the theory of total positivity and convolution properties (see Karlin, 1968) to obtain results along these lines. Specifically, consider the "best guess" and hierarchical models for a location parameter setting:

0-71(0-01,), (6.2.1) and: x|6~f(x-0) (6.2.2) for the "best guess" model and: a~7t*(a-a,), (6.2.3)

e|a~n(e-

(6.2.2)) will produce two distributions "closer to each other" (in some appropriately- 104 defined sense) than the original distributions tc, and tc2 . Yet the distributions so produced are the marginals m, and m2. Since Lemma 3.1.1 applies to all distributions, arguments along these lines could lead to general results about robustness in the Kullback-Leibler sense,2 Another, much more specialized "next step" would be a more systematic and thorough study of decision problems for exponential distributions, one of which was treated in an exploratory and ad hoc fashion in Section 2.3. The theory of Laplace transforms (see Saff and Snider (1976) for a general and elementary discussion; see

Billingsley (1986) for some discussion of the role of the Laplace transform in probability theory) might make possible the proof of some results, since the hierarchical prior

jaexp(-0a)rc*(a)da (6.2.5) A is the Laplace transform of a, 7i*(a) evaluated at 0. Such results might, in some contexts, also extend to regular exponential families The simulation study presented in Chapter IV, while considerably more extensive and elaborate than the one in Section 2.3, still suffers some of the limitations shared by all simulation studies relative to theoretical studies. The intractability of the integrals needed for the Bayes risks and regrets seems insurmountable, but it might still be possible to establish at least some upper or lower bounds relevant to the problem (e.g., "The ratio of the Bayes risk for the hierarchical rule to the Bayes risk for the benchmark rule will never exceed 1.25 when Properties A, B, C, and D obtain"). We have only considered what Berger (Berger, 1984) has called procedure robustness, that is, our measures of robustness have been averaged over all x which might

2 Other information measures could of course be studied, but the asymmetry of the Kullback-Leibler Number, which distinguishes clearly the role of the benchmark prior, makes it a very attractive criterion for our purposes. 105 be observed. We have ignored posterior robustness, which deals only with whatever data x

was actually observed. This choice was both natural and legitimate, since one goal of this research is to promote hierarchical priors as a necessary component of the non­ statistician's "toolbox" of statistical methods. Thus, the non-statistician must have evidence before selecting a method that it can reasonably be expected to perform well for whatever

data he/she collects. On the other hand, posterior robustness is an important property as well, in some sense more in keeping with the Bayesian paradigm. A study of when (i.e., for what sets of x-values) a particular hierarchical prior reduces regret would be quite valuable: the results would not be especially useful in encouraging non-statisticians to use hierarchical priors, but they could help the statistician better understand the underlying process and provide insights into conditional inference. APPENDIX A

SOME FORTRAN AND PASCAL PROGRAMS USED IN SECTION 2.3

c this program uses an imsl routine to generate the nodes and weights c needed for Gauss-Laguerre quadrature integer numpts,kj double precision qxfix,qx,qw,wk

dimension qxfix(32768),qx(32768),qw(32768),wk(32768)

intrinsic dabs external dg2rul c get weights and nodes needed to do Generalized LaGuerre Quadrature numpts = 512 do 2, k = 9, 15 call dg2rul(numpts,6,0.0d0,0.0d0,0,qxfix,qx,qw,wk) do 3, j = 1, numpts write(6,4) qx(j) write(6,4) qw(j) 4 foimat(d30.15) 3 continue numpts = numpts*2 2 continue end c this module contains functions to compute the bayes rule when c a gamma and inverse Gaussian hyperprior are used double precision function gamrul(x,a,b) double precision a,b,x,nodes,wghts,tol,esn,esd,cesn,cesd,tol2 double precision dabs integer j,c,numpts 106 dimension nodes( 15,32768), wghts( 15,32768)

intrinsic dabs

common nodes,wghts

tol = 0.00005d0 j = 9 numpts = 512 esn = O.OdO do 301, c = 1, numpts esn = esn + wghts(j,c)*nodes(j,c)**a/(x + nodes(j,c)/b) 301 continue 307 continue j=j+l numpts = numpts*2 cesn = O.OdO do 302, c = 1, numpts cesn = cesn + wghts(j,c)*nodes(j,c)**a/(x + nodes(j,c)/b) 302 continue tol2 = dabs((cesn - esn)/cesn) if (tol2.lt.tol) goto 303 if(j lt. 15) goto 500 write(6,1001) x,esn,cesn,tol2 1001 format('n\dl2.6,' ',dl2.6,’ \dl2.6,' ’,d!2.6) goto 303 500 esn = cesn goto 307 303 continue j = 9 numpts = 512 esd = O.OdO do 401, c = 1, numpts esd= esd + wghts(j,c)*nodes(j,c)**a/(x + nodes(j,c)/b)**2.0d0 401 continue 407 continue j = j + 1 numpts = numpts *2 cesd = O.OdO do 402, c = 1, numpts cesd = cesd + wghts(j,c)*nodes(j,c)**a/(x + nodes(j,c)/b)**2.0d0 402 continue tol2 = dabs((cesd - esd)/cesd) 108 if (tol2.lt.tol) goto 501 if (j It- 15) goto 507 write(6,1002) x,esd,cesd,tol2 1002 formatCd', dl2.6,' ',dl2.6,' ',dl2.6,' ',dl2.6) goto 501 507 esd = cesd goto 407 501 gamrul = cesn/cesd return end ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc double precision function igrul(x,mu,la) double precision mu,la,x,nodes,wghts,tol,esn,esd,cesn,cesd,tol2 double precision dabs,dsqrt,dexp,temp integer j,c,numpts

dimension nodes( 15,32768), wghts( 15,32768)

intrinsic dabs,dexp,dsqrt

common nodes,wghts

tol = 0.00005d0

numpts = 512 esn = O.OdO do 301, c = 1, numpts temp =dsqrt(nodes(j,c))*(x + 2.0d0*mu*mu*nodes(j,c)/la) temp = dexp(-0.25d0*la*la/(mu*mu*nodes(j,c)))/temp esn = esn + wghtsO,c)*temp 301 continue 307 continue j= j + 1 numpts = numpts*2 cesn = O.OdO do 302, c = 1, numpts temp =dsqrt(nodes(j,c))*(x + 2. OdO* mu* mu * nodes(j, c)/la) temp = dexp('0.25d0*la*ia/(mu*mu*nodes(j,c)))/temp cesn = cesn + wghts(j,c)*temp 302 continue tol2 = dabs((cesn - esn)/cesn) if((tol2.1t.tol).or.(j.eq.l5)) goto 303 esn = cesn goto 307 303 continue if (tol2.ge.tol) then write(6,898) x,esn,cesn, tol2 endif 898 format('n ',dl2.6,' ',dl2.6,' ’,dl2.6,' \dl2.6) 3 = 9 numpts ~ 512 esd = O.OdO do 401, c = 1, numpts temp =dsqrt(nodes(j,c))*(x + 2,0d0*mu*mu*nodes(j,c)/la)**2 OdO temp * dexp(-0.2 5 dO * la* la/(mu * mu * nodes(j ,c)))/temp esd = esd + wghts(j,c)*temp 401 continue 407 continue j = j + l numpts = numpts*2 cesd = O.OdO do 402, c = 1, numpts temp =dsqrt(nodes(j,c))*(x + 2.0d0*mu*nnu*nodes(j,c)/la)**2.0d0 temp = dexp(-0.25d0*la*la/(mu*mu*nodes(j,c)))/temp cesd = cesd + wghts(j,c)*temp 402 continue to!2 = dabs((cesd - esd)/cesd) if ((tol2.1t.tol).or.(j eq. 15)) goto 403 esd = cesd goto 407 403 continue if (tol2.ge.tol) then write(6,998) x,esd,cesd, tol2 endif 998 format('d \dl2.6,' ',dl2.6,' ',dl2.6,' \dl2.6) igrul = cesn/cesd return end c this program conducts the simulations used to compare gamma, c inverse Gaussian, and "best-guess" priors integer numsims,numpts,kj,i,i2,i3 double precision x,nodes,wghts,u,v ,temp double precision igrule,grule,al,reging,reggam double precision dabs,gamrul,igrul

dimension nodes(l5,32768), wghts( 15,32768),u(25000) dimension v(5) common nodes,wghts

intrinsic dabs external gamrul,igrul,dmun

numsims=10 c get weights and nodes needed to do Generalized LaGuerre Quadrature numpts = 512 do 2, k = 9, 15 do 3, j = 1, numpts read(5,55) nodes(kJ) read(5,55) wghts(kj) 55 format(d30.15) 3 continue numpts = numpts*2 2 continue al = 0.25d0 n= 1000 do 100, i = 1, 5 al = al + 0.25d0 if(i.eq.3) goto 100 write(6,310) al 310 format(dl0.5) if(al.gt.l.OdO) then v(l)= l.OdO/al else v(l)= al endif v(l) = v(l)**2.0d0 do 155, i2 = 2,5 v(i2) = v(i2-l)/2.0d0 155 continue do200,i2= 1,5 write(6,88) v(i2) 88 format(d20.10) do 400, i3 = l,numsims call dmun(n,u) reggam = O.OdO reging = O.OdO do 405, k = l,n x = u(k)/(1.0d0 - u(k)) grule = gamrul(x,al*al/v(i2),al/v(i2)) temp = x + 1 0d0 - grule Ill reggam = reggam + temp**2.0d0/n 131 continue igrule = igrul(x,a 1 ,a 1 * *3. 0d0/v(i2)) temp = x + 1,0d0 - igrule reging = reging + temp**2.0d0/n 132 continue 405 continue write(6,821) reggam 821 format(d20.10) write(6,822) reging 822 format(d20.10) 400 continue 200 continue 100 continue end

program Inout; {$N+} {this program processes the output of the above program and presents it in tabular form) type dataset = array[1.50] of extended; var gamma, invgau: dataset; variance,al,mg,mi,sg,si,regret extended; k,numsims,ij,m:integer; fo,f2:text; function mean(x.dataset;n:integer):extended; var k: integer; t: extended; begin t:= 0.0; for k:= 1 to n do t:= t + x[k]; mean:= t/n; end; { } function sdev(x:dataset;n:integer):extended; var k:integer; t,mu:extended; begin mu:= mean(x,n); t:= 0.0; 112 for k:= 1 to n do t:= t + (x[k] - mu)*(x[k] - mu); sdev:= sqrt(t/(n-l)); end; { } begin numsims:= 10; assign(f2,'inexp’); reset(Q); assign(fo,’outexp'); rewrite(fo); for m:= 1 to 4 do begin readln(£2,al); regret:=(al - 1.0)*(al -1.0); for j:= 1 to 5 do begin readln(f2,variance); for k:= 1 to numsims do begin readln(f2,gamma[k]); readln(f2,invgau[k]); end; mg:= mean(gamma,numsims); sg:= sdev(gamm a, numsims); mi:= mean(invgau, numsims); si:= sdev(invgau,numsims); write(fo,al:8:3,',,,variance;8:5,,,,,mg:8;4,,(’,sg:6:4,'),'); write(fo,mg/regret:8:4,',' ,mi:8:4,'(',si:6:4,'),'); writeln(fo,mi/regret :8:4); end; end; close(fo); close(f2); end. APPENDIX B

FINITE MIXTURE DISTRIBUTIONS

This appendix discusses finite mixture distributions, an extremely important topic of which considerable use is made in Chapters IV and V. The goal is to give the reader a formal characterization of mixture distributions; to distinguish between the various types of parameters involved in a mixture distribution; to explain the interrelationship between mixture distributions and the Bayesian paradigm; and to state and prove one result necessary for Chapter IV: the goal is not to attempt a comprehensive discussion of a topic that has been dealt with at book-length, nor even a survey of mixtures. We begin by defining the topic of this appendix, adopting the terminology of

Everitt and Hand (Everitt and Hand, ,1981):

Definition B.l. Let g(y;ii) be a d-dimensional probability density function depending on an m-dimensional parameter vector T) and let h (ti) be an m-dimensional cumulative distribution function. Then

h(y) = Jg(y;ri)dH(Ti) (B.l) is called a mixture density, and h (t|) is called the mixing distribution.

Definition B.2. Let g, H, h, y, and t| be as in Definition B.l. If the support of H is a

113 114

finite set of points 1 r i , ri f, then we say that «■_[ _2 -cJ

h(y) = i j g ^ r . g . j n ^ J (B.2)

is a finite mixture. To simplify our notation, we define

p, = H ( lv ) ,j“ l , - . c (B.3)

The values Pj are often called mixing weights and the densities j are often called

component densities (Titterington, Smith, and Makov, 1985). As Everitt and Hand observe, the parameters associated with finite mixture distributions may be divided into

three classes: (i) The number c of component densities; (ii) The mixing weights p^ j = 1,...,c;

(iii) The parameter vectors r j , j = 1,..., c.1 - j

There is clearly a potential difficulty with identifiability: Definition B.3 (Everitt and Hand, 1981): Let a class 3 of finite mixtures have the property that, for any two elements hj(y) = £ pgf y; rj J and h2(y) = S p*g(y; rj*) of 3 j=l V i=l -i the following implication holds:

If h,(y)= ^ (y ). (B.4) then: C] = c2; (B.5)

1 In the contexts we consider, these t|j, j= will be known and hence need not be estimated f

115

and For each j e{l,2,...,c j , there exists some i e{l,2,...,c2}suchthat

Pj = p*and ti = t|\ (B.6) - j -I

Then we say that the class 3 is identifiable. Clearly, if a class 3 is not identifiable, all the estimation problems (i)-(iii) are ill-posed. Fortunately, the following simple result characterizes identifiable classes of mixtures: Theorem B.l (Yakowitz and Spragins ,1968): A class 3 of mixtures is identifiable if and

only if the component densities are linearly independent over the set of real numbers. Methods of attacking problems (i), (ii), and (iii) exist and are well-known (Everitt and Hand, 1981), but they wilt not be recapitulated here. Instead we will outline two

reasons why mixtures figure prominently in Bayesian analysis. Firstly, Bayesian inference using finite mixtures as priors is often much easier than one might initially expect. The following theorem generalizes in a very straightforward manner a useful result (Berger ,1985) from the case k = 2 to ageneral k: Proposition B.2 : Let f(x|0), 0 e 0 denote a likelihood, and let tc,(0),7t2(0),***,7tk(0)be

probability [density] functions with support 0 . Define

m,(x) = Jf(x|e)d()t1(e)),i = l , - , k (B.7) e

tt,(6|x)= f o |9)-? ^ .i = l . - , k (B.8) m i(xJ

5l(x) = E"‘M [0], i = l,--,k (B.9)

( That is, (B.7), (B.8), and (B.9) denote respectively the marginal, the posterior, and the posterior mean based on 7tj(0).) Let e = (e1,e2,---,ek) denote a so-called probability vector (i.e., a vector such that it £e, = 1 and 8; > 0 Vi), Define the mixture prior by 116

it*(e)=ZE1it,(e). (b io ) i=l Then the posterior distribution of 0|X = x based on the prior distribution (B.10) is given by

7Cc(e|x) = i e J(x)7tj(0jx) (B.ll) j-» where

/ v sm fx ) «i(x)= ; ' . (B. 12) l e ^ t x ) i=l Furthermore, the posterior mean of 0|X = x based on the prior distribution (B.10) is given by

S,(x) = Z ej(x)5j(x). (B.13) j=l Proof. Let m*(x) denote the marginal of x when the prior (B, 10) is placed on 0:

m*(x) = J f(x|0)jt*(0)d>.(e) = t e / j f(x|0>ti(0)dX(e)>| = jU m ^x) (B. 14) e i=l J i=1 Then the posterior distribution tt*{0|x)when the prior (B. 10) is placed on 0 is given by:

(B.i s) le^x) Ze^^x) i=l i=l But f(x|0)7Cj(e) = m /x ^ O Ix ) (B, 16)

Combining (B. 16) and (B. 15) yields (B. 11). It follows from (B .ll) that the posterior mean corresponding to the prior given in (B. 10) is 117

Se^xjE^te] Efijin^S/x) 5*(x) = E*'(0|x)[0] = ^ — E------= ^ 4 ------(B.17) Se^fx) Sejm^x) i=l i=l which completes the proof of Proposition B.2. □ Secondly, and more importantly for our purposes, the link between mixtures and hierarchical priors is readily apparent from Definition B.l: let g(.;.) represents the prior

(conditional) density of the parameter y given the hyperparameter j\, while h(ii) represents the hyperprior placed on r\. Then (B.l) denotes the single-stage prior which results from integrating the hyperparameter T) out of the hierarchical structure. APPENDIX C

SOME EXACT AND APPROXIMATE FORMULAE

C.l: Introduction In this Appendix we develop formulae for the posterior mean and variance, and for the marginal density of the data, for two likelihood-prior combinations. The results developed herein are used in Chapter IV; they are presented separately in this Appendix because they are in some cases complicated and yet tangential to the topic of Chapter IV.

C.2: Normal Likelihood, Cauchy Prior Assume that

(C.2.1) and

(C.2.2)

We wish to derive expressions for

mc(x) = Jf(xj0)7tc(e)de; (C.2.3)

8c(x) = E*c(e|x*[0]; (C.2.4) and

118 119

Vc(x) = E"‘"»-'[(e-5,.fx))’ (C.2.5)

where 7tc(0|x) is the posterior density of 0 given X = x.

Note that

J8f(x|e)Itc(6)d0 M- (C.2.6) m,:W

The numerator and denominator in (C.2.7) are given by

Pc0exp(~(x~ 0)72 )' jef(x|0)7ic(e)d0= j ae, (C.2.7) k J 2 k (&2c +0*7 and

CO Pcexp(-(x-0)2/2) mc(x) = J d0. (C.2.8) 7tV27t{Pc +02)

Now, both integrals clearly exist and are finite: the absolute value of the integrand in

(C.2.7) is bounded above by a constant multiple of |0exp(-(x-0)2/2) and the absolute value of the integrand in (C.2.8) is bounded above by a constant multiple of

exp(-(x-0)2/2), both of which are integrable with finite values of the integrals..

However, in actual evaluation of the integrals in (C.2.6) we exploit a method devised by

Professor Herman Rubin. First we define Mills' Ratio (see Kendall and Stuart, 1969): Definition C.2.1: For a constant t, Mills' Ratio R(t) is defined as

l-«t>(t) = ^ 7exp(~ u3/2)du R(t) (C.2.9) <|»(t) <|»(t) exp(-t2/2)

Rubin (1977) notes that 120

exp(-{u-x)2/2) R(pc -ix )= J du, (C.2.10) V2tc(Pc -iu )

where i2 = -1. This observation led Rubin to the expression for the marginal and the

posterior mean in the following result: Proposition C.2.1: Using the notation defined thus far,

jM r (Pc -«)) nr ,(x) = (C.2.11) 71

, ^ PcIm(R(pc-ix)) (C.2.12) Re(R(Pc-ix)) ' and

Pc _ RZ fPcM R(Pc~*x)) Vc(x)=- (C.2.13) Re(R(pc -ix)) Pc Re(R(pc -ix))

Proof : We begin with

"exp(-(u-x)2/2) (pc+iu)exp(-(u-x)2/2) du= J du \ >/27t(Pc -iu ) >/27t(pc -iu)(pc +iu)

(Pc +iu)exp(-(u-x)2/2) du ^ n ( P c - i V )

(Pc +iu)exp(-(u-x)2/2)> = i du V 27rM +u2l 121

IW p H u - x )2/ 2) uexp(-(u-x)2/2) J du + i" ir f du 7 2 tc (pc+u2) ■v27c ~a (Pc + u2)

Pcexp(-(x-0)2/2) tit ® Pcu e x p (-(u -x )2/2) = * J du. (C.2.14) W 2 tc(Pc + 02) itV2n(Pc +u2)

From (C.2.7), (C.2.8) and (C.2.14) it is clear (C.2.11) and (C.2.12) obtain, A similar analysis yields the expected posterior lossVc(x) as follows;

02Pc exp(-(x - e)2 /2) ;7t(0|x)rCk21 = = J d0 W 27t(Pc+02)mc(x)

(e2 + p2 ~Pc)pc exp(-(x - e ) 2/2) J dO 7tV2rt(Pc +02)mc(x)

(e2 + p2 )pc exp(-(x -0 )2/2)' Pcexp(-(x-0)2/2) Pc = i d0 J d0 7tV27t(p2 + 0 2)mc(x) m, J*) itV2it(p2 +02)

exp(-(x-0)2/2) P > c (*) I d0 7tmc(x) -® 7 2 n mc (x)

Pc j (e - x ) ae - p itmc{x) 122

-Pc 7tmc(x)

Pc -J32. _ n2 (by (C.2.11)) (C.2.15) Re(R(3c -ix))

Thus, (C.2.13) follows from (C.2.12) and (C.2.15) ,□ A routine which enables one to evaluate Mills' Ratio for complex arguments exists in IMSL (tMSL, 1989b): the IMSL functions ERFC and CERFE 1 are defined by

2f > p ( - u 2)du ERFC(z) = — — -7=—-— , (C.2.16) yin and: CERFE(z) = exp(-z2)ERFC(-iz). (C.2.17)

To use these functions to evaluate Rubin's formula, we must express Mills' Ratio in terms of ERFC:

]exp(-u 2/2)du V 2 J exp(-yz)dy R(l)= ' exp(-,V2) - «p(-,72) Oety = u/^)

^ ( s ) i eXp(-yJ,dy pERFc(t/V2)

= ' (C 2 ,8 )

Evaluating (C.2.16) for a complex argument z yields

, f t ERFC(VV2) f t rexp^-(iz/V2)2|ERFc(-i(iz/^))]exp^iz/V2)2]

R(z) ^ -R / 2 ) " 1 1 exp(- z 2/ 2 )

1 ZERFE, the double-precision analog to CERFE, was the function actually used. 123

exp^-(iz/V2) ^ERFc(-i(iz/>/2))jexp(-z2/2) -Ji exi

= ex p ^(i2/V 2f|ERFc(-i(iz/V 2 ))]

= CERFE(iz/i/2) . (C.2.19)

Hence, from (C.2.19),

R(Pc - « ) =j|cERFE

= J | c ERFE((Pci + x)/V 2), (C.2.20)

C.3: Normal Likelihood, Double Exponential Prior As before,

f(x|0) = ^ ” exp( - ( x - 0)72 ) . (C.3.1)

But we place a different prior on 6:

U e ) = ^ « P( - x (C.3.2)

We wish to derive expressions for 124

mD(x)= Jf(x|0)jcD(e)d8, (C.3.3) -gO 5D(x) = E’,D(0|x)[e], (C.3.4) and:

VD(x) = E*D(e,I,)[(e - 6D{x))2], (C.3.5) where xD(0|x) is the posterior distribution of 9 given X = x. The results are summarized in the following proposition: Proposition C.3. P. Using the notation defined thus far,

mD(x) = ^ / ( r (x + 1/x) + r (-( x - 1/t))); (C.3.6)

, d / i (x + l/t)R(x + 1/ t) f (x - I/t)r (-( x - 1/t)) 8 W = R(x + 1/t) + r (-( x -,/ t))------;

{r (x + l/x)[l+(x + 1/t)2] + R(-(x - l/t))[l+(x - 1 / t)2] - 2/t}

(r (x + 1 /t ) + r (-( x - 1 / t)))

(x + 1/t)R(x + 1/t) + ( x - 1/t)r (-( x - 1/t)) ^ (C.3.8) " , R(x + 1/t) + R (-(x - 1/t))

Proof. Substituting (C.3,3) into (C.3.4) we obtain

® ( ffxi0ta (0)^ |0f(x|0)xD(0)d0 8D(x) = E*D<0|X,[0]= J 0 A J L i o W = ^ ------. (C.3.9) - I mD(x) / jf(X|e)7tD(0)de 125

exp( - ^ ( x - e ) 2W e x p f J ] jef(x|0)7cD(e)de = j e d0 2x

/ ( -x(x~0)2 +20 -x (x -0 )2 -20 V exp exp 2x 2x = / e \ / d0+j0 ^ / de (C.3.10) 2x^2% 0 2x^2%

V , < > Rewriting the arguments of the exponential functions yields

-x(x2 -2x0+02)+20 - t02 + 20(xx +1) - xx2 2x 2x

2 , (x + l/x)- x2. = (-iX0-(x + l/x>) + (C.3.11) 2 ’ and:

-r{x2-2jt8+62)-26 , + (x-V^ _ *> (C.3.12) 2x 2 ' 2 2 Substituting (C.3.11) and (C.3.12) into (C.3.10) and re-arranging we obtain:

(x + l/x) exp exp 2 ) 0 0expf-^(0-(x+l/x))2l j0f(x|0)rcD(0)d0 = - I in 2x y/2n 126

exp ' ) e x p ( - f ) ® exp(- y (0 - (x ~ 1/ x))2) + - 1 ■■■■■■■ ' -—-fi=srsa...... i I- d0 2t y/2%

x exp f(x+v x n exp r T eexpt-^e-tx+i/t))2)^ -(-(x + 1/t ))J de 2x (-(x + 1/t))V2 jc

2\ (x -l/t ) exp 9exp(-}(e-(x-l/x))2) d0, (C.3.13) 2x (x - 1 / t)V2 w

The integrals in (C.3.13) represent, respectively, E[Sj and E[S2], where S, and S2 have thetruncated normal distribution (Johnson and Kotz, 1970):

S,-N t(x + 1/t ,1 ,- qo,0), (C.3.14)

S2 ~ N t(x ~1/t ,1,0,qo). (C.3.15)

Johnson and Kotz provide formulae for the first and second moments (the latter will be used later). Using Mills' Ratio as defined in (C.2.9), we have

E[S,] = x + 1/t -1 /R (x + 1/t); (C.3.16)

E[S2] = x - 1/x + 1/r (-( x - 1/t)) ; (C.3.17)

E[S? ] = l+(x + 1/t)2 — (x + 1 /t)/r((x + 1/t)) ; (C. 3.18)

E[S2] = l+ ( x - 1/t)2 + (x - 1/t)/ r (-( x - 1/t)) . (C.3.19)

Substituting (C.3.16) and (C.3.17) into (C.3.13) yields. 2 (x + l/x) exp exp |0f(x|e)jtD(e)de=- M)(-(x + 07 / ))[ , 'r - l/R(x + 1/t)] 2 t 2 \ (x - 1/ t) exp exp —

+ - - (x - 1 / t ) [ x — 1/t h- l/R(-(x - 1 / t ) ) ] 2t

x exp T (x + 1/t)<1>(-(x + 1/t)) (-(x + 1/t)) 2\ 2\ 2t (x + 1/t) (x + 1/t) exp exp

( x - 1/t)(x- 1/t) 2\ 2t exp f r ( x - i/ t) 9 exp L /

exp v 2 (x + 1/t)/2tT <|>(x + 1/t) 4>(x - 1/t)

= ^ ~ [(x + 1/ t)r (x + i/ t) +(x - 1/t)r (-( x - 1/t))] . (C.3.20)

Furthermore,

mD

/ f f-t(x -0)2 +20 exp y exp 2x cO 2t J V J = 1 d0+J je. (C.3.21) 2xV2x 0 2x-j2it

Now using (C.3.11), (C.3.12), (C.3.21) can be simplified. Then combining (C.3.6) and (C.3,20) yields (C.3.7). Furthermore, we can evaluate E"d(0|x)[02] by the same methods:

j0 2f(x|e)jtD(e)de

f ( x + v x n ' ” 2 exp 'exp 02exp^-^(0-(x + l/r))2j (-(x + 1/t)) j d0 2t ® (-( x + 1/t))V2 jc

'(x-VtY' exp 02exp^-~(0-(x-l/'t))2j + - 2t <5>{x — l/x)V2n

( x + 1/t)2 < x2' exp exp

2t

( x - l / x ) 2 exp exp -f) (

Je2f(x|e)7tD(e)d0=

exp exp (4) <&( -( x + 1/ t))( 1 + (x + 1/ t)2 - (x + 1/ t)/ r ( x + 1/ t)) 2t

exp r ( x - i / x )3 M' t ) (d>(x - l/x))(l + (x - 1/ t)2 + ( x - 1/t)/ r (-( x - 1/t))) 2 t

f \ exp (4) ^ ( - ( x + I/t)) 2 t (x + i/ t) 4 exp 2 JJ

/ A exp (4)

exP “ ~ I

2tV2ti ^ + 1 + ^ + ^ 2]“ + ^

exp ( - — + 2t^ \r(~(x~Vt))[1+(x“ "VT))

= M i{ R(x + ]/T)[i+(x + i/T)2] + R(_(x_ y x))[1+(x_ 1/T)2]_ 2/T} (C.3.23) 130

From (C.3.6)and(C.3.23), it follows that:

Je 2f(x|e)7tD(e)de E "D(%)r021 „ ------1 1 mD(x)

{ r ( x + l/x)[l + (x + 1/x)2] + R(-(x - l/x))[l+(x - 1/x)2]- 2/x} (R(x + 1/x) + R(-(x - 1/x))) ^ *

Then, combining (C.3.7) and (C.3.26) yields (C.3.8), which completes the proof. □

C.4: Approximate Computational Formulae Some computing packages (e.g., IMSL, 1987) have routines to compute Mills' Ratio2 directly, relieving the statistician of the need to utilize numerical integration routines. Unfortunately, for reasons of underflow/overflow, IMSL cannot evaluate R(t) for |t| "large", with "large" has different meanings on different computer systems.

However, there exist simple approximations which are quite close to 5D(x), E’I'>(0|*)(x), and mD(x),

Define for x^O:

8 £ » = x - ^ ^ ; (C.4.1)

. , . exp(—Ix |/ t + l/(2-t2)) my(x)= F' \ /V ; (C.4.2) « X and:

2 But one must note: the IMSL functions AMILLR and DMILLR take Mills' Ratio to be 4>(t)/(l - (t)) rather than (l - (t))/(t). Why this is the case is unclear from the documentation: since both underflow a n d overflow can undermine the computation of Mills' Ratio, the advantages gained from this nonstandard definition are not apparent. Note that 8D(x) and 8]^(x) are odd functions, and that mD(x), m ^ x ) , e*d10|x)[02], and

E ^ (x ) are even functions. Hence the approximations are as good for negative values of x as for positive values of x. How good they are is shown by the following result: Proposition C.4.1 : For any values x > 0, % > 0 we have:

(C.4.4)

|mD(x) - m ^x)! = ^-j

iE^1e2]-E-(x)|<^R7^- (C.4.6)

Proof. To prove (C.4.4), rewrite (C.3.8) as:

x[r (x + 1/x) + r (-( x - 1/t))]+(1/ x)[r (x + 1/t) - R(-(x - 1/t))]

R(x + 1/t) + r (-( x - 1/t))

[r (x + 1/t)- r (-( x -- 1/t))] = x + (C.4.7) t( r (x + 1/t) + R (-(x - 1/ t)))

From (C.4.7) and (C.4.4):

[r(x + 1/t) -R (-(x - 1/t))] + 1 R(x + 1/t) + R(-(x - 1 / t ) )

R(x + 1/ t) - r (-( x - 1/t)) + R(x + 1/ t) + r (-( x - 1/x)) R(x + I / t ) + R(-(x - 1 /t)) *

132

R(x + 1/ t) R(x + 1/x) + R (-(x - 1/x))

( R(x + l/x) ^ (C.4.8) R(-( x - 1 / t))

Apropos (CAS),

<|>(x)r (x + 1/t) + ^ (x)r (-( x - 1/t)) e x p l 2x 2x 2t

(x)R(x + l/t) +

_1_ 4>(x )r (x + 1/t)+ ■ ^ + i/ ( 2 tj )W( x - 1/t))- exp[- ^ + V ^ * )) 2x

4(x)R(x + 1/t) + e*p(“ “ + 1/( 2't2) ) l “ ((x - V*)))l- (C.4.9) 2t

Finally,

^[0> ]-E£(x)|

{r(x + l/t)[l +(x + 1/x)2] + R (-(x - l/x))[l+(x - 1/x)2] - 2/x} . . ------r (x + 1/x) + r (_(x _ 1/t)) ll+(x-l/x)j 133

R(x + 1/t) 1+(x + I/t)2 - ( l +(x - 1/t)2) | - 2 / t

R(x + 1/t) + r (-( x -I/ t))

4 x R (x + 1/x) + 2 4(x + 1/x)R (x + 1/x) + 2 (C.4.10) 5 tR(-( x - 1/t)) < tR(-( x -1 / t))

But it has been shown (Gordon, 1941) that R(y) < — for y > 0 , so (C.4.5) follows from y (C.4.10), completing the proof of Proposition C.4.1. □ It can easily be shown that 5R(t)/3t < 0 for all t; therefore, the right-hand sides of

(C.4.4) and (C.4.6) are strictly decreasing in x. Likewise, the right-hand side of (C.4.5) is decreasing in x. For x = 10, x = 1 /4,

R(x + 1/x) <10 8, (C.4.11) R(-( x -1 / t)) and:

(x)R(x + 1/t) + exp^ - ^ + l/(2x2) j(l - o((x - 1/t))) 2x

< 2[l0~22 +(3.4)(l0"s)(l0“*)]; (C.4.12) while

<10" (C.4.13) tR(-(x-l/x)) APPENDIX D

SOME FORTRAN AND PASCAL PROGRAMS USED IN CHAPTER IV

c this module implements the formulae given in appendix c for posterior mean, variance c and predictive density c this function finds the bayes rule for the cauchy prior c and as a "sideshow" computes the marginal evaluated at obs c and the posterior variance double precision function estc()

double precision obs,mn,md,mc,nscal,cscal,dscal,pi,a,b double precision dsqrt,dreal,dimag,pvc,pvd,pvn double complex zerfe,z,dcmpbc,mill

common obs,mn.md,me,pvc,pvd,pvn,nscal,cscal,dscal intrinsic dreal.dsqrt,dimag,dcmplx external zerfe

pi « 3.141592653 5897930d0 z = dcmplx(obs/dsqrt(2.0d0),cscal/dsqrt(2.0d0)) mill = dsqrt(pi/2.0d0)*zerfe(z) a = dreal(miU) b ® dimag(mill) me = a/pi pvc = cscal/a - cscal*cscal - (cscal*b/a)**2.0d0 estc = cscal*b/a return end ccccccccccccccccccccccccccccccccccccccccccccccccc c this function finds the bayes rule for the d.e. prior c and as a "sideshow" computes the marginal evaluated at obs c and the posterior variance evaluated at obs double precision function estdO 134 double precision tl,t2,t3,dexp,dabs,dmillr,pvc,pvd,pvn double precision obs,mn,md,mc,nscal,cscal,dscal,dsqrt,ub

common obs,mn,md,me,pvc,pvd,pvn,nscal,cscal,dscal intrinsic dexp,dabs,dsqrt external dmitlr

pi = 3.1415926535897930d0 ub= 13.0d0 c imsl's mills ratio will work only for arguments with absolute value c smaller than a number larger greater than 13, hence when our argument c exceeds the upper bound ub we must use approobsimation techniques tl = obs + l.OdO/dscal t2 = obs -1 .OdO/dscal if ((dabs(tl).le.ub).and.(dabs(t2).le.ub» then c obs is small enough that we have no need of approximations md = 1.0d0/dmillr(tl) + 1.0d0/dmillr(-1.0d0*t2) t3 = tl/dmillr(tl) + t2/dmillr(-1.0d0*t2) t3 = t3/md pvd= I,0d0/dmillr(tl)*(1.0d0 + tl*tl) pvd = pvd + 1 ,OdO/dmillr(-l OdO*t2)*(l OdO + t2*t2) pvd = (pvd - 2.0d0/dscal)/md - t3*t3 md = md*dexp(-0.50d0*obs*obs)/(2.0d0*dscal*dsqrt(2.0d0*pi)) else c we will use asymptotic approximations t3 = obs - dabs(obs)/(obs*dscal) md ~ dexp(-1 OdO*dabs(obs)/dscal + 0.5d0/(dscal*dscal)) md = md/(2.0d0*dscal) pvd = l.OdO endif estd = t3 return end ccccccccccccccccccccccccccccccccccccccccccccccccc c this function returns the bayes rule for a normal prior c and as a " sideshow” computes the marginal evaluated at obs c and the posterior variance double precision function estn()

double precision dexp,dsqrt,pvc,pvd, pvn double precision obs,mn,md,mc,nscal,cscal,dscal,pi,t

common obs,mn,md,mc,pvc,pvd,pvn,nscal,cscal,dscal intrinsic dexp,dsqrt

pi = 3.1415926535897930d0 mn = dexp(-1.0d0*obs*obs/(2.0d0+ 2.0d0*nscal*nscal)) t = 2.0d0*pi*(1.0d0 + nscal*nscal) mn = mn/dsqrt(t) pvn = nscal*nscal/(1.0d0 + nscal*nscal) estn = obs*nscal*nscal/(1.0d0 + nscal*nscal) return end ccccccccccccccccccccccccccccccccccccccccccccccccc c this program will explore hierarchical models (mixtures) for c cauchy priors double precision obs,mn,md,mc,pvc,pvd,pvn,pvbayes double precision nscal,csca!,dscal,estc,estn,estd double precision ln,ld,lc,lh,m,rd,rc,rh,th,x,en double precision ze,ed,ec,eh,ebayes,mbayes double precision ehn,ehc,ehd,rhn,rhc,rhd,lhn,lhc,lhd,lbayes integer k,n,coj,l,s dimension x(25000),th(25000)

common obs,mn,md,me,pvc,pvd,pvn,nscal,cscal,dscal external estc,estn,estd

equivalence (ebayes,ec),(mbayes,mc),(lbayes,lc),(pvbayes,pvc)

parameter(ze = O.OdO) n = 25000 s= 10 do 2, j = 1,8 nscal =j/4.0d0 dscal = nscal*0.87117540373 ldO cscal = nscal*2.8149142149d0

do 555,co = l,s

call dmchy(25000,th) call dmnoa(25000,x) dataln,id,lc,lh,lhn,lhc,lhd/ze,ze,ze,ze,ze,ze,ze/ data m,rd,rc,rh,rhn,rhc,rhd/ze,ze,ze,ze,ze,ze,ze/ do 1, k = l,n obs = x(k) + cscal*th(k) en = estn() ed = estd() ec = estc() ehn = (mbayes*ebayes + mn*en)/(mbayes + mn) ehd = (mbayes*ebayes + md*ed)/(mbayes + md) ehc = (mbayes*ebayes + mc*ec)/(mbayes + me) eh = (mc*ec + md*ed + mn*en)/(mc + md + mn) Ibayes = Ibayes + pvbayes m = m + (en - ebayes)*(en - ebayes) rd = rd + (ed - ebayes)*(ed - ebayes) rc = rc + (ec - ebayes)*(ec - ebayes) rh = rh + (eh - ebayes)*(eh - ebayes) rhn = rhn + (ehn - ebayes)*(ehn - ebayes) rhd = rhd + (ehd - ebayes)*(ehd - ebayes) rhc = rhc + (ehc - ebayes)*(ehc - ebayes) continue

Ibayes = Ibayes/n m - m/n rd = rd/n rc = rc/n rh = rh/n rhn = rhn/n rhd = rhd/n rhc = rhc/n In = m + Ibayes Id = rd + Ibayes lc = rc + Ibayes lh = rh + Ibayes Ihn = rhn + Ibayes Ihd = rhd + Ibayes Ihc = rhc + Ibayes write(6,4) rh/rd write(6,4) rh/m write(6,4) lh/Ibayes write(6,4) rhd/rd write(6,4) rhn/m 138 write(6,4) lhd/Ibayes write(6,4) lhn/lbayes SSS continue 4 format(d30.15) 2 continue end

(A version of the above program, with n = 2500, used 1.6 seconds of system time, and 240 seconds of user time, on a DECsystem 5500. ) c this program will explore hierarchical models (mixtures) for c double exp. priors double precision obs,mn,md,mc,pvc,pvd,pvn,pvbayes double precision nscal,cscal,dscal,estc,estn,estd double precision ln,ld,lc,lh,m,rd,rc,rh,th,x,en double precision ze,ed,ec,eh,ebayes,mbayes double precision ehn,ehc,ehd,rhn,rhc,rhd,lhn,thc,lhd,Ibayes integer k,n,co j.l.s dimension x(25000),th(25000),1(25000)

common obs,mn,md,mc,pvc,pvd,pvn,nscal,cscal,dscal external estc,estn,estd

equivalence (ebayes,ed),(mbayes,md),(Ibayes,Id),(pvbayes,pvd)

parameter(ze = O.OdO) n = 25000 s — 10 do 2, j = 1,8 nscal =j/4.0d0 dscal = nscal*0.87117540373 ldO cscal = nsca!*2.8149142149d0

do 555,co = l,s

call mund(25000,2,l) call dmexp(25000,th) call dmnoa(25000,x) dataln,ld,lc,lh,lhn,lhc,lhd/ze,ze,ze,ze,ze,ze,ze/ data m,rd,rc,rh,rhn, rhc, rhd/ze,ze,ze,ze,ze,ze,ze/ do 1, k = l,n if (l(k).eq.l) then obs = x(k) + dscal*th(k) else obs = x(k) - dscal*th(k) endif en = estn() ed = estd() ec = estc() ehn = (mbayes*ebayes + mn*en)/(mbayes + mn) ehd = (mbayes*ebayes + md*ed)/(mbayes + md) ehc = (mbayes*ebayes + mc*ec)/(mbayes + me) eh = (mc*ec + md*ed + mn*en)/(mc + md + mn) Ibayes = Ibayes + pvbayes m =m + (en - ebayes)*(en - ebayes) rd = rd + (ed - ebayes)*(ed - ebayes) rc = rc + (ec - ebayes)*(ec - ebayes) rh = rh + (eh - ebayes)*(eh - ebayes) rhn = rhn + (ehn - ebayes)*(ehn - ebayes) rhd = rhd + (ehd - ebayes)*(ehd - ebayes) rhc = rhc + (ehc - ebayes)*(ehc - ebayes) continue

Ibayes = Ibayes/n m = m/n rd = rd/n rc = rc/n rh = rh/n rhn = rhn/n rhd = rhd/n rhc = rhc/n In = m + Ibayes Id = rd + Ibayes lc = rc + Ibayes lh = rh + Ibayes lhn = rhn + Ibayes lhd = rhd + Ibayes lhc = rhc + Ibayes write(6,4) rh/rc write(6,4) rh/m write(6,4) lh/lbayes write(6,4) rhc/rc write(6,4) rhn/m write(6,4) Ihc/lbayes write(6,4) lhn/lbayes 555 continue

4 format(d30.15) 2 continue end c this program will explore hierarchical models (mixtures) for c normal priors double precision obs,mn,md,me,pvc,pvd,pvn,pvbayes double precision nscal,cscal,dscal,estc,estn,estd double precision ln,ld,lc,lh,m,rd,rc,ih,th,x,en double precision ze,ed,ec,eh,ebayes,mbayes double precision ehn,ehc,ehd,rhn,rhc,rhd,lhn,lhc,lhd,Ibayes integer k,n,coj,l,s dimension x(25000),th(25000)

common obs,mn,md, me,pvc,pvd,pvn,nscal,cscal,dscal external estc,estn,estd

equivalence (ebayes,en),(mbayes,mn),(Ibayes,In),(pvbayes,pvn)

parameter(ze = O.OdO) n « 25000 s = 10 do 2,j= 1,8 nscal = j/4.0d0 dscal = nscal*0.87117540373 ldO cscal = nscal*2 8149142149d0

do 555,co ~ l,s

call dmnoa(25000,th) call dmnoa(25000,x) dataln,ld,lc,Ih,lhn,lhc,lhd/ze,ze,ze,ze,ze,ze,ze/ datam,rd,rc,rh,rhn,rhc,rhd/ze,ze,ze,ze,ze,ze,ze/ do 1, k= l,n

obs = x(k) + nscal*th(k) en = estn() ed = estd() ec = estc() ehn = (mbayes*ebayes + mn*en)/(mbayes + mn) ehd = (mbayes*ebayes + md*ed)/(mbayes + md) ehc = (mbayes *ebayes + mc*ec)/(mbayes + me) eh = (mc*ec + md*ed + mn*en)/(mc + md + mn) Ibayes = Ibayes + pvbayes m =m + (en - ebayes)*(en - ebayes) rd « rd + (ed - ebayes)*(ed - ebayes) rc = rc + (ec - ebayes)'"(ec - ebayes) rh = rh + (eh - ebayes)*(eh - ebayes) rhn = rhn + (ehn - ebayes)*(ehn - ebayes) rhd = rhd + (ehd - ebayes)*(ehd - ebayes) rhc = rhc + (ehc - ebayes)*(ehc - ebayes) 1 continue

Ibayes = Ibayes/n m = m/n rd = rd/n rc = rc/n rh = rh/n rhn = rhn/n rhd = rhd/n rhc = rhc/n In = m + Ibayes Id = rd + Ibayes lc = rc + Ibayes lh = rh + Ibayes lhn = rhn + Ibayes Ihd = rhd + Ibayes Ihc = rhc + Ibayes write(6,4) rh/rc write(6,4) rh/rd write(6,4) lh/lbayes write(6,4) rhc/rc write(6,4) rhd/rd write(6,4) Ihc/lbayes write(6,4) Ihd/lbayes 555 continue

4 format(d30.15) 2 continue end program inout; {$n+} {this program inputs the above output and summarizes/reports} {the results in tabular form) type garray = array[ 1.. 8] of extended; dataset = array[1.50] of extended; var ic,id,inor;text; tregc3d,tregc3n,tregd3c,tregd3n,tregn3c,tregn3d:dataset; trisc2d,trisc2n,trisd2c,trisd2n,trisn2c,trisn2d:dataset; tregc2d,tregc2n,tregd2c,tregd2n,tregn2c,tregn2d;dataset; trisc3,trisd3,trisn3: dataset; k j, nu msims: integer; o3 tenri,o31 enre, o2tenri, o2tenre: text; risc3,risd3,risn3 :garray; regc3d,regc3n,regd3c,regd3n,regn3c,regn3d:garTay; risc2d,risc2n,risd2c,risd2n,risn2c,risn2d:garray; regc2d,regc2n,regd2c,regd2n,regn2c,regn2d:garray; sdrisc3,sdrisd3,sdrisn3 :garray; sdregc3d,sdregc3n,sdregd3c,sdregd3n,sdregn3c,sdregn3d:garray; sdrisc2d,sdrisc2n,sdrisd2c,sdrisd2n,sdrisn2c,sdrisn2d:garray; sdregc2d,sdregc2n,sdregd2c,sdregd2n,sdregn2c,sdregn2d:garray; {------} function mean(x:dataset;n:integer);extended; var k:integer; t: extended; begin t:= 0.0; for k:= 1 to n do t:= t + x[k]; mean - t/n; end; { } function sdev(x:dataset;n:integer):extended; var k: integer; t,mu. extended; begin mu~ mean(x,n); t:= 0.0; for k:= 1 to n do t:= t + (x[k] - mu)*(x[k] - mu); sdev:= sqrt(t/(n-l»; 143 end; { } begin numsims:= 10; assign(ic,'c:\thesis\mixture\op 10c'); reset(ic); for k:= 1 to 8 do begin for j:= 1 to numsims do begin readln(ic,tregc3d[j]); readln(ic,tregc3 n[j]); readln(ic,trisc3 [j]); readln(ic,tregc2d[j]); readln(ic,tregc2n[j]); readln(ic,trisc2d[j]); readln(ic,trisc2n[j]); end; regc3d[k]:= mean(tregc3d,numsims); regc3n[k]:= mean(tregc3n, numsims); risc3[k]:= mean(trisc3, numsims); regc2d[k]:= mean(tregc2d,numsims); regc2n[k]:= mean(tregc2n, numsims); risc2d[k]:= mean(trisc2d,numsims); risc2n[k]:= mean(trisc2n,numsims); sdregc3d[k]:= sdev(tregc3d, numsims); sdregc3n[k]:= sdev(tregc3n,numsims); sdrisc3[k]:= sdev(trisc3,numsims); sdregc2d[k] := sdev(tregc2d,numsims); sdregc2n[k]:= sdev(tregc2n,numsims); sdrisc2d[k]:= sdev(trisc2d,numsims); sdrisc2n[k]:= sdev(trisc2n,numsims); end; close(ic); assign(id,'c:\thesis\mixture\op 10d'); reset(id); for k;= 1 to 8 do begin for j:= 1 to numsims do begin read!n(id,tregd3c[j]); readln(id,tregd3n[j]); readln(id,trisd3|j]); readln(id,tregd2c[j]), readln(id,tregd2n[j]); readln(id,trisd2c[j]); readln(id,trisd2n[j]); end; regd3c[k]:= mean(tregd3c, numsims); regd3n[k]:= mean(tregd3n,numsims); risd3[k]:= mean(trisd3, numsims); regd2c[k]:= mean(tregd2c, numsims); regd2n[k]:= mean(tregd2n,numsims); risd2c[k]:= mean(trisd2c,numsims); risd2n[k]:= mean(trisd2n,numsims); sdregd3c[k]:= sdev(tregd3c,numsims); sdregd3n[k]:= sdev(tregd3n,numsims); sdrisd3[k] = sdev(trisd3,numsims); sdregd2c£k]~ sdev(tregd2c,numsims); sdregd2n[k]:= sdev(tregd2n,numsims); sdrisd2c[k]:= sdev(trisd2c, numsims); sdrisd2n[k]:= sdev(trisd2n,numsims); end; close(id); assign(inor,'c :\thesis\mixture\op 1 On'); reset(inor); for k:= 1 to 8 do begin for j:= 1 to numsims do begin readln(inor,tregn3c[j]); readln(inor,tregn3d[j]); readln(inor,trisn3 [j]); readln(inor,tregn2c[j]); readln(inor,tregn2d[j]); readln(inor,trisn2c[j]); readln(inor,trisn2d|J]); end; regn3c[k]:= mean(tregn3c,numsims); regn3d[k]:= mean(tregn3d, numsims); risn3[k]:= mean(trisn3, numsims); regn2c[k]:= mean(tregn2c,numsims); regn2d[k]:= mean(tregn2d,numsims); risn2c[k]:= mean(trisn2c,numsims); 145 risn2d[k]:= mean(trisn2d,numsims); sdregn3c[k]:= sdev(tregn3c,numsims); sdregn3d[k]:= sdev(tregn3d, numsims); sdrisn3[k]:= sdev(trisn3,numsims); sdregn2c[k]:= sdev(tregn2c, numsims); sdregn2d[k] = sdev(tregn2d, numsims); sdrisn2c[k]:= sdev(trisn2c,numsims); sdrisn2d[k]:= sdev(trisn2d, numsims); end; close(inor); assign(o3tenri,'c:\thesis\mixture\o3tenri'); rewrite(o3tenri); for k;= 1 to 8 do begin write(o3tenri,(k/4):3:2,7,risc3[k]:6:4,7,sdrisc3[k]:6:4,7); v«ite(o3tenri,risd3[k]:6:4,7,sdrisd3[k]:6:4,7); writeln(o3tenri,risn3[k]:6:4,7,sdrisn3[k]:6:4); end; close(o3tenri); assign(o3tenre,c:\thesis\mixture\o3tenre'); rewrite(o3tenre); for k:= 1 to 8 do begin write(o3tenre,(k/4):3:2,7,regc3d[k]:6:4,7,sdregc3d[k]:6:4,7); write(o3tenre)regc3n[k]:6:4,'/,,sdregc3n[k]:6:4,7); write(o3tenre,regd3c[k]:6:4,7,sdregd3c[k]:6:4,7); write(o3tenre,regd3n[k]:6:4,7,sdregd3n[k]:6:4,7); write(o3tenre)regn3c[k]:6:4,7,sdregn3c[k]:6:4,7); writeln(o3tenre,regn3d[k]:6:4,7,sdregn3d[k]:6:4); end; close(o3tenre); assign(o2tenre,'c:\thesis\mixture\o2tenre'); rewrite(o2tenre); for k:= 1 to 8 do begin write(o2tenre,(k/4):3:2,7,regc2d[k]:6:4,7,sdregc2d[k]:6:4,7); write(o2tenre,regc2n[k]: 6:4,7,sdregc2n[k] :6:4,7); write(o2tenre,regd2c[k] :6:4,7,sdregd2c[k] :6:4,7); write(o2tenre,regd2n[k]:6:4,7,sdregd2n[k] :6:4,7); write(o2tenre,regn2c[k]: 6:4,7,sdregn2c[k] :6:4,7); writeln(o2tenre,regn2d[k] :6:4,7,sdregn2d[k] :6:4); end; close(o2tenre); assign(o2tenri,'c:\thesis\mixture\o2tenri'); rewrite(o2tenri); for k:= 1 to 8 do begin write(o2tenri,(k/4):3:2,',,1risc2d[k]:6:4,'/',sdrisc2d[k]:6:4,V); write(o2tenri,risc2n[k]:6:4,'/',sdrisc2n[k]:6:4t','); write(o2tenri)risd2c[k]:6:4t7',sdrisd2c[k]:6:4,7); write(o2tenri,risd2n[k] :6:4,'/\sdrisd2n[k] :6:4, write(o2tenri,risn2c[k]:6:4,'Asdrisn2c[k]:6:4,V); writeln(o2tenri, risn2d [k] :6:4,7',sdrisn2d[k]:6:4); end; close(o2tenii); end. APPENDIX E

SOME FORTRAN PROGRAMS USED IN CHAPTER V

c this program is used in section 5.4 to find gamma-minimax regret rules c for the problem of testing a simple vs. simple hypothesis in the c normal context

integer m,nj double precision aO,al,aeps,eps,rl,rO,regret,left,right

external regret

do 300, m = 1,9 a0 = m/10.0d0 do 200, n = m+1,9 al = n/lO.OdO left = O.OdO right = 1 .OdO do 100,j = 1,34 eps = 0.5d0*(left + right) aeps = eps*al + (1 .OdO - eps)*aO if ((regret(al,aeps)). It.(regret(aO,aeps))) right = eps if((regret(al,aeps)).gt.(regret(aO,aeps))) left = eps 100 continue rO = regret(aO,aeps)/regret(aO,al) rl = regret(al,aeps)/regret(al,aO) write(6,1) aO,a 1 ,eps,rO,r 1 1 format(d 10.5,',',dl0.5,7,d 10.5,7,d 10.5,*,',d 10.5); 200 continue 300 continue end ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc c this function finds the regret of using the rule based on prior 147 c probability au if the true prior probability is at double precision function regret(at,au)

double precision at,au, risk

external risk

regret = risk(at,au) - risk(at,at) return end ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc c this function finds the risk ofusing the rule based on prior c probability au if the true prior probability is at double precision function risk(at,au)

double precision at,au,dlog,dnord£,t

external dnordf,dlog

t = dlog((1.0d0 - au)/au) + 0,5d0

risk = at*dnordf(t - l.OdO) + (l.OdO - at)*dnordfi-l,0d0*t) return end

c this program compares the use of the gamma-minimax regret rule c with the rule based on hyperprior probability = 1/2 integer m,n,j double precision aO,al,aeps,eps,rl,rO,regret,left,right double precision appO,appl

external regret

do 300, m = 1,4 aO = m/10. OdO do 200, n = m+1,9 if ((m+n).eq. 10) goto 200 al = n/10.0d0 left = O.OdO right = l.OdO do 100, j = 1,34 eps = 0.5d0*(left + right) aeps = eps*al + (1 .OdO - eps)*aO if((regret(aI,aeps)).lt.(regret(aO,aeps))) right = if((regret(al,aeps)).gt.(regret(aO,aeps))) left = 100 continue rO = regret(aO,aeps)/regret(aO,al) rl =regret(al,aeps)/regret(al,aO) appO = regret(a0,0.5dO*(a(H-al ))/regret(aO,a 1) appl = regret(al ,0.5dO*(aO+al ))/regret(al ,aO) write(6,l) aO.al, (appO/rO), (appl/rl) 1 format(dl0.5,7,dl0.5,7,dl0.5,7,dl0.5); 200 continue 300 continue end LIST OF REFERENCES

Aitchison, J. and Dunsmore, I. Statistical Prediction Analysis, Cambridge University Press: London.

Anton, H. 1984. Calculus with Analytic Geometry. Wiley: New York.

Berger, J. O. 1984. "The Robust Bayesian Viewpoint." In Robustness o f Bayesian Analyses, J. Kadane (Ed ). North Holland: Amsterdam.

Berger, J. O. 1985. Statistical Decision Theory and Bayesian Analysis. Springer-Verlag: New York.

Berger, J. O. and Berliner, L. M. 1986. "Robust Bayes and Empirical Bayes Analysis with 8-contaminated priors. " Ann. Statist. (14), 461-486.

Berger, J. O. and O' Hagan, A. 1988a. "Ranges of Posterior Probabilities for Unimodal Priors with Specified Quantiles. " In 3, J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith (Eds.). Oxford University Press: Oxford.

Berger, J. O. and O'Hagan, A. 1988b. "Ranges of Posterior Probabilities for Quasiunimodat Priors with Specified Quantiles. " J. Amer. Stat. Assoc. (83), 503- 508.

Berger, J. O. and Sivaganesan, S. 1989. "Ranges of Posterior Measures for Priors with Unimodal Contaminations." Ann. Statist. (17), 868-889.

Berger, R. 1979. "Gamma Minimax Robustness of Bayes Rules. " Comm. Stat. (8), 543- 560.

Berliner, L. M. and Goel, P. K. 1990. "Incorporating Partial Prior Information: Ranges of Posterior Probabilities." In Bayesian and Likelihood Methods in Statistics and Econometrics , S. Geisser, J. S. Hodges, S. J. Press, and A. Zellner (Eds ). North Holland: Amsterdam.

Bernardo, J. M. 1979. "Reference Posterior Distributions for Bayesian Inference." J. Roy. Statist. Soc. (41), 113-147. 150 151

Billingsley, P. 1986. Probability and Measure. Wiley: New York.

Blum, J. R. and Rosenblatt, J. 1967. "On Partial a priori Information in Statistical Inference. " Ann. Math. Statist. (38), 1671-1678.

Chen, L. and Eichenauer, J. 1988. "Two Point Priors and r-Minimax Estimating in Families of Uniform Distributions." Statistiche Hefte (29), 45-57.

Csiszar, I. 1975. "/-Divergence Geometry of Probability Distributions and Minimization Problems." Ann. Prob. (3), 146-158.

Datta, G. and Ghosh, M. 1991. "Bayesian Prediction in Linear Models: Applications to Small Area Estimation." Ann. Statist. (19), 1748-1770.

Davis, P. J., and Rabinowitz, P. 1984. Methods o f Numerical Integration. Academic Press: New York.

DeGroot, M. H. 1970. Optimal Statistical Decisions. McGraw-Hill: New York.

Eichenauer, J., Lehn, J., and Rettig, S. 1988. "A Gamma-Minimax Result in Credibility Theory." Insurance: Mathematics and Economics (7), 49-57.

Eichenauer-Herrmann, J. 1990. "A Gamma-Minimax Result for the Classs of Symmetric and Unimodal Priors." Statistiche Hefte (31), 301-304.

Everitt, B., and Hand, D. 1981. Finite Mixture Distributions. Chapman and Hall: London.

Ferguson, T. 1967. Mathematical Statistics: A Decision-Theoretic Approach. Academic Press: New York.

Geisser, S. 1990. "On Hierarchical Bayes Procedures for Predicting Simple Exponential Survival." Biometrics (46), 225-230

Goel, P. K. 1983. "Information Measures and Bayesian Hierarchical Models." J. Amer. Stat. Assoc. (78), 408-410.

Goel, P. K and DeGroot, M. H. 1981. "Information about Hyperparameters in Hierarchical Models. nJ. Amer. Stat. Assoc. (76), 140-147.

Good, I. J. 1965. The Estimation o f Probabilities: An Essay on Modem Bayesian Methods. MIT Press: Cambridge. 152 Good, I. J. 1976. "The Bayesian Influence, or How To Sweep Subjectivism Under the Carpet. " In Foundations o f Statistical Inference , V. P. Godambe and D. A. Sprott (Eds ). Holt, Rinehart, and Winston: Toronto.

Good, I. J. 1980. "Some History of the Hierarchical Bayesian Methodology." In Bayesian Statistics I, J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith (Eds.). University Press. Valencia.

Good, I. J. 1983. Good Thinking: The Foundations o f Probability and Its Applications. University of Minnesota Press: Minneapolis.

Gordon, R. D. 1941. "Values of Mills' Ratio of Area to Bounding Ordinate and of the Normal Probability Integral for Large Values of the Argument. " Ann. Math. Statist. (12), 364-366.

IMSL. 1987. User's Manual: STAT/Library, IMSL: Houston.

IMSL. 1989a. User's Manual: MATH/Library. IMSL: Houston.

IMSL. 1989b. User's Manual.SFUN/Libraty. IMSL: Houston.

Johnson, N., and Kotz, S. 1970. Continuous Univariate Distributions, Vol. I. Wiley: New York.

Karlin, S. 1968. TotalPositivity, Vol. I. Stanford University Press: Stanford.

Kendall, M. G., and Stuart, A. 1969. The Advanced Theory o f Statistics, Vol. I: Distribution Theory. Hafner: New York.

Kullback, S. 1959. Information Theory and Statistics. Dover: New York.

Kullback, S. 1952. "An Application of Information Theory to Multivariate Analysis." Ann. Math. Statist. (23), 88-102.

Kullback, S. and Leibler, R. A. 1951. "On Information and Sufficiency. " Ann. Math. Statist. (22), 79-86.

Lehmann, E. 1983. Theory o f Point Estimation. Wiley: New York.

Lenk, P. 1991. "Towards a Practicable Bayesian Nonparametric Density Estimator." Biometrika (78), 531-543.

Levi, I. 1973. Inductive Logic and the Improvement o f Knowledge. Technical report, Columbia University. 153

Lindley, D. V. and Smith, A. F. M. 1972. "Bayes Estimates for the Linear Model. H J. Roy. Statist. Soc. Series B (34), 1-41.

Lindley 1984. Contained in Berger 1984.

Mardia, K., Kent, J. and Bibby, J. 1979. Multivariate Analysis. Academic Press: New York.

Menges, G. 1966. "On the Bayesification o f the Minimax Principle. IJnternehmensforschung (10), 81-91.

Mish, F. C. (Ed.-in-Chief). 1983. Webster's Ninth New Collegiate Dictionary. Merriam- Webster: Springfield, Massachusetts.

Mitrinovic, D. S. 1970. Analytic Inequalities. Springer-Verlag: New York.

Morris, C.N., and Normand, S. L. 1992. "Hierarchical Models for Combining Information and for Meta-Analyses.11 In Bayesian Statistics 4, J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith (Eds.). Clarendon Press: Oxford.

Norberg, R. 1989. "A Class of Conjugate Hierarchical Priors for Gammoid Likelihoods. " Scand. Actuarial J.( no volume), 177-193.

Robbins, H. 1964. "The Empirical Bayes Approach to Statistical Decision Problems." Ann. Math. Statist. (35), 1-20.

Rogers, J. M. 1974 .Some Measures o f Compromises Between Bayesian and non- Bayesian Statistical Methods. Doctoral dissertation, VPISU.

Rubin, H. 1968. Lecture notes on measure-theoretic decision theory (unpublished).

Rubin, H. 1977. "Robust Bayesian Estimation. " In Statistical Decision Theory and Related Topics II, S. S. Gupta and D. S. Moore (Eds.). Academic Press: New York.

SafF, E. B. and Snider, A. D. 1976. Fundamentals o f Complex Analysis for Mathematics, Science, and Engineering. Prentice-Hall: Englewood Cliffs.

Savage, L. J. 1972. The Foundations o f Statistics. Dover: New York.

Sivaganesan, S. 1988. "Range o f Posterior Measures for Priors with Arbitrary Contaminations. " Comm. Statist. - Theory and Methods (17), 1591-1612. 154 Smith, A. F. M. 1973a. "A General Bayesian Linear Model. " J. Roy. Statist. Soc. Series B (35), 67-76.

Smith, A. F. M. 1973b. "Bayes Estimates in the One-Way and Two-Way Models." Biometrika (60), 319-330.

Smith, A. F. M. and Verdinelli, I. 1980. "A Note on Bayes Designs for Inference Using a Hierarchical Linear Model. " Biometrika (67), 613-619.

Strasser, H. 1985. Mathematical Theory o f Statistics: Statistical Experiments and Asymptotic Decision Theory. De Gruyter: Berlin.

Stroud, A. H ., and Secrest, D. 1966. Gaussian Quadrature Formulas. Prentice-Hall: Englewood Cliffs.

Tierney, L. and Kadane, J. B. 1986. "Accurate Approximations for Posterior Moments and Marginal Densities." J. Amer. Stat. Assoc. (81), 82-86.

Titterington, D., Smith, A. F. M., and Makov, U. 1985. Statistical Analysis o fFinite Mixture Distributions. Wiley: New York.

Toman, B., and Notz, W. I. 1991. "Bayesian optimal experimental design for treatment- control comparisons in the presence of two-way heterogeneity." J. Stat. Plan. Inf. (27), 51-63.

Verdinelli, I. and Giovagnoli, A. 1985. "Optimal Block Designs Under a Hierarchical Model." In Bayesian Statistics 2, J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith (Eds ). North-HoIIand: Amsterdam.

Wald, A. 1950. Statistical Decision Functions. Wiley: New York.

Wolpert, R. L., and Warren-Hick, W. J. 1992. "Bayesian Hierarchical Logistic Models for Combining Field and Laboratory Survival Data. "In Bayesian Statistics 4 , J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith (Eds.). Clarendon Press: Oxford.

Yakowitz, S. and Spragins, J. 1968. "On the Identifiability of Finite Mixtures." Ann. Math. Statist. (39), 209-214.

Zacks, S. 1971. The Theory o f Statistical Inference. Wiley: New York.