76

BAYESIAN INFERENCE

Harry V. Roberts, University of Chicago

Bayesian inference or Bayesian terms of repeated independent tosses of a not - is an approach to based necessarily "fair" coin. It generates "heads" on the theory of subjective probability. A and "tails" in such a way that the conditional formal Bayesian analysis leads to probabilistic probability of heads on a single trial is always assessments of the object of uncertainty. For equal to a parameter p regardless of the pre- example, a Bayesian inference might be: "The vious history of heads and tails. probability is .95 that the of a lies between 12.1 and 23.7." The Suppose first that we have no direct sample number .95 represents a degree of belief, either evidence from the process. Based on experience in the sense of "subjective probability con- with similar processes, introspection, general sistent" or "subjective probability rational" knowledge, etc., we may be willing to translate (see PROBABILITY: PHILOSOPHY AND INTERPRETA- our judgments about the process into probabilist- TION, which should be read in conjunction with ic terms. For example, we might assess a the present article); .95 need not and typically (subjective) for p does not correspond to any "objective" long (the tilde "N " indicates that we are now think- relative . Very roughly, a degree of ing of the parameter p as a ). belief of .95 can be interpreted as betting odds Such a distribution is called a prior distribu- of 95 to 5 or 19 to 1. A degree of belief is tion because it is usually assessed prior to always potentially a basis for action; for sample evidence. Purely for illustration) example, it may be combined with utilities by suppose that the prior distribution of is the principle of maximization of expected util- uniform on the interval from 0 to 1: the ity (see STATISTICAL DECISION THEORY). probability that p lies in any subinterval is that subinterval's length, no matter where the By cóntrast, the traditional or "classical" subinterval is located between 0 and 1. Now approach to inference leads to probabilistic suppose that we observe heads, heads, and tails statements about the method by which a particu- on three tosses of a coin. The probability of lar inference is obtained. Thus a classical observing this sample, conditional on p, is inference might be: "A .95 for the mean of a normal distribution extends P2(1-P) from 12.1 to 23.7." The number .95 here represents a long -run relative frequency, namely the frequency with which intervals obtained by If we regard this expression as a function of the method that resulted in the present interval p, it is called the of the would in fact include the unknown mean. (It is sample. Bayes' theorem shows how to use the not to be inferred from the fact that we used the likelihood function in conjunction with the same numbers, .95, 12.1, and 23.7, in both illus- prior distribution to obtain a revised or trations that there will necessarily be a nu- posterior distribution of p. "Posterior" merical coincidence between the two approaches.) after the sample evidence, and the posterior distribution represents a reconciliation of The term "Bayesian" arises from an elemen- sample evidence and prior judgment. In terms tary theorem of probability theory, named after of inferences about p, we may write Bayes' the Rev. Thomas Bayes, an English clergyman of theorem in words as the 18th century, who first enunciated it and proposed its use in inference. Bayes' theorem (density) at p, is typically used in the process of making given the observed sample = Bayesian inferences, as will be explained below. For a number of historical reasons, however, current interest in Bayesian inference is quite (density) at p x likelihood recent -- dating, say, from the 1950's. Hence Prior probability of the observed sample the term "neo-Bayesian" is sometimes used in- stead of "Bayesian." Expressed mathematically,

r An Illustration of Bayesian Inference fl(P) Pr(l-P)n f"(Pir,n) Pr(1-1))n-rdp For a simple illustration of the Bayesian ft(P) approach, consider the problem of making infer- ences about a Bernoulli process with parameter p. A Bernoulli process can be visualized in where f'p) denotes the prior density of p, )n- (1-P denotes the likelihood if r heads are observed in n trials, and f "(plr,n) A revised version of this paper will ap- denotes the posterior density of p given the pear in the (forthcoming) International sample evidence. Encyclopedia of the Social Sciences. 77

In our example, f'(p) = 1, (0 < p < 1), r = 2, the data are in, there is no distinction be- n = 3, and tween sequential anAiysis and analysis for fixed sample size. In the Bernoulli example, successive samples of n1 and n2 with Pr(1-P)n = V(P) r and r successes could be analyzed as one pooled sample of + n2 trials with 1/12 = r1 + successes. Alternatively, a posterior so distribution could be computed after the first sample of nl; this distribution could then f"(PIr = 2, 3) = 12 p2(1-p), serve as a prior distribution for the second sample; finally, a second posterior distribution = 0 otherwise. could be computed after the second sample of

. By either route the posterior distribution Thus we emerge from the analysis with an after + n2 observations would be the same. explicit probability distribution for p. This Under almost any situation that is likely to distribution c.Aracterizes fully our judgments arise in practice, the "stopping rule" by which about It could be applied in a formal is terminated is irrelevant to the decision -theoretic analysis in which utilities analysis of the sample. For example, it would of alternative acts are functions of p. For not matter whether r successes in n trials example, we might make a Bayesian point estimate were obtained by fixing r in advance and of p (each possible point estimate is regarded observing the rth success on the nth trial, as an act), and the seriousness of an estimation or by fixing n in advance and counting r error ( "loss ") might be proportional to the successes in the n trials. (2) For the pur- square of the error. The best point estimate pose of statistical reporting, the likelihood can then be shown to be the mean of the pos- function is the important information to be terior distribution; in our example, this would conveyed. If a reader wants to perform his own be .6. Or, we might wish to describe certain Bayesian analysis, he needs the likelihood aspects of the posterior distribution for function, not a posterior distribution based on summary purposes; it can be shown, for example, someone else's prior, nor traditional analyses that such as significance tests, from which it may be difficult or impossible to recover the likeli-

P(5<.194) = and P(p>.932) = .025 , hood function. so a .95 "" for p extends Vagueness about Prior Probabilities from .194 to .932. Again, it can be shown that = .688: the posterior probability that In our example we assessed the prior dis- the coin is "biased" in favor of heads is a tribution of as a uniform distribution from little over 2/3. 0 to 1. It is sometimes thought that such an assessment means that we "know" p is so distributed, and that our claim to knowledge The Likelihood Principle might be verified or refuted in some way. It is indeed possible to imagine situations in which In our example, the effect of the sample the distribution of might be known, as when evidence was wholly transmitted by the likeli- one coin is to be drawn at random from a number hood function. All we needed to know from the of coins, each of which has a "known" p deter- sample was p)n -r; the actual sequence of mined by a very large number of tosses. The individual observations was irrelevant so long frequency distribution of these p's would then as we believed the assumption of a Bernoulli serve as a prior distribution, and all statisti- process. In general, a full Bayesian analysis cians would apply Bayes' theorem in analyzing requires as inputs for Bayes' theorem only the sample evidence. But such an example would be likelihood function and the prior distribution. unusual. Typically, in making an inference Thus the import of the sample evidence is fully about for a particular coin, the prior dis- reflected in the likelihood function, a princi- tribution of is not a description of some ple known as the likelihood principle (see also distribution of p's but rather a tool for LIKELIHOOD). Alternatively, given that the expressing judgments about based on evidence sample is drawn from a Bernoulli process, the other than the evidence of the particular sample import of the sample is fully reflected in the to be analyzed. numbers r and n, which are called sufficient statistics (see SUFFICIENCY). Not only do we rarely "know" the prior distribution of p, but we are typically more The likelihood principle implies certain or less vague when we try to assess it. This consequences that do not accord with tradi- vagueness is comparable to the vagueness that tional ideas. Here are examples: (1) Once 78

surrounds many decisions in everyday life. For hypothetical future outcome) how would we example, a person may decide to offer 421,250 assess the probability of heads on a second for a house he wishes to buy, even though he may trial? be quite vague about what amount he "should" of- fer. Similarly, in statistical inference we These approaches are intended only to be may assess a prior distribution in the face of suggestive. If several approaches to self- a certain amount of vagueness. If we are not interrogation lead to substantially different willing to do so, we cannot pursue a formal prior distributions, we must either try to Bayesian analysis and must evaluate sample evi- remove the internal inconsistency or be content dence intuitively, perhaps aided by the tools of with an intuitive analysis. Actinfly, from the and classical inference. point of view of "subjective probability con- sistent," the discovery of internal inconsist- Vagueness about prior probabilities is not ency in one's judgments is the only route toward the only kind of vagueness to be faced in sta- more "rational" decisions. The danger is not tistical analysis, and the other kinds of vague- that internal inconsistencies will be revealed ness are equally troublesome for approaches to but that they will be suppressed by self - statistics that do not use prior probabilities. deception or glossed over by lethargy. Vagueness about the likelihood function, that is, the process generating the , is typically It may happen that vagueness affects only substantial and hard to deal with. Moreover, unimportant aspects of the prior distribution: both classical and Bayesian decision theory theoretical or empirical analysis may show that bring in the idea of utility, and utilities the posterior distribution is insensitive to often are vague. these aspects of the distribution. For example, we may be vague about many aspects of the prior In assessing prior probabilities, skillful distribution, yet feel that it is nearly self -interrogation is needed in order to uniform over all values of the parameter for mitigate vagueness. Self- interrogation may be which the likelihood function is not essentially made more systematic and illuminating in several zero. This has been called a diffuse, informa- ways. (1) Direct judgmental assessment. In tionless, or locally - uniform prior distribution. assessing the prior distribution of for These terms are to be interpreted relative to example, we might ask, "For what p would we the spread of the likelihood function, which be indifferent to an even money bet that p is depends on the sample size; a prior that is above or below this value ?" (Answer is the diffuse relative to a large sample may not be .50- or .) Then, "If we were told diffuse relative to a small one. If the prior that p is above the .50-fractile just assessed, distribution is diffuse, the posterior distri- but nothing more, for what value of p would we bution can be easily approximated from the now be indifferent in such a bet ?" (Answer is assumption of a strictly uniform prior distri- the .75-fractile.) Similarly we might locate bution. The latter assumption, known histori- other key fractiles, or key relative heights on cally as Bayes' postulate (not to be confused the density function. (2) Translation to with Bayes' theorem), is regarded as a equivalent but hypothetical prior sample evi- device that leads to good approximations in dence. For example, we might feel that our prior certain circumstances, although supporters of opinion about p is roughly what it would have "subjective probability rational" sometimes been if we had initially held a uniform prior, regard it as more than that in their approach to and then seen r heads in n hypothetical Bayesian inference. The uniform prior is also trials from the process. The implied posterior useful for statistical reporting, since it leads distribution would serve as the prior. (3) to posterior distributions from which the likeli- Contemplation of possible sample outcomes. hood is easily recovered and presents the results Sometimes we may find it easy to decide directly in a form readily usable to any reader whose what our posterior distribution would be if a prior distribution is diffuse. certain hypothetical sample outcome were to materialize. We can then work backwards to see the prior distribution thereby implied. Of Probabilistic Prediction course, this approach is likely to be helpful only if the hypothetical sample outcomes are A distribution, prior or posterior, of the easy to assimilate. For example, if we make a parameter of a Bernoulli process implies a certain technical assumption about the general probabilistic prediction for any future sample shape of the prior distribution (beta distribu- to be drawn from the process, assuming that the tion), the answers to the following two simply- stopping rule is given. For example, the ,tated questions imply a prior distribution of denominator in the right hand side of the Bayes' p: (a) How do we assess the probability of formula for Bernoulli sampling (p. 3) can be heads on a single trial? (b) If we were to interpreted as the probability of obtaining the observe a head on a single trial (this is the particular sample actually observed, given the 79

prior distribution of p. While a person's techniques for overcoming the problems presented subjective probability distribution of can- by vagueness in such assessments. not be said to be "right" or "wrong," there are better and worse subjective distributions, and the test is predictive accuracy. Thus if Mr. A Design of and Surveys and Mr. B each has a distribution for p, and a new sample is then observed, we can calculate So far we have talked only about problems the probability of the sample in the light of of analysis of samples, without saying anything each prior distribution. The ratio of these about what kind of sample evidence, and how probabilities, technically a marginal likelihood much, should be sought. This kind of problem ratio, measures the extent to which the data is known as a problem of design. A formal favor A over B or vice -versa. This idea has Bayesian solution of a design problem requires important consequences for evaluating judgments that we look beyond the posterior distribution and selecting statistical models. to the ultimate decisions that will be made in the light of this distribution: the best In connection with the previous paragraph design depends on the purposes to be served by a separate point is worth making. The posterior collecting the data. Given the specific pur- distributions of A and B are bound to grow pose and the principle of maximization of closer together as sample evidence piles up, so expected utility, it is possible to calculate long as neither of the priors was dogmatic. An the expected utility of the best act for any example of a dogmatic prior would be the opinion particular sample outcome. We can repeat this that p is exactly .5. for each possible sample outcome for a given sample design. Next, we can weight each such utility by the probability of the corresponding Multivariate Inference and Nuisance Parameters outcome in the light of the prior distribution. This gives an overall expected utility for any Thus far we have used one basic example, proposed design. Finally, we pick the sample inferences about Bernoulli process. To design with the highest expected utility. For introduce some additional concepts, we now turn two- action problems- -e.g., deciding whether a to inferences about the mean of a2normal new medical treatment is better or worse than distribution with unknown a . In this a standard treatment- -this procedure is in no case we begiR with a joint prior distribution conflict with the traditional approach of for and 7. The likelihood function is now a selecting designs by comparing operating function of two variables, and An characteristics, although it formalizes certain inspection of the likelihood function will show things - -prior probabilities and utilities- -that not only that the sequence of observations is often are treated intuitively in the traditional irrelevant to inference, but also that the approach. magnitudes are irrelevant except insofar as they hhlp determine the sample mean and variance s', which, along with the sample size n, are Comparison of Bayesian and Classical Inference the sufficient statistics of this example (see SUFFICIENCY). The prior distribution combines Certain common statistical practices are with the likelihood essentially as before ex- subject to criticism either from the point of cept that a double integration (or double view of Bayesian or of classical theory: for summation) is needed instead of a single inte- example, estimation problems are frequently gration (or summation). The resul is a joint regarded as tests of null hypotheses, and .05 posterior distribution of or .01 significance levels are used inflexibly. Bayesian and classical theory are in many If we are interested only in then a2 respects closer to each other than either is to is said to be a nuisance parameter. In principle everyday practice. In comparing the two ap- it is simple to deal with a nuisance parameter: proaches, therefore, we shall confine the we "integrate it out" of the posterior distribu- discussion to the level of underlying theory. tion. In our example this means that we must In one sense the basic difference is the find the marginal distribution of m the acceptance of subjective probability judgment joint posterior distribution of and as a formal component of Bayesian inference. This does not mean that classical theorists Multivariate problems and nuisance para- would disavow judgment, only that they would meters can always be dealt with by the approach apply it informally after the "purely just described. The integrations required may statistical" analysis is finished: judgment is demand heavy computation, but the task is the "second span in the bridge of inference." straightforward. A more difficult problem is Building on subjective probability, Bayesian that of assessing multivariate prior distribu- theory is a unified theory, whereas classical tions, and research is needed to find better theory is diverse and ad hoc. In this sense 80

Bayesian theory is simpler. But in another classical techniques that seem roughly compati- sense Bayesian theory is more complex because it ble with the Bayesian approach; indeed, many incorporates more into the formal analysis. classical techniques are, under certain condi- Consider a famous controversy of classical sta- tions, good approximations to fully Bayesian tistics, the problem of comparing the means of ones. In the meanwhile, the between two normal distributions with possibly unequal the two approaches promises to lead to fruitful and unknown (the so- called "Behrens - developments in statistical inference, and the Fisher" problem). Conceptually this problem Bayesian approach promises to illuminate a poses major difficulties for some classical number of problems - -such as allowance for theories (not Fisher's fiducial inference; see selectivity- -that are otherwise hard to cope FIDUCIAL INFERENCE), but none for Bayesian theory. with. In application, however, the Bayesian approach faces the problem of assessing a prior distribu- tion involving four random variables. Moreover, A Few Suggestions for Further Reading there may be messy computational work after the prior distribution has been assessed. The first book -length development of Bayesian inference, which emphasizes heavily In many applications, however, a credible the decision - theoretic foundations of the interval emerging from the assumption of a subject, is Robert Schleifer, Probability and diffuse prior distribution is identical or Statistics for Business Decisions (New York: nearly identical to the corresponding confidence McGraw -Hill Book Company, Inc., 1959). A more interval. There is a difference of interpreta- technical development of the subject is given tion, illustrated in the opening two paragraphs by Howard Raiffa and Robert Schleifer, Applied of this article, but in practice many people Statistical Decision Theory (Boston: Division interpret the classical result in the Bayesian of Research, Graduate School of Business way. There often are numerical similarities Administration, Harvard University, 1961). An between the results of Bayesian and classical excellent short introduction with an extensive of the same data, but there can also be bibliography is Leonard J. Savage, " Bayesian substantial differences, for example, when the Statistics," in Robert E. Machol and Paul Gray, prior distribution is non -diffuse and when a eds., Recent Developments in Information and genuine null hypothesis is to be tested. Decision Processes (New York: The Macmillan Company, 1962). An interesting application of Often it may happen that the problem of Bayesian inference is given, along with a vagueness, discussed at some length above, makes penetrating discussion of underlying philosophy a formal Bayesian analysis seem unwise. In this and a comparison with the corresponding clas- event Bayesian theory may still be of some value sical in Frederick Mosteller and in selecting a descriptive analysis or a clas- David L. Wallace, "Inference in an Authorship sical technique that conforms well to the Problem," Journal of the American Statistical general Bayesian approach, and perhaps in modi- Association (Vol. 58, 1963), pp. 275-310. A fying the classical technique. For example, fuller description of this study will be found many of the classical developments in sample in Mosteller and. Wallace, Inference and surveys and analysis of experiments can be given Disputed Authorship: the Federalist Papers. rough Bayesian interpretations when vagueness (Reading, Massachusetts: Addison-Wesley about the likelihood (as opposed to prior Publishing Company, Inc., in press.) This probabilities) prevents a full Bayesian analysis. study gives a specific example of how one might Moreover, even an abortive Bayesian analysis may cope with vagueness about the likelihood func- contribute insight into a problem. tion. Another example is to be found in George E. P. Box and George C. Tiao, "A Further Look Bayesian inference has as yet received much at Robustness via Bayes' Theorem," Biometrika less theoretical study than has classical infer- (Vol. 49, 1962), pp. 419 -432. A thorough ence. It is hard at this writing to predict how development of Bayesian inference from the far Bayesian theory will lead in modification viewpoint of "subjective probability rational" and reinterpretation of classical theory. is to be found in Harold Jeffreys, Theory of Before a fully Bayesian replacement is available Probability (Oxford: Clarendon Press, 3rd there is certainly no need to discard those edition, 1961).