Bayesian Inference
Total Page:16
File Type:pdf, Size:1020Kb
76 BAYESIAN INFERENCE Harry V. Roberts, University of Chicago Bayesian inference or Bayesian statistics terms of repeated independent tosses of a not - is an approach to statistical inference based necessarily "fair" coin. It generates "heads" on the theory of subjective probability. A and "tails" in such a way that the conditional formal Bayesian analysis leads to probabilistic probability of heads on a single trial is always assessments of the object of uncertainty. For equal to a parameter p regardless of the pre- example, a Bayesian inference might be: "The vious history of heads and tails. probability is .95 that the mean of a normal distribution lies between 12.1 and 23.7." The Suppose first that we have no direct sample number .95 represents a degree of belief, either evidence from the process. Based on experience in the sense of "subjective probability con- with similar processes, introspection, general sistent" or "subjective probability rational" knowledge, etc., we may be willing to translate (see PROBABILITY: PHILOSOPHY AND INTERPRETA- our judgments about the process into probabilist- TION, which should be read in conjunction with ic terms. For example, we might assess a the present article); .95 need not and typically (subjective) probability distribution for p does not correspond to any "objective" long (the tilde "N " indicates that we are now think- relative frequency. Very roughly, a degree of ing of the parameter p as a random variable). belief of .95 can be interpreted as betting odds Such a distribution is called a prior distribu- of 95 to 5 or 19 to 1. A degree of belief is tion because it is usually assessed prior to always potentially a basis for action; for sample evidence. Purely for illustration) example, it may be combined with utilities by suppose that the prior distribution of is the principle of maximization of expected util- uniform on the interval from 0 to 1: the ity (see STATISTICAL DECISION THEORY). probability that p lies in any subinterval is that subinterval's length, no matter where the By cóntrast, the traditional or "classical" subinterval is located between 0 and 1. Now approach to inference leads to probabilistic suppose that we observe heads, heads, and tails statements about the method by which a particu- on three tosses of a coin. The probability of lar inference is obtained. Thus a classical observing this sample, conditional on p, is inference might be: "A .95 confidence interval for the mean of a normal distribution extends P2(1-P) from 12.1 to 23.7." The number .95 here represents a long -run relative frequency, namely the frequency with which intervals obtained by If we regard this expression as a function of the method that resulted in the present interval p, it is called the likelihood function of the would in fact include the unknown mean. (It is sample. Bayes' theorem shows how to use the not to be inferred from the fact that we used the likelihood function in conjunction with the same numbers, .95, 12.1, and 23.7, in both illus- prior distribution to obtain a revised or trations that there will necessarily be a nu- posterior distribution of p. "Posterior" means merical coincidence between the two approaches.) after the sample evidence, and the posterior distribution represents a reconciliation of The term "Bayesian" arises from an elemen- sample evidence and prior judgment. In terms tary theorem of probability theory, named after of inferences about p, we may write Bayes' the Rev. Thomas Bayes, an English clergyman of theorem in words as the 18th century, who first enunciated it and proposed its use in inference. Bayes' theorem Posterior probability (density) at p, is typically used in the process of making given the observed sample = Bayesian inferences, as will be explained below. For a number of historical reasons, however, current interest in Bayesian inference is quite Prior probability (density) at p x likelihood recent -- dating, say, from the 1950's. Hence Prior probability of the observed sample the term "neo-Bayesian" is sometimes used in- stead of "Bayesian." Expressed mathematically, r An Illustration of Bayesian Inference fl(P) Pr(l-P)n f"(Pir,n) Pr(1-1))n-rdp For a simple illustration of the Bayesian ft(P) approach, consider the problem of making infer- ences about a Bernoulli process with parameter p. A Bernoulli process can be visualized in where f'p) denotes the prior density of p, )n- (1-P denotes the likelihood if r heads are observed in n trials, and f "(plr,n) A revised version of this paper will ap- denotes the posterior density of p given the pear in the (forthcoming) International sample evidence. Encyclopedia of the Social Sciences. 77 In our example, f'(p) = 1, (0 < p < 1), r = 2, the data are in, there is no distinction be- n = 3, and tween sequential anAiysis and analysis for fixed sample size. In the Bernoulli example, successive samples of n1 and n2 with Pr(1-P)n = V(P) r and r successes could be analyzed as one pooled sample of + n2 trials with 1/12 = r1 + successes. Alternatively, a posterior so distribution could be computed after the first sample of nl; this distribution could then f"(PIr = 2, 3) = 12 p2(1-p), serve as a prior distribution for the second sample; finally, a second posterior distribution = 0 otherwise. could be computed after the second sample of . By either route the posterior distribution Thus we emerge from the analysis with an after + n2 observations would be the same. explicit probability distribution for p. This Under almost any situation that is likely to distribution c.Aracterizes fully our judgments arise in practice, the "stopping rule" by which about It could be applied in a formal sampling is terminated is irrelevant to the decision -theoretic analysis in which utilities analysis of the sample. For example, it would of alternative acts are functions of p. For not matter whether r successes in n trials example, we might make a Bayesian point estimate were obtained by fixing r in advance and of p (each possible point estimate is regarded observing the rth success on the nth trial, as an act), and the seriousness of an estimation or by fixing n in advance and counting r error ( "loss ") might be proportional to the successes in the n trials. (2) For the pur- square of the error. The best point estimate pose of statistical reporting, the likelihood can then be shown to be the mean of the pos- function is the important information to be terior distribution; in our example, this would conveyed. If a reader wants to perform his own be .6. Or, we might wish to describe certain Bayesian analysis, he needs the likelihood aspects of the posterior distribution for function, not a posterior distribution based on summary purposes; it can be shown, for example, someone else's prior, nor traditional analyses that such as significance tests, from which it may be difficult or impossible to recover the likeli- P(5<.194) = and P(p>.932) = .025 , hood function. so a .95 "credible interval" for p extends Vagueness about Prior Probabilities from .194 to .932. Again, it can be shown that = .688: the posterior probability that In our example we assessed the prior dis- the coin is "biased" in favor of heads is a tribution of as a uniform distribution from little over 2/3. 0 to 1. It is sometimes thought that such an assessment means that we "know" p is so distributed, and that our claim to knowledge The Likelihood Principle might be verified or refuted in some way. It is indeed possible to imagine situations in which In our example, the effect of the sample the distribution of might be known, as when evidence was wholly transmitted by the likeli- one coin is to be drawn at random from a number hood function. All we needed to know from the of coins, each of which has a "known" p deter- sample was p)n -r; the actual sequence of mined by a very large number of tosses. The individual observations was irrelevant so long frequency distribution of these p's would then as we believed the assumption of a Bernoulli serve as a prior distribution, and all statisti- process. In general, a full Bayesian analysis cians would apply Bayes' theorem in analyzing requires as inputs for Bayes' theorem only the sample evidence. But such an example would be likelihood function and the prior distribution. unusual. Typically, in making an inference Thus the import of the sample evidence is fully about for a particular coin, the prior dis- reflected in the likelihood function, a princi- tribution of is not a description of some ple known as the likelihood principle (see also distribution of p's but rather a tool for LIKELIHOOD). Alternatively, given that the expressing judgments about based on evidence sample is drawn from a Bernoulli process, the other than the evidence of the particular sample import of the sample is fully reflected in the to be analyzed. numbers r and n, which are called sufficient statistics (see SUFFICIENCY). Not only do we rarely "know" the prior distribution of p, but we are typically more The likelihood principle implies certain or less vague when we try to assess it. This consequences that do not accord with tradi- vagueness is comparable to the vagueness that tional ideas. Here are examples: (1) Once 78 surrounds many decisions in everyday life. For hypothetical future outcome) how would we example, a person may decide to offer 421,250 assess the probability of heads on a second for a house he wishes to buy, even though he may trial? be quite vague about what amount he "should" of- fer.