Constraints Versus Priors † Philip B
Total Page:16
File Type:pdf, Size:1020Kb
SIAM/ASA J. UNCERTAINTY QUANTIFICATION c 2015 Society for Industrial and Applied Mathematics Vol. 3, pp. 586–598 and American Statistical Association ∗ Constraints versus Priors † Philip B. Stark Abstract. There are deep and important philosophical differences between Bayesian and frequentist approaches to quantifying uncertainty. However, some practitioners choose between these approaches primar- ily on the basis of convenience. For instance, the ability to incorporate parameter constraints is sometimes cited as a reason to use Bayesian methods. This reflects two misunderstandings: First, frequentist methods can indeed incorporate constraints on parameter values. Second, it ignores the crucial question of what the result of the analysis will mean. Bayesian and frequentist measures of uncertainty have similar sounding names but quite different meanings. For instance, Bayesian uncertainties typically involve expectations with respect to the posterior distribution of the param- eter, holding the data fixed; frequentist uncertainties typically involve expectations with respect to the distribution of the data, holding the parameter fixed. Bayesian methods, including methods incorporating parameter constraints, require supplementing the constraints with a prior probability distribution for parameter values. This can cause frequentist and Bayesian estimates and their nom- inal uncertainties to differ substantially, even when the prior is “uninformative.” This paper gives simple examples where “uninformative” priors are, in fact, extremely informative, and sketches how to measure how much information the prior adds to the constraint. Bayesian methods can have good frequentist behavior, and a frequentist can use Bayesian methods and quantify the uncertainty by frequentist means—but absent a meaningful prior, Bayesian uncertainty measures lack meaning. The paper ends with brief reflections on practice. Key words. frequentist methods, Bayesian methods, uncertainty quantification, constraints, priors, risk AMS subject classifications. 62A01, 62C10, 62C20, 62G15, 62P99 DOI. 10.1137/130920721 1. Introduction. This paper is primarily about philosophy and interpretation rather than computation. It sketches differences in the assumptions frequentist and Bayesian methods require and differences in meaning between frequentist and Bayesian measures of uncertainty, and recapitulates some well-known (within statistics) connections between frequentist and Bayesian methods that are less well known in the wider uncertainty quantification community. It also addresses the misperception that frequentist methods cannot incorporate constraints on unknown parameters, and flags some circumstances in which frequentists might choose to use Bayesian methods without using Bayesian interpretations of the results. This paper focuses on making inferences about a (possibly infinite-dimensional) vector θ known to be in the subset Θ of a Hilbert space H, from observations related to that vector. In examples below, the set Θ is a “ball” in a norm or seminorm. A one-dimensional special case is to estimate a single, real-valued parameter that is known to be between 0 and 1. Then, the Hilbert space H is the real line, the norm is the absolute value, and the set Θ is a ball of ∗Received by the editors May 13, 2013; accepted for publication (in revised form) June 1, 2015; published electronically July 23, 2015. http://www.siam.org/journals/juq/3/92072.html † Department of Statistics, University of California, Berkeley, CA 94720 ([email protected]). Downloaded 08/01/15 to 169.229.32.36. Redistribution subject SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 586 Copyright © by SIAM and ASA. Unauthorized reproduction of this article is prohibited. CONSTRAINTS VERSUS PRIORS 587 radius 1/2 centered at the point 1/2. This one-dimensional problem does arise in practice: The parameter might be a mortality rate; the fraction of a healthcare invoice that is fraudulent; the percentage by which the margin of victory of an election is overstated; the fraction of incident energy absorbed by a material; the ratio of applied load to failure strength for a structure such as a bridge, dam, building, or engine mount (given that the structure did not fail); the ratio of an individual’s exposure to a toxic substance to the exposure that would be lethal (given that the individual did not die of the exposure); or the ratio of the speed of a particle to the velocity of light. Useful constraints on high-dimensional parameters, even infinite-dimensional parameters such as functions of time and space, often can be expressed as the assertion that a norm or seminorm is bounded by a constant. For example, we might be interested in the electric field generated by some device, as a function of position in space; we might know a priori that the energy of that field—a quadratic norm—is less than 1, in suitable units. Or we might be interested in the spatial variation of ground motion caused by a large earthquake; we might have an a priori upper bound on the total energy released by the earthquake. Constraints of this kind can reduce the uncertainty or parameter estimates, sometimes substantially; they can even render otherwise insoluble problems solvable [1, 2, 3, 4, 5, 6, 7]. The two most common approaches to dealing with problems of statistical estimation and inference are frequentist and Bayesian. The frequentist approach uses constraints directly: The estimation or inference problem is restricted to the set of parameters that satisfy the constraints. In the Bayesian or “subjective” approach, the analyst imposes the constraints by choosing a prior probability distribution that assigns probability 1 to the set of parameters that satisfy the constraint. These two approaches result in uncertainty estimates that differ substantially in definition and interpretation, and that can differ substantially numerically. This paper recounts some of the philosophical and definitional differences. Y 2. Notation. The notation follows that in [8]. We want to use measurements to learn about the (unknown) state of the world, denoted by θ, in a mathematical representation of some physical system. In many interesting cases, θ is a function of position and/or time, which in principle would require infinitely many numbers to describe perfectly—it is an element of an infinite-dimensional space. We shall assume that the possible values of Y are n-vectors of real numbers, i.e., elements of Rn. Mathematics, physics, or previous experiments tell us θ that satisfies a constraint like those described in the introduction. The set Θ expresses this constraint: We know a priori that θ ∈ Θ. The measurements Y are related to the state of the world θ through a measurement model, P Y η ∈ which gives the probability distribution η of for each possible state of the world Θ. That is, if θ = η,thenY ∼ Pη,where∼ means “has probability distribution” or “is distributed as.” Depending on θ, Y may be more likely to take values in some subsets of Rn than in others. n A ⊂ R P {Y ∈ A} P {Y ∈ A} For any particular set , η generally will not be equal to η unless η = η . However, it can be the case that η = η and yet Pη{Y ∈ A} = Pη {Y ∈ A} for all (measurable) subsets A ⊂ Rn. (In that case, θ is not identifiable.) We shall assume that there is a known measure μ that dominates the set of distributions P≡{Pη : η ∈ Θ}, so we can use “densities” even if some members of P have atoms. This assumption makes it easier to define likelihoods, which are required for the Bayesian Downloaded 08/01/15 to 169.229.32.36. Redistribution subject SIAM license or copyright; see http://www.siam.org/journals/ojsa.php Copyright © by SIAM and ASA. Unauthorized reproduction of this article is prohibited. 588 PHILIP B. STARK framework; it also implies that all the measures P are defined on a common sigma-algebra A of subsets of n, which avoids some potential pathologies. With respect to μ, the density of Pη at y is (2.1) pη(y) ≡ d Pη/dμ|y. y η Y y p y η For any fixed ,thelikelihood of given = is η( ), viewed as a function of alone. While we might like to know θ, it is often impossible to estimate it with any useful level of accuracy; in many problems, θ is not even identifiable. But it still may be possible and λ λ θ θ scientifically interesting to estimate a parameter = [ ], a property of . The parameter might be the average of θ over some volume of space or time, a norm or seminorm of θ,orthe number of local maxima θ has, for instance. We shall assume that the possible values of λ[θ] are also elements of a Hilbert space. 3. The Bayesian approach. The Bayesian approach starts with a prior probability dis- tribution π on Θ, and the likelihood function p η(y). To have a prior probability distribution 1 involves some technical restrictions that will not be considered here. Together, π and pη θ Y imply a joint distribution of and .Themarginal distribution or predictive distribution of Y is (3.1) m(y)= pη(y) π(dη). Θ On observing that Y = y, we assimilate the information by applying Bayes’ rule to find the θ Y y posterior distribution of given = : pη(y) π(dη) (3.2) π(dη|Y = y)= . m(y) (The marginal density m(y) can vanish, but the probability that it does is zero.) In principle, this solves the problem: All the information in the prior and the data is now contained in the posterior distribution. The posterior distribution πλ(d|Y = y)ofλ[θ] is the distribution induced by the posterior distribution of θ: For any (measurable) set A of possible values of λ, (3.3) Pr(λ[θ] ∈ A|Y = y)= πλ(d |Y = y) ≡ π(dη|Y = y). ∈A η:λ[η]∈A If we get additional data, we apply Bayes’ rule again, using the current posterior as the new prior.