Bridging Bayesian, frequentist and fiducial (BFF) inferences using confidence distribution
Suzanne Thornton
Department of Mathematics and Statistics, Swarthmore College Swarthmore, PA 19081, USA
Min-ge Xie
Department of Statistics, Rutgers University New Brunswick, NJ 08854, USA arXiv:2012.04464v1 [stat.ME] 8 Dec 2020
12Email addresses: [email protected] and [email protected]. The research is supported in part by NSF research grants: DMS1737857, DMS1812048, DMS2015373 and DMS2027855
1 Abstract
Bayesian, frequentist and fiducial (BFF) inferences are much more congruous than they have been perceived historically in the scientific community (cf., Reid and Cox 2015; Kass 2011; Efron 1998). Most practitioners are probably more familiar with the two dominant statistical inferential paradigms, Bayesian inference and frequentist inference. The third, lesser known fiducial inference paradigm was pioneered by R.A. Fisher in an attempt to define an inversion procedure for inference as an alternative to Bayes’ theorem. Although each paradigm has its own strengths and limitations subject to their different philosophical underpinnings, this article intends to bridge these different inferential methodologies through the lenses of confidence distribution theory and Monte-Carlo simulation procedures. This article attempts to understand how these three distinct paradigms, Bayesian, frequentist, and fiducial inference, can be unified and compared on a foundational level, thereby increasing the range of possible techniques available to both statistical theorists and practitioners across all fields.
Key words: confidence distribution, Bayesian posterior, Bootstrap estimator, Fiducial inversion, frequentist inference
1 Introduction
One motivation of developing confidence distributions is to address a simple question: “Can fre- quentist inference rely on a distribution function, or a ’distribution estimator’, in a manner similar to a Bayesian posterior?” (cf., Xie and Singh 2013). The answer is affirmative; in fact, many statisticians consider confidence distributions to be the “frequentist analogues” of Bayesian poste- rior distributions (Schweder and Hjort, 2003). Although the concept of confidence distributions is developed completely within the a frequentist framework, this special type of estimator also serves as a promising potential unifier among Bayesian, frequentist, and fiducial methods. In the subse- quent sections we first provide a brief review of the concept of confidence distributions and develop an appreciation for the flexibility of the inferential framework provided by confidence distributions, and then we highlight two deeply rooted connections on estimation and uncertainty quantifications
2 among confidence distributions, bootstrap distributions, fiducial distributions, and Bayesian poste- rior functions. These connections provide us with a unified platform from which we can compare and combine inference across different paradigms and which also allows us to discover a new as- sertion that, across all paradigms, a parameter has both a “fixed” and a “random” version, where the fixed version is the unknown (“true” or “realized”) target quantity of interest and the random version describes the uncertainty of our inference concerning the target quantity.
2 Confidence Distribution: A Distribution Estimator
Traditionally, statistical analysis uses a single point or perhaps an interval to estimate an unknown target parameter. However, another valid option is to instead consider using a sample-dependent function, specifically, a sample-dependent distribution function on the parameter space, to estimate the parameter of interest. This is exactly the motivation for developing and utilizing confidence distributions. A confidence distribution is a sample-dependent, distribution (or density) function, but it’s main purpose is similar to that of any other statistical estimator. The technical definition of a confidence distribution is therefore a simple, prescriptive definition. In the definition below Y and Θ are the sample and parameter space, respectively.
Definition 1 (Confidence Distribution) A sample-dependent function on the parameter space, i.e. a function on Y × Θ, H(·) = H(Y, ·), is called a confidence distribution (CD) for θ ∈ Θ if [R1] For each given sample Y = y ∈ Y, the function H(·) = H(y, ·) is a distribution function on the parameter space; and [R2] The function can provide confidence intervals (or regions) of all levels for the parameter θ. If [R2] holds asymptotically (or approximately), then H(·) is called an asymptotic (or approximate) CD (aCD).
The definition of a CD consists of two requirements, i.e., [R1] and [R2] and is analogous to the definitions of a consistent or an unbiased estimator in point estimation. Specifically, consistent or unbiased point estimators are estimators such that [R1] the estimator is a sample-dependent point in the parameter space and [R2] the estimator satisfies a particular performance-based criteria. For a consistent estimator, [R2] is that the estimator approaches the true parameter value as the sample
3 size increases; for unbiased estimators [R2] is that the expectation of the estimator is equal to the true parameter value.
When the parameter θ is scalar, requirement [R2] can be expressed as the requirement that at the true parameter value θ = θ0, H(θ0) ≡ H(Y, θ0) as a function of the random data follows a standard uniform distribution (cf. e.g., Xie and Singh 2013; Schweder and Hjort 2016). When θ is a vector, we may not have a simple or neat mathematical expression for [R2]. In some cases, a simultaneous CD of a certain form for multiple parameters may not even exist (cf., Xie and Singh 2013; Schweder and Hjort 2016). Fortunately, requirement [R2] in Definition 1 only asks for one set of confidence regions of all levels, which suffices for drawing inference statements. In this manner, there are several options for defining multivariate CDs, e.g., Singh et al. (2007); Xie and Singh (2013); Schweder and Hjort (2016); Liu et al. (2020).
The idea of using a distribution function as an estimator is not new or unique to CD theory. A bootstrap distribution is a distribution estimator for the unknown parameter (Efron, 1998). In Bayesian inference, the posterior distribution is also a distribution estimator for the unknown parameter. The benefit of using an entire function as an estimator rather than an interval or a point estimator, is that these distribution functions carry a greater wealth of information about the parameter of interest. A CD estimator, like a Bayesian posterior distribution, enables a statistician to draw inferential conclusion about the unknown parameter. For example, if one is provided with a CD for the parameter of interest, one can readily derive a point estimate, an interval estimate, and a p-value, among other features.
The simple example presented next will be revisited throughout the remainder of the chapter as an illustrative example.
Example 1 (Inference for Gaussian Data) Suppose we have a random sample of data, y =
(y1, . . . , yn), from a N(θ, 1) distribution. To estimate the unknown mean θ we may consider 1) a 1 Pn √ √ point estimate, e.g. y¯n = n i=1 yi; 2) an interval estimate, e.g. (¯yn − 1.96/ n, y¯n + 1.96/ n)
or 3) a distribution estimate, e.g. N(¯yn, 1/n). This distribution estimate is actually a CD for the unknown parameter θ.
1 By itself, the CD function N(¯yn, n ) provides information about θ including a point estimate, −1 α √ −1 α √ e.g. y¯n, an interval estimate, e.g. (¯yn − φ (1 − 2 )/ n, y¯n + φ (1 − 2 )/ n), and a p-value e.g. √ 1 − φ( n(b − y¯n)).
4 Figure 1: A graphical illustration of some of the inferential information contained in a CD from ˆ Xie and Singh (2013). Examples pictured include point estimators (mode θ, median Mn, and mean θ¯), a 95% CI, and a and a one-sided p-value.
Another important concept related to CD-based inference is that of a CD-random variable. For a given sample, a CD is a distribution function defined on the parameter space. We can construct
∗ a random variable (vector), θCD, on the parameter space such that, conditional upon the observed ∗ ∗ sample of data, θCD follows the CD. This θCD can be viewed as a random estimator of θ (cf., Xie and Singh 2013).
∗ Definition 2 Let Y = y be a given sample of data and H(θ) be a CD for θ. Then, θCD|y ∼ H(·), is referred to as a CD-random variable.
In the context of Example 1, we can simulate a CD-random variable (henceforth CD-r.v.) by
∗ generating θCD | y¯n ∼ N(¯yn, 1/n). Defining the CD-r.v. by conditioning on data is similar to the conditioning required in bootstrap and Bayesian inferences. We will further explore these connections in Section 3. CD-based inference is incredibly flexible and can be presented in three forms: in a density, a cumulative distribution function, or a confidence curve form. Here, a confidence curve is simply a mathematical representation of all confidence intervals for parameter θ along every level of α on the
1 − n (θ−y¯ )2 vertical axis. In Example 1 for instance, a confidence density is h(θ) = √ exp 2 n ; whereas 2πn √ a CD in the cumulative distribution function form is H(θ) = φ( n(θ − y¯n)); and a confidence curve √ √ is the function CV (θ) = 2 min{H(θ), 1 − H(θ)} = 2 min{φ( n(θ − y¯n), 1 − φ( n(θ − y¯n)}; Also, see Figure 2 (a).
5 The concept of a CD is also useful in the case where the sample of data is limited in size and drawn from a discrete distribution. In these situations, although (asymptotic) CDs may not be attainable, one may instead examine the difference between the discrete distribution estimator and the standard uniform distribution to get an idea of the under/over coverage of the confidence intervals derived from the CD. The definitions of lower and upper CDs are provided below for a scalar parameter θ ∈ Θ. Upper and lower CDs thus provide inferential statements in applications to discrete distributions with finite sample sizes.
Definition 3 A function H+(·) = H+(Y, ·) on Y ×Θ → [0, 1] is said to be an upper CD for θ ∈ Θ if (i) For each given Y = y ∈ Y, H+(·) is a monotonic increasing function on Θ with values ranging
+ + within (0, 1); (ii) At the true parameter value θ = θ0, H (θ0) = H (Y, θ0), as a function of the sample Y, is stochastically less than or equal to a uniformly distributed random variable U ∼ U(0, 1), i.e.,
+ P r(H (Y, θ0) ≤ t) ≥ t, for all t ∈ (0, 1). (1)
A lower CD H−(·) = H(Y, ·) for parameter θ can be defined similarly, but with (1) replaced by
− P r(H (Y, θ0) ≤ t) ≤ t for all t ∈ (0, 1).
Even if condition (i) above does not hold, H+(·) and H−(·) are still referred to as upper and lower CDs, respectively. This is because regardless of whether or not the monotonicity condition holds, because of the stochastic dominance inequalities in the definition of upper and lower CDs, for any α ∈ (0, 1) we have