Bridging Bayesian, Frequentist and Fiducial

Home , Confidence distribution, Fiducial inference, Frequentist inference

Bridging Bayesian, frequentist and ﬁducial (BFF) inferences using conﬁdence distribution

Suzanne Thornton

Department of Mathematics and Statistics, Swarthmore College Swarthmore, PA 19081, USA

Min-ge Xie

Department of Statistics, Rutgers University New Brunswick, NJ 08854, USA arXiv:2012.04464v1 [stat.ME] 8 Dec 2020

12Email addresses: [email protected] and [email protected]. The research is supported in part by NSF research grants: DMS1737857, DMS1812048, DMS2015373 and DMS2027855

1 Abstract

Bayesian, frequentist and fiducial (BFF) inferences are much more congruous than they have been perceived historically in the scientific community (cf., Reid and Cox 2015; Kass 2011; Efron 1998). Most practitioners are probably more familiar with the two dominant statistical inferential paradigms, Bayesian inference and frequentist inference. The third, lesser known fiducial inference paradigm was pioneered by R.A. Fisher in an attempt to define an inversion procedure for inference as an alternative to Bayes’ theorem. Although each paradigm has its own strengths and limitations subject to their different philosophical underpinnings, this article intends to bridge these different inferential methodologies through the lenses of confidence distribution theory and Monte-Carlo simulation procedures. This article attempts to understand how these three distinct paradigms, Bayesian, frequentist, and fiducial inference, can be unified and compared on a foundational level, thereby increasing the range of possible techniques available to both statistical theorists and practitioners across all fields.

Key words: conﬁdence distribution, Bayesian posterior, Bootstrap estimator, Fiducial inversion, frequentist inference

1 Introduction

One motivation of developing confidence distributions is to address a simple question: “Can frequentist inference rely on a distribution function, or a ’distribution estimator’, in a manner similar to a Bayesian posterior?” (cf., Xie and Singh 2013). The answer is affirmative; in fact, many statisticians consider confidence distributions to be the “frequentist analogues” of Bayesian posterior distributions (Schweder and Hjort, 2003). Although the concept of confidence distributions is developed completely within the a frequentist framework, this special type of estimator also serves as a promising potential unifier among Bayesian, frequentist, and fiducial methods. In the subse- quent sections we first provide a brief review of the concept of confidence distributions and develop an appreciation for the flexibility of the inferential framework provided by confidence distributions, and then we highlight two deeply rooted connections on estimation and uncertainty quantifications

2 among confidence distributions, bootstrap distributions, fiducial distributions, and Bayesian posterior functions. These connections provide us with a unified platform from which we can compare and combine inference across different paradigms and which also allows us to discover a new as- sertion that, across all paradigms, a parameter has both a “fixed” and a “random” version, where the fixed version is the unknown (“true” or “realized”) target quantity of interest and the random version describes the uncertainty of our inference concerning the target quantity.

2 Conﬁdence Distribution: A Distribution Estimator

Traditionally, statistical analysis uses a single point or perhaps an interval to estimate an unknown target parameter. However, another valid option is to instead consider using a sample-dependent function, specifically, a sample-dependent distribution function on the parameter space, to estimate the parameter of interest. This is exactly the motivation for developing and utilizing confidence distributions. A confidence distribution is a sample-dependent, distribution (or density) function, but it’s main purpose is similar to that of any other statistical estimator. The technical definition of a confidence distribution is therefore a simple, prescriptive definition. In the definition below Y and Θ are the sample and parameter space, respectively.

Definition 1 (Confidence Distribution) A sample-dependent function on the parameter space, i.e. a function on Y × Θ, H(·) = H(Y, ·), is called a confidence distribution (CD) for θ ∈ Θ if [R1] For each given sample Y = y ∈ Y, the function H(·) = H(y, ·) is a distribution function on the parameter space; and [R2] The function can provide confidence intervals (or regions) of all levels for the parameter θ. If [R2] holds asymptotically (or approximately), then H(·) is called an asymptotic (or approximate) CD (aCD).

The definition of a CD consists of two requirements, i.e., [R1] and [R2] and is analogous to the definitions of a consistent or an unbiased estimator in point estimation. Specifically, consistent or unbiased point estimators are estimators such that [R1] the estimator is a sample-dependent point in the parameter space and [R2] the estimator satisfies a particular performance-based criteria. For a consistent estimator, [R2] is that the estimator approaches the true parameter value as the sample

3 size increases; for unbiased estimators [R2] is that the expectation of the estimator is equal to the true parameter value.

When the parameter θ is scalar, requirement [R2] can be expressed as the requirement that at the true parameter value θ = θ0, H(θ0) ≡ H(Y, θ0) as a function of the random data follows a standard uniform distribution (cf. e.g., Xie and Singh 2013; Schweder and Hjort 2016). When θ is a vector, we may not have a simple or neat mathematical expression for [R2]. In some cases, a simultaneous CD of a certain form for multiple parameters may not even exist (cf., Xie and Singh 2013; Schweder and Hjort 2016). Fortunately, requirement [R2] in Definition 1 only asks for one set of confidence regions of all levels, which suffices for drawing inference statements. In this manner, there are several options for defining multivariate CDs, e.g., Singh et al. (2007); Xie and Singh (2013); Schweder and Hjort (2016); Liu et al. (2020).

The idea of using a distribution function as an estimator is not new or unique to CD theory. A bootstrap distribution is a distribution estimator for the unknown parameter (Efron, 1998). In Bayesian inference, the posterior distribution is also a distribution estimator for the unknown parameter. The beneﬁt of using an entire function as an estimator rather than an interval or a point estimator, is that these distribution functions carry a greater wealth of information about the parameter of interest. A CD estimator, like a Bayesian posterior distribution, enables a statistician to draw inferential conclusion about the unknown parameter. For example, if one is provided with a CD for the parameter of interest, one can readily derive a point estimate, an interval estimate, and a p-value, among other features.

The simple example presented next will be revisited throughout the remainder of the chapter as an illustrative example.

Example 1 (Inference for Gaussian Data) Suppose we have a random sample of data, y =

(y1, . . . , yn), from a N(θ, 1) distribution. To estimate the unknown mean θ we may consider 1) a 1 Pn √ √ point estimate, e.g. y¯n = n i=1 yi; 2) an interval estimate, e.g. (¯yn − 1.96/ n, y¯n + 1.96/ n)

or 3) a distribution estimate, e.g. N(¯yn, 1/n). This distribution estimate is actually a CD for the unknown parameter θ.

1 By itself, the CD function N(¯yn, n ) provides information about θ including a point estimate, −1 α √ −1 α √ e.g. y¯n, an interval estimate, e.g. (¯yn − φ (1 − 2 )/ n, y¯n + φ (1 − 2 )/ n), and a p-value e.g. √ 1 − φ( n(b − y¯n)).

4 Figure 1: A graphical illustration of some of the inferential information contained in a CD from ˆ Xie and Singh (2013). Examples pictured include point estimators (mode θ, median Mn, and mean θ¯), a 95% CI, and a and a one-sided p-value.

Another important concept related to CD-based inference is that of a CD-random variable. For a given sample, a CD is a distribution function deﬁned on the parameter space. We can construct

∗ a random variable (vector), θCD, on the parameter space such that, conditional upon the observed ∗ ∗ sample of data, θCD follows the CD. This θCD can be viewed as a random estimator of θ (cf., Xie and Singh 2013).

∗ Deﬁnition 2 Let Y = y be a given sample of data and H(θ) be a CD for θ. Then, θCD|y ∼ H(·), is referred to as a CD-random variable.

In the context of Example 1, we can simulate a CD-random variable (henceforth CD-r.v.) by

∗ generating θCD | y¯n ∼ N(¯yn, 1/n). Defining the CD-r.v. by conditioning on data is similar to the conditioning required in bootstrap and Bayesian inferences. We will further explore these connections in Section 3. CD-based inference is incredibly flexible and can be presented in three forms: in a density, a cumulative distribution function, or a confidence curve form. Here, a confidence curve is simply a mathematical representation of all confidence intervals for parameter θ along every level of α on the

1 − n (θ−y¯ )2 vertical axis. In Example 1 for instance, a conﬁdence density is h(θ) = √ exp 2 n ; whereas 2πn √ a CD in the cumulative distribution function form is H(θ) = φ( n(θ − y¯n)); and a conﬁdence curve √ √ is the function CV (θ) = 2 min{H(θ), 1 − H(θ)} = 2 min{φ( n(θ − y¯n), 1 − φ( n(θ − y¯n)}; Also, see Figure 2 (a).

5 The concept of a CD is also useful in the case where the sample of data is limited in size and drawn from a discrete distribution. In these situations, although (asymptotic) CDs may not be attainable, one may instead examine the difference between the discrete distribution estimator and the standard uniform distribution to get an idea of the under/over coverage of the confidence intervals derived from the CD. The definitions of lower and upper CDs are provided below for a scalar parameter θ ∈ Θ. Upper and lower CDs thus provide inferential statements in applications to discrete distributions with finite sample sizes.

Deﬁnition 3 A function H+(·) = H+(Y, ·) on Y ×Θ → [0, 1] is said to be an upper CD for θ ∈ Θ if (i) For each given Y = y ∈ Y, H+(·) is a monotonic increasing function on Θ with values ranging

+ + within (0, 1); (ii) At the true parameter value θ = θ0, H (θ0) = H (Y, θ0), as a function of the sample Y, is stochastically less than or equal to a uniformly distributed random variable U ∼ U(0, 1), i.e.,

+ P r(H (Y, θ0) ≤ t) ≥ t, for all t ∈ (0, 1). (1)

A lower CD H−(·) = H(Y, ·) for parameter θ can be deﬁned similarly, but with (1) replaced by

− P r(H (Y, θ0) ≤ t) ≤ t for all t ∈ (0, 1).

Even if condition (i) above does not hold, H+(·) and H−(·) are still referred to as upper and lower CDs, respectively. This is because regardless of whether or not the monotonicity condition holds, because of the stochastic dominance inequalities in the deﬁnition of upper and lower CDs, for any α ∈ (0, 1) we have

+ − P r θ0 ∈ {θ : H (Y, θ) ≤ α} ≥ α and P r θ0 ∈ {H (Y, θ) ≤ α} ≤ α.

Thus, a level (1 − α) conﬁdence interval (or set) {θ : H+(Y, θ) ≤ 1 − α} or {θ : H−(Y, θ) ≥ α} has guaranteed coverage rate of at least (1 − α)100%. Without condition (i) however, H+(·) and H−(·) may not be distribution functions and we may lose the “nested-ness" of the conﬁdence intervals (or

sets). This means that a level (1 − α) confidence set, C1−α, may not necessarily be contained in a 0 0 corresponding level (1 − α ) confidence set, C1−α0 , for some α > α. The following is an example of upper and lower CDs for the parameter of a discrete distribution. For more details on this example, see Hannig and Xie (2012). Another more complex example can be found in Luo et al. (2020), which links Fisher’s sharp null test in causal inference with the concepts of lower and upper CDs and extended confidence curves.

6 Example 2 Upper and Lower CDs from a Binomial Sample

Suppose some sample Y is from a Binomial(n, p0) distribution with realization y. Let Hn(p, y) = P (Y > y) = P npk(1−p)n−k. It can be shown that P r(H (p ,Y ) ≤ t) ≥ t and P (H (p ,Y − y

As we demonstrate in the remaining sections of this chapter, the concept of a CD includes a broad class of common inferential methods. Among this class of methods are bootstrap distributions, (normalized) likelihood functions, (normalized) empirical likelihood functions, p-value functions, fiducial distributions, some informative priors and Bayesian posterior distributions, and more. As summarized in Cox (2013), the usefulness of inference from a CD perspective lies in its ability to “provide simple and interpretable summaries of what can reasonably be learned from data (and an assumed model)". Regardless of whether the setting is parametric or nonparametric, normal or non-normal, exact or asymptotic, CD inference is possible as long as one can create confidence intervals (or regions) of all levels for the parameter of interest. This fact illustrates the appeal of CD-based inference as a union among different inferential paradigms. Before moving on to the next section, we present a CD framework for a couple of common inference problems in likelihood inference and hypothesis testing, and consider some relevant examples.

Likelihood-based approaches and their extensions are arguably the most commonly used method- ology to answer questions of statistical inference. Asymptotic, large sample arguments are frequently used to quantify the uncertainty in our point estimators from likelihood-based approaches. The very same consequences of the central limit theorem also provide a connection between the likelihood function and a CD, provided R L(θ | data)dθ < ∞, where L(θ | data) is the likelihood function or a generalized likelihood function.

For instance, Fraser and McDunnough (1984) showed that a normalized likelihood function is an aCD and Singh et al. (2007) demonstrated that a normalized proﬁle likelihood function is also an aCD. Therefore, the inferential conclusions based on these CDs match the conclusions based on the likelihood or proﬁle likelihood functions. Higher order asymptotic developments relating

7 likelihood inference to CDs include, for instance, Hall (1992); Reid and Fraser (2010); Pierce and Bellio (2017), among others.

Example 3 Again, suppose we have a random sample of data, y = (y1, . . . , yn), from a N(θ, 1) 1 P 2 n 2 Q − 2 (yi−θ) − 2 (¯yn−θ) distribution. The likelihood function is L(θ | y1, . . . , yn) ∝ i f(yi | θ) ∝ e ∝ e . L(θ|y) 1 − n (θ−y¯ )2 Normalizing the likelihood with respect to θ we get R = √ exp 2 n , which is exactly L(θ|y)dθ 2π/n the density function of the CD: a N(¯yn, 1/n) distribution function. Furthermore, since we are only dividing by a constant when normalizing (standardizing) the likelihood function, the mode of the CD is the same as that of the likelihood. Thus, inferential conclusions drawn from the likelihood function L(θ | y) are exactly the same as the inferential conclusions drawn from the CD function

N(¯yn, 1/n).

Likelihood-based inference represents only one particular generative method for obtaining a CD however, since CDs can obtained in many other cases without likelihood functions. We discuss some of these settings without likelihood functions later in the chapter.

Hypothesis tests are another common inferential method intimately linked to CDs. Consider a one-sided hypothesis testing problem K0 : θ ≤ θ0 versus K1 : θ > θ0 or a two-sided test K0 : θ = θ0 versus K1 : θ 6= θ0. Denote by pn = pn(θ0) the p-value from a testing method. The p-value, pn(θ0), depends on both the observed sample data and the testing value of the parameter, θ0. As θ0 varies in the parameter space Θ, pn = pn(θ0) is a function on the parameter space called a p-value function or a significance function; see, e.g., Fraser (1991). Often, pn(θ0) (as a function of the random sample) follows a standard uniform distribution under the the null hypothesis. As function of θ0, pn(θ0) is typically monotonically increasing in the one-sided hypothesis test setting, and first increases and then decreases in the case of the two-sided hypothesis test. Singh et al. (2007) and Xie and Singh (2013) show that the p-value function from the one-sided test corresponds to a regular CD function and the p-value function from a two-sided test corresponds to a confidence curve.

Example 4 In the setting of Example 1, for the one-sided hypothesis test for θ, K0 : θ = θ0 versus ¯ √ K1 : θ > θ0, the p-value is calculated by pn(θ0) = P (Y > y¯n) = φ( n(θ0 − y¯n)). As θ0 varies in Θ,

pn(θ0) is the cumulative distribution function of the distribution N(¯yn, 1/n) and is therefore a CD

for θ. For the two-sided hypothesis test K0 : θ = θ0 versus K1 : θ 6= θ0, the p-value is calculated by √ √ 2min{p(θ0), 1 − p(θ0)} = 2min{φ( n(θ0 − y¯n)), φ( n(¯yn − θ0))}, which is a conﬁdence curve for θ.

8 The next example considers the Mann-Whitney test for determining whether two independent samples are drawn from the same population. For a careful discussion on setting up the hypotheses for a Mann-Whitney test, we refer the reader to Divine et al. (2018).

Example 5 Suppose we observe two independent random samples (x1, . . . , xn1 ) and (y1, . . . , yn2 ) where each sample is drawn from an unknown continuous distribution (say, F (·) and G(·), respectively) that may only diﬀer in location. (Without loss of generality, suppose n1 < n2.) In this case, the Mann-Whitney test for a diﬀerence in the population distributions is equivalent to testing the null that K0 : F (t) = G(t), for all t versus the alternative K1 : F (t) = G(t − θ), for every t and for some θ 6= 0. More generally, this test can be restated in terms of a particular location shift θ0 > 0 where we consider K0 : θ = θ0 versus K1 : θ 6= θ0. Hollander et al. (2014)

n1(n1+1) The test statistic for K0 is Uθ0 = n1n2 + 2 − R1, where R1 = R1((θ0)) is the sum of the ranks of the observations (x1, . . . , xn1 ) when pooled together with the other sample, (y1 + θ0, . . . , yn2 + θ0).

See also Figure 2 (b). Letting H(t) = P (Uθ0 ≤ t | θ = θ0) be the distribution of Uθ0 under the null hypothesis. Then, the p-value for this test is calculated by 2min{p(θ0), 1 − p(θ0)}, which is a conﬁdence curve for θ. Here p(θ ) = H(uobs) and uobs represents the observed value of U . 0 θ0 θ0 θ0

As with likelihood-based inference, hypothesis tests represent only one of many other generative methods by which a CD can be obtained.

3 Two levels of Inferential Unity Through Conﬁdence Distri- butions

Now we will explore the key arguments that point to CD-theory as a novel connection among the three dominant paradigms of statistical inference. In subsection 3.1, we describe a conceptual union of BFF methods, highlighting that CDs, Bayesian posteriors and fiducial distribution functions are each defined as distribution (or function) estimators for the parameters of interest. This common- ality provides the statistical practitioner with a way to combine studies across different inferential paradigms. In subsection 3.2, we examine a deeply rooted union among BFF methods, wherein the randomness of artificial data (or Monte-Carlo simulations) can be used to quantify the uncertainty of these various distribution estimators. We specifically emphasize a common theme among

9 Figure 2: Each row displays three representations of a conﬁdence distribution: a CD density, a CD curve, and a conﬁdence curve (CV). In plot (a), the sample of data (n = 12) is from a N(θ0 = 10, 1) distribution. In plot (b), two independent random sample data (n1 = 10, n2 = 9) are from two t5 distributions, respectively, where the center locations of the two t5 distributions are θ0 = 1 apart. 1.0 1.0 1.2 0.8 0.8 0.6 0.6 0.8 CV CD 0.4 0.4 CD density 0.4 0.2 0.2 0.0 0.0 0.0 9.0 9.5 10.0 10.5 11.0 9.0 9.5 10.0 10.5 11.0 9.0 9.5 10.0 10.5 11.0 θ θ θ

(a) CD-inference for location parameter θ in the setting of Example 1. 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 CV CD 0.4 0.4 0.4 CD density 0.2 0.2 0.2 0.0 0.0 0.0 0 1 2 3 0 1 2 3 0 1 2 3 θ θ θ

(b) CD-inference for the location θ in the setting of Example 5.

10 every paradigm is the duality of parameters: the random version which describes the inferential uncertainty and the ﬁxed version which represents the target of interest.

3.1 Distribution Estimators: Performance-based Properties vs. Deduc- tive Reasoning

CD theory presents a uniﬁed inferential scheme that connects all three statistical paradigms, Bayesian, ﬁducial, and frequentist inference, at a high level as a class of distribution (function) estimators for a target parameter. In any of these frameworks, statistical conclusions rely on sample- dependent (distribution) functions on the parameter space and these functions are used to make inferential conclusions by way of point estimation, interval estimation, hypothesis testing, and more.

As we saw in Section 2, the definition of a CD originates from a pragmatic, behaviorist point of view; a CD requires a performance-based property. In contrast, fiducial and Bayesian estimators are defined based on inductive reasoning and therefore require specific procedures are followed to deduce any inferential conclusions. These procedural conditions include techniques such as solving model equations, minimizing loss functions, or using Bayes’ formula. This distinction between a pragmatically defined CD and other procedurally defined distribution estimators underscores a harmony inherent in statistical logic that has nothing to do with choice of a particular inferential method. For instance, one may consider fiducial and Bayesian procedure-based distribution estimators as possible means to the end of a more general behavioral statistical concept, that of a CD. In this way, one can imagine how fiducial distributions and Bayesian posterior distributions (among others) are to CDs as the MLE is to a consistent estimator. This relationship is illustrated in Table 1 alongside a parallel comparison from point estimation: the definition of a consistent (point) estimator is performance-based, while a MLE or M-estimator are procedure-based and provide a possible means to obtain consistent estimators. Furthermore, a MLE or M-estimator does not need to be, but often is, a consistent estimator; and a consistent estimator can be derived by other means. Similarly, a fiducial distribution or a Bayesian posterior provide a possible means to obtain a CD. Furthermore, a fiducial distribution or a Bayesian posterior do not need to be CDs, (though they often are - especially under a large sample setting); and a CD can come from other procedures as well.

By approaching the problem of statistical inference from the more general perspective of well-

11 Table 1: The CD as a unifying behavioral statistical concept.

Estimation Type Descriptive Procedure-wise Point Consistent estimator MLE, M-estimation, ... Fiducial distribution, p-value function, Distribution Confidence distribution Bootstrap distribution, (normalized) likelihood, Bayesian posterior, ... behaved distribution estimators, CD theory allows us to have a common platform to compare (and even combine) inference across different paradigms in ways that would otherwise not be possible. The example below (found in Cui and Xie (2020)) uses CD as an inferential tool to combine the results of four independent studies on simple cluster of differentiation 4 (cd-4) count data (see DiCiccio and Efron 1996 for more on this type of data).

Example 6 Suppose we obtain the summary statistics for some cd-4 count data and simulate four     2 µ1 σ1 ρσ1σ2 independent data sets from the bivariate normal distribution N   ,   . Assume 2 µ2 ρσ1σ2 σ2 each study makes its inferential conclusions independent of the others. Inferential conclusions for each of the four data sets were determined by one of Fisher’s Z method (Fisher 1915; Signorell and et mult. al 2020; the bias-corrected and accelerated (BCa) bootstrap method (DiCiccio and Efron 1996; Davison and Hinkley 1997; Angelo and Ripley 2020), the proﬁle likelihood approach (Li et al. 2018), or a Bayesian posterior from a uniform prior (Bååth, 2014), respectively.

Let p1(ρ) = p1(ρ, y1) be the p-value function from the right sided test using the Fisher’s Z method, h2(ρ) = h2(ρ, y2) be the bootstrap distribution using the bootstrap BC method, h3(ρ) = h3(ρ|y3) be the normalized proﬁle likelihood function and p4(ρ) = p4(ρ|y4) be the posterior distribution. We treat each of them as a CD, summarizing inferential information in each study. A combined CD that incorporates the information from all four studies can be explicitly expressed as Z ρ Z ρ Z ρ Z ρ Hc(ρ) = Gc gc p1(s)ds, h2(s)ds, h3(s)ds, p4(s)ds , −1 −1 −1 −1 where gc(u1, u2, u3, u4) is a given monotonic mapping function from quartic cube to the real line, 4 (0, 1) → IR, and Gc(t) = P (gc(U1,U2,U3,U4) ≤ t) with (U1,...,U4) being IID U(0, 1) random

12 Figure 3: A visual comparison of four individual CDs for ρ in Example 6 derived from four independent studies (dashed black lines) to the combined CD that incorporates information from all four studies (solid red line). 1.0

Fisher Bootstrap 0.8 Profile Bayes Combined CD 0.6 CD 0.4 0.2 0.0

0.3 0.4 0.5 0.6 0.7 0.8 ρ

variables. See Singh et al. (2005); Xie et al. (2011) for more detailed developments on CD combining methodologies.

−1 In illustrations presented in Figure 3 and Table 2, our combination uses gc(u1, . . . , u4) = DE (u1)+ −1 −1 ...+DE (u4), which leads to a Bahadur efficient combination (Singh et al., 2005). Here, DE (·) is the inverse cumulative distribution function of the standard Laplace distribution. Figure 3 com- pares the individual CDs derived in each of the four different inferential methods to the combined CD that incorporates information from all four studies and Table 2 summarizes the relative performance of each approach. Clearly, the combined inference is accurate and effective, demonstrating how one may combine inference across different paradigms.

13 Table 2: Coverage results for ﬁve diﬀerent inferential methods mentioned in Example 6, the last row of which is a CD approach that combines individual-study inference from the four independent studies. Coverage is computed for the parameter ρ over 200 replications. Each observed sample was

drawn from a bivariate normal model with true parameter values µ1 = 3.288, µ2 = 4.093, ρ = 0.723, 2 2 σ1 = 0.657, σ2 = 1.346.

Method 95% CI Coverage Mean length (sd) Fisher’s Z method 0.948 0.484 (0.140)

Bootstrap BCa 0.936 0.464 (0.156) Proﬁle likelihood 0.918 0.436 (0.131) Bayes (uniform prior) 0.964 0.522 (0.128) Combined CD 0.954 0.226 (0.041)

3.2 Connecting the Roots of Statistical Inference Through Artiﬁcial Sam- pling and the Duality of Parameters

Statistical inference is primarily concerned with the quantification of uncertainty inherent in sampling data from a population. Traditionally in practice, we observe only a single copy of the data and we assess this uncertainty by assuming some statistical model accurately describes the behav- ior of the population at large. There are however, alternative means to assess this uncertainty; for example, one popular approach creates many copies of artificial (or “fake") data using Monte-Carlo procedures. Aided by modern advancements in computer science and technology, these Monte-Carlo methods have proven to be useful and effective and there exists a rich literature exploring these various techniques, which include the bootstrap sampling method, approximate Bayesian computing (ABC) methods, generalized fiducial inference (GFI), knock-off sample and permutation methods and more; See e.g., Efron and Tibshirani (1993); Ernst (2004); Barber and Candès (2015); Hannig et al. (2016); Robert (2016); Li and Fearnhead (2018).

In this section, we explore how CD theory connects Bayesian, frequentist and fiducial inferential frameworks through three prominent artificial sampling procedures: bootstrap sampling, generalized fiducial inference, and approximate Bayesian computing. In developing these connections, a

14 union among Bayesian, frequentist and ﬁducial inferences becomes apparent as two common themes emerge: the development of an inversion argument and the role of two versions (ﬁxed and random) of model parameters.

3.2.1 Random Estimators and Inferential Inversion Arguments

Bootstrap Random Estimators The bootstrap method mimics sampling randomness by simulating artiﬁcial data through a resampling process and the randomness of the “fake" data is used to draw inferential conclusions about the parameter of interest. The mechanism that makes this simulation approach successful is called the bootstrap central limit theorem (Singh, 1981; Freedman and Bickel, 1981). This theorem ˆ states general conditions under which, for a parameter θ with true value θ0 and an estimator θ, we have, as the sample size n → ∞,

∗ ˆ ˆ θBT − θ | data ∼ θ − θ | θ = θ0, (2)

∗ where θBT represents the bootstrap estimator, that is, the estimator as a function of a set of bootstrap sample data. The bootstrap central limit theorem supports inferential conclusions based on the bootstrapped samples data because, in equation (2), the simulated artiﬁcial (resampling) variability of the bootstrap estimator, conditioned upon the observed sample data, matches the sampling variability of the estimator as a function of a random sample from the study population. Consider again the setting of Example 1. The bootstrap central limit theorem asserts that since

∗ y¯n −θ0 ∼ N(0, 1/n), it follows that, conditional upon the observed sample, y¯BT −y¯n | y¯ ∼ N(0, 1/n) as well. When the bootstrap procedure is applicable, the bootstrap estimator is closely connected

∗ to the concept of a CD-r.v. With the CD-r.v. θCD|x¯ ∼ N(¯x, 1/n) from Example 1, we have ∗ ∗ θCD − y¯ y¯ ∼ y¯ − θ θ = θ0, where both follow N(0, 1/n). If θCD is replaced by a bootstrap sample ∗ mean, y¯BT , the statement above matches exactly the bootstrap central limit theorem (2). Whereas the bootstrap central limit theorem is the key property ensuring the validity of bootstrap inference, a similar argument underscores the inferential validity of CD methods, that is

∗ ˆ ˆ (θCD − θ) data ∼ (θ − θ) θ = θ0, (3)

where θˆ is a point estimator (often the MLE) of θ; cf., Xie and Singh (2013). Comparing equations

∗ ∗ (2) and (3), θCD essentially performs the same as a bootstrap estimator, θBT . Their common trait

15 ∗ ∗ is that the simulated artiﬁcial variability of the random estimator, θCD or θBT , conditioned upon the observed sample of data, matches the sampling variability of the estimator θˆ as a function of a random sample from the population.

In contrast to the bootstrap procedure, CD-r.v.s can be obtained through many diﬀerent simulation schemes beyond resampling. As discussed in an earlier section, a CD random estimator can simulated from a CD obtained through normalized likelihood methods or p-value functions. The CD can also be obtained from pivot statistics or even from Bayesian or ﬁducial procedures as we will discuss next. Thus, although the bootstrap estimator functions much in the same way as a CD-r.v., the concept of a CD is actually much broader than that of a bootstrap distribution as will be made apparent in the following sections.

Fiducial and Generalized Fiducial Inference

In modern days statistics, R.A. Fisher’s ﬁducial method is understood as an inversion method to solve a structural model (or an algorithmic model) for model parameter θ. Assume the sample data is generated by the following algorithmic model

Y = G(θ, U), (4) where G(·, ·) is a general model of a known form, θ is the parameter of interest and U ∼ D(·) is some unobserved random noise vector following a known distribution. In Example 1, the data model is

0 Yi = θ+Ui, i = 1, . . . , n, in the form of (4), where Y = (Y1,...,Yn) and U = (U1,...,Un) ∼ N(0,I). In classical fiducial inference, model (4) is also known as a structure model, especially when Y is replaced with a summary statistic; cf. Fraser (1966, 1968). In Example 1, for instance, we have ¯ ¯ ¯ P ¯ P Y = θ + U, where Y = Yi/n and U = Ui/n respectively. R.A. Fisher suggested a fiducial inversion process whereby one would re-write θ = Y¯ − U¯ and asserted that for any observed Y¯ =y ¯, we have θ =y ¯−U¯. Since U¯ ∼ N(0, 1/n), the fiducial distribution of θ is then N(¯y, 1/n). This fiducial distribution N(¯y, 1/n) is identical to the CD described in Example 1 and is an effective distribution estimator to draw inference for the unknown θ. However, there is an (in)famous paradox in this classical fiducial argument, i.e., the “hidden subjectivity" underlying the fiducial claim that the sample Y = y can be observed while U can remain a random sample (Dempster, 1963). Because the data Y and the random U in (4) are completely dependent, if one is given, the other must be as well (even if not observed). When Y = y is observed, the correct equation for Example

16 1 is θ =y ¯ − u¯, rather than θ =y ¯ − U¯. In this equation, u¯ is the unobserved realization of U¯ corresponding to observed sample realization y.

With a slight adjustment in perspective, namely by treating the fiducial method as an stochastic inversion algorithm to solve for a random estimator of θ, we can avoid this paradoxical reasoning and use the fiducial distribution in a manner consistent with frequentist statistical inference. Specifically, once the data is observed, i.e. Y = y, the corresponding value of U = u is also realized. Model (4) is then y = G(θ, u), (5) where u is an unobserved realization from the known distribution D(·). One may simulate an artificial copy of U, say u∗ ∼ D(·), and a random estimator θ∗ (also known as a fiducial sample) is then the solution of y = G(θ∗, u∗). Repeatedly solving this equation with multiple copies of u∗ leads to multiple copies of θ∗. The underlying distribution of θ∗ is a fiducical distribution. In Example 1 for instance, u¯∗ ∼ N(0, 1/n) and θ∗|y ∼ N(¯y, 1/n). As we will soon see, these θ∗ play the same role as bootstrap estimators or CD-r.v.

Although θ0 is always a solution to equation y = G(θ, u) with the realized pair (y, u), when u is replaced by u∗, a solution to y = G(θ, u∗) may not exist. The so-called generalized ﬁducial inference method addresses this issue by introducing an optimization procedure under an -approximation:

∗ ∗ ∗ 2 θF D, = argminθ∗∈{θ∗: ||y−G(θ∗,u)||2≤}||y − G(θ , u )|| (6)

where → 0 at a fast rate (Hannig, 2009; Hannig et al., 2016). Hannig (2009) proved a ﬁducial Bernstein von Mises Theorem which states: under general regularity assumptions,

∗ ˆ ˆ (θF D, − θ) | data ∼ (θ − θ) | θ = θ0, (7) as n → ∞ and → 0 at a fast enough rate, where the above follow a normal distribution centered at zero with variance equal to the inverse of the Fisher information of the MLE, θˆ.

∗ The randomness of θF D, conditional upon the observed data Y = y, is inherited from the simulation of the artificial u∗. As with the bootstrap connection to a CD-r.v. stated in equation (3), equation (7) establishes a common trait in that the simulated artificial variability of the fiducial

∗ estimator, θF D,, conditioned upon the observed sample data, matches the sampling variability of the estimator θˆ as a function of a random sample from the study population Thus, the ﬁducial

17 ∗ Bernstein von Mises Theorem helps establish an equivalence among the ﬁducial sample (θFD), a ∗ ∗ CD-r.v. (θCD), and the bootstrap estimator (θBT ).

Approximate Bayesian Computing and Bayesian Inference

The approximate Bayesian computing (ABC) method is a Bayesian algorithm that attempts to obtain a Bayesian posterior without a direct use of Bayes’ formula. It can be viewed as a Bayesian stochastic inversion algorithm to solve the same model equation (5), where y is observed data generated from model (4) and θ and u are unknown realizations from known prior and error distributions π(θ) and D(u), respectively. The premise for establishing inferential conclusions for parameter θ in ABC is the following rejection algorithm.

Approximate Bayesian Computing (ABC) Algorithm

Step 1: Simulate θ∗ ∼ π(θ) and u∗ ∼ D(·) and compute y∗ = G(θ∗, u∗).

∗ ∗ ∗ Step 2: If y “matches” yobs (i.e., y ≈ yobs), then retain θ , otherwise repeat Step 1.

Effectively, the set of retained θ∗ in the ABC satisfy the equation y ≈ G(θ∗, u∗), so this stochastic algorithm solves equation (5) for θ∗. This inversion bears a striking resemblance to the fiducial stochastic inversion described above, except that the Bayesian inversion method assumes a prior π(θ) and is limited to finding those θ∗ that correspond to simulations from this prior. In contrast to

∗ ∗ the θFD obtained from the ﬁducial inversion algorithm, the θ in ABC contain both the information of π(·) and D(·).

In real applications of the ABC algorithm, the matching y ≈ y∗ in Step 2 is diﬃcult to achieve and is typically replaced by requiring a matching of summary statistics while tolerating a small degree of mismatch. Thus, in practice, Step 2 of the ABC Algorithm is replaced with Step 2’ below, where d(·, ·) is a distance metric (often Euclidean distance), t(·) is a summary statistic and > 0 is the tolerance for mismatch:

∗ ∗ Step 2’: If d(t(y ), t(yobs)) < , then retain θ , otherwise repeat Step 1.

Given the observed data, the underlying distribution of the output θ∗ from the ABC algorithm is called the ABC posterior, which we denote by fa(θ | y). If the summary statistic t(·) is suﬃcient, then the ABC-posterior, fa(θ | y), matches (approximately) the target posterior distribution, f(θ | y), derived from Bayes formula; cf, e.g., Frazier et al. (2018); Thornton (2019). Indeed, the ABC

18 method is a computational approach that can be used to obtain the posterior distribution without the use of Bayes’ formula or a direct evaluation of the likelihood function. Note that, when given the observed data y, the randomness in θ∗ is induced by artiﬁcial Monte-Carlo simulations of θ∗ ∼ π(·) and u∗ ∼ D(·). The Bayesian Bernstein von Mises Theorem is a well-known bridge relating Bayesian and frequentist inference when the sample size n → ∞; cf., e.g. van der Vaart (1998). Denote by θ0 the realized value of θ from the prior conditional on which the observed data y is generated, and let

∗ θBY be an outcome from an ABC Algorithm, or, more generally, a Monte-Carlo sample draw from the posterior distribution. We can reword the Bayesian Bernstein von Mises theorem as follows: Under some general regularity conditions and as n → ∞,

∗ ˆ ˆ (θBY − θ) | data ∼ (θ − θ) | θ = θ0, (8)

where the above follow a normal distribution centered at zero with variance equal to the inverse of the Fisher information of the MLE, θˆ. In the form of (8), the Bayesian Bernstein von Mises Theorem is analogous to the statements in equations (2), (3) and (7). The only distinction is that in Bayesian inference there is the additional model assumption of the prior and this information is used to draw inferential conclusions in addition to the model uncertainty of U. The common conclusion from all four of these analogous statements ˆ∗ ∗ ∗ ∗ is that the random estimators, θBT , θCD, θFD and θBY , behave the same for inference. These random estimators are useful precisely because the artiﬁcial Monte-Carlo randomness in each of these estimators matches the sampling variability of θˆ inherited form the model population. This leads us to the ﬁnal piece of the puzzle connecting all three types of inference, Bayesian, frequentist

∗ ∗ ∗ and fiducial, through the construction of random estimators (θBY , θFD, and θBT ) that can all be ∗ described as a CD-r.v., θCD. In real applications of ABC, it is common to have little to no knowledge of the sufficiency or near-sufficiency of the summary statistic, t(·). In such situations, the ABC posterior distribution can be quite different from the targeted posterior distribution. Thus, if the summary statistic used in an ABC algorithm is not sufficient, the ABC posterior obtained often can not provide us Bayesian inference in the usual sense. However, Thornton (2019); Thornton et al. (2018) have shown

that under some regularity conditions, the ABC posterior fa(θ|y) is often a valid CD, even if the summary statistic used is not suﬃcient.

19 Figure 4: Given a sample of size n = 30 from a Cauchy(θ, 1) distribution, we can plot the ABC posterior distribution (in black) for t1 =y ¯ (solid line) and t2 = Median(y) (dashed line). In these simulations we set θ0 = 10. The target posterior is shown in gray.

1.5 1.0 Density 0.5 0.0

6 8 10 12 14 θ

In the following Cauchy example, no suﬃcient summary statistic exists for the location parameter θ (besides the data itself) and, in this example, the ABC posterior distributions based on either the summary statistic of the sample mean or the sample median are quite diﬀerent and do not provide an approximation to the targeted posterior distribution f(θ|y). However, the ABC posterior samples

∗ θABC - obtained either using the sample mean or sample median as the summary statistic - are samples from a CD and can therefore be used for valid frequent inferential statements. Example 7 below provides a demonstration. As highlighted in Figure 4 and Table 3, with a tighter density curve, the ABC posterior based on the sample median is more eﬃcient than that that based on the sample mean. See Thornton (2019); Thornton et al. (2018) for more discussions.

Example 7 (Cauchy data) Suppose we observe y = (y1, y2, . . . , yn), an IID sample from a Cauchy(θ, 1)

distribution. When t(·) is not a suﬃcient statistic, fa(θ | y) does not converge to the target posterior

(see Figure 4); however, fa(θ | y) is a CD for θ. This is true for both t1 =y ¯ and t2 = Median(y). Although the efficiency of the CDs differ depending on the choice of summary statistic, the inferential validity of conclusions from either method are valid according to the definition of a CD and as

20 Table 3: Coverage results for the ABC approximate posterior and the Bayesian posterior from Example 7. Coverage is computed for the location parameter θ over 200 replications. Each observed

sample of size n = 30 was drawn from a Cauchy(θ, 1) distribution where θ0 = 10.

Method 95% CI (or credible interval) t(·) Coverage Mean length (sd) ABC y¯ 1.0 7.409 (0.679) ABC Median(y) 0.955 0.902 (0.0312) Posterior - 0.94 0.805 (0.238)

illustrated in the simulations for Table 3.

3.2.2 Two Versions of a Parameter: Fixed or Random

In statistics, we have traditionally distinguished inferential perspectives by asserting that the model parameter is random in Bayesian inference but is a fixed, unknown constant in frequentist inference. From this perspective, these inferential paradigms seem to conflict. Can we reconcile these different understandings of a parameter to create a meaningful bridge between the information gained from either perspective? Our answer to this question is an emphatic “yes" and we argue in the remainder of this subsection that there are two reconcilable versions of the parameter, random and fixed, in each of the Bayesian, frequentist and fiducial inferential paradigms. First consider a Bayesian perspective of a simple experiment known as Bayes’ Billiard Table Experiment, from whence the field of statistics was developed:

Assume that Billiard ball are rolled on a line of length one, with a uniform probability. Ball

W is rolled and stops at θ0 ∈ (0, 1). Then a diﬀerent ball O is rolled n times under the same assumptions. Let y be the number of times that the ball O stops to the left of ball W . Bayes proposed the motivating question: “Given y, what inference can we make concerning

θ0?"(Bayes, 1764)

In modern statistical terms, Bayes’ Billiard Table experiment collects a binomial sample of data assuming a U(0, 1) prior distribution. In this context, θ0 (where ball W landed) is a realization

21 from the prior distribution, θ ∼ U(0, 1), and the sampling scheme follows a binomial model y|θ =

θ0 ∼ Bin(n, θ0). For the sake of discussion and without loss of generality, suppose ball W lands at

the location θ0 = 0.38963 and suppose out of the n = 14 times ball O was rolled, we observe y = 5 instances where O landed to the left of ball W . This y = 5 is realization from Y |θ = 0.38963 ∼ Bin(n = 14, θ = 0.38963). By the typical calculation, we can derive the posterior distribution as θ|y = 5 ∼ Beta(6, 10). We emphasize that the target quantity we are interested in estimating is the ﬁxed, unknown value 0.38963, the location where ball W landed, which resulted in the observation y = 5. In this experiment, the target of interest is not a random quantity θ that follows either a U(0, 1) or a Beta(6, 10) distribution.

Bayes’ Billiard Table experiment clearly displays two versions of the parameter at play. The scientific, application-oriented point of view, is interested in the fixed version of the parameter that generated the observed data, i.e. θ0. There is uncertainty however in exactly how to evaluate θ0; to address this uncertainty, a Bayesian inferential approach elects to work with a random version of θ through the prior and posterior distributions. As stated in Berger et al. (2015), in a Bayesian perspective “parameters must have a distribution describing the available information about their values... this is not a description of their variability (since they are fixed unknown quantities), but a description of the uncertainty about their true values.”

This duality is actually present in every inferential paradigm where the model parameter has two versions: (a) a fixed, unknown true (or realized) value which generated the observed sample of data, and (b) a random version that is used to describe the uncertainty (but not the variability) about the fixed, unknown parameter value. The duality of parameters was arguably first recognized in the fiducial argument, although the philosophical underpinnings of the fiducial interpretation have probably generated more confusions and controversies than clarity (see, e.g. Rao 1973). Our discussion in the previous subsection helps us better understand how these two versions of a parameter apply across each inferential paradigm, as summarized in Table 4.

From Table 4, the ﬁxed version is the target of interest from which the sample data has been generated. It has the same standard interpretation in frequentist and ﬁducial inferences and it refers

to the realized parameter value, θ0, from the prior under the Bayesian framework. The random version is used to describe the uncertainty in our inference about the ﬁxed version. This version has a natural interpretation in Bayesian inference and corresponds to the CD-r.v. (or bootstrap estimator

22 in special cases) and ﬁducial sample in the frequentist and ﬁducial frameworks, respectively.

Table 4: Common traits among diﬀerent inferential paradigms.

CD Bootstrap Fiducial Bayesian

∗ ∗ ∗ ∗ Random version CD-r.v. (θCD) Bootstrap estimator (θBT ) Fiducial sample (θFD) Random parameter (prior/posterior sample) (θBY )

Fixed version True parameter value (θ0) True parameter value (θ0) True parameter value (θ0) Realized parameter value (θ0)

Whereas much previous research has highlighted differences among Bayesian, frequentist and fiducial inference, this unifying perspective inspired by CD can deepen our understanding of the foundational principles of statistical inference and it provides the philosophical framework whereby potentially any inferential method can benefit from any of the three dominant paradigms. For instance, recognizing and establishing the random version of parameters in frequentist and fiducial inferences suggests that the powerful statistical computational tools that have been so successful in many Bayesian applications can also be applied to frequentist and fiducial inference. Alternatively, introducing the interpretation of a fixed target parameter in the context of Bayesian inference can help directly and unambiguously connect the interpretation of parameters to the physical meanings of parameters in physics and other applied scientific fields. Furthermore, the CD-based connection among bootstrap sampling and approximate Bayesian computing reinforce the usefulness of developing new artificial sampling methods to address more difficult inference problems that lie beyond the reach of likelihood-based inference and the central limit theorem (and its extensions). Some related research developments attempting likelihood-free inference include finite sampling inference in inferential models and repro sampling approaches for inference concerning discrete parameters such as model selection or the number of clusters, etc. See e.g. Liu et al. (2018); Xie (2020); Wang and Xie (2020); Luo et al. (2020).

4 Discussion and Concluding Remarks

A CD is a sample-dependent distribution function on the parameter space that can represent confidence intervals (or regions) of all levels for some parameter of interest. The CD approach summarizes the information available from the observed data in a distribution form, when possible. An emerg- ing theme in the development of CD estimators is that any approach, regardless of whether it is Bayesian, fiducial, or frequentist, can potentially be unified under the concept of CDs as long as

23 the method can be used to build conﬁdence intervals of all levels (exactly or asymptotically). This uniﬁcation provides a platform to compare and also combine a variety of “distribution estimates” derived by various procedural methods.

In this article, we have also demonstrated that each of the frequentist, fiducial, and Bayesian inferential frameworks share two important features: an ability to describe parameter uncertainty with a random version and a view that there is some fixed, unknown value as the target of interest. The random version of the parameter is associated with distribution estimators, namely, a posterior distribution, fiducial distribution or CD, across all inferential paradigms. The common theorem critical to the success of each of these inferential frameworks aligns the variability of the random estimator (conditional upon the observed sample) with the model uncertainty about the fixed version of the parameter. These key similarities hint at the broad range of inference problems that may be solvable by simulating artificial Monte-Carlo samples that mimic the observed data. Indeed, we can re-frame the task of inference (quantifying the sampling variability in parameter estimation using only a single copy of data) as matching the variability in the random version of the parameter with the uncertainty in parameter estimation inherited through random sampling. This inferential task can often be achieved through artificial sampling or other Monte-Carlo simulations. In addition to supporting inferential unity among different statistical paradigms, this CD-based perspective also promises many new methodological developments that can provide novel inferential solutions in cases where traditional solutions were previously impossible. This potential for new developments includes prediction approaches, testing methods, simulation schemes, and ways of combining information from various inferential sources (e.g., Xie et al. 2011; Clagget et al. 2014; Liu et al. 2015; Hannig and Xie 2012; Shen et al. 2018; Vovk et al. 2019; Shen et al. 2020; Cai et al. 2020; Liu et al. 2020).

We conclude our discussion by noting that, although we have emphasized the unity of the Bayesian, frequentist and fiducial inferences, these procedures are distinct in several aspects. A Bayesian method includes an additional model assumption through the prior distribution. Depend- ing on the situation, this additional assumption may result in different inference results especially in a finite sample setting. Our alignment of equations (2), (3), (7) and (8) have involved the large sample bootstrap CLT and fiducial and Bayesian Bernstein von Mises theorems. The large sample Bayesian Bernstein von Mises theorem mitigates the impact of the prior on inference in favor of

24 growing data to make this alignment possible. Although this alignment can be extended to higher order (cf., e.g., higher order results in Hall 1992; Reid and Fraser 2010; Berger et al. 2015 for bootstrap, likelihood inference and objective Bayes, respectively), the impact of the prior assumption can be sizable in finite sample circumstances resulting in a mismatch with the frequentist conclusion (e.g., Fraser, 2011; Reid and Cox, 2015). Furthermore, the conclusions reached through a fiducial inversion procedure or through a Bayesian approach do not automatically (even if they do typically) lead to the correct frequentist inference with respect to a confidence distribution. This issue is similar to the fact that an MLE is usually but not automatically a consistent estimator and also that a consistent estimator does not need to be an MLE. Hence, the differences among Bayesian, frequentist and fiducial procedures do not prevent us from arguing for the overall congruence of these methods because the both a connection to CDs and the importance of two versions of the model parameter ultimately start each procedure across these paradigms on the same footing.

References

Angelo, C. and B. D. Ripley (2020). boot: Bootstrap R (S-Plus) Functions. R package version 1.3-25.

Bååth, R. (2014). Bayesian ﬁrst aid: A package that implements bayesian alternatives to the classical *.test functions in r. In UseR! 2014 - the International R User Conference.

Barber, R. F. and E. J. Candès (2015). Controlling the false discovery rate via knockoﬀs. Annals of Statistics 43 (5), 2055–2085.

Bayes, T. (1764). An essay towards solving a problem in the doctrine of chances. Philosophical Transactions, Royal Society, London.

Berger, J. O., J. M. Bernardo, and D. Sun (2015). Overall objective priors. Bayesian Analysis 10 (1), 189–221.

Cai, C., R. Chen, and M. Xie (2020). Individualized group learning. Journal of the American Statistical Association. under review. available at https://arxiv.org/abs/1906.05533.

25 Clagget, B., M. Xie, and L. Tian (2014). Meta analysis with ﬁxed, unknown, study-speciﬁc parameters. Journal of the American Statistical Association 109, 1667–1671.

Cox, D. R. (2013). Discussion of conﬁdence distribution, the frequentist distribution estimator of a parameter: a review. International Statistical Review 81, 40–41.

Cui, Y. and M. Xie (2020). Conﬁdence Distribution and Distribution Estimation for Modern Sta- tistical Inference. Handbook of Engineering Statistics, 2nd Edition. Springer.

Davison, A. C. and D. V. Hinkley (1997). Bootstrap Methods and their Application. Number 1. Cambridge university press.

Dempster, A. P. (1963). Further examples of inconsistencies in the ﬁducial argument. The Annals of Mathematical Statistics 34, 884–891.

DiCiccio, T. J. and B. Efron (1996). Bootstrap conﬁdence intervals. Statistical Science 11, 189–228.

Divine, G. W., H. J. Norton, A. E. Baron, and E. Juarez-Colunga (2018). The Wilcoxon-Mann- Whitney procedure fails as a test of medians. The American Statistican 72 (3), 278–286.

Efron, B. (1998). R. A. Fisher in the 21st century. Statistical Science 13, 95–122.

Efron, B. and R. J. Tibshirani (1993). An Introduction to the Bootstrap. Number 57 in Monographs on Statistics and Applied Probability. Boca Raton, Florida, USA: Chapman & Hall/CRC.

Ernst, M. D. (2004). Permutation methods: A basis for exact inference. Statistical Science 19 (4), 676–685.

Fisher, R. (1915). Frequency distribution of the values of the correlation coeﬃcient in samples of an indeﬁnitely large population. Biometrika 10, 507–521.

Fraser, D. (1991). Statistical inference: Likelihood to signiﬁcance. Journal of the American Statis- tical Association 86, 258–265.

Fraser, D. A. S. (1966). Structural probability and a generalization. Biometrika 53, 1–9.

Fraser, D. A. S. (1968). The Structure of Inference. Wiley.

26 Fraser, D. A. S. (2011). Is bayes posterior just quick and dirty conﬁdence? Statistical Science 26, 299–316.

Fraser, D. A. S. and P. McDunnough (1984). Further remarks on asymptotic normality of likelihood and conditional analyses. The Canadian Journal of Statistics 12, 183–190.

Frazier, D. T., G. M. Martin, C. P. Robert, and J. Rousseau (2018). Asymptotic properties of approximate bayesian computation. Biometrika (3), 593–607.

Freedman, D. A. and P. J. Bickel (1981). Some asymptotic theory for the bootstrap. Annals of Statistics 9 (6), 1196–1217.

Hall, P. (1992). On the removal of skewness by transformation. Journal of the Royal Statistical Society: Series B (Methodological) 54 (1), 221–228.

Hannig, J. (2009). On generalized ﬁducial inference. Statistica Sinica 19, 491–544.

Hannig, J., H. Iyer, R. C. S. Lai, and T. C. M. Lee (2016). Generalized ﬁducial inference: A review and new results. Journal of American Statistical Association 111 (515), 1346–1361.

Hannig, J. and M. Xie (2012). On Dempster-Shafer recombinations of conﬁdence distributions. Electrical Journal of Statistics 6, 1943–1966.

Hollander, M., D. A. Wolfe, and E. Chicken (2014). Nonparametric Statistical Methods. Wiley Series in Probability and Statistics. John Wiley & Sons, Incorporated.

Kass, R. E. (2011). Statistical inference: The big picture. Statistical Science 26, 1–9.

Li, W. and P. Fearnhead (2018). On the asymptotic eﬃciency of approximate bayesian computation estimators. Biometrika 105 (2), 286–299.

Li, Y., B. W. Shedden, and J. A. Gillespie (2018). Proﬁle likelihood estimation of the correlation co- eﬃcient in the presence of left, right or interval censoring and missing data. The R Journal 10 (2), 159.

Liu, D., R. Liu, and M. Xie (2015). Multivariate meta-analysis of heterogeneous studies using only summary statistics: eﬃciency and robustness. Journal of the American Statistical Associa- tion 110, 326–340.

27 Liu, D., R. Liu, and M. Xie (2020). Nonparametric fusion learning: synthesize inferences from di- verse sources using depth conﬁdence distribution. Revision for Journal of the American Statistical Association.

Liu, S., L. Tian, S. Lee, and M. Xie (2018). Exact inference on meta-analysis with generalized fixed-effects and random-effects models. Biostatistics & Epidemiology 2 (1), 1–22.

Luo, X., T. Dasgupta, M. Xie, and R. Liu (2020). Leveraging the ﬁsher randomization test using conﬁdence distributions: Inference, combination and fusion learning. ArXiv e-prints, arXiv:2004.08472v1 .

Pierce, D. A. and R. Bellio (2017). Modern likelihood-frequentist inference. International Statistical Review 85 (3), 519–541.

Rao, C. (1973). Linear Statistical Inference and Its Applications. Wiley series in probability and mathematical statistics: Probability and mathematical statistics. Wiley.

Reid, N. and D. R. Cox (2015). On some principles of statistical inference. International Statistical Review 83 (2), 293–308.

Reid, N. and D. A. S. Fraser (2010). Mean log likelihood and higher-order approximations. Biometrika 97, 159–170.

Robert, C. P. (2016). Approximate Bayesian Computation: A Survey on Recent Results. Monte Carlo and Quasi-Monte Carlo Methods. Springer, Cham.

Schweder, T. and N. Hjort (2016). Conﬁdence, Likelihood and Probability. Cambridge, U.K.: Cambridge University Press.

Schweder, T. and N. L. Hjort (2002). Conﬁdence and likelihood. Scandinavian Journal of Statis- tics 29, 309–332.

Schweder, T. and N. L. Hjort (2003). Frequentist analogues of priors and posteriors. In Economet- rics and the Philosophy of Economics: Theory-Data Confrontations in Economics, pp. 285–317. Princeton University Press.

28 Shen, J., R. Y. Liu, and M. Xie (2018). Prediction with conﬁdence‚ a general framework for predictive inference. Journal of Statistical Planning and Inference 195, 126–140.

Shen, J., R. Y. Liu, and M. Xie (2020). ifusion: Individualized fusion learning. Journal of the American Statistical Association 115 (531), 1251–1267.

Signorell, A. and et mult. al (2020). DescTools: Tools for Descriptive Statistics. R package version 0.99.38.

Singh, K. (1981). On the asymptotic accuracy of Efron’s bootstrap. The Annals of Statistics 9 (6), 1187–1195.

Singh, K., M. Xie, and W. E. Strawderman (2005). Combining information from independent sources through conﬁdence distributions. The Annals of Statistics 33, 159–183.

Singh, K., M. Xie, and W. E. Strawderman (2007). Conﬁdence distribution (cd) - distribution estimator of a parameter. In IMS Lecture Notes-Monograph Series, Volume 54, pp. 132–150. Institute of Mathematical Statistics.

Thornton, S. (2019). Advanced Computing Methods for Statistical Inference. Ph. D. thesis, Rutgers, The State University of New Jersey.

Thornton, S., W. Li, and M. Xie (2018). Approximate conﬁdence distribution computing: An eﬀective likelihood-free method with statistical guarantees. ArXiv e-prints, arXiv:1705.1034 . van der Vaart, A. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.

Vovk, V., J. Shen, V. Manokhin, and M. Xie (2019). Nonparametric predictive distributions by conformal prediction. Machine Learning 108, 445–474.

Wang, P. and M. Xie (2020). Repro Sampling Method for Statistical Inference of High Dimensional Linear Models. Research Manuscript.

Xie, M. (2020). Repro Sampling Method for Statistical Inference. Research Manuscript.

Xie, M. and K. Singh (2013). Conﬁdence distribution, the frequentist distribution estimator of a parameter (with discussions). International Statistical Review 81, 3–39.

29 Xie, M., K. Singh, and W. E. Strawderman (2011). Conﬁdence distributions and a unifying framework for meta-analysis. Journal of the American Statistical Association 106, 320–333.