<<

Two Statistical Paradigms Bayesian versus Frequentist

Steven Janke

April 2012

(Bayesian Seminar) April 2012 1 / 31 Probability versus

Probability Problem Given some symmetry, what is the probability that an event occurs? (Example: Probability of 3 heads when a fair coin is flipped 5 times.)

Statistical Problem Given some , what is the underlying symmetry? (Example: Four heads resulted from tossing a coin 5 times. Is is a fair coin?)

(Bayesian Seminar) April 2012 2 / 31 Statistics Problems

All statistical problems begin with a model for the problem. Then there are different flavors for the actual question: (Find the of the population.) (Confidence interval for the mean.) Hypothesis testing (Is the mean greater than 10?) Model fit (Could one be linearly related to another?)

(Bayesian Seminar) April 2012 3 / 31 and Loss functions

Definition () An estimator (θˆ) is a random variable that is a function of the data and reasonably approximates θ. ˆ ¯ 1 Pn For example, θ = X = n i=1 Xi

Loss Functions Squared Error: L(θ − θˆ) = (θ − θˆ)2 Absolute Error: L(θ − θˆ) = |θ − θˆ| Linex: L(θ − θˆ) = exp(c(θ − θˆ)) − c(θ − θˆ) − 1

(Bayesian Seminar) April 2012 4 / 31 Comparing Estimators

Definition (Risk) An estimator’s risk is R(θ, θˆ) = E(L(θ, θˆ)).

For two estimators θˆ1 and θˆ2, if R(θ, θˆ1) < R(θ, θˆ2) for all θ ∈ Θ, then θˆ2 is called inadmissible.

How do you find ”good” estimators? What does ”good” mean? How do you definitively rank estimators?

(Bayesian Seminar) April 2012 5 / 31 Frequentist Approach

Philosophical Principles: Probability as relative (repeatable trials). (Law of large numbers, ) θ is a fixed quantity, but the data can vary.

Definition ()

Suppose the population density is fθ and the data (x1, x2, ··· , xn) are n sampled independently, then the Likelihood Function is L(θ) = Πi=1fθ(xi ).

The value of θ (as a function of the data) that gives the maximum likelihood is a good candidate for an estimator θˆ.

(Bayesian Seminar) April 2012 6 / 31 Maximum Likelihood for Normal and Binomial

Population is Normal(µ, σ2): (x −µ)2 n 1 − i L(µ) = Π √ e 2σ2 i=1 σ 2π ¯ 1 Pn Maximum Likelihood givesµ ˆ = X = n i=1 xi . Population is Binomial(n,p): Flip a coin n times with P[head]=p. L(p) = pk (1 − p)n−k , where k is the number of heads. k Maximum Likelihood givesp ˆ = n . np P µ Both estimators are unbiased: Epˆ = n = p and Eµˆ = n = µ. Among all unbiased estimators, these estimators minimize the squared error loss.

(Bayesian Seminar) April 2012 7 / 31 Frequentist Conclusions

Definition (Confidence Interval) ¯ σ2 X is a random variable with mean µ and distribution N (µ, n ) ¯ √σ ¯ √σ The random interval (X − 1.96 n , X + 1.96 n ) is a confidence interval. The probability that it covers µ is 0.95. >>> The probability that µ is in the interval is not 0.95!

Definition (Hypothesis Testing) Null Hypothesis: µ = 10 ∗ X¯ −10 Calculate z = σ n p-value is P[z ≤ z∗] + P[z ≥ z∗] >>> p-value is not the probability that the null hypothesis is true.

(Bayesian Seminar) April 2012 8 / 31 Space of Estimators

Methods: Moments, Linear, Invariant Characteristics: Unbiased, Sufficient, Consistent, Minimum , Admissible

(Bayesian Seminar) April 2012 9 / 31 Frequentist Justification: Three Theorems

Theorem (Rao-Blackwell) A T such that the conditional distribution of the data given T does not depend on θ (the ) is called sufficient. Unbiased estimators that are functions of a sufficient statistic give the least variance.

Theorem (Lehmann-Scheffe) If T is a complete sufficient statistic for θ and θˆ = h(T ) is an unbiased estimator of θ, then θˆ has the smallest possible variance among unbiased estimators of θ. (UMVUE of θ.)

Theorem (Cramer-Rao)

Let a population have a ”regular” distribution with density fθ(x) and let θˆ ˆ ∂ 2 −1 be an unbiased estimator of the θ. Then Var(θ) ≥ (nE[( ∂θ ln fθ(X )) ])

(Bayesian Seminar) April 2012 10 / 31 Problems Choosing Estimators

How does the Frequentist choose an estimator? Example (Variance) Pn 2 i=1(xi −x¯) Maximum likelihood estimator: n Pn 2 i=1(xi −x¯) Unbiased estimator: n−1 Pn 2 i=1(xi −x¯) Minimum variance: n+1

Example (Mean) Estimate IQ for an individual. Say on a test was 130. Suppose scoring machine reported all scores less than 100 as 100. Now the Frequentist must change estimate of IQ.

(Bayesian Seminar) April 2012 11 / 31 What is the p-value?

The p-value depends on events that didn’t happen. Example (Flipping Coins) Flipping coin: K = 13 implies p-value is P[K ≤ 4 or K ≥ 13] = 0.049. If experimenter stopped when at least 4 heads and 4 tails, p − value = 0.021. If experimenter stops at n = 17 and continues if data inconclusive to n = 44, then p − value > 0.049.

p-value already suspect since it is NOT the probability H0 is true.

(Bayesian Seminar) April 2012 12 / 31 Several Normal at once

Example (Estimating Several Normal Means)

Suppose estimating k normal means µ1, µ2, ··· , µk from populations with common variance.

Then Frequentist estimate is ~x¯ = (¯x1, x¯2 ··· x¯k ) under Euclidean distance as the . Stein (1956) showed that ~x¯ is not even admissible! A better estimator isµ ˆ(¯x) = (1 − k−2 ) · x¯. ||x||2 The better estimator can be thought of as a Bayesian estimator with the correct prior.

(Bayesian Seminar) April 2012 13 / 31 Bayes Theorem

Richard Price communicates Rev. Thomas Bayes paper to Royal Society (1763) Given data, Bayes found probability p for a binomial is in a given interval. One of the many lemmas is now known as“Bayes Theorem”

P[AB] P[B|A]P[A] P[B|A]P[A] P[A|B] = = = P[A] P[B] P[B|A]P[A] + P[B|Ac ]P[Ac ]

Conditional probability defined after defining as an integral over a sub-sigma field of sets. Conditional distribution is then f (x, y) f (y|x)f (x) f (x|y) = = X fY (y) fY (y)

(Bayesian Seminar) April 2012 14 / 31 Bayesian Approach

Assume that the parameter under investigation is a random variable and that the data are fixed. Decide on a prior distribution for the parameter. (Bayes considered p to be uniformly distributed from 0 to 1.) Calculate the distribution of the parameter given the data. (Posterior distribution) To minimize squared error loss, take the of the posterior distribution as the estimator.

(Bayesian Seminar) April 2012 15 / 31 Bayesian Approach for the Binomial

Assume a binomial model with n, p (flipping a coin.) Let K be the number of heads (the data).

For a prior distribution on p take the . Γ(α+β) α−1 β−1 fp(p) = Γ(α)Γ(β) p (1 − p) for 0 < p < 1.

f (p|k) = f (k|p)fp(p) = C · pα+k−1(1 − p)β+n−k−1 fk (k)

α+k Hence, E[p|k] = α+β+n

With the uniform prior (α = 1, β = 1), the estimator is k+1 n k 2 1 n+2 = ( n+2 ) n + ( n+2 ) 2 .

(Bayesian Seminar) April 2012 16 / 31 Bayesian Approach for the Normal

If µ is the mean of a normal population, a reasonable prior 2 distribution for µ could be another , N (µo, σo). 2 n(¯x−µ)2 (µ−µ )2 (µ−µ1) √ o √ −c − n − 2 − 2 ne 2σ2 f (µ|x¯) = e 2σ 2σo = e 1 2πσσo g(¯x) 2πσσo 2 2 nσo σ Hence, E[µ|x¯] =x ¯( 2 2 ) + µo( 2 2 ) nσo +σ nσo +σ

(Bayesian Seminar) April 2012 17 / 31 Bayesian Approach under other Loss functions

Note: If we use absolute error loss, then the is the of the posterior distribution. 1 −cθ If we use Linex loss, then the Bayes estimator is − c ln{Ee }.

(Bayesian Seminar) April 2012 18 / 31 Calculating

If the loss function is squared error, then the mean squared error (MSE) measures expected loss.

MSE = E(θˆ − θ)2 = (E(θˆ) − θ)2 + E(θˆ − E(θˆ))2 (Squared Bias + Variance)

ˆ k 2 p(1−p) For θ = n : MSE = 0 + n ˆ k+1 1−2p 2 1 2 For θ = n+2 : MSE = ( n+2 ) + ( n+2 ) · p(1 − p)

(Bayesian Seminar) April 2012 19 / 31 Comparing the MSE

Figure: Comparing MSE for Freq and Bayes estimators

(Bayesian Seminar) April 2012 20 / 31 Bayesian Confidence (Credible) Interval

Θ and X are the random variables representing the parameter of interest and the data. The functions u(x) and v(x) are arbitrary and f is the posterior distribution, R v(x) Then, P[u(x) < Θ < v(x)|X = x] = u(x) f (θ|x)dθ

Definition () If u(x) and v(x) are picked so the probability is 0.95, the interval [u(x), v(x)] is a 95% credible interval for Θ.

2 2 q 2 2 xn¯ σo +µo σ σ σo For the normal distribution, 2 2 ± 1.96 2 2 is a 95% credible nσo +σ nσo +σ interval for µ.

(Bayesian Seminar) April 2012 21 / 31 Comparing Estimates

Frequentist and Bayesian estimates for p in the Binomial: (k, n) Estimate 95% Conf. Int. Width Frequentist (2,3) 0.667 (0.125, 0.982) 0.857 (30,45) 0.667 (0.509, 0.796) 0.287 (60,90) 0.667 (0.559, 0.760) 0.201 Bayes (Beta(1,1) (2,3) 0.600 (0.235, 0.964) 0.729 (30,45) 0.660 (0.527, 0.793) 0.266 (60,90) 0.663 (0.567, 0.759) 0.191 Bayes (Beta(4,4) (2,3) 0.545 (0.270, 0.820) 0.550 (30,45) 0.642 (0.514, 0.729) 0.254 (60,90) 0.653 (0.560, 0.747) 0.187

(Bayesian Seminar) April 2012 22 / 31 Bayesian Hypothesis testing

1 1 For the coin example, Ho : p = 2 and H1 : p 6= 2 Suppose K=14, n=17. 1 Objective stance: P[Ho is true ] = 2 . Subjective input: Largest value of P[Ho is true ], say po.

P[K = k|Ho]P[Ho] P[Ho|K = k] = (1) P[K = k|Ho]P[Ho] + P[K = k|H1]P[h − 1] 217 · 0.5 = (2) 217 · 0.5 + 0.5 R po p13(1 − p)4dp 2po −1 1−po

Probability depends on po, but is from 0.21 to 0.5 giving only mild evidence against Ho. Selecting a uniform distribution for p between 1 − po and po is not restricting. Symmetric distributions give same range. More generally, Bayesians calculate the between the hypotheses.

(Bayesian Seminar) April 2012 23 / 31 Noninformative Priors

If objectivity is an issue, pick priors which affect the posterior distribution the least. Improper priors: Select a uniform distribution on the real line. Formally, this gives a proper posterior for the normal and identifies X¯ as the estimator. Reference priors (maximize “distance” from posterior), Jeffrey priors (scale invariant), minimum entropy priors. Conjugate priors: convenient, easily interpreted.

(Bayesian Seminar) April 2012 24 / 31 Subjective Probability Axioms

Definition (Weak Order) If A and B are events, then A  B means B is at least as likely as A. (Informally, we are willing to place a bet that the odds of B are at least as large as the odds for A.) The relation  is referred to as weak order.

Axioms: For any pair of events A and B, either A  B or A  B.

For disjoint sets, A1  B1 and A2  B2 implies A1 ∪ A2  B1 ∪ B2. For any event A, ∅  A. T∞ {Ai } decreasing sequence of events and Ai  B, then i=1 Ai  B. For uniform distribution on [0, 1], any event is comparable to the subinterval I .(A  I or I  A) Conclusion: The relation  is in agreement with a probability measure.

(Bayesian Seminar) April 2012 25 / 31 Theoretical Bayesian Results

Theorem (Bayes Estimators) Under squared error loss, if θˆ is an unbiased estimator of θ, then it is not a Bayes estimator under any proper prior.

Theorem (Complete Class Theorem) Under some weak assumptions, every “good” estimator is a “Bayes” estimator with respect to some prior distribution.

(Bayesian Seminar) April 2012 26 / 31 When is the Bayesian estimator better?

Definition (Bayes Risk) Two sources of : estimator and parameter. ˆ The expected loss is EFθ [L(θ, θ)].

Let Go be the true distribution of θ. Then the Bayes risk is: ˆ ˆ r(Go, θ) = EGo EFθ [L(θ, θ)]

To compare a Bayesian estimator with prior G (call it θˆG ) to a Frequentist estimator θˆ, compare the Bayes risk.

When is r(Go, θˆG ) ≤ r(Go, θˆ)?

(Bayesian Seminar) April 2012 27 / 31 Threshold Theorem

Theorem (Samaniego, 2010)

If the Bayes estimator θˆG under square error loss has the form

θˆG = (1 − η)EG θ + ηθˆ

where θˆ is a sufficient, unbiased estimator of θ and η ∈ [0, 1), then

r(Go, θˆG ) ≤ r(Go, θˆ)

if and only if 1 + η Var (θ) + (E θ − E θ)2 ≤ r(G , θˆ) Go G Go 1 − η o

(Bayesian Seminar) April 2012 28 / 31 Proof of Threshold Theorem

Proof:

ˆ 2 ˆ 2 EFθ (θG − θ) = EFθ (η(θ − θ) + (1 − η)(EG θ − θ) (3) 2 ˆ 2 2 2 = η EFθ (θ − θ) + (1 − η) (EG θ − θ) (4)

Expectation of both sides with respect to Go gives:

ˆ 2 ˆ 2 2 r(Go, θG ) = η r(Go, θ) + (1 − η) EGo (θ − EG θ) (5) 2 ˆ 2 2 = η r(Go, θ) + (1 − η) (VarGo (θ) + (EG θ − EGo θ) ) (6)

So if r(Go, θˆG ) ≤ r(Go, θˆ) then 1 + η Var (θ) + (E θ − E θ)2 ≤ r(G , θˆ) Go G Go 1 − η o Q.E.D.

(Bayesian Seminar) April 2012 29 / 31 Example of Threshold Theorem

Example (Binomial Model) Problem: Estimate the proportion of ”long” words starting the pages of Of Human Bondage. k+α Bayes approach is to use ( n+α+β ) α Could just specify the prior mean ( α+β ) and η ∈ [0, 1) which k expresses confidence in n . True prior was degenerate distribution with mean 0.3008. r(Go, pˆ) = 0.02103 ∗ 2 1+η Bayes superior if (p − 0.3008) < 1−η (0.02103) Superior pairs (p∗, η) account for 55% of the square’s area. Empirical result: 88 out of 99 Bayesian estimators were superior.

(Bayesian Seminar) April 2012 30 / 31 Lingering Questions

What does objective mean? Should subjective probability add to the inference process? What is a good estimator?

A frequentist uses careful arguments and events that did not happen to answer the wrong question, while a Bayesian answers the right question by making assumptions that nobody can fully embrace.

(Bayesian Seminar) April 2012 31 / 31