<<

David Giles Bayesian

3. Properties of Bayes & Tests

(i) Some general results for Bayes Decision Rules. (ii) Bayes estimators - some exact results. (iii) Bayes estimators - some asymptotic (large-n) results. (iv) Bayesian .

1

General Results for Bayes Decision Rules

• Recall - Bayes rules and MEL rules (generally equivalent) • Minimum Expected Loss (MEL) Rule: "Act so as to Minimize (posterior) Expected Loss"

∫ 퐿(휽, 휽̂)푝(휽|풚)푑휽 Ω • Bayes' Rule: "Act so as to Minimize Average Risk." (Often called the "Bayes' Risk"): 푟(휽̂) = ∫ 푅 (휽, 휽̂)푝(휽)푑휽 • The action will involve selecting an , or rejecting some Hypothesis, for instance.

2

• We have the following general results –

(i) A Mini-max Rule always exists.

(ii) Any decision rule that is Admissible must be a Bayes’ Rule with

respect to some prior distribution.

(iii) If the prior distribution is “proper”, every Bayes’ Rule is

Admissible.

(iv) If a Bayes’ Rule has constant risk, then it is Mini-max.

(v) If a Mini-max Rule corresponds to a unique Bayes’ Rule, then

this Mini-max Rule is also unique.

3

Bayes Estimators – Some Exact Results

• As long as the prior p.d.f., 푝(휽) is “proper”, the corresponding Bayes’

estimator is Admissible.

• Bayesians are not really interested in the properties of their estimators in a

“repeated ” situation.

• Interested in behaviour that’s conditional on the in the current sample.

• Contrast “Bayesian Intervals” (or, “Bayesian Credible

Intervals”) with “Confidence Intervals”.

• However, note that Bayes’ estimators may be biased or unbiased in finite

samples.

4

Bayes Estimators – Some Asymptotic Results

• Intuitively, we’d expect that as the sample size grows, the Likelihood

Function will dominate the Prior.

• In the limit, we might expect that Bayes’ estimators will converge to

MLE’s.

• An exception will be if the prior is “totally dogmatic” (degenerate).

• So, not surprisingly, Bayes estimators are weakly consistent.

• The principal asymptotic result associated with Bayes’ estimators is the so-

called “Bernstein-von Mises Theorem”.

5

Sergei Bernstein Richard von Mises (1880 – 1968) (1883 – 1953)

6

Theorem: Unless 푝(휽) is degenerate, lim {푝(휽|풚)} = 푁[휽̃, 퐼(휽̃)−1] ; where 휽̃ is the MLE 푛→∞ of 휽.

Proof: (for the scalar case) From Bayes’ Theorem – 푝(휃|풚) ∝ 푝(휃)퐿(휃|풚) = 푝(휃)푒푥푝{푙표푔퐿(휃|풚)} • Take a Taylor’s series expansion for 푝(휃) about the MLE, 휃̃ –

1 ퟐ 푝(휃) = 푝(휃̃) + (휃 − 휃̃)푝′(휃̃) + (휃 − 휃̃) 푝′′(휃̃) + ⋯ 2

′ 푝 (휃̃) 1 ퟐ 푝′′(휃̃) = 푝(휃̃)[1 + (휃 − 휃̃) + (휃 − 휃̃) + ⋯ ] 푝(휃̃) 2 푝(휃̃)

7

′ 푝 (휃̃) 1 ퟐ 푝′′(휃̃) ∝ [1 + (휃 − 휃̃) + (휃 − 휃̃) + ⋯ ] 푝(휃̃) 2 푝(휃̃)

• Let 푙(휃) = 푙표푔퐿(휃).

• 푒푥푝{푙표푔퐿(휃|풚)} = 푒푥푝{푙(휃)}

1 ퟐ = 푒푥푝{푙(휃̃) + (휃 − 휃̃)푙′(휃̃) + (휃 − 휃̃) 푙′′(휃̃) + ⋯ } 2

• Note that 푙′(휃̃) = 0. • So,

1 ퟐ 1 ퟑ 푒푥푝{푙(휃)} = 푒푥푝{푙(휃̃)} 푒푥푝 { (휃 − 휃̃) 푙"(휃̃)} 푒푥푝 { (휃 − 휃̃) 푙′′′(휃̃)} … . 2 6

8

Or,

1 ퟐ 1 ퟑ 푒푥푝{푙(휃)} ∝ 푒푥푝 { (휃 − 휃̃) 푙"(휃̃)} 푒푥푝 { (휃 − 휃̃) 푙′′′(휃̃)} …. 2 6 • Expand this exponential:

1 ퟐ 1 ퟑ 푒푥푝{푙(휃)} ∝ 푒푥푝 { (휃 − 휃̃) 푙"(휃̃)} [1 + { (휃 − 휃̃) 푙′′′(휃̃)} + ⋯]. 2 6 • The leading term is

1 ퟐ 푒푥푝{푙(휃)} ∝ e푥푝 { (휃 − 휃̃) 푙"(휃̃)} 2 • This will apply for large n

In this case,

푝(휃|풚) ∝ 푒푥푝{푙(휃)}푝(휃)

9

1 ퟐ 푝′(휃̃) 1 ퟐ 푝′′(휃̃) ∝ 푒푥푝 { (휃 − 휃̃) 푙"(휃̃)} [1 + (휃 − 휃̃) + (휃 − 휃̃) + ⋯ ] 2 푝(휃̃) 2 푝(휃̃)

If n is large enough –

1 ퟐ 푝(휃|풚) ∝ 푒푥푝 { (휃 − 휃̃) 푙"(휃̃)} . 2 This is the kernel of a Normal density, centered at 휃̃, and with a of −1/푙"(휃̃). • Asymptotically, our under a zero-one will coincide with the MLE.

10

Bayesian Interval Estimation • Consider the Bayesian counterpart to a .

• Totally different interpretation – nothing to do with repeated sampling.

• An interval of the form, (휃퐿, 휃푈), such that 푃푟. [(휃퐿 < 휃 < 휃푈)|풚] = 푝% ,

is a 푝% Bayesian Posterior Probability Interval, or a 푝% Bayesian Credible

Interval for 휃.

• Obvious extension to the case where 휽 is a vector: a 푝% Bayesian Posterior

Probability Region, or a 푝% Bayesian Credible Region for 휽.

11

• When we construct a (frequentist) Confidence Interval or Region, we

usually try to make the interval as short (small) as possible for the desired

confidence level.

• We can also consider constructing an “optimal” Bayesian Posterior

Probability Region, as follows:

Given a posterior density function, 푝(휽 | 풚), let A be a subset of the

parameter space such that:

(i) 푃푟. [휽 ∈ 퐴 | 풚] = (1 − 훼)

(ii) For all 휽ퟏ ∈ 퐴 and 휽ퟐ ∉ 퐴, 푝(휽ퟏ | 풚) ≥ 푝(휽ퟐ | 풚) then A is a Highest Posterior Density (HPD) Region of content (1 − 훼) for 휽.

12

• For a given 훼, the HPD has the smallest possible volume.

• If 푝(휽 | 풚) is not uniform over every region of the parameter space, then the

HPD is unique.

• A BPI (BCI) can be used in obvious way to test simple null hypotheses, just

as a C.I. is used for this purpose by non-Bayesians.

• See more of this later in the course.

13

Posterior p.d.f.

[a1 , a2] [b1 , b2] ∪ [b3 , b4] [c1 , c2]

14

Example 1 2 2 푦푖~푁[휇 , 휎0 ] ; 휎0 is known

• Before we see the sample of data, we have prior beliefs about value of 휇: 2 푝(휇) = 푝(휇|휎0 )~푁[휇̅ , 푣̅]

That is,

1 푝(휇) ∝ 푒푥푝 {− (휇 − 휇̅)2} 2푣̅

• Now we take a random sample of data:

풚 = (푦1, 푦2, … … , 푦푛) • The joint data density (i.e., the ) is:

15

푛 1 푝(풚|휇, 휎2) = 퐿(휇|풚, 휎2) ∝ 푒푥푝 {− ∑(푦 − 휇)2} 0 0 2휎2 푖 0 푖=1 • Bayes’ Theorem:

2 2 2 푝(휇|풚, 휎0 ) ∝ 푝(휇|휎0 )푝(풚|휇, 휎0 ) • So, 푛푦̅ 2 휇̅ ( + 2) 1 1 푛 푣̅ 휎0 푝(휇|풚, 휎2) ∝ 푒푥푝 − ( + ) (휇 − ) 0 2 1 푛 2 푣̅ 휎0 ( + ) 푣̅ 휎2 { [ 0 ]}

• The Posterior distribution for μ is 푁[휇̿, 푣̿], where

16

1 푛 (( )휇̅ +( )푦̅) 푣̅ 휎2 휇̿ = 0 1 푛 ( + ) 푣̅ 2 휎0 1 1 푛 = ( + 2) 푣̿ 푣̅ 휎0

• So, a 95% HPD interval for 휇 is obtained as follows –

푃푟. [−1.96 < 푍 < 1.96] = 95%

(휇 − 휇̿) 푃푟. [−1.96 < < 1.96 | 풚] = 95% √푣̿

푃푟. [휇̿ − 1.96√푣̿ < 휇 < 휇̿ + 1.96√푣̿ | 풚] = 95%

17

• The 95% HPD interval for 휇 is the interval

{ 휇̿ − 1.96√푣̿ ; 휇̿ + 1.96√푣̿ }

• The (posterior) probability that 휇 lies in this interval is 95%.

Example 2

• Random sample of n observations from an Exponential distribution , so

푝(푦푖 |휃) = 휃exp (−휃푦푖) ; 휃 > 0

• Prior density for 휃 is chosen to be Gamma (α , β), where 훼, 훽 > 0:

휃 푝(휃) ∝ 휃훼−1exp (− ) 훽

Shape Scale

18

• So, the likelihood function is

푝(풚 |휃) = 휃푛exp (−푛푦̅휃)

• Bayes’ Theorem:

푝(휃 | 풚) ∝ 휃푛+훼−1exp {−(푛푦̅ + 훽−1)휃}

• This is posterior is Gamma, with parameters (푛 + 훼) and 1/ (푛푦̅ + 훽−1).

• Another example of Natural Conjugacy.

• Bayes point estimators of 휃

(i) Quadratic Loss: 휃̂ = (푛 + 훼)/ (푛푦̅ + 훽−1).

(ii) Zero-One Loss: 휃̂ = (푛 + 훼 − 1)/ (푛푦̅ + 훽−1).

19

• Suppose that 훼 = 2 and 훽 = 4 .

• If n = 2, and 푦̅ = 0.125 , what is the posterior probability of the BCI ,

[3.49 , 15.5]?

• In this case the posterior density is Gamma [4 , 2].

• This is the same as a Chi-Square density with 8 degrees of freedom.

• Because Gamma [(v / 2) , 2] = Chi-Square (v)

2 2 • 푃푟. [휒(8) < 3.49] = 0.10, and . [휒(8) < 15.5] = 0.95 .

• So, the posterior probability for this BCI is (0.95 – 0.10) = 0.85.

• What are the Bayes point estimates of 휃 under quadratic and 0-1 losses?

20