Two Statistical Paradigms Bayesian Versus Frequentist

Two Statistical Paradigms Bayesian versus Frequentist Steven Janke April 2012 (Bayesian Seminar) April 2012 1 / 31 Probability versus Statistics Probability Problem Given some symmetry, what is the probability that an event occurs? (Example: Probability of 3 heads when a fair coin is flipped 5 times.) Statistical Problem Given some data, what is the underlying symmetry? (Example: Four heads resulted from tossing a coin 5 times. Is is a fair coin?) (Bayesian Seminar) April 2012 2 / 31 Statistics Problems All statistical problems begin with a model for the problem. Then there are different flavors for the actual question: Point Estimation (Find the mean of the population.) Interval Estimation (Confidence interval for the mean.) Hypothesis testing (Is the mean greater than 10?) Model fit (Could one random variable be linearly related to another?) (Bayesian Seminar) April 2012 3 / 31 Estimators and Loss functions Definition (Estimator) An estimator (θˆ) is a random variable that is a function of the data and reasonably approximates θ. ˆ ¯ 1 Pn For example, θ = X = n i=1 Xi Loss Functions Squared Error: L(θ − θˆ) = (θ − θˆ)2 Absolute Error: L(θ − θˆ) = |θ − θˆ| Linex: L(θ − θˆ) = exp(c(θ − θˆ)) − c(θ − θˆ) − 1 (Bayesian Seminar) April 2012 4 / 31 Comparing Estimators Definition (Risk) An estimator’s risk is R(θ, θˆ) = E(L(θ, θˆ)). For two estimators θˆ1 and θˆ2, if R(θ, θˆ1) < R(θ, θˆ2) for all θ ∈ Θ, then θˆ2 is called inadmissible. How do you find ”good” estimators? What does ”good” mean? How do you definitively rank estimators? (Bayesian Seminar) April 2012 5 / 31 Frequentist Approach Philosophical Principles: Probability as relative frequency (repeatable trials). (Law of large numbers, Central Limit Theorem) θ is a fixed quantity, but the data can vary. Definition (Likelihood Function) Suppose the population density is fθ and the data (x1, x2, ··· , xn) are n sampled independently, then the Likelihood Function is L(θ) = Πi=1fθ(xi ). The value of θ (as a function of the data) that gives the maximum likelihood is a good candidate for an estimator θˆ. (Bayesian Seminar) April 2012 6 / 31 Maximum Likelihood for Normal and Binomial Population is Normal(µ, σ2): (x −µ)2 n 1 − i L(µ) = Π √ e 2σ2 i=1 σ 2π ¯ 1 Pn Maximum Likelihood givesµ ˆ = X = n i=1 xi . Population is Binomial(n,p): Flip a coin n times with P[head]=p. L(p) = pk (1 − p)n−k , where k is the number of heads. k Maximum Likelihood givesp ˆ = n . np P µ Both estimators are unbiased: Epˆ = n = p and Eµˆ = n = µ. Among all unbiased estimators, these estimators minimize the squared error loss. (Bayesian Seminar) April 2012 7 / 31 Frequentist Conclusions Definition (Confidence Interval) ¯ σ2 X is a random variable with mean µ and distribution N (µ, n ) ¯ √σ ¯ √σ The random interval (X − 1.96 n , X + 1.96 n ) is a confidence interval. The probability that it covers µ is 0.95. >>> The probability that µ is in the interval is not 0.95! Definition (Hypothesis Testing) Null Hypothesis: µ = 10 ∗ X¯ −10 Calculate z = σ n p-value is P[z ≤ z∗] + P[z ≥ z∗] >>> p-value is not the probability that the null hypothesis is true. (Bayesian Seminar) April 2012 8 / 31 Space of Estimators Methods: Moments, Linear, Invariant Characteristics: Unbiased, Sufficient, Consistent, Minimum Variance, Admissible (Bayesian Seminar) April 2012 9 / 31 Frequentist Justification: Three Theorems Theorem (Rao-Blackwell) A statistic T such that the conditional distribution of the data given T does not depend on θ (the parameter) is called sufficient. Unbiased estimators that are functions of a sufficient statistic give the least variance. Theorem (Lehmann-Scheffe) If T is a complete sufficient statistic for θ and θˆ = h(T ) is an unbiased estimator of θ, then θˆ has the smallest possible variance among unbiased estimators of θ. (UMVUE of θ.) Theorem (Cramer-Rao) Let a population have a ”regular” distribution with density fθ(x) and let θˆ ˆ ∂ 2 −1 be an unbiased estimator of the θ. Then Var(θ) ≥ (nE[( ∂θ ln fθ(X )) ]) (Bayesian Seminar) April 2012 10 / 31 Problems Choosing Estimators How does the Frequentist choose an estimator? Example (Variance) Pn 2 i=1(xi −x¯) Maximum likelihood estimator: n Pn 2 i=1(xi −x¯) Unbiased estimator: n−1 Pn 2 i=1(xi −x¯) Minimum variance: n+1 Example (Mean) Estimate IQ for an individual. Say score on a test was 130. Suppose scoring machine reported all scores less than 100 as 100. Now the Frequentist must change estimate of IQ. (Bayesian Seminar) April 2012 11 / 31 What is the p-value? The p-value depends on events that didn’t happen. Example (Flipping Coins) Flipping coin: K = 13 implies p-value is P[K ≤ 4 or K ≥ 13] = 0.049. If experimenter stopped when at least 4 heads and 4 tails, p − value = 0.021. If experimenter stops at n = 17 and continues if data inconclusive to n = 44, then p − value > 0.049. p-value already suspect since it is NOT the probability H0 is true. (Bayesian Seminar) April 2012 12 / 31 Several Normal means at once Example (Estimating Several Normal Means) Suppose estimating k normal means µ1, µ2, ··· , µk from populations with common variance. Then Frequentist estimate is ~x¯ = (¯x1, x¯2 ··· x¯k ) under Euclidean distance as the loss function. Stein (1956) showed that ~x¯ is not even admissible! A better estimator isµ ˆ(¯x) = (1 − k−2 ) · x¯. ||x||2 The better estimator can be thought of as a Bayesian estimator with the correct prior. (Bayesian Seminar) April 2012 13 / 31 Bayes Theorem Richard Price communicates Rev. Thomas Bayes paper to Royal Society (1763) Given data, Bayes found probability p for a binomial is in a given interval. One of the many lemmas is now known as“Bayes Theorem” P[AB] P[B|A]P[A] P[B|A]P[A] P[A|B] = = = P[A] P[B] P[B|A]P[A] + P[B|Ac ]P[Ac ] Conditional probability defined after defining conditional expectation as an integral over a sub-sigma field of sets. Conditional distribution is then f (x, y) f (y|x)f (x) f (x|y) = = X fY (y) fY (y) (Bayesian Seminar) April 2012 14 / 31 Bayesian Approach Assume that the parameter under investigation is a random variable and that the data are fixed. Decide on a prior distribution for the parameter. (Bayes considered p to be uniformly distributed from 0 to 1.) Calculate the distribution of the parameter given the data. (Posterior distribution) To minimize squared error loss, take the expected value of the posterior distribution as the estimator. (Bayesian Seminar) April 2012 15 / 31 Bayesian Approach for the Binomial Assume a binomial model with parameters n, p (flipping a coin.) Let K be the number of heads (the data). For a prior distribution on p take the Beta distribution. Γ(α+β) α−1 β−1 fp(p) = Γ(α)Γ(β) p (1 − p) for 0 < p < 1. f (p|k) = f (k|p)fp(p) = C · pα+k−1(1 − p)β+n−k−1 fk (k) α+k Hence, E[p|k] = α+β+n With the uniform prior (α = 1, β = 1), the estimator is k+1 n k 2 1 n+2 = ( n+2 ) n + ( n+2 ) 2 . (Bayesian Seminar) April 2012 16 / 31 Bayesian Approach for the Normal If µ is the mean of a normal population, a reasonable prior 2 distribution for µ could be another normal distribution, N (µo, σo). 2 n(¯x−µ)2 (µ−µ )2 (µ−µ1) √ o √ −c − n − 2 − 2 ne 2σ2 f (µ|x¯) = e 2σ 2σo = e 1 2πσσo g(¯x) 2πσσo 2 2 nσo σ Hence, E[µ|x¯] =x ¯( 2 2 ) + µo( 2 2 ) nσo +σ nσo +σ (Bayesian Seminar) April 2012 17 / 31 Bayesian Approach under other Loss functions Note: If we use absolute error loss, then the Bayes estimator is the median of the posterior distribution. 1 −cθ If we use Linex loss, then the Bayes estimator is − c ln{Ee }. (Bayesian Seminar) April 2012 18 / 31 Calculating Mean Squared Error If the loss function is squared error, then the mean squared error (MSE) measures expected loss. MSE = E(θˆ − θ)2 = (E(θˆ) − θ)2 + E(θˆ − E(θˆ))2 (Squared Bias + Variance) ˆ k 2 p(1−p) For θ = n : MSE = 0 + n ˆ k+1 1−2p 2 1 2 For θ = n+2 : MSE = ( n+2 ) + ( n+2 ) · p(1 − p) (Bayesian Seminar) April 2012 19 / 31 Comparing the MSE Figure: Comparing MSE for Freq and Bayes estimators (Bayesian Seminar) April 2012 20 / 31 Bayesian Confidence (Credible) Interval Θ and X are the random variables representing the parameter of interest and the data. The functions u(x) and v(x) are arbitrary and f is the posterior distribution, R v(x) Then, P[u(x) < Θ < v(x)|X = x] = u(x) f (θ|x)dθ Definition (Credible Interval) If u(x) and v(x) are picked so the probability is 0.95, the interval [u(x), v(x)] is a 95% credible interval for Θ. 2 2 q 2 2 xn¯ σo +µo σ σ σo For the normal distribution, 2 2 ± 1.96 2 2 is a 95% credible nσo +σ nσo +σ interval for µ. (Bayesian Seminar) April 2012 21 / 31 Comparing Estimates Frequentist and Bayesian estimates for p in the Binomial: (k, n) Estimate 95% Conf. Int. Width Frequentist (2,3) 0.667 (0.125, 0.982) 0.857 (30,45) 0.667 (0.509, 0.796) 0.287 (60,90) 0.667 (0.559, 0.760) 0.201 Bayes (Beta(1,1) (2,3) 0.600 (0.235, 0.964) 0.729 (30,45) 0.660 (0.527, 0.793) 0.266 (60,90) 0.663 (0.567, 0.759) 0.191 Bayes (Beta(4,4) (2,3) 0.545 (0.270, 0.820) 0.550 (30,45) 0.642 (0.514, 0.729) 0.254 (60,90) 0.653 (0.560, 0.747) 0.187 (Bayesian Seminar) April 2012 22 / 31 Bayesian Hypothesis testing 1 1 For the coin example, Ho : p = 2 and H1 : p 6= 2 Suppose K=14, n=17.

Two Statistical Paradigms Bayesian Versus Frequentist

Lecture 13: Simple Linear Regression in Matrix Format

In This Segment, We Discuss a Little More the Mean Squared Error

On the Meaning and Use of Kurtosis

The Probability Lifesaver: Order Statistics and the Median Theorem

Approximated Bayes and Empirical Bayes Confidence Intervals—

Minimum Mean Squared Error Model Averaging in Likelihood Models

Theoretical Statistics. Lecture 5. Peter Bartlett

Calibration: Calibrate Your Model

1 Estimation and Beyond in the Bayes Universe

Download Article (PDF)

Lecture 14 Testing for Kurtosis

Covariances of Two Sample Rank Sum Statistics