Chapter 7 Chapter 7: Estimation Sections

7.1 Bayesian Methods: 7.2 Prior and Posterior Distributions 7.3 Distributions 7.4 Bayes Frequentist Methods: 7.5 Maximum Likelihood Estimators 7.6 Properties of Maximum Likelihood Estimators Skip: p. 434-441 (EM algorithm and Plans) 7.7 Sufficient Skip: 7.8 Jointly Sufficient Statistics Skip: 7.9 Improving an

1 / 41 Chapter 7 7.1 Statistical Inference Statistical Inference

We have seen statistical models in the form of probability distributions:

f (x|θ)

In this section the general notation for any parameter will be θ The parameter space will be denoted by Ω

For example: Life time of a christmas light series follows the Expo(θ) The average of 63 poured drinks is approximately normal with θ The number of people that have a disease out of a group of N people follows the Binomial(N, θ) distribution.

In practice the value of the parameter θ is unknown.

2 / 41 Chapter 7 7.1 Statistical Inference Statistical Inference

Statistical Inference: Given the we have observed what can we say about θ?

I.e. we observe random variables X1,..., Xn that we assume follow our and then we want to draw probabilistic conclusions about the parameter θ.

For example: If I tested 5 Christmas light series from the same manufacturer and they lasted for

21, 103, 76, 88 and 96 days.

Assuming that the life times are independent and follow Expo(θ), what does this data set tell me about the θ?

3 / 41 Chapter 7 7.1 Statistical Inference Statistical Inference – Another example

Say I take a random sample of 100 people and test them all for a disease. If 3 of them have the disease, what can I say about θ = the prevalence of the disease in the population?

Say I estimate θ as θˆ = 3/100 = 3%. How sure am I about this number? I want uncertainty bounds on my estimate. Can I be confident that the prevalence of the disease is higher than 2% ?

4 / 41 Chapter 7 7.1 Statistical Inference Statistical Inference Examples of different types of inference

Prediction Predict random variables that have not yet been observed E.g. If we test 40 more people for the disease, how many people do we predict have the disease?

Estimation Estimate (predict) the unknown parameter θ E.g. We estimated the prevalence of the disease as θˆ = 3%.

5 / 41 Chapter 7 7.1 Statistical Inference Statistical Inference Examples of different types of inference

Making decisions Hypothesis testing, E.g. If the disease affects 2% or more of the population, the state will launch a costly public health campaign. Can we be confident that θ is higher than 2% ?

Experimental Design What and how much data should we collect? E.g. How do I select people in my ? How many do I need to be comfortable making decision based on my analysis? Often limited by time and / or budget constraints

6 / 41 Chapter 7 7.1 Statistical Inference Bayesian vs.

Should a parameter θ be treated as a ? E.g. consider the prevalence of a disease. Frequentists: No, the proportion q of the population that has the disease, is not a random phenomenon but a fixed number that is simply unknown Example: 95% confidence interval: Wish to find random variables T1 and T2 that satisfy the probabilistic statement P(T1 ≤ q ≤ T2) ≥ 0.9

Interpretation: P(T1 ≤ q ≤ T2) is the probability that the random interval [T1, T2] covers q

7 / 41 Chapter 7 7.1 Statistical Inference Bayesian vs. Frequentist Inference

Should a parameter be treated as a random variable? E.g. consider the prevalence of a disease. Bayesians: Yes, the proportion Q of the population that has the disease is unknown and the distribution of Q is a subjective that expresses the experimenters (prior) beliefs about Q Example: 95% : Wish to find constants t1 and t2 that satisfy the probabilistic statement P(t1 ≤ Q ≤ t2 | data ) ≥ 0.9

Interpretation: P(t1 ≤ Q ≤ t2) is the probability that the parameter Q is in the interval [t1, t2].

8 / 41 Chapter 7 7.2 Prior and Posterior Distributions

Prior distribution Prior distribution: The distribution we assign to parameters before observing the random variables. Notation for the prior pdf/pf : We will use p(θ), the book uses ξ(θ)

Likelihood When the joint pdf/pf f (x|θ) is regarded as a function of θ for given observations x1,..., xn it is called the .

Posterior distribution Posterior distribution: The conditional distribution of the parameters θ given the observed random variables X1,..., Xn. Notation for the posterior pdf/pf : We will use

p(θ|x1,..., xn) = p(θ|x)

9 / 41 Chapter 7 7.2 Prior and Posterior Distributions Bayesian Inference

Theorem 7.2.1: Calculating the posterior

Let X1,..., Xn be a random sample with pdf/pf f (x|θ) and let p(θ) be the prior pdf/pf of θ. The the posterior pdf/pf is

f (x |θ) × · · · × f (x |θ)p(θ) p(θ|x) = 1 n g(x)

where Z g(x) = f (x|θ)p(θ)dθ Ω

is the marginal distribution of X1,..., Xn

10 / 41 Chapter 7 7.2 Prior and Posterior Distributions Example: Binomial Likelihood and a Beta prior

I take a random sample of 100 people and test them all for a disease. Assume that

Likelihood: X|θ ∼ Binomial(100, θ), where X denotes the number of people with the disease Prior: θ ∼ Beta(2, 10)

I observe X = 3 and I want to find the posterior distribution of θ Generally: Find the posterior distribution of θ when X|θ ∼ Binomial(n, θ) and θ ∼ Beta(α, β) where n, α and β are known.

11 / 41 Chapter 7 7.2 Prior and Posterior Distributions Example: Binomial Likelihood and a Beta prior

Notice how the posterior is more concentrated than the prior. After seeing the data we know more about θ

12 / 41 Chapter 7 7.2 Prior and Posterior Distributions Bayesian Inference

Recall the formula for the posterior distribution: f (x |θ) × · · · × f (x |θ)p(θ) p(θ|x) = 1 n gn(x) R where g(x) = Ω f (x|θ)p(θ)dθ is the marginal distribution g(x) does not depend on θ We can therefore write

p(θ|x) ∝ f (x|θ)p(θ)

In many cases we can recognize the form of the distribution of θ from f (x|θ)p(θ), eliminating the need to calculate the marginal distribution Example: The Binomial - Beta case 13 / 41 Chapter 7 7.2 Prior and Posterior Distributions Sequential Updates

If our observations are a random sample, we can do Bayesian Analysis sequentially: Each time we use the posterior from the previous step as a prior:

p(θ|x1) ∝ f (x1|θ)p(θ)

p(θ|x1, x2) ∝ f (x2|θ)p(θ|x1)

p(θ|x1, x2, x3) ∝ f (x3|θ)p(θ|x1, x2) . .

p(θ|x1,... xn) ∝ f (xn|θ)p(θ|x1,..., xn−1)

For example: Say I test 40 more people for the disease and 2 tested positive. What is the new posterior?

14 / 41 Chapter 7 7.2 Prior and Posterior Distributions Prior distributions

The prior distribution should reflect what we know a priori about θ

For example: Beta(2, 10) puts almost all of the density below 0.5 and has a mean 2/(2 + 10) = 0.167, saying that a prevalence of more then 50% is very unlikely

Using Beta(1, 1), i.e. the Uniform(0, 1) indicates that a priori all values between 0 and 1 are equally likely.

15 / 41 Chapter 7 7.2 Prior and Posterior Distributions Choosing a prior

We need to choose prior distributions carefully

We need a distribution (e.g. Beta) and its (e.g. α, β)

When hyperparameters are difficult to interpret we can sometimes set a mean and a and solve for parameters E.g: What Beta prior has mean 0.1 and variance 0.12 ?

If more than one option seems sensible, we perform sensitivity analysis: We compare the posteriors we get when using the different priors.

16 / 41 Chapter 7 7.2 Prior and Posterior Distributions Sensitivity analysis – Binomial-Beta example

Notice: The posterior mean is always between the prior mean and the observed proportion 0.03 17 / 41 Chapter 7 7.2 Prior and Posterior Distributions Effect of sample size and prior variance

The posterior is influenced both by sample size and the prior variance Larger sample size ⇒ less the prior influences the posterior Larger prior variance ⇒ the less the prior influences the posterior

18 / 41 Chapter 7 7.2 Prior and Posterior Distributions Example -

2 2 Let X1,..., Xn be a random sample from N(θ, σ ) where σ is known 2 2 Let the prior distribution of θ be N(µ0, ν0 ) where µ0 and ν are known. 2 Show that the posterior distribution p(θ|x) is N(µ1, ν1 ) where σ2µ + nν2x σ2ν2 µ = 0 0 n and ν2 = 0 1 2 2 1 2 2 σ + nν0 σ + nν0

The posterior mean is a linear combination of the prior mean µ0 and the observed sample mean.

2 What happens when ν0 → ∞? 2 What happens when ν0 → 0? What happens when n → ∞?

19 / 41 Chapter 7 7.2 Prior and Posterior Distributions Example - Normal distribution

20 / 41 Chapter 7 7.3 Conjugate Prior Distributions Conjugate Priors

Def: Conjugate Priors

Let X1, X2,... be a random sample from f (x|θ). A family Ψ of distributions is called a conjugate family of prior distributions if for any prior distribution p(θ) in Ψ the posterior distribution p(θ|x) is also in Ψ

Likelihood Conjugate Prior for θ Bernoulli(θ) The Beta distributions Poisson(θ) The Gamma distributions N(θ, σ2), σ2 known The Normal distributions Exponential(θ) The Gamma distributions

Have already see the Bernoulli-Beta and Normal-Normal cases

21 / 41 Chapter 7 7.3 Conjugate Prior Distributions Updating the prior distribution

Suppose the proportion θ of defective items in a large shipment is unknown. The prior distribution of θ is Beta(α, β). 2 items are selected. What is your updated belief after observing the two items?

22 / 41 Chapter 7 7.3 Conjugate Prior Distributions Conjugate prior families

The Gamma distributions are a conjugate family for the Poisson(θ) likelihood:

If X1,..., Xn i.i.d. Poisson(θ) and θ ∼ Gamma(α, β) then the posterior is n ! X Gamma α + xi , β + n i=1

The Gamma distributions are a conjugate family for the Expo(θ) likelihood:

If X1,..., Xn i.i.d. Expo(θ) and θ ∼ Gamma(α, β) then the posterior is n ! X Gamma α + n, β + xi i=1

23 / 41 Chapter 7 7.3 Conjugate Prior Distributions Improper priors

Improper Prior: A “pdf” p(θ) where R p(θ)dθ = ∞ Used to try to put more emphasis on data and down play the prior Used when there is little or no prior information about θ. Caution: We always need to check that the posterior pdf is proper! (Integrates to 1) Example: 2 Let X1,..., Xn be i.i.d. N(θ, σ ) and p(θ) = 1, for θ ∈ R. Note: Here the prior variance is ∞ 2 Then the posterior is N(xn, σ /n)

24 / 41 Chapter 7 – continued Chapter 7: Estimation Sections

7.1 Statistical Inference Bayesian Methods: 7.2 Prior and Posterior Distributions 7.3 Conjugate Prior Distributions 7.4 Bayes Estimators Frequentist Methods: 7.5 Maximum Likelihood Estimators 7.6 Properties of Maximum Likelihood Estimators Skip: p. 434-441 (EM algorithm and Sampling Plans) 7.7 Sufficient Statistics Skip: 7.8 Jointly Sufficient Statistics Skip: 7.9 Improving an Estimator

25 / 41 Chapter 7 – continued 7.4 Bayes Estimators

In principle, Bayesian inference is the posterior distribution However, often people wish to estimate the unknown parameter θ with a single number A : Any function of observable random variables X1,..., Xn, T = r(X1, X2,..., Xn). Example: The sample mean X n is a statistic

Def: Estimator / Estimate

Suppose our observable data X1,..., Xn is i.i.d. f (x|θ), θ ∈ Ω ⊂ R. Estimator of θ: A real valued function δ(X1,..., Xn)

Estimate of θ: δ(x1,..., xn), i.e. estimator evaluated at the observed values

An estimator is a statistic and a random variable

26 / 41 Chapter 7 – continued 7.4 Bayes Estimators Bayes Estimator

Def: Loss function: A real valued function L(θ, a) where θ ∈ Ω and a ∈ R.

L(θ, a) = what we loose by using a as an estimate when θ is the true value of the parameter. Examples: Squared error loss function: L(θ, a) = (θ − a)2 Absolute error loss function: L(θ, a) = |θ − a|

27 / 41 Chapter 7 – continued 7.4 Bayes Estimators Bayes Estimator

Idea: Choose an estimator δ(X) so that we minimize the expected loss

Def: Bayes Estimator – Minimum expected loss An estimator is called the Bayesian estimator of θ if for all possible observations x of X the expected loss is minimized. For given X = x the expected loss is Z E (L(θ, a)|x) = L(θ, a)p(θ|x)dθ Ω Let a∗(x) be the value of a where the minimum is obtained. Then δ∗(x) = a∗(x) is the Bayesian estimate of θ and δ∗(X) is the Bayesian estimator of θ.

28 / 41 Chapter 7 – continued 7.4 Bayes Estimators Bayes Estimator

For squared error loss: The posterior mean δ∗(X) = E(θ|X) 2  mina E (L(θ, a)|x) = mina E (θ − a) |x . The mean of θ|x minimizes this, i.e. the posterior mean.

For absolute error loss: The posterior

mina E (L(θ, a)|x) = mina E (|θ − a| | x). The median of θ|x minimizes this, i.e. the posterior median.

The Posterior mean is a more common estimator because it is often difficult to obtain a closed expression of the posterior median.

29 / 41 Chapter 7 – continued 7.4 Bayes Estimators Examples

Normal Bayes Estimator, with respect to squared error loss: 2 2 If X1,..., Xn are N(θ, σ ) and θ ∼ N(µ0, ν0 ) then the Bayesian estimator of θ is σ2µ + nν2X δ∗(X) = 0 0 n 2 2 σ + nν0

Binomial Bayes Estimator, with respect to squared error loss: If X ∼ Binomial(n, θ) and θ ∼ Beta(α, β) then the Bayesian estimator of θ is α + X δ∗(X) = α + β + n

30 / 41 Chapter 7 – continued 7.4 Bayes Estimators Bayesian Inference – Pros and cons

Pros: Gives a coherent theory for statistical inference such as estimation. Allows for incorporation of prior scientific knowledge about parameters

Cons: Selecting a scientifically meaningful prior distributions (and loss functions) is often difficult, especially in high dimensions

31 / 41 Chapter 7 – continued 7.5 Maximum Likelihood Estimators Frequentist Inference

Likelihood When the joint pdf/pf f (x|θ) is regarded as a function of θ for given observations x1,..., xn it is called the likelihood function.

Maximum Likelihood Estimator Maximum likelihood estimator (MLE): For any given observations x we pick the θ ∈ Ω that maximizes f (x|θ).

Given X = x, the maximum likelihood estimate (MLE) will be a function of x. Notation: θˆ = δ(X) Potentially confusing notation: Sometimes θˆ is used for both the estimator and the estimate. Note: The MLE is required to be in the parameter space Ω. Often it is easier to maximize the log-likelihood L(θ) = log (f (x|θ)

32 / 41 Chapter 7 – continued 7.5 Maximum Likelihood Estimators Examples

Let X ∼ Binomial(n, θ) where n is given. Find the maximum likelihood estimator of θ. Say we observe X = 3, what is the maximum likelihood estimate of θ?

2 Let X1,..., Xn be i.i.d. N(µ, σ ). Find the MLE of µ when σ2 is known Find the MLE of µ and σ2 (both unknown)

ˆ Let X1,..., Xn be i.i.d. Uniform[0, θ], where θ > 0. Find θ

ˆ Let X1,..., Xn be i.i.d. Uniform[θ, θ + 1]. Find θ

33 / 41 Chapter 7 – continued 7.5 Maximum Likelihood Estimators MLE

Intuition: We pick the parameter that makes the observed data most likely

But: The likelihood is not a pdf/pf: If the likelihood of θ1 is larger than the likelihood of θ1, i.e. f (x|θ2) > f (x|θ1) it does NOT mean that θ2 is more likely Recall: θ is not random here Limitations: Does not always exist Not always appropriate - we cannot incorporate “external” (prior) knowledge May not be unique

34 / 41 Chapter 7 – continued Chapter 7: Estimation Sections

7.1 Statistical Inference Bayesian Methods: 7.2 Prior and Posterior Distributions 7.3 Conjugate Prior Distributions 7.4 Bayes Estimators Frequentist Methods: 7.5 Maximum Likelihood Estimators 7.6 Properties of Maximum Likelihood Estimators Skip: p. 434-441 (EM algorithm and Sampling Plans) Skip: 7.7 Sufficient Statistics Skip: 7.8 Jointly Sufficient Statistics Skip: 7.9 Improving an Estimator

35 / 41 Chapter 7 – continued 7.6 Properties of Maximum Likelihood Estimators Properties of MLE’s

Theorem 7.6.2: MLE’s are invariant If θˆ is the MLE of θ and g(θ) is a function of θ then g(θˆ) is the MLE of g(θ)

Example: Let pˆ be the MLE of a probability parameter, e.g. the p in Binomial(n, p). p pˆ Then the MLE of the odds, 1−p is 1−pˆ In general this does not hold for Bayes estimators. E.g. for square error loss E(g(θ)|x) 6= g(E(θ|x))

36 / 41 Chapter 7 – continued 7.6 Properties of Maximum Likelihood Estimators Computation

For MLE’s In many practical situations the maximization we need is not available analytically or too cumbersome There exist many numerical optimization methods, Newton’s Method (see definition 7.6.2) is one example. For Bayesian estimators In many practical situations the posterior distribution is not available in closed form This happens if we cannot evaluate the integral for the marginal distribution In stead people either approximate the posterior distribution or take random samples from it, e.g. using (MCMC) methods

37 / 41 Chapter 7 – continued 7.6 Properties of Maximum Likelihood Estimators Method of Moments (MOM)

Let X1,..., Xn be i.i.d. from f (x|θ) where θ is k dimensional. th 1 Pn j The j sample is defined as mj = n i=1 Xi Method of moments (MOM) estimator: match the theoretical moments and the sample moments and solve for parameters:

2 k m1 = E(X1|θ), m2 = E(X1 |θ),..., mk = E(X1 |θ)

Example:

Let X1,..., Xn be i.i.d. Gamma(α, β). Then α α(α + 1) E(X) = and E(X 2) = β β2

Find the MOM estimator of α and β

38 / 41 Chapter 7 – continued 7.7 Sufficient Statistics Sufficient Statistics

A statistic: T = r(X1,..., Xn)

Def: Sufficient Statistics

Let X1,..., Xn be a random sample form f (x|θ) and let T be a statistic. If the conditional distribution of

X1,..., Xn|T = t

does not depend on θ then T is called a sufficient statistic

The idea: Just as good to have the observed sufficient statistic as it is to have the individual observations of X1,..., Xn Can limit our search for a good estimator to sufficient statistics

39 / 41 Chapter 7 – continued 7.7 Sufficient Statistics Sufficient Statistics

Theorem 7.7.1: Factorization Criterion

Let X1,..., Xn be a random sample form f (x|θ) where θ ∈ Ω is unknown. A statistic T = r(X1,..., Xn) is a sufficient statistic for θ if n and only if for all x ∈ R and all θ ∈ Ω, the joint pdf/pf fn(x|θ) can be factored as fn(x|θ) = u(x)v (r(x), θ) where function u and v are nonnegative.

The function u may depend on x but not on θ The function v depends on θ but depends on x only through the value of the statistic r(x)

Both MLEs and Bayesian estimators depend on data only through sufficient statistics.

40 / 41 Chapter 7 – continued

END OF CHAPTER 7

41 / 41