<<

Introduction to Bayesian Estimation

March 2, 2016 The Plan

1. What is Bayesian ? How is it different from frequentist methods? 2. 4 Bayesian Principles 3. Overview of main concepts in Bayesian analysis Main reading: Ch.1 in Gary Koop’s

Additional material: Christian P. Robert’s The Bayesian Choice: From decision theoretic foundations to Computational Implementation Gelman, Carlin, Stern, Dunson, Wehtari and Rubin’s Bayesian Analysis and statistics; What’s the difference?

Probability is a branch of mathematics

I There is little disagreement about whether the theorems follow from the axioms

Statistics is an inversion problem: What is a good probabilistic description of the world, given the observed outcomes?

I There is some disagreement about how we interpret data/observations and how we make inference about unobservable parameters Why probabilistic models?

Is the world characterized by ?

I Is the weather random?

I Is a coin flip random?

I ECB interest rates? It is difficult to say with whether something is “truly” random.

Two schools of statistics What is the meaning of probability, randomness and uncertainty?

Two main schools of thought:

I The classical (or frequentist) view is that probability corresponds to the of occurrence in repeated

I The Bayesian view is that are statements about our state of , i.e. a subjective view. The difference has implications for how we interpret estimated statistical models and there is no general agreement about which approach is “better”. Frequentist vs

Variance of estimator vs of parameter

I Bayesians think of parameters as having distributions while frequentist conduct inference by thinking about the variance of an estimator Frequentist confidence intervals:

I Conditional on the null hypothesis and repeated draws from the population of equal sample length, what is the interval that θb lies in 95 per cent of the time? Bayesian probability intervals:

I Conditional on the observed data, a prior distribution of θ and a functional form (i.e. model), what is the shortest interval that with 95% contains θ? Frequentist vs Bayesian Coin Flippers

After flipping a coin 10 times and counting the number of heads, what is the probability that the next flip will come up heads?

I A Bayesian with a uniform prior would say that the probability of the next flip coming up heads is x/10 and where x is the number of times the flip came up heads in the observed sample

I A frequentist cannot answer this question: A frequentist p-value describes the probability of observing a more extreme outcome (i.e. more than x heads) when repeatedly from the distribution implied by the null hypothesis. Bayesian statistics Bayesian statistics

“Bayesianism has obviously come a long way. It used to be that could tell a Bayesian by his tendency to hold meetings in isolated parts of Spain and his obsession with coherence, self-interrogations, and other manifestations of paranoia. Things have changed...” Peter Clifford, 1993

Quote arrived here via Jesus Fernandez-Villaverde (U Penn) Bayesian statistics

Bayesians used to be fringe types, but are now more or less the mainstream in macro

I This is largely due to increased computing power

Bayesian methods have several advantages:

I Facilitates incorporating information from outside of sample

I Easy to compute confidence/probabilty intervals of functions of parameters

I Good small sample properties The subjective view of probability

What does subjective ?

I Coin flip: What is prob(heads) after flip but before looking? Important: Probabilities can be viewed as statements about our knowledge of a parameter, even if the parameter is a constant

I E.g. treating risk aversion as a does not imply that we think of it as varying across time. Subjective does not mean that anything goes! The subjective view of probability

Is a subjective view of probability useful?

I Easy to incorporate information from outside the sample

I One can use “objective”, or non-informative priors to isolate the sample information

I Explicitly stating priors is a transparent way to report input from researcher 4 Principles of Bayesian Statistics 4 Principles of Bayesian Statistics

1. The Likelihood Principle 2. The Sufficiency Principle 3. The Conditionality Principle 4. The Stopping Rule Principle

These are principles that appeal to “common sense”. The Likelihood Principle

All the information about a parameter from an is contained in the The Likelihood Principle

The likelihood principle can be derived from two other principles:

I The Sufficiency Principle

I If two samples imply the same sufficient (given a model that admits such a thing), then they should lead to the same inference about θ I Example: Two samples from a normal distribution with the same mean and variance should lead to the same inference about the parameters µ and σ2

I The Conditionality Principle

I If there are two possible experiments on the parameter θ is available, and one of the experiments is chosen with probability p then the inference on θ should only depend on the chosen experiment The Stopping Rule Principle

The Stopping Rule Principle is an implication of the Likelihood Principle

If a sequence of experiments, E1, E2, ..., is directed by a stopping rule, τ, which indicates when the experiment should stop, inference about θ should depend on τ only through the resulting sample. The Stopping Rule Principle: Example

In an experiment to find out the fraction of the population that watched a TV show, a researcher found 9 viewers and and 3 non-viewers. Two possible models: 1. The researcher interviewed 12 people and thus observed x ∼ B(12, θ) 2. The researcher questions N people until he obtained 3 non-viewers with N ∼ Neg(3, 1 − θ) The likelihood principle implies that the inference about θ should be identical if either model was used. A frequentist would draw different conclusions depending on wether he thought (1) or (2) generated the experiment. In fact, H0 : θ > 0.5 would be rejected if experiment is (2) but not if it is (1). The Stopping Rule Principle

“I learned the stopping rule principle from Professor Barnard, in conversation in the summer of 1952. Frankly, I thought it a scandal that anyone in the profession could advance an idea so patently wrong, even as today I can scarcely believe that some people resist an idea so patently right” Savage (1962)

Quote (again) arrived here via Jesus Fernandez-Villaverde (U Penn) The main components in Bayes Rule

Deriving Bayes’ Rule

p (A, B) = p(A | B)p(B)

Symmetrically p (A, B) = p(B | A)p(A) Implying p(A | B)p(B) = p(B | A)p(A) or p(B | A)p(A) p(A | B) = p(B) which is known as Bayes Rule. Data and parameters

The purpose of Bayesian analysis is to use the data y to learn about the “parameters” θ

I Parameters of a

I Or anything not directly observed Replace A and B in Bayes rule with θ and y to get

p(y | θ)p(θ) p(θ | y) = p(y)

The probability density p(θ | y) then describes what do we know about θ, given the data. Parameters as random variables

Since p(θ | y) is a density function, we thus treat θ as a random variable with a distribution. This is the subjective view of probability

I Express subjective probabilities through the prior

I More importantly, the randomness of a parameter is subjective in the sense of depending on what information is available to the person making the statement. The density p(θ | y) is proportional to the prior times the likelihood function

p(θ | y) ∝ p(y | θ) p(θ) | {z } | {z } |{z} posterior likelihood function prior

But what is the prior and the likelihood function? The prior density p(θ)

Sometimes we know more about the parameters than what the data tells us, i.e. we have some prior information. The prior density p(θ) summarizes what we know about a parameters θ before observing the data y.

I For a DSGE model, we may have information about ”deep” parameters

I of some parameters may be restricted by theory, e.g. risk aversion should be positive I Discount rate is inverse of average real interest rates I Price stickiness can be measured by surveys

I We may know something about the mean of a process Choosing priors

How influential the priors will be for the posterior is a choice:

I It is always possible to choose priors such that a given result is achieved no matter what the sample information is (i.e. dogmatic priors)

I It is also possible to choose priors such that they do not influence the posterior (i.e. so-called non-informative priors) Important: Priors are a choice and must be motivated. Choosing prior distributions

The is a good choice when parameter is in [0,1]

(1 − x)b−1 xa−1 P(x) = B(a, b) where (a − 1)!(b − 1)! B(a, b) = (a + b − 1)! Easier to parameterize using expression for mean, and variance: a a − 1 µ = , x = a + b b a + b − 2 ab σ2 = (a + b)2 (a + b + 1) Examples of beta distributions holding mean fixed

8

7

6

5

4

3

2

1

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Examples of beta distributions holding s.d. fixed

18

16

14

12

10

8

6

4

2

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Choosing prior distributions

The inverse gamma distribution is a good choice when parameter is positive ba P(x) = (1/x)a+1 exp(−b/x) Γ(a) where Γ(a) = (a − 1)! Again, easier to parameterize using expression for mean, mode and variance: b b µ = ; a > 1, x = a − 1 b a + 1 b2 σ2 = ; a > 2 (a − 1)2 (a − 2) Examples of inverse gamma distributions

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 Conjugate Priors

Conjugate priors are a particularly convenient:

I Combining distributions that are members of a conjugate family result in a new distribution that is a member of the same family Useful, but only so far as that the priors are chosen to actually reflect prior beliefs rather than just for analytical convenience Improper Priors

Improper priors are priors that are not probability density functions in the sense that they do not integrate to 1.

I Can still be used as a form of uninformative priors for some types of analysis

I The uniform distribution U(−∞, ∞) is popular

I Mode of posterior then coincide with MLE. The likelihood function

The likelihood function is the probability density of observing the data y conditional on a statistical model and the parameters θ.

I For instance if the data y conditional on θ is normally distributed we have that   −T /2 −1 1/2 1 0 −1 p(y | θ) = (2π) Ω exp − (y − µ) Ω (y − µ) 2

where Ω and µ are either elements or functions of θ. The posterior density p(θ | y)

The posterior density is often the object of fundamental interest in Bayesian estimation.

I The posterior is the “result”.

I We are interested in the entire conditional distribution of the parameters θ Model comparison

Bayesian statistics is not concerned with “rejecting” models, but with assigning probabilities to different models

I We can index different models by i = 1, 2, ...m

p(y | θ, Mi )p(θ | Mi ) p(θ | y, Mi ) = p(y | Mi ) Model comparison

The posterior model probability is given by

p(y | M )p(M ) p(M | y) = i i i p(y) where p(y | Mi ) is called the marginal likelihood. It can be computed from Z Z p(y | θ, Mi )p(θ, Mi ) p(θ | y, Mi )dθ = dθ p(y | Mi ) R by using that p(θ | y, Mi )dθ = 1 so that Z p(y | Mi ) = p(y | θ, Mi )p(θ, Mi )dθ The posterior

The posterior odds ratio is the relative probabilities of two models conditional on the data p(M | y) p(y | M )p(M ) i = i i p(Mj | y) p(y | Mj )p(Mj )

It is made up of the prior odds ratio p(Mi ) and the p(Mj ) p(y|Mi ) p(y|Mj ) Concepts of Bayesian analysis

Prior densities, likelihood functions, posterior densities, posterior odds ratios, Bayes factors etc are part of almost all Bayesian analysis.

I Data, choice of prior densities and the likelihood function are the inputs into the analysis

I Posterior densities etc are the outputs The rest of the course is about how to choose these inputs and how to compute the outputs Bayesian computation Bayesian computation

More specifics about the outputs R I Posterior mean E(θ | y) = θp(θ | y)dθ 2 I Posterior variance var(θ | y) = E(θ | y) − [E(θ | y)]

I Posterior prob(θi > 0) These objects can all be written in the form Z E(g (θ) | y) = g (θ) p(θ | y)dθ where g(y) is the function of interest. Posterior simulation

There are only a few cases when the expected value of functions of interest can be derived analytically.

I Instead, we rely on posterior simulation and Monte Carlo integration.

I Posterior simulation consists of constructing a sample from the posterior distribution p (θ | y) Monte carlo integration then uses that

S 1 X   g = g θ(s) bS S s=1

(s) and that limS→∞ gbS = E(g (θ) | y) where θ is a draw from the posterior distribution. The end product of Bayesian statistics

Most of Bayesian econometrics consists of simulating distributions of parameters using numerical methods.

I A simulated posterior is a numerical approximation to the distribution p(y | θ)p(θ)

I This is useful since the the distribution p(y | θ)p(θ) (by Bayes’ rule) is proportional to p(θ | y)

I We rely on ergodicity, i.e. that the moments of the constructed sample correspond to the moments of the distribution p(θ | y) The most popular procedure to simulate the posterior is called the Random Walk Metropolis Algorithm Sampling from a distribution

Example:

I Standard Normal distribution N(0, 1) If distribution is ergodic, sample shares all the properties of the true distribution asymptotically

I Why don’t we just compute the moments directly?

I When we can, we should (as in the Normal distribution’s case) I Not always possible, either because tractability reasons or for computational burden Sample from Standard Normal

4

3

2

1

0

-1

-2

-3 0 10 20 30 40 50 60 70 80 90 100 Sample from Standard Normal

7

6

5

4

3

2

1

0 -3 -2 -1 0 1 2 3 4 Sample from Standard Normal

5

4

3

2

1

0

-1

-2

-3

-4

-5 0 1 2 3 4 5 6 7 8 9 10 4 x 10 Sample from Standard Normal

8000

7000

6000

5000

4000

3000

2000

1000

0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Sampling from a distribution

Example:

I Standard Normal distribution N(0, 1) If distribution is ergodic, sample shares all the properties of the true distribution asymptotically

I Why don’t we just compute the moments directly?

I When we can, we should (as in the Normal distribution’s case) I Not always possible, either because tractability reasons or for computational burden What is a Markov Chain?

A process for which the distribution of next period variables are independent of the past, once we condition on the current state

I A Markov chain is a stochastic process with the Markov property

I “Markov chain” is sometimes taken to mean only processes with a countably finite number of states

I Here, the term will be used in the broader sense methods provide a way of generating samples that share the properties of the target density (i.e. the object of interest) What is a Markov Chain?

Markov property

p(θj+1 | θj ) = p(θj+1 | θj , θj−1, θj−2, ..., θ1)

Transition density function p(θj+1 | θj ) describes the distribution of θj+1 conditional on θj .

I In most applications, we know the conditional transition density and can figure out unconditional properties like E(θ) and E θ2

I MCMC methods can be used to do the opposite: Determine a particular conditional transition density such that the unconditional distribution converges to that of the target distribution. Summing up

I The subjective view of probability

I The Bayesian Principles

I Bayesians condition on the data

I Priors and likelihood functions are chosen by the researcher and must be motivated

I The posterior density is the main object of interest

I We often need to use numerical methods to approximate the posterior density Next time: Regression model with natural conjugate priors That’s it for today.