<<

Faculty of Life Sciences

Frequentist and Bayesian

Claus Ekstrøm E-mail: [email protected]

Outline

1 Frequentists and Bayesians • What is a probability? • Interpretation of results / inference 2 Comparisons 3 Markov chain Monte Carlo

Slide 2— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

What is a probability? Two schools in statistics: frequentists and Bayesians.

Slide 3— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics Frequentist school

School of , Egon Pearson and Ronald Fischer.

Slide 4— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

Bayesian school

“School” of Thomas Bayes

P(D|H) · P(H) P(H|D)= P(D|H) · P(H)dH

Slide 5— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

Frequentists Frequentists talk about probabilities in relation to experiments with a random component. Relative frequency of an event, A,isdefinedas number of outcomes consistent with A P(A)= number of experiments The probability of event A is the limiting relative frequency. Relative frequency 0.0 0.2 0.4 0.6 0.8 1.0 020406080100 n

Slide 6— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics Frequentists — 2 The definition restricts the things we can add probabilities to: What is the probability of there being life on Mars 100 billion years ago? We assume that there is an unknown but fixed underlying parameter, θ, for a population (i.e., the mean height on Danish men). Random variation (environmental factors, measurement errors, ...) means that each observation does not result in the true value.

Slide 7— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

The meta-experiment idea Frequentists think of meta-experiments and consider the current dataset as a single realization from all possible datasets.

Slide 8— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

The meta-experiment idea Frequentists think of meta-experiments and consider the current dataset as a single realization from all possible datasets.

167.2 cm

Slide 8— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics The meta-experiment idea Frequentists think of meta-experiments and consider the current dataset as a single realization from all possible datasets.

167.2 cm 175.5 cm

Slide 8— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

The meta-experiment idea Frequentists think of meta-experiments and consider the current dataset as a single realization from all possible datasets.

167.2 cm 175.5 cm 187.7 cm

Slide 8— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

The meta-experiment idea Frequentists think of meta-experiments and consider the current dataset as a single realization from all possible datasets.

167.2 cm 175.5 cm 187.7 cm 182.0 cm

Slide 8— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics Confidence intervals Thus a frequentist believes that a population mean is real, but unknown, and unknowable, and can only be estimated from the data. Knowing the distribution for the sample mean, he constructs a confidence interval, centered at the sample mean. • Either the true mean is in the interval or it is not. Can’t say there’s a 95% probability (long-run fraction having this characteristic) that the true mean is in this interval, because it’s either already in, or it’s not. • Reason: true mean is fixed value, which doesn’t have a distribution. • The sample mean does have a distribution! Thus must use statements like “95% of similar intervals would contain the true mean, if each interval were constructed from a different random sample like this one.”

Slide 9— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

Maximum likelihood

How will the frequentist estimate the parameter?

Slide 10— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

Maximum likelihood

How will the frequentist estimate the parameter? Answer: maximum likelihood.

Slide 10— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics Maximum likelihood

How will the frequentist estimate the parameter? Answer: maximum likelihood. Basic idea Our best estimate of the parameter(s) are the one(s) that make our observed data most likely. We know what we have observed so far (our data). Our best “guess” would therefore be to select parameters that make our observations most likely.

Binomial distribution:   n P(Y = y)= py (1 − p)n−y y

Slide 10— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

Bayesians Each investigator is entitled to his/hers personal belief ... the prior information. No fixed values for parameters but a distribution. Thumb tack pin pointing down: All distributions are subjective. Yours is as good as mine. Can still talk about the mean — but it is the mean of my distribution. In many cases trying to Prior distribution circumvent by using vague priors. 0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0 0.2 0.4 0.6 0.8 1.0

Theta

Slide 11— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

Credibility intervals

Bayesians have an altogether different world-view. They say that only the data are real. The population mean is an abstraction, and as such some values are more believable than others based on the data and their prior beliefs.

Slide 12— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics Credibility intervals

Bayesians have an altogether different world-view. They say that only the data are real. The population mean is an abstraction, and as such some values are more believable than others based on the data and their prior beliefs. The Bayesian constructs a credibility interval,centerednear the sample mean, but tempered by “prior” beliefs concerning the mean. Now the Bayesian can say what the frequentist cannot: “There is a 95% probability (degree of believability) that this interval contains the mean.”

Slide 12— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

Comparison

Advantages Disadvantages

Frequentist Objective Confidence intervals (not quite the desi- red) Calculations

Bayesian Credibility intervals Subjective (usually the desired) Complex models Calculations

Slide 13— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

In summary

• A frequentist is a person whose long-run ambition is to be wrong 5% of the time. • A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.

Slide 14— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics In summary

• A frequentist is a person whose long-run ambition is to be wrong 5% of the time. • A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule. A frequentist uses impeccable logic to answer the wrong question, while a Bayesean answers the right questionbymakingassumptionsthatnobodycan fully believe in. P. G. Hamer

Slide 14— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

Jury duty

Slide 15— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

Example: speed of light

What is the speed of light in vacuum “really”? Results (m/s) 299792459.2 299792460.0 299792456.3 299792458.1 299792459.5

Slide 16— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics Example: frequentists solution

The average of our observations is an estimate of the true, fixed (but unknown) speed of light, θˆ = 299792458.6. Conclusion: If we were to repeat this sequence of 5 measurements a repeated number of times, approximately 95% of my estimators will be within 1.83 m/s to the true speed of light. However, on this particular occasion where I have already calculated my statistic, I have no clue how close I actually am to the true value, but I feel comfortable that I am doing okay because of certain properties that my estimator has on repeated uses.

Slide 17— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

Example: Bayesian solution

The observations are fixed realization from the underlying distribution of the true speed of light. 1 “Guess” what the distribution of the speed of light is (the prior distribution). 2 Use Bayes Theorem to modify/update the prior distribution based on the observed data. 3 The modified distribution is denoted the posterior distribution. The posterior distribution holds the information about the true speed of light – and this distribution is entirely subjective.

Slide 18— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics

Markov Chain Monte Carlo

Having a likelihood does not necessarily makes it easy to work with. In Bayesian statistics the posterior distribution contains all relevant information about the parameters. Statistical inference is often calculated from summaries (integrals)  J = L(x)dx

However, these evaluations are not necessarily easy.

Slide 19— PhD (Aug 23rd 2011) — Frequentist and Bayesian statistics Bayesian modelling, Markov Chain Monte Carlo, Graphical Models

Søren Højsgaard Department of Mathematical Sciences Aalborg University, Denmark August 13, 2012

Contents

1 Bayesian modeling 2

2 Inference 2

3 Bayesian models based on DAGs 3 3.1 Example: Independent samples ...... 3 3.2 Example: Linear regression ...... 4 3.3 Example: Random regression model ...... 4

4 Computations using Monte Carlo methods 5 4.1 Rejection sampling ...... 5 4.2 Example: Rejection sampling ...... 6 4.3 Sampling importance resampling (SIR)* ...... 7 4.4 Markov Chain Monte Carlo methods ...... 8 4.5 The Metropolis–Hastings algorithm ...... 8 4.6 Special cases ...... 9 4.7 Example: Metropolis–Hastings algorithm ...... 10 4.8 Single component Metropolis–Hastings ...... 10 4.9 Gibbssampler* ...... 11 4.10 Sampling in high dimensions – problems ...... 11

5 Conditional independence 12

1 1 Bayesian modeling

• In a Bayesian setting, parameters are treated as random quantities on equal footing with the random variables.

• The joint distribution of a parameter (vector) θ and data (vector) y is specified through a prior distribution π(θ) for θ and a conditional distribution p(y | θ) of data for a fixed value of θ.

• This leads to the joint distribution p(y, θ) = p(y | θ)π(θ)

• The prior distribution π(θ) represents our knowledge (or uncertainty) about θ before data have been observed.

• After observing data y, the posterior distribution π∗(θ) of θ is obtained by condition- ing with data which gives p(y|θ)π(θ) π∗(θ) = p(θ|y) = ∝ L(θ)π(θ) p(y) where L(θ) = p(y | θ) is the likelihood and the marginal density p(y) = R p(y | θ)π(θ)dθ is the normalizing constant.

2 Inference

• With respect to inference, we might be interested in the posterior mean of some function g(θ): Z ∗ ∗ E(g(θ)|π ) = g(θ)π (θ)dθ

• However, usually π∗(θ) can not be found analytically because the normalizing con- stant p(y) = R p(y | θ)π(θ)dθ is intractable.

• In such cases one will often resort to sampling based methods: If we can draw samples θ(1), . . . , θ(N) from π∗(θ) we can do just as well: 1 X (g(θ)|π∗) ≈ g(θ(i)) E N i • The question is then how to draw samples from π∗(θ) where π∗(θ) is only known up to the normalizing constant.

• In some cases simple Monte Carlo methods will do (e.g. rejection sampling). • Most often, however, we have to use Markov Chain Monte Carlo Methods (e.g. the Metropolis–Hastings algorithm)

2 3 Bayesian models based on DAGs

The graph is a directed acyclic graph (DAG) where then nodes represent random quantities.

xA xS

xT xL xB

xE

xX xD

Figure 1: Directed acyclic graph. Nodes represent random quantities.

A joint distribution for x = (xA, xS, xT , xL, xB, xE, xX , xD) can be specified as a product of conditional distributions,

p(x) = p(xA)p(xT |xA)p(xS)p(xL|xS)p(xB|xS)p(xE|xT , xL)

p(xX |xE)p(xD|xE, xB)

• Notice: The specification has the form Y p(x) = p(xv|xpa(v)) v where pa(v) denotes the parents of v in the directed acyclic graph.

• Hence, we define a complex multivariate distribution by multiplying conditional uni- variate densities.

• Notice also that we use x here as a generic symbol for a random quantity rather than using y to represent data and θ to represent parameters.

• This makes sense in a Bayesian setting where there is no conceptual difference between parameters and data.

3.1 Example: Independent samples

Joint distribution: Y p(x1, . . . , x5, θ) = π(θ) p(xi | θ) i

For example we may have: xi | θ ∼ N(θ, 1) and θ ∼ N(0, 1).

3 θ θ

x1 x2 x3 x4 x5 xν

ν = 1,..., 5

Figure 2: Left: Representation of a Bayesian model for simple sampling. The picture to the right represents the same, but the plate allows a more compact representation.

α β α β

xi µi xi yi σ

i = 1,...,N yi σ

i = 1,...,N

Figure 3: Graphical representations of a traditional linear regression model with unknown 2 intercept α, slope β, and variance σ . In the representation to the left, the means µi have been represented explicitly.

3.2 Example: Linear regression

Regression model:

2 Yi ∼ N(µi, σ ), µi = α + βxi, i = 1,...,N

In this example, the parameters are θ = (α, β, σ). To complete the model specification we therefore need to specify a prior π(θ).

The xi’s are just explanatory variables and the µi’s are deterministic functions of their parents.

3.3 Example: Random regression model

Weights have been measured weekly for 30 rats over 5 weeks. Observations yij are weights of rat i at age xj. Random regression model 2 yij ∼ N(αi + βi(xi − x¯), σc )

4 αc α0 βc

σα αi βi σβ

σc yij xj j = 1,..., 5

i = 1,...,N

Figure 4: Graphical representation of a random coefficient regression model for the growth of rats. and 2 2 αi ∼ N(αc, σα), βi ∼ N(βc, σβ)

4 Computations using Monte Carlo methods

Consider a random variable (vector) X with density / probability mass function p(x). We shall call p(x) the target distribution (from which we want to sample). In many real world applications

• we can not directly draw samples from p.

• p is only known up to a constant of proportionality; that is

p(x) = k(x)/c

where k() is known and the normalizing constant c is unknown.

We reserve h(x) for a proposal distribution which is a distribution from which we can draw samples.

4.1 Rejection sampling

Let p(x) = k(x)/c be a density where k() is known and c is unknown. Let h(x) be a proposal distribution from which we can draw samples. Suppose we can find a constant M such that k(x) < Mh(x) for all x. The algorithm is then

5 1. Draw sample x ∼ h(). Draw u ∼ U(0, 1).

k(x)/M 2. Set α = h(x) 3. If u < α, accept x.

The accepted values x1, . . . xN is a random sample from p(·). Notice:

• It is tricky to choose a good proposal distribution h(). It should have support at least as large as p() and preferably heavier tails than p().

• It is desirable to choose M as small as possible. In practice this is difficult so one tends to take a conservative (large) choice of M whereby only few proposed values are accepted. Thus it is difficult to make rejection sampling efficient.

4.2 Example: Rejection sampling

k <- function(x,a=.4, b=.08){exp(a*(x-a)^2-b*x^4)} x <- seq(-4,4,0.1) plot(x,k(x),type='l') 3.0 2.5 2.0 1.5 k(x) 1.0 0.5 0.0

−4 −2 0 2 4

x

We take a uniform distribution on [−4, 4] (with density 1/8 = 0.125) as proposal distribu- tion

6 h <- function(x){rep.int(0.125,length(x))} N <- 100000 x.h <- runif(N,-4,4) u <- runif(N) ## The choice of M is critical M1 <- round(max(k(x)/h(x)))+1 M2 <- 100*round(max(k(x)/h(x)))+1 M3 <- 1000*round(max(k(x)/h(x)))+1 acc1 <- u < k(x.h)/(M1*h(x.h)) acc2 <- u < k(x.h)/(M2*h(x.h)) acc3 <- u < k(x.h)/(M3*h(x.h)) ## Fraction of accepted values sum(acc1)/N

[1] 0.31316

sum(acc2)/N

[1] 0.00336

sum(acc3)/N

[1] 0.00038

x.acc1 <- x.h[acc1] x.acc2 <- x.h[acc2] x.acc3 <- x.h[acc3]

par(mfrow=c(2,2), mar=c(2,2,1,1)) plot(x,k(x),type='l') barplot(table(round(x.acc1,1))/length(x.acc1)) barplot(table(round(x.acc2,1))/length(x.acc2)) barplot(table(round(x.acc3,1))/length(x.acc3)) 3.0 2.5 0.03 2.0 1.5 0.02 1.0 0.01 0.5 0.0 0.00 −4 −2 0 2 4 −3.5 −2.3 −1.1 0 0.8 1.8 2.8

x 0.04 0.06 0.03 0.04 0.02 0.02 0.01 0.00 0.00 −3.1 −2 −1.1 −0.1 0.8 1.6 2.5 −2.8 −2 −1.5 −0.7 −0.1 1.4 Choice of proposal h() and of M is critical.

4.3 Sampling importance resampling (SIR)*

Even when M is not readily available, we may generated approximate samples from p as follows.

7 1. Draw samples x1, . . . xN ∼ h(x).

i i 2. Calculated importance weights wi = p(x )/h(x ). P 3. Normalize the weights as qi = wi/ j wj.

1 N i 4. Resample from {x , . . . x } where y is drawn with probability qi.

These last samples are approximately samples from p. This scheme works also if p is only known up to proportionality, essentially because the normalizing constant cancels out in step 3. above. Notice that those samples from h which “fits best to p” are those most likely to appear in the resample. But it is also worth noticing that if h is a poor approximation to p then then the “best samples from h” are not necessarily good samples in the sense of resembling p.

4.4 Markov Chain Monte Carlo methods

A drawback of the rejection algorithm is that it is difficult to suggest a proposal distribution h which leads to an efficient algorithm and it is difficult to find M. A way around this problem is to let the proposed values depend on the last accepted values: If x0 is a “likely” value from p then so is probably also a proposed value x which is “close” to x0. Hence the proposal distribution will now be conditional on the last accepted value and have the form h(x|x0). This leads schemes (described below) for drawing samples x1, . . . , xN and these samples will, under certain conditions, form an ergodic Markov chain with p(x) as its stationary distribution. Hence, the expected value of any function of x can be calculated approximately as

Z 1 X g(x)p(x)dx ≈ g(xi). N i

4.5 The Metropolis–Hastings algorithm

Given an accepted value xt−1:

1. Draw x ∼ h(·|xt−1). Draw u ∼ U(0, 1).   p(x) h(xt−1|x) 2. Calculate acceptance probability α = min 1, p(xt−1) h(x|xt−1)

8 3. If u < α set xt = x; else set xt = xt−1.

After a burn–in period the samples x1, x2,... will be samples from p(·). Notice:

• The samples x1, x2,... will be correlated.

• The algorithm also works if p is only known up to proportionality (because the normalizing constant cancels when calculating the acceptance probability).

• We must be able to both sample from h() and evaluate the density.

4.6 Special cases

Metropolis algorithm (a special case of the Metropolis-Hastings algorithm) The pro- posal distribution is symmetrical, i.e. h(x|x0) = h(x0|x) for all pairs (x, x0). Hence   p(x) the acceptance probability is α = min 1, p(xt−1) .

Random–walk Metropolis A popular choice for proposal in a Metropolis algorithm is h(x|x0) = g(x − x0) where g is symmetric, e.g.

x = x0 + e e ∼ g

Example: x = x0 + N(0, σ2)

The independence sampler (A special case of the Metropolis–Hastings algorithm) The proposal h(x|x0) = h(x) does not depend on x0.) The acceptance probability be-   p(x) h(xt−1) comes α = min 1, p(xt−1) h(x) . For this sampler to work well, h should be a good approximation to p.

9 4.7 Example: Metropolis–Hastings algorithm

x.acc4 <- x.acc5 <- rep.int(NA, N) u <- runif(N) std <- 1 ## Spread of proposal xc <- 0; acc.count <- 0 for (ii in 1:N){ xp <- rnorm(1, mean=xc, sd=std) alpha <- min(1, (k(xp)/k(xc))*(dnorm(xc, mean=xp,sd=std)/dnorm(xp, mean=xc,sd=std))) acc.count <- acc.count + (u[ii] < alpha) x.acc4[ii] <- xc <- ifelse(u[ii] < alpha, xp, xc) } ## Fraction of accepted *new* proposals acc.count/N

[1] 0.72907

std <- .01 ## Spread of proposal xc <- 0; acc.count <- 0 for (ii in 1:N){ xp <- rnorm(1, mean=xc, sd=std) alpha <- min(1, (k(xp)/k(xc))*(dnorm(xc, mean=xp,sd=std)/dnorm(xp, mean=xc,sd=std))) acc.count <- acc.count + (u[ii] < alpha) x.acc5[ii] <- xc <- ifelse(u[ii] < alpha, xp, xc) } ## Fraction of accepted *new* proposals acc.count/N

[1] 0.99685

par(mfrow=c(2,2), mar=c(2,2,1,1)) plot(x,k(x),type='l') barplot(table(round(x.acc4,1))/length(x.acc4)) barplot(table(round(x.acc5,1))/length(x.acc5)) 3.0 2.5 0.03 2.0 0.02 1.5 1.0 0.01 0.5 0.0 0.00 −4 −2 0 2 4 −3.8 −2.4 −1.1 0 0.8 1.8 2.8

x 0.08 0.06 0.04 0.02 0.00 −3.3 −2.7 −2.1 −1.5 −1 −0.5 0

4.8 Single component Metropolis–Hastings

Suppose x is a vector. Instead of updating the entire x it is often more convenient and computationally efficient to update x in blocks.

We partition x into blocks, for example x = (x1, x2, x3).

10 t−1 t−1 t−1 t−1 Suppose that we have a sample x = (x1 , x2 , x3 ) and also that x1 has also been t updated to x1 in the current iteration. The task is to update x2.

To do so we specify a proposal distribution h2 from which we can sample candidate values for x2:

t t−1 t−1 1. Draw x2 ∼ h2(·|x1, x2 , x3 ). Draw u ∼ U(0, 1).

 t t−1 t−1 t t−1  p(x2|x1,x3 ) h2(x2 |x1,x2,x3 ) 2. Calculate acceptance probability α = min 1, t−1 t t−1 t t−1 t−1 p(x2 |x1,x3 ) h2(x2|x1,x2 ,x3 )

t t t−1 3. If u < α set x2 = x2; else set x2 = x2 .

4.9 Gibbs sampler*

The Gibbs sampler is a special case of single component Metropolis–Hastings. The proposal t t−1 t−1 distribution h2(x2|x1, x2 , x3 ) for updating x2 is

t t−1 p(x2|x1, x3 ) Hence for the Gibbs sampler, (i) proposed values are always accepted, but (ii) it is required that we can sample from the conditionals p(xi|x−i). t t t t One version of the algorithm is as follows. Suppose a sample x = (x1, x2, x3) is available.

k+1 k k 1. Sample x1 ∼ p(x1|x2, x3)

k+1 k+1 k 2. Sample x2 ∼ p(x2|x1 , x3)

k+1 k+1 k+1 3. Sample x3 ∼ p(x3|x1 , x2 )

k+1 k+1 k+2 k+3 4. Set x = (x1 , x1 , x1 )

The sequence x1, x2,... then consists of (correlated) samples from p(x).

4.10 Sampling in high dimensions – problems

Suppose we want to draw samples from p(x) where x = (x1, x2, . . . xK ). In principle we do sampling by Metropolis–Hastings, but in practice there can be a problem:

Suppose K = 80 and that each component xi is discrete with 10 levels. Then the total state–space for x will have 1080 configurations. This is a large number; it is one of the estimates of the number of atoms in the universe! Hence we can not even build a computer large enough to store the entire joint distribution p(x).

11 To proceed we must utilize that p(x) may have a structure that allows us to avoid the “curse of dimensionality”. This structure is defined in terms of conditional independence.

5 Conditional independence

Let X,Y,Z be random variables. Write f() for a generic density or probability mass function. We say that X and Y are conditionally independent given Z if

f(x, y | z) = f(x | z)f(y | z).

We write this as X ⊥⊥ Y | Z. An equivalent characterization is that f(x, y, z) factorizes as

f(x, y, z) = g(x, z)h(y, z)

that is, as a product of two functions g() and h(), where g() does not depend on y and h() does not depend on x. This is known as the factorization criterion. A third characterization is that p(x|y, z) = p(x|z) that is, the conditional distribution of X given Y,Z does not depend on Z.

xA xS

xT xL xB

xE

xX xD

Consider sampling xB given all other variables. Group terms in factorization   p(x) = p(xA)p(xT |xA)p(xS)p(xL|xS)p(xE|xT , xL)p(xX |xE)   p(xB|xS)p(xD|xE, xB)

= g(xA, xT , xS, xL, xX , xE)k(xB, xS, xD, xE) so xB ⊥⊥ xA, xT , xL, xX |xS, xD, xE

12 Put differently, xB depends directly on its parents, its children and its childrens other parents (the so–called Markov blanket). Once these variables are known, all other variables are irrelevant. We have k(xB, xS, xD, xE) p(xB|”the rest”) = R ∝ k(xB, xS, xD, xE) k(xB, xS, xD, xE)dxB

So to sample xB we only need a small part of the model, namely k(xB, xS, xD, xE). Thereby we will not be caught by the dimensionality problems. This is why sampling based methods can be made to work in large scale problems.

13 L5: Normal and multiparameter models Tuesday 14th August 2012, afternoon

Lyle Gurrin

Bayesian Data Analysis 13 – 17 August 2012, Copenhagen

Analysis for the normal distribution: Unknown mean, known variance Normal model underlies much statistical modelling. We start with the simplest case, assuming the variance is known:

1. Just one data point. 2. General case of a “sample” of data with many data points.

Normal and multiparameter models 79/ 258

Likelihood of one data point Consider a single observation y from normal distribution with mean θ and variance σ2,withσ2 known. The sampling distribution is:

1 − 1 (y−θ)2 p(y|θ)=√ e 2σ2 . (12) 2πσ

Normal and multiparameter models 80/ 258 Conjugate prior and posterior distributions This likelihood is the exponential of a quadratic form in θ, so conjugate prior distribution must have same form; parameterize this family of conjugate densities as   1 2 p(θ) ∝ exp − 2 (θ − μ0) ; (13) 2τ0 2 2 i.e. θ ∼ N(μ0,τ0 ), with hyperparameters μ0 and τ0 .

For now we assume μ0 and τ0 to be known.

Normal and multiparameter models 81/ 258

Posterior distribution From the conjugate form of prior distribution, the posterior distribution for θ is also normal:

   y − θ 2 θ − μ 2 p θ|y ∝ −1 ( ) ( 0) . ( ) exp 2 + 2 2 σ τ0 (14) Some algebra is required, however, to reveal its form (recall that in the posterior distribution everything except θ is regarded as constant).

Normal and multiparameter models 82/ 258

Parameters of the posterior distribution Algebraic rearrangement gives   1 2 p(θ|y) ∝ exp − 2 (θ − μ1) , (15) 2τ1

2 that is, the posterior distribution θ|y is N(μ1,τ1 ) where 1 μ 1 y τ 2 0 + σ2 1 1 1 μ 0 . 1 = 1 1 and 2 = 2 + 2 (16) 2 + 2 τ1 τ0 σ τ0 σ

Normal and multiparameter models 83/ 258 Precisions of the prior and posterior distributions In manipulating normal distributions, the inverse of the variance or precision plays a prominent role. For normal data and normal prior distribution, each with known precision, we have

1 1 1 . 2 = 2 + 2 τ1 τ0 σ posterior precision = prior precision + data precision.

Normal and multiparameter models 84/ 258

Interpreting the posterior mean, μ1 There are several ways of interpreting the form of the posterior mean μ1. In equation (16):

1 1 2 μ0 + 2 y τ0 σ μ1 = 1 1 . 2 + 2 τ0 σ posterior mean = weighted average of prior mean and observed value, y, with weights proportional to the precisions.

Normal and multiparameter models 85/ 258

Interpreting the posterior mean, μ1

Alternatively, μ1 = prior mean “adjusted” toward observed y:

τ 2 μ μ y − μ 0 , 1 = 0 +( 0) 2 2 (17) σ + τ0

or μ1 = data “shrunk” toward the prior mean:

σ2 μ y − y − μ . 1 = ( 0) 2 2 (18) σ + τ0

In both cases, the posterior mean μ1 is a compromise between the prior mean and the observed value. Normal and multiparameter models 86/ 258 Interpretation of μ1 for extreme cases

In the extreme cases, the posterior mean μ1 equals the prior mean or the observed value y.

2 μ1 = μ0 if y = μ0 or τ0 =0;

2 μ1 = y if y = μ0 or σ =0.

What is the correct interpretation for each scenario?

Normal and multiparameter models 87/ 258

Normal model with multiple observations The normal model with a single observation can easily be extended to the more realistic situation where we have a sample of independent and identically distributed observations y =(y1,...,yn). We can proceed formally, from n p(θ|y) ∝ p(θ)p(y|θ)=p(θ) p(yi|θ) i=1

2 where p(yi|θ)=N(yi|θ, σ ) with algebra similar to that above. The posterior distribution depends on y 1 only through the sample mean, y = n i yi, which is a sufficient statistic in this model.

Normal and multiparameter models 88/ 258

Normal model via the sample mean In fact, since y|θ, σ2 ∼ N(θ, σ2/n), we can apply results for the single normal observation 2 p(θ|y1 ...,yn)=p(θ|y)=N(θ|μn,τn),where

1 n 2 μ0 + 2 y τ0 σ μn = 1 n 2 + 2 τ0 σ and

n 1 1 . 2 = 2 + 2 τn τ0 σ

Normal and multiparameter models 89/ 258 Limits for large n and large τ 2 1 n The prior precision, 2 , and data precision, 2 , play τ0 σ equivalent roles; if n large, the posterior distribution is largely determined by σ2 and the sample value y. 2 As τ0 →∞with n fixed, or as n →∞with τ0 fixed, have:

p(θ|y) ≈ N(θ|y, σ2/n) (19)

Normal and multiparameter models 90/ 258

Limits for large n and large τ 2 A prior distribution with large τ 2 and thus low precision captures prior beliefs are diffuse over the range of θ where the likelihood is substantial. Compare the well-known result of classical statistics:

y|θ, σ2 ∼ N(θ, σ2/n) (20)

leads to use of σ y ± 1.96√ (21) n as a 95% confidence interval for θ. Bayesian approach gives the same result for noninformative prior.

Normal and multiparameter models 91/ 258

Multiparameter models: Introduction The reality of applied statistics: there are always several (maybe many) unknown parameters! BUT the interest usually lies in only a few of these (parameters of interest) while others are regarded as nuisance parameters for which we have no interest in making inferences but which are required in order to construct a realistic model. At this point the simple conceptual framework of the Bayesian approach reveals its principal advantage over other forms of inference.

Normal and multiparameter models 92/ 258 The Bayesian approach The Bayesian approach is clear: Obtain the joint posterior distribution of all unknowns, then integrate over the nuisance parameters to leave the marginal posterior distribution for the parameters of interest. Alternatively using simulation, draw samples from the entire joint posterior distribution (even this may be computationally difficult), look at the parameters of interest and ignore the rest.

Normal and multiparameter models 93/ 258

Averaging over nuisance parameters To begin exploring the ideas of joint and marginal distributions, suppose that θ has two parts, so θ =(θ1,θ2). We are interested only in θ1,withθ2 considered a “nuisance” parameter. For example:

y|μ, σ2 ∼ N(μ, σ2),

2 with both μ (=θ1)andσ (=θ2) unknown. Interest usually focusses on μ.

Normal and multiparameter models 94/ 258

Averaging over nuisance parameters

AIM: To obtain the conditional distribution p(θ1|y) of the parameters of interest θ1. This can be derived from joint posterior density,

p(θ1,θ2|y) ∝ p(y|θ1,θ2)p(θ1,θ2),

by averaging or integrating over θ2: 

p(θ1|y)= p(θ1,θ2|y)dθ2.

Normal and multiparameter models 95/ 258 Factoring the joint posterior Alternatively, the joint posterior density can be factored to yield: 

p(θ1|y)= p(θ1|θ2,y)p(θ2|y)dθ2, (22)

showing posterior distribution, p(θ1|y),asamixture of the conditional posterior distributions given the nuisance parameter, θ2,wherep(θ2|y) is a weighting function for the different possible values of θ2.

Normal and multiparameter models 96/ 258

Mixtures of conditionals

 The weights depend on the posterior density of θ2—so on a combination of evidence from data and prior model.  What if θ2 known to have a particular value?

The averaging over nuisance parameters can be interpreted very generally: θ2 can be categorical (discrete) and may take only a few possible values representing, for example, different sub-models.

Normal and multiparameter models 97/ 258

A strategy for computation We rarely evaluate integral (22) explicitly, but it suggests an important strategy for constructing and computing with multiparameter models, using simulation:

1. Draw θ2 from its marginal posterior distribution. 2. Draw θ1 from conditional posterior distribution, given the drawn value of θ2.

Normal and multiparameter models 98/ 258 Conditional simulation In this way the integration in (22) is performed indirectly.

In fact we can alter step 1 to draw θ2 from its conditional posterior distribution given θ1. Iterating the procedure will ultimately generate samples from the marginal posterior distribution of both θ1 and θ2. This is the much vaunted Gibbs sampler.

Normal and multiparameter models 99/ 258

Multiparameter models: Normal mean & variance Consider a vector y of n i.i.d. observations (univariate) from distributed N(μ, σ2). We begin by analysing the model under the convenient assumption of a noninformative prior distribution (which is easily extended to informative priors). Following the discussion on noninformative priors, we assume prior independence of location and scale parameters and take p(μ, σ2) to be uniform on (μ, logσ): p(μ, σ2) ∝ (σ2)−1.

Normal and multiparameter models 100/ 258

The joint posterior distribution, p(μ, σ2|y) Under the improper prior distribution the joint posterior distribution is proportional to the likelihood × the factor 1/σ2:

n 2 −n−2 1 2 p(μ, σ |y) ∝ σ exp − (yi − μ) 2σ2 i=1 n −n−2 1 2 2 = σ exp − (yi − y) + n(y − μ) 2σ2  i=1  1 = σ−n−2exp − [(n−1)s2 + n(y − μ)2] , (23) 2σ2  2 n 2 where s =1/(n − 1) i=1 (yi − y) is the sample variance of yi’s. The sufficient statistics are s2 and y.

Normal and multiparameter models 101/ 258 The conditional posterior dist’n, p(μ|σ2,y) We can factor the joint posterior density by considering first the conditional distribution p(μ|σ2,y), and then the marginal p(σ2|y). We can use a previous result for the mean μ of a normal distribution with known variance.

μ|σ2,y ∼ N(y, σ2/n). (24)

Normal and multiparameter models 102/ 258

The marginal posterior distribution, p(σ2|y) This requires averaging the joint distribution (23) over μ, that is, evaluating the simple normal integral    1  exp − n(y − μ)2 dμ = 2πσ2/n; 2σ2 thus,   (n − 1)s2 p(σ2|y) ∝ (σ2)−(n+1)/2exp − , (25) 2σ2 which is a scaled inverse-χ2 density: σ2|y ∼ Inv-χ2(n − 1,s2). (26)

Normal and multiparameter models 103/ 258

Product of conditional and marginal We have therefore factored (23) as the product of conditional and marginal densities

p(μ, σ2|y)=p(μ|σ2,y)p(σ2|y).

Normal and multiparameter models 104/ 258 Parallel between Bayes & frequentist results As with one-parameter normal results, there is a remarkable parallel with sampling theory: Bayes: n − s2 ( 1) |y ∼ χ2 σ2 n−1 Frequentist: n − s2 ( 1) |μ, σ2 ∼ χ2 σ2 n−1

Conditional on the values of μ and σ2 the sampling distribution of the appropriately scaled sufficient statistic (n − 1)s2/σ2 is chi-squared with n − 1 d.f.

Normal and multiparameter models 105/ 258

Analytic form of marginal posterior distribution of μ μ is typically the estimand of interest, so ultimate objective of the Bayesian analysis is the marginal posterior distribution of μ. This can be obtained by integrating σ2 out of the joint posterior distribution. Easily done by simulation: first draw σ2 from (26), then draw μ from (24). The posterior distribution of μ can be thought of as a mixture of normal distributions mixed over the scaled inverse chi-squared distribution for the variance - a rare case where analytic results are available.

Normal and multiparameter models 106/ 258

Performing the integration We start by integrating the joint posterior density σ2 over  ∞ p(μ|y)= p(μ, σ2|y)dσ2 0 This can be evaluated using the substitution A z = , where A =(n − 1)s2 + n(μ − y)2. 2σ2

Normal and multiparameter models 107/ 258 Marginal posterior distribution of μ We recognise (!) that the result is an unnormalized gamma integral:  ∞ p(μ|y) ∝ A−n/2 z(n−2)/2exp(−z)dz 0 ∝ [(n − 1)s2 + n(μ − y)2]−n/2  −n/2 n(μ − y)2 ∝ 1+ . (n − 1)s2

2 This is tn−1(y, s /n) density.

Normal and multiparameter models 108/ 258

Marginal posterior distribution of μ Equivalently, under the noninformative uniform prior distribution on (μ, logσ), the posterior distribution of μ is  μ − y  √  y ∼ t , s/ n n−1

where tn−1 is the standard Student-t density (location 0, scale 1) with n − 1 degrees of freedom.

Normal and multiparameter models 109/ 258

Comparing the sampling theory Again it is useful to compare the sampling theory: Under the sampling distribution, p(y|μ, σ2),  y − μ √  μ, σ2 ∼ t . s/ n n−1 √ The ratio (y − μ)/(s/ n) called a pivotal quantity: Its sampling distribution does not depend on the nuisance parameter σ2, and posterior distribution does not depend on data (helps in sampling theory inference by eliminating difficulties associated with the nuisance parameter σ2).

Normal and multiparameter models 110/ 258 L6: Multiparameter models example: Bioassay experiment Tuesday 14th August 2012, afternoon

Lyle Gurrin

Bayesian Data Analysis 13 – 17 August 2012, Copenhagen

Multiparameter models Few multiparameter sampling models allow explicit calculation of the posterior distribution. Data analysis for such models is usually achieved with simulation (especially MCMC methods). We will illustrate with a nonconjugate model for data from a bioassay experiment using a two-parameter generalised linear model.

Multiparameter models example: Bioassay experiment 111/ 258

Scientific problem In drug development, acute toxicity tests are performed in animals. Various dose levels of the compound are administered to batches of animals. Animals responses typically characterised by a binary outcome: alive or dead, tumour or no tumour, response or no response etc.

Multiparameter models example: Bioassay experiment 112/ 258 Data Structure Such an experiment gives rise to data of the form

(xi,ni,yi); i =1,...,k (27)

where th xi is the i dose level (i =1,...,k). th ni animals given i dose level. yi animals with positive outcome (tumour, death, response).

Multiparameter models example: Bioassay experiment 113/ 258

Example Data For the example data, twenty animals were tested, five at each of four dose levels. Dose,xi Number of Number of (log g/ml) animals, ni deaths, yi −0.863 5 0 −0.296 5 1 −0.053 5 3 0.727 5 5 Racine A, Grieve AP, Fluhler H, Smith AFM. (1986). Bayesian methods in practice: experiences in the pharmaceutical industry (with discussion). Applied Statistics 35, 93-150.

Multiparameter models example: Bioassay experiment 114/ 258

Sampling model at each dose level Within dosage level i: The animals are assumed to be exchangeable (there is no information to distinguish among them). We model the outcomes as independent given same probability of death θi, which leads to the familiar binomial sampling model:

yi|θi ∼ Bin(ni,θi) (28)

Multiparameter models example: Bioassay experiment 115/ 258 Setting up a model across dose levels Modelling the response at several dosage levels requires a relationship between the θi’s and xi’s.

We start by assuming that each θi is an independent parameter. We relax this assumption tomorrow when we develop hierarchical models.

There are many possibilities for relating the θi’s to the xi’s, but a popular and reasonable choice is a logistic regression model:

logit(θi)=log(θi/(1 − θi)) = α + βxi (29)

Multiparameter models example: Bioassay experiment 116/ 258

Setting up a model across dose levels We present an analysis based on a prior distribution for (α, β) that is independent and locally uniform in the two parameters, that is, p(α, β) ∝ 1,soan improper “noninformative” distribution. We need to check that the posterior distribution is proper (details not shown).

Multiparameter models example: Bioassay experiment 117/ 258

Describing the posterior distribution The form of posterior distribution: p(α, β|y) ∝ p(α, β)p(y|α, β) k  y  n −y  eα+βxi i i i ∝ 1 eα+βxi eα+βxi i=1 1+ 1+ One approach would be to use a normal approximation centered at posterior mode (˜α =0.87, β˜ =7.91) This is similar to the classical approach of obtaining maximum likelihood estimates (eg by running glm in R) Asymptotic standard errors can be obtained via ML theory. Multiparameter models example: Bioassay experiment 118/ 258 Bioassay graph 2

Contour plot: Posterior density of the parameters beta 0 5 10 15 20 25 30

−2 0 2 4 6 Multiparameter models example: Bioassay experiment 119/ 258 alpha Discrete approx. to the post. density (1) We illustrate computing the joint posterior distribution for (α, β) atagridofpointsin 2-dimensions:

1. We begin with a rough estimate of the parameters.

 Since logit(E(yi/ni)) = α + βxi we obtain rough estimates of α and β using a linear regression of logit(yi/ni) on xi  Set y1 =0.5,y4 =4.5 to enable calculation.  αˆ =0.1, βˆ =2.9 (standard errors 0.3 and 0.5).

Multiparameter models example: Bioassay experiment 120/ 258

Discrete approx. to the post. density (2) 2. Evaluate the posterior on a 200 × 200 grid; use range [−5, 10] × [−10, 40]. 3. Use R to produce a contour plot (lines of equal posterior density).   4. Renormalize on grid so α β p(α, β|y)=1 (i.e., create discrete approx to posterior) 5. Sample from marginal dist’n of one parameter p(α|y)= β p(α, β|y).

Multiparameter models example: Bioassay experiment 121/ 258 Discrete approx. to the post. density (3) 6. Sample from conditional dist’n of second parameter p(β|α, y) 7. We can improve sampling slightly by drawing from linear interpolation between grid points.

Alternative: exact posterior using advanced computation (methods covered later)

Multiparameter models example: Bioassay experiment 122/ 258

Posterior inference Quantities of interest:

 parameters (α, β).  LD50 = dose at which Pr(death) is 0.5 = −α/β – This is meaningless if β ≤ 0 (substance not harmful). – We perform inference in two steps: (i) Pr(β>0|y) (ii) posterior dist’n of LD50 conditional on β>0

Multiparameter models example: Bioassay experiment 123/ 258

Results We take 1000 simulation draws of (α, β) from the grid (different posterior sample than results in book) Note that β>0 for all 1000 draws. Summary of posterior distribution posterior quantiles 2.5% 25% 50% 75% 97.5% α −0.60.61.32.04.1 β 3.57.511.015.226.0 LD50 −0.28 −0.16 −0.11 −0.06 0.12

Multiparameter models example: Bioassay experiment 124/ 258 Lessons from simple examples The lack of multiparameter models with explicit posterior distributions not necessarily a barrier to analysis. We can use simulation, maybe after replacing sophisticated models with hierarchical or conditional models (possibly invoking a normal approximation in some cases).

Multiparameter models example: Bioassay experiment 125/ 258

The four steps of Bayesian inference 1. Write the likelihood p(y|θ). 2. Generate the posterior as p(θ|y)=p(θ)p(y|θ) by including well formulated information in p(θ) or else use p(θ)=constant. 3. Get crude estimates for θ as a starting point or for comparison. 4. Draw simulations θ1,θ2,...,θL (summaries for inference) and predictions y˜1, y˜2,....y˜K for each θl.

Multiparameter models example: Bioassay experiment 126/ 258 Laplace approximations and the INLA approach – an alternative to MCMC

Søren Højsgaard Department of Mathematical Sciences Aalborg University, Denmark August 13, 2012

Contents

1 Introduction 1

2 The Laplace approximation 2 2 2.1 Example: Approximating a χk distribution ...... 3 2.2 Summary ...... 4

3 The INLA approach 5 3.1 Linear regression with INLA ...... 5 3.2 Random regression model with INLA ...... 8

4 Summary 9

1 Introduction

Recall the setting in a Bayesian analysis: We wish to find the posterior distribution of θ given data, p(y|θ)π(θ) π∗(θ) = p(y) where p(y) = R p(y|θ)π(θ) is the normalizing constant which is usually intractable to calculate.

• MCMC methods can be difficult to work with in practice.

• An alternative approach to obtaining an approximation to the posterior is by the Laplace approximation.

1 • The Laplace approximation is implemented in the INLA program for a wide range of models.

• We shall just outline the ideas behind Laplace approximations and show a few ex- amples of using INLA. No detailed treatment!

2 The Laplace approximation

• The setting is that we have a density p(x) = k(x)/c which is only known up to proportionality. That is k(x) is known but c is unknown (but given as c = R k(x)dx).

k −1 • Example: k(x) = x 2 exp(−x/2). Can we then find an approximate value for c? Can we find a distributionp ˜(x) which is a good approximation to p(x)?

2 • It is so that k(x) is the central part of the density for a χk distribution and the normalizing constant is c = 2k/2Γ(k/2). But let us assume that we do not know that

• We approximate p(x) by a normal distribution, so we need to find the mean and variance of this approximating distribution.

• Idea: The log of a normal density is a quadratic function so we approximate log k(x) by a quadratic function.

We get 1 log k(x) ≈ log k(x∗) + D log k(x∗)(x − x∗) + D2 log k(x∗)(x − x∗)2 x 2 xx

∗ ∗ As x we choose the mode of k(x) (assume that there is only one). In this case Dx log k(x ) = 0 so we get

1 log k(x) ≈ log k(x∗) + D2 log k(x∗)(x − x∗)2 2 xx

2 1 Let σ = − 2 ∗ . Inserting this gives Dxx log k(x ) 1 log k(x) ≈ log k(x∗) − (x − x∗)2 2σ2

So log k(x) has the form of a log normal density N(x∗, σ2). This means that p(x) is approximated by a N(x∗, σ2)–density. Now exponentiate and integrate Z Z 1 exp(log k(x)) ≈ exp(log k(x∗)) exp(− (x − x∗)2)dx 2σ2

2 Z √ c = k(x)dx ≈ exp(log k(x∗)) 2πσ

So put differently k(x∗) exp(− 1 (x − x∗)2) p(x) = k(x)/c ≈ 2√σ2 k(x∗) 2πσ

2 2.1 Example: Approximating a χk distribution

k −1 Suppose we were only given that p(x) = k(x)/c with k(x) = x 2 exp(−x/2). k 1 We have log k(x) = ( 2 − 1) log x − 2 x and k 1 k D log k(x) = − 1x−1 − ,D2 log k(x) = − − 1x−2 x 2 2 xx 2

∗ Solving Dx log k(x) = 0 gives x = k − 2 ∗ 2 1 Inserting x = k − 2 gives σ = − 2 ∗ = 2(k − 2). Dxx log k(x ) Hence p(x) ≈ p˜(x) = N(k − 2, 2(k − 2))

2 Notice: If X ∼ χk then E(X) = k and Var(X) = 2k. Moreover X has the same distribution as a sum of k independent squared N(0, 1) variables, so by the central limit theorem X ∼approx N(k, 2k). Hence the result above is not too surprising.

R> library(gplots) R> par(mfrow=c(2,2), mar=c(2,2,1,1)) R> laplace <- function(kk){ + xx <- seq(0,kk+3*sqrt(2*kk), .1) + dd1 <- dchisq(xx, df=kk) + dd2 <- dnorm(xx, mean=kk-2, sd=sqrt(2*(kk-2))) + dd3 <- dnorm(xx, mean=kk, sd=sqrt(2*kk)) + + ylim=c(0, max(c(dd1,dd2,dd3)*1.1)) + plot(xx, dd1, type='l', ylim=ylim, lwd=2, main=sprintf("k=%s",kk)) + lines(xx, dd2, col=2,lty=2, lwd=2) + lines(xx, dd3, col=3,lty=3, lwd=2) + smartlegend(x = 'right', y = 'top', + legend = c("p(x)", "N(k-2,2(k-2))", "N(k,2k)"), col = 1:3, lty = 1:3) + } R> laplace(kk=3) R> laplace(kk=6) R> laplace(kk=10) R> laplace(kk=20)

3 k=3 k=6

0.30 p(x) 0.15 p(x) N(k−2,2(k−2)) N(k−2,2(k−2)) N(k,2k) N(k,2k) 0.20 0.10 dd1 0.10 0.05 0.00 0.00

0 2 4 6 8 10 0 5 10 15 k=10 k=20 xx xx p(x) p(x) 0.10 N(k−2,2(k−2)) N(k−2,2(k−2))

N(k,2k) 0.06 N(k,2k) 0.08 0.06 0.04 dd1 0.04 0.02 0.02 0.00 0.00

0 5 10 15 20 0 10 20 30 40

2.2 Summary If k(x) can be evaluated easily (not always the case), the scheme above can be automated:

1. Use a numerical optimizer to find the mode of h(x).

2. Use a numerical device to calculate the Hessian matrix.

In a Bayesian setting we have k(θ) = p(y|θ)π(θ) and the approach above is – in principle – directly applicable.

4 3 The INLA approach

• The difficult part arises when there are latent variables: In a Bayesian setting, if y = (yo, yl) where yo is observed and yl is latent (missing) then the likelihood is obtained by integration Z L(θ) = p(yo|θ) = p(yo, ym|θ)dym

• INLA (integrated Laplace approximation) handles a quite large class of latent variable models. There is a price though: INLA only produces marginal posteriors.

3.1 Linear regression with INLA Paired measurements of speed and stopping distance.

R> head(cars)

speed dist speed2 1 4 2 16 2 4 10 16 3 7 4 49 4 7 22 49 5 8 16 64 6 9 10 81

Physical theory suggests that dist ∝ speed2.

R> plot(dist~speed, data=cars) R> lines(fitted(lm(dist~I(speed^2),data=cars))~speed, data=cars)

5 ● 120 100 ●

● ● ● 80 ● ● ● ● ● ● 60 dist ● ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 ● ● ● ● ● ● ● ● 0

5 10 15 20 25

speed

6 R> m1 <- lm(dist~speed, data=cars) R> summary(m1)

Call: lm(formula = dist ~ speed, data = cars)

Residuals: Min 1Q Median 3Q Max -29.069 -9.525 -2.272 9.215 43.201

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -17.5791 6.7584 -2.601 0.0123 * speed 3.9324 0.4155 9.464 1.49e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1

Residual standard error: 15.38 on 48 degrees of freedom Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438 F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12

R> library(INLA) R> m2 <- inla(dist~speed, data=cars) R> summary(m2)

Call: "inla(formula = dist ~ speed, data = cars)"

Time used: Pre-processing Running inla Post-processing Total 0.10800600 0.11300707 0.04600191 0.26701498

Fixed effects: mean sd 0.025quant 0.5quant 0.975quant kld (Intercept) -17.568865 6.669401 -30.716658 -17.56905 -4.414572 0 speed 3.931745 0.410035 3.123373 3.93175 4.740422 0

The model has no random effects

Model hyperparameters: mean sd 0.025quant 0.5quant 0.975quant Precision for the Gaussian observations 0.0043932 0.0008806 0.0028843 0.0043217 0.0063242

Expected number of effective parameters(std dev): 2.00(2.803e-05) Number of equivalent replicates : 25.00

Marginal Likelihood: -220.91 Warning: Interpret the marginal likelihood with care if the prior model is improper.

Parameter estimates with lm() and inla() approaches are similar here. One thing to notice is the random noise

R> summary(m1)$sigma^2

[1] 236.5317

R> summary(m2)$hyperpar[1,1]

[1] 0.004393164

7 In lm() the normal distribution is parameterized as N(·, σ2) with σ2 being the variance. In inla(), the normal distribution is parameterized as N(·, τ) with τ being the precision, i.e. τ = 1/σ2.

R> 1/summary(m1)$sigma^2

[1] 0.004227763

R> m1 <- lm(dist~speed+I(speed^2), data=cars) R> cars <- transform(cars, speed2=speed^2) R> m2 <- inla(dist~speed+speed2, data=cars) R> coef(summary(m1))

Estimate Std. Error t value Pr(>|t|) (Intercept) 2.4701378 14.81716473 0.1667079 0.8683151 speed 0.9132876 2.03422044 0.4489620 0.6555224 I(speed^2) 0.0999593 0.06596821 1.5152647 0.1364024

R> summary(m2)$fixed

mean sd 0.025quant 0.5quant 0.975quant kld (Intercept) 2.4957912 14.58413029 -26.25568570 2.4953382 31.2618287 0 speed 0.9096222 2.00190544 -3.03722660 0.9096534 4.8579391 0 speed2 0.1000757 0.06492574 -0.02792013 0.1000737 0.2281365 0

R>

3.2 Random regression model with INLA

R> fetal <- read.csv("http://BendixCarstensen.com/Bayes/Cph-2012/data/fetal.csv", header=TRUE) R> library(lattice) R> xyplot(hc~tga, groups=id, data=fetal, type='l', lty=2)

8 400

350

300

250 hc

200

150

100

14 16 18 20 tga

R> library(lme4) R> library(INLA) R> ## Random intercept + random slope (uncorrelated) R> ## lmer() version R> mm1 <- lmer( sqrt(hc) ~ tga + (1|id) + (tga-1|id), data=fetal ) R> ## inla() version R> fetal <- transform(fetal, tgac=tga) R> im1 <- inla(sqrt(hc)~tga + f(id)+f(tgac), data=fetal)

4 Summary

• MCMC methodology is applicable to a very large class of statistical models,

9 • But: Getting MCMC based methods up and running may take a long time. Once they are up and running it takes time to do the analysis.

• Then, if we are not happy with the model, we change the model and start over again...

• For a more narrow (but still rich) class of models, INLA provides an alternative.

• INLA is based on integrated Laplace approximations. Model specification with INLA is relatively easy. Model fitting is fast.

• With INLA, there is a price to pay for these benefits: We only get the marginal posteriors of each parameter; no (direct) possibility of getting the joint posterior. ∗ ∗ That is, if θ = (θ1, θ2) is two dimensional, then we obtain posteriors π (θ1) an π (θ2) ∗ – but we do not get π (θ1, θ2).

• Moreover, the documentation of INLA is somewhat rough but will surely improve over the years to come. The URL to follow is http://www.r-inla.org/

10 L8: Assessing Convergence Wednesday 15th August 2012, morning

Lyle Gurrin

Bayesian Data Analysis 13 – 17 August 2012, Copenhagen

Inference from iterative simulation Basic method of inference from iterative simulation: Use the collection of all simulated draws from the posterior distribution p(θ|y) to summarise the posterior density and to compute quantiles, moments etc. Posterior predictive simulations of unobserved outcomes y˜ can be obtained by simulation conditional on the drawn values of θ. Inference using iterative simulation draws does, however, require care...

Assessing convergence 128/ 258

Difficulties with iterative simulation 1. Too few iterations generate simulations that are not representative of the target distribution. Even at convergence the early iterations are still influenced by the starting values. 2. Within-sequence correlation: Inference based on simulations from correlated draws will be less precise than those using the same number of independent draws.

Assessing convergence 129/ 258 Within-sequence correlation Serial correlation in the simulations is not necessarily a problem since:

 At convergence the draws are all identically distributed as p(θ|y).  We ignore the order of simulation draws for summary and inference.

But correlation causes inefficiencies, reducing the effective number of simulation draws.

Assessing convergence 130/ 258

Within-sequence correlation Should we thin the sequences by keeping every kth simulation draw and discarding the rest? Useful to skip iterations in problems with a large number of parameters (to save computer storage) or built-in serial correlation due to restricted jumping/proposal distributions. Thinned sequences treated in the same way for summary and inference.

Assessing convergence 131/ 258

Challenges of iterative simulation The Markov chain must be started somewhere, and initial values must be selected for the unknown parameters. In theory the choice of initial values will have no influence on the eventual samples from the Markov chain. In practice convergence will be improved and numerical problems avoided if reasonable initial values can be chosen.

Assessing convergence 132/ 258 Diagnosing convergence It is generally accepted that the only way to diagnose convergence is to

1. Run multiple chains from a diverse set of initial parameter values. 2. Use formal diagnostics to check whether the chains, up to expected chance variability,come from the same equilibrium distribution which is assumed to be the posterior of interest.

Assessing convergence 133/ 258

Diagnosing convergence Checking whether the values sampled from a Markov chain (possibly with many dimensions) has converged to its equilibrium distribution is not straightforward. Lack of convergence might be diagnosed simply by observing erratic behaviour of the sampled values...... but a steady trajectory does not necessarily mean that it is sampling from the correct posterior distribution - is it stuck in a particular area of the parameter space? Is this a result of the choice of initial values?

Assessing convergence 134/ 258

Handling iterative simulations A strategy for inference from iterative simulations:

1. Simulate multiple sequences with starting points dispersed throughout the sample space. 2. Monitor the convergence of all quantities of interest by comparing variation between and within simulated sequences until these are almost equal. 3. If no convergence then alter the algorithm. 4. Discard a burn-in (and/or thin) the simulation sequences prior to inference.

Assessing convergence 135/ 258 Discarding early iterations as a burn-in Discarding early iterations, known as burn-in,can reduce the effect of the starting distribution. Simulated values of θt, for large enough t, should be close to the target distribution p(θ|y). Depending on the context, different burn-in fractions can be appropriate. For any reasonable number of simulations discarding the first half is a conservative choice.

Assessing convergence 136/ 258

Formally assessing convergence For overdispersed starting points, the within-sequence variation will be much less than the between sequence variation. Once the sequences have mixed, the two variance components will be almost equal.

Assessing convergence 137/ 258

● ● theta2

● ● −4 −2 0 2 4

−4 −2 0 2 4

theta1

Assessing convergence 138/ 258 ● ● theta2

● ● −4 −2 0 2 4

−4 −2 0 2 4

theta1

Assessing convergence 139/ 258

● ● theta2

● ● −4 −2 0 2 4

−4 −2 0 2 4

theta1

Assessing convergence 140/ 258

● ● theta2

● ● −4 −2 0 2 4

−4 −2 0 2 4

theta1

Assessing convergence 141/ 258 ● ● theta2

● ● −4 −2 0 2 4

−4 −2 0 2 4

theta1

Assessing convergence 142/ 258

Monitoring convergence using multiple chains

 Run several sequences in parallel  Calculate two estimates of standard deviation (SD) of each component of (θ|y):

 An underestimate from SD within each sequence  An overestimate from SD of mixture of sequences  Calculate the potential scale reduction factor:

mixture-of-sequences estimate of SD(θ|y) R = within-sequence estimate of SD(θ|y)

Assessing convergence 143/ 258

Monitoring convergence (continued)   Initially R is large (use overdispersed starting points)   At convergence, R =1(each sequence has made a complete tour)   Monitor R for all parameters and quantities of interest; stop simulations when they are all near 1 (eg below 1.2)  At approximate convergence, simulation noise (“MCMC error”) is minor compared to posterior uncertainty about θ

Assessing convergence 144/ 258 Monitoring scalar estimands Monitor each scalar estimand and other scalar quantities of interest separately. We may also monitor the log of the posterior density (which is computed as part of the Metropolis algorithm). Since assessing convergence is based on means and variances it is sensible to transform scalar estimands to be approximately normally distributed.

Assessing convergence 145/ 258

Monitoring convergence of each scalar estimand Suppose we’ve simulated m parallel sequences or chains, each of length n (after discarding the burn-in). For each scalar estimand ψ we label the simulation draws as ψij(i =1, 2,...,n; j =1, 2,...,m),and we compute B and W , the between- and within-sequence variances:

Assessing convergence 146/ 258

Between- and within-sequence variation Between-sequence variation:

m n  2 B = (ψ − ψ ) , m − .j .. 1 j=1 m m where ψ.j = ψij and ψ.. = ψ.j j=1 j=1 Within-sequence variation: m n 1 2 2 1 2 W = s , where s = (ψij − ψ ) m j j n − .j j=1 1 i=1

Assessing convergence 147/ 258 Estimating the marginal posterior variance We can estimate var(ψ|y), the marginal posterior variance of the estimand using a weighted average of B and W :

+ n − 1 1 var (ψ|y)= W + B n n This overestimates the posterior variance assuming an overdispersed starting distribution, but is unbiased under stationarity (start with the target distribution) or in the limit as n →∞.

Assessing convergence 148/ 258

Estimating the marginal posterior variance For finite n, the within-sequence variance will be an underestimate of var(ψ|y) because individual sequences will not have ranged over the target distribution and will be less variable. In the limit the expected value of W approaches var(ψ|y).

Assessing convergence 149/ 258

The scale reduction factor We monitor convergence of iterative simulation by estimating the factor Rˆ by which the scale of the current distribution for ψ might be reduced in the limit as the number of iteration n →∞:  + var (ψ|y) R = , W

which declines to 1 as n →∞. If R and hence the potential scale reduction is high, further simulations may improve inference about the target distribution of the estimand ψ.

Assessing convergence 150/ 258 The scale reduction factor It is straightforward to calculate R for all scalar estimands of interest and can be accessed in JAGS using routines in CODA. The condition of R being “near” 1 depends on the problems at hand; values below 1.1 are usually acceptable. Avoids the need to examine time-series graphs etc. Simulations may still be far from convergence if some areas of the target distribution was not well captured by the starting values and are “hard to reach”.

Assessing convergence 151/ 258

Theeffectivesamplesize If n chains were truly independent then the between-sequence variance B is an unbiased estimate of var(ψ|y);we’dhavemn simulations from n chains. If simulations of ψ within each sequence are autocorrelated, B will be larger (in expectation) than var(ψ|y). Define the effective number of independent draws

as: + var (ψ|y) n = mn . eff B

Assessing convergence 152/ 258 L9: Hierarchical models Wednesday 15th August 2012, afternoon

Lyle Gurrin

Bayesian Data Analysis 13 – 17 August 2012, Copenhagen Bioassay example continued Let’s return to the simple bioassay example: A single (α, β) may be inadequate to fit a combined data set (several experiments). Imagine repeated bioassays with same compound, where (αj,βj) parameters from different bioassays.

Separate unrelated (αj,βj) are likely to “overfit” data (only 4 points in each data set). Information about the parameters of one bioassay can be obtained from others’ data.

Hierarchical models 153/ 258

Hierarchical models A natural prior distribution arises by assuming the (αj,βj)’s are a sample from a common population distribution. We’d be better off estimating the parameters governing the population distribution of (αj,βj) rather than each (αj,βj) separately.

Hierarchical models 154/ 258

Hierarchical models To do this we introduce new parameters that govern the population distribution, called hyperparameters. The distribution of observed outcomes are conditional on parameters which themselves have a probability specification, known as a hierarchical or multilevel model.

Hierarchical models 155/ 258 Specifying a hierarchical model Joint posterior distribution for the parameters (θ) and hyperparameters (φ):

p(θ, φ|y) ∝ p(θ, φ)p(y|θ, φ) ∝ p(θ, φ)p(y|θ) y ind. of φ given θ ∝ p(φ)p(θ|φ)p(y|θ),

Computation is often carried out in two steps: 1. Inference for θ as if we knew φ using the posterior conditional distribution p(θ|y, φ). 2. Inference for φ based on posterior marginal distribution p(φ|y).

Hierarchical models 156/ 258

Specifying a hierarchical model The model is specified in nested stages:

 p(y|θ)=the sampling distribution of the data.  p(θ|φ)=the population distribution for θ given φ.  p(φ)=the prior distribution for φ.  More levels are possible. The hyperprior distribution at highest level is often chosen to be noninformative.

Hierarchical models 157/ 258

Example 1: Meta-analysis of clinical trials Reference: Gelman et al., 5.6 & 19.4 Spiegelhalter et al., 3.17 Meta-analysis aims to summarise and integrate findings from research studies addressing the same scientific question. It involves combining information from several parallel data sources, so is closely connected to hierarchical modelling. There are well known frequentist methods too. We’ll reinforce some of the concepts of hierarchical modelling in a meta-analysis of clinical trials data.

Hierarchical models 158/ 258 MgSO4 for MI Our data come from 8 randomised controlled clinical trials, each with two groups of heart attack patients receiving (or not) intravenous magnesium sulfate. The frequency of mortality due to heart attack range from 1% to 17%, and samples sizes from less than 50 to more than 2300.

Hierarchical models 159/ 258

Trial Magnesium Control group group

deaths patients deaths patients

Morton 1 40 2 36 Rasmussen 9 135 23 135 Smith 2 200 7 200 Abraham 1 48 1 46 Feldstedt 10 150 8 148 Schechter 1 59 9 56 Ceremuzynski 1 25 3 23 LIMIT-2 90 1159 118 1157

Hierarchical models 160/ 258

Normal approximation to the likelihood Here we use the results of one analytic approach to produce a point estimate of the log odds ratio and asymptotic standard error that can be regarded as approximately a normal mean and standard deviation. 2 We use the notation yj and σj to be consistent with the earlier lecture.

Hierarchical models 161/ 258 More data for the MgSO4 example

Trial Magnesium Control Estimated Estimated group group log(OR) yj SD sj deaths patients deaths patients Morton 1 40 2 36 −0.83 1.25 Rasmussen 9 135 23 135 −1.06 0.41 Smith 2 200 7 200 −1.28 0.81 Abraham 1 48 1 46 −0.04 1.43 Feldstedt 10 150 8 148 0.22 0.49 Schechter 1 59 9 56 −2.41 1.07 Ceremuzynski 1 25 3 23 −1.28 1.19 LIMIT-2 90 1159 118 1157 −0.30 0.15

Hierarchical models 162/ 258

A hierarchical model: Stage 1 The first stage of the hierarchical model assumes that:

2 2 yj|θj,σj ∼ N(θj,σj ), (30)

The simplification of known variances is reasonable with large sample sizes (but see the online examples that use the “true” binomial sampling distribution).

Hierarchical models 163/ 258

 Possible assumptions about the θjs

1. Studies are identical replications, so θj = μ for all j (no heterogeneity)...or 2. No comparability between studies so that each study provides no information about the other (complete heterogeneity)...or 3. Studies are exchangeable but not identical or completely unrelated, so a compromise between 1and2.

Hierarchical models 164/ 258 A hierarchical model: Stage 2 The second stage of the hierarchical model assumes that the trial means θj are exchangeable with a normal distribution

2 θj ∼ N(μ, τ ). (31)

Hierarchical models 165/ 258

A hierarchical model: Stage 2 If μ and τ 2 are fixed and known, then the conditional posterior distribution of the θj’s are independent, and

θj|μ, τ, y ∼ N(θˆj,Vj), where 1 y 1 μ σ2 j + τ 2 ˆ j 1 θj = 1 1 and Vj = 1 1 . 2 + 2 2 + 2 σj τ σj τ Note that the posterior mean is a precision-weighted average of the prior population mean and the observed yj representing the treatment effect in the jth group. Hierarchical models 166/ 258

Posterior distn’s for the θj’s given y, μ, τ

The expression for the posterior distribution of θj can be rearranged as

2 θj|yj ∼ N(Bjμ +(1− Bj)yj, (1 − Bj)σj )(32)

2 2 2 where Bj = σj /(σj + τ ) is the weight given to the prior mean. Ignoring data from the other trials is equivalent to 2 setting τ = ∞, that is, Bj =0. The classical pooled result results from τ 2 → 0, that is, Bj =1.

Hierarchical models 167/ 258 Conditional prior distribution for μ A uniform conditional prior distribution p(μ|τ)=1 for μ leads to the following posterior distribution:

μ|τ,y ∼ N(ˆμ, Vμ)

where μˆ is the precision weighted average of the yj −1 values, and Vμ is the total precision:  J 1 y J j=1 σ2+τ 2 j 1 μˆ =  j and V −1 = . J 1 μ σ2 τ 2 j=1 2 2 j + σj +τ j=1

2 τ → 0 gives the classical result where Bj =1.

Hierarchical models 168/ 258

The exchangeable model and shrinkage The exchangeable model therefore leads to narrower posterior intervals for the θj’s than the “independence” model, but they are shrunk towards the prior mean response. The degree of shrinkage depends on the variability between studies, measured by τ 2, and the precision of the estimate of the treatment effect from the 2 individual trial, measured by σj .

Hierarchical models 169/ 258

The full hierarchical model The hierarchical model is completed by specifying a prior distribution for τ - we’ll use the noninformative prior p(τ)=1. Nevertheless, p(τ|y) is a complicated function of τ:

 J 2 2 j=1 N(yj|μ,ˆ σj + τ ) p(τ|y) ∝ N μ|μ, Vμ (ˆ ˆ ) J 2 1/2 2 2 −1/2 (yj − μˆ) ∝ Vμ (σ + τ ) exp − j σ2 τ 2 j=1 2( j + )

Hierarchical models 170/ 258 The profile likelihood for τ A tractable alternative to the marginal posterior distribution is the profile likelihood for τ, derived by replacing μ in the joint likelihood for μ and τ by its conditional maximum likelihood estimate μˆ(τ) given the value of τ. This summarises the support for different values of τ and is easily evaluated as J 2 2 N(yj|μˆ(τ),σj + τ ). j=1

Hierarchical models 171/ 258

0 0.5

−1 0.0

−2 −0.5

−3 −1.0 log(OR) Profile log(likelihood) −4 −1.5 Profile Trials estimates Overall estimate −5 −2.0

0.0 0.5 1.0 1.5 2.0 tau

Hierarchical models 172/ 258

Estimates of τ The maximum likelihood estimate is τˆ =0although values of τ with a profile log(likelihood) above −1.962/2 ≈−2 might be considered as being reasonably supported by the data. τˆ =0would not appear to be a robust choice as an estimate since non-zero values of τ, which are well-supported by the data, can have a strong influence on the conclusions. We shall assume, for illustration, the method-of-moments estimator τˆ =0.35.

Hierarchical models 173/ 258 Results of the meta-analysis

Trial Magnesium Control Estimated Estimated Shrinkage group group log(OR) yk SD sk Bk

deaths patients deaths patients

Morton 1 40 2 36 −0.83 1.25 0.90 Rasmussen 9 135 23 135 −1.06 0.41 0.50 Smith 2 200 7 200 −1.28 0.81 0.80 Abraham 1 48 1 46 −0.04 1.43 0.92 Feldstedt 10 150 8 148 0.22 0.49 0.59 Schechter 1 59 9 56 −2.41 1.07 0.87 Ceremuzynski 1 25 3 23 −1.28 1.19 0.89 LIMIT-2 90 1159 118 1157 −0.30 0.15 0.11

Hierarchical models 174/ 258

1_Morton ● ● 2_Rasmussen ● ● 3_Smith ● ● 4_Abraham ● ● 5_Feldstedt ● ● 6_Schechter ● ● 7_Ceremuzynski ● ● 8_LIMIT−2 ● ● Random 9_Overall Fixed ● ●

−3 −2 −1 0 1

favours magnesium <−− mortality log(OR) −−> favours placeb

Hierarchical models 175/ 258

Example 2: Statistical measures of fetal growth The fetal origins hypothesis proposes that fetal adaptation to an adverse intrauterine environment programs permanent physiological change. It is now acknowledged that their is an inverse relationship between low birthweight and subsequent elevated blood pressure (Huxley R. Lancet (2002)). We’ll look at quantifying fetal growth using statistical summary measures derived from serial fetal biometric data.

Hierarchical models 176/ 258 Western Australian Pregnancy Cohort Study Subjects received routine ultrasound examination at 18 weeks gestation. Additional ultrasound examination at 24, 28, 34, 38 weeks gestation as part of a randomised trial of the safety of repeated antenatal ultrasound.

Hierarchical models 177/ 258

Criteria for growth data These data exclude multiples, infants born less than 37 weeks gestation, or with maternal or fetal disease. Required agreement within 7 days between gestational age based on last menstrual period and ultrasound examination at 18 weeks. 3450 ultrasound measurements on 707 fetus of five fetal dimensions (BPD, OFD, HC, AC, FL). We’ll look at HC = head circumference.

Hierarchical models 178/ 258

Statistical modelling th Yij is the measured head circumference for the i th fetus at the j timepoint tij =18, 24, 28, 34, 38 weeks The number of measurements on an individual fetus varies from 1 to 7. The aim is to model the relationship between head circumference Yij and gestational age tij.

Hierarchical models 179/ 258 Head Circumference

● ● ● ● ●● ●● ●● ● ● ● ●●●●●● ● ●●●●●●●●●●●● ● ● ●●●●●●●●●● ● ● ● ●●●● ●●●●● ●●●●●●●●●●●●●● ● ● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ● ●●●●●●●●● ●● ●● ●●●●●●●●●●●●●●●●● ●● ● ●●● ● ●● ● ●●●●●●●●●●●●●●●●●● ●●●●● ●● ●●●●●●●●●●●●●●●●● ● ●● ● ●●●●●●● ● ● ●● ●●●●●●●●●●●● ● ● ●●●●●●●●●●●●● ● ●●●●●●●●●●●● ● ●●●●●●●●●●● ●●● ● ●●●●●●●●● ● ●●●●●●●●●●●●●●● ● ● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●● ● ●●● ●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●● ● ● ●●●●●● ●●●●●●●●●●●● ●● ●●●●●●●●● ●●●●●●●●●●● ● ●●●●●●●● ●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ●●●●●●●●●● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ●● ●●●●●● ● 50th

Head circumference (mm) ●●●●● ●●●●● ●●●● ● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● 10th/90th ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●● ●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●● ● ● ● ● ● ● ● ● 100 150 200 250 300 350 400

15 20 25 30 35 40 45

Gestational age (weeks)

Hierarchical models 180/ 258

Modelling strategy We follow the methodology in Royston P. Stat. Med. (1995): Transform both sides of the regression equation to establish approximately linear relationship between (λ) transformed outcome Yij and timescale g(tij) The longitudinal design suggests the use of linear mixed model.

Hierarchical models 181/ 258

Transformation of Yij Use the familiar Box-Cox transformation:

(λ) λ Yij =(Yij − 1)/λ if λ =0 = log(Yij) if λ =0

We account for gestational age when choosing λ by fitting a cubic polynomial in time tij.

Hierarchical models 182/ 258 Transformed tij (λ) We assume that Yij is linear in a second degree fractional polynomial in tij (Royston P, Altman DG Appl. Stat. (1994)).

(p1) (p2) g(tij)=ξ0 + ξ1tij + ξ2tij

(p1) p1 tij = tij if p1 =0 = log(tij) if p1 =0

(p2) (p1) If p1 = p2 then let tij = tij log(tij).

Hierarchical models 183/ 258

Fractional polynomials in tij

(p1) (p2) So g(tij)=ξ0 + ξ1tij + ξ2tij . 1 1 Select p1,p2 from {−3, −2, −1, −2, 0, 2, 1, 2, 3}. Use a grid search to find p1,p2 providing the best fit (λ) to Yij . Estimate ξ1 and ξ2 using maximum likelihood, with separate intercepts for each subject.

(p1) (p2) Let Xij = tij +(ξ2/ξ1) tij .

Hierarchical models 184/ 258

Transformations for head circumference For head circumference we find that λˆ =0.56 ≈ 0.50, equivalent to the square root transformation. We use a quadratic transformation of gestational age:

2 Xij = tij − 0.0116638tij.

Hierarchical models 185/ 258 Simple linear model The simplest linear model would be:

Yij = β0 + β1Xij + εij,

where 2 εij ∼ N(0,σε ).

Hierarchical models 186/ 258

Mixed linear model We can extend this to a mixed linear model by allowing subject-specific intercepts and gradients:

Yij =(β0 + u0i)+(β1 + u1i)Xij + εij,

where 2 εij ∼ N(0,σε ). and  2 u0i 0 σ0 σ01 = N2 , 2 u1i 0 σ01 σ1

with cov(εij,ui)=0.

Hierarchical models 187/ 258

References centiles The models can be used to derive reference centiles:

var(Yij)=var(u0i)+2cov(u0i,u1i)Xij 2 +var(u1i)Xij +var(εij) 2 2 2 2 = σ0 +2σ01Xij + σ1Xij + σε

The variance of Yij (square root of head circumference) is quadratic in gestational age.

Hierarchical models 188/ 258 Head Circumference

99th 90th

● ● ● ● 50th ●

● 10th

● 1st

Head circumference (mm) ●

● 100 150 200 250 300 350 400

15 20 25 30 35 40 45

Gestational age (weeks)

Hierarchical models 189/ 258

Measure of fetal growth We call also use models to derive measures of growth or change in size: (λ) (λ) (λ) Yi1 and Yi2 are bivariate normal, so Yi2 given (λ) Yi1 is univariate normal. The “conditional Z-score” is (λ) (λ) (λ) Yi2 − E(Yi2 |Yi1 ) Z2|1 =  (λ) (λ) var(Yi2 |Yi1 ) We used measures at 38 weeks gestation conditional on value at 18 weeks gestation and relating these to birthweight and subsequent blood pressure in childhood.

Hierarchical models 190/ 258

● ● ●

● Head circumference (mm)

● 100 150 200 250 300 350

15 20 25 30 35 40

Gestational age (weeks)

Hierarchical models 191/ 258 ● ● ●

● SQRT head circumference

12 14● 16 18

14 16 18 20 22

Transformed gestational age

Hierarchical models 192/ 258

● ● ●

● SQRT head circumference subject−specific population

12 14● 16 18

14 16 18 20 22

Transformed gestational age

Hierarchical models 193/ 258

● ● ●

● SQRT head circumference subject−specific population

12 14● 16 18 95% CL pop.

14 16 18 20 22

Transformed gestational age

Hierarchical models 194/ 258 ● ●

● Head circumference (mm)

● 100 150 200 250 300 350

15 20 25 30 35 40

Gestational age (weeks)

Hierarchical models 195/ 258

● ●

● SQRT head circumference

● 10 12 14 16 18

14 16 18 20 22

Transformed gestational age

Hierarchical models 196/ 258

● ●

subject−specific SQRT head circumference population

● 10 12 14 16 18

14 16 18 20 22

Transformed gestational age

Hierarchical models 197/ 258 ● ●

subject−specific SQRT head circumference population 95% CL pop. ● 10 12 14 16 18

14 16 18 20 22

Transformed gestational age

Hierarchical models 198/ 258

Fitting the model in JAGS Subject-specific index runs from i=1to707 u[i,1:2] ˜ dmnorm((0,0),Omega.beta[,]) u[i,1] (= u0i)andu[i,2] (= u1i)are multivariate normally distributed.

mu.beta[1] is the fixed effects intercept β0. mu.beta[2] is the fixed effects gradient β1. Omega.beta[,] is the inverse of the random effects variance-covariance matrix.

Hierarchical models 199/ 258

Fitting the model in JAGS The observation-specific index runs from j=1to 3097

mu[j] <- (mu.beta[1] + u[id[j],1]) + (mu.beta[2] + u[id[j],2])*X[j]

The observation-specific mean mu[j] uses id[j] as the first index of the random effect array u[1:707,1:2] to pick up the id number of the jth observation and select the correct row of the random effect array. The final step adds the error term to the mean: Y[j] ˜ dnorm(mu[j],tau.e)

Hierarchical models 200/ 258 Prior distributions for model parameters We use the “standard” noninformative prior distributions for the fixed effects β0, β1 and σe. We use a Wishart prior distribution for the precision matrix Omega.beta, which requires

 A degrees of freedom parameter (we use the smallest possible value, 2, the rank of the precision matrix).  A scale matrix representing our prior guess at the order of magnitude of the covariance matrix of the random effects (we assume 2 2 σ0 =0.5, σ1 =0.1 and σ01 =0.)

Hierarchical models 201/ 258 L10: Analysis twin data and GLMMs Thursday 16th August 2012, morning

Lyle Gurrin

Bayesian Data Analysis 13 – 17 August 2012, Copenhagen

This lecture

 A model for paired data.  Extend this to a genetic model for twin data.  Fitting these models in BUGS.  An introduction to the example on mammographic (breast) density.

Analysis of twin data and GLMMs 202/ 258 Paired data Paired data provide an excellent example of Bayesian methods in BUGS since these data are

 the simplest type of correlated data structure;  naturally represented as a hierarchical model.

We begin with the basic model to capture correlation in paired data, and then extend this to accommodate different correlation in monozygous (MZ) and dizygous (DZ) twin pairs.

Analysis of twin data and GLMMs 203/ 258

Notation for paired data Suppose we have a continuously valued outcome y measured in each individual of n twin pairs. th Let yij denote the measurement of the j individual in the ith twin pair, where j =1, 2 and i =1, 2,...,n. We do not consider additional measurements on exposure variables at this stage.

Analysis of twin data and GLMMs 204/ 258

A model for a single pair th We assume that, for the i pair, that yi1 and yi2 have common mean

E(yi1)=E(yi2)=ai,

where (for now) ai is assumed to be fixed. It is also assumed that the two measurements have common variance

2 var(yi1)=var(yi2)=σe .

Conditional on the value of ai, yi1 and yi2 are uncorrelated: cov(yi1,yi2)=0.

Analysis of twin data and GLMMs 205/ 258 A model for a single pair We can write this model as

yi1 = ai + εi1 yi2 = ai + εi2

where 2 var(εi1)=var(εi2)=σe and cov(εi1,εi2)=0, with an implicit assumption that

2 εij ∼ N(0,σe ).

Analysis of twin data and GLMMs 206/ 258

A hierarchical model for paired data We can extend this simple structure for paired data to a hierarchical model by assuming a normal population distribution for the pair-specific means ai

2 ai ∼ N(μ, σa).

Analysis of twin data and GLMMs 207/ 258

A hierarchical model for paired data This is an example of the hierarchical normal-normal model studied in lectures earlier. The sampling model is (for j =1, 2)

2 2 yij|ai,σe ∼ N(ai,σe )

The pair-specific mean model is (for j =1, 2)

2 2 ai|μ, σa ∼ N(μ, σa)

Analysis of twin data and GLMMs 208/ 258 Unconditional mean and variance of yij We can use the iterative expectation and variance formulae:

E(yij) = E(E(yij|ai)) =E(ai) = μ. var(yij) = var(E(yij|ai)) + E(var(yij|ai)) 2 =var(ai)+E(σe ) 2 2 = σa + σe .

2 2 So yij ∼ N(μ, σa + σe ).

Analysis of twin data and GLMMs 209/ 258

Bivariate distribution of (yi1,yi2)

In fact the joint distribution of yi1 and yi2 is bivariate normal:

 2 2 2 yi1 μ σa + σe σa = N2 , 2 2 2 yi2 μ σa σa + σe

BUGS does allow vector nodes to have a multivariate normal (MVN) distribution, but it requires careful specification of the parameters defining the MVN distribution.

Analysis of twin data and GLMMs 210/ 258

Introduction to twin data Measurements on twins are a special case of paired data and provides a natural matched design. Outcomes (continuous or binary) are possibly influenced by both genetic and environmental factors. There has been extensive development of statistical methods to deal with quantitative traits, concordance/discordance, affected sib-pairs etc. Studying basic models for twins provide a good introduction to analysing family data.

Analysis of twin data and GLMMs 211/ 258 Introduction to twin data We will now extend the simple paired model to accommodate monozygous (MZ) and dizygous (DZ) pairs. MZ twins are “identical” and share all of their genes. DZ twins are siblings and share on average half of their genes (same relationship as siblings like brother and sister).

Analysis of twin data and GLMMs 212/ 258

Within-pair covariation in twin data If a quantitative trait (continuous outcome variable) is under the influence of shared genetic factors then we expect the within-pair covariation to be smaller in DZ pairs than in MZ pairs:

covDZ(yi1,yi2)=ρDZ:MZcovMZ(yi1,yi2)

So ρDZ:MZ is the ratio of covariances between DZ and MZ pairs. We assume that var(yij) is the same in MZ and DZ twins and so ρDZ:MZ is also the ratio of the within-pair correlation in DZ and MZ pairs.

Analysis of twin data and GLMMs 213/ 258

Interpretation of ρDZ:MZ

 If ρDZ:MZ =1then the outcome is no more correlated between individuals in MZ pairs than in DZ pairs, so no evidence of genetic influence. 1  If ρDZ:MZ = 2 then we have an additive genetic model, also known as the “classical twin model”. 1  If 0 <ρDZ:MZ < 1 and ρDZ:MZ = 2 then any genetic model will need to be non-additive (eg gene-gene interaction) or incorporate a contribution to variation from shared environment.

Analysis of twin data and GLMMs 214/ 258 Specifying the additive model in BUGS Recall our original model:

yi1 = ai + εi1 yi2 = ai + εi2

where

2 var(εij)=σe cov(εi1,εi2)=0 2 ai ∼ N(μ, σa)

Analysis of twin data and GLMMs 215/ 258

Random effect sharing in MZ and DZ pairs

Now instead of just one random effect ai per pair, extend the model to incorporate three random 2 effects per pair ai1, ai2 and ai3, all i.i.d. N(μ, σa). Regardless of zygosity both individuals in a twin-pair share the first random effect ai1. Individuals in MZ pairs share the second random effect ai2.

For DZ pairs, one member of the pair receives ai2 and the other member receives ai3. So DZ pairs share less than MZ pairs. Scaling appropriately by ρ = ρDZ:MZ we have...

Analysis of twin data and GLMMs 216/ 258

Model equations For MZ pairs √  yi1 = ρai1 + 1 − ρai2 + εi1 √  yi2 = ρai1 + 1 − ρai2 + εi2

For DZ pairs √  yi1 = ρai1 + 1 − ρai2 + εi1 √  yi2 = ρai1 + 1 − ρai3 + εi2

Analysis of twin data and GLMMs 217/ 258 Covariance and variance for MZ and DZ pairs

Since MZ pairs share both ai1 and ai2,the y y within-pair√ covariance of i1 and i2 is √ 2 2 2 2 2 ρ σa + 1 − ρ σa = σa.

DZ pairs share only ai1 so the corresponding √ 2 2 2 within-pair covariance is ρ σa = ρσa. 2 2 The variance of yij is, however, σa + σe for both MZ and DZ pairs.

Analysis of twin data and GLMMs 218/ 258

Model equations Clearly the model for MZ pairs √  yi1 = ρai1 + 1 − ρai2 + εi1 √  yi2 = ρai1 + 1 − ρai2 + εi2

is identical to the original model

yi1 = ai1 + εi1 yi2 = ai1 + εi2

but computations in BUGS areeasierifboth members of the pair share some random effects but not others if they are DZ.

Analysis of twin data and GLMMs 219/ 258

Application: Mammographic density Female twin pairs (571 MZ 380 DZ) aged 40 to 70 were recruited in Australia and North America. The outcome is percent mammographic density, the ratio of dense tissue to non-dense tissue from mammographic scan. Age-adjusted mammographic density is a risk factor for breast cancer. The correlation for percent density was 0.63 in MZ pairs and 0.27 in DZ pairs adjusted for age and location. Other risk factors for breast density: height, weight, reproductive, diet. Analysis of twin data and GLMMs 220/ 258 Percent breast density and age at mammogram Percent breast density (%) 0 20406080100

40 45 50 55 60 65 70

Age at mammogram (years)

Analysis of twin data and GLMMs 221/ 258

Generalised Linear Mixed Models

Analysis of twin data and GLMMs 222/ 258

Melbourne Sexual Health Clinic (MSHC)

consults PID warts 23 sexual health physicians di- 80 1 1 816 41 46 726 12 37 agnosing each patient with ei- 2891 38 137 79 4 4 ther: 1876 34 73 469 8 27  pelvic inflammatory 1124 13 76 210 10 6 539 8 28 disease (PID) 1950 22 101 1697 24 86  genital warts 811 13 56 908 52 48 944 19 65 Are there differences between 832 10 33 1482 10 62 physicians in the proportion di- 456 0 20 420 0 8 agnosed with PID or warts? 1258 3 58 1101 1 22 109 0 3 1006 2 62

Analysis of twin data and GLMMs 223/ 258 Diagnosis frequency in MSHC

consults PID warts The proportion of patients diag- 80 1.25 1.25 816 5.02 5.64 726 1.65 5.10 nosed varies: 2891 1.31 4.74 79 5.06 5.06  PID 0% – 5.73% 1876 1.81 3.89 469 1.71 5.76 1.49% (weighted) 1124 1.16 6.76 210 4.76 2.86 539 1.48 5.19 1.72% (unweighted) 1950 1.13 5.18 1697 1.41 5.07  Warts 1.25% – 6.91% 811 1.60 6.91 908 5.73 5.29 944 2.01 6.89 4.86% (weighted) 832 1.20 3.97 1482 0.67 4.18 4.59% (unweighted) 456 0.00 4.39 420 0.00 1.90 Are the warts percentages “bet- 1258 0.24 4.61 1101 0.09 2.00 109 0.00 2.75 ter spread” than the PID per- 1006 0.20 6.16 centages (most less 2% but four are around 5%)?

Analysis of twin data and GLMMs 224/ 258

Data Structure For each outcome (PID or warts) the data are of the form

(ni,yi); i =1,...,23

where

ni is the number of consultations (patients seen) by physician i. yi is the number of patients diagnosed with the condition.

Analysis of twin data and GLMMs 225/ 258

Sampling model at each dose level For each physician, the patients (i.e. their outcomes) are assumed to be exchangeable (there is no information to distinguish one patient from another). We model the outcomes within-physician as independent given a physician-specific probability of death θi, which leads to the familiar binomial sampling model:

yi|θi ∼ Bin(ni,θi)

Analysis of twin data and GLMMs 226/ 258 Setting up a model across physicians

Typical assumption is that each θi is an independent parameter, that is, a fixed effect. We can re-express this model using logistic regression:

logit(θi)=log(θi/(1 − θi)) = αi

where αi is the physician-specific log odds of diagnosis with the condition.

Analysis of twin data and GLMMs 227/ 258

Estimating the fixed effects MLE: Estimates are:

θˆi = yi/ni αˆi = log(yi/(ni − yi))

Bayes: Estimate θ as the posterior mean using some (beta?) prior distribution (see Lecture 1).

An even more extreme assumption is that αi = α for some common log-odds of diagnosis α. Also this provides no way of quantifying variability in frequency of diagnosis - is there a compromise?

Analysis of twin data and GLMMs 228/ 258

A hierarchical model Replace independence with exchangeability - allow the αi to be drawn from a “population” distribution of diagnosis frequencies: 2 αi ∼ N(μ, τ ).

Then assume the yi are conditionally independent binomial random variables given the αi:

yi|αi ∼ Bin(ni, expit(αi)) where −1 expit(αi)=logit (αi)=exp(αi)/(1+exp(αi)) = θi.

Analysis of twin data and GLMMs 229/ 258 Logistic-normal GLMM

This logistic-normal model, where the αi’s are given a distribution, and so are random rather than fixed effects, is an example of a generalised linear mixed model (GLMM). The model is easily implemented in BUGS to generate posterior distributions for αi’s and hyperparameters μ and τ. This simple model is available in Stata and SAS but without much flexibility to extend to random coefficients.

Analysis of twin data and GLMMs 230/ 258

Fixed and random effects for warts

● ● ● Fixed Random ● Mean ●

Doctor ●

0.00 0.02 0.04 0.06 0.08 0.10 0.12

Proportion diagnosed with warts (with 95% CI)

Analysis of twin data and GLMMs 231/ 258

Fixed and random effects for PID

● ● ● Fixed Random ● Mean ●

Doctor ●

0.00 0.02 0.04 0.06 0.08 0.10 0.12

Proportion diagnosed with PID (with 95% CI)

Analysis of twin data and GLMMs 232/ 258 Summary output from BUGS Warts:

mean sd 2.5% 50% 97.5% Rhat n.eff mu -3.02 0.08 -3.19 -3.02 -2.87 1 5700 tau2 0.10 0.06 0.03 0.09 0.24 1 15000 tau 0.31 0.08 0.18 0.30 0.49 1 15000

PID:

mean sd 2.5% 50% 97.5% Rhat n.eff mu -4.57 0.32 -5.27 -4.55 -3.98 1 1000 tau2 1.67 0.84 0.65 1.47 3.82 1 15000 tau 1.26 0.29 0.81 1.21 1.95 1 15000

Analysis of twin data and GLMMs 233/ 258

Summary output from BUGS PID after removing those physicians with PID frequency > 2.5%:

mean sd 2.5% 50% 97.5% Rhat n.eff mu -4.86 0.27 -5.50 -4.83 -4.39 1 4500 tau2 1.07 0.70 0.29 0.90 2.93 1 4100 tau 0.99 0.30 0.54 0.95 1.71 1 4100

Analysis of twin data and GLMMs 234/ 258

Posterior densities for τ

warts PID Density 012345

0.0 0.5 1.0 1.5 2.0 2.5 3.0

N = 15000 Bandwidth = 0.01019

Analysis of twin data and GLMMs 235/ 258 L11: Penalised loss functions for model comparison and the DIC Friday 17th August 2012, morning

Lyle Gurrin

Bayesian Data Analysis 13 – 17 August 2012, Copenhagen

The need to compare models Model choice is an important part of data analysis. This lecture: Present the Deviance Information Criterion (DIC) and relate it to a cross-validation procedure for model checking. When should we use the DIC?

Penalised loss functions for model comparison and the DIC 236/ 258

Posterior predictive checking One approach to Bayesian model checking is based on hypothetical replicates of the same process that generated the data. In posterior predictive checking (Chapter 6 of BDA), replicate datasets are simulated using draws from the posterior distribution of the model parameters. The adequacy of the model is assessed by the faithfulness with which the replicates reproduce key features of the original data.

Penalised loss functions for model comparison and the DIC 237/ 258 Two independent datasets Suppose that we have training data Z and test data Y = {Y1,Y2,...,Yn}. We assess model adequacy using a loss function L(Y, Z) which measures the ability to make prediction of Y from Z. Suitable loss functions are derived from scoring rules which measure the utility of a probabilistic forecast of Y represented by a probability distribution p(y).

Penalised loss functions for model comparison and the DIC 238/ 258

The log scoring rule One sensible scoring rule is the log scoring rule

A log {(p(y))} + B(y)

for essentially arbitrary constant A and function B(.) of the data Y.

Penalised loss functions for model comparison and the DIC 239/ 258

Parametric models Consider the situation where all candidate models share a common vector of parameters θ (called the focus) and differ only in the (prior) structure for θ. Assume also that Y and Z are conditionally independent given θ,sothatp(Y|θ, Z)=p(Y|θ). The log scoring rule then becomes the log likelihood of the data Y as a function of θ, or equivalently the deviance −2log{p(Y|θ)}.

Penalised loss functions for model comparison and the DIC 240/ 258 Loss functions Two suggested loss functions are the “plug-in” deviance

Lp(Y, Z)=−2 log[p{Y|θ(Z)}]

where θ(Z)=E(θ|Z) and the expected deviance  Le(Y, Z)=−2 log{p(Y|θ)}p(θ|Z)dθ

where the expectation is taken over the posterior distribution of θ given Z with Y considered fixed.

Penalised loss functions for model comparison and the DIC 241/ 258

Loss functions Both loss functions are derived from the deviance but there are important difference between Lp and Le.

 The plug-in deviance is sensitive to reparametrisation but the expected deviance is co-ordinate free.  The plug-in deviance gives equal loss to all models that yield the same posterior expectation of θ, regardless of precision.  The expected deviance is a function of the full posterior distribution of θ given Z so takes precision into account.

Penalised loss functions for model comparison and the DIC 242/ 258

The problem How to proceed when there are no training data? An obvious idea is to use the test data to estimate θ and assess the fit of the model, that is, re-use the data to create the loss function L(Y, Y). But this is optimistic. “In-sample” prediction will always do better than “out-of-sample” prediction.

Penalised loss functions for model comparison and the DIC 243/ 258 Quantifying the optimism We can get an idea of the degree of optimism for loss functions that decompose into a sum of contributions from each Yi (e.g. independence)

n L(Y, Z)= L(Yi, Z) i=1

We can gauge the optimism by comparing L(Yi, Y) with L(Yi, Y−i) where Y−i is the data with Yi removed.

Penalised loss functions for model comparison and the DIC 244/ 258

Quantifying the optimism

The expected decrease in loss from using L(Yi, Y) in place of L(Yi, Y−i) is p E{L Y , Y − L Y , Y |Y } opti = ( i −i) ( i ) −i

which is the optimism of L(Yi, Y). The loss function L Y , Y p ( i )+ opti

has the same expectation given Y−i as the cross-validation loss L(Yi, Y−i) so is equivalent for an observer who has not seen Yi.

Penalised loss functions for model comparison and the DIC 245/ 258

Penalised loss function

The same argument applies to each Yi in turn. Proposal: Use the sum of the penalised loss functions L(Y, Y)+popt to assess model accuracy where n p = p opt i=1 opti is the cost of using the data twice.

Penalised loss functions for model comparison and the DIC 246/ 258 Deviance Information Criterion (DIC) The DIC is defined as

DIC = D + pD

where D = E(D|Y) is a measure of model fit and pD is the “effective number of parameters”, a measure of model complexity defined by

pD = D − D{θ(Y)}

Penalised loss functions for model comparison and the DIC 247/ 258

The effective number of parameters Using previous notation, D = Le(Y, Y) and p D{θ(Y)} = L (Y, Y),sopD canbewrittenas

e p pD = L (Y, Y) − L (Y, Y)

pD can be decomposed into the sum of individual contributions n n e p pD = pD = L (Yi, Y) − L (Yi, Y) i=1 i i=1

Penalised loss functions for model comparison and the DIC 248/ 258

The normal linear model We can do the algebra explicitly for the normal linear model that assumes the Yi are scalar. Both the expected and plug-in deviance have a p / − p penalty term like Di (1 Di ). But the large sample behaviour depends on the dimension of θ.

Penalised loss functions for model comparison and the DIC 249/ 258 The normal linear model θ p O n−1 If the dimension of is fixed then Di is ( ) so the penalised losses can be written as

−1 D + kpD + O(n )

where k =1for the plug-in deviance and k =2for the expected deviance. So the penalised plug-in deviance is the same as the DIC for regular linear models with scalar outcomes.

Penalised loss functions for model comparison and the DIC 250/ 258

Example: DIC in hierarchical models In random effects models, however, it is quite common for the dimension of θ to increase with n. The behaviour of the penalised plug-in deviance may be different from DIC. We illustrate this with the normal-normal hierarchical model, also known as the one-way random-effects analysis of variance (ANOVA).

Penalised loss functions for model comparison and the DIC 251/ 258

The Normal-normal hierarchical model The two-level hierarchical model assumes that:

2 Yi|θi ∼ N(θi,σi ), 2 θi|μ, τ ∼ N(μ, τ )

2 2 2 where the variances σ1,σ2,...,σn are fixed and μ and τ are given noninformative priors.

Penalised loss functions for model comparison and the DIC 252/ 258 The Normal-normal hierarchical model Assume candidate models are indexed by τ,andthe deviance is defined as  n 2 D(θ)= [(yi − θi)/σi] i=1 It can be shown that the contribution to the effective number of parameters from observation i is ρ − ρ p ρ i(1 i) Di = i + n j=1ρj

2 2 2 where ρi = τ /(τ + σi ) is the intra-class correlation coefficient.

Penalised loss functions for model comparison and the DIC 253/ 258

In the limit as τ 2 → 0 There are two limiting situations. In the limit as τ 2 → 0, the ANOVA model tends to a pooled model where all observations have the same prior mean μ. In this limit, pD =1and both the DIC and penalised plug-in deviance are equal to

n 2 [(Yj − Y )/σj] +2 (33) j=1

where  n Y /σ2 j=1 j j Y = n 2 . (34) j=11/σj

Penalised loss functions for model comparison and the DIC 254/ 258

In the limit as τ 2 →∞ In the limit as τ 2 →∞, the ANOVA model tends to a saturated fixed-effects model, in which Y−i contains no information about the mean of Yi.In this limit, pD = n and DIC =2n, but the penalised plug-in deviance tends to infinity. So there is strong disagreement between DIC and the penalised plug-in deviance in this case.

Penalised loss functions for model comparison and the DIC 255/ 258 Conclusion For linear models with scalar outcomes, DIC is a good approximation to the penalised plug-in p  i deviance whenever Di 1 for all , which implies pD  n.

So pD/n may be used as an indicator of the validity of DIC in such models.

Penalised loss functions for model comparison and the DIC 256/ 258

Further reading Plummer M. (2008). Penalized loss functions for Bayesian model comparison. Biostatistics, 1–17. Spiegelhalter D, Best N, Carlin B, van der Linde A. (2002). Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society, Series B 64, 583–639. WinBUGS website, which includes ’s slides from the “IceBUGS” workshop in 2006.

Penalised loss functions for model comparison and the DIC 257/ 258 L12: Comparing methods of measurement in Stata, SAS, R and BUGS Friday 17th August 2012, afternoon

Bendix Carstensen

Bayesian Data Analysis 13 – 17 August 2012, Copenhagen Comparing methods of measurement in Stata, SAS, R and BUGS Comparing methods of measurement using the MethComp package in R...

Comparing methods of measurement in Stata, SAS, R and BUGS 258/ 258 Method Comparison studies

Bendix Carstensen Steno Diabetes Center, Denmark & Department of Biostatistics, University of Copenhagen [email protected] http:\BendixCarstensen.com August 2012 PDAwBuR

Two methods for measuring fat content in human milk:

●● The ● ● ● ● relationship

● ●

Tr ig looks like: ● ● ●●● ● ● ●●● ●● ● ● y1 = a + by2 ●● ● ● ● ● ● ●● ● ● ● ● ● 123456 ● ●

123456 Gerber

1/ 33 Two methods — one measurement by each How large is the difference between a measurement with method 1 and one with method 2 on a (randomly chosen) person?

Di = y2i − y1i, D,¯ s.d.(D)

“Limits of agreement:”

D¯ ± 2 × s.d.(D)

95% prediction interval for the difference between a measurement by method 1 and one by method 2. [?, ?]

2/ 33

Limits of agreement:

● 0.17 Plot ● ● ● ● ● ● ● ● ● differences ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● −0.00 D ● ● ● ●● ● ( i) versus ● ● ● ● ● ● ● ● Trig − Gerber Trig ● averages −0.17 ● A ● ( i). −0.4 −0.2 0.0 0.2 0.4

123456 ( Trig + Gerber ) / 2

3/ 33

Model in “Limits of agreement” Methods m =1,...,M, applied to i =1,...,I individuals:

ymi = αm + μi + emi 2 emi ∼N(0,σm) measurement error

 Two-way analysis of variance model, with unequal variances in columns.  Different variances are not identifiable without replicate measurements for M =2because the variances cannot be separated.

4/ 33 A more general model

2 y1i = α1 + β1μi + e1i,e1i ∼N(0,σ1) 2 y2i = α2 + β2μi + e2i,e2i ∼N(0,σ2)

 Work out the prediction of y1 given an observation of y2 in terms of these parameters.  Work out how differences relate to averages in terms of these parameters.  . . . including the variances for deriving prediction intervals.

5/ 33

Conversion equation with prediction limits

14

12 ●

●● 10 ●● ● ● ● ● ● ● ● ● ● ● 8 ● Capillary blood ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 6 ● ● ●● ● ● ● ●

● 4

468101214 Venous plasma

6/ 33

Replicate measurements Fat data; exchangeable replicates: item repl KL SL 1 1 4.5 4.9 1 2 4.4 5.0 1 3 4.7 4.8 3 1 6.4 6.5 3 2 6.2 6.4 3 3 6.5 6.1

Oximetry data; linked replicates: item repl CO pulse 1 1 78.0 71 1 2 76.4 72 1 3 77.2 73 2 1 68.7 68 2 2 67.6 67 2 3 68.3 68

Linked or exchangeable replicates!

7/ 33 Extension of the simple model: exchangeable replicates

ymir = αm + μi + cmi + emir s.d.(cmi)=τm — “matrix”-effect s.d.(emir)=σm — measurement error

 Replicates within (m, i) are needed to separate τ and σ.  Even with replicates, the separate τs are only estimable if M>2.  Still assumes that the difference between methods is constant.  Assumes exchangeability of replicates.

8/ 33

Extension of the model: linked replicates ymir = αm + μi + air + cmi + emir s.d.(air)=ω — between replicates s.d.(cmi)=τm — “matrix”-effect s.d.(emir)=σm — measurement error

 Still assumes that the difference between methods is constant.  Replicates are linked between methods: air is common across methods, i.e. the first replicate on a person is made under similar conditions for all methods (i.e. at a specific day or the like).

9/ 33

Extension with non-constant bias

ymir = αm + βmμi + random effects There is now a scaling between the methods. Methods do not measure on the same scale — the relative scaling is estimated, between method 1 and 2 the scale is β2/β1. Consequence: Multiplication of all measurements on one method by a fixed number does not change results of analysis: The corresponding β is multiplied by the same factor as is the variance components for this method.

10/ 33 Variance components Two-way interactions:

ymir = αm + βm(μi + air + cmi)+emir

The random effects cmi and emir have variances specific for each method.

But air does not depend on m — must be scaled to each of the methods by the corresponding βm.

Implies that ω = s.d.(air) is irrelevant — the scale is arbitrary. The relevant quantities are βmω —the between replicate variation within item as measured on the mth scale.

11/ 33

Predicting method 2 from method 1

y10r = α1 + β1(μ0 + a0r + c10)+e10r y20r = α2 + β2(μ0 + a0r + c20)+e20r ⇓ β2 y20r = α2 + (y10r − α1 − e10r) β1 + β2(−c10 + c20)+e20r

The random effects have expectation 0,so:

β2 E(y20|y10)=ˆy20 = α2 + (y10 − α1) β1

12/ 33

β2 y20r = α2 + (y10r − α1 − e10r) β1 + β2(−c10 + c20)+e20r  2 β2 2 2 2 2 2 2 var(ˆy20|y10)= (β1τ1 + σ1)+(β2τ2 + σ2) β1

The slope of the prediction line from method 1 to method 2 is β2/β1. The width of the prediction interval is:   2 β2 2 2 2 2 2 2 2 × 2 × (β1τ1 + σ1)+(β2τ2 + σ2) β1

13/ 33 If we do the prediction the other way round (y1|y2) we get the same relationship i.e. a line with the inverse slope, β1/β2. The width of the prediction interval in this direction is (by permutation of indices):   2 β1 2 × 2 × (β2τ 2 + σ2)+ (β2τ 2 + σ2) 1 1 1 β 2 2 2  2  2 β1 β2 2 2 2 2 2 2 =2× 2 × (β1τ1 + σ1)+(β2τ2 + σ2) β2 β1 i.e. if we draw the prediction limits as straight lines they can be used both ways.

14/ 33

Conversion equation with prediction limits

14

12 ●

●● 10 ●● ● ● ● ● ● ● ● ● ● ● 8 ● Capillary blood ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 6 ● ● ●● ● ● ● ●

● 4

468101214 Venous plasma

15/ 33

100 pulse = 2.11 + 0.94 CO ● ●● ● ( 6.00 ) ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●●● ● ● ● ●●●●●●●●● ● ●● ●● ● ●●●●● 80 ●●●● ● ● ●● ● ● ● ● ● ● ● ●●●● ●●●● ● ● ● ●●●●● ●● ● ●● ●●●●●●●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●●● ●●● ●● ● ●●● ● ● ● ● ●● ●● ● ● ● ●● ●●●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● 60 ●

pulse ●● ● ● ● ● ● ● ● ● ● ● 40

● ● CO =

● −2.25 + 1.06 pulse ( 6.39 ) 20 20 40 60 80 100 CO 16/ 33 Variance components

ymir = αm + βm(μi + air + cmi)+emir The total variance of a measurement is:  2 2 2 2 2 βmω + βmτm + σm

These are the variance components returned by AltReg or MCmcmcm using print.MCmcmc and shown by post.MCmcmc.

17/ 33

Backtransformation for plotting

prpulse <- seq(20,100,1) lprpulse <- log( prpulse / (100-prpulse) ) lprCO <- ARoxt["CO",2] + ARoxt["CO",4]*lprpulse lprCOlo <- ARoxt["CO",2] + ARoxt["CO",4]*lprpulse - 2*sd.CO.pred lprCOhi <- ARoxt["CO",2] + ARoxt["CO",4]*lprpulse + 2*sd.CO.pred prCO <- 100/(1+exp(-cbind( lprCO, lprCOlo, lprCOhi ))) prCO[nrow(prCO),] <- 100

But this is not necessary; it is implemented in plot.MethComp:

plot( ARoxt, pl.type="conv" )

18/ 33

100 pulse =

2.03 + 0.94 CO ● ●● ● ( 6.01 ) ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●●● ● ● ● ●●●●●●●●● ● ●● ●● ● ●●●●● 80 ●●●● ● ● ●● ● ●● ● ● ● ● ●●●● ●●●● ● ● ● ●●●●● ●● ●●● ●●●●●●●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●●● ●●● ●● ● ●●● ●● ● ● ●● ●● ● ● ● ●● ●●●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ●● ● ● ● ● ● ● pulse ● ● ● ● 40

● ●

● 20

CO = −2.16 + 1.06 pulse ( 6.38 ) 0 0 20406080100 CO 19/ 33 Transformation to a Bland-Altman plot Just convert to the differences versus the averages:

prpulse <- cbind( prpulse, prpulse, prpulse ) with( to.wide(ox), plot( (CO+pulse)/2, CO-pulse, pch=16, ylim=c(-40,40), xlim=c(20,100), xaxs="i", yaxs="i" ) ) abline( h=-4:4*10, v=2:10*10, col=gray(0.8) ) matlines( (prCO+prpulse)/2, prCO-prpulse, lwd=c(3,1,1), col="blue", lty=1 )

But this is not necessary; it is implemented in plot.MethComp:

plot( ARoxt, pl.type="BA" )

20/ 33

pulse = 2.03 + 0.94 CO ( 6.01 ) 40

20 ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● 0 ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●●●● ● ● ● ●●●●●● ●●●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●●●● ●● ● ● pulse − CO ● ● ● ● ● ●●●● ●● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● −20 ● ●

−40 CO = −2.16 + 1.06 pulse ( 6.38 )

0 20406080100 ( CO + pulse ) / 2 21/ 33

Implementation in JAGS

ymir = αm + βm(μi + air + cmi)+emir Non-linear hierarchical model: Implement in JAGS.

 The model is symmetrical in methods.  Mean is overparametrized.  Choose a prior (and hence posterior!) for the μs with finite support.  Keeps the chains nicely in place. This is the philosophy in the function MCmcmc.

22/ 33 Results from fitting the model

The posterior dist’n of (αm,βm,μi) is singular.

But the relevant translation quantities are identifiable:

α2|1 = α2 − α1β2/β1

β2|1 = β2/β1

So are the variance components.

Posterior medians used to devise prediction equations with limits.

23/ 33

The MethComp package for R Implemented model:

ymir = αm + βm(μi + air + cmi)+emir

 Replicates required.  R2WinBUGS, BRugs or JAGS is required.  Dataframe with variables meth, item, repl and y (a Meth object)  The function MCmcmc writes a BUGS-program, initial values and data to files.  Runs BUGS and sucks results back in to R,and gives a nice overview of the conversion equations.

24/ 33

Example output: Oximetry

> library(MethComp) Loading required package: nlme > data(ox) > ox <- Meth(ox) The following variables from the dataframe "ox" are used as the Meth variables: meth: meth item: item repl: repl y: y #Replicates Method 1 2 3 #Items #Obs: 354 Values: min med max CO 1 4 56 61 177 22.2 78.6 93.5 pulse 1 4 56 61 177 24.0 75.0 94.0 > system.time( MCox <- MCmcmc( ox, n.iter=10000 ) ) user system elapsed 115.07 0.31 118.62 > system.time( Jox <- MCmcmc( ox, n.iter=10000, program="jags" user system elapsed 82.26 0.04 82.47

25/ 33 > Jox

Conversion between methods: alpha beta sd.pred int(t-f) slope(t-f) sd(t-f) To: From: CO CO 0.000 1.000 2.486 0.000 0.000 2.486 pulse -5.386 1.108 5.026 -5.110 0.102 4.769 pulse CO 4.862 0.903 4.540 5.110 -0.102 4.772 pulse 0.000 1.000 5.986 0.000 0.000 5.986

Variance components (sd): s.d. Method IxR MxI res CO 3.797 3.242 1.758 pulse 3.410 2.919 4.233

26/ 33

Variance components with 95 % cred.int.: method CO pulse qnt 50% 2.5% 97.5% 50% 2.5% 97.5% SD IxR 3.797 3.078 4.525 3.410 2.795 4.097 MxI 3.242 2.418 4.342 2.919 2.167 3.935 res 1.758 0.446 2.733 4.233 3.559 4.917 tot 5.326 4.676 6.151 6.219 5.569 6.940

Mean parameters with 95 % cred.int.: 50% 2.5% 97.5% P(>0/1) alpha[pulse.CO] 4.863 -2.831 12.202 0.877 alpha[CO.pulse] -5.384 -15.227 2.828 0.123 beta[pulse.CO] 0.903 0.807 1.003 0.029 beta[CO.pulse] 1.108 0.997 1.239 0.971

27/ 33

Transformed variance components

> Jox <- MCmcmc( ox, n.iter=10000, program="jags" ) > MethComp( Jox)$VarComp s.d. Method IxR MxI res CO 3.832491 3.199204 1.670209 pulse 3.422194 2.862554 4.266671

> tJox <- MCmcmc( ox, n.iter=10000, program="jags", Transform="p > MethComp(tJox)$VarComp s.d. Method IxR MxI res CO 0.2575410 0.1811183 0.1243838 pulse 0.2232714 0.1565227 0.2031203

28/ 33 Transformation If the data do not exhibit:

 linear relationship between methods  constant variation across the range of measurements — transform by some function, e.g. logit, and then do analysis. Report on the original scale.

29/ 33

CO = −6.00+1.12pulse (5.07)

● ● ●● ●●●● ●● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ●● ●●● ● ● ● ●● ● ● ●●●●●●●●● ● ●● ●●●●●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● CO

● ●

● ●

pulse = 5.38+0.90CO (4.55) 0 20406080100 0 20406080100 pulse

30/ 33

● ● ●● ●●●● ●● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ●● ●●● ● ● ● ●● ● ● ●●●●●●●●● ● ●● ●●●●●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● CO

● ●

● ● 0 20406080100 0 20406080100 pulse

31/ 33 CO = −6.00+1.12pulse (5.07) pulse = 5.38+0.90CO (4.55)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●● ● ●●● ●● ● ● ●● ● ● ●●● ● ●●●●● ●● ● ● ● ● ●●● ● ●●●● ● ● ●● ●●● ●●●● ● ● ● ● ●●●●● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●

CO − pulse ● ● ● ● ● ● ● ● ●● ●

● ●

● −40 −20 0 20 40

CO−pulse = −5.68+0.11(CO+pulse)/2 (4.80)

0 20406080100 ( CO + pulse ) / 2

32/ 33

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●● ● ●●● ●● ● ● ●● ● ● ●●● ● ●●●●● ●● ● ● ● ● ●●● ● ●●●● ● ● ●● ●●● ●●●● ● ● ● ● ●●●●● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●

CO − pulse ● ● ● ● ● ● ● ● ●● ●

● ●

● −40 −20 0 20 40

0 20406080100 ( CO + pulse ) / 2

33/ 33 Method Comparison studies

Bendix Carstensen Steno Diabetes Center, Denmark & Department of Biostatistics, University of Copenhagen [email protected] http:\BendixCarstensen.com August 2012 PDAwBuR

Two methods for measuring fat content in human milk:

●● The ● ● ● ● relationship

● ●

Tr ig looks like: ● ● ●●● ● ● ●●● ●● ● ● y1 = a + by2 ●● ● ● ● ● ● ●● ● ● ● ● ● 123456 ● ●

123456 Gerber

1/ 33

Two methods — one measurement by each How large is the difference between a measurement with method 1 and one with method 2 on a (randomly chosen) person?

Di = y2i − y1i, D,¯ s.d.(D)

“Limits of agreement:”

D¯ ± 2 × s.d.(D)

95% prediction interval for the difference between a measurement by method 1 and one by method 2. [?, ?]

2/ 33 Limits of agreement:

● 0.17 Plot ● ● ● ● ● ● ● ● ● differences ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● −0.00 D ● ● ● ●● ● ( i) versus ● ● ● ● ● ● ● ● Trig − Gerber Trig ● averages −0.17 ● A ● ( i). −0.4 −0.2 0.0 0.2 0.4

123456 ( Trig + Gerber ) / 2

3/ 33

Model in “Limits of agreement” Methods m =1,...,M, applied to i =1,...,I individuals:

ymi = αm + μi + emi 2 emi ∼N(0,σm) measurement error

 Two-way analysis of variance model, with unequal variances in columns.  Different variances are not identifiable without replicate measurements for M =2because the variances cannot be separated.

4/ 33

A more general model

2 y1i = α1 + β1μi + e1i,e1i ∼N(0,σ1) 2 y2i = α2 + β2μi + e2i,e2i ∼N(0,σ2)

 Work out the prediction of y1 given an observation of y2 in terms of these parameters.  Work out how differences relate to averages in terms of these parameters.  . . . including the variances for deriving prediction intervals.

5/ 33 A more general model

2 y1i = α1 + β1μi + e1i,e1i ∼N(0,σ1) 2 y2i = α2 + β2μi + e2i,e2i ∼N(0,σ2)

 Work out the prediction of y1 given an observation of y2 in terms of these parameters.  Work out how differences relate to averages in terms of these parameters.  . . . including the variances for deriving prediction intervals.

5/ 33

A more general model

2 y1i = α1 + β1μi + e1i,e1i ∼N(0,σ1) 2 y2i = α2 + β2μi + e2i,e2i ∼N(0,σ2)

 Work out the prediction of y1 given an observation of y2 in terms of these parameters.  Work out how differences relate to averages in terms of these parameters.  . . . including the variances for deriving prediction intervals.

5/ 33

Conversion equation with prediction limits

14

12 ●

●● 10 ● ● ● ● ● ● ● ● ● ● 8 ● Capillary blood ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 6 ● ● ●● ● ● ● ●

● 4

468101214 Venous plasma

6/ 33 Conversion equation with prediction limits

14

12 ●

●● 10 ●● ● ● ● ● ● ● ● ● ● 8 ● Capillary blood ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 6 ● ● ●● ● ● ● ●

● 4

468101214 Venous plasma

6/ 33

Conversion equation with prediction limits

14

12 ●

●● 10 ● ● ● ● ● ● ● ● ● ● ● 8 ● Capillary blood ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 6 ● ● ●● ● ● ● ●

● 4

468101214 Venous plasma

6/ 33

Conversion equation with prediction limits

14

12 ●

●● 10 ●● ● ● ● ● ● ● ● ● ● ● 8 ● Capillary blood ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 6 ● ● ●● ● ● ● ●

● 4

468101214 Venous plasma

6/ 33 Replicate measurements Fat data; exchangeable replicates: item repl KL SL 1 1 4.5 4.9 1 2 4.4 5.0 1 3 4.7 4.8 3 1 6.4 6.5 3 2 6.2 6.4 3 3 6.5 6.1

Oximetry data; linked replicates: item repl CO pulse 1 1 78.0 71 1 2 76.4 72 1 3 77.2 73 2 1 68.7 68 2 2 67.6 67 2 3 68.3 68

Linked or exchangeable replicates!

7/ 33

Extension of the simple model: exchangeable replicates

ymir = αm + μi + cmi + emir s.d.(cmi)=τm — “matrix”-effect s.d.(emir)=σm — measurement error

 Replicates within (m, i) are needed to separate τ and σ.  Even with replicates, the separate τs are only estimable if M>2.  Still assumes that the difference between methods is constant.  Assumes exchangeability of replicates.

8/ 33

Extension of the model: linked replicates ymir = αm + μi + air + cmi + emir s.d.(air)=ω — between replicates s.d.(cmi)=τm — “matrix”-effect s.d.(emir)=σm — measurement error

 Still assumes that the difference between methods is constant.  Replicates are linked between methods: air is common across methods, i.e. the first replicate on a person is made under similar conditions for all methods (i.e. at a specific day or the like).

9/ 33 Extension with non-constant bias

ymir = αm + βmμi + random effects There is now a scaling between the methods. Methods do not measure on the same scale — the relative scaling is estimated, between method 1 and 2 the scale is β2/β1. Consequence: Multiplication of all measurements on one method by a fixed number does not change results of analysis: The corresponding β is multiplied by the same factor as is the variance components for this method.

10/ 33

Variance components Two-way interactions:

ymir = αm + βm(μi + air + cmi)+emir

The random effects cmi and emir have variances specific for each method.

But air does not depend on m — must be scaled to each of the methods by the corresponding βm.

Implies that ω = s.d.(air) is irrelevant — the scale is arbitrary. The relevant quantities are βmω —the between replicate variation within item as measured on the mth scale.

11/ 33

Predicting method 2 from method 1

y10r = α1 + β1(μ0 + a0r + c10)+e10r y20r = α2 + β2(μ0 + a0r + c20)+e20r ⇓ β2 y20r = α2 + (y10r − α1 − e10r) β1 + β2(−c10 + c20)+e20r

The random effects have expectation 0,so:

β2 E(y20|y10)=ˆy20 = α2 + (y10 − α1) β1

12/ 33 Predicting method 2 from method 1

y10r = α1 + β1(μ0 + a0r + c10)+e10r y20r = α2 + β2(μ0 + a0r + c20)+e20r ⇓ β2 y20r = α2 + (y10r − α1 − e10r) β1 + β2(−c10 + c20)+e20r

The random effects have expectation 0,so:

β2 E(y20|y10)=ˆy20 = α2 + (y10 − α1) β1

12/ 33

β2 y20r = α2 + (y10r − α1 − e10r) β1 + β2(−c10 + c20)+e20r  2 β2 2 2 2 2 2 2 var(ˆy20|y10)= (β1τ1 + σ1)+(β2τ2 + σ2) β1

The slope of the prediction line from method 1 to method 2 is β2/β1. The width of the prediction interval is:   2 β2 2 2 2 2 2 2 2 × 2 × (β1τ1 + σ1)+(β2τ2 + σ2) β1

13/ 33

β2 y20r = α2 + (y10r − α1 − e10r) β1 + β2(−c10 + c20)+e20r  2 β2 2 2 2 2 2 2 var(ˆy20|y10)= (β1τ1 + σ1)+(β2τ2 + σ2) β1

The slope of the prediction line from method 1 to method 2 is β2/β1. The width of the prediction interval is:   2 β2 2 2 2 2 2 2 2 × 2 × (β1τ1 + σ1)+(β2τ2 + σ2) β1

13/ 33 β2 y20r = α2 + (y10r − α1 − e10r) β1 + β2(−c10 + c20)+e20r  2 β2 2 2 2 2 2 2 var(ˆy20|y10)= (β1τ1 + σ1)+(β2τ2 + σ2) β1

The slope of the prediction line from method 1 to method 2 is β2/β1. The width of the prediction interval is:   2 β2 2 2 2 2 2 2 2 × 2 × (β1τ1 + σ1)+(β2τ2 + σ2) β1

13/ 33

If we do the prediction the other way round (y1|y2) we get the same relationship i.e. a line with the inverse slope, β1/β2. The width of the prediction interval in this direction is (by permutation of indices):   2 β1 2 × 2 × (β2τ 2 + σ2)+ (β2τ 2 + σ2) 1 1 1 β 2 2 2  2  2 β1 β2 2 2 2 2 2 2 =2× 2 × (β1τ1 + σ1)+(β2τ2 + σ2) β2 β1 i.e. if we draw the prediction limits as straight lines they can be used both ways.

14/ 33

If we do the prediction the other way round (y1|y2) we get the same relationship i.e. a line with the inverse slope, β1/β2. The width of the prediction interval in this direction is (by permutation of indices):   2 β1 2 × 2 × (β2τ 2 + σ2)+ (β2τ 2 + σ2) 1 1 1 β 2 2 2  2  2 β1 β2 2 2 2 2 2 2 =2× 2 × (β1τ1 + σ1)+(β2τ2 + σ2) β2 β1 i.e. if we draw the prediction limits as straight lines they can be used both ways.

14/ 33 Conversion equation with prediction limits

14

12 ●

●● 10 ●● ● ● ● ● ● ● ● ● ● ● 8 ● Capillary blood ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 6 ● ● ●● ● ● ● ●

● 4

468101214 Venous plasma

15/ 33

100 pulse = 2.11 + 0.94 CO ● ●● ● ( 6.00 ) ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●●● ● ● ● ●●●●●●●●● ● ●● ●● ● ●●●●● 80 ●●●● ● ● ●● ● ● ● ● ● ● ● ●●●● ●●●● ● ● ● ●●●●● ●● ● ●● ●●●●●●●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●●● ●●● ●● ● ●●● ● ● ● ● ●● ●● ● ● ● ●● ●●●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● 60 ●

pulse ●● ● ● ● ● ● ● ● ● ● ● 40

● ● CO =

● −2.25 + 1.06 pulse ( 6.39 ) 20 20 40 60 80 100 CO 16/ 33

Variance components

ymir = αm + βm(μi + air + cmi)+emir The total variance of a measurement is:  2 2 2 2 2 βmω + βmτm + σm

These are the variance components returned by AltReg or MCmcmcm using print.MCmcmc and shown by post.MCmcmc.

17/ 33 Backtransformation for plotting

prpulse <- seq(20,100,1) lprpulse <- log( prpulse / (100-prpulse) ) lprCO <- ARoxt["CO",2] + ARoxt["CO",4]*lprpulse lprCOlo <- ARoxt["CO",2] + ARoxt["CO",4]*lprpulse - 2*sd.CO.pred lprCOhi <- ARoxt["CO",2] + ARoxt["CO",4]*lprpulse + 2*sd.CO.pred prCO <- 100/(1+exp(-cbind( lprCO, lprCOlo, lprCOhi ))) prCO[nrow(prCO),] <- 100

But this is not necessary; it is implemented in plot.MethComp:

plot( ARoxt, pl.type="conv" )

18/ 33

100

● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●●● ● ● ● ●●●●●●●●● ● ●● ●● ● ●●●●● 80 ●●●● ● ● ●● ● ●● ● ● ● ● ●●●● ●●●● ● ● ● ●●●●● ●● ●●● ●●●●●●●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●●● ●●● ●● ● ●●● ●● ● ● ●● ●● ● ● ● ●● ●●●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ●● ● ● ● ● ● ● pulse ● ● ● ● 40

● ●

● 20

0 0 20406080100 CO 19/ 33

100 pulse =

2.03 + 0.94 CO ● ●● ● ( 6.01 ) ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●●● ● ● ● ●●●●●●●●● ● ●● ●● ● ●●●●● 80 ●●●● ● ● ●● ● ●● ● ● ● ● ●●●● ●●●● ● ● ● ●●●●● ●● ●●● ●●●●●●●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●●● ●●● ●● ● ●●● ●● ● ● ●● ●● ● ● ● ●● ●●●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ●● ● ● ● ● ● ● pulse ● ● ● ● 40

● ●

● 20

CO = −2.16 + 1.06 pulse ( 6.38 ) 0 0 20406080100 CO 19/ 33 100 pulse =

2.03 + 0.94 CO ● ●● ● ( 6.01 ) ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●●● ● ● ● ●●●●●●●●● ● ●● ●● ● ●●●●● 80 ●●●● ● ● ●● ● ●● ● ● ● ● ●●●● ●●●● ● ● ● ●●●●● ●● ●●● ●●●●●●●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●●● ●●● ●● ● ●●● ●● ● ● ●● ●● ● ● ● ●● ●●●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ●● ● ● ● ● ● ● pulse ● ● ● ● 40

● ●

● 20

CO = −2.16 + 1.06 pulse ( 6.38 ) 0 0 20406080100 CO 19/ 33

Transformation to a Bland-Altman plot Just convert to the differences versus the averages:

prpulse <- cbind( prpulse, prpulse, prpulse ) with( to.wide(ox), plot( (CO+pulse)/2, CO-pulse, pch=16, ylim=c(-40,40), xlim=c(20,100), xaxs="i", yaxs="i" ) ) abline( h=-4:4*10, v=2:10*10, col=gray(0.8) ) matlines( (prCO+prpulse)/2, prCO-prpulse, lwd=c(3,1,1), col="blue", lty=1 )

But this is not necessary; it is implemented in plot.MethComp:

plot( ARoxt, pl.type="BA" )

20/ 33

40

20 ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● 0 ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●●●● ● ● ● ●●●●●● ●●●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●●●● ●● ● ● pulse − CO ● ● ● ● ● ●●●● ●● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● −20 ● ●

−40

0 20406080100 ( CO + pulse ) / 2 21/ 33 pulse = 2.03 + 0.94 CO ( 6.01 ) 40

20 ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● 0 ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●●●● ● ● ● ●●●●●● ●●●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●●●● ●● ● ● pulse − CO ● ● ● ● ● ●●●● ●● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● −20 ● ●

−40 CO = −2.16 + 1.06 pulse ( 6.38 )

0 20406080100 ( CO + pulse ) / 2 21/ 33

pulse = 2.03 + 0.94 CO ( 6.01 ) 40

20 ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● 0 ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●●●● ● ● ● ●●●●●● ●●●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●●●● ●● ● ● pulse − CO ● ● ● ● ● ●●●● ●● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● −20 ● ●

−40 CO = −2.16 + 1.06 pulse ( 6.38 )

0 20406080100 ( CO + pulse ) / 2 21/ 33

Implementation in JAGS

ymir = αm + βm(μi + air + cmi)+emir Non-linear hierarchical model: Implement in JAGS.

 The model is symmetrical in methods.  Mean is overparametrized.  Choose a prior (and hence posterior!) for the μs with finite support.  Keeps the chains nicely in place. This is the philosophy in the function MCmcmc.

22/ 33 Results from fitting the model

The posterior dist’n of (αm,βm,μi) is singular.

But the relevant translation quantities are identifiable:

α2|1 = α2 − α1β2/β1

β2|1 = β2/β1

So are the variance components.

Posterior medians used to devise prediction equations with limits.

23/ 33

The MethComp package for R Implemented model:

ymir = αm + βm(μi + air + cmi)+emir

 Replicates required.  R2WinBUGS, BRugs or JAGS is required.  Dataframe with variables meth, item, repl and y (a Meth object)  The function MCmcmc writes a BUGS-program, initial values and data to files.  Runs BUGS and sucks results back in to R,and gives a nice overview of the conversion equations.

24/ 33

Example output: Oximetry

> library(MethComp) Loading required package: nlme > data(ox) > ox <- Meth(ox) The following variables from the dataframe "ox" are used as the Meth variables: meth: meth item: item repl: repl y: y #Replicates Method 1 2 3 #Items #Obs: 354 Values: min med max CO 1 4 56 61 177 22.2 78.6 93.5 pulse 1 4 56 61 177 24.0 75.0 94.0 > system.time( MCox <- MCmcmc( ox, n.iter=10000 ) ) user system elapsed 115.07 0.31 118.62 > system.time( Jox <- MCmcmc( ox, n.iter=10000, program="jags" user system elapsed 82.26 0.04 82.47

25/ 33 > Jox

Conversion between methods: alpha beta sd.pred int(t-f) slope(t-f) sd(t-f) To: From: CO CO 0.000 1.000 2.486 0.000 0.000 2.486 pulse -5.386 1.108 5.026 -5.110 0.102 4.769 pulse CO 4.862 0.903 4.540 5.110 -0.102 4.772 pulse 0.000 1.000 5.986 0.000 0.000 5.986

Variance components (sd): s.d. Method IxR MxI res CO 3.797 3.242 1.758 pulse 3.410 2.919 4.233

26/ 33

Variance components with 95 % cred.int.: method CO pulse qnt 50% 2.5% 97.5% 50% 2.5% 97.5% SD IxR 3.797 3.078 4.525 3.410 2.795 4.097 MxI 3.242 2.418 4.342 2.919 2.167 3.935 res 1.758 0.446 2.733 4.233 3.559 4.917 tot 5.326 4.676 6.151 6.219 5.569 6.940

Mean parameters with 95 % cred.int.: 50% 2.5% 97.5% P(>0/1) alpha[pulse.CO] 4.863 -2.831 12.202 0.877 alpha[CO.pulse] -5.384 -15.227 2.828 0.123 beta[pulse.CO] 0.903 0.807 1.003 0.029 beta[CO.pulse] 1.108 0.997 1.239 0.971

27/ 33

Transformed variance components

> Jox <- MCmcmc( ox, n.iter=10000, program="jags" ) > MethComp( Jox)$VarComp s.d. Method IxR MxI res CO 3.832491 3.199204 1.670209 pulse 3.422194 2.862554 4.266671

> tJox <- MCmcmc( ox, n.iter=10000, program="jags", Transform="p > MethComp(tJox)$VarComp s.d. Method IxR MxI res CO 0.2575410 0.1811183 0.1243838 pulse 0.2232714 0.1565227 0.2031203

28/ 33 Transformation If the data do not exhibit:

 linear relationship between methods  constant variation across the range of measurements — transform by some function, e.g. logit, and then do analysis. Report on the original scale.

29/ 33

CO = −6.00+1.12pulse (5.07)

● ● ●● ●●●● ●● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ●● ●●● ● ● ● ●● ● ● ●●●●●●●●● ● ●● ●●●●●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● CO

● ●

● ●

pulse = 5.38+0.90CO (4.55) 0 20406080100 0 20406080100 pulse

30/ 33

● ● ●● ●●●● ●● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ●● ●●● ● ● ● ●● ● ● ●●●●●●●●● ● ●● ●●●●●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● CO

● ●

● ● 0 20406080100 0 20406080100 pulse

31/ 33 CO = −6.00+1.12pulse (5.07) pulse = 5.38+0.90CO (4.55)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●● ● ●●● ●● ● ● ●● ● ● ●●● ● ●●●●● ●● ● ● ● ● ●●● ● ●●●● ● ● ●● ●●● ●●●● ● ● ● ● ●●●●● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●

CO − pulse ● ● ● ● ● ● ● ● ●● ●

● ●

● −40 −20 0 20 40

CO−pulse = −5.68+0.11(CO+pulse)/2 (4.80)

0 20406080100 ( CO + pulse ) / 2

32/ 33

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●● ● ●●● ●● ● ● ●● ● ● ●●● ● ●●●●● ●● ● ● ● ● ●●● ● ●●●● ● ● ●● ●●● ●●●● ● ● ● ● ●●●●● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●

CO − pulse ● ● ● ● ● ● ● ● ●● ●

● ●

● −40 −20 0 20 40

0 20406080100 ( CO + pulse ) / 2

33/ 33