{ MODELS {

Likeliho o d and Reference Bayesian

Analysis

Mike West

May 3, 1999

1 Straight Line Regression Mo dels

We b egin with the simple straight line regression mo del

y = + x + 

i i i

where the design p oints x are xed in advance, and the measurement/

i

2

errors  are indep endent and normally distributed,   N 0;  for each i =

i i

1;::: ;n: In this context, wehavelooked at general mo delling questions,

and the tting of estimates of and : Now we turn to more

formal likeliho o d and .

1.1 Likeliho o d and MLEs

The formal parametric inference problem is a multi-parameter problem: we

2

require inferences on the three parameters  ; ;  : The likeliho o d function

has a simple enough form, as wenow show. Throughout, we do not indicate the

design p oints in conditioning statements, though they are implicitly conditioned

up on. Write Y = fy ;::: ;y g and X = fx ;::: ;x g: Given X and the mo del

1 n 1 n

parameters, each y is the corresp onding zero- normal random quantity

i

 plus the term + x ; so that y is normal with this term as its mean and

i i i

2

 : Also, since the  are indep endent, so are the y : Thus

i i

2 2

y j ; ;    N  + x ; 

i i

with conditional density function

2 2 2 2 1=2

py j ; ;   = expfy x  =2 g=2 

i i i

for each i: Also, by indep endence, the joint density function is

n

Y

2 2

: py j ; ;  = pY j ; ; 

i

i=1 1

Given the observed resp onse values Y this provides the likeliho o d function for

the three parameters: the joint density ab ove evaluated at the observed and

2

hence now xed values of Y; now viewed as a function of ; ;  as they vary

across p ossible parameter values. It is clear that this likeliho o d function is given

by

2 2 n

pY j ; ;   / expQ ; =2 =

where

n

X

2

Q ; = y x 

i i

i=1

n=2

and where a constant term 1=2  has b een dropp ed.

Let us lo ok at computing the joint maximum likeliho o d estimates MLEs

2

^

of the three parameters. This involves nding the values ^ ; ; ^  such that

2 2 2

^

pY j ;^ ; ^  >pY j ; ;   for any other parameter values  ; ;  : Wedo

this as follows.

2

 For any xed value of  ; the likeliho o d function is a strictly increasing

function of Q ; : Hence, changing  ;  to decrease the value of Q im-

plies that the value of the likeliho o d function increases. Clearly,cho osing

 ;  to minimise Q implies that we maximise the likeliho o d function. As

2

a result, the MLEs of  ;  for any sp eci c value of  are simply the

LSEs.

 From the earlier discussion of least square estimation, we know that the

2

^

LSEs  ; ^  do not, in fact, dep end on the value of the variance  : As a

result, the full three-dimensional maximisation is solved at the LSE values

2 2

^ ^

^ ;  and bycho osing ^ to maximise pY j ;^ ;   as a function of just

2

 : This trivially leads to

n

X

2 2

^ ^ = =n

i

i=1

2

^

where ^ = y ^ x for each i; this MLE of  is the usual residual

i i i

sum of squares with divisor n:

1.2 Reference Bayesian Analyses

1.2.1 Parametrisation and Reference Prior

2

To develop the reference Bayesian p osterior distribution for  ; ;   it is tra-

2 2

ditional to reparameterise from  to the precision parameter  =1= : This

is done simply for clarity of exp osition and ease of development. Plugging

2

 =1= in the likeliho o d function leads simply to

n=2

pY j ; ;  /  expQ ; =2

{ de ning the joint likeliho o d function for the three parameters  ; ; : 2

Bayesian inference requires a prior p ; ;  { a trivariate prior density

de ned on the mo del parameter space. Here we use the traditional reference

prior in which:

 The three parameters are indep endent, so p ; ; =p p p:

 As a function of ; the log-likeliho o d function is quadratic, indicating that

the likeliho o d function will contribute a normal form in : Hence a normal

prior for would be conjugate, leading to a normal p osterior. On this

basis, a normal prior with an extremely large variance as in reference

analysis of normal mo dels will representa vague or uninformative prior

p osition. Taking the formal limit of a normal prior with a variance tending

to in nity provides the traditional reference prior p  / constant:

 The same reasoning applies to ; leading to the traditional reference prior

p  / constant:

 As a function of  alone, the likeliho o d function has the same form as that

of a gamma density function in : Hence a gamma prior for  would b e

conjugate, leading to a gamma p osterior. On this basis, a gamma prior

with very small de ning parameters as in reference analysis of Poisson or

exp onential mo dels will representa vague or uninformative prior p osition.

Taking the formal limit of the   Gammaa; b prior at a = b = 0

1

provides the traditional reference prior p /  :

Combing these comp onents pro duces the standard non-informative/reference

1

prior p ; ;  /  : Then Bayes' theorem leads to the reference p osterior

n=21

p ; ; jY  / p ; ; pY j ; ;  /  expQ ; =2;

over real-valued and and >0: This is a joint density for the three quanti-

ties, and reference inference follows by exploring and summarising its prop erties.

Notice that the p osterior is almost the normalised likeliho o d function { the only

1

di erence is in the prior term  : We quote key features of this reference p os-

terior in the following sections.

1.2.2 Marginal Reference Posterior for 

The marginal p osterior for  is available byintegrating  ; ; i.e.,

Z

pjY = p ; ; jY d d

where the of integration is 1 < ; < 1: It can b e shown that this

yields the simple form

a1

pjY  /  expfbg

P

n

2

where a =n 2=2 and b = ^ =2: As a result, the p osterior for  is simply

i

i=1

Gammaa; b with these values of a; b: In particular, the p osterior mean is

2

E jY =1=s 3

where

n

X

2 2

s = ^ =n 2:

i

i=1

2 2

Since E jY isapoint estimate of  =1= ; then s is a corresp onding p oint

2

estimate of  : It is referred to as the residual variance estimate, as it is a sample

2

variance computed from the tted residuals ^ : Note that, unlike the MLE ^ ;

i

2

the estimate s has a divisor n 2: The common-sense interpretation of this

is that the e ective number of observations is the actual total n reduced by

the number of tted parameters, here just two. The term n 2 is called the

residual degrees of freedom, re ecting this adjustment from n; the initial degrees

2 2

of freedom. One implication is that s > ^ ; re ecting a more conservative

estimate of variance after accounting for the estimation of the two parameters.

1.2.3 Marginal Reference Posterior for  ; 

The marginal p osterior for  ;  is obtained as

Z

p ; jY = p ; ; jY d

and turns out to b e a bivariate T distribution. Of key practical relevance are

the implied univariate margins for p osterior for and linear functions of  ; 

alone, and these are all univariate T distributions see App endix for details of

T distributions. Sp eci c univariate margins are as follows.

2

 De ne v =1=S : The univariate p osterior for has a density function

xx

n1=2 2 2 2

^

g ; p jY  /fn 2+  =s v

and this is the density of a T distribution with n 2 degrees of freedom,

^

mo de and scale sv : By way of notation wehave

2 2

^

 jY   T  ; s v :

n2

2 2

^

As long as n>4; it is also true that E  jY = and V  jY =cs v

with c =n 2=n 4: The p osterior is symmetric and normal shap ed

ab out the mo de, though has heavier tails than a normal p osterior. We

can write

^ ^

= +sv t and t = =sv 

where the random quantity t  T 0; 1: Posterior and

n2

intervals for follow from those of the Student T distribution: if t is the

p

^

100p quantile of t; then that for is simply +sv t : The term sv

p

is called the p osterior of the co ecient.

For large degrees of freedom, the Student T distribution approaches the

standard normal, in which case we have the approximation t  N 0; 1

and so

2 2

^

:  jY   N  ; s v

4

Otherwise, we can view the distribution informally as \like the normal but

with a little bit of additional uncertainty."

 The univariate margin for is similarly a T distribution with n 2 degrees

2 1 2

of freedom, mo de ^ and scale sv where v = n +x =S ; i.e.,

xx

2 2

 jY   T ^ ; s v :

n2

 Under p ; jY  the two parameters are generally correlated. Assuming

n>4 so that second moments of the T distribution exist, the p osterior

2

is s c n 2=n 4 where c = x=S : Note that n

; ; xx

2=n 4  1 when n is large, when the p osterior is approximately

2

normal; in that case, the p osterior covariance is just the term s c ab ove.

;

Note also that the covariance, hence the correlation, is zero if and only if

x =0: This correlation is relevant when considering p osterior inferences

and predictions involving linear functions of the two parameters, as now

follows.

1.2.4 Intervals for and Signi cance of the Regression

The standard T test of the signi cance of the regression t can b e understo o d as

an assessment of the supp ort for the value = 0 under the marginal p osterior

distribution. Note that, if the assumption of a linear regression mo del is really

appropriate, then = 0 would imply no relationship between the x and y

variables, whereas large values of j j imply a strong relationship.

2 2

^

Under the symmetric p osterior distribution  jY   T  ; s v ; HPD in-

n2

tervals coincide with equal-tails intervals. Sp eci cally, the 1001 p p osterior

interval is

^

 sv t

p=2

for any p and where t is the 100p=2 quantile of T 0; 1: For

n2

p=2

example, p =0:05 that t is the lower 2.5 p oint of the standard

0:025

T distribution, and the ab oveinterval is a 95 equal tails, HPD interval for

: If the value = 0 lies outside this interval, then we are assured that the

regression is signi cant at at least the 95 level; that is, = 0 is among the

5 least likely values of the parameter.

The standard pvalue for the test of = 0 can b e interpreted as the p osterior

probability of values that have lower p osterior density than = 0: To do

this we need to nd the probability outside the HPD interval de ned with

one end-p oint at = 0: From the form of the p osterior density function, it

easily follows as detailed in the App endix that this exclude all values such

^ ^

that jtj > j =sv j where t = =sv : The resulting \tail-area" pvalue is

therefore simply

^

p value = 2Prt>j =sv j

^

where t  T 0; 1: The observed quantity =sv here is called the standard-

n2

ised T test { it is simply the size of the estimated co ecent relativeto 5

its standard error. It is tempting to refer to the pvalue as the \probabilityof

= 0" although it is obviously not quite that; it is, however, a standard measure

of the signi cance of the t of the mo del, with a low pvalue commensurate

with a signi cant t.

1.2.5 FTests, ANOVA and Deviances

It is traditional to summarise the statistical signi cance of the regression t with

a summary F test and the asso ciated analysis of , historically called

the , or ANOVA. This adds nothing new metho dologically;

rather, the F test and anova presentation is simply another way of summarising

2

the go o dness of t in terms of R and the pvalue based on the T test for =0;

as discussed in the previous section. One reason for restating the conclusions in

these di erent terms is simply historical precendent. A second, more imp ortant

reason is that the extension of signi cance assessment to multiple regression

mo dels { mo dels with more than one predictor variable { cannot b e done with

T tests alone, and inherently involves ANOVA ideas. Hence it is useful to

explain the connection in this simplest of cases.

Begin by noting that the pvalue from the T test of the signi cance of the

regression is equivalentto

2

^

p value = PrF >  =sv  

2

where F = t ; a obtained as the square of t  T 0; 1: The

n2

name of the distribution of F is the F distribution with 1;n 2 degrees of

freedom; we write F  F : Such distributions are well known, tabulated

1;n2

and available in computer software. Hence we could use the upp er tail-area of

the F to compute the pvalue, as an alternative to the T 0; 1:

1;n2 n2

Write

2 2

^

=v

2

^

: f = =sv  =

obs

2

s

It is easily shown see exercises that the numerator term in f reduces to

obs

2

^

2

= S n 2s

yy

2

v

P

n

2 2

^

where n 2s = y ^ x  is the residual sum of squares from the

i i

i=1

mo del t. As a result, it is trivial to compute

2 2

f =S n 2s =s

obs yy

and the implied pvalue.

2

The ab ove expression is intuitively reasonable. The term S n 2s is

yy

called the tted sum of squares or the variation explained by the regression of

y on x: This re ects the fact that it is the di erence b etween the total variation

in the resp onse data S  and the residual variation remaining after the mo del

yy

has b een tted. If the variation explained is large, then f will b e large and

obs 6

the pvalue small; just how large the reduction in variation due to the mo del

must b e is relativetothe scale of the data, so that the variation explained is

2

assessed relativeto s as is clear in the formula for f :

obs

In terms of the quadratic notation Q ; ; recall that S = Qy; 0 and the

yy

2

^

residual sum of squares is n 2s = Q^ ; : So, in this notation, the explained

^

variation is simply Qy; 0 Q^ ; : In some areas of mo dern ,

these quadratic, sums of squares measures are referred to as deviances. Thus

^

Qy; 0 is the total deviance, and Q^ ; isthe residual deviance from the tted

mo del; the di erence is the deviance explained by the regression on x; or the

reduction in deviance due to the mo del. ANOVA, the analysis of variance, is

simply a name for the representation of total variabilityin the resp onse data

using the ab ove comp onents. In mo dern times the term analysis of deviance is

more apt, as it generalises to non-normal and non-linear regression mo dels. The

summary is simple: Total deviance = Deviance explained + Residual deviance,

and the F test measures the signi cance of the mo del t by assessing just how

large the \deviance explained" here is.

1.2.6 Honest Prediction

Consider predicting a new case y at a further, or future, design p oint x :

n+1 n+1

Formally, this requires the evaluation of the p osterior predictive distribution

py jY ; i.e., simply the distribution for the new value conditional on the

n+1

data observed so far as ab ove, the x values are implicit and ignored in the

i

notation. It can b e shown that this leads to a T distribution, and we denote

this by

2 2

y jY   T ^y; s v 

n+1 n2

y

where

^

 y^ =^ + x ; just the tted line at x ; and

n+1 n+1

2 2

 v =1+w with

y

2 2 2 2

+2x c ; v + x w = v

n+1 ;

n+1

and where c = x=S was given ab ove in the p osterior covariance

; xx

between and :

It is easy to motivate this by considering the mo del directly, that is

y = + x + 

n+1 n+1 n+1

where  is the deviation from the line for the new case and so is indep endent

n+1

of past data. Taking exp ectations across this equation gives E y jY  = y;^

n+1

using linearity of exp ectations. The scale factor sv is similarly computed; note

y

that the p osterior correlation b etween and enters into its computation.

Note the following: 7

 For large n; the predictive distribution is close to normal, in which case

2 2

y jY   N ^y; s v :

n+1

y

2 2 2 2 2 2 2 2

 Since v = 1+w we have s v = s + s w : Here the rst s is the

y y

2 2 2

estimate of the variance  of  ; whereas the term s w measures the

n+1

uncertainty ab out the line + x at the new design p oint.

n+1

 With very large data sets the p osterior for  ;  will b e quite precise, and

2

w will b e small, in which case the p osterior is almost N ^y; s : Otherwise,

the predictions using the ab ove results will b e \honest" in the sense that

2 2

the additional term s w appropriately re ects the additional uncertainty

in predicting due to uncertainty ab out the regression line parameters.

2

 It can b e shown that the ab ove formula for w can b e reduced to

2 1 2

w = n +x x =S :

n+1 xx

2

As a result, w will be small when x is close to x; but grows larger

n+1

for larger values of jx xj: This means that the spread of the pre-

n+1

dictive distribution is greater for new design p oints that are further away

from the past design p oints; in this sense, predictions also appropriately

re ect increased uncertainty in extrap olating outside the range of past

exp erience. Nevertheless, extrap olation must always b e cautioned, unless

substantive theory exists to indicate the validity of extending the straight

line regression assumption into regions where no data are available.

In practice, computer programs work with matrix formulations of linear

mo dels so that the explicit details of the calculations ab ove are never needed

for actual numerical work. They are imp ortant, however, for understanding and

interpretation. 8

2 Sampling Theoretic Inference

The standard sampling-theoretic inference framework leads to essentially the

2

same metho ds { the same p oint estimates of  ; ;  ; the same intervals and

tests { in terms of numerical values, although, of course, their interpretations

and rationale are quite di erent to those in the Bayesian framework. The sam-

pling theory results are summarised here for comparison.

Under the assumption that the straight line mo del generates the data, con-

2

^

sider the statistics ^ ; ; ^  as functions of the random quantities Y prior to

2 2

^

their b eing observed. These have a joint p^ ; ; ^ j ; ;  

that is well-known e.g., De Gro ot, chapter 10. Remember that this is the

rep eat-sampling distribution of the estimators based on the mo del assuming

the parameter xed at their \true" values. This trivariate sampling distribution

has the following features.

^

 The estimators are unbiased; that is, E ^ j  = ; E  j  = and

2 2 2

E ^ j = : They are also consistent, in that they converge in prob-

ability to their resp ective parameters as n tends to in nity.

 The statistic  ^ =sv is distributed as T 0; 1: Note that this de-

n2

scrib es sampling variability in b oth ^ in the numerator and s in the de-

nominator. As a result, con dence intervals for have the form

^  sv t

p=2

for any probability p and where t is the 100p=2 quantile of T 0; 1:

n2

p=2

^

 The statistic  =sv is also distributed as T 0; 1: Note that this

n2

^

describ es sampling variability in b oth in the numerator and s in the

denominator. As a result, con dence intervals for have the form

^

 sv t

p=2

for any probability p and where t is the 100p=2 quantile of T 0; 1:

n2

p=2

2 2

 The statistic s has a sampling distribution that dep ends only on  ; and

2 2 2

is given bys j   Gamman 2=2; n 2=2 : Equivalently, the

2 2

quantity s n 2= is distributed as Gamman 2=2; 1=2 whichis

the chi-squared distribution with n 2 degrees of freedom.

Note that the p oint estimates and con dence intervals for the regression

parameters coincide numerically with those in the reference Bayesian analysis,

2

and that the same p oint estimate of  is generated to o. These numerical

equivalences extend to predictions.

On the issue of testing the signi cance of the regression mo del t, the

sampling-theoretic interpretation of the pvalue may be based on the follow-

ing reasoning. Condition on the hyp othesis that =0; so that the sampling

^ ^

distribution theory ab ove implies =sv  T 0; 1: Write t = =sv : Then

n2 9

large values of the statistic jtj are rare if the hyp othesis that = 0 is actually

true, and just how rare can b e measured by probabilities with resp ect to this

^

sampling distribution. Write t for the observed value of =sv in this data

obs

set. Then the probabilityof observing a future, hyp othetical data set Y that

gives rise to values of t that is at least as extreme as t is simply

obs

Prjtj > jt j=2Prt>jt j;

obs obs

and this is easily computed from the standard T 0; 1 distribution of t: Notice

n2

that, though the conceptual basis for this calculation is radically di erent from

the conditional Bayesian approach, the numerical results are again the same.

The ab ove is the standard sampling theory, or frequentist, de nition of the

observed signi cance level or pvalue of the test of the hyp othesis that =0:

It coincides with the Bayesian pvalue based on tail areas under the p osterior

density for : The numerical corresp ondence of the asso ciated F test follows

immediately.

It turns out that this test has other optimality prop erties from a sampling

theoretic viewp oint. In particular, it is a likeliho o d ratio test of the hyp othesis

= 0 compared to all other values of ; and a uniformly most p owerful test in

addition De Gro ot, chapter 10. 10

3 Multiple Linear Regression Mo dels:

Summary

The course slides provide coverage of the multiple linear regression mo del, no-

tation, mathematical structure, examples, and theory. This section is a sup-

plement on the basic theoretical results no pro ofs related to the reference

p osterior distribution and its uses.

The mo del is

y = + x + ::: x + 

i 0 1 i1 p ip i

for observations i =1;::: ;n: In vector/matrix notation wehave

0

y = x + 

i i

i

and

y = X + 

where:

 the n  1 resp onse vector y has elements y ;

i

0

where  the n  k design matrix X has rows x

i

0

x =1;x ;::: ;x 

i1 ip

i

and, of course, k = p +1;

 the k  1 regression parameter vector has elements :

i

 the n  1 error or deviation vector  has elements  :

i

In detail,

0 1 0 1

y 

1 1

B C B C

y 

2 2

B C B C

y = ;  =

B C B C

. .

. .

@ A @ A

. .

y 

n n

0 1 0 1

0

x 1 x x  x

11 12 1p

1

0

B C B C

x 1 x x  x

21 22 2p

2

B C B C

X = =

B C B C

. . . . .

. . . . .

@ A @ A

. . . .  .

0

x 1 x x  x

n1 n2 np

n

and

0 1

0

B C

1

B C

= :

B C

.

.

@ A

.

p

0

Row i of X is just x ; containing the values of all predictor variables in order

i

for observation i: Column j of X contains the n values of predictor variable j;

except for when j = 1 when it is just a column with all entries 1, corresp onding

to the intercept term of the mo del. 11

4 Multiple Linear Regression Mo dels:

Reference Posterior Distribution

The reference p osterior p jY isamultivariate T distribution. The parameter

vector contains k parameters, so the p osterior is a joint distribution for these

k parameters, or a k variate distribution. The multivariate T distribution has

n k degrees of freedom, and we write it as

2

^

 jY   T  ;s V

nk

with the following ingredients:

 The dimension k is implicit, and not made explicit in the notation;

 V is the k  k matrix de ned by

0

1

V =X X ;

and arises in the formula for

^

 ; the LSE of ; de ned by

0

^

= VX y;

2 2

 s is the residual estimate of the variance  ; given by

2

^

s = Q =n k 

where

n

X

2

^

^ Q =

i

i=1

is the residual sum of squares of the mo del t, or the residual deviance,

based on tted residuals

0

^

^ = y x

i i

i

2 2

for each i: As in the straight line mo del, s is the usual estimate of  ;

computed as a sample variance in which the sample values are the t-

ted residuals, and the denominator n k corrects the sample size n by

subtracting the numb er of parameters estimated in :

4.1 Univariate marginal p osteriors

One of the key prop erties of the joint p osterior distribution ab oveis that the

implied univariate marginal p osterior for any element of is a Student T

i

distribution. Sp eci cally, for i =1;::: ;k;

2 2

^

  jY   T  ;s v

i nk i

i

2

^ ^

where is the corresp onding estimate from and v is the corresp onding

i

i

2

diagonal elementofV; or v =V :

ii

i

Inference on any one parameter individually involves summarising this T

distribution, in the usual ways. 12

4.2 Other marginal p osteriors

A second key prop erty of the joint p osterior distribution ab ove is that the im-

plied multivariate marginal p osterior for any subset of r elements is also a mul-

tivariate T distribution. This is of most interest in connection with developing

measures of imp ortance of subsets of predictor variables in explaining variation

observed in the resp onse variable. We return to this b elow in discussing subset

F tests of such questions.

4.3 Honest prediction

A further key implication of the mo del and its analysis is that predictive distri-

butions for new/future resp onse variables are also T distributions. Sp eci cally,

consider a future predictor vector x : The p osterior predictive distribution

n+1

for the asso ciated future resp onse outcome

0

y = x + 

n+1 n+1

n+1

is given by

2 2

y jY   T ^y ;s v 

n+1 nk n+1

y

where

 the p oint prediction is simply

0

^ ^ ^ ^

= + x + ::: + x ; y^ = x

0 1 n+1;1 p n+1;p n+1

n+1

or just the value of the tted regression function at the new design p oint;

2 2 2 0

 v =1+w where w = x Vx :

n+1

y n+1

The term sv is called the predictive standard error. Note that w , and hence

y

v ; dep end on x though the notation do es not make this explicit, and this

y n+1

dep endence is imp ortant in re ecting di ering degrees of uncertainty ab out y

n+1

at di erent design p oints. In particular, as in the simple straight line regression

mo del, predictions at new p oints x that are far from the region of previous

n+1

exp erience will have higher standard errors since w; and hence v ; will b e larger.

y

Further, note that the term w app ears to re ect the estimation uncertainty{

the uncertainty due to lack of precise knowledge of : With large sample sizes,

^

the p osterior for will b e very concentrated ab out its mean of ; V will have

small elements, and hence w will b e small. In such cases, two e ects arise: rst,

2

s  1 so that the predictive standard error will b e approximately s ; secondly,

y

the T distribution, with a very large degrees of freedom, will b e approximately

normal. In such cases, therefore, wemay use the simpler approximate predictive

distribution

2

y jY   N ^y ;s :

n+1 n+1 13

5 Assessing subsets of predictors

5.1 Posterior assessment of parameters

Standard subset F tests provide numerical measures of the contributions of

subsets of predictor variables to the overall t of the mo del. Consider a sp eci c

subset of r

numb ered j ;::: ;j ; so that the corresp onding regression parameter are

1 r

1 0

j

1

C B

.

.

: =

A @

.

j

r

Under the reference p osterior distribution, the r vector parameter has a

marginal multivariate T distribution. One way to assess how imp ortant the

nk

r predictors in question are is to assess howmuch supp ort this p osterior gives

to values near = 0: The reasoning is parallel to that used in univariate cases:

if the p oint = 0 is unsupp orted by the data, and so well out \in the tails" of

the p osterior, then the corresp onding predictors play a meaningful role in the

mo del t. If, on the other hand, the p oint = 0 is close to the mean of the

p osterior and so is a likely value, the predictors have less relevance. Notice that

this is explored with no reference to the other predictors { it is therefore to b e

understo o d that this assessmentof is conditional up on the other covariates

b eing in the mo del.

To formally assess the supp ort for = 0 we

 identify all values of that havelower p osterior density than =0; and

then

 nd the p osterior probability of those values to deliver a pvalue to assess

just how extreme = 0 is.

The formal interpretation of suchapvalue is that it is the p osterior prob-

abilityon values that lie outside the highest p osterior density region de ned

by = 0; the common-sense terminology is that the pvalue is the probability

on values of less well-supp orted by the data than = 0: A small pvalue

indicates that is most probably di erent to 0 and so indicates that the r

predictors in question contribute signi cantly to the mo del t.

Standard theory tells us that the resulting pvalue may b e easily computed

as a tail-area under a univariate F distribution. The details and derivation are

not given here, but the formula of the threshold for the F distribution is b oth

easy to rememb er and imp ortanttointerpret. The result is that

p value = PrF >f ;

obs

where the random variable F has an F distribution on r and n k degrees of

freedom, or F  F ; and where f is the threshold value f computed

r;nk obs obs 14

from the p osterior. The formula for f is discussed in the following subsection.

obs

Note that the pvalue is trivially computed from software packages with cumu-

lative distribution functions of F distributions { it is just 1 P f  where P

obs

here stands for the cdf of F :

r;nk

5.2 F test statistics

Call the mo del ab ove Mo del A. In Mo del A, wehave p predictor variables and

are interested in whether or not a sp eci c subset of r of them are really relevant

in terms of statistical measures of t. Consider a di erent linear mo del, Mo del

B, that is just Mo del A with the r predictor variables in question removed. We

can trivially t Mo del B as well, compute the usual p osterior distributions and

numerical summaries, and use these to make comparisons with Mo del A. Mo del

A has more predictor variables than Mo del B: Mo del A has p predictors, but

Mo del B has only p r: In addition, we say that Mo del B is nested within

Mo del A, since it arises as a sp ecial case of Mo del A when = 0: If the r

predictors in question are really not relevant in describing observed variations

in the resp onse data, then the two mo dels will pro duce similar ts. If, on the

other hand, these sp eci c predictors are really related to the resp onse, Mo del

A will pro duce a di erent and b etter t, b etter in the sense of explaining more

observed variation. To this end, we compare the residual sums of squares, or

deviances, from the two mo dels.

Write Q and Q for the residual deviances in Mo dels A and B, resp ectively.

A B

We can interpret Q Q as the decrease in deviance in moving from the simpler

B A

Mo del B to the more elab orate Mo del A. A large di erence is indicative of a

signi cantly improved t, suggesting the r predictors are relevant. To measure

how large the decrease in deviance is, we rst note that the reduced deviance

under Mo del A is achieved at the cost of r extra parameters, so that we might

b etter consider the change in deviance p er parameter, or Q Q =r: Second,

B A

deviances are measured on a scale that is the square of the resp onse, or that

2

of  and its estimates; to standardise to a dimensional measure we divide by

2 2

the b est estimate of  available, namely s ; the residual estimate of variance

A

under the larger Mo del A. This leads us to the standardised, per parameter

2

\reduction in deviance" measure Q Q =r s : as a natural summary of just

B A

A

howmuch the r predictors in question improve mo del t in the context of the

other predictors already in Mo del B.

The F distribution theory mentioned in the previous section identi es just

this standardised deviance measure as the critical threshold value, i.e.,

2

f =Q Q =r s

obs B A

A

or

dif f er ence in dev iance=dif f er ence in number of par ameter s

f =

obs

r esidual estimate of v ar iance

where the \di erences" in deviances and numb er of parameters compare elab-

orating the data description from Mo del B to Mo del A, \residual estimate of 15

variance" is that in the more elab orate Mo del A. The resulting pvalue simply

converts this common-sense measure of \di erence" b etween the two mo dels to

a probability scale.

Finally, note the sp ecial case when we consider r = p and assess all pre-

dictor variables. In this case, the F test is a test of overall regression t, with

the simpler \baseline" Mo del B b eing the random sample mo del with no pre-

dictors. Many computer packages routinely generate the f threshold, and its

obs

corresp onding pvalue under the F distribution, as a summary of overall

p;np1

mo del t.

5.3 Cautionary note

As usual, it is imp ortanttobeaware that pvalues will tend to b e smaller for

larger sample sizes, the F test b eing more \p owerful" in identifying increasingly

small deviations of from 0 when n is very large. Hence it is imp ortant to

be aware that, if dealing with very large samples, most subsets of p otential

predictors may b e found to b e statistically signi canteven though their inclu-

sion in a mo del may lead to only minor changes in inferences and predictions.

Generally, and again as in all statistical mo dels, choice of candidate predictor

variables should b e based as much as p ossible on underlying scienti c relevance

and validity, and on empirical p erformance in out-of-sample predictions. 16

App endix: T distributions

A real-valued random quantity t has a standard Student T distribution with

k>0 degrees of freedom if the density function is

2 k +1=2

pt /fk + t g

p

k=2

with normalising constant a = k k +1=2=f  k=2g: We write t 

T 0; 1: The density is symmetric ab out zero with a shap e similar to the normal

k

density curve. For high values of k; say exceeding 20, the densityisvery close to

normal. For lower values of k; the density is heavier-tailed than normal, giving

higher probability to more extreme values. If k > 1 the mean exists and is

E t=0: If k>2 the variance exists and is V t=k=k 2; for large k; the

variance approaches 1 as the density approaches the standard normal.

For any constants m and v; with v > 0; de ne the random quantity x =

m + vt: Then x has a T distribution with lo cation m the mo de of pt and

2

the mean if k > 1; and scale v: We write x  T m; v : The variance is

k

2 2

V x=v k=k 2; so that v is the variance for large k: The density is, by

transformation,

2 2 k +1=2

px /fk +x m =v g :

2

For large k; the distribution of x is approximately N m; v : Otherwise, we

2

can view the distribution informally as \N m; v  with a little bit of additional

uncertainty."

Quantiles of the Student T distributions are tabulated and available in com-

puter software for anyvalue of k: Supp ose that t is the 100p quantile of the

p

distribution, i.e., such that Prt  t = p: Then the corresp onding quantile of

p

x is simply m + vt :

p

App endix: T test in straight line regression

2 2

^

 the values of having Under the reference p osterior  jY   T  ; s v

n2

lower density than = 0 simply satisfy

p jY 

2 2 2 n1=2

^

where p jY =cfn 2+  =s v g is the p osterior density. The

value at zero is just

2 2 2 n1=2 2 n1=2

^

p0jY =cfn 2+0  =s v g = cfn 2+T g

^

where T = =sv is the standardised T statistic. Hence p jY 

only if

2 2 2 2

^

fn 2+  =s v g > fn 2+T g

which reduces to

2 2

t >T or jtj > jT j 17

^

where t = =sv is a random quantity with the standard T 0; 1 dis-

n2

tribution.

As a result, the p osterior probabilityof values having lower density than

= 0 is just Prjtj > jT j= 2Prt>jT j since the T standard distribution is

symmetric ab out zero. This is the standard pvalue for the test of =0:

6 Exercises

^

1. Verify the formul for the LSEs  ; ^  in the straight line regression mo del.

2. In the straight line regression mo del, consider the the residual sum of

P

n

2

^ ^

squares Q^ ; = y ^ x  : By substituting the identity ^ =

i i

i=1

2 2 2 2

^ ^ ^ ^

y x show that Q^ ; =S =v : Deduce that =v is the tted

yy

sum of squares, or deviance explained, by the regression mo del.

3.

4.

5. 18