LINEAR REGRESSION MODELS Likelihood and Reference Bayesian

{ LINEAR REGRESSION MODELS { Likeliho o d and Reference Bayesian Analysis Mike West May 3, 1999 1 Straight Line Regression Mo dels We b egin with the simple straight line regression mo del y = + x + i i i where the design p oints x are xed in advance, and the measurement/sampling i 2 errors are indep endent and normally distributed, N 0; for each i = i i 1;::: ;n: In this context, wehavelooked at general mo delling questions, data and the tting of least squares estimates of and : Now we turn to more formal likeliho o d and Bayesian inference. 1.1 Likeliho o d and MLEs The formal parametric inference problem is a multi-parameter problem: we 2 require inferences on the three parameters ; ; : The likeliho o d function has a simple enough form, as wenow show. Throughout, we do not indicate the design p oints in conditioning statements, though they are implicitly conditioned up on. Write Y = fy ;::: ;y g and X = fx ;::: ;x g: Given X and the mo del 1 n 1 n parameters, each y is the corresp onding zero-mean normal random quantity i plus the term + x ; so that y is normal with this term as its mean and i i i 2 variance : Also, since the are indep endent, so are the y : Thus i i 2 2 y j ; ; N + x ; i i with conditional density function 2 2 2 2 1=2 py j ; ; = expfy x =2 g=2 i i i for each i: Also, by indep endence, the joint density function is n Y 2 2 : py j ; ; = pY j ; ; i i=1 1 Given the observed resp onse values Y this provides the likeliho o d function for the three parameters: the joint density ab ove evaluated at the observed and 2 hence now xed values of Y; now viewed as a function of ; ; as they vary across p ossible parameter values. It is clear that this likeliho o d function is given by 2 2 n pY j ; ; / expQ ; =2 = where n X 2 Q ; = y x i i i=1 n=2 and where a constant term 1=2 has b een dropp ed. Let us lo ok at computing the joint maximum likeliho o d estimates MLEs 2 ^ of the three parameters. This involves nding the values ^ ; ; ^ such that 2 2 2 ^ pY j ;^ ; ^ >pY j ; ; for any other parameter values ; ; : Wedo this as follows. 2 For any xed value of ; the likeliho o d function is a strictly increasing function of Q ; : Hence, changing ; to decrease the value of Q implies that the value of the likeliho o d function increases. Clearly,cho osing ; to minimise Q implies that we maximise the likeliho o d function. As 2 a result, the MLEs of ; for any sp eci c value of are simply the LSEs. From the earlier discussion of least square estimation, we know that the 2 ^ LSEs ; ^ do not, in fact, dep end on the value of the variance : As a result, the full three-dimensional maximisation is solved at the LSE values 2 2 ^ ^ ^ ; and bycho osing ^ to maximise pY j ;^ ; as a function of just 2 : This trivially leads to n X 2 2 ^ ^ = =n i i=1 2 ^ where ^ = y ^ x for each i; this MLE of is the usual residual i i i sum of squares with divisor n: 1.2 Reference Bayesian Analyses 1.2.1 Parametrisation and Reference Prior 2 To develop the reference Bayesian p osterior distribution for ; ; it is tra- 2 2 ditional to reparameterise from to the precision parameter =1= : This is done simply for clarity of exp osition and ease of development. Plugging 2 =1= in the likeliho o d function leads simply to n=2 pY j ; ; / expQ ; =2 { de ning the joint likeliho o d function for the three parameters ; ; : 2 Bayesian inference requires a prior p ; ; { a trivariate prior density de ned on the mo del parameter space. Here we use the traditional reference prior in which: The three parameters are indep endent, so p ; ; =p p p: As a function of ; the log-likeliho o d function is quadratic, indicating that the likeliho o d function will contribute a normal form in : Hence a normal prior for would be conjugate, leading to a normal p osterior. On this basis, a normal prior with an extremely large variance as in reference analysis of normal mo dels will representa vague or uninformative prior p osition. Taking the formal limit of a normal prior with a variance tending to in nity provides the traditional reference prior p / constant: The same reasoning applies to ; leading to the traditional reference prior p / constant: As a function of alone, the likeliho o d function has the same form as that of a gamma density function in : Hence a gamma prior for would b e conjugate, leading to a gamma p osterior. On this basis, a gamma prior with very small de ning parameters as in reference analysis of Poisson or exp onential mo dels will representa vague or uninformative prior p osition. Taking the formal limit of the Gammaa; b prior at a = b = 0 1 provides the traditional reference prior p / : Combing these comp onents pro duces the standard non-informative/reference 1 prior p ; ; / : Then Bayes' theorem leads to the reference p osterior n=21 p ; ; jY / p ; ; pY j ; ; / expQ ; =2; over real-valued and and >0: This is a joint density for the three quanti- ties, and reference inference follows by exploring and summarising its prop erties. Notice that the p osterior is almost the normalised likeliho o d function { the only 1 di erence is in the prior term : We quote key features of this reference p osterior in the following sections. 1.2.2 Marginal Reference Posterior for The marginal p osterior for is available byintegrating ; ; i.e., Z pjY = p ; ; jY d d where the range of integration is 1 < ; < 1: It can b e shown that this yields the simple form a1 pjY / expfbg P n 2 where a =n 2=2 and b = ^ =2: As a result, the p osterior for is simply i i=1 Gammaa; b with these values of a; b: In particular, the p osterior mean is 2 E jY =1=s 3 where n X 2 2 s = ^ =n 2: i i=1 2 2 Since E jY isapoint estimate of =1= ; then s is a corresp onding p oint 2 estimate of : It is referred to as the residual variance estimate, as it is a sample 2 variance computed from the tted residuals ^ : Note that, unlike the MLE ^ ; i 2 the estimate s has a divisor n 2: The common-sense interpretation of this is that the e ective number of observations is the actual total n reduced by the number of tted parameters, here just two. The term n 2 is called the residual degrees of freedom, re ecting this adjustment from n; the initial degrees 2 2 of freedom. One implication is that s > ^ ; re ecting a more conservative estimate of variance after accounting for the estimation of the two parameters. 1.2.3 Marginal Reference Posterior for ; The marginal p osterior for ; is obtained as Z p ; jY = p ; ; jY d and turns out to b e a bivariate T distribution. Of key practical relevance are the implied univariate margins for p osterior for and linear functions of ; alone, and these are all univariate T distributions see App endix for details of T distributions. Sp eci c univariate margins are as follows. 2 De ne v =1=S : The univariate p osterior for has a density function xx n1=2 2 2 2 ^ g ; p jY /fn 2+ =s v and this is the density of a T distribution with n 2 degrees of freedom, ^ mo de and scale sv : By way of notation wehave 2 2 ^ jY T ; s v : n2 2 2 ^ As long as n>4; it is also true that E jY = and V jY =cs v with c =n 2=n 4: The p osterior is symmetric and normal shap ed ab out the mo de, though has heavier tails than a normal p osterior. We can write ^ ^ = +sv t and t = =sv where the random quantity t T 0; 1: Posterior probabilities and n2 intervals for follow from those of the Student T distribution: if t is the p ^ 100p quantile of t; then that for is simply +sv t : The term sv p is called the p osterior standard error of the co ecient. For large degrees of freedom, the Student T distribution approaches the standard normal, in which case we have the approximation t N 0; 1 and so 2 2 ^ : jY N ; s v 4 Otherwise, we can view the distribution informally as \like the normal but with a little bit of additional uncertainty." The univariate margin for is similarly a T distribution with n 2 degrees 2 1 2 of freedom, mo de ^ and scale sv where v = n +x =S ; i.e., xx 2 2 jY T ^ ; s v : n2 Under p ; jY the two parameters are generally correlated. Assuming n>4 so that second moments of the T distribution exist, the p osterior 2 covariance is s c n 2=n 4 where c = x=S : Note that n ; ; xx 2=n 4 1 when n is large, when the p osterior is approximately 2 normal; in that case, the p osterior covariance is just the term s c ab ove.

LINEAR REGRESSION MODELS Likelihood and Reference Bayesian

Naïve Bayes Classifier

The Bayesian Approach to Statistics

Random Forest Classifiers

Bayesian Inference

Naïve Bayes Classifier

9 Introduction to Hierarchical Models

Bayesian Model Selection

Arxiv:2002.00269V2 [Cs.LG] 8 Mar 2021

Bayesian Networks

Bayesian Statistics

Hands-On Bayesian Neural Networks - a Tutorial for Deep Learning Users

Lecture 17 – Part 1 Bayesian Econometrics