Computational Statistics Linear Regression

Computational Statistics Linear Regression

COMPUTATIONAL STATISTICS LINEAR REGRESSION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor, H2bis [email protected] Trieste, Winter Semester 2015/2016 LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 33 / 41 MARGINAL LIKELIHOOD The marginal likelihood p(t ↵, β), appearing at the | denominator in Bayes theorem, can be used to identify good ↵ and β, known as hyperparameters. Intuitively, we can place a prior distribution over ↵ and β, compute their posterior, and use this in a fully Bayesian treatment of the regression: p(↵, β t) p(t ↵, β)p(↵, β) | / | If we assume the posterior is peaked around the mode, then we can take the MAP as an approximation of the full posterior for ↵ and β. If the flat is prior, this will boil down to the ML solution. LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 34 / 41 MARGINAL LIKELIHOOD Hence we need to optimise the marginal likelihood, which can be computed as: M N 1 1 N log p(t ↵, β)= log ↵+ log β E(m ) log S − log 2⇡ | 2 2 − N −2 | N |− 2 with β ↵ E(m )= t Φm 2 + m T m N 2|| − N|| 2 N N This optimisation problem can be solved with any optimisation routine, or with specialised methods, see Bishop. LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 35 / 41 BAYESIAN MODEL COMPARISON Consider and two different models, which one is M1 M2 the best to explain the data ? D In a Bayesian setting, we may place a prior p( ) on the Mj models, and compute the posterior p( )p( ) D|Mj Mj p( j )= p( )p( ) . M |D j D|Mj Mj As we typicallyP have additional parameters w, the term p( ) is the model evidence/ marginal likelihood. D|Mj The ratio p( )/p( ) is known as Bayes Factor. D|M1 D|M2 LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 36 / 41 BAYESIAN MODEL COMPARISON In Bayesian model comparison, we can take two approaches. We can compute the predictive distribution for each model and average it by the posterior model probability p(t )= p(t , )p( ) |D |Mj D Mj |D Xj Alternatively, we can choose the model with larger Bayes Factor. This will pick the correct model on average. In fact, the average log Bayes factor (assuming is the true M1 model) is p( ) p( ) log D|M1 > 0 D|M1 p( ) Z D|M2 OUTLINE 1 LINEAR REGRESSION MODELS 2 BAYESIAN LINEAR REGRESSION 3 DUAL REPRESENTATION AND KERNELS LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 38 / 41 DUAL REPRESENTATION Consider a regression problem with data (xi, yi ), and a linear model wT φ(x). We can restrict the choice of w to the linear subspace spanned by φ(x1),...,φ(xN), as any w othogonal to this T ? subspace will give a contribution w φ(xi)=0 on input ? points: M 1 − w = aj φ(xj) Xj=0 a are known as the dual variables T By defining the kernel k(xi, xj):=φ(xi) φ(xj), we can write T T i w φ(xi)=a K Where Ki is the ith column of the Gram matrix K, Kij = k(xi, xj). LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 39 / 41 DUAL REGRESSION PROBLEM In the dual variables, we have to optimise the following regression equation N E (a)+λE (a)= (t aT Ki)2 + λaT Ka d W i − Xi=1 By deriving w.r.t a and setting the gradient to zero, we obtain the solution 1 aˆ =(K + λI)− t At a new input x⇤, the prediction will then be T 1 y(x⇤)=k (K + λI)− t ⇤ T with k =(k(x⇤, x1),...,k(x⇤, xN)) ⇤ The computational cost to solve the primal problem is O(M3), while the dual costs O(N3). They can be both prohibitive is N and M are large. In this case, one can optimise the log likelihood directly, using gradient based methods. LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 40 / 41 THE KERNEL TRICK The dual objective function depends only on the scalar product of input vectors We can replace the Euclidean scalar product with any (non-linear) scalar product This is usually obtained by giving directly a non-linear kernel function k(xi , xj ) (kernel trick) This enables us to work with more general set of basis functions, even countable. See Gaussian processes. The same dual procedure applies to other algorithms, notably linear classification and SVMs LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 40 / 41 THE KERNEL TRICK The dual objective function depends only on the scalar product of input vectors We can replace the Euclidean scalar product with any (non-linear) scalar product This is usually obtained by giving directly a non-linear kernel function k(xi , xj ) (kernel trick) This enables us to work with more general set of basis functions, even countable. See Gaussian processes. The same dual procedure applies to other algorithms, notably linear classification and SVMs The computational cost to solve the primal problem is O(M3), while the dual costs O(N3). They can be both prohibitive is N and M are large. In this case, one can optimise the log likelihood directly, using gradient based methods. COMPUTATIONAL STATISTICS LINEAR CLASSIFICATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor, H2bis [email protected] Trieste, Winter Semester 2015/2016 OUTLINE 1 LINEAR CLASSIFIERS 2 LOGISTIC REGRESSION 3 LAPLACE APPROXIMATION 4 BAYESIAN LOGISTIC REGRESSION 5 CONSTRAINED OPTIMISATION 6 SUPPORT VECTOR MACHINES LINEAR CLASSIFIERS LOGISTIC REGRESSION LAPLACE BAYESIAN LOGISTIC REGRESSION OPTIMISATION SVM 3/52 INTRODUCTION Data: xi, ti . Output are discrete, either binary or multiclass (K classes), and are also denoted by yi . Classes are denoted by ,..., . C1 CK Discriminant function: we construct a function f (x) 1,...,K associating with each input a class. 2 { } Generative approach: We consider a prior over classes, p( ), and the class-conditional densities p(x ), from a Ck |Ck parametric family. We learn class-conditional densities from data, and then compute the class posterior. p(x k )p( k ) p( x)= |C C Ck | p(x) Discriminative approach: we learn directly a model for the class posteriori p( k x), typically as p( k x)=f (wφ(x)). C | C1 | f is called an activation function (and f − a link function). LINEAR CLASSIFIERS LOGISTIC REGRESSION LAPLACE BAYESIAN LOGISTIC REGRESSION OPTIMISATION SVM 4/52 ENCODING OF THE OUTPUT For a binary classification problem, usually we choose t 0, 1 . The interpretation is that of a “probability” to n 2 { } belong to class . C1 In some circumstances (perceptron, SVM), we will prefer the encoding t 1, 1 . n 2 {− } For a multiclass problem, we usually stick to a boolean encoding: t =(t ,...,t ), with t 0, 1 , and t is in n n,1 n,K n,j 2 { } n class k if and only if tn,k = 1 and tn,j = 0, for j , k..

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    15 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us